Curieux.JY
  • Post
  • Note
  • Jung Yeon Lee

On this page

  • 1 Introduction
    • 1.1 Constrained Markov Decision Process(CMDP)
    • 1.2 Policy Gradient Methods
  • 2 Method
    • 2.1 Interior-point Policy Optimization
    • 2.2 Logarithmic Barrier Function
    • 2.3 Performance Guarantee Bound
  • 3 Experiment
    • 3.1 Discounted Cumulative Constraints
    • 3.2 Mean Valued Constraints
    • 3.3 Constraint Effects
    • 3.4 Hyperparameter Tuning
    • 3.5 Multiple Constraints
    • 3.6 Stochastic Environment Effects
  • 4 Conclusion
  • 5 Reference

๐Ÿ“ƒIPO ๋ฆฌ๋ทฐ

paper
rl
cmdp
Interior-point Policy Optimization under Constraints
Published

November 10, 2024

1 Introduction

์˜ค๋Š˜์€ โ€œIPO: Interior-point Policy Optimization under Constraintsโ€๋ผ๋Š” ๋…ผ๋ฌธ์— ๋Œ€ํ•ด์„œ ๋ฆฌ๋ทฐํ•ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ํ”ํžˆ ๊ฐ•ํ™”ํ•™์Šต(Reinforcement Learning)์„ ์ฒ˜์Œ ๊ฐœ๋…์„ ๊ณต๋ถ€ํ•˜๊ณ  ๋‚˜๋ฉด, ๊ฐ•ํ™”ํ•™์Šต์˜ ๋ฌธ์ œ๋ฅผ MDP(Markov Decision Process)๋กœ ์ •์˜ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋– ์˜ฌ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ ๊ฐ•ํ™”ํ•™์Šต์˜ ํ•ต์‹ฌ์ธ Reward, ์ฆ‰ ๋ณด์ƒ์„ ์ž˜ ์„ค์ •ํ•ด์ฃผ์–ด์•ผ Agent๊ฐ€ ์›ํ•˜๋Š” ๋ฐฉํ–ฅ๋Œ€๋กœ ํ•™์Šต์„ ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๋ณด์ƒ์€ Agent๊ฐ€ ํ•ด์•ผํ•˜๋Š” ํ–‰๋™ ์–‘์‹์˜ (+)๊ฐ€ ๋˜๋Š” ๋ฐฉํ–ฅ์„ ๋‚˜ํƒ€๋‚ด๋Š” ์ง€ ํ‘œ์ด๋ฉฐ ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ํ–‰๋™์„ Encourage(์žฅ๋ ค)ํ•˜๋Š” ์—ญํ• ์„ ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์ด๋ฒˆ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ธฐ๋ณธ์ ์ธ ๊ฐ•ํ™”ํ•™์Šต์˜ MDP๊ฐ€ ์•„๋‹Œ Constraint๋ผ๋Š” ๊ฐœ๋…์„ ๋„ฃ์–ด์„œ ์ƒ๊ฐ์„ ํ•ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. Constraint(์ œ์•ฝ)์€ ๊ฐ€์žฅ ๋‹จ์ˆœํ•˜๊ฒŒ๋Š” -Reward ๋ผ๊ณ  ์ƒ๊ฐํ•ด๋ณผ ์ˆ˜ ๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ Agent๊ฐ€ ํ•˜์ง€ ์•Š์•˜์œผ๋ฉด ํ•˜๋Š” ํ–‰๋™์„ ์ •์˜ํ•จ์œผ๋กœ์จ negative reward๋ฅผ ์ค€๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด์ฃ . (๋งˆ์น˜ Gradient Ascent๊ฐ€ Gradient Discent์˜ ๋ฐ˜๋Œ€๋กœ ์ƒ๊ฐํ•ด๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด์š”.) ๋”ฐ๋ผ์„œ Reward์™€ Constraint๋Š” ์„œ๋กœ (+)/(-) ๋ถ€ํ˜ธ์ ์ธ ์„ฑ๊ฒฉ์ด ๋‹ค๋ฅด์ง€๋งŒ Agent์—๊ฒŒ ํ•™์Šต์˜ ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•˜๋Š” ์‹ ํ˜ธ๋ผ๋Š” ์ธก๋ฉด์—์„œ๋Š” ๊ณตํ†ต์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์กฐ๊ธˆ ๋” Constraint์— ๋Œ€ํ•ด์„œ ์ž์„ธํžˆ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. Constraint๋Š” ์ œ์•ฝ์ด ๋ฐœ์ƒ๋˜๋Š” ์‹œ์ ์— ๋”ฐ๋ผ 2๊ฐ€์ง€๋กœ ๋‚˜๋ˆ„์–ด์„œ ์ƒ๊ฐํ•ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

22
Constraints

์šฐ์„ , instantaneous constraint๋Š” ๋œป์—์„œ๋„ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด ์ผ์‹œ์ ์œผ๋กœ constraint๋ฅผ ์ฃผ๋Š” ๊ฒƒ์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ๊ฐ•ํ™”ํ•™์Šต์—์„œ Agent๊ฐ€ action์„ ํ•˜๊ฒŒ ๋˜๋Š” timestep ๋งˆ๋‹ค ์ œ์•ฝ ์ƒํ™ฉ์ธ์ง€๋ฅผ ํŒ๋‹จํ•˜์—ฌ constraint๋ฅผ ์ฃผ๋Š” ๊ฒƒ์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๊ธฐ๋ณธ์ ์ธ ๊ฐ•ํ™”ํ•™์Šต ๊ฐœ๋…์—์„œ ๋งค timestep๋งˆ๋‹ค reward๋ฅผ ์ฃผ๋Š” ์ƒํ™ฉ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ๋กœ๋ด‡ํŒ”(Manipulator)์„ ์ œ์–ดํ•˜๋Š” ์ƒํ™ฉ์„ ์ƒ๊ฐํ•ด๋ณด๋ฉด, Agent๋Š” ์ ์ ˆํ•œ ์›€์ง์ž„์„ ์œ„ํ•ด ๋กœ๋ด‡ํŒ”์„ ๊ตฌ์„ฑํ•˜๋Š” ๋ชจํ„ฐ๋“ค์„ ์ž˜ ๊ตฌ๋™ํ•˜์—ฌ ์›ํ•˜๋Š” ๋ชจ์…˜์„ ๋งŒ๋“ค์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ ๋กœ๋ด‡์ด ์›€์ง์ด๋Š” ๋ชจ๋“  ๋งค ์ˆœ๊ฐ„๋งˆ๋‹ค ๊ฐ ๋ชจํ„ฐ๋“ค(joint)์ด ๊ฐ€๋™๋ฒ”์œ„์— ์žˆ์–ด์•ผ ํ•˜๊ณ  ๊ณผํ•œ ํ† ํฌ๊ฐ€ ๊ฐ€ํ•ด์ง€์ง€ ์•Š๋„๋ก ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ œ์•ฝ ์ƒํ™ฉ๋“ค์€ ๋งค ์ˆœ๊ฐ„ ํŒ๋‹จํ•ด์„œ ํ•ด๋‹น ๋ฒ”์œ„๋“ค์„ ๋„˜์ง€ ์•Š๋Š” action์„ ์„ ํƒํ•˜๋„๋ก ํ•™์Šตํ•ด์•ผ ํ•˜๋ฏ€๋กœ instantaneous constraint์˜ ์˜ˆ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ์œผ๋กœ cumulative constraint๋Š” Agent๊ฐ€ ํ•™์Šตํ•˜๋Š” ํ•˜๋‚˜์˜ Episode ๋‚ด์—์„œ ๋ˆ„์ ํ•ด์„œ ๋‚˜์˜จ ๊ฐ’์œผ๋กœ ํŒ๋‹จํ•˜์—ฌ ์ œ์•ฝ์ƒํ™ฉ์„ ํŒ๋‹จํ•˜๋Š” ๊ฒƒ์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ ๋ˆ„์ ๋˜๋Š” ์‹œ๊ฐ„์€ ํ•˜๋‚˜์˜ Episode๊ฐ€ ์‹œ์ž‘ํ•ด์„œ ๋๋‚  ๋•Œ๊นŒ์ง€์ผ ์ˆ˜๋„ ์žˆ๊ณ  ์•„๋‹ˆ๋ฉด 5 timesteps ๋™์•ˆ์ด๋ผ๋Š” ํŠน์ • timestep ์ˆ˜๋ฅผ ์ง€์ •ํ•˜์—ฌ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋กœ๋ด‡ํŒ”์˜ ์˜ˆ์‹œ๋กœ ์‚ดํŽด๋ณด์ž๋ฉด, ๋กœ๋ด‡์ด ํŽœ์„ ์žก๋Š” ๋ชจ์…˜์„ ํ•  ๋•Œ๊นŒ์ง€ 100 timestep์ด ๊ฑธ๋ ธ๋Š”๋ฐ ๋งค timestep ๋งˆ๋‹ค ์ง€์—ฐ(latency)๊ฐ€ ๋ฐœ์ƒํ•˜์—ฌ ์ด๋ฅผ ์ œ์•ฝํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ƒํ™ฉ์—์„œ 100 timestep๋™์•ˆ์˜ average latency๋ฅผ ๊ตฌํ•ด์„œ ํŠน์ • latency๋ฅผ ๋„˜์ง€ ๋ชปํ•˜๋„๋ก constraint๋ฅผ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์˜ˆ์‹œ์ฒ˜๋Ÿผ ํŠน์ • ๊ตฌ๊ฐ„ ๋™์•ˆ์˜ ๊ฐ’์„ ํ†ตํ•ด์„œ constraint๋ฅผ ์ฃผ๋Š” ๊ฒƒ์„ cumulative constraint๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฒˆ IPO ๋…ผ๋ฌธ์—์„œ๋Š” ๋‘๋ฒˆ์งธ๋กœ ์†Œ๊ฐœ๋“œ๋ฆฐ cumulative constraint์— ์ดˆ์ ์„ ๋งž์ถฐ ๊ฐœ๋ฐœ๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์†Œ๊ฐœํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

1.1 Constrained Markov Decision Process(CMDP)

์•ž์„œ ์„ค๋ช…๋“œ๋ฆฐ Constraint๊ฐ€ MDP์— ์ถ”๊ฐ€๋œ ๊ฒƒ์„ Constrained Markov Decision Process(CMDP)๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. CMDP์—์„œ๋Š” Reward์™€ ๊ฐ™์ด ํ˜„์žฌ State์—์„œ Action์„ ์ทจํ•˜๊ณ  ๋‹ค์Œ State์— ๋„๋‹ฌํ–ˆ์„ ๋•Œ ์–ป๊ฒŒ ๋˜๋ฏ€๋กœ ์•„๋ž˜ ์‚ฌ์ง„์—์„œ์™€ ๊ฐ™์ด Space๊ฐ€ ์ •์˜๋˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Constraint๋Š” (s_n, a_n, s_{n+1})๊ณผ ๊ฐ™์€ transition tuple๋กœ ๊ณ„์‚ฐ๋˜๊ฒŒ ๋˜๋ฉฐ, cumulative constraint๋Š” ์ผ์ • timestep, ์ฆ‰ transition์ด n(์„œ์ˆ˜:t)๊ฐœ ๋ชจ์—ฌ์„œ ๊ณ„์‚ฐ๋˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋•Œ Constraint๋„ ์—ฌ๋Ÿฌ ์ข…๋ฅ˜๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ constraint์˜ ๊ฐ€์ง“ ์ˆ˜๋Š” m(์„œ์ˆ˜:i)์œผ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Constraint๋Š” more than better์ธ reward์™€ ๋‹ค๋ฅด๊ฒŒ ์ œ์•ฝ๋˜๋Š” ์ƒํ™ฉ์„ ์ •์˜ํ•˜๊ฒŒ ๋˜๋Š” constraint limit์ด ์žˆ๊ฒŒ ๋˜๊ณ  ์ด๋ฅผ \epsilon_i๋กœ ๋‚˜ํƒ€๋‚ด๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

22
Constraint Space and Constraint Limit

Constraint์˜ Expectation์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๊ฐ€ ๋˜๋ฉฐ 2๊ฐ€์ง€์˜ constraint ๊ณ„์‚ฐ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒซ๋ฒˆ์งธ๋กœ๋Š” discounted cumulative constraint๋กœ ํ• ์ธ์œจ \gamma๋ฅผ ๊ณ ๋ คํ•œ constraint๋“ค์„ ํ•˜๋‚˜์˜ policy๊ฐ€ ๋™์ž‘ํ•˜๋Š” ๋™์•ˆ ๋ˆ„์ ํ•ฉํ•œ ๊ฐ’์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ๋‘๋ฒˆ์งธ๋กœ๋Š” ์ผ์ • timestep T๋™์•ˆ ๊ณ„์‚ฐํ•œ constraint๋“ค์˜ ํ‰๊ท ์„ ๋งํ•˜๋Š” ๊ฒƒ์œผ๋กœ mean values constraint๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด 2๊ฐ€์ง€ ์ข…๋ฅ˜์˜ ์ง€ํ‘œ์— ๋Œ€ํ•ด์„œ ํ›„์— ์‹คํ—˜์—์„œ ๋‹ค๋ฃฐ ์˜ˆ์ •์ด๋ฉฐ CMDP์˜ ๋ชฉํ‘œ๋ฅผ ์ •๋ฆฌํ•ด๋ณด๋ฉด, ๊ธฐ์กด์— J_R๋งŒ์„ Maximizationํ–ˆ๋˜ ๊ฐ•ํ™”ํ•™์Šต ๋ฌธ์ œ๊ฐ€ J_{C_i}๋ฅผ ๊ณ ๋ คํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด ์ถ”๊ฐ€ ๋˜์—ˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

22
Constraint Expectation

๊ธฐ์กด์˜ Constraint๊ฐ€ ์žˆ๋Š” ์ตœ์ ํ™” ๋ฌธ์ œ๋Š” Lagrangian Relaxation Method๋ฅผ ํ†ตํ•ด์„œ ํ•ด๊ฒฐํ–ˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋ผ๊ทธ๋ž‘์ง€์•ˆ ์Šน์ˆ˜๋ฒ•์ด๋ผ๊ณ ๋„ ๋ถˆ๋ฆฌ๋Š” ํ•ด๋‹น ๋ฐฉ๋ฒ•์€ ๊ธฐ์กด์˜ ์ตœ์ ํ™” ์‹ f(x)์— constraint g_i(x)๊ฐ€ ์ถ”๊ฐ€๋œ ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ Lagrange Multipilers๋ฅผ ๊ณฑํ•˜์—ฌ ๊ธฐ์กด ์ตœ์ ํ™” ํ•จ์ˆ˜ ๋ชฉ์ ์‹์— ๋”ํ•˜์—ฌ์„œ ์ œ์•ฝ ์กฐ๊ฑด์„ ํ‘ธ๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

22
Lagrangian relaxation method
22

๋ผ๊ทธ๋ž‘์ง€์•ˆ ์Šน์ˆ˜๋ฒ•์€ ๊ฐ€์žฅ ์‹ฌํ”Œํ•˜๊ฒŒ ์ œ์•ฝ ์กฐ๊ฑด๋“ค์„ ๋ฉ”์ธ ์ตœ์ ํ™”์‹์— ๋…น์—ฌ๋‚ด์–ด ํ’€์–ด๋‚ด๋Š” ๋ฐฉ์‹์œผ๋กœ, CMDP ๋ฌธ์ œ๋“ค๋„ ํ•ด๋‹น ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ์ด ํ†ต์ƒ์ ์ธ ๋ฐฉ๋ฒ•์ด์—ˆ์ง€๋งŒ ๋ผ๊ทธ๋ž‘์ง€์•ˆ ์Šน์ˆ˜๋ฒ•์€ ์ •์ฑ…์ด ์ˆ˜๋ ดํ•  ๋•Œ ์ œ์•ฝ ์กฐ๊ฑด์ด ๋งŒ์กฑ๋˜์ง€๋งŒ, ์ด ์ ‘๊ทผ๋ฒ•์€ Lagrange multiplier์˜ ์ดˆ๊ธฐ๊ฐ’๊ณผ ํ•™์Šต๋ฅ ์— ๋ฏผ๊ฐํ•˜๊ณ  ํ•™์Šต ๊ณผ์ •์—์„œ ์–ป์€ ์ •์ฑ…์ด ํ•ญ์ƒ ์ œ์•ฝ ์กฐ๊ฑด์„ ์ผ๊ด€๋˜๊ฒŒ ๋งŒ์กฑ์‹œํ‚ค์ง€๋Š” ์•Š๋Š”๋‹ค๋Š” ํ•œ๊ณ„์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

1.2 Policy Gradient Methods

์•ž ๋ถ€๋ถ„์—์„œ ์‚ดํŽด๋ณธ ๊ฒƒ๊ณผ ๊ฐ™์ด CMDP Goal์€ Reward ๊ฐ’์„ ์ตœ๋Œ€ํ™”ํ•˜๋ฉด์„œ ์ œ์•ฝ์‹์„ ๋งŒ์กฑํ•˜๋Š” ์ตœ์ ์˜ policy๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

22
CMDP Goal

๋จผ์ € ์ œ์•ฝ์กฐ๊ฑด์„ ์ž ์‹œ ๋’ค๋กœ ๋‘๊ณ , ๋ณธ๋ž˜ ๊ธฐ๋ณธ์ ์ธ ๊ฐ•ํ™”ํ•™์Šต์˜ ๋ชฉ์ ์‹์ธ Reward Maximization์€ ์–ด๋–ป๊ฒŒ ํ• ๊นŒ์š”? Policy Gradient๋Š” ๊ฐ•ํ™”ํ•™์Šต์˜ ํ•œ ๊ณ„์—ด๋กœ ์ตœ์ ์˜ policy, ์ฆ‰ ๊ฐ€์žฅ Reward๋ฅผ ๋งŽ์ด ๋ฐ›์„ ์ˆ˜ ์žˆ๋Š” policy๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด ์•„๋ž˜์™€ ๊ฐ™์€ ๋ชฉ์ ์‹์˜ gradient๋ฅผ ๊ณ„์‚ฐํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋•Œ ์ตœ์ ์˜ policy๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด์„œ \theta๋Š” ์œ„์—์„œ ๊ตฌํ•œ gradient ๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์•„๋ž˜์™€ ๊ฐ™์ด ์—…๋ฐ์ดํŠธํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

22
Policy Gradient Methods

Trust Region Policy Optimization(TRPO)๋ผ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด PG๊ณ„์—ด์—์„œ ๋Œ€ํ‘œ์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ฉฐ, ์ตœ์ ์ด policy๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด surrogate function์„ ์ด์šฉํ•˜๊ณ  policy๊ฐ€ ์—…๋ฐ์ดํŠธ ๋˜๋Š” step size๋ฅผ ์ œํ•œํ•˜๊ธฐ ์œ„ํ•ด KL divergence๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. TRPO์˜ ์ตœ์ ํ™” ์‹์€ ์•„๋ž˜์™€ ๊ฐ™์ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

22
TRPO VS PPO

ํ•˜์ง€๋งŒ TRPO๋Š” conjugate gradient optimization์œผ๋กœ ํ’€๋ฆฌ๋Š” 2์ฐจ ๋ฏธ๋ถ„ ์ตœ์ ํ™”๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณ„์‚ฐ cost๊ฐ€ ํฝ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ TRPO๋ฅผ ์‹ค์šฉ์ ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒํ•œ Proximal Policy Optimization (PPO) ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. PPO์˜ ์ตœ์ ํ™” ์‹์€ TRPO์—์„œ ๋ฌธ์ œ์˜€๋˜ 2์ฐจ๋ฏธ๋ถ„์„ 1์ฐจ ๋ฏธ๋ถ„ surrogate function์œผ๋กœ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ์—ˆ์œผ๋ฉฐ ๊ณ„์‚ฐ๋ณต์žก์„ฑ์„ ์ค„์ผ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

IPO๋Š” ์ด๋Ÿฌํ•œ ํ๋ฆ„๋Œ€๋กœ ๋ฐœ์ „ํ•ด์˜จ PPO ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ตœ์ ํ™” ์‹์—์„œ ์ œ์•ฝ์‹์„ ์ถ”๊ฐ€ํ•˜๋ฉด์„œ ๋ฐœ์ „ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

2 Method

2.1 Interior-point Policy Optimization

IPO์ด์ „์— CPO(Constrained policy optimization)๋ผ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ œ์•ˆ๋˜์—ˆ์—ˆ์Šต๋‹ˆ๋‹ค. IPO๋Š” CPO์˜ ๋‹จ์ ์„ ๋ณด์™„ํ•˜์—ฌ ์ œ์•ˆ๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์œผ๋ฉฐ ์•„๋ž˜์™€ ๊ฐ™์ด 2๊ฐœ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋น„๊ตํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

22
CPO VS IPO

์šฐ์„ , CPO๋Š” TRPO์—์„œ ์ œ์•ฝ์กฐ๊ฑด์„ ์ถ”๊ฐ€ํ•œ ๋ชฉ์ ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ TRPO์˜ ๋ฌธ์ œ์ด๊ธฐ๋„ ํ–ˆ๋˜ 2์ฐจ ๋ฏธ๋ถ„ ๊ณ„์‚ฐ์ด ํ•„์š”ํ•˜๋‹ค๋Š” ํŠน์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ œ์•ฝ์กฐ๊ฑด๋“ค์„ ์ถ”๊ฐ€ํ•˜๊ฑฐ๋‚˜ mean valued constraint์™€ ๊ฐ™์€ ๋ˆ„์  ์ œ์•ฝ์‹์„ ๊ณ„์‚ฐํ•˜๊ธฐ ๊นŒ๋‹ค๋กญ๊ฑฐ๋‚˜ ํ•  ์ˆ˜ ์—†๋‹ค๋Š” ๋ฌธ์ œ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด์— ๋ฐ˜ํ•ด, IPO๋Š” PPO์— ์ œ์•ฝ์กฐ๊ฑด์„ ์ถ”๊ฐ€ํ•œ ๋ชฉ์ ์‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์—ฌ 1์ฐจ ๋ฏธ๋ถ„๋งŒ์„ ํ•˜๋ฉด ๋œ๋‹ค๋Š” ์žฅ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ, ๋‹ค์–‘ํ•œ ์ œ์•ฝ์กฐ๊ฑด๋“ค์„ ์ดํ›„์— ์„ค๋ช…ํ•  ํ•ต์‹ฌ ์•„์ด๋””์–ด์ธ logarithmic barrier function์„ ์ด์šฉํ•˜์—ฌ ์‰ฝ๊ฒŒ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

2.2 Logarithmic Barrier Function

์šฐ์„  IPO์˜ ๋ฌธ์ œ ์ •์˜๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด PPO์˜ ๋ชฉ์ ์‹์—๋‹ค๊ฐ€ Constraint๋ฅผ ์ถ”๊ฐ€ํ•œ ๊ฒƒ์œผ๋กœ ์ •์˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

22
IPO Problem Definition

Constraint๋Š” Limit์„ ๊ณ ๋ คํ•˜์—ฌ ๋ถ€๋“ฑํ˜ธ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์œผ๋ฉฐ ์ด๋Š” Indicatior Function์— ๋„ฃ์—ˆ์„๋•Œ, Constraint๋ฅผ ๋„˜์—ˆ์„ ๊ฒฝ์šฐ -\infin๋กœ ๋‚˜ํƒ€๋‚ด๊ณ  Constraint๋ฅผ ๋งŒ์กฑํ–ˆ์„ ๊ฒฝ์šฐ 0์œผ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ Indicator Function์€ ๋ถˆ์—ฐ์†์ ์ด๋ฉฐ ๋ฏธ๋ถ„ ๋ถˆ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— gradient๋ฅผ ๊ตฌํ•  ์ˆ˜ ์—†์–ด์„œ Logarithmic Barrier Function์„ ํ†ตํ•ด ๊ทผ์‚ฌํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

22
Logarithmic Barrier Function

Logarithmic Barrier Function(\phi)์€ ๊ทธ๋ž˜ํ”„์—์„œ์™€ ๊ฐ™์ด ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ์ธ t์˜ ๊ฐ’์ด ํด์ˆ˜๋ก Indicator Function๊ณผ ์œ ์‚ฌํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜ํ”„์—์„œ ์ดˆ๋ก์ƒ‰ t=50์ผ ๋•Œ์˜ ๊ทธ๋ž˜ํ”„๊ฐ€ ์ ์„ ์˜ Indicator Function๊ณผ ์œ ์‚ฌํ•œ ๊ฒƒ ์ฒ˜๋Ÿผ์š”. ๋˜ํ•œ \phi๋Š” ์ด๋ถ„์ด ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— gradient๋ฅผ ํ†ตํ•ด ์ตœ์ ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

22
IPO ๊ฒฐ๋ก (๋ชฉ์ ์‹๊ณผ ์ˆ˜๋„์ฝ”๋“œ)

๋”ฐ๋ผ์„œ IPO์˜ ์ตœ์ ํ™”์‹์€ PPO์˜ ๋ชฉ์ ์‹ (L^{C L I P}(\theta))์— Logarithmic Barrier Function(\phi)์„ ์ด์šฉํ•˜์—ฌ ์ œ์•ฝ์กฐ๊ฑด์„ ํ•ฉ์น˜๊ฒŒ ๋œ(\sum_{i=1}^m \phi\left(\widehat{J}_{C_i}^{\pi_i}\right)) ๋ชจ์Šต์ด ๋ฉ๋‹ˆ๋‹ค.

2.3 Performance Guarantee Bound

๊ทธ๋ ‡๋‹ค๋ฉด IPO์˜ ์„ฑ๋Šฅ ๋ณด์žฅ์„ ์ด๋ก ์ ์œผ๋กœ ๊ฒ€์ฆํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

22 22

์ด๋Ÿฌํ•œ ์ˆ˜์‹์ ์ธ ๊ฒ€์ฆ ๊ณผ์ •์„ ๊ฑฐ์ณ IPO์˜ ๋ชฉ์ ์‹์€ ์ผ์ • ํ•œ๊ณ„ ๋‚ด์— ์žˆ๋‹ค๋Š” ๊ฒƒ(Bounded) ๋˜์–ด์žˆ๋‹ค๋Š” ๊ฒฐ๋ก ์„ ๋‚ด๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

22

์ˆ˜์‹์ ์œผ๋กœ Performance Guarantee Bound๋ฅผ ํ™•์ธํ•˜์—ฌ t(logarithmic barrier function์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ)๊ฐ€ ํด์ˆ˜๋ก Indicator function์— ๋Œ€ํ•œ ๋” ์ข‹์€ ๊ทผ์‚ฌ๊ฐ’์„ ์ œ๊ณตํ•˜๊ฒŒ ๋˜๊ณ  ๋” ๋†’์€ reward์™€ cost๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ t๊ฐ€ ํด์ˆ˜๋ก ์ตœ์ ํ™” ์‹์ด ์ˆ˜๋ ดํ•˜๋Š” ์†๋„๋Š” ๋А๋ ค์ง„๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์ˆ˜์‹์œผ๋กœ ํ™•์ธํ•œ ๋‹จ์กฐ์„ฑ(monotonicity)์„ ์ด์šฉํ•˜์—ฌ, ์ˆ˜๋ ด ์†๋„์™€ ์ตœ์ ํ™” ์„ฑ๋Šฅ ์‚ฌ์ด์˜ ๊ท ํ˜•์„ ๋งž์ถœ ์ˆ˜ ์žˆ๋Š” ์ ์ ˆํ•œ t ๊ฐ’์„ ์ฐพ๊ธฐ ์œ„ํ•ด ์ด์ง„ ํƒ์ƒ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜(binary search)์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์‚ฌ์‹ค๋„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

3 Experiment

์‹คํ—˜์„ ํ†ตํ•ด ํ™•์ธํ•  ์ˆ˜ ์žˆ๋Š” IPO(Interior Point Optimization)์˜ ์ฃผ์š” ์žฅ์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • ํ• ์ธ ๋ˆ„์  ์ œ์•ฝ(discounted cumulative constraints)๊ณผ ํ‰๊ท  ๊ฐ’ ์ œ์•ฝ(mean valued constraints)์„ ํฌํ•จํ•œ ๋ณด๋‹ค ์ผ๋ฐ˜์ ์ธ ํ˜•ํƒœ์˜ ๋ˆ„์  ์ œ์•ฝ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์ด ๊ฐ„๋‹จํ•˜๊ณ  ์กฐ์ •์ด ์šฉ์ดํ•ฉ๋‹ˆ๋‹ค.
  • ๋ณต์ˆ˜์˜ ์ œ์•ฝ ์กฐ๊ฑด์ด ์žˆ๋Š” ์ตœ์ ํ™” ๋ฌธ์ œ๋กœ ์‰ฝ๊ฒŒ ํ™•์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ™•๋ฅ ์ ์ธ(stochastic) ํ™˜๊ฒฝ์—์„œ๋„ ๋†’์€ ์•ˆ์ •์„ฑ๊ณผ ๊ฒฌ๊ณ ํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
22
์‹คํ—˜ ๋น„๊ต๊ตฐ๊ณผ ์‹คํ—˜ ํ™˜๊ฒฝ

3.1 Discounted Cumulative Constraints

22
Discounted Cumulative Constraints ์‹คํ—˜ ๊ฒฐ๊ณผ
  • IPO VS. CPO
    • IPO
      • ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
      • ์ œ์•ฝ ์กฐ๊ฑด์ด ์ถฉ์กฑ๋œ ์ดํ›„์—๋„ ๋” ๋‚˜์€ ์ •์ฑ…์„ ์ฐพ๊ธฐ ์œ„ํ•ด ํƒ์ƒ‰์„ ๊ณ„์†ํ•ฉ๋‹ˆ๋‹ค.
      • ์ด๋กœ ์ธํ•ด ๋” ๋†’์€ ๋ณด์ƒ๊ณผ ๋” ๋‚ฎ์€ ๋น„์šฉ์œผ๋กœ ์ˆ˜๋ ดํ•ฉ๋‹ˆ๋‹ค.
      • ์ˆ˜๋ ด ์†๋„๋Š” ๋А๋ฆฌ์ง€๋งŒ, ์ตœ์ข… ์„ฑ๋Šฅ์€ CPO๋ณด๋‹ค ์šฐ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค.
    • CPO
      • ์ˆ˜๋ ด ์†๋„๊ฐ€ IPO๋ณด๋‹ค ๋น ๋ฆ…๋‹ˆ๋‹ค.
      • ์ œ์•ฝ ์กฐ๊ฑด์ด ์ถฉ์กฑ๋˜๋ฉด ๊ฐœ์„  ์ž‘์—…์„ ์ค‘๋‹จํ•ฉ๋‹ˆ๋‹ค.
      • ์ œ์•ฝ ์กฐ๊ฑด์„ ๋น ๋ฅด๊ฒŒ ๋งŒ์กฑ์‹œํ‚ค์ง€๋งŒ, ๊ทธ ์ดํ›„์—๋Š” ์„ฑ๋Šฅ ๊ฐœ์„ ์ด ๋ฉˆ์ถฅ๋‹ˆ๋‹ค.
      • ๋”ฐ๋ผ์„œ ๋ณด์ƒ์ด๋‚˜ ๋น„์šฉ ์ธก๋ฉด์—์„œ IPO๋งŒํผ์˜ ์ตœ์ ํ™”๋ฅผ ์ด๋ฃจ์ง€ ๋ชปํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
ํŠน์ง• IPO CPO
์ˆ˜๋ ด ์†๋„ ๋А๋ฆผ ๋น ๋ฆ„
์ œ์•ฝ ์ถฉ์กฑ ํ›„ ๊ฐœ์„  ๊ณ„์† ํƒ์ƒ‰ (๋” ๋‚˜์€ ์ •์ฑ…์„ ์ฐพ์Œ) ๊ฐœ์„  ์ค‘๋‹จ (์ œ์•ฝ ์กฐ๊ฑด ์ถฉ์กฑ ์‹œ)
์ตœ์ข… ์„ฑ๋Šฅ ๋” ๋†’์€ ๋ณด์ƒ๊ณผ ๋‚ฎ์€ ๋น„์šฉ ์ œ์•ฝ ์กฐ๊ฑด ๋งŒ์กฑ ํ›„ ๊ฐœ์„  ์—†์Œ
  • IPO VS. PDO
    • IPO
      • ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
      • ์ œ์•ฝ ์กฐ๊ฑด์ด ์ถฉ์กฑ๋œ ์ดํ›„์—๋„ ๋” ๋‚˜์€ ์ •์ฑ…์„ ์ฐพ๊ธฐ ์œ„ํ•ด ํƒ์ƒ‰์„ ๊ณ„์†ํ•ฉ๋‹ˆ๋‹ค.
      • ์•ˆ์ •์ ์ธ ํ•™์Šต ๊ณผ์ •์„ ๊ฐ€์ง€๋ฉฐ, ์„ฑ๋Šฅ์˜ ๋ณ€๋™์ด ์ ์Šต๋‹ˆ๋‹ค.
      • ์ดˆ๊ธฐํ™”๋‚˜ ํ•™์Šต๋ฅ ์— ๋œ ๋ฏผ๊ฐํ•ฉ๋‹ˆ๋‹ค.
    • PDO
      • IPO๋งŒํผ ์ข‹์€ ์ •์ฑ…์œผ๋กœ ์ˆ˜๋ ด ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, ํ›ˆ๋ จ ์ค‘ ์„ฑ๋Šฅ์˜ ๋ถ„์‚ฐ(variance)์ด ๋†’์Šต๋‹ˆ๋‹ค.
      • ์ œ์•ฝ ์กฐ๊ฑด ๊ฐ’์„ ํ•œ๊ณ„ ์ดํ•˜๋กœ ๋‚ฎ์ถ”๋Š” ์ •์ฑ…์„ ์ฐพ์„ ์ˆ˜ ์žˆ์œผ๋‚˜, ๊ทธ ๊ฒฐ๊ณผ ๋ณด์ƒ(reward)์ด ๊ฐ€์žฅ ๋‚ฎ์•„์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
      • Lagrange multiplier์˜ ์ดˆ๊ธฐ๊ฐ’๊ณผ ํ•™์Šต๋ฅ (learning rate)์— ๋ฏผ๊ฐํ•˜๊ฒŒ ๋ฐ˜์‘ํ•ฉ๋‹ˆ๋‹ค.
      • ์ดˆ๊ธฐ ์„ค์ •์ด ์ž˜๋ชป๋˜๋ฉด, ํ•™์Šต ๊ณผ์ •์ด ๋ถˆ์•ˆ์ •ํ•ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ํŠน์ง• IPO PDO
์ˆ˜๋ ด ์„ฑ๋Šฅ ์ตœ๊ณ  ์„ฑ๋Šฅ์— ์ˆ˜๋ ด IPO ์ˆ˜์ค€์œผ๋กœ ์ˆ˜๋ ด ๊ฐ€๋Šฅ
ํ›ˆ๋ จ ์ค‘ ์„ฑ๋Šฅ ๋ณ€๋™ ๋‚ฎ์Œ (์•ˆ์ •์ ) ๋†’์Œ (๋ณ€๋™์ด ํผ)
์ œ์•ฝ ์กฐ๊ฑด ๋งŒ์กฑ๋„ ์ œ์•ฝ ์กฐ๊ฑด์„ ์ถฉ์กฑํ•˜๋ฉฐ ํƒ์ƒ‰ ์ง€์† ์ œ์•ฝ ์กฐ๊ฑด ๊ฐ’์„ ํ•œ๊ณ„ ์ดํ•˜๋กœ ๋‚ฎ์ถค
๋ณด์ƒ (Reward) ๋†’์€ ๋ณด์ƒ ๊ฐ€์žฅ ๋‚ฎ์€ ๋ณด์ƒ ๊ฐ€๋Šฅ์„ฑ
์ดˆ๊ธฐํ™”/ํ•™์Šต๋ฅ  ๋ฏผ๊ฐ๋„ ๋‚ฎ์Œ ๋†’์Œ
  • (optional)CPO vs. PPO / TRPO
ํŠน์ง• CPO PPO TRPO
์ œ์•ฝ ์กฐ๊ฑด ์ฒ˜๋ฆฌ ์—ฌ๋ถ€ ์ œ์•ฝ ์กฐ๊ฑด์„ ๊ณ ๋ คํ•จ ์ œ์•ฝ ์กฐ๊ฑด ์—†์Œ ์ œ์•ฝ ์กฐ๊ฑด ์—†์Œ
๋ณด์ƒ (Reward) ๋†’์Œ (์ œ์•ฝ ์กฐ๊ฑด ๋‚ด์—์„œ) ๊ฐ€์žฅ ๋†’์Œ (์ œ์•ฝ ์กฐ๊ฑด ์œ„๋ฐ˜ ๊ฐ€๋Šฅ์„ฑ ์žˆ์Œ) ๋†’์Œ (์ œ์•ฝ ์กฐ๊ฑด์„ ๊ฐ„์ ‘์ ์œผ๋กœ ์™„ํ™”)
์ œ์•ฝ ์กฐ๊ฑด ์œ„๋ฐ˜ ๊ฐ€๋Šฅ์„ฑ ๋‚ฎ์Œ ๋†’์Œ ์ค‘๊ฐ„ (์‹ ๋ขฐ ์˜์—ญ์œผ๋กœ ์ผ๋ถ€ ์™„ํ™”)
ํ•™์Šต ์•ˆ์ •์„ฑ ๋†’์Œ ๋†’์Œ ๋งค์šฐ ๋†’์Œ
๊ณ„์‚ฐ ๋ณต์žก๋„ ์ค‘๊ฐ„ ๋‚ฎ์Œ ๋†’์Œ

3.2 Mean Valued Constraints

22
Mean Valued Constraints ์‹คํ—˜ ๊ฒฐ๊ณผ
  • IPO VS. PDO
    • IPO
      • ์ผ๊ด€๋œ ์ˆ˜๋ ด: ๋ชจ๋“  ์ž‘์—…(task)์—์„œ ํ• ์ธ ๋ˆ„์  ๋ณด์ƒ(discounted cumulative reward)์ด ๋†’์€ ์ •์ฑ…์œผ๋กœ ์•ˆ์ •์ ์œผ๋กœ ์ˆ˜๋ ดํ•ฉ๋‹ˆ๋‹ค.
      • ์ œ์•ฝ ์กฐ๊ฑด ๋งŒ์กฑ: ๋ชจ๋“  ์ž‘์—…์—์„œ ํ‰๊ท  ๊ฐ’ ์ œ์•ฝ(mean valued constraints)์„ ์ง€์†์ ์œผ๋กœ ๋งŒ์กฑ์‹œํ‚ต๋‹ˆ๋‹ค.
      • ์•ˆ์ •์ ์ธ ํ•™์Šต: ํ›ˆ๋ จ ์ค‘ ์„ฑ๋Šฅ์˜ ๋ณ€๋™์ด ์ ์œผ๋ฉฐ, ๋‚ฎ์€ ๋ถ„์‚ฐ(variance)์„ ๋ณด์ž…๋‹ˆ๋‹ค.
    • PDO
      • ์ œ์•ฝ ์กฐ๊ฑด ์œ„๋ฐ˜ ๊ฐ€๋Šฅ์„ฑ: ๊ฐ„ํ˜น ์ œ์•ฝ ์กฐ๊ฑด์„ ์œ„๋ฐ˜ํ•˜๋Š” ์ •์ฑ…์œผ๋กœ ์ˆ˜๋ ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (์ฐธ์กฐ: Figure 3b)
      • ํ›ˆ๋ จ ์ค‘ ๋†’์€ ๋ถ„์‚ฐ: ํ›ˆ๋ จ ๊ณผ์ •์—์„œ ์„ฑ๋Šฅ์˜ ๋ณ€๋™์ด ํฌ๋ฉฐ, ๋†’์€ ๋ถ„์‚ฐ์„ ๋ณด์ž…๋‹ˆ๋‹ค. (์ฐธ์กฐ: Figure 3d ๋ฐ Figure 3f)
      • ๋†’์€ ๋ณด์ƒ ๊ฐ€๋Šฅ์„ฑ: ๋•Œ๋•Œ๋กœ ๋†’์€ ๋ณด์ƒ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์ œ์•ฝ ์กฐ๊ฑด์„ ์ง€ํ‚ค์ง€ ๋ชปํ•  ์œ„ํ—˜์ด ์žˆ์Šต๋‹ˆ๋‹ค.
ํŠน์ง• IPO PDO
ํ• ์ธ ๋ˆ„์  ๋ณด์ƒ ์•ˆ์ •์ ์œผ๋กœ ๋†’์€ ๋ณด์ƒ์— ์ˆ˜๋ ด ๋†’์€ ๋ณด์ƒ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์œผ๋‚˜ ๋ถˆ์•ˆ์ •
์ œ์•ฝ ์กฐ๊ฑด ๋งŒ์กฑ๋„ ํ•ญ์ƒ ์ œ์•ฝ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•จ ๊ฐ„ํ˜น ์ œ์•ฝ ์กฐ๊ฑด์„ ์œ„๋ฐ˜
ํ›ˆ๋ จ ์ค‘ ์„ฑ๋Šฅ ๋ณ€๋™ (๋ถ„์‚ฐ) ๋‚ฎ์Œ (์•ˆ์ •์ ) ๋†’์Œ (๋ณ€๋™์ด ํผ)
์•ˆ์ •์„ฑ ๋งค์šฐ ์•ˆ์ •์  ์ดˆ๊ธฐํ™”์™€ ํ•™์Šต๋ฅ ์— ๋ฏผ๊ฐ

3.3 Constraint Effects

Point Gather ํ™˜๊ฒฝ์—์„œ ์ œ์•ฝ ์กฐ๊ฑด์„ ์™„ํ™”ํ•˜์—ฌ ์ž„๊ณ„๊ฐ’์„ 1๋กœ ์„ค์ •ํ•œ ๊ฒฝ์šฐ, ๊ฐ ์—์ด์ „ํŠธ๋Š” ํ‰๊ท ์ ์œผ๋กœ ์ตœ๋Œ€ 1๊ฐœ์˜ ํญํƒ„(bomb)์„ ์ˆ˜์ง‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Constraint ๊ฐ’์„ ๋‚ด๋ ค์„œ ์™„ํ™”ํ•˜๊ฒŒ ๋˜๋ฉด ์ œ์•ฝ ์กฐ๊ฑด์ด ๋งค์šฐ ๋А์Šจํ•ด์ ธ์„œ, ์ œ์•ฝ ์กฐ๊ฑด์ด ์žˆ๋Š” ์ตœ์ ํ™” ๋ฌธ์ œ์˜ ์„ฑ๋Šฅ์ด ์ œ์•ฝ ์กฐ๊ฑด์ด ์—†๋Š” ๊ฒฝ์šฐ์™€ ๋™์ผํ•œ ์ˆ˜์ค€์œผ๋กœ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.

  • CPO
    • CPO๋Š” ์—ฌ์ „ํžˆ ๋น„์šฉ์„ ์ฆ๊ฐ€์‹œ์ผœ ์ œ์•ฝ ์ž„๊ณ„๊ฐ’(1)์— ๋„๋‹ฌํ•˜๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
    • ์ด๋Š” ๋•Œ๋•Œ๋กœ ๋žœ๋ค ์ดˆ๊ธฐํ™”๋œ ์ •์ฑ…๋ณด๋‹ค๋„ ์„ฑ๋Šฅ์ด ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • CPO๋Š” ํ•ญ์ƒ ๋น„์šฉ์„ ์ œ์•ฝ ์ž„๊ณ„๊ฐ’(1)๊นŒ์ง€ ๋ฐ€์–ด ์˜ฌ๋ฆฌ๋ ค๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
  • IPO
    • IPO๋Š” ์ œ์•ฝ ์กฐ๊ฑด์ด ์ถฉ์กฑ๋œ ์ดํ›„์—๋„ ๋น„์šฉ์„ ๊ณ„์† ์ค„์—ฌ๋‚˜๊ฐ‘๋‹ˆ๋‹ค.
    • ์ด๋กœ ์ธํ•ด ๋” ๋‚ฎ์€ ๋น„์šฉ์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ, ๋” ๋‚˜์€ ์ตœ์ข… ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
ํŠน์ง• CPO IPO
์ œ์•ฝ ์กฐ๊ฑด ๋งŒ์กฑ๋„ ์ œ์•ฝ ์ž„๊ณ„๊ฐ’(1)๊นŒ์ง€ ๋น„์šฉ ์ฆ๊ฐ€ ์ œ์•ฝ ์ถฉ์กฑ ํ›„์—๋„ ๋น„์šฉ ๊ฐ์†Œ ์ง€์†
์ตœ์ข… ๋น„์šฉ ์ˆ˜์ค€ ์•ฝ 1 ์•ฝ 0.25
์„ฑ๋Šฅ ์ œ์•ฝ ์ถฉ์กฑ์„ ์šฐ์„ ์‹œํ•˜๋ฉฐ ์„ฑ๋Šฅ ์ €ํ•˜ ๊ฐ€๋Šฅ ์ œ์•ฝ์„ ์ถฉ์กฑํ•˜๋ฉด์„œ๋„ ๋” ๋‚˜์€ ์„ฑ๋Šฅ
22

๋”ฐ๋ผ์„œ ์‹คํ—˜์„ ํ†ตํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒฐ๋ก ์„ ๋‚ด๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • CPO๋Š” ์ œ์•ฝ์„ ๋งž์ถ”๊ธฐ ์œ„ํ•ด ๋น„์šฉ์„ ์ ๊ทน์ ์œผ๋กœ ์ฆ๊ฐ€์‹œํ‚ค์ง€๋งŒ, ๊ทธ ๊ฒฐ๊ณผ ์„ฑ๋Šฅ์ด ๋–จ์–ด์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • IPO๋Š” ์ œ์•ฝ์„ ๋งŒ์กฑํ•œ ์ดํ›„์—๋„ ๋น„์šฉ์„ ์ค„์ด๋ฉฐ, ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

3.4 Hyperparameter Tuning

  • IPO vs. PDO
    • IPO
      • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ t์˜ ํŠœ๋‹์ด ์šฉ์ดํ•ฉ๋‹ˆ๋‹ค.
      • ๋ณด์ƒ(reward)๊ณผ ๋น„์šฉ(cost)์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ t์™€ ์–‘์˜ ์ƒ๊ด€ ๊ด€๊ณ„๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.
      • t ๊ฐ’์ด ์ปค์งˆ์ˆ˜๋ก, ๋ณด์ƒ๊ณผ ๋น„์šฉ์ด ๋™์‹œ์— ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
      • ์ด์ง„ ํƒ์ƒ‰(binary search)์ด ๊ฐ€๋Šฅ:
      • t ๊ฐ’์„ ์กฐ์ •ํ•˜๋ฉฐ ์„ฑ๋Šฅ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด์ง„ ํƒ์ƒ‰์„ ํ†ตํ•ด ๋น ๋ฅด๊ฒŒ ์ตœ์ ์˜ ๊ฐ’์„ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • PDO
      • ์ดˆ๊ธฐ Lagrange multiplier (\lambda)์™€ ํ•™์Šต๋ฅ (learning rate)์˜ ์„ค์ •์ด ๊นŒ๋‹ค๋กญ์Šต๋‹ˆ๋‹ค.
      • ์ดˆ๊ธฐ \lambda ๊ฐ’์ด 0.01์—์„œ 0.1 ์‚ฌ์ด์ผ ๋•Œ ๋งค์šฐ ๋ฏผ๊ฐํ•˜๊ฒŒ ๋ฐ˜์‘ํ•ฉ๋‹ˆ๋‹ค.
      • ์ž˜๋ชป๋œ ์ดˆ๊ธฐํ™”๋Š” ํ•™์Šต ๊ณผ์ •์˜ ๋ถˆ์•ˆ์ •์„ ์ดˆ๋ž˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
      • ํ•™์Šต๋ฅ (learning rate)์˜ ๋ณ€ํ™”์—๋„ ๋ฏผ๊ฐํ•ฉ๋‹ˆ๋‹ค.
      • ํ•™์Šต๋ฅ ์ด 0.01์—์„œ 0.001๋กœ ์ž‘์•„์ง€๋ฉด, ์ •์ฑ…์˜ ์ˆ˜๋ ด ์†๋„๊ฐ€ ๋А๋ ค์ง‘๋‹ˆ๋‹ค.
      • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์— ๋งŽ์€ ์‹œ๊ฐ„๊ณผ ๋…ธ๋ ฅ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
ํŠน์ง• IPO PDO
ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ์šฉ์ด์„ฑ ์‰ฌ์›€ ์–ด๋ ต๊ณ  ๋ณต์žกํ•จ
๋ณด์ƒ๊ณผ ๋น„์šฉ์˜ ๊ด€๊ณ„ t์™€ ์–‘์˜ ์ƒ๊ด€ ๊ด€๊ณ„ ์ดˆ๊ธฐ \lambda์™€ ํ•™์Šต๋ฅ ์— ๋ฏผ๊ฐ
์ดˆ๊ธฐ ์„ค์ • ๋ฏผ๊ฐ๋„ ๋‚ฎ์Œ ๋†’์Œ
ํŠœ๋‹ ๋ฐฉ๋ฒ• ์ด์ง„ ํƒ์ƒ‰ ๊ฐ€๋Šฅ ์ดˆ๊ธฐํ™”์™€ ํ•™์Šต๋ฅ  ์„ค์ •์— ๋งŽ์€ ๋…ธ๋ ฅ ํ•„์š”
22

๋”ฐ๋ผ์„œ ์‹คํ—˜์„ ํ†ตํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒฐ๋ก ์„ ๋‚ด๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • IPO๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ t์˜ ํŠœ๋‹์ด ์‰ฝ๊ณ , ๋ณด์ƒ๊ณผ ๋น„์šฉ์ด t ๊ฐ’์— ๋”ฐ๋ผ ์˜ˆ์ธก ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋ณ€ํ™”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์•ˆ์ •์ ์ธ ์ตœ์ ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
  • PDO๋Š” ์ดˆ๊ธฐํ™”์™€ ํ•™์Šต๋ฅ ์— ๋ฏผ๊ฐํ•˜์—ฌ ํŠœ๋‹์ด ๊นŒ๋‹ค๋กญ๊ณ  ํ•™์Šต ๊ณผ์ •์ด ๋ถˆ์•ˆ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ์ดˆ๊ธฐ \lambda์™€ ํ•™์Šต๋ฅ  ์„ค์ •์ด ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.

3.5 Multiple Constraints

IPO (Interior Point Optimization)๋Š” ์ œ์•ฝ ์กฐ๊ฑด์„ ๋‹ค๋ฃฐ ๋•Œ ์œ ์—ฐํ•˜๊ณ  ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ๋ฐฉ์‹์œผ๋กœ ์„ค๊ณ„๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, logarithmic barrier function์„ ์‚ฌ์šฉํ•˜์—ฌ ์ œ์•ฝ ์กฐ๊ฑด์„ ์‰ฝ๊ฒŒ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. IPO์—์„œ๋Š” ์ƒˆ๋กœ์šด ์ œ์•ฝ ์กฐ๊ฑด์ด ํ•„์š”ํ•  ๋•Œ, ๊ธฐ์กด ์ตœ์ ํ™” ํ•จ์ˆ˜์— ๋กœ๊ทธ ๋ฐฐ๋ฆฌ์–ด ํ•ญ์„ ์ถ”๊ฐ€ํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ CPO๋ณด๋‹ค ๊ฐ„๋‹จํ•˜๊ฒŒ ์ œ์•ฝ ์กฐ๊ฑด์„ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ๋Š” ์ด์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. IPO๋Š” logarithmic barrier function์„ ์‚ฌ์šฉํ•˜์—ฌ ์ œ์•ฝ ์กฐ๊ฑด์„ ์‰ฝ๊ฒŒ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์–ด, ํ™•์žฅ์„ฑ๊ณผ ์œ ์—ฐ์„ฑ ์ธก๋ฉด์—์„œ CPO๋ณด๋‹ค ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

  • CPO์™€์˜ ๋น„๊ต
    • CPO (Constrained Policy Optimization)๋Š” ์ œ์•ฝ ์กฐ๊ฑด์„ ์ง์ ‘์ ์œผ๋กœ ๋‹ค๋ฃจ์ง€๋งŒ, ์ƒˆ๋กœ์šด ์ œ์•ฝ ์กฐ๊ฑด์ด ์ถ”๊ฐ€๋  ๋•Œ๋งˆ๋‹ค ๋ฌธ์ œ์˜ ๋ณต์žก๋„๊ฐ€ ์ฆ๊ฐ€ํ•˜๊ณ , ํŠœ๋‹์ด ์–ด๋ ค์›Œ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ๋ฐ˜๋ฉด, IPO๋Š” logarithmic barrier function์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ œ์•ฝ ์กฐ๊ฑด์„ ์‰ฝ๊ฒŒ ํ™•์žฅํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ๊ตฌํ˜„๊ณผ ํŠœ๋‹์ด ๋” ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค.
  • Point Gather ์‹คํ—˜์—์„œ์˜ ์ œ์•ฝ ์กฐ๊ฑด ํ™•์žฅ
    • Point Gather ํ™˜๊ฒฝ์—์„œ๋Š” ์—์ด์ „ํŠธ๊ฐ€ ๋ณด์ƒ์„ ์–ป๋Š” ๊ณผ์ •์—์„œ ๋‹ค์–‘ํ•œ ์ œ์•ฝ ์กฐ๊ฑด์„ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์‹คํ—˜์—์„œ ๋‹ค์–‘ํ•œ ์ œ์•ฝ ์กฐ๊ฑด์„ ์ถ”๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด, ์ƒˆ๋กœ์šด ํƒ€์ž…์˜ ball (์ œ์•ฝ ์กฐ๊ฑด์— ํ•ด๋‹นํ•˜๋Š” ์˜ค๋ธŒ์ ํŠธ)์„ ๋„์ž…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์˜ˆ๋ฅผ ๋“ค์–ด, ๊ธฐ์กด์˜ bomb ์™ธ์— ์ƒˆ๋กœ์šด ์ œ์•ฝ ์กฐ๊ฑด์„ ๋‚˜ํƒ€๋‚ด๋Š” ์—ฌ๋Ÿฌ ์ข…๋ฅ˜์˜ ball์„ ์ถ”๊ฐ€ํ•˜์—ฌ, ์—์ด์ „ํŠธ๊ฐ€ ์ด๋“ค์„ ํ”ผํ•˜๋ฉด์„œ๋„ ์ตœ๋Œ€ํ•œ ๋งŽ์€ ๋ณด์ƒ์„ ์–ป๋Š” ์ •์ฑ…์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์ด๋ฅผ ํ†ตํ•ด ๋‹ค์ค‘ ์ œ์•ฝ ์กฐ๊ฑด ํ™˜๊ฒฝ์—์„œ๋„ IPO์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
22
ํŠน์ง• IPO CPO
์ œ์•ฝ ์กฐ๊ฑด ์ถ”๊ฐ€ ์šฉ์ด์„ฑ ๋กœ๊ทธ ๋ฐฐ๋ฆฌ์–ด ํ•ญ ์ถ”๊ฐ€๋งŒ์œผ๋กœ ๊ฐ€๋Šฅ ๋ณต์žกํ•œ ์ถ”๊ฐ€ ์ž‘์—…๊ณผ ํŠœ๋‹ ํ•„์š”
ํ™•์žฅ์„ฑ ๊ฐ„๋‹จํ•˜๊ฒŒ ์—ฌ๋Ÿฌ ์ œ์•ฝ ์กฐ๊ฑด ํ™•์žฅ ๊ฐ€๋Šฅ ์ œ์•ฝ ์กฐ๊ฑด ์ถ”๊ฐ€ ์‹œ ๋ณต์žก๋„ ์ฆ๊ฐ€
Point Gather ์‹คํ—˜ ์ ์šฉ ๋‹ค์–‘ํ•œ ์ œ์•ฝ ์กฐ๊ฑด ball ์ถ”๊ฐ€ ๊ฐ€๋Šฅ ์ œ์•ฝ ์กฐ๊ฑด ์ถ”๊ฐ€ ์‹œ ์„ฑ๋Šฅ ์ €ํ•˜ ์œ„ํ—˜

3.6 Stochastic Environment Effects

์‹ค์„ธ๊ณ„ ํ™˜๊ฒฝ์—์„œ์˜ ๋ถˆํ™•์‹ค์„ฑ ๋ฐ ๋žœ๋ค ๋…ธ์ด์ฆˆ ์ถ”๊ฐ€ ์‹คํ—˜ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ๋Š” ํ•ญ์ƒ ๋ถˆํ™•์‹ค์„ฑ(uncertainty)์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ์—์ด์ „ํŠธ์˜ ํ–‰๋™ ๊ฒฐ๊ณผ๋Š” ์ข…์ข… ๋žœ๋ค ๋…ธ์ด์ฆˆ(random noise)์— ์˜ํ•ด ์˜ํ–ฅ์„ ๋ฐ›์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ฐ”๋žŒ, ์„ผ์„œ ์˜ค๋ฅ˜, ๋งˆ์ฐฐ ๋“ฑ์˜ ์˜ˆ๊ธฐ์น˜ ๋ชปํ•œ ์š”์ธ๋“ค์ด ์‹œ์Šคํ…œ์— ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ์‹คํ—˜์—์„œ ํ–‰๋™(action)์€ ์†๋„(velocity)์™€ ์ง„ํ–‰ ๋ฐฉํ–ฅ(heading)์˜ ๋ฒกํ„ฐ๋กœ ์ •์˜๋˜๋ฉฐ, ๊ฐ’์˜ ๋ฒ”์œ„๋Š” -1์—์„œ 1 ์‚ฌ์ด์ž…๋‹ˆ๋‹ค. (-1, 1) ๋ฒ”์œ„์˜ ๋ฒกํ„ฐ๋Š” ์—์ด์ „ํŠธ๊ฐ€ ์›€์ง์ผ ๋ฐฉํ–ฅ๊ณผ ์†๋„๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

์‹คํ—˜์—์„œ๋Š” ํ‰๊ท  0์˜ ๋žœ๋ค ๋…ธ์ด์ฆˆ๋ฅผ ํ–‰๋™(action)์— ์ถ”๊ฐ€ํ•˜์—ฌ ํ™˜๊ฒฝ์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ๋ชจ์‚ฌํ–ˆ์Šต๋‹ˆ๋‹ค.

  • ๋…ธ์ด์ฆˆ์˜ ๋ถ„์‚ฐ(variance)์€ ์„ธ ๊ฐ€์ง€ ๊ฐ’์œผ๋กœ ์„ค์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค:
    • \sigma^2 = 0.2
    • \sigma^2 = 0.5
    • \sigma^2 = 1.0
22
  • \sigma^2 = 0.5์ผ ๋•Œ๋„ ํ•™์Šต์ด ์„ฑ๊ณต์ ์œผ๋กœ ์ˆ˜๋ ดํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
    • ์ด๋Š” ์—์ด์ „ํŠธ๊ฐ€ ์ผ์ • ์ˆ˜์ค€์˜ ํ™˜๊ฒฝ ๋ถˆํ™•์‹ค์„ฑ์—์„œ๋„ ์•ˆ์ •์ ์œผ๋กœ ์ •์ฑ…์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  • \sigma^2 = 1.0์˜ ๊ฒฝ์šฐ, ๋…ธ์ด์ฆˆ๊ฐ€ ์ปค์ ธ ํ•™์Šต์ด ๋ถˆ์•ˆ์ •ํ•ด์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์œผ๋ฉฐ, ์ด๋Š” ์ถ”๊ฐ€ ์‹คํ—˜์—์„œ ํ™•์ธํ•  ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์‹ค์ œ ํ™˜๊ฒฝ์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ๋ฐ˜์˜ํ•˜๊ธฐ ์œ„ํ•ด ๋žœ๋ค ๋…ธ์ด์ฆˆ๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์€ ๊ฐ•ํ™” ํ•™์Šต์˜ ๊ฐ•๊ฑด์„ฑ(robustness) ํ‰๊ฐ€์— ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.
  • ์ ์ ˆํ•œ ์ˆ˜์ค€์˜ ๋…ธ์ด์ฆˆ(\sigma^2 = 0.5)์—์„œ๋Š” ํ•™์Šต์ด ์•ˆ์ •์ ์œผ๋กœ ์ง„ํ–‰๋˜์—ˆ์œผ๋ฉฐ, ์—์ด์ „ํŠธ๊ฐ€ ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ ๋ณ€๋™์—๋„ ์ž˜ ์ ์‘ํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.

4 Conclusion

์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” ์ œ์•ฝ์กฐ๊ฑด์„ ํฌํ•จํ•œ MDP์˜ ๋ฌธ์ œ๋Š” ์–ด๋–ป๊ฒŒ ์ •์˜ํ•  ์ˆ˜ ์žˆ๊ณ  ์–ด๋–ค ๋ฐฉ์‹์œผ๋กœ ์ตœ์ ํ™”์‹์„ ๋””์ž์ธํ•˜์—ฌ ํ’€ ์ˆ˜ ์žˆ๋Š”์ง€ ์‚ดํŽด๋ณด๋ฉฐ IPO ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์•˜์Šต๋‹ˆ๋‹ค. ๊ฐ•ํ™”ํ•™์Šต์—์„œ ํ•™์Šต์˜ ๋ฐฉํ–ฅ์„ฑ์„ Reward๋กœ๋งŒ ๋””์ž์ธ ํ•˜๊ฒŒ๋  ๊ฒฝ์šฐ์˜ ๋ฌธ์ œ๋“ค์„ Constraint๋กœ ๋ฐ”๊พธ์–ด์„œ ๋””์ž์ธํ•˜๊ฒŒ ๋œ๋‹ค๋ฉด ๋งŽ์€ ์ด์ ์ด ์žˆ์„ ์ˆ˜ ์žˆ๊ณ , CMDP๋ฅผ ๋‹ค๋ฃฌ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์— ๋น„ํ•ด ์‹ฌํ”Œํ•˜๋ฉด์„œ๋„ ์‚ฌ์šฉํ•˜๊ธฐ ํŽธํ•œ ์•„์ด๋””์–ด๋ผ๋Š” ์ƒ๊ฐ์ด ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค.

5 Reference

  • Original Paper: IPO
  • Lagrangian relaxation method Diagram

Copyright 2024, Jung Yeon Lee