Curieux.JY
  • JungYeon Lee
  • Post
  • Lecture
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ์„œ๋ก 
    • ๋ฐฉ๋ฒ•
      • ์‹œ์Šคํ…œ ๋™์—ญํ•™
      • Model Predictive Control
      • Roll-Out Q-Learning (ํ•ต์‹ฌ)
      • Q-functionยทrunning cost ์„ค๊ณ„
      • ์‹คํ—˜ ์…‹์—…
    • ์‹คํ—˜
      • ์งง์€ ์ง€ํ‰์„  (N=2)
      • ๊ธด ์ง€ํ‰์„  (N=5)
      • ์ง€ํ‰์„  ๊ธธ์ด vs ๋ˆ„์  ๋น„์šฉ (Fig. 4)
    • ๋น„ํŒ์  ๊ณ ์ฐฐ
    • ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

๐Ÿ“ƒCombining MPC & RL

rl
mpc
quadruped
locomotion
Combining Model-Predictive Control and Predictive Reinforcement Learning for Stable Quadrupedal Robot Locomotion
Published

April 20, 2026

  • Paper Link (arXiv:2307.07752)
  1. ๐Ÿพ ๋ณธ ๋…ผ๋ฌธ์€ ๋ชจ๋ธ ์˜ˆ์ธก ์ œ์–ด(MPC)์™€ ์˜ˆ์ธก ๊ฐ•ํ™” ํ•™์Šต(RQL)์„ ๊ฒฐํ•ฉํ•œ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ œ์–ด๊ธฐ๋ฅผ ์ œ์•ˆํ•˜์—ฌ ์ฟผ๋“œ๋Ÿฌํ”Œ ๋กœ๋ด‡์˜ ์•ˆ์ •์ ์ธ ๋ณดํ–‰ ์ƒ์„ฑ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃน๋‹ˆ๋‹ค.
  2. ๐Ÿค– ๊ฐœ๋ฐœ๋œ RQL ๋ฐฉ์‹์€ ์‹ ๊ฒฝ๋ง์œผ๋กœ ๋ชจ๋ธ๋ง๋œ Q-ํ•จ์ˆ˜๋ฅผ MPC์˜ ์˜ˆ์ธก ํ˜ธ๋ผ์ด์ฆŒ์— ๋Œ€ํ•œ ์ตœ์ข… ๋น„์šฉ์œผ๋กœ ํ†ตํ•ฉํ•˜์—ฌ ๊ณ„์‚ฐ ๋ณต์žก์„ฑ์„ ์™„ํ™”ํ•˜๊ณ , ํŠนํžˆ ์งง์€ ํ˜ธ๋ผ์ด์ฆŒ์—์„œ MPC๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
  3. โœจ ์‹คํ—˜ ๊ฒฐ๊ณผ, RQL์€ ์งง์€ ์˜ˆ์ธก ํ˜ธ๋ผ์ด์ฆŒ(N=2)์—์„œ๋„ MPC๋ณด๋‹ค ํ›จ์”ฌ ๋‚ฎ์€ ๋ˆ„์  ์‹คํ–‰ ๋น„์šฉ๊ณผ ๋ฐฉํ–ฅ ์˜ค๋ฅ˜๋ฅผ ๋‹ฌ์„ฑํ•˜์—ฌ, ์‹ค์‹œ๊ฐ„ ์˜จ๋ผ์ธ ์ œ์–ด ๋Šฅ๋ ฅ๊ณผ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ ๊ฐ„์˜ ๊ท ํ˜•์„ ์ œ๊ณตํ•จ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

๋ณธ ๋…ผ๋ฌธ์€ ์‚ฌ์กฑ๋ณดํ–‰ ๋กœ๋ด‡์˜ ์•ˆ์ •์ ์ธ ๋ณดํ–‰ ์ƒ์„ฑ์„ ์œ„ํ•ด ๋ชจ๋ธ ์˜ˆ์ธก ์ œ์–ด(Model-Predictive Control, MPC)์™€ ์˜ˆ์ธก ๊ฐ•ํ™” ํ•™์Šต(Predictive Reinforcement Learning, RL)์„ ๊ฒฐํ•ฉํ•˜๋Š” ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ œ์–ด ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์กฑ๋ณดํ–‰ ๋กœ๋ด‡์€ ๋†’์€ ์ด๋™์„ฑ๊ณผ ๊ธฐ๋™์„ฑ์„ ์ œ๊ณตํ•˜์ง€๋งŒ, ๋ณต์žกํ•œ ๊ธฐ๊ณ„ ๊ตฌ์กฐ์™€ ๋งŽ์€ ์ž์œ ๋„๋กœ ์ธํ•ด ์ œ์–ด๊ฐ€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

๊ธฐ์กด MPC๋Š” ์‹œ์Šคํ…œ ๋ชจ๋ธ๊ณผ ์ œ์•ฝ ์กฐ๊ฑด์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฏธ๋ž˜ ์ƒํƒœ๋ฅผ ์˜ˆ์ธกํ•˜๊ณ  ์ตœ์ ์˜ ์ œ์–ด ์ž…๋ ฅ์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ฐ•๋ ฅํ•œ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์งง์€ ์˜ˆ์ธก ๋ฒ”์œ„(planning horizon), ๊ตญ์†Œ ์ตœ์ ์ ์— ์ˆ˜๋ ด ๊ฐ€๋Šฅ์„ฑ, ๋ชจ๋ธ ์˜ค์ฐจ, ๊ทธ๋ฆฌ๊ณ  ๋ฏธ๋ž˜ ์žฌ๊ณ„ํš์„ ๊ณ ๋ คํ•˜์ง€ ๋ชปํ•˜๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด MPC์™€ ์ „์‹  ์ œ์–ด(Whole-Body Control, WBC)์˜ ํ†ตํ•ฉ, ํ•™์Šต ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•, ๊ทธ๋ฆฌ๊ณ  ์ œ์–ด Lyapunov ํ•จ์ˆ˜(Control Lyapunov Function, CLF)์™€์˜ ๊ฒฐํ•ฉ ์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋˜์–ด ์™”์Šต๋‹ˆ๋‹ค.

๊ฐ•ํ™” ํ•™์Šต(RL)์€ ์ˆœ์ˆ˜ํ•œ ๊ฒฝํ—˜์„ ํ†ตํ•ด ์ ์‘ํ•˜๋ฉฐ ๋ณต์žกํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐ ํƒ์›”ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋กœ๋ด‡ ์ œ์–ด์™€ ๊ฐ™์ด ๋ณต์žกํ•œ ์‹œ์Šคํ…œ์— ์ ์šฉํ•  ๊ฒฝ์šฐ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ ์‹คํ—˜ ๋น„์šฉ์ด ๋งŽ์ด ๋“ค๊ณ  ๋ณต์žก์„ฑ์ด ๋†’๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

๋ณธ ์—ฐ๊ตฌ๋Š” MPC์˜ ์งง์€ ์˜ˆ์ธก ๋ฒ”์œ„ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋กค์•„์›ƒ Q-ํ•™์Šต(Roll-out Q-Learning, RQL)์ด๋ผ๋Š” ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” MPC์˜ ๋น„์šฉ ํ•จ์ˆ˜์— Q-ํ•จ์ˆ˜ ํ˜•ํƒœ์˜ ํ…Œ์ผ ์ฝ”์ŠคํŠธ(tail cost)๋ฅผ ๋„์ž…ํ•˜์—ฌ ์˜ˆ์ธก ๋ฒ”์œ„๋ฅผ ์•”๋ฌต์ ์œผ๋กœ ํ™•์žฅํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. Q-ํ•จ์ˆ˜๋Š” ์‹ ๊ฒฝ๋ง์œผ๋กœ ๋ชจ๋ธ๋ง๋˜์–ด ๊ณ„์‚ฐ ๋ณต์žก๋„๋ฅผ ์™„ํ™”ํ•ฉ๋‹ˆ๋‹ค.

2. ์‹œ์Šคํ…œ ๋™์—ญํ•™ (Systemโ€™s dynamics):

Unitree A1 ๋กœ๋ด‡ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์‹œ์Šคํ…œ ๋™์—ญํ•™์„ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. ๋กœ๋ด‡์€ ์ ‘์ด‰์ ์—์„œ ์ž‘์šฉํ•˜๋Š” ํž˜์„ ๋ฐ›๋Š” ๋‹จ์ผ ๊ฐ•์ฒด๋กœ ๊ฐ„์ฃผํ•˜๋ฉฐ, ๋‹ค๋ฆฌ ๋™์—ญํ•™์€ ์ฃผ ์งˆ๋Ÿ‰์ฒด ๋Œ€๋น„ ์ž‘์€ ์งˆ๋Ÿ‰ ๋น„์œจ๋กœ ์ธํ•ด ๋ฌด์‹œ๋ฉ๋‹ˆ๋‹ค. ๋กœ๋ด‡์˜ ๊ฐ•์ฒด ๋™์—ญํ•™์€ ์„ธ๊ณ„ ์ขŒํ‘œ๊ณ„(world coordinates)์—์„œ ๋‹ค์Œ ์‹์œผ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค: \ddot{p} = \sum_{i=1}^{4} \frac{f_i}{m} - g \quad (1) \frac{d}{dt} (I\omega) = \sum_{i=1}^{4} r_i \times f_i \quad (2) \dot{R} = \omega \times R \quad (3) ์—ฌ๊ธฐ์„œ \ddot{p}๋Š” ๋กœ๋ด‡ ์œ„์น˜ p์˜ 2์ฐจ ๋ฏธ๋ถ„, f_i๋Š” i๋ฒˆ์งธ ์ง€๋ฉด ๋ฐ˜๋ ฅ, r_i๋Š” ํ•ด๋‹น ๋ ˆ๋ฒ„, m์€ ๋กœ๋ด‡ ์ „์ฒด ์งˆ๋Ÿ‰, g๋Š” ์ค‘๋ ฅ ๊ฐ€์†๋„, I๋Š” ๊ด€์„ฑ ๋ชจ๋ฉ˜ํŠธ, R์€ ํšŒ์ „ ํ–‰๋ ฌ, \omega๋Š” ๊ฐ์†๋„์ž…๋‹ˆ๋‹ค. ๋กœ๋ด‡์˜ ์ž์„ธ๋Š” ์˜ค์ผ๋Ÿฌ ๊ฐ \Theta = [\phi, \theta, \psi]๋กœ ๊ฒฐ์ •๋˜๋ฉฐ, ์ „์ฒด ๋™์—ญํ•™์€ ๋‹ค์Œ ์ƒํƒœ ๊ณต๊ฐ„ ๋ชจ๋ธ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค: \frac{d}{dt} \begin{bmatrix} p \\ \Theta \\ v \\ \omega_B \end{bmatrix} = \begin{bmatrix} v \\ J^{-1}\omega_B \\ \sum_{i=1}^{4} \frac{f_i}{m} - g \\ I_B^{-1} (R^T \sum_{i=1}^{4} r_i \times f_i - \omega_B \times I_B \omega_B) \end{bmatrix} \quad (6) ์—ฌ๊ธฐ์„œ J^{-1}๋Š” ๋ฐ”๋”” ํ”„๋ ˆ์ž„ ๊ฐ์†๋„ \omega_B๋ฅผ ์˜ค์ผ๋Ÿฌ ๊ฐ์˜ ๋ณ€ํ™”์œจ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํ–‰๋ ฌ์ด๋ฉฐ, I_B๋Š” ๋ฐ”๋”” ํ”„๋ ˆ์ž„์˜ ๊ด€์„ฑ ๋ชจ๋ฉ˜ํŠธ์ž…๋‹ˆ๋‹ค. ์ƒํƒœ x, ํ–‰๋™ u, ๋ ˆ๋ฒ„ ๋งค๊ฐœ๋ณ€์ˆ˜ \vartheta๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค: x := [p \ \Theta \ v \ \omega_B]^T \quad (7) u := [f_1 \ f_2 \ f_3 \ f_4]^T \quad (8) \vartheta := [r_1 \ r_2 \ r_3 \ r_4]^T \quad (9) ๋”ฐ๋ผ์„œ ๋™์—ญํ•™์€ \dot{x} = f(x, \vartheta, u)๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค.

3. ๋ฐฉ๋ฒ•๋ก  (Methods):

3.1 ๋ชจ๋ธ ์˜ˆ์ธก ์ œ์–ด (Model Predictive Control, MPC):

MPC๋Š” ๋‹ค์Œ ๋น„์šฉ ํ•จ์ˆ˜ J_{MPC}๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ์ œ์–ด ์ž…๋ ฅ์„ ์ฐพ์Šต๋‹ˆ๋‹ค:

\min_{\{u_{i|k}\}_i^N} J_{MPC}(x_0, \{x_{des,i|k}\}_i^N | \{u_{i|k}\}_i^N) := \min_{\{u_{i|k}\}_i^N} \sum_{i=1}^N \gamma^{i-1}r(\hat{x}_{i|k}, x_{des,i|k}, u_{i|k}) \quad (11)

์ œ์•ฝ ์กฐ๊ฑด์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • \hat{x}_{0,k} = x_0 (์ดˆ๊ธฐ ์ƒํƒœ)
  • \hat{x}_{i+1|k} = \Phi(\delta, \hat{x}_{i|k}, \vartheta_{i|k}, u_{i|k}) (์‹œ์Šคํ…œ ๋™์—ญํ•™, \Phi๋Š” ์ˆ˜์น˜ ์ ๋ถ„ ์Šคํ‚ด, ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” Euler explicit scheme \Phi(\delta, \hat{x}_{i|k}, \vartheta_{i|k}, u_{i|k}) = \hat{x}_{i|k} + \delta f(\hat{x}_{i|k}, \vartheta_{i|k}, u_{i|k}) ์‚ฌ์šฉ)
  • C_{i|k}u_{i|k} = 0 (์ ‘์ด‰ ์Šค์ผ€์ค„ ์ œ์•ฝ: ์Šค์œ™ ๋‹จ๊ณ„์—์„œ๋Š” ํž˜์ด 0)
  • Du_{i|k} \le 0 (๋งˆ์ฐฐ ์›๋ฟ” ์ œ์•ฝ: ์Šฌ๋ผ์ด๋”ฉ ๋ฐฉ์ง€, -\mu f_z \le f_x \le \mu f_z, -\mu f_z \le f_y \le \mu f_z) ์—ฌ๊ธฐ์„œ \gamma๋Š” ํ• ์ธ์œจ, N์€ ์˜ˆ์ธก ๋ฒ”์œ„, r์€ ์‹คํ–‰ ๋น„์šฉ(running cost)์ž…๋‹ˆ๋‹ค. MPC ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋งค ์‹œ๊ฐ„ ๋‹จ๊ณ„๋งˆ๋‹ค ํ˜„์žฌ ์ƒํƒœ๋ฅผ ๋ฐ›์•„ ์ตœ์ ์˜ ์•ก์…˜ ์‹œํ€€์Šค๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ์ฒซ ๋ฒˆ์งธ ์•ก์…˜์„ ์‹œ์Šคํ…œ์— ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

3.2 ๋กค์•„์›ƒ Q-ํ•™์Šต (Roll-Out Q-Learning, RQL):

RQL์€ MPC์˜ ๋น„์šฉ ํ•จ์ˆ˜์— Q-ํ•จ์ˆ˜ ํ…€์„ ์ถ”๊ฐ€ํ•˜์—ฌ ์˜ˆ์ธก ๋ฒ”์œ„ N์˜ ๋์— ์žˆ๋Š” ํ„ฐ๋ฏธ๋„ ๋น„์šฉ(terminal cost)์„ ๊ทผ์‚ฌํ•ฉ๋‹ˆ๋‹ค. \min_{\{u_{i|k}\}_i^N} J_{RQL}^a(x_0, \{x_{des,i|k}\}_i^N | \{u_{i|k}\}_i^N; w_k) := \min_{\{u_{i|k}\}_i^N} \left( \sum_{i=1}^{N-1} \gamma^{i-1}r(\hat{x}_{i|k}, x_{des,i|k}, u_{i|k}) + \hat{Q}(\hat{x}_{N|k}, x_{des,N|k}, u_{N|k}; w_k) \right) \quad (23) ์ œ์•ฝ ์กฐ๊ฑด์€ MPC์™€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. Q-ํ•จ์ˆ˜ \hat{Q}(x_k, x_{des,k}, u_k; w_k)๋Š” ๋งค ์‹œ๊ฐ„ ๋‹จ๊ณ„๋งˆ๋‹ค ๋‹ค์Œ ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜์—ฌ ์—…๋ฐ์ดํŠธ๋ฉ๋‹ˆ๋‹ค: J_k^c := \frac{1}{2} \sum_{i=k}^{k+M-1} e_i^2(w) \quad (21) e_k(w) := \hat{Q}(x_k, x_{des,k}, u_k; w) - r(x_k, x_{des,k}, u_k) - \hat{Q}(x_{k+1}, x_{des,k+1}, u_{k+1}; w_{prev}) \quad (22) ์—ฌ๊ธฐ์„œ M์€ ๋ฒ„ํผ ํฌ๊ธฐ(M=500)์ด๊ณ  w_k๋Š” Q-ํ•จ์ˆ˜ ์‹ ๊ฒฝ๋ง์˜ ๊ฐ€์ค‘์น˜์ž…๋‹ˆ๋‹ค. Q-ํ•จ์ˆ˜ ๋ชจ๋ธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค: \hat{Q}(x_k, x_{k,des}, u_k, w) := z_k^T A z_k \quad (30) z_k := \begin{bmatrix} x_k - x_{k,des} \\ \sum_{i=1}^4 f_{i,k} - mg \end{bmatrix} \quad (31) ์—ฌ๊ธฐ์„œ A๋Š” ๋Œ€๊ฐ ํ–‰๋ ฌ์ด๋ฉฐ, ๋Œ€๊ฐ์„ ์— Q-ํ•จ์ˆ˜ ๊ฐ€์ค‘์น˜ w๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ๋กœ๋ด‡์ด ์›ํ•˜๋Š” ์œ„์น˜์— ์„œ ์žˆ๊ณ (์ฆ‰, x_k = x_{k,des}) โ€œ์ด์ƒ์ ์ธโ€ ํž˜(์ฆ‰, ํž˜์˜ ํ•ฉ์ด mg)์„ ๊ฐ€ํ•  ๋•Œ Q-ํ•จ์ˆ˜ ๊ฐ’์ด 0์ด ๋˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์˜ ๋‹จ์ˆœ์„ฑ์€ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์„ ์œ„ํ•ด ์˜๋„๋˜์—ˆ์œผ๋ฉฐ, ํ„ฐ๋ฏธ๋„ ๋น„์šฉ์œผ๋กœ์„œ ํšจ๊ณผ์ ์ธ ์˜ํ–ฅ์„ ๊ธฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค.

4. ์‹คํ—˜ ์„ค์ • ๋ฐ ๊ฒฐ๊ณผ:

์‹คํ—˜์€ rcognita, ROS, Quad-SDK ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ A1 Unitree ๋กœ๋ด‡ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์—์„œ ์ˆ˜ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์‹คํ–‰ ๋น„์šฉ r(x, x_{des}, u)๋Š” ์ƒํƒœ ์˜ค์ฐจ e_x = x - x_{des}์™€ ์•ก์…˜ ์˜ค์ฐจ e_u = u - u_{des}๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ •์˜๋ฉ๋‹ˆ๋‹ค: r(x, x_{des}, u) := e_x^T P_x e_x + e_u^T P_u e_u \quad (29) P_x์™€ P_u๋Š” ๋Œ€๊ฐ ํ–‰๋ ฌ์ž…๋‹ˆ๋‹ค. u_{des}๋Š” ๋กœ๋ด‡์„ ์„  ์ž์„ธ๋กœ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ์ฐธ์กฐ ์ง€๋ฉด ๋ฐ˜๋ ฅ([mg/4 \ mg/4 \ mg/4 \ mg/4]^T)์ž…๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • ๋‹จ๊ธฐ ์˜ˆ์ธก ๋ฒ”์œ„(N=2): RQL์€ MPC๋ณด๋‹ค Z์ถ• ์œ„์น˜ ์˜ค์ฐจ์™€ ์ž์„ธ ์˜ค์ฐจ(๋กค, ํ”ผ์น˜)๋ฅผ ํ˜„์ €ํžˆ ์ค„์˜€์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ๋กค ์˜ค์ฐจ๋Š” ๊ฑฐ์˜ 10๋ฐฐ ๊ฐ์†Œํ–ˆ์Šต๋‹ˆ๋‹ค. RQL์˜ ํ‰๊ท  ์‹คํ–‰ ๋น„์šฉ์€ MPC๋ณด๋‹ค ์•ฝ 3๋ฐฐ ๋‚ฎ์•„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ด๋Š” Q-ํ•จ์ˆ˜๊ฐ€ ๋น„์šฉ ํ•จ์ˆ˜์—์„œ ์ง€๋ฐฐ์ ์ธ ์—ญํ• ์„ ํ•˜์—ฌ RQL์ด MPC๋ฅผ ๋Šฅ๊ฐ€ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
  • ์žฅ๊ธฐ ์˜ˆ์ธก ๋ฒ”์œ„(N=5): ๋‘ ์ œ์–ด๊ธฐ์˜ ์„ฑ๋Šฅ์€ ๊ฑฐ์˜ ๋™์ผํ•ด์กŒ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์˜ˆ์ธก ๋ฒ”์œ„๊ฐ€ ๊ธธ์–ด์งˆ์ˆ˜๋ก Q-ํ•จ์ˆ˜์˜ ์ค‘์š”์„ฑ์ด ๋น„์šฉ ํ•จ์ˆ˜์˜ ๋‹ค๋ฅธ ํ•ญ๋“ค์— ๋น„ํ•ด ์ค„์–ด๋“ค๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
  • ๋ˆ„์  ์‹คํ–‰ ๋น„์šฉ: ์งง์€ ์˜ˆ์ธก ๋ฒ”์œ„์˜ RQL์€ ๊ธด ์˜ˆ์ธก ๋ฒ”์œ„์˜ MPC๋ณด๋‹ค ๋” ๋‚˜์€ ๋ˆ„์  ์‹คํ–‰ ๋น„์šฉ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

5. ๊ฒฐ๋ก  (Concluding remarks):

๋ณธ ์—ฐ๊ตฌ์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ๋‹จ์ˆœํ•œ ์„ ํ˜• Q-ํ•จ์ˆ˜ ๋ชจ๋ธ(31)์„ ์‚ฌ์šฉํ–ˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๋‚ฎ์€ ์˜ˆ์ธก ๋ฒ”์œ„์—์„œ RQL์˜ ์ƒ๋‹นํ•œ ์ด์ ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ๋กœ๋ด‡ ์‹œ์Šคํ…œ์˜ ๋†’์€ ๋น„์„ ํ˜•์„ฑ์„ ๊ณ ๋ คํ•  ๋•Œ, Q-ํ•จ์ˆ˜ ๊ทผ์‚ฌ๊ฐ€ ๋น„์„ ํ˜•์ ์ด์–ด์•ผ ํ•จ์„ ์ง€์ ํ•˜๋ฉฐ, ๋” ์œ ์—ฐํ•˜๊ณ  ๋น„์„ ํ˜•์ ์ธ Q-ํ•จ์ˆ˜ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋ฉด ๋†’์€ ์˜ˆ์ธก ๋ฒ”์œ„์—์„œ๋„ ๋” ๋‚˜์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์„ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” MPC์™€ RL์˜ ๊ฒฐํ•ฉ์ด ๋กœ๋ด‡ ์ œ์–ด์—์„œ ์˜จ๋ผ์ธ ์ œ์–ด ๋Šฅ๋ ฅ๊ณผ ๊ณ„์‚ฐ ๋ณต์žก์„ฑ ์‚ฌ์ด์˜ ๊ท ํ˜•์„ ๋งž์ถ”๋Š” ๋ฐ ์œ ์ตํ•จ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.


๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

์„œ๋ก 

ํ˜„๋Œ€ ์‚ฌ์กฑ๋กœ๋ด‡์€ ๋†’์€ ๊ธฐ๋™์„ฑ๊ณผ ํ—˜์ง€ ์ฃผํ–‰ ๋Šฅ๋ ฅ์œผ๋กœ ์ ๊ฒ€ยท๋ฐฐ์†ก ๊ฐ™์€ ์‘์šฉ์— ์ ํ•ฉํ•˜์ง€๋งŒ, ์ž์œ ๋„๊ฐ€ ๋งŽ์€ ๋ณต์žกํ•œ ๊ธฐ๊ณ„ ๊ตฌ์กฐ๋ผ ๋™์ ์œผ๋กœ ๋ณ€ํ•˜๋Š” ํ™˜๊ฒฝ์—์„œ ํšจ์œจ์ ์œผ๋กœ ์ œ์–ดํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

  • MPC ๋Š” ์œ ํ•œ ์˜ˆ์ธก ์ง€ํ‰์„ ์—์„œ ๋™์ž‘ํ•˜๋ฉฐ ๋ณต์žกํ•œ ์ œ์•ฝ์„ ํšจ์œจ์ ์œผ๋กœ ๋‹ค๋ค„ ์‚ฐ์—…ยท์‹ค๋‚ด ๋ฏธ๊ธฐํ›„ ์ œ์–ดยท์‚ฌ์กฑ๋กœ๋ด‡๊นŒ์ง€ ํญ๋„“๊ฒŒ ์ ์šฉ๋์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ „ํ†ต์  MPC๋Š” ์งง์€ ๊ณ„ํš ์ง€ํ‰์„ , ๊ตญ์†Œ ์ตœ์  ์ˆ˜๋ ด, ๋™์—ญํ•™ ๋ชจ๋ธ ์˜ค์ฐจ, ๋ฏธ๋ž˜ replanning ๋ฏธ๋ฐ˜์˜ ๊ฐ™์€ ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์งง์€ ์ง€ํ‰์„  ๋ฌธ์ œ๋Š” ๊ณ„์‚ฐ ๋ณต์žก๋„์—์„œ ๋น„๋กฏ๋˜๋Š”๋ฐ, ํ•™์Šต ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์œผ๋กœ ๋ชฉ์  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ทผ์‚ฌ ํ•ด ์™„ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • RL ์€ ์‚ฌ์กฑ๋กœ๋ด‡ ์ œ์–ด ๊ฐ™์€ ๋ณต์žกํ•œ ๋ฌธ์ œ์— ์ ์šฉ๋ผ, ์ƒํƒœ ์ถ”์ •๊ธฐ์™€ ์ •์ฑ…์„ ๋™์‹œ์— ํ•™์Šตํ•˜๊ฑฐ๋‚˜ ํ—˜์ง€ ์ ์‘, ๋™์—ญํ•™ ๋ณ€ํ™” ์ ์‘ ๋“ฑ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

MPC์— RL์„ ๊ฒฐํ•ฉํ•ด ์งง์€ ์ง€ํ‰์„  ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๋Š” ๋‘ ๊ฐˆ๋ž˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. (1) ์ง€ํ‰์„  ๊ธธ์ด ์ž์ฒด๋ฅผ ํ•™์Šต ํ•˜๊ฑฐ๋‚˜, (2) ๋น„์šฉ ํ•จ์ˆ˜ ์ตœ์ ํ™”์— ๋” ๊ธด ์˜ˆ์ธก ๊ตฌ๊ฐ„์„ ์•”๋ฌต์ ์œผ๋กœ ๋ฐ˜์˜ ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ๊ฐˆ๋ž˜์˜ ๋™๊ธฐ๋Š” ๋ถ„๋ช…ํ•ฉ๋‹ˆ๋‹ค โ€” terminal cost(๋ง๋‹จ ๋น„์šฉ)๊ฐ€ ๋ฌดํ•œ ์ง€ํ‰์„  ํ•ด์— ์ถฉ๋ถ„ํžˆ ์ •ํ™•ํ•˜๋ฉด, ์งง์€ ์˜ˆ์ธก ์ง€ํ‰์„ ์œผ๋กœ๋„ ๊ดœ์ฐฎ์€ ์„ฑ๋Šฅ ์„ ๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ํ•™์Šต์œผ๋กœ ์ข‹์€ terminal cost๋ฅผ ์–ป๋Š” ๊ฒƒ์ด ํ•ฉ๋ฆฌ์ ์ด๋ฉฐ, ๊ทธ ํ•œ ๊ตฌํ˜„์ด Roll-Out RL(RQL) ์ž…๋‹ˆ๋‹ค. RQL์€ ์ „ํ†ต MPC์˜ ํ™•์žฅ/๊ฐ•ํ™”๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

RQL์ด ๋‹จ์ˆœ ๋ชจ๋ธ์—์„œ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ธ ๋ฐ ์ฐฉ์•ˆํ•ด, ์ €์ž๋“ค์€ ์ด๋ฅผ ์‚ฌ์กฑ๋กœ๋ด‡ locomotion ๊ฐ™์€ ๋” ๋ณต์žกํ•œ ์‹œ์Šคํ…œ ์— ์ ์šฉํ•ด ์ „ํ†ต MPC๋ฅผ ๋ฒ ์ด์Šค๋ผ์ธ์œผ๋กœ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ ๊ด€์ฐฐ: ์งง์€ ์˜ˆ์ธก ์ง€ํ‰์„ ์—์„œ RQL์ด ๋ˆ„์  running cost ์ธก๋ฉด์—์„œ MPC๋ฅผ ๋Šฅ๊ฐ€ํ–ˆ์œผ๋ฉฐ, ์ด๋Š” stacked ์ ‘๊ทผ(MPC+ํ•™์Šต๋œ tail cost) ์ด ์‚ฌ์กฑ๋กœ๋ด‡์—์„œ ์ˆœ์ˆ˜ RL์˜ ์œ ๋ ฅํ•œ ๋Œ€์•ˆ์ผ ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

๋ฐฉ๋ฒ•

flowchart LR
    subgraph SYS["System Dynamics (Unitree A1)"]
        DYN["แบ‹ = f(x, ฯ‘, u)<br/>x=[p,ฮธ,v,ฯ‰_B], u=[fโ‚..fโ‚„]"]
    end
    subgraph MPC["MPC"]
        JM["min ฮฃ ฮณ^(i-1) r(xฬ‚,x_des,u)<br/>over horizon N"]
    end
    subgraph RQL["Roll-Out Q-Learning"]
        JR["min ฮฃ_(i=1)^(N-1) ฮณ^(i-1) r(...)<br/>+ Qฬ‚(xฬ‚_N, x_des, u_N; w)<br/>(ํ•™์Šต๋œ tail cost)"]
        CRITIC["critic ๊ฐฑ์‹ :<br/>buffer(M=500) ์ตœ์†Œ์ œ๊ณฑ<br/>w โ† min_w J_k^c"]
    end
    DYN --> MPC
    DYN --> RQL
    CRITIC -.->|Q-function ๊ฐ€์ค‘์น˜ w| JR
    MPC -->|์ œ์•ฝ: contact schedule,<br/>friction cone ฮผ=0.3| OUT["first action u*_1 ์ ์šฉ"]
    RQL --> OUT

์‹œ์Šคํ…œ ๋™์—ญํ•™

Unitree A1์„ ํ™˜๊ฒฝ ๋ฒ ์ด์Šค๋ผ์ธ์œผ๋กœ ์‚ผ๊ณ , ๋กœ๋ด‡์„ ์ ‘์ด‰์ ์—์„œ ์ž‘์šฉํ•˜๋Š” ํž˜์„ ๋ฐ›๋Š” ๋‹จ์ผ ๊ฐ•์ฒด(single rigid body) ๋กœ ๋ชจ๋ธ๋งํ•ฉ๋‹ˆ๋‹ค(๋‹ค๋ฆฌ ์งˆ๋Ÿ‰์ด ์ „์ฒด์˜ ~10%๋ผ ๋ฌด์‹œ). ์›”๋“œ ์ขŒํ‘œ ๊ฐ•์ฒด ๋™์—ญํ•™์€

\ddot p = \sum_{i=1}^{4}\frac{f_i}{m} - g, \qquad \frac{d}{dt}(\mathcal I\omega) = \sum_{i=1}^{4} r_i \times f_i, \qquad \dot R = \omega \times R

๋ฐฉํ–ฅ์€ Euler ๊ฐ \Theta=[\phi,\theta,\psi] (roll/pitch/yaw)๋กœ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ์„ ํ–‰ ์—ฐ๊ตฌ๊ฐ€ ๋ฌด์‹œํ•œ R_y, R_x ํšŒ์ „๊ณผ full ๊ฐ์†๋„ ํ•ญ๊นŒ์ง€ ํฌํ•จํ•ด ๋” ์ •๋ฐ€ํ•˜๊ฒŒ ๋‹ค๋ฃน๋‹ˆ๋‹ค. ์ƒํƒœยทํ–‰๋™ยทlever๋ฅผ

x := [p\ \ \theta\ \ v\ \ \omega_B]^T, \quad u := [f_1\ f_2\ f_3\ f_4]^T, \quad \vartheta := [r_1\ r_2\ r_3\ r_4]^T

๋กœ ๋‘๋ฉด ๋™์—ญํ•™์€ \dot x = f(x, \vartheta, u) ๋กœ ์••์ถ•๋ฉ๋‹ˆ๋‹ค. ์ฆ‰ ์ง€๋ฉด ๋ฐ˜๋ ฅ(ground reaction force)์ด ํ–‰๋™ ์ž…๋‹ˆ๋‹ค.

Model Predictive Control

์˜ˆ์ธก ์ปจํŠธ๋กค๋Ÿฌ๋กœ ๋น„์šฉ J_{MPC} ๋ฅผ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค.

\min_{\{u_{i|k}\}} J_{MPC} = \min \sum_{i=1}^{N} \gamma^{i-1} r(\hat x_{i|k}, x_{\text{des},i|k}, u_{i|k})

\text{s.t.}\quad \hat x_{0,k}=x_0,\quad \hat x_{i+1|k}=\Phi(\delta,\hat x_{i|k},\vartheta_{i|k},u_{i|k}),\quad C_{i|k}u_{i|k}=0,\quad Du_{i|k}\le 0

๋‘ ์ œ์•ฝ์ด ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค.

  1. Contact schedule ์ œ์•ฝ: swing ๋‹จ๊ณ„์˜ ๋‹ค๋ฆฌ๋Š” ํž˜์ด 0, ์ ‘์ด‰ ์ค‘์ธ ๋‹ค๋ฆฌ๋งŒ ํž˜์„ ๋‚ผ ์ˆ˜ ์žˆ์Œ.
  2. Friction cone ์ œ์•ฝ: ๋ฏธ๋„๋Ÿผ ๋ฐฉ์ง€. ๋งˆ์ฐฐ๊ณ„์ˆ˜ \mu=0.3 ์— ๋Œ€ํ•ด -\mu f_z \le f_x \le \mu f_z, -\mu f_z \le f_y \le \mu f_z.

์ƒํƒœ ์ „์ด๋Š” ์•Œ๋ ค์ง„ ๋™์—ญํ•™์— Euler explicit ์ ๋ถ„: \Phi = \hat x_{i|k} + \delta f(\hat x_{i|k}, \vartheta_{i|k}, u_{i|k}). ๋งค ์Šคํ… ๋น„์šฉ์„ ํ’€์–ด ์ฒซ ํ–‰๋™ u^*_{1|k} ๋งŒ ์ ์šฉ(Algorithm 1).

Roll-Out Q-Learning (ํ•ต์‹ฌ)

RQL์€ value iteration Q-learning์„ ์”๋‹ˆ๋‹ค. ํ–‰๋™์€ u_k \leftarrow \min_u \hat Q(x_k, x_{\text{des},k}, u; w_k) ๋กœ ๊ณ ๋ฅด๊ณ , Q-function ๊ฐ€์ค‘์น˜ w ๋Š” ๋ฒ„ํผ(ํฌ๊ธฐ M=500) ์œ„์—์„œ TD๋ฅ˜ ์˜ค์ฐจ์˜ ์ตœ์†Œ์ œ๊ณฑ ์œผ๋กœ ๊ฐฑ์‹ ํ•ฉ๋‹ˆ๋‹ค.

w_k \leftarrow \min_w J_k^c, \qquad J_k^c := \frac{1}{2}\sum_{i=k}^{k+M-1} e_i^2(w)

e_k(w) := \hat Q(x_k, x_{\text{des},k}, u_k; w) - r(x_k, x_{\text{des},k}, u_k) - \hat Q(x_{k+1}, x_{\text{des},k+1}, u_{k+1}; w_{\text{prev}})

RQL์˜ actor ์—…๋ฐ์ดํŠธ๊ฐ€ MPC์™€ ๋‹ค๋ฅธ ๊ฒฐ์ •์  ์ง€์ ์ž…๋‹ˆ๋‹ค.

\min_{\{u_{i|k}\}} J_{RQL}^a = \min \underbrace{\sum_{i=1}^{N-1} \gamma^{i-1} r(\hat x_{i|k}, x_{\text{des},i|k}, u_{i|k})}_{\text{์งง์€ ์ง€ํ‰์„  } N-1 \text{ running cost}} + \underbrace{\hat Q(\hat x_{N|k}, x_{\text{des},N|k}, u_{N|k}; w_k)}_{\text{ํ•™์Šต๋œ tail/terminal cost}}

์ฆ‰ RQL = โ€œ์งง์€ MPC ๋น„์šฉ + ํ•™์Šต๋œ Q-function์„ ๋ง๋‹จ ๋น„์šฉ์œผ๋กœโ€. ๋‚˜๋จธ์ง€ ์ œ์•ฝยท๊ตฌ์กฐ๋Š” MPC์™€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. ์ง€ํ‰์„  ๋์„ Q-function์ด ๊ทผ์‚ฌํ•˜๋ฏ€๋กœ, ์งง์€ N ์œผ๋กœ๋„ ๊ธด ์ง€ํ‰์„  ํšจ๊ณผ๋ฅผ ๋ˆ„๋ฆฝ๋‹ˆ๋‹ค(Algorithm 2).

Q-functionยทrunning cost ์„ค๊ณ„

running cost๋Š” ์ด์ฐจ ํ˜•์‹์ž…๋‹ˆ๋‹ค.

r(x, x_{\text{des}}, u) := e_x^T P_x e_x + e_u^T P_u e_u

e_x = x - x_{\text{des}}, e_u = u - u_{\text{des}} ์ด๊ณ  u_{\text{des}} = [\tfrac{mg}{4}\ \tfrac{mg}{4}\ \tfrac{mg}{4}\ \tfrac{mg}{4}]^T (์„œ ์žˆ๋Š” ์ž์„ธ ์œ ์ง€์— ์ถฉ๋ถ„ํ•œ ๊ธฐ์ค€ ๋ฐ˜๋ ฅ). ๋”ฐ๋ผ์„œ ์ •ํ™•ํžˆ ์„œ ์žˆ์œผ๋ฉด r=0. Q-function์€ ๊ณ„์‚ฐ ํšจ์œจ์„ ์œ„ํ•ด ๋‹จ์ˆœํ•œ ์„ ํ˜•(์ด์ฐจ) ๋ชจ๋ธ ๋กœ ๋‘ก๋‹ˆ๋‹ค.

\hat Q(x_k, x_{\text{des},k}, u_k; w) := z_k^T A z_k, \qquad z_k := \begin{bmatrix} x_k - x_{k,\text{des}} \\ \sum_{i=1}^{4} f_{i,k} - mg \end{bmatrix}

A ๋Š” ๋Œ€๊ฐ ๊ฐ€์ค‘ ํ–‰๋ ฌ. ๋กœ๋ด‡์ด ๋ชฉํ‘œ ์œ„์น˜์— โ€œ์ด์ƒ์  ํž˜(\sum f = mg)โ€์œผ๋กœ ์„œ ์žˆ์„ ๋•Œ Q๊ฐ€ 0์ด ๋˜๋„๋ก ์„ค๊ณ„ํ•ด, ๋‹จ์ˆœ ๋ชจ๋ธ์ด์–ด๋„ terminal cost๋กœ์„œ ์ด๋“์„ ์ฃผ๋ฆฌ๋ผ ๊ธฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค.

์‹คํ—˜ ์…‹์—…

์„ธ ํ”„๋ ˆ์ž„์›Œํฌ ์œ„์— ๊ตฌ์ถ•: rcognita(RL ์—์ด์ „ํŠธ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ Python ํŒจํ‚ค์ง€), ROS, Quad-SDK(A1์šฉ planner+simulator). ROS๊ฐ€ ์ปจํŠธ๋กค๋Ÿฌ(rcognita)์™€ Quad-SDK๋ฅผ ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค.

์‹คํ—˜

MPC์™€ RQL์„ ์งง์€/๊ธด ์˜ˆ์ธก ์ง€ํ‰์„ ์—์„œ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค(Fig. 3).

์งง์€ ์ง€ํ‰์„  (N=2)

  • x์ถ• ์œ„์น˜ ์˜ค์ฐจ: ๋‘ ๋ฐฉ๋ฒ• ๋ชจ๋‘ ์ตœ์†Œ โ€” P_x ํ–‰๋ ฌ์—์„œ x์ถ• ์˜ค์ฐจ์— ๋†’์€ ๊ฐ€์ค‘์น˜๋ฅผ ์ค˜ ์ตœ์šฐ์„  ์ฒ˜๋ฆฌํ•˜๊ธฐ ๋•Œ๋ฌธ.
  • ๋ฐฉํ–ฅ(orientation) ์˜ค์ฐจ: MPC๋Š” ์งง์€ ์ง€ํ‰์„ ์—์„œ ํฐ ๋ฐฉํ–ฅ ์˜ค์ฐจ๋ฅผ ๋ณด์ด๋Š” ๋ฐ˜๋ฉด, RQL์€ roll ์˜ค์ฐจ๋ฅผ ์•ฝ 10๋ฐฐ ์ค„์ž„. Q-function์ด ๋น„์šฉ ํ•จ์ˆ˜์—์„œ ์ง€๋ฐฐ์  ์—ญํ• ์„ ํ•ด RQL์ด MPC๋ฅผ ๋Šฅ๊ฐ€.
  • running cost: RQL์˜ ํ‰๊ท  running cost๊ฐ€ MPC๋ณด๋‹ค ์•ฝ 3๋ฐฐ ๋‚ฎ์Œ.

๊ธด ์ง€ํ‰์„  (N=5)

์ง€ํ‰์„ ์ด ๊ธธ์–ด์ง€๋ฉด ๋‘ ์ปจํŠธ๋กค๋Ÿฌ์˜ ์„ฑ๋Šฅ์ด ๊ฑฐ์˜ ๋™์ผ ํ•ด์ง‘๋‹ˆ๋‹ค. ๊ธด ์ง€ํ‰์„ ์—์„œ๋Š” Q-function์˜ ๋น„์ค‘์ด ๋น„์šฉ ํ•จ์ˆ˜์˜ ์•ž์ชฝ running cost ํ•ญ๋“ค์— ๋น„ํ•ด ์ค„์–ด๋“ค๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

์ง€ํ‰์„  ๊ธธ์ด vs ๋ˆ„์  ๋น„์šฉ (Fig. 4)

  • MPC๋Š” ์ง€ํ‰์„ ์ด ์งง์„์ˆ˜๋ก ๋ˆ„์  running cost๊ฐ€ ๊ธ‰์ฆ(N=2 ์—์„œ ์•ฝ 5\times10^6).
  • RQL์€ ์ง€ํ‰์„ ๊ณผ ๋ฌด๊ด€ํ•˜๊ฒŒ ๋น„๊ต์  ๋‚ฎ๊ณ  ํ‰ํƒ„.
  • ์ฃผ๋ชฉ: ์งง์€ ์ง€ํ‰์„  RQL(N=2)์ด ๊ธด ์ง€ํ‰์„  MPC(N=5)๋ณด๋‹ค๋„ ๋ˆ„์  ๋น„์šฉ์ด ๋‚ฎ์Œ. ์ง€ํ‰์„ ์ด ๊ธธ์–ด์งˆ์ˆ˜๋ก Q-function์˜ ์ค‘์š”๋„๊ฐ€ ๋–จ์–ด์ ธ ๋‘ ๋ฐฉ๋ฒ•์ด ์ˆ˜๋ ด.

๋น„ํŒ์  ๊ณ ์ฐฐ

๊ฐ•์ 

  • ๋ช…ํ™•ํ•œ ๋ฌธ์ œ ์ •์˜์™€ ํ•ด๋ฒ•. โ€œMPC์˜ ์งง์€ ์ง€ํ‰์„  ํ•œ๊ณ„ โ†’ ํ•™์Šต๋œ Q-function์„ terminal cost๋กœโ€๋ผ๋Š” ๊ตฌ์„ฑ์ด ๊น”๋”ํ•ฉ๋‹ˆ๋‹ค. ์ง€์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋Š” ๊ณ„์‚ฐ ๋ถ€๋‹ด์„ ์งง์€ ์ง€ํ‰์„  + ํ•™์Šต ๋ง๋‹จ ๋น„์šฉ์œผ๋กœ ์šฐํšŒํ•œ๋‹ค๋Š” ๋ฐœ์ƒ์ด ์‹ค์šฉ์ ์ž…๋‹ˆ๋‹ค.
  • ์‚ฌ์ „ ํ•™์Šต ๋ถˆํ•„์š”ยทlive ์šด์šฉ. ์ˆœ์ˆ˜ RL์˜ ๋น„์‹ผ ์‚ฌ์ „ ํ•™์Šต ์—†์ด ์˜จ๋ผ์ธ์œผ๋กœ critic์„ ๊ฐฑ์‹ ํ•˜๋ฉฐ ๋™์ž‘ํ•ด, ๋กœ๋ด‡ ๋ฐฐํฌ์— ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
  • ์ •๋Ÿ‰์  ์šฐ์œ„. ์งง์€ ์ง€ํ‰์„ ์—์„œ roll ์˜ค์ฐจ ~10๋ฐฐยทrunning cost ~3๋ฐฐ ๊ฐœ์„ , ๊ทธ๋ฆฌ๊ณ  โ€œ์งง์€ RQL > ๊ธด MPCโ€๋ผ๋Š” ๊ฒฐ๊ณผ๋Š” stacked ์ ‘๊ทผ์˜ ๊ฐ€์น˜๋ฅผ ๋ถ„๋ช…ํžˆ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  • ์žฌํ˜„ ๊ฐ€๋Šฅํ•œ ์…‹์—…. rcognitaยทROSยทQuad-SDK ์กฐํ•ฉ์œผ๋กœ A1 ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ๊ตฌ์„ฑํ•ด ๋น„๊ต๋ฅผ ๋ช…๋ฃŒํžˆ ํ–ˆ์Šต๋‹ˆ๋‹ค.

์•ฝ์ ๊ณผ ํ•œ๊ณ„

  • ์„ ํ˜• Q-function์˜ ๋ณธ์งˆ์  ์ œ์•ฝ(์ €์ž ์ธ์ •). ์‹คํ—˜์€ ๋‹จ์ˆœ ์„ ํ˜•(์ด์ฐจ) Q-function ์— ๊ธฐ๋ฐ˜ํ•ฉ๋‹ˆ๋‹ค. ์‹œ์Šคํ…œ์˜ ๋†’์€ ๋น„์„ ํ˜•์„ฑ์„ ์ œ๋Œ€๋กœ ๋ฐ˜์˜ํ•˜๋ ค๋ฉด ๋น„์„ ํ˜• Q-function ์ด ํ•„์š”ํ•˜๋ฉฐ, ๊ทธ๋ž˜์•ผ ๊ธด ์ง€ํ‰์„ ์—์„œ๋„ ์ถ”๊ฐ€ ์ด๋“์ด ๊ธฐ๋Œ€๋ฉ๋‹ˆ๋‹ค. ํ˜„์žฌ ์ด๋“์ด ์ €์ง€ํ‰์„ ์— ๊ตญํ•œ๋œ ์ด์œ ์ด๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.
  • ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ•œ์ •. ํ‰๊ฐ€๊ฐ€ Quad-SDK ์‹œ๋ฎฌ๋ ˆ์ด์…˜์— ๋จธ๋ฌผ๋Ÿฌ, ์‹ค์ œ A1 ํ•˜๋“œ์›จ์–ด ๊ฒ€์ฆ๊ณผ sim-to-real ๊ฐญ ๋ถ„์„์ด ์—†์Šต๋‹ˆ๋‹ค(๋…ผ๋ฌธ ๋ฒ”์œ„ ๋ฐ–).
  • ๋‹จ์ผ ๋กœ๋ด‡ยท์ œํ•œ๋œ ์‹œ๋‚˜๋ฆฌ์˜ค. A1 ํ•œ ์ข…๋ฅ˜, ๋น„๊ต์  ์ •ํ˜•ํ™”๋œ ๋ณดํ–‰ ์‹œ๋‚˜๋ฆฌ์˜ค ์ค‘์‹ฌ์ด๋ผ, ํ—˜์ง€ยท์™ธ๋ž€ยท๋‹ค์–‘ํ•œ gait๋กœ์˜ ์ผ๋ฐ˜ํ™”๋Š” ์ถ”๊ฐ€ ๊ฒ€์ฆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค(์ถ”์ธก).
  • MPC ๋ฒ ์ด์Šค๋ผ์ธ ํ•œ ๊ฐ€์ง€. ํ•™์Šต๋œ ์ง€ํ‰์„  ๊ธธ์ด ๋ฐฉ์‹ ๋“ฑ ๋‹ค๋ฅธ MPC+RL ๋ณ€ํ˜•๊ณผ์˜ ๋น„๊ต๊ฐ€ ์žˆ์—ˆ๋‹ค๋ฉด ์šฐ์œ„๊ฐ€ ๋” ๋ถ„๋ช…ํ•ด์กŒ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

์ด ๋…ผ๋ฌธ์€ ์‚ฌ์กฑ๋กœ๋ด‡์˜ ์•ˆ์ • ๋ณดํ–‰์„ ์œ„ํ•ด MPC์™€ ์˜ˆ์ธกํ˜• RL(Roll-Out Q-Learning)์„ ๊ฒฐํ•ฉ ํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ ์งง์€ ์˜ˆ์ธก ์ง€ํ‰์„ ์˜ MPC ๋น„์šฉ ๋์— ์‹ ๊ฒฝ๋ง Q-function์„ โ€œ๊ผฌ๋ฆฌ ๋น„์šฉ(tail/terminal cost)โ€์œผ๋กœ ๋ถ™์—ฌ, ์ง€ํ‰์„ ์ด ๊ธธ์–ด์งˆ์ˆ˜๋ก ์ง€์ˆ˜์ ์œผ๋กœ ์ปค์ง€๋Š” ๊ณ„์‚ฐ ๋ถ€๋‹ด์„ ์šฐํšŒํ•˜๊ณ , ๋ช…๋ชฉ MPC๊ฐ€ ์‹คํŒจํ•˜๋Š” ์งง์€ ์ง€ํ‰์„ ์—์„œ๋„ ์•ˆ์ •์  ๋ณดํ–‰ ์„ ๋‹ฌ์„ฑํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ˆ˜์น˜๋กœ ์ •๋ฆฌํ•˜๋ฉด, ์งง์€ ์ง€ํ‰์„ (N=2)์—์„œ RQL์€ roll ์˜ค์ฐจ๋ฅผ ์•ฝ 10๋ฐฐ, ํ‰๊ท  running cost๋ฅผ ์•ฝ 3๋ฐฐ ์ค„์˜€๊ณ , ์งง์€ ์ง€ํ‰์„  RQL์ด ๊ธด ์ง€ํ‰์„  MPC๋ณด๋‹ค๋„ ๋ˆ„์  ๋น„์šฉ์ด ๋‚ฎ์•˜์Šต๋‹ˆ๋‹ค. ์ง€ํ‰์„ ์ด ๊ธธ์–ด์ง€๋ฉด Q-function์˜ ๋น„์ค‘์ด ์ค„์–ด ๋‘ ๋ฐฉ๋ฒ•์ด ์ˆ˜๋ ดํ•ฉ๋‹ˆ๋‹ค.

์‹ค๋ฌด ๊ด€์ ์—์„œ ์ด ์—ฐ๊ตฌ์˜ ๊ฐ€์น˜๋Š” โ€œํ•™์Šต๋œ terminal cost๋กœ MPC์˜ ์งง์€ ์ง€ํ‰์„  ํ•œ๊ณ„๋ฅผ ๋ฉ”์›Œ, ์‚ฌ์ „ ํ•™์Šต ์—†์ด live๋กœ ์•ˆ์ • ๋ณดํ–‰์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ ๊ฒƒโ€ ์— ์žˆ์Šต๋‹ˆ๋‹ค. ์„ ํ˜• Q-functionยท์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ•œ์ •์ด๋ผ๋Š” ํ•œ๊ณ„๋Š” ๋ถ„๋ช…ํ•˜์ง€๋งŒ, MPC + ํ•™์Šต๋œ tail cost(RQL) ๋ผ๋Š” ํ•˜์ด๋ธŒ๋ฆฌ๋“œ๋Š” ์˜จ๋ผ์ธ ์ œ์–ด ๋Šฅ๋ ฅ๊ณผ ๊ณ„์‚ฐ ๋ณต์žก๋„์˜ ๊ท ํ˜•์„ ์žก๋Š” ์œ ๋งํ•œ ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

Copyright 2026, JungYeon Lee