Curieux.JY
  • JungYeon Lee
  • Post
  • Lecture
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ์„œ๋ก 
    • ๋ฐฉ๋ฒ•
      • ๋ฏธ๋ถ„ ๊ฐ€๋Šฅ ๋ฌผ๋ฆฌ ์—”์ง„ ์† ๋ชจ๋ฐฉ ํ™˜๊ฒฝ
      • ๋ฏธ๋ถ„ ๊ฐ€๋Šฅ ๋ฌผ๋ฆฌ๋กœ ํ•˜๋Š” ๋ชจ์…˜ ๋ชจ๋ฐฉ
      • Demonstration Replay (ํ•ต์‹ฌ)
    • ์‹คํ—˜
      • ์ƒ˜ํ”Œ ํšจ์œจ: ํ•ด์„์  ๊ธฐ์šธ๊ธฐ์˜ ํž˜ (Table 2)
      • ๋ชจ์…˜ ํ’ˆ์งˆ (Table 1)
      • ์‹œ๊ฐ„ ํšจ์œจ
      • Ablation: Truncation ๊ธธ์ด
      • Ablation: Demonstration Replay
    • ๋น„ํŒ์  ๊ณ ์ฐฐ
    • ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

๐Ÿ“ƒDiffMimic

differentiable-physics
motion-mimicking
character-animation
rl
DiffMimic: Efficient Motion Mimicking with Differentiable Physics
Published

April 15, 2026

  • Paper Link (arXiv:2304.03274)
  • Code
  • Demo
  1. ๐Ÿš€ ๊ธฐ์กด RL ๊ธฐ๋ฐ˜ ๋ชจ์…˜ ๋ฏธ๋ฏนํ‚น์˜ ๋น„ํšจ์œจ์„ฑ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด, ๋ณธ ๋…ผ๋ฌธ์€ Differentiable Physics Simulators (DPS)๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ณต์žกํ•œ ์ •์ฑ… ํ•™์Šต ๋ฌธ์ œ๋ฅผ ๋‹จ์ˆœํ•œ ์ƒํƒœ ๋งค์นญ ๋ฌธ์ œ๋กœ ์žฌ๊ตฌ์„ฑํ•˜๋Š” DiffMimic์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
  2. ๐Ÿ’ก DiffMimic์€ DPS์˜ ๋ถ„์„์  ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ •์ฑ…์„ ์ง์ ‘ ์ตœ์ ํ™”ํ•จ์œผ๋กœ์จ RL ๊ธฐ๋ฐ˜ ๋ฐฉ์‹๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฅด๊ณ  ์•ˆ์ •์ ์ธ ์ˆ˜๋ ด์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ, local optima๋ฅผ ํ”ผํ•˜๊ณ  ๊ธด horizon์—์„œ ๊ทธ๋ž˜๋””์–ธํŠธ ์ „ํŒŒ๋ฅผ ์•ˆ์ •ํ™”ํ•˜๊ธฐ ์œ„ํ•ด Demonstration Replay ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.
  3. โฑ๏ธ ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ, DiffMimic์€ DeepMimic๊ณผ ๊ฐ™์€ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ก  ๋Œ€๋น„ ์šฐ์ˆ˜ํ•œ ์ƒ˜ํ”Œ ๋ฐ ์‹œ๊ฐ„ ํšจ์œจ์„ฑ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ํŠนํžˆ Backflip๊ณผ ๊ฐ™์€ ์–ด๋ ค์šด ๋™์ž‘์„ ๋‹จ 10๋ถ„ ๋งŒ์— ํ•™์Šตํ•˜๊ณ  3์‹œ๊ฐ„ ๋งŒ์— ๋ฐ˜๋ณตํ•  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

DIFFMIMIC๋Š” ๋ฌผ๋ฆฌ ๊ธฐ๋ฐ˜ ์บ๋ฆญํ„ฐ ์• ๋‹ˆ๋ฉ”์ด์…˜์˜ ํ•ต์‹ฌ ๊ณผ์ œ์ธ ๋ชจ์…˜ ๋ฏธ๋ฏนํ‚น(motion mimicking)์„ ์œ„ํ•ด ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ ๋ฌผ๋ฆฌ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ(Differentiable Physics Simulators, DPS)๋ฅผ ํ™œ์šฉํ•˜๋Š” ํšจ์œจ์ ์ธ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ ๋ชจ์…˜ ๋ฏธ๋ฏนํ‚น ๋ฐฉ๋ฒ•๋ก ๋“ค์€ ๋Œ€๋ถ€๋ถ„ ๊ฐ•ํ™” ํ•™์Šต(Reinforcement Learning, RL)์— ๊ธฐ๋ฐ˜ํ•˜๋ฉฐ, ์ด๋กœ ์ธํ•ด ๋ณด์ƒ ํ•จ์ˆ˜ ์„ค๊ณ„์˜ ์–ด๋ ค์›€(heavy reward engineering), ๋†’์€ ๋ถ„์‚ฐ(high variance), ๋А๋ฆฐ ์ˆ˜๋ ด(slow convergence), ๊ทธ๋ฆฌ๊ณ  ํƒ์ƒ‰์˜ ์–ด๋ ค์›€(hard explorations)๊ณผ ๊ฐ™์€ ๋ฌธ์ œ์— ์ง๋ฉดํ•ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, ๋‹จ์ˆœํ•œ ๋ชจ์…˜ ์‹œํ€€์Šค๋ฅผ ๋ชจ๋ฐฉํ•˜๋Š” ๋ฐ ์ˆ˜์‹ญ ์‹œ๊ฐ„ ๋˜๋Š” ๋ฉฐ์น ์˜ ํ›ˆ๋ จ ์‹œ๊ฐ„์ด ์†Œ์š”๋˜์–ด ํ™•์žฅ์„ฑ(scalability)์ด ๋ถ€์กฑํ–ˆ์Šต๋‹ˆ๋‹ค.

DiffMimic์€ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ์…˜ ๋ฏธ๋ฏนํ‚น์„ ๋ณต์žกํ•œ ์ •์ฑ… ํ•™์Šต(policy learning) ๋ฌธ์ œ๊ฐ€ ์•„๋‹Œ ํ›จ์”ฌ ๊ฐ„๋‹จํ•œ ์ƒํƒœ ๋งค์นญ(state matching) ๋ฌธ์ œ๋กœ ์žฌ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” DPS๊ฐ€ ์ œ๊ณตํ•˜๋Š” ๋ถ„์„์  ๊ธฐ์šธ๊ธฐ(analytical gradients)๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ง€๋ฉด ์ง„๋ฆฌ(ground-truth) ๋ฌผ๋ฆฌ์  ์‚ฌ์ „ ์ •๋ณด์™€ ํ•จ๊ป˜ ์ •์ฑ…์„ ์•ˆ์ •์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋Š” RL ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฅด๊ณ  ์•ˆ์ •์ ์ธ ์ˆ˜๋ ด์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก :

  1. ํ™˜๊ฒฝ ์„ค์ •:
    • Brax ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ™˜๊ฒฝ์„ ๊ตฌ์ถ•ํ•˜๋ฉฐ, DeepMimic์˜ ์„ค๊ณ„๋ฅผ ๋”ฐ๋ฅด๋Š” 13๊ฐœ์˜ ๋งํฌ์™€ 34๊ฐœ์˜ ์ž์œ ๋„(degrees of freedom)๋ฅผ ๊ฐ€์ง„ ํœด๋จธ๋…ธ์ด๋“œ(humanoid) ์บ๋ฆญํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ์ƒํƒœ(state) s๋Š” ๋ชจ๋“  ๋งํฌ์˜ ์ „์—ญ ์œ„์น˜ p, ํšŒ์ „ q, ์„ ํ˜• ์†๋„ \dot{p}, ๊ฐ์†๋„ \dot{q}, ๊ทธ๋ฆฌ๊ณ  ํƒ€์ž„์Šคํƒฌํ”„ ์—ญํ• ์„ ํ•˜๋Š” ์œ„์ƒ ๋ณ€์ˆ˜(phase variable) \varphi๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค: s := \{p, q, \dot{p}, \dot{q}, \varphi\}.
    • PD ์ปจํŠธ๋กค๋Ÿฌ(PD controller)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์บ๋ฆญํ„ฐ๋ฅผ ๊ตฌ๋™ํ•˜๋ฉฐ, ์ •์ฑ… ๋„คํŠธ์›Œํฌ๋Š” ๊ฐ ์กฐ์ธํŠธ์˜ ๋ชฉํ‘œ ๊ฐ๋„๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ชจ์…˜ ๋ฏธ๋ฏนํ‚น์„ ์œ„ํ•œ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ ๋ฌผ๋ฆฌ ํ™œ์šฉ:
    • DiffMimic์€ ์ •์ฑ… ๋กค์•„์›ƒ(policy rollout)๊ณผ ์ฐธ์กฐ ๋ชจ์…˜(reference motion) ๊ฐ„์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ง์ ‘ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.
    • ์†์‹ค ํ•จ์ˆ˜ L์€ ๋กค์•„์›ƒ ๊ถค์ (s_t)๊ณผ ์ฐธ์กฐ ๊ถค์ (\hat{s}_t) ๊ฐ„์˜ ๋‹จ๊ณ„๋ณ„(step-wise) L_2 ๊ฑฐ๋ฆฌ์˜ ํ•ฉ์œผ๋กœ ์ •์˜๋ฉ๋‹ˆ๋‹ค: L = \sum_{t=1}^T \|s_t - \hat{s}_t\|^2_2
    • ์—ฌ๊ธฐ์„œ \|s_t - \hat{s}_t\|^2_2๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ „์—ญ ์œ„์น˜, ํšŒ์ „(6D ํšŒ์ „ ํ‘œํ˜„ ์‚ฌ์šฉ), ์„ ํ˜• ์†๋„, ๊ฐ์†๋„์— ๋Œ€ํ•œ ๊ฐ€์ค‘ ํ•ฉ์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค: \|s_t - \hat{s}_t\|^2_2 = \frac{1}{\|J\|}\sum_{j \in J} w_p(p_j - \hat{p}_j)^2 + w_r(q_j - \hat{q}_j)^2 + w_v(\dot{p}_j - \hat{\dot{p}}_j)^2 + w_a(\dot{q}_j - \hat{\dot{q}}_j)^2 p_j, \hat{p}_j๋Š” J๋ฒˆ์งธ ์กฐ์ธํŠธ์˜ ์ „์—ญ ์œ„์น˜, q_j, \hat{q}_j๋Š” ์ „์—ญ ํšŒ์ „, \dot{p}_j, \hat{\dot{p}}_j๋Š” ์„ ํ˜• ์†๋„, \dot{q}_j, \hat{\dot{q}}_j๋Š” ๊ฐ์†๋„์ž…๋‹ˆ๋‹ค. w_p, w_r, w_v, w_a๋Š” ๊ฐ€์ค‘์น˜์ž…๋‹ˆ๋‹ค.
    • DPS๋Š” ๋™์  ์‹œ์Šคํ…œ์˜ ์ „์ด ํ•จ์ˆ˜(transition function) T ์—ญํ• ์„ ํ•˜๋ฉฐ, s_{t+1} = T(s_t, a_t)์™€ ๊ฐ™์ด ๋‹ค์Œ ์ƒํƒœ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. DPS๊ฐ€ ์™„์ „ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์†์‹ค ํ•จ์ˆ˜์—์„œ ํ–‰๋™ a_t์™€ ์ƒํƒœ s_t์— ๋Œ€ํ•œ ๊ธฐ์šธ๊ธฐ๋ฅผ ์ง์ ‘ ์œ ๋„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: \frac{\partial L}{\partial a_t} = \left(\frac{\partial L}{\partial T(s_t, a_t)}\right) \left(\frac{\partial T(s_t, a_t)}{\partial a_t}\right) \frac{\partial L}{\partial s_t} = \left(\frac{\partial L}{\partial T(s_t, a_t)}\right) \left(\frac{\partial T(s_t, a_t)}{\partial s_t}\right) ์ด ๊ธฐ์šธ๊ธฐ๋Š” ์ „์ฒด ๊ถค์ ์— ๊ฑธ์ณ ์žฌ๊ท€์ ์œผ๋กœ ์—ญ์ „ํŒŒ(backpropagated)๋˜์–ด ์ •์ฑ…์„ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค.
  3. ๋ฐ๋ชจ ์žฌํ˜„ (Demonstration Replay) ๋ฉ”์ปค๋‹ˆ์ฆ˜:
    • DPS๋ฅผ ์‚ฌ์šฉํ•œ ์ •์ฑ… ํ•™์Šต์€ ์žฅ๊ธฐ๊ฐ„ ๊ถค์ ์—์„œ ๊ธฐ์šธ๊ธฐ ํญ์ฃผ/์†Œ์‹ค(exploding/vanishing gradients) ๋ฌธ์ œ, ์ง€์—ญ ์ตœ์ ์ (local optima)์— ๊ฐ‡ํžˆ๋Š” ๋ฌธ์ œ, ๊ทธ๋ฆฌ๊ณ  ์ ‘์ด‰์ด ํ’๋ถ€ํ•œ(contact-rich) ํ™˜๊ฒฝ์—์„œ ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ๊ฑฐ๋‚˜ ์ž˜๋ชป๋œ ๊ธฐ์šธ๊ธฐ ๋ฌธ์ œ์— ์ง๋ฉดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์™€ ๋กค์•„์›ƒ ๊ถค์ ์ด ์ฐธ์กฐ ๊ถค์ ์—์„œ ๋ฒ—์–ด๋‚˜๋Š” ๋ถ„ํฌ ๋ณ€ํ™”(distributional shift)๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด Demonstration Replay๊ฐ€ ๋„์ž…๋ฉ๋‹ˆ๋‹ค.
    • Demonstration Replay๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜๋œ ์ƒํƒœ(s_t)์™€ ์ฐธ์กฐ ์ƒํƒœ(\hat{s}_t) ๊ฐ„์˜ ํฌ์ฆˆ ์˜ค๋ฅ˜(pose error)๊ฐ€ ํŠน์ • ์ž„๊ณ„๊ฐ’ \epsilon์„ ์ดˆ๊ณผํ•  ๊ฒฝ์šฐ, ํ˜„์žฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜๋œ ์ƒํƒœ๋ฅผ ํ•ด๋‹น ์ฐธ์กฐ ์ƒํƒœ๋กœ ๋Œ€์ฒดํ•˜์—ฌ ๋กค์•„์›ƒ์„ ์•ˆ๋‚ดํ•ฉ๋‹ˆ๋‹ค: s_{t+1} = \begin{cases} T(s_t, a_t), \quad a_t \sim \pi_\theta(a|s_t) & \text{if } \|s_t - \hat{s}_t\|^2_2 < \epsilon \\ T(\hat{s}_t, a_t), \quad a_t \sim \pi_\theta(a|\hat{s}_t) & \text{otherwise} \end{cases} ์ด ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ ์ •์ฑ…์˜ ํ•™์Šต์„ ์•ˆ์ •ํ™”ํ•˜๊ณ  ๋” ๋ถ€๋“œ๋Ÿฌ์šด ๊ธฐ์šธ๊ธฐ ์ถ”์ •(smoother gradient estimation)์„ ์ œ๊ณตํ•˜์—ฌ ์ง€์—ญ ์ตœ์ ์ ์—์„œ ๋ฒ—์–ด๋‚˜ ๋” ์ถฉ์‹คํ•˜๊ฒŒ ์ฐธ์กฐ ๋ชจ์…˜์„ ๋ชจ๋ฐฉํ•˜๋„๋ก ๋•์Šต๋‹ˆ๋‹ค.

์‹คํ—˜ ๊ฒฐ๊ณผ:

DiffMimic์€ DeepMimic, AMP, Spacetime Bound์™€ ๊ฐ™์€ ๊ธฐ์กด RL ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋ก ๋“ค๊ณผ ๋น„๊ตํ•˜์—ฌ ์šฐ์ˆ˜ํ•œ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ(sample efficiency)๊ณผ ์‹œ๊ฐ„ ํšจ์œจ์„ฑ(time efficiency)์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ํŠนํžˆ, ๋„์ „์ ์ธ Backflip ๋ชจ์…˜์„ ๋‹จ 10๋ถ„ ๋งŒ์— ํ•™์Šตํ•˜๊ณ , 3์‹œ๊ฐ„ ๋งŒ์— ๋ฐ˜๋ณต์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์ด Backflip์„ ์ˆœํ™˜์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐ ์•ฝ ํ•˜๋ฃจ๊ฐ€ ๊ฑธ๋ฆฌ๋Š” ๊ฒƒ๊ณผ ๋Œ€์กฐ์ ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ, Demonstration Replay๊ฐ€ ์ •์ฑ… ํ•™์Šต์˜ ์•ˆ์ •ํ™”์™€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๊ธฐ์—ฌํ•˜๋ฉฐ, ํŠนํžˆ Demonstration Replay (Threshold) ๋ฐฉ์‹์ด ๋” ๋†’์€ ์ถฉ์‹ค๋„๋กœ ๋ฐ๋ชจ๋ฅผ ์žฌํ˜„ํ•จ์„ ์ •์„ฑ์  ๋ฐ ์ •๋Ÿ‰์  ๋ถ„์„์„ ํ†ตํ•ด ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

๊ถ๊ทน์ ์œผ๋กœ DiffMimic์€ DPS๋ฅผ ํ™œ์šฉํ•œ ๋ชจ์…˜ ๋ฏธ๋ฏนํ‚น์˜ ์ƒˆ๋กœ์šด ์‹œ์ž‘์ ์„ ์ œ์‹œํ•˜๋ฉฐ, ํ–ฅํ›„ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ ์˜๋ฅ˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋“ฑ ๋” ๋ณต์žกํ•œ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ ์• ๋‹ˆ๋ฉ”์ด์…˜ ์‹œ์Šคํ…œ์—๋„ ์ ์šฉ๋  ์ˆ˜ ์žˆ๊ธฐ๋ฅผ ๊ธฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค.


๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

์„œ๋ก 

Motion mimicking์€ ์‹œ์—ฐ๋œ ๋ชจ์…˜ ๊ถค์ ์„ ๋ณต์›ํ•˜๋Š” ์ œ์–ด ์‹ ํ˜ธ๋ฅผ ๋งŒ๋“œ๋Š” ์ •์ฑ…์„ ์ฐพ๋Š” ์ผ๋กœ, ๋ฌผ๋ฆฌ ๊ธฐ๋ฐ˜ ์บ๋ฆญํ„ฐ ์• ๋‹ˆ๋ฉ”์ด์…˜์˜ ๊ทผ๊ฐ„์ด๋ฉฐ control stylizationยทskill composition ๊ฐ™์€ ์‘์šฉ์˜ ์ „์ œ ์กฐ๊ฑด์ž…๋‹ˆ๋‹ค. ์ตœ๊ทผ ํฐ ์ง„์ „์ด ์žˆ์—ˆ์ง€๋งŒ, ๊ธฐ์กด ๋ฐฉ๋ฒ•์€ ๋Œ€๋ถ€๋ถ„ RL ์„ ์ฑ„ํƒํ•ด ๋ณด์ƒ ํ•จ์ˆ˜์™€ ์ œ์–ด ์ •์ฑ…์„ ๋ฒˆ๊ฐˆ์•„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์—” ๋‘ ๊ฐ€์ง€ ๊ณ ์งˆ์  ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. ํ™•์žฅ์„ฑ: ๋‹จ์ผ ๋ชจ์…˜ ํ•˜๋‚˜๋ฅผ ๋ชจ๋ฐฉํ•˜๋Š” ๋ฐ๋„ ์ˆ˜์‹ญ ์‹œ๊ฐ„~๋ฉฐ์น ์ด ๊ฑธ๋ฆฝ๋‹ˆ๋‹ค.
  2. ๋ณด์ƒ ์„ค๊ณ„ ์˜์กด: ์„ฑ๋Šฅ์ด ์ •๊ตํ•˜๊ฒŒ ์„ค๊ณ„/ํ•™์Šต๋œ ๋ณด์ƒ ํ•จ์ˆ˜์˜ ํ’ˆ์งˆ์— ํฌ๊ฒŒ ์˜์กดํ•ด, ๋ณต์žกํ•œ ์‹ค์„ธ๊ณ„ ์‘์šฉ์œผ๋กœ์˜ ์ผ๋ฐ˜ํ™”๊ฐ€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

ํ•œํŽธ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅ ๋ฌผ๋ฆฌ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ(DPS) ๊ฐ€ ๋กœ๋ด‡ ์ œ์–ดยท๊ทธ๋ž˜ํ”ฝ์Šค์—์„œ ์ธ์ƒ์ ์ธ ์„ฑ๊ณผ๋ฅผ ๋ƒˆ์Šต๋‹ˆ๋‹ค. DPS๋Š” ๋ฌผ๋ฆฌ ์—ฐ์‚ฐ์ž๋ฅผ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ ๊ณ„์‚ฐ ๊ทธ๋ž˜ํ”„ ๋กœ ๋‹ค๋ค„, ๋ชฉํ‘œ(์ฆ‰ ๋ณด์ƒ)๋กœ๋ถ€ํ„ฐ ์ œ์–ด ์ •์ฑ…์œผ๋กœ ๊ธฐ์šธ๊ธฐ๋ฅผ ์ง์ ‘ ์ „ํŒŒ ํ•ฉ๋‹ˆ๋‹ค. ๋ณด์ƒ ํ•จ์ˆ˜์™€ ์ •์ฑ…์„ ๋ฒˆ๊ฐˆ์•„ ํ•™์Šตํ•  ํ•„์š” ์—†์ด, ์ œ์–ด ์ •์ฑ… ํ•™์Šต์„ ์ง์ ‘์ ์ด๊ณ  ํšจ์œจ์ ์ธ ์ตœ์ ํ™” ๋กœ ํ’€ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ DPS๋„ ๋งŒ๋Šฅ์€ ์•„๋‹™๋‹ˆ๋‹ค. ํ•ด์„์  ํ™˜๊ฒฝ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ฐ–๋”๋ผ๋„, ํŠนํžˆ stiffํ•˜๊ณ  ๋ถˆ์—ฐ์†์ ์ธ ๊ธฐ์šธ๊ธฐ ๋ฅผ ๋‚ด๋Š” contact-rich ๋ฌผ๋ฆฌ ์‹œ์Šคํ…œ์—์„œ๋Š” local optima์— ์‰ฝ๊ฒŒ ๋น ์ง‘๋‹ˆ๋‹ค. ๋˜ ๊ธด ๊ถค์ ์—์„œ๋Š” ์—ญ์ „ํŒŒ ๊ฒฝ๋กœ๋ฅผ ๋”ฐ๋ผ ์ˆ˜์น˜ ๊ธฐ์šธ๊ธฐ๊ฐ€ ์†Œ์‹ค/ํญ๋ฐœ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

DiffMimic์˜ ํ•œ ์ค„ ์š”์•ฝ: motion mimicking์„ ์ƒํƒœ ๋งค์นญ ๋ฌธ์ œ ๋กœ ์žฌ์ •์‹ํ™”ํ•˜๊ณ , DPS์˜ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅ dynamics๋กœ ๊ถค์  ๊ฑฐ๋ฆฌ์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ์ •์ฑ…์— ์ง์ ‘ ์ „ํŒŒํ•ด 1์ฐจ ๊ธฐ์šธ๊ธฐ๋กœ ์ƒ˜ํ”Œ ํšจ์œจ์„ ํฌ๊ฒŒ ๋†’์ด๋ฉฐ, Demonstration Replay ๋กœ long-horizonยทlocal-minima ๋ฌธ์ œ๋ฅผ ์•ˆ์ •ํ™”ํ•œ๋‹ค. DiffMimic์€ DPS๋ฅผ motion mimicking์— ์ฒ˜์Œ ์ ์šฉ ํ•œ ์—ฐ๊ตฌ์ด๋ฉฐ, ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ํ‘œ์ค€ ๋ฒค์น˜๋งˆํฌ๋กœ ๊ณต๊ฐœํ•ฉ๋‹ˆ๋‹ค.

๋ฐฉ๋ฒ•

flowchart LR
    REF["Reference Trajectory<br/>ล_0 โ†’ ล_1 โ†’ ... โ†’ ล_T"]
    subgraph ROLL["Learner Rollout (Brax, ๋ฏธ๋ถ„ ๊ฐ€๋Šฅ)"]
        S0["s_0"] -->|a_0~ฯ€_ฮธ| S1["s_1"]
        S1 -->|a_1| S2["s_2"]
        S2 -.->|Demo Replay:<br/>์˜ค์ฐจ ํฌ๋ฉด ล๋กœ ๊ต์ฒด| S3["ล_3"]
        S3 -->|a_3| S4["s_4"]
    end
    REF -->|step-wise L2| LOSS["L = ฮฃ โ€–s_t โˆ’ ล_tโ€–ยฒ"]
    ROLL --> LOSS
    LOSS -->|analytical gradient<br/>โˆ‡_ฮธ L (BPTT)| POLICY["Policy ฯ€_ฮธ ์—…๋ฐ์ดํŠธ"]
    POLICY -.-> ROLL

๋ฏธ๋ถ„ ๊ฐ€๋Šฅ ๋ฌผ๋ฆฌ ์—”์ง„ ์† ๋ชจ๋ฐฉ ํ™˜๊ฒฝ

ํ™˜๊ฒฝ์€ Brax ๋กœ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค. ์บ๋ฆญํ„ฐ๋Š” DeepMimic์„ ๋”ฐ๋ผ ์„ค๊ณ„ํ•œ humanoid๋กœ, 13๊ฐœ ๋งํฌ, 34 ์ž์œ ๋„, 45kg, 1.62m ์ž…๋‹ˆ๋‹ค. ๋ชจ๋“  ๋งํฌ๊ฐ€ ๋ฐ”๋‹ฅ๊ณผ ์ ‘์ด‰ํ•  ์ˆ˜ ์žˆ๊ณ , GPU ๋ณ‘๋ ฌํ™”๋กœ ๊ฐ€์†ํ•˜๋ฉฐ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋Š” 480 FPS ๋กœ ์—…๋ฐ์ดํŠธ๋ฉ๋‹ˆ๋‹ค. ๋” ๋ถ€๋“œ๋Ÿฌ์šด ๊ธฐ์šธ๊ธฐ ์ „ํŒŒ๋ฅผ ์œ„ํ•ด ๊ด€์ ˆ ํ•œ๊ณ„๋ฅผ ์™„ํ™” ํ•˜๊ณ , ๋งˆ์ฐฐ ๊ณ„์ˆ˜ ๋“ฑ์€ DeepMimic๊ณผ ๋™์ผํ•˜๊ฒŒ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

์ƒํƒœ์™€ ํ–‰๋™. ์ƒํƒœ๋Š” ๋ชจ๋“  ๋งํฌ์˜ ์œ„์น˜ p, ํšŒ์ „ q, ์„ ์†๋„ \dot p, ๊ฐ์†๋„ \dot q ๋ฅผ ๋กœ์ปฌ ์ขŒํ‘œ๋กœ ๋‹ด๊ณ , ์ถ”๊ฐ€๋กœ phase ๋ณ€์ˆ˜ \phi \in [0,1] ๋ฅผ ํƒ€์ž„์Šคํƒฌํ”„๋กœ ๋„ฃ์Šต๋‹ˆ๋‹ค: s := \{p, q, \dot p, \dot q, \phi\}. PD ์ปจํŠธ๋กค๋Ÿฌ ๋กœ ์บ๋ฆญํ„ฐ๋ฅผ ๊ตฌ๋™ํ•˜๋ฉฐ, ๋ชฉํ‘œ ๊ฐ๋„ \tilde q ์— ๋Œ€ํ•ด ํ† ํฌ๋Š”

\tau = k_p(\tilde q - q) + k_d(\dot{\tilde q} - \dot q)

์ •์ฑ… ๋„คํŠธ์›Œํฌ๊ฐ€ ๊ฐ ๊ด€์ ˆ์˜ PD ๋ชฉํ‘œ ๊ฐ๋„๋ฅผ 30 FPS ๋กœ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค(k_p, k_d ๋Š” DeepMimic๊ณผ ๋™์ผ).

๋ฏธ๋ถ„ ๊ฐ€๋Šฅ ๋ฌผ๋ฆฌ๋กœ ํ•˜๋Š” ๋ชจ์…˜ ๋ชจ๋ฐฉ

motion mimicking์€ ๊ฒฐ๊ตญ ์ •์ฑ… ๋กค์•„์›ƒ์„ reference ๋ชจ์…˜์— ๋งž์ถ”๋Š” ์ผ์ž…๋‹ˆ๋‹ค. ๋ชฉํ‘œ ์ž์ฒด๋Š” ๋‹จ์ˆœํ•˜์ง€๋งŒ, โ€œ๊ฑธ์–ด๋ผโ€ ๋˜๋Š” โ€œ๋ฐฑํ”Œ๋ฆฝํ•˜๋ผโ€๋ฅผ ์œ ๋„ํ•˜๋Š” ๋ณด์ƒ์„ ์„ค๊ณ„ํ•˜๋Š” ์ผ์€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. DiffMimic์˜ ํ†ต์ฐฐ์€ ์ด ์ž‘์—…์ด ํ•ด์„์  ๊ธฐ์šธ๊ธฐ๋กœ๋Š” ๋†€๋ž„ ๋งŒํผ ์‰ฌ์›Œ์ง„๋‹ค ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ฐ iteration์—์„œ ์ƒํƒœ๋ฅผ ์ฒซ reference ์ƒํƒœ๋กœ ์ดˆ๊ธฐํ™”ํ•˜๊ณ , ๋ณ‘๋ ฌ ํ™˜๊ฒฝ์—์„œ ์ตœ๋Œ€ ์—ํ”ผ์†Œ๋“œ ๊ธธ์ด๊นŒ์ง€ ๋กค์•„์›ƒํ•œ ๋’ค, ๋กค์•„์›ƒ ๊ถค์ ๊ณผ reference ๊ถค์  ์‚ฌ์ด์˜ ์Šคํ…๋ณ„ L2 ๊ฑฐ๋ฆฌ ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

\mathcal{L} = \sum_{t=1}^{T} \lVert s_t - \hat s_t \rVert_2^2

\lVert s_t - \hat s_t \rVert_2^2 \triangleq \frac{1}{\lVert J \rVert}\sum_{j\in J} w_p(p^j - \hat p^j)^2 + w_r(q^j - \hat q^j)^2 + w_v(\dot p^j - \dot{\hat p}^j)^2 + w_a(\dot q^j - \dot{\hat q}^j)^2

์œ„์น˜ยทํšŒ์ „ยท์„ ์†๋„ยท๊ฐ์†๋„ ์˜ค์ฐจ์˜ ๊ฐ€์ค‘ํ•ฉ์ด๋ฉฐ(ํšŒ์ „์€ quaternion๋ณด๋‹ค ๊ธฐ์šธ๊ธฐ ์ตœ์ ํ™”์— ์œ ๋ฆฌํ•œ 6D ํ‘œํ˜„ ์‚ฌ์šฉ), ๊ฐ€์ค‘์น˜ w_p, w_r, w_v, w_a ๋Š” ํฌ๊ธฐ๋ฅผ ๋Œ€๋žต ๋งž์ถ”๋„๋ก ๊ทผ์‚ฌ์ ์œผ๋กœ๋งŒ ํŠœ๋‹ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค โ€” RL์˜ ์ •๊ตํ•œ ๋ณด์ƒ ์„ค๊ณ„์™€ ๋Œ€๋น„๋ฉ๋‹ˆ๋‹ค.

DPS๊ฐ€ ์ „์ดํ•จ์ˆ˜ \mathcal T (s_{t+1} = \mathcal T(s_t, a_t)) ์—ญํ• ์„ ํ•˜๋ฉฐ ์™„์ „ํžˆ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ, ์†์‹ค๋กœ๋ถ€ํ„ฐ ํ˜„์žฌ ํ–‰๋™ a_t ์™€ ์ƒํƒœ s_t ์–‘์ชฝ์œผ๋กœ ๊ธฐ์šธ๊ธฐ๋ฅผ ์ง์ ‘ ์œ ๋„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

\frac{\partial \mathcal L}{\partial a_t} = \left(\frac{\partial \mathcal L}{\partial \mathcal T(s_t, a_t)}\right)\left(\frac{\partial \mathcal T(s_t, a_t)}{\partial a_t}\right), \qquad \frac{\partial \mathcal L}{\partial s_t} = \left(\frac{\partial \mathcal L}{\partial \mathcal T(s_t, a_t)}\right)\left(\frac{\partial \mathcal T(s_t, a_t)}{\partial s_t}\right)

์ด๋ฅผ ์žฌ๊ท€์ ์œผ๋กœ ์ ์šฉํ•ด ์ „์ฒด ๊ถค์ ์— ๊ฑธ์ณ ๊ธฐ์šธ๊ธฐ๋ฅผ ์ „ํŒŒ(BPTT)ํ•ฉ๋‹ˆ๋‹ค. learned world model์— ๊ธฐ๋Œ€๋Š” ๊ธฐ์กด ๋ฐฉ์‹๊ณผ ๋‹ฌ๋ฆฌ, off-the-shelf DPS๋Š” ์‹œ์Šคํ…œ์˜ ์‹ค์ œ ๋ฌผ๋ฆฌ๋ฅผ ๋‹ด์•„ ๋” ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๊ณ  ํ•ด์„ ๊ฐ€๋Šฅํ•œ ๊ธฐ์šธ๊ธฐ๋ฅผ ์ค๋‹ˆ๋‹ค.

Demonstration Replay (ํ•ต์‹ฌ)

DPS ์ •์ฑ… ํ•™์Šต์—” ์ž˜ ์•Œ๋ ค์ง„ ์„ธ ๊ฐ€์ง€ ๋‚œ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. (1) ๊ธด ์ง€ํ‰์„ ์—์„œ์˜ ๊ธฐ์šธ๊ธฐ ํญ๋ฐœ/์†Œ์‹ค, (2) local minima ๋กœ ์ธํ•œ ์ •์ฒด, (3) ๋…ธ์ด์ฆˆ/์ž˜๋ชป๋œ ๊ธฐ์šธ๊ธฐ.

motion mimicking ์ž‘์—…์˜ ๋†’์€ ๋น„๋ณผ๋ก์„ฑ(non-convexity)์ด ์ด๋ฅผ ์‹ฌํ™”์‹œํ‚ต๋‹ˆ๋‹ค. ์˜ˆ์ปจ๋Œ€ Backflip ์„ ๋ฐฐ์šธ ๋•Œ, ์ •์ฑ…์€ ๊ณต์ค‘์ œ๋น„๋ฅผ ๋„๋Š” ๋” ๋™์ ์ธ ๋™์ž‘์„ ํƒ์ƒ‰ํ•˜๋Š” ๋Œ€์‹  ํŒ”๋กœ ๋ชธ์„ ์ง€ํƒฑ ํ•˜๋Š” ์•ˆ์ดํ•œ ์ž์„ธ์— ๋น ์ง€๊ธฐ ์‰ฝ์Šต๋‹ˆ๋‹ค. ํ•œํŽธ BPTT๋ฅผ ์งง๊ฒŒ ์ž๋ฅด๋Š”(์˜ˆ: 10-step truncation) ๋‹จ์ˆœ ์ ˆ๋‹จ์€ ๊ถค์ ์˜ ๋ถˆ์—ฐ์†์„ ๋งŒ๋“ค์–ด ๋” ๋‚˜์œ local optimal ๋กœ ์ด๋•๋‹ˆ๋‹ค โ€” ๋™์ž‘๋“ค์ด ๊ฐ•ํ•˜๊ฒŒ ์ƒํ˜ธ์˜์กด์ (๊ณต์ค‘์—์„œ ์–ด๋–ป๊ฒŒ ๋’ค์ง‘์„์ง€)์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๊ธฐ์กด teacher forcing(Williams & Zipser 1989)์€ ๋กค์•„์›ƒ ์ƒํƒœ๋ฅผ reference๋กœ ๋ฌด์ž‘์œ„ ๊ต์ฒด(๋น„์œจ \gamma, Bernoulli)ํ•ฉ๋‹ˆ๋‹ค.

s_{t+1} = \begin{cases} \mathcal T(s_t, a_t), \ a_t \sim \pi_\theta(a|s_t) & \text{if } b=0,\ b\sim\text{Bernoulli}(\gamma) \\ \mathcal T(\hat s_t, a_t), \ a_t \sim \pi_\theta(a|\hat s_t) & \text{otherwise} \end{cases}

ํ•˜์ง€๋งŒ ๋ฌด์ž‘์œ„ ๊ต์ฒด๋Š” ์ „์—ญ์ ์œผ๋กœ๋Š” ๋‚˜์•„๋„ ํ”„๋ ˆ์ž„๋งˆ๋‹ค ์ถฉ์‹คํžˆ ๋ชจ๋ฐฉ ํ•จ์„ ๋ณด์žฅํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค(์ผ๋ถ€ ํ”„๋ ˆ์ž„์—์„œ ์–ด์ƒ‰ํ•œ ์ž์„ธยทํฐ ์˜ค์ฐจ).

DiffMimic์˜ Demonstration Replay(demonstration-guided exploration) ๋Š” reference์—์„œ ๋„ˆ๋ฌด ๋ฉ€์–ด์ง„ ์ƒํƒœ๋งŒ ์ž„๊ณ„๊ฐ’ \epsilon ๊ธฐ์ค€์œผ๋กœ ๊ต์ฒดํ•ฉ๋‹ˆ๋‹ค.

s_{t+1} = \begin{cases} \mathcal T(s_t, a_t), \ a_t \sim \pi_\theta(a|s_t) & \text{if } \lVert s_t - \hat s_t \rVert_2^2 < \epsilon \\ \mathcal T(\hat s_t, a_t), \ a_t \sim \pi_\theta(a|\hat s_t) & \text{otherwise} \end{cases}

๊ต์ฒด ๊ธฐ์ค€์ด ํ˜„์žฌ ๋กค์•„์›ƒ์˜ ์„ฑ๋Šฅ์— ๋‹ฌ๋ ค ์žˆ์œผ๋ฏ€๋กœ, ๊ต์ฒด ๋นˆ๋„๊ฐ€ ํ•™์Šต ์ค‘ ๋™์ ์œผ๋กœ ์ž๋™ ์กฐ์ • ๋ฉ๋‹ˆ๋‹ค. ๊ฒฝํ—˜์ ์œผ๋กœ ์ด ๋ฐฉ์‹์ด ๋” ๋งค๋„๋Ÿฌ์šด ๊ธฐ์šธ๊ธฐ ์ถ”์ •์„ ์ฃผ์–ด ์ •์ฑ… ํ•™์Šต์„ ํฌ๊ฒŒ ์•ˆ์ •ํ™”ํ•ฉ๋‹ˆ๋‹ค.

์‹คํ—˜

๋‹จ์ผ V100 GPU + Intel Xeon E5-2680์—์„œ ์‹คํ—˜ํ•ฉ๋‹ˆ๋‹ค. ์ฃผ ์ง€ํ‘œ๋Š” ํ‰๊ท  pose error(๋ฃจํŠธ ๊ด€์ ˆ ๊ธฐ์ค€ ์ƒ๋Œ€ ์œ„์น˜ ์˜ค์ฐจ, ๋ฏธํ„ฐ ๋‹จ์œ„)์ด๋ฉฐ reference์™€ ๋™๊ธฐํ™”๋ฅผ ์œ„ํ•ด DTW๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋น„๊ต ๋Œ€์ƒ์€ DeepMimic(RL + ์ •๊ตํ•œ ๋ณด์ƒ), Spacetime Bound(DeepMimic ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํƒ์ƒ‰), AMP(Adversarial Motion Prior).

์ƒ˜ํ”Œ ํšจ์œจ: ํ•ด์„์  ๊ธฐ์šธ๊ธฐ์˜ ํž˜ (Table 2)

20์ดˆ๋ฅผ ๋„˜์–ด์งˆ ๋•Œ๊นŒ์ง€ ๋กค์•„์›ƒํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ์ƒ˜ํ”Œ ์ˆ˜(10^6 ๋‹จ์œ„, DeepMimic ๋Œ€๋น„ ๋ณ€ํ™”์œจ):

Motion DeepMimic Spacetime Bound Ours
Back-Flip 31.18 41.20 (+32.1%) 14.88 (-52.2%)
Cartwheel 30.45 17.35 (-43.0%) 13.92 (-54.2%)
Walk 23.80 4.08 (-79.5%) 7.92 (-66.7%)
Run 19.31 4.11 (-78.7%) 8.16 (-57.7%)
Jump 25.65 41.63 (+77.8%) 5.28 (-79.4%)
Dance 24.59 10.00 (-59.3%) 16.56 (-32.6%)

DiffMimic์€ DeepMimic ๋Œ€๋น„ ์ผ๊ด€๋˜๊ฒŒ ์ƒ˜ํ”Œ ํšจ์œจ์ด ๋†’์Šต๋‹ˆ๋‹ค. DPS์˜ ํ•ด์„์  ๊ธฐ์šธ๊ธฐ๋กœ ์ ์€ ์ƒ˜ํ”Œ๋กœ๋„ ์ •์ฑ… ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐ˜๋ฉด, RL์€ ๊ดœ์ฐฎ์€ ์ถ”์ •์„ ์œ„ํ•ด ํฐ ๋ฐฐ์น˜๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. Spacetime Bound๋Š” Jump ๊ฐ™์€ ๋‹จ์ˆœ ์ž‘์—…์—์„œ๋„ DeepMimic๋ณด๋‹ค ๋งŽ์€ ์ƒ˜ํ”Œ์ด ํ•„์š”ํ•œ ๋“ฑ ๋ถˆ์•ˆ์ • ํ•œ ๋ฐ ๋น„ํ•ด, DiffMimic์€ ๋‹ค์–‘ํ•œ ์ž‘์—…์—์„œ ์•ˆ์ •์ ยท์ผ๊ด€์  ์ž…๋‹ˆ๋‹ค.

๋ชจ์…˜ ํ’ˆ์งˆ (Table 1)

12๊ฐœ ๋ชจ์…˜์˜ ํ‰๊ท  pose error์—์„œ DiffMimic์€ AMP๋ฅผ ์ผ๊ด€๋˜๊ฒŒ ๋Šฅ๊ฐ€ ํ•˜๊ณ  DeepMimic๊ณผ ๋น„์Šทํ•œ ์ˆ˜์ค€์ž…๋‹ˆ๋‹ค. ์ฃผ๋ชฉํ•  ์ : DiffMimic์€ ํ•™์Šต์—์„œ 4์ดˆ ๋กค์•„์›ƒ ๋งŒ ๋ณด๊ณ ๋„ DeepMimic์˜ 20์ดˆ cyclic ๋กค์•„์›ƒ ๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‚ด, reference์˜ ์•ˆ์ •์ ยท์ถฉ์‹คํ•œ ๋ณต์›์„ ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค.

Motion DeepMimic AMP Ours
Back-Flip 0.076 0.150 0.097
Jump 0.033 0.083 0.025
Run 0.028 0.056 0.039
Side-Flip 0.244 0.124 0.069
Walk 0.018 0.030 0.017

์‹œ๊ฐ„ ํšจ์œจ

ํ•ด์„์  ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์ด ์ถ”์ • ๊ธฐ์šธ๊ธฐ๋ณด๋‹ค ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๋ฏ€๋กœ wall-clock ๋น„๊ต๊ฐ€ ๊ณต์ •ํ•ฉ๋‹ˆ๋‹ค. GPU ๋ณ‘๋ ฌํ™”๋ฅผ ์“ฐ๋Š” AMP์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ, DiffMimic์€ ์ ˆ๋ฐ˜์˜ ํ•™์Šต ์‹œ๊ฐ„ ์œผ๋กœ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์— ๋„๋‹ฌํ•ฉ๋‹ˆ๋‹ค. Backflip์„ 10๋ถ„์— ํ•™์Šต ํ•˜๊ณ , 3์‹œ๊ฐ„(14.88M ์ƒ˜ํ”Œ) ์— cycle ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

Ablation: Truncation ๊ธธ์ด

์ „์ฒด ๊ถค์ ์œผ๋กœ ๊ธฐ์šธ๊ธฐ๋ฅผ ์ „ํŒŒํ•˜๋ฉด ๊ธธ์ด ๋•Œ๋ฌธ์— ํ•™์Šต์ด ์–ด๋ ต์ง€๋งŒ, ๋‹จ์ˆœํžˆ 10-step์œผ๋กœ ์ž๋ฅด๋ฉด ๊ถค์ ์— ๋ถˆ์—ฐ์†์ด ์ƒ๊ฒจ ์˜คํžˆ๋ ค ๋” ๋‚˜์œ ๊ฒฐ๊ณผ๋ฅผ ๋ƒ…๋‹ˆ๋‹ค(Fig. 7a-b). ๋ชจ์…˜์ด ๊ฐ•ํ•˜๊ฒŒ ์ƒํ˜ธ์˜์กด์ ์ด๊ธฐ ๋•Œ๋ฌธ์œผ๋กœ, ๋” ๋‚˜์€ ์ „๋žต(=Demonstration Replay)์˜ ํ•„์š”์„ฑ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Ablation: Demonstration Replay

์„ธ ๋ณ€ํ˜•์„ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค โ€” Full Horizon Gradient(๊ต์ฒด ์—†์Œ), Demo Replay (Random)(teacher forcing์‹ ๋ฌด์ž‘์œ„ ๊ต์ฒด), Demo Replay (Threshold)(์˜ค์ฐจ ๊ธฐ์ค€ ๊ต์ฒด).

  • Full Horizon Gradient ๋Š” local minimum์— ๋น ์ ธ, ๋ฐฑํ”Œ๋ฆฝ ๋Œ€์‹  ๋ชธ์„ ๊ตฝํ˜€ ํŒ”๋กœ ์ง€ํƒฑ ํ•˜๋Š” ๋™์ž‘์„ ํ•™์Šต(Fig. 6b). ๋‘ replay ๋ณ€ํ˜•์€ ๋ชจ๋‘ ๊ณต์ค‘ ๋ฐฑํ”Œ๋ฆฝ์— ์„ฑ๊ณต.
  • Random vs Threshold: ํ‰๊ท  ์˜ค์ฐจ๋Š” ๋น„์Šทํ•˜์ง€๋งŒ, Threshold๊ฐ€ ํ”„๋ ˆ์ž„๋ณ„ ์ตœ๋Œ€ ์˜ค์ฐจ๊ฐ€ ๋‚ฎ์•„ ๋” ์ถฉ์‹คํ•˜๊ฒŒ reference๋ฅผ ๋ณต์›(Fig. 8). Random์€ ์ „์ฒด ํ‰๊ท ์€ ์ค„์—ฌ๋„ ์ผ๋ถ€ ํ”„๋ ˆ์ž„์—์„œ ํฐ ์˜ค์ฐจ๊ฐ€ ๋‚จ์Šต๋‹ˆ๋‹ค. ์ฆ‰ ๋‹จ์ˆœํžˆ ํ‰๊ท  pose error๋ฅผ ์ค„์ด๋Š” ๊ฒƒ๋งŒ์œผ๋กœ๋Š” ๋ถ€์กฑ ํ•˜๋ฉฐ, ํ˜„์žฌ ์ •์ฑ… ์„ฑ๋Šฅ์— ๊ธฐ๋ฐ˜ํ•œ ์„ธ๋ฐ€ํ•œ ๊ฐ€์ด๋“œ(threshold) ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๋น„ํŒ์  ๊ณ ์ฐฐ

๊ฐ•์ 

  • ๋ฌธ์ œ ์žฌ์ •์‹ํ™”์˜ ์šฐ์•„ํ•จ. โ€œ๋ณต์žกํ•œ ์ •์ฑ… ํ•™์Šต โ†’ ๋‹จ์ˆœํ•œ ์ƒํƒœ ๋งค์นญโ€์ด๋ผ๋Š” ์žฌ๊ตฌ์„ฑ์ด, DPS์˜ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅ dynamics์™€ ๋งŒ๋‚˜ ๋ณด์ƒ ์„ค๊ณ„ ๋ถ€๋‹ด์„ ๊ฑฐ์˜ ์ œ๊ฑฐํ•˜๋ฉด์„œ ์ƒ˜ํ”Œยท์‹œ๊ฐ„ ํšจ์œจ์„ ๊ทน์ ์œผ๋กœ ๋†’์˜€์Šต๋‹ˆ๋‹ค. Backflip 10๋ถ„ ํ•™์Šต์€ ๊ฐ•๋ ฌํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.
  • DPS ํ•™์Šต ๋‚œ์ œ์˜ ์‹ค์šฉ์  ํ•ด๊ฒฐ. long-horizon ๊ธฐ์šธ๊ธฐ ๋ฌธ์ œยทlocal minima๋ฅผ ๋‹จ์ˆœ truncation์ด ์•„๋‹Œ ๋™์  Demonstration Replay ๋กœ ๋‹ค๋ฃฌ ์ ์ด ํ•ต์‹ฌ ๊ธฐ์—ฌ์ž…๋‹ˆ๋‹ค. Random vs Threshold์˜ ์ฐจ์ด๋ฅผ ํ”„๋ ˆ์ž„๋ณ„ ์˜ค์ฐจ๋กœ ๋ถ„์„ํ•œ ์ ๋„ ์„ค๋“๋ ฅ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ณต์ •ํ•˜๊ณ  ๋‹ค๊ฐ์ ์ธ ๋น„๊ต. ์ƒ˜ํ”Œ ํšจ์œจ(vs DeepMimic/Spacetime)๊ณผ ์‹œ๊ฐ„ ํšจ์œจ(vs AMP)์„ ๋ถ„๋ฆฌํ•ด ์ธก์ •ํ•˜๊ณ , 12๊ฐœ ๋ชจ์…˜ยท8๊ฐœ ํ•™์Šต ๊ณก์„ ์œผ๋กœ ํญ๋„“๊ฒŒ ๊ฒ€์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๋ฒค์น˜๋งˆํฌ ๊ณต๊ฐœ. DPS ๊ธฐ๋ฐ˜ motion mimicking์˜ ํ‘œ์ค€ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ๊ณต๊ฐœํ•ด ํ›„์† ์—ฐ๊ตฌ์˜ ๊ธฐ๋ฐ˜์„ ๋งˆ๋ จํ–ˆ์Šต๋‹ˆ๋‹ค.

์•ฝ์ ๊ณผ ํ•œ๊ณ„

  • ์ €์ž๊ฐ€ ์ธ์ •ํ•œ ํ•ต์‹ฌ ํ•œ๊ณ„: ํ‰๊ฐ€ํ•œ ์ž‘์—…์ด ์ƒ๋Œ€์ ์œผ๋กœ ์งง๊ณ , ๋‹ค๋ฅธ ๋ฌผ์ฒด์™€์˜ ์ƒํ˜ธ์ž‘์šฉ์ด ์—†์Šต๋‹ˆ๋‹ค. ๋‹ค๋ฌผ์ฒดยท์ ‘์ด‰์ด ๋ณต์žกํ•ด์ง€๋Š” ๋™์  ์‹œ์Šคํ…œ์—์„œ์˜ ๊ฑฐ๋™์€ ๋ฏธํ•ด๊ฒฐ๋กœ ๋‚จ์Šต๋‹ˆ๋‹ค.
  • DPS/์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ์˜์กด์„ฑ. ์„ฑ๊ณต์€ Brax์˜ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅ์„ฑ๊ณผ (๋ถ€๋“œ๋Ÿฌ์šด ๊ธฐ์šธ๊ธฐ๋ฅผ ์œ„ํ•œ) ๊ด€์ ˆ ํ•œ๊ณ„ ์™„ํ™” ๊ฐ™์€ ์‹œ๋ฎฌ ์„ค์ •์— ๊ธฐ๋Œ‘๋‹ˆ๋‹ค. ๋” stiffยท๋ถˆ์—ฐ์†์ ์ธ ์‹ค์ œ ์ ‘์ด‰์ด๋‚˜ ๋‹ค๋ฅธ ์—”์ง„์œผ๋กœ์˜ ์ „์ด๋Š” ์ถ”๊ฐ€ ๊ฒ€์ฆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  • DeepMimic ๋Œ€๋น„ ํ’ˆ์งˆ์€ ๋™๋“ฑ ์ˆ˜์ค€. ํšจ์œจ์€ ํฌ๊ฒŒ ์•ž์„œ์ง€๋งŒ pose error ํ’ˆ์งˆ ์ž์ฒด๋Š” DeepMimic๊ณผ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ์ผ๋ถ€ ์ž‘์—…์—์„œ ์•ฝ๊ฐ„ ๋’ค์ ธ, โ€œ๋” ์ •ํ™•โ€ํ•˜๋‹ค๊ธฐ๋ณด๋‹ค โ€œ๋น„์Šทํ•œ ํ’ˆ์งˆ์„ ํ›จ์”ฌ ๋น ๋ฅด๊ฒŒโ€์— ๊ฐ€๊น์Šต๋‹ˆ๋‹ค.
  • ์ž„๊ณ„๊ฐ’ \epsilon ์˜ ํŠœ๋‹. Demonstration Replay์˜ ํ•ต์‹ฌ์ธ \epsilon ์„ค์ •์ด ์ž‘์—…๋งˆ๋‹ค ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ž๋™ ์กฐ์ •์€ ๋นˆ๋„์— ํ•œ์ •๋ฉ๋‹ˆ๋‹ค(์ถ”์ธก: ์ž„๊ณ„๊ฐ’ ์ž์ฒด์˜ ๋ฏผ๊ฐ๋„ ๋ถ„์„์€ ์ œํ•œ์ ).

์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

DiffMimic์€ ๋ฌผ๋ฆฌ ๊ธฐ๋ฐ˜ ์บ๋ฆญํ„ฐ์˜ motion mimicking์„, RL์˜ ๋ณด์ƒ ์„ค๊ณ„ยท๋‚ฎ์€ ์ƒ˜ํ”Œ ํšจ์œจ ๋Œ€์‹  ๋ฏธ๋ถ„ ๊ฐ€๋Šฅ ๋ฌผ๋ฆฌ(DPS) ๊ธฐ๋ฐ˜ ์ƒํƒœ ๋งค์นญ ์œผ๋กœ ํ‘ผ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ (1) ๊ถค์  ๊ฑฐ๋ฆฌ์˜ ํ•ด์„์  ๊ธฐ์šธ๊ธฐ ๋ฅผ DPS dynamics๋กœ ์ •์ฑ…์— ์ง์ ‘ ์ „ํŒŒํ•˜๊ณ , (2) reference ์ƒํƒœ๋ฅผ ์˜ค์ฐจ ๊ธฐ์ค€์œผ๋กœ ๋ผ์›Œ ๋„ฃ๋Š” Demonstration Replay ๋กœ long-horizonยทlocal-minima๋ฅผ ์•ˆ์ •ํ™”ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ˆ˜์น˜๋กœ ์ •๋ฆฌํ•˜๋ฉด, DeepMimic ๋Œ€๋น„ ์ƒ˜ํ”Œ ํšจ์œจ ์ตœ๋Œ€ โˆ’79%, AMP ๋Œ€๋น„ wall-clock ์ ˆ๋ฐ˜, Backflip 10๋ถ„ ํ•™์Šตยท3์‹œ๊ฐ„ cycle(๋‹จ์ผ V100)์„ ๋‹ฌ์„ฑํ–ˆ๊ณ , 12๊ฐœ ๋ชจ์…˜์—์„œ AMP๋ฅผ ์ผ๊ด€๋˜๊ฒŒ ๋Šฅ๊ฐ€ํ•˜๋ฉฐ DeepMimic๊ณผ ๋™๋“ฑํ•œ ํ’ˆ์งˆ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. Demonstration Replay์˜ threshold ๋ฐฉ์‹์ด random๋ณด๋‹ค ํ”„๋ ˆ์ž„๋ณ„๋กœ ์ถฉ์‹คํžˆ ๋ชจ๋ฐฉํ•จ๋„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

์‹ค๋ฌด ๊ด€์ ์—์„œ ์ด ์—ฐ๊ตฌ์˜ ๊ฐ€์น˜๋Š” โ€œ๋ณด์ƒ ์„ค๊ณ„ ์—†์ด, ๋ฏธ๋ถ„ ๊ฐ€๋Šฅ ๋ฌผ๋ฆฌ์˜ ํ•ด์„์  ๊ธฐ์šธ๊ธฐ๋กœ ๊ณ ๋‚œ๋„ ๋ชจ์…˜์„ ๋ถ„ ๋‹จ์œ„๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ์„ ์ฒ˜์Œ ๋ณด์˜€๋‹คโ€ ๋Š” ๋ฐ ์žˆ์Šต๋‹ˆ๋‹ค. ์งง์€ ์ž‘์—…ยท๋ฌผ์ฒด ์ƒํ˜ธ์ž‘์šฉ ๋ถ€์žฌ๋ผ๋Š” ํ•œ๊ณ„๋Š” ๋‚จ์ง€๋งŒ, ์ƒํƒœ ๋งค์นญ + Demonstration Replay ๋ผ๋Š” ํ‹€์€ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅ ์• ๋‹ˆ๋ฉ”์ด์…˜(์˜ˆ: differentiable clothes simulation)์„ ํ–ฅํ•œ ์œ ๋งํ•œ ์ถœ๋ฐœ์ ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

Copyright 2026, JungYeon Lee