Curieux.JY
  • JungYeon Lee
  • Post
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review

๐Ÿ“ƒHandelBot ๋ฆฌ๋ทฐ

dexterous manipulation
sim2real
Real-World Piano Playing via Fast Adaptation of Dexterous Robot Policies
Published

March 22, 2026

  • Paper Link

  • Project Link

  • Amber Xie, Haozhi Qi, Dorsa Sadigh

  1. ๐ŸŽน HandelBot์€ sim-to-real gap์œผ๋กœ ์ธํ•ด ์ •๋ฐ€ํ•œ ์‹ค์ œ ํ™˜๊ฒฝ dexterity๊ฐ€ ์–ด๋ ค์šด bimanual piano playing์„ ์œ„ํ•œ ์ตœ์ดˆ์˜ ํ•™์Šต ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค.
  2. โœจ ์ด ์‹œ์Šคํ…œ์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์œผ๋กœ ํ›ˆ๋ จ๋œ ์ •์ฑ…์„ real-world data๋กœ ๋น ๋ฅด๊ฒŒ ์ ์‘์‹œํ‚ค๊ธฐ ์œ„ํ•ด, ๋จผ์ € structured trajectory refinement๋กœ ๊ณต๊ฐ„ ์ •๋ ฌ์„ ์ˆ˜์ •ํ•˜๊ณ  ์ด์–ด์„œ residual reinforcement learning์œผ๋กœ ๋ฏธ์„ธํ•œ corrective action์„ ํ•™์Šตํ•˜๋Š” 2๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  3. โœ… HandelBot์€ 5๊ณก์˜ ๋‹ค์–‘ํ•œ ๊ณก์—์„œ ์„ฑ๊ณต์ ์ธ real-world execution์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ, ๋‹จ 30๋ถ„ ๋ฏธ๋งŒ์˜ ๋ฌผ๋ฆฌ์  ์ƒํ˜ธ์ž‘์šฉ ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ ์ง์ ‘์ ์ธ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐฐํฌ๋ณด๋‹ค 1.8๋ฐฐ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

HandelBot ๋…ผ๋ฌธ์€ ๋‹ค์ง€(multi-fingered) ๋กœ๋ด‡ ์†์„ ์ด์šฉํ•œ ์ •๊ตํ•œ ํ˜„์‹ค ์„ธ๊ณ„ ํ”ผ์•„๋…ธ ์—ฐ์ฃผ๋ผ๋Š” ๋‚œ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด ์ž‘์—…์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ํ›ˆ๋ จ๋œ ์ •์ฑ…(\pi_{sim})์ด ๋ฐ€๋ฆฌ๋ฏธํ„ฐ ๊ทœ๋ชจ์˜ ์ •๋ฐ€๋„๋ฅผ ์š”๊ตฌํ•˜๋Š” ํƒœ์Šคํฌ์—์„œ ํ˜„์‹ค ์„ธ๊ณ„๋กœ ์ง์ ‘ ๋ฐฐํฌ๋  ๋•Œ ๋ฐœ์ƒํ•˜๋Š” ์‹ฌ-ํˆฌ-๋ฆฌ์–ผ(sim-to-real) ๊ฐญ์œผ๋กœ ์ธํ•œ ์‹คํŒจ๋ฅผ ๊ทน๋ณตํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘ก๋‹ˆ๋‹ค.

I. ์„œ๋ก  ๋ฐ ๋ฐฐ๊ฒฝ

๊ธฐ์กด์˜ ๋กœ๋ด‡ ํ”ผ์•„๋…ธ ์—ฐ์ฃผ ์‹œ์Šคํ…œ์€ ์ „์šฉ ํ•˜๋“œ์›จ์–ด์™€ ์ˆ˜์ž‘์—…์œผ๋กœ ์ œ์–ด๋˜๋Š” ์ปจํŠธ๋กค๋Ÿฌ์— ์˜์กดํ–ˆ์Šต๋‹ˆ๋‹ค. ์ตœ๊ทผ์˜ ํ•™์Šต ๊ธฐ๋ฐ˜ ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ฒ”์šฉ ๋กœ๋ด‡ ํ•˜๋“œ์›จ์–ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ์ธ์ƒ์ ์ธ dexterous ํ”ผ์•„๋…ธ ์—ฐ์ฃผ๋ฅผ ๋‹ฌ์„ฑํ–ˆ์ง€๋งŒ, ํ˜„์‹ค ์„ธ๊ณ„๋กœ์˜ ์‹ฌ-ํˆฌ-๋ฆฌ์–ผ ์ „์†ก์€ ์—ฌ์ „ํžˆ ๋ฏธ๊ฐœ์ฒ™ ๋ถ„์•ผ์˜€์Šต๋‹ˆ๋‹ค. HandelBot์€ ์ด๋Ÿฌํ•œ ๊ฐ„๊ทน์„ ๋ฉ”์šฐ๋ฉฐ, ํŠนํžˆ ์–‘์†(bimanual) ํ”ผ์•„๋…ธ ์—ฐ์ฃผ์— ์ดˆ์ ์„ ๋งž์ถฅ๋‹ˆ๋‹ค. ์ด ์‹œ์Šคํ…œ์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ์˜ ๊ฐ•๋ ฅํ•œ ์‚ฌ์ „ ํ›ˆ๋ จ(pretraining)๊ณผ ํ˜„์‹ค ์„ธ๊ณ„์—์„œ์˜ residual reinforcement learning์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ณต์žกํ•œ ์–‘์† ํ”ผ์•„๋…ธ ์—ฐ์ฃผ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

II. HandelBot ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก 

HandelBot์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ›ˆ๋ จ๋œ ์ •์ฑ…์„ ํ˜„์‹ค ์„ธ๊ณ„ ํ”ผ์•„๋…ธ ์—ฐ์ฃผ์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋‘ ๋‹จ๊ณ„์˜ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.

A. ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ์˜ ๊ฐ•ํ™” ํ•™์Šต (RL in Simulation)

์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์—์„œ ํ•ต์‹ฌ ํ”ผ์•„๋…ธ ์—ฐ์ฃผ ๋™์ž‘์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

  • ๋ณด์ƒ ์„ค๊ณ„ (Reward Design): RoboPianist[1]์˜ ์„ค๊ณ„๋ฅผ ๋”ฐ๋ฅด๋ฉฐ, ๋ชฉํ‘œ ๋…ธํŠธ๋ฅผ ์—ฐ์ฃผํ•˜๋Š” ๊ฒƒ์— ๋Œ€ํ•œ key press reward, ์˜ฌ๋ฐ”๋ฅธ ๊ฑด๋ฐ˜ ๊ทผ์ฒ˜์— ์žˆ๋Š” ๊ฒƒ์— ๋Œ€ํ•œ dense fingering reward, ๊ทธ๋ฆฌ๊ณ  energy penalty๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. Appendix์—์„œ๋Š” Key Press reward๊ฐ€ 0.7 \cdot \left(\frac{1}{K}\sum_{i} g(||k^s_i - 1||^2)\right) + 0.3 \cdot (1 - \mathbf{1}_{\{\text{false positive}\}})์™€ ๊ฐ™์ด ๋ณ€ํ˜•๋˜์–ด, ์ž˜๋ชป๋œ ํ‚ค๋ฅผ ๋ˆ„๋ฅด๋Š” ๊ฒƒ์ด ๊ฑฐ์˜ ๋ถˆ๊ฐ€ํ”ผํ•œ ํ˜„์‹ค ํ™˜๊ฒฝ์˜ ํŠน์„ฑ์„ ๋ฐ˜์˜ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๊ด€์ธก ๋ฐ ํ–‰๋™ ๊ณต๊ฐ„ (Observations and Actions): ๋กœ๋ด‡ ๊ณ ์œ ์ˆ˜์šฉ์„ฑ(proprioception), ํ˜„์žฌ ํ”ผ์•„๋…ธ ํ™œ์„ฑํ™”, ๋ชฉํ‘œ ํ”ผ์•„๋…ธ ํ™œ์„ฑํ™”, ํ™œ์„ฑํ™”๋œ ์†๊ฐ€๋ฝ ๋“ฑ์ด ๊ด€์ธก ๊ณต๊ฐ„์— ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ํ–‰๋™ ๊ณต๊ฐ„์€ delta joint positions์œผ๋กœ, ๋กœ๋ด‡ ์†์˜ ์ €์ˆ˜์ค€ ์ œ์–ด ๋ช…๋ น์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ํŠนํžˆ Tesollo DG-5F ์†์˜ ๊ฒฝ์šฐ, ๋งˆ์ง€๋ง‰ joint angle์„ 1 ๋ผ๋””์•ˆ์œผ๋กœ ๊ณ ์ •ํ•˜์—ฌ action space๋ฅผ ์ค„์ด๊ณ  ์†๊ฐ€๋ฝ ๋์œผ๋กœ ๊ฑด๋ฐ˜์„ ๋ˆ„๋ฅด๋„๋ก ์œ ๋„ํ•ฉ๋‹ˆ๋‹ค. ์†๋ชฉ ๊ถค์ (wrist trajectory)์€ ์•…๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์Šคํฌ๋ฆฝํŠธ๋˜๋ฉฐ, ์—ฌ๋Ÿฌ ๋…ธํŠธ๊ฐ€ ๋™์‹œ์— ๋ฐœ์ƒํ•  ๊ฒฝ์šฐ ํ‰๊ท  Y ์œ„์น˜์™€ ์ตœ์†Œ X ์œ„์น˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ง‘๊ณ„๋ฉ๋‹ˆ๋‹ค.
  • ์ •์ฑ… ํ•™์Šต: ๋น ๋ฅด๊ณ  ๋ณ‘๋ ฌ์ ์ธ ๋กค์•„์›ƒ๊ณผ dense reward ์‹ ํ˜ธ๋ฅผ ํ™œ์šฉํ•˜์—ฌ PPO [68] ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์ •์ฑ… \pi_{sim}์„ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค. ์ด \pi_{sim}์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ, ํ˜„์‹ค ์„ธ๊ณ„์—์„œ๋Š” ์ปจํŠธ๋กค๋Ÿฌ ๋ฐ ํ”ผ์•„๋…ธ ๊ฑด๋ฐ˜ ๋ˆ„๋ฅด๊ธฐ dynamics์˜ ๋ถˆ์ผ์น˜๋กœ ์ธํ•ด ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

B. ์ •์ฑ… ์ •์ œ (Policy Refinement)

Residual RL์„ ์‹คํ–‰ํ•˜๊ธฐ ์ „์—, ํ˜„์‹ค ์„ธ๊ณ„์—์„œ ๊ฒฝ๋Ÿ‰ํ™”๋œ ์ •์ฑ… ์ •์ œ ์ ˆ์ฐจ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ํ›ˆ๋ จ๋œ \pi_{sim}์œผ๋กœ๋ถ€ํ„ฐ ์–ป์€ ์ดˆ๊ธฐ ๊ฐœ๋ฐฉ ๋ฃจํ”„ ๊ถค์  \tau^0 = (s^0_0, ..., s^0_T)๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ \tau^* = (s^*_0, ..., s^*_T)๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

  • ์ธก๋ฉด ์กฐ์ธํŠธ ๋ณด์ • (Lateral Joint Correction): ๋„๋ฉ”์ธ ์ง€์‹(๊ฑด๋ฐ˜ ๊ธฐํ•˜ํ•™, ์†์˜ kinematics)์„ ํ™œ์šฉํ•˜์—ฌ ์ผ๊ด€๋œ ์ธก๋ฉด ํŽธํ–ฅ(lateral biases)๊ณผ ์ ‘์ด‰ ์˜ค์ •๋ ฌ(contact misalignments)์„ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค.
    • \pi_{sim}์„ ํ˜„์‹ค ๋กœ๋ด‡์—์„œ ๊ฐœ๋ฐฉ ๋ฃจํ”„(open-loop) ๋ฐฉ์‹์œผ๋กœ ์‹คํ–‰ํ•˜๊ณ , ๊ฐ ์‹œ๊ฐ„ ๋‹จ๊ณ„ t์—์„œ (i) ๋ชฉํ‘œ ๋…ธํŠธ ๋ฐ ํ•ด๋‹น ์†๊ฐ€๋ฝ, (ii) ์‹ค์ œ๋กœ ๋ˆŒ๋ฆฐ ๊ฑด๋ฐ˜ ์„ธํŠธ K_{press_t}๋ฅผ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค.
    • ๊ฐ ์†๊ฐ€๋ฝ์— ๋Œ€ํ•ด ๋ชฉํ‘œ์— ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋ˆŒ๋ฆฐ ๊ฑด๋ฐ˜ k_{press_t}๋ฅผ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ k_{press_t}๊ฐ€ ๋ชฉํ‘œ k_{target_t}์™€ ๋‹ค๋ฅด๋‹ค๋ฉด, ๋ฐฉํ–ฅ์„ฑ ์˜ค์ฐจ(signed directional error)๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค: \Delta_t = \begin{cases} +\delta & \text{if } k_{press_t} < k_{target_t} \\ -\delta & \text{if } k_{press_t} > k_{target_t} \\ 0 & \text{otherwise} \end{cases} ์—ฌ๊ธฐ์„œ \delta๋Š” ์ธก๋ฉด ์†๊ฐ€๋ฝ ์กฐ์ธํŠธ ์กฐ์ •๋Ÿ‰์„ ์ œ์–ดํ•˜๋Š” step size์ž…๋‹ˆ๋‹ค.
  • ๋ฐ˜๋ณต์  ์—…๋ฐ์ดํŠธ (Iterative Updates): ์ด ๋ณด์ • ์ ˆ์ฐจ๋Š” ๊ถค์  ์‹คํ–‰๊ณผ ์—…๋ฐ์ดํŠธ๋ฅผ ๋ฒˆ๊ฐˆ์•„ ๊ฐ€๋ฉฐ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. \delta๋Š” ์ดˆ๊ธฐ์—๋Š” ํฐ ๊ฐ’์œผ๋กœ ์„ค์ •ํ•˜๊ณ , ๋งค ๋ฐ˜๋ณต๋งˆ๋‹ค ์ ์ง„์ ์œผ๋กœ ๊ฐ์†Œ(annealing)์‹œ์ผœ ์ง„๋™์„ ํ”ผํ•˜๊ณ  ๋ถ€๋“œ๋Ÿฌ์šด ์ˆ˜๋ ด์„ ๋•์Šต๋‹ˆ๋‹ค. ์ธ์ ‘ ์†๊ฐ€๋ฝ์— 0.3\Delta_t์™€ ๊ฐ™์€ ์ž‘์€ ๋ณด์ • ํ•ญ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๊ณต๊ฐ„์  ๋ถ„๋ฆฌ(spatial separation)๋ฅผ ์žฅ๋ คํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ๊ฑด๋ฐ˜์ด ๋ˆŒ๋ฆด ๊ฒฝ์šฐ, ์™ผ์ชฝ์˜ ํ™œ์„ฑ ์†๊ฐ€๋ฝ์€ ๋‚ฎ์€ ์Œ์˜ ๊ฑด๋ฐ˜์„ ๋ˆ„๋ฅด๊ณ , ์˜ค๋ฅธ์ชฝ์˜ ํ™œ์„ฑ ์†๊ฐ€๋ฝ์€ ๋†’์€ ์Œ์˜ ๊ฑด๋ฐ˜์„ ๋ˆ„๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค.
  • ์ฒญํฌ ๋‹จ์œ„ ์—…๋ฐ์ดํŠธ (Chunked Updates): ์—…๋ฐ์ดํŠธ๋Š” ๋งค ์‹œ๊ฐ„ ๋‹จ๊ณ„๊ฐ€ ์•„๋‹Œ, ๊ธธ์ด K์˜ temporal chunks ๋‹จ์œ„๋กœ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋™์ž‘์˜ ๋ถ€๋“œ๋Ÿฌ์›€์„ ์œ„ํ•ด, ์†๊ฐ€๋ฝ ๋ ์˜ค์ฐจ๋ฅผ t+K+L๊นŒ์ง€ ๊ณ ๋ คํ•˜์—ฌ anticipatory spatial adjustments๋ฅผ ์ด‰์ง„ํ•ฉ๋‹ˆ๋‹ค. $\Delta_{chunk_t}$๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค: \Delta_{chunk_t} = \frac{1}{K+L} \sum_{j=t}^{t+K+L} \Delta_j ์ด ๋ฐ˜๋ณต ๊ณผ์ •์˜ ๋์—์„œ, ๊ฐ€์žฅ ์ข‹์€ F1 ์ ์ˆ˜๋ฅผ ๊ฐ€์ง„ ๊ถค์ ์„ ์ •์ œ๋œ ๊ถค์ (\tau^*)์œผ๋กœ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

C. ํ˜„์‹ค ์„ธ๊ณ„ ์ž”์ฐจ ๊ฐ•ํ™” ํ•™์Šต (Real-World Residual Reinforcement Learning)

์ •์ฑ… ์ •์ œ ๋‹จ๊ณ„์—์„œ ์–ป์€ ๊ฐœ๋ฐฉ ๋ฃจํ”„ ๊ถค์  s^*_0, ..., s^*_T๋ฅผ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด residual reinforcement learning ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ฑ„ํƒํ•ฉ๋‹ˆ๋‹ค.

  • ์ž”์ฐจ ์ •์ฑ… ๊ณต์‹ํ™” (Residual Policy Formulation): ์ž”์ฐจ ์ •์ฑ… \pi_{res}๋Š” ๊ธฐ๋ณธ ํ–‰๋™์— ๋Œ€ํ•œ ๋ถ€๊ฐ€์ ์ธ ๋ณด์ •(additive correction)์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค: \hat{s}_{t+1} = \pi_{res}(o_t) + s^*_{t+1} ์—ฌ๊ธฐ์„œ o_t๋Š” ์‹œ๊ฐ„ t์˜ ํ˜„์‹ค ์„ธ๊ณ„ ๊ด€์ธก๊ฐ’์ด๊ณ , s^*_{t+1}์€ ๊ฐœ๋ฐฉ ๋ฃจํ”„ ๊ถค์ ์˜ ๋‹ค์Œ ์ƒํƒœ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. \pi_{res}์˜ ์ถœ๋ ฅ์€ ์ž‘์€ ์„ญ๋™(perturbations)์œผ๋กœ ์ œํ•œ๋˜์–ด ๋” ์•ˆ์ „ํ•œ ํƒ์ƒ‰๊ณผ ๋น ๋ฅธ ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
  • ์ž”์ฐจ RL ๋ชฉํ‘œ (Residual RL Objective): ํ˜„์‹ค ์„ธ๊ณ„์—์„œ๋Š” ํ”ผ์•„๋…ธ์˜ MIDI ์ถœ๋ ฅ์—์„œ ํŒŒ์ƒ๋œ key press reward ์‹ ํ˜ธ๋งŒ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ์‚ฌ์šฉ๋œ ๊ฒƒ๊ณผ ๋™์ผ). \pi_{res}๋Š” ํ˜„์‹ค ์„ธ๊ณ„ dynamics ํ•˜์—์„œ ๊ธฐ๋Œ€ ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก ๊ฐ•ํ™” ํ•™์Šต์„ ํ†ตํ•ด ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค.
  • ์œ ๋„ ์žก์Œ (Guided Noise): TD3 [65] ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜๋ฉฐ, ์ƒ˜ํ”Œ๋ง๋œ ํ–‰๋™์— ์žก์Œ ํ•ญ์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, ์ •์ฑ… ์ •์ œ์—์„œ ์‚ฌ์šฉ๋œ ์ธก๋ฉด ์กฐ์ •์„ ๋ชจํ‹ฐ๋ธŒ๋กœ, ์žก์Œ $\epsilon \sim \mathcal{N}(0,1)$์˜ ๋ฐฉํ–ฅ์„ ์˜ฌ๋ฐ”๋ฅธ ์ธก๋ฉด ์›€์ง์ž„์˜ ๋ฐฉํ–ฅ์œผ๋กœ ์œ ๋„ํ•ฉ๋‹ˆ๋‹ค. ํ™•๋ฅ  Pr(\text{guided noise}) = 0.5๋กœ, ํ•ด๋‹น ์ธก๋ฉด ์กฐ์ธํŠธ์˜ ์žก์Œ ๋ถ€ํ˜ธ๊ฐ€ \Delta_t์™€ ๋™์ผํ•œ ๋ถ€ํ˜ธ๊ฐ€ ๋˜๋„๋ก ๋ณ€๊ฒฝํ•˜์—ฌ $\hat{\epsilon}$์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ตœ์ข… ํ–‰๋™์€ a = \mu_\theta(o) + \text{clip}(\hat{\epsilon}, -0.5, 0.5)๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํƒ์ƒ‰์„ ์˜ฌ๋ฐ”๋ฅธ ๊ฑด๋ฐ˜์„ ๋ˆ„๋ฅด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์œ ๋„ํ•˜๋Š” ๊ฒฝ๋Ÿ‰ ํœด๋ฆฌ์Šคํ‹ฑ์ž…๋‹ˆ๋‹ค.

III. ์‹คํ—˜ ๊ฒฐ๊ณผ

HandelBot์€ 5๊ฐœ์˜ ๋‹ค์–‘ํ•œ ๊ณก(Twinkle Twinkle, Ode to Joy, Hot Cross Buns, Fur Elise, Prelude in C)์— ๋Œ€ํ•ด ์–‘์† ๋กœ๋ด‡ ์‹œ์Šคํ…œ์œผ๋กœ ํ‰๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • ํ•˜๋“œ์›จ์–ด ์„ค์ •: Tesollo DG-5F ์†๊ณผ Franka Emika Panda ์•” ๋ฐ FR3 ์•”์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. MIDI ํ‚ค๋ณด๋“œ๋ฅผ ํ†ตํ•ด ์–ด๋–ค ๋…ธํŠธ๊ฐ€ ๋ˆŒ๋ ธ๋Š”์ง€ ๊ฐ์ง€ํ•˜์—ฌ ๋ณด์ƒ ๊ณ„์‚ฐ์— ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์•ˆ์ „ ๋ฐ ๋ฐฐํฌ: PyRoki [67]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์•ˆ์ „ ๋ ˆ์ด์–ด๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ , ์ •์ฑ… ํ–‰๋™์€ 10Hz์—์„œ ์ƒ์„ฑ๋œ ํ›„ 80Hz๋กœ ์„ ํ˜• ๋ณด๊ฐ„๋ฉ๋‹ˆ๋‹ค. ์•”์€ Polymetis ์ปจํŠธ๋กค๋Ÿฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 100Hz๋กœ ์ œ์–ด๋ฉ๋‹ˆ๋‹ค.
  • ์ฃผ์š” ๊ฒฐ๊ณผ (Fig. 4): HandelBot์€ ๋ชจ๋“  ํ‰๊ฐ€๋œ ์Œ์•…์—์„œ ์ผ๊ด€์ ์œผ๋กœ ๊ฐ€์žฅ ๋†’์€ F1 ์ ์ˆ˜๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ์ดํ„ฐ๋งŒ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•(์˜ˆ: \pi_{sim}(CL) ๋ฐ \pi_{sim})์€ ์‹ฌ-ํˆฌ-๋ฆฌ์–ผ ๊ฐญ์œผ๋กœ ์ธํ•ด ์„ฑ๋Šฅ์ด ํ˜„์ €ํžˆ ๋‚ฎ์•˜์Šต๋‹ˆ๋‹ค. policy refinement๋Š” ์†๊ฐ€๋ฝ ๋ˆ„๋ฆ„์„ ์˜ฌ๋ฐ”๋ฅธ ๋ชฉํ‘œ ํ‚ค์— ์ง์ ‘ ์ •๋ ฌํ•˜๋Š” ๋ฐ ํšจ๊ณผ์ ์ด๋ฉฐ, residual RL์€ ์˜ค๋ฅ˜๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ  ๋ฌผ๋ฆฌ์  dynamics์— ์ ์‘ํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.
  • ์ž”์ฐจ RL์˜ ์ค‘์š”์„ฑ (Table I, II): ์ดˆ๊ธฐํ™”๋œ ๊ถค์ (refined trajectory > \pi_{sim} > no initialization) ์œ„์— residual RL์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ๋” ๋†’์€ F1 ์ ์ˆ˜๋กœ ์ด์–ด์ง„๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ •์ œ๋œ ์ •์ฑ…์ด ํƒ์ƒ‰ ๊ณต๊ฐ„์„ ์ค„์—ฌ ๋” ์•ˆ์ •์ ์ด๊ณ  ํšจ์œจ์ ์ธ ํ›ˆ๋ จ์œผ๋กœ ์ด์–ด์ง„๋‹ค๋Š” ๊ฐ€์„ค์„ ๋’ท๋ฐ›์นจํ•ฉ๋‹ˆ๋‹ค. RL discount factor \gamma๊ฐ€ ๋‚ฎ์œผ๋ฉด F1 ์ ์ˆ˜๊ฐ€ ๋‚ฎ์•„์ง€๊ณ  ์›€์ง์ž„์ด ๋ถˆ๊ทœ์น™ํ•ด์ง‘๋‹ˆ๋‹ค. guided noise๋Š” default ์„ค์ •(Pr(\text{guided noise}) = 0.5)์ด Pr(guided noise) = 0๊ณผ ์œ ์‚ฌํ–ˆ์ง€๋งŒ, ํ•ญ์ƒ guided noise๋ฅผ ์ƒ˜ํ”Œ๋งํ•˜๋Š” ๊ฒƒ์€ ์„ฑ๋Šฅ ์ €ํ•˜๋กœ ์ด์–ด์กŒ๋Š”๋ฐ, ์ด๋Š” ์†๊ฐ€๋ฝ ํƒ์ƒ‰์ด ํŽธํ–ฅ๋˜์–ด ์ตœ์ ์ด ์•„๋‹Œ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ์˜ ํ•™์Šต์„ ๋ฐฉํ•ดํ•˜๊ธฐ ๋•Œ๋ฌธ์œผ๋กœ ์ถ”์ •๋ฉ๋‹ˆ๋‹ค.
  • ํ์‡„ ๋ฃจํ”„ Sim-to-Real (Table I): ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์‹คํ–‰(hybrid execution)์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์„ ํ˜„์‹ค ํ™˜๊ฒฝ๊ณผ ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰ํ•˜์—ฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ด€์ธก์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์‹ฌ-ํˆฌ-๋ฆฌ์–ผ ๊ฐญ์„ ์™„ํ™”ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์‹คํ–‰์ด ์ง์ ‘ ์ „์†ก๋ณด๋‹ค ๊ฐœ์„ ์„ ๋ณด์˜€์ง€๋งŒ, ํ˜„์‹ค ์„ธ๊ณ„ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜๋Š” HandelBot ๋ฐ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•๋“ค๊ณผ๋Š” ์—ฌ์ „ํžˆ ์„ฑ๋Šฅ ์ฐจ์ด๊ฐ€ ์ปธ์Šต๋‹ˆ๋‹ค.

IV. ๊ฒฐ๋ก  ๋ฐ ํ•œ๊ณ„

HandelBot์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ์˜ ๊ฐ•ํ™” ํ•™์Šต, ์ •์ฑ… ์ •์ œ, ๊ทธ๋ฆฌ๊ณ  ์ž”์ฐจ ๊ฐ•ํ™” ํ•™์Šต์„ ํ†ตํ•ด ๋กœ๋ด‡ ํ”ผ์•„๋…ธ ์—ฐ์ฃผ์˜ ๊ทน๋„์˜ ์ •๋ฐ€๋„ ์š”๊ตฌ ์‚ฌํ•ญ์„ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ์ทจ์•ฝํ•˜๊ณ  ๋ถˆ์™„์ „ํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์ •์ฑ…์„ ๋‹จ 30๋ถ„์ด๋ผ๋Š” ์ ์€ ์–‘์˜ ํ˜„์‹ค ์„ธ๊ณ„ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ›จ์”ฌ ๊ฒฌ๊ณ ํ•œ ํ”ผ์•„๋…ธ ์—ฐ์ฃผ ๋กœ๋ด‡์œผ๋กœ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•œ๊ณ„์ :

  • HandelBot์€ ์Šคํฌ๋ฆฝํŠธ๋œ end-effector ์›€์ง์ž„๊ณผ ๊ณ ์ •๋œ orientation์— ์˜์กดํ•˜์—ฌ ๋งค๋ฒˆ ์ˆ˜๋™ ํŠœ๋‹์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. End-effector ์›€์ง์ž„์— ๋Œ€ํ•œ residual RL์€ ์ด ๋ฌธ์ œ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ์ด๋Ÿฌํ•œ ์ œ์•ฝ์œผ๋กœ ์ธํ•ด ์—„์ง€์†๊ฐ€๋ฝ๊ณผ ์ƒˆ๋ผ์†๊ฐ€๋ฝ์˜ ์‚ฌ์šฉ์ด ์–ด๋ ค์›Œ์ ธ ๋น„๊ต์  ๊ฐ„๋‹จํ•œ ๊ณก์œผ๋กœ๋งŒ ํ‰๊ฐ€๊ฐ€ ์ด๋ฃจ์–ด์กŒ์Šต๋‹ˆ๋‹ค. ํ–ฅํ›„ ์ž‘์—…์—์„œ๋Š” ๋” ๋ณต์žกํ•œ ๊ณก์„ ์œ„ํ•ด ํšŒ์ „ ๋˜๋Š” ํ•™์Šต๋œ ์›€์ง์ž„์„ ํƒ์ƒ‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ •์ฑ… ์ •์ œ ๋‹จ๊ณ„๋Š” ์ธ๊ฐ„์ด ๊ฐ€์ด๋“œํ•˜๋Š” ํœด๋ฆฌ์Šคํ‹ฑ์— ์˜์กดํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํ”ผ์•„๋…ธ ์—ฐ์ฃผ์—๋Š” ์ ํ•ฉํ•˜์ง€๋งŒ, ๋‹ค๋ฅธ ํƒœ์Šคํฌ์—๋Š” ์ง์ ‘ ์ ์šฉํ•˜๊ธฐ ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ(vision-language models)๊ณผ ๊ฐ™์€ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์„ ํ†ตํ•ด ๋‹ค๋ฅธ ํƒœ์Šคํฌ์—์„œ๋„ ์ •์ฑ… ์ •์ œ๊ฐ€ ๊ฐ€๋Šฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

  • Structured refinement ๋‹จ๊ณ„์—์„œ ๋ฌผ๋ฆฌ์  rollout ๊ธฐ๋ฐ˜์œผ๋กœ lateral finger joint๋ฅผ ์กฐ์ •ํ•˜์—ฌ ๊ณต๊ฐ„์  misalignment๋ฅผ ๋ณด์ •
  • Residual RL๋กœ fine-grained corrective action์„ ์ž์œจ์ ์œผ๋กœ ํ•™์Šต
  • ๋ฐ€๋ฆฌ๋ฏธํ„ฐ ์ˆ˜์ค€์˜ ์ •๋ฐ€๋„๊ฐ€ ์š”๊ตฌ๋˜๋Š” ์–‘์†(bimanual) ํ”ผ์•„๋…ธ ์—ฐ์ฃผ๋ฅผ 5๊ณก์— ๊ฑธ์ณ ์„ฑ๊ณต์ ์œผ๋กœ ์‹œ์—ฐ
  • Sim2Real ๊ฐญ์„ ๋น ๋ฅธ ์ ์‘์œผ๋กœ ๊ทน๋ณตํ•˜๋Š” ์‹ค์šฉ์  ์ ‘๊ทผ๋ฒ• ์ œ์‹œ

Copyright 2026, JungYeon Lee