Curieux.JY
  • JungYeon Lee
  • Post
  • Lecture
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ํ•œ ์ค„ ์š”์•ฝ
    • ์™œ ํ”ผ์•„๋…ธ ์—ฐ์ฃผ๊ฐ€ ๊ทธ๋ ‡๊ฒŒ ์–ด๋ ค์šด๊ฐ€
    • ํ•ต์‹ฌ ํ†ต์ฐฐ: ์‹ค์„ธ๊ณ„ 30๋ถ„์ด ์‹œ๋ฎฌ๋ ˆ์ด์…˜ 100์‹œ๊ฐ„์„ ์ด๊ธด๋‹ค
    • ์‹œ์Šคํ…œ ๊ฐœ์š”
    • Stage 0: ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์ •์ฑ… ํ•™์Šต
    • Stage 1: ๊ตฌ์กฐํ™”๋œ ์ •์ฑ… ์ •์ œ(ํœด๋ฆฌ์Šคํ‹ฑ ์ธก๋ฉด ๊ด€์ ˆ ๋ณด์ •)
    • Stage 2: Residual RL with TD3 (๊ฐ€์ด๋“œ๋œ ๋…ธ์ด์ฆˆ)
    • ํ•˜๋“œ์›จ์–ด ์„ค์ •๊ณผ ์•ˆ์ „ ๊ณ„์ธต
    • ์‹คํ—˜ ๊ฒฐ๊ณผ
    • ๋””ํ…Œ์ผ ๋ถ„์„: Note Press ์‹œ๊ฐํ™”
    • ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต
    • ๋น„ํŒ์  ๊ณ ์ฐฐ
      • ๊ฐ•์ 
      • ์•ฝ์ ๊ณผ ํ•œ๊ณ„
    • ์‹œ์‚ฌ์ : ๋‹ค๋ฅธ ์ •๋ฐ€ dexterous ๊ณผ์ œ๋กœ์˜ ์ „์ด ๊ฐ€๋Šฅ์„ฑ
    • ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 
    • ์ฐธ๊ณ  ๋ฌธํ—Œ ๋ฐ ์ž๋ฃŒ

๐Ÿ“ƒHandelBot ๋ฆฌ๋ทฐ

dexterous manipulation
sim2real
Real-World Piano Playing via Fast Adaptation of Dexterous Robot Policies
Published

March 22, 2026

  • Paper Link
  • Project Link
  1. ๐ŸŽน HandelBot์€ sim-to-real gap์œผ๋กœ ์ธํ•ด ์ •๋ฐ€ํ•œ ์‹ค์ œ ํ™˜๊ฒฝ dexterity๊ฐ€ ์–ด๋ ค์šด bimanual piano playing์„ ์œ„ํ•œ ์ตœ์ดˆ์˜ ํ•™์Šต ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค.
  2. โœจ ์ด ์‹œ์Šคํ…œ์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์œผ๋กœ ํ›ˆ๋ จ๋œ ์ •์ฑ…์„ real-world data๋กœ ๋น ๋ฅด๊ฒŒ ์ ์‘์‹œํ‚ค๊ธฐ ์œ„ํ•ด, ๋จผ์ € structured trajectory refinement๋กœ ๊ณต๊ฐ„ ์ •๋ ฌ์„ ์ˆ˜์ •ํ•˜๊ณ  ์ด์–ด์„œ residual reinforcement learning์œผ๋กœ ๋ฏธ์„ธํ•œ corrective action์„ ํ•™์Šตํ•˜๋Š” 2๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  3. โœ… HandelBot์€ 5๊ณก์˜ ๋‹ค์–‘ํ•œ ๊ณก์—์„œ ์„ฑ๊ณต์ ์ธ real-world execution์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ, ๋‹จ 30๋ถ„ ๋ฏธ๋งŒ์˜ ๋ฌผ๋ฆฌ์  ์ƒํ˜ธ์ž‘์šฉ ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ ์ง์ ‘์ ์ธ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐฐํฌ๋ณด๋‹ค 1.8๋ฐฐ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

HandelBot ๋…ผ๋ฌธ์€ ๋‹ค์ง€(multi-fingered) ๋กœ๋ด‡ ์†์„ ์ด์šฉํ•œ ์ •๊ตํ•œ ํ˜„์‹ค ์„ธ๊ณ„ ํ”ผ์•„๋…ธ ์—ฐ์ฃผ๋ผ๋Š” ๋‚œ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด ์ž‘์—…์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ํ›ˆ๋ จ๋œ ์ •์ฑ…(\pi_{sim})์ด ๋ฐ€๋ฆฌ๋ฏธํ„ฐ ๊ทœ๋ชจ์˜ ์ •๋ฐ€๋„๋ฅผ ์š”๊ตฌํ•˜๋Š” ํƒœ์Šคํฌ์—์„œ ํ˜„์‹ค ์„ธ๊ณ„๋กœ ์ง์ ‘ ๋ฐฐํฌ๋  ๋•Œ ๋ฐœ์ƒํ•˜๋Š” ์‹ฌ-ํˆฌ-๋ฆฌ์–ผ(sim-to-real) ๊ฐญ์œผ๋กœ ์ธํ•œ ์‹คํŒจ๋ฅผ ๊ทน๋ณตํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘ก๋‹ˆ๋‹ค.

I. ์„œ๋ก  ๋ฐ ๋ฐฐ๊ฒฝ

๊ธฐ์กด์˜ ๋กœ๋ด‡ ํ”ผ์•„๋…ธ ์—ฐ์ฃผ ์‹œ์Šคํ…œ์€ ์ „์šฉ ํ•˜๋“œ์›จ์–ด์™€ ์ˆ˜์ž‘์—…์œผ๋กœ ์ œ์–ด๋˜๋Š” ์ปจํŠธ๋กค๋Ÿฌ์— ์˜์กดํ–ˆ์Šต๋‹ˆ๋‹ค. ์ตœ๊ทผ์˜ ํ•™์Šต ๊ธฐ๋ฐ˜ ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ฒ”์šฉ ๋กœ๋ด‡ ํ•˜๋“œ์›จ์–ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ์ธ์ƒ์ ์ธ dexterous ํ”ผ์•„๋…ธ ์—ฐ์ฃผ๋ฅผ ๋‹ฌ์„ฑํ–ˆ์ง€๋งŒ, ํ˜„์‹ค ์„ธ๊ณ„๋กœ์˜ ์‹ฌ-ํˆฌ-๋ฆฌ์–ผ ์ „์†ก์€ ์—ฌ์ „ํžˆ ๋ฏธ๊ฐœ์ฒ™ ๋ถ„์•ผ์˜€์Šต๋‹ˆ๋‹ค. HandelBot์€ ์ด๋Ÿฌํ•œ ๊ฐ„๊ทน์„ ๋ฉ”์šฐ๋ฉฐ, ํŠนํžˆ ์–‘์†(bimanual) ํ”ผ์•„๋…ธ ์—ฐ์ฃผ์— ์ดˆ์ ์„ ๋งž์ถฅ๋‹ˆ๋‹ค. ์ด ์‹œ์Šคํ…œ์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ์˜ ๊ฐ•๋ ฅํ•œ ์‚ฌ์ „ ํ›ˆ๋ จ(pretraining)๊ณผ ํ˜„์‹ค ์„ธ๊ณ„์—์„œ์˜ residual reinforcement learning์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ณต์žกํ•œ ์–‘์† ํ”ผ์•„๋…ธ ์—ฐ์ฃผ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

II. HandelBot ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก 

HandelBot์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ›ˆ๋ จ๋œ ์ •์ฑ…์„ ํ˜„์‹ค ์„ธ๊ณ„ ํ”ผ์•„๋…ธ ์—ฐ์ฃผ์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋‘ ๋‹จ๊ณ„์˜ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.

A. ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ์˜ ๊ฐ•ํ™” ํ•™์Šต (RL in Simulation)

์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์—์„œ ํ•ต์‹ฌ ํ”ผ์•„๋…ธ ์—ฐ์ฃผ ๋™์ž‘์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

  • ๋ณด์ƒ ์„ค๊ณ„ (Reward Design): RoboPianist[1]์˜ ์„ค๊ณ„๋ฅผ ๋”ฐ๋ฅด๋ฉฐ, ๋ชฉํ‘œ ๋…ธํŠธ๋ฅผ ์—ฐ์ฃผํ•˜๋Š” ๊ฒƒ์— ๋Œ€ํ•œ key press reward, ์˜ฌ๋ฐ”๋ฅธ ๊ฑด๋ฐ˜ ๊ทผ์ฒ˜์— ์žˆ๋Š” ๊ฒƒ์— ๋Œ€ํ•œ dense fingering reward, ๊ทธ๋ฆฌ๊ณ  energy penalty๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. Appendix์—์„œ๋Š” Key Press reward๊ฐ€ 0.7 \cdot \left(\frac{1}{K}\sum_{i} g(||k^s_i - 1||^2)\right) + 0.3 \cdot (1 - \mathbf{1}_{\{\text{false positive}\}})์™€ ๊ฐ™์ด ๋ณ€ํ˜•๋˜์–ด, ์ž˜๋ชป๋œ ํ‚ค๋ฅผ ๋ˆ„๋ฅด๋Š” ๊ฒƒ์ด ๊ฑฐ์˜ ๋ถˆ๊ฐ€ํ”ผํ•œ ํ˜„์‹ค ํ™˜๊ฒฝ์˜ ํŠน์„ฑ์„ ๋ฐ˜์˜ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๊ด€์ธก ๋ฐ ํ–‰๋™ ๊ณต๊ฐ„ (Observations and Actions): ๋กœ๋ด‡ ๊ณ ์œ ์ˆ˜์šฉ์„ฑ(proprioception), ํ˜„์žฌ ํ”ผ์•„๋…ธ ํ™œ์„ฑํ™”, ๋ชฉํ‘œ ํ”ผ์•„๋…ธ ํ™œ์„ฑํ™”, ํ™œ์„ฑํ™”๋œ ์†๊ฐ€๋ฝ ๋“ฑ์ด ๊ด€์ธก ๊ณต๊ฐ„์— ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ํ–‰๋™ ๊ณต๊ฐ„์€ delta joint positions์œผ๋กœ, ๋กœ๋ด‡ ์†์˜ ์ €์ˆ˜์ค€ ์ œ์–ด ๋ช…๋ น์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ํŠนํžˆ Tesollo DG-5F ์†์˜ ๊ฒฝ์šฐ, ๋งˆ์ง€๋ง‰ joint angle์„ 1 ๋ผ๋””์•ˆ์œผ๋กœ ๊ณ ์ •ํ•˜์—ฌ action space๋ฅผ ์ค„์ด๊ณ  ์†๊ฐ€๋ฝ ๋์œผ๋กœ ๊ฑด๋ฐ˜์„ ๋ˆ„๋ฅด๋„๋ก ์œ ๋„ํ•ฉ๋‹ˆ๋‹ค. ์†๋ชฉ ๊ถค์ (wrist trajectory)์€ ์•…๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์Šคํฌ๋ฆฝํŠธ๋˜๋ฉฐ, ์—ฌ๋Ÿฌ ๋…ธํŠธ๊ฐ€ ๋™์‹œ์— ๋ฐœ์ƒํ•  ๊ฒฝ์šฐ ํ‰๊ท  Y ์œ„์น˜์™€ ์ตœ์†Œ X ์œ„์น˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ง‘๊ณ„๋ฉ๋‹ˆ๋‹ค.
  • ์ •์ฑ… ํ•™์Šต: ๋น ๋ฅด๊ณ  ๋ณ‘๋ ฌ์ ์ธ ๋กค์•„์›ƒ๊ณผ dense reward ์‹ ํ˜ธ๋ฅผ ํ™œ์šฉํ•˜์—ฌ PPO [68] ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์ •์ฑ… \pi_{sim}์„ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค. ์ด \pi_{sim}์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ, ํ˜„์‹ค ์„ธ๊ณ„์—์„œ๋Š” ์ปจํŠธ๋กค๋Ÿฌ ๋ฐ ํ”ผ์•„๋…ธ ๊ฑด๋ฐ˜ ๋ˆ„๋ฅด๊ธฐ dynamics์˜ ๋ถˆ์ผ์น˜๋กœ ์ธํ•ด ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

B. ์ •์ฑ… ์ •์ œ (Policy Refinement)

Residual RL์„ ์‹คํ–‰ํ•˜๊ธฐ ์ „์—, ํ˜„์‹ค ์„ธ๊ณ„์—์„œ ๊ฒฝ๋Ÿ‰ํ™”๋œ ์ •์ฑ… ์ •์ œ ์ ˆ์ฐจ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ํ›ˆ๋ จ๋œ \pi_{sim}์œผ๋กœ๋ถ€ํ„ฐ ์–ป์€ ์ดˆ๊ธฐ ๊ฐœ๋ฐฉ ๋ฃจํ”„ ๊ถค์  \tau^0 = (s^0_0, ..., s^0_T)๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ \tau^* = (s^*_0, ..., s^*_T)๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

  • ์ธก๋ฉด ์กฐ์ธํŠธ ๋ณด์ • (Lateral Joint Correction): ๋„๋ฉ”์ธ ์ง€์‹(๊ฑด๋ฐ˜ ๊ธฐํ•˜ํ•™, ์†์˜ kinematics)์„ ํ™œ์šฉํ•˜์—ฌ ์ผ๊ด€๋œ ์ธก๋ฉด ํŽธํ–ฅ(lateral biases)๊ณผ ์ ‘์ด‰ ์˜ค์ •๋ ฌ(contact misalignments)์„ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค.
    • \pi_{sim}์„ ํ˜„์‹ค ๋กœ๋ด‡์—์„œ ๊ฐœ๋ฐฉ ๋ฃจํ”„(open-loop) ๋ฐฉ์‹์œผ๋กœ ์‹คํ–‰ํ•˜๊ณ , ๊ฐ ์‹œ๊ฐ„ ๋‹จ๊ณ„ t์—์„œ (i) ๋ชฉํ‘œ ๋…ธํŠธ ๋ฐ ํ•ด๋‹น ์†๊ฐ€๋ฝ, (ii) ์‹ค์ œ๋กœ ๋ˆŒ๋ฆฐ ๊ฑด๋ฐ˜ ์„ธํŠธ K_{press_t}๋ฅผ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค.
    • ๊ฐ ์†๊ฐ€๋ฝ์— ๋Œ€ํ•ด ๋ชฉํ‘œ์— ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋ˆŒ๋ฆฐ ๊ฑด๋ฐ˜ k_{press_t}๋ฅผ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ k_{press_t}๊ฐ€ ๋ชฉํ‘œ k_{target_t}์™€ ๋‹ค๋ฅด๋‹ค๋ฉด, ๋ฐฉํ–ฅ์„ฑ ์˜ค์ฐจ(signed directional error)๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค: \Delta_t = \begin{cases} +\delta & \text{if } k_{press_t} < k_{target_t} \\ -\delta & \text{if } k_{press_t} > k_{target_t} \\ 0 & \text{otherwise} \end{cases} ์—ฌ๊ธฐ์„œ \delta๋Š” ์ธก๋ฉด ์†๊ฐ€๋ฝ ์กฐ์ธํŠธ ์กฐ์ •๋Ÿ‰์„ ์ œ์–ดํ•˜๋Š” step size์ž…๋‹ˆ๋‹ค.
  • ๋ฐ˜๋ณต์  ์—…๋ฐ์ดํŠธ (Iterative Updates): ์ด ๋ณด์ • ์ ˆ์ฐจ๋Š” ๊ถค์  ์‹คํ–‰๊ณผ ์—…๋ฐ์ดํŠธ๋ฅผ ๋ฒˆ๊ฐˆ์•„ ๊ฐ€๋ฉฐ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. \delta๋Š” ์ดˆ๊ธฐ์—๋Š” ํฐ ๊ฐ’์œผ๋กœ ์„ค์ •ํ•˜๊ณ , ๋งค ๋ฐ˜๋ณต๋งˆ๋‹ค ์ ์ง„์ ์œผ๋กœ ๊ฐ์†Œ(annealing)์‹œ์ผœ ์ง„๋™์„ ํ”ผํ•˜๊ณ  ๋ถ€๋“œ๋Ÿฌ์šด ์ˆ˜๋ ด์„ ๋•์Šต๋‹ˆ๋‹ค. ์ธ์ ‘ ์†๊ฐ€๋ฝ์— 0.3\Delta_t์™€ ๊ฐ™์€ ์ž‘์€ ๋ณด์ • ํ•ญ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๊ณต๊ฐ„์  ๋ถ„๋ฆฌ(spatial separation)๋ฅผ ์žฅ๋ คํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ๊ฑด๋ฐ˜์ด ๋ˆŒ๋ฆด ๊ฒฝ์šฐ, ์™ผ์ชฝ์˜ ํ™œ์„ฑ ์†๊ฐ€๋ฝ์€ ๋‚ฎ์€ ์Œ์˜ ๊ฑด๋ฐ˜์„ ๋ˆ„๋ฅด๊ณ , ์˜ค๋ฅธ์ชฝ์˜ ํ™œ์„ฑ ์†๊ฐ€๋ฝ์€ ๋†’์€ ์Œ์˜ ๊ฑด๋ฐ˜์„ ๋ˆ„๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค.
  • ์ฒญํฌ ๋‹จ์œ„ ์—…๋ฐ์ดํŠธ (Chunked Updates): ์—…๋ฐ์ดํŠธ๋Š” ๋งค ์‹œ๊ฐ„ ๋‹จ๊ณ„๊ฐ€ ์•„๋‹Œ, ๊ธธ์ด K์˜ temporal chunks ๋‹จ์œ„๋กœ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋™์ž‘์˜ ๋ถ€๋“œ๋Ÿฌ์›€์„ ์œ„ํ•ด, ์†๊ฐ€๋ฝ ๋ ์˜ค์ฐจ๋ฅผ t+K+L๊นŒ์ง€ ๊ณ ๋ คํ•˜์—ฌ anticipatory spatial adjustments๋ฅผ ์ด‰์ง„ํ•ฉ๋‹ˆ๋‹ค. $\Delta_{chunk_t}$๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค: \Delta_{chunk_t} = \frac{1}{K+L} \sum_{j=t}^{t+K+L} \Delta_j ์ด ๋ฐ˜๋ณต ๊ณผ์ •์˜ ๋์—์„œ, ๊ฐ€์žฅ ์ข‹์€ F1 ์ ์ˆ˜๋ฅผ ๊ฐ€์ง„ ๊ถค์ ์„ ์ •์ œ๋œ ๊ถค์ (\tau^*)์œผ๋กœ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

C. ํ˜„์‹ค ์„ธ๊ณ„ ์ž”์ฐจ ๊ฐ•ํ™” ํ•™์Šต (Real-World Residual Reinforcement Learning)

์ •์ฑ… ์ •์ œ ๋‹จ๊ณ„์—์„œ ์–ป์€ ๊ฐœ๋ฐฉ ๋ฃจํ”„ ๊ถค์  s^*_0, ..., s^*_T๋ฅผ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด residual reinforcement learning ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ฑ„ํƒํ•ฉ๋‹ˆ๋‹ค.

  • ์ž”์ฐจ ์ •์ฑ… ๊ณต์‹ํ™” (Residual Policy Formulation): ์ž”์ฐจ ์ •์ฑ… \pi_{res}๋Š” ๊ธฐ๋ณธ ํ–‰๋™์— ๋Œ€ํ•œ ๋ถ€๊ฐ€์ ์ธ ๋ณด์ •(additive correction)์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค: \hat{s}_{t+1} = \pi_{res}(o_t) + s^*_{t+1} ์—ฌ๊ธฐ์„œ o_t๋Š” ์‹œ๊ฐ„ t์˜ ํ˜„์‹ค ์„ธ๊ณ„ ๊ด€์ธก๊ฐ’์ด๊ณ , s^*_{t+1}์€ ๊ฐœ๋ฐฉ ๋ฃจํ”„ ๊ถค์ ์˜ ๋‹ค์Œ ์ƒํƒœ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. \pi_{res}์˜ ์ถœ๋ ฅ์€ ์ž‘์€ ์„ญ๋™(perturbations)์œผ๋กœ ์ œํ•œ๋˜์–ด ๋” ์•ˆ์ „ํ•œ ํƒ์ƒ‰๊ณผ ๋น ๋ฅธ ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
  • ์ž”์ฐจ RL ๋ชฉํ‘œ (Residual RL Objective): ํ˜„์‹ค ์„ธ๊ณ„์—์„œ๋Š” ํ”ผ์•„๋…ธ์˜ MIDI ์ถœ๋ ฅ์—์„œ ํŒŒ์ƒ๋œ key press reward ์‹ ํ˜ธ๋งŒ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ์‚ฌ์šฉ๋œ ๊ฒƒ๊ณผ ๋™์ผ). \pi_{res}๋Š” ํ˜„์‹ค ์„ธ๊ณ„ dynamics ํ•˜์—์„œ ๊ธฐ๋Œ€ ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก ๊ฐ•ํ™” ํ•™์Šต์„ ํ†ตํ•ด ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค.
  • ์œ ๋„ ์žก์Œ (Guided Noise): TD3 [65] ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜๋ฉฐ, ์ƒ˜ํ”Œ๋ง๋œ ํ–‰๋™์— ์žก์Œ ํ•ญ์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, ์ •์ฑ… ์ •์ œ์—์„œ ์‚ฌ์šฉ๋œ ์ธก๋ฉด ์กฐ์ •์„ ๋ชจํ‹ฐ๋ธŒ๋กœ, ์žก์Œ $\epsilon \sim \mathcal{N}(0,1)$์˜ ๋ฐฉํ–ฅ์„ ์˜ฌ๋ฐ”๋ฅธ ์ธก๋ฉด ์›€์ง์ž„์˜ ๋ฐฉํ–ฅ์œผ๋กœ ์œ ๋„ํ•ฉ๋‹ˆ๋‹ค. ํ™•๋ฅ  Pr(\text{guided noise}) = 0.5๋กœ, ํ•ด๋‹น ์ธก๋ฉด ์กฐ์ธํŠธ์˜ ์žก์Œ ๋ถ€ํ˜ธ๊ฐ€ \Delta_t์™€ ๋™์ผํ•œ ๋ถ€ํ˜ธ๊ฐ€ ๋˜๋„๋ก ๋ณ€๊ฒฝํ•˜์—ฌ $\hat{\epsilon}$์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ตœ์ข… ํ–‰๋™์€ a = \mu_\theta(o) + \text{clip}(\hat{\epsilon}, -0.5, 0.5)๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํƒ์ƒ‰์„ ์˜ฌ๋ฐ”๋ฅธ ๊ฑด๋ฐ˜์„ ๋ˆ„๋ฅด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์œ ๋„ํ•˜๋Š” ๊ฒฝ๋Ÿ‰ ํœด๋ฆฌ์Šคํ‹ฑ์ž…๋‹ˆ๋‹ค.

III. ์‹คํ—˜ ๊ฒฐ๊ณผ

HandelBot์€ 5๊ฐœ์˜ ๋‹ค์–‘ํ•œ ๊ณก(Twinkle Twinkle, Ode to Joy, Hot Cross Buns, Fur Elise, Prelude in C)์— ๋Œ€ํ•ด ์–‘์† ๋กœ๋ด‡ ์‹œ์Šคํ…œ์œผ๋กœ ํ‰๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • ํ•˜๋“œ์›จ์–ด ์„ค์ •: Tesollo DG-5F ์†๊ณผ Franka Emika Panda ์•” ๋ฐ FR3 ์•”์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. MIDI ํ‚ค๋ณด๋“œ๋ฅผ ํ†ตํ•ด ์–ด๋–ค ๋…ธํŠธ๊ฐ€ ๋ˆŒ๋ ธ๋Š”์ง€ ๊ฐ์ง€ํ•˜์—ฌ ๋ณด์ƒ ๊ณ„์‚ฐ์— ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์•ˆ์ „ ๋ฐ ๋ฐฐํฌ: PyRoki [67]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์•ˆ์ „ ๋ ˆ์ด์–ด๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ , ์ •์ฑ… ํ–‰๋™์€ 10Hz์—์„œ ์ƒ์„ฑ๋œ ํ›„ 80Hz๋กœ ์„ ํ˜• ๋ณด๊ฐ„๋ฉ๋‹ˆ๋‹ค. ์•”์€ Polymetis ์ปจํŠธ๋กค๋Ÿฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 100Hz๋กœ ์ œ์–ด๋ฉ๋‹ˆ๋‹ค.
  • ์ฃผ์š” ๊ฒฐ๊ณผ (Fig. 4): HandelBot์€ ๋ชจ๋“  ํ‰๊ฐ€๋œ ์Œ์•…์—์„œ ์ผ๊ด€์ ์œผ๋กœ ๊ฐ€์žฅ ๋†’์€ F1 ์ ์ˆ˜๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ์ดํ„ฐ๋งŒ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•(์˜ˆ: \pi_{sim}(CL) ๋ฐ \pi_{sim})์€ ์‹ฌ-ํˆฌ-๋ฆฌ์–ผ ๊ฐญ์œผ๋กœ ์ธํ•ด ์„ฑ๋Šฅ์ด ํ˜„์ €ํžˆ ๋‚ฎ์•˜์Šต๋‹ˆ๋‹ค. policy refinement๋Š” ์†๊ฐ€๋ฝ ๋ˆ„๋ฆ„์„ ์˜ฌ๋ฐ”๋ฅธ ๋ชฉํ‘œ ํ‚ค์— ์ง์ ‘ ์ •๋ ฌํ•˜๋Š” ๋ฐ ํšจ๊ณผ์ ์ด๋ฉฐ, residual RL์€ ์˜ค๋ฅ˜๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ  ๋ฌผ๋ฆฌ์  dynamics์— ์ ์‘ํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.
  • ์ž”์ฐจ RL์˜ ์ค‘์š”์„ฑ (Table I, II): ์ดˆ๊ธฐํ™”๋œ ๊ถค์ (refined trajectory > \pi_{sim} > no initialization) ์œ„์— residual RL์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ๋” ๋†’์€ F1 ์ ์ˆ˜๋กœ ์ด์–ด์ง„๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ •์ œ๋œ ์ •์ฑ…์ด ํƒ์ƒ‰ ๊ณต๊ฐ„์„ ์ค„์—ฌ ๋” ์•ˆ์ •์ ์ด๊ณ  ํšจ์œจ์ ์ธ ํ›ˆ๋ จ์œผ๋กœ ์ด์–ด์ง„๋‹ค๋Š” ๊ฐ€์„ค์„ ๋’ท๋ฐ›์นจํ•ฉ๋‹ˆ๋‹ค. RL discount factor \gamma๊ฐ€ ๋‚ฎ์œผ๋ฉด F1 ์ ์ˆ˜๊ฐ€ ๋‚ฎ์•„์ง€๊ณ  ์›€์ง์ž„์ด ๋ถˆ๊ทœ์น™ํ•ด์ง‘๋‹ˆ๋‹ค. guided noise๋Š” default ์„ค์ •(Pr(\text{guided noise}) = 0.5)์ด Pr(guided noise) = 0๊ณผ ์œ ์‚ฌํ–ˆ์ง€๋งŒ, ํ•ญ์ƒ guided noise๋ฅผ ์ƒ˜ํ”Œ๋งํ•˜๋Š” ๊ฒƒ์€ ์„ฑ๋Šฅ ์ €ํ•˜๋กœ ์ด์–ด์กŒ๋Š”๋ฐ, ์ด๋Š” ์†๊ฐ€๋ฝ ํƒ์ƒ‰์ด ํŽธํ–ฅ๋˜์–ด ์ตœ์ ์ด ์•„๋‹Œ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ์˜ ํ•™์Šต์„ ๋ฐฉํ•ดํ•˜๊ธฐ ๋•Œ๋ฌธ์œผ๋กœ ์ถ”์ •๋ฉ๋‹ˆ๋‹ค.
  • ํ์‡„ ๋ฃจํ”„ Sim-to-Real (Table I): ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์‹คํ–‰(hybrid execution)์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์„ ํ˜„์‹ค ํ™˜๊ฒฝ๊ณผ ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰ํ•˜์—ฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ด€์ธก์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์‹ฌ-ํˆฌ-๋ฆฌ์–ผ ๊ฐญ์„ ์™„ํ™”ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์‹คํ–‰์ด ์ง์ ‘ ์ „์†ก๋ณด๋‹ค ๊ฐœ์„ ์„ ๋ณด์˜€์ง€๋งŒ, ํ˜„์‹ค ์„ธ๊ณ„ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜๋Š” HandelBot ๋ฐ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•๋“ค๊ณผ๋Š” ์—ฌ์ „ํžˆ ์„ฑ๋Šฅ ์ฐจ์ด๊ฐ€ ์ปธ์Šต๋‹ˆ๋‹ค.

IV. ๊ฒฐ๋ก  ๋ฐ ํ•œ๊ณ„

HandelBot์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ์˜ ๊ฐ•ํ™” ํ•™์Šต, ์ •์ฑ… ์ •์ œ, ๊ทธ๋ฆฌ๊ณ  ์ž”์ฐจ ๊ฐ•ํ™” ํ•™์Šต์„ ํ†ตํ•ด ๋กœ๋ด‡ ํ”ผ์•„๋…ธ ์—ฐ์ฃผ์˜ ๊ทน๋„์˜ ์ •๋ฐ€๋„ ์š”๊ตฌ ์‚ฌํ•ญ์„ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ์ทจ์•ฝํ•˜๊ณ  ๋ถˆ์™„์ „ํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์ •์ฑ…์„ ๋‹จ 30๋ถ„์ด๋ผ๋Š” ์ ์€ ์–‘์˜ ํ˜„์‹ค ์„ธ๊ณ„ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ›จ์”ฌ ๊ฒฌ๊ณ ํ•œ ํ”ผ์•„๋…ธ ์—ฐ์ฃผ ๋กœ๋ด‡์œผ๋กœ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•œ๊ณ„์ :

  • HandelBot์€ ์Šคํฌ๋ฆฝํŠธ๋œ end-effector ์›€์ง์ž„๊ณผ ๊ณ ์ •๋œ orientation์— ์˜์กดํ•˜์—ฌ ๋งค๋ฒˆ ์ˆ˜๋™ ํŠœ๋‹์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. End-effector ์›€์ง์ž„์— ๋Œ€ํ•œ residual RL์€ ์ด ๋ฌธ์ œ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ์ด๋Ÿฌํ•œ ์ œ์•ฝ์œผ๋กœ ์ธํ•ด ์—„์ง€์†๊ฐ€๋ฝ๊ณผ ์ƒˆ๋ผ์†๊ฐ€๋ฝ์˜ ์‚ฌ์šฉ์ด ์–ด๋ ค์›Œ์ ธ ๋น„๊ต์  ๊ฐ„๋‹จํ•œ ๊ณก์œผ๋กœ๋งŒ ํ‰๊ฐ€๊ฐ€ ์ด๋ฃจ์–ด์กŒ์Šต๋‹ˆ๋‹ค. ํ–ฅํ›„ ์ž‘์—…์—์„œ๋Š” ๋” ๋ณต์žกํ•œ ๊ณก์„ ์œ„ํ•ด ํšŒ์ „ ๋˜๋Š” ํ•™์Šต๋œ ์›€์ง์ž„์„ ํƒ์ƒ‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ •์ฑ… ์ •์ œ ๋‹จ๊ณ„๋Š” ์ธ๊ฐ„์ด ๊ฐ€์ด๋“œํ•˜๋Š” ํœด๋ฆฌ์Šคํ‹ฑ์— ์˜์กดํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํ”ผ์•„๋…ธ ์—ฐ์ฃผ์—๋Š” ์ ํ•ฉํ•˜์ง€๋งŒ, ๋‹ค๋ฅธ ํƒœ์Šคํฌ์—๋Š” ์ง์ ‘ ์ ์šฉํ•˜๊ธฐ ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ(vision-language models)๊ณผ ๊ฐ™์€ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์„ ํ†ตํ•ด ๋‹ค๋ฅธ ํƒœ์Šคํฌ์—์„œ๋„ ์ •์ฑ… ์ •์ œ๊ฐ€ ๊ฐ€๋Šฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

ํ•œ ์ค„ ์š”์•ฝ

์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ์ž˜ ํ•™์Šต๋œ ์ •์ฑ…์„ ๊ทธ๋Œ€๋กœ ์‹ค๋กœ๋ด‡์— ์˜ฌ๋ฆฌ๋ฉด ์†๊ฐ€๋ฝ์ด ์˜† ๊ฑด๋ฐ˜์„ ์นœ๋‹ค. HandelBot์€ 30๋ถ„์งœ๋ฆฌ ์‹ค์„ธ๊ณ„ ๋ฐ์ดํ„ฐ๋ฅผ ๋‘ ๋‹จ๊ณ„(ํœด๋ฆฌ์Šคํ‹ฑ ์ธก๋ฉด ๊ด€์ ˆ ๋ณด์ • + Residual TD3)๋กœ ํ™œ์šฉํ•ด ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค. F1 ์ ์ˆ˜๊ฐ€ ์ง์ ‘ sim-to-real ๋Œ€๋น„ ์•ฝ 1.8๋ฐฐ ํ–ฅ์ƒ๋˜๊ณ , Tesollo DG-5F ์–‘์† + Franka ๋‘ ๋Œ€ ๊ตฌ์„ฑ์œผ๋กœ 5๊ณก(Twinkle Twinkle, Ode to Joy, Hot Cross Buns, Prelude in C, Fur Elise)์„ ์—ฐ์ฃผํ•œ๋‹ค. โ€œ๋น„์‹ผ RL์€ ์‹œ๋ฎฌ์— ๋งก๊ธฐ๊ณ , ๋น„์‹ผ ์‹ค์„ธ๊ณ„ ๋ฐ์ดํ„ฐ๋Š” ์ •ํ™•ํžˆ ํ•„์š”ํ•œ ๊ณณ์—๋งŒ ์“ด๋‹คโ€๋ผ๋Š” ์‚ฌ๊ณ ๊ฐ€ ํ•ต์‹ฌ์ด๋‹ค.

์™œ ํ”ผ์•„๋…ธ ์—ฐ์ฃผ๊ฐ€ ๊ทธ๋ ‡๊ฒŒ ์–ด๋ ค์šด๊ฐ€

ํ”ผ์•„๋…ธ ํ•œ ๊ฑด๋ฐ˜์˜ ํญ์€ ์•ฝ 23mm ์ •๋„๋‹ค. ๊ฑฐ๊ธฐ์— ํฐ ๊ฑด๋ฐ˜๊ณผ ๊ฒ€์€ ๊ฑด๋ฐ˜์ด ์„ž์—ฌ ์žˆ๊ณ , ์†๊ฐ€๋ฝ์€ ๋‹ค์„ฏ ๊ฐœ์”ฉ ๋‘ ์†์œผ๋กœ ๋”ฐ๋กœ ์›€์ง์—ฌ์•ผ ํ•œ๋‹ค. ์ด๊ฒŒ ๋ฌด์—‡์„ ์˜๋ฏธํ•˜๋А๋ƒ (๋กœ๋ด‡ ์ž…์žฅ์—์„œ) ๋‹ค์Œ ์„ธ ๊ฐ€์ง€ ๋™์‹œ ์ œ์•ฝ์ด๋‹ค.

์ฒซ์งธ, ๊ณต๊ฐ„ ์ •๋ฐ€๋„. ์†๋์ด 1cm๋งŒ ์˜†์œผ๋กœ ํ˜๋Ÿฌ๋„ ์˜† ๊ฑด๋ฐ˜์ด ๋ˆŒ๋ฆฐ๋‹ค. ์ •๋‹ต์ด C์ธ๋ฐ D๋ฅผ ๋ˆ„๋ฅด๋ฉด ์ ์ˆ˜๋Š” ๊ทธ๋ƒฅ 0์ด๋‹ค. ๋‘˜์งธ, ์‹œ๊ฐ„ ์ •๋ฐ€๋„. ์Œ์•…์€ ๋ฐ•์ž๋‹ค. 100ms ๋Šฆ์œผ๋ฉด ์ฒญ๊ฐ์ ์œผ๋กœ ๋‹ค๋ฅธ ์Œ์•…์ด ๋œ๋‹ค. ์…‹์งธ, ์–‘์† ํ˜‘์‘. ๋ฒ ์ด์Šค ๋ผ์ธ์„ ์น˜๋Š” ์™ผ์†๊ณผ ๋ฉœ๋กœ๋””๋ฅผ ์น˜๋Š” ์˜ค๋ฅธ์†์ด ๋…๋ฆฝ์ ์œผ๋กœ, ๊ทธ๋Ÿฌ๋‚˜ ๋™๊ธฐํ™”๋˜์–ด ์›€์ง์—ฌ์•ผ ํ•œ๋‹ค. ์—ฌ๊ธฐ์— ๋”ํ•ด, ์†๊ฐ€๋ฝ์ด ๊ฑด๋ฐ˜์„ ๋ˆ„๋ฅด๋Š” ๊นŠ์ด๊นŒ์ง€ ์ •ํ™•ํžˆ ๋งž์ถฐ์•ผ MIDI๊ฐ€ โ€œ์ด ์Œ์ด ๋ˆŒ๋ ธ๋‹คโ€๋ผ๊ณ  ํŒ์ •ํ•œ๋‹ค.

๋กœ๋ด‡๊ณตํ•™์—์„œ ์ด ์ •๋„ ์ •๋ฐ€๋„๊ฐ€ ๋™์‹œ์— ์š”๊ตฌ๋˜๋Š” ๊ณผ์ œ๋Š” ํ”์น˜ ์•Š๋‹ค. ํ๋ธŒ ํšŒ์ „ ๊ฐ™์€ in-hand manipulation์€ ์œ„์น˜ ์˜ค์ฐจ์— ๋น„๊ต์  ๊ด€๋Œ€ํ•˜๋‹ค. ๋ฐ•ํ˜€ ์žˆ๋Š” ๋ชป์„ ์žก์•„ ๋นผ๋Š” ์ž‘์—…์€ ์‹œ๊ฐ„์— ๋‘”๊ฐํ•˜๋‹ค. ํ”ผ์•„๋…ธ๋Š” ๋‘˜ ๋‹ค ๋นก๋นกํ•˜๋‹ค. ๊ฑฐ๊ธฐ๋‹ค ์†๊ฐ€๋ฝ ๋‹ค์„ฏ ๊ฐœ์˜ ๋…๋ฆฝ ์ œ์–ด๊ฐ€ ํ•„์š”ํ•˜๋‹ˆ, low-dim ํ๋ธŒ ํšŒ์ „๋ณด๋‹ค ์ฐจ์›์ด ํ›จ์”ฌ ๋†’๋‹ค.

์ €์ž๋“ค์ด ์‚ฌ์šฉํ•œ Tesollo DG-5F ํ•ธ๋“œ๋Š” ์ธ๊ฐ„ ์†๋ณด๋‹ค ๋ช…๋ฐฑํžˆ ํฌ๋‹ค(์ด๊ฒŒ ๋…ผ๋ฌธ์—์„œ ์ง์ ‘ ์–ธ๊ธ‰๋˜๋Š” ์–ด๋ ค์›€์ด๋‹ค). ํฐ ์†์œผ๋กœ ์ข์€ ๊ฑด๋ฐ˜ ์œ„์— ๋‹ค์„ฏ ์†๊ฐ€๋ฝ์„ ํŽด๋‘๋ฉด, ์˜† ์†๊ฐ€๋ฝ์ด ์˜† ๊ฑด๋ฐ˜์— ๋‹ฟ๊ธฐ ์ง์ „ ์ƒํƒœ๊ฐ€ ์ž์ฃผ ๋งŒ๋“ค์–ด์ง„๋‹ค. ์‚ฌ๋žŒ๋„ ์†์ด ํฌ๋ฉด ์˜† ์Œ์„ ์ž˜๋ชป ๋ˆ„๋ฅด๋Š”๋ฐ, ๋กœ๋ด‡์€ ๊ทธ๊ฒŒ ๋งค๋ฒˆ ์ผ๊ด€๋œ ํŽธํ–ฅ(systematic bias)์œผ๋กœ ๋‚˜ํƒ€๋‚œ๋‹ค. ์ด ๊ด€์ฐฐ์ด HandelBot์˜ Stage 1 ์„ค๊ณ„ ๋™๊ธฐ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: ์‹ค์„ธ๊ณ„ 30๋ถ„์ด ์‹œ๋ฎฌ๋ ˆ์ด์…˜ 100์‹œ๊ฐ„์„ ์ด๊ธด๋‹ค

๋…ผ๋ฌธ์ด ๋˜์ง€๋Š” ๋ฉ”์‹œ์ง€๋ฅผ ํ•œ ๋ฌธ์žฅ์œผ๋กœ ์••์ถ•ํ•˜๋ฉด ์ด๋ ‡๋‹ค. ์ •๋ฐ€ dexterous task์—์„œ๋Š” ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ์•„๋ฌด๋ฆฌ ์ž˜ ๋งŒ๋“ค์–ด๋„ ์ž”์—ฌ sim-to-real gap์ด ๋‚จ๊ณ , ๊ทธ gap์€ ์†Œ๋Ÿ‰์˜ ์‹ค์„ธ๊ณ„ ๋ฐ์ดํ„ฐ๋กœ๋งŒ ๋‹ซํžŒ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ๊ทธ ์†Œ๋Ÿ‰์„ ์–ด๋–ป๊ฒŒ ํšจ์œจ์ ์œผ๋กœ ์“ธ ๊ฒƒ์ด๋ƒ๊ฐ€ ์ง„์งœ ๋ฌธ์ œ๋‹ค.

์„ธ ๊ฐ€์ง€ ์„ ํƒ์ง€๊ฐ€ ์žˆ๋‹ค. (a) ์ฒ˜์Œ๋ถ€ํ„ฐ ์‹ค์„ธ๊ณ„์—์„œ RL์„ ๋Œ๋ฆฐ๋‹ค(์ƒ˜ํ”Œ ํšจ์œจ ๋”์ฐ, ์‹œ๊ฐ„ยทํ•˜๋“œ์›จ์–ด ๋งˆ๋ชจ ํผ). (b) ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ์™„๋ฒฝํ•œ ์ •์ฑ…์„ ํ•™์Šตํ•˜๊ณ  ๋„๋ฉ”์ธ ๋žœ๋คํ™”๋กœ robustํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค(๊ณ ์ฐจ์› ์ •๋ฐ€ task์—์„œ๋Š” ํ•œ๊ณ„๊ฐ€ ๋ช…ํ™•). (c) ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์ •์ฑ…์„ ์‹œ๋“œ๋กœ ์‚ผ๊ณ , ์‹ค์„ธ๊ณ„ ๋ฐ์ดํ„ฐ๋กœ ์ž”์ฐจ๋งŒ ํ•™์Šตํ•œ๋‹ค(์ƒ˜ํ”Œ ํšจ์œจ ์ข‹์Œ, ๊ทธ๋Ÿฌ๋‚˜ ์ž”์ฐจ RL์ด ๊ทธ ์ž์ฒด๋กœ ๊นŒ๋‹ค๋กœ์›€).

HandelBot์€ (c)๋ฅผ ํƒํ•˜๋˜, ์ž”์ฐจ ํ•™์Šต ์ „์— โ€œํœด๋ฆฌ์Šคํ‹ฑ์œผ๋กœ ๋‹ซ์„ ์ˆ˜ ์žˆ๋Š” gap์€ ํœด๋ฆฌ์Šคํ‹ฑ์œผ๋กœ ๋‹ซ์žโ€๋ผ๋Š” ํ•œ ๋‹จ๊ณ„๋ฅผ ๋ผ์›Œ ๋„ฃ๋Š”๋‹ค. ์†๊ฐ€๋ฝ์ด ์ผ๊ด€๋˜๊ฒŒ ์™ผ์ชฝ์œผ๋กœ 1cm ์น˜์šฐ์ณ ์žˆ๋‹ค๋ฉด, RL์ด ๊ทธ๊ฑธ ๋ฐœ๊ฒฌํ•˜๊ธฐ ์ „์— ์‚ฌ๋žŒ์ด โ€œ์˜†์œผ๋กœ 1cm ์˜ฎ๊ฒจ๋ผโ€๋ผ๊ณ  ์ง์ ‘ ์ง€์‹œํ•  ์ˆ˜ ์žˆ์ง€ ์•Š๋А๋ƒ๋Š” ๋ฐœ์ƒ์ด๋‹ค. ์‚ฌ๋žŒ์˜ ์‚ฌ์ „ ์ง€์‹(ํ‚ค๋ณด๋“œ ๊ธฐํ•˜ + ์†๊ฐ€๋ฝ ์šด๋™ํ•™)์ด ์ ์šฉ๋˜๋Š” ๊ณณ์—๋Š” ํ•™์Šต์„ ์“ฐ์ง€ ์•Š๋Š”๋‹ค. ํ•™์Šต์€ ์ •๋ง ํ•™์Šต์ด ํ•„์š”ํ•œ ๊ณณ์—๋งŒ ์“ด๋‹ค.

์‹œ์Šคํ…œ ๊ฐœ์š”

flowchart TB
    subgraph SIM["Stage 0: Simulation (ManiSkill)"]
        A["PPO Training<br/>MIDI-based reward"]
        A --> B["Best ฯ€_sim selection<br/>via validation F1"]
        B --> C["Open-loop trajectory ฯ„_sim"]
    end

    subgraph REAL["Stage 1: Structured Refinement (real, heuristic)"]
        C --> D["Execute on hardware"]
        D --> E["Compare pressed vs target keys"]
        E --> F["Adjust lateral joints<br/>iteratively"]
        F --> G["Refined trajectory ฯ„*_sim"]
    end

    subgraph RES["Stage 2: Residual RL (real, learned)"]
        G --> H["Residual policy ฯ€_res<br/>on top of ฯ„*_sim"]
        H --> I["TD3 with guided noise"]
        I --> J["HandelBot policy"]
    end

    J --> K["10 Hz policy<br/>โ†’ PyRoki IK safety layer<br/>โ†’ 80 Hz hand commands"]

์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ์€ ์„ธ ๋‹จ๊ณ„๋กœ ๊น”๋”ํ•˜๊ฒŒ ๋ถ„๋ฆฌ๋œ๋‹ค. Stage 0์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ๊ฐ€๋Šฅํ•œ ํ•œ ์ข‹์€ ๋ฒ ์ด์Šค ์ •์ฑ…์„ ๋งŒ๋“ ๋‹ค. Stage 1์€ ๊ทธ ์ •์ฑ…์˜ ๊ฒฐ์ •๋ก ์  roll-out์„ ์‹ค์„ธ๊ณ„์—์„œ ๋Œ๋ ค์„œ, ๋น—๋‚˜๊ฐ€๋Š” ์†๊ฐ€๋ฝ์˜ ์ธก๋ฉด ๊ด€์ ˆ(lateral joint)์„ ํœด๋ฆฌ์Šคํ‹ฑ์œผ๋กœ ๋ณด์ •ํ•œ๋‹ค. Stage 2๋Š” ์ •์ œ๋œ ๊ถค์  ์œ„์— ์ž”์ฐจ RL์„ ํ•™์Šตํ•ด, Stage 1์ด ๋ชป ์žก์€ ๋ฏธ์„ธ ๋ณด์ •์„ ์ž๋™์œผ๋กœ ํ•™์Šตํ•œ๋‹ค.

Stage 0: ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์ •์ฑ… ํ•™์Šต

Stage 0์˜ ๊ฒฐ๊ณผ๋ฌผ์€ ๋‘ ๊ฐ€์ง€๋‹ค. ์ •์ฑ… \pi_{sim}, ๊ทธ๋ฆฌ๊ณ  ๊ทธ ์ •์ฑ…์œผ๋กœ๋ถ€ํ„ฐ ์ถ”์ถœํ•œ open-loop ๊ถค์  \tau_{sim}. ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋Š” ManiSkill์„ ์“ด๋‹ค(๋ณ‘๋ ฌ ๊ฐ€์† + GPU ์นœํ™”). ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ PPO(๊ณต๊ฐœ๋œ ์ฝ”๋“œ ํŒŒ์ผ๋ช…์ด piano_ppo_fast.py๋‹ค).

์—ฌ๊ธฐ์„œ ํฅ๋ฏธ๋กœ์šด ๋””์ž์ธ ์„ ํƒ์ด ๋‘ ๊ฐœ ์žˆ๋‹ค. ํ•˜๋‚˜๋Š”, ํ•™์Šต๋œ stochastic policy ์ž์ฒด๊ฐ€ ์•„๋‹ˆ๋ผ, ๊ทธ ์ •์ฑ…์œผ๋กœ ์‹œ๋ฎฌ์—์„œ ๋งŒ๋“  open-loop ๊ถค์ ์„ ์‹ค์„ธ๊ณ„๋กœ ๊ฐ€์ ธ๊ฐ„๋‹ค๋Š” ์ ์ด๋‹ค. ์ฆ‰ ์‹ค์„ธ๊ณ„ ๋‹จ๊ณ„์—์„œ โ€œ๊ด€์ธก์„ ๋ณด๊ณ  ํ–‰๋™์„ ๊ฒฐ์ •โ€ํ•˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ, ์‹œ๊ฐ„ ์ธ๋ฑ์Šค t๋ฅผ ๋ณด๊ณ  ๋ฏธ๋ฆฌ ์ •ํ•ด์ง„ ๊ด€์ ˆ ๊ฐ’์„ ๋”ฐ๋ผ๊ฐ„๋‹ค. ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š”, ํ•™์Šต๋œ ์ •์ฑ… ์ค‘์—์„œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ฒ€์ฆ F1์ด ๊ฐ€์žฅ ๋†’์€ trajectory ๋‹จ ํ•˜๋‚˜๋ฅผ ์„ ํƒํ•ด sim-to-real ์ถœ๋ฐœ์ ์œผ๋กœ ์‚ผ๋Š”๋‹ค๋Š” ์ ์ด๋‹ค. ์ฆ‰ โ€œํ†ต๊ณ„์ ์œผ๋กœ ์ข‹์€ ์ •์ฑ…โ€์ด ์•„๋‹ˆ๋ผ โ€œํ•œ ๋ฒˆ ์ž˜ ์นœ ์—ฐ์ฃผโ€๋ฅผ ๊ทธ๋Œ€๋กœ ๋“ค๊ณ  ๋‚˜๊ฐ„๋‹ค.

์ด ์„ ํƒ์€ ์‹ค์šฉ์ ์ด๋‹ค. ํ”ผ์•„๋…ธ ์—ฐ์ฃผ๋Š” ์Œ์•… ํ•œ ๊ณก์ด๋ผ๋Š” ์ •ํ•ด์ง„ ์‹œํ€€์Šค๋ฅผ ๋”ฐ๋ผ๊ฐ€๋Š” ์ผ์ด๋ผ closed-loop ๊ด€์ธก์ด ๊ตณ์ด ํ•„์š” ์—†์„ ์ˆ˜ ์žˆ๋‹ค. ๋˜ RL์ด ์‹œ๋“œ๋ณ„๋กœ ๋“ค์ญ‰๋‚ ์ญ‰ํ•œ ๊ณก์„ ์นœ๋‹ค๋Š” ์ ์„ ๊ฐ์•ˆํ•˜๋ฉด, ๊ฐ€์žฅ ์ž˜ ์นœ ํ•œ ๊ณก์„ ๊ณจ๋ผ ๋‹ค๋“ฌ๋Š” ํŽธ์ด ์•ˆ์ •์ ์ด๋‹ค. ๋‹จ์ ์€ ํ™˜๊ฒฝ ์™ธ๋ž€(์˜ˆ: ํ‚ค๋ณด๋“œ๊ฐ€ ์‚ด์ง ์›€์ง์ž„)์— ์•ฝํ•˜๋‹ค๋Š” ๊ฒƒ์ธ๋ฐ, ์ด๋Š” ์‹œ์Šคํ…œ์ด piano๋ฅผ ๊ณ ์ • ๋งˆ์šดํŒ…์œผ๋กœ ๊ฐ€์ •ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํฐ ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ๋‹ค.

๊ณก๋ณ„ horizon์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค์ •๋œ๋‹ค(๊ณต๊ฐœ ์ฝ”๋“œ ๊ธฐ์ค€).

Song Horizon (steps)
Twinkle Twinkle 160
Ode to Joy 330
Hot Cross Buns 160
Prelude in C 330
Fur Elise 320

ํ•™์Šต ๋ณด์ƒ์€ ํ‚ค๋ณด๋“œ์˜ MIDI ์ถœ๋ ฅ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ๋‹ค. ๋งค ์Šคํ…์—์„œ ๋ˆ„๋ฅธ ์Œ ์ง‘ํ•ฉ๊ณผ ์•…๋ณด๊ฐ€ ์š”๊ตฌํ•˜๋Š” ์Œ ์ง‘ํ•ฉ์„ ๋น„๊ตํ•ด F1 ํ˜•ํƒœ์˜ ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. ์ง๊ด€์ ์œผ๋กœ๋Š”, ๋งž๋Š” ํ‚ค๋ฅผ ๋ˆ„๋ฅด๋ฉด +, ํ‹€๋ฆฐ ํ‚ค๋ฅผ ๋ˆ„๋ฅด๋ฉด -, ๋ˆŒ๋Ÿฌ์•ผ ํ•˜๋Š”๋ฐ ์•ˆ ๋ˆ„๋ฅด๋ฉด -. ์ด ์‹ ํ˜ธ๊ฐ€ denseํ•˜๊ฒŒ ๋“ค์–ด์˜ค๋ฏ€๋กœ RL์ด ํ•™์Šตํ•˜๊ธฐ ์ข‹๋‹ค.

Stage 1: ๊ตฌ์กฐํ™”๋œ ์ •์ฑ… ์ •์ œ(ํœด๋ฆฌ์Šคํ‹ฑ ์ธก๋ฉด ๊ด€์ ˆ ๋ณด์ •)

Stage 1์€ ํฅ๋ฏธ๋กญ๋‹ค. ํ•™์Šต์ด ์•„๋‹ˆ๋‹ค. ์‚ฌ๋žŒ์ด ์†์œผ๋กœ ์ง  ๊ทœ์น™์ด๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ํšจ๊ณผ๊ฐ€ ํฌ๋‹ค.

๊ธฐ๋ณธ ์•„์ด๋””์–ด๋Š” ์ด๋ ‡๋‹ค. Tesollo ์†๊ฐ€๋ฝ์€ ์œ„์•„๋ž˜๋กœ ๊ตฝํžˆ๋Š” ๊ด€์ ˆ๊ณผ, ์˜†์œผ๋กœ ํ”๋“œ๋Š” ๊ด€์ ˆ(lateral joint)์„ ๋™์‹œ์— ๊ฐ–๋Š”๋‹ค. ์†๊ฐ€๋ฝ์ด ๋ชฉํ‘œ ๊ฑด๋ฐ˜ ์œ„์—์„œ ๋น—๋‚˜๊ฐˆ ๋•Œ, ๊ทธ ๋น—๋‚˜๊ฐ์€ ๋Œ€๋ถ€๋ถ„ โ€œ์˜†์œผ๋กœ ์–ผ๋งˆ๋‚˜ ์น˜์šฐ์ณค๋А๋ƒโ€์˜ ๋ฌธ์ œ๋‹ค. ์œ„๋กœ ๋“ค๋ฆฌ๋Š” ์ •๋„๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ๋„ ๋น„๊ต์  ์ž˜ ๋งž๋‹ค. ์˜† ๋ฐฉํ–ฅ๋งŒ ์‹œ์Šคํ…œ์ ์œผ๋กœ ์–ด๊ธ‹๋‚œ๋‹ค.

๊ทธ๋ž˜์„œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ•œ ์†๊ฐ€๋ฝ์ด ์–ด๋–ค timestep์— ์–ด๋А ๊ฑด๋ฐ˜์„ ์ณ์•ผ ํ•˜๋Š”์ง€๋ฅผ ์•ˆ ์ƒํƒœ์—์„œ, ์‹ค์„ธ๊ณ„ roll-out ๊ฒฐ๊ณผ๋ฅผ ๋ณด๊ณ  ์ธก๋ฉด ๊ด€์ ˆ๋งŒ ๋ฐ˜๋ณต์ ์œผ๋กœ ์˜ฎ๊ธด๋‹ค. ์˜์‚ฌ ์ฝ”๋“œ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

Algorithm: Iterative Lateral Joint Refinement
Input:  open-loop trajectory tau (joint targets over time)
        target MIDI sequence M (target key per finger per timestep)
        finger-to-lateral-joint mapping L
        step size delta, iterations N
Output: refined trajectory tau_star

for iter = 1 to N do
    pressed <- execute_real(tau)        # roll out on hardware
    for each timestep t in tau do
        for each finger f do
            k_target <- M[t][f]
            k_actual <- pressed[t][f]
            if k_actual is not k_target then
                dir <- sign(key_x(k_target) - key_x(k_actual))
                tau[t][L[f]] <- tau[t][L[f]] + dir * delta
            end
        end
    end
end
return tau as tau_star

์ด ์ ˆ์ฐจ์˜ ๋ฏธ๋•์€ ๋ช…ํ™•ํ•˜๋‹ค. ์ฒซ์งธ, ์‹ค์„ธ๊ณ„์—์„œ RL์„ ํ•œ ๋ฒˆ๋„ ๋Œ๋ฆฌ์ง€ ์•Š๊ณ , ๋‹จ์ˆœํ•œ ๊ฒฐ์ •๋ก ์  ๋ณด์ •์œผ๋กœ ํฐ spatial bias๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค. ๋‘˜์งธ, ๋ณด์ • ๋ฐฉํ–ฅ์ด ๋ช…ํ™•ํ•œ ๋ฌผ๋ฆฌ์  ์˜๋ฏธ(ํ‚ค๋ณด๋“œ x์ถ•)๋ฅผ ๊ฐ–๊ธฐ ๋•Œ๋ฌธ์— unstableํ•œ ํ•™์Šต ์‹ ํ˜ธ๊ฐ€ ๋ผ์–ด๋“ค์ง€ ์•Š๋Š”๋‹ค. ์…‹์งธ, ์ •ํ•ด์ง„ ํšŸ์ˆ˜ ์•ˆ์— ์ˆ˜๋ ดํ•œ๋‹ค.

๋ฌผ๋ก  ํ•œ๊ณ„๋„ ์žˆ๋‹ค. ๋…ผ๋ฌธ์ด ์ง์ ‘ ์ธ์ •ํ•œ๋‹ค. (1) ์ธก๋ฉด ๊ด€์ ˆ๋งŒ ๋งŒ์ง€๋ฏ€๋กœ, โ€œ๊ทธ ์†๊ฐ€๋ฝ์ด ์•„์˜ˆ ์•ˆ ๋‹ฟ์•„์„œ ๋ชป ์นœ ๊ฒฝ์šฐโ€๋Š” ๋ชป ๊ณ ์นœ๋‹ค. (2) โ€œ์–ด๋А ์†๊ฐ€๋ฝ์ด ์–ด๋А ํ‚ค๋ฅผ ์นœ๋‹คโ€๋ผ๋Š” ์†๊ฐ€๋ฝ-ํ‚ค ํ• ๋‹น์ด ์ •ํ™•ํ•˜๋‹ค๋Š” ๊ฐ€์ •์— ์˜์กดํ•œ๋‹ค. ์‹ค์ œ๋กœ๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์ •์ฑ…์ด ์˜† ์†๊ฐ€๋ฝ์œผ๋กœ ์นœ ๊ฒŒ ๋” ์ž์—ฐ์Šค๋Ÿฌ์šธ ์ˆ˜๋„ ์žˆ๋‹ค. (3) Z์ถ•(๋ˆ„๋ฆ„ ๊นŠ์ด)์ด๋‚˜ ๊ตฝํž˜ ๊ฐ๋„ ๊ฐ™์€ ๋‹ค๋ฅธ ์ž์œ ๋„๋Š” ๋งŒ์ง€์ง€ ์•Š๋Š”๋‹ค.

์ด๋Ÿฐ ํ•œ๊ณ„๋“ค์ด ๊ณง Stage 2์˜ ์กด์žฌ ์ด์œ ๋‹ค.

Stage 2: Residual RL with TD3 (๊ฐ€์ด๋“œ๋œ ๋…ธ์ด์ฆˆ)

Stage 1 ๊ฒฐ๊ณผ \tau^*_{sim}์ด ์ƒˆ ๋ฒ ์ด์Šค๋ผ์ธ์ด๋‹ค. ์ด ๋ฒ ์ด์Šค๋ผ์ธ ์œ„์— ์ž”์ฐจ ์ •์ฑ… \pi_{res}๋ฅผ ํ•™์Šตํ•œ๋‹ค. ์ฆ‰ ์‹ค์ œ ๋กœ๋ด‡ ๋ช…๋ น์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

a_t = a^*_{sim}(t) + \pi_{res}(s_t)

์—ฌ๊ธฐ์„œ a^*_{sim}(t)๋Š” ์ •์ œ๋œ ๊ถค์ ์˜ ์‹œ๊ฐ„ ์ธ๋ฑ์Šค์—์„œ ๋‚˜์˜ค๋Š” nominal action์ด๊ณ , \pi_{res}๋Š” ์ƒํƒœ s_t๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ๋ณด์ •๋Ÿ‰์„ ์ถœ๋ ฅํ•œ๋‹ค. ์ด ์ž”์ฐจ ํ•™์Šต์ด ๊ฐ–๋Š” ์žฅ์ ์€ ๋‘ ๊ฐ€์ง€๋‹ค. ์ฒซ์งธ, ํƒ์ƒ‰(exploration)์ด ์•ˆ์ „ํ•˜๋‹ค. ๋ฒ ์ด์Šค๊ฐ€ ์ด๋ฏธ ๊ฑฐ์˜ ๋งž๋Š” ๋™์ž‘์ด๋ฏ€๋กœ, ์ž”์ฐจ๊ฐ€ ์ž‘์€ ๋ฒ”์œ„๋งŒ ํ”๋“ค๋ฉด ๋œ๋‹ค. ๋‘˜์งธ, ๋ณด์ƒ ์‹ ํ˜ธ์˜ ๋ณ€๋™์„ฑ์ด ๋‚ฎ๋‹ค. ๋ฒ ์ด์Šค๋ผ์ธ ์ž์ฒด๋กœ๋„ ์ผ์ • ์ˆ˜์ค€์˜ ์Œ์„ ์น˜๋ฏ€๋กœ, ์ž”์ฐจ์˜ ์ž‘์€ ๋ณ€ํ™”๊ฐ€ F1์˜ ์ž‘์€ ๋ณ€ํ™”๋กœ ์ผ๊ด€๋˜๊ฒŒ ๋งคํ•‘๋œ๋‹ค.

์ €์ž๋“ค์€ ์ž”์ฐจ ํ•™์Šต์— TD3(Twin Delayed DDPG)๋ฅผ ์“ด๋‹ค. ๊ฒฐ์ •๋ก ์  ์ •์ฑ…์— ๋…ธ์ด์ฆˆ๋ฅผ ๋”ํ•˜๋Š” ๋ฐฉ์‹์ด๋ผ, off-policy๋กœ ์‹ค์„ธ๊ณ„ ์ƒ˜ํ”Œ์„ ๋ชจ์œผ๋ฉด์„œ๋„ ์•ˆ์ •์ ์œผ๋กœ ํ•™์Šต๋œ๋‹ค. ์—ฌ๊ธฐ๊นŒ์ง€๋Š” ํ‘œ์ค€์ด๋‹ค.

์ง„์งœ ๊น”๋”ํ•œ ๋””ํ…Œ์ผ์€ ์ด ๋ถ€๋ถ„์ด๋‹ค. TD3๋Š” ํƒ์ƒ‰์„ ์œ„ํ•ด ์•ก์…˜์— ๊ฐ€์šฐ์‹œ์•ˆ ๋…ธ์ด์ฆˆ \epsilon \sim \mathcal{N}(0, I)๋ฅผ ๋”ํ•œ๋‹ค. ์ผ๋ฐ˜ TD3๋ผ๋ฉด ์ด ๋…ธ์ด์ฆˆ๋Š” ๋“ฑ๋ฐฉ์„ฑ(isotropic), ์ฆ‰ ๋ชจ๋“  ๋ฐฉํ–ฅ์œผ๋กœ ๋˜‘๊ฐ™์ด ํ”๋“ ๋‹ค. ํ•˜์ง€๋งŒ ์ €์ž๋“ค์€ โ€œ๋…ธ์ด์ฆˆ ๋ถ€ํ˜ธ๋„ ์‚ฌ์‹ค์€ ๊ฐ€์ด๋“œํ•  ์ˆ˜ ์žˆ๋‹คโ€๋ผ๊ณ  ๋ณธ๋‹ค. Stage 1์—์„œ ์‚ฌ์šฉํ•œ ์ธก๋ฉด ๋ณด์ • ๋ฐฉํ–ฅ์ด ๊ทธ ๊ฐ€์ด๋“œ๋‹ค.

๊ตฌ์ฒด์ ์œผ๋กœ, ํ™•๋ฅ  \Pr(\text{guided noise}) = 0.5๋กœ ๋…ธ์ด์ฆˆ์˜ ๋ถ€ํ˜ธ๋ฅผ ์ธก๋ฉด ๋ณด์ •์—์„œ ์ •ํ•œ ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉํ–ฅ๊ณผ ์ผ์น˜ํ•˜๋„๋ก ๋’ค์ง‘๋Š”๋‹ค.

\hat{\epsilon}_i = \begin{cases} \mathrm{sign}(d_i) \cdot |\epsilon_i| & \text{with prob. } 0.5 \\ \epsilon_i & \text{otherwise} \end{cases}

๋‹จ, \|\hat{\epsilon}\|_2 = \|\epsilon\|_2 ์ฆ‰ ํฌ๊ธฐ๋Š” ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๊ณ  ๋ถ€ํ˜ธ๋งŒ ๋ฐ”๊พผ๋‹ค(์˜ฌ๋ฐ”๋ฅธ lateral joint ์ธ๋ฑ์Šค์—์„œ๋งŒ). ์ผ์ข…์˜ โ€œ๋ฐฉํ–ฅ ํŽธํ–ฅ(directional bias)โ€์„ ๊ฐ€์ง„ ํƒ์ƒ‰์ด๋‹ค. ๋ฌด์ž‘์ • ํ”๋“œ๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ, ์†๊ฐ€๋ฝ์ด ์˜† ๊ฑด๋ฐ˜์— ๋„ˆ๋ฌด ์น˜์šฐ์ณ ์žˆ์œผ๋ฉด ๋” ์ž์ฃผ โ€œ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉํ–ฅ์œผ๋กœโ€ ํ”๋“ค๊ฒŒ ํ•œ๋‹ค.

์ด๊ฒŒ ์™œ ์ค‘์š”ํ•˜๋ƒ. ํ‘œ์ค€ TD3๋กœ 30๋ถ„ ์•ˆ์— ์˜๋ฏธ ์žˆ๋Š” ์ž”์ฐจ๋ฅผ ํ•™์Šตํ•˜๊ธฐ๋Š” ์–ด๋ ต๋‹ค. 500-1000๋ฒˆ ์ •๋„์˜ roll-out ์•ˆ์—์„œ ์ •๋ฐ€ ์†๊ฐ€๋ฝ ๋ณด์ •์„ ์žก์•„๋‚ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ฐ€์ด๋“œ๋œ ๋…ธ์ด์ฆˆ๋Š” ํƒ์ƒ‰์˜ ์ ˆ๋ฐ˜์„ ์‚ฌ๋žŒ์ด ์•Œ๋ ค์ฃผ๋Š” ๋ฐฉํ–ฅ์— ์ •๋ ฌ์‹œ์ผœ, ์ƒ˜ํ”Œ ํšจ์œจ์„ ํฌ๊ฒŒ ๋Œ์–ด์˜ฌ๋ฆฐ๋‹ค. Residual RL ๋ฌธํ—Œ(์˜ˆ: Johannink et al., Davchev et al.)์ด ๋ณด์—ฌ ์˜จ ํŒจํ„ด(prior๋ฅผ ์–ด๋–ป๊ฒŒ๋“  ํƒ์ƒ‰์— ์ฃผ์ž…ํ•˜๋ฉด ์ด๊ธด๋‹ค)์„ ๋ช…ํ™•ํ•˜๊ฒŒ ํ™œ์šฉํ•œ ์‚ฌ๋ก€๋‹ค.

flowchart LR
    A["Nominal action<br/>a*_sim(t)"] --> S["+"]
    B["Residual policy<br/>ฯ€_res(s_t)"] --> S
    C["Gaussian noise<br/>ฮต ~ N(0,I)"] --> N["Sign flip<br/>(prob 0.5)"]
    D["Lateral direction<br/>d_i from refinement"] --> N
    N --> S
    S --> E["Final action a_t<br/>โ†’ robot"]

ํ•˜๋“œ์›จ์–ด ์„ค์ •๊ณผ ์•ˆ์ „ ๊ณ„์ธต

๋…ผ๋ฌธ Figure 2๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ํ•˜๋“œ์›จ์–ด ๊ตฌ์„ฑ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • ๋ฒ ์ด์Šค: ๋‘ ๋Œ€์˜ Franka ์•”(Panda + FR3)
  • ์—”๋“œ ์ดํŽ™ํ„ฐ: ๋‘ ๊ฐœ์˜ Tesollo DG-5F dexterous hand(์†๊ฐ€๋ฝ 5๊ฐœ์”ฉ, ์‚ฌ๋žŒ ์†๋ณด๋‹ค ํผ)
  • ํ™˜๊ฒฝ: MIDI ์ถœ๋ ฅ์ด ๊ฐ€๋Šฅํ•œ ๋””์ง€ํ„ธ ํ‚ค๋ณด๋“œ
  • ๊ฐ์ง€: ํ‚ค๋ณด๋“œ์˜ MIDI๋ฅผ ๋ณด์ƒ ์‹ ํ˜ธ๋กœ ์‚ฌ์šฉ

์—ฌ๊ธฐ์„œ ๋น„์ „ ์„ผ์„œ๋‚˜ ์ด‰๊ฐ ์„ผ์„œ๊ฐ€ ๋ช…์‹œ์ ์œผ๋กœ ๋“ค์–ด๊ฐ€์ง€ ์•Š๋Š”๋‹ค๋Š” ์ ์ด (๋‹ค๋ฅธ ์ •๋ฐ€ manipulation ์—ฐ๊ตฌ๋“ค๊ณผ ๋น„๊ตํ•˜๋ฉด) ๋‹ค๋ฅด๋‹ค. ๋ณด์ƒ์ด ํ™˜๊ฒฝ์—์„œ ์ง์ ‘ ์ธก์ • ๊ฐ€๋Šฅํ•œ ํ˜•ํƒœ(MIDI)๋กœ ๋–จ์–ด์ง€๋ฏ€๋กœ, ground truth๊ฐ€ ๊นจ๋—ํ•˜๋‹ค. RoboPianist์˜ ์‹œ๋ฎฌ ๋ณด์ƒ ์‹ ํ˜ธ์™€ ๊ฑฐ์˜ ๋™ํ˜•(ๅŒๅฝข)์ธ ์…ˆ์ด๋‹ค.

์•ˆ์ „ ๊ณ„์ธต๋„ ์ •์„ฑ์Šค๋Ÿฝ๋‹ค. ์‹œ๋ฎฌ์—์„œ ํ•™์Šต๋œ ๊ด€์ ˆ ๋ชฉํ‘œ๋ฅผ ๊ทธ๋Œ€๋กœ ์‹ค๋กœ๋ด‡์— ๋‚ด๋ฆฌ๋ฉด ์ž๊ธฐ ์ถฉ๋Œ์ด๋‚˜ ๊ฑด๋ฐ˜ ํ‘œ๋ฉด์„ ๋šซ๋Š” ๋™์ž‘์ด ์ƒ๊ธด๋‹ค. ๊ทธ๋ž˜์„œ ์ €์ž๋“ค์€ PyRoki๋ฅผ ์จ์„œ IK๋ฅผ ์ œ์•ฝ ์ตœ์ ํ™”๋กœ ํ‘ผ๋‹ค. ์ž๊ธฐ ์ถฉ๋Œ ํŽ˜๋„ํ‹ฐ + piano surface๋ฅผ ํ‰๋ฉด ์ œ์•ฝ์œผ๋กœ ๊ทผ์‚ฌํ•œ ๋น„์นจํˆฌ ํŽ˜๋„ํ‹ฐ๋ฅผ ํ•จ๊ป˜ ๋‘”๋‹ค. ์ •์ฑ… ์ถœ๋ ฅ์€ 10Hz๋กœ ๋‚˜์˜ค๊ณ , ์†์— ๋‚ด๋ ค๊ฐ€๋Š” ๋ช…๋ น์€ 80Hz๋กœ ์„ ํ˜• ๋ณด๊ฐ„๋˜์–ด ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ํ๋ฅธ๋‹ค. ์‹œ๋ฎฌ๊ณผ ์‹ค์„ธ๊ณ„์˜ control rate๋ฅผ ๋‹ค๋ฅด๊ฒŒ ๊ฐ€์ ธ๊ฐ€๋Š” ๊ฑด sim-to-real์—์„œ ํ”ํ•œ ํŠธ๋ฆญ์ด๋‹ค.

JungYeon์ด ์ง„ํ–‰ ์ค‘์ธ IsaacGymโ†’IsaacLab ๋งˆ์ด๊ทธ๋ ˆ์ด์…˜ ์ž‘์—…๊ณผ ์—ฐ๊ฒฐ์ง€์–ด ๋ณด๋ฉด, ManiSkill ๊ธฐ๋ฐ˜ ํ•™์Šต + ๋ณ„๋„ IK/safety layer๋ผ๋Š” ๋ถ„๋ฆฌ ์„ค๊ณ„๋Š” ์ต์ˆ™ํ•œ ๊ตฌ์กฐ๋‹ค. ์ฐจ์ด๋ผ๋ฉด HandelBot์€ piano๋ผ๋Š” ์ •์  ํ™˜๊ฒฝ ์ œ์•ฝ(ํ‰๋ฉด ํ‚ค๋ณด๋“œ + ๊ณ ์ • mount)์ด ๊ฐ•ํ•ด์„œ, safety layer๋ฅผ โ€œself-collision + planar contactโ€๋กœ ๋‹จ์ˆœํ™”ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค๋Š” ์ ์ด๋‹ค. Allegro Hand๋กœ ์ผ๋ฐ˜ manipulation์„ ํ‘ธ๋Š” ๊ฒฝ์šฐ์—๋Š” contact ๋ชจ๋ธ์ด ํ›จ์”ฌ ๋ณต์žกํ•ด์ง„๋‹ค.

์‹คํ—˜ ๊ฒฐ๊ณผ

ํ‰๊ฐ€๊ณก์€ ๋‹ค์„ฏ ๊ณก์ด๋‹ค. Twinkle Twinkle Little Star, Ode to Joy, Hot Cross Buns, Bach์˜ Prelude in C, Beethoven์˜ Fur Elise. ๋‚œ์ด๋„๊ฐ€ ์ ์ธต์ ์œผ๋กœ ์ฆ๊ฐ€ํ•œ๋‹ค. ๋งˆ์ง€๋ง‰ ๋‘ ๊ณก, ํŠนํžˆ Fur Elise๋Š” ์™ผ์†์˜ ํฐ ์ ํ”„(๋‹ค๋ฅธ ๊ฑด๋ฐ˜ ๊ตฐ์œผ๋กœ์˜ ์ด๋™)๊ฐ€ ์žฆ์•„์„œ, ์ •์  ์†๊ฐ€๋ฝ ํ• ๋‹น ๊ฐ€์ •์ด ๊นจ์ง€๊ธฐ ์‰ฝ๋‹ค.

ํ‰๊ฐ€ ์ง€ํ‘œ๋Š” F1 ร—100. F1์€ ์ ์‹œ์— ์˜ฌ๋ฐ”๋ฅธ ์Œ์„ ๋ˆ„๋ฅธ ๋น„์œจ(recall)๊ณผ, ๋ˆ„๋ฅธ ์Œ ์ค‘ ์˜ฌ๋ฐ”๋ฅธ ์Œ์˜ ๋น„์œจ(precision)์„ ๊ฒฐํ•ฉํ•œ ๊ฐ’์ด๋‹ค. ์Œ์•… ์ •๋ฐ€ ํ‰๊ฐ€์— ์ ํ•ฉํ•œ ์ง€ํ‘œ๋‹ค.

๋น„๊ต baseline์€ ์ด 5์ข…์ด๋‹ค(๋…ผ๋ฌธ Figure 3 ๋ฐ Table I ๊ธฐ์ค€).

  1. HandelBot (Ours): Stage 0 + Stage 1 + Stage 2 ์ „๋ถ€
  2. HandelBot w/o ResRL: Stage 0 + Stage 1๋งŒ(ํœด๋ฆฌ์Šคํ‹ฑ ๋ณด์ •๊นŒ์ง€)
  3. ฯ€_sim (closed-loop): ํ•™์Šต๋œ stochastic policy๋ฅผ ๊ทธ๋Œ€๋กœ ์‹ค๋กœ๋ด‡์— ๋ฐฐํฌ
  4. ฯ€_sim (open-loop): ์‹œ๋ฎฌ๋ ˆ์ด์…˜ trajectory๋ฅผ ๊ทธ๋Œ€๋กœ ์‹คํ–‰
  5. RL from Scratch: ์‹ค์„ธ๊ณ„์—์„œ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต
  6. Hybrid execution: ์‹ค์„ธ๊ณ„ ์‹คํ–‰ ์ค‘ proprioception์„ parallel sim์—์„œ ๊ฐ€์ ธ์˜ค๋Š” ๋ณ€ํ˜•

๊ฒฐ๊ณผ์˜ ํฐ ๊ทธ๋ฆผ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • HandelBot์ด ๋ชจ๋“  ๊ณก์—์„œ ๊ฐ€์žฅ ๋†’์€ F1์„ ๊ธฐ๋กํ•œ๋‹ค.
  • ์ง์ ‘ sim-to-real(ฯ€_sim)์€ ๋ชจ๋“  ๊ณก์—์„œ ํฐ ํญ์œผ๋กœ ๋’ค์ง„๋‹ค. ํ‰๊ท ์ ์œผ๋กœ HandelBot์€ ๊ทธ 1.8๋ฐฐ ์ˆ˜์ค€์˜ F1์„ ๋‚ธ๋‹ค.
  • Stage 1๋งŒ ์ ์šฉํ•œ ๋ฒ„์ „(HandelBot w/o ResRL)๋„ ฯ€_sim๋ณด๋‹ค ๋ช…ํ™•ํžˆ ์ข‹๋‹ค. ์ฆ‰ ํœด๋ฆฌ์Šคํ‹ฑ ์ธก๋ฉด ๋ณด์ •๋งŒ์œผ๋กœ๋„ ํฐ ๋ถ€๋ถ„์˜ spatial gap์„ ์žก๋Š”๋‹ค.
  • RL from Scratch๋Š” 30๋ถ„ budget์œผ๋กœ๋Š” ๊ฑฐ์˜ ์˜๋ฏธ ์žˆ๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ชป ๋‚ธ๋‹ค. ๋ฒ ์ด์Šค ์—†์ด๋Š” ํƒ์ƒ‰ ๊ณต๊ฐ„์ด ๋„ˆ๋ฌด ํฌ๋‹ค.

Table 1 (์š”์•ฝ). F1 ร—100 (๋†’์„์ˆ˜๋ก ์ข‹์Œ, 5๊ณก ํ‰๊ท  ๊ธฐ์ค€์˜ ์ •์„ฑ ๋น„๊ต)

Method Use real data Real-world budget Avg F1 (qualitative)
ฯ€_sim (open/closed-loop) No 0 Lowest
RL from Scratch Yes 30 min Lowest among learned
HandelBot w/o ResRL Yes 30 min Mid (clear gain over ฯ€_sim)
HandelBot (Ours) Yes 30 min Highest, ~1.8ร— over ฯ€_sim

๋…ผ๋ฌธ์ด ์–ด๋ธ”๋ ˆ์ด์…˜์—์„œ ๊ฐ•์กฐํ•˜๋Š” ๋‘ ๊ฐ€์ง€ ๋ฉ”์‹œ์ง€๊ฐ€ ์žˆ๋‹ค. ํ•˜๋‚˜, Stage 1 ๋‹จ๋…์œผ๋กœ๋„ ํฐ ์ด๋“์ด์ง€๋งŒ, ๊ทธ ์ด๋“์€ โ€œ์ธก๋ฉด ๊ด€์ ˆโ€์ด๋ผ๋Š” ์ข์€ ์ž์œ ๋„์—์„œ๋งŒ ์˜จ๋‹ค. ๊ทธ๋ž˜์„œ missed press(์•„์˜ˆ ๋ชป ๋ˆ„๋ฅธ ๊ฒฝ์šฐ)๋‚˜ ์†๊ฐ€๋ฝ ํ• ๋‹น ์˜ค๋ฅ˜๋Š” ๋ชป ์žก๋Š”๋‹ค. ๋‘˜, Stage 2๊ฐ€ ๊ทธ๊ฑธ ๋ณด์™„ํ•œ๋‹ค. Stage 2๋ฅผ ๋”ํ•˜๋ฉด ๋Œ€๋ถ€๋ถ„์˜ ๊ณก์—์„œ F1์ด ์ถ”๊ฐ€๋กœ ์˜ค๋ฅธ๋‹ค. ์ฆ‰ ๋‘ ๋‹จ๊ณ„๋Š” cumulativeํ•˜๋‹ค.

flowchart LR
    A["Direct sim-to-real<br/>(ฯ€_sim)"] -->|"+Stage 1<br/>lateral fix"| B["+Spatial alignment<br/>gain"]
    B -->|"+Stage 2<br/>residual TD3"| C["+Missed presses<br/>+timing<br/>+assignment errors"]
    C --> D["HandelBot<br/>~1.8x F1"]

๋””ํ…Œ์ผ ๋ถ„์„: Note Press ์‹œ๊ฐํ™”

๋…ผ๋ฌธ์˜ ํ•œ figure(์›นํŽ˜์ด์ง€์—์„œ๋„ ๋™์ผํ•˜๊ฒŒ ์ œ๊ณต๋˜๋Š” โ€œnote pressโ€ ๊ทธ๋ฆผ)๋Š” ์ •๋Ÿ‰ ์ง€ํ‘œ๋ฅผ ๋„˜์–ด ์–ด๋””์„œ ์‹คํŒจ๊ฐ€ ์ผ์–ด๋‚˜๋Š”์ง€๋ฅผ ์‹œ๊ฐ์ ์œผ๋กœ ๋ณด์—ฌ์ค€๋‹ค. ๊ฐ€๋กœ์ถ•์€ ๊ณก์˜ timestep, ์„ธ๋กœ์ถ•์€ ๊ฐ ๋…ธํŠธ(์ƒ๋‹จ ์ ˆ๋ฐ˜์€ ์˜ค๋ฅธ์†, ํ•˜๋‹จ ์ ˆ๋ฐ˜์€ ์™ผ์†). ๊ฐ ์ ์€ ๋ˆ„๋ฅธ ์‹œ์ ์— ์ƒ‰์œผ๋กœ ๋ถ„๋ฅ˜๋œ๋‹ค. ๋งž๊ฒŒ ๋ˆ„๋ฆ„(correct), ์ž˜๋ชป ๋ˆ„๋ฆ„(incorrect), ๋†“์นจ(missed).

ํฅ๋ฏธ๋กœ์šด ํŒจํ„ด์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • ์‰ฌ์šด ๊ณก(Twinkle Twinkle, Ode to Joy)์—์„œ๋Š” ์ ๋“ค์ด ๊ฑฐ์˜ ๋ชจ๋‘ โ€œcorrectโ€ ์ƒ‰์œผ๋กœ ์ฑ„์›Œ์ง„๋‹ค. ๊ฐ€๋” ์ž˜๋ชป๋œ ์Œ์ด ๋ผ์ง€๋งŒ ํฐ ๋ˆ„๋ฝ์€ ์—†๋‹ค.
  • Fur Elise์—์„œ๋Š” ์™ผ์†(์•„๋ž˜์ชฝ ์ ˆ๋ฐ˜)์—์„œ missed/incorrect๊ฐ€ ๋„๋“œ๋ผ์ง„๋‹ค. ์™ผ์†์ด ๋ฒ ์ด์Šค ์Œ๊ณผ ํ™”์Œ ์‚ฌ์ด๋ฅผ ํฌ๊ฒŒ ์ ํ”„ํ•˜๋Š” ๊ตฌ๊ฐ„์—์„œ ์†๊ฐ€๋ฝ์ด ๋”ฐ๋ผ๊ฐ€์ง€ ๋ชปํ•œ๋‹ค.
  • Prelude in C๋Š” ์ข€ ๋‹ค๋ฅธ ์–‘์ƒ์ด๋‹ค. ์Œ ์ž์ฒด๋Š” ๋น„๊ต์  ์ฒœ์ฒœํžˆ ํ๋ฅด์ง€๋งŒ ์–‘์†์ด ๋™์‹œ์— ์—ฌ๋Ÿฌ ์Œ์„ ์งš์–ด์•ผ ํ•˜๋Š” ๊ตฌ๊ฐ„์ด ๋งŽ์•„์„œ โ€œ๋™์‹œ ์ •ํ™•๋„โ€๊ฐ€ ์•ฝ์ ์ด๋‹ค.

๋˜ ๋‹ค๋ฅธ ํ•™์Šต ๊ณก์„  ์‹œ๊ฐํ™”(Twinkle Twinkle 5๊ฐœ evaluation trajectory)๋Š” Residual RL์˜ ์ž‘๋™์„ ๋ณด์—ฌ์ค€๋‹ค. ์ดˆ๋ฐ˜์—๋Š” ์™ผ์†์—์„œ ์—ฌ๋Ÿฌ ํ‚ค๋ฅผ ๋†“์น˜๋Š”๋ฐ, ์‹ค์„ธ๊ณ„ interaction์ด ์Œ“์ด๋ฉด์„œ ์ž”์ฐจ ์ •์ฑ…์ด ๊ทธ ๋ˆ„๋ฝ์„ ์ ์ง„์ ์œผ๋กœ ๋ฉ”์šด๋‹ค. โ€œ์–ด๋””์„œ ๋ง๊ฐ€์ง€๋Š”์ง€โ€๋ฅผ ์ง์ ‘ ๋ณด์—ฌ์ฃผ๋Š” ์ด๋Ÿฐ ๋””๋ฒ„๊น…์šฉ ์‹œ๊ฐํ™”๋Š” dexterous manipulation ๋…ผ๋ฌธ์—์„œ ์ •๋ง ์œ ์šฉํ•œ ์ž๋ฃŒ๋‹ค.

๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต

์ด ๋ถ„์•ผ์˜ ์ขŒํ‘œ๊ณ„๋ฅผ ์žก๊ธฐ ์œ„ํ•ด ์ฃผ์š” ๊ด€๋ จ ์—ฐ๊ตฌ๋ฅผ ์ •๋ฆฌํ•ด ๋ณด์ž.

RoboPianist (Zakka et al., 2023). ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์ „์šฉ ํ”ผ์•„๋…ธ ์—ฐ์ฃผ ๋ฒค์น˜๋งˆํฌ. ๋‘ ๊ฐœ์˜ Shadow Hand๋กœ 150๊ณก์„ ์นœ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์‹ค์„ธ๊ณ„ ์‹คํ–‰ ๊ฒฐ๊ณผ๋Š” ์—†๋‹ค. HandelBot์€ ์ •ํ™•ํžˆ ๊ทธ missing piece(real-world execution)๋ฅผ ์ฑ„์šฐ๋Š” ์ž‘์—…์ด๋‹ค.

Towards Learning to Play Piano with Dexterous Hands and Touch (Xu et al., 2022). TACTO ๊ธฐ๋ฐ˜ ์‹œ๋ฎฌ์—์„œ Allegro Hand + DIGIT ์ด‰๊ฐ ์„ผ์„œ๋กœ ํ”ผ์•„๋…ธ๋ฅผ ์นœ ์ดˆ๊ธฐ ์ž‘์—…. ๋‹จ์Œ ์œ„์ฃผ์ด๊ณ  ์—ญ์‹œ ์‹œ๋ฎฌ์ด๋‹ค. JungYeon์ด ๊ด€์‹ฌ์„ ๊ฐ–๊ณ  ์žˆ๋Š” TACTO์™€ ์ง์ ‘ ์—ฐ๊ฒฐ๋˜๋Š” ์„ ํ–‰ ์—ฐ๊ตฌ๋‹ค. HandelBot์€ ์ด‰๊ฐ ์„ผ์„œ ์—†์ด MIDI ๋ณด์ƒ๋งŒ์œผ๋กœ ์ง„ํ–‰ํ•œ๋‹ค๋Š” ์ ์—์„œ ๋” ๋‹จ์ˆœํ•œ sensing์ด์ง€๋งŒ, ๊ทธ๋งŒํผ ๋ณด์ƒ ์‹ ํ˜ธ๊ฐ€ ๊นจ๋—ํ•˜๋‹ค.

FurElise (2024, motion capture-based). ์‚ฌ๋žŒ์˜ ํ”ผ์•„๋…ธ ์—ฐ์ฃผ๋ฅผ motion capture๋กœ ์žก๊ณ , diffusion + ๊ฐ•ํ™”ํ•™์Šต์œผ๋กœ ์† ๋™์ž‘์„ ํ•ฉ์„ฑํ•œ๋‹ค. ์‹ค๋กœ๋ด‡์ด ์•„๋‹ˆ๋ผ ์‹œ๋ฎฌ์—์„œ์˜ physics-based character animation์ด๋‹ค. HandelBot๊ณผ๋Š” ๊ฒฐ์ด ๋‹ค๋ฅด์ง€๋งŒ, โ€œ์‚ฌ๋žŒ ๋ฐ์ดํ„ฐ โ†’ ํ•™์Šตโ€์ด๋ผ๋Š” ๋‹ค๋ฅธ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ๋ณด์—ฌ์ค€๋‹ค.

Residual RL ๊ณ„๋ณด. Johannink et al. 2019, Davchev et al. 2022 ๋“ฑ์ด ์ž”์ฐจ ํ•™์Šต์˜ ์›ํ˜•์ด๋‹ค. ์‹œ๋ฎฌ prior + ์‹ค์„ธ๊ณ„ residual์€ ์ •๋ฐ€ manipulation์—์„œ ๊ฑฐ์˜ ํ‘œ์ค€ ๋ ˆ์‹œํ”ผ๋‹ค. HandelBot์€ ์—ฌ๊ธฐ์— โ€œ์ž”์ฐจ๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์ „์—, ํ•œ ๋ฒˆ ๋” ๊ฒฐ์ •๋ก ์  ํœด๋ฆฌ์Šคํ‹ฑ์œผ๋กœ ๋ฏธ๋ฆฌ ์ •๋ ฌํ•œ๋‹คโ€๋ผ๋Š” ํ•œ ๋‹จ๊ณ„๋ฅผ ์ถ”๊ฐ€ํ–ˆ๋‹ค. ์ด ์ถ”๊ฐ€ ๋‹จ๊ณ„๊ฐ€ 30๋ถ„์ด๋ผ๋Š” ์งง์€ budget์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค.

HORA ๊ณ„์—ด (in-hand rotation). Sim-to-real๋กœ in-hand rotation์„ ์„ฑ๊ณต์‹œํ‚จ ๋ผ์ธ์—…์ด๋‹ค. ์ •์ฑ…์ด closed-loop๋กœ proprioception์„ ๋ณด๋ฉด์„œ ์ ์‘ํ•œ๋‹ค. HandelBot์€ ์ •๋ฐ˜๋Œ€๋‹ค. ์ •์ฑ…์€ open-loop, ์ž”์ฐจ๋งŒ closed-loop. ์ด ์ฐจ์ด๋Š” task์˜ ๋ณธ์งˆ ๋•Œ๋ฌธ์ด๋‹ค. HORA๋Š” ์™ธ๋ž€์ด ํฐ ํ™˜๊ฒฝ(๊ณต์ด ๊ตด๋Ÿฌ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ๋Š” ์†๋ฐ”๋‹ฅ)์ด๋ผ closed-loop์ด ํ•„์ˆ˜๋‹ค. ํ”ผ์•„๋…ธ๋Š” ์™ธ๋ž€์ด ๊ฑฐ์˜ ์—†๋Š” ์ •์  ํ™˜๊ฒฝ์ด๋ผ open-loop์ด ํ†ตํ•œ๋‹ค.

์ „์ฒด ๊ทธ๋ฆผ์—์„œ HandelBot์˜ ๋…์ž์„ฑ์€ ๋ช…ํ™•ํ•˜๋‹ค. (1) ์–‘์† dexterous ํ”ผ์•„๋…ธ ์—ฐ์ฃผ๋ฅผ ์‹ค๋กœ๋ด‡์—์„œ ์„ฑ๊ณต์‹œํ‚จ ์ฒซ ํ•™์Šต ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ. (2) 30๋ถ„์ด๋ผ๋Š” ๋งค์šฐ ์งง์€ ์‹ค์„ธ๊ณ„ budget. (3) ํœด๋ฆฌ์Šคํ‹ฑ๊ณผ ํ•™์Šต์˜ ๋ช…ํ™•ํ•œ ์—ญํ•  ๋ถ„๋‹ด(spatial bias๋Š” ํœด๋ฆฌ์Šคํ‹ฑ, ๊ทธ ์™ธ๋Š” RL).

๋น„ํŒ์  ๊ณ ์ฐฐ

๊ฐ•์ 

๋ฌธ์ œ ๋ถ„ํ•ด์˜ ๋ช…๋ฃŒ์„ฑ. โ€œ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ํ•™์Šตํ•œ ์ •์ฑ…์˜ ์‹คํŒจ๊ฐ€ ์–ด๋””์„œ ์˜ค๋Š”๊ฐ€โ€๋ฅผ ์ธก๋ฉด ํŽธํ–ฅ, missed press, ์†๊ฐ€๋ฝ ํ• ๋‹น ์˜ค๋ฅ˜๋กœ ๋ช…ํ™•ํžˆ ๋ถ„ํ•ดํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ฐ๊ฐ์— ๋‹ค๋ฅธ ๋„๊ตฌ(ํœด๋ฆฌ์Šคํ‹ฑ, residual RL)๋ฅผ ํ• ๋‹นํ•œ๋‹ค. ํ•™์Šต์ด ๋งŒ๋Šฅ์ด๋ผ๋Š” ์‹์˜ ์„ค๊ณ„๊ฐ€ ์•„๋‹ˆ๋ผ, ํ•™์Šต์ด ํ•„์š”ํ•œ ๊ณณ์—๋งŒ ํ•™์Šต์„ ์“ด๋‹ค. ์ด๋Ÿฐ ๋ชจ๋“ˆ์„ฑ์€ ์‹ค๋ฌด์—์„œ ๋””๋ฒ„๊น…์„ ์‰ฝ๊ฒŒ ๋งŒ๋“ ๋‹ค.

์ƒ˜ํ”Œ ํšจ์œจ. 30๋ถ„์€ ์ •๋ง ์งง๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ dexterous manipulation์˜ ์‹ค์„ธ๊ณ„ RL์€ ์ˆ˜ ์‹œ๊ฐ„์—์„œ ์ˆ˜์‹ญ ์‹œ๊ฐ„์„ ์š”๊ตฌํ•œ๋‹ค. ์‹œ๋ฎฌ๋ ˆ์ด์…˜ prior + ํœด๋ฆฌ์Šคํ‹ฑ ์ •๋ ฌ + guided exploration์ด ํ•จ๊ป˜ ์ž‘์šฉํ•œ ๊ฒฐ๊ณผ๋‹ค.

๋ณด์ƒ ์‹ ํ˜ธ์˜ ๊นจ๋—ํ•จ. MIDI ์ถœ๋ ฅ์€ ๋…ธ์ด์ฆˆ ์—†๋Š” ground truth๋‹ค. ํ‚ค๊ฐ€ ๋ˆŒ๋ ธ๋Š”์ง€ ์•ˆ ๋ˆŒ๋ ธ๋Š”์ง€๋Š” ๋ชจํ˜ธํ•˜์ง€ ์•Š๋‹ค. ์ด ์ ์ด ์ž”์ฐจ RL์„ ์•ˆ์ •์ ์œผ๋กœ ๋งŒ๋“ ๋‹ค. ์ผ๋ฐ˜ manipulation์—์„œ๋Š” ๋ณด์ƒ ์ •์˜ ์ž์ฒด๊ฐ€ ์–ด๋ ค์šด๋ฐ, ํ”ผ์•„๋…ธ๋Š” environment๊ฐ€ ๋ณด์ƒ์„ ์ œ๊ณตํ•œ๋‹ค๋Š” ์ ์ด ํฐ ์ž์‚ฐ์ด๋‹ค.

Guided noise ์•„์ด๋””์–ด. Stage 1์—์„œ ์–ป์€ ๋ณด์ • ๋ฐฉํ–ฅ์„ Stage 2์˜ ํƒ์ƒ‰ ๋…ธ์ด์ฆˆ์— ๋ถ€ํ˜ธ๋กœ ์ฃผ์ž…ํ•˜๋Š” ๋ฐฉ์‹์€ ๋‹จ์ˆœํ•˜์ง€๋งŒ ์˜๋ฆฌํ•˜๋‹ค. Prior๋ฅผ ์ •์ฑ… ์ดˆ๊ธฐํ™”์—๋งŒ ์“ฐ๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ ํƒ์ƒ‰ ์ž์ฒด์— ์ฃผ์ž…ํ•œ๋‹ค๋Š” ๋ฐœ์ƒ์€, ๋‹ค๋ฅธ ์ •๋ฐ€ manipulation ๊ณผ์ œ์—๋„ ์˜ฎ๊ธธ ๊ฐ€์น˜๊ฐ€ ์žˆ๋‹ค.

์•ฝ์ ๊ณผ ํ•œ๊ณ„

Open-loop trajectory์— ๋Œ€ํ•œ ์˜์กด. Stage 0์˜ ์ถœ๋ ฅ์ด stochastic policy๊ฐ€ ์•„๋‹ˆ๋ผ ํ•œ ๋ฒˆ ์ž˜ ์นœ open-loop ๊ถค์ ์ด๋‹ค. ์ด๊ฑด piano์ฒ˜๋Ÿผ ์ •ํ•ด์ง„ ์‹œํ€€์Šค๋ฅผ ๋”ฐ๋ผ๊ฐ€๋Š” task์—๋Š” ์ž˜ ๋งž์ง€๋งŒ, perturbation์ด ๋“ค์–ด์˜ค๋Š” ์ผ๋ฐ˜ manipulation์—๋Š” ์ ์šฉ์ด ์–ด๋ ต๋‹ค. ํ‚ค๋ณด๋“œ ์œ„์น˜๊ฐ€ ์‚ด์ง ํ”๋“ค๋ฆฌ๋ฉด ์‹œ์Šคํ…œ์ด ๋ฌด๋„ˆ์งˆ ์ˆ˜ ์žˆ๋‹ค.

์†๊ฐ€๋ฝ-ํ‚ค ํ• ๋‹น์˜ ์‚ฌ์ „ ์ง€์‹. Stage 1 ํœด๋ฆฌ์Šคํ‹ฑ์€ โ€œ์ด timestep์— ์ด ์†๊ฐ€๋ฝ์ด ์ด ํ‚ค๋ฅผ ์นœ๋‹คโ€๋Š” ๋งคํ•‘์ด ์ •ํ™•ํ•˜๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์ •์ฑ…์ด ์ธ๊ฐ„์ด ์˜ˆ์ƒํ•œ ์†๊ฐ€๋ฝ์ด ์•„๋‹Œ ๋‹ค๋ฅธ ์†๊ฐ€๋ฝ์œผ๋กœ ์น  ์ˆ˜๋„ ์žˆ๋‹ค. ์ด ๊ฐ€์ •์ด ๊นจ์งˆ ๋•Œ Stage 1์€ ์ž˜๋ชป๋œ ๋ฐฉํ–ฅ์œผ๋กœ ๋ณด์ •ํ•  ์œ„ํ—˜์ด ์žˆ๋‹ค(๋…ผ๋ฌธ๋„ ์ด ์ ์„ ์ธ์ •ํ•œ๋‹ค).

์ธก๋ฉด ๊ด€์ ˆ๋งŒ ๋ณด์ •. Z์ถ•(๋ˆ„๋ฆ„ ๊นŠ์ด), ๊ตฝํž˜ ๊ฐ๋„, ์†๋ชฉ ํšŒ์ „ ๊ฐ™์€ ๋‹ค๋ฅธ ์ž์œ ๋„๋Š” ํœด๋ฆฌ์Šคํ‹ฑ์ด ์•ˆ ๋งŒ์ง„๋‹ค. ๋งŒ์•ฝ sim-to-real gap์ด ์ธก๋ฉด์ด ์•„๋‹Œ ๋‹ค๋ฅธ ์ž์œ ๋„์—์„œ ํฌ๊ฒŒ ๋‚˜ํƒ€๋‚˜๋ฉด Stage 1์˜ ์ด๋“์ด ์ค„์–ด๋“ ๋‹ค. ๋‹ค๋ฅธ task๋กœ ์ผ๋ฐ˜ํ™”ํ•  ๋•Œ๋Š” โ€œ์–ด๋А ์ž์œ ๋„๊ฐ€ systematic bias์˜ ์ฃผ๋ฒ”์ธ๊ฐ€โ€๋ฅผ ๋จผ์ € ์‹๋ณ„ํ•ด์•ผ ํ•œ๋‹ค.

์ด‰๊ฐ ์ •๋ณด ๋ฏธ์‚ฌ์šฉ. DIGIT/GelSight ๊ฐ™์€ ์ด‰๊ฐ ์„ผ์„œ๋ฅผ ์•ˆ ์“ด๋‹ค. MIDI๊ฐ€ ๋ณด์ƒ์œผ๋กœ ์ถฉ๋ถ„ํ•˜๊ธฐ์— ํ•™์Šต์€ ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, ํ‚ค๋ฅผ ๋ˆŒ๋ €๋Š”๋ฐ ๋‹ฟ๊ธฐ ์ „์ธ ์ƒํƒœ๋‚˜ ๋ฏธ์„ธํ•œ ํž˜ ์กฐ์ ˆ ๊ฐ™์€ ์ •๋ณด๋Š” ๋ชป ํ™œ์šฉํ•œ๋‹ค. ํ‘œํ˜„๋ ฅ(expressivity)์ด ๊ฒฐ์—ฌ๋œ๋‹ค๋Š” ๋œป์ด๋‹ค. ์‚ฌ๋žŒ ํ”ผ์•„๋‹ˆ์ŠคํŠธ๊ฐ€ ๊ฐ•์•ฝ์„ ์กฐ์ ˆํ•˜๋Š” ๊ฒƒ ๊ฐ™์€ ํ‘œํ˜„์€ ์ด ์‹œ์Šคํ…œ ๋ฒ”์œ„ ๋ฐ–์ด๋‹ค.

ํ•œ ๊ณก๋‹น ํ•™์Šต. ๊ณก๋งˆ๋‹ค ๋ณ„๋„ ํ•™์Šต์ด ํ•„์š”ํ•˜๋‹ค. RoboPianist ์‹œ์ ˆ๋ถ€ํ„ฐ ์ด ํ•œ๊ณ„๋Š” ์ž˜ ์•Œ๋ ค์ ธ ์žˆ์ง€๋งŒ, HandelBot๋„ ๊ฐ™์€ ํ•œ๊ณ„๋ฅผ ๊ณต์œ ํ•œ๋‹ค. ํ•œ ์ •์ฑ…์œผ๋กœ ์ƒˆ๋กœ์šด ๊ณก์„ zero-shot์œผ๋กœ ์น˜๋Š” ์ผ๋ฐ˜ํ™”๋Š” ๋ณธ ๋…ผ๋ฌธ ๋ฒ”์œ„ ๋ฐ–์ด๋‹ค.

ํ‰๊ฐ€ ๊ณก์˜ ๋‚œ์ด๋„. 5๊ณก ์ค‘ Fur Elise๋ฅผ ์ œ์™ธํ•˜๋ฉด ๋น„๊ต์  ์‰ฌ์šด ๊ณก๋“ค์ด๋‹ค. ์ง„์งœ ๋„์ „(์˜ˆ: ๋น ๋ฅธ ํŠธ๋ฆด, ์˜ฅํƒ€๋ธŒ ์ ํ”„๊ฐ€ ๋งŽ์€ Liszt ๋ฅ˜)์—์„œ ์ด ํŒŒ์ดํ”„๋ผ์ธ์ด ๊ฒฌ๋””๋Š”์ง€๋Š” ์‹คํ—˜๋˜์ง€ ์•Š์•˜๋‹ค.

ํ•˜๋“œ์›จ์–ด์˜ ์‚ฌ์ด์ฆˆ ๋ฏธ์Šค๋งค์น˜. Tesollo DG-5F๊ฐ€ ์ธ๊ฐ„ ์†๋ณด๋‹ค ํฌ๋‹ค๋Š” ์ ์€ ์ •์งํ•œ ์–ด๋ ค์›€์ด์ง€๋งŒ, ๋™์‹œ์— โ€œ์ด ์‹œ์Šคํ…œ์€ ์ด ์†์— ๋งž์ถฐ์ง„ ๊ฒฐ๊ณผโ€์ž„์„ ์˜๋ฏธํ•œ๋‹ค. ๋” ์ž‘์€ dexterous hand(์˜ˆ: LEAP Hand, Allegro)๋กœ์˜ transfer๋Š” ๋ณ„๋„ ์ž‘์—…์ด๋‹ค.

๋ชจ๋“  baseline์ด ๋™์ผํ•˜๊ฒŒ ๊ฐ•ํ•˜์ง€๋Š” ์•Š์Œ. RL from scratch๊ฐ€ 30๋ถ„์œผ๋กœ ํ•™์Šต๋˜๊ธฐ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋Š” ๊ฑด ๋‹ค์†Œ ๋ฐ•ํ•œ ๋น„๊ต๋‹ค. ์ง„์งœ ์˜๋ฏธ ์žˆ๋Š” ๋น„๊ต๋Š” (a) ๋” ์ •๊ตํ•œ domain randomization์„ ์ ์šฉํ•œ sim policy, (b) DAgger ๊ฐ™์€ ๋‹ค๋ฅธ sim-to-real ์ ์‘๋ฒ•๊ณผ์˜ ๋น„๊ต๋‹ค.

์‹œ์‚ฌ์ : ๋‹ค๋ฅธ ์ •๋ฐ€ dexterous ๊ณผ์ œ๋กœ์˜ ์ „์ด ๊ฐ€๋Šฅ์„ฑ

JungYeon์˜ ์—ฐ๊ตฌ ์˜์—ญ(in-hand manipulation, sim-to-real transfer, Allegro Hand)๊ณผ ์—ฐ๊ฒฐ์ง€์–ด ๋ณด๋ฉด, HandelBot์˜ ๋ ˆ์‹œํ”ผ๊ฐ€ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋Š” ์‹œ์‚ฌ์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

Systematic bias๋ฅผ ๋จผ์ € ๋ถ„๋ฆฌํ•˜๋ผ. ์‹œ๋ฎฌ์—์„œ ์‹ค์„ธ๊ณ„๋กœ ์˜ฎ๊ธธ ๋•Œ ๊ฐ€์žฅ ๋จผ์ € ๋‚˜ํƒ€๋‚˜๋Š” ์‹คํŒจ๋Š” ๋ณดํ†ต โ€œ๊ฒฐ์ •๋ก ์ ์ด๊ณ  ์ผ๊ด€๋œโ€ ์ข…๋ฅ˜๋‹ค. ์ฆ‰ ๋งค๋ฒˆ ๊ฐ™์€ ๋ฐฉํ–ฅ์œผ๋กœ ๋น—๋‚˜๊ฐ„๋‹ค. ์ด๋Ÿฐ ์ข…๋ฅ˜์˜ ์˜ค๋ฅ˜์—๋Š” RL์ด ๊ณผ์ž‰์ด๋‹ค. ํ•œ๋‘ ์ž์œ ๋„์— ๋Œ€ํ•œ ํœด๋ฆฌ์Šคํ‹ฑ ๋ณด์ •์ด ๋” ๋น ๋ฅด๊ณ  ์•ˆ์ •์ ์ด๋‹ค. HORA ๊ณ„์—ด์˜ in-hand rotation ์ž‘์—…์—์„œ๋„, ์†๊ฐ€๋ฝ ๊ตฝํž˜ ๊ฒŒ์ธ์ด๋‚˜ ์†๋ฐ”๋‹ฅ ๊ฐ๋„ ๊ฐ™์€ systematic offset์„ ์‚ฌ์ „ ๋ณด์ •ํ•œ ๋’ค ์ž”์ฐจ๋ฅผ ํ•™์Šตํ•˜๋ฉด budget์ด ์ค„์–ด๋“ ๋‹ค๋Š” ๊ด€์ฐฐ์ด ์ข…์ข… ๋ณด๊ณ ๋œ๋‹ค.

Open-loop์„ ๋‘๋ ค์›Œ ๋ง๋ผ(์ •์  task์— ํ•œํ•ด). ๋ชจ๋“  task๊ฐ€ closed-loop ์ •์ฑ…์„ ์š”๊ตฌํ•˜๋Š” ๊ฑด ์•„๋‹ˆ๋‹ค. ํ™˜๊ฒฝ์ด ์ •์ ์ด๊ณ  ์‹œํ€€์Šค๊ฐ€ ๊ณ ์ •๋˜์–ด ์žˆ๋‹ค๋ฉด, open-loop trajectory + residual feedback์ด ์ถฉ๋ถ„ํžˆ ์ •๋ฐ€ํ•˜๋‹ค. Allegro Hand๋กœ ์ •ํ•ด์ง„ ๋„๊ตฌ ์‚ฌ์šฉ ์‹œํ€€์Šค๋ฅผ ํ‘ธ๋Š” ๊ฒฝ์šฐ(์˜ˆ: ์นด๋“œ ๋’ค์ง‘๊ธฐ ๊ฐ™์€ deterministic skill)์— ์ด ํŒจํ„ด์ด ์œ ํšจํ•  ์ˆ˜ ์žˆ๋‹ค.

Guided noise๋Š” ๋‹ค๋ฅธ ๊ณณ์—๋„ ์˜ฎ๊ธธ ๋งŒํ•˜๋‹ค. ์ž”์ฐจ RL์—์„œ ํƒ์ƒ‰ ๋…ธ์ด์ฆˆ์˜ ๋ถ€ํ˜ธ๋ฅผ prior๋กœ ๊ฐ€์ด๋“œํ•˜๋Š” ๊ธฐ๋ฒ•์€ piano์— ํŠนํ™”๋œ ๊ฒŒ ์•„๋‹ˆ๋‹ค. Sim-to-real gap์˜ ์ผ๊ด€๋œ ๋ฐฉํ–ฅ์„ฑ์ด ์•Œ๋ ค์ ธ ์žˆ๋‹ค๋ฉด, ๊ทธ ๋ฐฉํ–ฅ์œผ๋กœ ๋…ธ์ด์ฆˆ๋ฅผ ํŽธํ–ฅ์‹œํ‚ค๋Š” ๊ฒƒ๋งŒ์œผ๋กœ๋„ ์ƒ˜ํ”Œ ํšจ์œจ์ด ํฌ๊ฒŒ ๊ฐœ์„ ๋œ๋‹ค. JungYeon์ด ์ง„ํ–‰ํ•œ friction modeling ๋˜๋Š” PD gain ๋ถ„์„ ์ž‘์—…์—์„œ ๋„์ถœ๋œ โ€œ์–ด๋А ๋ฐฉํ–ฅ์œผ๋กœ ๋ณด์ •ํ•ด์•ผ ํ•œ๋‹คโ€๋Š” ์‚ฌ์ „ ์ง€์‹์„, ์ž”์ฐจ RL์˜ ํƒ์ƒ‰ ๋ถ„ํฌ์— ์ฃผ์ž…ํ•˜๋Š” ์‘์šฉ์„ ์ƒ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.

ManiSkill์˜ ์‹ค์šฉ์„ฑ. HandelBot์ด ManiSkill์„ ์“ด ๊ฒƒ์€ ๊ฐ€๋ณ๊ฒŒ ๋ณด์•„ ๋„˜๊ธธ ๋ถ€๋ถ„์ด ์•„๋‹ˆ๋‹ค. SAPIEN/ManiSkill์€ IsaacGym/Lab ๋Œ€๋น„ ์ง„์ž…์žฅ๋ฒฝ์ด ๋‚ฎ๊ณ , ๊ณก๋ณ„ 100-300 step ์ •๋„์˜ ์งง์€ horizon RL์„ ๋น ๋ฅด๊ฒŒ ๋Œ๋ฆฌ๊ธฐ์— ์ ํ•ฉํ•˜๋‹ค. JungYeon์ด Physical AI ๊ฐ•์˜๋ฅผ ManiSkill3๋กœ ์„ค๊ณ„ํ•œ ๋ฐฉํ–ฅ์„ฑ๊ณผ๋„ ์ผ์น˜ํ•œ๋‹ค.

๋ณด์ƒ์ด ๊นจ๋—ํ•œ ๋„๋ฉ”์ธ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋ผ. Dexterous manipulation์—์„œ ๊ฐ€์žฅ ์–ด๋ ค์šด ๋ถ€๋ถ„ ์ค‘ ํ•˜๋‚˜๋Š” ๋ณด์ƒ ์ •์˜๋‹ค. ํ”ผ์•„๋…ธ๋Š” MIDI๋ผ๋Š” ๊ฒฐ์ •๋ก ์  ๋ณด์ƒ ์ฑ„๋„์ด ์žˆ๋‹ค. ์ด๋Ÿฐ ๋ณด์ƒ์ด โ€œํ™˜๊ฒฝ ๊ทธ ์ž์ฒด์—์„œโ€ ๋‚˜์˜ค๋Š” task๋Š” ํ”์น˜ ์•Š์€ ์šด(luck)์ด์ง€๋งŒ, ๋น„์Šทํ•œ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง„ task(์˜ˆ: ํ‚คํŒจ๋“œ ์ž…๋ ฅ, ์Šค์œ„์น˜ ์กฐ์ž‘, ํŽ˜๋‹ฌ ์ •๋ฐ€ ์กฐ์ž‘)๋Š” ์žˆ๋‹ค. ์ฒ˜์Œ๋ถ€ํ„ฐ ์–ด๋ ค์šด ๋ณด์ƒ ์ •์˜(์ด‰๊ฐ + ์‹œ๊ฐ + ์ž์„ธ ๊ฒฐํ•ฉ)๋ฅผ ์‹œ๋„ํ•˜๊ธฐ๋ณด๋‹ค, ์ด๋Ÿฐ โ€œํ™˜๊ฒฝ์ด ๋ณด์ƒ์„ ์•Œ๋ ค์ฃผ๋Š”โ€ task๋ฅผ ๋ฐœํŒ์œผ๋กœ ์‚ผ๋Š” ๊ฒŒ ํ•ฉ๋ฆฌ์ ์ด๋‹ค.

์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

HandelBot์€ piano playing์ด๋ผ๋Š” ์ •๋ฐ€ ์–‘์† dexterous task์— ๋Œ€ํ•ด โ€œ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ•™์Šต + 30๋ถ„์˜ ๊ตฌ์กฐํ™”๋œ ์‹ค์„ธ๊ณ„ ์ ์‘โ€์ด๋ผ๋Š” ๊น”๋”ํ•œ ์ฒ˜๋ฐฉ์„ ์ œ์‹œํ•œ๋‹ค. ์ง์ ‘ sim-to-real๋กœ๋Š” 1.8๋ฐฐ ์ฐจ์ด๋กœ ์ง„๋‹ค. ๊ทธ ์ฐจ์ด๋ฅผ ๋ฉ”์šฐ๋Š” 30๋ถ„์˜ ์‚ฌ์šฉ ๋ฐฉ์‹์ด ์˜๋ฆฌํ•˜๋‹ค. ์ ˆ๋ฐ˜์€ ์‚ฌ๋žŒ์ด ์ง  ํœด๋ฆฌ์Šคํ‹ฑ(์ธก๋ฉด ๊ด€์ ˆ ๋ณด์ •), ์ ˆ๋ฐ˜์€ residual TD3(๊ฐ€์ด๋“œ๋œ ๋…ธ์ด์ฆˆ ํฌํ•จ)์— ํ• ๋‹นํ•œ๋‹ค.

์ด ์ž‘์—…์˜ ๊ฐ€์žฅ ํฐ ๊ธฐ์—ฌ๋Š” ๊ฒฐ๊ณผ ์ž์ฒด๊ฐ€ ์•„๋‹ˆ๋ผ ๋ถ„ํ•ด์˜ ๋ฐฉ๋ฒ•์ด๋‹ค. โ€œ์‹œ๋ฎฌ-์‹ค์„ธ๊ณ„ ๊ฐ„๊ทน์„ ํ•œ๊บผ๋ฒˆ์— RL๋กœ ๋‹ซ์ง€ ๋ง๊ณ , ๊ฒฐ์ •๋ก ์  ๋ณด์ •์œผ๋กœ ๋‹ซ์„ ์ˆ˜ ์žˆ๋Š” ๋ถ€๋ถ„๊ณผ ํ•™์Šต์œผ๋กœ ๋‹ซ์•„์•ผ ํ•˜๋Š” ๋ถ€๋ถ„์„ ๋ถ„๋ฆฌํ•˜๋ผโ€๋ผ๋Š” ๋ฉ”์‹œ์ง€๊ฐ€ ๋ช…ํ™•ํ•˜๋‹ค. ์ด ๋ฉ”์‹œ์ง€๋Š” piano๋ฅผ ๋„˜์–ด ๋‹ค๋ฅธ ์ •๋ฐ€ manipulation ๊ณผ์ œ๋กœ ์˜ฎ๊ธธ ์ˆ˜ ์žˆ๋‹ค.

ํ•œ๊ณ„๋„ ์ •์งํ•˜๋‹ค. Open-loop ์˜์กด, ์†๊ฐ€๋ฝ-ํ‚ค ํ• ๋‹น ๊ฐ€์ •, ์ด‰๊ฐ ์ •๋ณด ๋ถ€์žฌ, ํ•œ ๊ณก๋‹น ํ•™์Šต. ์ด ํ•œ๊ณ„๋“ค์€ ํ›„์† ์—ฐ๊ตฌ์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ์ถœ๋ฐœ์ ์ด๋‹ค. ๊ณก ์‚ฌ์ด ์ผ๋ฐ˜ํ™”, ์ด‰๊ฐ ํ†ตํ•ฉ, ๋‹ค์–‘ํ•œ dexterous hand๋กœ์˜ transfer ๋“ฑ.

์‹ค์šฉ ๊ด€์ ์—์„œ ๊ฐ€์žฅ ๊ฐ€์ ธ๊ฐˆ ๊ฐ€์น˜๊ฐ€ ์žˆ๋Š” ๋””ํ…Œ์ผ ์„ธ ๊ฐ€์ง€๋Š” (1) ์‹œ๋ฎฌ ์ •์ฑ…์—์„œ best trajectory ํ•œ ๊ฐœ๋ฅผ ๋ฝ‘์•„ open-loop ๋ฒ ์ด์Šค๋กœ ์“ฐ๋Š” ๋‹จ์ˆœ์„ฑ, (2) ์ธก๋ฉด ๊ด€์ ˆ๋งŒ ๊ณจ๋ผ ํœด๋ฆฌ์Šคํ‹ฑ์œผ๋กœ ์ •๋ ฌํ•˜๋Š” narrow but effective intervention, (3) TD3 ํƒ์ƒ‰ ๋…ธ์ด์ฆˆ์˜ ๋ถ€ํ˜ธ์— prior๋ฅผ ์ฃผ์ž…ํ•˜๋Š” guided noise ๊ธฐ๋ฒ•์ด๋‹ค. ์…‹ ๋‹ค ๋‹ค๋ฅธ dexterous manipulation ํŒŒ์ดํ”„๋ผ์ธ์— ์˜ฎ๊ฒจ ์‹คํ—˜ํ•  ๋งŒํ•œ ๊ฐ€๋ฒผ์šด ๊ฐœ์„ ์ด๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ, ์ด ๋…ผ๋ฌธ์ด ๋˜์ง€๋Š” ๋” ํฐ ์งˆ๋ฌธ์€ ์ด๊ฑฐ๋‹ค. โ€œRL์˜ ๋ณด์ƒ ์‹ ํ˜ธ๊ฐ€ ํ™˜๊ฒฝ์—์„œ ์ง์ ‘ ์ œ๊ณต๋˜๋Š” ์ •๋ฐ€ task๋Š” ๋˜ ์–ด๋”” ์žˆ๋Š”๊ฐ€?โ€ ํ‚คํŒจ๋“œ, ์Šค์œ„์น˜ ํŒจ๋„, ํŽ˜๋‹ฌ, ๋ฒ„ํŠผ ์‹œํ€€์Šค, ์žํŒ ์ž…๋ ฅ. ์‚ฌ์‹ค ์šฐ๋ฆฌ ์ฃผ๋ณ€์— ๊ฝค ๋งŽ๋‹ค. HandelBot์ด ๋ณด์—ฌ์ค€ ๋ ˆ์‹œํ”ผ๋Š” ๊ทธ๋Ÿฐ ๋ชจ๋“  ๊ณณ์— ์ ์šฉ๋  ์ž ์žฌ๋ ฅ์ด ์žˆ๋‹ค. ์–ด์ฉŒ๋ฉด ์ฐจ์„ธ๋Œ€ dexterous task ๋ฒค์น˜๋งˆํฌ์˜ ํ˜•ํƒœ๋Š” โ€œํ™˜๊ฒฝ์ด ๊ณง reward ์ฑ„๋„์ธโ€ ์ผ์ƒ ์ •๋ฐ€ ์กฐ์ž‘๋“ค์˜ ๋ชจ์Œ์ผ์ง€ ๋ชจ๋ฅธ๋‹ค.

์ฐธ๊ณ  ๋ฌธํ—Œ ๋ฐ ์ž๋ฃŒ

  • ๋…ผ๋ฌธ: Xie, A., Qi, H., Sadigh, D. (2026). HandelBot: Real-World Piano Playing via Fast Adaptation of Dexterous Robot Policies. arXiv:2603.12243.
  • ํ”„๋กœ์ ํŠธ ํŽ˜์ด์ง€: https://amberxie88.github.io/handelbot/
  • ์ฝ”๋“œ: https://github.com/amberxie88/handelbot
  • ๊ด€๋ จ: RoboPianist (Zakka et al., 2023), Towards Learning to Play Piano with Dexterous Hands and Touch (Xu et al., 2022), FurElise (2024)

Copyright 2026, JungYeon Lee