Curieux.JY
  • JungYeon Lee
  • Post
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ์™œ ์ด ๋…ผ๋ฌธ์ธ๊ฐ€ โ€” ๋ฌธ์ œ ์ •์˜์™€ ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ
      • ์„ธ ๊ฐ€์ง€ ๋ณ‘๋ชฉ
    • ๋ฐฉ๋ฒ•๋ก  โ€” ๋‘ ๊ธฐ๋‘ฅ์˜ ๊ตฌ์กฐ
      • ์‹œ์Šคํ…œ ๊ฐœ์š”
      • IMCopilot โ€” RL ๊ธฐ๋ฐ˜ ์›์ž์  ์ธํ•ธ๋“œ ์Šคํ‚ฌ
      • MoDE-VLA โ€” ๊ฐ๊ฐ ์ด์งˆ์„ฑ์„ ๋„˜๋Š” ์•„ํ‚คํ…์ฒ˜
      • ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ: SharpaNorth + SharpaWave
    • ์‹คํ—˜ โ€” ๋„ค ๊ณผ์ œ์˜ ๊ณ„๋‹จ์‹ ๋„์ „
      • ๊ณผ์ œ ๊ตฌ์„ฑ
      • ๊ฒฐ๊ณผ ์š”์•ฝ
      • PCR ์ง€ํ‘œ์˜ ์˜๋ฏธ
      • Ablation Study ์š”์•ฝ
    • ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต
      • ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์ „๋žต ๊ด€์ 
      • VLA ์•„ํ‚คํ…์ฒ˜ ๊ด€์ 
    • ๋น„ํŒ์  ๊ณ ์ฐฐ โ€” ๊ฐ•์ , ํ•œ๊ณ„, ๊ทธ๋ฆฌ๊ณ  ๋ฏธ๋ž˜
      • ๊ฐ•์ 
      • ํ•œ๊ณ„ ๋ฐ ์•ฝ์ 
    • ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 
    • ์ฐธ๊ณ  ์ž๋ฃŒ

๐Ÿ“ƒMoDE-VLA ๋ฆฌ๋ทฐ

vla
teleop
dexterity
multi-modal
Towards Human-Like Manipulation through RL-Augmented Teleoperation and Mixture-of-Dexterous-Experts VLA
Published

March 17, 2026

  • Paper Link
  • Project Link
  1. ๐Ÿค– ๋ณธ ๋…ผ๋ฌธ์€ Vision-Language-Action (VLA) ๋ชจ๋ธ์ด ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘, ๋‹ค์ค‘ ์Šคํ‚ฌ ํ•™์Šต ๋ฐ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์„ผ์„œ ์œตํ•ฉ์—์„œ ๊ฒช๋Š” ์–ด๋ ค์›€์„ ํ•ด๊ฒฐํ•˜์—ฌ ์ธ๊ฐ„๊ณผ ์œ ์‚ฌํ•œ ์–‘์† ์ˆ™๋ จ ์กฐ์ž‘์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ํ†ตํ•ฉ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
  2. ๐Ÿค ์ด ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜์„ ๋•๊ณ  VLA์˜ ํ˜ธ์ถœ ๊ฐ€๋Šฅํ•œ ์ €์ˆ˜์ค€ ๊ธฐ๋ณธ ๊ธฐ๋Šฅ์œผ๋กœ ์ž‘๋™ํ•˜๋Š” RL ํ›ˆ๋ จ ๊ธฐ๋ฐ˜์˜ In-hand Manipulation Copilot(IMCopilot)๊ณผ, ์ „์šฉ ๊ฒฝ๋กœ ๋ฐ ์ž”์—ฌ ์ฃผ์ž…์„ ํ†ตํ•ด ํž˜ ๋ฐ ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ์„ VLA ๋ฐฑ๋ณธ์— ํ†ตํ•ฉํ•˜๋Š” Mixture-of-Dexterous-Experts VLA(MoDE-VLA)๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.
  3. ๐ŸŽ ๊ธฐ์–ด ์กฐ๋ฆฝ, ์ถฉ์ „๊ธฐ ์—ฐ๊ฒฐ, ํŠœ๋ธŒ ์žฌ๋ฐฐ์น˜, ์‚ฌ๊ณผ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ๋ฅผ ํฌํ•จํ•œ 4๊ฐ€์ง€ ์ ‘์ด‰์ด ๋งŽ์€ ์ž‘์—…์— ๋Œ€ํ•œ ์‹คํ—˜์  ๊ฒ€์ฆ์€ ์ œ์•ˆ๋œ ์ ‘๊ทผ ๋ฐฉ์‹์ด ๊ธฐ์กด ๋ฒ ์ด์Šค๋ผ์ธ ๋Œ€๋น„ ์„ฑ๊ณต๋ฅ ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

๋ณธ ๋…ผ๋ฌธ์€ Vision-Language-Action (VLA) ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ๊ณ ์ž์œ ๋„(high-DoF), ์–‘์†(bi-manual), ์ •๊ตํ•œ(dexterous) ์ ‘์ด‰ ๊ธฐ๋ฐ˜(contact-rich) ์ธ-ํ•ธ๋“œ(in-hand) ์กฐ์ž‘(manipulation) ๋Šฅ๋ ฅ์„ ์ธ๊ฐ„๊ณผ ์œ ์‚ฌํ•œ ์ˆ˜์ค€์œผ๋กœ ํ™•์žฅํ•˜๊ธฐ ์œ„ํ•œ ํ†ตํ•ฉ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด VLA ๋ชจ๋ธ์€ ์ฃผ๋กœ ์ €์ž์œ ๋„ ์—”๋“œ-์ดํŽ™ํ„ฐ(end-effector)์™€ ์‹œ๊ฐ ๊ธฐ๋ฐ˜์˜ ๋‹จ์ˆœํ•œ ํ”ฝ-์•ค-ํ”Œ๋ ˆ์ด์Šค(pick-and-place) ์ž‘์—…์— ๊ตญํ•œ๋˜์–ด ์žˆ์—ˆ์œผ๋ฉฐ, ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ ํš๋“, ๋‹ค์ค‘ ์Šคํ‚ฌ(multi-skill) ํ•™์Šต, ์ด์ข…(heterogeneous) ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ(modality) ์„ผ์„œ ์œตํ•ฉ ์ธก๋ฉด์—์„œ ์–ด๋ ค์›€์„ ๊ฒช์—ˆ์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ณ‘๋ชฉ ํ˜„์ƒ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ๋ณธ ์—ฐ๊ตฌ๋Š” ๋‘ ๊ฐ€์ง€ ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.

  1. IMCopilot (In-hand Manipulation Copilot):

๊ฐ•ํ™” ํ•™์Šต(Reinforcement Learning, RL)์œผ๋กœ ํ›ˆ๋ จ๋œ ์›์ž์ (atomic) ์ธ-ํ•ธ๋“œ ์กฐ์ž‘ ์Šคํ‚ฌ(skill) ์Šค์œ„ํŠธ์ž…๋‹ˆ๋‹ค. ์ด IMCopilot์€ ๋‘ ๊ฐ€์ง€ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ฒซ์งธ, ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์‹œ ์ธ๊ฐ„ ์กฐ์ž‘์ž์˜ ๊ณต์œ  ์ž์œจ(shared-autonomy) ๋ณด์กฐ์ž(assistant) ์—ญํ• ์„ ํ•˜์—ฌ, ๋ณต์žกํ•œ ์ธ-ํ•ธ๋“œ ์กฐ์ž‘ ๋‹จ๊ณ„๋ฅผ IMCopilot์— ์œ„์ž„ํ•จ์œผ๋กœ์จ ๊ณ ํ’ˆ์งˆ์˜ ๋ฐ๋ชจ ๋ฐ์ดํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํš๋“ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•์Šต๋‹ˆ๋‹ค. ๋‘˜์งธ, ์ž์œจ ์‹คํ–‰ ์‹œ VLA ๋ชจ๋ธ์ด ํ˜ธ์ถœํ•  ์ˆ˜ ์žˆ๋Š” ์ €์ˆ˜์ค€(low-level) ์‹คํ–‰ ๊ธฐ๋ณธ ์š”์†Œ(primitive)๋กœ ์ž‘๋™ํ•˜์—ฌ ๊ณ„์ธต์ (hierarchical) ์กฐ์ž‘ ์•„ํ‚คํ…์ฒ˜๋ฅผ ํ˜•์„ฑํ•ฉ๋‹ˆ๋‹ค. IMCopilot์˜ ์Šคํ‚ฌ์€ IsaacLab ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์—์„œ ๊ทผ์œ„ ์ •์ฑ… ์ตœ์ ํ™”(Proximal Policy Optimization, PPO)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ›ˆ๋ จ๋˜๋ฉฐ, ๋น„๋Œ€์นญ ์•กํ„ฐ-ํฌ๋ฆฌํ‹ฑ(asymmetric actor-critic) ์•„ํ‚คํ…์ฒ˜์™€ ๊ต์‚ฌ-ํ•™์ƒ ์ฆ๋ฅ˜(teacher-student distillation)๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ๊ด€์ธก๊ฐ’(o_t)์€ ๊ณ ์œ ์ˆ˜์šฉ์„ฑ ๊ฐ๊ฐ(proprioception), ์†๊ฐ€๋ฝ ๋ ์ ‘์ด‰ ํž˜(fingertip contact forces), ๋ชฉํ‘œ ํšŒ์ „ ์ถ•์˜ 3๋‹จ๊ณ„ ์ด๋ ฅ์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ์ •์ฑ…์€ ์ƒ๋Œ€ ๊ด€์ ˆ ์œ„์น˜ ์˜คํ”„์…‹(\Delta\theta_t)์„ ์ถœ๋ ฅํ•˜๋ฉฐ, ์ด๋Š” ์ €์ˆ˜์ค€ PD ์ œ์–ด๊ธฐ(controller)์— ์˜ํ•ด ์ถ”์ ๋ฉ๋‹ˆ๋‹ค. ์‹ค์ œ ํ™˜๊ฒฝ์œผ๋กœ์˜ ์ œ๋กœ-์ƒท(zero-shot) ์ „์ด๋ฅผ ์œ„ํ•ด ๋„๋ฉ”์ธ ๋ฌด์ž‘์œ„ํ™”(domain randomization)๊ฐ€ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋ณด์ƒ ํ•จ์ˆ˜ r = \lambda_{rot}r_{rot} + \lambda_{vel}r_{vel} + \lambda_{work}r_{work} + \lambda_{torq}r_{torq} + \lambda_{diff}r_{diff}๋Š” ๋ชฉํ‘œ ์ถ• ์ฃผ์œ„์˜ ๊ฐ์†๋„(r_{rot})๋ฅผ ์žฅ๋ คํ•˜๋Š” ๋™์‹œ์— ๋ถˆํ•„์š”ํ•œ ์„ ํ˜• ์†๋„(r_{vel}), ๊ณผ๋„ํ•œ ๊ด€์ ˆ ์ž‘์—…๋Ÿ‰(r_{work}), ํ† ํฌ(r_{torq}), ๊ด€์ ˆ ํŽธ์ฐจ(r_{diff})์— ํŽ˜๋„ํ‹ฐ๋ฅผ ๋ถ€๊ณผํ•˜์—ฌ ์ž‘์—… ์ง„ํ–‰์˜ ์•ˆ์ •์„ฑ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

  1. MoDE-VLA (Mixture-of-Dexterous-Experts VLA):

์ด ์•„ํ‚คํ…์ฒ˜๋Š” ์‚ฌ์ „ ํ›ˆ๋ จ๋œ VLA ๋ฐฑ๋ณธ(backbone)์— ์ด์ข…์˜ ํž˜(force) ๋ฐ ์ด‰๊ฐ(tactile) ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ๋งค๋„๋Ÿฝ๊ฒŒ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. MoDE-VLA๋Š” ํž˜/์ด‰๊ฐ ์ •๋ณด์— ๋Œ€ํ•œ ์ „์šฉ ์ฒ˜๋ฆฌ ๊ฒฝ๋กœ๋ฅผ ํ†ตํ•ด ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์ด์งˆ์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ํž˜ ์‹ ํ˜ธ(f \in \mathbb{R}^{d_f})๋Š” ๋กœ๋ด‡ ํŒ”์˜ ๊ด€์ ˆ ํ† ํฌ(joint torque)์—์„œ ์˜ค๋ฉฐ ํŒ” ์ˆ˜์ค€์˜ ์ ‘์ด‰๋ ฅ์„ ๋ฐ˜์˜ํ•˜๊ณ , ์ด‰๊ฐ ์‹ ํ˜ธ(g \in \mathbb{R}^{d_g})๋Š” 10๊ฐœ ์†๊ฐ€๋ฝ ๋์˜ ์ด‰๊ฐ ์„ผ์„œ์—์„œ 6-์ž์œ ๋„ ํž˜ ๋ฐ ๋ Œ์น˜(wrench) ์ธก์ •์„ ์ง‘๊ณ„ํ•˜์—ฌ ์†๊ฐ€๋ฝ ๋ ์ˆ˜์ค€์˜ ์ ‘์ด‰ ํŒจํ„ด์„ ํฌ์ฐฉํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋Š” ํ•™์Šต๋œ ์„ ํ˜• ๋ ˆ์ด์–ด(linear layer)๋ฅผ ํ†ตํ•ด PaliGemma ์ž„๋ฒ ๋”ฉ(embedding) ๊ณต๊ฐ„์œผ๋กœ ํˆฌ์˜๋ฉ๋‹ˆ๋‹ค(z_f = W_f f + b_f, z_g = W_g g + b_g). ๊ฐ ์ž„๋ฒ ๋”ฉ์€ ์•ก์…˜ ์˜ˆ์ธก ์‹œํ€€์Šค ๊ธธ์ด H๋งŒํผ ๋ณต์ œ๋˜๊ณ  ์ •ํ˜„ํŒŒ(sinusoidal) ์œ„์น˜ ์ธ์ฝ”๋”ฉ(positional encoding)์ด ์ถ”๊ฐ€๋˜์–ด ์‹œ๊ฐ„์ ์œผ๋กœ ์ƒ‰์ธ๋œ ํ† ํฐ(token) ์‹œํ€€์Šค \tilde{Z}_f, \tilde{Z}_g \in \mathbb{R}^{H \times d_{pali}}๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

MoDE ๋ชจ๋“ˆ์€ ๋ฐฑ๋ณธ์˜ ์ปจํ…์ŠคํŠธ(contextual) ์ถœ๋ ฅ, ํ˜„์žฌ ๋””๋…ธ์ด์ง•(denoising) ์ƒํƒœ, ํž˜/์ด‰๊ฐ ํ† ํฐ์˜ ์„ธ ๊ฐ€์ง€ ์ •๋ณด ์ŠคํŠธ๋ฆผ์„ ๋ฐ›์•„๋“ค์ž…๋‹ˆ๋‹ค. ์ด๋“ค์€ ํ•˜๋‚˜์˜ ์‹œํ€€์Šค Z_{in} = [Z_{prefix} \| Z_{suffix} \| \tilde{Z}_f \| \tilde{Z}_g]๋กœ ์—ฐ๊ฒฐ๋œ ํ›„ ์ž๊ธฐ-์–ดํ…์…˜(self-attention) ๋ ˆ์ด์–ด๋ฅผ ํ†ต๊ณผํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ํ›„, ์ฒ˜๋ฆฌ๋œ ํž˜ ๋ฐ ์ด‰๊ฐ ํ† ํฐ์€ E๊ฐœ์˜ ์ „๋ฌธ๊ฐ€ MLP(Expert MLP)๋กœ ๊ตฌ์„ฑ๋œ ํฌ์†Œ ํ˜ผํ•ฉ ์ „๋ฌธ๊ฐ€(sparse Mixture-of-Experts, MoE) ๋ ˆ์ด์–ด๋ฅผ ํ†ต๊ณผํ•˜๋ฉฐ, ์ƒ์œ„-k ์Šค์บํ„ฐ ๋ผ์šฐํŒ…(top-k scatter routing) ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ ‘์ด‰ ๊ธฐ๋ฐ˜ ์กฐ์ž‘์˜ ๋‹ค์–‘ํ•œ ์ •์„ฑ์ (qualitative) ๋ ˆ์ง(regime)์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ์ „๋ฌธ๊ฐ€๊ฐ€ ํŠนํ™”๋  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. MoE ๋ ˆ์ด์–ด๋Š” ์ •์ œ๋œ ํž˜ ํ† ํฐ Z'_f์™€ ์ด‰๊ฐ ํ† ํฐ Z'_g๋ฅผ ์ถœ๋ ฅํ•˜๋ฉฐ, ์ด๋“ค์€ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ณ„ ํˆฌ์˜ ํ—ค๋“œ(projection head)๋ฅผ ํ†ตํ•ด ๋ฐฑ๋ณธ์˜ ์•ก์…˜ ์˜ˆ์ธก์— ์ž”์ฐจ(residual) ๋ณด์ •์œผ๋กœ ์ฃผ์ž…๋ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, ํž˜ ๋ณด์ •์€ ์ฃผ๋กœ ํŒ” ์•ก์…˜์—, ์ด‰๊ฐ ๋ณด์ •์€ ์ฃผ๋กœ ์† ์•ก์…˜์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ์ž”์ฐจ ๊ตฌ์กฐ๋Š” MoDE๊ฐ€ ๊ธฐ๋ณธ VLA ์˜ˆ์ธก์— ๋Œ€ํ•œ ์ •์ œ(refinement) ์—ญํ• ๋งŒ ์ˆ˜ํ–‰ํ•˜๋„๋ก ๋ณด์žฅํ•˜์—ฌ, ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์‹ ํ˜ธ๊ฐ€ ์ ์„ ๋•Œ ๋ฐฑ๋ณธ์˜ ๊ฐ•๊ฑดํ•œ(robust) ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋™์ž‘์„ ๋ณด์กดํ•ฉ๋‹ˆ๋‹ค.

๋ณธ ์—ฐ๊ตฌ๋Š” ์ƒค๋ฅดํŒŒ๋…ธ์Šค1(SharpaNorth1) ๋กœ๋ด‡ ํ”Œ๋žซํผ(๋‘ ๊ฐœ์˜ 7-DoF ๋กœ๋ด‡ ํŒ”๊ณผ 22-DoF ์ƒค๋ฅดํŒŒ์›จ์ด๋ธŒ2(SharpaWave2) ์ •๊ตํ•œ ์†์„ ํฌํ•จ, ์ด 63 DoF)๊ณผ ์ƒ์ฒด ์™ธ๊ณจ๊ฒฉ(upper-body exoskeleton), ์™ธ๊ณจ๊ฒฉ ์žฅ๊ฐ‘(exoskeleton gloves), VR ํ—ค๋“œ์…‹(VR headset)์„ ํฌํ•จํ•˜๋Š” ๋ฐ์ดํ„ฐ ํš๋“ ์‹œ์Šคํ…œ์„ ํ™œ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, ๋ฐœ ํŽ˜๋‹ฌ(foot pedals)์„ ํ†ตํ•ด IMCopilot์„ ํŠธ๋ฆฌ๊ฑฐํ•˜๋Š” ๊ณต์œ  ์ž์œจ ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ ๊ธฐ์กด ์›๊ฒฉ ์กฐ์ž‘์œผ๋กœ๋Š” ๊ฑฐ์˜ ๋ถˆ๊ฐ€๋Šฅํ–ˆ๋˜ ์• ํ”Œ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ(apple peeling)์™€ ๊ฐ™์€ ๋ณต์žกํ•œ ์ž‘์—…์— ๋Œ€ํ•œ ๊ณ ํ’ˆ์งˆ ๋ฐ๋ชจ ํš๋“์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ์Šต๋‹ˆ๋‹ค.

์‹คํ—˜์€ ์‚ฌ๊ณผ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ, ํŠœ๋ธŒ ์žฌ๋ฐฐ์น˜(tube rearranging), ๊ธฐ์–ด ์กฐ๋ฆฝ(gear assembling), ์ถฉ์ „๊ธฐ ๊ฝ‚๊ธฐ(charger plugging)์˜ ๋„ค ๊ฐ€์ง€ ๋ณต์žกํ•œ ์ ‘์ด‰ ๊ธฐ๋ฐ˜ ์ž‘์—…์—์„œ ์ˆ˜ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ๋Š” MoDE-VLA๊ฐ€ ๊ธฐ์ค€์„  \pi_0 ๋ชจ๋ธ์„ ๋Šฅ๊ฐ€ํ•จ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, ์‚ฝ์ž…(insertion) ์ž‘์—…์—์„œ ๋‘ ๋ฐฐ ์ด์ƒ์˜ ์„ฑ๊ณต๋ฅ  ํ–ฅ์ƒ์„ ๋ณด์˜€์œผ๋ฉฐ, IMCopilot์€ ์‚ฌ๊ณผ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ์—์„œ ์ค‘์š”ํ•œ ์ธ-ํ•ธ๋“œ ํšŒ์ „์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์—ฌ PCR(Peel Completion Ratio) 73%๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š” ๋ฐ ํ•ต์‹ฌ์ ์ธ ์—ญํ• ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ œ๊ฑฐ ์—ฐ๊ตฌ(ablation study)๋Š” ํž˜ ๋ฐ ์ด‰๊ฐ ์„ผ์„œ์˜ ์ค‘์š”์„ฑ๊ณผ IMCopilot์˜ ๊ธฐ์—ฌ๋„๋ฅผ ๋ช…ํ™•ํžˆ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ํž˜ ์„ผ์„œ์˜ ์ œ๊ฑฐ๋Š” ํ‰๊ท  SR์„ 11% ๊ฐ์†Œ์‹œ์ผฐ๊ณ , ์ด‰๊ฐ ์„ผ์„œ์˜ ์ œ๊ฑฐ๋Š” 8% ๊ฐ์†Œ์‹œ์ผฐ์œผ๋ฉฐ, IMCopilot์˜ ๋ถ€์žฌ๋Š” ์‚ฌ๊ณผ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ ์ž‘์—…์˜ PCR์„ 73%์—์„œ 25%๋กœ ํฌ๊ฒŒ ๋–จ์–ด๋œจ๋ ธ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก ์ ์œผ๋กœ, ๋ณธ ๋…ผ๋ฌธ์€ IMCopilot๊ณผ MoDE-VLA๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ๊ณ ์ž์œ ๋„ ์–‘์† ์ •๊ตํ•œ ์กฐ์ž‘์„ ์œ„ํ•œ ํฌ๊ด„์ ์ธ ๊ณ„์ธต์  ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์„ฑ๊ณต์ ์œผ๋กœ ๊ตฌ์ถ•ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ฐ์ดํ„ฐ ํš๋“ ๋ณ‘๋ชฉ ํ˜„์ƒ์„ ํ•ด๊ฒฐํ•˜๊ณ , ๋ณต์žกํ•œ ๋‹ค์ค‘ ์Šคํ‚ฌ ์ž‘์—…์„ ์ฒ˜๋ฆฌํ•˜๋ฉฐ, ์ด์ข… ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์„ผ์„œ ์ •๋ณด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์œตํ•ฉํ•˜์—ฌ ๋กœ๋ด‡์ด ์ธ๊ฐ„๊ณผ ์œ ์‚ฌํ•œ ์ˆ˜์ค€์˜ ์ •๊ตํ•œ ์กฐ์ž‘์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

์™œ ์ด ๋…ผ๋ฌธ์ธ๊ฐ€ โ€” ๋ฌธ์ œ ์ •์˜์™€ ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ

์‚ฌ๊ณผ๋ฅผ ๊ป์งˆ ์ฑ„๋กœ ๋ฒ—๊ธฐ๋Š” ์ž‘์—…์„ ์ƒ๊ฐํ•ด๋ณด์ž. ์ธ๊ฐ„์—๊ฒŒ๋Š” ๋„ˆ๋ฌด๋„ ์ž์—ฐ์Šค๋Ÿฌ์šด ์ด ๋™์ž‘์€ ์‹ค์ƒ ๋†€๋ผ์šด ๋‹ค์ค‘ ๊ฐ๊ฐ์˜ ํ˜‘์—ฐ์ด๋‹ค. ๋ˆˆ์œผ๋กœ ์นผ๋‚ ์˜ ์œ„์น˜๋ฅผ ํ™•์ธํ•˜๊ณ , ์†์˜ ํž˜์œผ๋กœ ์‚ฌ๊ณผ๋ฅผ ์ฅ๋˜ ์œผ์Šค๋Ÿฌ์ง€์ง€ ์•Š์„ ๋งŒํผ๋งŒ ๋ˆ„๋ฅด๊ณ , ํ”ผ๋ถ€์˜ ์ด‰๊ฐ์œผ๋กœ ๋ฏธ๋„๋Ÿฌ์ง์„ ๊ฐ์ง€ํ•ด ์†๊ฐ€๋ฝ ๊ฐ๋„๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์กฐ์ •ํ•œ๋‹ค. ์ด ๋ชจ๋“  ๊ฒƒ์ด ์ˆ˜์‹ญ ๋ฐ€๋ฆฌ์ดˆ ๋‹จ์œ„๋กœ ์ผ์–ด๋‚œ๋‹ค.

Vision-Language-Action (VLA) ๋ชจ๋ธ์€ ์ตœ๊ทผ ์ˆ˜๋…„๊ฐ„ ๋กœ๋ด‡ ์กฐ์ž‘ ๋ถ„์•ผ์—์„œ ๋ˆˆ๋ถ€์‹  ์„ฑ๊ณผ๋ฅผ ๋ƒˆ๋‹ค. \pi_0, OpenVLA, RoboFlamingo ๊ฐ™์€ ๋ชจ๋ธ๋“ค์ด ์–ธ์–ด ๋ช…๋ น์„ ๋ฐ›์•„ ๋‹ค์–‘ํ•œ ๋ฌผ์ฒด๋ฅผ ์ง‘๊ณ , ๋ถ„๋ฅ˜ํ•˜๊ณ , ๋ฐฐ์น˜ํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋“ค ๋ชจ๋ธ์˜ โ€œ์†โ€์€ ๋Œ€๋ถ€๋ถ„ 2-์ž์œ ๋„(DoF)์˜ ๋ณ‘๋ ฌ ๊ทธ๋ฆฌํผ(parallel gripper)์˜€๋‹ค. ์ด์ง„ ์ œ์–ด(์—ด๋ฆผ/๋‹ซํž˜)๋งŒ์œผ๋กœ๋„ ์ถฉ๋ถ„ํ•œ ๋‹จ์ˆœํ•œ ํ”ฝ์•คํ”Œ๋ ˆ์ด์Šค(pick-and-place) ์ˆ˜์ค€์— ๋จธ๋ฌผ๋ €๋˜ ๊ฒƒ์ด๋‹ค.

Sharpa Robotics์˜ Tutian Tang ์—ฐ๊ตฌํŒ€์€ ์ด ๋…ผ๋ฌธ์—์„œ ๊ทผ๋ณธ์ ์ธ ์งˆ๋ฌธ์„ ๋˜์ง„๋‹ค:

โ€œVLA ๋ชจ๋ธ์„ ์ธ๊ฐ„ ์ˆ˜์ค€์˜ ์–‘์† ์ •๊ต ์กฐ์ž‘(bimanual dexterous manipulation)์œผ๋กœ ํ™•์žฅํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?โ€

์ด๋“ค์ด ๋‹ค๋ฃจ๋Š” ํ”Œ๋žซํผ์€ SharpaNorth ๋กœ๋ด‡์ด๋‹ค. ์–‘ํŒ” ๊ฐ๊ฐ 7-DoF, ๊ฐ ์†(SharpaWave) 22-DoF, ํ•ฉ๊ณ„ 63-DoF์˜ ๊ณ ์ฐจ์› ์‹œ์Šคํ…œ์ด๋‹ค. ์—ฌ๊ธฐ์—์„œ ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ๋ณ‘๋ชฉ์ด ๋“ฑ์žฅํ•œ๋‹ค.

์„ธ ๊ฐ€์ง€ ๋ณ‘๋ชฉ

๋ณ‘๋ชฉ 1 โ€” ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์˜ ์–ด๋ ค์›€

63-DoF ์‹œ์Šคํ…œ์„ ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜์œผ๋กœ ์ œ์–ดํ•˜๋Š” ๊ฒƒ์€ ์ „๋ฌธ ์˜คํผ๋ ˆ์ดํ„ฐ์—๊ฒŒ๋„ ๊ทน๋„๋กœ ๋†’์€ ์ธ์ง€ ๋ถ€ํ•˜๋ฅผ ๊ฐ€ํ•œ๋‹ค. ๋‹จ์ˆœ ๊ทธ๋ฆฌํผ ์‹œ์Šคํ…œ์€ 30๋ถ„ ์—ฐ์† ์กฐ์ž‘๋„ ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, ๋‹ค์ง€ํ˜•(multi-finger) ์†์˜ ๊ณ ์ฐจ์› ์ œ์–ด๋Š” ๋ช‡ ๋ถ„ ์•ˆ์— ์˜คํผ๋ ˆ์ดํ„ฐ๋ฅผ ์ง€์น˜๊ฒŒ ๋งŒ๋“ ๋‹ค. ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ๋„ ๋ณด์žฅํ•˜๊ธฐ ์–ด๋ ต๋‹ค. ์‚ฌ๊ณผ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ์ฒ˜๋Ÿผ ์ •๊ตํ•œ ์ธํ•ธ๋“œ(in-hand) ํšŒ์ „์ด ํ•„์š”ํ•œ ์ž‘์—…์€ ์•„์˜ˆ ์ง์ ‘ ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜์œผ๋กœ ์ˆ˜ํ–‰ ์ž์ฒด๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.

๋ณ‘๋ชฉ 2 โ€” ๋ฉ€ํ‹ฐ์Šคํ‚ฌ ํ•™์Šต์˜ ์–ด๋ ค์›€

์‚ฌ๊ณผ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ ๊ฐ™์€ ๋ณต์žกํ•œ ์ž‘์—…์€ ๋‹จ์ผ ๊ท ์ผ ์ •์ฑ…์œผ๋กœ๋Š” ํ†ต๋‹ฌํ•˜๊ธฐ ์–ด๋ ต๋‹ค. ์‹œ์•ผ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ(approach), ํž˜ ๊ธฐ๋ฐ˜ ์ ˆ์‚ญ(cutting), ์ด‰๊ฐ ๊ธฐ๋ฐ˜ ํšŒ์ „(rotation) ๋“ฑ์ด ์„œ๋กœ ๋‹ค๋ฅธ ๊ด€์ธก ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์™€ ์ œ์–ด ์ „๋žต์„ ์š”๊ตฌํ•œ๋‹ค. 63-DoF์˜ ์•ก์…˜ ๊ณต๊ฐ„์—์„œ ๋‹จ์ผ ์ •์ฑ…์ด ์ด ๋ชจ๋“  ๋‹จ๊ณ„๋ฅผ ๋งˆ์Šคํ„ฐํ•˜๋Š” ๊ฒƒ์€ ํƒ์ƒ‰ ๊ณต๊ฐ„์ด ์ฒœ๋ฌธํ•™์ ์œผ๋กœ ์ปค์ ธ ํ˜„์‹ค์ ์ด์ง€ ์•Š๋‹ค.

๋ณ‘๋ชฉ 3 โ€” ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์ด์งˆ์„ฑ(Modality Heterogeneity)

ํž˜(force)๊ณผ ์ด‰๊ฐ(tactile) ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ์กด VLA ๋ฐฑ๋ณธ์— ๋‹จ์ˆœํžˆ ์—ฐ์ ‘(concatenate)ํ•˜๋ฉด ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์ด ์ €ํ•˜๋œ๋‹ค๋Š” ๊ฒƒ์€ ์„ ํ–‰ ์—ฐ๊ตฌ์—์„œ๋„ ๋ณด๊ณ ๋œ ๋ฐ” ์žˆ๋‹ค. ํž˜ ์‹ ํ˜ธ์™€ ์ด‰๊ฐ ์‹ ํ˜ธ๋Š” ์‹œ๊ฐ„ ๋™์—ญํ•™(temporal dynamics)์ด ๋‹ค๋ฅด๊ณ , ๋ฌผ๋ฆฌ์  ์˜๋ฏธ๋ก (physical semantics)๋„ ๋‹ค๋ฅด๋‹ค. ์ด๋ฅผ ๊ตฌ๋ถ„ํ•˜์ง€ ์•Š์€ ์ฑ„ ๋ชจ๋‘ ํ•˜๋‚˜์˜ ํ† ํฐ ์ŠคํŠธ๋ฆผ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋ฉด ์‚ฌ์ „ํ•™์Šต๋œ VLM ๋ฐฑ๋ณธ์˜ ํ‘œํ˜„๋ ฅ์„ ์˜ค์—ผ์‹œํ‚ค๊ฒŒ ๋œ๋‹ค.

์ด ์„ธ ๋ณ‘๋ชฉ ๊ฐ๊ฐ์— ๋Œ€ํ•ด ๋…ผ๋ฌธ์ด ์ œ์‹œํ•˜๋Š” ํ•ด๋‹ต์ด IMCopilot๊ณผ MoDE-VLA๋ผ๋Š” ๋‘ ํ•ต์‹ฌ ์ปดํฌ๋„ŒํŠธ์ด๋‹ค.


๋ฐฉ๋ฒ•๋ก  โ€” ๋‘ ๊ธฐ๋‘ฅ์˜ ๊ตฌ์กฐ

์‹œ์Šคํ…œ ๊ฐœ์š”

flowchart TB
    subgraph DataCollection ["Data Collection Phase"]
        HO["Human Operator\n(Exoskeleton)"] -->|"Gross Arm Motions"| RB["SharpaNorth Robot\n(63 DoF)"]
        HO -->|"Foot Pedal Trigger"| IMP["IMCopilot\n(RL Primitives)"]
        IMP -->|"In-hand Rotation\nGrasp Maintenance"| RB
    end

    subgraph Autonomy ["Autonomous Execution Phase"]
        VLA["OpenPI-0 Backbone\n(Vision + Language + Proprioception)"] --> MoDE["MoDE Module\n(Force-Tactile Fusion)"]
        MoDE -->|"Residual Correction"| ArmAct["Arm Actions"]
        MoDE -->|"Option 1: Tactile Refined\nHand Actions"| HandAct["Hand Actions"]
        MoDE -->|"Option 2: Dispatch"| IMP2["IMCopilot\n(RL Low-level Primitive)"]
        IMP2 --> HandAct
    end

    DataCollection -->|"Demonstrations"| Train["VLA Fine-tuning"]
    Train --> Autonomy
Figure 1: IMCopilot๊ณผ MoDE-VLA์˜ ์ด์ค‘ ์—ญํ•  ๊ตฌ์กฐ

์ด ๊ตฌ์กฐ์˜ ํ•ต์‹ฌ ํ†ต์ฐฐ์€ ์ด์ค‘ ์—ญํ• (dual role)์ด๋‹ค. IMCopilot์€ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋‹จ๊ณ„์—์„œ๋Š” ๊ณต๋™ ์ž์œจ ๋ณด์กฐ์ž๋กœ, ์ž์œจ ์‹คํ–‰ ๋‹จ๊ณ„์—์„œ๋Š” ํ˜ธ์ถœ ๊ฐ€๋Šฅํ•œ ์ €์ˆ˜์ค€ ํ”„๋ฆฌ๋ฏธํ‹ฐ๋ธŒ๋กœ ๋™์ž‘ํ•œ๋‹ค. ์ฆ‰, ํ›ˆ๋ จ๊ณผ ์ถ”๋ก  ์–‘์ชฝ์—์„œ ์ผ๊ด€๋œ ์—ญํ• ์„ ํ•˜๋Š” ๋‹จ์ผ RL ์ •์ฑ…์ด๋‹ค.


IMCopilot โ€” RL ๊ธฐ๋ฐ˜ ์›์ž์  ์ธํ•ธ๋“œ ์Šคํ‚ฌ

๊ฐœ๋…๊ณผ ๊ตฌ์กฐ

IMCopilot์€ ์†Œ์ˆ˜์˜ ์›์ž์  ์ธํ•ธ๋“œ ์กฐ์ž‘ ํ”„๋ฆฌ๋ฏธํ‹ฐ๋ธŒ(atomic in-hand manipulation primitives)๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ๋…ผ๋ฌธ์—์„œ ์–ธ๊ธ‰ํ•˜๋Š” ํ•ต์‹ฌ ํ”„๋ฆฌ๋ฏธํ‹ฐ๋ธŒ๋Š” ๋‘ ๊ฐ€์ง€๋‹ค:

  1. ์•ˆ์ •์  ํŒŒ์ง€ ์œ ์ง€(stable grasp maintenance) โ€” ์™ธ๋ถ€ ๊ต๋ž€ ํ•˜์—์„œ๋„ ๋ฌผ์ฒด๋ฅผ ํ™•์‹คํžˆ ์ฅ๊ณ  ์žˆ๋Š” ๊ฒƒ.
  2. ์ธํ•ธ๋“œ ํšŒ์ „(in-hand rotation) โ€” ํŒŒ์ง€ ์ƒํƒœ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ๋ฌผ์ฒด๋ฅผ ์† ์•ˆ์—์„œ ํšŒ์ „์‹œํ‚ค๋Š” ๊ฒƒ.

์ด ์Šคํ‚ฌ๋“ค์€ ์‹ฌ์ธต ๊ฐ•ํ™”ํ•™์Šต(deep RL)์œผ๋กœ ํ›ˆ๋ จ๋œ๋‹ค. ํ•ต์‹ฌ์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ๋ช…์‹œ์  ๋ณด์ƒ(reward)์„ ์„ค๊ณ„ํ•˜์—ฌ ํ•™์Šตํ•˜๊ณ , ์ด๋ฅผ ์‹ค๋ฌผ ๋กœ๋ด‡์—์„œ sim-to-real ์ „์ดํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

๊ณต๋™ ์ž์œจ(Shared Autonomy) ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜

์ธํ•ธ๋“œ ์กฐ์ž‘์˜ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ์‹์ด ๋…์ฐฝ์ ์ด๋‹ค. ๊ธฐ์กด์˜ ํ’€ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜(full teleoperation) ํŒจ๋Ÿฌ๋‹ค์ž„์€ ์˜คํผ๋ ˆ์ดํ„ฐ์—๊ฒŒ ๋ชจ๋“  DoF๋ฅผ ๋™์‹œ์— ์ œ์–ดํ•˜๋„๋ก ์š”๊ตฌํ•œ๋‹ค.

์ด ๋…ผ๋ฌธ์˜ ์ ‘๊ทผ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

Operator (via exoskeleton): Gross arm motions (7 DoF x 2)
IMCopilot (via RL skill):   In-hand finger motions (22 DoF x 2)
Interface: Foot pedal trigger to delegate to IMCopilot

์‚ฌ๊ณผ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ, ์˜คํผ๋ ˆ์ดํ„ฐ๋Š” ํŒ”์˜ ์ ‘๊ทผ ๊ฒฝ๋กœ์™€ ์นผ๋‚ ์˜ ์ „์ง„ ๋ฐฉํ–ฅ๋งŒ ์ œ์–ดํ•˜๋ฉด ๋œ๋‹ค. ์‚ฌ๊ณผ๋ฅผ ๋‹ค์Œ ์œ„์น˜๋กœ ํšŒ์ „์‹œํ‚ค๋Š” ๊ณ ๋‚œ๋„ ์ธํ•ธ๋“œ ์กฐ์ž‘์€ ๋ฐœ ํŽ˜๋‹ฌ์„ ๋ˆ„๋ฅด๋Š” ์ˆœ๊ฐ„ IMCopilot์ด ๋‹ด๋‹นํ•œ๋‹ค. โ€œ์–ด๋ ค์šด ๋ถ€๋ถ„์€ AI์—๊ฒŒ ์œ„์ž„ํ•˜๊ณ , ์ „์ฒด์ ์ธ ๊ณ„ํš์€ ์‚ฌ๋žŒ์ด ์žก๋Š”๋‹คโ€๋Š” ์ธ์ง€ ๋ถ„์—…์ด๋‹ค.

์ด๊ฒƒ์ด ์™œ ์ค‘์š”ํ•œ๊ฐ€? ์‚ฌ๋žŒ์ด ์‚ฌ๊ณผ ํšŒ์ „์„ ์ง์ ‘ ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜์œผ๋กœ ์‹œ๋„ํ•  ๋•Œ, ์‹คํ—˜์—์„œ ํšŒ์ „ ์„ฑ๊ณต๋ฅ (rotation success rate)์€ ๋‚ฎ์•˜๋‹ค. ๊ฐ™์€ ๊ตฌ์กฐ์  ์–ด๋ ค์›€์ด ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ ์ˆ˜์ค€์—์„œ๋„, ๊ทธ๋ฆฌ๊ณ  ๋‚˜์ค‘์— VLA ์ •์ฑ… ์ˆ˜์ค€์—์„œ๋„ ๋™์ผํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚œ๋‹ค๋Š” ๋œป์ด๋‹ค. IMCopilot์€ ์ด ๊ณตํ†ต์ ์ธ ์–ด๋ ค์›€์„ ํ•œ ๋ฒˆ์— ํ•ด๊ฒฐํ•˜๋Š” ๋‹จ์ผ ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด๋‹ค.


MoDE-VLA โ€” ๊ฐ๊ฐ ์ด์งˆ์„ฑ์„ ๋„˜๋Š” ์•„ํ‚คํ…์ฒ˜

MoDE-VLA๋Š” ์„ธ ๊ฐ€์ง€ ์„œ๋ธŒ๋ชจ๋“ˆ์˜ ์‹œ๋„ˆ์ง€๋กœ ์ž‘๋™ํ•œ๋‹ค: (1) OpenPI-0 ๋ฐฑ๋ณธ, (2) Mixture-of-Dexterous-Experts(MoDE) ๋ชจ๋“ˆ, (3) ๊ณ„์ธต์  ๊ฒฐ์ • ๋ฉ”์ปค๋‹ˆ์ฆ˜.

๊ธฐ๋ฐ˜ ๋ฐฑ๋ณธ: OpenPI-0

๋…ผ๋ฌธ์€ \pi_0์˜ ๊ณต๊ฐœ ๋ฒ„์ „์ธ OpenPI-0๋ฅผ VLA ๋ฐฑ๋ณธ์œผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค. ์ด ๋ชจ๋ธ์€ ์‹œ๊ฐ ํ† ํฐ, ์–ธ์–ด ํ† ํฐ, ๊ณ ์œ ๊ฐ๊ฐ(proprioception) ํ† ํฐ, ๊ทธ๋ฆฌ๊ณ  ๋…ธ์ด์ฆˆ ์•ก์…˜ ํ† ํฐ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ํ”Œ๋กœ์šฐ ๋งค์นญ(flow matching)์„ ํ†ตํ•ด ์•ก์…˜ ์ฒญํฌ(action chunk)๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค. \pi_0์˜ ํ•ต์‹ฌ ๊ตฌ์กฐ๋Š” ์–ธ์–ด ๋ชจ๋ธ ๋ฐฑ๋ณธ + ๋ถ„๋ฆฌ๋œ ์•ก์…˜ ์ „๋ฌธ๊ฐ€(action expert)๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ, MoDE๋Š” ์ด ์•ก์…˜ ์ „๋ฌธ๊ฐ€ ๋ถ€๋ถ„์— ์‚ฝ์ž…๋œ๋‹ค.

MoDE ๋ชจ๋“ˆ์˜ ์ž‘๋™ ์›๋ฆฌ

MoDE์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ํž˜-์ด‰๊ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์œ„ํ•œ ์ „์šฉ ๊ฒฝ๋กœ(dedicated pathway)๋ฅผ ๋งŒ๋“ค์–ด ์ž”์ฐจ ๋ณด์ •(residual correction) ํ˜•ํƒœ๋กœ ์ฃผ์ž…ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ˆ˜์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด:

a_t^{\text{final}} = a_t^{\text{VLA}} + \Delta a_t^{\text{MoDE}}

์—ฌ๊ธฐ์„œ a_t^{\text{VLA}}๋Š” VLA ๋ฐฑ๋ณธ์ด ์ƒ์„ฑํ•œ ๊ธฐ๋ณธ ์•ก์…˜์ด๊ณ , \Delta a_t^{\text{MoDE}}๋Š” MoDE ๋ชจ๋“ˆ์ด ํž˜/์ด‰๊ฐ ์‹ ํ˜ธ๋กœ๋ถ€ํ„ฐ ๊ณ„์‚ฐํ•œ ์ž”์ฐจ ๋ณด์ •๊ฐ’์ด๋‹ค.

MoDE ๋ชจ๋“ˆ์˜ ๋‚ด๋ถ€ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

flowchart LR
    F["Force Sensor\n(6-axis F/T)"] --> FT["Force Tokens"]
    T["Tactile Sensor\n(SharpaWave\ninternal camera)"] --> TT["Tactile Tokens"]

    FT --> SA["Self-Attention\nwith Backbone\nRepresentations"]
    TT --> SA

    SA --> MoE["Sparse MoE Router\n(per-timestep\nexpert specialization)"]

    MoE -->|"Force Experts"| FC["Force Residual\n(Arm Action Correction)"]
    MoE -->|"Tactile Experts"| TC["Tactile Residual\n(Hand Action Correction)"]

    FC -->|"Add"| ARM["Final Arm Actions"]
    TC -->|"Option 1: Add"| HAND["Final Hand Actions"]
    TC -->|"Option 2: Trigger"| IMP["IMCopilot\nDispatch"]
    IMP --> HAND
Figure 2: MoDE ๋ชจ๋“ˆ์˜ ๋‚ด๋ถ€ ์ฒ˜๋ฆฌ ํ๋ฆ„

๊ตฌ์ฒด์ ์œผ๋กœ ์„ธ ๋‹จ๊ณ„๋กœ ๋ถ„ํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค:

Step 1 โ€” ๊ต์ฐจ ์–ดํ…์…˜ (Self-Attention with Backbone)

ํž˜-์ด‰๊ฐ ํ† ํฐ์ด VLA ๋ฐฑ๋ณธ์˜ ํ‘œํ˜„๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด โ€œํ˜„์žฌ ์‹œ๊ฐ-์–ธ์–ด ๋งฅ๋ฝ์—์„œ ์ด ์ด‰๊ฐ ์‹ ํ˜ธ๊ฐ€ ์˜๋ฏธํ•˜๋Š” ๋ฐ”๋Š” ๋ฌด์—‡์ธ๊ฐ€โ€๋ฅผ ๋ชจ๋ธ์ด ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค. ์‚ฌ๊ณผ๋ฅผ ์žก๊ณ  ์นผ๋กœ ๊ป์งˆ์„ ๋ฒ—๊ธฐ๋Š” ์ค‘์— ๊ฐ์ง€๋œ ๋ฏธ๋„๋Ÿฌ์ง ์‹ ํ˜ธ์™€, ๊ธฐ์–ด๋ฅผ ์กฐ๋ฆฝํ•˜๋Š” ์ค‘์— ๊ฐ์ง€๋œ ๋ฏธ๋„๋Ÿฌ์ง ์‹ ํ˜ธ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์˜๋ฏธ์™€ ๋Œ€์‘ ์ „๋žต์„ ๊ฐ€์ง„๋‹ค.

Step 2 โ€” Sparse MoE ๋ผ์šฐํŒ…

์–ดํ…์…˜์„ ๊ฑฐ์นœ ํ† ํฐ์€ ํฌ์†Œ ์ „๋ฌธ๊ฐ€ ๋ผ์šฐํ„ฐ(sparse expert router)๋ฅผ ํ†ต๊ณผํ•œ๋‹ค. ๋ผ์šฐํ„ฐ๋Š” ๊ฐ ํƒ€์ž„์Šคํ…๋งˆ๋‹ค ์ ์ ˆํ•œ ์ „๋ฌธ๊ฐ€ ๋„คํŠธ์›Œํฌ๋ฅผ ๋™์ ์œผ๋กœ ์„ ํƒํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์นผ๋‚ ์ด ์‚ฌ๊ณผ ํ‘œ๋ฉด์— ์ตœ์ดˆ๋กœ ๋‹ฟ๋Š” โ€œ์ ‘์ด‰ ๊ฐœ์‹œ(contact onset)โ€ ์ˆœ๊ฐ„์—๋Š” ์ ‘์ด‰ ์ „๋ฌธ๊ฐ€๊ฐ€ ํ™œ์„ฑํ™”๋˜๊ณ , ์•ˆ์ •์ ์ธ ์ ˆ์‚ญ ๋‹จ๊ณ„์—์„œ๋Š” ๋‹ค๋ฅธ ์ „๋ฌธ๊ฐ€๊ฐ€ ์ฒ˜๋ฆฌํ•œ๋‹ค.

MoE ๋ผ์šฐํŒ… ๋ฐฉ์ •์‹:

\mathbf{y}(\mathbf{x}) = \sum_{i \in \text{TopK}(\mathbf{G}(\mathbf{x}))} g_i(\mathbf{x}) \cdot \mathbf{E}_i(\mathbf{x})

์—ฌ๊ธฐ์„œ \mathbf{G}(\mathbf{x})๋Š” ๊ฒŒ์ดํŒ… ๋„คํŠธ์›Œํฌ, g_i(\mathbf{x})๋Š” ๊ฒŒ์ดํŒ… ๊ฐ€์ค‘์น˜, \mathbf{E}_i(\mathbf{x})๋Š” i๋ฒˆ์งธ ์ „๋ฌธ๊ฐ€ ๋„คํŠธ์›Œํฌ์˜ ์ถœ๋ ฅ์ด๋‹ค.

Step 3 โ€” ์ž”์ฐจ ์ฃผ์ž… (Residual Injection)

MoE์˜ ์ถœ๋ ฅ์€ ๊ธฐ๋ณธ VLA ์•ก์…˜์— ๋ง์…ˆ(addition) ํ˜•ํƒœ๋กœ ์ฃผ์ž…๋œ๋‹ค. ์ด๊ฒƒ์ด ๋…ผ๋ฌธ์˜ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์—”์ง€๋‹ˆ์–ด๋ง ์„ ํƒ์ด๋‹ค. ์ง์ ‘ ์ถœ๋ ฅ์„ ๋Œ€์ฒดํ•˜๋Š” ๋Œ€์‹  ๋ณด์ •๊ฐ’์„ ๋”ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ, ์‚ฌ์ „ํ•™์Šต๋œ ์ง€์‹์˜ ํŒŒ๊ดด(catastrophic forgetting)๋ฅผ ๋ฐฉ์ง€ํ•œ๋‹ค.

์ด ์ ‘๊ทผ๋ฒ•์€ ์‚ฌ๋žŒ์˜ ์šด๋™ ์ œ์–ด์™€ ์ข‹์€ ์œ ๋น„๋ฅผ ์ด๋ฃฌ๋‹ค. ์ˆ™๋ จ๋œ ์กฐ๊ฐ๊ฐ€๊ฐ€ ์ƒˆ๋กœ์šด ์žฌ๋ฃŒ๋ฅผ ๋‹ค๋ฃฐ ๋•Œ, ๊ธฐ์กด์— ์ตํžŒ ์†๋†€๋ฆผ์˜ ๊ธฐ๋ฐ˜ ์œ„์— ์ƒˆ๋กœ์šด ์žฌ๋ฃŒ์˜ ๋ฌผ์„ฑ(์ด‰๊ฐ, ์ €ํ•ญ๋ ฅ)์— ๋งž๋Š” ๋ฏธ์„ธ ์กฐ์ •์„ ๋”ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค. ๊ธฐ์ดˆ ๊ธฐ์ˆ  ์ž์ฒด๋ฅผ ๋ฒ„๋ฆฌ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋งฅ๋ฝ์— ๋งž๋Š” ๋ณด์ •์„ ๊ฒน์น˜๋Š” ๊ฒƒ์ด๋‹ค.

์™œ ๋‹จ์ˆœ ์—ฐ์ ‘(Concat)์€ ์‹คํŒจํ•˜๋Š”๊ฐ€

ํž˜ ์‹ ํ˜ธ์™€ ์ด‰๊ฐ ์‹ ํ˜ธ๋ฅผ ๋‹จ์ˆœํžˆ VLA ์ž…๋ ฅ ํ† ํฐ์— ์—ฐ์ ‘ํ•˜๋ฉด ์™œ ์„ฑ๋Šฅ์ด ์ €ํ•˜๋˜๋Š”๊ฐ€?

๊ฐ€์žฅ ์ง๊ด€์ ์ธ ์„ค๋ช…์€ ๋ถ„ํฌ ์ถฉ๊ฒฉ(distribution shock)์ด๋‹ค. VLA ๋ชจ๋ธ์€ ์‹œ๊ฐ-์–ธ์–ด ๋ฐ์ดํ„ฐ๋กœ ๋ฐฉ๋Œ€ํ•˜๊ฒŒ ์‚ฌ์ „ํ•™์Šต๋˜์–ด ์žˆ๋‹ค. ์ด ๋ชจ๋ธ์˜ ์ž…๋ ฅ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์— ๋ฌผ๋ฆฌ ์‹ ํ˜ธ๋ฅผ ๊ทธ๋Œ€๋กœ ์ง‘์–ด๋„ฃ์œผ๋ฉด, ๋ชจ๋ธ์ด ์ด์ „์— ๋ณธ ์  ์—†๋Š” ์ด์ƒํ•œ ํ† ํฐ์ด ๊ฐ‘์ž๊ธฐ ๋‚˜ํƒ€๋‚˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๊ฒƒ์€ ๋งˆ์น˜ ์˜ค๋žซ๋™์•ˆ ์‹œ๊ฐ ์ •๋ณด๋งŒ์œผ๋กœ ์ž‘์—…ํ•˜๋˜ ์‚ฌ๋žŒ์—๊ฒŒ ๊ฐ‘์ž๊ธฐ ์†์— ์ „๊ธฐ์ถฉ๊ฒฉ ๊ฐ์ง€๊ธฐ๋ฅผ ๋ถ™์—ฌ๋‘๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค โ€” ์ฒ˜์Œ์—๋Š” ์˜คํžˆ๋ ค ๋ฐฉํ•ด๊ฐ€ ๋œ๋‹ค.

MoDE์˜ ์ž”์ฐจ ์ฃผ์ž…์€ ์ด ๋ฌธ์ œ๋ฅผ ์šฐ์•„ํ•˜๊ฒŒ ํ•ด๊ฒฐํ•œ๋‹ค. VLA ๋ฐฑ๋ณธ์€ ์›๋ž˜ ํ•˜๋˜ ๋Œ€๋กœ ์‹œ๊ฐ-์–ธ์–ด ๊ธฐ๋ฐ˜ ์•ก์…˜์„ ์ƒ์„ฑํ•˜๊ณ , MoDE๋Š” โ€œ๋‚ด๊ฐ€ ์ถ”๊ฐ€๋กœ ์กฐ์ •ํ•  ๋ถ€๋ถ„โ€๋งŒ ๊ณ„์‚ฐํ•ด์„œ ๋”ํ•œ๋‹ค.

๊ณ„์ธต์  ๊ฒฐ์ • ๋ฉ”์ปค๋‹ˆ์ฆ˜

๋งค ํƒ€์ž„์Šคํ…๋งˆ๋‹ค ์‹œ์Šคํ…œ์€ ๋‘ ๊ฐ€์ง€ ์˜ต์…˜ ์ค‘ ํ•˜๋‚˜๋ฅผ ์„ ํƒํ•œ๋‹ค:

  • Option 1: ์† ์•ก์…˜์„ VLA + MoDE ์ด‰๊ฐ ์ž”์ฐจ๋กœ ์ƒ์„ฑ (ํ”Œ๋กœ์šฐ ๋งค์นญ)
  • Option 2: IMCopilot์ด ์† ์•ก์…˜์„ ์ง์ ‘ ์ƒ์„ฑ (RL ์ •์ฑ…)

์–‘์ชฝ ์˜ต์…˜ ๋ชจ๋‘์—์„œ ํŒ” ์•ก์…˜์€ VLA + MoDE ํž˜ ์ž”์ฐจ๋กœ ์ƒ์„ฑ๋œ๋‹ค. Option 2๋กœ์˜ ์ „ํ™˜์€ ์‚ฌ๊ณผ ๊ป์งˆ ํ•œ ๋ฐ”ํ€ด ๋ฒ—๊ธฐ๊ธฐ๊ฐ€ ์™„๋ฃŒ๋œ ํ›„ ๋‹ค์Œ ๋ฐ”ํ€ด๋ฅผ ์œ„ํ•œ ํšŒ์ „์ด ํ•„์š”ํ•œ ์‹œ์ ์ฒ˜๋Ÿผ, VLA๊ฐ€ ์Šค์Šค๋กœ IMCopilot์„ ํ˜ธ์ถœํ•ด์•ผ ํ•œ๋‹ค๊ณ  ํŒ๋‹จํ•  ๋•Œ ๋ฐœ๋™๋œ๋‹ค.

์ด ๊ณ„์ธต์  ๊ตฌ์กฐ๋Š” ์ธ๊ฐ„์˜ ์šด๋™ ์ œ์–ด ์ด๋ก , ํŠนํžˆ ๊ณ„์ธต์  ์šด๋™ ์ œ์–ด(Hierarchical Motor Control) ๋ชจ๋ธ๊ณผ ๋งค์šฐ ์œ ์‚ฌํ•˜๋‹ค. ๋Œ€๋‡Œ ํ”ผ์งˆ์ด ๋ชฉํ‘œ์™€ ์ „๋žต์„ ๊ฒฐ์ •ํ•˜๊ณ , ์†Œ๋‡Œ์™€ ์ฒ™์ˆ˜์˜ ์ €์ˆ˜์ค€ ํšŒ๋กœ๊ฐ€ ๋ฐ˜์‚ฌ์™€ ๊ทผ์„ธ๋ฐ€์กฐ์ •์„ ๋‹ด๋‹นํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ.


ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ: SharpaNorth + SharpaWave

๋…ผ๋ฌธ์ด ์„ ํƒํ•œ ํ•˜๋“œ์›จ์–ด๋Š” ์ด ์—ฐ๊ตฌ์˜ ์„ฑ๊ฒฉ์„ ์ž˜ ๋ณด์—ฌ์ค€๋‹ค.

๊ตฌ์„ฑ ์š”์†Œ ์‚ฌ์–‘
ํ”Œ๋žซํผ SharpaNorth ์–‘ํŒ” ๋กœ๋ด‡
ํŒ” 7-DoF x 2 = 14 DoF
์† SharpaWave 22-DoF x 2 = 44 DoF
์ด DoF 63 DoF
์ด‰๊ฐ ์„ผ์„œ ์†๊ฐ€๋ฝ ๋ ๋ณ€ํ˜•์„ ๊ฐ์ง€ํ•˜๋Š” ๋‚ด๋ถ€ ์นด๋ฉ”๋ผ (visuotactile)
ํž˜ ์„ผ์„œ 6์ถ• F/T ์„ผ์„œ
ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ ์™ธ๊ณจ๊ฒฉ(exoskeleton) + VR ํ”ผ๋“œ๋ฐฑ

SharpaWave ์†์˜ ์ด‰๊ฐ ์„ผ์„œ๊ฐ€ ํฅ๋ฏธ๋กญ๋‹ค. ์™ธ๋ถ€์— ๋ณ„๋„์˜ ์••๋ ฅ ์„ผ์„œ ๋ฐฐ์—ด์„ ๋ถ™์ด๋Š” ๋ฐฉ์‹์ด ์•„๋‹ˆ๋ผ, ์†๊ฐ€๋ฝ ๋ ๋‚ด๋ถ€์— ์†Œํ˜• ์นด๋ฉ”๋ผ๋ฅผ ๋‚ด์žฅํ•˜์—ฌ ์†๊ฐ€๋ฝ ํŒจ๋“œ์˜ ํƒ„์„ฑ ๋ณ€ํ˜•์„ ๊ด‘ํ•™์ ์œผ๋กœ ์ธก์ •ํ•œ๋‹ค. ์ด ๋ฐฉ์‹์€ DIGIT, GelSight ๊ณ„์—ด ๋น„์ฃผ์˜ค-์ด‰๊ฐ(visuotactile) ์„ผ์„œ์™€ ๊ฐœ๋…์ ์œผ๋กœ ์œ ์‚ฌํ•˜๋‹ค.


์‹คํ—˜ โ€” ๋„ค ๊ณผ์ œ์˜ ๊ณ„๋‹จ์‹ ๋„์ „

๊ณผ์ œ ๊ตฌ์„ฑ

๋…ผ๋ฌธ์€ ์ ‘์ด‰ ๋ณต์žก๋„(contact complexity)๊ฐ€ ์ ์ธต์ ์œผ๋กœ ๋†’์•„์ง€๋Š” ๋„ค ๊ฐ€์ง€ ๊ณผ์ œ๋กœ ์‹œ์Šคํ…œ์„ ๊ฒ€์ฆํ•œ๋‹ค:

graph LR
    T1["Task 1\nGear Assembling\nsingle arm, vision+force"]
    T2["Task 2\nCharger Plugging\nsingle arm, precision insert"]
    T3["Task 3\nTest Tube Rearranging\nbimanual coordination"]
    T4["Task 4\nApple Peeling\nbimanual + in-hand rotation\n+ tactile feedback"]

    T1 -->|"complexity up"| T2 --> T3 --> T4

    style T4 fill:#e74c3c,color:#fff
Figure 3: ๋„ค ์‹คํ—˜ ๊ณผ์ œ์˜ ๋‚œ์ด๋„ ๊ณ„๋‹จ
  • Gear Assembling: ์ •๋ฐ€ ์œ„์น˜ ์ •๋ ฌ๊ณผ ํž˜ ์ œ์–ด๊ฐ€ ํ•„์š”ํ•œ ๊ธฐ์–ด ์กฐ๋ฆฝ. ๋‹จ์ผ ํŒ”, ํž˜ ์„ผ์„œ ํ”ผ๋“œ๋ฐฑ์ด ํ•ต์‹ฌ.
  • Charger Plugging: ์ปค๋„ฅํ„ฐ ์‚ฝ์ž…. ์ข์€ ๊ณต์ฐจ(tolerance) ์กฐ๊ฑด์—์„œ VLA์˜ ์‹œ๊ฐ ์ธ์‹๊ณผ ํž˜ ํ”ผ๋“œ๋ฐฑ์˜ ๊ฒฐํ•ฉ.
  • Test Tube Rearranging: ์–‘ํŒ” ํ˜‘๋ ฅ(bimanual coordination). ๋‘ ํŒ”์ด ์‹œ๊ฐ-์–ธ์–ด ๋งฅ๋ฝ์„ ๊ณต์œ ํ•˜๋ฉด์„œ ์‹œํ—˜๊ด€์„ ์žฌ๋ฐฐ์น˜.
  • Apple Peeling: ์ตœ๊ณ  ๋‚œ์ด๋„. ์‹œ๊ฐ ๊ธฐ๋ฐ˜ ๋Œ€๋žต ์ ‘๊ทผ โ†’ ํž˜ ๊ธฐ๋ฐ˜ ์ ˆ์‚ญ โ†’ ์ด‰๊ฐ ๊ธฐ๋ฐ˜ ์ธํ•ธ๋“œ ํšŒ์ „์˜ ์ˆœํ™˜ ๋ฐ˜๋ณต. ์ด ๋…ผ๋ฌธ์—์„œ ์„ธ๊ณ„ ์ตœ์ดˆ์˜ ์ž์œจ ์–‘์† ์‚ฌ๊ณผ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋‹ค๊ณ  ์ฃผ์žฅํ•œ๋‹ค.

๊ฒฐ๊ณผ ์š”์•ฝ

์•„๋ž˜๋Š” ๋…ผ๋ฌธ์˜ ์ฃผ์š” ์ •๋Ÿ‰์  ๊ฒฐ๊ณผ๋‹ค:

๊ณผ์ œ Baseline SR MoDE-VLA SR ๊ฐœ์„ 
Gear Assembling ~17% ~40% +135%
Charger Plugging ~20% ~45% +125%
Test Tube Rearranging ~15% ~30% +100%
Apple Peeling (SR) - 30% -
Apple Peeling (PCR) ~25% 73% +192%

์ „์ฒด ํ‰๊ท  ์„ฑ๊ณต๋ฅ : 34% (๋ฒ ์ด์Šค๋ผ์ธ ๋Œ€๋น„ 2๋ฐฐ ์ด์ƒ)

SR: Success Rate (์™„์ „ ์„ฑ๊ณต๋ฅ )
PCR: Peel Completion Ratio (ํ•œ ๋ฐ”ํ€ด ๊ป์งˆ ์™„์„ฑ ๋น„์œจ) โ€” ์‚ฌ๊ณผ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ์—๋งŒ ์ ์šฉ๋˜๋Š” ํŠน์ˆ˜ ์ง€ํ‘œ

PCR ์ง€ํ‘œ์˜ ์˜๋ฏธ

์‚ฌ๊ณผ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ์—์„œ SR 30%๋Š” ๋‚ฎ์•„ ๋ณด์ผ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ PCR 73%๋ผ๋Š” ์ˆ˜์น˜๊ฐ€ ์ด ์‹œ์Šคํ…œ์˜ ์‹ค์งˆ์  ๋Šฅ๋ ฅ์„ ๋” ์ž˜ ๋ณด์—ฌ์ค€๋‹ค. ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ์€ ์ข…์ข… ์ฒซ ๋ฒˆ์งธ ํŽ˜์ผ ์ŠคํŠธ๋กœํฌ(peel stroke)๋ฅผ ์‹œ์ž‘ํ•˜์ง€๋งŒ, ์‚ฌ๊ณผ๊ฐ€ ๋ฏธ๋„๋Ÿฌ์ง€๊ฑฐ๋‚˜ ํšŒ์ „์ด ์‹คํŒจํ•˜๋ฉด์„œ ๋ฃจํ”„๋ฅผ ์™„์„ฑํ•˜์ง€ ๋ชปํ–ˆ๋‹ค. MoDE-VLA๋Š” IMCopilot์˜ RL ํšŒ์ „ ์ „๋ฌธ๊ฐ€๋ฅผ ์ ์ ˆํ•œ ์ˆœ๊ฐ„์— ๋ฐœ๋™์‹œ์ผœ ํŽ˜์ผ ๋ฃจํ”„๋ฅผ ๋‹ซ๋Š” ๋ฐ ์„ฑ๊ณตํ–ˆ๋‹ค.

๊ฐ€์žฅ ์ธ์ƒ์ ์ธ ์ˆ˜์น˜๋Š” IMCopilot ์ œ๊ฑฐ ์‹œ PCR 25% ํ•˜๋ฝ์ด๋‹ค. IMCopilot ์—†์ด VLA๊ฐ€ ์ง์ ‘ ์† ์•ก์…˜ ์ „์ฒด๋ฅผ ์ƒ์„ฑํ•  ๊ฒฝ์šฐ, PCR์ด 73%์—์„œ 25%๋กœ ๊ธ‰๋ฝํ•œ๋‹ค. ์ด๋Š” ์ง์ ‘ ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜์—์„œ ํšŒ์ „ ์„ฑ๊ณต๋ฅ ์ด ๋‚ฎ๋‹ค๋Š” ๊ด€์ฐฐ๊ณผ ์ •ํ™•ํžˆ ์ผ์น˜ํ•œ๋‹ค. ๊ฐ™์€ ๊ตฌ์กฐ์  ์–ด๋ ค์›€์ด ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ ๋‹จ๊ณ„์™€ VLA ์ •์ฑ… ๋‹จ๊ณ„์—์„œ ๋™์ผํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚˜๊ณ , IMCopilot์ด ์ด ๊ณตํ†ต ์‹คํŒจ ๋ชจ๋“œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜์ž„์„ ๋ณด์—ฌ์ค€๋‹ค.

Ablation Study ์š”์•ฝ

๋…ผ๋ฌธ์˜ ablation์€ ๋‹ค์Œ ์งˆ๋ฌธ๋“ค์— ๋‹ตํ•œ๋‹ค:

  • Q1. IMCopilot ์—†์ด ์ˆœ์ˆ˜ ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜์œผ๋กœ ์‚ฌ๊ณผ ํšŒ์ „์ด ๊ฐ€๋Šฅํ•œ๊ฐ€? ์ „๋ฌธ ์˜คํผ๋ ˆ์ดํ„ฐ๋„ ์ง์ ‘ ํšŒ์ „์€ ์–ด๋ ต๊ณ  ์„ฑ๊ณต๋ฅ ์ด ๋งค์šฐ ๋‚ฎ๋‹ค.
  • Q2. MoDE ๋ชจ๋“ˆ ์—†์ด ํž˜/์ด‰๊ฐ์„ ๋‹จ์ˆœ ์—ฐ์ ‘ํ•˜๋ฉด? ์„ฑ๋Šฅ์ด ์ €ํ•˜๋œ๋‹ค. ํŠนํžˆ ๊ณ ์ ‘์ด‰(high-contact) ๊ณผ์ œ์—์„œ ๋‘๋“œ๋Ÿฌ์ง„๋‹ค.
  • Q3. IMCopilot ์—†์ด MoDE-VLA๋งŒ ์‚ฌ์šฉํ•˜๋ฉด? ์‚ฌ๊ณผ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ์—์„œ PCR 73% โ†’ 25%๋กœ ๊ธ‰๋ฝ. ์‹œ๊ฐ+ํž˜+์ด‰๊ฐ ์œตํ•ฉ๋งŒ์œผ๋กœ๋Š” ์ธํ•ธ๋“œ ํšŒ์ „์„ ์•ˆ์ •์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์—†์Œ.

๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต

๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์ „๋žต ๊ด€์ 

์ ‘๊ทผ๋ฒ• ๋ฐฉ์‹ ํ•œ๊ณ„
์ˆœ์ˆ˜ ๋น„์ „ ๊ธฐ๋ฐ˜ ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ ์นด๋ฉ”๋ผ๋กœ ์† ์ž์„ธ ์ถ”์  ํ์ƒ‰, ๊นŠ์ด ๋ถˆํ™•์‹ค์„ฑ, ์ ‘์ด‰ ์ธ์‹ ๋ถ€์žฌ
๊ธ€๋Ÿฌ๋ธŒ ๊ธฐ๋ฐ˜ (MANUS ๋“ฑ) ์† ํ‚ค๋„ค๋งˆํ‹ฑ ์ง์ ‘ ๋งคํ•‘ ์ ‘์ด‰๋ ฅ ์ •๋ณด ๋ถ€์žฌ, ์†๊ฐ€๋ฝ ๋น„๋Œ€์‘ ๋ฌธ์ œ
์™ธ๊ณจ๊ฒฉ ๊ธฐ๋ฐ˜ ๊ณ ์ถฉ์‹ค๋„ ์—ญ๋™ํ•™ ์ „๋‹ฌ ๊ณ ๋น„์šฉ, ๋†’์€ ์ธ์ง€๋ถ€ํ•˜
IMCopilot (๋ณธ ๋…ผ๋ฌธ) ๊ณต๋™์ž์œจ + ๋ฐœ ํŽ˜๋‹ฌ ์œ„์ž„ ์ดˆ๊ธฐ ๋‹จ๊ณ„, ์ œํ•œ์  ์Šคํ‚ฌ ์ง‘ํ•ฉ

GR-Dexter (ByteDance Seed)์™€ ๋น„๊ตํ•˜๋ฉด ํฅ๋ฏธ๋กญ๋‹ค. GR-Dexter๋Š” MANUS ๊ธ€๋Ÿฌ๋ธŒ + Meta Quest ํ—ค๋“œ์…‹์œผ๋กœ 56-DoF ์–‘์† ์‹œ์Šคํ…œ์„ ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ํ•˜์—ฌ VLA๋ฅผ ํ›ˆ๋ จํ•˜๋ฉฐ, ํ”ฝ์•คํ”Œ๋ ˆ์ด์Šค์—์„œ ์„ฑ๊ณต๋ฅ  0.97์— ๋‹ฌํ•˜๋Š” ์ธ์ƒ์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ธ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ณผ์ œ ๋‚œ์ด๋„ ์ž์ฒด๋Š” ๋ณธ ๋…ผ๋ฌธ์˜ ์‚ฌ๊ณผ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ์™€ ๋น„๊ตํ•˜๋ฉด ์ƒ๋Œ€์ ์œผ๋กœ ๋‹จ์ˆœํ•˜๋‹ค. ๋ณธ ๋…ผ๋ฌธ์˜ ์ฐจ๋ณ„์ ์€ VLA๊ฐ€ ๋‹จ์ˆœํžˆ ๊ณ ์ฐจ์› ๋ชจ๋ฐฉ์„ ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, RL ํ”„๋ฆฌ๋ฏธํ‹ฐ๋ธŒ๋ฅผ ํ˜ธ์ถœํ•˜๋Š” ๊ณ„์ธต์  ์‹คํ–‰ ๊ตฌ์กฐ๋ฅผ ๊ฐ–์ถ”์—ˆ๋‹ค๋Š” ์ ์ด๋‹ค.

VLA ์•„ํ‚คํ…์ฒ˜ ๊ด€์ 

๋ชจ๋ธ ์ ‘์ด‰ ๊ฐ๊ฐ ํ†ตํ•ฉ ๋ฐฉ์‹ ๋น„๊ณ 
pi0 / OpenPI-0 ๋ฏธํฌํ•จ ๊ธฐ๋ณธ ๋ฐฑ๋ณธ
ForceVLA FVLMoE (4 experts, k=1) ๋‹จ์ผ ํŒ”, 6D F/T๋งŒ ์ฒ˜๋ฆฌ
TA-VLA ํ† ํฌ ์‹ ํ˜ธ ํ†ตํ•ฉ ์ ‘์ด‰ ์กฐ์ž‘ ๊ฐœ์„ 
MoDE-VLA (๋ณธ ๋…ผ๋ฌธ) ํž˜+์ด‰๊ฐ ์ด์ค‘ ์ž”์ฐจ ๊ฒฝ๋กœ, MoE ์–‘์†, IMCopilot ๊ณ„์ธต ํ†ตํ•ฉ

ForceVLA์™€ MoDE-VLA๋Š” MoE๋ฅผ ์ด์šฉํ•œ ํž˜ ํ†ตํ•ฉ์ด๋ผ๋Š” ์ปจ์…‰์„ ๊ณต์œ ํ•˜์ง€๋งŒ, MoDE-VLA๋Š” (1) ์ด‰๊ฐ ์„ผ์„œ๊นŒ์ง€ ํฌํ•จํ•˜๋Š” ์ด์ค‘ ๊ฒฝ๋กœ, (2) IMCopilot ๊ณ„์ธต ์—ฐ๊ณ„, (3) ์–‘์† ์‹œ์Šคํ…œ์ด๋ผ๋Š” ์ ์—์„œ ๋” ๋ณต์žกํ•œ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃฌ๋‹ค.

HACTS (Human-As-Copilot Teleoperation System)์™€์˜ ๋น„๊ต๋„ ํฅ๋ฏธ๋กญ๋‹ค. HACTS๋Š” VLA ์ฝ”ํŒŒ์ผ๋Ÿฟ์ด ์†์˜ ์„ธ๋ฐ€ ๋™์ž‘์„ ์ž์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ณ  ์ธ๊ฐ„์ด ํŒ”์˜ ํฐ ๋™์ž‘๋งŒ ์ œ์–ดํ•˜๋Š” ๊ณต๋™์ž์œจ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ–ˆ๋Š”๋ฐ, IMCopilot์˜ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์ฒ ํ•™๊ณผ ๊ฐœ๋…์ ์œผ๋กœ ์œ ์‚ฌํ•˜๋‹ค. ๋‹ค๋งŒ HACTS๋Š” VLA ์ฝ”ํŒŒ์ผ๋Ÿฟ์„ ์“ฐ๋Š” ๋ฐ˜๋ฉด, IMCopilot์€ RL ์ •์ฑ…์„ ์“ด๋‹ค๋Š” ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค.


๋น„ํŒ์  ๊ณ ์ฐฐ โ€” ๊ฐ•์ , ํ•œ๊ณ„, ๊ทธ๋ฆฌ๊ณ  ๋ฏธ๋ž˜

๊ฐ•์ 

1. ํ†ต์ผ๋œ ์ด์ค‘ ์—ญํ•  ์„ค๊ณ„์˜ ์šฐ์•„ํ•จ

IMCopilot์ด ํ›ˆ๋ จ(๋ฐ์ดํ„ฐ ์ˆ˜์ง‘)๊ณผ ์ถ”๋ก (์ž์œจ ์‹คํ–‰)์—์„œ ๋™์ผํ•œ RL ์ •์ฑ…์„ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์ ์€ ์‹œ์Šคํ…œ ์„ค๊ณ„์˜ ์ผ๊ด€์„ฑ์„ ๋ณด์žฅํ•œ๋‹ค. ํ›ˆ๋ จ ๋ถ„ํฌ(training distribution)์™€ ์‹คํ–‰ ๋ถ„ํฌ(execution distribution) ์‚ฌ์ด์˜ ๋ถˆ์ผ์น˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ๋‹ค.

2. ์ž”์ฐจ ์ฃผ์ž…์˜ ๋ณด์ˆ˜์  ์•ˆ์ „์„ฑ

์‚ฌ์ „ํ•™์Šต ์ง€์‹์„ ํŒŒ๊ดดํ•˜์ง€ ์•Š๋Š” ์ž”์ฐจ ์ฃผ์ž… ๋ฐฉ์‹์€ ์‹ค์šฉ์ ์œผ๋กœ๋„, ์ด๋ก ์ ์œผ๋กœ๋„ ํƒ€๋‹นํ•˜๋‹ค. ๊ณ ํ’ˆ์งˆ VLA ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด ์ ์  ์ฆ๊ฐ€ํ•˜๋Š” ํ˜„์žฌ ์ƒํƒœ๊ณ„์—์„œ, ๊ธฐ์กด ๋ชจ๋ธ ์œ„์— ์ƒˆ๋กœ์šด ๊ฐ๊ฐ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ โ€œ๊ฝ‚์„ ์ˆ˜ ์žˆ๋Š”(pluggable)โ€ ํ˜•ํƒœ๋กœ ํ™•์žฅํ•˜๋Š” ์•„ํ‚คํ…์ฒ˜๋Š” ์žฌ์‚ฌ์šฉ์„ฑ(reusability) ๊ด€์ ์—์„œ ๊ฐ€์น˜๊ฐ€ ํฌ๋‹ค.

3. ์„ธ๊ณ„ ์ตœ์ดˆ ์ž์œจ ์–‘์† ์‚ฌ๊ณผ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ

์‚ฌ๊ณผ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ๋Š” ์—ฐ์†์  ํŽ˜์ผ ์ŠคํŠธ๋กœํฌ์™€ ์ธํ•ธ๋“œ ํšŒ์ „์˜ ๋ฐ˜๋ณต ๋ฃจํ”„, ์‚ฌ๊ณผ ํ‘œ๋ฉด์˜ ๋ถˆ๊ท ์ผํ•œ ๊ณก๋ฅ , ์นผ๋‚ ๊ณผ ์‚ฌ๊ณผ ํ‘œ๋ฉด ๊ฐ„์˜ ๋ณต์žกํ•œ ์ ‘์ด‰ ์—ญํ•™์ด ๊ฒฐํ•ฉ๋œ ๋ฒค์น˜๋งˆํฌ๊ธ‰ ๊ณผ์ œ๋‹ค.

4. PCR์ด๋ผ๋Š” ์„ธ๋ฐ€ํ•œ ํ‰๊ฐ€ ์ง€ํ‘œ ๋„์ž…

๋‹จ์ˆœ SR(์„ฑ๊ณต/์‹คํŒจ)๋งŒ์œผ๋กœ๋Š” ๋ณต์žกํ•œ ์กฐ์ž‘ ๊ณผ์ œ์˜ ๋ถ€๋ถ„์  ์„ฑ๊ณต์„ ์ธก์ •ํ•˜๊ธฐ ์–ด๋ ต๋‹ค. PCR์ฒ˜๋Ÿผ ๊ณผ์ œ ๊ตฌ์กฐ์— ๋งž์ถ˜ ์„ธ๋ฐ€ํ•œ ์ง€ํ‘œ๋ฅผ ์ œ์•ˆํ•œ ๊ฒƒ์€ ์ปค๋ฎค๋‹ˆํ‹ฐ์— ๊ธฐ์—ฌํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์  ์ œ์•ˆ์ด๊ธฐ๋„ ํ•˜๋‹ค.

ํ•œ๊ณ„ ๋ฐ ์•ฝ์ 

1. ์ œํ•œ์ ์ธ IMCopilot ์Šคํ‚ฌ ์ง‘ํ•ฉ

ํ˜„์žฌ IMCopilot์€ ์•ˆ์ •์  ํŒŒ์ง€ ์œ ์ง€์™€ ์ธํ•ธ๋“œ ํšŒ์ „์ด๋ผ๋Š” ์†Œ์ˆ˜์˜ ์›์ž์  ํ”„๋ฆฌ๋ฏธํ‹ฐ๋ธŒ๋งŒ ๊ฐ–์ถ”๊ณ  ์žˆ๋‹ค. ์‹ค์ œ ์‚ฐ์—… ํ™˜๊ฒฝ์—์„œ๋Š” ํ›จ์”ฌ ๋” ๋‹ค์–‘ํ•œ ์ธํ•ธ๋“œ ๋™์ž‘์ด ํ•„์š”ํ•˜๋‹ค. ์Šคํ‚ฌ์„ ํ™•์žฅํ•˜๋Š” ๊ณผ์ •์—์„œ ๊ฐ ์Šคํ‚ฌ๋งˆ๋‹ค ๋ณ„๋„์˜ RL ํ›ˆ๋ จ ์‚ฌ์ดํด์ด ํ•„์š”ํ•˜๋‹ค๋Š” ์ ์€ ํ™•์žฅ์„ฑ(scalability)์˜ ๋ณ‘๋ชฉ์ด ๋  ์ˆ˜ ์žˆ๋‹ค.

2. ์‚ฌ๊ณผ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ ์„ฑ๊ณต๋ฅ  30%์˜ ํ•œ๊ณ„

70%๋Š” ์•„์ง๋„ ์‹คํŒจํ•œ๋‹ค. PCR 73%๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, ๊ฐœ๋ณ„ ๋‹จ๊ณ„์—์„œ์˜ ๋Šฅ๋ ฅ์€ ์žˆ์œผ๋‚˜ ์ „์ฒด ์‹œํ€€์Šค์˜ ์กฐํ•ฉ์  ์‹คํŒจ๊ฐ€ SR์„ ๋Œ์–ด๋‚ด๋ฆฐ๋‹ค. ์ด๋Š” ๊ธด ์‹œํ€€์Šค์˜ ๋ณตํ•ฉ์  ์˜ค๋ฅ˜ ์ „ํŒŒ(error propagation) ๋ฌธ์ œ๋กœ, ๊ณ„์ธต์  ์ •์ฑ… ๊ตฌ์กฐ์˜ ๊ณ ์œ ํ•œ ์ทจ์•ฝ์ ์ด๋‹ค.

3. ์‹œ๋ฎฌ๋ ˆ์ด์…˜-์‹ค๋ฌผ ๊ฐญ ํ‰๊ฐ€ ๋ถ€์žฌ

RL ์Šคํ‚ฌ(IMCopilot)์˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ›ˆ๋ จ์—์„œ ์‹ค๋ฌผ ์ „์ด๊นŒ์ง€์˜ ๊ณผ์ •์ด ๋…ผ๋ฌธ์—์„œ ์ƒ์„ธํžˆ ๋‹ค๋ค„์ง€์ง€ ์•Š๋Š”๋‹ค. ์ •๊ตํ•œ ์ ‘์ด‰ ์—ญํ•™์„ ์š”๊ตฌํ•˜๋Š” ์ธํ•ธ๋“œ ์กฐ์ž‘์—์„œ sim-to-real ๊ฐญ์€ ํฐ ๋„์ „์ด๋ฉฐ, ์ด ๋ถ€๋ถ„์˜ ํˆฌ๋ช…ํ•œ ๋ณด๊ณ ๊ฐ€ ์•„์‰ฝ๋‹ค.

4. ๊ณ ์œ  ํ•˜๋“œ์›จ์–ด ์˜์กด์„ฑ

SharpaWave ์†์˜ ๋‚ด์žฅ ๋น„์ฃผ์˜ค-์ด‰๊ฐ ์„ผ์„œ๋Š” ์ด ์—ฐ๊ตฌ์˜ ํ•ต์‹ฌ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์ค‘ ํ•˜๋‚˜๋‹ค. ๋‹ค๋ฅธ ํ”Œ๋žซํผ(Allegro Hand, Shadow Hand, LEAP Hand ๋“ฑ)์œผ๋กœ์˜ ์ง์ ‘ ์ด์ „์—๋Š” ์ƒ๋‹นํ•œ ์žฌ์„ค๊ณ„๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ์•„ํ‚คํ…์ฒ˜์˜ ์ผ๋ฐ˜์„ฑ๊ณผ ํ”Œ๋žซํผ ์˜์กด์„ฑ ์‚ฌ์ด์˜ ๊ฐ„๊ทน์ด ์กด์žฌํ•œ๋‹ค.

5. ์‹ค์‹œ๊ฐ„ ์ œ์–ด ๋ ˆ์ดํ„ด์‹œ ๋ถ„์„ ๋ถ€์žฌ

ํ”Œ๋กœ์šฐ ๋งค์นญ ๊ธฐ๋ฐ˜ VLA์˜ ์ถ”๋ก  ๋ ˆ์ดํ„ด์‹œ์™€ IMCopilot์˜ ๋ฐ˜์‘ํ˜• ์ €๋ ˆ๋ฒจ ์ œ์–ด ๋ ˆ์ดํ„ด์‹œ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋งž๋ฌผ๋ฆฌ๋Š”์ง€๊ฐ€ ๋ช…ํ™•ํ•˜์ง€ ์•Š๋‹ค. ์ ‘์ด‰ ์ด๋ฒคํŠธ๋Š” ์ˆ˜์‹ญ ๋ฐ€๋ฆฌ์ดˆ ๋‹จ์œ„๋กœ ๋ฐœ์ƒํ•˜๋Š”๋ฐ, Option 2 ์ „ํ™˜ ๊ฒฐ์ •์˜ ์ง€์—ฐ์ด ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์— ๋Œ€ํ•œ ๋ถ„์„์ด ํ•„์š”ํ•˜๋‹ค.

6. ์ผ๋ฐ˜ํ™” ํ‰๊ฐ€์˜ ๋ถ€์žฌ

์‹คํ—˜์ด ํŠน์ • ์‚ฌ๊ณผ ์ข…๋ฅ˜, ์นผ ํ˜•ํƒœ, ๊ธฐ์–ด ๊ทœ๊ฒฉ์— ํ•œ์ •๋œ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. ๋‹ค์–‘ํ•œ ๊ฐ์ฒด, ํ˜•์ƒ, ์žฌ์งˆ์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ ํ‰๊ฐ€๊ฐ€ ๋ฏธํกํ•˜๋‹ค.


์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

โ€œTowards Human-Like Manipulation through RL-Augmented Teleoperation and Mixture-of-Dexterous-Experts VLAโ€๋Š” ๊ณ ์ฐจ์› ์–‘์† ์ •๊ต ์กฐ์ž‘์ด๋ผ๋Š” ๊ทน๋„๋กœ ์–ด๋ ค์šด ๋ฌธ์ œ์— ์ •๋ฉด์œผ๋กœ ๋„์ „ํ•œ ๋…ผ๋ฌธ์ด๋‹ค.

mindmap
  root((IMCopilot + MoDE-VLA))
    IMCopilot
      Dual Role
        Data Collection Copilot
        Autonomous Execution Primitive
      RL-trained Atomic Skills
        Stable Grasp
        In-hand Rotation
      Foot Pedal Interface
        Human arm control
        AI hand control
    MoDE-VLA
      OpenPI-0 Backbone
      Modality Pathway
        Force Tokens
        Tactile Tokens
        Cross-Attention
        Sparse MoE Router
      Residual Injection
        Arm Force Correction
        Hand Tactile Correction
      Hierarchical Decision
        Option 1 VLA + MoDE
        Option 2 IMCopilot
    Results
      4 Tasks escalating
      34pct avg SR
      2x Baseline
      World First Apple Peeling
Figure 4: ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ ๊ตฌ์กฐ ์š”์•ฝ

์ด ๋…ผ๋ฌธ์ด ํŠนํžˆ ๊ฐ€์น˜ ์žˆ๋Š” ์ด์œ ๋Š” ๋ฌธ์ œ๋ฅผ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ์ •ํ™•ํžˆ ๋ถ„ํ•ดํ•˜๋Š” ๋ฐฉ์‹ ๋•Œ๋ฌธ์ด๋‹ค. ์„ธ ๊ฐ€์ง€ ๋ณ‘๋ชฉ์„ ์ •์˜ํ•˜๊ณ , ๊ฐ ๋ณ‘๋ชฉ์— ๋Œ€์‘ํ•˜๋Š” ์ปดํฌ๋„ŒํŠธ๋ฅผ ์„ค๊ณ„ํ•˜๊ณ , ๊ฐ ์ปดํฌ๋„ŒํŠธ์˜ ๊ธฐ์—ฌ๋ฅผ ablation์œผ๋กœ ๊ฒ€์ฆํ•œ๋‹ค.

ํ•œํŽธ ์ด ๋…ผ๋ฌธ์ด ์—ฌ๋Š” ๋ฏธ๋ž˜ ์งˆ๋ฌธ๋“ค์€ ์ ์ง€ ์•Š๋‹ค. IMCopilot ์Šคํ‚ฌ ์ง‘ํ•ฉ์€ ์–ด๋–ป๊ฒŒ ์ฒด๊ณ„์ ์œผ๋กœ ํ™•์žฅํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€? PCR 73%๋ฅผ ๋‹ฌ์„ฑํ–ˆ์ง€๋งŒ SR 30%์— ๊ทธ์น˜๋Š” ์žฅ๊ธฐ ์‹œํ€€์Šค ์‹คํŒจ๋ฅผ ์–ด๋–ป๊ฒŒ ๊ทน๋ณตํ•˜๋Š”๊ฐ€? ๋‹ค๋ฅธ ํ”Œ๋žซํผ์œผ๋กœ ์ด์ „ํ•  ๋•Œ ์ด‰๊ฐ ์„ผ์„œ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์˜ ์ฐจ์ด๋ฅผ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋Š”๊ฐ€? MoDE์˜ MoE ์ „๋ฌธ๊ฐ€๋“ค์ด ๊ฐ๊ฐ ์–ด๋–ค ๋ฌผ๋ฆฌ์  ์œ„์ƒ์„ ํŠนํ™”ํ•˜๋Š”์ง€ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ(interpretability)์€ ์–ด๋–ป๊ฒŒ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?

์ด ๋…ผ๋ฌธ์˜ ๊ฐ€์žฅ ํฐ ๋ฉ”์‹œ์ง€๋Š” ๋ช…ํ™•ํ•˜๋‹ค: VLA์˜ ๋ฏธ๋ž˜๋Š” ๊ณ ์ฐจ์› ๊ณ„ํš๊ณผ ๋ฐ˜์‘ํ˜• ์ €์ˆ˜์ค€ ์ œ์–ด์˜ ๊ณ„์ธต์  ๋ถ„์—…์— ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ทธ ๋ถ„์—…์„ ํ›ˆ๋ จ ๋‹จ๊ณ„๋ถ€ํ„ฐ ์‹คํ–‰ ๋‹จ๊ณ„๊นŒ์ง€ ์ผ๊ด€๋˜๊ฒŒ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ์ด๋‹ค. ์ด ์›์น™์€ 63-DoF์˜ SharpaNorth์—๋งŒ ํ•ด๋‹นํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋‹ค. 16-DoF Allegro Hand๋ถ€ํ„ฐ ๋ฏธ๋ž˜์˜ ๋” ๋ณต์žกํ•œ ์† ์‹œ์Šคํ…œ๊นŒ์ง€, ์ ‘์ด‰์ด ํ’๋ถ€ํ•œ ์ •๊ต ์กฐ์ž‘ ์ „๋ฐ˜์— ์ ์šฉ ๊ฐ€๋Šฅํ•œ ์„ค๊ณ„ ์ฒ ํ•™์ด๋‹ค.


์ฐธ๊ณ  ์ž๋ฃŒ

  • Tutian Tang et al., Towards Human-Like Manipulation through RL-Augmented Teleoperation and Mixture-of-Dexterous-Experts VLA, arXiv:2603.08122, 2026.
  • Black et al., pi0: A Vision-Language-Action Flow Model for General Robot Control, 2024.
  • Shi et al., HACTS: a Human-As-Copilot Teleoperation System for Robot Learning, 2025.
  • Yin et al., ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation, NeurIPS 2025.
  • Wen et al., GR-Dexter Technical Report (ByteDance), 2025.
  • Qi et al., HORA: Dexterous In-Hand Object Rotation via RGB-D, CoRL 2023.

Copyright 2026, JungYeon Lee