Curieux.JY
  • JungYeon Lee
  • Post
  • Lecture
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ์„œ๋ก : ์™œ ์šฐ๋ฆฌ๋Š” ์•„์ง VLA๋กœ ์‚ฌ๊ณผ๋ฅผ ๊นŽ์ง€ ๋ชปํ•˜๋Š”๊ฐ€
    • ํ•œ ์žฅ์œผ๋กœ ๋ณด๋Š” ์‹œ์Šคํ…œ ์ „์ฒด ๊ทธ๋ฆผ
    • ๋ฐฉ๋ฒ• 1: IMCopilot โ€” ์‚ฌ๋žŒ๊ณผ VLA๊ฐ€ ๊ณต์œ ํ•˜๋Š” ์†๊ฐ€๋ฝ ๋ถ€์กฐ์ข…์‚ฌ
      • ์™œ ์† ์•ˆ ์กฐ์ž‘๋งŒ ๋”ฐ๋กœ ๋–ผ์–ด๋‚ด๋Š”๊ฐ€
      • ์Šคํ‚ฌ ๊ตฌ์„ฑ๊ณผ RL ํ•™์Šต
      • Dual role: ๊ฐ™์€ ์ •์ฑ…, ๋‘ ๊ฐœ์˜ ํ˜ธ์ถœ์ž
    • ๋ฐฉ๋ฒ• 2: MoDE-VLA โ€” ์‚ฌ์ „ํ•™์Šต๋œ ์ง€์‹์„ ๊นจ๋œจ๋ฆฌ์ง€ ์•Š๋Š” ์ ‘์ด‰ ์ธ์‹
      • ์™œ ๊ทธ๋ƒฅ concatํ•˜๋ฉด ์•ˆ ๋˜๋Š”๊ฐ€
      • ๋ฐฑ๋ณธ: \pi_0 flow-matching VLA
      • Force/Tactile ํ† ํฐ: ์‹œ๊ฐ„์ถ•์œผ๋กœ ํŽผ์น˜๊ธฐ
      • MoDE ๋ชจ๋“ˆ: ์ž๊ฐ€ ์ฃผ์˜ ํ›„ sparse MoE ๋ผ์šฐํŒ…
      • ์ž”์ฐจ ์ฃผ์ž…: ์‚ฌ์ „ํ•™์Šต ์ง€์‹์„ ๊นจ๋œจ๋ฆฌ์ง€ ์•Š๋Š” ํŠธ๋ฆญ
      • ๋‘ ์˜ต์…˜์˜ ์œ„๊ณ„์  ๊ฒฐ์ •
    • ์‹คํ—˜: ๋ฌด์—‡์ด ์ž…์ฆ๋˜์—ˆ๋Š”๊ฐ€
      • ํ‰๊ฐ€ ํƒœ์Šคํฌ
      • Q1: ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์˜ ์‹ค์งˆ์  ๊ฐœ์„ 
      • Q2: MoDE-VLA์˜ ์ •์ฑ… ์„ฑ๋Šฅ
      • Q3: Ablation โ€” ์–ด๋А ์ปดํฌ๋„ŒํŠธ๊ฐ€ ๋ฌด์—‡์„ ์ฑ…์ž„์ง€๋Š”๊ฐ€
    • ๋น„ํŒ์  ๊ณ ์ฐฐ
      • ๊ฐ•์ 
      • ์•ฝ์ ๊ณผ ํ•œ๊ณ„
      • ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต
    • ์‹œ์‚ฌ์ : ์‹ค๋ฌด ๋กœ๋ด‡๊ณตํ•™์ž์—๊ฒŒ ์ „ํ•˜๋Š” ๋ฉ”์‹œ์ง€
    • ๊ฒฐ๋ก : VLA ์‹œ๋Œ€์˜ dexterity๋Š” ์œ„๊ณ„์™€ ๋ถ„์—…์œผ๋กœ
      • ๋น ๋ฅธ ์ฐธ์กฐ ์นด๋“œ

๐Ÿ“ƒSharpa Fruit

teleop
tactile
vla
rl
Towards Human-Like Manipulation through RL-Augmented Teleoperation and Mixture-of-Dexterous-Experts VLA
Published

May 8, 2026

  • Paper Link
  1. ๐Ÿค– ๋ณธ ๋…ผ๋ฌธ์€ Vision-Language-Action (VLA) ๋ชจ๋ธ์ด ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘, ๋‹ค์ค‘ ์Šคํ‚ฌ ํ•™์Šต ๋ฐ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์„ผ์„œ ์œตํ•ฉ์—์„œ ๊ฒช๋Š” ์–ด๋ ค์›€์„ ํ•ด๊ฒฐํ•˜์—ฌ ์ธ๊ฐ„๊ณผ ์œ ์‚ฌํ•œ ์–‘์† ์ˆ™๋ จ ์กฐ์ž‘์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ํ†ตํ•ฉ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
  2. ๐Ÿค ์ด ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜์„ ๋•๊ณ  VLA์˜ ํ˜ธ์ถœ ๊ฐ€๋Šฅํ•œ ์ €์ˆ˜์ค€ ๊ธฐ๋ณธ ๊ธฐ๋Šฅ์œผ๋กœ ์ž‘๋™ํ•˜๋Š” RL ํ›ˆ๋ จ ๊ธฐ๋ฐ˜์˜ In-hand Manipulation Copilot(IMCopilot)๊ณผ, ์ „์šฉ ๊ฒฝ๋กœ ๋ฐ ์ž”์—ฌ ์ฃผ์ž…์„ ํ†ตํ•ด ํž˜ ๋ฐ ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ์„ VLA ๋ฐฑ๋ณธ์— ํ†ตํ•ฉํ•˜๋Š” Mixture-of-Dexterous-Experts VLA(MoDE-VLA)๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.
  3. ๐ŸŽ ๊ธฐ์–ด ์กฐ๋ฆฝ, ์ถฉ์ „๊ธฐ ์—ฐ๊ฒฐ, ํŠœ๋ธŒ ์žฌ๋ฐฐ์น˜, ์‚ฌ๊ณผ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ๋ฅผ ํฌํ•จํ•œ 4๊ฐ€์ง€ ์ ‘์ด‰์ด ๋งŽ์€ ์ž‘์—…์— ๋Œ€ํ•œ ์‹คํ—˜์  ๊ฒ€์ฆ์€ ์ œ์•ˆ๋œ ์ ‘๊ทผ ๋ฐฉ์‹์ด ๊ธฐ์กด ๋ฒ ์ด์Šค๋ผ์ธ ๋Œ€๋น„ ์„ฑ๊ณต๋ฅ ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

๋ณธ ๋…ผ๋ฌธ์€ Vision-Language-Action (VLA) ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ๊ณ ์ž์œ ๋„(high-DoF), ์–‘์†(bi-manual), ์ •๊ตํ•œ(dexterous) ์ ‘์ด‰ ๊ธฐ๋ฐ˜(contact-rich) ์ธ-ํ•ธ๋“œ(in-hand) ์กฐ์ž‘(manipulation) ๋Šฅ๋ ฅ์„ ์ธ๊ฐ„๊ณผ ์œ ์‚ฌํ•œ ์ˆ˜์ค€์œผ๋กœ ํ™•์žฅํ•˜๊ธฐ ์œ„ํ•œ ํ†ตํ•ฉ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด VLA ๋ชจ๋ธ์€ ์ฃผ๋กœ ์ €์ž์œ ๋„ ์—”๋“œ-์ดํŽ™ํ„ฐ(end-effector)์™€ ์‹œ๊ฐ ๊ธฐ๋ฐ˜์˜ ๋‹จ์ˆœํ•œ ํ”ฝ-์•ค-ํ”Œ๋ ˆ์ด์Šค(pick-and-place) ์ž‘์—…์— ๊ตญํ•œ๋˜์–ด ์žˆ์—ˆ์œผ๋ฉฐ, ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ ํš๋“, ๋‹ค์ค‘ ์Šคํ‚ฌ(multi-skill) ํ•™์Šต, ์ด์ข…(heterogeneous) ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ(modality) ์„ผ์„œ ์œตํ•ฉ ์ธก๋ฉด์—์„œ ์–ด๋ ค์›€์„ ๊ฒช์—ˆ์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ณ‘๋ชฉ ํ˜„์ƒ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ๋ณธ ์—ฐ๊ตฌ๋Š” ๋‘ ๊ฐ€์ง€ ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.

  1. IMCopilot (In-hand Manipulation Copilot): ๊ฐ•ํ™” ํ•™์Šต(Reinforcement Learning, RL)์œผ๋กœ ํ›ˆ๋ จ๋œ ์›์ž์ (atomic) ์ธ-ํ•ธ๋“œ ์กฐ์ž‘ ์Šคํ‚ฌ(skill) ์Šค์œ„ํŠธ์ž…๋‹ˆ๋‹ค. ์ด IMCopilot์€ ๋‘ ๊ฐ€์ง€ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ฒซ์งธ, ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์‹œ ์ธ๊ฐ„ ์กฐ์ž‘์ž์˜ ๊ณต์œ  ์ž์œจ(shared-autonomy) ๋ณด์กฐ์ž(assistant) ์—ญํ• ์„ ํ•˜์—ฌ, ๋ณต์žกํ•œ ์ธ-ํ•ธ๋“œ ์กฐ์ž‘ ๋‹จ๊ณ„๋ฅผ IMCopilot์— ์œ„์ž„ํ•จ์œผ๋กœ์จ ๊ณ ํ’ˆ์งˆ์˜ ๋ฐ๋ชจ ๋ฐ์ดํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํš๋“ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•์Šต๋‹ˆ๋‹ค. ๋‘˜์งธ, ์ž์œจ ์‹คํ–‰ ์‹œ VLA ๋ชจ๋ธ์ด ํ˜ธ์ถœํ•  ์ˆ˜ ์žˆ๋Š” ์ €์ˆ˜์ค€(low-level) ์‹คํ–‰ ๊ธฐ๋ณธ ์š”์†Œ(primitive)๋กœ ์ž‘๋™ํ•˜์—ฌ ๊ณ„์ธต์ (hierarchical) ์กฐ์ž‘ ์•„ํ‚คํ…์ฒ˜๋ฅผ ํ˜•์„ฑํ•ฉ๋‹ˆ๋‹ค. IMCopilot์˜ ์Šคํ‚ฌ์€ IsaacLab ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์—์„œ ๊ทผ์œ„ ์ •์ฑ… ์ตœ์ ํ™”(Proximal Policy Optimization, PPO)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ›ˆ๋ จ๋˜๋ฉฐ, ๋น„๋Œ€์นญ ์•กํ„ฐ-ํฌ๋ฆฌํ‹ฑ(asymmetric actor-critic) ์•„ํ‚คํ…์ฒ˜์™€ ๊ต์‚ฌ-ํ•™์ƒ ์ฆ๋ฅ˜(teacher-student distillation)๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ๊ด€์ธก๊ฐ’(o_t)์€ ๊ณ ์œ ์ˆ˜์šฉ์„ฑ ๊ฐ๊ฐ(proprioception), ์†๊ฐ€๋ฝ ๋ ์ ‘์ด‰ ํž˜(fingertip contact forces), ๋ชฉํ‘œ ํšŒ์ „ ์ถ•์˜ 3๋‹จ๊ณ„ ์ด๋ ฅ์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ์ •์ฑ…์€ ์ƒ๋Œ€ ๊ด€์ ˆ ์œ„์น˜ ์˜คํ”„์…‹(\Delta\theta_t)์„ ์ถœ๋ ฅํ•˜๋ฉฐ, ์ด๋Š” ์ €์ˆ˜์ค€ PD ์ œ์–ด๊ธฐ(controller)์— ์˜ํ•ด ์ถ”์ ๋ฉ๋‹ˆ๋‹ค. ์‹ค์ œ ํ™˜๊ฒฝ์œผ๋กœ์˜ ์ œ๋กœ-์ƒท(zero-shot) ์ „์ด๋ฅผ ์œ„ํ•ด ๋„๋ฉ”์ธ ๋ฌด์ž‘์œ„ํ™”(domain randomization)๊ฐ€ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋ณด์ƒ ํ•จ์ˆ˜ r = \lambda_{rot}r_{rot} + \lambda_{vel}r_{vel} + \lambda_{work}r_{work} + \lambda_{torq}r_{torq} + \lambda_{diff}r_{diff}๋Š” ๋ชฉํ‘œ ์ถ• ์ฃผ์œ„์˜ ๊ฐ์†๋„(r_{rot})๋ฅผ ์žฅ๋ คํ•˜๋Š” ๋™์‹œ์— ๋ถˆํ•„์š”ํ•œ ์„ ํ˜• ์†๋„(r_{vel}), ๊ณผ๋„ํ•œ ๊ด€์ ˆ ์ž‘์—…๋Ÿ‰(r_{work}), ํ† ํฌ(r_{torq}), ๊ด€์ ˆ ํŽธ์ฐจ(r_{diff})์— ํŽ˜๋„ํ‹ฐ๋ฅผ ๋ถ€๊ณผํ•˜์—ฌ ์ž‘์—… ์ง„ํ–‰์˜ ์•ˆ์ •์„ฑ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

  2. MoDE-VLA (Mixture-of-Dexterous-Experts VLA): ์ด ์•„ํ‚คํ…์ฒ˜๋Š” ์‚ฌ์ „ ํ›ˆ๋ จ๋œ VLA ๋ฐฑ๋ณธ(backbone)์— ์ด์ข…์˜ ํž˜(force) ๋ฐ ์ด‰๊ฐ(tactile) ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ๋งค๋„๋Ÿฝ๊ฒŒ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. MoDE-VLA๋Š” ํž˜/์ด‰๊ฐ ์ •๋ณด์— ๋Œ€ํ•œ ์ „์šฉ ์ฒ˜๋ฆฌ ๊ฒฝ๋กœ๋ฅผ ํ†ตํ•ด ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์ด์งˆ์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ํž˜ ์‹ ํ˜ธ(f \in \mathbb{R}^{d_f})๋Š” ๋กœ๋ด‡ ํŒ”์˜ ๊ด€์ ˆ ํ† ํฌ(joint torque)์—์„œ ์˜ค๋ฉฐ ํŒ” ์ˆ˜์ค€์˜ ์ ‘์ด‰๋ ฅ์„ ๋ฐ˜์˜ํ•˜๊ณ , ์ด‰๊ฐ ์‹ ํ˜ธ(g \in \mathbb{R}^{d_g})๋Š” 10๊ฐœ ์†๊ฐ€๋ฝ ๋์˜ ์ด‰๊ฐ ์„ผ์„œ์—์„œ 6-์ž์œ ๋„ ํž˜ ๋ฐ ๋ Œ์น˜(wrench) ์ธก์ •์„ ์ง‘๊ณ„ํ•˜์—ฌ ์†๊ฐ€๋ฝ ๋ ์ˆ˜์ค€์˜ ์ ‘์ด‰ ํŒจํ„ด์„ ํฌ์ฐฉํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋Š” ํ•™์Šต๋œ ์„ ํ˜• ๋ ˆ์ด์–ด(linear layer)๋ฅผ ํ†ตํ•ด PaliGemma ์ž„๋ฒ ๋”ฉ(embedding) ๊ณต๊ฐ„์œผ๋กœ ํˆฌ์˜๋ฉ๋‹ˆ๋‹ค(z_f = W_f f + b_f, z_g = W_g g + b_g). ๊ฐ ์ž„๋ฒ ๋”ฉ์€ ์•ก์…˜ ์˜ˆ์ธก ์‹œํ€€์Šค ๊ธธ์ด H๋งŒํผ ๋ณต์ œ๋˜๊ณ  ์ •ํ˜„ํŒŒ(sinusoidal) ์œ„์น˜ ์ธ์ฝ”๋”ฉ(positional encoding)์ด ์ถ”๊ฐ€๋˜์–ด ์‹œ๊ฐ„์ ์œผ๋กœ ์ƒ‰์ธ๋œ ํ† ํฐ(token) ์‹œํ€€์Šค \tilde{Z}_f, \tilde{Z}_g \in \mathbb{R}^{H \times d_{pali}}๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. MoDE ๋ชจ๋“ˆ์€ ๋ฐฑ๋ณธ์˜ ์ปจํ…์ŠคํŠธ(contextual) ์ถœ๋ ฅ, ํ˜„์žฌ ๋””๋…ธ์ด์ง•(denoising) ์ƒํƒœ, ํž˜/์ด‰๊ฐ ํ† ํฐ์˜ ์„ธ ๊ฐ€์ง€ ์ •๋ณด ์ŠคํŠธ๋ฆผ์„ ๋ฐ›์•„๋“ค์ž…๋‹ˆ๋‹ค. ์ด๋“ค์€ ํ•˜๋‚˜์˜ ์‹œํ€€์Šค Z_{in} = [Z_{prefix} \| Z_{suffix} \| \tilde{Z}_f \| \tilde{Z}_g]๋กœ ์—ฐ๊ฒฐ๋œ ํ›„ ์ž๊ธฐ-์–ดํ…์…˜(self-attention) ๋ ˆ์ด์–ด๋ฅผ ํ†ต๊ณผํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ํ›„, ์ฒ˜๋ฆฌ๋œ ํž˜ ๋ฐ ์ด‰๊ฐ ํ† ํฐ์€ E๊ฐœ์˜ ์ „๋ฌธ๊ฐ€ MLP(Expert MLP)๋กœ ๊ตฌ์„ฑ๋œ ํฌ์†Œ ํ˜ผํ•ฉ ์ „๋ฌธ๊ฐ€(sparse Mixture-of-Experts, MoE) ๋ ˆ์ด์–ด๋ฅผ ํ†ต๊ณผํ•˜๋ฉฐ, ์ƒ์œ„-k ์Šค์บํ„ฐ ๋ผ์šฐํŒ…(top-k scatter routing) ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ ‘์ด‰ ๊ธฐ๋ฐ˜ ์กฐ์ž‘์˜ ๋‹ค์–‘ํ•œ ์ •์„ฑ์ (qualitative) ๋ ˆ์ง(regime)์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ์ „๋ฌธ๊ฐ€๊ฐ€ ํŠนํ™”๋  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. MoE ๋ ˆ์ด์–ด๋Š” ์ •์ œ๋œ ํž˜ ํ† ํฐ Z'_f์™€ ์ด‰๊ฐ ํ† ํฐ Z'_g๋ฅผ ์ถœ๋ ฅํ•˜๋ฉฐ, ์ด๋“ค์€ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ณ„ ํˆฌ์˜ ํ—ค๋“œ(projection head)๋ฅผ ํ†ตํ•ด ๋ฐฑ๋ณธ์˜ ์•ก์…˜ ์˜ˆ์ธก์— ์ž”์ฐจ(residual) ๋ณด์ •์œผ๋กœ ์ฃผ์ž…๋ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, ํž˜ ๋ณด์ •์€ ์ฃผ๋กœ ํŒ” ์•ก์…˜์—, ์ด‰๊ฐ ๋ณด์ •์€ ์ฃผ๋กœ ์† ์•ก์…˜์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ์ž”์ฐจ ๊ตฌ์กฐ๋Š” MoDE๊ฐ€ ๊ธฐ๋ณธ VLA ์˜ˆ์ธก์— ๋Œ€ํ•œ ์ •์ œ(refinement) ์—ญํ• ๋งŒ ์ˆ˜ํ–‰ํ•˜๋„๋ก ๋ณด์žฅํ•˜์—ฌ, ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์‹ ํ˜ธ๊ฐ€ ์ ์„ ๋•Œ ๋ฐฑ๋ณธ์˜ ๊ฐ•๊ฑดํ•œ(robust) ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋™์ž‘์„ ๋ณด์กดํ•ฉ๋‹ˆ๋‹ค.

๋ณธ ์—ฐ๊ตฌ๋Š” ์ƒค๋ฅดํŒŒ๋…ธ์Šค1(SharpaNorth1) ๋กœ๋ด‡ ํ”Œ๋žซํผ(๋‘ ๊ฐœ์˜ 7-DoF ๋กœ๋ด‡ ํŒ”๊ณผ 22-DoF ์ƒค๋ฅดํŒŒ์›จ์ด๋ธŒ2(SharpaWave2) ์ •๊ตํ•œ ์†์„ ํฌํ•จ, ์ด 63 DoF)๊ณผ ์ƒ์ฒด ์™ธ๊ณจ๊ฒฉ(upper-body exoskeleton), ์™ธ๊ณจ๊ฒฉ ์žฅ๊ฐ‘(exoskeleton gloves), VR ํ—ค๋“œ์…‹(VR headset)์„ ํฌํ•จํ•˜๋Š” ๋ฐ์ดํ„ฐ ํš๋“ ์‹œ์Šคํ…œ์„ ํ™œ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, ๋ฐœ ํŽ˜๋‹ฌ(foot pedals)์„ ํ†ตํ•ด IMCopilot์„ ํŠธ๋ฆฌ๊ฑฐํ•˜๋Š” ๊ณต์œ  ์ž์œจ ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ ๊ธฐ์กด ์›๊ฒฉ ์กฐ์ž‘์œผ๋กœ๋Š” ๊ฑฐ์˜ ๋ถˆ๊ฐ€๋Šฅํ–ˆ๋˜ ์• ํ”Œ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ(apple peeling)์™€ ๊ฐ™์€ ๋ณต์žกํ•œ ์ž‘์—…์— ๋Œ€ํ•œ ๊ณ ํ’ˆ์งˆ ๋ฐ๋ชจ ํš๋“์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ์Šต๋‹ˆ๋‹ค.

์‹คํ—˜์€ ์‚ฌ๊ณผ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ, ํŠœ๋ธŒ ์žฌ๋ฐฐ์น˜(tube rearranging), ๊ธฐ์–ด ์กฐ๋ฆฝ(gear assembling), ์ถฉ์ „๊ธฐ ๊ฝ‚๊ธฐ(charger plugging)์˜ ๋„ค ๊ฐ€์ง€ ๋ณต์žกํ•œ ์ ‘์ด‰ ๊ธฐ๋ฐ˜ ์ž‘์—…์—์„œ ์ˆ˜ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ๋Š” MoDE-VLA๊ฐ€ ๊ธฐ์ค€์„  \pi_0 ๋ชจ๋ธ์„ ๋Šฅ๊ฐ€ํ•จ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, ์‚ฝ์ž…(insertion) ์ž‘์—…์—์„œ ๋‘ ๋ฐฐ ์ด์ƒ์˜ ์„ฑ๊ณต๋ฅ  ํ–ฅ์ƒ์„ ๋ณด์˜€์œผ๋ฉฐ, IMCopilot์€ ์‚ฌ๊ณผ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ์—์„œ ์ค‘์š”ํ•œ ์ธ-ํ•ธ๋“œ ํšŒ์ „์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์—ฌ PCR(Peel Completion Ratio) 73%๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š” ๋ฐ ํ•ต์‹ฌ์ ์ธ ์—ญํ• ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ œ๊ฑฐ ์—ฐ๊ตฌ(ablation study)๋Š” ํž˜ ๋ฐ ์ด‰๊ฐ ์„ผ์„œ์˜ ์ค‘์š”์„ฑ๊ณผ IMCopilot์˜ ๊ธฐ์—ฌ๋„๋ฅผ ๋ช…ํ™•ํžˆ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ํž˜ ์„ผ์„œ์˜ ์ œ๊ฑฐ๋Š” ํ‰๊ท  SR์„ 11% ๊ฐ์†Œ์‹œ์ผฐ๊ณ , ์ด‰๊ฐ ์„ผ์„œ์˜ ์ œ๊ฑฐ๋Š” 8% ๊ฐ์†Œ์‹œ์ผฐ์œผ๋ฉฐ, IMCopilot์˜ ๋ถ€์žฌ๋Š” ์‚ฌ๊ณผ ๊ป์งˆ ๋ฒ—๊ธฐ๊ธฐ ์ž‘์—…์˜ PCR์„ 73%์—์„œ 25%๋กœ ํฌ๊ฒŒ ๋–จ์–ด๋œจ๋ ธ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก ์ ์œผ๋กœ, ๋ณธ ๋…ผ๋ฌธ์€ IMCopilot๊ณผ MoDE-VLA๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ๊ณ ์ž์œ ๋„ ์–‘์† ์ •๊ตํ•œ ์กฐ์ž‘์„ ์œ„ํ•œ ํฌ๊ด„์ ์ธ ๊ณ„์ธต์  ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์„ฑ๊ณต์ ์œผ๋กœ ๊ตฌ์ถ•ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ฐ์ดํ„ฐ ํš๋“ ๋ณ‘๋ชฉ ํ˜„์ƒ์„ ํ•ด๊ฒฐํ•˜๊ณ , ๋ณต์žกํ•œ ๋‹ค์ค‘ ์Šคํ‚ฌ ์ž‘์—…์„ ์ฒ˜๋ฆฌํ•˜๋ฉฐ, ์ด์ข… ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์„ผ์„œ ์ •๋ณด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์œตํ•ฉํ•˜์—ฌ ๋กœ๋ด‡์ด ์ธ๊ฐ„๊ณผ ์œ ์‚ฌํ•œ ์ˆ˜์ค€์˜ ์ •๊ตํ•œ ์กฐ์ž‘์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

์„œ๋ก : ์™œ ์šฐ๋ฆฌ๋Š” ์•„์ง VLA๋กœ ์‚ฌ๊ณผ๋ฅผ ๊นŽ์ง€ ๋ชปํ•˜๋Š”๊ฐ€

VLA(Vision-Language-Action) ๋ชจ๋ธ์ด ๋กœ๋ด‡ ์กฐ์ž‘ ๋ถ„์•ผ์— ๋“ฑ์žฅํ•œ ์ดํ›„, โ€œPick the red block and place it on the blue plateโ€ ๊ฐ™์€ ์ž์—ฐ์–ด ๋ช…๋ น์œผ๋กœ ๋กœ๋ด‡์„ ์›€์ง์ด๋Š” ๊ฒƒ์€ ์–ด๋А๋ง ์ต์ˆ™ํ•œ ํ’๊ฒฝ์ด ๋˜์—ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ด๋Ÿฐ VLA๋“ค์ด ์ •๋ง ์ž˜ ํ•˜๋Š” ์ผ์„ ๊ฐ€๋งŒํžˆ ๋“ค์—ฌ๋‹ค๋ณด๋ฉด, ๊ฑฐ์˜ ๋Œ€๋ถ€๋ถ„ 2-finger ํ‰ํ–‰ ๊ทธ๋ฆฌํผ๋กœ ๋ฌด์–ธ๊ฐ€๋ฅผ ์ง‘์–ด์„œ ๋‹ค๋ฅธ ๊ณณ์— ๋†“๋Š” ์ผ์— ๋จธ๋ฌผ๋Ÿฌ ์žˆ๋‹ค. ์ฆ‰ โ€™pick-and-placeโ€™๋‹ค.

์—ฌ๊ธฐ์—๋Š” ์ž‘์ง€ ์•Š์€ ํ•จ์ •์ด ์žˆ๋‹ค. ์‚ฌ๋žŒ์˜ ์†์ด ํ•˜๋Š” ์ผ์„ ๋– ์˜ฌ๋ ค ๋ณด์ž. ์‚ฌ๊ณผ ๊ป์งˆ์„ ๊นŽ์„ ๋•Œ ์šฐ๋ฆฌ๋Š” ์–‘์†์„ ๋™์‹œ์— ์“ด๋‹ค. ํ•œ ์†์€ ์นผ๋‚ ์˜ ์œ„์น˜๋ฅผ ์‹œ๊ฐ์œผ๋กœ ๊ฐ€์ด๋“œํ•˜๋ฉด์„œ ๋ˆ„๋ฅด๋Š” ํž˜์„ ์กฐ์ ˆํ•˜๊ณ , ๋‹ค๋ฅธ ์†์€ ์‚ฌ๊ณผ๋ฅผ ์ฅ” ์ฑ„๋กœ ์† ์•ˆ์—์„œ ํšŒ์ „์‹œํ‚จ๋‹ค. ์†๊ฐ€๋ฝ ๋์—์„œ๋Š” ๋ฏธ์„ธํ•œ ๋ฏธ๋„๋Ÿฌ์ง์ด ๋А๊ปด์ง€๋ฉด ์ฆ‰๊ฐ ์ฅ๋Š” ํž˜์„ ํ‚ค์šด๋‹ค. ์‹œ๊ฐ, ํž˜, ์ด‰๊ฐ, ๊ทธ๋ฆฌ๊ณ  ์† ์•ˆ ์กฐ์ž‘(in-hand manipulation) ๊ธฐ์ˆ ์ด ํ•œ๊บผ๋ฒˆ์—, ๊ทธ๋ฆฌ๊ณ  ์œ„๊ณ„์ ์œผ๋กœ ํ˜‘๋ ฅํ•˜๊ณ  ์žˆ๋‹ค.

VLA๋ฅผ ์ด๋Ÿฐ ์˜์—ญ์œผ๋กœ ๋Œ์–ด์˜ฌ๋ฆฌ๋ ค๋Š” ์ˆœ๊ฐ„, ์„ธ ๊ฐ€์ง€ ๋ณ‘๋ชฉ์ด ํ•œ๊บผ๋ฒˆ์— ๋“ฑ์žฅํ•œ๋‹ค. ์ด ๋…ผ๋ฌธ(arXiv:2603.08122v1)์€ ๊ทธ ์„ธ ๊ฐ€์ง€๋ฅผ ์ •์งํ•˜๊ฒŒ ๋งˆ์ฃผ๋ณด๊ณ , ๋‘ ๊ฐœ์˜ ํ•ต์‹ฌ ๋ชจ๋“ˆ๋กœ ํ’€์–ด๋‚ธ๋‹ค.

๋ณ‘๋ชฉ ๋ฌด์—‡์ด ๋ฌธ์ œ์ธ๊ฐ€ ์ด ๋…ผ๋ฌธ์˜ ํ•ด๋ฒ•
๋ฐ์ดํ„ฐ ํš๋“ 63-DoF ์–‘์† ์‹œ์Šคํ…œ์„ ์‚ฌ๋žŒ์ด ์ง์ ‘ ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ํ•˜๊ธฐ ์–ด๋ ต๋‹ค IMCopilot์ด ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ ์ค‘ ์–ด๋ ค์šด ์† ์•ˆ ์กฐ์ž‘์„ ๋Œ€์‹  ์ˆ˜ํ–‰
๋‹ค์ค‘ ์Šคํ‚ฌ ํ•™์Šต ํ•œ ์ •์ฑ…์ด grasping, ์ •๋ฐ€ ์‚ฝ์ž…, in-hand rotation์„ ๋ชจ๋‘ ๋งˆ์Šคํ„ฐํ•˜๊ธฐ ํž˜๋“ค๋‹ค VLA๊ฐ€ IMCopilot์„ ํ˜ธ์ถœํ•˜๋Š” ์œ„๊ณ„์  ๊ตฌ์กฐ
๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์ด์งˆ์„ฑ ์‚ฌ์ „ํ•™์Šต๋œ VLA์— force/tactile์„ ๋‹จ์ˆœ concatํ•˜๋ฉด ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง„๋‹ค MoDE ๋ชจ๋“ˆ + ์ž”์ฐจ(residual) ์ฃผ์ž…

์ด ๊ธ€์—์„œ๋Š” ์œ„ ๋‘ ๋ชจ๋“ˆโ€”IMCopilot๊ณผ MoDE-VLAโ€”์ด ์–ด๋–ป๊ฒŒ ์„ค๊ณ„๋˜์—ˆ๊ณ , ์™œ ๊ทธ ์„ค๊ณ„๊ฐ€ ํ•ฉ๋ฆฌ์ ์ธ์ง€, ๊ทธ๋ฆฌ๊ณ  ์‹คํ—˜์ด ๋ฌด์—‡์„ ๋งํ•ด์ฃผ๋Š”์ง€๋ฅผ ๋”ฐ๋ผ๊ฐ€ ๋ณธ๋‹ค. ์ด๋ฏธ IsaacLab, PPO, ์‚ฌ์ „ํ•™์Šต VLA์— ์ต์ˆ™ํ•œ ๋…์ž๋ผ๋ฉด, ์ƒˆ๋กœ์šด ํŠธ๋ฆญ์ด ์•„๋‹ˆ๋ผ ์กฐํ•ฉ์˜ ํ†ต์ฐฐ์„ ์Œ๋ฏธํ•˜๊ธฐ์— ์ข‹์€ ๋…ผ๋ฌธ์ด๋‹ค.

ํ•œ ์žฅ์œผ๋กœ ๋ณด๋Š” ์‹œ์Šคํ…œ ์ „์ฒด ๊ทธ๋ฆผ

๋…ผ๋ฌธ์˜ Figure 2์™€ Figure 3์„ ํ•œ ๋ฒˆ์— ํก์ˆ˜ํ•˜๊ธฐ ์œ„ํ•ด, ๋จผ์ € ๋ฐ์ดํ„ฐ ํ๋ฆ„๊ณผ ์ถ”๋ก  ์‹œ ๊ฒฐ์ • ํ๋ฆ„์„ ๋ถ„๋ฆฌํ•ด์„œ ๊ทธ๋ ค๋ณด์ž.

flowchart TB
    subgraph DataCollection["Data Collection (training data path)"]
        Op[Human operator]
        Exo[Exoskeleton + VR + foot pedals]
        Robot[SharpaNorth: 2x 7-DoF arms + 2x 22-DoF hands]
        IM1[IMCopilot RL skills]
        Op --> Exo
        Exo -- arm + hand kinematics --> Robot
        Op -- pedal trigger --> IM1
        IM1 -- in-hand rotation only --> Robot
        Robot -- vision + force + tactile + actions --> Dataset[(Demonstration dataset)]
    end

    subgraph Inference["Autonomous inference (deployment path)"]
        Cam[Cameras: head x2, wrists x2]
        Lang[Language instruction]
        Prop[Proprioception]
        FT[Force + Tactile]
        Backbone[VLA backbone: SigLIP + PaliGemma + Action Expert]
        MoDE[MoDE module: self-attn + sparse MoE + residual]
        Decision{c > 0.5 ?}
        IM2[IMCopilot rotation skill]
        Action[Final action]

        Cam --> Backbone
        Lang --> Backbone
        Prop --> Backbone
        Backbone --> MoDE
        FT --> MoDE
        MoDE --> Decision
        Decision -- No --> Action
        Decision -- Yes (hand only) --> IM2
        IM2 --> Action
    end

    Dataset -. supervises .-> Backbone
    Dataset -. supervises .-> MoDE

ํ•ต์‹ฌ์€ IMCopilot์ด ํ•™์Šต ์‹œ(ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ ๋ถ€์กฐ์ข…์‚ฌ)์™€ ์ถ”๋ก  ์‹œ(VLA์˜ ์ €์ˆ˜์ค€ ํ˜ธ์ถœ ๊ฐ€๋Šฅ ํ”„๋ฆฌ๋ฏธํ‹ฐ๋ธŒ) ์–‘์ชฝ์—์„œ ๊ฐ™์€ ์—ญํ• ๋กœ ์žฌ์‚ฌ์šฉ๋œ๋‹ค๋Š” ์ ์ด๋‹ค. ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์˜ ์ผ๊ด€์„ฑ๊ณผ ์ถ”๋ก  ์‹œ ๋™์ž‘์˜ ์ผ๊ด€์„ฑ์ด ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋งž์ถฐ์ง„๋‹ค. ์ด๋Š” ๋‹จ์ˆœํžˆ ์—”์ง€๋‹ˆ์–ด๋ง ๋””ํ…Œ์ผ์ด ์•„๋‹ˆ๋ผ, โ€œ์‚ฌ๋žŒ์ด ๋งŒ๋“  ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ ์ •์ฑ…์ด ์‚ฌ๋žŒ์ด ๋ชป ํ•˜๋˜ ๋™์ž‘๊นŒ์ง€ ์ผ๋ฐ˜ํ™”ํ•˜๊ธธ ๊ธฐ๋Œ€ํ•˜์ง€ ๋ง์žโ€๋Š” ํ˜„์‹ค์  ์ธ์ •์ด๊ธฐ๋„ ํ•˜๋‹ค.

๋ฐฉ๋ฒ• 1: IMCopilot โ€” ์‚ฌ๋žŒ๊ณผ VLA๊ฐ€ ๊ณต์œ ํ•˜๋Š” ์†๊ฐ€๋ฝ ๋ถ€์กฐ์ข…์‚ฌ

์™œ ์† ์•ˆ ์กฐ์ž‘๋งŒ ๋”ฐ๋กœ ๋–ผ์–ด๋‚ด๋Š”๊ฐ€

23 DoF๋ฅผ ๋™์‹œ์— ์กฐ์ •ํ•˜๋ฉด์„œ ์†๋ฐ”๋‹ฅ ์•ˆ์˜ ์‚ฌ๊ณผ๋ฅผ ์ •ํ™•ํžˆ ํ•œ ๋ฐ”ํ€ด ํšŒ์ „์‹œํ‚ค๋Š” ์ผ์€, ์†”์งํžˆ ๋งํ•ด ์ˆ™๋ จ๋œ ์‚ฌ๋žŒ๋„ ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜์œผ๋กœ๋Š” ๊ฑฐ์˜ ๋ชป ํ•œ๋‹ค. ๋…ผ๋ฌธ Table I์ด ์ด ์ ์„ ์ •๋Ÿ‰์ ์œผ๋กœ ๋ณด์—ฌ์ค€๋‹ค. ํƒ๊ตฌ๊ณต์ฒ˜๋Ÿผ ์ž‘๊ณ  ๋ฏธ๋„๋Ÿฌ์šด ๋ฌผ์ฒด์— ๋Œ€ํ•ด ์‚ฌ๋žŒ์ด ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜์œผ๋กœ in-hand rotation์„ ์‹œ๋„ํ–ˆ์„ ๋•Œ ์„ฑ๊ณต๋ฅ ์€ 10% ์ˆ˜์ค€์ด๋‹ค. ์‚ฌ๊ณผ๋Š” 27%๋‹ค. ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์˜ ์ถœ๋ฐœ์ ๋ถ€ํ„ฐ ๋ฌด๋„ˆ์ง€๊ณ  ์žˆ๋‹ค๋Š” ๋œป์ด๋‹ค.

์ €์ž๋“ค์˜ ์ง„๋‹จ์€ ๋ช…๋ฃŒํ•˜๋‹ค. ์ด ๋™์ž‘์€ ์‚ฌ๋žŒ์ด ์ž˜ํ•˜์ง€ ๋ชปํ•˜๋ฏ€๋กœ, ์‚ฌ๋žŒ์˜ ์‹œ๋ฒ”์œผ๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฐ€์ • ์ž์ฒด๊ฐ€ ์„ฑ๋ฆฝํ•˜์ง€ ์•Š๋Š”๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์•ˆ์—์„œ RL๋กœ ๋”ฐ๋กœ ํ•™์Šต์‹œ์ผœ ๋‘๊ณ , ์‚ฌ๋žŒ์€ ๊ทธ ์Šคํ‚ฌ์„ โ€œ๋ฒ„ํŠผ์ฒ˜๋Ÿผ ํ˜ธ์ถœโ€ํ•˜๋ฉด ๋œ๋‹ค. ์ด๊ฒƒ์ด IMCopilot์ด๋‹ค.

์Šคํ‚ฌ ๊ตฌ์„ฑ๊ณผ RL ํ•™์Šต

IMCopilot์€ ๋‘ ๊ฐ€์ง€ atomic ์Šคํ‚ฌ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.

  1. Stable grasp maintenance โ€” ์™ธ๋ถ€ ๊ต๋ž€ ํ•˜์—์„œ ๋ฌผ์ฒด๋ฅผ ์•ˆ์ •์ ์œผ๋กœ ์žก๊ณ  ์žˆ๊ธฐ
  2. In-hand object rotation โ€” ์ง€์ •๋œ ์ถ• ์ฃผ์œ„๋กœ ์† ์•ˆ์—์„œ ํšŒ์ „

ํ•™์Šต ์„ค์ •์€ in-hand manipulation RL์˜ ์ •์„์„ ์ถฉ์‹คํžˆ ๋”ฐ๋ฅธ๋‹ค. IsaacLab ์œ„์—์„œ PPO, asymmetric actor-critic with teacher-student distillation, sim-to-real์„ ์œ„ํ•œ domain randomization. teacher-student ๊ตฌ์กฐ๋Š” OpenAI/IRobot์˜ in-hand cube reorientation ๊ณ„๋ณด์™€ ๊ฐ™์€ ํŒจํ„ด์ด๋‹ค.

  • Teacher (\mathbf{e}_t + \mathbf{o}_t): ๋ฌผ์ฒด pose, ์†๋„, ์งˆ๋Ÿ‰, ๋งˆ์ฐฐ๊ณ„์ˆ˜ ๊ฐ™์€ privileged ์ •๋ณด๋ฅผ ๋ฐ›๋Š”๋‹ค.
  • Student (\mathbf{o}_t only): 3-step proprioception history, fingertip contact force, ๋ชฉํ‘œ ํšŒ์ „์ถ•๋งŒ ๋ฐ›๋Š”๋‹ค.

action์€ ๊ด€์ ˆ ์œ„์น˜ ๋ณ€์œ„ \mathbf{a}_t = \Delta\theta_t์ด๊ณ , ์ด๋ฅผ ์ ๋ถ„ํ•ด์„œ PD ์ œ์–ด๊ธฐ๋กœ ์ถ”์ ํ•œ๋‹ค.

\mathbf{q}_t = \mathbf{q}_{t-1} + \lambda_{\text{scale}} \Delta\theta_t

๋ณด์ƒ ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‹ค์„ฏ ํ•ญ์˜ ๊ฐ€์ค‘ํ•ฉ์ด๋‹ค.

r = \lambda_{\text{rot}} r_{\text{rot}} + \lambda_{\text{vel}} r_{\text{vel}} + \lambda_{\text{work}} r_{\text{work}} + \lambda_{\text{torq}} r_{\text{torq}} + \lambda_{\text{diff}} r_{\text{diff}}

ํ•ญ ์˜๋ฏธ ์ง๊ด€
r_{\text{rot}} ๋ชฉํ‘œ ์ถ• ์ฃผ์œ„ angular velocity โ€œ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉํ–ฅ์œผ๋กœ ๋Œ๋ฆฌ๊ณ  ์žˆ๋Š”๊ฐ€?โ€
r_{\text{vel}} ์›์น˜ ์•Š๋Š” linear velocity ํŽ˜๋„ํ‹ฐ โ€œ๋ฌผ์ฒด๊ฐ€ ์†์—์„œ ๋น ์ ธ๋‚˜๊ฐ€๊ณ  ์žˆ์ง€๋Š” ์•Š์€๊ฐ€?โ€
r_{\text{work}} ๊ด€์ ˆ ์ผ(work) ํŽ˜๋„ํ‹ฐ โ€œ์“ธ๋ฐ์—†์ด ํž˜์„ ์“ฐ๊ณ  ์žˆ์ง€๋Š” ์•Š์€๊ฐ€?โ€
r_{\text{torq}} ๊ด€์ ˆ ํ† ํฌ ํŽ˜๋„ํ‹ฐ โ€œ๊ด€์ ˆ์„ ๋ฌด๋ฆฌํ•˜๊ฒŒ ์“ฐ๊ณ  ์žˆ์ง€๋Š” ์•Š์€๊ฐ€?โ€
r_{\text{diff}} ๊ธฐ๋ณธ ์ž์„ธ๋กœ๋ถ€ํ„ฐ์˜ ํŽธ์ฐจ ํŽ˜๋„ํ‹ฐ โ€œ์ด์ƒํ•œ ์ž์„ธ๋กœ ๋น ์ง€์ง€ ์•Š์•˜๋Š”๊ฐ€?โ€

์—ฌ๊ธฐ์„œ ํฅ๋ฏธ๋กœ์šด ํฌ์ธํŠธ๋Š” ๋ณด์ƒ ์„ค๊ณ„๊ฐ€ goal achievement(ํšŒ์ „)์™€ stability(์—๋„ˆ์ง€ยท์ž์„ธ)์˜ ๊ท ํ˜•์œผ๋กœ ์„ค๊ณ„๋˜์–ด ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ธฐ์กด in-hand reorientation ์—ฐ๊ตฌ์—์„œ ํ”ํžˆ ๋ณด์ด๋Š” โ€œgoal pose์— ๋„๋‹ฌํ–ˆ๋Š”๊ฐ€โ€ ํ˜•ํƒœ์˜ sparse reward ๋Œ€์‹ , โ€œ๊พธ์ค€ํžˆ ๋„๋Š” ํ–‰์œ„โ€ ์ž์ฒด์— ๋ณด์ƒ์„ ์ฃผ๋Š” angular-velocity ๊ธฐ๋ฐ˜ dense reward๋ฅผ ์“ด๋‹ค. ์‚ฌ๊ณผ ๊ป์งˆ ๊นŽ๊ธฐ์ฒ˜๋Ÿผ ์ฃผ๊ธฐ์ ์œผ๋กœ ํšŒ์ „์„ ๋ฐ˜๋ณตํ•˜๋Š” task์— ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋งž์•„๋–จ์–ด์ง€๋Š” ์„ ํƒ์ด๋‹ค.

Dual role: ๊ฐ™์€ ์ •์ฑ…, ๋‘ ๊ฐœ์˜ ํ˜ธ์ถœ์ž

ํ•™์Šต์ด ๋๋‚œ IMCopilot์€ ๋‘ ๋‹จ๊ณ„ ๋ชจ๋‘์—์„œ ๋™์ผํ•˜๊ฒŒ ์ž‘๋™ํ•œ๋‹ค.

flowchart LR
    subgraph DataPhase["Data collection"]
        H[Human]
        Pedal[Foot pedal]
        H -- press --> Pedal --> IM[IMCopilot policy]
    end
    subgraph InferPhase["Autonomous inference"]
        VLA[VLA action head]
        Trigger[Scalar c in 0..1]
        VLA -- predicts --> Trigger
        Trigger -- c > 0.5 --> IM
    end
    IM --> Hand[Hand joint commands]

ํ•™์Šต ๋ฐ์ดํ„ฐ์—์„œ ์† ๋™์ž‘์ด ๋‘ ๊ฐ€์ง€ ์ถœ์ฒ˜(์‚ฌ๋žŒ ์‹œ๋ฒ” + IMCopilot ์ถœ๋ ฅ)๋กœ๋ถ€ํ„ฐ ๋‚˜์˜ค๊ธฐ ๋•Œ๋ฌธ์—, VLA๋Š” ๋‹จ์ˆœํžˆ ์†๊ฐ€๋ฝ trajectory๋ฅผ ํ‰๋‚ด ๋‚ด๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ โ€œ์–ธ์ œ IMCopilot์„ ํ˜ธ์ถœํ• ์ง€โ€๋ฅผ ํ•™์Šตํ•ด์•ผ ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด action ๋ฒกํ„ฐ์— trigger ์Šค์นผ๋ผ c \in [0, 1]์„ ์ถ”๊ฐ€ํ•˜๊ณ , c > 0.5์ด๋ฉด hand action์„ IMCopilot ์ถœ๋ ฅ์œผ๋กœ ๋ฎ์–ด์“ด๋‹ค. ์ผ์ข…์˜ soft mode-switch๋‹ค.

์ด ์„ค๊ณ„๋Š” ๋‘ ๊ฐ€์ง€ ๋ฉด์—์„œ ์˜๋ฆฌํ•˜๋‹ค. ์ฒซ์งธ, action chunk ์•ˆ์—์„œ ์ผ๊ด€๋œ mode ์ „ํ™˜์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ๋‘˜์งธ, hand action ์ž์ฒด๋Š” demonstration์—์„œ IMCopilot์ด ๋งŒ๋“ค์–ด๋‚ธ ๊ฒƒ์„ ๊ทธ๋Œ€๋กœ ์“ฐ๋ฏ€๋กœ, VLA๋Š” hand์˜ ๋ฏธ์„ธ trajectory๋ฅผ ํ•™์Šตํ•˜์ง€ ์•Š์•„๋„ ๋œ๋‹ค. ๊ณ ์ฐจ์› ์†๊ฐ€๋ฝ ์ขŒํ‘œ ํšŒ๊ท€๋ผ๋Š” ๊ฐ€์žฅ ์–ด๋ ค์šด ๋ถ€๋ถ„์„ RL specialist์—๊ฒŒ ์™ธ์ฃผ ์ค€ ์…ˆ์ด๋‹ค.

๋ฐฉ๋ฒ• 2: MoDE-VLA โ€” ์‚ฌ์ „ํ•™์Šต๋œ ์ง€์‹์„ ๊นจ๋œจ๋ฆฌ์ง€ ์•Š๋Š” ์ ‘์ด‰ ์ธ์‹

์™œ ๊ทธ๋ƒฅ concatํ•˜๋ฉด ์•ˆ ๋˜๋Š”๊ฐ€

์—ฌ๊ธฐ๊ฐ€ ์ด ๋…ผ๋ฌธ์˜ ์ง„์งœ ๊ธฐ์ˆ ์  ๊ธฐ์—ฌ๋‹ค. force์™€ tactile์„ ์‚ฌ์ „ํ•™์Šต VLA์— ๋จน์ด๋Š” ๊ฐ€์žฅ ๋‹จ์ˆœํ•œ ๋ฐฉ๋ฒ•์€ proprioception ๋ฒกํ„ฐ์— ๊ทธ๋ƒฅ ์ด์–ด๋ถ™์ด๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ForceVLA, RDP ๊ฐ™์€ ์„ ํ–‰ ์—ฐ๊ตฌ์—์„œ ์ด๋ฏธ โ€œ๋‹จ์ˆœ concat์€ ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์„ ๊นŽ์•„ ๋จน๋Š”๋‹คโ€๋Š” ์ ์ด ๋ณด๊ณ ๋˜์–ด ์žˆ๋‹ค. ์ด์œ ๋Š” ๋‘ ๊ฐ€์ง€๋กœ ์ •๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.

  1. ๋ฌผ๋ฆฌ์  ์˜๋ฏธ๊ฐ€ ๋‹ค๋ฅด๋‹ค. ํŒ”์˜ joint torque(7-DoF ร— 2)๋Š” ๊ฑฐ์‹œ์  wrench์ด๊ณ , fingertip 6-DoF wrench(5 ร— 6 ร— 2)๋Š” ๋ฏธ์„ธ ์ ‘์ด‰ ํŒจํ„ด์ด๋‹ค. ๊ฐ™์€ ํ† ํฐ ๊ณต๊ฐ„์—์„œ ๊ท ์งˆํ•˜๊ฒŒ ๋‹ค๋ฃจ๋ฉด ํ•™์Šต ์‹ ํ˜ธ๊ฐ€ ํฌ์„๋œ๋‹ค.
  2. ์‹œ๊ฐ„ ์Šค์ผ€์ผ์ด ๋‹ค๋ฅด๋‹ค. ๋น„์ „ยท์–ธ์–ด ํ† ํฐ์€ ๋น„๊ต์  ์ฒœ์ฒœํžˆ ๋ณ€ํ•˜์ง€๋งŒ, ์ ‘์ด‰ ์‹ ํ˜ธ๋Š” ms ๋‹จ์œ„๋กœ ๊ธ‰๋ณ€ํ•œ๋‹ค. ๋‹จ์ผ attention pool ์•ˆ์—์„œ ๊ฐ™์ด ์“ฐ๋ฉด dominant modality์˜ gradient์— ๋ฌปํ˜€๋ฒ„๋ฆฐ๋‹ค.

์ €์ž๋“ค์˜ ์ฒ˜๋ฐฉ์€ ์„ธ ๊ฐ€์ง€ ๋””์ž์ธ ์›์น™์œผ๋กœ ์ •๋ฆฌ๋œ๋‹ค.

  1. dedicated pathway โ€” force/tactile์€ backbone๊ณผ ๋ถ„๋ฆฌ๋œ ๊ฒฝ๋กœ๋กœ ์ฒ˜๋ฆฌ
  2. modality-aware routing โ€” sparse MoE๋กœ ํ† ํฐ๋ณ„ expert ๋ถ„ํ™”
  3. residual injection โ€” backbone ์ถœ๋ ฅ ์œ„์— ์ž”์ฐจ๋กœ ๋”ํ•ด ๊ธฐ์กด ์ง€์‹ ๋ณด์กด

๋ฐฑ๋ณธ: \pi_0 flow-matching VLA

๊ธฐ๋ฐ˜ ๋ชจ๋ธ์€ Physical Intelligence์˜ \pi_0๋‹ค. ๊ตฌ์„ฑ ์š”์†Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

๋ชจ๋“ˆ ์—ญํ•  ํฌ๊ธฐ
SigLIP (So400m/14) vision tokenizer โ€”
PaliGemma (Gemma-3B) vision-language transformer 3B
Action Expert (Gemma-300M) flow-matching action head 300M

ํ•™์Šต ๋ชฉํ‘œ๋Š” flow matching loss๋‹ค. \pi_0๋ฅผ ์ตํžˆ ์•„๋Š” ๋…์ž๋ผ๋ฉด ์ต์ˆ™ํ•˜๊ฒ ์ง€๋งŒ, ์ง๊ด€์ ์œผ๋กœ ๋ณด๋ฉด ์ด๋ ‡๋‹ค. clean action \mathbf{x}_0์™€ ๊ฐ€์šฐ์‹œ์•ˆ ๋…ธ์ด์ฆˆ \boldsymbol{\epsilon} ์‚ฌ์ด๋ฅผ ์‹œ๊ฐ„ t๋กœ ์„ ํ˜• ๋ณด๊ฐ„ํ•œ \mathbf{x}_t = t \cdot \boldsymbol{\epsilon} + (1-t) \cdot \mathbf{x}_0๋ฅผ ๋งŒ๋“ค๊ณ , ๊ทธ ์ง€์ ์—์„œ ๋…ธ์ด์ฆˆ์—์„œ clean์œผ๋กœ ๊ฐ€๋Š” ์†๋„์žฅ \mathbf{v}_\theta(\mathbf{x}_t, t)์„ ํšŒ๊ท€์‹œํ‚จ๋‹ค.

\mathcal{L}_{\text{FM}} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \| \mathbf{v}_\theta(\mathbf{x}_t, t) - (\boldsymbol{\epsilon} - \mathbf{x}_0) \|^2 \right]

์ถ”๋ก  ์‹œ์—๋Š” ๋…ธ์ด์ฆˆ์—์„œ ์ถœ๋ฐœํ•ด Euler ๋ฐฉ๋ฒ•์œผ๋กœ N=10 ์Šคํ… ์ ๋ถ„ํ•˜๋ฉด action chunk๊ฐ€ ๋‚˜์˜จ๋‹ค. ์ด ๊ฒฐ๊ณผ์˜ ํ•ต์‹ฌ์€ โ€œ\mathbf{v}_\theta๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฌธ์ œโ€๋ผ๋Š” ์ ์ด๋‹ค. MoDE์˜ ์ž”์ฐจ ์ฃผ์ž…์ด ๋ฐ”๋กœ ์ด ์†๋„์žฅ ์˜ˆ์ธก์— ๋”ํ•ด์ง€๋Š” ๋ณด์ •์ด๋ผ๋Š” ์‚ฌ์‹ค์„ ์ดํ›„์— ๋‹ค์‹œ ์งš๊ฒŒ ๋œ๋‹ค.

action ๋ฒกํ„ฐ ์ž์ฒด๋Š” ์„ธ ๋ถ€๋ถ„์œผ๋กœ ๋‚˜๋‰œ๋‹ค.

\mathbf{a} = [\mathbf{a}_{\text{arm}};\; \mathbf{a}_{\text{hand}};\; \mathbf{a}_{\text{other}}]

์—ฌ๊ธฐ์„œ \mathbf{a}_{\text{other}}๋Š” ํ—ˆ๋ฆฌ ๋™์ž‘๊ณผ IMCopilot trigger c๋ฅผ ํฌํ•จํ•œ๋‹ค.

Force/Tactile ํ† ํฐ: ์‹œ๊ฐ„์ถ•์œผ๋กœ ํŽผ์น˜๊ธฐ

raw ์‹ ํ˜ธ ์ฐจ์›์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • Force \mathbf{f} \in \mathbb{R}^{14} โ€” ์–‘ํŒ” joint torque (7 ร— 2)
  • Tactile \mathbf{g} \in \mathbb{R}^{60} โ€” ์–‘์† fingertip 6-DoF wrench (5 ร— 6 ร— 2)

๊ฐ๊ฐ์„ PaliGemma embedding ์ฐจ์› d_{\text{pali}}๋กœ linear projectionํ•œ ๋’ค, action horizon H๋งŒํผ ๋ณต์ œํ•˜๊ณ  sinusoidal positional encoding์„ ๋”ํ•œ๋‹ค.

\tilde{\mathbf{z}}_f^{(h)} = \mathbf{z}_f + \text{PE}_{\text{sin}}(h), \quad \tilde{\mathbf{z}}_g^{(h)} = \mathbf{z}_g + \text{PE}_{\text{sin}}(h), \quad h = 1, \ldots, H

์—ฌ๊ธฐ์— ์ž‘์ง€๋งŒ ์ค‘์š”ํ•œ ํ†ต์ฐฐ์ด ์žˆ๋‹ค. ํ˜„์žฌ ์‹œ์ ์˜ force/tactile ํ•œ ํ”„๋ ˆ์ž„์„ ๋ฏธ๋ž˜ H ์Šคํ…์— ๊ทธ๋Œ€๋กœ ๋ณต์ œํ•ด์„œ ํ† ํฐ์—ด์„ ๋งŒ๋“ ๋‹ค. ์™œ? ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ํ›„์† MoE router๊ฐ€ โ€œhorizon์˜ ์‹œ์ ๋ณ„๋กœ ๋‹ค๋ฅธ expert๋ฅผ ๋ผ์šฐํŒ…โ€ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ฆ‰ t=1์—์„œ๋Š” contact-onset ์ „๋ฌธ๊ฐ€, t=H์—์„œ๋Š” steady-state force-tracking ์ „๋ฌธ๊ฐ€๊ฐ€ ํ™œ์„ฑํ™”๋  ์ˆ˜ ์žˆ๋Š” ๊ตฌ์กฐ์  ์—ฌ์ง€๋ฅผ ๋งŒ๋“ค์–ด ๋‘” ๊ฒƒ์ด๋‹ค.

MoDE ๋ชจ๋“ˆ: ์ž๊ฐ€ ์ฃผ์˜ ํ›„ sparse MoE ๋ผ์šฐํŒ…

์ „์ฒด ํ† ํฐ์—ด์€ ๋„ค ๋ถ€๋ถ„์œผ๋กœ concat๋œ๋‹ค.

\mathbf{Z}_{\text{in}} = [\mathbf{Z}_{\text{prefix}} \;\|\; \mathbf{Z}_{\text{suffix}} \;\|\; \tilde{\mathbf{Z}}_f \;\|\; \tilde{\mathbf{Z}}_g] \in \mathbb{R}^{(S_p + 3H) \times d_{\text{pali}}}

ํ† ํฐ ๊ทธ๋ฃน ์ •์ฒด ๊ธธ์ด
\mathbf{Z}_{\text{prefix}} PaliGemma ์ถœ๋ ฅ (vision + language + state) S_p
\mathbf{Z}_{\text{suffix}} action expert์˜ noisy action ํ† ํฐ H
\tilde{\mathbf{Z}}_f force tokens H
\tilde{\mathbf{Z}}_g tactile tokens H

self-attention ํ•œ ์ธต์„ ํ†ต๊ณผ์‹œํ‚ค๋ฉด force/tactile ํ† ํฐ์ด ์‹œ๊ฐยท์–ธ์–ดยทdenoising ์ปจํ…์ŠคํŠธ์— ๋ชจ๋‘ attend ํ•œ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ force/tactile ํ† ํฐ๋“ค์€ token-level top-k sparse MoE (E=8, k=1)๋กœ ๋ผ์šฐํŒ…๋œ๋‹ค.

for each token z in [Zf_tokens, Zg_tokens]:
    gate_logits = router(z)           # shape: [E]
    expert_id   = argmax(gate_logits)  # top-1
    z_out       = expert[expert_id](z)

์™œ ๋‹จ์ผ ๊ณต์œ  MLP ๋Œ€์‹  sparse MoE์ธ๊ฐ€? ์ ‘์ด‰ ํ’๋ถ€ ์กฐ์ž‘์—๋Š” ์งˆ์ ์œผ๋กœ ๋‹ค๋ฅธ regime๋“ค์ด ์„ž์—ฌ ์žˆ๋‹ค.

  • ์ž์œ ๊ณต๊ฐ„ ๋„๋‹ฌ
  • ์ดˆ๊ธฐ ์ ‘์ด‰
  • ์•ˆ์ • ๊ทธ๋ฆฝ ์œ ์ง€
  • ๋™์  in-hand rotation

๊ฐ regime์€ force-to-action ๋งคํ•‘์ด ๋‹ค๋ฅด๋‹ค. sparse routing์€ expert๋ฅผ regime๋ณ„ยท๊ด€์ ˆ ๊ทธ๋ฃน๋ณ„๋กœ ์ „๋ฌธํ™”์‹œํ‚ค๋ฉด์„œ๋„ ํ† ํฐ๋‹น ์—ฐ์‚ฐ๋Ÿ‰์€ ๋Š˜๋ฆฌ์ง€ ์•Š๋Š”๋‹ค. ํ•œ๋งˆ๋””๋กœ, modality-awareํ•˜๋ฉด์„œ phase-awareํ•œ dynamic capacity allocation์ด๋‹ค.

์ž”์ฐจ ์ฃผ์ž…: ์‚ฌ์ „ํ•™์Šต ์ง€์‹์„ ๊นจ๋œจ๋ฆฌ์ง€ ์•Š๋Š” ํŠธ๋ฆญ

MoE๋ฅผ ํ†ต๊ณผํ•œ force tokens \mathbf{Z}_f, tactile tokens \mathbf{Z}_g \in \mathbb{R}^{H \times d_{\text{pali}}}๋Š” backbone์˜ suffix ์ถœ๋ ฅ \mathbf{Z}_{\text{suffix}}์™€ ํ•ฉ์ณ์ ธ์„œ, modality-specific projection head๋ฅผ ๊ฑฐ์ณ ์†๋„์žฅ ์˜ˆ์ธก์— ๋“ค์–ด๊ฐ„๋‹ค.

\mathbf{v}_\theta(\mathbf{x}_t, t) = [W_1(\mathbf{Z}_f + \mathbf{Z}_{\text{suffix}}) \;\|\; W_2(\mathbf{Z}_g + \mathbf{Z}_{\text{suffix}})]

์—ฌ๊ธฐ์„œ ๋‘ ๊ฐ€์ง€ ํ•ต์‹ฌ ๋””์ž์ธ์ด ์žˆ๋‹ค.

ํ•ต์‹ฌ 1: ์ž”์ฐจ ํ˜•ํƒœ

\mathbf{Z}_f, \mathbf{Z}_g๋Š” backbone ์ถœ๋ ฅ์— ๋”ํ•ด์ง„๋‹ค. ์ž์œ ๊ณต๊ฐ„ ๋™์ž‘์ฒ˜๋Ÿผ force/tactile ์‹ ํ˜ธ๊ฐ€ ์ •๋ณด๋ฅผ ๊ฑฐ์˜ ๋‹ด๊ณ  ์žˆ์ง€ ์•Š์„ ๋•Œ, MoDE์˜ ์ถœ๋ ฅ์€ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ 0์— ๊ฐ€๊นŒ์›Œ์ง„๋‹ค. ์ฆ‰ ์‹ ํ˜ธ๊ฐ€ ์˜๋ฏธ ์žˆ์„ ๋•Œ๋งŒ ๋ณด์ •์ด ๋“ค์–ด๊ฐ€๊ณ , ๊ทธ๋ ‡์ง€ ์•Š์„ ๋•Œ๋Š” ์‚ฌ์ „ํ•™์Šต ๋™์ž‘์ด ๊ทธ๋Œ€๋กœ ๋ณด์กด๋œ๋‹ค. ์ด๋Š” LoRA๋‚˜ Adapter Tuning์ด base model์„ ๋ณดํ˜ธํ•˜๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜๊ณผ ์ •์‹ ์ด ๊ฐ™๋‹ค.

ํ•ต์‹ฌ 2: ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๋ถ„๋ฆฌ ๋ผ์šฐํŒ…

W_1์€ arm action, W_2๋Š” hand action์„ ๋‹ด๋‹นํ•œ๋‹ค. ์ฆ‰ arm-level torque(force)๋Š” arm action์—, fingertip wrench(tactile)๋Š” hand action์— ์˜ํ–ฅ์„ ์ฃผ๋„๋ก ๋ฌผ๋ฆฌ์  ์˜๋ฏธ๋ฅผ ๋”ฐ๋ผ ๊ฒฝ๋กœ๊ฐ€ ๋ถ„๋ฆฌ๋œ๋‹ค. ์ด ๋ถ„๋ฆฌ๋Š” ๋‹จ์ˆœํ•œ ์ธ๋•ํ‹ฐ๋ธŒ ๋ฐ”์ด์–ด์Šค๊ฐ€ ์•„๋‹ˆ๋ผ, โ€œํŒ”์˜ ํ† ํฌ ์ •๋ณด๊ฐ€ ์†๊ฐ€๋ฝ ์ œ์–ด๋ฅผ ์˜ค์—ผ์‹œํ‚ค์ง€ ์•Š๊ฒŒ ํ•˜๋ผโ€๋Š” ๋ช…์‹œ์  ์„ค๊ณ„๋‹ค.

๋‘ ์˜ต์…˜์˜ ์œ„๊ณ„์  ๊ฒฐ์ •

๋งˆ์ง€๋ง‰์œผ๋กœ ์ถ”๋ก  ์‹œ์— ๋‘ ๊ฐˆ๋ž˜์˜ ๊ฒฐ์ •์ด ์žˆ์Œ์„ ๋‹ค์‹œ ์ •๋ฆฌํ•˜์ž.

์˜ต์…˜ ์กฐ๊ฑด hand action ์ถœ์ฒ˜ arm action ์ถœ์ฒ˜
Option 1 c \le 0.5 VLA + tactile residual VLA + force residual
Option 2 c > 0.5 IMCopilot (์ง์ ‘ ์ œ์–ด) VLA + force residual

์ฆ‰ arm์€ ํ•ญ์ƒ VLA๊ฐ€ ์žก๊ณ  ์žˆ๊ณ , ์†์€ ์ƒํ™ฉ์— ๋”ฐ๋ผ RL specialist์—๊ฒŒ ์–‘๋ณดํ•˜๋Š” ๊ตฌ์กฐ๋‹ค. ์‚ฌ๋žŒ์˜ ์šด๋™ ์ œ์–ด์—์„œ cortex๊ฐ€ ๊ฑฐ์‹œ์  reaching plan์„ ์งœ๊ณ , ์ฒ™์ˆ˜์™€ cerebellum์ด ๋ฏธ์„ธํ•œ ์†๊ฐ€๋ฝ reflex๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ๊ณผ ์ง๊ด€์ ์œผ๋กœ ๋น„์Šทํ•œ ๋ถ„์—…์ด๋‹ค.

์‹คํ—˜: ๋ฌด์—‡์ด ์ž…์ฆ๋˜์—ˆ๋Š”๊ฐ€

ํ‰๊ฐ€ ํƒœ์Šคํฌ

๋…ผ๋ฌธ์€ ์ ‘์ด‰ ๋ณต์žก๋„๋ฅผ ๋‹จ๊ณ„์ ์œผ๋กœ ์˜ฌ๋ฆฐ 4๊ฐœ ํƒœ์Šคํฌ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

  1. Gear Assembling (ํ•œ ํŒ”) โ€” ๊ธฐ์–ด 3๊ฐœ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ถ•์— ๋ผ์›€. ์‚ฝ์ž… force ์กฐ์ ˆ์ด ํ•ต์‹ฌ.
  2. Charger Plugging (ํ•œ ํŒ”) โ€” ์ถฉ์ „๊ธฐ๋ฅผ ๋ฉ€ํ‹ฐํƒญ์— ๊ฝ‚์Œ. ๋งˆ์ง€๋ง‰ mm ๋‹จ์œ„ ์ •๋ฐ€ ์ œ์–ด.
  3. Tube Rearranging (์–‘ ํŒ”) โ€” ์‹œํ—˜๊ด€์„ ํ•œ ์†์œผ๋กœ ์ง‘์–ด ๋‹ค๋ฅธ ์†์œผ๋กœ ์˜ฎ๊ธด ๋’ค ๋‹ค์‹œ ๊ฝ‚์Œ. ์–‘์† ํ˜‘์‘.
  4. Apple Peeling (์–‘ ํŒ”) โ€” ์‚ฌ๊ณผ ๊ป์งˆ ํ•œ ์ค„ ๊นŽ๊ธฐ. ์‹œ๊ฐยทforceยทtactileยทin-hand rotation ๋ชจ๋‘ ํ•„์š”.

ํ‰๊ฐ€ ์ง€ํ‘œ๋Š” ๋‘ ๊ฐ€์ง€๋‹ค.

  • SR(Success Rate): ํƒœ์Šคํฌ ์ „์ฒด ์„ฑ๊ณต๋ฅ 
  • PCR(Peel Completion Ratio): Apple Peeling ์ „์šฉ. ํ‘œ์  ํ‘œ๋ฉด์˜ ๋ช‡ %๋ฅผ ๊นŽ์•˜๋Š”์ง€๋ฅผ 25% ๋‹จ์œ„๋กœ ์ด์‚ฐํ™”.

๊ฐ ํƒœ์Šคํฌ๋‹น 20ํšŒ trial.

Q1: ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์˜ ์‹ค์งˆ์  ๊ฐœ์„ 

Force/Tactile VR ํ”ผ๋“œ๋ฐฑ์˜ ํšจ๊ณผ โ€” Gear Assembling ๊ธฐ์ค€์œผ๋กœ ํ”ผ๋“œ๋ฐฑ ์—†์„ ๋•Œ 75๋ถ„์— 100 trial / 85 ์„ฑ๊ณต, ํ”ผ๋“œ๋ฐฑ ์žˆ์„ ๋•Œ 65๋ถ„์— 100 trial / 93 ์„ฑ๊ณต. ์ž‘์€ ์ฐจ์ด ๊ฐ™์ง€๋งŒ, ์‚ฌ๋žŒ ์šด์˜์ž์˜ ์ธ์ง€ ๋ถ€ํ•˜๊ฐ€ ์ค„์–ด๋“ค๋ฉด demonstration ํ’ˆ์งˆ์˜ ๋ถ„์‚ฐ๋„ ์ค„์–ด๋“ ๋‹ค๋Š” ์ ์ด ๋” ๋ณธ์งˆ์ ์ด๋‹ค.

IMCopilot์˜ ์ง„๊ฐ€๋Š” Table I์— ์žˆ๋‹ค.

Object Teleoperation SR IMCopilot SR
Ping-pong ball 10% 83%
Tennis ball 67% 93%
Apple 27% 90%
Overall 34% 89%

์ž‘๊ณ  ๋ฏธ๋„๋Ÿฌ์šด ๋ฌผ์ฒด์ผ์ˆ˜๋ก ๊ฒฉ์ฐจ๊ฐ€ ๊ทน๋‹จ์ ์ด๋‹ค. ์‚ฌ๊ณผ์˜ 27% โ†’ 90%๋Š” ๋‹จ์ˆœํ•œ ํ–ฅ์ƒ์ด ์•„๋‹ˆ๋ผ โ€œ์ˆ˜์ง‘ ๊ฐ€๋Šฅ vs ๋ถˆ๊ฐ€๋Šฅโ€์˜ ๊ฒฝ๊ณ„๋ฅผ ๋„˜๋Š” ์ฐจ์ด๋‹ค. ์ฆ‰ Apple Peeling ๊ฐ™์€ ํƒœ์Šคํฌ๋Š” IMCopilot ์—†์ด๋Š” ์˜๋ฏธ ์žˆ๋Š” demonstration์„ ๋ชจ์„ ์ˆ˜์กฐ์ฐจ ์—†๋‹ค๋Š” ๋œป์ด๋‹ค.

Q2: MoDE-VLA์˜ ์ •์ฑ… ์„ฑ๋Šฅ

Method Apple SR Apple PCR Tube Gear Charger Avg SR
\pi_0 baseline 0% 8% 15% 40% 5% 15%
MoDE-VLA (Ours) 30% 73% 30% 60% 15% 34%

ํ‰๊ท  SR์ด 15% โ†’ 34%๋กœ ๋‘ ๋ฐฐ ์ด์ƒ ๋›ด๋‹ค. ํฅ๋ฏธ๋กœ์šด ๊ด€์ฐฐ๋“ค:

  • Gear Assembling +20%, Charger Plugging +10% โ€” ๋‹จ์ผ ํŒ” ์‚ฝ์ž…์—์„œ force๊ฐ€ ๊ฒฐ์ •์ ์ด๋‹ค. ๋งˆ์ง€๋ง‰ ๋ช‡ mm์—์„œ์˜ contact onset detection์€ ๋น„์ „๋งŒ์œผ๋กœ๋Š” ์–ด๋ ต๋‹ค.
  • Apple Peeling์€ baseline์ด 0% SR / 8% PCR โ€” ๊นŽ๊ธฐ ์‹œ์ž‘์€ ํ•˜์ง€๋งŒ ํ•œ ๋ฐ”ํ€ด๋ฅผ ๋ชป ๋ˆ๋‹ค. ์‚ฌ๊ณผ๊ฐ€ ๋ฏธ๋„๋Ÿฌ์ง€๊ฑฐ๋‚˜ ํšŒ์ „์ด ๋ถ€์กฑํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. MoDE-VLA๊ฐ€ IMCopilot rotation expert๋ฅผ ์ ์‹œ์— ํ˜ธ์ถœํ•˜๋ฏ€๋กœ closed-loop ring completion์ด ๊ฐ€๋Šฅํ•ด์ง„๋‹ค.

์—ฌ๊ธฐ์„œ PCR ์ง€ํ‘œ์˜ ํ†ต์ฐฐ์„ ์งš์–ด๋ณด์ž. ๋‹จ์ˆœ SR๋กœ ๋ณด๋ฉด 0% โ†’ 30%์ง€๋งŒ, PCR๋กœ ๋ณด๋ฉด 8% โ†’ 73%๋‹ค. ์ฆ‰ ๋ถ€๋ถ„ ์ง„ํ–‰์„ ์ •๋Ÿ‰ํ™”ํ•˜์ง€ ์•Š์œผ๋ฉด baseline์ด โ€œ์ „ํ˜€ ๋ชป ํ•œ๋‹คโ€๋Š” binary ๊ฒฐ๋ก ์— ๋ฌถ์—ฌ์„œ ์–ด๋””๊นŒ์ง€ ์ž˜ํ•˜๊ณ  ์–ด๋””์„œ ๋ฌด๋„ˆ์ง€๋Š”์ง€ ์•ˆ ๋ณด์ธ๋‹ค. ์ฃผ๊ธฐ์ ์ธ task์ผ์ˆ˜๋ก ์ด๋Ÿฐ sub-metric ์„ค๊ณ„๊ฐ€ ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ตํ›ˆ์ด๋‹ค.

Q3: Ablation โ€” ์–ด๋А ์ปดํฌ๋„ŒํŠธ๊ฐ€ ๋ฌด์—‡์„ ์ฑ…์ž„์ง€๋Š”๊ฐ€

Variant Avg SR ๋ณ€ํ™”
Full MoDE-VLA 34% โ€”
w/o Force 23% โˆ’11%
w/o Tactile 26% โˆ’8%
w/o IMCopilot (Apple Peeling only) PCR 25% PCR โˆ’48%

ํ•ด์„์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

Force ์ œ๊ฑฐ (-11%) โ€” ๊ฐ€์žฅ ํฐ ๋‹จ์ผ ์ปดํฌ๋„ŒํŠธ ์†์‹ค. ์‚ฝ์ž… ํƒœ์Šคํฌ์—์„œ contact onset detection์˜ ์ผ์ฐจ ์‹ ํ˜ธ๊ฐ€ ์‚ฌ๋ผ์ง€๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ํฅ๋ฏธ๋กญ๊ฒŒ๋„ Apple Peeling์—์„œ๋„ โ€œ์•„์˜ˆ ์นผ์ด ์‚ฌ๊ณผ์— ๋‹ฟ์ง€ ์•Š์€ ์ฑ„ ํ—ˆ๊ณต์—์„œ ๊นŽ๋Š” ์‹œ๋Љ๋งŒ ํ•˜๋Š”โ€ ์‹คํŒจ ๋ชจ๋“œ๊ฐ€ ๋Š˜์–ด๋‚ฌ๋‹ค. ์‹œ๊ฐ๋งŒ์œผ๋กœ contact ์—ฌ๋ถ€๋ฅผ ์ถ”์ •ํ•˜๊ธฐ์—๋Š” ๋ถ€์กฑํ•˜๋‹ค๋Š” ์ง์ ‘ ์ฆ๊ฑฐ๋‹ค.

Tactile ์ œ๊ฑฐ (-8%) โ€” ์ฃผ๋กœ grasp-intensive phase์—์„œ slip ์ฆ๊ฐ€. ์†๊ฐ€๋ฝ ๋์˜ ๋ณ€ํ˜•/์ ‘์ด‰ ์ƒํƒœ cue๋Š” wrist F/T sensor๋‚˜ RGB๋กœ๋Š” ์žกํžˆ์ง€ ์•Š๋Š”๋‹ค. ์ด๋Š” ForceVLA๊ฐ€ wrist F/T๋งŒ ์“ฐ๋Š” ๊ฒƒ๊ณผ ๋น„๊ตํ•ด ์†๋ tactile์„ ๋ณ„๋„ modality๋กœ ๋‘๋Š” ๋ณธ ๋…ผ๋ฌธ ๋””์ž์ธ์˜ ์ •๋‹น์„ฑ์„ ๋ณด์—ฌ์ค€๋‹ค.

Tactile ์ œ๊ฑฐ๊ฐ€ Apple Peeling SR/PCR์—๋Š” ํฐ ์˜ํ–ฅ์ด ์—†๋‹ค๋Š” ๊ฒƒ๋„ ์žฌ๋ฏธ์žˆ๋‹ค. ์ €์ž๋“ค์˜ ํ•ด์„: ์นผ์€ power grasp์œผ๋กœ ๊ณ ์ •๋˜์–ด ์žˆ๊ณ , ์‚ฌ๊ณผ ์†์€ IMCopilot์ด ์ง์ ‘ ๋‹ค๋ฃจ๋Š”๋ฐ, IMCopilot ์ž์ฒด๊ฐ€ ์ž…๋ ฅ์œผ๋กœ ์ด๋ฏธ tactile์„ ์“ฐ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ฆ‰ IMCopilot์ด tactile feedback์„ ์ž๊ธฐ ์•ˆ์—์„œ ํก์ˆ˜ํ•ด ๋ฒ„๋ ค์„œ, ์ƒ์œ„ VLA์—๋Š” tactile์ด ๋œ ๊ฒฐ์ •์ ์ด๊ฒŒ ๋œ๋‹ค. ์œ„๊ณ„ ๋ถ„์—…์˜ ์ข‹์€ ๋ถ€์ˆ˜ํšจ๊ณผ๋‹ค.

IMCopilot ์ œ๊ฑฐ (Apple Peeling PCR 73% โ†’ 25%) โ€” VLA์—๊ฒŒ IMCopilot demonstration์˜ hand trajectory๋ฅผ ๊ทธ๋Œ€๋กœ imitateํ•˜๋ผ๊ณ  ์‹œ์ผฐ์„ ๋•Œ์˜ ๊ฒฐ๊ณผ๋‹ค. PCR์ด ๊ฑฐ์˜ 1/3 ์ˆ˜์ค€์œผ๋กœ ํญ๋ฝํ•œ๋‹ค. ์ด์œ ๋Š” ๋ช…ํ™•ํ•˜๋‹ค. 22-DoF ์†๊ฐ€๋ฝ trajectory๋Š” imitation learning๋งŒ์œผ๋กœ ์•ˆ์ •์ ์œผ๋กœ ์žฌํ˜„๋˜์ง€ ์•Š๋Š”๋‹ค. ํ•œ ๋ฒˆ ๊นŽ๊ณ  ํšŒ์ „์‹œํ‚ค๋ ค๋Š” ์ˆœ๊ฐ„ ์‚ฌ๊ณผ๊ฐ€ ๋–จ์–ด์ง€๊ฑฐ๋‚˜ ํšŒ์ „์ด ์ผ์–ด๋‚˜์ง€ ์•Š๋Š”๋‹ค. ๊ฒฐ๊ตญ in-hand rotation์€ task-specific RL specialist์˜ ์˜์—ญ์ด๋ผ๋Š” ๊ฒฐ๋ก ์„ ์ •๋Ÿ‰์ ์œผ๋กœ ๋’ท๋ฐ›์นจํ•œ๋‹ค.

๋น„ํŒ์  ๊ณ ์ฐฐ

๊ฐ•์ 

1. ์œ„๊ณ„์  ๋ถ„์—…์˜ ๊น”๋”ํ•œ ๊ตฌํ˜„. โ€œVLA๊ฐ€ plan์„ ์งœ๊ณ , RL specialist๊ฐ€ reactive skill์„ ๋‹ด๋‹นํ•œ๋‹คโ€๋Š” ๋ถ„์—…์€ ์ข…์ข… ์ถ”์ƒ์ ์œผ๋กœ๋งŒ ํšŒ์ž๋˜์—ˆ์ง€๋งŒ, ์ด ๋…ผ๋ฌธ์€ ๊ทธ๊ฒƒ์„ (a) ๋™์ผ specialist๋ฅผ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘๊ณผ ์ถ”๋ก  ์–‘์ชฝ์—์„œ ์žฌ์‚ฌ์šฉ, (b) action ๋ฒกํ„ฐ์— trigger ์Šค์นผ๋ผ๋ฅผ ๋„ฃ์–ด soft mode-switch๋ผ๋Š” ๋‘ ๊ฐ€์ง€ ๊ตฌ์ฒด์  ๋ฉ”์ปค๋‹ˆ์ฆ˜์œผ๋กœ ์„ฑ๊ณต์ ์œผ๋กœ ํ’€์–ด๋ƒˆ๋‹ค.

2. ์ž”์ฐจ ์ฃผ์ž…์˜ ๋ณด์ˆ˜์„ฑ. \mathbf{v}_\theta = W(\mathbf{Z}_{\text{modality}} + \mathbf{Z}_{\text{suffix}}) ํ˜•ํƒœ๋Š” ์ƒˆ๋กœ์šด modality๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฐ€์žฅ ์•ˆ์ „ํ•œ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋‹ค. ์‹ ํ˜ธ๊ฐ€ ๋ฌด์˜๋ฏธํ•  ๋•Œ ์ž๋™์œผ๋กœ 0์œผ๋กœ ์ˆ˜๋ ดํ•˜๋„๋ก ํ•˜๋Š” inductive bias๋Š”, ์‚ฌ์ „ํ•™์Šต backbone์ด ๋น„์‹ธ๊ฒŒ ํ•™์Šตํ•œ prior๋ฅผ ๋ง๊ฐ€๋œจ๋ฆฌ์ง€ ์•Š๋Š”๋‹ค. ์ด๋Š” ์–ด๋–ค ์‚ฌ์ „ํ•™์Šต VLA์—๋“  force/tactile์„ ์ถ”๊ฐ€๋กœ ๋ถ™์ด๊ณ  ์‹ถ์„ ๋•Œ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅํ•œ ํŒจํ„ด์ด๋‹ค.

3. modality-specific output head. W_1 for arm, W_2 for hand์˜ ๋ถ„๋ฆฌ๋Š” ์‚ฌ์†Œํ•ด ๋ณด์ด์ง€๋งŒ ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค. arm-level torque์™€ fingertip wrench๊ฐ€ cross-contaminate๋˜์ง€ ์•Š๋„๋ก ํ•˜๋Š” ๊ฒƒ์€ ๋‹จ์ˆœํ•œ inductive bias ์ด์ƒ์ด๋‹ค. ๋ฌผ๋ฆฌ์  ์˜๋ฏธ๋ฅผ ๋”ฐ๋ผ๊ฐ„ architectural separation์ด๋‹ค.

4. Sub-metric์˜ ์ค‘์š”์„ฑ์„ ๋ณด์—ฌ์ค€ PCR. ์ฃผ๊ธฐ์  task์—์„œ binary SR๋งŒ ์“ฐ๋ฉด baseline์€ โ€œ0%โ€๋กœ ๋ฌถ์—ฌ ๋””ํ…Œ์ผ์„ ์žƒ๋Š”๋‹ค. 25% ๋‹จ์œ„ ์ด์‚ฐํ™”๋Š” ๊ฑฐ์น ์ง€๋งŒ ์ถฉ๋ถ„ํžˆ informativeํ•˜๋‹ค. ๋‹ค๋ฅธ cyclic dexterous task(์˜ˆ: ์œ ์‚ฌํ•œ reorientation, ํšŒ์ „ ๊ฐ€๊ณต)์—๋„ ์ฐจ์šฉํ•  ๋งŒํ•œ ํ‰๊ฐ€ ํŒจํ„ด์ด๋‹ค.

5. Apple Peeling์ด๋ผ๋Š” demanding task์˜ ์ž์œจ ์‹œ์—ฐ. โ€œ์–‘์† ํ˜‘์‘ + force-guided cutting + tactile-guided in-hand rotationโ€์ด ๋ชจ๋‘ ํ•„์š”ํ•œ ํƒœ์Šคํฌ๊ฐ€ partial์ด๋ผ๋„ ์ž์œจ๋กœ ์™„์„ฑ๋œ ์‚ฌ๋ก€๋Š” ์ด์ „์— ๊ฑฐ์˜ ์—†๋‹ค.

์•ฝ์ ๊ณผ ํ•œ๊ณ„

1. ์ ˆ๋Œ€ ์„ฑ๋Šฅ์€ ์—ฌ์ „ํžˆ ๋‚ฎ๋‹ค. SR 30~60%๋Œ€๋‹ค. ์‚ฐ์—… ์ ์šฉ ๊ด€์ ์—์„œ๋Š” ๋ฉ€๋‹ค. ์ด ์ˆ˜์น˜๋Š” โ€œVLA ๊ธฐ๋ฐ˜ dexterous manipulation ๋ถ„์•ผ๊ฐ€ ์•„์ง ์ดˆ๊ธฐโ€๋ผ๋Š” ์‚ฌ์‹ค์„ ์ •์งํ•˜๊ฒŒ ๋“œ๋Ÿฌ๋‚ธ๋‹ค. baseline (\pi_0)๋„ ํ•˜๋“œ์›จ์–ด ๋งค์น˜๊ฐ€ ์™„๋ฒฝํ•˜์ง€ ์•Š์œผ๋‹ˆ ์ง์ ‘ ๋น„๊ต๋งŒ์œผ๋กœ ๊ฒฐ๋ก ์ง“๊ธฐ๋Š” ์‹ ์ค‘ํ•ด์•ผ ํ•œ๋‹ค.

2. IMCopilot์˜ ์Šคํ‚ฌ ์ข…๋ฅ˜๊ฐ€ ๋งค์šฐ ์ œํ•œ์ ์ด๋‹ค. ์•ˆ์ • ๊ทธ๋ฆฝ ์œ ์ง€ + ํ•œ ์ถ• ํšŒ์ „, ๋‘˜๋ฟ์ด๋‹ค. ์ผ๋ฐ˜ํ™”ํ•˜๋ ค๋ฉด axis-conditioned, object-conditioned ๋“ฑ ๋” ๋‹ค์–‘ํ•œ in-hand ์Šคํ‚ฌ์ด ํ•„์š”ํ•˜๊ณ , ์ด๋Š” RL ํ•™์Šต ๋น„์šฉ๊ณผ sim-to-real gap์„ ํญ๋ฐœ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. RotateIt, AnyRotate ๊ฐ™์€ ํ›„์† ํ๋ฆ„์„ ํ†ตํ•ฉํ•˜๋Š” ๊ฒƒ์ด ์ž์—ฐ์Šค๋Ÿฌ์šด ๋‹ค์Œ ๋‹จ๊ณ„๋กœ ๋ณด์ธ๋‹ค.

3. trigger ์Šค์นผ๋ผ c์˜ ํ•™์Šต ์‹ ํ˜ธ. ํ˜„์žฌ trigger๋Š” demonstration์—์„œ์˜ ์‚ฌ๋žŒ ํŽ˜๋‹ฌ ์ž…๋ ฅ์œผ๋กœ supervise๋œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์‹ค์ œ ์ถ”๋ก  ์‹œ์—๋Š” ์‚ฌ๋žŒ์ด ๋ชจ๋ฅด๋˜ ์‹œ์ ์—์„œ IMCopilot์„ ํ˜ธ์ถœํ•ด์•ผ ๋” ํšจ๊ณผ์ ์ผ ์ˆ˜ ์žˆ๋‹ค. trigger๋ฅผ RL๋กœ ๋ฏธ์„ธ์กฐ์ •ํ•˜๊ฑฐ๋‚˜ self-supervision์œผ๋กœ ๋ณด๊ฐ•ํ•˜๋Š” ์—ฌ์ง€๊ฐ€ ์žˆ๋‹ค.

4. force์™€ tactile์˜ ์‹œ๊ฐ„ ๋‹จ์ผ์„ฑ. ํ† ํฐํ™” ์‹œ ํ•œ ํ”„๋ ˆ์ž„์„ H๋ฒˆ ๋ณต์ œํ•œ๋‹ค๋Š” ์ ์€ ์˜๋ฆฌํ•˜์ง€๋งŒ, ๋น ๋ฅด๊ฒŒ ๋ณ€ํ•˜๋Š” ์ ‘์ด‰ transient๋ฅผ onboardํ•˜์ง€ ๋ชปํ•œ๋‹ค. ๋ฉ€ํ‹ฐํ”„๋ ˆ์ž„ history๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋„ฃ๊ณ  temporal MoE๋ฅผ ์ ์šฉํ•˜๋Š” ํ™•์žฅ์ด ์ž์—ฐ์Šค๋Ÿฝ๋‹ค.

5. ์‹œ๋ฎฌ๋ ˆ์ด์…˜โ†’์‹ค์ œ gap์˜ ํ‰๊ฐ€ ๋ถ€์žฌ. IMCopilot์€ IsaacLab์—์„œ๋งŒ ํ•™์Šต๋˜๊ณ  zero-shot ๋ฐฐํฌ๋œ๋‹ค. domain randomization ๋ฒ”์œ„๊ฐ€ ์ •ํ™•ํžˆ ์–ด๋–ค์ง€, ์‹ค์ œ ์‚ฌ๊ณผ์˜ ๋‹ค์–‘ํ•œ ํฌ๊ธฐยท์ค‘์‹ฌยท๋งˆ์ฐฐ์— ๋Œ€ํ•œ robustness ํ†ต๊ณ„๊ฐ€ ์žˆ์œผ๋ฉด ๋” ์„ค๋“๋ ฅ ์žˆ์„ ๊ฒƒ์ด๋‹ค(์ด๋Š” ์ด๋ฏธ Allegro Hand ๊ธฐ๋ฐ˜ RL ์—ฐ๊ตฌ์ž์—๊ฒŒ๋Š” ์ต์ˆ™ํ•œ ๊ฐˆ์ฆ์ด๊ธฐ๋„ ํ•˜๋‹ค).

6. ์ผ๋ฐ˜ VLA ์‚ฌ์ „ํ•™์Šต ๋ถ„ํฌ์™€์˜ ์ •ํ•ฉ์„ฑ. \pi_0๋Š” ์ฃผ๋กœ ํ‰ํ–‰ ๊ทธ๋ฆฌํผ ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์ „ํ•™์Šต๋˜์—ˆ๋‹ค. 22-DoF ์†์— ๋Œ€ํ•œ fine-tuning์ด backbone ํ‘œํ˜„์„ ์–ด๋–ป๊ฒŒ ๋ฐ”๊พธ๋Š”์ง€์— ๋Œ€ํ•œ ๋ถ„์„์ด ๋น ์ ธ ์žˆ๋‹ค. residual injection์ด ๋ณดํ˜ธํ•œ๋‹ค๋Š” ์ฃผ์žฅ์€ ์ •์„ฑ์ ์ด๋ฉฐ, ์ •๋Ÿ‰์  representation drift ๋ถ„์„์ด ์žˆ์œผ๋ฉด ๋” ๊ฐ•ํ•ด์ง„๋‹ค.

๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต

์—ฐ๊ตฌ force tactile ์œตํ•ฉ ๋ฐฉ์‹ ์† ํ˜•ํƒœ in-hand ์Šคํ‚ฌ
RDP wrist F/T โ€” fast-slow conditional ํ‰ํ–‰ ๊ทธ๋ฆฌํผ ์—†์Œ
TA-VLA joint torque โ€” architectural exploration ํ‰ํ–‰ ๊ทธ๋ฆฌํผ ์—†์Œ
ForceVLA wrist F/T โ€” force-aware MoE ํ‰ํ–‰ ๊ทธ๋ฆฌํผ ์—†์Œ
Tactile-VLA โ€” tactile hybrid pos/force ctrl ํ‰ํ–‰ ๊ทธ๋ฆฌํผ ์—†์Œ
MoDE-VLA (๋ณธ ๋…ผ๋ฌธ) joint torque fingertip 6-DoF dedicated path + sparse MoE + residual 22-DoF dexterous hand RL specialist (IMCopilot)

๋ณธ ๋…ผ๋ฌธ์˜ ์ž๋ฆฌ๋งค๊น€์€ ๋ถ„๋ช…ํ•˜๋‹ค. ์ด์ „ ์—ฐ๊ตฌ๋“ค์ด ๋Œ€์ฒด๋กœ ๋‹จ์ผ modality๋ฅผ ๋‹จ์ผ ์† ํ˜•ํƒœ(์ฃผ๋กœ ํ‰ํ–‰ ๊ทธ๋ฆฌํผ)์— ํ†ตํ•ฉํ•˜๋Š” ๋ฐ ์ง‘์ค‘ํ•œ ๋ฐ˜๋ฉด, ์ด ๋…ผ๋ฌธ์€ (i) force + tactile ๋‘ modality๋ฅผ ๋™์‹œ์—, (ii) ๋ฌผ๋ฆฌ์  ์˜๋ฏธ๋ณ„ ๋ถ„๋ฆฌ ๊ฒฝ๋กœ๋กœ, (iii) 22-DoF dexterous hand์—์„œ, (iv) RL skill ํ˜ธ์ถœ ๊ฐ€๋Šฅํ•œ ์œ„๊ณ„ ๊ตฌ์กฐ์—์„œ ํ†ตํ•ฉํ•œ ์ฒซ ์‹œ๋„์— ๊ฐ€๊น๋‹ค. ForceVLA์˜ force-aware MoE ์•„์ด๋””์–ด๋ฅผ force์™€ tactile ๋‘˜๋กœ ํ™•์žฅํ•˜๋ฉด์„œ ์ž”์ฐจ ์ฃผ์ž…์„ ์ถ”๊ฐ€ํ•œ ๊ฒƒ์œผ๋กœ ์ฝ์œผ๋ฉด ๊ณ„๋ณด๊ฐ€ ์ž์—ฐ์Šค๋Ÿฝ๋‹ค.

ํŠนํžˆ IsaacLab + PPO + asymmetric actor-critic + teacher-student distillation์œผ๋กœ in-hand rotation์„ ํ•™์Šตํ•˜๋Š” ๋ถ€๋ถ„์€, OpenAI cube reorientation, AnyRotate, RotateIt, DexNDM ๋“ฑ์˜ ๊ณ„๋ณด์™€ ์ •ํ™•ํžˆ ๊ฐ™์€ ์ž๋ฆฌ์— ์žˆ๋‹ค. ๋‹ค๋ฅธ ์ ์€ ๊ทธ RL skill์„ ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ ๋ถ€์กฐ์ข…์‚ฌ๋กœ๋„, VLA์˜ ํ˜ธ์ถœ ๊ฐ€๋Šฅ primitive๋กœ๋„ ๋™์‹œ์— ํ™œ์šฉํ•œ๋‹ค๋Š” ๋ฐ ์žˆ๋‹ค. ์ด dual-use๊ฐ€ ๋ณธ ๋…ผ๋ฌธ์˜ ๊ฐ€์žฅ ๋…์ฐฝ์ ์ธ ํ†ต์ฐฐ์ผ ์ˆ˜ ์žˆ๋‹ค.

์‹œ์‚ฌ์ : ์‹ค๋ฌด ๋กœ๋ด‡๊ณตํ•™์ž์—๊ฒŒ ์ „ํ•˜๋Š” ๋ฉ”์‹œ์ง€

์ด ๋…ผ๋ฌธ์„ ์ฝ”๋“œ ๋ฆฌ๋ทฐ์ฒ˜๋Ÿผ ์ฝ๊ณ  ๋‚˜๋ฉด, ๋‹ค์Œ ๋‹ค์„ฏ ๊ฐ€์ง€ ์‹ค๋ฌด ๊ตํ›ˆ์ด ๋‚จ๋Š”๋‹ค.

  1. ์‚ฌ๋žŒ๋„ ๋ชป ํ•˜๋Š” ๋™์ž‘์„ imitation์œผ๋กœ ํ•™์Šต์‹œํ‚ค๋ ค ํ•˜์ง€ ๋ง์ž. ๊ทธ ์˜์—ญ์€ RL specialist์— ์™ธ์ฃผ๋ฅผ ์ฃผ๊ณ , demonstration ์ž์ฒด์— ๊ทธ specialist์˜ ์ถœ๋ ฅ์„ ์„ž์–ด๋ผ. ์ด๋Ÿฌ๋ฉด ํ•™์Šต ๋ถ„ํฌ์™€ ์ถ”๋ก  ๋ถ„ํฌ๊ฐ€ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋งž์ถฐ์ง„๋‹ค.

  2. modality ์ถ”๊ฐ€๋Š” ์ฃผ ๊ฒฝ๋กœ(main path)๊ฐ€ ์•„๋‹ˆ๋ผ ์ž”์ฐจ(residual)๋กœ. ์‚ฌ์ „ํ•™์Šต๋œ backbone์˜ representation์„ ๊นจ์ง€ ์•Š์œผ๋ฉด์„œ ์ƒˆ๋กœ์šด sensor๋ฅผ ๋ถ™์ด๋Š” ๊ฐ€์žฅ ์•ˆ์ „ํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค. ์ž์œ  ๊ณต๊ฐ„์—์„œ๋Š” ์ž๋™์œผ๋กœ 0์ด ๋˜๋„๋ก.

  3. physical semantics์— ๋”ฐ๋ผ ๊ฒฝ๋กœ๋ฅผ ๋ถ„๋ฆฌํ•˜๋ผ. arm torque์™€ fingertip wrench๋ฅผ ๊ฐ™์€ ํ† ํฐ ํ’€์— ๋„ฃ์ง€ ๋งˆ๋ผ. arm action๊ณผ hand action ์ถœ๋ ฅ head๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” ์ž‘์€ ๊ฒฐ์ •์ด ํฐ ์ฐจ์ด๋ฅผ ๋งŒ๋“ ๋‹ค.

  4. action horizon๋งŒํผ modality ํ† ํฐ์„ ๋ณต์ œํ•˜๋ฉด sparse MoE์—๊ฒŒ phase๋ณ„ specialization ์—ฌ์ง€๋ฅผ ์ค€๋‹ค. โ€œํ•œ ํ”„๋ ˆ์ž„ ์‹ ํ˜ธ๋ฅผ H๋ฒˆ ํŽด์„œ ์‹œ๊ฐ„ ์Šฌ๋กฏ์„ ๋งŒ๋“ ๋‹คโ€๋Š” trick์€ ๋‹ค๋ฅธ ๋น„์ „ยท์–ธ์–ดยทํ–‰๋™ ์œตํ•ฉ ๊ตฌ์กฐ์—๋„ ๊ทธ๋Œ€๋กœ ์ฐจ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.

  5. ์ฃผ๊ธฐ์  ํƒœ์Šคํฌ๋Š” binary SR๋งŒ์œผ๋กœ ํ‰๊ฐ€ํ•˜์ง€ ๋งˆ๋ผ. ๋ถ€๋ถ„ ์ง„ํ–‰์„ ์ด์‚ฐํ™”ํ•œ sub-metric(์—ฌ๊ธฐ์„œ๋Š” PCR)์ด baseline์˜ ์‹คํŒจ ๋ชจ๋“œ์™€ method์˜ ์ง„์งœ ๊ธฐ์—ฌ๋ฅผ ๋ถ„๋ฆฌํ•ด ๋ณด์—ฌ์ค€๋‹ค.

๊ฒฐ๋ก : VLA ์‹œ๋Œ€์˜ dexterity๋Š” ์œ„๊ณ„์™€ ๋ถ„์—…์œผ๋กœ

์ด ๋…ผ๋ฌธ์€ ์ƒˆ๋กœ์šด ๋‹จ์ผ ํŠธ๋ฆญ์„ ์ œ์‹œํ•˜๊ธฐ๋ณด๋‹ค๋Š”, ์ด๋ฏธ ์•Œ๋ ค์ง„ ์ข‹์€ ๊ตฌ์„ฑ์š”์†Œ๋“ค(ฯ€0, flow matching, sparse MoE, asymmetric PPO, residual adaptation)์„ dexterous manipulation์˜ ์‹ค์ œ ๋ณ‘๋ชฉ์— ์ •ํ™•ํžˆ ๋ฐฐ์น˜ํ•œ ์ž‘ํ’ˆ์ด๋‹ค. ๊ทธ ๋ฐฐ์น˜์˜ ๊ฒฐ๊ณผ๋กœ, ์ „์—๋Š” ๋ฐ์ดํ„ฐ์กฐ์ฐจ ๋ชจ์„ ์ˆ˜ ์—†๋˜ ์–‘์† ์‚ฌ๊ณผ ๊นŽ๊ธฐ ๊ฐ™์€ task๊ฐ€ ์ž์œจ๋กœ ๋ถ€๋ถ„ ์„ฑ๊ณตํ•œ๋‹ค.

ํ•ต์‹ฌ์„ ํ•œ ๋ฌธ์žฅ์œผ๋กœ ์ค„์ด๋ฉด ์ด๋ ‡๋‹ค.

โ€œVLA์—๊ฒŒ plan์„ ๋งก๊ธฐ๊ณ , RL specialist์—๊ฒŒ ์†๊ฐ€๋ฝ reflex๋ฅผ ๋งก๊ธฐ๊ณ , force์™€ tactile์€ ์‚ฌ์ „ํ•™์Šต ์ง€์‹์„ ๊นจ๋œจ๋ฆฌ์ง€ ์•Š๋Š” ์ž”์ฐจ ๋ณด์ •์œผ๋กœ ๋ฐ€์–ด ๋„ฃ์–ด๋ผ.โ€

์ด ๋ฉ”์‹œ์ง€๋Š” 22-DoF SharpaWave์—์„œ๋งŒ ํ†ตํ•˜๋Š” ์ด์•ผ๊ธฐ๊ฐ€ ์•„๋‹ˆ๋‹ค. 16-DoF Allegro Hand ๊ฐ™์€ ๋‹ค๋ฅธ dexterous platform, ๋‹ค๋ฅธ ์‚ฌ์ „ํ•™์Šต VLA ๋ฐฑ๋ณธ(OpenVLA, \pi_{0.5}, RT-2 ๋“ฑ), ๋‹ค๋ฅธ modality(temperature, audio, proximity)์—๋„ ๊ทธ๋Œ€๋กœ ์˜ฎ๊ฒจ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์•„ํ‚คํ…์ฒ˜ ํŒจํ„ด์ด๋‹ค. ๊ทธ๋ž˜์„œ ์ด ๋…ผ๋ฌธ์€ ํ•œ ์‹œ์Šคํ…œ์˜ ๊ฒฐ๊ณผ ๋ณด๊ณ ์„œ๋ผ๊ธฐ๋ณด๋‹ค, VLA + dexterous manipulation์˜ ํ›„์† ์—ฐ๊ตฌ ์„ค๊ณ„์„œ๋กœ ์ฝ๋Š” ํŽธ์ด ๋” ํ’์„ฑํ•˜๋‹ค.


๋น ๋ฅธ ์ฐธ์กฐ ์นด๋“œ

Paper: Towards Human-Like Manipulation through RL-Augmented
       Teleoperation and Mixture-of-Dexterous-Experts VLA
ArXiv: 2603.08122v1

Hardware: SharpaNorth (2x 7-DoF arm + 2x 22-DoF hand = 63 DoF)
          fingertip 6-DoF tactile + arm joint torque

Key modules:
  IMCopilot   = PPO + IsaacLab + teacher-student RL
                 dual role: teleop copilot + VLA primitive
  MoDE-VLA    = pi0 backbone
                 + force/tactile token path (replicate H times)
                 + self-attn over [prefix | suffix | Zf | Zg]
                 + sparse MoE (E=8, top-1)
                 + modality-split residual head (W1 arm, W2 hand)

Action: a = [a_arm ; a_hand ; a_other(c, waist)]
        if c > 0.5: a_hand <- IMCopilot output

Best results:
  Avg SR  : 15% (pi0)  ->  34% (ours)
  Apple   : 0% / 8%    ->  30% / 73% (SR/PCR)
  In-hand rotation: 34% (teleop) -> 89% (IMCopilot)

Copyright 2026, JungYeon Lee