Curieux.JY
  • JungYeon Lee
  • Post
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ๋“ค์–ด๊ฐ€๋ฉฐ: ๋ˆˆ๋งŒ ์žˆ๋Š” ๋กœ๋ด‡์ด ๊ฟˆ๊พธ๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ
    • ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ: World Model์˜ ์•ฝ์ ๊ณผ ์ด‰๊ฐ์˜ ์—ญํ• 
      • World Model์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€
      • ๋น„์ „๋งŒ์œผ๋กœ๋Š” ๋ถ€์กฑํ•˜๋‹ค: ์„ธ ๊ฐ€์ง€ ๊ทผ๋ณธ์  ์‹คํŒจ
      • ์ด‰๊ฐ ์„ผ์„œ์˜ ํ˜„์žฌ: Digit 360๊ณผ Sparsh-X
    • ๋ฐฉ๋ฒ•๋ก : VT-WM์˜ ๊ตฌ์กฐ์™€ ์ž‘๋™ ์›๋ฆฌ
      • ์ „์ฒด ์•„ํ‚คํ…์ฒ˜ ๊ฐœ์š”
      • ์ž ์žฌ ์ƒํƒœ์˜ ๋‹ค์ค‘๋ชจ๋‹ฌ ํ†ตํ•ฉ
      • ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ํ•™์Šต: ๋‹จ์ผ ๋ชจ๋ธ, ๋‹ค์ˆ˜ ํƒœ์Šคํฌ
      • ๊ณ„ํš ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์ƒ์ƒ ์†์—์„œ ์ตœ์  ํ–‰๋™ ์ฐพ๊ธฐ
      • ํ•˜๋“œ์›จ์–ด ์„ค์ •
    • ์‹คํ—˜: ๋ฌด์—‡์„ ์ธก์ •ํ–ˆ๊ณ , ์–ด๋–ค ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”๋‚˜
      • ์‹คํ—˜ ๊ตฌ์กฐ์˜ ์„ธ ์งˆ๋ฌธ
      • ํ‰๊ฐ€ ์ง€ํ‘œ 1: ๋ฌผ์ฒด ์˜์†์„ฑ (Object Permanence)
      • ํ‰๊ฐ€ ์ง€ํ‘œ 2: ์šด๋™ ๋ฒ•์น™ ์ค€์ˆ˜ (Laws of Motion)
      • ํ‰๊ฐ€ ์ง€ํ‘œ 3: ์ œ๋กœ์ƒท ์‹ค์ œ ๋กœ๋ด‡ ๊ณ„ํš ์„ฑ๊ณต๋ฅ 
      • ํ‰๊ฐ€ ์ง€ํ‘œ 4: ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ (Few-shot Fine-tuning)
      • ๋…ผ๋ฌธ Figure ์„ค๋ช…
    • ๋น„ํŒ์  ๊ณ ์ฐฐ: ์ด ๋…ผ๋ฌธ์ด ์ž˜ํ•œ ๊ฒƒ๊ณผ ํ•œ๊ณ„
      • ๊ฐ•์ 
      • ์•ฝ์  ๋ฐ ํ•œ๊ณ„
    • ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต
      • World Model ๊ณ„๋ณด
      • ์œ ์‚ฌ ์—ฐ๊ตฌ์™€์˜ ์ฐจ์ด์ 
      • Dreamer ๊ณ„์—ด ๋Œ€๋น„
    • ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 
    • ์ฐธ๊ณ ๋ฌธํ—Œ

๐Ÿ“ƒVTWM ๋ฆฌ๋ทฐ

visuo-tactile
world model
digit360
Visuo-Tactile World Models
Published

March 16, 2026

  • Paper Link
  1. ๐Ÿ‘€ Visuo-Tactile World Model (VT-WM)์€ ์‹œ๊ฐ ์ •๋ณด์™€ ์ด‰๊ฐ ์„ผ์‹ฑ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์ ‘์ด‰์ด ๋งŽ์€ ๋กœ๋ด‡ ์กฐ์ž‘ ์ž‘์—…์—์„œ Vision-only World Models (V-WM)์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•ฉ๋‹ˆ๋‹ค.
  2. ๐Ÿ’ญ VT-WM์€ ์ƒ์ƒ ์†์—์„œ ๊ฐ์ฒด ์˜์†์„ฑ์„ 33% ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ๋ฌผ๋ฆฌ ๋ฒ•์น™ ์ค€์ˆ˜์œจ์„ 29% ๋†’์—ฌ, V-WM์—์„œ ํ”ํžˆ ๋ฐœ์ƒํ•˜๋Š” ํ™˜๊ฐ(hallucinations) ํ˜„์ƒ์„ ์ค„์—ฌ์ค๋‹ˆ๋‹ค.
  3. ๐Ÿค– ์ด๋Ÿฌํ•œ ๊ฐœ์„ ์€ ์‹ค์ œ ๋กœ๋ด‡ ์ œ์–ด์—์„œ Zero-shot ํ”Œ๋ž˜๋‹ ์„ฑ๊ณต๋ฅ ์„ ์ตœ๋Œ€ 35%๊นŒ์ง€ ๋†’์ด๊ณ , ์ œํ•œ๋œ ๋ฐ๋ชจ๋งŒ์œผ๋กœ๋„ ์ƒˆ๋กœ์šด ์ž‘์—…์— ํšจ๊ณผ์ ์œผ๋กœ ์ ์‘ํ•˜๋Š” ๋›ฐ์–ด๋‚œ ๋‹ค์žฌ๋‹ค๋Šฅํ•จ(versatility)์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

์ด ์—ฐ๊ตฌ๋Š” ๋กœ๋ด‡ ์กฐ์ž‘์„ ์œ„ํ•œ world model์—์„œ ์‹œ๊ฐ ์ •๋ณด(vision)์™€ ์ด‰๊ฐ ์ •๋ณด(tactile)๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ์ ‘์ด‰(contact) ๋ฌผ๋ฆฌํ•™์„ ๋ชจ๋ธ๋งํ•˜๋Š” Visuo-Tactile World Model (VT-WM)์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ ์‹œ๊ฐ ์ „์šฉ world model (V-WM)์€ ๊ฐ€๋ ค์ง(occlusion)์ด๋‚˜ ์‹œ๊ฐ์  ์œ ์‚ฌ์„ฑ(visual aliasing)์œผ๋กœ ์ธํ•ด ์ ‘์ด‰์ด ํ’๋ถ€ํ•œ(contact-rich) ์กฐ์ž‘ ์ž‘์—…์—์„œ ์ข…์ข… ์‹คํŒจํ•˜๋Š” ์–‘์ƒ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ฌผ์ฒด๊ฐ€ ์‚ฌ๋ผ์ง€๊ฑฐ๋‚˜(disappearing), ์ˆœ๊ฐ„์ด๋™ํ•˜๊ฑฐ๋‚˜(teleporting), ๋ฌผ๋ฆฌ ๋ฒ•์น™์„ ์œ„๋ฐ˜ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์›€์ง์ด๋Š”(moving in ways that violate basic physics) ํ™˜๊ฐ(hallucinations) ํ˜„์ƒ์„ ๊ฒช์—ˆ์Šต๋‹ˆ๋‹ค. VT-WM์€ vision์— tactile ์ด๋ฏธ์ง€๋ฅผ ๋ณด์™„ํ•จ์œผ๋กœ์จ ๋กœ๋ด‡-๊ฐ์ฒด ์ƒํ˜ธ์ž‘์šฉ์„ ๋” ์ž˜ ์ดํ•ดํ•˜์—ฌ ์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•ฉ๋‹ˆ๋‹ค.

VT-WM์˜ ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: ์ด ๋ชจ๋ธ์€ ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ๊ด€์ฐฐ ์–‘์‹(modality)์ธ vision๊ณผ tactile์„ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. vision ์ •๋ณด๋Š” ๋กœ๋ด‡์˜ ์ „์—ญ์  ๋ฌธ๋งฅ(global context)๊ณผ ์žฅ๋ฉด์„ ํฌ์ฐฉํ•˜๋Š” ์™ธ์‹œ์  ์นด๋ฉ”๋ผ(exocentric camera)์˜ RGB ๋น„๋””์˜ค ์ŠคํŠธ๋ฆผ์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. tactile ์ •๋ณด๋Š” ๋กœ๋ด‡ ์†๊ฐ€๋ฝ ๋์— ์žฅ์ฐฉ๋œ Digit 360 ์„ผ์„œ์—์„œ ์˜ค๋Š” ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋กœ, ์ ‘์ด‰ ์‹œ ์†Œํ”„ํŠธ ์—˜๋ผ์Šคํ† ๋จธ(elastomer) ํ‘œ๋ฉด์˜ ๋ณ€ํ˜•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋ชจ๋ธ์˜ ์•„ํ‚คํ…์ฒ˜(architecture)๋Š” ํฌ๊ฒŒ ์„ธ ๊ฐ€์ง€ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค:

  1. Vision Encoder: ์™ธ์‹œ์  ๋น„๋””์˜ค์—์„œ ๋กœ๋ด‡๊ณผ ํ™˜๊ฒฝ์˜ ์ž ์žฌ ์ƒํƒœ(latent state) s_k๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด Cosmos Tokenizer (Agarwal et al., 2025)๋ผ๋Š” ์‚ฌ์ „ ํ›ˆ๋ จ๋œ(pre-trained) ์‹œ๊ฐ ์ธ์ฝ”๋”(visual encoder)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  2. Tactile Encoder: Digit 360 ์„ผ์„œ์˜ ๊ณ ์ฃผํŒŒ ์ ‘์ด‰ ํ”ผ๋“œ๋ฐฑ(high-frequency contact feedback)์„ ์••์ถ•๋œ ์ž ์žฌ ์ƒํƒœ t_k๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์ค‘์š”ํ•œ ๋ฌผ๋ฆฌ์  ์ƒํ˜ธ์ž‘์šฉ์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค. ์ด ์—ญํ• ์€ Sparsh-X (Higuera et al., 2025) ๋ชจ๋ธ์ด ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  3. Predictor (Transition Model): ์ธ์ฝ”๋”์—์„œ ์–ป์€ ์ž ์žฌ ์ƒํƒœ s_k์™€ t_k๋Š” ์ œ์–ด ๋™์ž‘(control action) a_k์™€ ํ•จ๊ป˜ autoregressive ์˜ˆ์ธก๊ธฐ(predictor)๋กœ ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค. ์ด ์˜ˆ์ธก๊ธฐ๋Š” 12๊ฐœ ๋ ˆ์ด์–ด์˜ transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋กœ, ๋‹ค์Œ ๋‹จ๊ณ„์˜ ์ƒํƒœ (\hat{s}_{k+1}, \hat{t}_{k+1}) \sim P_\phi(s_k, t_k | a_k)๋ฅผ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค.
    • ์ž…๋ ฅ ์ž ์žฌ ์ƒํƒœ๋Š” sinusoidal positional embedding์œผ๋กœ ์ฆ๊ฐ•๋œ ํ›„ ํ†ตํ•ฉ๋œ ํ‘œํ˜„์œผ๋กœ ํˆฌ์˜๋ฉ๋‹ˆ๋‹ค. Vision๊ณผ tactile ํ† ํฐ(token)์€ ๊ณต๊ฐ„ ์ฐจ์›(spatial dimension)์„ ๋”ฐ๋ผ ์—ฐ๊ฒฐ๋˜์–ด ํ†ตํ•ฉ๋œ ์ž…๋ ฅ ์‹œํ€€์Šค๋ฅผ ํ˜•์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • Transformer ๋‚ด๋ถ€์—์„œ๋Š” ๋‘ ๊ฐ€์ง€ ์œ ํ˜•์˜ attention mechanism์ด ๋ฒˆ๊ฐˆ์•„ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค:
      • Spatio-Temporal Self-Attention: ํ† ํฐ ๊ฐ„์˜ ๊ณต๊ฐ„์  ์ƒํ˜ธ์ž‘์šฉ๊ณผ ์‹œ๊ฐ„์  ์ง„ํ™”๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํฌ์ฐฉํ•˜๊ธฐ ์œ„ํ•ด ๊ณต๊ฐ„(spatial)๊ณผ ์‹œ๊ฐ„(temporal) attention์œผ๋กœ ๋ถ„๋ฆฌ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ „์ฒด ์‹œ๊ณต๊ฐ„(spatiotemporal) attention์˜ ๋†’์€ ๋ณต์žก์„ฑ O((THW)2)๋ฅผ ํ”ผํ•ฉ๋‹ˆ๋‹ค.
      • Action Conditioning via Cross-Attention: ๊ฐ self-attention ๋ธ”๋ก ์ดํ›„, vision-touch ํ† ํฐ์€ action ํ† ํฐ์— cross-attendํ•˜์—ฌ ๋กœ๋ด‡์˜ ์ œ์–ด ์ž…๋ ฅ์„ ์˜ˆ์ธก์— ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค.
    • ๋ชจ๋“  attention layer๋Š” RoPE (Rotary Position Embeddings)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ๋Œ€ ์œ„์น˜ ์ธ์ฝ”๋”ฉ(relative position encoding)์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. Transformer ์ดํ›„, ํ‘œํ˜„์€ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ณ„(modality-specific) ์ถœ๋ ฅ ํ—ค๋“œ(output head)๋ฅผ ํ†ตํ•ด ์›๋ž˜ ์ฐจ์›์œผ๋กœ ๋‹ค์‹œ ํˆฌ์˜๋˜์–ด ์˜ˆ์ธก๋œ \hat{s}_{k+1}์™€ \hat{t}_{k+1}๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋ธ ํ›ˆ๋ จ์€ ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜(teleoperation)์„ ํ†ตํ•ด ์ˆ˜์ง‘๋œ contact-rich ์กฐ์ž‘ ์ž‘์—… ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋Š” ๋กœ๋ด‡์˜ ๊ณ ์œ ์ˆ˜์šฉ์„ฑ ์ƒํƒœ(proprioceptive state), ์™ธ์‹œ์  ๋น„๋””์˜ค, ๊ทธ๋ฆฌ๊ณ  ๊ฐ Digit 360 ์„ผ์„œ์˜ ๋น„๋””์˜ค๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์†์‹ค(loss)์€ ์•ˆ์ •์„ฑ๊ณผ ๊ธด ์‹œ๊ฐ„ ์˜ˆ์ธก ์ผ๊ด€์„ฑ(long-horizon coherence)์„ ์œ„ํ•ด teacher forcing๊ณผ sampling loss๋ฅผ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค: L_{teacher} = \sum_{k=1}^{T-1} (\|\hat{s}_{k+1} - s_{k+1}\|_1 + \|\hat{t}_{k+1} - t_{k+1}\|_1) ์—ฌ๊ธฐ์„œ \hat{s}_{k+1}์™€ \hat{t}_{k+1}๋Š” ์‹œ์  k๊นŒ์ง€์˜ ground-truth ์ƒํƒœ๋กœ๋ถ€ํ„ฐ ์˜ˆ์ธก๋œ ๊ฐ’์ด๊ณ , s_{k+1}์™€ t_{k+1}๋Š” ์‹œ์  k+1์˜ ground-truth ๊ด€์ฐฐ์—์„œ ์ธ์ฝ”๋”ฉ๋œ ์ž ์žฌ ๊ฐ’์ž…๋‹ˆ๋‹ค. L_{sampling} = \sum_{k=1}^{H} (\|\hat{s}^{sampled}_{k+1} - s_{k+1}\|_1 + \|\hat{t}^{sampled}_{k+1} - t_{k+1}\|_1) ์—ฌ๊ธฐ์„œ ์ƒ˜ํ”Œ๋ง๋œ ์ƒํƒœ(sampled states)๋Š” ๊ทธ๋ž˜๋””์–ธํŠธ(gradient) ์—†์ด ์ƒ์„ฑ๋˜์–ด ํ›ˆ๋ จ ๋ถˆ์•ˆ์ •์„ฑ(training instability)์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์ตœ์ข… ์†์‹ค์€ L = L_{teacher} + L_{sampling}๋กœ ์ด ๋‘ ์†์‹ค์˜ ๊ฐ€์ค‘ ํ‰๊ท ์œผ๋กœ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค. AdamW optimizer๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, Cosmos Tokenizer๋Š” ๊ณ ์ •๋˜๊ณ  Sparsh-X encoder๋Š” ์„ผ์„œ๋ณ„ ๋ณ€ํ˜•์„ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด fine-tune๋ฉ๋‹ˆ๋‹ค.

๊ณ„ํš(planning)์„ ์œ„ํ•ด VT-WM์€ Cross-Entropy Method (CEM)์™€ ํ†ตํ•ฉ๋ฉ๋‹ˆ๋‹ค. CEM์€ ์ฃผ์–ด์ง„ ๋ชฉํ‘œ ์ด๋ฏธ์ง€(goal image)์™€ ํ˜„์žฌ ์‹œ๊ฐ ๋ฐ ์ด‰๊ฐ ๋ฌธ๋งฅ(context)์„ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ์ ์˜ ๋™์ž‘ ์‹œํ€€์Šค๋ฅผ ํƒ์ƒ‰ํ•ฉ๋‹ˆ๋‹ค. ๋น„์šฉ ํ•จ์ˆ˜(cost function)๋Š” ์ตœ์ข… ์˜ˆ์ธก๋œ ์‹œ๊ฐ ์ž ์žฌ ์ƒํƒœ \hat{s}_{k+H}์™€ ๋ชฉํ‘œ ์ด๋ฏธ์ง€์˜ ์ž ์žฌ ์ƒํƒœ s_{goal} ์‚ฌ์ด์˜ โ„“2 ๊ฑฐ๋ฆฌ๋กœ ์ •์˜๋ฉ๋‹ˆ๋‹ค. ๊ณ„ํš์€ ๋กœ๋ด‡์˜ ์†๋ชฉ ์ž์„ธ(wrist pose)์˜ 3D translation ๋ฐ 3D orientation, ๊ทธ๋ฆฌ๊ณ  ์†์˜ ์—ด๋ฆผ/๋‹ซํž˜ ์ด์ง„ ๋ณ€์ˆ˜(binary variable)๋กœ ๊ตฌ์„ฑ๋œ \mathbb{R}^7์˜ ๋™์ž‘ ๊ณต๊ฐ„์—์„œ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. ๊ณ„ํš๋œ ๋™์ž‘ ์‹œํ€€์Šค๋Š” open-loop ๋ฐฉ์‹์œผ๋กœ ์‹ค์ œ ๋กœ๋ด‡์—์„œ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค.

์‹คํ—˜์€ VT-WM์˜ ์šฐ์ˆ˜์„ฑ์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค:

  1. ์ ‘์ด‰ ์ธ์ง€(Contact Perception) ๋Šฅ๋ ฅ: VT-WM์€ V-WM๋ณด๋‹ค ๋” ๋‚˜์€ ์ƒ์ƒ๋ ฅ(imagination) ํ’ˆ์งˆ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋ฌผ์ฒด ์˜์†์„ฑ(object permanence)๊ณผ ์ธ๊ณผ์  ์ค€์ˆ˜์„ฑ(causal compliance) ์ธก๋ฉด์—์„œ ์ธก์ •ํ–ˆ์„ ๋•Œ, VT-WM์€ moving object์— ๋Œ€ํ•ด ์ •๊ทœํ™”๋œ Frรฉchet Distance (CoTracker๋กœ ์ธก์ •)๋ฅผ V-WM ๋Œ€๋น„ ํ‰๊ท  33% ๊ฐ์†Œ์‹œ์ผฐ์œผ๋ฉฐ, static object์— ๋Œ€ํ•ด์„œ๋„ ํ‰๊ท  29% ๊ฐ์†Œ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. ์ด๋Š” VT-WM์ด ๋ฌผ์ฒด์˜ ์‚ฌ๋ผ์ง์ด๋‚˜ ๋น„๋ฌผ๋ฆฌ์  ์›€์ง์ž„๊ณผ ๊ฐ™์€ ํ™˜๊ฐ์„ ์ค„์—ฌ ๋” ๋ฌผ๋ฆฌ์ ์œผ๋กœ ์ผ๊ด€๋œ ๋กค์•„์›ƒ(rollouts)์„ ์ƒ์„ฑํ•จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  2. Zero-shot Planning ์„ฑ๋Šฅ: VT-WM์€ ์‹ค์ œ ๋กœ๋ด‡์—์„œ zero-shot planning์—์„œ V-WM์„ ๋Šฅ๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, contact-richํ•˜๊ณ  multi-step ์ž‘์—…(์˜ˆ: push fruits, reach & push, wipe cloth, stack cubes)์—์„œ VT-WM์€ V-WM๋ณด๋‹ค ์ตœ๋Œ€ 35% ๋” ๋†’์€ ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ด‰๊ฐ ์ ‘์ง€(tactile grounding)๊ฐ€ ์‹œ๊ฐ์  ์œ ์‚ฌ์„ฑ(visual aliasing) ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ  ๋” ์•ˆ์ •์ ์ธ ์ ‘์ด‰ ์ƒํ˜ธ์ž‘์šฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
  3. Downstream Versatility (์ƒˆ๋กœ์šด ์ž‘์—…์œผ๋กœ์˜ ์ ์‘): VT-WM์€ ์ƒˆ๋กœ์šด ์ž‘์—…(โ€œplace plate in dish rackโ€)์— 20๊ฐœ์˜ ์ œํ•œ๋œ ๋ฐ๋ชจ ์‹œํ€€์Šค(demonstration sequence)๋งŒ์œผ๋กœ fine-tuning๋˜์–ด 77%์˜ ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ด์ „์— ํ•™์Šต๋œ ์ ‘์ด‰ dynamics๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ํšจ์œจ์ ์ธ ๋ฐฉ์‹์œผ๋กœ ์ƒˆ๋กœ์šด ์ž‘์—…์— ๋น ๋ฅด๊ฒŒ ์ ์‘ํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๊ฒฐ๋ก ์ ์œผ๋กœ, VT-WM์€ ์‹œ๊ฐ๊ณผ ์ด‰๊ฐ ์ •๋ณด๋ฅผ ๊ฒฐํ•ฉํ•จ์œผ๋กœ์จ ๋กœ๋ด‡์ด ๋ฌผ๋ฆฌ์  ์ƒํ˜ธ์ž‘์šฉ์„ ๋” ์ •ํ™•ํ•˜๊ฒŒ ์ดํ•ดํ•˜๊ณ , ๋” ํ˜„์‹ค์ ์ธ ์ƒ์ƒ ๋กค์•„์›ƒ์„ ์ƒ์„ฑํ•˜๋ฉฐ, ์‹ค์ œ ๋กœ๋ด‡์—์„œ contact-rich ์กฐ์ž‘ ์ž‘์—…์„ ์œ„ํ•œ ๋” ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ๊ณ„ํš์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

ํ•œ๊ณ„์ ์œผ๋กœ๋Š” ์ด‰๊ฐ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๊ฐ€ vision-based tactile sensing (Digit 360)์— ๊ตญํ•œ๋œ๋‹ค๋Š” ์ , contact perception ํ‰๊ฐ€๊ฐ€ ํ›ˆ๋ จ ๋ถ„ํฌ ๋‚ด์˜ ์ž‘์—…์—๋งŒ ๋จธ๋ฌด๋ฅธ๋‹ค๋Š” ์ , CEM์„ ํ†ตํ•œ ๊ณ„ํš์ด ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋งŽ์ด ๋“ค์–ด open-loop ์‹คํ–‰์œผ๋กœ ์ด์–ด์ง„๋‹ค๋Š” ์  ๋“ฑ์ด ์–ธ๊ธ‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

๋“ค์–ด๊ฐ€๋ฉฐ: ๋ˆˆ๋งŒ ์žˆ๋Š” ๋กœ๋ด‡์ด ๊ฟˆ๊พธ๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ

๋‹น์‹ ์ด ๋ˆˆ์„ ๊ฐ๊ณ  ์–ด๋‘์šด ๋ฐฉ์—์„œ ์ปต์„ ์žก๋Š”๋‹ค๊ณ  ์ƒ๊ฐํ•ด๋ณด์ž. ์†์ด ์ปต ํ‘œ๋ฉด์— ๋‹ฟ๋Š” ์ˆœ๊ฐ„, ์†๊ฐ€๋ฝ ๋์˜ ๊ฐ๊ฐ์ด ๋งํ•ด์ค€๋‹ค โ€” โ€œ์•„, ์—ฌ๊ธฐ ์žˆ๊ตฌ๋‚˜.โ€ ๊ทธ ์ดํ›„๋กœ๋Š” ๋ˆˆ์ด ์—†์–ด๋„ ์ปต์„ ๋“ค์–ด์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋‹ค. ์†์—์„œ ์ „๋‹ฌ๋˜๋Š” ๋ฌด๊ฒŒ๊ฐ, ๋งˆ์ฐฐ๋ ฅ, ํ˜•์ƒ ์ •๋ณด๊ฐ€ ๋‡Œ ์†์˜ โ€™๋‚ด๋ถ€ ๋ชจ๋ธโ€™์„ ๊ฐฑ์‹ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

์ง€๊ธˆ๊นŒ์ง€์˜ ๋กœ๋ด‡ ์กฐ์ž‘ World Model์€ ์ด ์†๊ฐ€๋ฝ ๋์˜ ๊ฐ๊ฐ ์—†์ด, ์˜ค์ง ์นด๋ฉ”๋ผ ์ด๋ฏธ์ง€๋งŒ์œผ๋กœ ์„ธ๊ณ„๋ฅผ ์ƒ์ƒํ•ด์™”๋‹ค. ๊ฒฐ๊ณผ๋Š” ์–ด๋– ํ–ˆ์„๊นŒ? ๋ฌผ์ฒด๊ฐ€ ์†์— ์ฅ์–ด์ง„ ์ˆœ๊ฐ„ ๋งˆ์น˜ ๋งˆ์ˆ ์ฒ˜๋Ÿผ ์‚ฌ๋ผ์ง€๊ฑฐ๋‚˜, ์•„๋ฌด ํž˜๋„ ๊ฐ€ํ•˜์ง€ ์•Š์•˜๋Š”๋ฐ ๋ฏธ๋„๋Ÿฌ์ง€๊ฑฐ๋‚˜, ๋ฒฝ์„ ๊ด€ํ†ตํ•˜๋“ฏ ์›€์ง์ด๋Š” ํ™˜๊ฐ(hallucination)์ด ๋‚˜ํƒ€๋‚ฌ๋‹ค.

Visuo-Tactile World Models (VT-WM) ์€ ๋ฐ”๋กœ ์ด ๋ฌธ์ œ๋ฅผ ์ •๋ฉด์œผ๋กœ ๋‹ค๋ฃฌ๋‹ค. Carolina Higuera (UW/Meta), Sergio Arnaud, Byron Boots, Mustafa Mukadam, Francois Hogan, Franziska Meier๋กœ ๊ตฌ์„ฑ๋œ ์—ฐ๊ตฌํŒ€์ด ICLR 2026์— ์ œ์ถœํ•œ ์ด ๋…ผ๋ฌธ์€, World Model์˜ ์ƒ์ƒ(imagination) ์†์— ์ด‰๊ฐ์„ ์ง‘์–ด๋„ฃ์Œ์œผ๋กœ์จ ์ ‘์ด‰ ๋ฌผ๋ฆฌํ•™์„ ๋” ์ถฉ์‹คํ•˜๊ฒŒ ํ‘œํ˜„ํ•˜๊ณ , ๊ทธ ๋ฌผ๋ฆฌ์  ์ถฉ์‹ค๋„๊ฐ€ ์‹ค์ œ ๊ณ„ํš(planning)์œผ๋กœ ์ด์–ด์ง์„ ๋ณด์—ฌ์ค€๋‹ค.

๊ฒฐ๋ก ๋ถ€ํ„ฐ ๋งํ•˜๋ฉด ํ•ต์‹ฌ ์ˆ˜์น˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

  • Object Permanence (๋ฌผ์ฒด ์˜์†์„ฑ): +33% ํ–ฅ์ƒ
  • Laws of Motion (์šด๋™ ๋ฒ•์น™ ์ค€์ˆ˜): +29% ํ–ฅ์ƒ
  • Zero-shot Real-Robot Planning: ์ตœ๋Œ€ +35% ์„ฑ๊ณต๋ฅ 
  • Few-shot Fine-tuning: Behavioral Cloning ๋Œ€๋น„ 3.5ร— ์„ฑ๋Šฅ

์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ: World Model์˜ ์•ฝ์ ๊ณผ ์ด‰๊ฐ์˜ ์—ญํ• 

World Model์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€

World Model(WM)์€ ๋กœ๋ด‡์ด ํ˜„์‹ค ์„ธ๊ณ„๋ฅผ ๋‚ด๋ถ€์ ์œผ๋กœ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค. ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ๋‹จ์ˆœํ•˜๋‹ค โ€” ํ–‰๋™์„ ์‹ค์ œ๋กœ ์ทจํ•˜๊ธฐ ์ „์—, ๋จธ๋ฆฟ์†์—์„œ ๊ทธ ํ–‰๋™์˜ ๊ฒฐ๊ณผ๋ฅผ ๋จผ์ € โ€œ์ƒ์ƒโ€ํ•ด๋ณด๋Š” ๊ฒƒ. DreamerV3 (Hafner et al., 2023), UniSim, Genie 2 ๊ฐ™์€ ๋ชจ๋ธ๋“ค์ด ์ด ๊ณ„์—ด์„ ๋Œ€ํ‘œํ•˜๋ฉฐ, ์ตœ๊ทผ ๋กœ๋ด‡ ์กฐ์ž‘ ๋ถ„์•ผ์—์„œ๋„ ์ด๋ฅผ planning์— ํ™œ์šฉํ•˜๋ ค๋Š” ์‹œ๋„๊ฐ€ ํ™œ๋ฐœํ•˜๋‹ค.

์ˆ˜์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด, World Model์€ ๋‹ค์Œ์˜ ์ „์ด ๋ถ„ํฌ๋ฅผ ํ•™์Šตํ•œ๋‹ค:

p(s_{t+1} \mid s_t, a_t)

์—ฌ๊ธฐ์„œ s_t๋Š” ์ž ์žฌ ์ƒํƒœ(latent state), a_t๋Š” ํ–‰๋™(action)์ด๋‹ค. ์ด๋ฅผ ์ž๊ธฐํšŒ๊ท€์ ์œผ๋กœ ํ’€๋ฉด:

\hat{s}_{t+1}, \hat{o}_{t+1} = f_\theta(s_t, a_t)

ํ–‰๋™ ์‹œํ€€์Šค \{a_0, a_1, \ldots, a_T\}๋ฅผ ๊ฐ€์ƒ์œผ๋กœ ์‹คํ–‰ํ•˜๋ฉด์„œ ๋ฏธ๋ž˜ ๊ด€์ธก \hat{o}๋ฅผ ์˜ˆ์ธกํ•˜๊ณ , ๊ฐ€์žฅ ๋†’์€ ๋ณด์ƒ์ด ์˜ˆ์ธก๋˜๋Š” ํ–‰๋™ ์‹œํ€€์Šค๋ฅผ ์„ ํƒํ•œ๋‹ค.

๋น„์ „๋งŒ์œผ๋กœ๋Š” ๋ถ€์กฑํ•˜๋‹ค: ์„ธ ๊ฐ€์ง€ ๊ทผ๋ณธ์  ์‹คํŒจ

๋ฌธ์ œ๋Š” ํ˜„์žฌ Vision-only World Model (V-WM)์ด ์„ธ ๊ฐ€์ง€ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ๋ถˆ๊ฐ€๋Šฅํ•œ ์ƒํ™ฉ์„ ์ง€์†์ ์œผ๋กœ ๋งŒ๋“ค์–ด๋‚ธ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

1. ๋ฌผ์ฒด ์†Œ๋ฉธ (Object Disappearance)

๋กœ๋ด‡ ์†์ด ๋ฌผ์ฒด๋ฅผ ๊ฐ€๋ฆฌ๋Š” ์ˆœ๊ฐ„, ์นด๋ฉ”๋ผ ์ด๋ฏธ์ง€์—์„œ ๋ฌผ์ฒด๊ฐ€ ์‚ฌ๋ผ์ง„๋‹ค. ์‹œ๊ฐ์  occlusion์ด ๋ฐœ์ƒํ•˜๋ฉด V-WM์€ ๋ฌผ์ฒด๊ฐ€ ๋” ์ด์ƒ ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ์ž˜๋ชป ์ถ”๋ก ํ•œ๋‹ค. ์˜ˆ: ํ๋ธŒ๋ฅผ ์†์œผ๋กœ ์ง‘์–ด ์ด๋™ํ•˜๋Š” ๋„์ค‘ ํ๋ธŒ๊ฐ€ ์žฅ๋ฉด์—์„œ ์‚ฌ๋ผ์ง.

2. ์ˆœ๊ฐ„์ด๋™ (Teleportation)

๋ฌผ์ฒด๊ฐ€ ํ•œ ์œ„์น˜์—์„œ ๊ฐ‘์ž๊ธฐ ๋‹ค๋ฅธ ์œ„์น˜๋กœ ๋‚˜ํƒ€๋‚œ๋‹ค. ์—ฐ์†์ ์ธ ์šด๋™์„ ํ‘œํ˜„ํ•˜์ง€ ๋ชปํ•˜๊ณ  ๋ถˆ์—ฐ์†์ ์ธ ์ ํ”„๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

3. ๋ฌด์ธ๊ณผ์  ์šด๋™ (Acausal Motion)

๋กœ๋ด‡์ด ์ ‘์ด‰ํ•˜์ง€ ์•Š์•˜๋Š”๋ฐ ๋ฌผ์ฒด๊ฐ€ ์›€์ง์ด๊ฑฐ๋‚˜, ๋ฐ˜๋Œ€๋กœ ์ ‘์ด‰ํ–ˆ์Œ์—๋„ ๋ฌผ์ฒด๊ฐ€ ์ „ํ˜€ ์›€์ง์ด์ง€ ์•Š๋Š”๋‹ค. Newton์˜ ์ œ1ยท3 ๋ฒ•์น™์„ ์œ„๋ฐ˜ํ•˜๋Š” ์ƒํ™ฉ์ด๋‹ค.

์ด‰๊ฐ ์„ผ์„œ๋Š” ์ด ์„ธ ๋ฌธ์ œ ๋ชจ๋‘์— ๋Œ€ํ•œ ์ง์ ‘์ ์ธ ํ•ด๊ฒฐ์ฑ…์„ ์ œ๊ณตํ•œ๋‹ค. ์†๊ฐ€๋ฝ์ด ๋ฌผ์ฒด๋ฅผ ์ฅ๊ณ  ์žˆ์œผ๋ฉด ์ด‰๊ฐ ์‹ ํ˜ธ๊ฐ€ ํ™œ์„ฑํ™”๋˜๋ฉฐ, ์ด ์‹ ํ˜ธ๊ฐ€ โ€œ๋ฌผ์ฒด๋Š” ์—ฌ๊ธฐ ์žˆ๋‹คโ€๋Š” ์‚ฌ์‹ค์„ ๋ช…์‹œ์ ์œผ๋กœ ์•Œ๋ ค์ค€๋‹ค.

์ด‰๊ฐ ์„ผ์„œ์˜ ํ˜„์žฌ: Digit 360๊ณผ Sparsh-X

์ด ์—ฐ๊ตฌ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ด‰๊ฐ ์„ผ์„œ๋Š” Digit 360 (Lambeta et al., 2024)์ด๋‹ค. Digit ๊ณ„์—ด์€ GelSight (Yuan et al., 2017)์—์„œ ๋ฐœ์ „ํ•œ vision-based tactile sensor๋กœ, ์†Œํ”„ํŠธ ์—˜๋ผ์Šคํ† ๋จธ ํ‘œ๋ฉด์— ๋น›์„ ์˜์•„ ์ ‘์ด‰์— ์˜ํ•œ ๋ณ€ํ˜•์„ ๋‚ด๋ถ€ ์นด๋ฉ”๋ผ๋กœ ์ดฌ์˜ํ•œ๋‹ค. ์ด ์ด‰๊ฐ ์ด๋ฏธ์ง€์—์„œ ์ ‘์ด‰ ํ˜•์ƒ, ์••๋ ฅ ๋ถ„ํฌ, ์Šฌ๋ฆฝ ์—ฌ๋ถ€ ๋“ฑ์„ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋‹ค.

์›์‹œ ์ด‰๊ฐ ์ด๋ฏธ์ง€๋ฅผ ์ง์ ‘ ์‚ฌ์šฉํ•˜๋ฉด ๊ณ ์ฐจ์›์ด๋ผ WM ํ•™์Šต์— ๋ถ€๋‹ด์ด ํฌ๋‹ค. ๊ทธ๋ž˜์„œ ์‚ฌ์ „ํ•™์Šต๋œ ์ด‰๊ฐ ํ‘œํ˜„ ๋ชจ๋ธ์ธ Sparsh-X (Higuera et al., 2025)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ €์ฐจ์› ์ด‰๊ฐ ์ž„๋ฒ ๋”ฉ์„ ์ถ”์ถœํ•œ๋‹ค. Sparsh-X๋Š” ์ž๊ธฐ์ง€๋„ํ•™์Šต(self-supervised learning)์œผ๋กœ ํ•™์Šต๋œ ์ด‰๊ฐ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ๋กœ, ๋ ˆ์ด๋ธ” ์—†์ด๋„ ์ ‘์ด‰ ์—ญํ•™์˜ ํ’๋ถ€ํ•œ ์ •๋ณด๋ฅผ ์••์ถ•ํ•œ๋‹ค.

๋น„์ „ ์ธก์—์„œ๋Š” Cosmos Tokenizer (Agarwal et al., 2025)๋ฅผ ์‚ฌ์šฉํ•ด RGB ์ด๋ฏธ์ง€๋ฅผ ์ž ์žฌ ์ฝ”๋“œ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.


๋ฐฉ๋ฒ•๋ก : VT-WM์˜ ๊ตฌ์กฐ์™€ ์ž‘๋™ ์›๋ฆฌ

์ „์ฒด ์•„ํ‚คํ…์ฒ˜ ๊ฐœ์š”

VT-WM์˜ ์ „์ฒด ๊ตฌ์กฐ๋ฅผ ๋‹ค์Œ ๋‹ค์ด์–ด๊ทธ๋žจ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

graph LR
    subgraph Sensing ["Sensing Layer"]
        RGB["RGB Camera\n(Global Context)"]
        TAC["Digit 360\nTactile Sensors\n(Local Contact)"]
    end

    subgraph Encoding ["Encoding Layer"]
        CE["Cosmos Encoder\n(RGB Tokenizer)"]
        SE["Sparsh-X\n(Tactile Foundation Model)"]
    end

    subgraph WM ["World Model (Latent Space)"]
        LS["Multimodal\nLatent State s_t"]
        TM["Transition Model\nf_theta(s_t, a_t)"]
        PR["Predictor\nhat_o_{t+1}"]
    end

    subgraph Planning ["Planning Layer"]
        RL["Autoregressive\nRollout"]
        OPT["Action Optimization\n(MPC / CEM)"]
        PLAN["Zero-shot Plan\n{a_0,...,a_T}"]
    end

    RGB --> CE
    TAC --> SE
    CE --> LS
    SE --> LS
    LS --> TM
    TM --> LS
    TM --> PR
    LS --> RL
    RL --> OPT
    OPT --> PLAN

ํ•ต์‹ฌ ์„ค๊ณ„ ์ฒ ํ•™์€ ์—ญํ•  ๋ถ„๋ฆฌ(modality specialization)๋‹ค. ๋น„์ „์€ โ€œ์„ธ๊ณ„์˜ ์ „๊ฒฝ(global picture)โ€์„ ๋‹ด๋‹นํ•˜๊ณ , ์ด‰๊ฐ์€ โ€œ์ ‘์ด‰ ์ง€์ ์˜ ๋ฏธ์‹œ ๋ฌผ๋ฆฌํ•™(local contact physics)โ€์„ ๋‹ด๋‹นํ•œ๋‹ค. ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๊ฐ€ ์„œ๋กœ๋ฅผ ๋ณด์™„ํ•˜๋ฉฐ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ ์ž ์žฌ ์ƒํƒœ๋ฅผ ๋งŒ๋“ ๋‹ค.

์ž ์žฌ ์ƒํƒœ์˜ ๋‹ค์ค‘๋ชจ๋‹ฌ ํ†ตํ•ฉ

s_t = \text{Encode}(o_t^{rgb}, o_t^{tac}, a_{t-1})

์—ฌ๊ธฐ์„œ o_t^{rgb} \in \mathbb{R}^{d_{rgb}}๋Š” Cosmos Tokenizer๋กœ ์ธ์ฝ”๋”ฉ๋œ ๋น„์ „ ํŠน์ง•, o_t^{tac} \in \mathbb{R}^{d_{tac}}๋Š” Sparsh-X๋กœ ์ธ์ฝ”๋”ฉ๋œ ์ด‰๊ฐ ํŠน์ง•์ด๋‹ค. ์ด ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๊ฐ€ ๊ฒฐํ•ฉ๋˜์–ด ํ†ตํ•ฉ ์ž ์žฌ ์ƒํƒœ s_t๋ฅผ ํ˜•์„ฑํ•œ๋‹ค.

๋‹ค์Œ ์ž ์žฌ ์ƒํƒœ ์˜ˆ์ธก:

\hat{s}_{t+1} = f_\theta(s_t, a_t)

๊ด€์ธก ์žฌ๊ตฌ์„ฑ(prediction/decoding):

\hat{o}_{t+1}^{rgb}, \hat{o}_{t+1}^{tac} = g_\phi(\hat{s}_{t+1})

์›”๋“œ ๋ชจ๋ธ์€ ์–‘์ชฝ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ๋ชจ๋‘ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šต๋œ๋‹ค. ์ด๊ฒƒ์ด ์ค‘์š”ํ•œ ์ด์œ ๋Š”, ์ด‰๊ฐ ์˜ˆ์ธก ๋ชฉํ‘œ(tactile prediction objective)๊ฐ€ ๋ชจ๋ธ๋กœ ํ•˜์—ฌ๊ธˆ โ€œ์ด ํ–‰๋™์„ ์ทจํ•˜๋ฉด ์ ‘์ด‰์ด ์–ด๋–ป๊ฒŒ ๋ณ€ํ•  ๊ฒƒ์ธ๊ฐ€โ€๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ฒŒ ๋งŒ๋“ค๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

๋ฉ€ํ‹ฐํƒœ์Šคํฌ ํ•™์Šต: ๋‹จ์ผ ๋ชจ๋ธ, ๋‹ค์ˆ˜ ํƒœ์Šคํฌ

VT-WM์€ ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ์„ค์ •์œผ๋กœ ํ•™์Šต๋œ๋‹ค. ์—ฌ๋Ÿฌ ์ ‘์ด‰ ์ง‘์•ฝ์  ์กฐ์ž‘ ํƒœ์Šคํฌ(pushing, wiping, placing, stacking ๋“ฑ)์— ๋Œ€ํ•œ ๋ฐ๋ชจ ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ ํ•จ๊ป˜ ํ•™์Šตํ•œ๋‹ค. ์ด๋Š” ๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ ์ ‘์ด‰ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ์˜ ๋ฌผ๋ฆฌ ๋ฒ•์น™์„ ๊ณต์œ  ํ‘œํ˜„์œผ๋กœ ํก์ˆ˜ํ•˜๊ฒŒ ํ•ด์ค€๋‹ค.

ํ•™์Šต ์†์‹ค(training loss)์€ ์˜ˆ์ธก ์žฌ๊ตฌ์„ฑ ์˜ค์ฐจ์™€ ์ž ์žฌ ํ‘œํ˜„ ์ •๊ทœํ™”์˜ ๊ฒฐํ•ฉ์ด๋‹ค. Dreamer ๊ณ„์—ด๊ณผ ์œ ์‚ฌํ•œ RSSM(Recurrent State Space Model) ๊ตฌ์กฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋˜, ๋‹ค์ค‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์˜ˆ์ธก ๋ชฉํ‘œ๋ฅผ ์ถ”๊ฐ€ํ–ˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๊ณ„ํš ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์ƒ์ƒ ์†์—์„œ ์ตœ์  ํ–‰๋™ ์ฐพ๊ธฐ

ํ•™์Šต๋œ WM์„ ์‚ฌ์šฉํ•œ ๊ณ„ํš์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋™์ž‘ํ•œ๋‹ค:

Algorithm: VT-WM Zero-shot Planning
---------------------------------------------------------
Input:
  - Trained VT-WM (f_theta, g_phi)
  - Initial observation (o_0^rgb, o_0^tac)
  - Goal image o_goal^rgb
  - Planning horizon T
  - Action candidates K

1. Encode initial state: s_0 = Encode(o_0^rgb, o_0^tac)

2. For iteration 1..N_iter:
   a. Sample K action sequences {A^k}_{k=1}^{K}
      where A^k = {a_0^k, ..., a_{T-1}^k}

   b. For each A^k:
      - Unroll WM: s_1^k, ..., s_T^k = Rollout(s_0, A^k)
      - Decode: o_T^{rgb,k} = g_phi(s_T^k)
      - Compute reward: r^k = Sim(o_T^{rgb,k}, o_goal^rgb)

   c. Select best: A* = argmax_k r^k

3. Execute A* on real robot (open-loop)
---------------------------------------------------------
Output: Executed action sequence A*

ํ•ต์‹ฌ์€ WM์˜ โ€œ์ƒ์ƒ(imagination)โ€ ํ’ˆ์งˆ์ด ๋ฐ”๋กœ ๊ณ„ํš ํ’ˆ์งˆ์„ ๊ฒฐ์ •ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. V-WM์ด ๋ฌผ์ฒด๋ฅผ ์žƒ์–ด๋ฒ„๋ฆฌ๋Š” ์ƒ์ƒ์„ ํ•˜๋ฉด, ๊ฑฐ๊ธฐ์„œ ์ƒ์„ฑ๋œ ๊ณ„ํš์€ ๋ฌผ์ฒด๋ฅผ ์žƒ์–ด๋ฒ„๋ฆฌ๋Š” ํ–‰๋™์„ ์„ ํƒํ•˜๊ฒŒ ๋œ๋‹ค. ๋ฐ˜๋ฉด VT-WM์€ ๋ฌผ์ฒด๊ฐ€ ์† ์•ˆ์— ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์ด‰๊ฐ์œผ๋กœ ์•Œ๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ์ ‘์ด‰์„ ์œ ์ง€ํ•˜๋Š” ํ–‰๋™ ์‹œํ€€์Šค๋ฅผ ๋” ์ •ํ™•ํ•˜๊ฒŒ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•œ๋‹ค.

ํ•˜๋“œ์›จ์–ด ์„ค์ •

System Configuration
------------------------------------
Arm      : Franka Panda
Hand     : Allegro Hand V4
Tactile  : Digit 360 (fingertip, 3x)
Vision   : RGB camera (wrist/workspace)
Encoders : Cosmos Tokenizer (RGB)
           Sparsh-X (Tactile)
------------------------------------

Allegro Hand + Franka Panda ์กฐํ•ฉ์€ ์ด‰๊ฐ ์กฐ์ž‘ ์—ฐ๊ตฌ์—์„œ ์‚ฌ์‹ค์ƒ ํ‘œ์ค€ ํ”Œ๋žซํผ์œผ๋กœ ์ž๋ฆฌ์žก๊ณ  ์žˆ์œผ๋ฉฐ (NeuralFeels, DexWM ๋“ฑ), ์ด ๋…ผ๋ฌธ๋„ ๋™์ผํ•œ ํ”Œ๋žซํผ์„ ์‚ฌ์šฉํ•œ๋‹ค.


์‹คํ—˜: ๋ฌด์—‡์„ ์ธก์ •ํ–ˆ๊ณ , ์–ด๋–ค ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”๋‚˜

์‹คํ—˜ ๊ตฌ์กฐ์˜ ์„ธ ์งˆ๋ฌธ

์‹คํ—˜ ์„ค๊ณ„๋Š” ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ์งˆ๋ฌธ์— ๋‹ตํ•˜๋„๋ก ๊ตฌ์„ฑ๋œ๋‹ค:

  1. Contact Perception: VT-WM์ด V-WM๋ณด๋‹ค ๋ฌผ์ฒด ์˜์†์„ฑ๊ณผ ๋ฌผ๋ฆฌ ๋ฒ•์น™์„ ๋” ์ž˜ ํฌ์ฐฉํ•˜๋Š”๊ฐ€?
  2. Zero-shot Planning: ํ–ฅ์ƒ๋œ ์ ‘์ด‰ ์ธ์‹์ด ์‹ค์ œ ๋กœ๋ด‡ ๊ณ„ํš ์„ฑ๋Šฅ์œผ๋กœ ์ด์–ด์ง€๋Š”๊ฐ€?
  3. Downstream Versatility: ์ƒˆ๋กœ์šด ํƒœ์Šคํฌ์— ์†Œ์ˆ˜์˜ ๋ฐ๋ชจ๋งŒ์œผ๋กœ ์ ์‘ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?

ํ‰๊ฐ€ ์ง€ํ‘œ 1: ๋ฌผ์ฒด ์˜์†์„ฑ (Object Permanence)

๋ฌผ์ฒด ์˜์†์„ฑ์€ ์ •๊ทœํ™”๋œ Frรฉchet ๊ฑฐ๋ฆฌ(normalized Frรฉchet distance)๋กœ ์ธก์ •๋œ๋‹ค. ์ด ์ง€ํ‘œ๋Š” ์˜ˆ์ธก๋œ ๋ฌผ์ฒด ๊ถค์ ๊ณผ ์‹ค์ œ ๋ฌผ์ฒด ๊ถค์  ์‚ฌ์ด์˜ ๋ถ„ํฌ์  ๊ฑฐ๋ฆฌ๋ฅผ ์ธก์ •ํ•œ๋‹ค. ๊ฐ’์ด ๋‚ฎ์„์ˆ˜๋ก ์˜ˆ์ธก์ด ํ˜„์‹ค์— ๊ฐ€๊น๋‹ค๋Š” ์˜๋ฏธ๋‹ค.

์ˆ˜์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด:

\text{FD}(P, Q) = \min_{\gamma \in \Pi(P,Q)} \int_{\mathcal{X} \times \mathcal{X}} \|x - y\| \, d\gamma(x,y)

V-WM์€ occlusion์ด ๋ฐœ์ƒํ•˜๋Š” ์ˆœ๊ฐ„ ๋ฌผ์ฒด ์œ„์น˜ ์˜ˆ์ธก์ด ๋ถ•๊ดดํ•˜๋Š” ๋ฐ˜๋ฉด, VT-WM์€ ์ด‰๊ฐ ์‹ ํ˜ธ๊ฐ€ โ€œ๋ฌผ์ฒด๊ฐ€ ์—ฌ๊ธฐ ์žˆ์Œโ€์„ ์ง€์†์ ์œผ๋กœ ์•Œ๋ ค์ฃผ๋ฏ€๋กœ ๋ฌผ์ฒด ๊ถค์ ์„ ํ›จ์”ฌ ์ •ํ™•ํ•˜๊ฒŒ ์œ ์ง€ํ•œ๋‹ค.

๊ฒฐ๊ณผ: VT-WM์ด V-WM ๋Œ€๋น„ ์•ฝ 33% ๋‚ฎ์€ ์ •๊ทœํ™” Frรฉchet ๊ฑฐ๋ฆฌ๋ฅผ ๋‹ฌ์„ฑ (95% CI ํฌํ•จ).

ํ‰๊ฐ€ ์ง€ํ‘œ 2: ์šด๋™ ๋ฒ•์น™ ์ค€์ˆ˜ (Laws of Motion)

๋‘ ๋ฒˆ์งธ ์ง€ํ‘œ๋Š” ์˜ˆ์ธก๋œ ๋ฌผ์ฒด ์šด๋™์ด ๋‰ดํ„ด ์—ญํ•™๊ณผ ์–ผ๋งˆ๋‚˜ ์ผ์น˜ํ•˜๋Š”์ง€๋ฅผ ์ธก์ •ํ•œ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ๋Š”, ๋กœ๋ด‡์ด ๋ฌผ์ฒด์— ์ ‘์ด‰ํ•˜์ง€ ์•Š์„ ๋•Œ ๋ฌผ์ฒด๊ฐ€ ์›€์ง์ด์ง€ ์•Š์•„์•ผ ํ•˜๊ณ  (๊ด€์„ฑ์˜ ๋ฒ•์น™), ์ ‘์ด‰ ์‹œ ํž˜์˜ ๋ฐฉํ–ฅ์— ๋”ฐ๋ผ ์›€์ง์—ฌ์•ผ ํ•œ๋‹ค (์šด๋™์˜ ๋ฒ•์น™).

V-WM์€ ์‹œ๊ฐ์  aliasing์œผ๋กœ ์ธํ•ด โ€œ๋กœ๋ด‡์ด ๋‹ฟ์ง€ ์•Š์€ ๋ฌผ์ฒด๊ฐ€ ์›€์ง์ธ๋‹คโ€ ํ˜น์€ โ€œ๋กœ๋ด‡์ด ๋‹ฟ์•„๋„ ๋ฌผ์ฒด๊ฐ€ ์›€์ง์ด์ง€ ์•Š๋Š”๋‹คโ€๋Š” ๋น„์ธ๊ณผ์  ์˜ˆ์ธก์„ ์ž์ฃผ ์ƒ์„ฑํ•œ๋‹ค.

๊ฒฐ๊ณผ: VT-WM์ด V-WM ๋Œ€๋น„ 29% ๋” ๋†’์€ ์šด๋™ ๋ฒ•์น™ ์ค€์ˆ˜์œจ์„ ๋‹ฌ์„ฑ.

ํ‰๊ฐ€ ์ง€ํ‘œ 3: ์ œ๋กœ์ƒท ์‹ค์ œ ๋กœ๋ด‡ ๊ณ„ํš ์„ฑ๊ณต๋ฅ 

๊ฐ€์žฅ ์‹ค์งˆ์ ์ธ ์ง€ํ‘œ๋‹ค. ํ•™์Šต๋œ WM์„ ์‚ฌ์šฉํ•ด ๊ณ„ํš์„ ์ƒ์„ฑํ•˜๊ณ , ์ด๋ฅผ ์‹ค์ œ ๋กœ๋ด‡์— ์˜คํ”ˆ๋ฃจํ”„(open-loop)๋กœ ์‹คํ–‰ํ•œ ์„ฑ๊ณต๋ฅ ์ด๋‹ค.

ํƒœ์Šคํฌ ์œ ํ˜•๋ณ„๋กœ ๊ฒฐ๊ณผ๊ฐ€ ํฅ๋ฏธ๋กญ๊ฒŒ ๊ฐˆ๋ฆฐ๋‹ค:

Task Type V-WM VT-WM Delta
Reaching (kinematic) ~ ~ ~0%
Pushing (contact) - - +~30%
Wiping (contact+cloth) - - +~35%
Placing (contact+place) - - +~25%
Cube Stacking (multi-step) - - +35%

์ •ํ™•ํ•œ ์ˆ˜์น˜๋Š” ๋…ผ๋ฌธ Fig. ๋‚ด ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ์ถ”์ •.

ํ•ต์‹ฌ ๊ด€์ฐฐ: ๋‹จ์ˆœ ๋„๋‹ฌ(reaching) ํƒœ์Šคํฌ๋Š” ์šด๋™ํ•™์  ์ •ํ™•๋„๋งŒ ์š”๊ตฌํ•˜๋ฏ€๋กœ V-WM๊ณผ VT-WM์ด ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ ‘์ด‰์„ ์œ ์ง€ํ•ด์•ผ ํ•˜๋Š” ํƒœ์Šคํฌ(pushing, wiping, placing, stacking)์—์„œ VT-WM์ด ์ตœ๋Œ€ 35%๊นŒ์ง€ ๋” ๋†’์€ ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ•œ๋‹ค. ์ด‰๊ฐ์ด ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํƒœ์Šคํฌ์—์„œ ๊ฐ€์žฅ ํฐ ์ด๋“์ด ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ์€ ์ด๋ก ์ ์œผ๋กœ๋„ ๋‹น์—ฐํ•œ ๊ฒฐ๊ณผ๋‹ค.

ํ‰๊ฐ€ ์ง€ํ‘œ 4: ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ (Few-shot Fine-tuning)

์ƒˆ๋กœ์šด ํƒœ์Šคํฌ์— ๋Œ€ํ•ด ์†Œ์ˆ˜์˜ ๋ฐ๋ชจ๋กœ ํŒŒ์ธํŠœ๋‹ํ–ˆ์„ ๋•Œ์˜ ์„ฑ๋Šฅ์ด๋‹ค.

๊ฒฐ๊ณผ: VT-WM์ด Behavioral Cloning (BC) ๋Œ€๋น„ 3.5ร— ๋†’์€ ์„ฑ๊ณต๋ฅ .

์ด๋Š” ๋ฉ€ํ‹ฐํƒœ์Šคํฌ WM์ด ํ•™์Šตํ•œ ์ ‘์ด‰ ๋ฌผ๋ฆฌํ•™ ํ‘œํ˜„์ด ์ƒˆ๋กœ์šด ํƒœ์Šคํฌ๋กœ ํšจ๊ณผ์ ์œผ๋กœ ์ „์ด๋จ์„ ์˜๋ฏธํ•œ๋‹ค. BC๋Š” ์ž…๋ ฅ-์ถœ๋ ฅ ์Œ์„ ๋‹จ์ˆœํžˆ ์™ธ์šฐ๋Š” ๋ฐฉ์‹์ด๋ผ ์†Œ์ˆ˜ ๋ฐ์ดํ„ฐ์—์„œ ์ทจ์•ฝํ•˜์ง€๋งŒ, VT-WM์€ ๋ฌผ๋ฆฌ์  ํ‘œํ˜„์„ ๊ฐ–์ถ”๊ณ  ์žˆ์–ด ์†Œ์ˆ˜์˜ ์˜ˆ์‹œ๋กœ๋„ ๋น ๋ฅด๊ฒŒ ์ ์‘ํ•œ๋‹ค.

๋…ผ๋ฌธ Figure ์„ค๋ช…

Figure 1 (๋…ผ๋ฌธ ํ•ต์‹ฌ ๊ทธ๋ฆผ): ํ๋ธŒ ์ ์ธต(cube stacking) ํƒœ์Šคํฌ์—์„œ V-WM๊ณผ VT-WM์˜ ์ƒ์ƒ(imagination) ๋น„๊ต. V-WM์€ ํ๋ธŒ๋ฅผ ์ง‘์–ด ์ด๋™ํ•˜๋Š” ๋„์ค‘ ํ๋ธŒ๊ฐ€ ์ด๋ฏธ์ง€์—์„œ ์‚ฌ๋ผ์ง€์ง€๋งŒ(object disappearance hallucination), VT-WM์€ ํ๋ธŒ๊ฐ€ ์† ์•ˆ์— ์žˆ์Œ์„ ์ด‰๊ฐ ์‹ ํ˜ธ๋กœ ์•Œ๊ธฐ ๋•Œ๋ฌธ์— ์šด๋ฐ˜, ๋ฐฐ์น˜, ํ•ด์ œ์˜ ๋ชจ๋“  ๋‹จ๊ณ„์—์„œ ํ๋ธŒ๋ฅผ ์ผ๊ด€๋˜๊ฒŒ ํ‘œํ˜„ํ•œ๋‹ค.

Figure 4 (Object Permanence ์ •๋Ÿ‰ ๊ฒฐ๊ณผ): ์šด๋™ ์ค‘์ธ ๋ฌผ์ฒด์— ๋Œ€ํ•œ ์ •๊ทœํ™” Frรฉchet ๊ฑฐ๋ฆฌ๋ฅผ ์—ฌ๋Ÿฌ ํƒœ์Šคํฌ์— ๊ฑธ์ณ ํ‰๊ท ํ•˜๋ฉด, VT-WM์ด V-WM ๋Œ€๋น„ ์•ฝ 33% ๊ฐ์†Œ๋ฅผ ๋ณด์ธ๋‹ค.

Figure 8 (๋ฉ€ํ‹ฐํƒœ์Šคํฌ ๋ฐ์ดํ„ฐ์…‹): ํ•™์Šต์— ์‚ฌ์šฉ๋œ ๋‹ค์–‘ํ•œ ์ ‘์ด‰ ์ง‘์•ฝ์  ํƒœ์Šคํฌ๋“ค์˜ ์‹œ๊ฐํ™”. ๋ณต์ˆ˜์˜ ํƒœ์Šคํฌ๊ฐ€ ํ•˜๋‚˜์˜ ๋ชจ๋‹ฌ์— ๋ฌถ์—ฌ ํ•™์Šต๋จ์„ ๋ณด์—ฌ์ค€๋‹ค.


๋น„ํŒ์  ๊ณ ์ฐฐ: ์ด ๋…ผ๋ฌธ์ด ์ž˜ํ•œ ๊ฒƒ๊ณผ ํ•œ๊ณ„

๊ฐ•์ 

1. ๋ฌธ์ œ ์ •์˜์˜ ๋ช…ํ™•์„ฑ

โ€œVision-only WM์ด ๋ฌผ๋ฆฌ์ ์œผ๋กœ ๋ถˆ๊ฐ€๋Šฅํ•œ ์ƒ์ƒ์„ ํ•œ๋‹คโ€๋Š” ์ฃผ์žฅ์€ ์ถ”์ƒ์ ์ด์ง€ ์•Š๋‹ค. ๋…ผ๋ฌธ์€ ์ด๋ฅผ ์„ธ ๊ฐ€์ง€ ๊ตฌ์ฒด์  ์‹คํŒจ ๋ชจ๋“œ(์†Œ๋ฉธ, ์ˆœ๊ฐ„์ด๋™, ๋น„์ธ๊ณผ์  ์šด๋™)๋กœ ๋ถ„๋ฅ˜ํ•˜๊ณ , ๊ฐ๊ฐ์— ๋Œ€ํ•ด ์ •๋Ÿ‰์  ์ง€ํ‘œ๋ฅผ ์„ค๊ณ„ํ–ˆ๋‹ค. ์ด๋Ÿฐ ์‹์œผ๋กœ ๋ฌธ์ œ๋ฅผ ๋ถ„ํ•ดํ•˜๋Š” ๋Šฅ๋ ฅ์ด ์ข‹์€ ์—ฐ๊ตฌ์˜ ํ•ต์‹ฌ์ด๋‹ค.

2. ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๋ถ„๋ฆฌ์˜ ์ž์—ฐ์Šค๋Ÿฌ์›€

๋น„์ „=์ „์—ญ, ์ด‰๊ฐ=๊ตญ์†Œ ์ ‘์ด‰์ด๋ผ๋Š” ์—ญํ•  ๋ถ„๋ฆฌ๋Š” ์ง๊ด€์ ์ด๊ณ  ์ƒ๋ฌผํ•™์ ์œผ๋กœ๋„ ํƒ€๋‹นํ•˜๋‹ค. ์ธ๊ฐ„์˜ ์ฒด์„ฑ๊ฐ๊ฐ(somatosensory) ์‹œ์Šคํ…œ์ด ์ •ํ™•ํžˆ ์ด๋Ÿฐ ๋ฐฉ์‹์œผ๋กœ ๋™์ž‘ํ•œ๋‹ค โ€” ์‹œ๊ฐ์€ ํฐ ๊ทธ๋ฆผ์„, ํ”ผ๋ถ€ ์ˆ˜์šฉ์ฒด๋Š” ์ ‘์ด‰ ์„ธ๋ถ€ ์ •๋ณด๋ฅผ ์ฒ˜๋ฆฌํ•œ๋‹ค.

3. ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์˜ ํ˜„๋ช…ํ•œ ํ™œ์šฉ

Cosmos Tokenizer (๋น„์ „)์™€ Sparsh-X (์ด‰๊ฐ)๋ผ๋Š” ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์„ ์ธ์ฝ”๋”๋กœ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ, WM ํ•™์Šต ์ž์ฒด๋Š” ์ž ์žฌ ๊ณต๊ฐ„์—์„œ์˜ ๋™์—ญํ•™ ์˜ˆ์ธก์— ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” ํ•™์Šต ํšจ์œจ์„ฑ์„ ํฌ๊ฒŒ ๋†’์ด๋Š” ์„ค๊ณ„๋‹ค.

4. ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ์„ค์ •

๋‹จ์ผ ํƒœ์Šคํฌ๊ฐ€ ์•„๋‹Œ ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ํ•™์Šต์€, WM์ด ํƒœ์Šคํฌ ํŠนํ™”๋œ ํŒจํ„ด์ด ์•„๋‹Œ ๋ฒ”์šฉ ์ ‘์ด‰ ๋ฌผ๋ฆฌํ•™์„ ํ•™์Šตํ•˜๊ฒŒ ์œ ๋„ํ•œ๋‹ค. ์ด๊ฒƒ์ด ๋ฐ์ดํ„ฐ ํšจ์œจ์  ํŒŒ์ธํŠœ๋‹์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ํ•ต์‹ฌ์ด๋‹ค.

5. ์‹ค์ œ ๋กœ๋ด‡ ์‹คํ—˜

์‹œ๋ฎฌ๋ ˆ์ด์…˜์—๋งŒ ๋จธ๋ฌผ์ง€ ์•Š๊ณ  ์‹ค์ œ Allegro Hand + Franka Panda ํ”Œ๋žซํผ์—์„œ zero-shot ๊ณ„ํš์„ ๊ฒ€์ฆํ–ˆ๋‹ค. ํŠนํžˆ โ€œzero-shotโ€์ด๋ผ๋Š” ์  โ€” ํŒŒ์ธํŠœ๋‹ ์—†์ด WM์„ ๊ณ„ํš์— ์ง์ ‘ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์ ์ด ์ธ์ƒ์ ์ด๋‹ค.

์•ฝ์  ๋ฐ ํ•œ๊ณ„

1. ์„ผ์„œ ์˜์กด์„ฑ: Digit 360์€ ๊ณ ๊ฐ€์˜ ์ •๋ฐ€ ์„ผ์„œ๋‹ค. ๋” ์ €๋ ดํ•˜๊ฑฐ๋‚˜ ๋‹ค๋ฅธ ์ข…๋ฅ˜์˜ ์ด‰๊ฐ ์„ผ์„œ(force/torque ์„ผ์„œ, ๋ฐ”์ฝ”๋“œ ๊ธฐ๋ฐ˜ ์„ผ์„œ ๋“ฑ)์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™” ์‹คํ—˜์ด ์—†๋‹ค. ์—ฐ๊ตฌ์‹ค ์„ค์ • ์ด์™ธ์—์„œ์˜ ํ™œ์šฉ ๊ฐ€๋Šฅ์„ฑ์ด ์ œํ•œ๋  ์ˆ˜ ์žˆ๋‹ค.

2. Sim-to-Real Gap ๋ฏธ์ฒ˜๋ฆฌ: WM ํ•™์Šต ๋ฐ์ดํ„ฐ๊ฐ€ ์‹ค์ œ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ณด์ด๋Š”๋ฐ, ์ด‰๊ฐ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์˜ ์–ด๋ ค์›€(GelSight ๊ณ„์—ด์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์ด ํŠนํžˆ ์–ด๋ ต๋‹ค๊ณ  ์•Œ๋ ค์ ธ ์žˆ๋‹ค)์— ๋Œ€ํ•œ ๋…ผ์˜๊ฐ€ ๋ถ€์กฑํ•˜๋‹ค. ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋น„์šฉ๊ณผ ํ™•์žฅ์„ฑ์— ๋Œ€ํ•œ ์งˆ๋ฌธ์ด ๋‚จ๋Š”๋‹ค.

3. ์˜คํ”ˆ๋ฃจํ”„ ๊ณ„ํš์˜ ํ•œ๊ณ„: ํ˜„์žฌ์˜ ๊ณ„ํš์€ ์˜คํ”ˆ๋ฃจํ”„(open-loop)๋‹ค โ€” ๊ณ„ํš์„ ํ•œ ๋ฒˆ ์ƒ์„ฑํ•˜๊ณ  ๊ทธ๋Œ€๋กœ ์‹คํ–‰ํ•œ๋‹ค. ์‹ค์‹œ๊ฐ„ ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ์œผ๋กœ ๊ณ„ํš์„ ์ˆ˜์ •ํ•˜๋Š” ํด๋กœ์ฆˆ๋“œ๋ฃจํ”„(closed-loop) ์‹คํ–‰์€ ๊ตฌํ˜„๋˜์ง€ ์•Š์•˜๋‹ค. ์‹ค์ œ ์กฐ์ž‘์—์„œ๋Š” ์˜ˆ๊ธฐ์น˜ ๋ชปํ•œ ์ ‘์ด‰ ๋ณ€ํ™”๊ฐ€ ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋ฐœ์ƒํ•˜๋ฏ€๋กœ, ํด๋กœ์ฆˆ๋“œ๋ฃจํ”„๊ฐ€ ๋” ์ค‘์š”ํ•  ์ˆ˜ ์žˆ๋‹ค.

4. ํƒœ์Šคํฌ ๋‹ค์–‘์„ฑ์˜ ์ œํ•œ: ์‹คํ—˜ ํƒœ์Šคํฌ๊ฐ€ pushing, wiping, placing, stacking์œผ๋กœ ๋น„๊ต์  ๋‹จ์ˆœํ•˜๋‹ค. ์ •๋ฐ€ ์‚ฝ์ž…(peg-in-hole), ๋‚˜์‚ฌ ์กฐ์ž„, ์ฒœ ์กฐ์ž‘ ๋“ฑ ๋” ๋ณต์žกํ•œ ์ ‘์ด‰ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ์˜ ์„ฑ๋Šฅ์€ ๋ฏธ์ง€์ˆ˜๋‹ค.

5. ์ธ๊ณผ์„ฑ์˜ ๋ฌธ์ œ: WM์ด ๋ฌผ๋ฆฌ ๋ฒ•์น™์„ ๋” ์ž˜ ๋”ฐ๋ฅธ๋‹ค๋Š” ๊ฒƒ์ด โ€œ์ •๋ง๋กœ ๋ฌผ๋ฆฌ ์ธ๊ณผ์„ฑ์„ ๋ชจ๋ธ๋งํ•œ ๊ฒƒโ€์ธ์ง€, ์•„๋‹ˆ๋ฉด ์ด‰๊ฐ ๋ฐ์ดํ„ฐ๊ฐ€ ๋‹จ์ˆœํžˆ ๋” ์ข‹์€ ํ†ต๊ณ„์  ํŒจํ„ด์„ ์ œ๊ณตํ•œ ๊ฒƒ์ธ์ง€ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์–ด๋ ต๋‹ค. ๋ฌผ๋ฆฌ์  ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•œ ์‹ฌ์ธต ๋ถ„์„์ด ์•„์‰ฝ๋‹ค.

6. ๊ณ„ํš ์ง€ํ‰์„ ์˜ ํ•œ๊ณ„: ์žฅ๊ธฐ ๊ณ„ํš(long-horizon planning)์— ๋Œ€ํ•œ ์‹คํ—˜์ด ์ œํ•œ์ ์ด๋‹ค. ์ ‘์ด‰ ์˜ค์ฐจ๋Š” ์‹œ๊ฐ„์ด ์ง€๋‚จ์— ๋”ฐ๋ผ ๋ˆ„์ ๋˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์–ด, ๋” ๊ธด ์ง€ํ‰์„ ์—์„œ VT-WM์˜ ์ด์ ์ด ์–ผ๋งˆ๋‚˜ ์œ ์ง€๋˜๋Š”์ง€ ๋ถˆ๋ช…ํ™•ํ•˜๋‹ค.


๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต

World Model ๊ณ„๋ณด

graph TD
    A["DreamerV1/V2/V3\n(Hafner et al., 2019-2023)\nRSSM + Latent Imagination\nRL Setting"] --> B["DayDreamer\n(Wu et al., 2023)\nReal Robot + Dreamer\nVision-only"]
    B --> C["V-WM\n(Vision-only WM)\nThis paper's baseline"]
    C --> D["VT-WM\n(This Paper)\n+ Tactile Sensing\nContact-Rich Tasks"]
    E["UniSim\n(Yang et al., 2024)\nDiffusion-based\nVideo Prediction"] --> D
    F["Genie 2\n(Google)\nInteractive World Sim"] --> D
    G["Sparsh/Sparsh-X\n(Higuera et al., 2025)\nTactile Foundation Model"] --> D
    H["Cosmos Tokenizer\n(NVIDIA, 2025)\nVideo Tokenizer"] --> D

์œ ์‚ฌ ์—ฐ๊ตฌ์™€์˜ ์ฐจ์ด์ 

๋…ผ๋ฌธ ์ฃผ์š” ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ํ™œ์šฉ ๋ฐฉ์‹ ๊ณ„ํš ์ ์šฉ
DayDreamer (Wu et al.) Vision RL (Dreamer) No direct planning
NeuralFeels (Higuera et al.) Vision + Tactile Pose/Shape Estimation No planning
DexWM (2025) Vision Zero-shot planning Yes (vision only)
ViTaS (2026) Vision + Tactile Policy learning No WM
VT-WM (This) Vision + Tactile World Model + Planning Yes

VT-WM์˜ ๋…์ฐฝ์„ฑ์€ โ€œ์ด‰๊ฐ์„ World Model์˜ ์ƒ์ƒ์— ํ†ตํ•ฉํ•˜๊ณ , ๊ทธ ํ†ตํ•ฉ์ด ๊ณ„ํš ์„ฑ๋Šฅ ํ–ฅ์ƒ์œผ๋กœ ์ด์–ด์ง์„ ์ž…์ฆํ•œ ์ตœ์ดˆ์˜ ์—ฐ๊ตฌโ€๋ผ๋Š” ์ ์ด๋‹ค. ์ด์ „ ์—ฐ๊ตฌ๋“ค์€ ์ด‰๊ฐ์„ ์ •์ฑ… ํ•™์Šต์ด๋‚˜ ์ƒํƒœ ์ถ”์ •์— ํ™œ์šฉํ–ˆ์ง€๋งŒ, World Model์˜ ์˜ˆ์ธก/์ƒ์ƒ ํ’ˆ์งˆ ํ–ฅ์ƒ์— ์ดˆ์ ์„ ๋งž์ถ˜ ๊ฒƒ์€ ์ƒˆ๋กญ๋‹ค.

Dreamer ๊ณ„์—ด ๋Œ€๋น„

DreamerV3๋Š” ํ”ฝ์…€ ์žฌ๊ตฌ์„ฑ ๋ชฉํ‘œ๋กœ RSSM์„ ํ•™์Šตํ•œ๋‹ค. VT-WM์€ ์ด์™€ ์œ ์‚ฌํ•œ ๊ตฌ์กฐ์—์„œ ๋น„์ „๊ณผ ์ด‰๊ฐ์„ ๋™์‹œ์— ์˜ˆ์ธกํ•˜๋Š” ๋ชฉํ‘œ๋ฅผ ์ถ”๊ฐ€ํ•œ๋‹ค. ํ•ต์‹ฌ ์ฐจ์ด๋Š” ์ด‰๊ฐ ์˜ˆ์ธก ๋ชฉํ‘œ๊ฐ€ ๋ชจ๋ธ๋กœ ํ•˜์—ฌ๊ธˆ ์ ‘์ด‰ ์—ญํ•™์„ ์ž ์žฌ ํ‘œํ˜„์— ์ธ์ฝ”๋”ฉํ•˜๋„๋ก ๊ฐ•์ œํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด๊ฒƒ์ด ๋‹จ์ˆœํžˆ ๋” ๋งŽ์€ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ํ‘œํ˜„์˜ ์งˆ์  ๋ณ€ํ™”๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ์ด์œ ๋‹ค.

์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

VT-WM์ด ์ „๋‹ฌํ•˜๋Š” ๋ฉ”์‹œ์ง€๋Š” ๋‹จ์ˆœํ•˜๊ณ  ๊ฐ•๋ ฅํ•˜๋‹ค.

โ€œ์ƒ์ƒ์€ ํ˜„์‹ค์˜ ๋ฌผ๋ฆฌํ•™์„ ๋”ฐ๋ผ์•ผ ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋ฌผ๋ฆฌํ•™, ํŠนํžˆ ์ ‘์ด‰์˜ ๋ฌผ๋ฆฌํ•™์€ ์ด‰๊ฐ ์—†์ด ์™„์ „ํžˆ ํ‘œํ˜„๋  ์ˆ˜ ์—†๋‹ค.โ€

๋…ผ๋ฌธ์˜ ๊ธฐ์—ฌ๋ฅผ ์ •๋ฆฌํ•˜๋ฉด:

  1. ์ฒซ ๋ฒˆ์งธ ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ๋น„์ „-์ด‰๊ฐ World Model ์ œ์•ˆ
  2. ์ด‰๊ฐ ํ†ตํ•ฉ์ด WM์˜ ์ƒ์ƒ ๋ฌผ๋ฆฌ์  ์ถฉ์‹ค๋„๋ฅผ ์ •๋Ÿ‰์ ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚ด์„ ์ž…์ฆ (Object Permanence +33%, Laws of Motion +29%)
  3. ํ–ฅ์ƒ๋œ ์ƒ์ƒ ํ’ˆ์งˆ์ด ์‹ค์ œ ๊ณ„ํš ์„ฑ๋Šฅ์œผ๋กœ ์ด์–ด์ง์„ zero-shot ์‹คํ—˜์œผ๋กœ ํ™•์ธ (+35%)
  4. ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ์‚ฌ์ „ํ•™์Šต์ด ์†Œ์ˆ˜ ๋ฐ๋ชจ ์ ์‘์—์„œ BC ๋Œ€๋น„ 3.5ร— ์šฐ์œ„๋ฅผ ์ œ๊ณตํ•จ์„ ํ™•์ธ

์•„์ง ์˜คํ”ˆ๋ฃจํ”„ ๊ณ„ํš์˜ ํ•œ๊ณ„, ์„ผ์„œ ์˜์กด์„ฑ, ์žฅ๊ธฐ ๊ณ„ํš ํ™•์žฅ์„ฑ ๋“ฑ ํ’€์–ด์•ผ ํ•  ๋ฌธ์ œ๊ฐ€ ๋‚จ์•„์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด ์—ฐ๊ตฌ๋Š” ๋กœ๋ด‡ ์กฐ์ž‘์˜ World Model ํŒจ๋Ÿฌ๋‹ค์ž„์—์„œ ์ด‰๊ฐ์ด ์„ ํƒ์ด ์•„๋‹Œ ํ•„์ˆ˜์ž„์„ ๋ช…ํ™•ํžˆ ๋ณด์—ฌ์ค€ ์ค‘์š”ํ•œ ์ด์ •ํ‘œ๋‹ค.

์ ‘์ด‰ ์—†๋Š” ์กฐ์ž‘์ด ์—†๋“ฏ์ด, ์ด‰๊ฐ ์—†๋Š” World Model์€ ๋ถˆ์™„์ „ํ•˜๋‹ค. VT-WM์€ ์ด ๊ฐ„๊ทน์„ ๋ฉ”์šฐ๋Š” ์ฒซ ๋ฒˆ์งธ ์ฒด๊ณ„์ ์ธ ์‹œ๋„์ด๋ฉฐ, ์•ž์œผ๋กœ ์ด ๋ฐฉํ–ฅ์˜ ์—ฐ๊ตฌ๊ฐ€ ๋”์šฑ ๊ฐ€์†ํ™”๋  ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€ํ•œ๋‹ค.


์ฐธ๊ณ ๋ฌธํ—Œ

  • Higuera, C., Arnaud, S., Boots, B., Mukadam, M., Hogan, F., Meier, F. (2026). Visuo-Tactile World Models. arXiv:2602.06001. ICLR 2026 ์ œ์ถœ.
  • Hafner, D. et al. (2023). Mastering Diverse Domains through World Models. Nature (2025).
  • Higuera, C., et al. (2025). Sparsh-X: Tactile Foundation Model.
  • Agarwal et al. (2025). Cosmos: World Foundation Models.
  • Lambeta, M. et al. (2024). Digit 360: A Fully Actuated Tactile Sensor.
  • Yuan, W. et al. (2017). GelSight: High-resolution Robot Tactile Sensors. Sensors.
  • Higuera, C. et al. (2024). NeuralFeels with Neural Fields: Visuotactile Perception for In-Hand Manipulation. Science Robotics.

Copyright 2026, JungYeon Lee