Curieux.JY
  • JungYeon Lee
  • Post
  • Projects
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ์„œ๋ก : โ€œ๋ฏธ๋ž˜์˜ ํ”ฝ์…€์ด ์•„๋‹ˆ๋ผ, ๋ฏธ๋ž˜์˜ ์˜๋ฏธ๋ฅผ ์˜ˆ์ธกํ•˜๋ผโ€
      • ๋ฌธ์ œ์˜ ํ•ต์‹ฌ: ํ”ฝ์…€ ์˜ˆ์ธก์˜ ํ•œ๊ณ„
      • ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด: VQA๋กœ์„œ์˜ World Modeling
    • ๋ฐฉ๋ฒ•๋ก : VLM์„ World Model๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ
      • ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์„ฑ: SAQA (State-Action-Question-Answer)
      • ์งˆ๋ฌธ ์œ ํ˜•์˜ ๋‹ค์–‘์„ฑ
      • ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜: PaliGemma์— ์•ก์…˜ ์กฐ๊ฑดํ™” ์ถ”๊ฐ€ํ•˜๊ธฐ
      • ์ตœ์ข… ์ž…๋ ฅ ์‹œํ€€์Šค ๊ตฌ์„ฑ
    • ํ”Œ๋ž˜๋‹: Semantic World Model๋กœ ํ–‰๋™ ๊ฒฐ์ •ํ•˜๊ธฐ
      • ๊ฐ€์น˜ ํ•จ์ˆ˜ ์ •์˜
      • Early Reward: ๋” ๋น ๋ฅธ ๋ชฉํ‘œ ๋‹ฌ์„ฑ์„ ์žฅ๋ ค
      • ๋ฐฉ๋ฒ• 1: ์ƒ˜ํ”Œ ๊ธฐ๋ฐ˜ ํ”Œ๋ž˜๋‹ (MPPI)
      • ๋ฐฉ๋ฒ• 2: ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ ํ”Œ๋ž˜๋‹
      • ํ”Œ๋ž˜๋‹ ์†๋„ ๋น„๊ต
      • ๋‹ค๋‹จ๊ณ„ ํƒœ์Šคํฌ: ์„œ๋ธŒ๊ณจ ์ฒด์ด๋‹
    • ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ
      • ์‹คํ—˜ ํ™˜๊ฒฝ
      • ๋ฒ ์ด์Šค๋ผ์ธ
      • ํ•ต์‹ฌ ๊ฒฐ๊ณผ 1: ํ”Œ๋ž˜๋‹ ์„ฑ๋Šฅ
      • ํ•ต์‹ฌ ๊ฒฐ๊ณผ 2: ๋‹ค๋‹จ๊ณ„ ํƒœ์Šคํฌ
      • ํ•ต์‹ฌ ๊ฒฐ๊ณผ 3: ์„œ๋ธŒ์˜ตํ‹ฐ๋ฉ€ ๋ฐ์ดํ„ฐ์˜ ๊ฐ€์น˜
      • ํ•ต์‹ฌ ๊ฒฐ๊ณผ 4: ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ
      • Attention Map ์‹œ๊ฐํ™”
    • ๋น„ํŒ์  ๊ณ ์ฐฐ: ๊ฐ•์ ๊ณผ ํ•œ๊ณ„
      • ๊ฐ•์ 
      • ํ•œ๊ณ„
      • ๋ฏธํ•ด๊ฒฐ ์งˆ๋ฌธ๋“ค
    • ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต
      • Vision-Language-Action (VLA) ๋ชจ๋ธ๊ณผ์˜ ๋น„๊ต
      • ๊ธฐ์กด World Model๊ณผ์˜ ๋น„๊ต
      • UniPi์™€์˜ ๋น„๊ต
    • ์‘์šฉ ๊ฐ€๋Šฅ์„ฑ ๋ฐ ํ™•์žฅ ๋ฐฉํ–ฅ
      • ์‹ค์ œ ๋กœ๋ด‡ ์ ์šฉ์„ ์œ„ํ•œ ๋กœ๋“œ๋งต
      • Allegro Hand์™€ ๊ฐ™์€ ๋‹ค์ง€ ๋งค๋‹ˆํ“ฐ๋ ˆ์ด์…˜์—์˜ ์ ์šฉ
      • ๊ฐ•ํ™”ํ•™์Šต๊ณผ์˜ ํ†ตํ•ฉ
    • ๊ตฌํ˜„ ์„ธ๋ถ€์‚ฌํ•ญ (์‹ค๋ฌด์ž๋ฅผ ์œ„ํ•œ)
      • ๋ชจ๋ธ ํ›ˆ๋ จ
      • ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์„ฑ
      • ํ”Œ๋ž˜๋‹ ์„ค์ •
    • ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 
      • ํ•ต์‹ฌ ํ†ต์ฐฐ ์ •๋ฆฌ
      • ๋กœ๋ด‡๊ณตํ•™ ์—ฐ๊ตฌ์ž์—๊ฒŒ ์ฃผ๋Š” ์‹œ์‚ฌ์ 
  • โ›๏ธ Dig Review
    • ๋ฐฉ๋ฒ•
      • ํ•™์Šต ๋ฐ์ดํ„ฐ (SAQA ๋ฐ์ดํ„ฐ์…‹)
      • ๊ณ„ํš(Planning)
      • ๋‹ค๋‹จ๊ณ„ ๊ณ„ํš (Multi-Step Tasks)
    • ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ
    • ๋น„ํŒ์  ๊ณ ์ฐฐ
    • ์‘์šฉ ๋ฐ ํ™•์žฅ
    • ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

๐Ÿ“ƒSWM ๋ฆฌ๋ทฐ

vla
world-model
Semantic World Models
Published

December 23, 2025

๐Ÿ” Ping. ๐Ÿ”” Ring. โ›๏ธ Dig. A tiered review series: quick look, key ideas, deep dive.

  • Paper Link
  • Project
  1. โœจ ์ด ๋…ผ๋ฌธ์€ ๋ฏธ๋ž˜ ํ”„๋ ˆ์ž„์˜ ํ”ฝ์…€์„ ์žฌ๊ตฌ์„ฑํ•˜๋Š” ๋Œ€์‹ , ๋ฏธ๋ž˜ ๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ์‹œ๊ฐ ์งˆ๋ฌธ ์‘๋‹ต(VQA) ๋ฌธ์ œ๋กœ ์„ธ๊ณ„ ๋ชจ๋ธ๋ง์„ ์žฌ์ •์˜ํ•˜๋Š” Semantic World Models (SWM)๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
  2. ๐Ÿค– SWM์€ ์‚ฌ์ „ ํ•™์Šต๋œ Vision-Language Models (VLMs)๋ฅผ ์ด๋ฏธ์ง€-์•ก์…˜-ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ฏธ์„ธ ์กฐ์ •ํ•˜์—ฌ ์•ก์…˜์˜ ์˜๋ฏธ๋ก ์  ํšจ๊ณผ๋ฅผ ์˜ˆ์ธกํ•˜๋ฉฐ, ์ •์˜๋œ QA ์„ธํŠธ์™€ ์ƒ˜ํ”Œ๋ง ๋˜๋Š” gradient-based ํ”Œ๋ž˜๋‹ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ์ •์ฑ… ์ตœ์ ํ™”๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
  3. ๐Ÿš€ ์‹คํ—˜ ๊ฒฐ๊ณผ, SWM์€ LangTable ๋ฐ OGBench ํ™˜๊ฒฝ์—์„œ ํ”ฝ์…€ ๊ธฐ๋ฐ˜ world model๊ณผ offline RL baseline์„ ํฌ๊ฒŒ ๋Šฅ๊ฐ€ํ•˜๋ฉฐ, novel ๋ฐ out-of-distribution ์žฅ๋ฉด์—์„œ ๊ฐ•๋ ฅํ•œ generalization ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

์ด ๋…ผ๋ฌธ์€ ๋กœ๋ด‡ ์ œ์–ด๋ฅผ ์œ„ํ•œ ์›”๋“œ ๋ชจ๋ธ๋ง์˜ ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„์ธ Semantic World Models (SWM)์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด ์›”๋“œ ๋ชจ๋ธ๋“ค์€ ๋ฏธ๋ž˜ ํ”„๋ ˆ์ž„์„ ํ”ฝ์…€ ๋‹จ์œ„๋กœ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘์—ˆ์ง€๋งŒ, ์ด๋Š” ์ข…์ข… ์‹ค์ œ ๊ณ„ํš(planning) ๋ชฉํ‘œ์™€ ์ƒ์ถฉํ•˜๋ฉฐ, ํ”ฝ์…€ ์žฌ๊ตฌ์„ฑ์ด ๊ณ„ํš ์˜์‚ฌ๊ฒฐ์ •์— ํ•„์š”ํ•œ ํ•ต์‹ฌ์ ์ธ ์˜๋ฏธ๋ก ์  ์„ธ๋ถ€์‚ฌํ•ญ์„ ๋†“์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์•„์ด๋””์–ด ๋ฐ ๋ฐฉ๋ฒ•๋ก :

๋ณธ ๋…ผ๋ฌธ์€ ์›”๋“œ ๋ชจ๋ธ์ด ๋ฏธ๋ž˜ ํ”„๋ ˆ์ž„์„ ํ”ฝ์…€๋กœ ์žฌ๊ตฌ์„ฑํ•  ํ•„์š” ์—†์ด, ์˜ค์ง ํƒœ์Šคํฌ์™€ ๊ด€๋ จ๋œ ์˜๋ฏธ๋ก ์  ์ •๋ณด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋œ๋‹ค๋Š” ๊ฐ€์„ค์„ ์„ธ์›๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์›”๋“œ ๋ชจ๋ธ๋ง ๋ฌธ์ œ๋ฅผ ๋ฏธ๋ž˜ ํ”„๋ ˆ์ž„์— ๋Œ€ํ•œ ์‹œ๊ฐ ์งˆ์˜ ์‘๋‹ต(Visual Question Answering, VQA) ๋ฌธ์ œ๋กœ ์žฌ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, โ€œํŒ”์ด ๋ฌผ์ฒด์— ๊ฐ€๊นŒ์›Œ์กŒ๋Š”๊ฐ€?โ€, โ€œ๋นจ๊ฐ„ ํ๋ธŒ๊ฐ€ ๋„˜์–ด์กŒ๋Š”๊ฐ€?โ€์™€ ๊ฐ™์€ ์งˆ๋ฌธ์— โ€œ์˜ˆโ€ ๋˜๋Š” โ€œ์•„๋‹ˆ์˜คโ€๋กœ ๋‹ตํ•˜๋Š” ํ˜•ํƒœ๋กœ ๋ฏธ๋ž˜์˜ ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๊ด€์ ์€ Vision-Language Models (VLMs)์˜ ๊ฐ•๋ ฅํ•œ ์‚ฌ์ „ ํ•™์Šต(pretraining) ์ง€์‹๊ณผ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. SWM์€ ๊ธฐ๋ณธ์ ์œผ๋กœ ๊ธฐ์กด VLM(์˜ˆ: PaliGemma)์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฉฐ, ํ–‰๋™(action) ์กฐ๊ฑด์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๋ฏธ๋ž˜ ์‚ฌ๊ฑด์— ๋Œ€ํ•œ ์งˆ๋ฌธ์— ๋‹ตํ•˜๋„๋ก ๋ฏธ์„ธ ์กฐ์ •(fine-tuning)๋ฉ๋‹ˆ๋‹ค.

SWM ์•„ํ‚คํ…์ฒ˜ ๋ฐ ํ•™์Šต:

SWM์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค:

  1. VLM ๋ฐฑ๋ณธ: PaliGemma์™€ ๊ฐ™์€ ์‚ฌ์ „ ํ•™์Šต๋œ VLM์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”(v_\phi)์™€ ์–ธ์–ด ๋ชจ๋ธ(LLM)์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”์˜ ํŠน์ง•์€ LLM์˜ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์œผ๋กœ ํˆฌ์˜ํ•˜๋Š” ํ–‰๋ ฌ W \in \mathbb{R}^{d_{tok} \times d_{img}}๋ฅผ ํ†ตํ•ด ์—ฐ๊ฒฐ๋ฉ๋‹ˆ๋‹ค.
  2. ์•ก์…˜ ์ปจ๋””์…”๋‹: ์•ก์…˜ ์‹œํ€€์Šค a_{i:j}๋ฅผ ๋ชจ๋ธ ์ž…๋ ฅ์— ํ†ตํ•ฉํ•˜๊ธฐ ์œ„ํ•ด, ์ƒˆ๋กœ์šด ์„ ํ˜• ํˆฌ์˜ ํ–‰๋ ฌ P \in \mathbb{R}^{d_{tok} \times d_{act}}๋ฅผ ๋„์ž…ํ•˜์—ฌ ๊ฐ ์•ก์…˜ a \in \mathbb{R}^{d_{act}}์„ LLM ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์œผ๋กœ ํˆฌ์˜ํ•ฉ๋‹ˆ๋‹ค.
  3. ์ž…๋ ฅ ๊ตฌ์„ฑ: ํ˜„์žฌ ๊ด€์ธก๊ฐ’ S_i (RGB ํ”„๋ ˆ์ž„), ์ œ์•ˆ๋œ ์•ก์…˜ ์‹œํ€€์Šค a_{i:j}, ๊ทธ๋ฆฌ๊ณ  ๋ฏธ๋ž˜์— ๋Œ€ํ•œ ์ž์—ฐ์–ด ์งˆ์˜ QS_j๊ฐ€ ๋ชจ๋ธ์˜ ์ž…๋ ฅ์œผ๋กœ ์ฃผ์–ด์ง‘๋‹ˆ๋‹ค. ์ด๋ฅผ ์—ฐ๊ฒฐ๋œ ์ž„๋ฒ ๋”ฉ ์‹œํ€€์Šค๋กœ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค: \text{concat}(\text{W}^\top v_\phi(S_i), \text{P}^\top a_i, \text{P}^\top a_{i+1}, \dots, \text{P}^\top a_j, QS_j)
  4. ํ•™์Šต ๋ชฉํ‘œ: ๋ชจ๋ธ์€ ์ฃผ์–ด์ง„ ์ž…๋ ฅ์— ๋Œ€ํ•ด ํƒ€๊ฒŸ ๋‹ต๋ณ€ AS_j๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ์ข…๋‹จ๊ฐ„(end-to-end)์œผ๋กœ ๋ฏธ์„ธ ์กฐ์ •๋ฉ๋‹ˆ๋‹ค. ํ•™์Šต ๋ชฉํ‘œ๋Š” ํ‘œ์ค€ ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์†์‹ค์ž…๋‹ˆ๋‹ค: L = -\log p(AS_j | S_i, a_{i:j}, QS_j) ์ด๋Ÿฌํ•œ ํ•™์Šต ์ ˆ์ฐจ๋ฅผ ํ†ตํ•ด SWM์€ ํ”ฝ์…€ ์ˆ˜์ค€์˜ ํ‘œํ˜„์„ ๋ช…์‹œ์ ์œผ๋กœ ์ƒ์„ฑํ•˜์ง€ ์•Š๊ณ ๋„ ์–ธ์–ด ๊ณต๊ฐ„์—์„œ ํ™˜๊ฒฝ์˜ ์—ญํ•™์„ ํŒŒ์•…ํ•˜์—ฌ ๋ฏธ๋ž˜ ์ƒํƒœ์— ๋Œ€ํ•œ ์งˆ๋ฌธ์— ๋‹ตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ (SAQA):

SWM์„ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•ด state-action-question-answer (SAQA) ๋ฐ์ดํ„ฐ์…‹์ด ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค. D_{SAQA} = \{(S_i, a_{i:j}, QS_j, AS_j), \dots \} ์—ฌ๊ธฐ์„œ S_i๋Š” ํ˜„์žฌ ์ƒํƒœ(RGB ํ”„๋ ˆ์ž„), h๋Š” ์˜ˆ์ธก ์‹œ์ (horizon), a_{i:j}๋Š” S_i์—์„œ ์ทจํ•ด์ง„ ์•ก์…˜ ์‹œํ€€์Šค, QS_j์™€ AS_j๋Š” ๋ฏธ๋ž˜ ์ƒํƒœ S_j์— ๋Œ€ํ•œ ์งˆ๋ฌธ-๋‹ต๋ณ€ ์Œ์ž…๋‹ˆ๋‹ค. ์ด ๋ฐ์ดํ„ฐ๋Š” ๊ถค์ (trajectories) ๋ฐ์ดํ„ฐ์—์„œ ์ƒ์„ฑ๋˜๋ฉฐ, ๊ฐ์ฒด ์œ„์น˜์™€ ๊ฐ™์€ ํŠน๊ถŒ ์ •๋ณด(privileged information)๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์งˆ๋ฌธ์„ ํ”„๋กœ๊ทธ๋žจ์ ์œผ๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

SWM์„ ์ด์šฉํ•œ ๊ณ„ํš:

SWM์€ ๋‹ค์Œ ๋‘ ๊ฐ€์ง€ ๊ณ„ํš ๋ฐฉ๋ฒ•๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  1. ์ƒ˜ํ”Œ๋ง ๊ธฐ๋ฐ˜ ๊ณ„ํš (Sampling-Based Planning): Model Predictive Path Integral (MPPI)๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์•ก์…˜ ์‹œํ€€์Šค ๋ถ„ํฌ๋ฅผ ์œ ์ง€ํ•˜๊ณ  ๋ฐ˜๋ณต์ ์œผ๋กœ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ์ƒ˜ํ”Œ๋ง๋œ ๊ถค์ (a^{(k)})์˜ ๊ฐ€์น˜๋Š” SWM์ด ์›ํ•˜๋Š” ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•  ๊ฐ€๋Šฅ์„ฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค. ํƒœ์Šคํฌ T๋Š” ์งˆ๋ฌธ, ๋‹ต๋ณ€, ๊ฐ€์ค‘์น˜ ์ง‘ํ•ฉ์œผ๋กœ ์ •์˜๋ฉ๋‹ˆ๋‹ค: T := \{(Q_i, A^*_i, W_i)\}_{i=1}^k ๊ด€์ธก๊ฐ’ S์™€ ์•ก์…˜ ์‹œํ€€์Šค a_{1:n}์— ๋Œ€ํ•œ ๊ฐ€์น˜ ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค: V_T(S, a_{1:n}) = \sum_{i=0}^k W_i \cdot p_{\text{wm}}(A^*_i | S, a_{1:n}, Q_i) ๋˜ํ•œ, ์•ก์…˜ ์‹œํ€€์Šค๋ฅผ ๊ธธ์ด๊ฐ€ c์ธ ์„œ๋ธŒ ์ฒญํฌ๋กœ ๋‚˜๋ˆ„์–ด ์กฐ๊ธฐ ๋ณด์ƒ(early reward)์„ ์ œ๊ณตํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค: V_{T,c}(S, a_{1:n}) = \sum_{i=0}^k \sum_{j=c, j+=c}^n W_i \cdot p_{\text{wm}}(A^*_i | S, a_{1:j}, Q_i)
  2. ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ ๊ณ„ํš (Gradient-Based Planning): ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์—์„œ ์ƒ˜ํ”Œ๋ง ๊ธฐ๋ฐ˜ ๊ณ„ํš์˜ ๊ณ„์‚ฐ ๋น„์šฉ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋ฉ๋‹ˆ๋‹ค. ๋ฒ ์ด์Šค ์ •์ฑ…(\pi_b)์—์„œ ์ƒ์„ฑ๋œ ํ›„๋ณด ๊ถค์ (a \sim \pi_b(S))์„ SWM๊ณผ ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ •์ œํ•ฉ๋‹ˆ๋‹ค. ๋ชฉํ‘œ๋Š” ๊ฐ€์น˜ ํ•จ์ˆ˜ V_{T,c}(S, a)๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ์•ก์…˜ ์‹œํ€€์Šค a๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. J_T(a) = V_{T,c}(S, a)

์‹คํ—˜ ๊ฒฐ๊ณผ:

LangTable ๋ฐ OGBench ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์—์„œ ํ‰๊ฐ€๋œ SWM์€ ๊ธฐ์กด ํ”ฝ์…€ ๊ธฐ๋ฐ˜ ์›”๋“œ ๋ชจ๋ธ ๋ฐ ์˜คํ”„๋ผ์ธ RL(IDQL, AVD) ๋Œ€๋น„ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

  • SWM์€ ๋ฏธ๋ž˜ QA ์งˆ๋ฌธ์— ์ •ํ™•ํ•˜๊ฒŒ ๋‹ต๋ณ€ํ•˜๋ฉฐ ์ƒˆ๋กœ์šด ์žฅ๋ฉด์—๋„ ์ผ๋ฐ˜ํ™”๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • ๋ฒ ์ด์Šค ์ •์ฑ… ๋Œ€๋น„ LangTable์—์„œ ํ‰๊ท  14.4%์—์„œ 81.6%๋กœ, OGBench์—์„œ 45.33%์—์„œ 76%๋กœ ํ‰๊ท  ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • ์ค€์ตœ์  ๋ฐ์ดํ„ฐ(suboptimal data)๋ฅผ ํ›ˆ๋ จ์— ํ˜ผํ•ฉํ•˜๋ฉด ๋ชจ๋ธ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๋ฉฐ, SWM์€ ์ค€์ตœ์  ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ๋„ ํ•ฉ๋ฆฌ์ ์ธ ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์‚ฌ์ „ ํ•™์Šต๋œ VLM์˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ(์˜ˆ: ๊ตฌ์„ฑ์  ์ผ๋ฐ˜ํ™”, ๋ฐฐ๊ฒฝ ๋ณ€ํ™”์— ๋Œ€ํ•œ ๊ฐ•๊ฑด์„ฑ)์„ ์œ ์ง€ํ•˜๋ฉฐ OOD(Out-of-Distribution) ํ™˜๊ฒฝ์—์„œ๋„ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
  • ๋ชจ๋ธ์˜ ์–ดํ…์…˜ ๋งต(attention maps) ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•ด, SWM์ด ์–ธ์–ด ํ”„๋กฌํ”„ํŠธ์— ๋”ฐ๋ผ ์ด๋ฏธ์ง€์˜ ํƒœ์Šคํฌ ๊ด€๋ จ ์˜์—ญ์— ์ •ํ™•ํ•˜๊ฒŒ ์ฃผ์˜๋ฅผ ๊ธฐ์šธ์ด๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก  ๋ฐ ํ•œ๊ณ„:

SWM์€ ๋ฏธ๋ž˜ ๊ฒฐ๊ณผ๋ฅผ ์งˆ์˜ ์‘๋‹ต ํ˜•ํƒœ๋กœ ๋ช…์‹œ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•˜๋Š” ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์‹œํ•˜๋ฉฐ, ํ”ฝ์…€ ์ˆ˜์ค€ ์ •๋ณด ์žฌ๊ตฌ์„ฑ์˜ ํ•„์š”์„ฑ์„ ์—†์•ฑ๋‹ˆ๋‹ค. ์ด๋Š” ๊ธฐ์กด ํ”ฝ์…€ ๊ธฐ๋ฐ˜ ์›”๋“œ ๋ชจ๋ธ๋ง ๋ฐ ์˜คํ”„๋ผ์ธ RL ๋ฐฉ์‹๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์ง€๋งŒ, ๋Œ€๊ทœ๋ชจ VLM์˜ ๋†’์€ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋กœ ์ธํ•ด ๋‹จ์ผ GPU์—์„œ ์ƒ˜ํ”Œ ๊ธฐ๋ฐ˜ ๊ณ„ํš์˜ ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋†’๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ ๊ณ„ํš์€ ๋” ํšจ์œจ์ ์ด์ง€๋งŒ, ์ดˆ๊ธฐ ๊ถค์ ์„ ์ œ์•ˆํ•  ๋ฒ ์ด์Šค ์ •์ฑ…์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, SAQA ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•์„ ์œ„ํ•ด ์‹œ๋ฎฌ๋ ˆ์ด์…˜์˜ ์ ‘์ง€ ์ง„์‹ค(ground truth) ์ •๋ณด๊ฐ€ ํ•„์š”ํ•˜๋‹ค๋Š” ์ ์€ ์‹ค์ œ ๋กœ๋ด‡ ํ™˜๊ฒฝ ์ ์šฉ์— ์žˆ์–ด ๋„์ „ ๊ณผ์ œ์ž…๋‹ˆ๋‹ค. ํ–ฅํ›„ ์—ฐ๊ตฌ๋Š” ๋” ์ž‘์€ VLM ์‚ฌ์šฉ ๋ฐ ์˜ค๋ผํด ์ƒ์„ฑ QA ๋Œ€์‹  VLM ์ž์ฒด์—์„œ QA ์Œ์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉํ–ฅ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

์„œ๋ก : โ€œ๋ฏธ๋ž˜์˜ ํ”ฝ์…€์ด ์•„๋‹ˆ๋ผ, ๋ฏธ๋ž˜์˜ ์˜๋ฏธ๋ฅผ ์˜ˆ์ธกํ•˜๋ผโ€

๋กœ๋ด‡๊ณตํ•™์—์„œ ์šฐ๋ฆฌ๊ฐ€ ์ •๋ง๋กœ ์›ํ•˜๋Š” ๊ฒƒ์€ ๋ฌด์—‡์ผ๊นŒ์š”? ๋กœ๋ด‡์ด ์ปต์„ ์ง‘์œผ๋ ค ํ•  ๋•Œ, ์šฐ๋ฆฌ๋Š” ๋กœ๋ด‡์ด 1์ดˆ ํ›„์˜ ์นด๋ฉ”๋ผ ์ด๋ฏธ์ง€๋ฅผ ์™„๋ฒฝํ•˜๊ฒŒ ์˜ˆ์ธกํ•˜๊ธธ ์›ํ•˜๋Š” ๊ฑธ๊นŒ์š”? ์•„๋‹™๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์ง„์ •์œผ๋กœ ์•Œ๊ณ  ์‹ถ์€ ๊ฒƒ์€ ๋‹จ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค: โ€œ์ด ๋™์ž‘์„ ์ˆ˜ํ–‰ํ•˜๋ฉด ์ปต์„ ์žก๊ฒŒ ๋ ๊นŒ?โ€

์ด๊ฒƒ์ด ๋ฐ”๋กœ Semantic World Models(SWM) ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ํ†ต์ฐฐ์ž…๋‹ˆ๋‹ค. ๋ฆฌ์ฒ˜๋“œ ํŒŒ์ธ๋งŒ์ด ์–‘์ž์—ญํ•™์„ ์„ค๋ช…ํ•  ๋•Œ ๋ณธ์งˆ์„ ๊ฟฐ๋šซ์—ˆ๋“ฏ์ด, ์ด ๋…ผ๋ฌธ์€ World Model์˜ ๋ณธ์งˆ์„ ๊ฟฐ๋šซ์Šต๋‹ˆ๋‹ค. ํ”ฝ์…€ ์žฌ๊ตฌ์„ฑ์ด๋ผ๋Š” ์–ด๋ ค์šด ๋ฌธ์ œ๋ฅผ ํ’€์ง€ ๋ง๊ณ , ์ •๋ง ํ•„์š”ํ•œ ์˜๋ฏธ๋ก ์  ์ •๋ณด๋งŒ ์˜ˆ์ธกํ•˜์ž๋Š” ๊ฒƒ์ด์ฃ .

๋ฌธ์ œ์˜ ํ•ต์‹ฌ: ํ”ฝ์…€ ์˜ˆ์ธก์˜ ํ•œ๊ณ„

๊ธฐ์กด์˜ World Model๋“ค์€ ๋งˆ์น˜ ์‹œํ—˜ ๋ฒ”์œ„ ์ „์ฒด๋ฅผ ์™ธ์šฐ๋ ค๋Š” ํ•™์ƒ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. โ€œํ˜„์žฌ ํ”„๋ ˆ์ž„ + ์•ก์…˜ โ†’ ๋ฏธ๋ž˜ ํ”„๋ ˆ์ž„โ€์„ ์˜ˆ์ธกํ•˜๋ ค๊ณ  ํ•˜์ฃ . ์ด ์ ‘๊ทผ๋ฒ•์˜ ๋ฌธ์ œ์ ์€ ๋ช…ํ™•ํ•ฉ๋‹ˆ๋‹ค:

  1. ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋ง‰๋Œ€ํ•ฉ๋‹ˆ๋‹ค: ๊ณ ํ•ด์ƒ๋„ ๋น„๋””์˜ค๋ฅผ ์ƒ์„ฑํ•˜๋ ค๋ฉด ์—„์ฒญ๋‚œ ์—ฐ์‚ฐ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  2. ์ •์ž‘ ์ค‘์š”ํ•œ ๊ฒƒ์„ ๋†“์นฉ๋‹ˆ๋‹ค: ์•„๋ฌด๋ฆฌ ์‚ฌ์‹ค์ ์ธ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•ด๋„, โ€œ๋ฌผ์ฒด๊ฐ€ ์ ‘์ด‰ํ–ˆ๋Š”์ง€โ€๊ฐ™์€ ํ•ต์‹ฌ ์ •๋ณด๋ฅผ ์ •ํ™•ํžˆ ์บก์ฒ˜ํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  3. ๊ณ„ํš(Planning)๊ณผ ๋ชฉ์ ์ด ๋ถˆ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค: ํ”ฝ์…€ ์žฌ๊ตฌ์„ฑ ํ’ˆ์งˆ๊ณผ ์ข‹์€ ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ๋Š” ๋Šฅ๋ ฅ ์‚ฌ์ด์—๋Š” ์ง์ ‘์ ์ธ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    ๊ธฐ์กด World Model์˜ ๋”œ๋ ˆ๋งˆ                      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  ์ž…๋ ฅ: ํ˜„์žฌ ์ด๋ฏธ์ง€ + ์•ก์…˜ ์‹œํ€€์Šค                                   โ”‚
โ”‚  ์ถœ๋ ฅ: ๋ฏธ๋ž˜ ์ด๋ฏธ์ง€ (์ˆ˜๋ฐฑ๋งŒ ํ”ฝ์…€)                                   โ”‚
โ”‚                                                                 โ”‚
โ”‚  ๋ฌธ์ œ: ํ”ฝ์…€ ํ•˜๋‚˜ํ•˜๋‚˜๋ฅผ ์˜ˆ์ธกํ•˜๋А๋ผ ์ •์ž‘ "๋ธ”๋ก์ด ๋„˜์–ด์กŒ๋‚˜?"            โ”‚
โ”‚       ๊ฐ™์€ ํ•ต์‹ฌ ์งˆ๋ฌธ์— ๋‹ตํ•˜์ง€ ๋ชปํ•จ                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด: VQA๋กœ์„œ์˜ World Modeling

SWM์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋ฅผ ํ•œ ๋ฌธ์žฅ์œผ๋กœ ์š”์•ฝํ•˜๋ฉด ์ด๋ ‡์Šต๋‹ˆ๋‹ค:

โ€œWorld Modeling์„ ๋ฏธ๋ž˜์— ๋Œ€ํ•œ Visual Question Answering(VQA) ๋ฌธ์ œ๋กœ ์žฌ์ •์˜ํ•˜์žโ€

์ด๊ฒƒ์€ ๋งˆ์น˜ ์‹œํ—˜์—์„œ ์ „์ฒด ๊ต๊ณผ์„œ๋ฅผ ์™ธ์šฐ๋Š” ๋Œ€์‹ , ์ค‘์š”ํ•œ ๊ฐœ๋…๋งŒ ์ดํ•ดํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋กœ๋ด‡์—๊ฒŒ ํ•„์š”ํ•œ ๊ฒƒ์€ โ€œ๋ฏธ๋ž˜์˜ ๋ชจ๋“  ํ”ฝ์…€โ€์ด ์•„๋‹ˆ๋ผ โ€œ๋ฏธ๋ž˜์— ๋Œ€ํ•œ ํ•ต์‹ฌ ์งˆ๋ฌธ๋“ค์˜ ๋‹ตโ€์ž…๋‹ˆ๋‹ค.

graph LR
    subgraph G1["๊ธฐ์กด Video World Model"]
        A1[ํ˜„์žฌ ์ด๋ฏธ์ง€] --> B1[Video Prediction Model]
        C1[์•ก์…˜ ์‹œํ€€์Šค] --> B1
        B1 --> D1["๋ฏธ๋ž˜ ์ด๋ฏธ์ง€ (์ˆ˜๋ฐฑ๋งŒ ํ”ฝ์…€)"]
    end

    subgraph G2["Semantic World Model"]
        A2[ํ˜„์žฌ ์ด๋ฏธ์ง€] --> B2["SWM (VLM ๊ธฐ๋ฐ˜)"]
        C2[์•ก์…˜ ์‹œํ€€์Šค] --> B2
        E2["์งˆ๋ฌธ: ๋ธ”๋ก์ด ์ ‘์ด‰ํ–ˆ๋‚˜?"] --> B2
        B2 --> D2["Yes ๋˜๋Š” No"]
    end


๋ฐฉ๋ฒ•๋ก : VLM์„ World Model๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ

๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์„ฑ: SAQA (State-Action-Question-Answer)

SWM์„ ํ›ˆ๋ จ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์ €์ž๋“ค์€ ๋…ํŠนํ•œ ๋ฐ์ดํ„ฐ์…‹ ํ˜•์‹์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ „ํ†ต์ ์ธ (์ƒํƒœ, ์•ก์…˜, ๋‹ค์Œ ์ƒํƒœ) ํ˜•์‹ ๋Œ€์‹ , SAQA ํ˜•์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

\mathcal{D}_{SAQA} = \{(S_i, a_{i:j}, Q_{S_j}, A_{S_j}), \ldots\} \quad \text{where } j = i + h

์—ฌ๊ธฐ์„œ ๊ฐ ์š”์†Œ์˜ ์˜๋ฏธ๋Š”:

๊ธฐํ˜ธ ์˜๋ฏธ ์˜ˆ์‹œ
S_i ํ˜„์žฌ ์ƒํƒœ (RGB ์ด๋ฏธ์ง€) ํ…Œ์ด๋ธ” ์œ„ ๋ธ”๋ก๋“ค์˜ ์ด๋ฏธ์ง€
a_{i:j} ์•ก์…˜ ์‹œํ€€์Šค ๋กœ๋ด‡ ํŒ”์˜ xy ์ด๋™ ๋ช…๋ น๋“ค
h ์˜ˆ์ธก horizon 0~20 ์Šคํ…
Q_{S_j} ๋ฏธ๋ž˜ ์ƒํƒœ์— ๋Œ€ํ•œ ์งˆ๋ฌธ โ€œ๋นจ๊ฐ„ ๋ณ„์ด ํŒŒ๋ž€ ํ๋ธŒ์— ๋‹ฟ์•˜๋‚˜?โ€
A_{S_j} ํ•ด๋‹น ์งˆ๋ฌธ์˜ ์ •๋‹ต โ€œYesโ€ ๋˜๋Š” โ€œNoโ€

์ด ๋ฐ์ดํ„ฐ์…‹์˜ ์•„๋ฆ„๋‹ค์šด ์ ์€ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์˜ ํŠน๊ถŒ ์ •๋ณด(privileged information)๋ฅผ ํ™œ์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋Š” ๋ชจ๋“  ๋ฌผ์ฒด์˜ ์œ„์น˜๋ฅผ ์ •ํ™•ํžˆ ์•Œ๊ณ  ์žˆ์œผ๋ฏ€๋กœ, ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ฐฉ์‹์œผ๋กœ ์งˆ๋ฌธ-๋‹ต๋ณ€ ์Œ์„ ์ž๋™ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์งˆ๋ฌธ ์œ ํ˜•์˜ ๋‹ค์–‘์„ฑ

๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ•œ ์งˆ๋ฌธ ์œ ํ˜•๋“ค์„ ์‚ดํŽด๋ณด๋ฉด:

LangTable ํ™˜๊ฒฝ:

  • ๋ธ”๋ก ์ ‘์ด‰ ์—ฌ๋ถ€: โ€œIs the red star touching the blue cube?โ€
  • ๋กœ๋ด‡-๋ธ”๋ก ๊ฑฐ๋ฆฌ: โ€œIs the green cube next to the peg?โ€
  • ์œ„์น˜ ๊ด€๊ณ„: โ€œIs the red star in the center of the board?โ€
  • ์ƒ๋Œ€์  ๋ฐฉํ–ฅ: โ€œIs the peg above the red cube?โ€
  • ์ด๋™ ๋ฐฉํ–ฅ: โ€œDid the red cube move left?โ€
  • ๊ทผ์ ‘ ๋ณ€ํ™”: โ€œAre the red star and blue cube closer together?โ€

OGBench ํ™˜๊ฒฝ:

  • ํŒŒ์ง€ ์—ฌ๋ถ€: โ€œIs the red cube grasped by the robot?โ€
  • ์ ‘์ด‰ ํ™•์ธ: โ€œIs the blue cube touching the robot gripper?โ€
  • ์ ์ธต ์ƒํƒœ: โ€œIs the red cube on top of the blue cube?โ€

๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜: PaliGemma์— ์•ก์…˜ ์กฐ๊ฑดํ™” ์ถ”๊ฐ€ํ•˜๊ธฐ

SWM์€ ๊ธฐ์กด VLM์ธ PaliGemma (3B ํŒŒ๋ผ๋ฏธํ„ฐ)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. PaliGemma์˜ ๊ตฌ์„ฑ ์š”์†Œ๋Š”:

  1. Gemma LLM: ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ ์–ธ์–ด ๋ชจ๋ธ (ํ† ํฐ ์ž„๋ฒ ๋”ฉ ์ฐจ์›: d_{tok})
  2. SigLIP ๋น„์ „ ์ธ์ฝ”๋” (v_\phi): ์ด๋ฏธ์ง€๋ฅผ ํŠน์ง• ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ (ํŠน์ง• ์ฐจ์›: d_{img})
  3. ํ”„๋กœ์ ์…˜ ํ–‰๋ ฌ (W \in \mathbb{R}^{d_{tok} \times d_{img}}): ์ด๋ฏธ์ง€ ํŠน์ง•์„ ์–ธ์–ด ๋ชจ๋ธ ๊ณต๊ฐ„์œผ๋กœ ํˆฌ์˜

์—ฌ๊ธฐ์„œ ํ•ต์‹ฌ์ ์ธ ์ถ”๊ฐ€ ์š”์†Œ๋Š” ์•ก์…˜ ํ”„๋กœ์ ์…˜ ํ–‰๋ ฌ์ž…๋‹ˆ๋‹ค:

P \in \mathbb{R}^{d_{tok} \times d_{act}}

์ด ํ–‰๋ ฌ์€ ๊ฐ ์•ก์…˜ a \in \mathbb{R}^{d_{act}}๋ฅผ ์–ธ์–ด ๋ชจ๋ธ์˜ ํ† ํฐ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์œผ๋กœ ํˆฌ์˜ํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์น˜ ์ด๋ฏธ์ง€ ํ† ํฐ์ด ์–ธ์–ด ๋ชจ๋ธ์— ๋“ค์–ด๊ฐ€๋“ฏ์ด, ์•ก์…˜๋„ ๋™์ผํ•œ ๋ฐฉ์‹์œผ๋กœ ์ฃผ์ž…๋ฉ๋‹ˆ๋‹ค.

graph TB
    subgraph Input["์ž…๋ ฅ ์ฒ˜๋ฆฌ"]
        IMG["ํ˜„์žฌ ์ด๋ฏธ์ง€ S_i"] --> VE["SigLIP Vision Encoder"]
        VE --> IMGF["์ด๋ฏธ์ง€ ํŠน์ง• v_ฯ†(S_i)"]
        IMGF --> WPROJ["W ํ”„๋กœ์ ์…˜"]

        ACT["์•ก์…˜ ์‹œํ€€์Šค a_i...a_j"] --> APROJ["P ํ”„๋กœ์ ์…˜"]

        Q["์งˆ๋ฌธ Q_Sj"] --> TOK[ํ† ํฐํ™”]
    end

    subgraph TokenSeq["ํ† ํฐ ์‹œํ€€์Šค ๊ตฌ์„ฑ"]
        WPROJ --> CONCAT[Concatenate]
        APROJ --> CONCAT
        TOK --> CONCAT
        CONCAT --> SEQ["์ด๋ฏธ์ง€/์•ก์…˜/์งˆ๋ฌธ ํ† ํฐ๋“ค"]
    end

    subgraph LM["์–ธ์–ด ๋ชจ๋ธ"]
        SEQ --> GEMMA[Gemma LLM]
        GEMMA --> ANS["๋‹ต๋ณ€ A_Sj"]
    end

์ตœ์ข… ์ž…๋ ฅ ์‹œํ€€์Šค ๊ตฌ์„ฑ

์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ ํŠœํ”Œ (S_i, a_{i:j}, Q_{S_j}, A_{S_j})์— ๋Œ€ํ•ด, ๋ชจ๋ธ์˜ ์ž…๋ ฅ ์‹œํ€€์Šค๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:

\text{concat}\left(W^\top V_{sc}(S_i), P^\top a_i, P^\top a_{i+1}, \ldots, P^\top a_j, Q_{S_j}\right)

ํ›ˆ๋ จ์€ ํ‘œ์ค€ cross-entropy ์†์‹ค์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

\mathcal{L} = -\log p(A_{S_j} | S_i, a_{i:j}, Q_{S_j})

์ด ๊ตฌ์กฐ์˜ ์šฐ์•„ํ•จ์€ ๊ธฐ์กด VLM์˜ ์‚ฌ์ „ํ•™์Šต ์ง€์‹์„ ๊ทธ๋Œ€๋กœ ๋ณด์กดํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ํ”ฝ์…€ ์žฌ๊ตฌ์„ฑ ๋Œ€์‹  ์–ธ์–ด ๊ณต๊ฐ„์—์„œ ๋™์—ญํ•™์„ ๋ชจ๋ธ๋งํ•จ์œผ๋กœ์จ, VLM์ด ์ธํ„ฐ๋„ท ๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์—์„œ ํ•™์Šตํ•œ ์„ธ๊ณ„ ์ง€์‹์„ ๋กœ๋ด‡ ์ œ์–ด์— ์ „์ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


ํ”Œ๋ž˜๋‹: Semantic World Model๋กœ ํ–‰๋™ ๊ฒฐ์ •ํ•˜๊ธฐ

SWM์ด ๋ฏธ๋ž˜์— ๋Œ€ํ•œ ์งˆ๋ฌธ์— ๋‹ตํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด, ์ด๋ฅผ ์–ด๋–ป๊ฒŒ ๋กœ๋ด‡ ์ œ์–ด์— ํ™œ์šฉํ• ๊นŒ์š”?

๊ฐ€์น˜ ํ•จ์ˆ˜ ์ •์˜

๊ฐ ํƒœ์Šคํฌ๋Š” ์งˆ๋ฌธ-๋‹ต๋ณ€-๊ฐ€์ค‘์น˜์˜ ์ง‘ํ•ฉ์œผ๋กœ ์ •์˜๋ฉ๋‹ˆ๋‹ค:

T := \{(Q_i, A_i^*, W_i)\}_{i=1}^k

์˜ˆ๋ฅผ ๋“ค์–ด, โ€œ๋นจ๊ฐ„ ๋ธ”๋ก์„ ํŒŒ๋ž€ ๋ธ”๋ก์œผ๋กœ ๋ฐ€๊ธฐโ€ ํƒœ์Šคํฌ๋Š”:

์งˆ๋ฌธ ์›ํ•˜๋Š” ๋‹ต ๊ฐ€์ค‘์น˜
โ€œ๋นจ๊ฐ„ ๋ธ”๋ก์ด ํŒŒ๋ž€ ๋ธ”๋ก์— ๋‹ฟ์•˜๋‚˜?โ€ Yes 0.8
โ€œ๋นจ๊ฐ„ ๋ธ”๋ก์ด ํŒŒ๋ž€ ๋ธ”๋ก์— ๋” ๊ฐ€๊นŒ์›Œ์กŒ๋‚˜?โ€ Yes 0.2

์ฃผ์–ด์ง„ ์ƒํƒœ S์™€ ์•ก์…˜ ์‹œํ€€์Šค a_{1:n}์— ๋Œ€ํ•ด, ๊ฐ€์น˜ ํ•จ์ˆ˜๋Š”:

V^T(S, a_{1:n}) = \sum_{i=0}^{k} W_i \cdot p_{wm}(A_i^* | S, a_{1:n}, Q_i)

Early Reward: ๋” ๋น ๋ฅธ ๋ชฉํ‘œ ๋‹ฌ์„ฑ์„ ์žฅ๋ ค

์ €์ž๋“ค์€ ํฅ๋ฏธ๋กœ์šด ๋ฐœ๊ฒฌ์„ ํ•ฉ๋‹ˆ๋‹ค: ๋ชฉํ‘œ๋ฅผ ๋” ์ผ์ฐ ๋‹ฌ์„ฑํ•˜๋„๋ก ๋ณด์ƒํ•˜๋ฉด ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์•ก์…˜ ์‹œํ€€์Šค๋ฅผ ์ฒญํฌ๋กœ ๋‚˜๋ˆ„์–ด ์ ์ง„์ ์œผ๋กœ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค:

V^{T,c}(S, a_{1:n}) = \sum_{i=0}^{k} \sum_{\substack{j=c \\ j \mathrel{+}= c}}^{n} W_i \cdot p_{wm}(A_i^* | S, a_{1:j}, Q_i)

์—ฌ๊ธฐ์„œ c๋Š” ์ฒญํฌ ํฌ๊ธฐ์ž…๋‹ˆ๋‹ค. c=1์ด๋ฉด ๋งค ์•ก์…˜๋งˆ๋‹ค ํ‰๊ฐ€ํ•˜๊ณ , c=n์ด๋ฉด ์ „์ฒด ์‹œํ€€์Šค์— ๋Œ€ํ•ด ํ•œ ๋ฒˆ๋งŒ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

๋ฐฉ๋ฒ• 1: ์ƒ˜ํ”Œ ๊ธฐ๋ฐ˜ ํ”Œ๋ž˜๋‹ (MPPI)

Model Predictive Path Integral (MPPI) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

# MPPI ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜์‚ฌ์ฝ”๋“œ
def mppi_planning(swm_model, current_state, task_spec, num_iterations=10):
    # 1. ์•ก์…˜ ๋ถ„ํฌ ์ดˆ๊ธฐํ™”
    action_dist = Uniform(a_min, a_max)
    
    for iteration in range(num_iterations):
        # 2. K๊ฐœ์˜ ์•ก์…˜ ์‹œํ€€์Šค ์ƒ˜ํ”Œ๋ง
        action_sequences = [sample(action_dist) for _ in range(K)]
        
        # 3. ๊ฐ ์‹œํ€€์Šค์˜ ๊ฐ€์น˜ ๊ณ„์‚ฐ (SWM ์‚ฌ์šฉ)
        values = [compute_value(swm_model, current_state, 
                               actions, task_spec) 
                 for actions in action_sequences]
        
        # 4. ์†Œํ”„ํŠธ๋งฅ์Šค ๊ฐ€์ค‘ ํ‰๊ท ์œผ๋กœ ๋ถ„ํฌ ์—…๋ฐ์ดํŠธ
        weights = softmax(values / temperature)
        mean = weighted_average(action_sequences, weights)
        var = weighted_variance(action_sequences, weights, mean)
        
        action_dist = Normal(mean, var)
    
    return mean  # ์ตœ์ข… ์•ก์…˜ ์‹œํ€€์Šค

๊ฐ€์ค‘์น˜ ๊ณ„์‚ฐ:

\mu_t = \sum_{k=1}^{K} \frac{\exp(V_k/\lambda)}{\sum_{j=1}^{K}\exp(V_j/\lambda)} a_t^{(k)}

\sigma_t^2 = \sum_{k=1}^{K} \omega_k (a_t^{(k)} - \mu_t)^2

๋ฐฉ๋ฒ• 2: ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ ํ”Œ๋ž˜๋‹

MPPI๋Š” ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์—์„œ ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋†’์Šต๋‹ˆ๋‹ค. ๋” ํšจ์œจ์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค:

  1. ๊ธฐ๋ณธ ์ •์ฑ… \pi_b์—์„œ ํ›„๋ณด ๊ถค์  ์ƒ˜ํ”Œ๋ง: a \sim \pi_b(S)
  2. ๋ชฉ์  ํ•จ์ˆ˜์— ๋Œ€ํ•ด ๊ทธ๋ž˜๋””์–ธํŠธ ์ƒ์Šน:

J^T(a) = V^{T,c}(S, a)

# ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ ํ”Œ๋ž˜๋‹ ์˜์‚ฌ์ฝ”๋“œ
def gradient_planning(swm_model, base_policy, current_state, 
                      task_spec, num_iterations=10, lr=0.02):
    # 1. ๊ธฐ๋ณธ ์ •์ฑ…์—์„œ ์ดˆ๊ธฐ ๊ถค์  ์ƒ˜ํ”Œ๋ง
    actions = base_policy(current_state)
    actions.requires_grad = True
    
    for iteration in range(num_iterations):
        # 2. ๊ฐ€์น˜ ํ•จ์ˆ˜ ๊ณ„์‚ฐ
        value = compute_value(swm_model, current_state, 
                             actions, task_spec)
        
        # 3. ๊ทธ๋ž˜๋””์–ธํŠธ ๊ณ„์‚ฐ ๋ฐ ์—…๋ฐ์ดํŠธ
        grad = torch.autograd.grad(value, actions)
        grad = clip_grad_norm(grad, max_norm=1.0)
        actions = actions + lr * grad
    
    return actions

์ด ๋ฐฉ๋ฒ•์˜ ์žฅ์ :

  • ๋ฐฉํ–ฅ์„ฑ ์žˆ๋Š” ์ตœ์ ํ™”: ๋ฌด์ž‘์œ„ ์ƒ˜ํ”Œ๋ง ๋Œ€์‹  ๊ทธ๋ž˜๋””์–ธํŠธ ๋ฐฉํ–ฅ์œผ๋กœ ์ง์ ‘ ์ด๋™
  • ๋น ๋ฅธ ์ˆ˜๋ ด: ์ƒ˜ํ”Œ ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋ณด๋‹ค ํ›จ์”ฌ ์ ์€ ๋ฐ˜๋ณต์œผ๋กœ ์ˆ˜๋ ด
  • ํšจ์œจ์„ฑ: ๋‹จ์ผ ๊ถค์ ๋งŒ ์ตœ์ ํ™”ํ•˜๋ฏ€๋กœ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ 

graph LR
    subgraph MPPI["์ƒ˜ํ”Œ ๊ธฐ๋ฐ˜ (MPPI)"]
        A1[K๊ฐœ ๊ถค์  ์ƒ˜ํ”Œ๋ง] --> B1[๋ชจ๋“  ๊ถค์  ํ‰๊ฐ€]
        B1 --> C1[๊ฐ€์ค‘ ํ‰๊ท ]
        C1 --> D1[๋ถ„ํฌ ์—…๋ฐ์ดํŠธ]
        D1 --> A1
    end

    subgraph Gradient["๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜"]
        A2[๋‹จ์ผ ๊ถค์  ์ดˆ๊ธฐํ™”] --> B2[๊ฐ€์น˜ ํ•จ์ˆ˜ ๊ณ„์‚ฐ]
        B2 --> C2[๊ทธ๋ž˜๋””์–ธํŠธ ๊ณ„์‚ฐ]
        C2 --> D2[์•ก์…˜ ์—…๋ฐ์ดํŠธ]
        D2 --> B2
    end

ํ”Œ๋ž˜๋‹ ์†๋„ ๋น„๊ต

๋ฐฉ๋ฒ• ์•ก์…˜ ์ฒญํฌ๋‹น ์‹œ๊ฐ„
AVD (Action-conditioned Video Diffusion) 676.41์ดˆ
MPPI 4.48์ดˆ
๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ 1.56์ดˆ

๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์ด AVD ๋Œ€๋น„ 430๋ฐฐ ๋น ๋ฆ…๋‹ˆ๋‹ค!

๋‹ค๋‹จ๊ณ„ ํƒœ์Šคํฌ: ์„œ๋ธŒ๊ณจ ์ฒด์ด๋‹

์žฅ๊ธฐ horizon ํƒœ์Šคํฌ๋ฅผ ์œ„ํ•ด ์„œ๋ธŒ๊ณจ ์ฒด์ด๋‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

  1. ์„œ๋ธŒ๊ณจ ์‹œํ€€์Šค ์ •์˜: g_1, g_2, \ldots, g_T
  2. ๊ฐ ์„œ๋ธŒ๊ณจ์— ์งˆ๋ฌธ-๋‹ต๋ณ€ ์Œ ํ• ๋‹น
  3. ์ˆœ์ฐจ์ ์œผ๋กœ ์„œ๋ธŒ๊ณจ ์‹คํ–‰, SWM์œผ๋กœ ์™„๋ฃŒ ์—ฌ๋ถ€ ํ™•์ธ
  4. ์™„๋ฃŒ๋˜๋ฉด ๋‹ค์Œ ์„œ๋ธŒ๊ณจ๋กœ ์ „ํ™˜

์˜ˆ๋ฅผ ๋“ค์–ด, โ€œํ๋ธŒ ์Œ“๊ธฐโ€ ํƒœ์Šคํฌ:

  • ์„œ๋ธŒ๊ณจ 1: โ€œ๋กœ๋ด‡์ด ์ฒซ ๋ฒˆ์งธ ํ๋ธŒ๋ฅผ ์žก์•˜๋‚˜?โ€ โ†’ Yes
  • ์„œ๋ธŒ๊ณจ 2: โ€œ์ฒซ ๋ฒˆ์งธ ํ๋ธŒ๊ฐ€ ๋‘ ๋ฒˆ์งธ ํ๋ธŒ ์œ„์— ์žˆ๋‚˜?โ€ โ†’ Yes

์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ

์‹คํ—˜ ํ™˜๊ฒฝ

1. LangTable

  • ํ…Œ์ด๋ธ” ์œ„์—์„œ ๋กœ๋ด‡ ํŒ”๋กœ ๋ธ”๋ก์„ ์กฐ์ž‘
  • 180ร—320 RGB ์ด๋ฏธ์ง€ ๊ด€์ธก
  • xy ๋ธํƒ€ ํฌ์ฆˆ ์•ก์…˜ (๋ฒ”์œ„: -0.03 ~ 0.03)
  • ์ œ์–ด ์ฃผํŒŒ์ˆ˜: 10Hz

2. OGBench

  • ๋กœ๋ด‡ ๊ทธ๋ฆฌํผ๋กœ ํ๋ธŒ ์กฐ์ž‘
  • 224ร—224 RGB ์ด๋ฏธ์ง€ ๊ด€์ธก
  • 5์ฐจ์› ์•ก์…˜ (xyz ๋ธํƒ€, ๋ฐฉํ–ฅ, ๊ทธ๋ฆฌํผ)
  • ์ œ์–ด ์ฃผํŒŒ์ˆ˜: 10Hz

๋ฒ ์ด์Šค๋ผ์ธ

  1. IDQL: Implicit Q-Learning ๊ธฐ๋ฐ˜ ์˜คํ”„๋ผ์ธ RL
  2. AVD (Action-conditioned Video Diffusion): ํ”ฝ์…€ ๊ธฐ๋ฐ˜ World Model. ๋ฏธ๋ž˜ ํ”„๋ ˆ์ž„์„ ์˜ˆ์ธกํ•œ ํ›„ SWM์œผ๋กœ VQA ์ˆ˜ํ–‰

ํ•ต์‹ฌ ๊ฒฐ๊ณผ 1: ํ”Œ๋ž˜๋‹ ์„ฑ๋Šฅ

์ƒ˜ํ”Œ ๊ธฐ๋ฐ˜ ํ”Œ๋ž˜๋‹ (MPPI):

ํƒœ์Šคํฌ ์„ฑ๊ณต๋ฅ 
LT Reaching 100%
LT Block Separation 100%
OG Reaching 97%

๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ ์ •์ฑ… ๊ฐœ์„ :

ํƒœ์Šคํฌ Base Policy IDQL AVD SWM
Green Cube โ†’ Blue Moon 6% 8% 48% 78%
Red Moon โ†’ Green Star 18% 8% 44% 80%
Red Pentagon โ†’ Blue Moon 14% 12% 38% 80%
Yellow Pentagon โ†’ Red Moon 18% 8% 34% 86%
Yellow Star โ†’ Blue Cube 16% 10% 62% 84%
Blue Cube on Yellow Cube 52% 8% 50% 82%
Blue Cube on Green Cube 44% 16% 46% 84%
Yellow Cube on Red Cube 40% 24% 44% 62%

ํ‰๊ท  ์„ฑ๋Šฅ ํ–ฅ์ƒ:

  • LangTable: 14.4% โ†’ 81.6% (5.7๋ฐฐ ํ–ฅ์ƒ)
  • OGBench: 45.3% โ†’ 76.0% (1.7๋ฐฐ ํ–ฅ์ƒ)

ํ•ต์‹ฌ ๊ฒฐ๊ณผ 2: ๋‹ค๋‹จ๊ณ„ ํƒœ์Šคํฌ

ํƒœ์Šคํฌ Base Policy AVD SWM
MS1: Red pentagon โ†’ Blue moon, Yellow pentagon โ†’ Red moon 6% 8% 50%
MS2: Yellow star โ†’ Blue cube, Yellow pentagon โ†’ Red moon 4% 2% 66%
MS3: Yellow star โ†’ Blue cube, Red pentagon โ†’ Blue moon 4% 2% 54%
MS4: Green cube โ†’ Blue moon, Yellow pentagon โ†’ Red moon 2% 4% 54%

ํ‰๊ท  52%์˜ ์ •์ฑ… ๊ฐœ์„ ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฒฐ๊ณผ 3: ์„œ๋ธŒ์˜ตํ‹ฐ๋ฉ€ ๋ฐ์ดํ„ฐ์˜ ๊ฐ€์น˜

World Model์˜ ํ•ต์‹ฌ ์žฅ์  ์ค‘ ํ•˜๋‚˜๋Š” ์„œ๋ธŒ์˜ตํ‹ฐ๋ฉ€(๋น„์ „๋ฌธ๊ฐ€) ๋ฐ์ดํ„ฐ์—์„œ๋„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

๋ฐ์ดํ„ฐ์…‹ ์œ ํ˜• LangTable (ID) LangTable (OOD) OGBench (ID) OGBench (OOD)
์„œ๋ธŒ์˜ตํ‹ฐ๋ฉ€๋งŒ 85.98% 81.99% 90.83% 85.56%
์ „๋ฌธ๊ฐ€๋งŒ 91.27% 86.49% 96.53% 87.33%
ํ˜ผํ•ฉ 92.92% 88.32% 96.86% 88.16%

ํฅ๋ฏธ๋กœ์šด ์ : ์„œ๋ธŒ์˜ตํ‹ฐ๋ฉ€ ๋ฐ์ดํ„ฐ๋ฅผ ํ˜ผํ•ฉํ•˜๋ฉด ์ „๋ฌธ๊ฐ€ ๋ฐ์ดํ„ฐ๋งŒ ์‚ฌ์šฉํ•  ๋•Œ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค!

์ด๋Š” ์„œ๋ธŒ์˜ตํ‹ฐ๋ฉ€ ๋ฐ์ดํ„ฐ๊ฐ€ ๋‹ค์–‘ํ•œ ์ƒํ™ฉ(์‹คํŒจ ์‚ฌ๋ก€ ํฌํ•จ)์„ ์ œ๊ณตํ•˜์—ฌ ๋ชจ๋ธ์ด ๋” ๊ฐ•๊ฑดํ•œ ์˜ˆ์ธก์„ ํ•™์Šตํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฒฐ๊ณผ 4: ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ

๊ตฌ์„ฑ์  ์ผ๋ฐ˜ํ™” (Compositional Generalization):

  • ํ›ˆ๋ จ ์‹œ ๋ณด์ง€ ๋ชปํ•œ ์ƒ‰์ƒ-ํ˜•ํƒœ ์กฐํ•ฉ (์˜ˆ: ๋ณด๋ผ์ƒ‰ ์˜ค๊ฐํ˜•)
  • ํ‰๊ท  20% ์„ฑ๋Šฅ ํ–ฅ์ƒ (Base Policy ๋Œ€๋น„)

๋ฐฐ๊ฒฝ ๊ฐ•๊ฑด์„ฑ (Background Robustness):

  • OGBench์˜ ๋ฐฐ๊ฒฝ ์ƒ‰์ƒ์„ ์ƒˆ๋กœ์šด ์กฐํ•ฉ์œผ๋กœ ๋ณ€๊ฒฝ
  • ํ‰๊ท  15-20% ์„ฑ๋Šฅ ํ–ฅ์ƒ (Base Policy ๋Œ€๋น„)

์ด๋Š” SWM์ด VLM์˜ ์‚ฌ์ „ํ•™์Šต ์ง€์‹์„ ํšจ๊ณผ์ ์œผ๋กœ ๋ณด์กดํ•˜๊ณ  ํ™œ์šฉํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Attention Map ์‹œ๊ฐํ™”

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ์งˆ๋ฌธ: "Is the red moon touching the blue cube?"           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                            โ”‚
โ”‚  Layer 4, 6: ๋นจ๊ฐ„ ๋‹ฌ๊ณผ ํŒŒ๋ž€ ํ๋ธŒ์— ์ง‘์ค‘                      โ”‚
โ”‚  Layer 8+: ๋กœ๋ด‡ ํŒ”(peg)๋„ ํ•จ๊ป˜ attention                   โ”‚
โ”‚                                                            โ”‚
โ”‚  โ†’ ๋ชจ๋ธ์ด ์งˆ๋ฌธ์˜ ์˜๋ฏธ๋ฅผ ์ดํ•ดํ•˜๊ณ  ๊ด€๋ จ ๊ฐ์ฒด์— attention        โ”‚
โ”‚  โ†’ ํ›ˆ๋ จ ์ค‘ ๋ณธ ์  ์—†๋Š” 3๊ฐœ ๊ฐ์ฒด ์งˆ๋ฌธ์—๋„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ attention    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๋น„ํŒ์  ๊ณ ์ฐฐ: ๊ฐ•์ ๊ณผ ํ•œ๊ณ„

๊ฐ•์ 

1. ๊ฐœ๋…์  ์šฐ์•„ํ•จ

  • ํ”ฝ์…€ ์˜ˆ์ธก์ด๋ผ๋Š” ์–ด๋ ค์šด ๋ฌธ์ œ๋ฅผ ์šฐํšŒ
  • โ€œํ•„์š”ํ•œ ๊ฒƒ๋งŒ ์˜ˆ์ธกํ•˜์žโ€๋Š” ์›์น™์ด ๋ช…ํ™•
  • VLM์˜ ์‚ฌ์ „ํ•™์Šต ์ง€์‹์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํ™œ์šฉ

2. ๊ณ„์‚ฐ ํšจ์œจ์„ฑ

  • ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ ํ”Œ๋ž˜๋‹์ด ๋น„๋””์˜ค ๊ธฐ๋ฐ˜ ๋Œ€๋น„ 430๋ฐฐ ๋น ๋ฆ„
  • ์–ธ์–ด ๊ณต๊ฐ„์—์„œ์˜ ์˜ˆ์ธก์ด ํ”ฝ์…€ ์ƒ์„ฑ๋ณด๋‹ค ํ›จ์”ฌ ๊ฐ€๋ฒผ์›€

3. ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ

  • ์„œ๋ธŒ์˜ตํ‹ฐ๋ฉ€ ๋ฐ์ดํ„ฐ๋„ ํšจ๊ณผ์ ์œผ๋กœ ํ™œ์šฉ ๊ฐ€๋Šฅ
  • ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์˜ ํŠน๊ถŒ ์ •๋ณด๋กœ ๋ฐ์ดํ„ฐ ์ž๋™ ์ƒ์„ฑ

4. ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ

  • ๊ตฌ์„ฑ์  ์ผ๋ฐ˜ํ™” (์ƒˆ๋กœ์šด ์ƒ‰์ƒ-ํ˜•ํƒœ ์กฐํ•ฉ)
  • ๋ฐฐ๊ฒฝ ๋ณ€ํ™”์— ๊ฐ•๊ฑด
  • VLM์˜ ์„ธ๊ณ„ ์ง€์‹ ์ „์ด

5. ์œ ์—ฐํ•œ ํƒœ์Šคํฌ ์ •์˜

  • ์ž์—ฐ์–ด ์งˆ๋ฌธ์œผ๋กœ ํƒœ์Šคํฌ ์ •์˜
  • ๋ณต์žกํ•œ ๋ฆฌ์›Œ๋“œ ์—”์ง€๋‹ˆ์–ด๋ง ๋ถˆํ•„์š”

ํ•œ๊ณ„

1. ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ์˜์กด์„ฑ

  • SAQA ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ์— ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์˜ ํŠน๊ถŒ ์ •๋ณด ํ•„์š”
  • ์‹ค์ œ ๋กœ๋ด‡ ํ™˜๊ฒฝ์—์„œ QA ์Œ ํš๋“์ด ์–ด๋ ค์›€

2. ๊ธฐ๋ณธ ์ •์ฑ… ํ•„์š”

  • ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ ํ”Œ๋ž˜๋‹์€ ๊ธฐ๋ณธ ์ •์ฑ…์ด ํ•„์ˆ˜
  • ๊ธฐ๋ณธ ์ •์ฑ…์˜ ํ’ˆ์งˆ์ด ์ตœ์ ํ™” ์‹œ์ž‘์ ์„ ๊ฒฐ์ •

3. ๋ชจ๋ธ ํฌ๊ธฐ

  • 3B ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์‹ค์‹œ๊ฐ„ ์ œ์–ด ์ฃผํŒŒ์ˆ˜ ๋‹ฌ์„ฑ์ด ์–ด๋ ค์›€
  • ์ƒ˜ํ”Œ ๊ธฐ๋ฐ˜ ํ”Œ๋ž˜๋‹์€ ๋‹จ์ผ GPU์—์„œ ๋น„์‹ค์šฉ์ 

4. Yes/No ์งˆ๋ฌธ์˜ ํ•œ๊ณ„

  • ํ˜„์žฌ๋Š” ์ด์ง„ ์งˆ๋ฌธ๋งŒ ์ง€์›
  • ์—ฐ์†์ ์ธ ๊ฐ’ (๊ฑฐ๋ฆฌ, ๊ฐ๋„ ๋“ฑ) ์˜ˆ์ธก์— ์ œํ•œ

5. Long-horizon ํƒœ์Šคํฌ์˜ ๋ณต์žก์„ฑ

  • ์„œ๋ธŒ๊ณจ ์ˆ˜๋™ ์ •์˜ ํ•„์š”
  • ์ž๋™ ์„œ๋ธŒ๊ณจ ๋ฐœ๊ฒฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๋ถ€์žฌ

๋ฏธํ•ด๊ฒฐ ์งˆ๋ฌธ๋“ค

  1. ์Šค์ผ€์ผ๋ง ๋ฒ•์น™: ๋” ํฐ VLM์ด ๋” ๋‚˜์€ SWM์ด ๋ ๊นŒ?
  2. ์‹ค์„ธ๊ณ„ ์ „์ด: ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ํ•™์Šตํ•œ SWM์ด ์‹ค์ œ ๋กœ๋ด‡์—์„œ ์ž‘๋™ํ• ๊นŒ?
  3. ์—ฐ์† ์ถœ๋ ฅ: Yes/No ๋Œ€์‹  ์—ฐ์†์ ์ธ ๊ฐ’์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์„๊นŒ?
  4. ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž…๋ ฅ: ํž˜/ํ† ํฌ ์„ผ์„œ ๋“ฑ ๋‹ค๋ฅธ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ์„๊นŒ?

๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต

Vision-Language-Action (VLA) ๋ชจ๋ธ๊ณผ์˜ ๋น„๊ต

ํŠน์„ฑ VLA (์˜ˆ: OpenVLA) SWM
์ž…๋ ฅ ์ด๋ฏธ์ง€ + ์–ธ์–ด ์ง€์‹œ ์ด๋ฏธ์ง€ + ์•ก์…˜ + ์งˆ๋ฌธ
์ถœ๋ ฅ ์•ก์…˜ ์–ธ์–ด (Yes/No)
๋ชฉ์  ์ง์ ‘์ ์ธ ํ–‰๋™ ์ƒ์„ฑ ํ–‰๋™ ๊ฒฐ๊ณผ ์˜ˆ์ธก
์‚ฌ์ „ํ•™์Šต ๋ณด์กด ์•ก์…˜ ํ† ํฐ์œผ๋กœ ๋ณ€ํ™˜ ์‹œ ์†์‹ค ๊ฐ€๋Šฅ ์–ธ์–ด ์ถœ๋ ฅ์œผ๋กœ ๋” ์ž˜ ๋ณด์กด

SWM์€ VLA์˜ โ€œ์—ญ์ „๋œโ€ ๋ฒ„์ „์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•ก์…˜์ด ์ถœ๋ ฅ์ด ์•„๋‹Œ ์ž…๋ ฅ์ด ๋˜๊ณ , ์–ธ์–ด๊ฐ€ ์ž…๋ ฅ์ด ์•„๋‹Œ ์ถœ๋ ฅ์ด ๋ฉ๋‹ˆ๋‹ค.

๊ธฐ์กด World Model๊ณผ์˜ ๋น„๊ต

ํŠน์„ฑ DreamerV3 TD-MPC2 UniPi SWM
์˜ˆ์ธก ๋Œ€์ƒ ์ž ์žฌ ์ƒํƒœ ์ž ์žฌ ์ƒํƒœ ๋น„๋””์˜ค ์˜๋ฏธ๋ก ์  ์ •๋ณด
๋ฆฌ์›Œ๋“œ ํ•„์š” Yes Yes No No
VLM ํ™œ์šฉ No No No Yes
์ผ๋ฐ˜ํ™” ์ œํ•œ์  ์ œํ•œ์  ์ œํ•œ์  ๋†’์Œ

UniPi์™€์˜ ๋น„๊ต

UniPi๋Š” ๋น„๋””์˜ค ์˜ˆ์ธก World Model์„ ๊ณ ์ˆ˜์ค€ ํ”Œ๋ž˜๋„ˆ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. SWM๊ณผ์˜ ์ฐจ์ด:

  • UniPi: ํ”ฝ์…€ ๊ณต๊ฐ„์—์„œ ์˜ˆ์ธก โ†’ ์ €์ˆ˜์ค€ ์ •์ฑ… ์กฐ๊ฑดํ™”
  • SWM: ์˜๋ฏธ ๊ณต๊ฐ„์—์„œ ์˜ˆ์ธก โ†’ ์ง์ ‘์ ์ธ ํ”Œ๋ž˜๋‹ ์‹ ํ˜ธ

์‘์šฉ ๊ฐ€๋Šฅ์„ฑ ๋ฐ ํ™•์žฅ ๋ฐฉํ–ฅ

์‹ค์ œ ๋กœ๋ด‡ ์ ์šฉ์„ ์œ„ํ•œ ๋กœ๋“œ๋งต

graph TD
    subgraph Current["ํ˜„์žฌ ์ƒํƒœ"]
        A[์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ๊ฒ€์ฆ๋จ]
    end

    subgraph ShortTerm["๋‹จ๊ธฐ ๊ณผ์ œ"]
        B["๋” ์ž‘์€ VLM ์‚ฌ์šฉ (FastVLM, SmolVLM)"]
        C[์‹ค์‹œ๊ฐ„ ์ œ์–ด ์ฃผํŒŒ์ˆ˜ ๋‹ฌ์„ฑ]
        D[Sim-to-Real ์ „์ด ๊ฒ€์ฆ]
    end

    subgraph MidTerm["์ค‘๊ธฐ ๊ณผ์ œ"]
        E[VLM์œผ๋กœ QA ์Œ ์ž๋™ ์ƒ์„ฑ]
        F[์‹ค์ œ ๋ฐ์ดํ„ฐ ํ†ตํ•ฉ]
        G[๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž…๋ ฅ ํ™•์žฅ]
    end

    subgraph LongTerm["์žฅ๊ธฐ ๋น„์ „"]
        H[๋ฒ”์šฉ ๋กœ๋ด‡ World Model]
        I[์ž๋™ ์„œ๋ธŒ๊ณจ ๋ฐœ๊ฒฌ]
        J[Language-conditioned ์กฐ์ž‘]
    end

    A --> B
    A --> D
    B --> C
    C --> F
    D --> F
    E --> F
    F --> H
    G --> H
    H --> I
    H --> J

Allegro Hand์™€ ๊ฐ™์€ ๋‹ค์ง€ ๋งค๋‹ˆํ“ฐ๋ ˆ์ด์…˜์—์˜ ์ ์šฉ

SWM์˜ ์ ‘๊ทผ๋ฒ•์€ ๋‹ค์ง€ ์†(dexterous hand) ์กฐ์ž‘์— ํŠนํžˆ ์œ ๋งํ•ฉ๋‹ˆ๋‹ค:

  1. ์ ‘์ด‰ ์ƒํƒœ ์˜ˆ์ธก: โ€œ์—„์ง€๊ฐ€ ๋ฌผ์ฒด์— ๋‹ฟ์•˜๋‚˜?โ€, โ€œ๋ฌผ์ฒด๊ฐ€ ์•ˆ์ •์ ์œผ๋กœ ํŒŒ์ง€๋˜์—ˆ๋‚˜?โ€
  2. ํž˜ ๋ถ„ํฌ ์ถ”๋ก : โ€œ์ ์ ˆํ•œ ํŒŒ์ง€๋ ฅ์ด ๊ฐ€ํ•ด์กŒ๋‚˜?โ€
  3. ์กฐ์ž‘ ์ „๋žต ํ‰๊ฐ€: โ€œ์ด ๋™์ž‘์œผ๋กœ ๋ฌผ์ฒด๊ฐ€ ํšŒ์ „ํ• ๊นŒ?โ€

์ ์šฉ ์‹œ ๊ณ ๋ ค์‚ฌํ•ญ:

  • ๊ณ ์ฐจ์› ์•ก์…˜ ๊ณต๊ฐ„ (20+ DoF)์— ๋Œ€ํ•œ ์Šค์ผ€์ผ๋ง
  • ์ด‰๊ฐ ์ •๋ณด์˜ ์–ธ์–ด์  ํ‘œํ˜„
  • ๋น ๋ฅธ ์ œ์–ด ๋ฃจํ”„ ์š”๊ตฌ์‚ฌํ•ญ (>100Hz)

๊ฐ•ํ™”ํ•™์Šต๊ณผ์˜ ํ†ตํ•ฉ

SWM์€ Model-based RL์˜ ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์—ด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  1. Reward Shaping: SWM์˜ ์˜ˆ์ธก์„ ๋ฆฌ์›Œ๋“œ ์‹ ํ˜ธ๋กœ ํ™œ์šฉ
  2. Curiosity-driven Exploration: โ€œ์˜ˆ์ƒ๊ณผ ๋‹ค๋ฅธ ๊ฒฐ๊ณผโ€๋ฅผ ํƒํ—˜ ์‹ ํ˜ธ๋กœ ํ™œ์šฉ
  3. Hindsight Experience Replay: ์‹คํŒจ ๊ฒฝํ—˜์—์„œ โ€œ๋ฌด์—‡์„ ๋‹ฌ์„ฑํ–ˆ๋Š”๊ฐ€?โ€ ์ž๋™ ๋ ˆ์ด๋ธ”๋ง

๊ตฌํ˜„ ์„ธ๋ถ€์‚ฌํ•ญ (์‹ค๋ฌด์ž๋ฅผ ์œ„ํ•œ)

๋ชจ๋ธ ํ›ˆ๋ จ

# ํ•ต์‹ฌ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ
config = {
    "base_model": "PaliGemma-3B",
    "learning_rate": 1e-5,  # ์„ ํ˜• ๊ฐ์‡ 
    "batch_size": 96,  # ํšจ๊ณผ์  ๋ฐฐ์น˜ ํฌ๊ธฐ
    "training_steps": 24000,  # LangTable
    # "training_steps": 64000,  # OGBench
    "action_projection_dim": "act_dim ร— 2048",
    "optimizer": "AdamW",
    "full_weight_finetuning": True,
}

๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์„ฑ

# SAQA ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ ๋กœ์ง
def generate_saqa_dataset(trajectories):
    dataset = []
    for trajectory in trajectories:
        for i, (state, action) in enumerate(trajectory):
            # ์—ฌ๋Ÿฌ horizon ์ƒ˜ํ”Œ๋ง
            for h in sample_horizons(0, 20, num_samples=4):
                future_state = trajectory[i + h].state
                
                # ์งˆ๋ฌธ-๋‹ต๋ณ€ ์Œ ์ƒ์„ฑ (์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ์ •๋ณด ํ™œ์šฉ)
                qa_pairs = generate_qa_pairs(future_state)
                
                for question, answer in qa_pairs:
                    dataset.append({
                        "current_state": state.image,
                        "actions": trajectory[i:i+h].actions,
                        "question": question,
                        "answer": answer,
                    })
    
    # ์งˆ๋ฌธ ์œ ํ˜• ๋ฐ ๋‹ต๋ณ€ ๋ถ„ํฌ ๊ท ํ˜• ๋งž์ถ”๊ธฐ
    return balance_dataset(dataset)

ํ”Œ๋ž˜๋‹ ์„ค์ •

# LangTable ํ”Œ๋ž˜๋‹ ์„ค์ •
langtable_config = {
    "action_chunk_size": 8,
    "gradient_lr": 0.02,
    "planning_iterations": 10,
    "execute_actions": 4,  # 16๊ฐœ ์ค‘ 4๊ฐœ ์‹คํ–‰ ํ›„ ๋ฆฌํ”Œ๋ž˜๋‹
    "gradient_clip": 1.0,
}

# OGBench ํ”Œ๋ž˜๋‹ ์„ค์ •
ogbench_config = {
    "action_chunk_size": 8,
    "gradient_lr": 0.2,
    "planning_iterations": 20,
    "execute_actions": 4,
    "gradient_clip": 10.0,
}

์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

ํ•ต์‹ฌ ํ†ต์ฐฐ ์ •๋ฆฌ

  1. ํŒจ๋Ÿฌ๋‹ค์ž„ ์ „ํ™˜: World Modeling์„ โ€œํ”ฝ์…€ ์˜ˆ์ธกโ€์—์„œ โ€œ์˜๋ฏธ๋ก ์  ์งˆ๋ฌธ ์‘๋‹ตโ€์œผ๋กœ ์žฌ์ •์˜
  2. VLM์˜ ์ƒˆ๋กœ์šด ํ™œ์šฉ: ์‚ฌ์ „ํ•™์Šต๋œ VLM์„ World Model๋กœ ์ ์‘์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•๋ก  ์ œ์‹œ
  3. ํšจ์œจ์  ํ”Œ๋ž˜๋‹: ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”๋กœ ๋น„๋””์˜ค ๊ธฐ๋ฐ˜ ๋Œ€๋น„ 430๋ฐฐ ๋น ๋ฅธ ํ”Œ๋ž˜๋‹
  4. ๊ฐ•๊ฑดํ•œ ์ผ๋ฐ˜ํ™”: VLM์˜ ์„ธ๊ณ„ ์ง€์‹์„ ํ™œ์šฉํ•œ ๊ตฌ์„ฑ์  ์ผ๋ฐ˜ํ™” ๋ฐ ๋ฐฐ๊ฒฝ ๊ฐ•๊ฑด์„ฑ

๋กœ๋ด‡๊ณตํ•™ ์—ฐ๊ตฌ์ž์—๊ฒŒ ์ฃผ๋Š” ์‹œ์‚ฌ์ 

  1. โ€œ๋ฌด์—‡์„ ์˜ˆ์ธกํ•  ๊ฒƒ์ธ๊ฐ€โ€๋ฅผ ๋จผ์ € ๊ณ ๋ฏผํ•˜๋ผ: ๋ชจ๋“  ์ •๋ณด๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š๋‹ค
  2. Foundation Model์„ ์ ๊ทน ํ™œ์šฉํ•˜๋ผ: ์ธํ„ฐ๋„ท ๊ทœ๋ชจ์˜ ์‚ฌ์ „ํ•™์Šต ์ง€์‹์€ ๊ฐ•๋ ฅํ•œ ์ž์‚ฐ
  3. ์–ธ์–ด๋Š” ๊ฐ•๋ ฅํ•œ ์ธํ„ฐํŽ˜์ด์Šค๋‹ค: ํƒœ์Šคํฌ ์ •์˜, ์ƒํƒœ ํ‘œํ˜„, ๋ชฉํ‘œ ์ง€์ •์— ์ž์—ฐ์–ด ํ™œ์šฉ
  4. ์„œ๋ธŒ์˜ตํ‹ฐ๋ฉ€ ๋ฐ์ดํ„ฐ๋„ ๊ฐ€์น˜ ์žˆ๋‹ค: ๋‹ค์–‘ํ•œ ๊ฒฝํ—˜์ด ๊ฐ•๊ฑดํ•œ ๋ชจ๋ธ์„ ๋งŒ๋“ ๋‹ค

SWM์€ ๋กœ๋ด‡ World Model์˜ ์ƒˆ๋กœ์šด ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ํ”ฝ์…€์„ ์žฌ๊ตฌ์„ฑํ•˜๋Š” ๋Œ€์‹  ์˜๋ฏธ๋ฅผ ์ดํ•ดํ•˜๋Š” ๋ชจ๋ธ, ๋น„๋””์˜ค๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋Œ€์‹  ์งˆ๋ฌธ์— ๋‹ตํ•˜๋Š” ๋ชจ๋ธ. ์ด๋Ÿฌํ•œ ํŒจ๋Ÿฌ๋‹ค์ž„์€ ๋” ํšจ์œจ์ ์ด๊ณ , ๋” ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅํ•˜๋ฉฐ, ๋” ํ•ด์„ ๊ฐ€๋Šฅํ•œ ๋กœ๋ด‡ ์ œ์–ด ์‹œ์Šคํ…œ์œผ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํŒŒ์ธ๋งŒ์ด ๋งํ–ˆ๋“ฏ์ด, โ€œ๋ณต์žกํ•œ ๊ฒƒ์„ ๋‹จ์ˆœํ•˜๊ฒŒ ์„ค๋ช…ํ•  ์ˆ˜ ์—†๋‹ค๋ฉด, ์ถฉ๋ถ„ํžˆ ์ดํ•ดํ•˜์ง€ ๋ชปํ•œ ๊ฒƒ์ด๋‹ค.โ€ SWM์€ World Modeling์˜ ๋ณธ์งˆ์„ ๋‹จ์ˆœํ™”ํ•จ์œผ๋กœ์จ, ์šฐ๋ฆฌ๊ฐ€ ์ •๋ง ํ•„์š”ํ•œ ๊ฒƒ์ด ๋ฌด์—‡์ธ์ง€ ๋‹ค์‹œ ์ƒ๊ฐํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

์ฐธ๊ณ  ๋ฌธํ—Œ

  • Berg, J., Zhu, C., Bao, Y., Durugkar, I., & Gupta, A. (2025). Semantic World Models. arXiv:2510.19818.
  • Beyer, L., et al. (2024). PaliGemma: A versatile 3B VLM for transfer. arXiv:2407.07726.
  • Hafner, D., et al. (2019). Learning Latent Dynamics for Planning from Pixels. ICML.
  • Williams, G., et al. (2016). Aggressive Driving with Model Predictive Path Integral Control. ICRA.
  • Chi, C., et al. (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. RSS.
  • Zhu, C., et al. (2025). Unified World Models: Coupling Video and Action Diffusion. RSS.

โ›๏ธ Dig Review

โ›๏ธ Dig โ€” Go deep, uncover the layers. Dive into technical detail.

๋กœ๋ด‡ ์ œ์–ด์—์„œ ์›”๋“œ ๋ชจ๋ธ์€ ๋ฏธ๋ž˜๋ฅผ ์˜ˆ์ธกํ•˜์—ฌ ๊ณ„ํš์— ํ™œ์šฉํ•˜๋Š” ๊ฐ•๋ ฅํ•œ ๋„๊ตฌ์ž…๋‹ˆ๋‹ค. ์ „ํ†ต์ ์œผ๋กœ ์›”๋“œ ๋ชจ๋ธ์€ ํ”ฝ์…€ ๋‹จ์œ„์˜ ์˜์ƒ ์˜ˆ์ธก์„ ๋ชฉํ‘œ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ”ฝ์…€ ์žฌ๊ตฌ์„ฑ ๋Šฅ๋ ฅ์ด ๊ณ„ํš์˜ ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•˜์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋กœ๋ด‡์ด ๋ฌผ์ฒด๋ฅผ ์ง‘์–ด ์˜ฌ๋ฆฌ๋Š” ์ž‘์—…์—์„œ๋Š” ๋ฌผ์ฒด์˜ ์ •ํ™•ํ•œ ๋ชจ์–‘ ๋ณ€ํ™”๋ณด๋‹ค โ€œ๋ฌผ์ฒด๋ฅผ ์ง‘์—ˆ๋Š”์ง€(yes/no)โ€์™€ ๊ฐ™์€ ์˜๋ฏธ์  ์ •๋ณด๊ฐ€ ๋” ์ค‘์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๋ฌธ์ œ ์˜์‹์—์„œ ์ €์ž๋“ค์€ ๋ฏธ๋ž˜์˜ ํ”ฝ์…€์„ ์˜ˆ์ธกํ•˜๋Š” ๋Œ€์‹ , ์ž‘์—…์— ํ•„์š”ํ•œ ์˜๋ฏธ์  ์ •๋ณด๋งŒ ์˜ˆ์ธกํ•˜๋ฉด ์ถฉ๋ถ„ํ•˜๋‹ค๊ณ  ์ฃผ์žฅํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์›”๋“œ ๋ชจ๋ธ์ด ํ•ด์•ผ ํ•  ์ผ์€ โ€œํŒ”์ด ๋ฌผ์ฒด์— ๊ฐ€๊นŒ์›Œ์กŒ๋Š”๊ฐ€?โ€, โ€œ๋นจ๊ฐ„ ๋ธ”๋ก์ด ๋„˜์–ด์กŒ๋Š”๊ฐ€?โ€, โ€œํŒŒ๋ž€ ํ๋ธŒ๊ฐ€ ์ง‘ํ˜”๋Š”๊ฐ€?โ€ ๋“ฑ์˜ ๋ฏธ๋ž˜ ๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ์งˆ๋ฌธ(Q&A)์„ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ทธ๋ฆผ 1: ์ „ํ†ต์  ์˜์ƒ ๋ชจ๋ธ vs. VLM vs. ์˜๋ฏธ ๊ธฐ๋ฐ˜ ์›”๋“œ ๋ชจ๋ธ์˜ ๊ฐœ๋… ๋น„๊ต. ์ „ํ†ต์  VLM์€ ์ •์  ๊ด€์ฐฐ์— ๋Œ€ํ•œ ์งˆ๋ฌธ์— ๋‹ตํ•˜๊ณ , ๋น„๋””์˜ค ์›”๋“œ ๋ชจ๋ธ์€ ๋ฏธ๋ž˜ ํ”„๋ ˆ์ž„์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ˜๋ฉด, ์˜๋ฏธ ๊ธฐ๋ฐ˜ ์›”๋“œ ๋ชจ๋ธ(SWM)์€ ํ˜„์žฌ ๊ด€์ฐฐ๊ณผ ํ–‰๋™ ์‹œํ€€์Šค๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฏธ๋ž˜ ๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ์งˆ๋ฌธ์˜ ๋‹ต์„ ๋ฐ”๋กœ ์˜ˆ์ธกํ•œ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์€ ์ด๋Ÿฌํ•œ ์˜๋ฏธ ๊ธฐ๋ฐ˜ ์›”๋“œ ๋ชจ๋ธ(SWM) ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. SWM์€ ๋กœ๋ด‡์˜ ํ˜„์žฌ ์˜์ƒ ๊ด€์ฐฐ(image)๊ณผ ํ–‰๋™ ์‹œํ€€์Šค(action sequence), ๊ทธ๋ฆฌ๊ณ  ๋ฏธ๋ž˜ ๊ฒฐ๊ณผ์— ๊ด€ํ•œ ์ž์—ฐ์–ด ์งˆ๋ฌธ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„, ๊ทธ ๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ๋‹ต๋ณ€(์˜ˆ: yes/no ํ™•๋ฅ  ๋ถ„ํฌ)์„ ์ถœ๋ ฅํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ๋‹ค์‹œ ๋งํ•ด, SWM์€ ํ–‰๋™์— ์˜ํ•ด ์œ ๋ฐœ๋  ๋ฏธ๋ž˜ ์ƒํƒœ๋ฅผ ์–ธ์–ด์  ์งˆ๋ฌธ-์‘๋‹ต ํ˜•ํƒœ๋กœ ๋ชจ๋ธ๋งํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์„ธ๊ณ„ ๋ชจ๋ธ์˜ ํ•™์Šต ๋ชฉํ‘œ(์งˆ๋ฌธ์— ๋Œ€ํ•œ ์ •๋‹ต ์˜ˆ์ธก)์™€ ์‹ค์ œ ๊ณ„ํš ๋ชฉํ‘œ(์ž‘์—… ์„ฑ๊ณต ์—ฌ๋ถ€)๊ฐ€ ์ผ์น˜ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ SWM์„ ํ›ˆ๋ จ๋œ ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ(VLM)์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฌํ˜„ํ•จ์œผ๋กœ์จ, VLM์ด ๊ฐ€์ง„ ๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ํ•™์Šต ์ง€์‹๊ณผ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๋กœ๋ด‡ ์ œ์–ด์— ์ ๊ทน ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋ฐฉ๋ฒ•

SWM์˜ ํ•ต์‹ฌ์€ ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ์— ํ–‰๋™ ์ •๋ณด์™€ ์งˆ๋ฌธ์„ ๊ฒฐํ•ฉํ•˜๋Š” ์•„ํ‚คํ…์ฒ˜์ž…๋‹ˆ๋‹ค.

๊ตฌ์ฒด์ ์œผ๋กœ, ์ €์ž๋“ค์€ Google์˜ ์˜คํ”ˆ์†Œ์Šค VLM์ธ PaliGemma(3B) ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. PaliGemma๋Š” SigLIP ๋น„์ „ ์ธ์ฝ”๋”(์‹œ๊ฐ)์™€ Gemma ์–ธ์–ด ๋ชจ๋ธ(์ž์—ฐ์–ด)๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์— ํ–‰๋™(action) ์ž„๋ฒ ๋”ฉ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๋ชจ๋ธ์„ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์ด๋ฏธ์ง€ ๊ด€์ธก์€ SigLIP ์ธ์ฝ”๋”๋กœ ์ฒ˜๋ฆฌํ•˜๊ณ , ํ–‰๋™ ์‹œํ€€์Šค๋Š” ์ƒˆ๋กœ์šด ํˆฌ์˜ ํ–‰๋ ฌ(projection matrix)์„ ํ†ตํ•ด ์–ธ์–ด ๋ชจ๋ธ์˜ ํ† ํฐ ๊ณต๊ฐ„์œผ๋กœ ๋งตํ•‘ํ•ฉ๋‹ˆ๋‹ค. ์งˆ๋ฌธ(question)์€ ๊ธฐ์กด VLM์ฒ˜๋Ÿผ ํ† ํฌ๋‚˜์ด์ง•(tokenizing)ํ•˜์—ฌ ์–ธ์–ด ๋ชจ๋ธ๋กœ ์ž…๋ ฅ๋ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋ชจ๋ธ์€ (์ด๋ฏธ์ง€, ํ–‰๋™, ์งˆ๋ฌธ)์„ ํ•จ๊ป˜ ์ธํ’‹์œผ๋กœ ๋ฐ›์•„, ๋ฏธ๋ž˜์˜ ์˜๋ฏธ์  ์†์„ฑ์— ๋Œ€ํ•œ ๋‹ต๋ณ€์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

graph LR
    S(["ํ˜„์žฌ ์ด๋ฏธ์ง€ (State)"]) --> E[๋น„์ „ ์ธ์ฝ”๋”]
    A(["ํ–‰๋™ ์‹œํ€€์Šค"]) --> P[ํ–‰๋™ ์ž„๋ฒ ๋”ฉ]
    E --> LM["์–ธ์–ด ๋ชจ๋ธ (Gemma)"]
    P --> LM
    Q(["์งˆ๋ฌธ"]) --> LM
    LM --> Ans(["๋‹ต๋ณ€ (Yes/No ํ™•๋ฅ )"])

์œ„ ๋‹ค์ด์–ด๊ทธ๋žจ์€ SWM์˜ ์ž…๋ ฅ-์ถœ๋ ฅ ํ๋ฆ„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ํ˜„์žฌ ์ƒํƒœ์˜ ์ด๋ฏธ์ง€(S)์™€ ํ–‰๋™ ์‹œํ€€์Šค(A), ๊ทธ๋ฆฌ๊ณ  ์งˆ๋ฌธ(Q)์„ ๋ชจ๋ธ์— ๋„ฃ์œผ๋ฉด, ์–ธ์–ด ๋ชจ๋ธ์ด ๋ฏธ๋ž˜ ์ƒํƒœ์— ๋Œ€ํ•œ ๋‹ต๋ณ€(Ans)์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ–‰๋™๊ณผ ๊ด€์ฐฐ์„ ์–ธ์–ด ๋ชจ๋ธ์— ๊ฒฐํ•ฉํ•จ์œผ๋กœ์จ, SWM์€ ์–ธ์–ด ๊ณต๊ฐ„(language space)์—์„œ ํ™˜๊ฒฝ ๋™์—ญํ•™์„ ์ดํ•ดํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ๋ฐ์ดํ„ฐ (SAQA ๋ฐ์ดํ„ฐ์…‹)

SWM์€ ๊ด€์ฐฐ-ํ–‰๋™-์งˆ๋ฌธ-์‘๋‹ต(state-action-question-answer, SAQA) ํŠœํ”Œ๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฐ์ดํ„ฐ๋กœ ์ง€๋„ํ•™์Šต๋ฉ๋‹ˆ๋‹ค. ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„œ ์ˆ˜์ง‘ํ•œ ๊ถค์ (trajectory) ๋ฐ์ดํ„ฐ์—์„œ ์ž„์˜์˜ ์‹œ๊ฐ„ ๊ฐ„๊ฒฉ(horizon)์„ ์ •ํ•˜์—ฌ ํ–‰๋™์„ ์ƒ˜ํ”Œ๋งํ•˜๊ณ , ๊ทธ ๊ฒฐ๊ณผ๋กœ ๋„๋‹ฌํ•œ ๋ฏธ๋ž˜ ์ƒํƒœ๋กœ๋ถ€ํ„ฐ ์งˆ๋ฌธ๊ณผ ์ •๋‹ต์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ผ์ • ์‹œ๊ฐ„ ํ›„ ๋ฌผ์ฒด๊ฐ€ ์ง‘ํ˜”๋Š”์ง€ ๋ฌป๋Š” ์งˆ๋ฌธ์— ๋Œ€ํ•ด ์˜ค๋ผํด ์ •๋ณด๋กœ๋ถ€ํ„ฐ โ€œyesโ€ ๋˜๋Š” โ€œnoโ€ ์‘๋‹ต์„ ์–ป์–ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋กœ ์‚ผ์Šต๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ํ˜•ํƒœ์˜ ์งˆ๋ฌธ์œผ๋กœ ๋ฐœํ™”๋ฒ•(paraphrasing)์„ ๋‹ค์–‘ํ™”ํ•˜์—ฌ, ์งˆ๋ฌธ-๋‹ต๋ณ€ ์Œ์„ ํ’๋ถ€ํ•˜๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ๋กœ (์ด๋ฏธ์ง€, ํ–‰๋™, ์งˆ๋ฌธ, ์ •๋‹ต) ์Œ์„ ํ•™์Šต์‹œ์ผœ SWM์ด ๋ฏธ๋ž˜ ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

๊ณ„ํš(Planning)

ํ•™์Šต๋œ SWM์„ ์‚ฌ์šฉํ•˜์—ฌ ๋กœ๋ด‡์˜ ๋™์ž‘ ๊ณ„ํš์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์šฐ์„  ์ƒ˜ํ”Œ๋ง ๊ธฐ๋ฐ˜ ๊ณ„ํš๋ถ€ํ„ฐ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ ์ƒํƒœ S์—์„œ N๊ฐœ์˜ ํ–‰๋™ ์‹œํ€€์Šค a๋ฅผ ๋ฌด์ž‘์œ„ ๋˜๋Š” ๋ฒ ์ด์Šค ์ •์ฑ…์œผ๋กœ ์ƒ˜ํ”Œ๋งํ•ฉ๋‹ˆ๋‹ค. ๊ฐ a์— ๋Œ€ํ•ด SWM์— ์›ํ•˜๋Š” ์งˆ๋ฌธ์„ ์ž…๋ ฅํ•˜์—ฌ ๋‹ต๋ณ€ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๋ชฉํ‘œ ๋‹ต๋ณ€(์˜ˆ: โ€œyesโ€)์˜ ํ™•๋ฅ ์ด ๋†’์€ ํ–‰๋™ ์‹œํ€€์Šค๋ฅผ ์„ ํƒํ•จ์œผ๋กœ์จ, ๋ชฉํ‘œ ๋‹ฌ์„ฑ ํ™•๋ฅ ์ด ์ตœ๋Œ€ํ™”๋˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ–‰๋™์„ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด Model Predictive Path Integral(MPPI)์™€ ๊ฐ™์€ ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ด์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, SWM์— MPPI๋ฅผ ์ ์šฉํ•˜๋ฉด LangTable์˜ โ€œ๋„๋‹ฌโ€๊ณผ โ€œ๋ธ”๋ก ๋ถ„๋ฆฌโ€ ๊ณผ์ œ์—์„œ ๊ฑฐ์˜ 100% ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๋ณต์žกํ•œ ๊ณผ์ œ์—์„œ๋Š” ์ƒ˜ํ”Œ๋ง๋งŒ์œผ๋กœ๋Š” ํšจ์œจ์ด ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด์— ์ €์ž๋“ค์€ ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์šฐ์„  ๋ฒ ์ด์Šค ์ •์ฑ… ฯ€_b๋กœ ์ดˆ๊ธฐ ํ–‰๋™ ์‹œํ€€์Šค๋ฅผ ๋ฝ‘๊ณ , ์ด๋ฅผ SWM์— ์ž…๋ ฅํ•ด ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ๋ชฉํ‘œ ์งˆ๋ฌธ์˜ โ€œyesโ€ ํ™•๋ฅ ์„ ๋†’์ด๋„๋ก ํ–‰๋™์„ ์ง์ ‘ ๋ฏธ๋ถ„ํ•˜์—ฌ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค. ์ˆ˜์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

J_{T}(a)\mspace{6mu} = \mspace{6mu} V_{T,c}(S,a)

์—ฌ๊ธฐ์„œ S๋Š” ํ˜„์žฌ ์ƒํƒœ, a๋Š” ์ตœ์ ํ™”ํ•  ํ–‰๋™ ์‹œํ€€์Šค, T = \{\left( Q_{i},A_{i}^{*},W_{i} \right)\}๋Š” ์งˆ๋ฌธ Q_{i}, ์›ํ•˜๋Š” ๋‹ต๋ณ€ A_{i}^{*}, ๊ฐ€์ค‘์น˜ W_{i}์˜ ์ง‘ํ•ฉ์ž…๋‹ˆ๋‹ค. V_{T,c}(S,a)๋Š” SWM์ด ์˜ˆ์ธกํ•œ ๋‹ต๋ณ€ ํ™•๋ฅ ์— ๊ธฐ๋ฐ˜ํ•œ ๋ชฉํ‘œ ํ•จ์ˆ˜๋กœ, ์ด ๊ฐ’์„ ๊ทธ๋ž˜๋””์–ธํŠธ ์ƒํ–ฅ(gradient ascent) ๋ฐฉ์‹์œผ๋กœ ์ตœ๋Œ€ํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ ํ–‰๋™์— ๋Œ€ํ•œ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด SWM ๋‚ด๋ถ€๋ฅผ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ์‚ฌ์šฉํ•˜๋ฉฐ, ์•ˆ์ •์  ํ•™์Šต์„ ์œ„ํ•ด ๊ทธ๋ž˜๋””์–ธํŠธ ๋…ธ๋ฆ„ ํด๋ฆฌํ•‘ ๋“ฑ์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋‹ค๋‹จ๊ณ„ ๊ณ„ํš (Multi-Step Tasks)

์žฅ๊ธฐ ์ž‘์—…์„ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด, SWM ๊ธฐ๋ฐ˜ ๊ณ„ํš์„ ์—ฐ์†์ ์ธ ์„œ๋ธŒ๊ณจ(subgoal) ๋ฐฉ์‹์œผ๋กœ ํ™•์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ๋ธ”๋ก ์Œ“๊ธฐ ๊ณผ์ œ์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‹จ๊ณ„์  ์„œ๋ธŒ๊ณจ์„ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค:

1๋‹จ๊ณ„: โ€œ๋ธ”๋ก์„ ๋กœ๋ด‡์ด ์ง‘์—ˆ๋Š”๊ฐ€?โ€ (๋‹ต๋ณ€: โ€œyesโ€) 2๋‹จ๊ณ„: โ€œ๋ธ”๋ก์ด ๋‹ค๋ฅธ ๋ธ”๋ก ์œ„์— ์Œ“์˜€๋Š”๊ฐ€?โ€ (๋‹ต๋ณ€: โ€œyesโ€)

๊ฐ ๋‹จ๊ณ„๊ฐ€ ์™„๋ฃŒ๋˜์—ˆ๋Š”์ง€๋Š” SWM์— โ€œ๋™์ผํ•œโ€ ์งˆ๋ฌธ์„ ๋ฌป๋Š” ๊ฒƒ์œผ๋กœ ๊ฒ€์ฆํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์ฒซ ๋‹จ๊ณ„์˜ ์งˆ๋ฌธ์ด โ€œyesโ€๋กœ ํ™•์ธ๋˜๋ฉด ๋‹ค์Œ ๋‹จ๊ณ„ ์งˆ๋ฌธ์œผ๋กœ ๋„˜์–ด๊ฐ‘๋‹ˆ๋‹ค. ์ด์ฒ˜๋Ÿผ ๊ฐ ์„œ๋ธŒ๊ณจ ์™„๋ฃŒ ์—ฌ๋ถ€๋ฅผ SWM ์ž์ฒด๊ฐ€ ํŒ๋‹จํ•˜๋ฏ€๋กœ, ๋ณ„๋„์˜ ์ข…๋ฃŒ ํŒ์ •๊ธฐ ์—†์ด ์ž๋™์œผ๋กœ ๋‹ค๋‹จ๊ณ„ ๊ณ„ํš์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋ฆผ 2: ์˜ˆ์‹œ ๋‹ค๋‹จ๊ณ„ ๊ณ„ํš. (์ขŒ) ์ดˆ๊ธฐ ์ƒํƒœ, (์ค‘) ์ค‘๊ฐ„ ์ƒํƒœ, (์šฐ) ์™„๋ฃŒ ์ƒํƒœ. ๊ฐ ๋‹จ๊ณ„๋งˆ๋‹ค SWM์— ์งˆ๋ฌธ์„ ๋˜์ ธ ์กฐ๊ฑด์„ ํ™•์ธํ•œ๋‹ค. ์˜ˆ: ๋นจ๊ฐ„ ๋‹ฌ๊ณผ ๋…ธ๋ž€ ์˜ค๊ฐํ˜•์„ ์˜ฎ๊ธฐ๋Š” ์ž‘์—…์—์„œ๋Š” โ€œ๋นจ๊ฐ„ ๋‹ฌ์ด ๋…ธ๋ž€ ์˜ค๊ฐํ˜•์— ๋‹ฟ์•˜๋Š”๊ฐ€?โ€๋ฅผ ๋ฌป๊ณ , โ€œyesโ€์ผ ๋•Œ ๋‹ค์Œ ๋‹จ๊ณ„๋กœ ์ง„ํ–‰ํ•œ๋‹ค.

์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ

SWM์˜ ํšจ๊ณผ๋ฅผ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด ๋‘ ๊ฐ€์ง€ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ํ™˜๊ฒฝ์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. LangTable (Lynch et al., 2022)์—์„œ๋Š” ๋‹ค์–‘ํ•œ ์ƒ‰๊ณผ ๋ชจ์–‘์˜ ๋ธ”๋ก๋“ค์ด ๋†“์ธ ํƒ์ž ์œ„์—์„œ ๋ชฉํ‘œ ๋ธ”๋ก์œผ๋กœ ์ด๋™, ๋ถ„๋ฆฌ, ๋ธ”๋ก ๋ฐ€๊ธฐ ๋“ฑ์˜ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. OGBench (Park et al., 2025)๋Š” ์˜คํ”„๋ผ์ธ ๋ชฉํ‘œ์กฐ๊ฑด ๊ฐ•ํ™”ํ•™์Šต ๋ฒค์น˜๋งˆํฌ๋กœ, ํ๋ธŒ ์ง‘๊ธฐ์™€ ์Œ“๊ธฐ ๋“ฑ์˜ ๋ณต์žกํ•œ ์กฐ์ž‘ ์ž‘์—…์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ํ™˜๊ฒฝ์—์„œ๋Š” ์ „๋ฌธ๊ฐ€(์Šคํฌ๋ฆฝํŠธ) ๋ฐ๋ชจ์™€ ๋ฌด์ž‘์œ„ ํ”Œ๋ ˆ์ด ๋ฐ๋ชจ๋ฅผ ํ˜ผํ•ฉํ•˜์—ฌ SWM์„ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค. ํ‰๊ฐ€ ์‹œ์—๋Š” ์ƒˆ๋กœ์šด ๋ธ”๋ก ์ƒ‰์ƒ ์กฐํ•ฉ์ด๋‚˜ ๋ฐฐ๊ฒฝ ์กฐ๊ฑด์—์„œ์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ๋„ ์ธก์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

SWM์˜ ์„ฑ๋Šฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ด€์ ์—์„œ ํ‰๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค:

  • QA ์˜ˆ์ธก ์„ฑ๋Šฅ: SWM์ด ๋ฏธ๋ž˜ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ์ •๋‹ต์„ ์–ผ๋งˆ๋‚˜ ์ž˜ ์˜ˆ์ธกํ•˜๋Š”์ง€ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ „๋ฌธ๊ฐ€ ๋ฐ์ดํ„ฐ๋งŒ ํ•™์Šตํ•  ๋•Œ์™€ ์„œ๋ธŒ์˜ตํ‹ฐ๋ฉ€ ๋ฐ์ดํ„ฐ๋ฅผ ํ˜ผํ•ฉํ•œ ๊ฒฝ์šฐ๋ฅผ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ, ํ˜ผํ•ฉ ํ•™์Šต์ด ๊ฐ€์žฅ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ LangTable์˜ OOD ์„ค์ •์—์„œ, ์ „๋ฌธ๊ฐ€๋งŒ ํ•™์Šตํ•œ ๋ชจ๋ธ๋ณด๋‹ค ๋ฌด์ž‘์œ„ ๋ฐ์ดํ„ฐ๊ฐ€ ์„ž์ธ ๋ชจ๋ธ์ด ๋” ์ข‹์€ QA ์„ฑ๋Šฅ์„ ๋ƒˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์„œ๋ธŒ์˜ตํ‹ฐ๋ฉ€ ๋ฐ์ดํ„ฐ๊ฐ€ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๋†’์ด๋Š” ๋ฐ ๊ธฐ์—ฌํ•จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • ์ƒ˜ํ”Œ๋ง ๊ณ„ํš ๊ฒฐ๊ณผ (MPPI): SWM ๋ชจ๋ธ์— MPPI๋ฅผ ์ ์šฉํ•˜์—ฌ LangTable๊ณผ OGBench ์ž‘์—…์„ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค. ํ‘œ [2]์—์„œ ๋ณด๋Š” ๋ฐ”์™€ ๊ฐ™์ด, LangTable์˜ โ€œ๋„๋‹ฌ(Reaching)โ€ ๋ฐ โ€œ๋ธ”๋ก ๋ถ„๋ฆฌ(Separate Blocks)โ€ ๊ณผ์ œ์—์„œ SWM์€ 100% ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ–ˆ๊ณ , OGBench์˜ โ€œํ๋ธŒ ์ง‘๊ธฐ(Reach Cube)โ€ ๊ณผ์ œ์—์„œ๋„ 97% ์„ฑ๊ณต๋ฅ ์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์˜๋ฏธ ๊ณต๊ฐ„ ์ƒ์—์„œ ์ง์ ‘ ๊ณ„ํšํ•˜๋Š” ๊ฒƒ์ด ๋ณต์žกํ•œ ํ”ฝ์…€ ์˜ˆ์ธก ์—†์ด๋„ ์ž‘์—…์„ ํšจ๊ณผ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  • ๋ฒ ์ด์Šค ์ •์ฑ… ํ–ฅ์ƒ (Policy Improvement): ๋ณด๋‹ค ๊นŒ๋‹ค๋กœ์šด ๊ณผ์ œ์—์„œ๋Š”, ๋จผ์ € ๋ฒ ์ด์Šค ์ •์ฑ…(์˜ˆ: ํ™•์‚ฐ ์ •์ฑ…)์„ ์‚ฌ์šฉํ•ด ํ–‰๋™ ๊ถค์ ์„ ์ƒ์„ฑํ•œ ๋‹ค์Œ, SWM๊ณผ ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”๋กœ ์ด๋ฅผ ๊ฐœ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ๋ฅผ ๊ทธ๋ฆผ [15]์— ์š”์•ฝํ–ˆ์Šต๋‹ˆ๋‹ค. SWM์œผ๋กœ ์ •์ œํ•œ ๊ถค์ ์€ ๊ธฐ๋ณธ ์ •์ฑ… ๋Œ€๋น„ ํ˜„์ €ํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์ฃผ์—ˆ๊ณ , ๋‘ ๊ฐ€์ง€ ๋น„๊ต ๋Œ€์ƒ์ธ IDQL(์˜คํ”„๋ผ์ธ RL)๊ณผ AVD(์•ก์…˜ ์กฐ๊ฑด ์˜์ƒ ๋””ํ“จ์ „) ๋ชจ๋‘๋ฅผ ๋Šฅ๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด LangTable๊ณผ OGBench์˜ ํ‰๊ท  ์„ฑ๊ณต๋ฅ ์€ ๊ฐ๊ฐ ํฌ๊ฒŒ ์ƒ์Šนํ–ˆ์œผ๋ฉฐ, ํŠนํžˆ AVD์™€ IDQL์€ ๋ชจ๋“  ๊ณผ์ œ์—์„œ SWM์— ๋น„ํ•ด ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

๊ทธ๋ฆผ 3: LangTable๊ณผ OGBench์˜ ๊ณผ์ œ์—์„œ ๋ฒ ์ด์Šค ์ •์ฑ… ๋Œ€๋น„ ์„ฑ๋Šฅ ํ–ฅ์ƒ. ํŒŒ๋ž€์ƒ‰์€ ๋ฒ ์ด์Šค ์ •์ฑ…, ์ฃผํ™ฉ์ƒ‰์€ SWM(Graident) ๊ธฐ๋ฐ˜ ๊ฒฐ๊ณผ๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. SWM ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”๊ฐ€ ํฐ ํ–ฅ์ƒ์„ ๋ณด์ด๋ฉฐ, ๊ธฐ์กด AVD/IDQL ๋Œ€๋น„ ์šฐ์ˆ˜ํ•จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

  • ๋‹ค๋‹จ๊ณ„ ๊ณผ์ œ ์„ฑ๋Šฅ: ๋‹ค์ค‘ ์„œ๋ธŒ๊ณจ ์ž‘์—…์—์„œ๋„ SWM์˜ ์žฅ์ ์ด ํ™•์ธ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด โ€œ๋นจ๊ฐ„ ์˜ค๊ฐํ˜•โ†’ํŒŒ๋ž€ ๋‹ฌ, ๋…ธ๋ž€ ์˜ค๊ฐํ˜•โ†’ํŒŒ๋ž€ ํ๋ธŒโ€ ๋“ฑ 2-3๋‹จ๊ณ„ ๋ณตํ•ฉ ๊ณผ์ œ์—์„œ, SWM์€ ์„ฑ๊ณต๋ฅ  50โ€“66%๋ฅผ ๊ธฐ๋กํ•˜์—ฌ ๋ฒ ์ด์Šค ์ •์ฑ…(2โ€“4%)์ด๋‚˜ AVD(3โ€“8%)์— ๋น„ํ•ด ์›”๋“ฑํžˆ ๋†’์•˜์Šต๋‹ˆ๋‹ค. ์ด์ฒ˜๋Ÿผ SWM์€ ๊ฐ ๋‹จ๊ณ„๋งˆ๋‹ค ์ ์ ˆํ•œ ์งˆ๋ฌธ์„ ๋˜์ ธ ๊ณผ์ œ๋ฅผ ๋‹จ๊ณ„๋ณ„๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
  • ์ผ๋ฐ˜ํ™” (Out-of-Distribution): SWM์€ ํ›ˆ๋ จ์— ์—†๋Š” ์ƒˆ๋กœ์šด ์ƒ‰์ƒ ์กฐํ•ฉ์ด๋‚˜ ๋ฐฐ๊ฒฝ์—์„œ๋„ ์„ฑ๋Šฅ์„ ๋†’์˜€์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด LangTable์—์„œ ํ›ˆ๋ จ์— ์—†๋˜ ๋ณด๋ผ์ƒ‰ ์˜ค๊ฐํ˜•์„ ๋„์ž…ํ•œ ์กฐํ•ฉ ์ผ๋ฐ˜ํ™” ์‹คํ—˜์—์„œ, SWM์€ ๋ฒ ์ด์Šค ์ •์ฑ… ๋Œ€๋น„ ์„ฑ๊ณต๋ฅ ์ด ์•ฝ +28%ํฌ์ธํŠธ ์ฆ๊ฐ€ํ–ˆ๊ณ , OGBench์˜ ์ƒˆ๋กœ์šด ๋ฐฐ๊ฒฝ ์ƒ‰์ƒ์—์„œ๋„ +15%ํฌ์ธํŠธ ๊ฐœ์„ ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ด๋Š” SWM์ด ์‚ฌ์ „ํ•™์Šต๋œ VLM์˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๊ณ„์Šนํ•จ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
  • ๋น„๊ต ๋Œ€์ƒ ๋Œ€๋น„: SWM์€ ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ๋ฒ ์ด์Šค๋ผ์ธ์„ ์•ž์„ฐ์Šต๋‹ˆ๋‹ค. ํ•˜๋‚˜๋Š” IDQL(Implicit Q-Learning ๊ธฐ๋ฐ˜ ์˜คํ”„๋ผ์ธ RL)์ด๋ฉฐ, ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” AVD(์•ก์…˜-์กฐ๊ฑด ์˜์ƒ ๋””ํ“จ์ „)์ž…๋‹ˆ๋‹ค. AVD๋Š” ๋จผ์ € ํ–‰๋™์œผ๋กœ๋ถ€ํ„ฐ ๋ฏธ๋ž˜ ์˜์ƒ์„ ์ƒ์„ฑํ•œ ๋’ค, SWM์œผ๋กœ ์งˆ๋ฌธ์„ ๋˜์ ธ ๋ณด์ƒ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ SWM์€ ๋ชจ๋“  ๊ณผ์ œ์—์„œ IDQL๊ณผ AVD๋ณด๋‹ค ๋†’์€ ์„ฑ๊ณต๋ฅ ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

๋น„ํŒ์  ๊ณ ์ฐฐ

์˜๋ฏธ ๊ธฐ๋ฐ˜ ์›”๋“œ ๋ชจ๋ธ SWM์˜ ๊ฐ•์ ์€ ๋ชฉํ‘œ์™€ ์ผ์น˜ํ•˜๋Š” ์ •๋ณด๋ฅผ ์ง์ ‘ ์˜ˆ์ธกํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ํ”ฝ์…€ ์ „์ฒด๋ฅผ ๋งž์ถ”๋Š” ๋Œ€์‹ , ์‹ค์ œ ์ž‘์—…์— ํ•„์š”ํ•œ ์˜๋ฏธ์ ์ธ ์†์„ฑ(์˜ˆ: ๋ฌผ์ฒด ๊ฐ„ ์ ‘์ด‰ ์—ฌ๋ถ€)์„ ์˜ˆ์ธกํ•˜๋ฏ€๋กœ, ๊ณ„ํš ์„ฑ๋Šฅ์ด ์ค‘์š”ํ•œ ์ •๋ณด์— ์ง‘์ค‘๋ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, VLM์˜ ๋Œ€๊ทœ๋ชจ ์ธํ„ฐ๋„ท ํ•™์Šต ๋•๋ถ„์— SWM์€ ์ œํ•œ๋œ ๋ฐ์ดํ„ฐ๋กœ๋„ ๋ณต์žกํ•œ ์žฅ๋ฉด๊ณผ ์ƒˆ๋กœ์šด ์กฐํ•ฉ์„ ์ž˜ ์ผ๋ฐ˜ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ, SWM์€ ์ข…์ „์˜ ์›”๋“œ ๋ชจ๋ธ๋“ค๋ณด๋‹ค ์ ์€ ์ „์ œ ์กฐ๊ฑด์œผ๋กœ ๋‹ค์ค‘์ž‘์—…์— ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๋‹จ์ ๋„ ๋ถ„๋ช…ํ•ฉ๋‹ˆ๋‹ค. ํ˜„์žฌ SWM์€ ์ด์ง„ ์งˆ๋ฌธ(yes/no) ํ˜•์‹์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์–ด, ์ˆ˜์น˜์  ๊ณ„์‚ฐ์ด๋‚˜ ์—ฐ์† ๊ณต๊ฐ„์˜ ๋ฏธ๋ฌ˜ํ•œ ์ƒํƒœ ์˜ˆ์ธก์—๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ PaliGemma 3B์™€ ๊ฐ™์€ ๋Œ€ํ˜• VLM์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณ„์‚ฐ ๋น„์šฉ์ด ํฝ๋‹ˆ๋‹ค. ํŠนํžˆ ๊ณ„ํš ๋‹จ๊ณ„์—์„œ ์ˆ˜๋งŽ์€ ์ƒ˜ํ”Œ๋ง์ด๋‚˜ ๊ทธ๋ž˜๋””์–ธํŠธ ์—…๋ฐ์ดํŠธ๊ฐ€ ํ•„์š”ํ•˜์—ฌ ์‹ค์‹œ๊ฐ„ ์ œ์–ด์—๋Š” ๋ถ€๋‹ด์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด MPPI ๊ฐ™์€ ์ƒ˜ํ”Œ๋ง ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์€ ๋Œ€ํ˜• ๋ชจ๋ธ์— ๋Œ€ํ•ด ๋น„ํšจ์œจ์ ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ํ˜„์žฌ ์—ฐ๊ตฌ๋Š” ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ๊ธฐ๋ฐ˜ ๊ฒฐ๊ณผ์— ์ดˆ์ ์„ ๋งž์ถ”์—ˆ์œผ๋ฏ€๋กœ, ์‹ค์ œ ๋กœ๋ด‡์— ์ ์šฉํ•  ๋•Œ๋Š” ์‹œ๋ฎฌ-๋ฆฌ์–ผ ๊ฐญ(domain gap) ๋ฌธ์ œ๋ฅผ ๊ณ ๋ คํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๊ด€๋ จ ์—ฐ๊ตฌ๋กœ๋Š” ๋น„์ „-์–ธ์–ด-ํ–‰๋™(VLA) ๋ชจ๋ธ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, Google์˜ PaLM-E๋‚˜ SayCan์€ ์–ธ์–ด ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด ๋กœ๋ด‡ ๋ช…๋ น์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. VLA๋Š” ์ฃผ๋กœ ์ž์—ฐ์–ด ์ง€์‹œ(language)โ†’ํ–‰๋™(token)์œผ๋กœ ๋งคํ•‘ํ•˜๋Š” ๋ฐ˜๋ฉด, SWM์€ ํ–‰๋™โ†’์–ธ์–ด ํ˜•์‹์œผ๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰ SWM์€ ์ผ๋ฐ˜์ ์ธ VLA ์ ‘๊ทผ๋ฒ•์„ โ€œ๋’ค์ง‘์€(inverted)โ€ ํ˜•ํƒœ๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฐ ์ฐจ์ด๋กœ SWM์€ ์–ธ์–ด์  ์ถœ๋ ฅ์„ ํ†ตํ•ด ์‚ฌ์ „ํ•™์Šต ์ง€์‹์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ ์ž ์žฌ(latent) ๋˜๋Š” ์˜์ƒ ์˜ˆ์ธก ๊ธฐ๋ฐ˜ ์›”๋“œ ๋ชจ๋ธ(์˜ˆ: Dreamer, PlaNet ๋“ฑ)๊ณผ ๋‹ฌ๋ฆฌ, SWM์€ ์–ธ์–ด์  ์ถ”๋ก  ๊ณต๊ฐ„์—์„œ ๋ฏธ๋ž˜๋ฅผ ์˜ˆ์ธกํ•˜์—ฌ, ์‚ฌ์ „ํ•™์Šต๋œ ๋น„์ „-์–ธ์–ด ์ง€์‹์„ ํ™œ์šฉํ•˜๋Š” ์ ์—์„œ ์ฐจ๋ณ„ํ™”๋ฉ๋‹ˆ๋‹ค.

์‘์šฉ ๋ฐ ํ™•์žฅ

SWM์€ ๋‹ค๋ชฉ์  ๋กœ๋ด‡๊ณผ ์ธ๊ฐ„-๋กœ๋ด‡ ์ƒํ˜ธ์ž‘์šฉ ๋ถ„์•ผ์—์„œ ํŠนํžˆ ์œ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž‘์—… ๋ชฉํ‘œ๋ฅผ ์ž์—ฐ์–ด ์งˆ๋ฌธ ์„ธํŠธ๋กœ ํ‘œํ˜„ํ•˜๋ฉด, ์‚ฌ์šฉ์ž๊ฐ€ ์‰ฝ๊ฒŒ ์˜๋„๋ฅผ ์ง€์ •ํ•˜๊ฑฐ๋‚˜, ๊ณ ์ˆ˜์ค€ ์–ธ์–ด ๋ช…๋ น์„ ๋‹จ๊ณ„๋ณ„ ์งˆ๋ฌธ์œผ๋กœ ๋ถ„ํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด โ€œ๋นจ๊ฐ„ ๋ธ”๋ก์ด ํŒŒ๋ž€ ๋ธ”๋ก ์œ„์— ์žˆ๋Š”๊ฐ€?โ€ ๊ฐ™์€ ์งˆ๋ฌธ์œผ๋กœ ๋ชฉํ‘œ๋ฅผ ์ •์˜ํ•˜๊ณ  ๊ณ„ํšํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, SWM์€ ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ ํ•™์Šต ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ, ์‹ค์ œ ๋กœ๋ด‡์˜ ๊ฒฝํ—˜ ๋ฐ์ดํ„ฐ๋‚˜ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ์œผ๋กœ ํ™•์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์œผ๋กœ๋Š” ์‹ค์ œ ๋กœ๋ด‡ ์ ์šฉ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ ์นด๋ฉ”๋ผ ์˜์ƒ๊ณผ ์—ฐ์†์  ํ–‰๋™์„ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด SWM์„ ๋ณด์™„ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ์งˆ๋ฌธ ํ˜•์‹์„ ์ด์ง„ ์‘๋‹ต๋ฟ ์•„๋‹ˆ๋ผ ์ˆ˜์น˜ ์ถ”์ •์ด๋‚˜ ๋ณต์ˆ˜ ์„ ํƒ ๋ฌธ์ œ๋กœ ํ™•์žฅํ•˜๊ฑฐ๋‚˜, ์งˆ๋ฌธ ์ƒ์„ฑ์„ ์ž๋™ํ™”ํ•˜๋Š” ์—ฐ๊ตฌ๋„ ํฅ๋ฏธ๋กญ์Šต๋‹ˆ๋‹ค. ๋ณ‘๋ ฌํ™”๋œ ๋Œ€ํ™”ํ˜• ๊ณ„ํš, ์‹œ๋ฎฌ-๋ฆฌ์–ผ ๋„๋ฉ”์ธ ์ ์‘, ๋” ํฐ ์–ธ์–ด-๋น„์ „ ๋ชจ๋ธ ํ™œ์šฉ ๋“ฑ์œผ๋กœ SWM์˜ ์ ์šฉ ๋ฒ”์œ„๊ฐ€ ๋„“์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

๋ณธ ๋…ผ๋ฌธ์€ Semantic World Model (SWM)์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ์„ธ๊ณ„ ๋ชจ๋ธ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. SWM์€ ์ „ํ†ต์ ์ธ ์˜์ƒ ์˜ˆ์ธก ๋Œ€์‹ , ํ–‰๋™ ์‹œํ€€์Šค์— ๋Œ€ํ•œ ๋ฏธ๋ž˜ ๊ฒฐ๊ณผ๋ฅผ ์–ธ์–ด์  ์งˆ๋ฌธ-์‘๋‹ต ํ˜•์‹์œผ๋กœ ๋ชจ๋ธ๋งํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์ €์ž๋“ค์€ ๋Œ€ํ˜• VLM(PaliGemma) ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ–‰๋™ ์ž„๋ฒ ๋”ฉ์„ ๊ฒฐํ•ฉํ•˜์—ฌ SWM์„ ๊ตฌ์ถ•ํ•˜๊ณ , ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ SWM์€ LangTable๊ณผ OGBench ๊ณผ์ œ์—์„œ ๊ธฐ์กด์˜ ํ”ฝ์…€ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด๋‚˜ ์˜คํ”„๋ผ์ธ RL๋ณด๋‹ค ํ›จ์”ฌ ์šฐ์ˆ˜ํ•œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ํ•ต์‹ฌ ๊ธฐ์—ฌ๋Š” (1) SWM ๊ฐœ๋… ๋ฐ VLM ๊ธฐ๋ฐ˜ ์•„ํ‚คํ…์ฒ˜, (2) ํ–‰๋™-์–ธ์–ด ๊ฒฐํ•ฉ์„ ํ†ตํ•œ ์˜๋ฏธ ์˜ˆ์ธก, (3) ์ƒ˜ํ”Œ๋ง ๋ฐ ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜ ๊ณ„ํš ๊ธฐ๋ฒ• ์„ค๊ณ„, (4) ๋‹ค์ˆ˜์˜ ๊ณผ์ œ์—์„œ ์ž…์ฆ๋œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์ž…๋‹ˆ๋‹ค. ์š”์ปจ๋Œ€, SWM์€ ์–ธ์–ด์  ์‚ฌ๊ณ ๋กœ ๋ฏธ๋ž˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ์„ธ๊ณ„ ๋ชจ๋ธ๋กœ์„œ, ๋น„์ „-์–ธ์–ด ํ•™์Šต๊ณผ ๋กœ๋ด‡ ์ œ์–ด๋ฅผ ์ž‡๋Š” ๋‹ค๋ฆฌ๋ฅผ ๋†“์•˜์Šต๋‹ˆ๋‹ค. ํ–ฅํ›„ SWM์€ ๋ฌผ๋ฆฌ์  ๋กœ๋ด‡, ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ•™์Šต, ๋Œ€ํ™”ํ˜• ๊ณ„ํš ๋“ฑ ๋‹ค์–‘ํ•œ ์‘์šฉ ๋ถ„์•ผ์—์„œ ์ค‘์š”ํ•œ ํ†ต์ฐฐ์„ ์ œ๊ณตํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Copyright 2026, JungYeon Lee