Curieux.JY
  • JungYeon Lee
  • Post
  • Projects
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ๐ŸŽฏ ์„œ๋ก : ์™œ VLA์ธ๊ฐ€?
      • ChatGPT์™€ VLA์˜ ๊ฒฐ์ •์  ์ฐจ์ด
    • ๐Ÿ—บ๏ธ VLA์˜ ๋ถ„๋ฅ˜ ์ฒด๊ณ„ (Taxonomy)
      • ํ•ต์‹ฌ ํ†ต์ฐฐ: โ€œ๊ณ„์ธต์  ํ”„๋ ˆ์ž„์›Œํฌโ€
    • ๐Ÿงฉ Part 1: VLA์˜ ๊ตฌ์„ฑ ์š”์†Œ (Components)
      • 1.1 ๊ฐ•ํ™”ํ•™์Šต (Reinforcement Learning)
      • 1.2 ์‚ฌ์ „ํ•™์Šต๋œ ์‹œ๊ฐ ํ‘œํ˜„ (Pretrained Visual Representations)
      • 1.3 ๋™์—ญํ•™ ํ•™์Šต (Dynamics Learning)
      • 1.4 ์›”๋“œ ๋ชจ๋ธ (World Models)
      • 1.5 ์ถ”๋ก  (Reasoning)
    • ๐ŸŽฎ Part 2: ์ €์ˆ˜์ค€ ์ œ์–ด ์ •์ฑ… (Low-Level Control Policies)
      • VLA ์ œ์–ด ์ •์ฑ…์˜ ์ผ๋ฐ˜ ๊ณต์‹
      • 2.1 ์•„ํ‚คํ…์ฒ˜๋ณ„ ๋ถ„๋ฅ˜
      • 2.2 ํ•ต์‹ฌ ๋ชจ๋ธ ์‹ฌ์ธต ๋ถ„์„
      • 2.3 Diffusion-based ์ •์ฑ…
      • 2.4 Large VLA (LVLA)
      • 2.5 3D Vision ๊ธฐ๋ฐ˜ ์ •์ฑ…
    • ๐Ÿ—“๏ธ Part 3: ๊ณ ์ˆ˜์ค€ ํƒœ์Šคํฌ ํ”Œ๋ž˜๋„ˆ (Task Planners)
      • ํƒœ์Šคํฌ ํ”Œ๋ž˜๋„ˆ์˜ ์—ญํ• 
      • 3.1 ๋‹จ์ผ์ฒด ํƒœ์Šคํฌ ํ”Œ๋ž˜๋„ˆ (Monolithic)
      • 3.2 ๋ชจ๋“ˆํ˜• ํƒœ์Šคํฌ ํ”Œ๋ž˜๋„ˆ (Modular)
    • ๐Ÿ“Š Part 4: ๋ฐ์ดํ„ฐ์…‹๊ณผ ๋ฒค์น˜๋งˆํฌ
      • 4.1 ์‹ค์„ธ๊ณ„ ๋ฐ์ดํ„ฐ์…‹
      • 4.2 ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ & ๋ฒค์น˜๋งˆํฌ
      • 4.3 ์ž๋™ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘
    • ๐Ÿ”ฎ Part 5: ๋„์ „ ๊ณผ์ œ์™€ ๋ฏธ๋ž˜ ๋ฐฉํ–ฅ
      • 5.1 ์•ˆ์ „์„ฑ (Safety First)
      • 5.2 ๋ฐ์ดํ„ฐ & ๋ฒค์น˜๋งˆํฌ
      • 5.3 ์ผ๋ฐ˜ํ™” (Generalization)
      • 5.4 ์‹ค์‹œ๊ฐ„ ์‘๋‹ต์„ฑ
      • 5.5 ์žฅ๊ธฐ ํƒœ์Šคํฌ (Long-Horizon Tasks)
    • ๐Ÿ“ˆ ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 
      • VLA ๋ฐœ์ „์˜ ํ•ต์‹ฌ ํŠธ๋ Œ๋“œ
      • ๋กœ๋ด‡๊ณตํ•™์ž๋ฅผ ์œ„ํ•œ ํ•ต์‹ฌ ํ…Œ์ดํฌ์–ด์›จ์ด
      • ๋งˆ๋ฌด๋ฆฌ: ํŒŒ์ธ๋งŒ์˜ ๊ด€์ ์—์„œ
  • โ›๏ธ Dig Review
  • ๋น„์ „-์–ธ์–ด-์•ก์…˜ ๋ชจ๋ธ (VLA)์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€
    • 1. ์‚ฌ์ „ ํ›ˆ๋ จ(Pretraining)
    • 2. ์ œ์–ด ์ •์ฑ…(Control Policies)
    • 3. ์ž‘์—… ๊ณ„ํš(Task Planners)
    • 4. ๋ฐ์ดํ„ฐ์…‹ยท์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์™€ ํ‰๊ฐ€
    • 5. ๋„์ „ ๊ณผ์ œ์™€ ํ–ฅํ›„ ๋ฐฉํ–ฅ
    • ๊ฒฐ๋ก 
  • ์ฐธ๊ณ ์‚ฌํ•ญ

๐Ÿ“ƒVLA for Embodied AI ๋ฆฌ๋ทฐ

vla
embodied
A Survey on Vision-Language-Action Models for Embodied AI
Published

December 22, 2025

๐Ÿ” Ping. ๐Ÿ”” Ring. โ›๏ธ Dig. A tiered review series: quick look, key ideas, deep dive.

  • Paper Link
  • List
  1. ๐Ÿค– VLA(Vision-Language-Action) ๋ชจ๋ธ์€ Vision, Language, Action ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ Embodied AI ํ™˜๊ฒฝ์—์„œ ๋กœ๋ด‡ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ(multimodal) ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
  2. ๐Ÿ“š ์ด ์„œ๋ฒ ์ด๋Š” VLAs๋ฅผ ๊ฐœ๋ณ„ ๊ตฌ์„ฑ ์š”์†Œ(components), ์ €์ˆ˜์ค€ ์ œ์–ด ์ •์ฑ…(low-level control policies), ๊ณ ์ˆ˜์ค€ ํƒœ์Šคํฌ ํ”Œ๋ž˜๋„ˆ(high-level task planners) ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ์—ฐ๊ตฌ ๋ถ„์•ผ๋กœ ๋ถ„๋ฅ˜ํ•˜๊ณ  ๋‹ค์–‘ํ•œ ์•„ํ‚คํ…์ฒ˜์™€ ํ•™์Šต ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.
  3. ๐Ÿ› ๏ธ ๋˜ํ•œ, VLAs์˜ ๊ฐœ๋ฐœ์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹(datasets)๊ณผ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ(simulators)์™€ ๊ฐ™์€ ํ•„์ˆ˜ ์ž์›์„ ์ œ์‹œํ•˜๊ณ , ๋ฐ์ดํ„ฐ ํฌ์†Œ์„ฑ(data scarcity), ์•ˆ์ „(safety) ๋“ฑ์˜ ๊ณผ์ œ์™€ ์ธ๊ณต ์ผ๋ฐ˜ ์ง€๋Šฅ(AGI)์„ ํ–ฅํ•œ ๋ฏธ๋ž˜ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์„ ๋…ผ์˜ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

์ด ๋…ผ๋ฌธ์€ Embodied AI ๋ถ„์•ผ์—์„œ Vision-Language-Action (VLA) ๋ชจ๋ธ์— ๋Œ€ํ•œ ์ตœ์ดˆ์˜ ํฌ๊ด„์ ์ธ ์กฐ์‚ฌ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. VLA ๋ชจ๋ธ์€ Large Language Models (LLMs) ๋ฐ Vision-Language Models (VLMs)์˜ ์„ฑ๊ณต์„ ๋ฐ”ํƒ•์œผ๋กœ ๋“ฑ์žฅํ–ˆ์œผ๋ฉฐ, ์‹œ๊ฐ, ์–ธ์–ด, ํ–‰๋™ ์–‘์‹์„ ํ†ตํ•ฉํ•˜์—ฌ ์–ธ์–ด ์กฐ๊ฑด๋ถ€ ๋กœ๋ด‡ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐ ํŠนํ™”๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

I. ์„œ๋ก 

VLA ๋ชจ๋ธ์€ ํ™˜๊ฒฝ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” ๋ฌผ๋ฆฌ์  embodiments๋ฅผ ์ œ์–ดํ•˜๋ฉฐ, ํŠนํžˆ ๋กœ๋ด‡ ๋ถ„์•ผ์—์„œ ์–ธ์–ด ์ง€์‹œ์— ๋”ฐ๋ผ ํ™˜๊ฒฝ์„ ์‹œ๊ฐ์ ์œผ๋กœ ์ธ์‹ํ•˜๊ณ  ์ ์ ˆํ•œ ํ–‰๋™์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ multimodal ๋Šฅ๋ ฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. RT-2 [2]์—์„œ ์ด ์šฉ์–ด๊ฐ€ ์ฒ˜์Œ ์ œ์•ˆ๋˜์—ˆ์œผ๋ฉฐ, ์ดˆ๊ธฐ Deep Reinforcement Learning (RL) ์ ‘๊ทผ ๋ฐฉ์‹์— ๋น„ํ•ด ํ–ฅ์ƒ๋œ ๋‹ค์šฉ์„ฑ, dexterity ๋ฐ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์€ VLA๋ฅผ ์‹œ๊ฐ ๋ฐ ์–ธ์–ด๋กœ๋ถ€ํ„ฐ multimodal ์ž…๋ ฅ์„ ์ฒ˜๋ฆฌํ•˜์—ฌ embodied ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋กœ๋ด‡ ํ–‰๋™์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  ๋ชจ๋ธ๋กœ ์ •์˜ํ•˜๋ฉฐ, LLM ๋˜๋Š” Large VLM์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” VLA๋ฅผ โ€œLarge VLA (LVLA)โ€๋กœ ๊ตฌ๋ถ„ํ•ฉ๋‹ˆ๋‹ค.

II. ๋ฐฐ๊ฒฝ

Embodied AI๋Š” ๋ฌผ๋ฆฌ์  ํ™˜๊ฒฝ๊ณผ ๋Šฅ๋™์ ์œผ๋กœ ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” ์ธ๊ณต์ง€๋Šฅ์˜ ํ•œ ํ˜•ํƒœ๋กœ, ๋กœ๋ด‡ ํ•™์Šต์€ ์ข…์ข… Markov Decision Process (MDP) ๋˜๋Š” Partially-Observable Markov Decision Processes (POMDPs) ๋ฌธ์ œ๋กœ ์ •์‹ํ™”๋ฉ๋‹ˆ๋‹ค. ์ฃผ์š” ๋ชฉํ‘œ๋Š” ํ˜„์žฌ ์ƒํƒœ s์—์„œ ์ตœ์ ์˜ ํ–‰๋™ a๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์ •์ฑ… $ (a_t|s_{t}, a_{<t}) $๋ฅผ ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. Reward function ์ •์˜๊ฐ€ ์–ด๋ ค์šด ๊ฒฝ์šฐ Imitation Learning์ด ์‚ฌ์šฉ๋˜๋ฉฐ, ์–ธ์–ด ์ง€์‹œ p๋ฅผ ์‚ฌ์šฉํ•œ ์–ธ์–ด ์กฐ๊ฑด๋ถ€ ๋กœ๋ด‡ ์ •์ฑ… $ (a_t|p, s_{t}, a_{<t}) $๊ฐ€ ๊ฐœ๋ฐœ๋ฉ๋‹ˆ๋‹ค.

III. Vision-Language-Action ๋ชจ๋ธ

VLA ๋ชจ๋ธ์€ ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ์—ฐ๊ตฌ ๋ถ„์•ผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค: VLA์˜ ๊ฐœ๋ณ„ ๊ตฌ์„ฑ ์š”์†Œ, low-level ์ œ์–ด ์ •์ฑ…, high-level task planner.

A. VLA์˜ ๊ตฌ์„ฑ ์š”์†Œ

VLA ๋ชจ๋ธ์€ Computer Vision (CV), Natural Language Processing (NLP), RL์˜ ์„ฑ๊ณต์„ ๋ฐ”ํƒ•์œผ๋กœ ๊ฐœ๋ณ„ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค.

  1. Reinforcement Learning: RL์€ Embodied AI์˜ ๊ธฐ์ดˆ๋ฅผ ๋งˆ๋ จํ–ˆ์œผ๋ฉฐ, Deep Q-Network (DQN)์™€ ๊ฐ™์€ ๋ชจ๋ธ์„ ํ†ตํ•ด ๊ณ ์ฐจ์› ํ”ฝ์…€ ์ž…๋ ฅ์—์„œ ์ •์ฑ… ํ•™์Šต ๊ฐ€๋Šฅ์„ฑ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. Decision Transformer (DT) ๋ฐ Trajectory Transformer (TT)๋Š” Transformer ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒํƒœ, ํ–‰๋™, ๋ณด์ƒ ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ์˜๊ฐ์„ ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. RL๊ณผ LLM ๊ฐ„์˜ ์‹œ๋„ˆ์ง€ ํšจ๊ณผ๋Š” Human Feedback์œผ๋กœ๋ถ€ํ„ฐ์˜ RL (RLHF)์„ ํ†ตํ•ด LLM์„ ์ธ๊ฐ„ ์„ ํ˜ธ๋„์— ๋งž์ถ”๊ฑฐ๋‚˜, Reflexion๊ณผ ๊ฐ™์€ ์–ธ์–ด์  ํ”ผ๋“œ๋ฐฑ์„ ํ™œ์šฉํ•œ ์ƒˆ๋กœ์šด RL ๋ฐฉ๋ฒ•์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. Eureka [24]๋Š” LLM์ด ๋กœ๋ด‡์„ ์œ„ํ•œ ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ์„ค๊ณ„ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  2. Pretrained Visual Representations (PVRs): Vision encoder์˜ ํšจ๊ณผ๋Š” VLA์˜ ์„ฑ๋Šฅ์— ์ง์ ‘์ ์ธ ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๋‹ค.
    • CLIP [25]: 4์–ต ๊ฐœ์˜ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์œผ๋กœ ๊ตฌ์„ฑ๋œ WIT ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ›ˆ๋ จ๋˜๋ฉฐ, ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์„ ์‹๋ณ„ํ•˜๋Š” ๋Œ€์กฐ ํ•™์Šต objective๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • R3M [26]: ์‹œ๊ฐ„์  ๋Œ€์กฐ ํ•™์Šต(์ธ์ ‘ ํ”„๋ ˆ์ž„ ๊ฐ„ ๊ฑฐ๋ฆฌ ์ตœ์†Œํ™”, ๋น„์ธ์ ‘ ํ”„๋ ˆ์ž„ ๊ฐ„ ๊ฑฐ๋ฆฌ ์ตœ๋Œ€ํ™”) ๋ฐ ๋น„๋””์˜ค-์–ธ์–ด ์ •๋ ฌ objective๋ฅผ ํ†ตํ•ด PVR์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • MVP [28]: ์ปดํ“จํ„ฐ ๋น„์ „์˜ Masked Autoencoder (MAE)๋ฅผ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ์…‹์— ์ ์šฉํ•˜์—ฌ ์†์ƒ๋œ ํŒจ์น˜๋ฅผ ์žฌ๊ตฌ์„ฑํ•˜๋Š” self-supervised ๋ฐฉ์‹์œผ๋กœ ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค.
    • Voltron [35]: ์–ธ์–ด ์กฐ๊ฑด๋ถ€ MAE objective์™€ ์–ธ์–ด ์ƒ์„ฑ objective๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ์–ธ์–ด-์‹œ๊ฐ ์–‘์‹์˜ ์ •๋ ฌ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.
    • VC-1 [34]: ์ด์ „ PVR์— ๋Œ€ํ•œ ์‹ฌ์ธต ๋ถ„์„์„ ํ†ตํ•ด ์ตœ์ ์˜ ViT ๊ตฌ์„ฑ์„ ํƒ์ƒ‰ํ•˜๊ณ , PVR ๊ฐœ์„ ์— ๊ธฐ์—ฌํ•˜๋Š” ํ•ต์‹ฌ ์š”์†Œ๋ฅผ ๋ฐํž™๋‹ˆ๋‹ค.
    • DINOv2 [36]: self-distillation ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ํ†ตํ•ด MAE๋ฅผ ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๊ต์‚ฌ ๋„คํŠธ์›Œํฌ๋Š” ํ•™์ƒ ๋„คํŠธ์›Œํฌ์˜ EMA๋กœ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค.
    • I-JEPA [39]: joint-embedding predictive architectures์—์„œ ์˜๊ฐ์„ ๋ฐ›์•„ ํŒจ์น˜ ์ž„๋ฒ ๋”ฉ์„ ๋น„๊ตํ•˜์—ฌ ๋‚ด๋ถ€ ์„ธ๊ณ„ ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค. DINO์™€ ๋‹ฌ๋ฆฌ masked patches๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, MAE์™€ ๋‹ฌ๋ฆฌ ๋น„์ƒ์„ฑ์  ์ ‘๊ทผ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
    • Theia [40]: ๋‹ค์–‘ํ•œ vision foundation models (segmentation, depth, semantics ๋“ฑ)์„ ๋‹จ์ผ ๋ชจ๋ธ๋กœ ์ฆ๋ฅ˜ํ•˜์—ฌ ์ด์ „ PVR์„ ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
    • ๊ฐ•์  ๋ฐ ํ•œ๊ณ„: MAE ๊ธฐ๋ฐ˜ self-supervised ํ•™์Šต์€ pixel-level ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜์—ฌ ์ •๋ฐ€ํ•œ ๋กœ๋ด‡ ์กฐ์ž‘์— ์œ ์šฉํ•˜๋ฉฐ, DINOv2 ๋ฐ I-JEPA๋Š” ๊ฐ๊ฐ pixel- ๋ฐ patch-level ํŠน์ง• ํ•™์Šต์— ๊ฐ•์ ์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค. Theia๋Š” ์—ฌ๋Ÿฌ VFM์˜ ์ •๋ณด๋ฅผ ์œตํ•ฉํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.
  3. Video Representations: ๋น„๋””์˜ค๋Š” ์ด๋ฏธ์ง€ ์‹œํ€€์Šค๋กœ์„œ, ์‹œ๊ฐ„์  ๋Œ€์กฐ ํ•™์Šต ๋ฐ MAE์™€ ๊ฐ™์€ ๊ณ ์œ ํ•œ ํ‘œํ˜„ ๊ธฐ์ˆ ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. NeRF [43, 44] ๋ฐ 3D Gaussian Splatting (3D-GS) [45, 46]๋Š” ํ’๋ถ€ํ•œ 3D ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜๋ฉฐ, ์˜ค๋””์˜ค [47]๋„ ๋กœ๋ด‡ ์ •์ฑ…์— ์ค‘์š”ํ•œ cues๋ฅผ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  4. Dynamics Learning: ๋ชจ๋ธ $ f() $์— forward ๋˜๋Š” inverse dynamics ์ดํ•ด๋ฅผ ๋ถ€์—ฌํ•˜๋Š” objective๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
    • Forward dynamics: $ {t+1} f{fwd}(s_t, a_t) $ (์ฃผ์–ด์ง„ ํ–‰๋™์— ๋”ฐ๋ฅธ ๋‹ค์Œ ์ƒํƒœ ์˜ˆ์ธก).
    • Inverse dynamics: $ t f{inv}(s_t, s_{t+1}) $ (์ด์ „ ์ƒํƒœ์—์„œ ๋‹ค์Œ ์ƒํƒœ๋กœ ์ „ํ™˜ํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ํ–‰๋™ ๊ฒฐ์ •).
    • Vi-PRoM [48]: ๋น„๋””์˜ค ๊ฐ„ ๊ตฌ๋ณ„์„ ์œ„ํ•œ ๋Œ€์กฐ์  self-supervised ํ•™์Šต, ๋’ค์„ž์ธ ๋น„๋””์˜ค ํ”„๋ ˆ์ž„ ๋ณต๊ตฌ, pseudo labels๋ฅผ ์‚ฌ์šฉํ•œ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ objective๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
    • MIDAS [50]: pretraining์˜ ์ผ๋ถ€๋กœ inverse dynamics ์˜ˆ์ธก ์ž‘์—…์„ ๋„์ž…ํ•˜์—ฌ ํ™˜๊ฒฝ์˜ ์ „ํ™˜ dynamics ์ดํ•ด๋ฅผ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.
    • SMART [51]: forward dynamics ์˜ˆ์ธก, inverse dynamics ์˜ˆ์ธก, ๋ฌด์ž‘์œ„๋กœ ๋งˆ์Šคํ‚น๋œ hindsight control์„ ํฌํ•จํ•œ pretraining scheme์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
    • MaskDP [49]: ์ƒํƒœ ๋ฐ ํ–‰๋™ ํ† ํฐ์„ ๋งˆ์Šคํ‚นํ•˜์—ฌ ์žฌ๊ตฌ์„ฑํ•˜๋Š” masked decision prediction ์ž‘์—…์„ ํ†ตํ•ด forward ๋ฐ inverse dynamics ์ดํ•ด๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค.
    • VPT [53]: ๋ ˆ์ด๋ธ” ์—†๋Š” ์ธํ„ฐ๋„ท ๋น„๋””์˜ค๋ฅผ ํ™œ์šฉํ•˜์—ฌ Minecraft์šฉ foundation model์„ pretrainํ•ฉ๋‹ˆ๋‹ค.
    • ๊ฐ•์  ๋ฐ ํ•œ๊ณ„: ์ผ๋ฐ˜์ ์œผ๋กœ forward dynamics ํ•™์Šต์ด inverse dynamics ํ•™์Šต๋ณด๋‹ค ์–ด๋ ต์ง€๋งŒ, ๋” ํฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. Inverse dynamics ๋ชจ๋ธ์€ ์ƒํƒœ๋งŒ ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ์…‹์— ํ–‰๋™ ๋ ˆ์ด๋ธ”์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  5. World Models: ์„ธ๊ณ„ ๋ชจ๋ธ $ P() $์€ ์„ธ์ƒ์— ๋Œ€ํ•œ ์ƒ์‹์  ์ง€์‹์„ ์ธ์ฝ”๋”ฉํ•˜๊ณ  ์ฃผ์–ด์ง„ ํ–‰๋™์— ๋Œ€ํ•œ ๋ฏธ๋ž˜ ์ƒํƒœ $ {t+1} P({t+1}|s_t, a_t) $๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” model-based ์ œ์–ด ๋ฐ ๊ณ„ํš์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
    • Dreamer [55]: ์ž ์žฌ dynamics ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•˜๊ธฐ ์œ„ํ•ด ํ‘œํ˜„ ๋ชจ๋ธ, ์ „ํ™˜ ๋ชจ๋ธ, ๋ณด์ƒ ๋ชจ๋ธ์˜ ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. DreamerV2 [56]๋Š” discrete ์ž ์žฌ ์ƒํƒœ ๊ณต๊ฐ„์„ ๋„์ž…ํ–ˆ๊ณ , DreamerV3 [57]๋Š” ๋” ๋„“์€ ๋„๋ฉ”์ธ์œผ๋กœ ํ™•์žฅํ–ˆ์Šต๋‹ˆ๋‹ค.
    • IRIS [59]: GPT์™€ ๊ฐ™์€ autoregressive Transformer๋ฅผ ์„ธ๊ณ„ ๋ชจ๋ธ์˜ ๊ธฐ๋ฐ˜์œผ๋กœ ์‚ฌ์šฉํ•˜๋ฉฐ, VQ-VAE๋ฅผ vision encoder๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  6. LLM-induced World Models: LLM์˜ ์ƒ์‹์  ์ง€์‹์„ ํ™œ์šฉํ•˜์—ฌ VLA๋ฅผ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค.
    • DECKARD [61]: LLM์ด Minecraft์˜ ์•„์ดํ…œ ์ œ์ž‘์„ ์œ„ํ•œ directed acyclic graphs ํ˜•ํƒœ์˜ abstract world models (AWMs)๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ํ”„๋กฌํ”„ํŒ…ํ•ฉ๋‹ˆ๋‹ค.
    • LLM-DM [62]: LLM์„ ์‚ฌ์šฉํ•˜์—ฌ Planning Domain Definition Language (PDDL)๋กœ ์„ธ๊ณ„ ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค.
    • RAP [64]: LLM์„ ํ–‰๋™์„ ์˜ˆ์ธกํ•˜๋Š” ์ •์ฑ…๊ณผ ์ƒํƒœ ์ „ํ™˜ ๋ถ„ํฌ๋ฅผ ์ œ๊ณตํ•˜๋Š” ์„ธ๊ณ„ ๋ชจ๋ธ๋กœ ์žฌ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. Monte Carlo Tree Search (MCTS)๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ๊ตฌ์กฐํ™”๋œ ๊ณ„ํš์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
    • LLM-MCTS [66]: RAP์„ ๊ธฐ๋ฐ˜์œผ๋กœ POMDPs๋กœ ํ™•์žฅํ•˜๋ฉฐ, LLM์ด MCTS์˜ ๊ฒ€์ƒ‰ ๊ณต๊ฐ„์„ ์ค„์—ฌ ํšจ์œจ์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.
  7. Visual World Models: ํ…์ŠคํŠธ ํ˜•ํƒœ์˜ LLM-induced ์„ธ๊ณ„ ๋ชจ๋ธ๊ณผ ๋‹ฌ๋ฆฌ, ์‹œ๊ฐ ์„ธ๊ณ„ ๋ชจ๋ธ์€ ๋ฏธ๋ž˜ ์ƒํƒœ์˜ ์ด๋ฏธ์ง€, ๋น„๋””์˜ค, 3D ์žฅ๋ฉด์„ ์ƒ์„ฑํ•˜์—ฌ ๋ฌผ๋ฆฌ์  ์„ธ๊ณ„์™€ ๋” ๋ฐ€์ ‘ํ•˜๊ฒŒ ์ •๋ ฌ๋ฉ๋‹ˆ๋‹ค.
    • Genie [69]: Generative Interactive Environments๋ผ๋Š” ์ƒˆ๋กœ์šด ํด๋ž˜์Šค์˜ ์ƒ์„ฑ ๋ชจ๋ธ์„ ์†Œ๊ฐœํ•˜๋ฉฐ, ๋น„์ง€๋„ ๋ฐฉ์‹์œผ๋กœ ํ›ˆ๋ จ๋˜์–ด ์‚ฌ์šฉ์ž๊ฐ€ ์ƒ์„ฑ ํ™˜๊ฒฝ๊ณผ ํ”„๋ ˆ์ž„๋ณ„๋กœ ์ƒํ˜ธ์ž‘์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
    • 3D-VLA [70]: diffusion models๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€, ๊นŠ์ด ๋งต, ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ์™€ ๊ฐ™์€ ์‹œ๊ฐ์  ์ž…๋ ฅ์„ ์ฒ˜๋ฆฌํ•˜๊ณ , ์‚ฌ์šฉ์ž์˜ ์ฟผ๋ฆฌ์— ์‘๋‹ตํ•˜์—ฌ ๋ชฉํ‘œ ์ƒํƒœ(์ด๋ฏธ์ง€ ๋˜๋Š” ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ)๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • UniSim [71]: ์‹ค์ œ ์ƒํ˜ธ์ž‘์šฉ ๋น„๋””์˜ค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒ์„ฑ ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•˜์—ฌ high-level ๋ฐ low-level ํ–‰๋™ ๋ชจ๋‘์— ๋Œ€ํ•œ ์‹œ๊ฐ์  ๊ฒฐ๊ณผ๋ฅผ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•ฉ๋‹ˆ๋‹ค.
    • E2WM [72]: ๊ธฐ์กด ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ์„ธ๊ณ„ ๋ชจ๋ธ๋กœ ์‚ฌ์šฉํ•˜์—ฌ MCTS๋ฅผ ํ†ตํ•ด embodied ๊ฒฝํ—˜์„ ์ˆ˜์ง‘ํ•ฉ๋‹ˆ๋‹ค.
  8. Reasoning: LLM์˜ ํ•ต์‹ฌ ๋Šฅ๋ ฅ์ธ CoT (Chain-of-Thought) ์ถ”๋ก ์„ ์˜์‚ฌ๊ฒฐ์ • ๊ณผ์ •์— ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ThinkBot [75]: CoT๋ฅผ ์ ์šฉํ•˜์—ฌ sparseํ•œ ์ธ๊ฐ„ ์ง€์‹œ์—์„œ ๋ˆ„๋ฝ๋œ ํ–‰๋™ ์„ค๋ช…์„ ๋ณต๊ตฌํ•ฉ๋‹ˆ๋‹ค.
    • ReAct [76]: ์ถ”๋ก  ํ”์ ๊ณผ ํ–‰๋™์„ interleaveํ•˜์—ฌ ํ–‰๋™ ๊ณ„ํš์„ ์ƒ์„ฑํ•˜๊ณ  ์ƒ์‹์  ์ง€์‹์„ ์ฃผ์ž…ํ•˜๋ฉฐ ์˜ˆ์™ธ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ๋„์›€์„ ์ค๋‹ˆ๋‹ค.
    • ECoT [78]: low-level ์ œ์–ด ์ •์ฑ…์— embodied CoT ์ถ”๋ก ์„ ํ›ˆ๋ จ์‹œ์ผœ ๊ณ„ํš, sub-tasks, ๋™์ž‘, ์‹œ๊ฐ์  ํŠน์ง•์— ๋Œ€ํ•ด ์ถ”๋ก ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

B. Low-level Control Policies

VLA ๋ชจ๋ธ $ {} $๋Š” vision encoder์™€ language encoder์™€ ๊ฐ™์€ ์ง€๊ฐ ๋ชจ๋“ˆ์„ action decoder์™€ ํ†ตํ•ฉํ•˜์—ฌ ์–ธ์–ด ์ง€์‹œ p๋ฅผ ์‹คํ–‰ํ•˜๋Š” ์ œ์–ด ์ •์ฑ…์œผ๋กœ ํ˜•์„ฑ๋ฉ๋‹ˆ๋‹ค: $ t {}( t | p, s{t}, a{<t}) $.

  1. Non-Transformer Control Policies:
    • CLIPort [31]: CLIP๊ณผ Transporter ๋„คํŠธ์›Œํฌ๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ โ€œsemanticโ€ ์ •๋ณด์™€ โ€œspatialโ€ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๊ณ , CLIP ๋ฌธ์žฅ encoder๊ฐ€ SE(2) ํ–‰๋™์„ ์œ ๋„ํ•ฉ๋‹ˆ๋‹ค.
    • BC-Z [79]: ์–ธ์–ด ์ง€์‹œ ๋˜๋Š” ์ธ๊ฐ„ ์‹œ์—ฐ ๋น„๋””์˜ค๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ , FiLM layer๋ฅผ ํ†ตํ•ด ์ง€์‹œ ์ž„๋ฒ ๋”ฉ๊ณผ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ํ–‰๋™์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • UniPi [83]: ์˜์‚ฌ๊ฒฐ์ • ๋ฌธ์ œ๋ฅผ ํ…์ŠคํŠธ ์กฐ๊ฑด๋ถ€ ๋น„๋””์˜ค ์ƒ์„ฑ ๋ฌธ์ œ๋กœ ์ฒ˜๋ฆฌํ•˜์—ฌ, ์ฃผ์–ด์ง„ ํ…์ŠคํŠธ ์ง€์‹œ์— ๋”ฐ๋ผ ๋น„๋””์˜ค๋ฅผ ์ƒ์„ฑํ•˜๊ณ  inverse dynamics๋ฅผ ํ†ตํ•ด ํ–‰๋™์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
  2. Transformer-based Control Policies:
    • Gato [19]: ๋‹จ์ผ ๋ชจ๋ธ ๋งค๊ฐœ๋ณ€์ˆ˜ ์„ธํŠธ๋กœ Atari ๊ฒŒ์ž„, ์ด๋ฏธ์ง€ ์บก์…˜, ๋ธ”๋ก ์Œ“๊ธฐ ๋“ฑ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” โ€œmulti-modal, multi-task, multi-embodiment generalist agentโ€์ž…๋‹ˆ๋‹ค.
    • RoboCat [92]: Gato ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” self-improvement ํ”„๋กœ์„ธ์Šค๋ฅผ ์ œ์•ˆํ•˜์—ฌ 100๊ฐœ ๋ฏธ๋งŒ์˜ ์‹œ์—ฐ์œผ๋กœ ์ƒˆ๋กœ์šด ์ž‘์—…์— ๋น ๋ฅด๊ฒŒ ์ ์‘ํ•ฉ๋‹ˆ๋‹ค.
    • RT-1 [94]: BC-Z์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ, EfficientNet ๊ธฐ๋ฐ˜์˜ vision encoder์™€ Transformer decoder๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด์‚ฐํ™”๋œ ํ–‰๋™์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • Q-Transformer [95]: RT-1์„ ํ™•์žฅํ•˜์—ฌ autoregressive Q-functions๋ฅผ ๋„์ž…ํ•˜๊ณ , Q-learning ๋ฐฉ๋ฒ•์„ ์ฑ„ํƒํ•˜์—ฌ ์„ฑ๊ณต์ ์ธ ์‹œ์—ฐ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์‹คํŒจํ•œ trajectories๋„ ํ•™์Šต์— ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ACT [97]: action chunking์ด ์žˆ๋Š” conditional VAE ์ •์ฑ…์„ ๊ตฌ์ถ•ํ•˜์—ฌ, ์ •์ฑ…์ด ๋‹จ์ผ ํ–‰๋™์ด ์•„๋‹Œ ํ–‰๋™ ์‹œํ€€์Šค๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
  3. Control Policies for Multimodal Instructions:
    • VIMA [126]: multimodal prompts์— ์ค‘์ ์„ ๋‘๋ฉฐ, ๊ฐ์ฒด ์กฐ์ž‘, ์‹œ๊ฐ์  ๋ชฉํ‘œ ๋„๋‹ฌ, ์ƒˆ๋กœ์šด ๊ฐœ๋… ground, one-shot ๋น„๋””์˜ค ๋ชจ๋ฐฉ ๋“ฑ ๋ณต์žกํ•œ ์ž‘์—…์„ ์–ธ์–ด ํ”„๋กฌํ”„ํŠธ๋งŒ์œผ๋กœ๋Š” ํ‘œํ˜„ํ•˜๊ธฐ ์–ด๋ ค์šด ํƒœ์Šคํฌ๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
    • MOO [93]: RT-1์„ ํ™•์žฅํ•˜์—ฌ multimodal prompts๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ , OWL-ViT๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ํ”„๋กฌํ”„ํŠธ ๋‚ด ์ด๋ฏธ์ง€๋ฅผ ์ธ์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค.
  4. Control Policies with 3D Vision: 3D ๋น„์ „์€ 2D ์ด๋ฏธ์ง€๋ณด๋‹ค ํ’๋ถ€ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
    • PerAct [87]: RGB-D ์ž…๋ ฅ์—์„œ ์žฌ๊ตฌ์„ฑ๋œ voxel map์„ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ณ , gripper ์›€์ง์ž„์„ ์•ˆ๋‚ดํ•˜๋Š” ์ตœ์ƒ์˜ voxel์„ ์ถœ๋ ฅ์œผ๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • Act3D [88]: ์—ฐ์† ํ•ด์ƒ๋„ 3D feature field๋ฅผ ๋„์ž…ํ•˜์—ฌ voxelization์˜ ๊ณ„์‚ฐ ๋น„์šฉ์„ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค.
    • RVT, RVT-2 [89, 90]: ์žฅ๋ฉด ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ์˜ ๊ฐ€์ƒ ๋ทฐ์—์„œ ์ด๋ฏธ์ง€๋ฅผ ์žฌ-๋ Œ๋”๋งํ•˜๊ณ  ์ด๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  5. Diffusion-based Control Policies:
    • Diffusion Policy [104]: ๋กœ๋ด‡ ์ •์ฑ…์„ DDPM [128]์œผ๋กœ ์ •์‹ํ™”ํ•˜๋ฉฐ, ์‹œ๊ฐ ์กฐ๊ฑด๋ถ€ ๋ฐ ์‹œ๊ณ„์—ด diffusion Transformer์™€ ๊ฐ™์€ ๊ธฐ์ˆ ์„ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค.
    • SUDD [106]: LLM์ด ๋ฐ์ดํ„ฐ ์ƒ์„ฑ์„ ์•ˆ๋‚ดํ•˜๊ณ , ํ•„ํ„ฐ๋ง๋œ ๋ฐ์ดํ„ฐ์…‹์ด visuo-linguo-motor ์ •์ฑ…์œผ๋กœ ์ฆ๋ฅ˜๋˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
    • Octo [107]: OXE ๋ฐ์ดํ„ฐ์…‹ [112]์„ ํ™œ์šฉํ•œ Transformer ๊ธฐ๋ฐ˜ diffusion ์ •์ฑ…์œผ๋กœ, ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ๋ฐ ์ž‘์—…์— ๊ฑธ์ณ ๊ธ์ •์ ์ธ transfer ๋ฐ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
    • MDT [109]: DiT ๋ชจ๋ธ [129]์„ action prediction head์— ์ ์šฉํ•˜๋ฉฐ, masked generative foresight ๋ฐ contrastive latent alignment ๋ณด์กฐ objective๋ฅผ ํ†ตํ•ด U-Net ๊ธฐ๋ฐ˜ diffusion ๋ชจ๋ธ๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
    • RDT-1B [110]: DiT๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” bimanual manipulation์„ ์œ„ํ•œ diffusion foundation model์ž…๋‹ˆ๋‹ค.
  6. Diffusion-based Control Policies with 3D Vision:
    • DP3 [105]: diffusion ์ •์ฑ…์— 3D ์ž…๋ ฅ์„ ๋„์ž…ํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.
    • 3D Diffuser Actor [108]: Act3D์™€ Diffusion Policy๋ฅผ ๊ฒฐํ•ฉํ•œ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  7. Control Policies for Motion Planning:
    • Language costs [84]: ์ธ๊ฐ„ ์ง€์‹œ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋œ ์˜ˆ์ธก ๋น„์šฉ ๋งต์„ ์‚ฌ์šฉํ•˜์—ฌ motion planner๊ฐ€ ์ตœ์ ์˜ ํ–‰๋™์„ ๊ณ„์‚ฐํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
    • VoxPoser [103]: LLM ๋ฐ VLM์„ ์‚ฌ์šฉํ•˜์—ฌ affordance ๋ฐ constraint๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋‘ ๊ฐœ์˜ 3D voxel map์„ ์ƒ์„ฑํ•˜๊ณ , ๋ชจ๋ธ ์˜ˆ์ธก ์ œ์–ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ trajectory๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  8. Control Policies with Point-based Action:
    • PIVOT [132]: ๋กœ๋ด‡ ์ž‘์—…์„ ์‹œ๊ฐ์  ์งˆ์˜์‘๋‹ต์œผ๋กœ ๊ฐ„์ฃผํ•˜์—ฌ, VLM์ด ์‹œ๊ฐ์  proposals ์ง‘ํ•ฉ์—์„œ ์ตœ์ ์˜ ๋กœ๋ด‡ ํ–‰๋™์„ ์„ ํƒํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
    • RoboPoint [91]: spatial affordance prediction ์ž‘์—…์„ ์‚ฌ์šฉํ•˜์—ฌ VLM์„ finetuneํ•˜๋ฉฐ, 2D ์ด๋ฏธ์ง€์˜ affordance points๋ฅผ ๊นŠ์ด ๋งต์„ ์‚ฌ์šฉํ•˜์—ฌ 3D ๊ณต๊ฐ„์œผ๋กœ ํˆฌ์˜ํ•ฉ๋‹ˆ๋‹ค.
  9. Large VLA: RT-2 [2]์—์„œ ์ œ์•ˆ๋œ ์›๋ž˜ VLA ์ •์˜์— ํ•ด๋‹นํ•˜๋ฉฐ, LLM ๋ฐ VLM๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ํฐ ๋ชจ๋ธ ๊ทœ๋ชจ๋ฅผ ํŠน์ง•์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค.
    • RT-2 [2]: PaLI-X ๋ฐ PaLM-E์™€ ๊ฐ™์€ large multimodal models์˜ ๊ธฐ๋Šฅ์„ ๋กœ๋ด‡ ์ž‘์—…์— ํ™œ์šฉํ•˜๋ฉฐ, ์ธํ„ฐ๋„ท ๊ทœ๋ชจ VQA ๋ฐ์ดํ„ฐ์™€ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋‘ ํ•™์Šตํ•˜๋Š” co-fine-tuning์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.
    • RT-H [111]: ์–ธ์–ด ์ง€์‹œ์™€ low-level ํ–‰๋™ ์‚ฌ์ด์— ์–ธ์–ด ๋™์ž‘์˜ ์ค‘๊ฐ„ ์˜ˆ์ธก ๋ ˆ์ด์–ด๋ฅผ ํฌํ•จํ•˜๋Š” action hierarchy๋ฅผ ๋„์ž…ํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๊ณต์œ ๋ฅผ ์šฉ์ดํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
    • RT-X [112]: RT-1 ๋ฐ RT-2 ๋ชจ๋ธ์„ Open X-Embodiment (OXE)๋ผ๋Š” ๋” ํฐ ์˜คํ”ˆ ์†Œ์Šค ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์žฌํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค.
    • OpenVLA [37]: RT-2-X์˜ ์˜คํ”ˆ ์†Œ์Šค ๋ฒ„์ „์œผ๋กœ, ํšจ์œจ์ ์ธ fine-tuning ๋ฐฉ๋ฒ•์„ ํƒ์ƒ‰ํ–ˆ์Šต๋‹ˆ๋‹ค.
    • **$ _0 $** [115]: VLM์„ VLA๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ ์œ„ํ•œ flow-matching ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์•ˆํ•˜๋ฉฐ, mixture-of-experts ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์ถ”๊ฐ€ action expert๋ฅผ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค.
    • RoboMamba [116]: Transformer๋ฅผ Mamba state space model๋กœ ๋Œ€์ฒดํ•˜์—ฌ ํšจ์œจ์ ์ธ ๋กœ๋ด‡ ์ถ”๋ก  ๋ฐ ํ–‰๋™ ๊ธฐ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • WorldVLA [122] ๋ฐ UniVLA [123]: VLAs๋ฅผ ์„ธ๊ณ„ ๋ชจ๋ธ๊ณผ ํ†ตํ•ฉํ•˜์—ฌ multimodal ๋ฐ์ดํ„ฐ๋ฅผ discrete tokens๋กœ ์–‘์žํ™”ํ•˜์—ฌ, ํ–‰๋™ ๋ฐ ํ…์ŠคํŠธ ์ƒ์„ฑ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ด๋ฏธ์ง€ ์ƒ์„ฑ๋„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
    • Instruct2Act [102]: LLM์— vision ๋ฐ action tools๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ๋กœ๋ด‡ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
    • ๊ฐ•์  ๋ฐ ํ•œ๊ณ„:
      • ์•„ํ‚คํ…์ฒ˜: FiLM, cross-attention, concatenation, quantization, tool-use ๋ฐฉ์‹์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
      • ํ–‰๋™ ์œ ํ˜• ๋ฐ ํ›ˆ๋ จ ๋ชฉํ‘œ: low-level ์ œ์–ด ์ •์ฑ…์€ ์ฃผ๋กœ end-effector pose์— ๋Œ€ํ•œ ํ–‰๋™์„ ์˜ˆ์ธกํ•˜๋ฉฐ, ํ–‰๋™ ์œ ํ˜•์— ๋”ฐ๋ผ ๋‹ค์–‘ํ•œ Behavior Cloning (BC) objective (์˜ˆ: ์—ฐ์† ํ–‰๋™ $ L_{Cont} = _t MSE(a_t, t) $, ์ด์‚ฐ ํ–‰๋™ $ L{Disc} = _t CE(a_t, t) $) ๋ฐ Diffusion Policy์˜ DDPM objective (์˜ˆ: $ L{DDPM} = MSE(k, {}(a_t + _k, k)) $)๊ฐ€ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
      • RT ์‹œ๋ฆฌ์ฆˆ: RT-1์€ โ€œRobotic Transformerโ€ ๋ชจ๋ธ ์‹œ๋ฆฌ์ฆˆ์— ์˜๊ฐ์„ ์ฃผ์—ˆ์œผ๋ฉฐ, Transformer ๋ฐฑ๋ณธ์€ ๋” ํฐ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ์…‹์„ ํก์ˆ˜ํ•˜๋Š” ๋ฐ ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค.
      • LVLA vs. Generalized VLA: LVLA๋Š” ์ง€์‹œ ๋”ฐ๋ฅด๊ธฐ ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค์ง€๋งŒ, ํ›ˆ๋ จ ๋น„์šฉ๊ณผ ๋ฐฐํฌ ์†๋„(๋А๋ฆฐ ์ถ”๋ก  ์†๋„)๊ฐ€ ์šฐ๋ ค๋ฉ๋‹ˆ๋‹ค.
      • Scaling Law: LLM๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ, ๋กœ๋ด‡ ๊ณตํ•™์—์„œ๋„ model size, data quality, ํ™˜๊ฒฝ ๋ฐ ๊ฐ์ฒด ๋‹ค์–‘์„ฑ์˜ ์ค‘์š”์„ฑ์„ ๋ณด์—ฌ์ฃผ๋Š” scaling laws๊ฐ€ ๊ด€์ฐฐ๋ฉ๋‹ˆ๋‹ค.

IV. Task Planners

High-level task planner $ {} $๋Š” ๋ณต์žกํ•œ ์ž‘์—… $ $์„ subtasks ์‹œํ€€์Šค $ [p_1, p_2, , p_N] {}(, s_t) $๋กœ ๋ถ„ํ•ดํ•˜์—ฌ low-level ์ œ์–ด ์ •์ฑ… $ _{} $์— ์ง€์‹œ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์€ task ๋˜๋Š” subgoal decomposition์œผ๋กœ ์•Œ๋ ค์ ธ ์žˆ์œผ๋ฉฐ, TAMP (Task and Motion Planning) ๋ฐ Embodied Decision Making๊ณผ ๋ฐ€์ ‘ํ•˜๊ฒŒ ๊ด€๋ จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

A. Monolithic Task Planners

๋‹จ์ผ LLM ๋˜๋Š” Multimodal LLM (MLLM)์ด ๋งž์ถคํ˜• ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ embodied ๋ฐ์ดํ„ฐ์…‹์— finetuneํ•˜์—ฌ ์ž‘์—… ๊ณ„ํš์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. End-to-end Task Planners:
    • PaLM-E [11]: ViT์™€ PaLM์„ ํ†ตํ•ฉํ•˜์—ฌ high-level embodied ์ถ”๋ก  ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” large embodied multimodal language model์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ธ์ง€๋œ ์ด๋ฏธ์ง€์™€ high-level ์–ธ์–ด ์ง€์‹œ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ low-level ๋กœ๋ด‡ ์ •์ฑ…์„ ์œ„ํ•œ ํ…์ŠคํŠธ ๊ณ„ํš์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • EmbodiedGPT [136]: vision encoder ์ž„๋ฒ ๋”ฉ ๋ฐ LLM์ด ์ œ๊ณตํ•˜๋Š” embodied planning ์ •๋ณด๋กœ๋ถ€ํ„ฐ task-relevant instance-level ํŠน์ง•์„ ์ถœ๋ ฅํ•˜๋Š” embodied-former๋ฅผ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.
  2. End-to-end Task Planners with 3D Vision:
    • LEO [137]: point cloud encoder๋ฅผ LLM๊ณผ ํ†ตํ•ฉํ•˜๊ธฐ ์œ„ํ•œ 2๋‹จ๊ณ„ ํ›ˆ๋ จ ์ „๋žต์„ ์‚ฌ์šฉํ•˜๋ฉฐ, 3D ์งˆ์˜์‘๋‹ต๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์กฐ์ž‘, ๋‚ด๋น„๊ฒŒ์ด์…˜, ์ž‘์—… ๊ณ„ํš์—์„œ๋„ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
    • 3D-LLM [44]: LLM์— 3D ์ •๋ณด๋ฅผ ์ฃผ์ž…ํ•˜์—ฌ 3D-assisted dialog ๋ฐ ๋‚ด๋น„๊ฒŒ์ด์…˜๊ณผ ๊ฐ™์€ 3D ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
    • ShapeLLM [138]: ์ƒˆ๋กœ์šด 3D vision encoder์ธ ReCon++๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฌ์ถ•๋˜๋ฉฐ, ReCon++๋ฅผ LLaMA์™€ ํ†ตํ•ฉํ•˜์—ฌ 3D MM-Vet ๋ฒค์น˜๋งˆํฌ์—์„œ embodied ์ƒํ˜ธ์ž‘์šฉ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.
  3. Grounded Task Planners: low-level ์ œ์–ด ์ •์ฑ…์— ์˜ํ•ด ์‹คํ–‰๋  ์ˆ˜ ์žˆ๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ๊ณ ๋ คํ•˜์—ฌ high-level ํ–‰๋™์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • SayCan [10]: high-level LLM planner์™€ low-level ์ œ์–ด ์ •์ฑ…์„ ํ†ตํ•ฉํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, LLM์ด ๋‹ค์Œ low-level skill์„ โ€œsaysโ€ (task-grounding)ํ•˜๊ณ  low-level ์ •์ฑ…์ด skill ์™„๋ฃŒ ๊ฐ€๋Šฅ์„ฑ์„ โ€œcanโ€ (world-grounding)์œผ๋กœ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
    • Translated $ LM $ [139]: pretrained causal LLM์„ ์‚ฌ์šฉํ•˜์—ฌ high-level ์ง€์‹œ๋ฅผ free-form ์–ธ์–ด ๊ตฌ๋ฌธ์œผ๋กœ ๋œ ๋‹ค์Œ ํ–‰๋™์œผ๋กœ ๋ถ„ํ•ดํ•˜๊ณ , pretrained masked LLM์ด ํ–‰๋™ ๋ฒˆ์—ญ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

B. Modular Task Planners

end-to-end ๋ชจ๋ธ์„ finetuneํ•˜๋Š” ๋น„์šฉ์ด ๋งŽ์ด ๋“ค ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, off-the-shelf LLM ๋ฐ VLM์„ task planner๋กœ ์กฐ๋ฆฝํ•˜๋Š” ๋ชจ๋“ˆ์‹ ์„ค๊ณ„๋ฅผ ์ฑ„ํƒํ•ฉ๋‹ˆ๋‹ค.

  1. Language-based Task Planners:
    • Inner Monologue [9]: high-level ๋ช…๋ น๊ณผ low-level ์ •์ฑ… ์‚ฌ์ด์— ์œ„์น˜ํ•˜์—ฌ, LLM์ด low-level ์ œ์–ด ์ •์ฑ…์„ ์œ„ํ•œ ์–ธ์–ด ์ง€์‹œ๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ์ œ์–ด ์ •์ฑ…์˜ ํ”ผ๋“œ๋ฐฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด๋Ÿฌํ•œ ์ง€์‹œ๋ฅผ ๋™์ ์œผ๋กœ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.
    • LLM-Planner [141]: high-level planner์™€ low-level planner๋กœ ๊ตฌ์„ฑ๋œ ๊ณ„์ธต์  ์ •์ฑ…์„ ๋„์ž…ํ•˜๋ฉฐ, ์žฌ๊ณ„ํš ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ฉํ•˜์—ฌ ๋กœ๋ด‡์ด โ€œget unstuckโ€๋˜๋Š” ๊ฒƒ์„ ๋•์Šต๋‹ˆ๋‹ค.
    • Socratic Models (SMs) [143]: ๋‹ค์–‘ํ•œ pretrained ๋ชจ๋ธ์„ finetune ์—†์ด ํšจ๊ณผ์ ์œผ๋กœ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•˜๋ฉฐ, multimodal-informed prompting์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ multimodal ๊ธฐ๋Šฅ์„ ๊ฐ€์ง„ ๋ชจ๋ธ ๊ฐ„ ์ •๋ณด ๊ตํ™˜์„ ์šฉ์ดํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
  2. Code-based Task Planners: LLM์˜ ์ฝ”๋”ฉ ๋Šฅ๋ ฅ์„ ํ™œ์šฉํ•˜์—ฌ ํ”„๋กœ๊ทธ๋žจ ํ˜•ํƒœ์˜ ์ž‘์—… ๊ณ„ํš์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • ProgPrompt [144]: LLM์— ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ํ–‰๋™ ๋ฐ ๊ฐ์ฒด๋ฅผ ์ž์„ธํžˆ ์„ค๋ช…ํ•˜๋Š” ํ”„๋กœ๊ทธ๋žจ๊ณผ ์œ ์‚ฌํ•œ ์‚ฌ์–‘์„ ํ”„๋กฌํ”„ํŒ…ํ•˜์—ฌ high-level ๊ณ„ํš์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • ChatGPT for Robotics [145]: ChatGPT์˜ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋Šฅ๋ ฅ์„ ํ™œ์šฉํ•˜์—ฌ API๋ฅผ ํ†ตํ•ด low-level ํ–‰๋™์„ ์ƒ์„ฑํ•˜๋Š” โ€œuser on the loopโ€ ์ œ์–ด๋ฅผ ์šฉ์ดํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
    • Code as policies (CaP) [146]: GPT-3 ๋˜๋Š” Codex๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ perception ๋ชจ๋“ˆ ๋ฐ ์ œ์–ด API๋ฅผ ํ˜ธ์ถœํ•˜๋Š” ์ •์ฑ… ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • DEPS [147]: LLM์ด ํ™˜๊ฒฝ์—์„œ ์ˆ˜์ง‘๋œ ํ”ผ๋“œ๋ฐฑ ์„ค๋ช…์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ„ํš์„ ์ƒ์„ฑํ•˜๊ณ  ์‹คํŒจ๋ฅผ ์„ค๋ช…(โ€œself-explanationโ€)ํ•˜์—ฌ ์žฌ๊ณ„ํš์— ๋„์›€์„ ์ค๋‹ˆ๋‹ค.
    • ConceptGraphs [148]: ๊ด€์ฐฐ ์‹œํ€€์Šค๋ฅผ open-vocabulary 3D scene graphs๋กœ ๋ณ€ํ™˜ํ•˜๋ฉฐ, VLM์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์ฒด๋ฅผ ์บก์…˜ํ•˜๊ณ  ๊ฐ์ฒด ๊ฐ„ ๊ด€๊ณ„๋ฅผ ์„ค์ •ํ•˜์—ฌ LLM์— ํ’๋ถ€ํ•œ semantic ๋ฐ spatial ๊ด€๊ณ„๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
    • ๊ฐ•์  ๋ฐ ํ•œ๊ณ„: End-to-end ๋ชจ๋ธ์€ finetune์„ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ๋†’์ผ ์ˆ˜ ์žˆ์ง€๋งŒ ํ›ˆ๋ จ ๋น„์šฉ์ด ๋†’์Šต๋‹ˆ๋‹ค. ๋ชจ๋“ˆ์‹ ์ ‘๊ทผ ๋ฐฉ์‹์€ ์ฆ‰์‹œ ๋ฐฐํฌ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ์–ธ์–ด ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์€ LLM๊ณผ VLM ํ†ตํ•ฉ์ด ์šฉ์ดํ•˜์ง€๋งŒ, ์ฝ”๋“œ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์€ ๋” ํฐ ์ œ์–ด๋ ฅ์„ ์ œ๊ณตํ•˜๋ฉฐ ๋””๋ฒ„๊น…์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

V. Datasets and Benchmarks

Embodied AI๋Š” ์‹ค์„ธ๊ณ„ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ์˜ scarcity ๋ฌธ์ œ์— ์ง๋ฉดํ•ด ์žˆ์Šต๋‹ˆ๋‹ค.

A. Real-world Robot Datasets & Benchmarks: ๋กœ๋ด‡ ์žฅ๋น„ ์กฐ๋‹ฌ, ํ™˜๊ฒฝ ์„ค์ •, ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋น„์šฉ ๋ฐ ์‹œ๊ฐ„, ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ์œ ํ˜• ๋ฐ ๊ตฌ์„ฑ์œผ๋กœ ์ธํ•œ ๋ฐ์ดํ„ฐ ๋ถˆ์ผ์น˜, ๊ฐ์ฒด 6D poses์˜ ์ •ํ™•ํ•œ ์บก์ฒ˜์˜ ์–ด๋ ค์›€ ๋“ฑ์˜ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค (ํ‘œ V).

B. Simulators, Simulated Robot Datasets & Benchmarks: ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์€ ์‹ค์ œ ์„ธ๊ณ„์˜ ์žฅ์• ๋ฌผ์„ ์šฐํšŒํ•˜๊ณ  ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ํ”„๋กœ์„ธ์Šค๋ฅผ ํ™•์žฅํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜์ง€๋งŒ, sim-to-real gap ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค (ํ‘œ VI). ์ด๋Š” ๋น„ํ˜„์‹ค์ ์ธ ๋ Œ๋”๋ง ํ’ˆ์งˆ, ๋ฌผ๋ฆฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์˜ ๋ถ€์ •ํ™•์„ฑ, ๊ฐ์ฒด ํŠน์„ฑ ๋ฐ ๋กœ๋ด‡ ๋™์ž‘ ๊ณ„ํš์˜ ๋„๋ฉ”์ธ shift์—์„œ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

C. Automated Dataset Collection: RoboGen [187], AutoRT [188], DIAL [189] ๋ฐ RoboPoint [91]์™€ ๊ฐ™์€ ์ ‘๊ทผ ๋ฐฉ์‹์€ ์ž๋™ํ™”๋œ ๋ฐ์ดํ„ฐ์…‹ ์ˆ˜์ง‘์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

D. Human Datasets: ์ธ๊ฐ„ ํ–‰๋™ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์€ ๋ฐ์ดํ„ฐ ๋ถ€์กฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋Œ€์•ˆ์ ์ธ ์ „๋žต์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ธ๊ฐ„์˜ ์†/๋ชธ ๋™์ž‘์„ ๋กœ๋ด‡ embodiment๋กœ ์บก์ฒ˜ํ•˜๊ณ  ์ „์†กํ•˜๋Š” ์–ด๋ ค์›€, ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ์˜ ๋ถˆ์ผ์น˜์„ฑ, ์œ ์šฉํ•œ ์ •๋ณด ์ถ”์ถœ์˜ ๋…ธ๋™ ์ง‘์•ฝ์„ฑ ๋“ฑ์˜ ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

E. Task Planning Benchmarks: EgoPlan-Bench [192], PlanBench [193, 194], LoTa-Bench [195]๋Š” ์ž‘์—… ๊ณ„ํš ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. Embodied Agent Interface (EAI) [196]๋Š” LLM ๊ธฐ๋ฐ˜ ๋ชจ๋“ˆ์˜ ์ž…๋ ฅ-์ถœ๋ ฅ์„ ์ •์‹ํ™”ํ•˜์—ฌ ๋” ์„ธ๋ถ„ํ™”๋œ metrics๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

F. Embodied Question Answering Benchmarks: EQA ๋ฒค์น˜๋งˆํฌ (ํ‘œ VII)๋Š” ์ง์ ‘์ ์œผ๋กœ ๋กœ๋ด‡ ์ž‘์—…์„ ํ‰๊ฐ€ํ•˜์ง€ ์•Š์ง€๋งŒ, ๊ณต๊ฐ„ ์ถ”๋ก , ๋ฌผ๋ฆฌ ์ดํ•ด, ์„ธ๊ณ„ ์ง€์‹๊ณผ ๊ฐ™์€ embodied AI์— ๊ด€๋ จ๋œ ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์—์ด์ „ํŠธ๊ฐ€ ๋‹ต๋ณ€์„ ์ œ๊ณตํ•˜๊ธฐ ์ „์— ํ™˜๊ฒฝ์„ ๋Šฅ๋™์ ์œผ๋กœ ํƒ์ƒ‰ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์—์„œ ์‹œ๊ฐ์  ์งˆ์˜์‘๋‹ต ๋ฒค์น˜๋งˆํฌ์™€ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

VI. Challenges and Future Directions

  • Safety first: ๋กœ๋ด‡์€ ๋ฌผ๋ฆฌ์  ์„ธ๊ณ„์™€ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฏ€๋กœ ์•ˆ์ „์ด ๊ฐ€์žฅ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.
  • Datasets & Benchmarks: ๊ด‘๋ฒ”์œ„ํ•œ ๊ธฐ์ˆ , ๊ฐ์ฒด, embodiment ๋ฐ ํ™˜๊ฒฝ์„ ํฌ๊ด„ํ•˜๋Š” ํฌ๊ด„์ ์ธ ๋ฒค์น˜๋งˆํฌ๊ฐ€ ํ•„์š”ํ•˜๋ฉฐ, ์„ฑ๊ณต๋ฅ  ์ด์ƒ์˜ ์„ธ๋ถ„ํ™”๋œ metrics๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  • Foundation Models & Generalization: VLA foundation models ๋˜๋Š” robotic foundation models (RFM)์€embodiments, ํ™˜๊ฒฝ ๋ฐ ์ž‘์—…์˜ ๋‹ค์–‘์„ฑ์œผ๋กœ ์ธํ•ด ์—ฌ์ „ํžˆ ๊ฐœ๋ฐฉ๋œ ์—ฐ๊ตฌ ์ฃผ์ œ์ž…๋‹ˆ๋‹ค.
  • Multimodality: ์œ ์šฉํ•œ ์ž„๋ฒ ๋”ฉ ํš๋“ ๋ฐ ๋‹ค๋ฅธ ์–‘์‹์˜ ์ •๋ ฌ๊ณผ ๊ฐ™์€ multimodal ๋ชจ๋ธ๊ณผ ๊ด€๋ จ๋œ ๋งŽ์€ ๊ณผ์ œ๋ฅผ ์ƒ์†๋ฐ›์Šต๋‹ˆ๋‹ค.
  • Framework for Long-Horizon Tasks: ๊ณ„์ธต์  ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ๊ฐ€์žฅ ์‹ค์šฉ์ ์ด์ง€๋งŒ, ์‹œ์Šคํ…œ ๋ณต์žก์„ฑ๊ณผ ์ž ์žฌ์  ์‹คํŒจ ์ง€์ ์„ ์ฆ๊ฐ€์‹œํ‚ต๋‹ˆ๋‹ค. end-to-end ๋ฐฉ์‹์œผ๋กœ long-horizon ์ž‘์—…์„ low-level ์ œ์–ด ์‹ ํ˜ธ๋กœ ์ง์ ‘ ๋ณ€ํ™˜ํ•˜๋Š” ํ†ตํ•ฉ ํ”„๋ ˆ์ž„์›Œํฌ ๊ฐœ๋ฐœ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  • Real-Time Responsiveness: ๋งŽ์€ ๋กœ๋ด‡ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์€ ๋™์  ํ™˜๊ฒฝ์— ๋Œ€์‘ํ•˜๊ธฐ ์œ„ํ•ด ์‹ค์‹œ๊ฐ„ ์˜์‚ฌ๊ฒฐ์ •์ด ํ•„์š”ํ•˜๋ฉฐ, ์ถ”๋ก  ์‹œ๊ฐ„์ด ํ™˜๊ฒฝ ๋ณ€ํ™”๋ฅผ ๋”ฐ๋ผ๊ฐ€์ง€ ๋ชปํ•˜๋ฉด obsolete ํ–‰๋™์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Multi-agent Systems: ๋ถ„์‚ฐ๋œ ์ธ์‹ ๋ฐ ํ˜‘์—…์  ๊ณ ์žฅ ๋ณต๊ตฌ์™€ ๊ฐ™์€ ์ด์ ์„ ์ œ๊ณตํ•˜์ง€๋งŒ, ํšจ๊ณผ์ ์ธ ํ†ต์‹ , ์กฐ์ •๋œ dispatching, fleet heterogeneity ๋“ฑ์˜ ๋ฌธ์ œ์— ์ง๋ฉดํ•ฉ๋‹ˆ๋‹ค.
  • Ethical and Societal Implications: ํ”„๋ผ์ด๋ฒ„์‹œ, ์ผ์ž๋ฆฌ ๋Œ€์ฒด, ์˜์‚ฌ๊ฒฐ์ • ํŽธํ–ฅ, ์‚ฌํšŒ ๊ทœ๋ฒ” ๋ฐ ์ธ๊ฐ„ ๊ด€๊ณ„์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ๊ณผ ๊ด€๋ จ๋œ ์œค๋ฆฌ์ , ์‚ฌํšŒ์ , ๋ฒ•์  ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Applications: ๋Œ€๋ถ€๋ถ„์˜ ํ˜„์žฌ VLA๋Š” ๊ฐ€์ • ๋˜๋Š” ์‚ฐ์—… ํ™˜๊ฒฝ์— ์ค‘์ ์„ ๋‘์ง€๋งŒ, ๊ฐ€์ƒ ๋น„์„œ, ์ž์œจ ์ฃผํ–‰์ฐจ, ๋†์—… ๋กœ๋ด‡ ๋“ฑ ๋” ๋„“์€ ๋ฒ”์œ„์˜ ์‘์šฉ ๋ถ„์•ผ๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

VII. ๊ฒฐ๋ก 

VLA ๋ชจ๋ธ์€ embodied agents๊ฐ€ ๋ฌผ๋ฆฌ์  ์„ธ๊ณ„์™€ ์ƒํ˜ธ์ž‘์šฉํ•˜๊ณ  ์‚ฌ์šฉ์ž ์ง€์‹œ๋ฅผ ์ดํ–‰ํ•˜๋Š” ๋ฐ ์—„์ฒญ๋‚œ ๊ฐ€๋Šฅ์„ฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์€ Large VLAs์™€ generalized VLAs๋ฅผ ๊ฒ€ํ† ํ•œ ์ตœ์ดˆ์˜ ์กฐ์‚ฌ ๋…ผ๋ฌธ์œผ๋กœ, ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜, ํ›ˆ๋ จ ์ „๋žต ๋ฐ ๊ฐœ๋ณ„ ๋ชจ๋“ˆ์„ ํฌํ•จํ•œ ๊ธฐ์ˆ ์  ์„ธ๋ถ€ ์‚ฌํ•ญ์„ ๋ถ„์„ ๋ฐ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ๋ฐ์ดํ„ฐ์…‹, ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ๋ฐ ๋ฒค์น˜๋งˆํฌ์™€ ๊ฐ™์€ VLA ํ›ˆ๋ จ ๋ฐ ํ‰๊ฐ€๋ฅผ ์œ„ํ•œ ํ•„์ˆ˜ ์ž์›์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

โ€œ๋ฌผ๋ฆฌ ์„ธ๊ณ„์—์„œ ํ–‰๋™ํ•˜๋Š” AI๋ฅผ ๋งŒ๋“ค๋ ค๋ฉด, ๋จผ์ € ๋ณด๊ณ , ์ดํ•ดํ•˜๊ณ , ํ–‰๋™ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.โ€
โ€” ์ด ์„œ๋ฒ ์ด ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ๋ฉ”์‹œ์ง€

๐ŸŽฏ ์„œ๋ก : ์™œ VLA์ธ๊ฐ€?

์—ฌ๋Ÿฌ๋ถ„์ด ๋กœ๋ด‡์—๊ฒŒ โ€œ์ € ๋นจ๊ฐ„ ์‚ฌ๊ณผ๋ฅผ ์ง‘์–ด์„œ ์ ‘์‹œ์— ์˜ฌ๋ ค์ค˜โ€๋ผ๊ณ  ๋งํ•œ๋‹ค๊ณ  ์ƒ์ƒํ•ด ๋ด…์‹œ๋‹ค. ์ด ๊ฐ„๋‹จํ•œ ์ง€์‹œ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ ค๋ฉด ๋กœ๋ด‡์€:

  1. Vision (์‹œ๊ฐ): โ€œ์ € ๋นจ๊ฐ„ ์‚ฌ๊ณผโ€๊ฐ€ ์–ด๋””์— ์žˆ๋Š”์ง€ ๋ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  2. Language (์–ธ์–ด): โ€œ์ง‘์–ด์„œ ์ ‘์‹œ์— ์˜ฌ๋ คโ€๋ผ๋Š” ๋ช…๋ น์„ ์ดํ•ดํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  3. Action (ํ–‰๋™): ์‹ค์ œ๋กœ ๊ทธ๋ฆฌํผ๋ฅผ ์›€์ง์—ฌ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค

์ด ์„ธ ๊ฐ€์ง€๊ฐ€ ์™„๋ฒฝํ•˜๊ฒŒ ํ†ตํ•ฉ๋˜์–ด์•ผ๋งŒ ๋กœ๋ด‡์€ ์ผ์ƒ์ ์ธ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ”๋กœ ์ด๊ฒƒ์ด Vision-Language-Action Model (VLA)์˜ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค.

ChatGPT์™€ VLA์˜ ๊ฒฐ์ •์  ์ฐจ์ด

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     Conversational AI (ChatGPT)                  โ”‚
โ”‚  Input: Text โ”€โ”€โ”€โ”€โ”€โ”€โ–บ LLM โ”€โ”€โ”€โ”€โ”€โ”€โ–บ Output: Text                   โ”‚
โ”‚                    (์–ธ์–ด์˜ ์„ธ๊ณ„์—์„œ๋งŒ ๋™์ž‘)                        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                        Embodied AI (VLA)                         โ”‚
โ”‚  Input: Vision + Language โ”€โ”€โ–บ VLA โ”€โ”€โ–บ Output: Physical Actions  โ”‚
โ”‚                    (๋ฌผ๋ฆฌ ์„ธ๊ณ„์™€ ์ƒํ˜ธ์ž‘์šฉ)                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

ChatGPT๊ฐ€ โ€œํ…์ŠคํŠธ๋ฅผ ํ…์ŠคํŠธ๋กœโ€ ๋ณ€ํ™˜ํ•œ๋‹ค๋ฉด, VLA๋Š” โ€œ์‹œ๊ฐ๊ณผ ์–ธ์–ด๋ฅผ ๋ฌผ๋ฆฌ์  ํ–‰๋™์œผ๋กœโ€ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ๋ฐ”๋กœ AGI(๋ฒ”์šฉ ์ธ๊ณต์ง€๋Šฅ)๋กœ ๊ฐ€๋Š” ํ•ต์‹ฌ ๋นŒ๋”ฉ ๋ธ”๋ก์ธ ์ด์œ ์ž…๋‹ˆ๋‹ค.


๐Ÿ—บ๏ธ VLA์˜ ๋ถ„๋ฅ˜ ์ฒด๊ณ„ (Taxonomy)

์ด ์„œ๋ฒ ์ด์˜ ๊ฐ€์žฅ ํฐ ๊ณตํ—Œ ์ค‘ ํ•˜๋‚˜๋Š” VLA ์—ฐ๊ตฌ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ๋ถ„๋ฅ˜ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์•„๋ž˜ ๋‹ค์ด์–ด๊ทธ๋žจ์œผ๋กœ ์ „์ฒด ๊ตฌ์กฐ๋ฅผ ํŒŒ์•…ํ•ด ๋ด…์‹œ๋‹ค:

graph TB
    VLA[VLA Models] --> COMP["Components of VLA (VLA ๊ตฌ์„ฑ ์š”์†Œ)"]
    VLA --> CP["Control Policies (์ €์ˆ˜์ค€ ์ œ์–ด ์ •์ฑ…)"]
    VLA --> TP["Task Planners (๊ณ ์ˆ˜์ค€ ํƒœ์Šคํฌ ํ”Œ๋ž˜๋„ˆ)"]
    VLA --> DB["Datasets & Benchmarks (๋ฐ์ดํ„ฐ์…‹ & ๋ฒค์น˜๋งˆํฌ)"]

    COMP --> RL[Reinforcement Learning]
    COMP --> PVR[Pretrained Visual Repr.]
    COMP --> DL[Dynamics Learning]
    COMP --> WM[World Models]
    COMP --> RS[Reasoning]

    CP --> NONTF[Non-Transformer]
    CP --> TF[Transformer-based]
    CP --> DIFF[Diffusion-based]
    CP --> LVLA[Large VLA]

    TP --> MONO[Monolithic Planners]
    TP --> MOD[Modular Planners]

ํ•ต์‹ฌ ํ†ต์ฐฐ: โ€œ๊ณ„์ธต์  ํ”„๋ ˆ์ž„์›Œํฌโ€

ํ˜„๋Œ€ ๋กœ๋ด‡ ์‹œ์Šคํ…œ์˜ ๋Œ€๋ถ€๋ถ„์€ ๊ณ„์ธต์  ๊ตฌ์กฐ๋ฅผ ์ฑ„ํƒํ•ฉ๋‹ˆ๋‹ค:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ๐Ÿง  High-Level Task Planner (๊ณ ์ˆ˜์ค€ ํƒœ์Šคํฌ ํ”Œ๋ž˜๋„ˆ)                โ”‚
โ”‚    "์‚ฌ๊ณผ๋ฅผ ์ง‘์–ด์„œ ์ ‘์‹œ์— ์˜ฌ๋ ค" โ†’ ์„œ๋ธŒํƒœ์Šคํฌ๋กœ ๋ถ„ํ•ด                 โ”‚
โ”‚    [1. ์‚ฌ๊ณผ ์œ„์น˜ ์ฐพ๊ธฐ] [2. ๊ทธ๋ฆฌํผ ์ด๋™] [3. ์ง‘๊ธฐ] [4. ์˜ฎ๊ธฐ๊ธฐ]     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ”‚
                              โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ๐Ÿฆพ Low-Level Control Policy (์ €์ˆ˜์ค€ ์ œ์–ด ์ •์ฑ…)                  โ”‚
โ”‚    ๊ฐ ์„œ๋ธŒํƒœ์Šคํฌ๋ฅผ ์‹ค์ œ ๋กœ๋ด‡ ๋™์ž‘์œผ๋กœ ๋ณ€ํ™˜                        โ”‚
โ”‚    a_t = ฯ€_ฮธ(a_t | p, s_โ‰คt, a_<t)                               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

์ด ๊ตฌ์กฐ๊ฐ€ ํšจ๊ณผ์ ์ธ ์ด์œ : - ๊ณ ์ˆ˜์ค€ ํ”Œ๋ž˜๋„ˆ: ๋Œ€์šฉ๋Ÿ‰ ๋ชจ๋ธ์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ ํ™œ์šฉ - ์ €์ˆ˜์ค€ ์ •์ฑ…: ์†๋„์™€ ์ •๋ฐ€๋„์— ์ง‘์ค‘


๐Ÿงฉ Part 1: VLA์˜ ๊ตฌ์„ฑ ์š”์†Œ (Components)

1.1 ๊ฐ•ํ™”ํ•™์Šต (Reinforcement Learning)

VLA์˜ ๋ฟŒ๋ฆฌ๋Š” ๊ฐ•ํ™”ํ•™์Šต์— ์žˆ์Šต๋‹ˆ๋‹ค. MDP(Markov Decision Process)๋กœ ํ‘œํ˜„ํ•˜๋ฉด:

\tau = (s_1, a_1, r_1, \ldots, s_T, a_T, r_T)

์—ฌ๊ธฐ์„œ ํ•ต์‹ฌ์ ์ธ ๋ฐœ์ „๋“ค:

๋ชจ๋ธ ํ•ต์‹ฌ ์•„์ด๋””์–ด VLA์— ๋ฏธ์นœ ์˜ํ–ฅ
Decision Transformer RL ๊ถค์ ์„ ์‹œํ€€์Šค ๋ชจ๋ธ๋ง ๋ฌธ์ œ๋กœ ์žฌ์ •์˜ Transformer๊ฐ€ RL์— ์ ์šฉ๋  ์ˆ˜ ์žˆ์Œ์„ ์ฆ๋ช…
Trajectory Transformer ์ „์ฒด ๊ถค์ ์„ ํ•˜๋‚˜์˜ ์‹œํ€€์Šค๋กœ ์ฒ˜๋ฆฌ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์‹œํ€€์Šค ์ฒ˜๋ฆฌ์˜ ๊ธฐ์ดˆ
Gato ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ, ๋ฉ€ํ‹ฐํƒœ์Šคํฌ, ๋ฉ€ํ‹ฐ-์— ๋ฐ”๋””๋จผํŠธ ํ˜„๋Œ€ VLA์˜ ์ง์ ‘์  ์„ ์กฐ

ํŒŒ์ธ๋งŒ์‹ ์ง๊ด€

โ€œRL ๊ถค์ ์ด ๋ฌธ์žฅ๊ณผ ๊ฐ™๋‹ค๋ฉด, Decision Transformer๋Š” ๊ทธ ๋ฌธ์žฅ์„ โ€˜์ฝ๋Š”โ€™ ๋ฒ•์„ ๋ฐฐ์šด ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋งˆ์น˜ ์šฐ๋ฆฌ๊ฐ€ ์†Œ์„ค์„ ์ฝ์œผ๋ฉฐ ๋‹ค์Œ์— ๋ฌด์Šจ ์ผ์ด ์ผ์–ด๋‚ ์ง€ ์˜ˆ์ธกํ•˜๋“ฏ์ด, ๋กœ๋ด‡๋„ ์ด์ „ ์ƒํƒœ์™€ ํ–‰๋™์˜ โ€™์ด์•ผ๊ธฐโ€™๋ฅผ ์ฝ๊ณ  ๋‹ค์Œ ํ–‰๋™์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.โ€

1.2 ์‚ฌ์ „ํ•™์Šต๋œ ์‹œ๊ฐ ํ‘œํ˜„ (Pretrained Visual Representations)

VLA์˜ ๋ˆˆ ์—ญํ• ์„ ํ•˜๋Š” Vision Encoder๋Š” ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ์ฃผ์š” ์ ‘๊ทผ๋ฒ•๋“ค:

graph LR
    subgraph Methods["์‹œ๊ฐ ํ‘œํ˜„ ํ•™์Šต ๋ฐฉ๋ฒ•"]
        CLIP["CLIP (ํ…์ŠคํŠธ-์ด๋ฏธ์ง€ ๋Œ€์กฐํ•™์Šต)"]
        TCL["Time Contrastive (์‹œ๊ฐ„ ๋Œ€์กฐํ•™์Šต)"]
        MAE["MAE (๋งˆ์Šคํฌ ์˜คํ† ์ธ์ฝ”๋”)"]
        DINO["DINOv2 (์ž๊ธฐ์ฆ๋ฅ˜)"]
    end

    CLIP --> |Image-level| VE[Vision Encoder]
    TCL --> |Temporal| VE
    MAE --> |Pixel-level| VE
    DINO --> |Both levels| VE

์ฃผ์š” PVR ๋ชจ๋ธ ๋น„๊ต

๋ชจ๋ธ ๋„คํŠธ์›Œํฌ ํ•™์Šต ๋ฐฉ์‹ ํŠน์ง•
CLIP ViT-B VL ๋Œ€์กฐํ•™์Šต ๊ฐ€์žฅ ๋„๋ฆฌ ์‚ฌ์šฉ๋จ
R3M ResNet-50 ์‹œ๊ฐ„ ๋Œ€์กฐํ•™์Šต ์‹œ๊ฐ„์  ๊ด€๊ณ„ ํ•™์Šต
MVP ViT-B/L MAE ํ”ฝ์…€ ์ˆ˜์ค€ ์„ธ๋ถ€์ •๋ณด
VIP ResNet-50 ์‹œ๊ฐ„ ๋Œ€์กฐํ•™์Šต ๋ณด์ƒ ํ•จ์ˆ˜๋กœ๋„ ํ™œ์šฉ
VC-1 ViT-L MAE + CL ์ข…ํ•ฉ์  ๋น„๊ต ์—ฐ๊ตฌ
DINOv2 ViT ์ž๊ธฐ์ฆ๋ฅ˜ ํ”ฝ์…€+์ด๋ฏธ์ง€ ์ˆ˜์ค€ ๋ชจ๋‘
Theia ViT ์ฆ๋ฅ˜(Distillation) ์—ฌ๋Ÿฌ VFM ํ†ตํ•ฉ

ํ•ต์‹ฌ ์ˆ˜์‹: ๋Œ€์กฐ ํ•™์Šต

CLIP์˜ ํ•™์Šต ๋ชฉํ‘œ: \mathcal{L} = -\sum_{i=1}^{N} \log \frac{\exp(\mathcal{S}(x_i, y_i))}{\sum_{j=1}^{N} \exp(\mathcal{S}(x_i, y_j))}

์—ฌ๊ธฐ์„œ (x_i, y_i)๋Š” ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ, \mathcal{S}(\cdot)๋Š” ์œ ์‚ฌ๋„ ์ธก์ •

1.3 ๋™์—ญํ•™ ํ•™์Šต (Dynamics Learning)

๋กœ๋ด‡์ด โ€œ๋ฌผ๋ฆฌ ๋ฒ•์น™โ€์„ ์ดํ•ดํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉ๋ฒ•:

Forward Dynamics (์ˆœ๋ฐฉํ–ฅ ๋™์—ญํ•™):
ล_{t+1} โ† f_fwd(s_t, a_t)
"์ด ํ–‰๋™์„ ํ•˜๋ฉด ๋‹ค์Œ์— ๋ฌด์Šจ ์ผ์ด ์ผ์–ด๋‚ ๊นŒ?"

Inverse Dynamics (์—ญ๋ฐฉํ–ฅ ๋™์—ญํ•™):
รข_t โ† f_inv(s_t, s_{t+1})
"์ด ์ƒํƒœ์—์„œ ์ € ์ƒํƒœ๋กœ ๊ฐ€๋ ค๋ฉด ์–ด๋–ค ํ–‰๋™์„ ํ•ด์•ผ ํ• ๊นŒ?"

ํŒŒ์ธ๋งŒ์‹ ์ง๊ด€

โ€œ์ˆœ๋ฐฉํ–ฅ ๋™์—ญํ•™์€ ๋‹น๊ตฌ๊ณต์„ ์น  ๋•Œ ๊ณต์ด ์–ด๋””๋กœ ๊ฐˆ์ง€ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™๊ณ , ์—ญ๋ฐฉํ–ฅ ๋™์—ญํ•™์€ ๊ณต์ด ํŠน์ • ์œ„์น˜์— ๊ฐ€๊ฒŒ ํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ์ณ์•ผ ํ•˜๋Š”์ง€ ์•Œ์•„๋‚ด๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.โ€

1.4 ์›”๋“œ ๋ชจ๋ธ (World Models)

์›”๋“œ ๋ชจ๋ธ์€ ๋กœ๋ด‡์ด โ€œ์ƒ์ƒ ์†์—์„œโ€ ๋ฏธ๋ž˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋Šฅ๋ ฅ์ž…๋‹ˆ๋‹ค:

\hat{s}_{t+1} \sim P(\hat{s}_{t+1} | s_t, a_t)

graph TB
    subgraph WMTypes["์›”๋“œ ๋ชจ๋ธ์˜ ์ข…๋ฅ˜"]
        LWM["LLM-induced World Models (ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜)"]
        VWM["Visual World Models (์ด๋ฏธ์ง€/์˜์ƒ ๊ธฐ๋ฐ˜)"]
    end

    LWM --> DECKARD["DECKARD (DAG ํ˜•ํƒœ ์ถ”์ƒ ์›”๋“œ ๋ชจ๋ธ)"]
    LWM --> RAP["RAP (MCTS + LLM)"]
    LWM --> LLMDM["LLM-DM (PDDL ์ƒ์„ฑ)"]

    VWM --> Genie["Genie (์ƒํ˜ธ์ž‘์šฉ ํ™˜๊ฒฝ ์ƒ์„ฑ)"]
    VWM --> 3DVLA["3D-VLA (3D ๋ชฉํ‘œ ์ƒํƒœ ์ƒ์„ฑ)"]
    VWM --> UniSim["UniSim (์‹ค์„ธ๊ณ„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜)"]

Dreamer ์‹œ๋ฆฌ์ฆˆ์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              Dreamer์˜ ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ๋ชจ๋“ˆ                    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1. Representation Model: ์ด๋ฏธ์ง€ โ†’ ์ž ์žฌ ์ƒํƒœ ์ธ์ฝ”๋”ฉ         โ”‚
โ”‚ 2. Transition Model: ์ž ์žฌ ์ƒํƒœ ๊ฐ„ ์ „์ด ํ•™์Šต                 โ”‚
โ”‚ 3. Reward Model: ์ƒํƒœ์— ๋Œ€ํ•œ ๋ณด์ƒ ์˜ˆ์ธก                      โ”‚
โ”‚                                                            โ”‚
โ”‚ โ†’ "๊ฟˆ์†์—์„œ(imagination) ํ–‰๋™์„ ํ•™์Šตํ•˜๊ณ  ํ˜„์‹ค์— ์ ์šฉ"      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

1.5 ์ถ”๋ก  (Reasoning)

Chain-of-Thought (CoT) ๊ธฐ๋ฒ•์ด VLA์—๋„ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค:

๋ชจ๋ธ ์ ‘๊ทผ ๋ฐฉ์‹ ์ ์šฉ ๋ ˆ๋ฒจ
ThinkBot CoT๋กœ ๋ˆ„๋ฝ๋œ ํ–‰๋™ ์„ค๋ช… ๋ณต์› ํƒœ์Šคํฌ ํ”Œ๋ž˜๋‹
ReAct ์ถ”๋ก ๊ณผ ํ–‰๋™์„ ๋ฒˆ๊ฐˆ์•„ ์ˆ˜ํ–‰ ์˜์‚ฌ๊ฒฐ์ •
ECoT VLA์— CoT ์ถ”๋ก  ๋Šฅ๋ ฅ ๋ถ€์—ฌ ์ €์ˆ˜์ค€ ์ œ์–ด

ECoT์˜ ํ˜์‹ ์  ์ ‘๊ทผ

๊ธฐ์กด VLA: 
  ๊ด€์ฐฐ + ์ง€์‹œ โ†’ ๋ฐ”๋กœ ํ–‰๋™ ์ถœ๋ ฅ ("๊ทผ์œก ๊ธฐ์–ต" ๋ฐฉ์‹)

ECoT:
  ๊ด€์ฐฐ + ์ง€์‹œ โ†’ [๊ณ„ํš ์ถ”๋ก ] โ†’ [์„œ๋ธŒํƒœ์Šคํฌ ์ถ”๋ก ] โ†’ [๋™์ž‘ ์ถ”๋ก ] โ†’ [์‹œ๊ฐ ํŠน์ง• ์ถ”๋ก ] โ†’ ํ–‰๋™ ์ถœ๋ ฅ

๐ŸŽฎ Part 2: ์ €์ˆ˜์ค€ ์ œ์–ด ์ •์ฑ… (Low-Level Control Policies)

VLA ์ œ์–ด ์ •์ฑ…์˜ ์ผ๋ฐ˜ ๊ณต์‹

\hat{a}_t \sim \pi_\theta(\hat{a}_t | p, s_{\leq t}, a_{<t})

  • p: ์–ธ์–ด ์ง€์‹œ
  • s_{\leq t}: ํ˜„์žฌ๊นŒ์ง€์˜ ์ƒํƒœ (์ฃผ๋กœ ์ด๋ฏธ์ง€)
  • a_{<t}: ์ด์ „ ํ–‰๋™๋“ค
  • \pi_\theta: ํŒŒ๋ผ๋ฏธํ„ฐ \theta๋ฅผ ๊ฐ€์ง„ ์ •์ฑ…

2.1 ์•„ํ‚คํ…์ฒ˜๋ณ„ ๋ถ„๋ฅ˜

flowchart TB
    subgraph Arch["์ œ์–ด ์ •์ฑ… ์•„ํ‚คํ…์ฒ˜"]
        NT[Non-Transformer]
        TF[Transformer-based]
        DF[Diffusion-based]
        LV[Large VLA]
    end

    NT --> CLIPort[CLIPort]
    NT --> BCZ[BC-Z]
    NT --> HULC[HULC]

    TF --> RT1[RT-1]
    TF --> Gato[Gato]
    TF --> VIMA[VIMA]
    TF --> PerAct[PerAct]

    DF --> DiffPolicy[Diffusion Policy]
    DF --> Octo[Octo]
    DF --> DP3[DP3]

    LV --> RT2[RT-2]
    LV --> OpenVLA[OpenVLA]
    LV --> Pi0[ฯ€0]

2.2 ํ•ต์‹ฌ ๋ชจ๋ธ ์‹ฌ์ธต ๋ถ„์„

CLIPort: VLA์˜ ์„ ๊ตฌ์ž

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      CLIPort Architecture                    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                             โ”‚
โ”‚  RGB Image โ”€โ”€โ–บ CLIP Vision Encoder โ”€โ”€โ–บ "Semantic" Stream   โ”‚
โ”‚                        โ”‚                      โ”‚             โ”‚
โ”‚                        โ–ผ                      โ–ผ             โ”‚
โ”‚  RGB-D Image โ”€โ”€โ–บ Transporter Network โ”€โ”€โ–บ "Spatial" Stream  โ”‚
โ”‚                                               โ”‚             โ”‚
โ”‚  Language โ”€โ”€โ”€โ”€โ”€โ”€โ–บ CLIP Sentence Encoder โ”€โ”€โ”€โ”€โ”€โ”€โ”ค             โ”‚
โ”‚                                               โ–ผ             โ”‚
โ”‚                                          SE(2) Action       โ”‚
โ”‚                                    (Pick & Place Pose)      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ: โ€œ๋ฌด์—‡์„(Semantic)โ€ + โ€œ์–ด๋””์„œ(Spatial)โ€ = ์™„์ „ํ•œ ์กฐ์ž‘

RT-1: ๋Œ€๊ทœ๋ชจ ์‹ค์„ธ๊ณ„ ์ œ์–ด์˜ ์‹œ์ž‘

๊ตฌ์„ฑ ์š”์†Œ ์ƒ์„ธ
Vision Encoder EfficientNet
Language Encoder Universal Sentence Encoder
Action Decoder Transformer with FiLM conditioning
ํ•™์Šต ๋ฐ์ดํ„ฐ Fractal (130k ์—ํ”ผ์†Œ๋“œ)
ํ–‰๋™ ํƒ€์ž… ์ด์‚ฐํ™”๋œ ํ–‰๋™ (Discretized)
# RT-1 ์Šคํƒ€์ผ์˜ ํ–‰๋™ ํ† ํฐํ™” (์˜์‚ฌ ์ฝ”๋“œ)
def tokenize_action(action):
    # 7-DoF + gripper โ†’ 8 dimensions
    # ๊ฐ ์ฐจ์›์„ 256๊ฐœ ๋นˆ์œผ๋กœ ์ด์‚ฐํ™”
    tokens = []
    for dim in action:
        bin_idx = discretize(dim, num_bins=256)
        tokens.append(bin_idx)
    return tokens  # ์ด 8๊ฐœ์˜ ํ† ํฐ

VIMA: ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ”„๋กฌํ”„ํŠธ์˜ ํž˜

VIMA์˜ ํ˜์‹ ์ ์ธ ์ ์€ ์–ธ์–ด ์™ธ์—๋„ ๋‹ค์–‘ํ•œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ:

์ง€์›ํ•˜๋Š” ํ”„๋กฌํ”„ํŠธ ํƒ€์ž…:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 1. ํ…์ŠคํŠธ๋งŒ: "Stack the red block on the blue block"        โ”‚
โ”‚ 2. ํ…์ŠคํŠธ + ์ด๋ฏธ์ง€: "Pick up the [๐Ÿ–ผ๏ธ] and place it here"    โ”‚
โ”‚ 3. ๋น„๋””์˜ค ๋ฐ๋ชจ: "Do what you see in this video"             โ”‚
โ”‚ 4. ๋ชฉํ‘œ ์ด๋ฏธ์ง€: "Make the scene look like [๐Ÿ–ผ๏ธ]"             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

2.3 Diffusion-based ์ •์ฑ…

Diffusion Policy๋Š” ๋กœ๋ด‡ ์กฐ์ž‘์— ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ œ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค:

a_t^{(k-1)} = \frac{1}{\sqrt{\alpha_k}}\left(a_t^{(k)} - \frac{\beta_k}{\sqrt{1-\bar{\alpha}_k}}\epsilon_\theta(a_t^{(k)}, s_t, k)\right) + \sigma_k z

์™œ Diffusion์ธ๊ฐ€?

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ๊ธฐ์กด ๋ฐฉ์‹์˜ ํ•œ๊ณ„:                                            โ”‚
โ”‚ - ๋‹จ์ผ ์ตœ์  ํ–‰๋™๋งŒ ์˜ˆ์ธก (unimodal)                          โ”‚
โ”‚ - ๋‹ค์ค‘ ๋ชจ๋“œ ๋ถ„ํฌ ํ‘œํ˜„ ์–ด๋ ค์›€                                 โ”‚
โ”‚                                                             โ”‚
โ”‚ Diffusion์˜ ์žฅ์ :                                           โ”‚
โ”‚ - ๋ณต์žกํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ–‰๋™ ๋ถ„ํฌ ํ•™์Šต ๊ฐ€๋Šฅ                        โ”‚
โ”‚ - "Action Chunking": ํ•œ ๋ฒˆ์— ์—ฌ๋Ÿฌ ์‹œ๊ฐ„ ์Šคํ…์˜ ํ–‰๋™ ์˜ˆ์ธก     โ”‚
โ”‚ - ๋” ๋ถ€๋“œ๋Ÿฝ๊ณ  ์ผ๊ด€๋œ ๊ถค์  ์ƒ์„ฑ                               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

DP3 (3D Diffusion Policy)

graph LR
    PC[Point Cloud] --> ENC[3D Encoder]
    ENC --> DP[Diffusion Policy]
    DP --> ACT[Action Sequence]

    subgraph Adv["3D ํ‘œํ˜„์˜ ์žฅ์ "]
        ADV1[์‹œ์  ๋ถˆ๋ณ€์„ฑ]
        ADV2[๊นŠ์ด ์ •๋ณด ํ™œ์šฉ]
        ADV3[๊ณต๊ฐ„ ์ถ”๋ก  ํ–ฅ์ƒ]
    end

2.4 Large VLA (LVLA)

RT-2: VLM์„ VLA๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ

RT-2์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด: "Symbol Tuning"
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ PaLI-X/PaLM-E (Vision-Language Model)                       โ”‚
โ”‚           โ”‚                                                 โ”‚
โ”‚           โ–ผ                                                 โ”‚
โ”‚ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ๋กœ Co-fine-tuning                                โ”‚
โ”‚           โ”‚                                                 โ”‚
โ”‚           โ–ผ                                                 โ”‚
โ”‚ ํ–‰๋™์„ "์–ธ์–ด ํ† ํฐ"์ฒ˜๋Ÿผ ์ถœ๋ ฅ                                 โ”‚
โ”‚ ์˜ˆ: "1 128 91 241 5 101 127"                               โ”‚
โ”‚     (๊ฐ ์ˆซ์ž๊ฐ€ ํ–‰๋™ ์ฐจ์›์˜ ์ด์‚ฐํ™”๋œ ๊ฐ’)                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Web ์ง€์‹์˜ ์ „์ด: RT-2๋Š” ์ธํ„ฐ๋„ท์—์„œ ํ•™์Šตํ•œ ์ง€์‹์„ ๋กœ๋ด‡ ์ œ์–ด๋กœ ์ „์ดํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

OpenVLA: ์˜คํ”ˆ์†Œ์Šค LVLA

ํŠน์ง• ์ƒ์„ธ
๊ธฐ๋ฐ˜ ๋ชจ๋ธ Prismatic-7B VLM
Vision Encoders SigLIP + DINOv2 (์œตํ•ฉ)
ํ•™์Šต ๋ฐ์ดํ„ฐ Open X-Embodiment
์˜คํ”ˆ์†Œ์Šค โœ… (์ฝ”๋“œ, ๊ฐ€์ค‘์น˜ ๋ชจ๋‘ ๊ณต๊ฐœ)

ฯ€0 (Pi-Zero): Flow Matching ๊ธฐ๋ฐ˜ VLA

ฯ€0์˜ ํ˜์‹ :
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ๊ธฐ์กด LVLA: ํ–‰๋™์„ ์ด์‚ฐ ํ† ํฐ์œผ๋กœ ์ถœ๋ ฅ                         โ”‚
โ”‚                                                             โ”‚
โ”‚ ฯ€0: Flow Matching์œผ๋กœ ์—ฐ์† ํ–‰๋™ ์ง์ ‘ ์ƒ์„ฑ                   โ”‚
โ”‚     - ๋” ์ •๋ฐ€ํ•œ ์ œ์–ด ๊ฐ€๋Šฅ                                   โ”‚
โ”‚     - ์ด์‚ฐํ™” ์†์‹ค ์—†์Œ                                      โ”‚
โ”‚     - ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ–‰๋™ ๋ถ„ํฌ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋ชจ๋ธ๋ง                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

2.5 3D Vision ๊ธฐ๋ฐ˜ ์ •์ฑ…

PerAct: 3D ์–ดํฌ๋˜์Šค ๋งต

# PerAct ์Šคํƒ€์ผ ์˜์‚ฌ์ฝ”๋“œ
def peract_forward(rgb_d_images, language_instruction):
    # 1. ์—ฌ๋Ÿฌ ์‹œ์ ์˜ RGB-D๋ฅผ 3D ๋ณต์…€ ๊ทธ๋ฆฌ๋“œ๋กœ ๋ณ€ํ™˜
    voxel_grid = images_to_voxels(rgb_d_images)
    
    # 2. ์–ธ์–ด ์ž„๋ฒ ๋”ฉ
    lang_embed = clip_encode(language_instruction)
    
    # 3. PerceiverIO๋กœ 3D ์–ดํฌ๋˜์Šค ๋งต ์˜ˆ์ธก
    affordance_map = perceiver_io(voxel_grid, lang_embed)
    
    # 4. ๊ฐ€์žฅ ๋†’์€ ์–ดํฌ๋˜์Šค ์œ„์น˜ = ๋กœ๋ด‡์ด ํ–‰๋™ํ•  ์œ„์น˜
    action_pose = argmax(affordance_map)
    return action_pose

RVT (Robotic View Transformer)

RVT์˜ ํ•ต์‹ฌ: 2D โ†’ 3D ํ”„๋กœ์ ์…˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ์ž…๋ ฅ: ์—ฌ๋Ÿฌ ์‹œ์ ์˜ RGB-D ์ด๋ฏธ์ง€                               โ”‚
โ”‚                    โ”‚                                        โ”‚
โ”‚                    โ–ผ                                        โ”‚
โ”‚ ๊ฐ€์ƒ ์ง๊ต ์‹œ์  ์ด๋ฏธ์ง€ ์ƒ์„ฑ (Top, Front, Side ๋“ฑ)            โ”‚
โ”‚                    โ”‚                                        โ”‚
โ”‚                    โ–ผ                                        โ”‚
โ”‚ ๊ฐ ์‹œ์ ์—์„œ 2D ์–ดํฌ๋˜์Šค ์˜ˆ์ธก                                โ”‚
โ”‚                    โ”‚                                        โ”‚
โ”‚                    โ–ผ                                        โ”‚
โ”‚ 2D ์–ดํฌ๋˜์Šค๋ฅผ 3D ๊ณต๊ฐ„์œผ๋กœ ์—ญํˆฌ์˜                            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
์žฅ์ : PerAct๋ณด๋‹ค 10๋ฐฐ ๋น ๋ฅด๊ณ  ๋™๋“ฑํ•˜๊ฑฐ๋‚˜ ๋” ์ข‹์€ ์„ฑ๋Šฅ

๐Ÿ—“๏ธ Part 3: ๊ณ ์ˆ˜์ค€ ํƒœ์Šคํฌ ํ”Œ๋ž˜๋„ˆ (Task Planners)

ํƒœ์Šคํฌ ํ”Œ๋ž˜๋„ˆ์˜ ์—ญํ• 

\ell \xrightarrow{\pi_\phi} (p_1, p_2, \ldots, p_n)

๋ณต์žกํ•œ ์ง€์‹œ \ell์„ ์„œ๋ธŒํƒœ์Šคํฌ ์‹œํ€€์Šค (p_1, p_2, \ldots)๋กœ ๋ถ„ํ•ด

3.1 ๋‹จ์ผ์ฒด ํƒœ์Šคํฌ ํ”Œ๋ž˜๋„ˆ (Monolithic)

PaLM-E: ๊ฑฐ๋Œ€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM

PaLM-E์˜ ๊ตฌ์กฐ:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ์ž…๋ ฅ:                                                       โ”‚
โ”‚   - ํ…์ŠคํŠธ ํ† ํฐ: "Put the rice chips bag..."               โ”‚
โ”‚   - ์ด๋ฏธ์ง€ ํ† ํฐ: [๐Ÿ–ผ๏ธ] (ViT๋กœ ์ธ์ฝ”๋”ฉ)                        โ”‚
โ”‚   - ์ƒํƒœ ํ† ํฐ: ๋กœ๋ด‡ ์ƒํƒœ ๋ฒกํ„ฐ                               โ”‚
โ”‚                                                             โ”‚
โ”‚ PaLM-E (562B ํŒŒ๋ผ๋ฏธํ„ฐ):                                     โ”‚
โ”‚   ๋ชจ๋“  ํ† ํฐ์„ ํ•˜๋‚˜์˜ ์‹œํ€€์Šค๋กœ ์ฒ˜๋ฆฌ                          โ”‚
โ”‚                                                             โ”‚
โ”‚ ์ถœ๋ ฅ:                                                       โ”‚
โ”‚   - ๊ณ ์ˆ˜์ค€ ๊ณ„ํš: "1. Find bag 2. Pick up bag 3. Move to..."โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

3D Vision ๊ธฐ๋ฐ˜ ํ”Œ๋ž˜๋„ˆ๋“ค

๋ชจ๋ธ 3D ํ‘œํ˜„ ํŠน์ง•
3D-LLM NeRF ํŠน์ง• 3D ์žฅ๋ฉด ์ดํ•ด
LEO Scene Graph 3D ์„ธ๊ณ„์˜ ์—์ด์ „ํŠธ
MultiPLY Object-Centric ๋‹ค์ค‘ ๊ฐ๊ฐ ํ†ตํ•ฉ

3.2 ๋ชจ๋“ˆํ˜• ํƒœ์Šคํฌ ํ”Œ๋ž˜๋„ˆ (Modular)

graph TB
    subgraph ModPlanner["Modular Task Planner"]
        LLM["LLM Planner (SayCan, Inner Monologue)"]
        VLM["VLM for Grounding (ํ™˜๊ฒฝ ์ดํ•ด)"]
        SKILL["Skill Library (์‹คํ–‰ ๊ฐ€๋Šฅ ๊ธฐ์ˆ ๋“ค)"]
    end

    USER[์‚ฌ์šฉ์ž ์ง€์‹œ] --> LLM
    LLM --> |ํƒœ์Šคํฌ ๋ถ„ํ•ด| PLAN[๊ณ ์ˆ˜์ค€ ๊ณ„ํš]
    VLM --> |ํ™˜๊ฒฝ ์ •๋ณด| LLM
    PLAN --> |์Šคํ‚ฌ ํ˜ธ์ถœ| SKILL
    SKILL --> |ํ”ผ๋“œ๋ฐฑ| LLM

SayCan: ์–ดํฌ๋˜์Šค ๊ธฐ๋ฐ˜ ๊ทธ๋ผ์šด๋”ฉ

SayCan์˜ ํ•ต์‹ฌ ๊ณต์‹:
P(action | instruction) = P(useful | instruction) ร— P(feasible | state)

์—ฌ๊ธฐ์„œ:
- P(useful | instruction): LLM์ด ๊ณ„์‚ฐ (์ด ํ–‰๋™์ด ์œ ์šฉํ•œ๊ฐ€?)
- P(feasible | state): Value Function์ด ๊ณ„์‚ฐ (์ง€๊ธˆ ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ๊ฐ€?)

Code as Policies (CaP): ์ฝ”๋“œ๋กœ ์ •์ฑ… ํ‘œํ˜„

# CaP ์Šคํƒ€์ผ ์˜ˆ์‹œ: LLM์ด ์ƒ์„ฑํ•˜๋Š” ์ฝ”๋“œ
def execute_task(instruction: str):
    """์‚ฌ์šฉ์ž ์ง€์‹œ: "Stack all the blocks on the green area" """
    
    # LLM์ด ์ƒ์„ฑํ•œ ์ฝ”๋“œ
    blocks = detect_objects("block")
    green_area = detect_objects("green area")[0]
    
    for i, block in enumerate(blocks):
        pick(block)
        place(green_area.position + [0, 0, 0.05 * i])
    
    return "Task completed"

๐Ÿ“Š Part 4: ๋ฐ์ดํ„ฐ์…‹๊ณผ ๋ฒค์น˜๋งˆํฌ

4.1 ์‹ค์„ธ๊ณ„ ๋ฐ์ดํ„ฐ์…‹

๋ฐ์ดํ„ฐ์…‹ ์—ํ”ผ์†Œ๋“œ ์ˆ˜ ๋กœ๋ด‡ ํƒœ์Šคํฌ
RT-1 (Fractal) 130,000+ EDR Pick, Place, Move
BridgeData V2 60,000+ WidowX ๋‹ค์–‘ํ•œ ์กฐ์ž‘
Open X-Embodiment 1,000,000+ 22์ข… ๋กœ๋ด‡ 527๊ฐœ ์Šคํ‚ฌ
DROID 76,000+ Franka ์ผ์ƒ ์กฐ์ž‘

4.2 ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ & ๋ฒค์น˜๋งˆํฌ

graph LR
    subgraph Sims["์ฃผ์š” ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ"]
        MS["Meta-World (50๊ฐœ ์กฐ์ž‘ ํƒœ์Šคํฌ)"]
        RLB["RLBench (100+ ํƒœ์Šคํฌ)"]
        CAL["CALVIN (์žฅ๊ธฐ ์กฐ์ž‘)"]
        HAB["Habitat (๋‚ด๋น„๊ฒŒ์ด์…˜)"]
    end

    subgraph Bench["์ตœ์‹  ๋ฒค์น˜๋งˆํฌ"]
        LIBERO["LIBERO (์ง€์‹ ์ „์ด ํ‰๊ฐ€)"]
        VIMAB["VIMA-Bench (๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ”„๋กฌํ”„ํŠธ)"]
        BEH["BEHAVIOR-1K (1000๊ฐœ ์ผ์ƒ ํƒœ์Šคํฌ)"]
    end

4.3 ์ž๋™ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ๋ฐ์ดํ„ฐ ๋ถ€์กฑ ๋ฌธ์ œ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•๋“ค                                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1. RoboCat ์ž๊ธฐ๊ฐœ์„ : ๋กœ๋ด‡์ด ์Šค์Šค๋กœ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ              โ”‚
โ”‚ 2. ์ธํ„ฐ๋„ท ๋น„๋””์˜ค: ์ธ๊ฐ„ ์กฐ์ž‘ ์˜์ƒ์—์„œ ํ•™์Šต                   โ”‚
โ”‚ 3. ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์ฆ๊ฐ•: ๋„๋ฉ”์ธ ๋žœ๋คํ™”                           โ”‚
โ”‚ 4. LLM ์ƒ์„ฑ: ์–ธ์–ด ์ง€์‹œ ์ž๋™ ์ƒ์„ฑ                            โ”‚
โ”‚ 5. Diffusion ์ฆ๊ฐ•: ์ด๋ฏธ์ง€/์˜์ƒ ํ•ฉ์„ฑ                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”ฎ Part 5: ๋„์ „ ๊ณผ์ œ์™€ ๋ฏธ๋ž˜ ๋ฐฉํ–ฅ

5.1 ์•ˆ์ „์„ฑ (Safety First)

๋ฌผ๋ฆฌ์  ์•ˆ์ „ ์œ„ํ—˜:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 1. ์ถฉ๋Œ ์œ„ํ—˜: ๋กœ๋ด‡์ด ์‚ฌ๋žŒ์ด๋‚˜ ๋ฌผ์ฒด์™€ ์ถฉ๋Œ                    โ”‚
โ”‚ 2. ์˜ˆ์ธก ๋ถˆ๊ฐ€๋Šฅ์„ฑ: LLM ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ์˜ ํ™˜๊ฐ(hallucination)     โ”‚
โ”‚ 3. ์‹คํŒจ ๋ณต๊ตฌ: ์‹คํ–‰ ์ค‘ ์˜ค๋ฅ˜ ์‹œ ์•ˆ์ „ํ•œ ๋ณต๊ตฌ                   โ”‚
โ”‚                                                             โ”‚
โ”‚ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ:                                                  โ”‚
โ”‚ - ์•ˆ์ „ ์ œ์•ฝ ํ•™์Šต (Safe RL)                                  โ”‚
โ”‚ - ๋ถˆํ™•์‹ค์„ฑ ์ •๋Ÿ‰ํ™”                                           โ”‚
โ”‚ - Human-in-the-loop ์‹œ์Šคํ…œ                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

5.2 ๋ฐ์ดํ„ฐ & ๋ฒค์น˜๋งˆํฌ

๋„์ „ ๊ณผ์ œ ํ˜„์žฌ ์ƒํƒœ ๋ฏธ๋ž˜ ๋ฐฉํ–ฅ
๋ฐ์ดํ„ฐ ๋ถ€์กฑ Open X-Embodiment๋กœ ๊ฐœ์„  ์ค‘ ํฌ๋ผ์šฐ๋“œ์†Œ์‹ฑ, ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ
๋„๋ฉ”์ธ ์ฐจ์ด Sim-to-Real gap ์กด์žฌ ๋„๋ฉ”์ธ ์ ์‘, ์ฆ๋ฅ˜ ๊ธฐ๋ฒ•
์ผ๊ด€์„ฑ ์—†๋Š” ํฌ๋งท ๊ฐ ๋ฐ์ดํ„ฐ์…‹๋งˆ๋‹ค ๋‹ค๋ฆ„ ํ‘œ์ค€ํ™”๋œ ํฌ๋งท ํ•„์š”

5.3 ์ผ๋ฐ˜ํ™” (Generalization)

graph TB
    GEN[์ผ๋ฐ˜ํ™”์˜ ์„ธ ์ถ•]
    GEN --> TASK["ํƒœ์Šคํฌ ์ผ๋ฐ˜ํ™” (์ƒˆ๋กœ์šด ํƒœ์Šคํฌ ์ˆ˜ํ–‰)"]
    GEN --> ENV["ํ™˜๊ฒฝ ์ผ๋ฐ˜ํ™” (์ƒˆ๋กœ์šด ์žฅ์†Œ์—์„œ ๋™์ž‘)"]
    GEN --> EMB["์— ๋ฐ”๋””๋จผํŠธ ์ผ๋ฐ˜ํ™” (๋‹ค๋ฅธ ๋กœ๋ด‡์—์„œ ๋™์ž‘)"]

    TASK --> |๋ฐฉ๋ฒ•| FT["Foundation Models (๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ํ•™์Šต)"]
    ENV --> |๋ฐฉ๋ฒ•| DA["Domain Adaptation (๋„๋ฉ”์ธ ์ ์‘)"]
    EMB --> |๋ฐฉ๋ฒ•| CROSS["Cross-Embodiment (๊ต์ฐจ ์— ๋ฐ”๋””๋จผํŠธ ํ•™์Šต)"]

5.4 ์‹ค์‹œ๊ฐ„ ์‘๋‹ต์„ฑ

ํ˜„์žฌ ๋ฌธ์ œ:
- ๋Œ€ํ˜• VLA: ์ถ”๋ก ์— ์ˆ˜ ์ดˆ ์†Œ์š”
- ์‹ค์‹œ๊ฐ„ ์ œ์–ด: ์ˆ˜์‹ญ~์ˆ˜๋ฐฑ Hz ํ•„์š”

ํ•ด๊ฒฐ ๋ฐฉํ–ฅ:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 1. ๋ชจ๋ธ ์••์ถ•: ์–‘์žํ™”, ํ”„๋ฃจ๋‹, ์ฆ๋ฅ˜                          โ”‚
โ”‚ 2. ํšจ์œจ์  ์•„ํ‚คํ…์ฒ˜: Mamba, State Space Models               โ”‚
โ”‚ 3. ๊ณ„์ธต์  ๋ถ„๋ฆฌ: ๊ณ ์ˆ˜์ค€(๋А๋ฆผ) + ์ €์ˆ˜์ค€(๋น ๋ฆ„)                 โ”‚
โ”‚ 4. ์—์ง€ ์ปดํ“จํŒ…: ์˜จ๋””๋ฐ”์ด์Šค ์ถ”๋ก  ์ตœ์ ํ™”                       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

5.5 ์žฅ๊ธฐ ํƒœ์Šคํฌ (Long-Horizon Tasks)

"์•„์นจ ์‹์‚ฌ ์ค€๋น„ํ•˜๊ธฐ"๋ฅผ ์˜ˆ๋กœ ๋“ค๋ฉด:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ์„œ๋ธŒํƒœ์Šคํฌ ์ฒด์ธ:                                            โ”‚
โ”‚                                                             โ”‚
โ”‚ ๋ƒ‰์žฅ๊ณ  ์—ด๊ธฐ โ†’ ๊ณ„๋ž€ ๊บผ๋‚ด๊ธฐ โ†’ ๋ƒ‰์žฅ๊ณ  ๋‹ซ๊ธฐ โ†’ ํ”„๋ผ์ดํŒฌ ๊ฐ€์ ธ์˜ค๊ธฐ โ”‚
โ”‚      โ†’ ๊ฐ€์Šค๋ ˆ์ธ์ง€ ์ผœ๊ธฐ โ†’ ๊ณ„๋ž€ ๊นจ๊ธฐ โ†’ ์กฐ๋ฆฌํ•˜๊ธฐ โ†’ ์ ‘์‹œ์— ๋‹ด๊ธฐ โ”‚
โ”‚                                                             โ”‚
โ”‚ ๋„์ „ ๊ณผ์ œ:                                                  โ”‚
โ”‚ - ์˜ค๋ฅ˜ ๋ˆ„์ : ๊ฐ ๋‹จ๊ณ„์˜ ์ž‘์€ ์˜ค๋ฅ˜๊ฐ€ ๋ˆ„์                      โ”‚
โ”‚ - ์ƒํƒœ ์ถ”์ : ๊ธด ์‹œํ€€์Šค์—์„œ ๋งฅ๋ฝ ์œ ์ง€                        โ”‚
โ”‚ - ์˜ˆ์™ธ ์ฒ˜๋ฆฌ: ์˜ˆ์ƒ์น˜ ๋ชปํ•œ ์ƒํ™ฉ ๋Œ€์‘                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“ˆ ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

VLA ๋ฐœ์ „์˜ ํ•ต์‹ฌ ํŠธ๋ Œ๋“œ

timeline
    title VLA ๋ฐœ์ „ ํƒ€์ž„๋ผ์ธ

    section ์ดˆ๊ธฐ 2020-2021
        Decision Transformer : RL์˜ ์‹œํ€€์Šค ๋ชจ๋ธ๋งํ™”
        CLIPort : ์–ธ์–ด ์กฐ๊ฑด๋ถ€ ์กฐ์ž‘์˜ ์‹œ์ž‘
        CLIP : ์‹œ๊ฐ-์–ธ์–ด ์ •๋ ฌ์˜ ํ˜๋ช…

    section ์„ฑ์žฅ๊ธฐ 2022-2023
        RT-1 : ๋Œ€๊ทœ๋ชจ ์‹ค์„ธ๊ณ„ ๋ฐ์ดํ„ฐ ํ•™์Šต
        RT-2 : VLM์„ VLA๋กœ ์ „ํ™˜
        Diffusion Policy : ์ƒˆ๋กœ์šด ํ–‰๋™ ์ƒ์„ฑ ํŒจ๋Ÿฌ๋‹ค์ž„
        PerAct : 3D ์–ดํฌ๋˜์Šค ๋งต ๋„์ž…

    section ์„ฑ์ˆ™๊ธฐ 2024-ํ˜„์žฌ
        OpenVLA : ์˜คํ”ˆ์†Œ์Šค LVLA
        ฯ€0 : Flow Matching ๊ธฐ๋ฐ˜ ์ •๋ฐ€ ์ œ์–ด
        Open X-Embodiment : ๋Œ€๊ทœ๋ชจ ๊ต์ฐจ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ์…‹
        RDT-1B : 10์–ต ํŒŒ๋ผ๋ฏธํ„ฐ ํ™•์‚ฐ ์ •์ฑ…

๋กœ๋ด‡๊ณตํ•™์ž๋ฅผ ์œ„ํ•œ ํ•ต์‹ฌ ํ…Œ์ดํฌ์–ด์›จ์ด

๋ถ„๋ฅ˜ ํ•ต์‹ฌ ๋ฉ”์‹œ์ง€
์•„ํ‚คํ…์ฒ˜ Transformer ๊ธฐ๋ฐ˜์ด ๋Œ€์„ธ, Diffusion์ด ์ƒˆ๋กœ์šด ํŠธ๋ Œ๋“œ
๋ฐ์ดํ„ฐ ์–‘๋ณด๋‹ค ์งˆ, ๊ต์ฐจ ๋„๋ฉ”์ธ ๋ฐ์ดํ„ฐ์˜ ์ค‘์š”์„ฑ
ํ•™์Šต Imitation Learning์ด ์ฃผ๋ฅ˜, RL์€ ๋ฏธ์„ธ์กฐ์ •์šฉ
ํ‘œํ˜„ 3D ๋น„์ „์˜ ์ค‘์š”์„ฑ ์ฆ๊ฐ€, DINOv2/SigLIP ์กฐํ•ฉ ์ถ”์ฒœ
์Šค์ผ€์ผ ๋” ํฐ ๋ชจ๋ธ = ๋” ๋‚˜์€ ์ผ๋ฐ˜ํ™” (์Šค์ผ€์ผ๋ง ๋ฒ•์น™)
์‹ค์šฉ์„ฑ ๊ณ„์ธต์  ๊ตฌ์กฐ(ํ”Œ๋ž˜๋„ˆ + ์ปจํŠธ๋กค๋Ÿฌ)๊ฐ€ ํ˜„์‹ค์ 

๋งˆ๋ฌด๋ฆฌ: ํŒŒ์ธ๋งŒ์˜ ๊ด€์ ์—์„œ

โ€œ๋งŒ์•ฝ ์šฐ๋ฆฌ๊ฐ€ ๋ฌด์–ธ๊ฐ€๋ฅผ ์ •๋ง๋กœ ์ดํ•ดํ–ˆ๋‹ค๋ฉด, ๊ทธ๊ฒƒ์„ ๊ฐ„๋‹จํ•˜๊ฒŒ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.โ€

VLA์˜ ํ•ต์‹ฌ์€ ๊ฒฐ๊ตญ ์ด๊ฒƒ์ž…๋‹ˆ๋‹ค:

โ€œ๋กœ๋ด‡์ด ์‚ฌ๋žŒ์ฒ˜๋Ÿผ ๋ณด๊ณ , ๋“ฃ๊ณ , ์ดํ•ดํ•˜๊ณ , ํ–‰๋™ํ•˜๊ฒŒ ๋งŒ๋“ค๊ธฐโ€

์ด๊ฒƒ์€ ๋‹จ์ˆœํžˆ ์„ธ ๊ฐ€์ง€ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ํ•ฉ์น˜๋Š” ๊ฒƒ์ด ์•„๋‹™๋‹ˆ๋‹ค. ๋ฌผ๋ฆฌ ์„ธ๊ณ„์—์„œ ์˜๋ฏธ ์žˆ๋Š” ๋ณ€ํ™”๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š” AI๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ChatGPT๊ฐ€ ํ…์ŠคํŠธ๋กœ ์„ธ์ƒ์„ ๋ฐ”๊ฟจ๋‹ค๋ฉด, VLA๋Š” ๋ฌผ๋ฆฌ์  ํ–‰๋™์œผ๋กœ ์„ธ์ƒ์„ ๋ฐ”๊ฟ€ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๐Ÿ“š ์ถ”๊ฐ€ ๋ฆฌ์†Œ์Šค

์ฝ”๋“œ & ๊ตฌํ˜„

  • Awesome-VLA GitHub
  • OpenVLA
  • Diffusion Policy

๊ด€๋ จ ์„œ๋ฒ ์ด

  • Foundation Models in Robotics (2023)
  • Real-World Robot Applications of Foundation Models (2024)
  • Toward General-Purpose Robots via Foundation Models (2023)

์ฃผ์š” ๋ฐ์ดํ„ฐ์…‹

  • Open X-Embodiment
  • BridgeData V2
  • DROID

โ›๏ธ Dig Review

โ›๏ธ Dig โ€” Go deep, uncover the layers. Dive into technical detail.

๋น„์ „-์–ธ์–ด-์•ก์…˜ ๋ชจ๋ธ (VLA)์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€

๋น„์ „-์–ธ์–ด-์•ก์…˜ ๋ชจ๋ธ(Vision-Language-Action Models, VLAs)์€ ์‹œ๊ฐ(Visual) ์ •๋ณด, ์–ธ์–ด(Language) ์ •๋ณด, ๊ทธ๋ฆฌ๊ณ  ํ–‰๋™(Action) ์ถœ๋ ฅ์„ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋กœ๋ด‡ ํ•™์Šต ๋ชจ๋ธ์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์‚ฌ๋žŒ์—๊ฒŒ โ€œ๋นจ๊ฐ„ ์‚ฌ๊ณผ๋ฅผ ์ง‘์–ด ์‹ํƒ ์œ„์— ์˜ฌ๋ ค๋†”โ€๋ผ๋Š” ๋ช…๋ น์„ ๋‚ด๋ฆด ๋•Œ, ์šฐ๋ฆฌ๋Š” ๋ˆˆ์œผ๋กœ ์‚ฌ๊ณผ์™€ ์‹ํƒ์„ ์‹๋ณ„ํ•˜๊ณ  ๋‡Œ์—์„œ ์ ์ ˆํ•œ ํŒ” ๋™์ž‘์„ ๊ณ„ํšํ•ฉ๋‹ˆ๋‹ค. ์ด์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ VLA๋Š” ๋กœ๋ด‡์ด ์–ธ์–ด ์ง€์‹œ๋ฅผ ์ดํ•ดํ•˜๊ณ , ์นด๋ฉ”๋ผ๋กœ ๊ด€์ฐฐํ•œ ์žฅ๋ฉด์„ ์ธ์‹ํ•˜์—ฌ, ์‹ค์ œ ํ–‰๋™(ํŒ”์˜ ์ด๋™, ๊ทธ๋ฆฌํผ ์ž‘๋™ ๋“ฑ)์„ ์ƒ์„ฑํ•˜๋„๋ก ์„ค๊ณ„๋œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

์ž„๋ฒ ๋””๋“œ AI(Embodied AI) โ€“ ํŠนํžˆ ๋กœ๋ด‡ ๊ณตํ•™ โ€“์—์„œ๋Š” ์ด๋ ‡๊ฒŒ ๋ง๊ณผ ํ–‰๋™์„ ์ž‡๋Š” ๋Šฅ๋ ฅ์ด ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์ธ ๋Œ€ํ™”ํ˜• AI(ChatGPT ๋“ฑ)๋Š” ์–ธ์–ด ์ดํ•ด์— ์ง‘์ค‘ํ•˜์ง€๋งŒ, VLAs๋Š” ๋ฌผ๋ฆฌ์ ์ธ ๋ชธ์ฒด(๋กœ๋ด‡)๋ฅผ ์ œ์–ดํ•ด์•ผ ํ•˜๋ฏ€๋กœ ์‹œ๊ฐ๊ณผ ํ–‰๋™๊นŒ์ง€ ์—ฐ๊ด€ ์ง“์Šต๋‹ˆ๋‹ค. Ma ์™ธ๋Š” โ€œVLA ๊ธฐ๋ฐ˜ ์ •์ฑ…์€ ๋ณต์žกํ•œ ํ™˜๊ฒฝ์—์„œ ์ด์ „์˜ ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฒ•๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ๋‹ค์–‘์„ฑ๊ณผ ์œ ์—ฐ์„ฑ, ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค€๋‹คโ€๊ณ  ์ง€์ ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๊ณต์žฅ์ฒ˜๋Ÿผ ํ†ต์ œ๋œ ํ™˜๊ฒฝ๋ฟ ์•„๋‹ˆ๋ผ ์ฃผ๋ฐฉ์—์„œ ์š”๋ฆฌํ•˜๊ธฐ, ๋ฐฉ ์ฒญ์†Œํ•˜๊ธฐ ๋“ฑ์˜ ์ผ์ƒ์  ์ž‘์—…์—๋„ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์ž…๋‹ˆ๋‹ค.

๋กœ๋ด‡ ๋ถ„์•ผ์—์„œ ์ „ํ†ต์ ์ธ ๊ฐ•ํ™”ํ•™์Šต ์ •์ฑ…์€ ์ฃผ๋กœ ํ•œ ๊ฐ€์ง€ ์ž‘์—…(์˜ˆ: ๋ฌผ๊ฑด ์žก๊ธฐ)์— ๊ตญํ•œ๋˜๊ณ , ์ดฌ์˜ ํ™˜๊ฒฝ๋„ ์‹คํ—˜์‹ค์ฒ˜๋Ÿผ ํ•œ์ •์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ˜„๋Œ€์—๋Š” ChatGPT์™€ ๊ฐ™์€ ๋Œ€ํ˜• ์–ธ์–ด๋ชจ๋ธ(LLM)๊ณผ CLIP ๊ฐ™์€ ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ(VLM)์˜ ์„ฑ๊ณต์— ์ž๊ทน๋ฐ›์•„, โ€œํ•˜๋‚˜์˜ ๋กœ๋ด‡ ์ •์ฑ…์œผ๋กœ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฒ”์šฉ์„ฑโ€์ด ์š”๊ตฌ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์–ธ์–ด ๊ธฐ๋ฐ˜ ์ž‘์—… ์ง€์‹œ๊ฐ€ ์œ ๋ ฅํ•œ ๋ฐฉ์•ˆ์œผ๋กœ ๋– ์˜ฌ๋ž์œผ๋ฉฐ, VLAs๋Š” ๋ฐ”๋กœ ์ด ๊ณผ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋“ฑ์žฅํ–ˆ์Šต๋‹ˆ๋‹ค. VLA๋Š” ์‚ฌ์ „ ํ•™์Šต๋œ ๋น„์ „ ์ธ์ฝ”๋”์™€ LLM์„ ๊ฒฐํ•ฉํ•ด, ๋ณต์žกํ•œ ํ™˜๊ฒฝ์„ ์ •ํ™•ํžˆ ์ธ์‹ํ•˜๊ณ  โ€œ๋นจ๊ฐ„ ์‚ฌ๊ณผโ€ ๊ฐ™์€ ๊ฐ์ฒด ์ •๋ณด๋ถ€ํ„ฐ โ€œ๊ทธ๊ฒƒ์„ ์˜ฎ๊ฒจ๋ผโ€๋ผ๋Š” ์–ธ์–ด ์ง€์‹œ๋ฅผ ํ•˜๋‚˜์˜ ์ •์ฑ…์œผ๋กœ ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค.

VLA ๋ชจ๋ธ์˜ ๋ถ„๋ฅ˜ ์ฒด๊ณ„๋Š” ์„ธ ๊ฐ€์ง€ ์ถ•์œผ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค:

  • ์‚ฌ์ „ ํ›ˆ๋ จ(Pretraining): ๋น„์ „ ์ธ์ฝ”๋”, ๋™์  ๋ชจ๋ธ(dynamics), ์„ธ๊ณ„ ๋ชจ๋ธ(world model) ๋“ฑ์„ ๊ฐœ์„ ํ•˜์—ฌ ๊ธฐ๋ฐ˜ ๋Šฅ๋ ฅ์„ ํ‚ค์›๋‹ˆ๋‹ค.
  • ์ œ์–ด ์ •์ฑ…(Control Policy): ์ฃผ์–ด์ง„ ์–ธ์–ด ๋ช…๋ น๊ณผ ์‹œ๊ฐ ์ •๋ณด๋ฅผ ๋ฐ›์•„ ๋กœ๋ด‡์˜ ์‹ค์ œ ์ €์ˆ˜์ค€ ํ–‰๋™(ํŒ” ๊ด€์ ˆ ์ด๋™, ๊ทธ๋ฆฌํผ ๋™์ž‘ ๋“ฑ)์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ์ž‘์—… ๊ณ„ํš(Task Planner): ๊ณ ์ˆ˜์ค€ ์–ธ์–ด ๋ช…๋ น์„ ์—ฌ๋Ÿฌ ๋‹จ๊ณ„์˜ ํ•˜์œ„ ํƒœ์Šคํฌ๋กœ ๋ถ„ํ•ดํ•˜์—ฌ ์ €์ˆ˜์ค€ ์ œ์–ด ์ •์ฑ…์— ์ˆœ์ฐจ์ ์œผ๋กœ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค.

์ด ์„ธ ๊ฐ€์ง€ ์š”์†Œ๊ฐ€ ๊ณ„์ธต์ ์œผ๋กœ ๊ฒฐํ•ฉ๋˜์–ด, โ€œ์žฅ๊ธฐ ๊ณผ์ œ๋Š” ๊ณ„ํš์ž๊ฐ€ ์ „์ฒด๋ฅผ ๋‚˜๋ˆ„๊ณ , ์ œ์–ด ์ •์ฑ…์ด ๊ฐ ๋ถ€๋ถ„์„ ์ˆ˜ํ–‰โ€ํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์ „์ฒด ๊ตฌ์กฐ๋ฅผ ๊ทธ๋ฆผ์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

flowchart LR
    L[์–ธ์–ด ๋ช…๋ น] --> VLA[VLA ๋ชจ๋ธ]
    V[์‹œ๊ฐ ๊ด€์ธก] --> VLA
    VLA --> A[ํ–‰๋™ ์‹คํ–‰]
    A --> E[ํ™˜๊ฒฝ ๋ณ€ํ™”]
    E --> V
    style VLA fill:#e0f7fa,stroke:#333,stroke-width:1px

์œ„ ๊ณผ์ •์—์„œ VLA ๋ชจ๋ธ์€ ์–ธ์–ด์™€ ์‹œ๊ฐ ์ž…๋ ฅ์„ ๋ฐ›์•„ ๋กœ๋ด‡์˜ ๋™์ž‘์„ ์˜ˆ์ธกํ•˜๋ฉฐ, ๊ทธ ๋™์ž‘์„ ์‹ค์ œ ๋กœ๋ด‡ ๋ชจ์…˜ ํ”Œ๋ž˜๋„ˆ(๊ฐ ๊ด€์ ˆ์„ ์ œ์–ดํ•˜๋Š” ํ•˜๋ถ€ ๋ชจ๋“ˆ)๊ฐ€ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ํ•œํŽธ, ์žฅ๊ธฐ ๊ณผ์ œ ์ˆ˜ํ–‰ ์‹œ์—๋Š” โ€œ์ž‘์—… ๊ณ„ํš์ž(TP)โ€๊ฐ€ ์ด ๊ณผ์ •์„ ๊ฐ๋…ํ•˜์—ฌ ์—ฌ๋Ÿฌ ํ•˜์œ„ ๋ชฉํ‘œ๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ๊ฐ๊ฐ ์ €์ˆ˜์ค€ ์ •์ฑ…์— ๋งก๊ธฐ๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.

1. ์‚ฌ์ „ ํ›ˆ๋ จ(Pretraining)

์‚ฌ์ „ ํ›ˆ๋ จ ๋‹จ๊ณ„์—์„œ๋Š” ๋กœ๋ด‡์ด ์‹œ๊ฐ๊ณผ ๋™์  ํ™˜๊ฒฝ์— ๊ด€ํ•œ ์ผ๋ฐ˜์  ์ง€์‹์„ ๋ฏธ๋ฆฌ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์‚ฌ๋žŒ์ด ์‚ฌ๋ฌผ ์ธ์ง€์™€ ๊ธฐ๋ณธ ๋ฌผ๋ฆฌ ๋ฒ•์น™์„ ์–ด๋ฆฐ ์‹œ์ ˆ๋ถ€ํ„ฐ ๋ฐฐ์šฐ๋Š” ๊ฒƒ๊ณผ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค. ์ฃผ์š” ์„ธ๋ถ€ ๋ถ„์•ผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • ์‚ฌ์ „ ํ•™์Šต๋œ ๋น„์ „ ํ‘œํ˜„(Pretrained Vision Representation): ๋Œ€๊ทœ๋ชจ ์ด๋ฏธ์ง€-์–ธ์–ด ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋œ ๋น„์ „ ๋ชจ๋ธ(CLIP ๋“ฑ)์„ ๋กœ๋ด‡์— ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. CLIP์ฒ˜๋Ÿผ ์ด๋ฏธ์ง€๋ฅผ ์–ธ์–ด์™€ ํ•จ๊ป˜ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋ฉด, ๋กœ๋ด‡์ด ํ™˜๊ฒฝ์„ ๋ณด์•˜์„ ๋•Œ โ€œ์ด๊ฒƒ์€ ์ปต, ์ €๊ฒƒ์€ ์‚ฌ๊ณผโ€ ๊ฐ™์€ ์ •๋ณด๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. R3M, MVP, VIP, VC-1 ๋“ฑ์€ ๋กœ๋ด‡ ์กฐ์ž‘์šฉ ๋ฐ์ดํ„ฐ๋กœ ์‹œ๊ฐ ๋ชจ๋ธ์„ ํŠนํ™”ํ•˜์—ฌ ์‚ฌ์ „ ํ•™์Šตํ•œ ์‚ฌ๋ก€๋“ค์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, R3M์€ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ์ด๋ฏธ์ง€๋ฅผ ์ž„๋ฒ ๋”ฉํ•จ์œผ๋กœ์จ ์กฐ์ž‘ ์ž‘์—…์—์„œ ๊ฐ•๊ฑดํ•œ ์‹œ๊ฐ ํ”ผ์ฒ˜๋ฅผ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค.

    ๋น„์œ : ์ด๋Š” ๋กœ๋ด‡์ด โ€™์‹œ๊ฐ์  ์–ดํœ˜โ€™๋ฅผ ๋ฐฐ์šฐ๋Š” ๊ณผ์ •์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. ์–ด๋ฆฐ์•„์ด๊ฐ€ ๋‹ค์–‘ํ•œ ์‚ฌ๋ฌผ๊ณผ ํ–‰๋™์„ ๊ด€์ฐฐํ•˜๋ฉฐ ์„ธ์ƒ์„ ์ดํ•ดํ•˜๋“ฏ, ๋กœ๋ด‡๋„ ์‚ฌ์ „ ํ•™์Šต๋œ ๋น„์ „ ๋ชจ๋ธ๋กœ ํ™˜๊ฒฝ์˜ ๊ธฐ๋ณธ ์–ดํœ˜(์‚ฌ๋ฌผ ์ข…๋ฅ˜, ์œ„์น˜, ์ž์„ธ ๋“ฑ)๋ฅผ ์Šต๋“ํ•ฉ๋‹ˆ๋‹ค.

  • ๋™์—ญํ•™ ํ•™์Šต(Dynamics Learning): ๋กœ๋ด‡์˜ ํ–‰๋™ ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ์ปจ๋Œ€, ๋กœ๋ด‡ํŒ”์ด ์ƒ์ž๋ฅผ ๋ฐ€ ๋•Œ ์ƒ์ž๊ฐ€ ์–ด๋””๋กœ ์›€์ง์ผ์ง€๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. Vi-PRoM, MIDAS, SMART ๋“ฑ์˜ ์—ฐ๊ตฌ๋Š” ๋ฌผ์ฒด์˜ ์›€์ง์ž„๊ณผ ์ƒํ˜ธ์ž‘์šฉ(์˜ˆ: ์ƒ์ž ์žก๊ธฐ, ์Œ“๊ธฐ)์„ ์˜ˆ์ธกํ•˜๋„๋ก ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์ „ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ํ˜„์žฌ ์ƒํƒœ์™€ ํ–‰๋™์„ ์ž…๋ ฅ์œผ๋กœ ๋‹ค์Œ ์ƒํƒœ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์„ ํ•™์Šตํ•˜์—ฌ, ๋กœ๋ด‡์ด ํ–‰๋™ ์ „ํ›„์˜ ๋ณ€ํ™”๋ฅผ โ€™๋งˆ์น˜ ๋‘๋‡Œ ์† ์‹œ๋ฎฌ๋ ˆ์ด์…˜โ€™์ฒ˜๋Ÿผ ๊ฐ€๋Š ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•์Šต๋‹ˆ๋‹ค.

    ์œ ์ถ”: ์ด๋Š” ๋งˆ์น˜ ๋‹น์‹ ์ด ํƒ๊ตฌ๊ณต์„ ์น  ๋•Œ ๊ณต์ด ์–ด๋””๋กœ ํŠ€์–ด๋‚˜๊ฐˆ์ง€ ์˜ˆ์ƒํ•ด๋ณด๋Š” ๊ฒƒ๊ณผ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๋กœ๋ด‡์€ ๋™์—ญํ•™ ๋ชจ๋ธ์„ ํ†ตํ•ด โ€œ์ด๋ ‡๊ฒŒ ์†์„ ์›€์ง์ด๋ฉด ๋ฌผ์ฒด๋Š” ์ €๋ ‡๊ฒŒ ์›€์ง์ผ ๊ฒƒ์ด๋‹คโ€๋ฅผ ๋ฏธ๋ฆฌ ๋‚ด๋‹ค๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์›”๋“œ ๋ชจ๋ธ(World Model): ๊ด€์ฐฐ๊ณผ ํ–‰๋™ ์˜ˆ์ธก์„ ํ†ตํ•ฉํ•˜๋Š” ๋” ๊ณ ์ฐจ์›์  ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด Dreamer, ISO-Dream, IRIS ๊ฐ™์€ ์—ฐ๊ตฌ๋Š” ์นด๋ฉ”๋ผ ์˜์ƒ๊ณผ ๋กœ๋ด‡ ํ–‰๋™์„ ํ•˜๋‚˜์˜ ์ž ์žฌ๊ณต๊ฐ„์— ์ธ์ฝ”๋”ฉํ•˜์—ฌ, โ€˜๋กœ๋ด‡์˜ ๋‡Œโ€™ ์•ˆ์—์„œ ํ™˜๊ฒฝ์„ ๋‚ด์žฌ์ ์œผ๋กœ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ์›”๋“œ ๋ชจ๋ธ์€ ์žฅ๊ธฐ ๊ณ„ํš๊ณผ ์ƒ์ƒ(imagination)์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋งˆ์น˜ ์šฐ๋ฆฌ๊ฐ€ ์˜ํ™”๋ฅผ ๋จธ๋ฆฟ์†์œผ๋กœ ์žฌ์ƒํ•ด๋ณด๋“ฏ, ๋กœ๋ด‡๋„ ๋‚ด๋ถ€ ๋ชจ๋ธ์„ ํ†ตํ•ด ๋ณต์žกํ•œ ์ƒํ™ฉ์„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•˜๋ฉฐ ํ–‰๋™์„ ๊ณ„ํšํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‚ฌ์ „ํ›ˆ๋ จ ๋‹จ๊ณ„์˜ ๋ชฉํ‘œ๋Š” ๋กœ๋ด‡์˜ ๊ธฐ๋ณธ ๋Šฅ๋ ฅ(์‹œ๊ฐ์ธ์‹, ๋ฌผ๋ฆฌ๋ชจ๋ธ๋ง ๋“ฑ)์„ ๊ฒฌ๊ณ ํžˆ ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํŠผํŠผํ•œ ๊ธฐ๋ฐ˜ ์œ„์—์„œ, ์ดํ›„ ์ œ์–ด ์ •์ฑ…์ด ๋” ๋น ๋ฅด๊ณ  ์ผ๋ฐ˜ํ™”๋œ ํ•™์Šต์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์‚ฌ์ „ํ›ˆ๋ จ๋œ ๋น„์ „ ์ธ์ฝ”๋” ๋•๋ถ„์— ๋กœ๋ด‡์€ ๋ณต์žกํ•œ ์žฅ๋ฉด์—์„œ๋„ ๊ฐ์ฒด์˜ ์ข…๋ฅ˜์™€ ์ž์„ธ๋ฅผ ์ •ํ™•ํžˆ ์ธ์‹ํ•  ์ˆ˜ ์žˆ์–ด, ์ดํ›„ ์ •์ฑ… ํ•™์Šต ์‹œ ๋ฐ์ดํ„ฐ ํšจ์œจ๊ณผ ์„ฑ๋Šฅ ์•ˆ์ •์„ฑ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค.

2. ์ œ์–ด ์ •์ฑ…(Control Policies)

์ œ์–ด ์ •์ฑ…์€ ๋กœ๋ด‡์˜ โ€™ํ–‰๋™ ์‹ ๊ฒฝ๋งโ€™์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. ์–ธ์–ด ์ง€์‹œ์™€ ์‹œ๊ฐ ๊ด€์ฐฐ์„ ๋ฐ›์•„ ์‹ค์ œ ์ €์ˆ˜์ค€ ํ–‰๋™(ํŒ”์˜ ๊ด€์ ˆ ๊ฐ๋„, ๊ทธ๋ฆฌํผ์˜ ์˜คํ”ˆ/ํด๋กœ์ฆˆ ๋“ฑ)์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋กœ๋ด‡์€ ๊ตฌ์ฒด์ ์ธ ์ง€์‹œ(โ€œ๋นจ๊ฐ„ ์ปต์„ ์žก์•„โ€)๋ฅผ ์‹ค์ œ ์›€์ง์ž„(โ€œํŒ”์„ ํŽด๊ณ , ์†์„ ๋‚ด๋ฆฌ๊ณ , ๊ทธ๋ฆฌํผ๋ฅผ ๋‹ซ์•„โ€)์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ œ์–ด ์ •์ฑ… ์—ฐ๊ตฌ๋Š” ํฌ๊ฒŒ ๋‹ค์Œ ๋‹ค์„ฏ ๊ฐ€์ง€ ์œ ํ˜•์œผ๋กœ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • CNN/RNN ๊ธฐ๋ฐ˜ ์ •์ฑ… (๋น„ํŠธ๋žœ์Šคํฌ๋จธ): ์ „ํ†ต์ ์ธ ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ–‰๋™์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด CLIPort๋Š” ์‚ฌ์ „ ํ•™์Šต๋œ CLIP ๋น„์ „-์–ธ์–ด ์ธ์ฝ”๋”๋กœ ์ด๋ฏธ์ง€์™€ ๋ช…๋ น์„ ํ‘œํ˜„ํ•œ ๋’ค, ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง(CNN)์„ ํ†ตํ•ด ํ”ฝ์…€ ๊ณต๊ฐ„์—์„œ ์•ก์…˜์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. BC-Z, MCIL, HULC, UniPi ๋“ฑ๋„ ์œ ์‚ฌํ•˜๊ฒŒ CNN์ด๋‚˜ ๊ฐ„๋‹จํ•œ ์‹ ๊ฒฝ๋ง์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ๋ฒ•์€ ๊ตฌํ˜„์ด ๊ฐ„๋‹จํ•˜๊ณ  ์ž‘์€ ๋ชจ๋ธ๋กœ๋„ ๋™์ž‘ํ•˜์ง€๋งŒ, ๊ธด ์‹œํ€€์Šค๋‚˜ ๋ณต์žกํ•œ ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Transformer ๊ธฐ๋ฐ˜ ์ •์ฑ…: Transformer ์•„ํ‚คํ…์ฒ˜๋กœ ์‹œํ€€์Šค ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ž…๋ ฅ ์‹œ์ ์˜ ์ด๋ฏธ์ง€ ํ”ผ์ฒ˜์™€ ์ด์ „ ํ–‰๋™ ์ •๋ณด๋ฅผ ํ† ํฐ ์‹œํ€€์Šค๋กœ ๋งŒ๋“ค์–ด, ๋‹ค์Œ ํ–‰๋™์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด PerAct๋Š” ์ด๋ฏธ์ง€ ํ”ฝ์…€ ๊ณต๊ฐ„์„ ์ง์ ‘ ํŠธ๋žœ์Šคํฌ๋จธ๋กœ ๋งคํ•‘ํ•˜์—ฌ ํ–‰๋™์„ ์ƒ์„ฑํ•˜๊ณ , Gato๋Š” ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ์ž‘์—…์„ ํ•˜๋‚˜์˜ ๊ฑฐ๋Œ€ ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ๋กœ ํ•™์Šตํ•ด ๋ฉ€ํ‹ฐํƒœ์Šคํ‚น์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. Transformer ๊ธฐ๋ฐ˜์€ ๋ฌธ๋งฅ์„ ๊ธธ๊ฒŒ ๋ณด์กดํ•  ์ˆ˜ ์žˆ์–ด, ์—ฐ์†๋œ ํ”„๋ ˆ์ž„์—์„œ ์ผ๊ด€๋œ ํ–‰๋™์„ ๊ณ„ํšํ•˜๋Š” ๋ฐ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
  • ๋Œ€ํ˜•์–ธ์–ด๋ชจ๋ธ(LLM) ๊ธฐ๋ฐ˜ ์ •์ฑ…: GPT๋‚˜ PaLM ๊ฐ™์€ ์–ธ์–ด๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ–‰๋™์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. RT-2(google)๋‚˜ RoboFlamingo๊ฐ€ ๋Œ€ํ‘œ์  ์˜ˆ๋กœ, LLM ๋‚ด๋ถ€์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํ™œ์šฉํ•ด ๋กœ๋ด‡ ํ–‰๋™์„ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์—์„œ๋Š” ์‹ค์ œ ํ–‰๋™๋„ ์–ธ์–ด ํ˜•ํƒœ๋กœ ๋ชจ๋ธ์— ์ž…๋ ฅํ•˜๊ณ , ์ถœ๋ ฅ์œผ๋กœ ํ–‰๋™ ๋ช…๋ น(๋˜๋Š” ํ–‰๋™ ๊ธฐ์ˆ  ํ…์ŠคํŠธ)์„ ์–ป์€ ๋’ค ์ด๋ฅผ ๋กœ๋ด‡ ์ œ์–ด ์‹ ํ˜ธ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์žฅ์ ์€ ํ’๋ถ€ํ•œ ๊ณตํ†ต ์ƒ์‹๊ณผ ์–ธ์–ด ์ดํ•ด๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ด๋‚˜, ๋‹จ์ ์€ ์—ฐ์‚ฐ ๋น„์šฉ๊ณผ ์ง€์—ฐ์‹œ๊ฐ„์ด ํฌ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.
  • ๋‹ค์ค‘ ๋ชจ๋‹ฌ ๋ช…๋ น(Multi-modal Instruction): ์–ธ์–ด๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ(trajectory, ๋น„๋””์˜ค ๋“ฑ)๋ฅผ ํ•จ๊ป˜ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด VIMA๋‚˜ MOO, Octo๋Š” ์–ธ์–ด์™€ ํ•จ๊ป˜ ๋ช‡ ๊ฐ€์ง€ ์˜ˆ์‹œ ์ด๋ฏธ์ง€๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉด, ์ƒˆ๋กœ์šด ์กฐ์ž‘ ๊ณผ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋„๋ก ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ๋žŒ์—๊ฒŒ โ€œํ”ผ์•„๋…ธ ์กฐ๋ฆฝ ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•˜๋ผโ€๋Š” ๋ง๋ฟ ์•„๋‹ˆ๋ผ ์กฐ๋ฆฝ ์žฅ๋ฉด ๋น„๋””์˜ค๋ฅผ ํ•จ๊ป˜ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ๊ณผ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์€ ํŠนํžˆ ์ƒท ํ•™์Šต(few-shot) ์ผ๋ฐ˜ํ™”์— ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค.
  • ๋ชฉํ‘œ-์ƒํƒœ ์ง€์‹œ(Goal-state Instruction): ์–ธ์–ด ๋Œ€์‹  ๋ชฉํ‘œ ์ƒํƒœ(์˜ˆ: ๋ชฉํ‘œ ์ด๋ฏธ์ง€๋‚˜ ๊ฒฝ๋กœ ์Šค์ผ€์น˜)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด RoboCat์€ ์ฃผ์–ด์ง„ ๋ชฉํ‘œ ์ด๋ฏธ์ง€๋ฅผ ๋ณด๊ณ  ์ตœ๋‹จ ๊ฒฝ๋กœ๋กœ ๋™์ž‘ํ•˜๋„๋ก ํ•™์Šตํ•˜๊ณ , RT-Trajectory๋Š” ์‚ฌ๋žŒ์ด ๊ทธ๋ฆฐ ๊ถค์  ์Šค์ผ€์น˜๋ฅผ ๋กœ๋ด‡ํŒ”์˜ ๊ฒฝ๋กœ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ •์ฑ…์€ ์–ธ์–ด๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•„ ๋ช…ํ™•ํ•œ โ€œ๋งโ€ ๋Œ€์‹  ๊ตฌ์ฒด์  ์‹œ๊ฐ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•œ๋‹ค๋Š” ์ ์—์„œ VLAs์™€๋Š” ๊ตฌ๋ถ„๋˜์ง€๋งŒ, ๋ณต์žกํ•œ ๋ช…๋ น์„ ์ œ๊ณตํ•˜๊ธฐ ์–ด๋ ค์šด ์ƒํ™ฉ์—์„œ ๋Œ€์•ˆ์ด ๋ฉ๋‹ˆ๋‹ค.

์ด๋“ค ์ œ์–ด ์ •์ฑ… ์•„ํ‚คํ…์ฒ˜์—์„œ๋Š” ์‹œ๊ฐยท์–ธ์–ด ์ •๋ณด๋ฅผ ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ์‹์ด ๋‹ค์–‘ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด FiLM(Feature-wise Linear Modulation), Cross-Attention, ๋‹จ์ˆœ Concatenation ๋“ฑ์ด ์‚ฌ์šฉ๋˜์—ˆ๋Š”๋ฐ, ์ž‘์€ ๋ชจ๋ธ์—์„œ๋Š” FiLM์ด๋‚˜ cross-attention์ด ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ , ๊ฐ„๋‹จํ•œ ์กฐํ•ฉ(concatenation)๋„ ๋ชจ๋ธ์„ ํฌ๊ฒŒ ํ•˜๋ฉด ์œ ์‚ฌํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ›ˆ๋ จ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” ์ฃผ๋กœ ์‹œ์—ฐ(๋ฐ๋ชจ) ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ํ–‰๋™ ๋ณต์ œ(Behavior Cloning, BC)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์ „๋ฌธ๊ฐ€(๋˜๋Š” ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ)๋กœ๋ถ€ํ„ฐ ์–ป์€ (์ƒํƒœ, ํ–‰๋™) ์Œ์œผ๋กœ ์ •์ฑ…์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์—ฐ์†์ ์ธ ํ–‰๋™(a)์„ ์˜ˆ์ธกํ•  ๋•Œ ์†์‹ค ํ•จ์ˆ˜๋Š” ํ‰๊ท ์ œ๊ณฑ์˜ค์ฐจ(MSE) ํ˜•ํƒœ๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค:

L_{BC} = \mathbb{E}\left\lbrack \frac{1}{2} \parallel a - a^{*} \parallel^{2} \right\rbrack,

์—ฌ๊ธฐ์„œ a๋Š” ์ •์ฑ…์ด ์˜ˆ์ธกํ•œ ํ–‰๋™, a^{*}๋Š” ์ „๋ฌธ๊ฐ€๊ฐ€ ์‹ค์ œ ์ˆ˜ํ–‰ํ•œ ํ–‰๋™์ž…๋‹ˆ๋‹ค. ๋งŒ์•ฝ ํ–‰๋™์„ ์ด์‚ฐ์ ์œผ๋กœ ๋‚˜๋ˆ„์–ด ํ‘œํ˜„ํ•  ๊ฒฝ์šฐ, ๋Œ€์‹  ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์†์‹ค์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ํ”ฝ ์•ค ํ”Œ๋ ˆ์ด์Šค(pick-and-place)์—์„œ๋Š” ๋กœ๋ด‡ ๋ง๋‹จ์˜ ํ”ฝ(pick) ์œ„์น˜์™€ ํ”Œ๋ ˆ์ด์Šค(place) ์œ„์น˜ ๋‘ ์ง€์ ์„ ์˜ˆ์ธกํ•˜๋Š”๋ฐ, ์ด๋•Œ๋„ BC ๊ธฐ๋ฐ˜ ์†์‹ค์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด์ฒ˜๋Ÿผ ๋ชจ๋ฐฉํ•™์Šต์„ ํ†ตํ•ด ์ œ์–ด ์ •์ฑ…์„ ํ•™์Šตํ•˜๋ฉด, ์ฃผ์–ด์ง„ ์–ธ์–ด ์ง€์‹œ์— ๋งž๊ฒŒ ํ–‰๋™ ๊ฒฝ๋กœ๋ฅผ ๋น ๋ฅด๊ฒŒ ์ตํž ์ˆ˜ ์žˆ์ง€๋งŒ, ์‚ฌ๋žŒ์˜ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๊ฐ€ ์ถฉ๋ถ„ํ•ด์•ผ ํ•˜๊ณ  ์ƒˆ๋กœ์šด ์ƒํ™ฉ์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™”๊ฐ€ ๋ถ€์กฑํ•  ์ˆ˜ ์žˆ๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

# ๊ณ ์ˆ˜์ค€ ๊ณผ์ œ-ํ•˜์œ„ ๊ณผ์ œ ๊ณ„์ธต ๋ชจ์˜

goal_instruction = "์ฑ…์ƒ ์œ„์˜ ๋ฌผ๊ฑด๋“ค์„ ์ •๋ฆฌํ•ด ์ค˜"

# (1) ๊ณ ์ˆ˜์ค€ ๊ณ„ํš: LLM ๋“ฑ์œผ๋กœ ์„œ๋ธŒํƒœ์Šคํฌ ์ƒ์„ฑ
plan = TaskPlanner.generate_plan(goal_instruction)
# plan = ["์˜์ž ๋’ค๋กœ ๋ฐ€๊ธฐ", "์ฑ…์ƒ ์œ„ ์ฑ…๋“ค ์ค„ ์ •๋ฆฌํ•˜๊ธฐ", "์ปต๋“ค์„ ์˜ฎ๊ธฐ๊ธฐ"] ๋“ฑ

# (2) ๊ฐ ์„œ๋ธŒํƒœ์Šคํฌ๋ณ„๋กœ ์ €์ˆ˜์ค€ ์ œ์–ด ์ •์ฑ… ์‹คํ–‰
for subtask in plan:
    current_state = get_robot_observation()
    while not subtask.is_finished(current_state):
        action = ControlPolicy.predict(current_state, subtask)
        execute_robot_action(action)
        current_state = get_robot_observation()

์œ„ ์˜์‚ฌ์ฝ”๋“œ์—์„œ ๋ณด๋“ฏ, VLA๋Š” ๋จผ์ € ์žฅ๊ธฐ ๋ช…๋ น์„ ์ดํ•ดํ•˜์—ฌ ๊ณ„ํš์„ ์„ธ์šด ๋’ค, ์ €์ˆ˜์ค€ ์ •์ฑ…์ด ์‹ค์ œ ํ–‰๋™์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ TaskPlanner๋Š” ๋‹ค์Œ ์ ˆ์—์„œ ๋‹ค๋ฃจ๋Š” ๊ณ ์ˆ˜์ค€ ๊ณ„ํš์ž์ด๋ฉฐ, ControlPolicy๋Š” ์ง€๊ธˆ ์‚ดํŽด๋ณด๋Š” ์ €์ˆ˜์ค€ ์ œ์–ด ์ •์ฑ…์ž…๋‹ˆ๋‹ค.

3. ์ž‘์—… ๊ณ„ํš(Task Planners)

๊ณ ์ˆ˜์ค€ ์ž‘์—… ๊ณ„ํš์ž๋Š” ์žฅ๊ธฐ ๊ณผ์ œ๋ฅผ ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๋กœ ๋ถ„ํ•ดํ•˜์—ฌ ๋กœ๋ด‡์ด ์ฐจ๋ก€์ฐจ๋ก€ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด โ€œ๋ฐฉ์„ ์ฒญ์†Œํ•ดโ€๋ผ๋Š” ๋ชฉํ‘œ๋Š” ์—ฌ๋Ÿฌ ํ•˜์œ„ ์ž‘์—…์œผ๋กœ ๋‚˜๋‰˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค(๋ฐ”๋‹ฅ ์ฒญ์†Œํ•˜๊ธฐ, ๋ฌผ๊ฑด ์ œ์ž๋ฆฌ์— ๋†“๊ธฐ, ์“ฐ๋ ˆ๊ธฐํ†ต ๋น„์šฐ๊ธฐ ๋“ฑ). ์ด๋•Œ VLA์˜ Task Planner๋Š” ์ธ๊ฐ„์˜ ํผ์ฆ ๋งž์ถ”๊ธฐ์™€๋„ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋กœ๋ด‡์˜ ์‹œ์•ผ์™€ ์–ธ์–ด ์ง€์‹œ๋ฅผ ๋ณด๊ณ , โ€œ์ด ์ž‘์—…์„ ๋จผ์ € ํ•˜๊ณ , ๋‹ค์Œ์—” ์ € ์ž‘์—…โ€ฆโ€๊ณผ ๊ฐ™์€ ์ˆœ์„œ๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ ์ตœ๊ทผ์—๋Š” ๋Œ€ํ˜•์–ธ์–ด๋ชจ๋ธ(LLM)์„ ํ™œ์šฉํ•œ ๊ณ„ํš ๊ธฐ๋ฒ•์ด ์ฃผ๋ชฉ๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ฃผ์š” ์ ‘๊ทผ ๋ฐฉ์‹์€ ํฌ๊ฒŒ ์„ธ ๊ฐ€์ง€์ž…๋‹ˆ๋‹ค: End-to-End, ์–ธ์–ด ๊ธฐ๋ฐ˜, ์ฝ”๋“œ ๊ธฐ๋ฐ˜ ๊ณ„ํš์ž…๋‹ˆ๋‹ค.

  • End-to-End ๊ณ„ํš: ์‹œ๊ฐ-์–ธ์–ด ์ž…๋ ฅ์„ ํฌํ•จํ•œ ๋ชจ๋“  ์ •๋ณด๋ฅผ LLM์— ํ†ตํ•ฉํ•˜์—ฌ ์ง์ ‘ ๊ณ„ํš์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด SayCan ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ๋Š” PaLM๊ณผ ๊ฐ™์€ LLM์ด โ€œ{ํ˜„์žฌ ํ™˜๊ฒฝ, ์ง€์‹œ}โ€๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ์šฐ์„ ์ˆœ์œ„ ๊ธฐ๋ฐ˜์˜ ์ž‘์—… ๋ฆฌ์ŠคํŠธ๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค. ์ดํ›„ ๋‚ฎ์€ ์ˆ˜์ค€ ์ •์ฑ…์ด ์ˆœ์ฐจ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋ฉฐ, ํ™˜๊ฒฝ ๋ณ€ํ™”์— ๋”ฐ๋ผ LLM์ด ๋‹ค์‹œ ์žฌ๊ณ„ํšํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. PaLM-E ์—ฐ๊ตฌ์—์„œ๋„ ViT์™€ PaLM์„ ๊ฒฐํ•ฉํ•ด ์ด๋ฏธ์ง€๋ฅผ ๋ณด๊ณ  ํ…์ŠคํŠธ ๊ณ„ํš์„ ์ƒ์„ฑํ•œ ๋’ค, SayCan์„ ํ™œ์šฉํ•˜์—ฌ ๋กœ๋ด‡ ํ–‰๋™์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์˜ ์žฅ์ ์€ ์ธ๊ฐ„์˜ ํ”Œ๋ž˜๋„ˆ์ฒ˜๋Ÿผ ํ†ตํ•ฉ์ ์œผ๋กœ ์‚ฌ๊ณ ํ•œ๋‹ค๋Š” ์ ์ด์ง€๋งŒ, ๋‹ค๋Ÿ‰์˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ•™์Šต์ด ํ•„์š”ํ•˜๊ณ  ์—ฐ์‚ฐ ๋น„์šฉ์ด ํฌ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์–ธ์–ด ๊ธฐ๋ฐ˜ ๊ณ„ํš: LLM์„ ์‚ฌ์šฉํ•˜๋˜, ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ๋ชจ๋‘ ์–ธ์–ด ํ˜•์‹์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด Inner Monologue ๊ธฐ๋ฒ•์€ โ€œ๋Œ€ํ™” ๋‚ด๋ ˆ์ด์…˜โ€์ฒ˜๋Ÿผ LLM์ด ์ˆœ์ฐจ์ ์œผ๋กœ ๊ณ„ํš์„ ์„ธ์šฐ๊ณ , ์ €์ˆ˜์ค€ ์ •์ฑ…์€ ๊ทธ ํ…์ŠคํŠธ ์ง€์‹œ๋ฅผ ๋”ฐ๋ผ ์›€์ง์ž…๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ ๋กœ๋ด‡์ด๋‚˜ ์„ผ์„œ๋กœ๋ถ€ํ„ฐ ๋ฐ›์€ ํ”ผ๋“œ๋ฐฑ(์„ฑ๊ณต/์‹คํŒจ, ์˜ค๋ธŒ์ ํŠธ ๋ณ€ํ™” ๋“ฑ)์„ ํ…์ŠคํŠธ๋กœ LLM์— ์ „๋‹ฌํ•˜์—ฌ ๊ณ„์† ๋ณด์ •ํ•ด ๋‚˜๊ฐ‘๋‹ˆ๋‹ค. ๋งˆ์น˜ ์‚ฌ๋žŒ์ด ๋ฏธ๋ฆฌ ๊ฒฐ๋ก ์„ ๊ธ€๋กœ ์ ์–ด๊ฐ€๋ฉฐ ๋‹ค์Œ ํ–‰๋™์„ ์ •ํ•˜๋Š” ์…ˆ์ž…๋‹ˆ๋‹ค. LLM-Planner๋Š” LLM์ด ์ƒ์„ฑํ•œ ์–ธ์–ด ๊ณ„ํš์„ ๋‹จ๊ณ„๋ณ„๋กœ ์ œ์–ด ์ •์ฑ…์— ๋„˜๊ธฐ๊ณ , ํ•„์š”ํ•œ ๊ฒฝ์šฐ โ€œ๋‹ค์‹œ ๊ณ„ํš ์ˆ˜๋ฆฝโ€ํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. Socratic Models๋Š” ์—ฌ๋Ÿฌ ๋ชจ๋ธ(๋น„์ „, ์–ธ์–ด)์„ ์ค‘์žฌ(prompts) ๋ฐฉ์‹์œผ๋กœ ๊ฒฐํ•ฉํ•˜์—ฌ, ๋น„์–ธ์–ด ๋ฐ์ดํ„ฐ๋ฅผ ์–ธ์–ด ์„ค๋ช…์œผ๋กœ ๋ฐ”๊พธ๊ณ  ๋‹ค์‹œ ๋กœ๋ด‡์— ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ณตํ†ต์ ์€ ๋ชจ๋‘ ์–ธ์–ด๋ผ๋Š” ์ค‘๊ฐœ ๋งค๊ฐœ์ฒด๋กœ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ •๋ณด๋ฅผ ์ฒ˜๋ฆฌํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.
  • ์ฝ”๋“œ ๊ธฐ๋ฐ˜ ๊ณ„ํš: LLM์—๊ฒŒ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด๋‚˜ API ํ˜ธ์ถœ ์ฝ”๋“œ ํ˜•ํƒœ๋กœ ๊ณ„ํš์„ ์ƒ์„ฑํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ProgPrompt๋Š” LLM์— ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ํ•จ์ˆ˜์™€ ๊ฐ์ฒด ๋ชฉ๋ก์„ ์•Œ๋ ค์ฃผ๊ณ , โ€œํ”„๋กฌํ”„ํŠธโ€ ํ˜•์‹์œผ๋กœ ํƒœ์Šคํฌ ๊ณ„ํš์„ ์š”์ฒญํ•ฉ๋‹ˆ๋‹ค. ChatGPT for Robotics๋Š” ์ฃผ์–ด์ง„ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ํ•จ์ˆ˜(API)๋ฅผ ์„ค๋ช…ํ•œ ๋’ค ChatGPT๊ฐ€ ๋‹จ๊ณ„๋ณ„๋กœ ํŒŒ์ด์ฌ ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ํ•˜์—ฌ ๋กœ๋ด‡ ์ œ์–ด์— ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, pick_object(), move_robot()์™€ ๊ฐ™์€ API ๋ชฉ๋ก์„ ์ •์˜ํ•˜๊ณ  ChatGPT์—๊ฒŒ โ€œ์ปต์„ ์˜ฎ๊ธฐ๋Š” ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•ด ์ค˜โ€๋ผ๊ณ  ํ•˜๋ฉด, ChatGPT๊ฐ€ ํ•ด๋‹น ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ๋งŒ๋“ค์–ด์ค๋‹ˆ๋‹ค. Code-as-Policies๋Š” LLM์ด ์ •์ฑ… ์ž์ฒด๋ฅผ ์ฝ”๋“œ๋กœ ์ž‘์„ฑํ•˜์—ฌ ์‹คํ–‰ํ•˜๊ฒŒ ํ•˜๊ณ , DEPS๋Š” LLM์œผ๋กœ ๊ณ„ํš์„ ์„ธ์šฐ๊ณ  ์‹คํŒจ ์›์ธ์„ ์„ค๋ช…ํ•˜๊ฒŒ ํ•˜์—ฌ ๋‹ค์‹œ ๊ณ„ํšํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹๋“ค์€ LLM์˜ ์ฝ”๋”ฉ ๋Šฅ๋ ฅ๊ณผ ์„ธ๊ณ„์ง€์‹์„ ํ™œ์šฉํ•˜๋ฏ€๋กœ ๋ณต์žกํ•œ ๋…ผ๋ฆฌ ๊ณ„ํš์— ๊ฐ•์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์•ˆ์ •์„ฑ ๊ฒ€์‚ฌ, ๋ฒ„๊ทธ ๊ฐ€๋Šฅ์„ฑ ๋“ฑ ์‹ค์ œ ๋กœ๋ด‡ ์ ์šฉ์˜ ์œ„ํ—˜์„ฑ์„ ์„ธ์‹ฌํžˆ ๊ด€๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

flowchart LR
    SubtaskPlanner(๊ณ ์ˆ˜์ค€ ๊ณ„ํš์ž) -->|์ฑ…์ƒ ์ •๋ฆฌ| ActionPolicy(์ €์ˆ˜์ค€ ์ •์ฑ…)
    ActionPolicy --> Robot(๋กœ๋ด‡ ๋™์ž‘)
    Robot --> Environment(ํ™˜๊ฒฝ)
    Environment --> SubtaskPlanner
    style SubtaskPlanner fill:#ffe0b2,stroke:#333,stroke-width:1px

์œ„ ๋‹ค์ด์–ด๊ทธ๋žจ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ, ๊ณ ์ˆ˜์ค€ ๊ณ„ํš์ž(SubtaskPlanner)๋Š” ์–ธ์–ด ์ง€์‹œ๋กœ๋ถ€ํ„ฐ ๊ตฌ์ฒด์ ์ธ ์„œ๋ธŒํƒœ์Šคํฌ๋ฅผ ์ƒ์„ฑํ•ด ์ €์ˆ˜์ค€ ์ •์ฑ…์— ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค. ์ €์ˆ˜์ค€ ์ •์ฑ…์€ ๊ทธ์— ๋”ฐ๋ผ ๋กœ๋ด‡์„ ์›€์ง์—ฌ ํ–‰๋™ํ•˜๊ณ , ํ™˜๊ฒฝ ๋ณ€ํ™”๋ฅผ ๋‹ค์‹œ ๊ณ„ํš์ž์—๊ฒŒ ์•Œ๋ ค์ฃผ๋Š” ์ˆœํ™˜ ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

4. ๋ฐ์ดํ„ฐ์…‹ยท์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์™€ ํ‰๊ฐ€

VLA ์—ฐ๊ตฌ์— ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ์…‹๊ณผ ํ™˜๊ฒฝ์€ ํฌ๊ฒŒ ๋‘ ์ถ•์œผ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค. ํ˜„์‹ค ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ์™€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์ž…๋‹ˆ๋‹ค. ํ˜„์‹ค ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ๊ฒƒ์€ ๋น„์šฉ๊ณผ ์‹œ๊ฐ„์ด ๋งค์šฐ ๋งŽ์ด ๋“ญ๋‹ˆ๋‹ค. ๋กœ๋ด‡ ์žฅ๋น„ ํ™•๋ณด, ํ™˜๊ฒฝ ๊ตฌ์ถ•, ์ „๋ฌธ ์กฐ์ž‘์ž ํˆฌ์ž… ๋“ฑ ์ œ์•ฝ์ด ๋งŽ๊ณ , ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ์œ ํ˜•๊ณผ ์„ค์ • ๊ฐ„์˜ ๋ฐ์ดํ„ฐ ๋ถˆ์ผ์น˜ ๋ฌธ์ œ๋„ ํฝ๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ํ˜„์‹ค์—์„œ ์–ป์€ ๋Œ€๊ทœ๋ชจ ๊ณต๊ฐœ ๋ฐ์ดํ„ฐ์…‹์€ ๋“œ๋ญ…๋‹ˆ๋‹ค.

๋ฐ˜๋ฉด, ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜๋ฉด ๋น„์šฉ์„ ํฌ๊ฒŒ ์ค„์ด๊ณ  ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋น ๋ฅด๊ฒŒ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋Œ€ํ‘œ์  ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋กœ๋Š” Unity ๊ธฐ๋ฐ˜์˜ AI2-THOR, TDW, SAPIEN; Gazebo/Bullet ๊ธฐ๋ฐ˜์˜ iGibson, Habitat; MuJoCo ๊ธฐ๋ฐ˜์˜ Meta-World, RoboSuite ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด AI2-THOR๋Š” ๊ฐ€์ƒ ์ฃผ๋ฐฉ/๊ฑฐ์‹ค์—์„œ ๋ฌผ์ฒด ์กฐ์ž‘ ํƒœ์Šคํฌ๋ฅผ, Habitat/Gibson์€ ์‹ค๋‚ด ๋„ค๋น„๊ฒŒ์ด์…˜์„, Meta-World๋Š” ๋กœ๋ด‡ํŒ” ์กฐ์ž‘ ๊ณผ์ œ๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์‹œ๋ฎฌ ํ™˜๊ฒฝ์—์„œ๋Š” ๋กœ๋ด‡ ์นด๋ฉ”๋ผ(RGB, ๊นŠ์ด, ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ๋“ฑ)๋ฅผ ์ž์œ ๋กญ๊ฒŒ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๊ณ , ๋‹ค์–‘ํ•œ ์ž‘์—… ์‹œ๋‚˜๋ฆฌ์˜ค(์–ผ๊ตด ๋‹ฆ๊ธฐ, ๊ทธ๋ฆ‡ ์ •๋ฆฌ ๋“ฑ)๋ฅผ ์ž๋™ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—๋„ ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ํ˜„์‹ค๊ณผ ์‹œ๋ฎฌ ๊ฐ„ ๋ถˆ์ผ์น˜(sim-to-real gap)๊ฐ€ ๊ฐ€์žฅ ํฐ ๋ฌธ์ œ์ธ๋ฐ, ๊ทธ๋ž˜ํ”ฝ ํ’ˆ์งˆ ์ฐจ์ด, ๋ฌผ๋ฆฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์˜ ๋ถ€์ •ํ™•์„ฑ, ์ƒˆ๋กœ์šด ๋ฌผ์ฒด ๋ชจ๋ธ๋ง ์–ด๋ ค์›€ ๋“ฑ์ด ๊ทธ ์›์ธ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์•ก์ฒด๋‚˜ ์ฒœ ๊ฐ™์€ ๋น„๊ฐ•์ฒด ๊ฐ์ฒด๋ฅผ ํ˜„์‹ค์ฒ˜๋Ÿผ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ๊นŒ๋‹ค๋กญ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ์œ„์—์„œ ์ž˜ ํ•™์Šต๋œ VLA๋„ ํ˜„์‹ค ๋กœ๋ด‡์— ์˜ฎ๊ธฐ๋ฉด ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค๋ฉด ๋„๋ฉ”์ธ ๋žœ๋คํ™”, ์ •๊ตํ•œ ๋ฌผ๋ฆฌ ๋ชจ๋ธ๋ง, ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ๋ณด์ • ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๋กœ๋ด‡ ์—ฐ๊ตฌ์ž๋“ค์€ ๋˜ํ•œ ๋ฒค์น˜๋งˆํฌ๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ œ์–ด ์ •์ฑ…์€ ๋ณดํ†ต ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ ์œ„์—์„œ ๋‹จ์ˆœ ์กฐ์ž‘ ์ •ํ™•๋„๋‚˜ ์„ฑ๊ณต๋ฅ ๋กœ ํ‰๊ฐ€๋˜๋ฉฐ, ์ž‘์—… ๊ณ„ํš์ž๋Š” ์žฅ๊ธฐ ๊ณผ์ œ ์„ฑ๊ณต๋ฅ (์˜ˆ: ALFRED, RoboTHOR์—์„œ โ€œ์˜ค๋ธ์„ ์ผœ๋ผโ€)๋กœ ํŒ๋‹จํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋“ค ๋ฒค์น˜๋งˆํฌ๊ฐ€ ์‹ค์ œ ๋ฌผ๋ฆฌ ํ™˜๊ฒฝ์„ ์™„๋ฒฝํžˆ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•˜๊ณ , ๊ณ ์ˆ˜์ค€-์ €์ˆ˜์ค€ ํ†ตํ•ฉ ๋Šฅ๋ ฅ์„ ์ธก์ •ํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ์ง€์ ๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ํ–ฅํ›„์—๋Š” ์‹œ๋ฎฌ ์‹คํ—˜๊ณผ ํ•จ๊ป˜ ์‹ค์ œ ๋กœ๋ด‡ ์‹คํ—˜๋„ ํ‘œ์ค€ํ™”ํ•˜์—ฌ, ๋” ํ˜„์‹ค์ ์ธ ํ‰๊ฐ€ ์ฒด๊ณ„๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

5. ๋„์ „ ๊ณผ์ œ์™€ ํ–ฅํ›„ ๋ฐฉํ–ฅ

๋น„์ „-์–ธ์–ด-์•ก์…˜ ๋ชจ๋ธ์€ ๊ฐ•๋ ฅํ•œ ์ž ์žฌ๋ ฅ์„ ์ง€๋…”์ง€๋งŒ, ํ•ด๊ฒฐํ•ด์•ผ ํ•  ๋ฌธ์ œ๋„ ๋งŽ์Šต๋‹ˆ๋‹ค. ์ฃผ์š” ๋„์ „ ๊ณผ์ œ๋ฅผ ์ •๋ฆฌํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • ๋ฐ์ดํ„ฐ ๋ถ€์กฑ(Scarcity): ์‹ค์ œ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ€์กฑํ•ฉ๋‹ˆ๋‹ค. ํ˜„์‹ค ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์€ ๋น„์šฉยท์‹œ๊ฐ„์ ์œผ๋กœ ์–ด๋ ค์›Œ ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ํ•™์Šต์ด ํž˜๋“ญ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด ์‹œ๋ฎฌ ๋ฐ์ดํ„ฐ๋Š” ํ’๋ถ€ํ•˜์ง€๋งŒ ์•ž์„œ ์–ธ๊ธ‰ํ•œ ๊ฐญ ๋ฌธ์ œ๋กœ ํ˜„์‹ค ์ ์šฉ์„ฑ์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค. ํ•ด๊ฒฐ์ฑ…์œผ๋กœ๋Š” ๊ธฐ๊ด€ ๊ฐ„ ํ˜‘์—…์œผ๋กœ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ ๊ณต์œ , ํ˜น์€ ์‚ฌ๋žŒ ๋™์ž‘ ๋ฐ์ดํ„ฐ ํ™œ์šฉ(๋ฐ๋ชจ, AR/VR ํ™œ์šฉ)์ด ๋ชจ์ƒ‰๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์šด๋™ ๊ณ„ํš(Motion Planning): ํ˜„์žฌ์˜ ์ •์ฑ…์€ ๋Œ€๋ถ€๋ถ„ ๋‹จ์ผ ๊ด€์ ˆ ํ˜น์€ ๋ง๋‹จ ๋กœ๋ด‡ํŒ”์˜ ์œ„์น˜๋ฅผ ์ œ์–ดํ•˜์ง€๋งŒ, ๋ณต์žกํ•œ ์žฅ๊ธฐ ์ž‘์—…์—์„œ ํ•„์š”ํ•œ ์ •๋ฐ€ ์šด๋™ ๊ณ„ํš ๋Šฅ๋ ฅ์ด ๋ถ€์กฑํ•ฉ๋‹ˆ๋‹ค. ๊ณต๊ตฌ ์‚ฌ์šฉ, ๋ณต์žกํ•œ ๊ฒฝ๋กœ ํšŒํ”ผ, ์ •๋ฐ€ ์กฐ์ž‘ ๋“ฑ์—์„œ ๋” ์ •๊ตํ•œ ์šด๋™ ๊ณ„ํš ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋กœ๋ด‡์˜ ๋ฏผ์ฒฉ์„ฑ๊ณผ ํผํฌ๋จผ์Šค ํ–ฅ์ƒ์œผ๋กœ ์ด์–ด์ง‘๋‹ˆ๋‹ค.
  • ์‹ค์‹œ๊ฐ„ ์‘๋‹ต์„ฑ(Real-Time): ๋งŽ์€ ๋กœ๋ด‡ ์‘์šฉ์€ ์งง์€ ์ง€์—ฐ์œผ๋กœ ๋น ๋ฅธ ์˜์‚ฌ๊ฒฐ์ •์„ ์š”๊ตฌํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋Œ€ํ˜• ์–ธ์–ด๋ชจ๋ธ์„ ์“ฐ๋ฉด ๊ณ„์‚ฐ์ด ๋А๋ฆฌ๊ณ , ์‹ค์ œ ํ™˜๊ฒฝ ๋ณ€ํ™”์— ์ฆ‰์‹œ ๋Œ€์‘ํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํšจ์œจ์ ์ธ ๊ฒฝ๋Ÿ‰ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ํ•˜๋“œ์›จ์–ด ๊ฐ€์†, ์ „์ฒด ์‹œ์Šคํ…œ ์ตœ์ ํ™”๊ฐ€ ์š”๊ตฌ๋ฉ๋‹ˆ๋‹ค.
  • ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ†ตํ•ฉ(Multi-modal Fusion): ์‹œ๊ฐยท์–ธ์–ด ์™ธ์—๋„ ์Œ์„ฑ, ์ด‰๊ฐ ๋“ฑ ๋‹ค์–‘ํ•œ ์„ผ์„œ ์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•˜๋Š” ์ผ์ด ์ˆ™์ œ์ž…๋‹ˆ๋‹ค. ํŠนํžˆ ์ฒญ๊ฐ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜๋ฉด ๊ฐ€์ „์ œํ’ˆ์˜ ์ž‘๋™์Œ์œผ๋กœ ์ƒํ™ฉ ํŒŒ์•…์ด ๊ฐ€๋Šฅํ•˜๊ณ , ์Œ์„ฑ ๋ช…๋ น ์ฒ˜๋ฆฌ๋กœ ์‚ฌ์šฉ์ž์™€ ์ž์—ฐ์Šค๋ ˆ ๋Œ€ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•ž์œผ๋กœ๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ•™์Šต๊ณผ ํ“จ์ „ ๊ธฐ์ˆ ์˜ ๋ฐœ์ „์ด VLA์˜ ํ˜„์‹ค ๋ฐ˜์˜์„ฑ์„ ๋†’์ผ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ์ผ๋ฐ˜ํ™”(Generalization): VLA๊ฐ€ ๋‹ค์–‘ํ•œ ๋ฏธ์ง€์˜ ์ƒํ™ฉ์—์„œ๋„ ์–ธ์–ด ์ง€์‹œ๋ฅผ ์ดํ•ดํ•˜๊ณ  ์‹คํ–‰ํ•˜๋ ค๋ฉด ์‚ฌ๋žŒ ์ˆ˜์ค€์˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ChatGPT๊ฐ€ ๋‹ค์–‘ํ•œ ๋Œ€ํ™”์—์„œ ์œ ์—ฐํ•˜๋“ฏ์ด, VLA๋„ ๋‹ค์–‘ํ•œ ์ž‘์—…, ํ™˜๊ฒฝ, ๋กœ๋ด‡ ํƒ€์ž…์—์„œ ๊ฒฌ๊ณ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋” ํฐ ๊ทœ๋ชจ์˜ ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ํ•™์Šต, ๋„๋ฉ”์ธ ์–ด๋Œ‘ํ…Œ์ด์…˜, meta-learning ๊ธฐ๋ฒ• ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœํ•ฉ๋‹ˆ๋‹ค.
  • ์žฅ๊ธฐ ์ž‘์—…(Long-Horizon Task): โ€œํ™”๋ถ„์— ๋ฌผ์„ ์ค˜โ€ ๊ฐ™์€ ์งง์€ ๋ช…๋ น๋„ ์‹ค์ œ๋กœ๋Š” ์—ฌ๋Ÿฌ ๋‹จ๊ณ„ ๊ณผ์ œ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(๋กœ๋ด‡ํŒ” ์ด๋™ โ†’ ๋ฌผํ†ต ์ง‘๊ธฐ โ†’ ํ™”๋ถ„์œผ๋กœ ์ด๋™ โ†’ ๋ฌผ ๋ถ“๊ธฐ). ํ˜„์žฌ ๊ณ ์ˆ˜์ค€ ๊ณ„ํš์ž ๋ชจ๋ธ์€ ์ดˆ๊ธฐ ์„ฑ๊ณผ๋ฅผ ๋ณด์˜€์ง€๋งŒ, ๋Œ€๋ถ€๋ถ„์˜ LLM์€ ์ธ๊ฐ„์˜ ๋ฌผ๋ฆฌ์  ์ง€์‹์ด ๋ถ€์กฑํ•˜์—ฌ ๊ธด ๊ณ„ํš์„ ์™„๋ฒฝํžˆ ์ˆ˜ํ–‰ํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๊ณ„ํš ๋Šฅ๋ ฅ๊ณผ ์ธ์ง€ ๋Šฅ๋ ฅ์„ ๋™์‹œ์— ํ‚ค์šฐ๋Š” ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ธฐ์ดˆ ๋ชจ๋ธ(Foundation Model)์˜ ๋ถ€์žฌ: ์ด๋ฏธ์ง€๋Š” CLIP, ํ…์ŠคํŠธ๋Š” GPT์ฒ˜๋Ÿผ ๋‹จ์ผ ๋ฒ”์šฉ ๋ชจ๋ธ์ด ์กด์žฌํ•˜์ง€๋งŒ, ๋กœ๋ด‡ ์ œ์–ด ์ „์šฉ์˜ ๊ฑฐ๋Œ€ ๋ชจ๋ธ์€ ์•„์ง ์—†์Šต๋‹ˆ๋‹ค. ๋‹ค์–‘ํ•œ ๋กœ๋ด‡๊ณผ ํ™˜๊ฒฝ์„ ์•„์šฐ๋ฅด๋Š” ๊ณต์šฉ ๋ชจ๋ธ์„ ๋งŒ๋“ค๋ ค๋ฉด ์›น ๊ทœ๋ชจ์˜ ๋กœ๋ด‡ ํ–‰๋™ ๋ฐ์ดํ„ฐ์™€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ•™์Šต์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  • ์•ˆ์ „์„ฑ(Safety)๊ณผ ์œค๋ฆฌ: ๋กœ๋ด‡์€ ๋ฌผ๋ฆฌ์  ์„ธ๊ณ„์™€ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฏ€๋กœ ์ž˜๋ชป๋œ ๋™์ž‘์€ ์ธ๋ช…ยท์žฌ์‚ฐ ํ”ผํ•ด๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ VLA ์˜์‚ฌ๊ฒฐ์ •์„ ํˆฌ๋ช…ํ•˜๊ฒŒ ํ•˜๊ณ , ์˜ˆ์ธก ๋ถˆ๊ฐ€๋Šฅํ•œ ํ–‰๋™์„ ์ œ์–ดํ•˜๋Š” ์•ˆ์ „ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์—ฐ๊ตฌ๊ฐ€ ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ฐœ์ธ์ •๋ณด๋‚˜ ํŽธํ–ฅ ์—†๋Š” ํŒ๋‹จ ๋“ฑ ์œค๋ฆฌยท์‚ฌํšŒ์  ๊ณ ๋ ค๋„ ํ•จ๊ป˜ ๋…ผ์˜๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋„์ „ ๊ณผ์ œ๋“ค์„ ํ•ด๊ฒฐํ•˜๋ฉด, VLA ๊ธฐ๋ฐ˜ ๋กœ๋ด‡์€ ์‚ฐ์—… ํ˜„์žฅ๋ฟ ์•„๋‹ˆ๋ผ ๊ฐ€์ •, ์˜๋ฃŒ, ์„œ๋น„์Šค ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์— ํญ๋„“๊ฒŒ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก 

๋น„์ „-์–ธ์–ด-์•ก์…˜ ๋ชจ๋ธ์€ ๋กœ๋ด‡๊ณตํ•™์—์„œ ์–ธ์–ด์™€ ์‹œ๊ฐ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•ด ๋กœ๋ด‡ ํ–‰๋™์„ ์ƒ์„ฑํ•œ๋‹ค๋Š” ์ ์—์„œ ํ˜์‹ ์ ์ธ ์ ‘๊ทผ์ž…๋‹ˆ๋‹ค. ๋ณธ ์„œ๋ฒ ์ด์—์„œ๋Š” VLAs๋ฅผ ์‚ฌ์ „ํ›ˆ๋ จ(๋น„์ „ยท๋™์—ญํ•™ยท์›”๋“œ ๋ชจ๋ธ), ์ œ์–ด ์ •์ฑ…(์–ธ์–ด+์‹œ๊ฐโ†’ํ–‰๋™), ์ž‘์—… ๊ณ„ํš(์žฅ๊ธฐ๊ณผ์ œ ๋ถ„ํ•ด) ์„ธ ์ถ•์œผ๋กœ ์ฒด๊ณ„ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ๋ถ€๋ฌธ์—์„œ CLIP, R3M, Dreamer ๊ฐ™์€ ๋ชจ๋ธ๋ถ€ํ„ฐ RT-2, PaLM-E ๊ฐ™์€ ์ตœ์‹  LLM ๊ธฐ๋ฐ˜ ๋ชจ๋ธ, SayCan๊ณผ ProgPrompt ๊ฐ™์€ ๊ณ ์ˆ˜์ค€ ๊ณ„ํš์ž๊นŒ์ง€ ๋‹ค์–‘ํ•œ ์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋˜์–ด ์™”์Šต๋‹ˆ๋‹ค.

VLAs๋Š” ์ด๋ฏธ ๋ณต์žกํ•œ ํ™˜๊ฒฝ์—์„œ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ ํฐ ๊ฐ€๋Šฅ์„ฑ์„ ๋“œ๋Ÿฌ๋ƒˆ์ง€๋งŒ, ์—ฌ์ „ํžˆ ์ผ๋ฐ˜ํ™”, ํšจ์œจ์„ฑ, ์•ˆ์ „ ๋“ฑ ํ•ด๊ฒฐ ๊ณผ์ œ๊ฐ€ ๋‚จ์•„ ์žˆ์Šต๋‹ˆ๋‹ค. ์•ž์œผ๋กœ ์‹ค์ œ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ ๊ตฌ์ถ•, ์‹œ๋ฎฌ-์‹ค ๊ฐ„ ์—ฐ๊ตฌ, ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ•™์Šต ๊ธฐ๋ฒ• ๊ฐœ๋ฐœ์ด ํ™œ๋ฐœํ•ด์งˆ ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€๋ฉ๋‹ˆ๋‹ค. ๋ณธ ๋ฆฌ๋ทฐ๊ฐ€ ๋กœ๋ด‡๊ณตํ•™์ž๋“ค์—๊ฒŒ VLA ๊ฐœ๋…๊ณผ ์ตœ๊ทผ ์—ฐ๊ตฌ ๋™ํ–ฅ์— ๋Œ€ํ•œ ์ง๊ด€์  ์ดํ•ด์™€ ๊ธฐ์ˆ ์  ํ†ต์ฐฐ์„ ์ œ๊ณตํ•˜์—ฌ, ๋ฏธ๋ž˜์˜ ๋กœ๋ด‡ ์‹œ์Šคํ…œ ๊ฐœ๋ฐœ์— ์‹ค์งˆ์ ์ธ ๋„์›€์ด ๋˜๊ธธ ๋ฐ”๋ž๋‹ˆ๋‹ค.

์ฐธ๊ณ ์‚ฌํ•ญ

Copyright 2026, JungYeon Lee