Curieux.JY
  • JungYeon Lee
  • Post
  • Lecture
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ์„œ๋ก 
    • ๋ฐฉ๋ฒ•
      • Molmo2-ER: ์ฒดํ™” ์ถ”๋ก  ๋ฐฑ๋ณธ
      • MolmoAct2-FAST Tokenizer
      • Pre-training: ์ด์‚ฐ ์ž๊ธฐํšŒ๊ท€ ์ •์ฑ…
      • Post-training: flow-matching ์•ก์…˜ expert + ์ธต๋ณ„ KV ์กฐ๊ฑดํ™”
      • MolmoAct2-Think: ์ ์‘ํ˜• ๊นŠ์ด ์ถ”๋ก 
      • ๋ฐฐํฌ ์ตœ์ ํ™”
    • ์‹คํ—˜
      • Molmo2-ER (์ฒดํ™” ์ถ”๋ก  ๋ฐฑ๋ณธ)
      • Out-of-the-box ๋ฐฐํฌ
      • ํšจ์œจ์  ๋ฏธ์„ธ์กฐ์ •
      • MolmoAct2-Think & ๊ฐ•๊ฑด์„ฑ
      • Ablation & ์ถ”๋ก  ์†๋„
    • ๋น„ํŒ์  ๊ณ ์ฐฐ
    • ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

๐Ÿ“ƒMolmoAct2

VLA
embodied-reasoning
manipulation
open-source
diffusion
MolmoAct2: Action Reasoning Models for Real-World Deployment
Published

June 7, 2026

  • Paper Link (arXiv:2605.02881)
  • Project Page (Ai2 Blog)
  1. ๐Ÿค” MolmoAct2๋Š” ์‹ค์ œ ๋กœ๋ด‡ ๋ฐฐํฌ๋ฅผ ์œ„ํ•œ ์™„์ „ํžˆ ๊ฐœ๋ฐฉ๋œ VLA(Vision-Language-Action) ๋ชจ๋ธ๋กœ, ๊ธฐ์กด ์‹œ์Šคํ…œ์˜ ํ•œ๊ณ„์ ์„ ๊ฐœ์„ ํ•˜๋ฉฐ ๊ฐ•๋ ฅํ•œ Action Reasoning ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  2. ๐Ÿš€ ์ด ๋ชจ๋ธ์€ ๊ณต๊ฐ„ ๋ฐ embodied reasoning์— ํŠนํ™”๋œ Molmo2-ER VLM backbone, ๋Œ€๊ทœ๋ชจ์˜ MolmoAct2-BimanualYAM, DROID, SO100/101 ๋ฐ์ดํ„ฐ์…‹, MolmoAct2-FAST Tokenizer, ์ƒˆ๋กœ์šด per-layer KV conditioning VLA architecture, ๊ทธ๋ฆฌ๊ณ  ์ ์‘ํ˜• ๊นŠ์ด ์ถ”๋ก ์„ ํ†ตํ•œ MolmoAct2-Think์„ ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.
  3. ๐Ÿ† MolmoAct2๋Š” 7๊ฐœ์˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ ์‹ค์ œ ๋ฒค์น˜๋งˆํฌ์—์„œ ฯ€0.5์™€ ๊ฐ™์€ ๊ฐ•๋ ฅํ•œ Baseline์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, Molmo2-ER์€ 13๊ฐœ์˜ embodied-reasoning ๋ฒค์น˜๋งˆํฌ์—์„œ GPT-5 ๋ฐ Gemini Robotics ER-1.5๋ฅผ ๋›ฐ์–ด๋„˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ•˜๊ณ  ๋ชจ๋ธ ๊ฐ€์ค‘์น˜, ํ•™์Šต ์ฝ”๋“œ ๋ฐ ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์„ ๊ณต๊ฐœํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds. ์ด ๋…ผ๋ฌธ์€ ๋กœ๋ด‡์„ ์œ„ํ•œ Vision-Language-Action (VLA) ๋ชจ๋ธ์ธ โ€œMolmoAct2: Action Reasoning Models for Real-World Deploymentโ€๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ํ˜„์žฌ VLA ์‹œ์Šคํ…œ์€ ์‹ค์„ธ๊ณ„ ๋ฐฐํฌ์— ์—ฌ๋Ÿฌ ํ•œ๊ณ„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ฃผ์š” ๋ฌธ์ œ์ ์œผ๋กœ๋Š” ๋ชจ๋ธ์ด Closed-source์ด๊ฑฐ๋‚˜, ๊ณ ๊ฐ€์˜ ํ•˜๋“œ์›จ์–ด์— ์ข…์†๋˜๊ฑฐ๋‚˜, ์ถ”๋ก (reasoning) ๊ธฐ๋ฐ˜ ์ •์ฑ…์ด ๋†’์€ ์ง€์—ฐ ์‹œ๊ฐ„์„ ๋ฐœ์ƒ์‹œํ‚ค๊ฑฐ๋‚˜, ๋ฏธ์„ธ ์กฐ์ •(fine-tuning) ํ›„์—๋„ ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ์ˆ˜์ค€์˜ ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ•˜์ง€ ๋ชปํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. MolmoAct2๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์™„์ „ํžˆ Open-source๋กœ ์„ค๊ณ„๋˜์—ˆ์œผ๋ฉฐ, ์ด์ „ ๋ฒ„์ „์ธ MolmoAct๋ฅผ ๋‹ค์„ฏ ๊ฐ€์ง€ ์ธก๋ฉด์—์„œ ๋ฐœ์ „์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

1. ํ•ต์‹ฌ ๋ฐœ์ „ ๋ฐฉํ–ฅ

MolmoAct2๋Š” ๋‹ค์Œ ๋‹ค์„ฏ ๊ฐ€์ง€ ํ•ต์‹ฌ ์ถ•์„ ์ค‘์‹ฌ์œผ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค: 1. Molmo2-ER์ด๋ผ๋Š” ์ƒˆ๋กœ์šด VLM ๋ฐฑ๋ณธ: ๊ณต๊ฐ„ ๋ฐ ํ–‰๋™ ์ถ”๋ก (embodied reasoning)์— ํŠนํ™”๋œ Molmo2-ER์€ 3.3M ์ƒ˜ํ”Œ ์ฝ”ํผ์Šค์— specialize-then-rehearse ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. 2. ์„ธ ๊ฐ€์ง€ ์ƒˆ๋กœ์šด ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ์…‹: ์ €๊ฐ€์—์„œ ์ค‘๊ฐ€ ํ”Œ๋žซํผ์— ๊ฑธ์ณ MolmoAct2-BimanualYAM (720์‹œ๊ฐ„, ์ตœ๋Œ€ ๊ทœ๋ชจ์˜ ์–‘์† ์กฐ์ž‘ ๋ฐ์ดํ„ฐ์…‹), MolmoAct2-DROID (ํ’ˆ์งˆ ํ•„ํ„ฐ๋ง๋œ Franka DROID ์„œ๋ธŒ์…‹), MolmoAct2-SO100/101 (ํ’ˆ์งˆ ํ•„ํ„ฐ๋ง๋œ SO-100/101 ์„œ๋ธŒ์…‹)์ด ๊ณต๊ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. 3. MolmoAct2-FAST Tokenizer: ์ˆ˜๋ฐฑ๋งŒ ๊ฐœ์˜ ๊ถค์ (trajectory)์— ๊ฑธ์ณ ๋‹ค์„ฏ ๊ฐ€์ง€ ๋กœ๋ด‡ ์ข…๋ฅ˜(embodiment)๋กœ ํ•™์Šต๋œ Open-weight, Open-data Action Tokenizer์ž…๋‹ˆ๋‹ค. 4. ์ƒˆ๋กœ์šด VLA ์•„ํ‚คํ…์ฒ˜ ๋””์ž์ธ: ์ด์‚ฐ ํ† ํฐ(discrete-token) VLM์„ per-layer key-value (KV) conditioning์„ ํ†ตํ•ด flow-matching continuous-action expert์™€ ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค. 5. MolmoAct2-Think: ์ ์‘ํ˜• ๊นŠ์ด ์ถ”๋ก (adaptive-depth reasoning) ๋ณ€ํ˜•์œผ๋กœ, timesteps ์‚ฌ์ด์— ๋ณ€๊ฒฝ๋œ ์žฅ๋ฉด ์˜์—ญ์— ๋Œ€ํ•ด์„œ๋งŒ ๊นŠ์ด ํ† ํฐ์„ ์žฌ์˜ˆ์ธกํ•˜์—ฌ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ค„์ด๋ฉด์„œ๋„ ๊ธฐํ•˜ํ•™์  grounding์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

2. Molmo2-ER: ํ–‰๋™ ์ถ”๋ก ์„ ์œ„ํ•œ ๊ฐ•๋ ฅํ•œ VLM ๋ฐฑ๋ณธ

๊ธฐ์กด VLM ๋ฐฑ๋ณธ์€ ๋กœ๋ด‡ ์ œ์–ด์— ํ•„์š”ํ•œ ๋ฏธํ„ฐ๋ฒ•(metric), ๊ธฐํ•˜ํ•™์ (geometric), ์‹œ๊ฐ„์ (temporally grounded) ์ถ”๋ก ๋ณด๋‹ค๋Š” ์˜๋ฏธ๋ก ์  ์ด๋ฏธ์ง€ ์ดํ•ด์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. MolmoAct2๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Molmo2 (Clark et al., 2026)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ Molmo2-ER์„ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์žฅ๋ฉด ์ดํ•ด, ํ”ฝ์…€ ๋‹จ์œ„ ํฌ์ธํŒ…(pixel-accurate pointing), ๋‹ค์ค‘ ์ด๋ฏธ์ง€(multi-image) ๋ฐ ์ž์•„์ค‘์‹ฌ์  ์ถ”๋ก (egocentric reasoning), ์™ธ๋ถ€์ค‘์‹ฌ์  ๋Œ€์‘(exocentric correspondence), ๋น„๋””์˜ค ์‹œ๊ฐ„ ์ถ”๋ก (video temporal reasoning)๊ณผ ๊ฐ™์€ ํŠน์ˆ˜ ํ–‰๋™ ์ง€๊ฐ(embodied perception) ๊ธฐ์ˆ ์— ๋Œ€ํ•ด ๋ฏธ์„ธ ์กฐ์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Molmo2-ER์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” ์•ฝ 3.3M ์ƒ˜ํ”Œ๋กœ ๊ตฌ์„ฑ๋œ ์ƒˆ๋กœ์šด embodied reasoning corpus๋กœ, 6๊ฐ€์ง€ ์ƒํ˜ธ ๋ณด์™„์ ์ธ ๋Šฅ๋ ฅ ์ถ•์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค: single-image embodied QA, image pointing, image detection, video embodied QA, multi-image and egoโ€“exo reasoning, abstract embodied reasoning. ๊ฐ ์ถ•์€ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ground truth, 3D ์ฃผ์„์ด ๋‹ฌ๋ฆฐ ์‹ค์ œ ์Šค์บ”, template-generated QA, ์†Œ๋Ÿ‰์˜ LLM-generated chain-of-thought ๋“ฑ ๋‹ค์–‘ํ•œ ๊ฐ๋… ์†Œ์Šค๋ฅผ ๊ฐ€์ง„ 2~3๊ฐœ์˜ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ๋ฐฉ์‹์€ specialize-then-rehearse ๋ ˆ์‹œํ”ผ๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

  1. Stage 1: Embodied specialization. Molmo2-4B mid-training checkpoint์—์„œ ์‹œ์ž‘ํ•˜์—ฌ, Molmo2-ER ์ฝ”ํผ์Šค์— 8%์˜ Tulu-3 text-only data๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ 20K steps ๋™์•ˆ ๋ฏธ์„ธ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋‹จ๊ณ„๋Š” ๋ชจ๋ธ์ด ํ–‰๋™ ๋ฐ์ดํ„ฐ manifold๋กœ ๋น ๋ฅด๊ฒŒ ์ด๋™ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
  2. Stage 2: Joint refinement. Stage 1 checkpoint๋ฅผ Molmo2-ER ์ฝ”ํผ์Šค์™€ Molmo2์˜ ์›๋ณธ multimodal mid-training data (์ผ๋ฐ˜ VQA, ์บก์…˜, ํ•™์ˆ  ๋ฒค์น˜๋งˆํฌ, ์ถ”์ , Molmo2 ํฌ์ธํŒ…)๋ฅผ interleaveํ•œ ํ˜ผํ•ฉ ๋ฐ์ดํ„ฐ์…‹์—์„œ 1.5K steps ๋™์•ˆ ์ถ”๊ฐ€ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

3. ๋ฐ์ดํ„ฐ์…‹: ๋Œ€๊ทœ๋ชจ์˜ ๊ณ ํ’ˆ์งˆ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ

MolmoAct2๋Š” ์„ธ ๊ฐ€์ง€ ๋ณด์™„์ ์ธ ์†Œ์Šค์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค:

  1. MolmoAct2-BimanualYAM Dataset: 720์‹œ๊ฐ„ ์ด์ƒ์˜ teleoperated YAM ๊ถค์ ์„ ํฌํ•จํ•˜๋Š” 34.5k ๋กœ๋ด‡ ๋ฐ๋ชจ๋กœ, tabletop ๋ฐ household tasks๋ฅผ ํฌ๊ด„ํ•˜๋Š” ์ตœ๋Œ€ ๊ทœ๋ชจ์˜ Open-source ์–‘์† ์กฐ์ž‘ ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค. Cortex AI์˜ ์—„๊ฒฉํ•œ ํ”„๋กœํ† ์ฝœ ํ•˜์— ์ˆ˜์ง‘๋˜์–ด ๋†’์€ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.
  2. MolmoAct2-SO100/101 Dataset: Hugging Face์˜ ์ €๊ฐ€ ๋กœ๋ด‡ ํ”Œ๋žซํผ์ธ SO-100/101 ์ปค๋ฎค๋‹ˆํ‹ฐ ๋ฐ์ดํ„ฐ(LeRobot ๋ฐ์ดํ„ฐ)๋ฅผ ํ๋ ˆ์ด์…˜ ๋ฐ ํ•„ํ„ฐ๋งํ•˜์—ฌ ์ƒ์„ฑ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. 1,222๊ฐœ์˜ LeRobot datasets์—์„œ 38,059๊ฐœ์˜ ๋กœ๋ด‡ ๋ฐ๋ชจ episode๋ฅผ ์ถ”์ถœํ–ˆ์œผ๋ฉฐ, ๊ตฌ์กฐ์  ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ, eval-style datasets ์ œ๊ฑฐ, ๋ผ์ด์„ ์Šค/์ฝ”๋“œ๋ฒ ์ด์Šค ํ™•์ธ, TOPReward ํ’ˆ์งˆ ๊ฒŒ์ดํŠธ(Chen et al., 2026)๋ฅผ ํฌํ•จํ•˜๋Š” 4๋‹จ๊ณ„ ํ•„ํ„ฐ๋ง ํŒŒ์ดํ”„๋ผ์ธ์„ ์ ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
  3. MolmoAct2-DROID Dataset: ๋Œ€๊ทœ๋ชจ in-the-wild ๋กœ๋ด‡ ์กฐ์ž‘ ๋ฐ์ดํ„ฐ์…‹์ธ DROID (Khazatsky et al., 2024)์˜ ํ’ˆ์งˆ ํ•„ํ„ฐ๋ง๋œ Franka ์„œ๋ธŒ์…‹์ž…๋‹ˆ๋‹ค. extended language annotations ๋ฐ idle-frame filter๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 74,604๊ฐœ์˜ ์œ ํšจํ•œ episode๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

์ด ์„ธ ๋ฐ์ดํ„ฐ์…‹ ๋ชจ๋‘์— ๋Œ€ํ•ด VLM (Qwen3.5-27B)์„ ์‚ฌ์šฉํ•˜์—ฌ ์–ธ์–ด ์ง€์นจ(language instruction)์„ ์žฌ์ฃผ์„(re-annotate)ํ•˜์—ฌ ๋‹ค์–‘์„ฑ๊ณผ ์ •ํ™•์„ฑ์„ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, Open X-Embodiment (OXE) ํ˜ผํ•ฉ ๋ฐ์ดํ„ฐ์…‹์˜ targeted subset (BC-Z, BridgeData V2, RT-1) ๋ฐ MolmoAct Dataset์„ ํฌํ•จํ•˜๋Š” ํ•™์ˆ  ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ์…‹์„ ์ถ”๊ฐ€ํ•˜์—ฌ embodiment breadth๋ฅผ ํ™•์žฅํ–ˆ์Šต๋‹ˆ๋‹ค.

4. MolmoAct2 ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜ ๋ฐ ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ

MolmoAct2๋Š” ์„ธ ๋‹จ๊ณ„์˜ ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.

4.1. Pre-training (MolmoAct2-Pretrain)

MolmoAct2-Pretrain์€ Molmo2-ER VLM ๋ฐฑ๋ณธ์„ ์ด์‚ฐ์  ์ž๊ธฐํšŒ๊ท€ ๋กœ๋ด‡ ์ •์ฑ…(discrete autoregressive robot policy)์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋ฉฐ, Molmo2์˜ ํ† ํฐ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€์™€ ๋น„๋””์˜ค ํ”„๋ ˆ์ž„์€ ViT๋กœ ์ธ์ฝ”๋”ฉ๋˜๊ณ , vision-language connector๋ฅผ ํ†ตํ•ด ์–ธ์–ด ๋ชจ๋ธ๋กœ ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค. ๋กœ๋ด‡ ์˜ˆ์ œ๋Š” ํ˜„์žฌ ๋กœ๋ด‡ ๊ตฌ์„ฑ(configuration)์„ ์„ค๋ช…ํ•˜๋Š” state tokens์™€ ๋ฏธ๋ž˜ 1์ดˆ๊ฐ„์˜ ์›€์ง์ž„์„ ์„ค๋ช…ํ•˜๋Š” action tokens๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

4.1.1. MolmoAct2-FAST Tokenizer

๋กœ๋ด‡ ๋™์ž‘์€ ์—ฐ์†์ (continuous), embodiment-specificํ•˜๋ฉฐ ๋‹ค์–‘ํ•œ ์ œ์–ด ์†๋„(control rates)๋ฅผ ๊ฐ€์ง€๋ฏ€๋กœ, ์–ธ์–ด ๋ชจ๋ธ์˜ pre-training stream์— ์ง์ ‘ ์‚ฝ์ž…ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ MolmoAct2-FAST Tokenizer๋Š” FAST (Pertsch et al., 2025)๋ฅผ ๋”ฐ๋ผ ํ›ˆ๋ จ๋œ Open-weight, Open-data Action Tokenizer์ž…๋‹ˆ๋‹ค. ์ด๋Š” 1์ดˆ์˜ ๋™์ž‘ ๊ถค์ ์„ ์ฃผํŒŒ์ˆ˜ ์˜์—ญ ๋ณ€ํ™˜(frequency-domain transform)์œผ๋กœ ํ‘œํ˜„ํ•˜๊ณ , ๊ฒฐ๊ณผ๋ฅผ ์–‘์žํ™”(quantizing)ํ•œ ํ›„, byte-pair encoding


๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

์„œ๋ก 

๋ฌผ๋ฆฌ์  ์ง€๋Šฅ(physical intelligence)์€ ์ถ”์ƒ์  ๋‚ด๋ถ€ ๊ณ„์‚ฐ์ด ์•„๋‹ˆ๋ผ ์ง€๊ฐ๊ณผ ํ–‰๋™ ์„ ์ค‘์‹ฌ์œผ๋กœ ์กฐ์ง๋ฉ๋‹ˆ๋‹ค. ์‚ฌ๋žŒ์€ ๊ณต๊ฐ„ ํ‘œ์ƒ์„ ๋งŒ๋“ค๊ณ , ํ–‰๋™์„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•˜๊ณ , ๋ชธ์œผ๋กœ ์„ธ์ƒ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฉฐ ์‚ฌ๊ณ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์˜ค๋Š˜๋‚ ์˜ ๋กœ๋ด‡ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์€ ์ด ์ธ์ง€๊ณผํ•™์  ๊ด€์ ์—์„œ ๋ณด๋ฉด ๋ถˆ์™„์ „ํ•ฉ๋‹ˆ๋‹ค. ๊ตฌ์กฐํ™”๋œ ๊ณต๊ฐ„ ํ‘œ์ƒ์ด ๋ถ€์กฑํ•˜๊ณ , ๋ฌด๊ฑฐ์šด ๋‚ด๋ถ€ ์ถ”๋ก ์ด ์‹ค์‹œ๊ฐ„ ์ƒํ˜ธ์ž‘์šฉ์„ ๋ฐฉํ•ดํ•˜๋ฉฐ, ํ์‡„์„ฑ ๋•Œ๋ฌธ์— ์ƒˆ ์ž‘์—…ยทembodiment๋กœ ํ™•์žฅํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

์ €์ž๋“ค์ด ์งš๋Š” ๊ธด์žฅ(tension)์€ ๋ถ„๋ช…ํ•ฉ๋‹ˆ๋‹ค.

  • ์ถ”๋ก ์€ ์„ฑ๋Šฅ์„ ๋†’์ด์ง€๋งŒ ์ง€์—ฐ์„ ๋ถ€๋ฅธ๋‹ค. grounded ๊ณต๊ฐ„ ์ถ”๋ก , ์˜ˆ์ธก goal ์ด๋ฏธ์ง€, point trajectory, world-model rollout ๋“ฑ์€ ํ–‰๋™ ํ’ˆ์งˆ๊ณผ ํ•ด์„์„ฑ์„ ๋†’์ด์ง€๋งŒ, ํ˜„์žฌ ๊ตฌํ˜„์—์„œ๋Š” ๋‹จ์ผ ํ–‰๋™ ํ•˜๋‚˜๋ฅผ ๋‚ด๊ธฐ ์ „์— ์ˆ˜๋ฐฑ ๊ฐœ ํ† ํฐ์ด๋‚˜ ํ†ต์งธ ํ”„๋ ˆ์ž„์„ ์ƒ์„ฑ ํ•ด์•ผ ํ•ด์„œ ํ๋ฃจํ”„ ์ œ์–ด๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•  ๋งŒํผ ๋А๋ ค์ง‘๋‹ˆ๋‹ค.
  • ์ถ”๋ก ์€ ๊ฒฐ๊ตญ ๋ฐ‘๋ฐ”ํƒ• ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ๋งŒํผ๋งŒ ์ข‹๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ ํ”„๋Ÿฐํ‹ฐ์–ด ์ •์ฑ…์€ embodiment์— ํŠนํ™”๋ผ ์ƒˆ ์ž‘์—…ยท๋กœ๋ด‡์œผ๋กœ ์ ์‘์ด ์–ด๋ ต๊ณ , ํ”„๋Ÿฐํ‹ฐ์–ด VLA๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐยท๋ ˆ์‹œํ”ผยทweight๊ฐ€ ๋ชจ๋‘ ๋น„๊ณต๊ฐœ์ž…๋‹ˆ๋‹ค. ์†Œ์ˆ˜์˜ ์˜คํ”ˆ weight VLA๋งˆ์ € ๋น„์‹ธ๊ฑฐ๋‚˜ ํŠน์ˆ˜ํ•œ ๋กœ๋ด‡ ํ”Œ๋žซํผ์— ๋ฌถ์—ฌ ์žˆ์–ด, ๋ˆ„๊ฐ€ ์“ธ ์ˆ˜ ์žˆ๋Š”์ง€์™€ ์–ด๋””์„œ ํ‰๊ฐ€ยท๊ฐœ์„ ๋  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ๋™์‹œ์— ์ œ์•ฝํ•ฉ๋‹ˆ๋‹ค.

MolmoAct2์˜ ํ•œ ์ค„ ์š”์•ฝ: ๊ฐ•๋ ฅํ•œ ์˜คํ”ˆ ์ฒดํ™”์ถ”๋ก  VLM(Molmo2-ER) ์„ ๋ฐฑ๋ณธ์œผ๋กœ, ์ €~์ค‘๊ฐ€ ํ”Œ๋žซํผ์˜ ๊ณ ํ’ˆ์งˆ ์˜คํ”ˆ ๋ฐ์ดํ„ฐ ๋ฅผ ๋ชจ์œผ๊ณ , ์ด์‚ฐ ํ† ํฐ VLM + ์—ฐ์† ์•ก์…˜ expert ๋ฅผ ์ธต๋ณ„ KV๋กœ ์ž‡๊ณ , ์ ์‘ํ˜• ๊นŠ์ด ์ถ”๋ก (MolmoAct2-Think) ์œผ๋กœ ๋น ๋ฅด๊ณ  ํ•ด์„ ๊ฐ€๋Šฅํ•œ ์ถ”๋ก ๊นŒ์ง€ โ€” ์ด ๋ชจ๋“  ๊ฒƒ์„ weightยท์ฝ”๋“œยท๋ฐ์ดํ„ฐ๊นŒ์ง€ ์™„์ „ ๊ฐœ๋ฐฉ ์œผ๋กœ ๋ฌถ์–ด ์‹ค์„ธ๊ณ„ ๋ฐฐํฌ ๊ฐ€๋Šฅํ•œ ์•ก์…˜ ์ถ”๋ก  ๋ชจ๋ธ์„ ๋งŒ๋“ ๋‹ค.

flowchart LR
    subgraph BK["1 Molmo2-ER ๋ฐฑ๋ณธ"]
        ER["VLM (3.3M ์ฒดํ™”์ถ”๋ก <br/>specialize-then-rehearse)"]
    end
    subgraph DATA["2 ์˜คํ”ˆ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ"]
        D1["BimanualYAM 720h"]
        D2["DROID ์ •์ œ๋ณธ"]
        D3["SO-100/101 ์ •์ œ๋ณธ"]
    end
    subgraph TRAIN["3-stage ํ•™์Šต"]
        PRE["Pre-train<br/>(์ด์‚ฐ AR + FAST tokenizer)"]
        POST["Post-train<br/>(flow-matching expert<br/>+ ์ธต๋ณ„ KV ์กฐ๊ฑดํ™”)"]
        FT["Embodiment ๋ฏธ์„ธ์กฐ์ •"]
    end
    BK --> PRE
    DATA --> PRE
    PRE --> POST --> FT
    POST -.->|์ ์‘ํ˜• depth| THINK["MolmoAct2-Think"]
    FT --> DEP["Out-of-the-box ๋ฐฐํฌ<br/>YAM / SO-100/101 / DROID"]

๋ฐฉ๋ฒ•

MolmoAct2๋Š” 3๋‹จ๊ณ„ ํ•™์Šต(pre-training โ†’ post-training โ†’ embodiment ๋ฏธ์„ธ์กฐ์ •) ์œ„์— ์„œ๋ฉฐ, ํ•ต์‹ฌ ์„ค๊ณ„ ์ฒ ํ•™์€ โ€œ์‚ฌ์ „ํ•™์Šต VLM์˜ ์Šค์ผ€์ผ๋งยท์–ธ์–ด๋Šฅ๋ ฅ์„ ๋ณด์กดํ•˜๋ฉด์„œ ๋‹ค์–‘ํ•œ embodiment์˜ ์ •๋ฐ€ํ•œ ์—ฐ์† ํ–‰๋™์„ ๋งŒ๋“ ๋‹คโ€์ž…๋‹ˆ๋‹ค.

Molmo2-ER: ์ฒดํ™” ์ถ”๋ก  ๋ฐฑ๋ณธ

์ผ๋ฐ˜ VLM์€ ์˜๋ฏธ์  ์ด๋ฏธ์ง€ ์ดํ•ด์— ์ตœ์ ํ™”๋ผ, ๋กœ๋ด‡ ์ œ์–ด์— ํ•„์š”ํ•œ ๊ฑฐ๋ฆฌยท์ž์œ ๊ณต๊ฐ„ยท์‹œ์  ๊ฐ„ ๋Œ€์‘ยท์žฅ๋ฉด ๊ธฐํ•˜ ๊ฐ™์€ ๋Šฅ๋ ฅ์ด ์•ฝํ•ฉ๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ Molmo2๋ฅผ ์•ฝ 3.3M ์ƒ˜ํ”Œ ์˜ ์ฒดํ™” ์ถ”๋ก  ์ฝ”ํผ์Šค๋กœ ๋ฏธ์„ธ์กฐ์ •ํ•ด Molmo2-ER์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์ฝ”ํผ์Šค๋Š” 6๊ฐœ ์—ญ๋Ÿ‰ ์ถ•(๋‹จ์ผ ์ด๋ฏธ์ง€ ์ฒดํ™” QA, pointing, object detection, ๋น„๋””์˜ค ์ฒดํ™” QA, multi-image/ego-exo, ์ถ”์ƒ ์ถ”๋ก )์„ ๋‹ค์–‘ํ•œ ์ถœ์ฒ˜(์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ์ •๋‹ต, 3D ์ฃผ์„ ์‹ค์ธก ์Šค์บ”, ํ…œํ”Œ๋ฆฟ QA, ์†Œ๋Ÿ‰ LLM chain-of-thought)๋กœ ๋ฎ์Šต๋‹ˆ๋‹ค.

Specialize-then-rehearse ๋ ˆ์‹œํ”ผ๋Š” ๋‘ ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค.

  • Stage 1 (์ฒดํ™” ํŠนํ™”): Molmo2-4B mid-training ์ฒดํฌํฌ์ธํŠธ์—์„œ ์‹œ์ž‘, ์ฒดํ™” ์ฝ”ํผ์Šค + 8% Tulu-3 ํ…์ŠคํŠธ๋กœ 20K step ๋ฏธ์„ธ์กฐ์ •(์–ธ์–ด๋Šฅ๋ ฅ ๋ณด์กด).
  • Stage 2 (๊ณต๋™ ์žฌ์ •๋ จ): ์ฒดํ™” ์ฝ”ํผ์Šค๋ฅผ Molmo2 ์›๋ž˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฐ์ดํ„ฐ์™€ ์„ž์–ด 1.5K step ์ถ”๊ฐ€ ํ•™์Šต. NLP 8%, ๋‚˜๋จธ์ง€ 92%๋ฅผ embodied/general๋กœ ๋ถ„๋ฐฐ(์ตœ์  p=0.5).

๊ฒฐ๊ณผ์ ์œผ๋กœ Molmo2-ER์€ 13๊ฐœ ํ‘œ์ค€ ์ฒดํ™”์ถ”๋ก  ๋ฒค์น˜๋งˆํฌ ์ค‘ 9๊ฐœ์—์„œ 1์œ„, ํ‰๊ท  63.8% ๋กœ Gemini Robot-ER 1.5 ThinkingยทGPT-5๋ฅผ ๋Šฅ๊ฐ€ํ•˜๊ณ , ์‹œ์ž‘์  Molmo2 ๋Œ€๋น„ +17์  ํ–ฅ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค.

MolmoAct2-FAST Tokenizer

๋กœ๋ด‡ ํ–‰๋™์€ ์—ฐ์†์ ยทembodiment๋ณ„๋กœ ๋‹ค๋ฅด๊ณ  ์ œ์–ด ์ฃผํŒŒ์ˆ˜๋„ ์ œ๊ฐ๊ฐ์ด๋ผ ์–ธ์–ด๋ชจ๋ธ ํ† ํฐ์œผ๋กœ ์ง์ ‘ ๋„ฃ์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. FAST๋ฅผ ๋”ฐ๋ฅธ ์˜คํ”ˆ weightยท์˜คํ”ˆ ๋ฐ์ดํ„ฐ ํ† ํฌ๋‚˜์ด์ €๋กœ, 1์ดˆ ํ–‰๋™ ๊ถค์  ์„ ์ฃผํŒŒ์ˆ˜ ๋„๋ฉ”์ธ ๋ณ€ํ™˜ โ†’ ๊ณ„์ˆ˜ ์–‘์žํ™” โ†’ byte-pair encoding ์œผ๋กœ 2048 ํ† ํฐ ์–ดํœ˜ ์˜ ์ด์‚ฐ ์‹œํ€€์Šค๋กœ ์••์ถ•ํ•ฉ๋‹ˆ๋‹ค. 5๊ฐœ embodiment(YAMยทSO-100/101ยทDROIDยทBC-ZยทBridgeยทRT-1 ๋“ฑ) 100๋งŒ ํ–‰๋™ ์‹œํ€€์Šค๋กœ ํ•™์Šตํ•˜๋ฉฐ, ๋ชจ๋“  ํ–‰๋™์€ 32์ฐจ์›์œผ๋กœ ํŒจ๋”ฉ ํ•˜๊ณ  1โ€“99 ํผ์„ผํƒ€์ผ๋กœ ์ •๊ทœํ™”(gripper๋Š” ๋ณ„๋„ ์ฒ˜๋ฆฌ)ํ•ด ๊ด€์ ˆ๊ณต๊ฐ„ยท์—”๋“œ์ดํŽ™ํ„ฐ ์ œ์–ด๋ฅผ ํ•œ ํ† ํฌ๋‚˜์ด์ €๋กœ ๋ฎ์Šต๋‹ˆ๋‹ค.

Pre-training: ์ด์‚ฐ ์ž๊ธฐํšŒ๊ท€ ์ •์ฑ…

Molmo2-ER์„ ๊ทธ๋Œ€๋กœ ๋‘๊ณ , ์ด๋ฏธ์ง€/๋น„๋””์˜ค๋Š” ViT(SigLIP2)๋กœ ์ธ์ฝ”๋”ฉโ†’connector๋กœ ํ’€๋งโ†’LLM์— ํ…์ŠคํŠธ์™€ ํ•จ๊ป˜ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค. ๋กœ๋ด‡ ์˜ˆ์ œ๋Š” ๋‘ ํ† ํฐ ์ŠคํŠธ๋ฆผ์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค โ€” ํ˜„์žฌ ๋กœ๋ด‡ ๊ตฌ์„ฑ์„ ๋‹ด๋Š” state ํ† ํฐ(256๊ฐœ๋กœ ๊ท ์ผ ์–‘์žํ™”), ๋ฏธ๋ž˜ 1์ดˆ ํ–‰๋™์„ ๋‹ด๋Š” action ํ† ํฐ(FAST tokenizer). ์ฆ‰ ํ…์ŠคํŠธยทVLยทstateยทaction์„ ํ•˜๋‚˜์˜ next-token ์˜ˆ์ธก ๋ชฉํ‘œ ๋กœ ํ†ต์ผํ•ด, ๋ณ„๋„ ์—ฐ์† head ์—†์ด ๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ํ•™์Šต์„ ๋‹จ์ˆœยท์•ˆ์ •ํ™”ํ•ฉ๋‹ˆ๋‹ค(200K step, ์‹œํ€€์Šค 4200, 64ร—H100, ์•ฝ 5,760 GPU-hours).

Post-training: flow-matching ์•ก์…˜ expert + ์ธต๋ณ„ KV ์กฐ๊ฑดํ™”

์ด์‚ฐ ํ† ํฐ VLM์€ ์ถ”๋ก  grounding์€ ๊ฐ•ํ•˜์ง€๋งŒ, ์ถœ๋ ฅ ๊ณต๊ฐ„์ด ๊ณ ์ฃผํŒŒ ์—ฐ์† ๊ถค์ ๊ณผ ์•ˆ ๋งž์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ DiT ์Šคํƒ€์ผ flow-matching expert ๋ฅผ ๋ถ™์ž…๋‹ˆ๋‹ค. ์ •๊ทœํ™” ์•ก์…˜ ์ฒญํฌ a, ๊ฐ€์šฐ์‹œ์•ˆ ๋…ธ์ด์ฆˆ \epsilon, ์‹œ๊ฐ t\in[0,1] ์— ๋Œ€ํ•ด

x_t = (1-t)\epsilon + ta, \qquad u^\star = a - \epsilon

expert f_\theta ๋Š” ๋…ธ์ด์ฆˆ ์ฒญํฌยท์‹œ๊ฐ„ยทVLM ๋งฅ๋ฝ c ๋กœ๋ถ€ํ„ฐ ๋ชฉํ‘œ ์†๋„์žฅ(velocity field)์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

\mathcal{L}_{\text{flow}} = \mathbb{E}_{a,\epsilon,t}\Big[\big\lVert m \odot (f_\theta(x_t,t,c) - u^\star)\big\rVert_2^2\Big]

์—ฌ๊ธฐ์„œ m ์€ ํŒจ๋”ฉ ์ฐจ์›/์Šคํ…์„ ๋งˆ์Šคํ‚นํ•ฉ๋‹ˆ๋‹ค. ์ถ”๋ก  ์‹œ์—” ๊ฐ€์šฐ์‹œ์•ˆ ๋…ธ์ด์ฆˆ์—์„œ ์‹œ์ž‘ํ•ด ์†๋„์žฅ์„ ์ ๋ถ„ํ•ด ์—ฐ์† ๊ถค์ ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์„ค๊ณ„ โ€” ์ธต๋ณ„ KV ์กฐ๊ฑดํ™”. expert๊ฐ€ VLM ๋งฅ๋ฝ์„ ์–ด๋–ป๊ฒŒ ๋ฐ›๋А๋ƒ๊ฐ€ ๊ด€๊ฑด์ž…๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰ hidden state ํ•˜๋‚˜๋กœ ์••์ถ•(hidden-state conditioning)ํ•˜๋Š” ๋Œ€์‹ , MolmoAct2๋Š” expert๋ฅผ VLM๊ณผ ๋™์ผํ•œ ๊นŠ์ด(L=36 ์ธต) ๋กœ ๋งŒ๋“ค๊ณ , ๊ฐ ์ธต์˜ key/value๋ฅผ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์™€ expert์˜ cross-attention์— ๋„ฃ์Šต๋‹ˆ๋‹ค.

\tilde{K}_\ell = \text{reshape}(P_K K_\ell^{\text{vlm}}), \qquad \tilde{V}_\ell = \text{reshape}(P_V V_\ell^{\text{vlm}})

\text{CA}(Q_\ell,\tilde{K}_\ell,\tilde{V}_\ell) = \text{softmax}\!\left(\frac{Q_\ell \tilde{K}_\ell^\top}{\sqrt{d_h}}\right)\tilde{V}_\ell

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์—ฐ์† ์ปจํŠธ๋กค๋Ÿฌ๊ฐ€ VLM ์ž์‹ ์ด ์“ฐ๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•œ attention ์ƒํƒœ ์— ์ ‘๊ทผํ•˜๋ฉด์„œ๋„ ๋ฐฑ๋ณธ๊ณผ ๋ชจ๋“ˆ์‹์œผ๋กœ ๋ถ„๋ฆฌ๋ฉ๋‹ˆ๋‹ค(ablation์—์„œ hidden-state 94.0% < per-layer KV 95.9%). ๋˜ํ•œ knowledge insulation: expert๋Š” VLM keys/values์— ์กฐ๊ฑดํ™”ํ•˜๋˜ ์ด ํ…์„œ๋ฅผ detachํ•ด, flow loss ๊ธฐ์šธ๊ธฐ๊ฐ€ VLM์œผ๋กœ ์—ญ์ „ํŒŒ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค(VLM์€ LM loss๋กœ๋งŒ ๊ฐฑ์‹ ). ํ•™์Šต ๋ชฉํ‘œ๋Š” ๋‘ ์†์‹ค์˜ ํ•ฉ์ž…๋‹ˆ๋‹ค.

\mathcal{L}_{\text{post}} = \mathcal{L}_{\text{LM}} + \mathcal{L}_{\text{flow}}

๊ฐ ์ฒญํฌ๋‹น K=4 ๊ฐœ flow ์ƒ˜ํ”Œ์„ ์จ ๊ฐ™์€ VL ๋งฅ๋ฝ์„ ์žฌ์‚ฌ์šฉํ•˜๋ฉฐ ํšจ์œจ์„ ๋†’์ž…๋‹ˆ๋‹ค(100K step, ์•ฝ 2,304 GPU-hours).

MolmoAct2-Think: ์ ์‘ํ˜• ๊นŠ์ด ์ถ”๋ก 

๋กœ๋ด‡ ์กฐ์ž‘์€ ๊ฑฐ๋ฆฌยท์ž์œ ๊ณต๊ฐ„ยท๊ฐ€๋ฆผยทํ‘œ๋ฉด ๋ฐฐ์น˜ ๊ฐ™์€ ๊ณต๊ฐ„ ์ •๋ณด์— ์˜์กดํ•˜์ง€๋งŒ, ํ–‰๋™ ๋ชจ๋ฐฉ ๋ชฉํ‘œ๋Š” ์ด๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๋ฌป์ง€ ์•Š์Šต๋‹ˆ๋‹ค. MolmoAct๋Š” depth ํ† ํฐ ์˜ˆ์ธก ์„ ์ค‘๊ฐ„ ์ถ”๋ก  ๋‹จ๊ณ„๋กœ ์ถ”๊ฐ€ํ–ˆ๋Š”๋ฐ, MolmoAct2-Think์€ ์—ฌ๊ธฐ์— ์ ์‘์„ฑ ์„ ๋”ํ•ฉ๋‹ˆ๋‹ค.

๊ฐ ๊ด€์ธก depth map์„ 10ร—10 ๊ฒฉ์ž(100 ์œ„์น˜) ๋กœ ์–‘์žํ™”ํ•˜๊ณ , ๊ฐ ์œ„์น˜๋Š” 128๊ฐœ ํ•™์Šต๋œ ์ฝ”๋“œ ์ค‘ ํ•˜๋‚˜(VQ-VAE, Depth Anything V2 ๊ธฐ๋ฐ˜)๋ฅผ ๊ฐ–์Šต๋‹ˆ๋‹ค. ํ•ต์‹ฌ ํ†ต์ฐฐ: ๋กœ๋ด‡ ๊ถค์ ์€ ์‹œ๊ฐ„์  ์ค‘๋ณต(temporal redundancy) ์ด ์ปค์„œ, ํ•œ ์ œ์–ด ์Šคํ…์—์„œ ๋‹ค์Œ์œผ๋กœ ๊ฐˆ ๋•Œ ์žฅ๋ฉด ๊นŠ์ด ๊ฒฉ์ž์˜ ๋งŽ์€ ์…€์ด ๊ทธ๋Œ€๋กœ์ž…๋‹ˆ๋‹ค. ๋งค ์Šคํ… 100๊ฐœ ์ฝ”๋“œ๋ฅผ ๋‹ค ์žฌ์˜ˆ์ธกํ•˜์ง€ ์•Š๊ณ , RGB ํŒจ์น˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๊ฐ€ 0.996 ๋ฏธ๋งŒ์œผ๋กœ ๋ณ€ํ•œ ์…€๋งŒ ์ž๊ธฐํšŒ๊ท€๋กœ ๋‹ค์‹œ ์˜ˆ์ธกํ•˜๊ณ  ๋‚˜๋จธ์ง€๋Š” ์บ์‹œ๋ฅผ ์žฌ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

m_{t,i} = \mathbf{1}\big[\cos(x_{t,i}, x_{t-1,i}) < 0.996\big], \qquad b_{t,i} = \begin{cases} d_{t,i}, & m_{t,i}=1 \\ b_{t-1,i}, & m_{t,i}=0 \end{cases}

๊ทธ ๊ฒฐ๊ณผ ๊ธฐํ•˜ ์ถ”๋ก  ๋น„์šฉ์ด ์ •์  ์žฅ๋ฉด ๋น„์œจ์— ๋ฐ˜๋น„๋ก€ ํ•˜๊ฒŒ ์ค„์–ด, 100ํ† ํฐ ์ „๋ถ€๊ฐ€ ์•„๋‹ˆ๋ผ ๋ณ€ํ•œ ๋ถ€๋ถ„๋งŒํผ๋งŒ ์ถ”๋ก ํ•ฉ๋‹ˆ๋‹ค. ๋ฏธ์„ธ์กฐ์ • ์‹œ depth ์ž…๋ ฅ์— 10% ๋…ธ์ด์ฆˆ๋ฅผ ์ฃผ์ž…ํ•˜๊ณ , depth ํ† ํฐ KV์— ํ•™์Šต๋œ per-layer gate(์ดˆ๊ธฐ bias โˆ’4)๋ฅผ ๋‘ฌ, ๊ฐ expert ์ธต์ด depth prefix๋ฅผ ์–ผ๋งˆ๋‚˜ ์“ธ์ง€ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐํฌ ์ตœ์ ํ™”

์—ฐ์† expert ์ถ”๋ก ์€ ํ•œ ์ฒญํฌ ์•ˆ์—์„œ VLM ๋งฅ๋ฝ์ด flow ์Šคํ…์— ๊ฑธ์ณ ๋ถˆ๋ณ€์ด๋ฏ€๋กœ, ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ cross-attention ์ƒํƒœ๋ฅผ ์บ์‹œ ํ•˜๊ณ  ๊ณ ์ • ํ˜•ํƒœ flow ๋ฃจํ”„๋ฅผ CUDA Graph ๋กœ ์žก์•„ Pythonยท์ปค๋„ ๋Ÿฐ์น˜ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ค„์ž…๋‹ˆ๋‹ค.

์‹คํ—˜

7๊ฐœ ํ™˜๊ฒฝ ๋ฒค์น˜๋งˆํฌ์— ๊ฑธ์นœ ๊ด‘๋ฒ”์œ„ํ•œ ์‹ค์ฆ ์—ฐ๊ตฌ๋กœ, ์„ธ ๋ฒ”์ฃผ์˜ ์งˆ๋ฌธ์— ๋‹ตํ•ฉ๋‹ˆ๋‹ค.

Molmo2-ER (์ฒดํ™” ์ถ”๋ก  ๋ฐฑ๋ณธ)

13๊ฐœ VLM ์ฒดํ™”์ถ”๋ก  ๋ฒค์น˜๋งˆํฌ(Point-Bench, RefSpatial, BLINK, CV-Bench, ERQA, EmbSpatial, SAT, VSI-Bench ๋“ฑ)์—์„œ, Molmo2-ER์€ 9๊ฐœ์—์„œ ์˜คํ”ˆ weight 1์œ„, ํ‰๊ท  63.8% ๋กœ ์ฐจ์ˆœ์œ„ Gemini-ER 1.5 Thinking์„ 2.5์  ์•ž์„ฐ๊ณ , GPT-5ยทGemini 2.5 Pro ๊ฐ™์€ ๋น„๊ณต๊ฐœ ๋ชจ๋ธ๋„ ๋„˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋ฐฑ๋ณธ์„ Molmo2 โ†’ Molmo2-ER๋กœ ๋ฐ”๊พธ๋Š” ๊ฒƒ๋งŒ์œผ๋กœ LIBERO-Long ์ด์‚ฐ ํ–‰๋™ ์˜ˆ์ธก์ด 77.6% โ†’ 83.6%(+6.0) ๋กœ ์˜ฌ๋ผ, ์ฒดํ™” ํŠนํ™”๊ฐ€ VLM ๋ฒค์น˜๋งˆํฌ๋ฟ ์•„๋‹ˆ๋ผ ํ–‰๋™ ์˜ˆ์ธก์—๋„ ์ง์ ‘ ์ „์ด ๋จ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

Out-of-the-box ๋ฐฐํฌ

๋ฏธ์„ธ์กฐ์ • ์—†์ด ์‚ฌ์ „ํ•™์Šต ์ฒดํฌํฌ์ธํŠธ๋ฅผ ๊ทธ๋Œ€๋กœ ๋ฐฐํฌํ•ฉ๋‹ˆ๋‹ค.

ํ‰๊ฐ€ ์ฐจ์ˆœ์œ„ MolmoAct2
MolmoSpace (4 ์Šคํ‚ฌ ํ‰๊ท ) \pi_{0.5}-DROID 34.5 37.7 (+3.2)
DROID ์‹ค์„ธ๊ณ„ (5 ์ž‘์—…, OOD ์นด๋ฉ”๋ผ/๋ฌผ์ฒด) MolmoBot 87.1% (+38.7%p)
SO-100/101 ์‹ค์„ธ๊ณ„ (5 ์ž‘์—…) \pi_0-SO100/101 45.3 56.7% (+11.4%p)

ํŠนํžˆ DROIDยทSO-100/101 ๋ฏธ์„ธ์กฐ์ • ์ฒดํฌํฌ์ธํŠธ๊ฐ€ ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด ๊ฐ์ž embodiment์— ๋ฐฐํฌ๋˜์–ด \pi_{0.5} ๋ฅผ ํฌ๊ฒŒ ์•ž์„  ์ ์ด ์ธ์ƒ์ ์ž…๋‹ˆ๋‹ค.

ํšจ์œจ์  ๋ฏธ์„ธ์กฐ์ •

์†Œ์ˆ˜ ์‹œ์—ฐ์œผ๋กœ ์ƒˆ ์ž‘์—…ยทembodiment์— ์ ์‘ํ•˜๋Š” ๋Šฅ๋ ฅ์ž…๋‹ˆ๋‹ค.

๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ
LIBERO (4 suite ํ‰๊ท ) 97.2%(MolmoAct2), 98.1%(Think) โ€” ์ „ ๋ฒ ์ด์Šค๋ผ์ธ 1์œ„
RoboEval (bimanual Franka) 44.3%, \pi_{0.5} ๋Œ€๋น„ +3.8
์‹ค์„ธ๊ณ„ Bimanual YAM (8 ์ž‘์—…) 7/8 1์œ„, ํ‰๊ท  50.1% โ€” ์ฐจ์ˆœ์œ„ OpenVLA-OFT ๋Œ€๋น„ +15%

MolmoAct2-Think & ๊ฐ•๊ฑด์„ฑ

  • Think ํšจ๊ณผ: LIBERO์—์„œ 4๊ฐœ suite ์ค‘ 3๊ฐœ ํ–ฅ์ƒ, ๊ฐ€์žฅ ์–ด๋ ค์šด Long suite์—์„œ +2.2%๋กœ ์ตœ๋Œ€ โ€” ์ ์‘ํ˜• ๊นŠ์ด๊ฐ€ saturation ๋…ธ์ด์ฆˆ๊ฐ€ ์•„๋‹Œ ์‹ค์งˆ์  ์ด๋“ ์ž„์„ ์‹œ์‚ฌ. ํ‰๊ท  97.2% โ†’ 98.1%.
  • OOD ๊ฐ•๊ฑด์„ฑ(๊ณต๊ฐ„/์กฐ๋ช…/์–ธ์–ด/distractor ๋ณ€ํ™”): MolmoAct2-Think ํ‰๊ท  50.69% ๋กœ ์ฐจ์ˆœ์œ„ OpenVLA-OFT ๋Œ€๋น„ +10.8%p. ๋ชจ๋“  ๋ฒ”์ฃผ์—์„œ 1์œ„(๋‹จ ๊ณต๊ฐ„ ๋ณ€ํ™” 26.25%๋กœ ๊ฐ€์žฅ ๋‚ฎ์•„ ๊ฐœ์„  ์—ฌ์ง€).
  • ๊ถค์  ํ’ˆ์งˆ: RoboEval์—์„œ ์™„๋ฃŒ ์‹œ๊ฐ„(Stack Two Blocks 5.87s โ†’ 4.70s), joint path length ์•ฝ 2๋ฐฐ ๋‹จ์ถ• ๋“ฑ ๋” ์งง๊ณ  ์•ˆ์ •์ ยทํšจ์œจ์  ๊ถค์ .

Ablation & ์ถ”๋ก  ์†๋„

  • ์กฐ๊ฑดํ™” ๋ฐฉ์‹: per-layer KV(95.9%) > per-head per-layer KV(94.8%) > hidden-state(94.0%).
  • flow ์ƒ˜ํ”Œ ์ˆ˜: K=8 ์ด ํ‰๊ท  95.90%๋กœ ์ตœ์„ (K=1 94.15%).
  • ๋ฏธ์„ธ์กฐ์ • ์„ค๊ณ„: ์ด์‚ฐ+์—ฐ์† ๊ณต๋™ํ•™์Šต + full fine-tuning ์ด ํ‰๊ท  97.20%๋กœ ์ตœ์„ (action expert๋งŒ ํ•™์Šต ์‹œ 93.05%๋กœ ๊ธ‰๋ฝ).
  • ์ถ”๋ก  ์†๋„(LIBERO, H100, horizon 10): caching + CUDA Graph๋กœ MolmoAct2 55.79 Hz(์›๋ณธ 23.02 Hz ๋Œ€๋น„ 2.42๋ฐฐ), Think์€ 12.71 Hz. ์—ฐ์† ๊ฒฝ๋กœ๊ฐ€ ์ด์‚ฐ ๊ฒฝ๋กœ(14.17 Hz)๋ณด๋‹ค ๋นจ๋ผ ๊ธฐ๋ณธ ๋ฐฐํฌ ์˜ต์…˜์œผ๋กœ ์ฑ„ํƒ.

๋น„ํŒ์  ๊ณ ์ฐฐ

๊ฐ•์ 

  • ์ง„์ •ํ•œ ์™„์ „ ๊ฐœ๋ฐฉ์„ฑ. weightยทํ•™์Šต ์ฝ”๋“œยท์ „์ฒด ๋ฐ์ดํ„ฐ์…‹(720์‹œ๊ฐ„ bimanual ํฌํ•จ)ยทํ† ํฌ๋‚˜์ด์ €๊นŒ์ง€ ๊ณต๊ฐœํ•ด, ์žฌํ˜„ยทํ™•์žฅยท์ ์‘์˜ ์žฅ๋ฒฝ์„ ์‹ค์งˆ์ ์œผ๋กœ ํ—ˆ๋ญ…๋‹ˆ๋‹ค. โ€œ์˜คํ”ˆ weightโ€์— ๊ทธ์นœ ๊ธฐ์กด VLA์™€ ์ฐจ๋ณ„ํ™”๋˜๋Š” ๊ฐ€์žฅ ํฐ ๊ธฐ์—ฌ์ž…๋‹ˆ๋‹ค.
  • ์ถ”๋ก ๊ณผ ์†๋„์˜ ๋™์‹œ ๊ณต๋žต. ์ ์‘ํ˜• ๊นŠ์ด(temporal redundancy ํ™œ์šฉ)์™€ caching/CUDA Graph๋กœ, โ€œ์ถ”๋ก ์„ ๋ถ™์ด๋ฉด ๋А๋ ค์ง„๋‹คโ€๋Š” ํ†ต๋…์„ ์ •๋ฉด ๋ฐ˜๋ฐ•ํ•ฉ๋‹ˆ๋‹ค. per-layer KV ์กฐ๊ฑดํ™”๋„ hidden-state ๋Œ€๋น„ ์šฐ์œ„๋ฅผ ablation์œผ๋กœ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์ €~์ค‘๊ฐ€ ํ•˜๋“œ์›จ์–ด ์ง€ํ–ฅ. $6,000 ๋ฏธ๋งŒ bimanual YAM ์…‹์—…, ์ €๊ฐ€ SO-100/101 ์ง€์›์œผ๋กœ ํ•™๊ณ„ยท๋…๋ฆฝ ์—ฐ๊ตฌ์ž๊ฐ€ ์‹ค์ œ๋กœ ์“ธ ์ˆ˜ ์žˆ๋Š” ๋ฒ”์œ„๋ฅผ ๊ฒจ๋ƒฅํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๋ฐฉ๋Œ€ํ•˜๊ณ  ์ฒด๊ณ„์ ์ธ ์‹ค์ฆ. 7๊ฐœ ๋ฒค์น˜๋งˆํฌ + 13๊ฐœ ์ถ”๋ก  ๋ฒค์น˜๋งˆํฌ + OOD/๊ถค์ ํ’ˆ์งˆ/์†๋„/ablation๊นŒ์ง€, ์˜คํ”ˆ VLA๋กœ๋Š” ๊ฐ€์žฅ ๊ด‘๋ฒ”์œ„ํ•œ ํ‰๊ฐ€๋กœ ์ฃผ์žฅ์„ ๋’ท๋ฐ›์นจํ•ฉ๋‹ˆ๋‹ค.

์•ฝ์ ๊ณผ ํ•œ๊ณ„

  • ์„ธ๋ฐ€ํ•œ ๊ณต๊ฐ„ ์ผ๋ฐ˜ํ™”๋Š” ์—ฌ์ „ํžˆ ์•ฝ์ . OOD ํ‰๊ฐ€์—์„œ ๊ณต๊ฐ„ ๋ณ€ํ™”(spatial variance) ์„ฑ๊ณต๋ฅ ์ด 26.25%๋กœ ๊ฐ€์žฅ ๋‚ฎ์•„, ํ•™์Šต ๋ถ„ํฌ ๋ฐ– ์œ„์น˜ ๋ฐฐ์น˜์— ๋Œ€ํ•œ ๊ฐ•๊ฑด์„ฑ์€ ๋ถ€์กฑํ•ฉ๋‹ˆ๋‹ค(์ €์ž๋„ ์ธ์ •).
  • ์‹ค์„ธ๊ณ„ ์ ˆ๋Œ€ ์„ฑ๊ณต๋ฅ ์˜ ํ•œ๊ณ„. ์‹ค์„ธ๊ณ„ bimanual YAM ํ‰๊ท  50.1%๋Š” ๋ฒ ์ด์Šค๋ผ์ธ ๋Œ€๋น„ ํฌ๊ฒŒ ์•ž์„œ์ง€๋งŒ, ์ ˆ๋Œ€๊ฐ’์œผ๋กœ๋Š” ์‹ ๋ขฐ์„ฑ ์žˆ๋Š” ๋ฐฐํฌ ๊ธฐ์ค€์— ๋ชป ๋ฏธ์น˜๋Š” ์ž‘์—…์ด ๋งŽ์Šต๋‹ˆ๋‹ค. โ€œdeployment-readyโ€๋ผ๋Š” ํ‘œํ˜„์€ ์ƒ๋Œ€์  ์šฐ์œ„์— ๊ฐ€๊น์Šต๋‹ˆ๋‹ค(์ถ”์ธก).
  • articulated object ์•ฝ์ . MolmoSpace์˜ Open ์Šคํ‚ฌ์—์„œ ์ฐจ์ˆœ์œ„์— ๋’ค์ฒ˜์ ธ, ๊ด€์ ˆ ๋ฌผ์ฒด ์ƒํ˜ธ์ž‘์šฉ์€ ์ถ”๊ฐ€ ๊ฐœ์„  ๋ฐฉํ–ฅ์œผ๋กœ ๋‚จ์Šต๋‹ˆ๋‹ค.
  • ์ถ”๋ก  ๋น„์šฉยท๊ทœ๋ชจ. 4B ๋ฐฑ๋ณธ + ๋™์ผ ๊นŠ์ด(36์ธต) expert ๊ตฌ์กฐ๋Š” H100๊ธ‰ ์ž์›์—์„œ ํ‰๊ฐ€๋๊ณ , on-robot ์‹ค์‹œ๊ฐ„ ์ œ์•ฝ(์ €์ „๋ ฅ ์ž„๋ฒ ๋””๋“œ)์—์„œ์˜ ์‹ค์ธก์€ ์ œํ•œ์ ์ž…๋‹ˆ๋‹ค.
  • Think์˜ ๊ฐ€์ • ์˜์กด์„ฑ. ์ ์‘ํ˜• ๊นŠ์ด์˜ ์ด๋“์€ โ€œ์žฅ๋ฉด์ด ๋Œ€์ฒด๋กœ ์ •์ โ€์ด๋ผ๋Š” temporal redundancy ๊ฐ€์ •์— ์˜์กดํ•ฉ๋‹ˆ๋‹ค. ๋น ๋ฅด๊ฒŒ ๋ณ€ํ•˜๋Š” ๋™์  ์žฅ๋ฉด์—์„œ๋Š” ์žฌ์˜ˆ์ธก ๋น„์œจ์ด ๋†’์•„์ ธ ์†๋„ ์ด๋“์ด ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(์ถ”์ธก).

์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

MolmoAct2๋Š” VLA์˜ ์‹ค์„ธ๊ณ„ ๋ฐฐํฌ๋ฅผ ๊ฐ€๋กœ๋ง‰๋˜ ํ์‡„์„ฑยทํ•˜๋“œ์›จ์–ด ์ข…์†ยท์ถ”๋ก  ์ง€์—ฐยท๋‚ฎ์€ ์„ฑ๊ณต๋ฅ  ์„ ํ•œ๊บผ๋ฒˆ์— ๊ณต๋žตํ•œ ์™„์ „ ๊ฐœ๋ฐฉํ˜• ์•ก์…˜ ์ถ”๋ก  ๋ชจ๋ธ ์ž…๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ (1) ์ฒดํ™”์ถ”๋ก ์— ํŠนํ™”๋œ Molmo2-ER ๋ฐฑ๋ณธ, (2) 720์‹œ๊ฐ„ bimanual์„ ํฌํ•จํ•œ ์˜คํ”ˆ ๋ฐ์ดํ„ฐ์…‹ยทํ† ํฌ๋‚˜์ด์ €, (3) ์ด์‚ฐ ํ† ํฐ VLM๊ณผ ์—ฐ์† flow-matching expert๋ฅผ ์ž‡๋Š” ์ธต๋ณ„ KV ์กฐ๊ฑดํ™”, (4) ๋ณ€ํ•œ ์˜์—ญ๋งŒ ๋‹ค์‹œ ์ถ”๋ก ํ•˜๋Š” ์ ์‘ํ˜• ๊นŠ์ด(MolmoAct2-Think) ์ž…๋‹ˆ๋‹ค.

์ˆ˜์น˜๋กœ ์ •๋ฆฌํ•˜๋ฉด, Molmo2-ER์€ 13๊ฐœ ์ถ”๋ก  ๋ฒค์น˜๋งˆํฌ ํ‰๊ท  63.8% ๋กœ GPT-5ยทGemini-ER์„ ๋„˜์—ˆ๊ณ , MolmoAct2๋Š” LIBERO 97.2%(Think 98.1%), RoboEval 44.3%, ์‹ค์„ธ๊ณ„ bimanual 8์ž‘์—…์—์„œ ์ฐจ์ˆœ์œ„ ๋Œ€๋น„ +15%, OOD ๊ฐ•๊ฑด์„ฑ +10.8%p ๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. caching+CUDA Graph๋กœ 55.79 Hz ์˜ ์ œ์–ด์œจ๋„ ํ™•๋ณดํ–ˆ์Šต๋‹ˆ๋‹ค.

์‹ค๋ฌด ๊ด€์ ์—์„œ ์ด ์—ฐ๊ตฌ์˜ ๊ฐ€์น˜๋Š” โ€œํ”„๋Ÿฐํ‹ฐ์–ด๊ธ‰ ์„ฑ๋Šฅ์˜ VLA๋ฅผ, ๋ฐ์ดํ„ฐยท์ฝ”๋“œยทweight๊นŒ์ง€ ์ „๋ถ€ ์—ด์–ด ๋ˆ„๊ตฌ๋‚˜ ์ €๊ฐ€ ํ•˜๋“œ์›จ์–ด์—์„œ ์žฌํ˜„ยทํ™•์žฅยท๋ฐฐํฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์—ˆ๋‹คโ€ ๋Š” ๋ฐ ์žˆ์Šต๋‹ˆ๋‹ค. ์„ธ๋ฐ€ํ•œ ๊ณต๊ฐ„ ์ผ๋ฐ˜ํ™”์™€ ์‹ค์„ธ๊ณ„ ์ ˆ๋Œ€ ์„ฑ๊ณต๋ฅ ์ด๋ผ๋Š” ํ•œ๊ณ„๋Š” ๋ถ„๋ช…ํ•˜์ง€๋งŒ, ์ฒดํ™”์ถ”๋ก  ๋ฐฑ๋ณธ + ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ด์‚ฐ/์—ฐ์† + ์ ์‘ํ˜• ์ถ”๋ก  ์ด๋ผ๋Š” ๊ตฌ์„ฑ์€ ํ–ฅํ›„ ์˜คํ”ˆ VLA ์—ฐ๊ตฌ์˜ ๊ฐ•๋ ฅํ•œ ํ‘œ์ค€์ ์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Copyright 2026, JungYeon Lee