Curieux.JY
  • JungYeon Lee
  • Post
  • ๐Ÿ•ธ๏ธ Graph
  • Lecture
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ํ•œ ์ค„๋กœ ์‹œ์ž‘ํ•˜๋ฉด
    • ๋ฐฐ๊ฒฝ: ์™œ nominal imagination์œผ๋กœ๋Š” ๋ถ€์กฑํ•œ๊ฐ€
    • ๋ฐฉ๋ฒ•: ๋ชฉํ‘œ๋ฅผ minโ€“max๋กœ ์ •์‹ํ™”ํ•˜๊ณ , noise๋ฅผ ์ตœ์ ํ™”ํ•œ๋‹ค
      • ์ •์ฑ… ํ‰๊ฐ€ยท๊ฐœ์„ ์˜ ์ •์‹ํ™” (Eq. 4)
      • Semantic objective: VLM์œผ๋กœ ๋ชฉํ‘œ ์‚ฌ๊ฑด์„ ์ฑ„์  (Eq. 6)
      • Plausibility objective: noise๋ฅผ typical set ์•ˆ์— ๋ถ™๋“ ๋‹ค
      • Gradient ๊ทผ์‚ฌ: denoising ์ „์ฒด๋ฅผ ํ†ตํ•œ ์—ญ์ „ํŒŒ๋ฅผ ํ”ผํ•œ๋‹ค (Eq. 7โ€“8)
    • ์ง๊ด€: โ€œํ™•๋ฅ  ๊ป์งˆ ์œ„์—์„œ, ์›ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๊ฑท๊ธฐโ€
    • ์‹คํ—˜: ํ†ต์ œ ์‹คํ—˜ โ†’ ์ตœ์‹  WM โ†’ ์ •์ฑ… ๊ฐœ์„ 
      • ํ†ต์ œ ์‹คํ—˜ โ€” Naughty 3D Dubins Car (๋™์—ญํ•™์„ ์•„๋Š” ์„ธํŒ…)
      • ์ •์„ฑ ๊ฒฐ๊ณผ โ€” nominal์ด ๋†“์น˜๋Š” ์‹คํŒจ๋ฅผ, plausibleํ•  ๋•Œ๋งŒ ์กฐํ–ฅ
      • ์ตœ์‹  WM โ€” ์ฃผํ–‰(Vista)๊ณผ ์กฐ์ž‘(Ctrl-World)
      • ์ •์ฑ… ๊ฐœ์„  โ€” robust action์„ ์„ ํ˜ธํ•˜๋„๋ก fine-tune
    • ๋น„ํŒ์ ์œผ๋กœ ๋ณด๋ฉด
      • ๊ฐ•์ 
      • ์•ฝ์ ยทํ•œ๊ณ„
    • ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ์ž๋ฆฌ ๋งค๊น€
    • ์š”์•ฝ

๐Ÿ“ƒStressDream ๋ฆฌ๋ทฐ

world-model
diffusion
vla
vlm
autonomous-driving
manipulation
safety
NVIDIA
StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement
Published

June 20, 2026

  • Paper Link

  • Code Link

  • Project

  • Junwon Seo, Sushant Veer, Ran Tian, Wenhao Ding, Apoorva Sharma, Karen Leung, Edward Schmerling, Marco Pavone, Andrea Bajcsy (CMU IntentLab, NVIDIA Research, University of Washington, Stanford University)

  • Preprint (arXiv:2606.00267v1), 2026

  1. ๐Ÿ’ก diffusion ๊ธฐ๋ฐ˜ video world model์ด ์—ฌ๋Ÿฌ ๊ทธ๋Ÿด๋“ฏํ•œ ๋ฏธ๋ž˜๋ฅผ ์ƒ์ƒํ•  ์ˆ˜ ์žˆ์–ด๋„, ํ‘œ์ค€(nominal) ์ƒ˜ํ”Œ๋ง์€ ๋“œ๋ฌผ์ง€๋งŒ ์น˜๋ช…์ ์ธ ๊ณ ์ž„ํŒฉํŠธ ๊ฒฐ๊ณผ(์ถฉ๋Œยท์Ÿ์Œ)๋ฅผ ๋†“์น˜๋Š” ๋ฌธ์ œ๋ฅผ, world model์˜ ์ดˆ๊ธฐ noise๋ฅผ inference-time์— ์ตœ์ ํ™”ํ•ด ์ƒ์ƒ์„ โ€œ๊ณ ์ž„ํŒฉํŠธํ•˜๋ฉด์„œ๋„ ์—ฌ์ „ํžˆ ๊ทธ๋Ÿด๋“ฏํ•œ(plausible)โ€ ๊ฒฐ๊ณผ๋กœ ์กฐํ–ฅ(steer)ํ•ด ํ‘ผ๋‹ค.
  2. โš™๏ธ ์ตœ์ ํ™” ๊ธฐ์ค€์„ ๋‘ ํ•ญ์œผ๋กœ ์„ค๊ณ„ํ•œ๋‹ค โ€” VLM(Qwen-VL)์ด ์ƒ์„ฑ๋œ ์˜์ƒ์—์„œ ๋ชฉํ‘œ ์ด๋ฒคํŠธ๊ฐ€ ์ผ์–ด๋‚ฌ๋Š”์ง€ yes/no ๋กœ๊ทธํ™•๋ฅ  ์ฐจ์ด๋กœ ์ฑ„์ ํ•˜๋Š” semantic objective์™€, ์ตœ์ ํ™”๋œ noise๊ฐ€ ๊ณ ์ฐจ์› Gaussian์˜ typical set์„ ๋ฒ—์–ด๋‚˜ OOD๋กœ ํ‘œ๋ฅ˜ํ•˜์ง€ ์•Š๊ฒŒ ํ•˜๋Š” plausibility objective(norm ์ง‘์ค‘ยท๋“ฑ๋ฐฉ์„ฑยท์ŠคํŽ™ํŠธ๋Ÿผ ๋ฐฑ์ƒ‰์„ฑ)๋ฅผ ๊ฒฐํ•ฉํ•˜๊ณ , score-distillation์œผ๋กœ denoising ์ „์ฒด๋ฅผ ํ†ตํ•œ ์—ญ์ „ํŒŒ๋ฅผ ํ”ผํ•ด gradient๋ฅผ ๊ทผ์‚ฌํ•œ๋‹ค.
  3. ๐ŸŽฏ ๋™์—ญํ•™์„ ์•„๋Š” ํ†ต์ œ ์‹คํ—˜(Naughty Dubins Car)์—์„œ ์‹คํŒจ๊ฐ€ ์‹ค์ œ๋กœ ๊ฐ€๋Šฅํ•  ๋•Œ๋งŒ ์ด๋ฅผ ๊ฒ€์ถœํ•˜๊ณ , ์ตœ์‹  ์ฃผํ–‰ WM(Vista)ยท์กฐ์ž‘ WM(Ctrl-World)์—์„œ ์‹คํŒจ ๊ฒ€์ถœ recall์„ 54%โ†’94%๋กœ ๋Œ์–ด์˜ฌ๋ฆฌ๋ฉฐ, ์ด robust ํ‰๊ฐ€๋กœ VLA ์ •์ฑ…(ฯ€0.5)์„ fine-tuneํ•˜์ž ์„ฑ๊ณต๋ฅ ์ด 39%โ†’71%๋กœ ๊ฐœ์„ ๋œ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

Video world model(WM)์€ ์ž์œจ์ฃผํ–‰ยท์กฐ์ž‘์—์„œ โ€œ๊ฐ’๋น„์‹ผ ์‹ค์„ธ๊ณ„ ์ƒํ˜ธ์ž‘์šฉ ์—†์ดโ€ ์ •์ฑ…์„ ํ‰๊ฐ€ยท๊ฐœ์„ ํ•  ์ˆ˜๋‹จ์œผ๋กœ ์ฃผ๋ชฉ๋ฐ›๋Š”๋‹ค. ํ•ต์‹ฌ์€ ์ด๋“ค์ด diffusionยทflow matching ๊ฐ™์€ ์ƒ์„ฑ ๋ชจ๋ธ์ด๋ผ, ego-action์— ์กฐ๊ฑดํ™”๋œ ๋ฏธ๋ž˜ ๊ด€์ธก์˜ ๋ถ„ํฌ๋ฅผ ํ•™์Šตํ•œ๋‹ค๋Š” ์ ์ด๋‹ค. ๋ฌธ์ œ๋Š” ์ •์ฑ… ํ‰๊ฐ€ยท๊ฐœ์„ ์ด ๋ณดํ†ต ์ด ๋ถ„ํฌ์—์„œ ๋ฝ‘์€ nominal imagination(์ „ํ˜•์ ์ธ ํ•œ๋‘ ๊ฐœ ์ƒ˜ํ”Œ)์— ์˜์กดํ•œ๋‹ค๋Š” ๊ฒƒ. ์˜ˆ์ปจ๋Œ€ ๋งค๋‹ˆํ“ฐ๋ ˆ์ดํ„ฐ๊ฐ€ ํ…Œ์ด๋ธ” ์œ„ ๋†’์€ ๊ณณ์—์„œ ์—ด๋ฆฐ ๋ด‰์ง€๋ฅผ ๋–จ์–ด๋œจ๋ฆฌ๋ฉด ๋‚ด์šฉ๋ฌผ์ด ์Ÿ์•„์งˆ ์ˆ˜๋„ ์•„๋‹ ์ˆ˜๋„ ์žˆ๋Š”๋ฐ, nominal ์ƒ˜ํ”Œ์€ โ€œ์•ˆ ์Ÿ์•„์ง€๋Š”โ€ ํ”ํ•œ ๊ฒฐ๊ณผ๋งŒ ๋ณด์—ฌ์ฃผ๊ณ  ๋“œ๋ฌผ์ง€๋งŒ ์น˜๋ช…์ ์ธ ์‹คํŒจ๋ฅผ ๋†“์นœ๋‹ค. ์ด๊ฑธ ์žก์œผ๋ ค๋ฉด ์—„์ฒญ๋‚œ ์ˆ˜์˜ ์ƒ˜ํ”Œ์„ ๋ฝ‘์•„์•ผ ํ•ด ๋น„ํ˜„์‹ค์ ์ด๋‹ค. StressDream์€ โ€œ๊ทธ๋Ÿฌ๋ฉด ๊ทธ rare-but-plausible ์‹คํŒจ๋ฅผ ์ง์ ‘ ๊ฒจ๋ƒฅํ•ด ์ƒ์ƒํ•˜๊ฒŒ ๋งŒ๋“ค์žโ€๋Š” ๋ฐœ์ƒ์ด๋‹ค.


๊ฐœ์š”(Fig. 1) โ€” (์œ„) diffusion WM์˜ ์ดˆ๊ธฐ noise ฮต๋ฅผ ์ตœ์ ํ™”ํ•ด inference-time ํ”„๋กฌํ”„ํŠธ๊ฐ€ ์ง€์ •ํ•œ ๋ชฉํ‘œ ์ด๋ฒคํŠธ๋กœ ์ƒ์ƒ์„ ์กฐํ–ฅํ•œ๋‹ค. ๋ฌด์ œ์•ฝ ์ตœ์ ํ™”๋Š” typical set์„ ๋ฒ—์–ด๋‚˜ implausible ์˜์ƒ์„ ๋‚ณ์ง€๋งŒ, StressDream์€ VLM gradient๋กœ ์กฐํ–ฅํ•˜๋ฉด์„œ plausibility ํ•ญ์œผ๋กœ noise๋ฅผ ๊ณ ํ™•๋ฅ  ์˜์—ญ์— ๋ถ™๋“ค์–ด ๋‘”๋‹ค. (์•„๋ž˜) ๊ทธ ๊ฒฐ๊ณผ๋กœ ๊ฐ™์€ action์˜ โ€œ๊ทธ๋Ÿด๋“ฏํ•œ ์ตœ์•…(worst plausible)โ€ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์ƒํ•ด robust ์ •์ฑ… ํ‰๊ฐ€ยท๊ฐœ์„ ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก :

๊ด€๊ฑด์€ diffusion WM์—์„œ ์ดˆ๊ธฐ noise๊ฐ€ ๊ณง ์ œ์–ด ๋ณ€์ˆ˜๋ผ๋Š” ๊ด€์ฐฐ์ด๋‹ค. ์กฐ๊ฑด(๊ด€์ธก ์ด๋ ฅ \mathbf{o}^{\text{hist}}, action \mathbf{a})์ด ๊ณ ์ •๋˜๋ฉด probability-flow ODE๋ฅผ ๋”ฐ๋ฅด๋Š” ์ƒ์„ฑ์€ ์ดˆ๊ธฐ noise \boldsymbol{\epsilon}์˜ ๊ฒฐ์ •๋ก ์  ํ•จ์ˆ˜๊ฐ€ ๋œ๋‹ค: \mathbf{o} = f_\theta(\boldsymbol{\epsilon}, \mathbf{o}^{\text{hist}}, \mathbf{a}). ๋”ฐ๋ผ์„œ ์–ด๋–ค ๋ฏธ๋ž˜๊ฐ€ ์ƒ์„ฑ๋ ์ง€๋Š” ์ „์ ์œผ๋กœ \boldsymbol{\epsilon}๊ฐ€ ๊ฒฐ์ •ํ•œ๋‹ค. StressDream์€ test-time ๊ธฐ์ค€ ํ•จ์ˆ˜ \mathcal{C}^{\text{test}}๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก ์ด noise๋ฅผ gradient ascent๋กœ ๋ฐ€์–ด ์˜ฌ๋ฆฐ๋‹ค:

\boldsymbol{\epsilon}_{i+1} = \boldsymbol{\epsilon}_i + \eta\,\nabla_{\boldsymbol{\epsilon}_i}\!\left[\mathcal{C}^{\text{test}}(\mathbf{o}_i)\right],\qquad \mathbf{o}_i = f_\theta(\boldsymbol{\epsilon}_i, \mathbf{o}^{\text{hist}}, \mathbf{a}).

๊ธฐ์ค€ ํ•จ์ˆ˜๋Š” ๋‘ ํ•ญ์˜ ํ•ฉ \mathcal{C}^{\text{test}} = \mathcal{C}^{\text{sem}} + \mathcal{C}^{\text{pla}}์ด๋‹ค. Semantic ํ•ญ์€ VLM(Qwen-VL)์— โ€œ๋ชฉํ‘œ ์ด๋ฒคํŠธ๊ฐ€ ์ผ์–ด๋‚ฌ๋Š”๊ฐ€?โ€๋ฅผ ๋ฌป๊ณ  ๋‹จ์ผ ํ† ํฐ yes/no์˜ ๋กœ๊ทธํ™•๋ฅ  ์ฐจ์ด๋กœ ๋ฏธ๋ถ„๊ฐ€๋Šฅํ•œ ์ ์ˆ˜๋ฅผ ๋งŒ๋“ ๋‹ค:

\mathcal{C}^{\text{sem}}(\mathbf{o};\,l) = \log p^{\text{VLM}}(\texttt{yes}\mid \mathbf{o}, l) - \log p^{\text{VLM}}(\texttt{no}\mid \mathbf{o}, l).

Plausibility ํ•ญ \mathcal{C}^{\text{pla}} = \lambda_1\mathcal{C}^{\text{norm}} + \lambda_2\mathcal{C}^{\text{iso}} + \lambda_3\mathcal{C}^{\text{spec}}์€ ์ตœ์ ํ™”๋œ noise๊ฐ€ Gaussian prior์˜ typical set ์•ˆ์— ๋จธ๋ฌผ๊ฒŒ ๊ฐ•์ œํ•œ๋‹ค(๋…ธ๋ฆ„ ์ง‘์ค‘ยท๋ธ”๋ก ๋“ฑ๋ฐฉ์„ฑยท์ŠคํŽ™ํŠธ๋Ÿผ ๋ฐฑ์ƒ‰์„ฑ). ์—ฌ๊ธฐ์— denoising ์ „ ๊ณผ์ •์„ ํ†ตํ•œ ์—ญ์ „ํŒŒ ๋Œ€์‹  score-distillation ๊ทผ์‚ฌ \nabla_{\boldsymbol{\epsilon}}\mathcal{C}^{\text{test}}(\mathbf{o}) \approx \beta\,\nabla_{\mathbf{o}}\mathcal{C}^{\text{test}}(\mathbf{o})๋ฅผ ์จ์„œ ๊ณ„์‚ฐ์„ ๊ฐ๋‹น ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค.

์ฃผ์š” ๊ฒฐ๊ณผ:

  • ํ†ต์ œ ์‹คํ—˜(Naughty Dubins Car): ์‹ค์ œ ๋™์—ญํ•™์„ ์•„๋Š” ์„ธํŒ…์—์„œ, StressDream์€ ์‹คํŒจ๊ฐ€ ์‹ค์ œ๋กœ ๊ฐ€๋Šฅํ•  ๋•Œ๋งŒ ๊ทธ๊ฒƒ์„ ์ƒ์ƒํ•ด ๋†’์€ TPRยทTNR์„ ๋™์‹œ์— ๋‹ฌ์„ฑ. plausibility ํ•ญ์„ ๋นผ๋ฉด TNR์ด ๊ธ‰๋ฝ(implausible ์‹คํŒจ๋ฅผ ์ง€์–ด๋ƒ„), classifier guidance๋Š” false positive๊ฐ€ ๋งŽ์Œ(Fig. 2).
  • ์ฃผํ–‰(Vista) / ์กฐ์ž‘(Ctrl-World): ์กฐ์ž‘์—์„œ task-failure ๊ฒ€์ถœ recall์ด Nominal 54% โ†’ Best-of-N 71% โ†’ StressDream 94%(Fig. 5). ์ฃผํ–‰์—์„œ๋„ nominal์ด ๋†“์น˜๋Š” ์•ˆ์ „ ์œ„ํ—˜ ์ด๋ฒคํŠธ๋ฅผ target alignment๋ฅผ ๋†’๊ฒŒ ์œ ์ง€ํ•˜๋ฉฐ ์กฐํ–ฅ.
  • ์ •์ฑ… ๊ฐœ์„ : steered ์ƒ์ƒ์œผ๋กœ VLA ์ •์ฑ… ฯ€0.5๋ฅผ fine-tune(์œ„ํ—˜ action ๋‹ค์šด์›จ์ดํŠธ)ํ•˜์ž 6๊ฐœ ์กฐ์ž‘ ํƒœ์Šคํฌ ํ‰๊ท  ์„ฑ๊ณต๋ฅ  39% โ†’ 71%(Fig. 8, ํƒœ์Šคํฌ๋‹น 20 rollout).

๊ฒฐ๋ก :

StressDream์€ โ€œ๋งŽ์ด ์ƒ˜ํ”Œ๋งโ€์ด ์•„๋‹ˆ๋ผ โ€œnoise ๊ณต๊ฐ„์„ ๋ชฉํ‘œ๋ฅผ ํ–ฅํ•ด ๋ฏธ๋ถ„ ์ตœ์ ํ™”โ€๋กœ rare-but-plausible ์‹คํŒจ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๋ฐœ๊ตดํ•œ๋‹ค. VLM์ด ๋ฌด์—‡์„ ์ฐพ์„์ง€(semantic)๋ฅผ, typical-set ์ œ์•ฝ์ด ํ˜„์‹ค์„ฑ ๊ฒฝ๊ณ„(plausibility)๋ฅผ ๋‹ด๋‹นํ•˜๋Š” ๋ถ„์—…์ด ํ•ต์‹ฌ ์„ค๊ณ„๋‹ค. ๋‹ค๋งŒ ์‹คํŒจ ์ •์˜๋ฅผ ํ…์ŠคํŠธ์— ์˜์กดํ•˜๊ณ  base WM์ด ์ง€์›ํ•˜๋Š” ๊ฒฐ๊ณผ๋งŒ ์ƒ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ทผ๋ณธ ์ œ์•ฝ์ด ์žˆ๋‹ค(WM์ด ํ•™์Šต๋ถ„ํฌ์—์„œ ๋ชป ๋ณธ ์‹คํŒจ๋Š” ์กฐํ–ฅํ•ด๋„ ์•ˆ ๋‚˜์˜ด โ€” ์ด๊ฒƒ์ด โ€œplausibilityโ€์˜ ์ •ํ™•ํ•œ ์˜๋ฏธ๋‹ค).


๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

ํ•œ ์ค„๋กœ ์‹œ์ž‘ํ•˜๋ฉด

โ€œ์–ด๋–ค ๋ฏธ๋ž˜๊ฐ€ ์ƒ์„ฑ๋ ์ง€๋Š” diffusion์˜ ์ดˆ๊ธฐ noise๊ฐ€ ๊ฒฐ์ •ํ•œ๋‹ค โ€” ๊ทธ๋Ÿฌ๋‹ˆ ๋ฌด์ž‘์ • ์—ฌ๋Ÿฌ ๋ฒˆ ๋ฝ‘์ง€ ๋ง๊ณ , ๊ทธ noise๋ฅผ โ€™๊ณ ์ž„ํŒฉํŠธํ•˜์ง€๋งŒ ์—ฌ์ „ํžˆ ๊ทธ๋Ÿด๋“ฏํ•œ ๊ฒฐ๊ณผโ€™๋ฅผ ํ–ฅํ•ด ์ง์ ‘ ์ตœ์ ํ™”ํ•˜์ž.โ€ StressDream์€ video world model์˜ ์ดˆ๊ธฐ Gaussian noise๋ฅผ inference-time์— gradient ์ตœ์ ํ™”ํ•ด, ์ •์ฑ… ํ‰๊ฐ€ยท๊ฐœ์„ ์— ํ•„์š”ํ•œ worst-plausible ๋ฏธ๋ž˜๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ƒ์ƒํ•ด ๋‚ด๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

๋ฐฐ๊ฒฝ: ์™œ nominal imagination์œผ๋กœ๋Š” ๋ถ€์กฑํ•œ๊ฐ€

Video WM์€ ๋ฌผ๋ฆฌ ํ™˜๊ฒฝ์˜ ํ•™์Šต๋œ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋‹ค. CosmosยทWan ๊ฐ™์€ ๋Œ€๊ทœ๋ชจ ์ƒ์„ฑ ๋ชจ๋ธ์˜ ํ๋ฆ„ ์œ„์—์„œ, ๋กœ๋ณดํ‹ฑ์Šค์šฉ WM์€ diffusionยทflow matching์œผ๋กœ ego-action์— ์กฐ๊ฑดํ™”๋œ ๋ฏธ๋ž˜ ๊ด€์ธก์˜ ๋ถ„ํฌ๋ฅผ ํ•™์Šตํ•œ๋‹ค. ๋ถ„ํฌ๋ฅผ ํ•™์Šตํ•œ๋‹ค๋Š” ๊ฑด ๋ฌผ๋ฆฌ์  ์ƒํ˜ธ์ž‘์šฉ์˜ ๋ถˆํ™•์‹ค์„ฑ์ด๋‚˜ ์ฃผ๋ณ€ ์—์ด์ „ํŠธ ํ–‰๋™์˜ ๋‹ค์–‘์„ฑ์„ ๋‹ด์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ๋œป์ด๋‹ค.

๊ทธ๋Ÿฐ๋ฐ ์‹ค์ œ ์ •์ฑ… ํ‰๊ฐ€ยท๊ฐœ์„ ์€ ๋Œ€๊ฐœ ์ด ๋ถ„ํฌ์—์„œ ๋ฝ‘์€ nominal imagination์— ์˜์กดํ•œ๋‹ค. ์ด๋Š” WM์ด ํ‘œํ˜„ํ•˜๋Š” ๋‹ค์–‘ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ณผ์†Œ ํƒ์ƒ‰ํ•œ๋‹ค. ์ •์ฑ… ํ‰๊ฐ€์—์„œ ์ •์ž‘ ํ•„์š”ํ•œ ๊ฒƒ์€ action์˜ plausibleํ•˜๋ฉด์„œ๋„ high-impact์ธ ๊ฒฐ๊ณผ์ธ๋ฐ(์—ฌ๊ธฐ์„œ plausibility๋ž€ โ€œํ•™์Šต๋œ WM ๋ถ„ํฌ๊ฐ€ ์ง€์ง€ํ•˜๋Š”โ€ ๊ฒฐ๊ณผ๋ฅผ ๋œปํ•œ๋‹ค), naive ์ƒ˜ํ”Œ๋ง์€ ์—„์ฒญ๋‚œ ์ƒ˜ํ”Œ ์˜ˆ์‚ฐ ์—†์ด๋Š” ์ด๋Ÿฐ ๊ฒฐ๊ณผ๋ฅผ ์‰ฝ๊ฒŒ ๋†“์นœ๋‹ค. ์ €์ž๋“ค์˜ ์˜ˆ์‹œ: ๋งค๋‹ˆํ“ฐ๋ ˆ์ดํ„ฐ๊ฐ€ ์—ด๋ฆฐ ๋ด‰์ง€๋ฅผ ํ…Œ์ด๋ธ” ๋†’์€ ๊ณณ์—์„œ ๋–จ์–ด๋œจ๋ฆฌ๋ฉด WM ๋ถ„ํฌ์—๋Š” ์Ÿ์•„์ง/์•ˆ ์Ÿ์•„์ง์ด ๋‘˜ ๋‹ค ์žˆ์ง€๋งŒ, ๋ด‰์ง€๋ฅผ ๋‚ฎ๊ฒŒ ๋†“์œผ๋ฉด ์Ÿ์•„์ง์€ ๋“œ๋ฌผ๊ฑฐ๋‚˜ ์•„์˜ˆ ์—†๋‹ค. ์ด๋ ‡๊ฒŒ โ€œ๊ทธ๋Ÿด๋“ฏํ•œ ์‹คํŒจ๋ฅผ ์ƒ์ƒํ•˜๋Š” ๋Šฅ๋ ฅโ€์ด ์žˆ์–ด์•ผ ์œ„ํ—˜ํ•œ action์„ ๊ฑธ๋Ÿฌ๋‚ด๊ณ (ํ‰๊ฐ€) ์–ต์ œํ• (๊ฐœ์„ ) ์ˆ˜ ์žˆ๋‹ค.

ํ•ต์‹ฌ ๊ธฐ์ˆ ์  ๊ด€์ฐฐ์€ ์ดˆ๊ธฐ noise๊ฐ€ ์ œ์–ด ๋ณ€์ˆ˜๋ผ๋Š” ๊ฒƒ์ด๋‹ค. Diffusion WM์€ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ \mathbf{o}\sim p^{\text{data}}์™€ ํ‘œ์ค€ Gaussian \mathbf{x}^T = \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_D) ์‚ฌ์ด ๋ณ€ํ™˜์„ ํ•™์Šตํ•˜๋ฉฐ, ์—ญ๋ฐฉํ–ฅ denoising์„ ๋ฐ˜๋ณตํ•ด \mathbf{x}^0 = \mathbf{o}๋ฅผ ์–ป๋Š”๋‹ค. Probability-flow ODE์— ๋Œ€์‘ํ•˜๋Š” ๊ฒฐ์ •๋ก ์  ์ƒ˜ํ”Œ๋ง์—์„œ๋Š” ์กฐ๊ฑด์ด ๊ณ ์ •๋˜๋ฉด ์ƒ์„ฑ์ด ์˜ค์ง ์ดˆ๊ธฐ noise์˜ ํ•จ์ˆ˜ \mathbf{o} = f_\theta(\boldsymbol{\epsilon}, \mathbf{o}^{\text{hist}}, \mathbf{a})๊ฐ€ ๋œ๋‹ค. ์ฆ‰ noise๋ฅผ ๊ณ ๋ฅด๋Š” ๊ฒƒ์ด ๊ณง ์–ด๋–ค ์˜์ƒ์ด ๋‚˜์˜ฌ์ง€๋ฅผ ๊ณ ๋ฅด๋Š” ๊ฒƒ์ด๋‹ค.

๋ฐฉ๋ฒ•: ๋ชฉํ‘œ๋ฅผ minโ€“max๋กœ ์ •์‹ํ™”ํ•˜๊ณ , noise๋ฅผ ์ตœ์ ํ™”ํ•œ๋‹ค

์ •์ฑ… ํ‰๊ฐ€ยท๊ฐœ์„ ์˜ ์ •์‹ํ™” (Eq. 4)

action-conditioned WM f_\theta๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ํ›„๋ณด action ์‹œํ€€์Šค๋ฅผ ๊ทธ ๋ฏธ๋ž˜ ๊ฒฐ๊ณผ๋กœ ํ‰๊ฐ€ํ•˜๊ณ ์ž ํ•œ๋‹ค. ๋ฏธ๋ž˜ \mathbf{o}๋Š” test-time ๊ธฐ์ค€ \mathcal{C}^{\text{test}}(\mathbf{o})\in\mathbb{R}๋กœ ์ฑ„์ ๋œ๋‹ค(์‹คํŒจยท์ถฉ๋Œ ๊ฐ™์€ high-impact ์‚ฌ๊ฑด์ด ์ผ์–ด๋‚ฌ๋Š”์ง€). ํ•˜๋‚˜์˜ action์— ์—ฌ๋Ÿฌ plausible ๋ฏธ๋ž˜๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, ์ €์ž๋“ค์€ robust ์ •์ฑ…์„ ๋‹ค์Œ์˜ minโ€“max๋กœ ์ •์˜ํ•œ๋‹ค:

\mathbf{a}^\ast = \arg\min_{\mathbf{a}\in\mathcal{A}}\ \max_{\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_D)}\ \mathcal{C}^{\text{test}}\!\left(f_\theta(\boldsymbol{\epsilon}, \mathbf{o}^{\text{hist}}, \mathbf{a})\right).

  • Inner max(๊ณ ์ฐจ์› Gaussian noise ์„ ํƒ): ์ด action์˜ worst plausible ๋ฏธ๋ž˜๋ฅผ ์ฐพ๋Š”๋‹ค โ€” StressDream์ด ๋‹ด๋‹นํ•˜๋Š” ๋ถ€๋ถ„.
  • Outer min(action ์„ ํƒ): plausible ๋ฏธ๋ž˜ ์ „๋ฐ˜์—์„œ, ์ตœ์•…์„ ํฌํ•จํ•ด๋„ ๊ธฐ์ค€์„ ๋‚ฎ๊ฒŒ ์œ ์ง€ํ•˜๋Š” robust action์„ ๊ณ ๋ฅธ๋‹ค โ€” ์ƒ˜ํ”Œ๋ง ๊ธฐ๋ฐ˜ solver๋‚˜ ์ •์ฑ… ์ตœ์ ํ™”๊ฐ€ ๋‹ด๋‹น.

Inner ๋ฌธ์ œ๊ฐ€ ์–ด๋ ค์šด ์ด์œ ๋Š” noise ๊ณต๊ฐ„์ด ๊ทน๋‹จ์ ์œผ๋กœ ๊ณ ์ฐจ์›(์ฃผํ–‰ Vista๋Š” D\approx921{,}600, ์กฐ์ž‘ Ctrl-World๋Š” D=57{,}600)์ด๊ณ , ๊ฐ noise ํ‰๊ฐ€๋งˆ๋‹ค ๋น„์‹ผ denoising์ด ํ•„์š”ํ•ด ๋ฌด์ž‘์œ„ ๋ฐ˜๋ณต ์ƒ˜ํ”Œ๋ง์ด rare ์‚ฌ๊ฑด์„ ๋†“์น˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ๋ž˜์„œ ๋ฌด์ž‘์œ„ ๋Œ€์‹  ๋ฏธ๋ถ„๊ฐ€๋Šฅํ•œ ๊ธฐ์ค€์˜ gradient๋กœ noise๋ฅผ ์ง์ ‘ ์ƒ์Šน์‹œํ‚จ๋‹ค(Eq. 5). ๋ฌธ์ œ๋Š” ๋‘ ๊ฐ€์ง€ โ€” โ‘  ๊ณ ์ฐจ์› noise๋ฅผ naiveํ•˜๊ฒŒ ์ตœ์ ํ™”ํ•˜๋ฉด OOD๋กœ ๋ฐ€๋ ค implausible ์˜์ƒ์ด ๋‚˜์˜ค๊ณ , โ‘ก scene๋งˆ๋‹ค ๋‹ฌ๋ผ์ง€๋Š” ๋ฏธ๋ฌ˜ํ•œ ๋ชฉํ‘œ ์‚ฌ๊ฑด์„ ์ฑ„์ ํ•  ๋ฏธ๋ถ„๊ฐ€๋Šฅํ•œ ๊ธฐ์ค€์ด ํ•„์š”ํ•˜๋‹ค. StressDream์˜ ๋‘ objective๊ฐ€ ๊ฐ๊ฐ ์ด๋ฅผ ํ‘ผ๋‹ค.

Semantic objective: VLM์œผ๋กœ ๋ชฉํ‘œ ์‚ฌ๊ฑด์„ ์ฑ„์  (Eq. 6)

WM์€ ๋‹ค์–‘ํ•œ sceneยทtask์— ๊ฑธ์ณ ์ž‘๋™ํ•˜๋ฏ€๋กœ, ์กฐํ–ฅํ•  high-impact ๋ชฉํ‘œ ์‚ฌ๊ฑด์€ ์ •์ฑ… ๋งฅ๋ฝ์— ๋”ฐ๋ผ ๋งค๋ฒˆ ๋ฐ”๋€๋‹ค. ๊ทธ๋ž˜์„œ โ€œ์ƒ์„ฑ ์˜์ƒ์—์„œ scene-์˜์กด์  ๋ชฉํ‘œ ์‚ฌ๊ฑด์ด ์ผ์–ด๋‚ฌ๋Š”๊ฐ€โ€๋ฅผ ๋ฏธ๋ถ„๊ฐ€๋Šฅํ•˜๊ฒŒ ์ ์ˆ˜ํ™”ํ•  semantic ํ•ญ์ด ํ•„์š”ํ•˜๋‹ค. ์ €์ž๋“ค์€ VLM(Qwen-VL)์˜ ์ผ๋ฐ˜์  ์˜์ƒ ์ดํ•ด ๋Šฅ๋ ฅ์„ ํ™œ์šฉํ•œ๋‹ค. Inference-time ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ l(์˜ˆ: ์กฐ์ž‘ โ€œthe coffee beans spillโ€, ์ฃผํ–‰ โ€œa collision occursโ€)์„ ์ฃผ๊ณ , VLM์ด yes/no ๋‹จ์ผ ํ† ํฐ์„ ์ถœ๋ ฅํ•˜๊ฒŒ ํ•œ ๋’ค ๋กœ๊ทธํ™•๋ฅ  ์ฐจ์ด๋ฅผ ์ ์ˆ˜๋กœ ์ •์˜ํ•œ๋‹ค:

\mathcal{C}^{\text{sem}}(\mathbf{o};\,l) = \log p^{\text{VLM}}(\texttt{yes}\mid \mathbf{o}, l) - \log p^{\text{VLM}}(\texttt{no}\mid \mathbf{o}, l).

๋‹จ์ผ ํ† ํฐ ํ™•๋ฅ ์„ ์“ฐ๋ฏ€๋กœ ๋ฏธ๋ถ„๊ฐ€๋Šฅํ•˜๊ณ , ๊ณ ์ฐจ์› noise ์ตœ์ ํ™”์— ํ’๋ถ€ํ•œ gradient ์‹ ํ˜ธ๋ฅผ ์ค€๋‹ค. inference-time์— ํ…์ŠคํŠธ๋งŒ ๋ฐ”๊ฟ” ์„œ๋กœ ๋‹ค๋ฅธ ์‹คํŒจ ๋ชจ๋“œ๋ฅผ ์ง€์ •ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒŒ ์‹ค์šฉ์  ๊ฐ•์ ์ด๋‹ค.

Plausibility objective: noise๋ฅผ typical set ์•ˆ์— ๋ถ™๋“ ๋‹ค

Diffusion์€ Gaussian prior์—์„œ ๋ฝ‘์€ noise๋กœ ํ•™์Šต๋˜์—ˆ์œผ๋ฏ€๋กœ, noise๊ฐ€ typical set(๋Œ€๋ถ€๋ถ„์˜ ํ•™์Šต noise๊ฐ€ ๋†“์ด๋Š” ์˜์—ญ)์„ ๋ฒ—์–ด๋‚˜๋ฉด ๊ฒฐ๊ณผ ์˜์ƒ์ด WM ๋ถ„ํฌ์—์„œ ๋ฒ—์–ด๋‚˜๊ฑฐ๋‚˜(implausible) ํ™”์งˆ์ด ๋ง๊ฐ€์ง„๋‹ค. ์ค‘์š”ํ•œ ๋ฏธ๋ฌ˜ํ•จ: ๊ณ ์ฐจ์›์—์„œ typical set์€ ์ตœ๊ณ  ๋ฐ€๋„ ์˜์—ญ๊ณผ ๋‹ค๋ฅด๋‹ค โ€” ์˜๋ฒกํ„ฐ(zero vector)๋Š” ๋ฐ€๋„๋Š” ๋†’์ง€๋งŒ Gaussian์—์„œ ์ƒ˜ํ”Œ๋  ๊ฐ€๋Šฅ์„ฑ์€ ๊ทนํžˆ ๋‚ฎ๋‹ค. Gradient ์ตœ์ ํ™”๊ฐ€ noise๋ฅผ ์ด typical set ๋ฐ–์œผ๋กœ ๋ฐ€ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, ์ €์ž๋“ค์€ ์„ธ ํ†ต๊ณ„๋Ÿ‰์œผ๋กœ ์ด๋ฅผ ๊ทœ์ œํ•œ๋‹ค:

\mathcal{C}^{\text{pla}}(\boldsymbol{\epsilon}) = \lambda_1\mathcal{C}^{\text{norm}}(\boldsymbol{\epsilon}) + \lambda_2\mathcal{C}^{\text{iso}}(\boldsymbol{\epsilon}) + \lambda_3\mathcal{C}^{\text{spec}}(\boldsymbol{\epsilon}).

  • Norm concentration. Gaussian noise์˜ ์ œ๊ณฑ ๋…ธ๋ฆ„์€ \lVert\boldsymbol{\epsilon}\rVert_2^2 \sim \chi_D^2๋กœ ๋ฐ˜๊ฒฝ \sqrt{D} ๊ทผ์ฒ˜์˜ ์–‡์€ ๊ป์งˆ(shell)์— ์ง‘์ค‘ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ์ด ์ „ํ˜•์  ๋ฐ˜๊ฒฝ์—์„œ์˜ ์ดํƒˆ์„ ๋ฒŒํ•œ๋‹ค: \mathcal{C}^{\text{norm}}(\boldsymbol{\epsilon}) = -\big(\lVert\boldsymbol{\epsilon}\rVert_2 - \sqrt{D}\big)^2.
  • Isotropy. ์ „์—ญ ๋…ธ๋ฆ„์ด ๋งž์•„๋„ ๊ตญ์†Œ์ ์œผ๋กœ i.i.d. Gaussian๋‹ต์ง€ ์•Š์€ ์ƒ๊ด€ยท๊ตฌ์กฐ๊ฐ€ ๋‚จ์„ ์ˆ˜ ์žˆ๋‹ค. noise๋ฅผ ๋ฌด์ž‘์œ„๋กœ ์น˜ํ™˜ยท๋ถ„ํ• ํ•ด ๋ถ€๋ถ„๋ฒกํ„ฐ \{\boldsymbol{\epsilon}_i\}_{i=1}^m(\boldsymbol{\epsilon}_i\in\mathbb{R}^k, D=mk)๋กœ ๋‚˜๋ˆ„๊ณ , ๊ฒฝํ—˜์  2์ฐจ ๋ชจ๋ฉ˜ํŠธ \widehat{\boldsymbol{\Sigma}} = \frac{1}{m}\sum_i \boldsymbol{\epsilon}_i\boldsymbol{\epsilon}_i^\top๊ฐ€ \mathbf{I}_k์—์„œ ๋ฒ—์–ด๋‚จ์„ ๋ฒŒํ•œ๋‹ค: \mathcal{C}^{\text{iso}}(\boldsymbol{\epsilon}) = -\frac{1}{k}\lVert\widehat{\boldsymbol{\Sigma}} - \mathbf{I}_k\rVert_F^2(์—ฌ๋Ÿฌ ๋ฌด์ž‘์œ„ ์น˜ํ™˜ ํ‰๊ท ).
  • Spectral whiteness. ์ขŒํ‘œ ๊ณต๊ฐ„์—์„œ ์ „ํ˜•์ ์ด์–ด๋„ ์ฃผํŒŒ์ˆ˜ ์˜์—ญ artifact๊ฐ€ ์ƒ๊ธธ ์ˆ˜ ์žˆ๋‹ค. Gaussian noise๋Š” ํ‰ํ‰ํ•œ ๊ธฐ๋Œ€ ํŒŒ์›Œ ์ŠคํŽ™ํŠธ๋Ÿผ์„ ๊ฐ€์ง€๋ฏ€๋กœ, 2D DFT ํŒŒ์›Œ \mathbf{P} = \lvert\mathcal{F}(\boldsymbol{\epsilon})\rvert^2๋ฅผ B๊ฐœ ๊ณต๊ฐ„์ฃผํŒŒ์ˆ˜ bin์œผ๋กœ ๋ชจ์•„ bin๋ณ„ ํ‰๊ท  ํŒŒ์›Œ \{\hat p_b\}์˜ ๋ถ„์‚ฐ์„ ์ตœ์†Œํ™”ํ•œ๋‹ค: \mathcal{C}^{\text{spec}}(\boldsymbol{\epsilon}) = -\frac{1}{B}\sum_b (\hat p_b - \bar p)^2.

Gradient ๊ทผ์‚ฌ: denoising ์ „์ฒด๋ฅผ ํ†ตํ•œ ์—ญ์ „ํŒŒ๋ฅผ ํ”ผํ•œ๋‹ค (Eq. 7โ€“8)

Noise gradient \nabla_{\boldsymbol{\epsilon}}\mathcal{C}^{\text{test}}(f_\theta(\cdots))๋ฅผ ์ •ํ™•ํžˆ ๊ตฌํ•˜๋ ค๋ฉด ๋ฐ˜๋ณต denoising(์˜ˆ: 50 ์Šคํ…) ์ „์ฒด๋ฅผ ์—ญ์ „ํŒŒํ•ด์•ผ ํ•˜๋Š”๋ฐ, ๋ฉ”๋ชจ๋ฆฌยทgradient vanishing ๋ฌธ์ œ๊ฐ€ ํฌ๋‹ค. ์ €์ž๋“ค์€ score-distillation์„ ์ฑ„ํƒํ•ด ์ดˆ๊ธฐ noise์— ๋Œ€ํ•œ gradient๋ฅผ ์ƒ์„ฑ ์ƒ˜ํ”Œ์—์„œ์˜ gradient๋กœ ๊ทผ์‚ฌํ•œ๋‹ค:

\nabla_{\boldsymbol{\epsilon}}\mathcal{C}^{\text{test}}(\mathbf{o}) \approx \beta\,\nabla_{\mathbf{o}}\mathcal{C}^{\text{test}}(\mathbf{o}),\qquad \mathbf{o} = f_\theta(\boldsymbol{\epsilon}, \mathbf{o}^{\text{hist}}, \mathbf{a}).

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋ฏธ๋ถ„๊ฐ€๋Šฅํ•œ ๊ธฐ์ค€ ํ•จ์ˆ˜๋งŒ ์—ญ์ „ํŒŒํ•˜๋ฉด ๋˜๊ณ  denoising ์ฒด์ธ์€ ๊ฑด๋„ˆ๋›ด๋‹ค. ์ตœ์ข…์ ์œผ๋กœ ๋‘ objective์˜ gradient๋ฅผ ํ•ฉ์ณ noise๋ฅผ ๊ฐฑ์‹ ํ•œ๋‹ค:

\nabla_{\boldsymbol{\epsilon}}\mathcal{C}^{\text{test}}(\mathbf{o}) = \beta\,\nabla_{\mathbf{o}}\mathcal{C}^{\text{sem}}(\mathbf{o};\,l) + \nabla_{\boldsymbol{\epsilon}}\mathcal{C}^{\text{pla}}(\boldsymbol{\epsilon}),

๊ณ„์ˆ˜ \beta, \lambda_1, \lambda_2, \lambda_3๋Š” WMยทnoise ์ฐจ์›ยทVLM์— ๋”ฐ๋ผ ์กฐ์ •ํ•œ๋‹ค.

์ง๊ด€: โ€œํ™•๋ฅ  ๊ป์งˆ ์œ„์—์„œ, ์›ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๊ฑท๊ธฐโ€

๊ณ ์ฐจ์› Gaussian์„ ํ•˜๋‚˜์˜ ์–‡์€ ๊ตฌ๋ฉด ๊ป์งˆ๋กœ ์ƒ์ƒํ•˜๋ฉด ์ข‹๋‹ค. Nominal ์ƒ˜ํ”Œ์€ ๊ทธ ๊ป์งˆ ์œ„ ๋ฌด์ž‘์œ„ ํ•œ ์ ์ด๊ณ , Best-of-N์€ ๋ฌด์ž‘์œ„ ์—ฌ๋Ÿฌ ์  ์ค‘ ์ ์ˆ˜ ์ตœ๊ณ ๋ฅผ ๊ณ ๋ฅด๋Š” ๊ฒƒ โ€” ํ•˜์ง€๋งŒ rare ์‚ฌ๊ฑด์€ ๊ป์งˆ ์œ„ ์•„์ฃผ ์ข์€ ์˜์—ญ์— ์žˆ์–ด ๋ฌด์ž‘์œ„๋ก  ์ž˜ ์•ˆ ๊ฑธ๋ฆฐ๋‹ค. StressDream์€ ๊ป์งˆ ์œ„์— ๋จธ๋ฌผ๋ฉด์„œ(plausibility: normยทisotropyยทspectrum์ด ๊ป์งˆ/๋ฐฑ์ƒ‰์„ฑ ์ œ์•ฝ) ์ ์ˆ˜๊ฐ€ ์˜ค๋ฅด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๊ฑธ์–ด๊ฐ€๋Š”(semantic: VLM gradient) ๋ฐฉ์‹์ด๋‹ค. plausibility ํ•ญ์„ ๋นผ๋ฉด ๊ป์งˆ์„ ๋ฒ—์–ด๋‚˜(zero-vector ์ชฝ์ด๋‚˜ ๊ตฌ์กฐํ™”๋œ ๋ฐฉํ–ฅ์œผ๋กœ) implausibleํ•œ โ€œ์ง€์–ด๋‚ธ ์‹คํŒจโ€๋กœ ๋ฏธ๋„๋Ÿฌ์ง„๋‹ค โ€” Fig. 2๊ฐ€ ์ด๋ฅผ ์ •ํ™•ํžˆ ๋ณด์—ฌ์ค€๋‹ค.

์‹คํ—˜: ํ†ต์ œ ์‹คํ—˜ โ†’ ์ตœ์‹  WM โ†’ ์ •์ฑ… ๊ฐœ์„ 

ํ†ต์ œ ์‹คํ—˜ โ€” Naughty 3D Dubins Car (๋™์—ญํ•™์„ ์•„๋Š” ์„ธํŒ…)

์‹คํŒจ๊ฐ€ ์‹ค์ œ๋กœ ๊ฐ€๋Šฅํ•  ๋•Œ๋งŒ ์กฐํ–ฅ์ด ์‹คํŒจ๋ฅผ ์žก์•„๋‚ด๋Š”์ง€ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด, ๋™์—ญํ•™์„ ์•„๋Š” ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ 3D Dubins car๋ฅผ ๋งŒ๋“ ๋‹ค. ์ƒํƒœ s = [p_x, p_y, \theta], ์—ฐ์† ๊ฐ์†๋„ action a_t\in[-1.25, 1.25] rad/s, ๊ณ ์ • ์†๋„ v = 1 m/s, \Delta t = 0.05 s. โ€œnaughtyโ€๋Š” ํ™•๋ฅ  p = 0.2๋กœ ์ œ์–ด ์ž…๋ ฅ์˜ ๋ถ€ํ˜ธ๋ฅผ ๋’ค์ง‘์–ด ๋ถˆํ™•์‹ค์„ฑ์„ ์ค€๋‹ค. ์•ˆ์ „ ์ ์ˆ˜๋Š” \mathcal{C}(s) = p_x^2 + p_y^2 - 0.25^2๋กœ ์›์  ์ค‘์‹ฌ ๋ฐ˜๊ฒฝ 0.25 m์˜ ์›ํ˜• failure set์„ ์ •์˜ํ•œ๋‹ค. WM์€ ๋ฌด์ž‘์œ„ ๊ด€์ธก-action ๊ถค์  4,000๊ฐœ๋กœ ํ•™์Šตํ•œ one-step(H=1) diffusion ๋ชจ๋ธ(noise ์ฐจ์› 1,024)์ด๋ฉฐ, ์—ฌ๊ธฐ์„  VLM ๋Œ€์‹  ์•ˆ์ „ ์ ์ˆ˜๋ฅผ \mathcal{C}^{\text{sem}}์œผ๋กœ ์“ด๋‹ค.


ํ†ต์ œ ์‹คํ—˜(Fig. 2) โ€” (a) ์‹คํŒจ ๊ฒ€์ถœ์˜ TPRโ€“TNR. StressDream(์ฃผํ™ฉ)์€ TPRยทTNR์„ ๋™์‹œ์— ๋†’๊ฒŒ ์œ ์ง€; plausibility ํ•ญ์„ ๋บ€ ๋ณ€ํ˜•(์ฒญ๋ก)์€ TNR์ด ๊ธ‰๋ฝ(implausible ์‹คํŒจ๋ฅผ ์ง€์–ด๋ƒ„), classifier guidance(๋นจ๊ฐ•)๋Š” TPRยทTNR ๋ชจ๋‘ ๋‚ฎ์Œ. (b) nominal(๊ฒ€์ •) ๋Œ€๋น„ steered(์ฃผํ™ฉ) ์ƒ์ƒ ๊ถค์  โ€” StressDream์€ nominal์ด ๋†“์น˜๋Š” plausible ์‹คํŒจ(ํšŒ์ƒ‰ failure set ์ง„์ž…)๋ฅผ ์กฐํ–ฅํ•ด ์ฐพ์•„๋‚ธ๋‹ค.

5,000๊ฐœ ์ดˆ๊ธฐ ์ƒํƒœ-action ์‹œํ€€์Šค์— ๋Œ€ํ•ด, ํ™•๋ฅ ์  ๋™์—ญํ•™์—์„œ ๋‹ฌ์„ฑ ๊ฐ€๋Šฅํ•œ ground-truth ์ตœ์†Œ ์•ˆ์ „ ์ ์ˆ˜๊ฐ€ 0 ๋ฏธ๋งŒ์ด๋ฉด positive(์‹คํŒจ ๊ฐ€๋Šฅ)๋กœ ๋ผ๋ฒจ๋งํ•˜๊ณ , WM rollout์˜ ์˜ˆ์ธก ์ตœ์†Œ ์•ˆ์ „ ์ ์ˆ˜๋กœ ๋ถ„๋ฅ˜ํ•œ๋‹ค. 10 ์Šคํ… ์ตœ์ ํ™”. ๋น„๊ต ๋Œ€์ƒ์€ Nominal(N=1), Best-of-N(N=10), classifier guidance(CG, denoising ์ค‘ gradient ์ ์šฉ), ๊ทธ๋ฆฌ๊ณ  \mathcal{C}^{\text{pla}} ์ œ๊ฑฐ ๋ณ€ํ˜•. ๊ฒฐ๊ณผ: StressDream์€ ์‹คํŒจ๊ฐ€ plausibleํ•  ๋•Œ๋งŒ ์‹ ๋ขฐ์„ฑ ์žˆ๊ฒŒ ๊ฒ€์ถœ(๋†’์€ TPRยทTNR). \mathcal{C}^{\text{pla}}๊ฐ€ ์—†์œผ๋ฉด TNR์ด ๋‚ฎ์•„์ ธ(์•ˆ์ „ํ•œ ๊ถค์ ์„ ์‹คํŒจ๋กœ ์˜ค๋ถ„๋ฅ˜) implausible ์‹คํŒจ๋ฅผ ์ง€์–ด๋‚ด๊ณ , CG๋„ denoising ๊ถค์ ์„ ์ง์ ‘ ๊ฑด๋“œ๋ ค false positive๊ฐ€ ๋งŽ๋‹ค. ๋ฌด์ž‘์œ„ ์ƒ˜ํ”Œ๋ง(NominalยทBest-of-N)์€ plausibleํ•˜๊ธด ํ•˜๋‚˜ rare ์‹คํŒจ๋ฅผ ์ž์ฃผ ๋†“์นœ๋‹ค.

์ •์„ฑ ๊ฒฐ๊ณผ โ€” nominal์ด ๋†“์น˜๋Š” ์‹คํŒจ๋ฅผ, plausibleํ•  ๋•Œ๋งŒ ์กฐํ–ฅ


์ •์„ฑ ๋น„๊ต(Fig. 3) โ€” ์ƒ๋‹จ ํ…์ŠคํŠธ๊ฐ€ inference-time ๋ชฉํ‘œ ํ”„๋กฌํ”„ํŠธ. StressDream์€ nominal์ด ๋†“์น˜๋Š” ๋ณดํ–‰์ž near-missยท์ถฉ๋Œยทred-light ์œ„๋ฐ˜ยท์Ÿ์Œ ๋“ฑ high-impact ๊ฒฐ๊ณผ๋กœ ์ƒ์ƒ์„ ์กฐํ–ฅํ•œ๋‹ค. ๊ฒฐ์ •์ ์œผ๋กœ, ๋ชฉํ‘œ๊ฐ€ WM ๋ถ„ํฌ์—์„œ ์ง€์ง€๋˜์ง€ ์•Š์œผ๋ฉด(๋งจ ์˜ค๋ฅธ์ชฝ ๋‘ ์—ด: ๋‹ซํžŒ ๋ด‰์ง€ยท๋ˆ์ ํ•œ ์‚ฌํƒ•) ์–ต์ง€๋กœ ์ƒ์ƒํ•˜์ง€ ์•Š๋Š”๋‹ค(โ€œno spillโ€).

์ตœ์‹  WM โ€” ์ฃผํ–‰(Vista)๊ณผ ์กฐ์ž‘(Ctrl-World)

  • ์ฃผํ–‰: Vista๋ฅผ ์‚ฌ์šฉ(576\times1024 ์ „๋ฐฉ ์นด๋ฉ”๋ผ 25 ํ”„๋ ˆ์ž„ ์˜ˆ์ธก, waypoint๋ฅผ action์œผ๋กœ ์กฐ๊ฑดํ™”, D\approx921{,}600). PhysicalAI-Autonomous-Vehicles(PAI-AV)์™€ Nexar Collision Prediction ๋ฐ์ดํ„ฐ๋กœ fine-tune. 20 ์Šคํ… ์ตœ์ ํ™”, WolfยทX-CLIP์œผ๋กœ fine-tune๋œ Qwen2.5-VL-7B-Instruct ์‚ฌ์šฉ. ํ‰๊ฐ€๋Š” PAI-AV 8๊ฐœ ์•ˆ์ „ ์œ„ํ—˜ ์นดํ…Œ๊ณ ๋ฆฌ์—์„œ ํ๋ ˆ์ด์…˜ํ•œ 100๊ฐœ imageโ€“actionโ€“text ์Œ + 200๊ฐœ ์ž„๋ฐ• ์ถฉ๋Œ ์˜ˆ์‹œ. ์ง€ํ‘œ๋Š” held-out ํ‰๊ฐ€๊ธฐ WorldModelBench๋กœ ์žฐ target alignment(์˜์ƒ์ด ๋ชฉํ‘œ ํ…์ŠคํŠธ์™€ ๋งž๋Š”๊ฐ€)์™€ video quality(plausibility ๋Œ€๋ฆฌ ์ง€ํ‘œ).
  • ์กฐ์ž‘: Ctrl-World๋ฅผ ์‚ฌ์šฉ(DROID ์„ธํŒ…, 3๊ฐœ ์นด๋ฉ”๋ผ ๋ทฐ 192\times320์˜ 5 ํ”„๋ ˆ์ž„ ์˜ˆ์ธก, joint-position action, D = 57{,}600). 6๊ฐœ contact-rich ํƒœ์Šคํฌ, ํƒœ์Šคํฌ๋‹น ์•ฝ 150๊ฐœ teleoperation ๊ถค์ (์„ฑ๊ณตยท์‹คํŒจ ํฌํ•จ)์œผ๋กœ fine-tune. 10 ์Šคํ… ์ตœ์ ํ™”, Qwen3-VL-4B-Instruct ์‚ฌ์šฉ.

์กฐ์ž‘ ์‹คํŒจ ๊ฒ€์ถœ recall(Fig. 5) โ€” Ctrl-World ์ƒ์ƒ์—์„œ task-failure ๊ฒ€์ถœ. Nominal(N=1) 54% โ†’ Best-of-N(N=10) 71% โ†’ StressDream 94%. ๋ฌด์ž‘์œ„ ์ƒ์„ฑ์€ ๊ณผ๋„ํ•˜๊ฒŒ ๋‚™๊ด€์ ์ด๋ผ plausible ์‹คํŒจ๋ฅผ ์ž์ฃผ ๋†“์นœ๋‹ค.

์ฃผํ–‰์—์„œ๋Š” StressDream์ด nominal์ด ๋†“์น˜๋Š” ์•ˆ์ „ ์œ„ํ—˜/์‹คํŒจ ์ด๋ฒคํŠธ๋กœ ์ƒ์ƒ์„ ์กฐํ–ฅํ•˜๋ฉด์„œ(Fig. 4์˜ target alignment ์ƒ์Šน) \mathcal{C}^{\text{pla}} ๋•์— video quality๋ฅผ ๋ณด์กดํ•œ๋‹ค โ€” plausibility ํ•ญ์„ ๋นผ๋ฉด target alignmentยทvideo quality๊ฐ€ ๋‘˜ ๋‹ค ๋–จ์–ด์ง„๋‹ค. ๋˜ํ•œ ์กฐํ–ฅ์ด WM ๋ถ„ํฌ์— grounded์ž„์„ ๊ฒ€์ฆํ•˜๋Š” ์‹คํ—˜(Fig. 6): ์ถฉ๋Œ๋กœ fine-tuneํ•œ Vista์—์„œ๋Š” ์กฐํ–ฅ์ด ์ถฉ๋Œ์„ ์œ ๋„ํ•˜์ง€๋งŒ, ์ถฉ๋Œ์„ ํ•™์Šตํ•˜์ง€ ์•Š์€ base Vista์—์„œ๋Š” ์กฐํ–ฅํ•ด๋„ ์ถฉ๋Œ์„ ์ƒ์ƒํ•˜์ง€ ๋ชปํ•œ๋‹ค(target alignment ๋‚ฎ์Œ). ์ฆ‰ StressDream์€ WM ๋ถ„ํฌ๊ฐ€ ์ง€์ง€ํ•˜๋Š” ์‚ฌ๊ฑด๋งŒ ์กฐํ–ฅํ•˜์ง€, implausible ์‚ฌ๊ฑด์„ ํ•ฉ์„ฑํ•˜์ง€ ์•Š๋Š”๋‹ค.

์ •์ฑ… ๊ฐœ์„  โ€” robust action์„ ์„ ํ˜ธํ•˜๋„๋ก fine-tune


์ •์ฑ… ๊ฐœ์„ (Fig. 7) โ€” steered WM ์ƒ์ƒ์œผ๋กœ fine-tuneํ•œ ฯ€0.5๋Š” worst-case plausible ๊ฒฐ๊ณผ์—์„œ๋„ ์„ฑ๊ณตํ•˜๋Š” robust action(์˜ˆ: ๊ฐ€์žฅ์ž๋ฆฌ ๋Œ€์‹  ์ค‘์•™์— ๋†“๊ธฐ, ์ฒœ์ฒœํžˆ ๋ถ“๊ธฐ)์„ ์„ ํ˜ธํ•œ๋‹ค. ๋ฐ˜๋ฉด nominal fine-tuning์€ ์‹คํŒจ๊ฐ€ plausibleํ•œ ์œ„ํ—˜ action์„ ๊ทธ๋Œ€๋กœ ์ œ์•ˆํ•œ๋‹ค.

behavior-cloning ์ •์ฑ… ฯ€0.5(VLA)๋ฅผ ๊ฐœ์„ ํ•œ๋‹ค. ฯ€0.5-DROID๋ฅผ ํƒœ์Šคํฌ๋‹น 40๊ฐœ ์„ฑ๊ณต ์‹œ์—ฐ์œผ๋กœ weighted-regression fine-tuneํ•˜๋˜, ๋‘ ์„ธํŒ…์„ ๋น„๊ต: Nominal \pi^{\text{FT}}(๋ชจ๋“  ๊ถค์ ์— ๊ท ์ผ ๊ฐ€์ค‘ 1.0) vs Robust \pi^{\text{FT}}(steered ์ƒ์ƒ์—์„œ๋„ ์„ฑ๊ณตํ•œ ๊ถค์ ์—” 1.0, steered ์ƒ์ƒ์—์„œ ์‹คํŒจํ•˜๋Š” ๊ถค์ ์—” 0.1). ์ฆ‰ ๊ทธ๋Ÿด๋“ฏํ•œ ๊ฒฐ๊ณผ ๋ถ„ํฌ์— ์‹คํŒจ๊ฐ€ ์—†๋Š” robust action์„ ํ‚ค์šฐ๊ณ  ์œ„ํ—˜ action์„ ์–ต์ œํ•œ๋‹ค. ๊ฒฐ๊ณผ(Fig. 8, ํƒœ์Šคํฌ๋‹น 20 rollout ํ‰๊ท ): Nominal \pi^{\text{FT}} 39% โ†’ Robust \pi^{\text{FT}} 71%. ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋‹น์‹œ์—” ์šฐ์—ฐํžˆ ์„ฑ๊ณตํ–ˆ๋”๋ผ๋„ ๊ทธ๋Ÿด๋“ฏํ•œ ๊ฒฐ๊ณผ ๋ถ„ํฌ์— ์‹คํŒจ๊ฐ€ ํฌํ•จ๋˜๋Š” ์œ„ํ—˜ action์„ ๊ฑธ๋Ÿฌ๋‚ธ ๋•๋ถ„์ด๋‹ค.

๋น„ํŒ์ ์œผ๋กœ ๋ณด๋ฉด

๊ฐ•์ 

  • ๋ฌธ์ œ ์žฌ์ •์˜๊ฐ€ ๊น”๋”ํ•˜๋‹ค. โ€œrare ์‹คํŒจ๋ฅผ ์žก์œผ๋ ค๋ฉด ์ƒ˜ํ”Œ์„ ๋งŽ์ดโ€๋ผ๋Š” ํ†ต๋…์„, โ€œ์ƒ์„ฑ์€ ์ดˆ๊ธฐ noise์˜ ๊ฒฐ์ •๋ก ์  ํ•จ์ˆ˜์ด๋‹ˆ noise๋ฅผ ๋ชฉํ‘œ๋กœ ์ตœ์ ํ™”ํ•˜์žโ€๋กœ ๋’ค์ง‘๋Š”๋‹ค. minโ€“max ์ •์‹ํ™”(Eq. 4)๋กœ ํ‰๊ฐ€(inner)์™€ ๊ฐœ์„ (outer)์„ ํ•œ ํ‹€์— ๋‹ด์€ ๊ฒƒ๋„ ๋ช…๋ฃŒํ•˜๋‹ค.
  • plausibility์˜ ์กฐ์ž‘์  ์ •์˜๊ฐ€ ์ •์งํ•˜๋‹ค. โ€œํ˜„์‹ค์ โ€์„ ๋ชจํ˜ธํ•˜๊ฒŒ ๋‘์ง€ ์•Š๊ณ  โ€œWM ๋ถ„ํฌ๊ฐ€ ์ง€์ง€ํ•˜๋Š”๊ฐ€โ€๋กœ ๋ชป ๋ฐ•๊ณ , ์ด๋ฅผ ๊ณ ์ฐจ์› Gaussian์˜ typical set(norm shellยท๋“ฑ๋ฐฉ์„ฑยท๋ฐฑ์ƒ‰์„ฑ) ํ†ต๊ณ„๋กœ ๊ตฌ์ฒดํ™”ํ–ˆ๋‹ค. Fig. 6(์ถฉ๋Œ ๋ฏธํ•™์Šต base Vista์—์„  ์ถฉ๋Œ์„ ๋ชป ์ง€์–ด๋ƒ„)์ด ์ด ์ฃผ์žฅ์„ ๋ฐ˜์ฆ๊ฐ€๋Šฅํ•œ ํ˜•ํƒœ๋กœ ๊ฒ€์ฆํ•œ ์ ์ด ์ข‹๋‹ค โ€” ๋‹จ์ˆœ ํ™๋ณด๊ฐ€ ์•„๋‹ˆ๋ผ โ€œ์šฐ๋ฆฌ ๋ฐฉ๋ฒ•์ด ๋ชป ํ•˜๋Š” ๊ฒƒโ€์„ ๋ช…์‹œํ•œ๋‹ค.
  • ํ†ต์ œ ์‹คํ—˜์ด ์žˆ๋‹ค. ground-truth ๋™์—ญํ•™์„ ์•„๋Š” Dubins car์—์„œ TPRยทTNR์„ ํ•จ๊ป˜ ๋ณด๊ณ ํ•˜๊ณ , \mathcal{C}^{\text{pla}} ablation๊ณผ classifier guidance ๋น„๊ต๋กœ ๊ฐ ์š”์†Œ์˜ ์—ญํ• ์„ ๋ถ„๋ฆฌํ–ˆ๋‹ค. VLM ์—†์ด ์•ˆ์ „ ์ ์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์จ์„œ โ€œ์กฐํ–ฅ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์ž์ฒดโ€๋ฅผ VLM ์žก์Œ๊ณผ ๋ถ„๋ฆฌํ•ด ๊ฒ€์ฆํ•œ ์„ค๊ณ„๋„ ์‹ ์ค‘ํ•˜๋‹ค.
  • inference-time ์œ ์—ฐ์„ฑ. ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋งŒ ๋ฐ”๊ฟ” ์‹คํŒจ ๋ชจ๋“œ๋ฅผ ์ง€์ •ํ•˜๊ณ , ๋ณ„๋„ ์žฌํ•™์Šต ์—†์ด ์—ฌ๋Ÿฌ WM(VistaยทCtrl-World)์— ๋ถ™๋Š”๋‹ค.

์•ฝ์ ยทํ•œ๊ณ„

  • reward hacking์— ์ทจ์•ฝ(์ €์ž ์ธ์ •). semantic ํ•ญ์ด VLM ์ ์ˆ˜์— ์˜์กดํ•˜๋ฏ€๋กœ, ์ƒ์„ฑ์— ์˜๋ฏธ ์žˆ๋Š” ๋ณ€ํ™” ์—†์ด ์ ์ˆ˜๋งŒ ์˜ค๋ฅด๋Š” reward hacking์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ์ €์ž๋„ โ€œ์ผ๋ฐ˜ํ™”๋˜๊ณ  robustํ•œ ๋กœ๋ด‡ reward model์ด ํ•„์š”ํ•˜๋‹คโ€๊ณ  ๋ช…์‹œ. VLM(Qwen)์˜ ์˜์ƒ ์ดํ•ด ํ•œ๊ณ„ยทํ”„๋กฌํ”„ํŠธ ํ’ˆ์งˆ์ด ๊ทธ๋Œ€๋กœ ๋ณ‘๋ชฉ์ด ๋œ๋‹ค.
  • โ€œplausibilityโ€๋Š” ๋ฌผ๋ฆฌ์  ํ˜„์‹ค์„ฑ์ด ์•„๋‹ˆ๋‹ค. ์–ด๋””๊นŒ์ง€๋‚˜ base WM ๋ถ„ํฌ๊ฐ€ ์ง€์ง€ํ•˜๋Š” ๊ฒƒ์— ํ•œ์ •๋œ๋‹ค. WM์ด ๊ฒฐํ•จ ์žˆ๋Š”(๋น„ํ˜„์‹ค์ ) ์˜์ƒ์„ ๋‚ด๋ฉด ๊ทธ ๊ฒฐํ•จ ์•ˆ์—์„œ ์กฐํ–ฅํ•  ์ˆ˜ ์žˆ๊ณ , WM ํ•™์Šต๋ถ„ํฌ์— ์—†๋Š” ์‹ค์ œ ์œ„ํ—˜์€ ๋ฐœ๊ฒฌํ•˜์ง€ ๋ชปํ•œ๋‹ค. ์ฆ‰ ์•ˆ์ „ ๊ฒ€์ฆ์˜ ์™„์ „์„ฑ์€ WM ์ถฉ์‹ค๋„์— ์ข…์†๋˜๋ฉฐ, โ€œdiverse robot data๋กœ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ์ผ๊ด€๋œ ๊ณ ์ถฉ์‹ค WMโ€์ด๋ผ๋Š” ๋ฏธํ•ด๊ฒฐ ์ „์ œ ์œ„์— ์„œ ์žˆ๋‹ค.
  • ํ‰๊ฐ€ ์ง€ํ‘œยท๊ทœ๋ชจ์˜ ํ•œ๊ณ„. ์กฐ์ž‘ recall(Fig. 5)ยท์ •์ฑ… ์„ฑ๊ณต๋ฅ (Fig. 8)์ด ํ—ค๋“œ๋ผ์ธ์ด์ง€๋งŒ, ํƒœ์Šคํฌ๋‹น rollout์ด 20ํšŒ๋กœ ์ž‘์•„ ์‹ ๋ขฐ๊ตฌ๊ฐ„์ด ๋„“์„ ์ˆ˜ ์žˆ๊ณ , ์ ˆ๋Œ€ ์„ฑ๊ณต๋ฅ  71%๋Š” ์—ฌ์ „ํžˆ ์‹ค์‚ฌ์šฉ์—” ๋ถ€์กฑํ•˜๋‹ค. ์ •์ฑ… ๊ฐœ์„  ์‹คํ—˜๋„ ์‹ค์ œ ๋กœ๋ด‡์ด ์•„๋‹ˆ๋ผ WM ์ƒ์ƒ ๋‚ด ํ‰๊ฐ€๋กœ ๋ผ๋ฒจ๋งํ•œ weighted regression์ด๋ผ, ์‹ค์„ธ๊ณ„ sim-to-real ๊ฒฉ์ฐจ๋Š” ๋ณ„๊ฐœ ๋ฌธ์ œ๋กœ ๋‚จ๋Š”๋‹ค. ๋˜ํ•œ gradient ๊ทผ์‚ฌ(score-distillation)๊ฐ€ ๋„์ž…ํ•˜๋Š” bias์˜ ์˜ํ–ฅ์€ ์ •๋Ÿ‰์ ์œผ๋กœ ํŒŒ๊ณ ๋“ค์ง€ ์•Š์•˜๋‹ค.
  • ๋Ÿฐํƒ€์ž„ ๋น„์šฉ. ํ˜„์žฌ WM์€ ์ƒ์ƒ 1ํšŒ์— ์ˆ˜ ๋ถ„์ด ๊ฑธ๋ฆฌ๊ณ , ์—ฌ๊ธฐ์— 10โ€“20 ์Šคํ…์˜ noise ์ตœ์ ํ™”(๊ฐ ์Šคํ…์ด forward+backward)๋ฅผ ์–น์œผ๋ฏ€๋กœ ์‹ค์‹œ๊ฐ„ ํ๋ฃจํ”„ ํ‰๊ฐ€์—” ๋ฌด๊ฒ๋‹ค. ์ €์ž๋„ ํšจ์œจ์  WM(consistency ๋“ฑ)์œผ๋กœ์˜ ๊ฐœ์„ ์„ ํ–ฅํ›„ ๊ณผ์ œ๋กœ ๋“ ๋‹ค.
  • outer ์ตœ์ ํ™”๋Š” ์‚ฌ์‹ค์ƒ ๋ฏธ์™„. ๋…ผ๋ฌธ์˜ ๋ฌด๊ฒŒ์ค‘์‹ฌ์€ inner max(์กฐํ–ฅ)์ด๊ณ , robust action์„ ์‹ค์ œ๋กœ ๋ฝ‘๋Š” outer min์€ ์ •์ฑ… ๊ฐœ์„  ์‹คํ—˜์˜ ๋‹จ์ˆœํ•œ ๊ฐ€์ค‘ ์žฌํ•™์Šต์œผ๋กœ๋งŒ ๋‹ค๋ค„์ง„๋‹ค โ€” ์—ฐ์† action ๊ณต๊ฐ„์—์„œ์˜ ๋ณธ๊ฒฉ์  robust ์ •์ฑ… ์ตœ์ ํ™”๋Š” ์—ด๋ ค ์žˆ๋‹ค.

๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ์ž๋ฆฌ ๋งค๊น€

  • World model ๊ธฐ๋ฐ˜ ์ •์ฑ… ํ‰๊ฐ€ยท๊ฐœ์„ . WorldGymยทGemini-in-VeoยทWorld-GymnastยทVLAW ๋“ฑ โ€œWM์„ ์ •์ฑ… ํ‰๊ฐ€ ํ™˜๊ฒฝ/ํ›ˆ๋ จ์žฅ์œผ๋กœโ€ ์“ฐ๋Š” ํ๋ฆ„ ์œ„์— ์žˆ์œผ๋‚˜, ์ด๋“ค์ด nominal imagination์— ์˜์กดํ•˜๋Š” ๋ฐ ๋ฐ˜ํ•ด StressDream์€ worst-plausible๋กœ์˜ ์กฐํ–ฅ์„ ๋”ํ•œ๋‹ค. ๋ธ”๋กœ๊ทธ์˜ SWM(Semantic World Models) ๋ฆฌ๋ทฐ๋Š” VLM์œผ๋กœ WM ์ƒ์ƒ์„ ์ฑ„์ ํ•ด ์ •์ฑ…์„ ํ‰๊ฐ€ํ•œ๋‹ค๋Š” ์ ์—์„œ semantic objective์™€ ๋ฌธ์ œ์˜์‹์„ ๊ณต์œ ํ•œ๋‹ค โ€” StressDream์€ ์—ฌ๊ธฐ์— โ€œ๊ทธ ์ ์ˆ˜๋ฅผ gradient๋กœ ์‚ผ์•„ noise๋ฅผ ์ตœ์ ํ™”โ€ํ•˜๋Š” ๋Šฅ๋™์  ์กฐํ–ฅ์„ ์–น์€ ๊ฒƒ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
  • Video world model ์ž์ฒด. ๊ธฐ๋ฐ˜ WM์ธ Vista(์ฃผํ–‰)ยทCtrl-World(์กฐ์ž‘)๋Š” ๋ชจ๋‘ Stable Video Diffusion ๊ณ„์—ด์ด๋ฉฐ, ์ด๋Š” ๋กœ๋ณดํ‹ฑ์Šค WM ๊ณ„๋ณด(NewtWM ๋ฆฌ๋ทฐ์˜ ์—ฐ์†์ œ์–ด์šฉ ๋‹ค์ค‘ํƒœ์Šคํฌ WM, VTWM ๋ฆฌ๋ทฐ์˜ ์‹œ๊ฐ-์ด‰๊ฐ WM, RoboVerse ๋ฆฌ๋ทฐ์˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ”Œ๋žซํผ)์™€ ๋‚˜๋ž€ํžˆ ๋†“์ธ๋‹ค. StressDream์€ ์ƒˆ WM์„ ํ•™์Šตํ•˜๊ธฐ๋ณด๋‹ค ๊ธฐ์กด WM์„ inference-time์— ์กฐํ–ฅํ•œ๋‹ค๋Š” ์ ์—์„œ ์ด๋“ค๊ณผ ์ƒ๋ณด์ ์ด๋‹ค.
  • Diffusion noise ์ตœ์ ํ™” / inference-time alignment. DNOยทRENOยทโ€œnoise as diffusion guidanceโ€ ๋“ฑ text-to-image์—์„œ ์ดˆ๊ธฐ noise๋ฅผ reward๋กœ ์ตœ์ ํ™”ํ•˜๋˜ ํ๋ฆ„์„, ๊ทน๋‹จ์  ๊ณ ์ฐจ์› video WM์œผ๋กœ ํ™•์žฅํ•˜๊ณ  typical-set plausibility ์ œ์•ฝ์„ ์ถ”๊ฐ€ํ•œ ๊ฒƒ์ด ๊ธฐ์—ฌ๋‹ค. classifier guidance(denoising ๊ถค์  ์ง์ ‘ ์ˆ˜์ •)์™€ ๋Œ€๋น„ํ•ด โ€œ์ดˆ๊ธฐ noise๋งŒ ์ตœ์ ํ™”โ€๊ฐ€ plausibility์—์„œ ์œ ๋ฆฌํ•จ์„ ์‹คํ—˜์œผ๋กœ ๋ณด์˜€๋‹ค.
  • Robust/risk-aware ์ •์ฑ…. minโ€“max robust RLยทtail-risk ์ •์ฑ…๊ณผ ์—ฐ๊ฒฐ๋˜๋ฉฐ, StressDream์€ ๊ทธ โ€œํ™˜๊ฒฝ ๋ถˆํ™•์‹ค์„ฑโ€์„ ํ•™์Šต๋œ WM์˜ plausible ๋ฏธ๋ž˜ ๋ถ„ํฌ๋กœ ๊ตฌ์ฒดํ™”ํ•œ ์‚ฌ๋ก€๋‹ค.

์š”์•ฝ

StressDream์˜ ํ•œ ๋ฌธ์žฅ์€ โ€œvideo world model์˜ ์ดˆ๊ธฐ noise๋ฅผ, VLM์ด ์ฑ„์ ํ•˜๋Š” ๋ชฉํ‘œ ์‚ฌ๊ฑด ๋ฐฉํ–ฅ์œผ๋กœ gradient ์ตœ์ ํ™”ํ•˜๋˜ ๊ณ ์ฐจ์› Gaussian์˜ typical set ์•ˆ์— ๋ถ™๋“ค์–ด ๋‘ ์œผ๋กœ์จ, ๋งŽ์ด ์ƒ˜ํ”Œ๋งํ•˜์ง€ ์•Š๊ณ ๋„ ๊ทธ๋Ÿด๋“ฏํ•˜๋ฉด์„œ ์น˜๋ช…์ ์ธ ๋ฏธ๋ž˜๋ฅผ ์ƒ์ƒํ•ด ๋‚ธ๋‹คโ€์ด๋‹ค. semantic ํ•ญ(๋ฌด์—‡์„ ์ฐพ์„์ง€)๊ณผ plausibility ํ•ญ(ํ˜„์‹ค์„ฑ ๊ฒฝ๊ณ„)์˜ ๋ถ„์—…, ๊ทธ๋ฆฌ๊ณ  score-distillation gradient ๊ทผ์‚ฌ๊ฐ€ ์‹ค์šฉ์„ฑ์„ ๋งŒ๋“ ๋‹ค. ํ†ต์ œ๋œ Dubins ์‹คํ—˜์œผ๋กœ โ€œ์‹คํŒจ๊ฐ€ ๊ฐ€๋Šฅํ•  ๋•Œ๋งŒ ๊ฒ€์ถœโ€์„ ๋ณด์ด๊ณ , VistaยทCtrl-World์—์„œ ์‹คํŒจ ๊ฒ€์ถœ recall 54โ†’94%, VLA ์ •์ฑ… ๊ฐœ์„  39โ†’71%๋ฅผ ๋ณด๊ณ ํ–ˆ๋‹ค. ๋‹ค๋งŒ ๊ทธ ํž˜์€ ์–ด๋””๊นŒ์ง€๋‚˜ base WM์ด ์ง€์ง€ํ•˜๋Š” ๊ฒฐ๊ณผ์— ํ•œ์ •๋˜๊ณ , VLM reward hackingยทWM ์ถฉ์‹ค๋„ยท๋Ÿฐํƒ€์ž„์ด๋ผ๋Š” ์„ธ ๊ฐ€์ง€๊ฐ€ ํ–ฅํ›„ ์‹ค์‚ฌ์šฉ์„ ๊ฐ€๋ฅด๋Š” ๊ด€๋ฌธ์œผ๋กœ ๋‚จ๋Š”๋‹ค. โ€œ์ƒ˜ํ”Œ์„ ๋Š˜๋ฆฌ๋Š” ๋Œ€์‹  latent๋ฅผ ์กฐํ–ฅํ•œ๋‹คโ€๋Š” ๋ฐœ์ƒ์€ world-model ๊ธฐ๋ฐ˜ ์•ˆ์ „ ํ‰๊ฐ€์—์„œ ๊ณ„์† ํ™•์žฅ๋  ๊ฒฐ์ด๋‹ค.

Copyright 2026, JungYeon Lee