Curieux.JY
  • Post
  • Note
  • Jung Yeon Lee

On this page

  • Brief Review
  • Detail Review
    • ๊ฐœ์š”
    • ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜ ๋ถ„์„
    • ํ›ˆ๋ จ ์ „๋žต
    • ๋กœ๋ด‡ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์‘์šฉ
    • ๋น„๊ต ํ‰๊ฐ€ ๋ฐ ์‹คํ—˜ ๊ฒฐ๊ณผ
    • ๊ฒฐ๋ก 

๐Ÿ“ƒCosmos predict/transfer 2.5 ๋ฆฌ๋ทฐ

cosmos
nvidia
physical-ai
World Simulation with Video Foundation Models for Physical AI
Published

October 2, 2025

  • Research Blog
  • Paper Link
  • Cosmos Github
  1. NVIDIA๋Š” Physical AI๋ฅผ ์œ„ํ•œ ์ฐจ์„ธ๋Œ€ ์›”๋“œ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์ธ [Cosmos-Predict2.5]์™€ [Cosmos-Transfer2.5]๋ฅผ ์†Œ๊ฐœํ•˜๋ฉฐ, ๋กœ๋ด‡ ๋ฐ ์ž์œจ ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ๊ณ ํ’ˆ์งˆ ์„ธ๊ณ„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
  2. [Cosmos-Predict2.5]๋Š” ํ”Œ๋กœ์šฐ ๊ธฐ๋ฐ˜ ์•„ํ‚คํ…์ฒ˜๋กœ Text2World, Image2World, Video2World ์ƒ์„ฑ์„ ํ†ตํ•ฉํ•˜๊ณ , 2์–ต ๊ฐœ์˜ ๋น„๋””์˜ค ํด๋ฆฝ์œผ๋กœ ํ•™์Šต ๋ฐ ๊ฐ•ํ™” ํ•™์Šต ๊ธฐ๋ฐ˜ ํ›„์† ํ›ˆ๋ จ์„ ๊ฑฐ์ณ ๋น„๋””์˜ค ํ’ˆ์งˆ๊ณผ ๋ช…๋ น์–ด ์ •๋ ฌ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.
  3. [Cosmos-Transfer2.5]๋Š” Sim2Real ๋ฐ Real2Real ๋ณ€ํ™˜์„ ์œ„ํ•œ Control-Net ์Šคํƒ€์ผ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, ์ด์ „ ๋ชจ๋ธ๋ณด๋‹ค 3.5๋ฐฐ ์ž‘์ง€๋งŒ ๋” ๋†’์€ ์ถฉ์‹ค๋„์™€ ์•ˆ์ •์ ์ธ ์žฅ๊ธฐ ๋น„๋””์˜ค ์ƒ์„ฑ์„ ์ œ๊ณตํ•˜๋ฉฐ ๋‹ค์–‘ํ•œ Physical AI ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์— ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค.

We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World Foundation Models for Physical AI. Built on a flow-based architecture, [Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation in a single model and leverages [Cosmos-Reason1], a Physical AI vision-language model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learningโ€“based post-training, [Cosmos-Predict2.5] achieves substantial improvements over [Cosmos-Predict1] in video quality and instruction alignment, with models released at 2B and 14B scales. These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems. We further extend the family with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and Real2Real world translation. Despite being 3.5ร— smaller than [Cosmos-Transfer1], it delivers higher fidelity and robust long-horizon video generation. Together, these advances establish [Cosmos-Predict2.5] and Cosmos-Transfer2.5] as versatile tools for scaling embodied intelligence. To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at cosmos-predict2.5 and cosmos-transfer2.5. We hope these open resources lower the barrier to adoption and foster innovation in building the next generation of embodied intelligence.


Brief Review

NVIDIA๋Š” Physical AI ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ์„ธ๊ณ„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์— ์ค‘์ ์„ ๋‘” ๋น„๋””์˜ค Foundation ๋ชจ๋ธ์ธ [Cosmos-Predict2.5]๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ Flow Matching ๊ธฐ๋ฐ˜ ์•„ํ‚คํ…์ฒ˜๋ฅผ ํ†ตํ•ด Text2World, Image2World, Video2World ์ƒ์„ฑ์„ ๋‹จ์ผ ๋ชจ๋ธ๋กœ ํ†ตํ•ฉํ•˜๋ฉฐ, Physical AI ํŠนํ™” VLM์ธ [Cosmos-Reason1]์„ ํ™œ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ ์ ‘์ง€(grounding) ๋ฐ ์„ธ๊ณ„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์ œ์–ด ๊ธฐ๋Šฅ์„ ๊ฐ•ํ™”ํ•ฉ๋‹ˆ๋‹ค. 2์–ต ๊ฐœ์˜ ๋น„๋””์˜ค ํด๋ฆฝ์œผ๋กœ ์‚ฌ์ „ ํ•™์Šต๋˜๊ณ  RL ๊ธฐ๋ฐ˜ ํ›„์ฒ˜๋ฆฌ ํ•™์Šต(post-training)์„ ๊ฑฐ์ณ, [Cosmos-Predict1] ๋Œ€๋น„ ๋น„๋””์˜ค ํ’ˆ์งˆ ๋ฐ ๋ช…๋ น์–ด ์ •๋ ฌ(instruction alignment)์—์„œ ์ƒ๋‹นํ•œ ๊ฐœ์„ ์„ ์ด๋ฃจ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ 2B ๋ฐ 14B ์Šค์ผ€์ผ๋กœ ์ถœ์‹œ๋˜์—ˆ์œผ๋ฉฐ, ๋กœ๋ณดํ‹ฑ์Šค ๋ฐ ์ž์œจ ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ(synthetic data) ์ƒ์„ฑ, ์ •์ฑ… ํ‰๊ฐ€, ํ์‡„ ๋ฃจํ”„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ, Sim2Real ๋ฐ Real2Real ์„ธ๊ณ„ ๋ณ€ํ™˜์„ ์œ„ํ•œ ControlNet ์Šคํƒ€์ผ ํ”„๋ ˆ์ž„์›Œํฌ์ธ [Cosmos-Transfer2.5]๋ฅผ ๊ณต๊ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. [Cosmos-Transfer1]๋ณด๋‹ค 3.5๋ฐฐ ์ž‘์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๋” ๋†’์€ ํ’ˆ์งˆ๊ณผ ๊ฒฌ๊ณ ํ•œ ์žฅ๊ธฐ(long-horizon) ๋น„๋””์˜ค ์ƒ์„ฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ชจ๋“  ๋ฐœ์ „์€ [Cosmos-Predict2.5]์™€ [Cosmos-Transfer2.5]๋ฅผ Physical AI ํ™•์žฅ์„ ์œ„ํ•œ ๋‹ค๋ชฉ์  ๋„๊ตฌ๋กœ ์ž๋ฆฌ๋งค๊น€ํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. NVIDIA๋Š” Physical AI ์—ฐ๊ตฌ ๋ฐ ๋ฐฐํฌ๋ฅผ ๊ฐ€์†ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์†Œ์Šค ์ฝ”๋“œ, ์‚ฌ์ „ ํ•™์Šต๋œ ์ฒดํฌํฌ์ธํŠธ, ๋ฒค์น˜๋งˆํฌ๋ฅผ NVIDIA Open Model License ํ•˜์— ๊ณต๊ฐœํ–ˆ์Šต๋‹ˆ๋‹ค.

2. ๋ฐ์ดํ„ฐ

๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ์€ ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ์ธก๋ฉด์—์„œ ๊ฐœ์„ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์ฒซ์งธ, ์ผ๋ฐ˜์ ์ธ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ํ•„ํ„ฐ๋ง ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ์—…๊ทธ๋ ˆ์ด๋“œํ–ˆ์Šต๋‹ˆ๋‹ค.

๋‘˜์งธ, Physical AI ์—ญ๋Ÿ‰์„ ๊ฐ•ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๊ณ ํ’ˆ์งˆ Physical AI ๋ฐ์ดํ„ฐ๋ฅผ ํ๋ ˆ์ด์…˜ํ–ˆ์Šต๋‹ˆ๋‹ค.

2.1. ๋น„๋””์˜ค ํ๋ ˆ์ด์…˜ ํŒŒ์ดํ”„๋ผ์ธ: 7๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค: 1) Shot-aware video splitting, 2) GPU-based transcoding, 3) video cropping, 4) filtering, 5) captioning, 6) semantic deduplication, 7) sharding. 2์–ต ๊ฐœ ์ด์ƒ์˜ ์›๋ณธ ๋น„๋””์˜ค๋ฅผ ์ฒ˜๋ฆฌํ•˜์—ฌ 2์–ต ๊ฐœ์˜ ๊ณ ํ’ˆ์งˆ ํด๋ฆฝ์„ ํ๋ ˆ์ด์…˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•„ํ„ฐ๋ง ๋‹จ๊ณ„๋Š” ์›€์ง์ž„ ์•„ํ‹ฐํŒฉํŠธ(motion artifacts), ์™œ๊ณก(distortion), ์‹œ๊ฐ์  ๋…ธ์ด์ฆˆ(visual noise), ์˜ค๋ฒ„๋ ˆ์ด ํ…์ŠคํŠธ(overlay text), ๋ถ€์ ์ ˆํ•œ ์ฝ˜ํ…์ธ  ๋“ฑ์„ ์ œ๊ฑฐํ•˜๋ฉฐ, VLM์„ ํ™œ์šฉํ•œ ์ตœ์ข… ํ•„ํ„ฐ๋ง์„ ํ†ตํ•ด ์ •๋ฐ€๋„๋ฅผ ๋†’์˜€์Šต๋‹ˆ๋‹ค. ์บก์…”๋‹ ๋‹จ๊ณ„์—์„œ๋Š” Qwen2.5-VL-7B VLM์„ ์‚ฌ์šฉํ•˜์—ฌ ์‚ฌ์‹ค์ ์ด๊ณ  ๋งฅ๋ฝ ์ธ์‹์ ์ธ ์บก์…˜์„ ์ƒ์„ฑํ•˜๋ฉฐ, Semantic Deduplication ๋ฐ Sharding์„ ํ†ตํ•ด ๋ฐ์ดํ„ฐ์…‹์˜ ๊ตฌ์กฐํ™”๋œ ์‚ฌ์šฉ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์ด ํŒŒ์ดํ”„๋ผ์ธ์€ [Cosmos-Predict1]์— ๋น„ํ•ด ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ ๋ณผ๋ฅจ์„ ์ฒ˜๋ฆฌํ•˜๊ณ , ์—„๊ฒฉํ•œ ํ•„ํ„ฐ๋ง์œผ๋กœ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ์„ ๋Œ€ํญ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

2.2. ๋„๋ฉ”์ธ๋ณ„ ๋ฐ์ดํ„ฐ: ๋กœ๋ณดํ‹ฑ์Šค, ์ž์œจ ์ฃผํ–‰, ์Šค๋งˆํŠธ ๊ณต๊ฐ„, ์ธ๊ฐ„ ์—ญํ•™(Human Dynamics), ๋ฌผ๋ฆฌ(Physics)์˜ 5๊ฐ€์ง€ ํ•ต์‹ฌ ๋„๋ฉ”์ธ์— ๊ฑธ์ณ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ๋ฅผ ํ๋ ˆ์ด์…˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ๋„๋ฉ”์ธ์€ ์‚ฌ์ „ ํ•™์Šต๊ณผ ์œ ์‚ฌํ•œ ํ๋ ˆ์ด์…˜ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋”ฐ๋ฅด์ง€๋งŒ, ๋„๋ฉ”์ธ๋ณ„ ํ•„ํ„ฐ๋ง ๊ทœ์น™๊ณผ ๋งž์ถคํ˜• ํ”„๋กฌํ”„ํŠธ๊ฐ€ ์ ์šฉ๋œ ๋Œ€๊ทœ๋ชจ VLM์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋กœ๋ณดํ‹ฑ์Šค ๋ฐ์ดํ„ฐ์…‹์€ ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ํ”Œ๋žซํผ๊ณผ ์‹œ์ ์„ ํฌํ•จํ•˜๋ฉฐ, ์ž์œจ ์ฃผํ–‰ ๋ฐ์ดํ„ฐ์…‹์€ NVIDIA์˜ ์ž์ฒด ์ฃผํ–‰ ํ”Œ๋žซํผ์—์„œ ์ˆ˜์ง‘๋œ 7๊ฐœ ์นด๋ฉ”๋ผ ์‹œ์ ์˜ ์•ฝ 310๋งŒ ๊ฐœ์˜ ํด๋ฆฝ์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ๋‹ค์–‘ํ•œ ์šด์ „ ์กฐ๊ฑด๊ณผ ํ™˜๊ฒฝ ์†์„ฑ์„ ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค.

3. ๋ฐฉ๋ฒ•๋ก 

3.1. Flow Matching: [Cosmos-Predict2.5]๋Š” Flow Matching (FM)์„ ์ฑ„ํƒํ•ฉ๋‹ˆ๋‹ค. FM๊ณผ [Cosmos-Predict1]์— ์‚ฌ์šฉ๋œ Elucidated Diffusion Model (EDM)์€ ์ˆ˜ํ•™์ ์œผ๋กœ ๋™๋“ฑํ•˜์ง€๋งŒ, ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ ๋„คํŠธ์›Œํฌ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ํ™” ๋ฐฉ์‹์ด ๋‹ค๋ฆ…๋‹ˆ๋‹ค. FM์€ ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ ๋„คํŠธ์›Œํฌ๊ฐ€ Diffusion ๊ถค์ ์˜ ์†๋„(velocity)๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ๊ณ„์ˆ˜๋ฅผ ์„ ํƒํ•˜๋ฉฐ, ์ด๋Š” ๋” ์ง์ ‘์ ์ธ ํ•™์Šต ๋ชฉํ‘œ๋ฅผ ์ œ๊ณตํ•˜๊ณ  ์‹ค์งˆ์ ์œผ๋กœ ๋” ๋ถ€๋“œ๋Ÿฌ์šด ์ตœ์ ํ™”์™€ ํ–ฅ์ƒ๋œ ์ƒ˜ํ”Œ ํ’ˆ์งˆ์„ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ x, ๋…ธ์ด์ฆˆ ๋ฒกํ„ฐ \epsilon \sim \mathcal{N}(0, I), ๊ทธ๋ฆฌ๊ณ  ๋กœ์ง“-์ •๊ทœ ๋ถ„ํฌ์—์„œ ์ถ”์ถœ๋œ ํƒ€์ž„์Šคํ… t \in [0, 1]์ด ์ฃผ์–ด์งˆ ๋•Œ, ๋ณด๊ฐ„๋œ ์ž ์žฌ ๋ณ€์ˆ˜ x_t๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค:

x_t = (1 - t)x + t\epsilon ํ•ด๋‹น Ground Truth ์†๋„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: v_t = \epsilon - x ๋ชจ๋ธ์€ ์˜ˆ์ธก๊ณผ Ground Truth ๊ฐ„์˜ ํ‰๊ท  ์ œ๊ณฑ ์˜ค์ฐจ(MSE)๋ฅผ ์ตœ์†Œํ™”ํ•˜์—ฌ v_t๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šต๋ฉ๋‹ˆ๋‹ค: \mathcal{L}(\theta) = \mathbb{E}_{x, \epsilon, c, t} \|u(x_t, t, c; \theta) - v_t\|^2 ์—ฌ๊ธฐ์„œ c๋Š” ์ปจ๋””์…”๋‹ ์ •๋ณด(ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ, ์ฐธ์กฐ ํ”„๋ ˆ์ž„ ๋“ฑ)๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ , \theta๋Š” ๋ชจ๋ธ ๋งค๊ฐœ๋ณ€์ˆ˜์ด๋ฉฐ, u(\cdot; \theta)๋Š” ์˜ˆ์ธก๋œ ์†๋„ ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค. ๊ณ ํ•ด์ƒ๋„ ์ฝ˜ํ…์ธ ์˜ ๊ณผ๋„ํ•œ ์ƒ๊ด€ ๊ด€๊ณ„๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, Shifted Logit-Normal Distribution (Esser et al., 2024)์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต ํ”„๋กœ์„ธ์Šค๋ฅผ ๋” ๋†’์€ ๋…ธ์ด์ฆˆ ๋ ˆ๋ฒจ๋กœ ์˜๋„์ ์œผ๋กœ ํŽธํ–ฅ์‹œํ‚ต๋‹ˆ๋‹ค. ์ด๋Š” \beta๋ผ๋Š” Shift Hyperparameter๋ฅผ ํ†ตํ•ด t ๊ฐ’์„ ๋” ๋†’์€ ๋…ธ์ด์ฆˆ ์ชฝ์œผ๋กœ ์น˜์šฐ์น˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค: t_s = \frac{\beta t}{1 + (\beta - 1)t}

3.2. ๋„คํŠธ์›Œํฌ ์•„ํ‚คํ…์ฒ˜: [Cosmos-Predict2.5]๋Š” [Cosmos-Predict1]์˜ DiT ๊ธฐ๋ฐ˜ ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ ๋„คํŠธ์›Œํฌ๋ฅผ ์žฌ์‚ฌ์šฉํ•˜์ง€๋งŒ, ์ ˆ๋Œ€ ์œ„์น˜ ์ž„๋ฒ ๋”ฉ(absolute positional embeddings)์„ ์ œ๊ฑฐํ•˜๊ณ  ์ƒ๋Œ€ ์œ„์น˜ ์ž„๋ฒ ๋”ฉ(relative positional embeddings)๋งŒ ์œ ์ง€ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ํ•ด์ƒ๋„ ๋ฐ ์‹œํ€€์Šค ๊ธธ์ด์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. ์‹œ๊ฐ์  ํ† ํฌ๋‚˜์ด์ €(visual tokenizer)๋กœ๋Š” ๋น„๋””์˜ค ์‹œํ€€์Šค๋ฅผ 4x8x8 ์••์ถ•ํ•˜๋Š” Causal VAE์ธ WAN2.1 VAE๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, 93ํ”„๋ ˆ์ž„(24 ์ž ์žฌ ํ”„๋ ˆ์ž„)์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋กœ๋Š” [Cosmos-Predict1]์˜ T5 ๋Œ€์‹  [Cosmos-Reason1]์„ ํ™œ์šฉํ•˜๋ฉฐ, ์—ฌ๋Ÿฌ ๋ธ”๋ก์˜ ํ™œ์„ฑํ™”(activations)๋ฅผ ์—ฐ๊ฒฐํ•˜์—ฌ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•จ์œผ๋กœ์จ ์ง€์—ญ ๋ฐ ์ „์—ญ ์–ธ์–ด์  ๋งฅ๋ฝ์„ ๋”์šฑ ์ถฉ์‹คํžˆ ํฌ์ฐฉํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ Text2World, Image2World, Video2World ์„ธ ๊ฐ€์ง€ ๋ชจ๋“œ๋กœ ์ž‘๋™ํ•˜๋ฉฐ, Image2World ๋ฐ Video2World์—์„œ๋Š” ํ”„๋ ˆ์ž„ ๊ต์ฒด ์ „๋žต์„ ์‚ฌ์šฉํ•˜์—ฌ ์ดˆ๊ธฐ ํ”„๋ ˆ์ž„์„ ์กฐ๊ฑด๋ถ€ ํ”„๋ ˆ์ž„์œผ๋กœ ๋Œ€์ฒดํ•˜์—ฌ ์‹œ๊ฐ„์  ์ผ๊ด€์„ฑ์„ ๊ฐ•ํ™”ํ•ฉ๋‹ˆ๋‹ค.

4. ํ•™์Šต

4.1. ์‚ฌ์ „ ํ•™์Šต(Pre-training): ์ ์ง„์ ์ธ ํ•™์Šต ์ „๋žต์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. 256p ํ•ด์ƒ๋„์˜ Text2Image ์ž‘์—…์œผ๋กœ ์‹œ์ž‘ํ•˜์—ฌ, Image2World ๋ฐ Video2World ์ž‘์—…์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ 1 ๋˜๋Š” 5๊ฐœ์˜ ์กฐ๊ฑด๋ถ€ ํ”„๋ ˆ์ž„์„ ์ƒ˜ํ”Œ๋งํ•˜๊ณ  ๋‚˜๋จธ์ง€ 92 ๋˜๋Š” 88๊ฐœ ํ”„๋ ˆ์ž„์„ ์ƒ์„ฑํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์Šคํ‚น ์Šคํ‚ด(masking scheme)์„ ์‚ฌ์šฉํ•˜์—ฌ ์กฐ๊ฑด๋ถ€ ์ž…๋ ฅ๊ณผ ๋…ธ์ด์ฆˆ ์ž…๋ ฅ ํ”„๋ ˆ์ž„์„ ๊ตฌ๋ถ„ํ•ฉ๋‹ˆ๋‹ค. ์ดํ›„ ํ•ด์ƒ๋„๋ฅผ 256p์—์„œ 480p, 720p๋กœ ์ ์ง„์ ์œผ๋กœ ์ฆ๊ฐ€์‹œํ‚ค๊ณ , ๋งˆ์ง€๋ง‰์œผ๋กœ ์กฐ๊ฑด๋ถ€ ํ”„๋ ˆ์ž„์ด ์—†๋Š” Text2World ์ž‘์—…์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ํ•™์Šต ํƒ€์ž„์Šคํ…์€ Logit-Normal Distribution์—์„œ ์ƒ˜ํ”Œ๋ง๋˜๋ฉฐ, ํ•™์Šต ํ•ด์ƒ๋„๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ \beta ๊ฐ’์„ 1์—์„œ 5๋กœ ์ ์ง„์ ์œผ๋กœ ์ฆ๊ฐ€์‹œํ‚ค๋Š” Shifted Logit-Normal Distribution์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ๊ณ ๋…ธ์ด์ฆˆ ์˜์—ญ์—์„œ์˜ ํ•™์Šต ์ƒ˜ํ”Œ ๋ถ€์กฑ์œผ๋กœ ์ธํ•œ ์ „ํ™˜ ์•„ํ‹ฐํŒฉํŠธ๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด, ํ•™์Šต ์ƒ˜ํ”Œ์˜ 5%๋ฅผ ๋…ธ์ด์ฆˆ ๋ถ„ํฌ์˜ ์ƒ์œ„ 2%์—์„œ ๋ช…์‹œ์ ์œผ๋กœ ์ถ”์ถœํ•˜๋Š” ํƒ€๊ฒŸ ์ƒ˜ํ”Œ๋ง ์ „๋žต์„ ๋„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค. AdamW ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ์„ ํ˜• ํ•™์Šต๋ฅ  ์Šค์ผ€์ค„๋Ÿฌ์™€ ์›œ์—…(warmup) ๋‹จ๊ณ„๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

4.2. ํ›„์ฒ˜๋ฆฌ ํ•™์Šต(Post-training):

  • Supervised Fine-tuning (SFT): ๊ฐ์ฒด ์ง€์†์„ฑ(object permanence), ๊ณ ์† ์›€์ง์ž„(high motion), ๋ณตํ•ฉ ์žฅ๋ฉด(complex scenes), ์šด์ „, ๋กœ๋ด‡ ์กฐ์ž‘ ๋“ฑ 5๊ฐœ ๋„๋ฉ”์ธ์œผ๋กœ ๋ถ„๋ฅ˜๋œ ๊ณ ํ’ˆ์งˆ Physical AI ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด SFT๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋„๋ฉ”์ธ๋ณ„๋กœ ๋ณ„๋„์˜ ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผœ ์ „๋ฌธ ๋„๋ฉ”์ธ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ , Cooldown ๋‹จ๊ณ„๋ฅผ ํ†ตํ•ด 4K ๋น„๋””์˜ค๋กœ ๋ฏธ์„ธํ•œ ์‹œ๊ฐ์  ๋””ํ…Œ์ผ๊ณผ ๋ถ€๋“œ๋Ÿฌ์šด ์›€์ง์ž„์„ ๊ฐ•ํ™”ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ SFT ๋ชจ๋ธ์˜ ์žฅ์ ์„ ํ†ตํ•ฉํ•˜๊ธฐ ์œ„ํ•ด Model Merging (Yang et al., 2024)์„ ์ ์šฉํ•˜๋ฉฐ, Model Soup (Wortsman et al., 2022) ๋ฐฉ์‹์ด ํšจ๊ณผ์ ์ž„์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.
  • Reinforcement Learning (RL): VLM ๊ธฐ๋ฐ˜ ๋ณด์ƒ ๋ชจ๋ธ์ธ VideoAlign (Liu et al., 2025)์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ ์ •๋ ฌ, ์›€์ง์ž„ ํ’ˆ์งˆ, ์‹œ๊ฐ์  ํ’ˆ์งˆ์„ ํ‰๊ฐ€ํ•˜๊ณ  [Cosmos-Predict2.5-2B] (์‚ฌ์ „ ํ•™์Šต ๋ฐ ๋ณ‘ํ•ฉ ๋ชจ๋ธ ๋ชจ๋‘)๋ฅผ ํ›„์ฒ˜๋ฆฌ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. VideoAlign์€ GRPO (Guo et al., 2025)๋ฅผ ๋”ฐ๋ผ ๋กค์•„์›ƒ ๊ทธ๋ฃน ๋‚ด์—์„œ ๋ณด์ƒ์„ ์ •๊ทœํ™”ํ•˜์—ฌ ๊ฐ ์ถœ๋ ฅ์˜ ์žฅ์ (advantage)์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. RL์€ ๋ณด์ƒ ์ ์ˆ˜์™€ ์ธ๊ฐ„ ํ‰๊ฐ€ ๋ชจ๋‘์—์„œ ๋ชจ๋ธ ํ’ˆ์งˆ์„ ํšจ๊ณผ์ ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒƒ์œผ๋กœ ์ž…์ฆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

4.3. ์ธํ”„๋ผ: FSDP2๋ฅผ ๊ธฐ๋ณธ ๋ถ„์‚ฐ ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜, ๊ทธ๋ž˜๋””์–ธํŠธ, ์˜ตํ‹ฐ๋งˆ์ด์ € ์ƒํƒœ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ƒค๋”ฉํ•ฉ๋‹ˆ๋‹ค. ๊ณ ํ•ด์ƒ๋„ ๋˜๋Š” ์žฅ์‹œ๊ฐ„ ๋น„๋””์˜ค ํ•™์Šต ์‹œ ๋Œ€๊ทœ๋ชจ ์ž…๋ ฅ ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด Ulysses ์Šคํƒ€์ผ์˜ ์œ ์—ฐํ•œ ์ปจํ…์ŠคํŠธ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ(Context Parallelism)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰๊ณผ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์˜ ๊ท ํ˜•์„ ์œ„ํ•ด torch Selective Activation Checkpointing (SAC)์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. RL ํ›„์ฒ˜๋ฆฌ ํ•™์Šต์—์„œ ๋Œ€๋Ÿ‰์˜ ์ž…๋ ฅ๊ณผ ๋‹ค์–‘ํ•œ ๋ณด์ƒ ๋ชจ๋ธ์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ํšจ์œจ์ ์ด๊ณ  ์œ ์—ฐํ•œ Elastic Reward Service๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

5. ๊ฒฐ๊ณผ

๋ฒค์น˜๋งˆํ‚น: [Cosmos-Predict2.5-2B] ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ Physical AI ์ƒ์„ฑ ๋ฐ ์ดํ•ด ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๋Š” PAI-Bench (Zhou et al., 2025)์—์„œ ๋ณด๊ณ ํ–ˆ์Šต๋‹ˆ๋‹ค. PAI-Bench์˜ ์˜ˆ์ธก(predict) ์ž‘์—…์—์„œ ๋„๋ฉ”์ธ ์ ์ˆ˜(Domain Score)์™€ ํ’ˆ์งˆ ์ ์ˆ˜(Quality Score)๋ฅผ ์ธก์ •ํ•˜๋ฉฐ, [Cosmos-Predict2.5-2B] ํ›„์ฒ˜๋ฆฌ ํ•™์Šต ๋ชจ๋ธ์€ T2W์—์„œ ๋” ํฐ Wan2.2-5B ๋ชจ๋ธ๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๊ณ , I2W์—์„œ๋Š” ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค.

์ธ๊ฐ„ ํ‰๊ฐ€: ์ž๋™ํ™”๋œ ์ง€ํ‘œ ์™ธ์—, ํ˜„์‹ค์„ฑ, ์‹œ๊ฐ์  ํ’ˆ์งˆ, ์‹œ๊ฐ„์  ์ผ๊ด€์„ฑ, ์กฐ๊ฑด๋ถ€ ์ž…๋ ฅ๊ณผ์˜ ์ •๋ ฌ ๋“ฑ ์ธ๊ฐ„ ์„ ํ˜ธ๋„๋ฅผ ๋ฐ˜์˜ํ•˜๋Š” ๋น„๋””์˜ค ํ’ˆ์งˆ ์ธก๋ฉด์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ์ธ๊ฐ„ ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. [Cosmos-Predict2.5-2B]๋Š” Wan 2.2 5B ๋ฐ Wan 2.1 14B์— ๋น„ํ•ด ๊ฐ๊ฐ 60% ๋ฐ 85.7% ์ž‘์€ ํฌ๊ธฐ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , PAI-Bench I2W ๋ฐ T2W ์„ค์ •์—์„œ ์œ ์‚ฌํ•œ ์ธ๊ฐ„ ์„ ํ˜ธ๋„๋ฅผ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

์ •์„ฑ์  ์˜ˆ์‹œ: [Cosmos-Predict2.5-2B] ํ›„์ฒ˜๋ฆฌ ํ•™์Šต ๋ชจ๋ธ์€ ์šด์ „ ์‹œ ์ •ํ™•ํ•œ ํ–‰๋™์„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•˜๊ณ , ์‚ฌ์‹ค์ ์ธ ์‚ฐ์—… ๋ฐ ๋กœ๋ด‡ ์žฅ๋ฉด์„ ์ƒ์„ฑํ•˜๋ฉฐ, ๋ฌผ๋ฆฌ์ ์œผ๋กœ ์ผ๊ด€๋œ ์›€์ง์ž„์„ ์ƒ์„ฑํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

6. ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜

6.1. Cosmos-Transfer2.5: [Cosmos-Predict2.5-2B] ์œ„์— ๊ตฌ์ถ•๋œ ์กฐ๊ฑด๋ถ€ ์„ธ๊ณ„ ์ƒ์„ฑ ๋ชจ๋ธ๋กœ, ์—ฌ๋Ÿฌ ๊ณต๊ฐ„ ์ œ์–ด ์ž…๋ ฅ(์—์ง€, ๋ธ”๋Ÿฌ ์ฒ˜๋ฆฌ๋œ ๋น„๋””์˜ค, ์„ธ๊ทธ๋จผํ…Œ์ด์…˜ ๋งต, ๊นŠ์ด ๋งต ๋“ฑ)์— ๋”ฐ๋ผ ๊ณ ํ’ˆ์งˆ ์„ธ๊ณ„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. [Cosmos-Transfer1-7B]์™€ ๋‹ฌ๋ฆฌ, 4๊ฐœ์˜ ์ œ์–ด ๋ธ”๋ก์„ ๋ฉ”์ธ ๋ธŒ๋žœ์น˜ ์ „์ฒด์— ๊ฑธ์ณ ๊ท ๋“ฑํ•˜๊ฒŒ ๋ถ„๋ฐฐํ•˜์—ฌ ์กฐ๊ฑด ์ •๋ณด๋ฅผ ๋„คํŠธ์›Œํฌ์— ๋” ์ ์ง„์ ์œผ๋กœ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. [Cosmos-Transfer2.5-2B]๋Š” 3.5๋ฐฐ ์ž‘์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  [Cosmos-Transfer1-7B]๋ฅผ ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋” ๊ฐ•๋ ฅํ•œ ๊ธฐ๋ณธ ๋ชจ๋ธ๊ณผ Physical AI์— ์ค‘์ ์„ ๋‘” ํ๋ ˆ์ด์…˜๋œ ํ•™์Šต ๋ฐ์ดํ„ฐ ๋•๋ถ„์ž…๋‹ˆ๋‹ค. ์žฅ๊ธฐ ๋น„๋””์˜ค ์ƒ์„ฑ: ์žฅ๊ธฐ ๋น„๋””์˜ค ์ƒ์„ฑ์—์„œ ์˜ค๋ฅ˜ ๋ˆ„์ ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ Averaged Relative Normalized Dover Score (RNDS)๋ผ๋Š” ์ƒˆ๋กœ์šด ๋ฉ”ํŠธ๋ฆญ์„ ๋„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค. RNDS[i]๋Š” DOVER[i] / DOVER_GT[i]๋ฅผ DOVER[1] / DOVER_GT[1]๋กœ ์ •๊ทœํ™”ํ•œ ๊ฐ’์ž…๋‹ˆ๋‹ค. [Cosmos-Transfer2.5-2B]๋Š” [Cosmos-Transfer1-7B]์— ๋น„ํ•ด RNDS ๊ฐ์†Œ๊ฐ€ ํ›จ์”ฌ ์ ์–ด ์žฅ๊ธฐ ๋น„๋””์˜ค ์‹œํ€€์Šค์—์„œ ์˜ค๋ฅ˜ ๋ˆ„์ ๊ณผ ํ™˜๊ฐ(hallucination)์ด ์ ๊ณ  ์ถฉ์‹ค๋„(fidelity)๊ฐ€ ๋” ๋†’์Œ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

6.2. ๋กœ๋ด‡ ์ •์ฑ… ํ•™์Šต์„ ์œ„ํ•œ Cosmos-Transfer2.5: [Cosmos-Transfer2.5-2B]๋Š” ๋กœ๋ด‡ ์ •์ฑ… ํ•™์Šต์„ ์œ„ํ•œ ์‹œ๊ฐ์  ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ๊ธฐ๋กœ ํ™œ์šฉ๋˜์–ด, ๋กœ๋ด‡ ์ •์ฑ…์˜ ํ›ˆ๋ จ์„ ๊ฐ•ํ™”ํ•˜๊ณ  ์ด์ „์— ๋ณด์ง€ ๋ชปํ•œ ์‹œ๊ฐ์  ์‹œ๋‚˜๋ฆฌ์˜ค๋กœ ์ผ๋ฐ˜ํ™”ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Ego-centric ์นด๋ฉ”๋ผ๋ฅผ ์žฅ์ฐฉํ•œ ์–‘ํŒ” ๋กœ๋ด‡์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…Œ์ด๋ธ” ์ƒ๋‹จ ์กฐ์ž‘ ์ž‘์—…์„ ์œ„ํ•œ ์ธ๊ฐ„ ์›๊ฒฉ ์กฐ์ž‘ ์‹œ์—ฐ์„ ์ˆ˜์ง‘ํ•˜๊ณ , ์ด๋ฅผ ํ†ตํ•ด ์‹œ๊ฐ ๊ธฐ๋ฐ˜ ์ •์ฑ…์„ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค. [Cosmos-Transfer2.5-2B]๋Š” ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ํ†ตํ•ด ์›ํ•˜๋Š” ์‹œ๊ฐ์  ์กฐ๊ฑด์„ ์ง€์ •ํ•จ์œผ๋กœ์จ ๋‹ค์–‘ํ•œ ๊ตฌ์กฐํ™”๋œ ์‹œ๊ฐ์  ๋ณ€ํ˜•์„ ์ƒ์„ฑํ•˜๊ณ , ์ •์ฑ…์˜ ๊ฒฌ๊ณ ์„ฑ์„ ์ฒด๊ณ„์ ์œผ๋กœ ํ…Œ์ŠคํŠธํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์‹ค์ œ ๋กœ๋ด‡ ์‹คํ—˜์—์„œ, [Cosmos-Transfer2.5-2B]๋กœ ์ฆ๊ฐ•๋œ ์ •์ฑ…์€ 30๋ฒˆ์˜ ์‹œ๋„ ์ค‘ 24๋ฒˆ ์„ฑ๊ณตํ•˜์—ฌ, ์ƒˆ๋กœ์šด ํ…Œ์ŠคํŠธ ์‹œ๊ฐ„ ๊ฐ์ฒด ๋ฐ ํ™˜๊ฒฝ ๋ณ€ํ™”์— ๋Œ€ํ•ด ํ˜„์ €ํžˆ ๋†’์€ ๊ฒฌ๊ณ ์„ฑ๊ณผ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

6.3. ์šด์ „ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ์œ„ํ•œ Cosmos-Transfer2.5: [Cosmos-Predict2.5-2B]๋ฅผ ๋‹จ์ผ ๋ทฐ์—์„œ ๋ฉ€ํ‹ฐ ๋ทฐ ์„ธ๊ณ„ ์ƒ์„ฑ์œผ๋กœ ํ™•์žฅํ•˜์—ฌ [Cosmos-Predict2.5-2B/auto/multiview]๋ฅผ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ControlNet ์Šคํƒ€์ผ๋กœ ํ™•์žฅํ•˜์—ฌ [Cosmos-Transfer2.5-2B/auto/multiview]๋ฅผ ํ†ตํ•ด World Scenario Map์— ๋”ฐ๋ผ ์ผ๊ด€๋œ ๋ฉ€ํ‹ฐ ๋ทฐ ์žฅ๋ฉด์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. 720p ๋ฉ€ํ‹ฐ ๋ทฐ ์ƒ์„ฑ์„ ์œ„ํ•ด ์ž ์žฌ์  ์‹œ๊ฐ„ ์ฐจ์›(latent temporal dimension)์„ ์žฌ์‚ฌ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ๋ทฐ๋ฅผ ์—ฐ๊ฒฐํ•˜๊ณ , DiT ๋„คํŠธ์›Œํฌ์— ํ†ต๊ณผ์‹œํ‚ค๊ธฐ ์ „์— ์ปดํŒฉํŠธํ•œ ๋ทฐ๋ณ„ ํ•™์Šต ์ž„๋ฒ ๋”ฉ์„ ์ž ์žฌ ์ฑ„๋„ ์ฐจ์›์— ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค. 3D-factorized RoPE์™€ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ๊ณผ์˜ ๊ต์ฐจ ์–ดํ…์…˜(cross-attention)์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹: [Cosmos-Predict2.5-2B/auto/multiview]๋Š” 150๋งŒ ํด๋ฆฝ์˜ ๋ฉ€ํ‹ฐ ๋ทฐ ์บก์…˜ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต๋˜์—ˆ๊ณ , [Cosmos-Transfer2.5-2B/auto/multiview]๋Š” HD ๋งต ๋ฐ ๋™์  ๊ฐ์ฒด ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๋Š” โ€œWorld Scenario Mapโ€์„ ์ œ์–ด ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๋งต์€ ์ฐจ์„ , ๋„๋กœ ํ‘œ์‹, ์‹ ํ˜ธ๋“ฑ ๋“ฑ์˜ ๋งต ์š”์†Œ์™€ ๋™์  3D ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค๋ฅผ ํฌํ•จํ•˜๋ฉฐ, RDS-HQ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค. ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ: FVD/FID ์ ์ˆ˜์—์„œ ์ตœ๋Œ€ 2.3๋ฐฐ์˜ ์ƒ๋‹นํ•œ ํ–ฅ์ƒ์„ ๋ณด์˜€๊ณ , ์‹œ๊ฐ„์  ๋ฐ ๊ต์ฐจ ์นด๋ฉ”๋ผ ์ƒ˜์Šจ ์˜ค๋ฅ˜(cross-camera Sampson error)์—์„œ๋Š” ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์ˆ˜์ค€์„ ์œ ์ง€ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ œ์–ด ์‹ ํ˜ธ์— ๋Œ€ํ•œ ์ถฉ์‹ค๋„๋ฅผ ํ…Œ์ŠคํŠธํ•˜๊ธฐ ์œ„ํ•ด ์ƒ์„ฑ๋œ ๋น„๋””์˜ค์—์„œ 3D Cuboid ๋ฐ ์ฐจ์„  ๊ฐ์ง€ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ธก์ •ํ–ˆ์œผ๋ฉฐ, Transfer1-7B-Sample-AV์— ๋น„ํ•ด ์ตœ๋Œ€ 60%์˜ ๊ฐ์ง€ ์ง€ํ‘œ ํ–ฅ์ƒ์„ ๊ด€์ฐฐํ–ˆ์Šต๋‹ˆ๋‹ค.

6.4. ์นด๋ฉ”๋ผ ์ œ์–ด๋ฅผ ํ†ตํ•œ ๋ฉ€ํ‹ฐ ๋ทฐ ์ƒ์„ฑ: [Cosmos-Predict2.5-2B/robot/multiview]๋Š” ์ฐธ์กฐ ๋ทฐ์˜ ๋น„๋””์˜ค๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ์นด๋ฉ”๋ผ ๊ถค์ ์— ๋”ฐ๋ผ ์—ฌ๋Ÿฌ ๋ชฉํ‘œ ์‹œ์ ์—์„œ ์ถ”๊ฐ€ ๋น„๋””์˜ค๋ฅผ ํ•ฉ์„ฑํ•˜๋Š” ์นด๋ฉ”๋ผ ์ œ์–ด ๊ฐ€๋Šฅํ•œ ๋ฉ€ํ‹ฐ ๋ทฐ ์„ธ๊ณ„ ์ƒ์„ฑ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๋กœ๋ด‡ ์กฐ์ž‘ ์‹œ๋ฎฌ๋ ˆ์ด์…˜๊ณผ ๊ฐ™์ด ๋กœ๋ด‡์ด ์ง์ ‘ ์‹œ์•ผ ๋ฐ–์˜ ๊ฐ์ฒด๋ฅผ ์ถ”๋ก ํ•ด์•ผ ํ•˜๋Š” ๋กœ๋ด‡ ๊ณตํ•™์—์„œ ํŠนํžˆ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค. Plรผcker Raymaps (Sitzmann et al., 2021)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์นด๋ฉ”๋ผ๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ , ์ด๋ฅผ ๋น„๋””์˜ค ์ž ์žฌ ๊ณต๊ฐ„์— ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. Agibot, MultiCamVideo, SynCamVideo ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต๋˜๋ฉฐ, ๋จธ๋ฆฌ ๋ทฐ(head-view) ๋กœ๋ด‡ ์กฐ์ž‘ ๋น„๋””์˜ค๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ์ขŒ์šฐ ๊ทธ๋ฆฌํผ ์‹œ์ ์—์„œ ๋™๊ธฐํ™”๋œ ๋น„๋””์˜ค๋ฅผ ํ•ฉ์„ฑํ•˜๊ฑฐ๋‚˜(multiview-agibot), ์ œ3์ž ๋ทฐ ๋น„๋””์˜ค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ธฐ๋ณธ ์นด๋ฉ”๋ผ ๋ณ€ํ™˜ ํ•˜์— ๋™๊ธฐํ™”๋œ ๋น„๋””์˜ค 2๊ฐœ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. [Cosmos-Predict2.5-2B/robot/multiview]๋Š” ๋‹จ์ผ ๋ทฐ ๋Œ€์‘ ๋ชจ๋ธ๋ณด๋‹ค ํ˜„์ €ํžˆ ์šฐ์ˆ˜ํ•œ ๊ต์ฐจ ๋ทฐ ์ผ๊ด€์„ฑ(cross-view consistency)์„ ๋‹ฌ์„ฑํ•˜๋ฉด์„œ๋„ ์œ ์‚ฌํ•œ ์นด๋ฉ”๋ผ ๊ถค์  ์ •ํ™•๋„๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

6.5. VLA ํ•™์Šต์„ ์œ„ํ•œ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ: [Cosmos-Predict2.5]๋Š” ๋กœ๋ด‡ ์กฐ์ž‘์„ ์œ„ํ•œ ํ”Œ๋ž˜๋„ˆ(planner) ๋ฐ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋กœ์„œ ์ž ์žฌ๋ ฅ์ด ํฝ๋‹ˆ๋‹ค. ์ž์—ฐ์–ด ๋ช…๋ น์–ด๋ฅผ ๋”ฐ๋ฅด๋Š” ๋กœ๋ด‡์˜ ์‹ค์ œ ์‹œ์—ฐ ๋Œ€๊ทœ๋ชจ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ›„์ฒ˜๋ฆฌ ํ•™์Šต๋œ [Cosmos-Predict2.5]๋Š” ๋ณด์ง€ ๋ชปํ•œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๋กœ๋ด‡์˜ ์‚ฌ์‹ค์ ์ธ ๋น„๋””์˜ค๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋น„๋””์˜ค์—์„œ ์ž ์žฌ ์•ก์…˜ ๋ชจ๋ธ(latent action model) ๋˜๋Š” ์—ญ๋™์—ญํ•™ ๋ชจ๋ธ(inverse-dynamics model, IDM)์„ ์‚ฌ์šฉํ•˜์—ฌ ์˜์‚ฌ ์•ก์…˜ ์‹œํ€€์Šค(pseudo-action sequences)๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด VLA (Vision-Language-Action) ํ•™์Šต์„ ์œ„ํ•œ ๋น„์ „(์ƒ์„ฑ๋œ ๋น„๋””์˜ค), ์–ธ์–ด(๋ช…๋ น์–ด), ์•ก์…˜(์ƒ์„ฑ๋œ ์˜์‚ฌ ์•ก์…˜) ์ฃผ์„์ด ๋‹ฌ๋ฆฐ ์ƒ˜ํ”Œ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. [Cosmos-Predict2.5-14B/robot/gr00tdream-gr1]์€ DreamGen ๋ฒค์น˜๋งˆํฌ (Jang et al., 2025)์—์„œ GR1 ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ๊ฐ€์žฅ ๋†’์€ ๋ช…๋ น์–ด ์ถ”์ข… ์ ์ˆ˜๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

6.6. ์•ก์…˜ ์กฐ๊ฑด๋ถ€ ์„ธ๊ณ„ ์ƒ์„ฑ: [Cosmos-Predict2.5]๋ฅผ ์ˆœ์ˆ˜ ๋น„๋””์˜ค ์ƒ์„ฑ์—์„œ ์•ก์…˜ ์กฐ๊ฑด๋ถ€ ๋น„๋””์˜ค ์ƒ์„ฑ์œผ๋กœ ํ™•์žฅํ•˜์—ฌ [Cosmos-Predict2.5-2B/robot/action-cond]๋ฅผ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ๋‹จ์ผ ์กฐ๊ฑด๋ถ€ ์ด๋ฏธ์ง€์™€ ๋กœ๋ด‡ ์•ก์…˜ ์‹œํ€€์Šค๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„, ์ œ๊ณต๋œ ์•ก์…˜ ์‹œํ€€์Šค๋ฅผ ๋”ฐ๋ฅด๋Š” ๋ฏธ๋ž˜ ํ”„๋ ˆ์ž„์˜ ๋ฉ์–ด๋ฆฌ(chunk)๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ „์ฒด ๊ถค์ ์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด, ๊ฐ ๋ฉ์–ด๋ฆฌ๋Š” ๋งˆ์ง€๋ง‰์œผ๋กœ ์ƒ์„ฑ๋œ ํ”„๋ ˆ์ž„์— ๋”ฐ๋ผ ์˜ˆ์ธก๋˜๋Š” ์ž๋™ํšŒ๊ท€(autoregressive) ๋ฐฉ์‹์œผ๋กœ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.


Detail Review

์ฝ”์Šค๋ชจ์Šค ๋น„๋””์˜ค ๊ธฐ์ดˆ ๋ชจ๋ธ ๊ธฐ๋ฐ˜์˜ ์„ธ๊ณ„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜: ์‹ฌ์ธต ๋ฆฌ๋ทฐ

๋ฌผ๋ฆฌ์  AI(Physical AI) ์—ฐ๊ตฌ์—์„œ ์„ธ๊ณ„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์€ ๋งค์šฐ ์ค‘์š”ํ•œ ์š”์†Œ๋กœ, ์‹ค์ œ ๋กœ๋ด‡์ด๋‚˜ ์ฐจ๋Ÿ‰์ด ๊ฒฝํ—˜ํ•  ์ˆ˜ ์—†๋Š” ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ๊ณผ ์ƒํ™ฉ์„ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ์ œ๊ณตํ•ด์ฃผ๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด NVIDIA๋Š” Cosmos๋ผ๋Š” ์„ธ๊ณ„ ๊ธฐ์ดˆ ๋ชจ๋ธ (World Foundation Models) ํ”Œ๋žซํผ์„ ๊ฐœ๋ฐœํ•ด์™”์Šต๋‹ˆ๋‹ค. Cosmos๋Š” ํฌ๊ฒŒ ์„ธ ๊ฐ€์ง€ ๋ชจ๋ธ๊ตฐ(Cosmos-Predict, Cosmos-Transfer, Cosmos-Reason)์œผ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ, Cosmos-Predict๋Š” ํ…์ŠคํŠธยท์ด๋ฏธ์ง€ยท๋™์˜์ƒ ์ž…๋ ฅ์œผ๋กœ ๋ฏธ๋ž˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋น„๋””์˜ค๋ฅผ ์ƒ์„ฑํ•˜๊ณ , Cosmos-Transfer๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์žฅ๋ฉด์„ ํ˜„์‹ค๊ฐ ์žˆ๋Š” ์ด๋ฏธ์ง€/๋น„๋””์˜ค๋กœ ๋ณ€ํ™˜ํ•˜๋ฉฐ, Cosmos-Reason๋Š” ๋ฌผ๋ฆฌ์  ์ถ”๋ก ์„ ๋•๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ด๋ฒˆ ์‹ฌ์ธต ๋ฆฌ๋ทฐ์—์„œ๋Š” 2025๋…„ CoRL ํ•™ํšŒ์— ์†Œ๊ฐœ๋œ ๋…ผ๋ฌธใ€ŽWorld Simulation with Video Foundation Models for Physical AIใ€๋ฅผ ์ค‘์‹ฌ์œผ๋กœ, ํŠนํžˆ ๋™์˜์ƒ ์ƒ์„ฑ์— ์ดˆ์ ์„ ๋งž์ถ˜ Cosmos-Predict์™€ Cosmos-Transfer์˜ ์ตœ์‹  ๋ฒ„์ „(๋ฒ„์ „ 2.5) ๊ธฐ์—ฌ์™€ ์„ฑ๋Šฅ์„ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค.

๊ฐœ์š”

๋ฌผ๋ฆฌ ๊ธฐ๋ฐ˜ ์ธ๊ณต์ง€๋Šฅ(Physical AI)์—์„œ ์„ธ๊ณ„(World) ์‹œ๋ฎฌ๋ ˆ์ด์…˜์€ ๋Œ€๊ทœ๋ชจ ํ•™์Šต์„ ์œ„ํ•œ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ์œผ๋กœ์จ ํ•ต์‹ฌ์ ์ธ ์—ญํ• ์„ ํ•œ๋‹ค. NVIDIA๋Š” CoRL 2025์—์„œ ํ…์ŠคํŠธยท์ด๋ฏธ์ง€ยท๋น„๋””์˜ค ์ž…๋ ฅ์œผ๋กœ๋ถ€ํ„ฐ ์ตœ๋Œ€ 30์ดˆ ๊ธธ์ด์˜ ๊ณ ํ’ˆ์งˆ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋น„๋””์˜ค๋ฅผ ์ƒ์„ฑํ•˜๋Š” Cosmos-Predict2.5์™€, ๊ณต๊ฐ„์ •๋ณด(์˜ˆ: ๊นŠ์ด, ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜, ์—์ง€ ๋งต)๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์‚ฌ์‹ค์  ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๋Š” Cosmos-Transfer2.5๋ฅผ ๋ฐœํ‘œํ–ˆ๋‹ค. Cosmos ์‹œ๋ฆฌ์ฆˆ๋Š” ๋ฌผ๋ฆฌ ์„ธ๊ณ„์˜ ๋ฌผ์ฒด ์›€์ง์ž„๊ณผ ์ƒํ˜ธ์ž‘์šฉ์„ ๋ชจ๋ธ๋งํ•˜์—ฌ ๋กœ๋ด‡ ๋ฐ ์ž์œจ์ฃผํ–‰ ๊ฐ™์€ ๋ถ„์•ผ์˜ ํ•™์Šต์„ ๊ฐ€์†ํ™”ํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์˜์ƒ ๊ธฐ์ดˆ ๋ชจ๋ธ(Video Foundation Models, VFM)์€ ์„ธ๊ณ„์˜ ๊ณผ๊ฑฐ ํ”„๋ ˆ์ž„๊ณผ ์กฐ์ž‘ ๋ช…๋ น(perturbation)์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ๋ฏธ๋ž˜ ์ƒํƒœ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ˜๋‹ค. ํŠนํžˆ Cosmos WFMs๋Š” ์„ ํ–‰ํ•™์Šต(pre-training)๊ณผ ์ ์‘ํ•™์Šต(post-training)์„ ํ†ตํ•ด ๊ฐ•๋ ฅํ•œ ์„ธ๊ณ„๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•˜๋ฉฐ, ๋Œ€๊ทœ๋ชจ ์ผ๋ฐ˜ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ๋ฌผ๋ฆฌ ๋ฒ•์น™์„ ํ•™์Šตํ•˜๊ณ (pre-training), ์†Œ์ˆ˜์˜ ํŠนํ™”๋œ ๋ฐ์ดํ„ฐ๋กœ ์„ธ๋ถ€ ์ž‘์—…์„ ํ•™์Šตํ•˜๋Š” ๊ตฌ์กฐ์ด๋‹ค.

๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜ ๋ถ„์„

Cosmos-Predict2.5๋Š” ์„ธ ๊ฐœ์˜ ๊ธฐ์กด WFM์„ ํ†ตํ•ฉํ•˜์—ฌ ๋ณต์žก๋„๋ฅผ ์ค„์ด๊ณ , ๊ธฐ์กด๋ณด๋‹ค ๊ธธ๊ณ  ๋‹ค์ฑ„๋กœ์šด ๋น„๋””์˜ค ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ์ƒ์„ฑํ•œ๋‹ค. ํŠนํžˆ ๋น„๋””์˜ค ์ƒ์„ฑ ๋ฐฉ์‹์— Flow-Matching ๊ธฐ๋ฒ•์„ ๋„์ž…ํ•œ ๊ฒƒ์œผ๋กœ ์•Œ๋ ค์ ธ ์žˆ๋‹ค. ์ „ํ†ต์ ์ธ ํ™•์‚ฐ(Diffusion) ๋ฐฉ์‹๊ณผ ๋‹ฌ๋ฆฌ, ํ๋ฆ„-๋งค์นญ์€ ์—ฐ์†์ ์ธ ๋ณ€ํ™˜ ํ•จ์ˆ˜๋ฅผ ํ•™์Šตํ•˜์—ฌ ์˜์ƒ ๋…ธ์ด์ฆˆ๋ฅผ ์ œ๊ฑฐํ•˜๋ฉฐ ๋น ๋ฅธ ์ƒ˜ํ”Œ๋ง์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค. ์ด ๋ชจ๋ธ์€ ์ž์œ  ํ˜•์‹์˜ ํ…์ŠคํŠธ, ๋‹จ์ผ/๋‹ค์ค‘ ์ด๋ฏธ์ง€, ์—ฐ์† ๋น„๋””์˜ค ํ”„๋ ˆ์ž„ ๋“ฑ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„, ์š”๊ตฌ๋˜๋Š” ๋ฏธ๋ž˜ ์„ธ๊ณ„๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋กœ ํ˜„์‹ค์  ์žฅ๋ฉด์„ ๋งŒ๋“ค๊ฑฐ๋‚˜, ๋น„๋””์˜ค ์ž…๋ ฅ๊ณผ ๊ฒฐํ•ฉํ•œ ์ง€์‹œ๋ฌธ์œผ๋กœ ๋กœ๋ด‡ ์กฐ์ž‘ ๊ณผ์ •์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋‹ค.

Cosmos-Predict2.5๋Š” ๋น„๋””์˜ค ํ† ํฌ๋‚˜์ด์ €(VAE)๋กœ ์•Œ๋ ค์ง„ WAN2.1 VAE(Visual AutoEncoder)๋ฅผ ์ด์šฉํ•ด ์˜์ƒ ์ •๋ณด๋ฅผ ์••์ถ•ํ•˜์—ฌ ์—ฐ์†(latent) ๋˜๋Š” ์ด์‚ฐ(discrete) ํ† ํฐ์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค. ๋Œ€๊ทœ๋ชจ ์˜์ƒ ์ƒ์„ฑ ๋ชจ๋ธ๋“ค์€ ๋ฐฉ๋Œ€ํ•œ ์—ฐ์‚ฐ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ์ €์ฐจ์› ํ† ํฐ ํ‘œํ˜„์„ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š”๋ฐ, ์ด๋ฅผ ์œ„ํ•ด ์ฃผ์˜(attention) ๊ธฐ๋ฐ˜ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•ด ์›๋ณธ ํ”„๋ ˆ์ž„์„ ์••์ถ•ํ•œ๋‹ค. WAN2.1 VAE๋Š” ์ด๋Ÿฌํ•œ ์˜์ƒ ์ฝ”๋ฑ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ๋น„๋””์˜ค์˜ ์ค‘์š”ํ•œ ๋ฌผ๋ฆฌ์  ์ •๋ณด๋ฅผ ์ตœ๋Œ€ํ•œ ๋ณด์กดํ•˜๋ฉด์„œ ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์ธ๋‹ค.

๋˜ํ•œ Cosmos-Predict2.5๋Š” Cosmos-Reason1๊ณผ์˜ ํ˜‘์—…์„ ํ†ตํ•ด ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๊ฐ•ํ™”ํ•œ๋‹ค. Cosmos-Reason1์€ ๋ฌผ๋ฆฌ ์ƒ์‹์„ ๋‚ด์žฅํ•œ ์‹œ๊ฐ ์–ธ์–ด ๋ชจ๋ธ(VLM)์œผ๋กœ, ์˜์ƒ ๋‚ด ๊ฐ์ฒด์˜ ๊ณต๊ฐ„ยท์‹œ๊ฐ„์  ๊ด€๊ณ„์™€ ๋ฌผ๋ฆฌ ๋ฒ•์น™์„ ์ดํ•ดํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ์‚ฌ์ „ ํ•™์Šต๋œ ์„ธ๊ณ„๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ์ž ์žฌ ์„ธ๊ณ„์— ๋Œ€ํ•˜์—ฌ โ€œ๋‹ค์Œ ๋™์ž‘์€ ๋ฌด์—‡์ธ๊ฐ€?โ€์™€ ๊ฐ™์€ ์งˆ๋ฌธ์— ์—ฐ์‡„์ถ”๋ก (chain-of-thought) ๋ฐฉ์‹์œผ๋กœ ๋‹ตํ•  ์ˆ˜ ์žˆ์–ด, ๋ฌผ๋ฆฌ์  ์ œ์•ฝ์„ ๊ณ ๋ คํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์ƒ์„ฑ์— ๊ธฐ์—ฌํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋กœ๋ด‡ ํŒ”์ด ๋ฌผ์ฒด๋ฅผ ์ง‘์–ด์˜ฌ๋ฆฌ๋Š” ์žฅ๋ฉด์—์„œ๋Š” ์ค‘๋ ฅ, ๊ด€์„ฑ ๋“ฑ์˜ ๋ฌผ๋ฆฌ ์ƒ์‹์„ ๋ฐ”ํƒ•์œผ๋กœ ์ž์—ฐ์Šค๋Ÿฌ์šด ๋™์ž‘ ์‹œํ€€์Šค๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ๋•๋Š”๋‹ค. ์š”์•ฝํ•˜๋ฉด, Cosmos-Predict2.5์˜ ์•„ํ‚คํ…์ฒ˜๋Š” Flow-Matching ๊ธฐ๋ฐ˜ ์˜์ƒ ์ƒ์„ฑ๊ธฐ์™€ WAN2.1 VAE ํ† ํฌ๋‚˜์ด์ €, Cosmos-Reason1 ์ง€๋Šฅ์ด ๊ฒฐํ•ฉ๋œ ํ˜•ํƒœ๋กœ, ํ…์ŠคํŠธยท์ด๋ฏธ์ง€ยท๋น„๋””์˜ค ์ž…๋ ฅ์„ ํ•˜๋‚˜์˜ ์ผ๊ด€๋œ ์„ธ๊ณ„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋„๋ก ๊ตฌ์„ฑ๋œ๋‹ค.

ํ›ˆ๋ จ ์ „๋žต

Cosmos ๋ชจ๋ธ์€ ๋‹ค๋‹จ๊ณ„(pre-training โ†’ ํ›„์†ํ•™์Šต) ์ „๋žต์œผ๋กœ ํ•™์Šต๋œ๋‹ค. ์‚ฌ์ „ํ•™์Šต ๋‹จ๊ณ„์—์„œ๋Š” ์•ฝ 20๋งŒ ์‹œ๊ฐ„ ๋ถ„๋Ÿ‰์˜ ์˜์ƒ ๋ฐ์ดํ„ฐ์—์„œ ์ •์ ยท๋™์  ์ฝ˜ํ…์ธ ๊ฐ€ ํ’๋ถ€ํ•œ ๋ถ€๋ถ„์„ ์„ ๋ณ„ํ•ด ์•ฝ 1์–ต ๊ฐœ์˜ ๋น„๋””์˜ค ํด๋ฆฝ(2~60์ดˆ)์„ ๊ตฌ์ถ•ํ•˜์˜€๋‹ค. ๊ฐ ํด๋ฆฝ์—๋Š” ์˜์ƒ ์ž๋ง‰์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ๋น„์ฃผ์–ผ ์–ธ์–ด ๋ชจ๋ธ์„ ์ ์šฉํ•˜์˜€์œผ๋ฉฐ, H.264 GPU ๊ฐ€์† ๋””์ฝ”๋”ฉ ๋“ฑ์„ ํ™œ์šฉํ•ด ๋Œ€๊ทœ๋ชจ ์˜์ƒ ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ตฌํ˜„ํ–ˆ๋‹ค. ์ด๋กœ์จ ๋ชจ๋ธ์€ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์˜ ๋ฌผ๋ฆฌ์  ์žฅ๋ฉด ๋ณ€ํ™”์™€ ๋ฌผ์ฒด ์šด๋™์„ ํฌํ•จํ•œ ์ผ๋ฐ˜ํ™”๋œ ์„ธ๊ณ„ ์ง€์‹์„ ์Šต๋“ํ•œ๋‹ค. ์‚ฌ์ „ํ•™์Šต ์‹œ์—๋Š” Transformer ๊ธฐ๋ฐ˜ ํ™•์‚ฐ ๋ชจ๋ธ(diffusion)๊ณผ ์˜คํ† ๋ฆฌ๊ทธ๋ ˆ์‹œ๋ธŒ(autogressive) ๋ชจ๋ธ ๋‘ ๊ฐ€์ง€๋ฅผ ๋ณ‘ํ–‰ํ•˜์—ฌ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋“ค์€ ์—ฐ์†(latent) ํ† ํฐ๊ณผ ์ด์‚ฐ ํ† ํฐ ๋‘ ๊ฐ€์ง€ ํ‘œํ˜„์„ ๊ฐ๊ฐ ํ™œ์šฉํ•˜๋ฉฐ, ์ „์ž๋Š” ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ ๊ณผ์ •์„ ํ†ตํ•ด, ํ›„์ž๋Š” ์‹œ๊ณ„์—ด ๋‹ค์Œ-ํ”„๋ ˆ์ž„ ์˜ˆ์ธก์„ ํ†ตํ•ด ์˜์ƒ ์‹œํ€€์Šค๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

์ดํ›„์—๋Š” ๋ฌผ๋ฆฌ AI ํŠน์ • ๊ณผ์ œ์— ๋งž์ถฐ ์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ํ›„์† ํ•™์Šต(fine-tuning)์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์นด๋ฉ”๋ผ ์‹œ์  ์ œ์–ด๋‚˜ ๋กœ๋ด‡ ์กฐ์ž‘, ์ž์œจ์ฃผํ–‰ ๋“ฑ ๋„๋ฉ”์ธ๋ณ„ ๋ฐ์ดํ„ฐ(ํ”„๋กฌํ”„ํŠธ-์˜์ƒ ์Œ)๋ฅผ ์ด์šฉํ•ด ์‚ฌ์ „ํ•™์Šต๋œ ๋ชจ๋ธ์„ ๋ฏธ์„ธ์กฐ์ •ํ•œ๋‹ค. Cosmos-Reason1 ํ•™์Šต์—์„œ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ๋จผ์ € ๋Œ€๊ทœ๋ชจ ๋ฒ”์šฉ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•œ ์‚ฌ์ „ํ•™์Šต์„ ์ง„ํ–‰ํ•œ ๋’ค, ๋ฌผ๋ฆฌ ์ƒ์‹ ๋ฐ ์ž„๋ฒ ๋””๋“œ(embedded) ํ–‰๋™ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ์‚ฌ์ „์ง€์‹(supervised fine-tuning)๊ณผ ๊ฐ•ํ™”ํ•™์Šต(reinforcement learning, RL) ๋‹จ๊ณ„๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์‹ค์ œ๋กœ Cosmos-Reason1 ๋…ผ๋ฌธ์—์„œ๋Š” ๋ฌผ๋ฆฌ ์ƒ์‹ ๋ฐ์ดํ„ฐ๋กœ ์ง€๋„ํ•™์Šต(SFT)๊ณผ RL์„ ๊ฑฐ์นœ ํ›„ ๋ชจ๋ธ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋จ์„ ๋ณด์˜€์œผ๋ฉฐ, ์œ ์‚ฌํ•œ ์ ‘๊ทผ์œผ๋กœ Cosmos-Predict2.5์—์„œ๋„ negative-aware diffusion fine-tuning ๊ฐ™์€ ์˜จ๋ผ์ธ ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฒ•์ด ์ ์šฉ๋œ ๊ฒƒ์œผ๋กœ ์•Œ๋ ค์ ธ ์žˆ๋‹ค. ๊ฐ•ํ™”ํ•™์Šต ๋‹จ๊ณ„์—์„œ๋Š” ์ƒ์„ฑ๋œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์˜ ํ’ˆ์งˆ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋ณด์ƒํ•จ์ˆ˜(์˜ˆ: ํ˜„์‹ค์„ฑ, ๋™์ž‘ ์ผ๊ด€์„ฑ)๋ฅผ ์ •์˜ํ•˜๊ณ , ์ด๋ฅผ ์ตœ์ ํ™”ํ•˜๋„๋ก ๋ชจ๋ธ์„ ์กฐ์ •ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ๋‹ค๋‹จ๊ณ„ ํ•™์Šต ์ „๋žต์„ ํ†ตํ•ด, Cosmos ๋ชจ๋ธ์€ ์ผ๋ฐ˜ ๋„๋ฉ”์ธ ๋ฌผ๋ฆฌ๋ฅผ ์ดํ•ดํ•จ๊ณผ ๋™์‹œ์— ํŠน์ • ๋กœ๋ด‡ยท์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์— ์ ํ•ฉํ•œ ๋ฏธ์„ธํ•œ ์กฐ์ • ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ”๊ฒŒ ๋œ๋‹ค.

๋กœ๋ด‡ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์‘์šฉ

Cosmos ๋ชจ๋ธ์€ Sim2Real(์‹œ๋ฎฌโ†’์‹ค์„ธ๊ณ„) ๋ฐ Real2Real(์‹ค์„ธ๊ณ„ ๊ฐ„) ์ „ํ™˜ ์ž‘์—…์— ๊ฐ•๋ ฅํžˆ ํ™œ์šฉ๋œ๋‹ค. Cosmos-Transfer2.5๋Š” Cosmos-Predict2.5 ์œ„์—์„œ ๊ตฌ๋™๋˜๋Š” ์กฐ๊ฑด๋ถ€ ํ™•์‚ฐ ๋ชจ๋ธ๋กœ, ์‹ฌ๋„(depth), ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜, ์—์ง€ ๋“ฑ ๋ณต์ˆ˜์˜ ๊ณต๊ฐ„์  ์ œ์–ด ์ž…๋ ฅ์„ ๋ฐ›์•„ ๊ณ ํ’ˆ์งˆ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์˜ ๊นŠ์ด ๋งต๊ณผ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ๋งต์„ ์ž…๋ ฅํ•˜๋ฉด, ์ด๋ฅผ ์‚ฌ์‹ค์  ์นด๋ฉ”๋ผ ์˜์ƒ์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋กœ๋ด‡ ๋น„์ „ ํ•™์Šต์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ์  ๋ณ€ํ™˜์„ ํ†ตํ•ด ๋ฌผ๋ฆฌ์  ์žฅ๋ฉด์˜ ์กฐ๋ช…, ์žฌ์งˆ, ๋‚ ์”จ ๋“ฑ์„ ๋ณ€ํ™”์‹œ์ผœ ๋ฐ์ดํ„ฐ ๋‹ค์–‘์„ฑ์„ ํ™•์žฅํ•จ์œผ๋กœ์จ ์ •์ฑ…(policy) ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ๊ฐœ์„ ๋œ๋‹ค. Cosmos-Transfer1 ๋…ผ๋ฌธ์—์„œ๋„ ์—ฌ๋Ÿฌ ๊ณต๊ฐ„ ํ‘œํ˜„์„ ๊ฐ€์ค‘ํ•ฉ์œผ๋กœ ์œตํ•ฉํ•˜๋Š” Adaptive Multi-ControlNets๋ฅผ ๋„์ž…ํ•˜์—ฌ, ์ž์œจ์ฃผํ–‰ ๋“ฑ์˜ ๋„๋ฉ”์ธ์—์„œ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ๋ฐ์ดํ„ฐ๋ฅผ ํ˜„์‹ค์ ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ์ผ๋ฐ˜ํ™”๋ฅผ ํš๊ธฐ์ ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚จ ๋ฐ” ์žˆ๋‹ค. Cosmos-Transfer2.5๋Š” ์ด์ „ ๋ชจ๋ธ์— ๋น„ํ•ด 3.5๋ฐฐ ๊ฐ€๋ฒผ์šฐ๋ฉด์„œ๋„ ๊ณ ํ’ˆ์งˆ์˜ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•˜๋ฉฐ, ๋ณต์ˆ˜ ๋ทฐ์™€ ๋น ๋ฅธ ์†๋„๋ฅผ ์ง€์›ํ•œ๋‹ค.

๋กœ๋ด‡ ์ •์ฑ… ํ•™์Šต ์ธก๋ฉด์—์„œ๋Š”, ์ƒ์„ฑ ๋น„๋””์˜ค๋ฅผ ์ •์ฑ…ํ•™์Šต์˜ ๊ธฐ๋ฐ˜์œผ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์ตœ๊ทผ ์—ฐ๊ตฌ์— ๋”ฐ๋ฅด๋ฉด, ๋Œ€๊ทœ๋ชจ ์˜์ƒ ์ƒ์„ฑ ๋ชจ๋ธ์€ ๋กœ๋ด‡์˜ ์‹œ๊ฐ-์šด๋™ ์ •์ฑ… ํ•™์Šต์— ์œ ์šฉํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์—ญํ• ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰, ๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ๋กœ๋ด‡ ํ–‰๋™ ์˜์ƒ์€ ์ •์ฑ…์œผ๋กœ ํ•ด์„๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ์ ์€ ๋ฐ์ดํ„ฐ๋กœ๋„ ๊ฒฌ๊ณ ํ•œ ์ œ์–ด๊ธฐ๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด โ€œ์ปต์„ ์ง‘์œผ๋ผโ€๋Š” ์ง€์‹œ๋ฌธ์„ ๋ฐ›์•„ ๋กœ๋ด‡ ํŒ”์˜ ์ž‘์—… ๊ณผ์ •์„ ์ƒ์„ฑํ•˜๋ฉด, ํ–‰๋™ ๋””์ฝ”๋”๋ฅผ ํ†ตํ•ด ์‹ค์ œ ๋กœ๋ด‡ ์ œ์–ด ๋ช…๋ น์„ ์œ ๋„ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฌด์ž‘์ • ์ˆ˜์ง‘ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ํ›จ์”ฌ ์ ์€ ๋ฐ์ดํ„ฐ๋กœ๋„ ํ•™์Šต ๊ฐ€๋Šฅํ•˜๋ฉฐ, ์ƒ‰์ƒ, ๋ฐฐ๊ฒฝ, ๋ฌผ์ฒด ํ˜•ํƒœ ๋“ฑ์˜ ๋ณ€ํ™”์— ์ž˜ ์ผ๋ฐ˜ํ™”๋˜๋Š” ํŠน์„ฑ์„ ๋ณด์ธ๋‹ค. ๋˜ํ•œ Cosmos-Transfer2.5๋ฅผ ์ด์šฉํ•ด ์‹ค์ œ ๋กœ๋ด‡ ์‹คํ—˜ ์˜์ƒ ๊ฐ„์˜ ๋„๋ฉ”์ธ ์ฐจ์ด๋ฅผ ์ค„์ด๋Š” Real2Real ๋ณ€ํ™˜๋„ ๊ฐ€๋Šฅํ•˜๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋‚ฎ ์‹œ๊ฐ„ ํ™˜๊ฒฝ์—์„œ ํ•™์Šต๋œ ์ •์ฑ…์„ ๋น„์Šทํ•œ ๊ตฌ์กฐ์˜ ๋ฐค ์‹œ๊ฐ„ ์˜์ƒ์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ์  ์‹œ๊ฐ ๋ณ€ํ™˜(strutured visual transform)์„ ํ†ตํ•œ ๋‹ค์–‘์„ฑ ์ฆ๋Œ€๋Š”, ํ–‰๋™๊ณผ ๋ชฉํ‘œ๊ฐ€ ๊ฐ™๋”๋ผ๋„ ํ™˜๊ฒฝ ๋ณ€ํ™”์— ๊ฐ•๊ฑดํ•œ ์ •์ฑ…์„ ๋งŒ๋“œ๋Š” ๋ฐ ๊ธฐ์—ฌํ•œ๋‹ค.

๋น„๊ต ํ‰๊ฐ€ ๋ฐ ์‹คํ—˜ ๊ฒฐ๊ณผ

NVIDIA๋Š” Cosmos-Predict2.5/Transfer2.5์˜ ์„ฑ๋Šฅ์„ ๊ธฐ์กด ๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ต ํ‰๊ฐ€ํ–ˆ๋‹ค. ๋น„๊ณต๊ฐœ ํ…Œ์ŠคํŠธ ๊ฒฐ๊ณผ์— ๋”ฐ๋ฅด๋ฉด, ์ƒˆ๋กœ์šด ๋ชจ๋ธ์€ ์ด์ „ ๋ชจ๋ธ๋ณด๋‹ค ๋” ๊ธธ๊ณ  ๋ณต์žกํ•œ ์žฅ๋ฉด์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ƒ์„ฑํ•˜๋ฉฐ ์—ฐ์‚ฐ ํšจ์œจ๋„ ํ–ฅ์ƒ๋˜์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ์ธ๊ฐ„ ํ‰๊ฐ€(Human Evaluation)์—์„œ๋„ ๊ธ์ •์ ์ธ ๊ฒฐ๊ณผ๊ฐ€ ๋ณด๊ณ ๋˜์—ˆ๋‹ค. ์˜ˆ์ปจ๋Œ€ Cosmos-Predict1 ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์˜ ์‹คํ—˜์—์„œ, ๋น„๋””์˜ค ์˜ˆ์ธก ๊ณผ์ œ์— ๋Œ€ํ•ด ๊ณต๊ฐœ๋œ VideoLDM ๊ธฐ๋ฐ˜ ๋ชจ๋ธ ๋Œ€๋น„ ์ „๋ฌธ๊ฐ€ ํ‰๊ฐ€์—์„œ ์šฐ์ˆ˜ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ธ ๋ฐ” ์žˆ๋‹ค. ์ด๋Š” Cosmos ๋ชจ๋ธ์ด ์‹ค์ œ ์ธ๊ฐ„์˜ ๋ฌผ๋ฆฌ ์ƒ์‹๊ณผ ์ผ์น˜ํ•˜๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ์„ฑ๊ณตํ–ˆ์Œ์„ ์‹œ์‚ฌํ•œ๋‹ค. ๋˜ํ•œ, ๋น„๋””์˜ค ์ƒ์„ฑ ํ’ˆ์งˆ๊ณผ ํ–‰๋™ ์ •ํ•ฉ๋„์— ๋Œ€ํ•œ ํ‰๊ฐ€์—์„œ Cosmos ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด ๋†’์€ ์ ์ˆ˜๋ฅผ ๋ฐ›์•˜๋‹ค๊ณ  ์•Œ๋ ค์ ธ ์žˆ๋‹ค.

์ •๋Ÿ‰์  ์„ฑ๋Šฅ ์ธก์ • ์ง€ํ‘œ๋กœ๋Š” (๊ฐ€์ •) PAI-Bench์™€ ๊ฐ™์€ ๋ฌผ๋ฆฌ AI ์ „์šฉ ๋ฒค์น˜๋งˆํฌ๊ฐ€ ์‚ฌ์šฉ๋œ๋‹ค. ๋น„๋ก ์„ธ๋ถ€ ๊ฒฐ๊ณผ๋Š” ๊ณต๊ฐœ๋˜์ง€ ์•Š์•˜์œผ๋‚˜, ์–ธ๋ก  ๋ณด๋„์— ๋”ฐ๋ฅด๋ฉด Cosmos-Transfer2.5๋Š” ๋™๊ธ‰์˜ ๋‹ค๋ฅธ ๋ชจ๋ธ ๋Œ€๋น„ 3.5๋ฐฐ ๋” ์ž‘์€ ๋ชจ๋ธ ํฌ๊ธฐ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์†๋„์™€ ํ€„๋ฆฌํ‹ฐ ๋ฉด์—์„œ ์šฐ์ˆ˜ํ•˜๋‹ค๊ณ  ํ•œ๋‹ค. ์ด๋Š” ์‹ค์ œ์ ์ธ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋„๋ฉ”์ธ ์ „์ด ๊ณผ์ œ์—์„œ ํ•™์Šต ๋น„์šฉ๊ณผ ์ถ”๋ก  ์‹œ๊ฐ„์„ ํฌ๊ฒŒ ์ค„์ด๋ฉด์„œ๋„ ํ’ˆ์งˆ ์†์‹ค ์—†์ด ์„ฑ๋Šฅ์„ ํ™•๋ณดํ•œ ๊ฒƒ์ž„์„ ์˜๋ฏธํ•œ๋‹ค.

๊ฒฐ๋ก 

์„ธ๊ณ„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ๋‹ค๋ฃฌ Cosmos-Predict2.5/Transfer2.5 ๋ชจ๋ธ์€ ๋ฌผ๋ฆฌ ๊ธฐ๋ฐ˜ AI ์—ฐ๊ตฌ์— ์ค‘์š”ํ•œ ์ง„์ „์„ ๊ฐ€์ ธ์™”๋‹ค. ํ…์ŠคํŠธยท์ด๋ฏธ์ง€ยท๋น„๋””์˜ค ๋“ฑ์˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž…๋ ฅ์„ ์ฒ˜๋ฆฌํ•˜๊ณ , ๋‹ค๋ฃจ๊ธฐ ์–ด๋ ค์› ๋˜ ์žฅ๊ธฐ๊ฐ„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ์ƒ์„ฑํ•จ์œผ๋กœ์จ, ๋‹ค์–‘ํ•œ ๋กœ๋ด‡๊ณผ ์ž์œจ์ฃผํ–‰ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ณตํ•œ๋‹ค. WAN2.1 VAE์™€ Cosmos-Reason1์„ ๊ฒฐํ•ฉํ•œ ์•„ํ‚คํ…์ฒ˜๋Š” ๋ฌผ๋ฆฌ์  ์ผ๊ด€์„ฑ๊ณผ ์ถ”๋ก ๋Šฅ๋ ฅ์„ ๊ฒธ๋น„ํ•˜๋ฉฐ, ๋ณต์žกํ•œ ์ œ์–ด ํƒœ์Šคํฌ์—๋„ ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๋Š” ์ž ์žฌ๋ ฅ์„ ๋ณด์—ฌ์ค€๋‹ค. ๋˜ํ•œ Sim2Real/Real2Real ๋ณ€ํ™˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” Cosmos-Transfer2.5๋Š” ๋กœ๋ด‡ ํ•™์Šต์— ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ๊ฒฉ์ฐจ๋ฅผ ์ขํžˆ๋Š” ๋ฐ ๊ธฐ์—ฌํ•˜๋ฉฐ, ์ƒ์„ฑ ๋ชจ๋ธ์ด ์ •์ฑ… ํ•™์Šต์—๋„ ์‘์šฉ๋  ์ˆ˜ ์žˆ์Œ์„ ์‹ค์ฆํ–ˆ๋‹ค.

์ œํ•œ์ ์œผ๋กœ๋Š” ๊ฑฐ๋Œ€ ๋ชจ๋ธ์˜ ํ•™์Šต ๋น„์šฉ๊ณผ ์•ˆ์ •์„ฑ ๋ฌธ์ œ๊ฐ€ ๋‚จ์•„ ์žˆ๋‹ค. ๋Œ€๊ทœ๋ชจ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋ฐ ๋ชจ๋ธ ํ•™์Šต์€ ๊ณ„์‚ฐ ์ž์›์ด ๋งŽ์ด ํ•„์š”ํ•˜๋ฉฐ, ๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์ด ์‹ค์ œ ๋ฌผ๋ฆฌ์™€ ์–ผ๋งˆ๋‚˜ ์ผ์น˜ํ•˜๋Š”์ง€๋Š” ์—ฌ์ „ํžˆ ์ „์ˆ˜ ๊ฒ€์ฆ์ด ์–ด๋ ต๋‹ค. ํ–ฅํ›„ ์—ฐ๊ตฌ์—์„œ๋Š” ํšจ์œจ์ ์ธ ํ•™์Šต ๋ฐฉ๋ฒ•, ๊ฐ•ํ™”ํ•™์Šต๊ณผ ๋ชจ๋ธ์ฒด์ธ์˜ ๊ฒฐํ•ฉ, ๊ทธ๋ฆฌ๊ณ  ์ธ๊ฐ„-๋กœ๋ด‡ ์ƒํ˜ธ์ž‘์šฉ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•œ ์ง€์†์ ์ธ ๊ฐœ์„ ์ด ์š”๊ตฌ๋œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด Diffusion Negative-aware Fine-Tuning๊ณผ ๊ฐ™์€ ์ƒˆ๋กœ์šด ์˜จ๋ผ์ธ RL ๊ธฐ๋ฒ•๊ณผ, ์ธ๊ฐ„ ํ‰๊ฐ€์— ๊ธฐ๋ฐ˜ํ•œ ๋ณด์ƒ ํ•™์Šต ๋“ฑ์„ ํ†ตํ•ด ์ƒ์„ฑ ๋ชจ๋ธ์˜ ํ˜„์‹ค๊ฐ์„ ๋†’์ผ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค. ๋”๋ถˆ์–ด, ๋‹ค์–‘ํ•œ ์‹ค์ œ ๋กœ๋ด‡ ํ”Œ๋žซํผ์—์„œ์˜ ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜๊ณผ ์˜คํ”ˆ ๋ฐ์ดํ„ฐ์…‹ ๊ณต๊ฐœ๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™” ๋ฒ”์œ„๋ฅผ ๊ฒ€์ฆํ•˜๋Š” ์ž‘์—…๋„ ํ•„์š”ํ•˜๋‹ค. ์ข…ํ•ฉํ•˜๋ฉด, Cosmos-Predict2.5์™€ Transfer2.5๋Š” ๋ฌผ๋ฆฌ AI ๋ถ„์•ผ์—์„œ ์„ธ๊ณ„๋ชจ๋ธ ์—ฐ๊ตฌ๋ฅผ ํฌ๊ฒŒ ์•ž๋‹น๊ฒผ์œผ๋ฉฐ, ํ–ฅํ›„ ๋กœ๋ด‡ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ ์ œ์–ด ์‹œ์Šคํ…œ ๊ฐœ๋ฐœ์— ํ•ต์‹ฌ ๋„๊ตฌ๋กœ ํ™œ์šฉ๋  ์ „๋ง์ด๋‹ค.

Copyright 2024, Jung Yeon Lee