Curieux.JY
  • JungYeon Lee
  • Post
  • Projects
  • Note

On this page

  • ๋“ค์–ด๊ฐ€๋ฉฐ: Physical AI๋ฅผ ์œ„ํ•œ โ€œ์ƒ๊ฐํ•˜๋Š” ๋ˆˆโ€
  • 1. Qwen2.5-VL: Cosmos Reason 1์˜ ๊ธฐ๋ฐ˜
    • 1.1 ํ•ต์‹ฌ ์•„ํ‚คํ…์ฒ˜ ๊ฐœ์š”
    • 1.2 ํ•ต์‹ฌ ํ˜์‹  #1: Native Dynamic Resolution
    • 1.3 ํ•ต์‹ฌ ํ˜์‹  #2: M-RoPE (Multimodal Rotary Position Embedding)
    • 1.4 Dynamic FPS Sampling๊ณผ Absolute Time Encoding
    • 1.5 ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฐ ๋ชจ๋ธ ๋ณ€ํ˜•
    • 1.6 ๋ฒค์น˜๋งˆํฌ ์„ฑ๋Šฅ
  • 2. Qwen3-VL: Cosmos Reason 2์˜ ๊ธฐ๋ฐ˜
    • 2.1 ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ๊ธฐ๋‘ฅ (Three Core Pillars)
    • 2.2 ์•„ํ‚คํ…์ฒ˜ ํ˜์‹  #1: Interleaved-MRoPE
    • 2.3 ์•„ํ‚คํ…์ฒ˜ ํ˜์‹  #2: DeepStack Integration
    • 2.4 ์•„ํ‚คํ…์ฒ˜ ํ˜์‹  #3: Text-Timestamp Alignment
    • 2.5 ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฐ ๋ชจ๋ธ ๋ณ€ํ˜•
    • 2.6 ์•„ํ‚คํ…์ฒ˜ ๊ตฌ์„ฑ ์š”์†Œ
    • 2.7 ๋ฒค์น˜๋งˆํฌ ์„ฑ๋Šฅ
  • 3. NVIDIA Cosmos Reason: Physical AI๋กœ์˜ ํŠนํ™”
    • 3.1 Cosmos Reason 1 (Qwen2.5-VL ๊ธฐ๋ฐ˜)
    • 3.2 Cosmos Reason 2 (Qwen3-VL ๊ธฐ๋ฐ˜)
    • 3.3 Physical AI ํŠนํ™” ๊ธฐ๋Šฅ
    • 3.4 ์ฃผ์š” ํ™œ์šฉ ์‚ฌ๋ก€
  • 4. ๋น„๊ต ๋ถ„์„: ํ•œ๋ˆˆ์— ๋ณด๊ธฐ
    • 4.1 ๊ธฐ๋ฐ˜ ๋ชจ๋ธ ๋น„๊ต (Qwen2.5-VL vs Qwen3-VL)
    • 4.2 Cosmos Reason ๋น„๊ต (Reason 1 vs Reason 2)
  • 5. ๋กœ๋ณดํ‹ฑ์Šค ์—ฐ๊ตฌ์ž๋ฅผ ์œ„ํ•œ ์‹œ์‚ฌ์ 
    • 5.1 ์™œ Qwen VL ๊ธฐ๋ฐ˜์ธ๊ฐ€?
    • 5.2 Allegro Hand ์—ฐ๊ตฌ์—์˜ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ
    • 5.3 ์‹ค์šฉ์  ๋ฐฐํฌ ๊ณ ๋ ค์‚ฌํ•ญ
  • 6. ๊ฒฐ๋ก  ๋ฐ ์ „๋ง
  • ์ฐธ๊ณ  ์ž๋ฃŒ

๐ŸงฉQwen2.5-VL๊ณผ Qwen3-VL ์•„ํ‚คํ…์ฒ˜ ์‹ฌ์ธต ๋ถ„์„

cosmos
qwen
vlm
NVIDIA Cosmos Reason์˜ ๋‘๋‡Œ
Published

January 16, 2026

๋“ค์–ด๊ฐ€๋ฉฐ: Physical AI๋ฅผ ์œ„ํ•œ โ€œ์ƒ๊ฐํ•˜๋Š” ๋ˆˆโ€

๋กœ๋ด‡์ด ์„ธ์ƒ์„ ์ดํ•ดํ•˜๋ ค๋ฉด ๋ฌด์—‡์ด ํ•„์š”ํ• ๊นŒ์š”? ๋‹จ์ˆœํžˆ ์นด๋ฉ”๋ผ๋กœ ์ด๋ฏธ์ง€๋ฅผ โ€œ๋ณด๋Š” ๊ฒƒโ€๋งŒ์œผ๋กœ๋Š” ๋ถ€์กฑํ•ฉ๋‹ˆ๋‹ค. ๋ฌผ์ฒด๊ฐ€ ์–ด๋””์— ์žˆ๋Š”์ง€, ์–ด๋–ป๊ฒŒ ์›€์ง์ด๊ณ  ์žˆ๋Š”์ง€, ๊ทธ๋ฆฌ๊ณ  ๋‚ด๊ฐ€ ์–ด๋–ค ํ–‰๋™์„ ์ทจํ•ด์•ผ ํ•˜๋Š”์ง€๊นŒ์ง€ โ€œ์ถ”๋ก โ€ํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

NVIDIA์˜ Cosmos Reason์€ ๋ฐ”๋กœ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋œ Vision-Language Model(VLM)์ž…๋‹ˆ๋‹ค. GR00T ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡ ํ”Œ๋žซํผ์˜ โ€œ์ง€๋Šฅ ๋‘๋‡Œโ€ ์—ญํ• ์„ ๋‹ด๋‹นํ•˜๋ฉฐ, ์‹ค์„ธ๊ณ„์˜ ๋ฌผ๋ฆฌ ๋ฒ•์น™์„ ์ดํ•ดํ•˜๊ณ  ํ–‰๋™์„ ๊ณ„ํšํ•˜๋Š” ๋Šฅ๋ ฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

ํฅ๋ฏธ๋กœ์šด ์ ์€ ์ด ๊ฐ•๋ ฅํ•œ ์ถ”๋ก  ์—”์ง„์˜ ๊ธฐ๋ฐ˜์ด Alibaba์˜ ์˜คํ”ˆ์†Œ์Šค ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM์ธ Qwen VL ์‹œ๋ฆฌ์ฆˆ๋ผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

  • Cosmos Reason 1 โ†’ Qwen2.5-VL ๊ธฐ๋ฐ˜
  • Cosmos Reason 2 โ†’ Qwen3-VL ๊ธฐ๋ฐ˜

์ด ํฌ์ŠคํŒ…์—์„œ๋Š” ๋‘ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์˜ ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊นŠ์ด ์žˆ๊ฒŒ ๋ถ„์„ํ•˜๊ณ , NVIDIA๊ฐ€ ์ด๋ฅผ ์–ด๋–ป๊ฒŒ Physical AI์šฉ์œผ๋กœ ํŠนํ™”์‹œ์ผฐ๋Š”์ง€ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.


1. Qwen2.5-VL: Cosmos Reason 1์˜ ๊ธฐ๋ฐ˜

1.1 ํ•ต์‹ฌ ์•„ํ‚คํ…์ฒ˜ ๊ฐœ์š”

Qwen2.5-VL์€ Vision Transformer(ViT) + ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(LLM) ๋””์ฝ”๋”๋ฅผ ํ†ตํ•ฉํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์•„ํ‚คํ…์ฒ˜์ž…๋‹ˆ๋‹ค. 2025๋…„ 1์›” ๊ณต๊ฐœ๋˜์—ˆ์œผ๋ฉฐ, Qwen ํŒ€์ด โ€œ์ƒŒ๋“œ์œ„์น˜ ์ฟ ํ‚ค์˜ ์ค‘๊ฐ„์ธตโ€์ด๋ผ๊ณ  ํ‘œํ˜„ํ•œ ๊ธฐ์กด LVLM์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

[์ด๋ฏธ์ง€/๋น„๋””์˜ค ์ž…๋ ฅ] โ†’ [Vision Transformer] โ†’ [Projector] โ†’ [LLM Decoder] โ†’ [ํ…์ŠคํŠธ ์ถœ๋ ฅ]

1.2 ํ•ต์‹ฌ ํ˜์‹  #1: Native Dynamic Resolution

๊ธฐ์กด ๋น„์ „ ๋ชจ๋ธ๋“ค์€ ๋ชจ๋“  ์ด๋ฏธ์ง€๋ฅผ ๊ณ ์ •๋œ ํ•ด์ƒ๋„(์˜ˆ: 224ร—224)๋กœ ๋ฆฌ์‚ฌ์ด์ฆˆํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ •๋ณด ์†์‹ค์„ ์•ผ๊ธฐํ•˜๊ณ  ์ธ๊ฐ„์˜ ์‹œ๊ฐ ์ธ์ง€์™€ ๋™๋–จ์–ด์ง„ ๋ฐฉ์‹์ด์—ˆ์Šต๋‹ˆ๋‹ค.

Qwen2.5-VL์€ โ€œNaive Dynamic Resolutionโ€ ์ ‘๊ทผ๋ฒ•์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค:

  • ์ž…๋ ฅ ์ด๋ฏธ์ง€์˜ ์›๋ณธ ํ•ด์ƒ๋„๋ฅผ ๊ทธ๋Œ€๋กœ ์œ ์ง€
  • ์ด๋ฏธ์ง€ ํฌ๊ธฐ์— ๋น„๋ก€ํ•˜๋Š” ๊ฐ€๋ณ€ ๊ฐœ์ˆ˜์˜ ์‹œ๊ฐ ํ† ํฐ ์ƒ์„ฑ
  • Window Attention์„ ํ†ตํ•ด ๊ณ„์‚ฐ ํšจ์œจ์„ฑ ํ™•๋ณด

์ด๋ฅผ ํ†ตํ•ด ์ž‘์€ ๋ฌธ์„œ์˜ ์„ธ๋ถ€ ๊ธ€์ž๋ถ€ํ„ฐ ๊ณ ํ•ด์ƒ๋„ ์‚ฐ์—… ์ด๋ฏธ์ง€๊นŒ์ง€ ์›๋ณธ ํ’ˆ์งˆ ๊ทธ๋Œ€๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Window Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜

Vision Encoder๋Š” 3D Convolution์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž…๋ ฅ ์‹œ๊ฐ ๋ฐ์ดํ„ฐ๋ฅผ 14ร—14 ํŒจ์น˜ ์‹œ๋ฆฌ์ฆˆ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค. Window Attention์€ ์ด๋ฏธ์ง€๋ฅผ ์œˆ๋„์šฐ ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ ์œˆ๋„์šฐ ๋‚ด ํŒจ์น˜๋“ค ์‚ฌ์ด์—์„œ๋งŒ ์–ดํ…์…˜์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ์ด๋ฏธ์ง€ ์ž…๋ ฅ                              โ”‚
โ”‚       โ†“                                 โ”‚
โ”‚  3D Conv โ†’ 14ร—14 ํŒจ์น˜ ๋ถ„ํ•                โ”‚
โ”‚       โ†“                                 โ”‚
โ”‚  Window Attention (์œˆ๋„์šฐ ๋‚ด ์–ดํ…์…˜)       โ”‚
โ”‚       โ†“                                 โ”‚
โ”‚  MLP Layer โ†’ 2ร—2 ํŒจ์น˜ ๋ณ‘ํ•ฉ               โ”‚
โ”‚       โ†“                                 โ”‚
โ”‚  LLM ์ž…๋ ฅ ํ† ํฐ                           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

์ด ์„ค๊ณ„์˜ ํ•ต์‹ฌ ์žฅ์ :

ํŠน์ง• ์„ค๋ช…
์„ ํ˜• ์Šค์ผ€์ผ๋ง ์ด๋ฏธ์ง€ ํŒจ์น˜ ์ˆ˜์— ๋Œ€ํ•ด ๊ณ„์‚ฐ๋Ÿ‰์ด ์„ ํ˜•์œผ๋กœ ์ฆ๊ฐ€
ํ† ํฐ ์••์ถ• ์ถœ๋ ฅ ๋‹จ๊ณ„์—์„œ 2ร—2 ํŒจ์น˜๋ฅผ ํ•˜๋‚˜๋กœ ๋ณ‘ํ•ฉํ•˜์—ฌ ํ† ํฐ ์ˆ˜ ๊ฐ์†Œ
ํšจ์œจ์„ฑ ์ „์ฒด ์ด๋ฏธ์ง€ ์–ดํ…์…˜ ๋Œ€๋น„ ๋ฉ”๋ชจ๋ฆฌ/๊ณ„์‚ฐ ์ž์› ์ ˆ์•ฝ

1.3 ํ•ต์‹ฌ ํ˜์‹  #2: M-RoPE (Multimodal Rotary Position Embedding)

LLM์—์„œ ๋„๋ฆฌ ์“ฐ์ด๋Š” RoPE(Rotary Position Embedding)๋Š” 1์ฐจ์› ์‹œํ€€์Šค์šฉ์œผ๋กœ ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋ฏธ์ง€์™€ ๋น„๋””์˜ค๋Š” ๊ณต๊ฐ„(2D) + ์‹œ๊ฐ„(temporal) ์ •๋ณด๋ฅผ ๋ชจ๋‘ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

M-RoPE๋Š” ์œ„์น˜ ์ž„๋ฒ ๋”ฉ์„ ์„ธ ๊ฐ€์ง€ ๋…๋ฆฝ์ ์ธ ๊ตฌ์„ฑ์š”์†Œ๋กœ ๋ถ„ํ•ดํ•ฉ๋‹ˆ๋‹ค:

๊ตฌ์„ฑ์š”์†Œ ์—ญํ•  ์ ์šฉ ๋Œ€์ƒ
Temporal ์‹œ๊ฐ„์  ์ˆœ์„œ ๋น„๋””์˜ค ํ”„๋ ˆ์ž„ ์ˆœ์„œ
Height ์ˆ˜์ง ์œ„์น˜ ์ด๋ฏธ์ง€ ๋‚ด Y์ขŒํ‘œ
Width ์ˆ˜ํ‰ ์œ„์น˜ ์ด๋ฏธ์ง€ ๋‚ด X์ขŒํ‘œ

ํ…์ŠคํŠธ ์ž…๋ ฅ์˜ ๊ฒฝ์šฐ ์„ธ ๊ตฌ์„ฑ์š”์†Œ๊ฐ€ ๋™์ผํ•œ Position ID๋ฅผ ์‚ฌ์šฉํ•ด ๊ธฐ์กด 1D-RoPE์™€ ๋™์ผํ•˜๊ฒŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ๋น„๋””์˜ค์˜ ๊ฒฝ์šฐ ํ”„๋ ˆ์ž„๋งˆ๋‹ค Temporal ID๊ฐ€ ์ฆ๊ฐ€ํ•˜์—ฌ ์‹œ๊ฐ„ ํ๋ฆ„์„ ์ธ์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค.

ViT์™€ LLM์—์„œ์˜ RoPE ์ฐจ์ด

๊ตฌ์„ฑ ์š”์†Œ ์‚ฌ์šฉ RoPE ์ด์œ 
ViT (Vision Encoder) 2D RoPE ๋‹จ์ผ ์ด๋ฏธ์ง€/ํ”„๋ ˆ์ž„์˜ ํŠน์ง• ์ถ”์ถœ์— ์ง‘์ค‘, ์‹œ๊ฐ„(T) ์ฐจ์› ๋ถˆํ•„์š”
LLM Decoder 3D M-RoPE ํ…์ŠคํŠธ์™€ ์‹œ๊ฐ ๋ฐ์ดํ„ฐ์˜ ํ†ตํ•ฉ ์ฒ˜๋ฆฌ, ์‹œ๊ณต๊ฐ„ ์ •๋ณด ๋ชจ๋‘ ํ•„์š”

RoPE ๊ตฌํ˜„์—์„œ head_dim์˜ ์ ˆ๋ฐ˜์€ ๋†’์ด(h) ์ถ• ๊ธฐ๋ฐ˜, ๋‚˜๋จธ์ง€ ์ ˆ๋ฐ˜์€ ๋„ˆ๋น„(w) ์ถ• ๊ธฐ๋ฐ˜์œผ๋กœ ์ ์šฉ๋˜๋ฉฐ, ๋‘ ๋ถ€๋ถ„์ด ๋™์ผํ•œ ฮธ(๊ฐ๋„ ํŒŒ๋ผ๋ฏธํ„ฐ) ์„ธํŠธ๋ฅผ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค.

1.4 Dynamic FPS Sampling๊ณผ Absolute Time Encoding

๋น„๋””์˜ค ์ดํ•ด์—์„œ ๋˜ ๋‹ค๋ฅธ ํ˜์‹ ์€ ๋™์  FPS ์ƒ˜ํ”Œ๋ง์ž…๋‹ˆ๋‹ค. ๊ณ ์ •๋œ ํ”„๋ ˆ์ž„ ๋ ˆ์ดํŠธ ๋Œ€์‹  ๋น„๋””์˜ค์˜ ํŠน์„ฑ์— ๋งž๊ฒŒ ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ๋ฅผ ์กฐ์ ˆํ•˜๊ณ , ์ ˆ๋Œ€ ์‹œ๊ฐ„(Absolute Time)์„ ์ธ์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, โ€œ์˜์ƒ 1๋ถ„ 23์ดˆ์—์„œ ๋ฌด์Šจ ์ผ์ด ์ผ์–ด๋‚ฌ๋‚˜์š”?โ€๋ผ๋Š” ์งˆ๋ฌธ์— ์ •ํ™•ํžˆ ํ•ด๋‹น ์‹œ์ ์˜ ์ด๋ฒคํŠธ๋ฅผ ์ฐพ์•„ ๋‹ต๋ณ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

1.5 ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฐ ๋ชจ๋ธ ๋ณ€ํ˜•

๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠน์ง•
Qwen2.5-VL-3B 30์–ต ๊ฒฝ๋Ÿ‰ ๋ฐฐํฌ์šฉ
Qwen2.5-VL-7B 70์–ต ๊ท ํ˜•์žกํžŒ ์„ฑ๋Šฅ/ํšจ์œจ
Qwen2.5-VL-32B 320์–ต ๊ณ ์„ฑ๋Šฅ ์ถ”๋ก 
Qwen2.5-VL-72B 720์–ต SOTA๊ธ‰ ๋ฒค์น˜๋งˆํฌ ์„ฑ๋Šฅ

ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋Š” ์•ฝ 4.1์กฐ ํ† ํฐ ๊ทœ๋ชจ์˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ฝ”ํผ์Šค๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

1.6 ๋ฒค์น˜๋งˆํฌ ์„ฑ๋Šฅ

Qwen2.5-VL์€ ๋‹ค์–‘ํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฒค์น˜๋งˆํฌ์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค:

๋ฒค์น˜๋งˆํฌ Qwen2.5-VL-72B Qwen2.5-VL-32B ๋น„๊ณ 
MathVista 70.5~74.8 74.7 ์ˆ˜ํ•™์  ์‹œ๊ฐ ์ถ”๋ก 
MMMU 64.5 70.0 ๋Œ€ํ•™ ์ˆ˜์ค€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ดํ•ด
MMBench-EN 88.6 - ์ข…ํ•ฉ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฒค์น˜๋งˆํฌ
Note๋น„๋””์˜ค ์ดํ•ด ๋Šฅ๋ ฅ

Qwen2.5-VL์€ 1์‹œ๊ฐ„ ์ด์ƒ์˜ ๋น„๋””์˜ค๋ฅผ ์ดํ•ดํ•˜๊ณ , ๋น„๋””์˜ค ๋‚ด ํŠน์ • ์ด๋ฒคํŠธ๊ฐ€ ๋ฐœ์ƒํ•œ ์‹œ๊ฐ„ ๊ตฌ๊ฐ„์„ ์ •ํ™•ํžˆ ์ฐพ์•„๋‚ด๋Š” ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ”๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. Dynamic Resolution์„ ์‹œ๊ฐ„ ์ฐจ์›์œผ๋กœ ํ™•์žฅํ•œ Dynamic FPS ์ƒ˜ํ”Œ๋ง ๋•๋ถ„์ž…๋‹ˆ๋‹ค.


2. Qwen3-VL: Cosmos Reason 2์˜ ๊ธฐ๋ฐ˜

2025๋…„ 9์›” ๊ณต๊ฐœ๋œ Qwen3-VL์€ Qwen ์‹œ๋ฆฌ์ฆˆ VLM์˜ ๊ฐ€์žฅ ๊ฐ•๋ ฅํ•œ ๋ฒ„์ „์ž…๋‹ˆ๋‹ค. ๋‹จ์ˆœํ•œ ์ ์ง„์  ๊ฐœ์„ ์ด ์•„๋‹ˆ๋ผ, ์•„ํ‚คํ…์ฒ˜ ์ˆ˜์ค€์—์„œ ๊ทผ๋ณธ์ ์ธ ์—…๊ทธ๋ ˆ์ด๋“œ๊ฐ€ ์ด๋ฃจ์–ด์กŒ์Šต๋‹ˆ๋‹ค.

2.1 ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ๊ธฐ๋‘ฅ (Three Core Pillars)

  1. ๊ฐ•ํ™”๋œ ์ˆœ์ˆ˜ ํ…์ŠคํŠธ ์ดํ•ด: ๋น„์ „ ๋ชจ๋ธ์ž„์—๋„ ํ…์ŠคํŠธ ์ „์šฉ ๋ฐฑ๋ณธ์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ์–ธ์–ด ๋Šฅ๋ ฅ
  2. ์žฅ๋ฌธ๋งฅ ์ดํ•ด (256K ํ† ํฐ): ๊ธด ๋ฌธ์„œ์™€ ์žฅ์‹œ๊ฐ„ ๋น„๋””์˜ค์˜ ์ •๋ณด๋ฅผ ์—ฐ๊ฒฐํ•˜์—ฌ ์ถ”๋ก 
  3. ๊ณ ๊ธ‰ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ถ”๋ก : MMMU, MathVista ๋“ฑ ๋ณต์žกํ•œ ๋ฒค์น˜๋งˆํฌ์—์„œ ์„ ๋„์  ์„ฑ๋Šฅ

2.2 ์•„ํ‚คํ…์ฒ˜ ํ˜์‹  #1: Interleaved-MRoPE

Qwen2.5-VL์˜ M-RoPE๊ฐ€ ๊ฐ€์ง„ ๋ฌธ์ œ์ ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค: ์ฃผํŒŒ์ˆ˜ ์ŠคํŽ™ํŠธ๋Ÿผ ๋ถˆ๊ท ํ˜•(Spectral Imbalance). ์‹œ๊ฐ„, ๋†’์ด, ๋„ˆ๋น„์— ํ• ๋‹น๋œ ์ฃผํŒŒ์ˆ˜ ๋Œ€์—ญ์ด ๋ถˆ๊ท ๋“ฑํ•˜์—ฌ ์žฅ์‹œ๊ฐ„ ๋น„๋””์˜ค์—์„œ ์œ„์น˜ ์ •๋ณด๊ฐ€ ์†์‹ค๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Interleaved-MRoPE๋Š” ์ฃผํŒŒ์ˆ˜๋ฅผ ์„ธ ์ฐจ์›์— ๊ท ๋“ฑํ•˜๊ฒŒ ์ธํ„ฐ๋ฆฌ๋น™ํ•˜์—ฌ ๋ฐฐ๋ถ„ํ•ฉ๋‹ˆ๋‹ค:

  • ์‹œ๊ฐ„, ๋„ˆ๋น„, ๋†’์ด ๋ชจ๋‘ ์ „์ฒด ์ฃผํŒŒ์ˆ˜ ์ŠคํŽ™ํŠธ๋Ÿผ ํ™œ์šฉ
  • ๊ธด ๋น„๋””์˜ค์—์„œ๋„ ์œ„์น˜ ์ •๋ณด ๋ณด์กด
  • Position ID ์ฆ๊ฐ€ ์†๋„๊ฐ€ ๊ธฐ์กด RoPE๋ณด๋‹ค ๋А๋ ค ๋” ๊ธด ๋ฌธ๋งฅ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ
TipM-RoPE vs Interleaved-MRoPE
ํ•ญ๋ชฉ M-RoPE (Qwen2.5-VL) Interleaved-MRoPE (Qwen3-VL)
์ฃผํŒŒ์ˆ˜ ํ• ๋‹น T, H, W์— ๋ถˆ๊ท ๋“ฑ ๋ฐฐ๋ถ„ ์„ธ ์ฐจ์›์— ๊ท ๋“ฑ ์ธํ„ฐ๋ฆฌ๋น™
๋ฌธ์ œ์  ์žฅ์‹œ๊ฐ„ ๋น„๋””์˜ค์—์„œ ์œ„์น˜ ์ •๋ณด ์†์‹ค ํ•ด๊ฒฐ๋จ
์ŠคํŽ™ํŠธ๋Ÿผ ์ผ๋ถ€ ๋Œ€์—ญ๋งŒ ํ™œ์šฉ ์ „์ฒด ๋Œ€์—ญ ํ™œ์šฉ
ํšจ๊ณผ ~32K ํ† ํฐ 256K ํ† ํฐ ๋„ค์ดํ‹ฐ๋ธŒ ์ง€์›

2.3 ์•„ํ‚คํ…์ฒ˜ ํ˜์‹  #2: DeepStack Integration

๊ธฐ์กด VLM๋“ค์€ ViT์˜ ์ตœ์ข… ๋ ˆ์ด์–ด ์ถœ๋ ฅ๋งŒ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๊ณ ์ˆ˜์ค€ ์˜๋ฏธ ์ •๋ณด๋งŒ ์ „๋‹ฌํ•˜๊ณ  ์ €์ˆ˜์ค€ ์‹œ๊ฐ์  ์„ธ๋ถ€์‚ฌํ•ญ์„ ์žƒ์–ด๋ฒ„๋ฆฌ๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

DeepStack์€ ViT์˜ ์—ฌ๋Ÿฌ ๋ ˆ์ด์–ด์—์„œ ํŠน์ง•์„ ์ถ”์ถœํ•˜์—ฌ LLM์— ์ฃผ์ž…ํ•ฉ๋‹ˆ๋‹ค:

ViT Layer 1  โ†’ ์ €์ˆ˜์ค€ ํŠน์ง• (์—ฃ์ง€, ํ…์Šค์ฒ˜)     โ”€โ”
ViT Layer 6  โ†’ ์ค‘๊ฐ„์ˆ˜์ค€ ํŠน์ง• (ํŒจํ„ด, ํ˜•ํƒœ)      โ”œโ†’ LLM Hidden States์— ์ฃผ์ž…
ViT Layer 12 โ†’ ๊ณ ์ˆ˜์ค€ ํŠน์ง• (๊ฐ์ฒด, ์žฅ๋ฉด)       โ”€โ”˜

๋…ผ๋ฌธ DeepStack (arXiv:2406.04334)์—์„œ ์ œ์•ˆ๋œ ์ด ๋ฐฉ์‹์€ ์„ธ๋ฐ€ํ•œ ์‹œ๊ฐ์  ๋””ํ…Œ์ผ๊ณผ ์ถ”์ƒ์  ์˜๋ฏธ ์ •๋ณด๋ฅผ ๋™์‹œ์— ๋ณด์กดํ•ฉ๋‹ˆ๋‹ค.

ImportantDeepStack์˜ ํ•ต์‹ฌ

Qwen3-VL์˜ ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ ๋ถ€๋ถ„์€ ์ˆœ์ˆ˜ ํ…์ŠคํŠธ ์ „์šฉ ๋ชจ๋ธ์ด ์•„๋‹™๋‹ˆ๋‹ค. DeepStack์ด ์‹œ๊ฐ์  ํŠน์ง•์„ LLM์˜ ์ดˆ๊ธฐ hidden states์— ์ฃผ์ž…ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ํŠน์ง• ํ˜•ํƒœ๋Š” (num_layers, visual_seqlen, embed_dim)์ด๋ฉฐ, ๋น„์ „ ์ธ์ฝ”๋”์˜ ์—ฌ๋Ÿฌ ๋ ˆ์ด์–ด์—์„œ ์ถ”์ถœ๋˜์–ด ๋””์ฝ”๋” hidden states์— ๊ณต๊ธ‰๋ฉ๋‹ˆ๋‹ค.

2.4 ์•„ํ‚คํ…์ฒ˜ ํ˜์‹  #3: Text-Timestamp Alignment

Qwen2.5-VL์€ T-RoPE๋ฅผ ํ†ตํ•ด ๋น„๋””์˜ค์˜ ์‹œ๊ฐ„ ์ •๋ณด๋ฅผ ์•”๋ฌต์ ์œผ๋กœ ์ธ์ฝ”๋”ฉํ–ˆ์Šต๋‹ˆ๋‹ค. Qwen3-VL์€ ์ด๋ฅผ ๋ช…์‹œ์ ์ธ ํ…์ŠคํŠธ ํƒ€์ž„์Šคํƒฌํ”„ ์ •๋ ฌ๋กœ ๋ฐœ์ „์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

์ƒ์„ฑ๋œ ํ…์ŠคํŠธ๊ฐ€ ๋น„๋””์˜ค์˜ ํŠน์ • ํƒ€์ž„์Šคํƒฌํ”„์™€ ์ง์ ‘ ์—ฐ๊ฒฐ๋˜์–ด, โ€œ00:01:23์— ๋นจ๊ฐ„ ์ž๋™์ฐจ๊ฐ€ ์ขŒํšŒ์ „ํ•ฉ๋‹ˆ๋‹คโ€์™€ ๊ฐ™์€ ์ •๋ฐ€ํ•œ ์‹œ๊ฐ„์  ๊ทธ๋ผ์šด๋”ฉ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

2.5 ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฐ ๋ชจ๋ธ ๋ณ€ํ˜•

Qwen3-VL์€ Dense์™€ MoE(Mixture of Experts) ๋‘ ๊ฐ€์ง€ ์•„ํ‚คํ…์ฒ˜๋กœ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค:

๋ชจ๋ธ ์ด ํŒŒ๋ผ๋ฏธํ„ฐ ํ™œ์„ฑ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠน์ง•
Qwen3-VL-2B 20์–ต 20์–ต ์—ฃ์ง€ ๋ฐฐํฌ์šฉ
Qwen3-VL-4B 40์–ต 40์–ต ๋ชจ๋ฐ”์ผ/์ž„๋ฒ ๋””๋“œ
Qwen3-VL-8B 87.7์–ต 87.7์–ต ๊ท ํ˜•์žกํžŒ ์„ฑ๋Šฅ
Qwen3-VL-32B 320์–ต 320์–ต ๊ณ ์„ฑ๋Šฅ ์›Œํฌ๋กœ๋“œ
Qwen3-VL-30B-A3B 300์–ต 30์–ต MoE ํšจ์œจ์„ฑ
Qwen3-VL-235B-A22B 2350์–ต 220์–ต ์ตœ๊ณ  ์„ฑ๋Šฅ MoE

MoE ๋ชจ๋ธ์€ ์ด ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ํฌ์ง€๋งŒ ์ถ”๋ก  ์‹œ ์ผ๋ถ€ ์ „๋ฌธ๊ฐ€๋งŒ ํ™œ์„ฑํ™”๋˜์–ด ํšจ์œจ์„ฑ์„ ํ™•๋ณดํ•ฉ๋‹ˆ๋‹ค. ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋Š” 36์กฐ+ ํ† ํฐ, 119๊ฐœ ์–ธ์–ด๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

2.6 ์•„ํ‚คํ…์ฒ˜ ๊ตฌ์„ฑ ์š”์†Œ

Qwen3-VL์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ˜„๋Œ€์  ์•„ํ‚คํ…์ฒ˜ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค:

๊ตฌ์„ฑ ์š”์†Œ ์„ค๋ช…
Grouped Query Attention (GQA) ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ ์ธ ์–ดํ…์…˜
SwiGLU ํ™œ์„ฑํ™” ํ–ฅ์ƒ๋œ ๋น„์„ ํ˜• ํ‘œํ˜„๋ ฅ
RoPE + ๊ณ ๊ธ‰ ์ฃผํŒŒ์ˆ˜ ์Šค์ผ€์ผ๋ง Interleaved-MRoPE
RMSNorm ์‚ฌ์ „ ์ •๊ทœํ™” ํ•™์Šต ์•ˆ์ •์„ฑ
NoteMoE ๊ฐœ์„ ์ 

Qwen3-MoE๋Š” ์ด์ „ Qwen2.5-MoE์™€ ๋‹ฌ๋ฆฌ ๊ณต์œ  ์ „๋ฌธ๊ฐ€(shared experts)๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๊ธ€๋กœ๋ฒŒ ๋ฐฐ์น˜ ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ „๋ฌธ๊ฐ€๋“ค์˜ ํŠนํ™”๋œ ํ–‰๋™์„ ์žฅ๋ คํ•ฉ๋‹ˆ๋‹ค.

2.7 ๋ฒค์น˜๋งˆํฌ ์„ฑ๋Šฅ

Qwen3-VL-235B-A22B๋Š” ์—ฌ๋Ÿฌ ๋ฒค์น˜๋งˆํฌ์—์„œ SOTA ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค:

๋ฒค์น˜๋งˆํฌ Qwen3-VL-235B GPT-5 Gemini 2.5 Pro
MathVista 85.8% 81.3% -
MathVision 74.6% 65.8% 73.3%
AIMEโ€™24 85.7 - -
LiveCodeBench v5 70.7 - -
CodeForces ELO 2,056 - -
Tip2์‹œ๊ฐ„ ๋น„๋””์˜ค ๋ถ„์„

Qwen3-VL์€ 2์‹œ๊ฐ„ ๋ถ„๋Ÿ‰์˜ ๋น„๋””์˜ค๋ฅผ ์Šค์บ”ํ•˜์—ฌ ๊ฑฐ์˜ ๋ชจ๋“  ์„ธ๋ถ€์‚ฌํ•ญ์„ ์ •ํ™•ํžˆ ์ฐพ์•„๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 256K ํ† ํฐ ์ปจํ…์ŠคํŠธ ์œˆ๋„์šฐ์™€ Interleaved-MRoPE ๋•๋ถ„์ž…๋‹ˆ๋‹ค.


3. NVIDIA Cosmos Reason: Physical AI๋กœ์˜ ํŠนํ™”

3.1 Cosmos Reason 1 (Qwen2.5-VL ๊ธฐ๋ฐ˜)

NVIDIA๋Š” Qwen2.5-VL-7B๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ Cosmos-Reason1-7B๋ฅผ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. ์ถ”๊ฐ€๋กœ 56B ๋ฒ„์ „๋„ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ (2๋‹จ๊ณ„):

  1. Physical AI SFT: ๋ฌผ๋ฆฌ์  ์ƒ์‹(Physical Common Sense) ๋ฐ์ดํ„ฐ๋กœ Supervised Fine-tuning
  2. Physical AI RL: ๋ฌผ๋ฆฌ ๋ฒ•์น™์— ๋งž๋Š” ์ถ”๋ก ์„ ๊ฐ•ํ™”ํ•™์Šต์œผ๋กœ ์ตœ์ ํ™”

์ฃผ์š” ํŠน์ง•:

  • Chain-of-Thought: <think> ํƒœ๊ทธ๋ฅผ ํ†ตํ•œ ๋‹จ๊ณ„๋ณ„ ์ถ”๋ก  ๊ณผ์ • ๋ช…์‹œ
  • Embodied Reasoning: ๋‹จ์ˆœํ•œ ์ดํ•ด๋ฅผ ๋„˜์–ด ํ–‰๋™ ๊ณ„ํš๊นŒ์ง€ ์ถ”๋ก 
# ์ถ”๋ก  ํ”„๋กฌํ”„ํŠธ ์˜ˆ์‹œ
system: "You are a helpful assistant. Answer the question in the following format:
<think>
your reasoning
</think>

<answer>
your answer
</answer>"

3.2 Cosmos Reason 2 (Qwen3-VL ๊ธฐ๋ฐ˜)

2025๋…„ 12์›” ๊ณต๊ฐœ๋œ Cosmos Reason 2๋Š” Qwen3-VL ๊ธฐ๋ฐ˜์œผ๋กœ ๋Œ€ํญ ๊ฐ•ํ™”๋˜์—ˆ์Šต๋‹ˆ๋‹ค. CES 2026์—์„œ NVIDIA๋Š” ์ด๋ฅผ Physical AI์˜ ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ๋ฐœํ‘œํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ฐœ์„  ์‚ฌํ•ญ ์ƒ์„ธ
๊ธด ๋ฌธ๋งฅ 16K โ†’ 256K ํ† ํฐ
๊ณต๊ฐ„ ์ธ์ง€ 2D/3D ์ขŒํ‘œ, ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค, ๊ถค์ (trajectory) ์ถœ๋ ฅ
์‹œ๊ฐ„ ์ •๋ฐ€๋„ ํƒ€์ž„์Šคํƒฌํ”„ ๊ธฐ๋ฐ˜ ์ด๋ฒคํŠธ ๋กœ์ปฌ๋ผ์ด์ œ์ด์…˜
OCR ์ง€์› ํ…์ŠคํŠธ ์ธ์‹ ๋ฐ ์ถ”์ถœ
๋ชจ๋ธ ํฌ๊ธฐ 2B, 8B ์˜ต์…˜์œผ๋กœ ์—ฃ์ง€๋ถ€ํ„ฐ ํด๋ผ์šฐ๋“œ๊นŒ์ง€ ๋ฐฐํฌ
Noteํ•จ๊ป˜ ์ถœ์‹œ๋œ ๋ชจ๋ธ๋“ค (CES 2026)

Cosmos Reason 2์™€ ํ•จ๊ป˜ Cosmos Predict 2.5, Cosmos Transfer 2.5, Isaac GR00T N1.6 ๋กœ๋ด‡ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์ด ๊ณต๊ฐœ๋˜์–ด ๋กœ๋ณดํ‹ฑ์Šค ๊ฐœ๋ฐœ ๋ฐ ๋ฐฐํฌ๋ฅผ ๊ฐ€์†ํ™”ํ•ฉ๋‹ˆ๋‹ค.

3.3 Physical AI ํŠนํ™” ๊ธฐ๋Šฅ

์ผ๋ฐ˜ VLM๊ณผ Cosmos Reason์˜ ์ฐจ๋ณ„์ :

  1. ๋ฌผ๋ฆฌ์  ์ƒ์‹ ์ถ”๋ก : ๋‰ดํ„ด ์—ญํ•™, ์ค‘๋ ฅ, ์ถฉ๋Œ ์˜ˆ์ธก ๋“ฑ ๋ฌผ๋ฆฌ ๋ฒ•์น™ ๊ธฐ๋ฐ˜ ์ถ”๋ก 
  2. Embodied Reasoning: โ€œ๋กœ๋ด‡ ๊ทธ๋ฆฌํผ๊ฐ€ ํ…Œ์ดํ”„๋ฅผ ์ง‘์–ด ๋ฐ”๊ตฌ๋‹ˆ์— ๋„ฃ์œผ๋ ค๋ฉด?โ€ ๊ฐ™์€ ํ–‰๋™ ๊ณ„ํš
  3. ๊ถค์  ์ขŒํ‘œ ์ถœ๋ ฅ: ๋‹จ์ˆœ ํ…์ŠคํŠธ๊ฐ€ ์•„๋‹Œ JSON ํ˜•์‹์˜ trajectory ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
ImportantPhysical AI vs ์ผ๋ฐ˜ AI ์ถ”๋ก 

Physical AI๋Š” ๋™์ ์ด๊ณ  ๋ถˆํ™•์‹คํ•œ ์‹ค์„ธ๊ณ„ ํ™˜๊ฒฝ์—์„œ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ์ˆ˜ํ•™์ด๋‚˜ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ์ถ”์ƒ์  ์ถ”๋ก ๊ณผ ๋‹ฌ๋ฆฌ, embodied reasoning์€ AI ์‹œ์Šคํ…œ์ด ๋ฌผ๋ฆฌ ์„ธ๊ณ„์™€ ์ƒํ˜ธ์ž‘์šฉํ•˜๊ณ  ํ•™์Šตํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ํ˜„์žฌ ๊ด€์ธก๋ฟ ์•„๋‹ˆ๋ผ ๋ฏธ๋ž˜์˜ ๋ถˆํ™•์‹คํ•œ ํ™˜๊ฒฝ์—์„œ ์ง€๋Šฅ์ ์ธ ํ–‰๋™์„ ๊ณ„ํšํ•˜๋Š” ๋Šฅ๋ ฅ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

// Cosmos Reason 2 ์ถœ๋ ฅ ์˜ˆ์‹œ
{
  "steps": ["์ ‘๊ทผ", "์ง‘๊ธฐ", "์ด๋™", "๋†“๊ธฐ"],
  "trajectory": [
    {"x": 0.2, "y": 0.5, "z": 0.1},
    {"x": 0.3, "y": 0.4, "z": 0.15},
    ...
  ]
}

3.4 ์ฃผ์š” ํ™œ์šฉ ์‚ฌ๋ก€

๋กœ๋ด‡ ๊ณ„ํš ๋ฐ ์ถ”๋ก :

Cosmos Reason์€ ๋กœ๋ด‡ VLA(Vision-Language-Action) ๋ชจ๋ธ์—์„œ ์‹ ์ค‘ํ•˜๊ณ  ์ฒด๊ณ„์ ์ธ ์˜์‚ฌ๊ฒฐ์ •์„ ์œ„ํ•œ ๋‘๋‡Œ ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค. GR00T๋Š” Cosmos Reason์„ ๋‘๋‡Œ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ํœด๋จธ๋…ธ์ด๋“œ์˜ ์ „์‹  ์ œ์–ด๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์–ด๋…ธํ…Œ์ด์…˜:

๋Œ€๊ทœ๋ชจ์˜ ๋‹ค์–‘ํ•œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ์ž๋™์œผ๋กœ ๊ณ ํ’ˆ์งˆ ์–ด๋…ธํ…Œ์ด์…˜๊ณผ ๋น„ํ‰์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์‹ค์ œ ๋˜๋Š” ํ•ฉ์„ฑ ์ƒ์„ฑ๋œ ํ›ˆ๋ จ ๋น„๋””์˜ค์— ๋Œ€ํ•ด ํƒ€์ž„์Šคํƒฌํ”„์™€ ์ƒ์„ธํ•œ ์„ค๋ช…์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์‚ฐ์—… ์ฑ„ํƒ ํ˜„ํ™ฉ:

๊ธฐ์—… ํ™œ์šฉ ์‚ฌ๋ก€
Uber ์ž์œจ์ฃผํ–‰ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ์ •ํ™•ํ•˜๊ณ  ๊ฒ€์ƒ‰ ๊ฐ€๋Šฅํ•œ ๋น„๋””์˜ค ์บก์…˜ ์ƒ์„ฑ
Salesforce Cobalt ๋กœ๋ด‡ ์˜์ƒ ๋ถ„์„์„ ํ†ตํ•œ ์ž‘์—…์žฅ ์•ˆ์ „ ๋ฐ ๊ทœ์ • ์ค€์ˆ˜
Milestone ๊ตํ†ต AI ์—์ด์ „ํŠธ
Hitachi ์ž‘์—…์žฅ ์ƒ์‚ฐ์„ฑ AI ์—์ด์ „ํŠธ

4. ๋น„๊ต ๋ถ„์„: ํ•œ๋ˆˆ์— ๋ณด๊ธฐ

4.1 ๊ธฐ๋ฐ˜ ๋ชจ๋ธ ๋น„๊ต (Qwen2.5-VL vs Qwen3-VL)

ํ•ญ๋ชฉ Qwen2.5-VL Qwen3-VL
์ถœ์‹œ 2025๋…„ 1์›” 2025๋…„ 9์›”
๋ฌธ๋งฅ ๊ธธ์ด ~32K ํ† ํฐ 256K ํ† ํฐ (YaRN์œผ๋กœ 1M๊นŒ์ง€)
์œ„์น˜ ์ธ์ฝ”๋”ฉ M-RoPE Interleaved-MRoPE
๋น„์ „-์–ธ์–ด ์œตํ•ฉ ๋‹จ์ผ ๋ ˆ์ด์–ด DeepStack (๋‹ค์ธต ์œตํ•ฉ)
์‹œ๊ฐ„ ์ธ์ฝ”๋”ฉ T-RoPE Text-Timestamp Alignment
์•„ํ‚คํ…์ฒ˜ Dense๋งŒ Dense + MoE
์ตœ๋Œ€ ๋ชจ๋ธ 72B 235B (A22B ํ™œ์„ฑ)
ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ 4.1T ํ† ํฐ 36T+ ํ† ํฐ
์–ธ์–ด ์ง€์› ๋‹ค๊ตญ์–ด 119๊ฐœ ์–ธ์–ด
๋น„๋””์˜ค ์ฒ˜๋ฆฌ 1์‹œ๊ฐ„+ 2์‹œ๊ฐ„+

4.2 Cosmos Reason ๋น„๊ต (Reason 1 vs Reason 2)

ํ•ญ๋ชฉ Cosmos Reason 1 Cosmos Reason 2
๊ธฐ๋ฐ˜ Qwen2.5-VL-7B Qwen3-VL (2B/8B)
์ถœ์‹œ 2025๋…„ 3์›” 2025๋…„ 12์›”
๋ฌธ๋งฅ 16K ํ† ํฐ 256K ํ† ํฐ
์ถœ๋ ฅ ํ…์ŠคํŠธ ์ถ”๋ก  ํ…์ŠคํŠธ + ๊ถค์  ์ขŒํ‘œ
๊ณต๊ฐ„ ์ธ์ง€ ์ œํ•œ์  2D/3D ์ขŒํ‘œ, ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค, ๊ถค์ 
๋ฐฐํฌ ํฌ๊ธฐ 7B, 56B 2B, 8B
GPU ์š”๊ตฌ ~16GB (7B) 24GB (2B), 32GB (8B)
๋ผ์ด์„ ์Šค NVIDIA Open Model License NVIDIA Open Model License

5. ๋กœ๋ณดํ‹ฑ์Šค ์—ฐ๊ตฌ์ž๋ฅผ ์œ„ํ•œ ์‹œ์‚ฌ์ 

5.1 ์™œ Qwen VL ๊ธฐ๋ฐ˜์ธ๊ฐ€?

NVIDIA๊ฐ€ ์ž์ฒด ๊ฐœ๋ฐœ ๋Œ€์‹  Qwen์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์„ ํƒํ•œ ์ด์œ :

  1. ์˜คํ”ˆ์†Œ์Šค ์ƒํƒœ๊ณ„: Apache 2.0 / ์—ฐ๊ตฌ ๋ผ์ด์„ ์Šค๋กœ ์ปค์Šคํ„ฐ๋งˆ์ด์ง• ์šฉ์ด
  2. ๊ฒ€์ฆ๋œ ์„ฑ๋Šฅ: ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฒค์น˜๋งˆํฌ์—์„œ ์ƒ์œ„๊ถŒ ์œ ์ง€
  3. ํšจ์œจ์ ์ธ ์•„ํ‚คํ…์ฒ˜: Dynamic Resolution, M-RoPE ๋“ฑ ํ˜์‹ ์  ์„ค๊ณ„
  4. ํ™œ๋ฐœํ•œ ์—…๋ฐ์ดํŠธ: ๋น ๋ฅธ ๊ฐœ๋ฐœ ์‚ฌ์ดํด (6๊ฐœ์›” ๋‚ด 2.5โ†’3 ์—…๊ทธ๋ ˆ์ด๋“œ)

5.2 Allegro Hand ์—ฐ๊ตฌ์—์˜ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ

Cosmos Reason์„ ์† ๋กœ๋ด‡ ์—ฐ๊ตฌ์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์‹œ๋‚˜๋ฆฌ์˜ค:

  1. ๋น„๋””์˜ค ๊ธฐ๋ฐ˜ ์‹œ์—ฐ ํ•™์Šต: ์‚ฌ๋žŒ์˜ ์กฐ์ž‘ ์˜์ƒ์„ ๋ถ„์„ํ•˜์—ฌ grasp ๊ณ„ํš ์ƒ์„ฑ
  2. ๋ฌผ๋ฆฌ์  ์ถ”๋ก : โ€œ์ด ๋ฌผ์ฒด๋ฅผ ์ง‘์œผ๋ ค๋ฉด ์–ด๋–ค ์†๊ฐ€๋ฝ ๋ฐฐ์น˜๊ฐ€ ํ•„์š”ํ•œ๊ฐ€?โ€
  3. ์‹คํŒจ ๋ถ„์„: ์กฐ์ž‘ ์‹คํŒจ ์˜์ƒ์—์„œ ์›์ธ ์ถ”๋ก 
  4. ๋ฐ์ดํ„ฐ ์–ด๋…ธํ…Œ์ด์…˜: ๋Œ€๊ทœ๋ชจ ์กฐ์ž‘ ๋ฐ์ดํ„ฐ์…‹ ์ž๋™ ๋ผ๋ฒจ๋ง

5.3 ์‹ค์šฉ์  ๋ฐฐํฌ ๊ณ ๋ ค์‚ฌํ•ญ

๋ชจ๋ธ GPU ๋ฉ”๋ชจ๋ฆฌ ์ถ”์ฒœ ์šฉ๋„
Cosmos-Reason2-2B 24GB ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค, ์‹ค์‹œ๊ฐ„ ์ถ”๋ก 
Cosmos-Reason2-8B 32GB ์—ฐ๊ตฌ ์›Œํฌ์Šคํ…Œ์ด์…˜, ๊ณ ํ’ˆ์งˆ ์ถ”๋ก 

vLLM์„ ํ†ตํ•œ ๋ฐฐํฌ ์˜ˆ์‹œ:

vllm serve nvidia/Cosmos-Reason2-8B \
  --max-model-len 16384 \
  --media-io-kwargs '{"video": {"num_frames": -1}}' \
  --reasoning-parser qwen3 \
  --port 8000

6. ๊ฒฐ๋ก  ๋ฐ ์ „๋ง

NVIDIA Cosmos Reason์€ ์ˆœ์ˆ˜ ์–ธ์–ด ๋ชจ๋ธ์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๋ฌผ๋ฆฌ ์„ธ๊ณ„๋กœ ํ™•์žฅํ•˜๋Š” ์ค‘์š”ํ•œ ์ด์ •ํ‘œ์ž…๋‹ˆ๋‹ค. Qwen VL ์‹œ๋ฆฌ์ฆˆ๋ผ๋Š” ๊ฐ•๋ ฅํ•œ ์˜คํ”ˆ์†Œ์Šค ๊ธฐ๋ฐ˜ ์œ„์— Physical AI ํŠนํ™” ํ›ˆ๋ จ์„ ๋”ํ•ด, ๋กœ๋ด‡์ด โ€œ๋ณด๊ณ  ์ƒ๊ฐํ•˜๊ณ  ํ–‰๋™ํ•˜๋Š”โ€ ๋Šฅ๋ ฅ์„ ํš๋“ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ…Œ์ดํฌ์–ด์›จ์ด:

  • Qwen2.5-VL์˜ M-RoPE์™€ Dynamic Resolution์€ ์ด๋ฏธ์ง€/๋น„๋””์˜ค ์ดํ•ด์˜ ๊ธฐ๋ฐ˜
  • Qwen3-VL์˜ Interleaved-MRoPE, DeepStack, Text-Timestamp๋Š” ์žฅ์‹œ๊ฐ„ ๋น„๋””์˜ค์™€ ์ •๋ฐ€ํ•œ ์‹œ๊ณต๊ฐ„ ์ถ”๋ก ์„ ์œ„ํ•œ ์ง„ํ™”
  • Cosmos Reason์€ ์ด ๊ธฐ๋ฐ˜ ์œ„์— ๋ฌผ๋ฆฌ์  ์ƒ์‹, ํ–‰๋™ ๊ณ„ํš, ๊ถค์  ์ถœ๋ ฅ ๋Šฅ๋ ฅ์„ ์ถ”๊ฐ€

๋กœ๋ณดํ‹ฑ์Šค ๋ถ„์•ผ์—์„œ VLA(Vision-Language-Action) ๋ชจ๋ธ์ด ์ฃผ๋ชฉ๋ฐ›๋Š” ์ง€๊ธˆ, Cosmos Reason๊ณผ ๊ฐ™์€ ์ถ”๋ก  VLM์€ perception๊ณผ action ์‚ฌ์ด์˜ โ€œthinkingโ€ ๋ ˆ์ด์–ด๋กœ์„œ ํ•ต์‹ฌ์ ์ธ ์—ญํ• ์„ ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.


์ฐธ๊ณ  ์ž๋ฃŒ

๋…ผ๋ฌธ ๋ฐ ๊ธฐ์ˆ  ๋ณด๊ณ ์„œ

  • Qwen2.5-VL Technical Report (arXiv:2502.13923)
  • Qwen3-VL Technical Report (arXiv:2511.21631)
  • Qwen2-VL: Enhancing Vision-Language Modelโ€™s Perception (arXiv:2409.12191)
  • DeepStack: Deeply Stacking Visual Tokens (arXiv:2406.04334)
  • Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning (arXiv:2503.15558)

๋ชจ๋ธ ์ €์žฅ์†Œ

  • Qwen2.5-VL Collection (Hugging Face)
  • Qwen3-VL GitHub
  • Qwen3-VL-32B-Instruct (Hugging Face)
  • Cosmos-Reason1 GitHub
  • Cosmos-Reason2 GitHub
  • Cosmos-Reason2 Collection (Hugging Face)

NVIDIA ๊ณต์‹ ์ž๋ฃŒ

  • NVIDIA Cosmos Documentation
  • Cosmos Cookbook (๋ฐฐํฌ ๊ฐ€์ด๋“œ)
  • NVIDIA Developer Blog - Cosmos Reason
  • NVIDIA Newsroom: Physical AI Models (CES 2026)
  • Cosmos Reason 2 Hugging Face Blog

๊ธฐ์ˆ  ๋ถ„์„ ์ž๋ฃŒ

  • Qwen2.5-VL Transformers Documentation
  • Qwen3-VL: DeepStack Fusion, Interleaved-MRoPE (The Salt)
  • DeepWiki: Qwen2.5-VL Model Architecture

Copyright 2026, JungYeon Lee