Curieux.JY
  • JungYeon Lee
  • Post
  • Lecture
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ๋“ค์–ด๊ฐ€๋ฉฐ: ์ž‘์€ ๊ฑฐ์ธ์˜ ๋“ฑ์žฅ
    • 1. ๋ฌธ์ œ ์ •์˜: ๊ฑฐ๋Œ€ VLA์˜ ์„ธ ๊ฐ€์ง€ ๊ทธ๋ฆผ์ž
    • 2. SmolVLA ํ•œ๋ˆˆ์— ๋ณด๊ธฐ
    • 3. ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜: ๋ฌด์—‡์„ ์ž˜๋ž๊ณ , ์™œ ๊ทธ๊ฒŒ ์ž‘๋™ํ•˜๋‚˜
      • 3.1 VLM ๋ฐฑ๋ณธ: SmolVLM-2
      • 3.2 ์‹œ๊ฐ ํ† ํฐ 64๊ฐœ์˜ ๋น„๋ฐ€
      • 3.3 ๋ ˆ์ด์–ด ์Šคํ‚ต: ๋งˆ์ง€๋ง‰ ์ ˆ๋ฐ˜์€ ๋ฒ„๋ฆฐ๋‹ค
      • 3.4 Flow Matching Action Expert: ์—ฐ์† ํ–‰๋™์„ ์–ด๋–ป๊ฒŒ ๋งŒ๋“œ๋‚˜
      • 3.5 Cross-Attention๊ณผ Causal Self-Attention์˜ ์ธํ„ฐ๋ฆฌ๋น™
    • 4. ํ•™์Šต ๋ฐ์ดํ„ฐ: ์ปค๋ฎค๋‹ˆํ‹ฐ์˜ ํž˜
      • 4.1 VLM์œผ๋กœ ํƒœ์Šคํฌ ๋ผ๋ฒจ ์ž๋™ ์ƒ์„ฑ
      • 4.2 ์นด๋ฉ”๋ผ ์‹œ์  ํ‘œ์ค€ํ™”
    • 5. ๋น„๋™๊ธฐ ์ถ”๋ก (Asynchronous Inference): ๋กœ๋ด‡์ด ๋ฉ ๋•Œ๋ฆฌ์ง€ ์•Š๊ฒŒ
      • 5.1 ๋™๊ธฐ ์ถ”๋ก ์˜ ๋ฌธ์ œ
      • 5.2 ๋น„๋™๊ธฐ ์ถ”๋ก  ์•„ํ‚คํ…์ฒ˜
      • 5.3 ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜์‚ฌ์ฝ”๋“œ
      • 5.4 ์ž„๊ณ„๊ฐ’ g์˜ ์—ญํ• 
      • 5.5 ๊ด€์ธก ์œ ์‚ฌ๋„ ํ•„ํ„ฐ
      • 5.6 ๊ฒฐ๊ณผ: 30% ๋น ๋ฅด๊ณ , ๊ฐ™์€ ์‹œ๊ฐ„์— 2๋ฐฐ ๋งŽ์€ ์ž‘์—… ์™„๋ฃŒ
    • 6. ์‹คํ—˜ ๊ฒฐ๊ณผ ์ž์„ธํžˆ ๋ณด๊ธฐ
      • 6.1 ์‹œ๋ฎฌ๋ ˆ์ด์…˜: LIBERO์™€ Meta-World
      • 6.2 ์‹ค์ œ ๋กœ๋ด‡: SO-100 / SO-101
      • 6.3 ์‚ฌ์ „ํ•™์Šต๊ณผ ๋ฉ€ํ‹ฐํƒœ์Šคํฌ์˜ ํšจ๊ณผ ๋ถ„๋ฆฌ
    • 7. ์–ด๋ธ”๋ ˆ์ด์…˜: ์–ด๋–ค ๋””์ž์ธ ๊ฒฐ์ •์ด ์ •๋ง ์ค‘์š”ํ•œ๊ฐ€
      • 7.1 ์–ด๋–ค ๋ ˆ์ด์–ด๋ฅผ ์“ธ ๊ฒƒ์ธ๊ฐ€
      • 7.2 ํ•™์Šต ๋ชฉ์ ํ•จ์ˆ˜: Flow Matching vs Regression
      • 7.3 ์ƒํƒœ(State)๋Š” ์–ด๋””๋กœ ๋ณด๋‚ผ ๊ฒƒ์ธ๊ฐ€
      • 7.4 ์ฒญํฌ ํฌ๊ธฐ n
      • 7.5 ๊ด€์ธก ๊ฐฑ์‹  ์ฃผ๊ธฐ
    • 8. ๋น„ํŒ์  ๊ณ ์ฐฐ: ๊ฐ•์ ๊ณผ ์•ฝ์ 
      • 8.1 ๊ฐ•์ 
      • 8.2 ์•ฝ์ ๊ณผ ํ•œ๊ณ„
    • 9. ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ์œ„์น˜
    • 10. ์ง๊ด€์  ์ •๋ฆฌ: ์™œ ์ž‘์€ ๋ชจ๋ธ์ด ์ž‘๋™ํ–ˆ๋‚˜
    • 11. ๋งˆ๋ฌด๋ฆฌ: ๋ฌด์—‡์ด ํฅ๋ฏธ๋กœ์šด๊ฐ€, ๊ทธ๋ฆฌ๊ณ  ๋ฌด์—‡์„ ํ•ด ๋ณผ ๊ฒƒ์ธ๊ฐ€

๐Ÿ“ƒSmolVLA

vla
A Vision-Language-Action Model for Affordable and Efficient Robotics
Published

May 10, 2026

  • Paper Link
  • Code Link
  1. โœจ SmolVLA๋Š” ๊ธฐ์กด VLA ๋ชจ๋ธ์˜ ๋†’์€ ๋น„์šฉ๊ณผ ์ œํ•œ๋œ ๋ฐฐํฌ ๊ฐ€๋Šฅ์„ฑ์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์ €๋ ดํ•˜๊ณ  ํšจ์œจ์ ์ธ ๋กœ๋ณดํ‹ฑ์Šค๋ฅผ ๋ชฉํ‘œ๋กœ ํ•˜๋Š” ์†Œํ˜• Vision-Language-Action (VLA) ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
  2. ๐Ÿ’ก ๋ณธ ๋ชจ๋ธ์€ VLM ๋ ˆ์ด์–ด ๊ฑด๋„ˆ๋›ฐ๊ธฐ, ์ตœ์†Œํ•œ์˜ ์‹œ๊ฐ ํ† ํฐ ์‚ฌ์šฉ, ์ปค๋ฎค๋‹ˆํ‹ฐ ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•œ ์‚ฌ์ „ ํ•™์Šต, ๊ทธ๋ฆฌ๊ณ  ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ค„์ด๋Š” ๋น„๋™๊ธฐ ์ถ”๋ก  ์Šคํƒ ๋„์ž…์œผ๋กœ ํšจ์œจ์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•ฉ๋‹ˆ๋‹ค.
  3. ๐Ÿš€ SmolVLA๋Š” ํ›จ์”ฌ ๋” ํฐ VLA ๋ชจ๋ธ๊ณผ ๊ฒฌ์ค„ ๋งŒํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ ์‹ค์ œ ๋กœ๋ด‡ ๋ฒค์น˜๋งˆํฌ ๋ชจ๋‘์—์„œ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ•˜๊ณ  ํ•™์Šต ๋ฐ ์ถ”๋ก  ๋น„์šฉ์„ ๋Œ€ํญ ์ ˆ๊ฐํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

SmolVLA๋Š” ์ €๋น„์šฉ ๋ฐ ํšจ์œจ์ ์ธ ๋กœ๋ด‡ ๊ณตํ•™์„ ์œ„ํ•œ ์ž‘๊ณ  ํšจ์œจ์ ์ธ Vision-Language-Action (VLA) ๋ชจ๋ธ์„ ์ œ์•ˆํ•˜๋Š” ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ๊ธฐ์กด VLA ๋ชจ๋ธ๋“ค์ด ์ˆ˜์‹ญ์–ต ๊ฐœ์˜ Parameter๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์–ด ๋†’์€ Training ๋น„์šฉ๊ณผ ์ œํ•œ์ ์ธ Real-world ๋ฐฐํฌ ๊ฐ€๋Šฅ์„ฑ์„ ๊ฐ€์ง€๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. SmolVLA๋Š” Single GPU์—์„œ์˜ Training ๋ฐ Consumer-grade GPU ๋˜๋Š” CPU์—์„œ์˜ ๋ฐฐํฌ๋ฅผ ๋ชฉํ‘œ๋กœ ํ•˜์—ฌ ์ ‘๊ทผ์„ฑ์„ ๋†’์ด๊ณ , ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” Performance๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ๋น„์šฉ์„ ํฌ๊ฒŒ ์ ˆ๊ฐํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” ๊ธฐ์—ฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ๊ฒฝ๋Ÿ‰ Architecture: VLM์—์„œ Layer Skipping, ์ตœ์†Œํ•œ์˜ Visual Token ์‚ฌ์šฉ, Small Pretrained VLM ํ™œ์šฉ, ๊ทธ๋ฆฌ๊ณ  Cross-attention๊ณผ Self-attention Layer๋ฅผ Interleaveํ•˜๋Š” ๋“ฑ์˜ ์„ค๊ณ„๋ฅผ ํ†ตํ•ด Compactํ•˜๊ณ  Efficientํ•œ ๋ชจ๋ธ์„ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค.
  2. Community-driven Dataset์„ ํ†ตํ•œ Pretraining: ๊ณต๊ณต์ ์œผ๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ Community-contributed Dataset์—์„œ 30,000ํšŒ ๋ฏธ๋งŒ์˜ Episode๋กœ End-to-end Training์„ ์ง„ํ–‰ํ•˜์—ฌ, ๊ธฐ์กด์˜ ์—ฐ๊ตฌ๋ณด๋‹ค ํ›จ์”ฌ ์ ์€ Data๋กœ๋„ ๊ฐ•๋ ฅํ•œ Performance๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
  3. Asynchronous Inference Stack: Observation Processing ๋ฐ Action Prediction์„ Action Execution๊ณผ ๋ถ„๋ฆฌํ•˜์—ฌ Latency๋ฅผ ์ค„์ด๊ณ , Chunked Action Generation์„ ํ†ตํ•ด ๋” ๋†’์€ Control Rate๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” Optimized Inference Stack์„ ๋„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก  (Core Methodology)

1. Model Architecture

SmolVLA๋Š” Perception์„ ๋‹ด๋‹นํ•˜๋Š” Compact Pretrained VLM๊ณผ Action Expert์˜ ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค. VLM์€ ์—ฌ๋Ÿฌ RGB Image์™€ Language Instruction, ๊ทธ๋ฆฌ๊ณ  Sensorimotor State๋ฅผ ์ฒ˜๋ฆฌํ•˜์—ฌ Action Expert๋ฅผ Conditionํ•˜๋Š” Feature๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. Action Expert๋Š” ์ด Feature๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ Low-level Continuous Action Chunk๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

  • Vision-Language Model (VLM): SmolVLA๋Š” Multi-image ๋ฐ Video Input์— ์ตœ์ ํ™”๋œ Efficient ๋ชจ๋ธ์ธ SmolVLM-2 (Marafioti et al., 2025)๋ฅผ VLM Backbone์œผ๋กœ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. SmolVLM-2๋Š” SigLIP (Zhai et al., 2023)์„ ์‚ฌ์šฉํ•˜์—ฌ Visual Feature๋ฅผ Encodingํ•˜๊ณ , SmolLM2 Language Decoder (Allal et al., 2025)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • Visual Token Reduction: ํšจ์œจ์„ฑ์„ ์œ„ํ•ด Image Tiling์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  Global Image๋งŒ ์‚ฌ์šฉํ•˜๋ฉฐ, Pixel Shuffle Operation์„ ํ†ตํ•ด Frame๋‹น Visual Token ์ˆ˜๋ฅผ 64๊ฐœ๋กœ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค.
    • Faster Inference๋ฅผ ์œ„ํ•œ Layer Skipping: Pretrained ๋ชจ๋ธ์—์„œ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด Layer๋ฅผ Skippingํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ด์ „ ์—ฐ๊ตฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, VLM์˜ ๋งˆ์ง€๋ง‰ L๊ฐœ Layer ๋Œ€์‹  N๋ฒˆ์งธ Layer๊นŒ์ง€๋งŒ Feature๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” N = L/2๋กœ ์„ค์ •ํ•˜์—ฌ Speed์™€ Performance ๊ฐ„์˜ ๊ท ํ˜•์„ ๋งž์ถฅ๋‹ˆ๋‹ค. ์ด๋Š” LLM ๋ฐ Action Expert์˜ Computational Cost๋ฅผ ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ด๋Š” ํšจ๊ณผ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.
  • State, Action, Feature Projectors: Sensorimotor State๋ฅผ VLM Dimension์— ๋งž์ถ”๊ณ , Action์„ Action Expert Dimension์— ๋งž์ถ”๋ฉฐ, VLM Feature๋ฅผ Action Expert Dimension์— ๋งž์ถ”๊ธฐ ์œ„ํ•ด Linear Projection Layer๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • Flow Matching Action Expert (v_\theta): VLM Feature๋กœ๋ถ€ํ„ฐ Action Chunk A_t = (a_t, \dots, a_{t+n})๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ›ˆ๋ จ๋œ Transformer ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
    • Training Objective: Action Expert๋Š” Flow Matching Objective๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Training๋ฉ๋‹ˆ๋‹ค. ์ด Objective๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: L_{\tau}(\theta) = E_{p(A_t|o_t), q(A^\tau_t|A_t)} [\|v_\theta(A^\tau_t, o_t) - u(A^\tau_t|A_t)\|^2] ์—ฌ๊ธฐ์„œ o_t๋Š” N๋ฒˆ์งธ VLM Layer์—์„œ ์ถ”์ถœ๋œ Observation Feature์ด๋ฉฐ, A^\tau_t = \tau A_t + (1-\tau)\epsilon์ž…๋‹ˆ๋‹ค. \epsilon \sim N(0, I)๋Š” Noise Vector์ด๊ณ , u(A^\tau_t|A_t) = \epsilon - A_t๋Š” Vector Field์ž…๋‹ˆ๋‹ค. Action Expert v_\theta๋Š” ์ด Vector Field๋ฅผ ์ถœ๋ ฅํ•˜๋„๋ก ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค. \tau๋Š” Beta Distribution์—์„œ Sampling๋ฉ๋‹ˆ๋‹ค.
    • Interleaved Cross and Causal Self-Attention Layers: ๊ธฐ์กด VLA Architecture์™€ ๋‹ฌ๋ฆฌ, Action Expert๋Š” Cross-attention (CA) ๋ฐ Self-attention (SA) Layer๋ฅผ Interleaveํ•ฉ๋‹ˆ๋‹ค. ๊ฐ Block์€ CA ๋˜๋Š” SA Layer๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
      • CA Layer๋Š” VLM์˜ Key์™€ Value์— Cross-attendํ•˜์—ฌ VLM Feature์™€ ์ƒํ˜ธ ์ž‘์šฉํ•ฉ๋‹ˆ๋‹ค.
      • SA Layer๋Š” Action Token์ด ์„œ๋กœ๋ฅผ Attendํ•˜๋„๋ก ํ—ˆ์šฉํ•˜๋ฉฐ, Causal Attention Mask๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ Action Token์ด Chunk ๋‚ด์˜ ์ด์ „ Token๋งŒ Attendํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” Real Robot์—์„œ ๋” Smoothํ•œ Action Chunk๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค.
    • Action Expert์˜ Hidden Size๋Š” VLM Hidden Dimension d์˜ 0.75 \times d๋กœ ์ค„์—ฌ ํšจ์œจ์„ฑ์„ ๋†’์˜€์Šต๋‹ˆ๋‹ค.

2. Pretraining Data Collected by the Community

๋กœ๋ด‡ ๊ณตํ•™ ๋ถ„์•ผ์˜ Data๋Š” Vision ๋ฐ Language ๋ถ„์•ผ์— ๋น„ํ•ด ๊ทœ๋ชจ๊ฐ€ ํ›จ์”ฌ ์ž‘๊ณ , Data Heterogeneity (๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ํ˜•ํƒœ, Sensor, Actuation Mode, Control Frequency ๋“ฑ)๊ฐ€ ํฐ ๋ฌธ์ œ์˜€์Šต๋‹ˆ๋‹ค. SmolVLA๋Š” Low-end Robot Platform ๋ฐ Standardized Robotics Library์˜ ๋„์ž…์œผ๋กœ Data Heterogeneity ๋ฌธ์ œ๊ฐ€ ์™„ํ™”๋˜๋Š” ์ ์— ์ฃผ๋ชฉํ•˜์—ฌ Community Dataset์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • Dataset Source: Hugging Face์—์„œ ์–ป์€ 481๊ฐœ์˜ Community Dataset์„ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ, ์•ฝ 22.9K๊ฐœ์˜ Episode์™€ 10.6M๊ฐœ์˜ Frame์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
  • Task Annotation: Community Dataset์˜ Noise๊ฐ€ ๋งŽ์€ Task Annotation ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, Off-the-shelf VLM (Qwen2.5-VL-3B-Instruct)์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ„๊ฒฐํ•œ Task Description์„ ์ž๋™์œผ๋กœ ์ƒ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
  • Camera Viewpoint Normalization: Camera Naming Convention์˜ ๋†’์€ ๊ฐ€๋ณ€์„ฑ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ๊ฐ Camera๋ฅผ Standardized View Type (Top, Wrist, Side)์œผ๋กœ ์ˆ˜๋™์œผ๋กœ Mappingํ•˜๊ณ  OBS_IMAGE_1, OBS_IMAGE_2, OBS_IMAGE_3์œผ๋กœ ์ด๋ฆ„์„ ๋ณ€๊ฒฝํ–ˆ์Šต๋‹ˆ๋‹ค.

3. Asynchronous Inference

๊ธฐ์กด Visuomotor Policy๋Š” Action Chunk A_t = (a_t, \dots, a_{t+n})๋ฅผ ์ถœ๋ ฅํ•˜๊ณ , ๋กœ๋ด‡์€ ์ด Chunk ์ „์ฒด๋ฅผ ์‹คํ–‰ํ•œ ํ›„์—์•ผ ์ƒˆ๋กœ์šด Observation o_{t+n}์„ Policy์— ์ „๋‹ฌํ•˜์—ฌ ๋‹ค์Œ Chunk๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค (Synchronous Inference). ์ด๋Š” Observation ์‚ฌ์ด์—์„œ Open-loop Inference๊ฐ€ ๋ฐœ์ƒํ•˜์—ฌ Latency์™€ Robot Idle ์‹œ๊ฐ„์„ ์•ผ๊ธฐํ•ฉ๋‹ˆ๋‹ค. SmolVLA๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Asynchronous Inference Stack (Algorithm 1)์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.

  • ์›๋ฆฌ: RobotClient๊ฐ€ PolicyServer์— Observation o_t๋ฅผ ๋ณด๋‚ด๊ณ  Inference๊ฐ€ ์™„๋ฃŒ๋˜๋ฉด Action Chunk A_t๋ฅผ ๋ฐ›์Šต๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ Robot์ด ์ด์ „์— ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ Queue๋ฅผ ์†Œ๋น„ํ•˜๋Š” ๋™์•ˆ Chunk Prediction์„ Triggerํ•˜์—ฌ Execution Lag๋ฅผ ํ”ผํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ๊ตฌํ˜„: RobotClient๋Š” Queue์— ๋‚จ์€ Action์˜ ์ˆ˜๊ฐ€ Threshold (|A_t|/n < g) ๋ฏธ๋งŒ์ด ๋˜๋ฉด ์ƒˆ๋กœ์šด Observation์„ ์บก์ฒ˜ํ•˜์—ฌ PolicyServer์— ๋ณด๋ƒ…๋‹ˆ๋‹ค. ์ƒˆ๋กœ์šด Chunk๊ฐ€ ๋„์ฐฉํ•˜๋ฉด ๊ธฐ์กด Queue์™€ Overlap๋˜๋Š” ๋ถ€๋ถ„์„ Aggregationํ•˜์—ฌ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ํšจ์œจ์„ฑ: ์ด ๋ฐฉ์‹์€ Observation์„ ๋” ์ž์ฃผ ์ฒ˜๋ฆฌํ•˜์—ฌ Control Loop๋ฅผ ๊ฐ•ํ™”ํ•˜๊ณ  Idle Gap์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, Action Prediction์„ Action Execution๊ณผ ๋ถ„๋ฆฌํ•จ์œผ๋กœ์จ ์›๊ฒฉ Policy Server์—์„œ ๋” ๊ฐ•๋ ฅํ•œ Computational Resource๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ์œ ์‚ฌ์„ฑ ํ•„ํ„ฐ๋ฅผ ํ†ตํ•ด ์ค‘๋ณต Observation ์ฒ˜๋ฆฌ๋ฅผ ๋ฐฉ์ง€ํ•˜์—ฌ ๋น„ํšจ์œจ์„ฑ์„ ์ค„์ž…๋‹ˆ๋‹ค.

์‹คํ—˜ ๊ฒฐ๊ณผ

SmolVLA๋Š” Simulation ํ™˜๊ฒฝ (LIBERO ๋ฐ Meta-World)๊ณผ Real-world ํ™˜๊ฒฝ (SO100 ๋ฐ SO101 Robot Arm)์—์„œ ํ‰๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค. Simulation์—์„œ SmolVLA๋Š” Octo, OpenVLA, Diffusion Policy Baseline์„ ๋Šฅ๊ฐ€ํ–ˆ์œผ๋ฉฐ, Pretraining๋œ ฯ€0 (3.3B Parameter)์™€ ๊ฒฝ์Ÿํ•˜๊ฑฐ๋‚˜ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. Real-world์—์„œ๋Š” ACT ๋ฐ ฯ€0๋ณด๋‹ค ๋›ฐ์–ด๋‚œ Success Rate๋ฅผ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ, ํŠนํžˆ SO101 Robot์— ๋Œ€ํ•œ In-distribution ๋ฐ Out-of-distribution Generalization ๋Šฅ๋ ฅ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

Ablation Study๋ฅผ ํ†ตํ•ด ํ•ต์‹ฌ ์„ค๊ณ„ ์„ ํƒ์˜ ์ค‘์š”์„ฑ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค: Interleaved Cross ๋ฐ Self-attention์˜ ์ด์ , Causal Attention Mask์˜ ์ค‘์š”์„ฑ, VLM์˜ ์ดˆ๊ธฐ Layer ์‚ฌ์šฉ์˜ ํšจ์œจ์„ฑ, Flow Matching Objective์˜ ์šฐ์ˆ˜์„ฑ, ๊ทธ๋ฆฌ๊ณ  Sensorimotor State๋ฅผ VLM์— ์ „๋‹ฌํ•˜๋Š” ๊ฒƒ์˜ ํšจ๊ณผ๊ฐ€ ์ž…์ฆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. Asynchronous Inference๋Š” ์œ ์‚ฌํ•œ Success Rate๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ Synchronous Inference์— ๋น„ํ•ด ์•ฝ 30% ๋น ๋ฅธ Task Completion ์‹œ๊ฐ„์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

SmolVLA๋Š” Compactํ•˜๊ณ  Efficientํ•œ VLA ๋ชจ๋ธ๋กœ์„œ, ์ €๋ ดํ•œ Hardware์—์„œ ์‹คํ–‰๋˜๊ณ  Low-cost Robot์„ ์ œ์–ดํ•˜๋ฉฐ, ํ›จ์”ฌ ํฐ VLA ๋ชจ๋ธ๊ณผ ๊ฒฝ์Ÿํ•  ์ˆ˜ ์žˆ๋Š” Performance๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ์—ฐ๊ตฌ๋Š” ๋กœ๋ด‡ ๊ณตํ•™ ์—ฐ๊ตฌ์˜ ์ ‘๊ทผ์„ฑ์„ ๋†’์ด๋Š” ๋ฐ ๊ธฐ์—ฌํ•˜๊ณ , ํ–ฅํ›„ VLA ๋ชจ๋ธ ์„ค๊ณ„ ๋ฐ Inference ์ „๋žต์— ๋Œ€ํ•œ ์ค‘์š”ํ•œ ์ง€์นจ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

๋“ค์–ด๊ฐ€๋ฉฐ: ์ž‘์€ ๊ฑฐ์ธ์˜ ๋“ฑ์žฅ

๋กœ๋ด‡์ด โ€œ๋นจ๊ฐ„ ํ๋ธŒ๋ฅผ ์ง‘์–ด์„œ ๋ฐ•์Šค์— ๋„ฃ์–ดโ€๋ผ๋Š” ํ•œ ๋ฌธ์žฅ๋งŒ์œผ๋กœ ์ž„๋ฌด๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ชจ์Šต์„ ์ƒ์ƒํ•ด๋ณด์ž. ์ด๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๋ชจ๋ธ์ด ๋ฐ”๋กœ VLA(Vision-Language-Action) ๋ชจ๋ธ์ด๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ํ•œ ๊ฐ€์ง€ ๋ถˆํŽธํ•œ ์ง„์‹ค์ด ์žˆ๋‹ค. ์ง€๊ธˆ๊นŒ์ง€ ์ž˜ ๋™์ž‘ํ•˜๋Š” VLA๋“ค์€ ๋Œ€๋ถ€๋ถ„ ์ˆ˜์‹ญ์–ต ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๊ฑฐ์ธ๋“ค์ด๋‹ค. ฯ€โ‚€๋Š” 3.3B, OpenVLA๋Š” 7B. ์ด ๋ชจ๋ธ๋“ค์„ ํ•™์Šต์‹œํ‚ค๋ ค๋ฉด ๋ฐ์ดํ„ฐ์„ผํ„ฐ ๊ทœ๋ชจ์˜ GPU ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ํ•„์š”ํ•˜๊ณ , ์‹ค์ œ ๋กœ๋ด‡์— ์˜ฌ๋ ค์„œ ๋Œ๋ฆฌ๋ ค๋ฉด ๋น„์‹ผ ์—ฃ์ง€ GPU๊ฐ€ ๋˜ ํ•„์š”ํ•˜๋‹ค. SO-ARM100 ๊ฐ™์€ 100๋‹ฌ๋Ÿฌ์งœ๋ฆฌ ์ €๊ฐ€ ๋กœ๋ด‡์œผ๋กœ ์ž…๋ฌธํ•œ ์—ฐ๊ตฌ์ž์—๊ฒŒ๋Š” ์‚ฌ์‹ค์ƒ ๊ทธ๋ฆผ์˜ ๋–ก์ด๋‹ค.

Hugging Face์™€ Sorbonne ๋Œ€ํ•™๊ต ์—ฐ๊ตฌ์ง„์ด ๋ฐœํ‘œํ•œ SmolVLA๋Š” ์ด ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ •๋ฉด์œผ๋กœ ๊ฑฐ์Šค๋ฅธ๋‹ค. 450M ํŒŒ๋ผ๋ฏธํ„ฐ โ€” ฯ€โ‚€์˜ ์•ฝ 1/7 ํฌ๊ธฐ๋กœ, ๋‹จ์ผ GPU์—์„œ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๊ณ  ์‹ฌ์ง€์–ด CPU์—์„œ๋„ ์ถ”๋ก ์ด ๋Œ์•„๊ฐ€๋ฉฐ, ๊ทธ๋Ÿฌ๋ฉด์„œ๋„ ์„ฑ๋Šฅ์€ 10๋ฐฐ ํฐ ๋ชจ๋ธ๊ณผ ๊ฒฌ์ค„ ๋งŒํ•˜๋‹ค. ์ด ๊ธ€์—์„œ๋Š” ์–ด๋–ป๊ฒŒ ๊ทธ๊ฒŒ ๊ฐ€๋Šฅํ•œ์ง€ โ€” ๊ทธ๋“ค์ด ๋ฌด์—‡์„ ์ž˜๋ž๊ณ , ๋ฌด์—‡์„ ๋‚จ๊ฒผ๊ณ , ์™œ ๊ทธ๊ฒŒ ์ž‘๋™ํ•˜๋Š”์ง€ โ€” ๋ฅผ ํŒŒํ—ค์ณ ๋ณด๊ฒ ๋‹ค.


1. ๋ฌธ์ œ ์ •์˜: ๊ฑฐ๋Œ€ VLA์˜ ์„ธ ๊ฐ€์ง€ ๊ทธ๋ฆผ์ž

VLA ์—ฐ๊ตฌ ํ๋ฆ„์˜ ๋งค๋ ฅ์€ ๋ถ„๋ช…ํ•˜๋‹ค. ์ด๋ฏธ ์ธํ„ฐ๋„ท ๊ทœ๋ชจ๋กœ ์‚ฌ์ „ํ•™์Šต๋œ VLM(Vision-Language Model)์— ๋กœ๋ด‡ ํ–‰๋™์„ ํ•™์Šต์‹œํ‚ค๋ฉด, ์ƒ์‹๊ณผ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๊ทธ๋Œ€๋กœ ๋ฌผ๋ ค๋ฐ›์€ ์ผ๋ฐ˜ํ™”๋œ ์ •์ฑ…์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์•ฝ์†. RT-2๊ฐ€ ์ฒ˜์Œ ๊ทธ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์˜€๊ณ , OpenVLA, ฯ€โ‚€, GR00T N1 ๋“ฑ์ด ์ด๋ฅผ ๋ฐœ์ „์‹œ์ผฐ๋‹ค.

ํ•˜์ง€๋งŒ ๊ทธ ์•ฝ์†์—๋Š” ์„ธ ๊ฐ€์ง€ ๊ทธ๋ฆผ์ž๊ฐ€ ๋”ฐ๋ผ์˜จ๋‹ค.

์ฒซ์งธ, ๋น„์šฉ์˜ ๋ฒฝ. 7B ํŒŒ๋ผ๋ฏธํ„ฐ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๋ ค๋ฉด ๋ณดํ†ต ์ˆ˜๋ฐฑ GPUยท์‹œ๊ฐ„์ด ํ•„์š”ํ•˜๋‹ค. ์ถ”๋ก ๋„ ๋งŒ๋งŒ์น˜ ์•Š์•„์„œ, ์‹ค์‹œ๊ฐ„ ์ œ์–ด(30Hz ์ด์ƒ)์—๋Š” ๊ณ ๊ฐ€์˜ GPU๊ฐ€ ํ•„์ˆ˜๋‹ค. ํ•™๊ต ์—ฐ๊ตฌ์‹ค์ด๋‚˜ ๊ฐœ์ธ ์—ฐ๊ตฌ์ž๊ฐ€ ์ง„์ž…ํ•˜๊ธฐ ์–ด๋ ต๋‹ค.

๋‘˜์งธ, ๋ฐ์ดํ„ฐ์˜ ํ์‡„์„ฑ. ๊ธฐ์กด VLA๋“ค์€ Open X-Embodiment, DROID์ฒ˜๋Ÿผ ์ž˜ ์ •์ œ๋œ ํ•™์ˆ /์‚ฐ์—… ๋ฐ์ดํ„ฐ์…‹์— ์˜์กดํ•œ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ LeRobot, Hugging Face hub์—๋Š” SO-100 ๊ฐ™์€ ์ €๊ฐ€ ๋กœ๋ด‡์œผ๋กœ ์ผ๋ฐ˜์ธ์ด ๋ชจ์€ ์ˆ˜๋ฐฑ ๊ฐœ์˜ ์ปค๋ฎค๋‹ˆํ‹ฐ ๋ฐ์ดํ„ฐ์…‹์ด ์žˆ๋‹ค. ์ด ๋ฐ์ดํ„ฐ๋“ค์€ ์•„์นด๋ฐ๋ฏน ๋ฐ์ดํ„ฐ์…‹๊ณผ ๋‹ค๋ฅด๊ฒŒ ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ๊ณ  ํ‘œ์ค€ํ™”๊ฐ€ ์•ˆ ๋ผ ์žˆ์ง€๋งŒ, ๋™์‹œ์— ํ˜„์‹ค์˜ ๋‹ค์–‘์„ฑ์„ ๋‹ด๊ณ  ์žˆ๋‹ค. ์ด ๊ด‘๋งฅ์„ ์•„๋ฌด๋„ ์ œ๋Œ€๋กœ ํ™œ์šฉํ•˜์ง€ ๋ชปํ–ˆ๋‹ค.

์…‹์งธ, ์ถ”๋ก  ๊ตฌ์กฐ์˜ ๋น„ํšจ์œจ. ๋Œ€๋ถ€๋ถ„์˜ VLA๋Š” โ€œ๊ด€์ธก โ†’ ์•ก์…˜ ์ฒญํฌ(n step) ์˜ˆ์ธก โ†’ ์ฒญํฌ ์ „๋ถ€ ์‹คํ–‰ โ†’ ๋‹ค์‹œ ๊ด€์ธกโ€์ด๋ผ๋Š” ๋™๊ธฐ์‹(synchronous) ๋ฃจํ”„๋ฅผ ๋ˆ๋‹ค. ์ด ๊ตฌ์กฐ์—์„œ๋Š” ๋ชจ๋ธ์ด ๋‹ค์Œ ์ฒญํฌ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋™์•ˆ ๋กœ๋ด‡์ด ๋ฉ ๋•Œ๋ฆฌ๋Š” ์‹œ๊ฐ„(idle gap) ์ด ๋ฐœ์ƒํ•œ๋‹ค. ์ž‘์€ ๋ชจ๋ธ์ผ์ˆ˜๋ก ์ด idle gap์ด ์งง์ง€๋งŒ, ํฐ ๋ชจ๋ธ์ผ์ˆ˜๋ก ํ™˜๊ฒฝ ๋ณ€ํ™”์— ๋‘”๊ฐํ•ด์ง„๋‹ค.

SmolVLA๋Š” ์ด ์„ธ ๊ฐ€์ง€ ๊ทธ๋ฆผ์ž๋ฅผ ๋™์‹œ์— ๊ฑท์–ด๋‚ด๋ ค ํ•œ๋‹ค.

graph LR
    A[๊ธฐ์กด VLA์˜ ๋ฌธ์ œ] --> B[ํ•™์Šต ๋น„์šฉ]
    A --> C[๋ฐ์ดํ„ฐ ํ์‡„์„ฑ]
    A --> D[์ถ”๋ก  ๋น„ํšจ์œจ]
    B --> E[SmolVLA: 450M ์ปดํŒฉํŠธ ์„ค๊ณ„]
    C --> F[SmolVLA: 481๊ฐœ ์ปค๋ฎค๋‹ˆํ‹ฐ ๋ฐ์ดํ„ฐ์…‹]
    D --> G[SmolVLA: ๋น„๋™๊ธฐ ์ถ”๋ก  ์Šคํƒ]
    E --> H[์ €๋น„์šฉยท์ ‘๊ทผ๊ฐ€๋Šฅ VLA]
    F --> H
    G --> H


2. SmolVLA ํ•œ๋ˆˆ์— ๋ณด๊ธฐ

์ „์ฒด ๊ตฌ์กฐ๋ฅผ ๋จผ์ € ๊ทธ๋ฆฌ๊ณ  ์‹œ์ž‘ํ•˜์ž. ํฐ ๊ทธ๋ฆผ์„ ๋จธ๋ฆฌ์— ๋„ฃ์–ด๋‘๋ฉด ์„ธ๋ถ€์‚ฌํ•ญ์ด ํ›จ์”ฌ ์ž˜ ๋“ค์–ด์˜จ๋‹ค.

+------------------------------------------------------------+
|                       SmolVLA Overview                     |
+------------------------------------------------------------+
|                                                            |
|  [Lang Instruction]   [RGB Images]     [Robot State]       |
|        |                  |                 |              |
|        v                  v                 v              |
|   tokenizer         SigLIP encoder    Linear projector     |
|        |                  |                 |              |
|        +--------+---------+-----------------+              |
|                 v                                          |
|          +--------------+                                  |
|          |  SmolLM-2    |                                  |
|          |  (first N    |  <-- skip last (L-N) layers      |
|          |  layers only)|                                  |
|          +------+-------+                                  |
|                 | VLM features (o_t)                       |
|                 v                                          |
|   +-----------------------------------+                    |
|   |        Action Expert v_theta      |                    |
|   |  (CA <-> SA interleaved blocks)   |                    |
|   |  trained with Flow Matching       |                    |
|   +------------+----------------------+                    |
|                |                                           |
|                v                                           |
|       Action chunk: a_t, a_{t+1}, ..., a_{t+n}             |
|                                                            |
+------------------------------------------------------------+

ํ•ต์‹ฌ ์„ค๊ณ„ ์š”์ ์€ ๋‹ค์„ฏ ๊ฐ€์ง€๋กœ ์š”์•ฝ๋œ๋‹ค.

์„ค๊ณ„ ๊ฒฐ์ • ์ด์œ 
์ปดํŒฉํŠธ VLM (SmolVLM-2, 500M๊ธ‰) ๊ฑฐ๋Œ€ LLM ๋Œ€์‹  ์ž‘์€ ์‚ฌ์ „ํ•™์Šต ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ ์‚ฌ์šฉ
VLM ๋ ˆ์ด์–ด ์Šคํ‚ต (N=L/2) ๋งˆ์ง€๋ง‰ ์ ˆ๋ฐ˜ ๋ ˆ์ด์–ด ๋ฒ„๋ฆผ โ†’ ์—ฐ์‚ฐ ์ ˆ๋ฐ˜
์‹œ๊ฐ ํ† ํฐ 64๊ฐœ/ํ”„๋ ˆ์ž„ ํƒ€์ผ๋ง ์—†์ด ๊ธ€๋กœ๋ฒŒ ์ด๋ฏธ์ง€ + pixel shuffle
Flow Matching Action Expert ์—ฐ์† ํ–‰๋™์˜ ๋‹ค์ค‘๋ชจ๋“œ ๋ถ„ํฌ๋ฅผ ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ํ‘œํ˜„
CA + Causal SA ์ธํ„ฐ๋ฆฌ๋น™ ํ‘œํ˜„๋ ฅ๊ณผ ์†๋„์˜ ๊ท ํ˜•

3. ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜: ๋ฌด์—‡์„ ์ž˜๋ž๊ณ , ์™œ ๊ทธ๊ฒŒ ์ž‘๋™ํ•˜๋‚˜

3.1 VLM ๋ฐฑ๋ณธ: SmolVLM-2

SmolVLM-2๋Š” SigLIP ๋น„์ „ ์ธ์ฝ”๋” + SmolLM-2 ์–ธ์–ด ๋””์ฝ”๋”๋กœ ๊ตฌ์„ฑ๋œ ์•ฝ 500M ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ํšจ์œจ์  VLM์ด๋‹ค. ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ๋‹จ์ˆœํ•˜๋‹ค โ€” ๊ฑฐ๋Œ€ LLM์„ ๊ฐ€์ ธ์™€์„œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ๋กœ ํ™•์žฅํ•˜๋Š” ๋Œ€์‹ , ์ฒ˜์Œ๋ถ€ํ„ฐ ์ž‘์€ ๋ชจ๋ธ์„ ์ž˜ ํ•™์Šต์‹œํ‚จ๋‹ค. SmolLM-2 ์ž์ฒด๊ฐ€ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ์— ์ง‘์ค‘ํ•ด์„œ ํ•™์Šต๋œ ์ปดํŒฉํŠธ LLM์ด๋ผ, ๊ฐ™์€ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ ๋Œ€๋น„ ํ‘œํ˜„๋ ฅ์ด ์ข‹๋‹ค.

SmolVLA์—์„œ VLM์€ ์„ธ ๊ฐ€์ง€ ์ž…๋ ฅ์„ ๋ฐ›๋Š”๋‹ค:

  1. ์–ธ์–ด ์ง€์‹œ๋ฌธ โ†’ ํ…์ŠคํŠธ ํ† ํฐ
  2. RGB ์ด๋ฏธ์ง€(๋“ค) โ†’ SigLIP์œผ๋กœ ์ธ์ฝ”๋”ฉ๋œ ์‹œ๊ฐ ํ† ํฐ
  3. ๋กœ๋ด‡ ์ƒํƒœ (joint positions ๋“ฑ) โ†’ ์„ ํ˜• ํˆฌ์˜์œผ๋กœ ๋‹จ ํ•œ ๊ฐœ์˜ ํ† ํฐ์œผ๋กœ ์••์ถ•

์ด ํ† ํฐ๋“ค์„ concatenateํ•ด์„œ LLM์— ํ†ต๊ณผ์‹œํ‚จ๋‹ค. LLM์˜ ์ถœ๋ ฅ ํ”ผ์ฒ˜๊ฐ€ Action Expert์˜ ์กฐ๊ฑด(condition)์ด ๋œ๋‹ค.

3.2 ์‹œ๊ฐ ํ† ํฐ 64๊ฐœ์˜ ๋น„๋ฐ€

์ด ๋ถ€๋ถ„์ด ํฅ๋ฏธ๋กญ๋‹ค. ์ผ๋ฐ˜์ ์ธ VLM์€ ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€๋ฅผ ์ž˜ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด ์ด๋ฏธ์ง€ ํƒ€์ผ๋ง โ€” ์›๋ณธ ์ด๋ฏธ์ง€๋ฅผ ์—ฌ๋Ÿฌ ํŒจ์น˜๋กœ ๋‚˜๋ˆ„๊ณ  ๊ฐ๊ฐ์„ ์ธ์ฝ”๋”ฉ โ€” ์„ ํ•œ๋‹ค. ๋” ๋งŽ์€ ํ† ํฐ = ๋” ์ •๋ฐ€ํ•œ ์ธ์ง€. ํ•˜์ง€๋งŒ ์ถ”๋ก  ๋น„์šฉ์€ ํ† ํฐ ์ˆ˜์˜ ์ œ๊ณฑ(์–ดํ…์…˜ ๋ณต์žก๋„)์œผ๋กœ ์ฆ๊ฐ€ํ•œ๋‹ค.

SmolVLA๋Š” ํƒ€์ผ๋ง์„ ํฌ๊ธฐํ•œ๋‹ค. ๊ธ€๋กœ๋ฒŒ ์ด๋ฏธ์ง€ ํ•œ ์žฅ๋งŒ ์‚ฌ์šฉํ•˜๊ณ , ๊ฑฐ๊ธฐ์— pixel shuffle ์—ฐ์‚ฐ์œผ๋กœ ํ† ํฐ์„ 64๊ฐœ๋กœ ๊ฐ•์ œ ์••์ถ•ํ•œ๋‹ค. Pixel shuffle์€ ๊ณต๊ฐ„ ํ•ด์ƒ๋„๋ฅผ ์ฑ„๋„ ์ฐจ์›์œผ๋กœ ์˜ฎ๊ธฐ๋Š” ํŠธ๋ฆญ์œผ๋กœ, ์ •๋ณด๋ฅผ ๊ฐ€๋Šฅํ•œ ํ•œ ๋ณด์กดํ•˜๋ฉด์„œ ํ† ํฐ ์ˆ˜๋ฅผ ํ™• ์ค„์ธ๋‹ค.

์ง๊ด€์ ์œผ๋กœ: ์‚ฌ์ง„์„ 16๋ฉ”๊ฐ€ํ”ฝ์…€๋กœ ๋ณด๋“ , 64๊ฐœ์˜ โ€œ์ง€์—ญ ์š”์•ฝ ํŒจ์น˜โ€๋กœ ๋ณด๋“ , ํ๋ธŒ๋ฅผ ์ง‘๋Š” ๋ฐ ํ•„์š”ํ•œ ์ •๋ณด(์–ด๋”” ์žˆ๋Š”์ง€, ์ƒ‰์ด ๋ญ”์ง€, ์†์ด ์–ด๋””๋กœ ๊ฐ€์•ผ ํ•˜๋Š”์ง€)๋Š” ๋Œ€์ฒด๋กœ ์ถฉ๋ถ„ํ•˜๋‹ค๋Š” ๋ฒ ํŒ…์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด ๋ฒ ํŒ…์ด ์‹คํ—˜์ ์œผ๋กœ ์ž‘๋™ํ•œ๋‹ค.

3.3 ๋ ˆ์ด์–ด ์Šคํ‚ต: ๋งˆ์ง€๋ง‰ ์ ˆ๋ฐ˜์€ ๋ฒ„๋ฆฐ๋‹ค

์ด๊ฒŒ ๊ฐ€์žฅ ์ถฉ๊ฒฉ์ ์ธ ์„ค๊ณ„ ๊ฒฐ์ •์ด๋‹ค. ์‚ฌ์ „ํ•™์Šต๋œ LLM์˜ ๋งˆ์ง€๋ง‰ ์ ˆ๋ฐ˜ ๋ ˆ์ด์–ด๋ฅผ ์ž˜๋ผ๋ฒ„๋ฆฐ๋‹ค.

์ „ํ†ต์ ์ธ ๊ฐ€์ •์€ โ€œLLM์˜ ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด๊ฐ€ ๊ฐ€์žฅ ํ’๋ถ€ํ•œ ์˜๋ฏธ์  ํ‘œํ˜„์„ ๊ฐ€์ง„๋‹คโ€์ด๋‹ค. ํ•˜์ง€๋งŒ ์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค(El-Nouby et al., 2024; Bolya et al., 2025)์€ ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ์— ๊ฐ€์žฅ ์ข‹์€ ํ”ผ์ฒ˜๋Š” ๋ฐ˜๋“œ์‹œ ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด๊ฐ€ ์•„๋‹ˆ๋‹ค ๋ผ๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์คฌ๋‹ค. ํŠนํžˆ ๋ถ„๋ฅ˜๋‚˜ ์‹œ๊ฐ์  grounding ๊ฐ™์€ ํƒœ์Šคํฌ์—์„œ๋Š” ์ค‘๊ฐ„ ๋ ˆ์ด์–ด๊ฐ€ ๋” ์ข‹๋‹ค.

SmolVLA๋Š” ์ด ํ†ต์ฐฐ์„ ๋ฐ›์•„๋“ค์—ฌ, N = L/2๋กœ ์„ค์ •ํ•œ๋‹ค. ์ฆ‰, 16-layer LLM์ด๋ผ๋ฉด ์ฒ˜์Œ 8๊ฐœ ๋ ˆ์ด์–ด๋งŒ ์‚ฌ์šฉํ•œ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ:

  • LLM ์—ฐ์‚ฐ ๋น„์šฉ ์ ˆ๋ฐ˜
  • Action Expert๊ฐ€ cross-attendํ•˜๋Š” KV ๋ฉ”๋ชจ๋ฆฌ ์ ˆ๋ฐ˜
  • ๊ฑฐ์˜ ์†์‹ค ์—†๋Š” ์„ฑ๋Šฅ (์‹คํ—˜์ ์œผ๋กœ ๊ฒ€์ฆ)

3.4 Flow Matching Action Expert: ์—ฐ์† ํ–‰๋™์„ ์–ด๋–ป๊ฒŒ ๋งŒ๋“œ๋‚˜

์ž, ์—ฌ๊ธฐ๊ฐ€ ์ง„์งœ ํฅ๋ฏธ๋กœ์šด ๋ถ€๋ถ„์ด๋‹ค. Action Expert \mathbf{v}_\theta๋Š” VLM ํ”ผ์ฒ˜ \mathbf{o}_t๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ n๊ฐœ step์˜ ์•ก์…˜ ์ฒญํฌ \mathbf{A}_t = (a_t, ..., a_{t+n})์„ ์ถœ๋ ฅํ•œ๋‹ค. SmolVLA๋Š” n=50์„ ์‚ฌ์šฉํ•œ๋‹ค.

๋ฌธ์ œ๋Š” โ€œ์–ด๋–ป๊ฒŒ ํ•™์Šต์‹œํ‚ค๋‚˜โ€์ด๋‹ค. ๋‹จ์ˆœํ•œ ํšŒ๊ท€(L1/L2 loss)๋„ ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, ์‹ค์ œ ๋กœ๋ด‡์˜ ํ–‰๋™์€ ๋‹ค์ค‘ ๋ชจ๋“œ ๋ถ„ํฌ(multi-modal distribution) ๋ฅผ ๊ฐ€์ง„๋‹ค. ๊ฐ™์€ ์ƒํƒœ์—์„œ๋„ ํ๋ธŒ๋ฅผ ์ง‘๋Š” ๊ฒฝ๋กœ๋Š” ์—ฌ๋Ÿฌ ๊ฐœ ์žˆ์„ ์ˆ˜ ์žˆ๊ณ , ๋‹จ์ผ ํšŒ๊ท€๋Š” ์ด ํ‰๊ท ์„ ํ•™์Šตํ•ด๋ฒ„๋ ค ์–ด์ƒ‰ํ•œ ํ–‰๋™์„ ๋งŒ๋“ ๋‹ค.

Flow Matching์˜ ์ง๊ด€

Flow Matching์„ ์ดํ•ดํ•˜๋Š” ๊ฐ€์žฅ ์‰ฌ์šด ๋ฐฉ๋ฒ•์€ ์ด๋ ‡๋‹ค. ์šฐ๋ฆฌ๋Š” ๋…ธ์ด์ฆˆ ๋ถ„ํฌ(๊ฐ€์šฐ์‹œ์•ˆ) ์—์„œ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ(์ง„์งœ ํ–‰๋™) ๋กœ ๊ฐ€๋Š” โ€œํ๋ฆ„(flow)โ€์„ ํ•™์Šตํ•˜๊ณ  ์‹ถ๋‹ค.

  ฮต ~ N(0, I)           A_t (real action)
        \                   /
         \                 /
          \---tau=0.0----/   <- pure noise
            \-tau=0.5-/      <- halfway
              \tau=1./        <- real data
              
  At any tau in [0,1]:
    A_t^tau = tau * A_t + (1 - tau) * ฮต
  
  The model learns the velocity field:
    v_theta(A_t^tau, o_t)  ~  u(A_t^tau | A_t) = ฮต - A_t

ํ•™์Šต ์†์‹ค์€:

\mathcal{L}^{\tau}(\theta) = \mathbb{E}_{p(\mathbf{A}_t | \mathbf{o}_t),\, q(\mathbf{A}_t^\tau | \mathbf{A}_t)} \left[\|\mathbf{v}_\theta(\mathbf{A}_t^\tau, \mathbf{o}_t) - \mathbf{u}(\mathbf{A}_t^\tau | \mathbf{A}_t)\|^2\right]

์—ฌ๊ธฐ์„œ \mathbf{u}(\mathbf{A}_t^\tau | \mathbf{A}_t) = \epsilon - \mathbf{A}_t๋Š” ๋…ธ์ด์ฆˆ์—์„œ ๋ฐ์ดํ„ฐ๋กœ ๊ฐ€๋Š” ์†๋„ ๋ฒกํ„ฐ(vector field) ์ด๋‹ค. ๋ชจ๋ธ์€ โ€œํ˜„์žฌ ๋…ธ์ด์ฆˆ ์„ž์ธ ์•ก์…˜์—์„œ, ์ง„์งœ ์•ก์…˜ ์ชฝ์œผ๋กœ ์–ด๋А ๋ฐฉํ–ฅ์œผ๋กœ ์–ผ๋งˆ๋‚˜ ๊ฐ€์•ผ ํ•˜๋Š”๊ฐ€โ€๋ฅผ ํ•™์Šตํ•œ๋‹ค.

๋ฌผ๋ฆฌํ•™์  ๋น„์œ : ๊ฐ•๋ฌผ์ด ์–ด๋””๋กœ ํ๋ฅด๋Š”์ง€ ์•Œ๋ฉด, ์–ด๋””์„œ ์ถœ๋ฐœํ•ด๋„ ๊ฒฐ๊ตญ ๋ฐ”๋‹ค์— ๋„์ฐฉํ•  ์ˆ˜ ์žˆ๋‹ค. Flow Matching์€ โ€œ๋…ธ์ด์ฆˆ ๋ฐ”๋‹คโ€์—์„œ โ€œ๋ฐ์ดํ„ฐ ๋ฐ”๋‹คโ€๋กœ ๊ฐ€๋Š” ๊ฐ•์˜ ํ๋ฆ„์žฅ(velocity field)์„ ํ•™์Šตํ•œ๋‹ค. ์ถ”๋ก  ์‹œ์—๋Š” ๋…ธ์ด์ฆˆ๋ฅผ ๋ฝ‘์•„์„œ ์ด ํ๋ฆ„์„ ๋”ฐ๋ผ ํ˜๋ ค๋ณด๋‚ด๋ฉด ์ง„์งœ ํ–‰๋™์ด ๋‚˜์˜จ๋‹ค.

์ถ”๋ก ์€ ๋ณดํ†ต 10๋‹จ๊ณ„ ์ •๋„์˜ ODE solver๋กœ ์ ๋ถ„ํ•œ๋‹ค. Diffusion ๊ธฐ๋ฐ˜ ์ •์ฑ…(์˜ˆ: ฯ€โ‚€)๋„ ๋น„์Šทํ•œ ์•„์ด๋””์–ด์ด์ง€๋งŒ, Flow Matching์€ ๋” ์ง์„ ์ ์ธ(rectified) ํ๋ฆ„์„ ๋งŒ๋“ค์–ด ์ ์€ ์ ๋ถ„ ๋‹จ๊ณ„๋กœ๋„ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋‚ธ๋‹ค.

ฯ„๋Š” Beta ๋ถ„ํฌ์—์„œ ์ƒ˜ํ”Œ๋งํ•œ๋‹ค(ฯ€โ‚€๊ณผ ๋™์ผ). ์ด๋Š” ํ•™์Šต ์‹œ ๋…ธ์ด์ฆˆ ์ˆ˜์ค€์˜ ๋ถ„ํฌ๋ฅผ ๋น„๋Œ€์นญ์ ์œผ๋กœ ๋งŒ๋“ค์–ด, ์–ด๋ ค์šด ๋‹จ๊ณ„์— ๋” ๋งŽ์€ ์ƒ˜ํ”Œ์ด ํ• ๋‹น๋˜๋„๋ก ํ•œ๋‹ค.

๋˜ ํ•œ ๊ฐ€์ง€ ๋””ํ…Œ์ผ: Action Expert์˜ hidden size๋Š” VLM์˜ 0.75๋ฐฐ๋กœ ์ค„์ธ๋‹ค. ํ‘œํ˜„๋ ฅ์€ ์œ ์ง€ํ•˜๋ฉด์„œ ๋ฉ”๋ชจ๋ฆฌ/์—ฐ์‚ฐ์„ ์ถ”๊ฐ€๋กœ ์ ˆ๊ฐ.

3.5 Cross-Attention๊ณผ Causal Self-Attention์˜ ์ธํ„ฐ๋ฆฌ๋น™

์ด๊ฒŒ SmolVLA๋งŒ์˜ ๋…ํŠนํ•œ ์„ค๊ณ„๋‹ค. ๋‹ค๋ฅธ VLA๋“ค์€ ๋‘ ๊ฐˆ๋ž˜๋กœ ๊ฐˆ๋ฆฐ๋‹ค:

  • ฯ€โ‚€: VLM๊ณผ Action Expert๋ฅผ ํ•œ ๋ฉ์–ด๋ฆฌ์˜ self-attention์œผ๋กœ ์—ฐ๊ฒฐ
  • GR00T N1: ์ˆœ์ˆ˜ cross-attention๋งŒ ์‚ฌ์šฉ

SmolVLA๋Š” ๋‘ ๋ฐฉ์‹์„ ๋ธ”๋ก ๋‹จ์œ„๋กœ ๋ฒˆ๊ฐˆ์•„ ๋ผ์›Œ ๋„ฃ๋Š”๋‹ค(interleave):

Action Expert v_theta:
+---------------------+
| Block 1: CA (cross-attend to VLM keys/values)
+---------------------+
| Block 2: SA (causal self-attention among action tokens)
+---------------------+
| Block 3: CA
+---------------------+
| Block 4: SA
+---------------------+
| ... (alternating)

๊ฐ ์ปดํฌ๋„ŒํŠธ์˜ ์—ญํ• ์„ ์ง๊ด€์ ์œผ๋กœ ํ’€๋ฉด:

  • CA ๋ธ”๋ก: ์•ก์…˜ ํ† ํฐ์ด VLM ํ”ผ์ฒ˜๋ฅผ โ€œ๊ด€์ฐฐโ€ํ•œ๋‹ค. โ€œ์ง€๊ธˆ ํ™˜๊ฒฝ์ด ์–ด๋–ค ์ƒํ™ฉ์ด์ง€?โ€
  • SA ๋ธ”๋ก (causal): ์•ก์…˜ ํ† ํฐ๋“ค๋ผ๋ฆฌ ์„œ๋กœ โ€œ์กฐ์œจโ€ํ•œ๋‹ค. โ€œ์ง€๊ธˆ๊นŒ์ง€ ์ •ํ•ด์ง„ ํ–‰๋™๋“ค๊ณผ ์ผ๊ด€๋˜๊ฒŒ, ๋‹ค์Œ์—” ๋ญ˜ ํ•ด์•ผ ํ•˜์ง€?โ€

Causal mask๋Š” ์ค‘์š”ํ•˜๋‹ค. ๋ฏธ๋ž˜์˜ action ํ† ํฐ์„ ๋ณด๋ฉด ์•ˆ ๋˜๋Š”๋ฐ, ๊ทธ๋ž˜์•ผ ์ถ”๋ก  ์‹œ ์ผ๊ด€๋œ ์‹œํ€€์Šค๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค(๋ฏธ๋ž˜์—์„œ ๋ˆ„์„ค๋œ ์ •๋ณด๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์€ ์ถ”๋ก  ๋‹จ๊ณ„์—์„œ ๋ฌด๋„ˆ์ง„๋‹ค).

๋…ผ๋ฌธ ์–ด๋ธ”๋ ˆ์ด์…˜์— ๋”ฐ๋ฅด๋ฉด(Table 6, 7):

Attention LIBERO Avg SR (%)
CA only 79.0
SA only 74.5
CA + SA interleaved 85.5
Mask LIBERO Avg SR (%)
Bidirectional 67.5
Causal 74.5

ํŠนํžˆ ์‹ค์ œ ๋กœ๋ด‡์—์„œ SA ๋ธ”๋ก์ด ๋ถ€๋“œ๋Ÿฌ์šด ํ–‰๋™ ์ฒญํฌ๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐ ๊ฒฐ์ •์ ์ด๋ผ๊ณ  ์ €์ž๋“ค์€ ๋ณด๊ณ ํ•œ๋‹ค. ์ด๋Š” ์•ก์…˜ ๊ฐ„ ์‹œ๊ฐ„์  ์ผ๊ด€์„ฑ ๋•Œ๋ฌธ์œผ๋กœ ๋ณด์ธ๋‹ค.

graph TD
    VLM[VLM Features o_t] -->|keys, values| CA1[CA Block 1]
    Noise[Noisy Actions A_t^tau] --> CA1
    CA1 --> SA1[SA Block 1<br/>causal mask]
    SA1 --> CA2[CA Block 2]
    VLM -->|keys, values| CA2
    CA2 --> SA2[SA Block 2<br/>causal mask]
    SA2 --> Out[Velocity field<br/>v_theta]
    style CA1 fill:#FFD700,color:#000
    style CA2 fill:#FFD700,color:#000
    style SA1 fill:#FFFFE0,color:#000
    style SA2 fill:#FFFFE0,color:#000


4. ํ•™์Šต ๋ฐ์ดํ„ฐ: ์ปค๋ฎค๋‹ˆํ‹ฐ์˜ ํž˜

SmolVLA์˜ ๋‘ ๋ฒˆ์งธ ํ•ต์‹ฌ ๊ธฐ์—ฌ๋Š” ๋ฐ์ดํ„ฐ ์ถœ์ฒ˜์˜ ์ „ํ™˜์ด๋‹ค. OpenVLA๊ฐ€ 100๋งŒ trajectory, ฯ€โ‚€๊ฐ€ 1๋งŒ ์‹œ๊ฐ„ ๋ถ„๋Ÿ‰์˜ cross-embodiment ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ ๋ฐ˜๋ฉด, SmolVLA๋Š” ๋‹ค์Œ์„ ์‚ฌ์šฉํ•œ๋‹ค:

ํ•ญ๋ชฉ ์ˆ˜์น˜
๋ฐ์ดํ„ฐ์…‹ ์ˆ˜ 481๊ฐœ
์—ํ”ผ์†Œ๋“œ ์ˆ˜ 22.9K
ํ”„๋ ˆ์ž„ ์ˆ˜ 10.6M

๊ธฐ์กด VLA ๋Œ€๋น„ ํ•œ ์ž๋ฆฟ์ˆ˜ ์ ์€ ๋ฐ์ดํ„ฐ. ๊ทธ๊ฒƒ๋„ ์ „๋ถ€ Hugging Face์˜ ๊ณต๊ฐœ ์ปค๋ฎค๋‹ˆํ‹ฐ ๊ธฐ์—ฌ ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค. SO-100 ๊ฐ™์€ ์ €๊ฐ€ ๋กœ๋ด‡์œผ๋กœ ํ•™๊ต ์—ฐ๊ตฌ์‹ค, ๊ฐ€์ •, ๊ฐœ์ธ ์—ฐ๊ตฌ์ž๊ฐ€ ๋ชจ์€ ๋ฐ์ดํ„ฐ๋“ค. ์ด ๋ฐ์ดํ„ฐ๋“ค์˜ ํŠน์„ฑ:

  • ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ํ˜•ํƒœ(embodiment)
  • ์นด๋ฉ”๋ผ ์‹œ์  ๋ช…๋ช…์ด ์ œ๊ฐ๊ฐ (์˜ˆ: images.laptop์ด ์–ด๋–ค ๋ฐ์ดํ„ฐ์…‹์—์„  ์œ„์—์„œ ๋ณธ ์‹œ์ , ์–ด๋–ค ๋ฐ์ดํ„ฐ์…‹์—์„  ์†๋ชฉ ์‹œ์ )
  • ํƒœ์Šคํฌ ์–ด๋…ธํ…Œ์ด์…˜์ด ๋…ธ์ด์ฆˆํˆฌ์„ฑ์ด (โ€œtask descโ€, โ€œHoldโ€, โ€œUpโ€ ๊ฐ™์€ ๋ชจํ˜ธํ•œ ๋ผ๋ฒจ)
  • ํ•˜์ง€๋งŒ ํ˜„์‹ค์˜ ์ง„์งœ ๋‹ค์–‘์„ฑ์„ ๋‹ด๊ณ  ์žˆ์Œ

์ด ๋…ธ์ด์ฆˆ๋ฅผ ์–ด๋–ป๊ฒŒ ๋‹ค๋ฃจ์—ˆ๋‚˜?

4.1 VLM์œผ๋กœ ํƒœ์Šคํฌ ๋ผ๋ฒจ ์ž๋™ ์ƒ์„ฑ

์ €์ž๋“ค์€ Qwen2.5-VL-3B-Instruct์— ๊ฐ ๋ฐ์ดํ„ฐ์…‹์˜ ๋Œ€ํ‘œ ํ”„๋ ˆ์ž„๊ณผ ์›๋ž˜ ๋ผ๋ฒจ์„ ์ฃผ๊ณ , 30์ž ์ด๋‚ด์˜ ๋™์ž‘ ๋™์‚ฌ๋กœ ์‹œ์ž‘ํ•˜๋Š” ํ•œ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๊ฒŒ ํ–ˆ๋‹ค.

ํ”„๋กฌํ”„ํŠธ ๊ณจ์ž:
"์—ฌ๊ธฐ ํ˜„์žฌ ํƒœ์Šคํฌ ์„ค๋ช…์ด ์žˆ๋‹ค: {current_task}.
๋กœ๋ด‡ ํŒ”์ด ์ˆ˜ํ–‰ํ•˜๋Š” ๋™์ž‘์„ 30์ž ์ด๋‚ด, ๋™์‚ฌ๋กœ ์‹œ์ž‘ํ•˜๋Š”
๊ฐ„๊ฒฐํ•œ ํ•œ ๋ฌธ์žฅ์œผ๋กœ ๋‹ค์‹œ ์จ๋ผ.
์˜ˆ: 'Pick up the cube and place it in the box', 
    'Open the drawer'."

๊ฒฐ๊ณผ์ ์œผ๋กœ 481๊ฐœ ๋ฐ์ดํ„ฐ์…‹์ด ์ผ๊ด€๋œ ๋ช…๋ น ์Šคํƒ€์ผ๋กœ ์ •๊ทœํ™”๋๋‹ค.

4.2 ์นด๋ฉ”๋ผ ์‹œ์  ํ‘œ์ค€ํ™”

์‹œ์  ์ผ๊ด€์„ฑ๋„ ์ง์ ‘ ์†์œผ๋กœ ์ •๋ฆฌํ–ˆ๋‹ค. ๋ชจ๋“  ์นด๋ฉ”๋ผ๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ†ต์ผ:

  • OBS_IMAGE_1 = top view (์œ„์—์„œ)
  • OBS_IMAGE_2 = wrist view (์†๋ชฉ ์นด๋ฉ”๋ผ)
  • OBS_IMAGE_3 = side view (์˜†์—์„œ)

์ถ”๊ฐ€ ์‹œ์ ์€ ์ˆœ์„œ๋ฅผ ๋ณด์กดํ•˜๋˜ ํ•™์Šต ์‹œ ๋“œ๋กญ. ์‚ฌ๋žŒ์ด ์ง์ ‘ ํ–ˆ๋‹ค๋Š” ์ ์ด ์ค‘์š”ํ•œ๋ฐ, ์ด๋Š” ํ–ฅํ›„ ์ž๋™ํ™”์˜ ๋Œ€์ƒ์ด๋‹ค.

์ด ๋‘ ๋‹จ๊ณ„ โ€” VLM ์žฌ๋ผ๋ฒจ๋ง + ์‹œ์  ํ‘œ์ค€ํ™” โ€” ๊ฐ€ ์—†์œผ๋ฉด ์ปค๋ฎค๋‹ˆํ‹ฐ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต์ด ์˜๋ฏธ ์žˆ๊ฒŒ ์ž‘๋™ํ•˜์ง€ ์•Š์•˜์„ ๊ฑฐ๋ž€ ๊ฒŒ ์ €์ž๋“ค์˜ ์ฃผ์žฅ์ด๋‹ค. ์‹ค์ œ๋กœ ๋ฐ์ดํ„ฐ ์ •๋ฆฌ์— ๋“ค์ธ ์ •์„ฑ์ด ๊ฒฐ๊ณผ์˜ ํฐ ๋ถ€๋ถ„์„ ์„ค๋ช…ํ•œ๋‹ค.


5. ๋น„๋™๊ธฐ ์ถ”๋ก (Asynchronous Inference): ๋กœ๋ด‡์ด ๋ฉ ๋•Œ๋ฆฌ์ง€ ์•Š๊ฒŒ

์ด ๋ถ€๋ถ„์ด SmolVLA์—์„œ ๊ฐ€์žฅ ์‹ค์šฉ์ ์œผ๋กœ ์ค‘์š”ํ•œ ๊ธฐ์—ฌ๋‹ค. ๋ชจ๋ธ ์ž์ฒด๋Š” ์ต์ˆ™ํ•œ ๋””์ž์ธ์˜ ๋ณ€์ฃผ์ด์ง€๋งŒ, ๋น„๋™๊ธฐ ์ถ”๋ก ์€ ๋ชจ๋ธ ๋…๋ฆฝ์ ์œผ๋กœ ๋‹ค๋ฅธ ์ •์ฑ…์—๋„ ๋ฐ”๋กœ ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.

5.1 ๋™๊ธฐ ์ถ”๋ก ์˜ ๋ฌธ์ œ

์ผ๋ฐ˜์ ์ธ VLA ์ œ์–ด ๋ฃจํ”„:

  Time -->
  
  observe o_t
     |
     v
  [============= predict A_t (chunk of n actions) =============]   <- model busy
     |                                                          
     v
  execute a_t, a_{t+1}, ..., a_{t+n}
                                                                
  observe o_{t+n}
     |
     v
  [============= predict A_{t+n} =============]

๋ฌธ์ œ 1: ์ถ”๋ก  ์ค‘ idle gap. ์ •์ฑ…์ด ๋‹ค์Œ ์ฒญํฌ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋™์•ˆ ๋กœ๋ด‡์€ ์ •์ง€ํ•ด ์žˆ๋‹ค (๋˜๋Š” ๋งˆ์ง€๋ง‰ ์•ก์…˜์„ ๊ณ„์† ์‹คํ–‰).

๋ฌธ์ œ 2: ๋ฐ˜์‘์„ฑ ๋ถ€์กฑ. ์ฒญํฌ ์ „์ฒด๋ฅผ ๋‹ค ์‹คํ–‰ํ•œ ํ›„์—์•ผ ์ƒˆ ๊ด€์ธก์„ ๋ฐ›์œผ๋‹ˆ, ํ™˜๊ฒฝ์ด ๊ฐ‘์ž๊ธฐ ๋ฐ”๋€Œ์–ด๋„ ์ฆ‰๊ฐ ๋ฐ˜์‘ ๋ชป ํ•œ๋‹ค.

๋ฌธ์ œ 3: ์˜จ๋ณด๋“œ ์—ฐ์‚ฐ ๋ถ€๋‹ด. ๋ชจ๋ธ์ด ๋กœ๋ด‡ ์ปดํ“จํ„ฐ์—์„œ ๋Œ์•„์•ผ ํ•˜๋‹ˆ GPU๊ฐ€ ๋น„์‹ธ์ง„๋‹ค.

5.2 ๋น„๋™๊ธฐ ์ถ”๋ก  ์•„ํ‚คํ…์ฒ˜

SmolVLA์˜ ํ•ด๋ฒ•์€ ํด๋ผ์ด์–ธํŠธ-์„œ๋ฒ„ ๋ถ„๋ฆฌ:

graph LR
    subgraph Robot
    RC[RobotClient<br/>action queue]
    end
    subgraph Server
    PS[PolicyServer<br/>SmolVLA inference]
    end
    RC -->|observation o_t| PS
    PS -->|action chunk A_t| RC
    RC -->|execute a_t| Motor[Robot motors]

  • RobotClient: ์•ก์…˜ ํ์—์„œ ์•ก์…˜์„ ํ•˜๋‚˜์”ฉ ๊บผ๋‚ด(PopFront) ๋ชจํ„ฐ๋กœ ๋ณด๋ƒ„. ํ ๊ธธ์ด๊ฐ€ ์ž„๊ณ„๊ฐ’ ์•„๋ž˜๋กœ ๋–จ์–ด์ง€๋ฉด ์ƒˆ ๊ด€์ธก์„ ์บก์ฒ˜ํ•ด PolicyServer์— ์ „์†ก.
  • PolicyServer: ๊ด€์ธก์„ ๋ฐ›์•„ ์ถ”๋ก  ํ›„ ์ƒˆ ์ฒญํฌ๋ฅผ RobotClient๋กœ ๋ณด๋ƒ„. ๋‹ค๋ฅธ ๋จธ์‹ , GPU ์„œ๋ฒ„์— ์žˆ์–ด๋„ ๋จ.

ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ์ถ”๋ก ๊ณผ ์‹คํ–‰์„ ์‹œ๊ฐ„์ ์œผ๋กœ ๊ฒน์น˜๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒƒ.

5.3 ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜์‚ฌ์ฝ”๋“œ

Algorithm: Asynchronous Inference Loop

Input:  T (horizon), n (chunk size), g in [0,1] (queue threshold)

Init:   capture o_0; send to PolicyServer; receive A_0 = pi(o_0)

for t = 0 to T:
    a_t <- PopFront(A_t)
    Execute(a_t)
    
    if |A_t| / n < g:                          # queue is running low
        capture new observation o_{t+1}
        if NeedsProcessing(o_{t+1}):           # joint-space similarity filter
            async_handle = AsyncInfer(o_{t+1}) # non-blocking
            A_tilde_{t+1} = pi(o_{t+1})        # new chunk arrives
            A_{t+1} = aggregate(A_t, A_tilde_{t+1})  # merge overlap
    
    if NotCompleted(async_handle):
        A_{t+1} = A_t                          # keep using old queue
end for

5.4 ์ž„๊ณ„๊ฐ’ g์˜ ์—ญํ• 

์—ฌ๊ธฐ๊ฐ€ ํ•ต์‹ฌ์ด๋‹ค. g๋Š” ํ๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋น„์—ˆ์„ ๋•Œ ์ƒˆ ์ถ”๋ก ์„ ํŠธ๋ฆฌ๊ฑฐํ• ์ง€ ๋ฅผ ๊ฒฐ์ •ํ•œ๋‹ค.

Queue size over time, varying g:

g = 0.0  (sequential limit)
  Queue: |####### ........ ####### ........|
         full   empty(idle) full   empty(idle)
  -> Long idle gaps!

g = 0.7  (sweet spot)
  Queue: |##### ## ##### ## ##### ## ####|
         steady refill, no full drain
  -> Balanced reactivity vs compute

g = 1.0  (compute-intensive limit)
  Queue: |###############################|
         always near-full (one inference per tick)
  -> Maximum reactivity, but costly

์ด๋ก ์ ์œผ๋กœ, idle gap ์—†์ด ํ๋ฅผ ์œ ์ง€ํ•˜๋ ค๋ฉด:

g \geq \frac{\mathbb{E}[\ell_S] / \Delta t}{n}

์—ฌ๊ธฐ์„œ \ell_S๋Š” ์„œ๋ฒ„ ์ถ”๋ก  ์ง€์—ฐ, \Delta t๋Š” ์ œ์–ด ์ฃผ๊ธฐ(30fps๋ผ๋ฉด 33ms), n์€ ์ฒญํฌ ํฌ๊ธฐ. ๋งŒ์•ฝ ์ถ”๋ก ์ด 100ms ๊ฑธ๋ฆฌ๊ณ  ์ฒญํฌ๊ฐ€ 50์ด๋ฉด, g \geq 0.06์ด๋ฉด ์ถฉ๋ถ„ํžˆ ํ๊ฐ€ ์•ˆ ๋น„๋Š” ๊ฒŒ ๋ณด์žฅ๋œ๋‹ค. ์‹ค์ œ๋กœ๋Š” ์•ˆ์ „ ๋งˆ์ง„์„ ๋‘๊ณ  0.5~0.7์„ ์“ด๋‹ค.

5.5 ๊ด€์ธก ์œ ์‚ฌ๋„ ํ•„ํ„ฐ

๋˜ ํ•˜๋‚˜์˜ ๋””ํ…Œ์ผ: ๋กœ๋ด‡์ด ๊ฑฐ์˜ ์ •์ง€ํ•ด ์žˆ์„ ๋•Œ, ๊ด€์ธก์ด ๊ฑฐ์˜ ๋™์ผํ•œ๋ฐ๋„ ๋งค๋ฒˆ ์ƒˆ ์ถ”๋ก ์„ ํŠธ๋ฆฌ๊ฑฐํ•˜๋ฉด ์ž์› ๋‚ญ๋น„๋‹ค. SmolVLA๋Š” joint-space ๊ฑฐ๋ฆฌ๊ฐ€ ์ž„๊ณ„๊ฐ’ \epsilon ์ดํ•˜๋ฉด near-duplicate๋กœ ๋ณด๊ณ  ์ถ”๋ก ์„ ๊ฑด๋„ˆ๋›ด๋‹ค. ๋‹จ, ํ๊ฐ€ ๋น„๋ฉด ๋ฌด์กฐ๊ฑด ์ถ”๋ก .

5.6 ๊ฒฐ๊ณผ: 30% ๋น ๋ฅด๊ณ , ๊ฐ™์€ ์‹œ๊ฐ„์— 2๋ฐฐ ๋งŽ์€ ์ž‘์—… ์™„๋ฃŒ

์ง€ํ‘œ Sync Async ์ฐจ์ด
ํ‰๊ท  ์„ฑ๊ณต๋ฅ  (%) 78.3 73.3 ๋น„์Šท
Pick-Place ํ‰๊ท  ์‹œ๊ฐ„ (s) 13.75 9.70 30% ๋น ๋ฆ„
60์ดˆ ๋‚ด ์™„๋ฃŒํ•œ cube ์ˆ˜ 9 19 2๋ฐฐ ์ด์ƒ

์„ฑ๊ณต๋ฅ ์€ ๊ฑฐ์˜ ๋™์ผํ•˜์ง€๋งŒ, ํ™˜๊ฒฝ ์™ธ๋ž€(๋ˆ„๊ตฐ๊ฐ€ ํ๋ธŒ๋ฅผ ์˜ฎ๊ธด๋‹ค๋“ ์ง€)์— ๋Œ€ํ•œ ์ ์‘๋ ฅ์€ ๋น„๋™๊ธฐ๊ฐ€ ํ›จ์”ฌ ์ข‹๋‹ค๊ณ  ๋ณด๊ณ ํ•œ๋‹ค.


6. ์‹คํ—˜ ๊ฒฐ๊ณผ ์ž์„ธํžˆ ๋ณด๊ธฐ

6.1 ์‹œ๋ฎฌ๋ ˆ์ด์…˜: LIBERO์™€ Meta-World

LIBERO๋Š” 4๊ฐ€์ง€ ์นดํ…Œ๊ณ ๋ฆฌ(Spatial, Object, Goal, Long-horizon) ร— 10 ํƒœ์Šคํฌ = 40๊ฐœ ํƒœ์Šคํฌ ๋ฒค์น˜๋งˆํฌ. Meta-World๋Š” 50๊ฐœ ํƒœ์Šคํฌ ร— 4๋‹จ๊ณ„ ๋‚œ์ด๋„(Easy/Medium/Hard/Very Hard).

LIBERO ๊ฒฐ๊ณผ ์š”์•ฝ:

๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ VLA ์‚ฌ์ „ํ•™์Šต ํ‰๊ท  SR (%)
Diffusion Policy - No 72.4
Octo 0.09B Yes 75.1
OpenVLA 7B Yes 76.5
ฯ€โ‚€ (Paligemma init) 3B No 71.8
ฯ€โ‚€ (full pretrain) 3.3B Yes 86.0
SmolVLA 0.45B No 87.3
SmolVLA 2.25B No 88.75

์ด ํ‘œ๊ฐ€ ์ •๋ง ์ธ์ƒ์ ์ด๋‹ค. SmolVLA 0.45B๊ฐ€ 7๋ฐฐ ํฐ OpenVLA(7B)๋ฅผ 10%p ์ด์ƒ, ๊ฐ™์€ ํฌ๊ธฐ์˜ ฯ€โ‚€(3.3B)๋ฅผ ์‚ด์ง ์•ž์„ ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ SmolVLA๋Š” ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์ „ํ•™์Šต์กฐ์ฐจ ์•ˆ ํ–ˆ๋‹ค(VLM์—์„œ ์ดˆ๊ธฐํ™”๋งŒ ํ–ˆ๋‹ค).

Meta-World ๊ฒฐ๊ณผ:

๋ชจ๋ธ Easy Medium Hard Very Hard Avg
TinyVLA 77.6 21.5 11.4 15.8 31.6
ฯ€โ‚€ (Paligemma) 80.4 40.9 36.7 44.0 50.5
ฯ€โ‚€ (pretrained) 71.8 48.2 41.7 30.0 47.9
SmolVLA 0.45B 82.5 41.8 45.0 60.0 57.3
SmolVLA 2.25B 87.1 51.8 70.0 64.0 68.2

ํŠนํžˆ Hard/Very Hard์—์„œ SmolVLA์˜ ์šฐ์œ„๊ฐ€ ๋‘๋“œ๋Ÿฌ์ง„๋‹ค. Flow Matching action expert๊ฐ€ ๋‹ค์ค‘๋ชจ๋“œ ๋ถ„ํฌ๋ฅผ ์ž˜ ์žก์•„๋‚ด๋Š” ๊ฑธ๋กœ ํ•ด์„๋œ๋‹ค.

6.2 ์‹ค์ œ ๋กœ๋ด‡: SO-100 / SO-101

SO-100, ๋ฉ€ํ‹ฐ ํƒœ์Šคํฌ ํ•™์Šต ๊ฒฐ๊ณผ:

๋ชจ๋ธ Pick-Place Stacking Sorting ํ‰๊ท 
ACT (single-task, from scratch) 70 50 25 48.3
ฯ€โ‚€ (3.5B, multi-task) 100 40 45 61.7
SmolVLA (0.45B, multi-task) 75 90 70 78.3

ฯ€โ‚€๊ฐ€ Pick-Place ๋‹จ์ผ ํƒœ์Šคํฌ์—์„  100% ์ฐ์ง€๋งŒ, ๋” ์–ด๋ ค์šด Stacking๊ณผ Sorting์—์„  SmolVLA๊ฐ€ ์••๋„. Sorting์€ long-horizon ํƒœ์Šคํฌ๋ผ sub-task scoring์ด ์ ์šฉ๋˜๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ ์ž‘์€ ๋ชจ๋ธ์ด ํฐ ๋ชจ๋ธ์„ ์ด๊ธด๋‹ค๋Š” ๊ฒƒ์€ ์•„ํ‚คํ…์ฒ˜์™€ ๋ฐ์ดํ„ฐ ํ™œ์šฉ ํšจ์œจ์ด ์ข‹๋‹ค๋Š” ์‹ ํ˜ธ๋‹ค.

SO-101 OOD ์ผ๋ฐ˜ํ™” (Pick-Place-Lego, ํˆฌ๋ช… ๋ฐ•์Šค์— ์ž‘์€ lego ๋„ฃ๊ธฐ):

๋ชจ๋ธ In-Distribution Out-of-Distribution
ACT 70 40
SmolVLA 90 50

SmolVLA๋Š” SO-101 ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์ „ํ•™์Šต๋œ ์ ์ด ์—†๋Š”๋ฐ๋„ ACT๋ฅผ OOD์—์„œ ์ด๊ธด๋‹ค. ์ปค๋ฎค๋‹ˆํ‹ฐ ๋ฐ์ดํ„ฐ์˜ ๋‹ค์–‘์„ฑ์ด ์ผ๋ฐ˜ํ™”๋ฅผ ๋„์™”๋‹ค๋Š” ๊ฐ•ํ•œ ์ฆ๊ฑฐ.

6.3 ์‚ฌ์ „ํ•™์Šต๊ณผ ๋ฉ€ํ‹ฐํƒœ์Šคํฌ์˜ ํšจ๊ณผ ๋ถ„๋ฆฌ

ํ•™์Šต ์„ค์ • ์‚ฌ์ „ํ•™์Šต ํ‰๊ท  SR (%)
Single-task No 40.0
Multi-task No 51.7
Multi-task Yes 78.3

์‚ฌ์ „ํ•™์Šต ๋‹จ๋…์œผ๋กœ +27%p, ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ํ•™์Šต ๋‹จ๋…์œผ๋กœ +12%p. ๋‘ ํšจ๊ณผ๊ฐ€ ๋”ํ•ด์ง„๋‹ค. ์ด๋Š” ์ปค๋ฎค๋‹ˆํ‹ฐ ๋ฐ์ดํ„ฐ์…‹ ์‚ฌ์ „ํ•™์Šต์ด ๋‹จ์ˆœํ•œ ํŠธ๋ฆญ์ด ์•„๋‹ˆ๋ผ ๋ณธ์งˆ์  ๊ธฐ์—ฌ๋ผ๋Š” ๊ฐ•ํ•œ ์ฆ๊ฑฐ๋‹ค.


7. ์–ด๋ธ”๋ ˆ์ด์…˜: ์–ด๋–ค ๋””์ž์ธ ๊ฒฐ์ •์ด ์ •๋ง ์ค‘์š”ํ•œ๊ฐ€

๋…ผ๋ฌธ์€ ํ’๋ถ€ํ•œ ์–ด๋ธ”๋ ˆ์ด์…˜์„ ์ œ๊ณตํ•œ๋‹ค. ํ•ต์‹ฌ๋งŒ ์ถ”๋ฆฌ๋ฉด:

7.1 ์–ด๋–ค ๋ ˆ์ด์–ด๋ฅผ ์“ธ ๊ฒƒ์ธ๊ฐ€

N (์‚ฌ์šฉ ๋ ˆ์ด์–ด ์ˆ˜) LIBERO Avg SR (%)
8 75.0
16 78.5
24 79.5
32 (์ „์ฒด) 80.3
Skip every 2nd 75.5
์ž‘์€ VLM (256M) ํ’€ 75.8

์ค‘์š” ํ†ต์ฐฐ: โ€œํฐ VLM์˜ ์ ˆ๋ฐ˜ ๋ ˆ์ด์–ดโ€๊ฐ€ โ€œ์ž‘์€ VLM ์ „์ฒดโ€๋ณด๋‹ค ๋‚ซ๋‹ค. ์ฆ‰, ์‚ฌ์ „ํ•™์Šต๋œ ํ‘œํ˜„๋ ฅ์˜ ์–‘์ด ์ค‘์š”ํ•˜์ง€, ๊นŠ์ด๊ฐ€ ๊ฒฐ์ •์ ์ด์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ. ๋˜ํ•œ ๋งค ๋‘ ๋ฒˆ์งธ ๋ ˆ์ด์–ด ๊ฑด๋„ˆ๋›ฐ๊ธฐ๋ณด๋‹ค, ์ฒ˜์Œ N๊ฐœ๋ฅผ ์—ฐ์†์œผ๋กœ ์“ฐ๋Š” ๊ฒŒ ๋‚ซ๋‹ค.

7.2 ํ•™์Šต ๋ชฉ์ ํ•จ์ˆ˜: Flow Matching vs Regression

Objective LIBERO Avg SR (%)
L1 Regression 75.25
Flow Matching 80.25

ํŠนํžˆ long-horizon ํƒœ์Šคํฌ(LIBERO-10)์—์„œ Flow Matching์ด 38 โ†’ 53์œผ๋กœ ๋„์•ฝ. ๋‹ค์ค‘๋ชจ๋“œ ๋ถ„ํฌ ๋ชจ๋ธ๋ง์˜ ๊ฐ€์น˜๊ฐ€ ๊ธธ๊ณ  ๋ณต์žกํ•œ ํ–‰๋™์—์„œ ๋” ๋‘๋“œ๋Ÿฌ์ง„๋‹ค.

7.3 ์ƒํƒœ(State)๋Š” ์–ด๋””๋กœ ๋ณด๋‚ผ ๊ฒƒ์ธ๊ฐ€

State ์œ„์น˜ Attention LIBERO Avg SR (%)
Prefix (VLM์—) CA 80.3
Suffix (Action Expert์—) CA 73.3
Prefix SA 53.3
Suffix SA 74.8

์ƒํƒœ๋ฅผ VLM์— ๋„ฃ์–ด์„œ ์‹œ๊ฐยท์–ธ์–ด์™€ ํ•จ๊ป˜ ํ†ตํ•ฉ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒŒ ๋ช…ํ™•ํžˆ ์ข‹๋‹ค. Action Expert๊ฐ€ ์ฒ˜์Œ๋ถ€ํ„ฐ ์ •์ œ๋œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ‘œํ˜„์„ ๋ฐ›๊ฒŒ ๋˜๋‹ˆ๊นŒ.

7.4 ์ฒญํฌ ํฌ๊ธฐ n

n LIBERO Avg SR (%)
1 50.0
10 84.0
30 78.5
50 80.3
100 74.5

n=1(๋งค step๋งˆ๋‹ค ์ถ”๋ก )์€ ๋…ธ์ด์ฆˆ์— ์ทจ์•ฝํ•˜๊ณ , n=100์€ ๋„ˆ๋ฌด ๊ธธ์–ด์„œ ํ™˜๊ฒฝ ๋ณ€ํ™”์— ๋‘”๊ฐ. 10~50์ด sweet spot. ์‹ค์šฉ์ ์œผ๋กœ 50์ด ์ถ”๋ก  ํšจ์œจ๊ณผ ์„ฑ๋Šฅ์˜ ๊ท ํ˜•์ด ์ข‹๋‹ค.

7.5 ๊ด€์ธก ๊ฐฑ์‹  ์ฃผ๊ธฐ

์ฒญํฌ ๋‚ด ์‹คํ–‰ ํ›„ ์ƒˆ ๊ด€์ธก๊นŒ์ง€ step LIBERO Avg SR (%)
1 80.3
10 82.8
30 70.8
50 (์ „์ฒด ์‹คํ–‰) 51.8

์ฒญํฌ 50๊ฐœ๋ฅผ ๋‹ค ์‹คํ–‰ํ•˜๊ณ  ๊ด€์ธก ๊ฐฑ์‹ ํ•˜๋ฉด ์„ฑ๋Šฅ์ด ๋ฌด๋„ˆ์ง„๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋น„๋™๊ธฐ ์ถ”๋ก ์ด ํ•„์ˆ˜๋ผ๋Š” ๊ฐ•ํ•œ ์ •๋‹นํ™”.


8. ๋น„ํŒ์  ๊ณ ์ฐฐ: ๊ฐ•์ ๊ณผ ์•ฝ์ 

8.1 ๊ฐ•์ 

(1) ์ ‘๊ทผ์„ฑ์˜ ์ง„์ •ํ•œ ์ง„๋ณด. ๋‹จ์ผ GPU ํ•™์Šต, CPU ์ถ”๋ก  ๊ฐ€๋Šฅ, 100๋‹ฌ๋Ÿฌ๋Œ€ ๋กœ๋ด‡์œผ๋กœ ๊ฒ€์ฆ. ์ด๋Š” ๋‹จ์ˆœํ•œ ๋งˆ์ผ€ํŒ… ๋ฌธ๊ตฌ๊ฐ€ ์•„๋‹ˆ๋ผ ์‹ค์ œ ์—ฐ๊ตฌ์‹ค/๊ฐœ์ธ์ด ์ž…๋ฌธํ•  ์ˆ˜ ์žˆ๋Š” ์ง„์ž…์ ์„ ๋งŒ๋“ ๋‹ค. LeRobot ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์™€์˜ ํ†ตํ•ฉ๋„ ์ง„์งœ๋‹ค.

(2) ๋ฐ์ดํ„ฐ ํŒจ๋Ÿฌ๋‹ค์ž„์˜ ์ „ํ™˜. ์ปค๋ฎค๋‹ˆํ‹ฐ ๋ฐ์ดํ„ฐ์˜ ๊ฐ€์น˜๋ฅผ ์ฒ˜์Œ์œผ๋กœ ์‹œ์Šคํ…Œ๋งคํ‹ฑํ•˜๊ฒŒ ์ž…์ฆ. ํ–ฅํ›„ ๋” ๋งŽ์€ ์‚ฌ์šฉ์ž๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ์—ฌํ• ์ˆ˜๋ก ๋ชจ๋ธ์ด ์ข‹์•„์ง€๋Š” ์„ ์ˆœํ™˜ ๊ฐ€๋Šฅ์„ฑ.

(3) ๋น„๋™๊ธฐ ์ถ”๋ก ์˜ ๋ชจ๋ธ ๋…๋ฆฝ์„ฑ. Algorithm 1์€ SmolVLA๋งŒ์ด ์•„๋‹ˆ๋ผ ์ž„์˜์˜ chunked policy์— ์ ์šฉ ๊ฐ€๋Šฅ. ์ด๊ฒŒ ์‚ฌ์‹ค ๊ฐ€์žฅ ํฐ ์‹ค์šฉ์  ๊ธฐ์—ฌ์ผ ์ˆ˜ ์žˆ๋‹ค. ์‹ค์ œ ๋กœ๋ด‡ ์—ฐ๊ตฌ์—์„œ ์ฆ‰์‹œ ํ™œ์šฉ ๊ฐ€๋Šฅํ•œ ์—”์ง€๋‹ˆ์–ด๋ง ์ž์‚ฐ.

(4) Flow Matching + interleaved CA/SA ์กฐํ•ฉ. ฯ€โ‚€์˜ Flow Matching ์•„์ด๋””์–ด๋ฅผ ๊ฐ€์ ธ์˜ค๋˜, ์–ดํ…์…˜ ๊ตฌ์กฐ์—์„œ ์ฐจ๋ณ„ํ™”. ์–ด๋ธ”๋ ˆ์ด์…˜์ด ์ถฉ์‹คํ•ด์„œ ๊ฐ ๋””์ž์ธ ๊ฒฐ์ •์˜ ํšจ๊ณผ๋ฅผ ๋ช…ํ™•ํžˆ ๋ถ„๋ฆฌํ•ด ๋ณด์—ฌ์ค€๋‹ค.

(5) ์™„์ „ํ•œ ์˜คํ”ˆ์†Œ์Šค. ์ฝ”๋“œ, ๊ฐ€์ค‘์น˜, ํ•™์Šต ๋ฐ์ดํ„ฐ, ํ•˜๋“œ์›จ์–ด ๋„๋ฉด๊นŒ์ง€ ๋‹ค ๊ณต๊ฐœ. ์žฌํ˜„์„ฑ์˜ ๋ชจ๋ฒ”.

8.2 ์•ฝ์ ๊ณผ ํ•œ๊ณ„

(1) ๋‹จ์ผ embodiment ์˜์กด. ์‚ฌ์ „ํ•™์Šต์ด SO-100 ์ค‘์‹ฌ์ด๋ผ, ๋” ๋‹ค์–‘ํ•œ ๋กœ๋ด‡(Franka, UR5, ํœด๋จธ๋…ธ์ด๋“œ ๋“ฑ)์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™”๋Š” ๊ฒ€์ฆ๋˜์ง€ ์•Š์Œ. SO-101๋กœ zero-shot์ด ์ž˜ ๋˜๊ธด ํ–ˆ์ง€๋งŒ, SO ์‹œ๋ฆฌ์ฆˆ ๊ฐ„ ์œ ์‚ฌ์„ฑ์ด ํฌ๋‹ค๋Š” ์ ์„ ๊ฐ์•ˆํ•ด์•ผ ํ•œ๋‹ค.

(2) ๋ฐ์ดํ„ฐ์…‹ ๊ทœ๋ชจ์˜ ์ž‘์Œ. 22.9K ์—ํ”ผ์†Œ๋“œ๋Š” OpenVLA(1M)์— ๋น„ํ•ด ํ•œ์ฐธ ์ž‘๋‹ค. ์ €์ž๋“ค๋„ ์ด๋ฅผ ํ•œ๊ณ„๋กœ ๋ช…์‹œ. ๋” ํฐ ๋ฐ์ดํ„ฐ๋กœ ์Šค์ผ€์ผ๋งํ–ˆ์„ ๋•Œ์˜ ๊ฑฐ๋™์€ ๋ฏธ์ง€์ˆ˜.

(3) ์งง์€ ํ˜ธ๋ผ์ด์ฆŒ ํ•œ์ •. ํ‰๊ฐ€๋œ ํƒœ์Šคํฌ๋Š” ๋ชจ๋‘ ๋‹จ์ˆœํ•œ manipulation (pick-place, stacking, sorting). ์ง„์งœ long-horizon ํƒœ์Šคํฌ(์˜ˆ: ์š”๋ฆฌ, ์กฐ๋ฆฝ)์—์„œ์˜ ์„ฑ๋Šฅ์€ ์•Œ ์ˆ˜ ์—†์Œ. ์ €์ž๋“ค๋„ hierarchical policy ๋„์ž… ํ•„์š”์„ฑ์„ ์–ธ๊ธ‰.

(4) VLM ๋ฐฑ๋ณธ์˜ ์ ํ•ฉ์„ฑ. SmolVLM-2๋Š” OCR/๋ฌธ์„œ ์ดํ•ด์— ๊ฐ•์ ์ด ์žˆ๋Š” ๋ชจ๋ธ๋กœ ์‚ฌ์ „ํ•™์Šต๋จ. ๋กœ๋ด‡ ํ™˜๊ฒฝ(3D ๊ณต๊ฐ„ ์ถ”๋ก , ๋ฌผ๋ฆฌ์  ์ƒํ˜ธ์ž‘์šฉ)์— ์ตœ์ ์€ ์•„๋‹ ์ˆ˜ ์žˆ๋‹ค. ๋กœ๋ด‡์šฉ VLM ์‚ฌ์ „ํ•™์Šต ๋ ˆ์‹œํ”ผ๊ฐ€ ํ–ฅํ›„ ๊ณผ์ œ.

(5) Imitation learning๋งŒ. ๊ฐ•ํ™”ํ•™์Šต fine-tuning์ด ๋น ์ ธ ์žˆ์–ด, ๋ชจ๋ฐฉ ๋ฐ์ดํ„ฐ๋ฅผ ๋„˜์–ด์„  ํ–‰๋™ ํ–ฅ์ƒ์ด ์–ด๋ ต๋‹ค. ํŠนํžˆ dexterous manipulation(์ด ๊ธ€์„ ์ฝ๋Š” ๋ถ„์ด ์ต์ˆ™ํ•  ์˜์—ญ)์—์„œ๋Š” RL์ด ๊ฑฐ์˜ ํ•„์ˆ˜๋‹ค.

(6) Tactile/force sensing ๋ถ€์žฌ. ์‹œ๊ฐ+์–ธ์–ด+proprioception๋งŒ ์‚ฌ์šฉ. ์ •๋ฐ€ํ•œ manipulation(์˜ˆ: ์ผ€์ด๋ธ” ์กฐ๋ฆฝ, ์˜ท๊ฐ ๋‹ค๋ฃจ๊ธฐ)์—๋Š” ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ์ด ํ•„์ˆ˜์ธ๋ฐ, ์ด๋Š” SmolVLM-2 ๋ฐฑ๋ณธ์ด ๋‹ค๋ฃจ์ง€ ์•Š๋Š” ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋‹ค.

(7) ๋น„๋™๊ธฐ ์ถ”๋ก ์˜ ๋ฏธ์„ธ ์กฐ์ • ๋ถ€๋‹ด. ์ž„๊ณ„๊ฐ’ g, ์œ ์‚ฌ๋„ ์ž„๊ณ„๊ฐ’ \epsilon ๋“ฑ์ด ํƒœ์Šคํฌ๋ณ„๋กœ ํŠœ๋‹์ด ํ•„์š”. ๋…ผ๋ฌธ์—์„œ๋„ Pick-Place ๊ธฐ์ค€์œผ๋กœ ์ตœ์ ํ™”ํ•œ ๊ฐ’์„ ๋‹ค๋ฅธ ํƒœ์Šคํฌ์— ๊ทธ๋Œ€๋กœ ์ผ๋Š”๋ฐ, Sorting์—์„œ sync(70%)๊ฐ€ async(50%)๋ณด๋‹ค ์ž˜ ๋‚˜์˜จ ๊ฒƒ์€ ์ด ์ผ๋ฐ˜ํ™”์˜ ์–ด๋ ค์›€์„ ๋“œ๋Ÿฌ๋‚ธ๋‹ค.


9. ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ์œ„์น˜

graph TB
    subgraph "๊ฑฐ๋Œ€ VLA"
    A[OpenVLA<br/>7B, autoregressive tokens]
    B[ฯ€โ‚€<br/>3.3B, flow matching]
    C[GR00T N1<br/>ํœด๋จธ๋…ธ์ด๋“œ, cross-attn]
    D[RT-2<br/>VLM + robot data]
    end
    subgraph "ํšจ์œจ VLA"
    E[TinyVLA<br/>sub-1B, scratch]
    F[Octo<br/>0.09B, transformer]
    G[SmolVLA<br/>0.45B, community data]
    end
    A -.์ถ•์†Œ.-> G
    B -.ํšจ์œจํ™”.-> G
    E -.์‚ฌ์ „ํ•™์Šต ๊ฐ•ํ™”.-> G
    F -.๋‹ค์–‘ ๋ฐ์ดํ„ฐ.-> G
    style G fill:#90EE90,color:#000

์ฐจ์› OpenVLA ฯ€โ‚€ GR00T N1 TinyVLA SmolVLA
ํŒŒ๋ผ๋ฏธํ„ฐ 7B 3.3B 2B+ <1B 0.45B
์•ก์…˜ ํ‘œํ˜„ discrete tokens flow matching flow matching regression flow matching
๋ฐ์ดํ„ฐ OXE 10K hr cross-emb ํœด๋จธ๋…ธ์ด๋“œ ์ผ๋ฐ˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ปค๋ฎค๋‹ˆํ‹ฐ
Attention SA only SA only CA only SA only CA+SA interleaved
๋น„๋™๊ธฐ ์ถ”๋ก  X X X X O
์˜คํ”ˆ์†Œ์Šค ์™„์ „์„ฑ ๋ถ€๋ถ„ ๋ถ€๋ถ„ ๋ถ€๋ถ„ O ์™„์ „

ํŠนํžˆ dexterous manipulation ์—ฐ๊ตฌ์ž(์ด ๊ธ€์„ ์ฝ๋Š” ๋ถ„์ฒ˜๋Ÿผ)์—๊ฒŒ ํฅ๋ฏธ๋กœ์šด ๋น„๊ต:

  • SmolVLA์˜ Flow Matching action expert๋Š” DexVLA์™€ ๋น„์Šทํ•œ ์ฒ ํ•™(plug-in diffusion expert).
  • ํ•˜์ง€๋งŒ SmolVLA๋Š” dexterous hand๊ฐ€ ์•„๋‹Œ ๋‹จ์ˆœ ๊ทธ๋ฆฌํผ ๊ธฐ๋ฐ˜. ๊ณ ์ฐจ์› ์†๊ฐ€๋ฝ ์ œ์–ด(์˜ˆ: Allegro Hand์˜ 16 DoF) ๋กœ ํ™•์žฅํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋ ์ง€๊ฐ€ ํฐ ์งˆ๋ฌธ์ด๋‹ค. ์ฒญํฌ ํฌ๊ธฐ, action expert ์šฉ๋Ÿ‰ ๋“ฑ์ด ๋” ๋Š˜์–ด๋‚˜์•ผ ํ•  ๊ฒƒ์ด๊ณ , sim-to-real ์ „์ด๋„ ์ƒˆ ๋„์ „์ด ๋œ๋‹ค.
  • HORA, AnyRotate, DexNDM ๊ฐ™์€ dexterous in-hand manipulation ๋ผ์ธ๊ณผ ๊ฒฐํ•ฉํ•œ๋‹ค๋ฉด โ€” ์ฆ‰, ์‹œ๊ฐ-์–ธ์–ด ์กฐ๊ฑด๋ถ€ Flow Matching policy๋กœ in-hand reorientation์„ ํ•™์Šตํ•œ๋‹ค๋ฉด โ€” ํฅ๋ฏธ๋กœ์šด ํ›„์† ์—ฐ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค.

10. ์ง๊ด€์  ์ •๋ฆฌ: ์™œ ์ž‘์€ ๋ชจ๋ธ์ด ์ž‘๋™ํ–ˆ๋‚˜

์ด ๋…ผ๋ฌธ์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ฉ”์‹œ์ง€๋ฅผ ํ•œ ๋ฌธ์žฅ์œผ๋กœ ์••์ถ•ํ•˜๋ฉด ์ด๋ ‡๋‹ค.

โ€œVLA ์„ฑ๋Šฅ์„ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒƒ์€ ๋ชจ๋ธ ํฌ๊ธฐ๊ฐ€ ์•„๋‹ˆ๋ผ, ์‚ฌ์ „ํ•™์Šต๋œ ํ‘œํ˜„์˜ ํ’ˆ์งˆ, ํ–‰๋™ ๋ถ„ํฌ์˜ ๋ชจ๋ธ๋ง ๋ฐฉ์‹, ๊ทธ๋ฆฌ๊ณ  ๋ฐ์ดํ„ฐ์˜ ๋‹ค์–‘์„ฑ์ด๋‹ค.โ€

์„ธ๋ถ€ ํ†ต์ฐฐ:

  1. ์‚ฌ์ „ํ•™์Šต ํ‘œํ˜„์˜ ์ ˆ๋ฐ˜๋งŒ ์จ๋„ ์ถฉ๋ถ„ํ•˜๋‹ค โ€” VLM์˜ ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด๋Š” ๋กœ๋ด‡ ์ œ์–ด์— ๊ณผ์ž‰ ์ฒ˜๋ฆฌ๋œ ํ‘œํ˜„์ผ ์ˆ˜ ์žˆ๋‹ค. ์ค‘๊ฐ„ ๋ ˆ์ด์–ด๊ฐ€ ๋” โ€œํ–‰๋™ ์นœํ™”์ โ€์ผ ์ˆ˜ ์žˆ๋‹ค.
  2. ์‹œ๊ฐ ํ† ํฐ์„ 64๊ฐœ๋กœ ์ค„์—ฌ๋„ ๊ดœ์ฐฎ๋‹ค โ€” ์กฐ์ž‘ ํƒœ์Šคํฌ๋Š” ์ด๋ฏธ์ง€ ์ „์ฒด๋ฅผ ํ”ฝ์…€ ๋‹จ์œ„๋กœ ์ดํ•ดํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค. ๊ฐ์ฒด ์œ„์น˜์™€ ์†์˜ ์ƒํƒœ๋งŒ ์žกํžˆ๋ฉด ๋œ๋‹ค.
  3. Flow Matching์ด ํšŒ๊ท€๋ฅผ ์ด๊ธด๋‹ค โ€” ํ–‰๋™์€ ๋ณธ์งˆ์ ์œผ๋กœ ๋‹ค์ค‘๋ชจ๋“œ ๋ถ„ํฌ๋‹ค. ๋‹จ์ผ ํ‰๊ท ์„ ํ•™์Šตํ•˜๋Š” ํšŒ๊ท€๋ณด๋‹ค, ๋ถ„ํฌ ์ž์ฒด๋ฅผ ํ•™์Šตํ•˜๋Š” ์ƒ์„ฑ ๋ชจ๋ธ์ด ์ž์—ฐ์Šค๋Ÿฌ์šด ํ–‰๋™์„ ๋งŒ๋“ ๋‹ค.
  4. CA์™€ SA๋Š” ๋ณด์™„์žฌ๋‹ค โ€” CA๋Š” ํ™˜๊ฒฝ์„ ๋ณด๊ณ , SA๋Š” ์ž๊ธฐ ์ž์‹ ๊ณผ ์ผ๊ด€์„ฑ์„ ๋งž์ถ˜๋‹ค. ๋‘˜์„ ๋ฒˆ๊ฐˆ์•„ ์Œ“์œผ๋ฉด ๋‘˜์˜ ์žฅ์ ์„ ๋‹ค ์–ป๋Š”๋‹ค.
  5. ๋™๊ธฐ ์ถ”๋ก ์€ ์‚ฌ์น˜๋‹ค โ€” ๋ชจ๋ธ์ด ์ถ”๋ก ํ•˜๋Š” ๋™์•ˆ ๋กœ๋ด‡์ด ๋ฉ ๋•Œ๋ฆฌ๋Š” ๊ฒƒ์€ ๋ณธ์งˆ์ด ์•„๋‹ˆ๋ผ ์—”์ง€๋‹ˆ์–ด๋ง ๊ฒŒ์œผ๋ฆ„์ด๋‹ค. ๋น„๋™๊ธฐ๋กœ ๋ถ„๋ฆฌํ•˜๋ฉด 30%์˜ ์‹œ๊ฐ„์ด ์ ˆ์•ฝ๋œ๋‹ค.
  6. ๋ฐ์ดํ„ฐ์˜ ์ง„์งœ ๋‹ค์–‘์„ฑ์ด ์ค‘์š”ํ•˜๋‹ค โ€” ์ž˜ ์ •์ œ๋œ 100๋งŒ trajectory๋ณด๋‹ค, ๋…ธ์ด์ฆˆํˆฌ์„ฑ์ด์ง€๋งŒ ๋‹ค์–‘ํ•œ 23K trajectory๊ฐ€ ์–ด๋–ค ๋ฉด์—์„  ๋” ๊ฐ•๋ ฅํ•˜๋‹ค (ํŠนํžˆ ์ผ๋ฐ˜ํ™” ์ธก๋ฉด์—์„œ).

11. ๋งˆ๋ฌด๋ฆฌ: ๋ฌด์—‡์ด ํฅ๋ฏธ๋กœ์šด๊ฐ€, ๊ทธ๋ฆฌ๊ณ  ๋ฌด์—‡์„ ํ•ด ๋ณผ ๊ฒƒ์ธ๊ฐ€

SmolVLA๋Š” โ€œ๊ฑฐ๋Œ€ ๋ชจ๋ธ = ์ข‹์€ ์„ฑ๋Šฅโ€์ด๋ผ๋Š” ํ†ต๋…์„ ํ”๋“ ๋‹ค. ๋” ์ค‘์š”ํ•œ ๊ฑด, ์ด๋ฅผ ํ†ตํ•ด ์—ฐ๊ตฌ ์ง„์ž…์žฅ๋ฒฝ ์ž์ฒด๋ฅผ ๋‚ฎ์ท„๋‹ค๋Š” ์ ์ด๋‹ค. ๋น„์‹ผ GPU ์—†์ด๋„, ๋น„์‹ผ ๋กœ๋ด‡ ์—†์ด๋„, VLA ์—ฐ๊ตฌ๋ฅผ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์—ˆ๋‹ค.

์ด ๊ธ€์„ ์ฝ๋Š” dexterous manipulation ์—ฐ๊ตฌ์ž์˜ ๊ด€์ ์—์„œ, ๋ช‡ ๊ฐ€์ง€ ํฅ๋ฏธ๋กœ์šด ํ›„์† ๋ฐฉํ–ฅ์„ ์ œ์•ˆํ•˜๋ฉฐ ๋งˆ์นœ๋‹ค.

(1) Allegro Hand์— ์ ์šฉํ•ด๋ณด๊ธฐ. SmolVLA๋ฅผ 16-DoF ์†์— ๊ทธ๋Œ€๋กœ ์ ์šฉํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ? action expert์˜ ์ถœ๋ ฅ ์ฐจ์›์„ ๋Š˜๋ฆฌ๊ณ , isaaclab/IsaacGym ์‹œ๋ฎฌ์—์„œ fine-tuneํ•ด์„œ, in-hand reorientation ๊ฐ™์€ ํƒœ์Šคํฌ๋ฅผ ์‹œ๋„ํ•ด๋ณผ ๊ฐ€์น˜๊ฐ€ ์žˆ๋‹ค. ์ด๋•Œ SmolVLA์˜ ์ปดํŒฉํŠธ์„ฑ์€ sim-to-real loop์—์„œ ํฐ ์žฅ์ ์ด๋‹ค โ€” ๋น ๋ฅธ fine-tune, ๋น ๋ฅธ ์ถ”๋ก .

(2) ์ด‰๊ฐ ๋ฐฑ๋ณธ ์ถ”๊ฐ€. TacSL์ด๋‚˜ DIGIT ์‹œ๋ฎฌ์„ ์‚ฌ์šฉํ•ด์„œ tactile ํ† ํฐ์„ SmolVLA์˜ ์ž…๋ ฅ์— ์ถ”๊ฐ€ํ•˜๋Š” ์‹คํ—˜. SmolVLM-2 ์™ธ์— ๋ณ„๋„์˜ tactile encoder๋ฅผ ๋‘๊ณ  cross-attention์œผ๋กœ ํ†ตํ•ฉํ•˜๋ฉด, ์ •๋ฐ€ ์กฐ์ž‘์— ๊ฐ•ํ•ด์งˆ ๊ฐ€๋Šฅ์„ฑ.

(3) ๋น„๋™๊ธฐ ์ถ”๋ก ์„ quasi-dynamic ์ œ์–ด์™€ ๊ฒฐํ•ฉ. CTR-MPC์ฒ˜๋Ÿผ quasi-dynamic ๊ฐ€์ •์„ ์“ฐ๋Š” ์‹œ์Šคํ…œ์— ๋น„๋™๊ธฐ ์ถ”๋ก ์„ ์ ์šฉํ•˜๋ฉด, MPC์˜ ๊ณ„์‚ฐ ๋ถ€๋‹ด์„ ๋” ํšจ๊ณผ์ ์œผ๋กœ ๋ถ„์‚ฐ์‹œํ‚ฌ ์ˆ˜ ์žˆ์„์ง€ ์‹คํ—˜ํ•ด๋ณผ ๋งŒํ•˜๋‹ค.

(4) ์ปค๋ฎค๋‹ˆํ‹ฐ ๋ฐ์ดํ„ฐ ๊ธฐ์—ฌ. Hugging Face์— ์ž๊ธฐ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ฌ๋ฆฌ๋Š” ๊ฒƒ ์ž์ฒด๊ฐ€ ์ด์ œ ์˜๋ฏธ ์žˆ๋Š” ์—ฐ๊ตฌ ํ™œ๋™์ด๋‹ค. SO-ARM ์™ธ์˜ dexterous platform (Allegro, Tesollo ๋“ฑ)์˜ ์ปค๋ฎค๋‹ˆํ‹ฐ ๋ฐ์ดํ„ฐ์…‹์„ ๋งŒ๋“ค์–ด SmolVLA-style ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.

(5) RL fine-tuning. ConRFT ๊ฐ™์€ VLA์šฉ RL ๋ฐฉ๋ฒ•์„ SmolVLA์— ์ ์šฉํ•ด์„œ, imitation์˜ ํ•œ๊ณ„๋ฅผ ๋„˜๋Š” dexterous policy๋ฅผ ๋งŒ๋“ค์–ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ž‘์€ ๋ชจ๋ธ์ด๋ผ RL ํ•™์Šต์ด ๋นจ๋ผ์ง„๋‹ค๋Š” ๊ฒŒ ํฐ ์žฅ์ .

์ž‘์€ ๋ชจ๋ธ์˜ ๋งค๋ ฅ์€, ์‹คํ—˜์„ ๋นจ๋ฆฌ ๋Œ๋ฆด ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ข‹์€ ๊ฐ€์„ค์ด ์žˆ์œผ๋ฉด ๋ฉฐ์น  ๋‚ด์— ๊ฒ€์ฆํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด ๋…ผ๋ฌธ์€ ๊ทธ ๊ฐ€๋Šฅ์„ฑ์„ ํ™œ์ง ์—ด์–ด์คฌ๋‹ค.

Copyright 2026, JungYeon Lee