Curieux.JY
  • JungYeon Lee
  • Post
  • Lecture
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ํ•œ ์ค„ ์š”์•ฝ (TL;DR)
    • ๋“ค์–ด๊ฐ€๋ฉฐ: ๋งˆ์ง€๋ง‰ 1 mm์˜ ๋ฌธ์ œ
    • ๋ฐฐ๊ฒฝ: ์™œ VLA๋งŒ์œผ๋กœ๋Š” ๋ถ€์กฑํ•œ๊ฐ€
    • ํ•ต์‹ฌ ์•„์ด๋””์–ด: RL Token์ด๋ผ๋Š” ์ž‘์€ ์ฐฝ๋ฌธ
      • ์ง๊ด€: bottleneck์œผ๋กœ์„œ์˜ readout token
      • ์ˆ˜์‹์œผ๋กœ ๋ณด๋ฉด
      • ์™œ ์ด ๊ฒŒ ์ž˜ ์ž‘๋™ํ•˜๋Š”๊ฐ€ (์ง๊ด€)
    • ์•Œ๊ณ ๋ฆฌ์ฆ˜: RL Token ์œ„์—์„œ ํ•™์Šตํ•˜๋Š” ์ž‘์€ actor-critic
      • ์ „์ฒด ๊ตฌ์กฐ
      • MDP ์ •์˜: chunk ๋‹จ์œ„๋กœ ๋ฌถ๊ธฐ
      • Critic ํ•™์Šต: ํ‘œ์ค€ TD3 ์Šคํƒ€์ผ
      • Actor ํ•™์Šต: ์ฐธ์กฐ ํ–‰๋™ ์กฐ๊ฑด๋ถ€ + BC regularizer
      • Reference action dropout: ๋ฒ ๋ผ๊ธฐ ๋ฐฉ์ง€ ์žฅ์น˜
    • ์‹œ์Šคํ…œ: ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘๋ถ€ํ„ฐ ์ •์ฑ… ์—…๋ฐ์ดํŠธ๊นŒ์ง€
      • ์˜์‚ฌ์ฝ”๋“œ๋กœ ๋ณด๋Š” ์ „์ฒด ํ๋ฆ„
      • ์ž‘๋™ ์›๋ฆฌ์—์„œ ๋ˆˆ์—ฌ๊ฒจ๋ณผ ๋””ํ…Œ์ผ๋“ค
    • ์‹คํ—˜: 4๊ฐ€์ง€ ์ •๋ฐ€ manipulation task
      • Task ๊ตฌ์„ฑ
      • Q1: VLA baseline ๋Œ€๋น„ RLT๊ฐ€ ์ •๋ง ์ข‹์•„์ง€๋Š”๊ฐ€
      • Q2: ๋‹ค๋ฅธ RL ๋ฐฉ๋ฒ•๋“ค๊ณผ ๋น„๊ตํ•˜๋ฉด
      • Q3: ๊ฐ component๊ฐ€ ์ •๋ง ํ•„์š”ํ•œ๊ฐ€ (Ablation)
      • Q4: ์ •์„ฑ์  ๋ฐœ๊ฒฌ โ€” ์ƒˆ๋กœ์šด ์ „๋žต์˜ ์ถœํ˜„
    • ๋น„ํŒ์  ๊ณ ์ฐฐ
      • ๊ฐ•์ 
      • ํ•œ๊ณ„์™€ ์˜๋ฌธ์ 
      • ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ์œ„์น˜ ์ง“๊ธฐ
    • ๋กœ๋ด‡๊ณตํ•™์ž๊ฐ€ ์ด ๋…ผ๋ฌธ์—์„œ ๊ฐ€์ ธ๊ฐˆ ํ†ต์ฐฐ
    • ๋งˆ๋ฌด๋ฆฌ

๐Ÿ“ƒRL Token

vla
rl-token
physical-intelligence
Precise Manipulation with Efficient Online RL
Published

May 9, 2026

  • Paper Link
  • Homepage
  1. ๐Ÿค– ๋ณธ ๋…ผ๋ฌธ์€ Vision-Language-Action(VLA) ๋ชจ๋ธ์˜ ์˜จ๋ผ์ธ ๊ฐ•ํ™” ํ•™์Šต(RL) ๋ฏธ์„ธ ์กฐ์ •์„ ์œ„ํ•ด, VLA์˜ ๋‚ด๋ถ€ ํŠน์ง•์„ ์••์ถ•ํ•˜์—ฌ โ€œRL tokenโ€์ด๋ผ๋Š” ํšจ์œจ์ ์ธ ํ‘œํ˜„์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
  2. ๐Ÿš€ RL token์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต๋œ ๊ฒฝ๋Ÿ‰ ์•กํ„ฐ-ํฌ๋ฆฌํ‹ฑ ๋„คํŠธ์›Œํฌ๋Š” VLA์˜ ์ดˆ๊ธฐ ํ–‰๋™์„ ๊ฐœ์„ ํ•˜๊ณ  ์ •๊ทœํ™”ํ•˜์—ฌ, ๋ช‡ ์‹œ๊ฐ„ ๋˜๋Š” ๋ช‡ ๋ถ„์˜ ๋กœ๋ด‡ ๊ฒฝํ—˜๋งŒ์œผ๋กœ๋„ ์ƒ˜ํ”Œ ํšจ์œจ์ ์ธ ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
  3. โšก๏ธ ์‹ค์ œ ๋กœ๋ด‡ ์ž‘์—…์—์„œ RLT(RL with RL token)๋Š” ์ •๋ฐ€ํ•œ ์ž‘์—…์˜ ์„ฑ๊ณต๋ฅ ๊ณผ ์‹คํ–‰ ์†๋„๋ฅผ ์ตœ๋Œ€ 3๋ฐฐ๊นŒ์ง€ ํ–ฅ์ƒ์‹œํ‚ค๋ฉฐ, ์ผ๋ถ€ ์ž‘์—…์—์„œ๋Š” ์ „๋ฌธ๊ฐ€์˜ ์›๊ฒฉ ์กฐ์ž‘ ์†๋„๋ฅผ ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

์ด ๋…ผ๋ฌธ์€ Vision-Language-Action (VLA) ๋ชจ๋ธ์„ ์‹ค์ œ ๋กœ๋ด‡ ์ž‘์—…์— ์ •๋ฐ€ํ•˜๊ณ  ๋น ๋ฅด๊ฒŒ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•œ ํšจ์œจ์ ์ธ ์˜จ๋ผ์ธ ๊ฐ•ํ™” ํ•™์Šต (RL) ๋ฏธ์„ธ ์กฐ์ • ๋ฐฉ๋ฒ•์ธ RLT(RL Token)๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด VLA ๋ชจ๋ธ์€ ๋‹ค์–‘ํ•œ ์กฐ์ž‘ ์Šคํ‚ฌ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ์š”๊ตฌ๋˜๋Š” ๋ฐ€๋ฆฌ๋ฏธํ„ฐ ๋‹จ์œ„์˜ ์ •๋ฐ€๋„์™€ ์†๋„๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์„ ๊ฒช์Šต๋‹ˆ๋‹ค. RL์€ ์ด๋Ÿฌํ•œ ์ •๋ฐ€ ์ž‘์—…์„ ๊ฐœ์„ ํ•˜๋Š” ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์ด์ง€๋งŒ, ๋Œ€๊ทœ๋ชจ VLA ๋ชจ๋ธ์„ RL๋กœ ์ง์ ‘ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์€ ๊ณ„์‚ฐ ๋ฐ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ ์ธก๋ฉด์—์„œ ๋น„์‹ค์šฉ์ ์ž…๋‹ˆ๋‹ค. RLT๋Š” ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด VLA์˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ํ™œ์šฉํ•˜๋ฉด์„œ๋„ ๊ฐ€๋ฒผ์šด ์˜จ๋ผ์ธ RL์˜ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

1. ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก  (Core Methodology)

RLT์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” VLA ๋ชจ๋ธ์˜ ์‚ฌ์ „ ํ•™์Šต๋œ ์ง€์‹์„ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•˜์—ฌ RL ํ›ˆ๋ จ ํšจ์œจ์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋‹ค์Œ ์„ธ ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

A. RL Token ๋…ธ์ถœ์„ ์œ„ํ•œ VLA ์ ์‘ (Adapting the VLA to expose an RL interface)

์ƒ˜ํ”Œ ํšจ์œจ์ ์ธ ์˜จ๋ผ์ธ RL์€ ํšจ๊ณผ์ ์ธ ์ƒํƒœ ํ‘œํ˜„์— ํฌ๊ฒŒ ์˜์กดํ•ฉ๋‹ˆ๋‹ค. VLA ๋ชจ๋ธ์˜ ๋‚ด๋ถ€ ํŠน์ง•์€ ๊ณ ์ฐจ์›์ ์ด๋ฉฐ, ์˜จ๋ผ์ธ ์—…๋ฐ์ดํŠธ๋Š” ๋น„์šฉ์ด ๋งŽ์ด ๋“ญ๋‹ˆ๋‹ค. RLT๋Š” VLA๊ฐ€ ์‚ฌ์ „ ํ•™์Šต๋œ ์ง€์‹์„ ๋ณด์กดํ•˜๋ฉด์„œ๋„ RL์— ์ ํ•ฉํ•œ ์ž‘๊ณ  ํšจ์œจ์ ์ธ ํ‘œํ˜„์„ ์ œ๊ณตํ•˜๋„๋ก โ€œRL tokenโ€์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.

  1. VLA ๋ฏธ์„ธ ์กฐ์ • ๋ฐ RL Token ํ•™์Šต:
    • ๋จผ์ €, ์†Œ๋Ÿ‰์˜ task-specific demonstration ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ VLA ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” VLA์˜ ์ดˆ๊ธฐ task policy๋ฅผ ๊ฐœ์„ ํ•˜๊ณ , ๋™์‹œ์— RL token ํ•™์Šต์„ ์œ„ํ•œ ๊ธฐ๋ฐ˜์„ ๋งˆ๋ จํ•ฉ๋‹ˆ๋‹ค.
    • VLA์˜ ์ตœ์ข… ๋ ˆ์ด์–ด ํ† ํฐ ์ž„๋ฒ ๋”ฉ z = f(s, \ell; \theta_{\text{vla}}) (์ƒํƒœ s์™€ ์–ธ์–ด ์ง€์‹œ \ell์— ๋Œ€ํ•œ VLA์˜ ์ถœ๋ ฅ)๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ํ•™์Šต๋œ ์ž„๋ฒ ๋”ฉ e_{\text{rl}} = e_\phi(\text{<rl>})์„ ํ† ํฐ ์‹œํ€€์Šค์— ์ถ”๊ฐ€ํ•˜๊ณ , ๊ฒฝ๋Ÿ‰ ์ธ์ฝ”๋” ํŠธ๋žœ์Šคํฌ๋จธ g_\phi๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ™•์žฅ๋œ ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
    • ํŠน์ˆ˜ ํ† ํฐ ์œ„์น˜์—์„œ์˜ ์ธ์ฝ”๋” ์ถœ๋ ฅ, ์ฆ‰ z_{\text{rl}} = g_\phi([z_{1:M}, e_{\text{rl}}])_{M+1}์ด RL token์ด ๋ฉ๋‹ˆ๋‹ค. ์ด z_{\text{rl}}์€ VLA์˜ ์ง€์‹์„ ์š”์•ฝํ•˜๋Š” ์••์ถ•๋œ ๋ฒกํ„ฐ ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.
    • ๋””์ฝ”๋” ํŠธ๋žœ์Šคํฌ๋จธ d_\phi์™€ ์„ ํ˜• ์ถœ๋ ฅ ํ”„๋กœ์ ์…˜ h_\phi๋Š” z_{\text{rl}}๋กœ๋ถ€ํ„ฐ ์›๋ณธ ์ž„๋ฒ ๋”ฉ์„ ์žฌ๊ตฌ์„ฑํ•˜๋„๋ก ์ž๊ธฐํšŒ๊ท€์ ์œผ๋กœ ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค. ์žฌ๊ตฌ์„ฑ ๋ชฉ์  ํ•จ์ˆ˜๋Š” ๋ฐ๋ชจ ๋ฐ์ดํ„ฐ D์— ๋Œ€ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: L_{\text{ro}} = E_D \left[ \sum_{i=1}^M \left\| h_\phi d_\phi([z_{\text{rl}}, \bar{z}_{1:i-1}])_i - \bar{z}_i \right\|_2^2 \right] ์—ฌ๊ธฐ์„œ \bar{z}_i = \text{sg}(z_i)๋Š” VLA ์ž„๋ฒ ๋”ฉ์— ์ ์šฉ๋œ stop-gradient ์—ฐ์‚ฐ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
    • ์ด ํ›ˆ๋ จ ํ›„, \theta_{\text{vla}} (VLA ๋ชจ๋ธ)์™€ \phi (RL token ๊ด€๋ จ ๋งค๊ฐœ๋ณ€์ˆ˜)๋Š” ๊ณ ์ •๋˜๋ฉฐ, ์˜จ๋ผ์ธ RL์€ ์ด z_{\text{rl}} ํ‘œํ˜„์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

B. VLA Action Chunks ์ •์ œ๋ฅผ ์œ„ํ•œ ์˜จ๋ผ์ธ RL (Online RL to refine VLA action chunks)

RL token ํ‘œํ˜„์ด ๊ณ ์ •๋œ ํ›„, ๊ฒฝ๋Ÿ‰ ์•กํ„ฐ(\pi_\theta) ๋ฐ ํฌ๋ฆฌํ‹ฑ(Q_\psi) ๋„คํŠธ์›Œํฌ๋ฅผ ์˜จ๋ผ์ธ์œผ๋กœ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค. ์ด๋“ค ๋„คํŠธ์›Œํฌ๋Š” RL token๊ณผ ๋กœ๋ด‡์˜ ๊ณ ์œ ์ˆ˜์šฉ์„ฑ ์ƒํƒœ(proprioceptive state)๋ฅผ ๊ฒฐํ•ฉํ•œ ์ž…๋ ฅ x๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  1. ํฌ๋ฆฌํ‹ฑ ํ›ˆ๋ จ (Training the critic):
    • ํฌ๋ฆฌํ‹ฑ Q_\psi(x, a_{1:C})๋Š” ์ƒํƒœ์™€ ์•ก์…˜ ์ฒญํฌ a_{1:C}๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ๊ฐ€์น˜ ํ•จ์ˆ˜๋ฅผ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ C๋Š” RL ์ฒญํฌ ๊ธธ์ด์ด๋ฉฐ, H๋Š” VLA๊ฐ€ ์˜ˆ์ธกํ•˜๋Š” ์ฒญํฌ ํ˜ธ๋ผ์ด์ฆŒ์ž…๋‹ˆ๋‹ค(C < H).
    • ํ‘œ์ค€ ์˜คํ”„-์ •์ฑ… ์‹œ๊ฐ„์ฐจ(temporal-difference) ํ•™์Šต์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฆฌํ”Œ๋ ˆ์ด ๋ฒ„ํผ B์—์„œ ์ƒ˜ํ”Œ๋ง๋œ ์•ก์…˜ ์ฒญํฌ ์ „ํ™˜์— ๋Œ€ํ•ด ํฌ๋ฆฌํ‹ฑ์„ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค. ๋ชฉ์  ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: L_Q = E_{(x,a_{1:C},x') \sim B} \left[ \left( \hat{Q} - Q_\psi(x, a_{1:C}) \right)^2 \right] ์—ฌ๊ธฐ์„œ \hat{Q}๋Š” ํƒ€๊ฒŸ Q ๊ฐ’์ด๋ฉฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค: \hat{Q} = \sum_{t'=1}^C \gamma^{t'-1} r_{t'} + \gamma^C E_{a' \sim \pi_\theta} [Q_{\psi'}(x', a')] ์—ฌ๊ธฐ์„œ x = (z_{\text{rl}}, s_p)์ด๊ณ  s_p๋Š” ๊ณ ์œ ์ˆ˜์šฉ์„ฑ ์ƒํƒœ์ž…๋‹ˆ๋‹ค. TD3 [19]๋ฅผ ๋”ฐ๋ผ \psi'๋Š” ํƒ€๊ฒŸ ๋„คํŠธ์›Œํฌ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜์ž…๋‹ˆ๋‹ค.
  2. RL Policy ํ›ˆ๋ จ (Training the RL Policy):
    • ์•กํ„ฐ ๋„คํŠธ์›Œํฌ \pi_\theta(\cdot|x, \tilde{a}_{1:C})๋Š” ์•ก์…˜ ์ฒญํฌ์— ๋Œ€ํ•œ ๊ฐ€์šฐ์‹œ์•ˆ ์•ก์…˜ ๋ถ„ํฌ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ž…๋ ฅ ์ƒํƒœ x์™€ VLA๊ฐ€ ์ œ์•ˆํ•œ ์ฐธ์กฐ ์•ก์…˜ ์ฒญํฌ \tilde{a}_{1:C}๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์Šต๋‹ˆ๋‹ค.
    • ์•ก์…˜ ๋ถ„ํฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: \pi_\theta(a_{1:C} | x, \tilde{a}_{1:C}) = \mathcal{N}(\mu_\theta(x, \tilde{a}_{1:C}), \sigma^2 I)
    • ์•กํ„ฐ๋Š” ํฌ๋ฆฌํ‹ฑ ๊ฐ€์น˜๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋ฉด์„œ VLA ์ฐธ์กฐ ์ฒญํฌ \tilde{a}์— ๊ฐ€๊น๊ฒŒ ์œ ์ง€๋˜๋„๋ก ์ตœ์ ํ™”๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” KL-์ •๊ทœํ™”๋œ RL๊ณผ ์œ ์‚ฌํ•˜๋ฉฐ, ์˜จ๋ผ์ธ RL์„ VLA์˜ ๊ฐ•๋ ฅํ•œ ์ดˆ๊ธฐ ์ œ์•ˆ์„ ๊ตญ์†Œ์ ์œผ๋กœ ์ •์ œํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๋ชฉ์  ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: L_\pi(\theta) = E_{s \sim B, a_{1:C} \sim \pi_\theta} \left[ - Q_\psi(x, a_{1:C}) + \beta \|a_{1:C} - \tilde{a}_{1:C}\|_2^2 \right] ์—ฌ๊ธฐ์„œ \tilde{a}_{1:C} \sim \pi_{\text{vla}}(\cdot | s, \ell)์€ VLA์—์„œ ์ƒ˜ํ”Œ๋ง๋œ ์ฐธ์กฐ ์•ก์…˜ ์ฒญํฌ์ด๊ณ , \beta๋Š” ์ •๊ทœํ™” ๊ฐ•๋„๋ฅผ ์ œ์–ดํ•˜๋Š” ๊ณ„์ˆ˜์ž…๋‹ˆ๋‹ค.
    • Reference action dropout: ์•กํ„ฐ๊ฐ€ ๋‹จ์ˆœํžˆ \tilde{a}๋ฅผ ๋ชจ๋ฐฉํ•˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด, ํ›ˆ๋ จ ๋ฐฐ์น˜์—์„œ ๋ฌด์ž‘์œ„๋กœ ์ผ๋ถ€ ์ „ํ™˜์— ๋Œ€ํ•ด ์ฐธ์กฐ ์ฒญํฌ๋ฅผ 0์œผ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์•กํ„ฐ๊ฐ€ ๋…๋ฆฝ์ ์ธ ์•ก์…˜ ์ƒ์„ฑ ๊ฒฝ๋กœ๋ฅผ ์œ ์ง€ํ•˜๋„๋ก ๊ฐ•์ œํ•ฉ๋‹ˆ๋‹ค.

C. ์ „์ฒด ์‹œ์Šคํ…œ (Complete System)

RLT์˜ ์ „์ฒด ํ›ˆ๋ จ ๋ฃจํ”„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ์ค€๋น„ ๋‹จ๊ณ„ (Warmup): RL token ํ‘œํ˜„ ํ›ˆ๋ จ ํ›„, ๋ฆฌํ”Œ๋ ˆ์ด ๋ฒ„ํผ B๋ฅผ ๊ธฐ๋ณธ VLA ์ •์ฑ…์œผ๋กœ N_{\text{warm}} ์Šคํ…๋งŒํผ ์ฑ„์›๋‹ˆ๋‹ค. ์ด๋Š” ํฌ๋ฆฌํ‹ฑ์— ์ดˆ๊ธฐ ํ•™์Šต ์‹ ํ˜ธ๋ฅผ ์ œ๊ณตํ•˜๊ณ  RL์ด ์œ ๋Šฅํ•œ VLA ํ–‰๋™์—์„œ ์‹œ์ž‘ํ•˜๋„๋ก ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.
  2. ๋กค์•„์›ƒ (Rollout): ์˜จ๋ผ์ธ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์ค‘ ๊ฐ ์•ก์…˜ ์ฒญํฌ ๊ฒฝ๊ณ„์—์„œ, ๊ณ ์ •๋œ VLA๋Š” ์ฐธ์กฐ ์ฒญํฌ \tilde{a}_{1:H}๋ฅผ ์ƒ์„ฑํ•˜๊ณ  RL token ๋ชจ๋“ˆ์€ z_{\text{rl}}์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ์•กํ„ฐ๋Š” ์•ก์…˜ ์ฒญํฌ a_{1:C} \sim \pi_\theta(\cdot | x, \tilde{a}_{1:C})๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
    • ์‚ฌ๋žŒ ์ž‘์—…์ž๋Š” ์„ ํƒ์ ์œผ๋กœ ๊ฐœ์ž…ํ•˜์—ฌ ์•กํ„ฐ ์ถœ๋ ฅ์„ ๋ฎ์–ด์“ธ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด ๊ฒฝ์šฐ ๊ฐœ์ž…๋œ ์•ก์…˜์ด ๋ฆฌํ”Œ๋ ˆ์ด ๋ฒ„ํผ์— ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.
    • ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์„ ์œ„ํ•ด, RL ์ •์ฑ…์ด ์‚ฌ์šฉํ•˜๋Š” ์ฒญํฌ ๊ธธ์ด C์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ๋ชจ๋“  ์ค‘๊ฐ„ ์Šคํ…์— ๋Œ€ํ•œ ๊ด€์ธก์„ ์‚ฌ์šฉํ•˜์—ฌ ์ค‘๊ฐ„ ์Šคํ…์„ ๋ฆฌํ”Œ๋ ˆ์ด ๋ฒ„ํผ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค (์˜ˆ: < x_0, a_{0:C} >, < x_2, a_{2:C+2} > ๋“ฑ).
  3. ์—…๋ฐ์ดํŠธ (Update): ์ •์ฑ… ์—…๋ฐ์ดํŠธ๋Š” ๋ฆฌํ”Œ๋ ˆ์ด ๋ฒ„ํผ์—์„œ ์˜คํ”„-์ •์ฑ… ๋ฐฉ์‹์œผ๋กœ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. ๋กค์•„์›ƒ๊ณผ ํ•™์Šต์€ ๋น„๋™๊ธฐ์ ์œผ๋กœ ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค. ๋†’์€ update-to-data ratio (์˜ˆ: 5)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์„ ๋†’์ž…๋‹ˆ๋‹ค.
  4. Critical Phases์˜ ํƒ€๊ฒŸ ๊ฐœ์„  (Targeted improvement of critical phases): RLT๋Š” ๊ฐ ์ž‘์—…์—์„œ ๊ฐ€์žฅ ์–ด๋ ต๊ณ  ์ •๋ฐ€๋„๊ฐ€ ๋†’์€ โ€œcritical phaseโ€๋ฅผ ๊ฐœ์„ ํ•˜๋Š” ๋ฐ ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค. ์—ํ”ผ์†Œ๋“œ๋Š” ๊ธฐ๋ณธ VLA ๋ชจ๋ธ๋กœ ์‹œ์ž‘ํ•˜๋ฉฐ, ์‚ฌ๋žŒ ์ž‘์—…์ž๊ฐ€ ์–ธ์ œ VLA์—์„œ RL ์ •์ฑ…์œผ๋กœ ์ œ์–ด๋ฅผ ๋„˜๊ธธ์ง€ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” RL์ด ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํ–‰๋™ ๋ถ€๋ถ„์— ๋ฐ์ดํ„ฐ๋ฅผ ์ง‘์ค‘ํ•˜๊ณ  ์‹ ์šฉ ํ• ๋‹น์„ ์ง‘์ค‘ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

2. ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ (Experiments and Results)

RLT๋Š” ์Šคํฌ๋ฃจ ์„ค์น˜, ์ผ€์ด๋ธ” ํƒ€์ด ์ฒด๊ฒฐ, ์ด๋”๋„ท ์‚ฝ์ž…, ์ถฉ์ „๊ธฐ ์‚ฝ์ž…์˜ ๋„ค ๊ฐ€์ง€ ์‹ค์ œ ๋กœ๋ด‡ ์กฐ์ž‘ ์ž‘์—…์—์„œ ํ‰๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ์ž‘์—…๋“ค์€ ๋ชจ๋‘ ๋ฐ€๋ฆฌ๋ฏธํ„ฐ ๋˜๋Š” ์„œ๋ธŒ-๋ฐ€๋ฆฌ๋ฏธํ„ฐ ์ˆ˜์ค€์˜ ์ •๋ฐ€๋„๋ฅผ ์š”๊ตฌํ•ฉ๋‹ˆ๋‹ค.

  • Q1: ๊ธฐ๋ณธ VLA ๋ชจ๋ธ ๋Œ€๋น„ ์„ฑ๋Šฅ ๊ฐœ์„ :
    • RLT๋Š” ๋ชจ๋“  ์ž‘์—…์˜ critical phase์—์„œ ์„ฑ๊ณต๋ฅ ๊ณผ ์‹คํ–‰ ์†๋„๋ฅผ ์ผ๊ด€๋˜๊ฒŒ ๊ฐœ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋น„๊ต์  ์‰ฌ์šด ์ถฉ์ „๊ธฐ ๋ฐ ์ด๋”๋„ท ์ž‘์—…์—์„œ๋„ critical phase ์†๋„๊ฐ€ ์•ฝ 3๋ฐฐ ๋นจ๋ผ์กŒ์Šต๋‹ˆ๋‹ค.
    • ์–ด๋ ค์šด ์ผ€์ด๋ธ” ํƒ€์ด ๋ฐ ์Šคํฌ๋ฃจ ์ž‘์—…์—์„œ๋Š” ์„ฑ๊ณต๋ฅ ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ „์ฒด ์ž‘์—… ํ‰๊ฐ€์—์„œ๋„ ์Šคํฌ๋ฃจ ์ž‘์—…์—์„œ 40%, ์ผ€์ด๋ธ” ํƒ€์ด ์ž‘์—…์—์„œ 60%์˜ ์„ฑ๊ณต๋ฅ  ํ–ฅ์ƒ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
  • Q2: ๋‹ค๋ฅธ RL ๋ฐฉ๋ฒ•๋ก ๊ณผ์˜ ๋น„๊ต:
    • HIL-SERL ๋ฐ PLD์™€ ๊ฐ™์€ ๋‹จ์ผ ์Šคํ… ์˜จ๋ผ์ธ RL ๋ฐฉ๋ฒ•๋ก ์€ ํฌ์†Œ ๋ณด์ƒ์„ ๊ฐ–๋Š” ์ˆ˜๋ฐฑ ์Šคํ…์˜ ๊ธด ์ž‘์—…์—์„œ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•˜์ง€ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์•ก์…˜ ์ฒญํฌ๊ฐ€ ์—†๋Š” ๊ฒฝ์šฐ task horizon์ด ๋„ˆ๋ฌด ๊ธธ์–ด ๊ฐ€์น˜ ํ•จ์ˆ˜ ์—…๋ฐ์ดํŠธ๊ฐ€ ๋น„ํšจ์œจ์ ์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
    • DAgger ๋ฐ DSRL์€ RLT์™€ ์œ ์‚ฌํ•œ ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ–ˆ์ง€๋งŒ ์†๋„ ํ–ฅ์ƒ์€ ํ›จ์”ฌ ์ ์—ˆ์Šต๋‹ˆ๋‹ค. DSRL์€ ์ •์ฑ…์„ ๊ธฐ๋ณธ VLA์— ๊ฐ€๊น๊ฒŒ ๊ฐ•ํ•˜๊ฒŒ ์ œ์•ฝํ•˜์—ฌ ์•ˆ์ •์ ์ธ ํ›ˆ๋ จ์„ ์ œ๊ณตํ•˜์ง€๋งŒ ๊ฐœ์„  ์ž ์žฌ๋ ฅ์€ ์ œํ•œ์ ์ž…๋‹ˆ๋‹ค.
    • RLT๋Š” ๊ธฐ๋ณธ ์ •์ฑ…์˜ ๋†’์€ ์„ฑ๊ณต๋ฅ ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ํ‰๊ท  ์™„๋ฃŒ ์Šคํ… ์ˆ˜๋ฅผ 2๋ฐฐ ์ค„์—ฌ ๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
  • Q3: ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ์˜ ๊ธฐ์—ฌ๋„:
    • RL token, ์•ก์…˜ ์ฒญํฌ, BC (Behavioral Cloning) Regularizer, Reference-action pass-through์˜ ๋„ค ๊ฐ€์ง€ ์„ค๊ณ„ ์„ ํƒ ๋ชจ๋‘๊ฐ€ ์˜๋ฏธ ์žˆ๊ฒŒ ๊ธฐ์—ฌํ–ˆ์Šต๋‹ˆ๋‹ค.
    • RL token์„ ImageNet ์‚ฌ์ „ ํ•™์Šต๋œ ResNet-10 ์ธ์ฝ”๋”๋กœ ๋Œ€์ฒดํ•˜๋ฉด ์ฒ˜๋ฆฌ๋Ÿ‰์ด 50% ๊ฐ์†Œํ–ˆ์Šต๋‹ˆ๋‹ค.
    • ์ฒญํฌ (C=10) ๋Œ€์‹  ๋‹จ์ผ ์Šคํ… ์•ก์…˜์„ ์‚ฌ์šฉํ•˜๋ฉด ํšจ๊ณผ์ ์ธ horizon์ด ๊ทน์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜์—ฌ ๊ธฐ๋ณธ ์ •์ฑ… ์„ฑ๋Šฅ์„ ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งž์ถ”์ง€ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค.
    • BC Regularizer (\beta=0)๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๊ฒƒ์€ ๊ฐ€์žฅ ํฐ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์ดˆ๋ž˜ํ–ˆ์Šต๋‹ˆ๋‹ค.
    • Reference-action pass-through๋ฅผ ์ œ๊ฑฐํ•˜๋ฉด ํ•™์Šต์ด ๋А๋ ค์ง€๊ณ , ์ดˆ๊ธฐ ํƒ์ƒ‰ ์ดํƒˆ์ด ๋ฐœ์ƒํ•˜๋ฉฐ, ๋•Œ๋•Œ๋กœ ํ‡ดํ–‰์  ํ–‰๋™์œผ๋กœ ์ด์–ด์กŒ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๊ฒฐ๊ตญ RLT์˜ ์„ฑ๋Šฅ๊ณผ ์ผ์น˜ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ํ›ˆ๋ จ ๊ณผ์ •์—์„œ ๋” ๋งŽ์€ ์‹คํŒจ๋ฅผ ๊ฒช์—ˆ์Šต๋‹ˆ๋‹ค.
  • Q4: ์ƒˆ๋กœ์šด ํšจ๊ณผ์ ์ธ ์ „๋žต ๋ฐœ๊ฒฌ:
    • ์ด๋”๋„ท ์ž‘์—…์—์„œ RLT๋Š” ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ ๋ฐ๋ชจ๋‚˜ ๊ธฐ๋ณธ VLA ๋ชจ๋ธ๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฅธ ์†๋„๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
    • ๊ธฐ๋ณธ VLA๊ฐ€ ์ ‘์ด‰ ๊ทผ์ฒ˜์—์„œ โ€˜ํƒ์ƒ‰โ€™ ํ–‰๋™์„ ์ž์ฃผ ๋ณด์ธ ๋ฐ˜๋ฉด, RLT๋Š” ํฌํŠธ์— ์ ‘๊ทผํ•˜์—ฌ ์œ ๋™์ ์ธ ์›€์ง์ž„์œผ๋กœ ์ปค๋„ฅํ„ฐ๋ฅผ ์‚ฝ์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ฒซ ์‹œ๋„์— ์‹คํŒจํ•˜๋”๋ผ๋„ ์••๋ ฅ์„ ๊ฐ€ํ•˜๊ณ  ์ปค๋„ฅํ„ฐ๋ฅผ ์•ฝ๊ฐ„ ํ”๋“ค์–ด ์œ ์—ฐ์„ฑ์„ ํ™œ์šฉํ•˜์—ฌ ๋” ๋น ๋ฅธ ์‚ฝ์ž…์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ํ–‰๋™์€ ๋ฐ๋ชจ ๋ฐ์ดํ„ฐ์—์„œ ๋ณผ ์ˆ˜ ์—†์—ˆ์œผ๋ฉฐ, ์ˆœ์ „ํžˆ ์˜จ๋ผ์ธ ํƒ์ƒ‰์„ ํ†ตํ•ด ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค. ์ด๋Š” RLT๊ฐ€ ์ธ๊ฐ„์˜ ์ „๋žต์„ ๋ชจ๋ฐฉํ•˜๋Š” ๊ฒƒ์„ ๋„˜์–ด์„ค ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

3. ๊ฒฐ๋ก  (Conclusion)

RLT๋Š” ๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ ํ•™์Šต๋œ VLA์—์„œ ์ถ”์ถœํ•œ ํ‘œํ˜„์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ๋น ๋ฅด๊ณ  ํšจ์œจ์ ์ธ ์˜จ๋ผ์ธ RL ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ VLA๊ฐ€ ์••์ถ•๋œ ํ‘œํ˜„์„ ๋…ธ์ถœํ•˜๋„๋ก ํ›ˆ๋ จํ•จ์œผ๋กœ์จ, ๊ฒฝ๋Ÿ‰ ์•กํ„ฐ์™€ ํฌ๋ฆฌํ‹ฑ์ด ๋ช‡ ์‹œ๊ฐ„์˜ ์‹ค์ œ ๋กœ๋ด‡ ์—ฐ์Šต๋งŒ์œผ๋กœ๋„ ๋งค์šฐ ์ •๋ฐ€ํ•˜๊ณ  ์„ฌ์„ธํ•œ ์ž‘์—…์„ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. RLT๋Š” ๋ชจ๋“  ์ž‘์—…์—์„œ ์„ฑ๊ณต๋ฅ ๊ณผ ์‹คํ–‰ ์†๋„๋ฅผ ์ง€์†์ ์œผ๋กœ ๊ฐœ์„ ํ–ˆ์œผ๋ฉฐ, ๊ฐ€์žฅ ์–ด๋ ค์šด ๋‹จ๊ณ„์—์„œ ์ตœ๋Œ€ 3๋ฐฐ์˜ ์†๋„ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•˜๊ณ , ์ผ๋ถ€ ๊ฒฝ์šฐ์—๋Š” ์˜จ๋ผ์ธ RL์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” ์ „๋žต์„ ํ†ตํ•ด ์ „๋ฌธ๊ฐ€ ์ธ๊ฐ„ ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ ์†๋„๋ฅผ ๋Šฅ๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

RLT๋Š” ๋น ๋ฅธ ํ•™์Šต์„ ์ œ๊ณตํ•˜์ง€๋งŒ, ํ›ˆ๋ จ ์ค‘ ๋ณด์ƒ ์‹ ํ˜ธ, ๊ฐœ์ž… ์ˆ˜์ •, RL๊ณผ ๊ธฐ๋ณธ ์ •์ฑ… ๊ฐ„ ์ „ํ™˜ ๋“ฑ ์ถ”๊ฐ€์ ์ธ ์ธ๊ฐ„ ๊ฐœ์ž…์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์œผ๋กœ๋Š” ๋ณด์ƒ ๋ชจ๋ธ ๋ฐ ์ง„ํ–‰ ์˜ˆ์ธก์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋Ÿฌํ•œ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ์ž๋™ํ™”ํ•˜๋Š” ๊ฒƒ์ด ์ œ์•ˆ๋ฉ๋‹ˆ๋‹ค. ์ด ์—ฐ๊ตฌ๋Š” ๋กœ๋ด‡ ์‹œ์Šคํ…œ์ด ๋ฐ๋ชจ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ํ˜„์žฅ์—์„œ ์ง์ ‘ ๊ฐœ์„ ๋  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ์ค‘์š”ํ•œ ์ง„์ „์ด๋ฉฐ, ์‚ฌ์ „ ํ•™์Šต์ด ์ดˆ๊ธฐํ™” ์—ญํ• ์„ ํ•˜๊ณ  ์‹ค์ œ ์„ฑ๋Šฅ์€ RL์„ ํ†ตํ•ด ๋ฐœ๊ฒฌ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฐ€๋Šฅ์„ฑ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

ํ•œ ์ค„ ์š”์•ฝ (TL;DR)

ฯ€โ‚€.6 ๊ฐ™์€ ๊ฑฐ๋Œ€ํ•œ VLA๋Š” ๊ทธ๋Œ€๋กœ ์–ผ๋ฆฐ ์ฑ„, โ€œRL tokenโ€์ด๋ผ๋Š” ์ž‘์€ ์ฐฝ๋ฌธ ํ•˜๋‚˜๋งŒ ํ•™์Šต ๊ฐ€๋Šฅํ•˜๊ฒŒ ์—ด์–ด๋‘๊ณ , ๊ทธ ์œ„์— ๊ฐ€๋ฒผ์šด actor-critic์„ ์˜ฌ๋ ค ๋ช‡ ์‹œ๊ฐ„(๋•Œ๋กœ๋Š” ๋ช‡ ๋ถ„) ๋งŒ์— ์ •๋ฐ€ manipulation์„ ๋‹ค๋“ฌ๋Š”๋‹ค. Ethernet ์‚ฝ์ž…์—์„œ๋Š” ์‚ฌ๋žŒ๋ณด๋‹ค๋„ ๋น ๋ฅธ ์ •์ฑ…์ด ๋‚˜์™”๋‹ค.

Note

ํ•ต์‹ฌ ๊ธฐ์—ฌ 4๊ฐ€์ง€

  1. VLA ๋‚ด๋ถ€ ํ‘œํ˜„์„ ํ•œ ํ† ํฐ(1 ร— 2048)์œผ๋กœ ์••์ถ•ํ•˜๋Š” encoderโ€“decoder bottleneck (RL token)
  2. Action chunk ๋‹จ์œ„๋กœ ์ž‘๋™ํ•˜๋Š” off-policy actor-critic โ€” sparse reward ํ•˜์˜ credit assignment ๋ฌธ์ œ ์™„ํ™”
  3. VLA ์ฐธ์กฐ ํ–‰๋™์— conditioning + BC regularization โ€” ํƒ์ƒ‰ ๊ณต๊ฐ„์„ โ€œ์ง€์—ญ ํŽธ์ง‘โ€์œผ๋กœ ์ถ•์†Œ
  4. ์ฐธ์กฐ ํ–‰๋™์„ ๋ฒ ๋ผ๊ธฐ๋งŒ ํ•˜๋Š” ์‹คํŒจ ๋ชจ๋“œ๋ฅผ ๋ง‰๋Š” reference action dropout

๋“ค์–ด๊ฐ€๋ฉฐ: ๋งˆ์ง€๋ง‰ 1 mm์˜ ๋ฌธ์ œ

VLA(ฯ€โ‚€, ฯ€โ‚€.6, OpenVLA, Gemini Robotics ๋“ฑ)๋Š” ํ•œ๋งˆ๋””๋กœ โ€œ์ˆ˜๋งŒ ์‹œ๊ฐ„์˜ ์ธ๊ฐ„ demo๋ฅผ ๋ณธ ์ผ๋ฐ˜๋ก ์žโ€๋‹ค. ๋นจ๋ž˜ ๊ฐœ๊ธฐ, ์‹๊ธฐ ์ •๋ฆฌ, ๋ฐ•์Šค ์กฐ๋ฆฝ ๊ฐ™์€ long-horizon ๊ณผ์ œ๋ฅผ ๊ทธ๋Ÿญ์ €๋Ÿญ ํ•ด๋‚ธ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์šฐ๋ฆฌ ๋กœ๋ด‡๊ณตํ•™์ž๊ฐ€ ์ผ์ƒ์ ์œผ๋กœ ๋ถ€๋”ชํžˆ๋Š” ๋ฒฝ์€ ๋‹ค๋ฅธ ๋ฐ ์žˆ๋‹ค. ๋‚˜์‚ฌ ๋จธ๋ฆฌ๊ฐ€ ๋“œ๋ผ์ด๋ฒ„ ๋น„ํŠธ์™€ ์ •ํ™•ํžˆ ๋งž๋ฌผ๋ฆฌ๋Š” 0.5 mm ์˜์—ญ, ์ด๋”๋„ท ์ปค๋„ฅํ„ฐ๋ฅผ ํฌํŠธ์— ์ •ํ™•ํ•œ ๊ฐ๋„๋กœ ๋ฐ€์–ด ๋„ฃ๋Š” ๋งˆ์ง€๋ง‰ ํ•œ ์ˆœ๊ฐ„. ์ด โ€œ๋งˆ์ง€๋ง‰ 1 mmโ€์—์„œ VLA๋Š” ํ”ํžˆ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ–‰๋™ํ•œ๋‹ค:

  • ์ฒœ์ฒœํžˆ ์ ‘๊ทผํ•œ๋‹ค โ†’ ์‚ด์ง ๋น—๋‚˜๊ฐ€๋ฉด ํ›„ํ‡ดํ•œ๋‹ค โ†’ ๋‹ค์‹œ ์ ‘๊ทผํ•œ๋‹ค โ†’ ๋˜ ๋น—๋‚˜๊ฐ„๋‹ค โ†’ ๋’ค๋กœ ๋บ€๋‹ค โ†’ โ€ฆ

์ด๊ฑธ ๋…ผ๋ฌธ์—์„œ๋Š” probing behavior๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. ์‚ฌ๋žŒ์˜ demo๊ฐ€ ๊ทธ ์˜์—ญ์—์„œ ์ผ๊ด€๋˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— VLA๊ฐ€ ํ‰๊ท ์„ ๋‚ด๋ฉด ์–ด์ •์ฉกํ•œ ๋™์ž‘์ด ๋‚˜์˜ค๋Š” ๊ฒƒ์ด๋‹ค. demo๋ฅผ ๋” ๋ชจ์€๋‹ค๊ณ  ํ•ด๊ฒฐ๋˜๋Š” ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ๋‹ค โ€” ๊ทธ ์˜์—ญ ์ž์ฒด๊ฐ€ demo๋กœ ์ž˜ ์•ˆ ์žกํžˆ๋Š” ์˜์—ญ์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

์ด ์ง€์ ์—์„œ RL์ด ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋– ์˜ค๋ฅธ๋‹ค. ์‹ค์ œ๋กœ task๋ฅผ ์‹œ๋„ํ•˜๋ฉด์„œ ๊ฐ•ํ™”ํ•™์Šต์œผ๋กœ ๋‹ค๋“ฌ์œผ๋ฉด ๋œ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์—ฌ๊ธฐ์„œ ๋‘ ๊ฐ€์ง€ ํ๋ฆ„์ด ์ถฉ๋Œํ•œ๋‹ค:

  1. VLA๋ฅผ ํ†ต์งธ๋กœ RL๋กœ fine-tuneํ•˜๊ธฐ (์˜ˆ: RECAP, SimpleVLA-RL): ํ‘œํ˜„๋ ฅ์€ ์‚ด์•„ ์žˆ์ง€๋งŒ ๋ฐ์ดํ„ฐ/์—ฐ์‚ฐ์ด ๋ง‰๋Œ€ํ•˜๋‹ค. ์‹ค์‹œ๊ฐ„ ๋กœ๋ด‡ ํ•™์Šต budget์—์„œ ๋น„ํ˜„์‹ค์ ์ด๋‹ค.
  2. ์ž‘์€ ์ •์ฑ…์„ ์ฒ˜์Œ๋ถ€ํ„ฐ RL๋กœ ํ•™์Šต (์˜ˆ: HIL-SERL, RL100): ๋ช‡ ์‹œ๊ฐ„ ์•ˆ์— ํ•™์Šต์ด ๋๋‚˜์ง€๋งŒ, VLA๊ฐ€ ๊ฐ€์ง„ ๋Œ€๊ทœ๋ชจ prior๋ฅผ ํ†ต์งธ๋กœ ๋ฒ„๋ฆฐ๋‹ค.

RLT๋Š” ์ด ๋‘˜ ์‚ฌ์ด์˜ ๊น”๋”ํ•œ ์ ˆ์ถฉ์ ์„ ๋…ธ๋ฆฐ๋‹ค.

flowchart LR
    A["๊ฑฐ๋Œ€ VLA<br>๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ RL<br/>(RECAP ๋“ฑ)"] -->|"๋А๋ฆผ, ๋ฐ์ดํ„ฐ ๋ง‰๋Œ€"| C
    B["์ž‘์€ ์ •์ฑ…<br>scratch์—์„œ RL<br/>(HIL-SERL)"] -->|"VLA prior ์†์‹ค"| C
    C{"RLT์˜ ์ž๋ฆฌ"}
    C --> D["VLA freeze<br/>+ RL token<br/>+ ์ž‘์€ actor-critic"]
    D --> E["๋ช‡ ์‹œ๊ฐ„ ~ ๋ช‡ ๋ถ„<br/>VLA ์ง€์‹ ๋ณด์กด"]

๋ฐฐ๊ฒฝ: ์™œ VLA๋งŒ์œผ๋กœ๋Š” ๋ถ€์กฑํ•œ๊ฐ€

๋จผ์ € VLA์˜ ๊ตฌ์กฐ๋ฅผ ์งš๊ณ  ๊ฐ€์ž. ฯ€โ‚€.6๋Š” ๋‘ ๋ถ€๋ถ„์œผ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค:

๊ตฌ์„ฑ ์š”์†Œ ์—ญํ•  ํŒŒ๋ผ๋ฏธํ„ฐ
VLM backbone (SigLIP + Gemma) ์ด๋ฏธ์ง€ 4์žฅ + ์ž์—ฐ์–ด + proprioceptive state๋ฅผ token sequence๋กœ ์ธ์ฝ”๋”ฉ ~4.4 B
Action expert backbone token์— attentionํ•˜๋ฉฐ diffusion์œผ๋กœ action chunk ์ƒ์„ฑ ~860 M

50 Hz ์ œ์–ด, H = 50 step (์•ฝ 1์ดˆ)์˜ action chunk๋ฅผ ํ•œ ๋ฒˆ์— ๋ฝ‘๊ณ , ๋ณดํ†ต ์•ž์ชฝ 20 step๋งŒ open-loop๋กœ ์‹คํ–‰ํ•œ ๋’ค ๋‹ค์‹œ ๊ด€์ธกํ•ด์„œ re-planํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. 14์ฐจ์› action ร— 50 step = 700์ฐจ์› chunk๊ฐ€ ํ•œ ๋ฒˆ์˜ ์ถœ๋ ฅ ๋‹จ์œ„๊ฐ€ ๋œ๋‹ค.

์ด ๊ฑฐ๋Œ€ํ•œ ๋ชจ๋ธ ์œ„์—์„œ RL์„ ๋Œ๋ฆด ๋•Œ ๋ถ€๋”ชํžˆ๋Š” ๋ณธ์งˆ์  ์–ด๋ ค์›€์€ ๋‘ ๊ฐ€์ง€๋‹ค:

Warning

๋ฌธ์ œ 1: ํ‘œํ˜„ ์ฐจ์› ํญ๋ฐœ
Transformer ๋งˆ์ง€๋ง‰ layer์—์„œ N๊ฐœ ํ† ํฐ ร— 2048 ์ฐจ์› embedding์ด ์Ÿ์•„์ ธ ๋‚˜์˜จ๋‹ค. ์ด๊ฑธ ๊ทธ๋Œ€๋กœ critic์— ๋„ฃ์œผ๋ฉด small-data regime์—์„œ ํ•™์Šต์ด ์•ˆ ๋œ๋‹ค.

๋ฌธ์ œ 2: ๊ธด horizon ร— sparse reward
50 Hz ร— 5โ€“20 ์ดˆ critical phase = 250 ~ 1000 step. binary success/failure ํ•œ ๋ฒˆ. TD learning์œผ๋กœ ์ด ์‹ ํ˜ธ๋ฅผ ์ฒ˜์Œ step๊นŒ์ง€ propagation์‹œํ‚ค๋Š” ๋ฐ ํ•„์š”ํ•œ sample์ด ๋„ˆ๋ฌด ๋งŽ๋‹ค.

RLT์˜ ๋‘ ํ•ต์‹ฌ ๋””์ž์ธ์€ ์ •ํ™•ํžˆ ์ด ๋‘ ๋ฌธ์ œ์— ๋Œ€์‘ํ•œ๋‹ค.

ํ•ต์‹ฌ ์•„์ด๋””์–ด: RL Token์ด๋ผ๋Š” ์ž‘์€ ์ฐฝ๋ฌธ

์ง๊ด€: bottleneck์œผ๋กœ์„œ์˜ readout token

VLA ๋‚ด๋ถ€์—๋Š” task์— ํ•„์š”ํ•œ ์ •๋ณด๊ฐ€ ์ด๋ฏธ ์ถฉ๋ถ„ํžˆ ๋“ค์–ด ์žˆ๋‹ค. ๋ฌธ์ œ๋Š” ์–ด๋””์— ์žˆ๋Š”์ง€ ๋ชจ๋ฅธ๋‹ค๋Š” ์ ์ด๋‹ค. ์–ด๋–ค layer์˜ ์–ด๋–ค ํ† ํฐ์ด โ€œ์ง€๊ธˆ ๋‚˜์‚ฌ๊ฐ€ ๋น„๋šค์–ด์ ธ ์žˆ๋‹คโ€๋Š” ์‚ฌ์‹ค์„ ์ธ์ฝ”๋”ฉํ•˜๊ณ  ์žˆ๋Š”์ง€ ์•Œ ๊ธธ์ด ์—†๋‹ค.

์ €์ž๋“ค์˜ ๋‹ต์€ ๋‹จ์ˆœํ•˜์ง€๋งŒ ํšจ๊ณผ์ ์ด๋‹ค โ€” VLA์—๊ฒŒ โ€œํ•œ ํ† ํฐ์œผ๋กœ ์š”์•ฝํ•ด ๋ดโ€๋ผ๊ณ  ์‹œํ‚จ๋‹ค. ๋งˆ์น˜ BERT์˜ [CLS] ํ† ํฐ์ฒ˜๋Ÿผ, ํ•™์Šต ๊ฐ€๋Šฅํ•œ special embedding <rl>์„ ์ž…๋ ฅ ์‹œํ€€์Šค ๋์— ๋ถ™์ด๊ณ , ์ž‘์€ transformer๋กœ ์••์ถ•ํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ๊ทธ๋ƒฅ ์••์ถ•ํ•˜๋ฉด ์–ด๋””๋กœ ์ˆ˜๋ ดํ• ์ง€ ๋ชจ๋ฅด๋‹ˆ๊นŒ, decoder๊ฐ€ ์›๋ž˜ token sequence๋ฅผ reconstructํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ฐ•์ œํ•œ๋‹ค.

Input tokens:    [z_1, z_2, ..., z_M, e_rl]
                              |
                         encoder g_phi
                              |
                              v
Output at last position:  z_rl  (1 x 2048)  <-- this is the RL token
                              |
                         decoder d_phi
                              |
                              v
Reconstruct:     [z_1, z_2, ..., z_M]   (autoregressive)

ํ•ต์‹ฌ ํŠธ๋ฆญ: VLA์˜ ์›๋ณธ embedding z_i์—๋Š” stop-gradient๋ฅผ ๊ฑธ์–ด์„œ ๋””์ฝ”๋”๊ฐ€ reconstructํ•˜๋Š” ๋™์•ˆ VLA ์ž์ฒด๋Š” ํ”๋“ค๋ฆฌ์ง€ ์•Š๊ฒŒ ํ•œ๋‹ค. encoder์™€ decoder (\phi)๋งŒ ํ•™์Šต๋œ๋‹ค.

์ˆ˜์‹์œผ๋กœ ๋ณด๋ฉด

VLA๊ฐ€ ๋ฝ‘์€ token embedding์„ z_{1:M}, ํ•™์Šต ๊ฐ€๋Šฅํ•œ special embedding์„ e_{rl}์ด๋ผ ํ•˜์ž. RL token์€:

z_{rl} = g_\phi\big([z_{1:M}, e_{rl}]\big)_{M+1}

reconstruction loss๋Š”:

\mathcal{L}_{ro} = \mathbb{E}_\mathcal{D}\Bigg[\sum_{i=1}^M \big\| h_\phi\big(d_\phi([z_{rl}, \bar{z}_{1:i-1}])\big) - \bar{z}_i \big\|^2 \Bigg]

์—ฌ๊ธฐ์„œ \bar{z}_i = \text{sg}(z_i)๋Š” stop-gradient. ์ด loss๋กœ (\phi, optionally \theta_{vla})๋ฅผ ํ•™์Šตํ•˜๊ณ , ์ดํ›„์—” ๋ชจ๋‘ freezeํ•œ๋‹ค.

์™œ ์ด ๊ฒŒ ์ž˜ ์ž‘๋™ํ•˜๋Š”๊ฐ€ (์ง๊ด€)

์ด๊ฑธ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ƒ๊ฐํ•˜๋ฉด ํŽธํ•˜๋‹ค. VLA์˜ layer ์ถœ๋ ฅ์€ ์ฑ… ํ•œ ๊ถŒ ๋ถ„๋Ÿ‰์˜ ๋„์„œ๊ด€์ด๋‹ค. ๊ทธ ์•ˆ ์–ด๋”˜๊ฐ€์— โ€œ์ง€๊ธˆ ์ƒํ™ฉ์€ ์ด๋ ‡๊ณ , ์–ด๋–ป๊ฒŒ ์›€์ง์ด๋ฉด ๋œ๋‹คโ€๋Š” ๋‹ต์ด ์ ํ˜€ ์žˆ๊ธด ํ•œ๋ฐ, ์–ด๋А ์ฑ… ์–ด๋А ํŽ˜์ด์ง€์ธ์ง€ ๋ชจ๋ฅธ๋‹ค. RL token์€ โ€œ์ด ๋„์„œ๊ด€ ์ „์ฒด๋ฅผ ๋‹ค์‹œ ๋ณต์›ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ€์žฅ ์ž‘์€ ์š”์•ฝ๋ณธโ€์„ ๋งŒ๋“ค๋„๋ก ํ•™์Šต๋œ๋‹ค. ๊ทธ ์š”์•ฝ์€ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ task์— ๊ด€๋ จ๋œ ์ •๋ณด๋ฅผ ์šฐ์„ ์ˆœ์œ„๋กœ ๋‹ด๊ฒŒ ๋œ๋‹ค โ€” reconstruction์ด ์•ˆ ๋˜๋Š” ์ •๋ณด๋Š” ๋“ค์–ด ์žˆ์ง€ ์•Š์€ ์…ˆ์ด๊ณ , reconstruction์— ๋ณธ์งˆ์ ์ธ ์ •๋ณด๋Š” ์‚ด์•„๋‚จ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

ablation์—์„œ ์ด RL token์„ ๋‹จ์ˆœํ•œ ImageNet-pretrained ResNet-10์œผ๋กœ ๊ต์ฒดํ•˜๋ฉด throughput์ด ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์–ด๋“ ๋‹ค. ํ‘œ์ค€ vision encoder๋กœ๋Š” manipulation์— ํ•„์š”ํ•œ manipulation-specific structure๋ฅผ ๋ชป ์žก๋Š”๋‹ค๋Š” ๋œป์ด๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜: RL Token ์œ„์—์„œ ํ•™์Šตํ•˜๋Š” ์ž‘์€ actor-critic

์ด์ œ RL token์ด ์ค€๋น„๋๋‹ค๊ณ  ์น˜์ž. ์ด ์œ„์—์„œ ๋ฌด์—‡์„ ํ•™์Šตํ• ๊นŒ?

์ „์ฒด ๊ตฌ์กฐ

flowchart TB
    subgraph FROZEN["FROZEN VLA (ฯ€0.6)"]
        VLM["VLM backbone<br/>SigLIP + Gemma"]
        AE["Action expert<br/>diffusion"]
        ENC["RL token encoder"]
    end
    
    OBS["๊ด€์ธก<br/>์ด๋ฏธ์ง€ + ์–ธ์–ด + s_p"] --> VLM
    VLM --> ENC
    VLM --> AE
    AE --> AREF["์ฐธ์กฐ action chunk<br/>รฃ_1:C"]
    ENC --> ZRL["RL token z_rl"]
    
    ZRL --> ACTOR
    AREF --> ACTOR["Actor ฯ€_ฮธ<br/>(์†Œํ˜• MLP)"]
    SP["proprio s_p"] --> ACTOR
    ZRL --> CRITIC["Critic Q_ฯˆ<br/>(์†Œํ˜• MLP)"]
    SP --> CRITIC
    
    ACTOR --> A["์‹คํ–‰ action<br/>a_1:C"]
    A --> CRITIC
    
    style FROZEN fill:#e0e0e0
    style ACTOR fill:#ffe0b3
    style CRITIC fill:#b3d9ff

ํ•™์Šต๋˜๋Š” ๋ถ€๋ถ„์€ ์ฃผํ™ฉ(actor)๊ณผ ํŒŒ๋ž‘(critic)๋ฟ์ด๋‹ค. ํšŒ์ƒ‰์€ ๋ชจ๋‘ freeze.

MDP ์ •์˜: chunk ๋‹จ์œ„๋กœ ๋ฌถ๊ธฐ

ํ‘œ์ค€ MDP (S, A, p, r, \gamma)์ด์ง€๋งŒ, action ๊ณต๊ฐ„์€ chunk ๋‹จ์œ„๋‹ค:

a_{t:t+C-1} = (a_t, \dots, a_{t+C-1}) \in \mathbb{R}^{C \times d}

๋…ผ๋ฌธ์—์„œ C = 10, d = 14 โ†’ chunk ํ•œ ๊ฐœ = 140์ฐจ์›. VLA๊ฐ€ ๋ฝ‘๋Š” chunk ๊ธธ์ด H = 50๋ณด๋‹ค ์งง๊ฒŒ ์žก๋Š”๋‹ค (C < H). ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์žฌ๊ณ„ํš ๋นˆ๋„๊ฐ€ ๋†’์•„์ ธ reactiveํ•ด์ง„๋‹ค.

Chunk ๋‹จ์œ„ Q-function์€:

Q^\pi(s_t, a_{t:t+C-1}) = \sum_{t'=t}^{t+C-1} \gamma^{t'-t} r_{t'} + \gamma^C \mathbb{E}_{a' \sim \pi}\big[Q^\pi(s_{t+C}, a')\big]

Tip

์™œ chunking์ด RL์— ๊ทธ๋ ‡๊ฒŒ ์ค‘์š”ํ•œ๊ฐ€?

50 Hz ร— 1000 step = 1000๋ฒˆ์˜ TD backup์ด ํ•„์š”ํ•˜๋‹ค. sparse reward ํ•˜๋‚˜๊ฐ€ ์ฒ˜์Œ๊นŒ์ง€ propagate๋˜๋ ค๋ฉด ๊ทธ๋งŒํผ์˜ update๊ฐ€ ํ•„์š”ํ•œ๋ฐ, ์ด๋Š” ๋‹จ์ˆœ ์‚ฐ์ˆ˜๋กœ๋„ ๋”์ฐํ•œ ์–‘์ด๋‹ค. C = 10์ด๋ฉด effective horizon์ด 1000 โ†’ 100์œผ๋กœ 10๋ฐฐ ์งง์•„์ง„๋‹ค. ์ด๊ฑด ๋‹จ์ˆœํ•œ ์ตœ์ ํ™” ํŠธ๋ฆญ์ด ์•„๋‹ˆ๋ผ sparse-reward RL์˜ ๊ทผ๋ณธ์ ์ธ credit assignment ๋ฌธ์ œ๋ฅผ ํ‘ธ๋Š” ํ•ต์‹ฌ ์žฅ์น˜๋‹ค.

ablation ๊ฒฐ๊ณผ์—์„œ single-step ๋ณ€ํ˜•(w/o Chunk)์€ ์‚ฌ์‹ค์ƒ ํ•™์Šต์ด ์•ˆ ๋๋‹ค (์•„๋ž˜ ๊ทธ๋ฆผ ์ฐธ์กฐ).

Critic ํ•™์Šต: ํ‘œ์ค€ TD3 ์Šคํƒ€์ผ

\mathcal{L}_Q = \mathbb{E}_{(x, a_{1:C}, x') \sim \mathcal{B}}\Big[\big(\hat{Q} - Q_\psi(x, a_{1:C})\big)^2\Big]

\hat{Q} = \sum_{t'=1}^C \gamma^{t'-1} r_{t'} + \gamma^C \mathbb{E}_{a' \sim \pi_\theta}\big[Q_{\psi'}(x', a')\big]

์—ฌ๊ธฐ์„œ x = (z_{rl}, s^p), ์ฆ‰ RL token + proprioceptive state. TD3์ฒ˜๋Ÿผ ๋‘ ๊ฐœ์˜ Q ๋„คํŠธ์›Œํฌ ensemble์„ ์“ฐ๊ณ  target value ๊ณ„์‚ฐ ์‹œ minimum์„ ์ทจํ•œ๋‹ค(overestimation ๋ฐฉ์ง€).

Actor ํ•™์Šต: ์ฐธ์กฐ ํ–‰๋™ ์กฐ๊ฑด๋ถ€ + BC regularizer

์—ฌ๊ธฐ๊ฐ€ RLT์˜ ๋˜ ๋‹ค๋ฅธ ํ•ต์‹ฌ์ด๋‹ค. ๊ทธ๋ƒฅ RL ์ •์ฑ…์„ ํ•™์Šตํ•˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ, VLA๊ฐ€ ์ œ์•ˆํ•œ reference action chunk \tilde{a}_{1:C}๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๊ณ , ๊ฑฐ๊ธฐ์„œ ๊ฐ€๊นŒ์šด ๊ณณ์„ ํƒ์ƒ‰ํ•˜๊ฒŒ ํ•œ๋‹ค.

\pi_\theta(a_{1:C} \mid x, \tilde{a}_{1:C}) = \mathcal{N}\big(\mu_\theta(x, \tilde{a}_{1:C}), \sigma^2 I\big)

ํ•™์Šต objective๋Š”:

\mathcal{L}_\pi(\theta) = \mathbb{E}_{\substack{s \sim \mathcal{B} \\ a_{1:C} \sim \pi_\theta}}\Big[ -Q_\psi(x, a_{1:C}) + \beta \, \|a_{1:C} - \tilde{a}_{1:C}\|_2^2 \Big], \quad \tilde{a}_{1:C} \sim \pi_{vla}(\cdot \mid s, \ell)

๋‘ ํ•ญ์˜ ์˜๋ฏธ:

  • ์ฒซ ํ•ญ -Q_\psi: critic์ด ์ข‹๋‹ค๊ณ  ํ‰๊ฐ€ํ•˜๋Š” ํ–‰๋™์„ ํ–ฅํ•ด ๊ฐ€๋ผ.
  • ๋‘˜์งธ ํ•ญ \beta \|a - \tilde{a}\|^2: ๊ทธ๋Ÿฌ๋ฉด์„œ๋„ VLA์˜ ์ œ์•ˆ์—์„œ ๋„ˆ๋ฌด ๋ฉ€์–ด์ง€์ง€ ๋งˆ๋ผ.

์ด ๋‘˜์„ ํ•ฉํ•˜๋ฉด โ€œVLA๊ฐ€ ์ถ”์ฒœํ•œ ํ–‰๋™์˜ ๊ทผ๋ฐฉ์—์„œ critic์ด ๊ฐ€๋ฆฌํ‚ค๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์‚ด์ง ๋‹ค๋“ฌ์–ด๋ผโ€๊ฐ€ ๋œ๋‹ค. ๋…ผ๋ฌธ์€ ์ด๊ฑธ local action editing์ด๋ผ๊ณ  ํ‘œํ˜„ํ•œ๋‹ค. KL-regularized RL๊ณผ ์ •์‹ ์ ์œผ๋กœ ๊ฐ™์€ ๊ณ„์—ด์ด๋‹ค(MPO, Peng et al. ๋“ฑ).

Important

์™œ residual์ด ์•„๋‹ˆ๋ผ absolute๋กœ ์ถœ๋ ฅํ•˜๋Š”๊ฐ€?

PLD๋‚˜ Policy Decorator ๊ฐ™์€ ๊ธฐ์กด ๋ฐฉ๋ฒ•์€ residual์„ ํ•™์Šตํ•œ๋‹ค โ€” VLA ์ถœ๋ ฅ์— ๋”ํ•  ์ž‘์€ ๋ณด์ •๊ฐ’. RLT๋Š” absolute action์„ ์ง์ ‘ ์ถœ๋ ฅํ•˜๋˜ conditioning + regularization์œผ๋กœ ๋ฌถ๋Š” ๋ฐฉ์‹์ด๋‹ค.

์ด ์ฐจ์ด๊ฐ€ ๋ฏธ๋ฌ˜ํ•˜์ง€๋งŒ ์ค‘์š”ํ•˜๋‹ค:

  • residual์€ hand-tuned scaling factor๊ฐ€ ํ•„์š”ํ•˜๋‹ค (์–ผ๋งˆ๋‚˜ ๊ฐ•ํ•˜๊ฒŒ ๋ณด์ •ํ• ์ง€).
  • absolute + regularization์€ \beta ํ•˜๋‚˜๋งŒ ์กฐ์ •ํ•˜๋ฉด ๋˜๊ณ , ๋ฌด์—‡๋ณด๋‹ค \beta = 0์ด๋ฉด unconstrained RL, \beta \to \infty๋ฉด imitation์œผ๋กœ ์ž์—ฐ์Šค๋Ÿฌ์šด spectrum์ด ๋œ๋‹ค.
  • ๋˜ ํ•œ ๊ฐ€์ง€: VLA์˜ multimodal action distribution์—์„œ ํ•˜๋‚˜์˜ mode๋ฅผ sampling์œผ๋กœ ๋ฝ‘์€ ๋’ค ๊ทธ mode ๊ทผ๋ฐฉ์—์„œ ๋‹ค๋“ฌ๊ฒŒ ๋œ๋‹ค. unimodal Gaussian actor๊ฐ€ multimodal demo๋ฅผ ์ง์ ‘ ํ‰๋‚ด๋‚ด๋ ค ํ•  ๋•Œ์˜ ๋ชจ๋“œ ํ‰๊ท ํ™” ๋ฌธ์ œ๊ฐ€ ์‚ฌ๋ผ์ง„๋‹ค.

Reference action dropout: ๋ฒ ๋ผ๊ธฐ ๋ฐฉ์ง€ ์žฅ์น˜

์—ฌ๊ธฐ์„œ ํ•œ ๊ฐ€์ง€ ํ•จ์ •์ด ์žˆ๋‹ค. Actor๊ฐ€ reference \tilde{a}๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๊ณ  ๊ฑฐ๊ธฐ๋กœ๋ถ€ํ„ฐ ๋„ˆ๋ฌด ๋ฉ€์–ด์ง€์ง€ ๋ง๋ผ๊ณ  regularize๋˜๋ฉด, ๊ทธ๋ƒฅ \tilde{a}๋ฅผ ๊ทธ๋Œ€๋กœ ์ถœ๋ ฅํ•˜๋Š” ๊ฒŒ ๊ฐ€์žฅ ์†์‰ฌ์šด ๋‹ต์ด ๋œ๋‹ค. ํŠนํžˆ ํ•™์Šต ์ดˆ๊ธฐ์— critic์ด ์•„์ง informativeํ•˜์ง€ ์•Š์„ ๋•Œ ์ด๋Ÿฐ collapse๊ฐ€ ์ž˜ ์ผ์–ด๋‚œ๋‹ค.

ํ•ด๊ฒฐ์ฑ…์€ ๋‹จ์ˆœํ•˜๋‹ค. ๊ฐ batch์—์„œ ๋ฌด์ž‘์œ„๋กœ ์ ˆ๋ฐ˜์˜ transition์— ๋Œ€ํ•ด \tilde{a}๋ฅผ 0์œผ๋กœ ๋งˆ์Šคํ‚นํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด actor๋Š” reference ์—†์ด๋„ ํ–‰๋™์„ ๋งŒ๋“ค์–ด๋‚ผ ์ˆ˜ ์žˆ๋Š” ๋…๋ฆฝ์ ์ธ ๊ฒฝ๋กœ๋ฅผ ์œ ์ง€ํ•ด์•ผ ํ•œ๋‹ค. ์ถ”๋ก  ์‹œ์—๋Š” ํ•ญ์ƒ reference๋ฅผ ์ œ๊ณตํ•œ๋‹ค.

์ด๋Ÿฐ ์‚ฌ์†Œํ•ด ๋ณด์ด๋Š” ์žฅ์น˜๊ฐ€ ์˜์™ธ๋กœ ํฐ ์ฐจ์ด๋ฅผ ๋งŒ๋“ ๋‹ค. ablation์˜ w/o Pass-Through(reference๋ฅผ actor์—์„œ ์•„์˜ˆ ๋บ€ ๊ฒฝ์šฐ)๋Š” ๊ฒฐ๊ตญ ๋น„์Šทํ•œ ์ตœ์ข… ์„ฑ๋Šฅ์— ๋„๋‹ฌ์€ ํ•˜์ง€๋งŒ, ํ•™์Šต ๊ณผ์ •์—์„œ ํ›จ์”ฌ ๋งŽ์€ ์‹คํŒจ๋ฅผ ๊ฒช๋Š”๋‹ค.

์‹œ์Šคํ…œ: ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘๋ถ€ํ„ฐ ์ •์ฑ… ์—…๋ฐ์ดํŠธ๊นŒ์ง€

์˜์‚ฌ์ฝ”๋“œ๋กœ ๋ณด๋Š” ์ „์ฒด ํ๋ฆ„

# Stage 1: VLA & RL token ์ ์‘ (offline, ์ž‘์€ demo dataset)
Train phi (and optionally theta_vla) with reconstruction loss L_ro

# Stage 2: Online RL
Initialize critic Q_psi, actor pi_theta from scratch
Pre-fill replay buffer B with N_warm steps of VLA rollouts

for environment_step t = 0, C, 2C, ...:
    sample reference chunk a_tilde from VLA
    form RL state x = (z_rl(s), s_p)
    
    if human_intervenes:
        a = a_human
    elif t < N_warm:
        a = a_tilde
    else:
        a ~ pi_theta(. | x, a_tilde)
    
    execute a; observe r, s', s_p'
    if intervention: a_tilde <- a_human   # log corrected reference
    push <x, a, a_tilde, r, x'> into B
    
    # G updates per environment step (UTD ratio = 5)
    for g = 1..G:
        sample batch from B
        update Q_psi via TD backup           (Eq. 3)
        update pi_theta via Q + BC loss      (Eq. 5)

์ž‘๋™ ์›๋ฆฌ์—์„œ ๋ˆˆ์—ฌ๊ฒจ๋ณผ ๋””ํ…Œ์ผ๋“ค

1. Update-to-data ratio = 5
ํ™˜๊ฒฝ step ํ•œ ๋ฒˆ๋งˆ๋‹ค critic update๋ฅผ 5๋ฒˆ ํ•œ๋‹ค. small-data regime์—์„œ sample efficiency๋ฅผ ์งœ๋‚ด๊ธฐ ์œ„ํ•œ ํ‘œ์ค€ ํŠธ๋ฆญ์ด์ง€๋งŒ, value divergence ์œ„ํ—˜์ด ์žˆ์–ด ensemble๊ณผ BC reg๊ฐ€ ์•ˆ์ „์žฅ์น˜ ์—ญํ• ์„ ํ•œ๋‹ค.

2. Action chunk subsampling (stride = 2)
chunk๊ฐ€ C step ๋‹จ์œ„์ง€๋งŒ, ์‹ค์ œ๋กœ๋Š” stride 2๋กœ ์ƒ˜ํ”Œ์„ ๋งŒ๋“ ๋‹ค โ€” <x_0, a_{0:C}>, <x_2, a_{2:C+2}>, <x_4, a_{4:C+4}>, โ€ฆ ์ด๋Ÿฐ ์‹์ด๋‹ค. off-policy๋‹ˆ๊นŒ ๊ฐ€๋Šฅํ•˜๊ณ , ๋ฐ์ดํ„ฐ ํšจ์œจ์„ ํ•œ ๋ฒˆ ๋” ๋ถ€์ŠคํŠธํ•œ๋‹ค.

3. Critical phase ์ง‘์ค‘ ํ•™์Šต
RL์ด ์ •๋ง ์ฐจ์ด๋ฅผ ๋งŒ๋“œ๋Š” ๊ฑด ์–ด๋ ค์šด ๋‹จ๊ณ„๋ฟ์ด๋‹ค. ๊ทธ๋ž˜์„œ episode๋ฅผ base VLA๋กœ ์‹œ์ž‘ํ•˜๊ณ , ์‚ฌ๋žŒ์ด critical phase์— ์ง„์ž…ํ•˜๋Š” ์‹œ์ ์— RL ์ •์ฑ…์œผ๋กœ ํ•ธ๋“œ์˜คํ”„ํ•œ๋‹ค (interactive imitation learning๊ณผ ๋น„์Šทํ•œ ์ปจ์…‰). ํ•™์Šต์ด ๋๋‚˜๋ฉด ๋งˆ์ง€๋ง‰์— VLA๋ฅผ ์งง๊ฒŒ fine-tuneํ•ด์„œ ์–ธ์ œ ํ•ธ๋“œ์˜คํ”„ํ• ์ง€๋ฅผ ์ž๋™์œผ๋กœ ์˜ˆ์ธกํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค โ€” test time์—๋Š” ์‚ฌ๋žŒ ๊ฐœ์ž…์ด ํ•„์š” ์—†๊ฒŒ.

4. Human-in-the-loop
ํ•„์š”ํ•  ๋•Œ teleoperation์œผ๋กœ ๊ฐœ์ž…ํ•  ์ˆ˜ ์žˆ๊ณ , ๊ทธ corrected action๋„ buffer์— ์Œ“์ธ๋‹ค. HIL-SERL์˜ ๋””์ž์ธ์„ ๊ทธ๋Œ€๋กœ ์ฐจ์šฉํ–ˆ๋‹ค.

flowchart LR
    Start["์—ํ”ผ์†Œ๋“œ ์‹œ์ž‘"] --> BaseVLA["base VLA๋กœ<br/>์ „๋ฐ˜๋ถ€ ์ˆ˜ํ–‰"]
    BaseVLA --> Trigger{"critical phase<br/>๋„๋‹ฌ?"}
    Trigger -->|"ํ•™์Šต ์‹œ: ์‚ฌ๋žŒ ์‹ ํ˜ธ"| RL["RL ์ •์ฑ…์œผ๋กœ<br/>ํ•ธ๋“œ์˜คํ”„"]
    Trigger -->|"ํ…Œ์ŠคํŠธ ์‹œ: VLA์˜ ์ž์ฒด ์˜ˆ์ธก"| RL
    RL --> Inter{"intervention<br/>ํ•„์š”?"}
    Inter -->|"๋„ค"| Tele["teleop์œผ๋กœ ๋ณด์ •<br/>a_human"]
    Inter -->|"์•„๋‹ˆ์˜ค"| Auto["actor ์ถœ๋ ฅ ์‹คํ–‰"]
    Tele --> End["์„ฑ๊ณต/์‹คํŒจ<br/>sparse reward"]
    Auto --> End

์‹คํ—˜: 4๊ฐ€์ง€ ์ •๋ฐ€ manipulation task

Task ๊ตฌ์„ฑ

Task ํ•ต์‹ฌ ์–ด๋ ค์›€ critical phase ์ง€์†์‹œ๊ฐ„
Screw installation M3 ๋‚˜์‚ฌ๋ฅผ sub-mm ์ •๋ฐ€๋„๋กœ ์ •๋ ฌ, 10 cm grip-tip ๊ฑฐ๋ฆฌ ๋•Œ๋ฌธ์— ํšŒ์ „ ์˜ค์ฐจ ์ฆํญ 5โ€“20 s
Zip tie fastening ๋ณ€ํ˜•์ฒด ํƒ€์ด๋ฅผ ์ข์€ ์Šฌ๋กฏ์— ํ†ต๊ณผ (bimanual) 5โ€“20 s
Ethernet insertion ์ •ํ™•ํ•œ ๊ฐ๋„ + ๋‹จํ˜ธํ•œ ์‚ฝ์ž… ๋™์ž‘ 5โ€“20 s
Charger insertion ์ฝ˜์„ผํŠธ ์ •๋ ฌ, ์ž‘์€ ์˜ค์ฐจ๋„ ๋ฐ˜๋ณต probing ์œ ๋ฐœ 5โ€“20 s

์ „์ฒด task๋Š” 30โ€“120 s, 50 Hz ์ œ์–ด๋‹ˆ๊นŒ 1500โ€“6000 step. critical phase๋งŒ ๋”ฐ๋กœ ๋–ผ๋ฉด 250โ€“1000 step ์ˆ˜์ค€์ด๋‹ค.

Q1: VLA baseline ๋Œ€๋น„ RLT๊ฐ€ ์ •๋ง ์ข‹์•„์ง€๋Š”๊ฐ€

๋‹ต์€ ๋ช…ํ™•ํ•˜๊ฒŒ โ€œ์˜ˆโ€. critical-phase setting๊ณผ full-task setting ๋ชจ๋‘์—์„œ success rate์™€ throughput(10๋ถ„๋‹น ์„ฑ๊ณต ํšŸ์ˆ˜)์ด ํฌ๊ฒŒ ์˜ค๋ฅธ๋‹ค.

Throughput ๊ฐœ์„  (ASCII ์ฐจํŠธ, critical phase):

                Base VLA   RLT (Ours)
Screwdriver:    ~5         ~15        (3x)
Zip tie:        ~3         ~14        (~5x)
Ethernet:      ~150       ~400       (~3x)
Charger:       ~200       ~600       (~3x)

Full-task์—์„œ๋Š” grasping ๋“ฑ ์•ž ๋‹จ๊ณ„ ๋ˆ„์  ์˜ค์ฐจ ๋•Œ๋ฌธ์— ์ ˆ๋Œ€ ์„ฑ๊ณต๋ฅ ์€ ๋‚ฎ์ง€๋งŒ, screwdriver๋Š” +40%p, zip tie๋Š” +60%p์˜ ๊ฐœ์„ ์ด ๋ณด์ธ๋‹ค. ํŠนํžˆ ์–ด๋ ค์šด screwdriver์˜ ๊ฒฝ์šฐ critical phase ์„ฑ๊ณต๋ฅ ์ด 20% โ†’ 65%๋กœ ์ ํ”„ํ•œ๋‹ค.

Q2: ๋‹ค๋ฅธ RL ๋ฐฉ๋ฒ•๋“ค๊ณผ ๋น„๊ตํ•˜๋ฉด

๊ฐ€์žฅ ๋„์ „์ ์ธ ๋น„๊ต ๋Œ€์ƒ๋“ค:

๋ฐฉ๋ฒ• ํ•ต์‹ฌ Ethernet ๊ฒฐ๊ณผ
HIL-SERL VLA ์—†์ด ResNet + actor-critic ์‚ฌ์‹ค์ƒ ํ•™์Šต ์‹คํŒจ (50 Hz, action box ์—†์Œ)
PLD (Probe-Learn-Distill) single-step residual policy ํ•™์Šต ์‹คํŒจ (๊ธด horizon ร— sparse reward)
DSRL diffusion noise space์—์„œ RL success rate๋Š” ๋น„์Šทํ•˜๋‚˜ throughput ํฌ๊ฒŒ ๋ถ€์กฑ
DAgger intervention data๋กœ supervised fine-tuning success rate ๋น„์Šทํ•˜๋‚˜ demo ์†๋„ ํ•œ๊ณ„
RLT (ours) RL token + chunked actor-critic + BC reg ์„ฑ๊ณต๋ฅ  ์œ ์ง€ + 2ร— ๋น ๋ฅธ ํ‰๊ท  step

๊ฐ€์žฅ ์˜๋ฏธ ์žˆ๋Š” ๋ฐœ๊ฒฌ: single-step ๋ฐฉ๋ฒ•๋“ค(HIL-SERL, PLD)์ด ์ฒ˜์ฐธํ•˜๊ฒŒ ์‹คํŒจํ•œ ๊ฒƒ์€ ์šฐ์—ฐ์ด ์•„๋‹ˆ๋‹ค. 50 Hz ร— ์ˆ˜๋ฐฑ step ร— sparse reward ์กฐํ•ฉ์—์„œ๋Š” chunking ์—†์ด๋Š” TD๊ฐ€ ์ž‘๋™ํ•˜์ง€ ์•Š๋Š”๋‹ค.

Q3: ๊ฐ component๊ฐ€ ์ •๋ง ํ•„์š”ํ•œ๊ฐ€ (Ablation)

๋…ผ๋ฌธ Fig. 7, 8์˜ ๊ฒฐ๊ณผ๋ฅผ ์ •๋ฆฌํ•˜๋ฉด:

์ œ๊ฑฐ ํ•ญ๋ชฉ ํšจ๊ณผ
RL token โ†’ ResNet-10 throughput 50% ๊ฐ์†Œ
Action chunk โ†’ single-step ํ•™์Šต ์ž์ฒด๊ฐ€ ์–ด๋ ค์›€, base ์ •์ฑ… ๋”ฐ๋ผ์žก๊ธฐ๋„ ํž˜๋“ฆ
BC regularizer (\beta = 0) ๊ฐ€์žฅ ํฐ ์„ฑ๋Šฅ ํ•˜๋ฝ โ€” Q-gradient๋งŒ์œผ๋กœ๋Š” ํ–‰๋™ ๊ณต๊ฐ„ ํƒ์ƒ‰์ด ๋„ˆ๋ฌด ๋„“์Œ
Reference pass-through ์ตœ์ข… ์„ฑ๋Šฅ์€ ๋น„์Šทํ•˜๊ฒŒ ๋„๋‹ฌ, ๊ทธ๋Ÿฌ๋‚˜ ํ•™์Šต ์ค‘ ์‹คํŒจ๊ฐ€ ํ›จ์”ฌ ๋งŽ์Œ
Note

๊ฐ€์žฅ ์˜์™ธ์˜€๋˜ ๊ฒฐ๊ณผ: w/o BC Regularizer๊ฐ€ ๊ฐ€์žฅ ํฐ ์„ฑ๋Šฅ ์†์‹ค์„ ๋งŒ๋“ ๋‹ค. ์ด๊ฑด ๊ณง โ€œRL์„ VLA ํ–‰๋™์˜ ๊ทผ์ฒ˜์— ๊ฐ€๋‘๋Š” ๊ฒƒโ€์ด ๋‹จ์ˆœํ•œ ์•ˆ์ „์žฅ์น˜๊ฐ€ ์•„๋‹ˆ๋ผ ํ•™์Šต ํšจ์œจ์˜ ํ•ต์‹ฌ์ด๋ผ๋Š” ๋œป์ด๋‹ค. Unconstrained RL์€ 140์ฐจ์› chunk ๊ณต๊ฐ„์—์„œ ๊ธธ์„ ์žƒ๋Š”๋‹ค.

Q4: ์ •์„ฑ์  ๋ฐœ๊ฒฌ โ€” ์ƒˆ๋กœ์šด ์ „๋žต์˜ ์ถœํ˜„

์ด ๋ถ€๋ถ„์ด ๊ฐœ์ธ์ ์œผ๋กœ ๊ฐ€์žฅ ํฅ๋ฏธ๋กœ์› ๋‹ค. Ethernet task์—์„œ base VLA, teleop demo, RLT ์ •์ฑ…์˜ episode ๊ธธ์ด ๋ถ„ํฌ๋ฅผ ๋น„๊ตํ•˜๋ฉด:

Episode length (timesteps) - Ethernet critical phase
  0     50    100   150   200   250   300   350   400
  +-----+-----+-----+-----+-----+-----+-----+-----+
                          *  Teleop median = 146
                                  *  Base policy median = 228
            *  RLT median = 66
  +-----+-----+-----+-----+-----+-----+-----+-----+

RLT episode์˜ ์ ˆ๋ฐ˜์ด, ๊ฐ€์žฅ ๋น ๋ฅธ ์‚ฌ๋žŒ demo๋ณด๋‹ค๋„ ๋น ๋ฅด๋‹ค. ์ •์ฑ…์ด ๋ฐœ๊ฒฌํ•œ ์ƒˆ๋กœ์šด ์ „๋žต์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

  • Base VLA: ์ ‘๊ทผ โ†’ ์‚ด์ง ํ›„ํ‡ด โ†’ ์žฌ์ •๋ ฌ โ†’ ์žฌ์‹œ๋„ (probing)
  • Teleop: ํ•œ ๋ฒˆ์— ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ์‚ฝ์ž…
  • RLT: ์œ ์ฒด์  ์ ‘๊ทผ + ์ฒซ ์‹œ๋„ ์‹คํŒจ ์‹œ ์‚ด์ง ํ”๋“ค๋ฉด์„œ(wiggle) compliance๋ฅผ ํ™œ์šฉ

์ด wiggle ์ „๋žต์€ demo data์— ์—†๋‹ค. ์ˆœ์ „ํžˆ online exploration์—์„œ emergentํ•˜๊ฒŒ ๋‚˜์˜จ ๊ฑฐ๋‹ค. ์ด๊ฑด imitation์˜ ceiling์„ RL์ด ๊นฌ ๋ช…๋ฐฑํ•œ ์ฆ๊ฑฐ์ด๊ณ , RECAP, RL100 ๋“ฑ์ด ๋ณด์—ฌ์ค€ ํŒจํ„ด๊ณผ ์ผ๊ด€๋œ๋‹ค โ€” ๋‹จ, ํ›จ์”ฌ ๊ฐ€๋ฒผ์šด ํ•™์Šต budget์œผ๋กœ.

๋น„ํŒ์  ๊ณ ์ฐฐ

๊ฐ•์ 

1. ๊ฐœ๋…์  ๋‹จ์ˆœํ•จ๊ณผ ๋ช…๋ฃŒํ•จ. ๋””์ž์ธ ์„ ํƒ ํ•˜๋‚˜ํ•˜๋‚˜๊ฐ€ ๋ช…ํ™•ํ•œ ์ด์œ ๋กœ ์ •๋‹นํ™”๋œ๋‹ค. RL token์€ ํ‘œํ˜„ ์••์ถ•, chunking์€ credit assignment, BC reg๋Š” ํƒ์ƒ‰ ์ œ์•ฝ, dropout์€ collapse ๋ฐฉ์ง€. ๊ตฐ๋”๋”๊ธฐ๊ฐ€ ์—†๋‹ค.

2. Sample efficiency. โ€œ๋ช‡ ์‹œ๊ฐ„โ€์€ robotics ๊ธฐ์ค€์œผ๋กœ ์ง„์งœ ์งง๋‹ค. ํŠนํžˆ 5 minutes๋งŒ์— baseline์„ ์ถ”์›”ํ•˜๋Š” ablation ๊ฒฐ๊ณผ๋Š” ๊ฐ•๋ ฌํ•˜๋‹ค.

3. ์‚ฌ๋žŒ๋ณด๋‹ค ๋น ๋ฅธ ์ •์ฑ…. Ethernet ๊ฒฐ๊ณผ๋Š” ๋‹จ์ˆœํžˆ โ€œ์‚ฌ๋žŒ๋งŒํผ ์ž˜ํ•œ๋‹คโ€๊ฐ€ ์•„๋‹ˆ๋ผ โ€œ์‚ฌ๋žŒ๋ณด๋‹ค ๋น ๋ฅด๋‹ค + ์‹ ๋ขฐ์„ฑ ์œ ์ง€โ€์˜ ์˜์—ญ์ด๋‹ค. ์‚ฐ์—…์  ์˜๋ฏธ๊ฐ€ ํฌ๋‹ค.

4. ๋ชจ๋“ˆ์„ฑ. VLA๋ฅผ freezeํ•œ๋‹ค๋Š” ๊ฑด ์—ฌ๋Ÿฌ task๋ณ„๋กœ RL token + actor-critic๋งŒ ๋”ฐ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๋œป์ด๋‹ค. base model์„ ๋ง์น˜์ง€ ์•Š๊ณ  task-specific ๊ฐœ์„ ์„ ๋ˆ„์ ํ•  ์ˆ˜ ์žˆ๋‹ค.

ํ•œ๊ณ„์™€ ์˜๋ฌธ์ 

1. ์‚ฌ๋žŒ์ด ์—ฌ์ „ํžˆ ๋งŽ์ด ํ•„์š”ํ•˜๋‹ค. ๋…ผ๋ฌธ๋„ ์ธ์ •ํ•˜๋“ฏ, ์ด ์‹œ์Šคํ…œ์€ (a) sparse reward labeling, (b) intervention ์ œ๊ณต, (c) RL/base ํ•ธ๋“œ์˜คํ”„ ์‹œ์  ๊ฒฐ์ •์— ์‚ฌ๋žŒ ์†์ด ๋“ค์–ด๊ฐ„๋‹ค. โ€œautomated reward model + progress predictionโ€์œผ๋กœ ์ž๋™ํ™” ๊ฐ€๋Šฅํ•˜๋‹ค๊ณ  future work์œผ๋กœ ์–ธ๊ธ‰์€ ํ–ˆ์ง€๋งŒ, ์‹ค์ œ ํ˜„์žฅ ๋ฐฐํฌ๊นŒ์ง€๋Š” ๊ฑฐ๋ฆฌ๊ฐ€ ์žˆ๋‹ค.

2. RL token ํ•™์Šต์˜ demo ์˜์กด์„ฑ. Reconstruction objective๋Š” VLA๊ฐ€ ๋ณธ demo distribution ์œ„์—์„œ ํ•™์Šต๋œ๋‹ค. ๋งŒ์•ฝ RL์ด distribution์„ ํฌ๊ฒŒ ๋ฒ—์–ด๋‚˜๋Š” ํ–‰๋™(์˜ˆ: wiggle)์„ ๋ฐœ๊ฒฌํ•˜๋ฉด, ๊ทธ ์ƒˆ๋กœ์šด ์ƒํƒœ์—์„œ RL token์ด ์—ฌ์ „ํžˆ informativeํ•œ์ง€๋Š” ๋ณด์žฅ๋˜์ง€ ์•Š๋Š”๋‹ค. ablation์—์„œ w/o RL Token์ด ํ•™์Šต์ด ๋˜๊ธด ํ•˜๋‹ˆ๊นŒ catastrophic์€ ์•„๋‹ˆ์ง€๋งŒ, OOD ๊ฐ•๊ฑด์„ฑ์€ ๋ช…์‹œ์ ์œผ๋กœ ์ธก์ •๋˜์ง€ ์•Š์•˜๋‹ค.

3. Critical phase๊ฐ€ ์งง์€ task์— ํ•œ์ •๋œ ํ‰๊ฐ€. 5โ€“20 ์ดˆ์˜ critical phase๋Š” manipulation ๊ธฐ์ค€์œผ๋กœ๋Š” ์งง์€ ํŽธ์ด๋‹ค. ๋ถ„ ๋‹จ์œ„ critical phase(์˜ˆ: ์ •๋ฐ€ ํ•ด์ฒด, ๊ธธ์ด ์žˆ๋Š” ์กฐ๋ฆฝ)์—์„œ๋„ chunked TD๊ฐ€ ์ž‘๋™ํ• ์ง€๋Š” ๋ฏธ์ง€์ˆ˜๋‹ค. C๋ฅผ ๋Š˜๋ฆฌ๋ฉด chunk ์ฐจ์›์ด ๊ทธ๋งŒํผ ์ปค์ ธ์„œ actor ํ•™์Šต์ด ๋‹ค์‹œ ์–ด๋ ค์›Œ์ง„๋‹ค.

4. \pi_{0.6} specificํ•œ ๋””์ž์ธ. RL token encoder๊ฐ€ transformer์˜ final-layer embedding์„ ๋ฐ›๋Š”๋‹ค. flow-based VLA(์˜ˆ: \pi_0, \pi_{0.5})์— ์ง์ ‘ ์ ์šฉ ๊ฐ€๋Šฅํ•œ์ง€, ๋˜๋Š” GR00T์ฒ˜๋Ÿผ ๋‹ค๋ฅธ backbone์—์„œ๋„ ๋™์ž‘ํ• ์ง€๋Š” ๊ฒ€์ฆ์ด ํ•„์š”ํ•˜๋‹ค.

5. ฮฒ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ. BC regularizer ๊ฐ•๋„ \beta๋Š” task๋ณ„ ํŠœ๋‹์ด ํ•„์š”ํ•ด ๋ณด์ธ๋‹ค(๋…ผ๋ฌธ์— ๋ช…ํ™•ํ•œ ์ž๋™ ์กฐ์ • schema๋Š” ์—†์Œ). ๋„ˆ๋ฌด ํฌ๋ฉด VLA ๋ฒ ๋ผ๊ธฐ๋กœ collapse, ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด unconstrained RL์˜ ํ•จ์ •. KL-budget์„ ์ž๋™ ์กฐ์ •ํ•˜๋Š” MPO ๋ฅ˜์˜ ๊ธฐ๋ฒ•์„ ๊ฒฐํ•ฉํ•  ์—ฌ์ง€๊ฐ€ ์žˆ์–ด ๋ณด์ธ๋‹ค.

6. RL token์„ โ€œ์™œโ€ ์“ฐ๋Š”๊ฐ€์— ๋Œ€ํ•œ ๋” ๊นŠ์€ ๋ถ„์„ ๋ถ€์žฌ. RL token์ด ResNet๋ณด๋‹ค ์ข‹๋‹ค๋Š” ๊ฑด ๋ณด์˜€์ง€๋งŒ, VLA์˜ ์–ด๋А layer๋ฅผ ์“ฐ๋Š” ๊ฒŒ ์ตœ์ ์ธ์ง€, RL token ์ฐจ์›์„ ๋” ์ค„์ด๊ฑฐ๋‚˜ ๋Š˜๋ฆฌ๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋Š”์ง€, multi-token bottleneck์€ ์•ˆ ๋˜๋Š”์ง€ ๋“ฑ์˜ ablation์ด ๋น ์ ธ ์žˆ๋‹ค. ๋””์ž์ธ ๊ณต๊ฐ„์ด ๋” ํ’๋ถ€ํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋‹ค.

๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ์œ„์น˜ ์ง“๊ธฐ

flowchart TB
    subgraph Full["Full VLA RL fine-tuning"]
        RECAP["RECAP ฯ€0.6*<br/>offline RL, advantage policy extraction"]
        Simple["SimpleVLA-RL, ฯ€RL<br/>PPO ๊ณ„์—ด"]
    end
    subgraph Light["Lightweight VLA-augmented RL"]
        ConRFT["ConRFT<br/>action head + consistency"]
        PLD["PLD<br/>single-step residual"]
        PolicyDec["Policy Decorator<br/>scaled residual"]
        DSRL["DSRL<br/>noise-space steering"]
        GRRL["GR-RL<br/>noise predictor for diffusion"]
        RLT["RLT (this work)<br/>chunked actor-critic on RL token"]
    end
    subgraph NoVLA["VLA-free real-world RL"]
        HILSERL["HIL-SERL"]
        RL100["RL-100"]
        SERL["SERL"]
    end
    
    Full -->|"compute heavy"| Cost["high cost"]
    NoVLA -->|"no VLA prior"| NoGen["no generalization"]
    Light --> Sweet["sample-efficient"]
    
    style RLT fill:#ffe0b3

๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์นœ์ฒ™์€ DSRL์ด๋‹ค. ๋‘˜ ๋‹ค VLA๋ฅผ freezeํ•˜๊ณ  ๊ทธ ์ฃผ๋ณ€์—์„œ RL์„ ํ•œ๋‹ค. ์ฐจ์ด๋Š” ์–ด๋””์— RL์„ ๊ฑฐ๋Š”๊ฐ€:

  • DSRL: VLA์˜ noise space์— RL์„ ๊ฑด๋‹ค (๊ฐ„์ ‘์  modulation).
  • RLT: ์‹ค์ œ action space์— RL์„ ๊ฑธ๋˜, VLA reference๋กœ ๋ฌถ๋Š”๋‹ค (์ง์ ‘์  refinement).

DSRL์€ ์•ˆ์ •์ ์ด์ง€๋งŒ ๊ฐœ์„  ํญ์ด ์ œํ•œ๋œ๋‹ค. RLT๋Š” ๋” ๊ณต๊ฒฉ์ ์ธ ๋‹ค๋“ฌ๊ธฐ๊ฐ€ ๊ฐ€๋Šฅํ•ด์„œ throughput์ด ํฌ๊ฒŒ ์˜ค๋ฅธ๋‹ค. ํ•œํŽธ GR-RL์ด๋‚˜ ConRFT๋Š” latent / consistency ๋“ฑ ๋” ๋ชจ๋ธ-์ข…์†์ ์ธ ๋ฉ”์ปค๋‹ˆ์ฆ˜์— ์˜์กดํ•˜๋Š”๋ฐ, RLT๋Š” ํ‘œ์ค€ off-policy actor-critic์ด๋ผ portability๊ฐ€ ์ข‹๋‹ค.

๊ฐ€์žฅ ํฅ๋ฏธ๋กœ์šด ๋น„๊ต๋Š” RECAP์ด๋‹ค. RECAP๋Š” ๊ฐ™์€ ํšŒ์‚ฌ์˜ ๋‹ค๋ฅธ ์ž‘์—…์œผ๋กœ, VLA ํ†ต์งธ๋กœ RL fine-tuneํ•ด์„œ espresso ๋งŒ๋“ค๊ธฐ, ๋นจ๋ž˜ ๊ฐœ๊ธฐ ๊ฐ™์€ long-horizon task์—์„œ throughput์„ ๋‘ ๋ฐฐ ์ด์ƒ ์˜ฌ๋ ธ๋‹ค. RLT๋Š” ๊ฑฐ๊ธฐ์— ๋น„ํ•˜๋ฉด ํ›จ์”ฌ ๊ฐ€๋ณ๋‹ค. ๋‘ ์ ‘๊ทผ์€ ์‚ฌ์‹ค ์ƒ๋ณด์ ์œผ๋กœ ๋ณด์ธ๋‹ค โ€” RECAP-style ๋Œ€๊ทœ๋ชจ offline RL๋กœ VLA ์ž์ฒด๋ฅผ ๊ฐ•ํ™”ํ•˜๊ณ , ๊ทธ ์œ„์— RLT-style ๋น ๋ฅธ online refinement๋ฅผ ์˜ฌ๋ฆฌ๋Š” ๊ทธ๋ฆผ์ด ์ž์—ฐ์Šค๋Ÿฝ๋‹ค.

๋กœ๋ด‡๊ณตํ•™์ž๊ฐ€ ์ด ๋…ผ๋ฌธ์—์„œ ๊ฐ€์ ธ๊ฐˆ ํ†ต์ฐฐ

์ด๊ฑด ๋‹จ์ˆœํ•œ paper review๋ฅผ ๋„˜์–ด์„œ, ์‹ค์ œ๋กœ dexterous manipulation์„ ํ•˜๋Š” ์šฐ๋ฆฌ์—๊ฒŒ ๋ฌด์—‡์ด ์œ ์šฉํ•œ์ง€์˜ ์ •๋ฆฌ๋‹ค.

1. โ€œBottleneck tokenโ€์€ ์ผ๋ฐ˜์ ์ธ ๋„๊ตฌ๋‹ค.
RL token ์•„์ด๋””์–ด๋Š” ์‚ฌ์‹ค VLA์— ํ•œ์ •๋˜์ง€ ์•Š๋Š”๋‹ค. ๊ฑฐ๋Œ€ํ•œ multi-modal ๋ชจ๋ธ ์œ„์— ์ž‘์€ downstream task๋ฅผ ์˜ฌ๋ฆด ๋•Œ, encoderโ€“decoder reconstruction์œผ๋กœ ์–ป์€ single readout์€ ์ข‹์€ ์ถœ๋ฐœ์ ์ด ๋  ์ˆ˜ ์žˆ๋‹ค. tactile-conditioned policy, sim-to-real residual learning ๋“ฑ์—์„œ๋„ ์‹œ๋„ํ•ด๋ณผ ๊ฐ€์น˜๊ฐ€ ์žˆ๋‹ค.

2. RL๊ณผ chunking์€ ๋–ผ์–ด๋†“๊ณ  ์„ค๊ณ„ํ•˜๋ฉด ์•ˆ ๋œ๋‹ค.
Action chunk๋Š” ๋” ์ด์ƒ ๋‹จ์ˆœํžˆ โ€œBC์—์„œ ์žฌ๊ณ„ํš ๋นˆ๋„ ์ค„์ด๋Š” ํŠธ๋ฆญโ€์ด ์•„๋‹ˆ๋‹ค. sparse reward ํ•˜์—์„œ RL์ด ์ž‘๋™ํ•˜๊ธฐ ์œ„ํ•œ ํ•„์ˆ˜ ๊ตฌ์กฐ๋‹ค. ์ง์ ‘ 50 Hz ๋‹จ์œ„ RL์„ ์‹œ๋„ํ•ด๋ณธ ์‚ฌ๋žŒ์ด๋ผ๋ฉด ์ด ์ฐจ์ด๊ฐ€ ์–ผ๋งˆ๋‚˜ ํฐ์ง€ ์•ˆ๋‹ค.

3. โ€œVLA์˜ prior๋ฅผ ์–ด๋–ป๊ฒŒ ๋ณด์กดํ•˜๋ฉด์„œ ๊ทธ ์œ„์—์„œ ํ•™์Šตํ•˜๋Š”๊ฐ€โ€๊ฐ€ ํ•ต์‹ฌ ์งˆ๋ฌธ์ด๋‹ค.
์ด ๋…ผ๋ฌธ์˜ ๋‹ต์€ (a) freeze, (b) reference conditioning, (c) BC regularization. ์ด๊ฑด ์šฐ๋ฆฌ๊ฐ€ Allegro Hand๊ฐ™์€ ํ”Œ๋žซํผ์—์„œ ๊ธฐ์กด RL pipeline์„ VLA-augmented๋กœ ์˜ฎ๊ธธ ๋•Œ ์œ ์šฉํ•œ ํ…œํ”Œ๋ฆฟ์ด๋‹ค.

4. โ€œLast millimeterโ€๋Š” ์ง„์งœ๋กœ RL์ด ๋น›๋‚˜๋Š” ์˜์—ญ์ด๋‹ค.
โ€œ์ „๋ฐ˜๋ถ€ VLA + ํ›„๋ฐ˜๋ถ€ RLโ€ ๊ตฌ์กฐ๋Š” ์‹ค์šฉ์ ์œผ๋กœ ๋งค์šฐ ๋งค๋ ฅ์ ์ด๋‹ค. ์šฐ๋ฆฌ๊ฐ€ contact-rich ์ •๋ฐ€ manipulation์„ ๋‹ค๋ฃฐ ๋•Œ, ์ „์ฒด task๋ฅผ RL๋กœ ํ•™์Šตํ•  ํ•„์š”๋Š” ์—†๋‹ค โ€” ๊ฐ€์žฅ ์–ด๋ ค์šด phase์—๋งŒ ์ง‘์ค‘ํ•˜๋Š” ๊ฒŒ sample efficiency ์ธก๋ฉด์—์„œ ์••๋„์ ์ด๋‹ค.

5. Wiggle ๊ฐ™์€ emergent strategy๋Š” demo data๋กœ๋Š” ์ ˆ๋Œ€ ๋ชป ์–ป๋Š”๋‹ค.
์ด๊ฑด imitation learning๋งŒ์œผ๋กœ๋Š” ๊ฐ€๋‹ฟ์„ ์ˆ˜ ์—†๋Š” ์˜์—ญ์ด ์žˆ๋‹ค๋Š” ๋ช…๋ฐฑํ•œ ์ฆ๊ฑฐ๋‹ค. compliance๋ฅผ ๋Šฅ๋™์ ์œผ๋กœ ํ™œ์šฉํ•˜๋Š” ์ •์ฑ…์€ sim-to-real์—์„œ ํŠนํžˆ ์˜๋ฏธ๊ฐ€ ํฌ๋‹ค โ€” Isaac Lab์˜ contact model ์ •ํ™•๋„์™€ ์ง๊ฒฐ๋œ๋‹ค.

๋งˆ๋ฌด๋ฆฌ

RLT๋Š” โ€œ๊ฑฐ๋Œ€ VLA + ์ž‘์€ RL ๋ชจ๋“ˆโ€์ด๋ผ๋Š” ์ต์ˆ™ํ•œ ๊ทธ๋ฆผ์„ ๊ฐ€์žฅ ๊น”๋”ํ•œ ๋ฐฉ์‹์œผ๋กœ ํ’€์–ด๋‚ธ ๋…ผ๋ฌธ์ด๋‹ค. ํ™”๋ คํ•œ ์ƒˆ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์•„๋‹ˆ๋‹ค. RL token bottleneck, chunked TD, BC-regularized actor-critic โ€” ๊ฐ๊ฐ์€ ๋ชจ๋‘ ๊ธฐ์กด์— ์žˆ๋˜ ์•„์ด๋””์–ด๋‹ค. ํ•˜์ง€๋งŒ ์ด๊ฑธ ํ•œ ์‹œ์Šคํ…œ์œผ๋กœ ๋ฌถ๋Š” ๋””์ž์ธ์ด ๋ช…๋ฃŒํ•˜๊ณ  ๊ฒฐ๊ณผ๊ฐ€ ๊ฐ•๋ ฌํ•˜๋‹ค๋Š” ์ ์ด ์ด ๋…ผ๋ฌธ์˜ ๊ฐ€์น˜๋‹ค.

ํŠนํžˆ ๋‚˜๋Š” ๋‘ ๊ฐ€์ง€ ๋ฉ”์‹œ์ง€๊ฐ€ ์šฐ๋ฆฌ ๋ถ„์•ผ์— ์˜ค๋ž˜ ๋‚จ์„ ๊ฑฐ๋ผ๊ณ  ๋ณธ๋‹ค:

  1. VLA๋Š” freezeํ•ด๋„ ์ถฉ๋ถ„ํ•˜๋‹ค โ€” ๊ทธ ์œ„์— ์ž‘์€ RL์„ ์ž˜ ์˜ฌ๋ฆฌ๋ฉด ๋œ๋‹ค. Full fine-tuning์ด ํ•ญ์ƒ ๋‹ต์€ ์•„๋‹ˆ๋‹ค.
  2. ์‚ฌ๋žŒ๋ณด๋‹ค ๋น ๋ฅธ ์ •์ฑ…์€ ๋” ์ด์ƒ sim ์•ˆ์˜ ํ™˜์ƒ์ด ์•„๋‹ˆ๋‹ค. Real robot์—์„œ, ๋ช‡ ์‹œ๊ฐ„์˜ ๋ฐ์ดํ„ฐ๋กœ, ์ผ๋ฐ˜์  VLA ์œ„์—์„œ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฒƒ์ด ์ž…์ฆ๋๋‹ค.

Allegro Hand ๊ฐ™์€ dexterous platform์—์„œ ์ •๋ฐ€ in-hand reorientation, peg-in-hole, tool-use ๊ฐ™์€ ๊ณผ์ œ๋ฅผ ํ’€ ๋•Œ, RLT์˜ ๋””์ž์ธ์€ ๊ฑฐ์˜ ๊ทธ๋Œ€๋กœ ์ฐจ์šฉ ๊ฐ€๋Šฅํ•œ ํ…œํ”Œ๋ฆฟ์ด๋‹ค. tactile sensing(DIGIT, GelSight ๋“ฑ)์„ RL token์˜ ์ž…๋ ฅ์— ์ถ”๊ฐ€ํ•˜๋Š” ํ™•์žฅ๋„ ์ž์—ฐ์Šค๋Ÿฝ๋‹ค. ๋‹ค์Œ ๋‹จ๊ณ„๋กœ๋Š” (a) reward model ์ž๋™ํ™”, (b) RL/base ํ•ธ๋“œ์˜คํ”„์˜ ์™„์ „ ์ž๋™ํ™”, (c) ๋‹ค์–‘ํ•œ VLA backbone์—์„œ์˜ portability ๊ฒ€์ฆ์ด ๊ฐ€์žฅ ํฅ๋ฏธ๋กœ์šด ํ›„์† ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์ผ ๊ฒƒ์ด๋‹ค.

Tip

ํ•œ ์ค„๋กœ ์ •๋ฆฌํ•˜์ž๋ฉด:
VLA์˜ ๊ฑฐ๋Œ€ํ•œ ์‚ฌ์ „์ง€์‹์„ ์ž‘์€ ํ† ํฐ ํ•˜๋‚˜๋กœ ์‘์ถ•ํ•˜๊ณ , ๊ทธ ์œ„์— ๊ฐ€๋ฒผ์šด actor-critic์œผ๋กœ ์ •๋ฐ€ํ•จ๋งŒ ๋‹ค๋“ฌ๋Š”๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ์‚ฌ๋žŒ๋ณด๋‹ค ๋น ๋ฅธ, ๊ทธ๋ฆฌ๊ณ  ์‚ฌ๋žŒ์˜ demo๋กœ๋Š” ๋ถˆ๊ฐ€๋Šฅํ•œ ํ–‰๋™์ด emergentํ•˜๊ฒŒ ๋‚˜์˜จ๋‹ค. ๋กœ๋ด‡๊ณตํ•™์—์„œ โ€œ์ž‘๊ฒŒ ํ•™์Šตํ•ด์„œ ํฌ๊ฒŒ ํ™œ์šฉํ•œ๋‹คโ€์˜ ๋ชจ๋ฒ” ์‚ฌ๋ก€.


Reference
Xu et al., RL Token: Bootstrapping Online RL with Vision-Language-Action Models, Physical Intelligence, 2025. pi.website/research/rlt

Copyright 2026, JungYeon Lee