Curieux.JY
  • JungYeon Lee
  • Post
  • ๐Ÿ•ธ๏ธ Graph
  • Lecture
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ํ•œ ์ค„ ์š”์•ฝ
    • ์„œ๋ก : ์†๋งˆ๋‹ค ์ •์ฑ…์„ ์ƒˆ๋กœ ๋ฐฐ์›Œ์•ผ ํ•˜๋Š”๊ฐ€
    • ํ•ต์‹ฌ ์•„์ด๋””์–ด: ํ–‰๋™์„ โ€œ์†โ€์ด ์•„๋‹ˆ๋ผ โ€œ์˜๋„โ€๋กœ ์ ๊ธฐ
    • ๋ฐฉ๋ฒ• ๋“ค์—ฌ๋‹ค๋ณด๊ธฐ
      • 1) ๊ณต์œ  latent๋ฅผ ๋งŒ๋“œ๋Š” ๋ฉ€ํ‹ฐํ—ค๋“œ ์˜คํ† ์ธ์ฝ”๋”
      • 2) ์„ธ ์†์‹ค์˜ ์—ญํ• 
      • 3) VLA์— latent๋ฅผ ๋ผ์šฐ๋Š” ๋ฐฉ์‹
    • ์‹คํ—˜์ด ๋งํ•˜๋Š” ๊ฒƒ
    • ๋น„ํŒ์  ๊ณ ์ฐฐ
    • ํ•ต์‹ฌ์„ ๋‹ค์‹œ ํ•œ ์ค„๋กœ

๐Ÿ“ƒXL-VLA ๋ฆฌ๋ทฐ

cross-embodiment
vla
dexterity
latent
Cross-Hand Latent Representation for Vision-Language-Action Models
Published

March 13, 2026

  • Paper Link
  • Project Link
  1. ๐Ÿ’ก XL-VLA๋Š” ๋‹ค์–‘ํ•œ dexterous hand๋“ค ๊ฐ„์— ๊ณต์œ ๋˜๋Š” ํ†ต์ผ๋œ latent action space๋ฅผ ํ™œ์šฉํ•˜์—ฌ scalableํ•œ cross-embodiment dexterous manipulation์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” Vision-Language-Action (VLA) ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค.
  2. ๐Ÿ› ๏ธ ์ด embodiment-invariant latent space๋Š” unsupervised autoencoder๋ฅผ ํ†ตํ•ด ์‚ฌ์ „ ํ•™์Šต๋˜๋ฉฐ, reconstruction, retargeting, ๊ทธ๋ฆฌ๊ณ  latent regularization ์†์‹ค์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์†์˜ ๊ธฐ๊ตฌํ•™์  ์ฐจ์ด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋‹ค๋ฆฌ ๋†“์Šต๋‹ˆ๋‹ค.
  3. ๐Ÿ“ˆ ์‹ค์ œ ๋กœ๋ด‡ ์‹คํ—˜์—์„œ XL-VLA๋Š” ๊ธฐ์กด VLA ๋ชจ๋ธ๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๊ณ , ์ƒˆ๋กœ์šด hand-task ์กฐํ•ฉ์— ๋Œ€ํ•œ zero-shot generalization ๋Šฅ๋ ฅ์„ ์ž…์ฆํ•˜์—ฌ ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ์žฌํ™œ์šฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

XL-VLA ๋…ผ๋ฌธ์€ Vision-Language-Action (VLA) ๋ชจ๋ธ์„ ์œ„ํ•œ Cross-Hand Latent Representation์„ ์ œ์•ˆํ•˜์—ฌ, ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ Dexterous Hand์— ๊ฑธ์ณ ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ๋กœ๋ด‡ ์กฐ์ž‘(Manipulation)์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด VLA ๋ชจ๋ธ์€ ๋กœ๋ด‡์˜ Morphology์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง€๋Š” ํ–‰๋™ ๊ณต๊ฐ„(Action Space) ๋•Œ๋ฌธ์— ์ƒˆ๋กœ์šด ๋กœ๋ด‡์ด ๋“ฑ์žฅํ•  ๋•Œ๋งˆ๋‹ค ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ  ์žฌํ•™์Šตํ•ด์•ผ ํ•˜๋Š” ๋น„ํšจ์œจ์„ฑ์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค. ํŠนํžˆ Dexterous Hand์˜ ๊ฒฝ์šฐ, ๊ด€์ ˆ ์œ„์น˜(Joint Position) ํŒŒ๋ผ๋ฏธํ„ฐํ™”๊ฐ€ embodiment๋งˆ๋‹ค ํฌ๊ฒŒ ๋‹ฌ๋ผ์ง€๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ๋‹ค์–‘ํ•œ Dexterous Hand์— ๊ฑธ์ณ ๊ณต์œ ๋˜๋Š” ํ†ตํ•ฉ๋œ Latent Action Space๋ฅผ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.


XL-VLA ๊ฐœ์š”: ๋„ค ๊ฐ€์ง€ Dexterous Hand(Ability, Paxini DexH13, X-Hand1, Inspire)์— ๊ฑธ์ณ ๊ณต์œ  Latent Action์„ ๋””์ฝ”๋”ฉํ•˜๋Š” ๊ตฌ์กฐ์™€ ์‹คํ—˜ ํ™˜๊ฒฝ, ์ˆ˜์ง‘๋œ ๊ฐ์ฒด๋“ค์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก  (Core Methodology)

XL-VLA์˜ ํ•ต์‹ฌ์€ ๋‹ค์Œ ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค: (1) ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž…๋ ฅ(Vision V, Language T)์„ ์ธ์ฝ”๋”ฉํ•˜๋Š” VLA Backbone, (2) Cross-Embodiment Transfer๋ฅผ ์œ„ํ•ด ๋ฏธ๋ฆฌ ํ•™์Šต๋œ(pretrained) Latent Encoder ๋ฐ Decoder ์„ธํŠธ.

  1. ๋ฌธ์ œ ์ •์˜ (Problem Formulation): ๊ฐ Dexterous Hand h \in H๋Š” d_h๊ฐœ์˜ actuated joints๋ฅผ ๊ฐ€์ง€๋ฉฐ, ์ ˆ๋Œ€ ๊ด€์ ˆ ํšŒ์ „(Absolute Joint Rotations) q^{(h)} \in \mathbb{R}^{d_h}๋ฅผ ์ œ์–ดํ•ฉ๋‹ˆ๋‹ค. ์ •์ฑ…์€ Action Chunk ๋‹จ์œ„๋กœ ์ž‘๋™ํ•˜๋ฉฐ, ๊ฐ Action q^{(h)}_t \in \mathbb{R}^{64 \times d_h}๋Š” 20Hz๋กœ ์ƒ˜ํ”Œ๋ง๋œ 64๊ฐœ์˜ ๊ด€์ ˆ ์œ„์น˜ ๋ช…๋ น์–ด ์‹œํ€€์Šค(3.2์ดˆ์˜ ๋™์ž‘)์ž…๋‹ˆ๋‹ค. ์ •์ฑ…์€ ํ˜„์žฌ ๋‹จ๊ณ„ t์—์„œ ์ด์ „ ๊ด€์ ˆ ์ƒํƒœ, ์ด์ „์— ์‹คํ–‰๋œ Action Chunk q^{(h)}_t, ํ˜„์žฌ ์ด๋ฏธ์ง€ V, ์–ธ์–ด ์ง€์‹œ T๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ๋‹ค์Œ Chunk q^{(h)}_{t+1}๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค: q^{(h)}_{t+1} = F(q^{(h)}_t, V, T) ์—ฌ๊ธฐ์„œ F๋Š” Hand-Agnostic ๋ชจ๋ธ์ด๋ฉฐ, Hand ID h๋Š” ์ ์ ˆํ•œ Encoder/Decoder๋ฅผ ์„ ํƒํ•˜๋Š” ๋ฐ๋งŒ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

  2. XL-VLA ํŒŒ์ดํ”„๋ผ์ธ: XL-VLA๋Š” \pi_0 [6]์˜ ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด \pi_0๊ฐ€ proprioceptive history๋ฅผ state token ์Šคํƒ์œผ๋กœ ์ œ๊ณตํ–ˆ๋˜ ๊ฒƒ๊ณผ ๋‹ฌ๋ฆฌ, XL-VLA์—์„œ๋Š” latent action token์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ฐ Hand h์— ๋Œ€ํ•ด, Hand-specific Encoder E_h๋Š” ์ด์ „ ์ ˆ๋Œ€ ๊ด€์ ˆ ์œ„์น˜ Action Chunk q^{(h)}_t๋ฅผ ์••์ถ•๋œ Latent Vector z_t = E_h(q^{(h)}_t)๋กœ ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค. VLA ๋ชจ๋ธ์€ ์ด๋Ÿฌํ•œ Latent Token๋“ค์˜ ์งง์€ History์™€ Vision ๋ฐ Language Token์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ค์Œ Latent Chunk \hat{z}_{t+1}์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ์ด Latent Vector๋Š” Embodiment-specific Decoder D_h์— ์˜ํ•ด ๋‹ค์Œ ๊ด€์ ˆ ๋ช…๋ น Chunk \hat{q}^{(h)}_{t+1} = D_h(\hat{z}_{t+1})๋กœ ๋””์ฝ”๋”ฉ๋ฉ๋‹ˆ๋‹ค. VLA Fine-tuning ์ค‘์—๋Š” ๋ชจ๋“  Latent Encoder์™€ Decoder๋Š” Frozen ์ƒํƒœ๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.


XL-VLA ๋ชจ๋ธ ํŒŒ์ดํ”„๋ผ์ธ: \pi_0 ์œ„์— ๊ตฌ์ถ•๋˜์–ด Vision/Language ์ธ์ฝ”๋”์™€ ํ•จ๊ป˜ ๊ณต์œ  Latent Action Space์—์„œ ๋™์ž‘ํ•˜๋Š” Action Expert๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, VLA ํ•™์Šต ์ค‘์—๋Š” Encoder/Decoder๊ฐ€ Frozen ์ƒํƒœ๋กœ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค.
  1. Latent Space ํ•™์Šต (Latent Space Learning): Latent Space๋Š” ๋ฉ€ํ‹ฐ ํ—ค๋“œ VAE(Variational Autoencoder) ์Šคํƒ€์ผ์˜ Autoencoder๋ฅผ ํ†ตํ•ด VLA ๋ชจ๋ธ๊ณผ ๋…๋ฆฝ์ ์œผ๋กœ ์‚ฌ์ „ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค. ๊ฐ Hand Type h \in H์— ๋Œ€ํ•ด Hand-specific Encoder E_h์™€ Decoder D_h๊ฐ€ ์ •์˜๋ฉ๋‹ˆ๋‹ค. Input q^{(h)}๋Š” Encoder MLP๋ฅผ ํ†ตํ•ด ๊ณตํ†ต Latent Space๋กœ ํˆฌ์˜๋˜๊ณ , Decoder MLP๋Š” Latent Embedding์„ Hand์˜ ์›๋ž˜ ๊ด€์ ˆ ๊ตฌ์„ฑ์œผ๋กœ ์žฌํˆฌ์˜ํ•ฉ๋‹ˆ๋‹ค.

    ์˜๋ฏธ ์žˆ๋Š” Cross-Embodiment Latent Space๋ฅผ ํ˜•์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์„ธ ๊ฐ€์ง€ ํ›ˆ๋ จ ์ œ์•ฝ ์กฐ๊ฑด์ด ๋ถ€๊ณผ๋ฉ๋‹ˆ๋‹ค:

    • ์žฌ๊ตฌ์„ฑ ์†์‹ค (L_1, Reconstruction Loss): Encoder-Decoder ์Œ์ด ํ•ด๋‹น Hand์— ๋Œ€ํ•œ Autoencoder๋กœ ์ž‘๋™ํ•˜๋„๋ก ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค. L_1 = L_{rec} = \frac{1}{|H|} \sum_{h \in H} \text{MSE}(\hat{q}^{(h)}, q^{(h)}) ์ด๋Š” Latent Space๊ฐ€ Hand-specific kinematics๋ฅผ ๋ณด์กดํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
    • ๋ฆฌํƒ€๊ฒŸํŒ… ์†์‹ค (L_2, Retargeting Loss): ๋‹ค๋ฅธ Dexterous Hand ๋กœ๋ด‡ ๊ฐ„์˜ Fingertip Geometry๋ฅผ ์ •๋ ฌํ•ฉ๋‹ˆ๋‹ค. ๊ฐ Hand h์— ๋Œ€ํ•ด ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ Forward Kinematics (FK)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ด€์ ˆ์„ Fingertip Position p^{(h)}_i์— ๋งคํ•‘ํ•˜๊ณ , Fingertip Displacement \delta^{(h)}_{ij} = p^{(h)}_i - p^{(h)}_j๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. L_2 = \frac{1}{|H|(|H|-1)|P|} \sum_{s \neq t} \sum_{(i,j) \in P} w^{(s)}_{ij} \left[ \lambda_{dis} \| \delta^{(s)}_{ij} \|^2 - \| \hat{\delta}^{(t)}_{ij} \|^2 \right]^2 + \lambda_{dir}(1 - c^{(s,t)}_{ij}) ์—ฌ๊ธฐ์„œ \hat{\delta}^{(t)}_{ij}๋Š” Hand t์˜ ๋””์ฝ”๋”ฉ๋œ ๊ตฌ์„ฑ์—์„œ ๊ณ„์‚ฐ๋˜๋ฉฐ, c^{(s,t)}_{ij}๋Š” Pinch Directions \delta^{(s)}_{ij}์™€ \hat{\delta}^{(t)}_{ij} ์‚ฌ์ด์˜ ๊ฐ๋„ ์ฝ”์‚ฌ์ธ ๊ฐ’์ž…๋‹ˆ๋‹ค. w^{(s)}_{ij} = \exp(-\lambda_{exp} \| \delta^{(s)}_{ij} \|^2)๋Š” ๊ฐ•ํ•œ Pinch์— ๊ฐ€์ค‘์น˜๋ฅผ ๋‘ก๋‹ˆ๋‹ค. ์ด ์†์‹ค์€ ๋™์ผํ•œ Latent Code๊ฐ€ ๋‹ค์–‘ํ•œ Hand์—์„œ ๊ธฐํ•˜ํ•™์ ์œผ๋กœ ์ผ๊ด€๋œ Pinch Behavior๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
    • Latent ์†์‹ค (L_3, Latent Loss): Dexterous Hand Latent Space๋ฅผ ๋ถ€๋“œ๋Ÿฝ๊ณ  ์ž˜ ์ž‘๋™ํ•˜๋„๋ก ์ •๊ทœํ™”ํ•˜๊ธฐ ์œ„ํ•ด Latent ๋ณ€์ˆ˜์— ํ‘œ์ค€ ๊ฐ€์šฐ์‹œ์•ˆ ์‚ฌ์ „(Standard Gaussian Prior)์„ ๋ถ€๊ณผํ•ฉ๋‹ˆ๋‹ค. L_3 = L_{KL} = \mathbb{E}_q[ \text{KL}(q(z | q) \| \mathcal{N}(0, I)) ] ์ด๋Š” ๊ณต์œ  Latent Space๊ฐ€ \mathcal{N}(0, I) ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด๋„๋ก ๊ถŒ์žฅํ•˜๋ฉฐ, Sampling ๋ฐ Interpolation์„ ์šฉ์ดํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

    ์ด Latent ๋ชฉ์  ํ•จ์ˆ˜ (Total Latent Objective)๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: L_{latent} = L_1 + L_2 + \beta L_3 ์—ฌ๊ธฐ์„œ \beta = 10^{-5}, \lambda_{dis} = 2000.0, \lambda_{dir} = 5.0, \lambda_{exp} = 12.0๋กœ ๊ณ ์ •๋ฉ๋‹ˆ๋‹ค.


Latent Space ์‚ฌ์ „ ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ: ๊ฐ Hand์˜ ๊ด€์ ˆ ์œ„์น˜๊ฐ€ ๊ณต์œ  Latent Space๋กœ ๋งคํ•‘๋˜๋Š” Encoder-Decoder ๊ตฌ์กฐ์™€ ์žฌ๊ตฌ์„ฑ, ๋ฆฌํƒ€๊ฒŸํŒ…, KL ์ •๊ทœํ™” ์†์‹ค์ด ์ ์šฉ๋˜๋Š” ์œ„์น˜๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
์ด Latent Autoencoder๋Š” ์–ด๋– ํ•œ Demonstration์ด๋‚˜ Inverse Kinematics (IK)๋กœ ์ƒ์„ฑ๋œ Trajectory ์—†์ด ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค. ๋Œ€์‹ , ๊ฐ Hand $s \in H$์— ๋Œ€ํ•ด ํ•˜๋“œ์›จ์–ด ๊ด€์ ˆ ํ•œ๊ณ„ ๋‚ด์—์„œ ๋ฌด์ž‘์œ„๋กœ ๊ด€์ ˆ ๊ตฌ์„ฑ $q^{(s)}$๋ฅผ ์ƒ˜ํ”Œ๋งํ•ฉ๋‹ˆ๋‹ค. Latent ๊ณต๊ฐ„์˜ ์ •๋ ฌ์€ ์™„์ „ํžˆ Self-supervised ๋ฐฉ์‹์œผ๋กœ ์ด๋ฃจ์–ด์ง€๋ฉฐ, Cross-Hand Trajectory ์Œ์ด ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ (Experiments and Results)

๋ณธ ์—ฐ๊ตฌ๋Š” 10๊ฐ€์ง€ ๋‹ค์–‘ํ•œ ์กฐ์ž‘ Task์™€ Ability, Paxini DexH13, X-Hand1, Inspire ๋“ฑ 4๊ฐ€์ง€ Dexterous Hand๋ฅผ ํฌํ•จํ•˜๋Š” ๋Œ€๊ทœ๋ชจ Teleoperation Dataset์„ ๊ตฌ์ถ•ํ–ˆ์Šต๋‹ˆ๋‹ค (์ด 2M State-Action Pair). ์‹คํ—˜์€ xArm๊ณผ Unitree G1 ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡์—์„œ ์ˆ˜ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค.


๋„ค ๊ฐ€์ง€ ๋กœ๋ด‡ ํ•ธ๋“œ embodiment์— ๊ฑธ์ณ ๋ Œ๋”๋ง๋œ ์—ฐ์†์ ์ธ grasping Latent Trajectory ์‹œ๊ฐํ™” (๋ช…ํ™•์„ฑ์„ ์œ„ํ•ด X-Hand๋ฅผ ๊ฐ•์กฐ). ๋™์ผํ•œ Latent Code๊ฐ€ ๋‹ค์–‘ํ•œ ์†์—์„œ ์ผ๊ด€๋œ ๋™์ž‘์„ ์ƒ์„ฑํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  1. VLA + Latent ํ†ตํ•ฉ์˜ ํšจ๊ณผ (Effectiveness of VLA + Latent Integration):
    • Cross-Hand ๋ฐ์ดํ„ฐ ์Šค์ผ€์ผ๋ง: XL-VLA๋Š” \pi_0 baseline ๋Œ€๋น„ ๋ชจ๋“  Hand ๋ฐ Task์—์„œ ์ผ๊ด€๋˜๊ณ  ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค (Table 2). \pi_0์˜ ํ‰๊ท  ์„ฑ๊ณต๋ฅ ์€ 0.32์— ๋ถˆ๊ณผํ–ˆ์ง€๋งŒ, XL-VLA๋Š” 0.72๋ฅผ ๊ธฐ๋กํ•˜์—ฌ 40% ์ด์ƒ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ์ •๊ตํ•œ ์กฐ์ž‘ Task์—์„œ ๋‘๋“œ๋Ÿฌ์ง„ ๊ฐœ์„ ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
    • Cross-Robot ๋ฐ์ดํ„ฐ ์Šค์ผ€์ผ๋ง: Tabletop xArm๊ณผ ํœด๋จธ๋…ธ์ด๋“œ G1์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•จ๊ป˜ ํ•™์Šต์‹œ์ผฐ์„ ๋•Œ, XL-VLA๋Š” G1์—์„œ \pi_0 ๋Œ€๋น„ 57% ๋” ๋†’์€ ์„ฑ๊ณต๋ฅ ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค (XL-VLA: 0.825, \pi_0: 0.525) (Figure 5, Table 6). ์ด๋Š” ํ†ต์ผ๋œ Latent Space๊ฐ€ ์ด์ข… ๋กœ๋ด‡ ์‹œ์Šคํ…œ ๊ฐ„์—๋„ ์œ ์ตํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

G1 Cross-Robot ์„ฑ๋Šฅ: ์ •๋ ฌ๋œ Latent Action Space๋กœ co-trainingํ•œ ๊ฒฝ์šฐ์™€ Raw Action Space๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ๋ฅผ ๋‹ค์–‘ํ•œ State/Action ๊ธธ์ด์— ๊ฑธ์ณ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.
*   **Zero-Shot Task ์ผ๋ฐ˜ํ™”**: XL-VLA๋Š” Hold-out๋œ Task์— ๋Œ€ํ•ด Zero-Shot์œผ๋กœ ์ผ๋ฐ˜ํ™”ํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค (Figure 4). ํ‘œ์ค€ Kinematic Retargeting ๊ธฐ๋ฐ˜์˜ $\pi_0$+RT baseline๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ, XL-VLA๋Š” ๋ชจ๋“  Embodiment์™€ Task์—์„œ ์ผ๊ด€๋˜๊ฒŒ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ํŠนํžˆ ๋ฏธ์„ธํ•œ Dexterous Task์—์„œ ๊ทธ ์ด์ ์ด ๋”์šฑ ๋ช…ํ™•ํ–ˆ์Šต๋‹ˆ๋‹ค.

Zero-Shot Unseen Task ์ผ๋ฐ˜ํ™” ๊ฒฐ๊ณผ: Hold-out๋œ Task ํ‰๊ฐ€์— ๋Œ€ํ•œ ์—ฌ๋Ÿฌ embodiment์˜ ์„ฑ๊ณต๋ฅ (SR)๊ณผ ๋ถ€๋ถ„ ์„ฑ๊ณต๋ฅ (PSR)์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  1. Latent Action Space์˜ ํšจ๊ณผ (Effectiveness of the Latent Action Space):
    • Latent Replay ๋น„๊ต: Latent Action Diffusion (LAD) [2]์™€ ๊ฐ™์€ Supervised Latent Space Retargeting ๋ฐฉ๋ฒ•๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ, XL-VLA์˜ Latent Space๋Š” ํ›จ์”ฌ ๋›ฐ์–ด๋‚œ Replay ์„ฑ๊ณต๋ฅ ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค (Table 4). LAD๊ฐ€ 0.60, 0.61์— ๊ทธ์นœ ๋ฐ˜๋ฉด, XL-VLA๋Š” 0.82, 0.81์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” XL-VLA์˜ Latent Space๊ฐ€ Unsupervised ๋ฐฉ์‹์œผ๋กœ๋„ Embodiment-invariant ๊ตฌ์กฐ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํฌ์ฐฉํ•จ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
    • ์„ค๊ณ„ ์„ ํƒ ๋น„๊ต (Design Choice Comparison): Ablation Study๋ฅผ ํ†ตํ•ด Latent Space์˜ ์•„ํ‚คํ…์ฒ˜ ๋ฐ ์†์‹ค ํ•จ์ˆ˜ ์„ค๊ณ„๊ฐ€ ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๋ถ„์„ํ–ˆ์Šต๋‹ˆ๋‹ค (Table 5). ์ตœ์ข… ๊ตฌ์„ฑ (Hidden Size H128->64, Latent Dimension 32)์€ ์žฌ๊ตฌ์„ฑ ์ •ํ™•๋„(Reconstruction Accuracy), Cross-Embodiment Retargeting, Latent Continuity, Interpolation Smoothness ๋“ฑ ๋‹ค์–‘ํ•œ Metric์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ ๊ท ํ˜•์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, ์žฌ๊ตฌ์„ฑ ์†์‹ค(L_1)๊ณผ ๋ฆฌํƒ€๊ฒŸํŒ… ์†์‹ค(L_2) ๋ชจ๋‘ Cross-Embodiment ์„ฑ๋Šฅ์— ํ•„์ˆ˜์ ์ž„์ด ๋ฐํ˜€์กŒ์Šต๋‹ˆ๋‹ค. Latent Dimension์ด ๋„ˆ๋ฌด ์ปค์ง€๋ฉด(์˜ˆ: L128) Embodiment-invariant ๊ตฌ์กฐ๋ฅผ ๋ฐฉํ•ดํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก  (Conclusion)

XL-VLA๋Š” ํ†ตํ•ฉ๋œ Latent Action Space๋ฅผ ํ†ตํ•ด Vision-Language-Action ๋ชจ๋ธ์„ Dexterous Manipulation์— ์ ์šฉํ•˜๋Š” ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ํ•ธ๋“œ์— ๊ฑธ์ณ ์›ํ™œํ•œ ํ›ˆ๋ จ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ณ , ์ƒˆ๋กœ์šด Hand-Task ์กฐํ•ฉ์— ๋Œ€ํ•œ Zero-Shot ์ผ๋ฐ˜ํ™”๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ๊ด‘๋ฒ”์œ„ํ•œ ์‹ค์ œ ์‹คํ—˜์„ ํ†ตํ•ด XL-VLA๋Š” ํ‘œ์ค€ VLA ๋ชจ๋ธ ๋ฐ Retargeting ๊ธฐ๋ฐ˜ Baseline์„ ์ผ๊ด€๋˜๊ฒŒ ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” Latent Action Space๊ฐ€ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅํ•˜๊ณ  ๋ฐ์ดํ„ฐ ํšจ์œจ์ ์ธ Dexterous Manipulation ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ•๋ ฅํ•œ ๊ธฐ๋ฐ˜์ด ๋  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

ํ•œ ์ค„ ์š”์•ฝ

์†๊ฐ€๋ฝ ์ˆ˜๋„, ๊ด€์ ˆ ๋ฐฐ์น˜๋„, ์ œ์–ด ํŒŒ๋ผ๋ฏธํ„ฐ๋„ ์ œ๊ฐ๊ฐ์ธ ์—ฌ๋Ÿฌ dexterous hand๋ฅผ ํ•˜๋‚˜์˜ ๊ณต์œ  latent action space๋กœ ๋ฌถ์–ด, VLA๊ฐ€ โ€œ์–ด๋–ค ์†์ธ์ง€โ€๊ฐ€ ์•„๋‹ˆ๋ผ โ€œ๋ฌด์Šจ ๋™์ž‘์„ ํ•˜๋ ค๋Š”์ง€โ€๋ฅผ ํ•™์Šตํ•˜๊ฒŒ ๋งŒ๋“  ์—ฐ๊ตฌ๋‹ค. ๊ทธ ๋•๋ถ„์— ํ•œ ์†์—์„œ ๋ชจ์€ ๋ฐ์ดํ„ฐ๊ฐ€ ๋‹ค๋ฅธ ์†์œผ๋กœ ํ˜๋Ÿฌ๊ฐ€๊ณ , ์ฒ˜์Œ ๋ณด๋Š” (์† ร— ์ž‘์—…) ์กฐํ•ฉ์—๋„ zero-shot์œผ๋กœ ์ผ๋ฐ˜ํ™”๋œ๋‹ค.

์„œ๋ก : ์†๋งˆ๋‹ค ์ •์ฑ…์„ ์ƒˆ๋กœ ๋ฐฐ์›Œ์•ผ ํ•˜๋Š”๊ฐ€

VLA(Vision-Language-Action) ๋ชจ๋ธ์€ ์ธํ„ฐ๋„ท ๊ทœ๋ชจ์˜ vision-language ์‚ฌ์ „์ง€์‹ ์œ„์— ๋กœ๋ด‡ ํ–‰๋™์„ ์–น์–ด, โ€œ๋ณด๊ณ  โ†’ ์•Œ์•„๋“ฃ๊ณ  โ†’ ์›€์ง์ด๋Š”โ€ ์ผ์„ ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ์„ฑ๊ณตํ–ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์†์ด ๋ฐ”๋€Œ๋Š” ์ˆœ๊ฐ„ ์ด ์„ฑ๊ณต์ด ํ”๋“ค๋ฆฐ๋‹ค.

๋ฌธ์ œ์˜ ํ•ต์‹ฌ์€ ํ–‰๋™ ๊ณต๊ฐ„(action space)์ด๋‹ค. ๊ทธ๋ฆฌํผ ํ•˜๋‚˜๋ผ๋ฉด โ€œ์—ด๊ณ /๋‹ซ๊ณ โ€ ์ •๋„์ง€๋งŒ, dexterous hand๋Š” ์ž์œ ๋„๊ฐ€ 12~20์„ ๋„˜๋‚˜๋“ค๊ณ  ๊ด€์ ˆ์˜ ์˜๋ฏธ๋งˆ์ € ์†๋งˆ๋‹ค ๋‹ค๋ฅด๋‹ค. Ability, Paxini DexH13, X-Hand1, Inspire โ€” ์ด ๋„ค ์†์€ ์†๊ฐ€๋ฝ ์ˆ˜, ๊ด€์ ˆ ์ˆ˜, ๊ฐ€๋™ ๋ฒ”์œ„๊ฐ€ ์ „๋ถ€ ๋‹ค๋ฅด๋‹ค. ๊ฐ™์€ โ€œ์—„์ง€์™€ ๊ฒ€์ง€๋กœ ์ง‘๊ธฐโ€ ๋™์ž‘๋„ ๊ด€์ ˆ ๊ฐ๋„ ๋ฒกํ„ฐ๋กœ ์ ์œผ๋ฉด ์†๋งˆ๋‹ค ์™„์ „ํžˆ ๋‹ค๋ฅธ ์ˆซ์ž๊ฐ€ ๋œ๋‹ค. ๊ทธ๋ž˜์„œ ํ•œ ์†์—์„œ ํ•™์Šตํ•œ VLA๋ฅผ ๋‹ค๋ฅธ ์†์— ๊ทธ๋Œ€๋กœ ์˜ฌ๋ฆฌ๋ฉด ๋™์ž‘์ด ๋ฌด๋„ˆ์ง€๊ณ , ๊ฒฐ๊ตญ ์†์ด ์ƒˆ๋กœ ๋‚˜์˜ฌ ๋•Œ๋งˆ๋‹ค ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์‹œ ๋ชจ์•„ ์žฌํ•™์Šตํ•ด์•ผ ํ•œ๋‹ค.

๊ธฐ์กด ์šฐํšŒ๋กœ๋Š” ๋‘ ๊ฐˆ๋ž˜์˜€๋‹ค. (1) ๊ณตํ†ต end-effector pose๋กœ ์ถ”์ƒํ™” โ€” ํ•˜์ง€๋งŒ ์†๊ฐ€๋ฝ ํ•˜๋‚˜ํ•˜๋‚˜์˜ ์„ฌ์„ธํ•œ ์ ‘์ด‰์„ ๋ฒ„๋ฆฐ๋‹ค. (2) kinematic retargeting์œผ๋กœ ์† ์‚ฌ์ด ๋™์ž‘์„ ๋ณ€ํ™˜ โ€” ํ•˜์ง€๋งŒ ์†์˜ ๊ธฐ๊ตฌํ•™ ์ฐจ์ด๊ฐ€ ํด์ˆ˜๋ก ๋ณ€ํ™˜์ด ๋ถ€์ •ํ™•ํ•˜๊ณ , ๋ฏธ์„ธ ์กฐ์ž‘์—์„œ ๊นจ์ง„๋‹ค. XL-VLA์˜ ์งˆ๋ฌธ์€ ์ด๋ ‡๋‹ค. โ€œ์†์˜ ์ข…๋ฅ˜์™€ ๋ฌด๊ด€ํ•œ, ๊ทธ๋Ÿฌ๋‚˜ ์†๊ฐ€๋ฝ ์ˆ˜์ค€์˜ ์˜๋„๊นŒ์ง€ ๋‹ด๋Š” ๊ณตํ†ต ํ–‰๋™ ์–ธ์–ด๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š”๊ฐ€?โ€

ํ•ต์‹ฌ ์•„์ด๋””์–ด: ํ–‰๋™์„ โ€œ์†โ€์ด ์•„๋‹ˆ๋ผ โ€œ์˜๋„โ€๋กœ ์ ๊ธฐ

XL-VLA์˜ ๋‹ต์€ ํ†ต์—ญ์‚ฌ๋ฅผ ๋‘๋Š” ๊ฒƒ์ด๋‹ค. ์‚ฌ๋žŒ์ด ํ•œ๊ตญ์–ด๋กœ ๋งํ•˜๋“  ์˜์–ด๋กœ ๋งํ•˜๋“  โ€œ์‚ฌ๊ณผ๋ฅผ ์ง‘์–ดโ€๋ผ๋Š” ์˜๋ฏธ๋Š” ๊ฐ™๋‹ค. ํ†ต์—ญ์‚ฌ๋Š” ๊ทธ ์˜๋ฏธ๋ฅผ ์ค‘๋ฆฝ์ ์ธ ๊ฐœ๋… ๊ณต๊ฐ„์— ์ ์–ด๋‘๊ณ , ๋“ฃ๋Š” ์‚ฌ๋žŒ์˜ ์–ธ์–ด๋กœ ๋‹ค์‹œ ํ’€์–ด๋‚ธ๋‹ค.

์—ฌ๊ธฐ์„œ โ€œ์ค‘๋ฆฝ์ ์ธ ๊ฐœ๋… ๊ณต๊ฐ„โ€์ด ๋ฐ”๋กœ ๊ณต์œ  latent action space๋‹ค. ๊ฐ ์†์—๋Š” ์ „์šฉ ์ธ์ฝ”๋”(์ž๊ธฐ ๊ด€์ ˆ ๋ฒกํ„ฐ๋ฅผ ๊ณตํ†ต latent๋กœ ์••์ถ•)์™€ ์ „์šฉ ๋””์ฝ”๋”(๊ณตํ†ต latent๋ฅผ ์ž๊ธฐ ๊ด€์ ˆ ๋ช…๋ น์œผ๋กœ ๋ณต์›)๊ฐ€ ๋‹ฌ๋ฆฐ๋‹ค. VLA ๋ณธ์ฒด๋Š” ์†์„ ๋ชจ๋ฅธ๋‹ค. ์˜ค์ง latent ํ† ํฐ์˜ ํ๋ฆ„๋งŒ ๋ณด๊ณ  ๋‹ค์Œ latent๋ฅผ ์˜ˆ์ธกํ•˜๋ฉฐ, ์† ID๋Š” ๊ทธ์ € โ€œ์–ด๋А ์ธ์ฝ”๋”/๋””์ฝ”๋”๋ฅผ ๋ผ์šธ์ง€โ€ ๊ณ ๋ฅด๋Š” ์Šค์œ„์น˜์ผ ๋ฟ์ด๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ VLA๊ฐ€ ๋ฐฐ์šฐ๋Š” ๊ฒƒ์€ โ€œInspire์˜ 13๋ฒˆ ๊ด€์ ˆ์„ 0.3radโ€ ๊ฐ™์€ ์† ์ข…์† ๋ช…๋ น์ด ์•„๋‹ˆ๋ผ โ€œ์ง€๊ธˆ์€ ์ง‘๋Š” ๊ตญ๋ฉดโ€์ด๋ผ๋Š” embodiment-invariant ์˜๋„๋‹ค.

๋ฐฉ๋ฒ• ๋“ค์—ฌ๋‹ค๋ณด๊ธฐ

1) ๊ณต์œ  latent๋ฅผ ๋งŒ๋“œ๋Š” ๋ฉ€ํ‹ฐํ—ค๋“œ ์˜คํ† ์ธ์ฝ”๋”

latent space๋Š” VLA์™€ ๋ถ„๋ฆฌ๋˜์–ด ๋จผ์ € ํ•™์Šต๋œ๋‹ค. ์† h๋งˆ๋‹ค ์ธ์ฝ”๋” E_h์™€ ๋””์ฝ”๋” D_h๊ฐ€ ์žˆ๊ณ , ์ž…๋ ฅ ๊ด€์ ˆ ๊ตฌ์„ฑ q^{(h)}๋ฅผ ๊ณตํ†ต latent z๋กœ ๋ณด๋ƒˆ๋‹ค๊ฐ€ ๋‹ค์‹œ \hat q^{(h)}๋กœ ๋˜๋Œ๋ฆฐ๋‹ค.

๊ฐ€์žฅ ์˜๋ฆฌํ•œ ๋Œ€๋ชฉ์€ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐฉ์‹์ด๋‹ค. ์‹œ์—ฐ(demonstration)๋„, IK๋กœ ๋งŒ๋“  ๊ถค์ ๋„ ํ•„์š” ์—†๋‹ค. ๊ทธ๋ƒฅ ๊ฐ ์†์˜ ํ•˜๋“œ์›จ์–ด ๊ด€์ ˆ ํ•œ๊ณ„ ์•ˆ์—์„œ ๊ด€์ ˆ ๊ตฌ์„ฑ์„ ๋ฌด์ž‘์œ„๋กœ ์ƒ˜ํ”Œ๋งํ•ด์„œ ์“ด๋‹ค. ์ฆ‰ latent ์ •๋ ฌ์€ ์™„์ „ํžˆ self-supervised์ด๋ฉฐ, โ€œ์† A์˜ ์ด ๋™์ž‘ = ์† B์˜ ์ € ๋™์ž‘โ€ ๊ฐ™์€ cross-hand ์ง ๋ฐ์ดํ„ฐ๋„ ์š”๊ตฌํ•˜์ง€ ์•Š๋Š”๋‹ค.

flowchart LR
    subgraph PRE["latent ์‚ฌ์ „ํ•™์Šต (self-supervised)"]
        Q["๋ฌด์ž‘์œ„ ๊ด€์ ˆ ์ƒ˜ํ”Œ<br/>q^(h) (์†๋งˆ๋‹ค)"] --> E["์†๋ณ„ ์ธ์ฝ”๋” E_h"]
        E --> Z["๊ณต์œ  latent z<br/>~ N(0, I)"]
        Z --> D["์†๋ณ„ ๋””์ฝ”๋” D_h"]
        D --> R["๋ณต์› qฬ‚^(h)"]
    end
    Z -. "๋ฏธ๋ถ„๊ฐ€๋Šฅ FK" .-> FK["fingertip ๋ณ€์œ„ ์ •๋ ฌ<br/>(retargeting ์†์‹ค)"]

2) ์„ธ ์†์‹ค์˜ ์—ญํ• 

๊ณต์œ  ๊ณต๊ฐ„์ด โ€œ๊ทธ๋ƒฅ ์••์ถ•โ€์ด ์•„๋‹ˆ๋ผ ์˜๋ฏธ๊ฐ€ ํ†ตํ•˜๋Š” ๊ณต๊ฐ„์ด ๋˜๋ ค๋ฉด ์„ธ ๊ฐ€์ง€ ์ œ์•ฝ์ด ๋™์‹œ์— ๊ฑธ๋ ค์•ผ ํ•œ๋‹ค.

  • ์žฌ๊ตฌ์„ฑ ์†์‹ค L_1: E_hโ€“D_h๊ฐ€ ๊ฐ ์†์— ๋Œ€ํ•ด ์ œ๋Œ€๋กœ ๋œ ์˜คํ† ์ธ์ฝ”๋”๊ฐ€ ๋˜๋„๋ก โ€” latent๊ฐ€ ์†์˜ ๊ธฐ๊ตฌํ•™์„ ๋ณด์กดํ•˜๊ฒŒ ํ•œ๋‹ค.
  • ๋ฆฌํƒ€๊ฒŸํŒ… ์†์‹ค L_2 (ํ•ต์‹ฌ): ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ forward kinematics๋กœ ๊ด€์ ˆ์„ ์†๋ ์œ„์น˜๋กœ ๋ณด๋‚ด๊ณ , ์†๋ ์‚ฌ์ด ๋ณ€์œ„ \delta_{ij}๋ฅผ ์†๋“ค ์‚ฌ์ด์—์„œ ์ •๋ ฌํ•œ๋‹ค. ๊ฐ•ํ•œ pinch์— ๋” ํฐ ๊ฐ€์ค‘์น˜๋ฅผ ์ค€๋‹ค. ๋•๋ถ„์— ๊ฐ™์€ latent code๊ฐ€ ์†์ด ๋‹ฌ๋ผ๋„ ๊ธฐํ•˜ํ•™์ ์œผ๋กœ ์ผ๊ด€๋œ pinch๋ฅผ ๋งŒ๋“ ๋‹ค โ€” ์ด๊ฒƒ์ด โ€œ์˜๋ฏธ๊ฐ€ ํ†ตํ•˜๋Š”โ€ ๊ณต๊ฐ„์˜ ์ •์ฒด๋‹ค.
  • KL ์†์‹ค L_3: latent๋ฅผ \mathcal N(0,I)๋กœ ์ •๊ทœํ™”ํ•ด ๊ณต๊ฐ„์„ ๋งค๋„๋Ÿฝ๊ฒŒ ๋งŒ๋“ค๊ณ  ๋ณด๊ฐ„ยท์ƒ˜ํ”Œ๋ง์„ ์‰ฝ๊ฒŒ ํ•œ๋‹ค.

์ด ๋ชฉ์ ํ•จ์ˆ˜๋Š” L_{latent}=L_1+L_2+\beta L_3 (\beta=10^{-5}). ablation์—์„œ L_1๊ณผ L_2 ์ค‘ ํ•˜๋‚˜๋ผ๋„ ๋น ์ง€๋ฉด cross-embodiment ์„ฑ๋Šฅ์ด ๋ฌด๋„ˆ์ง„๋‹ค โ€” ๋‘˜ ๋‹ค ํ•„์ˆ˜๋‹ค.

3) VLA์— latent๋ฅผ ๋ผ์šฐ๋Š” ๋ฐฉ์‹

๋ณธ์ฒด๋Š” \pi_0 ์•„ํ‚คํ…์ฒ˜๋ฅผ ๋”ฐ๋ฅธ๋‹ค. ๋‹ค๋งŒ \pi_0๊ฐ€ proprioceptive ์ด๋ ฅ์„ state token์œผ๋กœ ๋„ฃ๋˜ ์ž๋ฆฌ์—, XL-VLA๋Š” latent action token์„ ๋„ฃ๋Š”๋‹ค. ์ด์ „ ๊ด€์ ˆ ์ฒญํฌ q^{(h)}_t๋ฅผ E_h๋กœ latent z_t๋กœ ์••์ถ•ํ•ด visionยทlanguage ํ† ํฐ๊ณผ ํ•จ๊ป˜ ๋„ฃ๊ณ , ๋‹ค์Œ latent ์ฒญํฌ \hat z_{t+1}๋ฅผ ์˜ˆ์ธกํ•œ ๋’ค D_h๋กœ ๊ด€์ ˆ ๋ช…๋ น \hat q^{(h)}_{t+1}๋กœ ๋ณต์›ํ•œ๋‹ค. ํ–‰๋™ ์ฒญํฌ๋Š” 20Hz๋กœ ์ƒ˜ํ”Œ๋ง๋œ 64์Šคํ…(์•ฝ 3.2์ดˆ)์ด๋ฉฐ, VLA ๋ฏธ์„ธ์กฐ์ • ๋™์•ˆ ์ธ์ฝ”๋”ยท๋””์ฝ”๋”๋Š” ์ „๋ถ€ frozen์ด๋‹ค. ์ฆ‰ ํ•œ ๋ฒˆ ์ž˜ ์ •๋ ฌํ•œ ๊ณตํ†ต ์–ธ์–ด๋ฅผ ๊ณ ์ •ํ•ด ๋‘๊ณ , ๊ทธ ์œ„์—์„œ ์ •์ฑ…๋งŒ ๋ฐฐ์šด๋‹ค.

์‹คํ—˜์ด ๋งํ•˜๋Š” ๊ฒƒ

์ˆ˜์น˜๋ฅผ ์˜๋ฏธ ์ค‘์‹ฌ์œผ๋กœ ํ’€๋ฉด ์ด๋ ‡๋‹ค.

  • Cross-hand ์Šค์ผ€์ผ๋ง: ๋„ค ์†์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•ฉ์ณ ํ•™์Šตํ•˜๋ฉด \pi_0 ํ‰๊ท  ์„ฑ๊ณต๋ฅ  0.32 โ†’ XL-VLA 0.72. ์†์ด ๋‹ค๋ฅด๋‹ค๋Š” ์ด์œ ๋กœ ๋ฒ„๋ ค์ง€๋˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์„œ๋กœ๋ฅผ ๋•๋Š”๋‹ค๋Š” ์ง์ ‘ ์ฆ๊ฑฐ๋‹ค.
  • Cross-robot ์Šค์ผ€์ผ๋ง: tabletop xArm๊ณผ ํœด๋จธ๋…ธ์ด๋“œ G1์„ ํ•จ๊ป˜ ํ•™์Šตํ•˜๋ฉด G1์—์„œ \pi_0 0.525 โ†’ 0.825. ๊ณตํ†ต latent๊ฐ€ ํŒ” ํ”Œ๋žซํผ์ด ๋‹ฌ๋ผ๋„ ์ด๋“์„ ์ค€๋‹ค.
  • Zero-shot ๋ฏธ์ง€ ์ž‘์—…: hold-outํ•œ (์†ร—์ž‘์—…) ์กฐํ•ฉ์—์„œ kinematic retargeting ๊ธฐ๋ฐ˜ \pi_0+RT๋ณด๋‹ค ์ผ๊ด€๋˜๊ฒŒ ์šฐ์ˆ˜ํ•˜๋ฉฐ, ๋ฏธ์„ธ ์กฐ์ž‘์—์„œ ๊ฒฉ์ฐจ๊ฐ€ ๋” ํฌ๋‹ค.
  • Latent ํ’ˆ์งˆ: ์ง€๋„์‹ latent retargeting(LAD)์ด replay 0.60/0.61์— ๊ทธ์นœ ๋ฐ˜๋ฉด, XL-VLA์˜ self-supervised latent๋Š” 0.82/0.81. ์ง ๋ฐ์ดํ„ฐ ์—†์ด๋„ ๋” ์ข‹์€ ๊ณต๊ฐ„์„ ๋งŒ๋“ ๋‹ค๋Š” ๋œป์ด๋‹ค.
  • ์„ค๊ณ„ ์„ ํƒ: latent ์ฐจ์› 32, hidden 128โ†’64 ๊ตฌ์„ฑ์ด ์žฌ๊ตฌ์„ฑยท๋ฆฌํƒ€๊ฒŸํŒ…ยท์—ฐ์†์„ฑยท๋ณด๊ฐ„์˜ ๊ท ํ˜•์ . latent๋ฅผ ๋„ˆ๋ฌด ํ‚ค์šฐ๋ฉด(์˜ˆ: 128) ์˜คํžˆ๋ ค embodiment-invariant ๊ตฌ์กฐ๊ฐ€ ํํŠธ๋Ÿฌ์ง„๋‹ค.

๋น„ํŒ์  ๊ณ ์ฐฐ

๊ฐ•์ . โ€œํ–‰๋™์„ ์˜๋„๋กœ ์ ๋Š”๋‹คโ€๋Š” ์ถ”์ƒํ™”๊ฐ€ ๊น”๋”ํ•˜๊ณ , ๊ทธ๊ฒƒ์„ ์‹œ์—ฐยทIK ์—†๋Š” ๋ฌด์ž‘์œ„ ๊ด€์ ˆ ์ƒ˜ํ”Œ๋ง๋งŒ์œผ๋กœ self-supervisedํ•˜๊ฒŒ ์ •๋ ฌํ•œ ์ ์ด ์‹ค์šฉ์ ์ด๋‹ค. ์†์ด ์ถ”๊ฐ€๋ผ๋„ ์ธ์ฝ”๋”/๋””์ฝ”๋” ํ•œ ์Œ๋งŒ ์ƒˆ๋กœ ๋ถ™์—ฌ ์ •๋ ฌํ•˜๋ฉด ๋˜๊ณ , ๊ธฐ์กด VLAยท๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋Œ€๋กœ ์žฌํ™œ์šฉํ•œ๋‹ค. retargeting ์†์‹ค์„ ๋ฏธ๋ถ„๊ฐ€๋Šฅ FK๋กœ ๊ฑด ๊ฒƒ๋„ ๊ธฐํ•˜ํ•™์  ์ผ๊ด€์„ฑ์„ ์ง์ ‘ ๊ฐ•์ œํ•˜๋Š” ์˜๋ฆฌํ•œ ์„ ํƒ์ด๋‹ค.

ํ•œ๊ณ„์™€ ์˜๋ฌธ.

  • ๋ฌด์ž‘์œ„ ๊ด€์ ˆ ์ƒ˜ํ”Œ๋ง์˜ ๋ถ„ํฌ ์ฐจ์ด: ํ•˜๋“œ์›จ์–ด ํ•œ๊ณ„ ์•ˆ์—์„œ ๊ท ์ผ ์ƒ˜ํ”Œ๋งํ•œ ๊ตฌ์„ฑ์€ ์‹ค์ œ ์กฐ์ž‘์—์„œ ์ž์ฃผ ์“ฐ๋Š” ์ž์„ธ ๋ถ„ํฌ์™€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๋‹ค. latent๊ฐ€ โ€œ์‹ค์ œ๋กœ ์ž์ฃผ ์“ฐ๋Š” ์˜์—ญโ€์—์„œ ์ถฉ๋ถ„ํžˆ ์ด˜์ด˜ํ•œ์ง€๋Š” ๋” ๋”ฐ์ ธ๋ด์•ผ ํ•œ๋‹ค(์ถ”์ธก).
  • frozen ๋””์ฝ”๋”์˜ ์ƒํ•œ: ์ •๋ ฌ์„ ๊ณ ์ •ํ•ด ๋‘๋Š” ์„ค๊ณ„๋Š” ์•ˆ์ •์ ์ด์ง€๋งŒ, ๋””์ฝ”๋”๊ฐ€ ํ‘œํ˜„ ๋ชป ํ•˜๋Š” ๋ฏธ์„ธ ๋™์ž‘์€ VLA๊ฐ€ ์•„๋ฌด๋ฆฌ ์ข‹์•„๋„ ๋ณต์› ๋‹จ๊ณ„์—์„œ ์ž˜๋ฆฐ๋‹ค.
  • ์† 4์ข…ยทteleop ๋ฐ์ดํ„ฐ ์˜์กด: 2M state-action์„ ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜์œผ๋กœ ๋ชจ์•˜๊ณ  ์†์€ 4์ข…์ด๋‹ค. ๋” ์ด์งˆ์ ์ธ ์†(์˜ˆ: 3์ง€ ๊ทธ๋ฆฌํผ โ†”๏ธŽ 5์ง€ humanoid hand)์ด๋‚˜ ๋” ๋งŽ์€ ์ข…์œผ๋กœ์˜ ํ™•์žฅ์„ฑ์€ ์ถ”๊ฐ€ ๊ฒ€์ฆ์ด ํ•„์š”ํ•˜๋‹ค.
  • ์‹ค์„ธ๊ณ„ ํ‰๊ฐ€ ์ค‘์‹ฌ: ๊ฐ•๋ ฅํ•œ ์‹ค์ฆ์ด์ง€๋งŒ, ์‹คํŒจ ์‚ฌ๋ก€์˜ ์›์ธ(์ •๋ ฌ ์˜ค๋ฅ˜ vs ์ •์ฑ… ์˜ค๋ฅ˜ vs ๋””์ฝ”๋” ํ•œ๊ณ„)์„ ๋ถ„ํ•ดํ•œ ๋ถ„์„์ด ๋” ์žˆ์œผ๋ฉด ์ข‹๊ฒ ๋‹ค.

ํ•ต์‹ฌ์„ ๋‹ค์‹œ ํ•œ ์ค„๋กœ

XL-VLA์˜ ๊ธฐ์—ฌ๋Š” โ€œdexterous manipulation์˜ cross-embodiment ๋ฌธ์ œ๋ฅผ ๊ณต์œ  latent action space๋กœ ํ™˜์›ํ•œ ๊ฒƒโ€์ด๋‹ค. ์†์˜ ๋‹ค์–‘์„ฑ์„ ์ธ์ฝ”๋”/๋””์ฝ”๋”๋ผ๋Š” ์–ด๋Œ‘ํ„ฐ๋กœ ํก์ˆ˜ํ•˜๊ณ , VLA์—๋Š” ์†๊ณผ ๋ฌด๊ด€ํ•œ ์˜๋„๋งŒ ๋ณด์—ฌ์คŒ์œผ๋กœ์จ, ๋ฐ์ดํ„ฐ๋Š” ์žฌํ™œ์šฉ๋˜๊ณ  ์ƒˆ ์กฐํ•ฉ์€ zero-shot์œผ๋กœ ํ’€๋ฆฐ๋‹ค. ๋ฌด์ž‘์œ„ ์ƒ˜ํ”Œ๋ง ๊ธฐ๋ฐ˜์˜ self-supervised ์ •๋ ฌ๊ณผ ๋ฏธ๋ถ„๊ฐ€๋Šฅ FK retargeting์ด ์ด ๊ทธ๋ฆผ์„ ๊ฐ’์‹ธ๊ฒŒ ๋งŒ๋“  ํ•ต์‹ฌ ์žฅ์น˜๋‹ค. ๋‚จ์€ ๊ณผ์ œ๋Š” ๋” ์ด์งˆ์ ์ธ ์†์œผ๋กœ์˜ ํ™•์žฅ๊ณผ latent ๋ถ„ํฌ์˜ ์ถฉ์‹ค๋„์ง€๋งŒ, โ€œํ–‰๋™์„ ์†์ด ์•„๋‹ˆ๋ผ ์˜๋„๋กœ ์ ๋Š”๋‹คโ€๋Š” ๋ฐฉํ–ฅ์„ฑ์€ ๋ฒ”์šฉ dexterous VLA๋กœ ๊ฐ€๋Š” ์„ค๋“๋ ฅ ์žˆ๋Š” ํ•œ ๊ฑธ์Œ์ด๋‹ค.

Copyright 2026, JungYeon Lee