Curieux.JY
  • JungYeon Lee
  • Post
  • ๐Ÿ•ธ๏ธ Graph
  • Lecture
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ํ•œ ์ค„๋กœ ์‹œ์ž‘ํ•˜๋ฉด
    • ์™œ ์–ด๋ ค์šด๊ฐ€ โ€” ๋ชจ๋“ˆ์‹ ์„ค๊ณ„์˜ ๊ธฐํ•˜ ์š”๊ฑด
    • ๋ฐฉ๋ฒ• ์ƒ์„ธ
      • 3.1 Reference-Frame Grounding
      • 3.2 Multi-View Fusion-Based 3D Lifting
      • 3.3 Object-Centric Atomic Action Alignment
      • 3.4 Dexterous Affordance-Guided Grasp and Motion Generation
    • ์ง๊ด€ โ€” ์™œ ๋‹ค์‹œ์  + voting์ธ๊ฐ€
    • ์‹คํ—˜
    • ๋น„ํŒ์ ์œผ๋กœ ๋ณด๋ฉด
      • ๊ฐ•์ 
      • ์•ฝ์ ยทํ•œ๊ณ„
    • ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ์ž๋ฆฌ๋งค๊น€
    • ์š”์•ฝ

๐Ÿ“ƒZeroDex ๋ฆฌ๋ทฐ

dexterity
manipulation
vlm
zero-shot
multi-view
grasp
long-horizon
ZeroDex: Zero-Shot Long-Horizon Dexterous Manipulation via Multi-View 3D-Grounded VLM Reasoning
Published

June 29, 2026

  • Paper Link

  • Project

  • Code: Coming Soon (๋ฏธ๊ณต๊ฐœ)

  • Jisoo Kim, Sangwon Baik, Taeksoo Kim, SungJoo Kim, Junyoung Lee, Mingi Choi, Hanbyul Joo

  • Seoul National University ยท RLWRLD

  • Preprint, 2026

  1. ๐Ÿ’ก ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ๋กœ ์ •์ฑ…์„ ์ƒˆ๋กœ ํ•™์Šตํ•˜์ง€ ์•Š๊ณ , VLM์˜ zero-shot ์ถ”๋ก (๋ฌด์—‡์„ยท์–ด๋””๋ฅผยท์–ด๋–ป๊ฒŒ)์„ ๋‹ค์‹œ์ (multi-view) 3D ๊ทธ๋ผ์šด๋”ฉ์œผ๋กœ ๋ฌผ๋ฆฌ ์‹คํ–‰์— ์ž‡๋Š” ๋ชจ๋“ˆ์‹ ์†์žฌ์ฃผ(dexterous) ์กฐ์ž‘ ํ”„๋ ˆ์ž„์›Œํฌ.
  2. โš™๏ธ VLM์ด ์–ธ์–ด ์ง€์‹œ๋ฅผ atomic primitive ์‹œํ€€์Šค(graspยทapply_actionยทwaypointยทreleaseยทhold)๋กœ ๋ถ„ํ•ดํ•˜๊ณ , ๊ฐ 2D ํ‚คํฌ์ธํŠธ๋ฅผ RANSAC ์‚ผ๊ฐ์ธก๋Ÿ‰ + reference-view ray voting์œผ๋กœ 3D๋กœ ์˜ฌ๋ฆฐ ๋’ค, ์–ดํฌ๋˜์Šค ๊ธฐ๋ฐ˜ ์† grasp์™€ ๋„๊ตฌ ๊ถค์ (Bag of Atomic Actions)์„ ์ •๋ ฌํ•ด ์‹คํ–‰ํ•œ๋‹ค.
  3. ๐ŸŽฏ ์‹ค๋กœ๋ด‡ tabletop์—์„œ ๋‹จ์ผ์‹œ์  RGB-D ๊ทธ๋ผ์šด๋”ฉ์„ ๋Šฅ๊ฐ€ํ•˜๊ณ (grasp ์œ„์น˜์˜ค์ฐจ 16.43โ†’4.58cm), ํƒœ์Šคํฌ๋‹น 30๊ฐœ ์‹œ์—ฐ์œผ๋กœ ๋ฏธ์„ธ์กฐ์ •ํ•œ VLA ๋ฒ ์ด์Šค๋ผ์ธ(GR00TยทBeing-H0)์ด 0/5๋กœ ์ „๋ฉธํ•œ ์ž‘์—…๋„ zero-shot์œผ๋กœ ์„ฑ๊ณตํ•˜๋ฉฐ, ์‹คํŒจ ๊ฐ์ง€ยท์žฌ๊ณ„ํš์œผ๋กœ long-horizon๊นŒ์ง€ ์ˆ˜ํ–‰.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

ZeroDex๋Š” โ€œ์†์žฌ์ฃผ ์กฐ์ž‘์„ ์œ„ํ•ด ๋งค๋ฒˆ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ๋กœ ์ •์ฑ…์„ ํ•™์Šตํ•ด์•ผ ํ•˜๋Š”๊ฐ€โ€๋ผ๋Š” ์งˆ๋ฌธ์— โ€œ์•„๋‹ˆ์˜คโ€๋ผ๊ณ  ๋‹ตํ•œ๋‹ค. ํ•ต์‹ฌ ๊ด€์ฐฐ์€, ํ˜„๋Œ€ VLM์ด ์ด๋ฏธ zero-shot์œผ๋กœ ์กฐ์ž‘์˜ ํ•˜์œ„ ์งˆ๋ฌธ ๋Œ€๋ถ€๋ถ„(๋ฌด์—‡์„ ์žก์„์ง€, ๋„๊ตฌ์˜ ์–ด๋А ๊ธฐ๋Šฅ๋ถ€๋ฅผ ์–ด๋–ป๊ฒŒ ์ฅ˜์ง€, ์–ด๋–ค ์ˆœ์„œ๋กœ ์›€์ง์ผ์ง€)์— ๋‹ตํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋ž˜์„œ ์ƒˆ ์ •์ฑ…์„ ํ•™์Šตํ•˜๋Š” ๋Œ€์‹ , ์˜๋ฏธ ์ถ”๋ก (VLM)๊ณผ ๋ฌผ๋ฆฌ ์‹คํ–‰(primitive ์ปจํŠธ๋กค๋Ÿฌ)์„ ๋ถ„๋ฆฌ(modular)ํ•œ๋‹ค. ๋‹ค๋งŒ ์ด ๋ชจ๋“ˆ์‹ ์„ค๊ณ„๋Š” 2D๋งŒ์œผ๋กœ๋Š” ํ’€ ์ˆ˜ ์—†๋Š” ๊ธฐํ•˜ ์š”๊ฑด์„ ์•ˆ๋Š”๋‹ค โ€” ์–ด๋””๋ฅผ ์žก๊ณ , ๋์ ์„ ์–ด๋””๋กœ ์˜ฎ๊ธฐ๋ฉฐ, ๋„๊ตฌ๋ฅผ ์–ด๋–ป๊ฒŒ ํœ˜๋‘๋ฅผ์ง€๋Š” ๋ณธ์งˆ์ ์œผ๋กœ 3D ๊ถค์ ๋Ÿ‰์ด๋‹ค. ZeroDex์˜ ์ค‘์‹ฌ ์•„์ด๋””์–ด๋Š” VLM ๊ทธ๋ผ์šด๋”ฉ์„ ์—ฌ๋Ÿฌ ์‹œ์ ์— ๊ฑธ์ณ ์œตํ•ฉํ•ด view-dependentํ•œ 2D ์˜ˆ์ธก์„ ์ผ๊ด€๋œ 3D๋กœ ๋“ค์–ด์˜ฌ๋ฆฌ๋Š” ๊ฒƒ์ด๋‹ค.


ZeroDex ๊ฐœ์š”(Fig. 1) โ€” ์–ธ์–ด ์ง€์‹œ์™€ ๋ณด์ •๋œ ๋‹ค์‹œ์  ๊ด€์ธก์„ ์ž…๋ ฅ์œผ๋กœ, ๊ฐ•๊ฑดํ•œ ์‚ผ๊ฐ์ธก๋Ÿ‰ + reference-view ray voting์œผ๋กœ task-relevant 3D ๊ทธ๋ผ์šด๋”ฉ์„ ์ถ”๋ก ํ•˜๊ณ , ์–ดํฌ๋˜์Šค ๊ธฐ๋ฐ˜ ์† grasp๋ฅผ ์ƒ์„ฑํ•ด pick-and-placeยทtool-use ๊ณ„ํš์„ ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ action primitive๋กœ ์‹คํ–‰ํ•œ๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก :

VLM \Phi๋Š” ๋‹ค์‹œ์  ์ด๋ฏธ์ง€ \mathcal{I}=\{I_v\}_{v=1}^{M}์™€ ์ง€์‹œ l๋กœ๋ถ€ํ„ฐ reference view r, ๋ชจ๋“œ z\in\{\mathrm{pick},\mathrm{tool}\}, ๋ชจ๋“œ๋ณ„ ๊ทธ๋ผ์šด๋”ฉ g_z๋ฅผ ๊ณ ๋ฅธ๋‹ค:

(r,z,g_z)=\Phi(\mathcal{I},l).

์ด์–ด I_r ์œ„์—์„œ primitive ์‹œํ€€์Šค \mathcal{Q}_r=\{(m_t,\mathcal{P}_r^t)\}_{t=1}^T๋ฅผ ์ƒ์„ฑํ•œ๋‹ค(m_t\in\{\mathrm{grasp},\mathrm{apply\_action},\mathrm{waypoint},\mathrm{release},\mathrm{hold}\}). ๊ฐ 2D ํ‚คํฌ์ธํŠธ๋Š” ๋‘ ๊ฐˆ๋ž˜๋กœ 3D๋กœ ์˜ฌ๋ฆฐ๋‹ค. โ‘  RANSAC ์‚ผ๊ฐ์ธก๋Ÿ‰ โ€” ๋ทฐ ์Œ (a,b)์˜ ํ›„๋ณด X_{a,b}^{t,j}๋ฅผ reprojection ํ•ฉ์˜๋กœ ์ฑ„์ ํ•˜๊ณ (S_{\mathrm{tri}}) ์ตœ๋Œ€ ํ•ฉ์˜ ํ›„๋ณด๋ฅผ ํƒํ•œ๋‹ค. โ‘ก reference-view ray voting โ€” reference ์นด๋ฉ”๋ผ ๊ด‘์„ ์„ ๋”ฐ๋ผ ๊นŠ์ด ํ›„๋ณด N_\delta๊ฐœ๋ฅผ ์ƒ˜ํ”Œํ•ด ๊ฐ ๋ทฐ์— ๋ฒˆํ˜ธ ๋งˆ์ปค๋กœ ํˆฌ์˜ํ•˜๊ณ , VLM์ด ์„ค๋ช… d_{t,j}์— ๊ฐ€์žฅ ๋งž๋Š” ์ธ๋ฑ์Šค๋ฅผ ํˆฌํ‘œ๋กœ ๊ณ ๋ฅธ๋‹ค. ์ตœ์ข… 3D ํ‚คํฌ์ธํŠธ๋Š” ์‚ผ๊ฐ์ธก๋Ÿ‰ ํ•ฉ์˜๊ฐ€ ์ž„๊ณ„ \tau_{\mathrm{tri}} ์ด์ƒ์ด๋ฉด ์‚ผ๊ฐ์ธก๋Ÿ‰๊ฐ’์„, ์•„๋‹ˆ๋ฉด voting๊ฐ’์œผ๋กœ ๋™์  ์„ ํƒํ•œ๋‹ค:

X_\star^{t,j}=\begin{cases}X_{\mathrm{tri}}^{t,j} & \text{if } \max_{a,b}S_{\mathrm{tri}}(a,b)\geq\tau_{\mathrm{tri}},\\ X_{\mathrm{vote}}^{t,j} & \text{otherwise}.\end{cases}

๋„๊ตฌ ์‚ฌ์šฉ์€ Bag of Atomic Actions \mathcal{A}=(c,\mathcal{T},X_s,X_e) โ€” ์Šคํ‚ฌ ๋ฒ”์ฃผ c, 6D ๋„๊ตฌ ๊ถค์  \mathcal{T}, ์‹œ์ž‘ยท๋ ์•ต์ปค๋ฅผ ๋‹ด์€ ์žฌ์‚ฌ์šฉ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ โ€” ์—์„œ ๊ฐ™์€ c์˜ ๊ถค์ ์„ ๊บผ๋‚ด, ์ €์žฅ ์•ต์ปค (X_s,X_e)๋ฅผ ํ˜„์žฌ ์žฅ๋ฉด์˜ lifted ํ‚คํฌ์ธํŠธ (X_{\mathrm{app}},X_{\mathrm{term}})๋กœ ๋ณด๋‚ด๋Š” ๊ฐ•์ฒด๋ณ€ํ™˜ T_{\mathrm{align}}์œผ๋กœ ์ •๋ ฌํ•œ๋‹ค(\hat{T}_i=T_{\mathrm{align}}\cdot T_i).

์ฃผ์š” ๊ฒฐ๊ณผ: (ํ™•์ธ๋œ ์ˆ˜์น˜๋งŒ)

  • ์‹ค๋กœ๋ด‡ tabletop์—์„œ ๋‹จ์ผ์‹œ์  RGB-D ๋ฒ ์ด์Šค๋ผ์ธ์„ ๋งค์นญ ๋˜๋Š” ๋Šฅ๊ฐ€ โ€” โ€œCluttered Precise Pick-and-Placeโ€์—์„œ 2/5 โ†’ 4/5.
  • ๋‹ค์‹œ์  ์œตํ•ฉ์ด grasp ์œ„์น˜์˜ค์ฐจ๋ฅผ ํฌ๊ฒŒ ์ค„์ž„ โ€” Stereo(RGB-D) L_{\mathrm{grasp}} 16.43cm โ†’ Ours(2 views) 4.58cm, L_{\mathrm{apply}} 2.72 โ†’ 1.35cm(3 views).
  • ํƒœ์Šคํฌ๋‹น 30๊ฐœ ์‹œ์—ฐ์œผ๋กœ ๋ฏธ์„ธ์กฐ์ •ํ•œ VLA ๋‘ ์ข…(GR00T, Being-H0)์€ ํ‰๊ฐ€ ์ž‘์—…์—์„œ 0/5๋กœ ์ „๋ฉธํ•œ ๋ฐ˜๋ฉด, ZeroDex๋Š” zero-shot์œผ๋กœ โ€œThrow Away Trashโ€ 10/10, โ€œBroom Cleanโ€ 8/10.
  • Long-horizon์—์„œ VLM์ด ์‹คํŒจ ์ƒํƒœ๋ฅผ ๊ฐ์ง€ํ•ด ์žฌ๊ณ„ํš(closed-loop retry) โ€” โ€œOrganize Objectsโ€ end-to-end 4/6, โ€œCookingโ€ 1/3.

๊ฒฐ๋ก : ZeroDex๋Š” โ€œVLM ์ถ”๋ก  + ๋‹ค์‹œ์  3D ๊ทธ๋ผ์šด๋”ฉ + ์žฌ์‚ฌ์šฉ primitiveโ€์˜ ๋ชจ๋“ˆ์‹ ์กฐํ•ฉ๋งŒ์œผ๋กœ, ํƒœ์Šคํฌ๋ณ„ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ยท๋ฏธ์„ธ์กฐ์ • ์—†์ด ์†์žฌ์ฃผ tool-use์™€ long-horizon ์กฐ์ž‘์„ zero-shot์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ์„ ์‹ค์ฆํ•œ๋‹ค.

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

ํ•œ ์ค„๋กœ ์‹œ์ž‘ํ•˜๋ฉด

VLM์€ ์ด๋ฏธ ์กฐ์ž‘์˜ โ€œ๋ฌด์—‡ยท์–ด๋””ยท์–ด๋–ป๊ฒŒยท์ˆœ์„œโ€๋ฅผ zero-shot์œผ๋กœ ๋‹ตํ•  ์ค„ ์•ˆ๋‹ค โ€” ๊ทธ๋ ‡๋‹ค๋ฉด ์ •์ฑ…์„ ์ƒˆ๋กœ ํ•™์Šตํ•˜์ง€ ๋ง๊ณ , VLM์˜ ์˜๋ฏธ ์ถ”๋ก ์„ ๋‹ค์‹œ์ ์œผ๋กœ 3D์— ๋ฌถ์–ด ๊ทธ๋Œ€๋กœ ์‹คํ–‰ํ•˜๋ฉด ๋œ๋‹ค. ๋‹จ, ๊ทธ ๋ฌถ์Œ(grounding)์ด ์†์žฌ์ฃผ ์กฐ์ž‘์— ์ถฉ๋ถ„ํžˆ ์ •๋ฐ€ํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด ZeroDex๊ฐ€ ๋– ์•ˆ๋Š” ์ง„์งœ ๋ฌธ์ œ๋‹ค.

์™œ ์–ด๋ ค์šด๊ฐ€ โ€” ๋ชจ๋“ˆ์‹ ์„ค๊ณ„์˜ ๊ธฐํ•˜ ์š”๊ฑด

์†์žฌ์ฃผ ์กฐ์ž‘์„ ํ‘ธ๋Š” ์ง€๋ฐฐ์  ์ ‘๊ทผ์€ end-to-end๋‹ค. VLA ๋ชจ๋ธ์€ ๋Œ€๊ทœ๋ชจ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ๋กœ ์ด๋ฏธ์ง€ยท์ง€์‹œ์—์„œ ํ–‰๋™์„ ์ง์ ‘ ์˜ˆ์ธกํ•ด ํฐ ์ง„์ „์„ ์ด๋ค˜์ง€๋งŒ, ๋‹ค์–‘ํ•œ ์ž‘์—…์—์„œ ์•ˆ์ •์ ์ด๋ ค๋ฉด ๊ด‘๋ฒ”์œ„ํ•œ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ยทํƒœ์Šคํฌ๋ณ„ ์ ์‘ยทํ™˜๊ฒฝ๋ณ„ ๋ฏธ์„ธ์กฐ์ •์ด ํ•„์š”ํ•ด ๊ฐœ๋ฐฉํ˜• ํ™˜๊ฒฝ์˜ ๋ฌผ์ฒดยท๋„๊ตฌยท๊ณต๊ฐ„ ๋‹ค์–‘์„ฑ์œผ๋กœ ํ™•์žฅํ•˜๊ธฐ ์–ด๋ ต๋‹ค. ์‚ฌ๋žŒ ์‹œ์—ฐ retargeting๋„ ๋˜ ๋‹ค๋ฅธ ๊ธธ์ด์ง€๋งŒ, ์‚ฌ๋žŒโ€“๋กœ๋ด‡ ์†์˜ embodiment ๊ฒฉ์ฐจ๊ฐ€ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ๋ถˆ๊ฐ€๋Šฅํ•œ ์ ‘์ด‰์ด๋‚˜ ๋ถˆ์•ˆ์ •ํ•œ grasp๋ฅผ ๋‚ณ์•„ ์ถ”๊ฐ€ ์ •์ œ๋‚˜ RL์„ ์š”๊ตฌํ•œ๋‹ค.

์ €์ž๋“ค์€ ๋ชจ๋“ˆ์‹ ์„ค๊ณ„๊ฐ€ ๋” ํšจ์œจ์ ์ด๋ผ ์ฃผ์žฅํ•œ๋‹ค. ํ•˜๋‚˜์˜ end-to-end ์ •์ฑ… ๋Œ€์‹ , ์˜๋ฏธ ์ถ”๋ก (VLM)๊ณผ ๋ฌผ๋ฆฌ ์‹คํ–‰(motion primitiveยท์ปจํŠธ๋กค๋Ÿฌ)์„ ๋ถ„๋ฆฌํ•œ๋‹ค. ํ•ต์‹ฌ ๊ด€์ฐฐ์€ modern VLM์ด zero-shot์œผ๋กœ ์กฐ์ž‘์˜ ํ•˜์œ„ ์งˆ๋ฌธ ๋Œ€๋ถ€๋ถ„์— ๋‹ตํ•œ๋‹ค๋Š” ๊ฒƒ โ€” ๋ฌด์—‡์„ ์žก์„์ง€, ๊ธฐ๋Šฅ์  ์–ดํฌ๋˜์Šค๋กœ์„œ ์–ด๋””๋ฅผ ์žก์„์ง€, ์–ด๋–ป๊ฒŒ ์›€์ง์ผ์ง€, ์–ด๋–ค ์ˆœ์„œ์ธ์ง€. ๊ทธ๋Ÿฌ๋ฉด ์ƒˆ ์ •์ฑ…์„ ํ•™์Šตํ•  ํ•„์š” ์—†์ด, VLM์ด ๊ณ„ํš์„ ๋‚ด๋ฉด ๋ณต์žกํ•œ ์ž‘์—…์„ pickยทmove ๊ฐ™์€ ๋‹จ์ˆœ atomic ์ž‘์—…์˜ ์‹œํ€€์Šค๋กœ ๋ถ„ํ•ดํ•ด ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” primitive ์ปจํŠธ๋กค๋Ÿฌ๋กœ ์‹คํ–‰ํ•˜๋ฉด ๋œ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์ด ์„ค๊ณ„๋Š” 2D ์ถ”๋ก ๋งŒ์œผ๋กœ๋Š” ์ถฉ์กฑํ•  ์ˆ˜ ์—†๋Š” ๊ธฐํ•˜ ์š”๊ฑด์— ๊ฑธ๋ฆฐ๋‹ค. ์žก์„ ์œ„์น˜๋Š” 2D๊ฐ€ ์•„๋‹ˆ๋ผ 3D๋กœ ์ง€์ •๋˜์–ด์•ผ ํ•˜๊ณ , ๋” ์ค‘์š”ํ•˜๊ฒŒ ์—”๋“œ์ดํŽ™ํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ์›€์ง์ด๊ณ  ๋ฌผ์ฒด๋ฅผ ์–ด๋””๋กœ ์˜ฎ๊ธธ์ง€๋Š” 3D ๊ถค์ ๋Ÿ‰์ด๋‹ค. ๋‹จ์ผ ์‹œ์ ์€ ์ด๋Ÿฐ 3D ๊ถค์ ์„ ์‹ ๋ขฐ์„ฑ ์žˆ๊ฒŒ ์ถ”๋ก ํ•  ๊ธฐํ•˜ ์ •๋ณด๋ฅผ ๊ฑฐ์˜ ๋‹ด์ง€ ๋ชปํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ZeroDex์˜ ์ค‘์‹ฌ ์•„์ด๋””์–ด๋Š” VLM ๊ทธ๋ผ์šด๋”ฉ์„ ์—ฌ๋Ÿฌ ์‹œ์ ์— ๊ฑธ์ณ ์œตํ•ฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

๋ฐฉ๋ฒ• ์ƒ์„ธ

ZeroDex๋Š” ๋ณด์ •๋œ ๋‹ค์‹œ์  RGB ์ด๋ฏธ์ง€์™€ ๊ณ ์ˆ˜์ค€ ์–ธ์–ด ์ง€์‹œ๋ฅผ ๋ฐ›์•„, ๋ฌผ๋ฆฌ์ ์œผ๋กœ ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ arm-hand ์‹คํ–‰ ๊ณ„ํš์„ ๋‚ธ๋‹ค. ํŒŒ์ดํ”„๋ผ์ธ์€ ๋„ค ๋‹จ๊ณ„๋‹ค โ€” (1) reference-view ์˜๋ฏธ ๊ทธ๋ผ์šด๋”ฉ, (2) ๋‹ค์‹œ์  ์œตํ•ฉ ๊ธฐ๋ฐ˜ 3D lifting, (3) tool-use๋ฅผ ์œ„ํ•œ object-centric atomic action ์ •๋ ฌ, (4) ์–ดํฌ๋˜์Šค ๊ธฐ๋ฐ˜ ์† graspยทmotion ์ƒ์„ฑ.

3.1 Reference-Frame Grounding

VLM \Phi๋Š” (r,z,g_z)=\Phi(\mathcal{I},l)๋กœ reference view r, ๋ชจ๋“œ z\in\{\mathrm{pick},\mathrm{tool}\}, ๋ชจ๋“œ๋ณ„ ๊ทธ๋ผ์šด๋”ฉ g_z๋ฅผ ๊ณ ๋ฅธ๋‹ค. pick-and-place๋ฉด g_{\mathrm{pick}}=(O_{\mathrm{tar}},\mathbf{p}_{\mathrm{dst}})(๋Œ€์ƒ ๋ฌผ์ฒด + I_r ์ƒ์˜ 2D ๋ชฉ์ ์ง€ ํ”ฝ์…€), tool-use๋ฉด g_{\mathrm{tool}}=(O_{\mathrm{tool}},c,O_{\mathrm{tar}},\mathbf{p}_{\mathrm{dst}})๋กœ ๋„๊ตฌ O_{\mathrm{tool}}๊ณผ ์Šคํ‚ฌ ๋ฒ”์ฃผ c(pouringยทsweeping ๋“ฑ)๊นŒ์ง€ ์‹๋ณ„ํ•œ๋‹ค. ์ด์–ด planning ํ”„๋กฌํ”„ํŠธ l'๋กœ primitive ์‹œํ€€์Šค๋ฅผ ์ƒ์„ฑํ•œ๋‹ค:

\mathcal{Q}_r=\Phi(I_r,l')=\{(m_t,\mathcal{P}_r^t)\}_{t=1}^{T},\quad \mathcal{P}_r^t=\{(\mathbf{p}_r^{t,j},d_{t,j})\}_{j=1}^{N_t},

์—ฌ๊ธฐ์„œ m_t\in\{\mathrm{grasp},\mathrm{apply\_action},\mathrm{waypoint},\mathrm{release},\mathrm{hold}\}, ๊ฐ 2D ํ‚คํฌ์ธํŠธ \mathbf{p}_r^{t,j}๋Š” 3D uplifting์„ ์œ„ํ•œ ์˜๋ฏธ ์„ค๋ช… d_{t,j}์™€ ์ง์ง€์–ด์ง„๋‹ค. pick์€ (grasp, waypoint, release), tool-use๋Š” (grasp, apply_action, release/hold) ๊ตฌ์กฐ๋‹ค. ํ‚คํฌ์ธํŠธ ์ˆ˜๋Š” tool-use์˜ grasp ๋‹จ๊ณ„์—์„œ๋งŒ N_t=2(์ฅ˜ ์  + ๊ธฐ๋Šฅ tip, ์˜ˆ: ๋น—์ž๋ฃจ ๋จธ๋ฆฌยท์ฃผ์ „์ž ์ฃผ๋‘ฅ์ด), ๋‚˜๋จธ์ง€๋Š” N_t=1.

3.2 Multi-View Fusion-Based 3D Lifting

๋‹จ์ผ ์‹œ์ ์˜ ๊นŠ์ด ๋ชจํ˜ธ์„ฑ๊ณผ ๋‹ค์‹œ์  ๊ฐ€๋ฆผ์„ ๋™์‹œ์— ๋„˜๊ธฐ ์œ„ํ•ด ์‚ผ๊ฐ์ธก๋Ÿ‰ + reference-view ray voting์„ ๊ฒฐํ•ฉํ•œ๋‹ค. ๊ฐ ํ‚คํฌ์ธํŠธ์— ๋Œ€ํ•ด ๋ชจ๋“  ๋ทฐ์—์„œ view-wise 2D ๊ทธ๋ผ์šด๋”ฉ \mathbf{p}_v^{t,j}=\Phi(I_v,l'')๋ฅผ ์–ป๋Š”๋‹ค.

๋จผ์ € RANSAC ์Šคํƒ€์ผ ์‚ผ๊ฐ์ธก๋Ÿ‰: ๋ทฐ ์Œ (a,b)์˜ ํ›„๋ณด X_{a,b}^{t,j}=\operatorname{Triangulate}(\mathbf{p}_a^{t,j},\mathbf{p}_b^{t,j})๋ฅผ, reprojection ์˜ค์ฐจ๊ฐ€ ํ”ฝ์…€ ์ž„๊ณ„ \epsilon_{\mathrm{tri}} ์ดํ•˜์ธ ๋ทฐ ์ˆ˜๋กœ ์ฑ„์ ํ•œ๋‹ค:

S_{\mathrm{tri}}(a,b)=\sum_{v=1}^{M}\mathbf{1}\!\left[\left\|\pi_v(X_{a,b}^{t,j})-\mathbf{p}_v^{t,j}\right\|_2\leq\epsilon_{\mathrm{tri}}\right],

์ตœ๋Œ€ ํ•ฉ์˜ ํ›„๋ณด๋ฅผ X_{\mathrm{tri}}^{t,j}๋กœ ํƒํ•œ๋‹ค. ๋ณด์™„ ์ถ”์ •์œผ๋กœ reference-view ray voting: reference ๊ด‘์„ ์„ ๋”ฐ๋ผ ๊นŠ์ด ํ›„๋ณด X_n^{t,j}๋ฅผ N_\delta๊ฐœ ์ƒ˜ํ”Œํ•ด, ๊ฐ ๋น„-reference ๋ทฐ์— ๋ฒˆํ˜ธ ๋งˆ์ปค๋กœ ํˆฌ์˜ํ•œ \tilde{I}_v^{t,j}๋ฅผ ๋งŒ๋“ค๊ณ  VLM์ด d_{t,j}์— ๊ฐ€์žฅ ๋งž๋Š” ์ธ๋ฑ์Šค \mathcal{C}_v^{t,j}๋ฅผ ๊ณ ๋ฅธ๋‹ค. ํˆฌํ‘œ๋ฅผ ํ•ฉ์‚ฐํ•ด X_{\mathrm{vote}}^{t,j}๋ฅผ ์–ป๋Š”๋‹ค:

S_{\mathrm{vote}}^{t,j}(n)=\sum_{v\neq r}\mathbf{1}[n\in\mathcal{C}_v^{t,j}],\qquad X_{\mathrm{vote}}^{t,j}=X_{\arg\max_n S_{\mathrm{vote}}^{t,j}(n)}^{t,j}.

์ตœ์ข… ํ‚คํฌ์ธํŠธ X_\star^{t,j}๋Š” ์‚ผ๊ฐ์ธก๋Ÿ‰ ํ•ฉ์˜๊ฐ€ \tau_{\mathrm{tri}} ์ด์ƒ์ด๋ฉด X_{\mathrm{tri}}, ์•„๋‹ˆ๋ฉด robustํ•œ X_{\mathrm{vote}}๋กœ ๋™์  ์„ ํƒํ•œ๋‹ค(Ping์˜ ์‹). ๋‘ ๊ฐˆ๋ž˜๋Š” ๊ฐ™์€ ๋‹ค์‹œ์  ์œตํ•ฉ์˜ ์ƒ๋ณด์  ๋ถ€๋ถ„์œผ๋กœ, ๊ฐ€๋ฆผยท์‹œ์  ๋ชจํ˜ธ์„ฑ ์•„๋ž˜์—์„œ ์‹ ๋ขฐํ•  ๋งŒํ•œ 3D ๊ทธ๋ผ์šด๋”ฉ์„ ๋งŒ๋“ ๋‹ค. ๋ชจ๋“  ํ‚คํฌ์ธํŠธ๋Š” ๋ณด์ • ์™ธ๋ถ€ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ world frame์œผ๋กœ ๋ณ€ํ™˜๋œ๋‹ค.


๊ทธ๋ผ์šด๋”ฉ ๋น„๊ต(Fig. S1) โ€” ์–ด์ˆ˜์„ ํ•œ ์žฅ๋ฉด์—์„œ ๋‹จ์ผ์‹œ์  RGB-D ๋ฒ ์ด์Šค๋ผ์ธ vs ๋‹ค์‹œ์  ๊ทธ๋ผ์šด๋”ฉ. ๋นจ๊ฐ•ยทํŒŒ๋ž‘ยท์ดˆ๋ก ๊ตฌ๋Š” ๊ฐ๊ฐ ์˜ˆ์ธก๋œ graspยทwaypointยทdestination.

3.3 Object-Centric Atomic Action Alignment

pick-and-place๋Š” ๋Œ€์ƒ ๋ฌผ์ฒด์˜ ํ˜„์žฌ ์ž์„ธ์—์„œ lifted release ํ‚คํฌ์ธํŠธ๊นŒ์ง€์˜ ์ „์†ก ๊ถค์ ์„ off-the-shelf ๋ชจ์…˜ ์ƒ์„ฑ์œผ๋กœ ๋งŒ๋“ค๋ฉด ๋œ๋‹ค. ๋ฐ˜๋ฉด tool-use๋Š” โ€œ๋„๊ตฌ๊ฐ€ ๋Œ€์ƒ์— ๋Œ€ํ•ด ์–ด๋–ป๊ฒŒ ์›€์ง์—ฌ์•ผ ํ•˜๋Š”๊ฐ€โ€๋ผ๋Š” ์ถ”๊ฐ€ motion prior๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ์ด๋ฅผ ์œ„ํ•ด Bag of Atomic Actions๋ฅผ ๋„์ž…ํ•œ๋‹ค โ€” ๋„๊ตฌ๊ฐ€ ๋Œ€์ƒ์— ์ƒ๋Œ€์ ์œผ๋กœ ์–ด๋–ป๊ฒŒ ์›€์ง์ด๋Š”์ง€๋ฅผ ๋ถ€ํ˜ธํ™”ํ•œ ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ object-centric primitive ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ:

\mathcal{A}=(c,\mathcal{T},X_s,X_e),\qquad \mathcal{T}=\{T_i\}_{i=0}^{N_a},\ T_i\in SE(3),

c๋Š” ์‚ฌ์ „์ •์˜ ์Šคํ‚ฌ ๋ฒ”์ฃผ, \mathcal{T}๋Š” ๋„๊ตฌ์˜ 6D ๊ถค์ , X_s,X_e\in\mathbb{R}^3๋Š” ์ €์žฅ๋œ ์‹œ์ž‘ยท๋ ์•ต์ปค๋‹ค. ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ๊ธฐ๋ก๋œ ์‹œ์—ฐ๊ณผ ์ƒ์„ฑ ๊ถค์ ์œผ๋กœ ์˜คํ”„๋ผ์ธ ๊ตฌ์ถ•๋˜๋ฉฐ, ๊ตฌํ˜„์—์„œ ์ƒ์„ฑ ๊ถค์ ์€ VLMPose๋กœ ์–ป๋Š”๋‹ค. ํ…Œ์ŠคํŠธ ์‹œ ๊ฐ™์€ c์˜ atomic action์„ ๊บผ๋‚ด, ์ €์žฅ ์•ต์ปค (X_s,X_e)๋ฅผ ํ˜„์žฌ ์žฅ๋ฉด์˜ lifted apply_actionยทterminal ํ‚คํฌ์ธํŠธ (X_{\mathrm{app}},X_{\mathrm{term}})๋กœ ๋ณด๋‚ด๋Š” ๊ฐ•์ฒด๋ณ€ํ™˜ T_{\mathrm{align}}\in SE(3)๋ฅผ ๊ตฌํ•ด ์ €์žฅ ๊ถค์ ์— ์ ์šฉํ•œ๋‹ค:

\hat{\mathcal{T}}=\{\hat{T}_i\}_{i=0}^{N_a},\qquad \hat{T}_i=T_{\mathrm{align}}\cdot T_i.

์ •๋ ฌ๋œ \hat{\mathcal{T}}๊ฐ€ ๋‹ค์Œ ๋‹จ๊ณ„์˜ ๋„๊ตฌ graspยท๋ชจ์…˜ ์ƒ์„ฑ์œผ๋กœ ๋„˜์–ด๊ฐ„๋‹ค.


Bag of Atomic Actions(Fig. S2) โ€” (A) โ€œPour water from the kettleโ€ ํ”„๋กฌํ”„ํŠธ๋กœ VLMPose๊ฐ€ ์ƒ์„ฑํ•œ ๋ฌผ์ฒด ๊ถค์ , (B) ์‹ค๋กœ๋ด‡์—์„œ ์‹คํ–‰๋œ object-centric atomic action๋“ค.

3.4 Dexterous Affordance-Guided Grasp and Motion Generation

lifted grasp ํ‚คํฌ์ธํŠธ X_{\mathrm{grasp}}๋Š” ์˜๋ฏธ ์•ต์ปค์ผ ๋ฟ, ์†์žฌ์ฃผ grasp์—๋Š” ์ž‘์—…์กฐ๊ฑด์  ์ ‘์ด‰ ์˜์—ญ์ด ๋” ํ•„์š”ํ•˜๋‹ค. ์กฐ์ž‘ ๋ฌผ์ฒด O_m(pick์€ O_{\mathrm{tar}}, tool-use๋Š” O_{\mathrm{tool}})์— ๋Œ€ํ•ด ๊ฐ ๋ทฐ์—์„œ grasp ํ‚คํฌ์ธํŠธ๋ฅผ ํˆฌ์˜ํ•ด ์–ดํฌ๋˜์Šค ํ”„๋กฌํ”„ํŠธ๋กœ 2D graspable bounding box B_v=\Phi(I_v,l''')๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค. ์ด๋ฅผ 3D๋กœ ์˜ฌ๋ฆฌ๊ธฐ ์œ„ํ•ด O_m ๋ฉ”์‹œ์˜ ๊ฐ ์ •์  q_i๋ฅผ ๋ชจ๋“  ๋ทฐ์— ํˆฌ์˜ํ•œ ๋‹ค์‹œ์  inclusion score๋กœ ์–ดํฌ๋˜์Šค ์˜์—ญ์„ ์ •์˜ํ•œ๋‹ค:

s(q_i)=\frac{1}{M}\sum_{v=1}^{M}\mathbf{1}\!\left[\pi_v(q_i)\in B_v\right],\qquad \mathcal{R}_{\mathrm{aff}}=\{q_i\mid s(q_i)\geq\tau\}.

\mathcal{R}_{\mathrm{aff}}์—์„œ ์†์žก์ดํ˜•์—” cylindrical template sampler, ์ผ๋ฐ˜ ํ˜•์ƒ์—” optimization ๊ธฐ๋ฐ˜ generator๋กœ ์† grasp ํ›„๋ณด G๋ฅผ ๋งŒ๋“ ๋‹ค. ๋ฌผ๋ฆฌ ํƒ€๋‹น์„ฑ์„ ์œ„ํ•ด, ๋ฌผ์ฒด ๋ฐฐ์น˜ยท๋„๊ตฌ ์ข…๋‹จ ์ž์„ธ๋ฅผ ์ •ํ•˜๋Š” ๊ทธ๋ผ์šด๋”ฉ ์ ์—” collision-aware ์œ„์น˜ ์ •์ œ๋ฅผ ์ ์šฉํ•œ๋‹ค โ€” ํ™˜๊ฒฝ ์นจํˆฌ ๊นŠ์ด \phi_m(\cdot)์ด 0์ด ๋˜๋Š” ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ถฉ๋Œ-์—†๋Š” ์œ„์น˜๋ฅผ ๊ตญ์†Œ ์ˆ˜์ง ๊ฒฉ์ž์—์„œ ์ฐพ๋Š”๋‹ค:

X_{\mathrm{loc}}^*=\arg\min_{X'\in\mathcal{G}(X_{\mathrm{loc}})}\|X'-X_{\mathrm{loc}}\|_2\quad\text{s.t.}\quad\phi_m(X')=0.

์ •์ œ๋œ ํ‚คํฌ์ธํŠธ๋กœ O_m์˜ 6D ๊ถค์  \mathcal{T}_{\mathrm{obj}}๋ฅผ ๊ตฌ์„ฑํ•˜๊ณ , ๊ฐ grasp ํ›„๋ณด g์— ๋Œ€ํ•ด off-the-shelf arm-hand ๋ชจ์…˜ ์ƒ์„ฑ๊ธฐ๊ฐ€ \mathcal{T}_{\mathrm{obj}}๋ฅผ ์ถ”์ข…ํ•˜๋ฉฐ ๊ธฐ๊ตฌํ•™ยท์ถฉ๋Œ ์ œ์•ฝ์„ ํ‘ผ๋‹ค: (\mathcal{T}_{\mathrm{robot}},\eta)=f_{\mathrm{motion}}(\mathcal{T}_{\mathrm{obj}},g). ํƒ€๋‹น์„ฑ \eta=1์ธ ์Œ์„ ์‹ค๋กœ๋ด‡ ์‹คํ–‰์œผ๋กœ ์„ ํƒํ•œ๋‹ค.


์–ดํฌ๋˜์Šค ๊ทธ๋ผ์šด๋”ฉยทgrasp ์ƒ์„ฑ(Fig. S3) โ€” ๋‹ค์‹œ์ ์˜ ์–ดํฌ๋˜์Šค bounding box๋ฅผ ๊ฒฐํ•ฉํ•ด 3D ์–ดํฌ๋˜์Šค ์˜์—ญ์„ ๋งŒ๋“ค๊ณ  ์† grasp๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

์ง๊ด€ โ€” ์™œ ๋‹ค์‹œ์  + voting์ธ๊ฐ€

VLM์˜ 2D ๊ทธ๋ผ์šด๋”ฉ์€ ์˜๋ฏธ์ ์ด๋ฉด์„œ ์‹œ์  ์˜์กด์ ์ด๋‹ค. ๋ทฐ๋งˆ๋‹ค ๋ณด์ด๋Š” ๋ถ€๋ถ„์ด ๋‹ค๋ฅด๊ณ , ๊ฐ€๋ฆผ์ด ์˜ˆ์ธก ์œ„์น˜๋ฅผ ํ”๋“ค๋ฉฐ, ๋ชจํ˜ธํ•œ ์ž‘์—… ๋งฅ๋ฝ์€ ์นด๋ฉ”๋ผ๋งˆ๋‹ค ๋‹ค๋ฅธ ์˜ˆ์ธก์„ ๋‚ธ๋‹ค. ์‚ผ๊ฐ์ธก๋Ÿ‰์€ ๋ทฐ ์‚ฌ์ด์˜ ๊ฐ•ํ•œ ๊ธฐํ•˜ ์ œ์•ฝ(ํŠนํžˆ wide-baseline ์Œ)์„ ํ™œ์šฉํ•ด ์ •ํ™•ํ•˜์ง€๋งŒ, ์ผ๋ถ€ ๋ทฐ์˜ 2D ์˜ˆ์ธก์ด ์–ด๊ธ‹๋‚˜๋ฉด ํ•ฉ์˜๊ฐ€ ๊นจ์ง„๋‹ค. ๊ทธ๋•Œ reference-view voting์ด reference ๊ด‘์„  ์œ„์—์„œ โ€œ๋‹ค๋ฅธ ๋ทฐ๋“ค๊ณผ ๊ฐ€์žฅ ์ผ๊ด€๋œ ๊นŠ์ด ํ›„๋ณดโ€๋ฅผ ๊ณจ๋ผ ์ถ”์ •์„ reference์— ๋‹จ๋‹จํžˆ ๊ณ ์ •ํ•œ๋‹ค. ๋‘˜์„ ํ•ฉ์˜ ์ ์ˆ˜๋กœ ๋™์  ์ „ํ™˜ํ•˜๋Š” ๊ฒƒ์ด ์ด ๋ฐฉ๋ฒ•์˜ ๊ฒฌ๊ณ ํ•จ์˜ ํ•ต์‹ฌ์ด๋‹ค โ€” ๊ธฐํ•˜๊ฐ€ ์ถฉ๋ถ„ํ•˜๋ฉด ์‚ผ๊ฐ์ธก๋Ÿ‰, ๋ถ€์กฑํ•˜๋ฉด robust voting.

์‹คํ—˜

ํ‰๊ฐ€๋Š” ์‹ค์„ธ๊ณ„ tabletop์—์„œ zero-shot ์กฐ์ž‘์„ ๋‹ค๋ฃฌ๋‹ค. ๋„ค ๋Šฅ๋ ฅ์„ ๋ณธ๋‹ค โ€” (1) distractor ์† ๋Œ€์ƒ ๊ทธ๋ผ์šด๋”ฉ + ์ถฉ๋Œ ๊ฒฌ๊ณ ์„ฑ(์ถ”๋ก ํ•œ ์“ฐ๋ ˆ๊ธฐ๋ฅผ ๋ฐ”๊ตฌ๋‹ˆ์— ๋„ฃ๊ธฐ), (2) ๊ณต๊ฐ„๊ด€๊ณ„ ์ถ”๋ก (๋„๊ตฌ๋ฅผ ์Šคํ† ๋ธŒ์— ๋†“๊ธฐ), (3) ์–ดํฌ๋˜์Šค ๊ธฐ๋ฐ˜ tool-use(๋น—์ž๋ฃจ๋กœ ์“ธ๊ธฐ), (4) long-horizon ์‹œํ€€์‹ฑ(3โ€“4๊ฐœ ๋ฌผ์ฒด ์š”๋ฆฌยท์ •๋ฆฌ).

ํ•˜๋“œ์›จ์–ด. xArm + Inspire ์†์žฌ์ฃผ ์†, ๋ณด์ •๋œ ๋‹ค์ˆ˜ RGB ์นด๋ฉ”๋ผ(stereo pair ํฌํ•จ). ๊นŠ์ด๋Š” FoundationStereo, ๋‹ค๋ฌผ์ฒด 6D ์ž์„ธ๋Š” FoundationPose๋ฅผ ์“ด๋‹ค.

๋ฒ ์ด์Šค๋ผ์ธ. โ‘  ๋‹จ์ผ ์‹œ์ ์—์„œ 2D ํ‚คํฌ์ธํŠธ๋ฅผ ์˜ˆ์ธกํ•ด ์ •๋ ฌ ๊นŠ์ด๋งต์œผ๋กœ 3D๋กœ ์˜ฌ๋ฆฌ๋Š” RGB-D ๊ทธ๋ผ์šด๋”ฉ, โ‘ก ํƒœ์Šคํฌ๋‹น 30๊ฐœ teleoperation ์‹œ์—ฐ์œผ๋กœ ๋ฏธ์„ธ์กฐ์ •ํ•œ VLA ๋‘ ์ข…(GR00T, Being-H0). ZeroDex๋Š” ๊ฐ€์ค‘์น˜ ๊ฐฑ์‹ ยทํƒœ์Šคํฌ ์‹œ์—ฐ ์—†์ด ์ „์ ์œผ๋กœ zero-shot์ด๋‹ค.


์ •์„ฑ ๊ฒฐ๊ณผ(Fig. 2) โ€” ๊ฐ ๊ณ ์ˆ˜์ค€ ์ง€์‹œ l์— ๋Œ€ํ•ด 3D ๊ทธ๋ผ์šด๋”ฉ์„ ์ถ”๋ก ํ•˜๊ณ , tool-use๋Š” object-centric atomic action์„ ํ˜„์žฌ ์žฅ๋ฉด์— ์ •๋ ฌํ•œ๋‹ค. ์ง์ ‘ยท๊ฐ„์ ‘ ์Šคํƒ€์ผ ์ง€์‹œ ๋ชจ๋‘์—์„œ ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ์— ๊ฑธ์ณ ๊ทธ๋ผ์šด๋”ฉ ์„ฑ๊ณต.

์‹ค๋กœ๋ด‡ ์„ฑ๊ณต๋ฅ . ๋‹จ์ผ์‹œ์  RGB-D ๋Œ€๋น„ ๋งค์นญ ๋˜๋Š” ํ–ฅ์ƒ โ€” โ€œThrow Away Trashโ€ 4/5 โ†’ 5/5, โ€œPlace Pot on Stoveโ€ 4/5 โ†’ 4/5. ๊ฐ•์ ์€ ์–ด์ˆ˜์„ ยท์ •๋ฐ€ ๋ฐฐ์น˜์—์„œ ๋‘๋“œ๋Ÿฌ์ ธ โ€œCluttered Precise Pick-and-Placeโ€๋Š” 2/5 โ†’ 4/5(Table 1). 30 ์‹œ์—ฐ์œผ๋กœ ๋ฏธ์„ธ์กฐ์ •ํ•œ VLA ๋‘ ์ข…์€ ํ‰๊ฐ€ ์ž‘์—…์—์„œ ๋ชจ๋‘ ์‹คํŒจ(0/5)ํ•œ ๋ฐ˜๋ฉด, ZeroDex๋Š” zero-shot์œผ๋กœ โ€œThrow Away Trashโ€ 10/10, โ€œBroom Cleanโ€ 8/10์„ ๋‹ฌ์„ฑ(Table 2).

3D ๊ทธ๋ผ์šด๋”ฉ ํ’ˆ์งˆ(Table 3). ๋‹ค์‹œ์  ์œตํ•ฉ์ด grasp ์œ„์น˜์˜ค์ฐจ๋ฅผ ํฌ๊ฒŒ ์ค„์ธ๋‹ค.

Method L_{\mathrm{grasp}} (cm) โ†“ L_{\mathrm{apply}} (cm) โ†“ \phi_m(X_{\mathrm{wp}}) โ†“
Stereo (RGB-D) 16.43 2.72 9.91
Ours (2 views) 4.58 1.70 9.81
Ours (3 views) 4.60 1.35 10.95
Ours (5 views) 4.77 1.94 9.78
Ours (w/ refinement) 4.77 1.63 9.60

๋ทฐ ์ˆ˜๋ฅผ ๋Š˜๋ฆฌ๋ฉด ์ด ์žฅ๋ฉด๋“ค์—์„  ์ˆ˜ํ™• ์ฒด๊ฐ์ด๋‹ค โ€” wide-baseline ์Œ์ด ์ด๋ฏธ ๊ฐ•ํ•œ ๊ธฐํ•˜ ์ œ์•ฝ์„ ์ฃผ๊ธฐ ๋•Œ๋ฌธ. collision-aware ์ •์ œ๊ฐ€ ์นจํˆฌ(penetration) ์˜ค์ฐจ๋ฅผ ์ตœ์ €๋กœ ๋‚ฎ์ถ˜๋‹ค.

Long-horizon(Table 4). primitive ์ˆ˜์ค€ ํ˜•์‹ํ™”๊ฐ€ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ closed-loop๋ฅผ ๊ฐ€๋Šฅ์ผ€ ํ•œ๋‹ค โ€” ์‹คํŒจ๊ฐ€ ๋‚˜๋ฉด ํ•ด๋‹น subtask๋ฅผ retry budget ์•ˆ์—์„œ ์žฌ๊ทธ๋ผ์šด๋”ฉยท์žฌ๊ณ„ํš์œผ๋กœ ๋ณต๊ตฌํ•œ๋‹ค. โ€œOrganize Objectsโ€๋Š” ๋‹จ๊ณ„๋ณ„ 6/6ยท5/6ยท3/5ยท3/3, end-to-end 4/6; โ€œCookingโ€์€ 3/3ยท3/3ยท1/3, end-to-end 1/3. ์‹คํŒจ๋Š” ์ฃผ๋กœ arm ๊ด€์ ˆ ํ•œ๊ณ„ยทํ™˜๊ฒฝ ์ถฉ๋Œยท๋ถˆ์•ˆ์ • grasp์—์„œ ์˜จ๋‹ค.


Long-horizon ์ •์„ฑ ๊ฒฐ๊ณผ(Fig. 3) โ€” ์—ฌ๋Ÿฌ subtask๋กœ ๊ตฌ์„ฑ๋œ ์‹œ๋‚˜๋ฆฌ์˜ค. ์œ„ ์˜ˆ์—์„œ grasp๊ฐ€ ์‹คํŒจํ•˜์ž VLM์ด ์‹คํŒจ ์ƒํƒœ๋ฅผ ๊ฐ์ง€ํ•˜๊ณ  ๋‹ค์Œ ํ–‰๋™์„ ์žฌ๊ณ„ํšํ•œ๋‹ค.

๋น„ํŒ์ ์œผ๋กœ ๋ณด๋ฉด

๊ฐ•์ 

  • ๋ฐ์ดํ„ฐ ์—†๋Š” ์†์žฌ์ฃผ tool-use. ํƒœ์Šคํฌ๋ณ„ ์‹œ์—ฐยท๋ฏธ์„ธ์กฐ์ • ์—†์ด zero-shot์œผ๋กœ pouringยทsweeping ๊ฐ™์€ ๊ธฐ๋Šฅ์  tool-use๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค. 30 ์‹œ์—ฐ VLA๊ฐ€ 0/5์ธ ์ž‘์—…์„ zero-shot์œผ๋กœ ํ‘ธ๋Š” ๋Œ€๋น„๋Š” ๋ชจ๋“ˆ์‹ ์„ค๊ณ„ ์ฃผ์žฅ์— ๊ฐ•ํ•œ ์ฆ๊ฑฐ๋‹ค.
  • ๋‹ค์‹œ์  ์œตํ•ฉ์ด ๊ทธ๋ผ์šด๋”ฉ ์ •๋ฐ€๋„๋ฅผ ์‹ค์ œ๋กœ ๋Œ์–ด์˜ฌ๋ฆฐ๋‹ค. L_{\mathrm{grasp}} 16.43 โ†’ 4.58cm๋Š” ์†์žฌ์ฃผ grasp์˜ ์•ˆ์ •์„ฑ์— ์ง๊ฒฐ๋˜๋Š”, ์ธก์ •์œผ๋กœ ๋ถ„๋ฆฌ๋œ ์ด๋“์ด๋‹ค. ์‚ผ๊ฐ์ธก๋Ÿ‰โ†”๏ธŽvoting ๋™์  ์ „ํ™˜์ด๋ผ๋Š” ์„ค๊ณ„๋„ ๊น”๋”ํ•˜๋‹ค.
  • closed-loop๊ฐ€ ํ˜•์‹์—์„œ ์ž์—ฐํžˆ ๋‚˜์˜จ๋‹ค. primitive ์‹œํ€€์Šค ํ˜•์‹ ๋•์— VLM์ด ๋‹จ๊ณ„ ์ง„ํ–‰์„ ๊ฒ€์ฆํ•˜๊ณ  ์‹คํŒจ subtask๋งŒ ์žฌ๊ณ„ํšํ•  ์ˆ˜ ์žˆ๋‹ค โ€” long-horizon์˜ ์‹ค์šฉ์  ๊ฒฌ๊ณ ํ•จ.
  • ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ถ”์ƒํ™”(BoAA). ๋„๊ตฌ ์šด๋™์„ object-centric 6D ๊ถค์  + ์•ต์ปค๋กœ ๋ฌถ์–ด ์žฅ๋ฉด์— ๊ฐ•์ฒด์ •๋ ฌํ•˜๋Š” ๋ฐฉ์‹์€ ์ƒˆ ์žฅ๋ฉดยท๋ฌผ์ฒด๋กœ์˜ ์ผ๋ฐ˜ํ™”๋ฅผ ๊ฐ’์‹ธ๊ฒŒ ๋งŒ๋“ ๋‹ค.

์•ฝ์ ยทํ•œ๊ณ„

  • 2D VLM ์‹ ๋ขฐ์„ฑ์— ์ƒํ•œ์ด ๋ฌถ์ธ๋‹ค(์ €์ž ์ธ์ •). ๋‹ค์‹œ์  liftingยท๊ตญ์†Œ ์ •์ œ๊ฐ€ ๊ธฐํ•˜ ์ผ๊ด€์„ฑ์„ ๋†’์—ฌ๋„, ์ž‘์—… ๋ถ„ํ•ดยท์–ดํฌ๋˜์Šค ์„ ํƒยท2D ์˜๋ฏธ ๊ทธ๋ผ์šด๋”ฉ์˜ ์˜ค๋ฅ˜๋Š” ํ•˜๋ฅ˜ ์‹คํ–‰ ์‹คํŒจ๋กœ ์ „ํŒŒ๋œ๋‹ค. ์‹œ์Šคํ…œ์˜ ์ฒœ์žฅ์ด ๊ณง VLM์˜ ์ฒœ์žฅ์ด๋‹ค.
  • off-the-shelf ๋ชจ์…˜ ํ”Œ๋ž˜๋„ˆ ์˜์กด(์ €์ž ์ธ์ •). ๊ธฐ๊ตฌํ•™์  ํŠน์ด์ ยท์ถฉ๋Œ ๊ฒ€์‚ฌ timeoutยท๋ถˆ์•ˆ์ • grasp๊ฐ€ ์—ฌ์ „ํžˆ ์‹คํŒจ๋ฅผ ์œ ๋ฐœํ•˜๊ณ  ์ถ”๋ก โ†’์‹คํ–‰ ์ง€์—ฐ(latency)์„ ํ‚ค์šด๋‹ค.
  • in-hand manipulation ๋ฏธ์ง€์›(์ €์ž ์ธ์ •). ์† ์•ˆ์—์„œ ๋ฌผ์ฒด ํšŒ์ „, ๊ฐ€์œ„ ์กฐ์ž‘, ์†์— ์ฅ” ๋„๊ตฌ์˜ ๋ฒ„ํŠผ ๋ˆ„๋ฅด๊ธฐ ๊ฐ™์€ ์ง„์งœ dexterous in-hand ๋Šฅ๋ ฅ์€ ๋ฒ”์œ„ ๋ฐ–์ด๋‹ค โ€” ํ˜„ ํ˜•์‹์€ object-centric ์กฐ์ž‘ยทtool-use์— ํ•œ์ •๋œ๋‹ค.
  • ํ‰๊ฐ€ ๊ทœ๋ชจ๊ฐ€ ์ž‘๋‹ค. ์‹ค๋กœ๋ด‡ ํ‘œ๋Š” ์ž‘์—…๋‹น 5โ€“10ํšŒ ์‹œ๋„, long-horizon์€ ์ž‘์—…๋‹น 3โ€“6ํšŒ๋กœ ํ‘œ๋ณธ์ด ์ž‘์•„ ํ†ต๊ณ„์  ์‹ ๋ขฐ๊ตฌ๊ฐ„์„ ๋…ผํ•˜๊ธฐ ์–ด๋ ต๋‹ค. โ€œCookingโ€ end-to-end 1/3 ๊ฐ™์€ ์ˆ˜์น˜๋Š” ํ‘œ๋ณธ ๋ณ€๋™์˜ ์—ฌ์ง€๊ฐ€ ํฌ๋‹ค.
  • ์ธํ”„๋ผ ๊ฐ€์ •์ด ๋ฌด๊ฒ๋‹ค. ๋ณด์ •๋œ ๋‹ค์‹œ์ (+stereo) ์นด๋ฉ”๋ผ, FoundationStereo/FoundationPose, ๋ฌผ์ฒด ๋ฉ”์‹œ(์–ดํฌ๋˜์Šค ์ •์  ํˆฌํ‘œ์šฉ)๋ฅผ ์ „์ œํ•œ๋‹ค. โ€œ๋ฐ์ดํ„ฐ ์—†์ดโ€๋Š” ๋งž์ง€๋งŒ ์žฅ๋ฉด ์…‹์—…ยท์บ˜๋ฆฌ๋ธŒ๋ ˆ์ด์…˜ยท๋ฌผ์ฒด ๋ชจ๋ธ์ด๋ผ๋Š” ๋‹ค๋ฅธ ๋น„์šฉ์„ ์ง„๋‹ค.
  • ๋‹ค์ค‘ VLM ์งˆ์˜ ๋น„์šฉ. ๋ทฐ๋งˆ๋‹ค, ํ‚คํฌ์ธํŠธ๋งˆ๋‹ค, voting ํ›„๋ณด๋งˆ๋‹ค VLM์„ ๋ถ€๋ฅด๋Š” ๊ตฌ์กฐ๋ผ ํ˜ธ์ถœ ์ˆ˜ยท์ง€์—ฐยท๊ธˆ์ „ ๋น„์šฉ์ด ๋‹จ์ผ ์ •์ฑ… ์ถ”๋ก ๋ณด๋‹ค ํฌ๋‹ค(๋…ผ๋ฌธ์€ ์ •๋Ÿ‰ ๋น„์šฉ์„ ๋ณธ๋ฌธ์—์„œ ๊ฐ•์กฐํ•˜์ง€ ์•Š์Œ).

๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ์ž๋ฆฌ๋งค๊น€

ZeroDex๋Š” ์„ธ ํ๋ฆ„์˜ ๊ต์ฐจ์ ์— ์žˆ๋‹ค. ์ฒซ์งธ, manipulation์„ ์œ„ํ•œ VLA: ๋Œ€๊ทœ๋ชจ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ๋กœ ํ–‰๋™์„ ์ง์ ‘ ์˜ˆ์ธกํ•˜๋Š” ๊ณ„์—ด์€ ๊ฐ•ํ•˜์ง€๋งŒ ์ƒˆ ๋ฌผ์ฒดยท๋„๊ตฌยทembodiment์— ์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ/์ ์‘์„ ์š”๊ตฌํ•œ๋‹ค โ€” ZeroDex๋Š” ์ •๋ฐ˜๋Œ€๋กœ ๊ฐ€์ค‘์น˜ ๊ฐฑ์‹  ์—†์ด zero-shot์œผ๋กœ ์ž‘๋™ํ•œ๋‹ค. ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ์†์žฌ์ฃผ ํŒŒ์šด๋ฐ์ด์…˜์ธ UniDex ๋ฆฌ๋ทฐ(์‚ฌ๋žŒ ์˜์ƒโ†’๋กœ๋ด‡ ๋ฐ์ดํ„ฐ๋กœ VLA ์‚ฌ์ „ํ•™์Šต)์™€๋Š” โ€œ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต vs VLM์œผ๋กœ ์ถ”๋ก โ€์ด๋ผ๋Š” ๋Œ€์ฒ™์ ์—์„œ ํฅ๋ฏธ๋กœ์šด ๋Œ€๋น„๋ฅผ ์ด๋ฃฌ๋‹ค. embodied reasoning์„ ๋‚ด์žฌํ™”ํ•œ VLA์ธ MolmoAct2 ๋ฆฌ๋ทฐ์™€๋„ โ€œ์ถ”๋ก ์„ ์ •์ฑ…์— ํ•™์Šต vs ์‚ฌ์ „ํ•™์Šต VLM์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉโ€์œผ๋กœ ๋น„๊ตํ•  ๋งŒํ•˜๋‹ค. ๋‘˜์งธ, foundation model ๊ธฐ๋ฐ˜ zero-shot ์กฐ์ž‘: LLM์œผ๋กœ ์ฝ”๋“œยท๊ณ„ํš์„ ์ƒ์„ฑํ•˜๊ฑฐ๋‚˜(code-gen), ์ด๋ฏธ์ง€์— ํ‚คํฌ์ธํŠธยทvisual markยท์–ดํฌ๋˜์Šค๋ฅผ ์ฐ๋Š”(visual-prompting) ๊ณ„์—ด๊ณผ ๋™๊ธฐ๋ฅผ ๊ณต์œ ํ•˜์ง€๋งŒ, ๋งŽ์€ ๊ธฐ์กด ๋ฐฉ๋ฒ•์ด ์†์žฌ์ฃผ ์กฐ์ž‘์— ํ•„์š”ํ•œ task-relevant 3D ๊ธฐํ•˜(์ ‘์ด‰์ ยท๋ฐฐ์น˜ ๋ชฉํ‘œยท๋„๊ตฌ ๊ถค์ )์— ์ทจ์•ฝํ•œ image-space/ํฌ์†Œ ์ค‘๊ฐ„ํ‘œํ˜„์— ๋จธ๋ฌธ๋‹ค โ€” ZeroDex๋Š” ์ด๋ฅผ ์ง์ ‘ 3D์— ๊ทธ๋ผ์šด๋”ฉํ•œ๋‹ค. ์…‹์งธ, 3D ๊ทธ๋ผ์šด๋”ฉยท์†์žฌ์ฃผ ์‹คํ–‰: 2D ๊ด€์ธก์˜ multi-view stereo lifting์„ VLM ์˜๋ฏธ ๊ทธ๋ผ์šด๋”ฉ๊ณผ ๊ฒฐํ•ฉํ•œ๋‹ค. ์†์žฌ์ฃผ grasp ์ƒ์„ฑ์„ ๋‹ค๋ฃจ๋Š” GenHand ๋ฆฌ๋ทฐ์™€๋Š” grasp ํ•ฉ์„ฑ์˜ ์ž…๋ ฅ(์–ดํฌ๋˜์Šค ์˜์—ญ)์„ ์–ด๋–ป๊ฒŒ ์–ป๋А๋ƒ์˜ ๊ด€์ ์—์„œ ๋งž๋‹ฟ๋Š”๋‹ค.

์š”์•ฝ

ZeroDex์˜ ๊ธฐ์—ฌ๋Š” โ€œ์†์žฌ์ฃผ ์กฐ์ž‘์„ ์œ„ํ•ด ์ •์ฑ…์„ ์ƒˆ๋กœ ํ•™์Šตํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค โ€” VLM์˜ zero-shot ์ถ”๋ก ์„ ๋‹ค์‹œ์  3D ๊ทธ๋ผ์šด๋”ฉ์œผ๋กœ ์ถฉ๋ถ„ํžˆ ์ •๋ฐ€ํ•˜๊ฒŒ ๋ฌถ์œผ๋ฉด ๋œ๋‹คโ€๋Š” ๋ชจ๋“ˆ์‹ ๊ด€์ ์„ ์‹ค์ฆํ•œ ๋ฐ ์žˆ๋‹ค. VLM์ด ์ง€์‹œ๋ฅผ atomic primitive๋กœ ๋ถ„ํ•ดํ•˜๊ณ , ๊ฐ 2D ํ‚คํฌ์ธํŠธ๋ฅผ ์‚ผ๊ฐ์ธก๋Ÿ‰+ray voting์œผ๋กœ 3D๋กœ ์˜ฌ๋ฆฌ๋ฉฐ, ์–ดํฌ๋˜์Šค grasp์™€ Bag-of-Atomic-Actions ๋„๊ตฌ ๊ถค์ ์„ ์žฅ๋ฉด์— ์ •๋ ฌํ•ด ์‹คํ–‰ํ•œ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋‹จ์ผ์‹œ์  RGB-D๋ฅผ grounding ์ •๋ฐ€๋„์—์„œ ๋Šฅ๊ฐ€ํ•˜๊ณ (16.43โ†’4.58cm), 30 ์‹œ์—ฐ VLA๊ฐ€ ์ „๋ฉธํ•œ ์ž‘์—…์„ zero-shot์œผ๋กœ ํ’€๋ฉฐ, ์‹คํŒจ ๊ฐ์ง€ยท์žฌ๊ณ„ํš์œผ๋กœ long-horizon๊นŒ์ง€ ๋‹ฟ๋Š”๋‹ค. ํ•œ๊ณ„๋„ ๋ถ„๋ช…ํ•˜๋‹ค โ€” 2D VLM๊ณผ off-the-shelf ํ”Œ๋ž˜๋„ˆ์— ์ฒœ์žฅ์ด ๋ฌถ์ด๊ณ , in-hand manipulation์€ ๋ชป ํ•˜๋ฉฐ, ํ‰๊ฐ€ ํ‘œ๋ณธ์ด ์ž‘๊ณ , ๋ณด์ • ๋‹ค์‹œ์ ยท๋ฌผ์ฒด ๋ฉ”์‹œ๋ผ๋Š” ์ธํ”„๋ผ ๋น„์šฉ์„ ์ง„๋‹ค. ๊ทธ๋Ÿผ์—๋„ โ€œ์ถ”๋ก ์€ ์‚ฌ์ „ํ•™์Šต VLM์— ๋งก๊ธฐ๊ณ , ์ •๋ฐ€๋„๋Š” ๋‹ค์‹œ์  ๊ธฐํ•˜๋กœ ๋ฉ”์šด๋‹คโ€๋Š” ๋ถ„์—…์€ ๋ฐ์ดํ„ฐ ๋น„์‹ผ ์†์žฌ์ฃผ ์กฐ์ž‘์— ๋Œ€ํ•œ ์„ค๋“๋ ฅ ์žˆ๋Š” ๋Œ€์•ˆ ์ฒญ์‚ฌ์ง„์ด๋‹ค. (์ฝ”๋“œ๋Š” ์ถ”ํ›„ ๊ณต๊ฐœ ์˜ˆ์ • โ€” ์žฌํ˜„ ํ‰๊ฐ€๋Š” ๊ณต๊ฐœ ํ›„ ๊ฐ€๋Šฅ.)

Copyright 2026, JungYeon Lee