Curieux.JY
  • JungYeon Lee
  • Post
  • Lecture
  • Note

On this page

  • ๐Ÿ” Ping Review
    • 1. ๋ฐฐ๊ฒฝ ๋ฐ ๋ฌธ์ œ ์ œ๊ธฐ
    • 2. ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก 
    • 3. ์ฃผ์š” ์„ฑ๊ณผ ๋ฐ ์‹คํ—˜ ๊ฒฐ๊ณผ
  • ๐Ÿ”” Ring Review
    • ์„œ๋ก 
    • ๋ฐฉ๋ฒ•
      • ๊ทธ๋ฆฝ ๋ถ„๋ฅ˜ํ•™(Taxonomy) ํ‘œํ˜„
      • Stage 1: Taxonomy ์„ ํƒ
      • Stage 2: Taxonomy-์กฐ๊ฑด๋ถ€ RL ์ •์ฑ…
      • Teacherโ€“Student Distillation
    • ์‹คํ—˜
      • ์ƒˆ๋กœ์šด ๋ฌผ์ฒด ์ผ๋ฐ˜ํ™”
      • ๋ฌผ์ฒดโ€“Taxonomy ์ •๋ ฌ(ํ•ต์‹ฌ ํ†ต์ฐฐ ๊ฒ€์ฆ)
      • Ablation
      • ์‹ค์„ธ๊ณ„(Allex) ๋ฐฐํฌ
    • ๋น„ํŒ์  ๊ณ ์ฐฐ
    • ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

๐Ÿ“ƒGRIT

dexterous-grasping
RL
taxonomy
manipulation
humanoid
Learning Dexterous Grasping from Sparse Taxonomy Guidance
Published

June 17, 2026

  • Paper Link (arXiv:2604.04138)
  • Project Page
  1. ๐Ÿค– GRIT์€ ์‚ฌ์ „ ์ •์˜๋œ grasp taxonomy๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๊ณ ์ˆ˜์ค€์˜ ํŒŒ์•… ์˜๋„์™€ ์ €์ˆ˜์ค€์˜ ์ •๋ฐ€ํ•œ ์†๊ฐ€๋ฝ ์ œ์–ด๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” 2๋‹จ๊ณ„ dexterous manipulation ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค.
  2. ๐Ÿ’ก Vision-Language Model์„ ํ†ตํ•ด ์žฅ๋ฉด๊ณผ ์ž‘์—… ๋งฅ๋ฝ์— ์ตœ์ ํ™”๋œ taxonomy๋ฅผ ์„ ํƒํ•˜๊ณ , multiplicative reward ๊ตฌ์กฐ๋ฅผ ์ ์šฉํ•˜์—ฌ ์˜๋„ํ•œ ํŒŒ์•… ํ˜•ํƒœ๋ฅผ ์ถฉ์‹คํžˆ ์œ ์ง€ํ•˜๋ฉฐ ์•ˆ์ •์ ์œผ๋กœ ๊ฐ์ฒด๋ฅผ ์กฐ์ž‘ํ•ฉ๋‹ˆ๋‹ค.
  3. ๐Ÿš€ ์‹คํ—˜ ๊ฒฐ๊ณผ, GRIT์€ ์ƒˆ๋กœ์šด ๊ฐ์ฒด์— ๋Œ€ํ•ด 87.9%์˜ ๋†’์€ ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ, ์‹ค์„ธ๊ณ„ ์‹คํ—˜์„ ํ†ตํ•ด ์ž‘์—… ๋ชฉ์ ๊ณผ ๊ฐ์ฒด ํ˜•์ƒ์— ๋”ฐ๋ผ ์œ ์—ฐํ•˜๊ฒŒ ํŒŒ์•… ์ „๋žต์„ ์กฐ์ •ํ•˜๋Š” ์ œ์–ด ๋Šฅ๋ ฅ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

๋ณธ ๋…ผ๋ฌธ์€ ๋ณต์žกํ•œ Dexterous manipulation(์ •๊ตํ•œ ์กฐ์ž‘)์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ๊ณ ์ฐจ์›์ ์ธ Grasp taxonomy(ํŒŒ์ง€ ๋ถ„๋ฅ˜)์™€ ์ €์ˆ˜์ค€์˜ ์ œ์–ด ์ •์ฑ…์„ ๊ฒฐํ•ฉํ•œ ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ์ธ GRIT(Grasp Reinforcement with Intended Taxonomies)์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

1. ๋ฐฐ๊ฒฝ ๋ฐ ๋ฌธ์ œ ์ œ๊ธฐ

๊ธฐ์กด์˜ Dexterous manipulation ์—ฐ๊ตฌ๋Š” ์ •๊ตํ•œ ์ ‘์ด‰์ ์ด๋‚˜ ์—ฐ์†์ ์ธ ๋™์ž‘ ๊ถค์ ์„ ์ง์ ‘ ํ•™์Šตํ•ด์•ผ ํ•˜๋Š” ์–ด๋ ค์›€์ด ์žˆ์—ˆ์œผ๋ฉฐ, ๊ฐ•ํ™”ํ•™์Šต์„ ํ†ตํ•œ ์ข…๋‹จ๊ฐ„(End-to-end) ํ•™์Šต์€ ์ œ์–ด๊ฐ€ ์–ด๋ ต๊ณ  ์‚ฌ์šฉ์ž ๊ฐœ์ž…์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋…ผ๋ฌธ์€ ์ธ๊ฐ„์˜ ํŒŒ์ง€ ์œ ํ˜• ๋ถ„๋ฅ˜ ์ฒด๊ณ„(Feix et al. [4])๋ฅผ ํ™œ์šฉํ•˜์—ฌ, ๊ณ ์ˆ˜์ค€์˜ โ€™ํŒŒ์ง€ ์˜๋„โ€™๋ฅผ ์ œ๊ณตํ•˜๊ณ  ์ €์ˆ˜์ค€์˜ โ€™์†๊ฐ€๋ฝ ๋™์ž‘โ€™์„ ์ƒ์„ฑํ•˜๋Š” 2๋‹จ๊ณ„ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.

2. ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก 

GRIT์€ ํฌ๊ฒŒ ํŒŒ์ง€ ๊ณ„ํš(Grasp Planning) ๋‹จ๊ณ„์™€ ์กฐ๊ฑด๋ถ€ ์ œ์–ด(Taxonomy-conditioned Control) ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

  • ํŒŒ์ง€ ๋ถ„๋ฅ˜ ๋ฐ ๊ณ„ํš: Feix์˜ ๋ถ„๋ฅ˜ ์ฒด๊ณ„๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ 30๊ฐœ์˜ ํŒŒ์ง€ ํ…œํ”Œ๋ฆฟ(\tau)์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ํ…œํ”Œ๋ฆฟ์€ ๊ธฐ์ค€ ๊ด€์ ˆ ๊ตฌ์„ฑ(\tilde{q}), ํ™œ์„ฑ ๋งํฌ ๋งˆ์Šคํฌ(\tilde{b}), ์ ‘์ด‰ ์œ„์น˜ ๋ฐ ๋ฒ•์„ (\tilde{p}, \tilde{n})์œผ๋กœ ์ •์˜๋ฉ๋‹ˆ๋‹ค. ์‹œ๊ฐ-์–ธ์–ด ๋ชจ๋ธ(VLM)์„ ์‚ฌ์šฉํ•˜์—ฌ ์žฅ๋ฉด ์ด๋ฏธ์ง€(I)์™€ ์ž‘์—… ์„ค๋ช…(T)์œผ๋กœ๋ถ€ํ„ฐ ์ตœ์ ์˜ ํŒŒ์ง€ ์„ค์ •(g = (\tau, \bar{w}_w))์„ ์ถ”๋ก ํ•ฉ๋‹ˆ๋‹ค.
  • Taxonomy-conditioned Control: ํ•™์Šต ๋ชฉ์  ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค. J(\pi) = E_{\pi} \left[ \sum_{t=0}^{T} \gamma^t r_t(s_t, g) \right] ์ œ์–ด ์ •์ฑ…์€ ํ˜„์žฌ ์ƒํƒœ์™€ ์„ ํƒ๋œ ํŒŒ์ง€ ๋ช…์„ธ(g)๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„, ์—ฐ์†์ ์ธ ์†๊ฐ€๋ฝ ์ œ์–ด ๋™์ž‘์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
  • ๋‹ค์ค‘ ๋ณตํ•ฉ ๋ณด์ƒ ๊ตฌ์กฐ(Multiplicative Composite Reward): ํŒŒ์ง€ ์ค€์ˆ˜์™€ ์ž‘์—… ์„ฑ๊ณต์„ ํšจ๊ณผ์ ์œผ๋กœ ์กฐ์œจํ•˜๊ธฐ ์œ„ํ•ด ๋ณด์ƒ ์‹์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค. r = r_h \cdot \alpha_h + r_o \cdot \alpha_o - r_{pen} ์—ฌ๊ธฐ์„œ \alpha_h์™€ \alpha_o๋Š” ๊ฐ๊ฐ ์ ‘๊ทผ ๋‹จ๊ณ„์™€ ํŒŒ์ง€ ๋‹จ๊ณ„์—์„œ ํ–‰๋™์˜ ์•ˆ์ •์„ฑ์„ ๊ฐ•์ œํ•˜๋Š” ๊ณฑ์…ˆ ๊ณ„์ˆ˜(Multiplicative constraint coefficient)๋กœ ์ž‘์šฉํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, ํŒŒ์ง€ ์ค€์ˆ˜ ์ •๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” \alpha_{mimic} ํ•ญ์€ ๋ ˆํผ๋Ÿฐ์Šค ํŒŒ์ง€ ์„ค์ •์œผ๋กœ๋ถ€ํ„ฐ์˜ ์˜ค์ฐจ๋ฅผ ํŽ˜๋„ํ‹ฐ๋กœ ๋ถ€์—ฌํ•˜์—ฌ, ์‚ฌ์šฉ์ž์˜ ์˜๋„๋ฅผ ์—„๊ฒฉํ•˜๊ฒŒ ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค. L_{mimic} = \frac{1}{N_{act}} \sum_{i=1}^{L} (\max(|q_i - q_{ref,i}| - \tau_{act}, 0))^2 + \dots
  • Distillation: ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์—์„œ ํŠน๊ถŒ ์ •๋ณด(Privileged information)๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ต์‚ฌ ์ •์ฑ…(Teacher policy)์„ ํ•™์Šต์‹œํ‚จ ํ›„, ๋ถ€๋ถ„์ ์ธ ๊ด€์ธก ์ •๋ณด(Point cloud)๋งŒ์„ ์‚ฌ์šฉํ•˜๋Š” ํ•™์ƒ ์ •์ฑ…(Student policy)์œผ๋กœ ์ฆ๋ฅ˜ํ•˜์—ฌ ์‹ค์ œ ๋กœ๋ด‡์— ๋ฐฐํฌํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.

3. ์ฃผ์š” ์„ฑ๊ณผ ๋ฐ ์‹คํ—˜ ๊ฒฐ๊ณผ

  • ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ: Objaverse ๋ฐ์ดํ„ฐ์…‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์‹คํ—˜์—์„œ, GRIT์€ ๊ธฐ์กด์˜ RDG๋‚˜ GraspXL ๋Œ€๋น„ ๋” ๋†’์€ ์„ฑ๊ณต๋ฅ (87.9%)์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์ œ์–ด ๊ฐ€๋Šฅ์„ฑ: ๋™์ผํ•œ ๋ฌผ์ฒด๋ผ๋„ ์ž‘์—… ์˜๋„(โ€œ์žก๊ธฐโ€ vs โ€œ์งœ๊ธฐโ€)์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ํŒŒ์ง€ ์œ ํ˜•(Precision grasp vs Power grasp)์„ ์„ ํƒํ•˜๋„๋ก ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Œ์„ ์ฆ๋ช…ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ํšจ์œจ์„ฑ: ๊ณฑ์…ˆ ํ˜•ํƒœ์˜ ๋ณด์ƒ ๊ตฌ์กฐ ๋•๋ถ„์— ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ์—†์ด๋„ ํŒŒ์ง€ ์ •ํ™•๋„์™€ ์ž‘์—… ์„ฑ๊ณต๋ฅ  ์‚ฌ์ด์˜ ์•ˆ์ •์ ์ธ ๊ท ํ˜•์„ ์œ ์ง€ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์‹ค์ œ ๋กœ๋ด‡ ์‹คํ—˜: ๋‹ค์–‘ํ•œ ๊ธฐํ•˜ํ•™์  ํ˜•ํƒœ์˜ ๋ฌผ์ฒด์— ๋Œ€ํ•ด ํŒŒ์ง€ ํ…œํ”Œ๋ฆฟ์„ ์„ ํƒ์ ์œผ๋กœ ์ ์šฉํ•จ์œผ๋กœ์จ, ๋ณต์žกํ•œ ํ™˜๊ฒฝ์—์„œ๋„ ์ •๊ตํ•œ ์กฐ์ž‘์ด ๊ฐ€๋Šฅํ•จ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก ์ ์œผ๋กœ, GRIT์€ ๊ณ ์ˆ˜์ค€์˜ โ€™์˜๋„โ€™์™€ ์ €์ˆ˜์ค€์˜ โ€™์‹คํ–‰โ€™์„ ๋ถ„๋ฆฌํ•จ์œผ๋กœ์จ ์ •๊ตํ•œ ์กฐ์ž‘ ์ž‘์—…์˜ ์ œ์–ด ๊ฐ€๋Šฅ์„ฑ๊ณผ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๋™์‹œ์— ํ™•๋ณดํ•œ ํšจ์œจ์ ์ธ ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค.


๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

์„œ๋ก 

๋Šฅ์ˆ™ํ•œ ๋‹ค์ง€ ํŒŒ์ง€(dexterous grasping)์—์„œ ๊ฐ€์žฅ ์˜ค๋ž˜๋œ ๊ธด์žฅ์€ โ€œ๋ฌด์—‡์„ ์ง€์ •ํ•˜๊ณ  ๋ฌด์—‡์„ ํ•™์Šต์— ๋งก๊ธธ ๊ฒƒ์ธ๊ฐ€โ€ ์ž…๋‹ˆ๋‹ค.

  • ์กฐ๋ฐ€ํ•œ ๋ช…์„ธ(dense specification) โ€” ์†๊ฐ€๋ฝ ๊ด€์ ˆ ๊ถค์ , ์ ‘์ด‰์ , ์ ‘์ด‰๋ ฅ์„ ๋ฌผ์ฒดยท์ž‘์—…๋งˆ๋‹ค ์‚ฌ๋žŒ์ด ์ง์ ‘ ์ง€์ •ํ•˜๋ฉด ์ œ์–ด๋Š” ์ •ํ™•ํ•ด์ง€์ง€๋งŒ, ์ƒˆ ๋ฌผ์ฒดยท์ƒˆ ์ž‘์—…์ด ์ƒ๊ธธ ๋•Œ๋งˆ๋‹ค ๋น„ํ˜„์‹ค์ ์ธ ์ฃผ์„ ๋น„์šฉ์ด ๋“ญ๋‹ˆ๋‹ค. ์ผ๋ฐ˜ํ™”๋„ ์‚ฌ์‹ค์ƒ ์‚ฌ๋žŒ์ด ๋งŒ๋“  ๋ฐ์ดํ„ฐ์˜ ๋ฒ”์œ„์— ๊ฐ‡ํž™๋‹ˆ๋‹ค.
  • ๋ช…์„ธ ์—†๋Š” ์ˆœ์ˆ˜ RL โ€” ๋ณด์ƒ๋งŒ ์ฃผ๊ณ  ์•Œ์•„์„œ ๋ฐฐ์šฐ๊ฒŒ ํ•˜๋ฉด ์ž์œจ์„ฑ์€ ๋†’์ง€๋งŒ, ํƒ์ƒ‰์ด ๋น„ํšจ์œจ์ ์ด๋ผ ํŠน์ • ์† ์ž์„ธ๋กœ ํŽธํ–ฅ ๋˜๊ธฐ ์‰ฝ์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  ๋ฌผ์ฒด๋ฅผ ๋น„์Šทํ•œ ๋ฐฉ์‹์œผ๋กœ ์›€์ผœ์ฅ๋Š” ๋‹จ์กฐ๋กœ์šด ์ •์ฑ…์œผ๋กœ ์ˆ˜๋ ดํ•˜๊ณ , ์‚ฌ์šฉ์ž๊ฐ€ โ€œ์ด๊ฑด ์ •๋ฐ€ํ•˜๊ฒŒ ์ง‘์–ด์ค˜โ€ ๊ฐ™์€ ์˜๋„๋ฅผ ์ฃผ์ž…ํ•  ํ†ต๋กœ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

์ €์ž๋“ค์ด ๋˜์ง€๋Š” ์งˆ๋ฌธ์€ ๋ถ„๋ช…ํ•ฉ๋‹ˆ๋‹ค. โ€œ์‚ฌ๋žŒ์ด ์†๊ฐ€๋ฝ ํ•˜๋‚˜ํ•˜๋‚˜๋ฅผ ์ง€์ •ํ•˜์ง€ ์•Š์œผ๋ฉด์„œ๋„, ์–ด๋–ค ์‹์œผ๋กœ ์žก์„์ง€์— ๋Œ€ํ•œ ์˜๋„๋Š” ์ฃผ์ž…ํ•  ์ˆ˜ ์žˆ๋Š” ์ตœ์†Œํ•œ์˜(sparse) ์ธํ„ฐํŽ˜์ด์Šค๋Š” ๋ฌด์—‡์ธ๊ฐ€?โ€

๋‹ต์€ ์ธ๊ฐ„ ๊ทธ๋ฆฝ ๋ถ„๋ฅ˜ํ•™(grasp taxonomy) ์ž…๋‹ˆ๋‹ค. ์‚ฌ๋žŒ์€ ์ˆ˜๋งŽ์€ ํŒŒ์ง€๋ฅผ ์†Œ์ˆ˜์˜ ์œ ํ˜•(power/precision, ๊ฐ์‹ธ๊ธฐ/์ง‘๊ธฐ ๋“ฑ) ์œผ๋กœ ๋ฒ”์ฃผํ™”ํ•ด ์™”์Šต๋‹ˆ๋‹ค. GRIT์˜ ํ†ต์ฐฐ์€ ์ด ์ถ”์ƒ์  ์œ ํ˜•์ด ๋ฌผ์ฒด ๊ธฐํ•˜์™€ ๊ฐ•ํ•˜๊ฒŒ ์—ฐ๊ด€ ๋œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค โ€” ๋ง‰๋Œ€ํ˜•์—” ์ •๋ฐ€ ์ง‘๊ธฐ๊ฐ€, ๋‘ฅ๊ทผ ๋ฌผ์ฒด์—” ๊ฐ์‹ธ๊ธฐ๊ฐ€ ๋” ์ž˜ ๋งž์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์œ ํ˜• ํ•˜๋‚˜๋งŒ ๊ณ ๋ฅด๋Š” ๊ฒƒ ๋งŒ์œผ๋กœ ์ถฉ๋ถ„ํ•œ ๊ฐ€์ด๋“œ๊ฐ€ ๋˜๋ฉฐ, ๊ตฌ์ฒด ์ œ์–ด๋Š” ๋ฌผ์ฒด์— ๋งž์ถฐ ์ •์ฑ…์ด ์ฑ„์šฐ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

์ด ๋…ผ๋ฌธ์˜ ํ•œ ์ค„ ์š”์•ฝ: Feix ๋ถ„๋ฅ˜ํ•™์˜ 30๊ฐœ ๊ทธ๋ฆฝ ์œ ํ˜• ์„ ํฌ์†Œ ๊ฐ€์ด๋“œ๋กœ ์‚ผ์•„, 1๋‹จ๊ณ„์—์„œ ์žฅ๋ฉดยท์ž‘์—…์œผ๋กœ๋ถ€ํ„ฐ ์œ ํ˜•์„ ๊ณ ๋ฅด๊ณ (VLM zero-shot), 2๋‹จ๊ณ„์—์„œ ๊ทธ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๋Š” taxonomy-์กฐ๊ฑด๋ถ€ RL ์ •์ฑ… ์œผ๋กœ ๋ฌผ์ฒด ๊ธฐํ•˜์— ์ ์‘ํ•˜๋Š” ์—ฐ์† ๋‹ค์ง€ ๋ชจ์…˜์„ ์ƒ์„ฑํ•œ๋‹ค โ€” ์กฐ๋ฐ€ํ•œ ์ฃผ์„ ์—†์ด๋„ ์ œ์–ด ๊ฐ€๋Šฅ์„ฑ๊ณผ ์ผ๋ฐ˜ํ™”๋ฅผ ๋™์‹œ์—.

flowchart LR
    subgraph S1["Stage 1 ยท Taxonomy ์„ ํƒ"]
        IMG["์žฅ๋ฉด ์ด๋ฏธ์ง€<br/>+ ์ž‘์—… ๋งฅ๋ฝ"]
        AXIS["3D ์ขŒํ‘œ์ถ• ์˜ค๋ฒ„๋ ˆ์ด<br/>(๋ชฉํ‘œ ๋ฐฉํ–ฅ)"]
        VLM["VLM (Gemini 3)<br/>zero-shot ์„ ํƒ"]
        IMG --> VLM
        AXIS --> VLM
        VLM --> TAU["๊ทธ๋ฆฝ ํ…œํ”Œ๋ฆฟ ฯ„<br/>{qฬƒ, bฬƒ, pฬƒ, รฑ}<br/>(30 ์œ ํ˜• ์ค‘ 1)"]
    end
    subgraph S2["Stage 2 ยท Taxonomy-์กฐ๊ฑด๋ถ€ RL"]
        OBS["๊ด€์ธก<br/>proprio + ๋ถ€๋ถ„ ์ ๊ตฐ<br/>+ BPS ๊ธฐํ•˜ + ์†๋ชฉ-๋ฌผ์ฒด"]
        POL["RL ์ •์ฑ…<br/>(teacherโ†’student distill)"]
        ACT["ํ–‰๋™ ฮ”q, ฮ”w<br/>์—ฐ์† ๋‹ค์ง€ ๋ชจ์…˜"]
        OBS --> POL --> ACT
    end
    TAU --> POL
    ACT --> ROB["ํœด๋จธ๋…ธ์ด๋“œ ์† Allex<br/>๊ธฐํ•˜/์ž‘์—…๋ณ„ ํŒŒ์ง€"]

๋ฐฉ๋ฒ•

GRIT๋Š” ํฌ์†Œ ๊ฐ€์ด๋“œ โ†’ ์—ฐ์† ์ œ์–ด ์˜ 2๋‹จ๊ณ„ ์œ„์— ์„ญ๋‹ˆ๋‹ค. ํ•ต์‹ฌ ์ฒ ํ•™์€ โ€œ์‚ฌ์šฉ์ž๋Š” ์ถ”์ƒ์  ๊ทธ๋ฆฝ ์œ ํ˜•๋งŒ ๊ณ ๋ฅด๊ณ , ์ •์ฑ…์€ ๊ทธ ๊ตฌ์กฐ์  ์˜๋„๋ฅผ ๋ณด์กดํ•œ ์ฑ„ ๋ฌผ์ฒด์— ๋งž์ถฐ ์ฑ„์šด๋‹คโ€์ž…๋‹ˆ๋‹ค.

๊ทธ๋ฆฝ ๋ถ„๋ฅ˜ํ•™(Taxonomy) ํ‘œํ˜„

Feix et al. ์ธ๊ฐ„ ๊ทธ๋ฆฝ ๋ถ„๋ฅ˜ํ•™ ์—์„œ ์ง€๋‚˜์น˜๊ฒŒ ๋ฌผ์ฒด-ํŠนํ™”๋œ 3๊ฐœ๋ฅผ ๋นผ 30๊ฐœ ์œ ํ˜• ์„ ์”๋‹ˆ๋‹ค. ๊ฐ ์œ ํ˜•์€ ํ…œํ”Œ๋ฆฟ \tau_i = \{\tilde{q}, \tilde{b}, \tilde{p}, \tilde{n}\} ์œผ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค.

  • \tilde{q} : ์ฐธ์กฐ ๊ด€์ ˆ ๊ตฌ์„ฑ(reference joint configuration)
  • \tilde{b} : ์†/์†๋ฐ”๋‹ฅ ๋งํฌ๊ฐ€ ์ ‘์ด‰์— ๊ด€์—ฌํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ์ด์ง„ ๊ฒฐํ•ฉ ๋งˆ์Šคํฌ
  • \tilde{p} : ๋กœ์ปฌ ์† ์ขŒํ‘œ๊ณ„ ๊ธฐ์ค€ ์ฐธ์กฐ ์ ‘์ด‰ ์œ„์น˜
  • \tilde{n} : ํ•ด๋‹น ์ ‘์ด‰์˜ ํ‘œ๋ฉด ๋ฒ•์„ 

์ด ํ‘œํ˜„์€ โ€œ์–ด๋А ์†๊ฐ€๋ฝยท์†๋ฐ”๋‹ฅ์ด ์–ด๋–ค ์ž์„ธ๋กœ ์–ด๋””์— ๋‹ฟ์•„์•ผ ํ•˜๋Š”๊ฐ€โ€๋ผ๋Š” ๊ตฌ์กฐ ๋งŒ ๋‹ด์„ ๋ฟ, ์ ˆ๋Œ€์  ๊ด€์ ˆ ๊ฐ์ด๋‚˜ ๋ฌผ์ฒด ์œ„ ์ •ํ™•ํ•œ ์ ‘์ด‰์ ์€ ๊ณ ์ •ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ทธ ๊ตฌ์ฒดํ™”๋Š” 2๋‹จ๊ณ„ ์ •์ฑ…์ด ๋ฌผ์ฒด ๊ธฐํ•˜๋ฅผ ๋ณด๊ณ  ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

Stage 1: Taxonomy ์„ ํƒ

๋ชฉํ‘œ๋Š” โ€œ์ด ์žฅ๋ฉดยท์ด ์ž‘์—…์—์„œ ์–ด๋–ค ๊ทธ๋ฆฝ ์œ ํ˜•์„ ์“ธ ๊ฒƒ์ธ๊ฐ€โ€์ž…๋‹ˆ๋‹ค.

  • ํ•™์Šต ์‹œ: taxonomy์™€ ์†๋ชฉ ๋ฐฉํ–ฅ(wrist orientation)์„ ๊ท ์ผ ์ƒ˜ํ”Œ๋ง ํ•ด, ์ •์ฑ…์ด ๋‹ค์–‘ํ•œ ์œ ํ˜• ์ „๋ฐ˜์— ๊ฑธ์ณ ๊ฒฌ๊ณ ํ•˜๊ฒŒ ํ•™์Šต๋˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค(ํŠน์ • ์œ ํ˜• ํŽธํ–ฅ ๋ฐฉ์ง€).
  • ์ถ”๋ก  ์‹œ: VLM(Gemini 3) ์ด zero-shot์œผ๋กœ ์œ ํ˜•์„ ๊ณ ๋ฆ…๋‹ˆ๋‹ค. ์ด๋•Œ ํ•ต์‹ฌ ํŠธ๋ฆญ์€ ์žฅ๋ฉด ์ด๋ฏธ์ง€ ์œ„์— 3D ์ขŒํ‘œ์ถ•์„ ์ง์ ‘ ์˜ค๋ฒ„๋ ˆ์ด ํ•ด ์ž ์žฌ์  ๋ชฉํ‘œ ์ ‘๊ทผ ๋ฐฉํ–ฅ์„ ์‹œ๊ฐ์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. VLM์ด 2D ์ด๋ฏธ์ง€๋งŒ ๋ณผ ๋•Œ ์•ฝํ•œ ๊ณต๊ฐ„ ์ถ”๋ก ์„, ์ด ์˜ค๋ฒ„๋ ˆ์ด๊ฐ€ ๋ณด๊ฐ•ํ•ฉ๋‹ˆ๋‹ค.

Stage 2: Taxonomy-์กฐ๊ฑด๋ถ€ RL ์ •์ฑ…

์„ ํƒ๋œ ํ…œํ”Œ๋ฆฟ \tau ๋ฅผ ์กฐ๊ฑด์œผ๋กœ, ๋ฌผ์ฒด ๊ธฐํ•˜์— ์ ์‘ํ•˜๋Š” ์—ฐ์† ๋‹ค์ง€ ๋ชจ์…˜์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

  • ๊ด€์ธก(observation): ์† ์ž๊ธฐ์ˆ˜์šฉ ์ƒํƒœ(๊ด€์ ˆ๊ฐ, ์ ‘์ด‰ indicator, ์ ‘์ด‰๋ ฅ) + ๋ฌผ์ฒด ์ƒํƒœ(RGB-D ๋‹จ์ผ ์‹œ์  ๋ถ€๋ถ„ ์ ๊ตฐ) + ์†๋ชฉ-๋ฌผ์ฒด ์ƒ๋Œ€ ๋ณ€์œ„ + ๊ฑฐ๋ฆฌ ํŠน์ง• + BPS(Basis Point Set) ๋กœ ์ธ์ฝ”๋”ฉํ•œ ๋กœ์ปฌ ๊ธฐํ•˜.
  • ํ–‰๋™(action): ๊ด€์ ˆ ๋ณ€์œ„ \Delta q_t \in \mathbb{R}^D, ๋ธํƒ€ ์†๋ชฉ ํฌ์ฆˆ \Delta w_t \in \mathbb{R}^7(์ง๊ต + ์ฟผํ„ฐ๋‹ˆ์–ธ).

๊ณฑ์…ˆํ˜• ๋ณตํ•ฉ ๋ณด์ƒ(ํ•ต์‹ฌ ์„ค๊ณ„). ๋ณด์ƒ์€ ๋‘ ์ถ•์„ ๊ณฑ์…ˆ์œผ๋กœ ๊ฒŒ์ดํŒ…ํ•ฉ๋‹ˆ๋‹ค.

r_t = \alpha_h \cdot r^{\text{hand}}_t + \alpha_o \cdot r^{\text{obj}}_t - r^{\text{penalty}}_t

  • r^{\text{hand}} : ์ ‘๊ทผ ๋‹จ๊ณ„ ์˜ ์† ์ค‘์‹ฌ ๋ณด์ƒ(ํ…œํ”Œ๋ฆฟ ์ž์„ธยท์ ‘์ด‰ ๊ตฌ์กฐ๋กœ์˜ ์ •๋ ฌ).
  • r^{\text{obj}} : ์•ˆ์ • ํŒŒ์ง€ ๋‹จ๊ณ„์˜ ๋ฌผ์ฒด ์ค‘์‹ฌ ๋ณด์ƒ(๋“ค์–ด์˜ฌ๋ฆผยท์•ˆ์ •์„ฑ).
  • \alpha_h, \alpha_o : ์•ˆ์ •์„ฑยทtaxonomy ์ค€์ˆ˜๋„ ์— ๋”ฐ๋ผ ๋ณด์ƒ์„ ์ผœ๊ณ  ๋„๋Š” ๊ณฑ์…ˆํ˜• ์ œ์•ฝ ๊ณ„์ˆ˜.
  • r^{\text{penalty}} : ์˜๋„ํ•˜์ง€ ์•Š์€ ์ ‘์ด‰์„ ์–ต์ œ.

๊ณฑ์…ˆํ˜•์ด ์ค‘์š”ํ•œ ์ด์œ ๋Š”, โ€œ์˜ฌ๋ฐ”๋ฅธ ๊ตฌ์กฐ๋กœ ์žก์•˜์„ ๋•Œ๋งŒโ€ ๋ฌผ์ฒด ๋ณด์ƒ์ด ํ™œ์„ฑํ™”๋˜์–ด ์ •์ฑ…์ด ๋‹จ์ˆœํžˆ ๋ฌผ์ฒด๋ฅผ ์›€์ผœ์ฅ๋Š” ์ง€๋ฆ„๊ธธ์„ ๋ง‰๊ณ  ์œ ํ˜• ์ถฉ์‹ค๋„(adherence) ๋ฅผ ๊ฐ•์ œํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ ๋”ํ•˜๊ธฐ(additive) ๋Œ€๋น„ ์ ‘์ด‰ ์ •๋ฐ€๋„ 28.57% ํ–ฅ์ƒ, ๊ด€์ ˆ ์˜ค์ฐจ๋„ ํฌ๊ฒŒ ๊ฐ์†Œํ–ˆ์Šต๋‹ˆ๋‹ค.

Teacherโ€“Student Distillation

  • Teacher: ์ „์ฒด(full) ์ ๊ตฐ๊ณผ ground-truth ์ ‘์ด‰ ๋“ฑ ํŠน๊ถŒ ์ •๋ณด ๋กœ ํ•™์Šต.
  • Student: ๋‹จ์ผ ์‹œ์  ๋ถ€๋ถ„ ๊ด€์ธก ๋งŒ ๋ฐ›๊ณ , LSTM ์œผ๋กœ ์ ‘์ด‰ ์‹ ํ˜ธ๋ฅผ ๋ณต์›ํ•ด ์‹ค์„ธ๊ณ„ ๋ฐฐํฌ ๊ฐ€๋Šฅ ํ˜•ํƒœ๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

ํ•™์Šต ํ™˜๊ฒฝ์€ MuJoCo-Warp + 30๊ฐœ YCB ๋ฌผ์ฒด ์ž…๋‹ˆ๋‹ค.

์‹คํ—˜

์ƒˆ๋กœ์šด ๋ฌผ์ฒด ์ผ๋ฐ˜ํ™”

ํ•™์Šต์— ์—†๋˜ Objaverse RoboCasa ์„œ๋ธŒ์…‹(373๊ฐœ ๋ฌผ์ฒด) ์—์„œ ์„ฑ๊ณต๋ฅ ์„ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.

๋ฐฉ๋ฒ• ์„ฑ๊ณต๋ฅ  ๊ฐ€์ด๋“œ ๋ฐฉ์‹
RDG 81.9% ๋ช…์‹œ์  ๊ทธ๋ฆฝ ์กฐ๊ฑด ์—†์Œ(๊ธฐํ•˜/์ ‘์ด‰ ์‹ ํ˜ธ)
GraspXL 85.9% graspable/non-graspable ํ‘œ๋ฉด ์ฃผ์„(๊ฐ„์ ‘)
GRIT 87.9% taxonomy ํฌ์†Œ ๊ฐ€์ด๋“œ

GRIT๋Š” RDG ๋Œ€๋น„ +6.0%p, GraspXL ๋Œ€๋น„ +1.9%p ์šฐ์œ„์ž…๋‹ˆ๋‹ค. ๋ช…์‹œ์  ๊ทธ๋ฆฝ ์กฐ๊ฑด์ด ์—†๋Š” RDG๋Š” ์† ์ž์„ธ ํŽธํ–ฅ์— ์ทจ์•ฝํ•˜๊ณ , GraspXL์˜ ๊ฐ„์ ‘ ํ‘œ๋ฉด ์ฃผ์„๋ณด๋‹ค taxonomy ์กฐ๊ฑด์ด ๋” ํšจ๊ณผ์ ์ž„์„ ๋ณด์ž…๋‹ˆ๋‹ค.

๋ฌผ์ฒดโ€“Taxonomy ์ •๋ ฌ(ํ•ต์‹ฌ ํ†ต์ฐฐ ๊ฒ€์ฆ)

โ€œํŠน์ • taxonomy๊ฐ€ ํŠน์ • ๋ฌผ์ฒด ๊ธฐํ•˜์— ๋” ํšจ๊ณผ์ โ€์ด๋ผ๋Š” ๊ฐ€์„ค์„ ์ •๋Ÿ‰ ๊ฒ€์ฆํ•ฉ๋‹ˆ๋‹ค.

  • ๊ณผ์ผ/์ฑ„์†Œ: taxonomy ๊ฐ„ ์„ฑ๋Šฅ ํŽธ์ฐจ 25.07% โ€” ์œ ํ˜• ์„ ํƒ์ด ์„ฑํŒจ๋ฅผ ํฌ๊ฒŒ ๊ฐ€๋ฆ„.
  • ํฌ์žฅ ์ œํ’ˆ(packed goods): ํŽธ์ฐจ 14.85% โ€” ์ƒ๋Œ€์ ์œผ๋กœ ๋œ ๋ฏผ๊ฐ.

์ฆ‰ ๊ทธ๋ฆฝ ์œ ํ˜•์˜ ํšจ๊ณผ๊ฐ€ ๋ฌผ์ฒด ๊ธฐํ•˜์— ๊ฐ•ํ•˜๊ฒŒ ์˜์กด ํ•˜๋ฉฐ, ๋”ฐ๋ผ์„œ โ€œ์œ ํ˜•์„ ์ž˜ ๊ณ ๋ฅด๋Š” ๊ฒƒโ€์ด ์ผ๋ฐ˜ํ™”์˜ ํ•ต์‹ฌ ๋ ˆ๋ฒ„์ž„์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

Ablation

  • BPS ํŠน์ง• ์ œ๊ฑฐ: ๋ชจ๋“  ์ง€ํ‘œ์—์„œ ์„ฑ๋Šฅ ํ•˜๋ฝ โ†’ ๋กœ์ปฌ ๊ธฐํ•˜ ์ธ์ฝ”๋”ฉ์ด ์ ์‘์  ํŒŒ์ง€์— ํ•„์ˆ˜.
  • ๋ณด์ƒ ํ˜•ํƒœ: ๋”ํ•˜๊ธฐ(additive) ํ˜•ํƒœ๋„ ์„ฑ๊ณต๋ฅ  ์ž์ฒด๋Š” ๋น„์Šทํ•˜์ง€๋งŒ, ๊ณฑ์…ˆํ˜•์ด ์ ‘์ด‰ ์ •๋ฐ€๋„ +28.57%, ๊ด€์ ˆ ์˜ค์ฐจ ์ธก๋ฉด์—์„œ๋„ ํฐ ๊ฐœ์„ (naive additive ๋Œ€๋น„ 57.83% ๋‚ฎ์€ ๊ด€์ ˆ ์˜ค์ฐจ)์„ ๋ณด์—ฌ ์œ ํ˜• ์ถฉ์‹ค๋„ ๊ฐ€ ํ›จ์”ฌ ์šฐ์ˆ˜.

์‹ค์„ธ๊ณ„(Allex) ๋ฐฐํฌ

์–‘์† ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡์— ๋ฐฐํฌํ•ด ๋‘ ๊ฐ€์ง€๋ฅผ ์‹œ์—ฐํ•ฉ๋‹ˆ๋‹ค.

  • ๊ธฐํ•˜ ์˜์กด์  ์ ์‘: ๋ง‰๋Œ€ํ˜•(rod-like) ๋ฌผ์ฒด์—” ์ •๋ฐ€ 4์ง€ ๊ทธ๋ฆฝ์ด, ์Šคํ’€ํ˜•(spool-like) ๋ฌผ์ฒด์—” ์ค‘๊ฐ„ ์ง๊ฒฝ ๊ทธ๋ฆฝ์ด ์„ฑ๊ณต.
  • ์ž‘์—…๋ณ„ ์„ ํƒ: ์ŠคํŽ€์ง€ ์งœ๊ธฐ(squeeze) ์—” ํŒŒ์›Œ ๊ทธ๋ฆฝ, ์šด๋ฐ˜(transport) ์—” ์ •๋ฐ€ ๊ทธ๋ฆฝ์„ ์„ ํƒ โ€” ๊ฐ™์€ ๋ฌผ์ฒด๋ผ๋„ ์ž‘์—… ์˜๋„์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ์œ ํ˜•์„ ๊ณ ๋ฆ„.

๋น„ํŒ์  ๊ณ ์ฐฐ

๊ฐ•์ 

  • ํฌ์†Œ ๊ฐ€์ด๋“œ๋ผ๋Š” ๊น”๋”ํ•œ ์ธํ„ฐํŽ˜์ด์Šค. โ€œ์œ ํ˜• ํ•˜๋‚˜๋งŒ ๊ณ ๋ฅธ๋‹คโ€๋Š” ์ถ”์ƒํ™”๊ฐ€ ์กฐ๋ฐ€ํ•œ ์ฃผ์„ ๋น„์šฉ๊ณผ ์ˆœ์ˆ˜ RL์˜ ์† ์ž์„ธ ํŽธํ–ฅ์„ ๋™์‹œ์— ํšŒํ”ผํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž๊ฐ€ ์†๊ฐ€๋ฝ์„ ์ง€์ •ํ•˜์ง€ ์•Š์œผ๋ฉด์„œ๋„ ์˜๋„๋ฅผ ์ฃผ์ž…ํ•˜๋Š” ํ†ต๋กœ๋ฅผ ์ œ๊ณตํ•œ ์ ์ด ํ•ต์‹ฌ ๊ธฐ์—ฌ์ž…๋‹ˆ๋‹ค.
  • ๊ฒ€์ฆ๋œ ํ•ต์‹ฌ ๊ฐ€์„ค. โ€œtaxonomy ํšจ๊ณผ๋Š” ๋ฌผ์ฒด ๊ธฐํ•˜์— ์˜์กดํ•œ๋‹คโ€๋ฅผ ๊ณผ์ผ/์ฑ„์†Œ 25.07% vs ํฌ์žฅ ์ œํ’ˆ 14.85% ๊ฐ™์€ ์ •๋Ÿ‰ ํŽธ์ฐจ๋กœ ๋’ท๋ฐ›์นจํ•ด, ์„ค๊ณ„ ๋™๊ธฐ๋ฅผ ๋ฐ์ดํ„ฐ๋กœ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๊ณฑ์…ˆํ˜• ๋ณด์ƒ์˜ ๋ช…๋ฃŒํ•œ ํšจ๊ณผ. ๋”ํ•˜๊ธฐ ๋Œ€๋น„ ์ ‘์ด‰ ์ •๋ฐ€๋„ยท๊ด€์ ˆ ์˜ค์ฐจ์—์„œ ๋ถ„๋ฆฌ๋œ ์ด๋“์„ ablation์œผ๋กœ ๋ณด์—ฌ, ์œ ํ˜• ์ถฉ์‹ค๋„๊ฐ€ ๋‹จ์ˆœ ์„ฑ๊ณต๋ฅ  ๋„ˆ๋จธ์˜ ํ’ˆ์งˆ์ž„์„ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค.
  • VLM + ์ขŒํ‘œ์ถ• ์˜ค๋ฒ„๋ ˆ์ด. zero-shot์œผ๋กœ ์œ ํ˜•์„ ๊ณ ๋ฅด๋˜ 3D ์ถ• ์˜ค๋ฒ„๋ ˆ์ด๋กœ ๊ณต๊ฐ„ ์ถ”๋ก ์„ ๋ณด๊ฐ•ํ•œ ์‹ค์šฉ์  ์„ค๊ณ„๋กœ, ํ•™์Šต ์—†์ด ์ถ”๋ก  ์‹œ ์ธ๊ฐ„ ์˜๋„/๋งฅ๋ฝ์„ ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค.

์•ฝ์ ๊ณผ ํ•œ๊ณ„

  • VLM ์„ ํƒ์˜ ์‹ ๋ขฐ์„ฑ ๋ฏธ๊ฒ€์ฆ. Stage 1์ด VLM zero-shot์— ์˜์กดํ•˜์ง€๋งŒ, ์ž˜๋ชป๋œ ์œ ํ˜•์„ ๊ณจ๋ž์„ ๋•Œ ์ „์ฒด ์„ฑ๋Šฅ์ด ์–ผ๋งˆ๋‚˜ ๋ฌด๋„ˆ์ง€๋Š”์ง€(์˜ค์„ ํƒ robustness)์— ๋Œ€ํ•œ ์ •๋Ÿ‰ ๋ถ„์„์€ ์ œํ•œ์ ์ž…๋‹ˆ๋‹ค(์ถ”์ธก).
  • ๋‹จ์ผ ์†/๋ฌผ์ฒด ํŒŒ์ง€์— ํ•œ์ •. ์–‘์† ํ˜‘์‘์ด๋‚˜ in-hand manipulation, ๋™์  ์กฐ์ž‘์€ ๋‹ค๋ฃจ์ง€ ์•Š๊ณ  ์ •์  ํŒŒ์ง€์— ์ง‘์ค‘๋ฉ๋‹ˆ๋‹ค.
  • ์‹ค์„ธ๊ณ„ ํ‰๊ฐ€์˜ ์ •๋Ÿ‰์„ฑ. Allex ๊ฒฐ๊ณผ๋Š” ๋‹ค์–‘ํ•œ ํŒŒ์ง€ ์ž์„ธ์˜ ์ •์„ฑ ์‹œ์—ฐ ์ค‘์‹ฌ์ด๊ณ , ์‹ค๋กœ๋ด‡ ์„ฑ๊ณต๋ฅ ยท์‹คํŒจ ๋ชจ๋“œ์˜ ์ฒด๊ณ„์  ์ •๋Ÿ‰ ๋น„๊ต๋Š” ๋” ํ•„์š”ํ•ด ๋ณด์ž…๋‹ˆ๋‹ค(์ถ”์ธก).
  • 30๊ฐœ ์œ ํ˜•์˜ ํ‘œํ˜„๋ ฅ. Feix ๋ถ„๋ฅ˜ํ•™์—์„œ 3๊ฐœ๋ฅผ ๋บ€ 30๊ฐœ๋กœ ์ถฉ๋ถ„ํ•œ์ง€, ๋ถ„๋ฅ˜ํ•™์— ์—†๋Š” ๋น„์ •ํ˜• ํŒŒ์ง€(์˜ˆ: ๋„๊ตฌ๋ฅผ ๋ผ์šฐ๋Š” ํŠน์ˆ˜ ๊ทธ๋ฆฝ)์—๋Š” ์–ด๋–ป๊ฒŒ ๋Œ€์‘ํ•˜๋Š”์ง€๋Š” ์—ด๋ฆฐ ์งˆ๋ฌธ์ž…๋‹ˆ๋‹ค.

์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

GRIT๋Š” ๋Šฅ์ˆ™ํ•œ ๋‹ค์ง€ ํŒŒ์ง€์˜ ์˜ค๋žœ ๋”œ๋ ˆ๋งˆ โ€” ์กฐ๋ฐ€ํ•œ ๋ช…์„ธ์˜ ๋น„ํ˜„์‹ค์„ฑ vs ์ˆœ์ˆ˜ RL์˜ ์† ์ž์„ธ ํŽธํ–ฅ โ€” ์„ ํฌ์†Œํ•œ ๊ทธ๋ฆฝ ๋ถ„๋ฅ˜(taxonomy) ๊ฐ€์ด๋“œ ๋กœ ๊ณต๋žตํ•ฉ๋‹ˆ๋‹ค. Feix ๋ถ„๋ฅ˜ํ•™์˜ 30๊ฐœ ์œ ํ˜• ์„ ์ถ”์ƒ ์ธํ„ฐํŽ˜์ด์Šค๋กœ ์‚ผ์•„, 1๋‹จ๊ณ„์—์„œ ์žฅ๋ฉดยท์ž‘์—…์œผ๋กœ๋ถ€ํ„ฐ ์œ ํ˜•์„ ๊ณ ๋ฅด๊ณ (VLM zero-shot + 3D ์ถ• ์˜ค๋ฒ„๋ ˆ์ด), 2๋‹จ๊ณ„์—์„œ ๊ณฑ์…ˆํ˜• ๋ณด์ƒ์˜ taxonomy-์กฐ๊ฑด๋ถ€ RL ์ •์ฑ… ์œผ๋กœ ๋ฌผ์ฒด ๊ธฐํ•˜์— ์ ์‘ํ•˜๋Š” ์—ฐ์† ๋‹ค์ง€ ๋ชจ์…˜์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์ˆ˜์น˜๋กœ ์ •๋ฆฌํ•˜๋ฉด, ์ƒˆ๋กœ์šด ๋ฌผ์ฒด 373๊ฐœ์—์„œ 87.9% ์„ฑ๊ณต๋ฅ ๋กœ RDG(81.9%)ยทGraspXL(85.9%)์„ ๋Šฅ๊ฐ€ํ–ˆ๊ณ , ๊ณฑ์…ˆํ˜• ๋ณด์ƒ์œผ๋กœ ์ ‘์ด‰ ์ •๋ฐ€๋„๋ฅผ +28.57% ๋Œ์–ด์˜ฌ๋ ธ์œผ๋ฉฐ, ๊ณผ์ผ/์ฑ„์†Œ 25.07% vs ํฌ์žฅ ์ œํ’ˆ 14.85%์˜ ํŽธ์ฐจ๋กœ โ€œ์œ ํ˜• ํšจ๊ณผ๋Š” ๋ฌผ์ฒด ๊ธฐํ•˜์— ์˜์กดํ•œ๋‹คโ€๋Š” ๊ฐ€์„ค์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํœด๋จธ๋…ธ์ด๋“œ Allex์—์„œ ๊ธฐํ•˜ยท์ž‘์—…์— ๋”ฐ๋ผ ๊ทธ๋ฆฝ ์œ ํ˜•์„ ๋ฐ”๊พธ๋Š” ์ ์‘์  ํŒŒ์ง€๋ฅผ ์‹ค์ œ๋กœ ์‹œ์—ฐํ–ˆ์Šต๋‹ˆ๋‹ค.

์‹ค๋ฌด ๊ด€์ ์—์„œ ์ด ์—ฐ๊ตฌ์˜ ๊ฐ€์น˜๋Š” โ€œ์†๊ฐ€๋ฝ์„ ์ผ์ผ์ด ์ง€์ •ํ•˜์ง€ ์•Š๊ณ  ์ถ”์ƒ์  ๊ทธ๋ฆฝ ์œ ํ˜• ํ•˜๋‚˜๋งŒ ๊ณ ๋ฅด๊ฒŒ ํ•จ์œผ๋กœ์จ, ์ œ์–ด ๊ฐ€๋Šฅ์„ฑ๊ณผ ์ƒˆ ๋ฌผ์ฒด ์ผ๋ฐ˜ํ™”๋ฅผ ๋™์‹œ์— ์–ป๋Š” ์ตœ์†Œ ์ธํ„ฐํŽ˜์ด์Šคโ€ ๋ฅผ ์ œ์‹œํ•œ ๋ฐ ์žˆ์Šต๋‹ˆ๋‹ค. VLM ์„ ํƒ์˜ robustness์™€ ์‹ค์„ธ๊ณ„ ์ •๋Ÿ‰ ํ‰๊ฐ€๋ผ๋Š” ํ•œ๊ณ„๋Š” ๋ถ„๋ช…ํ•˜์ง€๋งŒ, taxonomy ํฌ์†Œ ๊ฐ€์ด๋“œ + ๊ณฑ์…ˆํ˜• ์กฐ๊ฑด๋ถ€ RL ์ด๋ผ๋Š” ํ‹€์€ ํ–ฅํ›„ ์‚ฌ๋žŒ-์˜๋„ ์ฃผ์ž…ํ˜• ๋Šฅ์ˆ™ ์กฐ์ž‘ ์—ฐ๊ตฌ์˜ ์œ ๋ ฅํ•œ ์ถœ๋ฐœ์ ์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Copyright 2026, JungYeon Lee