Curieux.JY
  • JungYeon Lee
  • Post
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • 1. ์„œ๋ก : ์™œ ๋กœ๋ด‡์˜ ์†์€ ์•„์ง๋„ ์„œํˆฐ๊ฐ€?
      • 1.1 ๊ธฐ์กด ์ ‘๊ทผ์˜ ํ•œ๊ณ„
      • 1.2 DextER์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด
    • 2. ๋ฐฉ๋ฒ•๋ก : DextER ์•„ํ‚คํ…์ฒ˜ ํ•ด๋ถ€
      • 2.1 ๋ฌธ์ œ ๊ณต์‹ํ™”
      • 2.2 ์ „์ฒด ์•„ํ‚คํ…์ฒ˜
      • 2.3 ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜
      • 2.4 ๋ฉ”ํƒ€-ํ”„๋กฌํ”„ํŠธ์™€ ์ ‘์ด‰ ์œ„์น˜ ๋“œ๋กญ์•„์›ƒ
      • 2.5 ๋ฐ์ดํ„ฐ์…‹ ํ๋ ˆ์ด์…˜: MuJoCo + VLM ์ž๋™ ์ฃผ์„
    • 3. ์‹คํ—˜: ์ˆซ์ž๋กœ ํ™•์ธํ•˜๋Š” DextER์˜ ์„ฑ๋Šฅ
      • 3.1 ๊ตฌํ˜„ ์„ธ๋ถ€์‚ฌํ•ญ
      • 3.2 DexGYS ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ
      • 3.3 Ablation Study
      • 3.4 ์ œ๋กœ์ƒท ์ผ๋ฐ˜ํ™” (Dexonomy ๋ฐ์ดํ„ฐ์…‹)
      • 3.5 Steerable Generation: โ€œ์›ํ•˜๋Š” ๋Œ€๋กœ ์žก๊ธฐโ€
      • 3.6 ์ ‘์ด‰ ์ถ”๋ก  ํ’ˆ์งˆ ํ‰๊ฐ€
    • 4. ๋น„ํŒ์  ๊ณ ์ฐฐ: DextER์˜ ๊ฐ•์ ๊ณผ ํ•œ๊ณ„
      • 4.1 ๊ฐ•์  ๋ถ„์„
      • 4.2 ํ•œ๊ณ„์™€ ์—ด๋ฆฐ ์งˆ๋ฌธ๋“ค
    • 5. ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต
    • 6. ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

๐Ÿ“ƒDextER ๋ฆฌ๋ทฐ

llm
grasp
reasoning
Language-driven Dexterous Grasp Generation with Embodied Reasoning
Published

March 9, 2026

  • Paper Link
  • Project Link
  • Code Link

์–ธ์–ด๋กœ ์†๋์„ ์›€์ง์ด๊ฒŒ ํ•˜๋Š” ๋ฐฉ๋ฒ•

  1. ๐Ÿค– DextER๋Š” ์–ธ์–ด ๊ธฐ๋ฐ˜ dexterous grasp ์ƒ์„ฑ์„ ์œ„ํ•ด hand link๊ฐ€ ๊ฐ์ฒด์— ์ ‘์ด‰ํ•˜๋Š” ์œ„์น˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š” contact-based embodied reasoning ๋ฐฉ์‹์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
  2. ๐Ÿ“ˆ ์ด ๋ชจ๋ธ์€ DexGYS ๋ฒค์น˜๋งˆํฌ์—์„œ 67.14%์˜ grasp success rate๋ฅผ ๋‹ฌ์„ฑํ•˜์—ฌ ๊ธฐ์กด state-of-the-art๋ฅผ ๋Šฅ๊ฐ€ํ–ˆ์œผ๋ฉฐ, intention alignment์—์„œ 96.4% ํ–ฅ์ƒ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
  3. ๐ŸŽฏ DextER์˜ autoregressive framework๋Š” ์‚ฌ์šฉ์ž๊ฐ€ ๋ถ€๋ถ„์ ์ธ contact constraints๋ฅผ ์ง€์ •ํ•˜์—ฌ grasp ์ƒ์„ฑ์„ steerableํ•˜๊ฒŒ ์ œ์–ดํ•  ์ˆ˜ ์žˆ๋Š” fine-grained control ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

๋ณธ ๋…ผ๋ฌธ์€ ์–ธ์–ด ๊ธฐ๋ฐ˜์˜ ์ •๊ตํ•œ dexterous grasp ์ƒ์„ฑ์„ ์œ„ํ•œ DextER๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ Vision-Language Models (VLMs)์€ ๊ด€์ธก๊ฐ’์„ ์ง์ ‘ grasp parameter๋กœ ๋งคํ•‘ํ•˜์—ฌ ๋ฌผ๋ฆฌ์  ์ƒํ˜ธ์ž‘์šฉ์— ๋Œ€ํ•œ ์ค‘๊ฐ„ ์ถ”๋ก ์ด ๋ถ€์กฑํ–ˆ์Šต๋‹ˆ๋‹ค. DextER๋Š” ๋‹ค์ง€ ๋กœ๋ด‡ ํŒ”(multi-finger hand) ์กฐ์ž‘์„ ์œ„ํ•ด ์ ‘์ด‰ ๊ธฐ๋ฐ˜์˜ embodied reasoning์„ ๋„์ž…ํ•˜๋ฉฐ, ์ด๋Š” ์–ด๋–ค ์† ๋งํฌ(hand link)๊ฐ€ ๊ฐ์ฒด์˜ ์–ด๋А ์œ„์น˜์—์„œ ์ ‘์ด‰ํ•˜๋Š”์ง€๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ ํ†ต์ฐฐ์ž…๋‹ˆ๋‹ค. ์ด ์ ‘์ด‰ ์˜ˆ์ธก์€ ๋†’์€ ์ˆ˜์ค€์˜ task semantics์™€ ๋กœ๋ด‡์˜ embodiment ๋ฐ ๊ฐ์ฒด ํ˜•์ƒ์˜ ๋ฌผ๋ฆฌ์  ์ œ์•ฝ ์กฐ๊ฑด์„ ์—ฐ๊ฒฐํ•˜๋Š” embodiment-aware ์ค‘๊ฐ„ ํ‘œํ˜„์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก  (Core Methodology)

DextER๋Š” ์ฃผ์–ด์ง„ 3D point cloud P \in \mathbb{R}^{N \times 3}์™€ ์–ธ์–ด ์ง€์‹œ T๋กœ๋ถ€ํ„ฐ dexterous hand์˜ grasp pose \mathbf{a} \in \mathbb{R}^D๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ D๋Š” ์†์˜ ์ž์œ ๋„(degrees of freedom)๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๋ณธ ๋ชจ๋ธ์€ ์ด ์˜ˆ์ธก์„ ์ค‘๊ฐ„ ๋‹จ๊ณ„์ธ ์ ‘์ด‰ ํŒจํ„ด(contact patterns) C๋ฅผ ํ†ตํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ถ„ํ•ดํ•ฉ๋‹ˆ๋‹ค: p(\mathbf{a}, C|P, T) = p(C|P, T) \cdot p(\mathbf{a}|C, P, T) ์—ฌ๊ธฐ์„œ ์ ‘์ด‰ ์˜ˆ์ธก p(C|P, T)๊ฐ€ ์–ธ์–ด์™€ ๊ธฐํ•˜ํ•™์  ์ดํ•ด๋ฅผ grasp ์ƒ์„ฑ์— ์—ฐ๊ฒฐํ•˜๋Š” embodied reasoning ๊ณผ์ •์œผ๋กœ ์ž‘์šฉํ•ฉ๋‹ˆ๋‹ค.

1. ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜ (Model Architecture)

DextER๋Š” 3D vision encoder, multimodal projector, Large Language Model (LLM) backbone์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

  • Point Cloud Encoding: ์ž…๋ ฅ point cloud P๋กœ๋ถ€ํ„ฐ PartField [22]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ธฐํ•˜ํ•™์  ํŠน์ง• F \in \mathbb{R}^{M \times d}๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. PartField๋Š” 2D SAM mask๋ฅผ ์ด์šฉํ•œ ๋Œ€์กฐ ํ•™์Šต(contrastive learning)์„ ํ†ตํ•ด ํŒŒํŠธ ๋ถ„ํ• (part-segmentation)์„ ์œ„ํ•œ ์‚ฌ์ „ ํ•™์Šต์ด ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๊ฐ์ฒด ํ‘œ๋ฉด์˜ ์ ‘์ด‰ ์œ„์น˜๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ์ฐพ์•„๋‚ด๋Š” ๋ฐ ์œ ๋ฆฌํ•œ ํŒŒํŠธ ๊ธฐํ•˜ํ•™ ์ธ์ง€ ํŠน์ง•(part geometry-aware features)์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ถ”์ถœ๋œ ํŠน์ง•์€ ๊ฒฝ๋Ÿ‰ MLP๋ฅผ ํ†ตํ•ด LLM์˜ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์œผ๋กœ ํˆฌ์˜๋ฉ๋‹ˆ๋‹ค.
  • Action Tokenization: ์—ฐ์†์ ์ธ grasp parameter \mathbf{a} (28์ฐจ์›, ์†๋ฐ”๋‹ฅ ์ž์„ธ ๋ฐ ๊ด€์ ˆ ๊ฐ๋„ ํฌํ•จ)๋Š” ์ด์‚ฐ์ ์ธ ํ† ํฐ ๊ณต๊ฐ„์œผ๋กœ ํ† ํฐํ™”๋ฉ๋‹ˆ๋‹ค. ๊ฐ ์ฐจ์›์€ N_a๊ฐœ์˜ ๊ท ์ผํ•œ bin์œผ๋กœ ์–‘์žํ™”๋˜๋ฉฐ, ๊ฐ ์–‘์žํ™”๋œ ๊ฐ’์€ ๊ณ ์œ ํ•œ ํ† ํฐ \langle \text{action\_bin\_i} \rangle์œผ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค. ์ „์ฒด ์•ก์…˜ ์‹œํ€€์Šค๋Š” \langle |\text{action\_start}| \rangle์™€ \langle |\text{action\_end}| \rangle ํŠน์ˆ˜ ํ† ํฐ์œผ๋กœ ๊ฐ์‹ธ์ง‘๋‹ˆ๋‹ค.
  • LLM Backbone: Qwen2.5-0.5B [30, 42]๋ฅผ LLM backbone์œผ๋กœ ์‚ฌ์šฉํ•˜๋ฉฐ, point cloud embedding๊ณผ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์œตํ•ฉํ•˜์—ฌ ์ด์‚ฐ์ ์ธ ์ ‘์ด‰ ๋ฐ ์•ก์…˜ ํ† ํฐ์„ autoregressively ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

2. Embodied Reasoning์„ ํ†ตํ•œ ์ ‘์ด‰ ์˜ˆ์ธก (Embodied Reasoning via Contact Prediction)

  • Meta-prompts: ๋ชจ๋ธ์ด ์•ก์…˜ ์ƒ์„ฑ ์ „์— ์ ‘์ด‰ ์ถ”๋ก ์— ์ฐธ์—ฌํ•˜๋„๋ก ์œ ๋„ํ•˜๊ธฐ ์œ„ํ•ด, โ€œThink step by step: first predict which links contact where on the object, then predict the grasp poseโ€์™€ ๊ฐ™์€ ๋ช…์‹œ์ ์ธ ์ง€์‹œ๋ฅผ ํฌํ•จํ•˜๋Š” meta-prompt๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • Contact Representation: ์ ‘์ด‰์€ ์†์˜ ๋งํฌ l_i (์˜ˆ: index finger middle link)์™€ ๊ฐ์ฒด ํ‘œ๋ฉด์˜ 3D ์ ‘์ด‰ ์œ„์น˜ p_i \in \mathbb{R}^3์˜ ์Œ์œผ๋กœ ๊ตฌ์„ฑ๋œ C = \{(l_i, p_i)\}๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค. ์ ‘์ด‰ ์œ„์น˜ p_i์˜ ์ขŒํ‘œ๋Š” ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ณ„์‚ฐ๋œ ๊ณ ์ •๋œ 3D bounding box ๋‚ด๋กœ ์ •๊ทœํ™”๋œ ๋‹ค์Œ, ๊ฐ ๊ณต๊ฐ„ ์ฐจ์›์ด N_{pos}๊ฐœ์˜ bin์œผ๋กœ ๊ท ์ผํ•˜๊ฒŒ ์ด์‚ฐํ™”๋˜์–ด position token์œผ๋กœ ๋งคํ•‘๋ฉ๋‹ˆ๋‹ค. ๊ฐ ์ ‘์ด‰์€ \langle l_i \rangle \langle p_{ix} \rangle \langle p_{iy} \rangle \langle p_{iz} \rangle์™€ ๊ฐ™์€ ์‹œํ€€์Šค๋กœ ํ‘œํ˜„๋˜๋ฉฐ, ์ „์ฒด ์ ‘์ด‰ ์˜ˆ์ธก์€ \langle |\text{contact\_start}| \rangle์™€ \langle |\text{contact\_end}| \rangle๋กœ ๊ฐ์‹ธ์ง‘๋‹ˆ๋‹ค. ํ•„์š”ํ•œ ๋ชจ๋“  ํŠน์ˆ˜ ํ† ํฐ(action bin, position bin, link, delimiter ํ† ํฐ)์€ ์‚ฌ์ „ ํ•™์Šต๋œ tokenizer์— ๋“ฑ๋ก๋ฉ๋‹ˆ๋‹ค.

3. ํ›ˆ๋ จ ์ „๋žต (Training Strategy)

  • End-to-end ํ•™์Šต: point cloud ํ† ํฐ, task description, contact ํ† ํฐ, action ํ† ํฐ์„ ํฌํ•จํ•˜๋Š” ์ „์ฒด ์‹œํ€€์Šค์— ๋Œ€ํ•ด ํ‘œ์ค€ next-token prediction์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ end-to-end๋กœ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ ๋จผ์ € ์ ‘์ด‰ ํŒจํ„ด์„ ์˜ˆ์ธกํ•œ ๋‹ค์Œ, ์ด์— ์ƒ์‘ํ•˜๋Š” grasp pose๋ฅผ autoregressively ์ƒ์„ฑํ•˜๋„๋ก ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  • Hybrid Attention Mechanism: point cloud ํ† ํฐ์€ ์ „์—ญ์ ์ธ ๊ธฐํ•˜ํ•™์  ๋ฌธ๋งฅ์„ ํฌ์ฐฉํ•˜๊ธฐ ์œ„ํ•ด ์–‘๋ฐฉํ–ฅ ์–ดํ…์…˜(bidirectional attention)์„ ์‚ฌ์šฉํ•˜๊ณ , ์–ธ์–ด ๋ฐ ์•ก์…˜ ํ† ํฐ์€ ์ธ๊ณผ์  ์–ดํ…์…˜(causal attention)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • Contact Position Dropout: ์ •๊ทœํ™”๋ฅผ ์œ„ํ•ด ํ›ˆ๋ จ ์ค‘ p_{drop} ํ™•๋ฅ ๋กœ ์ ‘์ด‰ ์‹œํ€€์Šค์—์„œ position ํ† ํฐ์„ ์ œ๊ฑฐํ•˜๊ณ  link ํ† ํฐ๋งŒ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์˜ ์ ‘์ด‰ ์ •๋ณด๋ฅผ ์ฒ˜๋ฆฌํ•˜๋„๋ก ๋•์Šต๋‹ˆ๋‹ค.

4. ๋ฐ์ดํ„ฐ์…‹ ํ๋ ˆ์ด์…˜ (Dataset Curation)

DexGYS [36]์™€ Dexonomy [5] ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • Physics-based Contact Annotation: MuJoCo ๋ฌผ๋ฆฌ ์—”์ง„์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ grasp์— ๋Œ€ํ•œ ์ ‘์ด‰ ์ •๋ณด๋ฅผ ์ž๋™์œผ๋กœ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. ์† ๋ฐ ๊ฐ์ฒด ๋ชจ๋ธ์„ MuJoCo์— ๋กœ๋“œํ•˜๊ณ , ๊ฐ grasp pose์— ๋Œ€ํ•ด ์ •๋ฐฉํ–ฅ ์šด๋™ํ•™(forward kinematics)์„ ์‹คํ–‰ํ•œ ๋‹ค์Œ, ์† ๋งํฌ์™€ ๊ฐ์ฒด๊ฐ€ ์ ‘์ด‰ํ•˜๋Š” 3D ํ‘œ๋ฉด ์œ„์น˜๋ฅผ ๋ฌผ๋ฆฌ ๋ฒ„ํผ์—์„œ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
  • Grasp Instruction Annotation (Dexonomy์šฉ): Gemma-3 [29] VLM์„ ์‚ฌ์šฉํ•˜์—ฌ Dexonomy์— ๋Œ€ํ•œ grasp description์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๊ฐ grasp์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ์‹œ์  ์ด๋ฏธ์ง€๋ฅผ ๋ Œ๋”๋งํ•˜๊ณ , ๋ Œ๋”๋ง๋œ ์ด๋ฏธ์ง€์™€ ์ ‘์ด‰ ์ •๋ณด์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ VLM์— ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. VLM์€ ๊ฐ์ฒด ๋ฒ”์ฃผ๋ฅผ ์‹๋ณ„ํ•˜๊ณ , ์ ‘์ด‰๋œ ๊ธฐ๋Šฅ์  ๋ถ€๋ถ„์„ ์ถ”๋ก ํ•˜๋ฉฐ, ํ…์ŠคํŠธ ํ˜•ํƒœ์˜ grasp description์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ (Experiments and Results)

DextER๋Š” DexGYS validation set์—์„œ ์–ธ์–ด ์กฐ๊ฑด๋ถ€ dexterous grasp ์ƒ์„ฑ task๋ฅผ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

  • DexGYS ๋ฒค์น˜๋งˆํฌ: DextER๋Š” 67.14%์˜ grasp ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ•˜์—ฌ ์ด์ „ SOTA๋ณด๋‹ค 3.83%p ์šฐ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค. P-FID (Frรฉchet Distance) ์ ์ˆ˜ 0.20์„ ๊ธฐ๋กํ•˜์—ฌ ์ด์ „ SOTA์ธ DexGYSNet [36]์˜ 5.60 ๋Œ€๋น„ 96.4%์˜ ์˜๋„ ์ •๋ ฌ(intention alignment) ๊ฐœ์„ ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ƒ์„ฑ๋œ grasp๊ฐ€ ์–ธ์–ด๋กœ ์ง€์ •๋œ task ์˜๋„์™€ ํ›จ์”ฌ ๋” ์ž˜ ์ผ์น˜ํ•จ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
    • Embodied Reasoning (ER)์˜ ์—ญํ• : ER์ด ์—†๋Š” ๋ชจ๋ธ(w/o ER)์€ P-FID๊ฐ€ 0.20์—์„œ 0.30์œผ๋กœ ์ฆ๊ฐ€(50% ์„ฑ๋Šฅ ์ €ํ•˜)ํ•˜๊ณ , ์„ฑ๊ณต๋ฅ ์€ 67.14%์—์„œ 62.37%๋กœ ๊ฐ์†Œํ•˜๋Š” ๋“ฑ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ์ €ํ•˜๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋ช…์‹œ์ ์ธ ์ ‘์ด‰ ์˜ˆ์ธก์ด ์˜๋„ ์ •๋ ฌ ๋ฐ ๋ฌผ๋ฆฌ์  ํ’ˆ์งˆ ๋ชจ๋‘์— ์ค‘์š”ํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  • Ablation Study (Table 2):
    • ECoT: ECoT๋ฅผ ์ œ๊ฑฐํ•˜๋ฉด P-FID์™€ ์„ฑ๊ณต๋ฅ  ๋ชจ๋‘ ํฌ๊ฒŒ ์ €ํ•˜๋ฉ๋‹ˆ๋‹ค.
    • Token discretization granularity: Action ๋ฐ position ํ† ํฐ ๋ชจ๋‘ N_a = N_{pos} = 256 bins์ด ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
    • Contact position dropout (p_{drop}): p_{drop} = 0.5๊ฐ€ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, ์ ์ ˆํ•œ dropout์ด ์ •๊ทœํ™” ํšจ๊ณผ๋ฅผ ์ œ๊ณตํ•จ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.
    • Point Cloud Encoder: PartField [22]๊ฐ€ Uni3D [49]๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋Š”๋ฐ, ์ด๋Š” PartField์˜ ํŒŒํŠธ ์ธ์ง€ ํŠน์ง• ์ถ”์ถœ์ด ์ ‘์ด‰ ๊ธฐ๋ฐ˜ ์ถ”๋ก ์— ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋ถ€ํ•ฉํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
  • Zero-Shot Generalization (Table 3, ์ƒ๋‹จ): Dexonomy ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ›ˆ๋ จ ๋ฐ ํ‰๊ฐ€๋ฅผ ์ง„ํ–‰ํ–ˆ์œผ๋ฉฐ, DextER๋Š” โ€œUnseen Objectsโ€, โ€œUnseen Grasp Taxonomyโ€, โ€œUnseen Bothโ€๋ฅผ ํฌํ•จํ•œ ๋ชจ๋“  zero-shot ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๊ธฐ์ค€์„ (baseline) ๋ฐฉ๋ฒ•๋ก ๋“ค์„ ๋Šฅ๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • Steerable Grasp Generation (Table 3, ํ•˜๋‹จ): DextER์˜ autoregressive ํŠน์„ฑ์„ ํ™œ์šฉํ•˜์—ฌ ์‚ฌ์šฉ์ž๊ฐ€ ๋ถ€๋ถ„์ ์ธ ECoT ์‹œํ€€์Šค๋ฅผ ์ œ๊ณตํ•จ์œผ๋กœ์จ grasp ์ƒ์„ฑ์„ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 1๊ฐœ์—์„œ 5๊ฐœ๊นŒ์ง€์˜ ๋งํฌ๋ฅผ ์ง€์ •ํ–ˆ์„ ๋•Œ, ์ง€์ •๋œ ๋งํฌ์˜ ์ˆ˜๊ฐ€ ๋งŽ์„์ˆ˜๋ก ์˜๋„ ์ •๋ ฌ(P-FID, CD)๊ณผ ์„ฑ๊ณต๋ฅ ์ด ๋ชจ๋‘ ํ–ฅ์ƒ๋˜๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.
  • Contact Reasoning Quality (Table 4): ์ ‘์ด‰ ๋งํฌ ์˜ˆ์ธก์„ ์œ„ํ•œ IoU, Precision, Recall, F1 ๋ฐ ๊ณต๊ฐ„ ์ •ํ™•๋„๋ฅผ ์œ„ํ•œ Position Accuracy (1cm ์ž„๊ณ„๊ฐ’)๋ฅผ ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ, ๋งŒ์กฑ์Šค๋Ÿฌ์šด ์„ฑ๋Šฅ์„ ๋ณด์—ฌ ์ ‘์ด‰ ๊ธฐ๋ฐ˜ embodied reasoning์˜ ์ •ํ™•์„ฑ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก  (Conclusion)

DextER๋Š” ์ ‘์ด‰ ์˜ˆ์ธก์„ ํ†ตํ•ด embodied reasoning์„ ํ™œ์šฉํ•˜๋Š” ์–ธ์–ด ์กฐ๊ฑด๋ถ€ dexterous grasp ์ƒ์„ฑ์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋ฐฉ๋ฒ•๋ก ์€ DexGYS์—์„œ 67.14%์˜ grasp ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ ์ด์ „ SOTA ๋Œ€๋น„ 3.83%p ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๊ณ , ์˜๋„ ์ •๋ ฌ์—์„œ๋Š” 96.4%์˜ ๊ด„๋ชฉํ•  ๋งŒํ•œ ๊ฐœ์„ ์„ ์ด๋ฃจ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ ‘์ด‰ ์ถ”๋ก ์ด task semantics๋ฅผ ์ดํ•ดํ•˜๊ณ  ๋‹ค์–‘ํ•˜๊ณ  ์•ˆ์ •์ ์ธ grasp ๊ตฌ์„ฑ์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋˜ํ•œ, autoregressive ์ƒ์„ฑ ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ์‚ฌ์šฉ์ž๊ฐ€ ๋ถ€๋ถ„์ ์ธ ์ ‘์ด‰ ์ œ์•ฝ ์กฐ๊ฑด์„ ์ง€์ •ํ•˜์—ฌ ๋ชจ๋ธ์„ ์•ˆ๋‚ดํ•  ์ˆ˜ ์žˆ๋Š” steerable grasp generation์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์—ฌ, grasp ์ƒ์„ฑ์— ๋Œ€ํ•œ ์„ธ๋ฐ€ํ•œ ์ œ์–ด๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์ œํ•œ์‚ฌํ•ญ (Limitations)

Autoregressive ํ”„๋ ˆ์ž„์›Œํฌ๋Š” compounding errors์— ์ทจ์•ฝํ•˜๋ฉฐ, ํ˜„์žฌ ํ‰๊ฐ€๋Š” ๋‹จ์ผ์˜ ์ •์  ๊ฐ์ฒด์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์–ด ์‹ค์ œ ๋ณต์žกํ•œ ์žฅ๋ฉด์—์„œ์˜ ์ ์šฉ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ์ˆœ์ฐจ์ ์ธ ํ† ํฐ ์˜ˆ์ธก ๋ฐฉ์‹์€ ์‹ค์‹œ๊ฐ„ ์„ฑ๋Šฅ์— ์ œ์•ฝ์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

1. ์„œ๋ก : ์™œ ๋กœ๋ด‡์˜ ์†์€ ์•„์ง๋„ ์„œํˆฐ๊ฐ€?

์ธ๊ฐ„์˜ ์†์€ ๋†€๋ž๋„๋ก ์œ ์—ฐํ•˜๋‹ค. โ€œ๋จธ๊ทธ์ž”์„ ์†์žก์ด ์žก์•„์„œ ๋”ฐ๋ผ์ค˜โ€๋ผ๋Š” ๋ง ํ•œ๋งˆ๋””์— ์šฐ๋ฆฌ๋Š” ์—„์ง€์™€ ๊ฒ€์ง€๋ฅผ ์†์žก์ด ๊ณก๋ฉด์— ๋งž๊ฒŒ ๊ฐ์‹ธ๊ณ , ๋‚˜๋จธ์ง€ ์†๊ฐ€๋ฝ์œผ๋กœ ์•ˆ์ •๊ฐ์„ ๋”ํ•˜๋ฉฐ, ์†๋ชฉ ๊ฐ๋„๊นŒ์ง€ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์กฐ์ •ํ•œ๋‹ค. ์ด ๋ชจ๋“  ๊ฒƒ์ด ์ˆœ์‹๊ฐ„์—, ์˜์‹ํ•˜์ง€ ์•Š์•„๋„ ์ผ์–ด๋‚œ๋‹ค.

๋‹ค๊ด€์ ˆ ๋กœ๋ด‡ ์†(dexterous hand)์—๊ฒŒ ๊ฐ™์€ ์ผ์„ ์‹œํ‚ค๋ฉด ์–ด๋–จ๊นŒ? 20๊ฐœ ์ด์ƒ์˜ ์ž์œ ๋„(DOF)๋ฅผ ๋™์‹œ์— ์ œ์–ดํ•ด์•ผ ํ•˜๊ณ , ๋ฌผ์ฒด์˜ 3D ํ˜•์ƒ์„ ํŒŒ์•…ํ•ด์•ผ ํ•˜๊ณ , โ€œ์†์žก์ด๋ฅผ ์žก์œผ๋ผโ€๋Š” ์–ธ์–ด ์ง€์‹œ๋ฅผ ๋ฌผ๋ฆฌ์  ์ ‘์ด‰ ํŒจํ„ด์œผ๋กœ ๋ณ€ํ™˜ํ•ด์•ผ ํ•œ๋‹ค. ์ด๊ฒƒ์ด ์–ธ์–ด ๊ธฐ๋ฐ˜ ์ •๊ต ํŒŒ์ง€(language-driven dexterous grasp generation) ๋ฌธ์ œ๋‹ค.

1.1 ๊ธฐ์กด ์ ‘๊ทผ์˜ ํ•œ๊ณ„

์ตœ๊ทผ Vision-Language Model(VLM)์„ ํ™œ์šฉํ•œ ์—ฐ๊ตฌ๋“ค์ด ์ด ๋ฌธ์ œ์— ๋„์ „ํ•ด์™”๋‹ค. DexGYSNet, SemGrasp, DexVLG ๊ฐ™์€ ๋ฐฉ๋ฒ•๋“ค์ด 3D ์‹œ๊ฐ ํ‘œํ˜„๊ณผ ์–ธ์–ด ์ดํ•ด๋ฅผ ์œตํ•ฉํ•ด ์„ฑ๊ณผ๋ฅผ ๋ƒˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ด๋“ค์—๋Š” ๊ณตํ†ต์ ์ธ ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค.

๊ด€์ฐฐ โ†’ ํŒŒ์ง€ ํŒŒ๋ผ๋ฏธํ„ฐ ๋ผ๋Š” ์ง์ ‘ ๋งคํ•‘(direct mapping)

์–ธ์–ด ์ง€์‹œ์™€ 3D ํ˜•์ƒ์„ ๋ฐ›์•„ ๊ณง๋ฐ”๋กœ ์†์˜ ๊ด€์ ˆ ๊ฐ๋„์™€ ์œ„์น˜๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค. ์ค‘๊ฐ„์— โ€œ์†์ด ์–ด๋””์— ๋‹ฟ์„ ๊ฒƒ์ธ๊ฐ€โ€์— ๋Œ€ํ•œ ๋ช…์‹œ์  ์ถ”๋ก ์ด ์—†๋‹ค. ๋งˆ์น˜ ์ˆ˜ํ•™ ์‹œํ—˜์—์„œ ํ’€์ด ๊ณผ์ • ์—†์ด ๋‹ต๋งŒ ์“ฐ๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค. ๋‹ต์ด ๋งž์„ ๋•Œ๋„ ์žˆ์ง€๋งŒ, ์™œ ๋งž๋Š”์ง€ ์„ค๋ช…ํ•  ์ˆ˜ ์—†๊ณ , ์ƒˆ๋กœ์šด ๋ฌธ์ œ ์œ ํ˜•์— ์ทจ์•ฝํ•˜๋‹ค.

1.2 DextER์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด

POSTECH ์—ฐ๊ตฌํŒ€์ด ์ œ์•ˆํ•œ DextER(Dexterous Grasp Generation with Embodied Reasoning)๋Š” ์ด ์งˆ๋ฌธ์—์„œ ์ถœ๋ฐœํ•œ๋‹ค.

โ€œ๋‹ค๊ด€์ ˆ ์†์˜ ๋ฌผ๋ฆฌ์  ์ƒํ˜ธ์ž‘์šฉ์—์„œ ์ค‘๊ฐ„ ์ถ”๋ก  ํ‘œํ˜„์€ ๋ฌด์—‡์ด์–ด์•ผ ํ•˜๋Š”๊ฐ€?โ€

๊ทธ ๋‹ต์€ ์ ‘์ด‰์ (contact)์ด๋‹ค. โ€œ์–ด๋–ค ์†๊ฐ€๋ฝ ๋งํฌ๊ฐ€ ๋ฌผ์ฒด์˜ ์–ด๋–ค ์œ„์น˜์— ๋‹ฟ๋Š”์ง€โ€๋ฅผ ๋จผ์ € ์˜ˆ์ธกํ•˜๊ณ , ๊ทธ๊ฒƒ์„ ๋ฐœํŒ ์‚ผ์•„ ์ตœ์ข… ํŒŒ์ง€ ์ž์„ธ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

์ง๊ด€์ ์œผ๋กœ ์ƒ๊ฐํ•ด๋ณด์ž. ์‚ฌ๋žŒ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋‹ค. ๊ฐ€์œ„๋ฅผ ์žก์„ ๋•Œ ์šฐ๋ฆฌ๋Š” ๋ฌด์˜์‹์ ์œผ๋กœ โ€œ์ง‘๊ฒŒ์†๊ฐ€๋ฝ๊ณผ ์ค‘์ง€๊ฐ€ ๊ณ ๋ฆฌ์— ๋“ค์–ด๊ฐ€์•ผ ํ•œ๋‹คโ€๋Š” ์ ‘์ด‰ ๊ณ„ํš์„ ๋จผ์ € ์„ธ์šด๋‹ค. DextER๋Š” ์ด ์ž์—ฐ์Šค๋Ÿฌ์šด ์ถ”๋ก  ๊ณผ์ •์„ ๋ชจ๋ธ์— ๋ช…์‹œ์ ์œผ๋กœ ์ง‘์–ด๋„ฃ๋Š”๋‹ค.

์ด๊ฒƒ์ด ๋ฐ”๋กœ Embodied Chain-of-Thought(ECoT) โ€” ๋กœ๋ด‡ ์‹ ์ฒด์˜ ๋ฌผ๋ฆฌ์  ๊ตฌ์กฐ๋ฅผ ๋ฐ˜์˜ํ•œ ์‚ฌ๊ณ ์˜ ์—ฐ์‡„๋‹ค.


2. ๋ฐฉ๋ฒ•๋ก : DextER ์•„ํ‚คํ…์ฒ˜ ํ•ด๋ถ€

2.1 ๋ฌธ์ œ ๊ณต์‹ํ™”

์ˆ˜ํ•™์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด DextER๊ฐ€ ํ’€๋ ค๋Š” ๋ฌธ์ œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

์ž…๋ ฅ: ๋ฌผ์ฒด์˜ 3D ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ \mathbf{P} \in \mathbb{R}^{N \times 3}, ์–ธ์–ด ์ง€์‹œ \mathbf{T}
์ถœ๋ ฅ: ์†์˜ ํŒŒ์ง€ ์ž์„ธ \mathbf{a} \in \mathbb{R}^D (D=28, ํŒœ ํฌ์ฆˆ + ๊ด€์ ˆ ๊ฐ๋„)

๊ธฐ์กด ๋ฐฉ๋ฒ•์ด p(\mathbf{a} \mid \mathbf{P}, \mathbf{T})๋ฅผ ์ง์ ‘ ๋ชจ๋ธ๋งํ•œ๋‹ค๋ฉด, DextER๋Š” ์ด๋ฅผ ๋‘ ๋‹จ๊ณ„๋กœ ๋ถ„ํ•ดํ•œ๋‹ค:

p(\mathbf{a}, \mathcal{C} \mid \mathbf{P}, \mathbf{T}) = \underbrace{p(\mathcal{C} \mid \mathbf{P}, \mathbf{T})}_{\text{์ ‘์ด‰ ์ถ”๋ก }} \cdot \underbrace{p(\mathbf{a} \mid \mathcal{C}, \mathbf{P}, \mathbf{T})}_{\text{ํŒŒ์ง€ ์ƒ์„ฑ}}

์—ฌ๊ธฐ์„œ \mathcal{C} = \{(l_i, \mathbf{p}_i)\}๋Š” ์ ‘์ด‰ ์ง‘ํ•ฉ์œผ๋กœ, l_i๋Š” ์† ๋งํฌ ์ด๋ฆ„, \mathbf{p}_i \in \mathbb{R}^3๋Š” ๋ฌผ์ฒด ํ‘œ๋ฉด ์œ„ ์ ‘์ด‰ ์œ„์น˜๋‹ค.

์ด ๋ถ„ํ•ด๊ฐ€ ์™œ ๊ฐ•๋ ฅํ•œ๊ฐ€? ์ ‘์ด‰ ํŒจํ„ด \mathcal{C}๊ฐ€ โ€œ์–ธ์–ด ์˜๋ฏธโ€์™€ โ€œ๋ฌผ๋ฆฌ์  ์ œ์•ฝโ€ ์‚ฌ์ด์˜ ๋‹ค๋ฆฌ ์—ญํ• ์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. โ€œ์†์žก์ด๋ฅผ ์žก์•„๋ผโ€๋Š” ๋ง์ด โ†’ โ€œthumb_base, ff_distal์ด handle ๋ถ€์œ„์— ์ ‘์ด‰โ€ โ†’ ๊ตฌ์ฒด์ ์ธ ๊ด€์ ˆ ๊ฐ๋„๋กœ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ด์–ด์ง„๋‹ค.

2.2 ์ „์ฒด ์•„ํ‚คํ…์ฒ˜

DextER๋Š” ์„ธ ๋ชจ๋“ˆ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.

flowchart LR
    subgraph Input["์ž…๋ ฅ"]
        PC["ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ\nP โˆˆ โ„^(Nร—3)"]
        TXT["์–ธ์–ด ์ง€์‹œ\n'grasp handle to pour'"]
    end

    subgraph Encoder["์ธ์ฝ”๋”"]
        PF["PartField\n3D ์ธ์ฝ”๋”\n(ํŒŒํŠธ ์ธ์‹ ํŠน์ง•)"]
        TE["Qwen2.5 Tokenizer\nํ…์ŠคํŠธ ํ† ํฌ๋‚˜์ด์ €"]
        MLP["MLP Projector\n(2-layer)"]
    end

    subgraph Backbone["LLM ๋ฐฑ๋ณธ (Qwen2.5-0.5B)"]
        direction TB
        VT["๋น„์ฃผ์–ผ ํ† ํฐ\n(768๊ฐœ, ์–‘๋ฐฉํ–ฅ ์–ดํ…์…˜)"]
        LT["์–ธ์–ด ํ† ํฐ\n(์ธ๊ณผ์  ์–ดํ…์…˜)"]
        CT["์ ‘์ด‰ ํ† ํฐ ์ƒ์„ฑ\nโŸจcontact_startโŸฉ\nโŸจlinkโŸฉโŸจpxโŸฉโŸจpyโŸฉโŸจpzโŸฉ...\nโŸจcontact_endโŸฉ"]
        AT["์•ก์…˜ ํ† ํฐ ์ƒ์„ฑ\nโŸจaction_startโŸฉ\n{28ร—256-bin ํ† ํฐ}\nโŸจaction_endโŸฉ"]
    end

    subgraph Output["์ถœ๋ ฅ"]
        CP["์ ‘์ด‰ ์œ„์น˜\n(๋ฌผ์ฒด ํ‘œ๋ฉด 3D ์ขŒํ‘œ)"]
        GP["ํŒŒ์ง€ ์ž์„ธ\n(ํŒœ ํฌ์ฆˆ + ๊ด€์ ˆ ๊ฐ๋„)"]
    end

    PC --> PF --> MLP --> VT
    TXT --> TE --> LT
    VT & LT --> CT --> AT
    CT --> CP
    AT --> GP
Figure 1: DextER ์ „์ฒด ์•„ํ‚คํ…์ฒ˜ ๊ฐœ์š”

โ‘  3D ๋น„์ „ ์ธ์ฝ”๋”: PartField

ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ PartField๋ฅผ ์„ ํƒํ•œ ์ด์œ ๊ฐ€ ํฅ๋ฏธ๋กญ๋‹ค. PartField๋Š” 2D SAM ๋งˆ์Šคํฌ์™€์˜ ๋Œ€์กฐ ํ•™์Šต(contrastive learning)์œผ๋กœ ์‚ฌ์ „ํ•™์Šต๋œ ํŒŒํŠธ-๋ถ„ํ•  ์ธ์‹ 3D ์ธ์ฝ”๋”๋‹ค. ๊ธ€๋กœ๋ฒŒ ์˜ค๋ธŒ์ ํŠธ ํŠน์ง•์ด ์•„๋‹Œ, ๋กœ์ปฌ ํŒŒํŠธ ๊ธฐํ•˜ํ•™ ํŠน์ง•์„ ์ถ”์ถœํ•œ๋‹ค.

์™œ ์ค‘์š”ํ•œ๊ฐ€? DextER์˜ ์ ‘์ด‰ ์ถ”๋ก ์€ โ€œ์–ด๋А ํŒŒํŠธ์— ๋‹ฟ๋Š”๊ฐ€โ€๋ฅผ ์˜ˆ์ธกํ•ด์•ผ ํ•œ๋‹ค. ์†์žก์ด(handle), ๋šœ๊ป‘(lid), ๋ฒ„ํŠผ ๋“ฑ ์„ธ๋ถ€ ํŒŒํŠธ๋ฅผ ์ž˜ ์ธ์‹ํ•˜๋Š” ํŠน์ง•์ด ์ ‘์ด‰์  ์˜ˆ์ธก์— ์ง์ ‘ ๋„์›€์ด ๋œ๋‹ค. Ablation ๊ฒฐ๊ณผ์—์„œ๋„ Uni3D ๋Œ€๋น„ P-FID 0.52โ†’0.20, ์„ฑ๊ณต๋ฅ  59.07%โ†’67.14%๋กœ ์••๋„์  ์ฐจ์ด๋ฅผ ๋ณด์ธ๋‹ค.

์ธ์ฝ”๋” ์ถœ๋ ฅ์€ triplane feature map์—์„œ ๋‹ค์šด์ƒ˜ํ”Œ๋ง๋œ 768๊ฐœ์˜ ์‹œ๊ฐ ํ† ํฐ์ด๋‹ค.

โ‘ก ์•ก์…˜ ํ† ํฌ๋‚˜์ด์ œ์ด์…˜

์—ฐ์†์ ์ธ ํŒŒ์ง€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ด์‚ฐ ํ† ํฐ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์ด๋‹ค.

  1. ๊ฐ 28๊ฐœ ์ฐจ์›์— ๋Œ€ํ•ด 1~99 ํผ์„ผํƒ€์ผ ๊ฐ’์„ [-1, 1]๋กœ ์ •๊ทœํ™”
  2. ๊ฐ ์ฐจ์›์„ N_\mathbf{a} = 256 ๊ฐœ ๊ท ๋“ฑ ๊ตฌ๊ฐ„์œผ๋กœ ๋ถ„ํ• 
  3. ๊ฐ ๊ตฌ๊ฐ„์— ๊ณ ์œ  ํ† ํฐ โŸจaction_bin_iโŸฉ ํ• ๋‹น

๋”ฐ๋ผ์„œ ํ•˜๋‚˜์˜ ํŒŒ์ง€ ์ž์„ธ๋Š” 28๊ฐœ์˜ ์ด์‚ฐ ํ† ํฐ ์‹œํ€€์Šค๋กœ ํ‘œํ˜„๋œ๋‹ค.

์™œ ์—ฐ์†๊ฐ’ ๋Œ€์‹  ํ† ํฐ์ธ๊ฐ€? LLM์˜ next-token prediction ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋ณ„๋„์˜ ํšŒ๊ท€ ํ—ค๋“œ ์—†์ด ๊ธฐ์กด VLM ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ์„ ์žฌํ™œ์šฉํ•œ๋‹ค.

โ‘ข ์ ‘์ด‰ ํ‘œํ˜„ (Contact Tokens)

์ ‘์ด‰ ์ •๋ณด๋Š” ๋‹ค์Œ ํ˜•์‹์œผ๋กœ ํ† ํฐํ™”๋œ๋‹ค:

โŸจ|contact_start|โŸฉ
โŸจthbaseโŸฉโŸจpxโŸฉโŸจpyโŸฉโŸจpzโŸฉ    โ† ์—„์ง€ ๋ฐ‘๋™์ด (px,py,pz)์— ๋‹ฟ์Œ
โŸจffdistalโŸฉโŸจpxโŸฉโŸจpyโŸฉโŸจpzโŸฉ   โ† ๊ฒ€์ง€ ๋๋งˆ๋””๊ฐ€ (px,py,pz)์— ๋‹ฟ์Œ
โŸจmfmiddleโŸฉโŸจpxโŸฉโŸจpyโŸฉโŸจpzโŸฉ   โ† ์ค‘์ง€ ์ค‘๊ฐ„๋งˆ๋””๊ฐ€ (px,py,pz)์— ๋‹ฟ์Œ
โŸจ|contact_end|โŸฉ

์œ„์น˜ ์ขŒํ‘œ๋Š” N_{\text{pos}} = 256 ๊ฐœ ๋นˆ์œผ๋กœ ์ด์‚ฐํ™”๋œ๋‹ค. ๋งํฌ ์ด๋ฆ„ ํ† ํฐ๊ณผ ์œ„์น˜ ํ† ํฐ ๋ชจ๋‘ ์‚ฌ์ „ํ•™์Šต๋œ ํ† ํฌ๋‚˜์ด์ €์˜ vocabulary์— ํŠน์ˆ˜ ํ† ํฐ์œผ๋กœ ์ถ”๊ฐ€๋œ๋‹ค.

2.3 ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜

ํŠธ๋žœ์Šคํฌ๋จธ ์–ดํ…์…˜ ์„ค๊ณ„์—์„œ ์˜๋ฆฌํ•œ ์„ ํƒ์ด ์žˆ๋‹ค.

  • ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ ํ† ํฐ: ์–‘๋ฐฉํ–ฅ(bidirectional) ์–ดํ…์…˜ โ†’ ์ „์ฒด 3D ํ˜•์ƒ์— ๋Œ€ํ•œ ๊ธ€๋กœ๋ฒŒ ์ปจํ…์ŠคํŠธ ํŒŒ์•…
  • ์–ธ์–ด ๋ฐ ์•ก์…˜ ํ† ํฐ: ์ธ๊ณผ์ (causal) ์–ดํ…์…˜ โ†’ ํ‘œ์ค€ ์ž๊ธฐํšŒ๊ท€ ์ƒ์„ฑ ์œ ์ง€

์ด ์„ค๊ณ„๋Š” ์ง๊ด€์ ์ด๋‹ค. ๋ฌผ์ฒด์˜ ํ˜•์ƒ์€ โ€œ์ „์ฒดโ€๋ฅผ ๋™์‹œ์— ๋ด์•ผ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ฐ˜๋ฉด ํ…์ŠคํŠธ์™€ ์•ก์…˜์€ ์ˆœ์„œ๊ฐ€ ์ค‘์š”ํ•˜๋‹ค. ๋‘ ํŠน์„ฑ์„ ํ˜ผํ•ฉํ–ˆ๋‹ค.

2.4 ๋ฉ”ํƒ€-ํ”„๋กฌํ”„ํŠธ์™€ ์ ‘์ด‰ ์œ„์น˜ ๋“œ๋กญ์•„์›ƒ

๋ฉ”ํƒ€-ํ”„๋กฌํ”„ํŠธ: ๋ชจ๋ธ์ด ์ ‘์ด‰ ์ถ”๋ก ์„ ๋จผ์ € ์ˆ˜ํ–‰ํ•˜๋„๋ก ์œ ๋„ํ•˜๋Š” ํ”„๋กฌํ”„ํŠธ. ์˜ˆ์‹œ:
> โ€œThink step by step: first predict which links contact where on the object, then predict the grasp poseโ€

ํ•™์Šต ์‹œ ๋‹ค์–‘ํ•œ ํ‘œํ˜„์˜ ๋ฉ”ํƒ€-ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฌ์šฉํ•ด ํŠน์ • ๋ฌธ๊ตฌ์— ๊ณผ์ ํ•ฉ๋˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•œ๋‹ค.

์ ‘์ด‰ ์œ„์น˜ ๋“œ๋กญ์•„์›ƒ: ํ•™์Šต ์‹œ ํ™•๋ฅ  p_{\text{drop}} = 0.5๋กœ ์œ„์น˜ ํ† ํฐ \langle p_{ix} \rangle \langle p_{iy} \rangle \langle p_{iz} \rangle๋ฅผ ์ œ๊ฑฐํ•˜๋˜, ๋งํฌ ํ† ํฐ \langle l_i \rangle๋Š” ์œ ์ง€ํ•œ๋‹ค.

์ด๊ฒƒ์ด ์™œ ํ•„์š”ํ•œ๊ฐ€? ๋‘ ๊ฐ€์ง€ ํšจ๊ณผ๊ฐ€ ์žˆ๋‹ค. ์ฒซ์งธ, ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€. ๋‘˜์งธ, Steerable Generation ํ™œ์„ฑํ™” โ€” ์ถ”๋ก  ์‹œ ์‚ฌ์šฉ์ž๊ฐ€ ๋งํฌ๋งŒ ์ง€์ •ํ•˜๊ฑฐ๋‚˜, ๋งํฌ+์œ„์น˜๋ฅผ ๋ถ€๋ถ„ ์ง€์ •ํ•˜์—ฌ ๋ชจ๋ธ ์™„์„ฑ์„ ์œ ๋„ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

2.5 ๋ฐ์ดํ„ฐ์…‹ ํ๋ ˆ์ด์…˜: MuJoCo + VLM ์ž๋™ ์ฃผ์„

DextER์˜ ํ•™์Šต์—๋Š” ๋‘ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•œ๋‹ค:

๋ฐ์ดํ„ฐ์…‹ ํŠน์ง• DextER์—์„œ์˜ ์—ญํ• 
DexGYS 1,800๊ฐœ ๊ฐ์ฒด, 50,000 ํŒŒ์ง€-์–ธ์–ด ์Œ ์Šค์ผ€์ผ๊ณผ ์–ธ์–ด ๋‹ค์–‘์„ฑ ์ œ๊ณต
Dexonomy 31๊ฐ€์ง€ ํŒŒ์ง€ ๋ถ„๋ฅ˜์ฒด๊ณ„ (power grasp, precision pinch ๋“ฑ) ๊ตฌ์กฐํ™”๋œ ํŒŒ์ง€ ๋ณ€ํ˜• ์ œ๊ณต

๋‘ ๋ฐ์ดํ„ฐ์…‹ ๋ชจ๋‘์— ์ ‘์ด‰ ์ฃผ์„์ด ์—†์—ˆ๊ธฐ ๋•Œ๋ฌธ์—, MuJoCo ๋ฌผ๋ฆฌ ์—”์ง„์œผ๋กœ ์ž๋™ ์ƒ์„ฑํ–ˆ๋‹ค:

  1. ์†๊ณผ ๋ฌผ์ฒด ๋ชจ๋ธ์„ MuJoCo์— ๋กœ๋“œ
  2. Forward kinematics ์‹คํ–‰
  3. ๋ฌผ๋ฆฌ ๋ฒ„ํผ์—์„œ ์ ‘์ด‰ ๋ฐ์ดํ„ฐ ์ถ”์ถœ โ†’ ์–ด๋–ค ๋งํฌ๊ฐ€ ์–ด๋””์— ๋‹ฟ๋Š”์ง€ ํš๋“

Dexonomy๋Š” ์–ธ์–ด ์„ค๋ช…์ด ์—†์–ด์„œ, Gemma VLM์œผ๋กœ ์ž๋™ ์ƒ์„ฑํ–ˆ๋‹ค: 1. ๊ฐ ํŒŒ์ง€์— ๋Œ€ํ•ด 5๊ฐœ ๋ฉ€ํ‹ฐ๋ทฐ ์ด๋ฏธ์ง€ ๋ Œ๋”๋ง 2. VLM์— ๋ Œ๋”๋ง + ์ ‘์ด‰ ํ•ด๋ถ€ํ•™ ์ •๋ณด๋ฅผ ํ”„๋กฌํ”„ํŠธ๋กœ ์ž…๋ ฅ 3. ๊ฐ์ฒด ์นดํ…Œ๊ณ ๋ฆฌ, ์ ‘์ด‰ ๊ธฐ๋Šฅ๋ถ€์œ„(handle, rim ๋“ฑ), ํŒŒ์ง€ ์„ค๋ช… ํ…์ŠคํŠธ ์ƒ์„ฑ

์ด ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ๋Œ€๊ทœ๋ชจ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์™„์ „ ์ž๋™์œผ๋กœ ๊ตฌ์ถ•ํ–ˆ๋‹ค๋Š” ์ ์ด ์‹ค์šฉ์ ์œผ๋กœ ์ค‘์š”ํ•˜๋‹ค.


3. ์‹คํ—˜: ์ˆซ์ž๋กœ ํ™•์ธํ•˜๋Š” DextER์˜ ์„ฑ๋Šฅ

3.1 ๊ตฌํ˜„ ์„ธ๋ถ€์‚ฌํ•ญ

  • ์‹œ๊ฐ ์ธ์ฝ”๋”: PartField (์‚ฌ์ „ํ•™์Šต ๊ฐ€์ค‘์น˜ ๊ณ ์ •)
  • LLM ๋ฐฑ๋ณธ: Qwen2.5-0.5B (Qwen2.5 ํŒจ๋ฐ€๋ฆฌ ์ตœ์†Œ ๋ชจ๋ธ)
  • ์‹œ๊ฐ ํ”„๋กœ์ ํ„ฐ: 2-layer MLP
  • ํ•™์Šต: AdamW, lr=1e-4, cosine decay, batch=64, 100K iterations
  • ํ•˜๋“œ์›จ์–ด: NVIDIA A6000 GPU ร— 8
  • ์‹œ๋ฎฌ๋ ˆ์ด์…˜: DexGYS๋Š” Isaac Gym, Dexonomy๋Š” MuJoCo(DexGraspBench)

์ฃผ๋ชฉํ•  ์ : 0.5B ํŒŒ๋ผ๋ฏธํ„ฐ ์†Œํ˜• LLM์„ ์‚ฌ์šฉํ–ˆ์Œ์—๋„ SOTA๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋‹ค. ๋ชจ๋ธ ํฌ๊ธฐ๋ณด๋‹ค ์ถ”๋ก  ๊ตฌ์กฐ ์„ค๊ณ„๊ฐ€ ๋” ์ค‘์š”ํ•˜๋‹ค๋Š” ๋ฉ”์‹œ์ง€๋‹ค.

3.2 DexGYS ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ

ํ‰๊ฐ€ ์ง€ํ‘œ ํ•ด์„ค: - P-FID โ†“: ์ƒ์„ฑ๋œ ํŒŒ์ง€์™€ ์ฐธ์กฐ ํŒŒ์ง€์˜ ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ ํŠน์ง• ๋ถ„ํฌ Frรฉchet ๊ฑฐ๋ฆฌ. ๋‚ฎ์„์ˆ˜๋ก ์˜๋„ ์ •๋ ฌ์ด ์ข‹์Œ - CD โ†“: Chamfer Distance, ์† ๋ฉ”์‹œ ํ˜•์ƒ ์ฐจ์ด - Con. โ†“: ์ ‘์ด‰ ๋งต L2 ๊ฑฐ๋ฆฌ - Success โ†‘: Isaac Gym ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์„ฑ๊ณต๋ฅ  - Qโ‚ โ†‘: Force-closure ํ’ˆ์งˆ (ํŒŒ์ง€ ์•ˆ์ •์„ฑ) - Pen. โ†“: ์†-๋ฌผ์ฒด ์นจํˆฌ ๊นŠ์ด - ฮดt, ฮดr, ฮดq โ†‘: ์ƒ์„ฑ ๋‹ค์–‘์„ฑ (์œ„์น˜, ํšŒ์ „, ๊ด€์ ˆ)

Table 1: DexGYS ๋ฒค์น˜๋งˆํฌ ์ •๋Ÿ‰ ๋น„๊ต.
๋ฐฉ๋ฒ• P-FIDโ†“ CDโ†“ Con.โ†“ ์„ฑ๊ณต๋ฅ โ†‘(%) Qโ‚โ†‘ Pen.โ†“ ฮดtโ†‘ ฮดrโ†‘ ฮดqโ†‘
GraspCVAE 29.02 3.14 0.96 29.12 0.54 0.55 0.18 1.76 0.18
GraspTTA 33.15 12.19 1.11 43.46 0.71 0.19 2.11 6.15 3.87
SceneDiffusers 7.93 1.68 0.45 62.24 0.83 0.25 0.35 3.46 0.39
DGTR 15.77 2.90 0.78 51.91 0.78 0.16 2.05 14.01 4.30
DexGYSNet 5.60 1.20 0.36 63.31 0.83 0.22 6.12 55.68 6.12
DextER (w/o ER) 0.30 1.95 0.40 62.37 0.66 0.44 8.78 77.13 13.77
DextER 0.20 1.46 0.34 67.14 0.89 0.37 8.84 77.98 13.63

๊ฒฐ๊ณผ ํ•ด์„:

๊ฐ€์žฅ ๋ˆˆ์— ๋„๋Š” ์ˆ˜์น˜๋Š” P-FID 0.20์ด๋‹ค. ์ด์ „ SOTA DexGYSNet์˜ 5.60 ๋Œ€๋น„ 96.4% ํ–ฅ์ƒ์ด๋‹ค. ์ด๋Š” DextER๊ฐ€ ์ƒ์„ฑํ•œ ํŒŒ์ง€๊ฐ€ ์–ธ์–ด ์ง€์‹œ๊ฐ€ ์˜๋„ํ•˜๋Š” ํŒŒ์ง€ ๋ถ„ํฌ์™€ ํ›จ์”ฌ ๋” ์ž˜ ์ผ์น˜ํ•œ๋‹ค๋Š” ๋œป์ด๋‹ค.

์„ฑ๊ณต๋ฅ ๋„ 63.31% โ†’ 67.14% (3.83%p ํ–ฅ์ƒ)๋กœ ๊ฐœ์„ ๋˜์—ˆ๋‹ค. ๋‹จ์ˆœํžˆ โ€œ์–ด๋–ป๊ฒŒ ์žก๋Š”๊ฐ€โ€๋งŒ์ด ์•„๋‹ˆ๋ผ โ€œ์ž˜ ์žกํžˆ๋Š”๊ฐ€โ€๋„ ๋™์‹œ์— ๊ฐœ์„ ๋˜์—ˆ๋‹ค.

ECoT ์ œ๊ฑฐ ์‹คํ—˜(w/o ER)์ด ๋” ํฅ๋ฏธ๋กญ๋‹ค. ECoT ์—†์ด๋„ P-FID 0.30, ์„ฑ๊ณต๋ฅ  62.37%๋กœ DexGYSNet์„ ๋„˜์–ด์„ ๋‹ค. ์ด๋Š” VLA ์•„ํ‚คํ…์ฒ˜ ์ž์ฒด(PartField + Qwen2.5)์˜ ๊ธฐ์—ฌ๋„๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ECoT๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด P-FID๊ฐ€ 0.30โ†’0.20 (33% ์ถ”๊ฐ€ ๊ฐœ์„ ), ์„ฑ๊ณต๋ฅ  62.37%โ†’67.14% ํ–ฅ์ƒ. ์ ‘์ด‰ ์ถ”๋ก ์ด ์•„ํ‚คํ…์ฒ˜ ๊ฐœ์„  ์œ„์— ์˜๋ฏธ ์žˆ๋Š” ์ถ”๊ฐ€ ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค.

๋‹ค์–‘์„ฑ ์ง€ํ‘œ๋„ ์ฃผ๋ชฉํ•  ๋งŒํ•˜๋‹ค. ฮดr์ด 77.98๋กœ ์ด์ „ ๋ฐฉ๋ฒ• ๋Œ€๋น„ ์••๋„์ ์œผ๋กœ ๋†’๋‹ค. ๊ฐ™์€ ์ง€์‹œ์— ๋Œ€ํ•ด ๋‹ค์–‘ํ•œ ํŒŒ์ง€ ์ „๋žต์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์€ ์‹ค์ œ ๋ฐฐํฌ ํ™˜๊ฒฝ์—์„œ ์ค‘์š”ํ•˜๋‹ค.

3.3 Ablation Study

xychart-beta
    title "P-FID (๋‚ฎ์„์ˆ˜๋ก ์ข‹์Œ) - Ablation ๋น„๊ต"
    x-axis ["w/o ECoT", "ECoT(๊ธฐ๋ณธ)", "Na=128", "Na=256", "Na=512", "Npos=128", "Npos=256", "pdrop=0.0", "pdrop=0.5", "pdrop=1.0", "Uni3D", "PartField"]
    y-axis "P-FID" 0 --> 0.6
    bar [0.30, 0.20, 0.21, 0.20, 0.26, 0.21, 0.20, 0.22, 0.20, 0.30, 0.52, 0.20]
Figure 2: Ablation Study ๊ฒฐ๊ณผ ์š”์•ฝ
์„ค๊ณ„ ์„ ํƒ ๊ธฐ๋ณธ๊ฐ’ ํ•ต์‹ฌ ๋ฐœ๊ฒฌ
ECoT ํ™œ์„ฑํ™” ์—†์œผ๋ฉด P-FID +50%, ์„ฑ๊ณต๋ฅ  -4.77%p
Action bin (N_\mathbf{a}) 256 128์€ ์ •๋ฐ€๋„ ์†์‹ค, 512๋Š” ์–ดํœ˜ ๋ณต์žก๋„ ์ฆ๊ฐ€๋กœ ์„ฑ๋Šฅ ์ €ํ•˜
Position bin (N_{\text{pos}}) 256 ๋™์ผ ํŒจํ„ด. โ€œGoldilocksโ€ 256์ด ์ตœ์ 
Contact position dropout (p_{\text{drop}}) 0.5 ๊ณผ์†Œ(0.0)๋Š” ์ผ๋ฐ˜ํ™” ์•ฝํ™”, ๊ณผ๋‹ค(1.0)๋Š” ECoT ํšจ๊ณผ ์†Œ๋ฉธ
ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ ์ธ์ฝ”๋” PartField Uni3D ๋Œ€๋น„ P-FID 0.52โ†’0.20, ์„ฑ๊ณต๋ฅ  +8.07%p

ํŠนํžˆ ์ธ์ฝ”๋” ์„ ํƒ์˜ ์˜ํ–ฅ์ด ECoT๋ณด๋‹ค ํฌ๋‹ค๋Š” ์ ์ด ์‹ค์šฉ์ ์œผ๋กœ ์ค‘์š”ํ•˜๋‹ค. ํŒŒํŠธ-์ธ์‹ ๊ธฐํ•˜ํ•™ ํ‘œํ˜„์ด ์ ‘์ด‰ ๊ธฐ๋ฐ˜ ์ถ”๋ก ๊ณผ ์ž˜ ๋งž๋ฌผ๋ฆฐ๋‹ค๋Š” ๊ฒƒ์„ Ablation์ด ๋ช…ํ™•ํžˆ ๋ณด์—ฌ์ค€๋‹ค.

3.4 ์ œ๋กœ์ƒท ์ผ๋ฐ˜ํ™” (Dexonomy ๋ฐ์ดํ„ฐ์…‹)

DextER๊ฐ€ ํ•™์Šต ์‹œ ๋ชป ๋ณธ ๊ฐ์ฒด์™€ ํŒŒ์ง€ ์œ ํ˜•์— ์–ด๋–ป๊ฒŒ ๋Œ€์ฒ˜ํ•˜๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด 4๊ฐ€์ง€ ๋ถ„ํ• ๋กœ ํ…Œ์ŠคํŠธํ–ˆ๋‹ค.

๋ถ„ํ•  P-FIDโ†“ ์„ฑ๊ณต๋ฅ โ†‘(%)
Seen Obj. & Grasp 0.44 12.24
Unseen Obj. 1.44 10.86
Unseen Grasp Taxonomy 1.04 9.10
Unseen Both 1.23 8.41

๋น„๊ต ๊ธฐ์ค€์ธ DexGYS ๋ฐฉ๋ฒ•์€ โ€œSeenโ€ ์กฐ๊ฑด์—์„œ๋„ P-FID 1.89, ์„ฑ๊ณต๋ฅ  0.97%๋กœ ํ›จ์”ฌ ๋‚ฎ๋‹ค. DextER๋Š” ๋ชจ๋“  ์กฐ๊ฑด์—์„œ ์••๋„์ ์œผ๋กœ ์šฐ์ˆ˜ํ•˜๋‹ค.

ํฅ๋ฏธ๋กœ์šด ํŒจํ„ด: ์ƒˆ๋กœ์šด ๊ฐ์ฒด๋ณด๋‹ค ์ƒˆ๋กœ์šด ํŒŒ์ง€ ์œ ํ˜•(taxonomy) ์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™”๊ฐ€ ๋” ์–ด๋ ต๋‹ค. ์ด๋Š” ์ง๊ด€์ ์œผ๋กœ ๋ง์ด ๋œ๋‹ค โ€” ์ƒˆ ๋ฌผ์ฒด๋ผ๋„ ๋น„์Šทํ•œ ํ˜•์ƒ์ด ์žˆ์ง€๋งŒ, ์ „ํ˜€ ๋‹ค๋ฅธ ํŒŒ์ง€ ์ „๋žต(์˜ˆ: ์ƒˆ๋กœ์šด precision manipulation)์€ ๊ทผ๋ณธ์ ์œผ๋กœ ๋‹ค๋ฅธ ์ ‘์ด‰ ํŒจํ„ด์„ ์š”๊ตฌํ•œ๋‹ค.

3.5 Steerable Generation: โ€œ์›ํ•˜๋Š” ๋Œ€๋กœ ์žก๊ธฐโ€

DextER์˜ ๊ฐ€์žฅ ๋…์ฐฝ์ ์ธ ๊ธฐ๋Šฅ ์ค‘ ํ•˜๋‚˜๋‹ค. ์ž๊ธฐํšŒ๊ท€ ์ƒ์„ฑ์˜ ํŠน์„ฑ์„ ํ™œ์šฉํ•ด, ๋ถ€๋ถ„ ์ ‘์ด‰ ๋ช…์„ธ๋ฅผ prefix๋กœ ์ œ๊ณตํ•˜๋ฉด ๋ชจ๋ธ์ด ๋‚˜๋จธ์ง€๋ฅผ ์™„์„ฑํ•œ๋‹ค.

์˜ˆ: ์‚ฌ์šฉ์ž๊ฐ€ โ€œ์—„์ง€์™€ ๊ฒ€์ง€๊ฐ€ ์—ฌ๊ธฐ์— ๋‹ฟ์•„์•ผ ํ•ดโ€๋ผ๊ณ  ์ง€์ •ํ•˜๋ฉด, ๋ชจ๋ธ์ด ๋‚˜๋จธ์ง€ ์†๊ฐ€๋ฝ์˜ ์ ‘์ด‰๊ณผ ์ „์ฒด ํŒŒ์ง€ ์ž์„ธ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

์ง€์ • ๋งํฌ ์ˆ˜ P-FIDโ†“ CDโ†“ ์„ฑ๊ณต๋ฅ โ†‘(%)
0 (์ž์œ  ์ƒ์„ฑ) 0.44 18.32 12.24
1๊ฐœ ๋งํฌ 0.43 5.51 10.40
2๊ฐœ ๋งํฌ 0.28 2.33 14.67
3๊ฐœ ๋งํฌ 0.18 1.50 17.84
4๊ฐœ ๋งํฌ 0.14 0.91 20.14
5๊ฐœ ๋งํฌ 0.12 0.73 21.35

๋” ๋งŽ์€ ์ œ์•ฝ โ†’ ๋” ๋‚˜์€ ์˜๋„ ์ •๋ ฌ, ๊ทธ๋ฆฌ๊ณ  ๋” ๋†’์€ ์„ฑ๊ณต๋ฅ . ์ œ์•ฝ์ด ์‹ค์งˆ์ ์ธ ๊ฐ€์ด๋“œ ์—ญํ• ์„ ํ•œ๋‹ค๋Š” ๋œป์ด๋‹ค. ์ •๋ฐ€ ์กฐ๋ฆฝ์ด๋‚˜ ํŠน์ˆ˜ ๋„๊ตฌ ์‚ฌ์šฉ์ฒ˜๋Ÿผ โ€œ๋ฐ˜๋“œ์‹œ ์ด๋ ‡๊ฒŒ ์žก์•„์•ผ ํ•˜๋Š”โ€ ์‚ฐ์—… ์‘์šฉ์— ์ง์ ‘ ํ™œ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.

3.6 ์ ‘์ด‰ ์ถ”๋ก  ํ’ˆ์งˆ ํ‰๊ฐ€

์ง€ํ‘œ ๊ฐ’
IoU (๋งํฌ ์˜ˆ์ธก) 0.42
Precision 0.59
Recall 0.63
F1 0.57
Position Accuracy (1cm ์ด๋‚ด) 0.79

F1 0.57์€ ์™„๋ฒฝํ•˜์ง€ ์•Š๋‹ค. ํ•˜์ง€๋งŒ ์œ„์น˜ ์ •ํ™•๋„ 79%๋Š” ์ธ์ƒ์ ์ด๋‹ค โ€” ์˜ˆ์ธกํ•œ ์ ‘์ด‰ ์œ„์น˜์˜ 79%๊ฐ€ ์‹ค์ œ ์† forward kinematics ๊ฒฐ๊ณผ๋กœ ๊ณ„์‚ฐํ•œ ๋งํฌ ์œ„์น˜ 1cm ์ด๋‚ด์— ์žˆ๋‹ค. ์ด ์ •๋„ ๊ณต๊ฐ„ ์ •๋ฐ€๋„๋ฉด ์ ‘์ด‰ ์ถ”๋ก ์ด ํŒŒ์ง€ ์ƒ์„ฑ์— ์‹ค์งˆ์ ์ธ ๊ธฐํ•˜ํ•™์  ๊ฐ€์ด๋“œ๋ฅผ ์ œ๊ณตํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.


4. ๋น„ํŒ์  ๊ณ ์ฐฐ: DextER์˜ ๊ฐ•์ ๊ณผ ํ•œ๊ณ„

4.1 ๊ฐ•์  ๋ถ„์„

โ‘  ์ถ”๋ก  ๊ตฌ์กฐ์˜ ์„ค๊ณ„ ์ฒ ํ•™์ด ์˜ฌ๋ฐ”๋ฅด๋‹ค

โ€œ์ž…๋ ฅ โ†’ ์ถœ๋ ฅโ€ ์ง์ ‘ ๋งคํ•‘์˜ ํ•œ๊ณ„๋ฅผ ์ธ์‹ํ•˜๊ณ , ๋กœ๋ด‡๊ณตํ•™์ ์œผ๋กœ ์˜๋ฏธ ์žˆ๋Š” ์ค‘๊ฐ„ ํ‘œํ˜„(์ ‘์ด‰์ )์„ ์„ค๊ณ„ํ–ˆ๋‹ค. ์ด๊ฒƒ์€ ๋‹จ์ˆœํ•œ ์—”์ง€๋‹ˆ์–ด๋ง ํŠธ๋ฆญ์ด ์•„๋‹ˆ๋‹ค. ๋ฌผ๋ฆฌ ์„ธ๊ณ„์—์„œ ํŒŒ์ง€๊ฐ€ ์„ฑ๊ณตํ•˜๋ ค๋ฉด ์ ‘์ด‰์ด ์˜ฌ๋ฐ”๋ž˜์•ผ ํ•œ๋‹ค๋Š” ๊ทผ๋ณธ ์›๋ฆฌ๋ฅผ ๋ชจ๋ธ ๊ตฌ์กฐ์— ์ง์ ‘ ์ธ์ฝ”๋”ฉํ•œ ๊ฒƒ์ด๋‹ค.

โ‘ก ์ž๋™ํ™”๋œ ๋Œ€๊ทœ๋ชจ ํ•™์Šต ๋ฐ์ดํ„ฐ ๊ตฌ์ถ•

MuJoCo ๊ธฐ๋ฐ˜ ์ ‘์ด‰ ์ž๋™ ์ฃผ์„, VLM ๊ธฐ๋ฐ˜ ์–ธ์–ด ์ž๋™ ์ฃผ์„ ํŒŒ์ดํ”„๋ผ์ธ์€ ํ™•์žฅ์„ฑ์ด ๋†’๋‹ค. ์ƒˆ ๋ฐ์ดํ„ฐ์…‹์—๋„ ๋™์ผ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

โ‘ข Steerable Generation์˜ ์‹ค์šฉ์„ฑ

์‚ฐ์—… ํ˜„์žฅ์—์„œ๋Š” ์ข…์ข… โ€œํŠน์ • ๋ถ€์œ„๋ฅผ ํŠน์ • ๋ฐฉ์‹์œผ๋กœ ์žก์•„์•ผโ€ํ•˜๋Š” ์ œ์•ฝ์ด ์žˆ๋‹ค. Steerable Generation์€ ์ด๋Ÿฐ ์š”๊ตฌ์‚ฌํ•ญ์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ˆ˜์šฉํ•œ๋‹ค. ์ด๊ฒƒ์€ VLM ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์˜ ๋‹ซํžŒ ์‹œ์Šคํ…œ(closed-loop) ๋ฌธ์ œ๋ฅผ ์—ด๋ฆฐ ์ธํ„ฐํŽ˜์ด์Šค๋กœ ์ „ํ™˜ํ•œ๋‹ค.

โ‘ฃ ์†Œํ˜• ๋ชจ๋ธ๋กœ SOTA ๋‹ฌ์„ฑ

Qwen2.5-0.5B๋Š” ๋Œ€ํ˜• VLM ๋Œ€๋น„ ํ›จ์”ฌ ์ž‘๋‹ค. ์‹ค์‹œ๊ฐ„ ๋กœ๋ด‡ ์ œ์–ด์— ๋” ์ ํ•ฉํ•œ ์ง€์—ฐ ์‹œ๊ฐ„(latency)์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

4.2 ํ•œ๊ณ„์™€ ์—ด๋ฆฐ ์งˆ๋ฌธ๋“ค

โ‘  Sim-to-Real ๊ฐญ: ์•„์ง ์‹œ๋ฎฌ๋ ˆ์ด์…˜์— ๋จธ๋ฌผ๋‹ค

DextER์˜ ๋ชจ๋“  ์‹คํ—˜์€ Isaac Gym๊ณผ MuJoCo ์‹œ๋ฎฌ๋ ˆ์ด์…˜์ด๋‹ค. ์‹ค์ œ ๋กœ๋ด‡์— ๋Œ€ํ•œ ๊ฒ€์ฆ์ด ์—†๋‹ค. ์‹ค์ œ ์„ผ์„œ ๋…ธ์ด์ฆˆ, ๋ฌผ์ฒด ํ‘œ๋ฉด์˜ ๋งˆ์ฐฐ ๋ถˆ๊ท ์ผ์„ฑ, ์† ์บ˜๋ฆฌ๋ธŒ๋ ˆ์ด์…˜ ์˜ค์ฐจ ๋“ฑ์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ํ•™์Šตํ•œ ์ ‘์ด‰ ํŒจํ„ด์„ ๋ฌดํšจํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‹ค๊ด€์ ˆ ์†์˜ sim-to-real์€ ํ‰ํ–‰ ๊ทธ๋ฆฌํผ๋ณด๋‹ค ํ›จ์”ฌ ๋” ์–ด๋ ต๋‹ค.

โ‘ก ์ ‘์ด‰ ์ถ”๋ก ์˜ F1=0.57: ์ค‘๊ฐ„ ๋‹จ๊ณ„์˜ ๋ถˆ์™„์ „์„ฑ

ECoT์˜ ์ค‘๊ฐ„ ํ‘œํ˜„(์ ‘์ด‰ ์˜ˆ์ธก)์ด ์™„๋ฒฝํ•˜์ง€ ์•Š๋‹ค. F1 0.57์€ ์•ฝ 43%์˜ ๊ฒฝ์šฐ ์ž˜๋ชป๋œ ์ ‘์ด‰ ๋งํฌ๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค๋Š” ๋œป์ด๋‹ค. ๊ทธ๋Ÿผ์—๋„ ์ตœ์ข… ํŒŒ์ง€ ์„ฑ๋Šฅ์ด ์ข‹๋‹ค๋ฉด, ์ด๋Š” ๋ชจ๋ธ์ด โ€œ์ž˜๋ชป๋œ ์ ‘์ด‰ ์ถ”๋ก  โ†’ ์˜ฌ๋ฐ”๋ฅธ ํŒŒ์ง€โ€๋ผ๋Š” ๋‹จ๋ฝ(shortcut)์„ ํ•™์Šตํ–ˆ์„ ๊ฐ€๋Šฅ์„ฑ๋„ ์žˆ๋‹ค. ์ฆ‰, ECoT๊ฐ€ ์ง„์ •ํ•œ ์ถ”๋ก ์„ ํ•˜๋Š”์ง€, ์•„๋‹ˆ๋ฉด ๋‹จ์ˆœ ์„ฑ๋Šฅ ํ–ฅ์ƒ ํŠธ๋ฆญ์ธ์ง€ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์–ด๋ ต๋‹ค.

โ‘ข ์ƒˆ๋กœ์šด ํŒŒ์ง€ ์œ ํ˜•์— ๋Œ€ํ•œ ์ทจ์•ฝ์„ฑ

Dexonomy ์‹คํ—˜์—์„œ unseen grasp taxonomy์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™”๊ฐ€ ์ œํ•œ์ ์ด๋‹ค. ์ €์ž๋“ค๋„ โ€œ๋ชจ๋ธ์ด ๋ฌผ์ฒด๋Š” ์žก์ง€๋งŒ ๋ถˆ์•ˆ์ •(shaking)โ€ํ•˜๋‹ค๊ณ  ์ธ์ •ํ•œ๋‹ค. ํŒŒ์ง€ ๋ถ„๋ฅ˜์ฒด๊ณ„๊ฐ€ ๋‹ค๋ฅด๋ฉด ์ ‘์ด‰ ํŒจํ„ด์ด ๊ทผ๋ณธ์ ์œผ๋กœ ๋‹ฌ๋ผ์ง€๋Š”๋ฐ, ํ˜„์žฌ ๋ชจ๋ธ์€ ์ด๋ฅผ ์ œ๋Œ€๋กœ ์ถ”๋ก ํ•˜์ง€ ๋ชปํ•œ๋‹ค.

โ‘ฃ ๋‹จ์ผ ํŒŒ์ง€ ์ž์„ธ ์ƒ์„ฑ

์‹ค์ œ ์กฐ์ž‘์—์„œ๋Š” ํŒŒ์ง€ โ†’ ์ด๋™ โ†’ ์กฐ์ž‘์ด๋ผ๋Š” ์‹œํ€€์Šค๊ฐ€ ํ•„์š”ํ•˜๋‹ค. DextER๋Š” ๋‹จ์ผ ์‹œ์ ์˜ ํŒŒ์ง€ ์ž์„ธ๋งŒ์„ ์ƒ์„ฑํ•œ๋‹ค. ์ด ํŒŒ์ง€๊ฐ€ ์ดํ›„ ์กฐ์ž‘ ํƒœ์Šคํฌ์— ์ตœ์ ์ธ์ง€, ์˜ˆ๋ฅผ ๋“ค์–ด โ€œ๋”ฐ๋ฅด๊ธฐ ์œ„ํ•ด ์žก๊ธฐโ€๊ฐ€ โ€œ์‹ค์ œ๋กœ ๋”ฐ๋ฅด๋Š” ๋™์ž‘โ€์— ์ ํ•ฉํ•œ์ง€๋Š” ํ‰๊ฐ€ํ•˜์ง€ ์•Š๋Š”๋‹ค.

โ‘ค ์ ‘์ด‰ ๋“œ๋กญ์•„์›ƒ = ๋ถˆ์™„์ „ํ•œ ECoT

p_{\text{drop}} = 0.5๋Š” ์ ˆ๋ฐ˜์˜ ํ•™์Šต ์ƒ˜ํ”Œ์—์„œ ์ ‘์ด‰ ์œ„์น˜ ์—†์ด ๋งํฌ๋งŒ์œผ๋กœ ํ•™์Šต๋œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค. ์ด๋Š” ECoT์˜ ๊ณต๊ฐ„์  ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ์•ฝํ™”์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. Steerable Generation์˜ ํŽธ์˜์„ฑ๊ณผ ์ถ”๋ก  ์™„์ „์„ฑ ์‚ฌ์ด์˜ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„๋‹ค.

โ‘ฅ ๋‹จ์ผ ์† ๋ชจ๋ธ (ShadowHand)

๋งํฌ ํ† ํฐ์ด ShadowHand ์ „์šฉ์œผ๋กœ ์„ค๊ณ„๋˜์—ˆ๋‹ค. Allegro Hand, LEAP Hand ๋“ฑ ๋‹ค๋ฅธ ์† ํ”Œ๋žซํผ์œผ๋กœ ์ „์ดํ•˜๋ ค๋ฉด ์ƒˆ๋กœ์šด ๋งํฌ ํ† ํฐ๊ณผ ์žฌํ•™์Šต์ด ํ•„์š”ํ•˜๋‹ค. Embodiment-agnosticํ•œ ์„ค๊ณ„๊ฐ€ ์•„๋‹ˆ๋‹ค.


5. ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต

graph TD
    A["๋ฌผ๋ฆฌ ๊ธฐ๋ฐ˜ ํŒŒ์ง€ ์ตœ์ ํ™”\n(GraspIt!, Force Closure)"] --> B["๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ํŒŒ์ง€ ์ƒ์„ฑ\n(DexGraspNet, UniDexGrasp)"]
    B --> C["์–ธ์–ด ์กฐ๊ฑด ํŒŒ์ง€\n(DexGYSNet, SemGrasp)"]
    C --> D1["2๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ\n(์–ธ์–ด โ†’ ์–ดํฌ๋˜์Šค โ†’ ํŒŒ์ง€)\nAffordDexGrasp, DexGraspVLA"]
    C --> D2["End-to-End VLM\n(DexVLG, SemGrasp)"]
    D1 --> E["DextER\n(Contact-based ECoT)"]
    D2 --> E
    F["LLM Chain-of-Thought\n(Wei et al. 2022)"] --> G["Embodied CoT\n(ECoT, ThinkAct, EMMA)"]
    G --> E
    style E fill:#ff9900,color:#fff
Figure 3: ์–ธ์–ด ๊ธฐ๋ฐ˜ ์ •๊ต ํŒŒ์ง€ ์—ฐ๊ตฌ ๊ณ„๋ณด
๋ฐฉ๋ฒ• ํŒจ๋Ÿฌ๋‹ค์ž„ ์ค‘๊ฐ„ ํ‘œํ˜„ ์˜๋„ ์ •๋ ฌ ๋ฌผ๋ฆฌ ํ’ˆ์งˆ ์ œ์–ด ๊ฐ€๋Šฅ์„ฑ
DexGYSNet End-to-End ์—†์Œ ๋ณดํ†ต ์ข‹์Œ ์—†์Œ
DexVLG End-to-End VLM ์—†์Œ ์ข‹์Œ ์ข‹์Œ ์—†์Œ
AffordDexGrasp 2๋‹จ๊ณ„ ์–ดํฌ๋˜์Šค ๋งต ์ข‹์Œ ์ข‹์Œ ์ œํ•œ์ 
DexGraspVLA 2๋‹จ๊ณ„ + VLA ๊ณ„ํš ํ…์ŠคํŠธ ์ข‹์Œ ๋งค์šฐ ์ข‹์Œ ์ œํ•œ์ 
DextER End-to-End ECoT ์ ‘์ด‰์  (๋ฌผ๋ฆฌ์ ) ๋งค์šฐ ์ข‹์Œ ๋งค์šฐ ์ข‹์Œ ๋†’์Œ

DexGraspVLA์™€์˜ ๋น„๊ต๋Š” ํฅ๋ฏธ๋กญ๋‹ค. DexGraspVLA๋Š” 89.6%๋ผ๋Š” ๋†’์€ ์„ฑ๊ณต๋ฅ ์„ ๋ณด๊ณ ํ•˜์ง€๋งŒ, ์ด๋Š” ๋‹จ์ˆœ ํŒŒ์ง€(non-prehensile ํฌํ•จ)์— ๋Œ€ํ•œ ์ˆ˜์น˜์ด๋ฉฐ, ์–ธ์–ด-์˜๋„ ์ •๋ ฌ์€ ๋ช…์‹œ์ ์œผ๋กœ ์ธก์ •ํ•˜์ง€ ์•Š๋Š”๋‹ค. DextER๋Š” ํŠนํžˆ ์˜๋„ ์ •๋ ฌ์—์„œ ๋…๋ณด์ ์ด๋‹ค.


6. ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

DextER๋Š” ์–ธ์–ด ๊ธฐ๋ฐ˜ ๋‹ค๊ด€์ ˆ ๋กœ๋ด‡ ํŒŒ์ง€ ์ƒ์„ฑ์—์„œ โ€œ์ค‘๊ฐ„์—์„œ ๋ฌด์—‡์„ ์ถ”๋ก ํ•  ๊ฒƒ์ธ๊ฐ€โ€๋ผ๋Š” ๊ทผ๋ณธ์  ์งˆ๋ฌธ์„ ๋‹ค๋ฃฌ๋‹ค. ๊ทธ ๋‹ต์€ ์ ‘์ด‰ โ€” ์†์˜ ์–ด๋–ค ๋งํฌ๊ฐ€ ๋ฌผ์ฒด์˜ ์–ด๋””์— ๋‹ฟ๋Š”๊ฐ€ โ€” ์ด๋‹ค.

์ด ์•„์ด๋””์–ด๋Š” ๋‹จ์ˆœํ•˜์ง€๋งŒ ๊ฐ•๋ ฅํ•˜๋‹ค. ์–ธ์–ด(โ€œ์†์žก์ด ์žก์•„โ€)์™€ ๋ฌผ๋ฆฌ(โ€œff_distal์ด handle ๋ถ€์œ„ 3D ์ขŒํ‘œ์— ์ ‘์ด‰โ€)๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” ๋‹ค๋ฆฌ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ํ•™์Šตํ•œ๋‹ค.

ํ•ต์‹ฌ ๊ธฐ์—ฌ ์š”์•ฝ: - Contact-based Embodied Reasoning (ECoT): ์ ‘์ด‰์ ์„ ์ค‘๊ฐ„ ์‚ฌ๊ณ  ๋‹จ๊ณ„๋กœ ์‚ฌ์šฉ - ์ž๋™ํ™”๋œ ๋Œ€๊ทœ๋ชจ ํ•™์Šต ๋ฐ์ดํ„ฐ ๊ตฌ์ถ• ํŒŒ์ดํ”„๋ผ์ธ (MuJoCo + VLM) - Steerable Generation: ๋ถ€๋ถ„ ์ ‘์ด‰ ๋ช…์„ธ๋กœ ํŒŒ์ง€ ๊ฐ€์ด๋“œ - DexGYS SOTA: ์„ฑ๊ณต๋ฅ  67.14%, P-FID 96.4% ํ–ฅ์ƒ

ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ: - ์‹ค๋ฌผ ๋กœ๋ด‡ ๊ฒ€์ฆ (sim-to-real) - ๋‹ค์–‘ํ•œ ์† ํ”Œ๋žซํผ์œผ๋กœ์˜ ์ „์ด (Allegro, LEAP ๋“ฑ) - ์ ‘์ด‰ ์ถ”๋ก ๊ณผ ์กฐ์ž‘ ๊ณ„ํš(manipulation planning)์˜ ํ†ตํ•ฉ - ๋” ๊ฐ•๋ ฅํ•œ ์ค‘๊ฐ„ ์ถ”๋ก  (๋‹จ์ˆœ ์ ‘์ด‰์  โ†’ ์ ‘์ด‰๋ ฅ, ์ ‘์ด‰ ์ˆœ์„œ)

๋กœ๋ด‡์ด ์–ธ์–ด๋ฅผ ์ดํ•ดํ•˜๊ณ  ์†์„ ์ž์œ ์ž์žฌ๋กœ ์›€์ง์ด๋Š” ๋‚ ์€, ์ด๋ ‡๊ฒŒ ์ฐจ๊ทผ์ฐจ๊ทผ ์Œ“์ด๋Š” ์ถ”๋ก  ๊ตฌ์กฐ ์—ฐ๊ตฌ๋“ค ์œ„์— ์„ธ์›Œ์งˆ ๊ฒƒ์ด๋‹ค. DextER๋Š” ๊ทธ ๊ธธ์—์„œ ์„ค๊ณ„ ์ฒ ํ•™์„ ์ž˜ ๋ณด์—ฌ์ฃผ๋Š” ์ข‹์€ ์ด์ •ํ‘œ๋‹ค.

Copyright 2026, JungYeon Lee