Curieux.JY
  • JungYeon Lee
  • Post
  • Projects
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review

๐Ÿ“ƒDextER ๋ฆฌ๋ทฐ

llm
grasp
reasoning
Language-driven Dexterous Grasp Generation with Embodied Reasoning
Published

March 9, 2026

  • Paper Link
  • Project Link
  • Code Link
  1. ๐Ÿค– DextER๋Š” ์–ธ์–ด ๊ธฐ๋ฐ˜ dexterous grasp ์ƒ์„ฑ์„ ์œ„ํ•ด hand link๊ฐ€ ๊ฐ์ฒด์— ์ ‘์ด‰ํ•˜๋Š” ์œ„์น˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š” contact-based embodied reasoning ๋ฐฉ์‹์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
  2. ๐Ÿ“ˆ ์ด ๋ชจ๋ธ์€ DexGYS ๋ฒค์น˜๋งˆํฌ์—์„œ 67.14%์˜ grasp success rate๋ฅผ ๋‹ฌ์„ฑํ•˜์—ฌ ๊ธฐ์กด state-of-the-art๋ฅผ ๋Šฅ๊ฐ€ํ–ˆ์œผ๋ฉฐ, intention alignment์—์„œ 96.4% ํ–ฅ์ƒ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
  3. ๐ŸŽฏ DextER์˜ autoregressive framework๋Š” ์‚ฌ์šฉ์ž๊ฐ€ ๋ถ€๋ถ„์ ์ธ contact constraints๋ฅผ ์ง€์ •ํ•˜์—ฌ grasp ์ƒ์„ฑ์„ steerableํ•˜๊ฒŒ ์ œ์–ดํ•  ์ˆ˜ ์žˆ๋Š” fine-grained control ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

๋ณธ ๋…ผ๋ฌธ์€ ์–ธ์–ด ๊ธฐ๋ฐ˜์˜ ์ •๊ตํ•œ dexterous grasp ์ƒ์„ฑ์„ ์œ„ํ•œ DextER๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ Vision-Language Models (VLMs)์€ ๊ด€์ธก๊ฐ’์„ ์ง์ ‘ grasp parameter๋กœ ๋งคํ•‘ํ•˜์—ฌ ๋ฌผ๋ฆฌ์  ์ƒํ˜ธ์ž‘์šฉ์— ๋Œ€ํ•œ ์ค‘๊ฐ„ ์ถ”๋ก ์ด ๋ถ€์กฑํ–ˆ์Šต๋‹ˆ๋‹ค. DextER๋Š” ๋‹ค์ง€ ๋กœ๋ด‡ ํŒ”(multi-finger hand) ์กฐ์ž‘์„ ์œ„ํ•ด ์ ‘์ด‰ ๊ธฐ๋ฐ˜์˜ embodied reasoning์„ ๋„์ž…ํ•˜๋ฉฐ, ์ด๋Š” ์–ด๋–ค ์† ๋งํฌ(hand link)๊ฐ€ ๊ฐ์ฒด์˜ ์–ด๋А ์œ„์น˜์—์„œ ์ ‘์ด‰ํ•˜๋Š”์ง€๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ ํ†ต์ฐฐ์ž…๋‹ˆ๋‹ค. ์ด ์ ‘์ด‰ ์˜ˆ์ธก์€ ๋†’์€ ์ˆ˜์ค€์˜ task semantics์™€ ๋กœ๋ด‡์˜ embodiment ๋ฐ ๊ฐ์ฒด ํ˜•์ƒ์˜ ๋ฌผ๋ฆฌ์  ์ œ์•ฝ ์กฐ๊ฑด์„ ์—ฐ๊ฒฐํ•˜๋Š” embodiment-aware ์ค‘๊ฐ„ ํ‘œํ˜„์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก  (Core Methodology)

DextER๋Š” ์ฃผ์–ด์ง„ 3D point cloud P \in \mathbb{R}^{N \times 3}์™€ ์–ธ์–ด ์ง€์‹œ T๋กœ๋ถ€ํ„ฐ dexterous hand์˜ grasp pose \mathbf{a} \in \mathbb{R}^D๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ D๋Š” ์†์˜ ์ž์œ ๋„(degrees of freedom)๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๋ณธ ๋ชจ๋ธ์€ ์ด ์˜ˆ์ธก์„ ์ค‘๊ฐ„ ๋‹จ๊ณ„์ธ ์ ‘์ด‰ ํŒจํ„ด(contact patterns) C๋ฅผ ํ†ตํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ถ„ํ•ดํ•ฉ๋‹ˆ๋‹ค: p(\mathbf{a}, C|P, T) = p(C|P, T) \cdot p(\mathbf{a}|C, P, T) ์—ฌ๊ธฐ์„œ ์ ‘์ด‰ ์˜ˆ์ธก p(C|P, T)๊ฐ€ ์–ธ์–ด์™€ ๊ธฐํ•˜ํ•™์  ์ดํ•ด๋ฅผ grasp ์ƒ์„ฑ์— ์—ฐ๊ฒฐํ•˜๋Š” embodied reasoning ๊ณผ์ •์œผ๋กœ ์ž‘์šฉํ•ฉ๋‹ˆ๋‹ค.

1. ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜ (Model Architecture)

DextER๋Š” 3D vision encoder, multimodal projector, Large Language Model (LLM) backbone์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

  • Point Cloud Encoding: ์ž…๋ ฅ point cloud P๋กœ๋ถ€ํ„ฐ PartField [22]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ธฐํ•˜ํ•™์  ํŠน์ง• F \in \mathbb{R}^{M \times d}๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. PartField๋Š” 2D SAM mask๋ฅผ ์ด์šฉํ•œ ๋Œ€์กฐ ํ•™์Šต(contrastive learning)์„ ํ†ตํ•ด ํŒŒํŠธ ๋ถ„ํ• (part-segmentation)์„ ์œ„ํ•œ ์‚ฌ์ „ ํ•™์Šต์ด ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๊ฐ์ฒด ํ‘œ๋ฉด์˜ ์ ‘์ด‰ ์œ„์น˜๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ์ฐพ์•„๋‚ด๋Š” ๋ฐ ์œ ๋ฆฌํ•œ ํŒŒํŠธ ๊ธฐํ•˜ํ•™ ์ธ์ง€ ํŠน์ง•(part geometry-aware features)์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ถ”์ถœ๋œ ํŠน์ง•์€ ๊ฒฝ๋Ÿ‰ MLP๋ฅผ ํ†ตํ•ด LLM์˜ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์œผ๋กœ ํˆฌ์˜๋ฉ๋‹ˆ๋‹ค.
  • Action Tokenization: ์—ฐ์†์ ์ธ grasp parameter \mathbf{a} (28์ฐจ์›, ์†๋ฐ”๋‹ฅ ์ž์„ธ ๋ฐ ๊ด€์ ˆ ๊ฐ๋„ ํฌํ•จ)๋Š” ์ด์‚ฐ์ ์ธ ํ† ํฐ ๊ณต๊ฐ„์œผ๋กœ ํ† ํฐํ™”๋ฉ๋‹ˆ๋‹ค. ๊ฐ ์ฐจ์›์€ N_a๊ฐœ์˜ ๊ท ์ผํ•œ bin์œผ๋กœ ์–‘์žํ™”๋˜๋ฉฐ, ๊ฐ ์–‘์žํ™”๋œ ๊ฐ’์€ ๊ณ ์œ ํ•œ ํ† ํฐ \langle \text{action\_bin\_i} \rangle์œผ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค. ์ „์ฒด ์•ก์…˜ ์‹œํ€€์Šค๋Š” \langle |\text{action\_start}| \rangle์™€ \langle |\text{action\_end}| \rangle ํŠน์ˆ˜ ํ† ํฐ์œผ๋กœ ๊ฐ์‹ธ์ง‘๋‹ˆ๋‹ค.
  • LLM Backbone: Qwen2.5-0.5B [30, 42]๋ฅผ LLM backbone์œผ๋กœ ์‚ฌ์šฉํ•˜๋ฉฐ, point cloud embedding๊ณผ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์œตํ•ฉํ•˜์—ฌ ์ด์‚ฐ์ ์ธ ์ ‘์ด‰ ๋ฐ ์•ก์…˜ ํ† ํฐ์„ autoregressively ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

2. Embodied Reasoning์„ ํ†ตํ•œ ์ ‘์ด‰ ์˜ˆ์ธก (Embodied Reasoning via Contact Prediction)

  • Meta-prompts: ๋ชจ๋ธ์ด ์•ก์…˜ ์ƒ์„ฑ ์ „์— ์ ‘์ด‰ ์ถ”๋ก ์— ์ฐธ์—ฌํ•˜๋„๋ก ์œ ๋„ํ•˜๊ธฐ ์œ„ํ•ด, โ€œThink step by step: first predict which links contact where on the object, then predict the grasp poseโ€์™€ ๊ฐ™์€ ๋ช…์‹œ์ ์ธ ์ง€์‹œ๋ฅผ ํฌํ•จํ•˜๋Š” meta-prompt๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • Contact Representation: ์ ‘์ด‰์€ ์†์˜ ๋งํฌ l_i (์˜ˆ: index finger middle link)์™€ ๊ฐ์ฒด ํ‘œ๋ฉด์˜ 3D ์ ‘์ด‰ ์œ„์น˜ p_i \in \mathbb{R}^3์˜ ์Œ์œผ๋กœ ๊ตฌ์„ฑ๋œ C = \{(l_i, p_i)\}๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค. ์ ‘์ด‰ ์œ„์น˜ p_i์˜ ์ขŒํ‘œ๋Š” ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ณ„์‚ฐ๋œ ๊ณ ์ •๋œ 3D bounding box ๋‚ด๋กœ ์ •๊ทœํ™”๋œ ๋‹ค์Œ, ๊ฐ ๊ณต๊ฐ„ ์ฐจ์›์ด N_{pos}๊ฐœ์˜ bin์œผ๋กœ ๊ท ์ผํ•˜๊ฒŒ ์ด์‚ฐํ™”๋˜์–ด position token์œผ๋กœ ๋งคํ•‘๋ฉ๋‹ˆ๋‹ค. ๊ฐ ์ ‘์ด‰์€ \langle l_i \rangle \langle p_{ix} \rangle \langle p_{iy} \rangle \langle p_{iz} \rangle์™€ ๊ฐ™์€ ์‹œํ€€์Šค๋กœ ํ‘œํ˜„๋˜๋ฉฐ, ์ „์ฒด ์ ‘์ด‰ ์˜ˆ์ธก์€ \langle |\text{contact\_start}| \rangle์™€ \langle |\text{contact\_end}| \rangle๋กœ ๊ฐ์‹ธ์ง‘๋‹ˆ๋‹ค. ํ•„์š”ํ•œ ๋ชจ๋“  ํŠน์ˆ˜ ํ† ํฐ(action bin, position bin, link, delimiter ํ† ํฐ)์€ ์‚ฌ์ „ ํ•™์Šต๋œ tokenizer์— ๋“ฑ๋ก๋ฉ๋‹ˆ๋‹ค.

3. ํ›ˆ๋ จ ์ „๋žต (Training Strategy)

  • End-to-end ํ•™์Šต: point cloud ํ† ํฐ, task description, contact ํ† ํฐ, action ํ† ํฐ์„ ํฌํ•จํ•˜๋Š” ์ „์ฒด ์‹œํ€€์Šค์— ๋Œ€ํ•ด ํ‘œ์ค€ next-token prediction์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ end-to-end๋กœ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ ๋จผ์ € ์ ‘์ด‰ ํŒจํ„ด์„ ์˜ˆ์ธกํ•œ ๋‹ค์Œ, ์ด์— ์ƒ์‘ํ•˜๋Š” grasp pose๋ฅผ autoregressively ์ƒ์„ฑํ•˜๋„๋ก ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  • Hybrid Attention Mechanism: point cloud ํ† ํฐ์€ ์ „์—ญ์ ์ธ ๊ธฐํ•˜ํ•™์  ๋ฌธ๋งฅ์„ ํฌ์ฐฉํ•˜๊ธฐ ์œ„ํ•ด ์–‘๋ฐฉํ–ฅ ์–ดํ…์…˜(bidirectional attention)์„ ์‚ฌ์šฉํ•˜๊ณ , ์–ธ์–ด ๋ฐ ์•ก์…˜ ํ† ํฐ์€ ์ธ๊ณผ์  ์–ดํ…์…˜(causal attention)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • Contact Position Dropout: ์ •๊ทœํ™”๋ฅผ ์œ„ํ•ด ํ›ˆ๋ จ ์ค‘ p_{drop} ํ™•๋ฅ ๋กœ ์ ‘์ด‰ ์‹œํ€€์Šค์—์„œ position ํ† ํฐ์„ ์ œ๊ฑฐํ•˜๊ณ  link ํ† ํฐ๋งŒ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์˜ ์ ‘์ด‰ ์ •๋ณด๋ฅผ ์ฒ˜๋ฆฌํ•˜๋„๋ก ๋•์Šต๋‹ˆ๋‹ค.

4. ๋ฐ์ดํ„ฐ์…‹ ํ๋ ˆ์ด์…˜ (Dataset Curation)

DexGYS [36]์™€ Dexonomy [5] ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • Physics-based Contact Annotation: MuJoCo ๋ฌผ๋ฆฌ ์—”์ง„์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ grasp์— ๋Œ€ํ•œ ์ ‘์ด‰ ์ •๋ณด๋ฅผ ์ž๋™์œผ๋กœ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. ์† ๋ฐ ๊ฐ์ฒด ๋ชจ๋ธ์„ MuJoCo์— ๋กœ๋“œํ•˜๊ณ , ๊ฐ grasp pose์— ๋Œ€ํ•ด ์ •๋ฐฉํ–ฅ ์šด๋™ํ•™(forward kinematics)์„ ์‹คํ–‰ํ•œ ๋‹ค์Œ, ์† ๋งํฌ์™€ ๊ฐ์ฒด๊ฐ€ ์ ‘์ด‰ํ•˜๋Š” 3D ํ‘œ๋ฉด ์œ„์น˜๋ฅผ ๋ฌผ๋ฆฌ ๋ฒ„ํผ์—์„œ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
  • Grasp Instruction Annotation (Dexonomy์šฉ): Gemma-3 [29] VLM์„ ์‚ฌ์šฉํ•˜์—ฌ Dexonomy์— ๋Œ€ํ•œ grasp description์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๊ฐ grasp์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ์‹œ์  ์ด๋ฏธ์ง€๋ฅผ ๋ Œ๋”๋งํ•˜๊ณ , ๋ Œ๋”๋ง๋œ ์ด๋ฏธ์ง€์™€ ์ ‘์ด‰ ์ •๋ณด์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ VLM์— ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. VLM์€ ๊ฐ์ฒด ๋ฒ”์ฃผ๋ฅผ ์‹๋ณ„ํ•˜๊ณ , ์ ‘์ด‰๋œ ๊ธฐ๋Šฅ์  ๋ถ€๋ถ„์„ ์ถ”๋ก ํ•˜๋ฉฐ, ํ…์ŠคํŠธ ํ˜•ํƒœ์˜ grasp description์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ (Experiments and Results)

DextER๋Š” DexGYS validation set์—์„œ ์–ธ์–ด ์กฐ๊ฑด๋ถ€ dexterous grasp ์ƒ์„ฑ task๋ฅผ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

  • DexGYS ๋ฒค์น˜๋งˆํฌ: DextER๋Š” 67.14%์˜ grasp ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ•˜์—ฌ ์ด์ „ SOTA๋ณด๋‹ค 3.83%p ์šฐ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค. P-FID (Frรฉchet Distance) ์ ์ˆ˜ 0.20์„ ๊ธฐ๋กํ•˜์—ฌ ์ด์ „ SOTA์ธ DexGYSNet [36]์˜ 5.60 ๋Œ€๋น„ 96.4%์˜ ์˜๋„ ์ •๋ ฌ(intention alignment) ๊ฐœ์„ ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ƒ์„ฑ๋œ grasp๊ฐ€ ์–ธ์–ด๋กœ ์ง€์ •๋œ task ์˜๋„์™€ ํ›จ์”ฌ ๋” ์ž˜ ์ผ์น˜ํ•จ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
    • Embodied Reasoning (ER)์˜ ์—ญํ• : ER์ด ์—†๋Š” ๋ชจ๋ธ(w/o ER)์€ P-FID๊ฐ€ 0.20์—์„œ 0.30์œผ๋กœ ์ฆ๊ฐ€(50% ์„ฑ๋Šฅ ์ €ํ•˜)ํ•˜๊ณ , ์„ฑ๊ณต๋ฅ ์€ 67.14%์—์„œ 62.37%๋กœ ๊ฐ์†Œํ•˜๋Š” ๋“ฑ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ์ €ํ•˜๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋ช…์‹œ์ ์ธ ์ ‘์ด‰ ์˜ˆ์ธก์ด ์˜๋„ ์ •๋ ฌ ๋ฐ ๋ฌผ๋ฆฌ์  ํ’ˆ์งˆ ๋ชจ๋‘์— ์ค‘์š”ํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  • Ablation Study (Table 2):
    • ECoT: ECoT๋ฅผ ์ œ๊ฑฐํ•˜๋ฉด P-FID์™€ ์„ฑ๊ณต๋ฅ  ๋ชจ๋‘ ํฌ๊ฒŒ ์ €ํ•˜๋ฉ๋‹ˆ๋‹ค.
    • Token discretization granularity: Action ๋ฐ position ํ† ํฐ ๋ชจ๋‘ N_a = N_{pos} = 256 bins์ด ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
    • Contact position dropout (p_{drop}): p_{drop} = 0.5๊ฐ€ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, ์ ์ ˆํ•œ dropout์ด ์ •๊ทœํ™” ํšจ๊ณผ๋ฅผ ์ œ๊ณตํ•จ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.
    • Point Cloud Encoder: PartField [22]๊ฐ€ Uni3D [49]๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋Š”๋ฐ, ์ด๋Š” PartField์˜ ํŒŒํŠธ ์ธ์ง€ ํŠน์ง• ์ถ”์ถœ์ด ์ ‘์ด‰ ๊ธฐ๋ฐ˜ ์ถ”๋ก ์— ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋ถ€ํ•ฉํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
  • Zero-Shot Generalization (Table 3, ์ƒ๋‹จ): Dexonomy ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ›ˆ๋ จ ๋ฐ ํ‰๊ฐ€๋ฅผ ์ง„ํ–‰ํ–ˆ์œผ๋ฉฐ, DextER๋Š” โ€œUnseen Objectsโ€, โ€œUnseen Grasp Taxonomyโ€, โ€œUnseen Bothโ€๋ฅผ ํฌํ•จํ•œ ๋ชจ๋“  zero-shot ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๊ธฐ์ค€์„ (baseline) ๋ฐฉ๋ฒ•๋ก ๋“ค์„ ๋Šฅ๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • Steerable Grasp Generation (Table 3, ํ•˜๋‹จ): DextER์˜ autoregressive ํŠน์„ฑ์„ ํ™œ์šฉํ•˜์—ฌ ์‚ฌ์šฉ์ž๊ฐ€ ๋ถ€๋ถ„์ ์ธ ECoT ์‹œํ€€์Šค๋ฅผ ์ œ๊ณตํ•จ์œผ๋กœ์จ grasp ์ƒ์„ฑ์„ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 1๊ฐœ์—์„œ 5๊ฐœ๊นŒ์ง€์˜ ๋งํฌ๋ฅผ ์ง€์ •ํ–ˆ์„ ๋•Œ, ์ง€์ •๋œ ๋งํฌ์˜ ์ˆ˜๊ฐ€ ๋งŽ์„์ˆ˜๋ก ์˜๋„ ์ •๋ ฌ(P-FID, CD)๊ณผ ์„ฑ๊ณต๋ฅ ์ด ๋ชจ๋‘ ํ–ฅ์ƒ๋˜๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.
  • Contact Reasoning Quality (Table 4): ์ ‘์ด‰ ๋งํฌ ์˜ˆ์ธก์„ ์œ„ํ•œ IoU, Precision, Recall, F1 ๋ฐ ๊ณต๊ฐ„ ์ •ํ™•๋„๋ฅผ ์œ„ํ•œ Position Accuracy (1cm ์ž„๊ณ„๊ฐ’)๋ฅผ ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ, ๋งŒ์กฑ์Šค๋Ÿฌ์šด ์„ฑ๋Šฅ์„ ๋ณด์—ฌ ์ ‘์ด‰ ๊ธฐ๋ฐ˜ embodied reasoning์˜ ์ •ํ™•์„ฑ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก  (Conclusion)

DextER๋Š” ์ ‘์ด‰ ์˜ˆ์ธก์„ ํ†ตํ•ด embodied reasoning์„ ํ™œ์šฉํ•˜๋Š” ์–ธ์–ด ์กฐ๊ฑด๋ถ€ dexterous grasp ์ƒ์„ฑ์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋ฐฉ๋ฒ•๋ก ์€ DexGYS์—์„œ 67.14%์˜ grasp ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ ์ด์ „ SOTA ๋Œ€๋น„ 3.83%p ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๊ณ , ์˜๋„ ์ •๋ ฌ์—์„œ๋Š” 96.4%์˜ ๊ด„๋ชฉํ•  ๋งŒํ•œ ๊ฐœ์„ ์„ ์ด๋ฃจ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ ‘์ด‰ ์ถ”๋ก ์ด task semantics๋ฅผ ์ดํ•ดํ•˜๊ณ  ๋‹ค์–‘ํ•˜๊ณ  ์•ˆ์ •์ ์ธ grasp ๊ตฌ์„ฑ์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋˜ํ•œ, autoregressive ์ƒ์„ฑ ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ์‚ฌ์šฉ์ž๊ฐ€ ๋ถ€๋ถ„์ ์ธ ์ ‘์ด‰ ์ œ์•ฝ ์กฐ๊ฑด์„ ์ง€์ •ํ•˜์—ฌ ๋ชจ๋ธ์„ ์•ˆ๋‚ดํ•  ์ˆ˜ ์žˆ๋Š” steerable grasp generation์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์—ฌ, grasp ์ƒ์„ฑ์— ๋Œ€ํ•œ ์„ธ๋ฐ€ํ•œ ์ œ์–ด๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์ œํ•œ์‚ฌํ•ญ (Limitations)

Autoregressive ํ”„๋ ˆ์ž„์›Œํฌ๋Š” compounding errors์— ์ทจ์•ฝํ•˜๋ฉฐ, ํ˜„์žฌ ํ‰๊ฐ€๋Š” ๋‹จ์ผ์˜ ์ •์  ๊ฐ์ฒด์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์–ด ์‹ค์ œ ๋ณต์žกํ•œ ์žฅ๋ฉด์—์„œ์˜ ์ ์šฉ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ์ˆœ์ฐจ์ ์ธ ํ† ํฐ ์˜ˆ์ธก ๋ฐฉ์‹์€ ์‹ค์‹œ๊ฐ„ ์„ฑ๋Šฅ์— ์ œ์•ฝ์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

Copyright 2026, JungYeon Lee