Curieux.JY
  • JungYeon Lee
  • Post
  • Lecture
  • Note

On this page

  • ๐Ÿ” Ping Review
    • ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก 
    • ์ฃผ์š” ์„ฑ๊ณผ ๋ฐ ๊ฒฐ๋ก 
  • ๐Ÿ”” Ring Review
    • ์„œ๋ก 
    • ๋ฐฉ๋ฒ•
      • 1๋‹จ๊ณ„: ์†-๋ฌผ์ฒด ๋ณต์›
      • 2๋‹จ๊ณ„: ๋™์—ญํ•™ ์ธ์ง€ ๋ฆฌํƒ€๊ฒŸํŒ…
      • ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์„ค์ •
    • ์‹คํ—˜
      • ๋ณต์› (Reconstruction)
      • ๋ฆฌํƒ€๊ฒŸํŒ… (Retargeting)
      • ์‹ค์„ธ๊ณ„ ๋ฐฐํฌ
      • ๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋ง ๋ถ„์„
    • ๋น„ํŒ์  ๊ณ ์ฐฐ
    • ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

๐Ÿ“ƒDo as I Do

dexterous-manipulation
human-video
retargeting
data-generation
sim2real
Do as I Do: Dexterous Manipulation Data from Everyday Human Videos
Published

June 18, 2026

  • Paper Link (arXiv:2606.19333)
  • Project Page
  1. ๐Ÿค– DO AS I DO๋Š” ์ผ๋ฐ˜์ ์ธ ๋‹จ์ผ ์‹œ์  RGB ์˜์ƒ์—์„œ ์†๊ณผ ๋ฌผ์ฒด์˜ ์ƒํ˜ธ์ž‘์šฉ์„ 3D๋กœ ์žฌ๊ตฌ์„ฑํ•œ ๋’ค, ์ด๋ฅผ ๋‹ค์ง€ํ˜• ๋กœ๋ด‡ ์†์ด ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋™์ž‘์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
  2. ๐Ÿš€ ์ œ์•ˆ๋œ ๋ฐฉ์‹์€ SAM 3D์™€ Guided Diffusion์„ ํ™œ์šฉํ•˜์—ฌ ๋ณต์žกํ•œ ํ™˜๊ฒฝ์˜ ์˜์ƒ์—์„œ๋„ ์ •๊ตํ•˜๊ฒŒ ๋ฌผ์ฒด ์ƒํƒœ๋ฅผ ์ถ”์ ํ•˜๊ณ , ๋™์—ญํ•™ ๊ธฐ๋ฐ˜์˜ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ๋กœ๋ด‡์˜ ๋ฌผ๋ฆฌ์  ์ œ์•ฝ ์กฐ๊ฑด์„ ์ถฉ์กฑํ•˜๋Š” ์•ˆ์ •์ ์ธ ์กฐ์ž‘ ๊ถค์ ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  3. ๐Ÿ“ˆ ์‹คํ—˜ ๊ฒฐ๊ณผ, ๋ณธ ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ๊ธฐ์กด ์ƒํƒœ ๊ธฐ์ˆ (SOTA) ๋Œ€๋น„ ๋›ฐ์–ด๋‚œ ์žฌ๊ตฌ์„ฑ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ์‹ค์ œ ๋กœ๋ด‡ ํ™˜๊ฒฝ์—์„œ ๋‹ค์–‘ํ•œ dexterous manipulation ๊ณผ์—…์„ ์„ฑ๊ณต์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ํ™•๋ณด๋ฅผ ์œ„ํ•œ ์‹ค์šฉ์ ์ธ ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ตฌ์ถ•ํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

๋ณธ ๋…ผ๋ฌธ์€ ์ผ์ƒ์ ์ธ ๋‹จ์•ˆ RGB(monocular RGB) ์ธ๊ฐ„ ๋™์˜์ƒ์œผ๋กœ๋ถ€ํ„ฐ ๋ณต์žกํ•œ ๋‹ค์ง€(multi-fingered) ๋กœ๋ด‡์˜ ์กฐ์ž‘ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ DO AS I DO๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ์ธ๊ฐ„์˜ ๊ด€์ฐฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋ด‡์ด ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ๊ฒฝํ—˜์  ๋ฐ์ดํ„ฐ๋กœ ๋ณ€ํ™˜ํ•จ์œผ๋กœ์จ, ๋กœ๋ด‡ ํ•™์Šต์— ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ํ™•๋ณด์˜ ๋ณ‘๋ชฉ ํ˜„์ƒ์„ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก 

1. ์žฌ๊ตฌ์„ฑ(Reconstruction):

์ธ๊ฐ„์˜ ์†๊ณผ ๋ฌผ์ฒด์˜ ์ƒํ˜ธ์ž‘์šฉ์„ 3D๋กœ ๋ณต์›ํ•˜๊ธฐ ์œ„ํ•ด HaWoR(Hand tracking)๊ณผ SAM 3D(Object meshing)๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ ๊ฐ€๋ ค์ง(occlusion)์ด๋‚˜ ํ•ด์ƒ๋„ ์ €ํ•˜๊ฐ€ ๋นˆ๋ฒˆํ•œ ์ธํ„ฐ๋„ท ๋™์˜์ƒ์—์„œ ๋ฌผ์ฒด ์ถ”์ ์˜ ๊ฐ•๊ฑด์„ฑ์„ ํ™•๋ณดํ•˜๊ธฐ ์œ„ํ•ด โ€˜Guided Diffusionโ€™ ๊ธฐ๋ฒ•์„ ๋„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค.

  • Guided Diffusion ๊ธฐ๋ฐ˜ ๋ฌผ์ฒด ์ถ”์ : ๋ฌผ์ฒด์˜ ํ˜•ํƒœ(shape)๋ฅผ ๊ณ ์ •ํ•˜๊ณ  ์ด์ „ ํ”„๋ ˆ์ž„์˜ ํฌ์ฆˆ(xp_{k-1})๋ฅผ ์ฐธ์กฐํ•˜์—ฌ ํ˜„์žฌ ํ”„๋ ˆ์ž„์˜ ํฌ์ฆˆ(xp_k)๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ, ๋ชจ๋ธ์˜ ODE ์—…๋ฐ์ดํŠธ ๊ณผ์ •์—์„œ ๋‹ค์Œ ์‹๊ณผ ๊ฐ™์ด ๋ฌผ๋ฆฌ์  ๊ฐ€์ด๋“œ๋ฅผ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค: xs_t = (1 - \alpha_s)(xs_{t-\Delta} + \Delta vs_{\theta}) + \alpha_s zs_{ref}(t) xp_t = (1 - \alpha_p)(xp_{t-\Delta} + \Delta vp_{\theta}) + \alpha_p zp_{ref}(t)
  • Adaptive Guidance: ํฌ์ธํŠธ ์ถ”์ ๊ธฐ(BootsTAPIR)๋ฅผ ํ†ตํ•ด ๋ฌผ์ฒด์˜ ํšŒ์ „ ์†๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํฌ์ฆˆ ๊ฐ€์ด๋“œ ๊ฐ•๋„ \alpha_p๋ฅผ ๋™์ ์œผ๋กœ ์„ค์ •ํ•˜์—ฌ ์ถ”์ ์˜ ์•ˆ์ •์„ฑ์„ ๋†’์ž…๋‹ˆ๋‹ค.
  • ์ •๋ ฌ(Alignment): ์†๊ณผ ๋ฌผ์ฒด์˜ ์Šค์ผ€์ผ ์ฐจ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด MoGe๋กœ ์ถ”์ •๋œ ๊นŠ์ด ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์†๊ณผ ๋ฌผ์ฒด์˜ ์ค‘์‹ฌ(centroid)์„ ๊ธฐ์ค€์œผ๋กœ ์ง€ํ‘œ ๋‹จ์œ„(metric unit)์˜ 4D ๊ถค์ ์„ ์‚ฐ์ถœํ•ฉ๋‹ˆ๋‹ค.

2. ์žฌ๋Œ€์ƒํ™”(Retargeting):

๋ณต์›๋œ ๋…ธ์ด์ฆˆ ์„ž์ธ ์ธ๊ฐ„์˜ ๊ถค์ ์„ ๋กœ๋ด‡์˜ ๋ฌผ๋ฆฌ์  ํ™˜๊ฒฝ์œผ๋กœ ์ด์ „ํ•˜๊ธฐ ์œ„ํ•ด ๋ฌผ๋ฆฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋‚ด์—์„œ ์ƒ˜ํ”Œ๋ง ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”(MPPI-style optimization)๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

  • Warmup Steps: ์ตœ์ ํ™” ์ดˆ๊ธฐ ํ”„๋ ˆ์ž„์˜ ๋…ธ์ด์ฆˆ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋‚ด์— โ€˜Warmupโ€™ ๋‹จ๊ณ„๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ๋กœ๋ด‡์ด ์‹ค์ œ ์กฐ์ž‘์„ ์‹œ์ž‘ํ•˜๊ธฐ ์ „ ๋ฌผ์ฒด์™€์˜ ์ •๋ ฌ์„ ์กฐ์ •ํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
  • Random Force Perturbation: ๋‹ค์–‘ํ•œ ์™ธ๋ถ€ ํž˜์„ ๊ฐ€ํ•ด rollouts์„ ์ƒ˜ํ”Œ๋งํ•จ์œผ๋กœ์จ, ์ข์€ ์ง€์—ญ ์ตœ์ ํ•ด์— ๋น ์ง€์ง€ ์•Š๊ณ  ๋กœ๋ด‡์ด ๋ฌผ๋ฆฌ์ ์œผ๋กœ ๊ฐ•๊ฑดํ•œ ์ œ์–ด ์ „๋žต์„ ํ•™์Šตํ•˜๋„๋ก ์œ ๋„ํ•ฉ๋‹ˆ๋‹ค.
  • Transition Reward: ๋ฌผ์ฒด๋ฅผ ์žก๊ฑฐ๋‚˜ ๋†“๋Š” ์ค‘์š”ํ•œ ์ „ํ™˜์ ์—์„œ์˜ ์„ฑ๊ณต๋ฅ ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด, ์†-๋ฌผ์ฒด ๊ฐ„ ์ ‘์ด‰ ์—ฌ๋ถ€๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ํŽ˜๋„ํ‹ฐ ํ•ญ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๋‹จ๊ณ„์  ์ƒํ˜ธ์ž‘์šฉ์„ ๊ฐ•ํ™”ํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” ์„ฑ๊ณผ ๋ฐ ๊ฒฐ๋ก 

DO AS I DO๋Š” ๊ธฐ์กด์˜ SOTA ๋ชจ๋ธ๋ณด๋‹ค ์†-๋ฌผ์ฒด ๋ณต์› ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•จ์„ DexYCB ๋ฐ HOI4D ๋ฒค์น˜๋งˆํฌ๋ฅผ ํ†ตํ•ด ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ์ธํ„ฐ๋„ท, ์ž์•„ ์ค‘์‹ฌ(egocentric), ์ƒ์„ฑํ˜• ๋ชจ๋ธ ๋™์˜์ƒ์„ ํฌํ•จํ•œ ๋‹ค์–‘ํ•œ ์†Œ์Šค์—์„œ 500๊ฐœ ์ด์ƒ์˜ ์กฐ์ž‘ ๊ถค์ ์„ ์„ฑ๊ณต์ ์œผ๋กœ ์ถ”์ถœํ–ˆ์œผ๋ฉฐ, ์‹ค์ œ bimanual ๋กœ๋ด‡ ํ™˜๊ฒฝ์—์„œ 10๊ฐ€์ง€ ์ด์ƒ์˜ ๋ณต์žกํ•œ ์กฐ์ž‘ ํƒœ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐ ์„ฑ๊ณตํ–ˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์ €์ž๋“ค์€ ์ธํ„ฐ๋„ท ๋ฐ์ดํ„ฐ์˜ ํ’ˆ์งˆ์„ ์„ ๋ณ„ํ•˜๋Š” โ€™Efficacy Playbookโ€™์„ ์ œ์‹œํ•˜๋ฉฐ, ๋‹จ์ˆœํžˆ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์ •๊ตํ•œ ํ•„ํ„ฐ๋ง์ด ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•จ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.


๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

์„œ๋ก 

๋Šฅ์ˆ™ ์กฐ์ž‘(dexterous manipulation)์˜ ๋ฐœ๋ชฉ์„ ์žก๋Š” ๊ฒƒ์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์•„๋‹ˆ๋ผ ๋ฐ์ดํ„ฐ ์ž…๋‹ˆ๋‹ค.

  • ์‚ฌ๋žŒ ์† ๋‹ฎ์€ ๋‹ค์ง€ ํ”Œ๋žซํผ ์€ ์ž์œ ๋„๊ฐ€ ๋†’์•„, ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ยท๋ชจ์…˜์บก์ฒ˜๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์œผ๋Š” ๋น„์šฉ์ด ๋ง‰๋Œ€ํ•ฉ๋‹ˆ๋‹ค. ํŠน์ˆ˜ ์žฅ๊ฐ‘ยท๋งˆ์ปคยท์ŠคํŠœ๋””์˜ค๊ฐ€ ํ•„์š”ํ•˜๊ณ , ์ž‘์—…ยท๋ฌผ์ฒด๋งˆ๋‹ค ์ƒˆ๋กœ ์ˆ˜์ง‘ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฐ˜๋ฉด ์ธํ„ฐ๋„ทยท์ž๊ธฐ์ค‘์‹ฌ ์˜์ƒ ์—๋Š” ์‚ฌ๋žŒ์ด ์ปต์„ ๋“ค๊ณ , ๋šœ๊ป‘์„ ๋Œ๋ฆฌ๊ณ , ๋„๊ตฌ๋ฅผ ์“ฐ๋Š” ์žฅ๋ฉด์ด ์‚ฌ์‹ค์ƒ ๋ฌดํ•œํžˆ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ๋ฌธ์ œ๋Š” ์ด ํ‰๋ฒ”ํ•œ RGB ์˜์ƒ์ด ๋กœ๋ด‡์ด ๋ฐ”๋กœ ์“ธ ์ˆ˜ ์žˆ๋Š” ํ˜•ํƒœ๊ฐ€ ์•„๋‹ˆ๋ผ๋Š” ๊ฒƒ ์ž…๋‹ˆ๋‹ค โ€” 3D๋„, ๋ฌผ๋ฆฌ์ ์œผ๋กœ ํƒ€๋‹นํ•œ ๊ถค์ ๋„, ๋กœ๋ด‡ ์ž„๋ฒ ๋””๋จผํŠธ๋กœ์˜ ๋งคํ•‘๋„ ์—†์Šต๋‹ˆ๋‹ค.

์ €์ž๋“ค์ด ๋˜์ง€๋Š” ์งˆ๋ฌธ์€ ๋ช…ํ™•ํ•ฉ๋‹ˆ๋‹ค. โ€œํŠน์ˆ˜ ์žฅ๋น„ ์—†๋Š” ์ผ์ƒ์˜ ๋‹จ์•ˆ RGB ์˜์ƒ๋งŒ์œผ๋กœ, ๋‹ค์ง€ ๋Šฅ์ˆ™ ์†์ด ์‹ค์ œ๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” โ€˜๋กœ๋ด‡-์™„๊ฒฐ(robot-complete)โ€™ ์กฐ์ž‘ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฝ‘์•„๋‚ผ ์ˆ˜ ์žˆ๋Š”๊ฐ€?โ€

์—ฌ๊ธฐ์—” ๋‘ ๊ฐœ์˜ ํฐ ๊ฐ„๊ทน์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒซ์งธ ์ง€๊ฐ ๊ฐ„๊ทน โ€” ๋‹จ์•ˆ ์˜์ƒ์—์„œ ์†๊ณผ ๋ฌผ์ฒด์˜ 3D ํฌ์ฆˆ๋ฅผ ์‹œ๊ฐ„์— ๊ฑธ์ณ ์ผ๊ด€๋˜๊ฒŒ ๋ณต์›ํ•˜๊ธฐ. ๋‘˜์งธ ์ž„๋ฒ ๋””๋จผํŠธ ๊ฐ„๊ทน โ€” ์‚ฌ๋žŒ ์†์˜ ์šด๋™ํ•™์  ๊ถค์ ์„, ํ˜•ํƒœ๊ฐ€ ๋‹ค๋ฅธ ๋กœ๋ด‡ ์†์ด ๋ฌผ๋ฆฌ์ ์œผ๋กœ ํƒ€๋‹นํ•˜๊ฒŒ ์‹คํ–‰ํ•˜๋„๋ก ์˜ฎ๊ธฐ๊ธฐ. Do as I Do๋Š” ์ด ๋‘˜์„ ๊ฐ๊ฐ ๋ณต์›(reconstruction) ๊ณผ ๋ฆฌํƒ€๊ฒŸํŒ…(retargeting) ์œผ๋กœ ๋ถ„๋ฆฌํ•ด ๊ณต๋žตํ•ฉ๋‹ˆ๋‹ค.

์ด ๋…ผ๋ฌธ์˜ ํ•œ ์ค„ ์š”์•ฝ: SAM 3D๋ฅผ guided diffusion ๋น„๋””์˜ค ํŠธ๋ž˜์ปค ๋กœ ์žฌํ™œ์šฉํ•ด ๋‹จ์•ˆ RGB์—์„œ ์†-๋ฌผ์ฒด 4D๋ฅผ ๋ณต์›ํ•˜๊ณ , SPIDER ๊ธฐ๋ฐ˜์— warmupยท๋žœ๋ค ํž˜ยท์ „์ด ๋ณด์ƒ ์„ ๋”ํ•œ ๋™์—ญํ•™ ์ธ์ง€ ๋ฆฌํƒ€๊ฒŸํŒ…์œผ๋กœ ๋กœ๋ด‡-์™„๊ฒฐ ๊ถค์ ์„ ๋งŒ๋“ ๋‹ค โ€” ํŠน์ˆ˜ ์žฅ๋น„ ์—†์ด ์ผ์ƒ ์˜์ƒ์—์„œ ๋Šฅ์ˆ™ ์กฐ์ž‘ ๋ฐ์ดํ„ฐ๋ฅผ ๋Œ€๊ทœ๋ชจ๋กœ.

flowchart LR
    subgraph REC["1 ๋ณต์› (Reconstruction)"]
        VID["๋‹จ์•ˆ RGB ์˜์ƒ<br/>(์ธํ„ฐ๋„ท/ego/์ƒ์„ฑ)"]
        HAND["HaWoR<br/>์† ํฌ์ฆˆ ์ถ”์ "]
        OBJ["SAM 3D<br/>๋‹จ์ผ ํ”„๋ ˆ์ž„ ๋ฉ”์‹œ"]
        TRACK["guided diffusion ํŠธ๋ž˜์ปค<br/>flow matching ํฌ์ฆˆ ์ถ”์ <br/>+ SE(3) ํด๋Ÿฌ์Šคํ„ฐ ์„ ํƒ"]
        ALIGN["์ •๋ ฌ<br/>centroid + ์ตœ์†Œ์ œ๊ณฑ<br/>+ GeoCalib ์ค‘๋ ฅ"]
        VID --> HAND
        VID --> OBJ --> TRACK
        HAND --> ALIGN
        TRACK --> ALIGN
    end
    subgraph RET["2 ๋ฆฌํƒ€๊ฒŸํŒ… (Retargeting)"]
        SPIDER["SPIDER ๊ธฐ๋ฐ˜<br/>MPPI ์ƒ˜ํ”Œ๋ง ์ตœ์ ํ™”"]
        W["+ warmup ๋‹จ๊ณ„"]
        F["+ ๋žœ๋ค ํž˜ ์„ญ๋™"]
        T["+ ์ „์ด ๋ณด์ƒ"]
        SPIDER --- W --- F --- T
    end
    ALIGN --> SPIDER
    T --> ROB["๋กœ๋ด‡-์™„๊ฒฐ ๋ฐ์ดํ„ฐ<br/>UR3e + Sharpa Wave 22-DoF"]

๋ฐฉ๋ฒ•

์ „์ฒด๋Š” ๋ณต์› โ†’ ๋ฆฌํƒ€๊ฒŸํŒ… ์˜ 2๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค. ๋ณต์›์€ โ€œ๋‹จ์•ˆ RGB์—์„œ ์‹œ๊ฐ„ ์ผ๊ด€๋œ ์†-๋ฌผ์ฒด 4Dโ€๋ฅผ, ๋ฆฌํƒ€๊ฒŸํŒ…์€ โ€œ๊ทธ ๊ถค์ ์„ ๋กœ๋ด‡์ด ๋ฌผ๋ฆฌ์ ์œผ๋กœ ์‹คํ–‰ ๊ฐ€๋Šฅํ•˜๊ฒŒโ€๋ฅผ ์ฑ…์ž„์ง‘๋‹ˆ๋‹ค.

1๋‹จ๊ณ„: ์†-๋ฌผ์ฒด ๋ณต์›

์† ์ถ”์ . HaWoR ๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. in-the-wild ์˜์ƒ์—์„œ๋„ ์ˆ˜์ • ์—†์ด ํ•ฉ๋ฆฌ์ ์ธ ์† ํฌ์ฆˆ๋ฅผ ๋ƒ…๋‹ˆ๋‹ค.

๋ฌผ์ฒด โ€” ๋‹จ์ผ ํ”„๋ ˆ์ž„ ์ƒ์„ฑ. SAM 3D(์ด๋ฏธ์ง€ ์กฐ๊ฑด๋ถ€ 3D ์ƒ์„ฑ foundation model)๋กœ ๊ฐœ๋ณ„ ํ”„๋ ˆ์ž„์—์„œ ๋ฌผ์ฒด ๋ฉ”์‹œ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

๋ฌผ์ฒด โ€” ์‹œ๊ฐ„์  ์ถ”์ (ํ•ต์‹ฌ ํ˜์‹ ). ๋‹จ์ˆœํžˆ ๋งค ํ”„๋ ˆ์ž„ ๋…๋ฆฝ ์ƒ์„ฑํ•˜๋ฉด ์‹œ๊ฐ„ ์ผ๊ด€์„ฑ์ด ๊นจ์ง‘๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ SAM 3D๋ฅผ ๋น„๋””์˜ค ํŠธ๋ž˜์ปค๋กœ ์žฌํ™œ์šฉ ํ•ฉ๋‹ˆ๋‹ค.

  • ์•ต์ปค ํ”„๋ ˆ์ž„์—์„œ ๋ฌผ์ฒด ํ˜•์ƒ์„ ๊ณ ์ •.
  • flow matching ์ถ”๋ก  ์œผ๋กœ ํ”„๋ ˆ์ž„ ๊ฐ„ ํฌ์ฆˆ ๋ณ€ํ™”๋ฅผ ์˜ˆ์ธก.
  • ๋ชจ๋ธ์˜ denoising ์—…๋ฐ์ดํŠธ u_\theta ์™€ ๋ชฉํ‘œ interpolant x^{\text{target}} ๋ฅผ ๋ธ”๋ Œ๋”ฉ(Eq. 1):

x_{k+1} = (1-\lambda)\, \big(x_k + u_\theta(x_k, k)\big) + \lambda\, x^{\text{target}}_k

์—ฌ๊ธฐ์„œ ๊ฐ€์ด๋“œ ๊ฐ•๋„ \lambda ๋Š” ๊ณ ์ •๊ฐ’์ด ์•„๋‹ˆ๋ผ 2D ํฌ์ธํŠธ ์ถ”์ ์œผ๋กœ ์ธก์ •ํ•œ ํšŒ์ „ ์†๋„ ์—์„œ ๋„์ถœํ•ฉ๋‹ˆ๋‹ค. ๋ฌผ์ฒด๊ฐ€ ๊ฐ•์ฒด์ฒ˜๋Ÿผ ์ฒœ์ฒœํžˆ ๋Œ๋ฉด ์ถ”์ ์„ ๊ฐ•ํ•˜๊ฒŒ, ๋น ๋ฅด๊ฒŒ ๋ณ€ํ•˜๋ฉด ์œ ์—ฐํ•˜๊ฒŒ ์กฐ์ ˆํ•ฉ๋‹ˆ๋‹ค.

ํฌ์ฆˆ ํ›„๋ณด ์„ ํƒ. ํ”„๋ ˆ์ž„๋‹น 25๊ฐœ ํฌ์ฆˆ ํ›„๋ณด ๋ฅผ ์ƒ˜ํ”Œ๋งํ•œ ๋’ค, ๊ฐ€์ค‘ SE(3) ๊ฑฐ๋ฆฌ ๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋งํ•ด ๋Œ€ํ‘œ ํฌ์ฆˆ๋ฅผ ๊ณ ๋ฆ…๋‹ˆ๋‹ค. likelihood ๊ธฐ๋ฐ˜ ๋žญํ‚น๊ณผ ํ’ˆ์งˆ์€ ๋น„์Šทํ•˜๋ฉด์„œ ์ตœ๋Œ€ 30๋ฐฐ ๋น ๋ฆ…๋‹ˆ๋‹ค.

์ •๋ ฌ(Alignment). ์†๊ณผ ๋ฌผ์ฒด๋ฅผ ๋…๋ฆฝ ๋ณต์›ํ–ˆ์œผ๋ฏ€๋กœ ์ขŒํ‘œ๊ณ„๋ฅผ ๋งž์ถฐ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์–‘์ชฝ ์ค‘์‹ฌ(centroid)์„ ๊ณ„์‚ฐํ•˜๊ณ , ํ”„๋ ˆ์ž„๋ณ„ ๋ณ‘์ง„ ์Šค์ผ€์ผ์„ ์ตœ์†Œ์ œ๊ณฑ ์œผ๋กœ ํ’€๋ฉฐ, GeoCalib ์œผ๋กœ ๊ถค์ ์„ ์ค‘๋ ฅ ๋ฐฉํ–ฅ์— ์ •๋ ฌํ•ฉ๋‹ˆ๋‹ค.

2๋‹จ๊ณ„: ๋™์—ญํ•™ ์ธ์ง€ ๋ฆฌํƒ€๊ฒŸํŒ…

๋ณต์›๋œ (์šด๋™ํ•™์ ) ์†-๋ฌผ์ฒด ๊ถค์ ์€ ๊ทธ๋Œ€๋กœ ๋กœ๋ด‡์— ์˜ฌ๋ฆฌ๋ฉด ์ ‘์ด‰ ๋ถˆ์•ˆ์ •ยท๊ด€ํ†ต ๋“ฑ์œผ๋กœ ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค. SOTA ๋ฒ ์ด์Šค๋ผ์ธ SPIDER ์œ„์—์„œ MPPI ์Šคํƒ€์ผ ์ƒ˜ํ”Œ๋ง ์ตœ์ ํ™”(๋ฐ˜๋ณตยท์˜ˆ์ธก ์ง€ํ‰์„  ์–‘์ชฝ์— kernel annealing)๋กœ ๋™์—ญํ•™์„ ๊ณ ๋ ค ํ•œ ์ œ์–ด๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค. ํ”ํ•œ ์„ธ ๊ฐ€์ง€ ์‹คํŒจ ๋ชจ๋“œ๋ฅผ ๊ฐ๊ฐ ํ•œ ์š”์†Œ๋กœ ์žก์Šต๋‹ˆ๋‹ค.

  1. Warmup ๋‹จ๊ณ„. ๋…ธ์ด์ฆˆ ๋‚€ ์ดˆ๊ธฐํ™”์—์„œ ์†์ด ์–ด์ƒ‰ํ•œ ์ž์„ธ๋กœ ์‹œ์ž‘ํ•˜๋ฉด ์ถ”์  ์ฒซ ํ”„๋ ˆ์ž„๋ถ€ํ„ฐ ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ H ๊ฐœ ์Šคํ…์„ ์•ž์— ๋ถ™์—ฌ ๋ฌผ์ฒด๋ฅผ (์˜ˆ: ๊ณต์ค‘์—) ๊ณ ์ • ํ•œ ์ฑ„ ์†๋งŒ ์ž์œ ๋กญ๊ฒŒ ์›€์ง์—ฌ ์ž์„ธ๋ฅผ ์ •๋ ฌํ•œ ๋’ค ๋ณธ ์ถ”์ ์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.
  2. ๋žœ๋ค ํž˜ ์„ญ๋™. ๋ถˆ์•ˆ์ •ํ•œ ํŒŒ์ง€๋Š” ๋ณด์ƒ ์ง€ํ˜•์˜ ์ง€์—ญ ์ตœ์†Œ๊ฐ’์— ๊ฐ‡ํžˆ๊ธฐ ์‰ฝ์Šต๋‹ˆ๋‹ค. ๋กค์•„์›ƒ ์ƒ˜ํ”Œ์— ๋žœ๋ค ํž˜ ์„ ๊ฐ€ํ•ด, ๊ทธ๋Ÿฐ ์„ญ๋™์—๋„ ๊ฒฌ๋””๋Š”(์ฆ‰ ์•ˆ์ •์ ์œผ๋กœ ์žก๋Š”) ์ œ์–ด๋ฅผ ์„ ํ˜ธํ•˜๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค(sim-to-real robustness ์ฐฉ์•ˆ).
  3. ์ „์ด(transition) ๋ณด์ƒ. ๋‹จ๊ณ„ ์ „ํ™˜์„ ์œ ๋„ํ•˜๋Š” ํŽ˜๋„ํ‹ฐ์ž…๋‹ˆ๋‹ค โ€” โ€œrestโ€์—์„œ ๋ฌผ์ฒด-๋ฐ”๋‹ฅ ์ ‘์ด‰์ด ์—†์œผ๋ฉด, โ€œin-handโ€์—์„œ ์†-๋ฌผ์ฒด ์ ‘์ด‰์ด ์—†์œผ๋ฉด ๋ฒŒ์ ์„ ์ค˜, ๋“ค์–ด์˜ฌ๋ฆผยท๋‚ด๋ ค๋†“์Œ ๊ฐ™์€ ์ ‘์ด‰ ์ƒํƒœ ์ „์ด ๋ฅผ ๋ช…ํ™•ํžˆ ํ•ฉ๋‹ˆ๋‹ค.

์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์„ค์ •

  • ๋ฌผ๋ฆฌ ์—”์ง„: MuJoCo Warp, 0.005s ํƒ€์ž„์Šคํ…(200Hz).
  • ๋ฉ”์‹œ ์ฒ˜๋ฆฌ: CoACD ๋ณผ๋ก ๋ถ„ํ•ด + ๋‹ค์ค‘ ์ ‘์ด‰ ์•ˆ์ •ํ™”๋ฅผ ์œ„ํ•ด 2mm dilate.
  • ์ตœ์ ํ™”: ๊ณ„ํš ์Šคํ…๋‹น 1024 ์ƒ˜ํ”Œ, 32 ๋ฐ˜๋ณต, 3์ดˆ ์ง€ํ‰์„ , 0.5์ดˆ ๊ณ„ํš ๊ฐ„๊ฒฉ.

์‹คํ—˜

๋ณต์› (Reconstruction)

๋ฒค์น˜๋งˆํฌ ์ง€ํ‘œ ๊ฒฐ๊ณผ
DexYCB (160 ์˜์ƒ) F-5 0.71 (SOTA)
HOI4D (12 ์˜์ƒ) F-5 0.72 (SOTA)
In-the-wild (150 ์˜์ƒ) ์ธ๊ฐ„ ์„ ํ˜ธ FoundationPose ๋Œ€๋น„ 67% ์„ ํ˜ธ

๋ฒ ์ด์Šค๋ผ์ธ์€ HOยทIHOIยทHORSEยทMCC-HOยทG-HOP(joint ๋ณต์›) ๋ฐ FoundationPoseยทAny6D(๋ฌผ์ฒด ํŠธ๋ž˜์ปค)์ž…๋‹ˆ๋‹ค. Do as I Do๋Š” ์†-๋ฌผ์ฒด ์ƒํ˜ธ์ž‘์šฉ ์ถ”์ •๊ณผ ๊ถค์  ์ถ”์ถœ ๋ชจ๋‘์—์„œ ์ด๋“ค์„ ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

๋ฆฌํƒ€๊ฒŸํŒ… (Retargeting)

์„ฑ๊ณต ๊ธฐ์ค€์€ E_{pos} < 0.1\text{m}, E_{rot} < 0.5\text{rad} ์ž…๋‹ˆ๋‹ค. SPIDER์— ์„ธ ์š”์†Œ๋ฅผ ์ ์ง„์ ์œผ๋กœ ์ถ”๊ฐ€ํ•˜๋ฉฐ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ๋ฒ ์ด์Šค๋ผ์ธ(SPIDER) Do as I Do
๋ณต์› ๋ฐ์ดํ„ฐ (655 reference) 25% 71%
OakInk2 (1,352 clean mocap) โ€” 81%

๋ณต์› ๋ฐ์ดํ„ฐ์—์„œ 25% โ†’ 71% ์˜ ํฐ ๋„์•ฝ์€, warmupยท๋žœ๋ค ํž˜ยท์ „์ด ๋ณด์ƒ์ด ๋…ธ์ด์ฆˆ ๋‚€ ์‹ค์ œ ๋ณต์› ๊ถค์ ์—์„œ ํŠนํžˆ ํšจ๊ณผ์ ์ž„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊นจ๋—ํ•œ mocap(OakInk2)์—์„œ๋Š” 81%๋กœ ๋” ๋†’์•„, ๋ณต์› ๋…ธ์ด์ฆˆ๊ฐ€ ๋‚จ์€ ๊ฒฉ์ฐจ์˜ ์ฃผ์›์ธ์ž„์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

์‹ค์„ธ๊ณ„ ๋ฐฐํฌ

์–‘ํŒ” UR3e + Sharpa Wave 22-DoF ์† ์œผ๋กœ, 10๊ฐœ ์ž‘์—…์— ๊ฑธ์ณ 500๊ฐœ ๊ฒ€์ฆ ๊ถค์  ์„ ์‹ค์ œ ์‹คํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ผ์ƒ RGB ์˜์ƒ์—์„œ ์ถ”์ถœํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์‹ค๋กœ๋ด‡์—์„œ ์ž‘๋™ํ•จ์„ ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋ง ๋ถ„์„

์ธํ„ฐ๋„ท ์˜์ƒ์€ ํ’ˆ์งˆ ํŽธ์ฐจ๊ฐ€ ํฝ๋‹ˆ๋‹ค. ๋ถ„์„ ๊ฒฐ๊ณผ 100DOH ์ƒ˜ํ”Œ ํด๋ฆฝ์˜ ๋‹จ 4%๋งŒ ํ’ˆ์งˆ ๊ฒ€์‚ฌ๋ฅผ ํ†ต๊ณผํ•ด, ์ธํ„ฐ๋„ท ์†Œ์Šค๋ฅผ ์“ธ ๋•Œ ์ƒ๋‹นํ•œ ์ „์ฒ˜๋ฆฌ ๋น„์šฉ ์ด ๋“ ๋‹ค๋Š” ์ ์„ ์ •๋Ÿ‰์ ์œผ๋กœ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค โ€” ์‹ค๋ฌด์ž์—๊ฒŒ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๊ฐ€์ด๋“œ๋ผ์ธ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋น„ํŒ์  ๊ณ ์ฐฐ

๊ฐ•์ 

  • ๋ฐ์ดํ„ฐ ๋ณ‘๋ชฉ์„ ์ •๋ฉด ๊ณต๋žต. ํŠน์ˆ˜ ์žฅ๋น„ ์—†๋Š” ์ผ์ƒ RGB ์˜์ƒ์—์„œ ๋กœ๋ด‡-์™„๊ฒฐ ๊ถค์ ์„ ๋ฝ‘์•„, ๋Šฅ์ˆ™ ์กฐ์ž‘์˜ ๊ฐ€์žฅ ํฐ ๋น„์šฉ(๋ฐ์ดํ„ฐ ์ˆ˜์ง‘)์„ ์‹ค์งˆ์ ์œผ๋กœ ๋‚ฎ์ถฅ๋‹ˆ๋‹ค. ์ธํ„ฐ๋„ทยทegoยท์ƒ์„ฑ ์˜์ƒ๊นŒ์ง€ ๋‹ค์–‘ํ•œ ์†Œ์Šค๋ฅผ ๋‹ค๋ฃฌ ์ ์ด ํ™•์žฅ์„ฑ ์ธก๋ฉด์—์„œ ์ธ์ƒ์ ์ž…๋‹ˆ๋‹ค.
  • SAM 3D์˜ ์˜๋ฆฌํ•œ ์žฌํ™œ์šฉ. ๋‹จ์ผ ํ”„๋ ˆ์ž„ ์ƒ์„ฑ ๋ชจ๋ธ์„ guided diffusion ํŠธ๋ž˜์ปค๋กœ ๋ฐ”๊ฟ” ์‹œ๊ฐ„ ์ผ๊ด€์„ฑ์„ ํ™•๋ณดํ•˜๊ณ , SE(3) ํด๋Ÿฌ์Šคํ„ฐ๋ง์œผ๋กœ 30๋ฐฐ ๊ฐ€์†ํ•œ ์—”์ง€๋‹ˆ์–ด๋ง์ด ํ•ต์‹ฌ ๊ธฐ์—ฌ์ž…๋‹ˆ๋‹ค. ๊ฐ€์ด๋“œ ๊ฐ•๋„๋ฅผ ํšŒ์ „ ์†๋„๋กœ ์ ์‘์‹œํ‚จ ๋””ํ…Œ์ผ๋„ ์‹ค์šฉ์ ์ž…๋‹ˆ๋‹ค.
  • ์„ธ ๋ณด์ƒ ์š”์†Œ์˜ ๋ถ„๋ฆฌ๋œ ํšจ๊ณผ. warmupยท๋žœ๋ค ํž˜ยท์ „์ด ๋ณด์ƒ์„ ์ ์ง„ ์ถ”๊ฐ€ํ•˜๋ฉฐ 25%โ†’71%์˜ ๊ฐœ์„ ์„ ๋ถ„๋ฆฌ ๊ฒ€์ฆํ•ด, ๊ฐ ์š”์†Œ๊ฐ€ ์–ด๋–ค ์‹คํŒจ ๋ชจ๋“œ๋ฅผ ์žก๋Š”์ง€ ๋ช…ํ™•ํžˆ ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์‹ค์„ธ๊ณ„ ๊ฒ€์ฆ + ์ •์งํ•œ ๋น„์šฉ ๋ณด๊ณ . 10๊ฐœ ์ž‘์—… 500๊ถค์  ์‹คํ–‰์— ๋”ํ•ด, ์ธํ„ฐ๋„ท ์˜์ƒ์˜ 4% ์ƒ์กด์œจ์„ ๊ณต๊ฐœํ•ด ๋ฐฉ๋ฒ•์˜ ํ•œ๊ณ„์™€ ์ „์ฒ˜๋ฆฌ ๋ถ€๋‹ด์„ ํˆฌ๋ช…ํ•˜๊ฒŒ ๋“œ๋Ÿฌ๋ƒˆ์Šต๋‹ˆ๋‹ค.

์•ฝ์ ๊ณผ ํ•œ๊ณ„

  • ๋ณต์› ๋…ธ์ด์ฆˆ๊ฐ€ ์ƒํ•œ์„ ๊ฒฐ์ •. ๋ณต์› ๋ฐ์ดํ„ฐ 71% vs ๊นจ๋—ํ•œ mocap 81%์˜ ๊ฒฉ์ฐจ๋Š”, ์ตœ์ข… ํ’ˆ์งˆ์ด ์—ฌ์ „ํžˆ ๋‹จ์•ˆ ๋ณต์› ์ •ํ™•๋„์— ๋ฌถ์—ฌ ์žˆ์Œ ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์–ด๋ ค์šด ๊ฐ€๋ฆผ(occlusion)ยท๋น ๋ฅธ ๋™์ž‘ ์˜์ƒ์—์„œ์˜ ๊ฒฌ๊ณ ์„ฑ์€ ์ถ”๊ฐ€ ๊ฒ€์ฆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค(์ถ”์ธก).
  • ๊ทน์‹ฌํ•œ ๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋ง. 100DOH ํด๋ฆฝ์˜ 4%๋งŒ ํ†ต๊ณผํ•œ๋‹ค๋Š” ๊ฒƒ์€, โ€œ๋ฌดํ•œํ•œ ์ธํ„ฐ๋„ท ์˜์ƒโ€์ด๋ผ๋Š” ์ „์ œ๊ฐ€ ์‹ค์ œ๋กœ๋Š” ์†Œ์ˆ˜์˜ ๊ณ ํ’ˆ์งˆ ํด๋ฆฝ ์œผ๋กœ ์ถ•์†Œ๋จ์„ ๋œปํ•ฉ๋‹ˆ๋‹ค. ์ž๋™ ํ•„ํ„ฐ๋ง์˜ ์ •๋ฐ€๋„/์žฌํ˜„์œจ์€ ๋” ๋‹ค๋ค„์งˆ ์—ฌ์ง€๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋‹จ์ผ ๋ฌผ์ฒด ๊ฐ•์ฒด ๊ฐ€์ •์— ๊ฐ€๊นŒ์›€. ํ˜•์ƒ์„ ์•ต์ปค ํ”„๋ ˆ์ž„์— ๊ณ ์ •ํ•˜๋Š” ์ถ”์ ์€ ๊ฐ•์ฒดยท๋น„๋ณ€ํ˜• ๋ฌผ์ฒด์— ์ ํ•ฉํ•˜๋ฉฐ, ๋ณ€ํ˜•์ฒดยท๋‹ค๋ฌผ์ฒดยท๊ด€์ ˆ ๋ฌผ์ฒด๋กœ์˜ ์ผ๋ฐ˜ํ™”๋Š” ์ œํ•œ์ ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(์ถ”์ธก).
  • ๊ณ„์‚ฐ ๋น„์šฉ. ํ”„๋ ˆ์ž„๋‹น 25 ํฌ์ฆˆ ํ›„๋ณด + 1024 ์ƒ˜ํ”Œยท32 ๋ฐ˜๋ณต์˜ MPPI๋Š” ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ์‹œ ์ƒ๋‹นํ•œ ์—ฐ์‚ฐ์„ ์š”๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์Šค์ผ€์ผ๋ง ์‹œ ๋น„์šฉ-ํ’ˆ์งˆ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„์˜ ์ •๋Ÿ‰ ๋ถ„์„์€ ์ œํ•œ์ ์ž…๋‹ˆ๋‹ค.

์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

Do as I Do๋Š” ๋Šฅ์ˆ™ ์กฐ์ž‘์˜ ๋ฐ์ดํ„ฐ ๋ณ‘๋ชฉ ์„, ํŠน์ˆ˜ ์žฅ๋น„ ์—†๋Š” ์ผ์ƒ์˜ ๋‹จ์•ˆ RGB ์˜์ƒ ์—์„œ ๋กœ๋ด‡-์™„๊ฒฐ ์กฐ์ž‘ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฝ‘์•„๋‚ด๋Š” ๋ฐฉ์‹์œผ๋กœ ๊ณต๋žตํ•ฉ๋‹ˆ๋‹ค. ๋ณต์› ๋‹จ๊ณ„๋Š” HaWoR ์† ์ถ”์ ๊ณผ SAM 3D๋ฅผ ์žฌํ™œ์šฉํ•œ guided diffusion ๋น„๋””์˜ค ํŠธ๋ž˜์ปค(flow matching + ์ ์‘ํ˜• ๊ฐ€์ด๋“œ + 30๋ฐฐ ๋น ๋ฅธ SE(3) ํด๋Ÿฌ์Šคํ„ฐ ์„ ํƒ)๋กœ ์‹œ๊ฐ„ ์ผ๊ด€๋œ ์†-๋ฌผ์ฒด 4D๋ฅผ ๋งŒ๋“ค๊ณ , ๋ฆฌํƒ€๊ฒŸํŒ… ๋‹จ๊ณ„๋Š” SPIDER ๊ธฐ๋ฐ˜์— warmupยท๋žœ๋ค ํž˜ ์„ญ๋™ยท์ „์ด ๋ณด์ƒ ์„ ๋”ํ•œ ๋™์—ญํ•™ ์ธ์ง€ ์ตœ์ ํ™”๋กœ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ํƒ€๋‹นํ•œ ๋กœ๋ด‡ ๊ถค์ ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์ˆ˜์น˜๋กœ ์ •๋ฆฌํ•˜๋ฉด, ๋ณต์›์€ DexYCBยทHOI4D์—์„œ F-5 0.71/0.72 SOTA, in-the-wild ์ธ๊ฐ„ ์„ ํ˜ธ 67% ๋ฅผ ๋‹ฌ์„ฑํ–ˆ๊ณ , ๋ฆฌํƒ€๊ฒŸํŒ…์€ ๋ณต์› ๋ฐ์ดํ„ฐ์—์„œ 25% โ†’ 71%, ๊นจ๋—ํ•œ OakInk2์—์„œ 81% ์„ฑ๊ณต๋ฅ ์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์–‘ํŒ” UR3e + Sharpa Wave 22-DoF ์†์œผ๋กœ 10๊ฐœ ์ž‘์—… 500๊ถค์  ์„ ์‹ค์„ธ๊ณ„์—์„œ ์‹คํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

์‹ค๋ฌด ๊ด€์ ์—์„œ ์ด ์—ฐ๊ตฌ์˜ ๊ฐ€์น˜๋Š” โ€œํ‰๋ฒ”ํ•œ ์ธ๊ฐ„ ์˜์ƒ์„, ๋‹ค์ง€ ๋กœ๋ด‡์ด ์‹ค์ œ๋กœ ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ์กฐ์ž‘ ๋ฐ์ดํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ํŒŒ์ดํ”„๋ผ์ธโ€ ์„ ์ œ์‹œํ•œ ๋ฐ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณต์› ๋…ธ์ด์ฆˆ ์˜์กด์„ฑ๊ณผ ๊ทน์‹ฌํ•œ ๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋ง(4% ์ƒ์กด์œจ)์ด๋ผ๋Š” ํ•œ๊ณ„๋Š” ๋ถ„๋ช…ํ•˜์ง€๋งŒ, ์ƒ์„ฑ ๋ชจ๋ธ ๊ธฐ๋ฐ˜ 4D ๋ณต์› + ๋™์—ญํ•™ ์ธ์ง€ ๋ฆฌํƒ€๊ฒŸํŒ… ์ด๋ผ๋Š” ํ‹€์€ ํ–ฅํ›„ ์ธ๊ฐ„ ์˜์ƒ ๊ธฐ๋ฐ˜ ๋Šฅ์ˆ™ ์กฐ์ž‘ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ์˜ ๊ฐ•๋ ฅํ•œ ํ‘œ์ค€์ ์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Copyright 2026, JungYeon Lee