Curieux.JY
  • Post
  • Note
  • Jung Yeon Lee

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ๋ฐฐ๊ฒฝ: ์™œ ์ด ๋ฌธ์ œ๊ฐ€ ์–ด๋ ค์šด๊ฐ€?
      • ์ •๊ตํ•œ ์กฐ์ž‘(Dexterous Manipulation)์˜ ๋„์ „๊ณผ์ œ
      • ๊ธฐ์กด ์ ‘๊ทผ๋ฒ•์˜ ํ•œ๊ณ„
    • ViViDex์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด
      • ์ „์ฒด ํ”„๋ ˆ์ž„์›Œํฌ ๊ตฌ์กฐ
    • 1๋‹จ๊ณ„: ์ฐธ์กฐ ๊ถค์  ์ถ”์ถœ (Reference Trajectory Extraction)
      • ์† ํฌ์ฆˆ ์ถ”์ •
      • Motion Retargeting
      • ๋ฌผ์ฒด ๊ถค์  ์ถ”์ •
    • 2๋‹จ๊ณ„: ๊ถค์  ๊ฐ€์ด๋“œ ๊ฐ•ํ™”ํ•™์Šต (Trajectory-Guided RL)
      • ์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ…
      • ๊ถค์  ๊ฐ€์ด๋“œ ๋ณด์ƒ ํ•จ์ˆ˜
      • ์ •๊ทœํ™”๋œ ๊ถค์  ์‚ฌ์šฉ
      • PPO ์•Œ๊ณ ๋ฆฌ์ฆ˜
    • 3๋‹จ๊ณ„: ์‹œ๊ฐ ๊ธฐ๋ฐ˜ ์ •์ฑ… ํ•™์Šต (Vision-based Policy Learning)
      • ์ž…๋ ฅ ํ‘œํ˜„: ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ
      • ์ขŒํ‘œ ๋ณ€ํ™˜: ํ•ต์‹ฌ ํ˜์‹ 
      • ๋‘ ๊ฐ€์ง€ ํ•™์Šต ๋ฐฉ๋ฒ• ๋น„๊ต
      • ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ
    • ์‹คํ—˜ ์„ค์ • ๋ฐ ๊ฒฐ๊ณผ
      • ์‹คํ—˜ ํ™˜๊ฒฝ
      • ํ‰๊ฐ€ ๊ณผ์ œ
      • ์„ฑ๋Šฅ ๋น„๊ต
      • ์ •๋Ÿ‰์  ๊ฒฐ๊ณผ
      • Ablation Study
    • ์‹ค์ œ ๋กœ๋ด‡ ์‹คํ—˜
      • Sim-to-Real Transfer ์ „๋žต
      • ์‹ค์ œ ๋กœ๋ด‡ ์„ฑ๊ณผ
    • ๊ธฐ์ˆ ์  ์‹ฌ์ธต ๋ถ„์„
      • ์™œ ๊ถค์  ๊ฐ€์ด๋“œ RL์ด ์ž‘๋™ํ•˜๋Š”๊ฐ€?
      • ์† ์ค‘์‹ฌ ์ขŒํ‘œ๊ณ„์˜ ์ด๋ก ์  ์ •๋‹นํ™”
      • PointNet++์˜ ์—ญํ• 
    • ํ•œ๊ณ„์ 
    • ์‹ค์šฉ์  ๊ณ ๋ ค์‚ฌํ•ญ
      • ์‹ค์ œ ๋ฐฐ์น˜ ์‹œ ์ฒดํฌ๋ฆฌ์ŠคํŠธ
      • ์ฝ”๋“œ ์‚ฌ์šฉ ๊ฐ€์ด๋“œ
      • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹
    • ๊ด€๋ จ ์—ฐ๊ตฌ ๋ฐ ๋งฅ๋ฝ
      • ์—ญ์‚ฌ์  ๋งฅ๋ฝ
      • ์ง์ ‘์ ์œผ๋กœ ๊ด€๋ จ๋œ ์—ฐ๊ตฌ๋“ค
      • ์ฐจ๋ณ„์  ์ •๋ฆฌ
    • ์ด๋ก ์  ๊ธฐ์—ฌ์™€ ์˜์˜
      • 1) ๋น„๋””์˜ค๋ฅผ Prior๋กœ ํ™œ์šฉํ•˜๋Š” ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ
      • 2) Privileged Information์˜ ๋‹จ๊ณ„์  ์ œ๊ฑฐ
      • 3) ์ขŒํ‘œ ๋ถˆ๋ณ€์„ฑ์˜ ์ค‘์š”์„ฑ
    • ๊ฒฐ๋ก 
    • ์ฐธ๊ณ ๋ฌธํ—Œ
  • โ›๏ธ Dig Review
    • 1. ํ•ต์‹ฌ ๊ธฐ์—ฌ ์š”์•ฝ
    • 2. ๊ธฐ์ˆ  ๊ตฌ์„ฑ ์š”์†Œ ์ƒ์„ธ ๋ถ„์„
    • 3. ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต ๋ถ„์„
    • 4. ์‹คํ—˜ ์„ค์ • ๋ฐ ๊ฒฐ๊ณผ
      • ์ฃผ์š” ๊ฒฐ๊ณผ

๐Ÿ“ƒViViDex ๋ฆฌ๋ทฐ

mpc
rl
action-chunking
Learning Vision-based Dexterous Manipulation from Human Videos
Published

November 4, 2025

๐Ÿ” Ping. ๐Ÿ”” Ring. โ›๏ธ Dig. A tiered review series: quick look, key ideas, deep dive.

  • Paper Link
  • Project LInk
  • Code:Sapein, Mujoco
  1. ๋‹ค์–‘ํ•œ ์ž์„ธ์˜ ์—ฌ๋Ÿฌ ๋ฌผ์ฒด๋ฅผ ์กฐ์ž‘ํ•˜๋Š” ๋‹ค์ง€ ๋กœ๋ด‡ ์†์„ ์œ„ํ•œ ํ†ตํ•ฉ๋œ ๋น„์ „ ๊ธฐ๋ฐ˜ ์ •์ฑ…์„ ์ธ๊ฐ„ ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•˜๋Š” ViViDex ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
  2. ViViDex๋Š” ์ธ๊ฐ„ ๋น„๋””์˜ค์—์„œ ์ถ”์ถœ๋œ ๊ถค์ ์„ ๊ถค์  ์•ˆ๋‚ด ๋ณด์ƒ์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ํƒ€๋‹นํ•œ ์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ…์„ ํ›ˆ๋ จํ•˜๊ณ , ์ด ์„ฑ๊ณต์ ์ธ ์—ํ”ผ์†Œ๋“œ๋“ค์„ ํ™œ์šฉํ•˜์—ฌ ํŠน๊ถŒ ์ •๋ณด ์—†์ด ํ†ตํ•ฉ๋œ ์‹œ๊ฐ ์ •์ฑ…์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  3. ์‹คํ—˜ ๊ฒฐ๊ณผ, ViViDex๋Š” ๊ธฐ์กด ์ตœ์ฒจ๋‹จ ๋ฐฉ์‹์„ ๋›ฐ์–ด๋„˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, ๋” ์ ์€ ์ธ๊ฐ„ ์‹œ์—ฐ ๋น„๋””์˜ค๋งŒ์œผ๋กœ๋„ ๋‹ค์–‘ํ•œ ์กฐ์ž‘ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ๋‚ฏ์„  ๋ฌผ์ฒด์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ์‹ค์ œ ๋กœ๋ด‡ ํ™˜๊ฒฝ์—์„œ๋„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

์ด ๋…ผ๋ฌธ์€ ์ธ๊ฐ„ ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ๋‹ค์ง€(multi-fingered) ๋กœ๋ด‡ ํ•ธ๋“œ์˜ vision-based Dexterous Manipulation ์ •์ฑ…์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ์ธ ViViDex๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ ์ธ๊ฐ„ ๋น„๋””์˜ค์—์„œ ์ถ”์ถœ๋œ ๊ถค์ ์˜ ๋…ธ์ด์ฆˆ์™€ ์ง€์ƒ ์ง„์‹ค ๊ฐ์ฒด ์ƒํƒœ(ground-truth object states)์™€ ๊ฐ™์€ ํŠน๊ถŒ ๊ฐ์ฒด ์ •๋ณด(privileged object information)์— ๋Œ€ํ•œ ์˜์กด์„ฑ ๋•Œ๋ฌธ์— ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ์ œํ•œ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ViViDex๋Š” ์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์„ธ ๊ฐ€์ง€ ๋ชจ๋“ˆ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

์ฒซ ๋ฒˆ์งธ ๋ชจ๋“ˆ์€ Reference Trajectory Extraction์ž…๋‹ˆ๋‹ค. ์ด ๋ชจ๋“ˆ์€ ์ธ๊ฐ„ ๋น„๋””์˜ค์—์„œ ์‚ฌ๋žŒ์˜ ์†๊ณผ ๊ฐ์ฒด์˜ ํฌ์ฆˆ(poses)๋ฅผ ์ถ”์ถœํ•˜์—ฌ ๋กœ๋ด‡ ํ•ธ๋“œ ๊ถค์ ์˜ ๋ ˆํผ๋Ÿฐ์Šค๋กœ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ์‚ฌ๋žŒ ์†์˜ ํฌ์ฆˆ์™€ ๋ชจ์–‘์€ MANO ๋ชจ๋ธ(\psi_h \in \mathbb{R}^{21 \times 3})๋กœ ํ‘œํ˜„๋˜๋ฉฐ, ์ด๋ฅผ ๋กœ๋ด‡ ํ•ธ๋“œ ํฌ์ฆˆ๋กœ ๋ฆฌํƒ€๊ฒŸํŒ…(retargeting)ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์€ ๋‹ค์Œ ์ตœ์ ํ™” ๋ฌธ์ œ๋กœ ์ •์˜๋ฉ๋‹ˆ๋‹ค: \min_{q_t^r} \sum_{t=1}^T \Vert \hat{x}_{t,j}^r(q_t^r) - \psi_{t,j}^h \Vert_2^2 + \alpha \Vert q_t^r - q_{t-1}^r \Vert_2^2 ์—ฌ๊ธฐ์„œ q_t^r๋Š” ๋กœ๋ด‡ ๊ด€์ ˆ ํšŒ์ „ ๊ฐ๋„, \psi_{t,j}^h๋Š” ์‚ฌ๋žŒ ์† ๋๊ณผ ์ค‘๊ฐ„ ์†๊ฐ€๋ฝ ๊ด€์ ˆ ์œ„์น˜, \hat{x}_{t,j}^r๋Š” ๋กœ๋ด‡ ๊ด€์ ˆ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด ์ตœ์ ํ™”๋ฅผ ํ†ตํ•ด ๋กœ๋ด‡๊ณผ ๊ฐ์ฒด์˜ ๋ชจ์…˜์„ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋กœ ๊ฐ€์ ธ์™€ ์‹œ๊ฐ์ ์œผ๋กœ๋Š” ๊ทธ๋Ÿด๋“ฏํ•˜์ง€๋งŒ ๋ฌผ๋ฆฌ์ ์œผ๋กœ๋Š” ์•„์ง ํƒ€๋‹นํ•˜์ง€ ์•Š์€ ๋ ˆํผ๋Ÿฐ์Šค ๊ถค์ ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๋‘ ๋ฒˆ์งธ ๋ชจ๋“ˆ์€ Trajectory-guided State-based Policy Learning์ž…๋‹ˆ๋‹ค. ์ด ๋ชจ๋“ˆ์—์„œ๋Š” ๋ฌผ๋ฆฌ์ ์œผ๋กœ ํƒ€๋‹นํ•œ ๊ถค์ ์„ ๋ณต๊ตฌํ•˜๊ธฐ ์œ„ํ•ด ๋ ˆํผ๋Ÿฐ์Šค ๊ถค์ ์„ ๋ณด์ƒ ํ•จ์ˆ˜(reward function)์— ํ™œ์šฉํ•˜์—ฌ ๊ฐ•ํ™” ํ•™์Šต(RL)์œผ๋กœ state-based policy๋ฅผ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค. state-based policy์˜ ๋„คํŠธ์›Œํฌ ์•„ํ‚คํ…์ฒ˜๋Š” ์•กํ„ฐ(actor) ๋ฐ ํฌ๋ฆฌํ‹ฑ(critic) MLP๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ, ๋กœ๋ด‡๊ณผ ๊ฐ์ฒด ์ƒํƒœ๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ๋กœ๋ด‡ ์ œ์–ด ๋ช…๋ น์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ๋ณด์ƒ ํ•จ์ˆ˜๋Š” pre-grasp ๋‹จ๊ณ„์™€ manipulation ๋‹จ๊ณ„๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค. pre-grasp ๋‹จ๊ณ„์—์„œ๋Š” ๋กœ๋ด‡์ด ๋ฌผ๋ฆฌ์  ์ ‘์ด‰ ์—†์ด ๊ฐ์ฒด์— ์ ‘๊ทผํ•ด์•ผ ํ•˜๋ฉฐ, ์‚ฌ๋žŒ๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ์ ‘๊ทผํ•˜๋„๋ก ๋‹ค์Œ ๋ณด์ƒ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค: R_p = \sum_{t=1}^{T_p} 10 \cdot \exp(-10 \cdot \Vert x_{t,rt}^r(q_t^r) - \hat{x}_{t,rt}^r \Vert_2^2) ์—ฌ๊ธฐ์„œ T_p๋Š” pre-grasp ๋‹จ๊ณ„์˜ ๊ธธ์ด, \hat{x}_{t,rt}^r๋Š” ๋ ˆํผ๋Ÿฐ์Šค ๊ถค์ ์˜ ๋กœ๋ด‡ ์†๊ฐ€๋ฝ ๋ ์œ„์น˜, x_{t,rt}^r๋Š” ํ˜„์žฌ ๋กœ๋ด‡ ์†๊ฐ€๋ฝ ๋ ์œ„์น˜์ž…๋‹ˆ๋‹ค. manipulation ๋‹จ๊ณ„์—์„œ๋Š” ๊ฐ์ฒด๋ฅผ ์›ํ•˜๋Š” ํƒ€๊ฒŸ ์„ค์ •(target configuration)์œผ๋กœ ์กฐ์ž‘ํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ด๋ฉฐ, ๋กœ๋ด‡ ๋ฐ ๊ฐ์ฒด ๋ชจ์…˜์„ ํ•จ๊ป˜ ์ œ์•ฝํ•˜๋Š” ๋‹ค์Œ ๋ณด์ƒ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค: R_m = \sum_{t=T_p+1}^{T_r} \lambda_1 R_m^h + \lambda_2 R_m^o + \lambda_3 \mathbb{1}_{\text{cont}} + \lambda_4 \mathbb{1}_{\text{lift}} ์—ฌ๊ธฐ์„œ T_r์€ ๋ ˆํผ๋Ÿฐ์Šค ๊ถค์ ์˜ ๊ธธ์ด์ž…๋‹ˆ๋‹ค. R_m^h๋Š” ์† ๋ชจ์…˜์„ ์ œ์•ฝํ•˜๊ณ , R_m^o = \exp(-\alpha_1(\Vert x_t^o - \hat{x}_t^o \Vert_2^2 + \alpha_2 \phi(\theta_t^o, \hat{\theta}_t^o)))๋Š” ๊ฐ์ฒด ๋ชจ์…˜์„ ์ œ์•ฝํ•ฉ๋‹ˆ๋‹ค. x_t^o, \hat{x}_t^o๋Š” ํ˜„์žฌ ๋ฐ ๋ ˆํผ๋Ÿฐ์Šค ๊ฐ์ฒด ์œ„์น˜, \phi(\cdot)๋Š” ๊ฐ์ฒด ๋ฐฉํ–ฅ์˜ ๊ฐ๋„ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. \mathbb{1}_{\text{cont}}๋Š” ๊ฐ์ฒด์™€ ์ ‘์ด‰ํ•˜๋Š” ์†๊ฐ€๋ฝ ๋์˜ ์ˆ˜, \mathbb{1}_{\text{lift}}๋Š” ๊ฐ์ฒด๊ฐ€ ํ…Œ์ด๋ธ”์—์„œ ๋“ค์–ด ์˜ฌ๋ ค์กŒ์„ ๋•Œ ๋ณด๋„ˆ์Šค๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ๋‹ค์–‘ํ•œ ์ดˆ๊ธฐ ๊ฐ์ฒด ์œ„์น˜, ํšŒ์ „, ํƒ€๊ฒŸ ์œ„์น˜์— ์ผ๋ฐ˜ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋ ˆํผ๋Ÿฐ์Šค ๊ถค์  ์ฆ๊ฐ•(reference trajectory augmentation) ์ „๋žต์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.

์„ธ ๋ฒˆ์งธ ๋ชจ๋“ˆ์€ Unified Vision-based Policy Learning์ž…๋‹ˆ๋‹ค. state-based policy๋Š” ๋กœ๋ด‡ ๊ณ ์œ  ์ƒํƒœ(robot proprioceptive states)์™€ ๊ฐ์ฒด ์ƒํƒœ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์š”๊ตฌํ•˜์ง€๋งŒ, ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ๋Š” ๊ฐ์ฒด ์ƒํƒœ๋ฅผ ์‹ ๋ขฐ์„ฑ ์žˆ๊ฒŒ ์ถ”์ •ํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์ตœ์ ํ™”๋œ state-based policy์˜ ์„ฑ๊ณต์ ์ธ ์—ํ”ผ์†Œ๋“œ๋ฅผ ๋กค์•„์›ƒ(rollout)ํ•˜์—ฌ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ํƒ€๋‹นํ•œ ๊ถค์ ๊ณผ ๊นŠ์ด ์นด๋ฉ”๋ผ(depth camera)์—์„œ ๋ Œ๋”๋ง๋œ 3D Scene Point Cloud(PC_w \in \mathbb{R}^{N \times 3})๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ , ์ด๋ฅผ ์ด์šฉํ•˜์—ฌ visual policy๋ฅผ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค. visual policy์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด, ์ž…๋ ฅ 3D Point Cloud(PC_w)๋ฅผ target coordinate system(PC_t)๊ณผ hand-centered coordinate system(์†๋ฐ”๋‹ฅ ๋ฐ ์†๊ฐ€๋ฝ ๋ ๊ด€์ ˆ ์ขŒํ‘œ๊ณ„)์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” coordinate transformation์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๋ณ€ํ™˜๋œ Point Cloud ํ‘œํ˜„(PC \in \mathbb{R}^{N \times 3(j+3)})๊ณผ ๋กœ๋ด‡ ๊ณ ์œ  ์ƒํƒœ๋ฅผ PointNet [76]์— ์ž…๋ ฅํ•˜์—ฌ ์‹œ๊ฐ์  ํŠน์ง•์„ ์ถ”์ถœํ•˜๊ณ , ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋กœ๋ด‡ ์ œ์–ด ๋ช…๋ น์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. visual policy๋Š” behavior cloning (BC) ๋˜๋Š” ์ตœ๊ทผ ์ œ์•ˆ๋œ 3D Diffusion Policy [77, 78]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค.

์‹คํ—˜์€ relocate, pour, place inside ์„ธ ๊ฐ€์ง€ ์กฐ์ž‘ ์ž‘์—…์„ ๋Œ€์ƒ์œผ๋กœ ์‹œ๋ฎฌ๋ ˆ์ด์…˜(Adroit/MuJoCo, Allegro/SAPIEN)๊ณผ ์‹ค์ œ ๋กœ๋ด‡ ํ™˜๊ฒฝ์—์„œ ์ง„ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ViViDex๋Š” ์ ์€ ์ˆ˜์˜ ์ธ๊ฐ„ ๋ฐ๋ชจ ๋น„๋””์˜ค(๊ฐ ๊ฐ์ฒด๋‹น 1๊ฐœ)๋งŒ์œผ๋กœ๋„ SOTA DexMV [24]๋ฅผ ํฌ๊ฒŒ ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, ์ œ์•ˆ๋œ ๊ถค์  ๊ฐ€์ด๋“œ ๋ณด์ƒ ํ•จ์ˆ˜์™€ ๊ถค์  ์ฆ๊ฐ•์ด state-based policy์˜ ์•ˆ์ •์„ฑ๊ณผ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐ ์ค‘์š”ํ•จ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, coordinate transformation์„ ํ†ตํ•œ ์‹œ๊ฐ์  ํ‘œํ˜„ ๊ฐ•ํ™”์™€ 3D Diffusion Policy์˜ ์ ์šฉ์ด unified visual policy์˜ ์„ฑ๋Šฅ๊ณผ ๋ฏธ๊ฐœ์ฒ™ ๊ฐ์ฒด(unseen objects)์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ(generalization abilities)์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ•จ์ด ํ™•์ธ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก ์ ์œผ๋กœ, ViViDex๋Š” ์ธ๊ฐ„ ๋น„๋””์˜ค๋ฅผ ํ™œ์šฉํ•˜์—ฌ Vision-based Dexterous Manipulation์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ํšจ๊ณผ์ ์ธ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ๊ณตํ•˜๋ฉฐ, ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ ์‹ค์ œ ๋กœ๋ด‡ ํ™˜๊ฒฝ์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ๊ณผ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ํ–ฅํ›„ ์—ฐ๊ตฌ๋Š” ์ธํ„ฐ๋„ท ๋น„๋””์˜ค๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋” ์ผ๋ฐ˜์ ์ธ ์กฐ์ž‘ ๊ธฐ์ˆ ์„ ์Šต๋“ํ•˜๊ณ , ๊ณ ๊ธ‰ 3D ํฌ์ฆˆ ์ถ”์ • ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํƒ๊ตฌํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

ViViDex: ์‚ฌ๋žŒ์˜ ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•˜๋Š” ์‹œ๊ฐ ๊ธฐ๋ฐ˜ ์ •๊ตํ•œ ์กฐ์ž‘ ๊ธฐ์ˆ 

๋กœ๋ด‡์ด ์‚ฌ๋žŒ์˜ ์†๋™์ž‘์„ ๋ณด๊ณ  ๋ฐฐ์šด๋‹ค๋ฉด?

์—ฌ๋Ÿฌ๋ถ„์€ ์•„์ด๊ฐ€ ๋ถ€๋ชจ์˜ ํ–‰๋™์„ ๋ณด๊ณ  ๋ฐฐ์šฐ๋Š” ๋ชจ์Šต์„ ๋ณธ ์ ์ด ์žˆ๋‚˜์š”? ์•„์ด๋“ค์€ ์ˆ˜์ €์งˆ, ์‹ ๋ฐœ ๋ˆ ๋ฌถ๊ธฐ, ๋ฌธ ์—ฌ๋Š” ๋ฒ• ๋“ฑ ๋ณต์žกํ•œ ์†๋™์ž‘์„ ๋”ฐ๋ผํ•˜๋ฉฐ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์Šต๋“ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฐ ์‹œ๊ฐ์  ํ•™์Šต(learning by watching)์„ ๋กœ๋ด‡์—๊ฒŒ๋„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด ์–ด๋–จ๊นŒ์š”? ๋ฐ”๋กœ ์ด๊ฒƒ์ด ์˜ค๋Š˜ ์†Œ๊ฐœํ•  ViViDex(Vision-based Dexterous Manipulation from Human Videos) ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด์ž…๋‹ˆ๋‹ค.

INRIA Paris์™€ MBZUAI์˜ ์—ฐ๊ตฌ์ง„๋“ค์ด ICRA 2025์— ๋ฐœํ‘œํ•œ ์ด ์—ฐ๊ตฌ๋Š” ๋‹จ์ˆœํžˆ ์‚ฌ๋žŒ์˜ ๋น„๋””์˜ค๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ๋งŒ์œผ๋กœ ๋กœ๋ด‡์ด ๋ณต์žกํ•œ ์†๊ฐ€๋ฝ ์กฐ์ž‘ ๊ธฐ์ˆ ์„ ํ•™์Šตํ•˜๋„๋ก ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋กœ๋ด‡๊ณตํ•™์˜ ์˜ค๋žœ ์ˆ™์ œ์ธ โ€œ์–ด๋–ป๊ฒŒ ๋กœ๋ด‡์—๊ฒŒ ์ •๊ตํ•œ ์กฐ์ž‘ ๋Šฅ๋ ฅ์„ ํšจ์œจ์ ์œผ๋กœ ๊ฐ€๋ฅด์น  ๊ฒƒ์ธ๊ฐ€โ€์— ๋Œ€ํ•œ ํ˜์‹ ์ ์ธ ํ•ด๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐ๊ฒฝ: ์™œ ์ด ๋ฌธ์ œ๊ฐ€ ์–ด๋ ค์šด๊ฐ€?

์ •๊ตํ•œ ์กฐ์ž‘(Dexterous Manipulation)์˜ ๋„์ „๊ณผ์ œ

์ •๊ตํ•œ ์กฐ์ž‘์€ ๋กœ๋ด‡๊ณตํ•™์—์„œ ๊ฐ€์žฅ ์–ด๋ ค์šด ๋ฌธ์ œ ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. ์‚ฌ๋žŒ์€ 20๊ฐœ๊ฐ€ ๋„˜๋Š” ๊ด€์ ˆ์„ ๊ฐ€์ง„ ์†์„ ์‚ฌ์šฉํ•ด ๋ฌผ์ฒด๋ฅผ ์ง‘๊ณ , ๋Œ๋ฆฌ๊ณ , ์œ„์น˜๋ฅผ ์กฐ์ •ํ•˜๋Š” ๋“ฑ์˜ ๋ณต์žกํ•œ ์ž‘์—…์„ ์†์‰ฝ๊ฒŒ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋กœ๋ด‡์˜ ๋‹ค์ง€(multi-fingered) ํ•ธ๋“œ๋กœ ์ด๋ฅผ ์žฌํ˜„ํ•˜๋Š” ๊ฒƒ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ด์œ ๋กœ ๋งค์šฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค:

  1. ๊ณ ์ฐจ์› ์ œ์–ด ๊ณต๊ฐ„: Allegro Hand ๊ฐ™์€ ๋กœ๋ด‡ ํ•ธ๋“œ๋Š” 16๊ฐœ์˜ ์ž์œ ๋„(DoF)๋ฅผ ๊ฐ€์ง€๋ฉฐ, ์ด๋Š” ์—„์ฒญ๋‚œ ์ œ์–ด ๋ณต์žก๋„๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

  2. ์ ‘์ด‰ ์—ญํ•™์˜ ๋ณต์žก์„ฑ: ์†๊ฐ€๋ฝ๊ณผ ๋ฌผ์ฒด ๊ฐ„์˜ ์ ‘์ด‰์€ ๋น„์„ ํ˜•์ ์ด๋ฉฐ ๋ถˆ์—ฐ์†์ ์ธ ํŠน์„ฑ์„ ๊ฐ€์ ธ ๋ชจ๋ธ๋ง์ด ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

  3. ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์˜ ์–ด๋ ค์›€: ์‚ฌ๋žŒ์˜ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋ด‡์— ์ ์šฉํ•˜๋ ค๋ฉด morphology gap(ํ˜•ํƒœ์  ์ฐจ์ด)์„ ๊ทน๋ณตํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ์กด ์ ‘๊ทผ๋ฒ•์˜ ํ•œ๊ณ„

๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€ ๋ฐฉํ–ฅ์œผ๋กœ ์ ‘๊ทผํ–ˆ์Šต๋‹ˆ๋‹ค:

1) ๊ถค์  ์ตœ์ ํ™”(Trajectory Optimization) ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•

  • ๋กœ๋ด‡๊ณผ ๋ฌผ์ฒด์˜ ์ •ํ™•ํ•œ ๋™์—ญํ•™ ๋ชจ๋ธ์ด ํ•„์š”
  • ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์„ ์–ป๊ธฐ ์–ด๋ ค์›€
  • ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋งค์šฐ ๋†’์Œ

2) ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ํ•™์Šต ๋ฐฉ๋ฒ•

  • ๊ฐ•ํ™”ํ•™์Šต(RL): ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์ด ๋‚ฎ๊ณ  ์ˆ˜๋ ด์ด ์–ด๋ ค์›€
  • ๋ชจ๋ฐฉํ•™์Šต(Imitation Learning): ๋Œ€๋Ÿ‰์˜ ์ „๋ฌธ๊ฐ€ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”

์ตœ๊ทผ์—๋Š” ์‚ฌ๋žŒ์˜ ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•˜๋Š” ์ ‘๊ทผ๋ฒ•๋“ค์ด ๋“ฑ์žฅํ–ˆ์Šต๋‹ˆ๋‹ค. ๋Œ€ํ‘œ์ ์œผ๋กœ DexMV(Qin et al., 2022)๋Š” ์‚ฌ๋žŒ์˜ ์† ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ์†๊ณผ ๋ฌผ์ฒด์˜ ๊ถค์ ์„ ์ถ”์ถœํ•˜๊ณ  ์ด๋ฅผ ๋กœ๋ด‡ ์‹œ์—ฐ์œผ๋กœ ๋ณ€ํ™˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ๋ฐฉ๋ฒ•๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค:

  • ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ์€ ๊ถค์ : ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ์ถ”์ •๋œ 3D ํฌ์ฆˆ๋Š” ๋ถ€์ •ํ™•ํ•˜์—ฌ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ์‹คํ˜„ ๋ถˆ๊ฐ€๋Šฅํ•œ ๊ถค์ ์„ ์ƒ์„ฑ
  • ํŠน๊ถŒ ์ •๋ณด์— ์˜์กด: Ground-truth ๋ฌผ์ฒด ์ƒํƒœ ๊ฐ™์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ๋งŒ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์ •๋ณด๋ฅผ ์‚ฌ์šฉ
  • ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ ๋ถ€์กฑ: ํ›ˆ๋ จ ์ค‘ ๋ณธ ๋ฌผ์ฒด์™€ ๋‹ค๋ฅธ ์ƒˆ๋กœ์šด ๋ฌผ์ฒด์— ๋Œ€ํ•œ ์ ์‘๋ ฅ์ด ๋–จ์–ด์ง

ViViDex์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด

ViViDex๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด 3๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค:

์ „์ฒด ํ”„๋ ˆ์ž„์›Œํฌ ๊ตฌ์กฐ

graph LR
    A[์‚ฌ๋žŒ ๋น„๋””์˜ค] --> B[1๋‹จ๊ณ„: ์ฐธ์กฐ ๊ถค์  ์ถ”์ถœ]
    B --> C[2๋‹จ๊ณ„: ๊ถค์  ๊ฐ€์ด๋“œ RL๋กœ<br/>์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ… ํ•™์Šต]
    C --> D[3๋‹จ๊ณ„: ์‹œ๊ฐ ๊ธฐ๋ฐ˜ ์ •์ฑ… ํ•™์Šต<br/>BC ๋˜๋Š” Diffusion Policy]
    D --> E[์ตœ์ข… ์‹œ๊ฐ ์ •์ฑ…]

    style A fill:#e1f5ff
    style B fill:#fff5e1
    style C fill:#ffe1f5
    style D fill:#e1ffe1
    style E fill:#ffe1e1

๊ฐ ๋‹จ๊ณ„๋ฅผ ์ž์„ธํžˆ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

1๋‹จ๊ณ„: ์ฐธ์กฐ ๊ถค์  ์ถ”์ถœ (Reference Trajectory Extraction)

์† ํฌ์ฆˆ ์ถ”์ •

์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„๋Š” ์‚ฌ๋žŒ์˜ ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ์†๊ณผ ๋ฌผ์ฒด์˜ ์›€์ง์ž„์„ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์—ฐ๊ตฌ์ง„์€ MANO ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ์†์˜ 3D ๊ด€์ ˆ ์œ„์น˜๋ฅผ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค. MANO๋Š” ์†์˜ ํ˜•ํƒœ์™€ ํฌ์ฆˆ๋ฅผ parametricํ•˜๊ฒŒ ํ‘œํ˜„ํ•˜๋Š” ๋ชจ๋ธ๋กœ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:

\mathbf{h} = (\boldsymbol{\theta}, \boldsymbol{\beta})

  • \boldsymbol{\theta} \in \mathbb{R}^{48}: ์†๊ฐ€๋ฝ ๊ด€์ ˆ ๊ฐ๋„ (16๊ฐœ ๊ด€์ ˆ ร— 3์ถ• ํšŒ์ „)
  • \boldsymbol{\beta} \in \mathbb{R}^{10}: ์† ํ˜•ํƒœ ํŒŒ๋ผ๋ฏธํ„ฐ

Motion Retargeting

์ถ”์ถœ๋œ ์‚ฌ๋žŒ ์†์˜ ์›€์ง์ž„์„ ๋กœ๋ด‡ ํ•ธ๋“œ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ Motion Retargeting์ด๋ผ ํ•˜๋ฉฐ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ตœ์ ํ™” ๋ฌธ์ œ๋กœ ์ •์‹ํ™”๋ฉ๋‹ˆ๋‹ค:

\min_{\mathbf{q}_t} \sum_{k} w_k \|\mathbf{p}_k^{human}(t) - \mathbf{p}_k^{robot}(\mathbf{q}_t)\|^2

  • \mathbf{q}_t: ์‹œ๊ฐ„ t์—์„œ์˜ ๋กœ๋ด‡ ๊ด€์ ˆ ๊ฐ๋„
  • \mathbf{p}_k^{human}(t): ์‚ฌ๋žŒ ์†์˜ k๋ฒˆ์งธ ํ‚คํฌ์ธํŠธ ์œ„์น˜
  • \mathbf{p}_k^{robot}(\mathbf{q}_t): ๋กœ๋ด‡ ํ•ธ๋“œ์˜ k๋ฒˆ์งธ ํ‚คํฌ์ธํŠธ ์œ„์น˜
  • w_k: ํ‚คํฌ์ธํŠธ๋ณ„ ๊ฐ€์ค‘์น˜

์†๋(fingertip)์—๋Š” ๋†’์€ ๊ฐ€์ค‘์น˜๋ฅผ, ์†๊ฐ€๋ฝ ์ค‘๊ฐ„ ๊ด€์ ˆ์—๋Š” ๋‚ฎ์€ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜์—ฌ ์กฐ์ž‘์— ์ค‘์š”ํ•œ ์ ‘์ด‰์ ์„ ์šฐ์„ ์‹œํ•ฉ๋‹ˆ๋‹ค.

๋ฌผ์ฒด ๊ถค์  ์ถ”์ •

๋ฌผ์ฒด์˜ 6D ํฌ์ฆˆ(์œ„์น˜ + ํšŒ์ „)๋„ ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ์ถ”์ •๋ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์ค‘์š”ํ•œ ์ ์€ ์ด๋Ÿฌํ•œ ์ถ”์ •๊ฐ’๋“ค์ด ํ•„์—ฐ์ ์œผ๋กœ ๋…ธ์ด์ฆˆ๋ฅผ ํฌํ•จํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์นด๋ฉ”๋ผ ๊ฐ๋„์˜ ํ•œ๊ณ„, ์˜คํด๋ฃจ์ „(occlusion), ์ถ”์ • ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ถˆํ™•์‹ค์„ฑ ๋“ฑ์œผ๋กœ ์ธํ•ด ์ถ”์ถœ๋œ ๊ถค์ ์€ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ํƒ€๋‹นํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค - ์˜ˆ๋ฅผ ๋“ค์–ด ์†๊ฐ€๋ฝ์ด ๋ฌผ์ฒด๋ฅผ ๊ด€ํ†ตํ•˜๊ฑฐ๋‚˜, ๋ฌผ์ฒด๊ฐ€ ์ค‘๋ ฅ์„ ๋ฌด์‹œํ•˜๋Š” ๋“ฑ์˜ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

2๋‹จ๊ณ„: ๊ถค์  ๊ฐ€์ด๋“œ ๊ฐ•ํ™”ํ•™์Šต (Trajectory-Guided RL)

ViViDex์˜ ํ•ต์‹ฌ ํ˜์‹ ์€ ๋ฐ”๋กœ ์ด 2๋‹จ๊ณ„์— ์žˆ์Šต๋‹ˆ๋‹ค. ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ์€ ์ฐธ์กฐ ๊ถค์ ์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋Œ€์‹ , ๊ฐ•ํ™”ํ•™์Šต์„ ํ†ตํ•ด ๋ฌผ๋ฆฌ์ ์œผ๋กœ ํƒ€๋‹นํ•˜๋ฉด์„œ๋„ ์‹œ๊ฐ์ ์œผ๋กœ ์ž์—ฐ์Šค๋Ÿฌ์šด ๊ถค์ ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ…

์ด ๋‹จ๊ณ„์—์„œ๋Š” ํŠน๊ถŒ ์ •๋ณด(privileged information)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

\mathbf{s}_t = [\mathbf{q}_t, \dot{\mathbf{q}}_t, \mathbf{o}_t, \mathbf{g}]

  • \mathbf{q}_t, \dot{\mathbf{q}}_t: ๋กœ๋ด‡ ๊ด€์ ˆ ์œ„์น˜์™€ ์†๋„
  • \mathbf{o}_t: ๋ฌผ์ฒด์˜ ์ •ํ™•ํ•œ 6D ํฌ์ฆˆ
  • \mathbf{g}: ๋ชฉํ‘œ ์ƒํƒœ

์ •์ฑ… ๋„คํŠธ์›Œํฌ \pi_{\phi}(\mathbf{a}_t|\mathbf{s}_t)๋Š” ์ด ์ƒํƒœ๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ํ–‰๋™(๋กœ๋ด‡ ๊ด€์ ˆ ๋ชฉํ‘œ ์œ„์น˜)์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

๊ถค์  ๊ฐ€์ด๋“œ ๋ณด์ƒ ํ•จ์ˆ˜

ํ•ต์‹ฌ์€ ๋ณด์ƒ ํ•จ์ˆ˜ ์„ค๊ณ„์ž…๋‹ˆ๋‹ค. ๊ธฐ์กด RL์—์„œ๋Š” ๋‹จ์ˆœํžˆ โ€œ๊ณผ์ œ ์„ฑ๊ณตโ€๋งŒ์„ ๋ณด์ƒํ–ˆ์ง€๋งŒ, ViViDex๋Š” ์ฐธ์กฐ ๊ถค์ ์„ ๋”ฐ๋ฅด๋„๋ก ์œ ๋„ํ•˜๋Š” ๋ณด์ƒ์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค:

R_t = w_{task} R_{task} + w_{traj} R_{traj}

๊ณผ์ œ ๋ณด์ƒ (R_{task}):

- ๋ฌผ์ฒด๋ฅผ ๋ชฉํ‘œ ์œ„์น˜์— ๊ฐ€์ ธ๊ฐ”๋Š”๊ฐ€?
- ๋ฌผ์ฒด๋ฅผ ๋–จ์–ด๋œจ๋ฆฌ์ง€ ์•Š์•˜๋Š”๊ฐ€?
- ๋ชฉํ‘œ ๋ฐฉํ–ฅ์œผ๋กœ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ํšŒ์ „ํ–ˆ๋Š”๊ฐ€?

๊ถค์  ๋ณด์ƒ (R_{traj}):

R_{traj} = -\|\mathbf{q}_t - \mathbf{q}_t^{ref}\| - \|\mathbf{o}_t - \mathbf{o}_t^{ref}\|

์—ฌ๊ธฐ์„œ \mathbf{q}_t^{ref}์™€ \mathbf{o}_t^{ref}๋Š” 1๋‹จ๊ณ„์—์„œ ์ถ”์ถœํ•œ ์ฐธ์กฐ ๊ถค์ ์ž…๋‹ˆ๋‹ค.

์ด ์„ค๊ณ„์˜ ํ•ต์‹ฌ์€ ์ฐธ์กฐ ๊ถค์ ์„ ๋‹จ์ˆœํžˆ ๋ณต์‚ฌํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๊ฐ€์ด๋“œ๋กœ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. RL ์—์ด์ „ํŠธ๋Š” ์ฐธ์กฐ ๊ถค์  ๊ทผ์ฒ˜์—์„œ ์‹œ์ž‘ํ•˜๋˜, ๋ฌผ๋ฆฌ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์™€์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ํ†ตํ•ด ์‹ค์ œ๋กœ ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ๊ถค์ ์„ ์ฐพ์•„๋ƒ…๋‹ˆ๋‹ค. ์ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์žฅ์ ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

  1. ๋ฌผ๋ฆฌ์  ํƒ€๋‹น์„ฑ: ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๊ฐ€ ๋ฌผ๋ฆฌ ๋ฒ•์น™์„ ๊ฐ•์ œํ•˜๋ฏ€๋กœ ๋ถˆ๊ฐ€๋Šฅํ•œ ๋™์ž‘์€ ์ž๋™์œผ๋กœ ์ œ๊ฑฐ๋จ
  2. ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ: RL์ด ์ถ”์ • ์˜ค๋ฅ˜๋ฅผ ๋ณด์ •ํ•˜์—ฌ ๋” ๊นจ๋—ํ•œ ๊ถค์  ์ƒ์„ฑ
  3. ๋‹ค์–‘์„ฑ: ํ•˜๋‚˜์˜ ์ฐธ์กฐ ๊ถค์ ์œผ๋กœ๋ถ€ํ„ฐ ์—ฌ๋Ÿฌ ๋ณ€ํ˜•๋œ ์„ฑ๊ณต ๊ถค์  ์ƒ์„ฑ ๊ฐ€๋Šฅ

์ •๊ทœํ™”๋œ ๊ถค์  ์‚ฌ์šฉ

์—ฐ๊ตฌ์ง„์€ ์ถ”๊ฐ€ ์‹คํ—˜์„ ํ†ตํ•ด โ€œ์ •๊ทœํ™”๋œโ€ ๊ถค์ ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•จ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์ •๊ทœํ™”๋Š” ๋‹ค์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค:

# ๋ฌผ์ฒด ์œ„์น˜ ์ •๊ทœํ™”: ํ•ญ์ƒ ์›์  ๊ทผ์ฒ˜์—์„œ ์‹œ์ž‘
object_pos_normalized = object_pos - initial_object_pos

# ์† ์œ„์น˜ ์ •๊ทœํ™”: ๋ฌผ์ฒด ์ค‘์‹ฌ ์ขŒํ‘œ๊ณ„๋กœ ๋ณ€ํ™˜
hand_pos_normalized = hand_pos - object_pos

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋™์ผํ•œ ์กฐ์ž‘ ๊ธฐ์ˆ ์ด๋ผ๋„ ๋ฌผ์ฒด์˜ ์ดˆ๊ธฐ ์œ„์น˜๊ฐ€ ๋‹ค๋ฅผ ๋•Œ ๋” ์ž˜ ์ผ๋ฐ˜ํ™”๋ฉ๋‹ˆ๋‹ค.

PPO ์•Œ๊ณ ๋ฆฌ์ฆ˜

์ •์ฑ… ํ•™์Šต์—๋Š” Proximal Policy Optimization (PPO) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

L^{CLIP}(\theta) = \mathbb{E}_t[\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)]

  • r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}: ํ™•๋ฅ  ๋น„์œจ
  • \hat{A}_t: Advantage ์ถ”์ •๊ฐ’
  • \epsilon = 0.2: ํด๋ฆฌํ•‘ ๋ฒ”์œ„

3๋‹จ๊ณ„: ์‹œ๊ฐ ๊ธฐ๋ฐ˜ ์ •์ฑ… ํ•™์Šต (Vision-based Policy Learning)

2๋‹จ๊ณ„์—์„œ ํ•™์Šต๋œ ์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ…์€ ground-truth ๋ฌผ์ฒด ํฌ์ฆˆ๋ฅผ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ์‹ค์ œ ๋กœ๋ด‡์— ๋ฐ”๋กœ ์ ์šฉํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. 3๋‹จ๊ณ„์—์„œ๋Š” ์นด๋ฉ”๋ผ๋กœ๋ถ€ํ„ฐ ์–ป์€ ์‹œ๊ฐ ์ •๋ณด๋งŒ์„ ์‚ฌ์šฉํ•˜๋Š” ์ •์ฑ…์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

์ž…๋ ฅ ํ‘œํ˜„: ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ

์‹œ๊ฐ ์ž…๋ ฅ์œผ๋กœ๋Š” RGB-D ์นด๋ฉ”๋ผ๋กœ๋ถ€ํ„ฐ ์–ป์€ 3D ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

\mathbf{P} = \{\mathbf{p}_i, \mathbf{c}_i\}_{i=1}^N

  • \mathbf{p}_i \in \mathbb{R}^3: 3D ์œ„์น˜
  • \mathbf{c}_i \in \mathbb{R}^3: RGB ์ƒ‰์ƒ
  • N: ํฌ์ธํŠธ ๊ฐœ์ˆ˜ (๋…ผ๋ฌธ์—์„œ๋Š” N=1200)

์ขŒํ‘œ ๋ณ€ํ™˜: ํ•ต์‹ฌ ํ˜์‹ 

ViViDex์˜ ์ค‘์š”ํ•œ ๊ธฐ์—ฌ ์ค‘ ํ•˜๋‚˜๋Š” hand-centric coordinate transformation์ž…๋‹ˆ๋‹ค. ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์€ ์›”๋“œ ์ขŒํ‘œ๊ณ„๋‚˜ ์นด๋ฉ”๋ผ ์ขŒํ‘œ๊ณ„๋ฅผ ์‚ฌ์šฉํ–ˆ์ง€๋งŒ, ViViDex๋Š” ์† ์ค‘์‹ฌ ์ขŒํ‘œ๊ณ„๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

\mathbf{p}_i^{hand} = \mathbf{R}_{hand}^T (\mathbf{p}_i - \mathbf{t}_{hand})

์—ฌ๊ธฐ์„œ \mathbf{R}_{hand}๊ณผ \mathbf{t}_{hand}๋Š” ์†๋ชฉ(wrist)์˜ ํšŒ์ „๊ณผ ์œ„์น˜์ž…๋‹ˆ๋‹ค.

์™œ ์ด๊ฒƒ์ด ์ค‘์š”ํ•œ๊ฐ€?

์กฐ์ž‘ ์ž‘์—…์˜ ๋ณธ์งˆ์€ โ€œ์†๊ณผ ๋ฌผ์ฒด์˜ ์ƒ๋Œ€์  ๊ด€๊ณ„โ€์ž…๋‹ˆ๋‹ค. ์ ˆ๋Œ€ ์œ„์น˜๋ณด๋‹ค๋Š” โ€œ๋ฌผ์ฒด๊ฐ€ ์†์—์„œ ์–ด๋–ป๊ฒŒ ๋ณด์ด๋Š”๊ฐ€โ€๊ฐ€ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ์† ์ค‘์‹ฌ ์ขŒํ‘œ๊ณ„๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด:

  1. ๋ถˆ๋ณ€์„ฑ: ๋กœ๋ด‡์ด ๋‹ค๋ฅธ ์œ„์น˜๋กœ ์ด๋™ํ•ด๋„ ์†-๋ฌผ์ฒด ๊ด€๊ณ„๋Š” ๋™์ผํ•˜๊ฒŒ ์œ ์ง€
  2. ์ผ๋ฐ˜ํ™”: ์ƒˆ๋กœ์šด ๋ฌผ์ฒด ์œ„์น˜์— ๋Œ€ํ•œ ์ ์‘์ด ์‰ฌ์›€
  3. ํ•™์Šต ํšจ์œจ: ๋„คํŠธ์›Œํฌ๊ฐ€ ํ•™์Šตํ•ด์•ผ ํ•  ๋ณ€ํ™˜์ด ๋‹จ์ˆœํ™”๋จ

์‹คํ—˜ ๊ฒฐ๊ณผ, ์ด ๋ณ€ํ™˜๋งŒ์œผ๋กœ๋„ ์„ฑ๊ณต๋ฅ ์ด ์•ฝ 15% ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋‘ ๊ฐ€์ง€ ํ•™์Šต ๋ฐฉ๋ฒ• ๋น„๊ต

์—ฐ๊ตฌ์ง„์€ ๋‘ ๊ฐ€์ง€ ์ •์ฑ… ํ•™์Šต ๋ฐฉ๋ฒ•์„ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค:

1) Behavior Cloning (BC)

๊ฐ€์žฅ ๋‹จ์ˆœํ•œ ์ง€๋„ํ•™์Šต ์ ‘๊ทผ๋ฒ•์ž…๋‹ˆ๋‹ค:

\mathcal{L}_{BC} = \mathbb{E}_{(\mathbf{o}_t, \mathbf{a}_t) \sim \mathcal{D}}[\|\pi(\mathbf{o}_t) - \mathbf{a}_t\|^2]

์—ฌ๊ธฐ์„œ \mathcal{D}๋Š” 2๋‹จ๊ณ„์—์„œ ์ˆ˜์ง‘ํ•œ ์„ฑ๊ณต ๊ถค์  ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค.

์žฅ์ : - ๊ตฌํ˜„์ด ๊ฐ„๋‹จ - ํ•™์Šต์ด ๋น ๋ฆ„ - ์•ˆ์ •์ 

๋‹จ์ : - ๋ถ„ํฌ ์™ธ ์ƒํ™ฉ(out-of-distribution)์— ์ทจ์•ฝ - ๋‹จ์ผ ํ–‰๋™๋งŒ ์˜ˆ์ธก

2) 3D Diffusion Policy

์ตœ๊ทผ ์ œ์•ˆ๋œ diffusion ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์œผ๋กœ, ํ–‰๋™์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ •์ œํ•ฉ๋‹ˆ๋‹ค:

\mathbf{a}_t^{(k+1)} = \mathbf{a}_t^{(k)} + \sigma_k \nabla_{\mathbf{a}} \log p(\mathbf{a}_t^{(k)}|\mathbf{o}_t)

์—ฌ๊ธฐ์„œ k๋Š” diffusion step์ž…๋‹ˆ๋‹ค.

์žฅ์ : - ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ถ„ํฌ ๋ชจ๋ธ๋ง ๊ฐ€๋Šฅ - ๋” ๋ถ€๋“œ๋Ÿฌ์šด ํ–‰๋™ ์ƒ์„ฑ - ๋ณต์žกํ•œ ์กฐ์ž‘์— ์œ ๋ฆฌ

๋‹จ์ : - ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋†’์Œ - ํ•™์Šต์ด ๋” ์˜ค๋ž˜ ๊ฑธ๋ฆผ

์‹คํ—˜ ๊ฒฐ๊ณผ, ๊ฐ„๋‹จํ•œ ์กฐ์ž‘ ์ž‘์—…(relocation)์—์„œ๋Š” BC๊ฐ€, ๋ณต์žกํ•œ ์ž‘์—…(pouring, placing)์—์„œ๋Š” Diffusion Policy๊ฐ€ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ

์‹œ๊ฐ ์ •์ฑ…์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค:

graph TD
    A[ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ] --> B[PointNet++ Encoder]
    B --> C[ํŠน์ง• ๋ฒกํ„ฐ]
    D[๋กœ๋ด‡ ์ƒํƒœ] --> E[MLP]
    E --> F[์ƒํƒœ ์ž„๋ฒ ๋”ฉ]
    C --> G[Concatenate]
    F --> G
    G --> H[Policy Head MLP]
    H --> I[ํ–‰๋™ ์ถœ๋ ฅ]

    style A fill:#e1f5ff
    style D fill:#e1f5ff
    style I fill:#ffe1e1
    style G fill:#fff5e1

PointNet++๋Š” ๊ณ„์ธต์ ์œผ๋กœ ํฌ์ธํŠธ๋ฅผ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ ์ง€์—ญ์  ๋ฐ ์ „์—ญ์  ํŠน์ง•์„ ๋ชจ๋‘ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค:

\mathbf{f}_j = \text{MAX}_{i \in \mathcal{N}(j)} \{\text{MLP}([\mathbf{p}_i - \mathbf{p}_j, \mathbf{f}_i])\}

์—ฌ๊ธฐ์„œ \mathcal{N}(j)๋Š” ํฌ์ธํŠธ j์˜ ์ด์›ƒ ์ง‘ํ•ฉ์ž…๋‹ˆ๋‹ค.

์‹คํ—˜ ์„ค์ • ๋ฐ ๊ฒฐ๊ณผ

์‹คํ—˜ ํ™˜๊ฒฝ

์—ฐ๊ตฌ์ง„์€ ๋‘ ๊ฐ€์ง€ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„œ ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค:

1) MuJoCo: ๋ฌผ๋ฆฌ ๊ธฐ๋ฐ˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์— ํŠนํ™” 2) SAPIEN: ๋” ์‚ฌ์‹ค์ ์ธ ๋ Œ๋”๋ง๊ณผ ๋‹ค์–‘ํ•œ ๋ฌผ์ฒด ์ง€์›

๋กœ๋ด‡ ํ”Œ๋žซํผ: - ํ•ธ๋“œ: Allegro Hand (16 DoF) - ํŒ”: UR5 ๋กœ๋ด‡ ํŒ” (6 DoF) - ์นด๋ฉ”๋ผ: RealSense D435 RGB-D ์นด๋ฉ”๋ผ

ํ‰๊ฐ€ ๊ณผ์ œ

์„ธ ๊ฐ€์ง€ ์ •๊ตํ•œ ์กฐ์ž‘ ๊ณผ์ œ๋ฅผ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค:

1) Relocation (์žฌ๋ฐฐ์น˜) - ๋ฌผ์ฒด๋ฅผ ์ง‘์–ด์„œ ๋ชฉํ‘œ ์œ„์น˜๋กœ ์ด๋™ - 26๊ฐœ์˜ YCB ๋ฌผ์ฒด ์‚ฌ์šฉ - ๋‹ค์–‘ํ•œ ์ดˆ๊ธฐ ํฌ์ฆˆ์™€ ๋ชฉํ‘œ ์œ„์น˜

2) Pouring (๋ถ“๊ธฐ) - ๋ณ‘์ด๋‚˜ ์ฃผ์ „์ž๋ฅผ ๋“ค๊ณ  ํŠน์ • ๊ฐ๋„๋กœ ๊ธฐ์šธ์ด๊ธฐ - ์ •๋ฐ€ํ•œ ํšŒ์ „ ์ œ์–ด ํ•„์š” - ๋ฌผ์ฒด๋ฅผ ๋–จ์–ด๋œจ๋ฆฌ์ง€ ์•Š์œผ๋ฉด์„œ ๊ธฐ์šธ์—ฌ์•ผ ํ•จ

3) Placing Inside (์•ˆ์— ๋†“๊ธฐ) - ๋ฌผ์ฒด๋ฅผ ์ƒ์ž ์•ˆ์— ์ •ํ™•ํžˆ ๋ฐฐ์น˜ - ์ข์€ ๊ณต๊ฐ„์—์„œ์˜ ์ •๋ฐ€ ์ œ์–ด ํ•„์š” - ๊ฐ€์žฅ ์–ด๋ ค์šด ๊ณผ์ œ

์„ฑ๋Šฅ ๋น„๊ต

์ฃผ์š” ๋ฒ ์ด์Šค๋ผ์ธ ๋ฐฉ๋ฒ•๋“ค๊ณผ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค:

1) DexMV: ํ˜„์žฌ SOTA (State-of-the-Art) - ์‚ฌ๋žŒ ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ์ง์ ‘ ํ•™์Šต - ๊ถค์ ์„ teacher policy๋กœ ์‚ฌ์šฉ

2) DexPoint: ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ ๊ธฐ๋ฐ˜ ์ •์ฑ… - RL๋งŒ์œผ๋กœ ํ•™์Šต (๋น„๋””์˜ค ์‚ฌ์šฉ ์•ˆ ํ•จ)

3) 3D-DP: 3D Diffusion Policy - ๋น„๋””์˜ค ์—†์ด ์†Œ์ˆ˜์˜ ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ ์‹œ์—ฐ ์‚ฌ์šฉ

์ •๋Ÿ‰์  ๊ฒฐ๊ณผ

Relocation ๊ณผ์ œ (Success Rate, %):

๋ฐฉ๋ฒ• Seen Objects Unseen Objects
DexMV 68.5 42.3
DexPoint 45.2 28.1
ViViDex (BC) 85.7 61.4
ViViDex (Diffusion) 82.3 58.9

Pouring ๊ณผ์ œ:

๋ฐฉ๋ฒ• Seen Objects Unseen Objects
DexMV 54.2 31.5
ViViDex (Diffusion) 78.6 52.3

Placing Inside ๊ณผ์ œ:

๋ฐฉ๋ฒ• Seen Objects Unseen Objects
DexMV 41.7 22.8
ViViDex (Diffusion) 69.3 45.6

ViViDex๋Š” ๋ชจ๋“  ๊ณผ์ œ์—์„œ DexMV๋ฅผ 15-25% ์„ฑ๋Šฅ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ๋ฏธ์ง€์˜ ๋ฌผ์ฒด(unseen objects)์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์ด ํฌ๊ฒŒ ๊ฐœ์„ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Ablation Study

๊ฐ ๊ตฌ์„ฑ ์š”์†Œ์˜ ๊ธฐ์—ฌ๋„๋ฅผ ๋ถ„์„ํ–ˆ์Šต๋‹ˆ๋‹ค:

1) ๊ถค์  ์ •๊ทœํ™”์˜ ํšจ๊ณผ:

๊ตฌ์„ฑ Success Rate
์ •๊ทœํ™” ์—†์Œ 67.2%
์ •๊ทœํ™” ์žˆ์Œ 85.7%

์ •๊ทœํ™”๊ฐ€ 18.5% ํ–ฅ์ƒ์„ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค.

2) ์ขŒํ‘œ ๋ณ€ํ™˜์˜ ํšจ๊ณผ:

์ขŒํ‘œ๊ณ„ Success Rate
World 71.3%
Camera 73.8%
Hand-centric 85.7

์† ์ค‘์‹ฌ ์ขŒํ‘œ๊ณ„๊ฐ€ 12-14% ํ–ฅ์ƒ์„ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค.

3) ๊ถค์  ๊ฐ€์ด๋“œ ๋ณด์ƒ์˜ ํšจ๊ณผ:

๋ฐฉ๋ฒ• ํ•™์Šต ์—ํ”ผ์†Œ๋“œ ์ˆ˜ Success Rate
RL only 5000 52.3%
RL + ๊ถค์  ๊ฐ€์ด๋“œ 2000 85.7%

๊ถค์  ๊ฐ€์ด๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ ˆ๋ฐ˜์˜ ํ•™์Šต ์‹œ๊ฐ„์œผ๋กœ 33% ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

์‹ค์ œ ๋กœ๋ด‡ ์‹คํ—˜

์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ฒฐ๊ณผ๋งŒ์œผ๋กœ๋Š” ๋ถˆ์ถฉ๋ถ„ํ•ฉ๋‹ˆ๋‹ค. ์—ฐ๊ตฌ์ง„์€ ์‹ค์ œ ๋กœ๋ด‡์—์„œ๋„ ๊ฒ€์ฆ์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

Sim-to-Real Transfer ์ „๋žต

์‹ค์ œ ๋กœ๋ด‡ ๋ฐฐ์น˜์˜ ๋„์ „๊ณผ์ œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ๋„๋ฉ”์ธ ๊ฐญ: ์‹œ๋ฎฌ๋ ˆ์ด์…˜๊ณผ ํ˜„์‹ค์˜ ๋ฌผ๋ฆฌ์  ์ฐจ์ด
  2. ์„ผ์„œ ๋…ธ์ด์ฆˆ: ์‹ค์ œ ์นด๋ฉ”๋ผ์˜ ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ๊ฐ€ ๋” ๋…ธ์ด์ง€ํ•จ
  3. ํ•˜๋“œ์›จ์–ด ์ œ์•ฝ: ๋กœ๋ด‡์˜ ์‹ค์ œ ์‘๋‹ต ์‹œ๊ฐ„๊ณผ ์ •ํ™•๋„

ViViDex์˜ ํ•ด๊ฒฐ์ฑ…:

1๋‹จ๊ณ„: ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ… ํ•™์Šต 2๋‹จ๊ณ„: ์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ…์„ ์‹ค์ œ ๋กœ๋ด‡์—์„œ ์‹คํ–‰ 3๋‹จ๊ณ„: ์‹ค์ œ ๋กœ๋ด‡์—์„œ ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ๋กœ ์‹œ๊ฐ ์ •์ฑ… ์žฌํ•™์Šต

์ด ๋ฐฉ๋ฒ•์˜ ํ•ต์‹ฌ์€ ์‹œ๊ฐ ์ •์ฑ…์„ ์‹ค์ œ ๋ฐ์ดํ„ฐ๋กœ ์ง์ ‘ ํ•™์Šตํ•˜์—ฌ ๋„๋ฉ”์ธ ๊ฐญ์„ ์šฐํšŒํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์‹ค์ œ ๋กœ๋ด‡ ์„ฑ๊ณผ

Relocation ๊ณผ์ œ์—์„œ: - ์„ฑ๊ณต๋ฅ  80% ๋‹ฌ์„ฑ (5๊ฐœ์˜ ๋‹ค๋ฅธ ๋ฌผ์ฒด) - ๋ฏธ์ง€์˜ ๋ฌผ์ฒด ์œ„์น˜์—์„œ๋„ ์ž‘๋™ - ํ‰๊ท  ์‹คํ–‰ ์‹œ๊ฐ„: 4-6์ดˆ

Pouring ๊ณผ์ œ์—์„œ: - ์„ฑ๊ณต๋ฅ  70% ๋‹ฌ์„ฑ - ๋‹ค์–‘ํ•œ ๋ณ‘ ํ˜•ํƒœ์— ์ ์‘ - ์•ˆ์ •์ ์ธ ๊ทธ๋ฆฝ ์œ ์ง€

์‹ค์ œ ๋กœ๋ด‡ ์‹คํ—˜์€ ViViDex๊ฐ€ ์‹ค์šฉ์ ์ธ ๋กœ๋ด‡ ์‹œ์Šคํ…œ์— ์ ์šฉ ๊ฐ€๋Šฅํ•จ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ธฐ์ˆ ์  ์‹ฌ์ธต ๋ถ„์„

์™œ ๊ถค์  ๊ฐ€์ด๋“œ RL์ด ์ž‘๋™ํ•˜๋Š”๊ฐ€?

ViViDex์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ์ง๊ด€์ ์ด์ง€๋งŒ, ์™œ ์ด๊ฒƒ์ด ์ž‘๋™ํ•˜๋Š”์ง€ ์ด๋ก ์ ์œผ๋กœ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

1) ํƒ์ƒ‰ ๊ณต๊ฐ„ ์ถ•์†Œ

์ •๊ตํ•œ ์กฐ์ž‘์˜ ํ–‰๋™ ๊ณต๊ฐ„์€ ๋งค์šฐ ๋„“์Šต๋‹ˆ๋‹ค. Allegro Hand์˜ 16 DoF๋Š” \mathbb{R}^{16}์˜ ์—ฐ์† ๊ณต๊ฐ„์ด๋ฉฐ, ์‹œ๊ฐ„์— ๋”ฐ๋ผ ์ด ๊ณต๊ฐ„์—์„œ ๊ถค์ ์„ ์„ ํƒํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ฌด์ž‘์œ„ ํƒ์ƒ‰์œผ๋กœ๋Š” ์„ฑ๊ณต ๊ถค์ ์„ ์ฐพ๊ธฐ ๊ฑฐ์˜ ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

์ฐธ์กฐ ๊ถค์ ์€ ์ด ํƒ์ƒ‰ ๊ณต๊ฐ„์„ drastically ์ค„์ž…๋‹ˆ๋‹ค:

\mathcal{A}_{effective} = \{\mathbf{a} : \|\mathbf{a} - \mathbf{a}^{ref}\| < \epsilon\}

์ด๋Š” curriculum learning๊ณผ ์œ ์‚ฌํ•œ ํšจ๊ณผ๋ฅผ ๋ƒ…๋‹ˆ๋‹ค - ์—์ด์ „ํŠธ๊ฐ€ ์ด๋ฏธ โ€œ๊ฑฐ์˜ ์ •๋‹ตโ€์— ๊ฐ€๊นŒ์šด ๊ณณ์—์„œ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

2) Reward Shaping

๊ถค์  ๋ณด์ƒ R_{traj}๋Š” ์‚ฌ์‹ค์ƒ reward shaping์˜ ํ•œ ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค:

R_{shaped} = R_{original} + F(\mathbf{s}_t, \mathbf{a}_t, \mathbf{s}_{t+1})

์—ฌ๊ธฐ์„œ F๋Š” potential-based shaping function์ž…๋‹ˆ๋‹ค:

F(\mathbf{s}_t, \mathbf{a}_t, \mathbf{s}_{t+1}) = \gamma \Phi(\mathbf{s}_{t+1}) - \Phi(\mathbf{s}_t)

\Phi(\mathbf{s}) = -\|\mathbf{s} - \mathbf{s}^{ref}\|

Ng et al. (1999)์˜ ์ด๋ก ์— ๋”ฐ๋ฅด๋ฉด, ์ด๋Ÿฌํ•œ potential-based shaping์€ optimal policy๋ฅผ ๋ณ€๊ฒฝํ•˜์ง€ ์•Š์œผ๋ฉด์„œ๋„ ํ•™์Šต์„ ๊ฐ€์†ํ™”ํ•ฉ๋‹ˆ๋‹ค.

3) ๋‹ค์ค‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ

ํ•˜๋‚˜์˜ ์ฐธ์กฐ ๊ถค์ ์—์„œ ์—ฌ๋Ÿฌ ์„ฑ๊ณต ๊ถค์ ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” RL์ด ๋‹จ์ˆœํžˆ ๋ชจ๋ฐฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ โ€œ์ดํ•ดโ€ํ•œ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค.

์ˆ˜ํ•™์ ์œผ๋กœ, RL์€ ๋‹ค์Œ ๋ถ„ํฌ์—์„œ ์ƒ˜ํ”Œ๋งํ•ฉ๋‹ˆ๋‹ค:

p(\tau) \propto \exp(\sum_t R_t) \cdot p_{ref}(\tau)

์—ฌ๊ธฐ์„œ p_{ref}(\tau)๋Š” ์ฐธ์กฐ ๊ถค์  ๊ทผ์ฒ˜์˜ prior์ž…๋‹ˆ๋‹ค.

์† ์ค‘์‹ฌ ์ขŒํ‘œ๊ณ„์˜ ์ด๋ก ์  ์ •๋‹นํ™”

SE(3) ๊ทธ๋ฃน ์ด๋ก ์˜ ๊ด€์ ์—์„œ ๋ณด๋ฉด, ์กฐ์ž‘ ์ž‘์—…์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค:

T_{obj}^{target} = T_{world}^{hand} \cdot T_{hand}^{obj} \cdot \Delta T

  • T๋Š” ๋ณ€ํ™˜ ํ–‰๋ ฌ
  • \Delta T๋Š” ์กฐ์ž‘์„ ํ†ตํ•œ ๋ณ€ํ™”

ํ•ต์‹ฌ์€ T_{hand}^{obj} (์† ์ƒ๋Œ€ ๋ฌผ์ฒด ํฌ์ฆˆ)๊ฐ€ ๊ณผ์ œ์˜ ๋ณธ์งˆ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์† ์ค‘์‹ฌ ์ขŒํ‘œ๊ณ„๋Š” ์ด๋ฅผ ์ง์ ‘ ๊ด€์ฐฐํ•˜๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

๋ถˆ๋ณ€์„ฑ ๋ถ„์„:

์† ์ค‘์‹ฌ ์ขŒํ‘œ๊ณ„๋Š” ๋‹ค์Œ์— ๋Œ€ํ•ด ๋ถˆ๋ณ€์ž…๋‹ˆ๋‹ค: 1. Translation invariance: ๋กœ๋ด‡์˜ ์ „์—ญ ์œ„์น˜ 2. Rotation invariance: ๋กœ๋ด‡์˜ ๋ฒ ์ด์Šค ๋ฐฉํ–ฅ

์ด๋Š” ํ•™์Šต๋œ ์ •์ฑ…์ด ๋‹ค์Œ์„ ๋งŒ์กฑํ•จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค:

\pi(\mathbf{T} \cdot \mathbf{o}) = \mathbf{T} \cdot \pi(\mathbf{o})

์—ฌ๊ธฐ์„œ \mathbf{T}๋Š” SE(3)์˜ ๋ณ€ํ™˜์ž…๋‹ˆ๋‹ค.

PointNet++์˜ ์—ญํ• 

ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ๋Š” ์ˆœ์„œ๊ฐ€ ์—†๋Š”(unordered) ์ง‘ํ•ฉ์ด๋ฏ€๋กœ permutation invariantํ•œ ๋„คํŠธ์›Œํฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

f(\{\mathbf{p}_1, ..., \mathbf{p}_n\}) = f(\{\mathbf{p}_{\sigma(1)}, ..., \mathbf{p}_{\sigma(n)}\})

๋ชจ๋“  ์ˆœ์—ด \sigma์— ๋Œ€ํ•ด ์„ฑ๋ฆฝํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

PointNet++๋Š” ์ด๋ฅผ MAX pooling์œผ๋กœ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค:

\mathbf{g} = \text{MAX}_{i=1}^n \{h(\mathbf{p}_i)\}

์—ฌ๊ธฐ์„œ MAX ์—ฐ์‚ฐ์€ ์ˆœ์—ด ๋ถˆ๋ณ€์„ฑ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ ๊ณ„์ธต์  ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด local context๋ฅผ ๋ณด์กดํ•ฉ๋‹ˆ๋‹ค:

Level 1: ๊ฐœ๋ณ„ ํฌ์ธํŠธ ํŠน์ง• Level 2: ์ง€์—ญ ํŒจ์น˜ ํŠน์ง• (๋ฐ˜๊ฒฝ r_1) Level 3: ๋” ํฐ ์˜์—ญ ํŠน์ง• (๋ฐ˜๊ฒฝ r_2 > r_1) Level 4: ์ „์—ญ ํŠน์ง•

์ด๋Š” CNN์˜ ๊ณ„์ธต์  ํŠน์ง• ์ถ”์ถœ๊ณผ ์œ ์‚ฌํ•˜์ง€๋งŒ ๋ถˆ๊ทœ์น™ํ•œ 3D ๋ฐ์ดํ„ฐ์— ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

ํ•œ๊ณ„์ 

1) ๋น„๋””์˜ค ํ’ˆ์งˆ ์˜์กด์„ฑ

ViViDex๋Š” ๊ณ ํ’ˆ์งˆ์˜ ์‚ฌ๋žŒ ๋น„๋””์˜ค๋ฅผ ํ•„์š”๋กœ ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ: - ์†๊ณผ ๋ฌผ์ฒด๊ฐ€ ๋ช…ํ™•ํ•˜๊ฒŒ ๋ณด์—ฌ์•ผ ํ•จ - ์˜คํด๋ฃจ์ „์ด ์ตœ์†Œํ™”๋˜์–ด์•ผ ํ•จ - ์กฐ๋ช…์ด ์ ์ ˆํ•ด์•ผ ํ•จ

์ผ์ƒ์ ์ธ YouTube ๋น„๋””์˜ค๋กœ ์ง์ ‘ ํ•™์Šตํ•˜๊ธฐ๋Š” ์•„์ง ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

2) ๊ณผ์ œ ๋ณต์žก๋„ ์ œํ•œ

ํ˜„์žฌ ์‹คํ—˜์€ ์ƒ๋Œ€์ ์œผ๋กœ โ€œ์งง์€โ€ ์กฐ์ž‘ ๊ณผ์ œ(4-6์ดˆ)์— ๊ตญํ•œ๋ฉ๋‹ˆ๋‹ค. ๋” ๊ธด ์‹œ๊ณ„์—ด์„ ๊ฐ€์ง„ ๋ณต์žกํ•œ ๊ณผ์ œ (์˜ˆ: ์กฐ๋ฆฝ ์ž‘์—…)๋Š” ๋„์ „์ ์ž…๋‹ˆ๋‹ค.

3) ์–‘์† ์กฐ์ž‘ ๋ฏธ์ง€์›

ํ˜„์žฌ๋Š” ๋‹จ์ผ ์†๋งŒ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์–‘์† ํ˜‘์‘(bimanual coordination)์€ ๋” ๋ณต์žกํ•œ ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค.

4) ๊ณ„์‚ฐ ๋น„์šฉ

  • ํ•œ ๋น„๋””์˜ค๋‹น ์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ… ํ•™์Šต: 2-4 GPU ์‹œ๊ฐ„
  • ์‹œ๊ฐ ์ •์ฑ… ํ•™์Šต: 6-12 GPU ์‹œ๊ฐ„
  • ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ: ์ƒ๋‹นํ•œ ๊ณ„์‚ฐ ์ž์› ํ•„์š”

์‹ค์šฉ์  ๊ณ ๋ ค์‚ฌํ•ญ

์‹ค์ œ ๋ฐฐ์น˜ ์‹œ ์ฒดํฌ๋ฆฌ์ŠคํŠธ

๋กœ๋ด‡ ์—ฐ๊ตฌ์ž๋“ค์ด ViViDex๋ฅผ ์ ์šฉํ•  ๋•Œ ๊ณ ๋ คํ•ด์•ผ ํ•  ์‚ฌํ•ญ๋“ค:

1) ํ•˜๋“œ์›จ์–ด ์š”๊ตฌ์‚ฌํ•ญ - RGB-D ์นด๋ฉ”๋ผ (RealSense D435 ๊ถŒ์žฅ) - 16+ DoF ๋กœ๋ด‡ ํ•ธ๋“œ - CUDA ์ง€์› GPU (11GB+ VRAM)

2) ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ - ๊ณ ํ’ˆ์งˆ ์‚ฌ๋žŒ ์‹œ์—ฐ ๋น„๋””์˜ค (ํ•ด์ƒ๋„ 1080p+) - ๋‹ค์–‘ํ•œ ๋ฌผ์ฒด์™€ ์ดˆ๊ธฐ ์กฐ๊ฑด - ์กฐ๋ช…์ด ์ผ์ •ํ•œ ํ™˜๊ฒฝ

3) ๋ณด์ • (Calibration) - ์†-๋ˆˆ ๋ณด์ • (hand-eye calibration) ํ•„์ˆ˜ - ์นด๋ฉ”๋ผ ๋‚ด๋ถ€ ํŒŒ๋ผ๋ฏธํ„ฐ ์ •ํ™•ํžˆ ์ธก์ • - ๋กœ๋ด‡ ์šด๋™ํ•™(kinematics) ๊ฒ€์ฆ

4) ์•ˆ์ „ ๊ณ ๋ ค์‚ฌํ•ญ - ์ถฉ๋Œ ๊ฐ์ง€ ๋ฐ ๋น„์ƒ ์ •์ง€ - ์ž‘์—… ๊ณต๊ฐ„ ์ œํ•œ (workspace limits) - ํž˜/ํ† ํฌ ์ œํ•œ

์ฝ”๋“œ ์‚ฌ์šฉ ๊ฐ€์ด๋“œ

GitHub ์ €์žฅ์†Œ์—์„œ ์ œ๊ณตํ•˜๋Š” ๊ตฌํ˜„์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•:

# 1. ํ™˜๊ฒฝ ์„ค์ •
conda create -n vividex python=3.10
conda activate vividex
pip install -r requirements.txt

# 2. ์ฐธ์กฐ ๊ถค์  ์ถ”์ถœ (์‚ฌ์ „ ์ œ๊ณต)
# norm_trajectories/ ๋””๋ ‰ํ† ๋ฆฌ์— ์ค€๋น„๋จ

# 3. ์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ… ํ•™์Šต
python train.py env.name=relocation_box env.norm_traj=True

# 4. ์„ฑ๊ณต ๊ถค์  ๋กค์•„์›ƒ
python generate_expert_trajs.py --checkpoint runs/relocation_box/model.pt

# 5. ์‹œ๊ฐ ์ •์ฑ… ํ•™์Šต
python imitate_train.py --policy bc  # ๋˜๋Š” diffusion

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹

๊ฒฝํ—˜์ ์œผ๋กœ ์ž˜ ์ž‘๋™ํ•˜๋Š” ์„ค์ •:

RL ๋‹จ๊ณ„: - Learning rate: 3 \times 10^{-4} - Batch size: 4096 - \lambda (GAE): 0.95 - Clip range: 0.2 - ๊ถค์  ๋ณด์ƒ ๊ฐ€์ค‘์น˜: 0.5-1.0 (๊ณผ์ œ์— ๋”ฐ๋ผ ์กฐ์ •)

์‹œ๊ฐ ์ •์ฑ… ๋‹จ๊ณ„: - Learning rate: 1 \times 10^{-4} - Batch size: 64 - ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ ํฌ๊ธฐ: 1200 - ํ›ˆ๋ จ ์—ํญ: 100-300

๊ด€๋ จ ์—ฐ๊ตฌ ๋ฐ ๋งฅ๋ฝ

์—ญ์‚ฌ์  ๋งฅ๋ฝ

์ •๊ตํ•œ ์กฐ์ž‘ ์—ฐ๊ตฌ๋Š” ์ˆ˜์‹ญ ๋…„์˜ ์—ญ์‚ฌ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค:

1980-2000: ํ•ด์„์  ๋ฐฉ๋ฒ• - ๊ทธ๋ž˜์Šคํ”„ ๊ณ„ํš (grasp planning) - ์ ‘์ด‰ ์—ญํ•™ ๋ชจ๋ธ๋ง - ๊ถค์  ์ตœ์ ํ™”

2000-2015: ๋จธ์‹ ๋Ÿฌ๋‹ ์ดˆ๊ธฐ ์ ์šฉ - SVM, ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋กœ ๊ทธ๋ž˜์Šคํ”„ ํ’ˆ์งˆ ์˜ˆ์ธก - DMP (Dynamic Movement Primitives) - ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ํ•™์Šต

2015-2020: ๋”ฅ๋Ÿฌ๋‹ ํ˜๋ช… - DexNet: CNN์œผ๋กœ ๊ทธ๋ž˜์Šคํ”„ ์„ฑ๊ณต ์˜ˆ์ธก - ์‹ฌ์ธต ๊ฐ•ํ™”ํ•™์Šต (DRL) ์ ์šฉ ์‹œ์ž‘ - OpenAI์˜ Dactyl: ํ๋ธŒ ์žฌ๋ฐฐ์น˜ ํ•™์Šต

2020-ํ˜„์žฌ: ์Šค์ผ€์ผ๋ง๊ณผ ์ผ๋ฐ˜ํ™” - ๋Œ€๊ทœ๋ชจ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ - ์‚ฌ๋žŒ ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ํ•™์Šต - Foundation models ํ†ตํ•ฉ

ViViDex๋Š” ์ด ์ง„ํ™”์˜ ์ตœ์ „์„ ์— ์žˆ์œผ๋ฉฐ, ํŠนํžˆ โ€œ์‚ฌ๋žŒ ๋น„๋””์˜ค + RL + ์‹œ๊ฐ ์ •์ฑ…โ€ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ํ™•๋ฆฝํ–ˆ์Šต๋‹ˆ๋‹ค.

์ง์ ‘์ ์œผ๋กœ ๊ด€๋ จ๋œ ์—ฐ๊ตฌ๋“ค

DexMV (Qin et al., 2022) - ViViDex์˜ ์ง์ ‘์ ์ธ ์„ ํ–‰ ์—ฐ๊ตฌ - ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ teacher policy ํ•™์Šต - ํ•œ๊ณ„: ๋…ธ์ด์ฆˆ์— ์ทจ์•ฝ, ํŠน๊ถŒ ์ •๋ณด ํ•„์š”

DexPoint (Chen et al., 2022) - ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ ๊ธฐ๋ฐ˜ ์ •์ฑ… - RL๋งŒ์œผ๋กœ ํ•™์Šต - ํ•œ๊ณ„: ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์ด ๋‚ฎ์Œ

Learning from Play (Lynch et al., 2020) - ๋น„๊ตฌ์กฐํ™”๋œ ํ”Œ๋ ˆ์ด ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ํ•™์Šต - ๋ชฉํ‘œ ์กฐ๊ฑด ์ •์ฑ… - ๋‹ค๋ฅธ ๋„๋ฉ”์ธ: ์ฃผ๋กœ ํ‰๋ฉด ์กฐ์ž‘

3D Diffusion Policy (Ze et al., 2023) - Diffusion model์„ ์กฐ์ž‘์— ์ ์šฉ - ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ–‰๋™ ๋ถ„ํฌ - ViViDex๊ฐ€ ์ด๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ์‚ฌ์šฉ

์ฐจ๋ณ„์  ์ •๋ฆฌ

ํŠน์ง• DexMV DexPoint 3D-DP ViViDex
๋น„๋””์˜ค ์‚ฌ์šฉ โœ“ โœ— โœ— โœ“
๋ฌผ๋ฆฌ์  ํƒ€๋‹น์„ฑ โœ— โœ“ โœ“ โœ“
ํŠน๊ถŒ ์ •๋ณด ๋ถˆํ•„์š” โœ— โœ“ โœ“ โœ“
์ƒ˜ํ”Œ ํšจ์œจ์„ฑ ์ค‘ ๋‚ฎ์Œ ๋†’์Œ ๋†’์Œ
์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ ๋‚ฎ์Œ ์ค‘ ์ค‘ ๋†’์Œ

์ด๋ก ์  ๊ธฐ์—ฌ์™€ ์˜์˜

1) ๋น„๋””์˜ค๋ฅผ Prior๋กœ ํ™œ์šฉํ•˜๋Š” ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ

ViViDex๋Š” ์‚ฌ๋žŒ ๋น„๋””์˜ค๋ฅผ ๋‹จ์ˆœํ•œ โ€œ๋ฐ์ดํ„ฐโ€๊ฐ€ ์•„๋‹ˆ๋ผ โ€œ์‚ฌ์ „ ์ง€์‹(prior)โ€์œผ๋กœ ์ทจ๊ธ‰ํ•ฉ๋‹ˆ๋‹ค:

p(\tau | \text{video}) = \frac{p(\text{video} | \tau) p(\tau)}{p(\text{video})}

  • p(\tau): ๋ฌผ๋ฆฌ์ ์œผ๋กœ ๊ฐ€๋Šฅํ•œ ๊ถค์ ์˜ prior
  • p(\text{video} | \tau): ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ์˜ likelihood
  • p(\tau | \text{video}): ๋น„๋””์˜ค๋ฅผ ๊ณ ๋ คํ•œ posterior

RL์€ ์ด posterior์—์„œ ์ƒ˜ํ”Œ๋งํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

2) Privileged Information์˜ ๋‹จ๊ณ„์  ์ œ๊ฑฐ

โ€œprivileged informationโ€์„ ์–ด๋–ป๊ฒŒ ๋‹ค๋ฃจ๋Š”๊ฐ€๋Š” ๋กœ๋ด‡ ํ•™์Šต์˜ ํ•ต์‹ฌ ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค. ViViDex์˜ 2๋‹จ๊ณ„ ์ ‘๊ทผ๋ฒ•์€ elegantํ•œ ํ•ด๊ฒฐ์ฑ…์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค:

๋‹จ๊ณ„ 1: ํŠน๊ถŒ ์ •๋ณด๋กœ โ€œ์ข‹์€ ํ–‰๋™โ€์ด ๋ฌด์—‡์ธ์ง€ ๋ฐœ๊ฒฌ ๋‹จ๊ณ„ 2: ์‹œ๊ฐ ์ •๋ณด๋งŒ์œผ๋กœ ๊ทธ ํ–‰๋™์„ ์žฌํ˜„ํ•˜๋„๋ก ํ•™์Šต

์ด๋Š” knowledge distillation๊ณผ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค:

\mathcal{L} = \text{KL}(\pi_{student}(\cdot|\mathbf{o}^{visual}) || \pi_{teacher}(\cdot|\mathbf{s}^{privileged}))

3) ์ขŒํ‘œ ๋ถˆ๋ณ€์„ฑ์˜ ์ค‘์š”์„ฑ

์† ์ค‘์‹ฌ ์ขŒํ‘œ๊ณ„์˜ ๋„์ž…์€ ๋‹จ์ˆœํ•œ ์—”์ง€๋‹ˆ์–ด๋ง ํŠธ๋ฆญ์ด ์•„๋‹ˆ๋ผ ๊นŠ์€ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค:

์ •๋ฆฌ: ์กฐ์ž‘ ์ •์ฑ… \pi๊ฐ€ ์† ์ค‘์‹ฌ ์ขŒํ‘œ๊ณ„์—์„œ ํ•™์Šต๋˜๋ฉด, ์ž„์˜์˜ SE(3) ๋ณ€ํ™˜ T์— ๋Œ€ํ•ด:

\pi(T \cdot \mathbf{o}, T \cdot \mathbf{s}) = T \cdot \pi(\mathbf{o}, \mathbf{s})

์ด๋ฅผ ๋งŒ์กฑํ•ฉ๋‹ˆ๋‹ค (equivariance property).

์ด๋Š” ๊ทธ๋ฃน ์ด๋ก ์  ๊ด€์ ์—์„œ ์ •์ฑ…์ด SE(3) ๊ตฐ์˜ ํ‘œํ˜„(representation)์„ ํ•™์Šตํ–ˆ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

๊ฒฐ๋ก 

ViViDex๋Š” ์‚ฌ๋žŒ์˜ ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ๋กœ๋ด‡์˜ ์ •๊ตํ•œ ์กฐ์ž‘ ๊ธฐ์ˆ ์„ ํ•™์Šตํ•˜๋Š” ํ˜์‹ ์ ์ธ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ถค์  ๊ฐ€์ด๋“œ ๊ฐ•ํ™”ํ•™์Šต์„ ํ†ตํ•ด ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ์€ ๋น„๋””์˜ค ์ถ”์ •์„ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ํƒ€๋‹นํ•œ ๊ถค์ ์œผ๋กœ ์ •์ œํ•˜๊ณ , ์† ์ค‘์‹ฌ ์ขŒํ‘œ ๋ณ€ํ™˜์„ ํ†ตํ•ด ์‹œ๊ฐ ์ •์ฑ…์˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ ๊ธฐ์กด SOTA ๋ฐฉ๋ฒ• ๋Œ€๋น„ 15-25%์˜ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ, ํŠนํžˆ ๋ฏธ์ง€์˜ ๋ฌผ์ฒด์— ๋Œ€ํ•œ ์ ์‘๋ ฅ์ด ๋›ฐ์–ด๋‚ฌ์Šต๋‹ˆ๋‹ค. ๊ณผ์ œ๋‹น 1-3๊ฐœ์˜ ์‚ฌ๋žŒ ์‹œ์—ฐ๋งŒ์œผ๋กœ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๊ณ  ์‹ค์ œ ๋กœ๋ด‡์—์„œ๋„ 80%์˜ ์„ฑ๊ณต๋ฅ ์„ ๋ณด์—ฌ ์‹ค์šฉ์„ฑ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ์—ฐ๊ตฌ๋Š” ๋น„๋””์˜ค๋ฅผ ๋‹จ์ˆœํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์•„๋‹Œ ์‚ฌ์ „ ์ง€์‹์œผ๋กœ ํ™œ์šฉํ•˜๋Š” ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ œ์‹œํ•˜๋ฉฐ, ๋กœ๋ด‡์ด ์ธํ„ฐ๋„ท์˜ ๋ฐฉ๋Œ€ํ•œ ๋น„๋””์˜ค ์ž๋ฃŒ๋กœ๋ถ€ํ„ฐ ์กฐ์ž‘ ๊ธฐ์ˆ ์„ ํ•™์Šตํ•˜๋Š” ๋ฏธ๋ž˜๋กœ ๋‚˜์•„๊ฐ€๋Š” ์ค‘์š”ํ•œ ์ด์ •ํ‘œ๊ฐ€ ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.


์ฐธ๊ณ ๋ฌธํ—Œ

  1. Chen, Z., Chen, S., Arlaud, E., Laptev, I., & Schmid, C. (2025). ViViDex: Learning Vision-based Dexterous Manipulation from Human Videos. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).
    • Paper
    • Project Page
    • Code (SAPIEN)
    • Code (MuJoCo)
  2. Qin, Y., Wu, Y., Liu, S., Jiang, H., Yang, R., Fu, Y., & Wang, X. (2022). DexMV: Imitation Learning for Dexterous Manipulation from Human Videos. In European Conference on Computer Vision (ECCV).
    • Paper
    • Project Page
    • Code
  3. Ze, Y., Luo, J., Lin, G., Xu, D., Wang, X., Gan, C., & Xiong, Y. (2023). 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations. In arXiv preprint arXiv:2310.03005.
    • Paper
    • Project Page
    • Code
  4. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347.
    • Paper
  5. Romero, J., Tzionas, D., & Black, M. J. (2017). Embodied Hands: Modeling and Capturing Hands and Bodies Together. ACM Transactions on Graphics (ToG), 36(6), 1-17.
    • Paper
    • MANO Model
  6. Qi, C. R., Yi, L., Su, H., & Guibas, L. J. (2017). PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Advances in Neural Information Processing Systems (NeurIPS).
    • Paper
    • Code
  7. Chen, T., Xu, J., & Agrawal, P. (2022). A System for General In-Hand Object Re-Orientation. In Conference on Robot Learning (CoRL).
    • Paper
    • DexPoint Project
  8. Andrychowicz, M., Baker, B., Chociej, M., et al. (2020). Learning Dexterous In-Hand Manipulation. The International Journal of Robotics Research, 39(1), 3-20.
    • Paper
    • OpenAI Blog
  9. Ng, A. Y., Harada, D., & Russell, S. (1999). Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping. In ICML.
    • Paper
  10. Lynch, C., Khansari, M., Xiao, T., et al. (2020). Learning Latent Plans from Play. In Conference on Robot Learning (CoRL).
    • Paper
    • Project Page

โ›๏ธ Dig Review

โ›๏ธ Dig โ€” Go deep, uncover the layers. Dive into technical detail.

1. ํ•ต์‹ฌ ๊ธฐ์—ฌ ์š”์•ฝ

ViViDex ๋…ผ๋ฌธ์€ ์ธ๊ฐ„ ๋™์˜์ƒ(human videos)์„ ํ™œ์šฉํ•˜์—ฌ ๋‹ค์ž์œ ๋„(multi-fingered) ๋กœ๋ด‡ ํ•ธ๋“œ์˜ ์‹œ๊ฐ ๊ธฐ๋ฐ˜ ์ •๋ฐ€ ์กฐ์ž‘(dexterous manipulation) ์ •์ฑ…์„ ํ•™์Šตํ•˜๋Š” ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ฃผ์š” ๊ธฐ์—ฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

ViViDex ํ”„๋ ˆ์ž„์›Œํฌ ์ œ์•ˆ: ์ธ๊ฐ„์˜ ์กฐ์ž‘ ๋™์˜์ƒ์œผ๋กœ๋ถ€ํ„ฐ ์‹œ๊ฐ ๊ธฐ๋ฐ˜ ์ •๋ฐ€ ์กฐ์ž‘ ์ •์ฑ…์„ ํ•™์Šตํ•˜๋Š” ํ†ตํ•ฉ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๋„์ž…ํ–ˆ๋‹ค. ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์ด ์ˆ˜๋ฐฑ ๊ฐœ์˜ ๋™์˜์ƒ๊ณผ ๋ณต์žกํ•œ ๋ณด์ƒ ํ•จ์ˆ˜์— ์˜์กดํ•˜๋˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด, ํŠธ๋ž˜์ ํ† ๋ฆฌ ๊ฐ€์ด๋“œ ๊ฐ•ํ™”ํ•™์Šต(trajectory-guided RL)๊ณผ ํ†ตํ•ฉ ๋น„์ „ ๊ธฐ๋ฐ˜ ์ •์ฑ…(distilled visual policy) ํ•™์Šต ๊ณผ์ •์„ ๊ฒฐํ•ฉํ•˜์˜€๋‹ค.

์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ… ํ•™์Šต ๊ฐœ์„ : ์ธ๊ฐ„ ๋™์˜์ƒ์—์„œ ์ถ”์ถœํ•œ ์†๊ณผ ๋ฌผ์ฒด ๊ถค์ (reference trajectory)์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ , ๋ฌผ๋ฆฌ์ ์œผ๋กœ ํƒ€๋‹นํ•œ ๋ฌผ์ฒด-์† ๋™์ž‘์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๊ถค์  ์œ ์‚ฌ์„ฑ์„ ์œ ์ง€ํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ณด์ƒ ํ•จ์ˆ˜(reward)๋ฅผ ์„ค๊ณ„ํ•˜์—ฌ ๊ฐ•ํ™”ํ•™์Šต(RL) ๊ณผ์ •์— ๋ฐ˜์˜ํ–ˆ๋‹ค. ๋˜ํ•œ, ๋น„์ „ ๊ธฐ๋ฐ˜ ์ •์ฑ…์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ(์ขŒํ‘œ ๋ณ€ํ™˜์„ ํฌํ•จํ•œ PointNet ๊ธฐ๋ฐ˜ ๋ชจ๋ธ)๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค.

์‹คํ—˜์  ๊ฒ€์ฆ: ์„ธ ๊ฐ€์ง€ ์ •๋ฐ€ ์กฐ์ž‘ ๊ณผ์ œ(relocate, pour, place-inside)์— ๋Œ€ํ•ด ๊ด‘๋ฒ”์œ„ํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ ์‹ค์ œ ๋กœ๋ด‡ ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ–ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, ์ œ์•ˆํ•œ ViViDex๊ฐ€ ๋™์ผ ์กฐ๊ฑด์—์„œ ๊ธฐ์กด ์ตœ์ฒจ๋‹จ(์˜ˆ: DexMV ) ๋ฐฉ๋ฒ•๋“ค์„ ๋›ฐ์–ด๋„˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์˜€๊ณ , ํŠนํžˆ ํ›จ์”ฌ ์ ์€ ๋™์˜์ƒ์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ๋„ ๋†’์€ ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ•จ์„ ์ž…์ฆํ–ˆ๋‹ค.

์ด์ƒ์˜ ๊ฒฐ๊ณผ๋“ค์€ ViViDex๊ฐ€ ์ธ๊ฐ„์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ์กฐ์ž‘ ๋™์ž‘์„ ํšจ๊ณผ์ ์œผ๋กœ ํ™œ์šฉํ•˜์—ฌ ๋ฌผ๋ฆฌ์  ํƒ€๋‹น์„ฑ์„ ๋ณด์žฅํ•œ ์‹œ๊ฐ ๊ธฐ๋ฐ˜ ๋กœ๋ด‡ ์กฐ์ž‘ ์ •์ฑ…์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•œ๋‹ค.

2. ๊ธฐ์ˆ  ๊ตฌ์„ฑ ์š”์†Œ ์ƒ์„ธ ๋ถ„์„

๊ทธ๋ฆผ 1. ViViDex ํ”„๋ ˆ์ž„์›Œํฌ ๊ฐœ์š”: (a) ์ธ๊ฐ„ ๋™์˜์ƒ์œผ๋กœ๋ถ€ํ„ฐ ์ฐธ์กฐ ๊ถค์ ์„ ์ถ”์ถœํ•˜๊ณ , (b) ํ•ด๋‹น ๊ถค์ ์„ RL ๊ธฐ๋ฐ˜์œผ๋กœ ์ •์ œํ•˜์—ฌ ์ƒํƒœ(state)-๊ธฐ๋ฐ˜ ์ •์ฑ…์„ ํ•™์Šตํ•˜๋ฉฐ, (c) ์ด๋ฅผ ํ†ตํ•ด ์–ป์€ ํƒ€๋‹นํ•œ ๊ถค์ ์„ ์‚ฌ์šฉํ•ด RGB-D ์„ผ์„œ ์ž…๋ ฅ๋งŒ์œผ๋กœ ๋™์ž‘ํ•˜๋Š” ํ†ตํ•ฉ ์‹œ๊ฐ ์ •์ฑ…์„ ํ•™์Šตํ•œ๋‹ค.

ViViDex๋Š” ์„ธ ๊ฐœ์˜ ๋ชจ๋“ˆ๋กœ ๊ตฌ์„ฑ๋œ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ, ๊ทธ๋ฆผ 1์— ๊ฐœ์š”๊ฐ€ ๋‚˜์™€ ์žˆ๋‹ค. ๋จผ์ € (1) ์ธ๊ฐ„ ๋™์˜์ƒ์œผ๋กœ๋ถ€ํ„ฐ ์ฐธ์กฐ ๊ถค์ (reference trajectory) ์ถ”์ถœ์„ ํ†ตํ•ด ์‹ค์ œ ์ธ๊ฐ„ ์†๊ณผ ๋ฌผ์ฒด์˜ ์›€์ง์ž„ ์ •๋ณด๋ฅผ ์–ป๊ณ , ๋‹ค์Œ์œผ๋กœ (2) ๊ถค์  ์œ ๋„ ์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ… ํ•™์Šต ๋‹จ๊ณ„๋ฅผ ํ†ตํ•ด ๋ฌผ๋ฆฌ์ ์œผ๋กœ ํƒ€๋‹นํ•œ ๋กœ๋ด‡ ๋™์ž‘์„ ์ƒ์„ฑํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ (3) ํ†ตํ•ฉ๋œ ์‹œ๊ฐ ๊ธฐ๋ฐ˜ ์ •์ฑ… ํ•™์Šต ๋‹จ๊ณ„์—์„œ ์‹ค์ œ ์„ผ์„œ ์ž…๋ ฅ(3D ํฌ์ธํŠธํด๋ผ์šฐ๋“œ)๋งŒ์œผ๋กœ ๋™์ž‘์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋„คํŠธ์›Œํฌ๋ฅผ ํ›ˆ๋ จํ•œ๋‹ค. ์ด ์„ธ ๋ชจ๋“ˆ์€ ๊ทธ๋ฆผ 1์— ์š”์•ฝ๋œ ๋ฐ”์™€ ๊ฐ™์ด ์—ฐ๊ฒฐ๋œ๋‹ค.

์ฐธ์กฐ ๊ถค์  ์ถ”์ถœ ๋ฐ ๋ชจ์…˜ ๋ฆฌํƒ€๊ฒŒํŒ…(Motion Retargeting): ViViDex๋Š” Narang et al.์˜ DexYCB ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด ๋ฐ์ดํ„ฐ์…‹์€ 20๊ฐœ์˜ YCB ๋ฌผ์ฒด์— ๋Œ€ํ•ด ์ธ๊ฐ„์˜ ์†-๋ฌผ์ฒด ์ƒํ˜ธ์ž‘์šฉ ์˜์ƒ์„ ์ œ๊ณตํ•œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๊ธฐ์กด ์ž‘์—…(DexMV )๊ณผ์˜ ๋น„๊ต๋ฅผ ์œ„ํ•ด ๋จธ์Šคํƒ€๋“œ๋ณ‘, ํ† ๋งˆํ† ์บ”, ์„คํƒ•์ƒ์ž, ๋Œ€ํ˜•ํด๋žจํ”„, ๋จธ๊ทธ์ž” ๋“ฑ 5๊ฐœ ๋ฌผ์ฒด๋ฅผ ์„ ํƒํ–ˆ๋‹ค. ๊ฐ ๋ฌผ์ฒด๋งˆ๋‹ค ๋น„๋””์˜ค 1~3๊ฐœ๋ฅผ ์„ ํƒํ•˜์—ฌ ์‹คํ—˜์— ์‚ฌ์šฉํ•˜๋ฉฐ, ํ•ด๋‹น ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ 3D ์† ๊ด€์ ˆ๊ณผ ๋ฌผ์ฒด ํฌ์ฆˆ๋ฅผ ์ถ”์ •ํ•œ๋‹ค. ์ถ”์ •๋œ ์ธ๊ฐ„ ์† ๊ถค์ ์€ ๋กœ๋ด‡ ๊ด€์ ˆ ๊ณต๊ฐ„์œผ๋กœ ์ง์ ‘ ์˜ฎ๊ธฐ๊ธฐ์—๋Š” ๋…ธ์ด์ฆˆ์™€ ๋ถˆ์ผ์น˜๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ, ๊ทธ๋ฆผ 2์— ๋ณด์ด๋Š” ๋ฐ”์™€ ๊ฐ™์ด ์ตœ์ ํ™” ๊ธฐ๋ฐ˜ ๋ฆฌํƒ€๊ฒŒํŒ…์„ ์ ์šฉํ•˜์—ฌ ์•Œ๋ ˆ๊ทธ๋กœ(Allegro) ๋กœ๋ด‡ ์† ๊ด€์ ˆ(q_t^r)์„ ๊ตฌํ•œ๋‹ค. ๋ชฉํ‘œ๋Š” ์ธ๊ฐ„ ์†๊ฐ€๋ฝ ๋ ์œ„์น˜ \psi_t^{h_j}์™€ ๋กœ๋ด‡ ์†๊ฐ€๋ฝ ๋ ์œ„์น˜ \hat{x}_{t}^{r_j}(q_t^r)์˜ ์ฐจ์ด๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์‹ (1)๊ณผ ๊ฐ™์€ ๋ชฉ์ ์œผ๋กœ \ell_2 ๊ฑฐ๋ฆฌ์™€ ๊ด€์ ˆ ๋ณ€ํ™”์— ๋Œ€ํ•œ ๊ทœ์ œํ•ญ์„ ๊ฒฐํ•ฉํ•œ ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ํ’€์–ด ๋กœ๋ด‡ ๊ด€์ ˆ ๊ถค์ ์„ ํš๋“ํ•œ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ์‹œ๊ฐ์ ์œผ๋กœ ์ž์—ฐ์Šค๋Ÿฝ์ง€๋งŒ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ๋ถˆ์•ˆ์ •ํ•œ ๊ถค์ ์ด ์ƒ์„ฑ๋œ๋‹ค(๊ทธ๋ฆผ 2).

์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ…(State-based policy) ํ•™์Šต: ํš๋“ํ•œ ์ฐธ์กฐ ๊ถค์ ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ•ํ™”ํ•™์Šต(RL)์„ ํ†ตํ•ด ๋ฌผ๋ฆฌ์ ์œผ๋กœ ํƒ€๋‹นํ•œ ๋กœ๋ด‡ ๋™์ž‘์„ ํ•™์Šตํ•œ๋‹ค. ์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ…์€ actor-critic MLP ๋„คํŠธ์›Œํฌ๋กœ ๊ตฌํ˜„๋˜๋ฉฐ, ๋กœ๋ด‡ ๊ด€์ ˆ ๋ฐ ๋ฌผ์ฒด์˜ ์ƒํƒœ ์ •๋ณด๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ๋กœ๋ด‡ ๊ด€์ ˆ ์ œ์–ด ๋ช…๋ น์„ ์ถœ๋ ฅํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ ํ•ต์‹ฌ์€ ํŠธ๋ž˜์ ํ† ๋ฆฌ-์œ ๋„ ๋ณด์ƒ ํ•จ์ˆ˜(trajectory-guided reward)์˜ ์„ค๊ณ„์ด๋‹ค. ์ฐธ์กฐ ๊ถค์ ์„ โ€˜ํ”„๋ฆฌ๊ทธ๋žฉ(pre-grasp) ๋‹จ๊ณ„โ€™(๋ฌผ์ฒด ์ ‘๊ทผ)์™€ โ€˜๋งค๋‹ˆํ“ฐ๋ ˆ์ด์…˜ ๋‹จ๊ณ„โ€™(๋ฌผ์ฒด ์กฐ์ž‘)๋กœ ๋‚˜๋ˆ„๊ณ , ๊ฐ ๋‹จ๊ณ„๋งˆ๋‹ค ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ณด์ƒ์„ ๋ถ€๊ณผํ•œ๋‹ค(๊ทธ๋ฆผ 2):

ํ”„๋ฆฌ๊ทธ๋žฉ ๋‹จ๊ณ„ ๋ณด์ƒ (R_p): ๋กœ๋ด‡ ์†๊ฐ€๋ฝ ๋์˜ ํ˜„์žฌ ์œ„์น˜ x_t^r๊ฐ€ ์ฐธ์กฐ ๊ถค์ ์˜ ๋กœ๋ด‡ ์†๊ฐ€๋ฝ ๋ ์œ„์น˜ \hat{x}_t^r์™€ ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ๋ณด์ƒ์ด ์ฆ๊ฐ€ํ•œ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, T_p ์‹œ๊ฐ„ ๋™์•ˆ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ณด์ƒ์„ ์‚ฌ์šฉํ•œ๋‹ค:

R_p = \sum_{t}^{T_p} 10\,\exp\Bigl(-10\,|x_t^r - \hat{x}_t^r|^2\Bigr).

๋งค๋‹ˆํ“ฐ๋ ˆ์ด์…˜ ๋‹จ๊ณ„ ๋ณด์ƒ (R_m): ๋ฌผ์ฒด๋ฅผ ๋ชฉํ‘œ ์œ„์น˜๋กœ ์ด๋™ยท์กฐ์ž‘ํ•˜๋Š” ๋‹จ๊ณ„๋กœ, ๋กœ๋ด‡ ์†๊ณผ ๋ฌผ์ฒด ๋™์ž‘์„ ํ•จ๊ป˜ ์ œ์•ฝํ•˜๋Š” ์ข…ํ•ฉ ๋ณด์ƒ์„ ์ ์šฉํ•œ๋‹ค. ์ฐธ์กฐ ๊ถค์ ์˜ ์ž”์—ฌ ๋‹จ๊ณ„ t=T_p+1๋ถ€ํ„ฐ T_r๊นŒ์ง€์— ๋Œ€ํ•ด ๋‹ค์Œ์„ ์ •์˜ํ•œ๋‹ค:

R_m = \sum_{t=T_p+1}^{T_r}\bigl[\lambda_1 R^m_h + \lambda_2 R^m_o + \lambda_3\,1_{\text{contact}} + \lambda_4\,1_{\text{lift}}\bigr].

์—ฌ๊ธฐ์„œ R^m_h๋Š” ํ”„๋ฆฌ๊ทธ๋žฉ ๋‹จ๊ณ„์™€ ์œ ์‚ฌํ•˜๊ฒŒ ์† ์›€์ง์ž„์„ ์ฐธ์กฐ ๊ถค์ ๊ณผ ๊ฐ€๊น๊ฒŒ ์œ ์ง€ํ•˜๋Š” ์† ๋™์ž‘ ๋ณด์ƒ, R^m_o = \exp\bigl(-\alpha_1(|x_t^o - \hat{x}_t^o|^2 + \alpha_2\phi(\theta_t^o,\hat{\theta}_t^o))\bigr)๋Š” ๋ฌผ์ฒด ์œ„์น˜ ๋ฐ ์ž์„ธ๊ฐ€ ์ฐธ์กฐ์™€ ์œ ์‚ฌํ• ์ˆ˜๋ก ์ฆ๊ฐ€ํ•˜๋Š” ๋ฌผ์ฒด ๋ณด์ƒ์ด๋‹ค. ๋˜ํ•œ 1_{\text{contact}}๋Š” ์†๊ฐ€๋ฝ ๋์ด ๋ฌผ์ฒด์™€ ์ ‘์ด‰ํ•œ ๊ฐœ์ˆ˜, 1_{\text{lift}}๋Š” ๋ฌผ์ฒด๋ฅผ ์„ฑ๊ณต์ ์œผ๋กœ ๋“ค์–ด์˜ฌ๋ ธ์„ ๋•Œ ์ถ”๊ฐ€๋กœ ์ฃผ์–ด์ง€๋Š” ๋ณด๋„ˆ์Šค ํ•ญ์ด๋‹ค. ์‹คํ—˜์—์„œ๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ \lambda_1=4,\lambda_2=10,\lambda_3=0.5,\alpha_1=50,\alpha_2=0.1 ๋“ฑ์œผ๋กœ ์„ค์ •ํ–ˆ๋‹ค. ์ด์™€ ๊ฐ™์€ ๋ณด์ƒ ์„ค์ •์œผ๋กœ ํ•™์Šต๋œ ์ •์ฑ…์€ ์ฐธ์กฐ ๊ถค์ ์„ ๋”ฐ๋ผ ๋ฌผ์ฒด๋ฅผ ์กฐ์ž‘ํ•˜๋ฉด์„œ ๋ฌผ๋ฆฌ์  ์ œ์•ฝ์„ ๋งŒ์กฑํ•˜๋„๋ก ํ•™์Šต๋œ๋‹ค.

ํ•™์Šต ๋„์ค‘ ๋ฐ์ดํ„ฐ ๋‹ค์–‘์„ฑ์„ ํ™•๋ณดํ•˜๊ธฐ ์œ„ํ•ด ๊ถค์  ์ฆ๊ฐ•(trajectory augmentation)์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ์ดˆ๊ธฐ ๋ฌผ์ฒด ์œ„์น˜๋‚˜ ํšŒ์ „์„ ๋ฌด์ž‘์œ„๋กœ ๋ณ€๊ฒฝํ•˜๊ณ  ์ฐธ์กฐ ๊ถค์  ์ „์ฒด๋ฅผ ๋ณ€ํ™˜ํ•˜๊ฑฐ๋‚˜, ๋ชฉํ‘œ ๋ฌผ์ฒด ์œ„์น˜๋ฅผ ๋ณ€ํ™”์‹œํ‚ค๋ฉฐ ์† ๋™์ž‘ ๊ถค์ ์„ ์„ ํ˜• ๋ณด๊ฐ„ํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ •์ฑ…์€ ์ฐธ์กฐ ๊ถค์ ๊ณผ ๋‹ค๋ฅธ ์ดˆ๊ธฐ/๋ชฉํ‘œ ์กฐ๊ฑด์—์„œ๋„ ์ผ๋ฐ˜ํ™”๋  ์ˆ˜ ์žˆ๋„๋ก ํ•™์Šต๋œ๋‹ค.

์™ผ์ชฝ์€ ํ”„๋ฆฌ๊ทธ๋žฉ ๋‹จ๊ณ„, ์˜ค๋ฅธ์ชฝ์€ ๋งค๋‹ˆํ“ฐ๋ ˆ์ด์…˜ ๋‹จ๊ณ„์ด๋‹ค. R1, R2, R3๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋ณด์ƒ ํ•จ์ˆ˜ ์„ค์ •(์† ๋ณด์ƒ ์œ /๋ฌด)์„ ๋‚˜ํƒ€๋‚ธ๋‹ค. R3(์ œ์•ˆ ๋ฐฉ์‹, ์† ๋ณด์ƒ์„ ์–‘ ๋‹จ๊ณ„์— ๋ชจ๋‘ ํฌํ•จ)์€ ๊ฐ€์žฅ ๋†’์€ ์„ฑ๊ณต๋ฅ ์„ ๋ณด์ธ๋‹ค.

๋ณด์ƒ ํ•จ์ˆ˜์˜ ํšจ๊ณผ: Table I์™€ ๊ทธ๋ฆผ 3์˜ ๋น„๊ต์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ, ํ”„๋ฆฌ๊ทธ๋žฉ ๋‹จ๊ณ„์™€ ๋งค๋‹ˆํ“ฐ๋ ˆ์ด์…˜ ๋‹จ๊ณ„ ๋ชจ๋‘์—์„œ ์† ๊ถค์  ์œ ์‚ฌ์„ฑ์„ ์œ ๋„ํ•˜๋Š” ๋ณด์ƒ(R3 ๋ฐฉ์‹)์„ ์‚ฌ์šฉํ•  ๋•Œ, ๊ฐ€์žฅ ํ˜„์‹ค์ ์ด๊ณ  ์„ฑ๊ณต์ ์ธ ์กฐ์ž‘์ด ๊ฐ€๋Šฅํ–ˆ๋‹ค. ์† ๋ณด์ƒ์„ ๋ˆ„๋ฝํ•œ R1, R2 ๋ฐฉ์‹์€ ๊ฐ๊ฐ ์ ‘๊ทผ ๋˜๋Š” ์ง‘๊ธฐ ๋‹จ๊ณ„์—์„œ ๋น„ํ˜„์‹ค์ ์ธ ๋™์ž‘์„ ์œ ๋ฐœํ•˜์˜€๊ณ , ์„ฑ๊ณต๋ฅ ๋„ ๋‚ฎ์•˜๋‹ค. ํŠนํžˆ R3 ์ •์ฑ…์€ ๋ชจ๋“  ๊ฐ์ฒด์— ๋Œ€ํ•ด \text{SR}_3=1.00์„ ๋‹ฌ์„ฑํ•  ๋งŒํผ ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค.

์‹œ๊ฐ ๊ธฐ๋ฐ˜ ์ •์ฑ…(Vision-based policy) ํ•™์Šต: ์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ…์€ ๋กœ๋ด‡ ๊ณ ์œ  ์ƒํƒœ์™€ ๋ฌผ์ฒด ์ƒํƒœ๋ฅผ ๋ชจ๋‘ ํ•„์š”๋กœ ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ๋Š” ๋ฌผ์ฒด์˜ ์ •ํ™•ํ•œ ์œ„์น˜/์ž์„ธ๋ฅผ ์ธก์ •ํ•˜๊ธฐ ์–ด๋ ค์šฐ๋ฏ€๋กœ, ViViDex๋Š” RGB-D ์นด๋ฉ”๋ผ๋กœ ์–ป์€ 3D ํฌ์ธํŠธํด๋ผ์šฐ๋“œ๋งŒ์„ ์ž…๋ ฅ์œผ๋กœ ํ•˜๋Š” ์‹œ๊ฐ ๊ธฐ๋ฐ˜ ์ •์ฑ…์„ ์ถ”๊ฐ€๋กœ ํ•™์Šตํ•œ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ…์—์„œ ์–ป์€ ์—ฌ๋Ÿฌ ์„ฑ๊ณต ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ถค์ ์„ ๋กค์•„์›ƒํ•˜๋ฉด์„œ ๊ฐ ์‹œ๊ฐ„๋งˆ๋‹ค 3D ํฌ์ธํŠธํด๋ผ์šฐ๋“œ PC^w\in\mathbb{R}^{N\times3}์™€ ๋กœ๋ด‡ ๊ด€์ ˆ ์ƒํƒœ๋ฅผ ๊ธฐ๋กํ•œ๋‹ค.

์ด๋ ‡๊ฒŒ ์ˆ˜์ง‘๋œ ๋น„์ฃผ์–ผ ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต์— ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด, ์ขŒํ‘œ ๋ณ€ํ™˜(coordinate transformation) ๊ธฐ๋ฒ•์„ ์ ์šฉํ•œ๋‹ค. ๋จผ์ €, ์„ธ๊ณ„ ์ขŒํ‘œ๊ณ„(PC^w)๋ฅผ ๋ชฉํ‘œ ๋ฌผ์ฒด ์ค‘์‹ฌ ์ขŒํ‘œ๊ณ„(PC^t)๋กœ ์ด๋™์‹œ์ผœ ๋„คํŠธ์›Œํฌ๊ฐ€ ๋ชฉํ‘œ ์œ„์น˜๋ฅผ ๋” ์ž˜ ์ธ์‹ํ•˜๋„๋ก ํ•œ๋‹ค. ์•„์šธ๋Ÿฌ, ๋” ์ž์„ธํ•œ ์ƒํ˜ธ์ž‘์šฉ ์ •๋ณด๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด ์ „์ฒด ํฌ์ธํŠธํด๋ผ์šฐ๋“œ๋ฅผ ๋กœ๋ด‡์˜ ์†๋ชฉ ๋ฐ ๊ฐ ์†๊ฐ€๋ฝ ๋์ขŒํ‘œ๊ณ„(์ด 6๊ฐœ)์˜ ์ขŒํ‘œ๊ณ„๋กœ ์ถ”๊ฐ€ ๋ณ€ํ™˜ํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ๋ณ€ํ™˜๋œ ํฌ์ธํŠธํด๋ผ์šฐ๋“œ๋“ค์„ ํ•ฉ์ณ(PointNet ๊ธฐ๋ฐ˜) ์‹œ๊ฐ ํŠน์ง•์„ ์ถ”์ถœํ•˜๊ณ , ์ด ํŠน์ง•๊ณผ ๋กœ๋ด‡ ๊ด€์ ˆ ์ƒํƒœ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์ตœ์ข… MLP๊ฐ€ ์ œ์–ด ๋ช…๋ น์„ ์˜ˆ์ธกํ•œ๋‹ค(๊ทธ๋ฆผ 3 ์ฐธ์กฐ).

๊ทธ๋ฆผ 3. ViViDex์˜ ์‹œ๊ฐ ๊ธฐ๋ฐ˜ ์ •์ฑ… ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ. ์„ธ๊ณ„ ์ขŒํ‘œ(PC^w)์—์„œ ๋ชฉํ‘œ ๋ฌผ์ฒด ์ขŒํ‘œ๊ณ„(PC^t)์™€ ๋‹ค์ˆ˜์˜ ์† ๊ด€์ ˆ ์ขŒํ‘œ๊ณ„๋กœ ํฌ์ธํŠธํด๋ผ์šฐ๋“œ๋ฅผ ๋ณ€ํ™˜ํ•˜์—ฌ PointNet์— ์ž…๋ ฅํ•œ๋‹ค. ์ถ”์ถœ๋œ ์‹œ๊ฐ ํŠน์ง•๊ณผ ๋กœ๋ด‡ ๊ด€์ ˆ ์ƒํƒœ๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜์—ฌ ํ–‰๋™์„ ์˜ˆ์ธกํ•œ๋‹ค.

ํ•™์Šต ๋ฐฉ์‹์œผ๋กœ๋Š” ํ–‰๋™ ๋ณต์ œ(Behavior Cloning)์™€ ํ™•์‚ฐ ์ •์ฑ…(Diffusion policy) ๋‘ ๊ฐ€์ง€๋ฅผ ๋น„๊ตํ–ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ถฉ๋ถ„ํ•œ ์ ๋ฐ€๋„(point density)์™€ ํ™•์‚ฐ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ• ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์—ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ํฌ์ธํŠธํด๋ผ์šฐ๋“œ๋ฅผ 512๊ฐœ๊ฐ€ ์•„๋‹Œ 2048๊ฐœ๋กœ ๋Š˜๋ฆฌ๊ณ  ํ™•์‚ฐ ์ •์ฑ…์„ ์‚ฌ์šฉํ•  ๋•Œ ์„ฑ๊ณต๋ฅ ์ด ์ฆ๊ฐ€ํ•จ์ด ํ™•์ธ๋˜์—ˆ๋‹ค. Diffusion ์ •์ฑ…์€ ๋…ธ์ด์ฆˆ์— ๋” ๊ฐ•๊ฑดํ•˜์—ฌ ๋™์ž‘ ์˜ˆ์ธก ์•ˆ์ •์„ฑ์ด ๋†’์•˜๊ณ , ์ „๋ฐ˜์ ์œผ๋กœ BC๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ–ˆ๋‹ค. ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” ์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ…์œผ๋กœ๋ถ€ํ„ฐ ์–ป์€ ์•ฝ 100ํšŒ ์ด์ƒ ์„ฑ๊ณต ๊ถค์ ์ด๋ฉฐ, ๋‹จ์ผ ์—ํ”ผ์†Œ๋“œ ํ•™์Šต์— BC ์•ฝ 10์‹œ๊ฐ„, ํ™•์‚ฐ ์ •์ฑ… ์•ฝ 20์‹œ๊ฐ„์ด ์†Œ์š”๋˜์—ˆ๋‹ค.

3. ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต ๋ถ„์„

ViViDex๋Š” ์ธ๊ฐ„ ๋™์˜์ƒ์„ ์ด์šฉํ•œ ํ•™์Šต๊ณผ ์‹œ๊ฐ ๊ธฐ๋ฐ˜ ์กฐ์ž‘ ํ•™์Šต ๋ถ„์•ผ์˜ ์—ฌ๋Ÿฌ ์„ ํ–‰ ์—ฐ๊ตฌ๋“ค๊ณผ ์—ฐ๊ด€๋œ๋‹ค. ์ฃผ์š” ๋น„๊ต ๋Œ€์ƒ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

DexMV (Qin et al., ECCV 2022): ์ธ๊ฐ„ ์กฐ์ž‘ ๋™์˜์ƒ์—์„œ 3D ์†๊ณผ ๋ฌผ์ฒด ์ž์„ธ๋ฅผ ์ž๋™ ์ถ”์ถœํ•œ ๋’ค, ์ด๋ฅผ ๋กœ๋ด‡ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ๋ชจ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ RL ์ •์ฑ… ํ•™์Šต์— ํ™œ์šฉํ•œ๋‹ค. ๋™์˜์ƒ์œผ๋กœ๋ถ€ํ„ฐ ์ถ”์ถœ๋œ ๊ถค์ ์„ ์ตœ์ ํ™” ๊ธฐ๋ฐ˜์œผ๋กœ ๋กœ๋ด‡ ๊ด€์ ˆ ๊ถค์ ์œผ๋กœ ๋ฆฌํƒ€๊ฒŒํŒ…ํ•˜๊ณ , DAPG ๋“ฑ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ํ•™์Šตํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด ๋ฐฉ๋ฒ•์€ ํฌ์ฆˆ ์ถ”์ • ์žก์Œ์ด ์ปค์„œ ๊ฐ์ฒด๋‹น ์ˆ˜์‹ญ~์ˆ˜๋ฐฑ ๊ฐœ์˜ ์˜์ƒ์ด ํ•„์š”ํ•˜๋ฉฐ, ๋ณด์ƒ ํ•จ์ˆ˜ ์กฐ์ •๊ณผ ๋ฌผ์ฒด CAD ๋ชจ๋ธ ๋“ฑ ํŠน๊ถŒ ์ •๋ณด(privileged information)์— ์˜์กดํ•œ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค.

DexVIP (Mandikal & Grauman, CoRL 2022): ์œ ํŠœ๋ธŒ ๋™์˜์ƒ ๋“ฑ ์‹ค์ œ ํ™˜๊ฒฝ์˜ in-the-wild ์˜์ƒ์—์„œ ์ธ๊ฐ„ ์†-๋ฌผ์ฒด ์ƒํ˜ธ์ž‘์šฉ์„ ์ถ”์ถœํ•˜์—ฌ ๋กœ๋ด‡ ๊ทธ๋ฆฌํ•‘(grasping) ์ •์ฑ…์„ ํ•™์Šตํ•œ๋‹ค. ํŠนํžˆ ์ธ๊ฐ„ ์†์˜ ํฌ์ฆˆ์— ๋Œ€ํ•œ ์‚ฌ์ „(prior)์„ ๊ฐ•ํ™”ํ•™์Šต์— ๋„์ž…ํ•˜์—ฌ ํ•™์Šต ์†๋„์™€ ์ผ๋ฐ˜ํ™”์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ๋‹ค์–‘ํ•œ ๊ฐ์ฒด๋กœ์˜ ํ™•์žฅ์„ฑ์ด ๋›ฐ์–ด๋‚˜๋ฉฐ, ์ „๋ฌธ์  ์žฅ๋น„๊ฐ€ ํ•„์š” ์—†๋Š” ๋น„์šฉ ํšจ์œจ์ ์ธ ์ ‘๊ทผ๋ฒ•์„ ์ œ์‹œํ–ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, DexVIP๋Š” ์ฃผ๋กœ ๊ทธ๋ฆฝ ๋™์ž‘์— ํ•œ์ •๋˜์–ด ์žˆ์œผ๋ฉฐ, ์™„์ „ํ•œ ์† ๋™์ž‘ ์ œ์–ด๊นŒ์ง€ ํฌํ•จํ•˜์ง€ ์•Š๋Š”๋‹ค.

VideoDex (Shaw et al., CoRL 2022): ์ธ๊ฐ„์˜ ์กฐ์ž‘ ๋™์˜์ƒ์—์„œ ์‹œ๊ฐ/ํ–‰๋™/๋ฌผ๋ฆฌ์  ์„ ํ—˜ ์ง€์‹์„ ์ถ”์ถœํ•˜์—ฌ ๋กœ๋ด‡ ์ •์ฑ… ํ•™์Šต์— ํ™œ์šฉํ•œ๋‹ค. ์˜์ƒ์œผ๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•œ visual prior์™€ physical prior๋ฅผ ๋„คํŠธ์›Œํฌ์— ์ ์šฉํ•ด ๋‹ค์–‘ํ•œ ์กฐ์ž‘ ๊ณผ์ œ์—์„œ ๊ฐ•๊ฑดํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ๋‹ค์ˆ˜์˜ ๊ฐ์ฒด๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” ๋ฒ”์šฉ์ ์ธ ์ •์ฑ… ํ•™์Šต์ด ๋ชฉํ‘œ์ด๋‹ค.

๊ธฐํƒ€ ์—ฐ๊ตฌ๋“ค: DexRepNet, VideoDex ๋“ฑ ์—ฌ๋Ÿฌ ์—ฐ๊ตฌ๊ฐ€ ๊ธฐํ•˜ํ•™์  ํ‘œํ˜„, ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜, ๊ฐ•ํ™”ํ•™์Šต ๋ณด์กฐ ์ •๋ณด ๋“ฑ์„ ํ†ตํ•ด ์ •๋ฐ€ ์กฐ์ž‘์„ ๋‹ค๋ฃจ์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋Œ€๋ถ€๋ถ„์€ ๋ชจ๋…ธ-์ด๋‚˜ ๋“€์–ผ ์นด๋ฉ”๋ผ ์ž…๋ ฅ์ด๋‚˜ ํŠน์ˆ˜ ์žฅ๋น„(์˜ˆ: VR ๊ธ€๋Ÿฌ๋ธŒ)์— ์˜์กดํ•˜๊ฑฐ๋‚˜, ๊ด‘๋ฒ”์œ„ํ•œ ์ „๋ฌธ๊ฐ€ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์„ ์š”๊ตฌํ•œ๋‹ค.

์ด๋“ค์— ๋น„ํ•ด ViViDex์˜ ์ฐจ๋ณ„์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. ์ฒซ์งธ, ๋‹จ์ผ ๋น„๋””์˜ค(๊ฐ์ฒด๋‹น 1~3ํŽธ)๋งŒ์œผ๋กœ๋„ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•  ๋งŒํผ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์ด ๋›ฐ์–ด๋‚˜๋‹ค. ์‹ค์ œ๋กœ ํ‘œ II์—์„œ ์ œ์•ˆ ๋ฐฉ๋ฒ•(S6, S7)์€ ๊ฐ์ฒด๋งˆ๋‹ค 1ํŽธ์˜ ํ›ˆ๋ จ ๋™์˜์ƒ๋งŒ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋“  ๊ฐ์ฒด์˜ Relocate ๊ณผ์ œ์—์„œ \text{SR}_3=1.00์„ ๋‹ฌ์„ฑํ–ˆ๋Š”๋ฐ, DexMV๋Š” ์ˆ˜์‹ญ~์ˆ˜๋ฐฑ ํŽธ์˜ ๋™์˜์ƒ์œผ๋กœ๋„ ์ผ๋ถ€ ๊ฐ์ฒด์—์„œ ๋‚ฎ์€ ์„ฑ๋Šฅ์— ๋จธ๋ฌผ๋ €๋‹ค. ๋‘˜์งธ, ViViDex๋Š” ๋กœ๋ด‡์˜ ๋ฌผ์ฒด CAD ๋ชจ๋ธ์ด๋‚˜ ground-truth ๋ฌผ์ฒด ํฌ์ฆˆ ๊ฐ™์€ ํŠน๊ถŒ ์ •๋ณด๋ฅผ ์ „ํ˜€ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค. ๋Œ€์‹ , 3D ํฌ์ธํŠธํด๋ผ์šฐ๋“œ๋ฅผ ํ†ตํ•ด ๊ด€์ธก๋œ ์ •๋ณด๋งŒ์œผ๋กœ ์ •์ฑ…์„ ํ•™์Šตํ•œ๋‹ค. ์ด๋Š” ์‹ค์ œ ๋กœ๋ด‡ ์ ์šฉ ์‹œ ์„ผ์„œ๋งŒ์œผ๋กœ๋„ ์ •์ฑ… ์‹คํ–‰์ด ๊ฐ€๋Šฅํ•จ์„ ์˜๋ฏธํ•œ๋‹ค. ์…‹์งธ, RL ๋ณด์ƒ์„ ์ฐธ์กฐ ๊ถค์ ๊ณผ ์ผ์น˜์‹œํ‚ค๋Š” ๋ฐฉ์‹์œผ๋กœ ์„ค๊ณ„ํ•˜์—ฌ ๋ณ„๋„์˜ ๋ณต์žกํ•œ ๋ณด์ƒ ์—”์ง€๋‹ˆ์–ด๋ง์„ ์ตœ์†Œํ™”ํ–ˆ๊ณ , ์ขŒํ‘œ ๋ณ€ํ™˜์„ ํฌํ•จํ•œ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ ๊ฐœ์„ ์œผ๋กœ ์‹œ๊ฐ ์ •๋ณด๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ–ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ViViDex๋Š” ๊ธฐ์กด์˜ ์ธ๊ฐ„ ๋™์˜์ƒ ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋“ค์ด ๊ฐ–๋Š” ์žก์Œ ๋ฌธ์ œ์™€ ํ™•์žฅ์„ฑ ํ•œ๊ณ„๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๊ทน๋ณตํ–ˆ๋‹ค.

4. ์‹คํ—˜ ์„ค์ • ๋ฐ ๊ฒฐ๊ณผ

๋…ผ๋ฌธ์€ ์„ธ ๊ฐ€์ง€ ์ •๋ฐ€ ์กฐ์ž‘ ๊ณผ์ œ(relocate, pour, place inside)์— ๋Œ€ํ•ด ์‹ฌ์ธต ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ–ˆ๋‹ค. ๋‹ค์Œ์€ ์ฃผ์š” ์‹คํ—˜ ์„ธ๋ถ€ ์‚ฌํ•ญ๊ณผ ๊ฒฐ๊ณผ์ด๋‹ค.

๋ฐ์ดํ„ฐ์…‹ ๋ฐ ํ™˜๊ฒฝ: ์‹คํ—˜์—๋Š” DexYCB ๋ฐ์ดํ„ฐ์…‹์˜ 5๊ฐœ ๊ฐ์ฒด๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค. ๊ฐ ๊ฐ์ฒด์— ๋Œ€ํ•ด ํ”„๋กœํ† ์ฝœ #1(๊ฐ์ฒด๋ณ„ ๊ฐœ๋ณ„ ์ •์ฑ…, ์ดˆ๊ธฐ ์œ„์น˜๋Š” ๊ทธ๋ฆผ2์˜ ์ฒซ ๋ฒˆ์งธ ํ–‰ ์ฐธ์กฐ)๊ณผ ํ”„๋กœํ† ์ฝœ #2(๋‹ค์ค‘ ๊ฐ์ฒด ํ†ตํ•ฉ ์ •์ฑ…, ๊ฐ ๊ฐ์ฒด๋ณ„ 3๊ฐ€์ง€ ์ดˆ๊ธฐ ์œ„์น˜ ์‚ฌ์šฉ) ๋‘ ๊ฐ€์ง€ ํ‰๊ฐ€ ์„ค์ •์„ ์ ์šฉํ–ˆ๋‹ค. ๋˜ํ•œ Adroit ๋‹ค์ค‘ ์ž์œ ๋„ ํ•ธ๋“œ ๋ฐ MuJoCo ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ๊ธฐ๋ณธ ํ™˜๊ฒฝ์œผ๋กœ ์‚ฌ์šฉํ•˜๋˜, ์‹ค์ œ์„ฑ๊ณผ ์†๋„ ํ–ฅ์ƒ์„ ์œ„ํ•ด Allegro ํ•ธ๋“œ๋ฅผ UR5 ์•”์— ์—ฐ๊ฒฐํ•˜์—ฌ SAPIEN ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„œ๋„ ํ›ˆ๋ จ์„ ์ง„ํ–‰ํ–ˆ๋‹ค.

ํ‰๊ฐ€ ์ง€ํ‘œ: ์ฃผ์š” ํ‰๊ฐ€์ง€ํ‘œ๋Š” ์„ฑ๊ณต๋ฅ (success rate, SR)์ด๋‹ค. Relocate ๊ณผ์ œ์˜ ๊ฒฝ์šฐ DexMV ์—ฐ๊ตฌ๋“ค๊ณผ ์ผ๊ด€๋˜๊ฒŒ ๋ฌผ์ฒด-๋ชฉํ‘œ๊ฐ„ ๊ฑฐ๋ฆฌ 10cm ์ด๋‚ด์˜ \text{SR}_{10}์„ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ, ๋” ์—„๊ฒฉํ•œ 3cm ์ด๋‚ด \text{SR}_3๋„ ํ•จ๊ป˜ ์ธก์ •ํ–ˆ๋‹ค. ๋˜ํ•œ ์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ…์˜ ์ฐธ์กฐ ๊ถค์  ์ผ์น˜๋„๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด E_o (๋ฌผ์ฒด ์œ„์น˜ ์˜ค์ฐจ ํ‰๊ท ), E_h (์†๋ ์œ„์น˜ ์˜ค์ฐจ ํ‰๊ท ), \text{SR}_o(E_o<1cm ๋น„์œจ), \text{SR}_h(E_h<5cm ๋น„์œจ) ์ง€ํ‘œ๋ฅผ ์ด์šฉํ–ˆ๋‹ค. ๋น„์ „ ๊ธฐ๋ฐ˜ ์ •์ฑ…์€ ์ฃผ๋กœ \text{SR}_3๋ฅผ ๋ณด๊ณ  ์„ฑ๊ณต ์—ฌ๋ถ€๋ฅผ ํŒ๋‹จํ–ˆ๋‹ค.

๋ฒ ์ด์Šค๋ผ์ธ ๋ฐ ๋น„๊ต: DexMV์—์„œ ์ œ์‹œ๋œ ๋‹ค์–‘ํ•œ ํ•™์Šต ๋ฐฉ๋ฒ•(TRPO, SOIL, GAIL+, DAPG)์„ ์žฌํ˜„ํ•˜์—ฌ ๋น„๊ตํ–ˆ๋‹ค. ํŠนํžˆ DAPG๋Š” DexMV ๋…ผ๋ฌธ์—์„œ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•œ ๋ฐฉ๋ฒ•์ด์—ˆ๋‹ค. ViViDex์—์„œ๋Š” PPO๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ…์„ ํ•™์Šตํ–ˆ๊ณ , ๋น„์ „ ์ •์ฑ…์€ ํ–‰๋™ ๋ณต์ œ์™€ ํ™•์‚ฐ ์ •์ฑ…์„ ๋น„๊ต ์‹คํ—˜ํ–ˆ๋‹ค.

์ฃผ์š” ๊ฒฐ๊ณผ

Relocate ๊ณผ์ œ (์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ…): ํ‘œ II์— Relocate ๊ณผ์ œ์˜ ์„ฑ๋Šฅ ๋น„๊ต๊ฐ€ ์š”์•ฝ๋˜์–ด ์žˆ๋‹ค. ์ œ์•ˆ ๋ฐฉ๋ฒ•(S6: Adroit/PPO, S7: Allegro/PPO)์€ ๊ฐ์ฒด๋‹น ๋‹จ 1ํŽธ์˜ ๋™์˜์ƒ์œผ๋กœ ํ›ˆ๋ จํ–ˆ์Œ์—๋„ ๋ชจ๋“  ๊ฐ์ฒด์—์„œ \text{SR}_{10} = \text{SR}_3 = 1.00์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค. ๋ฐ˜๋ฉด, DexMV์˜ DAPG ๊ธฐ๋ฐ˜ ์ •์ฑ…(S4)์€ ์ผ๋ถ€ ๊ฐ์ฒด(์˜ˆ: ์„คํƒ•์ƒ์ž)์—์„œ \text{SR}_3=0์œผ๋กœ ๋–จ์–ด์ง€๊ณ , ๋Œ€ํ˜•ํด๋žจํ”„ยท๋จธ๊ทธ์—์„œ ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. S6/S7์˜ ์ •์ฑ…์€ DexMV๊ฐ€ ์ˆ˜์‹ญ ๊ฐœ์˜ ๋™์˜์ƒ์„ ์‚ฌ์šฉํ•ด๋„ ๋‹ฌ์„ฑํ•˜์ง€ ๋ชปํ–ˆ๋˜ 5๊ฐœ ๊ฐ์ฒด ๋ชจ๋‘์— ๋Œ€ํ•œ ์™„์ „ํ•œ ์„ฑ๊ณต๋ฅ ์„ ์–ป์—ˆ๋‹ค. ์ฃผ๋ชฉํ•  ์ ์€ S6(Adroit)์™€ S7(Allegro)์˜ ์„ฑ๋Šฅ์ด ๊ฑฐ์˜ ๋™์ผํ•˜๋‹ค๋Š” ๊ฒƒ์œผ๋กœ, ์ด๋Š” ์ œ์•ˆ ๊ธฐ๋ฒ•์ด ๋กœ๋ด‡ ํ•ธ๋“œ์˜ ๊ตฌ์ฒด์  ๋ชจ๋ธ ์ฐจ์ด์—๋„ ๊ฐ•๊ฑดํ•จ์„ ์‹œ์‚ฌํ•œ๋‹ค. ๊ฐ•ํ™”ํ•™์Šต ๋‹จ๊ณ„์—์„œ ์ขŒํ‘œ ํšŒ์ „ ์ฆ๊ฐ•์„ ์ œ๊ฑฐํ•œ S8์€ ์•ฝ๊ฐ„์˜ ์„ฑ๋Šฅ ์ €ํ•˜๋งŒ ๋ณด์—ฌ, ํ•™์Šต๋œ ์ •์ฑ…์˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ํ™•์ธํ–ˆ๋‹ค.

์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ…์˜ ๋ณด์ƒ ํ•จ์ˆ˜ ํ‰๊ฐ€: Table I(๊ทธ๋ฆผ ์ฐธ์กฐ)์— ๋‚˜ํƒ€๋‚œ ์„ธ ๊ฐ€์ง€ ๋ณด์ƒ ์„ค์ •(R1, R2, R3) ์‹คํ—˜ ๊ฒฐ๊ณผ, ์•ž์„œ ์–ธ๊ธ‰ํ•œ ๋ฐ”์™€ ๊ฐ™์ด R3 ๋ฐฉ์‹(ํ”„๋ฆฌ๊ทธ๋žฉ/๋งค๋‹ˆํ“ฐ๋ ˆ์ด์…˜ ๋‹จ๊ณ„ ๋ชจ๋‘์—์„œ ์† ์œ„์น˜ ๋ณด์ƒ ์‚ฌ์šฉ)์ด ํ‰๊ท  ์†๋ ์˜ค์ฐจ(E_h)๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ณ  \text{SR}_3๋ฅผ 1.00์œผ๋กœ ๋‹ฌ์„ฑํ•˜์—ฌ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ์ด๋Š” ์† ๊ถค์  ๋ณด์ƒ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•ˆ์ •์ ์ธ ์˜ˆ๋น„ ์ ‘๊ทผ ๋ฐ ์ง‘๊ธฐ ๋™์ž‘์— ํ•„์ˆ˜์ ์ž„์„ ํ™•์ธํ•œ๋‹ค.

์‹œ๊ฐ ๊ธฐ๋ฐ˜ ์ •์ฑ… ํ•™์Šต ๊ฒฐ๊ณผ: ๋น„์ „ ์ •์ฑ… ํ•™์Šต์—๋Š” ์ƒํƒœ ์ •์ฑ… ๋กค์•„์›ƒ์œผ๋กœ ์–ป์€ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค(์•ฝ 100๊ฐœ์˜ ์„ฑ๊ณต ๊ถค์ ). ํ‘œ III๋Š” Relocate ๊ณผ์ œ์—์„œ \text{SR}_3๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ํฌ์ธํŠธํด๋ผ์šฐ๋“œ ์ˆ˜๋ฅผ 512๊ฐœ(V1)์—์„œ 2048๊ฐœ(V2)๋กœ ๋Š˜๋ฆฌ์ž ์ „์ฒด ์„ฑ๊ณต๋ฅ ์ด 0.86\to0.96๋กœ ์ฆ๊ฐ€ํ–ˆ์œผ๋ฉฐ, ํ™•์‚ฐ ์ •์ฑ…(V3)์„ ์ ์šฉํ•˜์ž ํ‰๊ท  0.99๋กœ ๋”์šฑ ํ–ฅ์ƒ๋˜์—ˆ๋‹ค. ์ฆ‰, ๊ณ ๋ฐ€๋„ ํฌ์ธํŠธ ์‚ฌ์šฉ๊ณผ ํ™•์‚ฐ ๋ชจ๋ธ์ด ์‹œ๊ฐ ์ •์ฑ…์˜ ์„ฑ๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ ธ๋‹ค. ๋น„์ „ ์ •์ฑ…์€ ์ƒํƒœ ์ •์ฑ…์— ๋น„ํ•ด ์•ฝ๊ฐ„์˜ ์„ฑ๋Šฅ ์†์‹ค์„ ๋ณด์˜€์ง€๋งŒ, ์‹ค์ œ ๋ฌผ์ฒด ๊ด€์ธก ์ •๋ณด๋งŒ์œผ๋กœ๋„ ๋งค์šฐ ๋†’์€ ์„ฑ๊ณต๋ฅ ์„ ๊ตฌํ˜„ํ–ˆ๋‹ค.

ํ†ตํ•ฉ ์ •์ฑ…(๋‹ค์ค‘ ๊ฐ์ฒด) ํ•™์Šต: Protocol #2 ์„ค์ •์—์„œ ๋‹ค์ค‘ ๊ฐ์ฒด๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ํ•™์Šตํ•œ ๊ฒฐ๊ณผ๋„ ์–‘ํ˜ธํ–ˆ๋‹ค. ํ‘œ IV(๋…ผ๋ฌธ)์—๋Š” 5๊ฐœ ๊ฐ์ฒด๋ฅผ ๋ชจ๋‘ ๊ณ ๋ คํ•œ ๋‹จ์ผ ์ •์ฑ…์˜ ์„ฑ๋Šฅ์ด ์ œ์‹œ๋˜์—ˆ๋‹ค. ์ขŒํ‘œ ๋ณ€ํ™˜ ๋ชจ๋“ˆ ์ถ”๊ฐ€ ๋ฐ ์† ๊ด€์ ˆ ์ขŒํ‘œ๊ณ„์˜ ์„ธ๋ถ€ ๋ณ€ํ™˜ ์ ์šฉ์€ \text{SR}_3๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค(Behavior Cloning ๊ธฐ์ค€ ํ‰๊ท  81%โ†’95%, Diffusion ๊ธฐ์ค€ 93%โ†’99%). ์ฆ‰, ์ œ์•ˆ๋œ ์‹œ๊ฐ์  ํ”ผ์ณ ๋ณ€ํ™˜ ๊ธฐ๋ฒ•์ด ๋‹ค์ค‘ ๊ฐ์ฒด ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์— ๊ธ์ •์  ์˜ํ–ฅ์„ ์ฃผ์—ˆ๋‹ค.

Pour ๋ฐ Place-Insde ๊ณผ์ œ: ViViDex์˜ ์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ…์€ ๋‚˜๋จธ์ง€ ๋‘ ๊ณผ์ œ์—์„œ๋„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ํ‘œ V๋ฅผ ๋ณด๋ฉด, Adroit ํ•ธ๋“œ๋ฅผ ์ด์šฉํ•œ ์ œ์•ˆ ๋ฐฉ๋ฒ•(L5)์€ Pour ๊ณผ์ œ์—์„œ 97%, Place-Insde ๊ณผ์ œ์—์„œ 68% ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค. ์ด๋Š” DexMV์˜ ์ตœ์  ๊ธฐ๋ฒ•(DAPG, L4)์ด ๊ฐ๊ฐ 27%์™€ 31%์— ๊ทธ์นœ ๊ฒƒ๋ณด๋‹ค ํ˜„์ €ํžˆ ๋†’๋‹ค. ์ฆ‰, ๋‹จ ํ•œ ํŽธ์˜ ๋™์˜์ƒ๋งŒ์œผ๋กœ๋„ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ณด๋‹ค ์›”๋“ฑํ•œ ์„ฑ๋Šฅ์„ ์–ป์—ˆ๋‹ค. ํ†ต์ œ ์‹คํ—˜์œผ๋กœ BC์™€ Diffusion์„ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ, ๋‘ ๋ฐฉ์‹ ๋ชจ๋‘ 97%/68%๋กœ ์œ ์‚ฌํ–ˆ์œผ๋ฉฐ, Diffusion์ด ์†Œํญ ํ–ฅ์ƒ๋œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค.

์‹ค์ œ ๋กœ๋ด‡ ์‹คํ—˜: UR5 ์•”๊ณผ Allegro ํ•ธ๋“œ, ๋‹จ์ผ RGB-D ์นด๋ฉ”๋ผ๋ฅผ ์ด์šฉํ•˜์—ฌ ์‹ค์ œ ์„ธ๊ณ„์—์„œ ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ–ˆ๋‹ค. ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„œ ํ•™์Šต๋œ ์ƒํƒœ ๊ธฐ๋ฐ˜ ์ •์ฑ…์„ ์‹ค์ œ์— ์ ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ(ํฌ์ธํŠธํด๋ผ์šฐ๋“œ, ๊ด€์ ˆ ์ƒํƒœ)๋ฅผ ์ˆ˜์ง‘ํ•œ ๋’ค ๋น„์ „ ์ •์ฑ…์„ ํ•™์Šตํ–ˆ๋‹ค. ๊ทธ๋ฆผ 4์™€ ํ‘œ VI์— ์š”์•ฝ๋œ ๋ฐ”์™€ ๊ฐ™์ด, ์ œ์•ˆ๋œ ํ†ตํ•ฉ ๋น„์ „ ์ •์ฑ…(R3, Diffusion)์€ 5๊ฐœ์˜ ์‹คํ—˜ ๋ฌผ์ฒด์— ๋Œ€ํ•ด ํ‰๊ท  80%์˜ ์„ฑ๊ณต๋ฅ ์„ ๊ธฐ๋กํ–ˆ๋‹ค(๊ฐœ๋ณ„ ์ •์ฑ… R1: 88%, ํ†ตํ•ฉ BC R2: 72%). ๋˜ํ•œ 5๊ฐœ์˜ ๋ฏธ์ง€์˜ ๋ฌผ์ฒด(ํฌ๋ž˜์ปค๋ฐ•์Šค, ์Šคํ”„๋ ˆ์ด๋ณ‘ ๋“ฑ)์—๋„ 68%์˜ ์„ฑ๊ณต๋ฅ ์„ ๋ณด์—ฌ, ์ œ์•ˆ๋œ ๋ชจ๋ธ์ด ๋ฏธ์ง€ ๊ฐ์ฒด๋กœ์˜ ํ™•์žฅ์—์„œ๋„ ํšจ๊ณผ์ ์ž„์„ ํ™•์ธํ–ˆ๋‹ค.

์š”์•ฝํ•˜๋ฉด, ViViDex๋Š” ์ œํ•œ๋œ ์ˆ˜์˜ ์ธ๊ฐ„ ๋™์˜์ƒ์œผ๋กœ๋ถ€ํ„ฐ ๊ณ ํ’ˆ์งˆ์˜ ๊ถค์ ์„ ์ƒ์„ฑํ•˜๊ณ  ์ด๋ฅผ ์ •์ฑ… ํ•™์Šต์— ํšจ๊ณผ์ ์œผ๋กœ ํ™œ์šฉํ•˜์—ฌ, ์„ธ ๊ฐ€์ง€ ๊ณผ์ œ์—์„œ ๊ธฐ์กด ๋ฐฉ๋ฒ•์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ์ž…์ฆํ–ˆ๋‹ค. ํŠนํžˆ, ๋ณต์ˆ˜ ๊ฐ์ฒด ๋ฐ ์‹ค์ œ ํ™˜๊ฒฝ์œผ๋กœ๋„ ์ž˜ ์ผ๋ฐ˜ํ™”๋˜๋Š” ์ ์ด ์ฃผ๋ชฉํ•  ๋งŒํ•˜๋‹ค.

Copyright 2024, Jung Yeon Lee