Curieux.JY
  • JungYeon Lee
  • Post
  • Projects
  • Note

On this page

  • ๐Ÿ” Ping Review

๐Ÿ“ƒTF-HOT ๋ฆฌ๋ทฐ

pose-tracking
hand-pose
training-free
Training-Free Hand-Object Pose Tracking and Optimization for Dexterous Manipulation
Published

February 19, 2026

๐Ÿ” Ping. ๐Ÿ”” Ring. โ›๏ธ Dig. A tiered review series: quick look, key ideas, deep dive.

  • Paper Link

Inspire Hand ์‚ฌ์šฉ

  1. โœจ TF-HOT(Training-Free Hand-Object Pose Tracking)๋Š” ํ›ˆ๋ จ์ด ํ•„์š” ์—†๋Š”(training-free) ๋ฐฉ์‹์œผ๋กœ ์‚ฌ๋žŒ ์†๊ณผ ๊ฐ์ฒด์˜ ํฌ์ฆˆ๋ฅผ ๋น„๋””์˜ค์—์„œ ํšจ์œจ์ ์œผ๋กœ ์ถ”์ ํ•˜๊ณ  ์ตœ์ ํ™”ํ•˜๋Š” ์ƒˆ๋กœ์šด ํŒŒ์ดํ”„๋ผ์ธ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
  2. ๐Ÿ’ก ์ด ๋ฐฉ๋ฒ•์€ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ ๋ Œ๋”๋ง๊ณผ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ 2D ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ(SAM2, MMPose)์˜ ํ’๋ถ€ํ•œ ์‚ฌ์ „ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜์—ฌ 2D ๋ฐ 3D ์ œ์•ฝ ์กฐ๊ฑด๊ณผ ๋ฌผ๋ฆฌ์  ์ œ์•ฝ์„ ํฌํ•จํ•˜๋Š” ๋‹ค์ค‘ ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜์—ฌ ํฌ์ฆˆ๋ฅผ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค.
  3. ๐Ÿš€ TF-HOT์€ ์‹ค์ œ ํ™˜๊ฒฝ ๋น„๋””์˜ค์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ, ์ถ”์ถœ๋œ ํฌ์ฆˆ ๊ถค์ ์„ ํ™œ์šฉํ•˜๋Š” PTF(Pose Trajectory Following) ๋ชจ๋ฐฉ ํ•™์Šต์„ ํ†ตํ•ด ์ •๊ตํ•œ ์กฐ์ž‘ ์ •์ฑ… ํ•™์Šต์—์„œ ๊ธฐ์กด ๊ฐ•ํ™” ํ•™์Šต ๋ฐ ๋ชจ๋ฐฉ ํ•™์Šต ๋ฐฉ๋ฒ•์„ ๋Šฅ๊ฐ€ํ•จ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.


๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

๋ณธ ๋…ผ๋ฌธ์€ ํ›ˆ๋ จ ์—†์ด(Training-Free) ์ธ๊ฐ„์˜ ์†๊ณผ ๋ฌผ์ฒด์˜ ์ž์„ธ๋ฅผ ์ถ”์ ํ•˜๊ณ  ์ตœ์ ํ™”ํ•˜๋Š” ํŒŒ์ดํ”„๋ผ์ธ์ธ TF-HOT (Training-Free Hand-Object Pose Tracking and Optimization)์„ ์ œ์•ˆํ•˜๋ฉฐ, ์ด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋กœ๋ด‡์˜ ๋Šฅ์ˆ™ํ•œ ์กฐ์ž‘(dexterous manipulation)์„ ์œ„ํ•œ ๋ชจ๋ฐฉ ํ•™์Šต(imitation learning) ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

1. ์„œ๋ก  ๋ฐ ๋ฐฐ๊ฒฝ

๋Šฅ์ˆ™ํ•œ ์† ์กฐ์ž‘(dexterous manipulation)์€ ๋†’์€ ์ฐจ์›์˜ ํ–‰๋™ ๊ณต๊ฐ„๊ณผ ๊ณ ํ’ˆ์งˆ ์‹œ์—ฐ(demonstration) ๋ฐ์ดํ„ฐ์˜ ๋ถ€์กฑ์œผ๋กœ ์ธํ•ด ๋ณธ์งˆ์ ์œผ๋กœ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ์ธ๊ฐ„์˜ ์†๊ณผ ๋ฌผ์ฒด ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ์ด ๋‹ด๊ธด ๋งŽ์€ ๋น„๋””์˜ค๊ฐ€ ์กด์žฌํ•˜์ง€๋งŒ, ๋นˆ๋ฒˆํ•˜๊ณ  ์—ญ๋™์ ์ธ ๊ฐ€๋ ค์ง(occlusion) ๋•Œ๋ฌธ์— ์†๊ณผ ๋ฌผ์ฒด์˜ ์ž์„ธ๋ฅผ ์ •ํ™•ํ•˜๊ณ  ๊ฒฌ๊ณ ํ•˜๊ฒŒ ์ถ”์ ํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋Š” ๋น„๋””์˜ค์—์„œ ๊ณ ํ’ˆ์งˆ์˜ ๋กœ๋ด‡ ์กฐ์ž‘ ์‹œ์—ฐ์„ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์„ ๋ฐฉํ•ดํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ ์†-๋ฌผ์ฒด ์ž์„ธ ์ถ”์ • ๋ฐฉ๋ฒ•์€ ๋Œ€๊ทœ๋ชจ ์ฃผ์„(annotation) ๋ฐ์ดํ„ฐ์…‹์ด ํ•„์š”ํ•œ ํ•™์Šต ๊ธฐ๋ฐ˜(learning-based) ๋ฐฉ์‹๊ณผ ๋‹ค์ค‘ ์นด๋ฉ”๋ผ ์„ค์ •์— ์˜์กดํ•˜๋Š” ์ตœ์ ํ™” ๊ธฐ๋ฐ˜(optimization-based) ๋ฐฉ์‹์œผ๋กœ ๋‚˜๋‰˜๋ฉฐ, ๋‘˜ ๋‹ค ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ์˜ ์ ์šฉ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์ „ ํ›ˆ๋ จ๋œ 2D ๊ธฐ๋ฐ˜(foundation) ์ธ์‹ ๋ชจ๋ธ์˜ ํ’๋ถ€ํ•œ ์‚ฌ์ „ ์ง€์‹(prior)๊ณผ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ ๋ Œ๋”๋ง(differentiable rendering)์„ ํ™œ์šฉํ•˜๋Š” ํ›ˆ๋ จ ์—†๋Š”(training-free) ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

2. ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก  (TF-HOT)

TF-HOT์˜ ๋ชฉํ‘œ๋Š” RGB-D ๋น„๋””์˜ค ์ž…๋ ฅ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ๊ฐ ํ”„๋ ˆ์ž„์—์„œ ์†๊ณผ ๋ฌผ์ฒด์˜ ์ž์„ธ๋ฅผ ๊ณต๋™์œผ๋กœ ์ตœ์ ํ™”ํ•˜์—ฌ ์ถ”์ •ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

2.1. ๋ชจ๋ธ๋ง

  • ์† ๋ชจ๋ธ: MANO (Romero et al., 2022) ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ 3D ์† ํ˜•์ƒ์„ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์†์˜ ์ž์„ธ(\theta), ํ˜•์ƒ(\beta), ์ „์—ญ ํšŒ์ „(r), ์ „์—ญ ๋ณ€ํ™˜(t)์„ ํฌํ•จํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ \gamma = \{\theta, \beta, r, t\}๋ฅผ ํ†ตํ•ด ์† ๋ฉ”์‹œ M(\gamma)์™€ 3D ์† ๊ด€์ ˆ J(\gamma)๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฌผ์ฒด ๋ชจ๋ธ: ๋ฌผ์ฒด ๋ชจ๋ธ M_{obj}๋Š” ๋ฏธ๋ฆฌ ์•Œ๋ ค์ ธ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉฐ, ๋‹จ์ผ ๋ทฐ ๋˜๋Š” ๋‹ค์ค‘ ๋ทฐ ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ํ•™์Šต ๊ธฐ๋ฐ˜ 3D ์žฌ๊ตฌ์„ฑ(reconstruction) ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฌผ์ฒด ์ž์„ธ P๋Š” ์ฟผํ„ฐ๋‹ˆ์–ธ(quaternion)๊ณผ ๋ณ€ํ™˜(translation) ๋ฒกํ„ฐ๋กœ ํŒŒ๋ผ๋ฏธํ„ฐํ™”๋ฉ๋‹ˆ๋‹ค.

2.2. ์ตœ์ ํ™” ๋ชฉํ‘œ ํ•จ์ˆ˜

์†๊ณผ ๋ฌผ์ฒด์˜ ์ž์„ธ \{ \gamma, P \}๋ฅผ ๊ฐ ํ”„๋ ˆ์ž„๋ณ„๋กœ ๊ณต๋™ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์Œ ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค: L_{total}(\gamma, P) = \lambda_{2d}L_{2d}(\gamma) + \lambda_{render}L_{render}(\gamma, P) + \lambda_{surf}L_{surf}(\gamma, P) + \lambda_{sdf}L_{sdf}(P) + \lambda_{penetr}L_{penetr}(\gamma, P) + \lambda_{attr}L_{attr}(\gamma, P) + \lambda_{reg}L_{reg}(\gamma, P) ์—ฌ๊ธฐ์„œ \lambda ๊ฐ’๋“ค์€ ๊ฐ ์†์‹ค ํ•ญ์˜ ๊ฐ€์ค‘ ๊ณ„์ˆ˜์ž…๋‹ˆ๋‹ค. ์ด ์†์‹ค ํ•ญ๋“ค์€ ํฌ๊ฒŒ ์„ธ ๊ฐ€์ง€ ๋ฒ”์ฃผ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค: 2D ์ด๋ฏธ์ง€ ๊ณต๊ฐ„ ์ œ์•ฝ, 3D ์ •๋ณด ํ™œ์šฉ, ๊ทธ๋ฆฌ๊ณ  ์ตœ์ ํ™” ์•ˆ์ •์„ฑ ๋ฐ ๋ฌผ๋ฆฌ์  ํƒ€๋‹น์„ฑ(physically plausible)์„ ์œ„ํ•œ ์ •๊ทœํ™”(regularization) ํ•ญ.

2.2.1. 2D ์‚ฌ์ „ ์ง€์‹(Priors)์œผ๋กœ๋ถ€ํ„ฐ์˜ ์ œ์•ฝ

  • 2D ๊ด€์ ˆ ํˆฌ์˜ ์†์‹ค (L_{2d}): 3D ์† ๊ด€์ ˆ์„ ํˆฌ์˜ํ•œ ์œ„์น˜์™€ ์ฐธ์กฐ 2D ๊ด€์ ˆ ์œ„์น˜(\tilde{j}_{2d}, MMPose๋กœ ์˜ˆ์ธก) ์‚ฌ์ด์˜ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ(Euclidean distance)๋ฅผ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค. L_{2d}(\gamma) = \tilde{w}\|\Pi J(\gamma) - \tilde{j}_{2d}\|^2 ์—ฌ๊ธฐ์„œ \Pi๋Š” ํˆฌ์˜ ์—ฐ์‚ฐ์ž์ด๋ฉฐ, \tilde{w}๋Š” 2D ๊ด€์ ˆ ์œ„์น˜ ์˜ˆ์ธก์˜ ์‹ ๋ขฐ๋„์ž…๋‹ˆ๋‹ค.
  • ๋ Œ๋”๋ง ์†์‹ค (L_{render}): ํ”ฝ์…€ ๋‹จ์œ„ ๋งˆ์Šคํฌ ์†์‹ค์„ ์‚ฌ์šฉํ•˜์—ฌ ๋” ๋ฐ€๋„ ๋†’์€ ๊ฐ๋…(supervision)์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์†๊ณผ ๋ฌผ์ฒด๋ฅผ ํ•จ๊ป˜ ๋ Œ๋”๋งํ•˜์—ฌ ๊ฐ€๋ ค์ง์„ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค. M_{hand}, M_{obj} = \pi[M(\gamma), P_tM_{obj}] L_{render} = w_1\|M_{hand} - \tilde{M}_{hand}\|^2 + w_2\|M_{obj} - \tilde{M}_{obj}\|^2 \pi๋Š” ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ ๋งˆ์Šคํฌ ๋ Œ๋”๋Ÿฌ์ด๋ฉฐ, ์ฐธ์กฐ ๋งˆ์Šคํฌ \tilde{M}_{hand}, \tilde{M}_{obj}๋Š” SAM2 (Ravi et al., 2024)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์–ป์Šต๋‹ˆ๋‹ค.

2.2.2. 3D ์ •๋ณด ํ™œ์šฉ

  • ๊ฐ€์‹œ ์˜์—ญ ํ‘œ๋ฉด ์†์‹ค (L_{surf}): ๊ธฐ์กด ํ‘œ๋ฉด ์†์‹ค์˜ ํ•œ๊ณ„(๋‹จ์ผ ๋ทฐ์—์„œ ๋ถ€๋ถ„์ ์ธ ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ)๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ๋ฉ”์‰ฌ์˜ ๊ฐ€์‹œ ์˜์—ญ ๋ถ€๋ถ„(S)๋งŒ ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ(P)์™€ ์ •๋ ฌํ•˜๋„๋ก ์ œํ•œํ•ฉ๋‹ˆ๋‹ค. f(P, S) = (w_3 \sum_{\triangle_i \in S} \min_{p_j \in P} \|p_j - \triangle_i\|^2 + w_4 \frac{|S|}{|P|} \sum_{p_i \in P} \min_{\triangle_j \in S} \|p_i - \triangle_j\|^2) p_i๋Š” ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ P์˜ i๋ฒˆ์งธ ์ ์ด๊ณ , \triangle_j๋Š” ๊ฐ€์‹œ ์˜์—ญ S์˜ j๋ฒˆ์งธ ์‚ผ๊ฐํ˜•์ž…๋‹ˆ๋‹ค.

  • SDF ์†์‹ค (L_{sdf}): ๋ฌผ์ฒด์— ์ ์šฉ๋˜๋Š” ์†์‹ค๋กœ, ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ์™€ SDF (Signed Distance Function) ํ•„๋“œ์˜ ์ œ๋กœ ๋ ˆ๋ฒจ ์ง‘ํ•ฉ(zero-level set)์œผ๋กœ ์ •์˜๋œ ํ‘œ๋ฉด ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค. ๋ฌผ์ฒด ์ž์„ธ ์ดˆ๊ธฐํ™”๊ฐ€ ์ข‹์ง€ ์•Š์„ ๋•Œ๋„ ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป๋Š” ๋ฐ ๋„์›€์„ ์ค๋‹ˆ๋‹ค. L_{sdf}(P) = \sum_{v \in P}\|\phi(P^{-1}v)\|^2

\phi(x)๋Š” ๋ฌผ์ฒด์˜ ์บ๋…ธ๋‹ˆ์ปฌ ๊ณต๊ฐ„(canonical space)์œผ๋กœ ๋ณ€ํ™˜๋œ ์œ„์น˜ x์—์„œ์˜ SDF ๊ฐ’์ž…๋‹ˆ๋‹ค.

2.2.3. ์ •๊ทœํ™” ๋ฐ ์ดˆ๊ธฐํ™”

  • ์นจํˆฌ ์†์‹ค (L_{penetr}): ์†-๋ฌผ์ฒด ๊ฐ„์˜ ์นจํˆฌ(penetration)๋ฅผ ๋ฐฉ์ง€ํ•˜๋Š” ๋ฌผ๋ฆฌ์  ์ œ์•ฝ์ž…๋‹ˆ๋‹ค. ๋ฌผ์ฒด๋ฅผ ์นจํˆฌํ•œ ์†์˜ ์ •์ (vertex)์— ํŽ˜๋„ํ‹ฐ๋ฅผ ๋ถ€๊ณผํ•ฉ๋‹ˆ๋‹ค. L_{penetr}(\gamma, P) = \sum_{v \in M(\gamma)}(-\mathbf{1}_{\phi(P^{-1}v)<0} \phi(P^{-1}v))
  • ์ธ๋ ฅ ์†์‹ค (L_{attr}): ์†๊ฐ€๋ฝ ๋(fingertips)๊ณผ ๋ฌผ์ฒด ์‚ฌ์ด์˜ ์ ‘์ด‰์„ ์žฅ๋ คํ•˜๋Š” ๋ฌผ๋ฆฌ์  ์ œ์•ฝ์ž…๋‹ˆ๋‹ค. ๋ฌผ์ฒด ๋ฐ”๊นฅ์— ์žˆ๋Š” ๋‹ค์„ฏ ์†๊ฐ€๋ฝ ๋์˜ ์ตœ์†Œ SDF ๊ฐ’์— ํŽ˜๋„ํ‹ฐ๋ฅผ ๋ถ€๊ณผํ•ฉ๋‹ˆ๋‹ค. L_{attr}(\gamma, P) = \sum_{i=min}^{n=5} \min_{v \in M(\gamma)_{C_i}}(\mathbf{1}_{\phi(P^{-1}v)>0} \phi(P^{-1}v)) ์†์ด ๋ฌผ์ฒด์™€ ์ ‘์ด‰ํ•œ ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผ๋  ๋•Œ (์ตœ๋Œ€ ์นจํˆฌ๊ฐ€ ํŠน์ • ์ž„๊ณ„๊ฐ’์„ ์ดˆ๊ณผํ•  ๋•Œ) ํ™œ์„ฑํ™”๋ฉ๋‹ˆ๋‹ค.
  • ์ •๊ทœํ™” ์†์‹ค (L_{reg}): ํ”„๋ ˆ์ž„ ๊ฐ„์˜ 3D ์† ๊ด€์ ˆ(j_{3d_t})๊ณผ ๋ฌผ์ฒด ์ž์„ธ์˜ ๋ณ€ํ™˜(T_t) ๋ณ€ํ™”๋ฅผ ์ตœ์†Œํ™”ํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ ์•ˆ์ •ํ™”ํ•ฉ๋‹ˆ๋‹ค. L_{reg} = w_5 \max(0, \|j_{3d_t} - j_{3d_{t-1}}\|^2 - \epsilon_1) + w_6 \max(0, \|T_t - T_{t-1}\|^2 - \epsilon_2)
  • ์ดˆ๊ธฐํ™”: ์ฒซ ํ”„๋ ˆ์ž„์€ ์™ธ๋ถ€ ๊ฐ์ฒด ์ž์„ธ ์ถ”์ • ๋„คํŠธ์›Œํฌ(FoundationPose)๋กœ ๋ฌผ์ฒด ์ž์„ธ๋ฅผ, ์†์€ ์—ฌ๋Ÿฌ ์ „์—ญ ํšŒ์ „์„ ์ƒ˜ํ”Œ๋งํ•˜๊ณ  ์† ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ์˜ ์ค‘์‹ฌ์— ๋งž์ถ˜ ํ›„ 2D ๊ด€์ ˆ ์—๋Ÿฌ๊ฐ€ ๊ฐ€์žฅ ๋‚ฎ์€ ์ดˆ๊ธฐํ™”๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ์ดํ›„ ํ”„๋ ˆ์ž„์€ ์ด์ „ ํ”„๋ ˆ์ž„์˜ ์ตœ์ ํ™”๋œ ์ž์„ธ๋ฅผ ์ดˆ๊ธฐ๊ฐ’์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

3. ์‘์šฉ: ์ž์„ธ ๊ถค์  ์ถ”์  (PTF)

TF-HOT์œผ๋กœ ์ถ”์ถœ๋œ ์†๊ณผ ๋ฌผ์ฒด ์ž์„ธ๋Š” ๋กœ๋ด‡์˜ ๋Šฅ์ˆ™ํ•œ ์† ์กฐ์ž‘ ์ž‘์—…์—์„œ ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค. PTF (Pose Trajectory Following)๋Š” ๋‹จ์ผ ์ž์„ธ ์ „์šฉ ์‹œ์—ฐ(pose-only demonstration)์„ ํ™œ์šฉํ•˜์—ฌ ๋Šฅ์ˆ™ํ•œ ์† ์กฐ์ž‘ ์ž‘์—…์„ ์œ„ํ•œ ์ •์ฑ…(policy)์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋ชจ๋ฐฉ ํ•™์Šต ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. TF-HOT์—์„œ ์–ป์€ ๋ฌผ์ฒด ๋ฐ ์† ์ž์„ธ ๊ถค์ ์„ ์ด์šฉํ•˜์—ฌ ์—ญ์šด๋™ํ•™(inverse kinematics) ๋ฐ ๋ฆฌํƒ€๊ฒŒํŒ…(retargeting) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ ์šฉํ•˜์—ฌ ๋กœ๋ด‡ ์†์˜ ์ดˆ๊ธฐ ์ž์„ธ์™€ ์†๊ฐ€๋ฝ ์œ„์น˜๋ฅผ ์‹œ์—ฐ์˜ ์ฒซ ํ”„๋ ˆ์ž„๊ณผ ์ผ์น˜์‹œํ‚ต๋‹ˆ๋‹ค. ๊ทธ ํ›„, ๋กœ๋ด‡ ์†์˜ ํ˜„์žฌ ์ƒํƒœ๊ฐ€ ๋ชฉํ‘œ ์ž์„ธ ๊ถค์ ์„ ๋”ฐ๋ผ ์–ผ๋งˆ๋‚˜ ์ง„ํ–‰๋˜์—ˆ๋Š”์ง€๋ฅผ ์ธก์ •ํ•˜๋Š” ํŠน์ • ๊ถค์  ์ถ”์  ๋ณด์ƒ(trajectory-following reward)์„ ์„ค๊ณ„ํ•ฉ๋‹ˆ๋‹ค. PPO (Proximal Policy Optimization)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด ๊ถค์  ์ถ”์  ๋ณด์ƒ๊ณผ ์›๋ž˜ ํ™˜๊ฒฝ ๋ณด์ƒ์˜ ํ•ฉ๊ณ„๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก ์ •์ฑ…์„ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค.

4. ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ

๋ณธ ์—ฐ๊ตฌ๋Š” DexYCB ๋ฐ์ดํ„ฐ์…‹๊ณผ ์ž์ฒด ์ˆ˜์ง‘ํ•œ In-the-wild ๋ฐ์ดํ„ฐ์…‹์—์„œ TF-HOT์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

  • DexYCB ๋ฐ์ดํ„ฐ์…‹: MPJPE (์† ๊ด€์ ˆ ์œ„์น˜ ์˜ค์ฐจ), J2E (2D ๊ด€์ ˆ ํ”ฝ์…€ ์˜ค์ฐจ), t_{err} (๋ฌผ์ฒด ๋ณ€ํ™˜ ์˜ค์ฐจ), r_{err} (๋ฌผ์ฒด ํšŒ์ „ ์˜ค์ฐจ)๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. TF-HOT์€ HOTrack (Chen et al., 2023)๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ๋ฌผ์ฒด ์ž์„ธ ์ถ”์ •์—์„œ ๊ฐ€์žฅ ๋‚ฎ์€ ๋ณ€ํ™˜ ์˜ค์ฐจ๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
  • In-the-wild ๋ฐ์ดํ„ฐ์…‹: J2E* (MMPose์™€์˜ 2D ๊ด€์ ˆ ํ”ฝ์…€ ์˜ค์ฐจ), IoUobj (SAM2์™€์˜ ๋ฌผ์ฒด ๋งˆ์Šคํฌ IoU), SDobj (๊ฐ€์‹œ ์˜์—ญ 3D ํ‘œ๋ฉด ๊ฑฐ๋ฆฌ)๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. TF-HOT์€ HOTrack ๋ฐ HOISDF (Qi et al., 2024)๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์ •๋Ÿ‰์  ๋ฐ ์ •์„ฑ์  ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ํ•™์Šต ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ธ HOTrack๊ณผ HOISDF๋Š” ๊ฐ๊ฐ ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ ํ’ˆ์งˆ์— ๋ฏผ๊ฐํ•˜๊ฑฐ๋‚˜ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์—†๋Š” ์นด๋ฉ”๋ผ ์ž์„ธ ๋ฐ ๊ฐ์ฒด์— ์ผ๋ฐ˜ํ™”ํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์„ ๊ฒช์—ˆ์Šต๋‹ˆ๋‹ค.
  • Ablation Study: ๊ฐ ์†์‹ค ํ•ญ์˜ ์˜ํ–ฅ์„ ๋ถ„์„ํ•œ ๊ฒฐ๊ณผ, ์–ด๋–ค ์†์‹ค ํ•ญ์ด๋ผ๋„ ์ œ๊ฑฐํ•˜๋ฉด ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉฐ, ํŠนํžˆ ๊ฐ€์‹œ ์˜์—ญ 3D ํ‘œ๋ฉด ์†์‹ค์ด ์—†๋Š” ๊ฒฝ์šฐ ์ƒ๋‹นํ•œ ์˜ค์ •๋ ฌ์ด ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ์นจํˆฌ ์†์‹ค์ด ์—†์œผ๋ฉด ์†๊ณผ ๋ฌผ์ฒด๊ฐ€ ์นจํˆฌํ•˜๊ณ , ์ธ๋ ฅ ์†์‹ค์ด ์—†์œผ๋ฉด ๋น„ํ˜„์‹ค์ ์ธ ์žก๊ธฐ ์ž์„ธ๊ฐ€ ๋‚˜ํƒ€๋‚˜๋ฉฐ, ์ •๊ทœํ™” ์†์‹ค์ด ์—†์œผ๋ฉด ๊นŠ์ด ๋ฐ์ดํ„ฐ์˜ ๋…ธ์ด์ฆˆ์— ์ทจ์•ฝํ•ด์ง€๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.
  • PTF ์‘์šฉ ์‹คํ—˜: ManiSkill 3 ํ™˜๊ฒฝ์—์„œ ๋ฐ”๋‚˜๋‚˜, ์ด์ง€-์˜คํ”ˆ ์บ”, ์ฝ”๋ผ๋ฆฌ ํ”ฝ์—…(Pickup) ์ž‘์—…์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ˆœ์ˆ˜ PPO(๊ฐ•ํ™” ํ•™์Šต) ๋ฐ SOIL(์ƒํƒœ ์ „์šฉ ๋ชจ๋ฐฉ ํ•™์Šต)๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ, PTF๋Š” TF-HOT์—์„œ ์ถ”์ถœ๋œ ์‹œ์—ฐ์„ ํšจ๊ณผ์ ์œผ๋กœ ํ™œ์šฉํ•˜์—ฌ ๋” ๋†’์€ ์„ฑ๊ณต๋ฅ ๊ณผ ์ ์€ ์ƒ˜ํ”Œ๋กœ ์ž‘์—…์„ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, PTF๋Š” ๋ฌผ์ฒด์— ๋Œ€ํ•œ ์†์˜ ์ƒ๋Œ€์ ์ธ ์ž์„ธ๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ์œ ์ง€ํ•˜๋Š” ๋ฐ ๋„์›€์„ ์ฃผ์–ด ๋” ํšจ๊ณผ์ ์ธ ์žก๊ธฐ(grasping)๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ์Šต๋‹ˆ๋‹ค.

5. ๊ฒฐ๋ก  ๋ฐ ํ•œ๊ณ„

TF-HOT์€ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ ๋ Œ๋”๋ง๊ณผ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ 2D ์ธ์‹ ๋ชจ๋ธ์˜ ์‚ฌ์ „ ์ง€์‹์„ ํ™œ์šฉํ•˜์—ฌ ํ›ˆ๋ จ ์—†์ด๋„ ์ธ๊ฐ„ ์†๊ณผ ๋ฌผ์ฒด ์ž์„ธ ๊ถค์ ์„ ํšจ์œจ์ ์œผ๋กœ ์ตœ์ ํ™”ํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ์‹ค์ œ ๋น„๋””์˜ค์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ, ์ถ”์ถœ๋œ ์ž์„ธ ๊ถค์ ์€ ๋กœ๋ด‡ ๋Šฅ์ˆ™ ์กฐ์ž‘ ์ •์ฑ… ํ•™์Šต์„ ์œ„ํ•œ PTF์™€ ๊ฐ™์€ ๋ชจ๋ฐฉ ํ•™์Šต ๋ฐฉ์‹์— ํšจ๊ณผ์ ์œผ๋กœ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•œ๊ณ„์ ์œผ๋กœ๋Š” ์†์ด ์™„์ „ํžˆ ๊ฐ€๋ ค์ง€๊ฑฐ๋‚˜ ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ์— ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ 3D ์‚ฌ์ „ ์ง€์‹์˜ ๋ถ€์กฑ์œผ๋กœ ์ •ํ™•ํ•œ ์ž์„ธ ์ถ”์ •์ด ์–ด๋ ต๋‹ค๋Š” ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ํ–ฅํ›„ ๋‹ค์ค‘ ์นด๋ฉ”๋ผ ์„ค์ •์„ ํ†ตํ•ด ํ•ด๊ฒฐ๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, TF-HOT์€ ์ž๋™ ๋ฐ์ดํ„ฐ ์ฃผ์„ ๋„๊ตฌ๋กœ๋„ ํ™œ์šฉ๋  ์ž ์žฌ๋ ฅ์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

Copyright 2026, JungYeon Lee