Curieux.JY
  • JungYeon Lee
  • Post
  • Lecture
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ๋“ค์–ด๊ฐ€๋ฉฐ: ์™œ ์ด ๋…ผ๋ฌธ์ด ๋‹ค์„ฏ ์†๊ฐ€๋ฝ ์—ฐ๊ตฌ์ž๋“ค์—๊ฒŒ ์ค‘์š”ํ•œ๊ฐ€
    • ๋ฌธ์ œ ์ •์˜: ๋‹ค์ง€ ์กฐ์ž‘์€ ์™œ ๊ทธ๋ ‡๊ฒŒ ์–ด๋ ค์šด๊ฐ€
      • ์šด๋™ํ•™์  ํƒ€๊ฒŸ์˜ ํ•œ๊ณ„
      • ํ•ต์‹ฌ ํ†ต์ฐฐ: ์ ‘์ด‰์€ โ€œ์‚ผ๊ฐ๊ด€๊ณ„โ€๋‹ค
    • ๋ฐฉ๋ฒ•๋ก : CGP ํŒŒ์ดํ”„๋ผ์ธ์„ ๋œฏ์–ด๋ณด์ž
      • ํฐ ๊ทธ๋ฆผ: ๋‘ ์ปดํฌ๋„ŒํŠธ์˜ ๋ถ„์—…
      • ์ปดํฌ๋„ŒํŠธ 1: ์กฐ๊ฑด๋ถ€ ํ™•์‚ฐ ๊ถค์  ์ƒ์„ฑ๊ธฐ \pi_\theta
      • ์ปดํฌ๋„ŒํŠธ 2: ์ ‘์ด‰-์ผ๊ด€์„ฑ ๋งคํ•‘ M_\phi
      • ์ปดํฌ๋„ŒํŠธ 3: ์ž ์žฌ ์ด‰๊ฐ ์ƒ์„ฑ (Latent Tactile Generation)
      • ์ „์ฒด ์ถ”๋ก  ์•Œ๊ณ ๋ฆฌ์ฆ˜ (์˜์‚ฌ์ฝ”๋“œ)
      • ์ปดํ”Œ๋ผ์ด์–ธ์Šค ์ปจํŠธ๋กค๋Ÿฌ: ์†๊ณผ ํŒ”์˜ ๋ถ„์—…
    • ์‹คํ—˜: ์ •๋ง ์ž‘๋™ํ•˜๋Š”๊ฐ€?
      • ํ•˜๋“œ์›จ์–ด์™€ ํƒœ์Šคํฌ
      • ์„ธ ๊ฐ€์ง€ ํ‰๊ฐ€ ์ถ•
      • ๊ฒฐ๊ณผ ์š”์•ฝ: baseline ๋Œ€๋น„ ์ •์„ฑ์  ์ฐจ์ด
      • ์˜ˆ์ธก ๊ฒ€์ฆ: โ€œ์˜ˆ์–ธโ€์ด ๋งž๋Š”๊ฐ€?
      • ์‹œ๊ฐ์  ๊ฐ•๊ฑด์„ฑ
      • ์ถ”๋ก  ์‹œ๊ฐ„
    • ๋น„ํŒ์  ๊ณ ์ฐฐ
      • ๊ฐ•์ : ์šฐ์•„ํ•œ ๋ถ„์—…
      • ํ•œ๊ณ„ 1: ์ปจํŠธ๋กค๋Ÿฌ ๊ณ ์ • ๊ฐ€์ •
      • ํ•œ๊ณ„ 2: ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ ๋ฐ์ดํ„ฐ ์˜์กด
      • ํ•œ๊ณ„ 3: ์ผ๋ฐ˜ํ™” ๋ฒ”์œ„์˜ ๋ฏธ์ง€
      • ํ•œ๊ณ„ 4: world model๋กœ์„œ์˜ ํ™œ์šฉ ๊ฐ€๋Šฅ์„ฑ
    • ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต: ์–ด๋””์— ์ž๋ฆฌ ์žก๋Š”๊ฐ€?
      • CGP vs Visuotactile Diffusion Policy
      • CGP vs Reactive Diffusion Policy (RDP)
      • CGP vs Hierarchical Diffusion Policy (HDP)
    • ์‹œ์‚ฌ์ : ํ˜„์žฅ ์—ฐ๊ตฌ์ž์—๊ฒŒ ๋ฌด์—‡์„ ์˜๋ฏธํ•˜๋Š”๊ฐ€
    • ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

๐Ÿ“ƒContact-Grounded Policy ๋ฆฌ๋ทฐ

digit260
diffusion
contact
tactile
Dexterous Visuotactile Policy with Generative Contact Grounding
Published

May 5, 2026

  • Paper Link
  • Project Page
  1. ๐Ÿค– Contact-Grounded Policy (CGP)๋Š” ๋กœ๋ด‡์˜ ์‹ค์ œ ์ƒํƒœ์™€ ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ์˜ ๊ฒฐํ•ฉ๋œ ๊ถค์ ์„ ์˜ˆ์ธกํ•˜๊ณ , ์ด๋ฅผ ์ค€์ˆ˜ ์ปจํŠธ๋กค๋Ÿฌ(compliance controller)๋ฅผ ์œ„ํ•œ ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ๋ชฉํ‘œ ๋กœ๋ด‡ ์ƒํƒœ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋‹ค์ง€์  ์ ‘์ด‰์„ ์ ‘์ง€ํ•˜๋Š” visuotactile ์ •์ฑ…์ž…๋‹ˆ๋‹ค.
  2. ๐Ÿ’ก ์ด ์ •์ฑ…์€ conditional diffusion model์„ ์‚ฌ์šฉํ•˜์—ฌ ์••์ถ•๋œ latent space์—์„œ ๋ฏธ๋ž˜์˜ ๋กœ๋ด‡ ์ƒํƒœ์™€ ์ด‰๊ฐ ๋ฐ์ดํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์˜ˆ์ธกํ•˜๋ฉฐ, ํ•™์Šต๋œ contact-consistency mapping์„ ํ†ตํ•ด ์˜๋„๋œ ์ ‘์ด‰์ด ์‹ค์ œ ๋กœ๋ด‡์—์„œ ์‹คํ˜„๋˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
  3. โœ… CGP๋Š” in-hand manipulation, ์„ฌ์„ธํ•œ grasping, ๋„๊ตฌ ์‚ฌ์šฉ ๋“ฑ ๋‹ค์–‘ํ•œ ์ ‘์ด‰ ์ค‘์‹ฌ ์ž‘์—…์—์„œ visuomotor ๋ฐ visuotactile diffusion-policy baseline๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๊ณ , KL-regularized latent space์™€ residual mapping์˜ ์ค‘์š”์„ฑ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

๋ณธ ๋…ผ๋ฌธ์€ ๋‹ค์ง€(multi-finger) ๋กœ๋ด‡ ์†์„ ์ด์šฉํ•œ ์ ‘์ด‰ ๊ธฐ๋ฐ˜(contact-rich) ์กฐ์ž‘(dexterous manipulation)์˜ ๋‚œ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Contact-Grounded Policy (CGP)๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ ๋ชจ๋ฐฉ ํ•™์Šต(imitation learning) ๋ฐฉ๋ฒ•๋“ค์€ ์ฃผ๋กœ ์šด๋™ํ•™์  ๊ถค์ (kinematic trajectories)์„ ์˜ˆ์ธกํ•˜๋ฉฐ, ์ ‘์ด‰ ์ƒํƒœ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•˜์ง€ ์•Š์•„ ๋ณต์žกํ•œ ์ ‘์ด‰ ์ƒํ˜ธ์ž‘์šฉ์— ์–ด๋ ค์›€์„ ๊ฒช์Šต๋‹ˆ๋‹ค. CGP๋Š” ์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์ ‘์ด‰์˜ ๋ฌผ๋ฆฌ์  ๊ตฌํ˜„ ๊ฐ€๋Šฅ์„ฑ(physical realizability)์— ์ค‘์ ์„ ๋‘ก๋‹ˆ๋‹ค.

1. ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก  (Core Methodology)

CGP๋Š” ์ ‘์ด‰ ๊ธฐ๋ฐ˜ ์กฐ์ž‘ ๋ฌธ์ œ๋ฅผ โ€œ์ ‘์ด‰ ๊ทธ๋ผ์šด๋”ฉ(contact grounding)โ€ ๋ฌธ์ œ๋กœ ์žฌ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋‹จ์ˆœํ•œ ์ถ”๊ฐ€ ๊ด€์ธก์น˜๋กœ์„œ์˜ ์ด‰๊ฐ ์‹ ํ˜ธ ์‚ฌ์šฉ์„ ๋„˜์–ด, ์‹ค์ œ ๋กœ๋ด‡ ์ƒํƒœ(x_t)์™€ ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ(u_t)์˜ ์ƒํ˜ธ ์—ฐ๊ฒฐ๋œ ๊ถค์ ์„ ์˜ˆ์ธกํ•˜๊ณ , ์ด ์˜ˆ์ธก์„ ์ปดํ”Œ๋ผ์ด์–ธ์Šค ์ปจํŠธ๋กค๋Ÿฌ(compliance controller)๋ฅผ ์œ„ํ•œ ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ๋ชฉํ‘œ ๋กœ๋ด‡ ์ƒํƒœ(a_t)๋กœ ๋ณ€ํ™˜ํ•˜๋Š” โ€œํ•™์Šต๋œ ์ ‘์ด‰ ์ผ๊ด€์„ฑ ๋งคํ•‘(learned contact-consistency mapping)โ€์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

CGP๋Š” ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค:

  1. Conditional Diffusion Model (\pi_\theta): ๊ด€์ธก์น˜ ์ด๋ ฅ(O_t)์„ ์กฐ๊ฑด์œผ๋กœ ๋ฏธ๋ž˜ ์‹ค์ œ ๋กœ๋ด‡ ์ƒํƒœ์™€ ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ ๊ถค์ ์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, (\hat{X}_t, \hat{U}_t) \sim \pi_\theta (\cdot | O_t)๋ฅผ ์ƒ˜ํ”Œ๋งํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ \hat{X}_t = \{\hat{x}_{t+1}, \dots, \hat{x}_{t+T}\}๋Š” ๋ฏธ๋ž˜ ์‹ค์ œ ๋กœ๋ด‡ ์ƒํƒœ ๊ถค์ ์ด๊ณ , \hat{U}_t = \{\hat{u}_{t+1}, \dots, \hat{u}_{t+T}\}๋Š” ๋ฏธ๋ž˜ ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ ๊ถค์ ์ž…๋‹ˆ๋‹ค. ํšจ์œจ์ ์ธ ์‹ค์‹œ๊ฐ„ ์ƒ์„ฑ์„ ์œ„ํ•ด ์ด‰๊ฐ ๊ด€์ธก์น˜(u_t)๋Š” KL-์ •๊ทœํ™”๋œ ๋ณ€์ดํ˜• ์˜คํ† ์ธ์ฝ”๋”(KL-regularized VAE)๋ฅผ ํ†ตํ•ด ์••์ถ•๋œ ์ž ์žฌ ๊ณต๊ฐ„(h_t)์—์„œ ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค. ํ™•์‚ฐ ๋ชจ๋ธ์€ Y_t = [x_{t+1:t+T}, h_{t+1:t+T}]์— ๋Œ€ํ•ด ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค. ํ™•์‚ฐ ๋ชจ๋ธ์˜ ์†์‹ค ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: L_{\text{diff}}(\theta) = E_{(O_t,Y_0t ),\epsilon,j}[\| \epsilon - \pi_\theta (O_t, Y^j_t, j) \|^2] ์—ฌ๊ธฐ์„œ Y^j_t = \alpha_j Y^0_t + \sigma_j \epsilon๋Š” ๋…ธ์ด์ฆˆ๊ฐ€ ์ฃผ์ž…๋œ ๊ถค์ ์ž…๋‹ˆ๋‹ค.

  2. Learned Contact-Consistency Mapping (M_\phi): ์˜ˆ์ธก๋œ ์‹ค์ œ ๋กœ๋ด‡ ์ƒํƒœ(\hat{x}_{t+k})์™€ ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ(\hat{u}_{t+k}) ์Œ์„ ์ปจํŠธ๋กค๋Ÿฌ๊ฐ€ ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ๋ชฉํ‘œ ๋กœ๋ด‡ ์ƒํƒœ(\hat{a}_{t+k})๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋งคํ•‘์€ ์ž”์—ฌ ํ˜•์‹(residual form)์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ํ˜„์žฌ ์‹ค์ œ ์ƒํƒœ(x_t)๋กœ๋ถ€ํ„ฐ์˜ ์˜คํ”„์…‹์„ ์˜ˆ์ธกํ•˜๋ฉฐ, ์ด๋Š” ํ•™์Šต์„ ์•ˆ์ •ํ™”ํ•˜๊ณ  ์ปดํ”Œ๋ผ์ด์–ธ์Šค ์ปจํŠธ๋กค๋Ÿฌ ํ•˜์—์„œ ๋” ๊ฒฌ๊ณ ํ•œ ๋ชฉํ‘œ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๋งคํ•‘์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค: a_t = M_\phi(x_t, u_t) ์ถ”๋ก  ์‹œ์—๋Š” ์˜ˆ์ธก๋œ ๋ฏธ๋ž˜ ๊ถค์ ์„ ์‚ฌ์šฉํ•˜์—ฌ \hat{a}_{t+k} = M_\phi(\hat{x}_{t+k}, \hat{u}_{t+k})๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ์ปดํ”Œ๋ผ์ด์–ธ์Šค ์ปจํŠธ๋กค๋Ÿฌ๋Š” ์ด ๋ชฉํ‘œ๋ฅผ ์ถ”์ ํ•˜๋ฉฐ, ์ •์ฑ…์€ ๋ฐ˜๋ณต์ ์ธ ์˜ˆ์ธก ์ œ์–ด(receding-horizon manner) ๋ฐฉ์‹์œผ๋กœ ์žฌ๊ณ„ํš(replanning)ํ•ฉ๋‹ˆ๋‹ค.

2. ๊ธฐ์ˆ ์  ์ƒ์„ธ (Technical Details)

  • ์ ‘์ด‰ ๊ทธ๋ผ์šด๋”ฉ์˜ ๊ฐœ๋…: CGP๋Š” ์ ‘์ด‰์„ (์‹ค์ œ ๋กœ๋ด‡ ์ƒํƒœ x_t, ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ u_t, ๋ชฉํ‘œ ๋กœ๋ด‡ ์ƒํƒœ a_t)์˜ ์‚ผ์ค‘ํ•ญ์œผ๋กœ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ์ ‘์ด‰ ์œ„์น˜๋‚˜ ๋ชจ๋“œ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•˜๋Š” ๋Œ€์‹ , ํŠน์ • ์ด‰๊ฐ ์„ผ์„œ ๋ฐ ์ปดํ”Œ๋ผ์ด์–ธ์Šค ์ปจํŠธ๋กค๋Ÿฌ ์„ค์ • ํ•˜์—์„œ ์ธก์ • ๊ฐ€๋Šฅํ•˜๊ณ  ์ œ์–ด ๊ฐ€๋Šฅํ•œ ์‹ ํ˜ธ๋ฅผ ํ†ตํ•ด ์ ‘์ด‰์„ ๊ฐ„์ ‘์ ์œผ๋กœ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • ์ž ์žฌ ์ด‰๊ฐ ์ƒ์„ฑ (Latent Tactile Generation): ๊ณ ์ฐจ์› ์ด‰๊ฐ ๋ฐ์ดํ„ฐ์˜ ํšจ์œจ์ ์ธ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด VAE๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ u_t๋ฅผ ์ž ์žฌ ํ‘œํ˜„ h_t \in \mathbb{R}^M์œผ๋กœ ์••์ถ•ํ•ฉ๋‹ˆ๋‹ค. KL ์ •๊ทœํ™”๋Š” ์••์ถ•๋œ ์ž ์žฌ ๊ณต๊ฐ„์ด ํ™•์‚ฐ ๋ชจ๋ธ์— ์ ํ•ฉํ•˜๋„๋ก ์ž˜ ๊ตฌ์กฐํ™”๋˜๋„๋ก ๋•์Šต๋‹ˆ๋‹ค.
  • ๊ตฌํ˜„ ์„ ํƒ (Implementation Choices):
    • ์ด‰๊ฐ ์ธ์ฝ”๋” ๋ฐ ๋””์ฝ”๋”: ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ๋Š” 1D ResNet ๊ธฐ๋ฐ˜์˜ ์กฐ๋ฐ€ํ•œ ์ด‰๊ฐ ์–ด๋ ˆ์ด(dense tactile arrays)๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ์‹ค๋ฌผ ๋กœ๋ด‡์—์„œ๋Š” 2D ResNet ๊ธฐ๋ฐ˜์˜ Digit360 ์„ผ์„œ(์‹œ๊ฐ ๊ธฐ๋ฐ˜ ์ด‰๊ฐ ์ด๋ฏธ์ง€)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์— ๋งž๊ฒŒ ์„ค๊ณ„๋˜์—ˆ์ง€๋งŒ, ๊ณตํ†ต ํ›ˆ๋ จ ๋ชฉํ‘œ๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.
    • ์‹œ๊ฐ ์ธ์ฝ”๋” ๋ฐ ํ™•์‚ฐ: Diffusion Policy [4]์˜ U-Net ๊ธฐ๋ฐ˜ ์กฐ๊ฑด๋ถ€ ํ™•์‚ฐ ๋ชจ๋ธ๊ณผ DDIM ์ƒ˜ํ”Œ๋ง์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค. ์‹ค๋ฌผ ๋กœ๋ด‡์—์„œ๋Š” ๊ฐ ์ด‰๊ฐ ์ด๋ฏธ์ง€๊ฐ€ ๊ฐœ๋ณ„์ ์œผ๋กœ ์ธ์ฝ”๋”ฉ๋œ ํ›„ ๊ต์ฐจ ์„ผ์„œ ์…€ํ”„ ์–ดํ…์…˜(cross-sensor self-attention)์„ ํ†ตํ•ด ์ง‘๊ณ„๋ฉ๋‹ˆ๋‹ค.
    • ์ ‘์ด‰ ์ผ๊ด€์„ฑ ๋งคํ•‘: ๊ฒฝ๋Ÿ‰ ๋„คํŠธ์›Œํฌ๋กœ ๊ตฌํ˜„๋ฉ๋‹ˆ๋‹ค. ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ๋Š” ์ด‰๊ฐ ์ž ์žฌ ์ฝ”๋“œ๋ฅผ ๋””์ฝ”๋”ฉํ•˜์—ฌ ์žฌ์ธ์ฝ”๋”ฉํ•œ ํ›„ ์‹ค์ œ ๋กœ๋ด‡ ์ƒํƒœ์™€ ์—ฐ๊ฒฐํ•˜์—ฌ MLP์— ์ž…๋ ฅํ•˜์ง€๋งŒ, ์‹ค๋ฌผ ๋กœ๋ด‡์—์„œ๋Š” ์‹ค์‹œ๊ฐ„ ๋ฐฐํฌ๋ฅผ ์œ„ํ•ด ์ด‰๊ฐ ์ž ์žฌ ์ƒํƒœ๋ฅผ ์‹ค์ œ ๋กœ๋ด‡ ์ƒํƒœ์™€ ์ง์ ‘ ์—ฐ๊ฒฐํ•˜์—ฌ MLP์— ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค.

3. ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ (Experiments and Results)

CGP๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ (Tesollo DG-5F ํ•ธ๋“œ, ์กฐ๋ฐ€ํ•œ ์ด‰๊ฐ ์–ด๋ ˆ์ด)๊ณผ ์‹ค๋ฌผ ๋กœ๋ด‡ ํ™˜๊ฒฝ (Allegro V5 ํ•ธ๋“œ, Digit360 ์„ผ์„œ)์—์„œ ๋‹ค์–‘ํ•œ ์ ‘์ด‰ ๊ธฐ๋ฐ˜ ์กฐ์ž‘ ์ž‘์—…(In-Hand Box Flipping, Fragile Egg Grasping, Dish Wiping, Jar Opening)์— ๋Œ€ํ•ด ํ‰๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • ์„ฑ๋Šฅ ๋น„๊ต: CGP๋Š” visuomotor diffusion policy ๋ฐ visuotactile diffusion policy ๊ธฐ์ค€์„ (baselines)๋ณด๋‹ค ์ง€์†์ ์œผ๋กœ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ์ ‘์‹œ ๋‹ฆ๊ธฐ(Dish Wiping), ์ƒ์ž ๋’ค์ง‘๊ธฐ(In-Hand Box Flipping), ๋ณ‘ ๋”ฐ๊ธฐ(Jar Opening)์™€ ๊ฐ™์ด ์ง€์†์ ์ด๊ฑฐ๋‚˜ ์„ฌ์„ธํ•œ ์ ‘์ด‰์ด ์š”๊ตฌ๋˜๋Š” ์ž‘์—…์—์„œ ํ˜„์ €ํ•œ ๊ฐœ์„ ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
  • ์ ‘์ด‰ ๊ทธ๋ผ์šด๋”ฉ ์ฆ๋ช…: ๋กค์•„์›ƒ ์Šค๋ƒ…์ƒท์—์„œ ์˜ˆ์ธก๋œ ์ด‰๊ฐ ์‹ ํ˜ธ์™€ ์‹ค์ œ ๊ด€์ธก๋œ ์ด‰๊ฐ ์‹ ํ˜ธ ๊ฐ„์˜ ์‹œ๊ฐ„ ์ •๋ ฌ์„ ํ†ตํ•ด, CGP๊ฐ€ ์˜ˆ์ธกํ•œ ์ ‘์ด‰์ด ์‹คํ–‰ ์ค‘์— ์‹ค์ œ๋กœ ๊ตฌํ˜„๋จ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” CGP๊ฐ€ ๋‹จ์ˆœํžˆ ๊ฐ€๋Šฅํ•œ ์ด‰๊ฐ ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์˜ˆ์ธก๋œ ์ ‘์ด‰ ๋ฐœ์ „์„ ์žฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ์ œ์–ด ๊ฐ€๋Šฅํ•œ ์ƒํ˜ธ์ž‘์šฉ ๋ชฉํ‘œ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • ํ•ธ๋“œ ๊ตฌ์„ฑ ์˜ˆ์ธก (Hand Configuration Prediction): ์ ‘์ด‰ ์ผ๊ด€์„ฑ ๋งคํ•‘์˜ ํšจ๊ณผ๋ฅผ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•œ ์ œ์–ด๋œ ์‹คํ—˜์—์„œ, ์‹ค์ œ ๋กœ๋ด‡ ์ƒํƒœ์™€ ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ ๋ชจ๋‘๊ฐ€ ์ •ํ™•ํ•œ ์˜ˆ์ธก์— ํ•„์ˆ˜์ ์ž„์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ์ž”์—ฌ ์˜ˆ์ธก(residual prediction) ๋ฐฉ์‹์ด ์ ˆ๋Œ€ ์˜ˆ์ธก(absolute prediction) ๋ฐฉ์‹๋ณด๋‹ค ์˜ค๋ฅ˜๋ฅผ ์ค„์˜€์œผ๋ฉฐ, ์ด๋Š” ์ ‘์ด‰ ๊ทธ๋ผ์šด๋”ฉ์ด ์‹ค์ œ ์ƒํƒœ ์ฃผ๋ณ€์—์„œ ์ ‘์ด‰ ์กฐ๊ฑด์— ๋”ฐ๋ฅธ ์ˆ˜์ • ์‚ฌํ•ญ์œผ๋กœ ๋ชจ๋ธ๋ง๋  ๋•Œ ๊ฐ€์žฅ ์ž˜ ์ž‘๋™ํ•จ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
  • ์ด‰๊ฐ ์žฌ๊ตฌ์„ฑ ๋ฐ ์••์ถ• (Tactile Reconstruction and Compression): KL ์ •๊ทœํ™”๊ฐ€ ์žฌ๊ตฌ์„ฑ ์˜ค๋ฅ˜๋ฅผ ์•ฝ๊ฐ„ ์ฆ๊ฐ€์‹œํ‚ฌ ์ˆ˜ ์žˆ์ง€๋งŒ, ํ™•์‚ฐ ๊ธฐ๋ฐ˜ ์˜ˆ์ธก์˜ ์•ˆ์ •์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์ž˜ ๊ตฌ์กฐํ™”๋œ ์ž ์žฌ ๊ณต๊ฐ„์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•จ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ํ•˜๋ฅ˜ ์ •์ฑ…(downstream policy) ์„ฑ๋Šฅ ํ–ฅ์ƒ์œผ๋กœ ์ด์–ด์ง‘๋‹ˆ๋‹ค.
  • ์‹œ๊ฐ„ ํšจ์œจ์„ฑ (Time Efficiency): CGP๋Š” ๋ฏธ๋ž˜ ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ ๋ฐ ์ ‘์ด‰ ์ผ๊ด€์„ฑ ๋ชฉํ‘œ๋ฅผ ๋ชจ๋ธ๋งํ•จ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ์‹œ๊ฐ ๋ฐ ์‹œ๊ฐ-์ด‰๊ฐ ํ™•์‚ฐ ์ •์ฑ… ๊ธฐ์ค€์„ ๊ณผ ์œ ์‚ฌํ•œ ์ถ”๋ก  ์ง€์—ฐ ์‹œ๊ฐ„(inference latency)์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

4. ํ•œ๊ณ„ ๋ฐ ํ–ฅํ›„ ์—ฐ๊ตฌ (Limitations and Future Work)

  • ์„ผ์„œ ๋ฐ ์ œ์–ด ํŠน์ •์„ฑ: CGP์˜ ํ•ต์‹ฌ ํ•œ๊ณ„๋Š” ํŠน์ • ์„ผ์„œ ์œ ํ˜•๊ณผ ์ปดํ”Œ๋ผ์ด์–ธ์Šค ์ปจํŠธ๋กค๋Ÿฌ ์„ค์ •์— ๋Œ€ํ•œ ์˜์กด์„ฑ์ž…๋‹ˆ๋‹ค. ์„ผ์„œ ์œ ํ˜•์ด๋‚˜ ์ปจํŠธ๋กค๋Ÿฌ ๊ตฌ์„ฑ์ด ๋ณ€๊ฒฝ๋  ๊ฒฝ์šฐ ์žฌํ›ˆ๋ จ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ํ–ฅํ›„ ์—ฐ๊ตฌ๋Š” ๊ต์ฐจ ์„ผ์„œ ๋ฐ ๊ต์ฐจ ์ปจํŠธ๋กค๋Ÿฌ ๊ณต๋™ ํ›ˆ๋ จ(co-training), ๊ทธ๋ฆฌ๊ณ  ์ปจํŠธ๋กค๋Ÿฌ ๋งค๊ฐœ๋ณ€์ˆ˜ ๋ฐ ๋กœ๋ด‡ ๋ฌผ๋ฆฌ์  ๋งค๊ฐœ๋ณ€์ˆ˜(์˜ˆ: ์ž„ํ”ผ๋˜์Šค ๊ฒŒ์ธ)์— ๋Œ€ํ•œ ์กฐ๊ฑดํ™”๋ฅผ ํ†ตํ•ด ์ผ๋ฐ˜ํ™”๋ฅผ ๊ฐœ์„ ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.
  • ๋‹จ์ผ ์ž‘์—… ํ›ˆ๋ จ: ํ˜„์žฌ CGP๋Š” ๋‹จ์ผ ์ž‘์—… ํ›ˆ๋ จ ๋ฐ ํ‰๊ฐ€ ํ”„๋กœํ† ์ฝœ ํ•˜์—์„œ ๊ฒ€์ฆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋” ๋„“์€ ์ž‘์—… ๋ถ„ํฌ๋กœ ํ™•์žฅํ•˜๋ ค๋ฉด ๋” ๋‹ค์–‘ํ•œ ๋ฐ๋ชจ์™€ ์ƒํ˜ธ์ž‘์šฉ์„ ํ†ตํ•œ ๊ต์ฐจ ์ž‘์—… ๊ณต๋™ ํ›ˆ๋ จ์ด ํ•„์š”ํ•  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋ฉ๋‹ˆ๋‹ค.

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

๋“ค์–ด๊ฐ€๋ฉฐ: ์™œ ์ด ๋…ผ๋ฌธ์ด ๋‹ค์„ฏ ์†๊ฐ€๋ฝ ์—ฐ๊ตฌ์ž๋“ค์—๊ฒŒ ์ค‘์š”ํ•œ๊ฐ€

๋‹ค์ง€ ์†(multi-finger hand)์œผ๋กœ ๋ฌผ๊ฑด์„ ๋‹ค๋ฃจ๋Š” ์ผ์„ ํ•œ๋ฒˆ ๊ณฐ๊ณฐ์ด ์ƒ๊ฐํ•ด๋ณด๋ฉด, ๋ฌ˜ํ•œ ์‚ฌ์‹ค ํ•˜๋‚˜๋ฅผ ๋ฐœ๊ฒฌํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์ปต์„ ๋“ ๋‹ค๋Š” ํ–‰์œ„๋Š” โ€œ์†๊ฐ€๋ฝ ๊ด€์ ˆ ๊ฐ๋„๋ฅผ ์–ด๋””์— ๋‘˜ ๊ฒƒ์ธ๊ฐ€โ€์˜ ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ๋ผ, โ€œ์–ด๋А ์†๊ฐ€๋ฝ์ด ์–ด๋””๋ฅผ ์–ผ๋งˆ๋‚˜ ๋ˆ„๋ฅด๊ณ  ์žˆ์–ด์•ผ ํ•˜๋Š”๊ฐ€โ€์˜ ๋ฌธ์ œ๋ผ๋Š” ์ ์ด์ง€์š”. ๊ทธ๋Ÿฐ๋ฐ ์šฐ๋ฆฌ์˜ ์ •์ฑ… ํ•™์Šต ๋ชจ๋ธ๋“ค์€ ๋Œ€๋ถ€๋ถ„ ์ „์ž๋งŒ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. โ€œํƒ€๊ฒŸ ๊ด€์ ˆ ๊ฐ๋„โ€๋ฅผ ๋ฑ‰์–ด๋‚ด๊ณ ๋Š”, ๊ทธ ๋’ค์˜ PD ์ œ์–ด๊ธฐ์™€ ๋ฌผ๋ฆฌ ์„ธ๊ณ„๊ฐ€ ์•Œ์•„์„œ ์ž˜ ์ฒ˜๋ฆฌํ•ด์ฃผ๋ฆฌ๋ผ ๋ฏฟ๋Š” ๊ฑฐ์ฃ .

๋ฌธ์ œ๋Š”, ์•ˆ ๊ทธ๋ ‡๋‹ค๋Š” ๊ฒ๋‹ˆ๋‹ค. ํƒ€๊ฒŸ ๊ฐ๋„๋Š” ๋ชจ๋ธ์ด ํ•™์Šตํ•œ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ์•ˆ์—์„œ๋Š” ์ ๋‹นํ•œ ์ ‘์ด‰์„ ๋งŒ๋“ค์–ด๋‚ด์ง€๋งŒ, ์ƒˆ๋กœ์šด ๋ฌผ์ฒด๋ฅผ ๋งŒ๋‚˜๋ฉด ๋ชจ๋ธ์€ ์ด๋ ‡๊ฒŒ ํ–‰๋™ํ•ฉ๋‹ˆ๋‹ค โ€” ๋„ˆ๋ฌด ๊ฐ•ํ•˜๊ฒŒ ์ฅ์–ด ๊นจ๋œจ๋ฆฌ๊ฑฐ๋‚˜, ๋„ˆ๋ฌด ์•ฝํ•˜๊ฒŒ ์žก์•„ ๋ฏธ๋„๋Ÿฌ๋œจ๋ฆฌ๊ฑฐ๋‚˜. ์™œ๋ƒํ•˜๋ฉด ๋ชจ๋ธ์€ โ€œ์ ‘์ด‰์ด ์–ด๋–ป๊ฒŒ ์ง„ํ™”ํ•ด์•ผ ํ•˜๋Š”๊ฐ€โ€๋ฅผ ์ถ”๋ก ํ•œ ์ ์ด ์—†๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

Meta Reality Labs Research์™€ Purdue๊ฐ€ RSS 2026์— ๋‚ธ Contact-Grounded Policy(์ดํ•˜ CGP)๋Š” ์ •ํ™•ํžˆ ์ด ์ง€์ ์„ ์ฐŒ๋ฆ…๋‹ˆ๋‹ค. โ€œ์ •์ฑ…์ด ์ถœ๋ ฅํ•˜๋Š” ํ–‰๋™์„ ์ปจํŠธ๋กค๋Ÿฌ ์ž…์žฅ์—์„œ ๋‹ค์‹œ ์ƒ๊ฐํ•ด๋ณด๋ฉด, ๊ทธ๊ฑด ๊ฒฐ๊ตญ ์ ‘์ด‰์„ ๋งŒ๋“œ๋Š” ๋ช…๋ น์ด์ง€ ์•Š์„๊นŒ?โ€๋ผ๋Š” ์งˆ๋ฌธ์ด์ง€์š”. ๊ทธ๋ฆฌ๊ณ  ์ด ๋‹จ์ˆœํ•œ ์‹œ๊ฐ ์ „ํ™˜์—์„œ, ๊ฝค ์šฐ์•„ํ•œ ์‹œ์Šคํ…œ์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค. Allegro V5 ํ•ธ๋“œ์™€ Digit360์„ ์“ด๋‹ค๋Š” ์ ์—์„œ, ๊ฐ™์€ ํ”Œ๋žซํผ์—์„œ ์ž‘์—…ํ•˜๋Š” ๋ถ„๋“ค์—๊ฒŒ๋Š” ํŠนํžˆ ์™€๋‹ฟ์„ ๋งŒํ•œ ์ž‘์—…์ž…๋‹ˆ๋‹ค.

๋ฌธ์ œ ์ •์˜: ๋‹ค์ง€ ์กฐ์ž‘์€ ์™œ ๊ทธ๋ ‡๊ฒŒ ์–ด๋ ค์šด๊ฐ€

์šด๋™ํ•™์  ํƒ€๊ฒŸ์˜ ํ•œ๊ณ„

Diffusion Policy(DP) ๊ณ„์—ด์˜ ์ •์ฑ…๋“ค์ด ์ตœ๊ทผ ๋ช‡ ๋…„ ๋™์•ˆ imitation learning์—์„œ ๋ณด์—ฌ์ค€ ์„ฑ๊ณผ๋Š” ์ธ์ƒ์ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ DP ๊ณ„์—ด์€ ๊ฑฐ์˜ ํ•ญ์ƒ โ€œํƒ€๊ฒŸ ๋กœ๋ด‡ ์ƒํƒœ(target robot state)โ€๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์ปจํŠธ๋กค๋Ÿฌ๊ฐ€ ์ถ”์ข…ํ•ด์•ผ ํ•  reference๋งŒ ๋ฑ‰์–ด๋‚ด๊ณ , ๊ทธ๊ฒƒ์ด ์‹ค์ œ๋กœ ์–ด๋–ค ์ ‘์ด‰์„ ๋งŒ๋“ค์–ด๋‚ผ์ง€๋Š” ์‹ ๊ฒฝ ์“ฐ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์ด๊ฑด โ€œํ”ฝ ์•ค ํ”Œ๋ ˆ์ด์Šคโ€ ๊ฐ™์€ free-space ๋ชจ์…˜์—์„œ๋Š” ํฐ ๋ฌธ์ œ๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ๋‹ค์ง€ ์กฐ์ž‘์€ ๊ฑฐ์˜ ํ•ญ์ƒ ๋‹ค์  ์ ‘์ด‰(multi-point contact), ๋งˆ์ฐฐ, ๊ทธ๋ฆฌ๊ณ  ๋ฏธ์„ธ ์Šฌ๋ฆฝ์ด ๋™์‹œ์— ์ผ์–ด๋‚˜๋Š” ์˜์—ญ์ž…๋‹ˆ๋‹ค. ๊ฐ™์€ ํƒ€๊ฒŸ ๊ฐ๋„์—ฌ๋„, ๋ฌผ์ฒด์˜ ํ˜•์ƒ์ด ์‚ด์ง ๋ฐ”๋€Œ๊ฑฐ๋‚˜ ๋งˆ์ฐฐ๊ณ„์ˆ˜๊ฐ€ ๋‹ฌ๋ผ์ง€๋ฉด ์ ‘์ด‰ ํŒจ์น˜(contact patch)๋Š” ์™„์ „ํžˆ ๋‹ค๋ฅด๊ฒŒ ํ˜•์„ฑ๋˜์ฃ . ๊ทธ๋ž˜์„œ ํ•™์Šต ์‹œ ๋ถ„ํฌ์—์„œ ์•ฝ๊ฐ„๋งŒ ๋ฒ—์–ด๋‚˜๋„ ์ •์ฑ…์€ ๋‘ ๊ฐ€์ง€ ์‹คํŒจ ๋ชจ๋“œ ์ค‘ ํ•˜๋‚˜๋กœ ๋น ์ง‘๋‹ˆ๋‹ค.

  • ๋„ˆ๋ฌด ๋ปฃ๋ปฃํ•จ(Overly Stiff Motions): ํƒ€๊ฒŸ์ด ์‹ค์ œ ๋„๋‹ฌ ๊ฐ€๋Šฅํ•œ ์ž์„ธ๋ณด๋‹ค ๊นŠ์ˆ™์ด ๋ฐ•ํ˜€ ์žˆ์–ด, PD ์ œ์–ด๊ธฐ๊ฐ€ ํฐ ํ† ํฌ๋ฅผ ๋ฟœ์–ด๋‚ด๋ฉฐ ๋ฌผ์ฒด๋ฅผ ์œผ๊นธ. ๊นจ์ง€๊ธฐ ์‰ฌ์šด ๊ณ„๋ž€ ๊ฐ™์€ ์ž‘์—…์—์„œ ์น˜๋ช…์ .
  • ํž˜ ๋ถ€์กฑ์œผ๋กœ ์Šฌ๋ฆฝ(Insufficient Force โ†’ Slip): ํƒ€๊ฒŸ์ด ์ถฉ๋ถ„ํžˆ ์••์ž…๋˜์ง€ ์•Š์•„, ๋งˆ์ฐฐ๋ ฅ์ด ๋ชจ์ž๋ผ ๋ฌผ์ฒด๊ฐ€ ์†๊ฐ€๋ฝ ์‚ฌ์ด๋กœ ๋น ์ ธ๋‚˜๊ฐ. ๋ฐ•์Šค ํ”Œ๋ฆฌํ•‘์ด๋‚˜ jar opening์—์„œ ์ž์ฃผ ๋ฐœ์ƒ.

๋…ผ๋ฌธ์ด ๋ณด์—ฌ์ฃผ๋Š” baseline ๋น„๋””์˜ค์—์„œ ์ด ๋‘ ํŒจํ„ด์ด ์ •ํ™•ํžˆ ์žฌํ˜„๋ฉ๋‹ˆ๋‹ค. Visuotactile DP๋Š” ์ด‰๊ฐ์„ ๊ด€์ธก์œผ๋กœ ๋ฐ›๊ธฐ๋Š” ํ•˜์ง€๋งŒ, ์—ฌ์ „ํžˆ ์ถœ๋ ฅ์€ ์šด๋™ํ•™์  ํƒ€๊ฒŸ์ด๋ผ ๊ฐ™์€ ํ•จ์ •์— ๋น ์ง‘๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: ์ ‘์ด‰์€ โ€œ์‚ผ๊ฐ๊ด€๊ณ„โ€๋‹ค

์ด ๋…ผ๋ฌธ์ด ๋˜์ง€๋Š” ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํ•œ ๋ฌธ์žฅ์„ ํ’€์–ด์“ฐ์ž๋ฉด ์ด๋ ‡์Šต๋‹ˆ๋‹ค.

๊ณ ์ •๋œ ์ด‰๊ฐ ์„ผ์„œ์™€ ์ปดํ”Œ๋ผ์ด์–ธ์Šค ์ปจํŠธ๋กค๋Ÿฌ ์„ค์ • ํ•˜์—์„œ, ์ ‘์ด‰ ์ƒํƒœ๋Š” (์‹ค์ œ ๋กœ๋ด‡ ์ƒํƒœ, ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ, ์ปจํŠธ๋กค๋Ÿฌ ์ฐธ์กฐ)๋ผ๋Š” ์‚ผ์ค‘ํ•ญ(triplet)์— ์˜ํ•ด ์•”๋ฌต์ ์œผ๋กœ ์ •์˜๋œ๋‹ค.

์ด๊ฒŒ ์™œ ์ž์—ฐ์Šค๋Ÿฌ์šด์ง€๋ฅผ PD ์ œ์–ด๊ธฐ ๊ด€์ ์—์„œ ๋ณด๋ฉด ๋‹จ๋ฒˆ์— ์ดํ•ด๋ฉ๋‹ˆ๋‹ค. ๊ฐ ๊ด€์ ˆ์˜ PD ์ œ์–ด๊ธฐ๋Š” ๋ณธ์งˆ์ ์œผ๋กœ ๊ฐ€์ƒ ์Šคํ”„๋ง-๋Œํผ์ž…๋‹ˆ๋‹ค.

\tau_j = K_p (q^{\text{target}}_j - q^{\text{actual}}_j) - K_d \dot{q}_j

์—ฌ๊ธฐ์„œ K_p, K_d๊ฐ€ ๊ณ ์ •๋˜์–ด ์žˆ๋‹ค๋ฉด, ์ด ์‹์˜ ์˜๋ฏธ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • ํƒ€๊ฒŸ๊ณผ ์‹ค์ œ ์‚ฌ์ด์˜ ๊ฐ„๊ฒฉ์ด ๊ณง ํ† ํฌ๋‹ค.
  • ๊ทธ๋Ÿฐ๋ฐ ์ •์ƒ ์ƒํƒœ์—์„œ ๊ทธ ๊ฐ„๊ฒฉ์„ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๊ฑด ์™ธ๋ถ€ ์ ‘์ด‰๋ ฅ์ด๋‹ค.
  • ์ฆ‰, (target - actual) ์ž์ฒด๊ฐ€ ์™ธ๋ถ€ ์ ‘์ด‰๋ ฅ์˜ ๋น„๋ก€ ์ธก์ •๋Ÿ‰์ด ๋œ๋‹ค.

์—ฌ๊ธฐ์— ์ด‰๊ฐ ์„ผ์„œ(ํ”ผ๋ถ€์˜ ์ ‘์ด‰ ๋ถ„ํฌ)๊นŒ์ง€ ๊ฒฐํ•ฉํ•˜๋ฉด, ์šฐ๋ฆฌ๋Š” ์ ‘์ด‰์˜ โ€œ์–ด๋””์„œ/์–ผ๋งˆ๋‚˜/์–ด๋–ป๊ฒŒโ€ ์ •๋ณด๋ฅผ ๋ชจ๋‘ ์–ป์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ด ์„ธ ๊ฐ€์ง€๋ฅผ ํ•œ ๋ฌถ์Œ์œผ๋กœ ๋ณด๋ฉด:

+------------+        spring force        +------------+
|  TARGET    | <------------------------> |  ACTUAL    |
|  STATE     |   (PD controller spring)   |  STATE     |
+------------+                            +------------+
       \                                       /
        \                                     /
         \                                   /
          \         creates contact         /
           v                               v
              +----------------------+
              |  TACTILE FEEDBACK    |
              |  (where & how hard)  |
              +----------------------+

์ด ์‚ผ๊ฐํ˜• ๊ด€๊ณ„๊ฐ€ CGP ์ „์ฒด ์„ค๊ณ„์˜ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค. ์–ด๋–ค ๋‘ ๋ณ€์„ ์•Œ๋ฉด ๋‚˜๋จธ์ง€ ํ•œ ๋ณ€์€ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งคํ•‘์œผ๋กœ ๋ณต์› ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ง๊ด€์ด์ง€์š”.

๋ฐฉ๋ฒ•๋ก : CGP ํŒŒ์ดํ”„๋ผ์ธ์„ ๋œฏ์–ด๋ณด์ž

ํฐ ๊ทธ๋ฆผ: ๋‘ ์ปดํฌ๋„ŒํŠธ์˜ ๋ถ„์—…

CGP๋Š” ์˜์™ธ๋กœ ๋‹จ์ˆœํ•˜๊ฒŒ ๋‘ ๋ชจ๋“ˆ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค.

  • \pi_\theta (์กฐ๊ฑด๋ถ€ ํ™•์‚ฐ ๊ถค์  ์ƒ์„ฑ๊ธฐ): ํ˜„์žฌ ๊ด€์ธก O_t๊ฐ€ ์ฃผ์–ด์ง€๋ฉด, ๋ฏธ๋ž˜ horizon T์— ๋Œ€ํ•ด (actual robot state, tactile feedback) ํŽ˜์–ด์˜ ์‹œํ€€์Šค๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.
  • M_\phi (์ ‘์ด‰-์ผ๊ด€์„ฑ ๋งคํ•‘): ๊ฐ ์‹œ์ ์˜ (actual, tactile) ํŽ˜์–ด๋ฅผ ๋ฐ›์•„ ๊ทธ๊ฒƒ์„ ๋งŒ๋“ค์–ด๋‚ผ target robot state๋ฅผ ์ถ”๋ก ํ•œ๋‹ค.

์ด ๋ถ„์—…์ด ์™œ ์ค‘์š”ํ• ๊นŒ์š”? ์ง์ ‘ ๊ด€์ธก์—์„œ ํƒ€๊ฒŸ์œผ๋กœ ๋ฐ”๋กœ ๋งคํ•‘(์ „ํ˜•์ ์ธ DP)ํ•˜๋ฉด ์ •์ฑ…์ด โ€œ๋‚ด๊ฐ€ ์ด ํƒ€๊ฒŸ์„ ๋ณด๋ƒˆ์„ ๋•Œ ์ปจํŠธ๋กค๋Ÿฌ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ฐ˜์‘ํ•˜๊ณ  ์–ด๋–ค ์ ‘์ด‰์ด ๋งŒ๋“ค์–ด์งˆ์ง€โ€๋ฅผ ์•”๋ฌต์ ์œผ๋กœ ํ•™์Šตํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ด๊ฑด ๋งค์šฐ ๋ถ„ํฌ ์˜์กด์ ์ด๊ณ , ์ƒˆ๋กœ์šด ๋ฌผ์ฒด์—์„œ ๊นจ์ง€๊ธฐ ์‰ฝ์ง€์š”.

CGP๋Š” ๋Œ€์‹  ์ด๋ ‡๊ฒŒ ๋งํ•ฉ๋‹ˆ๋‹ค: โ€œ๋จผ์ € ์šฐ๋ฆฌ๊ฐ€ ๋งŒ๋“ค๊ณ  ์‹ถ์€ ์ ‘์ด‰์˜ ์ง„ํ™”(state-tactile ๊ถค์ )๋ฅผ ๊ทธ๋ ค๋ผ. ๊ทธ๋‹ค์Œ์— ๊ทธ ์ ‘์ด‰์„ ์‹ค์ œ ์ปจํŠธ๋กค๋Ÿฌ๊ฐ€ ๋งŒ๋“ค์–ด๋‚ด๋ ค๋ฉด ์–ด๋–ค reference๋ฅผ ๋ณด๋‚ด์•ผ ํ•˜๋Š”์ง€๋ฅผ ๋”ฐ๋กœ ํ’€์–ด๋ผ.โ€ ์ธ๊ฐ„์ด ์ปต์„ ์žก์„ ๋•Œ โ€œ์†๊ฐ€๋ฝ ๊ด€์ ˆ์„ X ๊ฐ๋„๋กœ ๋ณด๋‚ด์•ผ์ง€โ€ ํ•˜์ง€ ์•Š๊ณ  โ€œ์—„์ง€๊ฐ€ ์˜†๋ฉด์„ ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ๋ˆ„๋ฅด๊ณ , ๊ฒ€์ง€๊ฐ€ ๋’ท๋ฉด์„ ๋ฐ›์ณ์•ผ์ง€โ€๋ผ๊ณ  ์ƒ๊ฐํ•˜๋Š” ๊ฒƒ๊ณผ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค.

flowchart LR
    subgraph Obs["๊ด€์ธก O_t"]
        V[Vision: RGB camera]
        S[Proprioception: q_actual]
        T[Tactile: latent z_tac]
    end

    Obs --> Pi["ฯ€_ฮธ<br/>Conditional Diffusion<br/>(Latent Space)"]

    Pi --> Pred["์˜ˆ์ธก ๊ถค์ <br/>(s_t+1..t+T, z_tac_t+1..t+T)"]

    Pred --> Mphi["M_ฯ†<br/>Contact-Consistency<br/>Mapping"]

    Mphi --> Tgt["target robot state<br/>q_target_t+1..t+T"]

    Tgt --> Ctrl["Compliance<br/>Controller<br/>(PD + impedance)"]

    Ctrl --> Robot["Robot<br/>(Allegro V5 / Tesollo DG-5F)"]

    Robot -.observation.-> Obs

    style Pi fill:#cfe8ff,stroke:#1a73e8
    style Mphi fill:#ffd9b3,stroke:#e8710a
    style Ctrl fill:#d4edda,stroke:#28a745

์ปดํฌ๋„ŒํŠธ 1: ์กฐ๊ฑด๋ถ€ ํ™•์‚ฐ ๊ถค์  ์ƒ์„ฑ๊ธฐ \pi_\theta

๋…ผ๋ฌธ์—์„œ๋Š” \pi_\theta๋ฅผ diffusion-policy ์Šคํƒ€์ผ๋กœ ํŒŒ๋ผ๋ฏธํ„ฐํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋…ธ์ด์ฆˆ์—์„œ ์ถœ๋ฐœํ•ด ์ ์ง„์  ๋””๋…ธ์ด์ง•์„ ํ†ตํ•ด ๊ถค์ ์„ ์ƒ˜ํ”Œ๋งํ•˜์ง€์š”. ๋‹ค๋งŒ ์ž…๋ ฅ/์ถœ๋ ฅ ๊ตฌ์„ฑ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

์ž…๋ ฅ (์กฐ๊ฑด):

  • ๋น„์ „ ์ธ์ฝ”๋”๋กœ ์••์ถ•ํ•œ RGB ํŠน์ง•
  • ํ˜„์žฌ ๊ด€์ ˆ ์ƒํƒœ q_t (proprioception)
  • VAE๋กœ ์ธ์ฝ”๋”ฉ๋œ ์ž ์žฌ ์ด‰๊ฐ z^\text{tac}_t

์ถœ๋ ฅ (์ƒ์„ฑ):

  • ๋ฏธ๋ž˜ 16 step์˜ (s_{t+1:t+T}, z^\text{tac}_{t+1:t+T}) ๊ถค์ 

์˜ˆ์ธก horizon์€ 16 step, ๊ทธ์ค‘ 8 step๋งŒ ์‹คํ–‰ํ•˜๊ณ  ๋‹ค์‹œ replanningํ•ฉ๋‹ˆ๋‹ค. ์ „ํ˜•์ ์ธ receding-horizon imitation ํŒจํ„ด์ด์ง€์š”.

ํ•™์Šต ๋ชฉํ‘œ๋Š” ํ‘œ์ค€ diffusion training loss์ž…๋‹ˆ๋‹ค:

\mathcal{L}_\text{diff} = \mathbb{E}_{\tau, \epsilon, k} \left[ \big\| \epsilon - \epsilon_\theta(\tau_k, k, O_t) \big\|^2 \right]

์—ฌ๊ธฐ์„œ \tau๋Š” ground truth (state, latent-tactile) ๊ถค์ , k๋Š” ๋””๋…ธ์ด์ง• ์Šคํ…, \epsilon_\theta๊ฐ€ ๋…ธ์ด์ฆˆ ์˜ˆ์ธก ๋„คํŠธ์›Œํฌ์ž…๋‹ˆ๋‹ค. ์ถ”๋ก  ์‹œ์—๋Š” 8-step DDIM ๋””๋…ธ์ด์ง•์œผ๋กœ ๋น ๋ฅด๊ฒŒ ์ƒ˜ํ”Œ๋งํ•ฉ๋‹ˆ๋‹ค.

์ปดํฌ๋„ŒํŠธ 2: ์ ‘์ด‰-์ผ๊ด€์„ฑ ๋งคํ•‘ M_\phi

์ด ๋ชจ๋“ˆ์ด CGP์˜ ์ง„์งœ ๋ณธ์งˆ์ž…๋‹ˆ๋‹ค. ์ˆ˜์‹์ ์œผ๋กœ๋Š”

q^\text{target}_t = M_\phi(s_t, \text{tac}_t)

๋ผ๋Š” ๋‹จ์ˆœํ•œ ํ•จ์ˆ˜์ง€๋งŒ, ์˜๋ฏธ๋Š” ๊นŠ์Šต๋‹ˆ๋‹ค. โ€œ๋‚ด๊ฐ€ ์ง€๊ธˆ ์ด actual ์ƒํƒœ์— ์žˆ๊ณ  ์ด ์ด‰๊ฐ ์‹ ํ˜ธ๋ฅผ ๋ฐ›๊ณ  ์žˆ๋‹ค๋ฉด, ์ปจํŠธ๋กค๋Ÿฌ๋Š” ์–ด๋–ค reference๋กœ ์ž‘๋™ ์ค‘์ผ๊นŒ?โ€๋ฅผ ํ•™์Šตํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

์™œ ์ด๊ฒŒ ํ•™์Šต ๊ฐ€๋Šฅํ• ๊นŒ์š”? ์ปดํ”Œ๋ผ์ด์–ธ์Šค ์ปจํŠธ๋กค๋Ÿฌ(K_p, K_d ๊ณ ์ •)์™€ ์„ผ์„œ ์„ค์ •์ด ๊ณ ์ •์ด๋ฉด, ์ด ๋งคํ•‘์€ ์ด๋ก ์ ์œผ๋กœ ์ž˜ ์ •์˜๋œ ์—ญํ•จ์ˆ˜์— ๊ฐ€๊น์Šต๋‹ˆ๋‹ค. ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ๋Š” ๋งˆ์ฐฐ, ์ž„ํŒฉํŠธ, ๋น„๊ฐ•์ฒด ํšจ๊ณผ ๋•Œ๋ฌธ์— ๊น”๋”ํ•œ ์—ญํ•จ์ˆ˜๋Š” ์•„๋‹ˆ์ง€๋งŒ, ์‹ ๊ฒฝ๋ง์ด ๋ฐ์ดํ„ฐ์—์„œ ๊ทธ ๊ด€๊ณ„๋ฅผ ์ž˜ ํ‰๋‚ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด ์ด ๋…ผ๋ฌธ์˜ ์‹คํ—˜์  ์ฃผ์žฅ์ž…๋‹ˆ๋‹ค.

ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ ์‹œ์—ฐ์—์„œ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์–ป์Šต๋‹ˆ๋‹ค โ€” ๋งค ์Šคํ…๋งˆ๋‹ค (target, actual, tactile)์ด ๋ชจ๋‘ ๊ธฐ๋ก๋˜๋‹ˆ, ์ง€๋„ํ•™์Šต ํšŒ๊ท€๋กœ ์ถฉ๋ถ„ํ•˜์ง€์š”:

\mathcal{L}_M = \mathbb{E}_{(s, \text{tac}, q^\text{target}) \sim \mathcal{D}} \left[ \big\| q^\text{target} - M_\phi(s, \text{tac}) \big\|^2 \right]

์ด ๋ถ„๋ฆฌ(factorization)๊ฐ€ ์™œ ์ค‘์š”ํ•œ๊ฐ€? ์ •์ฑ…์ด ๋ฏธ๋ž˜ (state, tactile) ๊ถค์ ์„ ๊ทธ๋ฆฌ๋ฉด, ๊ทธ๊ฒƒ์€ โ€œ๋ฌผ๋ฆฌ์ ์œผ๋กœ ์ผ์–ด๋‚˜์•ผ ํ•  ์ผโ€์„ ๋ฌ˜์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  M_\phi๋Š” ๊ทธ ๋ฌ˜์‚ฌ๋ฅผ ์ปจํŠธ๋กค๋Ÿฌ๊ฐ€ ์‹ค์ œ๋กœ ์‹คํ˜„ ๊ฐ€๋Šฅํ•œ reference๋กœ ๋ฒˆ์—ญํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๋ถ„๋ฆฌํ•˜๋ฉด ์ •์ฑ…์€ ์ปจํŠธ๋กค๋Ÿฌ ๋™์—ญํ•™์„ ์•Œ ํ•„์š” ์—†์ด ์ ‘์ด‰ ์ง„ํ™”๋งŒ ๋ชจ๋ธ๋งํ•˜๋ฉด ๋˜๊ณ , ๋งคํ•‘์€ ์ปจํŠธ๋กค๋Ÿฌ๋ฅผ ์•ˆ๋‹ค๋Š” ๊ฐ€์ • ํ•˜์— ๋‹จ์ˆœํ•œ ํšŒ๊ท€ ๋ฌธ์ œ๋งŒ ํ’€๋ฉด ๋ฉ๋‹ˆ๋‹ค. ๋ถ„์—…์˜ ๊น”๋”ํ•จ์ด์ง€์š”.

์ปดํฌ๋„ŒํŠธ 3: ์ž ์žฌ ์ด‰๊ฐ ์ƒ์„ฑ (Latent Tactile Generation)

๋‹ค์ง€ ์ด‰๊ฐ ์„ผ์„œ์˜ raw ์ถœ๋ ฅ์€ ๋ฌด์ง€ํ•˜๊ฒŒ ํฐ ์ฐจ์›์ž…๋‹ˆ๋‹ค. Allegro V5์— ๋ถ€์ฐฉ๋œ Digit360 ๊ฐ™์€ vision-based tactile sensor๋Š” fingertip๋งˆ๋‹ค ์ˆ˜๋งŒ ํ”ฝ์…€์˜ ์ด๋ฏธ์ง€๋ฅผ, dense tactile array(Tesollo DG-5F์˜ ๊ฒฝ์šฐ)๋Š” ์ˆ˜๋ฐฑ ์ฑ„๋„์˜ ์••๋ ฅ๊ฐ’์„ ๋งค ์‹œ์  ๋ฑ‰์–ด๋ƒ…๋‹ˆ๋‹ค. ์ด๊ฑธ ๊ทธ๋Œ€๋กœ 16-step horizon์œผ๋กœ ์ƒ์„ฑํ•˜๋ ค๋ฉด ์‹œ๊ฐ„๋„ ๋ฉ”๋ชจ๋ฆฌ๋„ ํญ๋ฐœํ•˜์ง€์š”.

ํ•ด๊ฒฐ์ฑ…์€ latent diffusion์—์„œ ์ต์ˆ™ํ•œ ๊ทธ ํŒจํ„ด์ž…๋‹ˆ๋‹ค โ€” VAE๋กœ ์••์ถ•ํ•œ ํ›„ ์ž ์žฌ ๊ณต๊ฐ„์—์„œ ๋””๋…ธ์ด์ง•.

z^\text{tac}_t = E_\psi(\text{tac}_t), \qquad \widehat{\text{tac}}_t = G_\psi(z^\text{tac}_t)

์—ฌ๊ธฐ์„œ ํ•ต์‹ฌ์€ KL ์ •๊ทœํ™”์ž…๋‹ˆ๋‹ค. ๊ทธ๋ƒฅ AE๋กœ ์••์ถ•ํ•˜๋ฉด ์ž ์žฌ ๊ณต๊ฐ„์ด ๋„์—„๋„์—„ํ•ด์„œ ๋””๋…ธ์ด์ง•์ด ๋ถˆ์•ˆ์ •ํ•ด์ง‘๋‹ˆ๋‹ค. KL ํŽ˜๋„ํ‹ฐ๋ฅผ ๊ฑธ์–ด ์ž ์žฌ ๋ถ„ํฌ๋ฅผ ๋‹จ์œ„ ๊ฐ€์šฐ์‹œ์•ˆ ๊ทผ์ฒ˜๋กœ ์œ ์ง€ํ•˜๋ฉด, ๋””ํ“จ์ „ ๋ชจ๋ธ์ด ๋‹ค๋ฃจ๊ธฐ ์ข‹์€ ๋งค๋ˆํ•œ ๋งค๋‹ˆํด๋“œ๊ฐ€ ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์˜ ablation์€ ์ด KL ์ •๊ทœํ™”๊ฐ€ ์•ˆ์ •์„ฑ๊ณผ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์„ฑ๋Šฅ ๋ชจ๋‘์— ๊ธฐ์—ฌํ•œ๋‹ค๊ณ  ๋ณด๊ณ ํ•ฉ๋‹ˆ๋‹ค.

์ „์ฒด ์ถ”๋ก  ์•Œ๊ณ ๋ฆฌ์ฆ˜ (์˜์‚ฌ์ฝ”๋“œ)

# CGP inference loop (receding horizon, replan_every = 8)
def cgp_step(observation_buffer, q_actual_history, tactile_history):
    # 1. Encode current tactile observations to latent space
    z_tac_t = VAE_encoder(tactile_history[-k:])
    
    # 2. Form conditioning context O_t
    O_t = {
        "vision": visual_encoder(observation_buffer.images[-k:]),
        "state":  q_actual_history[-k:],
        "tactile_latent": z_tac_t,
    }
    
    # 3. Sample future trajectory via DDIM (8 denoising steps)
    tau = sample_noise(shape=(T, dim_state + dim_z_tac))
    for k_step in DDIM_schedule(num_steps=8):
        tau = denoise(tau, k_step, condition=O_t, network=eps_theta)
    
    s_future, z_tac_future = split(tau)        # T x dim_s, T x dim_z_tac
    
    # 4. Map each (state, latent-tactile) pair to a target robot state
    q_targets = []
    for h in range(T):
        # Decode tactile only if M_phi consumes raw tactile; many variants
        # consume latent directly. The paper uses the latent form.
        q_tar = M_phi(s_future[h], z_tac_future[h])
        q_targets.append(q_tar)
    
    # 5. Execute first 8 of 16 predicted target states; then replan
    return q_targets[:8]

์ปดํ”Œ๋ผ์ด์–ธ์Šค ์ปจํŠธ๋กค๋Ÿฌ: ์†๊ณผ ํŒ”์˜ ๋ถ„์—…

CGP๊ฐ€ ๊น”๋ฆฐ ํ† ๋Œ€๋„ ๋ฌด์‹œํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์†์€ joint-space PD, ํŒ”์€ operational-space impedance โ€” ์ฆ‰ whole-body compliance ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค. ์ด ์„ค์ •์€ ๋‘ ๊ฐ€์ง€ ์ ์—์„œ ์ค‘์š”ํ•œ๋ฐ์š”.

  1. ํƒ€๊ฒŸ์ด ์•ฝ๊ฐ„ ํ‹€๋ ค๋„ ๋ง๊ฐ€์ง€์ง€ ์•Š์Œ: ๊ฐ•์„ฑ ์ œ์–ด์™€ ๋‹ฌ๋ฆฌ, ์ปดํ”Œ๋ผ์ด์–ธ์Šค ์ œ์–ด๋Š” ํ™˜๊ฒฝ ์ถฉ๋Œ์ด๋‚˜ ์˜ˆ์ธก ์˜ค์ฐจ์— ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ๋ฐ˜์‘ํ•ฉ๋‹ˆ๋‹ค. ํ•™์Šต๋œ ์ •์ฑ…์˜ ์ž‘์€ ์˜ค์ฐจ๋ฅผ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ํก์ˆ˜ํ•ด์ฃผ๋Š” ์•ˆ์ „์žฅ์น˜์ด์ง€์š”.
  2. ์‚ผ๊ฐ๊ด€๊ณ„์˜ ์ „์ œ ์กฐ๊ฑด: ์•ž์„œ ๋ณธ (target, actual, tactile) ์‚ผ๊ฐ๊ด€๊ณ„๋Š” compliance๊ฐ€ ์žˆ์–ด์•ผ ์˜๋ฏธ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฌดํ•œ ๊ฐ•์„ฑ ์ œ์–ด๊ธฐ์—์„œ๋Š” actual์€ ํ•ญ์ƒ target๊ณผ ๊ฐ™์œผ๋‹ˆ ์ •๋ณด๊ฐ€ ์‚ฌ๋ผ์ง€์ฃ . PD-๊ธฐ๋ฐ˜ ์ปดํ”Œ๋ผ์ด์–ธ์Šค๊ฐ€ actual โ‰  target์ด๋ผ๋Š” โ€œ๊ฐญโ€์„ ๋งŒ๋“ค์–ด์ฃผ๊ณ , ๊ทธ ๊ฐญ์ด ๊ณง ์ ‘์ด‰ ์ •๋ณด๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

JungYeon๋‹˜์ด IsaacLab์œผ๋กœ ๋งˆ์ด๊ทธ๋ ˆ์ด์…˜ ํ•˜์‹œ๋ฉฐ ๋‹ค๋ฃจ์…จ๋˜ PD vs PID, gain handling, angular_damping ๋””ํดํŠธ ๋ณ€๊ฒฝ ๊ฐ™์€ ๋””ํ…Œ์ผ๋“ค์ด ์ •ํ™•ํžˆ ์ด ์ปดํ”Œ๋ผ์ด์–ธ์Šค ๋™์—ญํ•™์„ ์ขŒ์šฐํ•˜๋Š” ๋…ธ๋ธŒ๋“ค์ž…๋‹ˆ๋‹ค. CGP๊ฐ€ sim2real์—์„œ ์ž‘๋™ํ•˜๋ ค๋ฉด ์ด ๋ถ€๋ถ„์˜ ์ •ํ™•์„ฑ์ด ๊ฒฐ์ •์ ์ผ ์ˆ˜๋ฐ–์— ์—†์ฃ .

์‹คํ—˜: ์ •๋ง ์ž‘๋™ํ•˜๋Š”๊ฐ€?

ํ•˜๋“œ์›จ์–ด์™€ ํƒœ์Šคํฌ

ํ™˜๊ฒฝ ์† ์ด‰๊ฐ ์„ผ์„œ ํƒœ์Šคํฌ
Sim Tesollo DG-5F (5-finger) Dense whole-hand tactile array Fragile Egg Grasping, Dish Wiping, In-Hand Box Flipping
Real Allegro V5 (4-finger) Digit360 fingertip (vision-based) Jar Opening, In-Hand Box Flipping

ํฅ๋ฏธ๋กœ์šด ์ ์€ ๋‘ ์ข…๋ฅ˜์˜ ์ด‰๊ฐ ์„ผ์„œ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ(dense array vs vision-based)์—์„œ ๊ฐ™์€ framework๊ฐ€ ์ž‘๋™ํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. VAE ๋ฐฑ๋ณธ๋งŒ ๊ฐˆ์•„๋ผ์šฐ๋ฉด ๋˜๋‹ˆ, ์ด๋Š” latent tactile diffusion ์„ค๊ณ„์˜ ์ผ๋ฐ˜์„ฑ ์ฃผ์žฅ์„ ๋’ท๋ฐ›์นจํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ๋Š” ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜์œผ๋กœ ์ˆ˜์ง‘๋ฉ๋‹ˆ๋‹ค. ์‹ค์ œ ๋กœ๋ด‡์€ mocap ๊ธฐ๋ฐ˜ hand-tracking, ์‹œ๋ฎฌ๋ ˆ์ด์…˜์€ VR ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜. JungYeon๋‹˜์ด ์ต์ˆ™ํ•˜์‹  MANUS Core 3 + ROS2 ๊ธ€๋Ÿฌ๋ธŒ ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜์ด๋‚˜ GeoRT/dex-retargeting ๋ผ์ธ์˜ ์ž‘์—…๊ณผ ๊ฐ™์€ ๊ฒฐ์˜ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์ธํ”„๋ผ์ž…๋‹ˆ๋‹ค.

์„ธ ๊ฐ€์ง€ ํ‰๊ฐ€ ์ถ•

๋…ผ๋ฌธ์€ ํ‰๊ฐ€๋ฅผ ์„ธ ๊ฐˆ๋ž˜๋กœ ๊น”๋”ํ•˜๊ฒŒ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.

  1. End-to-end ์ •์ฑ… ์„ฑ๊ณต๋ฅ : ์‹œ๋ฎฌ๋ ˆ์ด์…˜ 3๊ฐœ, ์‹ค์ œ 2๊ฐœ ํƒœ์Šคํฌ์—์„œ closed-loop rollout ์„ฑ๊ณต๋ฅ .
  2. ์ ‘์ด‰-์ผ๊ด€์„ฑ ๋งคํ•‘ isolation ํ‰๊ฐ€: M_\phi๋งŒ ๋–ผ์–ด๋‚ด์„œ (state, tactile) โ†’ target ํšŒ๊ท€ ์ •ํ™•๋„์™€ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ ์ธก์ •.
  3. ์ž ์žฌ ์ด‰๊ฐ ํ‘œํ˜„ ๋ถ„์„: KL ์ •๊ทœํ™” ์œ ๋ฌด, ์ž ์žฌ ์ฐจ์›, VAE ๋ฐฑ๋ณธ ๋“ฑ design choice๊ฐ€ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ.

์ด ๋ถ„๋ฆฌ๋Š” ๋งค์šฐ ์ข‹์€ ํ‰๊ฐ€ ์„ค๊ณ„์ž…๋‹ˆ๋‹ค. ์™œ๋ƒํ•˜๋ฉด end-to-end ์„ฑ๊ณต๋ฅ ๋งŒ ๋ณด๋ฉด โ€œ์™œ ์ด๊ฒŒ ์ž˜ ๋๋Š”์ง€โ€ ์•Œ ์ˆ˜ ์—†๊ณ , ์ปดํฌ๋„ŒํŠธ๋ณ„ ํ‰๊ฐ€๋งŒ ๋ณด๋ฉด โ€œ์ „์ฒด ์‹œ์Šคํ…œ์ด ์ •๋ง ํ†ตํ•ฉ๋ผ์„œ ์ž‘๋™ํ•˜๋Š”์ง€โ€ ์•Œ ์ˆ˜ ์—†๋Š”๋ฐ, ๋‘˜ ๋‹ค๋ฅผ ๋ณด์—ฌ์ฃผ๋‹ˆ๊นŒ์š”.

๊ฒฐ๊ณผ ์š”์•ฝ: baseline ๋Œ€๋น„ ์ •์„ฑ์  ์ฐจ์ด

๋…ผ๋ฌธ์€ visuomotor DP, visuotactile DP๋ฅผ baseline์œผ๋กœ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค. ์ •ํ™•ํ•œ ์ˆ˜์น˜๋Š” ๋…ผ๋ฌธ์„ ๋ณด์‹œ๋Š” ๊ฒŒ ์ข‹์ง€๋งŒ, ์ •์„ฑ์  ํŒจํ„ด์€ ์ด๋ ‡์Šต๋‹ˆ๋‹ค.

  • In-Hand Box Flipping: Visuomotor DP๋Š” ์Šฌ๋ฆฝ์œผ๋กœ ์‹คํŒจ. Visuotactile DP๋Š” ํšŒ์ „ ๋ถ€์กฑ(incomplete flip)์œผ๋กœ ์‹คํŒจ. CGP๋Š” ๋‹ค์  ์ ‘์ด‰์„ ๋‹จ๊ณ„์ ์œผ๋กœ ์˜ฎ๊ฒจ๊ฐ€๋ฉฐ ์™„์ˆ˜.
  • Fragile Egg Grasping: Baseline๋“ค์€ too-stiff motion์œผ๋กœ ๊ณ„๋ž€ ํŒŒ๊ดด. CGP๋Š” ๋ถ€๋“œ๋Ÿฌ์šด ์ ‘์ด‰ ์œ ์ง€.
  • Dish Wiping: ๊ณก๋ฉด์„ ๋”ฐ๋ผ๊ฐ€๋ฉฐ ์ผ์ • ์••๋ ฅ์„ ์œ ์ง€ํ•ด์•ผ ํ•˜๋Š” ํƒœ์Šคํฌ. Baseline์€ ์••๋ ฅ ๋ถ€์กฑ ๋˜๋Š” ๊ณผ์••. CGP๋Š” ๊ณก๋ฅ  ๋ณ€ํ™”์— ๋งž์ถฐ ์ ‘์ด‰ ์ง„ํ™”.

์˜ˆ์ธก ๊ฒ€์ฆ: โ€œ์˜ˆ์–ธโ€์ด ๋งž๋Š”๊ฐ€?

๊ฐ€์žฅ ํฅ๋ฏธ๋กœ์šด ์ •์„ฑ์  ๊ฒฐ๊ณผ ์ค‘ ํ•˜๋‚˜๋Š” ์˜ˆ์ธก vs ๊ด€์ธก ์ด‰๊ฐ์˜ ์‹œ๊ฐ„ ์ •๋ ฌ ๋น„๊ต์ž…๋‹ˆ๋‹ค. CGP๊ฐ€ ์‹œ์  t์—์„œ ์˜ˆ์ธกํ•œ ๋ฏธ๋ž˜ ์ด‰๊ฐ ์‹ ํ˜ธ \widehat{\text{tac}}_{t+h}์™€, ์‹ค์ œ๋กœ ๋‚˜์ค‘์— ๊ด€์ธก๋œ \text{tac}_{t+h}๋ฅผ ์‹œ๊ฐ„ ์ถ•์œผ๋กœ ์ •๋ ฌํ•ด ์‹œ๊ฐ์ ์œผ๋กœ ๊ฒน์ณ๋ณด๋‹ˆ ๊ฑฐ์˜ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค.

์ด๊ฒŒ ์˜๋ฏธ์‹ฌ์žฅํ•œ ์ด์œ ๋Š”: ์ •์ฑ…์ด ๋‹จ์ˆœํžˆ โ€œ๊ทธ๋Ÿด๋“ฏํ•œ ํ–‰๋™โ€์„ ์˜ˆ์ธกํ•˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ, โ€œ๋‚ด๊ฐ€ ๋งŒ๋“ค ์ ‘์ด‰์ด ์ด๋ ‡๊ฒŒ ์ง„ํ™”ํ•  ๊ฒƒ์ด๋‹คโ€๋ผ๋Š” ๋ฌผ๋ฆฌ์  ์˜ˆ์–ธ์„ ๋‚ด๊ณ  ๊ทธ๊ฒƒ์„ ์‹ค์ œ๋กœ ์‹คํ˜„ํ•˜๊ณ  ์žˆ๋‹ค๋Š” ์ฆ๊ฑฐ์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. Diffusion world model์—์„œ โ€œrollout์ด ํ™˜๊ฒฝ๊ณผ ์–ผ๋งˆ๋‚˜ ์ผ์น˜ํ•˜๋А๋ƒโ€๊ฐ€ ๋ณธ์งˆ์ ์ธ ์งˆ๋ฌธ์ธ๋ฐ, CGP์˜ ์ž ์žฌ ์ด‰๊ฐ ์˜ˆ์ธก์€ ๊ทธ ๊ฒ€์ฆ์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํ†ต๊ณผํ•œ ์…ˆ์ž…๋‹ˆ๋‹ค.

์‹œ๊ฐ์  ๊ฐ•๊ฑด์„ฑ

์ €์ž๋“ค์ด ๋”ฐ๋กœ ๊ฐ•์กฐํ•˜๋Š” ํฅ๋ฏธ๋กœ์šด ๊ฒฐ๊ณผ: CGP๋Š” ์‹œ๊ฐ ์™ธ๋ž€์— ๊ฐ•ํ•˜๋‹ค. Box flipping ๋„์ค‘ ์นด๋ฉ”๋ผ ์‹œ์•ผ๋ฅผ ๋ถ€๋ถ„์ ์œผ๋กœ ๊ฐ€๋ ค๋„ ์ž‘์—…์ด ์ด์–ด์ง‘๋‹ˆ๋‹ค. ์ง๊ด€์ ์œผ๋กœ๋Š” ์ •์ฑ…์ด ์‹œ๊ฐ์—๋งŒ ์˜์กดํ•˜์ง€ ์•Š๊ณ  ์ด‰๊ฐ/proprioception์„ ํ•จ๊ป˜ grounding์œผ๋กœ ์“ฐ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์‹œ๊ฐ์ด ๋Š์–ด์ง€๋ฉด ๋‹ค๋ฅธ ๋‘ ๋ณ€์ด ์ž„์‹œ๋กœ ๋” ํฐ ๋น„์ค‘์„ ๊ฐ€์ ธ๊ฐ€๋Š” ์…ˆ์ด์ง€์š”. ๊ฐ™์€ ์ด์œ ๋กœ Visuotactile DP๋ณด๋‹ค ์‹œ๊ฐ corruption robustness๊ฐ€ ๋” ์ข‹๊ฒŒ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.

์ถ”๋ก  ์‹œ๊ฐ„

์ž ์žฌ ๊ณต๊ฐ„ ๋””ํ“จ์ „์„ ๋„์ž…ํ–ˆ์œผ๋‹ˆ ๋‹น์—ฐํ•œ ์งˆ๋ฌธ โ€” โ€œ์šด์˜ ๊ฐ€๋Šฅํ•œ ์†๋„์ธ๊ฐ€?โ€ Figure 7์˜ ์ถ”๋ก  ์‹œ๊ฐ„ ๋น„๊ต์—์„œ CGP๋Š” visuomotor/visuotactile DP์™€ ๋น„์Šทํ•œ ์ˆ˜์ค€์˜ ์ถ”๋ก  ์‹œ๊ฐ„์„ 8-step DDIM ๊ธฐ์ค€์œผ๋กœ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์ž ์žฌ ์••์ถ• ๋•๋ถ„์— raw tactile์„ ์ง์ ‘ ์ƒ์„ฑํ•  ๋•Œ๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฅด๊ณ , baseline ๋Œ€๋น„ ํฐ ์˜ค๋ฒ„ํ—ค๋“œ ์—†์ด ๋” ํ’๋ถ€ํ•œ ์˜ˆ์ธก์„ ํ•ฉ๋‹ˆ๋‹ค.

๋น„ํŒ์  ๊ณ ์ฐฐ

๊ฐ•์ : ์šฐ์•„ํ•œ ๋ถ„์—…

์ด ์ž‘์—…์˜ ๊ฐ€์žฅ ํฐ ๊ฐ•์ ์€ ์ถ”์ƒํ™”์˜ ๊น”๋”ํ•จ์ž…๋‹ˆ๋‹ค.

  1. ์ ‘์ด‰ ํ‘œํ˜„์˜ implicit ํ•™์Šต: contact location, mode, friction์„ ์ผ์ผ์ด ๋ชจ๋ธ๋งํ•˜์ง€ ์•Š๊ณ , โ€œ์‚ผ์ค‘ํ•ญ์œผ๋กœ captures๋œ๋‹คโ€๋Š” ๊ฐ€์ • ํ•˜์— ๋ฐ์ดํ„ฐ์—์„œ ํ•™์Šต. CTR(Contact Trust Region) ๊ฐ™์€ explicit MPC ๋ผ์ธ๊ณผ ์ •๋ฐ˜๋Œ€ ์ฒ ํ•™์ด์ง€๋งŒ, ๊ทธ ์ฒ ํ•™์ด ์ผ๊ด€์„ฑ ์žˆ๊ฒŒ ๊ด€์ฒ ๋ฉ๋‹ˆ๋‹ค.
  2. ์ปจํŠธ๋กค๋Ÿฌ ์˜์‹์  ํ•™์Šต(controller-aware learning): ์ •์ฑ… ์ถœ๋ ฅ์„ โ€œ์ปจํŠธ๋กค๋Ÿฌ referenceโ€๋กœ ๋ช…์‹œ์ ์œผ๋กœ ๋งคํ•‘ํ•˜๋Š” ์ ์ด CGP์˜ ๊ฐ€์žฅ ํฐ ์ฐจ๋ณ„์ ์ž…๋‹ˆ๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ imitation learning ์ •์ฑ…์ด โ€œํ–‰๋™์„ ํ™˜๊ฒฝ์ด ์–ด๋–ป๊ฒŒ ํ•ด์„ํ• ์ง€โ€์— ๋ฌด์ง€ํ•œ ๋ฐ˜๋ฉด, CGP๋Š” ๊ทธ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ํ•™์Šต ์•ˆ์— ๋Œ์–ด๋“ค์˜€์Šต๋‹ˆ๋‹ค.
  3. ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๋ฌด๊ด€ latent design: dense array๋“  vision-based tactile์ด๋“  ๊ฐ™์€ framework๋กœ ๋‹ค๋ฃธ. ์ด ์ ์€ ํ–ฅํ›„ GelSight, DIGIT, ReSkin, BioTac ๋“ฑ ๋‹ค์–‘ํ•œ ์„ผ์„œ๋กœ ํ™•์žฅํ•˜๊ธฐ ์ข‹์€ ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

ํ•œ๊ณ„ 1: ์ปจํŠธ๋กค๋Ÿฌ ๊ณ ์ • ๊ฐ€์ •

CGP๋Š” ๋ช…์‹œ์ ์œผ๋กœ โ€œ๊ณ ์ •๋œ ์ปดํ”Œ๋ผ์ด์–ธ์Šค ์ปจํŠธ๋กค๋Ÿฌ์™€ ์„ผ์„œ ์„ค์ •โ€์„ ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒŒ ์‹ค์šฉ์ ์œผ๋กœ ์˜๋ฏธํ•˜๋Š” ๋ฐ”:

  • K_p, K_d๋ฅผ ๋ฐ”๊พธ๋ฉด M_\phi๋ฅผ ๋‹ค์‹œ ํ•™์Šตํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. Stiffness scheduling์ด๋‚˜ variable impedance ์ปจํŠธ๋กค(์š”์ฆ˜ contact-rich์—์„œ ๋งŽ์ด ์“ฐ๋Š”)๊ณผ ์ž˜ ์•ˆ ๋งž์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์„ผ์„œ๋ฅผ ๊ต์ฒดํ•˜๋ฉด VAE์™€ M_\phi ๋ชจ๋‘ ์žฌํ•™์Šต. ์ธ๋”์ŠคํŠธ๋ฆฌ ๋ฐฐํฌ์—์„œ๋Š” ๋ถ€๋‹ด์Šค๋Ÿฌ์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋Š” ๋ณธ์งˆ์ ์œผ๋กœ system identification ๋น„์šฉ์„ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ + supervised learning์œผ๋กœ ์šฐํšŒํ•˜๋Š” trade-off์ž…๋‹ˆ๋‹ค. JungYeon๋‹˜๊ป˜์„œ ์ง„ํ–‰ํ•˜์‹  Allegro์˜ friction modeling ๋ฐ system identification ์ž‘์—…๊ณผ ๊ฒฐ์„ ๊ฐ™์ด ํ•˜๋ฉด์„œ๋„, ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ๋น„์šฉ์„ ๋ถ„์‚ฐ์‹œํ‚ค๋Š” ์ ‘๊ทผ์ด๋ผ ๋น„๊ต๊ฐ€ ํฅ๋ฏธ๋กญ์Šต๋‹ˆ๋‹ค.

ํ•œ๊ณ„ 2: ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ ๋ฐ์ดํ„ฐ ์˜์กด

CGP๋Š” imitation learning์ด๋ผ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•˜๊ณ , ๋‹ค์ง€ ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜์€ ์—ฌ์ „ํžˆ ๋น„์‹ผ ์ž์›์ž…๋‹ˆ๋‹ค. ๋‹ค์Œ ์งˆ๋ฌธ๋“ค์€ ๋…ผ๋ฌธ์ด ์ง์ ‘ ๋‹ตํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

  • ์‹œ์—ฐ ์–‘์— ๋Œ€ํ•œ scaling์€ ์–ด๋–ป๊ฒŒ ๋˜๋Š”๊ฐ€? (50๊ฐœ vs 200๊ฐœ vs 1000๊ฐœ)
  • ํ•œ ํƒœ์Šคํฌ์—์„œ ํ•™์Šตํ•œ M_\phi๊ฐ€ ๋‹ค๋ฅธ ํƒœ์Šคํฌ๋กœ transfer๋˜๋Š”๊ฐ€? (์ด๋ก ์ ์œผ๋กœ๋Š” ์ปจํŠธ๋กค๋Ÿฌ+์„ผ์„œ๊ฐ€ ๊ฐ™์œผ๋ฉด ๋˜๋‹ˆ transfer ๊ฐ€๋Šฅํ•ด์•ผ ํ•จ)
  • HORA, RotateIt, AnyRotate ๊ฐ™์€ RL ๋ผ์ธ๊ณผ ๊ฒฐํ•ฉ ๊ฐ€๋Šฅํ•œ๊ฐ€? (์ฆ‰, RL๋กœ ๋ฐ์ดํ„ฐ๋ฅผ self-collectํ•˜๊ณ  CGP์˜ contact grounding์œผ๋กœ ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ๋งŒ๋“ค๊ธฐ)

ํ•œ๊ณ„ 3: ์ผ๋ฐ˜ํ™” ๋ฒ”์œ„์˜ ๋ฏธ์ง€

๋…ผ๋ฌธ์ด ๋ณด์—ฌ์ฃผ๋Š” ํƒœ์Šคํฌ๋“ค์€ ๋ชจ๋‘ ๊ฐ•์ฒด ๋˜๋Š” ๊ฑฐ์˜ ๊ฐ•์ฒด์ž…๋‹ˆ๋‹ค. ๋ณ€ํ˜•์ฒด(์ฒœ wiping์€ ๋„๊ตฌ๊ฐ€ ๊ฐ•์ฒด), ์ ์„ฑ ์œ ์ฒด, ์ž…์ž ๋งค์ฒด ๊ฐ™์€ ์ง„์งœ hard contact-rich domain์—์„œ ์ž ์žฌ ์ด‰๊ฐ ์˜ˆ์ธก์ด ์•ˆ์ •์ ์ผ์ง€๋Š” ๋ณ„๊ฐœ์˜ ์งˆ๋ฌธ์ž…๋‹ˆ๋‹ค. KL ์ •๊ทœํ™”๋œ latent space๊ฐ€ ๋ถ„ํฌ ์™ธ ์ ‘์ด‰ ํŒจํ„ด(์˜ˆ: ์ง„๋™, ์ž„ํŒฉํŠธ, ๋ถ€๋ถ„ ์Šฌ๋ฆฝ)์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋Š” ์ถ”๊ฐ€ ์‹คํ—˜์ด ํ•„์š”ํ•ด ๋ณด์ž…๋‹ˆ๋‹ค.

ํ•œ๊ณ„ 4: world model๋กœ์„œ์˜ ํ™œ์šฉ ๊ฐ€๋Šฅ์„ฑ

์ €๋Š” ์ด ๋…ผ๋ฌธ์—์„œ ํฌ๊ฒŒ ๋งค๋ ฅ์„ ๋А๋ผ๋Š” ํ•œ ์ธก๋ฉด์ด ์ž ์žฌ๋œ ์ฑ„ ํ™œ์šฉ๋˜์ง€ ์•Š์•˜๋‹ค๊ณ  ๋ด…๋‹ˆ๋‹ค โ€” CGP์˜ latent tactile predictor๋Š” ์‚ฌ์‹ค์ƒ ์ž‘์€ world model์ž…๋‹ˆ๋‹ค. ๋ฏธ๋ž˜ (state, tactile)์„ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ˆ๊นŒ์š”. ๊ทธ๋Ÿผ ์ด๊ฑธ model-based RL์˜ dynamics model์ด๋‚˜, planning์„ ์œ„ํ•œ prediction backbone์œผ๋กœ ์“ธ ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ? ๋…ผ๋ฌธ์€ ์ด ๊ฐ€๋Šฅ์„ฑ์„ ์ง์ ‘ ๋‹ค๋ฃจ์ง€ ์•Š์ง€๋งŒ, dexterous MBRL์ด๋‚˜ VLA + RL hybrid ๋ผ์ธ์—์„œ ํฅ๋ฏธ๋กœ์šด ํ›„์† ์—ฐ๊ตฌ ํฌ์ธํŠธ์ž…๋‹ˆ๋‹ค.

๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต: ์–ด๋””์— ์ž๋ฆฌ ์žก๋Š”๊ฐ€?

flowchart TB
    A[Contact-Rich<br/>Dexterous Manipulation]

    A --> RL[Model-Free RL<br/>Approaches]
    A --> MPC[Model-Based<br/>MPC Approaches]
    A --> IL[Imitation Learning<br/>Approaches]

    RL --> HORA["HORA<br/>(in-hand rotation, RL)"]
    RL --> AnyR["AnyRotate<br/>RotateIt"]
    RL --> DEX["DeXtreme<br/>(massive sim)"]

    MPC --> CTR["CTR<br/>Contact Trust Region"]

    IL --> DP["Diffusion Policy<br/>(visuomotor)"]
    IL --> VtacDP["Visuotactile DP<br/>(tactile as obs)"]
    IL --> RDP["Reactive Diffusion<br/>Policy (slow-fast)"]
    IL --> HDP["Hierarchical DP<br/>(contact guidance)"]
    IL --> CGP["**Contact-Grounded<br/>Policy (this paper)**"]

    style CGP fill:#cfe8ff,stroke:#1a73e8,stroke-width:3px

CGP vs Visuotactile Diffusion Policy

๊ฐ€์žฅ ์ง์ ‘์ ์ธ ๋น„๊ต ๋Œ€์ƒ์ž…๋‹ˆ๋‹ค. ๋‘˜ ๋‹ค ์‹œ๊ฐ+์ด‰๊ฐ ์ž…๋ ฅ์„ ๋ฐ›๊ณ  diffusion์œผ๋กœ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ์ฐจ์ด๋Š” ๋‹จ ํ•˜๋‚˜, ๋ฌด์—‡์„ ์˜ˆ์ธกํ•˜๋А๋ƒ์ž…๋‹ˆ๋‹ค.

  • Visuotactile DP: ์ถœ๋ ฅ = ๋ฏธ๋ž˜ target robot state (kinematic).
  • CGP: ์ถœ๋ ฅ = ๋ฏธ๋ž˜ (actual robot state, tactile latent), ๊ทธ ํ›„ M_\phi๋ฅผ ํ†ตํ•ด target์œผ๋กœ ๋ณ€ํ™˜.

์ด ํ•œ ์ค„์˜ ์ฐจ์ด๊ฐ€ contact realization์—์„œ ํฐ ๊ฒฉ์ฐจ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค. CGP์˜ ์ถœ๋ ฅ์€ ๋ณธ์งˆ์ ์œผ๋กœ โ€œ๋‚ด๊ฐ€ ๋งŒ๋“ค๊ณ  ์‹ถ์€ ์ ‘์ด‰ ๊ทธ ์ž์ฒดโ€์ธ ๋ฐ˜๋ฉด, visuotactile DP์˜ ์ถœ๋ ฅ์€ โ€œ๊ทธ ์ ‘์ด‰์„ ๋งŒ๋“ค ๊ฑฐ๋ผ๊ณ  ์ถ”์ •๋˜๋Š” referenceโ€์ž…๋‹ˆ๋‹ค. ํ›„์ž๋Š” ์ปจํŠธ๋กค๋Ÿฌ ๋™์—ญํ•™์„ ์ •์ฑ…์ด ์•”๋ฌต์ ์œผ๋กœ ํ•™์Šตํ•ด์•ผ ํ•˜์ง€์š”.

CGP vs Reactive Diffusion Policy (RDP)

RDP๋Š” slow-fast hierarchical ๊ตฌ์กฐ๋กœ latent diffusion + ๋น ๋ฅธ tactile feedback fine-tuning์„ ํ•ฉ๋‹ˆ๋‹ค. ๋น„์Šทํ•œ ์ : ๋‘˜ ๋‹ค latent space, ๋‘˜ ๋‹ค tactile ํ™œ์šฉ. ๋‹ค๋ฅธ ์ :

  • RDP์˜ fast network๋Š” latent action chunk๋ฅผ tactile์— ๋”ฐ๋ผ ๋ฏธ์„ธ์กฐ์ •ํ•˜๋Š” closed-loop tuner.
  • CGP์˜ M_\phi๋Š” (state, tactile) โ†’ target์˜ ์ •์  ๋งคํ•‘.

RDP๋Š” ๋ฐ˜์‘ ์†๋„๋ฅผ, CGP๋Š” ์ ‘์ด‰ grounding ์ •ํ™•์„ฑ์„ ๊ฐ•์กฐํ•˜๋Š” ์…ˆ์ž…๋‹ˆ๋‹ค. ๋‘ ๋ผ์ธ์ด ๊ฒฐํ•ฉ๋˜๋ฉด ํฅ๋ฏธ๋กœ์šธ ๋“ฏํ•ฉ๋‹ˆ๋‹ค โ€” CGP์˜ contact-consistency mapping์„ ๋น ๋ฅด๊ฒŒ ์ ์šฉํ•˜๋ฉด์„œ ๋ฏธ์„ธ ๋ณด์ •์„ RDP ์Šคํƒ€์ผ๋กœ ํ•˜๋ฉด.

CGP vs Hierarchical Diffusion Policy (HDP)

HDP๋Š” contact โ€œ์œ„์น˜โ€๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ์˜ˆ์ธกํ•˜๊ณ  ์ด๋ฅผ condition์œผ๋กœ trajectory๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. CGP๋Š” ์ ‘์ด‰์„ (state, tactile) latent๋กœ implicitํ•˜๊ฒŒ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

์ธก๋ฉด HDP CGP
์ ‘์ด‰ ํ‘œํ˜„ Explicit (3D contact position) Implicit (state-tactile triplet)
๋‹ค์  ์ ‘์ด‰ ๋‹จ์ผ contact ์ค‘์‹ฌ Distributed multi-point ์ž์—ฐ ์ง€์›
์†๊ฐ€๋ฝ ์ˆ˜ Gripper ์œ„์ฃผ Multi-finger hand ํ‘œ์ 
์ปจํŠธ๋กค๋Ÿฌ ํ†ตํ•ฉ Loose Tight (M_\phi๋กœ ๋ช…์‹œ)

CGP๊ฐ€ ๋‹ค์ง€ ์†์— ๋” ์ž์—ฐ์Šค๋Ÿฌ์šด ์ด์œ ๊ฐ€ ์—ฌ๊ธฐ์„œ ๋“œ๋Ÿฌ๋‚˜์ง€์š”. ๋‹ค์„ฏ ์†๊ฐ€๋ฝ์ด ๋™์‹œ์— ๋งŒ๋“ค์–ด๋‚ด๋Š” ์ ‘์ด‰ ํŒจ์น˜๋“ค์„ ์ขŒํ‘œ๋กœ ์ผ์ผ์ด ์ถ”์ ํ•˜๊ธฐ๋Š” ์–ด๋ ต์ง€๋งŒ, latent๋กœ ๋ฌถ์–ด ํ‘œํ˜„ํ•˜๋ฉด ์ž์—ฐ์Šค๋Ÿฝ์Šต๋‹ˆ๋‹ค.

์‹œ์‚ฌ์ : ํ˜„์žฅ ์—ฐ๊ตฌ์ž์—๊ฒŒ ๋ฌด์—‡์„ ์˜๋ฏธํ•˜๋Š”๊ฐ€

CGP๊ฐ€ ๋งŒ๋Šฅ ํ•ด๋ฒ•์€ ์•„๋‹™๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด ๋…ผ๋ฌธ์€ ๋‹ค์ง€ ์กฐ์ž‘ ์ •์ฑ… ์„ค๊ณ„์—์„œ ๋‹ค์Œ์˜ ๋ช…์ œ๋ฅผ ๊ฐ•ํ•˜๊ฒŒ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค.

  1. โ€œ์ปจํŠธ๋กค๋Ÿฌ๋ฅผ ์ •์ฑ… ์„ค๊ณ„์— ๋ช…์‹œ์ ์œผ๋กœ ๋Œ์–ด๋“ค์—ฌ๋ผ.โ€ Reference์™€ actual ์‚ฌ์ด์˜ ๊ฐญ์ด ๊ณง ์ ‘์ด‰ ์ •๋ณด๋‹ค. ๊ฐ•์„ฑ ์ œ์–ด ์œ„์— ์ •์ฑ…์„ ์˜ฌ๋ฆฌ๋Š” ๊ด€ํ–‰์€ ์ ‘์ด‰์ด ํ’๋ถ€ํ•œ ์˜์—ญ์—์„œ ์ •์ฑ…์˜ ํ•™์Šต ๋ถ€๋‹ด์„ ํ‚ค์šด๋‹ค.
  2. โ€œ์ ‘์ด‰์„ ์ง์ ‘ ๋ชจ๋ธ๋งํ•˜์ง€ ๋ง๊ณ  ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๋ชจ๋ธ๋งํ•˜๋ผ.โ€ Contact location/mode/friction์„ ์ผ์ผ์ด ์ถ”์ •ํ•˜๋Š” ๋Œ€์‹ , ๊ทธ๊ฒƒ์ด ๋งŒ๋“ค์–ด๋‚ด๋Š” (state, tactile) ํŽ˜์–ด๋ฅผ ํ•™์Šตํ•ด implicitํ•˜๊ฒŒ ๋‹ค๋ฃจ๋Š” ํŽธ์ด ๋‹ค์ง€/๋‹ค์  ์ ‘์ด‰์—์„œ ๋” ํ™•์žฅ์„ฑ์ด ์ข‹๋‹ค.
  3. โ€œ์ž ์žฌ ๊ณต๊ฐ„์ด ๋‹ค์ง€ ์ด‰๊ฐ์„ ๋‹ค๋ฃจ๋Š” ์ž์—ฐ์Šค๋Ÿฌ์šด ์–ธ์–ด๋‹ค.โ€ Raw tactile์€ ๋„ˆ๋ฌด ๋ฌด๊ฒ๊ณ  noisyํ•˜๋‹ค. KL ์ •๊ทœํ™”๋œ latent๋กœ ์••์ถ•ํ•ด์•ผ ์•ˆ์ •์ ์ธ generative modeling์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

Contact-Grounded Policy๋Š” ํ•œ ๋ฌธ์žฅ์œผ๋กœ ์š”์•ฝํ•˜๋ฉด ์ด๋ ‡์Šต๋‹ˆ๋‹ค.

โ€œ๋‹ค์ง€ ์กฐ์ž‘ ์ •์ฑ…์˜ ์ถœ๋ ฅ์„ ์šด๋™ํ•™ ํƒ€๊ฒŸ์ด ์•„๋‹Œ ์ปจํŠธ๋กค๋Ÿฌ๊ฐ€ ์‹คํ˜„ํ•  ์ ‘์ด‰์˜ ์ง„ํ™”๋กœ ์ •์˜ํ•˜๊ณ , ๊ทธ๊ฒƒ์„ ์ž ์žฌ ๊ณต๊ฐ„์—์„œ diffusion์œผ๋กœ ์ƒ์„ฑํ•œ ๋’ค ํ•™์Šต๋œ ๋งคํ•‘์œผ๋กœ ์ปจํŠธ๋กค๋Ÿฌ reference๋กœ ๋ฒˆ์—ญํ•œ๋‹ค.โ€

์ด ํ•œ ๋ฌธ์žฅ ์•ˆ์— ์„ธ ๊ฐ€์ง€ ๊ฒฐ์ •์ด ๋“ค์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

  1. ํ‘œํ˜„(Representation): ์ ‘์ด‰์„ (target, actual, tactile) ์‚ผ์ค‘ํ•ญ์œผ๋กœ implicitํ•˜๊ฒŒ ์ •์˜.
  2. ์ƒ์„ฑ(Generation): ์ž ์žฌ ๊ณต๊ฐ„์—์„œ conditional diffusion์œผ๋กœ (actual, tactile) ๋ฏธ๋ž˜ ๊ถค์  ์ƒ˜ํ”Œ๋ง.
  3. ์‹คํ˜„(Realization): ํ•™์Šต๋œ M_\phi๋กœ ์ž ์žฌ ์˜ˆ์ธก์„ ์ปจํŠธ๋กค๋Ÿฌ reference๋กœ ๋ฒˆ์—ญ.

๊ฐ ๊ฒฐ์ •์€ ๋‹จ๋…์œผ๋กœ๋Š” ์ด๋ฏธ ์•Œ๋ ค์ง„ ๋„๊ตฌ์ง€๋งŒ, ์„ธ ๊ฒฐ์ •์˜ ๊ฒฐํ•ฉ์ด ๋‹ค์ง€ ์กฐ์ž‘ imitation learning์—์„œ ์ƒˆ๋กœ์šด ์ ˆ์ถฉ์ ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์‹œ๊ฐ ์™ธ๋ž€ ๊ฐ•๊ฑด์„ฑ, baseline ์‹คํŒจ ๋ชจ๋“œ(์Šฌ๋ฆฝ/๊ณผ์••) ํšŒํ”ผ, ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์ผ๋ฐ˜์„ฑ์ด ๊ทธ ๊ฒฐํ•ฉ์˜ ๊ฒฐ์‹ค์ž…๋‹ˆ๋‹ค.

๋‚จ์€ ํฅ๋ฏธ๋กœ์šด ์งˆ๋ฌธ๋“ค โ€” RL๊ณผ ๊ฒฐํ•ฉ ๊ฐ€๋Šฅํ•œ๊ฐ€? Variable impedance์— ํ™•์žฅ๋  ์ˆ˜ ์žˆ๋Š”๊ฐ€? Latent tactile predictor๋ฅผ world model๋กœ ์ง์ ‘ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€? ๋น„๊ฐ•์ฒด/์œ ์ฒด์—์„œ๋„ ์ž‘๋™ํ•˜๋Š”๊ฐ€? โ€” ์ด๋Ÿฐ ์งˆ๋ฌธ๋“ค์ด ํ›„์† ์—ฐ๊ตฌ์˜ ํ’๋ถ€ํ•œ ์ง€ํ‰์„ ์—ด์–ด์ค๋‹ˆ๋‹ค.

๋‹ค์ง€ ์กฐ์ž‘ ์—ฐ๊ตฌ๊ฐ€ ๊ฒฐ๊ตญ ๋„๋‹ฌํ•˜๋ ค๋Š” ๋ชฉํ‘œ๋Š” โ€œ์ธ๊ฐ„ ์†์ฒ˜๋Ÿผ ์ ‘์ด‰์„ ํ†ตํ•ด ์‚ฌ๊ณ ํ•˜๋Š” ๋กœ๋ด‡โ€์ด์ง€์š”. CGP๋Š” ๊ทธ ๊ธธ๋กœ ํ•œ ๊ฑธ์Œ์„ ๋‚ด๋”›์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ ๊ฑธ์Œ์ด ์šฐ์•„ํ•œ ์ด์œ ๋Š”, ์ƒˆ๋กœ์šด ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋ฐœ๋ช…ํ•œ ๊ฒŒ ์•„๋‹ˆ๋ผ ์ด๋ฏธ ์žˆ๋Š” ๋„๊ตฌ๋“ค์„ ์ ‘์ด‰์ด๋ผ๋Š” ๋ฌผ๋ฆฌ์  ์‹ค์ฒด์— ๋งž์ถฐ ์ •ํ™•ํ•œ ์ž๋ฆฌ์— ๋ฐฐ์น˜ํ•œ ๋ฐ ์žˆ๋‹ค๊ณ  ๋ด…๋‹ˆ๋‹ค.

์ข‹์€ ์‹œ์Šคํ…œ์€ ์ƒˆ๋กœ์šด ๋ถ€ํ’ˆ์œผ๋กœ ๋งŒ๋“œ๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ, ์ต์ˆ™ํ•œ ๋ถ€ํ’ˆ ์‚ฌ์ด์˜ interface๋ฅผ ๋‹ค์‹œ ๊ทธ๋ ค์„œ ๋งŒ๋“ ๋‹ค โ€” CGP๋Š” ๊ทธ ๊ตํ›ˆ์„ ๋‹ค์ง€ ์กฐ์ž‘ ์˜์—ญ์—์„œ ํ•œ ๋ฒˆ ๋” ๋ณด์—ฌ์ค€ ์ž‘์—…์ž…๋‹ˆ๋‹ค.


์ฐธ๊ณ  ์ž๋ฃŒ

  • ๋…ผ๋ฌธ (arXiv): https://arxiv.org/abs/2603.05687
  • ํ”„๋กœ์ ํŠธ ํŽ˜์ด์ง€: https://contact-grounded-policy.github.io/
  • ์ถœ์ฒ˜: Robotics: Science and Systems (RSS), 2026
  • ์ €์ž: Zhengtong Xu, Yeping Wang, Ben Abbatematteo, Jom Preechayasomboon, Sonny Chan, Nick Colonnese, Amirhossein H. Memar (Purdue / Meta Reality Labs Research / UW-Madison)

Copyright 2026, JungYeon Lee