Curieux.JY
  • JungYeon Lee
  • Post
  • Lecture
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ์„œ๋ก 
      • ๋ณ€ํ˜•์ฒด(deformable object)๋ฅผ ๋‹ค๋ฃฌ๋‹ค๋Š” ๊ฒƒ
      • ๋‘ ์ข…๋ฅ˜์˜ ํŠธ๋ ˆ์ด์‹ฑ, ๊ทธ๋ฆฌ๊ณ  ์ผ๋ฐ˜ํ™” ๋ฌธ์ œ
      • ์™œ ์ด‰๊ฐ(tactile)์ด ํ•„์š”ํ•œ๊ฐ€
    • ๋ฐฉ๋ฒ•
      • ๋ฌธ์ œ ์ •์˜: ์ˆ˜์‹์œผ๋กœ ๋ณธ โ€œ์ž˜๋œ ํŠธ๋ ˆ์ด์‹ฑโ€
      • ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘: ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ”ผ๋“œ๋ฐฑ์„ ๊ฐ–์ถ˜ ์›๊ฒฉ์กฐ์ž‘ ์‹œ์Šคํ…œ
      • ์ •์ฑ… ๋ฐฑ๋ณธ: Action Chunking Transformer (ACT)
      • ๊ธฐ์—ฌ 1: Local Center Loss (๊ตญ์†Œ ์ค‘์‹ฌ ์†์‹ค)
      • ๊ธฐ์—ฌ 2: Global Task Loss (์ „์—ญ ์ž‘์—… ์†์‹ค)
      • ์ „์ฒด ์†์‹ค
    • ์‹คํ—˜
      • ์„ค์ •
      • ๊ฒฐ๊ณผ 1: ์ž์„ธ ํ‘œํ˜„ โ€” ๊ด€์ ˆ๊ฐ vs EE pose
      • ๊ฒฐ๊ณผ 2: ๊ตฌ์„ฑ ์š”์†Œ Ablation
      • ๊ฒฐ๊ณผ 3: ํ†ตํ•ฉ ๋ชจ๋ธ vs ๊ฐœ๋ณ„ ๋ชจ๋ธ
      • ๊ฒฐ๊ณผ 4: Unseen ๋ฌผ์ฒด ์ผ๋ฐ˜ํ™”
    • ๋น„ํŒ์  ๊ณ ์ฐฐ
      • ๊ฐ•์ 
      • ์•ฝ์ ยทํ•œ๊ณ„
      • ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต
    • ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

๐Ÿ“ƒViTac-Tracing

tactile
imitation
deformable
ViTac-Tracing: Visual-Tactile Imitation Learning of Deformable Object Tracing
Published

May 30, 2026

  • Paper Link
  • Poster Link

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.


๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

์„œ๋ก 

๋ณ€ํ˜•์ฒด(deformable object)๋ฅผ ๋‹ค๋ฃฌ๋‹ค๋Š” ๊ฒƒ

๋กœ๋ด‡์ด ๋‹ค๋ฃจ๋Š” ๋Œ€์ƒ์ด ๊ฐ•์ฒด(rigid body)์ผ ๋•Œ๋Š” ๋ฌธ์ œ๊ฐ€ ๋น„๊ต์  ๋‹จ์ˆœํ•ฉ๋‹ˆ๋‹ค. ์ปต์˜ ์ž์„ธ(pose)๋Š” 6๊ฐœ์˜ ์ˆซ์ž(์œ„์น˜ 3 + ํšŒ์ „ 3)๋กœ ์™„์ „ํžˆ ํ‘œํ˜„๋˜๊ณ , ํ•œ ๋ฒˆ ์žก์œผ๋ฉด ์†๊ณผ ๋ฌผ์ฒด์˜ ์ƒ๋Œ€ ๊ด€๊ณ„๊ฐ€ ๋ณ€ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์‹ ๋ฐœ๋ˆ, ์ผ€์ด๋ธ”, ์ˆ˜๊ฑด, ์ฒœ ๊ฐ™์€ ๋ณ€ํ˜•์ฒด๋กœ ๋„˜์–ด๊ฐ€๋Š” ์ˆœ๊ฐ„ ์ด์•ผ๊ธฐ๊ฐ€ ์™„์ „ํžˆ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๋ฌผ์ฒด๋Š” ์ž์œ ๋„(DoF)๊ฐ€ ์‚ฌ์‹ค์ƒ ๋ฌดํ•œ๋Œ€์— ๊ฐ€๊น๊ณ , ์ฑ…์ƒ ์œ„์— ์•„๋ฌด๋ ‡๊ฒŒ๋‚˜ ๋†“์ด๋ฉด ์ ‘ํžˆ๊ณ  ๊ผฌ์ด๊ณ  ๋ญ‰์ณ ์žˆ์–ด์„œ โ€œ์ง€๊ธˆ ์ด ๋ฌผ์ฒด๊ฐ€ ์–ด๋–ค ์ƒํƒœ์ธ๊ฐ€โ€๋ฅผ ํ•œ๋ˆˆ์— ํŒŒ์•…ํ•˜๊ธฐ์กฐ์ฐจ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ ๋ณ€ํ˜•์ฒด ์กฐ์ž‘์—๋Š” ํ”ํžˆ ์ „์ฒ˜๋ฆฌ ๋‹จ๊ณ„๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๊ตฌ๊ฒจ์ง„ ํ‹ฐ์…”์ธ ๋ฅผ ํŽด์„œ ํ‰ํ‰ํ•˜๊ฒŒ ๋งŒ๋“ค๋ฉด ์˜ท๊นƒ์ด๋‚˜ ๋ชจ์„œ๋ฆฌ ๊ฐ™์€ ํŠน์ง•์ (landmark)์„ ์ฐพ๊ธฐ ์‰ฌ์›Œ์ง€๊ณ , ๊ทธ๋‹ค์Œ ์ ‘๊ธฐยท์ •๋ฆฌ ๊ฐ™์€ ๋‹ค์šด์ŠคํŠธ๋ฆผ(downstream) ์ž‘์—…์ด ์ˆ˜์›”ํ•ด์ง€์ฃ . ์ด ๋…ผ๋ฌธ์ด ๋‹ค๋ฃจ๋Š” tracing(ํŠธ๋ ˆ์ด์‹ฑ, ๋ฌผ์ฒด๋ฅผ ๋”ฐ๋ผ๊ฐ€๋ฉฐ ํŽด๊ธฐ)์ด ๋ฐ”๋กœ ๊ทธ ์ „์ฒ˜๋ฆฌ์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌํผ(gripper)๋กœ ๋ฌผ์ฒด์˜ ํ•œ์ชฝ ๋์„ ์žก๊ณ , ๋ฌผ์ฒด์˜ ๊ฐ€์žฅ์ž๋ฆฌ(edge)๋ฅผ ๋”ฐ๋ผ ์†๊ฐ€๋ฝ์„ ๋ฏธ๋„๋Ÿฌ๋œจ๋ ค ๋ฐ˜๋Œ€์ชฝ ๋๊นŒ์ง€ ์“ธ์–ด๋‚ด๋ ค ์—‰ํ‚จ ๋ฌผ์ฒด๋ฅผ ๊ณง๊ฒŒ ํŽด์ง„(extended) ์ƒํƒœ๋กœ ๋งŒ๋“œ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋น„์œ ํ•˜์ž๋ฉด, ํ—ํด์–ด์ง„ ๋ชฉ๊ฑธ์ด ์ค„์„ ํ•œ ์†์œผ๋กœ ๊ณ ์ •ํ•˜๊ณ  ๋‹ค๋ฅธ ์†๊ฐ€๋ฝ์œผ๋กœ ์ค„์„ ์ญ‰ ํ›‘์–ด ๋‚ด๋ ค ๋งค๋“ญ ์—†์ด ํŽด๋Š” ๋™์ž‘์„ ๋– ์˜ฌ๋ฆฌ๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์‚ฌ๋žŒ์€ ์†๋์˜ ์ด‰๊ฐ์œผ๋กœ โ€œ์ค„์ด ์•„์ง ์†๊ฐ€๋ฝ ์‚ฌ์ด์— ์ž˜ ๋ฌผ๋ ค ์žˆ๋Š”์ง€โ€๋ฅผ ๋А๋ผ๋ฉด์„œ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ๋กœ๋ด‡์ด ์ด๊ฑธ ํ•˜๋ ค๋ฉด ๊ทธ ์ด‰๊ฐ์„ ์–ด๋–ป๊ฒŒ๋“  ํ‰๋‚ด๋‚ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋‘ ์ข…๋ฅ˜์˜ ํŠธ๋ ˆ์ด์‹ฑ, ๊ทธ๋ฆฌ๊ณ  ์ผ๋ฐ˜ํ™” ๋ฌธ์ œ

๋…ผ๋ฌธ์€ ํŠธ๋ ˆ์ด์‹ฑ์„ ๋‘ ๋ฒ”์ฃผ๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.

  • 1D ํŠธ๋ ˆ์ด์‹ฑ (object following): ๋ˆ, ์ผ€์ด๋ธ”, ๋กœํ”„์ฒ˜๋Ÿผ ์„ ํ˜•(linear) ๋ฌผ์ฒด๋ฅผ ๋”ฐ๋ผ๊ฐ€๊ธฐ
  • 2D ํŠธ๋ ˆ์ด์‹ฑ (object sliding): ์ˆ˜๊ฑด, ์ฒœ์ฒ˜๋Ÿผ ํ‰๋ฉด(planar) ๋ฌผ์ฒด์˜ ํ•œ์ชฝ ๊ฐ€์žฅ์ž๋ฆฌ(edge)๋ฅผ ๋”ฐ๋ผ๊ฐ€๊ธฐ

๊ตฌ์กฐ๋Š” ๋‹ค๋ฅด์ง€๋งŒ ๋‘˜ ๋‹ค โ€œ๋ฌผ์ฒด์™€ ์ ‘์ด‰์„ ์œ ์ง€ํ•˜๋ฉฐ ํ•œ์ชฝ ๋์—์„œ ๋ฐ˜๋Œ€์ชฝ ๋๊นŒ์ง€ ๋ฏธ๋„๋Ÿฌ์ง„๋‹คโ€๋Š” ์ ์—์„œ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ๋‹ฎ์•„ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ €์ž๋“ค์€ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ ๋ชจ๋ธ(unified model)๋กœ 1D์™€ 2D๋ฅผ ๋ชจ๋‘ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ?๋ผ๋Š” ์งˆ๋ฌธ์„ ๋˜์ง‘๋‹ˆ๋‹ค.

๊ธฐ์กด ์ ‘๊ทผ๋“ค์€ ์—ฌ๊ธฐ์„œ ๋ง‰ํ˜”์Šต๋‹ˆ๋‹ค.

  • ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ์ œ์–ด(model-based control): ๋ฌผ์ฒด์˜ ์ƒํƒœ์™€ ๋™์—ญํ•™(dynamics)์„ ์ •ํ™•ํžˆ ๋ชจ๋ธ๋งํ•ด์•ผ ํ•˜๋Š”๋ฐ, ๋ฌดํ•œ DoF ๋ณ€ํ˜•์ฒด์—์„œ๋Š” ์ด๊ฒŒ ๋งค์šฐ ์–ด๋ ต๊ณ  ๋ฌผ์ฒด ์ข…๋ฅ˜๋งˆ๋‹ค ์ปจํŠธ๋กค๋Ÿฌ๋ฅผ ์ƒˆ๋กœ ์งœ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฐ•ํ™”ํ•™์Šต(RL): ๋ณด์ƒ ํ•จ์ˆ˜(reward) ์„ค๊ณ„์™€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋‚ด ์ •ํ™•ํ•œ ๋ณ€ํ˜•์ฒด ๋ชจ๋ธ๋ง์ด ํ•„์š”ํ•˜๊ณ , ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ํ•™์Šตํ•œ ์ •์ฑ…์„ ์‹ค์ œ ๋กœ๋ด‡์— ์˜ฎ๊ธธ ๋•Œ sim-to-real gap์œผ๋กœ ์ž์ฃผ ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค.

์ด ๋…ผ๋ฌธ์ด ํƒํ•œ ๊ธธ์€ ๋ชจ๋ฐฉํ•™์Šต(Imitation Learning, IL)์ž…๋‹ˆ๋‹ค. ์‚ฌ๋žŒ์ด ์ง์ ‘ ์‹œ์—ฐ(demonstration)ํ•œ ๋ฐ์ดํ„ฐ๋กœ ์ •์ฑ…์„ ํ•™์Šตํ•˜๋ฏ€๋กœ, ๋ช…์‹œ์ ์ธ ๋ฌผ์ฒด ๋™์—ญํ•™ ๋ชจ๋ธ๋„ ํ•„์š” ์—†๊ณ  sim-to-real gap๋„ ์—†์Šต๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ์ข…๋ฅ˜์˜ ๋ฌผ์ฒด๋ฅผ ํŠธ๋ ˆ์ด์‹ฑํ•˜๋Š” ํ†ตํ•ฉ ์ •์ฑ…์„ ํ•™์Šตํ•˜๊ธฐ์— ๋งค๋ ฅ์ ์ธ ์„ ํƒ์ง€์ฃ . ์ €์ž๋“ค์€ ์ด๊ฒƒ์ด 1D์™€ 2D ๋ณ€ํ˜•์ฒด ํŠธ๋ ˆ์ด์‹ฑ์„ ํ†ตํ•ฉ ์ •์ฑ…์œผ๋กœ ๋‹ค๋ฃจ๋Š” ์ฒซ ์‹œ๋„๋ผ๊ณ  ์ฃผ์žฅํ•ฉ๋‹ˆ๋‹ค.

์™œ ์ด‰๊ฐ(tactile)์ด ํ•„์š”ํ•œ๊ฐ€

ํŠธ๋ ˆ์ด์‹ฑ์˜ ์„ฑํŒจ๋Š” ๊ฒฐ๊ตญ โ€œ์ ‘์ด‰์„ ์žƒ์ง€ ์•Š๋Š” ๊ฒƒโ€์— ๋‹ฌ๋ ค ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ํŠธ๋ ˆ์ด์‹ฑ ์ค‘์—๋Š” ๊ทธ๋ฆฌํผ ์†๊ฐ€๋ฝ ์ž์ฒด๊ฐ€ ๋ฌผ์ฒด๋ฅผ ๊ฐ€๋ ค์„œ(occlusion) ์นด๋ฉ”๋ผ๋งŒ์œผ๋กœ๋Š” ์ง€๊ธˆ ๋ฌผ์ฒด๊ฐ€ ์†๊ฐ€๋ฝ ์‚ฌ์ด์— ์ œ๋Œ€๋กœ ๋ฌผ๋ ค ์žˆ๋Š”์ง€ ์•Œ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์†๊ฐ€๋ฝ ๋ ๊ฐ€์žฅ์ž๋ฆฌ๋กœ ๋ฌผ์ฒด๊ฐ€ ๋ฏธ๋„๋Ÿฌ์ ธ ๋น ์ง€๊ธฐ ์ง์ „์ธ์ง€, ์ค‘์•™์— ์•ˆ์ •์ ์œผ๋กœ ๋ฌผ๋ ค ์žˆ๋Š”์ง€๋ฅผ ์‹œ๊ฐ๋งŒ์œผ๋กœ ํŒ๋‹จํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค.

์ด ๊ตญ์†Œ(local) ์ •๋ณด์˜ ๊ณต๋ฐฑ์„ ๋ฉ”์šฐ๋Š” ๊ฒƒ์ด ๋ฐ”๋กœ ์ด‰๊ฐ ์„ผ์„œ์ž…๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์€ GelSight Wedge ๊ณ„์—ด์˜ ๋น„์ „ ๊ธฐ๋ฐ˜ ์ด‰๊ฐ ์„ผ์„œ(vision-based tactile sensor)๋ฅผ ๊ทธ๋ฆฌํผ ์†๊ฐ€๋ฝ์— ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. ๋น„์ „ ๊ธฐ๋ฐ˜ ์ด‰๊ฐ ์„ผ์„œ๋Š” ํˆฌ๋ช… ์ ค ํ‘œ๋ฉด์ด ๋ฌผ์ฒด์— ๋ˆŒ๋ฆฐ ๋ชจ์–‘์„ ๋‚ด๋ถ€ ์นด๋ฉ”๋ผ๋กœ ์ฐ์–ด ๊ณ ํ•ด์ƒ๋„ ์ด‰๊ฐ ์ด๋ฏธ์ง€(tactile image)๋ฅผ ๋งŒ๋“œ๋Š” ์žฅ์น˜์ž…๋‹ˆ๋‹ค. ์ฆ‰ ์ด‰๊ฐ์„ โ€œ์ด๋ฏธ์ง€โ€๋กœ ๋ฐ”๊ฟ” ์‹œ๊ฐ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค๋‹ˆ๋‹ค.

์ •๋ฆฌํ•˜๋ฉด, ์ด ๋…ผ๋ฌธ์˜ ํฐ ๊ทธ๋ฆผ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

flowchart TB
    A[Crumpled deformable object<br/>1D rope/cable or 2D towel/cloth] --> B[ViTac-Tracing policy]
    V[Visual image: global context] --> B
    T[Tactile image: local contact] --> B
    K[Robot kinematics: proprioception] --> B
    B --> C[Slide gripper along the object<br/>while keeping contact]
    C --> D[Extended, untangled state]
    D --> E[Easier downstream manipulation:<br/>folding, cable insertion, dressing]

ํ•ต์‹ฌ ๊ธฐ์—ฌ๋ฅผ ์„ธ ๊ฐ€์ง€๋กœ ์š”์•ฝํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  1. ์‹œ๊ฐ-์ด‰๊ฐ ๋ชจ๋ฐฉํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ด, ์‹ค์ œ ๋กœ๋ด‡์ด ๋‹ค์–‘ํ•œ 1D/2D ๋ณ€ํ˜•์ฒด๋ฅผ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ ์ •์ฑ…์œผ๋กœ ํŠธ๋ ˆ์ด์‹ฑํ•˜๊ฒŒ ํ•จ.
  2. ์ €๋น„์šฉ ์‹œ๊ฐ-์ด‰๊ฐ ์›๊ฒฉ์กฐ์ž‘(teleoperation) ์‹œ์Šคํ…œ์„ ๋งŒ๋“ค์–ด, ์‹œ์—ฐ์ž(operator)์™€ ๋กœ๋ด‡ ์–‘์ชฝ ๋ชจ๋‘์— ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ”ผ๋“œ๋ฐฑ์„ ์ œ๊ณตํ•จ.
  3. ๊ด‘๋ฒ”์œ„ํ•œ ablation/๋น„๊ต ์‹คํ—˜์œผ๋กœ ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ์˜ ํšจ๊ณผ๋ฅผ ๊ฒ€์ฆํ•˜๊ณ , ๋ณธ ์  ์žˆ๋Š”(seen) ๋ฌผ์ฒด์™€ ๋ณธ ์  ์—†๋Š”(unseen) ๋ฌผ์ฒด ๋ชจ๋‘์—์„œ ์„ฑ๋Šฅ์„ ๋ณด์ž„.

๋ฐฉ๋ฒ•

๋ฌธ์ œ ์ •์˜: ์ˆ˜์‹์œผ๋กœ ๋ณธ โ€œ์ž˜๋œ ํŠธ๋ ˆ์ด์‹ฑโ€

๋จผ์ € ํŠธ๋ ˆ์ด์‹ฑ์„ ์ˆ˜ํ•™์ ์œผ๋กœ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. 1D ๋ณ€ํ˜•์ฒด(๋˜๋Š” 2D ๋ฌผ์ฒด์˜ ํ•œ์ชฝ ๊ฐ€์žฅ์ž๋ฆฌ)๋ฅผ Cartesian ๊ณต๊ฐ„ ์•ˆ์˜ ์‹œ๋ณ€ ๊ณต๊ฐ„ ๊ณก์„ (time-varying spatial curve) \mathcal{C}_t \subset \mathbb{R}^3๋กœ ๋ชจ๋ธ๋งํ•ฉ๋‹ˆ๋‹ค. ๊ณก์„ ์˜ ์ „์ฒด ๊ธธ์ด๋ฅผ L์ด๋ผ ํ•˜๊ณ , ์ž‘์—…์€ t=0๋ถ€ํ„ฐ t=T๊นŒ์ง€ ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค.

๊ธฐํ˜ธ๋ฅผ ์ •๋ฆฌํ•˜๋ฉด:

  • p_0 = (x_0, y_0, z_0): ๋‹ค๋ฅธ ๊ทธ๋ฆฌํผ๊ฐ€ ๊ณ ์ •ํ•˜๊ณ  ์žˆ๋Š” ๊ณ ์ •์ (fixed point)
  • p_t = (x_t, y_t, z_t): ์›€์ง์ด๋Š” ๊ทธ๋ฆฌํผ์™€ ๋ฌผ์ฒด์˜ ์ ‘์ด‰์ (contact point)
  • o^T: ๊ทธ๋ฆฌํผ์˜ ์ด‰๊ฐ ๊ฐ์ง€ ์˜์—ญ(tactile sensing region)

์„ฑ๊ณต์ ์ธ ํŠธ๋ ˆ์ด์‹ฑ์ด ๋งŒ์กฑํ•ด์•ผ ํ•  ์ œ์•ฝ์€ ์ง๊ด€์ ์œผ๋กœ ๋‘ ๊ฐ€์ง€์ž…๋‹ˆ๋‹ค.

  1. ๋ฌผ์ฒด๋ฅผ ๋†“์น˜์ง€ ๋ง ๊ฒƒ: ์ ‘์ด‰์ ์€ ํ•ญ์ƒ ๊ณก์„  ์œ„์— ์žˆ์–ด์•ผ ํ•จ. ์ฆ‰ p_t \in \mathcal{C}_t.
  2. ์†๊ฐ€๋ฝ ์ค‘์•™์— ๋ฌผ๋ ค ์žˆ์„ ๊ฒƒ: ์ ‘์ด‰์ ์€ ์ด‰๊ฐ ๊ฐ์ง€ ์˜์—ญ ์•ˆ์— ๋จธ๋ฌผ๋Ÿฌ์•ผ ํ•จ. ์ฆ‰ p_t \in o^T. ๊ฐ์ง€ ์˜์—ญ์„ ๋ฒ—์–ด๋‚ฌ๋‹ค๋Š” ๊ฑด ๋ฌผ์ฒด๊ฐ€ ์†๊ฐ€๋ฝ ๋์œผ๋กœ ๋ฏธ๋„๋Ÿฌ์ ธ ๋น ์กŒ๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค.

๋˜ํ•œ ์ž‘์—…์˜ โ€œ์ง„ํ–‰โ€์— ๋Œ€ํ•œ ๋‘ ๊ฐ€์ง€ ๋ชฉํ‘œ๋„ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

  1. ์ ‘์ด‰์ ๊ณผ ๊ณ ์ •์  ์‚ฌ์ด ๊ฑฐ๋ฆฌ๊ฐ€ ์‹œ๊ฐ„์ด ์ง€๋‚˜๋ฉฐ ์ ์  ์ „์ฒด ๊ธธ์ด๋กœ ์ˆ˜๋ ดํ•ด์•ผ ํ•จ: \|p_t - p_0\|_2 \to L.
  2. ๊ทธ ๊ฑฐ๋ฆฌ๋Š” ๋‹จ์กฐ ์ฆ๊ฐ€(monotonically increasing)ํ•ด์•ผ ํ•จ: \frac{d}{dt}\|p_t - p_0\|_2 \geq 0. (ํ•œ ๋ฒˆ ํŽธ ๋ถ€๋ถ„์ด ๋‹ค์‹œ ์ค„์–ด๋“ค๋ฉด ์•ˆ ๋œ๋‹ค๋Š” ๋œป โ€” ๋’ค๋กœ ๊ฐ€์ง€ ๋ง๊ณ  ๊พธ์ค€ํžˆ ๋์„ ํ–ฅํ•ด ๋‚˜์•„๊ฐ€๋ผ.)

์ด ์ •์˜๊ฐ€ ์ค‘์š”ํ•œ ์ด์œ ๋Š”, ๋’ค์—์„œ ๋‚˜์˜ค๋Š” ๋‘ ๊ฐ€์ง€ ์†์‹ค ํ•จ์ˆ˜(local center loss, global task loss)๊ฐ€ ์ •ํ™•ํžˆ ์ด ์ œ์•ฝยท๋ชฉํ‘œ๋ฅผ ์ •์ฑ…์— ์ฃผ์ž…ํ•˜๊ธฐ ์œ„ํ•œ ์žฅ์น˜์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์ˆ˜์ง‘: ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ”ผ๋“œ๋ฐฑ์„ ๊ฐ–์ถ˜ ์›๊ฒฉ์กฐ์ž‘ ์‹œ์Šคํ…œ

๋ชจ๋ฐฉํ•™์Šต์€ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ์˜ ํ’ˆ์งˆ์ด ๊ณง ์ •์ฑ…์˜ ํ’ˆ์งˆ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ €์ž๋“ค์€ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์žฅ์น˜ ์ž์ฒด์— ๊ณต์„ ๋“ค์˜€์Šต๋‹ˆ๋‹ค. ํ•˜๋“œ์›จ์–ด ๊ตฌ์„ฑ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • ๋กœ๋ด‡: ์–‘ํŒ”(dual-arm) ABB YuMi. ํ•œ ํŒ”์€ leader(์‹œ์—ฐ์ž๊ฐ€ ์กฐ์ข…), ๋‹ค๋ฅธ ์ชฝ์€ follower(์‹ค์ œ ์ž‘์—… ์ˆ˜ํ–‰).
  • ์‹œ๊ฐ ์„ผ์„œ: ZED 2 ์Šคํ…Œ๋ ˆ์˜ค ์นด๋ฉ”๋ผ๊ฐ€ ์œ„์—์„œ ๋‚ด๋ ค๋‹ค๋ณด๋Š”(top-down) ๋ทฐ ์ œ๊ณต.
  • ์ด‰๊ฐ ์„ผ์„œ: GelSight Wedge ๊ธฐ๋ฐ˜ ๋น„์ „ ์ด‰๊ฐ ์„ผ์„œ๋ฅผ follower ๊ทธ๋ฆฌํผ ์†๊ฐ€๋ฝ์— ์žฅ์ฐฉ.
  • ์ œ์–ด: ๋กœ๋ด‡์€ ๊ด€์ ˆ ์œ„์น˜(joint position) ๋ชจ๋“œ๋กœ ์ œ์–ด, ์—”๋“œ์ดํŽ™ํ„ฐ(EE) ์†๋„๋Š” 400 mm/s๋กœ ์ œํ•œ. Nvidia Jetson Orin์ด ์นด๋ฉ”๋ผยทROS ๋“œ๋ผ์ด๋ฒ„ ๊ตฌ๋™(ROS Noetic, Docker ์ปจํ…Œ์ด๋„ˆ).

์—ฌ๊ธฐ์„œ ์˜๋ฆฌํ•œ ๋ถ€๋ถ„์€ ์‹œ์—ฐ์ž(์‚ฌ๋žŒ)์—๊ฒŒ๋„ ํ”ผ๋“œ๋ฐฑ์„ ์ค€๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๊ธฐ์กด ์ €๊ฐ€ ์›๊ฒฉ์กฐ์ž‘ ์‹œ์Šคํ…œ์€ ๋Œ€๋ถ€๋ถ„ ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ์ด ์—†์–ด์„œ, ์‹œ์—ฐ์ž๊ฐ€ โ€œ์ง€๊ธˆ ๋ฌผ์ฒด๊ฐ€ ์ž˜ ๋ฌผ๋ ค ์žˆ๋Š”์ง€โ€๋ฅผ ๋ชจ๋ฅธ ์ฑ„ ์กฐ์ข…ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ์‹œ์—ฐ ๋ฐ์ดํ„ฐ์˜ ์งˆ์ด ๋–จ์–ด์ง€์ฃ . ์ด ์‹œ์Šคํ…œ์€:

  • follower์˜ ์‹œ๊ฐยท์ด‰๊ฐ ์ด๋ฏธ์ง€๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ํ™”๋ฉด์— ์ŠคํŠธ๋ฆฌ๋ฐํ•ด ์‹œ์—ฐ์ž๊ฐ€ ์ ‘์ด‰ ์ƒํƒœ๋ฅผ ๋ˆˆ์œผ๋กœ ํ™•์ธ.
  • leader ๊ทธ๋ฆฌํผ์— ์ง„๋™ ๋ชจํ„ฐ(DAOKAI DC 5V Mini)๋ฅผ ๋‹ฌ์•„, ๋กœ๋ด‡์ด ํŠน์ด์ (singularity) ๊ทผ์ฒ˜์— ๊ฐ€๋ฉด ์ง„๋™์œผ๋กœ ๊ฒฝ๊ณ .

ํŠน์ด์  ๊ฒฝ๊ณ ๋Š” ์ •์ฑ…์ด (๊ด€์ ˆ๊ฐ์ด ์•„๋‹ˆ๋ผ) EE pose ๊ธฐ์ค€์œผ๋กœ ํ•™์Šต๋˜๊ธฐ ๋•Œ๋ฌธ์— ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. EE pose์™€ ๊ด€์ ˆ๊ฐ์˜ ๋งคํ•‘์ด ์ผ๋Œ€์ผ์ด ์•„๋‹ˆ๋ผ์„œ(non-unique), ํŠน์ • EE ์ž์„ธ์—์„œ ๋กœ๋ด‡์ด ์›€์ง์ผ ์ˆ˜ ์—†๋Š” ๊ณค๋ž€ํ•œ ๊ด€์ ˆ ๊ตฌ์„ฑ์— ๋น ์งˆ ์ˆ˜ ์žˆ๊ฑฐ๋“ ์š”. ํŠน์ด์  ๊ทผ์ ‘๋„๋Š” Yoshikawa ๊ฐ€์กฐ์ž‘์„ฑ ์ง€์ˆ˜(Manipulability Index)๋กœ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

w(q) = \sqrt{\det\!\big(J(q)\,J(q)^T\big)}

์—ฌ๊ธฐ์„œ q๋Š” ๋กœ๋ด‡์˜ ๊ด€์ ˆ ์ƒํƒœ, J(\cdot)๋Š” ์•ผ์ฝ”๋น„์•ˆ(Jacobian) ํ–‰๋ ฌ์ž…๋‹ˆ๋‹ค. ์ง๊ด€์ ์œผ๋กœ w(q)๊ฐ€ ํด์ˆ˜๋ก ๊ทธ ์ž์„ธ์—์„œ ๋กœ๋ด‡์ด ์—ฌ๋Ÿฌ ๋ฐฉํ–ฅ์œผ๋กœ ์ž์œ ๋กญ๊ฒŒ ์›€์ง์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ๋œป์ด๊ณ (=์†์žฌ์ฃผ๊ฐ€ ์ข‹์Œ), ์ž‘์•„์งˆ์ˆ˜๋ก ํŠน์ด์ ์— ๊ฐ€๊น๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค. w(q)๊ฐ€ ์ž„๊ณ„๊ฐ’ \lambda_w \cdot \max(w) ์•„๋ž˜๋กœ ๋–จ์–ด์ง€๋ฉด ์ง„๋™์ด ์ผœ์ง€๋ฉฐ, \lambda_w = 0.2๋Š” grid search๋กœ ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด๋ ‡๊ฒŒ ์ˆ˜์ง‘๋œ ํ•œ ์—ํ”ผ์†Œ๋“œ(episode) \mathcal{D}๋Š” ์‹œ์  t๋งˆ๋‹ค ๊ด€์ธก o_t = \{o_t^K, o_t^V, o_t^T\}(๊ฐ๊ฐ kinematics, visual, tactile)์™€ ์‹œ์—ฐ์ž ์ชฝ์—์„œ ์–ป์€ ground-truth ํ–‰๋™ a_t์˜ ์Œ์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:

\mathcal{D} = \{(o_t, a_t)\}_{t=0}^{T}

์ •์ฑ… ๋ฐฑ๋ณธ: Action Chunking Transformer (ACT)

์ •์ฑ…์˜ ๋ผˆ๋Œ€๋Š” ACT(Action Chunking Transformer)์ž…๋‹ˆ๋‹ค. ACT๋Š” ์ €๋น„์šฉ ์–‘ํŒ” ๋ชจ๋ฐฉํ•™์Šต์œผ๋กœ ์œ ๋ช…ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ, ํ•œ ๋ฒˆ์— ํ•œ ์Šคํ…๋งŒ ์˜ˆ์ธกํ•˜๋Š” ๋Œ€์‹  ์•ž์œผ๋กœ์˜ k๊ฐœ ํ–‰๋™์„ ๋ฌถ์Œ(chunk)์œผ๋กœ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋ˆ„์  ์˜ค์ฐจ(compounding error)์™€ ๋–จ๋ฆผ์ด ์ค„์–ด๋“œ๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์ž…๋ ฅ ์ฒ˜๋ฆฌ ํ๋ฆ„์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • o_t^K (๋กœ๋ด‡ kinematics) โ†’ MLP๋กœ ํŠน์ง• ์ถ”์ถœ
  • o_t^V (์‹œ๊ฐ ์ด๋ฏธ์ง€) โ†’ CNN (ResNet18)
  • o_t^T (์ด‰๊ฐ ์ด๋ฏธ์ง€) โ†’ ๋ณ„๋„์˜ CNN (ResNet18)
  • ์„ธ ํŠน์ง•์„ concatenateํ•ด์„œ Transformer ๊ธฐ๋ฐ˜ ์ •์ฑ… ๋„คํŠธ์›Œํฌ์— ์ž…๋ ฅ

flowchart LR
    K[Robot kinematics o_K] --> M[MLP]
    V[Visual image o_V] --> C1[CNN ResNet18]
    T[Tactile image o_T] --> C2[CNN ResNet18]
    M --> F[Concatenated features]
    C1 --> F
    C2 --> F
    F --> TR[Transformer policy network]
    TR --> A[Predicted action chunk a_t:t+k]
    TR --> I[Predicted completion sequence I_t:t+k]
    A --> L1[Local Center Loss]
    I --> L2[Global Task Loss]
    TR --> L3[Regularization KL Loss]
    L1 --> LO[Overall Loss]
    L2 --> LO
    L3 --> LO

ACT๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ ๋‘ ๊ฐœ์˜ ์†์‹ค๋กœ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค.

์žฌ๊ตฌ์„ฑ ์†์‹ค(reconstruction loss) โ€” ์˜ˆ์ธก ํ–‰๋™์ด ์‹œ์—ฐ ํ–‰๋™๊ณผ ์ผ์น˜ํ•˜๋„๋ก:

\mathcal{L}_{reconst} = \mathrm{MAE}(\hat{a}_{t:t+k},\, a_{t:t+k})

์—ฌ๊ธฐ์„œ MAE๋Š” ํ‰๊ท  ์ ˆ๋Œ€ ์˜ค์ฐจ(L1 loss)์ž…๋‹ˆ๋‹ค.

์ •๊ทœํ™” ์†์‹ค(regularization loss) โ€” ACT๋Š” CVAE(์กฐ๊ฑด๋ถ€ ๋ณ€๋ถ„ ์˜คํ† ์ธ์ฝ”๋”) ๊ตฌ์กฐ๋ผ, ์Šคํƒ€์ผ ๋ณ€์ˆ˜ z์˜ ์ธ์ฝ”๋” ๋ถ„ํฌ๋ฅผ ํ‘œ์ค€ ์ •๊ทœ๋ถ„ํฌ์— ๊ฐ€๊น๊ฒŒ ๋ฌถ์–ด๋‘ก๋‹ˆ๋‹ค:

\mathcal{L}_{reg} = D_{KL}\big(q_\phi(z \mid a_{t:t+k}, \bar{o}_t)\,\|\,\mathcal{N}(0, I)\big)

q_\phi๋Š” Transformer ์ธ์ฝ”๋”, z๋Š” ์Šคํƒ€์ผ ๋ณ€์ˆ˜, \bar{o}_t๋Š” ์ด๋ฏธ์ง€ ๊ด€์ธก์„ ๋บ€ ๋‚˜๋จธ์ง€ ๊ด€์ธก์ž…๋‹ˆ๋‹ค.

์—ฌ๊ธฐ๊นŒ์ง€๋Š” ํ‘œ์ค€ ACT์ž…๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์˜ ์ง„์งœ ๊ธฐ์—ฌ๋Š” ์—ฌ๊ธฐ์— ๋‘ ๊ฐ€์ง€ ์†์‹ค์„ ๋”ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค โ€” ํ•˜๋‚˜๋Š” ๊ตญ์†Œ(local) ๊ด€์ , ํ•˜๋‚˜๋Š” ์ „์—ญ(global) ๊ด€์ .

๊ธฐ์—ฌ 1: Local Center Loss (๊ตญ์†Œ ์ค‘์‹ฌ ์†์‹ค)

๋ฌธ์ œ ์ •์˜์˜ ์ œ์•ฝ 2๋ฒˆ(โ€œ์ ‘์ด‰์ ์ด ์ด‰๊ฐ ๊ฐ์ง€ ์˜์—ญ ์•ˆ์— ์žˆ์–ด์•ผ ํ•จโ€)์„ ๋– ์˜ฌ๋ ค๋ด…์‹œ๋‹ค. ๋ณ€ํ˜•์ฒด๋Š” DoF๊ฐ€ ๋†’์•„์„œ, ์ ‘์ด‰์ด ์ด‰๊ฐ ์„ผ์„œ์˜ ๊ฐ€์žฅ์ž๋ฆฌ(edge) ์ชฝ์œผ๋กœ ์น˜์šฐ์น˜๋ฉด ์†๊ฐ€๋ฝ ๋ฐ–์œผ๋กœ ๋ฏธ๋„๋Ÿฌ์ ธ ๋–จ์–ด์ง€๊ธฐ ์‰ฝ์Šต๋‹ˆ๋‹ค. ์ด์ƒ์ ์ธ ์ ‘์ด‰ ์œ„์น˜๋Š” ์ด‰๊ฐ ์ด๋ฏธ์ง€์˜ ์ค‘์•™(center)์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ์€ ์ด๋ ‡์Šต๋‹ˆ๋‹ค. ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•ญ์ƒ ์™„๋ฒฝํ•˜์ง€๋Š” ์•Š๋‹ค. ์‹œ์—ฐ์ž๊ฐ€ ์‹ค์‹œ๊ฐ„ ์ด‰๊ฐ ๋ชจ๋‹ˆํ„ฐ๋ง์„ ๋ณด๋ฉด์„œ ์กฐ์ž‘ํ•˜๊ธด ํ•˜์ง€๋งŒ, ๋ชจ๋“  ์ˆœ๊ฐ„ ์ ‘์ด‰์ ์„ ์ •ํ™•ํžˆ ์†๊ฐ€๋ฝ ์ค‘์•™์œผ๋กœ ์œ ์ง€ํ•˜์ง€๋Š” ๋ชปํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๋ชจ๋“  ์‹œ์—ฐ ํ–‰๋™์„ ๋˜‘๊ฐ™์€ ๊ฐ€์ค‘์น˜๋กœ ๋ชจ๋ฐฉํ•˜๋Š” ๋Œ€์‹ , ์ ‘์ด‰์ ์„ ์ค‘์•™์œผ๋กœ ๊ฐ€์ ธ๊ฐ€๋Š” ํ–‰๋™์— ๋” ํฐ ๊ฐ€์ค‘์น˜๋ฅผ ์ค๋‹ˆ๋‹ค.

๊ตฌํ˜„์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  1. ์ ‘์ด‰์  ์œ„์น˜๋ฅผ ์ด‰๊ฐ ์ด๋ฏธ์ง€์—์„œ ์ถ”์ถœ: ๊ณ ํ•ด์ƒ๋„ ์ด‰๊ฐ ํ…์Šค์ฒ˜๋Š” ๋งค์šฐ ์„ ๋ช…ํ•˜๋ฏ€๋กœ ๊ณ ์ „์  ์˜์ƒ์ฒ˜๋ฆฌ๋กœ ์ถฉ๋ถ„ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ˆ์ด์Šค์ผ€์ผ ๋ณ€ํ™˜ โ†’ ์ž„๊ณ„๊ฐ’(thresholding) โ†’ ๊ฐ€์šฐ์‹œ์•ˆ ํ•„ํ„ฐ๋ง โ†’ ์œค๊ณฝ์„ (contour) ์ถ”์ถœ๋กœ ์ ‘์ด‰ ๋งˆ์Šคํฌ๋ฅผ ๋งŒ๋“ค๊ณ , ๊ฐ€์žฅ ํฐ ์œค๊ณฝ์„ ํƒ€์›(ellipse)์œผ๋กœ ํ”ผํŒ…(๋˜๋Š” PCA๋กœ ๋ถ„์„)ํ•ด ํ”ฝ์…€ ์ขŒํ‘œ ์ ‘์ด‰์  p_t^{tac} = (u_t^{tac}, v_t^{tac})๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค.

  2. ์ค‘์‹ฌ์—์„œ ๋ฉ€์ˆ˜๋ก ์ž‘์•„์ง€๋Š” ๊ฐ€์ค‘์น˜: ๊ฐ์ง€ ์˜์—ญ์˜ ์ค‘์‹ฌ์„ c = (u_c, v_c)๋ผ ํ•  ๋•Œ,

w_t = \exp\!\left(-\frac{\|p_t^{tac} - c\|}{c}\right)

์ ‘์ด‰์ ์ด ์ค‘์‹ฌ์— ๊ฐ€๊นŒ์šฐ๋ฉด w_t \approx 1, ๊ฐ€์žฅ์ž๋ฆฌ๋กœ ๊ฐˆ์ˆ˜๋ก 0์— ๊ฐ€๊นŒ์›Œ์ง‘๋‹ˆ๋‹ค. ์ฆ‰ โ€œ์ด๋ฏธ ์ž˜ ๋ฌผ๋ ค ์žˆ๋Š” ์ข‹์€ ์ˆœ๊ฐ„์˜ ํ–‰๋™โ€์„ ๋” ์‹ ๋ขฐํ•˜๋ผ๋Š” ๋œป์ž…๋‹ˆ๋‹ค.

  1. ๊ฐ€์ค‘ ์žฌ๊ตฌ์„ฑ ์†์‹ค:

\mathcal{L}_{center} = w_{t:t+k} \cdot \mathrm{MAE}(\hat{a}_{t:t+k},\, a_{t:t+k})

๋น„์œ ํ•˜์ž๋ฉด, ์ค„๋„˜๊ธฐ๋ฅผ ๋ฐฐ์šฐ๋Š” ํ•™์ƒ์—๊ฒŒ ์ฝ”์น˜๊ฐ€ โ€œ์ž˜ ๋Œ๋ ธ์„ ๋•Œ์˜ ํผ์„ ํŠนํžˆ ์ž˜ ๊ธฐ์–ตํ•ด ๋‘ฌโ€๋ผ๊ณ  ๋งํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์–ด์„คํސ๋˜ ์ˆœ๊ฐ„์˜ ๋™์ž‘์€ ๋œ ๋”ฐ๋ผํ•˜๊ณ , ์•ˆ์ •์ ์ด์—ˆ๋˜ ์ˆœ๊ฐ„์˜ ๋™์ž‘์„ ์ง‘์ค‘์ ์œผ๋กœ ๋ชจ๋ฐฉํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด์ฃ .

๊ธฐ์—ฌ 2: Global Task Loss (์ „์—ญ ์ž‘์—… ์†์‹ค)

Center loss๊ฐ€ โ€œ์ ‘์ด‰์„ ์žƒ์ง€ ๋งˆโ€๋ผ๋Š” ๊ตญ์†Œ ์ œ์•ฝ์„ ๋‹ด๋‹นํ•œ๋‹ค๋ฉด, ์ „์—ญ ์†์‹ค์€ โ€œ์ž‘์—…์ด ์–ผ๋งˆ๋‚˜ ์ง„ํ–‰๋๋Š”์ง€, ์–ธ์ œ ๋ฉˆ์ถฐ์•ผ ํ•˜๋Š”์ง€โ€๋ผ๋Š” ์ „์—ญ ๋ชฉํ‘œ๋ฅผ ๋‹ด๋‹นํ•ฉ๋‹ˆ๋‹ค. ๋งŽ์€ ๋ณ€ํ˜•์ฒด ์กฐ์ž‘์€ ์ •ํ™•ํ•œ ์ข…๋ฃŒ(termination)๊ฐ€ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค โ€” ์ˆ˜๊ฑด์„ ์ ‘์œผ๋ ค๋ฉด ๋ชจ์„œ๋ฆฌ์—์„œ ๋ฉˆ์ถฐ์•ผ ํ•˜๊ณ , ์ผ€์ด๋ธ”์„ ํด๋ฆฝ์— ๊ฝ‚์œผ๋ ค๋ฉด ๊ทธ๋ฆฌํผ์—์„œ ๋น ์ง€๊ธฐ ์ง์ „์— ๋ฉˆ์ถฐ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ €์ž๋“ค์€ ์™„๋ฃŒ ์ง€์ˆ˜(completion index) I๋ผ๋Š” ์Šค์นผ๋ผ๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. ์ž‘์—… ์ค‘ ๋ฌผ์ฒด๊ฐ€ (๊ณ ์ •์ ๊ณผ ์ ‘์ด‰์  ์‚ฌ์ด์—์„œ) ํŒฝํŒฝํ•˜๊ฒŒ ๋‹น๊ฒจ์ ธ ์ง์„ ์— ๊ฐ€๊น๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด, ์ด๋ฏธ ํŽด์ง„ ๊ธธ์ด๋ฅผ p_0์™€ p_t ์‚ฌ์ด ๊ฑฐ๋ฆฌ๋กœ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๊ณ , ์ด๋ฅผ ์ „์ฒด ๊ธธ์ด๋กœ ๋‚˜๋ˆ  ์ง„ํ–‰๋ฅ ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋จผ์ € ์ด‰๊ฐ ์ด๋ฏธ์ง€์—์„œ ์–ป์€ ์ ‘์ด‰์  p_t^{tac}๋ฅผ ๊ทธ๋ฆฌํผ ์ขŒํ‘œ๊ณ„(์„ผ์„œ ์ขŒํ‘œ๊ณ„์™€ ์ผ์น˜)๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

p_t^{gripper} = \left(\frac{u_t^{tac} - u_c}{p2m},\; \frac{v_t^{tac} - v_c}{p2m},\; 0\right)

์—ฌ๊ธฐ์„œ p2m์€ ํ”ฝ์…€-๋ฏธํ„ฐ ์Šค์ผ€์ผ(pixel-to-meter)์ž…๋‹ˆ๋‹ค. ๊ทธ๋‹ค์Œ ์ขŒํ‘œ ๋ณ€ํ™˜์œผ๋กœ ์›”๋“œ ์ขŒํ‘œ๊ณ„์˜ ์ ‘์ด‰์  p_t๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค:

p_t = T_{gripper}^{world}\, p_t^{gripper}

T_{gripper}^{world}๋Š” ๊ทธ๋ฆฌํผ์—์„œ ์›”๋“œ๋กœ์˜ ๋ณ€ํ™˜ ํ–‰๋ ฌ์ž…๋‹ˆ๋‹ค. ์ด์ œ ground-truth ์™„๋ฃŒ ์ง€์ˆ˜๋ฅผ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ์— ๋ผ๋ฒจ๋งํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

I = \min\!\Big(\max\big(\tfrac{\|p_t - p_0\|_2}{\|p_T - p_0\|_2},\, 0\big),\, 1\Big)

๋ถ„์ž๋Š” ํ˜„์žฌ๊นŒ์ง€ ํŽด์ง„ ๊ธธ์ด, ๋ถ„๋ชจ๋Š” ์ตœ์ข…(๋๊นŒ์ง€ ๊ฐ”์„ ๋•Œ) ๊ธธ์ด์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ I๋Š” 0(์‹œ์ž‘)์—์„œ 1(์™„๋ฃŒ)๋กœ ๋งค๋„๋Ÿฝ๊ฒŒ ์ฆ๊ฐ€ํ•˜๋Š” ์ง„ํ–‰๋ฅ ์ž…๋‹ˆ๋‹ค. \min/\max๋Š” ๊ฐ’์„ [0, 1]๋กœ ํด๋ฆฝ(clip)ํ•˜๊ธฐ ์œ„ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ์ •์ฑ… ๋„คํŠธ์›Œํฌ์— ์™„๋ฃŒ ์ง€์ˆ˜ ์˜ˆ์ธก ๋ถ„๊ธฐ(branch)๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ํ–‰๋™ ์‹œํ€€์Šค์™€ ๋‚˜๋ž€ํžˆ ์™„๋ฃŒ ์ง€์ˆ˜ ์‹œํ€€์Šค \hat{I}_{t:t+k}๋„ ์˜ˆ์ธกํ•˜๊ฒŒ ํ•˜๊ณ , ์ด๋ฅผ MSE๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค:

\mathcal{L}_{task} = \mathrm{MSE}(\hat{I}_{t:t+k},\, I_{t:t+k})

์ง๊ด€์ ์œผ๋กœ ์ด๊ฒƒ์€ ์ •์ฑ…์—๊ฒŒ โ€œ์ง€๊ธˆ ์ž‘์—…์˜ ๋ช‡ ํผ์„ผํŠธ๊ฐ€ ๋๋‚ฌ๋Š”์ง€ ํ•ญ์ƒ ์˜์‹ํ•˜๋ผโ€๊ณ  ๊ฐ€๋ฅด์น˜๋Š” ๋ณด์กฐ ๊ณผ์ œ(auxiliary task)์ž…๋‹ˆ๋‹ค. ์ง„ํ–‰๋ฅ ์„ ์ธ์ง€ํ•˜๋ฉด ๋์ ์„ ๋” ์ž˜ ์ธ์‹ํ•˜๊ณ , ๋„ˆ๋ฌด ์ผ์ฐ ๋ฉˆ์ถ”๊ฑฐ๋‚˜ ๋์„ ์ง€๋‚˜์ณ ๊ณ„์† ๊ฐ€๋Š”(over-tracing) ์‹ค์ˆ˜๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ „์ฒด ์†์‹ค

๋„ค ์†์‹ค์„ ๊ฐ€์ค‘ํ•ฉํ•œ ๊ฒƒ์ด ์ตœ์ข… ๋ชฉ์ ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค(์žฌ๊ตฌ์„ฑ ์†์‹ค์€ center loss์— ํก์ˆ˜๋จ):

\mathcal{L} = \mathcal{L}_{center} + \lambda_{reg}\,\mathcal{L}_{reg} + \lambda_{task}\,\mathcal{L}_{task}

\lambda_{reg} = 100, \lambda_{task} = 100์ด๋ฉฐ grid search๋กœ ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

์˜์‚ฌ์ฝ”๋“œ๋กœ ์ •๋ฆฌํ•˜๋ฉด ํ•™์Šต ๋ฃจํ”„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

for each batch (o_K, o_V, o_T, a, p0, pT) in dataset:
    # feature extraction
    f = concat(MLP(o_K), CNN_v(o_V), CNN_t(o_T))

    # policy forward
    a_hat, I_hat, z_dist = transformer_policy(f)

    # local center loss
    p_tac = extract_contact_point(o_T)        # contour + ellipse/PCA
    w = exp(-norm(p_tac - center) / center)
    L_center = w * MAE(a_hat, a)

    # global task loss
    p_world = transform_to_world(p_tac)
    I_gt = clip(norm(p_world - p0) / norm(pT - p0), 0, 1)
    L_task = MSE(I_hat, I_gt)

    # regularization
    L_reg = KL(z_dist, N(0, I))

    loss = L_center + 100 * L_reg + 100 * L_task
    loss.backward(); optimizer.step()

์‹คํ—˜

์„ค์ •

  • ๋ฌผ์ฒด(seen): 1D 2์ข…(flat shoelace ์‹ ๋ฐœ๋ˆ, braided cable ์ผ€์ด๋ธ”), 2D 2์ข…(face towel ์ˆ˜๊ฑด, microfiber cloth ์ฒœ). ๊ฐ ๋ฌผ์ฒด๋‹น 25ํšŒ ์‹œ์—ฐ, ์ด 100 ์—ํ”ผ์†Œ๋“œ. 30 Hz๋กœ ์‹œ๊ฐยท์ด‰๊ฐ ์ด๋ฏธ์ง€, ๋กœ๋ด‡ ์ƒํƒœ, ํ–‰๋™์„ ๊ธฐ๋ก.
  • ๋ฌผ์ฒด(unseen, ์ผ๋ฐ˜ํ™” ํ‰๊ฐ€): 1D rope(ํ•ฉ์„ฑ ๋กœํ”„)์™€ 2D napkin(๋ฉด ๋ƒ…ํ‚จ).
  • ์ž…๋ ฅ: ์ด๋ฏธ์ง€๋Š” 480ร—480์œผ๋กœ crop/resize. ํ–‰๋™ยท์ƒํƒœ๋Š” 16์ฐจ์› โ€” ๊ด€์ ˆ๊ณต๊ฐ„ ๋ชจ๋ธ์€ 14 joint + 2 gripper, Cartesian ๋ชจ๋ธ์€ 2 EE pose + 2 gripper.
  • ํ•™์Šต: chunk size k=60, ์‹œ๊ฐยท์ด‰๊ฐ์šฉ ResNet18 ๋ฐฑ๋ณธ ๋ถ„๋ฆฌ, ๋ฐ๊ธฐยท๋Œ€๋น„ยท๊ฐ๋งˆ ์ฆ๊ฐ•(augmentation), 15,000 epoch, validation loss ์ตœ์†Œ ์ฒดํฌํฌ์ธํŠธ ์„ ํƒ. RTX 4090, AMD Threadripper Pro 5965WX, 128 GB RAM.
  • ์ถ”๋ก : temporal aggregation์€ ๋น„ํ™œ์„ฑํ™”. Cartesian ๋ชจ๋ธ์€ ์ถœ๋ ฅ EE pose๋ฅผ ๊ด€์ ˆ๊ฐ์œผ๋กœ ๋ณ€ํ™˜ ํ›„ ์‹คํ–‰.
  • ํ‰๊ฐ€ ์ง€ํ‘œ:
    • Success(์„ฑ๊ณต): ๋์ (๊ธธ์ด์˜ ๋งˆ์ง€๋ง‰ 5% ์ด๋‚ด)๊นŒ์ง€ ๋”ฐ๋ผ๊ฐ€ ๊ทธ๊ณณ์—์„œ ํŒŒ์ง€ ์œ ์ง€.
    • Robot collision: ๋ฌผ์ฒด๋‚˜ ์ž๊ธฐ ์ž์‹ ๊ณผ ์ถฉ๋Œํ•ด ํšŒ๋ณต ๋ถˆ๊ฐ€.
    • Early stopping: ๋์— ๋„๋‹ฌํ•˜์ง€ ๋ชปํ–ˆ์ง€๋งŒ ๋ฌผ์ฒด๋Š” ๋†“์ง€ ์•Š์Œ.
    • Over-tracing: ๋งˆ์ง€๋ง‰ 5%์— ๋„๋‹ฌํ–ˆ์œผ๋‚˜ ํŒŒ์ง€๋ฅผ ์œ ์ง€ํ•˜์ง€ ๋ชปํ•จ.
    • Object dropping: ๋ ๋„๋‹ฌ ์ „์— ๋ฌผ์ฒด๋ฅผ ๋–จ์–ด๋œจ๋ฆผ.
    • ๋ณด์กฐ ์ง€ํ‘œ: Success time(์„ฑ๊ณต ์†Œ์š” ์‹œ๊ฐ„), Completion ratio \|p_T - p_0\|_2 / L(๋„๋‹ฌ ๊ฑฐ๋ฆฌ/์ „์ฒด ๊ธธ์ด).

๊ฒฐ๊ณผ 1: ์ž์„ธ ํ‘œํ˜„ โ€” ๊ด€์ ˆ๊ฐ vs EE pose

๋จผ์ € proprioception(์ž๊ธฐ ์ž์„ธ ์ธ์ง€)์„ ๊ด€์ ˆ๊ฐ์œผ๋กœ ์ค„์ง€, EE pose๋กœ ์ค„์ง€๋ฅผ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค. 4๊ฐœ ๋ฌผ์ฒด, ๋ฌผ์ฒด๋‹น 10ํšŒ, ์ด 40ํšŒ ์‹คํ—˜์ž…๋‹ˆ๋‹ค.

ํ‘œํ˜„ ๋ฐฉ์‹ Success rate
Joint Space (๊ด€์ ˆ๊ฐ) 70.0% [54.6, 81.9]
EE pose (Cartesian, Ours) 80.0% [65.2, 89.5]

EE pose ๋ชจ๋ธ์ด ๋” ๋†’์€ ์„ฑ๊ณต๋ฅ ๊ณผ ๋” ๋†’์€ completion ratio๋ฅผ ๋ณด์˜€๊ณ , ํŠนํžˆ ๋ฌผ์ฒด ๋–จ์–ด๋œจ๋ฆผ(dropping)์ด ํ›จ์”ฌ ์ ์—ˆ์Šต๋‹ˆ๋‹ค. ํ•ด์„์€ ์ด๋ ‡์Šต๋‹ˆ๋‹ค. ํŠธ๋ ˆ์ด์‹ฑ์ด๋ผ๋Š” ์ž‘์—…์€ ๋ณธ์งˆ์ ์œผ๋กœ ์ž‘์—…๊ณต๊ฐ„(task space)์—์„œ ์ •์˜๋ฉ๋‹ˆ๋‹ค. EE pose๋Š” ์ž‘์—… ๋ชฉํ‘œ์™€ ์ž…๋ ฅ์„ ์ •๋ ฌ์‹œ์ผœ ๋ชจํ˜ธ์„ฑ(๊ด€์ ˆ๊ฐ์˜ ์ค‘๋ณต์„ฑ)์„ ์ค„์ด๊ณ , ์†๊ฐ€๋ฝ ๋Œ€๋น„ ๋ฌผ์ฒด ๋ฐฉํ–ฅ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๊ธฐ ์ข‹์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ดํ›„ ์‹คํ—˜์€ ๋ชจ๋‘ EE pose ํ‘œํ˜„์„ ์”๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ 2: ๊ตฌ์„ฑ ์š”์†Œ Ablation

๊ฐ ์„ผ์„œ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์™€ ์†์‹ค์˜ ๊ธฐ์—ฌ๋ฅผ ๋–ผ์–ด๋ณด๋ฉฐ ๊ฒ€์ฆํ•œ ํ•ต์‹ฌ ํ‘œ์ž…๋‹ˆ๋‹ค(๋ฌผ์ฒด๋‹น 10ํšŒ, ์ด 40ํšŒ).

๋ฐฉ๋ฒ• Success rate Collision Early stop Over-trace Drop
Joint Space 70.0% [54.6, 81.9] 1/40 4/40 2/40 5/40
w/o Vision (์‹œ๊ฐ ์ œ๊ฑฐ) 65.0% [49.5, 77.9] 4/40 2/40 8/40 1/40
w/o Tactile (์ด‰๊ฐ ์ œ๊ฑฐ) 60.0% [44.6, 73.7] 2/40 5/40 1/40 8/40
w/o Center Loss 65.0% [49.5, 77.9] 4/40 1/40 0/40 9/40
w/o Task Loss 67.5% [52.0, 79.9] 3/40 3/40 7/40 0/40
Ours (์ „์ฒด) 80.0% [65.2, 89.5] 2/40 2/40 3/40 1/40

์ด ํ‘œ๋Š” ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ์‹คํŒจ ๋ชจ๋“œ๋ฅผ ๋ง‰๋Š”๋‹ค๋Š” ์ ์—์„œ ๋งค์šฐ ๊น”๋”ํ•ฉ๋‹ˆ๋‹ค.

  • ์‹œ๊ฐ ์ œ๊ฑฐ: over-tracing์ด 8/40์œผ๋กœ ๊ธ‰์ฆ. ์‹œ๊ฐ์ด ์—†์œผ๋ฉด ๋์ ์„ ๋ชป ๋ณด๊ณ  ์ง€๋‚˜์ณ ๊ณ„์† ๊ฐ‘๋‹ˆ๋‹ค. ์ฆ‰ ์‹œ๊ฐ = ์ž‘์—… ์ง„ํ–‰/์ข…๋ฃŒ ์ธ์‹.
  • ์ด‰๊ฐ ์ œ๊ฑฐ: dropping์ด 8/40์œผ๋กœ ๊ธ‰์ฆ, completion ratio๋„ ๊ฐ€์žฅ ๋‚ฎ์Œ. ์ด‰๊ฐ์ด ์—†์œผ๋ฉด ์•ˆ์ •์  ์ ‘์ด‰์„ ๋ชป ์žก์•„ ๋ฌผ์ฒด๋ฅผ ๋–จ์–ด๋œจ๋ฆฝ๋‹ˆ๋‹ค. ์ฆ‰ ์ด‰๊ฐ = ์•ˆ์ •์  ์ ‘์ด‰ ์œ ์ง€.
  • Center loss ์ œ๊ฑฐ: dropping์ด 9/40์œผ๋กœ ๊ฐ€์žฅ ๋†’์Œ. ์ค‘์‹ฌ ์†์‹ค์ด โ€œ์ ‘์ด‰์„ ์ค‘์•™์œผ๋กœ ๊ฐ€์ ธ๊ฐ€๋Š” ์กฐ์ • ํ–‰๋™โ€์„ ํ•™์Šต์‹œ์ผœ ๋–จ์–ด๋œจ๋ฆผ์„ ๋ง‰๋Š”๋‹ค๋Š” ์ฆ๊ฑฐ.
  • Task loss ์ œ๊ฑฐ: over-tracing 7/40์œผ๋กœ ๋†’๊ณ  early stopping๋„ ๋Š˜์–ด๋‚จ. ์™„๋ฃŒ ์ง€์ˆ˜ ์ถœ๋ ฅ์ด ๋์  ์ธ์‹์— ๊ธฐ์—ฌํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํฅ๋ฏธ๋กœ์šด ์ ์€ success time(์†Œ์š” ์‹œ๊ฐ„)์€ ๋ชจ๋“  ๋ณ€ํ˜•์—์„œ ํฐ ์ฐจ์ด๊ฐ€ ์—†์—ˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ฆ‰ ์ด ๊ตฌ์„ฑ ์š”์†Œ๋“ค์€ โ€œ์†๋„โ€๊ฐ€ ์•„๋‹ˆ๋ผ โ€œ์„ฑ๊ณต/์‹คํŒจ์˜ ์งˆโ€์„ ์ขŒ์šฐํ•ฉ๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ 3: ํ†ตํ•ฉ ๋ชจ๋ธ vs ๊ฐœ๋ณ„ ๋ชจ๋ธ

1D๋งŒ, 2D๋งŒ, ์ „๋ถ€ ํ•ฉ์นœ ๋ฐ์ดํ„ฐ๋กœ ๊ฐ๊ฐ ํ•™์Šตํ•ด ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค(Table II). ๊ฒฐ๋ก ์€ ํ†ตํ•ฉ ํ•™์Šต์ด ๊ฐœ๋ณ„ ํ•™์Šต ๋Œ€๋น„ ์„ฑ๋Šฅ์„ ํ•ด์น˜์ง€ ์•Š๋Š”๋‹ค์ž…๋‹ˆ๋‹ค. ๊ฐ™์€ ๋ฌผ์ฒด ๊ธฐ์ค€์œผ๋กœ ๋‹จ์ผ ๋ฐ์ดํ„ฐ ๋ชจ๋ธ๊ณผ ํ†ตํ•ฉ ๋ชจ๋ธ ์‚ฌ์ด์— ๋šœ๋ ทํ•œ ์ฐจ์ด๊ฐ€ ์—†์—ˆ์Šต๋‹ˆ๋‹ค. 1D ๋ฌผ์ฒด์˜ ์„ฑ๊ณต๋ฅ (80~90%)์ด 2D๋ณด๋‹ค ๋Œ€์ฒด๋กœ ๋†’์•˜๋Š”๋ฐ, 2D ๋ฌผ์ฒด๋Š” ์†๊ฐ€๋ฝ ๋ฐ–์œผ๋กœ ์ฒœ ์ผ๋ถ€๊ฐ€ ๋Š˜์–ด์ ธ(dangle) ์ค‘๋ ฅ ๋•Œ๋ฌธ์— ๊ทธ๋ฆฌํผ ์•ž์ชฝ์œผ๋กœ ๋น ์ง€๊ธฐ ์‰ฌ์›Œ์„œ์ž…๋‹ˆ๋‹ค. ์ˆ˜๊ฑด์€ ์ฒœ๋ณด๋‹ค ํฌ๊ณ  ๋ฌด๊ฑฐ์›Œ ๋” ๋ถˆ๋ฆฌํ–ˆ๊ณ , ๊ธธ์ด๊ฐ€ ๊ธธ์ˆ˜๋ก ์ถ”์  ์˜ค์ฐจ๊ฐ€ ๋ˆ„์ ๋์Šต๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ 4: Unseen ๋ฌผ์ฒด ์ผ๋ฐ˜ํ™”

ํ•™์Šต์— ์“ฐ์ง€ ์•Š์€ ๋กœํ”„(1D)์™€ ๋ƒ…ํ‚จ(2D)์œผ๋กœ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค(๋ฌผ์ฒด๋‹น 20ํšŒ).

Unseen ๋ฌผ์ฒด Success rate Collision Early stop Over-trace Drop
Rope (1D) 70.0% [48.1, 85.5] 0/20 4/20 4/20 2/20
Napkin (2D) 60.0% [38.7, 78.1] 2/20 0/20 4/20 2/20

์ „์ฒด ํ‰๊ท  65%๋กœ, seen ๋ฌผ์ฒด์˜ 80%๋ณด๋‹ค ๋‚ฎ์ง€๋งŒ ๋ฌด๋„ˆ์ง€์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ์ข…๋ฃŒ ๊ด€๋ จ ์‹คํŒจ(early stopping, over-tracing)๊ฐ€ ๋” ์ž์ฃผ ๋‚˜์™”๋Š”๋ฐ, ์ด๋Š” ์‹œ๊ฐ์  ์™ธํ˜•์ด ๋‹ฌ๋ผ์ง„ ์˜ํ–ฅ์ด ๋” ํฌ๋‹ค๋Š” ํ•ด์„์ž…๋‹ˆ๋‹ค. ์ด‰๊ฐ ํ…์Šค์ฒ˜๋Š” unseen ๋ฌผ์ฒด๊ฐ€ seen ๋ฌผ์ฒด์™€ ๋‹ฎ์•„ ์žˆ์–ด(๋…ผ๋ฌธ Fig. 6) ์ ‘์ด‰ ์œ ์ง€ ์ธก๋ฉด์˜ ์ผ๋ฐ˜ํ™”๋Š” ๋น„๊ต์  ์ž˜ ๋์Šต๋‹ˆ๋‹ค.

๋น„ํŒ์  ๊ณ ์ฐฐ

๊ฐ•์ 

  • ๋ฌธ์ œ-ํ•ด๋ฒ•์˜ ์ •ํ•ฉ์„ฑ์ด ๋ช…ํ™•ํ•จ: ๋ฌธ์ œ ์ •์˜์—์„œ ๋‘ ์ œ์•ฝ(p_t \in o^T, p_t \in \mathcal{C}_t)๊ณผ ๋‘ ๋ชฉํ‘œ(๋‹จ์กฐ ์ฆ๊ฐ€, ๊ธธ์ด ์ˆ˜๋ ด)๋ฅผ ์„ธ์šฐ๊ณ , center loss์™€ task loss๊ฐ€ ๊ฐ๊ฐ ์ด๋ฅผ ์ •ํ™•ํžˆ ๊ฒจ๋ƒฅํ•ฉ๋‹ˆ๋‹ค. Ablation ํ‘œ์˜ ์‹คํŒจ ๋ชจ๋“œ ๋ถ„ํฌ๊ฐ€ ์ด ์„ค๊ณ„ ์˜๋„๋ฅผ ๊น”๋”ํ•˜๊ฒŒ ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค.
  • ์‹œ์—ฐ์ž์—๊ฒŒ๋„ ํ”ผ๋“œ๋ฐฑ์„ ์ค€ ์ : ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์žฅ์น˜์˜ ์ง„๋™ยท์‹ค์‹œ๊ฐ„ ์ŠคํŠธ๋ฆฌ๋ฐ์€ ์‚ฌ์†Œํ•ด ๋ณด์ด์ง€๋งŒ, ๋ชจ๋ฐฉํ•™์Šต์˜ ๋ณ‘๋ชฉ์ธ โ€œ์‹œ์—ฐ ํ’ˆ์งˆโ€์„ ์ •๋ฉด์œผ๋กœ ๋‹ค๋ฃฌ ์‹ค์šฉ์  ๊ธฐ์—ฌ์ž…๋‹ˆ๋‹ค.
  • ๋ชจ๋“ˆ์„ฑ: center loss, task loss, teleoperation ์‹œ์Šคํ…œ์€ ACT์— ์ข…์†๋˜์ง€ ์•Š๊ณ  Diffusion Policy ๋“ฑ ๋‹ค๋ฅธ IL ์•Œ๊ณ ๋ฆฌ์ฆ˜์—๋„ ์ด์‹ ๊ฐ€๋Šฅํ•˜๋‹ค๊ณ  ์ €์ž๋“ค์ด ๋ฐํž™๋‹ˆ๋‹ค. ์žฌ์‚ฌ์šฉ์„ฑ์ด ๋†’์Šต๋‹ˆ๋‹ค.
  • ๊ณ ์ „ ์˜์ƒ์ฒ˜๋ฆฌ์˜ ์ ์ ˆํ•œ ํ™œ์šฉ: ์ ‘์ด‰์  ์ถ”์ถœ์— ๋ฌด๊ฑฐ์šด ํ•™์Šต ๋ชจ๋ธ ๋Œ€์‹  contour+ellipse ํ”ผํŒ…์„ ์“ด ๊ฒƒ์€ ๊ณ ํ•ด์ƒ๋„ ์ด‰๊ฐ ํ…์Šค์ฒ˜์˜ ์„ ๋ช…ํ•จ์„ ์ž˜ ์‚ด๋ฆฐ ํ•ฉ๋ฆฌ์  ์„ ํƒ์ž…๋‹ˆ๋‹ค.

์•ฝ์ ยทํ•œ๊ณ„

  • ํ‘œ๋ณธ ๊ทœ๋ชจ๊ฐ€ ์ž‘์Œ: ํ•ต์‹ฌ ๋น„๊ต๊ฐ€ ๋ฌผ์ฒด๋‹น 10ํšŒ(ablation ์ด 40ํšŒ), unseen์€ ๋ฌผ์ฒด๋‹น 20ํšŒ ์ˆ˜์ค€์ž…๋‹ˆ๋‹ค. Wilson 95% ์‹ ๋ขฐ๊ตฌ๊ฐ„์ด ์ƒ๋‹นํžˆ ๋„“์–ด(์˜ˆ: Ours 80%์˜ ๊ตฌ๊ฐ„ [65.2, 89.5]) โ€œ80% vs 65%(w/o Tactile 60%)โ€์˜ ์ฐจ์ด๊ฐ€ ํ†ต๊ณ„์ ์œผ๋กœ ์–ผ๋งˆ๋‚˜ ๊ฒฌ๊ณ ํ•œ์ง€๋Š” ๋‹ค์†Œ ๋ณด์ˆ˜์ ์œผ๋กœ ๋ฐ›์•„๋“ค์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ์™„๋ฃŒ ์ง€์ˆ˜์˜ ๊ฐ€์ •: task loss๋Š” โ€œ๋ฌผ์ฒด๊ฐ€ p_0๊ณผ p_t ์‚ฌ์ด์—์„œ ํŒฝํŒฝํžˆ ์ง์„ ์œผ๋กœ ๋‹น๊ฒจ์ง„๋‹คโ€๋Š” ๊ฐ€์ •์— ๊ธฐ๋Œ‘๋‹ˆ๋‹ค. ๋А์Šจํ•˜๊ฑฐ๋‚˜ ๊ณก๋ฅ ์ด ํฐ ๊ตฌ๊ฐ„, ๋งค์šฐ ์‹ ์ถ•์„ฑ ์žˆ๋Š” ๋ฌผ์ฒด์—์„œ๋Š” ์ด ์ง์„  ๊ฑฐ๋ฆฌ ์ถ”์ •์ด ์‹ค์ œ ์ง„ํ–‰๋ฅ ๊ณผ ์–ด๊ธ‹๋‚  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (์ถ”์ธก: ์‹ ์ถ•์„ฑ์ด ํฐ ๊ณ ๋ฌด์ค„๋ฅ˜์—์„œ๋Š” ์˜ค์ฐจ๊ฐ€ ์ปค์งˆ ๊ฐ€๋Šฅ์„ฑ.)
  • 2D ๋ฌผ์ฒด์˜ ์ค‘๋ ฅ ์ทจ์•ฝ์„ฑ: 2D ๋ฌผ์ฒด์˜ dropping/over-tracing์€ ์ •์ฑ…๋ณด๋‹ค๋Š” ๊ทธ๋ฆฌํผ ํ˜•์ƒ์˜ ํ•œ๊ณ„๋กœ ๋ณด์ž…๋‹ˆ๋‹ค. ์ €์ž๋“ค๋„ V์žํ˜•ยท๊ตฌ๋ฉํ˜• ๊ทธ๋ฆฌํผ ๋“ฑ mechanical intelligence์™€์˜ ๊ฒฐํ•ฉ์„ ํ–ฅํ›„ ๊ณผ์ œ๋กœ ๋“ญ๋‹ˆ๋‹ค.
  • ์‹œ๊ฐ ์˜์กด์  ์ผ๋ฐ˜ํ™”: unseen ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ์ฃผ๋กœ ์‹œ๊ฐ ์™ธํ˜• ๋ณ€ํ™”์—์„œ ์˜จ๋‹ค๋Š” ์ ์€, ์‹œ๊ฐ ์ธ์ฝ”๋”๊ฐ€ ์™ธํ˜•์— ๋‹ค์†Œ ๊ณผ์ ํ•ฉ(overfit)๋์„ ๊ฐ€๋Šฅ์„ฑ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์กฐ๋ช…์„ ๊ณ ์ •ํ•˜๊ณ  ๊ฒ€์€ ์ŠคํŽ€์ง€ ํŒจ๋“œ๋ฅผ ์“ฐ๋Š” ๋“ฑ ํ™˜๊ฒฝ์ด ํ†ต์ œ๋œ ์ ๋„ ์‹ค์ œ ์ ์šฉ ์‹œ ์ผ๋ฐ˜ํ™”์— ๋ถ€๋‹ด์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋‹จ์ผ ๋กœ๋ด‡ยท๋‹จ์ผ ์„ผ์„œ ํ‰๊ฐ€: ABB YuMi + GelSight Wedge ํ•œ ์กฐํ•ฉ์—์„œ๋งŒ ๊ฒ€์ฆ๋˜์–ด, ๋‹ค๋ฅธ ๊ทธ๋ฆฌํผ/์„ผ์„œ๋กœ์˜ ์ด์‹์„ฑ์€ ๋ฏธ๊ฒ€์ฆ์ž…๋‹ˆ๋‹ค.

๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต

  • ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ์ œ์–ด(She et al., Zhang et al. ๋“ฑ): ์ •ํ™•ํ•œ ๋™์—ญํ•™ ๋ชจ๋ธ๊ณผ ๋ฌผ์ฒด๋ณ„ ์ปจํŠธ๋กค๋Ÿฌ๊ฐ€ ํ•„์š”ํ•ด ์ผ๋ฐ˜ํ™”๊ฐ€ ์•ฝํ•จ. ๋ณธ ๋…ผ๋ฌธ์€ IL๋กœ ๋ชจ๋ธ๋ง ๋ถ€๋‹ด์„ ์ œ๊ฑฐํ•˜๊ณ  ํ†ตํ•ฉ ์ •์ฑ…์„ ์ถ”๊ตฌ.
  • RL(Pecyna et al., Sun et al. ๋“ฑ): ๋ณด์ƒ ์„ค๊ณ„์™€ sim-to-real gap ๋ฌธ์ œ. ๋ณธ ๋…ผ๋ฌธ์€ ์‹ค์ œ ์‹œ์—ฐ๋งŒ์œผ๋กœ ํ•™์Šตํ•ด ์ด gap์„ ํšŒํ”ผ.
  • ์‹œ๊ฐ-์ด‰๊ฐ IL(Xue et al.์˜ AR ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ, Tactile-ALOHA ๋“ฑ): ์ ‘์ด‰์ด ํ’๋ถ€ํ•œ ์กฐ์ž‘์— ์‹œ๊ฐ+์ด‰๊ฐ์„ ๊ฒฐํ•ฉํ•œ ์„ ํ–‰ ์—ฐ๊ตฌ๋“ค๊ณผ ๊ฐ™์€ ๊ณ„๋ณด. ๋ณธ ๋…ผ๋ฌธ์˜ ์ฐจ๋ณ„์ ์€ (1) 1D/2D๋ฅผ ์•„์šฐ๋ฅด๋Š” ํ†ตํ•ฉ ํŠธ๋ ˆ์ด์‹ฑ, (2) center loss + task loss๋ผ๋Š” ํŠธ๋ ˆ์ด์‹ฑ ํŠนํ™” ๋ชฉ์ ํ•จ์ˆ˜, (3) ์‹œ์—ฐ์ž ํ”ผ๋“œ๋ฐฑ๊นŒ์ง€ ๊ฐ–์ถ˜ ์ €๋น„์šฉ ์‹œ์Šคํ…œ.

์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

ViTac-Tracing์€ โ€œ์—‰ํ‚จ ๋ณ€ํ˜•์ฒด๋ฅผ ํŽด๋Š” ํŠธ๋ ˆ์ด์‹ฑโ€์ด๋ผ๋Š” ์ „์ฒ˜๋ฆฌ ์ž‘์—…์„, ์‹œ๊ฐ๊ณผ ์ด‰๊ฐ์„ ํ•จ๊ป˜ ์“ฐ๋Š” ๋‹จ์ผ ํ†ตํ•ฉ ๋ชจ๋ฐฉํ•™์Šต ์ •์ฑ…์œผ๋กœ ํ‘ผ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ ๋‘ ๊ฐœ์˜ ํŠธ๋ ˆ์ด์‹ฑ ํŠนํ™” ์†์‹ค์ž…๋‹ˆ๋‹ค.

  • Local Center Loss: ์ ‘์ด‰์ ์„ ์ด‰๊ฐ ์ด๋ฏธ์ง€ ์ค‘์•™์œผ๋กœ ๊ฐ€์ ธ๊ฐ€๋Š” ํ–‰๋™์— ๊ฐ€์ค‘์น˜๋ฅผ ์ค˜ ์ ‘์ด‰ ์•ˆ์ •์„ฑ(๋†“์น˜์ง€ ์•Š๊ธฐ)์„ ํ™•๋ณด.
  • Global Task Loss: ์™„๋ฃŒ ์ง€์ˆ˜ I๋ฅผ ๋ณด์กฐ ์ถœ๋ ฅ์œผ๋กœ ์˜ˆ์ธกํ•˜๊ฒŒ ํ•ด ์ž‘์—… ์ง„ํ–‰/์ข…๋ฃŒ ์ธ์‹(์–ธ์ œ ๋ฉˆ์ถœ์ง€)์„ ํ•™์Šต.

์—ฌ๊ธฐ์— ์‹œ์—ฐ์ž์—๊ฒŒ ์‹œ๊ฐยท์ด‰๊ฐยท์ง„๋™ ํ”ผ๋“œ๋ฐฑ์„ ์ฃผ๋Š” ์ €๋น„์šฉ ABB YuMi ์›๊ฒฉ์กฐ์ž‘ ์‹œ์Šคํ…œ์œผ๋กœ ์–‘์งˆ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์•˜์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ seen ๋ฌผ์ฒด 80%, unseen ๋ฌผ์ฒด 65%์˜ ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ–ˆ๊ณ , ablation์œผ๋กœ โ€œ์‹œ๊ฐ=์ข…๋ฃŒ ์ธ์‹, ์ด‰๊ฐ=์ ‘์ด‰ ์œ ์ง€, center loss=๋–จ์–ด๋œจ๋ฆผ ๋ฐฉ์ง€, task loss=์ข…๋ฃŒ ์ •ํ™•๋„โ€๋ผ๋Š” ์—ญํ•  ๋ถ„๋‹ด์„ ๋ช…ํ™•ํžˆ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

๋กœ๋ด‡๊ณตํ•™ ์‹ค๋ฌด์ž ๊ด€์ ์—์„œ ์ด ๋…ผ๋ฌธ์ด ์ฃผ๋Š” ๋ฉ”์‹œ์ง€๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋ณ€ํ˜•์ฒด ์กฐ์ž‘์—์„œ ์ด‰๊ฐ์€ โ€œ์žˆ์œผ๋ฉด ์ข‹์€โ€ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๊ฐ€๋ฆผ(occlusion)์ด ๋ณธ์งˆ์ธ ์ƒํ™ฉ์—์„œ ๊ตญ์†Œ ์ ‘์ด‰ ์ •๋ณด๋ฅผ ๋ฉ”์šฐ๋Š” ํ•„์ˆ˜ ์š”์†Œ์ด๋ฉฐ, ์ž‘์—…์˜ ์ง„ํ–‰๋ฅ ์„ ๋ช…์‹œ์  ๋ณด์กฐ ๊ณผ์ œ๋กœ ํ•™์Šต์‹œํ‚ค๋ฉด ์ข…๋ฃŒ ์‹œ์  ํŒ๋‹จ์ด ํฌ๊ฒŒ ๊ฐœ์„ ๋œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๋‘ ์†์‹ค ๋ชจ๋‘ ACT๊ฐ€ ์•„๋‹Œ ๋‹ค๋ฅธ IL ๋ฐฑ๋ณธ(์˜ˆ: Diffusion Policy)์—๋„ ๊ทธ๋Œ€๋กœ ๋ถ™์ผ ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“ˆ์ด๋ผ, ์ ‘์ด‰ ๊ธฐ๋ฐ˜ ์กฐ์ž‘์„ ์—ฐ๊ตฌํ•˜๋Š” ์‚ฌ๋žŒ์—๊ฒŒ ๊ณง์žฅ ์‹œ๋„ํ•ด๋ณผ ๋งŒํ•œ ์‹ค์šฉ์  ๋„๊ตฌ์ž…๋‹ˆ๋‹ค. ํ–ฅํ›„ ๊ณผ์ œ๋กœ๋Š” ํŠนํ™” ๊ทธ๋ฆฌํผ์™€์˜ ๊ฒฐํ•ฉ, ๋” ๊นŠ์€ ์„ผ์„œ ์œตํ•ฉ, ๋” ํฐ ๊ทœ๋ชจ์˜ ํ‰๊ฐ€๊ฐ€ ์ œ์‹œ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Copyright 2026, JungYeon Lee