Curieux.JY
  • JungYeon Lee
  • Post
  • Lecture
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ์„œ๋ก 
    • ๋ฐฉ๋ฒ•
      • ํ‘œ์ค€ํ™”๋œ ์ด‰๊ฐ ํ‘œํ˜„: ์ ‘์ด‰ ๊นŠ์ด ์ด๋ฏธ์ง€
      • VT-Gen: ๋น„์ „์—์„œ ์ด‰๊ฐ์„ ๊ทธ๋ ค๋‚ด๊ธฐ
      • VT-Con: ์ƒ์„ฑ๋œ ์ด‰๊ฐ์„ ์ •์ฑ…์— ๋…น์ด๊ธฐ
      • ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ (์˜์‚ฌ์ฝ”๋“œ)
    • ์‹คํ—˜
      • ์„ค์ •
      • ์ƒ์„ฑ ํ’ˆ์งˆ (VT-Gen)
      • ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ‘ธ์‹ฑ ์„ฑ๋Šฅ
      • ์‹ค์„ธ๊ณ„ ๊ฒฐ๊ณผ
      • ์ œ๋กœ์ƒท ์ผ๋ฐ˜ํ™” (๋ฏธํ•™์Šต ๋ฌผ์ฒด)
      • ์ ˆ์ œ ์‹คํ—˜ (Ablation)
    • ๋น„ํŒ์  ๊ณ ์ฐฐ
      • ๊ฐ•์ 
      • ์•ฝ์ ๊ณผ ํ•œ๊ณ„
    • ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

๐Ÿ“ƒViTacGen

tactile
generation
manipulation
ViTacGen: Robotic Pushing with Vision-to-Touch Generation
Published

May 31, 2026

  • Paper Link
  • Poster Link

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.


๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

์„œ๋ก 

๋กœ๋ด‡์ด ๋ฌผ์ฒด๋ฅผ โ€œ๋ฏผ๋‹ค(pushing)โ€๋Š” ํ–‰์œ„๋Š” ๋‹จ์ˆœํ•ด ๋ณด์ด์ง€๋งŒ, ์‚ฌ์‹ค ๋งค์šฐ ๋ฏธ๋ฌ˜ํ•œ ์ž‘์—…์ž…๋‹ˆ๋‹ค. ์†๋์œผ๋กœ ๋ฌด๊ฑฐ์šด ์ฑ…์„ ์ฑ…์ƒ ์œ„์—์„œ ๋ฐ€์–ด ๋ณธ๋‹ค๊ณ  ์ƒ์ƒํ•ด ๋ณด์„ธ์š”. ์šฐ๋ฆฌ๋Š” ์ฑ…์ด ๋ฏธ๋„๋Ÿฌ์ง€๊ธฐ ์‹œ์ž‘ํ•˜๋Š” ์ˆœ๊ฐ„, ํšŒ์ „ํ•˜๋ ค๋Š” ๊ธฐ๋ฏธ, ์ ‘์ด‰๋ฉด์˜ ๋งˆ์ฐฐ ๋ณ€ํ™”๋ฅผ ๊ฑฐ์˜ ๋ฌด์˜์‹์ ์œผ๋กœ ์†๋์˜ ์ด‰๊ฐ์œผ๋กœ ๊ฐ์ง€ํ•˜๊ณ  ํž˜์„ ์กฐ์ ˆํ•ฉ๋‹ˆ๋‹ค. ๋กœ๋ด‡ ์—ญ์‹œ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ์—”๋“œ์ดํŽ™ํ„ฐ(end-effector)์™€ ๋ฌผ์ฒด ์‚ฌ์ด์˜ ์ ‘์ด‰๋ ฅ(contact force)๊ณผ ์ƒํ˜ธ์ž‘์šฉ ๋™์—ญํ•™์„ ์•Œ์•„์•ผ ๋ฌผ์ฒด๋ฅผ ์›ํ•˜๋Š” ์œ„์น˜๋กœ ์ •ํ™•ํžˆ ๋ฐ€ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฌธ์ œ๋Š” ์ด โ€œ์ด‰๊ฐโ€์„ ๋กœ๋ด‡์—๊ฒŒ ์–ด๋–ป๊ฒŒ ์ค„ ๊ฒƒ์ธ๊ฐ€์ž…๋‹ˆ๋‹ค. GelSight, TacTip ๊ฐ™์€ ๊ณ ํ•ด์ƒ๋„ ๊ด‘ํ•™์‹ ์ด‰๊ฐ ์„ผ์„œ๋Š” ํ’๋ถ€ํ•œ ์ ‘์ด‰ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜์ง€๋งŒ, ํ˜„์‹ค์ ์ธ ์žฅ๋ฒฝ์ด ๋งŽ์Šต๋‹ˆ๋‹ค.

  • ๋น„์šฉ: ๊ณ ํ•ด์ƒ๋„ ์ด‰๊ฐ ์„ผ์„œ๋Š” ๋น„์‹ธ๊ณ , ๋ชจ๋“  ๋กœ๋ด‡ ์†์— ์žฅ์ฐฉํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.
  • ๋‚ด๊ตฌ์„ฑ: ํƒ„์„ฑ์ฒด(elastomer) ํ‘œ๋ฉด์ด ๋งˆ๋ชจ๋˜๊ฑฐ๋‚˜ ์ฐข์–ด์ง€๊ธฐ ์‰ฝ์Šต๋‹ˆ๋‹ค.
  • ๋ณด์ •(calibration): ์„ผ์„œ๋งˆ๋‹ค, ์‹ฌ์ง€์–ด ๊ฐ™์€ ๋ชจ๋ธ์˜ ๊ฐœ์ฒด๋งˆ๋‹ค ์ถœ๋ ฅ ํŠน์„ฑ์ด ๋‹ฌ๋ผ ์ผ๊ด€์„ฑ์„ ๋งž์ถ”๊ธฐ๊ฐ€ ๊นŒ๋‹ค๋กญ์Šต๋‹ˆ๋‹ค.
  • ์ œ์กฐ ํŽธ์ฐจ: ์–‘์‚ฐ๋œ ์„ผ์„œ ๊ฐ„ ๋ณ€๋™์ด ์ •์ฑ…(policy)์˜ ์žฌํ˜„์„ฑ์„ ํ•ด์นฉ๋‹ˆ๋‹ค.

๋ฐ˜๋Œ€๋กœ ์นด๋ฉ”๋ผ(๋น„์ „)๋งŒ ์“ฐ๋Š” ์ •์ฑ…์€ ์ด๋Ÿฐ ํ•˜๋“œ์›จ์–ด ๋ฌธ์ œ์—์„œ ์ž์œ ๋กญ์ง€๋งŒ, ์ ‘์ด‰ ์ˆœ๊ฐ„์˜ ์„ฌ์„ธํ•œ ํž˜ ๋ณ€ํ™”๋ฅผ ์ง์ ‘ ๋ณด์ง€ ๋ชปํ•ด ์ •๋ฐ€๋„๊ฐ€ ๋–จ์–ด์ง‘๋‹ˆ๋‹ค.

ViTacGen์˜ ์ถœ๋ฐœ์ ์€ ์‚ฌ๋žŒ์˜ ๋Šฅ๋ ฅ์—์„œ ์˜๊ฐ์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์†์„ ๋Œ€ ๋ณด์ง€ ์•Š์•„๋„, ๋ฌผ์ฒด๊ฐ€ ๋ˆŒ๋ฆฌ๊ณ  ๋ฏธ๋„๋Ÿฌ์ง€๋Š” ๋ชจ์Šต๋งŒ ๋ณด๊ณ ๋„ ๋Œ€๋žต์ ์ธ ์ ‘์ด‰ ์ƒํƒœ๋ฅผ ์ถ”๋ก ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์น˜ ์˜์ƒ๋งŒ ๋ด๋„ โ€œ์ €๊ฑด ๋ฏธ๋„๋Ÿฌ์ง€๊ฒ ๋‹คโ€๋ผ๊ณ  ์ง๊ฐํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ์š”. ์ €์ž๋“ค์˜ ํ•ต์‹ฌ ์งˆ๋ฌธ์€ ์ด๊ฒƒ์ž…๋‹ˆ๋‹ค.

โ€œ๋น„์ „ ์ž…๋ ฅ๋งŒ์œผ๋กœ ๊ฐ€์งœ ์ด‰๊ฐ ์‹ ํ˜ธ๋ฅผ ์ƒ์„ฑ(generate) ํ•ด์„œ, ์ง„์งœ ์ด‰๊ฐ ์„ผ์„œ ์—†์ด๋„ ์ด‰๊ฐ ๊ธฐ๋ฐ˜ ์ •์ฑ…์˜ ์ด์ ์„ ๋ˆ„๋ฆด ์ˆ˜ ์žˆ์„๊นŒ?โ€

ViTacGen์€ ์ด ์งˆ๋ฌธ์— โ€œ๊ทธ๋ ‡๋‹คโ€๋ผ๊ณ  ๋‹ตํ•ฉ๋‹ˆ๋‹ค. ๋น„์ „ ์˜์ƒ ์‹œํ€€์Šค๋กœ๋ถ€ํ„ฐ ์ ‘์ด‰ ๊นŠ์ด ์ด๋ฏธ์ง€(contact depth image) ๋ผ๋Š” ํ‘œ์ค€ํ™”๋œ ์ด‰๊ฐ ํ‘œํ˜„์„ ํ•ฉ์„ฑํ•˜๋Š” ์ƒ์„ฑ ๋„คํŠธ์›Œํฌ์™€, ์ด ์ƒ์„ฑ๋œ ์ด‰๊ฐ์„ ๋น„์ „๊ณผ ์œตํ•ฉํ•ด ํ•™์Šตํ•˜๋Š” ๊ฐ•ํ™”ํ•™์Šต(RL) ์ •์ฑ…์„ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋ฌผ๋ฆฌ ์ด‰๊ฐ ์„ผ์„œ๊ฐ€ ์ „ํ˜€ ์—†๋Š” ๋น„์ „ ์ „์šฉ ๋กœ๋ด‡์— ์ œ๋กœ์ƒท(zero-shot)์œผ๋กœ ๋ฐฐํฌ ๊ฐ€๋Šฅํ•˜๋ฉด์„œ๋„ ์ตœ๋Œ€ 86%์˜ ์‹ค์„ธ๊ณ„ ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์ด ๋…ผ๋ฌธ์€ Kingโ€™s College London๊ณผ University of Bristol ์—ฐ๊ตฌ์ง„(Zhiyuan Wu, Yijiong Lin, Yongqiang Zhao, Xuyang Zhang, Zhuo Chen, Nathan Lepora, Shan Luo)์ด ์ž‘์„ฑํ–ˆ์œผ๋ฉฐ, IEEE Robotics and Automation Letters(RA-L)์— ๊ฒŒ์žฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋ฐฉ๋ฒ•

ViTacGen์€ ํฌ๊ฒŒ ๋‘ ๋ชจ๋“ˆ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

  1. VT-Gen (Vision-to-Touch Generation Network): ๋น„์ „ ์‹œํ€€์Šค โ†’ ์ ‘์ด‰ ๊นŠ์ด ์ด๋ฏธ์ง€ ์ƒ์„ฑ
  2. VT-Con (RL Policy with Contrastive Learning): ๋น„์ „ + ์ƒ์„ฑ๋œ ์ด‰๊ฐ์„ ๋Œ€์กฐํ•™์Šต์œผ๋กœ ์œตํ•ฉํ•˜๋Š” ๊ฐ•ํ™”ํ•™์Šต ์ •์ฑ…

์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ์„ ์ง๊ด€์ ์œผ๋กœ ๊ทธ๋ฆฌ๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

flowchart LR
    subgraph Input
        V["Visual sequence v1..vn<br/>(RGB camera frames)"]
    end

    subgraph VTGen["VT-Gen (frozen at policy stage)"]
        CE["Coarse encoder"]
        CMA["Cross-modal attention"]
        RE["Refine encoder + residual blocks"]
        DEC["Hierarchical decoder"]
        CE --> CMA --> RE --> DEC
    end

    subgraph VTCon["VT-Con (RL policy)"]
        EV["Visual CNN encoder Ev"]
        EC["Tactile CNN encoder Ec"]
        MOCO["Momentum Contrast<br/>InfoNCE alignment"]
        FUSE["Attention fusion"]
        SAC["SAC policy"]
    end

    V --> CE
    DEC -->|"generated contact depth c_gen"| EC
    V --> EV
    EV --> MOCO
    EC --> MOCO
    EV --> FUSE
    EC --> FUSE
    FUSE -->|"+ TCP coords"| SAC
    SAC --> ACT["Action: push command"]

ํ‘œ์ค€ํ™”๋œ ์ด‰๊ฐ ํ‘œํ˜„: ์ ‘์ด‰ ๊นŠ์ด ์ด๋ฏธ์ง€

๋จผ์ € ํ•ต์‹ฌ ์„ค๊ณ„ ๊ฒฐ์ • ํ•˜๋‚˜๋ฅผ ์งš์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ์ƒ์„ฑ ๋ชฉํ‘œ๋ฅผ โ€œ์›์‹œ ๊ด‘ํ•™ ์ด‰๊ฐ ์ด๋ฏธ์ง€โ€๊ฐ€ ์•„๋‹ˆ๋ผ ์ ‘์ด‰ ๊นŠ์ด ์ด๋ฏธ์ง€(contact depth image) ๋กœ ์žก์•˜์Šต๋‹ˆ๋‹ค.

์ง๊ด€์ ์œผ๋กœ ๋น„์œ ํ•˜๋ฉด, ์ ‘์ด‰ ๊นŠ์ด ์ด๋ฏธ์ง€๋Š” ๋ฌผ์ฒด๊ฐ€ ์„ผ์„œ ํ‘œ๋ฉด์„ ์–ผ๋งˆ๋‚˜ ๊นŠ๊ฒŒ ๋ˆŒ๋ €๋Š”์ง€๋ฅผ ํ”ฝ์…€๋ณ„ ๊นŠ์ด๋กœ ํ‘œํ˜„ํ•œ โ€œ์ง€ํ˜•๋„โ€์ž…๋‹ˆ๋‹ค. ๊ด‘ํ•™ ์ด‰๊ฐ ์„ผ์„œ๋งˆ๋‹ค ์กฐ๋ช…, ๋งˆ์ปค ํŒจํ„ด, ์ƒ‰๊ฐ์ด ์ œ๊ฐ๊ฐ์ด์ง€๋งŒ, ์ ‘์ด‰ ๊นŠ์ด๋ผ๋Š” ๊ธฐํ•˜ํ•™์  ์–‘์€ ์„ผ์„œ ์ข…๋ฅ˜์— ๋น„๊ต์  ๋ถˆ๋ณ€(invariant) ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ํŠน์ • ์„ผ์„œ ๋ธŒ๋žœ๋“œ์˜ ์™ธํ˜•์— ์ข…์†๋˜์ง€ ์•Š๋Š” ํ‘œ์ค€ํ™”๋œ ํ‘œํ˜„์ด๋ผ ๋น„์ „์œผ๋กœ๋ถ€ํ„ฐ ํ•™์Šตยท์ƒ์„ฑํ•˜๊ธฐ์— ๋” ์•ˆ์ •์ ์ž…๋‹ˆ๋‹ค. ๋‹ค๋งŒ ์ด ์„ ํƒ์€ ๋’ค์˜ ํ•œ๊ณ„(๊ตญ์†Œ ํž˜ ๋ถ„ํฌ ๊ฐ™์€ ์„ธ๋ฐ€ํ•œ ๋ฌผ๋ฆฌ๋Ÿ‰์€ ๋ชป ๋‹ด์Œ)์™€๋„ ์ง๊ฒฐ๋ฉ๋‹ˆ๋‹ค.

VT-Gen: ๋น„์ „์—์„œ ์ด‰๊ฐ์„ ๊ทธ๋ ค๋‚ด๊ธฐ

VT-Gen์€ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ๋กœ, ๋น„์ „ ํ”„๋ ˆ์ž„ ์‹œํ€€์Šค \{\mathcal{V}\} = \{v_1, \dots, v_n\}์„ ์ž…๋ ฅ๋ฐ›์•„ ์ ‘์ด‰ ๊นŠ์ด ์ด๋ฏธ์ง€ \boldsymbol{c}^{gen}์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Coarse-to-Refine ์ธ์ฝ”๋”ฉ. ์—ฌ๋Ÿฌ ํ”„๋ ˆ์ž„์„ ์ฑ„๋„ ์ฐจ์›์œผ๋กœ ์ด์–ด ๋ถ™์ธ ๋’ค,

  • Coarse encoder \mathcal{E}_{coarse}๊ฐ€ ๊ฑฐ์นœ ํŠน์ง•๋งต \boldsymbol{f}^v_{coarse}๋ฅผ ๋ฝ‘๊ณ ,
  • Cross-modal attention์ด ํ•™์Šต ๊ฐ€๋Šฅํ•œ ์œ„์น˜ ์ž„๋ฒ ๋”ฉ \boldsymbol{p}๋ฅผ ์‚ฌ์šฉํ•ด ํŠน์ง•์„ ์ •์ œํ•˜๋ฉฐ,
  • Refine encoder \mathcal{E}_{refine}์™€ ์ž”์ฐจ ๋ธ”๋ก(residual block)์ด ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ๋ณด์กดํ•˜๋ฉด์„œ ๋” ๊นŠ์€ ํ‘œํ˜„์„ ๋งŒ๋“ค๊ณ ,
  • ๊ณ„์ธต์  ๋””์ฝ”๋”(transposed convolution)๊ฐ€ ์ตœ์ข… ์ ‘์ด‰ ๊นŠ์ด ์˜ˆ์ธก์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

๊ฑฐ์น ๊ฒŒ ๊ทธ๋ฆฐ ๋’ค ์„ธ๋ฐ€ํ•˜๊ฒŒ ๋‹ค๋“ฌ๋Š” ํ™”๊ฐ€์˜ ์ž‘์—… ์ˆœ์„œ์™€ ๋‹ฎ์•˜์Šต๋‹ˆ๋‹ค. ๋จผ์ € ํฐ ์œค๊ณฝ์„ ์žก๊ณ (coarse), ๊ทธ ์œ„์— ๋””ํ…Œ์ผ์„ ์ฑ„์›Œ ๋„ฃ๋Š”(refine) ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

ํฌ๋กœ์Šค ๋ชจ๋‹ฌ ์–ดํ…์…˜์€ ๋ฉ€ํ‹ฐํ—ค๋“œ(h=8) ํ˜•ํƒœ๋กœ,

\boldsymbol{f}^v_{cm} = \boldsymbol{C}\big[\mathcal{A}^1_{cm}(\boldsymbol{f}^v_{coarse}, \boldsymbol{p}),\ \dots,\ \mathcal{A}^h_{cm}(\boldsymbol{f}^v_{coarse}, \boldsymbol{p})\big]\boldsymbol{w}_0

์ด๋ฉฐ, ๊ฐ ์–ดํ…์…˜์€ ์ต์ˆ™ํ•œ ์Šค์ผ€์ผ๋“œ ๋‹ทํ”„๋กœ๋•ํŠธ ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค.

\mathcal{A}_{cm}(\boldsymbol{x}, \boldsymbol{y}) = \mathrm{softmax}\!\left(\frac{\boldsymbol{q}\boldsymbol{k}^\top}{\sqrt{d}}\right)\boldsymbol{v}

ํ•™์Šต ์†์‹ค: VGG ์ง€๊ฐ ์†์‹ค(perceptual loss). ํ”ฝ์…€๋ณ„ L2๊ฐ€ ์•„๋‹ˆ๋ผ, ์‚ฌ์ „ํ•™์Šต๋œ VGG์˜ ํŠน์ง• ๊ณต๊ฐ„์—์„œ ๊ฑฐ๋ฆฌ๋ฅผ ์žฌ๋Š” ์ง€๊ฐ ์†์‹ค์„ ์”๋‹ˆ๋‹ค.

\mathcal{L}_{vgg} = \big\| \phi(\boldsymbol{c}^{gen}) - \phi(\boldsymbol{c}^{gt}) \big\|^2

์—ฌ๊ธฐ์„œ \phi๋Š” VGG ์ธ์ฝ”๋”๊ฐ€ ์ถ”์ถœํ•œ ํŠน์ง•์ž…๋‹ˆ๋‹ค. ์ง๊ด€์ ์œผ๋กœ, ํ”ฝ์…€ ํ•˜๋‚˜ํ•˜๋‚˜๋ฅผ ๋˜‘๊ฐ™์ด ๋งž์ถ”๋ผ๊ณ  ๊ฐ•์š”ํ•˜๊ธฐ๋ณด๋‹ค โ€œ์‚ฌ๋žŒ ๋ˆˆ์— ๋น„์Šทํ•˜๊ฒŒ ๋ณด์ด๋Š” ๊ตฌ์กฐโ€๋ฅผ ๋งž์ถ”๋ผ๊ณ  ์œ ๋„ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋•๋ถ„์— ์ ‘์ด‰ ํŒจํ„ด์˜ ๊ตฌ์กฐ์  ํ˜•ํƒœ๊ฐ€ ๋” ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค.

๋ณด๊ณ ๋œ ๋ชจ๋ธ ํฌ๊ธฐ๋Š” ์•ฝ 146.74 MB, ์ถ”๋ก  ์†๋„๋Š” 305.90 FPS๋กœ, ์‹ค์‹œ๊ฐ„ ์ œ์–ด ๋ฃจํ”„์— ์ถฉ๋ถ„ํžˆ ๋น ๋ฆ…๋‹ˆ๋‹ค.

VT-Con: ์ƒ์„ฑ๋œ ์ด‰๊ฐ์„ ์ •์ฑ…์— ๋…น์ด๊ธฐ

VT-Gen์ด ๋งŒ๋“ค์–ด ์ค€ ์ด‰๊ฐ์„ ์–ด๋–ป๊ฒŒ ๋น„์ „๊ณผ ํ•ฉ์ณ ์ •์ฑ…์„ ํ•™์Šตํ• ๊นŒ์š”? VT-Con์€ ์„ธ ๊ฐ€์ง€ ์•„์ด๋””์–ด๋ฅผ ์”๋‹ˆ๋‹ค.

(1) ๋‘ ๊ฐˆ๋ž˜ ํŠน์ง• ์ถ”์ถœ. ๊ตฌ์กฐ๊ฐ€ ๋™์ผํ•œ ๋‘ CNN \mathcal{E}^v, \mathcal{E}^c๊ฐ€ ๊ฐ๊ฐ ๋น„์ „ ์‹œํ€€์Šค์™€ ์ƒ์„ฑ๋œ ์ด‰๊ฐ ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•ด \boldsymbol{f}^v, \boldsymbol{f}^c๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์‹œ๊ฐ„ ์ฐฝ์€ 3ํ”„๋ ˆ์ž„(t-2 \to t)์œผ๋กœ, ์œ„์น˜๋ฟ ์•„๋‹ˆ๋ผ ์†๋„ยท๊ฐ€์†๋„ ๊ฐ™์€ ๋™์—ญํ•™ ๋‹จ์„œ๋ฅผ ๋‹ด์Šต๋‹ˆ๋‹ค. ํ•œ ์žฅ์˜ ์‚ฌ์ง„์œผ๋กœ๋Š” ๋ฌผ์ฒด๊ฐ€ ๋ฏธ๋„๋Ÿฌ์ง€๋Š”์ง€ ์•Œ ์ˆ˜ ์—†์ง€๋งŒ, ์—ฐ์†๋œ ์„ธ ์žฅ์ด๋ฉด ์›€์ง์ž„์˜ ํ๋ฆ„์ด ๋ณด์ด๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

(2) ๋Œ€์กฐํ•™์Šต์œผ๋กœ ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์ •๋ ฌ. ๋น„์ „๊ณผ ์ƒ์„ฑ ์ด‰๊ฐ์ด โ€œ๊ฐ™์€ ์ ‘์ด‰ ์‚ฌ๊ฑดโ€์„ ์„œ๋กœ ๋‹ค๋ฅธ ์‹œ์„ ์œผ๋กœ ๋ณธ ๊ฒƒ์ด๋ผ๋Š” ์ ์„ ๋ชจ๋ธ์— ๊ฐ€๋ฅด์นฉ๋‹ˆ๋‹ค. Momentum Contrast(MoCo) ๋ฐฉ์‹์˜ ๋ชจ๋ฉ˜ํ…€ ์ธ์ฝ”๋” \mathcal{M}^v, \mathcal{M}^c๋ฅผ ์ฒœ์ฒœํžˆ ๊ฐฑ์‹ ํ•˜๊ณ ,

\mathcal{M}^v \leftarrow \eta\,\mathcal{M}^v + (1-\eta)\,\mathcal{E}^v, \qquad \mathcal{M}^c \leftarrow \eta\,\mathcal{M}^c + (1-\eta)\,\mathcal{E}^c

InfoNCE ์†์‹ค๋กœ ์–‘๋ฐฉํ–ฅ ์ •๋ ฌ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋น„์ „โ†’์ด‰๊ฐ ๋ฐฉํ–ฅ์€

\mathcal{L}_{vt} = -\frac{1}{B}\sum_{i=1}^{B} \log \frac{\exp(\boldsymbol{f}^v_i \cdot \boldsymbol{m}^c_i / \tau)} {\sum_{j \ne i} \exp(\boldsymbol{f}^v_i \cdot \boldsymbol{m}^c_j / \tau)}

์ด‰๊ฐโ†’๋น„์ „ ๋ฐฉํ–ฅ์€

\mathcal{L}_{tv} = -\frac{1}{B}\sum_{i=1}^{B} \log \frac{\exp(\boldsymbol{f}^c_i \cdot \boldsymbol{m}^v_i / \tau)} {\sum_{j \ne i} \exp(\boldsymbol{f}^c_i \cdot \boldsymbol{m}^v_j / \tau)}

์ด๋ฉฐ ์ด ๋Œ€์กฐ ์†์‹ค์€ \mathcal{L}_{con} = \mathcal{L}_{vt} + \mathcal{L}_{tv}์ž…๋‹ˆ๋‹ค(B๋Š” ๋ฐฐ์น˜ ํฌ๊ธฐ, ์˜จ๋„ \tau = 0.1).

์ง๊ด€์ ์œผ๋กœ, ๊ฐ™์€ ์‹œ์ ์˜ (๋น„์ „, ์ด‰๊ฐ) ์Œ์€ ๊ฐ€๊น๊ฒŒ ๋Œ์–ด๋‹น๊ธฐ๊ณ (\boldsymbol{f}_i \leftrightarrow \boldsymbol{m}_i), ์„œ๋กœ ๋‹ค๋ฅธ ์‹œ์ ์˜ ์Œ์€ ๋ฐ€์–ด๋ƒ…๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๊ฐ€ ๊ณต์œ ํ•˜๋Š” โ€œ์ ‘์ด‰์˜ ๋ณธ์งˆโ€์„ ๋‹ด์€ ์ •๋ ฌ๋œ ํ‘œํ˜„ ๊ณต๊ฐ„์ด ๋งŒ๋“ค์–ด์ ธ, ์ƒ์„ฑ ์ด‰๊ฐ์˜ ๋…ธ์ด์ฆˆ์— ์ •์ฑ…์ด ๋œ ํ”๋“ค๋ฆฝ๋‹ˆ๋‹ค.

(3) ์–ดํ…์…˜ ๊ธฐ๋ฐ˜ ์œตํ•ฉ. ์ •๋ ฌ๋œ ๋‘ ํŠน์ง•์„ ๋ฉ€ํ‹ฐํ—ค๋“œ ์–ดํ…์…˜์œผ๋กœ ํ•ฉ์นฉ๋‹ˆ๋‹ค.

\boldsymbol{f}^{fuse} = \boldsymbol{C}\big[\mathcal{A}^1_{cm}(\boldsymbol{f}^v, \boldsymbol{f}^c),\ \dots,\ \mathcal{A}^h_{cm}(\boldsymbol{f}^v, \boldsymbol{f}^c)\big]\boldsymbol{w}_0

์ด ์œตํ•ฉ ํŠน์ง•์— ํ‰ํƒ„ํ™”ํ•œ TCP(tool center point) ์ขŒํ‘œ๋ฅผ ๋”ํ•ด ์ตœ์ข… ๊ด€์ธก ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

(4) SAC๋กœ ์ •์ฑ… ํ•™์Šต. ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ Soft Actor-Critic(SAC)์ž…๋‹ˆ๋‹ค. ๋ชฉ์ ํ•จ์ˆ˜๋Š”

J(\pi) = \mathbb{E}_{(s_t, a_t)\sim \rho_\pi}\left[\sum_t \gamma^t \big(R(s_t, a_t) + \alpha\,\mathcal{H}(\pi(\cdot|s_t))\big)\right]

์ด๊ณ , ๋ณด์ƒ์€ ๋ฌผ์ฒด-๋ชฉํ‘œ ๊ฑฐ๋ฆฌ์™€ TCP-๋ฌผ์ฒด ๊ฑฐ๋ฆฌ๋ฅผ ํ•จ๊ป˜ ์ค„์ด๋Š” ์กฐ๋ฐ€(dense) ๋ณด์ƒ R(s_t, a_t) = -d_{goal} - d_{TCP}์ž…๋‹ˆ๋‹ค. ์—”ํŠธ๋กœํ”ผ ํ•ญ \alpha\,\mathcal{H}๊ฐ€ ํƒ์ƒ‰์„ ์žฅ๋ คํ•ด, ๋ฏธ๋Š” ๋™์ž‘์ด ๋‹จ์กฐ๋กญ๊ฒŒ ํ•œ ๋ฐฉํ–ฅ์œผ๋กœ ๋น ์ง€์ง€ ์•Š๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ (์˜์‚ฌ์ฝ”๋“œ)

ํ•™์Šต์€ ๋‘ ๋‹จ๊ณ„๋กœ ๋ถ„๋ฆฌ๋ฉ๋‹ˆ๋‹ค.

Stage 1: Train VT-Gen
  collect 1000 paired (vision, tactile) trajectories from a pretrained RL expert
  split 7:2:1 (train:val:test)
  for epoch in 1..200:
      minimize L_vgg with Adam (lr=1e-4, eps=1e-8), batch=64

Stage 2: Train VT-Con (freeze VT-Gen)
  initialize SAC policy, replay buffer size = 20000
  for step in 1..1_000_000:
      v_seq      <- last 3 camera frames
      c_gen      <- VT_Gen(v_seq)            # generated contact depth
      f_v, f_c   <- E_v(v_seq), E_c(c_gen)
      L_con      <- InfoNCE(f_v, f_c) via MoCo momentum encoders
      f_fuse     <- AttentionFusion(f_v, f_c)
      obs        <- concat(f_fuse, TCP_coords)
      action     <- SAC_policy(obs)
      update SAC with reward R = -d_goal - d_TCP
      update encoders with L_con

ํ•ต์‹ฌ์€ Stage 2์—์„œ VT-Gen์„ ๋™๊ฒฐ(freeze) ํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ์ƒ์„ฑ๊ธฐ๋Š” ํ•œ ๋ฒˆ ์ž˜ ํ•™์Šตํ•ด ๋‘๊ณ , ์ •์ฑ…์€ ๊ทธ ์ถœ๋ ฅ์„ ๊ณ ์ •๋œ โ€œ๊ฐ€์ƒ ์ด‰๊ฐ ์„ผ์„œโ€์ฒ˜๋Ÿผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์‹คํ—˜

์„ค์ •

  • ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ: Tactile Gym 2, UR5e ๋กœ๋ด‡ + ํ‘ธ์‹ฑ ์—”๋“œ์ดํŽ™ํ„ฐ
  • ์ œ์–ด ์ฃผํŒŒ์ˆ˜: 500 Hz, ์—ํ”ผ์†Œ๋“œ ์ตœ๋Œ€ 350 ์Šคํ…
  • ์ž‘์—… ๊ณต๊ฐ„: 800 ร— 600 mm xy-ํ‰๋ฉด, 2 cm ๋†’์ด
  • ์นด๋ฉ”๋ผ: Intel RealSense D435 RGB, FOV 42ยฐ, ๊ฑฐ๋ฆฌ 1 m, 30ยฐ ํ•˜ํ–ฅ
  • ์ด๋ฏธ์ง€ ํฌ๊ธฐ: ๋น„์ „ยท์ด‰๊ฐ ๋ชจ๋‘ 128 ร— 128๋กœ ๋ฆฌ์‚ฌ์ด์ฆˆ
  • ๋„๋ฉ”์ธ ๋žœ๋คํ™”: ์นด๋ฉ”๋ผ ์‹œ์ , ์กฐ๋ช…, ๋ฐฐ๊ฒฝ, ์ƒ‰์ƒ ๋ณ€ํ™” + OpenSimplex ๋…ธ์ด์ฆˆ๋กœ ์ƒ์„ฑํ•œ ๋ฌด์ž‘์œ„ ํ‘ธ์‹ฑ ๊ถค์ 

๋ฌผ์ฒด ๊ตฌ์„ฑ. ํ•™์Šต์€ YCB ๋ฐ์ดํ„ฐ์…‹์˜ tea box, meat can, mug 3์ข…. ์ œ๋กœ์ƒท ํ…Œ์ŠคํŠธ๋Š” olive jar, apple, coffee can, soup can, ceramic cup 5์ข…์œผ๋กœ, ํ•™์Šต์— ์ „ํ˜€ ๋ณด์ง€ ๋ชปํ•œ ํ˜•์ƒ๋“ค์ž…๋‹ˆ๋‹ค.

ํ‰๊ฐ€์ง€ํ‘œ. ์‹œ๋ฎฌ๋ ˆ์ด์…˜์€ ๋ˆ„์  ๋ณด์ƒ, ์—ํ”ผ์†Œ๋“œ ๊ธธ์ด, ๊ฑฐ๋ฆฌ ์˜ค์ฐจ(mm), ์„ฑ๊ณต๋ฅ (์ž„๊ณ„ 2.5 cm). ์‹ค์„ธ๊ณ„๋Š” ๊ฑฐ๋ฆฌ ์˜ค์ฐจ(cm)์™€ ์„ฑ๊ณต๋ฅ (์ˆ˜๋™ ์ธก์ •). ์ƒ์„ฑ ํ’ˆ์งˆ์€ PSNR(โ†‘), SSIM(โ†‘), LPIPS(โ†“)๋กœ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.

์ƒ์„ฑ ํ’ˆ์งˆ (VT-Gen)

๋ฌผ์ฒด PSNR โ†‘ SSIM โ†‘ LPIPS โ†“
Tea Box 30.75 0.9482 0.0101
Meat Can 20.50 0.8657 0.0327
Mug 20.25 0.8222 0.0417

ํ‰๋ฉด์ ์ด๊ณ  ํ˜•์ƒ์ด ๋‹จ์ˆœํ•œ tea box์—์„œ ์ƒ์„ฑ ํ’ˆ์งˆ์ด ๊ฐ€์žฅ ๋†’๊ณ (PSNR 30.75, SSIM 0.95), ๊ณก๋ฉด์ด ๋งŽ์€ mug์—์„œ ๋‹ค์†Œ ๋–จ์–ด์ง‘๋‹ˆ๋‹ค. ๊ณก๋ฉด ๋ฌผ์ฒด์ผ์ˆ˜๋ก ์ ‘์ด‰ ๊นŠ์ด ํŒจํ„ด์ด ๋ณต์žกํ•ด ๋น„์ „๋งŒ์œผ๋กœ ์ถ”๋ก ํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ์ ์ด ๋“œ๋Ÿฌ๋‚ฉ๋‹ˆ๋‹ค.

์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ‘ธ์‹ฑ ์„ฑ๋Šฅ

๋ฌผ์ฒด ๋ฐฉ๋ฒ• ์„ฑ๊ณต๋ฅ  ๋ˆ„์  ๋ณด์ƒ
Tea Box Baseline (visual-only) 12% -155.73
Tea Box Baseline (visual & tactile) 13% -147.01
Tea Box ViTacGen (visual-only) 84% -84.35
Tea Box ViTacGen (visual & tactile) 92% -44.83
Meat Can Baseline (visual-only) 16% -169.27
Meat Can ViTacGen (visual-only) 81% -88.19
Meat Can ViTacGen (visual & tactile) 86% -44.49
Mug Baseline (visual-only) 20% -112.65
Mug ViTacGen (visual-only) 86% -41.53
Mug ViTacGen (visual & tactile) 83% -34.92

์—ฌ๊ธฐ์„œ ๊ฐ€์žฅ ์ธ์ƒ์ ์ธ ๋ถ€๋ถ„์€ ViTacGen์˜ visual-only ๋ฒ„์ „์กฐ์ฐจ baseline์„ ์••๋„ํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด tea box์—์„œ ๋น„์ „ ์ „์šฉ baseline์€ 12%์ธ๋ฐ, ๋™์ผํ•˜๊ฒŒ ๋น„์ „๋งŒ ์“ฐ๋Š” ViTacGen์€ 84%์ž…๋‹ˆ๋‹ค. ์ฆ‰, โ€œ์‹ค์ œ ์ด‰๊ฐ ์„ผ์„œ ์—†์ด ์ƒ์„ฑ๋œ ๊ฐ€์งœ ์ด‰๊ฐโ€์„ ๋”ํ–ˆ์„ ๋ฟ์ธ๋ฐ ์„ฑ๊ณต๋ฅ ์ด 7๋ฐฐ๊ฐ€๋Ÿ‰ ๋›ด ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ „์ฒด์ ์œผ๋กœ ๋ˆ„์  ๋ณด์ƒ์€ ์•ฝ 64.6~69.5% ๊ฐœ์„ , ์—ํ”ผ์†Œ๋“œ ๊ธธ์ด๋Š” 12.9~23.3% ๋‹จ์ถ•, ๊ฑฐ๋ฆฌ ์˜ค์ฐจ๋Š” 13.36~26.94 mm ๊ฐ์†Œํ–ˆ์Šต๋‹ˆ๋‹ค.

ํฅ๋ฏธ๋กœ์šด ๋Œ€๋น„์ : baseline์€ ์ง„์งœ ์ด‰๊ฐ์„ ๋„ฃ์–ด๋„(visual & tactile) ๊ฑฐ์˜ ๋‚˜์•„์ง€์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค(tea box 12%โ†’13%). ์ด๋Š” ๋‹จ์ˆœํžˆ ์ด‰๊ฐ ๋ฐ์ดํ„ฐ๋ฅผ ํ•ฉ์น˜๋Š” ๊ฒƒ๋งŒ์œผ๋กœ๋Š” ํšจ๊ณผ๊ฐ€ ์—†๊ณ , ๋Œ€์กฐํ•™์Šต ์ •๋ ฌ + ์–ดํ…์…˜ ์œตํ•ฉ์ด๋ผ๋Š” ViTacGen์˜ ํ‘œํ˜„ ํ•™์Šต ๋ฐฉ์‹์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์˜ ์ง„์งœ ์›์ธ์ž„์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

์‹ค์„ธ๊ณ„ ๊ฒฐ๊ณผ

๋ฌผ์ฒด Baseline ์„ฑ๊ณต๋ฅ  ViTacGen ์„ฑ๊ณต๋ฅ  Baseline ์˜ค์ฐจ ViTacGen ์˜ค์ฐจ
Tea Box 14% 76% 6.5 cm 2.6 cm
Meat Can 8% 82% 8.2 cm 1.9 cm
Mug 10% 86% 7.2 cm 1.8 cm

๋ฌผ๋ฆฌ ์ด‰๊ฐ ์„ผ์„œ ์—†์ด ๋น„์ „๋งŒ ๋‹ฌ๋ฆฐ ์‹ค์ œ ๋กœ๋ด‡์— ์ œ๋กœ์ƒท์œผ๋กœ ๋ฐฐํฌํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. ์„ฑ๊ณต๋ฅ ์ด baseline ๋Œ€๋น„ ํฌ๊ฒŒ ๋›ฐ์—ˆ๊ณ (์˜ˆ: mug 10%โ†’86%), ๊ฑฐ๋ฆฌ ์˜ค์ฐจ๋„ 7 cm๋Œ€์—์„œ 2 cm ์•ˆํŒŽ์œผ๋กœ ์ค„์—ˆ์Šต๋‹ˆ๋‹ค. ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ํ•™์Šตํ•œ ์ •์ฑ…์ด ํ˜„์‹ค๋กœ ์ž˜ ๋„˜์–ด์™”๋‹ค๋Š” ๊ฐ•ํ•œ ์ฆ๊ฑฐ์ž…๋‹ˆ๋‹ค.

์ œ๋กœ์ƒท ์ผ๋ฐ˜ํ™” (๋ฏธํ•™์Šต ๋ฌผ์ฒด)

olive jar, apple, coffee can, soup can, ceramic cup 5์ข… ํ‰๊ท ์œผ๋กœ ์„ฑ๊ณต๋ฅ  ์•ฝ 75.2%, ๊ฑฐ๋ฆฌ ์˜ค์ฐจ ์•ฝ 3.8 cm๋ฅผ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค. apple์ด 82%๋กœ ๊ฐ€์žฅ ์ข‹๊ณ  soup can์ด 70%๋กœ ๊ฐ€์žฅ ๋‚ฎ์•˜์Šต๋‹ˆ๋‹ค. ํ•™์Šต์— ์—†๋˜ ํ˜•์ƒ์—์„œ๋„ 70%๋Œ€๋ฅผ ์œ ์ง€ํ•œ๋‹ค๋Š” ์ ์€, ์ ‘์ด‰ ๊นŠ์ด๋ผ๋Š” ํ‘œํ˜„์ด ๋ฌผ์ฒด ํ˜•์ƒ์— ๊ณผ์ ํ•ฉ๋˜์ง€ ์•Š๊ณ  ์ผ๋ฐ˜์ ์ธ ์ ‘์ด‰ ๊ธฐํ•˜ ๋‹จ์„œ๋ฅผ ํฌ์ฐฉํ•œ๋‹ค๋Š” ํ•ด์„์„ ๋’ท๋ฐ›์นจํ•ฉ๋‹ˆ๋‹ค.

์ ˆ์ œ ์‹คํ—˜ (Ablation)

์œตํ•ฉ ๋ฐฉ์‹ (์ž„๊ณ„ 4.0 cm ๊ธฐ์ค€):

์œตํ•ฉ ์„ฑ๊ณต๋ฅ  ๋ณด์ƒ
Addition 77% -70.00
Concatenation 90% -67.84
Attention 95% -66.92 โœ“

๋Œ€์กฐํ•™์Šต ๋ฐฉ์‹ (์ž„๊ณ„ 4.0 cm ๊ธฐ์ค€):

๋ฐฉ๋ฒ• ์„ฑ๊ณต๋ฅ  ๋ณด์ƒ
SimCLR 89% -70.03
MoCo 95% -66.92 โœ“

์–ดํ…์…˜ ์œตํ•ฉ์ด ๋‹จ์ˆœ ๋ง์…ˆยท์—ฐ๊ฒฐ๋ณด๋‹ค ๋‚ซ๊ณ (95% vs 77%), MoCo๊ฐ€ SimCLR๋ณด๋‹ค ๋‚ซ์Šต๋‹ˆ๋‹ค. ์–ดํ…์…˜์€ ์ƒํ™ฉ์— ๋”ฐ๋ผ ๋น„์ „๊ณผ ์ด‰๊ฐ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋™์ ์œผ๋กœ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ์–ด, ์ ‘์ด‰์ด ์ค‘์š”ํ•œ ์ˆœ๊ฐ„์—” ์ด‰๊ฐ์— ๋” ์ง‘์ค‘ํ•˜๋Š” ์‹์˜ ์ ์‘์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ง๊ด€๊ณผ ๋งž์Šต๋‹ˆ๋‹ค.

๋น„ํŒ์  ๊ณ ์ฐฐ

๊ฐ•์ 

  • ํ•˜๋“œ์›จ์–ด ์˜์กด์„ฑ ์ œ๊ฑฐ: ๊ฐ€์žฅ ํฐ ์‹ค์šฉ์  ๊ฐ€์น˜์ž…๋‹ˆ๋‹ค. ๋น„์‹ธ๊ณ  ์•ฝํ•œ ์ด‰๊ฐ ์„ผ์„œ ์—†์ด๋„ ์ด‰๊ฐ ์ •์ฑ…์˜ ์ด์ ์„ ๋ˆ„๋ฆฌ๋ฉฐ, ๋น„์ „ ์ „์šฉ ๋กœ๋ด‡์— ์ œ๋กœ์ƒท ๋ฐฐํฌ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.
  • ํ‘œํ˜„ ์„ค๊ณ„์˜ ์˜๋ฆฌํ•จ: ์„ผ์„œ ์ข…์†์ ์ธ ์›์‹œ ๊ด‘ํ•™ ์ด๋ฏธ์ง€ ๋Œ€์‹  ์„ผ์„œ ๋ถˆ๋ณ€์ ์ธ ์ ‘์ด‰ ๊นŠ์ด๋ฅผ ์ƒ์„ฑ ๋ชฉํ‘œ๋กœ ์‚ผ์•„, ๋น„์ „โ†’์ด‰๊ฐ ํ•™์Šต์„ ์•ˆ์ •ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์œตํ•ฉ ๋ฐฉ์‹์˜ ๋ณธ์งˆ ๊ทœ๋ช…: baseline์— ์ง„์งœ ์ด‰๊ฐ์„ ๋„ฃ์–ด๋„ ํšจ๊ณผ๊ฐ€ ๋ฏธ๋ฏธํ–ˆ๋˜ ๋ฐ˜๋ฉด ViTacGen์€ ํฌ๊ฒŒ ์ข‹์•„์ง„ ๋Œ€๋น„๋Š”, ์„ฑ๋Šฅ์ด โ€œ์ด‰๊ฐ ๋ฐ์ดํ„ฐ์˜ ์กด์žฌโ€๊ฐ€ ์•„๋‹ˆ๋ผ โ€œ์ •๋ ฌยท์œตํ•ฉ ํ‘œํ˜„ ํ•™์Šตโ€์—์„œ ์˜จ๋‹ค๋Š” ์ ์„ ์„ค๋“๋ ฅ ์žˆ๊ฒŒ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  • ์‹ค์‹œ๊ฐ„์„ฑ: 305.90 FPS์˜ ์ƒ์„ฑ ์†๋„๋Š” 500 Hz ์ œ์–ด ๋ฃจํ”„ ์•ˆ์—์„œ ์‹ค์ œ๋กœ ์“ธ ๋งŒํ•ฉ๋‹ˆ๋‹ค.

์•ฝ์ ๊ณผ ํ•œ๊ณ„

  • ์ ‘์ด‰ ๊นŠ์ด์˜ ์ •๋ณด ํ•œ๊ณ„: ์ €์ž๋“ค๋„ ์ธ์ •ํ•˜๋“ฏ, ์ ‘์ด‰ ๊นŠ์ด๋Š” ๊ณ ์ˆ˜์ค€ ์ ‘์ด‰ ๊ธฐํ•˜๋งŒ ๋‹ด๊ณ  ๊ตญ์†Œ ํž˜ ๋ถ„ํฌ๋‚˜ ์ „๋‹จ๋ ฅ(shear force) ๊ฐ™์€ ์„ธ๋ฐ€ํ•œ ๋ฌผ๋ฆฌ๋Ÿ‰์€ ํ‘œํ˜„ํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. ๋ฏธ๋„๋Ÿฌ์ง ๊ฐ์ง€๋‚˜ ๋งˆ์ฐฐ ์ถ”์ •์ด ๊ฒฐ์ •์ ์ธ ์ž‘์—…์œผ๋กœ ํ™•์žฅํ•  ๋•Œ ๋ณ‘๋ชฉ์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ‘ธ์‹ฑ์ด๋ผ๋Š” ๋‹จ์ผ ๊ณผ์ œ: ๊ฒ€์ฆ์ด ํ‰๋ฉด ํ‘ธ์‹ฑ์— ํ•œ์ •๋ฉ๋‹ˆ๋‹ค. ์žก๊ธฐ(grasp), ์ธ-ํ•ธ๋“œ ์กฐ์ž‘ ๋“ฑ ์ ‘์ด‰ ๋™์—ญํ•™์ด ๋” ๋ณต์žกํ•œ ๊ณผ์ œ๋กœ์˜ ์ผ๋ฐ˜ํ™”๋Š” ์•„์ง ๋ฏธ์ง€์ˆ˜์ž…๋‹ˆ๋‹ค. (์ถ”์ธก: ๊ณก๋ฉด mug์—์„œ ์ƒ์„ฑ ํ’ˆ์งˆ์ด ๋–จ์–ด์ง„ ์ ์„ ๋ณด๋ฉด, ์ ‘์ด‰์ด ๋” ๋ณต์žกํ•œ ๊ณผ์ œ์—์„œ๋Š” ์ƒ์„ฑ๊ธฐ์˜ ํ’ˆ์งˆ์ด ๋” ํฐ ์ œ์•ฝ์ด ๋  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.)
  • Sim-to-real ๋…ธ์ด์ฆˆ: ์‹ค์„ธ๊ณ„ ์ด‰๊ฐ ์˜ˆ์ธก์— ๋…ธ์ด์ฆˆ ์•„ํ‹ฐํŒฉํŠธ๊ฐ€ ๋ณด๊ณ ๋ฉ๋‹ˆ๋‹ค. ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์‹œ๊ฐ๊ณผ ํ˜„์‹ค ์‹œ๊ฐ์˜ ์ฐจ์ด๊ฐ€ ์ƒ์„ฑ ์ด‰๊ฐ์˜ ํ’ˆ์งˆ์„ ๋–จ์–ด๋œจ๋ฆฝ๋‹ˆ๋‹ค.
  • ์ „๋ฌธ๊ฐ€ ์˜์กด ํ•™์Šต: VT-Gen ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ โ€œ์‚ฌ์ „ํ•™์Šต๋œ RL ์ „๋ฌธ๊ฐ€โ€์˜ 1,000๊ฐœ ๊ถค์ ์—์„œ ์ˆ˜์ง‘ํ•ฉ๋‹ˆ๋‹ค. ์ด ์ „๋ฌธ๊ฐ€์˜ ํ–‰๋™ ๋ถ„ํฌ์— ์ƒ์„ฑ๊ธฐ๊ฐ€ ํŽธํ–ฅ๋  ์ˆ˜ ์žˆ๊ณ , ์ „๋ฌธ๊ฐ€๊ฐ€ ๋‹ฟ์ง€ ๋ชปํ•œ ์ƒํƒœ์—์„œ์˜ ์ƒ์„ฑ ํ’ˆ์งˆ์€ ๋ถˆํ™•์‹คํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฌผ์ฒด ๋‹ค์–‘์„ฑ: ํ•™์Šต ๋ฌผ์ฒด๊ฐ€ 3์ข…์œผ๋กœ ์ œํ•œ์ ์ž…๋‹ˆ๋‹ค. ์ œ๋กœ์ƒท ๊ฒฐ๊ณผ๋Š” ๊ณ ๋ฌด์ ์ด๋‚˜, ๋” ํญ๋„“์€ ํ˜•์ƒยท์žฌ์งˆ์—์„œ์˜ ๊ฐ•๊ฑด์„ฑ ๊ฒ€์ฆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

ViTacGen์€ โ€œ๋น„์ „๋งŒ์œผ๋กœ ์ด‰๊ฐ์„ ์ƒ์„ฑํ•˜๋ฉด ์ง„์งœ ์ด‰๊ฐ ์„ผ์„œ๋ฅผ ๋Œ€์‹ ํ•  ์ˆ˜ ์žˆ๋‹คโ€๋Š” ๋ฐœ์ƒ์„ ์ฒด๊ณ„์ ์œผ๋กœ ๊ตฌํ˜„ํ•œ ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค. ๋น„์ „ ์‹œํ€€์Šค์—์„œ ์„ผ์„œ ๋ถˆ๋ณ€์ ์ธ ์ ‘์ด‰ ๊นŠ์ด ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” VT-Gen๊ณผ, ์ƒ์„ฑ ์ด‰๊ฐ์„ ๋Œ€์กฐํ•™์Šต(MoCo/InfoNCE)์œผ๋กœ ๋น„์ „๊ณผ ์ •๋ ฌํ•œ ๋’ค ์–ดํ…์…˜์œผ๋กœ ์œตํ•ฉํ•ด SAC ์ •์ฑ…์„ ํ•™์Šตํ•˜๋Š” VT-Con์ด ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค.

์‹คํ—˜์ ์œผ๋กœ, ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์„ฑ๊ณต๋ฅ ์„ baseline ๋Œ€๋น„ ์ˆ˜ ๋ฐฐ ๋Œ์–ด์˜ฌ๋ ธ๊ณ (์˜ˆ: 12%โ†’84~92%), ์ด‰๊ฐ ์„ผ์„œ๊ฐ€ ์—†๋Š” ์‹ค์ œ ๋กœ๋ด‡์— ์ œ๋กœ์ƒท ๋ฐฐํฌํ•ด ์ตœ๋Œ€ 86%์˜ ์„ฑ๊ณต๋ฅ ๊ณผ 2 cm ์•ˆํŒŽ์˜ ๊ฑฐ๋ฆฌ ์˜ค์ฐจ๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ฏธํ•™์Šต ๋ฌผ์ฒด์—์„œ๋„ ํ‰๊ท  75% ์ˆ˜์ค€์„ ์œ ์ง€ํ•ด ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

๋กœ๋ด‡๊ณตํ•™ ์‹ค๋ฌด์ž ๊ด€์ ์—์„œ ์ด ์—ฐ๊ตฌ๊ฐ€ ์ฃผ๋Š” ๋ฉ”์‹œ์ง€๋Š” ๋ถ„๋ช…ํ•ฉ๋‹ˆ๋‹ค. ๊ณ ๊ฐ€์˜ ์ด‰๊ฐ ์„ผ์„œ๋ฅผ ๋ชจ๋“  ๋กœ๋ด‡์— ๋‹ฌ ์ˆ˜ ์—†๋‹ค๋ฉด, ๋น„์ „์œผ๋กœ๋ถ€ํ„ฐ ์ด‰๊ฐ์„ โ€œ์ƒ์ƒโ€ํ•˜๊ฒŒ ๋งŒ๋“ค์–ด ๊ทธ ๊ฐ„๊ทน์„ ๋ฉ”์šฐ๋Š” ๊ธธ์ด ํ˜„์‹ค์ ์œผ๋กœ ์ž‘๋™ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋‹ค๋งŒ ์ ‘์ด‰ ๊นŠ์ด๋ผ๋Š” ํ‘œํ˜„์ด ๋‹ด์ง€ ๋ชปํ•˜๋Š” ํž˜ยท์ „๋‹จ ์ •๋ณด, ๊ทธ๋ฆฌ๊ณ  ํ‘ธ์‹ฑ์„ ๋„˜์–ด์„  ๋ณต์žก ์กฐ์ž‘์œผ๋กœ์˜ ํ™•์žฅ์€ ์•ž์œผ๋กœ ํ’€์–ด์•ผ ํ•  ๊ณผ์ œ์ž…๋‹ˆ๋‹ค. ์ €์ž๋“ค๋„ ํ–ฅํ›„ ๋ฐฉํ–ฅ์œผ๋กœ ๋ฌผ๋ฆฌ๋Ÿ‰์˜ ๋ช…์‹œ์  ์˜ˆ์ธก, ๋ฌผ์ฒด ์šด๋™์œผ๋กœ๋ถ€ํ„ฐ์˜ ์ „๋‹จ๋ ฅ ์ถ”๋ก , ๋น„์ „-์ด‰๊ฐ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๊ฒฉ์ฐจ์˜ ์ถ”๊ฐ€ ํ•ด์†Œ๋ฅผ ์ œ์‹œํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Copyright 2026, JungYeon Lee