Curieux.JY
  • JungYeon Lee
  • Post
  • Lecture
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ์„œ๋ก 
      • ํ•ต์‹ฌ ๋™๊ธฐ โ€” โ€œ๊ฒ‰๋ชจ์Šต์€ ๋ฌผ๋ฆฌ์˜ ์‹ ๋ขฐํ•  ์ˆ˜ ์—†๋Š” ๋Œ€๋ฆฌ์ง€ํ‘œโ€
      • ์™œ โ€œ๊ฒŒ์ดํŒ…โ€์ด ํ•„์š”ํ•œ๊ฐ€ โ€” ์ง๊ด€
    • ๋ฐฉ๋ฒ•
      • ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ
      • ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ณ„ ์ธ์ฝ”๋”
      • ๊ฒŒ์ดํŒ…์˜ ํ•ต์‹ฌ โ€” ํ† ํฐ ๋ ˆ๋ฒจ ์‹œ๊ฐ ์–ต์ œ
      • ๋ณด์กฐ ๊ฐ๋…๊ณผ ๊ฒŒ์ดํŠธ ์ •๊ทœํ™” โ€” ์†์‹ค ํ•จ์ˆ˜
      • ์˜์‚ฌ์ฝ”๋“œ
      • ์ ๋Œ€์  ๋ฐ์ดํ„ฐ์…‹ โ€” โ€œ์‹œ๊ฐ์˜ ํ•จ์ •โ€
    • ์‹คํ—˜
      • ํ”„๋กœํ† ์ฝœ
      • ๋น„๊ต ๋Œ€์ƒ(baseline)
      • ์ฃผ์š” ๊ฒฐ๊ณผ โ€” Unseen Object
      • ์ž‘์—…๋ณ„ Unseen ์ •ํ™•๋„
    • ๋น„ํŒ์  ๊ณ ์ฐฐ
      • ๊ฐ•์ 
      • ์•ฝ์ ยทํ•œ๊ณ„
      • ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต
    • ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

๐Ÿ“ƒGating-Based Vision-Proprioception Fusion

tactile
fusion
classification
Low-Cost Gating-Based Vision and Proprioception Fusion for Object Property Classification
Published

May 4, 2026

  • Paper Link
  • Poster Link

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.


๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

์„œ๋ก 

๋กœ๋ด‡์ด ๋ฌผ๊ฑด์„ ์ง‘์–ด ๋“ค ๋•Œ, ์‚ฌ๋žŒ์€ ๋ˆˆ์œผ๋กœ ๋ณธ ์ •๋ณด(๊ฒ‰๋ชจ์Šต, ์ƒ‰, ์งˆ๊ฐ)์™€ ์†์œผ๋กœ ๋А๋‚€ ์ •๋ณด(๋ฌด๊ฒŒ, ๋‹จ๋‹จํ•จ, ์žฌ์งˆ)๋ฅผ ๊ฑฐ์˜ ๋ฌด์˜์‹์ ์œผ๋กœ ํ•ฉ์ณ์„œ โ€œ์ด๊ฑด ๋ฌด๊ฒ๊ณ  ๋‹จ๋‹จํ•œ ๊ธˆ์† ์ปต์ด๋‹คโ€ ๊ฐ™์€ ํŒ๋‹จ์„ ๋‚ด๋ฆฝ๋‹ˆ๋‹ค. ๋ฌธ์ œ๋Š”, ๊ฒ‰๋ชจ์Šต์ด ๋ฌผ๋ฆฌ์  ์†์„ฑ์„ ํ•ญ์ƒ ์ •์งํ•˜๊ฒŒ ์•Œ๋ ค์ฃผ์ง€๋Š” ์•Š๋Š”๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๋˜‘๊ฐ™์ด ์ƒ๊ธด ๋‘ ๋ฌผ์ฒด๊ฐ€ ๋ฌด๊ฒŒยท๊ฐ•์„ฑยท์žฌ์งˆ์—์„œ ์™„์ „ํžˆ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋นˆ ํŽ˜ํŠธ๋ณ‘๊ณผ ๋ฌผ์ด ๊ฐ€๋“ ์ฐฌ ํŽ˜ํŠธ๋ณ‘์€ ์‚ฌ์ง„์ƒ ๊ฑฐ์˜ ๊ตฌ๋ถ„๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์ด ๋…ผ๋ฌธ(๋งจ์ฒด์Šคํ„ฐ ๋Œ€ํ•™, ICRA 2026 ViTac ์›Œํฌ์ˆ, Paper ID 8)์ด ์ •๋ฉด์œผ๋กœ ๋‹ค๋ฃจ๋Š” ์งˆ๋ฌธ์ด ๋ฐ”๋กœ ์ด๊ฒƒ์ž…๋‹ˆ๋‹ค.

โ€œ๊ฒ‰๋ชจ์Šต์ด ๋” ์ด์ƒ ๋ฌผ๋ฆฌ์  ์†์„ฑ์„ ์˜ˆ์ธกํ•ด ์ฃผ์ง€ ๋ชปํ•  ๋•Œ, ๋กœ๋ด‡์ด ๋ฌผ์ฒด์˜ ์งˆ๋Ÿ‰(mass)ยท๊ฐ•์„ฑ(stiffness)ยท์žฌ์งˆ(material)์„ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?โ€

์ด ์งˆ๋ฌธ์— ๋‹ตํ•˜๊ธฐ ์œ„ํ•ด ์ €์ž๋“ค์€ ๋‘ ๊ฐ€์ง€๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

  1. ์ €๋น„์šฉ ์‹œ๊ฐ-๊ณ ์œ ์ˆ˜์šฉ๊ฐ๊ฐ ์œตํ•ฉ(low-cost visuo-proprioceptive fusion): GelSight๋ฅ˜ ๊ด‘ํ•™ ์ด‰๊ฐ์ด๋‚˜ force-torque ์„ผ์„œ ๊ฐ™์€ ๋น„์‹ธ๊ณ  ๊นจ์ง€๊ธฐ ์‰ฌ์šด ์ „์šฉ ์ด‰๊ฐ ํ•˜๋“œ์›จ์–ด ์—†์ด, ๋กœ๋ด‡ ํŒ”์— ์ด๋ฏธ ๋‚ด์žฅ๋œ ์„œ๋ณด ๋ชจํ„ฐ์˜ ๋‚ด๋ถ€ ์‹ ํ˜ธ(position, load, current, velocity)๋งŒ์œผ๋กœ ์ด‰๊ฐ์  ์ฆ๊ฑฐ๋ฅผ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์— grasp ์ง์ „์— ์ฐ์€ ๋‹จ ํ•œ ์žฅ์˜ top-down RGB ์‚ฌ์ง„์„ ๋”ํ•ฉ๋‹ˆ๋‹ค.
  2. ๊ฒŒ์ดํŒ… ๊ธฐ๋ฐ˜ ์ ์‘์  ์œตํ•ฉ(gating-based fusion): ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๊ฐ€ ์ถฉ๋Œ(conflict)ํ•  ๋•Œ, ์ฆ‰ ์‹œ๊ฐ์ด ๊ฑฐ์ง“๋ง์„ ํ•  ๋•Œ, ์‹œ๊ฐ ์ฆ๊ฑฐ๋ฅผ ๋™์ ์œผ๋กœ ์–ต์ œ(suppress)ํ•˜๋Š” ๊ฒŒ์ดํŠธ๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋™๊ธฐ โ€” โ€œ๊ฒ‰๋ชจ์Šต์€ ๋ฌผ๋ฆฌ์˜ ์‹ ๋ขฐํ•  ์ˆ˜ ์—†๋Š” ๋Œ€๋ฆฌ์ง€ํ‘œโ€

๋น„์ „ ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์€ ์˜๋ฏธ๋ก ์  ์ถ”๋ก (semantic reasoning)์ด ๊ฐ•ํ•ด์„œ ์ƒ์‹์ ์œผ๋กœ ์†์„ฑ์„ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(์˜ˆ: โ€œ๋ฒฝ๋Œ์ฒ˜๋Ÿผ ์ƒ๊ฒผ์œผ๋‹ˆ ๋ฌด๊ฒ๊ฒ ์ง€โ€). ํ•˜์ง€๋งŒ ์ด๋Š” ์™ธ์–‘๊ณผ ๋ฌผ๋ฆฌ ์†์„ฑ ์‚ฌ์ด์˜ ๊ฐ€์งœ ์ƒ๊ด€(spurious correlation)์— ์˜์กดํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ๊ฐ€์งœ ์ƒ๊ด€์ด ๊นจ์ง€๋Š” ์ˆœ๊ฐ„ โ€” ๊ฐ€๋ฒผ์šด ๊ฐ€์งœ ๋ฒฝ๋Œ, ๋ง๋ž‘ํ•œ ๊ธˆ์†์ƒ‰ ๋ฌผ์ฒด โ€” ๋น„์ „ ๋ชจ๋ธ์€ ๋ฌด๋„ˆ์ง‘๋‹ˆ๋‹ค.

์ด‰๊ฐ ์„ผ์„œ๋Š” ๋” ์ง์ ‘์ ์ด๊ณ  ์‹ ๋ขฐํ•  ๋งŒํ•œ ๋ฌผ๋ฆฌ ์ฆ๊ฑฐ๋ฅผ ์ฃผ์ง€๋งŒ ๋น„์Œ‰๋‹ˆ๋‹ค. ์ €์ž๋“ค์˜ ํ†ต์ฐฐ์€ โ€œ๋กœ๋ด‡ ํŒ”์˜ ์„œ๋ณด ์‹ ํ˜ธ ์ž์ฒด๊ฐ€ ์‚ฌ์‹ค์ƒ ๊ณต์งœ์ธ ์ด‰๊ฐ ์ฑ„๋„โ€์ด๋ผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ฌผ์ฒด๋ฅผ ์ฅ๊ณ  ๋“ค์–ด ์˜ฌ๋ฆฌ๋Š” ๋™์•ˆ ๋ชจํ„ฐ๊ฐ€ ๊ฒช๋Š” ๋ถ€ํ•˜(load), ์ „๋ฅ˜(current), ์œ„์น˜(position), ์†๋„(velocity)์˜ ์‹œ๊ณ„์—ด์—๋Š” ๋ฌด๊ฒŒ์™€ ๋ณ€ํ˜•์„ฑ์˜ ํ”์ ์ด ๊ณ ์Šค๋ž€ํžˆ ๋‚จ์Šต๋‹ˆ๋‹ค.

์™œ โ€œ๊ฒŒ์ดํŒ…โ€์ด ํ•„์š”ํ•œ๊ฐ€ โ€” ์ง๊ด€

๋‘ ๊ฐ๊ฐ์„ ํ•ฉ์น˜๋Š” ๊ฐ€์žฅ ํ”ํ•œ ๋ฐฉ๋ฒ•์€ ๊ทธ๋ƒฅ ์ด์–ด ๋ถ™์ด๋Š” ๊ฒƒ(concatenation, vanilla fusion)์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Š” โ€œ๋‘ ์ž๋ฌธ๊ฐ€์˜ ์˜๊ฒฌ์„ ํ•ญ์ƒ ๋˜‘๊ฐ™์€ ๋น„์œจ๋กœ ์„ž๋Š”โ€ ์…ˆ์ด๋ผ, ํ•œ์ชฝ์ด ๊ฑฐ์ง“๋ง์„ ํ•  ๋•Œ ๊ทธ ์˜ค์—ผ์ด ๊ทธ๋Œ€๋กœ ๊ฒฐ๊ณผ์— ์Šค๋ฉฐ๋“ญ๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ ์ด ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ์‹คํ—˜ ๊ฒฐ๊ณผ๊ฐ€ ์ด๋ฅผ ์ฆ๋ช…ํ•ฉ๋‹ˆ๋‹ค โ€” vanilla fusion์€ ์†๋Š” ๋ฌผ์ฒด(deceptive object)์—์„œ ๋‹จ์ผ proprioception๋ณด๋‹ค๋„ ์˜คํžˆ๋ ค ๋” ๋‚˜๋น ์ง‘๋‹ˆ๋‹ค.

๊ฒŒ์ดํŒ…์€ ๋น„์œ ํ•˜์ž๋ฉด โ€œ์‹œ๊ฐ ์ฆ์ธ์ด ๊ฑฐ์ง“๋ง์„ ํ•˜๊ณ  ์žˆ๋‹ค๊ณ  ์˜์‹ฌ๋˜๋ฉด ๊ทธ ์ฆ์–ธ์˜ ๋ฐœ์–ธ๊ถŒ์„ ์ค„์ด๋Š” ์žฌํŒ์žฅโ€์ž…๋‹ˆ๋‹ค. ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๊ฐ€ ์„œ๋กœ ๋ชจ์ˆœ๋  ๋•Œ ์‹œ๊ฐ ํ† ํฐ์„ ํ•™์Šต๋œ null ํ† ํฐ ์ชฝ์œผ๋กœ ๋ฐ€์–ด๋‚ด, ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” proprioception์— ๋ฌด๊ฒŒ๋ฅผ ์‹ฃ์Šต๋‹ˆ๋‹ค.

๋ฐฉ๋ฒ•

์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ

๋ชจ๋ธ์€ (1) ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ณ„ ์ธ์ฝ”๋”ฉ โ†’ (2) ํ† ํฐ ๋ ˆ๋ฒจ ๊ฒŒ์ดํŒ… ์œตํ•ฉ โ†’ (3) ๊ณต์œ  Transformer๋ฅผ ํ†ตํ•œ ๋‹ค์ค‘ ์ž‘์—… ์˜ˆ์ธก์˜ ์„ธ ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

flowchart LR
    IMG[Pre-grasp RGB image] --> VENC[Frozen ResNet<br/>Visual Encoder]
    SRV[Servo signals:<br/>position, load,<br/>current, velocity] --> PENC[Temporal<br/>Proprio Encoder]
    VENC --> V["Visual tokens V"]
    PENC --> P["Proprio tokens P_tok"]
    V --> VPOOL[Avg Pool -> v_g]
    P --> PPOOL[Masked Pool -> p_g]
    VPOOL --> GATE[Conflict Estimator:<br/>MLP + Sigmoid]
    PPOOL --> GATE
    GATE --> G["gate g in 0..1"]
    G --> GV[Visual Gating:<br/>g*V + 1-g * v_null]
    V --> GV
    GV --> VT["Gated visual tokens"]
    VT --> TR[Shared Transformer<br/>Encoder + CLS]
    P --> TR
    TR --> M[mass]
    TR --> S[stiffness]
    TR --> U[material]
    PPOOL --> AUX[Proprio Auxiliary Heads]
    AUX --> AM[mass]
    AUX --> AS[stiffness]
    AUX --> AU[material]

๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ณ„ ์ธ์ฝ”๋”

  • ์‹œ๊ฐ ๋ถ„๊ธฐ(visual branch): ๋™๊ฒฐ๋œ(frozen) ResNet ์ธ์ฝ”๋”๋กœ grasp ์ง์ „ RGB ์ด๋ฏธ์ง€์—์„œ ์‹œ๊ฐ ํ† ํฐ V \in \mathbb{R}^{N_v \times d}๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. ๋™๊ฒฐ์ด๋ผ๋Š” ์ ์ด โ€œ์ €๋น„์šฉโ€์˜ ๋˜ ๋‹ค๋ฅธ ์ธก๋ฉด์ž…๋‹ˆ๋‹ค โ€” ์‚ฌ์ „ํ•™์Šต๋œ ๋ฐฑ๋ณธ์„ ๊ทธ๋Œ€๋กœ ์“ฐ๊ณ  ํ•™์Šต ๋ถ€๋‹ด์„ ์ค„์ž…๋‹ˆ๋‹ค.
  • ๊ณ ์œ ์ˆ˜์šฉ๊ฐ๊ฐ ๋ถ„๊ธฐ(proprioceptive branch): ์‹œ๊ฐ„์  ์ธ์ฝ”๋”(temporal encoder)๊ฐ€ ์›์‹œ ์„œ๋ณด ์‹ ํ˜ธ ์‹œ๊ณ„์—ด์„ ๊ณ ์œ ์ˆ˜์šฉ๊ฐ๊ฐ ํ† ํฐ P_{tok} \in \mathbb{R}^{N_p \times d}๋กœ ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค. (baseline์—์„œ๋Š” ์ด ์ž๋ฆฌ์— 1D-CNN์„ ์”๋‹ˆ๋‹ค.)

์—ฌ๊ธฐ์„œ N_v๋Š” ์‹œ๊ฐ ํ† ํฐ ์ˆ˜, N_p๋Š” proprioceptive ํ† ํฐ ์ˆ˜, d๋Š” ๊ณตํ†ต ์ž„๋ฒ ๋”ฉ ์ฐจ์›์ž…๋‹ˆ๋‹ค.

๊ฒŒ์ดํŒ…์˜ ํ•ต์‹ฌ โ€” ํ† ํฐ ๋ ˆ๋ฒจ ์‹œ๊ฐ ์–ต์ œ

์ด ๋…ผ๋ฌธ์˜ ๊ฒŒ์ดํŒ…์€ ํ”ํ•œ โ€œ๊ฐ€์ค‘ ํ‰๊ท  ํ•ฉโ€ ๋ฐฉ์‹๊ณผ ๋ฏธ๋ฌ˜ํ•˜๊ฒŒ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ ์‹œ๊ฐ ํ† ํฐ์„ ํ•™์Šต๋œ null ํ† ํฐ์œผ๋กœ ๋ณด๊ฐ„(interpolation)ํ•ด ์–ต์ œํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋จผ์ € ๊ฐ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ์š”์•ฝ ๋ฒกํ„ฐ๋กœ ์••์ถ•ํ•ฉ๋‹ˆ๋‹ค. ์‹œ๊ฐ ํ† ํฐ์€ global average pooling์œผ๋กœ v_g๋ฅผ, proprioceptive ํ† ํฐ์€ padding์„ ์ œ์™ธํ•œ masked pooling์œผ๋กœ p_g๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ๋‘ ์š”์•ฝ์„ ์ด์–ด ๋ถ™์—ฌ MLP + Sigmoid์— ํ†ต๊ณผ์‹œ์ผœ ์Šค์นผ๋ผ ๊ฒŒ์ดํŠธ g \in (0,1)๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค. ์ด ๋ชจ๋“ˆ์ด ๋ฐ”๋กœ ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์˜ ๋ชจ์ˆœ ์ •๋„๋ฅผ ์ถ”์ •ํ•˜๋Š” Conflict Estimator์ž…๋‹ˆ๋‹ค.

g = \sigma\big(\mathrm{MLP}([\,v_g \,;\, p_g\,])\big), \qquad g \in (0,1)

๊ทธ๋ฆฌ๊ณ  ์‹œ๊ฐ ํ† ํฐ์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ฒŒ์ดํŒ…ํ•ฉ๋‹ˆ๋‹ค.

\tilde{V} = g\,V + (1 - g)\,v_{\text{null}} \tag{1}

์—ฌ๊ธฐ์„œ v_{\text{null}}์€ ํ•™์Šต ๊ฐ€๋Šฅํ•œ null ํ† ํฐ์œผ๋กœ, ์‹œ๊ฐ ์‹œํ€€์Šค ์ „์ฒด์— ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋ฉ๋‹ˆ๋‹ค. ์ง๊ด€์ ์œผ๋กœ:

  • g \to 1: โ€œ์‹œ๊ฐ์„ ๋ฏฟ์–ด๋ผโ€ โ†’ \tilde{V} \approx V (์›๋ž˜ ์‹œ๊ฐ ํ† ํฐ ์œ ์ง€)
  • g \to 0: โ€œ์‹œ๊ฐ์ด ๊ฑฐ์ง“๋งํ•˜๊ณ  ์žˆ๋‹คโ€ โ†’ \tilde{V} \approx v_{\text{null}} (์‹œ๊ฐ ํ† ํฐ์„ ์˜๋ฏธ ์—†๋Š” null๋กœ ๋Œ€์ฒดํ•ด ์–ต์ œ)

์ด ์„ค๊ณ„์˜ ๋ฌ˜๋ฏธ๋Š”, ์‹œ๊ฐ์„ ๋‹จ์ˆœํžˆ โ€œ์•ฝํ•˜๊ฒŒ ์„ž๋Š”โ€ ๊ฒŒ ์•„๋‹ˆ๋ผ ๋ชจ์ˆœ ์ƒํ™ฉ์—์„œ ํ•™์Šต๋œ ์ค‘๋ฆฝ ํ† ํฐ์œผ๋กœ ๊ฐˆ์•„๋ผ์›Œ ์‹œ๊ฐ์˜ ๊ธฐ๋งŒ์  ์ •๋ณด๋ฅผ ์ ๊ทน์ ์œผ๋กœ ์ฐจ๋‹จํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.

๊ฒŒ์ดํŒ…๋œ ์‹œ๊ฐ ํ† ํฐ \tilde{V}, proprioceptive ํ† ํฐ P_{tok}, ๊ทธ๋ฆฌ๊ณ  ๋ถ„๋ฅ˜์šฉ [CLS] ํ† ํฐ์„ ๋ชจ๋‘ ์ด์–ด ๋ถ™์—ฌ ๊ณต์œ  Transformer ์ธ์ฝ”๋”์— ๋„ฃ๊ณ , ์—ฌ๊ธฐ์„œ ์งˆ๋Ÿ‰ยท๊ฐ•์„ฑยท์žฌ์งˆ์„ ๋™์‹œ์— ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

๋ณด์กฐ ๊ฐ๋…๊ณผ ๊ฒŒ์ดํŠธ ์ •๊ทœํ™” โ€” ์†์‹ค ํ•จ์ˆ˜

๊ฒŒ์ดํŒ…์—๋Š” ์œ„ํ—˜์ด ํ•˜๋‚˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋งŒ์•ฝ ์‹œ๊ฐ์ด ์ž์ฃผ ๋„์›€์ด ๋˜๋ฉด, proprioceptive ๋ถ„๊ธฐ๊ฐ€ ๊ฒŒ์„๋Ÿฌ์ ธ์„œ ๋…๋ฆฝ์ ์ธ ๋ฌผ๋ฆฌ ํ‘œํ˜„์„ ํ•™์Šตํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ์ด๋ฅผ ๋ง‰๊ธฐ ์œ„ํ•ด proprioception๋งŒ์œผ๋กœ ๋™์ผ ํƒ€๊นƒ(์งˆ๋Ÿ‰ยท๊ฐ•์„ฑยท์žฌ์งˆ)์„ ์˜ˆ์ธกํ•˜๋Š” ๋ณด์กฐ ํ—ค๋“œ(auxiliary head) 3๊ฐœ๋ฅผ p_g์— ๋ถ™์ž…๋‹ˆ๋‹ค. ์ด๋กœ์จ proprioceptive ๋ถ„๊ธฐ๊ฐ€ ์‹œ๊ฐ์— ๊ธฐ๋Œ€์ง€ ์•Š๊ณ  ๋…๋ฆฝ์ ์œผ๋กœ ๋ณ€๋ณ„๋ ฅ ์žˆ๋Š” ํ‘œํ˜„์„ ์œ ์ง€ํ•˜๋„๋ก ๊ฐ•์ œํ•ฉ๋‹ˆ๋‹ค.

์ „์ฒด ์†์‹ค์€ ์„ธ ํ•ญ์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

\mathcal{L} = \sum_{k \in \{m, s, u\}} \Big[ \mathcal{L}_{CE}(\hat{y}_k, y_k) + \lambda_{aux}\,\mathcal{L}_{CE}(\hat{y}_k^{aux}, y_k) + \lambda_{reg}\,R_{ent}(g) \Big] \tag{2}

์—ฌ๊ธฐ์„œ m, s, u๋Š” ๊ฐ๊ฐ mass, stiffness, material(material์˜ ์•ฝ์ž๋กœ u ์‚ฌ์šฉ) ์ž‘์—…์ž…๋‹ˆ๋‹ค.

  • ์ฒซ์งธ ํ•ญ: ์ฃผ ๋ถ„๋ฅ˜ ์†์‹ค โ€” ์œตํ•ฉ๋œ ํ‘œํ˜„์œผ๋กœ ์„ธ ์†์„ฑ์„ ์˜ˆ์ธกํ•˜๋Š” cross-entropy.
  • ๋‘˜์งธ ํ•ญ: ๋ณด์กฐ proprioceptive ์†์‹ค โ€” proprioception๋งŒ์œผ๋กœ ๊ฐ™์€ ํƒ€๊นƒ์„ ๋งžํžˆ๊ฒŒ ํ•˜๋Š” cross-entropy (\lambda_{aux}๋กœ ๊ฐ€์ค‘).
  • ์…‹์งธ ํ•ญ: ๊ฒŒ์ดํŠธ ์—”ํŠธ๋กœํ”ผ ์ •๊ทœํ™” R_{ent}(g) โ€” Bernoulli ์Œ์˜ ์—”ํŠธ๋กœํ”ผ๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š”๋ฐ, ์ด๋Š” ๊ฒŒ์ดํŠธ๊ฐ€ ๋„ˆ๋ฌด ์ผ์ฐ 0์ด๋‚˜ 1๋กœ ํฌํ™”(premature saturation)๋˜๋Š” ๊ฒƒ์„ ๋ง‰์•„ ํ•™์Šต ์ดˆ๊ธฐ์— ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ๋ชจ๋‘ ํƒ์ƒ‰(cross-modal exploration)ํ•˜๋„๋ก ์œ ๋„ํ•ฉ๋‹ˆ๋‹ค.

์˜์‚ฌ์ฝ”๋“œ

Input: pre-grasp image x_v, servo signal sequence x_p
V      = FrozenResNet(x_v)               # visual tokens, R^{Nv x d}
P_tok  = TemporalEncoder(x_p)            # proprio tokens, R^{Np x d}

v_g    = avg_pool(V)                     # visual summary
p_g    = masked_pool(P_tok)              # proprio summary (ignore padding)
g      = sigmoid(MLP(concat(v_g, p_g)))  # conflict gate, scalar in (0,1)

V_tilde = g * V + (1 - g) * v_null       # suppress visual tokens on conflict
tokens  = concat(CLS, V_tilde, P_tok)
feat    = Transformer(tokens)
y_main  = heads(feat)                    # mass, stiffness, material
y_aux   = aux_heads(p_g)                 # proprio-only predictions
return y_main

์ ๋Œ€์  ๋ฐ์ดํ„ฐ์…‹ โ€” โ€œ์‹œ๊ฐ์˜ ํ•จ์ •โ€

๋ฐฉ๋ฒ•๋งŒํผ ์ค‘์š”ํ•œ ๊ธฐ์—ฌ๊ฐ€ ์ ๋Œ€์  ๋ฐ์ดํ„ฐ์…‹(adversarial dataset)์ž…๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ 16๊ฐœ์˜ ํŠน์ˆ˜ ์ œ์ž‘ ๋ฌผ์ฒด๋กœ ๊ตฌ์„ฑ๋œ ๋ฐ์ดํ„ฐ์…‹์„ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค.

  • ํ•™์Šต ์„ธํŠธ: ์‹œ๊ฐ ํŠน์ง•(์ƒ‰, ์งˆ๊ฐ)์ด ๋ฌผ๋ฆฌ ์†์„ฑ(์งˆ๋Ÿ‰ยท๊ฐ•์„ฑยท์žฌ์งˆ)๊ณผ ๊ฐ•ํ•˜๊ฒŒ ์ƒ๊ด€๋˜๋„๋ก ์„ค๊ณ„ โ€” ์ฆ‰ ๋ชจ๋ธ์ด โ€œ๊ฒ‰๋ชจ์Šต ์ง€๋ฆ„๊ธธ(visual shortcut)โ€์„ ๋ฐฐ์šฐ๋„๋ก ์ผ๋ถ€๋Ÿฌ ์œ ๋„.
  • ํ…Œ์ŠคํŠธ(unseen) ์„ธํŠธ: ๊ทธ ์ƒ๊ด€์„ ์ผ๋ถ€๋Ÿฌ ๊นจ๋œจ๋ฆผ โ€” ์˜ˆ์ปจ๋Œ€ ๋ฌด๊ฑฐ์šด ํ•™์Šต ๋ฌผ์ฒด์™€ ์‹œ๊ฐ์ ์œผ๋กœ ๋˜‘๊ฐ™์ด ์ƒ๊ฒผ์ง€๋งŒ ์‹ค์ œ๋กœ๋Š” ๊ฐ€๋ฒผ์šด ๋ฌผ์ฒด.

๋ฐ์ดํ„ฐ๋Š” ํ‘œ์ค€ํ™”๋œ ์ž๋™ grasp-and-lift ์ ˆ์ฐจ๋กœ ์ˆ˜์ง‘ํ–ˆ์œผ๋ฉฐ, 800๊ฐœ ์ด์ƒ์˜ ์ƒ˜ํ”Œ์„ ๋ชจ์•˜์Šต๋‹ˆ๋‹ค. ๊ฐ ์ƒ˜ํ”Œ์€ grasp ์ง์ „ ์ „์—ญ ์‚ฌ์ง„ ํ•œ ์žฅ๊ณผ, ์ƒํ˜ธ์ž‘์šฉ ์ค‘ ๊ธฐ๋ก๋œ ๋‹ค์ฑ„๋„ ์„œ๋ณด ์‹ ํ˜ธ(position, load, current, velocity) ๊ตฌ๊ฐ„์œผ๋กœ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค.

์ด ์ ๋Œ€์  ๋ถ„ํ•  ๋•๋ถ„์—, unseen ์ •ํ™•๋„๊ฐ€ ๋†’๋‹ค๋Š” ๊ฒƒ์€ ๋‹จ์ˆœํ•œ ์ผ๋ฐ˜ํ™”๊ฐ€ ์•„๋‹ˆ๋ผ ์˜๋„์ ์ธ ์‹œ๊ฐ ํŽธํ–ฅ์— ๋Œ€ํ•œ ์ €ํ•ญ๋ ฅ์„ ์˜๋ฏธํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์‹คํ—˜

ํ”„๋กœํ† ์ฝœ

์„ธ ์†์„ฑ์„ ๋™์‹œ์— ์˜ˆ์ธกํ•˜๋Š” ํ†ตํ•ฉ ์ž‘์—…์œผ๋กœ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

  • ์งˆ๋Ÿ‰(mass): 3๊ฐœ ํด๋ž˜์Šค
  • ๊ฐ•์„ฑ(stiffness): 4๊ฐœ ํด๋ž˜์Šค
  • ์žฌ์งˆ(material): 5๊ฐœ ํด๋ž˜์Šค

์„ธ ์ž‘์—…์˜ ํ‰๊ท  ์ •ํ™•๋„๋ฅผ 5๊ฐœ random seed์— ๊ฑธ์ณ ๋ณด๊ณ ํ•ฉ๋‹ˆ๋‹ค. seen object๋Š” ํ•™์Šต์— ๋“ฑ์žฅํ•œ ๋ฌผ์ฒด, unseen object๋Š” ์‹œ๊ฐ์ ์œผ๋กœ ๊ธฐ๋งŒ์ ์ธ OOD ํ…Œ์ŠคํŠธ ์„ธํŠธ์˜ ๋ฌผ์ฒด์ž…๋‹ˆ๋‹ค.

๋น„๊ต ๋Œ€์ƒ(baseline)

Baseline ์‹œ๊ฐ ์ฒ˜๋ฆฌ ๊ณ ์œ ์ˆ˜์šฉ๊ฐ๊ฐ ์ฒ˜๋ฆฌ ์œตํ•ฉ
Vision-only ResNet-18 + Transformer ์—†์Œ -
Proprio-only ์—†์Œ 1D-CNN + Transformer -
Vanilla Fusion ResNet 1D-CNN ํ† ํฐ concat (๊ฒŒ์ดํŒ… ์—†์Œ, early-fusion ๋Œ€ํ‘œ๊ฒฉ)
Ours (Gated Fusion) Frozen ResNet Temporal encoder ๊ฒŒ์ดํŒ… + ๋ณด์กฐ ๊ฐ๋…

์ฃผ์š” ๊ฒฐ๊ณผ โ€” Unseen Object

๋…ผ๋ฌธ Table I์˜ ํ•ต์‹ฌ ์ˆ˜์น˜(mean ยฑ std, 5 seeds):

Method Seen-object Unseen-object Gate
Vision-only 95.39 ยฑ 0.73 18.00 ยฑ 6.16 โ€“
Proprio-only 95.29 ยฑ 1.93 87.89 ยฑ 1.62 โ€“
Vanilla Fusion 99.31 ยฑ 0.73 85.56 ยฑ 8.39 โ€“
Ours (Gated Fusion) 99.71 ยฑ 0.59 97.61 ยฑ 3.68 0.589

์ฝ์–ด๋‚ด์•ผ ํ•  ์ :

  1. Vision-only์˜ ๋ถ•๊ดด: seen์—์„œ๋Š” 95.39%์ง€๋งŒ unseen์—์„œ๋Š” 18.00%๋กœ ํญ๋ฝํ•ฉ๋‹ˆ๋‹ค. 3-4-5 ํด๋ž˜์Šค ์ž‘์—…์—์„œ ๋žœ๋ค ์ถ”์ธก ์ˆ˜์ค€์— ๊ฐ€๊น์Šต๋‹ˆ๋‹ค. ์‹œ๊ฐ ์ง€๋ฆ„๊ธธ์— ์™„์ „ํžˆ ์˜์กดํ–ˆ๋‹ค๋Š” ๋ช…๋ฐฑํ•œ ์ฆ๊ฑฐ์ž…๋‹ˆ๋‹ค.
  2. Vanilla Fusion์ด Proprio-only๋ณด๋‹ค ๋‚˜์จ: 85.56% < 87.89%. ๋‹จ์ˆœ concat์€ ์‹œ๊ฐ์˜ ๊ธฐ๋งŒ์  ์ •๋ณด๋ฅผ ๊ทธ๋Œ€๋กœ ๋นจ์•„๋“ค์—ฌ ์˜คํžˆ๋ ค proprioception ๋‹จ๋…๋ณด๋‹ค ๋–จ์–ด์ง‘๋‹ˆ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ std๊ฐ€ 8.39๋กœ ๋งค์šฐ ๋ถˆ์•ˆ์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด โ€œ๊ทธ๋ƒฅ ํ•ฉ์น˜๋ฉด ๋œ๋‹คโ€๋Š” ํ†ต๋…์— ๋Œ€ํ•œ ๊ฐ•๋ ฅํ•œ ๋ฐ˜๋ก€์ž…๋‹ˆ๋‹ค.
  3. Gated Fusion์˜ ์Šน๋ฆฌ: unseen์—์„œ 97.61% โ€” proprio-only(+9.72%p)์™€ vanilla fusion(+12.05%p)์„ ๋ชจ๋‘ ํฌ๊ฒŒ ์•ž์„ญ๋‹ˆ๋‹ค. seen์—์„œ๋„ 99.71%๋กœ ceiling์— ๊ทผ์ ‘ํ•ฉ๋‹ˆ๋‹ค.
  4. ๊ฒŒ์ดํŠธ ๊ฐ’ 0.589: ํ‰๊ท  ๊ฒŒ์ดํŠธ๊ฐ€ 0.5๋ณด๋‹ค ์•ฝ๊ฐ„ ๋†’์•„, ์‹œ๊ฐ์„ ์™„์ „ํžˆ ๋ฒ„๋ฆฌ์ง€ ์•Š์œผ๋ฉด์„œ๋„ ์„ ํƒ์ ์œผ๋กœ ์‹ ๋ขฐํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ž‘์—…๋ณ„ Unseen ์ •ํ™•๋„

Method Mass Stiffness Material
Vision-only 17.17 ยฑ 6.55 17.83 ยฑ 6.84 19.00 ยฑ 5.15
Proprio-only 100.00 ยฑ 0.00 81.00 ยฑ 2.76 82.67 ยฑ 2.20
Vanilla Fusion 87.67 ยฑ 7.91 84.50 ยฑ 9.61 84.50 ยฑ 8.04
Ours (Gated Fusion) 100.00 ยฑ 0.00 95.17 ยฑ 7.61 97.67 ยฑ 3.43

ํ•ด์„:

  • ์งˆ๋Ÿ‰: proprioception ๋‹จ๋…๋งŒ์œผ๋กœ๋„ unseen์—์„œ 100% ์™„๋ฒฝ ์˜ˆ์ธก. ๋ฌผ์ฒด๋ฅผ ๋“ค์–ด ์˜ฌ๋ฆด ๋•Œ ๋ชจํ„ฐ ๋ถ€ํ•˜/์ „๋ฅ˜์— ๋ฌด๊ฒŒ๊ฐ€ ์ง์ ‘์ ์œผ๋กœ ๋“œ๋Ÿฌ๋‚˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๊ฒŒ์ดํŒ…๋„ 100%๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฐ•์„ฑยท์žฌ์งˆ: proprioception ๋‹จ๋…์€ 81%, 82.67%๋กœ ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋‘ ์†์„ฑ์€ ์‹œ๊ฐ ์ •๋ณด(์งˆ๊ฐ, ๊ด‘ํƒ)๊ฐ€ ๋ณด์™„์ ์œผ๋กœ ์œ ์šฉํ•œ๋ฐ, ๊ฒŒ์ดํŒ…์ด ์ด๋ฅผ ์„ ํƒ์ ์œผ๋กœ ๋Œ์–ด์™€ 95.17%, 97.67%๋กœ ๋Œ์–ด์˜ฌ๋ฆฝ๋‹ˆ๋‹ค.
  • ์ฆ‰, โ€œ์งˆ๋Ÿ‰์€ ๋งŒ์ง€๋ฉด ์ •ํ™•ํ•˜์ง€๋งŒ, ๊ฐ•์„ฑยท์žฌ์งˆ์€ ์‹œ๊ฐ์˜ ๋„์›€์ด ํ•„์š”ํ•˜๋‹คโ€๋Š” ์ง๊ด€์„, ๊ฒŒ์ดํŒ…์ด ์ž‘์—…๋ณ„๋กœ ์ž๋™ ์กฐ์œจํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋น„ํŒ์  ๊ณ ์ฐฐ

๊ฐ•์ 

  • ๋ฌธ์ œ ์„ค์ •์˜ ๋ช…๋ฃŒํ•จ: โ€œ๊ฒ‰๋ชจ์Šต์ด ๊ฑฐ์ง“๋งํ•  ๋•Œโ€๋ผ๋Š” ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์ ๋Œ€์  ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์ •๋Ÿ‰ํ™”ํ•œ ์ ์ด ๊ฐ€์žฅ ํฐ ๊ธฐ์—ฌ์ž…๋‹ˆ๋‹ค. Vision-only๊ฐ€ 18%๋กœ ๋ถ•๊ดดํ•˜๋Š” ๋Œ€์กฐ๋Š” ์‹œ๊ฐ ์ง€๋ฆ„๊ธธ ๋ฌธ์ œ๋ฅผ ์„ค๋“๋ ฅ ์žˆ๊ฒŒ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค.
  • ์ €๋น„์šฉ ์‹ค์šฉ์„ฑ: ์ถ”๊ฐ€ ์ด‰๊ฐ ์„ผ์„œ ์—†์ด ์„œ๋ณด ์‹ ํ˜ธ + ์นด๋ฉ”๋ผ ํ•œ ์žฅ. ๊ธฐ์กด ๋งค๋‹ˆํ“ฐ๋ ˆ์ดํ„ฐ์— ์ฆ‰์‹œ ์ด์‹ ๊ฐ€๋Šฅํ•˜๊ณ  ์‚ฐ์—… ์ ์šฉ ๋ฌธํ„ฑ์ด ๋‚ฎ์Šต๋‹ˆ๋‹ค.
  • vanilla fusion์— ๋Œ€ํ•œ ์ •์งํ•œ ๋ฐ˜๋ก€: โ€œ๋‹จ์ˆœ ์œตํ•ฉ์€ ๋งŒ๋Šฅโ€์ด๋ผ๋Š” ํ†ต๋…์„ ์ž๊ธฐ ์‹คํ—˜์œผ๋กœ ๋ฐ˜๋ฐ•ํ•ฉ๋‹ˆ๋‹ค(85.56% < 87.89%). ๊ฒŒ์ดํŒ…์˜ ํ•„์š”์„ฑ์„ ๋ฐ์ดํ„ฐ๋กœ ์ฆ๋ช…ํ•œ ์…ˆ์ž…๋‹ˆ๋‹ค.
  • ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ: ๊ฒŒ์ดํŠธ ๊ฐ’ g ์ž์ฒด๊ฐ€ โ€œ์ง€๊ธˆ ์‹œ๊ฐ์„ ์–ผ๋งˆ๋‚˜ ๋ฏฟ์—ˆ๋Š”๊ฐ€โ€๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ์ง„๋‹จ ์‹ ํ˜ธ์ž…๋‹ˆ๋‹ค. ๋ณด๊ณ ๋œ ํ‰๊ท  0.589๋Š” ์„ ํƒ์  ์‹ ๋ขฐ๋ฅผ ์ •๋Ÿ‰ํ™”ํ•ฉ๋‹ˆ๋‹ค.
  • null ํ† ํฐ + ์—”ํŠธ๋กœํ”ผ ์ •๊ทœํ™”: ๋‹จ์ˆœ ๊ฐ€์ค‘ํ•ฉ์ด ์•„๋‹ˆ๋ผ ํ•™์Šต๋œ null๋กœ ์‹œ๊ฐ์„ ๊ฐˆ์•„๋ผ์šฐ๋Š” ์„ค๊ณ„, ๊ทธ๋ฆฌ๊ณ  ๊ฒŒ์ดํŠธ ์กฐ๊ธฐ ํฌํ™”๋ฅผ ๋ง‰๋Š” ์ •๊ทœํ™”๋Š” ๊ฒŒ์ดํŒ… ์œตํ•ฉ ์„ค๊ณ„์—์„œ ์„ธ๋ จ๋œ ๋””ํ…Œ์ผ์ž…๋‹ˆ๋‹ค.

์•ฝ์ ยทํ•œ๊ณ„

  • ๋ฐ์ดํ„ฐ์…‹ ๊ทœ๋ชจ: 16๊ฐœ ๋ฌผ์ฒด, 800์—ฌ ์ƒ˜ํ”Œ์€ ์›Œํฌ์ˆ ๋…ผ๋ฌธ ๊ทœ๋ชจ๋กœ๋Š” ํ•ฉ๋ฆฌ์ ์ด์ง€๋งŒ, ๋ฏธ์ง€์˜ ๋ฌผ์ฒด ๋ฒ”์ฃผยทํ™˜๊ฒฝ์œผ๋กœ์˜ ์ผ๋ฐ˜ํ™”๋Š” ๊ฒ€์ฆ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ์ ๋Œ€์  ๋ถ„ํ• ์ด 16๊ฐœ ๋ฌผ์ฒด ์•ˆ์—์„œ ์ด๋ค„์ง€๋ฏ€๋กœ ๋‹ค์–‘์„ฑ์ด ์ œํ•œ์ ์ž…๋‹ˆ๋‹ค.
  • ๋†’์€ ๋ถ„์‚ฐ: Gated Fusion์˜ unseen ์ •ํ™•๋„ std๊ฐ€ 3.68(material์€ 3.43, stiffness๋Š” 7.61)๋กœ ์ž‘์ง€ ์•Š์Šต๋‹ˆ๋‹ค. seed๋ณ„ ๋ณ€๋™์ด ์ปค์„œ, ์ ์€ ๋ฐ์ดํ„ฐ์—์„œ ๊ฒŒ์ดํŒ… ํ•™์Šต์ด ๋ถˆ์•ˆ์ •ํ•  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
  • proprioception์˜ ๋ณธ์งˆ์  ์ œ์•ฝ: ์„œ๋ณด ์‹ ํ˜ธ๋Š” ์ ‘์ด‰ ํ›„์—์•ผ ์˜๋ฏธ๊ฐ€ ์ƒ๊น๋‹ˆ๋‹ค. grasp-and-lift๋ฅผ ๋ฐ˜๋“œ์‹œ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•˜๋ฏ€๋กœ, โ€œ๋ณด๊ธฐ๋งŒ ํ•˜๊ณ  ์ถ”์ •โ€ํ•˜๋Š” ๋น„์ ‘์ด‰ ์‚ฌ์ „ ์˜ˆ์ธก์€ ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ๋˜ ๊ฒฐ๊ณผ๋Š” ์‚ฌ์šฉํ•œ ๋กœ๋ด‡ยท๊ทธ๋ฆฌํผ์˜ ์„ผ์‹ฑ ํ•ด์ƒ๋„์— ์˜์กดํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฒŒ์ดํŠธ๊ฐ€ ์Šค์นผ๋ผ: sample-wise ์Šค์นผ๋ผ ๊ฒŒ์ดํŠธ๋Š” ๋‹จ์ˆœํ•˜๊ณ  ํ•ด์„ํ•˜๊ธฐ ์‰ฝ์ง€๋งŒ, ์ฑ„๋„๋ณ„/ํ† ํฐ๋ณ„๋กœ ๋” ์„ธ๋ฐ€ํ•˜๊ฒŒ ์‹œ๊ฐ์„ ์–ต์ œํ•  ์—ฌ์ง€๋Š” ๋‚จ์•„ ์žˆ์Šต๋‹ˆ๋‹ค. (๋…ผ๋ฌธ์ด ์˜๋„์ ์œผ๋กœ ๋‹จ์ˆœํ•จ์„ ํƒํ•œ trade-off๋กœ ๋ณด์ž„)
  • ์„ธ๋ถ€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๋ฏธ๊ณต๊ฐœ: \lambda_{aux}, \lambda_{reg}, ํ† ํฐ ์ˆ˜ N_v, N_p, ์ž„๋ฒ ๋”ฉ ์ฐจ์› d ๋“ฑ ๊ตฌ์ฒด๊ฐ’์€ ๋ณธ๋ฌธ์—์„œ ๋ช…์‹œ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. (์ถ”์ธก) ์›Œํฌ์ˆ short paper ๋ถ„๋Ÿ‰ ์ œ์•ฝ ๋•Œ๋ฌธ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.

๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต

  • ์ด‰๊ฐ/force-torque ๊ธฐ๋ฐ˜ ์†์„ฑ ์ถ”์ • [5,6,7]: ์ •๋ฐ€ํ•˜์ง€๋งŒ ๊ณ ๊ฐ€ยท์ทจ์•ฝํ•œ ํ•˜๋“œ์›จ์–ด๊ฐ€ ํ•„์š”. ๋ณธ ๋…ผ๋ฌธ์€ ์ด๋ฅผ ๋‚ด์žฅ ์„œ๋ณด ์‹ ํ˜ธ๋กœ ๋Œ€์ฒด/๊ทผ์‚ฌํ•˜๋Š” ์ €๋น„์šฉ ๋…ธ์„ ์ž…๋‹ˆ๋‹ค.
  • ๋น„์ „ ๊ธฐ๋ฐ˜ ๋ฌผ๋ฆฌ ์ถ”๋ก  [3] GaussianProperty, [4] Tactile-Vision-Language ๋ชจ๋ธ: ๊ฐ•๋ ฅํ•œ ์˜๋ฏธ๋ก ์  ์ถ”๋ก ์„ ๊ฐ–์ง€๋งŒ ์™ธ์–‘-๋ฌผ๋ฆฌ ๊ฐ€์งœ ์ƒ๊ด€์— ์ทจ์•ฝ. ๋ณธ ๋…ผ๋ฌธ์˜ ์ ๋Œ€์  ๋ฐ์ดํ„ฐ์…‹์€ ๋ฐ”๋กœ ์ด ์•ฝ์ ์„ ๊ฒจ๋ƒฅํ•ฉ๋‹ˆ๋‹ค.
  • VisuoTactile ์ƒํ˜ธ์ž‘์šฉ ์ง€๊ฐ [1]: ๋Šฅ๋™์  ๋น„์ „-์ด‰๊ฐ ์œตํ•ฉ ๊ณ„์—ด. ๋ณธ ๋…ผ๋ฌธ์€ ์ „์šฉ ์ด‰๊ฐ ๋Œ€์‹  proprioception์„ ์“ด๋‹ค๋Š” ์ ์—์„œ ์ฐจ๋ณ„์ ์ž…๋‹ˆ๋‹ค.
  • ํ‘œ์ค€ early-fusion(๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋กœ๋ด‡ ํ•™์Šต) [11,12]: vanilla fusion(concat)์ด ๊ทธ ๋Œ€ํ‘œ proxy. ๋ณธ ๋…ผ๋ฌธ์€ ์ด๊ฒƒ์ด ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์ถฉ๋Œ์—์„œ ์ทจ์•ฝํ•จ์„ ์‹คํ—˜์œผ๋กœ ๋ณด์ด๊ณ , ๊ฒŒ์ดํŒ…์œผ๋กœ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์˜ ์‹ ๊ทœ์„ฑ์€ โ€œ๊ฒŒ์ดํŒ…โ€์ด๋ผ๋Š” ๊ธฐ๋ฒ• ์ž์ฒด๊ฐ€ ์•„๋‹ˆ๋ผ, (1) ์ €๋น„์šฉ proprioception์„ ์ด‰๊ฐ ๋Œ€์ฒด์žฌ๋กœ ์“ฐ๊ณ , (2) ์ ๋Œ€์  ์‹œ๊ฐ ํ•จ์ • ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๊ฒฌ๊ณ ์„ฑ์„ ์ธก์ •ํ•˜๋ฉฐ, (3) null-token ๋ณด๊ฐ„ + proprioceptive ๋ณด์กฐ ๊ฐ๋…์œผ๋กœ ์‹œ๊ฐ ํŽธํ–ฅ์„ ์–ต์ œํ•˜๋Š” ์กฐํ•ฉ๊ณผ ๋ฌธ์ œ ์„ค์ •์— ์žˆ์Šต๋‹ˆ๋‹ค.

์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

์ด ๋…ผ๋ฌธ์€ ๊ณ ๊ฐ€ ์ด‰๊ฐ ์„ผ์„œ ์—†์ด๋„ ๋กœ๋ด‡์— ์ด๋ฏธ ์กด์žฌํ•˜๋Š” ์„œ๋ณด ์‹ ํ˜ธ(position, load, current, velocity)๋ฅผ ๋น„์ „๊ณผ ๊ฒŒ์ดํŒ…์œผ๋กœ ์œตํ•ฉํ•ด ๋ฌผ์ฒด์˜ ์งˆ๋Ÿ‰ยท๊ฐ•์„ฑยท์žฌ์งˆ์„ ๋ถ„๋ฅ˜ํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ ์„ธ ๊ฐ€์ง€์ž…๋‹ˆ๋‹ค.

  1. ์ €๋น„์šฉ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ์žฌํ™œ์šฉ: ๋น„์‹ผ ์ด‰๊ฐ ํ•˜๋“œ์›จ์–ด ๋Œ€์‹  grasp-and-lift ์ค‘์˜ ๋‚ด๋ถ€ ๋ชจํ„ฐ ์‹ ํ˜ธ๋ฅผ โ€œ์‚ฌ์‹ค์ƒ ๊ณต์งœ์ธโ€ ์ด‰๊ฐ ์ฑ„๋„๋กœ ํ™œ์šฉ.
  2. ์ ๋Œ€์  ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๊ฒฌ๊ณ ์„ฑ ์ธก์ •: ์™ธ์–‘-๋ฌผ๋ฆฌ ๊ฐ€์งœ ์ƒ๊ด€์„ ์ผ๋ถ€๋Ÿฌ ๊นจ๋œจ๋ฆฐ 16๊ฐœ ๋ฌผ์ฒด๋กœ, ๋ชจ๋ธ์ด ์ง„์งœ ๋ฌผ๋ฆฌ๋ฅผ ์ถ”๋ก ํ•˜๋Š”์ง€ ์•„๋‹ˆ๋ฉด ์‹œ๊ฐ ์ง€๋ฆ„๊ธธ์— ์˜์กดํ•˜๋Š”์ง€ ๊ฐ€๋ฆฝ๋‹ˆ๋‹ค.
  3. ์ถฉ๋Œ ์ธ์‹ ๊ฒŒ์ดํŒ…: ์‹œ๊ฐ์ด ๊ฑฐ์ง“๋งํ•  ๋•Œ ํ•™์Šต๋œ null ํ† ํฐ์œผ๋กœ ์‹œ๊ฐ ํ† ํฐ์„ ์–ต์ œํ•˜๊ณ , proprioceptive ๋ณด์กฐ ๊ฐ๋…๊ณผ ๊ฒŒ์ดํŠธ ์—”ํŠธ๋กœํ”ผ ์ •๊ทœํ™”๋กœ ์•ˆ์ •ํ™”.

์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ์„ค๋“๋ ฅ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ๋งŒ์  unseen ๋ฌผ์ฒด์—์„œ Vision-only๋Š” 18.00%๋กœ ๋ถ•๊ดด, Vanilla Fusion์€ 85.56%๋กœ ์˜คํžˆ๋ ค proprio-only(87.89%)๋ณด๋‹ค ๋‚˜๋น ์ง€๋Š” ๋ฐ˜๋ฉด, ์ œ์•ˆํ•œ Gated Fusion์€ 97.61%๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ž‘์—…๋ณ„๋กœ๋Š” ์งˆ๋Ÿ‰์€ proprioception๋งŒ์œผ๋กœ 100%, ๊ฐ•์„ฑยท์žฌ์งˆ์€ ๊ฒŒ์ดํŒ…์ด ์‹œ๊ฐ์„ ์„ ํƒ์ ์œผ๋กœ ๋Œ์–ด์™€ ๊ฐ๊ฐ 95.17%, 97.67%๋กœ ๋Œ์–ด์˜ฌ๋ฆฝ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋ฉ”์‹œ์ง€๋Š” ๋ช…ํ™•ํ•ฉ๋‹ˆ๋‹ค.

โ€œ์ €๋น„์šฉ proprioception์€ ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ฌผ๋ฆฌ์  grounding์„ ์ œ๊ณตํ•œ๋‹ค. ์‹œ๊ฐ์€ ๊ท ์ผํ•˜๊ฒŒ ๋ฏฟ์„ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์„ ํƒ์ ์œผ๋กœ ํ™œ์šฉํ•ด์•ผ ํ•œ๋‹ค.โ€

ํ•œ ์ค„ ์š”์•ฝ: โ€œ์„ผ์„œ๋ฅผ ๋” ์‚ฌ๋Š” ๋Œ€์‹ , ๋กœ๋ด‡์ด ์ด๋ฏธ ๋А๋ผ๊ณ  ์žˆ๋Š” ๊ฒƒ์„ ๋˜‘๋˜‘ํ•˜๊ฒŒ ๊ณจ๋ผ ๋“ฃ๊ฒŒ ํ•˜์žโ€ โ€” ๊ทธ๋ฆฌ๊ณ  ์‹œ๊ฐ์ด ๊ฑฐ์ง“๋งํ•  ๋• ๊ทธ ์ž…์„ ๋ง‰์„ ์ค„ ์•Œ์•„์•ผ ํ•œ๋‹ค.

Copyright 2026, JungYeon Lee