Curieux.JY
  • JungYeon Lee
  • Post
  • Lecture
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ์„œ๋ก : ์™œ VLA์— โ€œ์ด‰๊ฐโ€์ด ํ•„์š”ํ•œ๊ฐ€
      • VLA๊ฐ€ ์ž˜ํ•˜๋Š” ๊ฒƒ๊ณผ ๋ชปํ•˜๋Š” ๊ฒƒ
      • ๊ธฐ์กด ์ ‘๊ทผ์˜ ํ•œ๊ณ„: ์ด‰๊ฐ์„ โ€œ๊ณ๋‹ค๋ฆฌโ€๋กœ ์ทจ๊ธ‰ํ•จ
      • ์ด ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ํ†ต์ฐฐ
      • ์„ธ ๊ฐ€์ง€ ๋Šฅ๋ ฅ์œผ๋กœ ๋ณธ ๊ธฐ์—ฌ
    • ๋ฐฉ๋ฒ•: ๋„ค ๊ฐ€์ง€ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ํ•œ ๊ทธ๋ฆ‡์— ๋‹ด๋Š”๊ฐ€
      • 1) ์ •์ฑ… ์•„ํ‚คํ…์ฒ˜: ํ† ํฐ ๋ ˆ๋ฒจ์—์„œ ์„ž์–ด๋ผ
      • 2) ํ•™์Šต: Flow Matching์œผ๋กœ ์œ„์น˜์™€ ํž˜์„ ํ•จ๊ป˜ ๋งž์ถ˜๋‹ค
      • 3) ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์œ„์น˜-ํž˜ ์ปจํŠธ๋กค๋Ÿฌ: ๋‘ ๋ชฉํ‘œ๋ฅผ ์–ด๋–ป๊ฒŒ ํ™”ํ•ด์‹œํ‚ค๋‚˜
      • 4) Tactile-VLA-CoT: ์†๋์œผ๋กœ ์ƒ๊ฐํ•˜๊ธฐ
      • 5) ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘: ์† ๋Œ€์‹  ์†๊ฐ€๋ฝ ๊ฐ๊ฐ์ด ์žˆ๋Š” ์žฅ์น˜
    • ์‹คํ—˜: ์„ธ ๊ฐ€์ง€ ์งˆ๋ฌธ์— ๋‹ตํ•˜๊ธฐ
      • RQ1: ํž˜ ๊ด€๋ จ ์–ธ์–ด๋ฅผ ์ผ๋ฐ˜ํ™”ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€
      • RQ2: ์ฒ˜์Œ ๋ณด๋Š” ๋ฌผ์ฒด์— ์ ์ ˆํ•œ ํž˜์„ ์ถ”๋ก ํ•˜๋Š”๊ฐ€
      • RQ3: ์‹คํŒจ๋ฅผ ์ถ”๋ก ์œผ๋กœ ๊ทน๋ณตํ•˜๋Š”๊ฐ€
    • ๋น„ํŒ์  ๊ณ ์ฐฐ: ๋ฌด์—‡์ด ๊ฐ•ํ•˜๊ณ  ๋ฌด์—‡์ด ๋นˆ์•ฝํ•œ๊ฐ€
      • ๊ฐ•์ 
      • ์•ฝ์ ๊ณผ ํ•œ๊ณ„
    • ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต: ๋ฌด์—‡์ด ์ƒˆ๋กœ์šด๊ฐ€
    • ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 
      • ์ฐธ๊ณ 

๐Ÿ“ƒTactile-VLA

tactile
vla
cot
Unlocking Vision-Language-Action Modelโ€™s Physical Knowledge for Tactile Generalization
Published

May 23, 2026

  • Paper
  1. ๐Ÿฆพ Tactile-VLA๋Š” Vision-Language-Action (VLA) ๋ชจ๋ธ์— ์ด‰๊ฐ ์„ผ์‹ฑ์„ ๊นŠ์ด ์œตํ•ฉํ•˜์—ฌ, ์ ‘์ด‰์ด ๋งŽ์€ ์ž‘์—…์—์„œ ์ •๊ตํ•œ ํž˜ ์ œ์–ด์™€ ๋ฌผ๋ฆฌ์  ์ƒํ˜ธ์ž‘์šฉ ๋Šฅ๋ ฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค.
  2. ๐Ÿง  ์ด ๋ชจ๋ธ์€ VLM์˜ ๋‚ด์žฌ๋œ ๋ฌผ๋ฆฌ์  ์ง€์‹์„ ํ™œ์šฉํ•˜์—ฌ โ€˜softlyโ€™ ๋˜๋Š” โ€™hardโ€™์™€ ๊ฐ™์€ ํž˜ ๊ด€๋ จ ์–ธ์–ด๋ฅผ ์ผ๋ฐ˜ํ™”ํ•˜๊ณ , ๋ฌผ์ฒด์˜ ์†์„ฑ์— ๋”ฐ๋ผ ์ ์ ˆํ•œ ํž˜์„ ์ ์šฉํ•˜๋ฉฐ, ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์‹คํŒจ๋ฅผ ์ง„๋‹จํ•˜๊ณ  ์ ์‘์ ์œผ๋กœ ํž˜ ์ „๋žต์„ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
  3. ๐Ÿš€ ์‹คํ—˜ ๊ฒฐ๊ณผ, Tactile-VLA๋Š” ๋ช‡ ๊ฐ€์ง€ ๋ฐ๋ชจ๋งŒ์œผ๋กœ VLM์˜ ์‚ฌ์ „ ์ง€์‹์„ ํ™œ์„ฑํ™”ํ•˜์—ฌ zero-shot, cross-object, ๊ทธ๋ฆฌ๊ณ  force-sensitive ํ™˜๊ฒฝ์—์„œ ํƒ์›”ํ•œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ ๊ธฐ์กด VLA ๋ชจ๋ธ๋“ค์„ ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

Tactile-VLA๋Š” Vision-Language-Action (VLA) ๋ชจ๋ธ์ด ์ ‘์ด‰์ด ๋งŽ์€(contact-rich) ์กฐ์ž‘ ์ž‘์—…์—์„œ ์ •ํ™•ํ•œ ๋ฌผ๋ฆฌ์  ์ƒํ˜ธ์ž‘์šฉ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก, ์ž ์žฌ๋œ ๋ฌผ๋ฆฌ์  ์ง€์‹์„ ํ™œ์„ฑํ™”ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•˜๋Š” ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด VLA ๋ชจ๋ธ์€ ๊ณ ์ˆ˜์ค€ ์ถ”๋ก ๊ณผ ๊ณ„ํš์—๋Š” ๋›ฐ์–ด๋‚˜์ง€๋งŒ, ์ •๋ฐ€ํ•œ ํž˜ ์ œ์–ด๊ฐ€ ํ•„์š”ํ•œ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ๋Š” ์‹ค์ œ ๋ฌผ๋ฆฌ์  ํ˜„์‹ค์„ ์ดํ•ดํ•˜๋Š” ๋ฐ ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. Tactile-VLA๋Š” ๋น„์ „, ์–ธ์–ด, ์•ก์…˜ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์— ์ด‰๊ฐ ์„ผ์‹ฑ(tactile sensing)์„ ๊นŠ์ด ์œตํ•ฉํ•˜์—ฌ ์ด๋Ÿฌํ•œ ๊ฐ„๊ทน์„ ๋ฉ”์›๋‹ˆ๋‹ค. ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” VLM(Vision-Language Model)์˜ ์‚ฌ์ „ ์ง€์‹์— ์ด๋ฏธ ๋ฌผ๋ฆฌ์  ์ƒํ˜ธ์ž‘์šฉ์— ๋Œ€ํ•œ ์˜๋ฏธ๋ก ์  ์ดํ•ด๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ์†Œ์ˆ˜์˜ ์ด‰๊ฐ ๋ฐ๋ชจ๋ฅผ ํ†ตํ•ด ๋กœ๋ด‡์˜ ์ด‰๊ฐ ์„ผ์„œ์™€ ์—ฐ๊ฒฐํ•จ์œผ๋กœ์จ ์ œ๋กœ์ƒท ์ผ๋ฐ˜ํ™”(zero-shot generalization)๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด ๋ชจ๋ธ์€ ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ๊ธฐ๋Šฅ์„ ํ†ตํ•ด ์ด‰๊ฐ ์ผ๋ฐ˜ํ™”๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

  1. Tactile-Aware Instruction Following (์ด‰๊ฐ ์ธ์ง€ ๋ช…๋ น ์ถ”์ข…): โ€œ๋ถ€๋“œ๋Ÿฝ๊ฒŒ(gently)โ€ ๋˜๋Š” โ€œ์„ธ๊ฒŒ(hard)โ€์™€ ๊ฐ™์€ ํž˜ ๊ด€๋ จ ์–ธ์–ด์˜ ์˜๋ฏธ๋ฅผ ํ•™์Šตํ•˜์—ฌ, ๋™์ž‘๋งŒ ํ•™์Šตํ•œ ์ƒˆ๋กœ์šด ์ž‘์—…์—๋„ ์ด๋ฅผ ์ ์šฉํ•˜์—ฌ ์–ธ์–ด ๊ธฐ๋ฐ˜ ํž˜ ์ œ์–ด๋ฅผ ์ผ๋ฐ˜ํ™”ํ•ฉ๋‹ˆ๋‹ค.
  2. Utilizing Tactile-Relevant Common Sense (์ด‰๊ฐ ๊ด€๋ จ ์ƒ์‹ ํ™œ์šฉ): ๋ฌด๊ฑฐ์šด ์‡ ๊ตฌ์Šฌ์—๋Š” ๊ฐ•ํ•œ ๊ทธ๋ฆฝ๋ ฅ์„, ๊นจ์ง€๊ธฐ ์‰ฌ์šด ์šฉ๊ณผ(pitaya)์—๋Š” ๋ถ€๋“œ๋Ÿฌ์šด ๊ทธ๋ฆฝ๋ ฅ์„ ์ ์šฉํ•˜๋Š” ๋“ฑ, ์‹œ๊ฐ ๋ฐ ๋ฌธ๋งฅ์  ๋‹จ์„œ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์‚ฌ๋ฌผ์˜ ์†์„ฑ์— ๋”ฐ๋ผ ์ ‘์ด‰ ํ–‰๋™์„ ์กฐ์ ˆํ•ฉ๋‹ˆ๋‹ค.
  3. Adaptive Tactile-Involved Reasoning (์ ์‘ํ˜• ์ด‰๊ฐ ๊ด€๋ จ ์ถ”๋ก ): ์ž‘์—… ์‹คํŒจ ์‹œ ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ์„ ํ†ตํ•ด ์‹คํŒจ ์›์ธ์„ ์ง„๋‹จํ•˜๊ณ  ์ˆ˜์ • ์กฐ์น˜๋ฅผ ์ทจํ•˜๋„๋ก ์ž์œจ์ ์œผ๋กœ ์ ์‘ํ•ฉ๋‹ˆ๋‹ค (์˜ˆ: ์ž˜ ์ง€์›Œ์ง€์ง€ ์•Š๋Š” ์น ํŒ ์ž๊ตญ์„ ์ง€์šฐ๊ธฐ ์œ„ํ•ด ์ดˆ๊ธฐ ์‹คํŒจ ํ›„ ์•„๋ž˜๋กœ ๋ˆ„๋ฅด๋Š” ํž˜์„ ์ฆ๊ฐ€์‹œํ‚ด).

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก  (Core Methodology):

Tactile-VLA๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์„ผ์„œ ์ž…๋ ฅ(๋น„์ „, ์–ธ์–ด, ์ด‰๊ฐ, ๊ณ ์œ ์ˆ˜์šฉ์„ฑ ์ƒํƒœ)์„ ์ฒ˜๋ฆฌํ•˜์—ฌ ํž˜ ์ธ์ง€(force-aware) ์•ก์…˜ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•˜๋Š” ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

1. ์ •์ฑ… ์•„ํ‚คํ…์ฒ˜ ๋ฐ ํ•™์Šต (Policy Architecture and Learning):

  • ๋ชฉํ‘œ: ์ถ”์ƒ์ ์ธ ์ƒํ˜ธ์ž‘์šฉ ์ดํ•ด๋ฅผ ์ •๋ฐ€ํ•œ ์‹ค์ œ ํž˜ ์ œ์–ด๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ, ๋™์ผํ•œ ์›€์ง์ž„์„ ๊ณต์œ ํ•˜์ง€๋งŒ ํž˜์ด ๋‹ค๋ฅธ ๋ช…๋ น(์˜ˆ: โ€œUSB๋ฅผ ๋‹จ๋‹จํžˆ ์‚ฝ์ž…โ€ vs. โ€œUSB๋ฅผ ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ์‚ฝ์ž…โ€)์„ ๊ตฌ๋ณ„ํ•ฉ๋‹ˆ๋‹ค.
  • ์•„ํ‚คํ…์ฒ˜: ํ† ํฐ ๋ ˆ๋ฒจ ์œตํ•ฉ(token-level fusion) ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ์ž…๋ ฅ ์ธ์ฝ”๋”ฉ:
      • ์‹œ๊ฐ (Vision): ์‚ฌ์ „ ํ›ˆ๋ จ๋œ Vision Transformer (ViT) ์ธ์ฝ”๋” E'_{vis}๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ง€๋‚œ H ํ”„๋ ˆ์ž„์„ ๊ณ ์œ ํ•œ ํ† ํฐ ์‹œํ€€์Šค๋กœ ์ธ์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค.
      • ์ด‰๊ฐ (Tactile): ๊ฐ„๋‹จํ•œ MLP E'_\psi๊ฐ€ H๊ฐœ์˜ ์ด‰๊ฐ ์ธก์ •๊ฐ’ ์ด๋ ฅ์„ ์ฒ˜๋ฆฌํ•˜์—ฌ ๋‹จ์ผ ์œตํ•ฉ ํ† ํฐ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
      • ์–ธ์–ด (Language): ์ผ๋ฐ˜ ์–ธ์–ด ํ† ํฌ๋‚˜์ด์ € E_{lang}๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ํ†ตํ•ฉ ์ž…๋ ฅ ์‹œํ€€์Šค (S_t): ๋ชจ๋“  ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์˜ ํ† ํฐ์€ ๋‹ค์Œ ํ˜•์‹์œผ๋กœ ์—ฐ๊ฒฐ๋ฉ๋‹ˆ๋‹ค: S_t = [E'_{vis}(I_{t-H+1}), \dots, E'_{vis}(I_t), E_{lang}(L_t), E'_{\psi}([T_{t-H+1}, \dots, T_t])] ์—ฌ๊ธฐ์„œ I๋Š” ์ด๋ฏธ์ง€, L์€ ์–ธ์–ด, T๋Š” ์ด‰๊ฐ ์‹ ํ˜ธ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
    • Transformer ๋ฐฑ๋ณธ: S_t๋Š” ๋น„์ธ๊ณผ์  ์–ดํ…์…˜(non-causal attention) ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ด Transformer ๋ฐฑ๋ณธ์— ์˜ํ•ด ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋น„์ „, ์–ธ์–ด, ์ด‰๊ฐ ํ† ํฐ์ด ์ž์œ ๋กญ๊ฒŒ ์ƒํ˜ธ ์ž‘์šฉํ•˜์—ฌ ๊นŠ์ด ํ†ตํ•ฉ๋œ ๋ฌธ๋งฅ์  ํ‘œํ˜„์„ ์ƒ์„ฑํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
    • ์•ก์…˜ ์ถœ๋ ฅ: ์ด ํ’๋ถ€ํ•œ ํ‘œํ˜„์€ ์ด‰๊ฐ ์ธ์ง€ ์•ก์…˜ ์ „๋ฌธ๊ฐ€(tactile-aware action expert)์—๊ฒŒ ์ „๋‹ฌ๋˜์–ด ์ฆ๊ฐ•๋œ ์•ก์…˜ ๋ฒกํ„ฐ a_t๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค. a_t์—๋Š” ๋ชฉํ‘œ ์œ„์น˜(target position) P_{target}์™€ ๋ชฉํ‘œ ์ ‘์ด‰ ํž˜(target contact force) F_{target}์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.
  • ํ•™์Šต: ๋ชจ๋ฐฉ ํ•™์Šต(imitation learning)์„ ํ†ตํ•ด ์ข…๋‹จ๊ฐ„(end-to-end) ํŒŒ์ธํŠœ๋‹๋ฉ๋‹ˆ๋‹ค.
    • ์‚ฌ์ „ ํ›ˆ๋ จ๋œ \pi_0 (Black et al., 2024)์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๊ณต์œ  ์ปดํฌ๋„ŒํŠธ๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๊ณ , ์ด‰๊ฐ ์ธ์ฝ”๋”์™€ ์ˆ˜์ •๋œ ์•ก์…˜ ์ „๋ฌธ๊ฐ€ ๊ฐ™์€ ์ƒˆ๋กœ์šด ๋ชจ๋“ˆ์€ ๋ฌด์ž‘์œ„๋กœ ์ดˆ๊ธฐํ™”๋ฉ๋‹ˆ๋‹ค.
    • Conditional Flow Matching (CFM) ๋ชฉํ‘œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์ด ์ง„ํ–‰๋˜๋ฉฐ, ์†์‹ค ํ•จ์ˆ˜๋Š” ์˜ˆ์ธก๋œ ์•ก์…˜ ์‹œํ€€์Šค์˜ ์šด๋™ํ•™์ (kinematic) ๋ฐ ํž˜(force) ์ฐจ์› ๋ชจ๋‘์—์„œ์˜ ํŽธ์ฐจ์— ํŽ˜๋„ํ‹ฐ๋ฅผ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ด ๋ชจ๋ธ์€ ์–ธ์–ด์  ๋‰˜์•™์Šค(์˜ˆ: โ€œgentlyโ€)์™€ ํ•ด๋‹น ๋ฌผ๋ฆฌ์  ํž˜ ํฌ๊ธฐ(์˜ˆ: 0.5N) ๊ฐ„์˜ ์ง์ ‘์ ์ธ ๋งคํ•‘์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

2. ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์œ„์น˜-ํž˜ ์ œ์–ด๊ธฐ (Hybrid Position-Force Controller):

  • ์ด‰๊ฐ ์ธ์ง€ ์•ก์…˜ ์ „๋ฌธ๊ฐ€๊ฐ€ P_{target}์™€ F_{target}๋ฅผ ๊ฒฐ์ •ํ•˜๋ฉด, ์ €์ˆ˜์ค€ ์ œ์–ด๊ธฐ(low-level controller)๊ฐ€ ์ด ๋‘ ๋ชฉํ‘œ์˜ ๊ท ํ˜•์„ ๋งž์ถฅ๋‹ˆ๋‹ค.
  • ์ ‘๊ทผ ๋ฐฉ์‹: ์œ„์น˜ ์ง€๋ฐฐ์ (position-dominant) ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ทจํ•˜๋ฉฐ, ์ž„ํ”ผ๋˜์Šค ์ œ์–ด(impedance control) ์›๋ฆฌ์—์„œ ์˜๊ฐ์„ ๋ฐ›์€ ๊ฐ„์ ‘ ํž˜ ์ œ์–ด(indirect force control) ๋ฐฉ์‹์„ ์ฑ„ํƒํ•ฉ๋‹ˆ๋‹ค.
  • ์ž‘๋™: ํž˜ ์˜ค์ฐจ \Delta F = F_{target} - F_{measured}๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ์ด ์˜ค์ฐจ์˜ ํฌ๊ธฐ \left \| \Delta F \right \|๊ฐ€ ์‚ฌ์ „ ์ •์˜๋œ ์ž„๊ณ„๊ฐ’ \tau๋ฅผ ์ดˆ๊ณผํ•  ๋•Œ๋งŒ ๋ณด์ • ์œ„์น˜ ์กฐ์ •(corrective positional adjustment)์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. P_{hybrid} = \begin{cases} P_{target} + K \cdot \Delta F & \text{if } \left \| \Delta F \right \| > \tau \\ P_{target} & \text{if } \left \| \Delta F \right \| \leq \tau \end{cases} ์—ฌ๊ธฐ์„œ K๋Š” ๊ฒŒ์ธ ๋งคํŠธ๋ฆญ์Šค์ž…๋‹ˆ๋‹ค.
  • ๊ตฌํ˜„: PID ์ œ์–ด๊ธฐ(Proportional-Integral-Derivative controller)๊ฐ€ ๋™์ ์œผ๋กœ ์—…๋ฐ์ดํŠธ๋œ P_{hybrid}๋กœ ๋กœ๋ด‡์˜ ์กฐ์ธํŠธ๋ฅผ ๊ตฌ๋™ํ•ฉ๋‹ˆ๋‹ค.
  • ํž˜ ์„ฑ๋ถ„ ๋ถ„๋ฆฌ: ์ˆœ ์™ธ๋ ฅ(net external force)๊ณผ ๋‚ด๋ถ€ ์žก๊ธฐ ํž˜(internal grasping force)์˜ ์ œ์–ด๋ฅผ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌํผ์˜ Cartesian ์œ„์น˜๋Š” ์ˆœ ์™ธ๋ ฅ ์ œ์–ด์—, ๊ทธ๋ฆฌํผ ํญ(gripper width)์€ ๋‚ด๋ถ€ ์žก๊ธฐ ํž˜ ์ œ์–ด์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

3. Tactile-VLA-CoT: ์ถ”๋ก  ๊ธฐ๋ฐ˜ ์ ์‘ (Reasoning-Based Adaptation):

  • ๋ชฉํ‘œ: VLM์˜ ์ž ์žฌ๋œ ์ถ”๋ก  ๊ธฐ์ˆ ์„ ํ™œ์„ฑํ™”ํ•˜์—ฌ ๊ฐ•๋ ฅํ•œ ์ ์‘ ๋Šฅ๋ ฅ์„ ํ™•๋ณดํ•ฉ๋‹ˆ๋‹ค.
  • Chain-of-Thought (CoT) ํ†ตํ•ฉ: ํž˜ ๋ฐ ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ์„ ์ •์ฑ… ์ž…๋ ฅ ์ด์ƒ์œผ๋กœ ํ™œ์šฉํ•˜์—ฌ ์ ์‘ํ˜• ์ถ”๋ก  ๋ฐ ์žฌ๊ณ„ํš(re-planning)์„ ์œ„ํ•œ ์ค‘์š”ํ•œ ๋‹จ์„œ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์ž‘๋™ ๋ฐฉ์‹:
    • VLM์˜ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋””์ฝ”๋”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ช…์‹œ์ ์ธ ๋‚ด๋ถ€ ๋…๋ฐฑ(internal monologue)์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ชจ๋ธ์ด ์‹คํŒจ ์›์ธ(์˜ˆ: ์˜ˆ์ƒ์น˜ ๋ชปํ•œ ๋ฏธ๋„๋Ÿฌ์ง)์„ ์ถ”๋ก ํ•˜๊ณ  ์ˆ˜์ • ์•ก์…˜์„ ๊ณต์‹ํ™”ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
    • ํ›ˆ๋ จ: ์‹คํŒจ ์ด๋ฒคํŠธ(์˜ˆ: ์น ํŒ์„ ๋‹ฆ์„ ๋•Œ ๋ฏธ๋„๋Ÿฌ์ง)๋ฅผ ํฌ์ฐฉํ•˜๊ณ  ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๊ฐ๊ฐ ์ŠคํŠธ๋ฆผ์„ ์‹คํŒจ ์›์ธ์„ ๋ถ„์„ํ•˜๋Š” ์–ธ์–ด ์ฃผ์„๊ณผ ์ง์ง€์€ ์ž‘๊ณ  ํŠนํ™”๋œ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํŒŒ์ธํŠœ๋‹๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” VLM์˜ ์ผ๋ฐ˜ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๋ณด์กดํ•˜๊ณ  ์ด‰๊ฐ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋กœ ํ™•์žฅํ•˜์—ฌ ์„ผ์„œ ์‹ ํ˜ธ์—์„œ ๋ฌผ๋ฆฌ์  ํ˜„์ƒ(์˜ˆ: ๋ถˆ์ถฉ๋ถ„ํ•œ ํ•˜ํ–ฅ ์••๋ ฅ, ๋„๊ตฌ ๋ฏธ๋„๋Ÿฌ์ง)์„ ์ถ”๋ก ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
    • ์‹คํ–‰ ์‹œ: CoT ์ถ”๋ก ์€ ๊ณ ์ •๋œ ๊ฐ„๊ฒฉ์œผ๋กœ ํŠธ๋ฆฌ๊ฑฐ๋ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ ๋จผ์ € ์ž‘์—… ์„ฑ๊ณต ์—ฌ๋ถ€๋ฅผ ํŒ๋‹จํ•˜๊ณ , ์‹คํŒจ๋กœ ํŒ๋‹จ๋˜๋ฉด ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ทผ๋ณธ ์›์ธ์„ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค.
    • ์ˆ˜์ • ์ง€์‹œ ์ƒ์„ฑ: ์ถ”๋ก  ์ถœ๋ ฅ์€ ๋‹ค์–‘ํ•œ ํž˜ ์„ฑ๋ถ„์„ ๋ช…์‹œ์ ์œผ๋กœ ๋ถ„์„ํ•˜๊ณ (์˜ˆ: โ€œgrasping force is sufficient, but normal force is too lowโ€), ๋‹ค์Œ ์‹œ๋„๋ฅผ ์•ˆ๋‚ดํ•  ์ƒˆ๋กœ์šด ์ˆ˜์ • ์ง€์‹œ(์˜ˆ: โ€œwipe the board again, but apply more downward forceโ€)๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

4. ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ (Data Collection):

  • ์ •ํ™•ํ•˜๊ณ  ์˜๋ฏธ๋ก ์ ์œผ๋กœ ์ •๋ ฌ๋œ ์ด‰๊ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์œ„ํ•ด Universal Manipulation Interface (UMI)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํŠน์ˆ˜ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์„ค์ •์„ ๊ตฌ์ถ•ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • UMI ๊ทธ๋ฆฌํผ์— ๊ณ ํ•ด์ƒ๋„ ์ด‰๊ฐ ์„ผ์„œ(์ •์ƒ๋ ฅ ๋ฐ ์ „๋‹จ๋ ฅ ๊ฐ์ง€ ๊ฐ€๋Šฅ)๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ์กฐ์ž‘์ž๊ฐ€ ์ ‘์ด‰ ์—ญํ•™์„ ์ง์ ‘ ๊ฐ์ง€ํ•˜๊ณ  ํž˜์— ์˜ํ•ด ๋ช…์‹œ์ ์œผ๋กœ ์•ˆ๋‚ด๋˜๋Š” ๋ฐ๋ชจ๋ฅผ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์‹œ๊ฐ„ ๋™๊ธฐํ™”(temporal synchronization)๋ฅผ ์œ„ํ•ด ๋ชจ๋“  ๋ฐ์ดํ„ฐ ์ŠคํŠธ๋ฆผ์˜ ํƒ€์ž„์Šคํƒฌํ”„๋ฅผ ์ •๋ ฌํ–ˆ์Šต๋‹ˆ๋‹ค. 100Hz์˜ ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ๊ณผ 20Hz์˜ ์‹œ๊ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์บก์ฒ˜ํ•˜๊ณ , ๊ณ ์ฃผํŒŒ ์ด‰๊ฐ ์‹ ํ˜ธ๋Š” ์‹œ๊ฐ ํ”„๋ ˆ์ž„์— ๋งž์ถฐ ๋‹ค์šด์ƒ˜ํ”Œ๋งํ–ˆ์Šต๋‹ˆ๋‹ค.

์‹คํ—˜์€ USB/์ถฉ์ „๊ธฐ ์‚ฝ์ž… ๋ฐ ์ถ”์ถœ, ํƒ์ƒ ๊ฐ์ฒด ์žก๊ธฐ, ์น ํŒ ๋‹ฆ๊ธฐ ์„ธ ๊ฐ€์ง€ ์ ‘์ด‰์ด ๋งŽ์€ ์กฐ์ž‘ ์ž‘์—…์„ ํ†ตํ•ด ์ˆ˜ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค. Tactile-VLA๋Š” ๊ธฐ์กด VLA ๋ชจ๋ธ ๋Œ€๋น„ ์ด‰๊ฐ ๊ด€๋ จ ์–ธ์–ด ์ดํ•ด, ์ƒ์‹์  ํž˜ ์ ์šฉ, ๊ทธ๋ฆฌ๊ณ  ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ ๊ธฐ๋ฐ˜์˜ ์ ์‘ํ˜• ์ถ”๋ก  ๋Šฅ๋ ฅ์—์„œ ๋›ฐ์–ด๋‚œ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, ์†Œ์ˆ˜์˜ ์ด‰๊ฐ ๋ฐ๋ชจ๋งŒ์œผ๋กœ VLM์˜ ์ž ์žฌ๋œ ๋ฌผ๋ฆฌ์  ์ง€์‹์„ ํ™œ์„ฑํ™”ํ•˜์—ฌ ์ œ๋กœ์ƒท ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ์„ฑ๊ณต์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

์–ธ์–ด๋กœ โ€œํž˜โ€์„ ์กฐ์ ˆํ•˜๋Š” ๋กœ๋ด‡

๊ฒฐ๋ก ๋ถ€ํ„ฐ ๋งํ•˜๋ฉด, ์ด ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ์ฃผ์žฅ์€ ํ•œ ๋ฌธ์žฅ์œผ๋กœ ์••์ถ•๋œ๋‹ค. VLM(Vision-Language Model)์€ ์ด๋ฏธ โ€œ์‚ด์‚ดโ€๊ณผ โ€œ์„ธ๊ฒŒโ€์˜ ๋ฌผ๋ฆฌ์  ์ฐจ์ด๋ฅผ ์•Œ๊ณ  ์žˆ๋‹ค. ๋‹จ์ง€ ๊ทธ ์ง€์‹์„ ์ด‰๊ฐ ์„ผ์„œ์— ์—ฐ๊ฒฐํ•ด์ฃผ๋Š” ๋‹ค๋ฆฌ๊ฐ€ ์—†์—ˆ์„ ๋ฟ์ด๋‹ค.

Tactile-VLA๋Š” ๊ทธ ๋‹ค๋ฆฌ๋ฅผ ๋†“๋Š”๋‹ค. ์‹œ๊ฐยท์–ธ์–ดยทํ–‰๋™(VLA)์— ์ด‰๊ฐ(Tactile)์„ ๋„ค ๋ฒˆ์งธ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋กœ ๊นŠ๊ฒŒ ์œตํ•ฉํ•ด, โ€œUSB๋ฅผ ์‚ด์‚ด ๊ฝ‚์•„โ€์™€ โ€œUSB๋ฅผ ์„ธ๊ฒŒ ๊ฝ‚์•„โ€์ฒ˜๋Ÿผ ๊ถค์ ์€ ๊ฐ™์ง€๋งŒ ํž˜์ด ๋‹ค๋ฅธ ๋ช…๋ น์„ ๊ตฌ๋ถ„ํ•ด ์‹คํ–‰ํ•œ๋‹ค. ๊ทธ๊ฒƒ๋„ ์ˆ˜์‹ญ ๊ฐœ ๋ฐ๋ชจ๋งŒ์œผ๋กœ, ํ•™์Šตํ•œ ์  ์—†๋Š” ์ž‘์—…๊ณผ ๋ฌผ์ฒด์— ๋Œ€ํ•ด zero-shot์œผ๋กœ ์ผ๋ฐ˜ํ™”ํ•œ๋‹ค.

๋กœ๋ด‡๊ณตํ•™์ž์—๊ฒŒ ์ด ๋…ผ๋ฌธ์ด ํฅ๋ฏธ๋กœ์šด ์ด์œ ๋Š” ๋‹จ์ˆœํ•˜๋‹ค. ์šฐ๋ฆฌ๋Š” ๊ทธ๋™์•ˆ ํž˜ ์ œ์–ด(force control)์™€ ์˜๋ฏธ ์ดํ•ด(semantic reasoning)๋ฅผ ๋ณ„๊ฐœ์˜ ์„ธ๊ณ„๋กœ ๋‹ค๋ค„์™”๋‹ค. ์ž„ํ”ผ๋˜์Šค ์ œ์–ด๋Š” ์ œ์–ด์ด๋ก ์˜ ์˜์—ญ์ด์—ˆ๊ณ , ์ƒ์‹ ์ถ”๋ก ์€ LLM์˜ ์˜์—ญ์ด์—ˆ๋‹ค. Tactile-VLA๋Š” ์ด ๋‘˜์„ ํ•˜๋‚˜์˜ end-to-end ํŒŒ์ดํ”„๋ผ์ธ ์•ˆ์—์„œ ๋ฌถ์–ด๋ฒ„๋ฆฐ๋‹ค. ๊ทธ ๋ฌถ๋Š” ๋ฐฉ์‹์ด ์˜๋ฆฌํ•˜๋‹ค.

๋…ผ๋ฌธ: Tactile-VLA: Unlocking Vision-Language-Action Modelโ€™s Physical Knowledge for Tactile Generalization (Huang et al., Tsinghua/SJTU, arXiv:2507.09160, 2025)


์„œ๋ก : ์™œ VLA์— โ€œ์ด‰๊ฐโ€์ด ํ•„์š”ํ•œ๊ฐ€

VLA๊ฐ€ ์ž˜ํ•˜๋Š” ๊ฒƒ๊ณผ ๋ชปํ•˜๋Š” ๊ฒƒ

์ง€๋‚œ ๋ช‡ ๋…„๊ฐ„ RT-1, RT-2, Octo, OpenVLA, ฯ€0 ๊ฐ™์€ VLA ๋ชจ๋ธ๋“ค์ด ๋ณด์—ฌ์ค€ ๊ฒƒ์€ ๋ช…ํ™•ํ•˜๋‹ค. ๊ฑฐ๋Œ€ํ•œ vision-language ๋ฐฑ๋ณธ์„ ๊ฐ€์ ธ๋‹ค ์“ฐ๋ฉด, ๋กœ๋ด‡์ด ์ถ”์ƒ์  ๋ช…๋ น(โ€œ์‚ฌ๊ณผ๋ฅผ ์ง‘์–ดโ€)์„ ํ•ด์„ํ•˜๊ณ  ์ฒ˜์Œ ๋ณด๋Š” ์žฅ๋ฉด์—๋„ ๊ทธ๋Ÿญ์ €๋Ÿญ ์ผ๋ฐ˜ํ™”ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

๊ทธ๋Ÿฐ๋ฐ ์ด ๋ชจ๋ธ๋“ค์€ ๊ณตํ†ต๋œ ์•ฝ์ ์„ ๊ฐ€์ง„๋‹ค. ๋ฌด์—‡์„(what)์„ ํ• ์ง€๋Š” ์ž˜ ์•Œ์ง€๋งŒ, ์–ด๋–ป๊ฒŒ(how) ๋ฌผ๋ฆฌ์ ์œผ๋กœ ์ƒํ˜ธ์ž‘์šฉํ• ์ง€๋Š” ๋ชจ๋ฅธ๋‹ค. ํŠนํžˆ ์ ‘์ด‰์ด ๋งŽ์€(contact-rich) ์ž‘์—…์—์„œ ๊ทธ๋ ‡๋‹ค.

๋น„์œ ๋ฅผ ํ•˜๋‚˜ ๋“ค์–ด๋ณด์ž. ์‹œ๊ฐ๋งŒ ๊ฐ€์ง„ ๋กœ๋ด‡์€ ์•ˆ๋Œ€๋ฅผ ๋ฒ—์—ˆ์ง€๋งŒ ์žฅ๊ฐ‘์€ ๋‘๊ป๊ฒŒ ๋‚€ ์‚ฌ๋žŒ๊ณผ ๊ฐ™๋‹ค. ์ปต์ด ์–ด๋”” ์žˆ๋Š”์ง€๋Š” ๋ณด์ง€๋งŒ, ๊ทธ ์ปต์„ ์–ผ๋งˆ๋‚˜ ์„ธ๊ฒŒ ์ฅ์–ด์•ผ ๊นจ์ง€์ง€ ์•Š๋Š”์ง€๋Š” ์†๋์˜ ๊ฐ๊ฐ์ด ์—†์œผ๋‹ˆ ์•Œ ์ˆ˜ ์—†๋‹ค. ์‚ฌ๋žŒ์€ ๋ฌด๊ฑฐ์šด ์‡ ๊ณต๊ณผ ์ž˜ ์ต์€ ์šฉ๊ณผ(ํ”ผํƒ€์•ผ)๋ฅผ ๊ฐ™์€ ์†๋™์ž‘์œผ๋กœ ์ง‘์ง€ ์•Š๋Š”๋‹ค. ์†๊ฐ€๋ฝ ๋์—์„œ โ€œ์•„, ์ด๊ฑด ๋ฌผ๋ ํ•˜๋„คโ€๋ผ๋Š” ์‹ ํ˜ธ๊ฐ€ ์ฆ‰๊ฐ ์˜ฌ๋ผ์˜ค๊ณ , ์šฐ๋ฆฌ๋Š” ๋ฌด์˜์‹์ ์œผ๋กœ ํž˜์„ ๋บ€๋‹ค. ์ด ์ฆ‰๊ฐ์ ์ด๊ณ  ๊ตญ์†Œ์ ์ด๋ฉฐ ์‹œ๊ฐ„์ ์œผ๋กœ ๋ณ€ํ•˜๋Š” ํ”ผ๋“œ๋ฐฑ์ด ๋ฐ”๋กœ ์ด‰๊ฐ์ด๋‹ค.

๊ธฐ์กด ์ ‘๊ทผ์˜ ํ•œ๊ณ„: ์ด‰๊ฐ์„ โ€œ๊ณ๋‹ค๋ฆฌโ€๋กœ ์ทจ๊ธ‰ํ•จ

์ด‰๊ฐ์„ ๋กœ๋ด‡ ์ •์ฑ…์— ๋„ฃ์œผ๋ ค๋Š” ์‹œ๋„๋Š” ์ด์ „์—๋„ ์žˆ์—ˆ๋‹ค(FuSe, ForceVLA ๋“ฑ). ๊ทธ๋Ÿฌ๋‚˜ ๋…ผ๋ฌธ์€ ๊ธฐ์กด ์—ฐ๊ตฌ ๋Œ€๋ถ€๋ถ„์ด ์ด‰๊ฐ์„ ๋ณด์กฐ ์ง€๊ฐ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋กœ๋งŒ ๋‹ค๋ค˜๋‹ค๊ณ  ์ง€์ ํ•œ๋‹ค. ์ฆ‰, ์ด‰๊ฐ ์ •๋ณด๋ฅผ ์ž…๋ ฅ ์–ด๋”˜๊ฐ€์— ๋ผ์›Œ๋„ฃ๊ธด ํ–ˆ์ง€๋งŒ, ์ •์ฑ…์ด ์‹ค์ œ๋กœ ํ–‰๋™(action)์„ ์ƒ์„ฑํ•˜๋Š” ๊ณผ์ •์— ์ง์ ‘ ๊ฐœ์ž…ํ•˜์ง„ ๋ชปํ–ˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

์ด๊ฒŒ ์™œ ๋ฌธ์ œ์ธ๊ฐ€. ์‹œ๊ฐยท์–ธ์–ด๋กœ โ€œ๋ฌด์—‡์„ ํ• ์ง€โ€๋Š” ํ’๋ถ€ํ•˜๊ฒŒ ์ถ”๋ก ํ•˜๋ฉด์„œ, ์ •์ž‘ ๊ทธ ๊ฒฐ์ •์ด โ€œ์–ผ๋งˆ์˜ ํž˜์œผ๋กœโ€ ๋•…์— ๋‹ฟ๋Š” ๋‹จ๊ณ„์—์„œ๋Š” ์ด‰๊ฐ์ด ์˜์‚ฌ๊ฒฐ์ •์—์„œ ๋น ์ ธ๋ฒ„๋ฆฐ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ์ •์ฑ…์˜ ์ถœ๋ ฅ์€ ์—ฌ์ „ํžˆ ์œ„์น˜(position) ์ค‘์‹ฌ์ด๊ณ , ํž˜(force)์€ ๊ทธ๋ƒฅ ๋”ฐ๋ผ์˜ค๋Š” ๋ถ€์‚ฐ๋ฌผ์ด ๋œ๋‹ค.

์ด ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ํ†ต์ฐฐ

์—ฌ๊ธฐ์„œ ์ €์ž๋“ค์˜ ๊ฐ€์žฅ ๋„๋ฐœ์ ์ธ ๊ฐ€์„ค์ด ๋“ฑ์žฅํ•œ๋‹ค.

VLM์˜ ์ž ์žฌ ๊ณต๊ฐ„(latent space)์—๋Š” ์ด๋ฏธ ๋ฌผ๋ฆฌ์  ์ƒํ˜ธ์ž‘์šฉ์— ๋Œ€ํ•œ ํ’๋ถ€ํ•œ ์˜๋ฏธ์  ์ดํ•ด๊ฐ€ ๋“ค์–ด ์žˆ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ํ•  ์ผ์€ ๊ทธ๊ฒƒ์„ ์ด‰๊ฐ ์„ผ์„œ์— โ€œ์—ฐ๊ฒฐโ€ํ•ด ๊นจ์šฐ๋Š” ๊ฒƒ๋ฟ์ด๋‹ค.

๋‹ค์‹œ ๋งํ•ด, โ€œ์‚ด์‚ดโ€์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ์ž‘์€ ํž˜(0.5N)์— ๋Œ€์‘ํ•˜๊ณ  โ€œ์„ธ๊ฒŒโ€๊ฐ€ ํฐ ํž˜์— ๋Œ€์‘ํ•œ๋‹ค๋Š” ์ง€์‹์„ ๋ชจ๋ธ์—๊ฒŒ ์ฒ˜์Œ๋ถ€ํ„ฐ ๊ฐ€๋ฅด์น  ํ•„์š”๊ฐ€ ์—†๋‹ค. ๊ทธ ์—ฐ๊ด€์„ฑ์€ ์ธํ„ฐ๋„ท ํ…์ŠคํŠธ๋กœ ์‚ฌ์ „ํ•™์Šต๋œ ์–ธ์–ด๋ชจ๋ธ ์•ˆ์— ์ด๋ฏธ ์ž ๋“ค์–ด ์žˆ๋‹ค. ์ ์€ ์ˆ˜์˜ ๋ฐ๋ชจ๋งŒ์œผ๋กœ ๊ทธ ์ž ๋“  ์ง€์‹์„ ํ–‰๋™ ์ถœ๋ ฅ์œผ๋กœ ๋Œ์–ด๋‚ด๋ฉด ๋œ๋‹ค. ์ด๊ฒŒ zero-shot ์ผ๋ฐ˜ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•œ ์ด์œ ์— ๋Œ€ํ•œ ์ด ๋…ผ๋ฌธ์˜ ์„ค๋ช…์ด๋‹ค.

์„ธ ๊ฐ€์ง€ ๋Šฅ๋ ฅ์œผ๋กœ ๋ณธ ๊ธฐ์—ฌ

๋…ผ๋ฌธ์€ ์ž์‹ ๋“ค์˜ ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ์„ธ ๊ฐ€์ง€ ๋Šฅ๋ ฅ์„ ์—ฐ๋‹ค๊ณ  ์ •๋ฆฌํ•œ๋‹ค. ์ด ์„ธ ๊ฐ€์ง€๊ฐ€ ๊ณง ์‹คํ—˜์˜ ์„ธ ์งˆ๋ฌธ(RQ1~RQ3)๊ณผ ์ผ๋Œ€์ผ๋กœ ๋Œ€์‘ํ•œ๋‹ค.

๋Šฅ๋ ฅ ๋ฌด์—‡์„ ์˜๋ฏธํ•˜๋‚˜ ์˜ˆ์‹œ
(a) Tactile-Aware Instruction Following ํž˜ ๊ด€๋ จ ๋ถ€์‚ฌ(โ€œ์‚ด์‚ดโ€, โ€œ์„ธ๊ฒŒโ€)์˜ ์˜๋ฏธ๋ฅผ ํ•œ ์ž‘์—…์—์„œ ๋ฐฐ์›Œ ๋‹ค๋ฅธ ์ž‘์—…์— ์ „์ด USB์—์„œ ๋ฐฐ์šด โ€œsoftlyโ€๋ฅผ ์ถฉ์ „๊ธฐ ์ž‘์—…์— zero-shot ์ ์šฉ
(b) Tactile-Relevant Common Sense ๋ฌผ์ฒด ์†์„ฑ์— ๋Œ€ํ•œ ์ƒ์‹์œผ๋กœ ๋ช…์‹œ์  ํž˜ ๋ช…๋ น ์—†์ด๋„ ์ ์ ˆํ•œ ๊ทธ๋ฆฝ ํž˜ ์„ ํƒ ์ฒ˜์Œ ๋ณด๋Š” ์šฉ๊ณผ๋Š” ์ž๋™์œผ๋กœ ์‚ด์‚ด, ์‡ ๊ณต์€ ๊ฝ‰
(c) Adaptive Tactile-Involved Reasoning ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ์œผ๋กœ ์‹คํŒจ๋ฅผ ์ง„๋‹จํ•˜๊ณ  ์Šค์Šค๋กœ ์ „๋žต ์ˆ˜์ • ์น ํŒ ๋‹ฆ๊ธฐ ์‹คํŒจ ํ›„ ์ถ”๋ก ์œผ๋กœ ํž˜์„ ํ‚ค์›Œ ์žฌ์‹œ๋„

ํ•ต์‹ฌ ๊ธฐ์—ฌ๋ฅผ ์ •๋ฆฌํ•˜๋ฉด ์…‹์ด๋‹ค. ์ฒซ์งธ, ์ด‰๊ฐ์„ VLA์˜ ๋„ค์ดํ‹ฐ๋ธŒ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋กœ ๊นŠ๊ฒŒ ์œตํ•ฉํ•˜๋Š” ์•„ํ‚คํ…์ฒ˜(Tactile-VLA). ๋‘˜์งธ, ์‹ค์‹œ๊ฐ„ ํž˜ ํ”ผ๋“œ๋ฐฑ์„ CoT(Chain-of-Thought)๋กœ ํ•ด์„ํ•ด ์‹คํŒจ์— ์ ์‘์ ์œผ๋กœ ์žฌ๊ณ„ํšํ•˜๋Š” ๋ณ€ํ˜•(Tactile-VLA-CoT). ์…‹์งธ, zero-shotยทcross-objectยทforce-sensitive ์„ค์ •์—์„œ ํ‘œ์ค€ VLA ๋ฒ ์ด์Šค๋ผ์ธ์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ์ผ๋ฐ˜ํ™” ์‹ค์ฆ.


๋ฐฉ๋ฒ•: ๋„ค ๊ฐ€์ง€ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ํ•œ ๊ทธ๋ฆ‡์— ๋‹ด๋Š”๊ฐ€

์ „์ฒด ๊ตฌ์กฐ๋ฅผ ๋จผ์ € ๊ทธ๋ฆผ์œผ๋กœ ๋ณด์ž. ์ž…๋ ฅ์€ ๋„ค ๊ฐˆ๋ž˜(์‹œ๊ฐ, ์–ธ์–ด, ์ด‰๊ฐ, ๊ณ ์œ ์ˆ˜์šฉ๊ฐ๊ฐ)๋กœ ๋“ค์–ด์˜ค๊ณ , ์‚ฌ์ „ํ•™์Šต VLM์—์„œ ์œตํ•ฉ๋œ ๋’ค, action expert๊ฐ€ ์œ„์น˜์™€ ํž˜์„ ๋™์‹œ์— ๋ฑ‰์–ด๋‚ด๊ณ , ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ปจํŠธ๋กค๋Ÿฌ๊ฐ€ ์ด๋ฅผ ์‹ค์ œ ๊ด€์ ˆ ์›€์ง์ž„์œผ๋กœ ๋ฒˆ์—ญํ•œ๋‹ค.

flowchart TD
    A1["Image<br/>(last H frames)"] -->|ViT Encoder| F[Token Prefix S_t]
    A2["Language<br/>(instruction)"] -->|Tokenizer| F
    A3["Tactile Signal<br/>(normal + shear)"] -->|MLP Encoder| F
    A4["Proprioceptive<br/>State"] -->|Encoder| F
    F --> VLM["Pretrained VLM<br/>Gemma 2.6B<br/>(non-causal attention)"]
    VLM --> AE["Tactile-Aware<br/>Action Expert (300M)"]
    AE --> PT["Target Position<br/>P_target"]
    AE --> FT["Target Force<br/>F_target"]
    PT --> HC["Hybrid Position-Force<br/>Controller"]
    FT --> HC
    HC --> R["Robot Joints<br/>(PID actuation)"]
    VLM -.->|periodic trigger| COT["Tactile-VLA-CoT<br/>(reasoning + replan)"]
    COT -.->|new instruction| A2

1) ์ •์ฑ… ์•„ํ‚คํ…์ฒ˜: ํ† ํฐ ๋ ˆ๋ฒจ์—์„œ ์„ž์–ด๋ผ

๊ฐ€์žฅ ์ค‘์š”ํ•œ ์„ค๊ณ„ ๊ฒฐ์ •์€ token-level fusion์ด๋‹ค. ๊ฐ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ๋”ฐ๋กœ ์ธ์ฝ”๋”ฉํ•œ ๋’ค, ํŠธ๋žœ์Šคํฌ๋จธ ๋ฐฑ๋ณธ์˜ ์ž…๋ ฅ prefix ์•ˆ์—์„œ ํ•˜๋‚˜๋กœ ์„ž๋Š”๋‹ค.

์™œ ์ด๊ฒŒ ์ค‘์š”ํ•œ๊ฐ€. ๋งŒ์•ฝ ์ด‰๊ฐ ์ •๋ณด๋ฅผ ๋งˆ์ง€๋ง‰ ๋‹จ๊ณ„์—์„œ ์œ„์น˜ ์ถœ๋ ฅ์— ๋”ํ•ด์ฃผ๋Š” ์‹์œผ๋กœ โ€œ์–•๊ฒŒโ€ ๋ถ™์˜€๋‹ค๋ฉด, ๋ชจ๋ธ์€ ์ด‰๊ฐ๊ณผ ์–ธ์–ด๋ฅผ ํ•จ๊ป˜ ๊ณ ๋ คํ•˜๋Š” ์ถ”๋ก ์„ ํ•  ์ˆ˜ ์—†๋‹ค. ํ† ํฐ ๋ ˆ๋ฒจ์—์„œ ์„ž์–ด์•ผ โ€œ์ด ์น ํŒ์€ ๋ถ„ํ•„์ด๋ผ ๋งˆ์ฐฐ์ด ํฌ๋‹ˆ๊นŒ ๋” ๋ˆŒ๋Ÿฌ์•ผ๊ฒ ๋‹คโ€ ๊ฐ™์€ ๊ต์ฐจ ์ถ”๋ก ์ด ๊ฐ€๋Šฅํ•ด์ง„๋‹ค. ํŠนํžˆ ๋’ค์— ๋‚˜์˜ฌ CoT ๋ณ€ํ˜•์ด ์ž‘๋™ํ•˜๋ ค๋ฉด ์ด ๊นŠ์€ ์œตํ•ฉ์ด ํ•„์ˆ˜๋‹ค.

๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ณ„ ์ธ์ฝ”๋”๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • ์‹œ๊ฐ: ์‚ฌ์ „ํ•™์Šต๋œ ViT ์ธ์ฝ”๋”(ฯ€0์™€ ๋™์ผ ๋ฐฉ์‹). ์ตœ๊ทผ H๊ฐœ ํ”„๋ ˆ์ž„์„ ๊ฐ๊ฐ ํ† ํฐ ์ง‘ํ•ฉ์œผ๋กœ ์ธ์ฝ”๋”ฉ.
  • ์ด‰๊ฐ: ๋‹จ์ˆœํ•œ MLP ์ธ์ฝ”๋”. H๊ฐœ์˜ ์ด‰๊ฐ ์ธก์ •๊ฐ’ ์ด๋ ฅ์„ ์ด์–ด๋ถ™์—ฌ(concatenate) ํ•˜๋‚˜์˜ ์œตํ•ฉ ํ† ํฐ์œผ๋กœ ์••์ถ•. ์ด ํ† ํฐ์ด ์ ‘์ด‰์˜ ์‹œ๊ฐ„์  ๋™์—ญํ•™์„ ๋‹ด๋Š”๋‹ค.
  • ์–ธ์–ด: ์ผ๋ฐ˜์ ์ธ language tokenizer.

์ด๋ ‡๊ฒŒ ๋งŒ๋“  ํ† ํฐ๋“ค์„ ์ด์–ด๋ถ™์—ฌ ํ†ตํ•ฉ ์ž…๋ ฅ prefix ์‹œํ€€์Šค S_t๋ฅผ ๋งŒ๋“ ๋‹ค.

S_t = \left[ E'_{vis}(I_{t-H+1}), \dots, E'_{vis}(I_t),\; E_{lang}(L_t),\; E'_{\psi}([T_{t-H+1}, \dots, T_t]) \right]

์—ฌ๊ธฐ์„œ ํ•ต์‹ฌ์€ prefix ์œ„์— non-causal attention(์–‘๋ฐฉํ–ฅ ์–ดํ…์…˜)์„ ๊ฑด๋‹ค๋Š” ์ ์ด๋‹ค. ์ธ๊ณผ์ (causal) ๋งˆ์Šคํ‚น์„ ํ’€์–ด์ฃผ๋ฉด ์‹œ๊ฐยท์–ธ์–ดยท์ด‰๊ฐ ํ† ํฐ์ด ์„œ๋กœ ์ž์œ ๋กญ๊ฒŒ cross-attend ํ•œ๋‹ค. ์ง๊ด€์ ์œผ๋กœ ๋งํ•˜๋ฉด, โ€œ๋ถ„ํ•„โ€์ด๋ผ๋Š” ์–ธ์–ด ํ† ํฐ๊ณผ โ€œ๋งˆ์ฐฐ๋ ฅ์ด ํฌ๋‹คโ€๋Š” ์ด‰๊ฐ ํ† ํฐ์ด ์„œ๋กœ๋ฅผ ๋“ค์—ฌ๋‹ค๋ณด๋ฉด์„œ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ๋œ ํ‘œํ˜„์„ ๋งŒ๋“ ๋‹ค๋Š” ๋œป์ด๋‹ค.

์ด ํ’๋ถ€ํ•œ ํ‘œํ˜„์€ tactile-aware action expert(300M ํŒŒ๋ผ๋ฏธํ„ฐ)๋กœ ๋„˜์–ด๊ฐ„๋‹ค. ์—ฌ๊ธฐ๊ฐ€ ๋ณดํ†ต์˜ VLA์™€ ๊ฒฐ์ •์ ์œผ๋กœ ๋‹ค๋ฅธ ์ง€์ ์ด๋‹ค. action expert๊ฐ€ ๋‚ด๋†“๋Š” ํ–‰๋™ ๋ฒกํ„ฐ a_t๋Š” ๋ชฉํ‘œ ์œ„์น˜ P_{target}๋ฟ ์•„๋‹ˆ๋ผ ๋ชฉํ‘œ ์ ‘์ด‰๋ ฅ F_{target}์„ ๋ช…์‹œ์ ์œผ๋กœ ํฌํ•จํ•œ๋‹ค.

Standard VLA action:   a_t = [ P_target ]
Tactile-VLA action:    a_t = [ P_target , F_target ]

ํž˜์„ ํ–‰๋™ ๊ณต๊ฐ„(action space)์— ์ง์ ‘ ๋„ฃ์—ˆ๋‹ค๋Š” ๊ฒƒ, ์ด๊ฒƒ์ด โ€œ์–ธ์–ด๊ฐ€ ํ–‰๋™์˜ ๊ฐ•๋„๋ฅผ ์กฐ์ ˆโ€ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“œ๋Š” ๊ตฌ์กฐ์  ์—ด์‡ ๋‹ค.

2) ํ•™์Šต: Flow Matching์œผ๋กœ ์œ„์น˜์™€ ํž˜์„ ํ•จ๊ป˜ ๋งž์ถ˜๋‹ค

ํ•™์Šต์€ imitation learning์œผ๋กœ ์ง„ํ–‰๋œ๋‹ค. ๊ณต์œ  ์ปดํฌ๋„ŒํŠธ๋Š” ฯ€0์˜ ์‚ฌ์ „ํ•™์Šต ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์ดˆ๊ธฐํ™”ํ•˜๊ณ , ์ƒˆ๋กœ ์ถ”๊ฐ€ํ•œ ๋ชจ๋“ˆ(์ด‰๊ฐ ์ธ์ฝ”๋”, ์ˆ˜์ •๋œ action expert)์€ ๋ฌด์ž‘์œ„ ์ดˆ๊ธฐํ™”ํ•œ ๋’ค ์ „์ฒด๋ฅผ end-to-end๋กœ ํŒŒ์ธํŠœ๋‹ํ•œ๋‹ค.

๋ชฉ์ ํ•จ์ˆ˜๋Š” Conditional Flow Matching(CFM)์ด๋‹ค. Flow matching์„ ์ฒ˜์Œ ๋“ฃ๋Š” ๋…์ž๋ฅผ ์œ„ํ•ด ์ง๊ด€์„ ํ’€์–ด๋ณด์ž.

ํ™•์‚ฐ ๋ชจ๋ธ(diffusion)์ด โ€œ๋…ธ์ด์ฆˆ๋ฅผ ์กฐ๊ธˆ์”ฉ ๊ฑท์–ด๋‚ด๋ฉฐ ๋ฐ์ดํ„ฐ๋กœ ๊ฐ€๋Š” ๊ธธโ€์„ ๋ฐฐ์šด๋‹ค๋ฉด, flow matching์€ โ€œ๋…ธ์ด์ฆˆ ๋ถ„ํฌ์—์„œ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋กœ ๊ฐ€๋Š” ์ง์„ ์— ๊ฐ€๊นŒ์šด ํ๋ฆ„์˜ ์†๋„์žฅ(velocity field)โ€์„ ๋ฐฐ์šด๋‹ค. ๋น„์œ ํ•˜๋ฉด, ๊ฐ•๋ฌผ ์œ„ ์–ด๋А ์ง€์ ์— ๋–จ์–ด๋œจ๋ฆฐ ๋‚˜๋ญ‡์žŽ์ด ์–ด๋А ๋ฐฉํ–ฅ์œผ๋กœ ์–ผ๋งˆ๋‚˜ ๋นจ๋ฆฌ ํ˜๋Ÿฌ๊ฐ€์•ผ ๋ชฉ์ ์ง€(์ •๋‹ต ํ–‰๋™)์— ๋„๋‹ฌํ•˜๋Š”์ง€, ๊ทธ ํ™”์‚ดํ‘œ๋ฅผ ๋ชจ๋“  ์ง€์ ์—์„œ ์˜ˆ์ธกํ•˜๋„๋ก ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ถ”๋ก  ์‹œ์—๋Š” ์ด ํ™”์‚ดํ‘œ๋ฅผ ๋”ฐ๋ผ ์ ๋ถ„ํ•ด ํ–‰๋™์„ ์ƒ์„ฑํ•œ๋‹ค. ํ™•์‚ฐ๋ณด๋‹ค ์ ์€ ์Šคํ…์œผ๋กœ ๋ถ€๋“œ๋Ÿฌ์šด ์—ฐ์† ํ–‰๋™์„ ๋ฝ‘์„ ์ˆ˜ ์žˆ์–ด ๋กœ๋ด‡ ์ œ์–ด์— ์ž˜ ๋งž๋Š”๋‹ค.

์—ฌ๊ธฐ์„œ ์ค‘์š”ํ•œ ๋””ํ…Œ์ผ. ์†์‹คํ•จ์ˆ˜๋Š” ์˜ˆ์ธก๋œ ํ–‰๋™ ์‹œํ€€์Šค์˜ ์šด๋™ํ•™์  ์ฐจ์›(์œ„์น˜)๊ณผ ํž˜ ์ฐจ์› ๋ชจ๋‘์˜ ํŽธ์ฐจ์— ํŽ˜๋„ํ‹ฐ๋ฅผ ์ค€๋‹ค. ์ฆ‰ ์œ„์น˜๋งŒ ๋งž์ถ”๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ ํž˜๋„ ์ •๋‹ต๊ณผ ๋งž๋„๋ก ํ•™์Šตํ•œ๋‹ค. ๋ฐ”๋กœ ์ด ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ๋ชจ๋ธ๋กœ ํ•˜์—ฌ๊ธˆ VLM์˜ ์ž ์žฌ๋œ ๋ฌผ๋ฆฌ ์ง€์‹์„ ๋Œ์–ด๋‚ด, โ€œgentlyโ€๋ผ๋Š” ์–ธ์–ด์  ๋‰˜์•™์Šค์™€ 0.5N์ด๋ผ๋Š” ๋ฌผ๋ฆฌ์  ํž˜ ํฌ๊ธฐ ์‚ฌ์ด์˜ ์ง์ ‘ ๋งคํ•‘์„ ๋งŒ๋“ค๋„๋ก ๊ฐ•์ œํ•œ๋‹ค.

3) ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์œ„์น˜-ํž˜ ์ปจํŠธ๋กค๋Ÿฌ: ๋‘ ๋ชฉํ‘œ๋ฅผ ์–ด๋–ป๊ฒŒ ํ™”ํ•ด์‹œํ‚ค๋‚˜

action expert๊ฐ€ ๋ชฉํ‘œ ์œ„์น˜์™€ ๋ชฉํ‘œ ํž˜์„ ์คฌ๋‹ค๊ณ  ๋์ด ์•„๋‹ˆ๋‹ค. ์ €์ˆ˜์ค€ ์ปจํŠธ๋กค๋Ÿฌ๊ฐ€ ์ด ๋‘ ๋ชฉํ‘œ๋ฅผ ์กฐ์œจํ•ด์•ผ ํ•œ๋‹ค. ๋ฌธ์ œ๋Š” ์œ„์น˜ ์ œ์–ด์™€ ํž˜ ์ œ์–ด๊ฐ€ ๋ณธ์งˆ์ ์œผ๋กœ ์ถฉ๋Œํ•œ๋‹ค๋Š” ๋ฐ ์žˆ๋‹ค. ๋‹จ๋‹จํ•œ ๋ฒฝ์— ์†์„ ์ •ํ™•ํžˆ โ€œ์ด ์œ„์น˜โ€์— ๋‘๋ ค ํ•˜๋ฉด์„œ ๋™์‹œ์— โ€œ์ด ํž˜โ€์œผ๋กœ ๋ˆ„๋ฅด๋ ค ํ•˜๋ฉด, ์œ„์น˜๋ฅผ 1mm๋งŒ ์–ด๊ธ‹๋‚˜๋„ ํž˜์€ ํญ๋ฐœํ•œ๋‹ค.

์ €์ž๋“ค์˜ ์ „๋žต์€ position-dominant๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ ์กฐ์ž‘ ์ž‘์—…์€ ์ •๋ฐ€ํ•œ ์šด๋™ํ•™์  ๋™์ž‘์ด ์ง€๋ฐฐํ•˜๊ณ , ํž˜ ์ œ์–ด๋Š” ์ ‘์ด‰ ์ˆœ๊ฐ„์—๋งŒ ํ•„์š”ํ•˜๋‹ค๋Š” ๊ณ ์ „์  ํ†ต์ฐฐ(Raibert & Craig, 1981)์„ ๋”ฐ๋ฅธ๋‹ค. ๊ทธ๋ž˜์„œ ๋ชจ๋“  ๊ฒƒ์„ ์ตœ์ข…์ ์œผ๋กœ ์œ„์น˜ ๋ช…๋ น์œผ๋กœ ํ™˜์›ํ•œ๋‹ค.

ํž˜ ๋ชฉํ‘œ๋Š” ์ž„ํ”ผ๋˜์Šค ์ œ์–ด ์›๋ฆฌ(Hogan, 1985)์—์„œ ์˜๊ฐ์„ ๋ฐ›์€ ๊ฐ„์ ‘ ํž˜ ์ œ์–ด๋กœ ํ†ตํ•ฉํ•œ๋‹ค. ํž˜ ์˜ค์ฐจ๋ฅผ ์œ„์น˜ ๋ช…๋ น์˜ ๋ณด์ •๋Ÿ‰์œผ๋กœ ๋ฒˆ์—ญํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

\Delta F = F_{target} - F_{measured}

P_{hybrid} = \begin{cases} P_{target} + K \cdot \Delta F & \text{if } \lVert \Delta F \rVert > \tau \\[4pt] P_{target} & \text{if } \lVert \Delta F \rVert \le \tau \end{cases}

์—ฌ๊ธฐ์„œ K๋Š” ๊ฒŒ์ธ ํ–‰๋ ฌ, \tau๋Š” ์ž„๊ณ„๊ฐ’์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  PID ์ปจํŠธ๋กค๋Ÿฌ๊ฐ€ ์ด ๋™์ ์œผ๋กœ ๊ฐฑ์‹ ๋œ P_{hybrid}๋กœ ๊ด€์ ˆ์„ ๊ตฌ๋™ํ•œ๋‹ค.

์ด ์‹์˜ ์ง๊ด€์€ ์ด๋ ‡๋‹ค. โ€œ์›ํ•˜๋Š” ํž˜๋ณด๋‹ค ๋œ ๋ˆ„๋ฅด๊ณ  ์žˆ์œผ๋ฉด(ฮ”F๊ฐ€ ์–‘์ˆ˜๋กœ ํฌ๋ฉด) ๋ชฉํ‘œ ์œ„์น˜๋ฅผ ์ ‘์ด‰๋ฉด ์ชฝ์œผ๋กœ ๋” ๋ฐ€์–ด๋„ฃ์–ด๋ผ.โ€ ๋งˆ์น˜ ๋ฒฝ์„ ์†์œผ๋กœ ๋ฏธ๋Š”๋ฐ ์ถฉ๋ถ„ํžˆ ์•ˆ ๋ˆŒ๋ฆฌ๋ฉด, ์†์„ ๋ฒฝ ์•ˆ์ชฝ์œผ๋กœ ๋” ๋ณด๋‚ด๋ ค๋Š” ์‹œ๋Љ์„ ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค. ๋ฌผ๋ฆฌ์ ์œผ๋กœ๋Š” ๋ชป ๋“ค์–ด๊ฐ€๋‹ˆ ๊ทธ๋งŒํผ ํž˜์ด ์˜ฌ๋ผ๊ฐ„๋‹ค. ๋‹จ, ์˜ค์ฐจ๊ฐ€ ์ž„๊ณ„๊ฐ’ \tau ์ดํ•˜๋กœ ์ž‘์œผ๋ฉด ๋ณด์ •์„ ๋„๋Š”๋ฐ(dead-band), ์ด๊ฑด ๋ฏธ์„ธํ•œ ๋–จ๋ฆผ์„ ๋ง‰์•„ ๋™์ž‘์„ ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ๋งŒ๋“ค๊ธฐ ์œ„ํ•จ์ด๋‹ค.

๊ณ ์ „์  ์ž„ํ”ผ๋˜์Šค ์ œ์–ด๊ฐ€ ์ˆ˜๋™์  ์ˆœ์‘(passive compliance, ๋ฐ€๋ฉด ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ๋ฐ€๋ ค๋‚จ)์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค๋ฉด, ์—ฌ๊ธฐ์„œ๋Š” ๋ชฉํ‘œ ํž˜์˜ ๋Šฅ๋™์  ์ถ”์ข…(active force tracking)์„ ๋…ธ๋ฆฐ๋‹ค๋Š” ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค.

ํ•œ ๊ฐ€์ง€ ๋” ์˜๋ฆฌํ•œ ๋ถ„๋ฆฌ. ์ปจํŠธ๋กค๋Ÿฌ๋Š” ๋‘ ํž˜ ์„ฑ๋ถ„์„ ๋…๋ฆฝ ์ฑ„๋„๋กœ ๋–ผ์–ด๋‚ธ๋‹ค.

  • ๊ทธ๋ฆฌํผ์˜ ์นดํ…Œ์‹œ์•ˆ ์œ„์น˜ โ†’ ๋ฌผ์ฒด์— ๊ฐ€ํ•˜๋Š” ์™ธ๋ถ€ ์•Œ์งœํž˜(net external force) ์กฐ์ ˆ
  • ๊ทธ๋ฆฌํผ ํญ(width) โ†’ ๋‚ด๋ถ€ ํŒŒ์ง€๋ ฅ(internal grasping force) ์กฐ์ ˆ, ์ฆ‰ ์–ผ๋งˆ๋‚˜ ๊ฝ‰ ์ฅ๋Š”๊ฐ€

์ด ๋ถ„๋ฆฌ ๋•๋ถ„์— โ€œ์šฉ๊ณผ๋ฅผ ์‚ด์‚ด ์ฅ๋ฉด์„œ(ํญ์€ ๋„“๊ฒŒ ์œ ์ง€) ๋™์‹œ์— ์œ„๋กœ ๋“ค์–ด์˜ฌ๋ฆฌ๋Š”(์œ„์น˜๋กœ ์™ธ๋ ฅ ์กฐ์ ˆ)โ€ ๋™์ž‘์ด ์ถฉ๋Œ ์—†์ด ๊ฐ€๋Šฅํ•ด์ง„๋‹ค.

4) Tactile-VLA-CoT: ์†๋์œผ๋กœ ์ƒ๊ฐํ•˜๊ธฐ

์—ฌ๊ธฐ๊ฐ€ ์ด ๋…ผ๋ฌธ์—์„œ ๊ฐ€์žฅ ๋งค๋ ฅ์ ์ธ ๋ถ€๋ถ„์ด๋‹ค. ํ•ต์‹ฌ ์•„ํ‚คํ…์ฒ˜๊ฐ€ ํž˜์„ ์ •๋ฐ€ ์ œ์–ดํ•œ๋‹ค๋ฉด, CoT ๋ณ€ํ˜•์€ ๊ทธ ์œ„์— ์ถ”๋ก ์„ ์–น๋Š”๋‹ค.

์•„์ด๋””์–ด๋Š” ์ด๋ ‡๋‹ค. VLM์€ ์›๋ž˜ ๋””์ฝ”๋”๋กœ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•  ์ค„ ์•ˆ๋‹ค. ๊ทธ ๋Šฅ๋ ฅ์„ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์™€, ๋กœ๋ด‡์ด ๋‚ด์  ๋…๋ฐฑ(internal monologue)์„ ์ƒ์„ฑํ•˜๊ฒŒ ํ•œ๋‹ค. ์‹คํŒจ์˜ ์›์ธ์„ ์Šค์Šค๋กœ ์ง„๋‹จํ•˜๊ณ (โ€œ๋ฏธ๋„๋Ÿฌ์กŒ๋„คโ€), ๋ณด์ • ํ–‰๋™์„ ์ •์‹ํ™”ํ•œ๋‹ค(โ€œ์ „๋‹จ๋ ฅ์„ ๋” ํ‚ค์›Œ์•ผ๊ฒ ๋‹คโ€).

์ž‘๋™ ๊ณผ์ •์„ ๊ทธ๋ฆผ์œผ๋กœ ๋ณด์ž.

flowchart TD
    S["Execute action<br/>(default force)"] --> Q1{"Q: Task done?<br/>(periodic check)"}
    Q1 -->|Yes| DONE["Task complete"]
    Q1 -->|"No, still marks remain"| ANALYZE["Analyze tactile feedback<br/>(normal force / shear force)"]
    ANALYZE --> REASON["CoT reasoning:<br/>'grasping force OK,<br/>but shear force too low'"]
    REASON --> NEWCMD["Generate corrective instruction:<br/>'wipe again, more downward force'"]
    NEWCMD --> S

ํ•™์Šต ๋ฐฉ์‹์ด ๋˜‘๋˜‘ํ•˜๋‹ค. ์ž‘๊ณ  ํ‘œ์ ํ™”๋œ ๋ฐ๋ชจ ๋ฐ์ดํ„ฐ์…‹์„ ์“ด๋‹ค. ๊ฐ ์ƒ˜ํ”Œ์€ ํŠน์ • ์‹คํŒจ ์‚ฌ๊ฑด(์˜ˆ: ์น ํŒ์„ ๋‹ฆ๋‹ค๊ฐ€ ๋ฏธ๋„๋Ÿฌ์ง)์˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์„ผ์„œ ์ŠคํŠธ๋ฆผ๊ณผ, ๊ทธ ์‹คํŒจ ์›์ธ์„ ๋ถ„์„ํ•˜๋Š” ์–ธ์–ด ์ฃผ์„์„ ์ง์ง€์šด๋‹ค. ์˜ˆ: โ€œํž˜์ด ๋„ˆ๋ฌด ์•ฝํ–ˆ๋‹ค. ๋” ์„ผ ํž˜์ด ํ•„์š”ํ•˜๋‹ค. ์ด์ œ 5N์œผ๋กœ ์‹œ๋„ํ•œ๋‹ค.โ€

์ด ํ•™์Šต์€ ๋‘ ๊ฐ€์ง€๋ฅผ ๋™์‹œ์— ๋…ธ๋ฆฐ๋‹ค.

  1. catastrophic forgetting ๋ฐฉ์ง€: VLM ๋ณธ์—ฐ์˜ ์ผ๋ฐ˜ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๋ณด์กดํ•œ๋‹ค.
  2. ์ถ”๋ก ์„ ์ด‰๊ฐ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋กœ ํ™•์žฅ: ์„ผ์„œ ์‹ ํ˜ธ๋กœ๋ถ€ํ„ฐ ๋ฌผ๋ฆฌ ํ˜„์ƒ์„ ์ถ”๋ก ํ•˜๋„๋ก ๊ฐ€๋ฅด์นœ๋‹ค. ๋‹ฆ์„ ๋•Œ ํ•˜ํ–ฅ ์••๋ ฅ์ด ๋ถ€์กฑํ•˜๋‹ค๊ฑฐ๋‚˜, ์ „๋‹จ๋ ฅ ์‹ ํ˜ธ๋กœ๋ถ€ํ„ฐ ๋„๊ตฌ๊ฐ€ ๋ฏธ๋„๋Ÿฌ์ง€๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๊ฐ์ง€ํ•˜๋Š” ์‹์ด๋‹ค.

์ถ”๋ก ์€ ๊ณ ์ •๋œ ๊ฐ„๊ฒฉ๋งˆ๋‹ค ํŠธ๋ฆฌ๊ฑฐ๋œ๋‹ค. ๋‹จ์ˆœํ•˜์ง€๋งŒ ํšจ๊ณผ์ ์ธ ๋ฐฉ์‹์ด๋‹ค. ํ”„๋กฌํ”„ํŠธ ๊ตฌ์กฐ๋Š” ๋จผ์ € โ€œ์ž‘์—…์ด ์„ฑ๊ณตํ–ˆ๋‚˜?โ€๋ฅผ ํŒ๋‹จํ•˜๊ฒŒ ํ•˜๊ณ , ์‹คํŒจ๋ผ๋ฉด ์„ผ์„œ ํ”ผ๋“œ๋ฐฑ์œผ๋กœ ์›์ธ์„ ๋ถ„์„ํ•˜๊ฒŒ ํ•œ ๋’ค(โ€œํŒŒ์ง€๋ ฅ์€ ์ถฉ๋ถ„ํ•˜๋‚˜ ์ˆ˜์งํ•ญ๋ ฅ์ด ๋„ˆ๋ฌด ๋‚ฎ๋‹คโ€), ์ƒˆ ๋ณด์ • ๋ช…๋ น(โ€œํŒ์„ ๋‹ค์‹œ ๋‹ฆ๋˜ ํ•˜ํ–ฅ ํž˜์„ ๋” ์ค˜๋ผโ€)์„ ์ƒ์„ฑํ•œ๋‹ค.

์˜์‚ฌ์ฝ”๋“œ๋กœ ์ •๋ฆฌํ•˜๋ฉด:

PROCEDURE TactileVLA_CoT_Step:
    every K timesteps:
        success <- VLM_decode("Has the task been done?", sensory_context)
        IF success == False:
            cause <- VLM_decode("Analyze failure using force feedback", sensory_context)
            # e.g. "normal force sufficient, shear force too low"
            new_instruction <- VLM_decode("Formulate corrective command", cause)
            # e.g. "wipe again with larger shear force"
            current_instruction <- new_instruction
    action <- Policy(prefix(image, current_instruction, tactile, state))
    execute(HybridController(action.P_target, action.F_target))

5) ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘: ์† ๋Œ€์‹  ์†๊ฐ€๋ฝ ๊ฐ๊ฐ์ด ์žˆ๋Š” ์žฅ์น˜

์ข‹์€ ๋ชจ๋ธ์€ ์ข‹์€ ๋ฐ์ดํ„ฐ์—์„œ ๋‚˜์˜จ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์—ฌ๊ธฐ์— ๋ฏธ๋ฌ˜ํ•œ ํ•จ์ •์ด ์žˆ๋‹ค. ์ผ๋ฐ˜์ ์ธ ์›๊ฒฉ์กฐ์ž‘(teleoperation)์œผ๋กœ ๋ฐ๋ชจ๋ฅผ ๋ชจ์œผ๋ฉด, ์‚ฌ๋žŒ ์กฐ์ž‘์ž๊ฐ€ ํž˜ ํ”ผ๋“œ๋ฐฑ์„ ์ง์ ‘ ๋ชป ๋А๋‚€๋‹ค. ๊ทธ๋ ‡๊ฒŒ ๋ชจ์€ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ ์ •์ฑ…์€ ๋ณธ์งˆ์ ์œผ๋กœ ์ด‰๊ฐ์— ์˜์กดํ•˜์ง€ ์•Š๊ฒŒ ๋œ๋‹ค. ํ•™์Šต ๋ชฉํ‘œ ์ž์ฒด์™€ ์–ด๊ธ‹๋‚˜๋Š” ๊ฒƒ์ด๋‹ค.

์ €์ž๋“ค์€ ์ด ๋ฌธ์ œ๋ฅผ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์žฅ์น˜ ์„ค๊ณ„๋กœ ํ‘ผ๋‹ค. UMI(Universal Manipulation Interface)๋ผ๋Š” ํœด๋Œ€ํ˜• ํ•ธ๋“œํ—ฌ๋“œ ์žฅ์น˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, ๊ทธ๋ฆฌํผ์— ๋ฒ•์„ ๋ ฅ(normal force)๊ณผ ์ „๋‹จ๋ ฅ(shear force)์„ ๋ชจ๋‘ ์žก์•„๋‚ด๋Š” ๊ณ ํ•ด์ƒ๋„ ์ด‰๊ฐ ์„ผ์„œ ๋‘ ๊ฐœ๋ฅผ ์ฆ์„คํ–ˆ๋‹ค. ์ด๋Ÿฌ๋ฉด ์กฐ์ž‘์ž๊ฐ€ ์ ‘์ด‰ ๋™์—ญํ•™์„ ์ง์ ‘ ๋А๋ผ๋ฉด์„œ, ํž˜์— ์˜ํ•ด ๋ช…์‹œ์ ์œผ๋กœ ๊ฐ€์ด๋“œ๋œ ๋ฐ๋ชจ๋ฅผ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๋‹ค.

Data collection rig:
  - GoPro camera (visual)
  - Dual high-res tactile sensors (normal + shear)
  - 3D-printed gripper (UMI-based)

Sampling:
  - Tactile: 100 Hz  -->  downsampled to match
  - Visual:  20 Hz
  - Timestamps aligned per session

์‹œ๊ฐ„ ๋™๊ธฐํ™”๋„ ์‹ ๊ฒฝ ์ผ๋‹ค. ์„ธ์…˜๋งˆ๋‹ค ๋ชจ๋“  ๋ฐ์ดํ„ฐ ์ŠคํŠธ๋ฆผ์˜ ํƒ€์ž„์Šคํƒฌํ”„๋ฅผ ์ •๋ ฌํ•˜๊ณ , 100Hz ์ด‰๊ฐ์„ 20Hz ์‹œ๊ฐ ํ”„๋ ˆ์ž„์— ๋งž์ถฐ ๋‹ค์šด์ƒ˜ํ”Œ๋งํ•œ๋‹ค. ๊ฒฐ๊ณผ๋ฌผ์ด visionยทlanguageยทtactileยทaction์ด ์ •๋ฐ€ํ•˜๊ฒŒ ๋™๊ธฐํ™”๋œ VLA-T ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค.


์‹คํ—˜: ์„ธ ๊ฐ€์ง€ ์งˆ๋ฌธ์— ๋‹ตํ•˜๊ธฐ

์‹คํ—˜์€ ์„ธ ์—ฐ๊ตฌ์งˆ๋ฌธ(RQ)์œผ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ์ž‘์—…์€ ์„ธ ๊ฐ€์ง€ contact-rich ์‹œ๋‚˜๋ฆฌ์˜ค๋‹ค.

  • Charger/USB ์‚ฝ์ž…ยท์ถ”์ถœ: USB๋‚˜ ์ถฉ์ „๊ธฐ๋ฅผ ๋ฝ‘์•„ ์˜ฌ๋ฐ”๋ฅธ ์†Œ์ผ“์— ๊ฝ‚๊ธฐ
  • Tabletop Grasping: ๋ฌด๊ฑฐ์šด/์•ฝํ•œ ๋ฌผ์ฒด๋ฅผ ์‚ฌ์ „์— ํŒ๋‹จํ•ด ์ ์ ˆํ•œ ํž˜์œผ๋กœ ํŒŒ์ง€
  • Wiping the Board: ๋ณด๋“œ๋ฅผ ๋‹ฆ๊ณ , ๊ฒฐ๊ณผ๋ฅผ ํ‰๊ฐ€ํ•˜๊ณ , ํ•„์š”ํ•˜๋ฉด ํž˜์„ ์กฐ์ •

๋ฒ ์ด์Šค๋ผ์ธ์€ ฯ€0-base(๋ฒ”์šฉ VLA flow ๋ชจ๋ธ)์™€ ๊ทธ ๋ณ€ํ˜• ฯ€0-fast๋‹ค. ๋‘˜ ๋‹ค ์ด‰๊ฐ ์œตํ•ฉ ์•„ํ‚คํ…์ฒ˜๊ฐ€ ์—†๋‹ค.

RQ1: ํž˜ ๊ด€๋ จ ์–ธ์–ด๋ฅผ ์ผ๋ฐ˜ํ™”ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€

์„ค๊ณ„๊ฐ€ ์˜๋ฆฌํ•˜๋‹ค. ๋ชจ๋ธ์„ USB ์ž‘์—…(Task A)์—์„œ โ€œsoftlyโ€/โ€œhardโ€์™€ ํŠน์ • ํž˜ ํ”„๋กœํŒŒ์ผ์„ ์—ฐ๊ด€์ง“๋„๋ก ํ•™์Šต์‹œํ‚จ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ์ถฉ์ „๊ธฐ ์ž‘์—…(Task B)์œผ๋กœ ์ „์ดํ•˜๋Š”๋ฐ, Task B์—๋Š” ๋™์ž‘๋งŒ ๊ฐ€๋ฅด์น˜๊ณ  ํž˜ ๊ด€๋ จ ์–ธ์–ด ๋ช…๋ น์€ ์ „ํ˜€ ์ฃผ์ง€ ์•Š๋Š”๋‹ค. ์ด๊ฒŒ ์ง„์งœ ์˜๋ฏธ ์ ‘์ง€(semantic grounding)๋ฅผ ํ…Œ์ŠคํŠธํ•œ๋‹ค. ์–ธ์–ด๊ฐ€ zero-shot ๋งฅ๋ฝ์—์„œ ๋ฌผ๋ฆฌ์  ์ƒํ˜ธ์ž‘์šฉ์„ ์ง์ ‘ ์กฐ์ ˆํ•˜๋Š”์ง€ ๋ณด๋Š” ๊ฒƒ์ด๋‹ค.

๋จผ์ € ์„ฑ๊ณต๋ฅ ๋ถ€ํ„ฐ.

Table 1. USB/Charger ์‚ฝ์ž…ยท์ถ”์ถœ ์„ฑ๊ณต๋ฅ  (%)

Model USB (%) Charger (%)
ฯ€0-base 5 40
ฯ€0-fast 0 25
Tactile-VLA 35 90

์„ฑ๊ณต๋ฅ ๋งŒ ๋ด๋„ ์ฐจ์ด๊ฐ€ ํฌ๋‹ค. ์ •๋ฐ€ํ•œ ์‚ฝ์ž…์€ ์ •๋ ฌ ์˜ค์ฐจ๋‚˜ ๊ณผ๋„ํ•œ ํž˜์œผ๋กœ ์‹คํŒจํ•˜๊ธฐ ์‰ฌ์šด๋ฐ, ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ์˜ ๊นŠ์€ ์œตํ•ฉ์ด ์ ‘์ด‰ ๊ตญ๋ฉด์—์„œ ๋” ์ •๋ฐ€ํ•˜๊ณ  ์ ์‘์ ์ธ ์ œ์–ด๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค.

ํ•˜์ง€๋งŒ ์ง„์งœ ์ฆ๊ฑฐ๋Š” ์ ์šฉ๋œ ํž˜์— ์žˆ๋‹ค.

Table 2. ๋ช…๋ น์–ด๋ณ„ ์ ์šฉ ํž˜ (N)

Model โ€˜softlyโ€™ (USB, ํ•™์Šต) โ€˜hardโ€™ (USB, ํ•™์Šต) โ€˜gentlyโ€™ (์ผ๋ฐ˜ํ™”) โ€˜firmlyโ€™ (์ผ๋ฐ˜ํ™”) โ€˜harderโ€™ (์™ธ์‚ฝ) โ€˜softlyโ€™ (Charger, zero-shot) โ€˜hardโ€™ (Charger, zero-shot)
ฯ€0 2.41 2.68 2.35 2.72 2.29 6.61 5.69
ฯ€0-fast 2.61 2.33 2.79 2.45 2.58 7.37 6.42
Tactile-VLA 0.51 2.57 0.75 1.98 2.94 4.68 9.13

์ด ํ‘œ๋ฅผ ์ฒœ์ฒœํžˆ ์ฝ์–ด๋ณด์ž. ์„ธ ๊ฐ€์ง€ ๋‹จ๊ณ„์˜ ์ผ๋ฐ˜ํ™”๊ฐ€ ํ•œ ํ‘œ์— ๋‹ด๊ฒจ ์žˆ๋‹ค.

  1. ํ•™์Šตํ•œ ๋‹จ์–ด: โ€œsoftlyโ€=0.51N, โ€œhardโ€=2.57N. ํ•™์Šตํ•œ ๋Œ€๋กœ ๋ช…ํ™•ํžˆ ๊ตฌ๋ถ„ํ•œ๋‹ค.
  2. ์ผ๋ฐ˜ํ™”ํ•œ ๋‹จ์–ด: ํ•™์Šต ์•ˆ ํ•œ โ€œgentlyโ€=0.75N, โ€œfirmlyโ€=1.98N. ์˜๋ฏธ์ ์œผ๋กœ ์ค‘๊ฐ„์ฏค ๋˜๋Š” ํž˜์„ ์ •ํ™•ํžˆ ์ถ”๋ก ํ•œ๋‹ค. ๋ถ€์‚ฌ๋“ค์˜ ๊ฐ•๋„ ์ŠคํŽ™ํŠธ๋Ÿผ์„ ์ดํ•ดํ•œ ๊ฒƒ์ด๋‹ค.
  3. ํ•™์Šต ๋ฒ”์œ„ ๋ฐ– ์™ธ์‚ฝ: โ€œharderโ€=2.94N. ํ•™์Šต๋œ โ€œhardโ€(2.57N)๋ณด๋‹ค ๋” ํฐ ํž˜์„ ์ ์šฉํ•œ๋‹ค. ๋น„๊ต๊ธ‰์˜ ์˜๋ฏธ๊นŒ์ง€ ์™ธ์‚ฝํ•œ๋‹ค.
  4. zero-shot ์ž‘์—… ์ „์ด: ์ถฉ์ „๊ธฐ ์ž‘์—…์—์„œ โ€œhardโ€=9.13N, โ€œsoftlyโ€=4.68N. USB์—์„œ ๋ฐฐ์šด ํž˜-์–ธ์–ด ๋งคํ•‘์ด ์ฒ˜์Œ ๋ณด๋Š” ์ถฉ์ „๊ธฐ ์ž‘์—…์œผ๋กœ ์ „์ด๋๋‹ค.

๋ฐ˜๋ฉด ๋ฒ ์ด์Šค๋ผ์ธ ฯ€0/ฯ€0-fast๋Š” ๋ถ€์‚ฌ๊ฐ€ ๋ฌด์—‡์ด๋“  ํž˜์ด ๊ฑฐ์˜ ์ผ์ •ํ•˜๋‹ค(2.3~2.8N ์‚ฌ์ด). ์–ธ์–ด๋ฅผ ๋ฌผ๋ฆฌ์  ํž˜์— ์ ‘์ง€ํ•  ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ์—†์œผ๋‹ˆ, ๋ช…๋ น๊ณผ ์ ์šฉ ํž˜ ์‚ฌ์ด์— ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์•„์˜ˆ ์—†๋‹ค. ์ถฉ์ „๊ธฐ ์ž‘์—…์—์„œ๋Š” ๊ฐ’์ด ๋“ค์ญ‰๋‚ ์ญ‰ 6~7N๋Œ€๋กœ ํŠ€๋Š”๋ฐ, ์ด๊ฑด ์˜๋ฏธ ์ดํ•ด๊ฐ€ ์•„๋‹ˆ๋ผ ๋‹จ์ˆœํžˆ ์ œ์–ด ๋ถˆ์•ˆ์ •์— ๊ฐ€๊น๋‹ค.

์ด ๊ฒฐ๊ณผ์˜ ํ•จ์˜๋Š” ๋ถ„๋ช…ํ•˜๋‹ค. ๋ถ€์‚ฌ ํ•˜๋‚˜๊ฐ€ ์—ฐ์†์ ์ธ ํž˜ ๊ฐ’์œผ๋กœ ๋งค๋„๋Ÿฝ๊ฒŒ ๋งคํ•‘๋˜๊ณ , ๊ทธ ๋งคํ•‘์ด ์ž‘์—…์„ ๊ฑด๋„ˆ๋›ฐ์–ด ์ „์ด๋œ๋‹ค๋Š” ๊ฒƒ์€, ๋ชจ๋ธ์ด ๋‹จ์–ด๋ฅผ ์™ธ์šด ๊ฒŒ ์•„๋‹ˆ๋ผ ์–ธ์–ด-ํž˜์˜ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅํ•œ ๊ต์ฐจ๋ชจ๋‹ฌ ์ดํ•ด๋ฅผ ํ•™์Šตํ–ˆ๋‹ค๋Š” ์ฆ๊ฑฐ๋‹ค.

RQ2: ์ฒ˜์Œ ๋ณด๋Š” ๋ฌผ์ฒด์— ์ ์ ˆํ•œ ํž˜์„ ์ถ”๋ก ํ•˜๋Š”๊ฐ€

์ด๋ฒˆ์—” ๋ช…์‹œ์  ํž˜ ๋ช…๋ น ์—†์ด, ๋ฌผ์ฒด์˜ ์†์„ฑ๋งŒ ๋ณด๊ณ  ํž˜์„ ์ •ํ•˜๋Š” ์ƒ์‹(common sense)์„ ํ…Œ์ŠคํŠธํ•œ๋‹ค. ๋ฌผ์ฒด๋ฅผ ์„ธ ๋ฒ”์ฃผ๋กœ ๋‚˜๋ˆˆ๋‹ค.

  • Solid & Heavy(๋‹จ๋‹จยท๋ฌด๊ฑฐ์›€): ๊ฝ‰ ์žก์•„์•ผ ํ•จ
  • Solid & Light(๋‹จ๋‹จยท๊ฐ€๋ฒผ์›€): ์ค‘๊ฐ„ ํž˜
  • Fragile & Light(์•ฝํ•จยท๊ฐ€๋ฒผ์›€): ์‚ด์‚ด ์žก์•„์•ผ ํ•จ (๋ณ€ํ˜• ์—†์ด)

ID(in-domain, ํ•™์Šต ์ค‘ ๋ด„)์™€ OOD(out-of-domain, ์ฒ˜์Œ ๋ด„) ๋ฌผ์ฒด๋กœ ํ‰๊ฐ€ํ•œ๋‹ค. ์„ฑ๊ณต์€ ๋ณ€ํ˜• ์—†์ด ํ•œ ๋ฒˆ์— ๋“ค์–ด์˜ฌ๋ฆฌ๋Š” ๊ฒƒ.

Table 3. ๋ฌผ์ฒด๋ณ„ ํŒŒ์ง€ ์„ฑ๊ณต๋ฅ  (%, 10ํšŒ ์‹œ๋„ ๊ธฐ์ค€)

๋ฒ”์ฃผ ๋ฌผ์ฒด ID/OOD ฯ€0-base ฯ€0-fast Tactile-VLA
Solid & Heavy Iron cube ID 100 70 100
Battery OOD 80 60 90
Nail ID 30 10 100
Steel Ball OOD 60 70 90
Solid & Light Wood block ID 60 70 90
Charger OOD 70 50 100
Plastic ID 40 30 80
Toy OOD 30 40 90
Fragile & Light Pitaya ID 50 40 90
Melon OOD 0 10 80
BlueBerry OOD 0 0 100
PaperBox OOD 0 0 90

๊ฐ€์žฅ ๊ทน์ ์ธ ์ค„์€ ๋งจ ์•„๋ž˜๋‹ค. ์•ฝํ•œ ๋ฌผ์ฒด(๋ธ”๋ฃจ๋ฒ ๋ฆฌ, ์ข…์ด์ƒ์ž)์—์„œ ๋ฒ ์ด์Šค๋ผ์ธ์€ 0%๋‹ค. ๋ณ€ํ˜•์‹œํ‚ค์ง€ ์•Š๊ณ  ๋“œ๋Š” ๋ฐ ์ „๋ถ€ ์‹คํŒจํ•œ๋‹ค. Tactile-VLA๋Š” ๊ฐ™์€ ๋ฌผ์ฒด์—์„œ 90~100%๋ฅผ ๊ธฐ๋กํ•œ๋‹ค.

๋…ผ๋ฌธ์˜ Figure 6(๋ง‰๋Œ€๊ทธ๋ž˜ํ”„)์ด ์ด ์ด์•ผ๊ธฐ๋ฅผ ์‹œ๊ฐํ™”ํ•œ๋‹ค. ํ…์ŠคํŠธ๋กœ ๊ทธ ๊ทธ๋ฆผ์„ ๋ฌ˜์‚ฌํ•˜๋ฉด ์ด๋ ‡๋‹ค.

Figure 6 ์„ค๋ช…: ์„ธ ๊ฐœ์˜ ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„ ํŒจ๋„. ๊ฐ๊ฐ Solid & Heavy / Solid & Light / Fragile & Light ๋ฒ”์ฃผ. x์ถ•์€ ๊ฐœ๋ณ„ ๋ฌผ์ฒด, y์ถ•์€ ์ ์šฉ๋œ ํŒŒ์ง€๋ ฅ(N, 0~7 ๋ฒ”์œ„). ID์™€ OOD ๋ฌผ์ฒด๋ฅผ ๋‹ค๋ฅธ ์ƒ‰์œผ๋กœ ํ‘œ์‹œํ•˜๊ณ  ์˜ค์ฐจ๋ง‰๋Œ€ ํฌํ•จ. ํ•ต์‹ฌ ํŒจํ„ด์€, Tactile-VLA๊ฐ€ ๋ฌด๊ฑฐ์šด ๋ฌผ์ฒด์—” ๋†’์€ ๋ง‰๋Œ€(์„ผ ํž˜), ์•ฝํ•œ ๋ฌผ์ฒด์—” ๋‚ฎ์€ ๋ง‰๋Œ€(์•ฝํ•œ ํž˜)๋ฅผ ๋ณด์ด๋ฉฐ, ์ด ๊ฒฝํ–ฅ์ด ์ฒ˜์Œ ๋ณด๋Š” OOD ๋ฌผ์ฒด์—์„œ๋„ ๊ทธ๋Œ€๋กœ ์œ ์ง€๋œ๋‹ค๋Š” ์ ์ด๋‹ค.

์ฆ‰ ๋ชจ๋ธ์€ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๊ณผ์ ํ•ฉํ•œ ๊ฒŒ ์•„๋‹ˆ๋ผ, VLM์˜ ์‚ฌ์ „ ์ง€์‹(โ€œ์šฉ๊ณผ๋Š” ๋ฌด๋ฅด๋‹คโ€, โ€œ์‡ ๊ณต์€ ๋‹จ๋‹จํ•˜๋‹คโ€)์„ ์ด‰๊ฐ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋กœ ์ „์ดํ•ด ์ฒ˜์Œ ๋ณด๋Š” ๋ฌผ์ฒด์—๋„ ์ ์ ˆํ•œ ํž˜์„ ์ถ”๋ก ํ•œ๋‹ค.

RQ3: ์‹คํŒจ๋ฅผ ์ถ”๋ก ์œผ๋กœ ๊ทน๋ณตํ•˜๋Š”๊ฐ€

๋งˆ์ง€๋ง‰ ์‹คํ—˜์ด CoT์˜ ์ง„๊ฐ€๋ฅผ ๋ณธ๋‹ค. ํฐ ๋ณด๋“œ(whiteboard)์—์„œ ๋งˆ์ปค๋ฅผ ๋‹ฆ๋Š” ์ถ”๋ก ์„ ๋ฐฐ์šด ๋ชจ๋ธ์ด, ๋ฌผ๋ฆฌ์ ์œผ๋กœ ์ „ํ˜€ ๋‹ค๋ฅธ ๊ฒ€์€ ์น ํŒ(blackboard, ๋ถ„ํ•„)์œผ๋กœ zero-shot ์ผ๋ฐ˜ํ™”ํ•˜๋Š”์ง€๋ฅผ ๋ณธ๋‹ค. ๋ถ„ํ•„์€ ๋งˆ์ปค๋ณด๋‹ค ํ›จ์”ฌ ํฐ ํž˜์ด ํ•„์š”ํ•˜๋‹ค.

ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” ํฐ ๋ณด๋“œ์—์„œ ๋ชจ์€ ์„ฑ๊ณตยท์‹คํŒจ ๋ฐ๋ชจ์˜ ํ˜ผํ•ฉ์ด๋‹ค. ์‹คํŒจ ์‚ฌ๋ก€(ํž˜์ด ์•ฝํ•ด ๋ชป ์ง€์›€)์—๋Š” ๊ต์ •์  ์‚ฌ๊ณ  ๊ณผ์ •์„ ์ ์€ ๊ฐ๋… ํ…์ŠคํŠธ๊ฐ€ ๋ถ™๋Š”๋‹ค(โ€œํž˜์ด ๋„ˆ๋ฌด ์•ฝํ–ˆ๋‹ค. ๋” ์„ผ ํž˜์ด ํ•„์š”ํ•˜๋‹ค. ์ด์ œ 5N์œผ๋กœ ์‹œ๋„ํ•œ๋‹ค.โ€). ํ‰๊ฐ€ ์‹œ์—” โ€œ๋ณด๋“œ๋ฅผ ๋‹ฆ์•„โ€๋ผ๊ณ ๋งŒ ์ง€์‹œํ•œ๋‹ค.

Table 4. ID/OOD ์‹œ๋‚˜๋ฆฌ์˜ค ์„ฑ๊ณต๋ฅ  (%)

Model In-Domain (Whiteboard) Out-of-Domain (Blackboard)
ฯ€0-base 40 0
ฯ€0-fast 45 0
Tactile-VLA 80 15
Tactile-VLA-CoT 75 80

์—ฌ๊ธฐ์„œ ๋‘ ๊ฐ€์ง€๋ฅผ ์ฝ์–ด์•ผ ํ•œ๋‹ค.

์ฒซ์งธ, Tactile-VLA(์ถ”๋ก  ์—†์Œ)๋Š” OOD์—์„œ 15%๋กœ ๊ฑฐ์˜ ์‹คํŒจํ•œ๋‹ค. ํฐ ๋ณด๋“œ(ID)์—์„œ๋Š” 80%๋กœ ์ž˜ํ•˜์ง€๋งŒ, ์ฒ˜์Œ ๋ณด๋Š” ์น ํŒ์—์„œ๋Š” ์ ์ ˆํ•œ ํž˜์„ ๋ชจ๋ฅธ๋‹ค. ์ฆ‰ ์ •๋ฐ€ ํž˜ ์ œ์–ด๋งŒ์œผ๋กœ๋Š” ์ƒˆ ์‹œ๋‚˜๋ฆฌ์˜ค์— ์ผ๋ฐ˜ํ™”๊ฐ€ ์•ˆ ๋œ๋‹ค.

๋‘˜์งธ, Tactile-VLA-CoT๋Š” OOD์—์„œ 80%๋กœ ๋„์•ฝํ•œ๋‹ค. ํฐ ๋ณด๋“œ(ID)์—์„œ๋Š” 75%๋กœ ์•ฝ๊ฐ„ ๋‚ฎ์€๋ฐ(์ถ”๋ก  ์˜ค๋ฒ„ํ—ค๋“œ์˜ ์†Œ์†Œํ•œ ๋น„์šฉ), ์น ํŒ์—์„œ๋Š” ์••๋„์ ์ด๋‹ค.

์ž‘๋™ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ๊ตฌ์ฒด์  ์ˆซ์ž๋กœ ๋ณด๋ฉด ์„ค๋“๋ ฅ์ด ์ƒ๊ธด๋‹ค. ์น ํŒ์—์„œ ๋ชจ๋ธ์€ ์ฒ˜์Œ์— ๊ธฐ๋ณธ ํž˜ 3.5N์œผ๋กœ ๋‹ฆ๋Š”๋‹ค. ์‹คํŒจํ•œ๋‹ค. ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ์œผ๋กœ ์ง„์ „์ด ์—†์Œ์„ ์ธ์ง€ํ•œ CoT ๋ชจ๋“ˆ์ด ์ถ”๋ก  ์‚ฌ์Šฌ์„ ์ƒ์„ฑํ•ด โ€œ๋” ํฐ ํž˜์ด ํ•„์š”ํ•˜๋‹คโ€๊ณ  ๊ฒฐ๋ก ์ง“๋Š”๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์Šค์Šค๋กœ ํž˜์„ 6.7N์œผ๋กœ ์˜ฌ๋ฆฐ๋‹ค. ์ด๋Š” ํฐ ๋ณด๋“œ ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ 5N๋ณด๋‹ค 34% ํฐ ๊ฐ’์ด๋‹ค. ์ด ์ ์‘์œผ๋กœ ๋ถ„ํ•„์„ ์„ฑ๊ณต์ ์œผ๋กœ ์ง€์šด๋‹ค.

Figure 7 ์„ค๋ช…: ์„ธ ํŒจ๋„ (a)(b)(c). (a) ํฐ ๋ณด๋“œ์—์„œ ๋งˆ์ปค ๋‹ฆ๊ธฐ๋ฅผ ํ•™์Šตํ•œ ์ƒํƒœ. (b) ๊ฒ€์€ ์น ํŒ์œผ๋กœ์˜ zero-shot ์ „์ด์—์„œ ์ดˆ๊ธฐ ์ •์ฑ…์ด ์‹คํŒจํ•จ (๋ถ„ํ•„์—” ํž˜์ด ๋ถ€์กฑ). (c) ์‹คํŒจ์˜ ๋ฌผ๋ฆฌ์  ํ”ผ๋“œ๋ฐฑ์„ ์ถ”๋ก ํ•œ ๋’ค ํž˜์„ ํ‚ค์›Œ ์„ฑ๊ณต์ ์œผ๋กœ ๋‹ฆ์Œ.

๋ฒ ์ด์Šค๋ผ์ธ์€ ๋‹ฆ๋Š” ๋™์ž‘์€ ํ‰๋‚ด ๋‚ด์ง€๋งŒ, ์ด‰๊ฐ ์‹คํŒจ๋ฅผ ํ•ด์„ํ•˜๊ณ  ํž˜์„ ์˜ฌ๋ฆฌ๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ์—†์–ด ๊ฐ™์€ ์ €ํž˜ ๋™์ž‘๋งŒ ๋ฐ˜๋ณตํ•œ๋‹ค. ์ด ๋Œ€๋น„๊ฐ€ contact-rich ์ž‘์—…์—์„œ ์ด‰๊ฐ ์ค‘์‹ฌ ์ถ”๋ก ์˜ ์—ญํ• ์„ ๋˜๋ ท์ด ๋ณด์—ฌ์ค€๋‹ค.


๋น„ํŒ์  ๊ณ ์ฐฐ: ๋ฌด์—‡์ด ๊ฐ•ํ•˜๊ณ  ๋ฌด์—‡์ด ๋นˆ์•ฝํ•œ๊ฐ€

๊ฐ•์ 

1. ๊ฐ€์„ค์ด ๋ช…๋ฃŒํ•˜๊ณ , ์‹คํ—˜์ด ๊ทธ ๊ฐ€์„ค์„ ์ •ํ™•ํžˆ ๊ฒจ๋ƒฅํ•œ๋‹ค. โ€œVLM์€ ์ด๋ฏธ ๋ฌผ๋ฆฌ๋ฅผ ์•ˆ๋‹คโ€๋Š” ์ฃผ์žฅ์€ ๊ฒ€์ฆํ•˜๊ธฐ ๊นŒ๋‹ค๋กœ์šด ๋ช…์ œ์ธ๋ฐ, ์ €์ž๋“ค์€ ์ด๋ฅผ ์„ธ ๊ฐœ์˜ ๊น”๋”ํ•œ ์ผ๋ฐ˜ํ™” ์ถ•(์–ธ์–ด ์ „์ด, ๋ฌผ์ฒด ์ „์ด, ์ถ”๋ก  ์ „์ด)์œผ๋กœ ๋ถ„ํ•ดํ•ด ๊ฐ๊ฐ ์ธก์ • ๊ฐ€๋Šฅํ•œ ์‹คํ—˜์œผ๋กœ ๋งŒ๋“ค์—ˆ๋‹ค. ํŠนํžˆ Table 2์˜ ๋ถ€์‚ฌ ์ŠคํŽ™ํŠธ๋Ÿผ(softlyโ†’gentlyโ†’firmlyโ†’hardโ†’harder)์€ ์˜๋ฏธ ์ ‘์ง€๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ๊ฑฐ์˜ ๊ต๊ณผ์„œ์ ์ธ ์ฆ๊ฑฐ๋‹ค.

2. ํž˜์„ ํ–‰๋™ ๊ณต๊ฐ„์— ์ง์ ‘ ๋„ฃ์€ ์„ค๊ณ„. ๋งŽ์€ ์ด‰๊ฐ ํ†ตํ•ฉ ์—ฐ๊ตฌ๊ฐ€ ์ด‰๊ฐ์„ ์ž…๋ ฅ์—๋งŒ ๋‘๋Š” ๊ฒƒ๊ณผ ๋‹ฌ๋ฆฌ, F_{target}์„ ์ถœ๋ ฅ์œผ๋กœ ๋Œ์–ด๋‚ธ ๊ฒƒ์ด ํ•ต์‹ฌ ์ฐจ๋ณ„์ ์ด๋‹ค. ์ด๊ฒŒ ์–ธ์–ด๋กœ ํž˜์„ ์กฐ์ ˆํ•˜๋Š” ๋Šฅ๋ ฅ์˜ ๊ตฌ์กฐ์  ์›์ฒœ์ด๋‹ค.

3. ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์˜ ๋ฌธ์ œ์˜์‹. โ€œ์›๊ฒฉ์กฐ์ž‘์€ ์กฐ์ž‘์ž๊ฐ€ ํž˜์„ ๋ชป ๋А๋ผ๋ฏ€๋กœ ์ด‰๊ฐ ๋น„์˜์กด ์ •์ฑ…์„ ๋งŒ๋“ ๋‹คโ€๋Š” ์ง€์ ์€ ๋‚ ์นด๋กญ๋‹ค. ์ด‰๊ฐ ์„ผ์„œ๋ฅผ ๋‹จ ํ•ธ๋“œํ—ฌ๋“œ ์žฅ์น˜๋กœ ์ด๋ฅผ ํ‘ธ๋Š” ์ ‘๊ทผ์€ ์‹ค์šฉ์ ์ด๊ณ  ์žฌํ˜„ ๊ฐ€๋Šฅํ•˜๋‹ค. ์ ์€ ๋ฐ๋ชจ(์ž‘์—…๋‹น 50~200๊ฐœ)๋กœ ์ผ๋ฐ˜ํ™”๋ฅผ ๋Œ์–ด๋‚ธ ๊ฒƒ๋„ ์ธ์ƒ์ ์ด๋‹ค.

4. CoT ๋ณ€ํ˜•์˜ ์ž๊ธฐ๊ต์ •. 3.5Nโ†’6.7N์˜ ์ž์œจ์  ํž˜ ์ฆ๊ฐ€๋Š”, ์ถ”๋ก ์ด ๋‹จ์ˆœ ์žฅ์‹์ด ์•„๋‹ˆ๋ผ ์‹ค์ œ ์ œ์–ด ํ–‰๋™์„ ๋ฐ”๊พผ๋‹ค๋Š” ๊ตฌ์ฒด์  ์ฆ๊ฑฐ๋‹ค.

์•ฝ์ ๊ณผ ํ•œ๊ณ„

1. ์ ˆ๋Œ€ ์„ฑ๊ณต๋ฅ ์€ ์—ฌ์ „ํžˆ ๋‚ฎ๋‹ค. USB ์‚ฝ์ž… 35%, OOD ์น ํŒ 80%๊ฐ€ ๋ฒ ์ด์Šค๋ผ์ธ ๋Œ€๋น„ ํฐ ํ–ฅ์ƒ์ธ ๊ฑด ๋งž์ง€๋งŒ, ์‹ค๋ฐฐํฌ ๊ธฐ์ค€์œผ๋กœ๋Š” ๊ฐˆ ๊ธธ์ด ๋ฉ€๋‹ค. ํŠนํžˆ USB 35%๋Š” ์ •๋ฐ€ ์‚ฝ์ž…์˜ ์–ด๋ ค์›€์„ ๊ทธ๋Œ€๋กœ ๋ณด์—ฌ์ค€๋‹ค. ๋…ผ๋ฌธ์€ ์ƒ๋Œ€์  ์šฐ์œ„์— ์ดˆ์ ์„ ๋งž์ถ”์ง€๋งŒ, ๋…์ž๋Š” ์ ˆ๋Œ€ ์ˆ˜์น˜๋„ ํ•จ๊ป˜ ๋ด์•ผ ํ•œ๋‹ค.

2. ์Šค์ผ€์ผ์ด ์ž‘๋‹ค. ์„ธ ๊ฐ€์ง€ ์ž‘์—…, ์ œํ•œ๋œ ๋ฌผ์ฒด ์ง‘ํ•ฉ, 10ํšŒ ๋‚ด์™ธ์˜ ์‹œ๋„. ํ†ต๊ณ„์  ์‹ ๋ขฐ๊ตฌ๊ฐ„์ด ๋„“์„ ์ˆ˜ ์žˆ๋‹ค(Table 3์€ ๋ฌผ์ฒด๋‹น 10ํšŒ, Figure 6์€ 5ํšŒ). โ€œzero-shot ์ผ๋ฐ˜ํ™”โ€๋ผ๋Š” ๊ฐ•ํ•œ ์ฃผ์žฅ์— ๋น„ํ•˜๋ฉด ํ‰๊ฐ€ ๋‹ค์–‘์„ฑ์ด ๋ถ€์กฑํ•˜๋‹ค. ๋‹ค๋ฅธ ๋กœ๋ด‡ ํ”Œ๋žซํผ, ๋‹ค๋ฅธ ์ด‰๊ฐ ์„ผ์„œ๋กœ์˜ ์ „์ด๋Š” ๊ฒ€์ฆ๋˜์ง€ ์•Š์•˜๋‹ค.

3. โ€œVLM์ด ๋ฌผ๋ฆฌ๋ฅผ ์•ˆ๋‹คโ€๋Š” ์ฃผ์žฅ์˜ ์ธ๊ณผ์  ์ฆ๋ช…์€ ์•ฝํ•˜๋‹ค. Tactile-VLA๊ฐ€ ์ผ๋ฐ˜ํ™”๋ฅผ ์ž˜ํ•˜๋Š” ๊ฒƒ์€ ์‚ฌ์‹ค์ด๋‚˜, ๊ทธ๊ฒƒ์ด ์ •๋ง ์‚ฌ์ „ํ•™์Šต๋œ VLM์˜ ์ž ์žฌ ์ง€์‹ ๋•๋ถ„์ธ์ง€, ์•„๋‹ˆ๋ฉด ๋‹จ์ง€ ํž˜์„ ํ–‰๋™ ๊ณต๊ฐ„์— ๋„ฃ์€ ์•„ํ‚คํ…์ฒ˜ ๋•๋ถ„์ธ์ง€๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” ablation์ด ๋ถ€์กฑํ•˜๋‹ค. ์˜ˆ์ปจ๋Œ€ ๋ฌด์ž‘์œ„ ์ดˆ๊ธฐํ™”๋œ ๋ฐฑ๋ณธ vs ์‚ฌ์ „ํ•™์Šต ๋ฐฑ๋ณธ์„ ๋น„๊ตํ–ˆ๋‹ค๋ฉด ์ฃผ์žฅ์ด ํ›จ์”ฌ ๋‹จ๋‹จํ–ˆ์„ ๊ฒƒ์ด๋‹ค.

4. ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ปจํŠธ๋กค๋Ÿฌ์˜ ๊ฒŒ์ธ ํŠœ๋‹. K์™€ \tau๋Š” ์ˆ˜๋™ ์„ค์ •์œผ๋กœ ๋ณด์ด๋ฉฐ, ์ž‘์—…ยท๋ฌผ์ฒดยท์„ผ์„œ๊ฐ€ ๋ฐ”๋€Œ๋ฉด ์žฌํŠœ๋‹์ด ํ•„์š”ํ•  ๊ฐ€๋Šฅ์„ฑ์ด ํฌ๋‹ค. position-dominant ๊ฐ€์ •์€ ์ง„์งœ๋กœ ํž˜์ด ์ง€๋ฐฐํ•˜๋Š” ์ž‘์—…(์˜ˆ: ๋ฌด๋ฅธ ๋ฌผ์ฒด๋ฅผ ์ผ์ • ์••๋ ฅ์œผ๋กœ ๋ฌธ์ง€๋ฅด๊ธฐ, ์–‘์† ํ˜‘์‘)์—์„œ๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ๋‹ค.

5. CoT์˜ ํŠธ๋ฆฌ๊ฑฐ๊ฐ€ ๊ณ ์ • ๊ฐ„๊ฒฉ. โ€œ๋‹จ์ˆœํ•˜๊ณ  ํšจ๊ณผ์ โ€์ด๋ผ์ง€๋งŒ, ๋น ๋ฅธ ์‹คํŒจ ๊ฐ์ง€๊ฐ€ ์ค‘์š”ํ•œ ์ž‘์—…์—์„œ๋Š” ๊ณ ์ • ๊ฐ„๊ฒฉ ์ ๊ฒ€์ด ๋ฐ˜์‘์„ฑ์„ ๋–จ์–ด๋œจ๋ฆด ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฒคํŠธ ๊ธฐ๋ฐ˜(์˜ˆ: ํž˜ ์ด์ƒ์น˜ ๊ฐ์ง€) ํŠธ๋ฆฌ๊ฑฐ๊ฐ€ ๋” ์ž์—ฐ์Šค๋Ÿฌ์šธ ์ˆ˜ ์žˆ๋‹ค.

6. ์ถ”๋ก  ๋น„์šฉ. ID ํฐ ๋ณด๋“œ์—์„œ Tactile-VLA-CoT(75%)๊ฐ€ Tactile-VLA(80%)๋ณด๋‹ค ์•ฝ๊ฐ„ ๋‚ฎ์€ ๊ฒƒ์€, ์ถ”๋ก  ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋‹จ์ˆœ ์ž‘์—…์—์„œ๋Š” ์˜คํžˆ๋ ค ํ•ด๊ฐ€ ๋  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•œ๋‹ค. ์–ธ์ œ ์ถ”๋ก ์„ ์ผœ๊ณ  ๋Œ์ง€์— ๋Œ€ํ•œ ์ •์ฑ…์€ ๋‹ค๋ฃจ์ง€ ์•Š์•˜๋‹ค.


๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต: ๋ฌด์—‡์ด ์ƒˆ๋กœ์šด๊ฐ€

์ด‰๊ฐ์„ VLA์— ๋„ฃ๋Š” ์‹œ๋„๋Š” ๋™์‹œ๊ธฐ์— ์—ฌ๋Ÿฟ ์žˆ์—ˆ๋‹ค. ํ•ต์‹ฌ ์ฐจ์ด๋ฅผ ํ‘œ๋กœ ์ •๋ฆฌํ•œ๋‹ค.

์—ฐ๊ตฌ ์ด‰๊ฐ ํ†ตํ•ฉ ๋ฐฉ์‹ ์ฐจ๋ณ„์  / ํ•œ๊ณ„
FuSe (Jones et al., 2025) ๋ณด์กฐ ์†์‹ค(auxiliary loss)๋กœ ์ด์ข… ์„ผ์„œ ํŒŒ์ธํŠœ๋‹, ์–ธ์–ด ์ ‘์ง€ ์ด‰๊ฐ์ด ํ–‰๋™ ์ƒ์„ฑ์— ์ง์ ‘ ๊ฐœ์ž…ํ•˜๊ธฐ๋ณด๋‹ค ํ‘œํ˜„ ํ•™์Šต ๋ณด์กฐ
ForceVLA (Yu et al., 2025) force-aware MoE(์ „๋ฌธ๊ฐ€ ํ˜ผํ•ฉ), ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ณ„ ๋ผ์šฐํŒ… ํž˜ ์ธ์ง€๋Š” ํ•˜๋‚˜, ๋ผ์šฐํŒ… ๊ตฌ์กฐ ์ค‘์‹ฌ
TLA (Hao et al., 2025) tactile-language-action ์ง์ ‘ ๋งคํ•‘ contact-rich ํŠนํ™”, VLM ์‚ฌ์ „์ง€์‹ ํ™œ์šฉ ์ฃผ์žฅ์€ ์•ฝํ•จ
3D-ViTac, MimicTouch end-to-end visuo-tactile ์ •์ฑ… ์–ธ์–ด ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๋ถ€์žฌ โ†’ ์ถ”์ƒ ๋ชฉํ‘œ ์ถ”๋ก ยท์ƒ์‹ ์ผ๋ฐ˜ํ™” ์ œํ•œ
Reactive Diffusion Policy (Xue et al., 2025) slow-fast ์‹œ๊ฐ-์ด‰๊ฐ ์ •์ฑ… ๊ณ„ํš/์ œ์–ด ๋ถ„๋ฆฌ, ์˜๋ฏธ ์ ‘์ง€๋ณด๋‹ค ๋ฐ˜์‘์„ฑ ์ดˆ์ 
Tactile-VLA (๋ณธ ๋…ผ๋ฌธ) ํ† ํฐ ๋ ˆ๋ฒจ ๊นŠ์€ ์œตํ•ฉ + ํž˜์„ action space์— ์ง์ ‘ ํฌํ•จ + CoT ์ถ”๋ก  VLM ์ž ์žฌ ์ง€์‹์„ ์ ์€ ๋ฐ๋ชจ๋กœ ๊นจ์›Œ zero-shot ํž˜ ์ผ๋ฐ˜ํ™”

์ €์ž๋“ค์ด ์ฃผ์žฅํ•˜๋Š” ๋ณธ์ธ๋“ค์˜ ์œ„์น˜๋Š” ๋ถ„๋ช…ํ•˜๋‹ค. FuSe์ฒ˜๋Ÿผ ๋ณด์กฐ ์†์‹ค๋กœ ๋ถ™์ด๊ฑฐ๋‚˜ ForceVLA์ฒ˜๋Ÿผ ๋ผ์šฐํŒ…์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋Œ€์‹ , VLM์˜ ์ž ์žฌ ๊ณต๊ฐ„์— ์ด๋ฏธ ํ’๋ถ€ํ•œ ๋ฌผ๋ฆฌ์  ์ƒํ˜ธ์ž‘์šฉ์˜ ์˜๋ฏธ ์ดํ•ด๊ฐ€ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์‹ค์ฆํ•˜๊ณ , ์ด๋ฅผ ์ด‰๊ฐ ์„ผ์„œ์— ์ง์ ‘ ์—ฐ๊ฒฐํ•ด ์ ์€ ๋ฐ๋ชจ๋กœ ๊นจ์šด๋‹ค๋Š” ์ ์ด ์ฐจ๋ณ„์ ์ด๋‹ค.

์ „ํ†ต์  ์ด‰๊ฐ ์ •์ฑ… ์—ฐ๊ตฌ(grasping์˜ Calandra, insertion์˜ Dong, in-hand์˜ Qi ๋“ฑ)์™€์˜ ๊ด€๊ณ„๋„ ๋ช…ํ™•ํžˆ ํ•œ๋‹ค. ์ด๋“ค ํŠนํ™” ์ •์ฑ…์€ ํ•ด๋‹น ์ž‘์—…์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‚ด์ง€๋งŒ ์–ธ์–ด ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๊ฐ€ ์—†์–ด ์ƒˆ ๋ช…๋ น ์ผ๋ฐ˜ํ™”ยท์ถ”์ƒ ๋ชฉํ‘œ ์ถ”๋ก ยท์ƒ์‹ ํ™œ์šฉ์ด ์ œํ•œ๋œ๋‹ค. Tactile-VLA๋Š” ์ด๋Ÿฐ ์ด‰๊ฐ ๊ธฐ๋ฐ˜ ์ •์ฑ…์˜ ๋ฌผ๋ฆฌ์  ์ •๋ฐ€ํ•จ๊ณผ ํ˜„๋Œ€ VLA์˜ ์˜๋ฏธ์  ์œ ์—ฐ์„ฑยท์„ธ๊ณ„ ์ง€์‹์„ ๊ฒฐํ•ฉํ•˜๋ ค๋Š” ์‹œ๋„๋กœ ์ž๋ฆฌ๋งค๊น€ํ•œ๋‹ค.

์—ฐ๊ตฌ ํ๋ฆ„์„ ํ•œ๋ˆˆ์— ๋ณด๋ฉด:

flowchart LR
    A["๊ณ ์ „ ์ด‰๊ฐ ์ œ์–ด<br/>(impedance, force control)"] --> C["ํŠนํ™” ์ด‰๊ฐ ์ •์ฑ…<br/>(grasping, insertion)"]
    B["VLA ๋ชจ๋ธ<br/>(RT-2, OpenVLA, pi0)"] --> D["์ด‰๊ฐ ํ†ตํ•ฉ VLA<br/>(FuSe, ForceVLA, TLA)"]
    C --> E["Tactile-VLA<br/>๋ฌผ๋ฆฌ ์ •๋ฐ€ํ•จ + ์˜๋ฏธ ์œ ์—ฐ์„ฑ"]
    D --> E
    E --> F["Tactile-VLA-CoT<br/>+ ์ž์œจ ์ถ”๋ก /์žฌ๊ณ„ํš"]


์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

์ด ๋…ผ๋ฌธ์ด ๋กœ๋ด‡๊ณตํ•™์ž์—๊ฒŒ ๋‚จ๊ธฐ๋Š” ๋ฉ”์‹œ์ง€๋Š” ์„ธ ๊ฐˆ๋ž˜๋กœ ์ •๋ฆฌ๋œ๋‹ค.

์ฒซ์งธ, ํž˜์„ ์ผ๊ธ‰ ์‹œ๋ฏผ์œผ๋กœ ๋Œ€์ ‘ํ•˜๋ผ. ์ด‰๊ฐ์„ ์ž…๋ ฅ์— ๋ผ์›Œ๋„ฃ๋Š” ๊ฒƒ๊ณผ, ๋ชฉํ‘œ ํž˜ F_{target}์„ ํ–‰๋™ ์ถœ๋ ฅ์œผ๋กœ ๋Œ์–ด๋‚ด๋Š” ๊ฒƒ์€ ์งˆ์ ์œผ๋กœ ๋‹ค๋ฅด๋‹ค. ํ›„์ž๊ฐ€ โ€œ์–ธ์–ด๋กœ ํž˜์„ ์กฐ์ ˆโ€ํ•˜๋Š” ๋Šฅ๋ ฅ์„ ์—ฐ๋‹ค. Table 2์˜ ๋ถ€์‚ฌ ์ŠคํŽ™ํŠธ๋Ÿผ์ด ๊ทธ ์ฆ๊ฑฐ๋‹ค.

๋‘˜์งธ, ์‚ฌ์ „ํ•™์Šต VLM์€ ์ƒ๊ฐ๋ณด๋‹ค ๋งŽ์€ ๋ฌผ๋ฆฌ๋ฅผ ์•Œ๊ณ  ์žˆ๋‹ค. โ€œ์šฉ๊ณผ๋Š” ๋ฌด๋ฅด๋‹คโ€, โ€œ๋ถ„ํ•„์€ ๋งˆ์ฐฐ์ด ํฌ๋‹คโ€ ๊ฐ™์€ ์ƒ์‹์€ ์ธํ„ฐ๋„ท ํ…์ŠคํŠธ์— ์ด๋ฏธ ๋…น์•„ ์žˆ๋‹ค. ์ ์€ ๋ฐ๋ชจ๋กœ ๊ทธ๊ฒƒ์„ ์ด‰๊ฐ ์ฑ„๋„์— ์—ฐ๊ฒฐํ•˜๋ฉด zero-shot ์ผ๋ฐ˜ํ™”๊ฐ€ ๋”ฐ๋ผ์˜จ๋‹ค. ์ด ๊ด€์ ์€ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋น„์šฉ์— ์‹œ๋‹ฌ๋ฆฌ๋Š” ์‹ค๋ฌด์ž์—๊ฒŒ ์‹ค์šฉ์  ํ•จ์˜๋ฅผ ์ค€๋‹ค. ๋ชจ๋“  ๋ฌผ์ฒดยท์ž‘์—…์„ ๋ฐ๋ชจ๋กœ ์ฑ„์šฐ๋ ค ํ•˜์ง€ ๋ง๊ณ , VLM์ด ์ด๋ฏธ ์•„๋Š” ๊ฒƒ์„ ๊นจ์šฐ๋Š” ๋‹ค๋ฆฌ๋ฅผ ์„ค๊ณ„ํ•˜๋ผ๋Š” ๊ฒƒ์ด๋‹ค.

์…‹์งธ, ์†๋์œผ๋กœ ์ƒ๊ฐํ•˜๊ฒŒ ๋งŒ๋“ค๋ฉด ์‹คํŒจ์—์„œ ํšŒ๋ณตํ•œ๋‹ค. Tactile-VLA-CoT์˜ 3.5Nโ†’6.7N ์ž์œจ ์กฐ์ •์€, ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ์„ ๋ช…์‹œ์  ์ถ”๋ก ์œผ๋กœ ๋Œ์–ด์˜ฌ๋ ธ์„ ๋•Œ ์ •์ฑ…์ด ์ƒˆ ์ƒํ™ฉ์— ์Šค์Šค๋กœ ์ ์‘ํ•จ์„ ๋ณด์—ฌ์ค€๋‹ค.

๋ฌผ๋ก  ํ•œ๊ณ„๋„ ๋ช…ํ™•ํ•˜๋‹ค. ์ ˆ๋Œ€ ์„ฑ๊ณต๋ฅ ์€ ์•„์ง ๋‚ฎ๊ณ , ํ‰๊ฐ€ ์Šค์ผ€์ผ์ด ์ž‘์œผ๋ฉฐ, โ€œVLM ์‚ฌ์ „์ง€์‹ ๋•๋ถ„โ€์ด๋ผ๋Š” ์ธ๊ณผ์  ์ฃผ์žฅ์„ ๋ถ„๋ฆฌํ•˜๋Š” ablation์ด ๋ถ€์กฑํ•˜๋‹ค. ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ปจํŠธ๋กค๋Ÿฌ์˜ ๊ฒŒ์ธ ํŠœ๋‹๊ณผ CoT์˜ ๊ณ ์ • ๊ฐ„๊ฒฉ ํŠธ๋ฆฌ๊ฑฐ๋Š” ์‹ค์ „ ๋ฐฐํฌ์—์„œ ์†๋ด์•ผ ํ•  ์ง€์ ์ด๋‹ค.

๊ทธ๋Ÿผ์—๋„ ์ด ๋…ผ๋ฌธ์˜ ๋ฐฉํ–ฅ์€ ์˜ณ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ๋ฒ”์šฉ ๋กœ๋ด‡ ์—์ด์ „ํŠธ๋กœ ๊ฐ€๋ ค๋ฉด, โ€œ๋ฌด์—‡์„ ํ• ์ง€โ€ ์•„๋Š” ๊ฒƒ๋งŒ์œผ๋กœ๋Š” ๋ถ€์กฑํ•˜๊ณ  โ€œ์–ผ๋งˆ์˜ ํž˜์œผ๋กœ ํ• ์ง€โ€๋ฅผ ์˜๋ฏธ์™€ ์—ฐ๊ฒฐํ•ด ์•Œ์•„์•ผ ํ•œ๋‹ค. Tactile-VLA๋Š” ๊ทธ ์—ฐ๊ฒฐ์˜ ํ•œ ๊ฐ€์ง€ ์ž‘๋™ํ•˜๋Š” ์ฒญ์‚ฌ์ง„์„ ์ œ์‹œํ•œ๋‹ค. ์ด‰๊ฐ์„ ๊ณ๋‹ค๋ฆฌ๊ฐ€ ์•„๋‹ˆ๋ผ ๋„ค์ดํ‹ฐ๋ธŒ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋กœ, ํž˜์„ ๋ถ€์‚ฐ๋ฌผ์ด ์•„๋‹ˆ๋ผ ํ–‰๋™์˜ ์ผ๋ถ€๋กœ ๋‹ค๋ฃฌ๋‹ค๋Š” ์„ค๊ณ„ ์ฒ ํ•™์€, ์•ž์œผ๋กœ์˜ contact-rich manipulation ์—ฐ๊ตฌ๊ฐ€ ์ฐธ๊ณ ํ•  ๋งŒํ•œ ์ถœ๋ฐœ์ ์ด๋‹ค.

dexterous manipulation์„ ์—ฐ๊ตฌํ•˜๋Š” ์ž…์žฅ์—์„œ ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ง€์ ์€ ํž˜์˜ ๋‘ ์ฑ„๋„ ๋ถ„๋ฆฌ(์™ธ๋ถ€ ์•Œ์งœํž˜์€ ์œ„์น˜๋กœ, ๋‚ด๋ถ€ ํŒŒ์ง€๋ ฅ์€ ๊ทธ๋ฆฌํผ ํญ์œผ๋กœ)๋‹ค. ๋‹ค์ง€ ์†(multi-finger hand)์œผ๋กœ ํ™•์žฅํ•œ๋‹ค๋ฉด ์ด ๋ถ„๋ฆฌ๊ฐ€ ์–ด๋–ป๊ฒŒ ์ผ๋ฐ˜ํ™”๋ ์ง€, ๊ทธ๋ฆฌ๊ณ  DIGIT/GelSight ๊ฐ™์€ ๊ณ ํ•ด์ƒ๋„ ์ด‰๊ฐ์„ ๋‹จ์ˆœ MLP ํ† ํฐ์ด ์•„๋‹ˆ๋ผ ๋” ํ’๋ถ€ํ•˜๊ฒŒ ์ธ์ฝ”๋”ฉํ–ˆ์„ ๋•Œ ์˜๋ฏธ ์ ‘์ง€๊ฐ€ ๋” ๊ฐ•ํ•ด์งˆ์ง€๊ฐ€ ์ž์—ฐ์Šค๋Ÿฌ์šด ํ›„์† ์งˆ๋ฌธ์ด๋‹ค.


์ฐธ๊ณ 

  • ์›๋ฌธ: Huang, J., Wang, S., Lin, F., Hu, Y., Wen, C., Gao, Y. (2025). Tactile-VLA: Unlocking Vision-Language-Action Modelโ€™s Physical Knowledge for Tactile Generalization. arXiv:2507.09160.
  • ๊ธฐ๋ฐ˜ ๋ชจ๋ธ: ฯ€0 (Black et al., 2024), Gemma 2.6B, ViT (Dosovitskiy et al., 2020)
  • ๋ฐ์ดํ„ฐ ์žฅ์น˜ ๊ธฐ๋ฐ˜: UMI (Chi et al., 2024)
  • ์ง์ ‘ ๋น„๊ต ๋Œ€์ƒ: FuSe (Jones et al., 2025), ForceVLA (Yu et al., 2025), TLA (Hao et al., 2025)

Copyright 2026, JungYeon Lee