Curieux.JY
  • JungYeon Lee
  • Post
  • Projects
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • 1. ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ: ์™œ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์ด ์ค‘์š”ํ•œ๊ฐ€?
      • 1.1 ํ˜„์žฌ Imitation Learning์˜ ํ•œ๊ณ„
      • 1.2 ์ธ๊ฐ„๊ณผ ๋™๋ฌผ์˜ ํ•™์Šต ํšจ์œจ์„ฑ
    • 2. ํ•ต์‹ฌ ์•„์ด๋””์–ด: Trajectory Decomposition
      • 2.1 Alignment์™€ Interaction์˜ ๋ถ„๋ฆฌ
      • 2.2 ์™œ Decomposition์ด ํšจ๊ณผ์ ์ธ๊ฐ€?
    • 3. MT3 ์‹œ์Šคํ…œ ์•„ํ‚คํ…์ฒ˜
      • 3.1 ์‹œ์Šคํ…œ ๊ฐœ์š”
      • 3.2 Retrieval-Based ์ ‘๊ทผ๋ฒ•
      • 3.3 Alignment์™€ Interaction์˜ ์‹คํ–‰
      • 3.4 ์™œ Retrieval์ด ์ž‘๋™ํ•˜๋Š”๊ฐ€?
    • 4. ๋น„๊ต ์‹คํ—˜: Decomposition์˜ ํšจ๊ณผ ๊ฒ€์ฆ
      • 4.1 ๋น„๊ต ๋Œ€์ƒ ๋ฐฉ๋ฒ•๋“ค
      • 4.2 ์‹คํ—˜ ์„ค๊ณ„
      • 4.3 ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ถ„์„
      • 4.4 Retrieval vs. Behavioral Cloning
    • 5. ์ฒœ ๊ฐœ ์ž‘์—… ํ•™์Šต: ์ „๋ก€ ์—†๋Š” ๊ทœ๋ชจ์˜ ์‹คํ—˜
      • 5.1 ์‹คํ—˜ ๊ทœ๋ชจ์™€ ๋„์ „ ๊ณผ์ œ
      • 5.2 ์‹คํ—˜ ์กฐ๊ฑด์˜ ๋‚œ์ด๋„
      • 5.3 ์„ฑ๋Šฅ ๊ฒฐ๊ณผ
      • 5.4 ์‹คํŒจ ์‚ฌ๋ก€ ๋ถ„์„
    • 6. ๊ธฐ์ˆ ์  ์„ธ๋ถ€์‚ฌํ•ญ
      • 6.1 ํ•˜๋“œ์›จ์–ด ๊ตฌ์„ฑ
      • 6.2 Demonstration ์ˆ˜์ง‘ ๋ฐ ์ฒ˜๋ฆฌ
      • 6.3 Behavioral Cloning ๊ตฌํ˜„
      • 6.4 Retrieval System ์„ธ๋ถ€์‚ฌํ•ญ
    • 7. ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต
      • 7.1 Trajectory Decomposition
      • 7.2 Retrieval for Imitation Learning
      • 7.3 Large-Scale Robot Learning
    • 8. ํ•œ๊ณ„์ ๊ณผ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ
      • 8.1 ํ˜„์žฌ ํ•œ๊ณ„์ 
      • 8.2 ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ
    • ํ•ต์‹ฌ ๊ธฐ์—ฌ ์š”์•ฝ
    • ์—ฐ๊ตฌ์˜ ๊ฐ•์ 
    • ๊ฐœ์„  ๊ฐ€๋Šฅํ•œ ์ 
  • โ›๏ธ Dig Review
    • ๊ฐœ์š”: ํ•˜๋ฃจ, 1,000๊ฐœ์˜ ๋กœ๋ด‡ ์ž‘์—… ํ•™์Šต
    • ๋ฐฉ๋ฒ•๋ก : ๋‹ค์ค‘ ์ž‘์—… ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ ๋ถ„์„
      • 1. ๋‘ ๋‹จ๊ณ„๋กœ ๋‚˜๋ˆ„๋Š” ์กฐ์ž‘ ์ •์ฑ… โ€“ ์ •๋ ฌ ๋‹จ๊ณ„์™€ ์ƒํ˜ธ์ž‘์šฉ ๋‹จ๊ณ„
      • 2. ํ–‰๋™ ํด๋กœ๋‹(BC) ๋Œ€ Retrieval(๋ฐ๋ชจ ๊ฒ€์ƒ‰) ๊ธฐ๋ฐ˜ ์ •์ฑ…
      • 3. ํ•™์Šต ์•„ํ‚คํ…์ฒ˜์™€ ๋ถ„์‚ฐ ์‹œ์Šคํ…œ
    • ์ˆ˜ํ•™์  ์„ธ๋ถ€์‚ฌํ•ญ: ์ •์ฑ… ํ•™์Šต, ์†์‹ค ํ•จ์ˆ˜์™€ ์•ˆ์ •์„ฑ
      • 1. ํ–‰๋™ํด๋กœ๋‹(BC) ์ •์ฑ… ํ•™์Šต โ€“ ๋‹ค์ค‘ ๊ณผ์—… ํ™•๋ฅ  ๋ชจ๋ธ
      • 2. ๋ฉ€ํ‹ฐํƒœ์Šคํ‚น๊ณผ ์ผ๋ฐ˜ํ™” โ€“ ์ž„๋ฒ ๋”ฉ ๋ฐ Retrieval ์•Œ๊ณ ๋ฆฌ์ฆ˜
      • 3. ํ•™์Šต ์•ˆ์ •ํ™” ๋ฐ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•
    • ์‹คํ—˜: ํ™˜๊ฒฝ, ํƒœ์Šคํฌ ๋‹ค์–‘์„ฑ ๋ฐ ๊ฒฐ๊ณผ ๋ถ„์„
      • 1. ์‹คํ—˜ ํ™˜๊ฒฝ๊ณผ ํƒœ์Šคํฌ ๊ตฌ์„ฑ
      • 2. ๋น„๊ต ๋ฐฉ๋ฒ• ๋ฐ ํ‰๊ฐ€ ๋ฐฉ์‹
      • 3. ์ฃผ์š” ์‹คํ—˜ ๊ฒฐ๊ณผ: ์†Œ๊ทœ๋ชจ ๋ถ„์„
      • 4. 1000๊ฐœ ๊ณผ์—… ๋Œ€๊ทœ๋ชจ ํ‰๊ฐ€ ๊ฒฐ๊ณผ
    • ๊ธฐ์—ฌ ๋ฐ ์ฐจ๋ณ„์  ์š”์•ฝ
    • ๋กœ๋ด‡๊ณตํ•™์  ์˜์˜ ๋ฐ ํ™œ์šฉ ๊ฐ€๋Šฅ์„ฑ

๐Ÿ“ƒMT3 ๋ฆฌ๋ทฐ

rl
gpu-parallel
grasping
Learning a Thousand Tasks in a Day
Published

November 18, 2025

๐Ÿ” Ping. ๐Ÿ”” Ring. โ›๏ธ Dig. A tiered review series: quick look, key ideas, deep dive.

  • Paper Link
  • Homepage
  • Code
  1. ํ˜„์žฌ ๋กœ๋ด‡ ๋ชจ๋ฐฉ ํ•™์Šต์˜ ๋‚ฎ์€ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์„ ๊ฐœ์„ ํ•˜๊ณ ์ž, ๋ณธ ์—ฐ๊ตฌ๋Š” ์กฐ์ž‘ ๊ถค์ ์„ ์ •๋ ฌ(alignment) ๋ฐ ์ƒํ˜ธ์ž‘์šฉ(interaction) ๋‹จ๊ณ„๋กœ ๋ถ„ํ•ดํ•˜๊ณ  ๊ฒ€์ƒ‰ ๊ธฐ๋ฐ˜ ์ผ๋ฐ˜ํ™”(retrieval-based generalization)๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค.
  2. ๐Ÿš€ ์ด๋Ÿฌํ•œ ๋ถ„ํ•ด ๋ฐ ๊ฒ€์ƒ‰ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ ๋ฐฉ์‹์ธ Multi-Task Trajectory Transfer (MT3)๋Š” ์ž‘์—…๋‹น ๋‹จ ํ•˜๋‚˜์˜ ์‹œ์—ฐ๋งŒ์œผ๋กœ 1,000๊ฐ€์ง€ ์ผ์ƒ ์ž‘์—…์„ 24์‹œ๊ฐ„ ๋‚ด์— ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ•˜๋ฉฐ, ๊ธฐ์กด ๋‹จ์ผ ์ •์ฑ…(monolithic policy) ๋ฐฉ์‹๋ณด๋‹ค ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์„ ํš๊ธฐ์ ์œผ๋กœ ๊ฐœ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค.
  3. ๐Ÿ› ๏ธ ๊ด‘๋ฒ”์œ„ํ•œ ์‹ค์ œ ์‹œ๋‚˜๋ฆฌ์˜ค ํ‰๊ฐ€๋ฅผ ํ†ตํ•ด MT3๊ฐ€ ์ ์€ ๋ฐ์ดํ„ฐ ํ™˜๊ฒฝ์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์ง€๋งŒ, ๊ณ ์ •๋ฐ€ ์ž‘์—…์ด๋‚˜ ๋ณ€ํ˜• ๊ฐ€๋Šฅํ•œ ๊ฐ์ฒด ์กฐ์ž‘ ์‹œ ๊ฐœ๋ฐฉ ๋ฃจํ”„(open-loop) ์ƒํ˜ธ์ž‘์šฉ์˜ ๊ทผ๋ณธ์ ์ธ ํ•œ๊ณ„์ ์„ ๋“œ๋Ÿฌ๋ƒˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

์ด ๋…ผ๋ฌธ์€ ๋กœ๋ด‡ ์กฐ์ž‘์„ ์œ„ํ•œ ๋ชจ๋ฐฉ ํ•™์Šต(imitation learning)์˜ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋‘ ๊ฐ€์ง€ ๊ทผ๋ณธ์ ์ธ ์„ ํ–‰ ์ง€์‹(prior)์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค: ์กฐ์ž‘ ๊ถค์ ์„ ์ˆœ์ฐจ์ ์ธ ์ •๋ ฌ(alignment) ๋ฐ ์ƒํ˜ธ์ž‘์šฉ(interaction) ๋‹จ๊ณ„๋กœ ๋ถ„ํ•ดํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฒ€์ƒ‰ ๊ธฐ๋ฐ˜(retrieval-based) ์ผ๋ฐ˜ํ™”์ž…๋‹ˆ๋‹ค. ์—ฐ๊ตฌ์ž๋“ค์€ 3,450ํšŒ์˜ ์‹ค์ œ ๋กค์•„์›ƒ(rollout)์„ ํ†ตํ•ด ์ด๋Ÿฌํ•œ ๋ถ„ํ•ด๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์—ฐ๊ตฌํ•˜๊ณ , ์ •๋ ฌ ๋ฐ ์ƒํ˜ธ์ž‘์šฉ ๋‹จ๊ณ„์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ์„ค๊ณ„ ์„ ํƒ์„ ๋น„๊ตํ•˜๋ฉฐ, ํ˜„์žฌ ์ง€๋ฐฐ์ ์ธ ํŒจ๋Ÿฌ๋‹ค์ž„์ธ ๋‹จ์ผ ๋‹จ๊ณ„(single-phase)์˜ ํ†ตํ•ฉ(monolithic) ์ •์ฑ…์„ ์‚ฌ์šฉํ•œ ํ–‰๋™ ๋ณต์ œ(Behavioral Cloning, BC)์™€ ๋น„๊ตํ•˜์—ฌ ์ผ๋ฐ˜ํ™” ๋ฐ ์Šค์ผ€์ผ๋ง ๊ฒฝํ–ฅ์„ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก  (Core Methodology)

์ด ์—ฐ๊ตฌ๋Š” ๋กœ๋ด‡ ์กฐ์ž‘ ๊ถค์ ์„ alignment์™€ interaction์˜ ๋‘ ๋‹จ๊ณ„๋กœ ๋ถ„ํ•ดํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘ก๋‹ˆ๋‹ค. 1. Alignment Phase: ๋กœ๋ด‡์˜ ์—”๋“œ ์ดํŽ™ํ„ฐ(end-effector) ๋˜๋Š” ์žก๊ณ  ์žˆ๋Š” ๊ฐ์ฒด๋ฅผ ๋ชฉํ‘œ ๊ฐ์ฒด์— ์ƒ๋Œ€์ ์œผ๋กœ ๋ฐฐ์น˜ํ•˜๋Š” ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค. ์ด ๋‹จ๊ณ„์—์„œ๋Š” ์ตœ์ข… ์œ„์น˜๊ฐ€ ์ค‘์š”ํ•˜๋ฉฐ, ํŠน์ • ๊ฒฝ๋กœ๋Š” ๋œ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. 2. Interaction Phase: ์‹ค์ œ ๊ฐ์ฒด ์กฐ์ž‘์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋‹จ๊ณ„๋กœ, ์ •๋ฐ€ํ•œ ๊ถค์  ์‹คํ–‰์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๊ฐ ๋‹จ๊ณ„์— ๋Œ€ํ•ด ์—ฐ๊ตฌ์ž๋“ค์€ Behavioral Cloning (BC)๊ณผ retrieval-based methods์˜ ๋‘ ๊ฐ€์ง€ ์ ‘๊ทผ ๋ฐฉ์‹์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋„ค ๊ฐ€์ง€ ์กฐํ•ฉ์˜ ๋ถ„ํ•ด ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค: * BC-BC: ์ •๋ ฌ๊ณผ ์ƒํ˜ธ์ž‘์šฉ ๋ชจ๋‘์— BC๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. * BC-Ret: BC ์ •๋ ฌ๊ณผ ๊ฒ€์ƒ‰ ๊ธฐ๋ฐ˜ ์ƒํ˜ธ์ž‘์šฉ์„ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค. * Ret-BC: ๊ฒ€์ƒ‰ ๊ธฐ๋ฐ˜ ์ •๋ ฌ๊ณผ BC ์ƒํ˜ธ์ž‘์šฉ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. * Ret-Ret (MT3 - Multi-Task Trajectory Transfer): ์ •๋ ฌ๊ณผ ์ƒํ˜ธ์ž‘์šฉ ๋ชจ๋‘์— ๊ฒ€์ƒ‰ ๊ธฐ๋ฐ˜ ์ •์ฑ…์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ๊ธฐ์กด Trajectory Transfer [21, 35]์˜ ๋‹ค์ค‘ ์ž‘์—… ํ•™์Šต ์„ค์ • ํ™•์žฅ์œผ๋กœ ๊ฐ„์ฃผ๋ฉ๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ถ„ํ•ด ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋“ค์€ ์ „์ฒด ๊ถค์ ์„ ํ•™์Šตํ•˜๋Š” ํ†ตํ•ฉ BC(monolithic BC) ๋ฐฉ๋ฒ•์ธ MT-ACT+์™€ ๋น„๊ต๋ฉ๋‹ˆ๋‹ค.

์ •์ฑ… ์„ค๊ณ„ ๋ฐ ๊ตฌํ˜„ (Policy Design and Implementation)

  • Behavioral Cloning (BC) ๊ตฌํ˜„:
    • ์•„ํ‚คํ…์ฒ˜: MT-ACT [15] ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ณ€ํ˜•๋œ Transformer ๊ธฐ๋ฐ˜ ๋ฐฑ๋ณธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ ์ž…๋ ฅ๊ณผ ์–ธ์–ด ์„ค๋ช…์„ ์ฒ˜๋ฆฌํ•˜๊ณ , variational inference [33, 34]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์กฐ์ž‘ ์‹œ์—ฐ์˜ ๋‹ค์ค‘ ๋ชจ๋“œ(multi-modal) ํŠน์„ฑ์„ ๋ชจ๋ธ๋งํ•ฉ๋‹ˆ๋‹ค. ์ž…๋ ฅ์œผ๋กœ๋Š” segmented point cloud์™€ task description์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
    • ์†์‹ค ํ•จ์ˆ˜: ์‹œ์—ฐ ํ–‰๋™ ์ฒญํฌ(action chunks)์˜ ๋กœ๊ทธ-์šฐ๋„(log-likelihood)๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” VAE(Variational Autoencoder) ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์žฌ๊ตฌ์„ฑ ์†์‹ค(reconstruction loss)๊ณผ ๊ฐ€์šฐ์‹œ์•ˆ ์‚ฌ์ „(Gaussian prior)์— ๋Œ€ํ•œ ์ธ์ฝ”๋” ์ •๊ทœํ™” ํ•ญ

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

ํ•˜๋ฃจ๋งŒ์— ์ฒœ ๊ฐ€์ง€ ์ž‘์—… ํ•™์Šตํ•˜๊ธฐ: MT3์˜ ํ˜์‹ ์  ์ ‘๊ทผ

๋กœ๋ด‡๊ณตํ•™ ์—ฐ๊ตฌ์ž๋“ค์—๊ฒŒ imitation learning์˜ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์€ ์˜ค๋žซ๋™์•ˆ ํ’€๋ฆฌ์ง€ ์•Š๋Š” ๋‚œ์ œ์˜€์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ์˜ ์ตœ์ฒจ๋‹จ ์‹œ์Šคํ…œ๋“ค์€ ํ•˜๋‚˜์˜ ์ž‘์—…์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ์ˆ˜๋ฐฑ์—์„œ ์ˆ˜์ฒœ ๊ฐœ์˜ demonstration์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์‹ค์šฉ์ ์ธ ๋ฒ”์šฉ ๋กœ๋ด‡ ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐ ์—„์ฒญ๋‚œ ์žฅ๋ฒฝ์ด ๋ฉ๋‹ˆ๋‹ค.

Imperial College London์˜ Robot Learning Lab์—์„œ ๋ฐœํ‘œํ•œ โ€œLearning a Thousand Tasks in a Dayโ€ ๋…ผ๋ฌธ(Science Robotics, 2025)์€ ์ด๋Ÿฌํ•œ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์™„์ „ํžˆ ๋’ค์ง‘๋Š” ํ˜์‹ ์ ์ธ ์ ‘๊ทผ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์—ฐ๊ตฌํŒ€์€ ๋‹จ ํ•˜๋‚˜์˜ demonstration๋งŒ์œผ๋กœ๋„ ์ž‘์—…์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” Multi-Task Trajectory Transfer (MT3) ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜์—ฌ, 24์‹œ๊ฐ„ ์ด๋‚ด์— 1,000๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ manipulation ์ž‘์—…์„ ํ•™์Šตํ•˜๋Š” ๋ฐ ์„ฑ๊ณตํ–ˆ์Šต๋‹ˆ๋‹ค.

๋ณธ ๋ฆฌ๋ทฐ์—์„œ๋Š” ์ด ์—ฐ๊ตฌ๊ฐ€ ์–ด๋–ป๊ฒŒ ๊ธฐ์กด ํŒจ๋Ÿฌ๋‹ค์ž„์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ–ˆ๋Š”์ง€, ๊ทธ ํ•ต์‹ฌ ์•„์ด๋””์–ด์™€ ์‹คํ—˜ ๊ฒฐ๊ณผ, ๊ทธ๋ฆฌ๊ณ  ์‹ค์ œ ์‘์šฉ ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•ด ์‹ฌ์ธต์ ์œผ๋กœ ๋ถ„์„ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.


1. ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ: ์™œ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์ด ์ค‘์š”ํ•œ๊ฐ€?

1.1 ํ˜„์žฌ Imitation Learning์˜ ํ•œ๊ณ„

์ตœ๊ทผ ๋ช‡ ๋…„๊ฐ„ robotics ๋ถ„์•ผ์—์„œ๋Š” ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•œ behavioral cloning (BC) ์ ‘๊ทผ๋ฒ•์ด ์ฃผ๋ฅ˜๋ฅผ ์ด๋ฃจ์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•๋“ค์˜ ๋ฐ์ดํ„ฐ ์š”๊ตฌ๋Ÿ‰์€ ์‹ค์šฉ์„ฑ์— ์‹ฌ๊ฐํ•œ ์ œ์•ฝ์„ ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค:

  • BC-Z: 100๊ฐœ ์ž‘์—…์— ๋Œ€ํ•ด ์•ฝ 26,000๊ฐœ์˜ demonstration (์ž‘์—…๋‹น ~250๊ฐœ)
  • RT-1: 744๊ฐœ ์ž‘์—…์— ๋Œ€ํ•ด 17๊ฐœ์›”๊ฐ„ ์•ฝ 130,000๊ฐœ์˜ demonstration ์ˆ˜์ง‘ (์ž‘์—…๋‹น ~175๊ฐœ)
  • MT-ACT: 38๊ฐœ ์ž‘์—…์— ๋Œ€ํ•ด 2๊ฐœ์›”๊ฐ„ 7,500๊ฐœ์˜ demonstration (์ž‘์—…๋‹น ~200๊ฐœ)
  • ALOHA Unleashed: ๋ณต์žกํ•œ ์ž‘์—…์˜ ๊ฒฝ์šฐ ์ž‘์—…๋‹น ์ตœ๋Œ€ 8,000๊ฐœ์˜ demonstration ํ•„์š”

์ด๋Ÿฌํ•œ ๋ฐ์ดํ„ฐ ์š”๊ตฌ๋Ÿ‰์€ ์ˆ˜์ฒœ ๊ฐœ์˜ ์ž‘์—…์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” ๋ฒ”์šฉ ๋กœ๋ด‡ ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ•˜๋ ค๋ฉด ์ฒœ๋ฌธํ•™์ ์ธ ์‹œ๊ฐ„๊ณผ ๋น„์šฉ์ด ์†Œ์š”๋จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

1.2 ์ธ๊ฐ„๊ณผ ๋™๋ฌผ์˜ ํ•™์Šต ํšจ์œจ์„ฑ

ํฅ๋ฏธ๋กญ๊ฒŒ๋„, ์ธ๊ฐ„๊ณผ ๋™๋ฌผ์€ ํ›จ์”ฌ ๋” ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค:

  • ์˜์•„๋“ค์€ ์ „๋ฌธ๊ฐ€์˜ demonstration์ด ์žˆ์„ ๋•Œ unguided exploration๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฅด๊ฒŒ manipulation ๊ธฐ์ˆ ์„ ์Šต๋“ํ•ฉ๋‹ˆ๋‹ค
  • ์˜์žฅ๋ฅ˜๋Š” 5ํšŒ ๋ฏธ๋งŒ์˜ demonstration์œผ๋กœ manipulation ์ž‘์—…์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • ์„ค์น˜๋ฅ˜๋„ 10ํšŒ ๋ฏธ๋งŒ์˜ demonstration์œผ๋กœ ํ–‰๋™ ๋ฐ ๋‚ด๋น„๊ฒŒ์ด์…˜ ๊ธฐ์ˆ ์„ ์Šต๋“ํ•ฉ๋‹ˆ๋‹ค

์ด๋Ÿฌํ•œ ์ƒ๋ฌผํ•™์  ํ•™์Šต ํšจ์œจ์„ฑ๊ณผ ํ˜„์žฌ ๋กœ๋ด‡ ์‹œ์Šคํ…œ ๊ฐ„์˜ ๊ฒฉ์ฐจ๋Š” ๊ทผ๋ณธ์ ์ธ ์ ‘๊ทผ ๋ฐฉ์‹์˜ ์žฌ๊ณ ๊ฐ€ ํ•„์š”ํ•จ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.


2. ํ•ต์‹ฌ ์•„์ด๋””์–ด: Trajectory Decomposition

2.1 Alignment์™€ Interaction์˜ ๋ถ„๋ฆฌ

MT3์˜ ๊ฐ€์žฅ ํ•ต์‹ฌ์ ์ธ ํ†ต์ฐฐ๋ ฅ์€ manipulation trajectory๋ฅผ ๋‘ ๊ฐœ์˜ ์ˆœ์ฐจ์  ๋‹จ๊ณ„๋กœ ๋ถ„ํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

Alignment Phase (์ •๋ ฌ ๋‹จ๊ณ„)

  • ๋ชฉํ‘œ: end-effector๋ฅผ ๋ชฉํ‘œ ๋ฌผ์ฒด์— ๋Œ€ํ•ด ์ ์ ˆํ•œ ์ƒ๋Œ€ pose๋กœ ์ด๋™
  • ํŠน์ง•: ์ •ํ™•ํ•œ ๊ฒฝ๋กœ๋Š” ์ค‘์š”ํ•˜์ง€ ์•Š์Œ. ์ตœ์ข… pose๋งŒ ์ค‘์š”
  • ์˜ˆ์‹œ: ํ”Œ๋Ÿฌ๊ทธ ์‚ฝ์ž… ์ž‘์—…์—์„œ ํ”Œ๋Ÿฌ๊ทธ๋ฅผ ์†Œ์ผ“ ์•ž์— ์œ„์น˜์‹œํ‚ค๋Š” ๊ณผ์ •
  • ์œ ์—ฐ์„ฑ: ๋‹ค์–‘ํ•œ ๊ถค์ ์ด ์„ฑ๊ณต์ ์œผ๋กœ ์ •๋ ฌ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Œ

Interaction Phase (์ƒํ˜ธ์ž‘์šฉ ๋‹จ๊ณ„)

  • ๋ชฉํ‘œ: ์‹ค์ œ manipulation ์ˆ˜ํ–‰
  • ํŠน์ง•: ์ •๋ฐ€ํ•œ ์‹คํ–‰์ด ํ•„์ˆ˜์ . ์ •ํ™•ํ•œ ๊ถค์ ์ด ์„ฑ๊ณต์˜ ํ•ต์‹ฌ
  • ์˜ˆ์‹œ: ํ”Œ๋Ÿฌ๊ทธ๋ฅผ ์‹ค์ œ๋กœ ์†Œ์ผ“์— ์‚ฝ์ž…ํ•˜๋Š” ๊ณผ์ •
  • ์ •๋ฐ€์„ฑ: ์ž‘์€ ํŽธ์ฐจ๋„ ์ž‘์—… ์‹คํŒจ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Œ

2.2 ์™œ Decomposition์ด ํšจ๊ณผ์ ์ธ๊ฐ€?

์ด ๋ถ„ํ•ด๋Š” ๊ฐ ๋‹จ๊ณ„์˜ ๊ทผ๋ณธ์ ์œผ๋กœ ๋‹ค๋ฅธ ํŠน์„ฑ์„ ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค:

  1. ์„œ๋กœ ๋‹ค๋ฅธ ์ •๋ฐ€๋„ ์š”๊ตฌ์‚ฌํ•ญ: alignment๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ๊ด€๋Œ€ํ•˜์ง€๋งŒ, interaction์€ ๋งค์šฐ ์ •๋ฐ€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  2. ํ•™์Šต ๋‚œ์ด๋„์˜ ์ฐจ์ด: ๊ฐ ๋‹จ๊ณ„์— ํŠนํ™”๋œ policy๋Š” ์ „์ฒด ๊ถค์ ์„ ํ•œ ๋ฒˆ์— ํ•™์Šตํ•˜๋ ค๋Š” monolithic policy๋ณด๋‹ค ๋” ์‰ฝ๊ฒŒ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  3. ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ: alignment ์ „๋žต์€ ๋น„์Šทํ•œ ๋ฌผ์ฒด ์นดํ…Œ๊ณ ๋ฆฌ ๋‚ด์—์„œ ๋” ์‰ฝ๊ฒŒ ์ „์ด๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

์‹คํ—˜ ๊ฒฐ๊ณผ, decomposition ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์€ monolithic BC๋ณด๋‹ค ํ•œ ์ž๋ฆฟ์ˆ˜(order of magnitude) ๋” ๋†’์€ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.


3. MT3 ์‹œ์Šคํ…œ ์•„ํ‚คํ…์ฒ˜

3.1 ์‹œ์Šคํ…œ ๊ฐœ์š”

MT3๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ž…๋ ฅ์„ ๋ฐ›์Šต๋‹ˆ๋‹ค: - Segmented point cloud: ๋ชฉํ‘œ ๋ฌผ์ฒด์˜ 3D ํ˜•์ƒ ์ •๋ณด - Language description: ์ž‘์—…์— ๋Œ€ํ•œ ์ž์—ฐ์–ด ์„ค๋ช… (์˜ˆ: โ€œ๋ฌผ๋ณ‘ ์—ด๊ธฐโ€)

๊ทธ๋ฆฌ๊ณ  ๋‹ค์Œ์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค: - Robot actions: end-effector์˜ ์›€์ง์ž„๊ณผ gripper ์ƒํƒœ

3.2 Retrieval-Based ์ ‘๊ทผ๋ฒ•

MT3์˜ ๊ฐ€์žฅ ๋…ํŠนํ•œ ํŠน์ง•์€ ์™„์ „ํ•œ retrieval-based ๋ฐฉ๋ฒ•๋ก ์„ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. Behavioral cloning์ฒ˜๋Ÿผ demonstration์„ network weight์— ์ธ์ฝ”๋”ฉํ•˜๋Š” ๋Œ€์‹ , demonstration์„ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ•˜๊ณ  inference ์‹œ์ ์— ๊ฐ€์žฅ ์ ํ•ฉํ•œ demonstration์„ ๊ฒ€์ƒ‰ํ•ฉ๋‹ˆ๋‹ค.

Hierarchical Retrieval Pipeline

MT3๋Š” 2๋‹จ๊ณ„ retrieval ํ”„๋กœ์„ธ์Šค๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

Stage 1: Language-Based Retrieval - task description์—์„œ micro skill name ์ถ”์ถœ (์˜ˆ: โ€œ๋ฌผ๋ณ‘ ์—ด๊ธฐโ€) - ๋™์ผํ•œ micro skill์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ชจ๋“  demonstration ํ•„ํ„ฐ๋ง

Stage 2: Geometry-Based Retrieval - ๋ฌผ์ฒด์˜ ํ˜•์ƒ๊ณผ pose ์œ ์‚ฌ๋„๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ตœ์ ์˜ demonstration ์„ ํƒ - PointNet++ ๊ธฐ๋ฐ˜ encoder๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ object embedding ์ƒ์„ฑ - Cosine similarity๋ฅผ ํ†ตํ•ด ๊ฐ€์žฅ ์œ ์‚ฌํ•œ demonstration ๊ฒ€์ƒ‰

Point Cloud Encoder์˜ ํŠน์ง•

์—ฐ๊ตฌํŒ€์ด ์‚ฌ์šฉํ•œ point cloud encoder๋Š” ๋งค์šฐ ํฅ๋ฏธ๋กœ์šด ํŠน์„ฑ์„ ๋ณด์ž…๋‹ˆ๋‹ค:

  • ๊ณ„์ธต์  ํด๋Ÿฌ์Šคํ„ฐ๋ง: t-SNE ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ, object category๋ณ„๋กœ clustering๋˜๋ฉฐ, ๊ฐ category ๋‚ด์—์„œ ๋‹ค์‹œ instance๋ณ„๋กœ sub-clustering๋จ
  • Pose ๋ฏผ๊ฐ์„ฑ: ์œ ์‚ฌํ•œ pose์˜ ๋ฌผ์ฒด๋“ค์ด embedding space์—์„œ ๋” ๊ฐ€๊นŒ์ด ์œ„์น˜
  • ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ: ์ƒˆ๋กœ์šด ๋ฌผ์ฒด instance์— ๋Œ€ํ•ด์„œ๋„ ๋น„์Šทํ•œ ํ˜•์ƒ์„ ๊ฐ€์ง„ demonstration์„ ํšจ๊ณผ์ ์œผ๋กœ ๊ฒ€์ƒ‰

3.3 Alignment์™€ Interaction์˜ ์‹คํ–‰

Retrieval-Based Alignment

  1. Pose Estimation: Trajectory Transfer ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ test scene์— ํ•„์š”ํ•œ end-effector pose ๊ณ„์‚ฐ
    • Demonstration๊ณผ test scene ๊ฐ„์˜ ์ƒ๋Œ€์  ๋ฌผ์ฒด pose (T_ฮด) ์ถ”์ •
    • Generalized ICP๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ •๋ฐ€๋„ ํ–ฅ์ƒ
  2. Motion Planning: ๊ณ„์‚ฐ๋œ target pose๋กœ ์ด๋™ํ•˜๋Š” collision-free ๊ถค์  ์ƒ์„ฑ

์ˆ˜ํ•™์ ์œผ๋กœ, test scene์—์„œ์˜ end-effector pose๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค:

T^Test_WE = T_ฮด ยท T^Demo_WE

์—ฌ๊ธฐ์„œ: - T^Test_WE: test scene์—์„œ์˜ end-effector pose - T^Demo_WE: demonstration์—์„œ์˜ end-effector pose - T_ฮด: demonstration๊ณผ test scene ๊ฐ„์˜ ์ƒ๋Œ€์  ๋ณ€ํ™˜

Retrieval-Based Interaction

๊ฒ€์ƒ‰๋œ demonstration์˜ interaction trajectory๋ฅผ end-effector frame์—์„œ ๊ทธ๋Œ€๋กœ ์žฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋†€๋ž๋„๋ก ๋‹จ์ˆœํ•˜์ง€๋งŒ ํšจ๊ณผ์ ์ธ ์ ‘๊ทผ๋ฒ•์ž…๋‹ˆ๋‹ค:

  • demonstration์—์„œ ๊ธฐ๋ก๋œ end-effector velocity๋ฅผ end-effector frame ๊ธฐ์ค€์œผ๋กœ ์‹คํ–‰
  • ์ •ํ™•ํ•œ motion pattern์„ ๋ณด์กด
  • ์ƒˆ๋กœ์šด ๋ฌผ์ฒด instance์— ๋Œ€ํ•ด์„œ๋„ ๋†’์€ ์„ฑ๊ณต๋ฅ  ๋‹ฌ์„ฑ

3.4 ์™œ Retrieval์ด ์ž‘๋™ํ•˜๋Š”๊ฐ€?

Retrieval-based interaction์ด ์„ฑ๊ณต์ ์ธ ์ด์œ ๋Š” ๋‘ ๊ฐ€์ง€ ํ•ต์‹ฌ ํ†ต์ฐฐ์— ๊ธฐ๋ฐ˜ํ•ฉ๋‹ˆ๋‹ค:

  1. ๊ถค์ ์˜ ๊ตฌ์กฐ์  ์œ ์‚ฌ์„ฑ: ๋™์ผํ•œ ๋ฌผ์ฒด ์นดํ…Œ๊ณ ๋ฆฌ ๋‚ด์—์„œ ์ตœ์ ์˜ interaction trajectory๋Š” ๋ฌผ์ฒด ํ˜•์ƒ์ด ์ƒ๋‹นํžˆ ๋‹ฌ๋ผ๋„ ์œ ์‚ฌํ•œ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค

  2. Task Tolerance: ๋งŽ์€ manipulation ์ž‘์—…๋“ค์ด ๋ฌผ์ฒด ํ˜•์ƒ์˜ ๋ณ€ํ™”์— ๋Œ€ํ•ด ์ž์—ฐ์Šค๋Ÿฌ์šด ํ—ˆ์šฉ ๋ฒ”์œ„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค

    • ์˜ˆ: ์„œ๋กœ ๋‹ค๋ฅธ ๋จธ๊ทธ์ปต์„ ์žก์„ ๋•Œ, ํฌ๊ธฐ์™€ ์†์žก์ด ๋ชจ์–‘์€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์ง€๋งŒ ํ•ต์‹ฌ ์ ‘๊ทผ ๋ฐฉ์‹๊ณผ grasping motion์€ ์ผ๊ด€๋ฉ๋‹ˆ๋‹ค

4. ๋น„๊ต ์‹คํ—˜: Decomposition์˜ ํšจ๊ณผ ๊ฒ€์ฆ

์—ฐ๊ตฌํŒ€์€ decomposition์˜ ํšจ๊ณผ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด 5๊ฐ€์ง€ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์„ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค:

4.1 ๋น„๊ต ๋Œ€์ƒ ๋ฐฉ๋ฒ•๋“ค

  1. MT-ACT+ (Monolithic BC): ์ „์ฒด trajectory๋ฅผ ๋‹จ์ผ policy๋กœ ํ•™์Šตํ•˜๋Š” baseline
  2. BC-BC: Alignment์™€ Interaction ๋ชจ๋‘ BC๋กœ ํ•™์Šต
  3. BC-Ret: Alignment๋Š” BC, Interaction์€ Retrieval
  4. Ret-BC: Alignment๋Š” Retrieval, Interaction์€ BC
  5. Ret-Ret (MT3): Alignment์™€ Interaction ๋ชจ๋‘ Retrieval

4.2 ์‹คํ—˜ ์„ค๊ณ„

์—ฐ๊ตฌํŒ€์€ ๋‘ ๊ฐ€์ง€ ์ƒํ˜ธ๋ณด์™„์ ์ธ ์‹คํ—˜์„ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค:

์‹คํ—˜ 1: Demonstrations per Task ํ™•์žฅ

  • ๊ณ ์ •: 4๊ฐœ์˜ micro skill, 12๊ฐœ์˜ seen tasks, 8๊ฐœ์˜ unseen tasks
  • ๋ณ€์ˆ˜: demonstration ์ˆ˜ (1๊ฐœ โ†’ 50๊ฐœ)
  • ๋ชฉํ‘œ: ์ถ”๊ฐ€ demonstration์ด ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ ๋ถ„์„

์„ ํƒ๋œ micro skill: - Insert book in backpack (๊ด€์ ˆํ˜• ๋ฌผ์ฒด ์กฐ์ž‘) - Insert bread in toaster (์‚ฝ์ž… ์ž‘์—…) - Open box (๊ด€์ ˆํ˜• ๋ฌผ์ฒด ์กฐ์ž‘) - Scoop pancake from pan (scooping ์ž‘์—…)

์‹คํ—˜ 2: Task ์ˆ˜ ํ™•์žฅ

  • ๊ณ ์ •: ์ด 150๊ฐœ์˜ demonstration
  • ๋ณ€์ˆ˜: Task ์ˆ˜์™€ ๋ถ„ํฌ
    • 10 tasks ร— 15 demos
    • 30 tasks ร— 5 demos
    • 50 tasks ร— 3 demos
  • ๋ชฉํ‘œ: task diversity๊ฐ€ ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ ๋ถ„์„

4.3 ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ถ„์„

์ „์ฒด ์„ฑ๋Šฅ ๋น„๊ต

์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ๋ช…ํ™•ํ•œ ์„ฑ๋Šฅ ๊ณ„์ธต์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  1. MT3 (Ret-Ret)๊ฐ€ ์••๋„์  ์šฐ์œ„:
    • ๋ชจ๋“  data regime์—์„œ ์ผ๊ด€๋˜๊ฒŒ ์ตœ๊ณ  ์„ฑ๋Šฅ
    • 3 demos/task๋กœ๋„ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•๋“ค์˜ 50 demos/task ์„ฑ๋Šฅ์„ ๋Šฅ๊ฐ€
    • Seen tasks์™€ unseen tasks ๋ชจ๋‘์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ
  2. Decomposition์˜ ์ผ๊ด€๋œ ์ด์ :
    • ๋ชจ๋“  decomposition ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋“ค์ด monolithic baseline(MT-ACT+) ๋Šฅ๊ฐ€
    • ํŠนํžˆ limited demonstration regime(<10 demos/task)์—์„œ ํฐ ๊ฒฉ์ฐจ

Decomposition vs. Monolithic: ์‹ฌ์ธต ๋ถ„์„

Dataset Size์— ๋”ฐ๋ฅธ ํ•™์Šต ์—ญํ•™:

  • Decomposition ๋ฐฉ๋ฒ•๋“ค:
    • 1-10 demos/task ๊ตฌ๊ฐ„์—์„œ ๊ธ‰๊ฒฉํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ
    • 1 demo/task๋งŒ์œผ๋กœ๋„ MT-ACT+์˜ 10 demos/task ์„ฑ๋Šฅ ์ดˆ๊ณผ
    • 50 demos/task์—์„œ ์„ฑ๋Šฅ ํฌํ™” ๊ฒฝํ–ฅ
  • Monolithic (MT-ACT+):
    • ์ดˆ๊ธฐ ์ง„์ „์ด ๋А๋ฆผ
    • 10โ†’50 demos/task์—์„œ ํฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ
    • Decomposition๊ณผ์˜ ๊ฒฉ์ฐจ๋ฅผ ์ขํžˆ์ง€๋งŒ ์—ฌ์ „ํžˆ ๋‚ฎ์€ ์ ˆ๋Œ€ ์„ฑ๋Šฅ

์ด๋Š” decomposition์ด task ๊ตฌ์กฐ๋ฅผ ๋ณธ์งˆ์ ์œผ๋กœ ํ™œ์šฉํ•˜๋Š” ๋ฐ˜๋ฉด, monolithic ์ ‘๊ทผ๋ฒ•์€ ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ์ด ๊ตฌ์กฐ๋ฅผ ํ•™์Šตํ•ด์•ผ ํ•จ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

Task Diversity ํšจ๊ณผ:

  • Decomposition ๋ฐฉ๋ฒ•๋“ค:
    • Seen tasks: task diversity ์ฆ๊ฐ€ ์‹œ ์„ฑ๋Šฅ ๊ฐ์†Œ (demonstration์ด ๋” ๋ถ„์‚ฐ๋˜๋ฏ€๋กœ)
    • Unseen tasks: task diversity ์ฆ๊ฐ€ ์‹œ ์„ฑ๋Šฅ ํ–ฅ์ƒ (๋” ๋‹ค์–‘ํ•œ ๋ฌผ์ฒด instance ๊ฒฝํ—˜)
  • Monolithic (MT-ACT+):
    • Seen/unseen tasks ๋ชจ๋‘์—์„œ task diversity ์ฆ๊ฐ€ ์‹œ ์„ฑ๋Šฅ ํ–ฅ์ƒ
    • ์„œ๋กœ ๋‹ค๋ฅธ ๋ฌผ์ฒด instance์˜ manipulation์—์„œ ํŒจํ„ด์„ ์ฐพ๋Š” ๋Šฅ๋ ฅ ํ–ฅ์ƒ
    • ํ•˜์ง€๋งŒ ์—ฌ์ „ํžˆ decomposition ๋ฐฉ๋ฒ•๋“ค๋ณด๋‹ค ๋‚ฎ์€ ์ ˆ๋Œ€ ์„ฑ๋Šฅ

ํ†ต๊ณ„์  ์œ ์˜์„ฑ

๋ชจ๋“  ์‹คํ—˜ ์กฐ๊ฑด์—์„œ decomposition ๋ฐฉ๋ฒ•๋“ค๊ณผ MT-ACT+ ๊ฐ„์˜ ์„ฑ๋Šฅ ์ฐจ์ด๋Š” two-proportion Z-test๋กœ ํ†ต๊ณ„์ ์œผ๋กœ ์œ ์˜ํ•จ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.

4.4 Retrieval vs. Behavioral Cloning

๊ฐ phase(alignment, interaction)์— ๋Œ€ํ•ด retrieval๊ณผ BC๋ฅผ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ:

Alignment Phase

  • Retrieval-based (Ret-Ret + Ret-BC) > BC-based (BC-Ret + BC-BC)
  • ๋ชจ๋“  data regime์—์„œ ์ผ๊ด€๋œ ์šฐ์œ„
  • ํŠนํžˆ limited demonstration ํ™˜๊ฒฝ์—์„œ ํฐ ๊ฒฉ์ฐจ

Interaction Phase

  • Retrieval-based (Ret-Ret + BC-Ret) > BC-based (Ret-BC + BC-BC)
  • Unseen tasks์—์„œ๋„ ๋†’์€ ์„ฑ๋Šฅ ์œ ์ง€
  • ๋‹จ์ˆœํžˆ demonstration์„ replayํ•˜๋Š” ๋ฐฉ์‹์ž„์—๋„ ํšจ๊ณผ์ 

๋†€๋ผ์šด ๋ฐœ๊ฒฌ: Retrieval-Based Interaction์˜ ํšจ๊ณผ

Retrieval-based interaction์ด ์ƒˆ๋กœ์šด ๋ฌผ์ฒด instance์— ๋Œ€ํ•ด์„œ๋„ ์ž˜ ์ž‘๋™ํ•˜๋Š” ์ด์œ :

  1. ๊ถค์  ๊ตฌ์กฐ์˜ ์•ˆ์ •์„ฑ: ๋™์ผ ์นดํ…Œ๊ณ ๋ฆฌ ๋‚ด ๋ฌผ์ฒด๋“ค์€ ํ˜•์ƒ์ด ๋‹ค์–‘ํ•ด๋„ ์œ ์‚ฌํ•œ ์ตœ์  interaction trajectory๋ฅผ ๊ฐ€์ง
  2. ์ž์—ฐ์Šค๋Ÿฌ์šด ํ—ˆ์šฉ ๋ฒ”์œ„: ๋งŽ์€ manipulation task๊ฐ€ ํ˜•์ƒ ๋ณ€ํ™”์— ๋Œ€ํ•œ tolerance๋ฅผ ๊ฐ€์ง
  3. BC๋„ ๋™์ผํ•œ ์ด์ : BC ์ ‘๊ทผ๋ฒ•๋„ ์ด๋Ÿฌํ•œ ํŠน์„ฑ์œผ๋กœ๋ถ€ํ„ฐ ์ด๋“์„ ์–ป์ง€๋งŒ, retrieval์ด ๋” ์ง์ ‘์ ์œผ๋กœ ํ™œ์šฉ

5. ์ฒœ ๊ฐœ ์ž‘์—… ํ•™์Šต: ์ „๋ก€ ์—†๋Š” ๊ทœ๋ชจ์˜ ์‹คํ—˜

5.1 ์‹คํ—˜ ๊ทœ๋ชจ์™€ ๋„์ „ ๊ณผ์ œ

์—ฐ๊ตฌํŒ€์€ MT3์˜ ์‹ค์šฉ์„ฑ์„ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด ๋‹จ์ผ demonstration per task๋กœ 1,000๊ฐœ์˜ manipulation ์ž‘์—…์„ ํ•™์Šตํ•˜๋Š” ์ „๋ก€ ์—†๋Š” ๊ทœ๋ชจ์˜ ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

์ž‘์—…์˜ ๋‹ค์–‘์„ฑ

  • 31๊ฐœ์˜ macro skills: pour, insert, fold, grasp, swipe, twist, zip, dust ๋“ฑ
  • 534๊ฐœ์˜ micro skills: ์˜ˆ์‹œ
    • โ€œpour wine from wine bottle into wine glassโ€
    • โ€œpour milk from carton into bowlโ€
    • โ€œinsert plate into plate rackโ€
    • โ€œinsert plug into socketโ€
    • โ€œfold towelโ€, โ€œfold t-shirtโ€
  • 402๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฌผ์ฒด: ์ผ์ƒ์ ์ธ ๊ฐ€์ •์šฉํ’ˆ๋“ค

๊ธฐ์กด ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต

์—ฐ๊ตฌ Tasks Objects Demos/Task ์ˆ˜์ง‘ ๊ธฐ๊ฐ„
BC-Z 100 ~12-70 ~250 125 hours
RT-1 744 ~12-70 ~175 17 months
MT-ACT 38 ~12-70 ~200 2 months
MT3 1,000 402 1 <24 hours

MT3๋Š” ์ž‘์—… ๋‹ค์–‘์„ฑ์—์„œ๋Š” ์•ฝ 10๋ฐฐ, ๋ฌผ์ฒด ๋‹ค์–‘์„ฑ์—์„œ๋Š” ์•ฝ 6๋ฐฐ, ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์—์„œ๋Š” 175๋ฐฐ ์ด์ƒ์˜ ๊ฐœ์„ ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

5.2 ์‹คํ—˜ ์กฐ๊ฑด์˜ ๋‚œ์ด๋„

์—ฐ๊ตฌํŒ€์€ MT3์˜ ๋Šฅ๋ ฅ์„ ์ฒ ์ €ํžˆ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด ์˜๋„์ ์œผ๋กœ ์–ด๋ ค์šด ์กฐ๊ฑด์„ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค:

๋ฌผ์ฒด ๋‹ค์–‘์„ฑ

  • ํˆฌ๋ช…/๋ฐ˜ํˆฌ๋ช… ๋ฌผ์ฒด: ํ”Œ๋ผ์Šคํ‹ฑ ์šฉ๊ธฐ, ์œ ๋ฆฌ์ปต (depth sensor์— ์–ด๋ ค์›€)
  • ๋ณ€ํ˜• ๊ฐ€๋Šฅํ•œ ๋ฌผ์ฒด: ์˜ท๊ฐ€์ง€
  • ๋ฐ˜์‚ฌ์„ฑ ๋ฌผ์ฒด: ๊ธˆ์† ํ† ์Šคํ„ฐ
  • ๊ด€์ ˆํ˜• ๋ฌผ์ฒด: ์„œ๋ž์žฅ, ์ƒ์ž

ํ™˜๊ฒฝ์  ๋ณ€ํ™”

  • Distractor objects: ๊ฐ ํ‰๊ฐ€๋งˆ๋‹ค 5-20๊ฐœ์˜ ๋ฐฉํ•ด ๋ฌผ์ฒด ๋ฐฐ์น˜
  • ์กฐ๋ช… ๋ณ€ํ™”: LED ์กฐ๋ช…์˜ ์ƒ‰์ƒ๊ณผ ๊ฐ•๋„๋ฅผ ๋Šฅ๋™์ ์œผ๋กœ ๋ณ€๊ฒฝ
  • ๋ฌผ์ฒด ์œ„์น˜ ๋ฌด์ž‘์œ„ํ™”: workspace ๋‚ด ์–ด๋””๋“  ๋ฐฐ์น˜, ์ตœ๋Œ€ 45๋„ ํšŒ์ „
  • ํ‘œ๋ฉด ์ƒ‰์ƒ ๋ณ€๊ฒฝ: demonstration๊ณผ testing ๊ฐ„ ์˜๋„์ ์œผ๋กœ ๋‹ค๋ฅธ ์ƒ‰์ƒ ์‚ฌ์šฉ

5.3 ์„ฑ๋Šฅ ๊ฒฐ๊ณผ

์ „์ฒด ์„ฑ๊ณต๋ฅ 

  • Seen tasks: 78.25% (1,000 tasks, 2 trials each)
  • Unseen tasks: 65.66% (100 tasks, 2 trials each)

์ด๋Š” ๋‹จ์ผ demonstration๋งŒ์œผ๋กœ ํ•™์Šตํ–ˆ๊ณ , ์ „๋ก€ ์—†๋Š” ์ž‘์—… ๋‹ค์–‘์„ฑ๊ณผ challenging ์‹ค์„ธ๊ณ„ ์กฐ๊ฑด์„ ๊ณ ๋ คํ•  ๋•Œ ๋งค์šฐ ์ธ์ƒ์ ์ธ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.

Macro Skill๋ณ„ ์„ฑ๋Šฅ ๋ถ„์„

์„ฑ๋Šฅ์€ ์ž‘์—…์˜ ์ •๋ฐ€๋„ ์š”๊ตฌ์‚ฌํ•ญ๊ณผ ๊ฐ•ํ•œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค:

๊ณ ์„ฑ๊ณต๋ฅ  ์ž‘์—… (80-90%+):

  • Stacking: ๋ฌผ์ฒด๋ฅผ ์Œ“๋Š” ์ž‘์—…
  • Dusting: ๋จผ์ง€ ๋‹ฆ๊ธฐ
  • Grasping: ๋ฌผ์ฒด ์žก๊ธฐ
  • ํŠน์ง•: ์‹คํ–‰์˜ ๋ถˆ์™„์ „ํ•จ์— ๋Œ€ํ•œ ๋†’์€ tolerance

์ค‘๊ฐ„ ์„ฑ๊ณต๋ฅ  ์ž‘์—… (60-80%):

  • Pouring: ๋ฌผ์ฒด ๋”ฐ๋ฅด๊ธฐ
  • Scooping: ๋– ๋‚ด๊ธฐ
  • Opening/Closing: ์—ด๊ธฐ/๋‹ซ๊ธฐ

์ €์„ฑ๊ณต๋ฅ  ์ž‘์—… (40-60%):

  • Insertion: ์‚ฝ์ž… ์ž‘์—…
  • Hanging: ๊ฑธ๊ธฐ
  • ํŠน์ง•: ๋งค์šฐ ์ •๋ฐ€ํ•œ ์‹คํ–‰์ด ์š”๊ตฌ๋จ, ๋‚ฎ์€ ์˜ค๋ฅ˜ ํ—ˆ์šฉ ๋ฒ”์œ„

5.4 ์‹คํŒจ ์‚ฌ๋ก€ ๋ถ„์„

์—ฐ๊ตฌํŒ€์€ seen tasks์— ๋Œ€ํ•œ ์ฒด๊ณ„์ ์ธ failure mode analysis๋ฅผ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค:

์‹คํŒจ ์›์ธ ๋ถ„ํฌ

  1. Pose Estimation ์‹คํŒจ (23.9%):
    • demonstration ๋Œ€๋น„ ๊ธ‰๊ฒฉํ•œ pose ๋ณ€ํ™”
    • ๋น„๋Œ€์นญ ํ˜•์ƒ์œผ๋กœ ์ธํ•œ ๋ถ€๋ถ„ point cloud์˜ ํฐ ์ฐจ์ด
    • ์›๊ทผ ๋ณ€ํ™”๋กœ ์ธํ•œ ๋ฌธ์ œ
    • ํ•ด๊ฒฐ์ฑ…: ๋‹ค์ค‘ ์นด๋ฉ”๋ผ ์‹œ์Šคํ…œ์œผ๋กœ ๋” ์™„์ „ํ•œ ๊ธฐํ•˜ํ•™์  ์ •๋ณด ์ œ๊ณต
  2. Retrieval ์‹คํŒจ (22.3%):
    • ๋ถ€๋ถ„์ ์œผ๋กœ ๊ฐ€๋ ค์ง„ ๋ฌผ์ฒด
    • ์ž‘์€ ๋ฌผ์ฒด ๋ถ€๋ถ„์˜ ๊ด€๋ จ ๋ณ€ํ™”๋ฅผ ์‹๋ณ„ํ•˜๊ธฐ ์–ด๋ ค์›€
    • ํ•ด๊ฒฐ์ฑ…: ๋‹ค์ค‘ ์นด๋ฉ”๋ผ๋กœ ๋” ์™„์ „ํ•œ ๋ฌผ์ฒด ๊ด€์ฐฐ, ๊ด€๋ จ ๋ฌผ์ฒด ๋ถ€๋ถ„ ๋ถ„๋ฆฌ ๊ธฐ๋ฒ• ๊ฐœ์„ 
  3. Segmentation ์‹คํŒจ (19.5%):
    • ํˆฌ๋ช… ๋ฌผ์ฒด
    • ๋น„์Šทํ•˜๊ฒŒ ์ƒ๊ธด ๋ฌผ์ฒด๋“ค์ด ์žˆ๋Š” cluttered scene
    • ์ „๋ง: Segmentation ๋ชจ๋ธ์˜ ์ง€์†์ ์ธ ๋ฐœ์ „์œผ๋กœ ๊ฐœ์„  ์˜ˆ์ƒ
  4. Grasped Object ๊ด€๋ จ ๋ฌธ์ œ (20.2%):
    • Demonstration๊ณผ deployment ๊ฐ„ ์ผ๊ด€๋˜์ง€ ์•Š์€ ๋ฌผ์ฒด ๋ฐฐ์น˜
    • ์‚ฝ์ž…์ด๋‚˜ scooping ๊ฐ™์€ ์ž‘์—…์—์„œ ์ฃผ๋กœ ๋ฐœ์ƒ
    • ํ•ด๊ฒฐ์ฑ…: Papagiannis et al. (2024)์˜ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ grasp์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™”
  5. Motion Planning/Kinematics (5.3%):
    • ๋น„๊ต์  ๋“œ๋ฌผ๊ฒŒ ๋ฐœ์ƒ
    • ์ˆœ์ˆ˜ํ•œ planning ๋ฌธ์ œ๋Š” ์ฃผ์š” bottleneck์ด ์•„๋‹˜
  6. ๊ธฐํƒ€ (9.0%):
    • Calibration drift
    • ๋ฏธ์„ธํ•œ misalignment ์˜ค๋ฅ˜

ํ•ต์‹ฌ ํ†ต์ฐฐ

  • Perception์ด ์ฃผ์š” ๋ณ‘๋ชฉ: Segmentation, retrieval, pose estimation์ด ์ „์ฒด ์‹คํŒจ์˜ ์•ฝ 66% ์ฐจ์ง€
  • Motion execution์€ robust: ์ˆœ์ˆ˜ํ•œ motion ๋ฌธ์ œ๋Š” 5.3%์— ๋ถˆ๊ณผ
  • ๊ฐœ์„  ๋ฐฉํ–ฅ์ด ๋ช…ํ™•: ๋‹ค์ค‘ ์นด๋ฉ”๋ผ ์‹œ์Šคํ…œ๊ณผ ๋” ๋‚˜์€ perception ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋Œ€๋ถ€๋ถ„์˜ ๋ฌธ์ œ ํ•ด๊ฒฐ ๊ฐ€๋Šฅ

6. ๊ธฐ์ˆ ์  ์„ธ๋ถ€์‚ฌํ•ญ

6.1 ํ•˜๋“œ์›จ์–ด ๊ตฌ์„ฑ

  • ๋กœ๋ด‡ ํ”Œ๋žซํผ: Sawyer robot arm
  • End-effector: 2F-85 Robotiq gripper
  • Sensing: RealSense D415 RGB-D camera (head-mounted)
  • ์ž‘์—… ๊ณต๊ฐ„: 80 ร— 45 cm

์ด๋Š” minimal hardware setup์œผ๋กœ, ๋น„์šฉ ํšจ์œจ์„ฑ์„ ๊ณ ๋ คํ•œ ์„ ํƒ์ž…๋‹ˆ๋‹ค.

6.2 Demonstration ์ˆ˜์ง‘ ๋ฐ ์ฒ˜๋ฆฌ

Demonstration ํ‘œํ˜„

Demonstration ฯ„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค:

ฯ„ = {o_i, e_i}^N_{i=1}

์—ฌ๊ธฐ์„œ: - o_i: RGB-D ์ด๋ฏธ์ง€ ๊ด€์ฐฐ - e_i: End-effector ์ƒํƒœ (6D pose + gripper state) - N: ์‹œํ€€์Šค ๊ธธ์ด - ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ: 30 Hz

๊ฐ demonstration์€ language description l๊ณผ ํ•จ๊ป˜ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.

์ˆ˜์ง‘ ์ „๋žต

Interaction Phase๋งŒ ๊ธฐ๋ก: - Alignment phase๋Š” ์ตœ์ข… pose๋งŒ ์ค‘์š”ํ•˜๋ฏ€๋กœ ์‹ค์ œ ๊ถค์  ๊ธฐ๋ก ๋ถˆํ•„์š” - ํ•ฉ์„ฑ alignment trajectory ์ƒ์„ฑ ๊ฐ€๋Šฅ

์žฅ์ : 1. ์ˆ˜์ง‘ ์‹œ๊ฐ„ ๋‹จ์ถ• 2. Demonstration ๋ถ„ํ•ด๊ฐ€ ์ž์—ฐ์Šค๋Ÿฌ์›€ 3. Synthetic data augmentation ์šฉ์ด

Point Cloud ์ƒ์„ฑ Pipeline

  1. Segmentation:
    • ์ฒซ ํ”„๋ ˆ์ž„: Grounding DINO๋กœ ๋ชฉํ‘œ ๋ฌผ์ฒด segmentation
    • ํ›„์† ํ”„๋ ˆ์ž„: XMem์œผ๋กœ segmentation ์ „ํŒŒ (occlusion ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ)
  2. Point Cloud ๋ณ€ํ™˜:
    • RGB-D + segmentation mask โ†’ target object point cloud
    • Retrieval: robot frame ๊ธฐ์ค€
    • BC training: end-effector frame ๊ธฐ์ค€ (spatial generalization ํ–ฅ์ƒ)

6.3 Behavioral Cloning ๊ตฌํ˜„

์—ฐ๊ตฌํŒ€์ด decomposition๊ณผ monolithic์„ ๊ณต์ •ํ•˜๊ฒŒ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด ๊ตฌํ˜„ํ•œ BC ์‹œ์Šคํ…œ์˜ ์„ธ๋ถ€์‚ฌํ•ญ:

Network Architecture: MT-ACT+

์ž…๋ ฅ ์ฒ˜๋ฆฌ: - Point Cloud Encoder: PointNet++ (clustering + per-cluster embedding) - Task Conditioning: FiLM (Feature-wise Linear Modulation) - CLIP embedding of task description - Point cloud features๋ฅผ task-specificํ•˜๊ฒŒ ์กฐ์ • - Multi-modal Modeling: Variational Inference - Valid action์˜ multi-modal distribution ๋ชจ๋ธ๋ง - Diffusion model ๋Œ€๋น„ ๊ณ„์‚ฐ ํšจ์œจ์ 

์ถ”๊ฐ€ ์ž…๋ ฅ: - Action history (task progress ์ถ”๋ก ์šฉ) - Terminal action output (๋ช…์‹œ์  ์™„๋ฃŒ ์‹ ํ˜ธ)

MT-ACT์™€์˜ ์ฐจ์ด์ : - Point cloud input ์ง€์› - Proprioception ์ œ๊ฑฐ (spatial generalization ํ–ฅ์ƒ) - Action history ํฌํ•จ - ๊ฐ data regime๋ณ„ ์ตœ์ ํ™”๋œ parameter ์ˆ˜

Loss Function

VAE objective ์‚ฌ์šฉ:

min_ฮธ ฮฃ_{o_i,a_i,l~D} ฯ€_ฮธ(a_{i:i+k} | o_i, l)

๊ตฌ์„ฑ: - Reconstruction loss - KL divergence term (Gaussian prior์— ๋Œ€ํ•œ regularization) - Learned weighting with homoscedastic uncertainty (Kendall & Cipolla, 2017)

Action Representation

  • Action chunking: k-step future actions ์˜ˆ์ธก
  • Relative poses: ํ˜„์žฌ end-effector pose ๋Œ€๋น„ ์ƒ๋Œ€์  pose
  • Orientation: Angle-axis representation
  • Spatial resolution: 1cm ์ผ์ • ๊ฐ„๊ฒฉ์œผ๋กœ waypoint ์ƒ˜ํ”Œ๋ง

Data Augmentation

๊ณตํ†ต Augmentation: 1. Point cloud masking: - Furthest point sampling โ†’ 10 clusters - Random 4 clusters masking (partial occlusion robustness)

  1. Noise injection:
    • Point cloud์— Gaussian noise
    • Action history label์— Gaussian noise

Interaction-specific Augmentation: - End-effector pose perturbation: - ์œ„์น˜: ยฑ0.9 cm - ๋ฐฉํ–ฅ: ยฑ5 degrees - State, action label, history label ์—…๋ฐ์ดํŠธ - Covariate shift์— ๋Œ€ํ•œ robustness ํ–ฅ์ƒ

Synthetic Alignment Trajectories

BC alignment policy์™€ MT-ACT+ baseline์„ ์œ„ํ•ด ํ•ฉ์„ฑ ๊ถค์  ์ƒ์„ฑ:

  1. ์‹œ์ž‘ pose ์ƒ˜ํ”Œ๋ง: 30ร—80ร—80 cm cuboid ๋‚ด
  2. Linear trajectory: ์‹œ์ž‘ โ†’ demonstration ์ฒซ pose
  3. Demonstration๋‹น 1,000๊ฐœ trajectory ์ƒ์„ฑ
  4. ์ถ”๊ฐ€ perturbation: ์ตœ์ข… alignment pose ๊ทผ์ฒ˜ (1mm-1cm, 0.5-5 degrees)

6.4 Retrieval System ์„ธ๋ถ€์‚ฌํ•ญ

Object Embedding Network

์•„ํ‚คํ…์ฒ˜: - Encoder: PointNet++ ๊ธฐ๋ฐ˜ - Training: Auto-encoder framework - Point cloud โ†’ embedding โ†’ occupancy grid - Loss: Binary cross-entropy - Dataset: Object-centric dataset (Vitiello et al., 2023)

Embedding Space ํŠน์„ฑ: - Category-level clustering: ๋™์ผ category์˜ ๋ฌผ์ฒด๋“ค์ด clustering - Instance-level sub-clustering: ๊ฐ category ๋‚ด์—์„œ instance๋ณ„ ๊ตฌ๋ถ„ - Pose sensitivity: ์œ ์‚ฌํ•œ pose๊ฐ€ ๊ฐ€๊นŒ์šด embedding

Retrieval ํ”„๋กœ์„ธ์Šค

  1. Language matching: Task description โ†’ micro skill name ์ถ”์ถœ
  2. Geometry matching:
    • Test object point cloud โ†’ embedding (PointNet++)
    • Cosine similarity ๊ณ„์‚ฐ
    • ์ตœ๊ณ  similarity demonstration ์„ ํƒ

Pose Estimation: Trajectory Transfer

ํ•ต์‹ฌ ์•„์ด๋””์–ด:

T^Test_WE = T_ฮด ยท T^Demo_WE

T_ฮด ์ถ”์ •: 1. Initial estimate: Regression method (Vitiello et al., 2023) 2. Refinement: Generalized ICP (Open3D implementation) - Point cloud alignment - Iterative closest point with generalization


7. ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต

7.1 Trajectory Decomposition

์œ ์‚ฌ ์ ‘๊ทผ๋ฒ•: - Perceiver-Actor (Shridhar et al., 2022): Waypoint decomposition - ChainedDiffuser (Xian et al., 2023): Keypose prediction - Coarse-to-Fine Imitation Learning (Johns, 2021): Single demonstration learning - DOME (Valassakis et al., 2022): One-shot visual servoing

MT3์˜ ์ฐจ๋ณ„์ : - ๋” ๋„“์€ ๋ฒ”์œ„์˜ ํ•™์Šต ์ „๋žต ํƒ์ƒ‰ - Systematic evaluation of design choices - Unprecedented scale (1,000 tasks)

7.2 Retrieval for Imitation Learning

๊ธฐ์กด ์—ฐ๊ตฌ: - VINN (Pari et al., 2022): Nearest-neighbor retrieval - Frame-by-frame k-NN - Action averaging - ํ•œ๊ณ„: ๋‹จ์ผ phase, ์ œํ•œ๋œ ์ผ๋ฐ˜ํ™”

  • DINOBot (Di Palo & Johns, 2024): DINO-ViT features
    • Image-level retrieval + pixel-level alignment
    • Foundation model ํ™œ์šฉ
    • ํ•œ๊ณ„: RGB๋งŒ ์‚ฌ์šฉ, task description ๋ฏธํ™œ์šฉ

MT3์˜ ๊ฐœ์„ ์ : - Hierarchical retrieval: Language + geometry - Task description ํ™œ์šฉ: Micro skill filtering - Object geometry: 3D point cloud embedding - Systematic evaluation: Scaling๊ณผ diversity ํšจ๊ณผ ๋ถ„์„

7.3 Large-Scale Robot Learning

Foundation Model ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•: - RT-1, RT-2 (Brohan et al., 2023; Zitkovich et al., 2023) - RoboCat (Bousmalis et al., 2024) - Octo (Octo Model Team et al., 2024) - ฯ€0 (Black et al., 2024)

์ฐจ์ด์ :

ํŠน์„ฑ Foundation Models MT3
Approach End-to-end learning Structural decomposition
Data requirement ์ˆ˜๋ฐฑ demos/task 1 demo/task
Generalization Internet-scale pre-training Retrieval + geometric reasoning
Interpretability Black-box Explicit phases
Scalability ๋ง‰๋Œ€ํ•œ ์ปดํ“จํŒ… ํ•„์š” ํšจ์œจ์ 

๋ณด์™„์  ๊ด€๊ณ„: - Foundation models: Broad world knowledge, semantic understanding - MT3: Data efficiency, explicit reasoning, interpretability


8. ํ•œ๊ณ„์ ๊ณผ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ

8.1 ํ˜„์žฌ ํ•œ๊ณ„์ 

1. Task ์ •์˜์˜ ์ œ์•ฝ

  • Single interaction tasks: ํ•˜๋‚˜์˜ ๋ฌผ์ฒด์™€ ๋‹จ์ผ ์ƒํ˜ธ์ž‘์šฉ๋งŒ ๋‹ค๋ฃธ
  • Multi-step behaviors: Pick-and-place ๊ฐ™์€ ๋ณตํ•ฉ ์ž‘์—…์€ chaining ํ•„์š”
  • ํ–ฅํ›„: High-level planner์™€์˜ ํ†ตํ•ฉ (๋…ผ๋ฌธ ์›น์‚ฌ์ดํŠธ์— ์˜ˆ์‹œ ์žˆ์Œ)

2. Grasped Object Assumption

  • ๊ฐ€์ •: Demonstration๊ณผ testing์—์„œ gripper ๋‚ด ๋ฌผ์ฒด pose ๋™์ผ
  • ๋ฌธ์ œ: Insertion, scooping ๊ฐ™์€ ์ž‘์—…์—์„œ 20.2% ์‹คํŒจ ์›์ธ
  • ํ•ด๊ฒฐ์ฑ…: Papagiannis et al. (2024) ๋ฐฉ๋ฒ• ์ ์šฉ ๊ฐ€๋Šฅ

3. Perception ์˜์กด์„ฑ

  • ์ฃผ์š” ์‹คํŒจ ์›์ธ: Segmentation (19.5%), Retrieval (22.3%), Pose estimation (23.9%)
  • ํŠนํžˆ ์–ด๋ ค์šด ๊ฒฝ์šฐ:
    • ํˆฌ๋ช…/๋ฐ˜ํˆฌ๋ช… ๋ฌผ์ฒด
    • Cluttered scenes
    • Occluded objects
    • Drastic pose changes

4. Single Camera ํ•œ๊ณ„

  • ๋ฌธ์ œ: Incomplete object observation, perspective-dependent challenges
  • ์˜ํ–ฅ: Retrieval ๋ฐ pose estimation ์ •ํ™•๋„ ์ €ํ•˜
  • ํ•ด๊ฒฐ์ฑ…: Multi-camera setup์œผ๋กœ ๋Œ€๋ถ€๋ถ„ ํ•ด๊ฒฐ ๊ฐ€๋Šฅ

8.2 ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ

1. Perception ๊ฐœ์„ 

Multi-Camera System: - ๋” ์™„์ „ํ•œ object observation - Perspective variation ๊ฐ์†Œ - Occlusion handling ๊ฐœ์„  - ์˜ˆ์ƒ ํšจ๊ณผ: ์‹คํŒจ์œจ์˜ ์•ฝ 66% ๊ฐœ์„  ๊ฐ€๋Šฅ

Advanced Segmentation: - Transparent object handling - Cluttered scene robustness - Foundation model ๊ธฐ๋ฐ˜ segmentation (SAM ๋“ฑ)

Robust Pose Estimation: - Learning-based registration - Multi-view consistency - Symmetry handling

2. Retrieval System ๊ณ ๋„ํ™”

๋” sophisticated matching: - Part-level similarity - Task-relevant feature emphasis - Context-aware retrieval

Active learning: - Uncertainty-aware demonstration selection - Optimal demonstration set curation

3. ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ ํ™•์žฅ

Cross-Category Transfer: - ํ˜„์žฌ: Same category์˜ novel instances - ๋ชฉํ‘œ: Similar manipulation์„ ์š”๊ตฌํ•˜๋Š” different categories

Few-Shot Adaptation: - ๋ช‡ ๊ฐœ์˜ ์ถ”๊ฐ€ demonstration์œผ๋กœ ์ƒˆ๋กœ์šด task category ํ•™์Šต - Meta-learning๊ณผ์˜ ๊ฒฐํ•ฉ

4. ๋ณต์žกํ•œ ์ž‘์—…์œผ๋กœ ํ™•์žฅ

Multi-Object Manipulation: - ์—ฌ๋Ÿฌ ๋ฌผ์ฒด์™€์˜ ๋™์‹œ ์ƒํ˜ธ์ž‘์šฉ - Object rearrangement

Long-Horizon Tasks: - Hierarchical planning๊ณผ์˜ ํ†ตํ•ฉ - Task decomposition at multiple levels

Bimanual Manipulation: - ์–‘์† ํ˜‘์—… - ๋” ๋ณต์žกํ•œ ์กฐ์ž‘ ๊ฐ€๋Šฅ

5. Safety์™€ Robustness

Failure Detection and Recovery: - Online monitoring - Automatic retry with alternative demonstrations

Safe Exploration: - Constraint-aware execution - Collision avoidance in cluttered environments

6. Foundation Model๊ณผ์˜ ํ†ตํ•ฉ

Vision-Language Models: - Better task understanding - Natural language interaction - Scene understanding

Hybrid Approach: - Foundation model์˜ semantic knowledge - MT3์˜ data efficiency์™€ geometric reasoning - ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ ๊ฐ€๋Šฅ์„ฑ

ํ•ต์‹ฌ ๊ธฐ์—ฌ ์š”์•ฝ

  1. Decomposition Prior์˜ ์ฒด๊ณ„์  ๊ฒ€์ฆ:
    • Alignment-interaction ๋ถ„ํ•ด๊ฐ€ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์„ ํ•œ ์ž๋ฆฟ์ˆ˜ ๊ฐœ์„ 
    • Limited demonstration regime(<10 demos/task)์—์„œ ํŠนํžˆ ํšจ๊ณผ์ 
    • 3,450 real-world rollouts๋กœ ์ฒ ์ €ํžˆ ๊ฒ€์ฆ
  2. Retrieval-Based Learning์˜ ์žฌ๋ฐœ๊ฒฌ:
    • BC ๋Œ€๋น„ superior performance in low-data regime
    • ๋‹จ์ˆœํ•˜์ง€๋งŒ ํšจ๊ณผ์ ์ธ ์ ‘๊ทผ๋ฒ•
    • Seen๊ณผ unseen tasks ๋ชจ๋‘์—์„œ ๊ฐ•๋ ฅํ•œ ์ผ๋ฐ˜ํ™”
  3. ์ „๋ก€ ์—†๋Š” ๊ทœ๋ชจ์˜ ์‹ค์ฆ:
    • 1,000 tasks, 402 objects, 31 macro skills
    • ๋‹จ์ผ demonstration per task
    • 24์‹œ๊ฐ„ ์ด๋‚ด ์ˆ˜์ง‘
    • ๊ธฐ์กด ์—ฐ๊ตฌ ๋Œ€๋น„ 2-3 orders of magnitude ๊ฐœ์„ 
  4. ์‹คํŒจ ๋ชจ๋“œ์˜ ์ฒด๊ณ„์  ๋ถ„์„:
    • Perception์ด ์ฃผ์š” ๋ณ‘๋ชฉ (66% ์‹คํŒจ ์›์ธ)
    • Motion execution์€ robust (5.3% ์‹คํŒจ)
    • ๋ช…ํ™•ํ•œ ๊ฐœ์„  ๋ฐฉํ–ฅ ์ œ์‹œ

์‹คํ—˜ ์„ค๊ณ„:

  • Controlled experiments๋กœ ๊ฐ component์˜ ํšจ๊ณผ ๋ถ„๋ฆฌ
  • Multiple data regimes ํ‰๊ฐ€
  • Statistical significance ๊ฒ€์ฆ
  • Fair comparison์„ ์œ„ํ•œ ์„ธ์‹ฌํ•œ ๊ตฌํ˜„

Evaluation Rigor:

  • Challenging real-world conditions
  • Diverse object types and environments
  • Systematic failure analysis
  • Transparent reporting

์ด๋ก ์  ํ†ต์ฐฐ

Why Decomposition Works:

  • Alignment๊ณผ interaction์˜ ๊ทผ๋ณธ์ ์œผ๋กœ ๋‹ค๋ฅธ ํŠน์„ฑ ํ™œ์šฉ
  • ๊ฐ phase์— ํŠนํ™”๋œ policy์˜ ํ•™์Šต ์šฉ์ด์„ฑ
  • Tolerance์™€ precision requirements์˜ ์ ์ ˆํ•œ ๋งค์นญ

Why Retrieval Works:

  • Optimal trajectory์˜ ๊ตฌ์กฐ์  ์œ ์‚ฌ์„ฑ
  • Task tolerance์˜ ํšจ๊ณผ์  ํ™œ์šฉ
  • Geometric reasoning์˜ ์ง์ ‘์„ฑ๊ณผ ํšจ์œจ์„ฑ

ํ˜„์žฌ ํ•œ๊ณ„:

  • Perception ์˜์กด์„ฑ (ํŠนํžˆ transparent objects, occlusion)
  • Single interaction tasks๋กœ ์ œํ•œ
  • Grasped object pose consistency ๊ฐ€์ •

๊ฐœ์„  ๊ฒฝ๋กœ:

  • Multi-camera perception
  • Hierarchical planning ํ†ตํ•ฉ
  • Advanced grasp adaptation
  • Foundation model integration

MT3๋Š” โ€œscaling lawsโ€์— ๋Œ€ํ•œ ๊ทผ๋ณธ์ ์ธ ์งˆ๋ฌธ์„ ์ œ๊ธฐํ•ฉ๋‹ˆ๋‹ค.

  • ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ์™€ ๋” ํฐ ๋ชจ๋ธ์ด ํ•ญ์ƒ ๋‹ต์ธ๊ฐ€?
  • Structural priors์™€ domain knowledge๋Š” ์–ด๋–ป๊ฒŒ ํ™œ์šฉํ•ด์•ผ ํ•˜๋Š”๊ฐ€?
  • Data efficiency์™€ generalization์˜ ์ตœ์  balance๋Š”?

์ด ์—ฐ๊ตฌ๋Š” โ€œBetter architecture beats bigger dataโ€๋ผ๋Š” ๋ช…์ œ๋ฅผ ๊ฐ•๋ ฅํžˆ ์ง€์ง€ํ•˜๋ฉฐ, ์•ž์œผ๋กœ์˜ ๋กœ๋ด‡ ํ•™์Šต ์—ฐ๊ตฌ๊ฐ€ ๋‘ ๋ฐฉํ–ฅ์„ ๋ชจ๋‘ ์ถ”๊ตฌํ•ด์•ผ ํ•จ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค:

  1. Efficiency-Focused Approach (MT3 style):
    • Structural priors ํ™œ์šฉ
    • Explicit reasoning
    • Data efficiency ์šฐ์„ 
  2. Scale-Focused Approach (Foundation Models):
    • Large-scale pre-training
    • Emergent capabilities
    • Broad generalization

์—ฐ๊ตฌ์˜ ๊ฐ•์ 

1. ๋ช…ํ™•ํ•œ ๋™๊ธฐ์™€ ๋ฌธ์ œ ์ •์˜: - ํ˜„์žฌ ์‹œ์Šคํ…œ์˜ ๋น„ํ˜„์‹ค์ ์ธ ๋ฐ์ดํ„ฐ ์š”๊ตฌ๋Ÿ‰์„ ์ •ํ™•ํžˆ ์ง€์  - ์ƒ๋ฌผํ•™์  ํ•™์Šต๊ณผ์˜ ๋น„๊ต๋กœ ๊ฐœ์„  ์—ฌ์ง€ ๋ช…ํ™•ํ™”

2. ์ฒด๊ณ„์ ์ธ ์‹คํ—˜ ์„ค๊ณ„: - Controlled experiments๋กœ ๊ฐ design choice์˜ ํšจ๊ณผ ๋ถ„๋ฆฌ - Multiple perspectives (dataset size, diversity) ํ‰๊ฐ€ - Statistical rigor

3. ์ „๋ก€ ์—†๋Š” ๊ทœ๋ชจ์˜ ์‹ค์ฆ: - Talk is cheap, show me the code/results - 1,000 tasks๋Š” ๋‹จ์ˆœํ•œ ์ˆซ์ž ์ด์ƒ์˜ ์˜๋ฏธ - ์‹ค์„ธ๊ณ„ challenging conditions

4. ํˆฌ๋ช…ํ•œ ๋ถ„์„: - ์‹คํŒจ ์‚ฌ๋ก€์˜ ์†”์งํ•œ ๋ณด๊ณ  - ๊ฐ component์˜ ํ•œ๊ณ„ ๋ช…์‹œ - Future work ๋ฐฉํ–ฅ ์ œ์‹œ

๊ฐœ์„  ๊ฐ€๋Šฅํ•œ ์ 

1. BC Baseline์˜ ๊ณต์ •์„ฑ: - MT-ACT+ ๊ตฌํ˜„์ด ์›๋ณธ MT-ACT์™€ ๋™์ผํ•œ์ง€ ๋ถˆ๋ช…ํ™• - ๋” ๊ฐ•๋ ฅํ•œ BC baseline (e.g., diffusion models) ๋น„๊ต ํ•„์š” - Point cloud input์œผ๋กœ์˜ ๋ณ€ํ™˜์ด ๋ถˆ๋ฆฌํ•˜๊ฒŒ ์ž‘์šฉํ–ˆ์„ ๊ฐ€๋Šฅ์„ฑ

2. Generalization ํ‰๊ฐ€์˜ ์ œํ•œ: - Unseen tasks๊ฐ€ ๊ฐ™์€ macro skill ๋‚ด์—๋งŒ ๊ตญํ•œ - Cross-category transfer ํ‰๊ฐ€ ๋ถ€์กฑ - Novel manipulation types์— ๋Œ€ํ•œ zero-shot ๋Šฅ๋ ฅ ๋ฏธํ™•์ธ

3. Multi-Step Tasks์˜ ๋ถ€์žฌ: - Single interaction ํ•œ๊ณ„ - Long-horizon tasks์—์„œ์˜ ์„ฑ๋Šฅ ๋ถˆ๋ช…ํ™• - High-level planning๊ณผ์˜ ํ†ตํ•ฉ ์‹ค์ฆ ๋ถ€์กฑ

4. Comparison์˜ ๋ฒ”์œ„: - Recent foundation models (Octo, ฯ€0)์™€ ์ง์ ‘ ๋น„๊ต ๋ถ€์žฌ - ๊ฐ™์€ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ SOTA methods ๋น„๊ต ํ•„์š”

ํ–ฅํ›„ ์—ฐ๊ตฌ ์ œ์•ˆ

1. Adaptive Decomposition:

  • Task์— ๋”ฐ๋ผ dynamicํ•˜๊ฒŒ decomposition strategy ์„ ํƒ
  • Learning when to decompose

2. Hierarchical Retrieval:

  • Multi-level similarity (category โ†’ instance โ†’ pose)
  • Context-aware demonstration selection

3. Active Demonstration Collection:

  • Uncertainty-guided demonstration request
  • Minimal demonstration set for maximum coverage

4. Foundation Model Integration:

  • VLM for better task understanding
  • Semantic guidance for retrieval
  • Hybrid reasoning (explicit + implicit)

๋กœ๋ด‡๊ณตํ•™ ์ปค๋ฎค๋‹ˆํ‹ฐ์— ๋˜์ง€๋Š” ์งˆ๋ฌธ

  1. ๋ฐ์ดํ„ฐ vs. ๊ตฌ์กฐ: Scaling laws์˜ ํ•œ๊ณ„๋Š” ์–ด๋””์ธ๊ฐ€?
  2. Explicit vs. Implicit: ์–ด๋А ์ •๋„์˜ inductive bias๊ฐ€ ์ ์ ˆํ•œ๊ฐ€?
  3. Generalization์˜ ๋ณธ์งˆ: Interpolation์ธ๊ฐ€ retrieval์ธ๊ฐ€?
  4. Practical Deployment: Lab-to-field gap์„ ์–ด๋–ป๊ฒŒ ๋ฉ”์šธ ๊ฒƒ์ธ๊ฐ€?

๋งˆ์น˜๋ฉฐ

โ€œLearning a Thousand Tasks in a Dayโ€๋Š” ๋กœ๋ด‡ ํ•™์Šต์˜ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์žฌ์ •๋ฆฝํ•˜๋Š” ์ค‘์š”ํ•œ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค. ๋‹จ์ˆœํžˆ โ€œ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐโ€๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, task์˜ ๊ตฌ์กฐ๋ฅผ ์ดํ•ดํ•˜๊ณ  ํ™œ์šฉํ•จ์œผ๋กœ์จ ํš๊ธฐ์ ์ธ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋ฉ”์‹œ์ง€

  1. Decomposition matters: Alignment์™€ interaction์˜ ๋ถ„๋ฆฌ๋Š” ํ•œ ์ž๋ฆฟ์ˆ˜์˜ ํšจ์œจ์„ฑ ๊ฐœ์„ ์„ ๊ฐ€์ ธ์˜ด
  2. Retrieval works: ๋‹จ์ˆœํ•˜์ง€๋งŒ ํšจ๊ณผ์ . ํŠนํžˆ limited data regime์—์„œ
  3. Scale is achievable: ์ ์ ˆํ•œ ์ ‘๊ทผ๋ฒ•์œผ๋กœ 1,000 tasks๋ฅผ ํ•˜๋ฃจ ๋งŒ์— ํ•™์Šต ๊ฐ€๋Šฅ
  4. Perception is key: ๊ฐœ์„ ์˜ ์—ฌ์ง€๊ฐ€ ๊ฐ€์žฅ ํฐ ๋ถ€๋ถ„

์ด ์—ฐ๊ตฌ๋Š” ๋‹ค์Œ์„ ์ƒ๊ธฐ์‹œํ‚ต๋‹ˆ๋‹ค:

  • First principles thinking์˜ ์ค‘์š”์„ฑ: ๋ฌธ์ œ์˜ ๋ณธ์งˆ์  ๊ตฌ์กฐ ์ดํ•ด
  • Simplicity์˜ ํž˜: ๋ณต์žกํ•œ end-to-end๋ณด๋‹ค ๋‹จ์ˆœํ•˜๊ณ  interpretableํ•œ ์ ‘๊ทผ์ด ๋•Œ๋กœ๋Š” ๋” ํšจ๊ณผ์ 
  • Systematic evaluation์˜ ๊ฐ€์น˜: Rigorous experiments๊ฐ€ ์ง„์ •ํ•œ ํ†ต์ฐฐ์„ ์ œ๊ณต

MT3๋Š” ์‹œ์ž‘์ ์ž…๋‹ˆ๋‹ค. Foundation models, multi-modal learning, hierarchical planning ๋“ฑ๊ณผ์˜ ํ†ตํ•ฉ์„ ํ†ตํ•ด ๋”์šฑ ๊ฐ•๋ ฅํ•˜๊ณ  ๋ฒ”์šฉ์ ์ธ ๋กœ๋ด‡ ์‹œ์Šคํ…œ์œผ๋กœ ๋ฐœ์ „ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ฒƒ์€, ์ด ์—ฐ๊ตฌ๊ฐ€ ๋ณด์—ฌ์ค€ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด ๊ฐ€๋Šฅ์„ฑ์ž…๋‹ˆ๋‹ค. ์ด์ œ ์šฐ๋ฆฌ๋Š” ์ˆ˜์ฒœ ๊ฐœ์˜ ์ž‘์—…์„ ๋‹ค๋ฃจ๋Š” ๋ฒ”์šฉ ๋กœ๋ด‡์ด ๊ทธ๋ฆฌ ๋จผ ๋ฏธ๋ž˜๊ฐ€ ์•„๋‹ ์ˆ˜ ์žˆ๋‹ค๋Š” ํฌ๋ง์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โ›๏ธ Dig Review

โ›๏ธ Dig โ€” Go deep, uncover the layers. Dive into technical detail.

๊ฐœ์š”: ํ•˜๋ฃจ, 1,000๊ฐœ์˜ ๋กœ๋ด‡ ์ž‘์—… ํ•™์Šต

ํ•˜๋ฃจ๋„ ์•ˆ ๋˜๋Š” ์‹œ๊ฐ„์— ๋กœ๋ด‡์—๊ฒŒ 1,000๊ฐ€์ง€๋‚˜ ๋˜๋Š” ์กฐ์ž‘ ๊ณผ์—…์„ ๊ฐ€๋ฅด์น  ์ˆ˜ ์žˆ๋‹ค๋ฉด ์–ด๋–จ๊นŒ์š”? ๊ธฐ์กด ๋กœ๋ด‡ ๋ชจ๋ฐฉ ํ•™์Šต์—์„œ๋Š” ๊ณผ์—… ํ•˜๋‚˜๋‹น ์ˆ˜๋ฐฑ~์ˆ˜์ฒœ ํšŒ์˜ ์‹œ์—ฐ์„ ํ•„์š”๋กœ ํ•˜๊ณค ํ–ˆ์ง€๋งŒ, ์ตœ๊ทผ Science Robotics์— ๋ฐœํ‘œ๋œ โ€œLearning a Thousand Tasks in a Dayโ€ ์—ฐ๊ตฌ๋Š” ๋‹จ ํ•œ ๋ฒˆ์˜ ์‹œ์—ฐ๋งŒ์œผ๋กœ๋„ ๋‹ค์–‘ํ•œ ๋ฌผ์ฒด ์กฐ์ž‘ ๊ณผ์—…๋“ค์„ ํ•™์Šตํ•˜๋Š” ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ธ€์—์„œ๋Š” ํ•ด๋‹น ๋…ผ๋ฌธ์˜ ๋ฐฉ๋ฒ•๋ก , ์ˆ˜์‹์  ์„ธ๋ถ€์‚ฌํ•ญ, ์‹คํ—˜ ๊ฒฐ๊ณผ์™€ ๊ธฐ์—ฌ์ ์„ ๋กœ๋ด‡๊ณตํ•™์ž์˜ ์‹œ๊ฐ์—์„œ ๊นŠ์ด ์žˆ๊ฒŒ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ ์กฐ์ž‘ ๋™์ž‘์˜ โ€œ์ •๋ ฌ-์ƒํ˜ธ์ž‘์šฉโ€ 2๋‹จ๊ณ„ ๋ถ„ํ• ๊ณผ ๋ฐ๋ชจ ๊ฒ€์ƒ‰(retrieval) ๊ธฐ๋ฐ˜ ์ผ๋ฐ˜ํ™”๋ผ๋Š” ๋‘ ๊ฐ€์ง€ ํ•ต์‹ฌ ์•„์ด๋””์–ด์— ์ฃผ๋ชฉํ•˜์—ฌ, ์–ด๋–ป๊ฒŒ ์ด๋Ÿฌํ•œ ์ ‘๊ทผ๋ฒ•์ด ๋ฐ์ดํ„ฐ ํšจ์œจ์„ ํš๊ธฐ์ ์œผ๋กœ ๋†’์—ฌ ํ•˜๋ฃจ ๋งŒ์— ์ฒœ ๊ฐœ์˜ ์ž‘์—…์„ ํ•™์Šตํ•˜๊ฒŒ ํ–ˆ๋Š”์ง€ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์‹คํ—˜ ์„ค์ •๊ณผ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด ๋™์ž‘ ์ •์ฑ… ํ•™์Šต์˜ ์ˆ˜ํ•™์  ์ •์˜, ๋ถ„์‚ฐ ํ•™์Šต ๊ตฌ์กฐ, ๋ชจ๋ฐฉํ•™์Šต๊ณผ ๊ฐ•ํ™”ํ•™์Šต์˜ ์—ญํ• , ๊ทธ๋ฆฌ๊ณ  ํ–ฅํ›„ ๋กœ๋ด‡ ํ•™์Šต์— ์ฃผ๋Š” ์‹œ์‚ฌ์ ์„ ๋…ผ์˜ํ•ฉ๋‹ˆ๋‹ค.

๋ฐฉ๋ฒ•๋ก : ๋‹ค์ค‘ ์ž‘์—… ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ ๋ถ„์„

1. ๋‘ ๋‹จ๊ณ„๋กœ ๋‚˜๋ˆ„๋Š” ์กฐ์ž‘ ์ •์ฑ… โ€“ ์ •๋ ฌ ๋‹จ๊ณ„์™€ ์ƒํ˜ธ์ž‘์šฉ ๋‹จ๊ณ„

๋ณธ ์—ฐ๊ตฌ์˜ ์ฒซ ๋ฒˆ์งธ ํ•ต์‹ฌ์€ ์กฐ์ž‘ ๋™์ž‘์„ ๋‘ ๋‹จ๊ณ„๋กœ ๋ถ„ํ•ดํ•˜๋Š” ๊ตฌ์กฐ์  ์‚ฌ์ „(prior)์ž…๋‹ˆ๋‹ค. ํ•˜๋‚˜์˜ ์ž‘์—… ์ˆ˜ํ–‰์„ โ€œ์ •๋ ฌ(Alignment) ๋‹จ๊ณ„โ€์™€ โ€œ์ƒํ˜ธ์ž‘์šฉ(Interaction) ๋‹จ๊ณ„โ€๋กœ ๊ตฌ๋ถ„ํ•˜์—ฌ, ๊ฐ ๋‹จ๊ณ„์— ํŠนํ™”๋œ ์ •์ฑ…์„ ๋”ฐ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ •๋ ฌ ๋‹จ๊ณ„์—์„œ๋Š” ๋กœ๋ด‡์˜ ๋ง๋‹จ์žฅ์น˜(End-effector)๋ฅผ ๋Œ€์ƒ ๋ฌผ์ฒด์— ๋Œ€ํ•ด ์ ์ ˆํ•œ ์ดˆ๊ธฐ ์ž์„ธ๋กœ ์œ„์น˜์‹œํ‚ด์œผ๋กœ์จ, ํ›„์† ์กฐ์ž‘์„ ์œ„ํ•œ ์ค€๋น„๋ฅผ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋‹จ๊ณ„์—์„œ๋Š” ๋ชฉํ‘œ ์ตœ์ข… ์œ„์น˜๋งŒ ์ค‘์š”ํ•  ๋ฟ, ๊ฑฐ๊ธฐ๊นŒ์ง€ ์–ด๋–ค ๊ฒฝ๋กœ๋กœ ์ด๋™ํ–ˆ๋Š”์ง€๋Š” ํฌ๊ฒŒ ์ƒ๊ด€์—†์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์ „์› ํ”Œ๋Ÿฌ๊ทธ๋ฅผ ์ฝ˜์„ผํŠธ ์•ž๊นŒ์ง€ ๊ฐ€์ ธ๋‹ค ๋†“๋Š” ์ •๋ ฌ ๋‹จ๊ณ„์—์„œ๋Š”, ์—ฌ๋Ÿฌ ๊ฒฝ๋กœ ์ค‘ ์–ด๋–ค ๊ฒฝ๋กœ๋ฅผ ํƒํ•˜๋“  ํ”Œ๋Ÿฌ๊ทธ๋ฅผ ์†Œ์ผ“ ์•ž์— ๊ฐ€์ ธ๋‹ค๋†“๊ธฐ๋งŒ ํ•˜๋ฉด ์„ฑ๊ณต์ž…๋‹ˆ๋‹ค. ์ƒํ˜ธ์ž‘์šฉ ๋‹จ๊ณ„์—์„œ๋Š” ์‹ค์ œ ๋ฌผ์ฒด ์กฐ์ž‘์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋•Œ๋Š” ์„ธ๋ฐ€ํ•˜๊ณ  ์ •ํ™•ํ•œ ๊ถค์  ์ œ์–ด๊ฐ€ ํ•„์ˆ˜์ ์ด๋ฉฐ, ์‹คํ–‰๋œ ๊ฒฝ๋กœ ์ž์ฒด๊ฐ€ ๊ณผ์—… ์„ฑ๊ณต์„ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์ •๋ ฌ์„ ๋งˆ์นœ ํ”Œ๋Ÿฌ๊ทธ๋ฅผ ์†Œ์ผ“์— ๊ฝ‚๋Š” ์ƒํ˜ธ์ž‘์šฉ ๋‹จ๊ณ„์—์„œ๋Š”, ์ž‘์€ ์˜ค์ฐจ๋„ ํ—ˆ์šฉ๋˜์ง€ ์•Š์„ ์ •๋„๋กœ ์ •๊ตํ•œ ์‚ฝ์ž… ๋™์ž‘์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๋‘ ๋‹จ๊ณ„๋ฅผ ๋ถ„๋ฆฌํ•จ์œผ๋กœ์จ, ๊ฐ ๋‹จ๊ณ„์˜ ์ •์ฑ…์„ ํ•ด๋‹น ์—ญํ• ์— ๋งž๊ฒŒ ์ตœ์ ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์€ ๋‹จ์ผ ํ†ตํ•ฉ ์ •์ฑ…(๋ชจ๋†€๋ฆฌ์‹)์œผ๋กœ ์ „์ฒด ๋™์ž‘์„ ํ•œ๊บผ๋ฒˆ์— ํ•™์Šตํ•˜๋Š” ๊ธฐ์กด ๋ฐฉ์‹์— ๋น„ํ•ด, ๋‘ ๋‹จ๊ณ„ ๋ถ„ํ•  ์ •์ฑ…์ด ์ ์€ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋กœ๋„ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šต๋จ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ ์ €์ž๋“ค์€ ํ•˜๋‚˜์˜ ๋ชจ๋†€๋ฆฌ์‹ ์ •์ฑ…์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๊ธฐ์กด BC(Behavioral Cloning, ํ–‰๋™ ํด๋กœ๋‹)๋ณด๋‹ค, ์ •๋ ฌ/์ƒํ˜ธ์ž‘์šฉ ๊ฐ๊ฐ์— ํŠนํ™”๋œ ๋‘ ์ •์ฑ…์„ ์ˆœ์ฐจ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋ฉด ๋ฐ์ดํ„ฐ ํšจ์œจ์ด 10๋ฐฐ ์ด์ƒ ํ–ฅ์ƒ๋œ๋‹ค๊ณ  ๋ณด๊ณ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํŠนํ™” ๋•๋ถ„์— ์†Œ๋Ÿ‰(์˜ˆ: ๊ณผ์—…๋ณ„ 10๊ฐœ ๋ฏธ๋งŒ)์˜ ์‹œ์—ฐ์œผ๋กœ๋„ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•ด์กŒ์Šต๋‹ˆ๋‹ค.

2. ํ–‰๋™ ํด๋กœ๋‹(BC) ๋Œ€ Retrieval(๋ฐ๋ชจ ๊ฒ€์ƒ‰) ๊ธฐ๋ฐ˜ ์ •์ฑ…

๋‘ ๋ฒˆ์งธ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ํ”ํžˆ ์‚ฌ์šฉํ•˜๋Š” ํ–‰๋™ ๋ชจ๋ฐฉ ํ•™์Šต(Behavioural Cloning, BC)์€ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ์‹ ๊ฒฝ๋ง ์ •์ฑ…์„ ํ›ˆ๋ จ์‹œํ‚จ ํ›„, ์ถ”๋ก  ์‹œ์—๋Š” ์˜ค๋กœ์ง€ ํ•™์Šต๋œ ๋„คํŠธ์›Œํฌ๋กœ ํ–‰๋™์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” โ€œRetrieval(๊ฒ€์ƒ‰) ๊ธฐ๋ฐ˜โ€ ์ •์ฑ…์„ ๋„์ž…ํ•˜๋Š”๋ฐ, ์ด๋Š” ํ›ˆ๋ จ ๋‹จ๊ณ„์—์„œ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ , ๋Œ€์‹  ์ถ”๋ก  ์‹œ์— ์ง์ ‘ ์ฐธ๊ณ ํ•œ๋‹ค๋Š” ์ ์—์„œ BC์™€ ๊ทผ๋ณธ์ ์œผ๋กœ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ๋‹ค์‹œ ๋งํ•ด, BC๋Š” ์‹œ์—ฐ์„ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ ๊ฐ€์ค‘์น˜์— ๋‚ด์žฌํ™”์‹œํ‚ค๋Š” ๋ฐ˜๋ฉด, Retrieval ๋ฐฉ๋ฒ•์€ ์‹คํ–‰ ์‹œ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅ๋œ ์‹œ์—ฐ์„ ๋ถˆ๋Ÿฌ์™€ ๊ทธ๋Œ€๋กœ ์ด์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ Multi-Task Trajectory Transfer (MT3)๊ฐ€ ๋ฐ”๋กœ Retrieval ๊ธฐ๋ฐ˜ ์ •๋ ฌ+์ƒํ˜ธ์ž‘์šฉ ์ •์ฑ…์˜ ์กฐํ•ฉ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ์ƒˆ๋กœ์šด ๋ชจ๋ฐฉํ•™์Šต ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. MT3 ์ •์ฑ… ์‹คํ–‰์˜ ํ๋ฆ„์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์š”์•ฝ๋ฉ๋‹ˆ๋‹ค:

**MT3 ์ •์ฑ… ์‹คํ–‰ ํ”„๋กœ์„ธ์Šค**
์ž…๋ ฅ: ์ž‘์—…์— ๋Œ€ํ•œ ์–ธ์–ด ์„ค๋ช… $T$, ํ˜„์žฌ ํ™˜๊ฒฝ์˜ ๋Œ€์ƒ ๋ฌผ์ฒด ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ $O$

1. **๋ฐ๋ชจ ๊ฒ€์ƒ‰**: ์‚ฌ์ „์— ์ €์žฅ๋œ ๋ชจ๋“  ์‹œ์—ฐ๋“ค $D$์— ๋Œ€ํ•˜์—ฌ, ์ž‘์—… ์„ค๋ช… ์œ ์‚ฌ๋„ ๋ฐ ๊ธฐํ•˜ํ•™์  ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐ. ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ์‹œ์—ฐ $d^*$์„ ์„ ํƒ.
 - *์–ธ์–ด ์œ ์‚ฌ๋„*๋Š” $T$์™€ ์‹œ์—ฐ์˜ ์„ค๋ช… $T_i$ ๊ฐ„ ์˜๋ฏธ ์œ ์‚ฌ์„ฑ์„ ํ‰๊ฐ€ํ•˜๊ณ , *๊ธฐํ•˜ํ•™ ์œ ์‚ฌ๋„*๋Š” ํ˜„์žฌ ๊ด€์ธก $O$์™€ ์‹œ์—ฐ์˜ ๋ฌผ์ฒด ํฌ์ธํŠธํด๋ผ์šฐ๋“œ $O_i$ ๊ฐ„์˜ ํ˜•ํƒœ ๋ฐ pose ์œ ์‚ฌ์„ฑ์„ **ํ•™์Šต๋œ ์ž ์žฌ ๊ณต๊ฐ„**์—์„œ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
2. **์ •๋ ฌ ๋‹จ๊ณ„ ์‹คํ–‰**: ์„ ํƒ๋œ ์‹œ์—ฐ $d^*$์—์„œ **์ •๋ ฌ ์™„๋ฃŒ ์‹œ์˜ ๋กœ๋ด‡ ๋ง๋‹จ ์ž์„ธ**๋ฅผ ๋ถˆ๋Ÿฌ์˜จ ํ›„, **๋ฌผ์ฒด ์ž์„ธ ์ถ”์ •**์„ ํ†ตํ•ด ํ˜„์žฌ ์žฅ๋ฉด์˜ ๋Œ€์ƒ ๋ฌผ์ฒด ์ขŒํ‘œ๊ณ„๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ, ๊ทธ ๋ชฉํ‘œ ์ž์„ธ๋กœ ๋กœ๋ด‡์„ ์ด๋™์‹œํ‚ค๋Š” **๋ชจ์…˜ ํ”Œ๋ž˜๋‹**์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ์ •๋ ฌ์„ ์™„๋ฃŒํ•ฉ๋‹ˆ๋‹ค.
3. **์ƒํ˜ธ์ž‘์šฉ ๋‹จ๊ณ„ ์‹คํ–‰**: ์‹œ์—ฐ $d^*$์˜ **์ƒํ˜ธ์ž‘์šฉ ๋‹จ๊ณ„ ๋™์•ˆ ๋กœ๋ด‡ ๋ง๋‹จ์˜ ์†๋„ ์‹œํ€€์Šค**๋ฅผ ๊ทธ๋Œ€๋กœ ์žฌ์ƒ(open-loop)ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋•Œ ์žฌ์ƒ์€ **๋กœ๋ด‡ ๋ง๋‹จ ์ขŒํ‘œ๊ณ„ ๊ธฐ์ค€**์œผ๋กœ ์ด๋ฃจ์–ด์ง€๋ฉฐ, ์ •๋ ฌ ๋‹จ๊ณ„์—์„œ ๋งž์ถฐ์ง„ ๋ฌผ์ฒด ์ƒ๋Œ€ ์œ„์น˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์‹œ์—ฐ ๊ถค์ ์„ ์ถ”์ข…ํ•ฉ๋‹ˆ๋‹ค.

์œ„ ๊ณผ์ •์—์„œ ์ฃผ๋ชฉํ•  ์ ์€, Retrieval ๊ธฐ๋ฐ˜ ์ •์ฑ…์€ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์‚ฌ์ „ ํ•™์Šต์ด ์ „ํ˜€ ์—†์–ด๋„ ์ž‘๋™ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์„ ํ™œ์šฉํ•ด ์ž‘์—… ํ…์ŠคํŠธ ์„ค๋ช… ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , RGB-D ์นด๋ฉ”๋ผ๋กœ ์–ป์€ ๋ฌผ์ฒด 3D ํ˜•์ƒ์˜ ์ž ์žฌ ๋ฒกํ„ฐ๋ฅผ ๋น„๊ตํ•˜์—ฌ ๊ฐ€์žฅ ์•Œ๋งž์€ ์‹œ์—ฐ์„ ์„ ํƒํ–ˆ์Šต๋‹ˆ๋‹ค. ์„ ํƒ๋œ ์‹œ์—ฐ์„ ํ™œ์šฉํ•  ๋•Œ๋„, ์ •๋ ฌ ๋‹จ๊ณ„์—์„œ๋Š” ์‹œ์—ฐ์˜ ์ตœ์ข… ์ž์„ธ๋ฅผ ํ˜„์žฌ ๋ฌผ์ฒด์˜ ์ž์„ธ๋กœ ๋งตํ•‘ํ•˜๊ธฐ ์œ„ํ•ด 6-์ž์œ ๋„ ๋ฌผ์ฒด ์ž์„ธ ์ถ”์ • ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ–ˆ๊ณ , ์ƒํ˜ธ์ž‘์šฉ ๋‹จ๊ณ„์—์„œ๋Š” ์‹œ์—ฐ์˜ ๋™์ž‘์„ ๊ทธ๋Œ€๋กœ โ€œopen-loopโ€๋กœ ์žฌํ˜„ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ Retrieval+์˜คํ”ˆ๋ฃจํ”„ ์ ‘๊ทผ์€ ๊ณผ๊ฑฐ ์—ฐ๊ตฌ๋“ค์ด ์ฃผ๋กœ ๊ฐ•ํ™”ํ•™์Šต์„ ํ†ตํ•ด ์ƒํ˜ธ์ž‘์šฉ ์ •์ฑ…์„ ๋ฏธ์„ธ์กฐ์ •ํ•˜๊ฑฐ๋‚˜, ํ˜น์€ ํ•™์Šต๋œ ํ”ผ๋“œ๋ฐฑ ์ œ์–ด๊ธฐ๋ฅผ ์“ฐ๊ณค ํ–ˆ๋˜ ๊ฒƒ๊ณผ ๋Œ€์กฐ์ ์ž…๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ๊ฐ•ํ™”ํ•™์Šต์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ ๋„ ์‹œ์—ฐ ์žฌ์ƒ๋งŒ์œผ๋กœ ์ถฉ๋ถ„ํ•œ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๋ชจ๋ฐฉํ•™์Šต๊ณผ ๊ธฐ์กด ์ œ์–ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ฐฝ์˜์ ์œผ๋กœ ๊ฒฐํ•ฉํ•œ ์‚ฌ๋ก€๋ผ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ฐธ๊ณ : Retrieval ๊ธฐ๋ฐ˜ ์ •์ฑ…์€ ๊ณผ์—… ์‹คํ–‰ ์ค‘ ํ•ญ์ƒ ์‹œๆผ” ๋ฐ์ดํ„ฐ๋ฅผ ์ฐธ์กฐํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ํ”ํžˆ โ€œ๋น„ํ•™์Šต(non-parametric) ์ •์ฑ…โ€์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ํ›ˆ๋ จ์‹œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ€์กฑํ•ด๋„ ๋ฌธ์ œ์—†์ง€๋งŒ, ๋‹จ์ ์€ ์‹คํ–‰ ์ค‘ ํ”ผ๋“œ๋ฐฑ์œผ๋กœ ์˜ค๋ฅ˜๋ฅผ ๊ต์ •ํ•˜์ง€ ๋ชปํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ํ›„์ˆ ํ•  ์‹คํ—˜ ๊ฒฐ๊ณผ์—์„œ๋„ ์ด๋Ÿฌํ•œ ์˜คํ”ˆ ๋ฃจํ”„(open-loop) ๋ฐฉ์‹์˜ ํ•œ๊ณ„๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

3. ํ•™์Šต ์•„ํ‚คํ…์ฒ˜์™€ ๋ถ„์‚ฐ ์‹œ์Šคํ…œ

๋ชจ๋“  ๋น„๊ต ๋ฐฉ๋ฒ•๋“ค์ด ๋™์ผํ•œ ์ž…๋ ฅ-์ถœ๋ ฅ ์•„ํ‚คํ…์ฒ˜ ํ‹€ ๋‚ด์—์„œ ๊ตฌํ˜„๋˜์—ˆ๋‹ค๋Š” ๊ฒƒ๋„ ์ฃผ๋ชฉํ•  ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค. ๋กœ๋ด‡์€ ์นด๋ฉ”๋ผ ๊ธฐ๋ฐ˜ ์‹œ๊ฐ ์ •๋ณด์™€ ๊ณผ์—…์— ๋Œ€ํ•œ ํ…์ŠคํŠธ ์„ค๋ช…์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„๋“ค์ž…๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, Intel RealSense D415 RGB-D ์นด๋ฉ”๋ผ๋กœ ์ดฌ์˜ํ•œ ์žฅ๋ฉด์—์„œ ๋Œ€์ƒ ๋ฌผ์ฒด์˜ ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ๋ฅผ ๋ถ„ํ• (์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜)ํ•˜์—ฌ ์–ป๊ณ , ํ•ด๋‹น ๊ณผ์—…์„ ์„ค๋ช…ํ•˜๋Š” ๋ฌธ์žฅ์„ ํ•จ๊ป˜ ์ •์ฑ…์˜ ์ž…๋ ฅ์œผ๋กœ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž…๋ ฅ์„ ๋ฐ›์•„ ๋‹ค์ค‘ ์ž‘์—… ์ •์ฑ…์ด ์ถœ๋ ฅํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ์€ ๋กœ๋ด‡์˜ ํ–‰๋™ ์ œ์–ด ๋ช…๋ น์ž…๋‹ˆ๋‹ค (์˜ˆ: ๊ด€์ ˆ ์†๋„ ๋˜๋Š” ๋ง๋‹จ ์œ„์น˜ ๋ณ€ํ™” ๋“ฑ). BC ๋ฐฉ์‹์˜ ๊ฒฝ์šฐ ์ด ์ •์ฑ…์ด ๊ณง ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์ด๋ฉฐ, Retrieval ๋ฐฉ์‹์˜ ๊ฒฝ์šฐ๋Š” ์•ž์„œ ์„ค๋ช…ํ•œ ๊ฒ€์ƒ‰-์žฌ์ƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ •์ฑ… ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ๋ชจ๋“  ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ๋™์ผํ•œ ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ(Sawyer 7-์ž์œ ๋„ ๋กœ๋ด‡ํŒ” + Robotiq 2F-85 ๊ทธ๋ฆฌํผ)์„ ์‚ฌ์šฉํ•˜๊ณ , ๋™์ผํ•œ ํ˜•ํƒœ์˜ ์ž…๋ ฅ์„ ์ฒ˜๋ฆฌํ•˜๋„๋ก ์„ค๊ณ„ํ•จ์œผ๋กœ์จ ๊ฒฐ๊ณผ ๋น„๊ต์˜ ๊ณต์ •์„ฑ์„ ๋‹ด๋ณดํ–ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ์ปจ๋Œ€, ๋ชจ๋†€๋ฆฌ์‹ BC ์ •์ฑ…์ด๋“  ๋ถ„ํ• ๋œ BC+Retrieval ์ •์ฑ…์ด๋“  ๋˜‘๊ฐ™์ด ์ ๊ตฐ+์–ธ์–ด ์ž…๋ ฅ์„ ๋ฐ›์•„ ๋™์ž‘์„ ์‚ฐ์ถœํ•˜๋„๋ก ํ†ต์ผํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ›ˆ๋ จ์€ ์˜คํ”„๋ผ์ธ์œผ๋กœ ์ด๋ค„์ง€๋ฉฐ, ๋ถ„์‚ฐ ํ•™์Šต ์ธํ”„๋ผ์— ๋Œ€ํ•œ ์–ธ๊ธ‰์€ ๋…ผ๋ฌธ์— ์ง์ ‘์ ์ด์ง€ ์•Š์ง€๋งŒ, ์‹คํ—˜ ๊ทœ๋ชจ์ƒ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์™€ GPU ๊ฐ€์†์ด ํ™œ์šฉ๋˜์—ˆ์„ ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค. 1000๊ฐœ ๊ณผ์—…์˜ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ ์ž์ฒด๋Š” ๊ณผ์—…๋ณ„ 1๊ฐœ๋กœ ๋งค์šฐ ์ ์ง€๋งŒ, ์ •์ฑ… ์‹ ๊ฒฝ๋ง์˜ ๋ณต์žก๋„ (์˜ˆ: Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ ์‚ฌ์šฉ)์™€ ๋‹ค์ค‘ ๊ณผ์—…์— ๋Œ€ํ•œ ํ›ˆ๋ จ์„ ๊ณ ๋ คํ•˜๋ฉด, ์ ์–ด๋„ ์—ฌ๋Ÿฌ GPU๋ฅผ ์ด์šฉํ•œ ๋ณ‘๋ ฌ ํ•™์Šต์ด๋‚˜ ๋Œ€๋Ÿ‰์˜ ๋ชจ์…˜ ํ”Œ๋ž˜๋‹ ์—ฐ์‚ฐ ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ ๋“ฑ์ด ํ•„์š”ํ–ˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์‹ค์ œ ๊ตฌํ˜„์€ ๊ณต๊ฐœ๋œ ์ฝ”๋“œ ๋ ˆํฌ์ง€ํ† ๋ฆฌ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ํ•™์Šต ์Šคํฌ๋ฆฝํŠธ์™€ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์–ด ์žฌํ˜„์„ ์œ„ํ•œ ๊ธฐ์ˆ ์  ๊ธฐ๋ฐ˜๋„ ๋งˆ๋ จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๊ฐ•ํ™”ํ•™์Šต๊ณผ ๋ชจ๋ฐฉํ•™์Šต์˜ ๊ฒฐํ•ฉ ์ธก๋ฉด์—์„œ ์ด ์—ฐ๊ตฌ๋ฅผ ๋ฐ”๋ผ๋ณด๋ฉด ํฅ๋ฏธ๋กœ์šด ์‹œ์‚ฌ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ๊ฐ•ํ™”ํ•™์Šต(RL) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ง์ ‘ ์‚ฌ์šฉํ•˜์ง€๋Š” ์•Š์•˜์ง€๋งŒ, ๊ธฐ์กด RL ์—ฐ๊ตฌ๋“ค์˜ ์„ฑ๊ณผ์ธ ์ •์ฑ… ๋ถ„ํ•  (์˜ˆ: ๋จผ์ € ์ •๋ ฌ ํ›„ ๋ฏธ์„ธ์กฐ์ •)์ด๋‚˜ ์˜คํ”ˆ๋ฃจํ”„ ๋ฐ˜๋ณต ์‹คํ–‰ ์•„์ด๋””์–ด๋ฅผ ๋ฐ›์•„๋“ค์˜€์Šต๋‹ˆ๋‹ค. ์ฆ‰, ๋ชจ๋ฐฉํ•™์Šต(IL)์˜ ๋ฐ์ดํ„ฐ ํšจ์œจ๊ณผ ์ „ํ†ต์  ์ œ์–ด/RL์˜ ์•ˆ์ •์  ์‹คํ–‰ ์ „๋žต์„ ์กฐํ•ฉํ•œ ํ˜•ํƒœ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ ‘๊ทผ์€ ํ–ฅํ›„ ํ•„์š”ํ•œ ๊ฒฝ์šฐ RL๋กœ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๊ฑฐ๋‚˜ ์˜จ๋ผ์ธ ๋ณด์ •ํ•˜๋Š” ํ•˜์ด๋ธŒ๋ฆฌ๋“œ๋กœ ํ™•์žฅ๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋Œ€๊ทœ๋ชจ ๋‹ค์ค‘ ๊ณผ์—… ํ•™์Šต์— ์ƒˆ๋กœ์šด ์„ค๊ณ„ ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์„ธ๋ถ€์‚ฌํ•ญ: ์ •์ฑ… ํ•™์Šต, ์†์‹ค ํ•จ์ˆ˜์™€ ์•ˆ์ •์„ฑ

์ด ์ ˆ์—์„œ๋Š” ๋…ผ๋ฌธ์— ๋“ฑ์žฅํ•˜๋Š” ์ •์ฑ… ํ•™์Šต์˜ ์ˆ˜ํ•™์  ์ •์˜, ๋ฉ€ํ‹ฐํƒœ์Šคํ‚น ํ•™์Šต ๋ฐฉ์‹, ์†์‹ค ํ•จ์ˆ˜ ์„ค๊ณ„ ๋ฐ ํ•™์Šต ์•ˆ์ •ํ™” ๊ธฐ๋ฒ• ๋“ฑ์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ BC ์ •์ฑ…์˜ ํ•™์Šต ๋ชฉํ‘œ์™€ Retrieval ์ •์ฑ…์˜ ์ผ๋ฐ˜ํ™” ์›๋ฆฌ๋ฅผ ์ˆ˜์‹๊ณผ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ธก๋ฉด์—์„œ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

1. ํ–‰๋™ํด๋กœ๋‹(BC) ์ •์ฑ… ํ•™์Šต โ€“ ๋‹ค์ค‘ ๊ณผ์—… ํ™•๋ฅ  ๋ชจ๋ธ

BC ๊ธฐ๋ฐ˜ ์ •์ฑ…์€ ์ฃผ์–ด์ง„ ๊ด€์ธก o (์˜ˆ: ์นด๋ฉ”๋ผ์—์„œ ์–ป์€ ์ ๊ตฐ + ๊ณผ์—…์„ค๋ช…) ์ƒํƒœ์—์„œ ๋กœ๋ด‡ ํ–‰๋™ a๋ฅผ ์ถœ๋ ฅํ•˜๋Š” ํ™•๋ฅ ์ •์ฑ… \pi_\theta(a|o)๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค. ์ด ์ •์ฑ…์€ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ์…‹ D={(o_i, a_i)}๋ฅผ ์ด์šฉํ•ด ์‹œ์—ฐ ํ–‰๋™์˜ ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก ํ•™์Šต๋ฉ๋‹ˆ๋‹ค. ํ•™์Šต ๋ชฉํ‘œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ˆ˜์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. \max_{\theta} \; \frac{1}{|D|}\sum_{(o,a)\in D} \log \pi_{\theta}(a\,|\,o) \;,

์ฆ‰ ๋ฐ๋ชจ ํ–‰๋™์ด ์ •์ฑ…์— ์˜ํ•ด ๋‚˜์˜ฌ ํ™•๋ฅ ์˜ ๋กœ๊ทธํ•ฉ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์Œ์˜ ๋กœ๊ทธ-์šฐ๋„ ์†์‹ค(NLL ์†์‹ค)์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ๊ณผ ๋™์น˜์ž…๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํŠนํžˆ ํ™•๋ฅ ์  ์ƒ์„ฑ ๋ชจ๋ธ์˜ ํ˜•ํƒœ๋กœ ์ •์ฑ…์„ ํ•™์Šตํ–ˆ๋Š”๋ฐ, ๋ณ€๋ถ„ ์˜คํ† ์ธ์ฝ”๋”(VAE) ๊ตฌ์กฐ๋ฅผ ๋„์ž…ํ•˜์—ฌ ์ž ์žฌ ๋ณ€์ˆ˜ z๋ฅผ ํ†ตํ•ด ์ •์ฑ…์˜ ๋‹ค์–‘์„ฑ์„ ํ‘œํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ MT-ACT๋ผ ๋ถˆ๋ฆฌ๋Š” Transformer ๊ธฐ๋ฐ˜ ์ •์ฑ… ๋„คํŠธ์›Œํฌ๋ฅผ ๋ณ€ํ˜•ํ•˜์—ฌ, ์ธ์ฝ”๋” q_\phi(z|o,a)์™€ ๋””์ฝ”๋” p_\theta(a|o,z)๋ฅผ ํ•จ๊ป˜ ํ•™์Šตํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ ํ•™์Šต ์†์‹ค์€ ์žฌ๊ตฌ์„ฑ ์†์‹ค(์‹œ์—ฐ ํ–‰๋™ a์™€ ๋””์ฝ”๋” ์ถœ๋ ฅ ๊ฐ„ ์ฐจ์ด)๊ณผ ์ •๊ทœํ™” ์†์‹ค(์ธ์ฝ”๋”๊ฐ€ ์ถœ๋ ฅํ•˜๋Š” ์ž ์žฌ๋ถ„ํฌ q_\phi(z|o,a)์™€ ์‚ฌ์ „๋ถ„ํฌ p(z) ๊ฐ„ KL ๋ฐœ์‚ฐ)์„ ํ•ฉ์นœ ํ˜•ํƒœ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ๊ณต์‹์ ์œผ๋กœ๋Š” VAE ๋ชฉ์  ํ•จ์ˆ˜: L_{\text{VAE}}(\theta,\phi) \;=\; \mathbb{E}{q\phi(z|o,a)}[-\log p_\theta(a|o,z)] \;+\; \beta\, D_{\mathrm{KL}}(q_\phi(z|o,a) \parallel p(z)) \,,

๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ์—ฌ๊ธฐ์„œ \beta๋Š” KLํ•ญ ๊ฐ€์ค‘์น˜์ž…๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์— ๋”ฐ๋ฅด๋ฉด, MT-ACT ๋ชจ๋ธ์€ ์ด๋Ÿฌํ•œ VAE ๊ธฐ๋ฐ˜ ํ•™์Šต์„ ํ†ตํ•ด ์‹œ์—ฐ ๋ฐ์ดํ„ฐ์˜ ๊ณต๊ฐ„์  ๊ตฌ์„ฑ๊ณผ ๊ธฐํ•˜ํ•™์  ์œ ์‚ฌ์„ฑ์„ ๋ฐ˜์˜ํ•˜๋Š” ํ‘œํ˜„์„ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ์žฌ๊ตฌ์„ฑ ์†์‹ค ํ•ญ๋ชฉ์€ ๋กœ๋ด‡ ์กฐ์ž‘ ํ–‰๋™์˜ ์—ฌ๋Ÿฌ ์š”์†Œ๋กœ ๊ตฌ์„ฑ๋˜๋Š”๋ฐ (์˜ˆ: ํฌ์ง€์…˜, ์˜ค๋ฆฌ์—”ํ…Œ์ด์…˜, ๊ทธ๋ฆฌํผ ์ƒํƒœ ๋“ฑ), ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ์˜ ์˜ค์ฐจ๋ฅผ ๋™์ผ ์„ ์ƒ์—์„œ ํ•ฉ์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ๋ถˆํ™•์‹ค์„ฑ ๊ธฐ๋ฐ˜ ๊ฐ€์ค‘์น˜ ์กฐ์ • ๊ธฐ๋ฒ•์ด ํ™œ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. Kendall ๋“ฑ(2017)์˜ ๋ฐฉ๋ฒ•์„ ์ฐธ๊ณ ํ•˜์—ฌ, ํ•™์Šต ์ค‘์— ์†์‹ค ๊ตฌ์„ฑ๋ณ„ ๊ฐ€์ค‘์น˜๋ฅผ ์ž๋™์œผ๋กœ ์กฐ์ ˆํ•จ์œผ๋กœ์จ, ์‚ฌ๋žŒ์ด ์ง์ ‘ ๊ฐ€์ค‘์น˜๋ฅผ ํŠœ๋‹ํ•˜์ง€ ์•Š์•„๋„ ์•ˆ์ •์ ์ธ ํ•™์Šต์ด ์ด๋ฃจ์–ด์ง€๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ํ•™์Šต ์•ˆ์ •์„ฑ์„ ๋†’์ด๋Š” ์ค‘์š”ํ•œ ๊ธฐ๋ฒ•์œผ๋กœ, ์—ฌ๋Ÿฌ ์ข…๋ฅ˜์˜ ์˜ค๋ฅ˜ ํ•ญ์ด ๊ณต์กดํ•  ๋•Œ ํŠน์ • ํ•ญ๋ชฉ์˜ ์Šค์ผ€์ผ์ด๋‚˜ ๋‹จ์œ„ ์ฐจ์ด๋กœ ์ธํ•ด ํ•™์Šต์ด ๋ถˆ์•ˆ์ •ํ•ด์ง€๋Š” ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.

2. ๋ฉ€ํ‹ฐํƒœ์Šคํ‚น๊ณผ ์ผ๋ฐ˜ํ™” โ€“ ์ž„๋ฒ ๋”ฉ ๋ฐ Retrieval ์•Œ๊ณ ๋ฆฌ์ฆ˜

๋‹ค์ค‘ ๊ณผ์—… ํ•™์Šต(multi-task learning) ๋งฅ๋ฝ์—์„œ, BC ์ •์ฑ…์€ ์—ฌ๋Ÿฌ ๊ณผ์—…์˜ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ๊บผ๋ฒˆ์— ํ•™์Šตํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๊ณผ์—…์„ ๋‚˜ํƒ€๋‚ด๋Š” ์ถ”๊ฐ€ ์ž…๋ ฅ์„ ์ •์ฑ…์— ์ œ๊ณตํ–ˆ์Šต๋‹ˆ๋‹ค. ์•ž์„œ ์–ธ๊ธ‰ํ•œ ์–ธ์–ด ์„ค๋ช…์ด ๋ฐ”๋กœ ๊ฐ ๊ณผ์—…์˜ ๋ชฉ์ ์„ ๋ช…์‹œํ•˜๋Š” ์—ญํ• ์„ ํ•˜๋ฉฐ, ์ด ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ์„ ํ†ตํ•ด ์ •์ฑ… ๋„คํŠธ์›Œํฌ๊ฐ€ ํ˜„์žฌ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•  ๊ณผ์—…์„ ์ธ์ง€ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ€๋ น โ€œ๋ถ„ํ™์ƒ‰ ์†๊ฐ€๋ฐฉ์˜ ์ง€ํผ ์—ด๊ธฐโ€๋ผ๋Š” ๊ณผ์—… ์„ค๋ช…์„ ์ž…๋ ฅ๋ฐ›์œผ๋ฉด, ๋„คํŠธ์›Œํฌ๋Š” ์ด ์ž„๋ฒ ๋”ฉ๊ณผ ์‹œ๊ฐ ์ •๋ณด(๋ถ„ํ• ๋œ ์†๊ฐ€๋ฐฉ 3D ๋ฐ์ดํ„ฐ)๋ฅผ ํ•จ๊ป˜ ์ฒ˜๋ฆฌํ•˜์—ฌ ํ•ด๋‹น ์ž‘์—…์— ๋งž๋Š” ํ–‰๋™ ์ถœ๋ ฅ์„ ๋‚ด๋ณด๋‚ด๋„๋ก ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•จ์œผ๋กœ์จ ๋‹จ์ผ ๋„คํŠธ์›Œํฌ๊ฐ€ ์ˆ˜๋ฐฑ ๊ฐœ์— ๋‹ฌํ•˜๋Š” ๋‹ค์–‘ํ•œ ์ž‘์—…๋“ค์„ ๊ตฌ๋ณ„ํ•˜์—ฌ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์—ˆ๊ณ , ๊ทธ ๊ฒฐ๊ณผ 534๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฏธ์‹œ์  ๊ธฐ์ˆ (micro skills)์„ ํ•œ๊บผ๋ฒˆ์— ํ•™์Šตํ•˜๋Š” ๋ฐ ์„ฑ๊ณตํ–ˆ์Šต๋‹ˆ๋‹ค.

Retrieval ๊ธฐ๋ฐ˜ ์ •์ฑ…์˜ ๊ฒฝ์šฐ ํ•™์Šต ๊ณผ์ •์ด ๋ณ„๋„๋กœ ์กด์žฌํ•˜์ง€ ์•Š์ง€๋งŒ, ์ผ๋ฐ˜ํ™”๋ฅผ ์œ„ํ•ด ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์—์„œ์˜ ์œ ์‚ฌ๋„ ์ธก์ •์„ ์‚ฌ์šฉํ–ˆ๋‹ค๋Š” ์ ์ด ์ˆ˜ํ•™์ ์œผ๋กœ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ํ…์ŠคํŠธ ์„ค๋ช…์˜ ์ž„๋ฒ ๋”ฉ (์ž์—ฐ์–ด ๋ชจ๋ธ ํ™œ์šฉ)๊ณผ ๋ฌผ์ฒด ํ˜•ํƒœ์˜ ์ž„๋ฒ ๋”ฉ (RGB-D๋กœ๋ถ€ํ„ฐ ์ถ”์ถœํ•œ ํฌ์ธํŠธํด๋ผ์šฐ๋“œ๋ฅผ ์ธ์ฝ”๋”ฉํ•œ ์ž ์žฌ ๋ฒกํ„ฐ)์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์ข…ํ•ฉ ์œ ์‚ฌ๋„ ํ•จ์ˆ˜ S(i,*)๋ฅผ ์ •์˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ƒˆ๋กœ์šด ํ…Œ์ŠคํŠธ ๊ณผ์—…์— ๋Œ€ํ•ด ์ด ํ•จ์ˆ˜๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ๋‚ด ์‹œ์—ฐ i์™€์˜ ์œ ์‚ฌ๋„๋ฅผ ํ‰๊ฐ€ํ•˜๊ณ , ์ด ์ค‘ ์ตœ๋Œ€์ธ ์‹œ์—ฐ d^*๋ฅผ ์„ ํƒํ•˜๋Š” ๊ฒƒ์€:

d^* \;=\; \arg\max_{i \in D}\Big[ \text{Sim}{\text{lang}}(T, T_i)\;+\;\text{Sim}(O, O_i)\Big] \,,

์™€ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ \text{Sim}{\text{lang}}๋Š” ๊ณผ์—… ์„ค๋ช… ๋ฌธ์žฅ ๊ฐ„ ์œ ์‚ฌ๋„ (์˜ˆ: ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„), \text{Sim}์€ ๋ฌผ์ฒด ํฌ์ธํŠธํด๋ผ์šฐ๋“œ ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ๋œปํ•ฉ๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ํ›„์ž๋ฅผ ํ•™์Šต๋œ ์ž ์žฌ ๊ณต๊ฐ„์—์„œ์˜ ๊ฑฐ๋ฆฌ๋กœ ์ •์˜ํ–ˆ๋Š”๋ฐ, ์ด๋Š” ์‚ฌ์ „์— ํ•™์Šต๋œ ๋ฌผ์ฒด ์ธ์ฝ”๋” ๋„คํŠธ์›Œํฌ(์ž์„ธํ•œ ๊ตฌ์กฐ๋Š” ๋ถ€๋ก ๊ธฐ์ˆ )๋ฅผ ํ†ตํ•ด ์ถ”์ถœ๋œ ๋ฒกํ„ฐ ๊ฐ„ ๊ฑฐ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ์„ ํšจ์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋“  ์‹œ์—ฐ์˜ ์ž„๋ฒ ๋”ฉ์„ ๋ฏธ๋ฆฌ ์ €์žฅํ•ด ๋‘๊ณ , ํ…Œ์ŠคํŠธ ์‹œ ์ผ์ข…์˜ ์ตœ๊ทผ์ ‘ ์ด์›ƒ ๊ฒ€์ƒ‰์„ ์ˆ˜ํ–‰ํ–ˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (ํ•„์š”์‹œ FAISS์™€ ๊ฐ™์€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ™œ์šฉ ๊ฐ€๋Šฅ). Retrieval ์ •์ฑ…์˜ ์ผ๋ฐ˜ํ™” ์›๋ฆฌ๋Š” ๊ฐ„๋‹จํžˆ ๋งํ•ด โ€œ๋น„์Šทํ•œ ๋ฌผ์ฒด-๋น„์Šทํ•œ ์ž‘์—…์€ ๊ฐ™์€ ๊ถค์ ์ด๋ฉด ๋œ๋‹คโ€๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์ƒˆ๋กœ์šด ๋จธ๊ทธ์ž”์„ ์ง‘๋Š” ๊ณผ์ œ๊ฐ€ ์ฃผ์–ด์ง€๋ฉด, ์ €์žฅ๋œ ์‹œ์—ฐ ์ค‘ ๋‹ค๋ฅธ ๋จธ๊ทธ์ž”์„ ์žก์€ ์‹œ์—ฐ์„ ์ฐพ์•„ ๊ทธ๋Œ€๋กœ ํ‰๋‚ด๋‚ด๋Š” ์‹์ž…๋‹ˆ๋‹ค. ์ˆ˜ํ•™์ ์œผ๋กœ ์ด๋Š” ๋™์ผํ•œ ๋งคํฌ๋กœ ์Šคํ‚ฌ ๋ฒ”์ฃผ ๋‚ด์—์„œ๋Š” ์ตœ์  ๊ถค์  ๊ตฌ์กฐ๊ฐ€ ๊ณต์œ ๋œ๋‹ค๋Š” ๊ฐ€์ •์„ ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค. ์ผ์ • ๋ฒ”์œ„์˜ ๋ชจ์–‘ ์ฐจ์ด, ํฌ๊ธฐ ์ฐจ์ด๋Š” ์‹œ์—ฐ ๊ถค์ ์˜ ๋ฏธ์„ธํ•œ ๋ณ€ํ˜•์œผ๋กœ๋„ ์ถฉ๋ถ„ํžˆ ์ ์‘๋  ์ˆ˜ ์žˆ๊ณ , ์ด๋•Œ ๋ณ€ํ˜•์€ ์ฃผ๋กœ ์ •๋ ฌ ๋‹จ๊ณ„์—์„œ์˜ ์ขŒํ‘œ ๋งž์ถค(transform)์œผ๋กœ ํ•ด๊ฒฐ๋ฉ๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ ์ €์ž๋“ค์€ โ€œ๋จธ๊ทธ์ž” ์žก๊ธฐโ€ ๋™์ž‘์„ ์˜ˆ๋กœ ๋“ค์–ด ์„ค๋ช…ํ•˜๋Š”๋ฐ, ๋จธ๊ทธ์ž”๋งˆ๋‹ค ์†์žก์ด ๋ชจ์–‘์ด๋‚˜ ํฌ๊ธฐ๊ฐ€ ๋‹ฌ๋ผ๋„ ํ•ต์‹ฌ ์žก๊ธฐ ๋™์ž‘(grasp motion)์˜ ๊ตฌ์กฐ๋Š” ์œ ์‚ฌํ•˜๋ฏ€๋กœ ํ•œ ์‹œ์—ฐ์œผ๋กœ ๋‹ค๋ฅธ ๋จธ๊ทธ์ž”๋„ ์„ฑ๊ณต์ ์œผ๋กœ ์žก์„ ์ˆ˜ ์žˆ๋‹ค๊ณ  ์–ธ๊ธ‰ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋งŒ, Retrieval ๋ฐฉ์‹์€ ์—ฐ์†์ ์ธ ์ผ๋ฐ˜ํ™”(๋ณด๊ฐ„)์—๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์š”๊ตฌ๋˜๋Š” ํ•ด๊ฒฐ์ฑ…์ด ๋‘ ๊ฐœ์˜ ์‹œ์—ฐ ์‚ฌ์ด ์–ด๋”˜๊ฐ€์— ์žˆ์„ ๊ฒฝ์šฐ, ์ด ๋ฐฉ์‹์€ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ํ•œ์ชฝ ์‹œ์—ฐ์„ ํƒํ•  ๋ฟ ์ค‘๊ฐ„ ํ•ด๋ฒ•์„ ์ƒ์„ฑํ•˜์ง€ ๋ชปํ•œ๋‹ค๋Š” ๊ฒƒ์ด ์ €์ž๋“ค์˜ ์ง€์ ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๊ทผ๋ณธ์ ์œผ๋กœ ๋น„๊ฐ„์„ญ ๋ณด๊ฐ„์„ ๋ชปํ•˜๋Š” non-parametric ๋ฐฉ์‹์˜ ํ•œ๊ณ„๋กœ, ํ›„์† ์—ฐ๊ตฌ์—์„œ ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ์‹œ์—ฐ์˜ ์กฐํ•ฉ์ด๋‚˜ ์ƒ์„ฑ ๋ชจ๋ธ์„ ํ™œ์šฉํ•œ ์ƒˆ๋กœ์šด trajectory ์ƒ์„ฑ ๋“ฑ์ด ํ•„์š”ํ•œ ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค.

3. ํ•™์Šต ์•ˆ์ •ํ™” ๋ฐ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•

์†Œ์ˆ˜์˜ ๋ฐ๋ชจ๋กœ๋„ ํ•™์Šต์„ ์›ํ™œํžˆ ํ•˜๊ธฐ ์œ„ํ•ด, ์ €์ž๋“ค์€ ํ•™์Šต ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๋ฐ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ธฐ๋ฒ•๋„ ํ™œ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ์ •๋ ฌ ๋‹จ๊ณ„์˜ ๊ฒฝ์šฐ, ๋‹จ์ผ ์‹œ์—ฐ์—์„œ๋Š” ๋ชฉํ‘œ ๋ฌผ์ฒด์— ์ ‘๊ทผํ•˜๋Š” ๊ฒฝ๋กœ๊ฐ€ ํ•˜๋‚˜๋งŒ ์ฃผ์–ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ๋„คํŠธ์›Œํฌ๊ฐ€ ๊ฒฝ๋กœ ๋‹ค์–‘์„ฑ์— ๋‘”๊ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ์‹œ์—ฐ์˜ ์ •๋ ฌ ์ตœ์ข… ์ž์„ธ๋งŒ ์œ ์ง€ํ•˜๊ณ  ๋‹ค์–‘ํ•œ ๊ฒฝ๋กœ๋กœ ์ ‘๊ทผํ•˜๋Š” ์ถ”๊ฐ€ ๋ชจ์…˜๋“ค์„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์œผ๋กœ ์ƒ์„ฑํ•˜์—ฌ BC-์ •๋ ฌ ์ •์ฑ…์˜ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋ ธ์Šต๋‹ˆ๋‹ค (๋…ผ๋ฌธ Methods 4.3.4 ํ•ญ๋ชฉ). ์˜ˆ๋ฅผ ๋“ค์–ด ํ”Œ๋Ÿฌ๊ทธ๋ฅผ ์†Œ์ผ“ ์•ž์œผ๋กœ ๊ฐ€์ ธ๊ฐ€๋Š” ๋™์ž‘์—์„œ, ์ง์„  ๊ฒฝ๋กœ๋ฟ ์•„๋‹ˆ๋ผ ๊ณก์„ ์ด๋‚˜ ๋‹ค์–‘ํ•œ ๊ฐ๋„์˜ ๊ฒฝ๋กœ๋“ค์„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ ์œผ๋กœ ๋งŒ๋“ค์–ด ์คŒ์œผ๋กœ์จ, ์ •๋ ฌ ์ •์ฑ…์ด ๊ฒฝ๋กœ์˜ ๋ชจ์–‘์— ๋ฏผ๊ฐํ•˜์ง€ ์•Š๊ณ ๋„ ๋ชฉํ‘œ ์ž์„ธ์— ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•™์Šต์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฐ•ํ™”๋œ ๋ฐ์ดํ„ฐ ๋‹ค์–‘์„ฑ์€ ์ •์ฑ…์˜ ๊ฒฌ๊ณ ์„ฑ์„ ๋†’์—ฌ, ์ •๋ ฌ ์ •์ฑ…์ด ํ›ˆ๋ จ ๋•Œ ๋ณด์ง€ ๋ชปํ•œ ์ƒˆ๋กœ์šด ์œ„์น˜์—์„œ๋„ ์„ฑ๊ณตํ™•๋ฅ ์„ ๋†’์ด๋„๋ก ๋„์™€์ค๋‹ˆ๋‹ค. ๋˜ํ•œ ๋ฐ๋ชจ ์ฆ๊ฐ• ์ธก๋ฉด์—์„œ, ๋ฌผ์ฒด์˜ ์ดˆ๊ธฐ ๋ฐฐ์น˜๋‚˜ ์นด๋ฉ”๋ผ ๊ฐ๋„ ๋ณ€ํ™”๋ฅผ ๋ฐ˜์˜ํ•˜๊ธฐ ์œ„ํ•ด ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฌด์ž‘์œ„๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ธฐ๋ฒ•๋„ ํ™œ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์‹คํ—˜์—์„œ๋Š” ํ‰๊ฐ€ ์‹œ ๋ฌผ์ฒด์˜ ์œ„์น˜๋ฅผ ์ตœ๋Œ€ 20cm ๋ฒ”์œ„์—์„œ ๋ฌด์ž‘์œ„ ๋ณ€์œ„ํ•˜๊ณ , ๋ฐฉํ–ฅ์€ ์ˆ˜์ง์ถ• ๊ธฐ์ค€ ์ผ์ • ๊ฐ๋„(random orientation) ํšŒ์ „์‹œ์ผฐ๋Š”๋ฐ, ํ•™์Šต ์ค‘์—๋„ ์ด๋Ÿฌํ•œ ๋ณ€ํ™”๋ฅผ ๊ฒฌ๋”œ ์ˆ˜ ์žˆ๋„๋ก ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์‹œ์—ฐ ํฌ์ธํŠธํด๋ผ์šฐ๋“œ๋ฅผ ์•ฝ๊ฐ„ ํšŒ์ „/์ด๋™์‹œํ‚ค๊ฑฐ๋‚˜, ์žก์Œ์ด๋‚˜ ๋ถ€๋ถ„ ํ์ƒ‰์„ ์ถ”๊ฐ€ํ•˜๋Š” ๋“ฑ์˜ ๊ธฐ๋ฒ•์ด ์ ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค (๋ถ€๋ก 4.3.3 ์ฐธ๊ณ ).

์š”์•ฝํ•˜๋ฉด, ์†์‹ค ํ•จ์ˆ˜ ์„ค๊ณ„์˜ ์„ธ์‹ฌํ•จ(VAE + ๋ถˆํ™•์‹ค์„ฑ ๊ฐ€์ค‘), ์ถฉ๋ถ„ํ•œ ์ž„๋ฒ ๋”ฉ ํ•™์Šต์„ ํ†ตํ•œ ์ผ๋ฐ˜ํ™”, ์‹œ์—ฐ ๊ฒฝ๋กœ ์ฆ๊ฐ• ๋“ฑ์ด ์–ด์šฐ๋Ÿฌ์ ธ ํ•™์Šต ์•ˆ์ •์„ฑ๊ณผ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ํ™•๋ณด๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ˆ˜ํ•™์ ยท์•Œ๊ณ ๋ฆฌ์ฆ˜์  ํ† ๋Œ€ ์œ„์—์„œ, MT3์™€ ๋‹ค๋ฅธ ์ •์ฑ…๋“ค์˜ ์„ฑ๋Šฅ ์ฐจ์ด๊ฐ€ ์–ด๋–ป๊ฒŒ ๋‚˜ํƒ€๋‚˜๋Š”์ง€ ๋‹ค์Œ์œผ๋กœ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์‹คํ—˜: ํ™˜๊ฒฝ, ํƒœ์Šคํฌ ๋‹ค์–‘์„ฑ ๋ฐ ๊ฒฐ๊ณผ ๋ถ„์„

๋ณธ ์—ฐ๊ตฌ๋Š” ์†Œ๊ทœ๋ชจ ์ œ์–ด ์‹คํ—˜๊ณผ ๋Œ€๊ทœ๋ชจ 1000๊ณผ์—… ์‹คํ—˜์˜ ๋‘ ๋‹จ๊ณ„๋กœ ๋‚˜๋‰˜์–ด ์ง„ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ์ ˆ์—์„œ๋Š” ์‹คํ—˜ ํ™˜๊ฒฝ ์„ธํŒ…, ํƒœ์Šคํฌ ๊ตฌ์„ฑ๊ณผ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์†๋„, ๋น„๊ต ๋Œ€์ƒ๊ณผ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•, ๊ทธ๋ฆฌ๊ณ  ํ•ต์‹ฌ ๊ฒฐ๊ณผ๋ฅผ ์ •๋Ÿ‰์ /์ •์„ฑ์ ์œผ๋กœ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค.

1. ์‹คํ—˜ ํ™˜๊ฒฝ๊ณผ ํƒœ์Šคํฌ ๊ตฌ์„ฑ

๋กœ๋ด‡ ํ”Œ๋žซํผ: ๋ชจ๋“  ์‹คํ—˜์€ ์‹ค์ œ ๋กœ๋ด‡์œผ๋กœ ์ˆ˜ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์‚ฌ์šฉ๋œ ํ•˜๋“œ์›จ์–ด๋Š” Sawyer 7-DOF ๋กœ๋ด‡ํŒ” (Rethink Robotics)๊ณผ Robotiq 2F-85 ์ „๋™ ๊ทธ๋ฆฌํผ๋กœ, ์‚ฌ๋žŒ ํŒ”๊ณผ ์œ ์‚ฌํ•œ ์ž‘์—… ๊ณต๊ฐ„์„ ์ง€๋‹Œ ์—ฐ๊ตฌ์šฉ ๋กœ๋ด‡์ž…๋‹ˆ๋‹ค. ๋กœ๋ด‡์˜ ๋จธ๋ฆฌ ๋ถ€๋ถ„์—๋Š” Intel RealSense D415 RGB-D ์นด๋ฉ”๋ผ๊ฐ€ ์žฅ์ฐฉ๋˜์–ด, ์ž‘์—… ๊ณต๊ฐ„์„ ๋‚ด๋ ค๋‹ค๋ณด๋Š” ์‹œ์ ์—์„œ ์ปฌ๋Ÿฌ ์˜์ƒ๊ณผ ๊นŠ์ด ์ •๋ณด๋ฅผ ํš๋“ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋กœ๋ด‡์€ ์ž์ฒด ์‹œ๊ฐ์„ผ์„œ๋ฅผ ํ†ตํ•ด ๋ฌผ์ฒด๋ฅผ ์ธ์‹ํ•˜๊ณ  ์กฐ์ž‘ํ•  ์ˆ˜ ์žˆ๋Š” ์…ˆ์ž…๋‹ˆ๋‹ค. ๊ณผ์—…(Task)์˜ ์ •์˜: ์ €์ž๋“ค์€ ๋กœ๋ด‡ ์กฐ์ž‘ ๊ณผ์—…์„ ๋งคํฌ๋กœ ์Šคํ‚ฌ(macro skill), ๋งˆ์ดํฌ๋กœ ์Šคํ‚ฌ(micro skill), ํƒœ์Šคํฌ(task)์˜ ๊ณ„์ธต์œผ๋กœ ๊ฐœ๋…ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. - ๋งคํฌ๋กœ ์Šคํ‚ฌ์€ โ€œ์—ด๊ธฐโ€, โ€œ๊ฝ‚๊ธฐโ€, โ€œ์ ‘๊ธฐโ€ ๋“ฑ ์ƒํ˜ธ์ž‘์šฉ์˜ ์œ ํ˜•์œผ๋กœ ๊ตฌ๋ถ„๋˜๋Š” ์ƒ์œ„ ๊ฐœ๋…์˜ ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค. ์ด 31๊ฐœ์˜ ๋งคํฌ๋กœ ์Šคํ‚ฌ ๋ฒ”์ฃผ๊ฐ€ ์‹คํ—˜์— ํฌํ•จ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. - ๋งˆ์ดํฌ๋กœ ์Šคํ‚ฌ์€ ํŠน์ • ๋ฌผ์ฒด ์ข…๋ฅ˜์— ์ ์šฉ๋œ ๋งคํฌ๋กœ ์Šคํ‚ฌ๋กœ์„œ, ๋ฌผ์ฒด ํŠน์„ฑ์— ๋งž๊ฒŒ ์„ธ๋ถ€ ๋™์ž‘์ด ์กฐ์ •๋œ ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด โ€œ์˜ค๋ธ๋ฌธ ์—ด๊ธฐ(์ธก๋ฉด์œผ๋กœ ์—ฌ๋Š” ํ˜•ํƒœ)โ€ vs โ€œ์˜ค๋ธ๋ฌธ ์—ด๊ธฐ(์•„๋ž˜๋กœ ์—ฌ๋Š” ํ˜•ํƒœ)โ€๋Š” ๊ฐ™์€ ๋งคํฌ๋กœ ์Šคํ‚ฌ(์—ด๊ธฐ)์ด์ง€๋งŒ ๋ฌผ์ฒด ๊ตฌ์กฐ์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ๋ชจ์…˜ ํ”„๋กœํ•„์„ ๊ฐ€์ง€๋ฏ€๋กœ ๋ณ„๊ฐœ์˜ ๋งˆ์ดํฌ๋กœ ์Šคํ‚ฌ๋กœ ๊ฐ„์ฃผ๋ฉ๋‹ˆ๋‹ค. - ํƒœ์Šคํฌ(task)๋Š” ๊ตฌ์ฒด์ ์ธ ๊ฐœ๋ณ„ ๊ณผ์—…์œผ๋กœ, ํ•˜๋‚˜์˜ ๋งˆ์ดํฌ๋กœ ์Šคํ‚ฌ์ด ํŠน์ •ํ•œ ๋‹จ์ผ ๋ฌผ์ฒด ์ธ์Šคํ„ด์Šค์— ์ ์šฉ๋œ ๊ฒฝ์šฐ๋ฅผ ๋งํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ์ปจ๋Œ€ โ€œ๋ถ„ํ™์ƒ‰ ๋‘ฅ๊ทผ ์†๊ฐ€๋ฐฉ์˜ ์ง€ํผ๋ฅผ ์—ด๊ธฐโ€๋Š” ํŠน์ • ์†๊ฐ€๋ฐฉ(๊ฐ์ฒด ์ธ์Šคํ„ด์Šค)์— ๋Œ€ํ•ด โ€œ์ง€ํผ ์—ด๊ธฐโ€๋ผ๋Š” ๋งˆ์ดํฌ๋กœ ์Šคํ‚ฌ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ํ•œ ํƒœ์Šคํฌ์ž…๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์—์„œ 1000๊ฐœ์˜ ํƒœ์Šคํฌ๋ž€ ๊ถ๊ทน์ ์œผ๋กœ 534๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋งˆ์ดํฌ๋กœ ์Šคํ‚ฌ์— ์†ํ•˜๋Š” ๊ตฌ์ฒด์  ๊ฐ์ฒด-๋™์ž‘ ์กฐํ•ฉ 1000๊ฐœ๋ฅผ ๊ฐ€๋ฆฌํ‚ต๋‹ˆ๋‹ค. ์ด๋“ค ํƒœ์Šคํฌ๋Š” ๋‹ค์‹œ 31๊ฐœ์˜ ๋งคํฌ๋กœ ์Šคํ‚ฌ ๋ฒ”์ฃผ๋กœ ๋ฌถ์ด๋Š”๋ฐ, ์ƒํ™œ ์† ๋‹ค์–‘ํ•œ ์กฐ์ž‘์„ ํฌ๊ด„ํ•˜๋„๋ก ์„ ์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ๋ฌธ ์—ด๊ธฐ, ์„œ๋ž ๋‹ซ๊ธฐ, ์นซ์†” ๋„ฃ๊ธฐ, USB ๊ฝ‚๊ธฐ, ์ ‘์‹œ ์Œ“๊ธฐ, ์˜ท๊ฑธ์ด์— ์˜ท๊ฑธ๊ธฐ, ์ˆ˜๊ฑด ์งœ๊ธฐ ๋“ฑ ๋งค์šฐ ํญ๋„“์€ ์กฐ์ž‘๋“ค์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์‚ฌ์šฉ๋œ ๋ฌผ์ฒด๋งŒ ํ•ด๋„ 402์ข…์— ์ด๋ฅด๋Š” ๋‹ค์–‘ํ•œ ๊ฐ์ฒด๋กœ ๊ตฌ์„ฑ๋˜์–ด, ๊ฐ€์ •์šฉํ’ˆ, ๊ณต๊ตฌ, ์ฃผ๋ฐฉ์šฉํ’ˆ, ์žฅ๋‚œ๊ฐ ๋“ฑ ์ผ์ƒ ์ƒํ™œ ๋ฌผ์ฒด ์ „๋ฐ˜์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

๋ฐ๋ชฌ์ŠคํŠธ๋ ˆ์ด์…˜ ์ˆ˜์ง‘: ๋ชจ๋“  ์‹œ์—ฐ์€ ์‚ฌ๋žŒ ์กฐ์ž‘์ž๊ฐ€ ๋™์ผํ•œ ๋กœ๋ด‡ ํ•˜๋‚˜๋ฅผ ์ด์šฉํ•ด ์ˆœ์ฐจ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. 1000๊ฐœ ์ž‘์—…์— ๋Œ€ํ•ด ๊ฐ๊ฐ 1ํšŒ์”ฉ ์‹œ์—ฐ์„ ๋ชจ์•˜๊ณ , ์ด ์†Œ์š” ์‹œ๊ฐ„์€ ์•ฝ 17์‹œ๊ฐ„(์—ฐ์† ์ž‘๋™ ์‹œ ํ•˜๋ฃจ ๋ฏธ๋งŒ)์œผ๋กœ ๋ณด๊ณ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ํ‰๊ท  ํ•œ ๊ณผ์—…๋‹น 1๋ถ„๋„ ์ฑ„ ๊ฑธ๋ฆฌ์ง€ ์•Š๋Š” ์†๋„๋กœ ์‹œ์—ฐ์ด ์ง„ํ–‰๋˜์—ˆ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๋น ๋ฅธ ์‹œ์—ฐ ์ˆ˜์ง‘์ด ๊ฐ€๋Šฅํ–ˆ๋˜ ์ด์œ ๋Š”, ๋Œ€๋ถ€๋ถ„์˜ ํƒœ์Šคํฌ๊ฐ€ ๋‹จ์ผ ๋‹จ๊ณ„ (pick ๋˜๋Š” place ๋“ฑ ํ•œ ๋™์ž‘)์œผ๋กœ ์™„๋ฃŒ๋˜๋Š” ๋น„๊ต์  ์งง์€ ์ž‘์—…์ด์—ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋งŒ์•ฝ ๋ฉ€ํ‹ฐ ์Šคํ… ์ž‘์—…(์˜ˆ: ์ง‘์–ด์„œ ์˜ฎ๊ฒจ๋†“๊ธฐ ๋“ฑ์˜ pick-and-place)์ธ ๊ฒฝ์šฐ์—๋„, ์ €์ž๋“ค์€ ์ด๋ฅผ ๋ณ„๊ฐœ์˜ ์—ฐ์† ํƒœ์Šคํฌ๋กœ ๋ถ„ํ• ํ•˜์—ฌ ๊ฐ๊ฐ ์‹œ์—ฐ์„ ์ˆ˜์ง‘ํ•˜๊ณ , ๋‚˜์ค‘์— ๊ณ ์ˆ˜์ค€ ํ”Œ๋ž˜๋„ˆ๋ฅผ ํ†ตํ•ด ์—ฐ๊ฒฐ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ์…‹์€ 1,000๊ฐœ์˜ ๋‹จ์ผ ๋‹จ๊ณ„ ์‹œ์—ฐ์œผ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ, ์ด๋Š” ๊ธฐ์กด ๋Œ€๊ทœ๋ชจ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ์…‹(์ˆ˜๋งŒ ํšŒ ์‹œ์—ฐ)๊ณผ๋Š” ์ฐจ์›์„ ๋‹ฌ๋ฆฌํ•˜๋Š” ์ดˆ์†Œ๋Ÿ‰ ๋ฐ์ดํ„ฐ ํ•™์Šต์˜ ํ† ๋Œ€๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

2. ๋น„๊ต ๋ฐฉ๋ฒ• ๋ฐ ํ‰๊ฐ€ ๋ฐฉ์‹

๋น„๊ต ์ •์ฑ…: ์•ž์„œ ์„ค๋ช…ํ•œ ๋„ค ๊ฐ€์ง€ ์ •์ฑ… ์กฐํ•ฉ๊ณผ ๋ชจ๋†€๋ฆฌ์‹ BC๊ฐ€ ์„ฑ๋Šฅ ๋น„๊ต๋ฅผ ์œ„ํ•ด ๋ชจ๋‘ ๊ตฌํ˜„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์ •๋ฆฌํ•˜๋ฉด: - BC-BC: ์ •๋ ฌ ๋‹จ๊ณ„์™€ ์ƒํ˜ธ์ž‘์šฉ ๋‹จ๊ณ„ ๋ชจ๋‘ BC ์ •์ฑ… ์‚ฌ์šฉ. (๋‘ ๊ฐœ์˜ ๋ณ„๋„ ์‹ ๊ฒฝ๋ง ์ •์ฑ…)

  • BC-Ret: BC ์ •๋ ฌ ์ •์ฑ…์œผ๋กœ ๋ชฉํ‘œ ์ž์„ธ์— ๋กœ๋ด‡์„ ๋‘๊ณ , Retrieval ์ƒํ˜ธ์ž‘์šฉ(์˜คํ”ˆ๋ฃจํ”„ ๋ฐ๋ชจ ์žฌ์ƒ)์œผ๋กœ ์กฐ์ž‘ ์ˆ˜ํ–‰.
  • Ret-BC: Retrieval ์ •๋ ฌ(ํฌ์ฆˆ ์ถ”์ • + ๋ชจ์…˜ํ”Œ๋žœ)๋กœ ๋กœ๋ด‡์„ ์œ„์น˜์‹œํ‚จ ๋’ค, BC ์ƒํ˜ธ์ž‘์šฉ ์ •์ฑ…์œผ๋กœ ์กฐ์ž‘ ๋งˆ๋ฌด๋ฆฌ.
  • Ret-Ret (MT3): ์ •๋ ฌ๊ณผ ์ƒํ˜ธ์ž‘์šฉ ๋ชจ๋‘ Retrieval ๊ธฐ๋ฐ˜์œผ๋กœ ์ˆ˜ํ–‰. ์ฆ‰, MT3๋Š” ํฌ์ฆˆ ์ถ”์ • + ๊ถค์  ์žฌ์ƒ์˜ ์ˆœ์„œ๋กœ ์™„์ „ํ•œ ์˜คํ”ˆ๋ฃจํ”„ ์‹คํ–‰์„ ํ•˜๋Š” ์ •์ฑ…์ž…๋‹ˆ๋‹ค.
  • Monolithic BC (MT-ACT+): ํ•˜๋‚˜์˜ ํ†ตํ•ฉ BC ์ •์ฑ…์ด ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ์ „์ฒด ๋™์ž‘์„ ์ˆ˜ํ–‰. ์ด๋Š” ๊ธฐ์กด ๋ฐฉ์‹์˜ ๋Œ€ํ‘œ ๊ฒฉ์œผ๋กœ, ๋…ผ๋ฌธ์—์„œ๋Š” Google์˜ MT-ACT ๋ชจ๋ธ์„ ๋ณ€ํ˜•ํ•œ ๊ฒƒ์„ ์‚ฌ์šฉํ–ˆ๊ธฐ์— MT-ACT+๋ผ ๋ช…๋ช…ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  BC ๊ณ„์—ด ์ •์ฑ…์€ ๋™์ผํ•œ Transformer ๊ธฐ๋ฐ˜ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜๋˜, ํ•™์Šต์— ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ ๋ฒ”์œ„๋งŒ ๋‹ฌ๋ž์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด BC-์ •๋ ฌ ์ •์ฑ…๊ณผ BC-์ƒํ˜ธ์ž‘์šฉ ์ •์ฑ…์€ ๊ฐ๊ธฐ ์ •๋ ฌ ๋‹จ๊ณ„ ์‹œ์—ฐ๋งŒ, ์ƒํ˜ธ์ž‘์šฉ ๋‹จ๊ณ„ ์‹œ์—ฐ๋งŒ์œผ๋กœ ๋”ฐ๋กœ ํ›ˆ๋ จ๋˜์—ˆ๊ณ , Monolithic BC๋Š” ์ „์ฒด ๊ถค์  ์‹œ์—ฐ์œผ๋กœ ํ•œ ๋ฒˆ์— ํ›ˆ๋ จ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. Retrieval ๊ณ„์—ด ์ •์ฑ…์€ ํ•™์Šต์ด ํ•„์š” ์—†์œผ๋ฏ€๋กœ ํŠน๋ณ„ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ์—†์ง€๋งŒ, ๊ณต์ •ํ•œ ๋น„๊ต๋ฅผ ์œ„ํ•ด ์ด๋“ค๋„ ๋™์ผํ•œ ์ž…๋ ฅ (์ ๊ตฐ+์–ธ์–ด)์„ ๋ฐ›๋Š” ํฌ๋งท์œผ๋กœ ๊ตฌํ˜„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. (์ฆ‰, ์–ธ์–ด ์„ค๋ช…์„ ํ™œ์šฉํ•ด ๊ฐ™์€ ์กฐ๊ฑด ํ•˜์—์„œ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋ฐ๋ชจ๋ฅผ ์ฐพ๋„๋ก ํ•จ) ํ‰๊ฐ€ ํ”„๋กœํ† ์ฝœ: ์„ฑ๋Šฅ ํ‰๊ฐ€๋Š” ์„ฑ๊ณต/์‹คํŒจ ์ธก์ •์œผ๋กœ ์ด๋ฃจ์–ด์กŒ์Šต๋‹ˆ๋‹ค. ๊ฐ ํƒœ์Šคํฌ์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ๋ฒˆ(์ฃผ๋กœ 2~3ํšŒ) ์‹คํ–‰ํ•˜์—ฌ ํ‰๊ท  ์„ฑ๊ณต๋ฅ ์„ ๊ณ„์‚ฐํ•˜์˜€๊ณ , 95% ์œŒ์Šจ ์‹ ๋ขฐ๊ตฌ๊ฐ„์„ ์˜ค๋ฅ˜ ๋ฐ” ํ˜•ํƒœ๋กœ ์ œ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค. โ€œ์„ฑ๊ณตโ€์˜ ์ •์˜๋Š” ์ž‘์—…์— ๋”ฐ๋ผ ๊ตฌ์ฒด์ ์œผ๋กœ ์ •ํ•ด์กŒ๋Š”๋ฐ, ์˜ˆ๋ฅผ ๋“ค์–ด ์‚ฝ์ž… ๊ณผ์—…์€ ์ œ์ž๋ฆฌ์— ๋๊นŒ์ง€ ์‚ฝ์ž…๋˜๋ฉด ์„ฑ๊ณต, ์žก๊ธฐ ๊ณผ์—…์€ ๋Œ€์ƒ ๋ฌผ์ฒด๋ฅผ ๋“ค์–ด์˜ฌ๋ ธ์„ ๋•Œ ์„ฑ๊ณต ๋“ฑ์œผ๋กœ ํ˜„์žฅ ํ‰๊ฐ€์ž๊ฐ€ ํŒ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ 1000๊ฐœ ๊ณผ์—… ํ‰๊ฐ€์—์„œ๋Š” ๊ฐ ๊ณผ์—…๋‹น 2ํšŒ์”ฉ (์„ฑ๊ณต ๋˜๋Š” ์‹คํŒจ) ์‹œ๋„๋ฅผ ์ง„ํ–‰ํ•˜์—ฌ ์ด 2200ํšŒ์˜ ๋กค์•„์›ƒ์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๋ณด๊ธฐ ์œ„ํ•ด ๋ฏธ๋“ฑ๋ก(์ฒ˜์Œ ๋ณด๋Š”) ๊ฐ์ฒด์— ๋Œ€ํ•œ ๊ณผ์—… 100๊ฐœ๋„ ์ถ”๊ฐ€๋กœ ํ‰๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ๋Š” ํ›ˆ๋ จ์— ์‚ฌ์šฉ๋˜์ง€ ์•Š์€ ์ƒˆ๋กœ์šด ๋ฌผ์ฒด๋ฅผ ๋™์ผ ๋งคํฌ๋กœ/๋งˆ์ดํฌ๋กœ ์Šคํ‚ฌ๋กœ ์กฐ์ž‘ํ•˜๋„๋ก ํ…Œ์ŠคํŠธํ•˜์—ฌ, ์นดํ…Œ๊ณ ๋ฆฌ ์ˆ˜์ค€์˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ์ธก์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ํ›ˆ๋ จ ์‹œ โ€œ๋จธ๊ทธ์ž”A ์žก๊ธฐโ€๋ฅผ ๋ฐฐ์› ๋‹ค๋ฉด, ํ…Œ์ŠคํŠธ์—์„œ โ€œ๋จธ๊ทธ์ž”B ์žก๊ธฐโ€๋ฅผ ์‹œ๋„ํ•˜๋Š” ์‹์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ seen vs unseen ๊ณผ์—… ์„ฑ๊ณต๋ฅ ์„ ๋น„๊ตํ•จ์œผ๋กœ์จ, ์ •์ฑ…์ด ์ƒˆ๋กœ์šด ์ธ์Šคํ„ด์Šค์— ์–ผ๋งˆ๋‚˜ ์ž˜ ๋Œ€์‘ํ•˜๋Š”์ง€ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ํ™˜๊ฒฝ์„ ์‹ค์ƒํ™œ์— ๊ฐ€๊น๊ฒŒ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด, ํ‰๊ฐ€ ์‹œ ๋‚œ์ด๋„ ์š”์†Œ๋ฅผ ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค: - ์ž‘์—… ๊ณต๊ฐ„์— 5~20๊ฐœ์˜ ์ฃผ๋ณ€ ๋ฐฉํ•ด ๋ฌผ์ฒด(๋””์ŠคํŠธ๋ž™ํ„ฐ)๋ฅผ ๋ฌด์ž‘์œ„๋กœ ๋†“์•„๋‘์–ด, ๋กœ๋ด‡์ด ์ž˜๋ชป๋œ ๋ฌผ์ฒด๋ฅผ ์ง‘๊ฑฐ๋‚˜ ๊ฒฝ๋กœ๊ฐ€ ๋ฐฉํ•ด๋ฐ›์„ ๊ฐ€๋Šฅ์„ฑ์„ ๋†’์˜€์Šต๋‹ˆ๋‹ค. - ์กฐ๋ช… ์กฐ๊ฑด๋„ ๋‹ค์–‘ํ•˜๊ฒŒ ๋ณ€ํ™”์‹œ์ผœ, ์นด๋ฉ”๋ผ ์˜์ƒ์˜ ๋ฐ๊ธฐ/์ƒ‰์ƒ์ด ๋‹ฌ๋ผ์ง€๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค. - ์•ž์„œ ์–ธ๊ธ‰ํ•œ ๋Œ€๋กœ ๋ฌผ์ฒด์˜ ์ดˆ๊ธฐ ๋ฐฐ์น˜ ์œ„์น˜์™€ ๋ฐฉํ–ฅ๋„ ๋ฌด์ž‘์œ„๋กœ ๋ฐ”๊พธ์–ด, ํ›ˆ๋ จ ๋•Œ์™€ ๋‹ค๋ฅธ ์ƒํ™ฉ์„ ์—ฐ์ถœํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ชจ๋“  ์„ค์ •์€ MT3์˜ ๊ฐ•๊ฑด์„ฑ ํ•œ๊ณ„๋ฅผ ์‹œํ—˜ํ•˜๊ธฐ ์œ„ํ•จ์œผ๋กœ, ์ €์ž๋“ค์€ ์ผ๋ถ€๋Ÿฌ ์–ด๋ ค์šด ์กฐ๊ฑด๋“ค์„ ๋ถ€์—ฌํ–ˆ๋‹ค๊ณ  ๋ฐํžˆ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

3. ์ฃผ์š” ์‹คํ—˜ ๊ฒฐ๊ณผ: ์†Œ๊ทœ๋ชจ ๋ถ„์„

์ฒซ ๋ฒˆ์งธ๋กœ, ๋ฐ์ดํ„ฐ ์–‘ ๋ฐ ๊ณผ์—… ๋‹ค์–‘์„ฑ์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ๋ณ€ํ™”๋ฅผ ์†Œ๊ทœ๋ชจ๋กœ ๋ถ„์„ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ 70์—ฌ ๊ฐœ ๋ฌผ์ฒด์— ๋Œ€ํ•ด ๋‘ ๊ฐ€์ง€ ์‹คํ—˜์„ ์„ค๊ณ„ํ–ˆ๋Š”๋ฐ, (a) ๊ณผ์—… ์ˆ˜ ๊ณ ์ • ํ›„ ๊ณผ์—…๋‹น ์‹œ์—ฐ ๊ฐœ์ˆ˜ ์ฆ๊ฐ€ ์‹คํ—˜๊ณผ (b) ์ด ์‹œ์—ฐ ํšŸ์ˆ˜ ๊ณ ์ • ํ›„ ๊ณผ์—… ๋‹ค์–‘์„ฑ ์ฆ๊ฐ€ ์‹คํ—˜์ž…๋‹ˆ๋‹ค.

  1. ๋ฐ์ดํ„ฐ์…‹ ํฌ๊ธฐ ์‹คํ—˜: 4๊ฐœ์˜ ๋Œ€ํ‘œ ๋งˆ์ดํฌ๋กœ ์Šคํ‚ฌ์„ ์„ ์ •ํ•˜๊ณ  (์˜ˆ: ๋ฌธ ์—ด๊ธฐ, ๋ถ€๋“œ๋Ÿฌ์šด ๋ฌผ์ฒด ๋‹ค๋ฃจ๊ธฐ, ๊ตญ์ž ๋œจ๊ธฐ, ์‚ฝ์ž…ํ•˜๊ธฐ ๋“ฑ ๊ฐ๊ธฐ ๋‹ค๋ฅธ ์œ ํ˜•), ์ด์— ํ•ด๋‹นํ•˜๋Š” 12๊ฐœ ๊ณผ์—…(๊ฐ๊ฐ 3๊ฐœ ๋ฌผ์ฒด) + 8๊ฐœ ์‹ ๊ทœ ๋ฌผ์ฒด ๊ณผ์—…์œผ๋กœ ์ด 20๊ฐœ ํƒœ์Šคํฌ๋ฅผ ์ค€๋น„ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ๊ฐ ๊ณผ์—…๋‹น ์‹œ์—ฐ ์ˆ˜๋ฅผ 1๊ฐœ์—์„œ 50๊ฐœ๊นŒ์ง€ ๋‹จ๊ณ„์ ์œผ๋กœ ๋Š˜๋ ค๊ฐ€๋ฉฐ ๋‹ค์„ฏ ๊ฐ€์ง€ ์ •์ฑ…์˜ ์„ฑ๋Šฅ์„ ์ธก์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ 50๊ฐœ๋Š” ๋ณต์žกํ•œ ๊ถค์  ํ•™์Šต์— ์ถฉ๋ถ„ํ•œ ์ƒํ•œ์„ ์œผ๋กœ ๊ฐ„์ฃผ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ๋Š” ์‹œ์—ฐ ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚ ์ˆ˜๋ก ๋ชจ๋“  ๋ฐฉ๋ฒ•์˜ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์ง€๋งŒ, Retrieval ๊ธฐ๋ฐ˜ MT3์˜ ๊ฒฝ์šฐ ๋‹จ 3๊ฐœ์˜ ๋ฐ๋ชจ๋กœ๋„ ํƒ€ ๋ฐฉ๋ฒ•์˜ 50๊ฐœ ๋ฐ๋ชจ ์„ฑ๋Šฅ์„ ์•ž์งˆ๋ €๋‹ค๋Š” ์ ์ด ๋‘๋“œ๋Ÿฌ์กŒ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ MT3 (Ret-Ret)๋Š” ์†Œ๋Ÿ‰์˜ ๋ฐ๋ชจ๋กœ๋„ ์ผ๊ด€๋˜๊ฒŒ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋‚ด๋ฉฐ, ๋ณธ ์‹คํ—˜ ๋ฒ”์œ„ ๋‚ด๋‚ด ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•๋“ค์„ ์šฐ์œ„๋ฅผ ์œ ์ง€ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋ณด์ง€ ๋ชปํ•œ ๋ฌผ์ฒด(unseen)์— ๋Œ€ํ•ด์„œ๋„ ์œ ์‚ฌํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚ฌ๋Š”๋ฐ, MT3์˜ ๋†’์€ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ๋•์— ์ƒˆ๋กœ์šด ๊ฐ์ฒด์—์„œ๋„ ์ข‹์€ ์„ฑ๊ณต๋ฅ ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
  2. ๊ณผ์—… ๋‹ค์–‘์„ฑ ์‹คํ—˜: ์ด ์‹œ์—ฐ ํšŸ์ˆ˜๋ฅผ 150์œผ๋กœ ๊ณ ์ •ํ•œ ์ƒํƒœ์—์„œ, ์ด๋ฅผ 10๊ฐœ ๊ณผ์—…(๊ฐ 15๊ฐœ), 30๊ฐœ ๊ณผ์—…(๊ฐ 5๊ฐœ), 50๊ฐœ ๊ณผ์—…(๊ฐ 3๊ฐœ)๋กœ ๋ถ„๋ฐฐํ•˜๋Š” ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ ํฌํ•จ๋œ ๋งˆ์ดํฌ๋กœ ์Šคํ‚ฌ์€ 10๊ฐ€์ง€๋กœ ํ™•๋Œ€ํ•˜์—ฌ, ์•ž์˜ (a) ์‹คํ—˜๋ณด๋‹ค ๋‹ค์–‘ํ•œ ๊ธฐ์ˆ ๋“ค์ด ์„ž์ด๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์— ๋”ฐ๋ฅด๋ฉด, ๊ณผ์—… ์ข…๋ฅ˜๊ฐ€ ๋Š˜์–ด๋‚ ์ˆ˜๋ก (๊ณผ์—…๋‹น ๋ฐ๋ชจ๊ฐ€ ์ค„์–ด๋“ค์ˆ˜๋ก) ์ „์ฒด์ ์ธ ์„ฑ๊ณต๋ฅ ์€ ๋ชจ๋“  ๋ฐฉ๋ฒ•์—์„œ ๊ฐ์†Œํ–ˆ์ง€๋งŒ, MT3๋Š” ๊ฐ€์žฅ ์™„๋งŒํ•˜๊ฒŒ ์ €ํ•˜๋˜๋ฉฐ ์—ฌ์ „ํžˆ ์ตœ์ƒ์œ„ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ํ•œํŽธ BC ๋ชจ๋†€๋ฆฌ์‹(MT-ACT+)์€ ๊ณผ์—…์ด ์ฆ๊ฐ€ํ• ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ๋” ๊ธ‰๊ฒฉํžˆ ๋–จ์–ด์กŒ๋Š”๋ฐ, ์ด๋Š” ํ•œ์ •๋œ ๋ฐ๋ชจ๋ฅผ ๋„ˆ๋ฌด ๋งŽ์€ ๊ณผ์—…์— ๋ถ„์‚ฐํ•˜๋ฉด ํ•™์Šต ํšจ์œจ์ด ๋–จ์–ด์ง€๋Š” ํ˜„์ƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ํฅ๋ฏธ๋กœ์šด ์ ์€, ๋…ผ๋ฌธ์—์„œ โ€œ๊ณผ์—…๋‹น ๋ฐ๋ชจ๊ฐ€ ์ถฉ๋ถ„ํžˆ ๋งŽ๊ฑฐ๋‚˜, ๊ณผ์—… ๋‹ค์–‘์„ฑ์ด ๋งค์šฐ ํด ๊ฒฝ์šฐ์—๋Š” ๋ชจ๋†€๋ฆฌ์‹ BC๊ฐ€ ์˜คํžˆ๋ ค ์ข‹์€ ์Šค์ผ€์ผ๋ง ์ถ”์„ธ๋ฅผ ๋ณด์ธ๋‹คโ€๊ณ  ์–ธ๊ธ‰๋œ ๋Œ€๋ชฉ์ž…๋‹ˆ๋‹ค. ์ฆ‰, ๋ฐ์ดํ„ฐ๊ฐ€ ํ’๋ถ€ํ•œ ์˜์—ญ์—์„œ๋Š” ๊ฑฐ๋Œ€ํ•œ ์‹ ๊ฒฝ๋ง ์ •์ฑ…์ด ํž˜์„ ๋ฐœํœ˜ํ•˜์ง€๋งŒ, ์ด ์—ฐ๊ตฌ์˜ ๊ด€์‹ฌ ์˜์—ญ์ธ ์ €๋ฐ์ดํ„ฐ(regime)์—์„œ๋Š” ํŠนํ™”+๊ฒ€์ƒ‰ ์ „๋žต์ด ์••๋„์ ์ด๋ผ๋Š” ๊ฒฐ๋ก ์ž…๋‹ˆ๋‹ค.

์ „์ฒด์ ์œผ๋กœ ์†Œ๊ทœ๋ชจ ์‹คํ—˜๋“ค๋กœ๋ถ€ํ„ฐ ๋„์ถœ๋œ ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

Retrieval ๊ธฐ๋ฐ˜ (MT3)์ด ํ•ญ์ƒ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, ํŠนํžˆ ๋ฐ๋ชจ ์ˆ˜๊ฐ€ ์ ์„ ๋•Œ ๊ฒฉ์ฐจ๊ฐ€ ํฌ๋‹ค. ์ด๋Š” ์‹œ์—ฐ ์ž์ฒด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ผ๋ฐ˜ํ™”ํ•˜๋Š” ์ ‘๊ทผ์˜ ์žฅ์ ์„ ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค. Trajectory ๋ถ„ํ• ์˜ ํšจ๊ณผ๋กœ, ์–ด๋–ค ๋ฐฉ์‹์ด๋“  2๋‹จ๊ณ„ ๋ฐฉ๋ฒ•(BC-BC, BC-Ret, Ret-BC, Ret-Ret)์ด ์ผ๋‹จ๊ณ„ ๋ชจ๋†€๋ฆฌ์‹ ๋ฐฉ๋ฒ•๋ณด๋‹ค ์„ฑ๋Šฅ ์šฐ์œ„์— ์žˆ์Šต๋‹ˆ๋‹ค. ์‹ฌ์ง€์–ด BC-BC vs Monolithic๋งŒ ๋น„๊ตํ•ด๋„, ๊ฐ™์€ BC ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ผ๋ฉด ๋ถ„ํ• ์ด ์ด๋“์ž„์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค. ์ •๋ ฌ ๋‹จ๊ณ„์˜ Retrieval vs BC, ์ƒํ˜ธ์ž‘์šฉ ๋‹จ๊ณ„์˜ Retrieval vs BC ๊ฐ๊ฐ์„ ๋น„๊ตํ•ด๋„, Retrieval์ด ๋” ๋‚˜์€ ๊ฒฝํ–ฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ •๋ ฌ ๋‹จ๊ณ„๋Š” ํฌ์ฆˆ ์ถ”์ • ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์ด, ์ƒํ˜ธ์ž‘์šฉ ๋‹จ๊ณ„๋Š” ์˜คํ”ˆ๋ฃจํ”„ ์žฌ์ƒ์ด, ๋™์ผ ์กฐ๊ฑด์˜ BC๋ณด๋‹ค ๋†’์€ ์„ฑ๊ณต๋ฅ ์„ ๋‚˜ํƒ€๋ƒˆ์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ, ์ด ๋ถ€๋ถ„์€ ์ž‘์—… ์ข…๋ฅ˜์— ๋”ฐ๋ผ ์กฐ๊ธˆ์”ฉ ํŽธ์ฐจ๊ฐ€ ์žˆ์–ด Discussion์—์„œ ์ถ”๊ฐ€ ๋…ผ์˜๋ฉ๋‹ˆ๋‹ค.

4. 1000๊ฐœ ๊ณผ์—… ๋Œ€๊ทœ๋ชจ ํ‰๊ฐ€ ๊ฒฐ๊ณผ

๋‘ ๋ฒˆ์งธ๋กœ, ๋…ผ๋ฌธ์˜ ํ•˜์ด๋ผ์ดํŠธ์ธ 1,000๊ฐœ ๊ณผ์—… ํ•™์Šต ํ‰๊ฐ€ ๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” ์•ž์„œ ์„ ์ •ํ•œ MT3 (Ret-Ret) ์ •์ฑ…์„ ์‚ฌ์šฉํ•˜์—ฌ, 1000๊ฐœ์˜ ํ•™์Šต ํƒœ์Šคํฌ(์‹œ์—ฐ์„ ๋ณธ ๊ณผ์—…)์™€ 100๊ฐœ์˜ ์‹ ๊ทœ ํƒœ์Šคํฌ(์‹œ์—ฐ์„ ๋ณด์ง€ ์•Š์€ ๊ณผ์—…)์— ๋Œ€ํ•ด ์„ฑ๋Šฅ์„ ์ธก์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ „์ฒด ๊ฒฐ๊ณผ๋ฅผ ์š”์•ฝํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

์ „์ฒด ์„ฑ๊ณต๋ฅ : ํ•™์Šต๋œ 1000๊ฐœ ๊ณผ์—…์— ๋Œ€ํ•œ ํ‰๊ท  ์„ฑ๊ณต๋ฅ ์€ ์•ฝ 78.3%, ์ฒ˜์Œ ๋ณด๋Š” 100๊ฐœ ๊ณผ์—…์— ๋Œ€ํ•œ ์„ฑ๊ณต๋ฅ ์€ ์•ฝ 68.0%๋กœ ๋ณด๊ณ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋ฌด์ž‘์œ„์— ๊ฐ€๊นŒ์šด ์ดˆ๊ธฐํ™” ์ƒํƒœ์—์„œ ํ•œ ๋ฒˆ์˜ ์ธ๊ฐ„ ์‹œ๋ฒ”๋งŒ์œผ๋กœ 4๋ฒˆ ์ค‘ 3๋ฒˆ ์ด์ƒ์€ ์„ฑ๊ณตํ•˜๊ฒŒ ๋งŒ๋“  ์…ˆ์ด๋ฉฐ, ์ผ๋ถ€ ๋ฒ”์ฃผ์˜ ์ž‘์—…์—์„œ๋Š” 80~90%๋ฅผ ๋„˜๋Š” ์„ฑ๊ณต๋ฅ ๋„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ํ˜„์žฌ ๋กœ๋ด‡ ํ•™์Šต ์—ฐ๊ตฌ์—์„œ ์œ ๋ก€์—†์ด ๋†’์€ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ๊ณผ ๋ฒ”์šฉ์„ฑ์„ ๋ณด์—ฌ์ฃผ๋Š” ์ˆ˜์น˜์ž…๋‹ˆ๋‹ค.

๋งคํฌ๋กœ ์Šคํ‚ฌ๋ณ„ ์„ฑ๋Šฅ ํŽธ์ฐจ: 31๊ฐœ ์ƒ์œ„ ๊ธฐ์ˆ  ๋ฒ”์ฃผ๋ณ„๋กœ ์„ฑ๋Šฅ์„ ์ง‘๊ณ„ํ•œ ๊ฒฐ๊ณผ, ์ž‘์—… ์ข…๋ฅ˜์— ๋”ฐ๋ผ ์„ฑ๊ณต๋ฅ  ํŽธ์ฐจ๊ฐ€ ๋šœ๋ ทํ–ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, โ€œ๋‹ฆ๊ธฐโ€๋‚˜ โ€œ๋‹ด๊ธฐโ€ ๊ฐ™์€ ๋™์ž‘์€ 80% ์ด์ƒ์˜ ๋†’์€ ์„ฑ๊ณต๋ฅ ์„ ๋ณด์ธ ๋ฐ˜๋ฉด, โ€œ์ •๋ฐ€ ์‚ฝ์ž…โ€์ด๋‚˜ โ€œ๋ณ€ํ˜•์ฒด ๋‹ค๋ฃจ๊ธฐโ€ ๊ฐ™์€ ๋™์ž‘์€ ๋‚ฎ์€ ์„ฑ๊ณต๋ฅ ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ž‘์—…์ด ์š”๊ตฌํ•˜๋Š” ์ •ํ™•๋„์™€ ํ”ผ๋“œ๋ฐฑ ํ•„์š”์„ฑ์— ๋”ฐ๋ผ MT3์˜ ์„ฑ๋Šฅ ํ•œ๊ณ„๊ฐ€ ๋“œ๋Ÿฌ๋‚˜๋Š” ๊ฒƒ์œผ๋กœ ํ•ด์„๋ฉ๋‹ˆ๋‹ค.

๊ณต๊ฐ„์  ์˜ค์ฐจ ํ—ˆ์šฉ๋„๊ฐ€ ํฐ ์ž‘์—…: ์ ‘์ด‰ ์œ„์น˜๋‚˜ ๊ฐ๋„์— ์•ฝ๊ฐ„์˜ ์—ฌ์œ ๊ฐ€ ์žˆ๋Š” ์ž‘์—…๋“ค(์˜ˆ: ๋‹ฆ๊ธฐ, ํœ˜์ “๊ธฐ, ๋†“๊ธฐ, ์ผ๋ฐ˜์ ์ธ ์žก๊ธฐ ๋“ฑ)์€ ๊ฑฐ์˜ ๋ฌธ์ œ์—†์ด ๋†’์€ ์„ฑ๊ณต๋ฅ (80~90%๋Œ€)์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ž‘์—…๋“ค์€ ์ƒํ˜ธ์ž‘์šฉ ์‹œ ์•ฝ๊ฐ„ ์–ด๊ธ‹๋‚˜๋„ ๊ฒฐ๊ณผ์— ํฐ ์ง€์žฅ์ด ์—†๊ธฐ ๋•Œ๋ฌธ์—, ์˜คํ”ˆ๋ฃจํ”„ ์žฌ์ƒ์ด ์ถฉ๋ถ„ํžˆ ํ†ต์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์ •๋ฐ€ ์กฐ์ค€์ด ํ•„์š”ํ•œ ์ž‘์—…: ๋ฐ˜๋ฉด, ์•„์ฃผ ์ž‘์€ ๊ธฐํ•˜ ์š”์†Œ์˜ ์ •๋ ฌ์ด ํ•„์š”ํ•œ ์ž‘์—…(์˜ˆ: ํ”Œ๋Ÿฌ๊ทธ๋ฅผ ์ฝ˜์„ผํŠธ์— ๊ฝ‚๊ธฐ, ์ €๊ธˆํ†ต ์Šฌ๋กฏ์— ๋™์ „ ๋„ฃ๊ธฐ, ์—ด์‡  ๊ฑธ๊ธฐ ๋“ฑ)์€ ์‹คํŒจ์œจ์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋†’์•˜์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋ฐ€๋ฆฌ๋ฏธํ„ฐ ๋‹จ์œ„์˜ ์˜ค์ฐจ๋„ ํ—ˆ์šฉ๋˜์ง€ ์•Š๋Š” ์ž‘์—…์—์„œ๋Š” ํฌ์ฆˆ ์ถ”์ •์˜ ์•ฝ๊ฐ„์˜ ์˜ค๋ฅ˜๋„ ์น˜๋ช…์ ์ด๋ฉฐ, ์—ด๋ฆฐ ๊ณ ๋ฆฌ ์‹คํ–‰์œผ๋กœ๋Š” ์‹ค์‹œ๊ฐ„ ๋ณด์ •์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ํ”Œ๋Ÿฌ๊ทธ-์†Œ์ผ“ ์ž‘์—…์€ ์ •๋ ฌ ๋‹จ๊ณ„์—์„œ ์กฐ๊ธˆ๋งŒ ๊ฐ๋„๊ฐ€ ํ‹€๋ ค๋„ ์‚ฝ์ž…์ด ๋๋‚ด ์‹คํŒจํ•˜๊ณ , ํ•œ ๋ฒˆ ์‹คํŒจํ•˜๋ฉด ์žฌ์‹œ๋„ ์—†์ด ์ข…๋ฃŒ๋˜๋ฏ€๋กœ ์„ฑ๊ณต๋ฅ ์— ์ง์ ‘์ ์ธ ์˜ํ–ฅ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

๋น„๋Œ€์นญ ๋ฌผ์ฒด์˜ ์ฒ˜๋ฆฌ: MT3๊ฐ€ ์ „์ฒด ๋ฌผ์ฒด ํ˜•์ƒ์— ๋งž์ถฐ ์ •๋ ฌํ•˜๋„๋ก ์„ค๊ณ„๋œ ๋ฐ˜๋ฉด, ๋ฌผ์ฒด์— ์ž‘์ง€๋งŒ ์ค‘์š”ํ•œ ๋น„๋Œ€์นญ ๋ถ€์œ„๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ์ปจ๋Œ€, ์ฃผ์ „์ž์˜ ์†์žก์ด๋‚˜ ์ฃผ๋‘ฅ์ด์ฒ˜๋Ÿผ ์ „์ฒด ๋ถ€ํ”ผ์— ๋น„ํ•ด ์ž‘์€ ๋ถ€๋ถ„์ด ๊ฒฐ์ •์  ์—ญํ• ์„ ํ•˜๋Š” ์ž‘์—…์—์„œ, ๊ธ€๋กœ๋ฒŒ ํฌ์ฆˆ ๋งค์นญ์ด ๊ทธ ์„ธ๋ถ€๋ฅผ ๋†“์ณ ์ž˜๋ชป๋œ ์ž์„ธ๋กœ ์ •๋ ฌํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์œ„ ์‹คํ—˜์—์„œ๋„ ์ฃผ์ „์ž๋ฅผ ํŠน์ • ๋ฐฉํ–ฅ์œผ๋กœ ๋”ฐ๋ฅด๋Š” ์ž‘์—…์—์„œ ์ฃผ๋‘ฅ์ด ๋ฐฉํ–ฅ์„ ์ž˜๋ชป ๋งž์ถฐ ์‹คํŒจํ•˜๋Š” ์‚ฌ๋ก€๊ฐ€ ๋ณด๊ณ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์นดํ…Œ๊ณ ๋ฆฌ ์ˆ˜์ค€ ์ผ๋ฐ˜ํ™”: ์ตํžŒ ๊ณผ์—…๊ณผ ๊ฐ™์€ ๋งคํฌ๋กœ/๋งˆ์ดํฌ๋กœ ์Šคํ‚ฌ ๋ฒ”์ฃผ ๋‚ด์˜ ์ƒˆ๋กœ์šด ๋ฌผ์ฒด๋“ค์— ๋Œ€ํ•ด์„œ๋Š” ๋Œ€์ฒด๋กœ ์–‘ํ˜ธํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์—ฌ๋Ÿฌ ๋‹ค๋ฅธ ํƒ์ž๋‚˜ ๋ฐ”๋‹ฅ์—์„œ ๋‹ฆ๊ธฐ ์ž‘์—…์„ ํ•  ๋•Œ๋Š”, ํ‘œ๋ฉด ์žฌ์งˆ์ด๋‚˜ ์ƒ‰์ด ๋‹ฌ๋ผ๋„ ์œ ์‚ฌํ•œ ๋‹ฆ๋Š” ๊ถค์ ์„ ๊ทธ๋Œ€๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์–ด ์„ฑ๊ณตํ–ˆ์Šต๋‹ˆ๋‹ค. ๋จธ๊ทธ์ž” ์žก๊ธฐ ์—ญ์‹œ ์†์žก์ด ์œ„์น˜๊ฐ€ ๋Œ€๋™์†Œ์ดํ•˜์—ฌ ๋Œ€๋ถ€๋ถ„์˜ ์ƒˆ๋กœ์šด ๋จธ๊ทธ์ž”์„ ๋ฌธ์ œ์—†์ด ์žก์•˜์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ๊ฐ์ฒด ๋ชจ์–‘ ์ฐจ์ด๋กœ ์ƒํ˜ธ์ž‘์šฉ ๊ถค์  ์ž์ฒด๊ฐ€ ๋‹ฌ๋ผ์ ธ์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ์—๋Š” ์‹คํŒจ๊ฐ€ ๋Š˜์—ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์ฃผ์ „์ž ๋ถ“๊ธฐ ์ž‘์—…์€ ๋ฐ›๋Š” ์šฉ๊ธฐ์˜ ํ˜•ํƒœ๊ฐ€ ๋ฐ”๋€Œ๋ฉด ๋ถ€์„ ๊ฐ๋„๋‚˜ ๋™์ž‘์ด ๋‹ฌ๋ผ์ ธ์•ผ ํ•˜๋Š”๋ฐ, MT3๋Š” ๊ธฐ์กด ์‹œ์—ฐ์˜ ๊ฐ๋„๋กœ๋งŒ ๋ถ€์–ด์„œ ์‹คํŒจํ–ˆ์Šต๋‹ˆ๋‹ค. ๋น„์Šทํ•˜๊ฒŒ ์‹ ์šฉ์นด๋“œ ๋ฆฌ๋”์— ์นด๋“œ ๊ธ๊ธฐ ์ž‘์—…๋„ ๋ฆฌ๋” ๊ธฐ๊ณ„์˜ ์Šฌ๋กฏ ์œ„์น˜ ์ฐจ์ด์— ์ ์‘ํ•˜์ง€ ๋ชปํ•ด ์‹คํŒจํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Retrieval์˜ ํ•œ๊ณ„ โ€“ ๋ณด๊ฐ„ ๋ถˆ๊ฐ€: ์•ž์„œ ์ˆ˜ํ•™์  ์„ค๋ช…์—์„œ ์ง€์ ํ–ˆ๋“ฏ, MT3๋Š” ๋‘˜ ์ด์ƒ์˜ ์‹œ์—ฐ์„ ์กฐํ•ฉํ•ด ์ƒˆ๋กœ์šด ๋™์ž‘์„ ๋งŒ๋“ค์–ด๋‚ด์ง€ ๋ชปํ•œ๋‹ค๋Š” ๊ทผ๋ณธ์  ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ ์‹คํ—˜์—์„œ๋„ ํ•„์š”ํ•œ ๋™์ž‘์ด ์ €์žฅ๋œ ์‹œ์—ฐ๋“ค ์‚ฌ์ด ์–ด๋”˜๊ฐ€์— ์žˆ๋Š” ๊ฒฝ์šฐ, MT3๋Š” ๊ฐ€๊นŒ์šด ์‹œ์—ฐ ํ•˜๋‚˜๋ฅผ ํƒํ•  ๋ฟ ๋ฏธ์„ธํ•œ ์กฐ์ • ๋™์ž‘์„ ์ƒ์„ฑํ•˜์ง€ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ด๋ถ„๋ฒ•์  ์„ ํƒ์œผ๋กœ ์ธํ•ด, ์•ฝ๊ฐ„ ์ƒˆ๋กœ์šด ์ƒํ™ฉ์— ๋Œ€ํ•œ ์ ์‘๋ ฅ์ด ๋–จ์–ด์ง€๋Š” ๋ชจ์Šต์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ๋ณ€ํ˜• ๊ฐ€๋Šฅํ•œ ๋ฌผ์ฒด: ์ฒœ์ด๋‚˜ ๋ˆ์ฒ˜๋Ÿผ ๋ณ€ํ˜•์ฒด(deformable)๋ฅผ ๋‹ค๋ฃจ๋Š” ์ž‘์—…์€ MT3์—๊ฒŒ ํŠนํžˆ ์–ด๋ ค์šด ์˜์—ญ์œผ๋กœ ๋‚จ์•˜์Šต๋‹ˆ๋‹ค. ๋ณ€ํ˜•์ฒด๋Š” ๊ฒ‰๋ชจ์Šต๋งŒ์œผ๋กœ ๋ฌผ๋ฆฌ์  ํŠน์„ฑ์„ ์•Œ ์ˆ˜ ์—†๊ณ , ๊ฐ™์€ ๋ชจ์–‘์ด๋ผ๋„ ๊ฐ•์„ฑ, ๋งˆ์ฐฐ ๋“ฑ ๋‚ด๋ถ€ ํŠน์„ฑ์ด ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์‹คํ—˜ ์ค‘ โ€œ์ฑ…์„ ๋‹ค์–‘ํ•œ ๊ฐ€๋ฐฉ์— ๋„ฃ๊ธฐโ€ ์ž‘์—…์—์„œ, ๊ฐ€๋ฐฉ์˜ ๋šœ๊ป‘์ด๋‚˜ ์ฃผ๋จธ๋‹ˆ ํ˜•ํƒœ๊ฐ€ ์กฐ๊ธˆ์”ฉ ๋‹ฌ๋ผ ๋“ค์–ด์˜ฌ๋ฆฌ๋Š” ํž˜ ์กฐ์ ˆ์ด ๋ฐ”๋€Œ์–ด์•ผ ํ–ˆ์ง€๋งŒ, MT3๋Š” ์ด๋ฅผ ์‹œ์—ฐ ๊ธฐ๋ฐ˜์œผ๋กœ๋Š” ์˜ˆ์ธกํ•  ์ˆ˜ ์—†์–ด ์‹คํŒจํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ด์ฒ˜๋Ÿผ ๋ณ€ํ˜•์ฒด ์ž‘์—…์€ ์™ธํ˜•๋งŒ์œผ๋กœ๋Š” ์ถฉ๋ถ„ํ•œ ์ •๋ณด๋ฅผ ์–ป๊ธฐ ์–ด๋ ค์›Œ, ์ถ”๊ฐ€ ์„ผ์‹ฑ์ด๋‚˜ ์˜จ๋ผ์ธ ํ•™์Šต ์—†์ด๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์˜คํ”ˆ๋ฃจํ”„ ์ƒํ˜ธ์ž‘์šฉ์˜ ๊ทผ๋ณธ์  ํ•œ๊ณ„: ์ „๋ฐ˜์ ์ธ ์‹คํ—˜์„ ํ†ตํ•ด ๋“œ๋Ÿฌ๋‚œ MT3์˜ ๊ฐ€์žฅ ํฐ ์•ฝ์ ์€ โ€œํ•œ๋ฒˆ ์žฌ์ƒ์„ ์‹œ์ž‘ํ•˜๋ฉด ์ค‘๊ฐ„์— ์ˆ˜์ •ํ•  ๋„๋ฆฌ๊ฐ€ ์—†๋‹คโ€๋Š” ์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค. ๋งŽ์€ ์‹คํŒจ ์‚ฌ๋ก€์—์„œ, ๋งŒ์ผ ์‹ค์‹œ๊ฐ„ ํ”ผ๋“œ๋ฐฑ์œผ๋กœ ์•ฝ๊ฐ„๋งŒ ๋ณด์ •ํ–ˆ๋”๋ผ๋ฉด ํ•ด๊ฒฐ๋  ์ƒํ™ฉ๋“ค์ด ์žˆ์—ˆ์ง€๋งŒ, MT3๋Š” trajectory๋ฅผ ์‹œ์ž‘ํ•˜๋ฉด ๋๊นŒ์ง€ ๊ทธ๋Œ€๋กœ ์‹คํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์‹คํŒจ๋ฅผ ๋ชจ๋ฉดํ•˜์ง€ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์ˆ˜๊ฑด ๊ฐœ๊ธฐ๋‚˜ ์ฒœ ์ ‘๊ธฐ ์ž‘์—…์—์„œ๋Š” ์ฒœ์˜ ์›€์ง์ž„์— ๋”ฐ๋ผ ๋งค ์ˆœ๊ฐ„ ํž˜์„ ์กฐ์ ˆํ•ด์•ผ ํ•˜๋Š”๋ฐ, ์˜คํ”ˆ๋ฃจํ”„ ์žฌ์ƒ์œผ๋กœ๋Š” ์ด๋Ÿฌํ•œ ๋Œ€์‘์ด ๋ถˆ๊ฐ€๋Šฅํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ ๋ฌผ์ฒด๋ฅผ ๋ฐ€๋ฉด์„œ ๋ฐฉํ–ฅ์„ ๋ฐ”๊พธ๋Š” ์ž‘์—… ๋“ฑ์€ ๋ณธ์งˆ์ ์œผ๋กœ ํ์‡„๋ฃจํ”„ ์ œ์–ด๊ฐ€ ํ•„์š”ํ•œ๋ฐ, MT3๋Š” ๊ตฌ์กฐ์ ์œผ๋กœ ์ด๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์‹คํŒจ ํ›„ ์žฌ์‹œ๋„ํ•˜๋Š” ์‹์˜ ๋ณด์™„๋„ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ์œผ๋‚˜, ํ™˜๊ฒฝ์ด ์ด๋ฏธ ์–ด๊ธ‹๋‚œ ๋’ค๋ผ ์ˆ˜์ •์ด ์–ด๋ ต๊ณ  ํšจ์œจ๋„ ๋–จ์–ด์ง‘๋‹ˆ๋‹ค. ์š”์ปจ๋Œ€, MT3๋Š” โ€œํ•œ๋ฒˆ์˜ ๊ธฐํšŒโ€์— ๋ชจ๋“  ๊ฒƒ์„ ๊ฑฐ๋Š” ์ •์ฑ…์ด๋ฏ€๋กœ, ํ™˜๊ฒฝ ๋ณ€ํ™”์— ์ฆ‰๊ฐ ๋Œ€์‘ํ•˜๊ฑฐ๋‚˜ ์˜ค์ฐจ๋ฅผ ๋ˆ„์  ๋ณด์ •ํ•˜๋Š” ๋Šฅ๋ ฅ์€ ์—†๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ํ™•์ธ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์‹คํŒจ ์›์ธ ๋ถ„์„: ์ €์ž๋“ค์€ 1000๊ฐœ ๊ณผ์—… ํ‰๊ฐ€์—์„œ ์‹คํŒจํ•œ ์ผ€์ด์Šค๋“ค์„ ์ผ์ผ์ด ๋ถ„์„ํ•˜์—ฌ, ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•œ ์‹คํŒจ ์š”์ธ๋“ค์„ ๋ถ„๋ฅ˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ: ๊ฐ€์žฅ ๋งŽ์€ ๋ถ€๋ถ„์„ ์ฐจ์ง€ํ•œ ๊ฒƒ์€ ๋ฐ๋ชจ Retrieval ๋‹จ๊ณ„์˜ ์˜ค๋ฅ˜ (~22%)๋กœ, ๋ฌผ์ฒด๊ฐ€ ๋ถ€๋ถ„ ๊ฐ€๋ ค์ ธ ์žˆ๊ฑฐ๋‚˜(global shape๋งŒ์œผ๋กœ ์ž‘์€ ์ฐจ์ด๋ฅผ ๊ตฌ๋ถ„ ๋ชปํ•˜๊ฑฐ๋‚˜) ํ•  ๋•Œ ์ž˜๋ชป๋œ ์‹œ์—ฐ์„ ์„ ํƒํ•˜๋Š” ๊ฒฝ์šฐ์˜€์Šต๋‹ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ๋Š” ํฌ์ธํŠธํด๋ผ์šฐ๋“œ ๋ถ„ํ•  ๋ฐ ์ธ์‹ ๋ฌธ์ œ (~19.5%)๋กœ, ํŠนํžˆ ํˆฌ๋ช…ํ•œ ๋ฌผ์ฒด๋‚˜ ๋ณต์žกํ•œ ๋ฐฐ๊ฒฝ ์† ๋ฌผ์ฒด์—์„œ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜์ด ์‹คํŒจํ•˜์—ฌ ์• ์ดˆ์— ์ž˜๋ชป๋œ ๋Œ€์ƒ์ด ์„ ํƒ๋˜๋Š” ๊ฒฝ์šฐ์˜€์Šต๋‹ˆ๋‹ค. ์„ธ ๋ฒˆ์งธ๋Š” ํฌ์ฆˆ ์ถ”์ •์˜ ์‹คํŒจ (~23.9%)๋กœ, ๋ฌผ์ฒด์˜ ๋Œ€์นญ์„ฑ์ด๋‚˜ ์‹œ์•ผ๊ฐ ๋ณ€ํ™”๋กœ ์ธํ•ด ์ž์„ธ๋ฅผ ์ž˜๋ชป ๋งž์ถ”๋Š” ์ผ์ด ์›์ธ์ด์—ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ๋ฌผ์ฒด๊ฐ€ ์‹œ์—ฐ ๋•Œ์™€ ์ „ํ˜€ ๋‹ค๋ฅธ ๊ฐ๋„๋กœ ๋†“์ด๋ฉด, ๋ถ€๋ถ„ ์ ๊ตฐ๋“ค์˜ ๋ชจ์–‘์ด ๋‹ฌ๋ผ์ ธ ์ถ”์ •์ด ๋ถˆ์•ˆ์ •ํ•ด์ง‘๋‹ˆ๋‹ค. ๋‚˜๋จธ์ง€ ์•ฝ 30% ์ •๋„๋Š” ์ƒํ˜ธ์ž‘์šฉ ์‹คํ–‰ ๋‹จ๊ณ„์˜ ๋ฌธ์ œ์˜€๋Š”๋ฐ, ์ฃผ๋กœ ๋กœ๋ด‡์ด ์ฅ๊ณ  ์žˆ๋Š” ๋ฌผ์ฒด์˜ ์ดˆ๊ธฐ ์žก๋Š” ์œ„์น˜๊ฐ€ ์‹œ์—ฐ๊ณผ ๋‹ฌ๋ผ ๋์— ๊ฐ€์„œ ์—‡๋‚˜๊ฐ€๊ฑฐ๋‚˜ (20.2%), ํ˜น์€ ์•ž์„œ ๋งํ•œ ์˜คํ”ˆ๋ฃจํ”„ ๋ณด์ • ๋ถˆ๊ฐ€๋กœ ์ธํ•œ ์‹คํŒจ๋“ค์ด์—ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ๋กœ๋ด‡์ด ๋“ค๊ณ  ํ•˜๋Š” ์ž‘์—…(์˜ˆ: ์ง‘๊ฒŒ๋กœ ๋ฌผ์ฒด ์ฅ” ์ฑ„ ๋‹ค๋ฅธ ๋™์ž‘)์—์„œ, ์ดˆ๊ธฐ ํŒŒ์ง€(grasp) ์œ„์น˜๊ฐ€ ์กฐ๊ธˆ๋งŒ ๋‹ฌ๋ผ๋„ ์ดํ›„ ๊ถค์ ์ด ์–ด๊ธ‹๋‚˜๋Š”๋ฐ ์ด๋ฅผ MT3๋Š” ์ˆ˜์ •ํ•˜์ง€ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ถ„์„์„ ํ†ตํ•ด, MT3์˜ ์„ฑ๋Šฅ ๋ณ‘๋ชฉ์€ ๋Œ€๋ถ€๋ถ„ ์ธ์‹(vision) ๋‹จ๊ณ„์— ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹œ๊ฐ์  ์ฒ˜๋ฆฌ(์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜, ํฌ์ฆˆ์ถ”์ •)์™€ ๋ฐ๋ชจ ์„ ํƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ฐœ์„ ์ด ๊ณง๋ฐ”๋กœ ์„ฑ๊ณต๋ฅ  ํ–ฅ์ƒ์œผ๋กœ ์ด์–ด์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด, ์ •์ฑ… ์ž์ฒด์˜ ํ•œ๊ณ„(์˜คํ”ˆ๋ฃจํ”„)์— ๊ธฐ์ธํ•œ ์‹คํŒจ๋„ ๋ฌด์‹œํ•  ์ˆ˜ ์—†๊ธฐ์—, ์ด๋Š” ๊ตฌ์กฐ์ ์ธ ๊ฐœ์„ ์ด ํ•„์š”ํ•จ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ์—ฌ ๋ฐ ์ฐจ๋ณ„์  ์š”์•ฝ

์ด ์—ฐ๊ตฌ์˜ ๊ธฐ์—ฌ์ ์€ ํฌ๊ฒŒ ์„ธ ๊ฐ€์ง€๋กœ ์ •๋ฆฌ๋ฉ๋‹ˆ๋‹ค:

์ €๋ฐ์ดํ„ฐ ๋‹ค์ค‘๊ณผ์—… ํ•™์Šต์— ๋Œ€ํ•œ ์ฒด๊ณ„์  ํ‰๊ฐ€: ๊ธฐ์กด ๋กœ๋ด‡ ํ•™์Šต ์—ฐ๊ตฌ๋“ค์€ ํƒœ์Šคํฌ๋ณ„ ์ˆ˜๋ฐฑ ๊ฐœ์˜ ์‹œ์—ฐ์ด ์ „์ œ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•˜์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ๊ณผ์—…๋‹น ์‹œ์—ฐ ๊ฐœ์ˆ˜๊ฐ€ 1~10๊ฐœ ์ˆ˜์ค€์ธ ๊ทนํ•œ ์ƒํ™ฉ์—์„œ ๋‹ค์ค‘ ๊ณผ์—… ํ•™์Šต์˜ ์„ฑ๋Šฅ์„ ์ฒด๊ณ„์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜์˜€๊ณ , ์ด๋ฅผ ํ†ตํ•ด ํ˜„์žฌ ๋ฌธํ—Œ์˜ ๊ณต๋ฐฑ์„ ๋ฉ”์šฐ๋Š” ์‹ค์ฆ์  ํ†ต์ฐฐ์„ ์ œ๊ณตํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ช‡ ๊ฐ€์ง€ ์„ค๊ณ„ ์กฐํ•ฉ(๋ถ„ํ•  vs ๋น„๋ถ„ํ• , BC vs Retrieval)์„ ์ •๋Ÿ‰์ ์œผ๋กœ ๋น„๊ตํ•จ์œผ๋กœ์จ, ๋ฐ์ดํ„ฐ ํšจ์œจ ๊ด€์ ์—์„œ ์–ด๋–ค ์ ‘๊ทผ์ด ์œ ๋ฆฌํ•œ์ง€ ๊ทผ๊ฑฐ๋ฅผ ๋งˆ๋ จํ•ด ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

์ƒˆ๋กœ์šด ํ•™์Šต ํŒจ๋Ÿฌ๋‹ค์ž„ MT3์˜ ์ œ์‹œ: ์ €์ž๋“ค์€ Multi-Task Trajectory Transfer (MT3)๋ผ๋Š” Retrieval ๊ธฐ๋ฐ˜ ๋ถ„ํ•  ์ •์ฑ…์„ ๊ณ ์•ˆํ•˜๊ณ , ์ด๊ฒƒ์ด ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๊ฐ€ ์ œํ•œ์ ์ผ ๋•Œ ๋ชจ๋†€๋ฆฌ์‹ BC๋ณด๋‹ค ์œ ๋งํ•œ ๋Œ€์•ˆ์ž„์„ ์ฆ๋ช…ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋‹จ์ˆœํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋„˜์–ด, โ€œ๋ฐ๋ชจ๋ฅผ ํ›ˆ๋ จ์ด ์•„๋‹ˆ๋ผ ์‹คํ–‰์— ํ™œ์šฉํ•œ๋‹คโ€๋Š” ๋ฐœ์ƒ์˜ ์ „ํ™˜์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ ๋ณต์žกํ•œ ์‹ ๊ฒฝ๋ง ์—†์ด๋„ ๊ด‘๋ฒ”์œ„ํ•œ ์ž‘์—… ํ•™์Šต์ด ๊ฐ€๋Šฅํ•จ์„ ๋ณด์—ฌ์คŒ์œผ๋กœ์จ, ๋Œ€๊ทœ๋ชจ ๋กœ๋ด‡ ํ•™์Šต์—๋Š” ๊ฑฐ๋Œ€ ๋ชจ๋ธ์ด ํ•„์ˆ˜๋ผ๋Š” ๊ธฐ์กด ๊ฐ€์ •์„ ๋„์ „ํ–ˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ 1000๊ฐœ ์ž‘์—…์„ ๋‹จ ํ•˜๋ฃจ๋งŒ์— ๋ฐฐ์›Œ๋‚ธ ๊ฒƒ์€, ๋ชจ๋†€๋ฆฌ์‹ ๊ฑฐ๋Œ€ ๋ชจ๋ธ๋กœ๋Š” ์‹คํ˜„ํ•˜๊ธฐ ์–ด๋ ค์› ์„ ์„ฑ๊ณผ์ž…๋‹ˆ๋‹ค.

1,000๊ฐœ ์‹ค์ œ ์ž‘์—… ํ•™์Šต์˜ ์‹คํ˜„ ๋ฐ ํ•œ๊ณ„ ๋ถ„์„: ๋ณธ ์—ฐ๊ตฌ๋Š” ๋ณต์žกํ•˜๊ณ  ๋‹ค์–‘ํ•œ 1000๊ฐœ ์ž‘์—…์„ ์‹ค์ œ ๋กœ๋ด‡์œผ๋กœ ํ•™์Šต์‹œ์ผœ๋ณธ ์ตœ์ดˆ์˜ ์‚ฌ๋ก€๋กœ์„œ, ํ˜„์‹ค ์„ธ๊ณ„ ๋กœ๋ด‡ ํ•™์Šต์˜ ์Šค์ผ€์ผ ์—… ๊ฐ€๋Šฅ์„ฑ์„ ์‹œ์—ฐํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋Œ€๊ทœ๋ชจ ํ•™์Šต์˜ ์ƒˆ๋กœ์šด ์ง€ํ‰์„ ์—ด์—ˆ์„ ๋ฟ ์•„๋‹ˆ๋ผ, ๋™์‹œ์— MT3 ์ ‘๊ทผ๋ฒ•์˜ ํ•œ๊ณ„์™€ ์‹คํŒจ ๋ชจ๋“œ๋ฅผ ๋ฉด๋ฐ€ํžˆ ๋ถ„์„ํ•˜์—ฌ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์„ ์ œ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ์ธ์‹ ์˜ค์ฐจ, ์˜คํ”ˆ๋ฃจํ”„ ์ œ์–ด์˜ ์ทจ์•ฝ์ , ๋ฒ”์šฉ์  ํ”ผ๋“œ๋ฐฑ์˜ ๋ถ€์žฌ ๋“ฑ์˜ ๋ฌธ์ œ๋ฅผ ๊ตฌ์ฒด์ ์œผ๋กœ ๋“œ๋Ÿฌ๋‚ด์–ด, ๋‹ค์Œ ๋‹จ๊ณ„ ์—ฐ๊ตฌ๋“ค์ด ํ•ด๊ฒฐํ•ด์•ผ ํ•  ๊ณผ์ œ๋ฅผ ๋ช…ํ™•ํžˆ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

์ด์ „ ์—ฐ๊ตฌ์™€์˜ ์ฐจ๋ณ„์„ฑ๋„ ๋ถ„๋ช…ํ•ฉ๋‹ˆ๋‹ค. Behavior Transformer (BC-Z)๋‚˜ Robotics Transformer (RT-1) ๋“ฑ์˜ ์„ ํ–‰ ์—ฐ๊ตฌ๋“ค์€ ๊ฑฐ๋Œ€ ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ๋ชจ๋ธ์— ์˜์กดํ•˜๋ฉฐ ๊ณผ์—…๋‹น ํ‰๊ท  200ํšŒ ์ด์ƒ์˜ ์‹œ์—ฐ์„ ํˆฌ์ž…ํ–ˆ์ง€๋งŒ, ๋ณธ ์—ฐ๊ตฌ๋Š” ์ดˆ์†Œ๋Ÿ‰ ๋ฐ์ดํ„ฐ๋กœ๋„ ์ž‘๋™ํ•˜๋Š” ์„ค๊ณ„์— ์ง‘์ค‘ํ•˜์—ฌ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์„ ํ•œ ์ฐจ์› ๋†’์˜€์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ณผ๊ฑฐ ์ผ๋ถ€ ์—ฐ๊ตฌ๋“ค์ด ํฌ์ฆˆ ์ถ”์ • + RL ํ˜น์€ ๋น„์ „ ์„œ๋ณด์ž‰์œผ๋กœ ํŠน์ • ๋‹จ์ผ ์ž‘์—…์„ ์„ฑ๊ณต์‹œํ‚จ ๋ฐ” ์žˆ์œผ๋‚˜, ๋ณธ ๋…ผ๋ฌธ์€ ๋‹ค์–‘ํ•œ ์„ค๊ณ„ ์กฐํ•ฉ์„ ๋™์ผ ํ”Œ๋žซํผ์—์„œ ๋น„๊ตํ–ˆ๋‹ค๋Š” ์ ์—์„œ ์ผ๋ฐ˜ํ™”๋œ ๊ฒฐ๋ก ์„ ๋„์ถœํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. FlowRetrieval์ด๋‚˜ SAILOR ๋“ฑ ์ตœ๊ทผ ์ œ์•ˆ๋œ Retrieval ๋ฐฉ๋ฒ•๋“ค์ด ์ฃผ๋กœ ์ •์ฑ… ํ•™์Šต ์ „์— ๋ฐ์ดํ„ฐ ์„ ๋ณ„์— ์“ฐ์ธ ๋ฐ˜๋ฉด, MT3๋Š” ์‹คํ–‰ ์‹œ ๋ฐ๋ชจ๋ฅผ ๊ฒ€์ƒ‰ํ•œ๋‹ค๋Š” ์ฐจ์ด๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์‹ค์‹œ๊ฐ„ ๋ฐ๋ชจ ํ™œ์šฉ์€ VINN ๋“ฑ ์ผ๋ถ€ ์‚ฌ๋ก€๊ฐ€ ์žˆ์—ˆ์œผ๋‚˜, ์ž์—ฐ์–ด+๊ธฐํ•˜ ์ •๋ณด๋กœ ์ „์ฒด ๊ถค์ ์„ ๊ฒ€์ƒ‰ํ•˜๋Š” ๋ฐฉ์‹์€ ๋ณธ ์—ฐ๊ตฌ์˜ ์ƒˆ๋กœ์šด ๊ณตํ—Œ์ž…๋‹ˆ๋‹ค. ์š”์•ฝํ•˜๋ฉด, โ€œํ•˜๋ฃจ์— ์ฒœ ์ž‘์—… ํ•™์Šตโ€ ์—ฐ๊ตฌ๋Š” ์ด๋ก ์ ์œผ๋ก  ๊ฐ„๋‹จํ•˜์ง€๋งŒ ๊ฐ•๋ ฅํ•œ ์•„์ด๋””์–ด(๋‹จ๊ณ„ ๋ถ„ํ•  + ๋ฐ๋ชจ๊ฒ€์ƒ‰)๋ฅผ ๋Œ€๊ทœ๋ชจ ์‹คํ—˜์œผ๋กœ ์ž…์ฆํ•ด ๋ณด์ž„์œผ๋กœ์จ, ๋กœ๋ด‡ ํ•™์Šต ๋ถ„์•ผ์— ์‹ค์šฉ์„ฑ๊ณผ ํ™•์žฅ์„ฑ ๋ฉด์—์„œ ํฐ ํš์„ ๊ทธ์€ ์—ฐ๊ตฌ๋ผ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋กœ๋ด‡๊ณตํ•™์  ์˜์˜ ๋ฐ ํ™œ์šฉ ๊ฐ€๋Šฅ์„ฑ

ํ˜„์‹ค ์„ธ๊ณ„ ๋กœ๋ด‡ ํ•™์Šต ๊ด€์ ์—์„œ, ์ด ์—ฐ๊ตฌ๊ฐ€ ์ฃผ๋Š” ์˜๋ฏธ์™€ ํ–ฅํ›„ ๊ณผ์ œ๋ฅผ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ดˆ์†Œ๋Ÿ‰ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋กœ๋ด‡: ์‚ฌ๋žŒ์€ ์ƒˆ๋กœ์šด ์ž‘์—…์„ ๋ฐฐ์šฐ๋Š”๋ฐ ๋ช‡ ๋ฒˆ์˜ ์‹œ๋ฒ”์œผ๋กœ ์ถฉ๋ถ„ํ•˜์ง€๋งŒ, ๋กœ๋ด‡์€ ๊ทธ๋ ‡์ง€ ๋ชปํ•˜๋‹ค๋Š” ๊ฒƒ์ด ์ •์„ค์ด์—ˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ 1ํšŒ ์‹œ๋ฒ”์ด๋ผ๋Š” ์ธ์ƒ์ ์ธ ์ˆ˜์น˜๋กœ ๋กœ๋ด‡์˜ ํ•™์Šต ํšจ์œจ์„ ์ธ๊ฐ„ ์ˆ˜์ค€์— ๊ฐ€๊น๊ฒŒ ๋Œ์–ด์˜ฌ๋ ธ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๊ฐ€์ •์šฉ ์„œ๋น„์Šค ๋กœ๋ด‡์ด๋‚˜ ์‚ฐ์—…์šฉ ํ˜‘๋™ ๋กœ๋ด‡์— ๋ฐ”๋กœ ์‘์šฉ๋  ์ˆ˜ ์žˆ๋Š” ๊ฐ€๋Šฅ์„ฑ์„ ์—ด์—ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ์ปจ๋Œ€ ์‚ฌ์šฉ์ž๊ฐ€ ๋กœ๋ด‡์—๊ฒŒ ์ƒˆ๋กœ์šด ์ž‘์—…์„ ๊ฐ€๋ฅด์น  ๋•Œ ์ผ์ผ์ด ๋งŽ์€ ์˜ˆ์ œ๋ฅผ ์ค„ ํ•„์š” ์—†์ด, ํ•œ ๋ฒˆ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ์œผ๋กœ ์ถฉ๋ถ„ํ•œ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋กœ๋ด‡ ๋ณด๊ธ‰์˜ ํฐ ์žฅ์• ์˜€๋˜ ๋ฐ์ดํ„ฐ ์ค€๋น„ ๋น„์šฉ์„ ํš๊ธฐ์ ์œผ๋กœ ์ค„์ผ ์ˆ˜ ์žˆ๋Š” ๋ฐฉํ–ฅ์ž…๋‹ˆ๋‹ค. ๋ชจ๋ฐฉํ•™์Šต๊ณผ ๊ณ ์ „ ์ œ์–ด์˜ ์œตํ•ฉ: MT3์˜ ์„ฑ๊ณต์€ ํ•™์Šต ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๊ณผ ์ „ํ†ต์  ๋กœ๋ด‡๊ธฐ์ˆ ์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•œ ๊ฒฐ๊ณผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํฌ์ฆˆ ์ถ”์ •, ๋ชจ์…˜ ํ”Œ๋ž˜๋‹, ๊ถค์  ์žฌ์ƒ ๋“ฑ ๋กœ๋ด‡๊ณตํ•™์—์„œ ์˜ค๋ž˜ ์—ฐ๊ตฌ๋œ ๊ธฐ๋ฒ•๋“ค์„ ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ ๋‚ด์— ํ†ตํ•ฉํ•œ ๋•๋ถ„์—, ์‹ ๊ฒฝ๋ง ๋‹จ๋…์œผ๋กœ ํ’€๊ธฐ ์–ด๋ ค์šด ๋ฌธ์ œ๋ฅผ ์šฐํšŒํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ํ–ฅํ›„ ๋กœ๋ด‡ ํ•™์Šต ์‹œ์Šคํ…œ ์„ค๊ณ„์— ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ ‘๊ทผ์˜ ์ค‘์š”์„ฑ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์™„์ „ํ•œ ์—”๋“œํˆฌ์—”๋“œ ํ•™์Šต ๋Œ€์‹ , ๋ฌธ์ œ์— ๊ตฌ์กฐ๋ฅผ ๋ถ€์—ฌํ•˜๊ณ  ๊ฒ€์ฆ๋œ ์„œ๋ธŒ๋ฃจํ‹ด์„ ํ™œ์šฉํ•˜๋ฉด ํ›จ์”ฌ ์ ์€ ๋ฐ์ดํ„ฐ๋กœ๋„ ์„ฑ๊ณผ๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

ํ™•์žฅ์„ฑ: 1000๊ฐœ ๊ณผ์—…์„ ์‹œ์—ฐํ•˜๊ณ  ํ•™์Šตํ•œ ๊ฒƒ์€ ์‹œ์ž‘์— ๋ถˆ๊ณผํ•ฉ๋‹ˆ๋‹ค. ๊ณผ์—…์˜ ๋ณต์žก๋„๋ฅผ ๋†’์ด๊ฑฐ๋‚˜ ์—ฐ์† ๋™์ž‘(๋ฉ€ํ‹ฐ์Šคํ…)์„ ๋Š˜๋ฆฌ๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ™•์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” pick-and-place๋ฅผ ๋‘˜๋กœ ์ชผ๊ฐœ์•ผ ํ–ˆ์ง€๋งŒ, ๋ฏธ๋ž˜์—๋Š” MT3 ๋ฐฉ์‹์œผ๋กœ pickโ†’place ๋‘ ๋‹จ๊ณ„๋ฅผ ์—ฐ์† ๊ฒ€์ƒ‰/์‹คํ–‰ํ•˜๋„๋ก ๋ฐœ์ „์‹œํ‚ฌ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ ๋‹ค์ˆ˜ ๋กœ๋ด‡์„ ํ™œ์šฉํ•œ ๋ณ‘๋ ฌ ์‹œ์—ฐ ์ˆ˜์ง‘์œผ๋กœ ์‹œ๊ฐ„์„ ๋” ๋‹จ์ถ•ํ•˜๊ฑฐ๋‚˜, ์ž๋™ ์‹œ์—ฐ ์ƒ์„ฑ(์˜ˆ: ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ์ด์šฉ)์œผ๋กœ ๋ฐ๋ชจ ์ˆ˜๋ฅผ ํ™•๋Œ€ํ•˜๋Š” ๊ฒƒ๋„ ์ƒ๊ฐํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•™์Šต๋œ 1000๊ฐœ ์Šคํ‚ฌ์„ ์กฐํ•ฉํ•ด์„œ ๋ณตํ•ฉ ๊ณผ์ œ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ณ ์ˆ˜์ค€ ํ”Œ๋ž˜๋„ˆ์™€์˜ ์—ฐ๊ณ„๋„ ์‹ค์šฉ์ ์ธ ๋‹ค์Œ ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค.

๊ณผ์ œ: ์ด ์—ฐ๊ตฌ๋Š” ๋™์‹œ์— ๋ช‡ ๊ฐ€์ง€ ํ•œ๊ณ„๋ฅผ ๋“œ๋Ÿฌ๋ƒˆ๊ณ , ์ด๋Š” ๊ทธ๋Œ€๋กœ ํ–ฅํ›„ ์—ฐ๊ตฌ ๊ณผ์ œ๋กœ ์ด์–ด์ง‘๋‹ˆ๋‹ค. ์‹ค์‹œ๊ฐ„ ํ”ผ๋“œ๋ฐฑ ํ†ตํ•ฉ: ์˜คํ”ˆ๋ฃจํ”„ ์ •์ฑ…์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๋ ค๋ฉด, ํ์‡„๋ฃจํ”„ ์ œ์–ด ๋˜๋Š” ๊ฐ•ํ™”ํ•™์Šต์„ ํ†ตํ•œ ๋ฏธ์„ธ ์กฐ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ์ปจ๋Œ€ MT3์˜ ์ƒํ˜ธ์ž‘์šฉ ๋‹จ๊ณ„์— ๋น„์ „ ํ”ผ๋“œ๋ฐฑ์„ ์ถ”๊ฐ€ํ•˜์—ฌ visual servoing์ฒ˜๋Ÿผ ์‹คํ–‰ ์ค‘ ๊ถค์ ์„ ์กฐ์ ˆํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฒฐํ•ฉํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์‹ค์ œ ๋…ผ๋ฌธ์—์„œ๋„ ์ด๋Ÿฌํ•œ ํ”ผ๋“œ๋ฐฑ ๋ถ€์žฌ๋ฅผ ์ฃผ์š” ํ•œ๊ณ„๋กœ ๊ผฝ๊ณ  ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ๋ณด์™„ํ•˜๋ฉด ๋” ๋†’์€ ์„ฑ๊ณต๋ฅ ๊ณผ ์•ˆ์ •์„ฑ์„ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ–ฅ์ƒ๋œ ์ธ์‹ ๊ธฐ์ˆ : ์‹คํŒจ ์›์ธ์˜ ์ƒ๋‹น ๋ถ€๋ถ„์ด ๋ฌผ์ฒด ์ธ์‹๊ณผ ์ž์„ธ ์ถ”์ •์˜ ์˜ค๋ฅ˜์—์„œ ๋น„๋กฏ๋œ ๋งŒํผ, ๋” ๊ฐ•์ธํ•œ ์ธ์‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋’ท๋ฐ›์นจ๋˜๋ฉด ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ๊ฐœ์„ ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ๋ชจ๋ธ์ด๋‚˜ ๋น„์ „ ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ 6-DoF ํฌ์ฆˆ ์ถ”์ •์„ ์‚ฌ์šฉํ•˜๋ฉด, ํˆฌ๋ช…์ฒด๋‚˜ ๋ถ€๋ถ„ ๊ฐ€๋ฆผ ์ƒํ™ฉ๋„ ๋” ์ž˜ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์†Œํ˜• ๋ถ€์œ„ ์ธ์‹์„ ์œ„ํ•ด ๋ฉ€ํ‹ฐ์นด๋ฉ”๋ผ๋‚˜ ๊ณ ํ•ด์ƒ๋„ ์„ผ์„œ์˜ ๋„์ž…๋„ ๊ณ ๋ คํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Retrieval ๊ณ ๋„ํ™”: ํ˜„์žฌ๋Š” language+geometry ๋‹จ์ˆœ ํ•ฉ์œผ๋กœ ๋ฐ๋ชจ๋ฅผ ๊ณจ๋ž๋‹ค๋ฉด, ๋ฏธ๋ž˜์—๋Š” ๊ณผ์—… ์ˆ˜ํ–‰ ์„ฑ๊ณต๋ฅ ์„ ์˜ˆ์ธกํ•˜๋Š” ๋Ÿฌ๋‹ ํˆฌ ๋žญํฌ(learning-to-rank) ๊ธฐ๋ฒ•์ด๋‚˜, ์—ฌ๋Ÿฌ ๋ฐ๋ชจ๋ฅผ ๋™์‹œ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ๊ณ ๋„ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜๋‚˜์˜ ์˜ˆ๋กœ, ๋‘ ๊ฐœ ์ด์ƒ์˜ ์œ ์‚ฌ ๋ฐ๋ชจ ๊ถค์ ์„ ํ•ฉ์„ฑํ•˜๊ฑฐ๋‚˜ ๋ณด๊ฐ„ํ•˜๋Š” ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผœ, ์™„์ „ํžˆ ๋™์ผํ•œ ๋ฐ๋ชจ๊ฐ€ ์—†์–ด๋„ ์œ ์‚ฌํ•œ ์ƒˆ๋กœ์šด ํ–‰๋™์„ ์ƒ์„ฑํ•ด๋‚ผ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ์ƒ์„ฑ ๋ชจ๋ธ(์˜ˆ: Diffusion Policy ๋“ฑ)์„ ํ™œ์šฉํ•ด Retrieval+Generation ํ˜ผํ•ฉ์œผ๋กœ ๋ฐœ์ „์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ํฅ๋ฏธ๋กœ์šด ๋ฐฉํ–ฅ์ž…๋‹ˆ๋‹ค. ๋ณ€ํ˜•์ฒด ๋ฐ ๋ณต์žกํ•œ ์ƒํ˜ธ์ž‘์šฉ: ์˜ท ๊ฐœ๊ธฐ, ๋ฐง์ค„ ๋ฌถ๊ธฐ ๋“ฑ ๋ณ€ํ˜•์ฒด ์ž‘์—…์ด๋‚˜, ๋ฏธ๋„๋Ÿฌ์ง€๋Š” ์ ‘์ด‰์„ ํ™œ์šฉํ•œ ๋™์ž‘ ๋“ฑ์€ ์—ฌ์ „ํžˆ ๋‚œ์ œ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ณผ์—…์—๋Š” ๋ฌผ๋ฆฌ ๊ธฐ๋ฐ˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€๋กœ ํ•™์Šต์— ํ™œ์šฉํ•˜๊ฑฐ๋‚˜, ์˜จ๋ผ์ธ์œผ๋กœ ๋ชจ๋ธ ์—…๋ฐ์ดํŠธ(์˜ˆ: ๋ฉ”ํƒ€๋Ÿฌ๋‹)ํ•˜๋Š” ์ ‘๊ทผ์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ๋ณ€ํ˜•์ฒด์˜ ๊ฒฝ์šฐ Force/Torque ์„ผ์„œ ํ”ผ๋“œ๋ฐฑ ๋“ฑ ๋น„์ „ ์ด์™ธ์˜ ๊ฐ๊ฐ์„ ํ†ตํ•ฉํ•˜๋Š” ๊ฒƒ๋„ ํ•œ ๋ฐฉํ–ฅ์ž…๋‹ˆ๋‹ค. ์‹ค์šฉํ™” ์ „๋ง: 1000๊ฐ€์ง€ ์ž‘์—…์„ ์ตํžŒ ๋กœ๋ด‡์€ ๋” ์ด์ƒ ๊ณต์ƒ๋งŒ์€ ์•„๋‹™๋‹ˆ๋‹ค. ์˜ˆ์ปจ๋Œ€ ๊ฐ€์ •์šฉ ์„œ๋น„์Šค ๋กœ๋ด‡์ด MT3 ๊ธฐ์ˆ ์„ ํƒ‘์žฌํ•œ๋‹ค๋ฉด, ์ œ์กฐ์‚ฌ๊ฐ€ ๋ฏธ๋ฆฌ ํ•™์Šต์‹œํ‚จ ์ˆ˜๋ฐฑ ๊ฐ€์ง€ ๊ฐ€์‚ฌ๋™์ž‘์„ ์ˆ˜ํ–‰ํ•˜๋ฉด์„œ, ์‚ฌ์šฉ์ž๋กœ๋ถ€ํ„ฐ ๋ช‡ ๊ฐ€์ง€ ์ƒˆ๋กœ์šด ์ง‘์•ˆ์ผ ์Šคํ‚ฌ์„ ๊ฐ„๋‹จํžˆ ๋ฐฐ์›Œ ์ถ”๊ฐ€ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์‚ฐ์—… ํ˜„์žฅ์—์„œ๋„ ๋‹คํ’ˆ์ข… ์†Œ๋Ÿ‰ ์ƒ์‚ฐ์— ๋กœ๋ด‡์„ ์ ์šฉํ•˜๋ ค๋ฉด ์žฆ์€ ์ž‘์—… ์ „ํ™˜์ด ํ•„์š”ํ•œ๋ฐ, ์ด๋Ÿฐ ์ƒํ™ฉ์—์„œ ํ•œ๋ฒˆ ๋ณด์—ฌ์ฃผ๊ณ  ๋ฐ”๋กœ ๋ฐฐ์šฐ๋Š” ๋กœ๋ด‡์€ ํ˜์‹ ์ ์ผ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ฌด์—‡๋ณด๋‹ค ์ด ์—ฐ๊ตฌ๋Š” โ€œ๋ผ์ŠคํŠธ ๋ฏธํ„ฐ(last-meter) ํ•™์Šตโ€์˜ ์ค‘์š”์„ฑ์„ ๋ถ€๊ฐํ•ฉ๋‹ˆ๋‹ค โ€“ ๊ฑฐ์ฐฝํ•œ ์ผ๋ฐ˜์ง€๋Šฅ์ด ์•„๋‹ˆ๋”๋ผ๋„, ํ˜„์žฅ์—์„œ ์‚ฌ๋žŒ์˜ ๊ฐ„๋‹จํ•œ ๋ฐ๋ชจ๋ฅผ ํ†ตํ•ด ๋น ๋ฅด๊ฒŒ ์ ์‘ํ•  ์ˆ˜ ์žˆ๋Š” ๋กœ๋ด‡์ด ์‹ค์šฉ์  ๊ฐ€์น˜๋ฅผ ๋ฐœํœ˜ํ•  ๊ฒƒ์ด๋ผ๋Š” ์ ์ž…๋‹ˆ๋‹ค. ์š”์•ฝํ•˜๋ฉด, Learning a Thousand Tasks in a Day๋Š” ๊ทนํ•œ์˜ ๋ฐ์ดํ„ฐ ํšจ์œจ๋กœ ๋Œ€๊ทœ๋ชจ ์ž‘์—… ํ•™์Šต์„ ๋‹ฌ์„ฑํ•œ ๊ธฐ๋…๋น„์ ์ธ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค. ๋กœ๋ด‡๊ณตํ•™์ ์œผ๋กœ ์ด๋Š” ํ•™์Šต๊ณผ ์ œ์–ด์˜ ์ ‘๋ชฉ์„ ํ†ตํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ๊ณผ ํ˜„์‹ค์ ์ธ ๋ฒ”์šฉ ๋กœ๋ด‡์— ํ•œ ๊ฑธ์Œ ๋‹ค๊ฐ€์„ฐ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๋ฌผ๋ก  ํ•ด๊ฒฐํ•ด์•ผ ํ•  ๋ฌธ์ œ๋“ค๋„ ๋‚จ์•„ ์žˆ์ง€๋งŒ, ์ด๋Ÿฌํ•œ ํŒจ๋Ÿฌ๋‹ค์ž„ ์ „ํ™˜์  ์ ‘๊ทผ์„ ํ†ตํ•ด ๊ถ๊ทน์ ์œผ๋กœ๋Š” โ€œ์‚ฌ๋žŒ์ฒ˜๋Ÿผ ํ•œ๋ฒˆ ๋ณด๊ณ  ๋ฐฐ์šฐ๋Š”โ€ ๋ฒ”์šฉ ๋กœ๋ด‡์„ ์‹คํ˜„ํ•˜๋Š” ๊ธธ์ด ์—ด๋ฆฌ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Copyright 2026, JungYeon Lee