Curieux.JY
  • JungYeon Lee
  • Post
  • Projects
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • 1. ์„œ๋ก : ์™œ ์ด ์—ฐ๊ตฌ๊ฐ€ ์ค‘์š”ํ•œ๊ฐ€?
    • 2. ๊ธฐ์กด ์—ฐ๊ตฌ์™€์˜ ์ฐจ๋ณ„์ : ์™œ AINA์ธ๊ฐ€?
      • 2.1 ๊ธฐ์กด ์ ‘๊ทผ๋ฒ•๋“ค์˜ ํ•œ๊ณ„
      • 2.2 AINA์˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ
    • 3. ๊ธฐ์ˆ ์  ๊นŠ์ด ํƒ๊ตฌ: AINA ํ”„๋ ˆ์ž„์›Œํฌ ์ƒ์„ธ ๋ถ„์„
      • 3.1 ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ์š”
      • 3.2 Aria Gen 2 ์Šค๋งˆํŠธ ์•ˆ๊ฒฝ: ๊ฒŒ์ž„ ์ฒด์ธ์ €
      • 3.3 ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋ฐ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ
      • 3.4 ์ •์ฑ… ๋„คํŠธ์›Œํฌ ์•„ํ‚คํ…์ฒ˜
      • 3.5 ๋กœ๋ด‡ ๋ฐฐํฌ
    • 4. ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ถ„์„
      • 4.1 ํƒœ์Šคํฌ ๋ฐ ์„ค์ •
      • 4.2 ๋ฒ ์ด์Šค๋ผ์ธ ๋น„๊ต
      • 4.3 ์ผ๋ฐ˜ํ™” ์‹คํ—˜
    • 5. ๊ธฐ์ˆ ์  ์‹ฌ์ธต ๋ถ„์„
      • 5.1 ์™œ 3D ํฌ์ธํŠธ ๊ธฐ๋ฐ˜ ํ‘œํ˜„์ธ๊ฐ€?
      • 5.2 In-Scene ๋ฐ๋ชจ์˜ ์—ญํ• 
      • 5.3 FoundationStereo ์„ ํƒ์˜ ์˜๋ฏธ
      • 5.4 ์‹คํŒจ ๋ชจ๋“œ ๋ถ„์„
    • 6. ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต
      • 6.1 HuDOR (Human Demonstration to Robot)
      • 6.2 UMI (Universal Manipulation Interface)
      • 6.3 DexCap
    • 7. ํ•œ๊ณ„์  ๋ฐ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ
      • 7.1 ํ˜„์žฌ ํ•œ๊ณ„์ 
      • 7.2 ์œ ๋งํ•œ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ
    • 8. ์‹ค๋ฌด์  ์‹œ์‚ฌ์ 
      • 8.1 ๋กœ๋ด‡๊ณตํ•™์ž๋ฅผ ์œ„ํ•œ ์ฒดํฌ๋ฆฌ์ŠคํŠธ
      • 8.2 ์–ธ์ œ AINA ์ ‘๊ทผ๋ฒ•์„ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š”๊ฐ€?
    • 9. ๊ฒฐ๋ก 
      • ํ•ต์‹ฌ ๊ธฐ์—ฌ ์š”์•ฝ
      • ๋กœ๋ด‡๊ณตํ•™์˜ ๋ฏธ๋ž˜๋ฅผ ์œ„ํ•œ ์‹œ์‚ฌ์ 
  • โ›๏ธ Dig Review
    • ๋ฐฉ๋ฒ•๋ก  ๋ถ„์„
    • ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ ํ•ด์„
    • ๊ธฐ์ˆ ์  ์‘์šฉ ๊ฐ€๋Šฅ์„ฑ
    • ํ•œ๊ณ„์  ๋ฐ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ

๐Ÿ“ƒAINA ๋ฆฌ๋ทฐ

dexterity
smart-lense
teleop
Dexterity from Smart Lenses Multi-Fingered Robot Manipulation with In-the-Wild Human Demonstrations
Published

December 5, 2025

๐Ÿ” Ping. ๐Ÿ”” Ring. โ›๏ธ Dig. A tiered review series: quick look, key ideas, deep dive.

  • Paper Link
  • Code
  • Project
  1. ๐Ÿค– AINA๋Š” Aria Gen 2 ์Šค๋งˆํŠธ ๊ธ€๋ผ์Šค๋กœ ์ˆ˜์ง‘๋œ in-the-wild ์ธ๊ฐ„ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ multi-fingered ๋กœ๋ด‡ ์กฐ์ž‘ ์ •์ฑ…์„ ํ•™์Šตํ•˜๋Š” ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค.
  2. ๐Ÿ’ก ์ด ํ”„๋ ˆ์ž„์›Œํฌ๋Š” 3D object track ๋ฐ fingertip point๋ฅผ ์ถ”์ถœํ•˜๊ณ  ๋กœ๋ด‡ ํ™˜๊ฒฝ์— ์ •๋ ฌํ•จ์œผ๋กœ์จ, ๋ฐฐ๊ฒฝ ๋ณ€ํ™”์— ๊ฐ•์ธํ•œ point-based ์ •์ฑ… ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
  3. ๐Ÿš€ AINA๋Š” ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ๋‚˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์—†์ด๋„ ๋‹ค์–‘ํ•œ ์ผ์ƒ ์กฐ์ž‘ ์ž‘์—…์„ ์„ฑ๊ณต์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋ฉฐ, in-the-wild ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ๋กœ๋ด‡์œผ๋กœ์˜ ํšจ๊ณผ์ ์ธ ๊ธฐ์ˆ  ์ด์ „์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

์ด ๋…ผ๋ฌธ์€ Aria Gen 2 ์Šค๋งˆํŠธ ๊ธ€๋ผ์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์•ผ์ƒ(in-the-wild) ํ™˜๊ฒฝ์—์„œ ์ˆ˜์ง‘๋œ ์ธ๊ฐ„ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ๋‹ค์ง€ ๋กœ๋ด‡ ์กฐ์ž‘(multi-fingered robot manipulation) ์ •์ฑ…์„ ํ•™์Šตํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ์ธ AINA๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ์–ด๋– ํ•œ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ(์˜จ๋ผ์ธ ์ˆ˜์ •, ๊ฐ•ํ™” ํ•™์Šต ๋˜๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํฌํ•จ)๋„ ์š”๊ตฌํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์ ์—์„œ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค๊ณผ ์ฐจ๋ณ„ํ™”๋ฉ๋‹ˆ๋‹ค.

์ฃผ์š” ๋ชฉํ‘œ ๋ฐ ๋ฐฐ๊ฒฝ:

๋กœ๋ด‡์ด ์ผ์ƒ ํ™˜๊ฒฝ์—์„œ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅํ•œ ๋‹ค์ง€ ์กฐ์ž‘์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์€ ์˜ค๋žœ ๋ชฉํ‘œ์˜€์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ธ๊ฐ„๊ณผ ๋กœ๋ด‡ ๊ฐ„์˜ embodiment gap๊ณผ ์ธ๊ฐ„ ๋น„๋””์˜ค์—์„œ ๋กœ๋ด‡ ํ•™์Šต์— ํ•„์š”ํ•œ ๊ด€๋ จ contextual ๋ฐ motion cue๋ฅผ ์ถ”์ถœํ•˜๋Š” ์–ด๋ ค์›€์ด ๋ณ‘๋ชฉ ํ˜„์ƒ์œผ๋กœ ์ž‘์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. AINA๋Š” Aria Gen 2 ๊ธ€๋ผ์Šค์˜ ๋ฐœ์ „๋œ ์„ผ์‹ฑ ๋Šฅ๋ ฅ์„ ํ™œ์šฉํ•˜์—ฌ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ธ€๋ผ์Šค๋Š” ๊ฒฝ๋Ÿ‰์ด๋ฉฐ ํœด๋Œ€ ๊ฐ€๋Šฅํ•˜๊ณ , ๊ณ ํ•ด์ƒ๋„ RGB ์นด๋ฉ”๋ผ, ์ •ํ™•ํ•œ ์˜จ๋ณด๋“œ 3D head ๋ฐ hand poses, ๊ทธ๋ฆฌ๊ณ  depth estimation์„ ์œ„ํ•œ wide stereo view๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํŠน์ง•๋“ค์€ ๋ฐฐ๊ฒฝ ๋ณ€ํ™”์— ๊ฐ•์ธํ•œ 3D point-based policy ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋ฉฐ, ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ ์—†์ด ์ง์ ‘ ๋ฐฐํฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก  (Methodology):

AINA๋Š” ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:

  1. ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ (Data Collection): ์ธ๊ฐ„์ด Aria Gen 2 ๊ธ€๋ผ์Šค๋ฅผ ์ฐฉ์šฉํ•˜๊ณ  ์ž„์˜์˜ ๋ฐฐ๊ฒฝ ๋ฐ ์‹œ์ ์—์„œ ์•ผ์ƒ(in-the-wild) ํ™˜๊ฒฝ์—์„œ ๋‹ค์ˆ˜์˜ ๋น„๋””์˜ค ์‹œ์—ฐ์„ ์ˆ˜์ง‘ํ•ฉ๋‹ˆ๋‹ค. ์ถ”๊ฐ€๋กœ, ๋กœ๋ด‡ ๋ฐฐํฌ ๊ณต๊ฐ„์—์„œ ๋‹จ ํ•˜๋‚˜์˜ in-scene ๋น„๋””์˜ค ์‹œ์—ฐ์„ ์ˆ˜์ง‘ํ•ฉ๋‹ˆ๋‹ค.
    • Aria Gen 2 ๊ธ€๋ผ์Šค: ์ „๋ฉด RGB ์นด๋ฉ”๋ผ, 4๊ฐœ์˜ SLAM ์นด๋ฉ”๋ผ, IMU๊ฐ€ ์žฅ์ฐฉ๋˜์–ด ์‚ฌ์šฉ์ž head pose ๋ฐ hand pose๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค. Head pose๋Š” IMU์— ์˜ํ•ด ์ธก์ •๋œ gravity vector๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์›”๋“œ ํ”„๋ ˆ์ž„์ด ์ดˆ๊ธฐํ™”๋ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋Š” 10 Hz๋กœ ๊ธฐ๋ก๋ฉ๋‹ˆ๋‹ค.
    • In-scene ์‹œ์—ฐ: ๋กœ๋ด‡ ํ™˜๊ฒฝ์˜ RGB-D ์นด๋ฉ”๋ผ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ˆ˜์ง‘๋˜๋ฉฐ, Hamer๋ฅผ ํ†ตํ•ด 2D hand pose๋ฅผ ์ถ”์ •ํ•˜๊ณ  ์‚ผ๊ฐ์ธก๋Ÿ‰(triangulation)์„ ํ†ตํ•ด 3D pose๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค.
  2. ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๋ฐ ์ •๋ ฌ (Processing and Domain Alignment):
    • Object Point Clouds ์ถ”์ถœ: ์ •์ฑ… ํ•™์Šต ์‹œ ๊ด€์ธก๊ฐ’์œผ๋กœ object point clouds๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ฐฐ๊ฒฝ ๋ณ€ํ™”์™€ ์ธ๊ฐ„-๋กœ๋ด‡ ๊ฐ„์˜ ์‹œ๊ฐ์  ์ฐจ์ด์— ๋ถˆ๋ณ€์„ฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
      • ์ดˆ๊ธฐ ํ”„๋ ˆ์ž„์—์„œ Grounded-SAM์„ ์‚ฌ์šฉํ•˜์—ฌ ์ƒํ˜ธ์ž‘์šฉ ๊ฐ์ฒด๋ฅผ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
      • CoTracker๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„ํ• ๋œ ๊ฐ์ฒด๋ฅผ ํ”„๋ ˆ์ž„ ๊ฐ„ 2D object points๋กœ ์ถ”์ ํ•ฉ๋‹ˆ๋‹ค.
      • ์ด 2D points๋ฅผ 3D๋กœ unprojectํ•ฉ๋‹ˆ๋‹ค. In-scene ์‹œ์—ฐ์˜ ๊ฒฝ์šฐ RGB-D ์นด๋ฉ”๋ผ์˜ depth๋ฅผ ์ง์ ‘ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. In-the-wild ์‹œ์—ฐ์˜ ๊ฒฝ์šฐ, Aria ๊ธ€๋ผ์Šค๋Š” depth๋ฅผ ์ œ๊ณตํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ, rectified stereo images์™€ ์นด๋ฉ”๋ผ ๊ฐ„์˜ baseline์„ ์‚ฌ์šฉํ•˜์—ฌ Foundation-Stereo๋ฅผ ํ†ตํ•ด disparity map์„ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค.
        • Depth ๊ณ„์‚ฐ: Z = \frac{f \cdot B}{d} (์—ฌ๊ธฐ์„œ Z๋Š” depth, f๋Š” focal length, B๋Š” baseline, d๋Š” disparity map)
    • Domain Alignment: Aria ๊ธ€๋ผ์Šค๋กœ ์ˆ˜์ง‘๋œ 3D object tracks๋Š” ์‹œ์—ฐ๋งˆ๋‹ค ๋†’์ด์™€ ์‚ฌ์šฉ์ž ์ž์„ธ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋“  3D points๋ฅผ ๋กœ๋ด‡ base frame์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
      • In-scene ์‹œ์—ฐ์„ ์•ต์ปค(anchor)๋กœ ์‚ฌ์šฉํ•˜์—ฌ, in-the-wild ์‹œ์—ฐ์˜ ๊ฐ์ฒด ์ ๊ตฐ(O_t^w)๊ณผ ์†๋ ์ ๊ตฐ(F_t^w)์„ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
      • ๋จผ์ €, ์ฒซ ํ”„๋ ˆ์ž„์˜ ๊ฐ์ฒด ์ ๊ตฐ centroid ๊ฐ„์˜ translation(\Delta O = O_0^s - O_0^w)์„ ๊ณ„์‚ฐํ•˜์—ฌ in-the-wild trajectory์— ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
      • ๋‹ค์Œ์œผ๋กœ, in-scene(F_0^s) ๋ฐ in-the-wild(F_0^w)์˜ ์ดˆ๊ธฐ hand pose๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Kabsch ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ ์šฉํ•˜์—ฌ rigid transform์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ z-์ถ• ์ฃผ๋ณ€์˜ ํšŒ์ „(R_z)์„ ์ถ”์ถœํ•˜๊ณ  ์ด๋ฅผ in-the-wild ์‹œ์—ฐ์— ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
      • ์ตœ์ข… ๋ณ€ํ™˜๋œ trajectory๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: \hat{O}_t^w = R_z \cdot O_t^w + \Delta O \hat{F}_t^w = R_z \cdot F_t^w + \Delta O \hat{T}^w = \{ \hat{O}_t^w, \hat{F}_t^w \}
  3. ์ •์ฑ… ํ•™์Šต ๋ฐ ๋ฐฐํฌ (Policy Learning and Deployment):
    • ์ •์ฑ… ์•„ํ‚คํ…์ฒ˜: Point-Policy [7]๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” Transformer-based point-cloud policy๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
      • ์ž…๋ ฅ: t-T_o๋ถ€ํ„ฐ t๊นŒ์ง€์˜ ์†๋ trajectory F_{t-T_o:t}์™€ ๊ฐ์ฒด ์ ๊ตฐ O_{t-T_o:t} (์—ฌ๊ธฐ์„œ T_o=10์€ ๊ด€์ธก ํžˆ์Šคํ† ๋ฆฌ).
      • ์ถœ๋ ฅ: t๋ถ€ํ„ฐ t+T_p๊นŒ์ง€์˜ ๋ฏธ๋ž˜ ์†๋ trajectory \hat{F}_{t:t+T_p} (์—ฌ๊ธฐ์„œ T_p=30์€ ์˜ˆ์ธก horizon).
      • Vector Neuron MLPs [52]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ์ ์˜ ๊ด€์ธก ํžˆ์Šคํ† ๋ฆฌ๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๋ฉฐ, ์ด๋Š” 3D ๊ธฐํ•˜ํ•™์  ์ •๋ณด๋ฅผ ์ž˜ ํฌ์ฐฉํ•ฉ๋‹ˆ๋‹ค.
      • ์ธ์ฝ”๋”ฉ๋œ ๋ฒกํ„ฐ๋Š” Transformer Encoder์˜ ํ† ํฐ์œผ๋กœ ์ž…๋ ฅ๋ฉ๋‹ˆ๋‹ค.
      • ์†๋ ํ† ํฐ์— ๋Œ€ํ•ด์„œ๋งŒ Positional encoding์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
      • ์†์‹ค ํ•จ์ˆ˜: ์˜ˆ์ธก๋œ ์†๋๊ณผ ground-truth ์†๋ ๊ฐ„์˜ Mean Squared Error (L_{MSE} = E[\|F_{t:t+T_p} - \hat{F}_{t:t+T_p}\|^2]).
      • ์ผ๋ฐ˜ํ™” ๊ฐœ์„ : ํ•™์Šต ์ค‘ 3D translation, scaling, z-์ถ• ์ฃผ๋ณ€ rotation์„ ๋ฌด์ž‘์œ„๋กœ ์ ์šฉํ•˜๋Š” augmentation๊ณผ ์ž…๋ ฅ ์†๋์— Gaussian noise๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ๋ชจ๋ธ์˜ ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.
    • ๋กœ๋ด‡ ๋ฐฐํฌ (Robot Setup): Kinova Gen3 ๋กœ๋ด‡ ํŒ”๊ณผ Psyonic Ability Hand (5๊ฐœ์˜ ์†๊ฐ€๋ฝ)๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. RealSense RGB-D ์นด๋ฉ”๋ผ ๋‘ ๋Œ€๊ฐ€ ์ž‘์—… ๊ณต๊ฐ„ ์ฃผ๋ณ€์— ๋ฐฐ์น˜๋ฉ๋‹ˆ๋‹ค.
    • ์—ญ๊ธฐ๊ตฌํ•™ (Inverse Kinematics, IK): ์ธ๊ฐ„๊ณผ ๋กœ๋ด‡์˜ ํŒ” ๋ฐ ์†์˜ kinematics๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์—, ์ปค์Šคํ…€ full arm-hand IK ๋ชจ๋“ˆ I๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ชจ๋“ˆ์€ desired fingertips F_{t+1}์™€ ํ˜„์žฌ ๋กœ๋ด‡ ์กฐ์ธํŠธ J^t๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ๋‹ค์Œ ์กฐ์ธํŠธ ๊ฐ๋„ J^{t+1} = I(F_{t+1}, J^t)๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
    • ์‹ค์šฉ์  ๊ตฌํ˜„ ์„ธ๋ถ€์‚ฌํ•ญ: grasping ์ž‘์—…์„ ์œ„ํ•ด, ์˜ˆ์ธก๋œ ์—„์ง€์†๊ฐ€๋ฝ๊ณผ ๋‹ค๋ฅธ ์†๊ฐ€๋ฝ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๊ฐ€ 5cm ๋ฏธ๋งŒ์ผ ๊ฒฝ์šฐ ์†๊ฐ€๋ฝ์ด ์„œ๋กœ ๊ฐ€๊นŒ์›Œ์ง€๋„๋ก ํ•˜๋Š” grasping threshold๋ฅผ ์„ค์ •ํ•˜์—ฌ ์ธ๊ฐ„์˜ grasping force๋ฅผ ๋ชจ๋ฐฉํ•ฉ๋‹ˆ๋‹ค.

์‹คํ—˜ ๊ฒฐ๊ณผ (Experimental Evaluation):

AINA๋Š” 9๊ฐ€์ง€ ์ผ์ƒ ์กฐ์ž‘ ์ž‘์—…์— ๋Œ€ํ•ด ํ‰๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • ๋ฐ์ดํ„ฐ ์œ ํ˜•์˜ ์ค‘์š”์„ฑ: In-scene๊ณผ in-the-wild ๋ฐ์ดํ„ฐ์˜ ๊ณต๋™ ํ•™์Šต์ด ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. In-scene ๋‹จ๋…์€ ์ผ๋ฐ˜ํ™”๊ฐ€ ๋ถ€์กฑํ•˜๊ณ , in-the-wild ๋‹จ๋…์€ ๋กœ๋ด‡ ๋ฐฐํฌ ํ™˜๊ฒฝ๊ณผ์˜ ๋ถˆ์ผ์น˜๋กœ ์‹คํŒจ์œจ์ด ๋†’์•˜์Šต๋‹ˆ๋‹ค. In-scene ์‹œ์—ฐ์€ in-the-wild ์‹œ์—ฐ์„ ๋กœ๋ด‡ ๊ณต๊ฐ„์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐ ๊ฒฐ์ •์ ์ธ ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.
  • Image-based ์ ‘๊ทผ ๋ฐฉ์‹๊ณผ์˜ ๋น„๊ต: AINA๋Š” Masked BAKU์™€ Masked BAKU with History์™€ ๊ฐ™์€ image-based baseline๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ธ๊ฐ„์˜ ๋จธ๋ฆฌ ์›€์ง์ž„์œผ๋กœ ์ธํ•œ ์‹œ์  ์ฐจ์ด๊ฐ€ image-based ๋ฐฉ์‹์˜ ์„ฑ๋Šฅ ์ €ํ•˜๋กœ ์ด์–ด์ง€๋Š” ๋ฐ˜๋ฉด, AINA์˜ point cloud ์ž…๋ ฅ๊ณผ alignment๋Š” ์ด๋Ÿฌํ•œ ๋ถˆ์ผ์น˜์— ๊ฐ•์ธํ•จ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์ž‘์—… ๊ณต๊ฐ„ ๋†’์ด ๋ณ€ํ™”์— ๋Œ€ํ•œ ๊ฐ•์ธ์„ฑ: AINA๋Š” ์ž‘์—… ๊ณต๊ฐ„์˜ ๋†’์ด๊ฐ€ ๋‹ฌ๋ผ์ ธ๋„ (3๋‹จ๊ณ„ ๋†’์ด ์„ค์ •) ๊ฐ•์ธํ•˜๊ฒŒ ์ž‘๋™ํ•˜๋ฉฐ, in-scene ์‹œ์—ฐ์„ ์žฌ์ˆ˜์ง‘ํ•˜๋ฉด ์ƒˆ๋กœ์šด ๋†’์ด์— ๋งž์ถฐ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
  • ๋‹ค๋ฅธ ๊ฐ์ฒด์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™”: ์œ ์‚ฌํ•œ ๋ชจ์–‘์˜ ๊ฐ์ฒด(์˜ˆ: ํ† ์Šคํ„ฐ, ์ง€์šฐ๊ฐœ)์—๋Š” ์ž˜ ์ผ๋ฐ˜ํ™”๋˜์ง€๋งŒ, ๋ชจ์–‘๊ณผ ๋ฌด๊ฒŒ๊ฐ€ ํฌ๊ฒŒ ๋‹ค๋ฅธ ๊ฐ์ฒด(์˜ˆ: ํŒ์ฝ˜ ๋ด‰์ง€, ๋ณด๋“œ ์ง€์šฐ๊ฐœ)์—๋Š” ์ผ๋ฐ˜ํ™”์— ์–ด๋ ค์›€์„ ๊ฒช๋Š” ํ•œ๊ณ„๋ฅผ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

์ œํ•œ ์‚ฌํ•ญ ๋ฐ ๊ฒฐ๋ก  (Limitations and Conclusion):

  • Force Feedback ํ†ตํ•ฉ์˜ ์–ด๋ ค์›€: ์† ํฌ์ฆˆ ์ถ”์ •๋งŒ์œผ๋กœ๋Š” ํž˜ ์ •๋ณด๋ฅผ ํฌ์ฐฉํ•˜๊ธฐ ์–ด๋ ค์›Œ ์ •ํ™•ํ•œ dexterous manipulation์— ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. EMG ์„ผ์„œ๋‚˜ force-estimating ์žฅ๊ฐ‘ ๋“ฑ์˜ ํ†ตํ•ฉ์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์„ผ์„œ ๊ฐ„ ๋™๊ธฐํ™” ๋ฌธ์ œ: Aria Gen 2 ๊ธ€๋ผ์Šค์˜ RGB์™€ SLAM ์นด๋ฉ”๋ผ ๊ฐ„ shutter timing์˜ ๋ฏธ์„ธํ•œ ์ฐจ์ด๋กœ ์ธํ•ด ๋น ๋ฅธ ๋จธ๋ฆฌ ์›€์ง์ž„ ์‹œ ์ •๋ ฌ ๋ถˆ์ผ์น˜๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋” ๊ฒฌ๊ณ ํ•œ 3D object tracking ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‚˜ mesh ํ‘œํ˜„ ์ถ”์ ์„ ํ†ตํ•ด ๊ฐœ์„ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋ฐฐํฌ ์‹œ ๊ด€์ธก ๋ถˆ์ผ์น˜: ํ˜„์žฌ ๋ฐฐํฌ ์‹œ Realsense ์นด๋ฉ”๋ผ๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, Aria ๊ธ€๋ผ์Šค๋กœ ์ˆ˜์ง‘๋œ keypoints์™€ ์•ฝ๊ฐ„์˜ ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. FoundationStereo๋ฅผ ํ†ตํ•œ ์‹ค์‹œ๊ฐ„ depth ์ถ”์ •์˜ ์–ด๋ ค์›€ ๋•Œ๋ฌธ์ด๋ฉฐ, ์ตœ์ ํ™”๋ฅผ ํ†ตํ•ด ํ•ด๊ฒฐ๋  ์ˆ˜ ์žˆ์„ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋ฉ๋‹ˆ๋‹ค.

๊ฒฐ๋ก ์ ์œผ๋กœ, AINA๋Š” Aria Gen 2 ๊ธ€๋ผ์Šค์˜ ๊ธฐ๋Šฅ์„ ํ™œ์šฉํ•˜์—ฌ ์•ผ์ƒ ์ธ๊ฐ„ ์‹œ์—ฐ์œผ๋กœ๋ถ€ํ„ฐ ๋‹ค์ง€ ๋กœ๋ด‡ ์ •์ฑ…์„ ํ•™์Šตํ•˜๋Š” ์œ ๋งํ•œ ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค. ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ ์—†์ด 3D point-based policy๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ธ๊ฐ„-๋กœ๋ด‡ embodiment gap์„ ์ค„์ด๊ณ  ๋ฐฐ๊ฒฝ ๋ณ€ํ™”์— ๊ฐ•์ธํ•œ ์กฐ์ž‘ ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

1. ์„œ๋ก : ์™œ ์ด ์—ฐ๊ตฌ๊ฐ€ ์ค‘์š”ํ•œ๊ฐ€?

๋กœ๋ด‡๊ณตํ•™ ์ปค๋ฎค๋‹ˆํ‹ฐ์—์„œ ์˜ค๋žซ๋™์•ˆ ๊ฟˆ๊ฟ”์˜จ ๋ชฉํ‘œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ”๋กœ ์ธ๊ฐ„์ด ์ผ์ƒ ํ™˜๊ฒฝ์—์„œ ์ˆ˜ํ–‰ํ•˜๋Š” ์ž‘์—…์„ ๊ด€์ฐฐํ•˜์—ฌ ๋กœ๋ด‡์ด ๋‹ค์ง€(multi-fingered) ์กฐ์ž‘์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ๋ชฉํ‘œ๊ฐ€ ์‹คํ˜„๋œ๋‹ค๋ฉด ๋กœ๋ด‡ ์กฐ์ž‘์˜ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ(generalizability)์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋  ๊ฒƒ์ด๋ฉฐ, ๋ฌด์—‡๋ณด๋‹ค ๋…ธ๋™ ์ง‘์•ฝ์ ์ธ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์— ๋Œ€ํ•œ ์˜์กด๋„๋ฅผ ๋Œ€ํญ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ ์ด ๋ชฉํ‘œ๋ฅผ ํ–ฅํ•œ ์ง„์ „์€ ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€ ๋ณ‘๋ชฉ์œผ๋กœ ์ธํ•ด ์ง€์ฒด๋˜์–ด ์™”์Šต๋‹ˆ๋‹ค:

  1. ์ฒดํ™” ๊ฒฉ์ฐจ(Embodiment Gap): ์ธ๊ฐ„์˜ ์†๊ณผ ๋กœ๋ด‡ ํ•ธ๋“œ ์‚ฌ์ด์˜ ํ˜•ํƒœํ•™์ , ์—ญํ•™์  ์ฐจ์ด
  2. ๋งฅ๋ฝ ๋ฐ ๋ชจ์…˜ ํ ์ถ”์ถœ์˜ ์–ด๋ ค์›€: ์ž์—ฐ ํ™˜๊ฒฝ์—์„œ ์ดฌ์˜๋œ ์ธ๊ฐ„ ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ์ž์œจ ์ •์ฑ… ํ•™์Šต์— ํ•„์š”ํ•œ ๊ด€๋ จ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ธฐ์ˆ ์  ํ•œ๊ณ„

์ด ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” AINA(Autonomous Imitation of Natural Actions) ํ”„๋ ˆ์ž„์›Œํฌ๋Š” Meta์˜ Aria Gen 2 ์Šค๋งˆํŠธ ์•ˆ๊ฒฝ์ด๋ผ๋Š” ๊ฐ•๋ ฅํ•˜๋ฉด์„œ๋„ ๊ฐ„๋‹จํ•œ ํ•˜๋“œ์›จ์–ด์™€ ํ˜์‹ ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ์ด ๊ฟˆ์— ํ•œ ๊ฑธ์Œ ๋” ๊ฐ€๊นŒ์ด ๋‹ค๊ฐ€๊ฐ”์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

โ€œThe most profound technologies are those that disappear. They weave themselves into the fabric of everyday life until they are indistinguishable from it.โ€ - Mark Weiser

๋…ผ๋ฌธ์˜ ์ €์ž๋“ค์€ ์ด ์ธ์šฉ๊ตฌ๋กœ ์‹œ์ž‘ํ•˜๋ฉฐ, AINA๊ฐ€ ๊ถ๊ทน์ ์œผ๋กœ ์ถ”๊ตฌํ•˜๋Š” ๋ฐฉํ–ฅ์„ฑ์„ ์•”์‹œํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์ˆ ์ด ์ผ์ƒ์— ์Šค๋ฉฐ๋“ค์–ด ์‚ฌ๋ผ์ง€๋“ฏ, ๋กœ๋ด‡์ด ์ธ๊ฐ„์˜ ์ผ์ƒ์  ํ–‰๋™์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๊ด€์ฐฐํ•˜๊ณ  ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ๋ฏธ๋ž˜๋ฅผ ๊ทธ๋ฆฌ๊ณ  ์žˆ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.


2. ๊ธฐ์กด ์—ฐ๊ตฌ์™€์˜ ์ฐจ๋ณ„์ : ์™œ AINA์ธ๊ฐ€?

2.1 ๊ธฐ์กด ์ ‘๊ทผ๋ฒ•๋“ค์˜ ํ•œ๊ณ„

๋‹ค์ง€ ๋กœ๋ด‡ ํ•ธ๋“œ์˜ ์ •์ฑ… ํ•™์Šต์€ ํฌ๊ฒŒ ์„ธ ๊ฐ€์ง€ ์ ‘๊ทผ๋ฒ•์œผ๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

(1) ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ ๊ธฐ๋ฐ˜ ํ•™์Šต

  • ๋†’์€ ํ’ˆ์งˆ์˜ ๋ฐ๋ชจ ๋ฐ์ดํ„ฐ ํš๋“ ๊ฐ€๋Šฅ
  • ๋‹จ์ : ๋‹ค์ง€ ํ•ธ๋“œ์˜ ๋†’์€ ์ž์œ ๋„(DoF)๋กœ ์ธํ•ด ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ ์ž์ฒด๊ฐ€ ๋งค์šฐ ์–ด๋ ค์›€
  • ๋‘ ์†๊ฐ€๋ฝ ๊ทธ๋ฆฌํผ๋„ ์ˆ˜์ฒœ ๊ฐœ์˜ ๋ฐ๋ชจ๊ฐ€ ํ•„์š”ํ•œ๋ฐ, ๋‹ค์ง€ ํ•ธ๋“œ๋Š” ๋”์šฑ ๋งŽ์€ ๋ฐ์ดํ„ฐ ์š”๊ตฌ
  • ์ €์ง€์—ฐ(low-latency) ์—ฐ์† ํ”ผ๋“œ๋ฐฑ ์‹œ์Šคํ…œ ๊ตฌ์ถ•์ด ๊ธฐ์ˆ ์ ์œผ๋กœ ๋‚œํ•ด

(2) ๊ฐ•ํ™”ํ•™์Šต(RL) ๊ธฐ๋ฐ˜ ํ•™์Šต

  • ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ์ •์ฑ… ํ•™์Šต ํ›„ ์‹ค์ œ ๋กœ๋ด‡์— ์ „์ด
  • ๋‹จ์ : sim-to-real ๊ฒฉ์ฐจ, ๋ณด์ƒ ํ•จ์ˆ˜ ์„ค๊ณ„์˜ ์–ด๋ ค์›€
  • HuDOR๊ณผ ๊ฐ™์€ ์—ฐ๊ตฌ์—์„œ๋Š” ์ธ๊ฐ„ ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ๋ณด์ƒ์„ ์ถ”์ถœํ•˜์—ฌ RL๋กœ ์ •์ฑ… ๊ฐœ์„ 

(3) ์ธ๊ฐ„ ๋น„๋””์˜ค ๊ธฐ๋ฐ˜ ํ•™์Šต

  • ๊ฐ€์žฅ ํ™•์žฅ ๊ฐ€๋Šฅํ•œ(scalable) ์ ‘๊ทผ๋ฒ•
  • ๋‹จ์ : ๋Œ€๋ถ€๋ถ„์˜ ๊ธฐ์กด ์—ฐ๊ตฌ๋Š” in-domain ๋ฐ์ดํ„ฐ ํ•„์š” (๋กœ๋ด‡ ํ™˜๊ฒฝ์—์„œ ์ˆ˜์ง‘)
  • in-the-wild ๋ฐ์ดํ„ฐ ํ™œ์šฉ ์‹œ ๋ฐฐ๊ฒฝ, ์‹œ์ , ์กฐ๋ช… ๋ณ€ํ™”์— ์ทจ์•ฝ

2.2 AINA์˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ

AINA๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•ต์‹ฌ์ ์ธ ์ฐจ๋ณ„์ ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค:

ํŠน์ง• ๊ธฐ์กด ์—ฐ๊ตฌ AINA
๋กœ๋ด‡ ๋ฐ์ดํ„ฐ ํ•„์š” ์—ฌ๋ถ€ ํ•„์š” (์˜จ๋ผ์ธ ๊ต์ •, RL, ์‹œ๋ฎฌ๋ ˆ์ด์…˜) ๋ถˆํ•„์š”
๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ํ™˜๊ฒฝ In-domain (๋กœ๋ด‡ workspace) In-the-wild (์–ด๋””์„œ๋“ )
์ •์ฑ… ํ‘œํ˜„ 2D ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ 3D ํฌ์ธํŠธ ๊ธฐ๋ฐ˜
๋ฐฐ๊ฒฝ ๋ณ€ํ™” ๊ฐ•๊ฑด์„ฑ ์ทจ์•ฝ ๊ฐ•๊ฑด
์† ์ถ”์  ๋ฐฉ์‹ ์™ธ๋ถ€ ์„ผ์„œ/์ถ”์ • ์˜จ๋ณด๋“œ ์ถ”์  (Aria Gen 2)
๊นŠ์ด ์ •๋ณด ํš๋“ RGB-D ์นด๋ฉ”๋ผ ํ•„์š” ์Šคํ…Œ๋ ˆ์˜ค ๊นŠ์ด ์ถ”์ •

3. ๊ธฐ์ˆ ์  ๊นŠ์ด ํƒ๊ตฌ: AINA ํ”„๋ ˆ์ž„์›Œํฌ ์ƒ์„ธ ๋ถ„์„

3.1 ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ์š”

AINA์˜ ์›Œํฌํ”Œ๋กœ์šฐ๋Š” ์„ธ ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:

[1๋‹จ๊ณ„] ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘
    โ†“ Aria Gen 2 ์Šค๋งˆํŠธ ์•ˆ๊ฒฝ
    โ†“ In-the-wild + ๋‹จ์ผ In-scene ๋ฐ๋ชจ
    
[2๋‹จ๊ณ„] ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
    โ†“ 3D ์† ํฌ์ฆˆ ์ถ”์ถœ (์˜จ๋ณด๋“œ)
    โ†“ ์Šคํ…Œ๋ ˆ์˜ค ๊นŠ์ด ์ถ”์ • (FoundationStereo)
    โ†“ 2D ๊ฐ์ฒด ์ถ”์  โ†’ 3D ์–ธํ”„๋กœ์ ์…˜
    โ†“ ๋„๋ฉ”์ธ ์ •๋ ฌ (Translation + Rotation)
    
[3๋‹จ๊ณ„] ์ •์ฑ… ํ•™์Šต ๋ฐ ๋ฐฐํฌ
    โ†“ Vector Neuron MLP (SO(3)-equivariant)
    โ†“ Transformer Encoder
    โ†“ Fingertip Trajectory Prediction
    โ†“ Inverse Kinematics โ†’ Robot Deployment

3.2 Aria Gen 2 ์Šค๋งˆํŠธ ์•ˆ๊ฒฝ: ๊ฒŒ์ž„ ์ฒด์ธ์ €

AINA์˜ ์„ฑ๊ณต์—์„œ ํ•ต์‹ฌ์ ์ธ ์—ญํ• ์„ ํ•˜๋Š” ๊ฒƒ์ด ๋ฐ”๋กœ Meta์˜ Project Aria Gen 2 ์•ˆ๊ฒฝ์ž…๋‹ˆ๋‹ค. ์ด ๋””๋ฐ”์ด์Šค๊ฐ€ ์™œ ์ค‘์š”ํ•œ์ง€ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

ํ•˜๋“œ์›จ์–ด ์‚ฌ์–‘

๊ตฌ์„ฑ์š”์†Œ ์‚ฌ์–‘
RGB ์นด๋ฉ”๋ผ ๊ณ ํ•ด์ƒ๋„ ์ „๋ฉด ์นด๋ฉ”๋ผ
SLAM ์นด๋ฉ”๋ผ 4๊ฐœ (6DOF ์œ„์น˜ ์ถ”์ ์šฉ)
Eye Tracking ๋‚ด์žฅ (2๊ฐœ ์นด๋ฉ”๋ผ)
์˜จ๋ณด๋“œ ์ฒ˜๋ฆฌ SLAM, Hand Tracking, Eye Tracking
๋ฌด๊ฒŒ ์•ฝ 75g
๋ฐฐํ„ฐ๋ฆฌ 6-8์‹œ๊ฐ„ ์—ฐ์† ์‚ฌ์šฉ
ํŠน์ˆ˜ ์„ผ์„œ PPG (์‹ฌ๋ฐ•), Contact Microphone

AINA์—์„œ์˜ ํ™œ์šฉ

  1. ์˜จ๋ณด๋“œ 3D ์† ํฌ์ฆˆ ์ถ”์ •: Gen 2๋Š” ์ž์ฒด ์นฉ์…‹์œผ๋กœ ์‹ค์‹œ๊ฐ„ ์† ์ถ”์ ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์™ธ๋ถ€ ์„ผ์„œ๋‚˜ ๋ณต์žกํ•œ ํ›„์ฒ˜๋ฆฌ ์—†์ด๋„ ์ •ํ™•ํ•œ 3D ์† ๊ด€์ ˆ ์œ„์น˜๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

  2. ํ—ค๋“œ ํฌ์ฆˆ ์ถ”์ •: SLAM ์นด๋ฉ”๋ผ๋ฅผ ํ†ตํ•ด ์ฐฉ์šฉ์ž์˜ ๋จธ๋ฆฌ ์œ„์น˜์™€ ๋ฐฉํ–ฅ์„ 6DOF๋กœ ์ถ”์ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์›”๋“œ ํ”„๋ ˆ์ž„์—์„œ์˜ ์† ์œ„์น˜๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  3. ์Šคํ…Œ๋ ˆ์˜ค ๊นŠ์ด ์ถ”์ •: ์ขŒ์šฐ SLAM ์นด๋ฉ”๋ผ ์ด๋ฏธ์ง€๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์”ฌ์˜ ๊นŠ์ด ๋งต์„ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” FoundationStereo๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  4. ํฌํ„ฐ๋ธ” ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘: ๊ฐ€๋ฒผ์šด ๋ฌด๊ฒŒ์™€ ๊ธด ๋ฐฐํ„ฐ๋ฆฌ ์ˆ˜๋ช…์œผ๋กœ ์–ด๋””์„œ๋“  ์ž์—ฐ์Šค๋Ÿฌ์šด ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

3.3 ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋ฐ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ

3.3.1 In-the-Wild ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘

์‚ฌ์šฉ์ž๋Š” Aria Gen 2 ์•ˆ๊ฒฝ์„ ์ฐฉ์šฉํ•˜๊ณ  ์ž„์˜์˜ ํ™˜๊ฒฝ(๋ถ€์—Œ, ์‚ฌ๋ฌด์‹ค, ์‹คํ—˜์‹ค ๋“ฑ)์—์„œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ ํŠน๋ณ„ํ•œ ๋งˆ์ปค๋‚˜ ํ†ต์ œ๋œ ์กฐ๋ช…์ด ํ•„์š” ์—†์Šต๋‹ˆ๋‹ค.

์ˆ˜์ง‘๋˜๋Š” ๋ฐ์ดํ„ฐ: - RGB ๋น„๋””์˜ค ์ŠคํŠธ๋ฆผ - ์˜จ๋ณด๋“œ ์ถ”์ •๋œ 3D ์† ํฌ์ฆˆ (fingertip ์œ„์น˜ ํฌํ•จ) - ํ—ค๋“œ ํฌ์ฆˆ (์›”๋“œ ํ”„๋ ˆ์ž„ ๊ธฐ์ค€) - ์Šคํ…Œ๋ ˆ์˜ค SLAM ์นด๋ฉ”๋ผ ์ด๋ฏธ์ง€

3.3.2 3D ๊ฐ์ฒด ์ถ”์ 

๊ฐ์ฒด์˜ 3D ์œ„์น˜๋ฅผ ์ถ”์ ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์Œ ๊ณผ์ •์„ ๊ฑฐ์นฉ๋‹ˆ๋‹ค:

  1. 2D ๊ฐ์ฒด ๋ถ„ํ• : ์–ธ์–ด ํ”„๋กฌํ”„ํŠธ ๊ธฐ๋ฐ˜ off-the-shelf ์ปดํ“จํ„ฐ ๋น„์ „ ๋ชจ๋ธ ์‚ฌ์šฉ
  2. ์Šคํ…Œ๋ ˆ์˜ค ๊นŠ์ด ์ถ”์ •: FoundationStereo๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ SLAM ์นด๋ฉ”๋ผ ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ๊นŠ์ด ๋งต ์ƒ์„ฑ
  3. 3D ์–ธํ”„๋กœ์ ์…˜: 2D ๊ฐ์ฒด ๋งˆ์Šคํฌ๋ฅผ ๊นŠ์ด ๋งต๊ณผ ๊ฒฐํ•ฉํ•˜์—ฌ 3D ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ ํš๋“

FoundationStereo ์„ ํƒ ์ด์œ : - NVIDIA์—์„œ ๊ฐœ๋ฐœํ•œ zero-shot ์Šคํ…Œ๋ ˆ์˜ค ๋งค์นญ foundation model - 1M ์Šคํ…Œ๋ ˆ์˜ค ์Œ์œผ๋กœ ํ•™์Šต, ๋†’์€ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ - CVPR 2025 Best Paper Nomination - Middlebury, ETH3D ๋ฒค์น˜๋งˆํฌ 1์œ„

3.3.3 ๋„๋ฉ”์ธ ์ •๋ ฌ (Critical Step!)

In-the-wild ๋ฐ์ดํ„ฐ์™€ ๋กœ๋ด‡ ํ™˜๊ฒฝ ์‚ฌ์ด์˜ ๊ณต๊ฐ„์  ์ •๋ ฌ์€ AINA์˜ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋‹จ์ผ in-scene ๋ฐ๋ชจ๋ฅผ ์•ต์ปค๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์ •๋ ฌ ๊ณผ์ •:

  1. Translation ์ •๋ ฌ: ๋ชจ๋“  ๋ฐ๋ชจ์˜ ์งˆ๋Ÿ‰ ์ค‘์‹ฌ(Center of Mass)์„ ์ผ์น˜์‹œํ‚ด

    O_aligned = O - CoM(O) + CoM(O_inscene)
    F_aligned = F - CoM(F) + CoM(F_inscene)
  2. Rotation ์ •๋ ฌ: ์ค‘๋ ฅ์ถ•(gravity axis)์„ ๊ธฐ์ค€์œผ๋กœ ํšŒ์ „ ์ •๋ ฌ

    • ์›”๋“œ ํ”„๋ ˆ์ž„์ด ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์‹œ ์ž„์˜๋กœ ์ดˆ๊ธฐํ™”๋˜๋ฏ€๋กœ ํšŒ์ „ ๋ณด์ • ํ•„์ˆ˜
    • ์†์˜ ๋ฐฉํ–ฅ ๋ฒกํ„ฐ๋ฅผ in-scene ๋ฐ๋ชจ์— ๋งž์ถค

์ •๋ ฌํ•˜์ง€ ์•Š์œผ๋ฉด ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ: - ๊ฐ์ฒด ์œ„์น˜๊ฐ€ ์‹ฌ๊ฐํ•˜๊ฒŒ ์–ด๊ธ‹๋‚จ - ์†์˜ ๋ฐฉํ–ฅ์ด ์™„์ „ํžˆ ๋ฐ˜๋Œ€๊ฐ€ ๋  ์ˆ˜ ์žˆ์Œ - ์ •์ฑ…์ด ์ž˜๋ชป๋œ ๊ณต๊ฐ„ ๊ด€๊ณ„๋ฅผ ํ•™์Šต

3.4 ์ •์ฑ… ๋„คํŠธ์›Œํฌ ์•„ํ‚คํ…์ฒ˜

AINA์˜ ์ •์ฑ… ๋„คํŠธ์›Œํฌ๋Š” Point-Policy ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฉฐ, 3D ํฌ์ธํŠธ ๊ธฐ๋ฐ˜ ํ‘œํ˜„์˜ ์žฅ์ ์„ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

3.4.1 ์ž…๋ ฅ ํ‘œํ˜„

  • Fingertip Points: F_{t-T_o:t} \in \mathbb{R}^{T_o \times 5 \times 3}
    • T_o = 10: ๊ด€์ธก ํžˆ์Šคํ† ๋ฆฌ ๊ธธ์ด
    • 5๊ฐœ fingertip (์—„์ง€ + 4์†๊ฐ€๋ฝ)
  • Object Points: O_{t-T_o:t} \in \mathbb{R}^{T_o \times N \times 3}
    • N: ๊ฐ์ฒด ํฌ์ธํŠธ ๊ฐœ์ˆ˜

3.4.2 Vector Neuron MLP

AINA์—์„œ ๊ฐ€์žฅ ํฅ๋ฏธ๋กœ์šด ์•„ํ‚คํ…์ฒ˜ ์„ ํƒ ์ค‘ ํ•˜๋‚˜๋Š” Vector Neuron MLP์˜ ์‚ฌ์šฉ์ž…๋‹ˆ๋‹ค. ์ด๋Š” SO(3)-equivariant ์‹ ๊ฒฝ๋ง์œผ๋กœ, 3D ํšŒ์ „์— ๋Œ€ํ•œ ๋“ฑ๋ณ€์„ฑ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

Vector Neuron์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด: - ๊ธฐ์กด MLP: ์Šค์นผ๋ผ ๋‰ด๋Ÿฐ z \in \mathbb{R} - Vector Neuron: ๋ฒกํ„ฐ ๋‰ด๋Ÿฐ \mathbf{v} \in \mathbb{R}^3

์ˆ˜ํ•™์  ์ •์˜:

\text{VN-Linear}: f_{\text{lin}}(\mathbf{V}) = \mathbf{W}\mathbf{V}, \quad \mathbf{W} \in \mathbb{R}^{C_{l+1} \times C_l}

์—ฌ๊ธฐ์„œ \mathbf{V} \in \mathbb{R}^{C \times 3}๋Š” ๋ฒกํ„ฐ ๋‰ด๋Ÿฐ๋“ค์˜ ํ–‰๋ ฌ์ž…๋‹ˆ๋‹ค.

SO(3)-Equivariance ์ฆ๋ช…:

f_{\text{lin}}(\mathbf{V}\mathbf{R}) = \mathbf{W}(\mathbf{V}\mathbf{R}) = (\mathbf{W}\mathbf{V})\mathbf{R} = f_{\text{lin}}(\mathbf{V})\mathbf{R}

์ฆ‰, ์ž…๋ ฅ์— ํšŒ์ „ \mathbf{R}์„ ์ ์šฉํ•˜๋ฉด ์ถœ๋ ฅ์—๋„ ๋™์ผํ•œ ํšŒ์ „์ด ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

VN-ReLU (๋น„์„ ํ˜• ํ™œ์„ฑํ™”): ์ผ๋ฐ˜์ ์ธ ReLU๋Š” ๋“ฑ๋ณ€์„ฑ์„ ๊นจ๋œจ๋ฆฝ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํŠน์ˆ˜ํ•˜๊ฒŒ ์„ค๊ณ„๋œ VN-ReLU๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

\text{VN-ReLU}(\mathbf{v}) = 
\begin{cases}
\mathbf{v} & \text{if } \langle \mathbf{v}, \mathbf{k} \rangle \geq 0 \\
\mathbf{v} - \langle \mathbf{v}, \frac{\mathbf{k}}{\|\mathbf{k}\|} \rangle \frac{\mathbf{k}}{\|\mathbf{k}\|} & \text{otherwise}
\end{cases}

์—ฌ๊ธฐ์„œ \mathbf{k} = \mathbf{U}\mathbf{V}๋Š” ํ•™์Šต๋œ ๋ฐฉํ–ฅ ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค.

AINA์—์„œ์˜ ํšจ๊ณผ: - ๋ฐฐ๊ฒฝ clutter์— ๋Œ€ํ•œ ๊ฐ•๊ฑด์„ฑ ํ–ฅ์ƒ - ๋‹ค์–‘ํ•œ ์‹œ์ ์—์„œ ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ ํ™œ์šฉ ๊ฐ€๋Šฅ - ๋กœ๋ด‡ ๋ฐฐํฌ ํ™˜๊ฒฝ์˜ ๋ณ€ํ™”์— ์ ์‘

3.4.3 Transformer Encoder

Vector Neuron MLP๋กœ ์ธ์ฝ”๋”ฉ๋œ ํฌ์ธํŠธ ํŠน์ง•๋“ค์€ Transformer Encoder์— ์ž…๋ ฅ๋ฉ๋‹ˆ๋‹ค.

์•„ํ‚คํ…์ฒ˜ ์„ธ๋ถ€์‚ฌํ•ญ: - ๊ฐ ํฌ์ธํŠธ์˜ ํžˆ์Šคํ† ๋ฆฌ๋ฅผ ๋‹จ์ผ ๋ฒกํ„ฐ๋กœ ์••์ถ• (VN-MLP) - Fingertip๊ณผ Object ํฌ์ธํŠธ๋ฅผ ๋ณ„๋„ ํ† ํฐ์œผ๋กœ ์ฒ˜๋ฆฌ - ํ•ต์‹ฌ: Fingertip ํ† ํฐ์—๋งŒ learned positional encoding ์ ์šฉ - ์ด์œ : Fingertip๋งŒ ๋ฐ๋ชจ ๊ฐ„ ๋Œ€์‘(correspondence) ๊ด€๊ณ„๊ฐ€ ์กด์žฌ - Object ํฌ์ธํŠธ๋Š” ๋ฐ๋ชจ๋งˆ๋‹ค ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ

์ถœ๋ ฅ: - Transformer ์ถœ๋ ฅ์„ MLP์— ํ†ต๊ณผ์‹œ์ผœ ๋ฏธ๋ž˜ fingertip trajectory ์˜ˆ์ธก - ์˜ˆ์ธก horizon: T_p = 30 steps

3.4.4 ์†์‹ค ํ•จ์ˆ˜

๋‹จ์ˆœํ•˜์ง€๋งŒ ํšจ๊ณผ์ ์ธ MSE ์†์‹ค ์‚ฌ์šฉ:

\mathcal{L} = \frac{1}{T_p \cdot 5 \cdot 3} \sum_{t'=t}^{t+T_p} \sum_{i=1}^{5} \|\hat{F}_{t'}^{(i)} - F_{t'}^{(i)}\|^2

3.5 ๋กœ๋ด‡ ๋ฐฐํฌ

3.5.1 Fingertip to Joint Angle ๋ณ€ํ™˜

์˜ˆ์ธก๋œ fingertip trajectory๋ฅผ ๋กœ๋ด‡์— ๋ฐฐํฌํ•˜๊ธฐ ์œ„ํ•ด:

  1. Allegro Hand Forward Kinematics (FK)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ˜„์žฌ fingertip ์œ„์น˜ ๊ณ„์‚ฐ
  2. Inverse Kinematics (IK) ์ตœ์ ํ™”๋กœ ๋ชฉํ‘œ fingertip ์œ„์น˜์— ํ•ด๋‹นํ•˜๋Š” ๊ด€์ ˆ๊ฐ ๊ณ„์‚ฐ
  3. ๊ด€์ ˆ๊ฐ์„ ๋กœ๋ด‡ ์ปจํŠธ๋กค๋Ÿฌ์— ์ „์†ก

3.5.2 Grasping Threshold

์ธ๊ฐ„ ๋ฐ๋ชจ์—๋Š” ํž˜(force) ์ •๋ณด๊ฐ€ ์—†์œผ๋ฏ€๋กœ, grasping ์ž‘์—…์„ ์œ„ํ•œ ํœด๋ฆฌ์Šคํ‹ฑ ์ ์šฉ: - Fingertip ๊ฐ„ ๊ฑฐ๋ฆฌ๊ฐ€ ์ž„๊ณ„๊ฐ’ ์ดํ•˜๋กœ ๊ฐ์†Œํ•˜๋ฉด grasping ์‹œ์ž‘ - Grasping ์ค‘์—๋Š” fingertip ์œ„์น˜๋ฅผ ๊ณ ์ •


4. ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ถ„์„

4.1 ํƒœ์Šคํฌ ๋ฐ ์„ค์ •

AINA๋Š” 9๊ฐ€์ง€ ์ผ์ƒ ์กฐ์ž‘ ํƒœ์Šคํฌ์—์„œ ํ‰๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค:

ํƒœ์Šคํฌ ์„ค๋ช… ๋‚œ์ด๋„
Stowing ๋ฌผ์ฒด๋ฅผ ์ƒ์ž์— ๋„ฃ๊ธฐ ์ค‘
Oven Turning ์˜ค๋ธ ๋‹ค์ด์–ผ ๋Œ๋ฆฌ๊ธฐ ์ค‘
Oven Opening ์˜ค๋ธ ๋ฌธ ์—ด๊ธฐ ๊ณ 
Drawer Opening ์„œ๋ž ์—ด๊ธฐ ๊ณ 
Cup Pouring ์ปต ๊ธฐ์šธ์—ฌ ๋ถ“๊ธฐ ๊ณ 
Planar Reorientation ํ‰๋ฉด์—์„œ ๊ฐ์ฒด ํšŒ์ „ ์ค‘
Toaster Press ํ† ์Šคํ„ฐ ๋ ˆ๋ฒ„ ๋ˆ„๋ฅด๊ธฐ ์ €
Toy Picking ์žฅ๋‚œ๊ฐ ์ง‘์–ด์„œ ์˜ฎ๊ธฐ๊ธฐ ์ €
Wiping ๋‹ฆ๊ธฐ ๋™์ž‘ ์ €

๋กœ๋ด‡ ์„ค์ •: - 6-DoF Kinova JACO ๋กœ๋ด‡ ํŒ” - 16-DoF Allegro Hand (4์†๊ฐ€๋ฝ)

4.2 ๋ฒ ์ด์Šค๋ผ์ธ ๋น„๊ต

Table I: ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ๋ณ„ ์„ฑ๋Šฅ ๋น„๊ต

๋ฐฉ๋ฒ• Toaster Press Toy Picking
In-Scene Only 3/10 (30%) 1/10 (10%)
In-The-Wild Only 0/10 (0%) 0/10 (0%)
In-Scene Transform & In-The-Wild 0/10 (0%) 1/10 (10%)
In-Scene Training & In-The-Wild 6/10 (60%) 2/10 (20%)
AINA 13/15 (87%) 13/15 (87%)

ํ•ต์‹ฌ ํ†ต์ฐฐ: - In-the-wild ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ๋Š” ํ•™์Šต ๋ถˆ๊ฐ€ (๋„๋ฉ”์ธ ๊ฒฉ์ฐจ) - In-scene ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ๋Š” ๋ฐ์ดํ„ฐ ๋ถ€์กฑ - AINA์˜ ์ •๋ ฌ ์ „๋žต์ด ๋‘ ๋ฐ์ดํ„ฐ์˜ ์‹œ๋„ˆ์ง€๋ฅผ ์ฐฝ์ถœ

Table II: RGB ์ž…๋ ฅ ๋ฒ ์ด์Šค๋ผ์ธ ๋น„๊ต

๋ฐฉ๋ฒ• Oven Opening Drawer Opening
Masked BAKU 6/15 (40%) 1/15 (7%)
Masked BAKU with History 0/15 (0%) 0/15 (0%)
AINA 12/15 (80%) 11/15 (73%)

๋ถ„์„: - RGB ๊ธฐ๋ฐ˜ ์ •์ฑ…(BAKU)์€ ๋ฐฐ๊ฒฝ ๋ณ€ํ™”์— ๋ฏผ๊ฐ - ๋งˆ์Šคํ‚น์„ ์ ์šฉํ•ด๋„ ์„ฑ๋Šฅ ์ œํ•œ์  - 3D ํฌ์ธํŠธ ๊ธฐ๋ฐ˜ ํ‘œํ˜„์˜ ์šฐ์›”์„ฑ ์ž…์ฆ

4.3 ์ผ๋ฐ˜ํ™” ์‹คํ—˜

4.3.1 ๋†’์ด ์ผ๋ฐ˜ํ™”

์ž‘์—… ๊ณต๊ฐ„์˜ ๋†’์ด๊ฐ€ ๋ณ€ํ•  ๋•Œ AINA์˜ ์ ์‘๋ ฅ์„ ํ‰๊ฐ€:

Toy Picking: - Height 1: 5/10 (50%) - Height 2: 6/10 (60%) - Height 3: 2/10 (20%)

Wiping: - Height 1: 5/10 (50%) - Height 2: 5/10 (50%) - Height 3: 8/10 (80%)

๋ถ„์„: - ์ƒˆ๋กœ์šด ๋†’์ด์—์„œ ์ถ”๊ฐ€ in-scene ๋ฐ๋ชจ 1๊ฐœ๋กœ ์ •์ฑ… ์žฌํ•™์Šต - ์ตœ์†Œํ•œ์˜ ์ธ๊ฐ„ ๋…ธ๋ ฅ์œผ๋กœ ๋†’์ด ๋ณ€ํ™”์— ์ ์‘ - Height 3์—์„œ Toy Picking ์„ฑ๋Šฅ ์ €ํ•˜๋Š” ์ž‘์—… ๋‚œ์ด๋„ ์ฆ๊ฐ€ ๋•Œ๋ฌธ

4.3.2 ๊ฐ์ฒด ์ผ๋ฐ˜ํ™”

ํ•™์Šต ์‹œ ๋ณด์ง€ ๋ชปํ•œ ๊ฐ์ฒด์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™”:

Toy Picking: - Popcorn Package, Bowl: 1/10 (10%) - Toy, Bowl: 2/10 (20%)

Wiping: - Sponge: 7/10 (70%) - Eraser: 5/10 (50%)

Toaster: - Different Toaster: 6/10 (60%)

๋ถ„์„: - ๋น„์Šทํ•œ ํ˜•ํƒœ/๋ฌด๊ฒŒ์˜ ๊ฐ์ฒด์—๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ์ž˜ ์ผ๋ฐ˜ํ™” - ํ˜•ํƒœ/๋ฌด๊ฒŒ๊ฐ€ ํฌ๊ฒŒ ๋‹ค๋ฅธ ๊ฒฝ์šฐ ์‹คํŒจ ์ฆ๊ฐ€ - ์–ธ์–ด ํ”„๋กฌํ”„ํŠธ๋ฅผ ํ†ตํ•œ ๊ฐ์ฒด ๋ถ„ํ• ์ด ์ผ๋ฐ˜ํ™”์— ๊ธฐ์—ฌ


5. ๊ธฐ์ˆ ์  ์‹ฌ์ธต ๋ถ„์„

5.1 ์™œ 3D ํฌ์ธํŠธ ๊ธฐ๋ฐ˜ ํ‘œํ˜„์ธ๊ฐ€?

AINA๊ฐ€ RGB ์ด๋ฏธ์ง€ ๋Œ€์‹  3D ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ๋ฅผ ๊นŠ์ด ๋ถ„์„ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

(1) ๋ฐฐ๊ฒฝ ๋ถˆ๋ณ€์„ฑ (Background Invariance)

RGB ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ ์ •์ฑ…์˜ ๊ทผ๋ณธ์  ๋ฌธ์ œ:

์ž…๋ ฅ: I_rgb โˆˆ R^(Hร—Wร—3)
โ†’ ๋ฐฐ๊ฒฝ, ์กฐ๋ช…, ํ…์Šค์ฒ˜ ๋ชจ๋‘ ํฌํ•จ
โ†’ ์ •์ฑ…์ด ๋ฐฐ๊ฒฝ ํŒจํ„ด์— ๊ณผ์ ํ•ฉ
โ†’ ์ƒˆ๋กœ์šด ํ™˜๊ฒฝ์—์„œ ์‹คํŒจ

3D ํฌ์ธํŠธ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ:

์ž…๋ ฅ: P_object โˆˆ R^(Nร—3), P_fingertip โˆˆ R^(5ร—3)
โ†’ ๊ธฐํ•˜ํ•™์  ์ •๋ณด๋งŒ ํฌํ•จ
โ†’ ๋ฐฐ๊ฒฝ ์ •๋ณด ์ž๋™ ๋ฐฐ์ œ
โ†’ ํ™˜๊ฒฝ ๋ณ€ํ™”์— ๊ฐ•๊ฑด

(2) ์ฒดํ™” ๊ฒฉ์ฐจ ์ตœ์†Œํ™”

์ธ๊ฐ„ ์†๊ณผ ๋กœ๋ด‡ ํ•ธ๋“œ์˜ ์ฐจ์ด: - ํ˜•ํƒœํ•™์  ์ฐจ์ด (์†๊ฐ€๋ฝ ๊ฐœ์ˆ˜, ๊ธธ์ด, ๊ด€์ ˆ ๊ตฌ์กฐ) - ์šด๋™ํ•™์  ์ฐจ์ด (์ž‘์—… ๊ณต๊ฐ„, ๊ด€์ ˆ ๋ฒ”์œ„)

Fingertip ํ‘œํ˜„์˜ ์žฅ์ : - ์† ์ „์ฒด๊ฐ€ ์•„๋‹Œ ์ ‘์ด‰์ (fingertips)์—๋งŒ ์ง‘์ค‘ - ์ธ๊ฐ„ 5๊ฐœ ์†๊ฐ€๋ฝ โ†’ Allegro 4๊ฐœ ์†๊ฐ€๋ฝ + ์—„์ง€ ๋งคํ•‘ ๊ฐ€๋Šฅ - ํ˜•ํƒœ๋ณด๋‹ค ๊ธฐ๋Šฅ์  ์œ ์‚ฌ์„ฑ์— ๊ธฐ๋ฐ˜

(3) SO(3) Equivariance์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ํ†ตํ•ฉ

3D ํฌ์ธํŠธ๋Š” Vector Neuron๊ณผ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๊ฒฐํ•ฉ:

P โˆˆ R^(Nร—3) โ†’ VN-MLP โ†’ F โˆˆ R^(Cร—3)

์ž…๋ ฅ ํฌ์ธํŠธ์— ํšŒ์ „ R ์ ์šฉ:

PR โ†’ VN-MLP โ†’ FR

์ฆ‰, ํšŒ์ „๋œ ์ž…๋ ฅ์€ ํšŒ์ „๋œ ์ถœ๋ ฅ์„ ์ƒ์„ฑ โ†’ ์‹œ์  ๋ณ€ํ™”์— ์ž๋™ ์ ์‘

5.2 In-Scene ๋ฐ๋ชจ์˜ ์—ญํ• 

AINA์—์„œ ๋‹จ์ผ in-scene ๋ฐ๋ชจ๋Š” ๊ฒฐ์ •์  ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค:

  1. ๊ณต๊ฐ„์  ์•ต์ปค ์ œ๊ณต: In-the-wild ๋ฐ๋ชจ์˜ ์ขŒํ‘œ๊ณ„๋ฅผ ๋กœ๋ด‡ ํ™˜๊ฒฝ์— ์ •๋ ฌ
  2. ์Šค์ผ€์ผ ์ฐธ์กฐ: ์ ˆ๋Œ€์  ํฌ๊ธฐ ์ •๋ณด ์ œ๊ณต
  3. ํ™˜๊ฒฝ ์ปจํ…์ŠคํŠธ: ๋กœ๋ด‡ ๋ฐฐํฌ ํ™˜๊ฒฝ์˜ ํŠน์„ฑ ๋ฐ˜์˜

๋น„์œ : > In-scene ๋ฐ๋ชจ ์—†๋Š” AINA = ์ง€๋„ ์—†์ด ์™ธ๊ตญ ์—ฌํ–‰ > In-scene ๋ฐ๋ชจ ์žˆ๋Š” AINA = ํ˜„์ง€์ธ ๊ฐ€์ด๋“œ์™€ ํ•จ๊ป˜ํ•˜๋Š” ์—ฌํ–‰

5.3 FoundationStereo ์„ ํƒ์˜ ์˜๋ฏธ

AINA๊ฐ€ ๊นŠ์ด ์ถ”์ •์— FoundationStereo๋ฅผ ์„ ํƒํ•œ ๊ฒƒ์€ ์ „๋žต์ ์ž…๋‹ˆ๋‹ค:

๊ธฐ์กด ๋Œ€์•ˆ๋“ค์˜ ํ•œ๊ณ„: - Monocular depth: ์Šค์ผ€์ผ ๋ชจํ˜ธ์„ฑ, ์ •ํ™•๋„ ์ œํ•œ - RGB-D ์นด๋ฉ”๋ผ: ์ถ”๊ฐ€ ํ•˜๋“œ์›จ์–ด ํ•„์š”, ํœด๋Œ€์„ฑ ์ €ํ•˜ - ๊ธฐ์กด ์Šคํ…Œ๋ ˆ์˜ค ๋งค์นญ: ๋„๋ฉ”์ธ ํŠนํ™”, ์ผ๋ฐ˜ํ™” ์–ด๋ ค์›€

FoundationStereo์˜ ์žฅ์ : - Zero-shot ์ผ๋ฐ˜ํ™”: in-the-wild ํ™˜๊ฒฝ์—์„œ ์ฆ‰์‹œ ์ž‘๋™ - ๋†’์€ ์ •ํ™•๋„: KITTI, Middlebury, ETH3D ๋ฒค์น˜๋งˆํฌ SOTA - Side-tuning adapter: DepthAnythingV2์˜ ์‚ฌ์ „ํ•™์Šต ์ง€์‹ ํ™œ์šฉ

5.4 ์‹คํŒจ ๋ชจ๋“œ ๋ถ„์„

๋…ผ๋ฌธ์—์„œ ๋ช…์‹œ์ ์œผ๋กœ ์–ธ๊ธ‰ํ•˜์ง€ ์•Š์ง€๋งŒ, ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ์‹คํŒจ ๋ชจ๋“œ๋“ค:

  1. Occlusion: ์†์ด๋‚˜ ๊ฐ์ฒด๊ฐ€ ๊ฐ€๋ ค์งˆ ๋•Œ 3D ์ถ”์  ์‹คํŒจ
  2. Fast Motion: ๋น ๋ฅธ ๋™์ž‘์—์„œ ๋ชจ์…˜ ๋ธ”๋Ÿฌ๋กœ ์ธํ•œ ์ถ”์ • ์˜ค๋ฅ˜
  3. Transparent/Reflective Objects: ๊นŠ์ด ์ถ”์ •์˜ ๊ทผ๋ณธ์  ํ•œ๊ณ„
  4. Novel Object Shapes: ํ•™์Šต ๋ถ„ํฌ์—์„œ ํฌ๊ฒŒ ๋ฒ—์–ด๋‚œ ๊ฐ์ฒด
  5. Force-sensitive Tasks: ํž˜ ์ •๋ณด ์—†์ด ์ •๋ฐ€ ์กฐ๋ฆฝ ๋“ฑ์€ ์–ด๋ ค์›€

6. ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต

6.1 HuDOR (Human Demonstration to Robot)

HuDOR๋Š” AINA์™€ ๊ฐ€์žฅ ์ง์ ‘์ ์œผ๋กœ ๋น„๊ต๋˜๋Š” ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค.

์ธก๋ฉด HuDOR AINA
๋ฐ์ดํ„ฐ ์†Œ์Šค ๋‹จ์ผ in-scene ์ธ๊ฐ„ ๋น„๋””์˜ค In-the-wild + In-scene
ํ•™์Šต ๋ฐฉ์‹ RL ๊ธฐ๋ฐ˜ ์ •์ฑ… ๊ฐœ์„  ์ˆœ์ˆ˜ Imitation Learning
๋กœ๋ด‡ ๋ฐ์ดํ„ฐ ํ•„์š” (RL ๊ณผ์ •) ๋ถˆํ•„์š”
๋ณด์ƒ ์„ค๊ณ„ ๊ฐ์ฒด ๋ชจ์…˜ ์œ ์‚ฌ๋„ ๊ธฐ๋ฐ˜ N/A (supervised)
ํ™•์žฅ์„ฑ ์ œํ•œ์  (RL ๋น„์šฉ) ๋†’์Œ (๋ฐ๋ชจ ์ˆ˜ ์ฆ๊ฐ€๋งŒ)

HuDOR์˜ ๊ฐ•์ : - RL์„ ํ†ตํ•ด ๋กœ๋ด‡ ์—ญํ•™์— ์ ์‘ - ๋‹จ์ผ ๋น„๋””์˜ค๋กœ๋„ ์ž‘๋™

AINA์˜ ๊ฐ•์ : - ๋กœ๋ด‡ ์ƒํ˜ธ์ž‘์šฉ ์—†์ด ์˜คํ”„๋ผ์ธ ํ•™์Šต - In-the-wild ๋ฐ์ดํ„ฐ ํ™œ์šฉ์œผ๋กœ ๋‹ค์–‘์„ฑ ์ฆ๊ฐ€ - ๋” ๋‹จ์ˆœํ•œ ํŒŒ์ดํ”„๋ผ์ธ

6.2 UMI (Universal Manipulation Interface)

UMI๋Š” ๋ฒ”์šฉ ์กฐ์ž‘ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ œ์•ˆํ•œ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค.

์ธก๋ฉด UMI AINA
์ธํ„ฐํŽ˜์ด์Šค ์ปค์Šคํ…€ ํ•ธ๋“œํ—ฌ๋“œ ๊ทธ๋ฆฌํผ Aria Gen 2 ์•ˆ๊ฒฝ
ํƒ€๊ฒŸ ๋กœ๋ด‡ ์ฃผ๋กœ 2D ๊ทธ๋ฆฌํผ ๋‹ค์ง€ ํ•ธ๋“œ
๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋ฐฉ์‹ ํ•ธ๋“œํ—ฌ๋“œ ์กฐ์ž‘ ์ž์—ฐ์Šค๋Ÿฌ์šด ์† ์‚ฌ์šฉ
3D ํ‘œํ˜„ Diffusion Policy Point-Policy

UMI์˜ ๊ฐ•์ : - ๋” ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ์—”๋“œ์ดํŽ™ํ„ฐ ์ง€์› - ์‚ฐ์—… ํ™˜๊ฒฝ์— ์ ํ•ฉ

AINA์˜ ๊ฐ•์ : - ํ•ธ์ฆˆํ”„๋ฆฌ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ - ๋‹ค์ง€ ์กฐ์ž‘์— ํŠนํ™” - ๋” ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๊ฐ„ ๋™์ž‘ ์บก์ฒ˜

6.3 DexCap

DexCap์€ ์ฐฉ์šฉํ˜• ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•œ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค.

์ธก๋ฉด DexCap AINA
์„ผ์‹ฑ ๋ชจ์…˜ ์บก์ฒ˜ ๊ธ€๋Ÿฌ๋ธŒ + SLAM Aria Gen 2 (์˜ฌ์ธ์›)
ํ•™์Šต ๋ฐฉ์‹ Diffusion Policy + ์˜จ๋ผ์ธ ๊ต์ • Point-Policy (์˜คํ”„๋ผ์ธ)
์˜จ๋ผ์ธ ๊ต์ • ํ•„์š” ๋ถˆํ•„์š”
3D ํ‘œํ˜„ ๋‹ค์–‘ํ•œ ์ž…๋ ฅ ํฌ์ธํŠธ ํด๋ผ์šฐ๋“œ

DexCap์˜ ๊ฐ•์ : - ์ •๋ฐ€ํ•œ ์† ์ถ”์  (MoCap) - Diffusion Policy์˜ ๋‹ค๋ชจ๋‹ฌ ํ•™์Šต ๋Šฅ๋ ฅ

AINA์˜ ๊ฐ•์ : - ๋” ๊ฐ„๋‹จํ•œ ํ•˜๋“œ์›จ์–ด (์•ˆ๊ฒฝ๋งŒ) - ์˜จ๋ผ์ธ ๊ต์ • ์—†์ด ๋ฐฐํฌ ๊ฐ€๋Šฅ - ๋‚ฎ์€ ์ง„์ž… ์žฅ๋ฒฝ


7. ํ•œ๊ณ„์  ๋ฐ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ

7.1 ํ˜„์žฌ ํ•œ๊ณ„์ 

(1) ๋‹จ์ผ ๊ฐ์ฒด ์กฐ์ž‘ ์ œํ•œ

  • ํ˜„์žฌ ํŒŒ์ดํ”„๋ผ์ธ์€ ๋‹จ์ผ ๊ฐ์ฒด ์ถ”์ ์— ์ตœ์ ํ™”
  • ๋‹ค์ค‘ ๊ฐ์ฒด ์ƒํ˜ธ์ž‘์šฉ์€ ์ถ”๊ฐ€ ์—ฐ๊ตฌ ํ•„์š”

(2) Force/Tactile ์ •๋ณด ๋ถ€์žฌ

  • ์ธ๊ฐ„ ๋ฐ๋ชจ์—์„œ ํž˜ ์ •๋ณด ํš๋“ ๋ถˆ๊ฐ€
  • ์ •๋ฐ€ ์กฐ๋ฆฝ, ๋ถ€๋“œ๋Ÿฌ์šด ๊ฐ์ฒด ์กฐ์ž‘์— ์ œํ•œ

(3) ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ

  • ํƒœ์Šคํฌ๋‹น ํ‰๊ท  15๋ถ„์˜ ๋ฐ๋ชจ ํ•„์š”
  • Foundation model๊ณผ์˜ ๊ฒฐํ•ฉ์œผ๋กœ ๊ฐœ์„  ๊ฐ€๋Šฅ

(4) ์‹ค์‹œ๊ฐ„์„ฑ

  • ํ˜„์žฌ ์ถ”๋ก  ์†๋„ ๋ฏธ๊ณต๊ฐœ
  • ์‹ค์‹œ๊ฐ„ ๋ฐ˜์‘ํ˜• ํƒœ์Šคํฌ์— ๋Œ€ํ•œ ๊ฒ€์ฆ ํ•„์š”

(5) Bimanual ์กฐ์ž‘

  • ์–‘์† ์กฐ์ž‘์— ๋Œ€ํ•œ ํ™•์žฅ ๋ฏธ๊ฒ€์ฆ

7.2 ์œ ๋งํ•œ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ

(1) Foundation Model ํ†ตํ•ฉ

ํ˜„์žฌ: Task-specific ์ •์ฑ… ํ•™์Šต
๋ฏธ๋ž˜: VLM/LLM ๊ธฐ๋ฐ˜ ๋ฒ”์šฉ ์กฐ์ž‘ ์ •์ฑ…

์˜ˆ: GPT-4V, Gemini ๋“ฑ์„ ํ™œ์šฉํ•œ ์–ธ์–ด ์กฐ๊ฑด๋ถ€ ์กฐ์ž‘

(2) Sim-to-Real ํ•˜์ด๋ธŒ๋ฆฌ๋“œ

AINA์˜ in-the-wild ๋ฐ์ดํ„ฐ + ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ
โ†’ ๋” ๊ฐ•๊ฑดํ•œ ์ •์ฑ… ํ•™์Šต

(3) Tactile Sensing ํ†ตํ•ฉ

DIGIT, GelSight ๋“ฑ ์ด‰๊ฐ ์„ผ์„œ์™€์˜ ๊ฒฐํ•ฉ
โ†’ ํž˜ ์ •๋ณด ํš๋“ โ†’ ์ •๋ฐ€ ์กฐ์ž‘ ๊ฐ€๋Šฅ

(4) Continuous Learning

๋ฐฐํฌ ์ค‘ ์‹คํŒจ ์ผ€์ด์Šค ์ˆ˜์ง‘
โ†’ ์˜จ๋ผ์ธ ์ •์ฑ… ์—…๋ฐ์ดํŠธ
โ†’ ์ง€์†์  ๊ฐœ์„ 

(5) Multi-Robot Learning

๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ํ”Œ๋žซํผ์—์„œ ๋™์ผ ์ธ๊ฐ„ ๋ฐ๋ชจ ํ™œ์šฉ
โ†’ ๋ฒ”์šฉ ์กฐ์ž‘ ์ •์ฑ…

8. ์‹ค๋ฌด์  ์‹œ์‚ฌ์ 

8.1 ๋กœ๋ด‡๊ณตํ•™์ž๋ฅผ ์œ„ํ•œ ์ฒดํฌ๋ฆฌ์ŠคํŠธ

AINA ์Šคํƒ€์ผ ์‹œ์Šคํ…œ ๊ตฌ์ถ• ์‹œ ๊ณ ๋ ค์‚ฌํ•ญ:

ํ•˜๋“œ์›จ์–ด: - [ ] Egocentric ์นด๋ฉ”๋ผ (์†/๊ฐ์ฒด ๋™์‹œ ์บก์ฒ˜) - [ ] ์˜จ๋ณด๋“œ ์† ์ถ”์  ๋˜๋Š” ๊ณ ํ’ˆ์งˆ ์ถ”์ •๊ธฐ - [ ] ์Šคํ…Œ๋ ˆ์˜ค ๊นŠ์ด ์ถ”์ • ๊ฐ€๋Šฅ ์นด๋ฉ”๋ผ ๋ฐฐ์น˜ - [ ] ํฌํ„ฐ๋ธ”/๊ฒฝ๋Ÿ‰ ํผํŒฉํ„ฐ

์†Œํ”„ํŠธ์›จ์–ด: - [ ] ๊ฐ•๊ฑดํ•œ ๊ฐ์ฒด ๋ถ„ํ•  ๋ชจ๋ธ - [ ] Zero-shot ์Šคํ…Œ๋ ˆ์˜ค ๊นŠ์ด ์ถ”์ • - [ ] SO(3)-equivariant ๋„คํŠธ์›Œํฌ ๊ตฌํ˜„ - [ ] ํšจ์œจ์ ์ธ IK ์†”๋ฒ„

๋ฐ์ดํ„ฐ: - [ ] In-the-wild ๋ฐ๋ชจ ์ˆ˜์ง‘ ํ”„๋กœํ† ์ฝœ - [ ] In-scene ์•ต์ปค ๋ฐ๋ชจ ์ˆ˜์ง‘ - [ ] ๋„๋ฉ”์ธ ์ •๋ ฌ ํŒŒ์ดํ”„๋ผ์ธ - [ ] ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๊ฒ€์ฆ ์ ˆ์ฐจ

8.2 ์–ธ์ œ AINA ์ ‘๊ทผ๋ฒ•์„ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š”๊ฐ€?

์ ํ•ฉํ•œ ๊ฒฝ์šฐ: - ๋‹ค์ง€ ํ•ธ๋“œ ์กฐ์ž‘ ์—ฐ๊ตฌ - ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์ด ์–ด๋ ค์šด ํ™˜๊ฒฝ - ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ์—์„œ ์ผ๋ฐ˜ํ™” ํ•„์š” - ๋น„์ „๋ฌธ๊ฐ€์˜ ๋ฐ๋ชจ ์ˆ˜์ง‘์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ

๋ถ€์ ํ•ฉํ•œ ๊ฒฝ์šฐ: - ์ •๋ฐ€ ํž˜ ์ œ์–ด๊ฐ€ ํ•„์ˆ˜์ธ ํƒœ์Šคํฌ - ์‹ค์‹œ๊ฐ„ ๋ฐ˜์‘์ด criticalํ•œ ํƒœ์Šคํฌ - 2D ๊ทธ๋ฆฌํผ (๋” ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ• ์กด์žฌ) - ๋ฐ์ดํ„ฐ๊ฐ€ ์ด๋ฏธ ํ’๋ถ€ํ•œ ๊ฒฝ์šฐ


9. ๊ฒฐ๋ก 

AINA๋Š” ๋กœ๋ด‡ ์กฐ์ž‘ ํ•™์Šต ๋ถ„์•ผ์—์„œ ์ค‘์š”ํ•œ ์ด์ •ํ‘œ๋ฅผ ์„ธ์šด ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค. ์ธ๊ฐ„์˜ in-the-wild ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ๋‹ค์ง€ ๋กœ๋ด‡ ์ •์ฑ…์„ ํ•™์Šตํ•œ๋‹ค๋Š” ์˜ค๋ž˜๋œ ๊ฟˆ์— ํ•œ ๊ฑธ์Œ ๋” ๋‹ค๊ฐ€๊ฐ”์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ธฐ์—ฌ ์š”์•ฝ

  1. ํŒจ๋Ÿฌ๋‹ค์ž„ ์ „ํ™˜: ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ ์—†์ด ์ธ๊ฐ„ ๋ฐ๋ชจ๋งŒ์œผ๋กœ ๋‹ค์ง€ ์กฐ์ž‘ ํ•™์Šต
  2. ์‹ค์šฉ์  ํ•˜๋“œ์›จ์–ด: Aria Gen 2๋ผ๋Š” ์ƒ์šฉํ™” ๊ฐ€๋Šฅํ•œ ๋””๋ฐ”์ด์Šค ํ™œ์šฉ
  3. 3D ํฌ์ธํŠธ ํ‘œํ˜„: ๋ฐฐ๊ฒฝ ๋ถˆ๋ณ€์„ฑ๊ณผ ์ฒดํ™” ๊ฒฉ์ฐจ ์ตœ์†Œํ™” ๋™์‹œ ๋‹ฌ์„ฑ
  4. ๊ฐ„๋‹จํ•˜์ง€๋งŒ ํšจ๊ณผ์ : ๋ณต์žกํ•œ RL์ด๋‚˜ ์˜จ๋ผ์ธ ๊ต์ • ์—†์ด ์ž‘๋™

๋กœ๋ด‡๊ณตํ•™์˜ ๋ฏธ๋ž˜๋ฅผ ์œ„ํ•œ ์‹œ์‚ฌ์ 

โ€œ๋กœ๋ด‡์ด ์ธ๊ฐ„์„ ๊ด€์ฐฐํ•˜๋ฉฐ ๋ฐฐ์šฐ๋Š” ์„ธ์ƒโ€

์ด๊ฒƒ์ด AINA๊ฐ€ ๊ทธ๋ฆฌ๋Š” ๋ฏธ๋ž˜์ž…๋‹ˆ๋‹ค. ๋ฌผ๋ก  ์•„์ง ๊ฐˆ ๊ธธ์ด ๋ฉ‰๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ์—ฐ๊ตฌ๋Š” ๊ทธ ๋ฐฉํ–ฅ์œผ๋กœ ๋‚˜์•„๊ฐ€๋Š” ๊ตฌ์ฒด์ ์ด๊ณ  ์‹ค์šฉ์ ์ธ ๊ฒฝ๋กœ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

Aria Gen 2์™€ ๊ฐ™์€ ์›จ์–ด๋Ÿฌ๋ธ” ๋””๋ฐ”์ด์Šค์˜ ๋ฐœ์ „, FoundationStereo ๊ฐ™์€ foundation model์˜ ์„ฑ์ˆ™, ๊ทธ๋ฆฌ๊ณ  AINA ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ˜์‹ ์ด ๊ฒฐํ•ฉ๋  ๋•Œ, ์šฐ๋ฆฌ๋Š” ์ง„์ •์œผ๋กœ ๋ฒ”์šฉ์ ์ธ ๋กœ๋ด‡ ์กฐ์ž‘ ์‹œ์Šคํ…œ์— ๋” ๊ฐ€๊นŒ์›Œ์งˆ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

  1. Guzey, I., et al. (2025). โ€œDexterity from Smart Lenses: Multi-Fingered Robot Manipulation with In-the-Wild Human Demonstrations.โ€ arXiv:2511.16661.
  2. Deng, C., et al. (2021). โ€œVector Neurons: A General Framework for SO(3)-Equivariant Networks.โ€ ICCV 2021.
  3. Wen, B., et al. (2025). โ€œFoundationStereo: Zero-Shot Stereo Matching.โ€ CVPR 2025 (Best Paper Nomination).
  4. Meta. (2025). โ€œIntroducing Aria Gen 2: Unlocking New Research in Machine Perception, Contextual AI, Robotics, and More.โ€
  5. Guzey, I., et al. (2024). โ€œHuDOR: Bridging the Human to Robot Dexterity Gap through Object-Oriented Rewards.โ€ arXiv:2410.23289.

โ›๏ธ Dig Review

โ›๏ธ Dig โ€” Go deep, uncover the layers. Dive into technical detail.

ํ•ต์‹ฌ ๊ธฐ์—ฌ ์š”์•ฝ: ๋ณธ ์—ฐ๊ตฌ๋Š” Aria Gen 2 ์Šค๋งˆํŠธ ์•ˆ๊ฒฝ์„ ํ™œ์šฉํ•œ ์ƒˆ๋กœ์šด AINA ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•˜์—ฌ, ์ผ์ƒ ํ™˜๊ฒฝ์—์„œ ์ˆ˜์ง‘๋œ ์ธ๊ฐ„ ์‹œ์—ฐ๋งŒ์œผ๋กœ ๋‹ค์ง€ ์กฐ์ž‘ ์ •์ฑ…์„ ํ•™์Šตํ•˜๋Š” ์ตœ์ดˆ์˜ ์‹œ์Šคํ…œ์ž„์„ ๋ณด์˜€๋‹ค. Aria Gen 2 ์•ˆ๊ฒฝ์€ ๊ณ ํ•ด์ƒ๋„ RGB ์นด๋ฉ”๋ผ, ์˜จ๋ณด๋“œ 3D ์†/๋จธ๋ฆฌ ์ž์„ธ ์ถ”์ •, ๊ด‘๊ฐ ์Šคํ…Œ๋ ˆ์˜ค ๋ทฐ ๋“ฑ์„ ๊ฐ–์ถ”์–ด ์ž„์˜์˜ ๋ฐฐ๊ฒฝ์—์„œ๋„ ๊นŠ์ด ์ •๋ณด๋ฅผ ์•ˆ์ •์ ์œผ๋กœ ํš๋“ํ•  ์ˆ˜ ์žˆ๋‹ค. AINA๋Š” ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ(์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋˜๋Š” ๊ฐ•ํ™”ํ•™์Šต ํฌํ•จ) ์—†์ด ์˜ค์ง โ€œ์ธ๊ฐ„ ๋น„๋””์˜คโ€๋กœ๋ถ€ํ„ฐ ์ง์ ‘ ๋‹ค์ง€ ๋กœ๋ด‡ ์ •์ฑ…์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ์ข…์ „์˜ ๋ฐฉ๋ฒ•๋“ค์ด ํ•„์š”๋กœ ํ–ˆ๋˜ ๋Œ€๊ทœ๋ชจ ๋กœ๋ด‡ ์ œ์–ด ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋ถ€๋‹ด์„ ํš๊ธฐ์ ์œผ๋กœ ์ค„์˜€๋‹ค.

  • ๋ˆ„๊ตฌ๋‚˜ ์–ด๋””์„œ๋“  ์ˆ˜์ง‘ ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ: Aria Gen 2 ์•ˆ๊ฒฝ์„ ์“ฐ๋ฉด ์ฃผ๋ฐฉ, ์‚ฌ๋ฌด์‹ค, ์‹คํ—˜์‹ค ๋“ฑ ๋‹ค์–‘ํ•œ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ์†์‰ฝ๊ฒŒ ์ธ๊ฐ„์˜ ์กฐ์ž‘ ์žฅ๋ฉด์„ ์ดฌ์˜ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋…ผ๋ฌธ์— ๋”ฐ๋ฅด๋ฉด ํ‰๊ท  15๋ถ„์˜ ์ธ๊ฐ„ ์‹œ์—ฐ ๋…นํ™”๋กœ๋„ ์ž์œจ ๋กœ๋ด‡ ์ •์ฑ… ํ•™์Šต์— ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.
  • ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ ๋ถˆํ•„์š”: AINA๋Š” ์ธ๊ฐ„ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ ํ•™์Šตํ•˜๋ฉฐ, ์‹œ๋ฎฌ๋ ˆ์ด์…˜์ด๋‚˜ ๋กœ๋ด‡ ์ž์ฒด ๋ฐ์ดํ„ฐ(์˜ˆ: ๊ฐ•ํ™”ํ•™์Šต, ์˜จ๋ผ์ธ ๋ณด์ • ๋“ฑ)๋ฅผ ์ „ํ˜€ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค. ์ด๋Š” ์ธ๊ฐ„-๋กœ๋ด‡ ๊ฐ„ ๊ฒฉ์ฐจ(embodiment gap)๋ฅผ ๊ทน๋ณตํ•˜๋ ค๋Š” ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค๊ณผ ๋Œ€๋น„๋˜๋Š” ์ ์ด๋‹ค.
  • ๋„๋ฉ”์ธ ์ •๋ ฌ ๊ธฐ๋ฒ•: ํ•™์Šต ์ „์ฒ˜๋ฆฌ์—์„œ ํ•œ ์žฅ์˜ in-scene ์‹œ์—ฐ ์˜์ƒ์„ ๊ธฐ์ค€์œผ๋กœ ๋ชจ๋“  in-the-wild ์‹œ์—ฐ์„ ์ •๋ ฌํ•œ๋‹ค. ๊ฐ ์‹œ์—ฐ์˜ ๊ฐ์ฒด์™€ ์† ์œ„์น˜๋ฅผ ๋ฌด๊ฒŒ์ค‘์‹ฌ ๊ธฐ์ค€์œผ๋กœ ํ‰ํ–‰ ์ด๋™์‹œํ‚ค๊ณ  ์†์˜ ์ค‘๋ ฅ์ถ•์„ ์ค‘์‹ฌ์œผ๋กœ ํšŒ์ „์‹œ์ผœ, ์„œ๋กœ ๋‹ค๋ฅธ ํ™˜๊ฒฝ ๊ฐ„ ์ขŒํ‘œ๊ณ„๋ฅผ ์ผ์น˜์‹œํ‚จ๋‹ค.
  • 3D ์  ๊ธฐ๋ฐ˜ ์ •์ฑ… ์•„ํ‚คํ…์ฒ˜: ์†๋(fingertip)๊ณผ ๊ฐ์ฒด์˜ 3D ํ‚คํฌ์ธํŠธ ๊ถค์ ์„ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋…์ฐฝ์  ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋ฅผ ์„ค๊ณ„ํ–ˆ๋‹ค. ์ž…๋ ฅ ํฌ์ธํŠธ๋“ค์€ ๋ฒกํ„ฐ-๋‰ด๋Ÿฐ MLP(3D ์ •๋ณด์— SO(3) ๋“ฑ๋ณ€ ๋Œ€์‘ ๋ ˆ์ด์–ด ์ ์šฉ)๋กœ ์ž„๋ฒ ๋”ฉ๋œ ํ›„, ํŠธ๋žœ์Šคํฌ๋จธ ์ธ์ฝ”๋”์˜ ํ† ํฐ์œผ๋กœ ์ฒ˜๋ฆฌ๋œ๋‹ค. ํŠธ๋žœ์Šคํฌ๋จธ ์ถœ๋ ฅ์€ MLP๋ฅผ ํ†ตํ•ด ํ–ฅํ›„ ์†๋ ๊ถค์ ์„ ์˜ˆ์ธกํ•˜๊ณ , MSE ์†์‹ค๋กœ ํ•™์Šต๋œ๋‹ค. ์ด๋ ‡๊ฒŒ 3D ์ •๋ณด๋งŒ์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ, ์‚ฌ๋žŒ๊ณผ ๋กœ๋ด‡ ๊ด€์ฐฐ ๊ฐ„์˜ ์‹œ๊ฐ์  ์ฐจ์ด๋ฅผ ์ค„์ด๊ณ  ๋ฐฐ๊ฒฝ ๋ณ€ํ™”์— ๊ฐ•ํ•œ ์ •์ฑ…์„ ๊ตฌํ˜„ํ–ˆ๋‹ค.
  • ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ: 9๊ฐ€์ง€ ์ผ์ƒ ์กฐ์ž‘ ๊ณผ์ œ(์˜ˆ: ํ† ์Šคํ„ฐ ํ”„๋ ˆ์Šค, ์žฅ๋‚œ๊ฐ ์ง‘๊ธฐ, ์˜ค๋ธ ์—ด๊ธฐ, ์„œ๋ž ๋ฐ€๊ธฐ ๋“ฑ)์—์„œ ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ–ˆ๋‹ค. AINA๋Š” ํ† ์Šคํ„ฐ ํ”„๋ ˆ์Šค์™€ ์žฅ๋‚œ๊ฐ ์ง‘๊ธฐ์—์„œ ์„ฑ๊ณต๋ฅ  86%(13/15)๋ฅผ ๋‹ฌ์„ฑํ•ด, ๋‹จ์ผ ํ™˜๊ฒฝ ํ•™์Šต(30% ์ดํ•˜)์ด๋‚˜ ๋‹จ์ˆœ in-the-wild ํ•™์Šต(0%) ๋Œ€๋น„ ์›”๋“ฑํžˆ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ๋˜ํ•œ ์˜ค๋ธ ์—ด๊ธฐ์™€ ์„œ๋ž ์—ด๊ธฐ ๊ณผ์ œ์—์„œ๋Š” Masked-BAKU ๊ธฐ๋ฐ˜ ์ด๋ฏธ์ง€ ์ •์ฑ…(์ตœ๋Œ€ 6/15)๋ณด๋‹ค ํ˜„์ €ํžˆ ๋†’์€ 80% ์ด์ƒ ์„ฑ๊ณต๋ฅ (12/15, 11/15)์„ ๊ธฐ๋กํ–ˆ๋‹ค.

Meta์˜ Aria Gen 2 ์Šค๋งˆํŠธ ์•ˆ๊ฒฝ(์ด๋ฏธ์ง€): 3D ์†/๋จธ๋ฆฌ ์ž์„ธ ์ถ”์ • ๋ฐ ๊ณ ํ•ด์ƒ๋„ ์Šคํ…Œ๋ ˆ์˜ค ์นด๋ฉ”๋ผ๋ฅผ ๊ฐ–์ถ˜ ํœด๋Œ€ํ˜• ์Šค๋งˆํŠธ ๊ธ€๋ž˜์Šค. AINA๋Š” ์ด ์•ˆ๊ฒฝ์„ ํ†ตํ•ด ์ž์œ ํ™˜๊ฒฝ์˜ ์ธ๊ฐ„ ์กฐ์ž‘ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•œ๋‹ค.

๋ฐฉ๋ฒ•๋ก  ๋ถ„์„

  1. ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋ฐ ์ฒ˜๋ฆฌ: ์—ฐ๊ตฌ์ž๋Š” Aria Gen 2 ์•ˆ๊ฒฝ์„ ์ฐฉ์šฉํ•˜๊ณ  ์นด๋ฉ”๋ผ ์‹œ์•ผ ์•ˆ์—์„œ ์ธ๊ฐ„์˜ ์กฐ์ž‘ ์‹œ์—ฐ์„ ๋…นํ™”ํ•œ๋‹ค. ์ด ๊ณผ์ •์—์„œ Grounded-SAM ๋“ฑ์˜ ์–ธ์–ด-ํ”„๋กฌํ”„ํŠธ ๊ธฐ๋ฐ˜ ๊ฐ์ฒด ๋ถ„ํ• /์ถ”์  ๋ชจ๋ธ์„ ์ด์šฉํ•ด ํ”„๋ ˆ์ž„๋ณ„๋กœ ๊ฐ์ฒด๋ฅผ ์‹๋ณ„ใƒป์ถ”์ ํ•œ๋‹ค. ๋™์‹œ์— Aria ๊ธ€๋ž˜์Šค์˜ SLAM ์นด๋ฉ”๋ผ๋ฅผ ํ™œ์šฉํ•ด FoundationStereo ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๊นŠ์ด ๋งต์„ ์ถ”์ •ํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ์–ป์€ 2D ๊ฐ์ฒด ๊ถค์ ์„ ๋Œ€์‘ ๊นŠ์ด ๋งต๊ณผ ๊ฒฐํ•ฉํ•˜์—ฌ 3D ๊ฐ์ฒด ์œ„์น˜์™€ ์† ๊ด€์ ˆ ๊ถค์ ์„ ํš๋“ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์ปต์„ ์ฅ์–ด๋“œ๋Š” ์žฅ๋ฉด์ด๋ผ๋ฉด, ์ปต๊ณผ ์†๋์˜ 3D ๊ถค์ ์ด ๋ชจ๋‘ ๊ณ„์‚ฐ๋˜์–ด ๋‹ค์Œ ๋‹จ๊ณ„๋กœ ๋„˜์–ด๊ฐ„๋‹ค.
  2. ๋„๋ฉ”์ธ ์ •๋ ฌ (Domain Alignment): In-the-wild ์‹œ์—ฐ๋“ค์€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐฐ๊ฒฝยท๋†’์ดยท์นด๋ฉ”๋ผ ์œ„์น˜์—์„œ ์ˆ˜์ง‘๋˜๋ฏ€๋กœ, ๋กœ๋ด‡ ์ž‘์—… ๊ณต๊ฐ„๊ณผ ์ขŒํ‘œ๊ณ„๊ฐ€ ๋งž์ง€ ์•Š๋Š”๋‹ค. AINA๋Š” ๋กœ๋ด‡ ํ™˜๊ฒฝ์—์„œ ๋‹จ์ผ ์‹œ์—ฐ์„ ์•ต์ปค(๊ธฐ์ค€)๋กœ ์‚ผ์•„, ๋‚˜๋จธ์ง€ ์‹œ์—ฐ๋“ค์˜ ์ขŒํ‘œ๊ณ„๋ฅผ ์ผ์น˜์‹œํ‚จ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ ๊ฐ ์‹œ์—ฐ์˜ ๊ฐ์ฒด ๋ฐ ์† ํ‚คํฌ์ธํŠธ๋ฅผ ๊ทธ ๋ฌด๊ฒŒ ์ค‘์‹ฌ์œผ๋กœ ํ‰ํ–‰์ด๋™ํ•˜๊ณ , ์†์˜ ์ค‘๋ ฅ์ถ•(์ˆ˜์ง์ถ•)์„ ๊ธฐ์ค€์œผ๋กœ ํšŒ์ „์‹œํ‚จ๋‹ค. ์ด ๊ณผ์ •์„ ํ†ตํ•ด ๋ชจ๋“  ์‹œ์—ฐ์€ ๋™์ผํ•œ ์ฐธ์กฐ ํ”„๋ ˆ์ž„์œผ๋กœ ์ •๋ ฌ๋˜๋ฉฐ, ์ดˆ๊ธฐ ์ขŒํ‘œ ๋ถˆ์ผ์น˜์— ์˜ํ•œ ์˜ค๋ฅ˜๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค.
  3. ์ •์ฑ… ํ•™์Šต: ์ •๋ ฌ๋œ 3D ์†๋๊ณผ ๊ฐ์ฒด ํ‚คํฌ์ธํŠธ ํžˆ์Šคํ† ๋ฆฌ๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ํ–ฅํ›„ ์†๋ ๊ถค์ ์„ ์˜ˆ์ธกํ•˜๋Š” ํ์‡„ํ˜• ์ •์ฑ…์„ ํ•™์Šตํ•œ๋‹ค. ์ž…๋ ฅ๋œ ๊ฐ ํฌ์ธํŠธ๋Š” ๋ฒกํ„ฐ-๋‰ด๋Ÿฐ MLP๋กœ ์ธ์ฝ”๋”ฉ๋˜์–ด 3์ฐจ์› ๋ถˆ๋ณ€ ํ‘œํ˜„์„ ์–ป๊ณ , ์ด ๋ฒกํ„ฐ๋“ค์„ Transformer ์ธ์ฝ”๋”์˜ ํ† ํฐ์œผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค. ํŠนํžˆ ์†๋ ํ‚คํฌ์ธํŠธ์—๋Š” ์œ„์น˜ ์ธ์ฝ”๋”ฉ์ด ์ ์šฉ๋œ๋‹ค. Transformer์˜ ์ถœ๋ ฅ ๋ฒกํ„ฐ๋Š” MLP๋ฅผ ํ†ตํ•ด ๋ฏธ๋ž˜์˜ ์†๋ ์ขŒํ‘œ๋ฅผ ์˜ˆ์ธกํ•˜๋ฉฐ, ์˜ˆ์ธก๋œ ์†๋ ๊ถค์ ๊ณผ ์‹ค์ œ ๊ถค์  ๊ฐ„ MSE ์†์‹ค๋กœ ๋„คํŠธ์›Œํฌ๋ฅผ ์ข…๋ฃŒ(end-to-end) ํ•™์Šตํ•œ๋‹ค. ์ด ๊ตฌ์กฐ๋Š” 3D ์ (point) ๊ธฐ๋ฐ˜ ์ž…๋ ฅ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐฐ๊ฒฝ์ด๋‚˜ ์กฐ๋ช… ๋ณ€ํ™”์— ๊ฐ•์ธํ•œ ์ •์ฑ…์„ ๊ตฌํ˜„ํ•œ๋‹ค. ํ•™์Šต ์‹œ์—๋Š” ๋ฌด์ž‘์œ„ 3D ํ‰ํ–‰์ด๋™, ํšŒ์ „, ์Šค์ผ€์ผ๋ง ๋“ฑ์˜ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์„ ํ†ตํ•ด ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๋†’์ธ๋‹ค.

์ถ”๊ฐ€๋กœ, ๋…ผ๋ฌธ์—์„œ๋Š” ๋กœ๋ด‡ ๋ฐฐ์น˜ ํ™˜๊ฒฝ์— ๋งž์ถ˜ ์ดํ–‰ ๊ณผ์ •์„ ๊ณ ๋ คํ•œ๋‹ค. ์‹ค์ œ ์‹คํ—˜์—๋Š” Kinova Gen3 ๋กœ๋ด‡ ํŒ”(7DOF)๊ณผ Psyonic Ability 5-์ง€ ์†์ด ์‚ฌ์šฉ๋˜์—ˆ์œผ๋ฉฐ, ํ•„์š” ์‹œ ์ด๋“ค ๊ด€์ ˆ๊ฐ’์— ๋งž๊ฒŒ ์†๋ ์œ„์น˜๋ฅผ ๋กœ๋ด‡ ๋ชจํ„ฐ ๋ช…๋ น์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์ปค์Šคํ…€ IK ๋ชจ๋“ˆ์„ ๋„์ž…ํ–ˆ๋‹ค. ์กฐ์ž‘ ๋™์ž‘ ๋„์ค‘ ๊ทธ๋ฆฝ ์—ฌ๋ถ€๋Š” ์†๊ฐ€๋ฝ ๊ฐ„ ๊ฑฐ๋ฆฌ ๊ธฐ์ค€์œผ๋กœ ์„ค์ •ํ•˜์—ฌ, ์ธ๊ฐ„ ์‹œ์—ฐ์˜ ํž˜ ์ •๋ณด๋ฅผ ์–ด๋А ์ •๋„ ๋ชจ๋ฐฉํ–ˆ๋‹ค.

์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ ํ•ด์„

์‹คํ—˜์€ ์ด 9๊ฐœ ์ผ์ƒ ์กฐ์ž‘ ๊ณผ์ œ(์˜ˆ: ๋‹ฆ๊ธฐ, ์žฅ๋‚œ๊ฐ ์ง‘๊ธฐ/๋ฐฐ์น˜, ์˜ค๋ธ ์—ด๊ธฐ, ์„œ๋ž ์—ด๊ธฐ, ํ† ์Šคํ„ฐ ๋ˆ„๋ฅด๊ธฐ ๋“ฑ)์— ๋Œ€ํ•ด ์‹ค์ œ ๋กœ๋ด‡์—์„œ ์ˆ˜ํ–‰๋˜์—ˆ๋‹ค. ์ฃผ์š” ๋น„๊ต ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค:

  • ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์˜ ๋น„๊ต: Table I์—์„œ AINA(์ธ-์”ฌ+์ธ-์™€์ผ๋“œ)๋Š” ํ† ์Šคํ„ฐ ํ”„๋ ˆ์Šค์™€ ์žฅ๋‚œ๊ฐ ์ง‘๊ธฐ์—์„œ ๊ฐ๊ฐ 86%(13/15)์˜ ์„ฑ๊ณต๋ฅ ์„ ๊ธฐ๋กํ–ˆ๋‹ค. ๋ฐ˜๋ฉด, ๋‹จ์ผ ํ™˜๊ฒฝ ์‹œ์—ฐ๋งŒ์„ ์‚ฌ์šฉํ•œ ์ •์ฑ…์€ ๊ฐ๊ฐ 30%์™€ 10%์— ๊ทธ์ณค๊ณ , ์ˆœ์ˆ˜ ์ธ-์™€์ผ๋“œ ์‹œ์—ฐ๋งŒ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ 0%๋ฅผ ๊ธฐ๋กํ–ˆ๋‹ค. ์ด๋Š” ์ธ-์™€์ผ๋“œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฐฐ์น˜ ๊ณต๊ฐ„์˜ ๋‹ค์–‘์„ฑ์„ ํ™•๋ณดํ•ด์ฃผ๊ณ , ์ธ-์”ฌ ์‹œ์—ฐ์ด ์ •์ฑ… ํ•™์Šต์„ ์•ˆ์ •ํ™”์‹œํ‚ด์„ ๋ณด์—ฌ์ค€๋‹ค.
  • RGB ์ด๋ฏธ์ง€ vs 3D ํฌ์ธํŠธ: Table II์—์„œ ๊ธฐ์กด์˜ ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ Transformer(imitation) ๋ฐฉ๋ฒ•์ธ Masked-BAKU๋Š” ์˜ค๋ธ ์—ด๊ธฐ/์„œ๋ž ์—ด๊ธฐ ๊ณผ์ œ์—์„œ ์„ฑ๊ณต๋ฅ ์ด ๊ฐ๊ฐ 40% ๋ฏธ๋งŒ(6/15, 1/15)์— ๋ถˆ๊ณผํ–ˆ๋‹ค. ๋ฐ˜๋ฉด AINA๋Š” ๊ฐ๊ฐ 80%(12/15)์™€ 73%(11/15)๋กœ ํ›จ์”ฌ ์šฐ์ˆ˜ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค. ์ด๋Š” ์‚ฌ๋žŒ ์‹œ์—ฐ ์นด๋ฉ”๋ผ์˜ ์‹œ์  ๋ณ€ํ™”์— 3D ํฌ์ธํŠธ ๊ธฐ๋ฐ˜ ์ •์ฑ…์ด ๋ณด๋‹ค ๊ฐ•์ธํ•จ์„ ์‹œ์‚ฌํ•œ๋‹ค.
  • ๋†’์ด ๋ณ€ํ™” ์‹คํ—˜: ์ฑ…์ƒ ๋†’์ด๋ฅผ ๋ณ€ํ™”์‹œ์ผœ ํ…Œ์ŠคํŠธํ•œ ๊ฒฐ๊ณผ, AINA๋Š” ์‚ฌ์ „ ํ•™์Šต ์—†์ด ์ƒˆ๋กœ์šด ๋†’์ด์—์„œ๋„ ๋น„๊ต์  ๋†’์€ ์„ฑ๊ณต๋ฅ ์„ ์œ ์ง€ํ–ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์žฅ๋‚œ๊ฐ ์ง‘๊ธฐ ๊ณผ์ œ์—์„œ ๋†’์ด 1~3(์ €โ†’๊ณ )์—์„œ ์„ฑ๊ณต๋ฅ ์€ 50%, 60%, 20%์˜€๊ณ , ๋‹ฆ๊ธฐ ๊ณผ์ œ๋Š” 50%, 50%, 80%๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค. ์ด๋Š” ๋กœ๋ด‡ ์ž‘์—… ๊ณต๊ฐ„ ๋†’์ด๊ฐ€ ๋ณ€๋™๋  ๋•Œ์—๋„, ์ตœ์†Œํ•œ์˜ ์ถ”๊ฐ€ ์‹œ์—ฐ(๋†’์ด๋ณ„ ์ธ-์”ฌ)๋งŒ์œผ๋กœ ์ •์ฑ…์ด ์ผ๋ฐ˜ํ™”๋  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.
  • ์ƒˆ๋กœ์šด ๊ฐ์ฒด ์ผ๋ฐ˜ํ™”: ํ•™์Šต๋œ ์ •์ฑ…์„ ๋™์ผ ๊ณผ์ œ์ง€๋งŒ ๋‹ค๋ฅธ ๋ฌผ์ฒด์— ์ ์šฉํ–ˆ์„ ๋•Œ๋„ ์„ฑ๊ณต๋ฅ ์„ ํ‰๊ฐ€ํ–ˆ๋‹ค. ํ˜•ํƒœ์™€ ๋ฌด๊ฒŒ๊ฐ€ ์› ํ•™์Šต ๋Œ€์ƒ๊ณผ ์œ ์‚ฌํ•œ ๊ฒฝ์šฐ(์˜ˆ: ์ŠคํŽ€์ง€โ†’๋น„์Šทํ•œ ์ŠคํŽ€์ง€, ํ† ์Šคํ„ฐโ†’๋‹ค๋ฅธ ํ† ์Šคํ„ฐ)์—๋Š” ์„ฑ๊ณต๋ฅ (70% ์ด์ƒ)์ด ๋น„๊ต์  ๋†’์•˜์œผ๋‚˜, ํฌ๊ฒŒ ๋‹ค๋ฅธ ๋ฌผ์ฒด(์˜ˆ: ์ธํ˜• ๋Œ€์‹  ํŒ์ฝ˜ ํŒฉ ์‚ฌ์šฉ)์—์„œ๋Š” ์„ฑ๋Šฅ์ด ๊ธ‰๊ฐํ–ˆ๋‹ค(์„ฑ๊ณต๋ฅ  20% ์ดํ•˜). ์ด๋Š” ๊ฐ์ฒด์˜ ๋ฌผ๋ฆฌ์  ํŠน์„ฑ์ด ๋‹ค๋ฅผ ๋•Œ ํ˜„์žฌ ๋ฐฉ์‹์ด ํ•œ๊ณ„๊ฐ€ ์žˆ์Œ์„ ์‹œ์‚ฌํ•œ๋‹ค.

๊ธฐ์ˆ ์  ์‘์šฉ ๊ฐ€๋Šฅ์„ฑ

AINA์˜ ์ ‘๊ทผ๋ฒ•์€ ๋‹ค์–‘ํ•œ ์‹ค์ œ ๋กœ๋ด‡ ์‹œ์Šคํ…œ์— ์‘์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ๊ฐ€์ •์šฉ ์„œ๋น„์Šค ๋กœ๋ด‡์˜ ๊ฒฝ์šฐ, ์Šค๋งˆํŠธ ์•ˆ๊ฒฝ์„ ์ฐฉ์šฉํ•œ ์‚ฌ์šฉ์ž๊ฐ€ ์ฃผ๋ฐฉ ๋„๊ตฌ ์‚ฌ์šฉ, ์ •๋ฆฌ, ์ฒญ์†Œ ๋“ฑ์˜ ์ž‘์—…์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์‹œ์—ฐํ•˜๋ฉด, ๊ทธ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ๋กœ๋ด‡์ด ์œ ์‚ฌ ์ž‘์—…์„ ์ž๋™ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์ปต ๋”ฐ๋ฅด๊ธฐ, ์„œ๋ž ๋‹ฆ๊ธฐ ๋“ฑ ์‹คํ—˜ ๊ณผ์ œ๋Š” ์ผ์ƒ์ ์ด๋ฏ€๋กœ ๊ฐ€์‚ฌ ๋กœ๋ด‡์— ๋ฐ”๋กœ ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค. ์‚ฐ์—…์šฉ ํ˜‘๋™๋กœ๋ด‡ ๋ถ„์•ผ์—์„œ๋„ ์†Œํ˜• ๋ถ€ํ’ˆ ์กฐ๋ฆฝ์ด๋‚˜ ํˆด ์กฐ์ž‘์— ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ๊ธฐ์กด์˜ ๋‹จ์ˆœ ๊ทธ๋ฆฌํผ ๋Œ€์‹  ๋‹ค์ง€ ์†์„ ์ด์šฉํ•ด, ๋ณต์žกํ•œ ์กฐ๋ฆฝ๋ฌผ์ด๋‚˜ ์•…๊ธฐ๋ฅผ ๋‹ค๋ฃจ๋Š” ์ž‘์—…์— ์“ฐ์ผ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ, ์ „๋ฌธ๊ฐ€์˜ ๋™์ž‘์„ AR/VR ํ™˜๊ฒฝ์—์„œ ๋…นํ™”ํ•˜์—ฌ ์ด๋ฅผ ์ด์šฉํ•œ ์›๊ฒฉ ์กฐ์ž‘ ์‹œ์Šคํ…œ ๊ตฌ์ถ•๋„ ๊ฐ€๋Šฅํ•˜๋‹ค.

์ด๋Ÿฌํ•œ ์‘์šฉ์˜ ์‹คํ˜„ ๊ฐ€๋Šฅ์„ฑ์€ โ€œํ‰๊ท  15๋ถ„โ€๋งŒ์˜ ๋ฐ์ดํ„ฐ๋กœ๋„ ์ •์ฑ…์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒฐ๊ณผ์—์„œ ์ž˜ ๋“œ๋Ÿฌ๋‚œ๋‹ค. ์ฆ‰, ๋น„์ „๋ฌธ๊ฐ€๋„ ์Šค๋งˆํŠธ ์•ˆ๊ฒฝ ํ•œ ๋Œ€๋กœ ์†์‰ฝ๊ฒŒ ์กฐ์ž‘ ์‹œ์—ฐ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์–ด, ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋ถ€๋‹ด์ด ํฌ๊ฒŒ ์ค„์–ด๋“ ๋‹ค. Meta์˜ Aria Gen 2 ์ž์ฒด๊ฐ€ ๋กœ๋ด‡โˆ™AR ์—ฐ๊ตฌ๋ฅผ ์œ„ํ•œ ์—ฐ๊ตฌ์žฅ๋น„๋กœ ๊ฐœ๋ฐœ๋˜๊ณ  ์žˆ๋‹ค๋Š” ์ ๋„ ์ด ์ ‘๊ทผ์˜ ์‹ค์šฉ์„ฑ์„ ๋’ท๋ฐ›์นจํ•œ๋‹ค. ํ–ฅํ›„์—๋Š” ๋” ๋งŽ์€ ์ž‘์—…๊ตฐ(์˜ˆ: ๋‹ค์–‘ํ•œ ๋„๊ตฌ ์กฐ์ž‘, ๋ณตํ•ฉ ๋™์ž‘ ์ˆœ์„œ)๊ณผ ์„œ๋กœ ๋‹ค๋ฅธ ๋กœ๋ด‡ ํ•ธ๋“œ ํ”Œ๋žซํผ์— AINA๋ฅผ ์ ์šฉํ•ด ๋ฒ”์šฉ์„ฑ์„ ๊ฒ€์ฆํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

ํ•œ๊ณ„์  ๋ฐ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ

๋ณธ ์—ฐ๊ตฌ๋Š” ํ˜์‹ ์ ์ด์ง€๋งŒ ๋ช‡ ๊ฐ€์ง€ ์ œ์•ฝ์ด ์žˆ๋‹ค. ์ฒซ์งธ, ํž˜(feedback) ์ •๋ณด ๋ถ€์žฌ๋‹ค. ์Šค๋งˆํŠธ ์•ˆ๊ฒฝ์œผ๋กœ๋Š” ์˜ค์ง ์†์˜ ๊ด€์ ˆ ์œ„์น˜๋งŒ ์ธก์ •ํ•  ์ˆ˜ ์žˆ์–ด, ์†๊ฐ€๋ฝ ์‚ฌ์ด์˜ ์ ‘์ด‰๋ ฅ์ด๋‚˜ ๋ฌผ์ฒด์˜ ์ด‰๊ฐ ์ •๋ณด๋Š” ์–ป์ง€ ๋ชปํ•œ๋‹ค. ์ด๋Š” ์ •๋ฐ€ํ•œ ์„ฌ์„ธ ์กฐ์ž‘์ด๋‚˜ ๋ฏธ์„ธํ•œ ๊ทธ๋ฆฝ ๋™์ž‘์—์„œ ํ•œ๊ณ„๋ฅผ ์•ผ๊ธฐํ•œ๋‹ค. ๋‘˜์งธ, ์นด๋ฉ”๋ผ ๋™๊ธฐํ™” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ์•„๋ฆฌ์•„ ๊ธ€๋ž˜์Šค์˜ RGB ์นด๋ฉ”๋ผ์™€ SLAM ์นด๋ฉ”๋ผ ์‚ฌ์ด์— ์…”ํ„ฐ ํƒ€์ด๋ฐ ์ฐจ์ด๊ฐ€ ์žˆ์–ด, ๋น ๋ฅธ ๋จธ๋ฆฌ ์›€์ง์ž„ ์‹œ RGB ์ด๋ฏธ์ง€์™€ ๊นŠ์ด ๋งต ๊ฐ„์— ์˜ค์ฐจ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. ์…”ํ„ฐ ๊ฐ„ ๋ถˆ์ผ์น˜๋กœ ์ธํ•ด ๊ฐ์ฒด ํ”ฝ์…€๊ณผ ์‹ค์ œ 3D ์œ„์น˜๊ฐ€ ์–ด๊ธ‹๋‚  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ํ˜„์žฌ๋Š” ์ดฌ์˜์ž๊ฐ€ ๊ธ‰๊ฒฉํ•œ ๊ณ ๊ฐœ ์›€์ง์ž„์„ ํ”ผํ•˜๋„๋ก ์•ˆ๋‚ดํ•œ๋‹ค. ์…‹์งธ, ์‹ค์‹œ๊ฐ„ ์ ์šฉ์„ฑ์ด๋‹ค. ์‹คํ—˜์—์„œ๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์‹œ Aria ์•ˆ๊ฒฝ์„, ๋ฐฐ์น˜ ์‹œ์—๋Š” RealSense RGB-D๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค. ์ด๋กœ ์ธํ•ด ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ์ฐจ์ด๊ฐ€ ์•ฝ๊ฐ„ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ํ˜„์žฌ Aria์—์„œ ์‹ค์‹œ๊ฐ„ ๊นŠ์ด ์ถ”์ •์€ ์ตœ์ ํ™” ์ค‘์ด๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ, ํ˜„์žฌ ๋ฐฉ์‹์€ ์‚ฌ์ „ ๋…นํ™”๋œ ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋งŒ์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ, ๋™์  ํ™˜๊ฒฝ ๋ณ€ํ™”๋‚˜ ์‚ฌ๋žŒ ํ–‰๋™์˜ ๋ณ€๋™์— ์‹ค์‹œ๊ฐ„ ์ ์‘ํ•˜๊ธฐ ์–ด๋ ต๋‹ค.

ํ–ฅํ›„์—๋Š” ์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•œ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์••๋ ฅ ์„ผ์„œ๋‚˜ ๊ทผ์ „๋„(EMG) ์„ผ์„œ๋ฅผ ๋ถ€์ฐฉํ•˜์—ฌ ์†์˜ ํž˜ ์ •๋ณด๋ฅผ ํ•จ๊ป˜ ์ˆ˜์ง‘ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋น ๋ฅธ ๋จธ๋ฆฌ ์›€์ง์ž„์„ ๊ฒฌ๋””๋Š” 3D ๊ฐ์ฒด ์ถ”์ ์ด๋‚˜ ๋ฉ”์‰ฌ ๊ธฐ๋ฐ˜ ๊ฐ์ฒด ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ•์„ ๋„์ž…ํ•˜๋ฉด ๋™์  ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ๋„ ์•ˆ์ •์ ์ธ ์ •ํ•ฉ์„ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ, ์•„๋ฆฌ์•„ ๊ธ€๋ž˜์Šค ์ž์ฒด์˜ ์‹ค์‹œ๊ฐ„ ๊นŠ์ด ์ฒ˜๋ฆฌ ๊ธฐ๋Šฅ์„ ์ตœ์ ํ™”ํ•˜์—ฌ, ์ˆ˜์ง‘๋ถ€ํ„ฐ ๋ฐฐ์น˜๊นŒ์ง€ ๋™์ผํ•œ ์„ผ์„œ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜๋ฉด ๋„๋ฉ”์ธ ๊ฐ„ ์ฐจ์ด๋ฅผ ๋”์šฑ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๋ณด๋‹ค ๋‹ค์–‘ํ•œ ์ž‘์—…๊ณผ ์‹ค์‹œ๊ฐ„ ๋ณด์ •(์˜ˆ: ๊ฐ•ํ™”ํ•™์Šต ์—ฐ๊ณ„) ๋“ฑ์œผ๋กœ AINA์˜ ๋ฒ”์šฉ์„ฑ์„ ํ™•์žฅํ•˜๋ฉด, ์‹ค์„ธ๊ณ„ ๋กœ๋ด‡ ์‘์šฉ์˜ ํ•œ๊ณ„๋ฅผ ๋”์šฑ ๊ทน๋ณตํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

Copyright 2026, JungYeon Lee