Curieux.JY
  • JungYeon Lee
  • Post
  • Projects
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ์„œ๋ก : ์™œ ๋กœ๋ด‡์€ ์•„์ง๋„ ์†์„ ์ž˜ ๋ชป ์“ฐ๋Š”๊ฐ€?
      • ๋ฌธ์ œ์˜ ํ•ต์‹ฌ โ€” ๋ฐ์ดํ„ฐ๊ฐ€ ์—†๋‹ค
      • ์„ ํ–‰ ์—ฐ๊ตฌ์˜ ํ•œ๊ณ„
      • EgoScale์˜ ํ•ต์‹ฌ ์ฃผ์žฅ
    • ๋ฐฉ๋ฒ•๋ก : EgoScale์˜ ๊ตฌ์กฐ๋ฅผ ํ•ด๋ถ€ํ•œ๋‹ค
      • ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ์š”
      • 1๋‹จ๊ณ„: ๋Œ€๊ทœ๋ชจ ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ ์‚ฌ์ „ํ•™์Šต
      • 2๋‹จ๊ณ„: ์ •๋ ฌ๋œ ์ธ๊ฐ„-๋กœ๋ด‡ ์ค‘๊ฐ„ํ•™์Šต (Mid-training)
      • 3๋‹จ๊ณ„: ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ ํ›„์ฒ˜๋ฆฌ ํ•™์Šต (Post-training)
    • ์Šค์ผ€์ผ๋ง ๋ฒ•์น™: ์ด ๋…ผ๋ฌธ์˜ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ฐœ๊ฒฌ
      • ๋กœ๊ทธ-์„ ํ˜• ๊ด€๊ณ„์˜ ๋ฐœ๊ฒฌ
      • ์™œ ์ด๊ฒƒ์ด ์ค‘์š”ํ•œ๊ฐ€?
    • ์‹คํ—˜: ๋ฌด์—‡์„ ์–ด๋–ป๊ฒŒ ํ…Œ์ŠคํŠธํ–ˆ๋Š”๊ฐ€
      • ์‹คํ—˜ ์„ค์ •
      • ์ฃผ์š” ๊ฒฐ๊ณผ 1: ์‚ฌ์ „ํ•™์Šต์˜ ํšจ๊ณผ
      • ์ฃผ์š” ๊ฒฐ๊ณผ 2: ์Šค์ผ€์ผ์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ํ–ฅ์ƒ
      • ์ฃผ์š” ๊ฒฐ๊ณผ 3: ์•ก์…˜ ํ‘œํ˜„ ๋น„๊ต
      • ์ฃผ์š” ๊ฒฐ๊ณผ 4: ์ฒดํ™” ์ „์ด (Cross-Embodiment Transfer)
      • ์ฃผ์š” ๊ฒฐ๊ณผ 5: ์›์ƒท ์ „์ด
    • ๋น„ํŒ์  ๊ณ ์ฐฐ: ๊ฐ•์ , ์•ฝ์ , ๊ทธ๋ฆฌ๊ณ  ์—ด๋ฆฐ ์งˆ๋ฌธ๋“ค
      • ๊ฐ•์ 
      • ์•ฝ์  ๋ฐ ํ•œ๊ณ„
    • ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต
      • ์ธ๊ฐ„-๋กœ๋ด‡ ์ „์ด ์—ฐ๊ตฌ ๊ณ„๋ณด
      • ์ฃผ์š” ๊ฒฝ์Ÿ ์—ฐ๊ตฌ์™€์˜ ์ •๋Ÿ‰ ๋น„๊ต
      • ฯ€โ‚€ (pi-zero)์™€์˜ ๊ด€๊ณ„
      • GR00T N1๊ณผ์˜ ๊ด€๊ณ„
    • ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก : ์ด ๋…ผ๋ฌธ์ด ๋กœ๋ด‡๊ณตํ•™๊ณ„์— ๋งํ•˜๋Š” ๊ฒƒ
      • ํ•ต์‹ฌ ๊ธฐ์—ฌ ์š”์•ฝ
      • ์ด ์—ฐ๊ตฌ๊ฐ€ ์—ด์–ด๋‘๋Š” ๋ฏธ๋ž˜ ๋ฐฉํ–ฅ
      • ๋กœ๋ด‡๊ณตํ•™์ž์—๊ฒŒ ์ฃผ๋Š” ์‹ค์šฉ์  ๋ฉ”์‹œ์ง€
    • ์ฐธ๊ณ  ๋ฌธํ—Œ

๐Ÿ“ƒEgoScale ๋ฆฌ๋ทฐ

humanoid
vla
egocentric
human-robot-transfer
Scaling Dexterous Manipulation with Diverse Egocentric Human Data
Published

February 26, 2026

  • Paper Link
  • Code Link
  1. ๐Ÿค– EgoScale์€ 20,854์‹œ๊ฐ„ ์ด์ƒ์˜ egocentric human video๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋Œ€๊ทœ๋ชจ ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ dexterous manipulation ์ „์ด ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•˜๋ฉฐ, ๋ฐ์ดํ„ฐ ๊ทœ๋ชจ์™€ action prediction validation loss ์‚ฌ์ด์— log-linear ์Šค์ผ€์ผ๋ง ๋ฒ•์น™์ด ์žˆ์Œ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.
  2. ๐Ÿš€ ์ด ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ๋Œ€๊ทœ๋ชจ human pretraining๊ณผ ์†Œ๋Ÿ‰์˜ aligned human-robot mid-training์„ ๊ฒฐํ•ฉํ•˜๋Š” 2๋‹จ๊ณ„ ํ•™์Šต ๋ฐฉ์‹์„ ํ†ตํ•ด long-horizon dexterous manipulation ๋ฐ one-shot task adaptation์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
  3. ๐Ÿฆพ ๊ทธ ๊ฒฐ๊ณผ, ์ตœ์ข… ์ •์ฑ…์€ 22-DoF dexterous robotic hand์—์„œ no-pretraining baseline ๋Œ€๋น„ ํ‰๊ท  ์„ฑ๊ณต๋ฅ ์„ 54% ํ–ฅ์ƒ์‹œ์ผฐ์œผ๋ฉฐ, ๋” ๋‚ฎ์€ DoF์˜ robot hand์—๋„ ํšจ๊ณผ์ ์œผ๋กœ ์ „์ด๋˜์–ด ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ embodiment-agnostic motor prior๋ฅผ ์ œ๊ณตํ•จ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

๋ณธ ์—ฐ๊ตฌ๋Š” ๋ฐฉ๋Œ€ํ•œ ์–‘์˜ ์ธ๊ฐ„์˜ ์ž๊ธฐ ์ค‘์‹ฌ์ (egocentric) ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ฏธ์„ธํ•œ(fine-grained) ๊ณ ์ž์œ ๋„(high-DoF) ๋กœ๋ด‡์˜ ์ •๊ตํ•œ ์กฐ์ž‘(dexterous manipulation)์„ ์œ„ํ•œ ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ ๋ฐ์ดํ„ฐ์…‹ ๊ทœ๋ชจ๊ฐ€ ์ž‘๊ฑฐ๋‚˜ ์ €์ž์œ ๋„ ํ•ธ๋“œ(low-DoF hand)์— ์ง‘์ค‘๋˜์–ด ์žˆ์–ด, ๋Œ€๊ทœ๋ชจ ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ณต์žกํ•œ ์ •๊ตํ•œ ์กฐ์ž‘์„ ์–ผ๋งˆ๋‚˜ ์ง€์›ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ๋ถˆ๋ถ„๋ช…ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ์ด๋Ÿฌํ•œ ์˜๋ฌธ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋Œ€๊ทœ๋ชจ ์ž๊ธฐ ์ค‘์‹ฌ์  ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜์˜ ์ธ๊ฐ„-๋กœ๋ด‡ ์ „์ด ํ”„๋ ˆ์ž„์›Œํฌ์ธ EgoScale์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก  (Core Methodology):

EgoScale์€ ์ •๊ตํ•œ ๋กœ๋ด‡ ์ œ์–ด์— ์ง์ ‘์ ์œผ๋กœ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋Š” ํ‘œํ˜„(representations)์„ ๋Œ€๊ทœ๋ชจ ์ž๊ธฐ ์ค‘์‹ฌ์  ์ธ๊ฐ„ ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋‘ ๊ฐ€์ง€ ํ•ต์‹ฌ์ ์ธ ์„ค๊ณ„ ์„ ํƒ์„ ํ•ฉ๋‹ˆ๋‹ค.

  1. ์ธ๊ฐ„ ํ–‰๋™ ํ‘œํ˜„ (Human Action Representation):
    • ์›์‹œ ์„ผ์„œ ์ŠคํŠธ๋ฆผ (Raw Sensor Streams): ๋จธ๋ฆฌ ์žฅ์ฐฉํ˜• ์นด๋ฉ”๋ผ์—์„œ ์ดฌ์˜๋œ egocentric RGB ์˜์ƒ๊ณผ SLAM(Simultaneous Localization and Mapping) ๋ฐ ์† ํฌ์ฆˆ ์ถ”์ • ํŒŒ์ดํ”„๋ผ์ธ์„ ํ†ตํ•ด ์–ป์€ ์นด๋ฉ”๋ผ ์›€์ง์ž„(T_{t}^{w \leftarrow c} \in SE(3)) ๋ฐ ์ธ๊ฐ„ ์† ํฌ์ฆˆ(21๊ฐœ์˜ ํ‚คํฌ์ธํŠธ, ์นด๋ฉ”๋ผ ํ”„๋ ˆ์ž„์—์„œ์˜ ๊ฐ•์ฒด ๋ณ€ํ™˜ H_{t}^{c,i} \in SE(3))๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ์†๋ชฉ ์ˆ˜์ค€ ํŒ” ์›€์ง์ž„ (Wrist-level Arm Motion): ์ „์—ญ ์นด๋ฉ”๋ผ ์›€์ง์ž„์— ๋ถˆ๋ณ€ํ•˜๋Š” ๋™์ž‘ ๋ช…๋ น์„ ์–ป๊ธฐ ์œ„ํ•ด ์—ฐ์†์ ์ธ ํƒ€์ž„์Šคํ… ๊ฐ„์˜ ์ƒ๋Œ€์ ์ธ ์†๋ชฉ ์›€์ง์ž„์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” \Delta W_t = (W_{0w})^{-1} W_{tw}๋กœ ์ •์˜๋˜๋ฉฐ, ๋กœ๋ด‡ ์‹คํ–‰์—์„œ๋„ ๋™์ผํ•˜๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” ์ฃผ์š” ํŒ” ์ˆ˜์ค€ ๋™์ž‘ ์ถ”์ƒํ™”(action abstraction)์ž…๋‹ˆ๋‹ค.
    • ์† ๊ด€์ ˆ ์›€์ง์ž„ (Hand Articulation): 21๊ฐœ์˜ ์ธ๊ฐ„ ์† ํ‚คํฌ์ธํŠธ๋ฅผ Sharpa hand์˜ 22-DoF ๋กœ๋ด‡ ํ•ธ๋“œ ์กฐ์ธํŠธ ๊ณต๊ฐ„์œผ๋กœ ๋ฆฌํƒ€๊ฒŸํŒ…(retargeting)ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ตœ์ ํ™” ๊ธฐ๋ฐ˜ ์ ˆ์ฐจ๋ฅผ ํ†ตํ•ด ์ˆ˜ํ–‰๋˜๋ฉฐ, ์กฐ์ธํŠธ ํ•œ๊ณ„(joint limits)์™€ ์šด๋™ํ•™์  ์ œ์•ฝ(kinematic constraints)์„ ๊ณ ๋ คํ•˜์—ฌ ์ธ๊ฐ„ ์†๊ฐ€๋ฝ์˜ ์ •๊ตํ•œ ์›€์ง์ž„์„ ๋ณด์กดํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ฐ์ดํ„ฐ ์†Œ์Šค ๋ฐ ์ฒ˜๋ฆฌ (Data Sources and Processing):
    • 1๋‹จ๊ณ„: ๋Œ€๊ทœ๋ชจ ์ž๊ธฐ ์ค‘์‹ฌ์  ์ธ๊ฐ„ ์‚ฌ์ „ ํ•™์Šต ๋ฐ์ดํ„ฐ (Large-Scale Egocentric Human Pretraining Data):
      • ์ด 20,854์‹œ๊ฐ„ ๋ถ„๋Ÿ‰์˜ ์ž๊ธฐ ์ค‘์‹ฌ์  ์ธ๊ฐ„ ํ™œ๋™ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์ „ ํ•™์Šต์— ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ์ค‘ ๋Œ€๋ถ€๋ถ„์€ 9,869๊ฐœ์˜ ์žฅ๋ฉด, 6,015๊ฐœ์˜ ์ž‘์—…, 43,237๊ฐœ์˜ ๊ฐ์ฒด๋ฅผ ํฌํ•จํ•˜๋Š” ์‹ค์ œ ํ™˜๊ฒฝ(๊ฐ€์ •, ์‚ฐ์—…, ์†Œ๋งค, ๊ต์œก ๋“ฑ)์—์„œ ์ˆ˜์ง‘๋œ ์•ผ์ƒ(in-the-wild) ๋…นํ™”๋ณธ์œผ๋กœ, ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ์ง€๋งŒ ๊ด‘๋ฒ”์œ„ํ•œ ์กฐ์ž‘ ํ–‰๋™์„ ํฌ๊ด„ํ•ฉ๋‹ˆ๋‹ค.
      • ์ถ”๊ฐ€์ ์œผ๋กœ Apple Vision Pro๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ •ํ™•ํ•œ ์†๋ชฉ ๋ฐ ์† ์ถ”์  ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ณตํ•˜๋Š” EgoDex ๋ฐ์ดํ„ฐ์…‹ 829์‹œ๊ฐ„์„ ํฌํ•จํ•˜์—ฌ ์‚ฌ์ „ ํ•™์Šต์˜ ์•ˆ์ •์„ฑ์„ ๋†’์ž…๋‹ˆ๋‹ค.
    • 2๋‹จ๊ณ„: ์ •๋ ฌ๋œ ์ธ๊ฐ„-๋กœ๋ด‡ ์ค‘๊ฐ„ ํ•™์Šต ๋ฐ์ดํ„ฐ (Aligned Human-Robot Mid-Training Data):
      • ์ธ๊ฐ„ ์‹œ์—ฐ๊ณผ ๋กœ๋ด‡ ์‹คํ–‰ ๊ฐ„์˜ ์‹ ์ฒด์  ๊ฐ„๊ทน(embodiment gap)์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ์ธ๊ฐ„ ๋ฐ ์›๊ฒฉ ์กฐ์ž‘ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ๊ฐ€ ํฌํ•จ๋œ ๋” ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.
      • ์ด 344๊ฐœ์˜ ํ…Œ์ด๋ธ”ํƒ‘ ์กฐ์ž‘ ์ž‘์—…์œผ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ, ๊ฐ ์ž‘์—…๋‹น ์•ฝ 30๊ฐœ์˜ ์ธ๊ฐ„ ๊ถค์ (trajectory)๊ณผ 5๊ฐœ์˜ ๋กœ๋ด‡ ๊ถค์ ์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค (์ด ์ธ๊ฐ„ 50์‹œ๊ฐ„, ๋กœ๋ด‡ 4์‹œ๊ฐ„).
      • ์ธ๊ฐ„ ์‹œ์—ฐ์€ ๋กœ๋ด‡๊ณผ ๋™์ผํ•œ ์นด๋ฉ”๋ผ ๊ตฌ์„ฑ(์ผ์น˜ํ•˜๋Š” ์‹œ์ , ๋ณด์ •๋œ ๋‚ด์žฌ ๋งค๊ฐœ๋ณ€์ˆ˜)์œผ๋กœ ์ˆ˜์ง‘๋˜๋ฉฐ, Vive trackers์™€ Manus gloves๋ฅผ ํ†ตํ•ด ์†๋ชฉ ํฌ์ฆˆ์™€ ์ „์ฒด ์† ํฌ์ฆˆ๊ฐ€ ๊ธฐ๋ก๋ฉ๋‹ˆ๋‹ค. ์ด ๋ฐ์ดํ„ฐ์…‹์€ ๊ทœ๋ชจ๋Š” ์ž‘์ง€๋งŒ ๋กœ๋ด‡ ์ž‘์—… ๊ณต๊ฐ„๊ณผ ์šด๋™ํ•™์— ๋งž์ถฐ ๋ช…์‹œ์ ์œผ๋กœ ์‹ ์ฒด ์ •๋ ฌ(embodiment-aligned)๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜ ๋ฐ ํ›ˆ๋ จ (Model Architecture and Training):

  • ๋ชจ๋ธ์€ GR00T N1 [19]๊ณผ ์œ ์‚ฌํ•œ ํ”Œ๋กœ์šฐ ๊ธฐ๋ฐ˜ VLA(Vision-Language-Action) ์•„ํ‚คํ…์ฒ˜๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค. ๊ฐ ํƒ€์ž„์Šคํ… t์—์„œ ๋ชจ๋ธ์€ ์ด๋ฏธ์ง€(I_t)์™€ ์–ธ์–ด ์ง€์‹œ(l_t)๋กœ ๊ตฌ์„ฑ๋œ ๊ด€์ธก๊ฐ’ o_t = (I_t, l_t)์— ์กฐ๊ฑด์„ ๋ถ€์—ฌํ•˜์—ฌ vision-language embedding \phi_t๋กœ ์ธ์ฝ”๋”ฉํ•œ ํ›„, ํ”Œ๋กœ์šฐ ๋งค์นญ(flow-matching) ๋ชฉํ‘œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฏธ๋ž˜ ๋™์ž‘ ๋ฉ์–ด๋ฆฌ(chunk)๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
  • ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ๋Š” ๋กœ๋ด‡ ๊ณ ์œ  ์ƒํƒœ(proprioceptive state) q_t๋ฅผ ์กฐ๊ฑด์œผ๋กœ ํ•˜์ง€๋งŒ, ์ธ๊ฐ„ ์‹œ์—ฐ์—๋Š” ์ด๋Ÿฌํ•œ ์‹ ํ˜ธ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๊ณ ์œ  ์ƒํƒœ๊ฐ€ ์—†์„ ๋•Œ๋Š” ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํ”Œ๋ ˆ์ด์Šคํ™€๋” ํ† ํฐ์œผ๋กœ ๋Œ€์ฒดํ•˜์—ฌ ํ†ต์ผ๋œ ๋ชจ๋ธ ๊ตฌ์„ฑ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  • ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ์‹ ์ฒด์— ๋Œ€์‘ํ•˜๊ธฐ ์œ„ํ•ด ๊ฒฝ๋Ÿ‰์˜ ์‹ ์ฒด ์กฐ๊ฑด๋ถ€ MLP ์–ด๋Œ‘ํ„ฐ(embodiment-conditioned MLP adapters)๋ฅผ ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ ์ธํ„ฐํŽ˜์ด์Šค์— ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ์–ด๋Œ‘ํ„ฐ๋“ค์€ ์‹ ์ฒด ํŠน์ • ๊ณ ์œ  ์ƒํƒœ๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๊ณ  ์† ๋™์ž‘์„ ๋””์ฝ”๋”ฉํ•˜๋ฉฐ, ์ƒ๋Œ€ ์†๋ชฉ ์›€์ง์ž„ ์˜ˆ์ธก, vision-language backbone, DiT action expert๋Š” ์™„์ „ํžˆ ๊ณต์œ ๋ฉ๋‹ˆ๋‹ค.
  • ํ›ˆ๋ จ ๋ ˆ์‹œํ”ผ (Training Recipe):
    1. 1๋‹จ๊ณ„ (์ธ๊ฐ„ ์‚ฌ์ „ ํ•™์Šต): 20,000์‹œ๊ฐ„์˜ egocentric ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ๋กœ 100,000 ์Šคํ… ๋™์•ˆ ํ•™์Šตํ•˜๋ฉฐ, VLA ๋ชจ๋ธ์˜ ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์™„์ „ํžˆ ํ•ด์ œ(unfreezing)ํ•˜์—ฌ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋ฅผ ํก์ˆ˜ํ•ฉ๋‹ˆ๋‹ค.
    2. 2๋‹จ๊ณ„ (์ •๋ ฌ๋œ ์ค‘๊ฐ„ ํ•™์Šต): ์ •๋ ฌ๋œ ์ธ๊ฐ„-๋กœ๋ด‡ ํ”Œ๋ ˆ์ด ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ 50,000 ์Šคํ… ๋™์•ˆ ํ•™์Šตํ•˜๋ฉฐ, vision-language backbone์€ ๊ณ ์ •ํ•˜๊ณ (freezing) vision encoder์™€ DiT action expert๋งŒ ์—…๋ฐ์ดํŠธํ•˜์—ฌ ํ‘œํ˜„์„ ๋กœ๋ด‡ ๊ฐ๊ฐ ๋ฐ ์ œ์–ด์— ๊ณ ์ •(anchor)์‹œํ‚ต๋‹ˆ๋‹ค.
    3. 3๋‹จ๊ณ„ (ํ›„์† ํ•™์Šต): ์ž‘์—…๋ณ„ ๋กœ๋ด‡ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋กœ 10,000 ์Šคํ… ๋™์•ˆ ๋ฏธ์„ธ ์กฐ์ •(fine-tuning)ํ•ฉ๋‹ˆ๋‹ค. ์ค‘๊ฐ„ ํ•™์Šต์ด ์‚ฌ์šฉ๋œ ๊ฒฝ์šฐ vision encoder๋Š” ๊ณ ์ •ํ•˜๊ณ , ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ ํ•ด์ œํ•˜์—ฌ ์ƒˆ๋กœ์šด ์‹ ์ฒด์— ์ ์‘ํ•ฉ๋‹ˆ๋‹ค.

์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ (Experiments and Results):

  • RQ1: ๋Œ€๊ทœ๋ชจ ์ธ๊ฐ„ ์‚ฌ์ „ ํ•™์Šต์˜ ํšจ๊ณผ: ์ธ๊ฐ„ ์‚ฌ์ „ ํ•™์Šต์€ ๋ชจ๋“  ์ž‘์—…์—์„œ ํ›ˆ๋ จ ์Šคํฌ๋ž˜์น˜(training from scratch) ๋Œ€๋น„ ํ‰๊ท  ์ž‘์—… ์™„๋ฃŒ๋„(task completion)๋ฅผ 55% ์ด์ƒ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ๊ณ  ์ œ์•ฝ์ด ์—†๋Š” ๋Œ€๊ทœ๋ชจ ์ธ๊ฐ„ ์‚ฌ์ „ ํ•™์Šต์€ ๋Œ€๋ถ€๋ถ„์˜ ์ž‘์—…์—์„œ ์ค‘๊ฐ„ ํ•™์Šต๋งŒ ์ ์šฉํ•œ ๊ธฐ์ค€์„ (mid-training-only baseline)๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ธ๊ฐ„ ์‹œ์—ฐ์˜ ๊ทœ๋ชจ์™€ ๋‹ค์–‘์„ฑ์ด ์ •๊ตํ•œ ์กฐ์ž‘์„ ์œ„ํ•œ ๊ฐ•๋ ฅํ•œ ๊ท€๋‚ฉ์  ํŽธํ–ฅ(inductive biases)์„ ์ œ๊ณตํ•จ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์ธ๊ฐ„ ์‚ฌ์ „ ํ•™์Šต๊ณผ ์†Œ๋Ÿ‰์˜ ์ •๋ ฌ๋œ ์ค‘๊ฐ„ ํ•™์Šต์„ ๊ฒฐํ•ฉํ–ˆ์„ ๋•Œ ์ตœ์ƒ์˜ ์ „์ฒด ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
  • RQ2: ๋ฐ์ดํ„ฐ ์Šค์ผ€์ผ๋ง ๋ฒ•์น™ (Scaling Law): ์ธ๊ฐ„ ์‚ฌ์ „ ํ•™์Šต ๋ฐ์ดํ„ฐ ์–‘์„ 1k์—์„œ 20k ์‹œ๊ฐ„์œผ๋กœ ๋Š˜๋ฆฌ๋ฉด ํ‰๊ท  ์ž‘์—… ์™„๋ฃŒ๋„๊ฐ€ 0.30์—์„œ 0.71๋กœ ๊พธ์ค€ํžˆ ์ฆ๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์ค‘ ๋ชจ๋ธ์˜ ํ–‰๋™ ์˜ˆ์ธก ๊ฒ€์ฆ ์†์‹ค(validation loss)์€ ๋ฐ์ดํ„ฐ ๊ทœ๋ชจ๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ์•ˆ์ •์ ์ด๊ณ  ๋‹จ์กฐ๋กœ์šด ๊ฐœ์„ ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ˆ˜๋ ด ์‹œ ์ตœ์  ๊ฒ€์ฆ ์†์‹ค๊ณผ ๋ฐ์ดํ„ฐ ๊ทœ๋ชจ ์‚ฌ์ด์— L = 0.024 - 0.003 \cdot \ln(D)์˜ ๋กœ๊ทธ-์„ ํ˜•(log-linear) ์Šค์ผ€์ผ๋ง ๋ฒ•์น™(R^2 = 0.9983)์ด ๊ด€์ฐฐ๋˜์—ˆ์œผ๋ฉฐ, ์ด ์˜คํ”„๋ผ์ธ ์Šค์ผ€์ผ๋ง ํ–‰๋™์€ ์‹ค์ œ ๋กœ๋ด‡ ์„ฑ๋Šฅ๊ณผ ๊ฐ•๋ ฅํ•œ ์ƒ๊ด€ ๊ด€๊ณ„๋ฅผ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
  • RQ3: ์›์ƒท ์ „์ด (One-shot Transfer) ๋ฐ ์ผ๋ฐ˜ํ™” (Generalization): ์ •๋ ฌ๋œ ์ธ๊ฐ„-๋กœ๋ด‡ ์ค‘๊ฐ„ ํ•™์Šต์€ ์ด์ „์— ๋ณด์ง€ ๋ชปํ•œ ๊ธฐ์ˆ ์— ๋Œ€ํ•œ ์›์ƒท ์ „์ด๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์ค‘๊ฐ„ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์—†๋˜ โ€˜์…”์ธ  ์ ‘๊ธฐ(Fold Shirt)โ€™ ๋ฐ โ€˜๋ฌผ๋ณ‘ ๋šœ๊ป‘ ํ’€๊ธฐ(Unscrewing Water Bottles)โ€™ ์ž‘์—…์—์„œ, Pretrain + Midtrain ๋ชจ๋ธ์€ ๋‹จ์ผ ๋กœ๋ด‡ ์‹œ์—ฐ๊ณผ ์ •๋ ฌ๋œ ์ธ๊ฐ„ ์‹œ์—ฐ์„ ๋ณด๊ฐ•ํ•˜์—ฌ โ€™์…”์ธ  ์ ‘๊ธฐโ€™์—์„œ 0.88, โ€™๋ฌผ๋ณ‘ ๋šœ๊ป‘ ํ’€๊ธฐโ€™์—์„œ 0.55์˜ ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ค‘๊ฐ„ ํ•™์Šต์ด ๊ณต์œ ๋œ ๋™์ž‘ ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ์ƒˆ๋กœ์šด ์ž‘์—…์œผ๋กœ์˜ ํšจ๊ณผ์ ์ธ ์ผ๋ฐ˜ํ™”๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  • RQ4: ๊ต์ฐจ ์‹ ์ฒด ์ „์ด (Cross-embodiment Transfer): ์ธ๊ฐ„ ์‚ฌ์ „ ํ•™์Šต์œผ๋กœ ํ•™์Šต๋œ ํ‘œํ˜„์€ Unitree G1 ๋กœ๋ด‡๊ณผ ๊ฐ™์€ ํ˜„์ €ํ•˜๊ฒŒ ๋‹ค๋ฅธ ์šด๋™ํ•™ ๋ฐ 7-DoF ์‚ผ์ง€ํ˜• ํ•ธ๋“œ(tri-finger hand)๋ฅผ ๊ฐ€์ง„ ๋กœ๋ด‡์œผ๋กœ๋„ ์ „์ด๋ฉ๋‹ˆ๋‹ค. G1 ํ”Œ๋ ˆ์ด ๋ฐ์ดํ„ฐ๋ฅผ ์ค‘๊ฐ„ ํ•™์Šต์— ํฌํ•จ์‹œ์ผฐ์„ ๋•Œ, G1 ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ ํ•™์Šตํ•œ ๊ฒฝ์šฐ๋ณด๋‹ค โ€˜Pen in Binโ€™ ๋ฐ โ€˜Dish in Rackโ€™ ์ž‘์—…์—์„œ ํ˜„์ €ํžˆ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ธ๊ฐ„ ์‚ฌ์ „ ํ•™์Šต์ด ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ ์‹ ์ฒด ๋…๋ฆฝ์ ์ธ ๋ชจํ„ฐ ์‚ฌ์ „ ์ง€์‹(motor prior)์„ ์ œ๊ณตํ•จ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
  • RQ5: ์† ๋™์ž‘ ๊ณต๊ฐ„ ์„ค๊ณ„ (Hand Action Space Design): 22-DoF ๋ฆฌํƒ€๊ฒŸํŒ…๋œ ์กฐ์ธํŠธ ๊ณต๊ฐ„(retargeted joint space)์—์„œ ์ธ๊ฐ„ ์† ๋™์ž‘์„ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์ผ๊ด€๋œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์†๋ชฉ ์ „์šฉ(wrist-only) ํ‘œํ˜„์€ ์ •๊ตํ•œ ์กฐ์ž‘์ด ํ•„์š”ํ•œ ์ž‘์—…์—์„œ ์ €์กฐํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๊ณ , ์†๊ฐ€๋ฝ ๋ ๊ธฐ๋ฐ˜(fingertip-based) ํ‘œํ˜„์€ ๋ถˆ๊ฐ€๋Šฅํ•œ ์กฐ์ธํŠธ ๊ตฌ์„ฑ์œผ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์–ด ๋ถˆ์•ˆ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก  (Conclusion):

๋ณธ ์—ฐ๊ตฌ๋Š” ์ •๊ตํ•œ ๋กœ๋ด‡ ์กฐ์ž‘์„ ์œ„ํ•œ ์ธ๊ฐ„-๋กœ๋ด‡ ์ „์ด๊ฐ€ ๊ทผ๋ณธ์ ์œผ๋กœ ์Šค์ผ€์ผ๋ง ํ˜„์ƒ์ž„์„ ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค. EgoScale์€ 20,000์‹œ๊ฐ„ ์ด์ƒ์˜ ์ž๊ธฐ ์ค‘์‹ฌ์  ์ธ๊ฐ„ ์กฐ์ž‘ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•˜์—ฌ ์ธ๊ฐ„ ํ–‰๋™ ์˜ˆ์ธก ์†์‹ค๊ณผ ๋ฐ์ดํ„ฐ ๊ทœ๋ชจ ์‚ฌ์ด์˜ ๋ช…ํ™•ํ•œ ๋กœ๊ทธ-์„ ํ˜• ์Šค์ผ€์ผ๋ง ๋ฒ•์น™์„ ๋ฐœ๊ฒฌํ–ˆ์œผ๋ฉฐ, ์ด ์†์‹ค์ด ์‹ค์ œ ๋กœ๋ด‡ ์„ฑ๋Šฅ์„ ๊ฐ•๋ ฅํ•˜๊ฒŒ ์˜ˆ์ธกํ•จ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ๋Œ€๊ทœ๋ชจ ์ธ๊ฐ„ ์‚ฌ์ „ ํ•™์Šต๊ณผ ์†Œ๋Ÿ‰์˜ ์ •๋ ฌ๋œ ์ธ๊ฐ„-๋กœ๋ด‡ ์ค‘๊ฐ„ ํ•™์Šต์„ ๊ฒฐํ•ฉํ•˜๋Š” ๋‹จ์ˆœํ•˜๊ณ  ํšจ๊ณผ์ ์ธ ์ „์ด ๋ฐฉ์‹์€ ๊ฐ•๋ ฅํ•œ ์žฅ๊ธฐ ์กฐ์ž‘(long-horizon manipulation), ๋น„์ƒ ์›์ƒท ์ ์‘(emergent one-shot adaptation), ๊ทธ๋ฆฌ๊ณ  ํ˜„์ €ํžˆ ๋‹ค๋ฅธ ์šด๋™ํ•™์„ ๊ฐ€์ง„ ๋กœ๋ด‡ ์‹ ์ฒด ๊ฐ„์˜ ๊ฒฌ๊ณ ํ•œ ์ „์ด๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ธ๊ฐ„์„ ์ผ๋ฐ˜์ ์ธ ์ฒดํ™”๋œ ์ง€๋Šฅ(embodied intelligence) ํ•™์Šต์„ ์œ„ํ•œ ์ง„์ •ํ•œ ์Šค์ผ€์ผ๋ง ๊ฐ€๋Šฅํ•œ ์‹ ์ฒด(embodiment)๋กœ ๊ฐ„์ฃผํ•  ์ˆ˜ ์žˆ๋Š” ๋ฏธ๋ž˜๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.


๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

ํ•œ ์ค„ ์š”์•ฝ:
๋กœ๋ด‡์—๊ฒŒ ์กฐ์ž‘ ๊ธฐ์ˆ ์„ ๊ฐ€๋ฅด์น˜๋Š” ๊ฐ€์žฅ ์Šค์ผ€์ผ๋Ÿฌ๋ธ”ํ•œ ๋ฐฉ๋ฒ•์€ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ๋ฅผ ๋” ๋ชจ์œผ๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ, ์ธ๊ฐ„์ด ์ด๋ฏธ ์ˆ˜์ฒœ ์‹œ๊ฐ„์”ฉ ์ˆ˜ํ–‰ํ•ด์˜จ ์†๋™์ž‘ ์˜์ƒ์„ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค โ€” ๊ทธ๋ฆฌ๊ณ  ์ด ๋ฐฉ๋ฒ•์€ ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ์Šค์ผ€์ผ๋ง ๋ฒ•์น™์„ ๋”ฐ๋ฅธ๋‹ค.


์„œ๋ก : ์™œ ๋กœ๋ด‡์€ ์•„์ง๋„ ์†์„ ์ž˜ ๋ชป ์“ฐ๋Š”๊ฐ€?

๋ฌธ์ œ์˜ ํ•ต์‹ฌ โ€” ๋ฐ์ดํ„ฐ๊ฐ€ ์—†๋‹ค

์ž ๊น ์ƒ๊ฐํ•ด๋ณด์ž. ์—ฌ๋Ÿฌ๋ถ„์€ ์ง€๊ธˆ๊นŒ์ง€ ์‚ด๋ฉด์„œ ์ˆ˜๋ฐฑ๋งŒ ๋ฒˆ ๋ฌผ๊ฑด์„ ์ง‘๊ณ , ๋Œ๋ฆฌ๊ณ , ๋ผ์šฐ๊ณ , ์ ‘์—ˆ์„ ๊ฒƒ์ด๋‹ค. ์ “๊ฐ€๋ฝ์งˆ, ์ž๋™์ฐจ ์—ด์‡  ๊ฝ‚๊ธฐ, ์…”์ธ  ๋‹จ์ถ” ์ฑ„์šฐ๊ธฐ โ€” ์ด ๋ชจ๋“  ํ–‰๋™๋“ค์€ ๋ณ„๋‹ค๋ฅธ ๊ต์œก ์—†์ด ๋ชธ์— ์ตํ˜€์ง„ ์šด๋™ ์ง€์‹(motor knowledge) ์ด๋‹ค.

๊ทธ๋Ÿฐ๋ฐ ๋กœ๋ด‡์€? ๋กœ๋ด‡์—๊ฒŒ ์ด ์ง€์‹์„ ์ „๋‹ฌํ•˜๋ ค๋ฉด ์ผ์ผ์ด ์›๊ฒฉ์กฐ์ž‘(teleoperation) ์œผ๋กœ ์‹œ๋ฒ”์„ ๋ณด์—ฌ์ค˜์•ผ ํ•œ๋‹ค. ๋กœ๋ด‡ ํŒ” ์•ž์— ์•‰์•„ ์กฐ์ด์Šคํ‹ฑ์ด๋‚˜ ๊ธ€๋Ÿฌ๋ธŒ๋กœ ์ œ์–ดํ•˜๋ฉด์„œ, ์ˆ˜์‹ญ ๋ฒˆ, ์ˆ˜๋ฐฑ ๋ฒˆ ๊ฐ™์€ ๋™์ž‘์„ ๋ฐ˜๋ณตํ•˜๋ฉฐ ๋ฐ์ดํ„ฐ๋ฅผ ์Œ“๋Š”๋‹ค. ์‹œ๊ฐ„๋„ ๋ˆ๋„ ์—„์ฒญ๋‚˜๊ฒŒ ๋“ ๋‹ค.

LLM์ด๋‚˜ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ๋Š” ์ธํ„ฐ๋„ท์— ๋„˜์ณ๋‚˜๋Š” ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ๋•๋ถ„์— ๋ชจ๋ธ์„ ํญ๋ฐœ์ ์œผ๋กœ ์Šค์ผ€์ผ๋งํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋กœ๋ด‡ ์กฐ์ž‘? ์ธํ„ฐ๋„ท ๊ทœ๋ชจ์˜ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ ์ฝ”ํผ์Šค๋Š” ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค. Open X-Embodiment ๊ฐ™์€ ๋Œ€๊ทœ๋ชจ ํ”„๋กœ์ ํŠธ๋ฅผ ํ†ตํ•ด ์ˆ˜์ฒœ ์‹œ๊ฐ„์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๊ณต๊ฐœ๋˜์—ˆ์ง€๋งŒ, ์ด๊ฒƒ๋„ ์–ธ์–ด ๋ชจ๋ธ์ด ํ•™์Šตํ•˜๋Š” ๋ฐ์ดํ„ฐ ๊ทœ๋ชจ์— ๋น„ํ•˜๋ฉด ์ƒˆ๋ฐœ์˜ ํ”ผ๋‹ค.

์ด ๊ทผ๋ณธ์ ์ธ ๋ณ‘๋ชฉ์„ ํ’€๊ธฐ ์œ„ํ•œ ์ž์—ฐ์Šค๋Ÿฌ์šด ์งˆ๋ฌธ์ด ์žˆ๋‹ค:

โ€œ์ธ๊ฐ„ ํ–‰๋™ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋ด‡ ์ •์ฑ… ํ•™์Šต์˜ ์ฃผ์š” ๋ฐ์ดํ„ฐ ์†Œ์Šค๋กœ ์“ธ ์ˆ˜ ์žˆ์„๊นŒ?โ€

์„ ํ–‰ ์—ฐ๊ตฌ์˜ ํ•œ๊ณ„

์ด ์•„์ด๋””์–ด ์ž์ฒด๋Š” ์ƒˆ๋กญ์ง€ ์•Š๋‹ค. ์ธ๊ฐ„ ์˜์ƒ์—์„œ affordance๋ฅผ ์ถ”์ถœํ•˜๊ฑฐ๋‚˜, hand keypoint๋ฅผ ์ถ”์ ํ•ด ๋กœ๋ด‡ ์•ก์…˜์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์—ฐ๊ตฌ๋“ค์ด ์žˆ์–ด์™”๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ณตํ†ต๋œ ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ๋‹ค:

  • ์ž‘์€ ๊ทœ๋ชจ: ๋Œ€๋ถ€๋ถ„ ์ˆ˜๋ฐฑ~์ˆ˜์ฒœ ์‹œ๊ฐ„ ์ˆ˜์ค€์˜ ๋ฐ์ดํ„ฐ๋กœ๋งŒ ์‹คํ—˜
  • ์ œํ•œ๋œ ์„ค์ •: ํŠน์ • ํƒœ์Šคํฌ๋‚˜ ํ™˜๊ฒฝ์— ํŠนํ™”๋œ ๋ฐฉ์‹
  • ๊ณ ์ž์œ ๋„ ์† ์ œ์–ด ๋ฏธ์ง€์›: ์†๋ชฉ ์›€์ง์ž„๋งŒ ๋‹ค๋ฃจ๊ฑฐ๋‚˜, ์†๊ฐ€๋ฝ ์ˆ˜์ค€์˜ dexterous control์—๋Š” ์ ์šฉํ•˜๊ธฐ ์–ด๋ ค์›€
  • ์Šค์ผ€์ผ๋ง ๋ฒ•์น™ ๋ฏธํ™•์ธ: ๋ฐ์ดํ„ฐ๋ฅผ ๋” ๋Š˜๋ฆฌ๋ฉด ์ •๋ง ์„ฑ๋Šฅ์ด ์ข‹์•„์ง€๋Š”์ง€ ๋ถˆ๋ถ„๋ช…

EgoScale์€ ์ด ๋ชจ๋“  ํ•œ๊ณ„๋ฅผ ์ •๋ฉด์œผ๋กœ ๋ถ€๋”ชํžŒ๋‹ค.

EgoScale์˜ ํ•ต์‹ฌ ์ฃผ์žฅ

NVIDIA GEAR ํŒ€์ด ์ด๋„๋Š” ์ด ์—ฐ๊ตฌ๋Š” ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ๋ฉ”์‹œ์ง€๋ฅผ ์ „๋‹ฌํ•œ๋‹ค:

ImportantEgoScale์˜ 3๋Œ€ ํ•ต์‹ฌ ์ฃผ์žฅ
  1. ์Šค์ผ€์ผ๋ง ๋ฒ•์น™์˜ ์กด์žฌ: ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ ๊ทœ๋ชจ์™€ validation loss ์‚ฌ์ด์—๋Š” ๋กœ๊ทธ-์„ ํ˜• ๊ด€๊ณ„๊ฐ€ ์„ฑ๋ฆฝํ•˜๋ฉฐ, ์ด loss๋Š” ์‹ค์ œ ๋กœ๋ด‡ ์„ฑ๋Šฅ๊ณผ ๊ฐ•ํ•œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ธ๋‹ค.
  2. ์ „์ด ๋ ˆ์‹œํ”ผ: ๋Œ€๊ทœ๋ชจ ์ธ๊ฐ„ ์‚ฌ์ „ํ•™์Šต + ์†Œ๋Ÿ‰์˜ ์ •๋ ฌ๋œ ์ธ๊ฐ„-๋กœ๋ด‡ ์ค‘๊ฐ„ํ•™์Šต(mid-training) ์กฐํ•ฉ์ด ํšจ๊ณผ์ ์ด๋‹ค.
  3. ์ฒดํ™” ๋ถˆ๊ฐ€์ง€๋ก ์  ํ‘œํ˜„: ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋œ ํ‘œํ˜„์€ ์† ์„ค๊ณ„๊ฐ€ ๋‹ค๋ฅธ ๋กœ๋ด‡์—๋„ ์ „์ด๋œ๋‹ค.

๋ฐฉ๋ฒ•๋ก : EgoScale์˜ ๊ตฌ์กฐ๋ฅผ ํ•ด๋ถ€ํ•œ๋‹ค

์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ์š”

EgoScale์€ 3๋‹จ๊ณ„ ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ๋งˆ์น˜ ์•„์ด๊ฐ€ ์–ธ์–ด๋ฅผ ๋ฐฐ์šฐ๋Š” ๊ณผ์ •๊ณผ ๋น„์Šทํ•˜๊ฒŒ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค โ€” ๋จผ์ € ๋ฐฉ๋Œ€ํ•œ ์–‘์˜ ์–ธ์–ด๋ฅผ ๋“ฃ๊ณ  ํŒจํ„ด์„ ์ฒด๋“(์‚ฌ์ „ํ•™์Šต), ๊ทธ ๋‹ค์Œ ๋งํ•˜๊ธฐ ์—ฐ์Šต์œผ๋กœ ์‹ค์ œ ๋ฐœํ™”์— ์ ์‘(์ค‘๊ฐ„ํ•™์Šต), ๋งˆ์ง€๋ง‰์œผ๋กœ ํŠน์ • ์ƒํ™ฉ์— ๋งž๋Š” ํ‘œํ˜„์„ ์ตํžˆ๋Š”(ํ›„์ฒ˜๋ฆฌ ํ•™์Šต) ์‹์ด๋‹ค.

flowchart LR
    subgraph PRE["โ‘  ์‚ฌ์ „ํ•™์Šต (Pre-training)"]
        D1["20,854์‹œ๊ฐ„\n์—๊ณ ์„ผํŠธ๋ฆญ\n์ธ๊ฐ„ ์˜์ƒ"]
        A1["์†๋ชฉ 6-DoF +\n22-DoF ์† ๊ด€์ ˆ\n์•ก์…˜ ์˜ˆ์ธก"]
        D1 --> A1
    end

    subgraph MID["โ‘ก ์ค‘๊ฐ„ํ•™์Šต (Mid-training)"]
        D2["์ •๋ ฌ๋œ\n์ธ๊ฐ„-๋กœ๋ด‡\nํ”Œ๋ ˆ์ด ๋ฐ์ดํ„ฐ"]
        A2["๋กœ๋ด‡ ๊ฐ์ง€/์ œ์–ด\n๋„๋ฉ”์ธ ์ ์‘"]
        D2 --> A2
    end

    subgraph POST["โ‘ข ํ›„์ฒ˜๋ฆฌ ํ•™์Šต (Post-training)"]
        D3["ํƒœ์Šคํฌ๋ณ„\n์†Œ๋Ÿ‰ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ\n(1-shot ํฌํ•จ)"]
        A3["๋‹ค์šด์ŠคํŠธ๋ฆผ\nํƒœ์Šคํฌ ์ˆ˜ํ–‰"]
        D3 --> A3
    end

    PRE -->|"ํ‘œํ˜„ ์ „์ด"| MID
    MID -->|"์ •์ฑ… ์ ์‘"| POST

    style PRE fill:#e8f4f8,stroke:#2196F3
    style MID fill:#e8f8e8,stroke:#4CAF50
    style POST fill:#fff3e0,stroke:#FF9800

EgoScale 3๋‹จ๊ณ„ ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ


1๋‹จ๊ณ„: ๋Œ€๊ทœ๋ชจ ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ ์‚ฌ์ „ํ•™์Šต

๋ฐ์ดํ„ฐ์…‹: 20,854์‹œ๊ฐ„์˜ ์†

์ด ์—ฐ๊ตฌ์—์„œ ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ์…‹์€ ์ง€๊ธˆ๊นŒ์ง€ human-to-robot transfer ์—ฐ๊ตฌ์— ์“ฐ์ธ ๋ฐ์ดํ„ฐ๋ณด๋‹ค 20๋ฐฐ ์ด์ƒ ํฌ๋‹ค. ์—ฌ๋Ÿฌ ๊ณต๊ฐœ ์—๊ณ ์„ผํŠธ๋ฆญ ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ฉํ–ˆ์œผ๋ฉฐ, ๊ฐ ๋น„๋””์˜ค์—์„œ ์†๋ชฉ ํฌ์ฆˆ์™€ ์†๊ฐ€๋ฝ ๊ด€์ ˆ ์ •๋ณด๋ฅผ ์ž๋™์œผ๋กœ ์ถ”์ถœํ•ด ์•ก์…˜ ๋ ˆ์ด๋ธ”์„ ์ƒ์„ฑํ•œ๋‹ค.

Table 1: EgoScale ์‚ฌ์ „ํ•™์Šต ๋ฐ์ดํ„ฐ ๊ฐœ์š”
ํ•ญ๋ชฉ ๋‚ด์šฉ
์ด ํ•™์Šต ์‹œ๊ฐ„ 20,854 ์‹œ๊ฐ„
๊ธฐ์กด ์ตœ๋Œ€ ๊ทœ๋ชจ ๋Œ€๋น„ ์•ฝ 20๋ฐฐ ์ด์ƒ
๋ฐ์ดํ„ฐ ์œ ํ˜• ์—๊ณ ์„ผํŠธ๋ฆญ ์ธ๊ฐ„ ์กฐ์ž‘ ์˜์ƒ
์•ก์…˜ ๋ ˆ์ด๋ธ” ์†๋ชฉ 6-DoF + 22-DoF ์† ๊ด€์ ˆ ๊ฐ๋„
ํš๋“ ๋ฐฉ๋ฒ• ์† ์ถ”์  ๊ธฐ์ˆ  ์ž๋™ ์ ์šฉ (Apple Vision Pro ๋“ฑ)

ํ•ต์‹ฌ ์„ค๊ณ„ ๊ฒฐ์ •: ์–ด๋–ค ์•ก์…˜ ํ‘œํ˜„์„ ์“ธ ๊ฒƒ์ธ๊ฐ€?

์ด ์งˆ๋ฌธ์ด ๊ฒฐ๊ณผ๋ฅผ ํฌ๊ฒŒ ์ขŒ์šฐํ•œ๋‹ค. EgoScale ํŒ€์€ ์„ธ ๊ฐ€์ง€ ์„ ํƒ์ง€๋ฅผ ๋น„๊ตํ–ˆ๋‹ค:

Note์•ก์…˜ ํ‘œํ˜„ ์„ ํƒ์ง€ ๋น„๊ต

์„ ํƒ์ง€ 1 โ€” ์†๋ชฉ๋งŒ (Wrist-only)
์†๋ชฉ์˜ ์œ„์น˜/๋ฐฉํ–ฅ๋งŒ ์˜ˆ์ธก. ๊ฐ€์žฅ ๋‹จ์ˆœํ•˜์ง€๋งŒ ์†๊ฐ€๋ฝ ์ˆ˜์ค€์˜ dexterity ์ •๋ณด ์—†์Œ.

์„ ํƒ์ง€ 2 โ€” ์†๊ฐ€๋ฝ ๋์  (Fingertip SE(3))
๊ฐ ์†๊ฐ€๋ฝ ๋์˜ SE(3) ๊ถค์  ์˜ˆ์ธก, MLP๋กœ ๊ด€์ ˆ ๊ฐ๋„๋กœ ๋ณ€ํ™˜. EgoVLA ๋ฐฉ์‹.

์„ ํƒ์ง€ 3 โ€” 22-DoF ๊ด€์ ˆ ๊ณต๊ฐ„ (EgoScale ๊ธฐ๋ณธ๊ฐ’) โœ“
22๊ฐœ ์† ๊ด€์ ˆ ๊ฐ๋„ ์ง์ ‘ ์˜ˆ์ธก. ๊ฐ€์žฅ ํ’๋ถ€ํ•œ ์ •๋ณด, ๋กœ๋ด‡ ์† retargeting๊ณผ ์ง์ ‘ ํ˜ธํ™˜.

์‹คํ—˜ ๊ฒฐ๊ณผ, 22-DoF ๊ด€์ ˆ ๊ณต๊ฐ„ ํ‘œํ˜„์ด ๊ฐ€์žฅ ์ข‹์€ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ์ง๊ด€์ ์œผ๋กœ ์ดํ•ด๊ฐ€ ๋œ๋‹ค โ€” ์†๊ฐ€๋ฝ ํ•˜๋‚˜ํ•˜๋‚˜์˜ ์›€์ง์ž„ ํŒจํ„ด์„ ํ•™์Šตํ•ด์•ผ dexterous manipulation์ด ๊ฐ€๋Šฅํ•˜๋‹ˆ๊นŒ.

๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜: Flow ๊ธฐ๋ฐ˜ VLA

EgoScale์˜ ๋ชจ๋ธ์€ VLM ๋ฐฑ๋ณธ + DiT(Diffusion Transformer) ์•ก์…˜ ์ „๋ฌธ๊ฐ€๋กœ ๊ตฌ์„ฑ๋œ flow-based VLA๋‹ค. ฯ€โ‚€(pi-zero)์—์„œ ์˜๊ฐ์„ ๋ฐ›์€ ์ด ๊ตฌ์กฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

flowchart TB
    subgraph INPUT["์ž…๋ ฅ"]
        I1["์—๊ณ ์„ผํŠธ๋ฆญ\n RGB ์ด๋ฏธ์ง€"]
        I2["์–ธ์–ด ์ง€์‹œ๋ฌธ\n(Task description)"]
        I3["ํ˜„์žฌ ์†๋ชฉ/์†\n๊ณ ์œ ๊ฐ๊ฐ ์ƒํƒœ"]
    end

    subgraph VLM["VLM ๋ฐฑ๋ณธ (Vision-Language Model)"]
        V1["๋น„์ „ ์ธ์ฝ”๋”\n(Visual Tokens)"]
        V2["์–ธ์–ด ์ธ์ฝ”๋”\n(Text Tokens)"]
        V3["ํฌ๋กœ์Šค-์–ดํ…์…˜\n์œตํ•ฉ"]
    end

    subgraph ADAPT["๊ฒฝ๋Ÿ‰ ์ฒดํ™” ์–ด๋Œ‘ํ„ฐ"]
        AD1["์ธ๊ฐ„์šฉ\n๊ณ ์œ ๊ฐ๊ฐ ์ž„๋ฒ ๋”ฉ"]
        AD2["๋กœ๋ด‡์šฉ\n๊ณ ์œ ๊ฐ๊ฐ ์ž„๋ฒ ๋”ฉ"]
    end

    subgraph EXPERT["DiT ์•ก์…˜ ์ „๋ฌธ๊ฐ€"]
        E1["๋…ธ์ด์ฆˆ ์•ก์…˜ ์ž…๋ ฅ xโ‚œ"]
        E2["Flow Matching\n๋””๋…ธ์ด์ง•"]
        E3["์˜ˆ์ธก ์•ก์…˜ ์ถœ๋ ฅ\n(์†๋ชฉ 6-DoF + 22-DoF ์†)"]
    end

    I1 --> V1
    I2 --> V2
    I3 --> ADAPT
    V1 & V2 --> V3
    V3 & ADAPT --> E1
    E1 --> E2 --> E3

    style VLM fill:#e3f2fd,stroke:#1565C0
    style EXPERT fill:#fce4ec,stroke:#c62828
    style ADAPT fill:#f3e5f5,stroke:#6a1b9a

EgoScale ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜

Flow Matching์ด๋ž€?
๊ฐ„๋‹จํžˆ ๋งํ•˜๋ฉด, ๋ชจ๋ธ์€ โ€œ๋…ธ์ด์ฆˆ๋กœ ๋’ค์„ž์ธ ์•ก์…˜โ€์—์„œ ์‹œ์ž‘ํ•ด ์ ์  ์‹ค์ œ ์•ก์…˜์œผ๋กœ ์ •์ œํ•ด๊ฐ€๋Š” ํ”„๋กœ์„ธ์Šค๋ฅผ ํ•™์Šตํ•œ๋‹ค. Diffusion policy์˜ ์นœ์ฒ™์ด๋ผ ๋ณด๋ฉด ๋œ๋‹ค. ์ˆ˜ํ•™์ ์œผ๋กœ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค:

\mathcal{L}_\text{flow} = \mathbb{E}_{t, x_0, x_1}\left[\|v_\theta(x_t, t, c) - (x_1 - x_0)\|^2\right]

์—ฌ๊ธฐ์„œ x_t = (1-t)x_0 + tx_1์€ ๋…ธ์ด์ฆˆ x_0์—์„œ ์‹ค์ œ ์•ก์…˜ x_1์œผ๋กœ์˜ ์„ ํ˜• ๋ณด๊ฐ„, v_\theta๋Š” ์†๋„ ํ•„๋“œ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ, c๋Š” ์–ธ์–ด+๋น„์ „ ์ปจํ…์ŠคํŠธ๋‹ค.

์™œ Diffusion/Flow๋ฅผ ์“ฐ๋Š”๊ฐ€?
๋‹ค์ž์œ ๋„ ์† ๋™์ž‘์€ ๋ณธ์งˆ์ ์œผ๋กœ ๋‹ค๋ด‰๋ถ„ํฌ(multimodal distribution) ๋ฅผ ๊ฐ€์ง„๋‹ค โ€” ๊ฐ™์€ ์ƒํ™ฉ์—์„œ๋„ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ์œ ํšจํ•œ ์†๊ฐ€๋ฝ ๋ฐฐ์น˜๊ฐ€ ์กด์žฌํ•œ๋‹ค. ๋‹จ์ˆœํ•œ MSE ์†์‹ค๋กœ ํ•™์Šตํ•˜๋ฉด ์ด ๋ชจ๋“  ๊ฐ€๋Šฅ์„ฑ์˜ โ€œํ‰๊ท ๊ฐ’โ€์„ ์ถœ๋ ฅํ•ด ํ๋ฆฟํ•˜๊ณ  ๋ฌดํšจํ•œ ๋™์ž‘์ด ๋‚˜์˜จ๋‹ค. Flow matching์€ ์ด ๋ถ„ํฌ๋ฅผ ์ œ๋Œ€๋กœ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋‹ค.


2๋‹จ๊ณ„: ์ •๋ ฌ๋œ ์ธ๊ฐ„-๋กœ๋ด‡ ์ค‘๊ฐ„ํ•™์Šต (Mid-training)

์ด๊ฒƒ์ด EgoScale์˜ ๊ฐ€์žฅ ์˜๋ฆฌํ•œ ์•„์ด๋””์–ด ์ค‘ ํ•˜๋‚˜๋‹ค.

์‚ฌ์ „ํ•™์Šต๋œ ๋ชจ๋ธ์€ ์ธ๊ฐ„ ์†์„ ๋ณด๊ณ  ์ธ๊ฐ„ ์•ก์…˜์„ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šต๋˜์–ด ์žˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์‹ค์ œ ๋ฐฐํฌ ํ™˜๊ฒฝ์—์„œ๋Š” ๋กœ๋ด‡ ์†์ด ๋‹ฌ๋ฆฐ ๋กœ๋ด‡ ํŒ”์˜ ์นด๋ฉ”๋ผ ์˜์ƒ์ด ๋“ค์–ด์˜ค๊ณ , ๋กœ๋ด‡ ๊ด€์ ˆ ๋ช…๋ น์„ ์ถœ๋ ฅํ•ด์•ผ ํ•œ๋‹ค. ์ด ๊ฐ„๊ทน์„ ์–ด๋–ป๊ฒŒ ๋ฉ”์šธ๊นŒ?

์ค‘๊ฐ„ํ•™์Šต์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด: ์Œ ๋ฐ์ดํ„ฐ(Paired Data)

๊ฐ™์€ ์กฐ์ž‘ ๋™์ž‘์„ ์ธ๊ฐ„์ด ์ˆ˜ํ–‰ํ•˜๋ฉด์„œ ๋™์‹œ์— ๋กœ๋ด‡๋„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ด ๋‘ ์Œ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•จ๊ป˜ ํ•™์Šต์— ์‚ฌ์šฉํ•œ๋‹ค:

flowchart LR
    subgraph HUMAN["์ธ๊ฐ„ ํ”Œ๋ ˆ์ด ๋ฐ์ดํ„ฐ"]
        H1["์†๋ชฉ/์† ์„ผ์„œ\n์ฐฉ์šฉ ์ธ๊ฐ„"]
        H2["์—๊ณ ์„ผํŠธ๋ฆญ ์นด๋ฉ”๋ผ"]
        H3["์ธ๊ฐ„ ์† ์•ก์…˜\n(22-DoF)"]
        H1 --> H2 & H3
    end

    subgraph ROBOT["๋กœ๋ด‡ ํ”Œ๋ ˆ์ด ๋ฐ์ดํ„ฐ"]
        R1["๋™์ผ/์œ ์‚ฌ ํ™˜๊ฒฝ\n๋กœ๋ด‡ ํŒ” ์ˆ˜ํ–‰"]
        R2["๋กœ๋ด‡ ํƒ‘์žฌ ์นด๋ฉ”๋ผ"]
        R3["๋กœ๋ด‡ ๊ด€์ ˆ ๋ช…๋ น\n(22-DoF retargeted)"]
        R1 --> R2 & R3
    end

    subgraph ALIGN["์ •๋ ฌ ๋งคํ•‘"]
        A1["๊ณตํ†ต ์•ก์…˜ ๊ณต๊ฐ„\n(์†๋ชฉ ์œ„์น˜ + ์† ๊ด€์ ˆ)"]
        A2["๋„๋ฉ”์ธ ์–ด๋Œ‘ํ„ฐ\nํ•™์Šต"]
    end

    H3 & R3 --> A1 --> A2

    style HUMAN fill:#e8f5e9,stroke:#388e3c
    style ROBOT fill:#e3f2fd,stroke:#1976d2
    style ALIGN fill:#fff8e1,stroke:#f57f17

์ค‘๊ฐ„ํ•™์Šต ์ •๋ ฌ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋ฐฉ์‹

์ค‘์š”ํ•œ ๊ฒƒ์€ ์ค‘๊ฐ„ํ•™์Šต ๋ฐ์ดํ„ฐ๊ฐ€ ์†Œ๋Ÿ‰์ด๋ผ๋Š” ์ ์ด๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ๊ตฌ์ฒด์ ์œผ๋กœ ๊ฐ์ฒด๋‹น ์•ฝ 100๊ฐœ ๊ถค์ ์„ ์ธ๊ฐ„์ด ์‹œ์—ฐํ•˜๋ฉด ์ถฉ๋ถ„ํ•˜๋‹ค๊ณ  ๋ณด๊ณ ํ•œ๋‹ค. ์ˆ˜์ฒœ ์‹œ๊ฐ„์˜ ์‚ฌ์ „ํ•™์Šต์— ๋น„ํ•˜๋ฉด ๊ทน์†Œ๋Ÿ‰์ด๋‹ค.

์™œ ์ด๊ฒƒ์ด ์ž‘๋™ํ•˜๋Š”๊ฐ€?
์‚ฌ์ „ํ•™์Šต๋œ ํ‘œํ˜„์€ ์ด๋ฏธ ๋ฌผ๋ฆฌ์  ์กฐ์ž‘์˜ ํ’๋ถ€ํ•œ ๊ตฌ์กฐ โ€” ์†์ด ์–ด๋–ป๊ฒŒ ๊ฐ์ฒด์— ์ ‘๊ทผํ•˜๊ณ , ํž˜์„ ๊ฐ€ํ•˜๊ณ , ๋ฆด๋ฆฌ์ฆˆํ•˜๋Š”์ง€ โ€” ๋ฅผ ๋‹ด๊ณ  ์žˆ๋‹ค. ์ค‘๊ฐ„ํ•™์Šต์€ ์ด ๊ตฌ์กฐ๋ฅผ โ€œ์žฌ๋ฐœ๊ฒฌโ€ํ•˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ ๋‹จ์ˆœํžˆ ์ธ๊ฐ„์˜ ๊ฐ๊ฐโ†’๋กœ๋ด‡์˜ ๊ฐ๊ฐ์œผ๋กœ, ์ธ๊ฐ„์˜ ๊ด€์ ˆโ†’๋กœ๋ด‡์˜ ๊ด€์ ˆ๋กœ ๋งคํ•‘ํ•˜๋Š” โ€œ๋ฒˆ์—ญโ€ ์ž‘์—…๋งŒ ์ˆ˜ํ–‰ํ•œ๋‹ค.

Retargeting: ์ธ๊ฐ„ ์† โ†’ ๋กœ๋ด‡ ์†
์ธ๊ฐ„ ์†(22 DOF)๊ณผ ๋กœ๋ด‡ ์†์€ ์šด๋™ํ•™์ ์œผ๋กœ ๋‹ค๋ฅด๋‹ค. ์ด ๋ณ€ํ™˜์„ ์œ„ํ•ด ์ตœ์ ํ™” ๊ธฐ๋ฐ˜ retargeting์„ ์‚ฌ์šฉํ•œ๋‹ค:

\hat{q}_\text{robot} = \arg\min_{q} \sum_{i \in \text{fingertips}} \|f_i^\text{robot}(q) - f_i^\text{human}\|^2 + \lambda \|q\|^2

์†๊ฐ€๋ฝ ๋์ (fingertip) ์œ„์น˜๋ฅผ ์ตœ๋Œ€ํ•œ ๋งค์นญ์‹œํ‚ค๋ฉด์„œ ๊ด€์ ˆ ๊ฐ๋„๋Š” ์ž‘๊ฒŒ ์œ ์ง€ํ•˜๋Š” ์ตœ์ ํ™”๋‹ค. ์™„๋ฒฝํ•œ ๋ณ€ํ™˜์€ ๋ถˆ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, ์ค‘๊ฐ„ํ•™์Šต์„ ํ†ตํ•ด ์ด ๊ทผ์‚ฌ ์˜ค๋ฅ˜๋ฅผ ๋ณด์ •ํ•œ๋‹ค.


3๋‹จ๊ณ„: ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ ํ›„์ฒ˜๋ฆฌ ํ•™์Šต (Post-training)

์ด ๋‹จ๊ณ„์—์„œ๋Š” ์‹ค์ œ ์ˆ˜ํ–‰ํ•  ํƒœ์Šคํฌ์— ๋งž๋Š” ์†Œ๋Ÿ‰์˜ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ๋กœ ํŒŒ์ธํŠœ๋‹ํ•œ๋‹ค. ๋†€๋ผ์šด ์ ์€ ์›์ƒท(one-shot) โ€” ๋‹จ ํ•˜๋‚˜์˜ ๋กœ๋ด‡ ์‹œ์—ฐ๋งŒ์œผ๋กœ๋„ ์ƒˆ๋กœ์šด ํƒœ์Šคํฌ์— ์ ์‘์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

์ด๊ฒƒ์ด ๊ฐ€๋Šฅํ•œ ์ด์œ ๋Š” ์ค‘๊ฐ„ํ•™์Šต ๋‹จ๊ณ„์—์„œ ์ธ๊ฐ„ ์‹œ์—ฐ์„ 100๊ฐœ ์ œ๊ณตํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋กœ๋ด‡์€ ๋”ฑ ํ•œ ๋ฒˆ๋งŒ ์ง์ ‘ ๊ฒฝํ—˜ํ•˜๊ณ , ๋‚˜๋จธ์ง€๋Š” ์ธ๊ฐ„์ด ๊ฐ™์€ ๋งฅ๋ฝ์—์„œ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฑธ ๋ณด๊ณ  ํ•™์Šตํ•œ๋‹ค. ์ธ๊ฐ„ ๊ต์‚ฌ์—๊ฒŒ ์‹œ๋ฒ”์„ 99๋ฒˆ ๋ณด๊ณ  ๋ณธ์ธ์ด ํ•œ ๋ฒˆ ํ•ด๋ณด๋Š” ํ•™์ƒ๊ณผ ๊ฐ™๋‹ค.


์Šค์ผ€์ผ๋ง ๋ฒ•์น™: ์ด ๋…ผ๋ฌธ์˜ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ฐœ๊ฒฌ

๋กœ๊ทธ-์„ ํ˜• ๊ด€๊ณ„์˜ ๋ฐœ๊ฒฌ

EgoScale ํŒ€์€ ๋ฐ์ดํ„ฐ๋ฅผ 1k, 2k, 4k, 10k, 20k ์‹œ๊ฐ„์œผ๋กœ ๋Š˜๋ ค๊ฐ€๋ฉฐ validation loss๋ฅผ ์ธก์ •ํ–ˆ๋‹ค. ๊ฒฐ๊ณผ๋Š” ๋†€๋ž๋„๋ก ๊น”๋”ํ•˜๋‹ค:

\mathcal{L}_\text{val} = a \cdot \log(D) + b

์—ฌ๊ธฐ์„œ D๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ ์‹œ๊ฐ„, a์™€ b๋Š” ์ƒ์ˆ˜. ๋…ผ๋ฌธ์—์„œ ๋ณด๊ณ ํ•œ ๊ฒฐ์ • ๊ณ„์ˆ˜๋Š” R^2 = 0.9983 โ€” ๊ฑฐ์˜ ์™„๋ฒฝํ•œ ๋กœ๊ทธ-์„ ํ˜• ๊ด€๊ณ„๋‹ค.

xychart-beta
    title "์Šค์ผ€์ผ๋ง ๋ฒ•์น™: ๋ฐ์ดํ„ฐ ๊ทœ๋ชจ vs. Validation Loss (๊ฐœ๋…์  ํ‘œํ˜„)"
    x-axis ["1k hrs", "2k hrs", "4k hrs", "10k hrs", "20k hrs"]
    y-axis "Validation Loss" 0.5 --> 2.5
    line [2.3, 1.9, 1.6, 1.2, 1.0]

์ธ๊ฐ„ ๋ฐ์ดํ„ฐ ์Šค์ผ€์ผ๊ณผ validation loss์˜ ๊ด€๊ณ„ (๊ฐœ๋…๋„)

์™œ ์ด๊ฒƒ์ด ์ค‘์š”ํ•œ๊ฐ€?

์Šค์ผ€์ผ๋ง ๋ฒ•์น™์ด ์กด์žฌํ•œ๋‹ค๋Š” ๊ฒƒ์€ ์˜ˆ์ธก ๊ฐ€๋Šฅ์„ฑ์„ ์˜๋ฏธํ•œ๋‹ค. LLM ์ปค๋ฎค๋‹ˆํ‹ฐ๋Š” Chinchilla ๋ฒ•์น™ ๋•๋ถ„์— โ€œ์ด๋งŒํผ ์ปดํ“จํŒ…์„ ์“ฐ๋ฉด ์ด๋งŒํผ ์„ฑ๋Šฅ์ด ๋‚˜์˜จ๋‹คโ€๋ฅผ ๋ฏธ๋ฆฌ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋๋‹ค. EgoScale์€ ๋กœ๋ด‡ ์กฐ์ž‘ ๋ถ„์•ผ์—์„œ ์ฒ˜์Œ์œผ๋กœ ์ด๋Ÿฐ ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ์Šค์ผ€์ผ๋ง ๊ด€๊ณ„๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

๋” ์ค‘์š”ํ•œ ๊ฒƒ์€ validation loss๊ฐ€ ์‹ค์ œ ๋กœ๋ด‡ ์„ฑ๋Šฅ๊ณผ ๊ฐ•ํ•œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ธ๋‹ค๋Š” ์ ์ด๋‹ค. ์‹œ๋ฎฌ๋ ˆ์ด์…˜ proxy ๋ฉ”ํŠธ๋ฆญ์ด ์•„๋‹ˆ๋ผ ์‹ค๋ฌผ ๋กœ๋ด‡ ์‹คํ—˜์—์„œ์˜ ํƒœ์Šคํฌ ์™„๋ฃŒ์œจ์ด๋‹ค. ์ฆ‰, loss๋ฅผ ๋‚ฎ์ถ”๋Š” ๊ฒƒ์ด ๊ณง ๋กœ๋ด‡์ด ๋” ์ž˜ํ•˜๋Š” ๊ฒƒ๊ณผ ์ง๊ฒฐ๋œ๋‹ค๋Š” ์˜๋ฏธ๋‹ค.

Tip์—ฐ๊ตฌ์ž์—๊ฒŒ ์ฃผ๋Š” ์‹œ์‚ฌ์ 

์ด ์Šค์ผ€์ผ๋ง ๋ฒ•์น™์€ ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ๋ฅผ ๋” ๋ชจ์„์ˆ˜๋ก ์„ฑ๋Šฅ์ด ๋” ์ข‹์•„์ง์„ ๋ณด์žฅํ•˜๋ฉฐ, ์•„์ง ํฌํ™”(saturation) ์กฐ์ง์ด ์—†๋‹ค. 100k ์‹œ๊ฐ„, 1M ์‹œ๊ฐ„์œผ๋กœ ๊ฐ€๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ? ๋…ผ๋ฌธ ์ €์ž๋“ค๋„ ์ด ์งˆ๋ฌธ์„ ์—ด๋ฆฐ ๊ณผ์ œ๋กœ ๋‚จ๊ฒจ๋‘๊ณ  ์žˆ๋‹ค.


์‹คํ—˜: ๋ฌด์—‡์„ ์–ด๋–ป๊ฒŒ ํ…Œ์ŠคํŠธํ–ˆ๋Š”๊ฐ€

์‹คํ—˜ ์„ค์ •

๋กœ๋ด‡ ํ”Œ๋žซํผ: Unitree G1 ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡์— 22-DoF 5์ง€ Dexterous Hand ์žฅ์ฐฉ (์ผ๋ถ€ ์‹คํ—˜์€ ๋‹ค๋ฅธ ์† ์„ค๊ณ„๋ฅผ ๊ฐ€์ง„ ๋กœ๋ด‡๋„ ์‚ฌ์šฉ)

ํ‰๊ฐ€ ํƒœ์Šคํฌ โ€” 5๊ฐ€์ง€ ๊ณ ๋‚œ์ด๋„ dexterous ์กฐ์ž‘:

Table 2: ํ‰๊ฐ€ ํƒœ์Šคํฌ ์š”์•ฝ
ํƒœ์Šคํฌ ์„ค๋ช… ๋‚œ์ด๋„ ํฌ์ธํŠธ
Shirt Rolling ํ‹ฐ์…”์ธ ๋ฅผ ์›ํ†ตํ˜•์œผ๋กœ ๋ง์•„ ๋ฐ”๊ตฌ๋‹ˆ์— ๋„ฃ๊ธฐ ์–‘์† ํ˜‘์‘, ๋ณ€ํ˜• ๊ฐ€๋Šฅํ•œ ๋ฌผ์ฒด
Tong ์ง‘๊ฒŒ๋กœ ๋ฌผ๊ฑด ์ง‘๊ธฐ ๋„๊ตฌ ์‚ฌ์šฉ, ์ •๋ฐ€ ํŒŒ์ง€
Card Sorting ์นด๋“œ ๋ถ„๋ฅ˜ ์–‡์€ ๋ฌผ์ฒด, ์ •๋ฐ€ ์กฐ์ž‘
Bottle ๋ณ‘๋šœ๊ป‘ ๋Œ๋ ค ์—ด๊ธฐ/๋‹ซ๊ธฐ ๋‚˜์‚ฌํ˜• ์šด๋™, ํž˜ ์ œ์–ด
Syringe ์ฃผ์‚ฌ๊ธฐ ์กฐ์ž‘ ๊ทน๋„์˜ ์ •๋ฐ€์„ฑ ์š”๊ตฌ

์ฃผ์š” ๊ฒฐ๊ณผ 1: ์‚ฌ์ „ํ•™์Šต์˜ ํšจ๊ณผ

๊ฐ€์žฅ ํ•ต์‹ฌ์ ์ธ ablation์€ ์‚ฌ์ „ํ•™์Šต ์œ ๋ฌด์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ๋น„๊ต๋‹ค:

xychart-beta
    title "ํ•™์Šต ๊ตฌ์„ฑ๋ณ„ ์„ฑ๋Šฅ (ํ‰๊ท  ํƒœ์Šคํฌ ์™„๋ฃŒ ์ ์ˆ˜)"
    x-axis ["No Pretrain", "Midtrain Only", "Human Pretrain", "Human Pretrain\n+ Midtrain"]
    y-axis "Task Completion Score (%)" 0 --> 100
    bar [20, 32, 51, 74]

ํ•™์Šต ๊ตฌ์„ฑ๋ณ„ ํ‰๊ท  ํƒœ์Šคํฌ ์™„๋ฃŒ์œจ (๊ฐœ๋…์  ๋น„๊ต)

ํ•ต์‹ฌ ์ˆ˜์น˜: Human Pretrain + Midtrain ์กฐํ•ฉ์€ ์‚ฌ์ „ํ•™์Šต ์—†๋Š” ๋ฒ ์ด์Šค๋ผ์ธ ๋Œ€๋น„ ํ‰๊ท  ์„ฑ๊ณต๋ฅ  54% ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค.

์ฃผ์š” ๊ฒฐ๊ณผ 2: ์Šค์ผ€์ผ์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ํ–ฅ์ƒ

1k~20k ์‹œ๊ฐ„์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋Š˜๋ฆด์ˆ˜๋ก ๋‹ค์šด์ŠคํŠธ๋ฆผ ๋กœ๋ด‡ ์„ฑ๋Šฅ์ด ๋‹จ์กฐ ์ฆ๊ฐ€ํ•œ๋‹ค. ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹(1k ์‹œ๊ฐ„)์—์„œ๋Š” ๊ณผ์ ํ•ฉ(overfitting) ์กฐ์ง์ด ๋ณด์ด์ง€๋งŒ, ๋” ํฐ ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” ์•ˆ์ •์ ์ด๊ณ  ๋‹จ์กฐ๋กœ์šด ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๊ด€์ฐฐ๋œ๋‹ค.

์ด๊ฒƒ์ด ์™œ ์ค‘์š”ํ•˜๋ƒ๋ฉด, ๊ธฐ์กด์—๋Š” โ€œ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ๋ฅผ ๋” ๋งŽ์ด ์จ๋„ ์–ด๋А ์ด์ƒ์€ ๋„์›€์ด ์•ˆ ๋œ๋‹คโ€๋Š” ์šฐ๋ ค๊ฐ€ ์žˆ์—ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. EgoScale์€ ํƒ์ƒ‰๋œ ๋ฒ”์œ„(20k ์‹œ๊ฐ„) ๋‚ด์—์„œ๋Š” ํฌํ™”๊ฐ€ ์—†์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.

์ฃผ์š” ๊ฒฐ๊ณผ 3: ์•ก์…˜ ํ‘œํ˜„ ๋น„๊ต

์•ก์…˜ ํ‘œํ˜„ ํ‰๊ท  ์ ์ˆ˜ ๋น„๊ณ 
Wrist-only ๋‚ฎ์Œ ์†๊ฐ€๋ฝ ์ •๋ณด ์—†์Œ
Fingertip SE(3) ์ค‘๊ฐ„ EgoVLA ๋ฐฉ์‹
22-DoF Joint (EgoScale) ๊ฐ€์žฅ ๋†’์Œ ๊ธฐ๋ณธ ์„ค์ •

22-DoF ๊ด€์ ˆ ํ‘œํ˜„์˜ ์šฐ์œ„๋Š” dexterous manipulation์—์„œ ์†๊ฐ€๋ฝ ์ˆ˜์ค€์˜ ์„ธ๋ฐ€ํ•œ ์ œ์–ด ์ •๋ณด๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

์ฃผ์š” ๊ฒฐ๊ณผ 4: ์ฒดํ™” ์ „์ด (Cross-Embodiment Transfer)

ํฅ๋ฏธ๋กœ์šด ์‹คํ—˜ ์ค‘ ํ•˜๋‚˜๋Š” ๋™์ผํ•œ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์„ ๋‹ค๋ฅธ ์† ์„ค๊ณ„๋ฅผ ๊ฐ€์ง„ ๋กœ๋ด‡์— ์ „์ดํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

G1 ์ด์™ธ์˜ ๋‚ฎ์€ ์ž์œ ๋„ ์†์„ ๊ฐ€์ง„ ๋กœ๋ด‡์—๋„ ์ค‘๊ฐ„ํ•™์Šต๋งŒ ์ถ”๊ฐ€ํ•˜๋ฉด ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๊ด€์ฐฐ๋œ๋‹ค. ์ด๋Š” ์‚ฌ์ „ํ•™์Šต์ด โ€œ22-DoF ์†์—๋งŒ ๋งž๋Š” ํ‘œํ˜„โ€์ด ์•„๋‹ˆ๋ผ ์กฐ์ž‘์˜ ๋ณดํŽธ์  ๊ตฌ์กฐ๋ฅผ ํ•™์Šตํ–ˆ์Œ์„ ์‹œ์‚ฌํ•œ๋‹ค.

์ฃผ์š” ๊ฒฐ๊ณผ 5: ์›์ƒท ์ „์ด

์ค‘๊ฐ„ํ•™์Šต ํ›„ ํ›„์ฒ˜๋ฆฌ ํ•™์Šต ๋‹จ๊ณ„์—์„œ ํƒœ์Šคํฌ๋‹น ๋กœ๋ด‡ ์‹œ์—ฐ 1๊ฐœ๋งŒ ์ œ๊ณตํ–ˆ์„ ๋•Œ์˜ ๊ฒฐ๊ณผ:

  • ๊ธฐ์กด ๋ฐฉ๋ฒ•(์‚ฌ์ „ํ•™์Šต ์—†์Œ): 1๊ฐœ ์‹œ์—ฐ์œผ๋กœ๋Š” ๊ฑฐ์˜ ์ž‘๋™ํ•˜์ง€ ์•Š์Œ
  • EgoScale: ์˜๋ฏธ์žˆ๋Š” ์„ฑ๊ณต๋ฅ  ๋‹ฌ์„ฑ

์ง๊ด€์ ์œผ๋กœ, ๋ชจ๋ธ์€ ์ธ๊ฐ„ ์‹œ์—ฐ 100๊ฐœ์—์„œ โ€œ์ด ๋งฅ๋ฝ์—์„œ ์†์„ ์–ด๋–ป๊ฒŒ ์“ฐ๋Š”์ง€โ€๋ฅผ ์ด๋ฏธ ํ•™์Šตํ–ˆ๊ณ , ๋กœ๋ด‡ ์‹œ์—ฐ 1๊ฐœ๋Š” โ€œ๊ฐ™์€ ์›๋ฆฌ๋ฅผ ๋‚ด ๋ชธ์œผ๋กœ ์–ด๋–ป๊ฒŒ ์‹คํ–‰ํ•˜๋Š”์ง€โ€๋ฅผ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๋Š” ๋ฐ ์ถฉ๋ถ„ํ•˜๋‹ค.


๋น„ํŒ์  ๊ณ ์ฐฐ: ๊ฐ•์ , ์•ฝ์ , ๊ทธ๋ฆฌ๊ณ  ์—ด๋ฆฐ ์งˆ๋ฌธ๋“ค

๊ฐ•์ 

1. ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์˜ ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„
๋กœ๋ด‡ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์˜ ๋ณ‘๋ชฉ์„ ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ๋กœ ์šฐํšŒํ•œ๋‹ค๋Š” ์•„์ด๋””์–ด ์ž์ฒด๊ฐ€ ์‹ค์šฉ์ ์ด๊ณ  ์Šค์ผ€์ผ๋Ÿฌ๋ธ”ํ•˜๋‹ค. Apple Vision Pro๋‚˜ Meta Aria Glass ๊ฐ™์€ ์›จ์–ด๋Ÿฌ๋ธ” ์„ผ์„œ์˜ ๋ณด๊ธ‰์€ ์ด ๋ฐฉํ–ฅ์˜ ์ž ์žฌ๋ ฅ์„ ๋” ๋†’์—ฌ์ค€๋‹ค.

2. ๊ฒ€์ฆ๋œ ์Šค์ผ€์ผ๋ง ๋ฒ•์น™ (R^2 = 0.9983)
๋‹จ์ˆœํžˆ โ€œ๋งŽ์œผ๋ฉด ์ข‹๋‹คโ€๊ฐ€ ์•„๋‹ˆ๋ผ ์ •๋Ÿ‰์ ์œผ๋กœ ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ์Šค์ผ€์ผ๋ง์„ ๋ณด์ธ๋‹ค๋Š” ๊ฒƒ์€ ์—ฐ๊ตฌ ์ปค๋ฎค๋‹ˆํ‹ฐ์™€ ํˆฌ์ž์ž ๋ชจ๋‘์—๊ฒŒ ์ค‘์š”ํ•œ ์‹ ํ˜ธ๋‹ค.

3. ์›์ƒท ์ ์‘์˜ ์‹ค์šฉ์„ฑ
์ƒˆ๋กœ์šด ํƒœ์Šคํฌ์— ๋กœ๋ด‡ ์‹œ์—ฐ 1๊ฐœ๋งŒ์œผ๋กœ ์ ์‘ํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด, ํ˜„์žฅ ๋ฐฐํฌ ๋น„์šฉ์ด ๊ทน์ ์œผ๋กœ ๋‚ฎ์•„์ง„๋‹ค. ์ด๋Š” ๋‹จ์ˆœํ•œ ์„ฑ๋Šฅ ์ง€ํ‘œ๋ฅผ ๋„˜์–ด ์‹ค์šฉ์  ๊ฐ€์น˜๊ฐ€ ๋งค์šฐ ๋†’๋‹ค.

4. ์ฒดํ™” ๋ถˆ๊ฐ€์ง€๋ก ์  ํ‘œํ˜„
22-DoF ์†์˜ Unitree G1๊ณผ ๋‹ค๋ฅธ ์„ค๊ณ„์˜ ๋กœ๋ด‡ ๋ชจ๋‘์—์„œ ์ž‘๋™ํ•œ๋‹ค๋Š” ๊ฒƒ์€ ํ•˜๋‚˜์˜ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์„ ์—ฌ๋Ÿฌ ํ•˜๋“œ์›จ์–ด์— ์žฌ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ์„ ์˜๋ฏธํ•œ๋‹ค.

์•ฝ์  ๋ฐ ํ•œ๊ณ„

1. ์ค‘๊ฐ„ํ•™์Šต ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋น„์šฉ
์ค‘๊ฐ„ํ•™์Šต์— ํ•„์š”ํ•œ โ€œ์ •๋ ฌ๋œ ์ธ๊ฐ„-๋กœ๋ด‡ ํŽ˜์–ด ๋ฐ์ดํ„ฐโ€ ์ˆ˜์ง‘์€ ์—ฌ์ „ํžˆ ๋กœ๋ด‡์ด ํ•„์š”ํ•˜๋‹ค. ์™„์ „ํžˆ ๋กœ๋ด‡ ์—†์ด ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค. ์ค‘๊ฐ„ํ•™์Šต ๋ฐ์ดํ„ฐ ๊ทœ๋ชจ์™€ ํ’ˆ์งˆ์ด ์ตœ์ข… ์„ฑ๋Šฅ์— ์–ผ๋งˆ๋‚˜ ๋ฏผ๊ฐํ•œ์ง€ ๋” ์„ธ๋ฐ€ํ•œ ๋ถ„์„์ด ํ•„์š”ํ•˜๋‹ค.

2. ์†๋ชฉ ์ค‘์‹ฌ ํ‘œํ˜„์˜ ํ•œ๊ณ„
์—๊ณ ์„ผํŠธ๋ฆญ ๋น„๋””์˜ค์—์„œ ์† ๊ด€์ ˆ์„ ์ •ํ™•ํ•˜๊ฒŒ ์ถ”์ ํ•˜๋Š” ๊ฒƒ ์ž์ฒด๊ฐ€ ์–ด๋ ต๋‹ค. ํŠนํžˆ ์†์ด ๊ฐ€๋ ค์ง€๊ฑฐ๋‚˜ ๋น ๋ฅด๊ฒŒ ์›€์ง์ผ ๋•Œ ๋…ธ์ด์ฆˆ๊ฐ€ ์‹ฌํ•˜๋‹ค. ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ์•ก์…˜ ๋ ˆ์ด๋ธ” ํ’ˆ์งˆ์ด ๊ฒฐ๊ณผ์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ๋” ๊นŠ์€ ๋ถ„์„์ด ํ•„์š”ํ•˜๋‹ค.

3. ์ด‰๊ฐ(Tactile) ์ •๋ณด์˜ ๋ถ€์žฌ
์„ฌ์„ธํ•œ ์กฐ์ž‘ โ€” ์˜ˆ์ปจ๋Œ€ ๋‹ฌ๊ฑ€์„ ๊นจ์ง€ ์•Š๊ณ  ์ง‘๊ฑฐ๋‚˜, ์–‡์€ ์นด๋“œ๋ฅผ ์ง‘์„ ๋•Œ โ€” ์—๋Š” ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ์ด ๊ฒฐ์ •์ ์ด๋‹ค. ์—๊ณ ์„ผํŠธ๋ฆญ ๋น„๋””์˜ค๋Š” ๋ณธ์งˆ์ ์œผ๋กœ ์‹œ๊ฐ ์ •๋ณด๋งŒ ์ œ๊ณตํ•˜๋ฏ€๋กœ, ์ด ํ•œ๊ณ„๋Š” ๊ตฌ์กฐ์ ์ด๋‹ค.

4. ์–‘์†(bimanual) ์กฐ์ž‘์˜ ํ™•์žฅ์„ฑ ๋ฏธํ™•์ธ
Shirt Rolling ํƒœ์Šคํฌ๊ฐ€ ์–‘์†์„ ์‚ฌ์šฉํ•˜๊ธด ํ•˜์ง€๋งŒ, ๋” ๋ณต์žกํ•œ ์–‘์† ํ˜‘์‘์ด ํ•„์š”ํ•œ ํƒœ์Šคํฌ(์˜ˆ: ๋šœ๊ป‘์„ ํ•œ ์†์œผ๋กœ ์žก๊ณ  ๋‹ค๋ฅธ ์†์œผ๋กœ ๋น„ํ‹€๊ธฐ)์—์„œ์˜ ์„ฑ๋Šฅ์€ ์•„์ง ์ถฉ๋ถ„ํžˆ ๊ฒ€์ฆ๋˜์ง€ ์•Š์•˜๋‹ค.

5. ์žฅ๊ธฐ ๊ณ„ํš(Long-horizon Planning)์˜ ํ•œ๊ณ„ ์ธ์ •
์ €์ž๋“ค ์Šค์Šค๋กœ ์ธ์ •ํ•˜๋“ฏ, ์ˆ˜์‹ญ ๋‹จ๊ณ„์— ๊ฑธ์นœ ์žฅ๊ธฐ ์กฐ์ž‘ ๊ณ„ํš์€ ์—ฌ์ „ํžˆ ์—ด๋ฆฐ ๋„์ „ ๊ณผ์ œ๋‹ค. ํ˜„์žฌ ๊ฒฐ๊ณผ๋Š” ์ฃผ๋กœ ๋‹จ์ผ ๋˜๋Š” ์†Œ์ˆ˜ ๋‹จ๊ณ„ ํƒœ์Šคํฌ์— ์ง‘์ค‘๋˜์–ด ์žˆ๋‹ค.

6. ๋ฐ์ดํ„ฐ ๋‹ค์–‘์„ฑ vs. ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„
๋Œ€๊ทœ๋ชจ ๊ณต๊ฐœ ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ฉํ•˜๋ฉด ๋‹ค์–‘์„ฑ์€ ๋†’์•„์ง€์ง€๋งŒ, ๊ฐ ๋ฐ์ดํ„ฐ ์†Œ์Šค์˜ ํ’ˆ์งˆ๊ณผ ๋ ˆ์ด๋ธ” ๋…ธ์ด์ฆˆ๋ฅผ ์ œ์–ดํ•˜๊ธฐ ์–ด๋ ต๋‹ค. ๋ฐ์ดํ„ฐ ํ๋ ˆ์ด์…˜ ์ „๋žต์— ๋Œ€ํ•œ ๋” ์ฒด๊ณ„์ ์ธ ๋ถ„์„์ด ํ•„์š”ํ•˜๋‹ค.


๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต

์ธ๊ฐ„-๋กœ๋ด‡ ์ „์ด ์—ฐ๊ตฌ ๊ณ„๋ณด

timeline
    title ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ๋กœ๋ด‡ ํ•™์Šต ์—ฐ๊ตฌ ํ๋ฆ„
    section ๊ธฐ์ดˆ ์—ฐ๊ตฌ
        2022 : Ego4D (Facebook/Meta)
             : ์—๊ณ ์„ผํŠธ๋ฆญ ๋น„๋””์˜ค ๋Œ€๊ทœ๋ชจ ์ˆ˜์ง‘
             : ์† ํฌ์ฆˆ ๋ ˆ์ด๋ธ” ์—†์Œ
    section ์† ์ถ”์  + ๋กœ๋ด‡
        2023-2024 : R3M, VIP
                  : ์ธ๊ฐ„ ์˜์ƒ ํ‘œํ˜„ ํ•™์Šต
                  : ์†๋ชฉ/๊ทธ๋ฆฌํผ ์ค‘์‹ฌ
    section Dexterous๋กœ ํ™•์žฅ
        2025 : EgoDex (Apple Vision Pro)
             : 829์‹œ๊ฐ„, 194 ํƒœ์Šคํฌ
             : 22-DoF ์† ์ถ”์ 
        2025 : EgoVLA
             : ์ธ๊ฐ„ VLA, IK+retargeting
             : ์†Œ๊ทœ๋ชจ ๋ฐ์ดํ„ฐ
    section ๋Œ€๊ทœ๋ชจ ์Šค์ผ€์ผ๋ง
        2026 : EgoScale (NVIDIA)
             : 20,854์‹œ๊ฐ„
             : ์Šค์ผ€์ผ๋ง ๋ฒ•์น™ ๋ฐœ๊ฒฌ
             : ์›์ƒท ์ ์‘

์ธ๊ฐ„-๋กœ๋ด‡ ์ „์ด ์—ฐ๊ตฌ์˜ ๋ฐœ์ „ ๊ณ„๋ณด

์ฃผ์š” ๊ฒฝ์Ÿ ์—ฐ๊ตฌ์™€์˜ ์ •๋Ÿ‰ ๋น„๊ต

Table 3: ๊ด€๋ จ ์—ฐ๊ตฌ ๋น„๊ต
๋ฐฉ๋ฒ• ๋ฐ์ดํ„ฐ ๊ทœ๋ชจ DoF ์Šค์ผ€์ผ๋ง ๋ฒ•์น™ ์›์ƒท ํฌ๋กœ์Šค ์ฒดํ™”
EgoVLA ~์ˆ˜๋ฐฑ ์‹œ๊ฐ„ 6 DoF + fingertip โœ— โœ— ์ œํ•œ์ 
EgoDex 829์‹œ๊ฐ„ 22 DoF ์ผ๋ถ€ โœ— โœ—
In-N-On ~1M ์—ํ”ผ์†Œ๋“œ ๊ฐ€๋ณ€ โœ— โœ— ์ผ๋ถ€
EgoScale 20,854์‹œ๊ฐ„ 22 DoF โœ“ (Rยฒ=0.9983) โœ“ โœ“

ฯ€โ‚€ (pi-zero)์™€์˜ ๊ด€๊ณ„

EgoScale์˜ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜๋Š” Physical Intelligence์˜ ฯ€โ‚€์—์„œ ์˜๊ฐ์„ ๋ฐ›์•˜๋‹ค. ฯ€โ‚€๊ฐ€ ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ๋กœ ์ผ๋ฐ˜ํ™”๋ฅผ ์ถ”๊ตฌํ–ˆ๋‹ค๋ฉด, EgoScale์€ ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ๋ฅผ ์ฃผ์š” ์Šค์ผ€์ผ๋ง ์†Œ์Šค๋กœ ์‚ผ๋Š” ๋ณด์™„์  ์ ‘๊ทผ์ด๋‹ค. ๋‘ ๋ฐฉํ–ฅ์€ ์„œ๋กœ ๊ฒฝ์Ÿ์ด ์•„๋‹Œ ์ƒํ˜ธ ๋ณด์™„์ ์ด๋ฉฐ, ๋ฏธ๋ž˜์—๋Š” ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ + ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ์˜ ํ˜ผํ•ฉ ํ•™์Šต์ด ๊ฐ€์žฅ ๊ฐ•๋ ฅํ•œ ๋ฐฉํ–ฅ์ด ๋  ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.

GR00T N1๊ณผ์˜ ๊ด€๊ณ„

NVIDIA์˜ ๋˜ ๋‹ค๋ฅธ ํ”„๋กœ์ ํŠธ GR00T N1์€ ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡์„ ์œ„ํ•œ ๋ฒ”์šฉ VLA๋‹ค. EgoScale์€ GR00T ์‹œ๋ฆฌ์ฆˆ์™€ ์ƒํ˜ธ ๋ณด์™„์ ์ด๋‹ค โ€” GR00T๊ฐ€ ์•„ํ‚คํ…์ฒ˜์™€ ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ ํ†ตํ•ฉ์— ์ง‘์ค‘ํ–ˆ๋‹ค๋ฉด, EgoScale์€ ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ ์Šค์ผ€์ผ๋ง์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•œ๋‹ค. EgoScale์˜ ์‚ฌ์ „ํ•™์Šต ์ ‘๊ทผ์ด GR00T์™€ ๊ฐ™์€ ์‹œ์Šคํ…œ์˜ ์‚ฌ์ „ํ•™์Šต ๋‹จ๊ณ„์— ํ†ตํ•ฉ๋  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค.


์š”์•ฝ ๋ฐ ๊ฒฐ๋ก : ์ด ๋…ผ๋ฌธ์ด ๋กœ๋ด‡๊ณตํ•™๊ณ„์— ๋งํ•˜๋Š” ๊ฒƒ

ํ•ต์‹ฌ ๊ธฐ์—ฌ ์š”์•ฝ

NoteEgoScale ํ•ต์‹ฌ ๊ธฐ์—ฌ 5๊ฐ€์ง€
  1. ์Šค์ผ€์ผ๋ง ๋ฒ•์น™ ๋ฐœ๊ฒฌ: ์ธ๊ฐ„ ์—๊ณ ์„ผํŠธ๋ฆญ ๋ฐ์ดํ„ฐ์™€ dexterous manipulation ์ •์ฑ… ํ•™์Šต ์‚ฌ์ด์— log-linear ์Šค์ผ€์ผ๋ง ๋ฒ•์น™์ด ์กด์žฌํ•จ์„ ์‹ค์ฆ์ ์œผ๋กœ ํ™•์ธ (R^2 = 0.9983)

  2. ํšจ๊ณผ์ ์ธ ์ „์ด ๋ ˆ์‹œํ”ผ: ๋Œ€๊ทœ๋ชจ ์ธ๊ฐ„ ์‚ฌ์ „ํ•™์Šต + ์†Œ๋Ÿ‰ ์ •๋ ฌ ์ค‘๊ฐ„ํ•™์Šต์˜ ์กฐํ•ฉ์ด ๋กœ๋ด‡ ์„ฑ๋Šฅ์„ ํ‰๊ท  54% ํ–ฅ์ƒ

  3. ์›์ƒท ํƒœ์Šคํฌ ์ ์‘: ์ค‘๊ฐ„ํ•™์Šต ํ›„ ํƒœ์Šคํฌ๋‹น ๋กœ๋ด‡ ์‹œ์—ฐ 1๊ฐœ๋งŒ์œผ๋กœ ์ƒˆ๋กœ์šด ์กฐ์ž‘ ํƒœ์Šคํฌ ์ˆ˜ํ–‰ ๊ฐ€๋Šฅ

  4. ์ฒดํ™” ๋ถˆ๊ฐ€์ง€๋ก ์  ํ‘œํ˜„: ํ•™์Šต๋œ motor prior๊ฐ€ ๋‹ค๋ฅธ ์† ์„ค๊ณ„๋ฅผ ๊ฐ€์ง„ ๋กœ๋ด‡์—๋„ ์ „์ด ๊ฐ€๋Šฅ

  5. 22-DoF ์•ก์…˜ ํ‘œํ˜„์˜ ์ค‘์š”์„ฑ: ์†๋ชฉ๋งŒ์ด ์•„๋‹Œ ์†๊ฐ€๋ฝ ์ˆ˜์ค€์˜ ๊ด€์ ˆ ๊ณต๊ฐ„ ํ‘œํ˜„์ด dexterous manipulation์— ๊ฒฐ์ •์ 

์ด ์—ฐ๊ตฌ๊ฐ€ ์—ด์–ด๋‘๋Š” ๋ฏธ๋ž˜ ๋ฐฉํ–ฅ

๋ฐ์ดํ„ฐ ๊ด€์ :
20k ์‹œ๊ฐ„์—์„œ ํฌํ™”๊ฐ€ ์—†๋‹ค๋ฉด, 100k ์‹œ๊ฐ„, 1M ์‹œ๊ฐ„์—์„œ๋Š” ์–ด๋–ป๊ฒŒ ๋ ๊นŒ? YouTube๋‚˜ ๊ณต๊ณต์žฅ์†Œ์˜ ๋ณด์•ˆ์นด๋ฉ”๋ผ, ์Šค๋งˆํŠธํฐ ์˜์ƒ๊นŒ์ง€ ํ™œ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค๋ฉด ๋ฐ์ดํ„ฐ๋Š” ์‚ฌ์‹ค์ƒ ๋ฌดํ•œํ•˜๋‹ค. ์ด ๋ฐฉํ–ฅ์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ๋‹ค์Œ ๋‹จ๊ณ„๋Š” ์›น ์Šค์ผ€์ผ ๋น„๋””์˜ค์—์„œ์˜ ํ•™์Šต์ด๋‹ค.

๋ชจ๋ธ ๊ด€์ :
๋ฐ์ดํ„ฐ ์Šค์ผ€์ผ๋งŒ์ด ์•„๋‹ˆ๋ผ ๋ชจ๋ธ ์šฉ๋Ÿ‰ ์Šค์ผ€์ผ๋ง๊ณผ์˜ ์ƒํ˜ธ์ž‘์šฉ์ด ์•„์ง ๋ฏธ๊ฐœ์ฒ™ ์˜์—ญ์ด๋‹ค. ๋” ํฐ VLM ๋ฐฑ๋ณธ๊ณผ ๋” ๋งŽ์€ ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ๋ฅผ ๋™์‹œ์— ๋Š˜๋ฆฌ๋ฉด ์–ด๋–ค ์‹œ๋„ˆ์ง€๊ฐ€ ์ƒ๊ธฐ๋Š”์ง€ ํƒ๊ตฌํ•  ์—ฌ์ง€๊ฐ€ ํฌ๋‹ค.

์ด‰๊ฐ ํ†ตํ•ฉ:
์‹œ๊ฐ ๊ธฐ๋ฐ˜ ํ•™์Šต์˜ ํ•œ๊ณ„๋ฅผ ๋„˜์œผ๋ ค๋ฉด ์›จ์–ด๋Ÿฌ๋ธ” ์ด‰๊ฐ ์„ผ์„œ๋กœ ์ˆ˜์ง‘ํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ์ด ๋ฐฉํ–ฅ์€ ๊ธฐ์ˆ ์ ์œผ๋กœ ๋” ์–ด๋ ต์ง€๋งŒ, ์ •๋ฐ€ ์กฐ์ž‘์—์„œ ์งˆ์  ๋„์•ฝ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋‹ค.

์–‘์† ๋ณต์žก ์กฐ์ž‘:
์ธ๊ฐ„์˜ ์–‘์† ํ˜‘์‘ ๋Šฅ๋ ฅ์€ ๋‹จ์ˆœ ํŒŒ์ง€๋ฅผ ํ›จ์”ฌ ๋„˜์–ด์„ ๋‹ค. ์š”๋ฆฌ, ์ˆ˜๋ฆฌ, ์ œ์กฐ ํ˜„์žฅ์—์„œ์˜ ๋ณต์žกํ•œ ์–‘์† ์กฐ์ž‘์œผ๋กœ EgoScale์˜ ์ ‘๊ทผ์„ ํ™•์žฅํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•œ ๊ณผ์ œ๋‹ค.

๋กœ๋ด‡๊ณตํ•™์ž์—๊ฒŒ ์ฃผ๋Š” ์‹ค์šฉ์  ๋ฉ”์‹œ์ง€

๋งŒ์•ฝ ์—ฌ๋Ÿฌ๋ถ„์ด dexterous manipulation ์—ฐ๊ตฌ๋ฅผ ํ•˜๊ณ  ์žˆ๋‹ค๋ฉด, EgoScale์€ ๋‹ค์Œ์„ ์‹œ์‚ฌํ•œ๋‹ค:

  1. ๋ฐ์ดํ„ฐ ์ „๋žต์„ ์žฌ๊ณ ํ•˜๋ผ: ๋กœ๋ด‡ ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜์—๋งŒ ์˜์กดํ•˜๋Š” ๊ฒƒ์€ ๊ทผ๋ณธ์ ์ธ ์Šค์ผ€์ผ ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค. ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์ „ํ•™์Šต์— ํ™œ์šฉํ•˜๋Š” ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์ถ•์„ ๊ณ ๋ คํ•  ๊ฐ€์น˜๊ฐ€ ์žˆ๋‹ค.

  2. ์† ์•ก์…˜ ํ‘œํ˜„์— ํˆฌ์žํ•˜๋ผ: 22-DoF ๊ด€์ ˆ ๊ณต๊ฐ„์ด ์†๋ชฉ/fingertip๋ณด๋‹ค ์ผ๊ด€๋˜๊ฒŒ ์šฐ์›”ํ•˜๋‹ค. ๊ณ ์ž์œ ๋„ ์† ์ถ”์  ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์ถ•์— ์‹œ๊ฐ„์„ ํˆฌ์žํ•  ๊ฐ€์น˜๊ฐ€ ์žˆ๋‹ค.

  3. ์Šค์ผ€์ผ๋ง ๊ด€๊ณ„๋ฅผ ๋จผ์ € ํ™•์ธํ•˜๋ผ: ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ์†Œ์Šค๋‚˜ ์•„ํ‚คํ…์ฒ˜๋ฅผ ํ‰๊ฐ€ํ•  ๋•Œ, validation loss์™€ ์‹ค์ œ ์„ฑ๋Šฅ์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋จผ์ € ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด ์—ฐ๊ตฌ ์ž์›์„ ํšจ์œจ์ ์œผ๋กœ ์“ฐ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

  4. Allegro Hand๋‚˜ ๋‹ค๋ฅธ ๊ณ ์ž์œ ๋„ ์† ํ”Œ๋žซํผ ์—ฐ๊ตฌ์ž๋ผ๋ฉด: EgoScale ๋ฐฉ์‹์˜ ์‚ฌ์ „ํ•™์Šต์ด ์—ฌ๋Ÿฌ๋ถ„์˜ ํ”Œ๋žซํผ์—์„œ๋„ ์ž‘๋™ํ•˜๋Š”์ง€ ํƒ๊ตฌํ•ด๋ณผ ๊ฐ€์น˜๊ฐ€ ์ถฉ๋ถ„ํ•˜๋‹ค. ํฌ๋กœ์Šค ์ฒดํ™” ์ „์ด ๊ฒฐ๊ณผ๋Š” ์ด ๊ฐ€๋Šฅ์„ฑ์— ํฌ๋ง์ ์ธ ์‹ ํ˜ธ๋ฅผ ๋ณด๋‚ธ๋‹ค.


์ฐธ๊ณ  ๋ฌธํ—Œ

  • Zheng, R., Niu, D., Xie, Y., et al. (2026). EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data. arXiv:2602.16710. https://arxiv.org/abs/2602.16710
  • Black, K., et al. (2024). ฯ€โ‚€: A Vision-Language-Action Flow Model for General Robot Control. Physical Intelligence.
  • Hoque, R., et al. (2025). EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video. arXiv:2505.11709.
  • Yang, Z., et al. (2025). EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos. arXiv:2507.12440.
  • Bjorck, J., et al. (2025). GR00T N1: An Open Foundation Model for Generalist Humanoid Robots. NVIDIA.
  • Grauman, K., et al. (2022). Ego4D: Around the World in 3,000 Hours of Egocentric Video. CVPR 2022.
  • Oโ€™Neill, J., et al. (2024). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. ICRA 2024.

Note๋…ผ๋ฌธ ์ •๋ณด

์ œ๋ชฉ: EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data
์ €์ž: Ruijie Zheng, Dantong Niu, Yuqi Xie*, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castaรฑeda, Fengyuan Hu, You Liang Tan, Letian Fu, Trevor Darrell, Furong Huang, Yuke Zhuโ€ , Danfei Xuโ€ , Linxi Fanโ€ 
์†Œ์†: NVIDIA GEAR, UC Berkeley, University of Maryland
arXiv: 2602.16710
ํ”„๋กœ์ ํŠธ ํŽ˜์ด์ง€: https://research.nvidia.com/labs/gear/egoscale/
์ œ์ถœ์ผ: 2026๋…„ 2์›” 18์ผ

Copyright 2026, JungYeon Lee