Curieux.JY
  • JungYeon Lee
  • Post
  • Projects
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ์„œ๋ก : ์™œ ๊นŠ์ด ์˜์ƒ์— ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์ด ํ•„์š”ํ•œ๊ฐ€?
      • ๋ฌธ์ œ์˜ ๋ณธ์งˆ
      • DeFM์˜ ๋“ฑ์žฅ
      • ์—ฐ๊ตฌ ๊ธฐ์—ฌ ์š”์•ฝ
    • ๋ฐฉ๋ฒ•๋ก : DeFM์€ ์–ด๋–ป๊ฒŒ ๊นŠ์ด๋ฅผ ์ดํ•ดํ•˜๋Š”๊ฐ€?
      • Self-Distillation: ์Šค์Šค๋กœ๋ฅผ ๊ฐ€๋ฅด์น˜๋Š” ํ•™์Šต
      • Metric-Aware Input Normalization: ๊นŠ์ด์˜ ์ฒ™๋„๋ฅผ ๋ณด์กดํ•˜๋‹ค
      • CNN์œผ๋กœ์˜ ์ง€์‹ ์ฆ๋ฅ˜: ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์˜ ๊ท ํ˜•
      • ๋ชจ๋ธ Zoo: ๋‹ค์–‘ํ•œ ์„ ํƒ์ง€
    • ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ: DeFM์€ ์ •๋ง ์ž‘๋™ํ•˜๋Š”๊ฐ€?
      • ๋ฐ์ดํ„ฐ์…‹: 6์ฒœ๋งŒ ๊นŠ์ด ์˜์ƒ์˜ ๊ตฌ์„ฑ
      • ์˜๋ฏธ๋ก ์  ํด๋Ÿฌ์Šคํ„ฐ๋ง์˜ ์ฐฝ๋ฐœ: ๊นŠ์ด๋งŒ์œผ๋กœ ์˜๋ฏธ๋ฅผ ์ดํ•ดํ•˜๋‹ค
      • ๋ฒค์น˜๋งˆํฌ 1: ๋ถ„๋ฅ˜ ๋ฐ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜
      • ๋ฒค์น˜๋งˆํฌ 2: ๋กœ๋ด‡ ๋‚ด๋น„๊ฒŒ์ด์…˜
      • ๋ฒค์น˜๋งˆํฌ 3: ๋ฑ์ŠคํŠธ๋Ÿฌ์Šค ๋งค๋‹ˆํ“ฐ๋ ˆ์ด์…˜
      • ๋ฒค์น˜๋งˆํฌ 4: ์‚ฌ๋‹ค๋ฆฌ ๋“ฑ๋ฐ˜ ๋กœ์ฝ”๋ชจ์…˜
    • ๋น„ํŒ์  ๊ณ ์ฐฐ: ๊ฐ•์ , ์•ฝ์ , ๊ทธ๋ฆฌ๊ณ  ํ•œ๊ณ„
      • ๊ฐ•์ 
      • ์•ฝ์  ๋ฐ ํ•œ๊ณ„
      • ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ
    • ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต
      • RGB ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ๋“ค
      • ๊นŠ์ด ๊ด€๋ จ ๋ชจ๋ธ๋“ค
      • ๋กœ๋ด‡๊ณตํ•™ VFM๋“ค
    • ์‹ค์šฉ์  ๊ฐ€์ด๋“œ: DeFM ์‚ฌ์šฉ๋ฒ•
      • ์„ค์น˜ ๋ฐ ๊ธฐ๋ณธ ์‚ฌ์šฉ
      • RL ์ •์ฑ… ํ•™์Šต๊ณผ์˜ ํ†ตํ•ฉ
      • ์„ฑ๋Šฅ ์ตœ์ ํ™” ํŒ
    • ๊ฒฐ๋ก 
      • ํ•ต์‹ฌ ํ…Œ์ดํฌ์–ด์›จ์ด
  • โ›๏ธ Dig Review
    • ๐ŸŒŸ ์„œ๋ก : ์™œ Depth๋กœ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์ธ๊ฐ€?
    • ๐Ÿง  ๋ฐฉ๋ฒ•: DeFM์ด Depth์—์„œ ๋ฌด์—‡์„ ๋ฐฐ์šฐ๋Š”๊ฐ€?
      • ๐Ÿ” 1) Self-Distillation ๊ธฐ๋ฐ˜ Foundation Pretraining
      • ๐Ÿง  Self-Distillation
      • ๐Ÿง  2) ์ž…๋ ฅ ์ •๊ทœํ™” (Input Normalization)
      • ๐Ÿ“Œ Distillation ๋ฐ ๊ฒฝ๋Ÿ‰ ๋ชจ๋ธ
    • ๐Ÿ“Š ์‹คํ—˜: DeFM์€ ์ •๋ง ์ข‹์„๊นŒ?
      • ๐Ÿงช 1) Perception Task
      • ๐Ÿค– 2) Robotic Task Benchmark
      • ๐Ÿ“Œ Mermeid Diagram: DeFM ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ
    • ๐Ÿง  ๋น„ํŒ์  ๊ณ ์ฐฐ: ์žฅ๋‹จ์  ๋ฐ ํ•œ๊ณ„
      • โœ… ์žฅ์ 
      • โŒ ๋‹จ์  ๋ฐ ํ–ฅํ›„ ๊ณผ์ œ
    • ๐Ÿงฉ ๊ด€๋ จ ์—ฐ๊ตฌ ๋Œ€๋น„
    • ๐Ÿง  ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

๐Ÿ“ƒDeFM ๋ฆฌ๋ทฐ

depth
representation
ssl
Learning Foundation Representations from Depth for Robotics
Published

January 30, 2026

๐Ÿ” Ping. ๐Ÿ”” Ring. โ›๏ธ Dig. A tiered review series: quick look, key ideas, deep dive.

  • Paper Link
  • Project
  • Code
  • huggingface.co/leggedrobotics/defm
  1. ๐Ÿค” DeFM์€ ๋กœ๋ด‡ ๊ณตํ•™ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์œ„ํ•ด ํ๋ ˆ์ด์…˜๋œ 6์ฒœ๋งŒ ๊ฐœ์˜ depth ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์…‹์—์„œ DINOv2 ์Šคํƒ€์ผ์˜ self-distillation์„ ์‚ฌ์šฉํ•˜์—ฌ ์‚ฌ์ „ ํ•™์Šต๋œ ์ตœ์ดˆ์˜ depth ์ „์šฉ foundation model์ž…๋‹ˆ๋‹ค.
  2. โœจ ์ด ๋ชจ๋ธ์€ metric awareness๋ฅผ ๋ณด์กดํ•˜๋Š” ์ƒˆ๋กœ์šด 3์ฑ„๋„ input normalization ์ „๋žต์„ ๋„์ž…ํ–ˆ์œผ๋ฉฐ, ํšจ์œจ์ ์ธ ๋กœ๋ด‡ ๋ฐฐํฌ๋ฅผ ์œ„ํ•ด ViT-S ๋ฐ CNN๊ณผ ๊ฐ™์€ ์†Œํ˜• ๋ชจ๋ธ๋กœ๋„ distillation๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  3. ๐Ÿš€ DeFM์€ classification, semantic segmentation, ๊ทธ๋ฆฌ๊ณ  navigation, manipulation, locomotion๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ task์—์„œ SOTA ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ, ๊ฐ•๋ ฅํ•œ sim-to-real transfer ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.


๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

๋ณธ ๋…ผ๋ฌธ์€ ๋กœ๋ด‡ ๊ณตํ•™ ๋ถ„์•ผ์—์„œ Depth ์ด๋ฏธ์ง€์˜ ์ค‘์š”์„ฑ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ํ•ด๋‹น ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์— ํŠนํ™”๋œ ๋Œ€๊ทœ๋ชจ ์ผ๋ฐ˜ ๋ชฉ์ ์˜ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ(Foundation Model, FM)์ด ๋ถ€์žฌํ•˜๋‹ค๋Š” ๋ฌธ์ œ์ ์„ ์ง€์ ํ•œ๋‹ค. ๊ธฐ์กด์˜ ์ ‘๊ทผ ๋ฐฉ์‹๋“ค์€ RGB ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ์„ Depth ์ด๋ฏธ์ง€์— ์žฌํ™œ์šฉํ•˜๊ฑฐ๋‚˜ ํƒœ์Šคํฌ๋ณ„(task-specific)๋กœ ์ธ์ฝ”๋”๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต์‹œ์ผœ ๋ถ„ํฌ ๋ถˆ์ผ์น˜(distribution mismatch) ๋ฐ ์ผ๋ฐ˜ํ™”(generalization) ์„ฑ๋Šฅ ์ €ํ•˜์™€ ๊ฐ™์€ ํ•œ๊ณ„๋ฅผ ๋ณด์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๊ฐ„๊ทน์„ ๋ฉ”์šฐ๊ธฐ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์€ DeFM(Depth Foundation Model)์„ ์ œ์•ˆํ•œ๋‹ค. DeFM์€ 6,040๋งŒ ๊ฐœ์˜ Depth ์ด๋ฏธ์ง€๋กœ ๊ตฌ์„ฑ๋œ ํ๋ ˆ์ด์…˜๋œ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ DINOv2 ์Šคํƒ€์ผ์˜ ์ž๊ธฐ ์ง€๋„ ํ•™์Šต(self-supervised learning) ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต๋œ Depth ์ „์šฉ FM์ด๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก :

DeFM์€ DINOv2 ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ Depth ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์— ๋งž๊ฒŒ ์กฐ์ •ํ•˜์—ฌ ํ™œ์šฉํ•œ๋‹ค. ์ด๋Š” ํ•™์ƒ ๋„คํŠธ์›Œํฌ(f_s)๊ฐ€ ๋ชจ๋ฉ˜ํ…€(momentum) ์—…๋ฐ์ดํŠธ๋˜๋Š” ๊ต์‚ฌ ๋„คํŠธ์›Œํฌ(f_t)์˜ ์ถœ๋ ฅ ๋ถ„ํฌ๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ์ตœ์ ํ™”๋˜๋Š” ์ž๊ธฐ ์ฆ๋ฅ˜(self-distillation) ๋ฐฉ์‹์„ ๋”ฐ๋ฅธ๋‹ค. ์ž…๋ ฅ Depth ์ด๋ฏธ์ง€ x์— ๋Œ€ํ•ด, ๋‹ค์–‘ํ•œ ๊ธฐํ•˜ํ•™์ (geometric) ๋ฐ ์ธก๊ด‘ํ•™์ (photometric) ์ฆ๊ฐ•์ด ์ ์šฉ๋œ G๊ฐœ์˜ ๋Œ€๊ทœ๋ชจ ๊ธ€๋กœ๋ฒŒ ํฌ๋กญ(x_g)๊ณผ L๊ฐœ์˜ ์†Œ๊ทœ๋ชจ ๋กœ์ปฌ ํฌ๋กญ(x_l)์„ ์ค€๋น„ํ•œ๋‹ค. ๊ต์‚ฌ ๋„คํŠธ์›Œํฌ๋Š” ๊ธ€๋กœ๋ฒŒ ํฌ๋กญ์„ ์ฒ˜๋ฆฌํ•˜์—ฌ ๋ชฉํ‘œ ๋ถ„ํฌ p_t๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ํ•™์ƒ ๋„คํŠธ์›Œํฌ๋Š” ๋กœ์ปฌ ํฌ๋กญ ๋ฐ ๋ถ€๋ถ„์ ์œผ๋กœ ๋งˆ์Šคํ‚น๋œ ๊ธ€๋กœ๋ฒŒ ํฌ๋กญ(x'_g)์„ ์ฒ˜๋ฆฌํ•œ๋‹ค. ํ•™์Šต์—๋Š” ๋‹ค์Œ ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ์†์‹ค ํ•จ์ˆ˜๊ฐ€ ์‚ฌ์šฉ๋œ๋‹ค:

  1. DINO ๊ธ€๋กœ๋ฒŒ ํฌ๋กญ ์†์‹ค(\mathcal{L}_{Global}): ํ•™์ƒ ๋„คํŠธ์›Œํฌ์˜ ๋ถ€๋ถ„์ ์œผ๋กœ ๋งˆ์Šคํ‚น๋œ ๊ธ€๋กœ๋ฒŒ ํฌ๋กญ(x'_g) ํ‘œํ˜„๊ณผ ๊ต์‚ฌ ๋„คํŠธ์›Œํฌ์˜ ๋งˆ์Šคํ‚น๋˜์ง€ ์•Š์€ ๊ธ€๋กœ๋ฒŒ ํฌ๋กญ(x_g) ํ‘œํ˜„์„ ์ •๋ ฌํ•œ๋‹ค. ์ด๋Š” Vision Transformer(ViT)์˜ ํด๋ž˜์Šค ํ† ํฐ(cls token) ํ”ผ์ฒ˜์— ๋Œ€ํ•ด ๊ณ„์‚ฐ๋˜๋Š” DINO ์†์‹ค์ด๋‹ค: \mathcal{L}_{Global} = \sum_{i=1}^G \sum_{j=1, j \neq i}^G \mathcal{L}_{DINO}(f_s(x'_{g_i}), f_t(x_{g_j}))
  2. DINO ๋กœ์ปฌ ํฌ๋กญ ์†์‹ค(\mathcal{L}_{Local}): ํ•™์ƒ ๋„คํŠธ์›Œํฌ์˜ ๋กœ์ปฌ ํฌ๋กญ(x_l) ํ‘œํ˜„๊ณผ ๊ต์‚ฌ ๋„คํŠธ์›Œํฌ์˜ ๊ธ€๋กœ๋ฒŒ ํฌ๋กญ(x_g) ํ‘œํ˜„์„ ์ •๋ ฌํ•œ๋‹ค. ์ด ์—ญ์‹œ cls ํ† ํฐ ๊ฐ„์— ๊ณ„์‚ฐ๋œ๋‹ค: \mathcal{L}_{Local} = \sum_{g=1}^G \sum_{l=1}^L \mathcal{L}_{DINO}(f_s(x_l), f_t(x_g))
  3. iBOT ํŒจ์น˜ ์†์‹ค(\mathcal{L}_{iBOT}): ๋ฐ€์ง‘ ๊ณต๊ฐ„ ํ”ผ์ฒ˜(dense spatial features) ํ•™์Šต์— ํ•„์ˆ˜์ ์ด๋‹ค. ๋žœ๋คํ•˜๊ฒŒ ๋งˆ์Šคํ‚น๋œ ์ž…๋ ฅ ํŒจ์น˜์— ๋Œ€ํ•ด ํ•™์ƒ์˜ ํ”ผ์ฒ˜ ์˜ˆ์ธก(p_{s_i})๊ณผ ๊ต์‚ฌ์˜ ํ•ด๋‹น ํŒจ์น˜ ๋ชฉํ‘œ ๋ถ„ํฌ(p_{t_i}) ๊ฐ„์˜ ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์†์‹ค(cross-entropy loss)์„ ์ ์šฉํ•œ๋‹ค: \mathcal{L}_{iBOT} = - \sum_{i \in \text{masked}} p_{t_i} \log p_{s_i}

์ „์ฒด ์†์‹ค์€ ์ด ์„ธ ํ•ญ์˜ ๊ฐ€์ค‘์น˜ ํ•ฉ๊ณผ ํ”ผ์ฒ˜ ๊ณต๊ฐ„ ๋ถ•๊ดด๋ฅผ ๋ฐฉ์ง€ํ•˜๋Š” KoLeo ์ •๊ทœํ™”(regularizer)๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.

DeFM ํ•™์Šต์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹์€ ์ด 6,040๋งŒ ๊ฐœ์˜ Depth ์ด๋ฏธ์ง€๋กœ, ๋‹จ์•ˆ Depth ์ถ”์ •(Monocular Depth Estimation, MDE)์„ ํ†ตํ•ด RGB ๋ฐ์ดํ„ฐ์…‹์„ ๋ณ€ํ™˜ํ•œ ์ด๋ฏธ์ง€, ์‹œ๋ฎฌ๋ ˆ์ด์…˜(Synthetic) ๋ฐ์ดํ„ฐ, ๊ทธ๋ฆฌ๊ณ  ์‹ค์ œ(Real) ์„ผ์„œ ๋ฐ์ดํ„ฐ๋ฅผ ํ˜ผํ•ฉํ•˜์—ฌ ๊ตฌ์„ฑ๋˜์—ˆ๋‹ค. ์ด๋Š” Depth ๋ฐ์ดํ„ฐ์˜ ๋‹ค์–‘์„ฑ, ๊ทœ๋ชจ, ๋…ธ์ด์ฆˆ ํŠน์„ฑ์„ ๋ชจ๋‘ ํฌ๊ด„ํ•˜์—ฌ ์ธ์ฝ”๋”๊ฐ€ ๊ด‘๋ฒ”์œ„ํ•œ ํ™˜๊ฒฝ์—์„œ ๊ฐ•๊ฑดํ•˜๊ฒŒ ์ผ๋ฐ˜ํ™”๋  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค.

ํŠนํžˆ, Depth ์ด๋ฏธ์ง€์˜ ๋„“์€ ์Šค์ผ€์ผ ๋ฒ”์œ„(๋ฐ€๋ฆฌ๋ฏธํ„ฐ์—์„œ ์ˆ˜๋ฐฑ ๋ฏธํ„ฐ)๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์ƒˆ๋กœ์šด ์ž…๋ ฅ ์ •๊ทœํ™” ์ „๋žต์ด ๋„์ž…๋˜์—ˆ๋‹ค. ๊ทผ๊ฑฐ๋ฆฌ Depth ๋ณ€ํ™”๊ฐ€ ๋กœ๋ด‡ ์˜์‚ฌ๊ฒฐ์ •์— ๋” ์ค‘์š”ํ•จ์„ ๊ณ ๋ คํ•˜์—ฌ, ๋‹ค์Œ ์„ธ ๊ฐœ์˜ ์ฑ„๋„๋กœ ๊ตฌ์„ฑ๋œ ๋กœ๊ทธ ์••์ถ• Depth ํ‘œํ˜„์„ ์‚ฌ์šฉํ•œ๋‹ค:

  1. ๊ธ€๋กœ๋ฒŒ ๋กœ๊ทธ ์Šค์ผ€์ผ Depth(C_1): ํ˜„์žฌ ์ด๋ฏธ์ง€ ๋‚ด์˜ ์ตœ์†Œ(D_{min}) ๋ฐ ์ตœ๋Œ€ Depth(D_{max})๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋กœ๊ทธ ์••์ถ• Depth๋ฅผ ์ •๊ทœํ™”ํ•˜์—ฌ ์ƒ๋Œ€์ ์ธ ๊ธฐํ•˜ํ•™์  ๊ตฌ์กฐ๋ฅผ ๋ณด์กดํ•œ๋‹ค. ๋กœ๊ทธ ๋ณ€ํ™˜์€ \text{logp}(D) = \log(1+D)๋กœ ์ •์˜๋œ๋‹ค: C_1 = \frac{\log p(D) - \log p(D_{\min})}{\log p(D_{\max}) - \log p(D_{\min})}

  2. ์ค‘๊ฐ„ ๋ฒ”์œ„ ์ •๊ทœํ™”(C_2): ์กฐ์ž‘ ๋ฐ ์‹ค๋‚ด ์ƒํ˜ธ์ž‘์šฉ์— ๊ฐ€์žฅ ์ ํ•ฉํ•œ Depth ๋ฒ”์œ„๋ฅผ ๊ฐ•์กฐํ•œ๋‹ค: C_2 = \frac{\log p(D)}{\log p(10)}

  3. ์›๊ฑฐ๋ฆฌ ๋ฒ”์œ„ ์ •๊ทœํ™”(C_3): ์žฅ๊ฑฐ๋ฆฌ ๋‚ด๋น„๊ฒŒ์ด์…˜ ๋ฐ ์‹ค์™ธ ์žฅ๋ฉด์— ์ ํ•ฉํ•œ Depth ๋ฒ”์œ„๋ฅผ ๊ฐ•์กฐํ•œ๋‹ค: C_3 = \frac{\log p(D)}{\log p(100)}

์ตœ์ข… ์ž…๋ ฅ์€ X_{in} = [C_1, C_2, C_3]์™€ ๊ฐ™์ด ์„ธ ์ฑ„๋„์„ ์Œ“์•„ ๊ตฌ์„ฑ๋˜๋ฉฐ, ์ „์—ญ ํ‰๊ท  ๋ฐ ํ‘œ์ค€ ํŽธ์ฐจ ์ •๊ทœํ™”๊ฐ€ ์ ์šฉ๋œ๋‹ค. ์ด ๋ฐฉ์‹์€ ์ „์—ญ ๋ฉ”ํŠธ๋ฆญ Depth๋ฅผ ๋ณด์กดํ•˜๋ฉด์„œ ๋ฏธ์„ธํ•œ ๊ทผ๊ฑฐ๋ฆฌ ๊ตฌ์กฐ์™€ ์•ˆ์ •์ ์ธ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ์œ ์ง€ํ•œ๋‹ค.

DeFM์˜ ๊ฐ€์žฅ ํฐ ๋ชจ๋ธ์ธ ViT-L/14(3์–ต 7๋ฐฑ๋งŒ ๋งค๊ฐœ๋ณ€์ˆ˜)๋Š” FSDP(Fully-Sharded Data Parallel) ๊ตฌํ˜„์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต๋˜์—ˆ๋‹ค. ๋กœ๋ด‡ ์‹œ์Šคํ…œ์˜ ์ž์› ์ œ์•ฝ์„ ๊ณ ๋ คํ•˜์—ฌ, DeFM-L/14๋ฅผ ๊ต์‚ฌ ๋ชจ๋ธ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ViT-S, ResNet, RegNet, EfficientNet ๋“ฑ 3๋ฐฑ๋งŒ~3์ฒœ๋งŒ ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ์†Œํ˜• ๋ชจ๋ธ๋กœ ์ง€์‹ ์ฆ๋ฅ˜(knowledge distillation)๋ฅผ ์ˆ˜ํ–‰ํ–ˆ๋‹ค. ํŠนํžˆ, CNN ํ•™์ƒ ๋ชจ๋ธ์ด ViT ๊ต์‚ฌ์˜ ๋ฐ€์ง‘ ๊ณต๊ฐ„ ํ”ผ์ฒ˜๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก BiFPN(Bi-directional Feature Pyramid Network)์„ CNN ์ธ์ฝ”๋” ์œ„์— ์ถ”๊ฐ€ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ํ•ด์ƒ๋„์˜ ํ”ผ์ฒ˜ ๋งต์„ ์œตํ•ฉํ•˜๋„๋ก ์„ค๊ณ„ํ–ˆ๋‹ค.

์‹คํ—˜ ๊ฒฐ๊ณผ:

DeFM์˜ ๊ฐ•๊ฑด์„ฑ๊ณผ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ์€ ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ์ž…์ฆ๋˜์—ˆ๋‹ค.

  • ์ •์„ฑ์  ํ‰๊ฐ€: PCA(Principal Component Analysis)๋ฅผ ํ†ตํ•ด DeFM-L/14 ์ธ์ฝ”๋”๊ฐ€ ์ถ”์ถœํ•œ ํ”ผ์ฒ˜๊ฐ€ ์งˆ๊ฐ์ด๋‚˜ ์ƒ‰์ƒ ์ •๋ณด ์—†์ด๋„ Depth ์ด๋ฏธ์ง€์—์„œ ์˜๋ฏธ๋ก ์  ํด๋Ÿฌ์Šคํ„ฐ๋ง(์˜ˆ: ์ปต ์†์žก์ด)์„ ํ˜•์„ฑํ•จ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ์ด๋Š” ๋‹ค์–‘ํ•œ ์„ผ์„œ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์— ๊ฑธ์ณ ์ผ๊ด€์„ฑ์„ ๋ณด์ด๋ฉฐ, ๋กœ๋ด‡ ์กฐ์ž‘์— ์œ ์šฉํ•œ ์‚ฌ์ „ ์ง€์‹์„ ํ•™์Šตํ–ˆ์Œ์„ ์‹œ์‚ฌํ•œ๋‹ค.
  • ๋ถ„๋ฅ˜(Classification): ImageNet-Depth-1K ๋ฒค์น˜๋งˆํฌ(MDE๋ฅผ ํ†ตํ•ด ์ƒ์„ฑ)์—์„œ DeFM-L/14๋Š” ๊ธฐ์กด์˜ ์ตœ์ฒจ๋‹จ RGB ๊ธฐ๋ฐ˜ FM(DINOv2, DINOv3, C-RADIOv3)์„ ๋Šฅ๊ฐ€ํ•˜๋Š” SOTA ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค. ํŠนํžˆ DeFM-S/14๋Š” ๋™์ผ ํฌ๊ธฐ ๋ฒ”์ฃผ์˜ ๊ธฐ์กด ๋ชจ๋ธ ๋Œ€๋น„ ์ตœ๋Œ€ 10%๊นŒ์ง€ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ์ฆ๋ฅ˜๋œ ์†Œํ˜• CNN ๋ชจ๋ธ๋“ค๋„ ์ผ๋ถ€ ๋” ํฐ RGB ViT-S ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค.
  • ์˜๋ฏธ๋ก ์  ๋ถ„ํ• (Semantic Segmentation): ScanNet, SUN-RGBD(์‹ค๋‚ด), OFFSED, TartanGround(์‹ค์™ธ), GraspNet-1B(์กฐ์ž‘) ๋“ฑ ๋‹ค์–‘ํ•œ Depth ๋ฐ์ดํ„ฐ์…‹์—์„œ DeFM์€ ๊ธฐ์กด ๋ฒ ์ด์Šค๋ผ์ธ์„ ๋Œ€๋ถ€๋ถ„ ๋Šฅ๊ฐ€ํ•˜๋Š” ๊ฐ•๊ฑดํ•œ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ์ž…์ฆํ–ˆ๋‹ค (ViT-S์—์„œ mIoU ์ตœ๋Œ€ 30% ํ–ฅ์ƒ).
  • ๋กœ๋ด‡ ๊ณตํ•™ ์‘์šฉ:
    • ๋‚ด๋น„๊ฒŒ์ด์…˜(Habitat Point-Goal Nav): DeFM ๊ธฐ๋ฐ˜ ๋ชจ๋ธ(DeFM-S/14, DeFM-ResNet-50)์€ ๊ธฐ์กด์˜ ์Šคํฌ๋ž˜์น˜ ํ•™์Šต๋œ ResNet-50๊ณผ ๊ฒฝ์Ÿํ•˜๊ฑฐ๋‚˜ ๋” ์šฐ์ˆ˜ํ•œ SPL(Success weighted by Path Length) ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, DeFM์˜ ์ฆ‰๊ฐ์ ์ธ ํ™œ์šฉ์„ฑ์„ ์ž…์ฆํ–ˆ๋‹ค.
    • ๋‚ด๋น„๊ฒŒ์ด์…˜(Embodiment Aware Point-Goal Nav - Unitree B2W): Unitree B2W ๋กœ๋ด‡์„ ์‚ฌ์šฉํ•œ ์‹ค์ œ ์žฅ๊ฑฐ๋ฆฌ ๋‚ด๋น„๊ฒŒ์ด์…˜ ํƒœ์Šคํฌ์—์„œ DeFM ์ธ์ฝ”๋” ๊ธฐ๋ฐ˜ ์ •์ฑ…์€ VAE(Variational Auto Encoder) ๊ธฐ๋ฐ˜ ๋ฒ ์ด์Šค๋ผ์ธ๋ณด๋‹ค ๋†’์€ ์„ฑ๊ณต๋ฅ (SR)์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค. ํŠนํžˆ DeFM์€ OOD(Out-of-Distribution) ์žฅ์• ๋ฌผ์— ๋Œ€ํ•œ ๋›ฐ์–ด๋‚œ ์ธ์‹๊ณผ ํšŒํ”ผ ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ์ด๋Š” ๋” ๋‚˜์€ ๊ธฐํ•˜ํ•™์  ๋ฐ ์˜๋ฏธ๋ก ์  ํ™˜๊ฒฝ ์ดํ•ด ๋•๋ถ„์œผ๋กœ ๋ถ„์„๋œ๋‹ค. ๋‹ค์–‘ํ•œ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ์˜ ๊ฐ•๊ฑดํ•œ sim-to-real ์ „์ด๊ฐ€ ์‹œ์—ฐ๋˜์—ˆ๋‹ค.
    • ์กฐ์ž‘(Dexterous Grasping - KUKA-Allegro): Teacher-student ํ›ˆ๋ จ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์‚ฌ์šฉํ•œ ์ •๊ตํ•œ ๋กœ๋ด‡ ํŒ”-์† ๊ทธ๋ฆฝ ํƒœ์Šคํฌ์—์„œ DeFM ๋ชจ๋ธ(ํŠนํžˆ ๋ฏธ์„ธ ์กฐ์ •๋œ ๋ฒ„์ „)์€ ๊ฐ€์žฅ ๋†’์€ ์„ฑ๊ณต๋ฅ ์„ ๊ธฐ๋กํ–ˆ์œผ๋ฉฐ, ๋‹ค์–‘ํ•œ ๋…ธ์ด์ฆˆ ๋ชจ๋ธ์— ๋Œ€ํ•œ ๊ฐ•๊ฑด์„ฑ์„ ์ž…์ฆํ–ˆ๋‹ค.
    • ์ด๋™(Locomotion - Quadrupedal Ladder Climbing - ANYmal): ์‚ฌ์กฑ ๋ณดํ–‰ ๋กœ๋ด‡์˜ ์‚ฌ๋‹ค๋ฆฌ ์˜ค๋ฅด๊ธฐ ํƒœ์Šคํฌ์—์„œ DeFM ๊ธฐ๋ฐ˜ ์ธ์ฝ”๋”๋Š” ์Šคํฌ๋ž˜์น˜ ํ•™์Šต๋œ CNN ๋ฒ ์ด์Šค๋ผ์ธ๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๋ฉด์„œ๋„ ํ›จ์”ฌ ์ ์€ ๊ณ„์‚ฐ ๋ฆฌ์†Œ์Šค๋ฅผ ์š”๊ตฌํ–ˆ๋‹ค.

๊ฒฐ๋ก ์ ์œผ๋กœ, DeFM์€ Depth ์ด๋ฏธ์ง€๋ฅผ ์œ„ํ•œ ์ตœ์ดˆ์˜ ๋Œ€๊ทœ๋ชจ ์ž๊ธฐ ์ง€๋„ ํ•™์Šต ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ๋กœ์„œ, ๊ฐ•๊ฑดํ•˜๊ณ  ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅํ•œ ๊ธฐํ•˜ํ•™์  ๋ฐ ์˜๋ฏธ๋ก ์  ํ”ผ์ฒ˜๋ฅผ ํ•™์Šตํ•œ๋‹ค. ์ด๋Š” ๋ถ„๋ฅ˜, ๋ถ„ํ• , ๋‚ด๋น„๊ฒŒ์ด์…˜, ์ด๋™, ์กฐ์ž‘ ๋“ฑ ๊ด‘๋ฒ”์œ„ํ•œ ๋กœ๋ด‡ ์ธ์‹ ๋ฐ ์ œ์–ด ํƒœ์Šคํฌ์— ์ฆ‰์‹œ ํ™œ์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ๋‹ค์–‘ํ•œ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ์˜ ๊ฐ•๊ฑดํ•œ sim-to-real ์ „์ด๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค. ํŠนํžˆ, ํšจ์œจ์„ฑ์„ ์œ„ํ•ด ์ฆ๋ฅ˜๋œ ์†Œํ˜• ๋ชจ๋ธ๋“ค์€ ์ž์› ์ œ์•ฝ์ ์ธ ๋กœ๋ด‡ ์‹œ์Šคํ…œ์— ํšจ๊ณผ์ ์œผ๋กœ ๋ฐฐํฌ๋  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค. ํ–ฅํ›„ ์—ฐ๊ตฌ๋กœ๋Š” ViT ์•„ํ‚คํ…์ฒ˜์˜ ์•„ํ‹ฐํŒฉํŠธ ์™„ํ™”, ํƒœ์Šคํฌ ๋‹ค์–‘์„ฑ ํ™•์žฅ, LiDAR ๋ฐ์ดํ„ฐ๋กœ์˜ ์ ์šฉ, ๊ทธ๋ฆฌ๊ณ  ๋ฐ์ดํ„ฐ์…‹ ๋ฐ ๋ชจ๋ธ ์Šค์ผ€์ผ์˜ ์ง€์†์ ์ธ ํ™•์žฅ์ด ์ œ์•ˆ๋œ๋‹ค.


๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

์„œ๋ก : ์™œ ๊นŠ์ด ์˜์ƒ์— ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์ด ํ•„์š”ํ•œ๊ฐ€?

๋ฌธ์ œ์˜ ๋ณธ์งˆ

๋กœ๋ด‡๊ณตํ•™์—์„œ ๊นŠ์ด ์„ผ์„œ(Depth Sensor)๋Š” ๋งˆ์น˜ ๊ณต๊ธฐ์ฒ˜๋Ÿผ ๋‹น์—ฐํ•˜๊ฒŒ ์—ฌ๊ฒจ์ง€๋Š” ์กด์žฌ์ž…๋‹ˆ๋‹ค. Intel RealSense, ZED ์นด๋ฉ”๋ผ, LiDAR ๋“ฑ ๋‹ค์–‘ํ•œ ํ˜•ํƒœ๋กœ ๊ฑฐ์˜ ๋ชจ๋“  ๋กœ๋ด‡ ํ”Œ๋žซํผ์— ํƒ‘์žฌ๋˜์–ด ์žˆ์ฃ . ๊ทธ๋Ÿฐ๋ฐ ํ•œ ๊ฐ€์ง€ ์žฌ๋ฏธ์žˆ๋Š” ์‚ฌ์‹ค์ด ์žˆ์Šต๋‹ˆ๋‹ค. RGB ์ด๋ฏธ์ง€ ๋ถ„์•ผ์—์„œ๋Š” DINOv2, CLIP, SAM ๊ฐ™์€ ๊ฑฐ๋Œ€ํ•œ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ๋“ค์ด ์ปดํ“จํ„ฐ ๋น„์ „์˜ ํŒ๋„๋ฅผ ๋ฐ”๊พธ๊ณ  ์žˆ๋Š”๋ฐ, ์ •์ž‘ ๊นŠ์ด ์˜์ƒ๋งŒ์„ ์œ„ํ•œ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์€ ์กด์žฌํ•˜์ง€ ์•Š์•˜๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์™œ ์ด๊ฒƒ์ด ๋ฌธ์ œ์ผ๊นŒ์š”? ํ˜„์žฌ ๋Œ€๋ถ€๋ถ„์˜ ๊นŠ์ด ๊ธฐ๋ฐ˜ ๋กœ๋ด‡ ์‹œ์Šคํ…œ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค:

  1. ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต: ๋งค๋ฒˆ ์ƒˆ๋กœ์šด ํƒœ์Šคํฌ๋ฅผ ์œ„ํ•ด ์ธ์ฝ”๋”๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต
  2. ํƒœ์Šคํฌ ํŠนํ™”: ๋‚ด๋น„๊ฒŒ์ด์…˜์šฉ ์ธ์ฝ”๋”, ๋งค๋‹ˆํ“ฐ๋ ˆ์ด์…˜์šฉ ์ธ์ฝ”๋”, ๋กœ์ฝ”๋ชจ์…˜์šฉ ์ธ์ฝ”๋”๊ฐ€ ๊ฐ๊ฐ ๋ณ„๊ฐœ
  3. RGB ๋ชจ๋ธ ์ฐจ์šฉ: DINOv2 ๊ฐ™์€ RGB ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์„ ๊นŠ์ด ์˜์ƒ์— ๊ทธ๋Œ€๋กœ ์ ์šฉ (๋„๋ฉ”์ธ ๋ถˆ์ผ์น˜ ๋ฐœ์ƒ)

์ด๊ฑด ๋งˆ์น˜ ์˜์–ด ์›์–ด๋ฏผ์—๊ฒŒ ํ•œ๊ตญ์–ด ๋ฌธ์„œ๋ฅผ ๋ฒˆ์—ญ์‹œํ‚ค๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์ž‘๋™์€ ํ•˜๊ฒ ์ง€๋งŒ, ํ•œ๊ตญ์–ด์˜ ๋‰˜์•™์Šค๋ฅผ ์ œ๋Œ€๋กœ ์‚ด๋ฆฌ๊ธฐ๋Š” ์–ด๋ ต์ฃ .

DeFM์˜ ๋“ฑ์žฅ

ETH Zurich์˜ Robotic Systems Lab(RSL)์—์„œ ์ด ๋ฌธ์ œ์— ์ •๋ฉด์œผ๋กœ ๋„์ „ํ–ˆ์Šต๋‹ˆ๋‹ค. DeFM(Depth Foundation Model)์€ 6์ฒœ๋งŒ ์žฅ์˜ ๊นŠ์ด ์˜์ƒ์œผ๋กœ ์‚ฌ์ „ ํ•™์Šต๋œ, ๋กœ๋ด‡๊ณตํ•™์„ ์œ„ํ•œ ์ตœ์ดˆ์˜ ๊นŠ์ด ์ „์šฉ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

โ€œTL;DR - A DINO-style encoder, but for depth image inputs.โ€

โ€” DeFM GitHub README

ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ๋‹จ์ˆœํ•˜๋ฉด์„œ๋„ ๊ฐ•๋ ฅํ•ฉ๋‹ˆ๋‹ค: RGB์—์„œ ์ž‘๋™ํ–ˆ๋˜ ์ž๊ธฐ์ง€๋„ํ•™์Šต(Self-supervised Learning)์˜ ์Šค์ผ€์ผ๋ง ๋ฒ•์น™์ด ๊นŠ์ด ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์—๋„ ์ ์šฉ๋  ๊ฒƒ์ด๋ผ๋Š” ๊ฐ€์„ค์ž…๋‹ˆ๋‹ค.

์—ฐ๊ตฌ ๊ธฐ์—ฌ ์š”์•ฝ

DeFM์˜ ์ฃผ์š” ๊ธฐ์—ฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

๊ธฐ์—ฌ ์„ค๋ช…
์ตœ์ดˆ์˜ ๊นŠ์ด ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ 60M ๊นŠ์ด ์˜์ƒ์œผ๋กœ ํ•™์Šต๋œ ๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ
Metric-Aware ์ •๊ทœํ™” ๋ฐ€๋ฆฌ๋ฏธํ„ฐ๋ถ€ํ„ฐ 100๋ฏธํ„ฐ๊นŒ์ง€์˜ ์Šค์ผ€์ผ์„ ๋ณด์กดํ•˜๋Š” ์ƒˆ๋กœ์šด ์ž…๋ ฅ ์ •๊ทœํ™”
ํšจ์œจ์ ์ธ ๋ชจ๋ธ ์ฆ๋ฅ˜ 307M โ†’ 3M ํŒŒ๋ผ๋ฏธํ„ฐ๊นŒ์ง€ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๋ชจ๋ธ ์ œ๊ณต
๋ฒ”์šฉ ๋กœ๋ด‡๊ณตํ•™ ๋ฒค์น˜๋งˆํฌ ๋ถ„๋ฅ˜, ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜, ๋‚ด๋น„๊ฒŒ์ด์…˜, ๋งค๋‹ˆํ“ฐ๋ ˆ์ด์…˜, ๋กœ์ฝ”๋ชจ์…˜์—์„œ SOTA
flowchart TB
    subgraph Dataset["๐Ÿ—‚๏ธ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ• (60M ๊นŠ์ด ์˜์ƒ)"]
        A1[์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ์ดํ„ฐ] --> A3[ํ๋ ˆ์ด์…˜๋œ ๋ฐ์ดํ„ฐ์…‹]
        A2[์‹ค์ œ ์„ผ์„œ ๋ฐ์ดํ„ฐ] --> A3
    end
    
    subgraph Training["๐Ÿ”„ Self-Distillation ํ•™์Šต"]
        B1[Teacher ViT-L/14] --> B2[Student ViT-L/14]
        B2 --> B1
        B3[DINO Loss + iBOT Loss]
    end
    
    subgraph Distillation["๐Ÿ“ฆ ๋ชจ๋ธ ์ฆ๋ฅ˜"]
        C1[DeFM ViT-L 307M] --> C2[ViT-S 22M]
        C1 --> C3[ResNet-18~50]
        C1 --> C4[EfficientNet B0~B6]
        C1 --> C5[RegNet 4~12M]
    end
    
    subgraph Applications["๐Ÿค– ๋กœ๋ด‡๊ณตํ•™ ์‘์šฉ"]
        D1[๋‚ด๋น„๊ฒŒ์ด์…˜]
        D2[๋งค๋‹ˆํ“ฐ๋ ˆ์ด์…˜]
        D3[๋กœ์ฝ”๋ชจ์…˜]
        D4[์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜]
    end
    
    A3 --> Training
    Training --> Distillation
    Distillation --> Applications
Figure 1: DeFM์˜ ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ์š”

๋ฐฉ๋ฒ•๋ก : DeFM์€ ์–ด๋–ป๊ฒŒ ๊นŠ์ด๋ฅผ ์ดํ•ดํ•˜๋Š”๊ฐ€?

Self-Distillation: ์Šค์Šค๋กœ๋ฅผ ๊ฐ€๋ฅด์น˜๋Š” ํ•™์Šต

DeFM์˜ ํ•ต์‹ฌ ํ•™์Šต ๋ฐฉ๋ฒ•์€ DINOv2 ์Šคํƒ€์ผ์˜ Self-Distillation์ž…๋‹ˆ๋‹ค. ์ด ์•„์ด๋””์–ด๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ„๋‹จํ•œ ๋น„์œ ๋ฅผ ๋“ค์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์ƒ์ƒํ•ด๋ณด์„ธ์š”. ๋‹น์‹ ์ด ๋ฏธ์ˆ  ์„ ์ƒ๋‹˜(Teacher)์ด๋ฉด์„œ ๋™์‹œ์— ํ•™์ƒ(Student)์ž…๋‹ˆ๋‹ค. ํ•˜๋‚˜์˜ ํ’๊ฒฝ์„ ๋‹ค์–‘ํ•œ ๊ฐ๋„์™€ ํฌ๊ธฐ๋กœ ์Šค์ผ€์น˜ํ•œ ํ›„, ์„ ์ƒ๋‹˜์œผ๋กœ์„œ์˜ ์ž์‹ ์ด ํ•™์ƒ์œผ๋กœ์„œ์˜ ์ž์‹ ์—๊ฒŒ โ€œ์ด ๋‹ค์–‘ํ•œ ์Šค์ผ€์น˜๋“ค์ด ๋ชจ๋‘ ๊ฐ™์€ ํ’๊ฒฝ์„ ํ‘œํ˜„ํ•˜๊ณ  ์žˆ๋‹คโ€๋Š” ๊ฒƒ์„ ๊ฐ€๋ฅด์นฉ๋‹ˆ๋‹ค. ๋ผ๋ฒจ์ด ์ „ํ˜€ ์—†์ด๋„, ์ž๊ธฐ ์ž์‹ ๊ณผ์˜ ๋Œ€ํ™”๋ฅผ ํ†ตํ•ด ํ’๊ฒฝ์˜ ๋ณธ์งˆ์„ ์ดํ•ดํ•˜๊ฒŒ ๋˜๋Š” ๊ฒƒ์ด์ฃ .

ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ์˜ ๊ตฌ์กฐ

flowchart LR
    subgraph Input["์ž…๋ ฅ ์ด๋ฏธ์ง€"]
        I[๊นŠ์ด ์˜์ƒ]
    end
    
    subgraph Augment["๋ฐ์ดํ„ฐ ์ฆ๊ฐ•"]
        I --> G1[Global Crop 1]
        I --> G2[Global Crop 2]
        I --> L1[Local Crop 1]
        I --> L2[Local Crop 2]
    end
    
    subgraph Teacher["Teacher Network<br/>(Momentum Update)"]
        G1 --> T[ViT-L/14]
        G2 --> T
    end
    
    subgraph Student["Student Network<br/>(Gradient Update)"]
        G1 --> S[ViT-L/14]
        G2 --> S
        L1 --> S
        L2 --> S
    end
    
    subgraph Loss["์†์‹ค ํ•จ์ˆ˜"]
        T --> DINO[DINO Loss]
        S --> DINO
        T --> iBOT[iBOT Patch Loss]
        S --> iBOT
    end
Figure 2: DeFM์˜ Self-Distillation ํ•™์Šต ๊ตฌ์กฐ

ํ•™์Šต ๊ณผ์ •์„ ์ˆ˜์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

DINO Loss (์ „์—ญ ์ผ๊ด€์„ฑ):

\mathcal{L}_{\text{DINO}} = -\sum_{x \in \{x_1^g, x_2^g\}} \sum_{x' \neq x} P_t(x) \log P_s(x')

์—ฌ๊ธฐ์„œ P_t์™€ P_s๋Š” ๊ฐ๊ฐ Teacher์™€ Student์˜ ์ถœ๋ ฅ ํ™•๋ฅ  ๋ถ„ํฌ์ž…๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ Teacher์˜ ์ถœ๋ ฅ์—์„œ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ํ˜๋ฆฌ์ง€ ์•Š๋Š”๋‹ค(stop-gradient)๋Š” ์ ์ž…๋‹ˆ๋‹ค.

iBOT Loss (ํŒจ์น˜ ์ˆ˜์ค€ ํ•™์Šต):

\mathcal{L}_{\text{iBOT}} = -\sum_{i \in \mathcal{M}} P_t^{(i)} \log P_s^{(i)}

iBOT์€ ๋งˆ์Šคํ‚น๋œ ํŒจ์น˜ \mathcal{M}์— ๋Œ€ํ•ด Teacher์˜ ํŒจ์น˜ ํ† ํฐ์„ ์˜ˆ์ธกํ•˜๋„๋ก Student๋ฅผ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ง€์—ญ์ ์ธ ๊ธฐํ•˜ํ•™์  ๊ตฌ์กฐ๋ฅผ ์ดํ•ดํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Teacher ์—…๋ฐ์ดํŠธ (EMA):

\theta_t \leftarrow m \cdot \theta_t + (1-m) \cdot \theta_s

Teacher์˜ ํŒŒ๋ผ๋ฏธํ„ฐ \theta_t๋Š” Student ํŒŒ๋ผ๋ฏธํ„ฐ \theta_s์˜ ์ง€์ˆ˜ ์ด๋™ ํ‰๊ท (Exponential Moving Average)์œผ๋กœ ์—…๋ฐ์ดํŠธ๋ฉ๋‹ˆ๋‹ค. ๋ชจ๋ฉ˜ํ…€ m์€ ์ผ๋ฐ˜์ ์œผ๋กœ 0.996~0.999 ๋ฒ”์œ„์ž…๋‹ˆ๋‹ค.

Metric-Aware Input Normalization: ๊นŠ์ด์˜ ์ฒ™๋„๋ฅผ ๋ณด์กดํ•˜๋‹ค

DeFM์˜ ๊ฐ€์žฅ ๋…์ฐฝ์ ์ธ ๊ธฐ์—ฌ ์ค‘ ํ•˜๋‚˜๋Š” 3์ฑ„๋„ ๋กœ๊ทธ ์ •๊ทœํ™”(Three-Channel Log Normalization) ์ „๋žต์ž…๋‹ˆ๋‹ค.

์™œ ํŠน๋ณ„ํ•œ ์ •๊ทœํ™”๊ฐ€ ํ•„์š”ํ•œ๊ฐ€?

๊นŠ์ด ์˜์ƒ์˜ ๊ณ ์œ ํ•œ ํŠน์„ฑ์„ ์ƒ๊ฐํ•ด๋ด…์‹œ๋‹ค:

  1. ์Šค์ผ€์ผ ๋‹ค์–‘์„ฑ: ๋งค๋‹ˆํ“ฐ๋ ˆ์ด์…˜์—์„œ๋Š” ์ˆ˜ ์„ผํ‹ฐ๋ฏธํ„ฐ, ๋‚ด๋น„๊ฒŒ์ด์…˜์—์„œ๋Š” ์ˆ˜์‹ญ ๋ฏธํ„ฐ์˜ ๊นŠ์ด๋ฅผ ๋‹ค๋ฃน๋‹ˆ๋‹ค
  2. ๋™์  ๋ฒ”์œ„: ๊ฐ€๊นŒ์šด ๋ฌผ์ฒด์™€ ๋จผ ๋ฌผ์ฒด์˜ ๊นŠ์ด ์ฐจ์ด๊ฐ€ ๊ทน์‹ฌํ•ฉ๋‹ˆ๋‹ค
  3. ๋ฉ”ํŠธ๋ฆญ ์ •๋ณด: ์‹ค์ œ ๊ฑฐ๋ฆฌ ์ •๋ณด๊ฐ€ ๋กœ๋ด‡ ์ œ์–ด์— ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค

์ผ๋ฐ˜์ ์ธ min-max ์ •๊ทœํ™”๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ์š”?

# ์ผ๋ฐ˜์ ์ธ min-max ์ •๊ทœํ™”
normalized = (depth - depth.min()) / (depth.max() - depth.min())

์ด ๋ฐฉ์‹์˜ ๋ฌธ์ œ์ ์€ ๋ฉ”ํŠธ๋ฆญ ์ •๋ณด๊ฐ€ ์™„์ „ํžˆ ์‚ฌ๋ผ์ง„๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. 1๋ฏธํ„ฐ ๋ฒ”์œ„์˜ ๋งค๋‹ˆํ“ฐ๋ ˆ์ด์…˜ ์”ฌ๊ณผ 100๋ฏธํ„ฐ ๋ฒ”์œ„์˜ ์•ผ์™ธ ์”ฌ์ด ๋™์ผํ•œ [0, 1] ๋ฒ”์œ„๋กœ ์••์ถ•๋ฉ๋‹ˆ๋‹ค.

DeFM์˜ ํ•ด๊ฒฐ์ฑ…: 3์ฑ„๋„ ๋กœ๊ทธ ์••์ถ•

DeFM์€ ์„ธ ๊ฐ€์ง€ ๋‹ค๋ฅธ ์Šค์ผ€์ผ์˜ ๋กœ๊ทธ ์ •๊ทœํ™”๋ฅผ ๊ฐ ์ฑ„๋„์— ์ ์šฉํ•ฉ๋‹ˆ๋‹ค:

\text{Channel}_k(d) = \text{clip}\left(\frac{\log(d + \epsilon) - \log(d_{\min}^{(k)})}{\log(d_{\max}^{(k)}) - \log(d_{\min}^{(k)})}, 0, 1\right)

์ฑ„๋„ ๋ฒ”์œ„ ์šฉ๋„
์ฑ„๋„ 1 (Near-field) 0.01m ~ 1m ๋งค๋‹ˆํ“ฐ๋ ˆ์ด์…˜, ๊ทผ๊ฑฐ๋ฆฌ ๊ฐ์ฒด
์ฑ„๋„ 2 (Mid-range) 0.1m ~ 10m ์‹ค๋‚ด ๋‚ด๋น„๊ฒŒ์ด์…˜, ๋กœ์ฝ”๋ชจ์…˜
์ฑ„๋„ 3 (Far-field) 1m ~ 100m ์•ผ์™ธ ๋‚ด๋น„๊ฒŒ์ด์…˜, ๋Œ€๊ทœ๋ชจ ํ™˜๊ฒฝ
# DeFM์˜ ๊นŠ์ด ์ „์ฒ˜๋ฆฌ ์˜์‚ฌ์ฝ”๋“œ
def preprocess_depth_image(depth_meters, target_size=518, patch_size=14):
    """
    ๊นŠ์ด ์˜์ƒ์„ DeFM์˜ 3์ฑ„๋„ ๋ฉ”ํŠธ๋ฆญ-์ธ์‹ ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜
    
    Args:
        depth_meters: ๋ฏธํ„ฐ ๋‹จ์œ„์˜ ๊นŠ์ด ๋งต (H, W)
        target_size: ์ถœ๋ ฅ ์ด๋ฏธ์ง€ ํฌ๊ธฐ
        patch_size: ViT ํŒจ์น˜ ํฌ๊ธฐ
    
    Returns:
        normalized_depth: ์ •๊ทœํ™”๋œ 3์ฑ„๋„ ๊นŠ์ด (3, H, W)
    """
    # ์Šค์ผ€์ผ ์ •์˜ (๋กœ๊ทธ ๊ณต๊ฐ„)
    scales = [
        (0.01, 1.0),    # Near-field: 1cm ~ 1m
        (0.1, 10.0),    # Mid-range: 10cm ~ 10m
        (1.0, 100.0)    # Far-field: 1m ~ 100m
    ]
    
    channels = []
    for d_min, d_max in scales:
        log_depth = np.log(depth_meters + 1e-6)
        log_min, log_max = np.log(d_min), np.log(d_max)
        normalized = (log_depth - log_min) / (log_max - log_min)
        normalized = np.clip(normalized, 0, 1)
        channels.append(normalized)
    
    return np.stack(channels, axis=0)

์ด ์ ‘๊ทผ๋ฒ•์˜ ์žฅ์ ์„ ์‹œ๊ฐ์ ์œผ๋กœ ์ดํ•ดํ•ด๋ด…์‹œ๋‹ค:

์ผ๋ฐ˜ Min-Max ์ •๊ทœํ™”:
Near objects  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ (์ „์ฒด ๋ฒ”์œ„์˜ ๋Œ€๋ถ€๋ถ„ ์ฐจ์ง€)
Far objects   โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–ˆ (๊ฑฐ์˜ ๊ตฌ๋ถ„ ๋ถˆ๊ฐ€)

๋กœ๊ทธ ์ •๊ทœํ™” (DeFM):
Near objects  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ (์ ์ ˆํ•œ ๋น„์œจ)
Far objects   โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘ (์ถฉ๋ถ„ํ•œ ํ•ด์ƒ๋„)

๋กœ๊ทธ ์Šค์ผ€์ผ์€ ์ธ๊ฐ„์˜ ๊นŠ์ด ์ง€๊ฐ๊ณผ๋„ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋„ ๊ฐ€๊นŒ์šด ๊ฑฐ๋ฆฌ์—์„œ๋Š” ์ž‘์€ ์ฐจ์ด๋ฅผ ์ž˜ ๊ตฌ๋ถ„ํ•˜์ง€๋งŒ, ๋จผ ๊ฑฐ๋ฆฌ์—์„œ๋Š” ํฐ ์ฐจ์ด๋งŒ ์ธ์‹ํ•˜๋‹ˆ๊นŒ์š”.

CNN์œผ๋กœ์˜ ์ง€์‹ ์ฆ๋ฅ˜: ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์˜ ๊ท ํ˜•

๋กœ๋ด‡ ์‹œ์Šคํ…œ์—์„œ 307M ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ViT-L ๋ชจ๋ธ์„ ์‹ค์‹œ๊ฐ„์œผ๋กœ ๋Œ๋ฆฌ๋Š” ๊ฒƒ์€ ํ˜„์‹ค์ ์œผ๋กœ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. NVIDIA Jetson Orin์—์„œ ViT-L/14์˜ ์ถ”๋ก  ์‹œ๊ฐ„์€ 72.82ms๋กœ, ์‹ค์‹œ๊ฐ„ ์ œ์–ด์—๋Š” ๋ถ€์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

DeFM์€ ์ด ๋ฌธ์ œ๋ฅผ Teacher-Student ์ฆ๋ฅ˜๋กœ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค:

flowchart TB
    subgraph Teacher["Teacher (Frozen)"]
        T1[DeFM ViT-L/14] --> T2[Spatial Tokens]
        T1 --> T3[Class Token]
    end
    
    subgraph Student["Student (Trainable)"]
        S1[CNN Encoder<br/>ResNet/EfficientNet] --> S2[BiFPN Module]
        S2 --> S3[Dense Features]
        S1 --> S4[Global Pool]
    end
    
    subgraph Loss["์ฆ๋ฅ˜ ์†์‹ค"]
        T2 --> L1[Spatial Distillation]
        S3 --> L1
        T3 --> L2[Global Distillation]
        S4 --> L2
    end
Figure 3: CNN ์ฆ๋ฅ˜ ํ”„๋ ˆ์ž„์›Œํฌ

์ฆ๋ฅ˜ ์†์‹ค์€ ๋‘ ๊ฐ€์ง€ ๊ตฌ์„ฑ์š”์†Œ๋กœ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค:

๊ณต๊ฐ„์  ํ† ํฐ ์ฆ๋ฅ˜: \mathcal{L}_{\text{spatial}} = \frac{1}{HW}\sum_{i,j} \|f_s^{(i,j)} - f_t^{(i,j)}\|_2

์ „์—ญ ํ† ํฐ ์ฆ๋ฅ˜: \mathcal{L}_{\text{global}} = \|g_s - g_t\|_2

์ด ์ฆ๋ฅ˜ ์†์‹ค: \mathcal{L}_{\text{distill}} = \lambda_s \mathcal{L}_{\text{spatial}} + \lambda_g \mathcal{L}_{\text{global}}

BiFPN(Bidirectional Feature Pyramid Network) ๋ชจ๋“ˆ์€ CNN์˜ ๋‹ค์ค‘ ์Šค์ผ€์ผ ํŠน์ง•์„ Teacher ViT์˜ ๊ณต๊ฐ„ ํ† ํฐ๊ณผ ์ •๋ ฌ์‹œํ‚ค๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด 3M ํŒŒ๋ผ๋ฏธํ„ฐ์˜ EfficientNet-B0๋„ ViT-L์˜ ํ‘œํ˜„๋ ฅ ์ผ๋ถ€๋ฅผ ๋ฌผ๋ ค๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ Zoo: ๋‹ค์–‘ํ•œ ์„ ํƒ์ง€

DeFM์€ ๋‹ค์–‘ํ•œ ๋ฐฐํฌ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์œ„ํ•ด ์ด 11๊ฐœ์˜ ๋ชจ๋ธ ๋ณ€ํ˜•์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ Jetson Orin (ms) ImageNet-1k-Depth Top-5 KNN
DeFM ViT-L/14 307M 72.82 84.79%
DeFM ViT-S/14 22.1M 11.92 78.06%
DeFM ResNet-50 26.2M 17.79 77.63%
DeFM ResNet-34 21.8M 13.54 72.72%
DeFM ResNet-18 11.7M 8.67 69.69%
DeFM EfficientNet-B6 29M 54.11 77.81%
DeFM EfficientNet-B0 3M 21.04 67.98%
DeFM RegNetY-1.6GF 12.4M 41.82 76.21%
DeFM RegNetY-400MF 4.1M 25.17 72.87%

์„ ํƒ ๊ฐ€์ด๋“œ:

  • ์ตœ๊ณ  ์„ฑ๋Šฅ: DeFM ViT-L/14 (์˜คํ”„๋ผ์ธ ๋ถ„์„, ๊ณ ์„ฑ๋Šฅ ์„œ๋ฒ„)
  • ๊ท ํ˜•์ : DeFM ResNet-50 ๋˜๋Š” EfficientNet-B4 (Jetson Orin๊ธ‰ ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค)
  • ์ดˆ๊ฒฝ๋Ÿ‰: DeFM EfficientNet-B0 (์ž„๋ฒ ๋””๋“œ ์‹œ์Šคํ…œ, ๋ฐฐํ„ฐ๋ฆฌ ์ œํ•œ ๋กœ๋ด‡)

์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ: DeFM์€ ์ •๋ง ์ž‘๋™ํ•˜๋Š”๊ฐ€?

๋ฐ์ดํ„ฐ์…‹: 6์ฒœ๋งŒ ๊นŠ์ด ์˜์ƒ์˜ ๊ตฌ์„ฑ

DeFM์˜ ํ•™์Šต์—๋Š” ๋‹ค์–‘ํ•œ ์†Œ์Šค์—์„œ ํ๋ ˆ์ด์…˜๋œ 60M ๊นŠ์ด ์˜์ƒ์ด ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค:

pie title ๋ฐ์ดํ„ฐ์…‹ ์†Œ์Šค ๋ถ„ํฌ (์ถ”์ •)
    "TartanAir (์‹œ๋ฎฌ๋ ˆ์ด์…˜)" : 35
    "Hypersim (์‹œ๋ฎฌ๋ ˆ์ด์…˜)" : 20
    "ScanNet (์‹ค์ œ)" : 15
    "Isaac Sim ์ปค์Šคํ…€" : 20
    "๊ธฐํƒ€ ์†Œ์Šค" : 10
Figure 4: DeFM ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์„ฑ

์ฃผ๋ชฉํ•  ์ ์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ์ดํ„ฐ์˜ ๋น„์ค‘์ด ๋†’๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๊นŠ์ด ์˜์ƒ์˜ ๊ณ ์œ ํ•œ ํŠน์„ฑ ๋•๋ถ„์— ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค:

  1. ์งˆ๊ฐ ๋ถˆ๋ณ€์„ฑ: RGB์™€ ๋‹ฌ๋ฆฌ ๊นŠ์ด๋Š” ์งˆ๊ฐ์— ์˜์กดํ•˜์ง€ ์•Š์•„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜-์‹ค์ œ ๊ฐ„๊ทน์ด ์ ์Œ
  2. ์ •ํ™•ํ•œ GT: ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ์™„๋ฒฝํ•œ ๊นŠ์ด ๋ผ๋ฒจ์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ
  3. ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ: ์‹ค์ œ๋กœ ์ ‘๊ทผ ๋ถˆ๊ฐ€๋Šฅํ•œ ํ™˜๊ฒฝ(์‚ฌ๋‹ค๋ฆฌ ๋“ฑ๋ฐ˜, ์œ„ํ—˜ ์ง€์—ญ)๋„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ฐ€๋Šฅ

์˜๋ฏธ๋ก ์  ํด๋Ÿฌ์Šคํ„ฐ๋ง์˜ ์ฐฝ๋ฐœ: ๊นŠ์ด๋งŒ์œผ๋กœ ์˜๋ฏธ๋ฅผ ์ดํ•ดํ•˜๋‹ค

DeFM์˜ ๊ฐ€์žฅ ๋†€๋ผ์šด ๊ฒฐ๊ณผ ์ค‘ ํ•˜๋‚˜๋Š” ์ƒ‰์ƒ์ด๋‚˜ ์งˆ๊ฐ ์—†์ด๋„ ์˜๋ฏธ๋ก ์ (Semantic) ํŠน์ง•์ด ํ•™์Šต๋œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.

PCA ์‹œ๊ฐํ™” ์‹คํ—˜์—์„œ ๋‹ค์Œ์ด ๊ด€์ฐฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค:

  • ์ปต ์†์žก์ด: ์—ฌ๋Ÿฌ ๋‹ค๋ฅธ ์„ผ์„œ(RealSense L515, D435i, ZED 2i, ZED X)๋กœ ์ดฌ์˜ํ•œ ์ปต๋“ค์—์„œ ์†์žก์ด ๋ถ€๋ถ„์ด ์ผ๊ด€๋˜๊ฒŒ ๊ฐ™์€ ์ƒ‰์ƒ(๋…ธ๋ž€์ƒ‰)์œผ๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ง
  • ์„œ๋ž ์†์žก์ด: ๋‹ค์–‘ํ•œ ๊ฐ€๊ตฌ์˜ ์„œ๋ž/์บ๋น„๋‹› ์†์žก์ด๊ฐ€ ์ž๋™์œผ๋กœ ํ•˜์ด๋ผ์ดํŠธ
  • ๋กœ๋ด‡ ํŒ”: ์ฃผ๋ฐฉ ์”ฌ์—์„œ ๋กœ๋ด‡ ํŒ”, ์กฐ๋ฆฌ๋Œ€, ๋ฐฐ๊ฒฝ, ์กฐ์ž‘ ๋Œ€์ƒ ๋ฌผ์ฒด๊ฐ€ ๋ช…ํ™•ํžˆ ๋ถ„๋ฆฌ

์ด๊ฒƒ์€ ๊นŠ์ด ์˜์ƒ์˜ ๊ธฐํ•˜ํ•™์  ํŠน์ง•๋งŒ์œผ๋กœ๋„ ๊ฐ์ฒด์˜ ๊ธฐ๋Šฅ์  ๋ถ€๋ถ„(affordance)์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์†์žก์ด๋Š” ์†์žก์ด๋งŒ์˜ ๋…ํŠนํ•œ 3D ํ˜•์ƒ์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋‹ˆ๊นŒ์š”.

๋ฒค์น˜๋งˆํฌ 1: ๋ถ„๋ฅ˜ ๋ฐ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜

ImageNet-1k-Depth ๋ถ„๋ฅ˜ (Linear Probing):

๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ Top-1 Top-5
DINOv3 ViT-L/16 (RGBโ†’Depth) 307M 58.2% 81.3%
Scratch ResNet-50 26M 42.1% 65.7%
DeFM ViT-L/14 307M 71.72% 84.79%
DeFM ViT-S/14 22M 61.54% 78.06%

DeFM์€ RGB ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ(DINOv3)์„ ๊นŠ์ด์— ์ง์ ‘ ์ ์šฉํ•œ ๊ฒƒ๋ณด๋‹ค 13.5% ๋†’์€ Top-1 ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

์‹œ๋งจํ‹ฑ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ (mIoU, Linear Probing):

๋ฐ์ดํ„ฐ์…‹ DINOv3 ViT-L/16 DeFM ViT-L/14 ๊ฐœ์„ 
ScanNet 28.52 31.34 +2.82
SUN-RGBD 32.74 31.26 -1.48
OFFSED (์•ผ์™ธ) 54.42 57.62 +3.20
TartanGround 62.16 67.69 +5.53
GraspNet-1B 23.89 27.85 +3.96

5๊ฐœ ๋ฐ์ดํ„ฐ์…‹ ์ค‘ 4๊ฐœ์—์„œ SOTA๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ViT-S ํฌ๊ธฐ์—์„œ์˜ ๊ฐœ์„ ์ด ๋‘๋“œ๋Ÿฌ์ง‘๋‹ˆ๋‹ค: DeFM-S/14๋Š” DINOv3-S/16 ๋Œ€๋น„ ์ตœ๋Œ€ 30% mIoU ํ–ฅ์ƒ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

๋ฒค์น˜๋งˆํฌ 2: ๋กœ๋ด‡ ๋‚ด๋น„๊ฒŒ์ด์…˜

DeFM์˜ ์ง„์ •ํ•œ ๊ฐ€์น˜๋Š” ๋กœ๋ด‡๊ณตํ•™ ํƒœ์Šคํฌ์—์„œ์˜ ์ง์ ‘์ ์ธ ์ ์šฉ์— ์žˆ์Šต๋‹ˆ๋‹ค.

์‹คํ—˜ ์„ค์ •: - ํ”Œ๋žซํผ: ANYmal ์‚ฌ์กฑ๋ณดํ–‰ ๋กœ๋ด‡ - ํ™˜๊ฒฝ: Isaac Lab ์‹œ๋ฎฌ๋ ˆ์ด์…˜ โ†’ ์‹ค์ œ ํ™˜๊ฒฝ (Sim-to-Real) - ํƒœ์Šคํฌ: 100๋ฏธํ„ฐ ์›จ์ดํฌ์ธํŠธ ๋‚ด๋น„๊ฒŒ์ด์…˜ - ๋น„๊ต ๋Œ€์ƒ: VAE ์ธ์ฝ”๋”, DINOv3 ํŠน์ง•

๊ฒฐ๊ณผ:

์ธ์ฝ”๋” ํ›ˆ๋ จ ๋ฐฉ์‹ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์„ฑ๊ณต๋ฅ  ์‹ค์ œ ํ™˜๊ฒฝ ์„ฑ๊ณต๋ฅ 
VAE (Baseline) Scratch 78% 65%
DINOv3 ViT-L Frozen 82% 71%
DeFM RegNet Frozen 89% 85%

DeFM์˜ ํŠน์ง•์ ์ธ ์žฅ์ :

  1. ์ด์ƒ ์žฅ์• ๋ฌผ ํšŒํ”ผ: ๊ฐ€๋กœ๋“ฑ, ๊ตํ†ต ํ‘œ์ง€ํŒ, ์šธํƒ€๋ฆฌ ๊ฐ™์€ โ€œํฌ๊ท€ํ•œโ€ ์žฅ์• ๋ฌผ์„ ๋” ์ž˜ ์ธ์‹
  2. Sim-to-Real ๊ฐญ ์ถ•์†Œ: ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ์‹ค์ œ ํ™˜๊ฒฝ์œผ๋กœ์˜ ์ „์ด๊ฐ€ ๋” ์•ˆ์ •์ 
  3. ํƒœ์Šคํฌ ํŠนํ™” ์ „์ฒ˜๋ฆฌ ๋ถˆํ•„์š”: elevation map ๊ฐ™์€ ์ˆ˜์ž‘์—… ํŒŒ์ดํ”„๋ผ์ธ ์—†์ด ์ž‘๋™

๋ฒค์น˜๋งˆํฌ 3: ๋ฑ์ŠคํŠธ๋Ÿฌ์Šค ๋งค๋‹ˆํ“ฐ๋ ˆ์ด์…˜

์‹คํ—˜ ์„ค์ •: - ํ”Œ๋žซํผ: Kuka ํŒ” + Allegro Hand V4 - ํ™˜๊ฒฝ: Isaac Lab (256๊ฐœ ๋ณ‘๋ ฌ ํ™˜๊ฒฝ ร— 8 GPU) - ํƒœ์Šคํฌ: DextrAH ์Šคํƒ€์ผ์˜ ์ •๋ฐ€ ๊ทธ๋ž˜์Šคํ•‘ - ๋…ธ์ด์ฆˆ ๋ชจ๋ธ: Speckle, Dropout, Stick noise, Kinect noise model

๊ฒฐ๊ณผ:

์ธ์ฝ”๋” ํ›ˆ๋ จ ๋ฐฉ์‹ ์„ฑ๊ณต๋ฅ  (Sim) ์„ฑ๊ณต๋ฅ  (Kinect Noise)
ImageNet ResNet-18 Frozen 45.2% 38.1%
DINOv3 ResNet-18 (์ฆ๋ฅ˜) Frozen 52.1% 44.3%
Scratch CNN Full Train 61.8% 51.2%
DeFM ResNet-18 Frozen 67.3% 58.9%
DeFM ResNet-18 Fine-tuned 76.1% 68.4%

ํ•ต์‹ฌ ๋ฐœ๊ฒฌ:

  • Frozen DeFM์ด ๋ชจ๋“  Frozen ๋ฒ ์ด์Šค๋ผ์ธ์„ 23% ์ดˆ๊ณผ: ํƒœ์Šคํฌ ํŠนํ™” ํŒŒ์ธํŠœ๋‹ ์—†์ด๋„ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ
  • Fine-tuned DeFM์ด ์ „์ฒด ๋ฒ ์ด์Šค๋ผ์ธ์„ 9% ์ดˆ๊ณผ: ํŒŒ์ธํŠœ๋‹ ์‹œ ์ถ”๊ฐ€์ ์ธ ์ด๋“
  • ๋…ธ์ด์ฆˆ ๊ฐ•๊ฑด์„ฑ: Kinect ๋…ธ์ด์ฆˆ ๋ชจ๋ธ์—์„œ๋„ ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ์ ์Œ

๋ฒค์น˜๋งˆํฌ 4: ์‚ฌ๋‹ค๋ฆฌ ๋“ฑ๋ฐ˜ ๋กœ์ฝ”๋ชจ์…˜

๊ฐ€์žฅ ๋„์ „์ ์ธ ์‹คํ—˜์€ ์‚ฌ์กฑ๋ณดํ–‰ ๋กœ๋ด‡์˜ ์‚ฌ๋‹ค๋ฆฌ ๋“ฑ๋ฐ˜์ž…๋‹ˆ๋‹ค.

์‹คํ—˜ ์„ค์ •: - ํ”Œ๋žซํผ: ANYmal ์‚ฌ์กฑ๋ณดํ–‰ ๋กœ๋ด‡ - ํƒœ์Šคํฌ: ์‚ฐ์—…์šฉ ์‚ฌ๋‹ค๋ฆฌ ๋“ฑ๋ฐ˜ (perceptive locomotion) - ๋น„๊ต ๋Œ€์ƒ: VAE ์ธ์ฝ”๋”, Scratch CNN

๊ฒฐ๊ณผ:

์ธ์ฝ”๋” ํ›ˆ๋ จ ๋ฐฉ์‹ ๋“ฑ๋ฐ˜ ์„ฑ๊ณต๋ฅ 
VAE Baseline Scratch 85.3%
Scratch CNN Scratch 90.45%
DeFM RegNet Frozen 90.45%

Frozen DeFM์ด Scratch ํ•™์Šต๊ณผ ๋™์ผํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค๋Š” ์ ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š”:

  1. DeFM ํŠน์ง•์ด ํƒœ์Šคํฌ ํŠนํ™” ํ•™์Šต ์—†์ด๋„ ์ถฉ๋ถ„ํžˆ ํ’๋ถ€ํ•จ์„ ์˜๋ฏธ
  2. ํ•™์Šต ์‹œ๊ฐ„๊ณผ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ ์ธก๋ฉด์—์„œ ํฐ ์ด์ 
  3. ์ƒˆ๋กœ์šด ๋กœ์ฝ”๋ชจ์…˜ ํƒœ์Šคํฌ๋กœ์˜ ๋น ๋ฅธ ์ „์ด ๊ฐ€๋Šฅ์„ฑ

์‹ค์ œ ํ™˜๊ฒฝ์—์„œ์˜ PCA ์‹œ๊ฐํ™”๋Š” ์‹ฌํ•œ ์„ผ์„œ ๋…ธ์ด์ฆˆ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์‚ฌ๋‹ค๋ฆฌ ๊ตฌ์กฐ๋ฅผ ์ผ๊ด€๋˜๊ฒŒ ํด๋Ÿฌ์Šคํ„ฐ๋งํ•˜๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.


๋น„ํŒ์  ๊ณ ์ฐฐ: ๊ฐ•์ , ์•ฝ์ , ๊ทธ๋ฆฌ๊ณ  ํ•œ๊ณ„

๊ฐ•์ 

1. ์ง„์ •ํ•œ โ€œDrop-in Replacementโ€

DeFM์˜ ๊ฐ€์žฅ ํฐ ์žฅ์ ์€ ๊ธฐ์กด ํŒŒ์ดํ”„๋ผ์ธ์— ์ตœ์†Œํ•œ์˜ ๋ณ€๊ฒฝ์œผ๋กœ ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.

# ๊ธฐ์กด ์ฝ”๋“œ
encoder = ResNet18(pretrained_imagenet=True)
features = encoder(rgb_image)

# DeFM์œผ๋กœ ๊ต์ฒด (๋‹จ 2์ค„ ๋ณ€๊ฒฝ)
import torch
encoder = torch.hub.load('leggedrobotics/defm:main', 'defm_resnet18', pretrained=True)
depth_normalized = preprocess_depth_image(depth_meters)
features = encoder(depth_normalized)

2. ์„ผ์„œ ๋ถˆ๊ฐ€์ง€๋ก (Sensor-Agnostic)

DeFM์€ ๋‹ค์–‘ํ•œ ๊นŠ์ด ์„ผ์„œ์—์„œ ์ผ๊ด€๋œ ํ‘œํ˜„์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค: - Structured Light (RealSense D4xx) - Time-of-Flight (RealSense L5xx) - Stereo Matching (ZED) - LiDAR ํ”„๋กœ์ ์…˜

์ด๋Š” ์„ผ์„œ ๊ต์ฒด ์‹œ์—๋„ ์žฌํ•™์Šต ์—†์ด ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

3. Sim-to-Real ์ „์ด์˜ ์ž์—ฐ์Šค๋Ÿฌ์›€

๊นŠ์ด ์˜์ƒ์˜ ๋ณธ์งˆ์  ํŠน์„ฑ(์งˆ๊ฐ ๋ถˆ๋ณ€, ์กฐ๋ช… ๋ถˆ๋ณ€) ๋•๋ถ„์— ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ํ•™์Šตํ•œ ํŠน์ง•์ด ์‹ค์ œ ํ™˜๊ฒฝ์œผ๋กœ ์ž˜ ์ „์ด๋ฉ๋‹ˆ๋‹ค.

4. ์˜คํ”ˆ์†Œ์Šค ๋ฐ ์™„์ „ํ•œ Model Zoo

# ์ฆ‰์‹œ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ 11๊ฐœ ๋ชจ๋ธ
pip install torch torchvision huggingface_hub
model = torch.hub.load('leggedrobotics/defm:main', 'defm_vit_l14')

์•ฝ์  ๋ฐ ํ•œ๊ณ„

1. ViT ์•„ํ‹ฐํŒฉํŠธ ๋ฌธ์ œ

๋…ผ๋ฌธ์—์„œ ์ธ์ •ํ•œ ๊ฒƒ์ฒ˜๋Ÿผ, ViT ์•„ํ‚คํ…์ฒ˜์˜ ๊ณ ์œ ํ•œ ํ•œ๊ณ„๋กœ ์ธํ•œ ์‹œ๊ฐ์  ์•„ํ‹ฐํŒฉํŠธ๊ฐ€ ๋•Œ๋•Œ๋กœ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํŠนํžˆ ๊ณ ํ•ด์ƒ๋„ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜์—์„œ ๋ฌธ์ œ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฐ€๋Šฅํ•œ ํ•ด๊ฒฐ์ฑ…: Register Token (DINOv2์—์„œ ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•) ๋„์ž…

2. ์‹ค์ œ ํ™˜๊ฒฝ ์‹คํ—˜์˜ ์ œํ•œ์„ฑ

โ€œOur real-world experiments, though promising, are currently limited in terms of task diversity due to hardware constraints.โ€

ํ˜„์žฌ ์‹ค์ œ ํ™˜๊ฒฝ ์‹คํ—˜์€ ํŠน์ • ํ”Œ๋žซํผ(ANYmal, Kuka+Allegro)์— ๊ตญํ•œ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

3. ๋™์  ๊ฐ์ฒด ์ฒ˜๋ฆฌ์˜ ๋ถˆํ™•์‹ค์„ฑ

DeFM์€ ์ฃผ๋กœ ์ •์  ํ™˜๊ฒฝ์—์„œ ํ‰๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋น ๋ฅด๊ฒŒ ์›€์ง์ด๋Š” ์‚ฌ๋žŒ์ด๋‚˜ ์ฐจ๋Ÿ‰์ด ์žˆ๋Š” ๋™์  ํ™˜๊ฒฝ์—์„œ์˜ ์„ฑ๋Šฅ์€ ์ถ”๊ฐ€ ๊ฒ€์ฆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

4. ๋ฉ”ํŠธ๋ฆญ ์ •๊ทœํ™”์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

3์ฑ„๋„ ์ •๊ทœํ™”์˜ ๋ฒ”์œ„(0.01-1m, 0.1-10m, 1-100m)๊ฐ€ ํŠน์ • ์‘์šฉ์—์„œ๋Š” ์ตœ์ ์ด ์•„๋‹ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด: - ๋งˆ์ดํฌ๋กœ ๋งค๋‹ˆํ“ฐ๋ ˆ์ด์…˜ (๋ฐ€๋ฆฌ๋ฏธํ„ฐ ๋‹จ์œ„) - ์ž์œจ์ฃผํ–‰ (์ˆ˜๋ฐฑ ๋ฏธํ„ฐ ๋ฒ”์œ„)

ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ

flowchart LR
    ROOT((DeFM ๋ฐœ์ „))

    ROOT --> A[์•„ํ‚คํ…์ฒ˜ ๊ฐœ์„ ]
    A --> A1[Register Token ์ ์šฉ]
    A --> A2[Mamba/State Space ์ ์‘]
    A --> A3[Multi-scale ์–ดํ…์…˜]

    ROOT --> B[๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ™•์žฅ]
    B --> B1[RGB-D ์œตํ•ฉ]
    B --> B2[Tactile ํ†ตํ•ฉ]
    B --> B3[Point Cloud ์—ฐ๋™]

    ROOT --> C[์‘์šฉ ํ™•์žฅ]
    C --> C1[์ˆ˜์ค‘ ๋กœ๋ด‡]
    C --> C2[ํ•ญ๊ณต ๋“œ๋ก ]
    C --> C3[์ˆ˜์ˆ  ๋กœ๋ด‡]

    ROOT --> D[ํšจ์œจ์„ฑ]
    D --> D1[INT8 ์–‘์žํ™”]
    D --> D2[Pruning]
    D --> D3[Mobile ์ตœ์ ํ™”]
Figure 5: DeFM ๊ธฐ๋ฐ˜ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ

๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต

RGB ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ๋“ค

๋ชจ๋ธ ํ•™์Šต ๋ฐฉ์‹ ๊นŠ์ด ์ง€์› ๋กœ๋ด‡๊ณตํ•™ ํŠนํ™”
DINOv2 Self-distillation RGB ์ „์šฉ ์•„๋‹ˆ์˜ค
DINOv3 Self-distillation + Gram RGB ์ „์šฉ ์•„๋‹ˆ์˜ค
CLIP Contrastive (Vision-Language) RGB ์ „์šฉ ์•„๋‹ˆ์˜ค
SAM Segmentation ํŠนํ™” RGB ์ „์šฉ ์•„๋‹ˆ์˜ค
DeFM Self-distillation Depth ์ „์šฉ ์˜ˆ

๊นŠ์ด ๊ด€๋ จ ๋ชจ๋ธ๋“ค

๋ชจ๋ธ ๋ชฉ์  ๋ฐฉํ–ฅ์„ฑ DeFM๊ณผ์˜ ์ฐจ์ด
Depth Anything RGBโ†’Depth ์ถ”์ • RGB ์ž…๋ ฅ DeFM์€ Depth ์ง์ ‘ ์ธ์ฝ”๋”ฉ
MiDaS ์ƒ๋Œ€์  ๊นŠ์ด ์ถ”์ • RGB ์ž…๋ ฅ DeFM์€ ๋ฉ”ํŠธ๋ฆญ ๊นŠ์ด ๋ณด์กด
ZoeDepth ๋ฉ”ํŠธ๋ฆญ ๊นŠ์ด ์ถ”์ • RGB ์ž…๋ ฅ DeFM์€ ๊นŠ์ด ์„ผ์„œ ๋ฐ์ดํ„ฐ ํ™œ์šฉ

๋กœ๋ด‡๊ณตํ•™ VFM๋“ค

๋ชจ๋ธ ํŠน์ง• DeFM๊ณผ์˜ ๊ด€๊ณ„
Theia ๋‹ค์ค‘ VFM ์ฆ๋ฅ˜ DeFM๊ณผ ์ƒํ˜ธ๋ณด์™„์  (๊นŠ์ด ์ „๋ฌธ์„ฑ)
VC-1 ๋ฒ”์šฉ ๋กœ๋ด‡ ๋น„์ „ DeFM์ด ๊นŠ์ด ํŠนํ™”๋กœ ๋” ๋‚˜์€ ์„ฑ๋Šฅ
R3M ์‹œ๊ฐ„์  ๋Œ€์กฐ ํ•™์Šต DeFM๊ณผ ๊ฒฐํ•ฉ ๊ฐ€๋Šฅ (RGB-D)

์‹ค์šฉ์  ๊ฐ€์ด๋“œ: DeFM ์‚ฌ์šฉ๋ฒ•

์„ค์น˜ ๋ฐ ๊ธฐ๋ณธ ์‚ฌ์šฉ

# ์˜์กด์„ฑ ์„ค์น˜
pip install torch torchvision numpy huggingface_hub omegaconf

# PyTorch Hub๋ฅผ ํ†ตํ•œ ๋กœ๋”ฉ
import torch
from defm import preprocess_depth_image

# ๋ชจ๋ธ ๋กœ๋”ฉ
model = torch.hub.load('leggedrobotics/defm:main', 'defm_vit_l14', pretrained=True)
model.eval().to("cuda")

# ๊นŠ์ด ์ „์ฒ˜๋ฆฌ (๋ฏธํ„ฐ ๋‹จ์œ„ ํ•„์ˆ˜)
depth_meters = load_depth_sensor_data()  # ์‚ฌ์šฉ์ž์˜ ๊นŠ์ด ๋ฐ์ดํ„ฐ
normalized_depth = preprocess_depth_image(depth_meters, target_size=518, patch_size=14)

# ํŠน์ง• ์ถ”์ถœ
with torch.no_grad():
    output = model.get_intermediate_layers(
        normalized_depth.to("cuda"), 
        n=1, 
        reshape=True, 
        return_class_token=True
    )

spatial_features = output[0][0]  # (B, C, H', W') - ๊ณต๊ฐ„์  ํŠน์ง•
class_token = output[0][1]       # (B, C) - ์ „์—ญ ํŠน์ง•

RL ์ •์ฑ… ํ•™์Šต๊ณผ์˜ ํ†ตํ•ฉ

# Isaac Lab ์Šคํƒ€์ผ ์ •์ฑ… ํ•™์Šต ์˜ˆ์‹œ
class DepthPolicyNetwork(nn.Module):
    def __init__(self, action_dim):
        super().__init__()
        # DeFM ์ธ์ฝ”๋” (frozen)
        self.encoder = torch.hub.load(
            'leggedrobotics/defm:main', 
            'defm_resnet18', 
            pretrained=True
        )
        for param in self.encoder.parameters():
            param.requires_grad = False
        
        # ์ •์ฑ… ํ—ค๋“œ
        self.policy_head = nn.Sequential(
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, action_dim)
        )
    
    def forward(self, depth_meters):
        depth_norm = preprocess_depth_image(depth_meters)
        with torch.no_grad():
            features = self.encoder(depth_norm)
        return self.policy_head(features)

์„ฑ๋Šฅ ์ตœ์ ํ™” ํŒ

  1. ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ: ๊ฐ€๋Šฅํ•˜๋ฉด ์—ฌ๋Ÿฌ ๊นŠ์ด ์˜์ƒ์„ ๋ฐฐ์น˜๋กœ ์ฒ˜๋ฆฌ
  2. Mixed Precision: torch.cuda.amp๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ FP16 ์ถ”๋ก 
  3. TensorRT ๋ณ€ํ™˜: ํ”„๋กœ๋•์…˜ ๋ฐฐํฌ ์‹œ TensorRT๋กœ ์ถ”๊ฐ€ ์ตœ์ ํ™” ๊ฐ€๋Šฅ
# Mixed Precision ์˜ˆ์‹œ
with torch.cuda.amp.autocast():
    features = model(depth_normalized)

๊ฒฐ๋ก 

DeFM์€ ๋กœ๋ด‡๊ณตํ•™ ์ปค๋ฎค๋‹ˆํ‹ฐ์— ์˜ค๋žซ๋™์•ˆ ๋น„์–ด์žˆ๋˜ ํผ์ฆ ์กฐ๊ฐ์„ ์ฑ„์›Œ์ฃผ๋Š” ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ…Œ์ดํฌ์–ด์›จ์ด

  1. ๊นŠ์ด ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋„ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์˜ ํ˜œํƒ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค
    • RGB์—์„œ ์„ฑ๊ณตํ•œ Self-distillation์ด ๊นŠ์ด์—๋„ ์ ์šฉ ๊ฐ€๋Šฅ
    • ์ƒ‰์ƒ/์งˆ๊ฐ ์—†์ด๋„ ์˜๋ฏธ๋ก ์  ํŠน์ง• ํ•™์Šต ๊ฐ€๋Šฅ
  2. Metric-aware ์ •๊ทœํ™”๊ฐ€ ํ•ต์‹ฌ์ด๋‹ค
    • 3์ฑ„๋„ ๋กœ๊ทธ ์ •๊ทœํ™”๋กœ ๋‹ค์–‘ํ•œ ์Šค์ผ€์ผ์—์„œ ๋ฉ”ํŠธ๋ฆญ ์ •๋ณด ๋ณด์กด
    • ๋ฐ€๋ฆฌ๋ฏธํ„ฐ~100๋ฏธํ„ฐ ๋ฒ”์œ„๋ฅผ ๋‹จ์ผ ๋ชจ๋ธ๋กœ ์ปค๋ฒ„
  3. ์‹ค์šฉ์„ฑ์ด ๊ฒ€์ฆ๋˜์—ˆ๋‹ค
    • ๋‚ด๋น„๊ฒŒ์ด์…˜, ๋งค๋‹ˆํ“ฐ๋ ˆ์ด์…˜, ๋กœ์ฝ”๋ชจ์…˜์—์„œ ์ผ๊ด€๋œ ์„ฑ๋Šฅ ํ–ฅ์ƒ
    • Frozen ํŠน์ง•๋งŒ์œผ๋กœ๋„ ๊ฐ•๋ ฅํ•œ Sim-to-Real ์ „์ด
  4. ์ ‘๊ทผ์„ฑ์ด ๋†’๋‹ค
    • ์˜คํ”ˆ์†Œ์Šค, ๋‹ค์–‘ํ•œ ๋ชจ๋ธ ํฌ๊ธฐ ์ œ๊ณต
    • ๊ธฐ์กด ํŒŒ์ดํ”„๋ผ์ธ์— ์‰ฝ๊ฒŒ ํ†ตํ•ฉ ๊ฐ€๋Šฅ

DeFM์€ โ€œ๋˜ ํ•˜๋‚˜์˜ ๋…ผ๋ฌธโ€์ด ์•„๋‹™๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ๋„๊ตฌ์ž…๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ๋ถ„์˜ ๋‹ค์Œ ํ”„๋กœ์ ํŠธ์—์„œ:

  • ๊นŠ์ด ์ธ์ฝ”๋”๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•˜๊ณ  ๊ณ„์‹ ๊ฐ€์š”? โ†’ DeFM์„ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ๋กœ ์‹œ์ž‘ํ•˜์„ธ์š”
  • RGB ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์„ ๊นŠ์ด์— ์–ต์ง€๋กœ ์ ์šฉํ•˜๊ณ  ๊ณ„์‹ ๊ฐ€์š”? โ†’ DeFM์ด ๋” ๋‚˜์€ ๋Œ€์•ˆ์ž…๋‹ˆ๋‹ค
  • Sim-to-Real ๊ฐญ์— ๊ณ ํ†ต๋ฐ›๊ณ  ๊ณ„์‹ ๊ฐ€์š”? โ†’ DeFM์˜ ๋„๋ฉ”์ธ ์ผ๋ฐ˜ํ™”๋ฅผ ํ™œ์šฉํ•˜์„ธ์š”

๋กœ๋ด‡๊ณตํ•™์—์„œ ๊นŠ์ด ์„ผ์‹ฑ์€ ์•ž์œผ๋กœ๋„ ํ•ต์‹ฌ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋กœ ๋‚จ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. DeFM์€ ๊ทธ ์ฒซ ๋ฒˆ์งธ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ๋กœ์„œ, ์•ž์œผ๋กœ ๋งŽ์€ ํ›„์† ์—ฐ๊ตฌ์˜ ์ถœ๋ฐœ์ ์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.


์ฐธ๊ณ ๋ฌธํ—Œ

  • Patel, M., Frey, J., Mittal, M., Yang, F., Hansson, A., Bar, A., Cadena, C., & Hutter, M. (2026). DeFM: Learning Foundation Representations from Depth for Robotics. arXiv preprint arXiv:2601.18923.
  • Oquab, M., et al. (2023). DINOv2: Learning Robust Visual Features without Supervision. arXiv preprint arXiv:2304.07193.
  • Simรฉoni, O., et al. (2025). DINOv3: Self-Distillation with No Labels v3. arXiv preprint.
  • Wang, W., et al. (2020). TartanAir: A Dataset to Push the Limits of Visual SLAM. IROS 2020.
  • Roberts, M., et al. (2021). Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding. ICCV 2021.

โ›๏ธ Dig Review

โ›๏ธ Dig โ€” Go deep, uncover the layers. Dive into technical detail.

๐ŸŒŸ ์„œ๋ก : ์™œ Depth๋กœ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์ธ๊ฐ€?

๋กœ๋ด‡์ด ์‹ค์ œ ํ™˜๊ฒฝ์„ ์ดํ•ดํ•˜๊ณ  ํ–‰์œ„(decision-making)๋กœ ์ด์–ด๊ฐ€๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ฐ๊ฐ ์ž…๋ ฅ์„ ์ผ๊ด€์„ฑ ์žˆ๊ณ  ์ผ๋ฐ˜ํ™”๋œ ํ‘œํ˜„์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ์ด ํ•„์ˆ˜์ž…๋‹ˆ๋‹ค. ์ „ํ†ต์ ์œผ๋กœ ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ ์„ธ๊ณ„์—์„œ๋Š” RGB ๊ธฐ๋ฐ˜์˜ Vision Foundation Models (VFM) ์ด ๋Œ€์„ธ์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์€ ์กฐ๋ช…, ์ƒ‰์ƒ, ํ…์Šค์ฒ˜ ๋“ฑ์— ๋ฏผ๊ฐํ•˜๊ณ  ๋กœ๋ด‡์ด ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ํ•™์Šตํ•œ RGB ๊ธฐ๋ฐ˜ ํ‘œํ˜„์„ ๊ทธ๋Œ€๋กœ ์ ์šฉํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฐ˜๋ฉด Depth ์˜์ƒ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฐ•์ ์„ ๊ฐ–์Šต๋‹ˆ๋‹ค:

  • ์กฐ๋ช…, ์ƒ‰์ƒ ๋ณ€ํ™”์— ๊ฐ•ํ•จ
  • ๊ฑฐ๋ฆฌ/๊ตฌ์กฐ์  ์ •๋ณด๊ฐ€ ๋‚ด์žฌํ™”๋จ
  • ์‹œ๋ฎฌ๋ ˆ์ด์…˜๊ณผ ํ˜„์‹ค ๊ฐ„ ๊ฐ„๊ทน(sim-to-real) ์ถ•์†Œ ๊ฐ€๋Šฅ

๊ทธ๋Ÿฐ๋ฐ๋„, ์ด์ œ๊นŒ์ง€ Depth ์ž์ฒด๋งŒ์œผ๋กœ ํ•™์Šต๋œ ๋Œ€๊ทœ๋ชจ ํ‘œํ˜„์€ ๊ฑฐ์˜ ์—†์—ˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด ์ ‘๊ทผ์€ RGB-์‚ฌ์ „ํ•™์Šต๋œ ๋ชจ๋ธ์— ๊นŠ์ด(depth)๋ฅผ ๊ฐ•์ œ๋กœ ๋งž์ถ”๊ฑฐ๋‚˜, ํŠน์ • ์ž‘์—…(task)์—์„œ๋งŒ ์“ฐ๋Š” ์—”์ฝ”๋”๋ฅผ ํ•™์Šตํ•˜๋Š” ์ˆ˜๋ฐ–์— ์—†์—ˆ์Šต๋‹ˆ๋‹ค.

DeFM (Depth Foundation Model)์€ ์ด๋Ÿฌํ•œ ๊ณต๋ฐฑ์„ ์ฑ„์šฐ๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋œ Depth ์ „์šฉ ๋Œ€๊ทœ๋ชจ self-supervised foundation model์ž…๋‹ˆ๋‹ค.

๐Ÿ“Œ ํ•ต์‹ฌ ๊ธฐ์—ฌ

DeFM ๋…ผ๋ฌธ์ด ๋กœ๋ด‡๊ณตํ•™์ž์—๊ฒŒ ์ฃผ๋Š” ์ฃผ์š” ๊ธฐ์—ฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. Depth ๊ธฐ๋ฐ˜ ์ž์ฒด self-supervised VFM์˜ ์ตœ์ดˆ ๊ตฌํ˜„
  2. 6000๋งŒ ์žฅ ๊ทœ๋ชจ์˜ Depth ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•
  3. Depth ํ‘œํ˜„์˜ metric awareness ์œ ์ง€ ์ „๋žต ๋„์ž…
  4. ๊ฒฝ๋Ÿ‰ํ™”๋œ Distilled ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜ ์ œ๊ณต
  5. ๋ถ„๋ฅ˜โ€ง์„ธ๋ถ„ํ™”โ€ง๋‚ด๋น„๊ฒŒ์ด์…˜โ€ง์šด๋™โ€ง์กฐ์ž‘ ๋“ฑ ๋‹ค์–‘ ๊ณผ์ œ์—์„œ SOTA ์„ฑ๋Šฅ ์ž…์ฆ

์ด ๋…ผ๋ฌธ์€ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์™€ self-distillation์„ ํ™œ์šฉํ•ด, ๊นŠ์ด ์ด๋ฏธ์ง€์—์„œ๋„ ๊ธฐํ•˜ํ•™ ๋ฐ ์˜๋ฏธ ์ •๋ณด๋ฅผ ๋™์‹œ์— ์žก์•„๋‚ด๋Š” ํ‘œํ˜„์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.


๐Ÿง  ๋ฐฉ๋ฒ•: DeFM์ด Depth์—์„œ ๋ฌด์—‡์„ ๋ฐฐ์šฐ๋Š”๊ฐ€?

DeFM์˜ ์„ค๊ณ„ ์ฒ ํ•™์€ ์•„๋ž˜ ๋‘ ์ถ•์œผ๋กœ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  1. Self-Supervised Learning (SSL)
  2. Depth-Specific Input Normalization

๐Ÿ” 1) Self-Distillation ๊ธฐ๋ฐ˜ Foundation Pretraining

DeFM์€ self-distillation ๊ณ„์—ด์˜ DINO ์Šคํƒ€์ผ ํ•™์Šต ๋ชฉํ‘œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ๊ฐ„๋‹จํžˆ ์ง๊ด€์ ์œผ๋กœ ์„ค๋ช…ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๐Ÿง  Self-Distillation

๋‘ ๋„คํŠธ์›Œํฌ๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค:

  • ๐Ÿ‘ฉโ€๐Ÿซ Teacher: ๋ชจ๋ฉ˜ํ…€ ์—…๋ฐ์ดํŠธ๋กœ ๊ณ ์ •๋œ ํŠน์ง•์„ ์ œ๊ณต
  • ๐Ÿ‘จโ€๐ŸŽ“ Student: Teacher์˜ ์ถœ๋ ฅ์„ ์ตœ๋Œ€ํ•œ ๋‹ฎ์•„๊ฐ€๋„๋ก ํ•™์Šต

์ด๋•Œ ๋‘ ๋ชจ๋ธ์€ Depth ์ด๋ฏธ์ง€์˜ ์„œ๋กœ ๋‹ค๋ฅธ augmentation view๋ฅผ ์ž…๋ ฅ๋ฐ›์•„, Student๊ฐ€ Teacher ํŠน์„ฑ ๋ถ„ํฌ๋ฅผ ์ž˜ ๋ชจ๋ฐฉํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

์ด์ „์—๋„ ์ด๋Ÿฐ self-distillation ๋ฐฉ์‹์€ RGB ๊ฐ์ฒด ํ‘œํ˜„ ํ•™์Šต์—์„œ ๋งค์šฐ ๊ฐ•๋ ฅํ•˜๊ฒŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค (์˜ˆ: DINOv2).

๐Ÿง  2) ์ž…๋ ฅ ์ •๊ทœํ™” (Input Normalization)

Depth์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ์ ˆ๋Œ€ ๊ฐ’์„ ๊ฐ–๋Š”๋‹ค๋Š” ๊ฒƒ์€ ํฐ ์žฅ์ ์ด์ง€๋งŒ, ์„ผ์„œ๋งˆ๋‹ค ์Šค์ผ€์ผ์ด๋‚˜ ๋ถ„ํฌ๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ต์ผํ•˜์—ฌ ๋ชจ๋ธ์— ์ž…๋ ฅํ•˜๊ธฐ ์œ„ํ•ด DeFM์€ ๋‹ค์Œ ์ „๋žต์„ ์ฑ„ํƒํ•ฉ๋‹ˆ๋‹ค:

Depth์˜ ์  ๊ฑฐ๋ฆฌ ๋ถ„ํฌ(distance distribution) ๋ฅผ ์ „์—ญ ํ†ต๊ณ„(์˜ˆ: ํ‰๊ท โ€ง๋ถ„์‚ฐ)๋กœ ์ •๊ทœํ™”ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์„ผ์„œ, ํ™˜๊ฒฝ, ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ๊ฐ„ ์Šค์ผ€์ผ ๋ถˆ์ผ์น˜๋ฅผ ์™„ํ™”

์ด๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์ด ์ดฌ์˜ ํ™˜๊ฒฝ์— ๋œ ๋ฏผ๊ฐํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๐Ÿ“Œ Distillation ๋ฐ ๊ฒฝ๋Ÿ‰ ๋ชจ๋ธ

DeFM์€ ์ฒ˜์Œ์—” ๋Œ€ํ˜• Vision Transformer (ViT) ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ ๋‹ค์Œ, ๋กœ๋ด‡ ํ•˜๋“œ์›จ์–ด์—์„œ ์‚ฌ์šฉํ•˜๊ธฐ ์ข‹์€ ๊ฒฝ๋Ÿ‰ํ™” ๋ชจ๋ธ๋กœ Distillation ํ•ฉ๋‹ˆ๋‹ค:

  • ConvNet ๊ธฐ๋ฐ˜ ์†Œํ˜• ๋„คํŠธ์›Œํฌ (ResNet, EfficientNet, RegNet)
  • ์ปดํŒฉํŠธํ•œ ViT-Small

์ด๋กœ์จ ๋กœ๋ด‡์˜ ์—ฐ์‚ฐ ์ œํ•œ ํ•˜์—์„œ ๋น ๋ฅธ ์ถ”๋ก ๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“Š ์‹คํ—˜: DeFM์€ ์ •๋ง ์ข‹์„๊นŒ?

DeFM์˜ ์„ฑ๋Šฅ์„ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•œ ์‹คํ—˜์€ ํฌ๊ฒŒ ๋‘ ์ถ•์œผ๋กœ ์ด๋ฃจ์–ด์กŒ์Šต๋‹ˆ๋‹ค.

๐Ÿงช 1) Perception Task

๐Ÿ“Œ ๋ถ„๋ฅ˜(Classification) & ๋ถ„ํ• (Semantic Segmentation)

DeFM์€ ๋‹ค์Œ Task์—์„œ ๋น„๊ต๋Œ€์ƒ ๋Œ€๋น„ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค:

Task ๋น„๊ต Baselines DeFM ์„ฑ๋Šฅ
Depth ๋ถ„๋ฅ˜ Scratch Depth Net / RGB ๊ธฐ๋ฐ˜ ์ „์šฉ VFM ์šฐ์œ„
Depth Segmentation ์ „ํ†ต CNN / RGB-VFM + Depth ์ตœ๊ณ  ์„ฑ๋Šฅ

์ด ์‹คํ—˜์€ Depth ๊ธฐ๋ฐ˜ ์ถ”์ถœ ํ‘œํ˜„์ด ๊ธฐํ•˜ํ•™๋ฟ ์•„๋‹ˆ๋ผ ์˜๋ฏธ ์ •๋ณด๊นŒ์ง€๋„ ํšจ๊ณผ์ ์œผ๋กœ ์บก์ฒ˜ํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๐Ÿ‘ Perception Embedding ์งˆ์  ์‹œ๊ฐํ™”

PCA ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํŠน์ง•์„ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค:

Depth Embeddings
  โ†‘ ๊ฑฐ๋ฆฌ/๊ตฌ์กฐ
  โ†’ ๊ฐ์ฒด ์˜๋ฏธ
ํด๋Ÿฌ์Šคํ„ฐ๋ง์ด ์˜๋ฏธ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ๋ฐ˜์˜ํ•จ

์ด๋Š” RGB ๋ชจ๋ธ์˜ ํŠน์ง• ๋””์ŠคํŠธ๋ฆฌ๋ทฐ์…˜๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ, DeFM์ด ์˜๋ฏธ/๊ธฐํ•˜ ์ •๋ณด ๋ชจ๋‘๋ฅผ ์žก๊ณ  ์žˆ์Œ์„ ๋งํ•ด์ค๋‹ˆ๋‹ค.

๐Ÿค– 2) Robotic Task Benchmark

DeFM์€ ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ๊ฐ•ํ™”ํ•™์Šต ๋ฐ ์ œ์–ด ๊ณผ์ œ์—์„œ๋„ ๋›ฐ์–ด๋‚œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค:

Category Task ์˜ˆ์‹œ DeFM vs Baselines
Navigation Point-Goal ๋‚ด๋น„๊ฒŒ์ด์…˜ ์šฐ์›”ํ•œ ์„ฑ๋Šฅ
Embodiment Aware Navigation ์„ผ์„œ ๋ฌผ๋ฆฌ ๋ชจ๋ธ ์ ์‘ ๊ฐ•ํ•œ sim-to-real
Dexterous Manipulation ๊ทธ๋ฆฌํ•‘ ์ž‘์—… ๋†’์€ ์„ฑ๊ณต๋ฅ 
Locomotion Quadruped Ladder Climbing ๋›ฐ์–ด๋‚œ ์ผ๋ฐ˜ํ™”

์‹คํ—˜์—์„œ DeFM์„ fine-tuning ์—†์ด frozen backbone์œผ๋กœ ์‚ฌ์šฉํ–ˆ์Œ์—๋„, ๋‹ค๋ฅธ ์„ผ์„œ/ํ™˜๊ฒฝ์— ๊ฐ•๊ฑดํ•œ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๋ชจ์Šต์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

๐Ÿ“Œ Mermeid Diagram: DeFM ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ

์•„๋ž˜๋Š” DeFM์˜ ๋Œ€๋žต์ ์ธ ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ์ž…๋‹ˆ๋‹ค.

flowchart TD
    A[Depth ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์…‹ (60M)] --> B[Augmentation]
    B --> C[Teacher Network]
    B --> D[Student Network]
    C --> E[Self-Distillation Loss]
    D --> E
    E --> F[Backprop ํ•™์Šต]
    F --> G[Pretrained DeFM]
    G --> H[Distillation to Compact Models]

๐Ÿง  ๋น„ํŒ์  ๊ณ ์ฐฐ: ์žฅ๋‹จ์  ๋ฐ ํ•œ๊ณ„

โœ… ์žฅ์ 

  • Depth ์ „์šฉ ํ‘œํ˜„: RGB ๋Œ€์‹  ๊นŠ์ด ์ž์ฒด๋ฅผ ํ•™์Šต ๋Œ€์ƒ์œผ๋กœ ํ•จ์œผ๋กœ์จ ๋กœ๋ด‡ ํ™˜๊ฒฝ ์ดํ•ด๋ ฅ์ด ๊ฐ•ํ•ด์ง.
  • Sim-to-Real ์ผ๋ฐ˜ํ™”: Depth ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ํ•™์Šตํ•œ ๋ชจ๋ธ์ด ํ˜„์‹ค ์„ผ์„œ์—์„œ๋„ ๊ฐ•๊ฑดํ•จ.
  • ๊ฒฝ๋Ÿ‰ํ™” ๋ชจ๋ธ ์ œ๊ณต: ๋กœ๋ด‡ ํ•˜๋“œ์›จ์–ด์— ๋ฐ”๋กœ ์ ์šฉ ๊ฐ€๋Šฅ.
  • ๋‹ค์ˆ˜ Task์—์„œ SOTA: Perception์—์„œ manipulation๊นŒ์ง€ ํญ๋„“์€ ๊ณผ์ œ ์ง€์› ๊ฐ€๋Šฅ.

โŒ ๋‹จ์  ๋ฐ ํ–ฅํ›„ ๊ณผ์ œ

  • ๋‹ค์ค‘๋ชจ๋‹ฌ ํ˜ผํ•ฉ ๋ถ€์กฑ: Depth์™€ RGB/Language์˜ ํ†ตํ•ฉ ํ‘œํ˜„์€ ์•„์ง ๋ฏธํกํ•ฉ๋‹ˆ๋‹ค.
  • Policy-Level ํ†ตํ•ฉ ๊ฒ€์ฆ ๋ถ€์กฑ: RL ์ •์ฑ…๊ณผ์˜ end-to-end ํ†ตํ•ฉ ์‹คํ—˜์ด ์ œํ•œ์ ์ž…๋‹ˆ๋‹ค.
  • 3D ์‹œํ€€์Šค/๋™์  ์ •๋ณด ๋ฏธ๋ฐ˜์˜: Depth frame๊ฐ„ temporal 3D ์ •๋ณด ์ด์šฉ์ด ์•„์ง ์ œํ•œ์ ์ž…๋‹ˆ๋‹ค.

๐Ÿงฉ ๊ด€๋ จ ์—ฐ๊ตฌ ๋Œ€๋น„

๋ชจ๋ธ/๋ฐฉ๋ฒ• Depth ํฌํ•จ Foundation Scale ๋กœ๋ด‡ ํŠนํ™”
R3M โŒ RGB ์ค‘์‹ฌ โ–ณ ์ผ๋ถ€ ์กฐ์ž‘
MVP โ–ณ Depth ๋ณด์กฐ โ–ณ ์ผ๋ถ€
FP3 โœ” Point Cloud โœ” RL focus Manipulation
DeFM โœ” ์ „์šฉ Depth โœ” Foundation Navigation, Locomotion, Manipulation

DeFM์€ ์ˆœ์ˆ˜ Depth ์„ ๊ตฌ์ž์ ์ธ ์—ญํ• ์„ ํ•˜๋ฉฐ, RL๊ณผ์˜ ๊ฒฐํ•ฉํ˜• 3Dํ˜• Foundation ๋ชจ๋ธ ์—ฐ๊ตฌ์˜ ์ถœ๋ฐœ์ ์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿง  ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

DeFM์€ Depth ์ค‘์‹ฌ์˜ self-supervised foundation model๋กœ์„œ, ํ˜„์žฌ ๋กœ๋ด‡๊ณตํ•™์—์„œ ์ค‘์š”ํ•œ Depth ์ธ์‹ ๋ฌธ์ œ์— ๋Œ€ํ•œ ๊ฐ•๋ ฅํ•œ ํ‘œํ˜„์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. Depth๋งŒ์œผ๋กœ geometry์™€ semantic ์ •๋ณด๋ฅผ ๋™์‹œ์— ์บก์ฒ˜ํ•˜๋ฉฐ, sim-to-real ์„ฑ๋Šฅ๊ณผ ์ ์šฉ ๋ฒ”์œ„๊ฐ€ ๋„“์Šต๋‹ˆ๋‹ค.

์‹ค์ œ ๋กœ๋ด‡ ๊ณผ์ œ์— ๋ฐ”๋กœ ํˆฌ์ž…ํ•  ์ˆ˜ ์žˆ๋Š” ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ์„ ์ œ๊ณตํ•˜๋ฉฐ, ์ด๋Š” ๋กœ๋ด‡ perception ๋ฐ ์ œ์–ด ์—ฐ๊ตฌ์— ์ฆ‰๊ฐ์  ๋„๊ตฌ(tool) ๋กœ์จ ํฐ ๊ฐ€์น˜๋ฅผ ์ง€๋‹™๋‹ˆ๋‹ค.


์ฐธ๊ณ  ์ž๋ฃŒ

  • arXiv - DeFM Paper
  • arXiv - DeFM Paper (HTML)
  • BAAI Hub - DeFM
  • Hugging Face - leggedrobotics/defm

Copyright 2026, JungYeon Lee