Curieux.JY
  • JungYeon Lee
  • Post
  • Projects
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
  • โ›๏ธ Dig Review

๐Ÿ“ƒDeFM ๋ฆฌ๋ทฐ

depth
representation
ssl
Learning Foundation Representations from Depth for Robotics
Published

January 30, 2026

๐Ÿ” Ping. ๐Ÿ”” Ring. โ›๏ธ Dig. A tiered review series: quick look, key ideas, deep dive.

  • Paper Link
  • Project
  • Code
  1. ๐Ÿค” DeFM์€ ๋กœ๋ด‡ ๊ณตํ•™ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์œ„ํ•ด ํ๋ ˆ์ด์…˜๋œ 6์ฒœ๋งŒ ๊ฐœ์˜ depth ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์…‹์—์„œ DINOv2 ์Šคํƒ€์ผ์˜ self-distillation์„ ์‚ฌ์šฉํ•˜์—ฌ ์‚ฌ์ „ ํ•™์Šต๋œ ์ตœ์ดˆ์˜ depth ์ „์šฉ foundation model์ž…๋‹ˆ๋‹ค.
  2. โœจ ์ด ๋ชจ๋ธ์€ metric awareness๋ฅผ ๋ณด์กดํ•˜๋Š” ์ƒˆ๋กœ์šด 3์ฑ„๋„ input normalization ์ „๋žต์„ ๋„์ž…ํ–ˆ์œผ๋ฉฐ, ํšจ์œจ์ ์ธ ๋กœ๋ด‡ ๋ฐฐํฌ๋ฅผ ์œ„ํ•ด ViT-S ๋ฐ CNN๊ณผ ๊ฐ™์€ ์†Œํ˜• ๋ชจ๋ธ๋กœ๋„ distillation๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  3. ๐Ÿš€ DeFM์€ classification, semantic segmentation, ๊ทธ๋ฆฌ๊ณ  navigation, manipulation, locomotion๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ task์—์„œ SOTA ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ, ๊ฐ•๋ ฅํ•œ sim-to-real transfer ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

๋ณธ ๋…ผ๋ฌธ์€ ๋กœ๋ด‡ ๊ณตํ•™ ๋ถ„์•ผ์—์„œ Depth ์ด๋ฏธ์ง€์˜ ์ค‘์š”์„ฑ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ํ•ด๋‹น ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์— ํŠนํ™”๋œ ๋Œ€๊ทœ๋ชจ ์ผ๋ฐ˜ ๋ชฉ์ ์˜ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ(Foundation Model, FM)์ด ๋ถ€์žฌํ•˜๋‹ค๋Š” ๋ฌธ์ œ์ ์„ ์ง€์ ํ•œ๋‹ค. ๊ธฐ์กด์˜ ์ ‘๊ทผ ๋ฐฉ์‹๋“ค์€ RGB ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ์„ Depth ์ด๋ฏธ์ง€์— ์žฌํ™œ์šฉํ•˜๊ฑฐ๋‚˜ ํƒœ์Šคํฌ๋ณ„(task-specific)๋กœ ์ธ์ฝ”๋”๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต์‹œ์ผœ ๋ถ„ํฌ ๋ถˆ์ผ์น˜(distribution mismatch) ๋ฐ ์ผ๋ฐ˜ํ™”(generalization) ์„ฑ๋Šฅ ์ €ํ•˜์™€ ๊ฐ™์€ ํ•œ๊ณ„๋ฅผ ๋ณด์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๊ฐ„๊ทน์„ ๋ฉ”์šฐ๊ธฐ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์€ DeFM(Depth Foundation Model)์„ ์ œ์•ˆํ•œ๋‹ค. DeFM์€ 6,040๋งŒ ๊ฐœ์˜ Depth ์ด๋ฏธ์ง€๋กœ ๊ตฌ์„ฑ๋œ ํ๋ ˆ์ด์…˜๋œ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ DINOv2 ์Šคํƒ€์ผ์˜ ์ž๊ธฐ ์ง€๋„ ํ•™์Šต(self-supervised learning) ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต๋œ Depth ์ „์šฉ FM์ด๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก :

DeFM์€ DINOv2 ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ Depth ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์— ๋งž๊ฒŒ ์กฐ์ •ํ•˜์—ฌ ํ™œ์šฉํ•œ๋‹ค. ์ด๋Š” ํ•™์ƒ ๋„คํŠธ์›Œํฌ(f_s)๊ฐ€ ๋ชจ๋ฉ˜ํ…€(momentum) ์—…๋ฐ์ดํŠธ๋˜๋Š” ๊ต์‚ฌ ๋„คํŠธ์›Œํฌ(f_t)์˜ ์ถœ๋ ฅ ๋ถ„ํฌ๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ์ตœ์ ํ™”๋˜๋Š” ์ž๊ธฐ ์ฆ๋ฅ˜(self-distillation) ๋ฐฉ์‹์„ ๋”ฐ๋ฅธ๋‹ค. ์ž…๋ ฅ Depth ์ด๋ฏธ์ง€ x์— ๋Œ€ํ•ด, ๋‹ค์–‘ํ•œ ๊ธฐํ•˜ํ•™์ (geometric) ๋ฐ ์ธก๊ด‘ํ•™์ (photometric) ์ฆ๊ฐ•์ด ์ ์šฉ๋œ G๊ฐœ์˜ ๋Œ€๊ทœ๋ชจ ๊ธ€๋กœ๋ฒŒ ํฌ๋กญ(x_g)๊ณผ L๊ฐœ์˜ ์†Œ๊ทœ๋ชจ ๋กœ์ปฌ ํฌ๋กญ(x_l)์„ ์ค€๋น„ํ•œ๋‹ค. ๊ต์‚ฌ ๋„คํŠธ์›Œํฌ๋Š” ๊ธ€๋กœ๋ฒŒ ํฌ๋กญ์„ ์ฒ˜๋ฆฌํ•˜์—ฌ ๋ชฉํ‘œ ๋ถ„ํฌ p_t๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ํ•™์ƒ ๋„คํŠธ์›Œํฌ๋Š” ๋กœ์ปฌ ํฌ๋กญ ๋ฐ ๋ถ€๋ถ„์ ์œผ๋กœ ๋งˆ์Šคํ‚น๋œ ๊ธ€๋กœ๋ฒŒ ํฌ๋กญ(x'_g)์„ ์ฒ˜๋ฆฌํ•œ๋‹ค. ํ•™์Šต์—๋Š” ๋‹ค์Œ ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ์†์‹ค ํ•จ์ˆ˜๊ฐ€ ์‚ฌ์šฉ๋œ๋‹ค:

  1. DINO ๊ธ€๋กœ๋ฒŒ ํฌ๋กญ ์†์‹ค(\mathcal{L}_{Global}): ํ•™์ƒ ๋„คํŠธ์›Œํฌ์˜ ๋ถ€๋ถ„์ ์œผ๋กœ ๋งˆ์Šคํ‚น๋œ ๊ธ€๋กœ๋ฒŒ ํฌ๋กญ(x'_g) ํ‘œํ˜„๊ณผ ๊ต์‚ฌ ๋„คํŠธ์›Œํฌ์˜ ๋งˆ์Šคํ‚น๋˜์ง€ ์•Š์€ ๊ธ€๋กœ๋ฒŒ ํฌ๋กญ(x_g) ํ‘œํ˜„์„ ์ •๋ ฌํ•œ๋‹ค. ์ด๋Š” Vision Transformer(ViT)์˜ ํด๋ž˜์Šค ํ† ํฐ(cls token) ํ”ผ์ฒ˜์— ๋Œ€ํ•ด ๊ณ„์‚ฐ๋˜๋Š” DINO ์†์‹ค์ด๋‹ค: \mathcal{L}_{Global} = \sum_{i=1}^G \sum_{j=1, j \neq i}^G \mathcal{L}_{DINO}(f_s(x'_{g_i}), f_t(x_{g_j}))
  2. DINO ๋กœ์ปฌ ํฌ๋กญ ์†์‹ค(\mathcal{L}_{Local}): ํ•™์ƒ ๋„คํŠธ์›Œํฌ์˜ ๋กœ์ปฌ ํฌ๋กญ(x_l) ํ‘œํ˜„๊ณผ ๊ต์‚ฌ ๋„คํŠธ์›Œํฌ์˜ ๊ธ€๋กœ๋ฒŒ ํฌ๋กญ(x_g) ํ‘œํ˜„์„ ์ •๋ ฌํ•œ๋‹ค. ์ด ์—ญ์‹œ cls ํ† ํฐ ๊ฐ„์— ๊ณ„์‚ฐ๋œ๋‹ค: \mathcal{L}_{Local} = \sum_{g=1}^G \sum_{l=1}^L \mathcal{L}_{DINO}(f_s(x_l), f_t(x_g))
  3. iBOT ํŒจ์น˜ ์†์‹ค(\mathcal{L}_{iBOT}): ๋ฐ€์ง‘ ๊ณต๊ฐ„ ํ”ผ์ฒ˜(dense spatial features) ํ•™์Šต์— ํ•„์ˆ˜์ ์ด๋‹ค. ๋žœ๋คํ•˜๊ฒŒ ๋งˆ์Šคํ‚น๋œ ์ž…๋ ฅ ํŒจ์น˜์— ๋Œ€ํ•ด ํ•™์ƒ์˜ ํ”ผ์ฒ˜ ์˜ˆ์ธก(p_{s_i})๊ณผ ๊ต์‚ฌ์˜ ํ•ด๋‹น ํŒจ์น˜ ๋ชฉํ‘œ ๋ถ„ํฌ(p_{t_i}) ๊ฐ„์˜ ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์†์‹ค(cross-entropy loss)์„ ์ ์šฉํ•œ๋‹ค: \mathcal{L}_{iBOT} = - \sum_{i \in \text{masked}} p_{t_i} \log p_{s_i}

์ „์ฒด ์†์‹ค์€ ์ด ์„ธ ํ•ญ์˜ ๊ฐ€์ค‘์น˜ ํ•ฉ๊ณผ ํ”ผ์ฒ˜ ๊ณต๊ฐ„ ๋ถ•๊ดด๋ฅผ ๋ฐฉ์ง€ํ•˜๋Š” KoLeo ์ •๊ทœํ™”(regularizer)๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.

DeFM ํ•™์Šต์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹์€ ์ด 6,040๋งŒ ๊ฐœ์˜ Depth ์ด๋ฏธ์ง€๋กœ, ๋‹จ์•ˆ Depth ์ถ”์ •(Monocular Depth Estimation, MDE)์„ ํ†ตํ•ด RGB ๋ฐ์ดํ„ฐ์…‹์„ ๋ณ€ํ™˜ํ•œ ์ด๋ฏธ์ง€, ์‹œ๋ฎฌ๋ ˆ์ด์…˜(Synthetic) ๋ฐ์ดํ„ฐ, ๊ทธ๋ฆฌ๊ณ  ์‹ค์ œ(Real) ์„ผ์„œ ๋ฐ์ดํ„ฐ๋ฅผ ํ˜ผํ•ฉํ•˜์—ฌ ๊ตฌ์„ฑ๋˜์—ˆ๋‹ค. ์ด๋Š” Depth ๋ฐ์ดํ„ฐ์˜ ๋‹ค์–‘์„ฑ, ๊ทœ๋ชจ, ๋…ธ์ด์ฆˆ ํŠน์„ฑ์„ ๋ชจ๋‘ ํฌ๊ด„ํ•˜์—ฌ ์ธ์ฝ”๋”๊ฐ€ ๊ด‘๋ฒ”์œ„ํ•œ ํ™˜๊ฒฝ์—์„œ ๊ฐ•๊ฑดํ•˜๊ฒŒ ์ผ๋ฐ˜ํ™”๋  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค.

ํŠนํžˆ, Depth ์ด๋ฏธ์ง€์˜ ๋„“์€ ์Šค์ผ€์ผ ๋ฒ”์œ„(๋ฐ€๋ฆฌ๋ฏธํ„ฐ์—์„œ ์ˆ˜๋ฐฑ ๋ฏธํ„ฐ)๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์ƒˆ๋กœ์šด ์ž…๋ ฅ ์ •๊ทœํ™” ์ „๋žต์ด ๋„์ž…๋˜์—ˆ๋‹ค. ๊ทผ๊ฑฐ๋ฆฌ Depth ๋ณ€ํ™”๊ฐ€ ๋กœ๋ด‡ ์˜์‚ฌ๊ฒฐ์ •์— ๋” ์ค‘์š”ํ•จ์„ ๊ณ ๋ คํ•˜์—ฌ, ๋‹ค์Œ ์„ธ ๊ฐœ์˜ ์ฑ„๋„๋กœ ๊ตฌ์„ฑ๋œ ๋กœ๊ทธ ์••์ถ• Depth ํ‘œํ˜„์„ ์‚ฌ์šฉํ•œ๋‹ค:

  1. ๊ธ€๋กœ๋ฒŒ ๋กœ๊ทธ ์Šค์ผ€์ผ Depth(C_1): ํ˜„์žฌ ์ด๋ฏธ์ง€ ๋‚ด์˜ ์ตœ์†Œ(D_{min}) ๋ฐ ์ตœ๋Œ€ Depth(D_{max})๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋กœ๊ทธ ์••์ถ• Depth๋ฅผ ์ •๊ทœํ™”ํ•˜์—ฌ ์ƒ๋Œ€์ ์ธ ๊ธฐํ•˜ํ•™์  ๊ตฌ์กฐ๋ฅผ ๋ณด์กดํ•œ๋‹ค. ๋กœ๊ทธ ๋ณ€ํ™˜์€ \text{logp}(D) = \log(1+D)๋กœ ์ •์˜๋œ๋‹ค: C_1 = \frac{\log p(D) - \log p(D_{\min})}{\log p(D_{\max}) - \log p(D_{\min})}

  2. ์ค‘๊ฐ„ ๋ฒ”์œ„ ์ •๊ทœํ™”(C_2): ์กฐ์ž‘ ๋ฐ ์‹ค๋‚ด ์ƒํ˜ธ์ž‘์šฉ์— ๊ฐ€์žฅ ์ ํ•ฉํ•œ Depth ๋ฒ”์œ„๋ฅผ ๊ฐ•์กฐํ•œ๋‹ค: C_2 = \frac{\log p(D)}{\log p(10)}

  3. ์›๊ฑฐ๋ฆฌ ๋ฒ”์œ„ ์ •๊ทœํ™”(C_3): ์žฅ๊ฑฐ๋ฆฌ ๋‚ด๋น„๊ฒŒ์ด์…˜ ๋ฐ ์‹ค์™ธ ์žฅ๋ฉด์— ์ ํ•ฉํ•œ Depth ๋ฒ”์œ„๋ฅผ ๊ฐ•์กฐํ•œ๋‹ค: C_3 = \frac{\log p(D)}{\log p(100)}

์ตœ์ข… ์ž…๋ ฅ์€ X_{in} = [C_1, C_2, C_3]์™€ ๊ฐ™์ด ์„ธ ์ฑ„๋„์„ ์Œ“์•„ ๊ตฌ์„ฑ๋˜๋ฉฐ, ์ „์—ญ ํ‰๊ท  ๋ฐ ํ‘œ์ค€ ํŽธ์ฐจ ์ •๊ทœํ™”๊ฐ€ ์ ์šฉ๋œ๋‹ค. ์ด ๋ฐฉ์‹์€ ์ „์—ญ ๋ฉ”ํŠธ๋ฆญ Depth๋ฅผ ๋ณด์กดํ•˜๋ฉด์„œ ๋ฏธ์„ธํ•œ ๊ทผ๊ฑฐ๋ฆฌ ๊ตฌ์กฐ์™€ ์•ˆ์ •์ ์ธ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ์œ ์ง€ํ•œ๋‹ค.

DeFM์˜ ๊ฐ€์žฅ ํฐ ๋ชจ๋ธ์ธ ViT-L/14(3์–ต 7๋ฐฑ๋งŒ ๋งค๊ฐœ๋ณ€์ˆ˜)๋Š” FSDP(Fully-Sharded Data Parallel) ๊ตฌํ˜„์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต๋˜์—ˆ๋‹ค. ๋กœ๋ด‡ ์‹œ์Šคํ…œ์˜ ์ž์› ์ œ์•ฝ์„ ๊ณ ๋ คํ•˜์—ฌ, DeFM-L/14๋ฅผ ๊ต์‚ฌ ๋ชจ๋ธ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ViT-S, ResNet, RegNet, EfficientNet ๋“ฑ 3๋ฐฑ๋งŒ~3์ฒœ๋งŒ ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ์†Œํ˜• ๋ชจ๋ธ๋กœ ์ง€์‹ ์ฆ๋ฅ˜(knowledge distillation)๋ฅผ ์ˆ˜ํ–‰ํ–ˆ๋‹ค. ํŠนํžˆ, CNN ํ•™์ƒ ๋ชจ๋ธ์ด ViT ๊ต์‚ฌ์˜ ๋ฐ€์ง‘ ๊ณต๊ฐ„ ํ”ผ์ฒ˜๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก BiFPN(Bi-directional Feature Pyramid Network)์„ CNN ์ธ์ฝ”๋” ์œ„์— ์ถ”๊ฐ€ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ํ•ด์ƒ๋„์˜ ํ”ผ์ฒ˜ ๋งต์„ ์œตํ•ฉํ•˜๋„๋ก ์„ค๊ณ„ํ–ˆ๋‹ค.

์‹คํ—˜ ๊ฒฐ๊ณผ:

DeFM์˜ ๊ฐ•๊ฑด์„ฑ๊ณผ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ์€ ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ์ž…์ฆ๋˜์—ˆ๋‹ค.

  • ์ •์„ฑ์  ํ‰๊ฐ€: PCA(Principal Component Analysis)๋ฅผ ํ†ตํ•ด DeFM-L/14 ์ธ์ฝ”๋”๊ฐ€ ์ถ”์ถœํ•œ ํ”ผ์ฒ˜๊ฐ€ ์งˆ๊ฐ์ด๋‚˜ ์ƒ‰์ƒ ์ •๋ณด ์—†์ด๋„ Depth ์ด๋ฏธ์ง€์—์„œ ์˜๋ฏธ๋ก ์  ํด๋Ÿฌ์Šคํ„ฐ๋ง(์˜ˆ: ์ปต ์†์žก์ด)์„ ํ˜•์„ฑํ•จ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ์ด๋Š” ๋‹ค์–‘ํ•œ ์„ผ์„œ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์— ๊ฑธ์ณ ์ผ๊ด€์„ฑ์„ ๋ณด์ด๋ฉฐ, ๋กœ๋ด‡ ์กฐ์ž‘์— ์œ ์šฉํ•œ ์‚ฌ์ „ ์ง€์‹์„ ํ•™์Šตํ–ˆ์Œ์„ ์‹œ์‚ฌํ•œ๋‹ค.
  • ๋ถ„๋ฅ˜(Classification): ImageNet-Depth-1K ๋ฒค์น˜๋งˆํฌ(MDE๋ฅผ ํ†ตํ•ด ์ƒ์„ฑ)์—์„œ DeFM-L/14๋Š” ๊ธฐ์กด์˜ ์ตœ์ฒจ๋‹จ RGB ๊ธฐ๋ฐ˜ FM(DINOv2, DINOv3, C-RADIOv3)์„ ๋Šฅ๊ฐ€ํ•˜๋Š” SOTA ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค. ํŠนํžˆ DeFM-S/14๋Š” ๋™์ผ ํฌ๊ธฐ ๋ฒ”์ฃผ์˜ ๊ธฐ์กด ๋ชจ๋ธ ๋Œ€๋น„ ์ตœ๋Œ€ 10%๊นŒ์ง€ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ์ฆ๋ฅ˜๋œ ์†Œํ˜• CNN ๋ชจ๋ธ๋“ค๋„ ์ผ๋ถ€ ๋” ํฐ RGB ViT-S ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค.
  • ์˜๋ฏธ๋ก ์  ๋ถ„ํ• (Semantic Segmentation): ScanNet, SUN-RGBD(์‹ค๋‚ด), OFFSED, TartanGround(์‹ค์™ธ), GraspNet-1B(์กฐ์ž‘) ๋“ฑ ๋‹ค์–‘ํ•œ Depth ๋ฐ์ดํ„ฐ์…‹์—์„œ DeFM์€ ๊ธฐ์กด ๋ฒ ์ด์Šค๋ผ์ธ์„ ๋Œ€๋ถ€๋ถ„ ๋Šฅ๊ฐ€ํ•˜๋Š” ๊ฐ•๊ฑดํ•œ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ์ž…์ฆํ–ˆ๋‹ค (ViT-S์—์„œ mIoU ์ตœ๋Œ€ 30% ํ–ฅ์ƒ).
  • ๋กœ๋ด‡ ๊ณตํ•™ ์‘์šฉ:
    • ๋‚ด๋น„๊ฒŒ์ด์…˜(Habitat Point-Goal Nav): DeFM ๊ธฐ๋ฐ˜ ๋ชจ๋ธ(DeFM-S/14, DeFM-ResNet-50)์€ ๊ธฐ์กด์˜ ์Šคํฌ๋ž˜์น˜ ํ•™์Šต๋œ ResNet-50๊ณผ ๊ฒฝ์Ÿํ•˜๊ฑฐ๋‚˜ ๋” ์šฐ์ˆ˜ํ•œ SPL(Success weighted by Path Length) ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, DeFM์˜ ์ฆ‰๊ฐ์ ์ธ ํ™œ์šฉ์„ฑ์„ ์ž…์ฆํ–ˆ๋‹ค.
    • ๋‚ด๋น„๊ฒŒ์ด์…˜(Embodiment Aware Point-Goal Nav - Unitree B2W): Unitree B2W ๋กœ๋ด‡์„ ์‚ฌ์šฉํ•œ ์‹ค์ œ ์žฅ๊ฑฐ๋ฆฌ ๋‚ด๋น„๊ฒŒ์ด์…˜ ํƒœ์Šคํฌ์—์„œ DeFM ์ธ์ฝ”๋” ๊ธฐ๋ฐ˜ ์ •์ฑ…์€ VAE(Variational Auto Encoder) ๊ธฐ๋ฐ˜ ๋ฒ ์ด์Šค๋ผ์ธ๋ณด๋‹ค ๋†’์€ ์„ฑ๊ณต๋ฅ (SR)์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค. ํŠนํžˆ DeFM์€ OOD(Out-of-Distribution) ์žฅ์• ๋ฌผ์— ๋Œ€ํ•œ ๋›ฐ์–ด๋‚œ ์ธ์‹๊ณผ ํšŒํ”ผ ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ์ด๋Š” ๋” ๋‚˜์€ ๊ธฐํ•˜ํ•™์  ๋ฐ ์˜๋ฏธ๋ก ์  ํ™˜๊ฒฝ ์ดํ•ด ๋•๋ถ„์œผ๋กœ ๋ถ„์„๋œ๋‹ค. ๋‹ค์–‘ํ•œ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ์˜ ๊ฐ•๊ฑดํ•œ sim-to-real ์ „์ด๊ฐ€ ์‹œ์—ฐ๋˜์—ˆ๋‹ค.
    • ์กฐ์ž‘(Dexterous Grasping - KUKA-Allegro): Teacher-student ํ›ˆ๋ จ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์‚ฌ์šฉํ•œ ์ •๊ตํ•œ ๋กœ๋ด‡ ํŒ”-์† ๊ทธ๋ฆฝ ํƒœ์Šคํฌ์—์„œ DeFM ๋ชจ๋ธ(ํŠนํžˆ ๋ฏธ์„ธ ์กฐ์ •๋œ ๋ฒ„์ „)์€ ๊ฐ€์žฅ ๋†’์€ ์„ฑ๊ณต๋ฅ ์„ ๊ธฐ๋กํ–ˆ์œผ๋ฉฐ, ๋‹ค์–‘ํ•œ ๋…ธ์ด์ฆˆ ๋ชจ๋ธ์— ๋Œ€ํ•œ ๊ฐ•๊ฑด์„ฑ์„ ์ž…์ฆํ–ˆ๋‹ค.
    • ์ด๋™(Locomotion - Quadrupedal Ladder Climbing - ANYmal): ์‚ฌ์กฑ ๋ณดํ–‰ ๋กœ๋ด‡์˜ ์‚ฌ๋‹ค๋ฆฌ ์˜ค๋ฅด๊ธฐ ํƒœ์Šคํฌ์—์„œ DeFM ๊ธฐ๋ฐ˜ ์ธ์ฝ”๋”๋Š” ์Šคํฌ๋ž˜์น˜ ํ•™์Šต๋œ CNN ๋ฒ ์ด์Šค๋ผ์ธ๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๋ฉด์„œ๋„ ํ›จ์”ฌ ์ ์€ ๊ณ„์‚ฐ ๋ฆฌ์†Œ์Šค๋ฅผ ์š”๊ตฌํ–ˆ๋‹ค.

๊ฒฐ๋ก ์ ์œผ๋กœ, DeFM์€ Depth ์ด๋ฏธ์ง€๋ฅผ ์œ„ํ•œ ์ตœ์ดˆ์˜ ๋Œ€๊ทœ๋ชจ ์ž๊ธฐ ์ง€๋„ ํ•™์Šต ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ๋กœ์„œ, ๊ฐ•๊ฑดํ•˜๊ณ  ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅํ•œ ๊ธฐํ•˜ํ•™์  ๋ฐ ์˜๋ฏธ๋ก ์  ํ”ผ์ฒ˜๋ฅผ ํ•™์Šตํ•œ๋‹ค. ์ด๋Š” ๋ถ„๋ฅ˜, ๋ถ„ํ• , ๋‚ด๋น„๊ฒŒ์ด์…˜, ์ด๋™, ์กฐ์ž‘ ๋“ฑ ๊ด‘๋ฒ”์œ„ํ•œ ๋กœ๋ด‡ ์ธ์‹ ๋ฐ ์ œ์–ด ํƒœ์Šคํฌ์— ์ฆ‰์‹œ ํ™œ์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ๋‹ค์–‘ํ•œ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ์˜ ๊ฐ•๊ฑดํ•œ sim-to-real ์ „์ด๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค. ํŠนํžˆ, ํšจ์œจ์„ฑ์„ ์œ„ํ•ด ์ฆ๋ฅ˜๋œ ์†Œํ˜• ๋ชจ๋ธ๋“ค์€ ์ž์› ์ œ์•ฝ์ ์ธ ๋กœ๋ด‡ ์‹œ์Šคํ…œ์— ํšจ๊ณผ์ ์œผ๋กœ ๋ฐฐํฌ๋  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค. ํ–ฅํ›„ ์—ฐ๊ตฌ๋กœ๋Š” ViT ์•„ํ‚คํ…์ฒ˜์˜ ์•„ํ‹ฐํŒฉํŠธ ์™„ํ™”, ํƒœ์Šคํฌ ๋‹ค์–‘์„ฑ ํ™•์žฅ, LiDAR ๋ฐ์ดํ„ฐ๋กœ์˜ ์ ์šฉ, ๊ทธ๋ฆฌ๊ณ  ๋ฐ์ดํ„ฐ์…‹ ๋ฐ ๋ชจ๋ธ ์Šค์ผ€์ผ์˜ ์ง€์†์ ์ธ ํ™•์žฅ์ด ์ œ์•ˆ๋œ๋‹ค.


๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.


โ›๏ธ Dig Review

โ›๏ธ Dig โ€” Go deep, uncover the layers. Dive into technical detail.

Copyright 2026, JungYeon Lee