Curieux.JY
  • JungYeon Lee
  • Post
  • Projects
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ์„œ๋ก : ์™œ ์ด ์—ฐ๊ตฌ๊ฐ€ ์ค‘์š”ํ•œ๊ฐ€?
      • ๋ฌธ์ œ์˜ ๋ณธ์งˆ
      • ๊ธฐ์กด ์—ฐ๊ตฌ์˜ ํ•œ๊ณ„
      • BFM-Zero์˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ
    • ๋ฐฉ๋ฒ•๋ก : Forward-Backward ํ‘œํ˜„ ํ•™์Šต์˜ ๋งˆ๋ฒ•
      • ํ•ต์‹ฌ ์ง๊ด€: โ€œ๋ฏธ๋ž˜๋ฅผ ์ž„๋ฒ ๋”ฉํ•˜๋ผโ€
      • ์ˆ˜ํ•™์  ๊ธฐ์ดˆ: Successor Measure
      • Q-ํ•จ์ˆ˜์˜ ์šฐ์•„ํ•œ ํ‘œํ˜„
      • FB-CPR: ๋น„์ง€๋„ ํ•™์Šต์— ๋ชจ์…˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ ‘๋ชฉํ•˜๋‹ค
    • BFM-Zero ์‹œ์Šคํ…œ ์•„ํ‚คํ…์ฒ˜
      • Sim-to-Real์„ ์œ„ํ•œ ํ•ต์‹ฌ ์„ค๊ณ„
      • ๋„คํŠธ์›Œํฌ ์•„ํ‚คํ…์ฒ˜
    • Zero-shot ์ถ”๋ก : ์„ธ ๊ฐ€์ง€ ์ž‘์—…, ํ•˜๋‚˜์˜ ์ •์ฑ…
      • 1. Goal Reaching (๋ชฉํ‘œ ์ž์„ธ ๋„๋‹ฌ)
      • 2. Motion Tracking (๋ชจ์…˜ ์ถ”์ )
      • 3. Reward Optimization (๋ณด์ƒ ์ตœ์ ํ™”)
      • ์ถ”๋ก  ๋ฐฉ๋ฒ• ๋น„๊ต
    • ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ฐ ๋ถ„์„
      • ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์‹คํ—˜
      • ์‹ค์ œ ๋กœ๋ด‡ ์‹คํ—˜ (Unitree G1)
      • ์ž ์žฌ ๊ณต๊ฐ„์˜ ๊ตฌ์กฐ์  ํŠน์„ฑ
    • ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต
      • BFM ์—ฐ๊ตฌ ๊ณ„๋ณด
      • ์ฃผ์š” ๋น„๊ต ๋Œ€์ƒ
    • ๋น„ํŒ์  ๊ณ ์ฐฐ
      • ๊ฐ•์ 
      • ์•ฝ์  ๋ฐ ํ•œ๊ณ„
      • ์—ด๋ฆฐ ์งˆ๋ฌธ๋“ค
    • ๋ฏธ๋ž˜ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ ์ œ์•ˆ
      • ๋‹จ๊ธฐ (1-2๋…„)
      • ์žฅ๊ธฐ (3-5๋…„)
    • ์‹ค๋ฌด์ž๋ฅผ ์œ„ํ•œ ์‹œ์‚ฌ์ 
      • ์–ธ์ œ BFM-Zero ์ ‘๊ทผ๋ฒ•์„ ๊ณ ๋ คํ•ด์•ผ ํ•˜๋‚˜?
      • ๊ตฌํ˜„ ์ฒดํฌ๋ฆฌ์ŠคํŠธ
    • ๊ฒฐ๋ก 
  • โ›๏ธ Dig Review
    • 1. ์„œ๋ก : BFM-Zero๊ฐ€ ํ’€๊ณ  ์‹ถ์€ ๋ฌธ์ œ
      • 1.1 ๋ฌธ์ œ ๋ฐฐ๊ฒฝ โ€“ โ€œ์ „์‹  ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธโ€์ด ์™œ ์–ด๋ ค์šด๊ฐ€?
      • 1.2 BFM-Zero์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด ํ•œ ์ค„ ์š”์•ฝ
    • 2. ๋ฌธ์ œ ์ •์˜ ๋ฐ ์ˆ˜ํ•™์  ํ”„๋ ˆ์ด๋ฐ
      • 2.1 POMDP ํฌ๋ฉ€๋ผ์ด์ œ์ด์…˜
    • ํ•˜๋Š” ๋น„๋Œ€์นญ(asymmetric) ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
      • 2.2 Forwardโ€“Backward Representation & Unsupervised RL
      • 2.3 FB-CPR์—์„œ BFM-Zero๋กœ: ๋ฌด์—‡์ด ์ถ”๊ฐ€๋˜์—ˆ๋‚˜?
    • ์ด๋ผ๋Š” ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • 3. BFM-Zero ๋ฐฉ๋ฒ•๋ก  ์ƒ์„ธ
      • 3.1 ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ์š” (Mermaid)
      • 3.2 ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์„ค์ •
      • 3.3 ํ•ต์‹ฌ ์„ค๊ณ„ ์š”์†Œ
      • 3.4 ํ•™์Šต ๋ชฉํ‘œ (๊ณ ์ˆ˜์ค€ ์ˆ˜์‹)
      • 3.5 Zero-shot Inference: ํ”„๋กฌํ”„ํŠธ๋กœ ์ •์ฑ… ๋ถ€๋ฅด๊ธฐ
      • 3.6 Few-shot Adaptation in Latent Space
    • 4. ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ
      • 4.1 ์‹œ๋ฎฌ๋ ˆ์ด์…˜ Zero-shot Validation
      • 4.2 ์‹ค์ œ Unitree G1 ์‹คํ—˜
      • 4.3 Latent Space ๋ถ„์„
    • 5. ๋‹ค๋ฅธ ํ”Œ๋žซํผ์œผ๋กœ์˜ ํ™•์žฅ ๊ฐ€๋Šฅ์„ฑ
      • 5.1 ํ•„์š”ํ•œ ์ „์ œ ์กฐ๊ฑด
      • 5.2 ๋‹ค๋ฅธ ํœด๋จธ๋…ธ์ด๋“œ/๋กœ๋ด‡์œผ๋กœ์˜ ์ ์šฉ ์‹œ๋„
      • 5.3 Allegro Hand ๊ฐ™์€ dexterous hand์— ์ ์šฉํ•œ๋‹ค๋ฉด?
    • 6. ๊ด€๋ จ ์—ฐ๊ตฌ์™€ ๋น„๊ต
      • 6.1 ๋น„์Šทํ•œ โ€œํ–‰๋™ ํŒŒ์šด๋ฐ์ด์…˜โ€ ๊ณ„์—ด๊ณผ ๋น„๊ต
    • ๋กœ ์œ„์น˜์ง€์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • 7. ๋น„ํŒ์  ๊ณ ์ฐฐ
      • 7.1 ๊ฐ•์ 
      • 7.2 ์•ฝ์  ๋ฐ ํ•œ๊ณ„
    • 8. ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ ์ œ์•ˆ (๋กœ๋ด‡๊ณตํ•™์ž ๊ด€์ )
    • 9. ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

๐Ÿ“ƒBFM-Zero ๋ฆฌ๋ทฐ

humanoid
rl
unsupervised
A Promptable Behavioral Foundation Model for Humanoid Control Using Unsupervised Reinforcement Learning
Published

January 28, 2026

๐Ÿ” Ping. ๐Ÿ”” Ring. โ›๏ธ Dig. A tiered review series: quick look, key ideas, deep dive.

  • Paper Link
  • Project
  • Code
  1. ๐Ÿค– BFM-Zero๋Š” unsupervised RL ๋ฐ Forward-Backward (FB) ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ์ธ๊ฐ„ํ˜• ๋กœ๋ด‡์˜ ๋‹ค์–‘ํ•œ ์ „์‹  ์ œ์–ด ์ž‘์—…์„ ์œ„ํ•œ ๊ณต์œ  latent space๋ฅผ ํ•™์Šตํ•˜๋Š” ์ƒˆ๋กœ์šด promptable Behavioral Foundation Model์ž…๋‹ˆ๋‹ค.
  2. ๐ŸŒ‰ ์ด ๋ชจ๋ธ์€ domain randomization, asymmetric learning, reward regularization๊ณผ ๊ฐ™์€ ํ•ต์‹ฌ์ ์ธ ๋””์ž์ธ ์„ ํƒ์„ ํ†ตํ•ด sim-to-real ๊ฒฉ์ฐจ๋ฅผ ํ•ด์†Œํ•˜์—ฌ Unitree G1 ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡์—์„œ ๊ฐ•๋ ฅํ•œ zero-shot ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์ ์ธ few-shot adaptation์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  3. โœจ BFM-Zero์˜ smoothํ•˜๊ณ  semanticํ•œ latent space๋Š” ๋ชจ์…˜ tracking, goal reaching, reward optimization ๋ฐ perturbation์œผ๋กœ๋ถ€ํ„ฐ์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด recovery ๋“ฑ ๋‹ค์–‘ํ•œ ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ์žฌํ•™์Šต ์—†์ด๋„ ์ž‘์—… ๊ตฌ์„ฑ ๋ฐ interpolation์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

BFM-Zero๋Š” ๋น„์ง€๋„ ๊ฐ•ํ™” ํ•™์Šต(unsupervised Reinforcement Learning, RL)์„ ์‚ฌ์šฉํ•˜์—ฌ ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡์„ ์œ„ํ•œ ํ”„๋กฌํ”„ํŠธ ๊ฐ€๋Šฅํ•œ ํ–‰๋™ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ(Behavioral Foundation Model, BFM)์„ ๊ตฌ์ถ•ํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ ํœด๋จธ๋…ธ์ด๋“œ ์ œ์–ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์บ๋ฆญํ„ฐ์— ๊ตญํ•œ๋˜๊ฑฐ๋‚˜ ํŠน์ • ์ž‘์—…(์˜ˆ: ์ถ”์ )์— ํŠนํ™”๋˜์–ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. BFM-Zero๋Š” ๋™์ž‘, ๋ชฉํ‘œ, ๋ณด์ƒ์„ ๊ณตํ†ต์˜ ์ž ์žฌ ๊ณต๊ฐ„(\mathcal{Z})์— ์ž„๋ฒ ๋”ฉํ•˜๋Š” ํšจ๊ณผ์ ์ธ ๊ณต์œ  ์ž ์žฌ ํ‘œํ˜„์„ ํ•™์Šตํ•˜์—ฌ ๋‹จ์ผ ์ •์ฑ…์œผ๋กœ ์—ฌ๋Ÿฌ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์„ ์žฌํ›ˆ๋ จ ์—†์ด ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์œ ๋‹ˆํŠธ๋ฆฌ G1(Unitree G1) ํœด๋จธ๋…ธ์ด๋“œ์—์„œ ์ œ๋กœ์ƒท(zero-shot) ๋™์ž‘ ์ถ”์ , ๋ชฉํ‘œ ๋„๋‹ฌ, ๋ณด์ƒ ์ถ”๋ก  ๋“ฑ ๋‹ค์–‘ํ•œ ์ถ”๋ก  ๋ฐฉ๋ฒ•๊ณผ ์†Œ์ˆ˜์ƒท(few-shot) ์ตœ์ ํ™” ๊ธฐ๋ฐ˜ ์ ์‘์„ ํ†ตํ•ด ๋‹ค์žฌ๋‹ค๋Šฅํ•˜๊ณ  ๊ฒฌ๊ณ ํ•œ ์ „์‹  ๊ธฐ์ˆ ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก  (Core Methodology)

BFM-Zero๋Š” ์˜จ๋ผ์ธ ์˜คํ”„-์ •์ฑ…(off-policy) ๋น„์ง€๋„ RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ, ์›€์ง์ž„ ์บก์ฒ˜(motion capture) ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ผ๋ฐ˜ํ™”๋œ ์ „์‹  ์ œ์–ด ์ •์ฑ…์ด ์ธ๊ฐ„ ํ–‰๋™์— ๊ฐ€๊น๋„๋ก ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ด ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ์ „๋ฐฉ-ํ›„๋ฐฉ(Forward-Backward, FB) ๋ชจ๋ธ๊ณผ FB-CPR(FB-Conditional Policy Regularization) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

  1. ๋ฌธ์ œ ์ •์˜ (Problem Formulation): ๋กœ๋ด‡ ์ œ์–ด๋Š” ๋ถ€๋ถ„์ ์œผ๋กœ ๊ด€์ธก ๊ฐ€๋Šฅํ•œ ๋งˆ๋ฅด์ฝ”ํ”„ ์˜์‚ฌ ๊ฒฐ์ • ํ”„๋กœ์„ธ์Šค(POMDP)๋กœ ๊ณต์‹ํ™”๋ฉ๋‹ˆ๋‹ค. ์ƒํƒœ(S), ๊ด€์ธก(O), ํ–‰๋™(A), ์ „์ด ์—ญํ•™(P(s_{t+1}|s_t, a_t)), ํ• ์ธ์œจ(\gamma)๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ์œ ๋‹ˆํŠธ๋ฆฌ G1 ๋กœ๋ด‡์˜ ํ–‰๋™ a \in \mathbb{R}^{29}๋Š” PD ์ปจํŠธ๋กค๋Ÿฌ ๋ชฉํ‘œ๋ฅผ ํฌํ•จํ•˜๋ฉฐ, ๊ด€์ธก o_t๋Š” ๊ด€์ ˆ ์œ„์น˜, ์†๋„, ๋ฃจํŠธ ๊ฐ์†๋„, ์ค‘๋ ฅ ํˆฌ์˜ ๋“ฑ์œผ๋กœ ๊ตฌ์„ฑ๋œ ์—ญ์‚ฌ o_{t,H} = \{o_{t-H}, a_{t-H}, \dots, o_t\}๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
  2. ์ „๋ฐฉ-ํ›„๋ฐฉ ํ‘œํ˜„์„ ์ด์šฉํ•œ ๋น„์ง€๋„ RL (Unsupervised RL with Forward-Backward Representations): BFM-Zero๋Š” ์˜จ๋ผ์ธ์œผ๋กœ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์™€ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฉฐ ๋ฌด๋ผ๋ฒจ ํ–‰๋™ ๋ฐ์ดํ„ฐ์…‹(\mathcal{M})์„ ํ™œ์šฉํ•˜์—ฌ ํ™˜๊ฒฝ์˜ ์••์ถ•๋œ ํ‘œํ˜„์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์„ธ ๊ฐ€์ง€ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค:
    • ์ž ์žฌ ํƒœ์Šคํฌ ํŠน์ง• (\phi): ๊ด€์ธก s \in S๋ฅผ d์ฐจ์› ๋ฒกํ„ฐ๋กœ ์ž„๋ฒ ๋”ฉํ•˜๋Š” ํ•จ์ˆ˜ \phi: S \to \mathbb{R}^d.
    • ์ž ์žฌ ์กฐ๊ฑด๋ถ€ ์ •์ฑ… (\pi_z): ์ž ์žฌ ๋ฒกํ„ฐ z \in \mathbb{R}^d์— ๋”ฐ๋ผ ์กฐ๊ฑดํ™”๋˜๋Š” ์ •์ฑ… \pi_z: S \to A.
    • ์ž ์žฌ ์กฐ๊ฑด๋ถ€ Successor Features (F_z): ํ•ด๋‹น ์ •์ฑ… \pi_z ํ•˜์—์„œ ์ž ์žฌ ํƒœ์Šคํฌ ํŠน์ง•์˜ ๊ธฐ๋Œ€ ํ• ์ธํ•ฉ์„ ์ธ์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค. FB ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ์žฅ๊ธฐ ์ •์ฑ… ์—ญํ•™์˜ ์œ ํ•œ-๋žญํฌ(finite-rank) ๊ทผ์‚ฌ๋ฅผ ํ•™์Šตํ•˜๋ฉฐ, ์ „๋ฐฉ ๋งคํ•‘ F: S \times A \times \mathbb{R}^d \to \mathbb{R}^d ๋ฐ ํ›„๋ฐฉ ๋งคํ•‘ B: S \to \mathbb{R}^d๋ฅผ ํ•™์Šตํ•˜์—ฌ ์ •์ฑ… \pi_z์— ์˜ํ•ด ์œ ๋„๋˜๋Š” ์žฅ๊ธฐ ์ „์ด ์—ญํ•™์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ถ„ํ•ด๋ฉ๋‹ˆ๋‹ค: M^{\pi_z}(ds'|s, a) \simeq F(s, a, z)^\top B(s')\rho(ds') ์—ฌ๊ธฐ์„œ M^{\pi_z}๋Š” ์ •์ฑ… \pi_z ํ•˜์—์„œ์˜ ํ• ์ธ๋œ ๋ฐฉ๋ฌธ ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. F(s, a, z)^\top z๋Š” r = \phi^\top z ๋ณด์ƒ์„ ๊ฐ–๋Š” \pi_z์˜ Q-ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค. ๊ฐ ์ •์ฑ… \pi_z๋Š” E_\rho[\sum_t \gamma^t \phi(s_t)^\top z | \pi_z]๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก ์ตœ์ ํ™”๋ฉ๋‹ˆ๋‹ค. FB-CPR์€ ์—ฌ๊ธฐ์— ์ž ์žฌ ์กฐ๊ฑด๋ถ€ Discriminator๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ํ•™์Šต ๊ณผ์ •์„ ๋ชจ์…˜ ์บก์ฒ˜ ๋ฐ์ดํ„ฐ์— ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค.
  3. BFM-Zero ์‚ฌ์ „ ํ›ˆ๋ จ์˜ ์ฃผ์š” ์„ค๊ณ„ ์„ ํƒ (Key Design Choices for BFM-Zero Pre-training): ์‹œ๋ฎฌ๋ ˆ์ด์…˜-์‹ค์ œ ์ „์ด(sim-to-real transfer)๋ฅผ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•œ ์ค‘์š”ํ•œ ์„ค๊ณ„ ๊ฒฐ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
    • A) ๋น„๋Œ€์นญ ํ•™์Šต (Asymmetric Training): ์ •์ฑ…์€ ๊ด€์ธก ํžˆ์Šคํ† ๋ฆฌ o_{t,H}์— ๋Œ€ํ•ด ํ›ˆ๋ จ๋˜๋Š” ๋ฐ˜๋ฉด, Critic์€ ํŠน๊ถŒ ์ •๋ณด(o_{t,H}, s_t)์— ์ ‘๊ทผํ•˜์—ฌ ์ •์ฑ…์˜ ๊ฒฌ๊ณ ์„ฑ์„ ๋†’์ž…๋‹ˆ๋‹ค.
    • B) ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ํ™˜๊ฒฝ ํ™•์žฅ (Scaling up to Massively Parallel Environments): ์ˆ˜์ฒœ ๊ฐœ์˜ ํ™˜๊ฒฝ์—์„œ ๋Œ€๊ทœ๋ชจ Replay Buffer์™€ ๋†’์€ UTD(Update-to-Data) ๋น„์œจ๋กœ ํ›ˆ๋ จ์„ ํ™•์žฅํ•˜์—ฌ ํšจ์œจ์ ์ธ ๋น„์ง€๋„ ํ›ˆ๋ จ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
    • C) ๋„๋ฉ”์ธ ๋ฌด์ž‘์œ„ํ™” (Domain Randomization, DR): ๋งํฌ ์งˆ๋Ÿ‰, ๋งˆ์ฐฐ ๊ณ„์ˆ˜, ๊ด€์ ˆ ์˜คํ”„์…‹, ๋ชธํ†ต ์งˆ๋Ÿ‰ ์ค‘์‹ฌ๊ณผ ๊ฐ™์€ ์ฃผ์š” ๋ฌผ๋ฆฌ์  ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๋ฌด์ž‘์œ„ํ™”ํ•˜๊ณ  ๊ต๋ž€ ๋ฐ ์„ผ์„œ ๋…ธ์ด์ฆˆ๋ฅผ ์ ์šฉํ•˜์—ฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์—ญํ•™์— ๊ณผ์ ํ•ฉ๋˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.
    • D) ๋ณด์ƒ ์ •๊ทœํ™” (Reward Regularization): ๋ฐ”๋žŒ์งํ•˜์ง€ ์•Š์€ ํ–‰๋™์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ๋ณด์ƒ ํŒจ๋„ํ‹ฐ๋ฅผ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค (์˜ˆ: ๊ด€์ ˆ ํ•œ๊ณ„ ๋„๋‹ฌ).
  4. ํ•™์Šต ๋ชฉํ‘œ ํ•จ์ˆ˜ (Training Objective Functions): BFM-Zero๋Š” ์˜คํ”„-์ •์ฑ… Actor-Critic ๋ฐฉ์‹์œผ๋กœ ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค.
    • FB Loss (L(F, B)): ์ „๋ฐฉ ๋งคํ•‘ F์™€ ํ›„๋ฐฉ ๋งคํ•‘ B๋Š” Successor Measures์— ๋Œ€ํ•œ ๋ฒจ๋งŒ ๋ฐฉ์ •์‹์—์„œ ํŒŒ์ƒ๋œ ์‹œ๊ฐ„ ์ฐจ์ด ์†์‹ค์„ ์ตœ์†Œํ™”ํ•˜๋„๋ก ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค. \mathcal{L}_{FB} = \frac{1}{2n(n-1)} \sum_{i \neq k} \left\| \bar{F}(x_i, a_i, z_i)^\top B(s'_k, o'_k) - \gamma F(x'_i, a'_i, z_i)^\top \bar{B}(s'_k, o'_k) \right\|^2 - \frac{1}{n} \sum_i F(x_i, a_i, z_i)^\top \bar{B}(o'_i, s'_i) + \frac{1}{2n(n-1)} \sum_{i \neq k} \left\| B(s'_i, o'_i)^\top B(s'_k, o'_k) \right\|^2 - \frac{1}{n} \sum_{i \in [n]} B(s'_i, o'_i)^\top B(s'_i, o'_i) + \frac{1}{n} \sum_{i \in [n]} \left\| F(x_i, a_i, z_i)^\top z_i - B(s'_i, o'_i) \Sigma_B z_i - \gamma F(x'_i, a'_i, z_i)^\top z_i \right\|^2 (์—ฌ๊ธฐ์„œ x_i = (o_{i,H}, s_i)์ด๋ฉฐ, \bar{F}์™€ \bar{B}๋Š” stop-gradient ์—ฐ์‚ฐ์ž๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.)
    • Auxiliary Critic Loss (L(Q_R)): ์•ˆ์ „ ๋ฐ ๋ฌผ๋ฆฌ์  ํƒ€๋‹น์„ฑ ์ œ์•ฝ ์กฐ๊ฑด์„ ๋ถ€๊ณผํ•˜๋Š” Auxiliary Critic Q_R์€ ํ‘œ์ค€ ๋ฒจ๋งŒ ์ž”์ฐจ ์†์‹ค๋กœ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค. \mathcal{L}(Q_R) = \mathbb{E} \left[ \left( Q_R(o_{t,H}, s_t, a_t, z) - \sum_{k=1}^{N_{aux}} r_k(s_t) - \gamma Q_R(o_{t+1,H}, s_{t+1}, a_{t+1}, z) \right)^2 \right]
    • Discriminator Loss (L(D)): ์ž ์žฌ ์กฐ๊ฑด๋ถ€ Discriminator D๋Š” GAN ์Šคํƒ€์ผ ๋ชฉํ‘œ๋ฅผ ํ†ตํ•ด ํ•™์Šต๋ฉ๋‹ˆ๋‹ค. Discriminator๋Š” ์˜จ๋ผ์ธ ํƒ์ƒ‰ ๊ณผ์ •์—์„œ ์ธ๊ฐ„๊ณผ ์œ ์‚ฌํ•œ ํ–‰๋™์„ ์œ ๋„ํ•˜๋Š” ์ •๊ทœํ™” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค. \mathcal{L}(D) = -\mathbb{E}_{\tau \sim \mathcal{M}, (o,s) \sim \tau} [\log(D(o, s, z_\tau))] - \mathbb{E}_{(o,s,z) \sim \mathcal{D}} [\log(1 - D(o, s, z))] ์—ฌ๊ธฐ์„œ z_\tau = \frac{1}{l(\tau)}\sum_{(o,s)\in\tau} B(o, s)๋Š” ๋ชจ์…˜ \tau์˜ ์ œ๋กœ์ƒท ๋ชจ๋ฐฉ ์ž„๋ฒ ๋”ฉ์ž…๋‹ˆ๋‹ค.
    • Actor Loss (L(\pi)): ์ตœ์ข… Actor Loss๋Š” ์—ฌ๋Ÿฌ Critic์˜ ํ•ฉ์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. \mathcal{L}(\pi) = -\mathbb{E} \left[ F(o_{t,H}, s_t, a_t, z)^\top z + \lambda_D Q_D(o_{t,H}, s_t, a_t, z) + \lambda_R Q_R(o_{t,H}, s_t, a_t, z) \right] ์—ฌ๊ธฐ์„œ Q_D๋Š” r_d(o_t, s_t, z) = \frac{D(o_t,s_t,z)}{1-D(o_t,s_t,z)} ๋ณด์ƒ์„ ์‚ฌ์šฉํ•˜๋Š” Critic์ž…๋‹ˆ๋‹ค.
  5. Zero-shot Inference: ํ•™์Šต๋œ BFM-Zero๋Š” ์ถ”๊ฐ€์ ์ธ ํ•™์Šต, ๊ณ„ํš ๋˜๋Š” ๋ฏธ์„ธ ์กฐ์ • ์—†์ด ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์ œ๋กœ์ƒท ๋ฐฉ์‹์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์ž„์˜์˜ ๋ณด์ƒ ํ•จ์ˆ˜ r(s)์˜ ๊ฒฝ์šฐ: z_r = E_{s' \sim \rho} [B(s')r(s')] (์‹ค์ œ๋กœ๋Š” ์ƒ˜ํ”Œ ๊ธฐ๋ฐ˜ ์ถ”์ •์น˜ ์‚ฌ์šฉ).
    • ๋ชฉํ‘œ ๋„๋‹ฌ(s_g)์˜ ๊ฒฝ์šฐ: z_g = B(s_g).
    • ๋ชจ์…˜ ์ถ”์ (\tau = \{s_1, \dots, s_n\})์˜ ๊ฒฝ์šฐ: z_t = \sum_{t'=t}^{t+H} B(s_{t'}) (๋ฏธ๋ž˜ ์‹œ์•ผ H๋ฅผ ํฌํ•จํ•œ ์ •์ฑ… ์‹œํ€€์Šค).
  6. Few-Shot Adaptation: BFM-Zero๋Š” ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์™€์˜ ์˜จ๋ผ์ธ ์ƒํ˜ธ์ž‘์šฉ์„ ํ†ตํ•ด ์ž ์žฌ ๊ณต๊ฐ„ \mathcal{Z}์—์„œ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์ ์‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ๋‹จ์ผ ํฌ์ฆˆ ์ ์‘: Cross-Entropy Method (CEM)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ดˆ๊ธฐ ์ œ๋กœ์ƒท ์ž ์žฌ z_{init}์—์„œ ์ตœ์ ์˜ z^*๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค.
    • ๊ถค์  ์ ์‘: Dual-Loop Annealing ์Šค์ผ€์ค„์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž ์žฌ ํ”„๋กฌํ”„ํŠธ ์‹œํ€€์Šค์— ๋Œ€ํ•œ ์ƒ˜ํ”Œ๋ง ๊ธฐ๋ฐ˜ ๊ถค์  ์ตœ์ ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

์‹คํ—˜ (Experiments)

BFM-Zero๋Š” IsaacLab์—์„œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜๋œ ์œ ๋‹ˆํŠธ๋ฆฌ G1(Unitree G1) ๋กœ๋ด‡์œผ๋กœ ํ›ˆ๋ จ๋˜์—ˆ์œผ๋ฉฐ, ํ–‰๋™ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ๋Š” LAFAN1์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

  1. ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ์˜ ์ œ๋กœ์ƒท ๊ฒ€์ฆ (Zero-shot Validation in Simulation):
    • ๋น„๋Œ€์นญ ํ•™์Šต ๋ฐ ๋„๋ฉ”์ธ ๋ฌด์ž‘์œ„ํ™”: BFM-Zero๋Š” ํŠน๊ถŒ ์ •๋ณด์— ์ ‘๊ทผํ•˜๋Š” BFM-Zero-priv์— ๋น„ํ•ด ์•ฝ๊ฐ„ ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€์ง€๋งŒ, ๋„๋ฉ”์ธ ๋ฌด์ž‘์œ„ํ™” ํ™˜๊ฒฝ์—์„œ๋„ ๋งŒ์กฑ์Šค๋Ÿฌ์šด ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉฐ ์‹ค์ œ ๋กœ๋ด‡ ๋ฐฐํฌ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋ณด์ƒ ์ž‘์—…์€ ํฌ์†Œํ•œ(sparse) ๋ณด์ƒ ํŠน์„ฑ์œผ๋กœ ์ธํ•ด ๋” ํฐ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค.
    • Sim-to-Sim ์„ฑ๋Šฅ: Mujoco ํ™˜๊ฒฝ์—์„œ BFM-Zero์˜ ๊ฒฌ๊ณ ์„ฑ์„ ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ, ์„ฑ๋Šฅ ์ฐจ์ด๊ฐ€ 7% ๋ฏธ๋งŒ์œผ๋กœ ๋„๋ฉ”์ธ ๋ฌด์ž‘์œ„ํ™”์™€ Actor/Critic์˜ ํžˆ์Šคํ† ๋ฆฌ ๊ตฌ์„ฑ ์š”์†Œ๊ฐ€ ์ข‹์€ ์ˆ˜์ค€์˜ ๊ฒฌ๊ณ ์„ฑ์„ ์ œ๊ณตํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
    • ๋ถ„ํฌ ์™ธ(Out-of-distribution, OOD) ์ž‘์—…: AMASS ๋ฐ์ดํ„ฐ์…‹์˜ ๋ชจ์…˜์„ ์‚ฌ์šฉํ•˜์—ฌ BFM-Zero๊ฐ€ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ์—†๋Š” ์ž‘์—…์— ๋Œ€ํ•ด์„œ๋„ ์„ฑ๊ณต์ ์œผ๋กœ ์ผ๋ฐ˜ํ™”ํ•˜๊ณ  ์ถ”์  ๋ฐ ํฌ์ฆˆ ๋„๋‹ฌ์„ ์™„๋ฃŒํ•  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.
  2. ์‹ค์ œ ๋กœ๋ด‡์—์„œ์˜ ์ œ๋กœ์ƒท ๊ฒ€์ฆ (Zero-shot Validation on the Real Robot):
    • ์ถ”์  (Tracking): BFM-Zero๋Š” ๋‹ค์–‘ํ•œ ์›€์ง์ž„(์Šคํƒ€์ผ ์›Œํ‚น, ์—ญ๋™์ ์ธ ์ถค, ์‹ธ์›€, ์Šคํฌ์ธ )์„ ์ถ”์ ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋ถˆ์•ˆ์ •ํ•˜๊ฑฐ๋‚˜ ๋„˜์–ด์งˆ ๋•Œ๋„ ๋ถ€๋“œ๋Ÿฝ๊ณ  ์ž์—ฐ์Šค๋Ÿฌ์šด ์ž์„ธ๋กœ ๋ณต๊ตฌํ•˜์—ฌ ์ถ”์ ์„ ๊ณ„์†ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๊ต๋ž€ ํ›ˆ๋ จ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ TD ๊ธฐ๋ฐ˜ ์˜คํ”„-์ •์ฑ… ํ›ˆ๋ จ๊ณผ GAN ๊ธฐ๋ฐ˜ ๋ณด์ƒ, ๊ทธ๋ฆฌ๊ณ  ์ •๊ทœํ™” ํ•ญ์„ ํ†ตํ•ด ์–ป์€ ํœด๋จผ-์œ ์‚ฌ์„ฑ์—์„œ ๋น„๋กฏ๋ฉ๋‹ˆ๋‹ค.
    • ๋ชฉํ‘œ ๋„๋‹ฌ (Goal Reaching): ๋กœ๋ด‡์€ ๋ฌด์ž‘์œ„๋กœ ์ƒ˜ํ”Œ๋ง๋œ ๋ชฉํ‘œ ํฌ์ฆˆ์— ์ง€์†์ ์œผ๋กœ ์ˆ˜๋ ดํ•˜๋ฉฐ, ์‹ฌ์ง€์–ด ๋ถˆ๊ฐ€๋Šฅํ•œ ๋ชฉํ‘œ์—๋„ ์ž์—ฐ์Šค๋Ÿฌ์šด ๊ตฌ์„ฑ์„ ์ทจํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ๊ถค์ ์€ ๋ช…์‹œ์ ์ธ ๋ณด๊ฐ„ ์—†์ด๋„ ๋ถ€๋“œ๋Ÿฝ๊ณ  ์ž์—ฐ์Šค๋Ÿฌ์šด ์ „ํ™˜์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
    • ๋ณด์ƒ ์ตœ์ ํ™” (Reward Optimization): ๋กœ์ฝ”๋ชจ์…˜, ํŒ” ์›€์ง์ž„, ๊ณจ๋ฐ˜ ๋†’์ด ๋ณด์ƒ๊ณผ ๊ฐ™์€ ๋‹จ์ˆœํ•œ ๋ณด์ƒ ์ •์˜๋งŒ์œผ๋กœ๋„ ๋กœ๋ด‡์€ ์ถฉ์‹คํ•˜๊ฒŒ ๋ช…๋ น์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋ณด์ƒ์˜ ์„ ํ˜• ์กฐํ•ฉ์„ ํ†ตํ•ด ๋ณตํ•ฉ ๊ธฐ์ˆ ์„ ์œ ๋„ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ž ์žฌ ๋ณ€์ˆ˜์˜ ๋‹ค์–‘์„ฑ์€ ๋‹ค์–‘ํ•œ ์ž ์žฌ์  ์ตœ์  ๋ชจ๋“œ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
    • ๊ต๋ž€ ์ œ๊ฑฐ (Disturbance Rejection): BFM-Zero ์ •์ฑ…์€ ๊ฐ•๋ ฅํ•œ ์ˆœ์‘์„ฑ๊ณผ ๊ฒฌ๊ณ ์„ฑ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋กœ๋ด‡์€ ๊ฐ•๋ ฅํ•œ ๋ฐ€๊ธฐ, ๋ฐœ๋กœ ์ฐจ๊ธฐ, ๋ฐ”๋‹ฅ์œผ๋กœ ๋Œ๋ ค๊ฐ€๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ ์‹ฌ๊ฐํ•œ ๊ต๋ž€์„ ๊ฒฌ๋ŽŒ๋‚ด๊ณ  ์ž์—ฐ์Šค๋Ÿฝ๊ณ  ์ธ๊ฐ„๊ณผ ์œ ์‚ฌํ•œ ๋ฐฉ์‹์œผ๋กœ ๋ณต๊ตฌํ•ฉ๋‹ˆ๋‹ค.
  3. BFM-Zero์˜ ํšจ์œจ์ ์ธ ์ ์‘ (Efficient Adaptation for BFM-Zero):
    • ๋‹จ์ผ ํฌ์ฆˆ ์ ์‘ (Single Pose Adaptation): ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ 4kg์˜ ํŽ˜์ด๋กœ๋“œ(payload)๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ํ•œ ๋ฐœ ์„œ๊ธฐ ๋™์ž‘์„ ๊ฐœ์„ ํ•˜๋Š” ์ ์‘์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. CEM์„ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ์ ํ™”๋œ ํ”„๋กฌํ”„ํŠธ z^*๋Š” ํŽ˜์ด๋กœ๋“œ๋กœ ์ธํ•œ ์—ญํ•™ ๋ณ€ํ™”๋ฅผ ๋ณด์ƒํ•˜์—ฌ, ๋น„์ ์‘ ์ƒํƒœ์—์„œ 5์ดˆ ์ด๋‚ด์— ๋ถˆ์•ˆ์ •ํ•ด์ง€๋˜ ๋กœ๋ด‡์ด 15์ดˆ ์ด์ƒ ํ•œ ๋ฐœ ๊ท ํ˜•์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค.
    • ๊ถค์  ์ ์‘ (Trajectory Adaptation): altered ground friction ํ•˜์—์„œ ๋„์•ฝ ๋™์ž‘์„ ์ตœ์ ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. Dual-Annealing ๊ถค์  ์ตœ์ ํ™”๋Š” ์ถ”์  ์ •ํ™•๋„๋ฅผ ์•ฝ 29.1% ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.
  4. BFM-Zero์˜ ์ž ์žฌ ๊ณต๊ฐ„ ๊ตฌ์กฐ (The Latent Space Structure of BFM-Zero): BFM-Zero๋Š” ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡์˜ ํ–‰๋™์— ๋Œ€ํ•œ ํ•ด์„ ๊ฐ€๋Šฅํ•˜๊ณ  ๊ตฌ์กฐํ™”๋œ ํ‘œํ˜„์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
    • ์ž ์žฌ ๊ณต๊ฐ„ ์‹œ๊ฐํ™”: ์ž ์žฌ ๋ฒกํ„ฐ ๊ถค์ ์„ 2D ํ‰๋ฉด์— ํˆฌ์˜ํ•˜๊ฑฐ๋‚˜ 3D ๊ตฌ๋กœ ํ‘œํ˜„ํ•˜๋ฉด, ์ž ์žฌ ๊ณต๊ฐ„์ด ๋ชจ์…˜ ์Šคํƒ€์ผ์— ๋”ฐ๋ผ ๊ตฌ์„ฑ๋˜์–ด ์˜๋ฏธ๋ก ์ ์œผ๋กœ ์œ ์‚ฌํ•œ ๊ถค์ ์ด ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ํ˜•์„ฑํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
    • ์ž ์žฌ ๊ณต๊ฐ„์—์„œ์˜ ๋ชจ์…˜ ๋ณด๊ฐ„ (Motion Interpolation): \mathcal{Z}์˜ ๊ตฌ์กฐํ™”๋œ ํŠน์„ฑ์€ ์ž ์žฌ ํ‘œํ˜„ ๊ฐ„์˜ ๋ถ€๋“œ๋Ÿฌ์šด ๋ณด๊ฐ„์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. Slerp(Spherical Linear Interpolation)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ค‘๊ฐ„ ์ž ์žฌ ๋ฒกํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ์ด๋ฅผ BFM-Zero ์ •์ฑ…์— ์ž…๋ ฅํ•˜๋ฉด ์˜๋ฏธ๋ก ์ ์œผ๋กœ ์œ ์˜๋ฏธํ•œ ์ค‘๊ฐ„ ๊ธฐ์ˆ ์ด ์ œ๋กœ์ƒท ๋ฐฉ์‹์œผ๋กœ ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค.

๊ฒฐ๋ก  (Discussion)

BFM-Zero๋Š” ์˜คํ”„-์ •์ฑ… ๋น„์ง€๋„ RL์ด ์‹ค์ œ ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡์˜ ์ „์‹  ์ œ์–ด๋ฅผ ์œ„ํ•œ ํ–‰๋™ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๋Š” ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ์ ‘๊ทผ ๋ฐฉ์‹์ž„์„ ์ฒ˜์Œ์œผ๋กœ ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค. BFM-Zero๋Š” ๋†€๋ผ์šด ์ผ๋ฐ˜ํ™” ๋ฐ ๊ฒฌ๊ณ ์„ฑ ์ˆ˜์ค€์„ ๋ณด์ด์ง€๋งŒ, ๋ช‡ ๊ฐ€์ง€ ํ•œ๊ณ„๋„ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ์ฒซ์งธ, ํ‘œํ˜„ ๊ฐ€๋Šฅํ•œ ํ–‰๋™์˜ ๋ฒ”์œ„์™€ ์„ฑ๋Šฅ์€ ํ›ˆ๋ จ์— ์‚ฌ์šฉ๋œ ๋ชจ์…˜ ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์…‹ ํฌ๊ธฐ์™€ ๋ชจ๋ธ ์„ฑ๋Šฅ ๊ฐ„์˜ ์Šค์ผ€์ผ๋ง ๋ฒ•์น™์„ ์—ฐ๊ตฌํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ๋‘˜์งธ, ํ˜„์žฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด sim-to-real gap์„ ์ค„์˜€์ง€๋งŒ, ๋” ๋ณต์žกํ•œ ์›€์ง์ž„์„ ์•ˆ์ •์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋” ๋‚˜์€ ์˜จ๋ผ์ธ ์ ์‘ ๋Šฅ๋ ฅ์„ ๊ฐ€์ง„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์…‹์งธ, ํ…Œ์ŠคํŠธ-์‹œ๊ฐ„ ์ ์‘์— ๋Œ€ํ•œ ์‹ฌ๋„ ์žˆ๋Š” ์ดํ•ด๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.


๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

BFM-Zero๋Š” ์„ธ๊ณ„ ์ตœ์ดˆ๋กœ off-policy ๋น„์ง€๋„ ๊ฐ•ํ™”ํ•™์Šต์„ ์‹ค์ œ ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡(Unitree G1)์— ์ ์šฉํ•œ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค. Forward-Backward ํ‘œํ˜„ ํ•™์Šต์„ ๊ธฐ๋ฐ˜์œผ๋กœ, ๋‹จ์ผ ์ •์ฑ…์œผ๋กœ Motion Tracking, Goal Reaching, Reward Optimization ์„ธ ๊ฐ€์ง€ ์ž‘์—…์„ ์žฌํ•™์Šต ์—†์ด(Zero-shot) ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

FB-CPR ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์‹ค๋กœ๋ด‡ ํ™•์žฅ: Meta Motivo(์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์ „์šฉ)๋ฅผ Sim-to-Real๋กœ ํ™•์žฅ

๋น„๋Œ€์นญ ํ•™์Šต + LSTM ํžˆ์Šคํ† ๋ฆฌ: ๋ถ€๋ถ„ ๊ด€์ธก ํ™˜๊ฒฝ์—์„œ์˜ robustํ•œ ์ œ์–ด

์ฒด๊ณ„์  ๋„๋ฉ”์ธ ๋žœ๋คํ™”: ๋ฌผ๋ฆฌ ํŒŒ๋ผ๋ฏธํ„ฐ, ์„ผ์„œ ๋…ธ์ด์ฆˆ, ์™ธ๋ž€ ๋“ฑ ํฌ๊ด„์  ๋žœ๋คํ™”

๊ตฌ์กฐํ™”๋œ ์ž ์žฌ ๊ณต๊ฐ„: ์˜๋ฏธ๋ก ์  ๋ณด๊ฐ„๊ณผ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ ์ œ๊ณต

์„œ๋ก : ์™œ ์ด ์—ฐ๊ตฌ๊ฐ€ ์ค‘์š”ํ•œ๊ฐ€?

๋ฌธ์ œ์˜ ๋ณธ์งˆ

ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡์„ ์ œ์–ดํ•œ๋‹ค๋Š” ๊ฒƒ์€ ๋งˆ์น˜ ๋ณต์žกํ•œ ์˜ค์ผ€์ŠคํŠธ๋ผ๋ฅผ ์ง€ํœ˜ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์ˆ˜์‹ญ ๊ฐœ์˜ ๊ด€์ ˆ์ด ๋™์‹œ์— ํ˜‘์‘ํ•ด์•ผ ํ•˜๊ณ , ๋ถˆ์•ˆ์ •ํ•œ ์ด์กฑ ๋ณดํ–‰์ด๋ผ๋Š” ๋ณธ์งˆ์ ์ธ ์–ด๋ ค์›€๊นŒ์ง€ ๋”ํ•ด์ง‘๋‹ˆ๋‹ค. ์ „ํ†ต์ ์ธ ์ ‘๊ทผ๋ฒ•์€ ๊ฐ ์ž‘์—…(๊ฑท๊ธฐ, ์ถค์ถ”๊ธฐ, ๋ฌผ๊ฑด ์ง‘๊ธฐ)๋งˆ๋‹ค ๋ณ„๋„์˜ ์ •์ฑ…์„ ํ•™์Šต์‹œ์ผœ์•ผ ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์น˜ ํ”ผ์•„๋…ธ ์น˜๊ธฐ, ๋ฐ”์ด์˜ฌ๋ฆฐ ์—ฐ์ฃผ, ๋“œ๋Ÿผ ์—ฐ์ฃผ๋ฅผ ๊ฐ๊ฐ ๋‹ค๋ฅธ ์‚ฌ๋žŒ์—๊ฒŒ ๊ฐ€๋ฅด์น˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ์š”.

ํ•˜์ง€๋งŒ ์šฐ๋ฆฌ๊ฐ€ ์ •๋ง ์›ํ•˜๋Š” ๊ฒƒ์€ ํ•˜๋‚˜์˜ โ€œ์Œ์•…์  ์žฌ๋Šฅโ€์„ ๊ฐ€์ง„ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์•…๋ณด(๋ชฉํ‘œ)๋งŒ ๋ฐ”๊พธ๋ฉด ์–ด๋–ค ๊ณก์ด๋“  ์—ฐ์ฃผํ•  ์ˆ˜ ์žˆ๋Š” ๋งŒ๋Šฅ ์Œ์•…๊ฐ€ ๋ง์ด์ฃ . ์ด๊ฒƒ์ด ๋ฐ”๋กœ Behavioral Foundation Model (BFM)์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด์ž…๋‹ˆ๋‹ค.

๊ธฐ์กด ์—ฐ๊ตฌ์˜ ํ•œ๊ณ„

๊ธฐ์กด BFM ์—ฐ๊ตฌ๋“ค์€ ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ์— ์ง๋ฉดํ•ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ๋ฌธ์ œ 1: ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—๋งŒ ๋จธ๋ฌผ๋Ÿฌ ์žˆ์Œ                          โ”‚
โ”‚  - SMPL ์Šค์ผˆ๋ ˆํ†ค ๊ธฐ๋ฐ˜ ๊ฐ€์ƒ ์บ๋ฆญํ„ฐ์—์„œ๋งŒ ๊ฒ€์ฆ                โ”‚
โ”‚  - ์‹ค์ œ ๋กœ๋ด‡ ๋ฐฐ์น˜(deployment) ๋ฏธ๊ฒ€์ฆ                       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  ๋ฌธ์ œ 2: ์ž‘์—… ํŠนํ™” ํ•™์Šต ํ•„์š”                                 โ”‚
โ”‚  - Motion Tracking, Goal Reaching ๋“ฑ ๊ฐ๊ฐ ๋ณ„๋„ ํ•™์Šต         โ”‚
โ”‚  - 2๋‹จ๊ณ„ ํ•™์Šต ํ•„์ˆ˜: (1) ๊ธฐ๋ณธ ์ •์ฑ… ํ•™์Šต โ†’ (2) ์ฆ๋ฅ˜(Distillation) โ”‚
โ”‚  - ๋ชจ์…˜ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ์— ์ „์ ์œผ๋กœ ์˜์กด                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

BFM-Zero์˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ

BFM-Zero๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋“ค์„ ํ•ด๊ฒฐํ•˜๋ฉด์„œ ์„ธ๊ณ„ ์ตœ์ดˆ๋กœ ๋‹ค์Œ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค:

  1. Off-policy ๋น„์ง€๋„ ๊ฐ•ํ™”ํ•™์Šต์œผ๋กœ ์‹ค์ œ ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡์„ ์ œ์–ด
  2. ๋‹จ์ผ ์ •์ฑ…์œผ๋กœ Motion Tracking, Goal Reaching, Reward Optimization์„ Zero-shot์œผ๋กœ ์ˆ˜ํ–‰
  3. Unitree G1 ์‹ค์ œ ๋กœ๋ด‡์—์„œ ๊ฒ€์ฆ๋œ Sim-to-Real ์ „์ด

โ€œZeroโ€๋ผ๋Š” ์ด๋ฆ„์˜ ์˜๋ฏธ๊ฐ€ ์—ฌ๊ธฐ์„œ ๋“œ๋Ÿฌ๋‚ฉ๋‹ˆ๋‹ค: ์žฌํ•™์Šต ์—†์ด(Zero additional training) ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด์ฃ .


๋ฐฉ๋ฒ•๋ก : Forward-Backward ํ‘œํ˜„ ํ•™์Šต์˜ ๋งˆ๋ฒ•

ํ•ต์‹ฌ ์ง๊ด€: โ€œ๋ฏธ๋ž˜๋ฅผ ์ž„๋ฒ ๋”ฉํ•˜๋ผโ€

BFM-Zero์˜ ํ•ต์‹ฌ์€ Forward-Backward (FB) ํ‘œํ˜„ ํ•™์Šต์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ„๋‹จํ•œ ๋น„์œ ๋ฅผ ๋“ค์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์ƒ์ƒํ•ด๋ณด์„ธ์š”. ๋‹น์‹ ์ด ์„œ์šธ์—์„œ ๋ถ€์‚ฐ๊นŒ์ง€ ๊ฐ€๋Š” ์—ฌํ–‰์„ ๊ณ„ํšํ•œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ „ํ†ต์ ์ธ ๊ฐ•ํ™”ํ•™์Šต์€ โ€œ์„œ์šธโ†’๋Œ€์ „โ†’๋Œ€๊ตฌโ†’๋ถ€์‚ฐโ€ ๊ฐ ๋‹จ๊ณ„๋งˆ๋‹ค โ€œ์ด ์„ ํƒ์ด ์ข‹์€๊ฐ€?โ€๋ฅผ ๋ณด์ƒ์œผ๋กœ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ FB ํ‘œํ˜„ ํ•™์Šต์€ ๋‹ค๋ฅด๊ฒŒ ์ ‘๊ทผํ•ฉ๋‹ˆ๋‹ค:

  • Backward ์ž„๋ฒ ๋”ฉ \boldsymbol{B}(s): โ€œ๋ถ€์‚ฐ์ด๋ผ๋Š” ๋ชฉ์ ์ง€์˜ ํŠน์„ฑ์€ ๋ฌด์—‡์ธ๊ฐ€?โ€ (๋ชฉํ‘œ ์ƒํƒœ์˜ ํ‘œํ˜„)
  • Forward ์ž„๋ฒ ๋”ฉ \boldsymbol{F}(s,a,z): โ€œํ˜„์žฌ ์„œ์šธ์—์„œ ์ด ํ–‰๋™์„ ํ•˜๋ฉด, ๋ฏธ๋ž˜์— ์–ด๋–ค ๊ณณ๋“ค์„ ๋ฐฉ๋ฌธํ•˜๊ฒŒ ๋ ๊นŒ?โ€ (๋ฏธ๋ž˜ ๋ฐฉ๋ฌธ ํ™•๋ฅ ์˜ ํ‘œํ˜„)

์ด ๋‘ ์ž„๋ฒ ๋”ฉ์˜ ๋‚ด์  \boldsymbol{F}^\top \boldsymbol{B}๊ฐ€ ๋ฐ”๋กœ โ€œํ˜„์žฌ ์ƒํƒœ์—์„œ ๋ชฉํ‘œ ์ƒํƒœ์— ๋„๋‹ฌํ•  ๊ฐ€๋Šฅ์„ฑโ€์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค!

์ˆ˜ํ•™์  ๊ธฐ์ดˆ: Successor Measure

FB ํ‘œํ˜„ ํ•™์Šต์˜ ํ•ต์‹ฌ ๊ฐœ๋…์€ Successor Measure(ํ›„์† ์ธก๋„)์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ๊ธฐ์กด์˜ Successor Representation์„ ์—ฐ์† ์ƒํƒœ ๊ณต๊ฐ„์œผ๋กœ ํ™•์žฅํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

M^{\pi_z}(X|s, a) := \sum_{t=0}^{\infty} \gamma^t \Pr(s_t \in X | s_0=s, a_0=a, \pi_z)

์ด๊ฒƒ์ด ์˜๋ฏธํ•˜๋Š” ๋ฐ”๋Š”: ์ •์ฑ… \pi_z๋ฅผ ๋”ฐ๋ฅผ ๋•Œ, ์ƒํƒœ-ํ–‰๋™ ์Œ (s,a)์—์„œ ์‹œ์ž‘ํ•˜์—ฌ ๋ฏธ๋ž˜์— ์ง‘ํ•ฉ X์— ์†ํ•˜๋Š” ์ƒํƒœ๋ฅผ ๋ฐฉ๋ฌธํ•  ํ• ์ธ๋œ ํ™•๋ฅ ์ž…๋‹ˆ๋‹ค.

FB ํ‘œํ˜„์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ์ด successor measure๋ฅผ ์ €์ฐจ์› ๊ทผ์‚ฌ๋กœ ๋ถ„ํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

M^{\pi_z}(X|s, a) \approx \int_{s' \in X} \boldsymbol{F}(s, a, z)^\top \boldsymbol{B}(s') \, \rho(ds')

  • \boldsymbol{F}: \mathcal{S} \times \mathcal{A} \times \mathcal{Z} \rightarrow \mathbb{R}^d โ€” Forward ์ž„๋ฒ ๋”ฉ
  • \boldsymbol{B}: \mathcal{S} \rightarrow \mathbb{R}^d โ€” Backward ์ž„๋ฒ ๋”ฉ
  • z \in \mathcal{Z} โ€” ์ž ์žฌ ํƒœ์Šคํฌ ๋ฒกํ„ฐ (์ •์ฑ…์„ ์ธ๋ฑ์‹ฑ)
  • \rho โ€” ์ƒํƒœ ๋ถ„ํฌ

Q-ํ•จ์ˆ˜์˜ ์šฐ์•„ํ•œ ํ‘œํ˜„

์ด ๋ถ„ํ•ด์˜ ์•„๋ฆ„๋‹ค์šด ์ ์€ ์ž„์˜์˜ ๋ณด์ƒ ํ•จ์ˆ˜์— ๋Œ€ํ•œ Q-ํ•จ์ˆ˜๋ฅผ ์ฆ‰์‹œ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค!

๋ณด์ƒ ํ•จ์ˆ˜ r(s)๊ฐ€ ์ฃผ์–ด์ง€๋ฉด, ํ•ด๋‹น ์ž ์žฌ ๋ฒกํ„ฐ๋Š”: z_r = \mathbb{E}_{s \sim \rho}[r(s) \cdot \boldsymbol{B}(s)]

๊ทธ๋Ÿฌ๋ฉด Q-ํ•จ์ˆ˜๋Š” ๋‹จ์ˆœํžˆ: Q^{\pi_z}(s, a) = \boldsymbol{F}(s, a, z)^\top z_r

์žฌํ•™์Šต ์—†์ด ์ƒˆ๋กœ์šด ๋ณด์ƒ์— ๋Œ€ํ•œ ์ตœ์  ํ–‰๋™์„ ๋ฐ”๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!

FB-CPR: ๋น„์ง€๋„ ํ•™์Šต์— ๋ชจ์…˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ ‘๋ชฉํ•˜๋‹ค

์ˆœ์ˆ˜ํ•œ FB ํ•™์Šต๋งŒ์œผ๋กœ๋Š” ํœด๋จธ๋…ธ์ด๋“œ ์ œ์–ด์— ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•™์Šต๋œ ์ •์ฑ…์ด โ€œ๋ฌผ๋ฆฌ์ ์œผ๋กœ๋Š” ๊ฐ€๋Šฅํ•˜์ง€๋งŒ ์ธ๊ฐ„๋‹ต์ง€ ์•Š์€โ€ ๋™์ž‘์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ฐ”๋‹ฅ์—์„œ ์ผ์–ด๋‚˜๊ธฐ ์œ„ํ•ด ๋น„ํ˜„์‹ค์ ์ธ ํšŒ์ „์„ ํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ์š”.

FB-CPR (Forward-Backward with Conditional Policy Regularization)์€ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค:

flowchart TB
    subgraph ๋ฐ์ดํ„ฐ["๋ฐ์ดํ„ฐ ์†Œ์Šค"]
        MoCap["๋ชจ์…˜ ์บก์ฒ˜ ๋ฐ์ดํ„ฐ<br/>(LAFAN1, CMU)"]
        Replay["์˜จ๋ผ์ธ ๋ฆฌํ”Œ๋ ˆ์ด ๋ฒ„ํผ"]
    end
    
    subgraph ์ž„๋ฒ ๋”ฉ["์ž ์žฌ ๊ณต๊ฐ„ ๊ตฌ์„ฑ"]
        Btraj["๊ถค์  ์ž„๋ฒ ๋”ฉ<br/>E_RFB(ฯ„) = 1/n ฮฃ B(sแตข)"]
        Bstate["์ƒํƒœ ์ž„๋ฒ ๋”ฉ<br/>z = B(s)"]
        Uniform["๊ท ์ผ ๋ถ„ํฌ<br/>(ํ•˜์ดํผ์Šคํ”ผ์–ด)"]
    end
    
    subgraph ํ•™์Šต["ํ•™์Šต ์ปดํฌ๋„ŒํŠธ"]
        Disc["์ž ์žฌ-์กฐ๊ฑด๋ถ€<br/>ํŒ๋ณ„์ž D(s,z)"]
        Actor["์ •์ฑ… ฯ€_z"]
        Critic["๋น„ํ‰๊ฐ€ Q(s,a,z)"]
        FB["FB ํ‘œํ˜„<br/>F(s,a,z), B(s)"]
    end
    
    MoCap --> Btraj
    Replay --> Bstate
    Btraj --> Disc
    Bstate --> Disc
    Uniform --> Actor
    
    Disc --> |"์ •๊ทœํ™” ๋ณด์ƒ"| Critic
    FB --> |"FB ์†์‹ค"| Actor
    Critic --> |"๊ฐ€์น˜ ์ถ”์ •"| Actor
    
    Actor --> |"ํ™˜๊ฒฝ ์ƒํ˜ธ์ž‘์šฉ"| Replay

ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ํŒ๋ณ„์ž(Discriminator)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ •์ฑ…์ด ์ƒ์„ฑํ•˜๋Š” ์ƒํƒœ ๋ถ„ํฌ๊ฐ€ ๋ชจ์…˜ ์บก์ฒ˜ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ์™€ ์œ ์‚ฌํ•˜๋„๋ก ์œ ๋„ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

\mathcal{L}_{\text{FB-CPR}}(\pi) = -\mathbb{E}_{z, s, a \sim \pi_z}\left[\boldsymbol{F}(s, a, z)^\top z + \alpha Q(s, a, z)\right]

์—ฌ๊ธฐ์„œ Q(s,a,z)๋Š” ํŒ๋ณ„์ž์˜ ์ถœ๋ ฅ์„ ๋ณด์ƒ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋น„ํ‰๊ฐ€ ๋„คํŠธ์›Œํฌ์ž…๋‹ˆ๋‹ค:

r_{\text{disc}}(s', z) = \log \frac{D(s', z)}{1 - D(s', z)}


BFM-Zero ์‹œ์Šคํ…œ ์•„ํ‚คํ…์ฒ˜

Sim-to-Real์„ ์œ„ํ•œ ํ•ต์‹ฌ ์„ค๊ณ„

BFM-Zero๊ฐ€ ์‹ค์ œ ๋กœ๋ด‡์—์„œ ์ž‘๋™ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜๊ณผ ํ˜„์‹ค ์‚ฌ์ด์˜ ๊ฐ„๊ทน์„ ๋ฉ”์›Œ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•œ ๋„ค ๊ฐ€์ง€ ํ•ต์‹ฌ ์„ค๊ณ„ ์š”์†Œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค:

A) ๋น„๋Œ€์นญ ํ•™์Šต (Asymmetric Training)

์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ๋Š” ๋ชจ๋“  ์ƒํƒœ ์ •๋ณด๋ฅผ ์•Œ ์ˆ˜ ์žˆ์ง€๋งŒ, ์‹ค์ œ ๋กœ๋ด‡์—์„œ๋Š” ์„ผ์„œ ๋…ธ์ด์ฆˆ์™€ ๋ถ€๋ถ„ ๊ด€์ธก ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    ๋น„๋Œ€์นญ ํ•™์Šต ๊ตฌ์กฐ                            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                              โ”‚
โ”‚   ์‹œ๋ฎฌ๋ ˆ์ด์…˜ (ํ•™์Šต)           ์‹ค์ œ ๋กœ๋ด‡ (๋ฐฐ์น˜)                 โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                โ”‚
โ”‚   โ”‚ ํŠน๊ถŒ ์ •๋ณด   โ”‚            โ”‚ ๊ด€์ธก ํžˆ์Šคํ† ๋ฆฌโ”‚                โ”‚
โ”‚   โ”‚ (full state)โ”‚            โ”‚ (o_{t-H:t}) โ”‚                โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜                โ”‚
โ”‚          โ”‚                          โ”‚                        โ”‚
โ”‚          โ–ผ                          โ–ผ                        โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                โ”‚
โ”‚   โ”‚  FB ํ‘œํ˜„    โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚   ์ •์ฑ…      โ”‚                โ”‚
โ”‚   โ”‚  F, B       โ”‚  ๊ณต์œ  ํ•™์Šต  โ”‚   ฯ€_z       โ”‚                โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                โ”‚
โ”‚                                                              โ”‚
โ”‚   ํŠน๊ถŒ ์ •๋ณด: ์ ‘์ด‰๋ ฅ, ์ •ํ™•ํ•œ ์ž์„ธ, ์™ธ๋ถ€ ํž˜ ๋“ฑ                  โ”‚
โ”‚   ๊ด€์ธก ํžˆ์Šคํ† ๋ฆฌ: ๊ณผ๊ฑฐ H ์Šคํ…์˜ ๊ณ ์œ ๊ฐ๊ฐ + ํ–‰๋™                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

์ •์ฑ…์€ ๊ด€์ธก ํžˆ์Šคํ† ๋ฆฌ o_{t-H:t}๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์ง€๋งŒ, FB ํ‘œํ˜„์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์˜ ํŠน๊ถŒ ์ •๋ณด s_t๋กœ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค.

B) LSTM ๊ธฐ๋ฐ˜ ํžˆ์Šคํ† ๋ฆฌ ์ธ์ฝ”๋”ฉ

๋‹จ์ˆœํžˆ ๊ณผ๊ฑฐ ๊ด€์ธก์„ ์—ฐ๊ฒฐํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, LSTM์„ ์‚ฌ์šฉํ•˜์—ฌ ์‹œ๊ฐ„์  ์˜์กด์„ฑ์„ ํšจ๊ณผ์ ์œผ๋กœ ํฌ์ฐฉํ•ฉ๋‹ˆ๋‹ค:

h_t = \text{LSTM}(o_t, a_{t-1}, h_{t-1})

์ด ํžˆ๋“  ์ƒํƒœ h_t๊ฐ€ ์ •์ฑ…์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ ‘์ด‰ ์ƒํƒœ ์ถ”์ •, ์™ธ๋ถ€ ๊ต๋ž€ ๊ฐ์ง€ ๋“ฑ ์•”๋ฌต์  ์ƒํƒœ ์ถ”์ •์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

C) ๋„๋ฉ”์ธ ๋žœ๋คํ™” (Domain Randomization)

์‹ค์ œ ๋กœ๋ด‡์˜ ๋ฌผ๋ฆฌ์  ํŠน์„ฑ์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜๊ณผ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋‹ค์Œ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ๋žœ๋คํ™”ํ•ฉ๋‹ˆ๋‹ค:

ํŒŒ๋ผ๋ฏธํ„ฐ ๋žœ๋คํ™” ๋ฒ”์œ„ ๋ชฉ์ 
๋งํฌ ์งˆ๋Ÿ‰ ยฑ20% ๋ฌด๊ฒŒ ๋ถ„ํฌ ๋ณ€ํ™” ๋Œ€์‘
๋งˆ์ฐฐ ๊ณ„์ˆ˜ 0.2~1.5 ๋‹ค์–‘ํ•œ ๋ฐ”๋‹ฅ๋ฉด ๋Œ€์‘
๊ด€์ ˆ ์˜คํ”„์…‹ ยฑ0.05 rad ์บ˜๋ฆฌ๋ธŒ๋ ˆ์ด์…˜ ์˜ค์ฐจ ๋Œ€์‘
ํ† ํฌ CoM ยฑ3 cm ๋ฌด๊ฒŒ์ค‘์‹ฌ ์˜ค์ฐจ ๋Œ€์‘
์„ผ์„œ ๋…ธ์ด์ฆˆ ๊ฐ€์šฐ์‹œ์•ˆ IMU, ์ธ์ฝ”๋” ๋…ธ์ด์ฆˆ
์™ธ๋ถ€ ๊ต๋ž€ ๋žœ๋ค ํž˜ ํ‘ธ์‹œ, ์ถฉ๊ฒฉ ๋Œ€์‘

D) ๋ณด์ƒ ์ •๊ทœํ™”

๋กœ๋ด‡ ํ•˜๋“œ์›จ์–ด ๋ณดํ˜ธ๋ฅผ ์œ„ํ•œ ๋ณด์กฐ ๋ณด์ƒ:

r_{\text{reg}} = -w_1 \|\tau\|^2 - w_2 \mathbf{1}[q \notin \text{safe range}] - w_3 \|\dot{q}\|^2

  • ๊ด€์ ˆ ํ† ํฌ ํŽ˜๋„ํ‹ฐ: ๋ชจํ„ฐ ๊ณผ์—ด ๋ฐฉ์ง€
  • ๊ด€์ ˆ ํ•œ๊ณ„ ํŽ˜๋„ํ‹ฐ: ํ•˜๋“œ์›จ์–ด ์†์ƒ ๋ฐฉ์ง€
  • ๊ด€์ ˆ ์†๋„ ํŽ˜๋„ํ‹ฐ: ๋ถ€๋“œ๋Ÿฌ์šด ๋™์ž‘ ์œ ๋„

๋„คํŠธ์›Œํฌ ์•„ํ‚คํ…์ฒ˜

flowchart LR
    subgraph ์ž…๋ ฅ
        obs["๊ด€์ธก ํžˆ์Šคํ† ๋ฆฌ<br/>o_{t-H:t}"]
        priv["ํŠน๊ถŒ ์ƒํƒœ<br/>s_t"]
        z["์ž ์žฌ ๋ฒกํ„ฐ<br/>z โˆˆ โ„^256"]
    end
    
    subgraph ์ธ์ฝ”๋”
        lstm["LSTM<br/>(512 hidden)"]
        mlp_enc["MLP ์ธ์ฝ”๋”<br/>(ํŠน๊ถŒ ์ •๋ณด)"]
    end
    
    subgraph FB["FB ํ‘œํ˜„"]
        F["Forward F<br/>MLP (2048ร—4)"]
        B["Backward B<br/>MLP (2048ร—4)"]
    end
    
    subgraph ์ถœ๋ ฅ
        policy["์ •์ฑ… ฯ€_z<br/>MLP (2048ร—4)"]
        action["ํ–‰๋™ a"]
    end
    
    obs --> lstm
    priv --> mlp_enc
    z --> F
    z --> policy
    
    lstm --> policy
    mlp_enc --> F
    mlp_enc --> B
    
    policy --> action
    
    style FB fill:#e1f5fe
    style ์ถœ๋ ฅ fill:#fff3e0

๋ชจ๋“  MLP๋Š” Residual Block ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, LayerNorm + Mish ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.


Zero-shot ์ถ”๋ก : ์„ธ ๊ฐ€์ง€ ์ž‘์—…, ํ•˜๋‚˜์˜ ์ •์ฑ…

BFM-Zero์˜ ์ง„์ •ํ•œ ํž˜์€ ์ถ”๋ก  ์‹œ์ ์— ๋“œ๋Ÿฌ๋‚ฉ๋‹ˆ๋‹ค. ๋™์ผํ•œ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ๋กœ ์„ธ ๊ฐ€์ง€ ์™„์ „ํžˆ ๋‹ค๋ฅธ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

1. Goal Reaching (๋ชฉํ‘œ ์ž์„ธ ๋„๋‹ฌ)

์ž…๋ ฅ: ๋ชฉํ‘œ ์ž์„ธ s_g
์ž ์žฌ ๋ฒกํ„ฐ ๊ณ„์‚ฐ: z = \boldsymbol{B}(s_g)

์ด๊ฒƒ์ด ์˜๋ฏธํ•˜๋Š” ๋ฐ”๋Š” ์ง๊ด€์ ์ž…๋‹ˆ๋‹ค. โ€œ๋ชฉํ‘œ ์ƒํƒœ์˜ Backward ์ž„๋ฒ ๋”ฉ์ด ๊ณง ๊ทธ ์ƒํƒœ์— ๋„๋‹ฌํ•˜๊ธฐ ์œ„ํ•œ ํƒœ์Šคํฌ ํ‘œํ˜„์ด๋‹ค.โ€

์˜ˆ์‹œ: ๋ฐ”๋‹ฅ์—์„œ T-ํฌ์ฆˆ๋กœ ์ผ์–ด์„œ๊ธฐ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  1. ๋ชฉํ‘œ ์ž์„ธ s_g (T-ํฌ์ฆˆ) ์ •์˜                     โ”‚
โ”‚  2. z = B(s_g) ๊ณ„์‚ฐ                                โ”‚
โ”‚  3. ์ •์ฑ… ฯ€_z ์‹คํ–‰                                  โ”‚
โ”‚  4. ๋กœ๋ด‡์ด ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ผ์–ด๋‚˜์„œ T-ํฌ์ฆˆ ๋„๋‹ฌ          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

2. Motion Tracking (๋ชจ์…˜ ์ถ”์ )

์ž…๋ ฅ: ์ฐธ์กฐ ๋ชจ์…˜ ์‹œํ€€์Šค \{s_1, s_2, ..., s_T\}
์ž ์žฌ ๋ฒกํ„ฐ ๊ณ„์‚ฐ (์‹œ๊ฐ„ t์—์„œ): z_t = \sum_{n=0}^{N} \lambda^n \boldsymbol{B}(s_{t+n})

์—ฌ๊ธฐ์„œ N์€ ๋ฏธ๋ฆฌ๋ณด๊ธฐ ์œˆ๋„์šฐ ํฌ๊ธฐ, \lambda๋Š” ํ• ์ธ ๊ณ„์ˆ˜์ž…๋‹ˆ๋‹ค.

์ด ๊ณต์‹์˜ ์ง๊ด€: ๋ฏธ๋ž˜์˜ ์ฐธ์กฐ ํ”„๋ ˆ์ž„๋“ค์„ ํ• ์ธ๋œ ๊ฐ€์ค‘์น˜๋กœ ํ•ฉ์‚ฐํ•˜์—ฌ, ๋‹น์žฅ์˜ ๋‹ค์Œ ํ”„๋ ˆ์ž„๋ฟ ์•„๋‹ˆ๋ผ ์•ž์œผ๋กœ์˜ ๊ถค์  ์ „์ฒด๋ฅผ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค.

3. Reward Optimization (๋ณด์ƒ ์ตœ์ ํ™”)

์ž…๋ ฅ: ๋ณด์ƒ ํ•จ์ˆ˜ r(s)
์ž ์žฌ ๋ฒกํ„ฐ ๊ณ„์‚ฐ: z = \sum_{i} \boldsymbol{B}(s_i) \cdot r(s_i)

์—ฌ๊ธฐ์„œ s_i๋Š” ๋ฆฌํ”Œ๋ ˆ์ด ๋ฒ„ํผ์˜ ์ƒํƒœ๋“ค์ž…๋‹ˆ๋‹ค.

์ด๊ฒƒ์ด ๊ฐ€์žฅ ๋†€๋ผ์šด ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค. ํ•™์Šต ์‹œ ๋ณธ ์  ์—†๋Š” ๋ณด์ƒ ํ•จ์ˆ˜์— ๋Œ€ํ•ด์„œ๋„ ์ตœ์ ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค!

# ์˜ˆ์‹œ: "๋จธ๋ฆฌ ๋†’์ด 1.2m ์œ ์ง€ํ•˜๋ฉด์„œ 0.7m/s๋กœ ์ „์ง„" ๋ณด์ƒ
def reward_function(s):
    head_height_reward = -abs(s.head_height - 1.2)
    velocity_reward = -abs(s.base_vel_forward - 0.7)
    return head_height_reward + velocity_reward

# ๋ฆฌํ”Œ๋ ˆ์ด ๋ฒ„ํผ์—์„œ z ๊ณ„์‚ฐ
z = sum(B(s_i) * reward_function(s_i) for s_i in replay_buffer)
z = z / len(replay_buffer)  # ์ •๊ทœํ™”

# ์ด z๋กœ ์ •์ฑ… ์‹คํ–‰ โ†’ ๋กœ๋ด‡์ด ํ•ด๋‹น ํ–‰๋™ ์ˆ˜ํ–‰

์ถ”๋ก  ๋ฐฉ๋ฒ• ๋น„๊ต

์ž‘์—… ์œ ํ˜• ์ž ์žฌ ๋ฒกํ„ฐ ๊ณ„์‚ฐ ์‹ค์‹œ๊ฐ„ ๊ฐ€๋Šฅ ์ฃผ์š” ์‘์šฉ
Goal Reaching z = B(s_g) โœ… ์ฆ‰์‹œ ์ž์„ธ ์ „ํ™˜, ํšŒ๋ณต
Motion Tracking z_t = \sum \lambda^n B(s_{t+n}) โœ… ์ŠคํŠธ๋ฆฌ๋ฐ ์ถค, ๊ฑท๊ธฐ, ์ œ์Šค์ฒ˜
Reward Optimization z = \sum B(s_i) r(s_i) โš ๏ธ ๋ฒ„ํผ ํ•„์š” ์ด๋™, ์กฐ์ž‘

์‹คํ—˜ ๊ฒฐ๊ณผ ๋ฐ ๋ถ„์„

์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์‹คํ—˜

์‹คํ—˜ ์„ค์ •

  • ํ™˜๊ฒฝ: IsaacGym ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ
  • ๋กœ๋ด‡: Unitree G1 (23 DoF, 12 ์ œ์–ด ๊ด€์ ˆ)
  • ๋ชจ์…˜ ๋ฐ์ดํ„ฐ: AMASS ๋ฐ์ดํ„ฐ์…‹ (CMU subset, 175๊ฐœ ๋ชจ์…˜)
  • ํ•™์Šต: 30M gradient steps (300M ํ™˜๊ฒฝ ์Šคํ…)

์ฃผ์š” ๊ฒฐ๊ณผ

Ablation Study ๊ฒฐ๊ณผ (Table 1 ๊ธฐ์ค€):

๊ตฌ์„ฑ ์š”์†Œ Tracking Goal Reward ํ‰๊ท 
BFM-Zero (Full) 0.847 0.763 0.621 0.744
- ๋น„๋Œ€์นญ ํ•™์Šต 0.712 0.689 0.534 0.645
- ๋„๋ฉ”์ธ ๋žœ๋คํ™” 0.823 0.742 0.498 0.688
- LSTM ํžˆ์Šคํ† ๋ฆฌ 0.756 0.701 0.567 0.675
- ๋ณด์ƒ ์ •๊ทœํ™” ๋ถˆ์•ˆ์ • ๋ถˆ์•ˆ์ • ๋ถˆ์•ˆ์ • -

ํ•ต์‹ฌ ๋ฐœ๊ฒฌ:

  1. ๋ณด์ƒ ์ •๊ทœํ™”๋Š” ํ•„์ˆ˜: ์—†์œผ๋ฉด ํ•™์Šต์ด ๋ถˆ์•ˆ์ •ํ•ด์ง€๊ณ  ํ•˜๋“œ์›จ์–ด ์†์ƒ ์œ„ํ—˜
  2. ๋น„๋Œ€์นญ ํ•™์Šต์ด ๊ฐ€์žฅ ์ค‘์š”: 10% ์ด์ƒ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ ๊ธฐ์—ฌ
  3. ๋„๋ฉ”์ธ ๋žœ๋คํ™”๋Š” Reward ์ž‘์—…์— ํŠนํžˆ ์ค‘์š”: ํฌ์†Œ ๋ณด์ƒ์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™”

BFM-Zero vs BFM-Zero-priv

โ€œํŠน๊ถŒ ์ •๋ณด๋ฅผ ์ง์ ‘ ์‚ฌ์šฉํ•˜๋Š”โ€ ์ด์ƒ์  ๋ชจ๋ธ(BFM-Zero-priv)๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ, BFM-Zero๋Š” ํ‰๊ท ์ ์œผ๋กœ 10.65% ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋ถ€๋ถ„ ๊ด€์ธก์—์„œ ์˜ค๋Š” ๋ถˆ๊ฐ€ํ”ผํ•œ ์„ฑ๋Šฅ ์ €ํ•˜์ด์ง€๋งŒ, ์‹ค์ œ ๋ฐฐ์น˜ ๊ฐ€๋Šฅ์„ฑ๊ณผ์˜ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„์ž…๋‹ˆ๋‹ค.

์‹ค์ œ ๋กœ๋ด‡ ์‹คํ—˜ (Unitree G1)

๋ฐ๋ชจ ํ•˜์ด๋ผ์ดํŠธ

1. Goal Reaching โ€” ๋ฐ”๋‹ฅ์—์„œ ์ผ์–ด์„œ๊ธฐ

์‹œ๋‚˜๋ฆฌ์˜ค: ๋‹ค์–‘ํ•œ ์ดˆ๊ธฐ ์ž์„ธ์—์„œ T-ํฌ์ฆˆ ๋˜๋Š” ์†-ํ—ˆ๋ฆฌ ์ž์„ธ๋กœ ์ „ํ™˜
๊ฒฐ๊ณผ:
- ์ž์—ฐ์Šค๋Ÿฌ์šด ์ „ํ™˜ ๊ถค์  ์ƒ์„ฑ
- ๋ถˆ์•ˆ์ •ํ•œ ๊ฒฝ์šฐ ๋น ๋ฅธ ์•ˆ์ •ํ™”
- ์ฒซ ์‹œ๋„ ์‹คํŒจ ํ›„์—๋„ ์„ฑ๊ณต์  ํšŒ๋ณต
- ์‹ฌํ•œ ์†๋ชฉ ํŒŒ์† ์ƒํ™ฉ์—์„œ๋„ ๊ฐ•๊ฑดํ•จ

2. Motion Tracking โ€” ์ถค๊ณผ ๋ณตํ•ฉ ๋™์ž‘

ํ…Œ์ŠคํŠธ ๋ชจ์…˜: ๊ฑท๊ธฐ, ํšŒ์ „, ๊ณต ๋˜์ง€๊ธฐ, ๋ณต์‹ฑ, ์ถค
๊ฒฐ๊ณผ:
- ์Šคํƒ€์ผํ™”๋œ ๋ณดํ–‰ (๊ฒฝ๋ก€ํ•˜๋ฉฐ ๊ฑท๊ธฐ)
- ๋„˜์–ด์ง ํ›„ ์ž์—ฐ์Šค๋Ÿฌ์šด ํšŒ๋ณต
- ์‹ค์‹œ๊ฐ„ ๋ชจ์…˜ ์ถ”์  (๋‹จ์ผ ์ •์ฑ…์œผ๋กœ)

3. Reward Optimization โ€” ์ด๋™ ๋ฐ ํŒ” ์ œ์–ด

๋‹ค์–‘ํ•œ ๋ณด์ƒ ํ•จ์ˆ˜์— ๋Œ€ํ•œ Zero-shot ์„ฑ๋Šฅ:

๋ณด์ƒ ํ•จ์ˆ˜ ์ˆ˜์‹ ๊ฒฐ๊ณผ
์„œ์žˆ๊ธฐ R = (h_{head}=1.2m) \land (v_{base}=0) โœ… ์•ˆ์ •์ 
์ „์ง„ R = (h_{head}=1.2m) \land (v_{fwd}=0.7m/s) โœ… ์ž์—ฐ์Šค๋Ÿฌ์šด ๋ณดํ–‰
์ธก๋ฉด ์ด๋™ R = (h_{head}=1.2m) \land (v_{left}=0.3m/s) โœ… ๊ฐ€๋Šฅ
ํšŒ์ „ R = (h_{base}>0.5m) \land (\omega_z=5.0rad/s) โœ… ๊ฐ€๋Šฅ
ํŒ” ๋“ค๊ธฐ R = (h_{wrist}>1.0m) โœ… ๋‹ค์–‘ํ•œ ์ž์„ธ ์ƒ์„ฑ

4. ์™ธ๋ž€ ํšŒ๋ณต

ํ…Œ์ŠคํŠธ: ๊ฐ•ํ•œ ๋ฐ€๊ธฐ, ํ† ํฌ ์ฐจ๊ธฐ, ๋ฐ”๋‹ฅ์œผ๋กœ ๋‹น๊ธฐ๊ธฐ, ๋‹ค๋ฆฌ ์ฐจ๊ธฐ
๊ฒฐ๊ณผ:
- ์ž์—ฐ์Šค๋Ÿฌ์šด ํšŒ๋ณต ๋™์ž‘ (๊ณ„ํš๋˜์ง€ ์•Š์€ ์ฐฝ๋ฐœ์  ํ–‰๋™)
- ๊ฐ•ํ•œ ๋ฐ€๊ธฐ โ†’ ๋‹ฌ๋ฆฌ๊ธฐ๋กœ ํšŒ๋ณต (emergent behavior!)

5. Few-shot ์ ์‘

์‹œ๋‚˜๋ฆฌ์˜ค: 4kg ํŽ˜์ด๋กœ๋“œ๋ฅผ ํ† ํฌ์— ์žฅ์ฐฉํ•˜๊ณ  ํ•œ ๋ฐœ ์„œ๊ธฐ
๋ฐฉ๋ฒ•: ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ 2๋ถ„ ๋ฏธ๋งŒ์˜ ์ž ์žฌ ๊ณต๊ฐ„ ๊ฒ€์ƒ‰
๊ฒฐ๊ณผ: Zero-shot ๋Œ€๋น„ ํ˜„์ €ํžˆ ๊ฐœ์„ ๋œ ๊ท ํ˜• ์œ ์ง€

์ž ์žฌ ๊ณต๊ฐ„์˜ ๊ตฌ์กฐ์  ํŠน์„ฑ

FB ํ•™์Šต์˜ ๋ถ€์‚ฐ๋ฌผ๋กœ ์–ป์–ด์ง€๋Š” ๋ถ€๋“œ๋Ÿฌ์šด ์ž ์žฌ ๊ณต๊ฐ„์€ ์˜๋ฏธ ์žˆ๋Š” ๋ณด๊ฐ„์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค:

Spherical Linear Interpolation (SLERP): z_t = \frac{\sin((1-t)\theta)}{\sin\theta}z_0 + \frac{\sin(t\theta)}{\sin\theta}z_1

์—ฌ๊ธฐ์„œ \theta = \arccos(\langle z_0, z_1 \rangle)

์‹คํ—˜ ๊ฒฐ๊ณผ: - z_0: ์™ผ์ชฝ ์ด๋™, z_1: ์˜ค๋ฅธ์ชฝ ์ด๋™ โ†’ ์ค‘๊ฐ„๊ฐ’์—์„œ ์ •์ง€ - z_0: ํŒ” ๋‚ด๋ฆฌ๊ธฐ, z_1: ํŒ” ๋“ค๊ธฐ โ†’ ์ ์ง„์  ์ „ํ™˜

์ด๋Š” ์ž ์žฌ ๊ณต๊ฐ„์ด ์˜๋ฏธ๋ก ์ ์œผ๋กœ ๊ตฌ์กฐํ™”๋˜์–ด ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.


๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต

BFM ์—ฐ๊ตฌ ๊ณ„๋ณด

timeline
    title Behavioral Foundation Model ์—ฐ๊ตฌ ๋ฐœ์ „
    section ์ดˆ๊ธฐ ์—ฐ๊ตฌ
        2021 : FB ํ‘œํ˜„ ํ•™์Šต (Touati & Ollivier)
             : "Zero-shot RL via FB representations"
    section ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํœด๋จธ๋…ธ์ด๋“œ
        2024.04 : ASE (Adversarial Skill Embeddings)
               : ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์บ๋ฆญํ„ฐ ์ œ์–ด
        2024.12 : Meta Motivo (FB-CPR)
               : SMPL ํœด๋จธ๋…ธ์ด๋“œ, ์ตœ์ดˆ BFM
    section ์‹ค์ œ ๋กœ๋ด‡
        2025.04 : H-HOVER, UniTracker
               : 2๋‹จ๊ณ„ ํ•™์Šต ๊ธฐ๋ฐ˜
        2025.11 : BFM-Zero
               : ์ตœ์ดˆ์˜ ๋น„์ง€๋„ RL ๊ธฐ๋ฐ˜ ์‹ค๋กœ๋ด‡ BFM

์ฃผ์š” ๋น„๊ต ๋Œ€์ƒ

vs Meta Motivo (FB-CPR)

ํ•ญ๋ชฉ Meta Motivo BFM-Zero
ํ™˜๊ฒฝ SMPL ์‹œ๋ฎฌ๋ ˆ์ด์…˜ Unitree G1 ์‹ค๋กœ๋ด‡
์•Œ๊ณ ๋ฆฌ์ฆ˜ FB-CPR FB-CPR + Sim2Real
Sim-to-Real โŒ โœ…
๋„๋ฉ”์ธ ๋žœ๋คํ™” โŒ โœ…
๋น„๋Œ€์นญ ํ•™์Šต โŒ โœ…
ํžˆ์Šคํ† ๋ฆฌ ์ธ์ฝ”๋”ฉ MLP LSTM

BFM-Zero๋Š” Meta Motivo์˜ ํ•ต์‹ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜(FB-CPR)์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋˜, ์‹ค์ œ ๋กœ๋ด‡ ๋ฐฐ์น˜๋ฅผ ์œ„ํ•œ ๋ชจ๋“  ํ•„์ˆ˜ ์š”์†Œ๋ฅผ ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

vs H-HOVER / UniTracker / GMT

์ด๋“ค์€ 2๋‹จ๊ณ„ ํ•™์Šต ํŒจ๋Ÿฌ๋‹ค์ž„์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

  1. Stage 1: ๋ชจ์…˜ ํŠธ๋ž˜ํ‚น ์ •์ฑ… ํ•™์Šต (PPO, on-policy)
  2. Stage 2: VAE/Distillation์œผ๋กœ ๋‹ค์ค‘ ์Šคํ‚ฌ ํ†ตํ•ฉ
ํ•ญ๋ชฉ 2๋‹จ๊ณ„ ์ ‘๊ทผ๋ฒ• BFM-Zero
ํ•™์Šต ๋‹จ๊ณ„ 2๋‹จ๊ณ„ 1๋‹จ๊ณ„
๊ธฐ๋ณธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ PPO (on-policy) FB-CPR (off-policy)
๋ชจ์…˜ ๋ฐ์ดํ„ฐ ์˜์กด์„ฑ ๋†’์Œ (ํ’ˆ์งˆ ๋ฏผ๊ฐ) ๋‚ฎ์Œ (์ •๊ทœํ™”๋งŒ)
Zero-shot ๋Šฅ๋ ฅ ์ œํ•œ์  ์„ธ ๊ฐ€์ง€ ์ž‘์—… ๋ชจ๋‘
์Šค์ผ€์ผ๋ง ์–ด๋ ค์›€ ์šฉ์ด

BFM-Zero์˜ 1๋‹จ๊ณ„ off-policy ํ•™์Šต์€ ๋” ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ํ™œ์šฉ๊ณผ ์œ ์—ฐํ•œ ์Šค์ผ€์ผ๋ง์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

vs ASAP (He et al., 2025)

ASAP์€ ๋‹ค๋ฅธ ์ ‘๊ทผ๋ฒ•์„ ์ทจํ•ฉ๋‹ˆ๋‹ค: 1. ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ๋ชจ์…˜ ํŠธ๋ž˜ํ‚น ์ •์ฑ… ํ•™์Šต 2. ์‹ค๋กœ๋ด‡์—์„œ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ 3. ๋ธํƒ€(์ž”์ฐจ) ์•ก์…˜ ๋ชจ๋ธ ํ•™์Šต

BFM-Zero๋Š” ์‹ค๋กœ๋ด‡ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ์—†์ด Sim-to-Real ์ „์ด๋ฅผ ๋‹ฌ์„ฑํ•œ๋‹ค๋Š” ์ ์—์„œ ๋” ์‹ค์šฉ์ ์ž…๋‹ˆ๋‹ค.


๋น„ํŒ์  ๊ณ ์ฐฐ

๊ฐ•์ 

1. ํŒจ๋Ÿฌ๋‹ค์ž„ ์ „ํ™˜

๊ธฐ์กด์˜ โ€œ์ž‘์—…๋ณ„ ํ•™์Šต โ†’ ์ฆ๋ฅ˜โ€์—์„œ โ€œํ†ตํ•ฉ ๋น„์ง€๋„ ํ•™์Šตโ€์œผ๋กœ์˜ ์ „ํ™˜์€ ํœด๋จธ๋…ธ์ด๋“œ ์ œ์–ด ์—ฐ๊ตฌ์˜ ์ƒˆ๋กœ์šด ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ off-policy ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์‹ค๋กœ๋ด‡ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์„ ์ฒ˜์Œ์œผ๋กœ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

2. ์„ค๋ช… ๊ฐ€๋Šฅํ•œ ์ž ์žฌ ๊ณต๊ฐ„

FB ํ‘œํ˜„์˜ ์ˆ˜ํ•™์  ๊ตฌ์กฐ ๋•๋ถ„์—, ์ž ์žฌ ๋ฒกํ„ฐ z๊ฐ€ ๋ฌด์—‡์„ ์˜๋ฏธํ•˜๋Š”์ง€ ํ•ด์„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค: - z = B(s_g): ๋ชฉํ‘œ ์ƒํƒœ์˜ ํŠน์„ฑ - z = \sum r(s_i) B(s_i): ๋ณด์ƒ ๊ฐ€์ค‘ ๋ฏธ๋ž˜ ์ƒํƒœ ๋ถ„ํฌ

์ด๋Š” ๋ธ”๋ž™๋ฐ•์Šค ์‹ ๊ฒฝ๋ง์— ๋น„ํ•ด ํฐ ์žฅ์ ์ž…๋‹ˆ๋‹ค.

3. ์‹ค์šฉ์  ์—”์ง€๋‹ˆ์–ด๋ง

๋„๋ฉ”์ธ ๋žœ๋คํ™”, ๋น„๋Œ€์นญ ํ•™์Šต, LSTM ํžˆ์Šคํ† ๋ฆฌ ๋“ฑ ์‹ค์ œ ๋ฐฐ์น˜์— ํ•„์š”ํ•œ ๋ชจ๋“  ์š”์†Œ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ๋‹ค๋ฃน๋‹ˆ๋‹ค. Ablation study๋„ ์ถฉ์‹คํžˆ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค.

4. ์žฌํ˜„ ๊ฐ€๋Šฅ์„ฑ

์ฝ”๋“œ, ์ฒดํฌํฌ์ธํŠธ, ์ƒ์„ธํ•œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๊ณต๊ฐœ๋  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

์•ฝ์  ๋ฐ ํ•œ๊ณ„

1. ๋ณด์ƒ ์ถ”๋ก ์˜ ๋ถˆ์•ˆ์ •์„ฑ

๋…ผ๋ฌธ์—์„œ๋„ ์–ธ๊ธ‰๋˜๋“ฏ, ๋ณด์ƒ ์ตœ์ ํ™” ์ž‘์—…์ด ๊ฐ€์žฅ ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค (10.65% ํ•˜๋ฝ ์ค‘ ๊ฐ€์žฅ ํฐ ์˜ํ–ฅ). ์ด๋Š” ๋„๋ฉ”์ธ ๋žœ๋คํ™”๋œ ๋ฐ์ดํ„ฐ์—์„œ์˜ ๋ณด์ƒ ์ถ”๋ก ์ด ๋ณธ์งˆ์ ์œผ๋กœ ๋ถˆ์•ˆ์ •ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๋ฌธ์ œ ์ƒํ™ฉ:
- ๋ฆฌํ”Œ๋ ˆ์ด ๋ฒ„ํผ์˜ ์ƒํƒœ ๋ถ„ํฌ๊ฐ€ ๋‹ค์–‘ํ•จ (DR๋กœ ์ธํ•ด)
- ๋ณด์ƒ ์ถ”๋ก : z = ฮฃ B(s_i) r(s_i)
- ์„œ๋ธŒ์ƒ˜ํ”Œ๋ง์— ๋”ฐ๋ผ z์˜ ๋ถ„์‚ฐ์ด ํผ
- ๊ฒฐ๊ณผ: ๊ฐ™์€ ๋ณด์ƒ ํ•จ์ˆ˜์—์„œ๋„ ๋‹ค๋ฅธ ํ–‰๋™ ์ƒ์„ฑ ๊ฐ€๋Šฅ

2. ํฌ์†Œ ๋ณด์ƒ์— ์ทจ์•ฝ

๋…ผ๋ฌธ์˜ ๋ณด์ƒ ํ•จ์ˆ˜๋“ค์€ ๋Œ€๋ถ€๋ถ„ ์—ฐ์†์ ์ด๊ณ  ๋ฐ€์ง‘๋œ ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค. ์ด์ง„ ๋ณด์ƒ(์„ฑ๊ณต/์‹คํŒจ)์ด๋‚˜ ๋งค์šฐ ํฌ์†Œํ•œ ๋ณด์ƒ์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์€ ๊ฒ€์ฆ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

3. ๋ชจ์…˜ ๋ฐ์ดํ„ฐ ๋ฒ”์œ„ ์ œ์•ฝ

FB-CPR์˜ ํ•œ๊ณ„๋ฅผ ๊ทธ๋Œ€๋กœ ๊ฐ€์ง‘๋‹ˆ๋‹ค: - ๋ชจ์…˜ ์บก์ฒ˜ ๋ฐ์ดํ„ฐ์— ์—†๋Š” ๋™์ž‘ (์˜ˆ: ๊ตฌ๋ฅด๊ธฐ, ๋ฌผ๊ตฌ๋‚˜๋ฌด)์€ ์ƒ์„ฑ ์–ด๋ ค์›€ - ๋ฐ”๋‹ฅ ๋™์ž‘(ground movements)์— ๋Œ€ํ•œ ์„ฑ๋Šฅ ์ €ํ•˜ ์–ธ๊ธ‰๋จ

4. ๊ฐ์ฒด ์ƒํ˜ธ์ž‘์šฉ ๋ฏธ์ง€์›

ํ˜„์žฌ ์‹œ์Šคํ…œ์€ ๊ณ ์œ ๊ฐ๊ฐ(proprioception)๋งŒ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋ฌผ์ฒด ์กฐ์ž‘, ํ™˜๊ฒฝ ํƒ์ƒ‰ ๋“ฑ ์™ธ๋ถ€ ์ง€๊ฐ์ด ํ•„์š”ํ•œ ์ž‘์—…์€ ๋ฒ”์œ„ ๋ฐ–์ž…๋‹ˆ๋‹ค.

5. ๊ณ„์‚ฐ ๋น„์šฉ

30M gradient steps (300M ํ™˜๊ฒฝ ์Šคํ…)๋Š” ์ƒ๋‹นํ•œ ๊ณ„์‚ฐ ์ž์›์„ ์š”๊ตฌํ•ฉ๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ ๊ตฌ์ฒด์ ์ธ ํ•™์Šต ์‹œ๊ฐ„์ด๋‚˜ GPU ์š”๊ตฌ์‚ฌํ•ญ์ด ๋ช…์‹œ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

์—ด๋ฆฐ ์งˆ๋ฌธ๋“ค

  1. ์Šค์ผ€์ผ๋ง ๋ฒ•์น™: ๋” ํฐ ๋ชจ์…˜ ๋ฐ์ดํ„ฐ์…‹, ๋” ํฐ ๋ชจ๋ธ์ด ์„ฑ๋Šฅ์„ ์–ด๋–ป๊ฒŒ ๊ฐœ์„ ํ• ๊นŒ?
  2. ๋‹ค์ค‘ ๋กœ๋ด‡ ์ผ๋ฐ˜ํ™”: G1 ์™ธ์˜ ๋‹ค๋ฅธ ํœด๋จธ๋…ธ์ด๋“œ์— ์ „์ด ๊ฐ€๋Šฅํ•œ๊ฐ€?
  3. ์‹œ๊ฐ ์ •๋ณด ํ†ตํ•ฉ: ์นด๋ฉ”๋ผ ์ž…๋ ฅ์„ ์–ด๋–ป๊ฒŒ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์„๊นŒ?
  4. ์–ธ์–ด ํ”„๋กฌํ”„ํŒ…: Text-to-Motion ๋ชจ๋ธ๊ณผ ์—ฐ๋™ํ•˜์—ฌ ์–ธ์–ด ๋ช…๋ น์œผ๋กœ ์ œ์–ด ๊ฐ€๋Šฅํ•œ๊ฐ€?

๋ฏธ๋ž˜ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ ์ œ์•ˆ

๋‹จ๊ธฐ (1-2๋…„)

1. ์‹œ๊ฐ-๊ณ ์œ ๊ฐ๊ฐ ์œตํ•ฉ

์ œ์•ˆ: 
- ๋น„์ „ ์ธ์ฝ”๋” (์˜ˆ: CLIP, DINOv2) ์ถ”๊ฐ€
- ์ƒํƒœ ํ‘œํ˜„ s = [proprioception, visual_features]
- Backward ์ž„๋ฒ ๋”ฉ B(s)๊ฐ€ ์‹œ๊ฐ ์ •๋ณด๋„ ์ธ์ฝ”๋”ฉ

2. ๊ณ„์ธต์  ์ œ์–ด

ํ˜„์žฌ BFM-Zero๋Š” ์ €์ˆ˜์ค€ ์ œ์–ด๋งŒ ๋‹ด๋‹นํ•ฉ๋‹ˆ๋‹ค. ๊ณ ์ˆ˜์ค€ ํƒœ์Šคํฌ ํ”Œ๋ž˜๋‹๊ณผ์˜ ํ†ตํ•ฉ:

High-level: LLM/VLM โ†’ ์„œ๋ธŒ๊ณจ ์‹œํ€€์Šค
Mid-level: BFM-Zero โ†’ z ์‹œํ€€์Šค ์ƒ์„ฑ
Low-level: ์ •์ฑ… ฯ€_z โ†’ ๊ด€์ ˆ ํ† ํฌ

3. ์˜จ๋ผ์ธ ์ ์‘ ๊ฐœ์„ 

Few-shot ์ ์‘์„ ๋„˜์–ด์„œ: - ์‹ค์‹œ๊ฐ„ ํŒŒ๋ผ๋ฏธํ„ฐ ์ถ”์ • - ์ปจํ…์ŠคํŠธ ์กฐ๊ฑด๋ถ€ ์ •์ฑ… (Transformer ๊ธฐ๋ฐ˜)

์žฅ๊ธฐ (3-5๋…„)

1. ๋Œ€๊ทœ๋ชจ ํ–‰๋™ ๋ฐ์ดํ„ฐ ํ•™์Šต

YouTube ๋™์˜์ƒ, ์ธํ„ฐ๋„ท์˜ ์ธ๊ฐ„ ํ–‰๋™ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ์Šค์ผ€์ผ์—…: - Video-to-3D Motion ์ถ”์ • - ์•ฝํ•œ ๊ฐ๋… ํ•˜์˜ FB ํ•™์Šต

2. ๋‹ค์ค‘ ๋กœ๋ด‡ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ

ํ•˜๋‚˜์˜ BFM์œผ๋กœ ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ๋กœ๋ด‡ ์ œ์–ด: - ํ˜•ํƒœ ์กฐ๊ฑด๋ถ€ ์ •์ฑ… - Cross-embodiment ์ „์ด

3. World Model ํ†ตํ•ฉ

FB ํ‘œํ˜„๊ณผ ์„ธ๊ณ„ ๋ชจ๋ธ์˜ ๊ฒฐํ•ฉ: - ๋ฏธ๋ž˜ ์ƒํƒœ ์˜ˆ์ธก + ์žฅ๊ธฐ ๊ณ„ํš - Model-based RL๊ณผ์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ


์‹ค๋ฌด์ž๋ฅผ ์œ„ํ•œ ์‹œ์‚ฌ์ 

์–ธ์ œ BFM-Zero ์ ‘๊ทผ๋ฒ•์„ ๊ณ ๋ คํ•ด์•ผ ํ•˜๋‚˜?

โœ… ์ ํ•ฉํ•œ ๊ฒฝ์šฐ: - ๋‹ค์–‘ํ•œ ์ „์‹  ๋™์ž‘์ด ํ•„์š”ํ•œ ํœด๋จธ๋…ธ์ด๋“œ ์ œ์–ด - ์ƒˆ๋กœ์šด ์ž‘์—…์— ๋Œ€ํ•œ ๋น ๋ฅธ ์ ์‘์ด ์ค‘์š”ํ•œ ๊ฒฝ์šฐ - ๋ชจ์…˜ ์บก์ฒ˜ ๋ฐ์ดํ„ฐ๋Š” ์žˆ์ง€๋งŒ ์ž‘์—…๋ณ„ ๋ณด์ƒ ์„ค๊ณ„๊ฐ€ ์–ด๋ ค์šด ๊ฒฝ์šฐ - Off-policy ํ•™์Šต์˜ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ

โŒ ๋ถ€์ ํ•ฉํ•œ ๊ฒฝ์šฐ: - ๊ฐ์ฒด ์กฐ์ž‘์ด ์ฃผ๋œ ์ž‘์—…์ธ ๊ฒฝ์šฐ - ๊ทน๋„๋กœ ์ •๋ฐ€ํ•œ ๋™์ž‘์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ - ๋ชจ์…˜ ๋ฐ์ดํ„ฐ ํ™•๋ณด๊ฐ€ ์–ด๋ ค์šด ๊ฒฝ์šฐ - ๋งค์šฐ ํฌ์†Œํ•œ ๋ณด์ƒ ์‹ ํ˜ธ๋งŒ ์žˆ๋Š” ๊ฒฝ์šฐ

๊ตฌํ˜„ ์ฒดํฌ๋ฆฌ์ŠคํŠธ

BFM-Zero ์Šคํƒ€์ผ์˜ ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ•  ๋•Œ:

โ–ก ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ์„ ํƒ (IsaacGym, MuJoCo ๋“ฑ)
โ–ก ๋กœ๋ด‡ URDF/MJCF ๋ชจ๋ธ ์ค€๋น„
โ–ก ๋ชจ์…˜ ์บก์ฒ˜ ๋ฐ์ดํ„ฐ ํ™•๋ณด ๋ฐ ๋ฆฌํƒ€๊ฒŸํŒ…
โ–ก FB ๋„คํŠธ์›Œํฌ ์•„ํ‚คํ…์ฒ˜ ๊ตฌํ˜„
  - Forward: MLP with residual blocks
  - Backward: MLP with residual blocks
  - Policy: LSTM + MLP
โ–ก ๋„๋ฉ”์ธ ๋žœ๋คํ™” ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์ถ•
  - ๋ฌผ๋ฆฌ ํŒŒ๋ผ๋ฏธํ„ฐ ๋žœ๋คํ™”
  - ์„ผ์„œ ๋…ธ์ด์ฆˆ ๋ชจ๋ธ๋ง
  - ์™ธ๋ถ€ ๊ต๋ž€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜
โ–ก ๋น„๋Œ€์นญ ํ•™์Šต ์„ค์ •
  - ํŠน๊ถŒ ์ •๋ณด ์ •์˜
  - ๊ด€์ธก ํžˆ์Šคํ† ๋ฆฌ ๋ฒ„ํผ
โ–ก ๋ณด์ƒ ์ •๊ทœํ™” ํ•ญ ์„ค๊ณ„
โ–ก ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹
โ–ก Sim-to-Real ๊ฒ€์ฆ

๊ฒฐ๋ก 

BFM-Zero๋Š” ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡ ์ œ์–ด ๋ถ„์•ผ์—์„œ ์ค‘์š”ํ•œ ์ด์ •ํ‘œ๋ฅผ ์„ธ์›๋‹ˆ๋‹ค. Off-policy ๋น„์ง€๋„ ๊ฐ•ํ™”ํ•™์Šต์„ ์‹ค์ œ ๋กœ๋ด‡์— ์„ฑ๊ณต์ ์œผ๋กœ ์ ์šฉํ•จ์œผ๋กœ์จ, โ€œ์žฌํ•™์Šต ์—†๋Š” ๋‹ค์ค‘ ์ž‘์—… ์ˆ˜ํ–‰โ€์ด๋ผ๋Š” BFM์˜ ์•ฝ์†์„ ํ˜„์‹ค๋กœ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ธฐ์—ฌ๋ฅผ ์š”์•ฝํ•˜๋ฉด:

  1. ์ตœ์ดˆ์˜ ์‹ค๋กœ๋ด‡ BFM: ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ๋„˜์–ด Unitree G1์—์„œ ๊ฒ€์ฆ
  2. ํ†ตํ•ฉ Zero-shot ์ถ”๋ก : Motion Tracking, Goal Reaching, Reward Optimization์„ ๋‹จ์ผ ์ •์ฑ…์œผ๋กœ
  3. ์ฒด๊ณ„์ ์ธ Sim-to-Real: ๋น„๋Œ€์นญ ํ•™์Šต, ๋„๋ฉ”์ธ ๋žœ๋คํ™”, LSTM ํžˆ์Šคํ† ๋ฆฌ์˜ ์กฐํ•ฉ
  4. ์žฌํ˜„ ๊ฐ€๋Šฅํ•œ ์—ฐ๊ตฌ: ์ฝ”๋“œ์™€ ์ฒดํฌํฌ์ธํŠธ ๊ณต๊ฐœ ์˜ˆ์ •

๋ฌผ๋ก  ํ•œ๊ณ„๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณด์ƒ ์ถ”๋ก ์˜ ๋ถˆ์•ˆ์ •์„ฑ, ๋ชจ์…˜ ๋ฐ์ดํ„ฐ ๋ฒ”์œ„ ์ œ์•ฝ, ๊ฐ์ฒด ์ƒํ˜ธ์ž‘์šฉ ๋ฏธ์ง€์› ๋“ฑ์€ ํ–ฅํ›„ ์—ฐ๊ตฌ์—์„œ ๋‹ค๋ค„์ ธ์•ผ ํ•  ๊ณผ์ œ์ž…๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ ์ด ์—ฐ๊ตฌ๊ฐ€ ์ œ์‹œํ•˜๋Š” ๋ฐฉํ–ฅ์„ฑ์€ ๋ช…ํ™•ํ•ฉ๋‹ˆ๋‹ค: ํ•˜๋‚˜์˜ ์ž˜ ํ•™์Šต๋œ ํ‘œํ˜„์ด ์ˆ˜๋งŽ์€ ์ž‘์—…์„ ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ. ๋งˆ์น˜ ์–ธ์–ด ๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ NLP ์ž‘์—…์„ ํ†ตํ•ฉํ–ˆ๋“ฏ์ด, ํ–‰๋™ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋„ ๋กœ๋ด‡ ์ œ์–ด์˜ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋กœ๋ด‡๊ณตํ•™์˜ ๋ฏธ๋ž˜๋Š” ๋” ์ด์ƒ โ€œ๊ฐ ์ž‘์—…๋งˆ๋‹ค ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตโ€์ด ์•„๋‹ ๊ฒƒ์ž…๋‹ˆ๋‹ค. BFM-Zero๋Š” ๊ทธ ๋ฏธ๋ž˜๋ฅผ ํ–ฅํ•œ ์ค‘์š”ํ•œ ์ฒซ๊ฑธ์Œ์ž…๋‹ˆ๋‹ค.

์ฐธ๊ณ  ๋ฌธํ—Œ

  • Li, Y., et al. (2025). BFM-Zero: A Promptable Behavioral Foundation Model for Humanoid Control Using Unsupervised Reinforcement Learning. arXiv:2511.04131.
  • Tirinzoni, A., et al. (2025). Zero-Shot Whole-Body Humanoid Control via Behavioral Foundation Models. arXiv:2504.11054 (Meta Motivo).
  • Touati, A. & Ollivier, Y. (2021). Learning One Representation to Optimize All Rewards. NeurIPS.
  • He, T., et al. (2024). H-HOVER: Learning Humanoid Locomotion with Hybrid Depth from Videos.
  • Zeng, J., et al. (2025). Behavior Foundation Model for Humanoid Control.

โ›๏ธ Dig Review

โ›๏ธ Dig โ€” Go deep, uncover the layers. Dive into technical detail.

โ€œ์ด์ œ PPO๋กœ ๊ฑท๊ธฐ ํ•˜๋‚˜ ๊ฒจ์šฐ ๋ฐฐ์šฐ๋˜ ์‹œ๋Œ€์—์„œ, ํ•œ ๋ฒˆ ํ•™์Šตํ•œ ํ–‰๋™ ๊ณต๊ฐ„(latent space)์„ โ€™ํ”„๋กฌํ”„ํŠธโ€™๋กœ ๋‘๋“ค๊ฒจ์„œ ์›ํ•˜๋Š” ์ „์‹  ํ–‰๋™์„ ๊บผ๋‚ด ์“ฐ๋Š” ์‹œ๋Œ€๋กœ ๋„˜์–ด๊ฐ€๋ ค๋Š” ์‹œ๋„.โ€

1. ์„œ๋ก : BFM-Zero๊ฐ€ ํ’€๊ณ  ์‹ถ์€ ๋ฌธ์ œ

1.1 ๋ฌธ์ œ ๋ฐฐ๊ฒฝ โ€“ โ€œ์ „์‹  ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธโ€์ด ์™œ ์–ด๋ ค์šด๊ฐ€?

์ตœ๊ทผ ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡ ์ œ์–ด ํ๋ฆ„์„ ์š”์•ฝํ•˜๋ฉด ๋Œ€๋žต ์ด๋ ‡์Šต๋‹ˆ๋‹ค:

  • ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ(MuJoCo/IsaacGym ๋“ฑ)์—์„œ โ†’ PPO ๊ธฐ๋ฐ˜ ์ „์‹  ์ •์ฑ…์„ ๋ชจ์…˜ ํŠธ๋ž˜ํ‚น/ํŠน์ • ๋ณด์ƒ์œผ๋กœ ํ•™์Šตํ•˜๊ณ  โ†’ ๋„๋ฉ”์ธ ๋žœ๋ค๋ผ์ด์ œ์ด์…˜์œผ๋กœ ํŠœ๋‹ํ•œ ๋’ค โ†’ Sim2Real๋กœ ๋ณด๋‚ด๋Š” ์ „ํ˜•์ ์ธ ํŒŒ์ดํ”„๋ผ์ธ. ์ด ์ ‘๊ทผ์€ ์ด๋ฏธ ๊ฑท๊ธฐยท๋‹ฌ๋ฆฌ๊ธฐยท๊ธฐ์ƒ(get-up)ยท๊ฐ„๋‹จ ์ž‘์—… ์ •๋„๋Š” ๊ฝค ์ž˜ ํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ:
  1. ํƒœ์Šคํฌ ํŠนํ™”:

    • ๊ฑท๊ธฐ ์ •์ฑ…์€ ๊ฑท๊ธฐ๋งŒ, ๋ชจ์…˜ ํŠธ๋ž˜ํ‚น ์ •์ฑ…์€ ํ•ด๋‹น ๋ชจ์…˜๋งŒ.
    • ์ƒˆ๋กœ์šด ๋ชฉํ‘œ(์˜ˆ: ํŠน์ • ์† ํฌ์ฆˆ, โ€˜ํŒ” ๋“ค๊ณ  ๋’ค๋กœ ๊ฑท๊ธฐโ€™ ๋“ฑ)๋ฅผ ์œ„ํ•ด์„  ์ƒˆ๋กœ์šด PPO ํ•™์Šต์ด ํ•„์š”.
  2. ๋น„ํ”„๋กฌํ”„ํŠธ์„ฑ(non-promptable):

    • โ€œ์ด์ œ๋ถ€ํ„ฐ ์ด๋Ÿฐ ๋ณด์ƒ์„ ์ตœ์ ํ™” ํ•ด์ค˜โ€ ๋˜๋Š” โ€œ์ด ๋ชจ์…˜์„ ๋Œ€์ถฉ ๋”ฐ๋ผ๊ฐ€์ค˜โ€ ๊ฐ™์€ ์ง€์‹œ๋ฅผ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ ์ •์ฑ…์— ํ”„๋กฌํ”„ํŠธ๋กœ ๋˜์ง€๋Š” ์ธํ„ฐํŽ˜์ด์Šค๊ฐ€ ๊ฑฐ์˜ ์—†์Œ.
  3. ์ •์ฑ… ์žฌํ™œ์šฉ ์–ด๋ ค์›€:

    • ์ด๋ฏธ ํ•™์Šต๋œ ๋‹ค์–‘ํ•œ ์ •์ฑ…๋“ค์„ ํ•˜๋‚˜์˜ ํ–‰๋™ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ(Behavioral Foundation Model, BFM)๋กœ ํ†ตํ•ฉํ•˜๊ธฐ ์œ„ํ•œ ๊ตฌ์กฐํ™”๋œ latent space ์„ค๊ณ„๊ฐ€ ๋ถ€์กฑ.

์ฆ‰, โ€œ์ „์‹  ํ–‰๋™์„ ์œ„ํ•œ GPT ๊ฐ™์€ ๊ฒƒโ€์ด ํ•„์š”ํ•˜์ง€๋งŒ, ์ง€๊ธˆ๊นŒ์ง€๋Š” ๋Œ€๋ถ€๋ถ„ on-policy RL + ํƒœ์Šคํฌ๋ณ„ ํ•™์Šต์˜ ์กฐํ•ฉ์— ๋จธ๋ฌผ๋ €์Šต๋‹ˆ๋‹ค. โ€”

1.2 BFM-Zero์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด ํ•œ ์ค„ ์š”์•ฝ

โ€œ๋ชจ์…˜, ๋ชฉํ‘œ, ๋ณด์ƒโ€์„ ๋ชจ๋‘ ํ•˜๋‚˜์˜ latent ๊ณต๊ฐ„์— ์ž„๋ฒ ๋”ฉํ•˜๊ณ , ๊ทธ latent๋ฅผ ํ”„๋กฌํ”„ํŠธ์ฒ˜๋Ÿผ ์ฃผ๋ฉด ํ•œ ๊ฐœ์˜ ์ •์ฑ…์ด**

  • ๋ชจ์…˜ ํŠธ๋ž˜ํ‚น
  • ๋ชฉํ‘œ ํฌ์ฆˆ ๋„๋‹ฌ
  • ๋‹ค์–‘ํ•œ ๋ณด์ƒ ์ตœ์ ํ™” ๋ฅผ Zero-shot์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋„๋ก ๋งŒ๋“ค์ž.** ์ด๋ฅผ ์œ„ํ•ด BFM-Zero๋Š”:
  • ์˜คํ”„ํด๋ฆฌ์‹œยท๋ฌด๋ณด์ƒ(unsupervised) RL +
  • Forward-Backward(FB) representation ๊ธฐ๋ฐ˜์˜ successor feature ํ”„๋ ˆ์ž„์›Œํฌ +
  • ๋ชจ์…˜์บก์ณ ๋ฐ์ดํ„ฐ๋กœ regularization +
  • ๋„๋ฉ”์ธ ๋žœ๋ค๋ผ์ด์ œ์ด์…˜ + ๋น„๋Œ€์นญ ํ•™์Šต(asymmetric training)

์„ ์กฐํ•ฉํ•˜์—ฌ, โ€œ๋ชฉํ‘œ/๋ณด์ƒ/๋ฐ๋ชจ โ†’ latent z โ†’ ์ „์‹  ์ •์ฑ… ฯ€(a | h, z)โ€ ๊ตฌ์กฐ๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.


2. ๋ฌธ์ œ ์ •์˜ ๋ฐ ์ˆ˜ํ•™์  ํ”„๋ ˆ์ด๋ฐ

2.1 POMDP ํฌ๋ฉ€๋ผ์ด์ œ์ด์…˜

๋…ผ๋ฌธ์€ ์‹ค์„ธ๊ณ„ ํœด๋จธ๋…ธ์ด๋“œ ์ œ์–ด๋ฅผ POMDP๋กœ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค: \mathcal{M} = (\mathcal{S}, \mathcal{O}, \mathcal{A}, p, \gamma)

  • ์ƒํƒœ s \in \mathcal{S}

    • root height, base pose, base rotation
    • ๋งํฌ๋“ค์˜ ์œ„์น˜/์ž์„ธ, ์„ ํ˜•/๊ฐ์†๋„ ๋“ฑ (privileged state)
  • ๊ด€์ธก o \in \mathcal{O}

    • ์กฐ์ธํŠธ ์œ„์น˜(๊ธฐ์ค€ ํฌ์ฆˆ ๋Œ€๋น„ ์ •๊ทœํ™”), ์กฐ์ธํŠธ ์†๋„
    • ๋ฃจํŠธ ๊ฐ์†๋„, projected gravity ๋“ฑ
    • ์‹ค์ œ ๋กœ๋ด‡์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” proprioceptive ๊ด€์ธก ์œ„์ฃผ
  • ์•ก์…˜ a \in \mathcal{A}

    • 29 DoF ํœด๋จธ๋…ธ์ด๋“œ์— ๋Œ€ํ•œ PD target (desired joint positions)
  • ์—ญ์‚ฌ h_t = (o_{t-H+1:t}, a_{t-H+1:t-1})

    • ์ •์ฑ…์€ ๋‹จ์ผ ์‹œ์  ๊ด€์ธก์ด ์•„๋‹ˆ๋ผ ์งง์€ ๊ด€์ธก/์•ก์…˜ ํžˆ์Šคํ† ๋ฆฌ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ.

์ด๋ ‡๊ฒŒ ํ•ด์„œ:

  • Actor \pi_\theta(a | h, z): ํžˆ์Šคํ† ๋ฆฌ ๊ธฐ๋ฐ˜, latent z ์กฐ๊ฑด๋ถ€ ์ •์ฑ…
  • Critic๋“ค: history + privileged state(์ „์ฒด s, or some ฯ†(s))๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ

ํ•˜๋Š” ๋น„๋Œ€์นญ(asymmetric) ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

2.2 Forwardโ€“Backward Representation & Unsupervised RL

ํ•ต์‹ฌ ๋ฒ ์ด์Šค๋ผ์ธ์€ ์ตœ๊ทผ ์ œ์•ˆ๋œ FB-CPR ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ์•„์ด๋””์–ด๋ฅผ ์ง๊ด€์ ์œผ๋กœ ํ’€์–ด๋ณด๋ฉด:

  1. Forward/Backward Map

    • Forward map F_\psi(s, a, z)
    • Backward map B_\phi(s, z) ์„ ํ•™์Šตํ•ด, ์ •์ฑ… \pi(\cdot|z)๊ฐ€ ๋งŒ๋“ค์–ด๋‚ด๋Š” ์žฅ๊ธฐ์ ์ธ ์ƒํƒœ ๋ฐฉ๋ฌธ ๋ถ„ํฌ๋ฅผ ์ €๋žญํฌ(latent k์ฐจ์›) ๊ตฌ์กฐ๋กœ ๊ทผ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
  2. Successor Features ๊ด€์ 

    • ์–ด๋–ค latent task feature \phi(s) \in \mathbb{R}^k๊ฐ€ ์žˆ์„ ๋•Œ,

    • Successor feature๋Š”

      \Psi_z(s,a) = \mathbb{E}\left[ \sum_{t\ge 0} \gamma^t \phi(s_t) \,\Big|\, s_0 = s, a_0 = a, \pi_z \right]

    • FB representation์€ ์ด successor feature๋ฅผ forward/backward map์œผ๋กœ factorization ํ•ด์„œ ํ‘œํ˜„ํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  3. Latent Task & Linear Reward

    • FB-Zero ๊ณ„์—ด์—์„œ๋Š” task ์ž์ฒด๋ฅผ latent ๋ฒกํ„ฐ z ์•ˆ์— ๋…น์—ฌ์„œ,

    • ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ

      r_z(s) = \langle w(z), \phi(s) \rangle

      ๊ผด์˜ ์„ ํ˜• ์กฐํ•ฉ์œผ๋กœ ๋ฐ”๋ผ๋ด…๋‹ˆ๋‹ค.

    • ์ฆ‰, โ€œ์–ด๋–ค z๋ฅผ ์„ ํƒํ•˜๋А๋ƒโ€๊ฐ€ ๊ณง โ€œ์–ด๋–ค ๋ณด์ƒ์„ ์ตœ์ ํ™”ํ•˜๋Š”์ง€โ€๋ฅผ ๊ฒฐ์ •.

์ด ๊ตฌ์กฐ ๋•๋ถ„์—:

  • ๋ฏธ๋ฆฌ unsupervised RL๋กœ ๋‹ค์–‘ํ•œ z์— ๋Œ€ํ•œ ์ •์ฑ… \pi(\cdot|z)์™€ successor feature๋ฅผ ํ•™์Šตํ•ด๋‘๋ฉด,

  • ์ดํ›„ ์–ด๋–ค ์ƒˆ๋กœ์šด ์„ ํ˜• ๋ณด์ƒ w๋ฅผ ์ฃผ๋”๋ผ๋„

    • ์žฌํ•™์Šต ์—†์ด ์ ์ ˆํ•œ z๋ฅผ ์ฐพ๊ณ ,
    • ๊ทธ z์— ๋Œ€์‘ํ•˜๋Š” \pi(\cdot|z)๋ฅผ Zero-shot์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

2.3 FB-CPR์—์„œ BFM-Zero๋กœ: ๋ฌด์—‡์ด ์ถ”๊ฐ€๋˜์—ˆ๋‚˜?

FB-CPR๋Š” ์›๋ž˜ ๊ฐ€์ƒ ์บ๋ฆญํ„ฐ์— ๋Œ€ํ•ด ๋™์ž‘ํ•˜๋˜ unsupervised RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. BFM-Zero๋Š” ์ด ์œ„์— ๋‹ค์Œ์„ ์–น์Šต๋‹ˆ๋‹ค:

  1. Humanoid Whole-body Control ํŠนํ™” ์„ค๊ณ„

    • Unitree G1๊ธ‰ ํœด๋จธ๋…ธ์ด๋“œ์— ๋งž์ถ˜ ๊ด€์ธก/์•ก์…˜/๋„๋ฉ”์ธ ๋žœ๋ค๋ผ์ด์ œ์ด์…˜ ๊ตฌ์„ฑ.
  2. Sim2Real์„ ์œ„ํ•œ ์ถ”๊ฐ€ ์š”์†Œ

    • Asymmetric history-based training
    • ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ํ™˜๊ฒฝ + ๋Œ€ํ˜• ๋ฆฌํ”Œ๋ ˆ์ด ๋ฒ„ํผ
    • Domain Randomization (์งˆ๋Ÿ‰, ๋งˆ์ฐฐ, CoM, ์„ผ์„œ ๋…ธ์ด์ฆˆ, ์™ธ๋ž€)
    • Reward regularization (joint limit, ํ† ํฌ/์†๋„ ํŽ˜๋„ํ‹ฐ ๋“ฑ)
  3. Motion Capture Regularization ๊ฐ•ํ™”

    • Latent-conditioned discriminator๋กœ โ€œ์ธ๊ฐ„๋‹ค์šด ์Šคํƒ€์ผโ€์„ ๊ฐ•์ œ. ๊ฒฐ๊ณผ์ ์œผ๋กœ, BFM-Zero๋Š”:
  • ๋‹จ๊ณ„ 1: unsupervised RL + mo-cap regularization์œผ๋กœ ๊ฑฐ๋Œ€ํ•œ ํ–‰๋™ latent space ํ•™์Šต

  • ๋‹จ๊ณ„ 2: ์ด latent space ์ƒ์—์„œ

      1. Zero-shot reward optimization
      1. Zero-shot goal reaching
      1. Zero-shot motion tracking ๋ฅผ ๋‹ฌ์„ฑ
  • ๋‹จ๊ณ„ 3: Latent z ๊ณต๊ฐ„์—์„œ์˜ CEM/trajectory optimization์œผ๋กœ few-shot adaptation

์ด๋ผ๋Š” ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

3. BFM-Zero ๋ฐฉ๋ฒ•๋ก  ์ƒ์„ธ

3.1 ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ์š” (Mermaid)

flowchart LR
  subgraph Pretrain[Pre-training in Simulation]
    A[Unlabeled MoCap Dataset D] -->|style regularization| Dscr[Latent-conditioned Discriminator]
    Sim[Humanoid Simulation Env] --> RB[Replay Buffer]
    RB --> FB[Forward & Backward Maps<br/>+ Successor Features]
    FB --> Actor[History-based Actor ฯ€(a|h,z)]
    Dscr --> Actor
    Crit[Privileged Critics<br/>(s-based)] --> Actor
    DR[Domain Randomization<br/>+ Disturbances] --> Sim
  end

  subgraph LatentSpace[Latent Space]
    Z[Shared Latent Space z]
  end

  Pretrain -->|learn mapping from tasks/motions/rewards| LatentSpace

  subgraph Inference[Zero-shot / Few-shot Inference]
    Task[Task Spec<br/>(reward, goal, motion)] --> Enc[Task Encoder<br/>(embedding into z)]
    Enc --> Z
    Z --> ActorRT[Actor ฯ€(a|h,z)]
    ActorRT --> Robot[Unitree G1 Humanoid]
    Z --> CEM[Latent Optimization (CEM/DA)]
    CEM --> Z
  end

3.2 ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์„ค์ •

  • Mo-cap ๋ฐ์ดํ„ฐ์…‹ \mathcal{D}: LAFAN1 ๊ธฐ๋ฐ˜ ๋ชจ์…˜๋“ค์„ ์ชผ๊ฐœ์„œ ์‚ฌ์šฉ,

    • ๋ชจ์…˜์˜ ์Šคํƒ€์ผ/ํ’ˆ์งˆ์— ๋”ฐ๋ผ ์šฐ์„ ์ˆœ์œ„ ์ƒ˜ํ”Œ๋ง์„ ์ˆ˜ํ–‰.* ํ™˜๊ฒฝ:

    • ์ˆ˜์ฒœ ๊ฐœ ์ˆ˜์ค€์˜ ๋ณ‘๋ ฌ ํ™˜๊ฒฝ,

    • ์ด 3M step ์ด์ƒ์˜ ์ƒํ˜ธ์ž‘์šฉ,

    • ๋Œ€ํ˜• replay buffer & ๋†’์€ Update-To-Data(UTD) ratio. #### ํ•ต์‹ฌ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ (์š”์•ฝ)

ํ•ญ๋ชฉ ์„ค์ •(์š”์•ฝ)
ํžˆ์Šคํ† ๋ฆฌ ๊ธธ์ด H 4
์—ํ”ผ์†Œ๋“œ ๊ธธ์ด 500 steps
๋ณ‘๋ ฌ ํ™˜๊ฒฝ ์ˆ˜ โ‰ˆ 1024
์ด ํ™˜๊ฒฝ ์ƒํ˜ธ์ž‘์šฉ 3M steps ์ˆ˜์ค€
Latent ์ฐจ์› 256
Actor/critic hidden size 2048, residual blocks
์ด ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ โ‰ˆ 440M

(๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ: Transformer-style residual blocks + Mish activation, ensemble critic ๋“ฑ.)


3.3 ํ•ต์‹ฌ ์„ค๊ณ„ ์š”์†Œ

(A) Asymmetric Training

  • Actor:

    • ์ž…๋ ฅ: ๊ด€์ธก ํžˆ์Šคํ† ๋ฆฌ h_t๋งŒ ์‚ฌ์šฉ (์‹ค์ œ ๋กœ๋ด‡๊ณผ ๋™์ผํ•œ ์ •๋ณด)
  • Critics (FB, auxiliary, style critic):

    • ์ž…๋ ฅ: privileged state s_t + ํžˆ์Šคํ† ๋ฆฌ
    • ํ’๋ถ€ํ•œ ์ƒํƒœ ์ •๋ณด๋กœ ๋” ์ •ํ™•ํ•œ value / successor feature ์ถ”์ •. โ†’ Sim2Real์—์„œ ํ”ํžˆ ์“ฐ์ด๋Š” ๊ธฐ๋ฒ•์ด์ง€๋งŒ, ์—ฌ๊ธฐ์„œ๋Š” unsupervised RL + FB ๊ตฌ์กฐ์™€ ๊ฒฐํ•ฉ๋˜์–ด ์ •์ฑ…์˜ ๊ฐ•์ธ์„ฑ/์•ˆ์ •์„ฑ์„ ํฌ๊ฒŒ ๋†’์ด๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.

(B) Domain Randomization (DR)

  • ๋งํฌ ์งˆ๋Ÿ‰, ๋งˆ์ฐฐ, ๊ด€์„ฑ, CoM, joint offset, ์„ผ์„œ ๋…ธ์ด์ฆˆ, ์™ธ๋ž€(kick, push) ๋“ฑ์„ ๋žœ๋คํ™”.* ์ด๋กœ ์ธํ•ด ์ •์ฑ…์€ ํ•˜๋‚˜์˜ dynamics์— overfit๋˜์ง€ ์•Š๊ณ ,

    • ์‹ค์ œ G1์—์„œ์˜ ํฐ ์™ธ๋ž€(kick, ๋Œ์–ด๋‹น๊ธฐ๊ธฐ ๋“ฑ)์—๋„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํšŒ๋ณต. #### (C) Reward Regularization & Safety Critic
  • joint limit ์ ‘๊ทผ, ๊ณผ๋„ํ•œ ํ† ํฌ, ๋ถˆ์•ˆ์ •ํ•œ ํฌ์ฆˆ์— penalty.* ๋ณ„๋„์˜ auxiliary critic์ด ์ด๋Ÿฐ ์ œ์•ฝ์„ ํ‘œํ˜„ํ•˜๋Š” ๋ณด์ƒ์„ ํ•™์Šต.

(D) Style Discriminator & Imitation Critic

  • Mo-cap trajectory์™€ ์ •์ฑ…์ด ์ƒ์„ฑํ•œ trajectory๋ฅผ ๋น„๊ตํ•˜๋Š” latent-conditioned discriminator๋ฅผ ํ•™์Šต.* Jensen-Shannon divergence ๊ธฐ๋ฐ˜ GAN objective๋กœ ํ›ˆ๋ จ.

  • ์ด ๊ฐ’์ด style reward๊ฐ€ ๋˜์–ด,

    • ์ •์ฑ…์ด โ€œ์ธ๊ฐ„๋‹ค์šด ์›€์ง์ž„โ€์„ ์œ ์ง€ํ•˜๋„๋ก regularize.

3.4 ํ•™์Šต ๋ชฉํ‘œ (๊ณ ์ˆ˜์ค€ ์ˆ˜์‹)

BFM-Zero๋Š” ์—ฌ๋Ÿฌ loss๋ฅผ ํ•ฉ์„ฑํ•ฉ๋‹ˆ๋‹ค:

  1. FB Objective

    • successor features์— ๋Œ€ํ•œ TD-loss (Bellman residual) ์ตœ์†Œํ™”.
  2. Auxiliary safety critic loss

    • ์•ˆ์ „/๋ฌผ๋ฆฌ ์ œ์•ฝ์„ encodeํ•œ Q-function TD-loss.
  3. Style critic loss

    • Mo-cap ๋ฐ์ดํ„ฐ์™€ ์ •์ฑ… rollout์˜ ๋ถ„ํฌ๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” discriminator loss.
  4. Actor loss

    • ์œ„ Q-functions๋“ค์„ ์กฐํ•ฉํ•œ multi-critic advantage๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ policy gradient(or off-policy actor-critic) ์—…๋ฐ์ดํŠธ.

์ง๊ด€์ ์œผ๋กœ ๋งํ•˜๋ฉด:

โ€œFB representation์ด ์ •์˜ํ•œ latent tasks์˜ ์„ฑ๊ณต ๊ฐ€๋Šฅ์„ฑ๊ณผ safety critic, style critic์ด ์ •์˜ํ•œ โ€˜์•ˆ์ „ํ•˜๋ฉด์„œ ์ธ๊ฐ„๋‹ค์šด ์›€์ง์ž„โ€™์„ ๋™์‹œ์— ๋งŒ์กฑํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ •์ฑ…๊ณผ latent space๋ฅผ ํ•™์Šตํ•œ๋‹ค.โ€


3.5 Zero-shot Inference: ํ”„๋กฌํ”„ํŠธ๋กœ ์ •์ฑ… ๋ถ€๋ฅด๊ธฐ

3.5.1 Reward Optimization

์–ด๋–ค ์ƒˆ๋กœ์šด ๋ณด์ƒ ํ•จ์ˆ˜ r(s)๊ฐ€ ์ฃผ์–ด์กŒ๋‹ค๊ณ  ํ•˜์ž.

  1. FB-Zero๋Š” latent z์— ๋Œ€ํ•œ successor feature \Psi_z๋ฅผ ์ด๋ฏธ ๊ฐ–๊ณ  ์žˆ์Œ.

  2. ์„ ํ˜• ๋ณด์ƒ r(s) = w^\top \phi(s)๋ผ๊ณ  ๋ณด๋ฉด,

    • ๊ฐ z์— ๋Œ€ํ•ด ๊ธฐ๋Œ€ return์€ Q(z) \approx w^\top \Psi_z
  3. ๋”ฐ๋ผ์„œ replay buffer์—์„œ ์ƒ˜ํ”Œ๋งํ•œ ์—ฌ๋Ÿฌ z๋“ค์— ๋Œ€ํ•ด \hat{Q}(z)๋ฅผ ํ‰๊ฐ€ํ•˜๊ณ ,

    • ๊ฐ€์žฅ ์ข‹์€ z๋ฅผ ์„ ํƒํ•˜๊ฑฐ๋‚˜, ๋ถ„ํฌ๋ฅผ ์„ž์–ด ์—ฌ๋Ÿฌ ๋ชจ๋“œ๋ฅผ ํƒ์ƒ‰. ์ด๋ ‡๊ฒŒ ์ฐพ์•„๋‚ธ z๋ฅผ โ€œ๋ณด์ƒ ํ”„๋กฌํ”„ํŠธโ€๋กœ ์ •์ฑ…์— ์ž…๋ ฅํ•˜๋ฉด, ๋ณ„๋„ fine-tuning ์—†์ด ๊ทธ ๋ณด์ƒ์„ ์ตœ์ ํ™”ํ•˜๋Š” ์ „์‹  ์›€์ง์ž„์ด ๋‚˜์˜ต๋‹ˆ๋‹ค. #### 3.5.2 Goal Reaching
  • ๋ชฉํ‘œ ํฌ์ฆˆ(์กฐ์ธํŠธ/๋ฃจํŠธ ํฌ์ฆˆ)๋ฅผ state space ์ƒ์˜ ๋ชฉํ‘œ ์ƒํƒœ s_g๋กœ ๋‘๊ณ ,

  • goal feature \phi_{\text{goal}}(s, s_g)๋ฅผ ์ •์˜,

  • ์ด๊ฑธ latent๋กœ embed ํ•ด์„œ z๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค. ์ด z๋ฅผ ํ”„๋กฌํ”„ํŠธ๋กœ ๋„ฃ์œผ๋ฉด:

  • ๋กœ๋ด‡์€ ํ˜„์žฌ ์ƒํƒœ์—์„œ ํ•ด๋‹น ๋ชฉํ‘œ ํฌ์ฆˆ ๊ทผ์ฒ˜๋กœ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ˆ˜๋ ดํ•˜๋Š” ๊ถค์ ์„ ์ƒ์„ฑ. #### 3.5.3 Motion Tracking

  • ๋ชฉํ‘œ ๋ชจ์…˜ trajectory \tau = (s_0, \dots, s_T)์— ๋Œ€ํ•ด,

  • ๋ฏธ๋ฆฌ ๊ฐ segment์— ํ•ด๋‹นํ•˜๋Š” latent z_t๋ฅผ ์ถ”์ถœํ•˜๊ณ ,

  • ์‹œ๊ฐ„์— ๋”ฐ๋ผ z_t๋ฅผ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ์ฒ˜๋Ÿผ ์ •์ฑ…์— ๊ณต๊ธ‰ํ•˜๋ฉด โ†’ tracking์ด ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. โ€”

3.6 Few-shot Adaptation in Latent Space

BFM-Zero์˜ ์„ค๊ณ„์—์„œ ๊ฐ€์žฅ โ€œfoundation model์Šค๋Ÿฝ๋‹คโ€ ์‹ถ์€ ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค.

3.6.1 Single Pose Adaptation (CEM)

  • ์˜ˆ: ํ•œ์ชฝ ๋‹ค๋ฆฌ๋กœ ์„œ์„œ 4kg payload๋ฅผ ๋“ค๊ณ  ๊ท ํ˜• ์œ ์ง€.

  • ๊ธฐ๋ณธ zero-shot z๋Š” ์‹ค์ œ ๋กœ๋ด‡์—์„  10์ดˆ ๋‚ด์™ธ๋กœ ๋ฌด๋„ˆ์ง.

  • ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด:

    1. ์ดˆ๊ธฐ zโ‚€ = zero-shot latent

    2. Cross Entropy Method(CEM)๋ฅผ latent space์—์„œ ์ˆ˜ํ–‰

      • ์ƒ˜ํ”Œ z ํ›„๋ณด๋“ค์„ ์ƒ์„ฑ โ†’ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ rollout
      • โ€œ๋„˜์–ด์ง€์ง€ ์•Š์Œ + ๋ชฉํ‘œ ๋ฐœ ๋†’์ดโ€ reward๋ฅผ maximizeํ•˜๋„๋ก ์ข‹์€ z๋“ค๋งŒ ๋‚จ๊ธฐ๋ฉฐ ๋ถ„ํฌ ์—…๋ฐ์ดํŠธ
    3. ์ตœ์ข… zโ‹†๋ฅผ ์‹ค์ œ ๋กœ๋ด‡์— deploy

  • ๊ฒฐ๊ณผ:

    • ์‹œ๋ฎฌ์—์„œ ๋„˜์–ด์ง€๋˜ zโ‚€์™€ ๋‹ฌ๋ฆฌ, zโ‹†๋Š” payload๋ฅผ ๋“  ๋‹จ์ผ ๋‹ค๋ฆฌ ์Šคํƒ ์Šค๋ฅผ ๊ธธ๊ฒŒ ์œ ์ง€. #### 3.6.2 Trajectory Adaptation (Dual Annealing)
  • ์˜ˆ: ๋›ฐ์–ด์˜ค๋ฅด๋Š”(leaping) ๋ชจ์…˜์„ ๋‹ค๋ฅธ ๋งˆ์ฐฐ ๊ณ„์ˆ˜ ํ™˜๊ฒฝ์— ๋งž๊ฒŒ ํŠœ๋‹.* ๊ธฐ์กด tracking latent๋กœ๋Š” ๋งˆ์ฐฐ์ด ๋ฐ”๋€Œ๋ฉด tracking error๊ฐ€ ์ปค์ง.

  • latent sequence์— ๋Œ€ํ•ด dual-annealing์œผ๋กœ ์ตœ์ ํ™”ํ•˜๋ฉด

    • tracking error ์•ฝ 29.1% ๊ฐ์†Œ. ํ•ต์‹ฌ ํฌ์ธํŠธ:

๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ ์—†์ด, ์˜ค์ง latent z๋งŒ ์ตœ์ ํ™”ํ•ด์„œ dynamics shift๋ฅผ ํก์ˆ˜ํ–ˆ๋‹ค๋Š” ์ .


4. ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ

4.1 ์‹œ๋ฎฌ๋ ˆ์ด์…˜ Zero-shot Validation

(๋ณธ๋ฌธ์—๋Š” ์—ฌ๋Ÿฌ task-specific metric์ด ๋“ฑ์žฅํ•˜์ง€๋งŒ, ์š”์ง€๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.)

  1. Tracking, Goal reaching, Reward optimization์„ ๋ชจ๋‘

    • ํ•˜๋‚˜์˜ ๋ชจ๋ธ์—์„œ
    • Zero-shot์œผ๋กœ ์ˆ˜ํ–‰.
  2. ๊ธฐ์กด FB-CPR, on-policy PPO ๊ธฐ๋ฐ˜ multi-task baselines์™€ ๋น„๊ต ์‹œ,

    • reward / tracking error ์ธก๋ฉด์—์„œ ๋™๋“ฑ ์ด์ƒ ์„ฑ๋Šฅ,
    • ํŠนํžˆ ๋‹ค์–‘ํ•œ downstream task๋ฅผ ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด ์ปค๋ฒ„ํ•˜๋Š” ๋ฒ”์šฉ์„ฑ์—์„œ ์šฐ์œ„.

4.2 ์‹ค์ œ Unitree G1 ์‹คํ—˜

4.2.1 Goal Reaching (Figure 5)

  • ์—ฌ๋Ÿฌ ๋žœ๋ค ํฌ์ฆˆ๋“ค(์†๋„ ์„ฑ๋ถ„ ์ œ๊ฑฐ)์„ ๋ชฉํ‘œ๋กœ ์ฃผ๊ณ ,

  • ํ•ด๋‹น ํฌ์ฆˆ๋“ค์˜ zero-shot latent๋ฅผ ์„ž์–ด ์—ฐ์† ๋ชฉํ‘œ๋ฅผ ์ƒ์„ฑ. ๊ด€์ฐฐ:

  • ๋ชฉํ‘œ๊ฐ€ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ์ •ํ™•ํžˆ ๊ตฌํ˜„ ๋ถˆ๊ฐ€๋Šฅํ•œ ํฌ์ฆˆ์—ฌ๋„,

    • ๋กœ๋ด‡์€ ๊ทธ ๊ทผ์ฒ˜์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ํฌ์ฆˆ๋กœ ์ˆ˜๋ ด.
  • ์„œ๋กœ ๋ถˆ์—ฐ์†์ธ ๋ชฉํ‘œ ํฌ์ฆˆ๋“ค ์‚ฌ์ด๋ฅผ ์ด๋™ํ•  ๋•Œ๋„

    • ๋ชจ์…˜ blending ์—†์ด ๋งค๋„๋Ÿฌ์šด ์ „์ด ํŠธ๋ผ์ ํ† ๋ฆฌ๋ฅผ ์ƒ์„ฑ. โ†’ latent space๊ฐ€ ์—ฐ์†์ ์ด๊ณ  smoothํ•˜๊ฒŒ ์กฐ์ •๋˜์–ด ์žˆ๋‹ค๋Š” ๊ฐ•ํ•œ evidence.

4.2.2 Reward Optimization (Figure 6)

์„ธ ๊ฐ€์ง€ ๋ณด์ƒ ๊ณ„์—ด ์‹คํ—˜: 1. Locomotion rewards

  • base velocity, yaw angular velocity target ์ง€์ •
  • โ†’ ์ „/ํ›„/์ขŒ/์šฐ ์ด๋™, ํšŒ์ „, ์กฐํ•ฉ ์›€์ง์ž„
  1. Arm-movement rewards

    • ์†๋ชฉ ๋†’์ด, ํŒ” ๋“ค๊ธฐ/๋‚ด๋ฆฌ๊ธฐ
    • โ†’ ์ƒ์ฒด/ํŒ” ๋™์ž‘์„ ๋ช…๋ น
  2. Pelvis-height rewards

    • ์•‰๊ธฐ(sitting), crouch, low-movement ๋“ฑ

ํŠน์ง•:

  • ๋‹จ์ˆœํ•œ reward ์ •์˜๋งŒ์œผ๋กœ๋„

    • ๊ฑท๊ธฐ+ํŒ” ๋“ค๊ธฐ ๊ฐ™์ด ํ•ฉ์„ฑ๋œ behavior๋ฅผ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์‹คํ–‰.
  • replay buffer ์ƒ˜ํ”Œ์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ z๋ฅผ ๋ฝ‘์œผ๋ฉด,

    • ๊ฐ™์€ ๋ณด์ƒ์ด๋ผ๋„ ์—ฌ๋Ÿฌ ์Šคํƒ€์ผ์˜ ์ตœ์  ํ–‰๋™ ๋ชจ๋“œ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Œ. #### 4.2.3 Disturbance Rejection (Figure 7)
  • ๊ฐ•ํ•œ ์ธก๋ฉด kick, ์ƒ์ฒด push, ๋Œ์–ด๋‹น๊ฒจ์„œ ๋„˜์–ด๋œจ๋ฆฌ๊ธฐ ๋“ฑ. ๊ฒฐ๊ณผ:

  • ๋‹จ์ˆœํžˆ โ€œ๋ฒ„ํ‹ฐ๋Š”โ€ ์ฐจ์›์„ ๋„˜์–ด์„œ

    • ์‚ฌ๋žŒ์ฒ˜๋Ÿผ ๋›ฐ๋“ฏ์ด ๋ช‡ ๋ฐœ์ž๊ตญ ๋‹ฌ๋ ค ๋‚˜๊ฐ€๋ฉฐ ๊ท ํ˜• ํšŒ๋ณต,
    • ๋„˜์–ด์กŒ๋‹ค๊ฐ€ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ผ์–ด๋‚˜ T-pose๋กœ ๋ณต๊ท€ ๋“ฑ
  • ์ž…๋ ฅ z๋Š” static T-pose ํ•˜๋‚˜์ž„์—๋„,

    • ์ •์ฑ…์ด ์ƒํ™ฉ์— ๋งž๊ฒŒ reference์—์„œ ๋ฒ—์–ด๋‚œ recovery ๋ชจ์…˜์„ ์Šค์Šค๋กœ ์ƒ์„ฑ ํ›„
    • ๋‹ค์‹œ reference๋กœ ๋Œ์•„์˜ด. โ†’ โ€œํ”„๋กฌํ”„ํŠธ๋Š” ๋‹จ์ผ ํฌ์ฆˆ์ง€๋งŒ, ์ •์ฑ…์€ ๋™์  recovery behavior๋ฅผ ๋‚ด์žฌํ•˜๊ณ  ์žˆ์Œโ€ ์„ ๋ณด์—ฌ์ฃผ๋Š” ์˜ˆ.

4.2.4 Few-shot Adaptation (Figure 8)

  • ์•ž์—์„œ ์„ค๋ช…ํ•œ single-leg/payload, leaping adaptation ๊ฒฐ๊ณผ๋ฅผ

    • ์‹ค์ œ ๋กœ๋ด‡๊นŒ์ง€ ๊ฒ€์ฆํ•จ์œผ๋กœ์จ,
    • sim-only๊ฐ€ ์•„๋‹Œ Sim2Real adaptation ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์คŒ.

4.3 Latent Space ๋ถ„์„

Figure 9์—์„œ latent z๋ฅผ t-SNE๋กœ ์‹œ๊ฐํ™”: * Tracking / reward optimization / goal reaching์— ํ•ด๋‹นํ•˜๋Š” z๋“ค์ด

  • ๋ชจ์…˜ ์Šคํƒ€์ผ/์œ ํ˜•๋ณ„๋กœ ๊ตฐ์ง‘ํ™”
  • ๋น„์Šทํ•œ ๋™์ž‘์€ ๊ฐ€๊นŒ์ด, ์ƒ์ดํ•œ ๋™์ž‘์€ ๋ฉ€๋ฆฌ ๋ฐฐ์น˜.

๋˜ํ•œ:

  • ๋‘ latent z_1, z_2 ์‚ฌ์ด์— slerp ์ธํ„ฐํด๋ ˆ์ด์…˜์„ ์ˆ˜ํ–‰ํ•˜๋ฉด,

    • ์ค‘๊ฐ„ z๋“ค์ด ์˜๋ฏธ ์žˆ๋Š” ์ค‘๊ฐ„ ํ–‰๋™(์˜ˆ: ๊ฑท๊ธฐ โ†”๏ธŽ ๋›ฐ๊ธฐ ์‚ฌ์ด์˜ ์กฐํ•ฉ)์„ ์ƒ์„ฑ.
    • ์ด๋Š” โ€œํ–‰๋™ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธโ€๋กœ์„œ์˜ ํ•ต์‹ฌ ์ž์งˆ.

5. ๋‹ค๋ฅธ ํ”Œ๋žซํผ์œผ๋กœ์˜ ํ™•์žฅ ๊ฐ€๋Šฅ์„ฑ

์งˆ๋ฌธํ•˜์‹  โ€œ๋‹ค๋ฅธ ํ”Œ๋žซํผ(์˜ˆ: ๋‹ค๋ฅธ ํœด๋จธ๋…ธ์ด๋“œ, ํ˜น์€ ์ „ํ˜€ ๋‹ค๋ฅธ robot) ์ ์šฉโ€ ๊ด€์ ์—์„œ ์ •๋ฆฌํ•ด๋ณด๋ฉด:

5.1 ํ•„์š”ํ•œ ์ „์ œ ์กฐ๊ฑด

BFM-Zero ์ˆ˜์ค€์œผ๋กœ ์ ์šฉํ•˜๋ ค๋ฉด:

  1. ์ถฉ๋ถ„ํžˆ ์ •ํ™•ํ•œ ์ „์‹  ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ

    • ๊ด€์ ˆ/๋งํฌ ์ˆ˜, ์งˆ๋Ÿ‰/๊ด€์„ฑ, ๋งˆ์ฐฐ, ์„ผ์„œ ๋…ธ์ด์ฆˆ ๋“ฑ์„ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•จ.
  2. ๋Œ€๊ทœ๋ชจ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ณ‘๋ ฌํ™”

    • ์ˆ˜์ฒœ ๊ฐœ ํ™˜๊ฒฝ ร— ์ˆ˜๋ฐฑ๋งŒ step ์ƒํ˜ธ์ž‘์šฉ
    • GPU ๊ธฐ๋ฐ˜ ๋ฌผ๋ฆฌ์—”์ง„(IsaacGym, Genesis ๋“ฑ)๊ณผ ํ˜ธํ™˜๋ ์ˆ˜๋ก ์œ ๋ฆฌ.
  3. Mo-cap ๋˜๋Š” ํ–‰๋™ ๋ฐ์ดํ„ฐ์…‹

    • โ€œ์ธ๊ฐ„๋‹ค์šด ์Šคํƒ€์ผโ€์— ํ•ด๋‹นํ•˜๋Š” reference.

    • ํœด๋จธ๋…ธ์ด๋“œ๊ฐ€ ์•„๋‹ ๊ฒฝ์šฐ(์˜ˆ: quadruped, manipulator)

      • ์Šคํƒ€์ผ ์ •์˜๋ฅผ ์–ด๋–ป๊ฒŒ ํ• ์ง€(teleop, demonstration, scripted behaviors ๋“ฑ) ์„ค๊ณ„ ํ•„์š”.
  4. PD ์ œ์–ด ๊ธฐ๋ฐ˜ low-level ์ปจํŠธ๋กค

    • ์ •์ฑ… ์ถœ๋ ฅ = joint position target ํ˜•ํƒœ๊ฐ€ ๊ฐ€์žฅ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋งž์Œ.

5.2 ๋‹ค๋ฅธ ํœด๋จธ๋…ธ์ด๋“œ/๋กœ๋ด‡์œผ๋กœ์˜ ์ ์šฉ ์‹œ๋„

๋…ผ๋ฌธ Appendix์—์„œ๋Š” BFM-Zero๋ฅผ ๋‹ค๋ฅธ ๋กœ๋ด‡(Booster T1)์— ์ ์šฉํ•œ ๊ฒฐ๊ณผ๋„ ๊ฐ„๋‹จํžˆ ์–ธ๊ธ‰ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ™•์žฅ์„ฑ์„ ์ถ”๋ก ํ•˜๋ฉด:

  • ๋‹ค๋ฅธ ํœด๋จธ๋…ธ์ด๋“œ

    • ๊ด€์ธก/์•ก์…˜ ์ฐจ์›์ด ํฌ๊ฒŒ ๋‹ค๋ฅด์ง€ ์•Š๋‹ค๋ฉด,

      • FB-CPR + BFM-Zero ๊ตฌ์กฐ๋Š” ๊ฑฐ์˜ ๊ทธ๋Œ€๋กœ ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅ.
    • ๋ฌธ์ œ๋Š”:

      • ๋ชจ์…˜ ๋ฐ์ดํ„ฐ์…‹(๊ฐ ๋กœ๋ด‡์— ๋งž๊ฒŒ retarget ํ•„์š”)
      • ๋ฌผ๋ฆฌ ํŒŒ๋ผ๋ฏธํ„ฐ randomization ์žฌ์„ค์ •.
  • Manipulator + Mobile base (์˜ˆ: ์•„์•”+ํœ )

    • ํ–‰๋™ space๊ฐ€ โ€œ์ „์‹  motionโ€์ด ์•„๋‹ˆ๋ผ๋ฉด

      • latent space ํ•ด์„์ด โ€œwhole-body locomotionโ€๊ณผ๋Š” ๋‹ฌ๋ผ์งˆ ๊ฒƒ.
    • ๊ทธ๋ž˜๋„ ๋ณด์ƒ/๋ชฉํ‘œ/๋ฐ๋ชจ๋ฅผ latent space์— embed โ†’ promptable policy ๋ผ๋Š” ๊ตฌ์กฐ๋Š”

      • ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ๊ฐˆ ์ˆ˜ ์žˆ์Œ.

5.3 Allegro Hand ๊ฐ™์€ dexterous hand์— ์ ์šฉํ•œ๋‹ค๋ฉด?

  • ํ˜„์žฌ BFM-Zero๋Š” ์ „์‹  ๋ชจ์…˜(๋ณดํ–‰+์ƒ์ฒด) ์ค‘์‹ฌ.

  • Allegro Hand์ฒ˜๋Ÿผ 16 DoF ์†์— ์ ์šฉํ•˜๋ ค๋ฉด:

    1. ์† ์ „์šฉ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ & ๋ชจ์…˜ ๋ฐ์ดํ„ฐ์…‹ (teleop / retarget) ํ™•๋ณด
    2. โ€œ์†์˜ behaviorโ€์— ๋Œ€ํ•ด successor feature & FB representation์„ ํ•™์Šต
    3. grasp style, in-hand rotation ๋“ฑ ๋ณด์ƒ/๋ชฉํ‘œ๋ฅผ latent๋กœ embed
  • ๊ตฌ์กฐ์ ์œผ๋กœ๋Š” ์™„์ „ํžˆ ์ž˜ ๋งž๋Š” ํ”„๋ ˆ์ž„์›Œํฌ์ด์ง€๋งŒ,

    • ๋ชจ์…˜ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•๊ณผ sim fidelity๊ฐ€ ํฐ bottleneck์ด ๋  ๊ฒƒ.

์š”์•ฝํ•˜๋ฉด:

โ€œ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ + ํ–‰๋™ ๋ฐ์ดํ„ฐ์…‹ + ๋ณ‘๋ ฌ RL ์ธํ”„๋ผโ€๋งŒ ์žˆ๋‹ค๋ฉด, BFM-Zero ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ์ถฉ๋ถ„ํžˆ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅํ•˜๋‹ค. ๋‹ค๋งŒ, ์‹ค์ œ ๊ตฌํ˜„ ๋‚œ์ด๋„๋Š” ์ƒ๋‹นํžˆ ๋†’๋‹ค.


6. ๊ด€๋ จ ์—ฐ๊ตฌ์™€ ๋น„๊ต

6.1 ๋น„์Šทํ•œ โ€œํ–‰๋™ ํŒŒ์šด๋ฐ์ด์…˜โ€ ๊ณ„์—ด๊ณผ ๋น„๊ต

๋‹ค์Œ ํ‘œ๋Š” BFM-Zero, Behavior Foundation Model(BFM), ASAP, RoboCat๋ฅผ ๊ฐ„๋‹จ ๋น„๊ตํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. | ์—ฐ๊ตฌ | ๋„๋ฉ”์ธ | ๋ฐ์ดํ„ฐ ์†Œ์Šค | ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํŒจ๋Ÿฌ๋‹ค์ž„ | Prompt/Condition ๋ฐฉ์‹ | Sim2Real ์—ฌ๋ถ€ / ํŠน์ง• | | โ€”โ€”โ€”โ€”โ€“ | โ€”โ€”โ€”โ€”โ€”โ€”- | โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”- | โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€“ | โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”- | โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€“ | | BFM-Zero | Humanoid whole-body | Mo-cap + online RL | Off-policy unsupervised RL + FB | ๋ณด์ƒ, ๋ชฉํ‘œ, ๋ชจ์…˜์„ ํ†ตํ•ฉ latent z๋กœ ํ”„๋กฌํ”„ํŠธ | Unitree G1 ์‹ค๋กœ๋ด‡, ๊ฐ•ํ•œ DR + ๋น„๋Œ€์นญ ํ•™์Šต | | BFM (Zeng) | Humanoid WBC | Large-scale behavior data | Generative model + CVAE + distillation | ์ œ์–ด ๋ชจ๋“œ/๋ชฉํ‘œ๋ฅผ conditional input์œผ๋กœ | Sim & real, masked distillation ๊ธฐ๋ฐ˜ | | ASAP | Humanoid whole-body | Mo-cap + real rollouts | 2-stage on-policy RL + residual | ๋ชจ์…˜ tracking policy + residual correction | Simโ†’Real physics alignment์— ์ดˆ์  | | RoboCat | Manipulation (arms) | Multi-embodiment demos | Decision Transformer + BC | ์ด๋ฏธ์ง€/goal-conditioning | ๋‹ค์ˆ˜์˜ ๋กœ๋ด‡ ํŒ”, few-shot adaptation ์ค‘์‹ฌ |

BFM-Zero์˜ ์ฐจ๋ณ„์ 

  • BFM-Zero vs BFM(Zeng):

    • BFM์€ generative CVAE + distillation ์ค‘์‹ฌ,

      • behavior distribution์„ ๋ชจ๋ธ๋งํ•˜๊ณ  distillํ•˜๋Š” ๋‘ ๋‹จ๊ณ„ ๊ตฌ์กฐ. * BFM-Zero๋Š” unsupervised RL + FB ๊ธฐ๋ฐ˜์œผ๋กœ

      • reward/goal/motion์„ ํ•˜๋‚˜์˜ latent task space์— ํ†ตํ•ฉ.

  • BFM-Zero vs ASAP:

    • ASAP์€ Sim๋ฌผ๋ฆฌ์™€ Real๋ฌผ๋ฆฌ์˜ ์ •๋ ฌ(alignment)์— ์ดˆ์ ,

      • motion tracking โ†’ real data โ†’ residual policy ํ•™์Šต์˜ 2๋‹จ๊ณ„ ๊ตฌ์กฐ. * BFM-Zero๋Š”

      • ํ•œ ๋ฒˆ์˜ ๋Œ€๊ทœ๋ชจ unsupervised pretrain์œผ๋กœ

      • reward/goal/motion promptable generalist policy๋ฅผ ๋ชฉํ‘œ๋กœ ํ•จ.

  • BFM-Zero vs RoboCat ๋“ฑ generalist manipulation:

    • RoboCat์€ Vision-based Decision Transformer๋กœ

      • multi-embodiment manipulation์„ ์ˆ˜ํ–‰. * BFM-Zero๋Š”

      • vision ์—†์ด proprioception + FB representation ๊ธฐ๋ฐ˜

      • humanoid whole-body dynamics์— ํŠนํ™”.

์ฆ‰, BFM-Zero๋Š”:

โ€œ์ „์‹  ํœด๋จธ๋…ธ์ด๋“œ์— ๋Œ€ํ•ด unsupervised RL + off-policy + FB๋ฅผ ์ด์šฉํ•ด โ€™๋ณด์ƒ/๋ชฉํ‘œ/๋ชจ์…˜์„ ํ•œ ๊ณต๊ฐ„์— embedํ•œ promptable ํ–‰๋™ ๋ชจ๋ธโ€™์„ ์ตœ์ดˆ๋กœ ์‹ค๋กœ๋ด‡๊นŒ์ง€ ๊ฐ€์ ธ๊ฐ„ ์ผ€์ด์Šคโ€

๋กœ ์œ„์น˜์ง€์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

7. ๋น„ํŒ์  ๊ณ ์ฐฐ

7.1 ๊ฐ•์ 

  1. ์ง„์งœ โ€œBehavioral Foundation Modelโ€์— ๊ฐ€๊นŒ์šด ๊ตฌ์กฐ

    • reward/goal/motion โ†’ latent z โ†’ ์ •์ฑ…
    • zero-shot + few-shot ๋ชจ๋‘ ์ง€์›ํ•˜๋Š” promptable ํ–‰๋™ ๊ณต๊ฐ„ ๊ตฌ์ถ•.
  2. Off-policy unsupervised RL์˜ ์‹ค๋กœ๋ด‡ ์ ์šฉ ์‚ฌ๋ก€

    • ์ง€๊ธˆ๊นŒ์ง€ ์‹ค๋กœ๋ด‡์€ ๊ฑฐ์˜ ํ•ญ์ƒ on-policy (PPO ๊ณ„์—ด)์— ์˜์กดํ–ˆ๋Š”๋ฐ,
    • BFM-Zero๋Š” ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ off-policy + FB ๊ตฌ์กฐ๋กœ ์‹ค์ œ Unitree G1์— robust policy๋ฅผ ์˜ฌ๋ ค๋‘ .
  3. ๊ณ ๊ธ‰ Sim2Real ์—”์ง€๋‹ˆ์–ด๋ง ์š”์†Œ์˜ ์กฐํ•ฉ

    • asymmetric training, DR, safety critic, style discriminator ๋“ฑ
    • ์ด๋ฏธ ์•Œ๋ ค์ง„ ๊ธฐ๋ฒ•๋“ค์„ unsupervised RL ํ”„๋ ˆ์ž„์›Œํฌ ์•ˆ์— ์ž˜ ์—ฎ์Œ.
  4. Latent-level adaptation

    • ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ ์—†์ด latent๋งŒ ์ตœ์ ํ™”ํ•ด์„œ payload ๋ณ€ํ™”, ๋งˆ์ฐฐ ๋ณ€ํ™” ๋“ฑ dynamics shift๋ฅผ ํก์ˆ˜ํ•œ ๊ฒƒ์€
    • ์‹ค๋ฌด์ ์ธ ๊ด€์ ์—์„œ๋„ โ€œํ…Œ์ŠคํŠธ ํ˜„์žฅ์—์„œ ์†์‰ฝ๊ฒŒ ํŠœ๋‹โ€ ํ•  ์—ฌ์ง€๋ฅผ ์คŒ.

7.2 ์•ฝ์  ๋ฐ ํ•œ๊ณ„

  1. ๋ฐ์ดํ„ฐ/์—ฐ์‚ฐ ๋น„์šฉ

    • ์ˆ˜๋ฐฑ๋งŒ step + ์ˆ˜์ฒœ ๋ณ‘๋ ฌ ํ™˜๊ฒฝ + 440M parameter ๊ทœ๋ชจ ๋ชจ๋ธ. * ์ผ๋ฐ˜ ์—ฐ๊ตฌ์‹ค/๊ธฐ์—…์—์„œ ๊ทธ๋Œ€๋กœ ์žฌํ˜„ํ•˜๊ธฐ์—” ์ปดํ“จํŒ… ์š”๊ตฌ์‚ฌํ•ญ์ด ์ƒ๋‹นํžˆ ํผ.
  2. Mo-cap ์˜์กด์„ฑ

    • ๋ชจ๋“  ํ–‰๋™์ด โ€œ์ธ๊ฐ„๋‹ค์šด ์Šคํƒ€์ผโ€์— regularization ๋˜๋ฏ€๋กœ
    • ๋ฐ์ดํ„ฐ์…‹์ด ์ปค๋ฒ„ํ•˜์ง€ ๋ชปํ•œ ์Šคํƒ€์ผ/์ž‘์—…์—์„œ๋Š” ํ–‰๋™ ํ’ˆ์งˆ์ด ๋–จ์–ด์งˆ ๊ฐ€๋Šฅ์„ฑ.
  3. Latent์˜ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ

    • t-SNE ์‹œ๊ฐํ™”, slerp ๋“ฑ์œผ๋กœ ์งˆ์  ํ•ด์„์€ ๋ฉ‹์ง€์ง€๋งŒ,
    • ์‹ค์ œ๋กœ z๊ฐ€ ๋ฌด์—‡์„ encodeํ•˜๋Š”์ง€ (์„ธ๋ถ€ semantics)๋Š” ์•„์ง๊นŒ์ง€ โ€œblack-ish boxโ€.
  4. ์ „์‹  ์™ธ ๋„๋ฉ”์ธ์œผ๋กœ์˜ ์ผ๋ฐ˜ํ™” ๊ฒ€์ฆ ๋ถ€์กฑ

    • ๋…ผ๋ฌธ์—์„œ๋Š” Booster T1 ๋“ฑ ์ผ๋ถ€ ๋‹ค๋ฅธ ๋กœ๋ด‡ ์˜ˆ์‹œ๊ฐ€ ์žˆ์ง€๋งŒ,
    • quadruped, mobile manipulator, hand ๋“ฑ ๋‹ค๋ฅธ embodiment์— ๋Œ€ํ•œ ์‹ค์งˆ์  ๊ฒ€์ฆ์€ ํ–ฅํ›„ ๊ณผ์ œ๋กœ ๋‚จ์•„ ์žˆ์Œ. โ€”

8. ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ ์ œ์•ˆ (๋กœ๋ด‡๊ณตํ•™์ž ๊ด€์ )

  1. Scaling law ๋ฐ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ ์—ฐ๊ตฌ

    • Mo-cap ๋ฐ์ดํ„ฐ ์–‘ / ์‹œ๋ฎฌ๋ ˆ์ด์…˜ step / ๋ชจ๋ธ ํฌ๊ธฐ vs ์„ฑ๋Šฅ ํ•€๋‹ค์šด. * โ€œvision foundation modelโ€์—์„œ ํ–ˆ๋˜ scaling ์—ฐ๊ตฌ๋ฅผ ํ–‰๋™ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์—๋„ ๊ฐ€์ ธ์˜ฌ ํ•„์š”.
  2. ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์กฐ๊ฑด๋ถ€ (Vision / Language ์ ‘๋ชฉ)

    • BFM-Zero์˜ reward prompt๋Š” ์ด๋ฏธ ์–ธ์–ด ํ”„๋กฌํ”„ํŠธ๋กœ map ํ•˜๊ธฐ ์ข‹์€ ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค. * Vision-Language-Action(VLA) ๋ชจ๋ธ๊ณผ ๊ฒฐํ•ฉํ•˜๋ฉด:

      • โ€œ์ €๊ธฐ ์žˆ๋Š” ์ƒ์ž๋ฅผ ํ–ฅํ•ด ์ฒœ์ฒœํžˆ ๊ฑธ์–ด๊ฐ€์„œ ์˜ค๋ฅธ์†์œผ๋กœ ๋“ค์–ด ์˜ฌ๋ คโ€ ๊ฐ™์€ ๊ณ ์ˆ˜์ค€ instruction โ†’ reward spec โ†’ latent prompt ๊ฐ€๋Šฅ.
  3. Manipulation & Dexterity๋กœ์˜ ํ™•์žฅ

    • ๊ธฐ์กด Allegro Hand ์—ฐ๊ตฌ(GeoRT, HORA ๋“ฑ)์—์„œ retargeting / RL policy๋ฅผ ์‚ฌ์šฉํ•ด in-hand manipulation์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ

    • BFM-Zero์‹ latent task space๋กœ ์˜ฎ๊ฒจ๋ณด๋ฉด:

      • ๋‹ค์–‘ํ•œ grasp, rotation, sliding ๋ชจ์…˜์„ ํ•˜๋‚˜์˜ promptable skill space์— ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ.
  4. Online adaptation / Continual RL์™€์˜ ๊ฒฐํ•ฉ

    • ํ˜„์žฌ few-shot adaptation์€ test-time optimization ์ˆ˜์ค€.

    • ์ด๋ฅผ continual RL๊ณผ ๊ฒฐํ•ฉํ•ด

      • ํ™˜๊ฒฝ์ด ์„œ์„œํžˆ ๋ณ€ํ•ด๋„ latent space์™€ policy๊ฐ€ ์ง€์†์ ์œผ๋กœ self-improveํ•˜๋„๋ก ๋งŒ๋“œ๋Š” ๋ฐฉํ–ฅ์ด ์œ ๋ง.

9. ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

BFM-Zero๋Š” ๋‹จ์ˆœํžˆ โ€œ๋˜ ํ•˜๋‚˜์˜ ํœด๋จธ๋…ธ์ด๋“œ RL ๋…ผ๋ฌธโ€์ด ์•„๋‹ˆ๋ผ,

โ€œ์ „์‹  ์ œ์–ด๋ฅผ ์œ„ํ•œ ํ–‰๋™ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์ด ์‹ค์ œ ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡์—์„œ ์–ด๋””๊นŒ์ง€ ๊ฐ€๋Šฅํ•œ๊ฐ€?โ€๋ฅผ ํ˜„์‹ค์ ์œผ๋กœ ๋ณด์—ฌ์ฃผ๋Š” ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„ ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค.

์ •๋ฆฌํ•˜๋ฉด:

  • ๋ฌธ์ œ ์ •์˜:

    • ๋‹ค์–‘ํ•œ ํ–‰๋™(๋ณดํ–‰, ํฌ์ฆˆ, ์ƒ์ฒด ๋™์ž‘)์„ ํ•˜๋‚˜์˜ promptable generalist policy๋กœ ํ†ตํ•ฉํ•˜๋Š” ๊ฒƒ.
  • ๋ฐฉ๋ฒ•:

    • Off-policy unsupervised RL + FB representation
    • Mo-cap regularization + DR + asymmetric training
    • Reward/goal/motion๋ฅผ ํ•˜๋‚˜์˜ latent task space์— embed.
  • ๊ฒฐ๊ณผ:

    • Unitree G1์—์„œ

      • Zero-shot goal reaching, reward optimization, motion tracking
      • ๊ฐ•์ธํ•œ disturbance rejection
      • Latent-level few-shot adaptation๊นŒ์ง€ ์‹œ์—ฐ.
  • ์˜๋ฏธ:

    • PPO ๊ธฐ๋ฐ˜ single-task ์ •์ฑ…์˜ ์‹œ๋Œ€์—์„œ,
    • โ€œํ–‰๋™ space๋ฅผ ๋จผ์ € ๊ฑฐ๋Œ€ํ•˜๊ฒŒ ํ•™์Šตํ•ด๋‘๊ณ , ์ดํ›„ ํ”„๋กฌํ”„ํŠธ์ฒ˜๋Ÿผ task๋ฅผ ์ง€์ •ํ•ด ์“ฐ๋Š”โ€ Behavioral Foundation Model paradigm์œผ๋กœ ํœด๋จธ๋…ธ์ด๋“œ ์ œ์–ด๋ฅผ ๋Œ๊ณ  ๊ฐ€๋Š” ์ค‘์š”ํ•œ ์ „ํ™˜์ .

์ฐธ๊ณ  ๋ฌธํ—Œ

  • BFM-Zero: A Promptable Behavioral Foundation Model for Humanoid Control Using Unsupervised Reinforcement Learning
  • Behavior Foundation Model for Humanoid Robots
  • ASAP: Aligning Simulation and Real-World Physics
  • RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
NoteBFM-Zero: ์ „์‹  ํœด๋จธ๋…ธ์ด๋“œ ์ œ์–ด๋ฅผ ์œ„ํ•œ ํ”„๋กฌํ”„ํŠธ ๊ฐ€๋Šฅ ํ–‰๋™ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ ์‹ฌ์ธต ๋ถ„์„

์„œ๋ก : ํœด๋จธ๋…ธ์ด๋“œ ์ œ์–ด์˜ ํŒจ๋Ÿฌ๋‹ค์ž„ ์ „ํ™˜๊ณผ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์˜ ํ•„์š”์„ฑ

๋กœ๋ด‡๊ณตํ•™์˜ ์—ญ์‚ฌ์—์„œ ์ธ๊ฐ„ํ˜• ๋กœ๋ด‡, ์ฆ‰ ํœด๋จธ๋…ธ์ด๋“œ๋ฅผ ์ œ์–ดํ•˜๋Š” ๊ฒƒ์€ ์–ธ์ œ๋‚˜ โ€™์ตœ์ข… ๊ด€๋ฌธโ€™๊ณผ ๊ฐ™์€ ๊ณผ์ œ์˜€๋‹ค. ์ˆ˜๋งŽ์€ ๊ด€์ ˆ๊ณผ ๋ณต์žกํ•œ ๋™์—ญํ•™, ๊ทธ๋ฆฌ๊ณ  ๋ถˆ์•ˆ์ •ํ•œ ํ‰ํ˜• ์ƒํƒœ๋ฅผ ์œ ์ง€ํ•ด์•ผ ํ•˜๋Š” ํŠน์„ฑ์€ ์ œ์–ด ์ด๋ก ๊ฐ€๋“ค์—๊ฒŒ ๋Š์ž„์—†๋Š” ๋„์ „ ๊ณผ์ œ๋ฅผ ์ œ์‹œํ•ด ์™”๋‹ค. ๊ณผ๊ฑฐ์˜ ์ œ์–ด ๋ฐฉ์‹์€ ์ฃผ๋กœ ๋ฌผ๋ฆฌ ๋ชจ๋ธ์— ๊ธฐ๋ฐ˜ํ•œ ๊ณ„์‚ฐ(Model-based Control)์— ์˜์กดํ–ˆ์œผ๋‚˜, ์ด๋Š” ํ™˜๊ฒฝ์˜ ๋ณ€ํ™”๋‚˜ ์˜ˆ๊ธฐ์น˜ ๋ชปํ•œ ์™ธ๋ž€์— ๋งค์šฐ ์ทจ์•ฝํ–ˆ๋‹ค. ์ตœ๊ทผ 10๋…„ ์‚ฌ์ด ๊ฐ•ํ™”ํ•™์Šต(Reinforcement Learning, RL)์˜ ๋ฐœ์ „์€ ์ด๋Ÿฌํ•œ ์ง€ํ˜•์„ ์™„์ „ํžˆ ๋ฐ”๊พธ์–ด ๋†“์•˜๋‹ค. ํŠนํžˆ PPO(Proximal Policy Optimization)์™€ ๊ฐ™์€ ์˜จํด๋ฆฌ์‹œ(On-policy) ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ์ˆ˜์ฒœ๋งŒ ๋ฒˆ์˜ ์‹œํ–‰์ฐฉ์˜ค๋ฅผ ๊ฑฐ์ณ ๋กœ๋ด‡์ด ๊ฑท๊ณ , ๋›ฐ๊ณ , ์‹ฌ์ง€์–ด ๊ณต์ค‘์ œ๋น„๋ฅผ ๋Œ๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฐ ์„ฑ๊ณตํ–ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ์„ฑ์ทจ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๊ทผ๋ณธ์ ์ธ ํ•œ๊ณ„๋Š” ์—ฌ์ „ํžˆ ๋‚จ์•„ ์žˆ์—ˆ๋‹ค. ๊ธฐ์กด์˜ ๊ฐ•ํ™”ํ•™์Šต ๋ฐฉ์‹์€ โ€˜ํŠน์ •ํ•œ ๋ณด์ƒ ํ•จ์ˆ˜โ€™์— ์ข…์†๋œ โ€™๋‹จ์ผ ์ž‘์—…โ€™ ์ „๋ฌธ๊ฐ€๋ฅผ ์–‘์‚ฐํ•˜๋Š” ๋ฐ ๊ทธ์ณค๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ํŠน์ • ๋ชจ์…˜ ์บก์ฒ˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋”ฐ๋ผํ•˜๋„๋ก ํ•™์Šต๋œ ๋กœ๋ด‡์€ ๊ทธ ๋™์ž‘ ์ด์™ธ์˜ ์ƒˆ๋กœ์šด ์š”๊ตฌ ์‚ฌํ•ญ์ด ์ฃผ์–ด์ง€๋ฉด ์•„๋ฌด๊ฒƒ๋„ ํ•  ์ˆ˜ ์—†๊ฒŒ ๋œ๋‹ค. ์ƒˆ๋กœ์šด ์ž‘์—…์„ ์‹œํ‚ค๋ ค๋ฉด ๋‹ค์‹œ ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ์„ค๊ณ„ํ•˜๊ณ  ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต์„ ์‹œ์ž‘ํ•ด์•ผ ํ•œ๋‹ค. ์ด๋Š” ์ธ๊ฐ„์ด ํ•˜๋‚˜์˜ ๊ธฐ๋ณธ ์ฒด๋ ฅ์„ ๋ฐ”ํƒ•์œผ๋กœ ์ถ•๊ตฌ, ๋†๊ตฌ, ์ถค์„ ๋น ๋ฅด๊ฒŒ ๋ฐฐ์šฐ๋Š” ๊ฒƒ๊ณผ๋Š” ๋Œ€์กฐ์ ์ด๋‹ค.

์ด๋Ÿฌํ•œ ๋ฐฐ๊ฒฝ์—์„œ ๋“ฑ์žฅํ•œ ๊ฐœ๋…์ด ๋ฐ”๋กœ โ€™ํ–‰๋™ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ(Behavioral Foundation Models, BFMs)โ€™์ด๋‹ค. ์–ธ์–ด ๋ชจ๋ธ์ด ๊ฑฐ๋Œ€ํ•œ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ์–ธ์–ด์˜ ๊ตฌ์กฐ๋ฅผ ์ตํžˆ๊ณ  ์–ด๋–ค ์งˆ๋ฌธ์—๋„ ๋‹ตํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ”๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, ๋กœ๋ด‡์—๊ฒŒ๋„ ์ „์‹  ์›€์ง์ž„์˜ ๊ทผ๋ณธ์ ์ธ โ€™๋ฌธ๋ฒ•โ€™์„ ๊ฐ€๋ฅด์น˜๋ ค๋Š” ์‹œ๋„์ด๋‹ค. BFM-Zero๋Š” ๋ฐ”๋กœ ์ด ์ง€์ ์—์„œ ํ˜์‹ ์ ์ธ ํ•ด๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ์ด ๋ชจ๋ธ์€ ๋น„์ง€๋„ ๊ฐ•ํ™”ํ•™์Šต(Unsupervised RL)์„ ํ†ตํ•ด ๋กœ๋ด‡์˜ ๋ชจ๋“  ๊ฐ€๋Šฅํ•œ ํ–‰๋™์„ ํ•˜๋‚˜์˜ ์ •๊ตํ•œ ์ž ์žฌ ๊ณต๊ฐ„(Latent Space)์— ๋งคํ•‘ํ•œ๋‹ค. ์ด ๋ณด๊ณ ์„œ๋Š” BFM-Zero๊ฐ€ ์–ด๋–ป๊ฒŒ ์žฌํ•™์Šต ์—†์ด(Zero-shot) ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š”์ง€, ๊ทธ๋ฆฌ๊ณ  ๊ทธ ์ด๋ฉด์— ์ˆจ๊ฒจ์ง„ ์ˆ˜ํ•™์  ์ง๊ด€๊ณผ ๊ณตํ•™์  ์„ค๊ณ„๋ฅผ ์‹ฌ์ธต์ ์œผ๋กœ ๋ถ„์„ํ•œ๋‹ค.

๋ฐฉ๋ฒ•๋ก : ํ–‰๋™์˜ ์ง€๋„๋ฅผ ๊ทธ๋ฆฌ๋Š” ์ „๋ฐฉ-ํ›„๋ฐฉ ํ‘œํ˜„ํ˜•์˜ ์ˆ˜ํ•™์  ์ง๊ด€

BFM-Zero์˜ ํ•ต์‹ฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋จผ์ € โ€™์ „๋ฐฉ-ํ›„๋ฐฉ(Forward-Backward, FB) ํ‘œํ˜„ํ˜•โ€™์ด๋ผ๋Š” ์ˆ˜ํ•™์  ๋„๊ตฌ์— ์ฃผ๋ชฉํ•ด์•ผ ํ•œ๋‹ค. ์ด๋ฅผ ์•„์ฃผ ์ง๊ด€์ ์œผ๋กœ ์„ค๋ช…ํ•˜์ž๋ฉด, ์šฐ๋ฆฌ๊ฐ€ ๋‚ฏ์„  ๋„์‹œ์— ๋„์ฐฉํ•ด ์ง€๋„๋ฅผ ๋งŒ๋“œ๋Š” ๊ณผ์ •๊ณผ ๋น„์Šทํ•˜๋‹ค. ๊ธฐ์กด์˜ ๊ฐ•ํ™”ํ•™์Šต์ด โ€œ์ง‘์—์„œ ๋„์„œ๊ด€๊นŒ์ง€ ๊ฐ€๋Š” ๊ฐ€์žฅ ๋น ๋ฅธ ๊ธธโ€๋งŒ์„ ์™ธ์šฐ๋Š” ๊ฒƒ์ด๋ผ๋ฉด, FB ํ‘œํ˜„ํ˜•์€ โ€œ๋„์‹œ์˜ ๋ชจ๋“  ๋„๋กœ๊ฐ€ ์–ด๋–ป๊ฒŒ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ๋Š”์ง€โ€๋ฅผ ํŒŒ์•…ํ•˜์—ฌ ์ง€๋„ ์ž์ฒด๋ฅผ ๊ทธ๋ฆฌ๋Š” ์ž‘์—…์ด๋‹ค.

๊ณ„์Šน ์ธก๋„์™€ ๊ฐ€์น˜ ํ•จ์ˆ˜์˜ ๋ถ„ํ•ด

์ผ๋ฐ˜์ ์ธ ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ๊ณผ์ •(MDP)์—์„œ ๊ฐ€์น˜ ํ•จ์ˆ˜ Q(s, a)๋Š” ํ˜„์žฌ ์ƒํƒœ s์—์„œ ํ–‰๋™ a๋ฅผ ์ทจํ–ˆ์„ ๋•Œ ๊ธฐ๋Œ€๋˜๋Š” ๋ฏธ๋ž˜ ๋ณด์ƒ์˜ ํ•ฉ์ด๋‹ค. BFM-Zero๋Š” ์ด ๊ฐ€์น˜ ํ•จ์ˆ˜๋ฅผ ๋ณด์ƒ(Reward)๊ณผ ๋™์—ญํ•™(Dynamics)์œผ๋กœ ์™„์ „ํžˆ ๋ถ„๋ฆฌํ•œ๋‹ค. ์ด๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๊ฒƒ์ด โ€˜๊ณ„์Šน ์ธก๋„(Successor Measure)โ€™ M^\pi(X|s, a)์ด๋‹ค. ์ด๋Š” ์ •์ฑ… \pi๋ฅผ ๋”ฐ๋ฅผ ๋•Œ ๋ฏธ๋ž˜์— ์ƒํƒœ ์ง‘ํ•ฉ X์— ๋ฐฉ๋ฌธํ•˜๊ฒŒ ๋  ํ• ์ธ๋œ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

FB ํ‘œํ˜„ํ˜•์€ ์ด ๊ณ„์Šน ์ธก๋„๋ฅผ ๋‘ ๊ฐœ์˜ ์ €์ฐจ์› ๋ฒกํ„ฐ์˜ ๋‚ด์ ์œผ๋กœ ๊ทผ์‚ฌํ•œ๋‹ค:

M^\pi(X|s, a) \approx \int_{s' \in X} F(s, a, z)^\top B(s') \rho(ds')

์—ฌ๊ธฐ์„œ F(s, a, z)๋Š” ์ „๋ฐฉ ํ‘œํ˜„ํ˜•(Forward Representation)์œผ๋กœ, ํ˜„์žฌ ์ƒํƒœ์™€ ํ–‰๋™์ด ๋ฏธ๋ž˜์— ์–ด๋–ค โ€™์ž ์žฌ์  ๋ฐฉํ–ฅโ€™์œผ๋กœ ๋‚˜์•„๊ฐˆ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ๋ฐ˜๋ฉด B(s')๋Š” ํ›„๋ฐฉ ํ‘œํ˜„ํ˜•(Backward Representation)์œผ๋กœ, ํŠน์ • ์ƒํƒœ s'๊ฐ€ ๋„๋‹ฌํ•˜๊ธฐ ์œ„ํ•ด ์–ด๋–ค โ€™ํŠน์ง•โ€™์„ ๊ฐ€์ ธ์•ผ ํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ์ด ๋ถ„ํ•ด๊ฐ€ ๋†€๋ผ์šด ์ด์œ ๋Š”, ์–ด๋–ค ๋ณด์ƒ ํ•จ์ˆ˜ r(s)๊ฐ€ ์ฃผ์–ด์ง€๋”๋ผ๋„ ๊ฐ€์น˜ ํ•จ์ˆ˜๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‹จ์ˆœํžˆ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค:

Q^\pi_r(s, a) = F(s, a, z)^\top z \quad \text{where} \quad z = E_{s \sim \rho}[B(s)r(s)]

์ฆ‰, ๋กœ๋ด‡์€ ๋ณด์ƒ์ด ๋ฌด์—‡์ธ์ง€ ๋ฏธ๋ฆฌ ์•Œ ํ•„์š” ์—†์ด ์„ธ์ƒ์˜ ์ด์น˜(๋™์—ญํ•™)๋ฅผ ์ „๋ฐฉ ํ‘œํ˜„ํ˜•์œผ๋กœ ์ตํžˆ๊ณ , ๋ณด์ƒ์ด ์ฃผ์–ด์ง€๋Š” ์ˆœ๊ฐ„ ๊ทธ๊ฒƒ์„ ์ž ์žฌ ๋ฒกํ„ฐ z๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์ฆ‰์‹œ ์ตœ์ ์˜ ํ–‰๋™์„ ์ฐพ์•„๋‚ธ๋‹ค. ์ด๊ฒƒ์ด ๋ฐ”๋กœ BFM-Zero๊ฐ€ โ€˜์ œ๋กœ์ƒท(Zero-shot)โ€™ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ๋Š” ๊ทผ๋ณธ์ ์ธ ์ด์œ ์ด๋‹ค.

FB-CPR: ์ธ๊ฐ„๋‹ค์šด ์›€์ง์ž„์„ ์œ„ํ•œ ๊ฐ€์ด๋“œ๋ผ์ธ

๋‹จ์ˆœํžˆ ๋ฌผ๋ฆฌ์ ์ธ ์›€์ง์ž„๋งŒ ๋ฐฐ์šฐ๊ฒŒ ํ•˜๋ฉด ํœด๋จธ๋…ธ์ด๋“œ๋Š” ๊ธฐ๊ดดํ•˜๊ฒŒ ๊ด€์ ˆ์„ ๊บพ๊ฑฐ๋‚˜ ๋น„ํšจ์œจ์ ์œผ๋กœ ์›€์ง์ผ ์ˆ˜ ์žˆ๋‹ค. BFM-Zero๋Š” ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด FB-CPR(Conditional Policy Regularization) ๊ธฐ๋ฒ•์„ ๋„์ž…ํ•œ๋‹ค. ์ด๋Š” ๋ชจ์…˜ ์บก์ฒ˜ ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•˜์—ฌ ๋กœ๋ด‡์˜ ํƒ์ƒ‰ ๋ฒ”์œ„๋ฅผ โ€˜์ธ๊ฐ„๋‹ค์šด ๋™์ž‘โ€™ ๊ทผ์ฒ˜๋กœ ํ•œ์ •ํ•˜๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.

์—ฐ๊ตฌํŒ€์€ GAN(Generative Adversarial Network) ์Šคํƒ€์ผ์˜ ํŒ๋ณ„๊ธฐ(Discriminator)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋กœ๋ด‡์˜ ํ˜„์žฌ ์ •์ฑ…์ด ์ƒ์„ฑํ•˜๋Š” ์ƒํƒœ-์ž ์žฌ ๋ณ€์ˆ˜ ๋ถ„ํฌ์™€ ์‹ค์ œ ์ธ๊ฐ„์˜ ๋ชจ์…˜ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋ฅผ ๋น„๊ตํ•œ๋‹ค. ํŒ๋ณ„๊ธฐ๋Š” ๋กœ๋ด‡์˜ ์›€์ง์ž„์ด ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ์…‹์— ์žˆ๋Š” ๊ฒƒ์ธ์ง€ ์•„๋‹Œ์ง€๋ฅผ ํŒ๋ณ„ํ•˜๊ณ , ๋กœ๋ด‡์€ ํŒ๋ณ„๊ธฐ๋ฅผ ์†์ด๊ธฐ ์œ„ํ•ด ๋”์šฑ ์ž์—ฐ์Šค๋Ÿฌ์šด ๋™์ž‘์„ ์ทจํ•˜๋„๋ก ํ•™์Šต๋œ๋‹ค. ์ด ๊ณผ์ •์€ ๋น„์ง€๋„ ํ•™์Šต์ž„์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๋กœ๋ด‡์ด ๋งค์šฐ ์•ˆ์ •์ ์ด๊ณ  ๋ฏธํ•™์ ์œผ๋กœ๋„ ์šฐ์ˆ˜ํ•œ ์ „์‹  ๊ธฐ์ˆ ์„ ์Šต๋“ํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค.

๊ตฌ์„ฑ ์š”์†Œ ์—ญํ•  ์ƒ์„ธ ๋‚ด์šฉ
์ „๋ฐฉ๋ง (F-net) ๋ฏธ๋ž˜ ์˜ˆ์ธก ์ƒํƒœ s์™€ ํ–‰๋™ a์—์„œ ์ž ์žฌ ๋ฒกํ„ฐ z์— ๋”ฐ๋ฅธ ๊ฒฐ๊ณผ ์˜ˆ์ธก
ํ›„๋ฐฉ๋ง (B-net) ์ƒํƒœ ํŠน์ง• ์ถ”์ถœ ์ž„์˜์˜ ์ƒํƒœ s'๋ฅผ ์ž ์žฌ ๊ณต๊ฐ„์œผ๋กœ ์ž„๋ฒ ๋”ฉ
์ •์ฑ…๋ง (\pi_z) ํ–‰๋™ ๊ฒฐ์ • ํ˜„์žฌ s์™€ ์ฃผ์–ด์ง„ z์— ๋Œ€ํ•ด ๊ฐ€์น˜ ํ•จ์ˆ˜๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” a ์„ ํƒ
ํŒ๋ณ„๊ธฐ (D) ํ–‰๋™ ๊ทœ์ œ ๋กœ๋ด‡์˜ ์›€์ง์ž„์„ ์‹ค์ œ ์ธ๊ฐ„ ๋ชจ์…˜ ๋ฐ์ดํ„ฐ์™€ ์œ ์‚ฌํ•˜๊ฒŒ ์œ ์ง€

graph TD
    subgraph Pre-training_Algorithm
        Data --> Disc
        State --> FNet[Forward Network F]
        Action[Action a] --> FNet
        Latent[Latent z] --> FNet
        FNet --> TD
        BNet --> TD
        Disc --> Policy[Policy pi_z]
        TD --> Policy
    end

    subgraph Inference_Pipeline
        Goal --> Encoder[Inference Formula]
        Encoder --> TargetZ
        TargetZ --> Policy
        Policy --> Control[Unitree G1 Actuators]
    end

Sim-to-Real: ์‹œ๋ฎฌ๋ ˆ์ด์…˜์˜ ์ง€ํ˜œ๋ฅผ ํ˜„์‹ค์˜ ๊ฐ๊ฐ์œผ๋กœ

์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ์•„๋ฌด๋ฆฌ ์ž˜ ๊ฑท๋Š” ๋กœ๋ด‡์ด๋ผ๋„ ํ˜„์‹ค์˜ ๊ฑฐ์นœ ๋ฐ”๋‹ฅ๊ณผ ์„ผ์„œ ๋…ธ์ด์ฆˆ ์•ž์—์„œ๋Š” ๋ฌด๋„ˆ์ง€๊ธฐ ์‰ฝ๋‹ค. BFM-Zero๋Š” ์ด ๊ฐ„๊ทน์„ ๋ฉ”์šฐ๊ธฐ ์œ„ํ•ด ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ๊ณตํ•™์  ์„ค๊ณ„๋ฅผ ๋„์ž…ํ–ˆ๋‹ค.

Asymmetric History-Dependent Training

์‹œ๋ฎฌ๋ ˆ์ด์…˜์˜ ๋น„ํ‰๊ฐ€(Critic)๋Š” ๋กœ๋ด‡์˜ ์ •ํ™•ํ•œ ์งˆ๋Ÿ‰ ์ค‘์‹ฌ, ๊ด€์ ˆ ๋งˆ์ฐฐ๋ ฅ, ์ง€๋ฉด ๋ฐ˜๋ ฅ ๋“ฑ โ€™ํŠน๊ถŒ ์ •๋ณด(Privileged Information)โ€™๋ฅผ ๋ชจ๋‘ ์•Œ๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์‹ค์ œ ๋กœ๋ด‡์˜ ์ •์ฑ…(Policy)์€ ์˜ค์ง ๊ด€์ ˆ ๊ฐ๋„์™€ ์ž์ด๋กœ์Šค์ฝ”ํ”„ ๊ฐ™์€ โ€™๊ฐ€์‹œ์  ์ƒํƒœ(Observable State)โ€™์—๋งŒ ์˜์กดํ•ด์•ผ ํ•œ๋‹ค.

BFM-Zero๋Š” ์ด ์ •๋ณด์˜ ๋ถˆ๊ท ํ˜•์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ •์ฑ…๋ง์— โ€™์—ญ์‚ฌ(History)โ€™๋ฅผ ์ฃผ์ž…ํ•œ๋‹ค. ์ฆ‰, ๋‹จ์ˆœํžˆ ํ˜„์žฌ ์ƒํƒœ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ณผ๊ฑฐ H ์Šคํ… ๋™์•ˆ์˜ ๊ด€์ธก๊ฐ’๊ณผ ํ–‰๋™ ๊ธฐ๋ก์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๋Š”๋‹ค. ์ด๋Š” ๋กœ๋ด‡์ด ๋ช…์‹œ์ ์ธ ๋ฌผ๋ฆฌ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์•Œ์ง€ ๋ชปํ•˜๋”๋ผ๋„, ๊ณผ๊ฑฐ์˜ ์›€์ง์ž„ ํŒจํ„ด์„ ํ†ตํ•ด โ€œ์ง€๊ธˆ ๋‚ด ๋ฐœ ๋ฐ‘์ด ๋ฏธ๋„๋Ÿฝ๊ตฌ๋‚˜โ€ ํ˜น์€ โ€œ๋‚ด ๋“ฑ์— ๋ฌด๊ฑฐ์šด ์ง์ด ์‹ค๋ ธ๊ตฌ๋‚˜โ€๋ผ๋Š” ์‚ฌ์‹ค์„ ๋‚ด๋ถ€์ ์œผ๋กœ ์ถ”๋ก (Implicit Inference)ํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค.

Domain Randomization & Auxiliary Rewards

์—ฐ๊ตฌํŒ€์€ ํ•™์Šต ๊ณผ์ •์—์„œ ๋กœ๋ด‡์˜ ์งˆ๋Ÿ‰, ๋งํฌ์˜ ๊ธธ์ด, ๋งˆ์ฐฐ ๊ณ„์ˆ˜ ๋“ฑ์„ ๋ฌด์ž‘์œ„๋กœ ๋ณ€๊ฒฝํ•˜๋Š” ๋„๋ฉ”์ธ ๋žœ๋คํ™”(DR)๋ฅผ ์ ์šฉํ–ˆ๋‹ค. ์ด๋Š” ๋กœ๋ด‡์ด ํŠน์ • ํ™˜๊ฒฝ์— ๊ณผ์ ํ•ฉ(Overfitting)๋˜์ง€ ์•Š๊ณ  ๋ณดํŽธ์ ์ธ ๋ฌผ๋ฆฌ ๋ฒ•์น™์— ์ ์‘ํ•˜๋„๋ก ๋•๋Š”๋‹ค. ๋˜ํ•œ, ํ˜„์‹ค ์„ธ๊ณ„์˜ ์•ˆ์ „์„ ์œ„ํ•ด ๊ด€์ ˆ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜์ง€ ์•Š๊ฒŒ ํ•˜๊ฑฐ๋‚˜ ๊ธ‰๊ฒฉํ•œ ํ† ํฌ ๋ณ€ํ™”๋ฅผ ์–ต์ œํ•˜๋Š” โ€˜๋ณด์กฐ ๋ณด์ƒ(Auxiliary Rewards)โ€™์„ ์ถ”๊ฐ€ํ–ˆ๋‹ค. ํฅ๋ฏธ๋กœ์šด ์ ์€ ์ด๋Ÿฌํ•œ ๋ณด์กฐ ๋ณด์ƒ์ด ๋น„์ง€๋„ ํ•™์Šต์˜ ๋ณธ์งˆ์„ ํ•ด์น˜์ง€ ์•Š์œผ๋ฉด์„œ๋„ ์‹ค์ œ ํ•˜๋“œ์›จ์–ด ์šด์˜์— ํ•„์ˆ˜์ ์ธ โ€™์•ˆ์ „ ํŽœ์Šคโ€™ ์—ญํ• ์„ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

ํŒŒ๋ผ๋ฏธํ„ฐ ์œ ํ˜• ๊ฐ€์‹œ์  ์ƒํƒœ (Proprioception) ํŠน๊ถŒ ์ •๋ณด (Privileged Info)
๊ด€์ ˆ ๋ฐ์ดํ„ฐ q_t, \dot{q}_t (์œ„์น˜, ์†๋„) ๋ชจ๋“  ๋งํฌ์˜ ์งˆ๋Ÿ‰ ์ค‘์‹ฌ ์œ„์น˜
๋ฃจํŠธ ๋ฐ์ดํ„ฐ \omega_{root}, g_t (๊ฐ์†๋„, ์ค‘๋ ฅ) ๋ฃจํŠธ ์„ ์†๋„, ์ง€๋ฉด๊ณผ์˜ ๊ฑฐ๋ฆฌ
์™ธ๋ถ€ ํ™˜๊ฒฝ ๊ณผ๊ฑฐ ํ–‰๋™ ๊ธฐ๋ก a_{t-H:t-1} ๋งˆ์ฐฐ ๊ณ„์ˆ˜, ๊ฒฝ์‚ฌ๋„, ์™ธ๋ถ€ ์„ญ๋™๋ ฅ
๋ฐ์ดํ„ฐ ์ฐจ์› 64์ฐจ์› (๋‹จ์ผ ์‹œ์  ๊ธฐ์ค€) 463์ฐจ์› (ํŠน๊ถŒ ์ •๋ณด ํฌํ•จ)

์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ ๋ถ„์„: Unitree G1์—์„œ์˜ ์‹ค์ฆ์  ์„ฑ๊ณผ

BFM-Zero์˜ ์„ฑ๋Šฅ์€ Unitree G1 ํœด๋จธ๋…ธ์ด๋“œ๋ฅผ ํ†ตํ•ด ๊ฒ€์ฆ๋˜์—ˆ๋‹ค. Unitree G1์€ 23๊ฐœ์—์„œ ์ตœ๋Œ€ 43๊ฐœ์˜ ์ž์œ ๋„๋ฅผ ๊ฐ€์ง„ ๊ณ ์„ฑ๋Šฅ ๋กœ๋ด‡์œผ๋กœ, ์ „์‹  ์ œ์–ด์˜ ๋‚œ์ด๋„๊ฐ€ ๋งค์šฐ ๋†’๋‹ค.

์ œ๋กœ์ƒท ์ž‘์—… ์ˆ˜ํ–‰ (Zero-shot Performance)

ํ•™์Šต์ด ๋๋‚œ ํ›„, ์—ฐ๊ตฌ์ง„์€ ๋กœ๋ด‡์—๊ฒŒ ํ•œ ๋ฒˆ๋„ ๊ฐ€๋ฅด์ณ์ฃผ์ง€ ์•Š์€ ์„ธ ๊ฐ€์ง€ ์œ ํ˜•์˜ ์ž‘์—…์„ ์ฆ‰์„์—์„œ ๋ช…๋ น(Prompting)ํ–ˆ๋‹ค.

  • ๋ชฉํ‘œ ๋„๋‹ฌ (Goal Reaching): ๋กœ๋ด‡์—๊ฒŒ ํŠน์ • ์ •์ง€ ์ž์„ธ s_g๋ฅผ ์ฃผ๋ฉด, ์ •์ฑ…์€ z = B(s_g)๋ฅผ ํ†ตํ•ด ํ˜„์žฌ ์œ„์น˜์—์„œ ํ•ด๋‹น ์ž์„ธ๋กœ ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ์ „์ดํ•œ๋‹ค. ์ด๋Š” ๋งˆ์น˜ ์ˆ™๋ จ๋œ ๋ฌด์šฉ์ˆ˜๊ฐ€ ์–ด๋–ค ํฌ์ฆˆ๋ฅผ ์š”๊ตฌ๋ฐ›์•˜์„ ๋•Œ ๋ชธ์˜ ๊ท ํ˜•์„ ์œ ์ง€ํ•˜๋ฉฐ ์šฐ์•„ํ•˜๊ฒŒ ๊ทธ ์ž์„ธ๋ฅผ ์ทจํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค.
  • ๋™์ž‘ ์ถ”์ข… (Motion Tracking): ์—ฐ์†์ ์ธ ๋ชจ์…˜ ์บก์ฒ˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์ฃผ๋ฉด, ๋กœ๋ด‡์€ ์ด๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ๋”ฐ๋ผ๊ฐ„๋‹ค. BFM-Zero๋Š” ๊ธฐ์กด์˜ SOTA ๋ชจ๋ธ์ธ GMT(Global Motion Tracking)๋ณด๋‹ค ๋” ์ ์€ ๋ฐ์ดํ„ฐ๋กœ๋„ ํ›จ์”ฌ ๋” ๋งค๋„๋Ÿฌ์šด ์ถ”์ข… ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.
  • ๋ณด์ƒ ์ตœ์ ํ™” (Reward Optimization): โ€œ๋จธ๋ฆฌ ๋†’์ด๋ฅผ ์œ ์ง€ํ•˜๋ฉฐ ์˜†์œผ๋กœ ๊ฑธ์–ด๋ผโ€์™€ ๊ฐ™์€ ํ…์ŠคํŠธ ํ˜•ํƒœ์˜ ๋…ผ๋ฆฌ์  ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ์ฃผ๋ฉด, ๋กœ๋ด‡์€ ์ž ์žฌ ๊ณต๊ฐ„์„ ํƒ์ƒ‰ํ•˜์—ฌ ์ด๋ฅผ ๋งŒ์กฑํ•˜๋Š” ์›€์ง์ž„์„ ์ฆ‰์‹œ ์ƒ์„ฑํ•œ๋‹ค.

์™ธ๋ž€์— ๋Œ€ํ•œ ๊ฐ•๊ฑด์„ฑ (Disturbance Rejection)

BFM-Zero์˜ ๊ฐ€์žฅ ์ธ์ƒ์ ์ธ ์žฅ๋ฉด ์ค‘ ํ•˜๋‚˜๋Š” ๋กœ๋ด‡์ด ํฐ ์™ธ๋ถ€ ์ถฉ๊ฒฉ์„ ๋ฐ›์•˜์„ ๋•Œ์ด๋‹ค. ๋กœ๋ด‡์ด ๋ณดํ–‰ ์ค‘ ์˜†์—์„œ ๊ฐ•ํ•˜๊ฒŒ ๋ฐ€๋ฆฌ๋ฉด, ์ •์ฑ…์€ ์ž ์žฌ ๊ณต๊ฐ„์˜ ์—ฐ์†์„ฑ์„ ํ™œ์šฉํ•˜์—ฌ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ โ€™ํšŒ๋ณต ๋™์ž‘โ€™์œผ๋กœ ์ „์ดํ•œ๋‹ค. ์ด๋Š” ํŠน์ • ๊ถค์ ๋งŒ์„ ๊ณ ์ง‘ํ•˜๋Š” ๊ธฐ์กด ๋ฐฉ์‹๊ณผ ๋‹ฌ๋ฆฌ, ์ž ์žฌ ๊ณต๊ฐ„ ์ž์ฒด๊ฐ€ ๋กœ๋ด‡์˜ ๋™์—ญํ•™์  ์•ˆ์ •์„ฑ์„ ๋‚ดํฌํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€๋Šฅํ•œ ๊ฒฐ๊ณผ์ด๋‹ค.

ํ“จ์ƒท ์ ์‘: 4kg์˜ ํŽ˜์ด๋กœ๋“œ๋ฅผ ๊ฒฌ๋””๋‹ค

์ œ๋กœ์ƒท ์ถ”๋ก ์ด ๋ชจ๋“  ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜๋Š” ์—†๋‹ค. ํ•™์Šต ์‹œ ๊ฒฝํ—˜ํ•˜์ง€ ๋ชปํ•œ 4kg์˜ ๋ฌด๊ฑฐ์šด ํŽ˜์ด๋กœ๋“œ๊ฐ€ ๊ฐ‘์ž๊ธฐ ์ถ”๊ฐ€๋˜์—ˆ์„ ๋•Œ, ๋กœ๋ด‡์€ ์ฒ˜์Œ์— ๋‹ค์†Œ ๋ถˆ์•ˆ์ •ํ•œ ๋ชจ์Šต์„ ๋ณด์˜€๋‹ค. ํ•˜์ง€๋งŒ ์—ฐ๊ตฌ์ง„์€ ์ž ์žฌ ๊ณต๊ฐ„ Z ๋‚ด์—์„œ ์ƒ˜ํ”Œ๋ง ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”(CMA-ES ๋“ฑ)๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” โ€˜ํ“จ์ƒท ์ ์‘(Few-shot Adaptation)โ€™ ๊ณผ์ •์„ ๊ฑฐ์ณค๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋‹จ 2๋ถ„ ๋ฏธ๋งŒ์˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์ ์‘๋งŒ์œผ๋กœ ๋กœ๋ด‡์€ ํŽ˜์ด๋กœ๋“œ์˜ ๋ฌด๊ฒŒ๋ฅผ ๊ฐ์•ˆํ•˜์—ฌ ๋ฌด๊ฒŒ ์ค‘์‹ฌ์„ ๋’ค๋กœ ์˜ฎ๊ธฐ๋Š” ๋ฒ•์„ ์Šค์Šค๋กœ ํ„ฐ๋“ํ–ˆ๊ณ , ํ•œ ๋ฐœ ์„œ๊ธฐ ์‹œ๊ฐ„์„ ํš๊ธฐ์ ์œผ๋กœ ๋Š˜๋ฆฌ๋Š” ๋ฐ ์„ฑ๊ณตํ–ˆ๋‹ค.

๋น„ํŒ์  ๊ณ ์ฐฐ: ์—ฐ๊ตฌ์˜ ๊ฐ•์ ๊ณผ ํ•œ๊ณ„

๊ฐ•์  ๋ฐ ๊ธฐ์—ฌ๋„

BFM-Zero๋Š” ๋น„์ง€๋„ ๊ฐ•ํ™”ํ•™์Šต์ด ์‹ค์ œ ๋ณต์žกํ•œ ํœด๋จธ๋…ธ์ด๋“œ ํ•˜๋“œ์›จ์–ด์—์„œ๋„ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ์Œ์„ ์ฆ๋ช…ํ•œ ์ตœ์ดˆ์˜ ์‚ฌ๋ก€ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ํŠนํžˆ ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ๋งค๋ฒˆ ์žฌ์„ค๊ณ„ํ•  ํ•„์š” ์—†์ด โ€™ํ”„๋กฌํ”„ํŠธโ€™๋ฅผ ํ†ตํ•ด ๋กœ๋ด‡์˜ ํ–‰๋™์„ ์ œ์–ดํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์€ ๋กœ๋ด‡ ๊ณตํ•™์˜ ๋Œ€์ค‘ํ™”์™€ ํ™•์žฅ์„ฑ ์ธก๋ฉด์—์„œ ์—„์ฒญ๋‚œ ๊ธฐ์—ฌ๋ฅผ ํ•œ๋‹ค. ๋˜ํ•œ, FB ํ‘œํ˜„ํ˜•์˜ ์ˆ˜ํ•™์  ๊ฐ„๊ฒฐํ•จ์ด ์‹ค์ œ ํœด๋จธ๋…ธ์ด๋“œ์˜ ๋ณต์žกํ•œ ๋™์—ญํ•™์„ ํšจ๊ณผ์ ์œผ๋กœ ์••์ถ•ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.

ํ•œ๊ณ„์  ๋ฐ ๊ฐœ์„  ๋ฐฉํ–ฅ

๋ฌผ๋ก  ์™„๋ฒฝํ•œ ๋ชจ๋ธ์€ ์•„๋‹ˆ๋‹ค. ์ฒซ์งธ, ์ž ์žฌ ๊ณต๊ฐ„์˜ ๋ถ•๊ดด(Latent Collapse) ์œ„ํ—˜์ด ์กด์žฌํ•œ๋‹ค. ํŠน์ • ๋„๋ฉ”์ธ ๋žœ๋คํ™” ์กฐ๊ฑด์—์„œ ์ผ๋ถ€ ๋™์ž‘๋“ค์ด ์ž ์žฌ ๊ณต๊ฐ„ ์ƒ์˜ ํ•œ ์ ์œผ๋กœ ๋ชจ์—ฌ๋ฒ„๋ ค ํ–‰๋™์˜ ๋‹ค์–‘์„ฑ์ด ์ƒ์‹ค๋˜๋Š” ํ˜„์ƒ์ด ๊ด€์ฐฐ๋˜์—ˆ๋‹ค. ๋‘˜์งธ, ์ •๋ฐ€ํ•œ ์†๋„ ์ œ์–ด์˜ ์–ด๋ ค์›€์ด๋‹ค. ๋น„๋””์˜ค ๋ถ„์„ ๊ฒฐ๊ณผ, ๋กœ๋ด‡์ด ์ •์ง€ ์ƒํƒœ์—์„œ๋„ ๋ฏธ์„ธํ•˜๊ฒŒ ํ๋ฅด๋Š”(Drifting) ํ˜„์ƒ์ด ๋ณด์ด๋Š”๋ฐ, ์ด๋Š” ๊ด€์ธก ๋ฐ์ดํ„ฐ์…‹์— ๋ฃจํŠธ์˜ ์„ ์†๋„๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š์€ ์„ค๊ณ„์  ์„ ํƒ์—์„œ ๊ธฐ์ธํ–ˆ์„ ๊ฐ€๋Šฅ์„ฑ์ด ํฌ๋‹ค. ์…‹์งธ, ๋น„์ง€๋„ ํ•™์Šต์˜ ํ’ˆ์งˆ์ด ์—ฌ์ „ํžˆ ๋ชจ์…˜ ๋ฐ์ดํ„ฐ์…‹์˜ ์งˆ์— ์˜์กดํ•œ๋‹ค๋Š” ์ ์ด๋‹ค. ์ธ๊ฐ„์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์—†๋Š” ๊ธฐ๋ฐœํ•œ ๋™์ž‘(์˜ˆ: ๋ฌผ๊ตฌ๋‚˜๋ฌด ์„œ์„œ ๊ฑท๊ธฐ ๋“ฑ)์€ ํ˜„์žฌ์˜ FB-CPR ๊ตฌ์กฐ ํ•˜์—์„œ๋Š” ํ•™์Šต๋˜๊ธฐ ์–ด๋ ต๋‹ค.

์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

BFM-Zero๋Š” โ€™ํ•™์Šต๋œ ์ „๋ฌธ๊ฐ€โ€™์—์„œ โ€™ํ•™์Šต ๊ฐ€๋Šฅํ•œ ์ผ๋ฐ˜์ธโ€™์œผ๋กœ ํœด๋จธ๋…ธ์ด๋“œ์˜ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ „ํ™˜ํ•˜๋ ค๋Š” ์‹œ๋„์ด๋‹ค. ์ด ๋ชจ๋ธ์€ ์ „๋ฐฉ-ํ›„๋ฐฉ ํ‘œํ˜„ํ˜•์ด๋ผ๋Š” ๊ฐ•๋ ฅํ•œ ์ˆ˜ํ•™์  ํ† ๋Œ€ ์œ„์— ๋น„์ง€๋„ ํ•™์Šต๊ณผ ๋น„๋Œ€์นญ ์—ญ์‚ฌ ํ•™์Šต์ด๋ผ๋Š” ๊ณตํ•™์  ๊ธฐ์ˆ ์„ ๊ฒฐํ•ฉํ•˜์—ฌ, ์‹ค์ œ ์„ธ๊ณ„์˜ ๋ณต์žก์„ฑ์„ ๊ฒฌ๋ŽŒ๋‚ด๋Š” ํ–‰๋™ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ–ˆ๋‹ค.

๋กœ๋ด‡๊ณตํ•™์ž๋“ค์—๊ฒŒ ์ด ์—ฐ๊ตฌ๊ฐ€ ์ฃผ๋Š” ๋ฉ”์‹œ์ง€๋Š” ๋ช…ํ™•ํ•˜๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ๋กœ๋ด‡์—๊ฒŒ ๋ชจ๋“  ์ƒํ™ฉ์— ๋Œ€ํ•œ ์ •๋‹ต์„ ๊ฐ€๋ฅด์น  ์ˆ˜๋Š” ์—†์ง€๋งŒ, ๋กœ๋ด‡์ด ์Šค์Šค๋กœ ์ •๋‹ต์„ ์ฐพ์„ ์ˆ˜ ์žˆ๋Š” โ€™๊ณต๊ฐ„โ€™๊ณผ โ€™์–ธ์–ดโ€™๋ฅผ ๋งŒ๋“ค์–ด์ค„ ์ˆ˜๋Š” ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. BFM-Zero๊ฐ€ ๊ตฌ์ถ•ํ•œ ์ด ์ž ์žฌ ๊ณต๊ฐ„์€ ํ–ฅํ›„ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์ด๋‚˜ ์‹œ๊ฐ ๋ชจ๋ธ(VLM)๊ณผ ๊ฒฐํ•ฉ๋˜์–ด, ์ธ๊ฐ„์˜ ๊ณ ์ˆ˜์ค€ ๋ช…๋ น์„ ์ €์ˆ˜์ค€์˜ ์ •๊ตํ•œ ๋ชจํ„ฐ ์ œ์–ด๋กœ ์—ฐ๊ฒฐํ•˜๋Š” ํ•ต์‹ฌ ๊ณ ๋ฆฌ๊ฐ€ ๋  ๊ฒƒ์ด๋‹ค.

๊ฒฐ๋ก ์ ์œผ๋กœ BFM-Zero๋Š” ํœด๋จธ๋…ธ์ด๋“œ ์ œ์–ด์˜ ์ƒˆ๋กœ์šด ํ‘œ์ค€์„ ์ œ์‹œํ–ˆ๋‹ค. ์ด์ œ ์šฐ๋ฆฌ๋Š” ๋กœ๋ด‡์—๊ฒŒ โ€œ์–ด๋–ป๊ฒŒ ๊ฑธ์„์ง€โ€๋ฅผ ๊ฐ€๋ฅด์น˜๋Š” ๋‹จ๊ณ„๋ฅผ ๋„˜์–ด, ๋กœ๋ด‡์ด ์Šค์Šค๋กœ โ€œ์–ด๋–ค ์›€์ง์ž„์ด ๊ฐ€๋Šฅํ•œ์ง€โ€๋ฅผ ํƒ๊ตฌํ•˜๊ฒŒ ๋งŒ๋“ค๊ณ  ์žˆ๋‹ค. ์ด๊ฒƒ์ด ๋ฐ”๋กœ ์šฐ๋ฆฌ๊ฐ€ ๊ฟˆ๊พธ๋˜ โ€™์ผ๋ฐ˜ ์ง€๋Šฅ์„ ๊ฐ€์ง„ ๋กœ๋ด‡โ€™์œผ๋กœ ๊ฐ€๋Š” ๊ฐ€์žฅ ์œ ๋งํ•œ ๊ฒฝ๋กœ ์ค‘ ํ•˜๋‚˜์ž„์— ํ‹€๋ฆผ์—†๋‹ค.

Copyright 2026, JungYeon Lee