Curieux.JY
  • JungYeon Lee
  • Post
  • Projects
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ๋“ค์–ด๊ฐ€๋ฉฐ: ๋กœ๋ด‡ ํ•™์Šต์˜ ๋‘ ๊ฐ€์ง€ ํŒจ๋Ÿฌ๋‹ค์ž„์ด ๋งŒ๋‚˜๋‹ค
    • 1. ์—ฐ๊ตฌ ๋™๊ธฐ: BC์™€ RL์˜ ํ•œ๊ณ„๋ฅผ ๋„˜์–ด์„œ
      • 1.1 Behavior Cloning์˜ ๊ตฌ์กฐ์  ํ•œ๊ณ„
      • 1.2 Reinforcement Learning์˜ ์‹ค์„ธ๊ณ„ ์ ์šฉ ๋‚œ์ œ
      • 1.3 ๊ธฐ์กด ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ ‘๊ทผ๋ฒ•๋“ค์˜ ํ•œ๊ณ„
    • 2. ResFiT: ๋ฐฉ๋ฒ•๋ก ์˜ ํ•ต์‹ฌ
      • 2.1 ์ „์ฒด ํ”„๋ ˆ์ž„์›Œํฌ ๊ฐœ์š”
      • 2.2 Base Policy: Action Chunking๊ณผ BC
      • 2.3 Residual Policy์˜ ์„ค๊ณ„
      • 2.4 Off-Policy RL์˜ ์ˆ˜ํ•™์  ๊ธฐ์ดˆ
    • 3. ํ•ต์‹ฌ ์„ค๊ณ„ ๊ฒฐ์ •๋“ค: ์‹ค์„ธ๊ณ„ RL์„ ๊ฐ€๋Šฅ์ผ€ ํ•œ ๋น„๋ฐ€
      • 3.1 Update-to-Data (UTD) Ratio
      • 3.2 N-step Returns
      • 3.3 Layer Normalization
      • 3.4 Randomized Ensembled Double Q-Learning (REDQ)
      • 3.5 Symmetric Sampling
      • 3.6 Delayed Actor Updates
      • 3.7 Target Policy Smoothing
      • 3.8 Visual Encoder: Shallow ViT + DrQ Augmentation
    • 4. ์‹คํ—˜ ํ™˜๊ฒฝ๊ณผ ์„ค์ •
      • 4.1 ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ
      • 4.2 ์‹ค์„ธ๊ณ„ ํ™˜๊ฒฝ
      • 4.3 ๋น„๊ต ๋Œ€์ƒ
    • 5. ์‹คํ—˜ ๊ฒฐ๊ณผ ์‹ฌ์ธต ๋ถ„์„
      • 5.1 ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ: On-policy vs Off-policy
      • 5.2 ํƒœ์Šคํฌ๋ณ„ ์„ฑ๋Šฅ ๋ถ„์„
      • 5.3 Filtered BC์˜ ํ•œ๊ณ„
      • 5.4 ์‹ค์„ธ๊ณ„ ๊ฒฐ๊ณผ
      • 5.5 Ablation Studies ์š”์•ฝ
    • 6. ๊ธฐ์ˆ ์  ์ธ์‚ฌ์ดํŠธ์™€ ํ† ๋ก 
      • 6.1 Base Policy์˜ ์ด์ค‘ ์—ญํ• 
      • 6.2 ์™œ Per-step Residual์ธ๊ฐ€?
      • 6.3 Off-policy์˜ ํ•ต์‹ฌ์  ์ค‘์š”์„ฑ
      • 6.4 Action Chunking์˜ ์–‘๋ฉด์„ฑ
      • 6.5 ํ•œ๊ณ„์ 
    • 7. ๋ฏธ๋ž˜ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ
      • 7.1 Frozen Base ์ œ์•ฝ ์™„ํ™”
      • 7.2 Knowledge Distillation
      • 7.3 Multi-task Generalization
      • 7.4 ์ž๋™ํ™”๋œ ์ธํ”„๋ผ
    • 8. ์‹ค์šฉ์  ์‹œ์‚ฌ์ : ๋กœ๋ด‡๊ณตํ•™ ์—ฐ๊ตฌ์ž๋ฅผ ์œ„ํ•œ ํ…Œ์ดํฌ์–ด์›จ์ด
      • 8.1 โ€œ๋ ˆ์‹œํ”ผโ€์˜ ๊ฐ€์น˜
      • 8.2 BC + RL ํ•˜์ด๋ธŒ๋ฆฌ๋“œ์˜ ์‹ค์šฉ์„ฑ
      • 8.3 Sparse Reward์˜ ๊ฐ€๋Šฅ์„ฑ
      • 8.4 Action Chunking BC์˜ RL ํ˜ธํ™˜์„ฑ
    • 9. ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต
      • 9.1 vs ResiP (Ankile et al., 2024)
      • 9.2 vs Policy Decorator
      • 9.3 vs IBRL
      • 9.4 vs SERL
    • 10. ๊ฒฐ๋ก 
  • โ›๏ธ Dig Review
  • Residual Off-Policy RL์„ ํ†ตํ•œ Behavior Cloning ์ •์ฑ… ์ •๊ตํ™” (ResFiT)
    • ์ˆ˜ํ•™์  ๊ธฐ๋ฒ•๊ณผ ํ•ต์‹ฌ ์›๋ฆฌ
    • ๊ธฐ์กด BC Fine-tuning ๋ฐฉ์‹๊ณผ ROPI(ResFiT)์˜ ์ฐจ์ด์ 
    • ์‹คํ—˜ ๊ฒฐ๊ณผ ํ•ด์„: ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ ์‹ค์ œ ๋กœ๋ด‡์—์„œ์˜ ์„ฑ๋Šฅ
    • ๋กœ๋ด‡๊ณตํ•™์  ์ ์šฉ ๋ฐ ๊ณ ๋ ค์‚ฌํ•ญ

๐Ÿ“ƒResidual Off-Policy RL ๋ฆฌ๋ทฐ

rl
off-policy
residual
Residual Off-Policy RL for Finetuning Behavior Cloning Policies
Published

November 28, 2025

๐Ÿ” Ping. ๐Ÿ”” Ring. โ›๏ธ Dig. A tiered review series: quick look, key ideas, deep dive.

  • Paper
  • Code
  • Project
  1. ๐Ÿฆพ ์ด ์—ฐ๊ตฌ๋Š” Behavior Cloning (BC) ์ •์ฑ…์˜ ๋ฐ์ดํ„ฐ ํ•œ๊ณ„์™€ Reinforcement Learning (RL)์˜ ์‹ค์ œ ๋กœ๋ด‡ ์ ์šฉ ์–ด๋ ค์›€์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, BC์™€ RL์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•˜๋Š” ์ž”์ฐจ ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ์ธ ResFiT(Residual Fine-tuning)๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
  2. ๐Ÿค– ResFiT๋Š” ์‚ฌ์ „ ํ•™์Šต๋œ BC ์ •์ฑ…์„ ๋ธ”๋ž™๋ฐ•์Šค ๊ธฐ๋ฐ˜์œผ๋กœ ํ™œ์šฉํ•˜๊ณ , ์ƒ˜ํ”Œ ํšจ์œจ์ ์ธ Off-policy RL์„ ํ†ตํ•ด ๊ฐ€๋ฒผ์šด ๋‹จ๊ณ„๋ณ„ ์ž”์ฐจ ๋ณด์ •์„ ํ•™์Šตํ•˜์—ฌ ๊ณ ์ž์œ ๋„ ์‹œ์Šคํ…œ์—์„œ BC ์ •์ฑ…์˜ ์„ฑ๋Šฅ์„ ํšจ๊ณผ์ ์œผ๋กœ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค.
  3. ๐Ÿ† ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ State-of-the-art ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ 200๋ฐฐ ํ–ฅ์ƒ๋œ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์„ ๋ณด์˜€๊ณ , ํŠนํžˆ 5๊ฐœ์˜ ์†๊ฐ€๋ฝ์„ ๊ฐ€์ง„ ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡์— ๋Œ€ํ•œ ์ตœ์ดˆ์˜ ์‹ค์ œ ํ™˜๊ฒฝ RL ํ•™์Šต์„ ์„ฑ๊ณต์ ์œผ๋กœ ์‹œ์—ฐํ•˜์—ฌ ๋กœ๋ด‡ ๊ณตํ•™์˜ ์‹ค์šฉ์ ์ธ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

์ด ๋…ผ๋ฌธ์€ Behavior Cloning (BC)์˜ ์žฅ์ ๊ณผ Reinforcement Learning (RL)์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๊ณ ์ž์œ ๋„(high-degree-of-freedom, DoF) ๋กœ๋ด‡ ์‹œ์Šคํ…œ์—์„œ ํšจ๊ณผ์ ์ธ visuomotor ์ œ์–ด ์ •์ฑ…์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ Residual Off-Policy RL (ResFiT) ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด BC๋Š” ๋ฐ๋ชจ ๋ฐ์ดํ„ฐ์˜ ํ’ˆ์งˆ, ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋…ธ๋ ฅ, ๊ทธ๋ฆฌ๊ณ  ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ์˜ ํ•œ๊ณ„๋กœ ์ธํ•ด ์ •์ฑ… ์„ฑ๋Šฅ์ด ํฌํ™”๋˜๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด RL์€ ์ž์œจ์ ์ธ ํ™˜๊ฒฝ ์ƒํ˜ธ์ž‘์šฉ์„ ํ†ตํ•ด ํ•™์Šตํ•˜์ง€๋งŒ, ์ƒ˜ํ”Œ ๋น„ํšจ์œจ์„ฑ, ์•ˆ์ „ ๋ฌธ์ œ, ๊ทธ๋ฆฌ๊ณ  ํฌ์†Œํ•œ ๋ณด์ƒ(sparse reward)์œผ๋กœ๋ถ€ํ„ฐ ์žฅ๊ธฐ์ ์ธ ์ž‘์—…์„ ํ•™์Šตํ•˜๋Š” ์–ด๋ ค์›€ ๋•Œ๋ฌธ์— ์‹ค์„ธ๊ณ„ ๋กœ๋ด‡์— ์ง์ ‘ ์ ์šฉํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

ResFiT๋Š” ์‚ฌ์ „ ํ•™์Šต๋œ BC ์ •์ฑ…์„ ๋ธ”๋ž™๋ฐ•์Šค(black-box) ๊ธฐ๋ณธ ์ •์ฑ…์œผ๋กœ ํ™œ์šฉํ•˜๊ณ , ๊ทธ ์œ„์— ๊ฒฝ๋Ÿ‰์˜ ๋‹จ๊ณ„๋ณ„(per-step) ์ž”์—ฌ(residual) ๋ณด์ •๊ฐ’์„ ์ƒ˜ํ”Œ ํšจ์œจ์ ์ธ ์˜คํ”„-์ •์ฑ…(off-policy) RL์„ ํ†ตํ•ด ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ๊ธฐ๋ณธ ์ •์ฑ…์˜ ํŒŒ๋ผ๋ฏธํ„ฐํ™”๋‚˜ ์•ก์…˜ ์ฒญํ‚น(action chunking) ๋ฐฉ์‹์— ๊ตฌ์• ๋ฐ›์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์‹œ๋ฎฌ๋ ˆ์ด์…˜๊ณผ ์‹ค์„ธ๊ณ„ ๋ชจ๋‘์—์„œ ๊ณ ์ž์œ ๋„ ์‹œ์Šคํ…œ์˜ ์กฐ์ž‘ ์ •์ฑ…์„ ํšจ๊ณผ์ ์œผ๋กœ ๊ฐœ์„ ํ•˜๋ฉฐ, ํŠนํžˆ 5-fingered hand๋ฅผ ๊ฐ€์ง„ ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡์—์„œ ์‹ค์„ธ๊ณ„ RL ํ›ˆ๋ จ์„ ์„ฑ๊ณต์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•œ ์ตœ์ดˆ์˜ ์‚ฌ๋ก€๋ผ๊ณ  ์ฃผ์žฅํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก  (Core Methodology)

ResFiT๋Š” ๋‘ ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:

  1. ๊ธฐ๋ณธ ์ •์ฑ…(Base Policy) ํ•™์Šต (Behavior Cloning with Action Chunking):
    • ์—์ด์ „ํŠธ๋Š” ๊ฐ ํƒ€์ž„์Šคํ… t์—์„œ ๊ด€์ธก o_t๋ฅผ ๋ฐ›๊ณ  ํ–‰๋™ a_t๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
    • ๋จผ์ € ์ธ๊ฐ„ ์กฐ์ž‘(teleoperation)์„ ํ†ตํ•ด ์„ฑ๊ณต์ ์ธ ๊ถค์  \tau = (o_0, a_0, o_1, a_1, \dots)์œผ๋กœ ๊ตฌ์„ฑ๋œ ๋ฐ๋ชจ ๋ฐ์ดํ„ฐ์…‹ D_{demos}๋ฅผ ์ˆ˜์ง‘ํ•ฉ๋‹ˆ๋‹ค.
    • ์ด ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ธฐ๋ณธ ์ •์ฑ… \pi_\psi(a_{t:t+k}|o_t)๋ฅผ ํ–‰๋™ ์ฒญํ‚น(action chunking) ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ด ์ •์ฑ…์€ ๊ฐ ํƒ€์ž„์Šคํ…์—์„œ k๊ฐœ์˜ ๋ฏธ๋ž˜ ํ–‰๋™ ์‹œํ€€์Šค๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค.
    • ํ›ˆ๋ จ ๋ชฉํ‘œ๋Š” ๋ฐ๋ชจ์—์„œ ๊ฐ€์ ธ์˜จ ํ–‰๋™ ์ฒญํฌ์˜ ๋กœ๊ทธ-๊ฐ€๋Šฅ๋„(log-likelihood)๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค: \min_\psi - \mathbb{P}_{o_t, a_{t:t+k} \in D_{demos}} \log \pi_\psi(a_{t:t+k}|o_t).
    • ํ–‰๋™ ์ฒญํ‚น์€ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ๋ชจ๋ฐฉ ํ•™์Šต(imitation learning)์—์„œ์˜ ๋ณตํ•ฉ ์˜ค๋ฅ˜(compounding errors)๋ฅผ ์™„ํ™”ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค. ํ•™์Šต๋œ \pi_{base}๋Š” ๊ณ ์ •(freeze)๋ฉ๋‹ˆ๋‹ค.
  2. ์˜คํ”„-์ •์ฑ… ์ž”์—ฌ RL์„ ํ†ตํ•œ ๋ฏธ์„ธ ์กฐ์ • (Fine-tuning with Off-Policy Residual RL):
    • ๊ณ ์ •๋œ ๊ธฐ๋ณธ ์ •์ฑ… \pi_{base} ์œ„์— ์ƒˆ๋กœ์šด ์ •์ฑ… \pi_{res}๋ฅผ RL๋กœ ํ•™์Šตํ•˜์—ฌ \pi_{base}๊ฐ€ ์ €์ง€๋ฅด๋Š” ์‹ค์ˆ˜๋ฅผ ๋ณด์ •ํ•˜๊ณ  ์ •์ฑ… ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.
    • ์ด ์ž”์—ฌ ์ •์ฑ… \pi_{res}๋Š” \pi_{base}์˜ ๋‚ด๋ถ€ ํŒŒ๋ผ๋ฏธํ„ฐํ™” ๋ฐ ํ›ˆ๋ จ ๋ฐฉ์‹์— ๊ตฌ์• ๋ฐ›์ง€ ์•Š์œผ๋ฉฐ, ์ž”์—ฌ๊ฐ’์˜ ํฌ๊ธฐ๋ฅผ ์ œ์–ดํ•˜์—ฌ ์•ˆ์ •์ ์ธ ํƒ์ƒ‰์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
    • ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ๊ณผ์ •(MDP)์˜ ์ƒํƒœ s_t \in \mathcal{S}, ํ–‰๋™ a_t \in \mathcal{A}, ๋ณด์ƒ r_t = R(s_t, a_t), ํ• ์ธ์œจ \gamma, ์‹œ๊ฐ„ ๋ฒ”์œ„ H๋ฅผ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค.
    • ํ‘œ์ค€ ์˜คํ”„-์ •์ฑ… RL ๋ฐฉ๋ฒ•์ด Q_\phi(s_t, a_t)์™€ \pi_\theta(s_t)๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐ˜๋ฉด, ResFiT๋Š” ์ž”์—ฌ ์„ค์ •์„ ์œ„ํ•ด ์ด๋ฅผ ์žฌ๋งค๊ฐœ๋ณ€์ˆ˜ํ™”(reparameterize)ํ•ฉ๋‹ˆ๋‹ค:
      • ๋น„ํ‰๊ฐ€(critic) Q_\phi๋Š” Q_\phi(s_t, a_{base_t} + \pi_\theta(s_t, a_{base_t}))๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ a_{base_t}๋Š” \pi_{base}(s_t)์—์„œ ์–ป์€ ๊ธฐ๋ณธ ํ–‰๋™์ž…๋‹ˆ๋‹ค.
      • ์ •์ฑ… \pi_\theta๋Š” ์ž”์—ฌ ํ–‰๋™์„ ์ถœ๋ ฅํ•˜๋ฏ€๋กœ \pi_\theta(s_t, a_{base_t})๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค.
      • ์ „์ฒด ํ–‰๋™์€ a_t = a_{base_t} + a_{res_t}์ด๋ฉฐ, ์—ฌ๊ธฐ์„œ a_{res_t} = \pi_\theta(s_t, a_{base_t})์ž…๋‹ˆ๋‹ค. ๋น„ํ‰๊ฐ€๋Š” ์ด ์ „์ฒด ํ–‰๋™์˜ ๊ฐ€์น˜๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
    • ๋น„ํ‰๊ฐ€ ํ•™์Šต (Critic Learning):
      • Bellman ๋ฐฉ์ •์‹์— ๋”ฐ๋ผ ์ตœ์  ํ–‰๋™-๊ฐ€์น˜ ํ•จ์ˆ˜ Q^\star(s_t, a_t)๋ฅผ ๊ทผ์‚ฌํ•˜๊ธฐ ์œ„ํ•ด ํ‰๊ท  ์ œ๊ณฑ Bellman ์˜ค๋ฅ˜(Mean-Squared Bellman Error, MSBE) ์†์‹ค์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
      • ์ฃผ์–ด์ง„ ์ „์ด(transition) ๋ฐ์ดํ„ฐ์…‹ D=(s_t, a_t, r_t, s_{t+1}, d_t)์— ๋Œ€ํ•ด ์†์‹ค ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: L(\phi) = \mathbb{E}_{(s_t, a_t, r_t, s_{t+1}, d_t) \sim D} \left[ \left( Q_\phi(s_t, a_t) - \left( r_t + \gamma(1 - d_t) Q_\phi(s_{t+1}, a_{base_{t+1}} + \pi_\theta(s_{t+1}, a_{base_{t+1}})) \right) \right)^2 \right] ์—ฌ๊ธฐ์„œ a_{base_{t+1}} = \pi_{base}(s_{t+1})์ž…๋‹ˆ๋‹ค.
    • ์ •์ฑ… ํ•™์Šต (Policy Learning):
      • ์ •์ฑ… \pi_\theta(s_t, a_{base_t})๋Š” ๊ฐ€์น˜ ํ•จ์ˆ˜์— ๋Œ€ํ•œ ๊ฒฝ์‚ฌ ์ƒ์Šน(gradient ascent)์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค: L(\theta) = - \mathbb{E}_{(s_t, a_{base_t}) \sim D} \left[ Q_\phi(s_t, a_{base_t} + \pi_\theta(s_t, a_{base_t})) \right]

์ฃผ์š” ์„ค๊ณ„ ๊ฒฐ์ • (Design Decisions)

์•ˆ์ •์ ์ด๊ณ  ์ƒ˜ํ”Œ ํšจ์œจ์ ์ธ ์ž”์—ฌ ๋ฏธ์„ธ ์กฐ์ •์„ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ฃผ์š” ์„ค๊ณ„ ๊ฒฐ์ •๋“ค์ด ์ ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค:

  • Update-to-Data (UTD) Ratio: ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ ํ–ฅ์ƒ์„ ์œ„ํ•ด 1๋ณด๋‹ค ํฐ UTD ๋น„์œจ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • n-step Returns: ๊ธด ์‹œ๊ฐ„ ๋ฒ”์œ„์™€ ํฌ์†Œํ•œ ๋ณด์ƒ ์ž‘์—…์— ํšจ๊ณผ์ ์ธ n-step return (๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” n=3)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค: \sum_{i=0}^{n-1} \gamma^i r_{t+i} + \gamma^n Q(s_{t+n}, a_{t+n}).
  • Critic์˜ Layer Normalization: ํ•จ์ˆ˜ ๊ทผ์‚ฌ ์‚ฌ์šฉ์œผ๋กœ ์ธํ•ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” Q-ํ•จ์ˆ˜์˜ ๊ณผ๋Œ€ํ‰๊ฐ€(overestimation)๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋น„ํ‰๊ฐ€์— Layer Normalization์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
  • TD3(Twin Delayed Deep Deterministic Policy Gradient) ๊ธฐ๋ฒ• ์ ์šฉ: ๋ถˆ์•ˆ์ •์„ฑ์„ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์ง€์—ฐ๋œ ์•กํ„ฐ ์—…๋ฐ์ดํŠธ(delayed actor updates), Polyak ํ‰๊ท (Polyak averaging)์„ ์‚ฌ์šฉํ•œ ํƒ€๊ฒŸ ๋„คํŠธ์›Œํฌ ์—…๋ฐ์ดํŠธ, ํƒ€๊ฒŸ ์ •์ฑ… ์Šค๋ฌด๋”ฉ(target policy smoothing)์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • Randomized Ensembled Double Q-Learning (REDQ): ๊ณผ๋Œ€ํ‰๊ฐ€ ํŽธํ–ฅ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด Q-ํ•จ์ˆ˜ ์•™์ƒ๋ธ”์„ ์‚ฌ์šฉํ•˜๋ฉฐ, TD-ํƒ€๊ฒŸ ๊ณ„์‚ฐ ์‹œ ๋ฌด์ž‘์œ„ ๋ถ€๋ถ„์ง‘ํ•ฉ์˜ Q-ํ•จ์ˆ˜ ์ค‘ ์ตœ์†Ÿ๊ฐ’์„ ์‚ฌ์šฉํ•˜๊ณ , ์ •์ฑ… ์—…๋ฐ์ดํŠธ ์‹œ์—๋Š” ์ „์ฒด ์•™์ƒ๋ธ”์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์‹œ๊ฐ ์ž…๋ ฅ ์ฒ˜๋ฆฌ: ์‹œ๊ฐ ์ž…๋ ฅ์—๋Š” DrQ ์Šคํƒ€์ผ์˜ ๋ฌด์ž‘์œ„ ์‰ฌํ”„ํŠธ ์ฆ๊ฐ•(random shift augmentations)์„ ์ ์šฉํ•œ ์–•์€ ViT ์ธ์ฝ”๋”๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ๋Œ€์นญ ์ƒ˜ํ”Œ๋ง (Symmetric Sampling): ์˜จ๋ผ์ธ RL ๋‹จ๊ณ„์—์„œ ๋ฐ๋ชจ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ ๋ฐฐ์น˜(batch)์˜ 50%๋Š” ๊ณ ์ •๋œ ์˜คํ”„๋ผ์ธ ๋ฐ๋ชจ ๋ฐ์ดํ„ฐ์—์„œ, ๋‚˜๋จธ์ง€ 50%๋Š” ์ง€์†์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋Š” ์˜จ๋ผ์ธ ๋ฒ„ํผ์—์„œ ์ƒ˜ํ”Œ๋งํ•ฉ๋‹ˆ๋‹ค.

์‹คํ—˜ ๊ฒฐ๊ณผ

  • ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ฒฐ๊ณผ:
    • ResFiT๋Š” ๋ชจ๋“  ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์ž‘์—…์—์„œ ๊ฑฐ์˜ ์™„๋ฒฝํ•œ ์ •์ฑ…์œผ๋กœ ์ˆ˜๋ ดํ•˜๋ฉฐ, ์˜จ-์ •์ฑ…(on-policy) RL ๋ฐฉ์‹์ธ PPO ๋Œ€๋น„ ์•ฝ 200๋ฐฐ ๋†’์€ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
    • ํŠนํžˆ BoxCleanup, CanSort, Coffee์™€ ๊ฐ™์ด DoF๊ฐ€ ๋†’๊ณ  ์‹œ๊ฐ„ ๋ฒ”์œ„๊ฐ€ ๊ธด ์–ด๋ ค์šด ์ž‘์—…์—์„œ ResFiT๋Š” ๋‹ค๋ฅธ ๋ฒ ์ด์Šค๋ผ์ธ์ด๋‚˜ ์–ด๋ธ”๋ ˆ์ด์…˜ ๋ฒ„์ „๋ณด๋‹ค ํšจ์œจ์ ์œผ๋กœ ๋†’์€ ์„ฑ๋Šฅ์— ๋„๋‹ฌํ–ˆ์Šต๋‹ˆ๋‹ค.
    • ํ•„ํ„ฐ๋ง๋œ BC(Filtered BC)๋Š” ์•ˆ์ •์ ์ด์—ˆ์œผ๋‚˜ ์ดˆ๊ธฐ BC ์ •์ฑ… ์„ฑ๋Šฅ์—์„œ ์ตœ์†Œํ•œ์˜ ๊ฐœ์„ ๋งŒ์„ ๋ณด์—ฌ, ์ •๋ฐ€๋„๊ฐ€ ์ค‘์š”ํ•œ ์ž‘์—…์—์„œ๋Š” ๋ช…์‹œ์ ์ธ ๊ฐ€์น˜ ์ตœ๋Œ€ํ™”๊ฐ€ ํ•„์š”ํ•จ์„ ์‹œ์‚ฌํ–ˆ์Šต๋‹ˆ๋‹ค.
    • UTD ๋น„์œจ๊ณผ n-step return์˜ ์ค‘์š”์„ฑ๋„ ํ™•์ธ๋˜์—ˆ์œผ๋ฉฐ, ํฌ์†Œํ•œ ๋ณด์ƒ ์ž‘์—…์—์„œ๋Š” 1๋ณด๋‹ค ํฐ n-step์ด ํ•„์ˆ˜์ ์ž„์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
  • ์‹ค์„ธ๊ณ„ RL ๊ฒฐ๊ณผ:
    • 29-DoF ์–‘ํŒ” ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡ Vega์—์„œ WoollyBallPnP ๋ฐ PackageHandover ๋‘ ๊ฐ€์ง€ ์ž‘์—…์— ResFiT๋ฅผ ์ ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
    • WoollyBallPnP ์ž‘์—…์—์„œ ResFiT๋Š” 134๋ฒˆ์˜ ๋กค์•„์›ƒ(์•ฝ 15๋ถ„ ๋กœ๋ด‡ ์‹คํ–‰ ๋ฐ์ดํ„ฐ) ํ›„ ๊ธฐ๋ณธ ์ •์ฑ…์˜ 14% ์„ฑ๊ณต๋ฅ ์„ 64%๋กœ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.
    • PackageHandover ์ž‘์—…์—์„œ๋Š” 343๋ฒˆ์˜ RL ์—ํ”ผ์†Œ๋“œ(์•ฝ 76๋ถ„ ๋ฐ์ดํ„ฐ) ํ›„ ๊ธฐ๋ณธ ์ •์ฑ…์˜ 23% ์„ฑ๊ณต๋ฅ ์„ 64%๋กœ ๋Œ์–ด์˜ฌ๋ ธ์Šต๋‹ˆ๋‹ค.
    • ์ด๋Š” 5-fingered hand๋ฅผ ๊ฐ€์ง„ ์–‘ํŒ” ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡์—์„œ ์‹ค์„ธ๊ณ„ RL์„ ์ˆ˜ํ–‰ํ•œ ์ตœ์ดˆ์˜ ์‹œ์—ฐ์œผ๋กœ ๋ณด๊ณ ๋ฉ๋‹ˆ๋‹ค.

๊ฒฐ๋ก 

์ด ๋…ผ๋ฌธ์€ ์‚ฌ์ „ ํ›ˆ๋ จ๊ณผ ๋ฏธ์„ธ ์กฐ์ • ๋‹จ๊ณ„๋ฅผ ๋ถ„๋ฆฌํ•จ์œผ๋กœ์จ BC ์ •์ฑ…์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋™์‹œ์— RL์˜ ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ ์ •์ฑ…์€ ์ข‹์€ ์ดˆ๊ธฐํ™”๋ฅผ ์ œ๊ณตํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์•”๋ฌต์ ์ธ ์•ˆ์ „ ์ œ์•ฝ ์—ญํ• ๊ณผ ๊ฐ•๋ ฅํ•œ ํƒ์ƒ‰ ์‚ฌ์ „ ์ •๋ณด(exploration prior)๋ฅผ ์ œ๊ณตํ•˜์—ฌ ๊ณ ์ž์œ ๋„ ํ™˜๊ฒฝ์—์„œ ํฌ์†Œํ•œ ๋ณด์ƒ์œผ๋กœ๋„ RL์ด ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ResFiT์˜ ์ฃผ์š” ํ•œ๊ณ„๋Š” ํ•™์Šต๋œ ํ–‰๋™์ด ๊ธฐ๋ณธ ์ •์ฑ… ์ฃผ๋ณ€์— ์ œ์•ฝ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ด์ง€๋งŒ, ์‹ค์„ธ๊ณ„ ๊ฒ€์ฆ์„ ํ†ตํ•ด ์ƒ˜ํ”Œ ํšจ์œจ์ ์ธ RL์ด ์–‘ํŒ” ์กฐ์ž‘ ํ”Œ๋žซํผ์—์„œ ์„ฑ๊ณต์ ์œผ๋กœ ์ž‘๋™ํ•จ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

๋“ค์–ด๊ฐ€๋ฉฐ: ๋กœ๋ด‡ ํ•™์Šต์˜ ๋‘ ๊ฐ€์ง€ ํŒจ๋Ÿฌ๋‹ค์ž„์ด ๋งŒ๋‚˜๋‹ค

๋กœ๋ด‡ ๋งค๋‹ˆํ“ฐ๋ ˆ์ด์…˜ ๋ถ„์•ผ์—์„œ ์ผํ•˜๋‹ค ๋ณด๋ฉด ํ•œ ๊ฐ€์ง€ ๊ทผ๋ณธ์ ์ธ ๋”œ๋ ˆ๋งˆ์™€ ์ž์ฃผ ๋งˆ์ฃผ์น˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. Behavior Cloning(BC)์€ ์ธ์ƒ์ ์ธ visuomotor ์ •์ฑ…์„ ๋งŒ๋“ค์–ด๋‚ด์ง€๋งŒ, ๊ฒฐ๊ตญ ์ธ๊ฐ„ ์‹œ์—ฐ์˜ ํ’ˆ์งˆ์— ๊ฐ‡ํ˜€๋ฒ„๋ฆฝ๋‹ˆ๋‹ค. ์•„๋ฌด๋ฆฌ ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•ด๋„ ์–ด๋А ์ˆœ๊ฐ„ ์„ฑ๋Šฅ์ด ์ •์ฒด๋˜๊ณ , ํ…”๋ ˆ์˜คํผ๋ ˆ์ดํ„ฐ์˜ ์‹ค์ˆ˜๋‚˜ ๋ถ€์ •ํ™•ํ•จ์„ ๊ณ ์Šค๋ž€ํžˆ ๋ฌผ๋ ค๋ฐ›๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด Reinforcement Learning(RL)์€ ์ž์œจ์ ์œผ๋กœ ํ™˜๊ฒฝ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฉฐ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์‹ค์ œ ๋กœ๋ด‡์—์„œ์˜ ์ƒ˜ํ”Œ ๋น„ํšจ์œจ์„ฑ๊ณผ ์•ˆ์ „ ๋ฌธ์ œ, ๊ทธ๋ฆฌ๊ณ  ๊ณ ์ž์œ ๋„ ์‹œ์Šคํ…œ์—์„œ์˜ sparse reward ํ•™์Šต ๋‚œ์ด๋„๋Š” ์—ฌ์ „ํžˆ ๊ทน๋ณตํ•˜๊ธฐ ์–ด๋ ค์šด ์žฅ๋ฒฝ์ž…๋‹ˆ๋‹ค.

์˜ค๋Š˜ ๋ฆฌ๋ทฐํ•  ์ด ๋…ผ๋ฌธ์€ ๋ฐ”๋กœ ์ด ๋‘ ํŒจ๋Ÿฌ๋‹ค์ž„์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•˜๋ ค๋Š” ์‹œ๋„์ž…๋‹ˆ๋‹ค. ๋” ์ •ํ™•ํžˆ ๋งํ•˜๋ฉด, BC๋กœ ํ•™์Šต๋œ ์ •์ฑ…์„ โ€œ๋ธ”๋ž™๋ฐ•์Šคโ€๋กœ ์ทจ๊ธ‰ํ•˜๊ณ , ๊ทธ ์œ„์— ๊ฒฝ๋Ÿ‰์˜ per-step residual correction์„ off-policy RL๋กœ ํ•™์Šตํ•˜๋Š” ์ ‘๊ทผ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ์ด๋ฅผ ResFiT(Residual Fine-Tuning)์ด๋ผ ๋ช…๋ช…ํ•˜๊ณ , ์‹œ๋ฎฌ๋ ˆ์ด์…˜๋ฟ ์•„๋‹ˆ๋ผ ์‹ค์ œ ํœด๋จธ๋…ธ์ด๋“œ ๋กœ๋ด‡์—์„œ ์„ธ๊ณ„ ์ตœ์ดˆ๋กœ dexterous hand๋ฅผ ๊ฐ€์ง„ ๋กœ๋ด‡์˜ ์‹ค์„ธ๊ณ„ RL ํ•™์Šต์„ ์‹œ์—ฐํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ฆฌ๋ทฐ์—์„œ๋Š” ๋…ผ๋ฌธ์˜ ๊ธฐ์ˆ ์  ์„ธ๋ถ€์‚ฌํ•ญ์„ ๊นŠ์ด ํŒŒ๊ณ ๋“ค๋ฉฐ, ์™œ ์ด ์ ‘๊ทผ๋ฒ•์ด ์ž‘๋™ํ•˜๋Š”์ง€, ์–ด๋–ค ์„ค๊ณ„ ๊ฒฐ์ •์ด ์ค‘์š”ํ•œ์ง€, ๊ทธ๋ฆฌ๊ณ  ๋กœ๋ด‡๊ณตํ•™ ์—ฐ๊ตฌ์ž๋กœ์„œ ์šฐ๋ฆฌ๊ฐ€ ๋ฐฐ์šธ ์ˆ˜ ์žˆ๋Š” ์‹ค์šฉ์  ์ธ์‚ฌ์ดํŠธ๋ฅผ ๋„์ถœํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.


1. ์—ฐ๊ตฌ ๋™๊ธฐ: BC์™€ RL์˜ ํ•œ๊ณ„๋ฅผ ๋„˜์–ด์„œ

1.1 Behavior Cloning์˜ ๊ตฌ์กฐ์  ํ•œ๊ณ„

์ตœ๊ทผ ๋ช‡ ๋…„๊ฐ„ BC๋Š” ๋†€๋ผ์šด ๋ฐœ์ „์„ ์ด๋ค˜์Šต๋‹ˆ๋‹ค. Diffusion Policy, ACT(Action Chunking with Transformers), ฯ€0 ๊ฐ™์€ ๋ชจ๋ธ๋“ค์€ ์ˆ˜๋ฐฑ๋งŒ ๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง„ ๋Œ€๊ทœ๋ชจ ์‹ ๊ฒฝ๋ง์„ ํ™œ์šฉํ•ด ๋ณต์žกํ•œ visuomotor ์ •์ฑ…์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ action chunkingโ€”ํ•œ ๋ฒˆ์— ์—ฌ๋Ÿฌ ํƒ€์ž„์Šคํ…์˜ ์•ก์…˜์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹โ€”์€ imitation learning์—์„œ compounding error๋ฅผ ์ค„์ด๋Š” ํšจ๊ณผ์ ์ธ ๊ธฐ๋ฒ•์œผ๋กœ ์ž๋ฆฌ์žก์•˜์Šต๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ BC์—๋Š” ๊ทผ๋ณธ์ ์ธ ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค:

  1. ์‹œ์—ฐ ํ’ˆ์งˆ์˜ ์ฒœ์žฅ: ์•„๋ฌด๋ฆฌ ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•ด๋„ ํ…”๋ ˆ์˜คํผ๋ ˆ์ดํ„ฐ์˜ ์„ฑ๋Šฅ์„ ๋„˜์–ด์„œ๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค
  2. Diminishing returns: ์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค์ด ์ผ๊ด€๋˜๊ฒŒ ๋ณด์—ฌ์ฃผ๋“ฏ, ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๊ฐ€ ์ฆ๊ฐ€ํ•ด๋„ ์„ฑ๋Šฅ ํ–ฅ์ƒ์€ ์ ์  ๋‘”ํ™”๋ฉ๋‹ˆ๋‹ค
  3. Distribution shift: ํ•™์Šต ์ค‘ ๋ณด์ง€ ๋ชปํ•œ ์ƒํƒœ์—์„œ์˜ ์—๋Ÿฌ๊ฐ€ ๋ˆ„์ ๋˜๋ฉฐ, ์ด๋Š” ๊ธด horizon ํƒœ์Šคํฌ์—์„œ ํŠนํžˆ ์น˜๋ช…์ ์ž…๋‹ˆ๋‹ค
  4. Reactivity ๋ถ€์กฑ: Action chunking์€ horizon์„ ์ค„์ด์ง€๋งŒ, open-loop ์‹คํ–‰ ํŠน์„ฑ์ƒ ์„ฌ์„ธํ•œ ๋ฐ˜์‘์„ฑ์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค

1.2 Reinforcement Learning์˜ ์‹ค์„ธ๊ณ„ ์ ์šฉ ๋‚œ์ œ

RL์€ ์ด๋ก ์ ์œผ๋กœ ์ž์œจ์  ๊ฐœ์„ ์ด ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, ์‹ค์ œ ๋กœ๋ด‡์—์„œ์˜ ์ ์šฉ์—๋Š” ์ˆ˜๋งŽ์€ ์žฅ์• ๋ฌผ์ด ์žˆ์Šต๋‹ˆ๋‹ค:

  1. Sample inefficiency: ํŠนํžˆ on-policy ๋ฐฉ๋ฒ•(PPO ๋“ฑ)์€ ์ˆ˜์ฒœ๋งŒ ์Šคํ…์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  2. ์•ˆ์ „ ๋ฌธ์ œ: ํƒํ—˜ ๊ณผ์ •์—์„œ ๋กœ๋ด‡์ด๋‚˜ ํ™˜๊ฒฝ์— ์†์ƒ์„ ์ค„ ์œ„ํ—˜
  3. Sparse reward์˜ ์–ด๋ ค์›€: ๊ณ ์ž์œ ๋„ ์‹œ์Šคํ…œ์—์„œ ๋ฌด์ž‘์œ„ ํƒํ—˜์œผ๋กœ ์„ฑ๊ณต์„ ๋‹ฌ์„ฑํ•˜๊ธฐ๋ž€ ๊ฑฐ์˜ ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค
  4. ์•„ํ‚คํ…์ฒ˜ ํ˜ธํ™˜์„ฑ: Action chunking์ด๋‚˜ diffusion ๊ธฐ๋ฐ˜ BC ๋ชจ๋ธ์— RL์„ ์ง์ ‘ ์ ์šฉํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค

1.3 ๊ธฐ์กด ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ ‘๊ทผ๋ฒ•๋“ค์˜ ํ•œ๊ณ„

BC์™€ RL์„ ๊ฒฐํ•ฉํ•˜๋ ค๋Š” ์‹œ๋„๋Š” ์ด์ „์—๋„ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค:

  • IBRL: BC ์ •์ฑ…์„ ์‚ฌ์šฉํ•ด ์•ก์…˜์„ ์ œ์•ˆํ•˜๊ณ  ํƒ€๊ฒŸ ๊ฐ’์„ ๋ถ€ํŠธ์ŠคํŠธ๋žฉํ•˜์ง€๋งŒ, ๋ณต์žกํ•œ ํƒœ์Šคํฌ์—์„œ ํ•œ๊ณ„๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค
  • PA-RL: Q-function์œผ๋กœ ์•ก์…˜์„ ์ตœ์ ํ™”ํ•˜์ง€๋งŒ ์ •์ฑ… ์ž์ฒด๋ฅผ ๊ฐœ์„ ํ•˜์ง„ ์•Š์Šต๋‹ˆ๋‹ค
  • Policy Decorator: ์ „์ฒด ์•ก์…˜ ์ฒญํฌ์— ๋Œ€ํ•œ residual์„ ํ•™์Šตํ•˜์ง€๋งŒ, ์‹œ๋ฎฌ๋ ˆ์ด์…˜๊ณผ ๋‹จ์ผ ์•” ํƒœ์Šคํฌ๋กœ ์ œํ•œ๋ฉ๋‹ˆ๋‹ค
  • ResiP: On-policy PPO๋กœ residual์„ ํ•™์Šตํ•˜์ง€๋งŒ, ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์ด ๋„ˆ๋ฌด ๋‚ฎ์•„ ์‹ค์„ธ๊ณ„ ์ ์šฉ์ด ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค

2. ResFiT: ๋ฐฉ๋ฒ•๋ก ์˜ ํ•ต์‹ฌ

2.1 ์ „์ฒด ํ”„๋ ˆ์ž„์›Œํฌ ๊ฐœ์š”

ResFiT์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ๋†€๋ž๋„๋ก ๋‹จ์ˆœํ•ฉ๋‹ˆ๋‹ค:

  1. Phase 1 (BC): ์ธ๊ฐ„ ์‹œ์—ฐ์œผ๋กœ๋ถ€ํ„ฐ action chunking์„ ์‚ฌ์šฉํ•œ base policy ฯ€_base๋ฅผ ํ•™์Šต
  2. Phase 2 (Residual RL): Base policy๋ฅผ freezeํ•˜๊ณ , per-step residual correction ฯ€_res๋ฅผ off-policy RL๋กœ ํ•™์Šต

์ตœ์ข… ์•ก์…˜์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค:

a_t = a_t^base + a_t^res

์—ฌ๊ธฐ์„œ a_t^base๋Š” frozen base policy์˜ ์ถœ๋ ฅ(action chunk์˜ ํ˜„์žฌ ์Šคํ…)์ด๊ณ , a_t^res๋Š” residual policy๊ฐ€ ์˜ˆ์ธกํ•œ ๋ณด์ •๊ฐ’์ž…๋‹ˆ๋‹ค.

2.2 Base Policy: Action Chunking๊ณผ BC

Base policy๋Š” ๊ด€์ธก o_t๋ฅผ ๋ฐ›์•„ k๊ฐœ์˜ ๋ฏธ๋ž˜ ์•ก์…˜ ์‹œํ€€์Šค๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค:

ฯ€_ฯˆ(a_{t:t+k} | o_t)

ํ•™์Šต์€ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ log-likelihood ์ตœ๋Œ€ํ™”๋กœ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค:

min_ฯˆ -E_{o_t, a_{t:t+k} โˆˆ D_demos} [log ฯ€_ฯˆ(a_{t:t+k} | o_t)]

Action chunking์˜ ์žฅ์ ์€ ์ž˜ ์•Œ๋ ค์ ธ ์žˆ์Šต๋‹ˆ๋‹ค: - ํƒœ์Šคํฌ์˜ effective horizon ๊ฐ์†Œ - Compounding error ์™„ํ™” - Temporal consistency ํ–ฅ์ƒ

2.3 Residual Policy์˜ ์„ค๊ณ„

Residual policy๋Š” ๋‹ค์Œ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์Šต๋‹ˆ๋‹ค: - ํ˜„์žฌ ๊ด€์ธก s_t (๋˜๋Š” o_t) - Base policy์˜ ํ˜„์žฌ ์Šคํ… ์•ก์…˜ a_t^base

๊ทธ๋ฆฌ๊ณ  ๋ณด์ •๊ฐ’ a_t^res๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค:

ฯ€_res(s_t, a_t^base) โ†’ a_t^res

์ด ์„ค๊ณ„์˜ ํ•ต์‹ฌ ์žฅ์ ๋“ค:

  1. Base policy agnostic: Base policy์˜ ์•„ํ‚คํ…์ฒ˜(diffusion, transformer ๋“ฑ)์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ์ ์šฉ ๊ฐ€๋Šฅ
  2. ์•ˆ์ •์„ฑ: Residual์˜ ํฌ๊ธฐ๋ฅผ ์ œ์–ดํ•จ์œผ๋กœ์จ base policy ๊ทผ์ฒ˜์—์„œ ์•ˆ์ „ํ•œ ํƒํ—˜ ๊ฐ€๋Šฅ
  3. Per-step correction: Action chunk ์ „์ฒด๊ฐ€ ์•„๋‹Œ ๊ฐœ๋ณ„ ์Šคํ…์—์„œ ๋ณด์ •ํ•˜๋ฏ€๋กœ reactivity ํ–ฅ์ƒ
  4. ๊ฒฝ๋Ÿ‰ ๋ชจ๋ธ: ์ˆ˜๋ฐฑ๋งŒ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ base policy์™€ ๋‹ฌ๋ฆฌ, residual policy๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ์ž‘์€ ๋„คํŠธ์›Œํฌ

2.4 Off-Policy RL์˜ ์ˆ˜ํ•™์  ๊ธฐ์ดˆ

์ €์ž๋“ค์€ DDPG ์Šคํƒ€์ผ์˜ actor-critic ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Critic (Q-function)

Bellman equation์„ ๊ธฐ๋ฐ˜์œผ๋กœ Q-function์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค:

Q*(s, a) = E[r + ฮณ max_{a'} Q*(s', a')]

Residual ์„ธํŒ…์—์„œ ์ด๋ฅผ ํ™•์žฅํ•˜๋ฉด:

L_critic(ฯ†) = E_{(s,a,r,s') ~ D} [(Q_ฯ†(s, a^base + a^res) - y)ยฒ]

์—ฌ๊ธฐ์„œ ํƒ€๊ฒŸ y๋Š”:

y = r + ฮณ Q_ฯ†'(s', a'^base + ฯ€_ฮธ'(s', a'^base))

Actor (Residual Policy)

Q-function์˜ gradient๋ฅผ ํ†ตํ•ด policy๋ฅผ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค:

โˆ‡_ฮธ J = E_s [โˆ‡_{a^res} Q_ฯ†(s, a^base + a^res) ยท โˆ‡_ฮธ ฯ€_ฮธ(s, a^base)]

์ด ๊ตฌ์กฐ์—์„œ critic์€ โ€œbase + residualโ€ ์ „์ฒด ์•ก์…˜์˜ ๊ฐ€์น˜๋ฅผ ํ‰๊ฐ€ํ•˜๊ณ , actor๋Š” residual๋งŒ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.


3. ํ•ต์‹ฌ ์„ค๊ณ„ ๊ฒฐ์ •๋“ค: ์‹ค์„ธ๊ณ„ RL์„ ๊ฐ€๋Šฅ์ผ€ ํ•œ ๋น„๋ฐ€

์ €์ž๋“ค์€ ๋‹จ์ˆœํžˆ off-policy RL์„ residual ํ•™์Šต์— ์ ์šฉํ•œ ๊ฒƒ์ด ์•„๋‹™๋‹ˆ๋‹ค. ์‹ค์„ธ๊ณ„ ๋กœ๋ด‡์—์„œ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ˆ˜๋งŽ์€ ์„ธ๋ถ€ ์„ค๊ณ„ ๊ฒฐ์ •๋“ค์„ ์‹ ์ค‘ํ•˜๊ฒŒ ์กฐํ•ฉํ–ˆ์œผ๋ฉฐ, ์ด๊ฒƒ์ด ์ด ๋…ผ๋ฌธ์˜ ์ง„์ •ํ•œ ๊ธฐ์—ฌ์ž…๋‹ˆ๋‹ค.

3.1 Update-to-Data (UTD) Ratio

UTD ratio๋Š” ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋‹น ์ˆ˜ํ–‰ํ•˜๋Š” ๋ชจ๋ธ ์—…๋ฐ์ดํŠธ ํšŸ์ˆ˜๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

์ €์ž๋“ค์€ UTD=4๋ฅผ ๊ธฐ๋ณธ๊ฐ’์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ: - UTD=0.5: ํ•™์Šต์ด ๋ˆˆ์— ๋„๊ฒŒ ๋А๋ฆผ - UTD=4: ๋Œ€๋ถ€๋ถ„์˜ ์ด๋“์„ ์–ป์Œ - UTD=8+: ์ถ”๊ฐ€ ์ด๋“์ด ๋ฏธ๋ฏธํ•˜๋ฉฐ ์˜คํžˆ๋ ค ๋ถˆ์•ˆ์ •ํ•ด์งˆ ์ˆ˜ ์žˆ์Œ

์ด๋Š” ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ๊ณผ ํ•™์Šต ์•ˆ์ •์„ฑ ์‚ฌ์ด์˜ ๊ท ํ˜•์ ์„ ์ฐพ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. Horizon์ด 150-250 ์Šคํ…์ธ ํƒœ์Šคํฌ์—์„œ UTD=4๊ฐ€ ์ตœ์ ์ด๋ผ๋Š” ๊ฒƒ์€ ์‹ค์šฉ์ ์œผ๋กœ ๋งค์šฐ ์œ ์šฉํ•œ ๊ฐ€์ด๋“œ๋ผ์ธ์ž…๋‹ˆ๋‹ค.

3.2 N-step Returns

Sparse reward ํ™˜๊ฒฝ์—์„œ 1-step TD learning์€ reward ์‹ ํ˜ธ๊ฐ€ Q-function๊นŒ์ง€ ์ „ํŒŒ๋˜๋Š” ๋ฐ ๋„ˆ๋ฌด ์˜ค๋žœ ์‹œ๊ฐ„์ด ๊ฑธ๋ฆฝ๋‹ˆ๋‹ค.

์ €์ž๋“ค์€ n=5 step returns๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

G_t^{(n)} = r_t + ฮณr_{t+1} + ... + ฮณ^{n-1}r_{t+n-1} + ฮณ^n Q(s_{t+n}, a_{t+n})

์‹คํ—˜์—์„œ ๊ด€์ฐฐ๋œ ํŒจํ„ด: - n=1: Sparse reward์—์„œ ์„ฑ๋Šฅ ์ €ํ•˜ - n=5: ์ตœ์ ์˜ ์„ฑ๋Šฅ - n>10: Bias ์ฆ๊ฐ€๋กœ ์ธํ•œ ์„ฑ๋Šฅ ์ €ํ•˜

์ด๋Š” โ€œvariance vs biasโ€ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„์˜ ์ „ํ˜•์ ์ธ ์˜ˆ์ž…๋‹ˆ๋‹ค.

3.3 Layer Normalization

Off-policy RL์˜ ๊ณ ์งˆ์ ์ธ ๋ฌธ์ œ ์ค‘ ํ•˜๋‚˜๋Š” Q-function์˜ catastrophic overestimation์ž…๋‹ˆ๋‹ค. Out-of-distribution ์•ก์…˜์— ๋Œ€ํ•ด Q ๊ฐ’์ด ๋น„์ •์ƒ์ ์œผ๋กœ ๋†’๊ฒŒ ์ถ”์ •๋˜๋ฉด, policy๊ฐ€ ์ด๋ฅผ exploitํ•˜๋ฉฐ ํ•™์Šต์ด ๋ถ•๊ดดํ•ฉ๋‹ˆ๋‹ค.

์ €์ž๋“ค์€ RLPD์—์„œ ์˜๊ฐ์„ ๋ฐ›์•„ critic MLP์— layer normalization์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” Q ๊ฐ’์˜ ๋ฒ”์œ„๋ฅผ ์•”๋ฌต์ ์œผ๋กœ ์ œํ•œํ•˜์—ฌ overestimation์„ ์™„ํ™”ํ•ฉ๋‹ˆ๋‹ค.

Ablation ๊ฒฐ๊ณผ, layer norm ์—†์ด๋Š” ํŠนํžˆ ๋ณต์žกํ•œ ํƒœ์Šคํฌ(Coffee ๋“ฑ)์—์„œ ํ•™์Šต์ด ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค.

3.4 Randomized Ensembled Double Q-Learning (REDQ)

REDQ๋Š” ์—ฌ๋Ÿฌ Q-function์˜ ์•™์ƒ๋ธ”์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค: - TD ํƒ€๊ฒŸ ๊ณ„์‚ฐ ์‹œ: ๋žœ๋ค subset์˜ Q-function๋“ค ์‚ฌ์šฉ - Policy ์—…๋ฐ์ดํŠธ ์‹œ: ์ „์ฒด ์•™์ƒ๋ธ”์˜ ํ‰๊ท  ์‚ฌ์šฉ

์ด ์—ญ์‹œ overestimation bias๋ฅผ ์ค„์ด๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

3.5 Symmetric Sampling

RLPD์—์„œ ๋„์ž…๋œ ์ด ๊ธฐ๋ฒ•์€ ๊ฐ training batch์˜ 50%๋ฅผ ์˜คํ”„๋ผ์ธ demonstration ๋ฐ์ดํ„ฐ์—์„œ, ๋‚˜๋จธ์ง€ 50%๋ฅผ ์˜จ๋ผ์ธ ๋ฒ„ํผ์—์„œ ์ƒ˜ํ”Œ๋งํ•ฉ๋‹ˆ๋‹ค.

์ด๊ฒƒ์ด ์ค‘์š”ํ•œ ์ด์œ : - Demonstration ๋ฐ์ดํ„ฐ๋Š” ๊ณ ํ’ˆ์งˆ์˜ state-action ์Œ์„ ์ œ๊ณต - ํ•™์Šต ์ดˆ๊ธฐ์— critic์ด ์•ˆ์ •์ ์ธ ๊ฐ’์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์›€ - ์˜จ๋ผ์ธ ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ๋Š” ๋„๋‹ฌํ•˜๊ธฐ ์–ด๋ ค์šด ์ƒํƒœ ๊ณต๊ฐ„์„ ์ปค๋ฒ„

3.6 Delayed Actor Updates

TD3์—์„œ ์œ ๋ž˜ํ•œ ์ด ๊ธฐ๋ฒ•์€ actor๋ฅผ critic๋ณด๋‹ค ๋œ ์ž์ฃผ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค(์˜ˆ: critic 2-8ํšŒ ์—…๋ฐ์ดํŠธ๋‹น actor 1ํšŒ).

์•„์ง ์ •ํ™•ํ•˜์ง€ ์•Š์€ Q-function์œผ๋กœ policy๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋ฉด ๋ถˆ์•ˆ์ •ํ•ด์ง€๊ธฐ ๋•Œ๋ฌธ์—, critic์ด ๋จผ์ € ์ˆ˜๋ ดํ•  ์‹œ๊ฐ„์„ ์ฃผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

3.7 Target Policy Smoothing

Q ํƒ€๊ฒŸ ๊ณ„์‚ฐ ์‹œ ํƒ€๊ฒŸ policy์˜ ์•ก์…˜์— ๋…ธ์ด์ฆˆ๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค:

a' = ฯ€_ฮธ'(s') + clip(ฮต, -c, c), ฮต ~ N(0, ฯƒ)

์ด๋Š” Q-function์ด ํŠน์ • ์•ก์…˜์— ๋Œ€ํ•ด ๋‚ ์นด๋กœ์šด ํ”ผํฌ๋ฅผ ํ˜•์„ฑํ•˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ณ , ๋” smoothํ•œ ๊ฐ’ ์ถ”์ •์„ ์œ ๋„ํ•ฉ๋‹ˆ๋‹ค.

3.8 Visual Encoder: Shallow ViT + DrQ Augmentation

์ด๋ฏธ์ง€ ๊ด€์ธก์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด: - Shallow ViT encoder: ๊นŠ์€ CNN ๋Œ€์‹  ์ƒ๋Œ€์ ์œผ๋กœ ์–•์€ Vision Transformer ์‚ฌ์šฉ - DrQ-style random shift augmentation: ์ด๋ฏธ์ง€์— ๋žœ๋ค ์ด๋™์„ ์ ์šฉํ•˜์—ฌ overfitting ๋ฐฉ์ง€

Vision-based RL์—์„œ data augmentation์€ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์— ํ•ต์‹ฌ์ ์ž…๋‹ˆ๋‹ค.


4. ์‹คํ—˜ ํ™˜๊ฒฝ๊ณผ ์„ค์ •

4.1 ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ

์ €์ž๋“ค์€ ํ˜„์‹ค์ ์ธ ์ œ์•ฝ ์กฐ๊ฑด์„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ๋„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค: - ๋‹จ์ผ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ (๋ณ‘๋ ฌํ™” ์—†์Œ) - ์ด๋ฏธ์ง€ + ๋กœ๋ด‡ ๊ด€์ ˆ ์ƒํƒœ ๊ด€์ธก (privileged object ์ƒํƒœ ์ •๋ณด ์—†์Œ) - Sparse binary reward

Robomimic ํƒœ์Šคํฌ: - Can: ์บ”์„ ์ง‘์–ด์„œ bin์— ๋†“๊ธฐ (7-DoF Franka, ๋‹จ์ผ ์•”) - Square: ์‚ฌ๊ฐํ˜• ๊ฐ์ฒด๋ฅผ ์ •ํ™•ํ•œ ์œ„์น˜์— ์กฐ๋ฆฝ (๋†’์€ ์ •๋ฐ€๋„ ์š”๊ตฌ)

DexMimicGen ํƒœ์Šคํฌ (bimanual + dexterous hands): - BoxCleanup: ๋‘ ํŒ”๋กœ ๋ฐ•์Šค ๋šœ๊ป‘์„ ์ง‘์–ด์„œ ๋ฐ•์Šค ์œ„์— ์ •ํ™•ํžˆ ๋†“๊ธฐ (dual Franka) - CanSort: ์‹ค๋ฆฐ๋”๋ฅผ ํ•œ ์†์—์„œ ๋‹ค๋ฅธ ์†์œผ๋กœ ๊ฑด๋„ค๊ธฐ (GR1 ํœด๋จธ๋…ธ์ด๋“œ) - Coffee: ์ปคํ”ผ ํฌ๋“œ๋ฅผ ์ง‘์–ด์„œ ์ปคํ”ผ ๋จธ์‹ ์— ๋„ฃ๊ณ  ๋šœ๊ป‘ ๋‹ซ๊ธฐ (GR1, ๊ฐ€์žฅ ๊ธด horizon)

Bimanual ํƒœ์Šคํฌ๋“ค์€ 24์ฐจ์› action space (ํŒ”๋‹น 6-DoF EE pose + ์†๋‹น 6-DoF ๊ด€์ ˆ)๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

4.2 ์‹ค์„ธ๊ณ„ ํ™˜๊ฒฝ

๋กœ๋ด‡ ํ”Œ๋žซํผ: Dexmate Vega ํœ  ํœด๋จธ๋…ธ์ด๋“œ - 7-DoF ์–‘ํŒ” - 6-DoF OyMotion dexterous hands (์–‘์†) - Zed ์นด๋ฉ”๋ผ (๋จธ๋ฆฌ ์žฅ์ฐฉ) - ์ด 29์ฐจ์› action space (์ ˆ๋Œ€ ๊ด€์ ˆ ์œ„์น˜ ์ œ์–ด)

ํƒœ์Šคํฌ: - WoollyBallPnP: ํ…Œ์ด๋ธ” ์œ„ ์ž„์˜ ์œ„์น˜์˜ ํ„ธ์‹ค ๊ณต์„ ์ง‘์–ด์„œ tote์— ๋„ฃ๊ธฐ - PackageHandover: ๋ณ€ํ˜• ๊ฐ€๋Šฅํ•œ ํŒจํ‚ค์ง€๋ฅผ ์˜ค๋ฅธ์†์œผ๋กœ ์ง‘๊ณ , ์™ผ์†์— ๊ฑด๋„ค๊ณ , ์™ผ์ชฝ tote์— ๋†“๊ธฐ

์•ˆ์ „ ์ธํ”„๋ผ: - ์†๋ชฉ force-torque ์„ผ์„œ๋ฅผ ํ™œ์šฉํ•œ ์•ˆ์ „ ์ œํ•œ - ์ž๊ธฐ ์ถฉ๋Œ ๋ฐฉ์ง€ ์ฒดํฌ

Asynchronous Actor-Learner: ๋†’์€ UTD๋กœ ์ธํ•œ ๋ณ‘๋ชฉ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘๊ณผ ๋ชจ๋ธ ํ•™์Šต์„ ๋ณ„๋„ ํ”„๋กœ์„ธ์Šค๋กœ ๋ถ„๋ฆฌ

4.3 ๋น„๊ต ๋Œ€์ƒ

  1. Tuned RLPD: ResFiT์˜ ๋ชจ๋“  off-policy ์„ค๊ณ„ ๊ฒฐ์ •์„ ์ ์šฉํ–ˆ์ง€๋งŒ, base policy ์—†์ด ์ฒ˜์Œ๋ถ€ํ„ฐ single-step policy ํ•™์Šต
  2. IBRL: BC policy๋ฅผ ์‚ฌ์šฉํ•ด ์•ก์…˜ ์ œ์•ˆ ๋ฐ ํƒ€๊ฒŸ ๋ถ€ํŠธ์ŠคํŠธ๋ž˜ํ•‘
  3. Filtered BC: Base policy๋ฅผ ์˜จ๋ผ์ธ ์„ฑ๊ณต trajectory๋กœ ๊ณ„์† fine-tuning (reward-weighted regression์˜ 0/1 ๋ฒ„์ „)
  4. On-policy Residual (PPO): ResiP ๋ฐฉ์‹์˜ on-policy residual RL

5. ์‹คํ—˜ ๊ฒฐ๊ณผ ์‹ฌ์ธต ๋ถ„์„

5.1 ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ: On-policy vs Off-policy

BoxCleanup ํƒœ์Šคํฌ์—์„œ์˜ ๋น„๊ต: - PPO (on-policy): 40M ์Šคํ…์—์„œ ์ˆ˜๋ ด - ResFiT (off-policy): 200k ์Šคํ…์—์„œ ์ˆ˜๋ ด

์ด๋Š” 200๋ฐฐ์˜ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ ํ–ฅ์ƒ์ž…๋‹ˆ๋‹ค. ์‹ค์„ธ๊ณ„ ๋กœ๋ด‡์—์„œ 40M ์Šคํ…์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ๊ฒƒ์€ ํ˜„์‹ค์ ์œผ๋กœ ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. 20Hz ์ œ์–ด ์ฃผ๊ธฐ๋ฅผ ๊ฐ€์ •ํ•˜๋ฉด: - 40M ์Šคํ… โ‰ˆ 556์‹œ๊ฐ„ (์•ฝ 23์ผ ์—ฐ์† ๊ฐ€๋™) - 200k ์Šคํ… โ‰ˆ 2.8์‹œ๊ฐ„

์ด ์ฐจ์ด๋Š” โ€œ์‹ค์„ธ๊ณ„ RL์ด ๊ฐ€๋Šฅํ•œ๊ฐ€โ€์˜ ๋ถ„๊ธฐ์ ์ž…๋‹ˆ๋‹ค.

5.2 ํƒœ์Šคํฌ๋ณ„ ์„ฑ๋Šฅ ๋ถ„์„

Can (๋‹จ์ˆœ ํƒœ์Šคํฌ): - ๋ชจ๋“  ๋ฐฉ๋ฒ•์ด 150k ์Šคํ… ๋‚ด ๋†’์€ ์„ฑ๊ณต๋ฅ  ๋‹ฌ์„ฑ - ResFiT์ด ๊ฐ€์žฅ ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ด (75k ์Šคํ…)

Square (์ •๋ฐ€๋„ ์š”๊ตฌ): - ResFiT๊ณผ Tuned RLPD๋งŒ 90% ์ด์ƒ ๋‹ฌ์„ฑ - IBRL, Filtered BC๋Š” ์ •์ฒด

BoxCleanup (bimanual ํ˜‘์กฐ): - Baseline๋“ค์€ 0%๋กœ ๋ถ•๊ดดํ•˜๊ฑฐ๋‚˜ ๋А๋ฆฐ ์ˆ˜๋ ด - ResFiT์€ ์•ˆ์ •์ ์œผ๋กœ 95%+ ๋„๋‹ฌ

CanSort (hand-to-hand transfer): - ์œ ์‚ฌํ•œ ํŒจํ„ด, ResFiT ์šฐ์œ„

Coffee (๊ฐ€์žฅ ๊ธด horizon + ๋†’์€ ์ •๋ฐ€๋„): - Action chunking ์—†๋Š” ๋ชจ๋“  ๋ฐฉ๋ฒ• ์‹คํŒจ - ResFiT๋งŒ ์•ˆ์ •์  ํ•™์Šต - Tuned RLPD๋„ ์‹คํŒจ (action chunking์˜ ์ค‘์š”์„ฑ ์ž…์ฆ)

5.3 Filtered BC์˜ ํ•œ๊ณ„

ํฅ๋ฏธ๋กญ๊ฒŒ๋„, Filtered BC๋Š” ์„ฑ๊ณต trajectory๋งŒ ์„ ํƒํ•ด BC๋ฅผ ๊ณ„์†ํ•˜๋Š” ํ•ฉ๋ฆฌ์ ์ธ ์ ‘๊ทผ์ฒ˜๋Ÿผ ๋ณด์ด์ง€๋งŒ, ์‹ค์ œ๋กœ๋Š” ๊ฑฐ์˜ ๊ฐœ์„ ์ด ์—†์Šต๋‹ˆ๋‹ค.

์ €์ž๋“ค์˜ ๋ถ„์„: ์ฃผ์š” ์‹คํŒจ ๋ชจ๋“œ๊ฐ€ ์ •๋ฐ€๋„์ผ ๋•Œ, ๋ช…์‹œ์ ์ธ value maximization ์—†์ด๋Š” ๊ฐœ์„ ํ•˜๊ธฐ ์–ด๋ ต๋‹ค. BC๋Š” โ€œ์–ด๋–ค ํ–‰๋™์ด ๋” ๋‚˜์€๊ฐ€โ€๋ฅผ ํŒ๋‹จํ•˜์ง€ ์•Š๊ณ  ๋‹จ์ˆœํžˆ ์‹œ์—ฐ์„ ๋ชจ๋ฐฉํ•  ๋ฟ์ž…๋‹ˆ๋‹ค.

5.4 ์‹ค์„ธ๊ณ„ ๊ฒฐ๊ณผ

WoollyBallPnP: - BC (ACT): 14% ์„ฑ๊ณต๋ฅ  - ResFiT ํ›„: 64% ์„ฑ๊ณต๋ฅ  (+50%p) - ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ: 134 ์—ํ”ผ์†Œ๋“œ (์•ฝ 71๋ถ„์˜ ๋กœ๋ด‡ ์‹คํ–‰ ๋ฐ์ดํ„ฐ)

PackageHandover: - BC (ACT): 23% ์„ฑ๊ณต๋ฅ  - ResFiT ํ›„: 64% ์„ฑ๊ณต๋ฅ  (+41%p) - ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ: 343 ์—ํ”ผ์†Œ๋“œ (์•ฝ 123๋ถ„)

์ €์ž๋“ค์€ ์ด๊ฒƒ์ด ์–‘ํŒ” dexterous ํœด๋จธ๋…ธ์ด๋“œ์—์„œ ์™„์ „ํžˆ ์‹ค์„ธ๊ณ„์—์„œ ํ•™์Šต๋œ ์ฒซ ๋ฒˆ์งธ RL ๋ฐ๋ชจ๋ผ๊ณ  ์ฃผ์žฅํ•ฉ๋‹ˆ๋‹ค.

5.5 Ablation Studies ์š”์•ฝ

์„ค๊ณ„ ๊ฒฐ์ • ์ œ๊ฑฐ ์‹œ ์˜ํ–ฅ
Layer Norm Coffee์—์„œ ์™„์ „ ์‹คํŒจ, ๋‹ค๋ฅธ ํƒœ์Šคํฌ๋„ ์„ฑ๋Šฅ ์ €ํ•˜
Demo during RL Coffee์—์„œ๋Š” ์˜ํ–ฅ ์ ์Œ, ๋‹ค๋ฅธ ํƒœ์Šคํฌ์—์„œ ์•ฝ๊ฐ„์˜ ์ €ํ•˜
n-step returns (n=1) Sparse reward ํƒœ์Šคํฌ์—์„œ ์‹ฌ๊ฐํ•œ ์„ฑ๋Šฅ ์ €ํ•˜
UTD < 1 ์ˆ˜๋ ด ์†๋„ ํ˜„์ €ํžˆ ๋А๋ฆผ

6. ๊ธฐ์ˆ ์  ์ธ์‚ฌ์ดํŠธ์™€ ํ† ๋ก 

6.1 Base Policy์˜ ์ด์ค‘ ์—ญํ• 

์ €์ž๋“ค์ด ๋ฐœ๊ฒฌํ•œ ์ค‘์š”ํ•œ ํ†ต์ฐฐ:

1. ์•”๋ฌต์  ์•ˆ์ „ ์ œ์•ฝ Base policy ์—†์ด ํ•™์Šต๋œ ์ •์ฑ…์€ ๋” ๋น ๋ฅด์ง€๋งŒ ๊ณต๊ฒฉ์ ์ธ ํ–‰๋™์„ ๋ณด์ž…๋‹ˆ๋‹ค. ์‹ค์„ธ๊ณ„ ๋ฐฐํฌ์—๋Š” ๋ถ€์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. Frozen base policy๋Š” ํƒํ—˜์„ ์•ˆ์ „ํ•œ ์˜์—ญ์œผ๋กœ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค.

2. ๊ฐ•๋ ฅํ•œ ํƒํ—˜ prior ๊ณ ์ž์œ ๋„ ๊ณต๊ฐ„์—์„œ sparse reward๋กœ ํ•™์Šตํ•˜๋ ค๋ฉด, ๋ฌด์ž‘์œ„ ํƒํ—˜์œผ๋กœ๋Š” ๊ฑฐ์˜ ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. Base policy๋Š” โ€œ๋Œ€๋žต์ ์œผ๋กœ ์˜ณ์€โ€ ํ–‰๋™์„ ์ œ๊ณตํ•˜๊ณ , residual์€ ์ด๋ฅผ ๋ฏธ์„ธ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

6.2 ์™œ Per-step Residual์ธ๊ฐ€?

Policy Decorator ๊ฐ™์€ ๋ฐฉ๋ฒ•์€ action chunk ์ „์ฒด์— ๋Œ€ํ•œ residual์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ResFiT์€ per-step correction์„ ์„ ํƒํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด์œ : - Reactivity: ๊ฐ ์Šคํ…์—์„œ ๊ด€์ธก์— ๋ฐ˜์‘ํ•  ์ˆ˜ ์žˆ์Œ - ๋” ์ž‘์€ action space: Chunk ์ „์ฒด๊ฐ€ ์•„๋‹Œ ๋‹จ์ผ ์Šคํ… ์•ก์…˜๋งŒ ์˜ˆ์ธก - Action chunk ํฌ๊ธฐ์™€ ๋ฌด๊ด€: Base policy์˜ chunk ํฌ๊ธฐ๊ฐ€ ๋ฐ”๋€Œ์–ด๋„ residual policy๋Š” ๋™์ผ

6.3 Off-policy์˜ ํ•ต์‹ฌ์  ์ค‘์š”์„ฑ

On-policy ๋ฐฉ๋ฒ•(PPO)๊ณผ์˜ 200๋ฐฐ ํšจ์œจ์„ฑ ์ฐจ์ด๋Š” ๋‹จ์ˆœํžˆ โ€œ๋” ๋น ๋ฅด๋‹คโ€๋ฅผ ๋„˜์–ด์„œ๋Š” ์˜๋ฏธ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

์‹ค์„ธ๊ณ„ ๋กœ๋ด‡ RL์—์„œ: - ๋กœ๋ด‡ ๋งˆ๋ชจ์™€ ํ”ผ๋กœ - ์ธ๊ฐ„ ๊ฐ๋…์ž์˜ ์‹œ๊ฐ„ ๋น„์šฉ - ํ™˜๊ฒฝ ๋ฆฌ์…‹์˜ ์–ด๋ ค์›€

์ด ๋ชจ๋“  ๊ฒƒ์ด sample efficiency๋ฅผ critical factor๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

6.4 Action Chunking์˜ ์–‘๋ฉด์„ฑ

Coffee ํƒœ์Šคํฌ์—์„œ action chunking ์—†๋Š” ๋ชจ๋“  ๋ฐฉ๋ฒ•์ด ์‹คํŒจํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” BC์—์„œ action chunking์ด ์ œ๊ณตํ•˜๋Š” ์žฅ์ ์ด RL์—์„œ๋„ ์œ ์ง€๋จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค: - ๊ธด horizon ํƒœ์Šคํฌ์—์„œ์˜ temporal consistency - Compounding error์˜ implicit ์™„ํ™”

ํ•˜์ง€๋งŒ ๋™์‹œ์—, action chunking์€ RL ์ตœ์ ํ™”๋ฅผ ์–ด๋ ต๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค (action space ํญ๋ฐœ). ResFiT์€ base policy์—์„œ ์ด ์žฅ์ ์„ ์œ ์ง€ํ•˜๋ฉด์„œ, per-step residual๋กœ ์ตœ์ ํ™” tractability๋ฅผ ํ™•๋ณดํ•ฉ๋‹ˆ๋‹ค.

6.5 ํ•œ๊ณ„์ 

์ €์ž๋“ค์ด ๋ช…์‹œํ•œ ํ•œ๊ณ„:

1. Base policy์— ์ข…์† Residual์€ base policy๊ฐ€ ์ด๋ฏธ ์ธ์ฝ”๋”ฉํ•œ ์ „๋žต ๋ฒ”์œ„ ๋‚ด์—์„œ๋งŒ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทผ๋ณธ์ ์œผ๋กœ ๋‹ค๋ฅธ ์Šคํ‚ฌ์ด๋‚˜ ์ „๋žต์€ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

2. ์ธ๊ฐ„ ๊ฐ๋… ํ•„์š” ํ˜„์žฌ ์‹œ์Šคํ…œ์€ ๋ฆฌ์…‹๊ณผ reward labeling์— ์ธ๊ฐ„ ๊ฐ๋…์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์™„์ „ ์ž์œจ์ ์ธ ์Šคํ‚ฌ ๊ฐœ์„ ์—๋Š” ์ž๋™ ๋ฆฌ์…‹๊ณผ ์„ฑ๊ณต ๊ฐ์ง€ ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

3. Frozen base ์ œ์•ฝ Base policy๋ฅผ freezeํ•˜๋ฉด ์•ˆ์ •์„ฑ์ด ํ™•๋ณด๋˜์ง€๋งŒ, base policy ์ž์ฒด์˜ ๊ฐœ์„ ์€ ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.


7. ๋ฏธ๋ž˜ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ

์ €์ž๋“ค์ด ์ œ์‹œํ•œ ๋ฐฉํ–ฅ:

7.1 Frozen Base ์ œ์•ฝ ์™„ํ™”

์•ˆ์ •์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ base policy๋„ ํ•จ๊ป˜ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์„๊นŒ์š”? ์ด๋Š” โ€œ์–ผ๋งˆ๋‚˜ ์ œ์•ฝ์„ ํ’€์–ด๋„ ํ•™์Šต์ด ๋ฐœ์‚ฐํ•˜์ง€ ์•Š๋Š”๊ฐ€โ€์˜ ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค.

7.2 Knowledge Distillation

Combined policy (base + residual)์˜ ๊ฐœ์„ ๋œ ํ–‰๋™์„ ๋‹ค์‹œ base policy๋กœ distillํ•˜๋ฉด, residual์ด ๋” ๊ฐœ์„ ํ•  ์—ฌ์ง€๊ฐ€ ์ƒ๊น๋‹ˆ๋‹ค. ์ด๋ฅผ ๋ฐ˜๋ณตํ•˜๋ฉด iterative improvement๊ฐ€ ๊ฐ€๋Šฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

7.3 Multi-task Generalization

ํƒœ์Šคํฌ๋ณ„ residual ๊ฐœ์„ ์„ ์ ์  ๋” ๋Šฅ๋ ฅ ์žˆ๋Š” generalist๋กœ distillํ•˜๋Š” ๊ฒƒ. ResFiT์€ base model agnosticํ•˜๋ฏ€๋กœ, ๋Œ€๊ทœ๋ชจ multi-task behavior model์˜ fine-tuning์—๋„ ์ ์šฉ ๊ฐ€๋Šฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

7.4 ์ž๋™ํ™”๋œ ์ธํ”„๋ผ

์ž๋™ ๋ฆฌ์…‹, ์„ฑ๊ณต ๊ฐ์ง€, safety rails๊ฐ€ ๊ฐ–์ถฐ์ง€๋ฉด ์ธ๊ฐ„ ๊ฐ๋… ์—†์ด ์ž์œจ์  ์Šคํ‚ฌ ๊ฐœ์„ ์ด ๊ฐ€๋Šฅํ•ด์งˆ ๊ฒƒ์ž…๋‹ˆ๋‹ค.


8. ์‹ค์šฉ์  ์‹œ์‚ฌ์ : ๋กœ๋ด‡๊ณตํ•™ ์—ฐ๊ตฌ์ž๋ฅผ ์œ„ํ•œ ํ…Œ์ดํฌ์–ด์›จ์ด

8.1 โ€œ๋ ˆ์‹œํ”ผโ€์˜ ๊ฐ€์น˜

์ด ๋…ผ๋ฌธ์˜ ์ง„์ •ํ•œ ๊ธฐ์—ฌ๋Š” ๊ฐœ๋ณ„ ๊ธฐ๋ฒ•์ด ์•„๋‹ˆ๋ผ, ์—ฌ๋Ÿฌ ๊ธฐ๋ฒ•์˜ ์‹ ์ค‘ํ•œ ์กฐํ•ฉ์ž…๋‹ˆ๋‹ค: - UTD=4 - n=5 step returns - Layer normalization - Symmetric sampling - Delayed actor updates - REDQ ensemble - DrQ augmentation

๊ฐ๊ฐ์€ ์ด๋ฏธ ์•Œ๋ ค์ง„ ๊ธฐ๋ฒ•์ด์ง€๋งŒ, ์ด๋“ค์„ residual RL์— ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์กฐํ•ฉํ•˜๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค.

8.2 BC + RL ํ•˜์ด๋ธŒ๋ฆฌ๋“œ์˜ ์‹ค์šฉ์„ฑ

BC๋กœ โ€œ๋Œ€๋žต์ ์œผ๋กœ ์ž‘๋™ํ•˜๋Š”โ€ ์ •์ฑ…์„ ๋น ๋ฅด๊ฒŒ ์–ป๊ณ , RL๋กœ fine-tuneํ•˜๋Š” ์ ‘๊ทผ๋ฒ•์€ ๋งค์šฐ ์‹ค์šฉ์ ์ž…๋‹ˆ๋‹ค: - ์ดˆ๊ธฐ BC: 50-100ํšŒ ์‹œ์—ฐ์œผ๋กœ 50-80% ์„ฑ๊ณต๋ฅ  - ResFiT fine-tuning: 1-2์‹œ๊ฐ„ ์‹ค์„ธ๊ณ„ interaction์œผ๋กœ 90%+ ๋„๋‹ฌ ๊ฐ€๋Šฅ

8.3 Sparse Reward์˜ ๊ฐ€๋Šฅ์„ฑ

Dense reward shaping ์—†์ด sparse binary reward๋งŒ์œผ๋กœ ๊ณ ์ž์œ ๋„ bimanual ํƒœ์Šคํฌ๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์€ ์‹ค์šฉ์ ์œผ๋กœ ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. Reward engineering์€ ์ข…์ข… trial-and-error๊ฐ€ ํ•„์š”ํ•œ ๊ณ ํ†ต์Šค๋Ÿฌ์šด ๊ณผ์ •์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

8.4 Action Chunking BC์˜ RL ํ˜ธํ™˜์„ฑ

Diffusion Policy, ACT ๋“ฑ action chunking ๊ธฐ๋ฐ˜ BC ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋‹ค๋ฉด, ResFiT ์Šคํƒ€์ผ์˜ per-step residual์ด RL fine-tuning์˜ ์‹ค์šฉ์  ๊ฒฝ๋กœ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


9. ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต

9.1 vs ResiP (Ankile et al., 2024)

๊ฐ™์€ 1์ €์ž์˜ ์ด์ „ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค. ResiP์€: - On-policy PPO ์‚ฌ์šฉ โ†’ ์ƒ˜ํ”Œ ๋น„ํšจ์œจ์  - ์‹œ๋ฎฌ๋ ˆ์ด์…˜๋งŒ ๊ฒ€์ฆ - Sim-to-real ํŒŒ์ดํ”„๋ผ์ธ ์˜์กด

ResFiT์€: - Off-policy TD3 ๊ธฐ๋ฐ˜ โ†’ 200๋ฐฐ ํšจ์œจ์  - ์‹ค์„ธ๊ณ„ ์ง์ ‘ ํ•™์Šต ๊ฒ€์ฆ

9.2 vs Policy Decorator

  • Chunk-level residual โ†’ Per-step residual
  • ์‹œ๋ฎฌ๋ ˆ์ด์…˜๋งŒ โ†’ ์‹ค์„ธ๊ณ„ ๊ฒ€์ฆ
  • ๋‹จ์ผ ์•” โ†’ Bimanual + dexterous

9.3 vs IBRL

IBRL์€ BC policy๋ฅผ action proposal๊ณผ value bootstrapping์— ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ: - BC policy๋ฅผ ์ง์ ‘ fine-tuneํ•˜์ง€ ์•Š์Œ - ๋ณต์žกํ•œ ํƒœ์Šคํฌ์—์„œ ์„ฑ๋Šฅ ํ•œ๊ณ„

9.4 vs SERL

SERL์€ ์‹ค์„ธ๊ณ„ RL์˜ ์„ ๊ตฌ์  ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค: - Parallel jaw gripper + ๋‹จ์ผ ์•” - ์ƒ๋Œ€์ ์œผ๋กœ ๋‹จ์ˆœํ•œ ํƒœ์Šคํฌ

ResFiT์€ ๋” ๋†’์€ ์ž์œ ๋„(29-DoF)์™€ dexterous manipulation์œผ๋กœ ํ™•์žฅํ–ˆ์Šต๋‹ˆ๋‹ค.


10. ๊ฒฐ๋ก 

ResFiT์€ BC์™€ RL์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•˜๋Š” ์‹ค์šฉ์ ์ด๊ณ  ์šฐ์•„ํ•œ ์ ‘๊ทผ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. BC ์ •์ฑ…์„ ๋ธ”๋ž™๋ฐ•์Šค๋กœ ์ทจ๊ธ‰ํ•˜๊ณ  ๊ฒฝ๋Ÿ‰ per-step residual์„ off-policy RL๋กœ ํ•™์Šตํ•จ์œผ๋กœ์จ:

  1. Base policy ์•„ํ‚คํ…์ฒ˜์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ์ ์šฉ ๊ฐ€๋Šฅ
  2. Sparse binary reward๋งŒ์œผ๋กœ ๊ณ ์ž์œ ๋„ ํƒœ์Šคํฌ ํ•™์Šต ๊ฐ€๋Šฅ
  3. 1-2์‹œ๊ฐ„์˜ ์‹ค์„ธ๊ณ„ ๋ฐ์ดํ„ฐ๋กœ ์œ ์˜๋ฏธํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ ๊ฐ€๋Šฅ
  4. ์„ธ๊ณ„ ์ตœ์ดˆ๋กœ dexterous hand๋ฅผ ๊ฐ€์ง„ ํœด๋จธ๋…ธ์ด๋“œ์—์„œ ์‹ค์„ธ๊ณ„ RL ์‹œ์—ฐ

๋ฌผ๋ก  base policy์— ์ข…์†๋˜๋Š” ํ•œ๊ณ„๋Š” ์žˆ์ง€๋งŒ, ์ด๋Š” ์•ˆ์ •์„ฑ๊ณผ์˜ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„์ž…๋‹ˆ๋‹ค. ํ–ฅํ›„ frozen base ์ œ์•ฝ์„ ์™„ํ™”ํ•˜๊ณ , knowledge distillation์„ ํ†ตํ•ด iterative improvement๊ฐ€ ๊ฐ€๋Šฅํ•ด์ง„๋‹ค๋ฉด, ๋กœ๋ด‡ ํ•™์Šต์˜ ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„์ด ์—ด๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋กœ๋ด‡ ๋งค๋‹ˆํ“ฐ๋ ˆ์ด์…˜ ์—ฐ๊ตฌ์ž๋กœ์„œ, ์ด ๋…ผ๋ฌธ์ด ์ œ์‹œํ•˜๋Š” โ€œ๋ ˆ์‹œํ”ผโ€๋Š” ์ฆ‰์‹œ ์ ์šฉ ๊ฐ€๋Šฅํ•œ ์‹ค์šฉ์  ๊ฐ€์ด๋“œ๋ผ์ธ์ž…๋‹ˆ๋‹ค. BC๋กœ ์‹œ์ž‘ํ•ด RL๋กœ ๊ฐœ์„ ํ•˜๋Š” ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ ‘๊ทผ๋ฒ•์ด ์ ์  ๋” ํ‘œ์ค€์ด ๋  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋ฉ๋‹ˆ๋‹ค.

์ฐธ๊ณ ๋ฌธํ—Œ

  • Ball et al. (2023). RLPD: Efficient Online RL with Offline Data
  • Zhao et al. (2023). ACT: Learning Fine-grained Bimanual Manipulation
  • Chi et al. (2023). Diffusion Policy
  • Luo et al. (2024). SERL: Sample-Efficient Robotic RL
  • Ankile et al. (2024). ResiP: Residual for Precise Manipulation
  • Yuan et al. (2024). Policy Decorator

โ›๏ธ Dig Review

โ›๏ธ Dig โ€” Go deep, uncover the layers. Dive into technical detail.

Residual Off-Policy RL์„ ํ†ตํ•œ Behavior Cloning ์ •์ฑ… ์ •๊ตํ™” (ResFiT)

์ตœ๊ทผ Behavior Cloning(BC) ๊ธฐ๋ฒ•์€ ๋ณต์žกํ•œ ์‹œ๊ฐ๊ธฐ๋ฐ˜ ์กฐ์ž‘ ์ •์ฑ…์„ ์‹คํ˜„ํ–ˆ์ง€๋งŒ, ์ฃผ๋กœ ๋ฐ๋ชจ ๋ฐ์ดํ„ฐ์˜ ํ’ˆ์งˆ๊ณผ ์–‘์— ์˜์กดํ•ด ์„ฑ๋Šฅ์ด ํ•œ๊ณ„์— ๋ด‰์ฐฉํ•œ๋‹ค. ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด Residual Off-Policy RL(ResFiT) ๊ธฐ๋ฒ•์€ BC๋กœ ํ•™์Šตํ•œ ์ •์ฑ…์„ โ€œ๊ธฐ์ €(base) ์ •์ฑ…โ€์œผ๋กœ ์ทจ๊ธ‰ํ•˜๊ณ , ์—ฌ๊ธฐ์— ์ž‘์€ ์ž”์ฐจ(residual) ์ˆ˜์ •ํ•ญ์„ ํ•™์Šตํ•˜๋Š” ํ˜•ํƒœ๋กœ RL์„ ์ ์šฉํ•œ๋‹ค. ๊ทธ๋ฆผ 1์€ ResFiT์˜ ๋‘ ๋‹จ๊ณ„ ์ ‘๊ทผ๋ฒ•์„ ๋ณด์ธ๋‹ค. ์ฒซ ๋‹จ๊ณ„์—์„œ๋Š” ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋กœ BC๋ฅผ ์ˆ˜ํ–‰ํ•ด ๊ธฐ์ € ์ •์ฑ…์„ ์–ป๊ณ  ์ด๋ฅผ ๊ณ ์ •(frozen)์‹œํ‚จ๋‹ค. ๋‘ ๋ฒˆ์งธ ๋‹จ๊ณ„์—์„œ๋Š” ์ด ๊ณ ์ •๋œ ๊ธฐ์ € ์ •์ฑ…์˜ ํ–‰๋™์— ๋ง๋ถ™์ผ ์ž”์ฐจ ์ˆ˜์ •(action residual)์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์ƒ˜ํ”Œ ํšจ์œจ์ ์ธ Off-Policy RL์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ฆ‰, ์—์ด์ „ํŠธ๋Š” ํ˜„์žฌ ๊ด€์ธก(observation)๊ณผ ๊ธฐ์ € ์ •์ฑ…์ด ์ถœ๋ ฅํ•œ ํ–‰๋™ a_{base}๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ์ž‘์€ ์ˆ˜์ • a_{res}์„ ์ƒ์„ฑํ•˜๋ฉฐ, ์‹ค์ œ ํ™˜๊ฒฝ์— ์‹คํ–‰๋˜๋Š” ํ–‰๋™์€ a = a_{base} + a_{res} ๊ฐ€ ๋œ๋‹ค. ์ด๋กœ์จ ResFiT๋Š” ๊ธฐ์ € ์ •์ฑ… ์•„ํ‚คํ…์ฒ˜์— ๊ตฌ์• ๋ฐ›์ง€ ์•Š๊ณ , ์•ก์…˜ ์ฒญํ‚น(action chunking)์ด๋‚˜ ํ™•์‚ฐ ์ •์ฑ…(difusion) ๊ฐ™์€ ๋ณต์žกํ•œ ๊ตฌ์กฐ๋ฅผ ๊ทธ๋Œ€๋กœ ๋‘” ์ฑ„ ํŽธ๋ฆฌํ•˜๊ฒŒ RL์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ธฐ์ € ์ •์ฑ…์€ ์ „๋ฌธ๊ฐ€ ์‹œ์—ฐ์—์„œ ์–ป์€ ํ•ฉ๋ฆฌ์ ์ธ ํ–‰๋™ ๋ถ„ํฌ๋ฅผ ์ œ๊ณตํ•˜๋ฏ€๋กœ ์•ˆ์ •์  ํƒํ—˜๊ณผ ์ •์ฑ… ์ดˆ๊ธฐํ™”์— ์œ ๋ฆฌํ•˜๋ฉฐ, ์ž”์ฐจ ํ•™์Šต์€ ์˜ค์ง ์ˆ˜์ • ๋ณด์ƒ๊ฐ’(Residual Reward)์„ ์ตœ๋Œ€ํ™”ํ•จ์œผ๋กœ์จ ์ •๋ฐ€๋„๋ฅผ ๋†’์ธ๋‹ค.

๊ทธ๋ฆผ 1: ResFiT(Residual Off-Policy RL) ๊ฐœ์š”. (์ขŒ) ๊ธฐ์ € BC ์ •์ฑ…์œผ๋กœ๋ถ€ํ„ฐ ์–ป์€ ํ–‰๋™ a_{base}์— ์˜คํ”„ํด๋ฆฌ์‹œ RL๋กœ ํ•™์Šต๋œ ์ž”์ฐจ ์ •์ฑ…์ด a_{res}๋ฅผ ๋”ํ•˜์—ฌ ์ตœ์ข… ํ–‰๋™ a๋ฅผ ๋งŒ๋“ ๋‹ค. (์šฐ) ๋‘ ๋‹จ๊ณ„ ํ•™์Šต: ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋กœ BC๋ฅผ ์ˆ˜ํ–‰ํ•œ ๋’ค ์ •์ฑ…์„ ๊ณ ์ •ํ•˜๊ณ , ์ดํ›„ ์‹œ์—ฐ ๋ฒ„ํผ์™€ ์˜จ๋ผ์ธ ์ƒํ˜ธ์ž‘์šฉ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ‘ํ•ฉํ•ด ์˜คํ”„ํด๋ฆฌ์‹œ RL๋กœ ์ž”์ฐจ ์ •์ฑ…์„ ํ•™์Šตํ•œ๋‹ค.

์ˆ˜ํ•™์  ๊ธฐ๋ฒ•๊ณผ ํ•ต์‹ฌ ์›๋ฆฌ

ResFiT๋Š” MDP (s,a,r,\gamma) ํ™˜๊ฒฝ์—์„œ ์ž‘๋™ํ•˜๋ฉฐ, ๊ธฐ์ € ์ •์ฑ… \pi_{\psi}(a_{t:t+k}|s_t)๋Š” ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋กœ ํ–‰๋™ ์ฒญํ‚น(action chunking) ๋ฐฉ์‹์˜ BC๋กœ ํ•™์Šต๋œ๋‹ค. ์ดํ›„ ์ด \pi_{\psi}๋ฅผ ๊ณ ์ •ํ•˜๊ณ , ์ƒˆ๋กœ์šด ์ž”์ฐจ ์ •์ฑ… \pi_{\theta}(s_t, a_{base_t})๋ฅผ ํ•™์Šตํ•œ๋‹ค. ์ „์ฒด ํ–‰๋™์€

a_t = a_{base_t} + a_{res_t},\quad a_{base_t} = \pi_{base}(s_t)

๋กœ ์ •์˜ํ•˜๋ฉฐ, ํฌ๋ฆฌํ‹ฑ Q_\phi๋Š” ์ด ์ „์ฒด ํ–‰๋™์— ๋Œ€ํ•œ ๊ฐ€์น˜ํ•จ์ˆ˜๋ฅผ ๊ทผ์‚ฌํ•œ๋‹ค. ResFiT๋Š” DDPG ์Šคํƒ€์ผ์˜ ์•กํ„ฐ-ํฌ๋ฆฌํ‹ฑ ํ•™์Šต ๊ตฌ์กฐ๋ฅผ ์ฐจ์šฉํ•˜์—ฌ ๋™์ž‘ํ•œ๋‹ค. ๋จผ์ €, ํฌ๋ฆฌํ‹ฑ์€ ๋‹ค์Œ ๋ฒจ๋งŒ ๋ฐฉ์ •์‹์„ ๋งŒ์กฑํ•˜๋„๋ก ํ•™์Šต๋œ๋‹ค. ํ‘œ์ค€ Qํ•จ์ˆ˜์— ๋Œ€ํ•ด

Q^\pi(s,a) = r(s,a) + \gamma \mathbb{E}_{s'}[Q^\pi(s', \pi(s'))]

์ž”์ฐจ ์„ค์ •์— ๋งž์ถ”์–ด ๊ฒฝ์‚ฌํ•˜๊ฐ•์œผ๋กœ MSBE(Mean-Squared Bellman Error)๋ฅผ ์ตœ์†Œํ™”ํ•œ๋‹ค. ์ฆ‰, ๋ฒ„ํผ \mathcal{D}์—์„œ ์ถ”์ถœํ•œ ์ „์ด (s_t,a_t,r_t,s',d_t)์— ๋Œ€ํ•ด ํฌ๋ฆฌํ‹ฑ ์†์‹ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

\mathcal{L}(\phi) = \mathbb{E}_{\mathcal{D}}\Big[ \big(Q_\phi(s_t,a_t) - (r_t + \gamma (1-d_t) Q_\phi(s_{t+1}, a_{base_{t+1}} + \pi_\theta(s_{t+1},a_{base_{t+1}})))\big)^2 \Big]

์—ฌ๊ธฐ์„œ a_{base_{t+1}}=\pi_{base}(s_{t+1})์ด๋‹ค. ํฌ๋ฆฌํ‹ฑ์ด ์ตœ์ ์˜ ๊ฐ’์„ ๊ทผ์‚ฌํ• ์ˆ˜๋ก, ์ด๋ฅผ ํ†ตํ•ด ์•กํ„ฐ(์ž”์ฐจ ์ •์ฑ…)๋ฅผ ์—…๋ฐ์ดํŠธํ•  ์ˆ˜ ์žˆ๋‹ค. ์•กํ„ฐ๋Š” Qํ•จ์ˆ˜๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐ์ •ํ•˜๋ฉฐ, ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ Q๋ฅผ ๋”ฐ๋ฅด๋Š” ๊ฒฝ์‚ฌ์ƒ์Šน์œผ๋กœ ํ•™์Šต๋œ๋‹ค. ์ฆ‰, ์•กํ„ฐ์˜ ์†์‹ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋œ๋‹ค:

\mathcal{L}(\theta) = -\mathbb{E}_{(s_t,a_{base_t})\sim\mathcal{D}}\Big[ Q_\phi\big(s_t, a_{base_t} + \pi_{\theta}(s_t,a_{base_t})\big)\Big].

์ด์™€ ๊ฐ™์ด, ResFiT๋Š” ์ „ํ†ต์ ์ธ off-policy RL์˜ ๊ตฌ์กฐ๋ฅผ ๋”ฐ๋ฅด๋ฉด์„œ๋„ ์ „์ฒด ํ–‰๋™์„ ๊ธฐ์ €+์ž”์ฐจ์˜ ํ•ฉ์œผ๋กœ ์žฌ์ •์˜ํ•˜์—ฌ, ์ž”์ฐจ ํ–‰์œ„๋งŒ ํ•™์Šตํ•˜๋„๋ก ๋ณ€ํ˜•ํ•œ ๊ฒƒ์ด๋‹ค.

์„ค๊ณ„์ ์œผ๋กœ ResFiT๋Š” ์ƒ˜ํ”Œ ํšจ์œจ๊ณผ ์•ˆ์ •์„ฑ์„ ์œ„ํ•ด ์—ฌ๋Ÿฌ ๊ธฐ๋ฒ•์„ ์ฑ„ํƒํ•œ๋‹ค. ์—…๋ฐ์ดํŠธ ๋Œ€ ๋ฐ์ดํ„ฐ ๋น„์œจ(UTD)์„ 1๋ณด๋‹ค ํฌ๊ฒŒ ์„ค์ •ํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋‹น ์—ฌ๋Ÿฌ ๋ฒˆ ๋ชจ๋ธ ์—…๋ฐ์ดํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ , n-์Šคํ… ๋ฆฌํ„ด(n-step return)์„ ํ™œ์šฉํ•ด ํฌ์†Œ๋ณด์ƒ ํ™˜๊ฒฝ์—์„œ ์ •๋ณด์ „๋‹ฌ์„ ๋Š˜๋ฆฐ๋‹ค. ๋˜ํ•œ TD3 ๊ธฐ๋ฐ˜์˜ ์ง€์—ฐ๋œ ์•กํ„ฐ ์—…๋ฐ์ดํŠธ, ํƒ€๊ฒŸ ๋„คํŠธ์›Œํฌ ํด๋ฆฌ์•ก ํ‰๊ท (Polyak averaging), ํƒ€๊ฒŸ ์ •์ฑ… ์Šค๋ฌด๋”ฉ ๋“ฑ ํ‘œ์ค€ ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜๊ณ , EnsQ(์—”์ฆˆ๋ฒˆ๋ธ” Q) ๋ฐ ๋ ˆ์ด์–ด ์ •๊ทœํ™”(layer norm)๋ฅผ ํ†ตํ•ด ์˜ค๋ฒ„์—์Šคํ‹ฐ๋ฉ”์ด์…˜์„ ์™„ํ™”ํ•œ๋‹ค. ๋น„์ฃผ์–ผ ์ž…๋ ฅ์˜ ๊ณผ์ ํ•ฉ์„ ๋ง‰๊ธฐ ์œ„ํ•ด ViT ๋น„์ „ ์ธ์ฝ”๋”์™€ DrQ ์Šคํƒ€์ผ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์„ ํ™œ์šฉํ•˜๋ฉฐ, ์˜จ๋ผ์ธ ๋ฒ„ํผ์˜ ๋ฐ์ดํ„ฐ์™€ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์ƒ˜ํ”Œ๋ง(์ฆํญ)ํ•˜์—ฌ ํ•™์Šต ์•ˆ์ •์„ฑ์„ ํ™•๋ณดํ•œ๋‹ค.

๊ธฐ์กด BC Fine-tuning ๋ฐฉ์‹๊ณผ ROPI(ResFiT)์˜ ์ฐจ์ด์ 

์ „ํ†ต์ ์œผ๋กœ BC ์ •์ฑ…์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด RL๋กœ ํŒŒ์ธํŠœ๋‹ํ•˜๋ ค๋Š” ์‹œ๋„๋“ค์ด ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์ง์ ‘์ ์ธ RL fine-tuning์€ ์ตœ๊ทผ ๋Œ€๊ทœ๋ชจ ๋„คํŠธ์›Œํฌ์™€ ํ–‰๋™ ์ฒญํ‚น, ํ™•์‚ฐ ๊ตฌ์กฐ ๋•Œ๋ฌธ์— ๋งค์šฐ ์–ด๋ ค์› ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, IBRL(Imitation Bootstrapped RL)์€ ๋จผ์ € ๋ชจ๋ฐฉ ์ •์ฑ…์„ ํ•™์Šตํ•˜๊ณ  ์ด๋ฅผ RL ํƒํ—˜ ๋ฐ ๊ฐ€์น˜ ์ถ”์ •์˜ ์ดˆ๊ธฐ๊ฐ’์œผ๋กœ ํ™œ์šฉํ•˜์ง€๋งŒ, ์—ฌ์ „ํžˆ ๋ณต์žกํ•œ ๋ชจ๋ธ ์ž์ฒด๋ฅผ ๋ณ€๊ฒฝํ•œ๋‹ค. PA-RL(Policy-Agnostic RL)์€ ๋ณต์žกํ•œ ์ •์ฑ… ๋Œ€์‹  Q-ํ•จ์ˆ˜๋งŒ ํ•™์Šตํ•˜์—ฌ ๊ทผ์‚ฌํ•˜์ง€๋งŒ, ์ •์ฑ… ๊ตฌ์กฐ๋ฅผ ์ง์ ‘ ์—…๋ฐ์ดํŠธํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค. ๋˜ํ•œ DSRL ๊ฐ™์€ ํ™•์‚ฐ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์€ ๊ณ ์ •๋œ ํ™•์‚ฐ์ •์ฑ…์˜ ์ž ์žฌ๋…ธ์ด์ฆˆ ๊ณต๊ฐ„์—์„œ ํ–‰๋™์„ ์ตœ์ ํ™”ํ•˜๋‚˜, ์ด ๋ฐฉ์‹๋“ค์€ ํŠน์ˆ˜ํ•œ ์ •์ฑ… ํ˜•ํƒœ์— ์˜์กดํ•˜๋ฏ€๋กœ ์ผ๋ฐ˜ํ™”๊ฐ€ ์–ด๋ ต๋‹ค.

๋ฐ˜๋ฉด ResFiT๋Š” ์ž”์ฐจ ๋ณด์ƒ ํ•™์Šต(Residual RL) ๋ฐฉ์‹์œผ๋กœ BC ์ •์ฑ…๊ณผ RL์„ ๊ฒฐํ•ฉํ•œ๋‹ค. ๊ณผ๊ฑฐ Residual RL ์—ฐ๊ตฌ๋“ค์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์ด๋‚˜ ๋‹จ์ˆœ ํ™˜๊ฒฝ์—์„œ ์œ ์šฉํ–ˆ์œผ๋‚˜, ResFiT๋Š” ์ด๋ฅผ ๊ณ ์ฐจ์› ์‹ค์„ธ๊ณ„ ๋กœ๋ด‡์—๋„ ์ ์šฉํ–ˆ๋‹ค. ๊ธฐ์กด Residual RL ์—ฐ๊ตฌ ์ค‘ ResiP๋Š” ๋‹จ์ผ ์Šคํ… ์ž”์ฐจ๋ฅผ PPO ๊ฐ™์€ ์˜จ-ํด๋ฆฌ์‹œ ๋ฐฉ๋ฒ•์œผ๋กœ ํ•™์Šตํ–ˆ์œผ๋‚˜ ์ƒ˜ํ”Œ ํšจ์œจ์ด ๋‚ฎ์•˜๋‹ค. Policy Decorator ์—ฐ๊ตฌ๋Š” ์ „์ฒด ํ–‰๋™ ์ฒญํฌ์— ๋Œ€ํ•ด ์ž”์ฐจ๋ฅผ ํ•™์Šตํ–ˆ์œผ๋‚˜, ๋‹จ์ผ ์Šคํ… ๊ธฐ์ค€์ด ์•„๋‹ˆ๋ผ ์ฒญํฌ ๋‹จ์œ„์ด๋ฏ€๋กœ ์ˆ˜์ •์ด ๊ฑฐ์น ๊ณ , EXPO ์—ฐ๊ตฌ๋Š” ์ฒญํ‚น ์—†๋Š” ์ •์ฑ…์— Residual RL์„ ์ ์šฉํ–ˆ์œผ๋‚˜ ๋‹จ์ผ ์•” ์ž‘์—…๊ณผ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋‚ด ์‹คํ—˜์— ๊ทธ์ณค๋‹ค. ์ด์™€ ๋‹ฌ๋ฆฌ ResFiT๋Š” ๋‹จ๊ณ„๋ณ„ ์ž”์ฐจ(per-step residual)๋ฅผ ํ•™์Šตํ•˜๊ณ  ์˜คํ”„ํด๋ฆฌ์‹œ RL์„ ํ™œ์šฉํ•˜์—ฌ ์ƒ˜ํ”Œ ํšจ์œจ์„ ๋Œ€ํญ ๋†’์˜€๋‹ค. ๋˜ํ•œ ResFiT๋Š” ๊ธฐ์ € ์ •์ฑ…์„ ํ‘์ƒ์ž(black-box)๋กœ ๋‹ค๋ฃจ๋ฏ€๋กœ ์ •์ฑ… ๊ตฌ์กฐ์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ResFiT๋Š” ๋Œ€ํ˜• ํ–‰๋™ ์ฒญํ‚น ๊ธฐ๋ฐ˜ ์ •์ฑ…์—๋„ ์†์‰ฝ๊ฒŒ ๋‹ซํžŒ ๋ฃจํ”„(called-loop) ํ˜•ํƒœ์˜ ์ •๋ฐ€ ์ˆ˜์ •์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ, ์ž”์ฐจ ํฌ๊ธฐ๋ฅผ ์ œํ•œํ•จ์œผ๋กœ์จ ํ•™์Šต ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ๋„ ์•ˆ์ „ํ•œ ํƒํ—˜์„ ๋ณด์žฅํ•œ๋‹ค. ์š”์•ฝํ•˜๋ฉด, ResFiT๋Š” ๊ณ ์ •๋œ BC ์ •์ฑ…์œผ๋กœ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ์˜คํ”„ํด๋ฆฌ์‹œ ๋ฐฉ์‹์œผ๋กœ ์ž‘์€ ๋ณด์ •ํ•ญ๋งŒ ํ•™์Šตํ•จ์œผ๋กœ์จ, ๊ธฐ์กด BC fine-tuning๊ณผ ๋‹ฌ๋ฆฌ ํ•™์Šต ์•ˆ์ •์„ฑ๊ณผ ์œ ์—ฐ์„ฑ์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ•œ๋‹ค.

์‹คํ—˜ ๊ฒฐ๊ณผ ํ•ด์„: ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ ์‹ค์ œ ๋กœ๋ด‡์—์„œ์˜ ์„ฑ๋Šฅ

ResFiT๋Š” Robosuite(ฮผMuJoCo ๊ธฐ๋ฐ˜) ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ์ƒ์˜ ๋‹ค์–‘ํ•œ ์กฐ์ž‘ ์ž‘์—…์—์„œ ํ‰๊ฐ€๋˜์—ˆ๋‹ค. ์‹คํ—˜ ๊ณผ์ œ๋Š” ๋‹จ์ผ ์•” ์ž‘์—…(Franka: Can, Square)๊ณผ ์Œ์ˆ˜ ์ž‘์—…(BoxCleanup, CanSort, Coffee)์œผ๋กœ, ์ „์ž๋Š” ํ–‰๋™ ๊ณต๊ฐ„ 7์ฐจ์›, ํ›„์ž๋Š” 24์ฐจ์›์œผ๋กœ ๋ณต์žก๋„๋ฅผ ๋†’์˜€๋‹ค. ๊ทธ๋ฆผ 4๋Š” ์ฃผ์š” ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ๋ชจ๋“  ๊ณผ์ œ์—์„œ ResFiT๋Š” ๋†’์€ ์„ฑ๊ณต๋ฅ ์— ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ดํ–ˆ๋Š”๋ฐ, ์˜ˆ๋ฅผ ๋“ค์–ด ๊ฐ„๋‹จํ•œ Can ๊ณผ์ œ์—์„œ๋Š” 7๋งŒ ์Šคํ… ๋งŒ์— ๊ฑฐ์˜ ์™„๋ฒฝํ•œ ์„ฑ๋Šฅ์— ๋„๋‹ฌํ–ˆ๋‹ค. ํŠนํžˆ ๋‚œ์ด๋„๊ฐ€ ๋†’์€ bimanual ์ž‘์—…(BoxCleanup, CanSort, Coffee)์—์„œ ResFiT๋Š” ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์ด 0%๋กœ ์ถ”๋ฝํ•˜๊ฑฐ๋‚˜ ํ•™์Šต ์†๋„๊ฐ€ ๋А๋ฆฐ ๋ฐ ๋น„ํ•ด ๋น ๋ฅด๊ฒŒ ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ(90% ์ด์ƒ)์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค. ํ•„ํ„ฐ๋ง BC(๊ธฐ์ € ์ •์ฑ… ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ)์™€ ๊ฐ™์€ ๋‹จ์ˆœ ๊ฐœ์„  ๋ฐฉ์‹์€ ์ดˆ๊ธฐ ์ •์ฑ…๋ณด๋‹ค ๊ฑฐ์˜ ํ–ฅ์ƒ๋˜์ง€ ๋ชปํ•œ ๋ฐ˜๋ฉด, ResFiT๋Š” ์ •๋ฐ€๋„ ํ–ฅ์ƒ์ด๋ผ๋Š” ๋ชฉํ‘œ๋ฅผ ์œ„ํ•ด ๊ฐ’ํ•จ์ˆ˜ ์ตœ์ ํ™”๋ฅผ ํ™œ์šฉํ–ˆ๋‹ค.

๊ทธ๋ฆผ 4: ๋‹ค์–‘ํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ณผ์ œ์—์„œ ResFiT ์„ฑ๋Šฅ ๋น„๊ต. ResFiT๋Š” ๋ชจ๋“  ์ž‘์—…์—์„œ ๋น ๋ฅด๊ฒŒ ๋†’์€ ์„ฑ๊ณต๋ฅ ์— ์ˆ˜๋ ดํ•˜๋ฉฐ, ๊ธฐ์กด ์˜คํ”„ํด๋ฆฌ์‹œ RL์ด๋‚˜ Residual RL ๊ธฐ๋ฐ˜ ๋Œ€์•ˆ์— ๋น„ํ•ด ๋” ์•ˆ์ •์ ์œผ๋กœ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•œ๋‹ค.

๋˜ํ•œ ResFiT๋Š” ์‹ค์ œ ๋กœ๋ด‡(29-DOF ํœด๋จธ๋…ธ์ด๋“œ)์—๋„ ์ ์šฉ๋˜์—ˆ๋‹ค. ๋‘ ์ž‘์—…(WoollyBallPnP: ๋‹จ์ผ ์•”, PackageHandover: ์Œ์ˆ˜)์„ ๋Œ€์ƒ์œผ๋กœ ์ง„ํ–‰ํ–ˆ์œผ๋ฉฐ, ๋‘˜ ๋‹ค ์ด๋ฏธ ACT ๋“ฑ ๊ธฐ์กด BC ์ •์ฑ…์ด ์กด์žฌํ•˜๋Š” ์„ค์ •์ด์—ˆ๋‹ค. ์˜ˆ์ปจ๋Œ€ WoollyBallPnP์—์„œ ๊ธฐ์ € ์ •์ฑ…์€ ๊ทนํžˆ ์ž‘์€ ๊ณต์„ ์ง‘๋Š” ๋ฐ 14%์˜ ์„ฑ๊ณต๋ฅ ์„ ๋ณด์˜€์œผ๋‚˜, ์˜คํ”„ํด๋ฆฌ์‹œ RL์„ ์ด์šฉํ•œ ์ž”์ฐจ ํŒŒ์ธํŠœ๋‹(134 ์—ํ”ผ์†Œ๋“œ) ํ›„ ์„ฑ๊ณต๋ฅ ์ด 64%๋กœ ๊ธ‰๋“ฑํ–ˆ๋‹ค. PackageHandover๋Š” ์ดˆ๊ธฐ 23%์—์„œ 343 ์—ํ”ผ์†Œ๋“œ ํ›„ 64%๋กœ ์ฆ๊ฐ€ํ–ˆ๋Š”๋ฐ, ์ด๋Š” ์™„์ „ํžˆ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ๋‘ ํŒ”๊ณผ ๋‹ค์„ฏ ์†๊ฐ€๋ฝ์„ ๊ฐ€์ง„ ํœด๋จธ๋…ธ์ด๋“œ๋กœ RL์„ ์ˆ˜ํ–‰ํ•ด ์–ป์€ ์ตœ์ดˆ์˜ ์‚ฌ๋ก€**๋กœ ํ‰๊ฐ€๋œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ResFiT๋Š” ๊ธฐ์ € ์ •์ฑ…์ด ๋ฏธํกํ•œ ๋ถ€๋ถ„(์˜ˆ: ์ •๋ฐ€ ๊ทธ๋ฆฝ)์—์„œ ๋šœ๋ ทํ•œ ๊ฐœ์„ ์„ ๋ณด์—ฌ์ฃผ์—ˆ์ง€๋งŒ, ์—ฌ์ „ํžˆ ์•ˆ์ „ํ•œ ์‹คํ—˜์„ ์œ„ํ•ด ์ œํ•œ๋œ ์ˆ˜์˜ ํœด๋จผ ๋ ˆ์ด๋ธ”๋ง(์‹œ์ž‘ ์œ„์น˜, ๋ณด์ƒ ์—ฌ๋ถ€ ๋“ฑ)๊ณผ ๋ฆฌ์…‹์ด ํ•„์š”ํ–ˆ๋‹ค.

์ข…ํ•ฉํ•˜์ž๋ฉด ResFiT๋Š” MuJoCo ๊ธฐ๋ฐ˜ ํ™˜๊ฒฝ์—์„œ๋„ state-of-the-art ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ–ˆ์œผ๋ฉฐ, ์‹ค์ œ ๋กœ๋ด‡์—์„œ๋„ ์ œํ•œ์ ์ด์ง€๋งŒ ์œ ์˜๋ฏธํ•œ ์ž๋™ํ™” ๊ฐœ์„  ํšจ๊ณผ๋ฅผ ์ž…์ฆํ–ˆ๋‹ค. ํŠนํžˆ ํ‘œ 5-6์—์„œ ๋ณด๋“ฏ, off-policy ๋ฐฉ์‹์ด on-policy(PPO) ๋Œ€๋น„ ์ƒ˜ํ”Œ ํšจ์œจ ๋ฉด์—์„œ ๋ช‡์‹ญ ๋ฐฐ ๋น ๋ฅธ ์ˆ˜๋ ด์„ ๋ณด์—ฌ์ฃผ์—ˆ์œผ๋ฉฐ, ์—ฌ๋Ÿฌ ablation ์—ฐ๊ตฌ๋ฅผ ํ†ตํ•ด UTD ๋น„์œจ, n-์Šคํ… ๋“ฑ ์„ค๊ณ„ ์š”์†Œ์˜ ํšจ๊ณผ๊ฐ€ ํ™•์ธ๋˜์—ˆ๋‹ค.

๋กœ๋ด‡๊ณตํ•™์  ์ ์šฉ ๋ฐ ๊ณ ๋ ค์‚ฌํ•ญ

ResFiT๋Š” ๊ณ ์ฐจ์› ์กฐ์ž‘ ๊ณผ์ œ์— ์˜คํ”„๋ผ์ธ BC์™€ ์˜จ๋ผ์ธ RL์„ ๊ฒฐํ•ฉํ•  ์ˆ˜ ์žˆ๋Š” ์‹ค์šฉ์  ๊ฒฝ๋กœ๋ฅผ ์ œ์‹œํ•œ๋‹ค. ๊ธฐ์ € BC ์ •์ฑ…์€ ๋‹จ์ˆœ ์ดˆ๊ธฐํ™” ์ด์ƒ์˜ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜๋Š”๋ฐ, ์‹คํ—˜์—์„œ ์ด๋ฅผ ํ†ตํ•ด ์•ˆ์ „ํ•œ ํ–‰๋™ ๋ฒ”์œ„๊ฐ€ ์•”์‹œ๋˜์—ˆ๊ณ , ํฌ์†Œ ๋ณด์ƒ ํ™˜๊ฒฝ์—์„œ ์œ ์šฉํ•œ ํƒํ—˜ prior๊ฐ€ ๋˜์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์—ฌ์ „ํžˆ ๋ช‡ ๊ฐ€์ง€ ์ œ์•ฝ์„ ๊ฐ–๋Š”๋‹ค. ์ฒซ์งธ, ์ •์ฑ…์€ ๊ธฐ์ € ์ •์ฑ… ์ฃผ๋ณ€ ํ•ด์—์„œ๋งŒ ๊ฐœ์„ ๋˜๋ฏ€๋กœ ์™„์ „ํžˆ ์ƒˆ๋กœ์šด ์ „๋žต ํƒ์ƒ‰์—๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค. ๋‘˜์งธ, ์‹ค์ œ ์ ์šฉ ์‹œ์—๋Š” ์ž๋™ ๋ฆฌ์…‹ ๋ฉ”์ปค๋‹ˆ์ฆ˜, ์‹ค์‹œ๊ฐ„ ์„ฑ๊ณต ๊ฐ์ง€, ์•ˆ์ „ ์žฅ์น˜ ๋“ฑ์ด ํ•„์ˆ˜์ ์ด๋‹ค. ์ €์ž๋„ ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด ์•„์ง๋„ ํœด๋จผ ์˜ค๋ฒ„์‚ฌ์ดํŠธ ์—†์ด ์™„์ „ ์ž๋™ํ™”๋˜๊ธฐ์—๋Š” ํ˜„์‹ค์ ์ธ ์–ด๋ ค์›€(๋ฆฌ์…‹, ๋ณด์ƒ ๋ผ๋ฒจ๋ง ๋“ฑ)์ด ๋‚จ์•„ ์žˆ๋‹ค.

์ข…ํ•ฉํ•˜๋ฉด, ResFiT๋Š” ๊ณ ์ž์œ ๋„ ๋กœ๋ด‡ ํ•™์Šต์— ์žˆ์–ด BC๋กœ ์–ป์€ ์ข‹์€ ์ดˆ๊ธฐํ™”๋ฅผ RL๋กœ ํšจ์œจ์ ์œผ๋กœ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ•๋ ฅํ•œ ๊ธฐ๋ฒ•์ด๋‹ค. ํ–ฅํ›„ ์—ฐ๊ตฌ์—์„œ๋Š” ์ž”์ฐจ ์ •์ฑ…์—์„œ ์–ป์€ ํ–ฅ์ƒ๋œ ์„ฑ๋Šฅ์„ ๋‹ค์‹œ ๊ธฐ์ € ์ •์ฑ…์— ์ฆ๋ฅ˜(distillation)ํ•˜์—ฌ ๋”์šฑ ๊ฐ•๋ ฅํ•œ ์ผ๋ฐ˜ ์ •์ฑ…์„ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์ด๋‚˜, ๊ณ ์ • ๊ธฐ์ € ์ œ์•ฝ์„ ์™„ํ™”ํ•ด ๋” ๊ทผ๋ณธ์ ์ธ ๊ธฐ์ˆ  ์Šต๋“์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉํ–ฅ์ด ๋ชจ์ƒ‰๋  ์ˆ˜ ์žˆ๋‹ค. ํŠนํžˆ ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ํ•™์Šต์—์„œ ์ด ์•„์ด๋””์–ด๊ฐ€ ์œ ์šฉํ•  ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€๋˜๋ฉฐ, ๋ณธ ์—ฐ๊ตฌ๋Š” ๊ณ ์ฐจ์› ์‹ค์ œ ๋กœ๋ด‡์—์„œ์˜ RL ๊ฐ€๋Šฅ์„ฑ์„ ํ•œ ๋‹จ๊ณ„ ์ง„์ „์‹œํ‚จ ์˜ˆ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

Copyright 2026, JungYeon Lee