Curieux.JY
  • JungYeon Lee
  • Post
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ์„œ๋ก : ์™œ ์ด ๋ฌธ์ œ๊ฐ€ ์–ด๋ ค์šด๊ฐ€?
    • ๋ฐฉ๋ฒ•: RLPD์˜ ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ์„ค๊ณ„ ์„ ํƒ
      • ์„ค๊ณ„ ์„ ํƒ 1: ๋Œ€์นญ ์ƒ˜ํ”Œ๋ง (Symmetric Sampling)
      • ์„ค๊ณ„ ์„ ํƒ 2: ๋ ˆ์ด์–ด ์ •๊ทœํ™” (Layer Normalization)๋กœ Q-๊ฐ’ ๋ฐœ์‚ฐ ์–ต์ œ
      • ์„ค๊ณ„ ์„ ํƒ 3: ์ƒ˜ํ”Œ ํšจ์œจ์  RL โ€” ๋Œ€๊ทœ๋ชจ ์•™์ƒ๋ธ”
      • ์˜์‚ฌ์ฝ”๋“œ: RLPD ์ „์ฒด ๊ตฌ์กฐ
      • ํ™˜๊ฒฝ๋ณ„ ์„ค๊ณ„ ์„ ํƒ (Per-Environment Design Choices)
    • ์ „์ฒด ๊ตฌ์กฐ ๋‹ค์ด์–ด๊ทธ๋žจ
    • RLPD ํ•ต์‹ฌ ์„ค๊ณ„ ์„ ํƒ ์š”์•ฝํ‘œ
    • ์‹คํ—˜: ์–ด๋–ค ํ™˜๊ฒฝ์—์„œ ์–ผ๋งˆ๋‚˜ ์ข‹์€๊ฐ€?
      • ์‹คํ—˜ ์„ค์ •
      • ์ฃผ์š” ๊ฒฐ๊ณผ
      • ํ”ฝ์…€ ๊ธฐ๋ฐ˜ ํ™˜๊ฒฝ์œผ๋กœ์˜ ์ „์ด
      • Ablation: LayerNorm์˜ ์—ญํ• 
    • ๋น„๊ต: ๊ด€๋ จ ์—ฐ๊ตฌ ํฌ์ง€์…”๋‹
    • ๋น„ํŒ์  ๊ณ ์ฐฐ: ๊ฐ•์ ๊ณผ ํ•œ๊ณ„
      • ๊ฐ•์ 
      • ํ•œ๊ณ„์™€ ์•ฝ์ 
    • ๋กœ๋ด‡๊ณตํ•™ ์‹ค๋ฌด์ž๋ฅผ ์œ„ํ•œ ์ ์šฉ ๊ฐ€์ด๋“œ
    • ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 
    • ์ฐธ๊ณ  ๋ฌธํ—Œ (์„ ํƒ)

๐Ÿ“ƒRLPD ๋ฆฌ๋ทฐ

rl
offline-data
sample-efficiency
Efficient Online Reinforcement Learning with Offline Data
Published

March 20, 2026

  • Paper Link
  • Code Link
  1. ๐Ÿ’ก ์ด ๋…ผ๋ฌธ์€ ๊ธฐ์กด์˜ off-policy RL ๋ฐฉ๋ฒ•์— ์ตœ์†Œํ•œ์˜ ์ˆ˜์ •๋งŒ์œผ๋กœ ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜๋Š” ํšจ์œจ์ ์ธ ์˜จ๋ผ์ธ ๊ฐ•ํ™” ํ•™์Šต ๋ฐฉ๋ฒ•์ธ RLPD๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
  2. โš™๏ธ RLPD๋Š” symmetric sampling, ๊ฐ€์น˜ ๊ณผ๋Œ€์ถ”์ •(value over-extrapolation)์„ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•œ Layer Normalization, ๊ทธ๋ฆฌ๊ณ  sample-efficient ํ•™์Šต์„ ์œ„ํ•œ large ensembles ์‚ฌ์šฉ์„ ํ†ตํ•ด ์ด๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  3. ๐Ÿš€ 30๊ฐ€์ง€ ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ์— ๊ฑธ์นœ ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด, RLPD๋Š” ์ถ”๊ฐ€์ ์ธ ๊ณ„์‚ฐ ์˜ค๋ฒ„ํ—ค๋“œ ์—†์ด ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ณด๋‹ค ์ตœ๋Œ€ 2.5๋ฐฐ ํ–ฅ์ƒ๋œ ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” state-of-the-art ์„ฑ๋Šฅ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

๋ณธ ์—ฐ๊ตฌ๋Š” ์˜จ๋ผ์ธ Reinforcement Learning(RL)์—์„œ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ(sample efficiency)๊ณผ ํƒํ—˜(exploration)์ด๋ผ๋Š” ์ฃผ์š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ(offline data)๋ฅผ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์€ ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๊ด‘๋ฒ”์œ„ํ•œ ์ˆ˜์ •์ด๋‚˜ ์ถ”๊ฐ€์ ์ธ ๋ณต์žก์„ฑ์„ ์š”๊ตฌํ–ˆ์ง€๋งŒ, ๋ณธ ๋…ผ๋ฌธ์€ ๊ธฐ์กด off-policy RL ๋ฐฉ๋ฒ•๋“ค์„ ํ™œ์šฉํ•˜์—ฌ ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋ฅผ ์˜จ๋ผ์ธ ํ•™์Šต์— ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์งˆ๋ฌธํ•œ๋‹ค. ์—ฐ๊ตฌ ๊ฒฐ๊ณผ, ์•ฝ๊ฐ„์˜ ์ค‘์š”ํ•˜๊ณ  ํ•„์ˆ˜์ ์ธ ๋ณ€๊ฒฝ ์‚ฌํ•ญ๋งŒ์œผ๋กœ๋„ ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ์ด๋ฅผ RLPD(Reinforcement Learning with Prior Data)๋ผ๊ณ  ๋ช…๋ช…ํ•œ๋‹ค.

๊ธฐ์กด off-policy RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ์™€ ํ•จ๊ป˜ ๋‹จ์ˆœํžˆ ์ ์šฉํ•˜๋Š” ๊ฒƒ์€ ๋งŒ์กฑ์Šค๋Ÿฝ์ง€ ๋ชปํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ดˆ๋ž˜ํ•  ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, Figure 1์—์„œ โ€™SAC + Offline Dataโ€™๋Š” โ€™IQL + Finetuningโ€™์— ๋น„ํ•ด ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค. RLPD๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ช‡ ๊ฐ€์ง€ ํ•ต์‹ฌ ์„ค๊ณ„ ์„ ํƒ(design choices)์„ ์ œ์•ˆํ•œ๋‹ค.

  1. Design Choice 1: A Simple and Efficient Strategy to Incorporate Offline Data (Symmetric Sampling) RLPD๋Š” ์‚ฌ์ „ ์ˆ˜์ง‘๋œ ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ฉํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ„๋‹จํ•œ โ€˜symmetric samplingโ€™ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋Š” ๊ฐ ๋ฏธ๋‹ˆ๋ฐฐ์น˜์—์„œ 50%์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ˜„์žฌ replay buffer์—์„œ ์ƒ˜ํ”Œ๋งํ•˜๊ณ , ๋‚˜๋จธ์ง€ 50%๋Š” ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ ๋ฒ„ํผ์—์„œ ์ƒ˜ํ”Œ๋งํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ์ด ์ „๋žต์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ์—†์ด๋„ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์—์„œ ํšจ๊ณผ์ ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ SAC์™€ ๊ฐ™์€ ํ‘œ์ค€ off-policy ๋ฉ”์„œ๋“œ์— ์ด๋ฅผ ๋‹จ์ˆœํžˆ ์ ์šฉํ•˜๋Š” ๊ฒƒ๋งŒ์œผ๋กœ๋Š” ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ์–ด๋ ต๋‹ค.

  2. Design Choice 2: Layer Normalization Mitigates Catastrophic Overestimation ํ‘œ์ค€ off-policy RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ•™์Šต๋œ Q-function์„ Out-of-Distribution(OOD) ์•ก์…˜์— ๋Œ€ํ•ด ์ฟผ๋ฆฌํ•  ๋•Œ, ํ•จ์ˆ˜ ๊ทผ์‚ฌ(function approximation)๋กœ ์ธํ•ด ์‹ค์ œ ๊ฐ’๋ณด๋‹ค ๊ณผ๋„ํ•˜๊ฒŒ ๋†’๊ฒŒ ํ‰๊ฐ€(overestimation)ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค. ์ด๋Š” ํ•™์Šต ๋ถˆ์•ˆ์ •์„ฑ๊ณผ ์ž ์žฌ์ ์ธ ๋ฐœ์‚ฐ(divergence)์„ ์ดˆ๋ž˜ํ•œ๋‹ค. Figure 2๋Š” symmetric sampling์„ ์ ์šฉํ–ˆ์„ ๋•Œ Q-value๊ฐ€ ๋ฐœ์‚ฐํ•˜๋Š” ํ˜„์ƒ์„ ๋ณด์—ฌ์ค€๋‹ค. RLPD๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด critic ๋„คํŠธ์›Œํฌ์— Layer Normalization(LayerNorm)์„ ์ ์šฉํ•  ๊ฒƒ์„ ์ œ์•ˆํ•œ๋‹ค. LayerNorm์€ ๋„คํŠธ์›Œํฌ์˜ ์™ธ์‚ฝ(extrapolation)์„ ํšจ๊ณผ์ ์œผ๋กœ ์ œํ•œํ•˜๋ฉด์„œ๋„, ์ •์ฑ…์ด ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ์— ๊ณ ์ •๋˜๋„๋ก ๋ช…์‹œ์ ์œผ๋กœ ์ œ์•ฝํ•˜์ง€ ์•Š์•„ ์ƒˆ๋กœ์šด ์˜์—ญ ํƒ์ƒ‰์„ ๋ฐฉํ•ดํ•˜์ง€ ์•Š๋Š”๋‹ค. ํŠนํžˆ, LayerNorm์€ Q-value๋ฅผ ๊ฐ€์ค‘์น˜(weight) ๋ ˆ์ด์–ด์˜ norm์— ์˜ํ•ด ๊ฒฝ๊ณ„ ์ง“๋Š”๋‹ค. ์ฆ‰, Q-function Q_{\theta,w}(s, a)๊ฐ€ ํŒŒ๋ผ๋ฏธํ„ฐ \theta, w๋กœ ํ‘œํ˜„๋˜๊ณ  ์ค‘๊ฐ„ ํ‘œํ˜„์ด \psi_\theta(s, a)์ผ ๋•Œ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ด€๊ณ„๊ฐ€ ์„ฑ๋ฆฝํ•œ๋‹ค: \Vert Q_{\theta,w}(s, a)\Vert = \Vert w^T \text{relu}(\psi_\theta(s, a))\Vert \le \Vert w\Vert \Vert \text{relu}(\psi_\theta(s, a))\Vert \le \Vert w\Vert \Vert \psi(s, a)\Vert \le \Vert w\Vert ์ด๋Ÿฌํ•œ ์†์„ฑ์€ OOD ์•ก์…˜์— ๋Œ€ํ•œ Q-value๊ฐ€ ์ด๋ฏธ ๋ณธ ๋ฐ์ดํ„ฐ์˜ ๊ฐ’๋ณด๋‹ค ํฌ๊ฒŒ ์ฆ๊ฐ€ํ•˜์ง€ ์•Š๋„๋ก ๋ณด์žฅํ•˜์—ฌ, ์˜ค์ฐจ์„ฑ ์•ก์…˜ ์™ธ์‚ฝ์˜ ์˜ํ–ฅ์„ ํฌ๊ฒŒ ์ค„์ธ๋‹ค. Figure 2์™€ Figure 7์€ LayerNorm์ด critic ๋ฐœ์‚ฐ์„ ์™„ํ™”ํ•˜๊ณ  ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ด์„ ๋ณด์—ฌ์ค€๋‹ค.

  3. Design Choice 3: Sample Efficient RL ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ์˜ ํšจ๊ณผ์ ์ธ ํ™œ์šฉ์„ ์œ„ํ•ด Bellman backup์ด ์ตœ๋Œ€ํ•œ ์ƒ˜ํ”Œ ํšจ์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰๋˜์–ด์•ผ ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด RLPD๋Š” ์—…๋ฐ์ดํŠธ-๋Œ€-๋ฐ์ดํ„ฐ(update-to-data, UTD) ๋น„์œจ์„ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋†’์€ UTD ๋น„์œจ์€ ํ†ต๊ณ„์  ๊ณผ์ ํ•ฉ(statistical overfitting)์„ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, ์ด๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด critic ๋„คํŠธ์›Œํฌ์— Random Ensemble Distillation์„ ์ ์šฉํ•œ๋‹ค. ์ด๋Š” L2 normalization์ด๋‚˜ Dropout๋ณด๋‹ค ๊ฐ•๋ ฅํ•œ ์ •๊ทœํ™”(regularization) ํšจ๊ณผ๋ฅผ ์ œ๊ณตํ•œ๋‹ค(Figure 9). ํ”ฝ์…€ ๊ธฐ๋ฐ˜(pixel-based) ํ™˜๊ฒฝ์˜ ๊ฒฝ์šฐ, Random Shift Augmentations๋„ ํ•จ๊ป˜ ์‚ฌ์šฉ๋œ๋‹ค.

  4. Per-Environment Design Choices ๋ณธ ๋…ผ๋ฌธ์€ Deep RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ตฌํ˜„ ์„ธ๋ถ€ ์‚ฌํ•ญ์— ๋Œ€ํ•œ ๋ฏผ๊ฐ์„ฑ์„ ๊ฐ•์กฐํ•˜๋ฉฐ, ํŠน์ • ์„ค๊ณ„ ์„ ํƒ์ด ํ™˜๊ฒฝ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Œ์„ ์ง€์ ํ•œ๋‹ค.

    • Clipped Double Q-Learning (CDQ): Q-learning์˜ ๊ฐ’ ๊ณผ๋Œ€ํ‰๊ฐ€ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋˜์—ˆ์œผ๋‚˜, ํŠน์ • ํ™˜๊ฒฝ(์˜ˆ: sparse reward tasks)์—์„œ๋Š” ๋„ˆ๋ฌด ๋ณด์ˆ˜์ ์ผ ์ˆ˜ ์žˆ๋‹ค. RLPD๋Š” 2๊ฐœ์˜ Q-function ๋Œ€์‹  1๊ฐœ์˜ Q-function์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ผ ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•œ๋‹ค.
    • Maximum Entropy RL: ํƒํ—˜์„ ์ด‰์ง„ํ•˜๋Š” ๋ฐ ์œ ์šฉํ•˜์ง€๋งŒ, ์—”ํŠธ๋กœํ”ผ ํ•ญ์˜ ์œ ๋ฌด๋‚˜ ๊ฐ€์ค‘์น˜ \alpha๋Š” ํ™˜๊ฒฝ์— ๋”ฐ๋ผ ์ตœ์ ์˜ ๊ฐ’์ด ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ๋‹ค.
    • Architecture: Actor์™€ critic ๋„คํŠธ์›Œํฌ์˜ ๋ ˆ์ด์–ด ์ˆ˜(2 ๋˜๋Š” 3)๋„ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค. RLPD๋Š” practitioner๋“ค์„ ์œ„ํ•ด ์ด๋Ÿฌํ•œ ํ™˜๊ฒฝ๋ณ„ ์„ค๊ณ„ ์„ ํƒ์„ ์ˆœ์„œ๋Œ€๋กœ ํ…Œ์ŠคํŠธํ•ด๋ณด๋Š” ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค.

RLPD Algorithm Overview (Algorithm 1)

RLPD๋Š” SAC๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฉฐ, ์œ„์—์„œ ์„ค๋ช…ํ•œ ํ•ต์‹ฌ ์š”์†Œ๋“ค์„ ํ†ตํ•ฉํ•œ๋‹ค.

  • LayerNorm, Large Ensemble Size (E), Gradient Steps (G), ๊ทธ๋ฆฌ๊ณ  ๋„คํŠธ์›Œํฌ Architecture๋ฅผ ์„ ํƒํ•œ๋‹ค.
  • ์ดˆ๊ธฐํ™”๋œ Critic(\theta_i) ๋ฐ Actor(\phi) ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
  • Symmetric sampling์„ ํ†ตํ•ด minibatch๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. (Line 12: replay buffer R์—์„œ N/2 ์ƒ˜ํ”Œ, Line 13: offline data buffer D์—์„œ N/2 ์ƒ˜ํ”Œ)
  • Critic ์—…๋ฐ์ดํŠธ ์‹œ, Ensemble Critics ์ค‘ Subset Z๊ฐœ๋ฅผ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ ํƒ€๊ฒŸ Q-value๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค (Line 15, 16: y = r + \gamma \min_{i \in Z} Q_{\theta'_i}(s', \tilde{a}'), \tilde{a}' \sim \pi_\phi(\cdot|s')).
  • ์„ ํƒ์ ์œผ๋กœ entropy term์„ ์ถ”๊ฐ€ํ•œ๋‹ค (Line 17: y = y + \gamma\alpha \log \pi_\phi(\tilde{a}'|s')).
  • Critic ๋ฐ Actor ๋„คํŠธ์›Œํฌ๋ฅผ ์—…๋ฐ์ดํŠธํ•œ๋‹ค.

Experiments

RLPD๋Š” Sparse Adroit, D4RL AntMaze, D4RL Locomotion, V-D4RL ๋“ฑ 30๊ฐœ ์ด์ƒ์˜ ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ์—์„œ ํ‰๊ฐ€๋˜์—ˆ๋‹ค. ๊ธฐ์กด ์ตœ์ฒจ๋‹จ ๋ฐฉ๋ฒ•(Prior SoTA)๊ณผ SACfD(์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋กœ replay buffer๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๋Š” ๋ฐฉ์‹)์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ, RLPD๋Š” ๋ชจ๋“  ๋ฒค์น˜๋งˆํฌ์—์„œ ๊ธฐ์กด ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋Šฅ๊ฐ€ํ•˜๊ฑฐ๋‚˜ ๋Œ€๋“ฑํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, ํŠนํžˆ ์–ด๋ ค์šด sparse reward ํƒœ์Šคํฌ์—์„œ๋Š” ์ตœ๋Œ€ 2.5๋ฐฐ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค(Figure 4, 5). RLPD๋Š” ์‚ฌ์ „ ํ•™์Šต(pre-training) ์—†์ด๋„ ์ด๋Ÿฌํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์—ฌ, ๋น ๋ฅธ ์˜จ๋ผ์ธ ๊ฐœ์„ ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.

Ablation Study

  • LayerNorm์˜ ์ค‘์š”์„ฑ: Figure 7์—์„œ LayerNorm์€ Adroit ๋„๋ฉ”์ธ์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ์œ„ํ•ด ๋งค์šฐ ์ค‘์š”ํ•˜๋ฉฐ, ํŠนํžˆ ๋ฐ์ดํ„ฐ๊ฐ€ ์ œํ•œ์ ์ด๊ฑฐ๋‚˜ ํ˜‘์†Œํ•˜๊ฒŒ ๋ถ„ํฌ๋œ ๊ฒฝ์šฐ(Expert Adroit Sparse Tasks) LayerNorm์ด ์—†์œผ๋ฉด ์„ฑ๋Šฅ์ด ๋ถ•๊ดด๋จ์„ ๋ณด์—ฌ์ค€๋‹ค.
  • Workflow ๊ฒ€์ฆ: Figure 8์€ ์ œ์•ˆ๋œ ํ™˜๊ฒฝ๋ณ„ ์„ค๊ณ„ ์„ ํƒ(CDQ ์‚ฌ์šฉ ์—ฌ๋ถ€, entropy term ์‚ฌ์šฉ ์—ฌ๋ถ€, ๋„คํŠธ์›Œํฌ ๋ ˆ์ด์–ด ์ˆ˜)์ด ๊ฐ•ํ•œ ์„ฑ๋Šฅ์„ ์ด๋Œ์–ด๋‚ด๊ณ , ์ด๋ฅผ ์ ์ ˆํžˆ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•จ์„ ๋ณด์—ฌ์ค€๋‹ค.
  • Critic Regularization: Figure 9์—์„œ Random Ensemble Distillation์ด weight-decay๋‚˜ Dropout๋ณด๋‹ค ์ „๋ฐ˜์ ์œผ๋กœ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•˜๋ฉฐ, ํŠนํžˆ sparse reward ํƒœ์Šคํฌ์—์„œ ๊ทธ๋Ÿฌํ•˜๋‹ค.
  • Sampling Proportion Sensitivity: Figure 12์—์„œ 50%์˜ symmetric sampling ๋น„์œจ์ด ๋‹ค์–‘ํ•œ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•˜๋ฉฐ, RLPD๊ฐ€ ์ƒ˜ํ”Œ๋ง ๋น„์œจ์— ํฌ๊ฒŒ ๋ฏผ๊ฐํ•˜์ง€ ์•Š์Œ์„ ๋ณด์—ฌ์ค€๋‹ค. Initializing the buffer with offline data(Figure 11)๋Š” ์ดˆ๊ธฐ ์„ฑ๋Šฅ์€ ์ข‹์œผ๋‚˜ ์ ๊ทผ์  ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค.

๊ฒฐ๋ก ์ ์œผ๋กœ, ๋ณธ ์—ฐ๊ตฌ๋Š” ๊ธฐ์กด off-policy RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ์™€ ํ•จ๊ป˜ ์˜จ๋ผ์ธ ํ•™์Šต์— ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋งค์šฐ ํšจ๊ณผ์ ์ž„์„ ๋ณด์—ฌ์ค€๋‹ค. symmetric sampling, LayerNorm์„ ํ†ตํ•œ Q-value ์™ธ์‚ฝ ์ •๊ทœํ™”, ๊ทธ๋ฆฌ๊ณ  ์ƒ˜ํ”Œ ํšจ์œจ์ ์ธ ํ•™์Šต(large ensembles)์˜ ๋…ํŠนํ•œ ์กฐํ•ฉ์ด RLPD์˜ ์„ฑ๊ณต์— ํ•ต์‹ฌ์ ์ž„์„ ์ž…์ฆํ–ˆ๋‹ค. ์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์— ๋ฏธ๋ฏธํ•œ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋ฉฐ, ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ก ์— ์‰ฝ๊ฒŒ ํ†ตํ•ฉ๋  ์ˆ˜ ์žˆ์–ด practitioner๋“ค์—๊ฒŒ ์‹ค์šฉ์ ์ธ ์ง€์นจ์„ ์ œ๊ณตํ•œ๋‹ค.

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

ICML 2023 (Short Presentation)

์„œ๋ก : ์™œ ์ด ๋ฌธ์ œ๊ฐ€ ์–ด๋ ค์šด๊ฐ€?

๋กœ๋ด‡ ํŒ”์ด ๋ฌผ๊ฑด์„ ์ง‘์–ด ์˜ฌ๋ฆฌ๋Š” ๋ฒ•์„ ๋ฐฐ์šด๋‹ค๊ณ  ์ƒ์ƒํ•ด๋ณด์ž. ์šฐ๋ฆฌ์—๊ฒŒ๋Š” ์ด๋ฏธ ์ธ๊ฐ„์ด ์‹œ์—ฐํ•œ ๋ฐ์ดํ„ฐ ์ˆ˜๋ฐฑ ๊ฐœ๊ฐ€ ์žˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ๊ฐ•ํ™”ํ•™์Šต(RL) ์—์ด์ „ํŠธ๊ฐ€ ์ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๊ณ ์„œ๋„ ์ฒ˜์Œ์—” ์™„์ „ํžˆ ๋ฌด์ž‘์œ„๋กœ ํŒ”์„ ํœ˜์ “๋Š”๋‹ค๋ฉด ์–ผ๋งˆ๋‚˜ ๋น„ํšจ์œจ์ ์ธ๊ฐ€? ์ด๋ฏธ โ€œ์–ด๋””๋กœ ๊ฐ€์•ผ ํ•˜๋Š”์ง€โ€๋ฅผ ๊ฐ€๋ฅด์ณ์ฃผ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋Š”๋ฐ ์ด๋ฅผ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•œ๋‹ค๋ฉด, ์ด๋Š” ๋งˆ์น˜ ๊ธธ์„ ์•Œ๊ณ  ์žˆ๋Š” ์ง€๋„๋ฅผ ์ฃผ๋จธ๋‹ˆ์— ๋„ฃ์–ด๋‘๊ณ  ๊ธธ์„ ํ—ค๋งค๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค.

RL์˜ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ(sample efficiency) ๋ฌธ์ œ๋Š” ๋กœ๋ด‡๊ณตํ•™ ์‹ค๋ฌด์ž์—๊ฒŒ ํŠนํžˆ ๋ผˆ์•„ํ”„๋‹ค. ์‹ค์ œ ๋กœ๋ด‡์„ ์ˆ˜์‹ญ๋งŒ ๋ฒˆ ๋Œ๋ฆด ์ˆ˜๋Š” ์—†๋‹ค. ํ•˜๋“œ์›จ์–ด๊ฐ€ ๋งˆ๋ชจ๋˜๊ณ , ์•ˆ์ „์‚ฌ๊ณ ๊ฐ€ ๋‚˜๋ฉฐ, ๋ฌด์—‡๋ณด๋‹ค ์‹œ๊ฐ„์ด ์—†๋‹ค. ๊ทธ๋ž˜์„œ ์—ฐ๊ตฌ์ž๋“ค์€ ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€ ๋ฐฉํ–ฅ์„ ํƒ์ƒ‰ํ•ด์™”๋‹ค.

  1. ์˜คํ”„๋ผ์ธ RL (Offline RL): ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ ์ •์ฑ…์„ ํ•™์Šตํ•œ๋‹ค. ํ™˜๊ฒฝ๊ณผ์˜ ์ƒํ˜ธ์ž‘์šฉ์ด ์ „ํ˜€ ์—†๋‹ค. ๋Œ€ํ‘œ์ ์œผ๋กœ IQL(Implicit Q-Learning), CQL(Conservative Q-Learning) ๋“ฑ์ด ์žˆ๋‹ค.
  2. ์˜จ๋ผ์ธ ํŒŒ์ธํŠœ๋‹ (Online Fine-tuning): ์˜คํ”„๋ผ์ธ RL๋กœ ๋จผ์ € ์ •์ฑ…์„ ์ดˆ๊ธฐํ™”ํ•œ ๋’ค, ์˜จ๋ผ์ธ ์ƒํ˜ธ์ž‘์šฉ์œผ๋กœ ๊ฐœ์„ ํ•œ๋‹ค.

๋‘ ๋ฐฉ์‹ ๋ชจ๋‘ ๊ณตํ†ต๋œ ๋”œ๋ ˆ๋งˆ๋ฅผ ์•ˆ๊ณ  ์žˆ๋‹ค. ์˜คํ”„๋ผ์ธ RL์€ ๋ถ„ํฌ ์™ธ ํ–‰๋™(out-of-distribution action)์— ๋Œ€ํ•ด ๊ณผ๋„ํ•˜๊ฒŒ ๋ณด์ˆ˜์ ์ด์–ด์„œ, ์˜จ๋ผ์ธ์œผ๋กœ ์ „ํ™˜ํ–ˆ์„ ๋•Œ ํƒ์ƒ‰์„ ์–ต์ œํ•œ๋‹ค. ๋ฐ˜๋Œ€๋กœ ์ด ๋ณด์ˆ˜์„ฑ์„ ํ’€๋ฉด Q-๊ฐ’์ด ํญ๋ฐœ์ ์œผ๋กœ ๋ฐœ์‚ฐํ•œ๋‹ค.

์ด ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ์งˆ๋ฌธ์€ ๋‹จ์ˆœํ•˜๋ฉด์„œ๋„ ๋„๋ฐœ์ ์ด๋‹ค:

โ€œ์˜คํ”„๋ผ์ธ ์‚ฌ์ „ ํ•™์Šต์ด๋‚˜ ๋ช…์‹œ์  ์ œ์•ฝ ์—†์ด, ๊ธฐ์กด off-policy ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋ƒฅ ์ง‘์–ด๋„ฃ์œผ๋ฉด ์•ˆ ๋˜๋Š”๊ฐ€?โ€

๊ทธ๋ฆฌ๊ณ  ์ €์ž๋“ค์˜ ๋‹ต์€ โ€œ๋œ๋‹ค. ๋‹จ, ๋ช‡ ๊ฐ€์ง€ ํ•ต์‹ฌ ์„ค๊ณ„ ์„ ํƒ์ด ํ•„์š”ํ•˜๋‹คโ€์ด๋‹ค. ์ด ๋…ผ๋ฌธ์€ ๊ทธ ์„ค๊ณ„ ์„ ํƒ๋“ค์„ ์ฒด๊ณ„์ ์œผ๋กœ ์ฐพ์•„๋‚ด๊ณ  ์ •๋‹นํ™”ํ•œ๋‹ค.


๋ฐฉ๋ฒ•: RLPD์˜ ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ์„ค๊ณ„ ์„ ํƒ

์ €์ž๋“ค์ด ์ œ์•ˆํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ด๋ฆ„์€ RLPD (Reinforcement Learning with Prior Data)๋‹ค. ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ SAC (Soft Actor-Critic)์ด๋ฉฐ, ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ์ˆ˜์ •์„ ๊ฐ€ํ•œ๋‹ค. ์ด ์„ธ ๊ฐ€์ง€๋ฅผ ํ•˜๋‚˜์”ฉ ํ•ด๋ถ€ํ•ด๋ณด์ž.

์„ค๊ณ„ ์„ ํƒ 1: ๋Œ€์นญ ์ƒ˜ํ”Œ๋ง (Symmetric Sampling)

๊ฐ€์žฅ ๋‹จ์ˆœํ•˜๋ฉด์„œ๋„ ๊ฐ€์žฅ ๊ฐ•๋ ฅํ•œ ์•„์ด๋””์–ด๋‹ค.

๊ธฐ์กด ์ ‘๊ทผ ๋ฐฉ์‹๋“ค์€ ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ๋‹ค๋ค„์™”๋Š”๊ฐ€? ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€์˜€๋‹ค.

  • ๋ฒ„ํผ ์ดˆ๊ธฐํ™”(Seeded Buffer): ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฆฌํ”Œ๋ ˆ์ด ๋ฒ„ํผ์— ๋ฏธ๋ฆฌ ์ฑ„์›Œ ๋„ฃ๋Š”๋‹ค. ์˜จ๋ผ์ธ ๊ฒฝํ—˜์ด ์Œ“์ด๋ฉด ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ์˜ ๋น„์œจ์ด ์ค„์–ด๋“ ๋‹ค.
  • ์‚ฌ์ „ ํ•™์Šต(Pre-training): ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋กœ ๋จผ์ € ์ •์ฑ…์„ ํ•™์Šตํ•œ ๋’ค ์˜จ๋ผ์ธ ์ „ํ™˜.

RLPD๋Š” ๋‹ค๋ฅด๋‹ค. ๋งค ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋งˆ๋‹ค ์ •ํ™•ํžˆ 50%๋Š” ์˜จ๋ผ์ธ ๋ฆฌํ”Œ๋ ˆ์ด ๋ฒ„ํผ์—์„œ, 50%๋Š” ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ ๋ฒ„ํผ์—์„œ ์ƒ˜ํ”Œ๋งํ•œ๋‹ค. ํ•™์Šต ๋‚ด๋‚ด ์ด ๋น„์œจ์„ ๊ณ ์ •์œผ๋กœ ์œ ์ง€ํ•œ๋‹ค. ์ด๋ฅผ ๋Œ€์นญ ์ƒ˜ํ”Œ๋ง(Symmetric Sampling)์ด๋ผ ๋ถ€๋ฅธ๋‹ค.

์ด ๋‹จ์ˆœํ•œ ์ „๋žต์ด ์™œ ํšจ๊ณผ์ ์ธ๊ฐ€? ์ง๊ด€์ ์œผ๋กœ ์ƒ๊ฐํ•ด๋ณด์ž:

  • ๋ฒ„ํผ ์ดˆ๊ธฐํ™” ๋ฐฉ์‹์˜ ๋ฌธ์ œ: ์˜จ๋ผ์ธ ๊ฒฝํ—˜์ด ์Œ“์ด๋ฉด ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฒ„ํผ์—์„œ ํฌ์„๋œ๋‹ค. ์ฆ‰, ํ›„๋ฐ˜๋ถ€์—๋Š” ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฑฐ์˜ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ฒŒ ๋œ๋‹ค. ํฌ๊ท€ํ•œ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋ผ๋ฉด ๋”์šฑ ์‹ฌ๊ฐํ•˜๋‹ค.
  • ๋Œ€์นญ ์ƒ˜ํ”Œ๋ง์˜ ์žฅ์ : ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ์ผ์ • ๋น„์œจ๋กœ ํ™œ์šฉํ•œ๋‹ค. ์ด๋Š” ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๊ฐ€ ํƒ์ƒ‰ ๋ฐ€๋„(reward density)๋ฅผ ๋†’์—ฌ์ฃผ๋Š” ์—ญํ• ์„ ์ง€์†์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ํ•œ๋‹ค.

์‹คํ—˜์—์„œ๋„ ์ด๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค (๋…ผ๋ฌธ Figure 10). Adroit Pen ํ™˜๊ฒฝ์—์„œ ๋Œ€์นญ ์ƒ˜ํ”Œ๋ง์€ ๋ฐฐ์น˜ ๋‚ด ๋ณด์ƒ ๋ฐ€๋„๋ฅผ ์ง€์†์ ์œผ๋กœ ๋†’์ด๋ฉฐ, Door ํ™˜๊ฒฝ์—์„œ๋Š” ์•ˆ์ •์„ฑ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค. ๋†€๋ž๊ฒŒ๋„ 50% ๋น„์œจ์ด 25%, 75%, 100%๋ณด๋‹ค ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์—์„œ ๊ฐ€์žฅ ๊ฒฌ๊ณ ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค (๋…ผ๋ฌธ Figure 12). ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ์—†์ด๋„ ์ž˜ ๋™์ž‘ํ•˜๋Š” โ€œ๊ณต์งœ ์ ์‹ฌโ€์— ๊ฐ€๊น๋‹ค.

์„ค๊ณ„ ์„ ํƒ 2: ๋ ˆ์ด์–ด ์ •๊ทœํ™” (Layer Normalization)๋กœ Q-๊ฐ’ ๋ฐœ์‚ฐ ์–ต์ œ

์ด๊ฒƒ์ด RLPD์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•˜๊ณ  ํฅ๋ฏธ๋กœ์šด ํ†ต์ฐฐ์ด๋‹ค.

๋ฌธ์ œ์˜ ๋ณธ์งˆ: ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋Š” ์ƒํƒœ-ํ–‰๋™ ๊ณต๊ฐ„์˜ ์ผ๋ถ€๋งŒ ์ปค๋ฒ„ํ•œ๋‹ค. Q-์‹ ๊ฒฝ๋ง์€ ํ•™์Šต ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ๋ฐ”๊นฅ์˜ ํ–‰๋™์— ๋Œ€ํ•ด์„œ๋„ ๊ฐ’์„ ์˜ˆ์ธกํ•ด์•ผ ํ•˜๋Š”๋ฐ, ์ด๋•Œ ๋ถ„ํฌ ์™ธ ํ–‰๋™์— ๋Œ€ํ•œ Q-๊ฐ’์ด ํญ๋ฐœ์ ์œผ๋กœ ๊ณผ๋Œ€์ถ”์ •(overestimation)๋  ์ˆ˜ ์žˆ๋‹ค. ์˜คํ”„๋ผ์ธ RL ๋ถ„์•ผ์—์„œ ์ด ํ˜„์ƒ์€ ์˜ค๋ž˜์ „๋ถ€ํ„ฐ ์•Œ๋ ค์ง„ ์น˜๋ช…์  ๋ณ‘๋ฆฌ๋‹ค.

๋…ผ๋ฌธ์˜ Figure 2๋ฅผ ๋ณด๋ฉด ๊ทน๋ช…ํ•˜๊ฒŒ ๋“œ๋Ÿฌ๋‚œ๋‹ค. ๋Œ€์นญ ์ƒ˜ํ”Œ๋ง๋งŒ ์ ์šฉํ–ˆ์„ ๋•Œ AntMaze Large ๊ฐ™์€ ๋ณต์žกํ•œ ํ™˜๊ฒฝ์—์„œ Q-๊ฐ’์ด ๋กœ๊ทธ ์Šค์ผ€์ผ๋กœ ํญ์ฃผํ•œ๋‹ค. ์„ฑ๋Šฅ์€ ์ „ํ˜€ ์˜ค๋ฅด์ง€ ์•Š๋Š”๋‹ค.

๊ธฐ์กด ์˜คํ”„๋ผ์ธ RL์˜ ๋Œ€์ฒ˜๋ฒ•์€ ๋ณด์ˆ˜์  ๋ฒŒ์ (conservative penalty)์ด์—ˆ๋‹ค. CQL์€ ๋ถ„ํฌ ์™ธ ํ–‰๋™์˜ Q-๊ฐ’์„ ๋ช…์‹œ์ ์œผ๋กœ ๋‚ฎ์ถ”๊ณ , BCO๋Š” ํ–‰๋™ ํด๋กœ๋‹ ํ•ญ์„ ์ถ”๊ฐ€ํ•œ๋‹ค. ์ด๋Ÿฐ ๋ฐฉ์‹์€ ํšจ๊ณผ์ ์ด์ง€๋งŒ ํƒ์ƒ‰์„ ์–ต์ œํ•œ๋‹ค๋Š” ๋ถ€์ž‘์šฉ์ด ์žˆ๋‹ค.

RLPD์˜ ํ•ด๋ฒ•์€ ๋†€๋ž๋„๋ก ์šฐ์•„ํ•˜๋‹ค: ๋ ˆ์ด์–ด ์ •๊ทœํ™”(Layer Normalization, LN)๋ฅผ ํฌ๋ฆฌํ‹ฑ ๋„คํŠธ์›Œํฌ์— ์ ์šฉํ•œ๋‹ค.

์™œ LN์ด ํšจ๊ณผ๊ฐ€ ์žˆ๋Š”๊ฐ€? ์ €์ž๋“ค์€ ์ด๋ฅผ ์ˆ˜ํ•™์ ์œผ๋กœ ๋ณด์—ฌ์ค€๋‹ค. LN์ด ์ ์šฉ๋œ Q-ํ•จ์ˆ˜ Q_{\theta,w}์— ๋Œ€ํ•ด:

\|Q_{\theta,w}(s, a)\| = \|w^T \text{relu}(\psi_\theta(s, a))\| \leq \|w\| \cdot \|\psi_\theta(s, a)\|

LN์€ ์ค‘๊ฐ„ ํ‘œํ˜„ \psi_\theta๋ฅผ ๋‹จ์œ„ ๊ตฌ ์œ„๋กœ ์ •๊ทœํ™”ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ:

\|\psi_\theta(s, a)\| \leq 1 \quad \Rightarrow \quad \|Q_{\theta,w}(s, a)\| \leq \|w\|

๊ฒฐ๋ก : Q-๊ฐ’์€ ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด ๊ฐ€์ค‘์น˜์˜ norm์— ์˜ํ•ด ์œ„๋กœ ์œ ๊ณ„(bounded)๋œ๋‹ค. ์ด๋Š” ๋ถ„ํฌ ์™ธ ํ–‰๋™์— ๋Œ€ํ•ด์„œ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋‹ค. Q-๊ฐ’์ด ๋ฌดํ•œํžˆ ํญ๋ฐœํ•˜๋Š” ์ผ์ด ๊ตฌ์กฐ์ ์œผ๋กœ ๋ถˆ๊ฐ€๋Šฅํ•ด์ง„๋‹ค.

๋…ผ๋ฌธ์˜ Figure 3์ด ์ด๋ฅผ ์ง๊ด€์ ์œผ๋กœ ๋ณด์—ฌ์ค€๋‹ค. ๋ฐ˜์ง€๋ฆ„ 0.5์ธ ์› ์œ„์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ•  ๋•Œ, ํ‘œ์ค€ MLP๋Š” ๋ถ„ํฌ ๋ฐ–(์› ์™ธ๋ถ€)์—์„œ ๊ฐ’์ด ๋ฌดํ•œํžˆ ์ฆ๊ฐ€ํ•˜์ง€๋งŒ, LN์„ ์ถ”๊ฐ€ํ•œ MLP๋Š” ๋ถ„ํฌ ๋ฐ–์—์„œ๋„ ๊ฐ’์ด ๊ฒฝ๊ณ„ ๋‚ด์— ๋จธ๋ฌธ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ ์—ฌ๊ธฐ์„œ ์ค‘์š”ํ•œ ๋ฏธ๋ฌ˜ํ•จ์ด ์žˆ๋‹ค. LN์€ Q-๊ฐ’์— ์ƒํ•œ์„ ๋‘์ง€๋งŒ, ํŠน์ • ํ–‰๋™์„ ๋ช…์‹œ์ ์œผ๋กœ โ€œ๋‚˜์˜๋‹คโ€๊ณ  ํŒ๋‹จํ•˜์ง€๋Š” ์•Š๋Š”๋‹ค. CQL์ฒ˜๋Ÿผ ๋ถ„ํฌ ์™ธ ํ–‰๋™์˜ Q-๊ฐ’์„ ์ธ์œ„์ ์œผ๋กœ ๋‚ฎ์ถ”์ง€ ์•Š๋Š”๋‹ค. ๋”ฐ๋ผ์„œ ํƒ์ƒ‰์„ ์–ต์ œํ•˜์ง€ ์•Š๋Š”๋‹ค. ์ด๊ฒƒ์ด ์˜จ๋ผ์ธ RL์—์„œ LN์ด ํŠนํžˆ ๊ฐ•๋ ฅํ•œ ์ด์œ ๋‹ค. ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋กœ ์ธํ•œ ๋ฐœ์‚ฐ์„ ๋ง‰์œผ๋ฉด์„œ๋„, ์—์ด์ „ํŠธ๊ฐ€ ์ƒˆ๋กœ์šด ์˜์—ญ์„ ์ž์œ ๋กญ๊ฒŒ ํƒ์ƒ‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ—ˆ์šฉํ•œ๋‹ค.

์„ค๊ณ„ ์„ ํƒ 3: ์ƒ˜ํ”Œ ํšจ์œจ์  RL โ€” ๋Œ€๊ทœ๋ชจ ์•™์ƒ๋ธ”

์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋ฅผ ์ตœ๋Œ€ํ•œ ๋น ๋ฅด๊ฒŒ ํ™œ์šฉํ•˜๋ ค๋ฉด, ๊ฐ ํ™˜๊ฒฝ ์Šคํ…์—์„œ ๋” ๋งŽ์€ ํ•™์Šต์ด ์ด๋ฃจ์–ด์ ธ์•ผ ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•œ ๋‘ ๊ฐ€์ง€ ์ถ•์ด ์žˆ๋‹ค.

โ‘  UTD (Update-to-Data) ๋น„์œจ ์ฆ๊ฐ€
ํ™˜๊ฒฝ ์Šคํ… 1ํšŒ๋‹น ์—ฌ๋Ÿฌ ๋ฒˆ์˜ ๊ทธ๋ž˜๋””์–ธํŠธ ์—…๋ฐ์ดํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค. UTD=20์ด๋ฉด ํ™˜๊ฒฝ ์Šคํ… 1๋ฒˆ์— ๊ทธ๋ž˜๋””์–ธํŠธ ์—…๋ฐ์ดํŠธ 20๋ฒˆ์„ ํ•œ๋‹ค. ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๊ฐ€ ๋” ๋นจ๋ฆฌ โ€œ์†Œํ™”โ€๋œ๋‹ค.

ํ•˜์ง€๋งŒ UTD๋ฅผ ๋†’์ด๋ฉด ํ†ต๊ณ„์  ๊ณผ์ ํ•ฉ(overfitting) ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด๋‹ค. Q-ํ•จ์ˆ˜๊ฐ€ ๋ฏธ๋‹ˆ๋ฐฐ์น˜์— ๊ณผ์ ํ•ฉํ•˜์—ฌ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ๊ฒƒ์ด๋‹ค.

โ‘ก ํฌ๋ฆฌํ‹ฑ ์•™์ƒ๋ธ” (Critic Ensemble)
์ด ๊ณผ์ ํ•ฉ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด RLPD๋Š” REDQ (Randomized Ensemble Double Q-Learning) ์Šคํƒ€์ผ์˜ ๋Œ€๊ทœ๋ชจ ์•™์ƒ๋ธ”์„ ์ฑ„ํƒํ•œ๋‹ค. E๊ฐœ์˜ ํฌ๋ฆฌํ‹ฑ Q_{\theta_1}, \ldots, Q_{\theta_E}๋ฅผ ๋™์‹œ์— ํ•™์Šตํ•˜๋ฉฐ, TD ๋ฐฑ์—… ์‹œ ์ด ์ค‘ ๋žœ๋คํ•˜๊ฒŒ ์„œ๋ธŒ์…‹ Z๊ฐœ๋ฅผ ์„ ํƒํ•ด ์ตœ์†Ÿ๊ฐ’์„ ์ทจํ•œ๋‹ค.

y = r + \gamma \min_{i \in \mathcal{Z}} Q_{\theta'_i}(s', \tilde{a}'), \quad \tilde{a}' \sim \pi_\phi(\cdot|s')

์—ฌ๊ธฐ์„œ |\mathcal{Z}|๋Š” ํ™˜๊ฒฝ์— ๋”ฐ๋ผ 1 ๋˜๋Š” 2๋กœ ์„ค์ •ํ•œ๋‹ค (์ž์„ธํ•œ ๋‚ด์šฉ์€ ์•„๋ž˜ ํ™˜๊ฒฝ๋ณ„ ์„ค๊ณ„ ์„ ํƒ ์„น์…˜ ์ฐธ์กฐ).

์ €์ž๋“ค์€ ๋‹ค์–‘ํ•œ ์ •๊ทœํ™” ๋ฐฉ๋ฒ•์„ ๋น„๊ตํ•œ๋‹ค (๋…ผ๋ฌธ Figure 9): - Weight Decay: ๋ชจ๋“  ๋„๋ฉ”์ธ์—์„œ ์•™์ƒ๋ธ”๋ณด๋‹ค ์—ด๋“ฑ. - Dropout: ๋ฐ€์ง‘ ๋ณด์ƒ(dense reward) ํ™˜๊ฒฝ์—์„œ๋Š” ๊ดœ์ฐฎ์ง€๋งŒ, ํฌ์†Œ ๋ณด์ƒ(sparse reward) ํ™˜๊ฒฝ์—์„œ๋Š” ์‹คํŒจ. - ์•™์ƒ๋ธ” (RLPD): ๊ฐ€์žฅ ์ผ๊ด€๋˜๊ฒŒ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ.

๊ฒฐ๋ก ์ ์œผ๋กœ ์•™์ƒ๋ธ”์ด ๊ฐ€์žฅ ๋ฒ”์šฉ์ ์ด๊ณ  ๊ฐ•๋ ฅํ•œ ์ •๊ทœํ™” ์ „๋žต์ด๋‹ค.


์˜์‚ฌ์ฝ”๋“œ: RLPD ์ „์ฒด ๊ตฌ์กฐ

Algorithm: RLPD (Online RL with Offline Data)

Inputs:
  - Offline dataset D = {(s, a, r, s') tuples}
  - Ensemble size E, gradient steps G per env step
  - Architecture: LayerNorm, number of layers

Initialize:
  - E critic networks {theta_i}, targets {theta'_i = theta_i}
  - Actor network phi
  - Empty online replay buffer R

While training:
  Receive initial state s_0
  For each env step t:
    a_t ~ pi_phi(.|s_t)                      # Act
    Store (s_t, a_t, r_t, s_{t+1}) in R      # Collect

    For g = 1..G:                             # Multiple gradient steps
      Sample N/2 from R  (online data)
      Sample N/2 from D  (offline data)       # Symmetric Sampling
      Combine into batch b of size N

      Sample subset Z of Z indices from {1..E}
      Compute TD target:
        y = r + gamma * min_{i in Z} Q_{theta'_i}(s', a'_tilde)
        [optionally + gamma * alpha * log pi_phi(a'_tilde|s')]

      For i = 1..E:
        Update theta_i: minimize (y - Q_{theta_i}(s,a))^2  # LayerNorm in Q-net

      Update actor phi: maximize (1/E) * sum_i Q_{theta_i}(s, a_tilde)
      Update target networks: theta'_i <- rho*theta'_i + (1-rho)*theta_i

ํ™˜๊ฒฝ๋ณ„ ์„ค๊ณ„ ์„ ํƒ (Per-Environment Design Choices)

RLPD๋Š” ์œ„์˜ ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ์„ ํƒ ์™ธ์—, ํ™˜๊ฒฝ์— ๋”ฐ๋ผ ์กฐ์ •ํ•ด์•ผ ํ•  โ€œํ™˜๊ฒฝ ๋ฏผ๊ฐ(environment-sensitive)โ€ ์„ ํƒ๋“ค์ด ์žˆ๋‹ค. ์ €์ž๋“ค์€ ์ด๊ฒƒ์ด ๊ธฐ์กด RL ๋ฌธํ—Œ์—์„œ ํ”ํžˆ ๋‹น์—ฐํ•˜๊ฒŒ ๋ฐ›์•„๋“ค์—ฌ์ง€์ง€๋งŒ ์‚ฌ์‹ค์€ ์žฌ๊ฒ€ํ† ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๊ณ  ๊ฐ•์กฐํ•œ๋‹ค.

โ‘  Clipped Double Q-Learning (CDQ)
TD3์™€ SAC์—์„œ ํ‘œ์ค€์œผ๋กœ ์“ฐ์ด๋Š” CDQ๋Š” ๋‘ ํฌ๋ฆฌํ‹ฑ์˜ ์ตœ์†Ÿ๊ฐ’์„ ํƒ€๊นƒ์œผ๋กœ ์“ด๋‹ค. ์ด๋Š” ์‹ค์ œ ํƒ€๊นƒ Q-๊ฐ’์—์„œ ์•ฝ 1 ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ๋นผ๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ์–ด ๋ณด์ˆ˜์ ์ด๋‹ค. ํฌ์†Œ ๋ณด์ƒ ํ™˜๊ฒฝ์—์„œ๋Š” ์ด ๋ณด์ˆ˜์„ฑ์ด ํ•™์Šต์„ ๋ฐฉํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค. ๋…ผ๋ฌธ์˜ AntMaze Large Diverse ์‹คํ—˜ (Figure 8)์—์„œ CDQ๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  1๊ฐœ ํฌ๋ฆฌํ‹ฑ๋งŒ ์„œ๋ธŒ์…‹์œผ๋กœ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋œ๋‹ค.

โ‘ก ์ตœ๋Œ€ ์—”ํŠธ๋กœํ”ผ ํ•ญ (MaxEnt / Entropy Backups)
SAC์˜ ์—”ํŠธ๋กœํ”ผ ํ•ญ์€ ํƒ์ƒ‰์„ ๋•๋Š”๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ผ๋ถ€ ํ™˜๊ฒฝ(Adroit Relocate, Humanoid Walk)์—์„œ๋Š” ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์„ ์ €ํ•˜์‹œํ‚จ๋‹ค. ์ €์ž๋“ค์€ ์ด ํ•ญ์„ ์ œ๊ฑฐํ•˜๋Š” ๊ฒƒ์„ ์ถœ๋ฐœ์ ์œผ๋กœ ์ถ”์ฒœํ•œ๋‹ค.

โ‘ข ๋„คํŠธ์›Œํฌ ๊นŠ์ด (Network Depth)
2์ธต vs. 3์ธต MLP๋ฅผ ๋น„๊ตํ•œ๋‹ค. ๋ณต์žกํ•œ ํ™˜๊ฒฝ(์˜ˆ: Adroit, Humanoid)์—์„œ๋Š” 3์ธต์ด ์œ ๋ฆฌํ•˜๊ณ , ๋‹จ์ˆœํ•œ ํ™˜๊ฒฝ์—์„œ๋Š” 2์ธต์œผ๋กœ ์ถฉ๋ถ„ํ•˜๋‹ค.

์‹ค์šฉ์  ์›Œํฌํ”Œ๋กœ์šฐ: ์ €์ž๋“ค์€ ์ด ์„ธ ๊ฐ€์ง€ ํ™˜๊ฒฝ๋ณ„ ์„ ํƒ์„ ์•„๋ž˜ ์ˆœ์„œ๋กœ ๋จผ์ € ํƒ์ƒ‰ํ•˜๋ผ๊ณ  ๊ถŒ์žฅํ•œ๋‹ค.

Step 1: Try subsetting 1 critic (disable CDQ)  -->  Observe improvement?
Step 2: Try removing entropy backups            -->  Observe improvement?
Step 3: Try deeper 3-layer MLP                 -->  Observe improvement?

์ „์ฒด ๊ตฌ์กฐ ๋‹ค์ด์–ด๊ทธ๋žจ

flowchart TD
    A[Start: Offline Dataset D\n& Empty Replay Buffer R] --> B[Environment Interaction]
    B --> C[Collect Transition\nstore in R]
    C --> D{For G gradient steps}
    D --> E["Symmetric Sampling\n50% from R + 50% from D\n--> Batch b"]
    E --> F["TD Target Computation\ny = r + gamma * min_{Z} Q_i(s', a')\n[with optional entropy]"]
    F --> G["Critic Update x E\nLayerNorm prevents Q-value divergence\nEnsemble provides regularization"]
    G --> H["Actor Update\nmaximize mean Q over ensemble"]
    H --> I["Target Network EMA Update"]
    I --> D
    D --> B

    style E fill:#4CAF50,color:#fff
    style G fill:#2196F3,color:#fff
    style F fill:#9C27B0,color:#fff


RLPD ํ•ต์‹ฌ ์„ค๊ณ„ ์„ ํƒ ์š”์•ฝํ‘œ

์„ค๊ณ„ ์„ ํƒ ๋ฌด์—‡์„ ํ•ด๊ฒฐํ•˜๋Š”๊ฐ€ ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๋Š”๊ฐ€ ์ถ”๊ฐ€ ๋น„์šฉ
๋Œ€์นญ ์ƒ˜ํ”Œ๋ง ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ ํฌ์„ ๋ฌธ์ œ ๋งค ๋ฐฐ์น˜ 50:50 ๊ณ ์ • ํ˜ผํ•ฉ ์—†์Œ
Layer Normalization Q-๊ฐ’ ๋ฐœ์‚ฐ / ๊ณผ๋Œ€์ถ”์ • Q-๊ฐ’์„ \|w\|๋กœ ์œ ๊ณ„ํ™” ๋ฏธ๋ฏธํ•จ
ํฌ๋ฆฌํ‹ฑ ์•™์ƒ๋ธ” (REDQ) ํ†ต๊ณ„์  ๊ณผ์ ํ•ฉ E๊ฐœ ํฌ๋ฆฌํ‹ฑ, ๋žœ๋ค ์„œ๋ธŒ์…‹ ๋ฉ”๋ชจ๋ฆฌ E๋ฐฐ
UTD ๋น„์œจ ์ฆ๊ฐ€ ๋А๋ฆฐ ๋ฐ์ดํ„ฐ ํ™œ์šฉ ์Šคํ…๋‹น G๋ฒˆ ์—…๋ฐ์ดํŠธ ๊ณ„์‚ฐ๋Ÿ‰ G๋ฐฐ
CDQ ์กฐ์ • ๊ณผ๋„ํ•œ ๋ณด์ˆ˜์„ฑ 1๊ฐœ ํฌ๋ฆฌํ‹ฑ ์„œ๋ธŒ์…‹ ์—†์Œ
์—”ํŠธ๋กœํ”ผ ํ•ญ ์กฐ์ • ํ™˜๊ฒฝ๋ณ„ ํƒ์ƒ‰ trade-off ํ™˜๊ฒฝ์— ๋”ฐ๋ผ on/off ์—†์Œ

์‹คํ—˜: ์–ด๋–ค ํ™˜๊ฒฝ์—์„œ ์–ผ๋งˆ๋‚˜ ์ข‹์€๊ฐ€?

์‹คํ—˜ ์„ค์ •

์ €์ž๋“ค์€ ์ด 30๊ฐœ ํƒœ์Šคํฌ์— ๊ฑธ์ณ RLPD๋ฅผ ๊ฒ€์ฆํ•œ๋‹ค. ํฌ๊ฒŒ ์„ธ ๊ทธ๋ฃน์ด๋‹ค.

๊ทธ๋ฃน 1: Sparse Adroit (3๊ฐœ ํƒœ์Šคํฌ)
dexterous hand ์กฐ์ž‘ ํƒœ์Šคํฌ โ€” ํŽœ ๋Œ๋ฆฌ๊ธฐ(Pen), ๋ฌธ ์—ด๊ธฐ(Door), ๊ณต ์žฌ๋ฐฐ์น˜(Relocate). ํฌ์†Œ ๋ณด์ƒ์ด๋ฉฐ, ์†Œ์ˆ˜์˜ ์ธ๊ฐ„ ์‹œ์—ฐ + ๋Œ€๋Ÿ‰์˜ BC ์ •์ฑ… ๊ถค์ ์ด ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋กœ ์ œ๊ณต๋œ๋‹ค. ๋น„๊ต ๊ธฐ์ค€: IQL + Fine-tuning.

๊ทธ๋ฃน 2: D4RL AntMaze (6๊ฐœ ํƒœ์Šคํฌ)
Ant ๋กœ๋ด‡์ด ๋ฏธ๋กœ๋ฅผ ํƒ์ƒ‰ํ•˜๋Š” ํƒœ์Šคํฌ. ๋ณด์ƒ์€ ๊ทนํžˆ ํฌ์†Œ(๋ชฉํ‘œ ๋„๋‹ฌ์‹œ์—๋งŒ). ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋Š” ์„œ๋ธŒ์˜ตํ‹ฐ๋ฉ€ ๊ถค์ ์œผ๋กœ๋งŒ ๊ตฌ์„ฑ๋œ๋‹ค. ๋น„๊ต ๊ธฐ์ค€: IQL + Fine-tuning.

๊ทธ๋ฃน 3: D4RL Locomotion (12๊ฐœ ํƒœ์Šคํฌ)
Hopper, HalfCheetah, Walker, Ant์˜ ๋‹ค์–‘ํ•œ ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๋ฒ„์ „. ๋ฐ€์ง‘ ๋ณด์ƒ. ๋น„๊ต ๊ธฐ์ค€: Off2On.

๋ชจ๋“  ์‹คํ—˜์€ 10 ์‹œ๋“œ, 1 ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ๋ณด๊ณ ํ•œ๋‹ค.

์ฃผ์š” ๊ฒฐ๊ณผ

๋…ผ๋ฌธ Figure 4 (๋ชจ๋“  ํƒœ์Šคํฌ ์ง‘๊ณ„ ๊ฒฐ๊ณผ):

  • Adroit: RLPD๋Š” IQL+Fine-tuning์„ ํฌ๊ฒŒ ์•ž์„œ๋ฉฐ, ํŠนํžˆ Door ํƒœ์Šคํฌ์—์„œ Prior SoTA ๋Œ€๋น„ 2.5๋ฐฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ.
  • AntMaze: RLPD๋Š” Prior SoTA๊ฐ€ ํ• ๋‹นํ•œ ์Šคํ…์˜ 3๋ถ„์˜ 1 ์ด๋‚ด์— ๋™๋“ฑ ์ด์ƒ์˜ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ. ๋ชจ๋“  6๊ฐœ AntMaze ํƒœ์Šคํฌ๋ฅผ ์ฒ˜์Œ์œผ๋กœ ํšจ๊ณผ์ ์œผ๋กœ ํ’€์–ด๋‚ธ ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ์ €์ž๋“ค์€ ์ฃผ์žฅํ•œ๋‹ค.
  • Locomotion: ๊ธฐ์กด Off2On๊ณผ ์œ ์‚ฌํ•œ ์ˆ˜์ค€์ด๋ฉฐ, ์‚ฌ์ „ ํ•™์Šต ์—†์ด๋„ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ.

ํŠนํžˆ ์ฃผ๋ชฉํ•  ์ ์€ ์ด๋‹ค: IQL+Fine-tuning ๊ฐ™์€ Prior SoTA ๋ฐฉ๋ฒ•๋“ค์€ ์˜คํ”„๋ผ์ธ ์‚ฌ์ „ ํ•™์Šต ๋•๋ถ„์— ์ดˆ๊ธฐ ์„ฑ๋Šฅ์ด ๋†’๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ RLPD๋Š” ์‚ฌ์ „ ํ•™์Šต ์—†์ด ์‹œ์ž‘ํ•ด์„œ 1๋งŒ ์Šคํ… ๋‚ด์™ธ๋งŒ์— ์ด ์ดˆ๊ธฐ ์„ฑ๋Šฅ์„ ๋”ฐ๋ผ์žก๊ณ  ์ด๋ฅผ ๋„˜์–ด์„ ๋‹ค.

ํ”ฝ์…€ ๊ธฐ๋ฐ˜ ํ™˜๊ฒฝ์œผ๋กœ์˜ ์ „์ด

์ €์ž๋“ค์€ RLPD๋ฅผ V-D4RL (๋น„์ „ ๊ธฐ๋ฐ˜ D4RL)์—๋„ ์ ์šฉํ•œ๋‹ค. ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋Š” ์ƒํƒœ ๊ธฐ๋ฐ˜(state-based) ์ •์ฑ…์ด ์ƒ์„ฑํ•œ ํ”ฝ์…€ ๊ด€์ฐฐ ๊ถค์ ์œผ๋กœ, ๋ถ€๋ถ„ ๊ฐ€๊ด€์ธก์„ฑ(partial observability) ๋ฌธ์ œ๊ฐ€ ๋‚ด์žฌ๋˜์–ด ์žˆ๋‹ค.

ํ‰๊ฐ€ ๊ธฐ์ค€์€ โ€œ10% DMCโ€ โ€” ์ฆ‰ DrQ-v2๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ์ „์ฒด ํƒ€์ž„์Šคํ…์˜ ๋‹จ 10%๋งŒ ์‚ฌ์šฉ.

  • Walker Walk, Cheetah Run์—์„œ RLPD๋Š” DrQ-v2 ๋Œ€๋น„ ์ผ๊ด€๋˜๊ฒŒ ๋†’์€ ์ƒ˜ํ”Œ ํšจ์œจ.
  • Humanoid Walk์—์„œ๋Š” BC baseline์ด ์‹œ๊ฐ์  ํ์ƒ‰(visual occlusion)์œผ๋กœ ์‹คํŒจํ•˜์ง€๋งŒ RLPD๋Š” ์œ ์˜๋ฏธํ•œ ํ•™์Šต์„ ๋‹ฌ์„ฑ.
  • UTD=10์œผ๋กœ ๋†’์˜€์„ ๋•Œ Cheetah Run Expert์—์„œ ๊ทน์ ์ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ โ€” ํ”ฝ์…€ ๊ธฐ๋ฐ˜ continuous control์—์„œ ๊ณ -UTD ์ ‘๊ทผ์ด ํšจ๊ณผ์ ์ž„์„ ์ฒ˜์Œ์œผ๋กœ ๋ณด์ธ ์‚ฌ๋ก€.

์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ ํƒœ์Šคํฌ์—์„œ๋Š” ๋žœ๋ค ์‹œํ”„ํŠธ ์–ด๊ทธ๋ฉ˜ํ…Œ์ด์…˜(random shift augmentation)์„ ์ถ”๊ฐ€๋กœ ์‚ฌ์šฉํ•˜๋ฉฐ, ์ด๋Š” TD-learning ๊ณผ์ ํ•ฉ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•œ๋‹ค.

Ablation: LayerNorm์˜ ์—ญํ• 

๋…ผ๋ฌธ Figure 7์€ LN์˜ ์ค‘์š”์„ฑ์„ ๋‹ค์–‘ํ•œ ์กฐ๊ฑด์—์„œ ๋ณด์—ฌ์ค€๋‹ค:

  • Adroit Sparse (์ „์ฒด ๋ฐ์ดํ„ฐ): LN ์ œ๊ฑฐ ์‹œ ๋ถ„์‚ฐ์ด ํฌ๊ฒŒ ์ฆ๊ฐ€ํ•˜๊ณ  ํ‰๊ท  ์„ฑ๋Šฅ ํ•˜๋ฝ.
  • Expert Adroit Sparse (22๊ฐœ ๊ถค์ ๋งŒ): ๋ฐ์ดํ„ฐ๊ฐ€ ๊ทนํžˆ ์ œํ•œ์ ์ผ ๋•Œ LN ์—†์ด๋Š” ์™„์ „ํžˆ ์‹คํŒจ โ€” ๋ชจ๋“  ํƒœ์Šคํฌ์—์„œ ์ง„์ „ ์—†์Œ. LN์ด ์žˆ์œผ๋ฉด ์—ฌ์ „ํžˆ Prior SoTA๋ฅผ ๋Šฅ๊ฐ€.
  • AntMaze Large: LN์ด ์ƒ˜ํ”Œ ํšจ์œจ ํ–ฅ์ƒ์— ๊ธฐ์—ฌ.
  • V-D4RL Humanoid Walk: ๋ณต์žกํ•œ ๊ณ ์ฐจ์› ํ”ฝ์…€ ํ™˜๊ฒฝ์—์„œ๋„ LN์˜ ์•ˆ์ •ํ™” ํšจ๊ณผ ํ™•์ธ.

์ด ๊ฒฐ๊ณผ๋Š” LN์ด ๋‹จ์ˆœํ•œ โ€œ์žˆ์œผ๋ฉด ์ข‹์€โ€ ์ฒจ๊ฐ€๋ฌผ์ด ์•„๋‹ˆ๋ผ, ๋ฐ์ดํ„ฐ๊ฐ€ ์ œํ•œ์ ์ด๊ฑฐ๋‚˜ ์ข์€ ๋ถ„ํฌ์ผ ๋•Œ ํ•„์ˆ˜์ ์ธ ๊ตฌ์„ฑ ์š”์†Œ์ž„์„ ๋ณด์—ฌ์ค€๋‹ค.


๋น„๊ต: ๊ด€๋ จ ์—ฐ๊ตฌ ํฌ์ง€์…”๋‹

graph LR
    A["Offline RL\n(IQL, CQL, TD3+BC)"] -->|"+ Online Finetuning"| B["IQL + Finetuning\n(Kostrikov et al., 2022)"]
    A --> C["Off2On\n(Lee et al., 2021)"]
    D["Online RL\n(SAC, TD3)"] -->|"+ Offline Buffer Init"| E["SACfD\n(Vecerรญk et al., 2017)"]
    D -->|"+ REDQ + LN + Sym. Sampling"| F["RLPD (Ours)"]
    B -->|"Requires offline pretraining\nRestricts exploration"| G["Drawbacks"]
    E -->|"Offline data diluted\nNo divergence control"| G
    F -->|"No pretraining\nNo explicit constraints\nExploration-friendly"| H["Advantages"]

๋ฐฉ๋ฒ• ์˜คํ”„๋ผ์ธ ์‚ฌ์ „ํ•™์Šต ๋ช…์‹œ์  ์ œ์•ฝ ํƒ์ƒ‰ ์–ต์ œ ๋ณต์žก๋„
IQL + Fine-tuning O O (๋ณด์ˆ˜์  Q) ๋ถ€๋ถ„์  ๋†’์Œ
Off2On O O (pessimistic Q-ensemble) ๋ถ€๋ถ„์  ๋†’์Œ
SACfD X X X ๋‚ฎ์Œ
RLPD X X X ๋‚ฎ์Œ

RLPD์˜ ๊ฐ€์žฅ ์ง์ ‘์ ์ธ ๊ฒฝ์Ÿ์ž๋Š” Off2On (Lee et al., 2021) ์ด๋‹ค. Off2On๋„ ๋Œ€๊ทœ๋ชจ ์•™์ƒ๋ธ”๊ณผ ๋†’์€ UTD๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ์˜คํ”„๋ผ์ธ ์‚ฌ์ „ ํ•™์Šต์ด ํ•„์š”ํ•˜๊ณ  ๋ณ„๋„์˜ balancing mechanism์„ ๋„์ž…ํ•œ๋‹ค. RLPD๋Š” ์‚ฌ์ „ ํ•™์Šต ์—†์ด๋„ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•œ๋‹ค.


๋น„ํŒ์  ๊ณ ์ฐฐ: ๊ฐ•์ ๊ณผ ํ•œ๊ณ„

๊ฐ•์ 

1. ์‹ค์šฉ์  ๋‹จ์ˆœ์„ฑ (Practical Simplicity)
RLPD์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋“ค์€ โ€œ์ถ”๊ฐ€์ ์ธ ๋ณต์žก์„ฑ ์—†์ด ๊ธฐ์กด SAC์— ๋ช‡ ์ค„์„ ๋ฐ”๊ฟจ์„ ๋ฟโ€์ด๋‹ค. LayerNorm์„ ํฌ๋ฆฌํ‹ฑ์— ์ถ”๊ฐ€ํ•˜๊ณ , ์ƒ˜ํ”Œ๋ง ๋ฐฉ์‹์„ ๋ฐ”๊พธ๊ณ , ์•™์ƒ๋ธ”์„ ํ‚ค์› ๋‹ค. ์ด๊ฒƒ์œผ๋กœ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์„ ์••๋„ํ•œ๋‹ค. ์žฌํ˜„์„ฑ(reproducibility)๋„ ๋†’๋‹ค. ์ฝ”๋“œ๋ฒ ์ด์Šค๊ฐ€ JAX๋กœ ๊ณต๊ฐœ๋˜์–ด ์žˆ์œผ๋ฉฐ, IQL ๊ฐ™์€ ๋ฌด๊ฑฐ์šด ์‚ฌ์ „ ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ์ด ์—†์–ด์„œ ์‹œ์ž‘ํ•˜๊ธฐ ์‰ฝ๋‹ค.

2. ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ์— ๋ฌด๊ด€ํ•œ ๋ฒ”์šฉ์„ฑ
์ „๋ฌธ๊ฐ€ ์‹œ์—ฐ 22๊ฐœ์งœ๋ฆฌ ๊ทนํžˆ ์ œํ•œ๋œ ๋ฐ์ดํ„ฐ์—์„œ๋ถ€ํ„ฐ ๋Œ€๋Ÿ‰์˜ ์„œ๋ธŒ์˜ตํ‹ฐ๋ฉ€ ๊ถค์ ๊นŒ์ง€ ๋ชจ๋‘ ์ž˜ ๋™์ž‘ํ•œ๋‹ค. ์ด๋Š” ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ์˜ ํ’ˆ์งˆ์„ ๋ณด์žฅํ•˜๊ธฐ ์–ด๋ ค์šด ์‹ค์ œ ๋กœ๋ด‡ ์‘์šฉ์—์„œ ๋งค์šฐ ์ค‘์š”ํ•œ ํŠน์„ฑ์ด๋‹ค.

3. ํƒ์ƒ‰ ๋น„์–ต์ œ (Exploration-Friendly)
LN์€ Q-๊ฐ’์— ์ƒํ•œ์„ ๋‘์ง€๋งŒ ํŠน์ • ํ–‰๋™์„ ๋ฒŒ์ฃผ์ง€ ์•Š๋Š”๋‹ค. ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ๋ฐ”๊นฅ์˜ ์ƒˆ๋กœ์šด ํ–‰๋™์„ ์‹œ๋„ํ•  ์ž์œ ๊ฐ€ ๋ณด์žฅ๋œ๋‹ค. ์ด๋Š” ์˜คํ”„๋ผ์ธ RL์˜ ๊ณ ์งˆ์  ๋ฌธ์ œ์ธ โ€œ๋ถ„ํฌ ๋‚ด ๊ฐ‡ํž˜โ€์„ ํ•ด๊ฒฐํ•œ๋‹ค.

4. LayerNorm์˜ ์ด๋ก ์  ์ •๋‹นํ™”
๋‹จ์ˆœํžˆ โ€œ์‹คํ—˜์ ์œผ๋กœ LN์ด ์ž˜ ๋œ๋‹คโ€๊ฐ€ ์•„๋‹ˆ๋ผ, Q-๊ฐ’์˜ upper bound ์œ ๋„๋ฅผ ํ†ตํ•ด ์™œ LN์ด ๋ฐœ์‚ฐ์„ ๋ฐฉ์ง€ํ•˜๋Š”์ง€ ์ด๋ก ์ ์œผ๋กœ ๋ณด์ธ๋‹ค.

ํ•œ๊ณ„์™€ ์•ฝ์ 

1. ์•™์ƒ๋ธ”์˜ ๋ฉ”๋ชจ๋ฆฌ ๋น„์šฉ
E=10์งœ๋ฆฌ ์•™์ƒ๋ธ”์„ ์‚ฌ์šฉํ•˜๋ฉด ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ 10๋ฐฐ๋‹ค. ์‹ค์ œ ๋กœ๋ด‡ ๋ฐฐํฌ ํ™˜๊ฒฝ์—์„œ ๋ฉ”๋ชจ๋ฆฌ ์ œ์•ฝ์ด ์žˆ๋Š” ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค์—์„œ๋Š” ์ ์šฉ์ด ์–ด๋ ค์šธ ์ˆ˜ ์žˆ๋‹ค. ๋…ผ๋ฌธ์€ ๊ณ„์‚ฐ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์—†๋‹ค๊ณ  ์ฃผ์žฅํ•˜์ง€๋งŒ, ์ด๋Š” ๋ณ‘๋ ฌ ๊ณ„์‚ฐ์ด ๊ฐ€๋Šฅํ•œ ๊ณ ์‚ฌ์–‘ GPU ํ™˜๊ฒฝ์„ ์ „์ œ๋กœ ํ•œ๋‹ค.

2. ํ™˜๊ฒฝ๋ณ„ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํƒ์ƒ‰ ํ•„์š”
CDQ ์‚ฌ์šฉ ์—ฌ๋ถ€, ์—”ํŠธ๋กœํ”ผ ํ•ญ, ๋„คํŠธ์›Œํฌ ๊นŠ์ด ๋“ฑ ํ™˜๊ฒฝ๋ณ„ ์„ ํƒ์€ ๊ฒฐ๊ตญ ํƒ์ƒ‰์ด ํ•„์š”ํ•˜๋‹ค. ์ €์ž๋“ค์ด ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ์ œ์‹œํ•˜์ง€๋งŒ, ์ƒˆ๋กœ์šด ํ™˜๊ฒฝ์— ์ ์šฉํ•  ๋•Œ ์ด ํƒ์ƒ‰ ๋น„์šฉ์ด ๋ฐœ์ƒํ•œ๋‹ค. โ€œํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์—†์Œโ€์ด๋ผ๋Š” ์ฃผ์žฅ์€ ํ•ต์‹ฌ ์„ธ ๊ฐ€์ง€(๋Œ€์นญ ์ƒ˜ํ”Œ๋ง, LN, ์•™์ƒ๋ธ”)์—๋งŒ ํ•ด๋‹นํ•œ๋‹ค.

3. ๋ฆฌ์›Œ๋“œ ํ•จ์ˆ˜ ์„ค๊ณ„ ์˜์กด์„ฑ
RLPD๋„ ๊ฒฐ๊ตญ RL์ด๋ฏ€๋กœ, ๋ณด์ƒ ํ•จ์ˆ˜๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ์‹ค์ œ ๋กœ๋ด‡์—์„œ ํฌ์†Œ ๋ณด์ƒ(์„ฑ๊ณต/์‹คํŒจ)์€ ๊ทธ๋‚˜๋งˆ ์ •์˜ํ•˜๊ธฐ ์‰ฝ์ง€๋งŒ, ๋ณต์žกํ•œ ์กฐ์ž‘ ํƒœ์Šคํฌ์—์„œ ๋ฐ€์ง‘ ๋ณด์ƒ์„ ์„ค๊ณ„ํ•˜๋Š” ๊ฒƒ์€ ๋ณ„๊ฐœ์˜ ์–ด๋ ค์šด ๋ฌธ์ œ๋‹ค.

4. ๋‹จ๊ธฐ ํƒ์ƒ‰ ๋น„ํšจ์œจ ๊ฐ€๋Šฅ์„ฑ
๋Œ€์นญ ์ƒ˜ํ”Œ๋ง์€ ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณ„์† 50% ์‚ฌ์šฉํ•œ๋‹ค. ๋งŒ์•ฝ ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๊ฐ€ ๊ทนํžˆ ์ œํ•œ์ ์ด๊ณ  ๋ถ„ํฌ๊ฐ€ ์ข๋‹ค๋ฉด, ํ›„๋ฐ˜๋ถ€ ํ•™์Šต์—์„œ ์ด ๊ณ ์ • ๋น„์œจ์ด ์˜คํžˆ๋ ค ๋ถˆํ•„์š”ํ•œ ํŽธํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ๋‹ค. ์ด์— ๋Œ€ํ•œ ์ด๋ก ์  ๋ถ„์„์€ ๋ถ€์กฑํ•˜๋‹ค.

5. ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ์˜ ๋ณด์ƒ ๋ ˆ์ด๋ธ” ๊ฐ€์šฉ์„ฑ ๊ฐ€์ •
RLPD๋Š” ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ์— (s, a, r, s') ํŠœํ”Œ โ€” ์ฆ‰ ๋ณด์ƒ r์ด ํฌํ•จ๋˜์–ด ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. ์‹ค์ œ๋กœ๋Š” ๋ณด์ƒ ๋ ˆ์ด๋ธ”์ด ์—†๋Š” ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋‚˜ ๋ชจ์…˜ ์บก์ฒ˜ ๋ฐ์ดํ„ฐ๋งŒ ์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. ์ด ๊ฒฝ์šฐ RLPD๋ฅผ ์ง์ ‘ ์ ์šฉํ•˜๊ธฐ ์–ด๋ ต๋‹ค.


๋กœ๋ด‡๊ณตํ•™ ์‹ค๋ฌด์ž๋ฅผ ์œ„ํ•œ ์ ์šฉ ๊ฐ€์ด๋“œ

RLPD๊ฐ€ ํŠนํžˆ ์œ ์šฉํ•œ ์‹œ๋‚˜๋ฆฌ์˜ค:

  1. ์†Œ์ˆ˜์˜ ์ธ๊ฐ„ ์‹œ์—ฐ + ์˜จ๋ผ์ธ RL ์กฐํ•ฉ: ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜์ด๋‚˜ ํ‚ค๋„ค์Šคํ…Œํ‹ฑ ํ‹ฐ์นญ์œผ๋กœ ์–ป์€ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•ด RL ์ดˆ๊ธฐํ™”. Allegro Hand ๊ฐ™์€ dexterous hand์—์„œ finger gaiting์ด๋‚˜ regrasping ํ•™์Šต.

  2. ์„œ๋ธŒ์˜ตํ‹ฐ๋ฉ€ ์‚ฌ์ „ ๋ฐ์ดํ„ฐ ํ™œ์šฉ: ์ด์ „ ์‹คํ—˜์—์„œ ์‹คํŒจํ•œ ๊ถค์ ๋“ค๋„ ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋กœ ํ™œ์šฉ ๊ฐ€๋Šฅ. ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ์— ๋œ ๋ฏผ๊ฐํ•˜๋‹ค๋Š” ์žฅ์ .

  3. Sim-to-Real ํŒŒ์ดํ”„๋ผ์ธ: ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„œ ์ƒ์„ฑํ•œ ๊ถค์ ์„ ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋กœ, ์‹ค์ œ ๋กœ๋ด‡ ์ƒํ˜ธ์ž‘์šฉ์„ ์˜จ๋ผ์ธ ๋ฐ์ดํ„ฐ๋กœ ํ™œ์šฉํ•˜๋Š” hybrid ์ ‘๊ทผ.

์‹ค์šฉ์  ๊ตฌํ˜„ ์ฒดํฌ๋ฆฌ์ŠคํŠธ:

[  ] SAC ๊ธฐ๋ฐ˜ ๊ตฌํ˜„์—์„œ ์‹œ์ž‘
[  ] ํฌ๋ฆฌํ‹ฑ ๋„คํŠธ์›Œํฌ ๋ชจ๋“  hidden layer์— LayerNorm ์ถ”๊ฐ€
[  ] ํฌ๋ฆฌํ‹ฑ ์•™์ƒ๋ธ” ํฌ๊ธฐ E=10์œผ๋กœ ์„ค์ • (proprioceptive)
[  ] ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ ๋ฒ„ํผ D ๋ณ„๋„ ๊ตฌ์„ฑ (๊ณ ์ •, ์—…๋ฐ์ดํŠธ ์•ˆ ํ•จ)
[  ] ๋งค ๋ฐฐ์น˜: R์—์„œ N/2, D์—์„œ N/2 ์ƒ˜ํ”Œ๋ง
[  ] UTD ๋น„์œจ: ์‹œ์ž‘์€ UTD=1, ์ดํ›„ ํ•„์š”์‹œ ์ฆ๊ฐ€
[  ] ํ™˜๊ฒฝ๋ณ„ ์กฐ์ •: CDQ / Entropy / ๋ ˆ์ด์–ด ๊นŠ์ด ์ˆœ์„œ๋กœ ํƒ์ƒ‰
[  ] ํ”ฝ์…€ ๊ธฐ๋ฐ˜: ๋žœ๋ค ์‹œํ”„ํŠธ ์–ด๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ์ถ”๊ฐ€

์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

RLPD๊ฐ€ ์šฐ๋ฆฌ์—๊ฒŒ ๊ฐ€๋ฅด์ณ์ฃผ๋Š” ๊ฒƒ์€ ๋‹จ์ˆœํ•˜์ง€๋งŒ ์‹ฌ์˜คํ•˜๋‹ค: ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ์™€ ์˜จ๋ผ์ธ RL์˜ ๊ฒฐํ•ฉ์„ ์œ„ํ•ด ๋ณต์žกํ•œ ์•„ํ‚คํ…์ฒ˜ ๋ณ€๊ฒฝ์ด ํ•„์š”ํ•˜์ง€ ์•Š๋‹ค. ํ•ต์‹ฌ์€ ์„ธ ๊ฐ€์ง€๋‹ค.

  1. ๋Œ€์นญ ์ƒ˜ํ”Œ๋ง โ€” ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ์ผ์ •ํ•˜๊ฒŒ ์‚ฌ์šฉํ•˜๋ผ.
  2. Layer Normalization โ€” Q-๊ฐ’ ๋ฐœ์‚ฐ์„ ๊ตฌ์กฐ์ ์œผ๋กœ ๋ง‰๋˜, ํƒ์ƒ‰์„ ์–ต์ œํ•˜์ง€ ๋งˆ๋ผ.
  3. ๋Œ€๊ทœ๋ชจ ์•™์ƒ๋ธ” โ€” UTD๋ฅผ ๋†’์—ฌ๋„ ๊ณผ์ ํ•ฉ๋˜์ง€ ์•Š๋„๋ก ํ†ต๊ณ„์  ์ •๊ทœํ™”๋ฅผ ์ œ๊ณตํ•˜๋ผ.

์ด ์„ธ ๊ฐ€์ง€์˜ ์กฐํ•ฉ์ด ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์„ ์ตœ๋Œ€ 2.5๋ฐฐ ์•ž์„œ๋Š” ์„ฑ๋Šฅ์„ ๋งŒ๋“ค์–ด๋‚ธ๋‹ค. ์ถ”๊ฐ€ ๊ณ„์‚ฐ ์˜ค๋ฒ„ํ—ค๋“œ ์—†์ด.

๋กœ๋ด‡๊ณตํ•™ ๊ด€์ ์—์„œ ์ด ๋…ผ๋ฌธ์˜ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ฉ”์‹œ์ง€๋Š” โ€œ์ข‹์€ ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋‹ค๋ฉด, ๋ณต์žกํ•œ ์‚ฌ์ „ ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ ์—†์ด๋„ ์˜จ๋ผ์ธ RL์„ ๋น ๋ฅด๊ฒŒ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ๋‹คโ€๋Š” ๊ฒƒ์ด๋‹ค. Allegro Hand ๊ฐ™์€ dexterous manipulation ํ”Œ๋žซํผ์—์„œ ์†Œ์ˆ˜์˜ ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ ์‹œ์—ฐ์œผ๋กœ RL ํ•™์Šต์„ ํ‚ฅ์Šคํƒ€ํŠธํ•˜๋ ค๋Š” ์—ฐ๊ตฌ์ž๋“ค์—๊ฒŒ ์ง์ ‘์ ์œผ๋กœ ์œ ์šฉํ•œ ๋ ˆ์‹œํ”ผ๋‹ค.

๋ฌผ๋ก  ๋ณด์ƒ ํ•จ์ˆ˜ ์„ค๊ณ„, ๋„๋ฉ”์ธ ๋žœ๋คํ™”, ์‹ค์ œ ๋กœ๋ด‡์˜ ์•ˆ์ „ ์ œ์•ฝ ๊ฐ™์€ ์‹ค๋ฌด ๋ฌธ์ œ๋“ค์€ ์—ฌ์ „ํžˆ ๋ณ„๋„๋กœ ํ•ด๊ฒฐํ•ด์•ผ ํ•œ๋‹ค. ํ•˜์ง€๋งŒ RLPD๋Š” โ€œ์‚ฌ์ „ ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ํ™œ์šฉํ•  ๊ฒƒ์ธ๊ฐ€โ€๋ผ๋Š” ํ•ต์‹ฌ ์งˆ๋ฌธ์— ๋ช…์พŒํ•˜๊ณ  ์‹ค์šฉ์ ์ธ ๋‹ต์„ ์ œ์‹œํ•œ๋‹ค.

๋ณต์žกํ•จ์€ ์ดํ•ด์˜ ๋ถ€์กฑ์—์„œ ์˜จ๋‹ค. ์ง„์งœ ์ดํ•ด๋Š” ๋‹จ์ˆœํ•จ์œผ๋กœ ์ˆ˜๋ ดํ•œ๋‹ค.


์ฐธ๊ณ  ๋ฌธํ—Œ (์„ ํƒ)

  • Ball et al. (2023). Efficient Online Reinforcement Learning with Offline Data. ICML 2023.
  • Haarnoja et al. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning. ICML.
  • Chen et al. (2021). Randomized Ensembled Double Q-Learning. ICLR.
  • Kostrikov et al. (2022). Offline Reinforcement Learning with Implicit Q-Learning. ICLR.
  • Lee et al. (2021). Offline-to-Online RL via Balanced Replay and Pessimistic Q-Ensemble. CoRL.
  • Fu et al. (2020). D4RL: Datasets for Deep Data-Driven Reinforcement Learning. arXiv.
  • Ba et al. (2016). Layer Normalization. arXiv.

Copyright 2026, JungYeon Lee