Curieux.JY
  • Post
  • Note
  • Jung Yeon Lee

On this page

  • Brief Review
    • ์ข…๋ฅ˜
    • ํ‰๊ฐ€

๐Ÿ“ƒOffline RL Survey ๋ฆฌ๋ทฐ

rl
offline-rl
survey
A Survey on Offline Reinforcement Learning - Taxonomy, Review, and Open Problems
Published

July 2, 2025

  • Paper Link
  1. ์˜คํ”„๋ผ์ธ RL(Offline RL)์€ ํ™˜๊ฒฝ๊ณผ์˜ ์ƒํ˜ธ์ž‘์šฉ ์—†์ด ์ •์  ๋ฐ์ดํ„ฐ์…‹๋งŒ์„ ์ด์šฉํ•ด ํ•™์Šตํ•˜๋Š” ํŒจ๋Ÿฌ๋‹ค์ž„์œผ๋กœ, ์‹ค์ œ ํ™˜๊ฒฝ ์ ์šฉ์— ํ•„์ˆ˜์ ์ด์ง€๋งŒ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ๋ณ€ํ™”(distributional shift) ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ณธ ๋…ผ๋ฌธ์€ ์˜คํ”„๋ผ์ธ RL ๊ธฐ๋ฒ•์„ ๋ถ„๋ฅ˜ํ•˜๋Š” ์ƒˆ๋กœ์šด Taxonomy๋ฅผ ์ œ์•ˆํ•˜๊ณ , ์ตœ์‹  ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฐ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ข…ํ•ฉ์ ์œผ๋กœ ๊ฒ€ํ† ํ•˜๋ฉฐ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ํŠน์„ฑ์— ๋”ฐ๋ฅธ ๊ธฐ๋ฒ•๋ณ„ ์„ฑ๋Šฅ์„ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค.
  3. ๋”๋ถˆ์–ด ์˜คํ”„๋ผ์ธ ์ •์ฑ… ํ‰๊ฐ€(Off-Policy Evaluation, OPE)๋ฅผ ํฌํ•จํ•œ ๋ฏธํ•ด๊ฒฐ ๊ณผ์ œ๋“ค์„ ๋…ผ์˜ํ•˜๊ณ  ๋ถ„์•ผ์˜ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์— ๋Œ€ํ•œ ํ†ต์ฐฐ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
Timeline illustrating the key developments in the field of offline RL

Brief Review

๋ณธ ๋…ผ๋ฌธ์€ ์ •์  ๋ฐ์ดํ„ฐ์…‹(\mathcal{D})์œผ๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•˜๋ฉฐ ํ™˜๊ฒฝ๊ณผ์˜ ์ถ”๊ฐ€ ์ƒํ˜ธ์ž‘์šฉ ์—†์ด ์ •์ฑ…(\pi_{\text{off}})์„ ๋„์ถœํ•˜๋Š” Offline Reinforcement Learning (์˜คํ”„๋ผ์ธ ๊ฐ•ํ™”ํ•™์Šต) ๋ถ„์•ผ์— ๋Œ€ํ•œ ํฌ๊ด„์ ์ธ ์„œ๋ฒ ์ด ๋…ผ๋ฌธ์ž…๋‹ˆ๋‹ค. ์˜จ๋ผ์ธ ๋˜๋Š” Off-policy RL (์˜คํ”„-ํด๋ฆฌ์‹œ ๊ฐ•ํ™”ํ•™์Šต)๊ณผ ๋‹ฌ๋ฆฌ ์˜คํ”„๋ผ์ธ RL์€ ๊ณ ๋น„์šฉ ๋˜๋Š” ์œ„ํ—˜์„ฑ์œผ๋กœ ์ธํ•ด ํ™˜๊ฒฝ ์ƒํ˜ธ์ž‘์šฉ์ด ์–ด๋ ค์šด ์‹ค์ œ ์‘์šฉ ๋ถ„์•ผ(์˜ˆ: ๊ต์œก, ํ—ฌ์Šค์ผ€์–ด, ๋กœ๋ณดํ‹ฑ์Šค)์— ํŠนํžˆ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

์˜คํ”„๋ผ์ธ RL์˜ ํ•ต์‹ฌ ๊ณผ์ œ๋Š” ํ•™์Šต๋œ ์ •์ฑ…(\pi_{\theta})์ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ(\pi_{\beta}๋˜๋Š”d^{\pi_\beta})์—์„œ ๋ฒ—์–ด๋‚  ๋•Œ ๋ฐœ์ƒํ•˜๋Š” Distributional Shift (๋ถ„ํฌ ๋ณ€ํ™”) ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค. ํŠนํžˆ function approximator์˜ ๊ณผ๋Œ€ ์ถ”์ •(overestimation)๊ณผ ์˜ค์ฐจ ๋ˆ„์ (compounding error)์ด ๋ฌธ์ œ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ๊ฐ€์น˜ ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•(value-based method)์˜ ๊ฒฝ์šฐ, ๋ฒจ๋งŒ ์—๋Ÿฌ(Bellman error) ์ตœ์†Œํ™” ๋ชฉํ‘œ ํ•จ์ˆ˜

J(\phi) = \mathbb{E}_{s, a, s' \sim \mathcal{D}}[(r(s, a) + \gamma \mathbb{E}_{a' \sim \pi_{\text{off}}(\cdot|s')}[Q^{\pi}_{\phi}(s', a')] - Q^{\pi}_{\phi}(s, a))^2]

์—์„œa'์ด ๋ฐ์ดํ„ฐ์…‹์˜ ํ–‰๋™ ๋ถ„ํฌ\pi_{\beta}์™€ ๋‹ค๋ฅผ ๋•Œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

์ข…๋ฅ˜

๋…ผ๋ฌธ์€ ์˜คํ”„๋ผ์ธ RL ๋ฐฉ๋ฒ•๋ก ์„ ๋ถ„๋ฅ˜ํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด Taxonomy (๋ถ„๋ฅ˜์ฒด๊ณ„)๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ƒ์œ„ ์ˆ˜์ค€์—์„œ๋Š” ํ•™์Šต ๋Œ€์ƒ์„ ๊ธฐ์ค€์œผ๋กœ Model-Based (๋ชจ๋ธ ๊ธฐ๋ฐ˜), One-step (์›์Šคํ…), Imitation Learning (๋ชจ๋ฐฉ ํ•™์Šต) ๋ฐฉ๋ฒ•์œผ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ์†์‹ค ํ•จ์ˆ˜๋‚˜ ํ›ˆ๋ จ ์ ˆ์ฐจ์— ๋Œ€ํ•œ ๋ณ€ํ˜•์ธ Policy Constraints (์ •์ฑ… ์ œ์•ฝ), Regularization (์ •๊ทœํ™”), Uncertainty Estimation (๋ถˆํ™•์‹ค์„ฑ ์ถ”์ •)์„ ๋ถ€๊ฐ€์ ์ธ ํŠน์„ฑ์œผ๋กœ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

  • Policy Constraints: ํ•™์Šต๋œ ์ •์ฑ…\pi_{\theta}๋ฅผ ํ–‰๋™ ์ •์ฑ…\pi_{\beta}์— ๊ฐ€๊น๊ฒŒ ์ œ์•ฝํ•ฉ๋‹ˆ๋‹ค.

    • Direct (์ง์ ‘):\pi_{\beta}๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ์ถ”์ •ํ•˜๊ณ \mathcal{D}(\pi_{\theta}(\cdot|s), \hat{\pi}_{\beta}(\cdot|s)) \le \epsilon์™€ ๊ฐ™์€ ์ œ์•ฝ ์กฐ๊ฑด(e.g.,f-divergence ์‚ฌ์šฉ)์„ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค (BCQ, BRAC). ์ถ”์ • ์˜ค๋ฅ˜์— ๋ฏผ๊ฐํ•ฉ๋‹ˆ๋‹ค.
    • Implicit (์•”๋ฌต์ ):\pi_{\beta} ์ถ”์ • ์—†์ด ์ˆ˜์ •๋œ ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ์•”๋ฌต์ ์œผ๋กœ ์ œ์•ฝํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜์™€ ๊ฐ™์€ Advantage-weighted regression ํ˜•ํƒœ๊ฐ€ ๋Œ€ํ‘œ์ ์ž…๋‹ˆ๋‹ค (BEAR, AWR, AWAC, TD3+BC). J(\theta) = \mathbb{E}_{s,a \sim \mathcal{D}}[\log \pi_{\theta}(a|s) \exp(\frac{1}{\lambda} \hat{A}^{\pi}(s, a))]
  • Importance Sampling (IS): Off-policy ์ •์ฑ… ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ํŠธ๋ผ์ ํ† ๋ฆฌ ํ™•๋ฅ  ๋น„์œจ์˜ ๊ณฑ(w_{i:j})์œผ๋กœ ์ธํ•ด ๋ถ„์‚ฐ์ด ๋งค์šฐ ๋†’์Šต๋‹ˆ๋‹ค. Variance Reduction (๋ถ„์‚ฐ ๊ฐ์†Œ) ๊ธฐ๋ฒ•(Per-decision IS, Doubly Robust Estimator, Marginalized IS)์ด ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. Marginalized IS๋Š” ์ƒํƒœ ํ•œ๊ณ„ ๋ถ„ํฌ ๋น„์œจ(\rho_{\pi}(s)) ๋˜๋Š” ์ƒํƒœ-ํ–‰๋™ ํ•œ๊ณ„ ๋ถ„ํฌ ๋น„์œจ(\rho_{\pi}(s, a))์˜ ๋ฒจ๋งŒ ๋ฐฉ์ •์‹d^{\pi_\beta}(s')\rho_\pi(s') = (1-\gamma)d_0(s') + \gamma \sum_{s,a} d^{\pi_\beta}(s)\rho_\pi(s)\pi(a|s)T(s'|s,a)์„ ํ™œ์šฉํ•˜์—ฌ ๋ถ„์‚ฐ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•ฉ๋‹ˆ๋‹ค (GenDICE).

  • Regularization: ์ •์ฑ… ๋˜๋Š” ๊ฐ€์น˜ ํ•จ์ˆ˜์— ํŽ˜๋„ํ‹ฐ ํ•ญ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๋ฐ”๋žŒ์งํ•œ ์†์„ฑ์„ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค.

    • Policy Regularization: ์ •์ฑ…์˜ ์—”ํŠธ๋กœํ”ผ(entropy)๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜์—ฌ ํ™•๋ฅ ์„ฑ(stochasticity)์„ ๋†’์ž…๋‹ˆ๋‹ค (SAC).
    • Value Regularization: OOD ํ–‰๋™์— ๋Œ€ํ•œ Q-๊ฐ’ ์ถ”์ •์„ ๋‚ฎ๊ฒŒ ๊ฐ•์ œํ•˜์—ฌ ๋ณด์ˆ˜์ ์ธ ๊ฐ€์น˜ ์ถ”์ •์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. CQL์€ \max_{\mu} \mathbb{E}_{s \sim \mathcal{D}, a \sim \mu(\cdot|s)}[Q^{\pi}_{\phi}(s, a)] - \mathbb{E}_{s \sim \mathcal{D}, a \sim \hat{\pi}_{\beta}(\cdot|s)}[Q^{\pi}_{\phi}(s, a)] + \mathcal{R}(\mu) ์™€ ๊ฐ™์€ ์ •๊ทœํ™” ํ•ญ์„ ํ†ตํ•ด ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฐ€์น˜ ํ•จ์ˆ˜๊ฐ€ ์ฐธ ๊ฐ’์˜ ํ•˜ํ•œ(lower bound)์ด ๋˜๋„๋ก ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  • Uncertainty Estimation: ํ•™์Šต๋œ ์ •์ฑ…, ๊ฐ€์น˜ ํ•จ์ˆ˜ ๋˜๋Š” ๋ชจ๋ธ์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ์ถ”์ •ํ•˜์—ฌ ๋ณด์ˆ˜์„ฑ์˜ ์ •๋„๋ฅผ ๋™์ ์œผ๋กœ ์กฐ์ ˆํ•ฉ๋‹ˆ๋‹ค. ๋ณดํ†ต ์•™์ƒ๋ธ”(ensemble)์„ ์‚ฌ์šฉํ•˜์—ฌ ์˜ˆ์ธก ๋ถ„์‚ฐ ๋“ฑ์œผ๋กœ ๋ถˆํ™•์‹ค์„ฑ์„ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค (REM).

  • Model-Based Methods: ๋ฐ์ดํ„ฐ์…‹\mathcal{D}๋กœ ์ „์ด ๋™์—ญํ•™(T)๊ณผ ๋ณด์ƒ ํ•จ์ˆ˜(r)๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ํ•™์Šต๋œ ๋ชจ๋ธ์€ ๊ณ„ํš(planning)์— ์‚ฌ์šฉ๋˜๊ฑฐ๋‚˜ ๋ชจ๋ธ ๋กค์•„์›ƒ(model rollout)์„ ํ†ตํ•ด ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ ๋ถ„ํฌ ๋ณ€ํ™” ๋ฌธ์ œ๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ๋ถˆํ™•์‹ค์„ฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ณด์ƒ์— ํŽ˜๋„ํ‹ฐ๋ฅผ ์ฃผ๋Š” ๋ณด์ˆ˜์ ์ธ ๋ชจ๋ธ(\tilde{r}_{\psi_r}(s, a) = r_{\psi_r}(s, a) - \lambda U_r(s, a))์„ ํ•™์Šตํ•˜๋Š” ์ ‘๊ทผ ๋ฐฉ์‹์ด ์žˆ์Šต๋‹ˆ๋‹ค (MOReL, MOPO, COMBO). COMBO๋Š” ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ํ™˜๊ฒฝ์—์„œ์˜ ๊ฐ€์น˜ ์ •๊ทœํ™”(value regularization)๋ฅผ ํ†ตํ•ด ๋ถˆํ™•์‹ค์„ฑ ์ •๋Ÿ‰ํ™” ์—†์ด๋„ ๋ณด์ˆ˜์„ฑ์„ ํ™•๋ณดํ•ฉ๋‹ˆ๋‹ค.

  • One-Step Methods: ์ •์ฑ… ํ‰๊ฐ€ ๋ฐ ์ •์ฑ… ๊ฐœ์„  ๋‹จ๊ณ„๋ฅผ ๋ฐ˜๋ณตํ•˜์ง€ ์•Š๊ณ , ํ–‰๋™ ์ •์ฑ…(\pi_{\beta})์˜ ๊ฐ€์น˜ ํ•จ์ˆ˜(Q^{\pi_{\beta}})๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ํ•™์Šตํ•œ ํ›„ ๋‹จ์ผ ์ •์ฑ… ๊ฐœ์„  ๋‹จ๊ณ„๋งŒ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด OOD ํ–‰๋™์— ๋Œ€ํ•œ ๊ฐ€์น˜ ํ‰๊ฐ€๋ฅผ ํ”ผํ•ฉ๋‹ˆ๋‹ค. IQL(Implicit Q-Learning)์€ ๊ฐ€์น˜ ํ•จ์ˆ˜(V^{\pi}) ํ•™์Šต์— Expectile Regression (๋ถ„์œ„ ํšŒ๊ท€) ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ๋‚ด์˜ โ€˜์ข‹์€โ€™ ํ–‰๋™๋“ค์— ๋Œ€ํ•œ Q๊ฐ’์˜ ์ƒํ•œ์— ๊ทผ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

  • Imitation Learning: ํ–‰๋™ ์ •์ฑ…์„ ๋ชจ๋ฐฉ(mimic)ํ•ฉ๋‹ˆ๋‹ค. ๋‹จ์ˆœ Behavior Cloning (ํ–‰๋™ ๋ณต์ œ, BC)์€ ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ๋ณต์ œํ•ฉ๋‹ˆ๋‹ค. ๊ณ ๊ธ‰ ๊ธฐ๋ฒ•์€ ๊ฐ€์น˜ ํ•จ์ˆ˜ ๋“ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ฐจ์„  ํ–‰๋™์„ ํ•„ํ„ฐ๋งํ•˜๊ฑฐ๋‚˜(BAIL, CRR) ์›ํ•˜๋Š” ๊ฒฐ๊ณผ(๋ชฉํ‘œ, ๋ณด์ƒ ๋“ฑ)์— ์กฐ๊ฑดํ™”๋œ ์ •์ฑ…์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค(RvS).

  • Trajectory Optimization (ํŠธ๋ผ์ ํ† ๋ฆฌ ์ตœ์ ํ™”): ์ „์ฒด ํŠธ๋ผ์ ํ† ๋ฆฌ(\tau = (s_0, a_0, \dots, s_H))์— ๋Œ€ํ•œ ๊ฒฐํ•ฉ ์ƒํƒœ-ํ–‰๋™ ๋ถ„ํฌ(p_{\pi_{\beta}}(\tau))๋ฅผ ์‹œํ€€์Šค ๋ชจ๋ธ(Sequence Model, ์˜ˆ: Transformer)๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ํ•™์Šต๋œ ๋ถ„ํฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์›ํ•˜๋Š” ์ˆ˜์ต(Return-to-Go) ๋“ฑ์— ์กฐ๊ฑดํ™”ํ•˜์—ฌ ๊ณ„ํš์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค(TT, DT). ํฌ์†Œ ๋ณด์ƒ ๋ฌธ์ œ์— ๊ฐ•์ ์„ ๋ณด์ž…๋‹ˆ๋‹ค.

ํ‰๊ฐ€

Off-policy Evaluation (OPE, ์˜คํ”„-ํด๋ฆฌ์‹œ ํ‰๊ฐ€)๋Š” ์˜คํ”„๋ผ์ธ RL์˜ ์ค‘์š”ํ•œ Open Problem ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. ํ™˜๊ฒฝ๊ณผ์˜ ์ƒํ˜ธ์ž‘์šฉ ์—†์ด ์˜คํ”„๋ผ์ธ์œผ๋กœ ์ •์ฑ…์˜ ์„ฑ๋Šฅ์„ ์ •ํ™•ํžˆ ์ถ”์ •ํ•˜๊ณ  ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํŠœ๋‹ํ•˜๋Š” ๊ฒƒ์€ ์‹ค์šฉ์ ์ธ ์˜คํ”„๋ผ์ธ RL์— ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค. ์ฃผ์š” OPE ๋ฐฉ๋ฒ•์—๋Š” Model-Based ์ ‘๊ทผ๋ฒ•, Importance Sampling, Fit Q Evaluation (FQE)๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฒฝํ—˜์  ์—ฐ๊ตฌ๋“ค์— ๋”ฐ๋ฅด๋ฉด FQE๊ฐ€ ์ข…์ข… ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ, ๋ชจ๋“  ์„ค์ •์—์„œ ์ผ๊ด€์ ์œผ๋กœ ์šฐ์ˆ˜ํ•œ ๋ฐฉ๋ฒ•์€ ์•„์ง ์—†์Šต๋‹ˆ๋‹ค (DOPE ๋ฒค์น˜๋งˆํฌ).

์˜คํ”„๋ผ์ธ RL Benchmark๋กœ๋Š” D4RL๊ณผ RL Unplugged๊ฐ€ ๋„๋ฆฌ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

์ด๋“ค์€ Narrow and Biased Data Distributions (์ข๊ณ  ํŽธํ–ฅ๋œ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ), Undirected and Multitask Data (์ง€ํ–ฅ๋˜์ง€ ์•Š์€ ๋‹ค์ค‘ ์ž‘์—… ๋ฐ์ดํ„ฐ), Sparse Rewards (ํฌ์†Œ ๋ณด์ƒ), Suboptimal Data (์ฐจ์„  ๋ฐ์ดํ„ฐ), Nonrepresentable Behavior Policies (ํ‘œํ˜„ ๋ถˆ๊ฐ€๋Šฅํ•œ ํ–‰๋™ ์ •์ฑ…), Non-Markovian Behavior Policies (๋น„ ๋งˆ๋ฅด์ฝ”ํ”„ ํ–‰๋™ ์ •์ฑ…), Realistic Domains (ํ˜„์‹ค์ ์ธ ๋„๋ฉ”์ธ) ๋“ฑ ์‹ค์ œ ์‘์šฉ์— ์ค‘์š”ํ•œ Dataset Design Factors (๋ฐ์ดํ„ฐ์…‹ ์„ค๊ณ„ ์š”์†Œ)๋ฅผ ํฌํ•จํ•˜๋Š” ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ๊ณผ ๋ฐ์ดํ„ฐ์…‹์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ Stochastic Dynamics (ํ™•๋ฅ ์  ๋™์—ญํ•™), Nonstationarity (๋น„์ •์ƒ์„ฑ), Risky Biases (์œ„ํ—˜ํ•œ ํŽธํ–ฅ), Multiagent ํ™˜๊ฒฝ ๋“ฑ์€ ์—ฌ์ „ํžˆ ๋ถ€์กฑํ•œ ์‹ค์ •์ž…๋‹ˆ๋‹ค. D4RL ๋ฒค์น˜๋งˆํฌ ์„ฑ๋Šฅ ๋ถ„์„์— ๋”ฐ๋ฅด๋ฉด ์ตœ๊ทผ ๋ฐฉ๋ฒ•(TT, IQL)๊ณผ ํŠธ๋ผ์ ํ† ๋ฆฌ ์ตœ์ ํ™” ๋ฐ ์›์Šคํ… ๋ฐฉ๋ฒ•์ด ํฌ์†Œ ๋ณด์ƒ์ด๋‚˜ ๋‹ค์ค‘ ์ž‘์—… ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ•์ ์„ ๋ณด์ด๋ฉฐ ์œ ๋งํ•œ ๋ถ„๋ฅ˜๋กœ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.

๋ฏธ๋ž˜ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์œผ๋กœ๋Š” OPE์˜ ์‹ ๋ขฐ์„ฑ ํ–ฅ์ƒ, Unsupervised RL ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•œ ๋ ˆ์ด๋ธ” ์—†๋Š” ๋ฐ์ดํ„ฐ ํ™œ์šฉ, Incremental RL์„ ํ†ตํ•œ ์˜จ๋ผ์ธ Fine-tuning ์ „๋žต ๊ฐœ๋ฐœ, Safety-critical RL (์•ˆ์ „ ํ•„์ˆ˜ ๊ฐ•ํ™”ํ•™์Šต, ์˜ˆ: CVaR) ๋ถ„์•ผ ์—ฐ๊ตฌ ๋“ฑ์ด ์ œ์•ˆ๋ฉ๋‹ˆ๋‹ค. ํšจ๊ณผ์ ์ธ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋ฐ curation ๋˜ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐœ๋ฐœ๋งŒํผ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

Copyright 2024, Jung Yeon Lee