Curieux.JY
  • Post
  • Note
  • Jung Yeon Lee

On this page

  • Brief Review
    • Introduction
    • PIML: An Overview
    • PIRL: Fundamentals, Taxonomy and Examples
    • PIRL: Review and Analysis

๐Ÿ“ƒPhysics Informed RL Survey ๋ฆฌ๋ทฐ

rl
physics-informed
survey
A Survey on Physics Informed Reinforcement Learning - Review and Open Problems
Published

July 3, 2025

  • Paper Link
  1. ์ด ๋…ผ๋ฌธ์€ Physics-Informed Reinforcement Learning(PIRL) ์—ฐ๊ตฌ ๋™ํ–ฅ์„ ์กฐ์‚ฌํ•˜๊ณ , ๋ฌผ๋ฆฌํ•™ ์ •๋ณด๋ฅผ RL์— ํ†ตํ•ฉํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ถ„๋ฅ˜ ์ฒด๊ณ„๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
  2. PIRL์€ ๋ฐฉ์ •์‹, ์ œ์•ฝ ์กฐ๊ฑด, ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฌผ๋ฆฌํ•™์  ์‚ฌ์ „ ์ •๋ณด๋ฅผ RL ํŒŒ์ดํ”„๋ผ์ธ์˜ ์ƒํƒœ, ์•ก์…˜, ๋ณด์ƒ, ๋„คํŠธ์›Œํฌ, ๋ชจ๋ธ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ€๋ถ„์— ํ†ตํ•ฉํ•˜์—ฌ RL์˜ ํšจ์œจ์„ฑ๊ณผ ์•ˆ์ „์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.
  3. ์ด ๋…ผ๋ฌธ์˜ ๋ถ„์„์€ PIRL์˜ ๋‹ค์–‘ํ•œ ์ ์šฉ ๋ถ„์•ผ์™€ ํ•จ๊ป˜ ํ•ด๊ฒฐ๋˜์ง€ ์•Š์€ ๋ฌธ์ œ์  ๋ฐ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•˜์—ฌ ๋ถ„์•ผ์˜ ์„ฑ์žฅ์— ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค.
PIRL taxonomy and further categories

Brief Review

๋ณธ ๋…ผ๋ฌธ์€ ๋ฌผ๋ฆฌ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•œ ๊ฐ•ํ™” ํ•™์Šต(Physics-Informed Reinforcement Learning, PIRL)์— ๋Œ€ํ•œ ํฌ๊ด„์ ์ธ ์กฐ์‚ฌ ๋…ผ๋ฌธ์ž…๋‹ˆ๋‹ค. PIRL์€ ๋ฌผ๋ฆฌ์  ์ œ์•ฝ ์กฐ๊ฑด๊ณผ ๋ฌผ๋ฆฌ ๋ฒ•์น™์„ ํ•™์Šต ๊ณผ์ •์— ํ†ตํ•ฉํ•˜์—ฌ ๊ธฐ๊ณ„ ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ, ํŠนํžˆ ๊ฐ•ํ™” ํ•™์Šต(RL)์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์ ‘๊ทผ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

Introduction

RL์€ ์‹œํ–‰์ฐฉ์˜ค๋ฅผ ํ†ตํ•ด ์˜์‚ฌ ๊ฒฐ์ • ๋ฐ ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ์œ ๋งํ•œ ์ ‘๊ทผ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์ž์œจ ์ฃผํ–‰, ๋กœ๋ด‡ ๊ณตํ•™, ์—ฐ์† ์ œ์–ด ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ ์„ฑ๊ณต์„ ๊ฑฐ๋‘์—ˆ์ง€๋งŒ, ์‹ค์ œ ๋ฐ์ดํ„ฐ์˜ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ ๋ถ€์กฑ, ๊ณ ์ฐจ์› ์—ฐ์† ์ƒํƒœ/์•ก์…˜ ๊ณต๊ฐ„ ์ฒ˜๋ฆฌ์˜ ์–ด๋ ค์›€, ์•ˆ์ „ํ•œ ํƒ์ƒ‰, ์ ์ ˆํ•œ ๋ณด์ƒ ํ•จ์ˆ˜ ์ •์˜, ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ-์‹ค์ œ ํ™˜๊ฒฝ ๊ฐ„์˜ ์ฐจ์ด ๋“ฑ์˜ ๋ฌธ์ œ์— ์ง๋ฉดํ•ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฌผ๋ฆฌ ์ •๋ณด๋ฅผ ML ๋ชจ๋ธ์— ํ†ตํ•ฉํ•˜๋Š” PIML(Physics-Informed Machine Learning)์€ ๋ถˆ์™„์ „ํ•œ ๋ฌผ๋ฆฌ ์ •๋ณด์™€ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ๋” ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ณ , ๋” ๋‚˜์€ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, ๋ฌผ๋ฆฌ์ ์œผ๋กœ ํƒ€๋‹นํ•œ ์†”๋ฃจ์…˜์„ ์ œ๊ณตํ•˜๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. RL์€ ๋Œ€๋ถ€๋ถ„ ์‹ค์ œ ์„ธ๊ณ„ ๋ฌธ์ œ์™€ ๊ด€๋ จ์ด ์žˆ์œผ๋ฉฐ ์„ค๋ช… ๊ฐ€๋Šฅํ•œ ๋ฌผ๋ฆฌ์  ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ฌผ๋ฆฌ ์ •๋ณด ํ†ตํ•ฉ์— ์ ํ•ฉํ•œ ๋ถ„์•ผ์ž…๋‹ˆ๋‹ค. ์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค์€ ๋ฌผ๋ฆฌ ์ •๋ณด๋ฅผ RL ํŒŒ์ดํ”„๋ผ์ธ์— ํ†ตํ•ฉํ•˜์—ฌ ์ด๋Ÿฌํ•œ ๊ณผ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ฌผ๋ฆฌ ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ์ฐจ์› ์—ฐ์† ์ƒํƒœ๋ฅผ ์ง๊ด€์ ์ธ ํ‘œํ˜„์œผ๋กœ ์ค„์ด๊ฑฐ๋‚˜ ๋” ๋‚˜์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ๊ตฌ์ถ•ํ•˜๋ฉฐ, ์•ˆ์ „ํ•œ ํ•™์Šต์„ ์œ„ํ•œ ๋ฌผ๋ฆฌ์  ์ œ์•ฝ ์กฐ๊ฑด์„ ๋ณด์ƒ ํ•จ์ˆ˜์— ํ†ตํ•ฉํ•˜๋Š” ๋“ฑ์˜ ์‹œ๋„๊ฐ€ ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. PIRL ์—ฐ๊ตฌ๋Š” ์ง€๋‚œ 6๋…„๊ฐ„ ์ฆ๊ฐ€ํ•˜๋Š” ์ถ”์„ธ๋ฅผ ๋ณด์ด๋ฉฐ ์ฃผ๋ชฉ๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

  1. Taxonomy: ์–ด๋–ค ๋ฌผ๋ฆฌ ์ง€์‹/ํ”„๋กœ์„ธ์Šค๊ฐ€ ๋ชจ๋ธ๋ง๋˜๊ณ , ์–ด๋–ป๊ฒŒ ํ‘œํ˜„๋˜๋ฉฐ, RL ์ ‘๊ทผ ๋ฐฉ์‹์— ์–ด๋–ป๊ฒŒ ํ†ตํ•ฉ๋˜๋Š”์ง€์— ๋Œ€ํ•œ ํ†ตํ•ฉ ๋ถ„๋ฅ˜ ์ฒด๊ณ„๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
  2. Algorithmic Review: ๋ฌผ๋ฆฌ ์ •๋ณด ๊ธฐ๋ฐ˜ RL ๋ฐฉ๋ฒ•๋ก ์— ๋Œ€ํ•œ ์ตœ์‹  ์ ‘๊ทผ ๋ฐฉ์‹์„ ํ†ต์ผ๋œ ํ‘œ๊ธฐ๋ฒ•๊ณผ ๊ธฐ๋Šฅ ๋‹ค์ด์–ด๊ทธ๋žจ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒ€ํ† ํ•ฉ๋‹ˆ๋‹ค.
  3. Training and evaluation benchmark Review: ๊ฒ€ํ† ๋œ ๋ฌธํ—Œ์—์„œ ์‚ฌ์šฉ๋œ ํ‰๊ฐ€ ๋ฒค์น˜๋งˆํฌ๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์ธ๊ธฐ ์žˆ๋Š” ํ”Œ๋žซํผ/๋„๊ตฌ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
  4. Analysis: ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์— ๊ฑธ์นœ model-based ๋ฐ model-free RL ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ๋ฌผ๋ฆฌ ์ •๋ณด๊ฐ€ ํŠน์ • RL ์ ‘๊ทผ ๋ฐฉ์‹์— ์–ด๋–ป๊ฒŒ ํ†ตํ•ฉ๋˜๋Š”์ง€, ์–ด๋–ค ๋ฌผ๋ฆฌ ํ”„๋กœ์„ธ์Šค๊ฐ€ ๋ชจ๋ธ๋ง/ํ†ตํ•ฉ๋˜๋Š”์ง€, ์–ด๋–ค ๋„คํŠธ์›Œํฌ ์•„ํ‚คํ…์ฒ˜ ๋˜๋Š” ์ฆ๊ฐ•์ด ์‚ฌ์šฉ๋˜๋Š”์ง€ ์ƒ์„ธํžˆ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค.
  5. Open Problems: ํ˜„์žฌ ์ง๋ฉดํ•œ ๊ณผ์ œ, ๋ฏธํ•ด๊ฒฐ ์—ฐ๊ตฌ ์งˆ๋ฌธ ๋ฐ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์— ๋Œ€ํ•œ ๊ด€์ ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

PIML: An Overview

๋ฌผ๋ฆฌ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•œ ๊ธฐ๊ณ„ ํ•™์Šต ๊ฐœ์š”

PIML์€ ์ˆ˜ํ•™์  ๋ฌผ๋ฆฌ ๋ชจ๋ธ๊ณผ ๊ด€์ธก ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต ๊ณผ์ •์— ํ†ตํ•ฉํ•˜์—ฌ, ๋ถˆ์™„์ „ํ•˜๊ณ  ๋ถˆํ™•์‹คํ•˜๋ฉฐ ๊ณ ์ฐจ์›์ ์ธ ๋ณต์žกํ•œ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ๋„ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ์ผ๊ด€๋œ ์†”๋ฃจ์…˜์„ ์ฐพ๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ๋ฌผ๋ฆฌ ์ง€์‹์„ ML ๋ชจ๋ธ์— ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์€ ๋ฌผ๋ฆฌ/๊ณผํ•™์  ์ผ๊ด€์„ฑ ๋ณด์žฅ, ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ ์ฆ๊ฐ€, ํ•™์Šต ๊ณผ์ • ๊ฐ€์†ํ™”, ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ ํ–ฅ์ƒ, ํˆฌ๋ช…์„ฑ/ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ ์ฆ์ง„๊ณผ ๊ฐ™์€ ์ด์ ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋ฌผ๋ฆฌ ์ง€์‹์„ ํ†ตํ•ฉํ•˜๋Š” ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ์ „๋žต์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • Observational bias: ๋ฌผ๋ฆฌ์  ์›๋ฆฌ๋ฅผ ๋ฐ˜์˜ํ•˜๋Š” multi-modal ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ DNN์„ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค. ๊ด€์ธก, ์‹œ๋ฎฌ๋ ˆ์ด์…˜, ๋ฌผ๋ฆฌ ๋ฐฉ์ •์‹ ์ƒ์„ฑ ๋ฐ์ดํ„ฐ, ์ง€๋„, ์ถ”์ถœ๋œ ๋ฌผ๋ฆฌ ๋ฐ์ดํ„ฐ ๋“ฑ ๋‹ค์–‘ํ•œ ์†Œ์Šค์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • Learning bias: ์†์‹ค ํ•จ์ˆ˜์— ๋ฌผ๋ฆฌ ๊ธฐ๋ฐ˜์˜ ํŽ˜๋„ํ‹ฐ ํ•ญ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ์‚ฌ์ „ ์ง€์‹์„ ๊ฐ•ํ™”ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. PINN(Physics-Informed Neural Networks)์€ PDE๋ฅผ ์‹ ๊ฒฝ๋ง์˜ ์†์‹ค ํ•จ์ˆ˜์— ํฌํ•จ์‹œํ‚ค๋Š” ๋Œ€ํ‘œ์ ์ธ ์˜ˆ์ž…๋‹ˆ๋‹ค.
  • Inductive biases: custom neural network ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ๋ฌผ๋ฆฌ ์›๋ฆฌ๋ฅผ โ€˜ํ•˜๋“œโ€™ ์ œ์•ฝ ์กฐ๊ฑด์œผ๋กœ ํ†ตํ•ฉํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. Hamiltonian NN, Lagrangian Neural Networks (LNNs) ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

PIRL: Fundamentals, Taxonomy and Examples

๋ฌผ๋ฆฌ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•œ ๊ฐ•ํ™” ํ•™์Šต: ๊ธฐ๋ณธ, ๋ถ„๋ฅ˜ ๋ฐ ์˜ˆ์‹œ

RL ๊ธฐ๋ณธ (RL fundamentals)

RL์€ MDP (Markov Decision Process) ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๋”ฐ๋ฅด๋Š” ์ˆœ์ฐจ์  ์˜์‚ฌ ๊ฒฐ์ • ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ์—์ด์ „ํŠธ(agent)์™€ ํ™˜๊ฒฝ(environment)์ด ์ƒํ˜ธ ์ž‘์šฉํ•˜๋ฉฐ, ์—์ด์ „ํŠธ๋Š” ์ƒํƒœ(s_t)๋ฅผ ๊ด€์ฐฐํ•˜๊ณ  ํ–‰๋™(a_t)์„ ์„ ํƒํ•˜๋ฉฐ, ํ™˜๊ฒฝ์€ ๋‹ค์Œ ์ƒํƒœ(s_{t+1})์™€ ๋ณด์ƒ(r_t)์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋ชฉํ‘œ๋Š” ๋ˆ„์  ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ์ •์ฑ… \pi_\phi(a_t|s_t)์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ \phi๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. MDP๋Š” ํŠœํ”Œ (S, A, R, P, \gamma)๋กœ ํ‘œํ˜„๋˜๋ฉฐ, S๋Š” ์ƒํƒœ ๊ณต๊ฐ„, A๋Š” ์•ก์…˜ ๊ณต๊ฐ„, R์€ ๋ณด์ƒ ํ•จ์ˆ˜, P(s_{t+1}|s_t, a_t)๋Š” ํ™˜๊ฒฝ ๋ชจ๋ธ(์ „์ด ํ™•๋ฅ ), \gamma \in [0, 1]๋Š” ํ• ์ธ ๊ณ„์ˆ˜์ž…๋‹ˆ๋‹ค. ๋ชฉํ‘œ ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. J(\phi) = \mathbb{E}_{\tau \sim p_\phi(\tau)} \left[ \sum_{t=1}^T \gamma^{t-1} R(a_t, s_{t+1}) \right] ์—ฌ๊ธฐ์„œ \tau๋Š” ์—ํ”ผ์†Œ๋“œ์˜ ์ƒํƒœ-์•ก์…˜ ์‹œํ€€์Šค์ž…๋‹ˆ๋‹ค. RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ model-free (ํ™˜๊ฒฝ ๋ชจ๋ธ ์—†์ด ํ•™์Šต)์™€ model-based (ํ™˜๊ฒฝ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ„ํš/ํ•™์Šต)๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, online (์ตœ์‹  ์ •์ฑ…์œผ๋กœ ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ), off-policy (๊ฒฝํ—˜ ๋ฆฌํ”Œ๋ ˆ์ด ๋ฒ„ํผ์˜ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ), offline (๊ณ ์ •๋œ ๋ฐ์ดํ„ฐ์…‹ ์‚ฌ์šฉ)์œผ๋กœ ๋ถ„๋ฅ˜๋ฉ๋‹ˆ๋‹ค.

PIRL ์†Œ๊ฐœ (PIRL: Introduction)

PIRL์€ ๋ฌผ๋ฆฌ ๊ตฌ์กฐ, ์‚ฌ์ „ ์ง€์‹(priors), ์‹ค์ œ ๋ฌผ๋ฆฌ ๋ณ€์ˆ˜๋ฅผ ์ •์ฑ… ํ•™์Šต ๋˜๋Š” ์ตœ์ ํ™” ๊ณผ์ •์— ํ†ตํ•ฉํ•˜๋Š” ๊ฐœ๋…์ž…๋‹ˆ๋‹ค. ์ด๋Š” RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํšจ์œจ์„ฑ, ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ, ํ›ˆ๋ จ ๊ฐ€์†ํ™”์— ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค.

PIRL ๋ถ„๋ฅ˜ ์ฒด๊ณ„ (PIRL Taxonomy)

์ด ๋…ผ๋ฌธ์€ ๋ฌผ๋ฆฌ ์ •๋ณด ์œ ํ˜•, ๋ฌผ๋ฆฌ ์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•˜๋Š” PIRL ๋ฐฉ๋ฒ•, ๊ทธ๋ฆฌ๊ณ  RL ํŒŒ์ดํ”„๋ผ์ธ์˜ ์„ธ ๊ฐ€์ง€ ์ถ•์„ ์ค‘์‹ฌ์œผ๋กœ PIRL ๋ถ„๋ฅ˜ ์ฒด๊ณ„๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

  • Physics information (types): representation of physics priors
    1. Differential and algebraic equations (DAE): PDE/ODE, ๊ฒฝ๊ณ„ ์กฐ๊ฑด(BC) ๋“ฑ ์‹œ์Šคํ…œ ๋™์—ญํ•™ ํ‘œํ˜„ (์˜ˆ: PINN).
    2. Barrier certificate and physical constraints (BPC): CLF, BF, CBF/CBC ๋“ฑ ์•ˆ์ „ ์ œ์•ฝ ์กฐ๊ฑด (์˜ˆ: ์•ˆ์ „ ์ค‘์š” ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ํƒ์ƒ‰ ๊ทœ์ œ).
    3. Physics parameters, primitives and physical variables (PPV): ํ™˜๊ฒฝ/์‹œ์Šคํ…œ์—์„œ ์ถ”์ถœ/๋„์ถœ๋œ ๋ฌผ๋ฆฌ ๊ฐ’ (์˜ˆ: jam-avoiding distance, dynamic movement primitives).
    4. Offline data and representation (ODR): ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ๊ธฐ๋ฐ˜ ํ•™์Šต ๊ฐœ์„ ์„ ์œ„ํ•œ ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ ๋˜๋Š” ๋ฌผ๋ฆฌ์ ์œผ๋กœ ๊ด€๋ จ๋œ ์ €์ฐจ์› ํ‘œํ˜„ ํ•™์Šต.
    5. Physics simulator and model (PS): RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ…Œ์ŠคํŠธ๋ฒ ๋“œ ๋˜๋Š” ๋ฌผ๋ฆฌ์  ์ •ํ™•์„ฑ์„ ๋ถ€์—ฌํ•˜๊ธฐ ์œ„ํ•œ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ํ™œ์šฉ (์˜ˆ: MBRL์—์„œ ์‹œ์Šคํ…œ ๋ชจ๋ธ ํ•™์Šต).
    6. Physical properties (PPR): ์‹œ์Šคํ…œ ํ˜•ํƒœ, ๋Œ€์นญ ๋“ฑ ๊ธฐ๋ณธ์ ์ธ ๋ฌผ๋ฆฌ ๊ตฌ์กฐ/์†์„ฑ ์ง€์‹.
  • PIRL methods: physics prior augmentations to RL
    1. State design: ๊ด€์ฐฐ๋œ ์ƒํƒœ ๊ณต๊ฐ„ ์ˆ˜์ •/ํ™•์žฅ (์˜ˆ: ์ƒํƒœ ์œตํ•ฉ, ํŠน์ง• ์ถ”์ถœ).
    2. Action regulation: ์•ก์…˜ ๊ฐ’์— ์ œ์•ฝ ์กฐ๊ฑด ๋ถ€๊ณผ (์˜ˆ: ์•ˆ์ „ ํ•„ํ„ฐ).
    3. Reward design: ํšจ๊ณผ์ ์ธ ๋ณด์ƒ ์„ค๊ณ„ ๋˜๋Š” ๋ณด์ƒ ํ•จ์ˆ˜ ์ฆ๊ฐ•.
    4. Augment policy or value N/W: ์ •์ฑ… ๋˜๋Š” ๊ฐ€์น˜ ํ•จ์ˆ˜์˜ ์—…๋ฐ์ดํŠธ ๊ทœ์น™, ์†์‹ค, ๊ตฌ์กฐ ๋ณ€๊ฒฝ.
    5. Augment simulator or model: ๊ธฐ์ดˆ ๋ฌผ๋ฆฌ ์ง€์‹ ํ†ตํ•ฉ์„ ํ†ตํ•œ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ/๋ชจ๋ธ ๊ฐœ์„ .
  • RL Pipeline
    1. Problem Representation: ์‹ค์ œ ๋ฌธ์ œ๋ฅผ MDP๋กœ ๋ชจ๋ธ๋ง (์ƒํƒœ, ์•ก์…˜, ๋ณด์ƒ ์ •์˜).
    2. Learning strategy: ์—์ด์ „ํŠธ-ํ™˜๊ฒฝ ์ƒํ˜ธ ์ž‘์šฉ ๋ฐฉ์‹, ํ•™์Šต ์•„ํ‚คํ…์ฒ˜, ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ ํƒ ๊ฒฐ์ •.
    3. Network design: ์ •์ฑ…/๊ฐ€์น˜ ๋„คํŠธ์›Œํฌ์˜ ์„ธ๋ถ€ ๊ตฌ์กฐ ์„ค๊ณ„.
    4. Training: ๋„คํŠธ์›Œํฌ ํ•™์Šต (Sim-to-real ๋“ฑ ํ›ˆ๋ จ ์ฆ๊ฐ• ํฌํ•จ).
    5. Trained policy deployment: ํ›ˆ๋ จ๋œ ์ •์ฑ… ๋ฐฐํฌ.

์ถ”๊ฐ€ ๋ถ„๋ฅ˜ (Further categorization)

์ด ๋…ผ๋ฌธ์€ ์ถ”๊ฐ€์ ์œผ๋กœ ๋‘ ๊ฐ€์ง€ ๋ฒ”์ฃผ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ PIRL ๊ตฌํ˜„์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

  • Bias: PIML์—์„œ ์‚ฌ์šฉ๋˜๋Š” bias ๊ฐœ๋…(Observational, Learning, Inductive)๊ณผ PIRL ์ ‘๊ทผ ๋ฐฉ์‹์˜ ๊ด€๊ณ„๋ฅผ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค.
  • Learning architecture: ๋ฌผ๋ฆฌ ์ •๋ณด ํ†ตํ•ฉ์„ ์œ„ํ•ด ์ „ํ†ต์ ์ธ RL ํ•™์Šต ์•„ํ‚คํ…์ฒ˜์— ๋„์ž…๋œ ๋ณ€๊ฒฝ ์‚ฌํ•ญ์— ๋”ฐ๋ผ ๋ถ„๋ฅ˜ํ•ฉ๋‹ˆ๋‹ค.
    1. Safety filter: ์•ˆ์ „ ์ œ์•ฝ ์กฐ๊ฑด์„ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด ์—์ด์ „ํŠธ์˜ ์•ก์…˜์„ ์กฐ์ ˆํ•˜๋Š” ๋ชจ๋“ˆ ํฌํ•จ.
    2. PI reward: ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ๋ฌผ๋ฆฌ ์ •๋ณด๋กœ ์ˆ˜์ •.
    3. Residual learning: ๋ฌผ๋ฆฌ ์ •๋ณด ๊ธฐ๋ฐ˜ ์ œ์–ด๊ธฐ์™€ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ์ •์ฑ…์„ ๊ฒฐํ•ฉ.
    4. Physics embedded network: ์ •์ฑ… ๋˜๋Š” ๊ฐ€์น˜ ํ•จ์ˆ˜ ๋„คํŠธ์›Œํฌ์— ์‹œ์Šคํ…œ ๋™์—ญํ•™ ๋“ฑ ๋ฌผ๋ฆฌ ์ •๋ณด ์ง์ ‘ ํ†ตํ•ฉ.
    5. Differentiable simulator: ์†์‹ค ๊ธฐ์šธ๊ธฐ๋ฅผ ์ œ์–ด ์•ก์…˜์— ๋Œ€ํ•ด ์ง์ ‘ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ ๋ฌผ๋ฆฌ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ์‚ฌ์šฉ.
    6. Sim-to-Real: ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„œ ํ•™์Šต ํ›„ ์‹ค์ œ ํ™˜๊ฒฝ์œผ๋กœ ์ „์ด.
    7. Physics variable: ๋ฌผ๋ฆฌ ๋งค๊ฐœ๋ณ€์ˆ˜, ๋ณ€์ˆ˜, ํ”„๋ฆฌ๋ฏธํ‹ฐ๋ธŒ๋ฅผ ์ƒํƒœ/๋ณด์ƒ ๋“ฑ์— ์ถ”๊ฐ€.
    8. Hierarchical RL: ๊ณ„์ธต์  ๋˜๋Š” ์ปค๋ฆฌํ˜๋Ÿผ ํ•™์Šต ์„ค์ •์—์„œ ๋ฌผ๋ฆฌ ์ •๋ณด๋ฅผ ํ†ตํ•ฉ.
    9. Data augmentation: ์ž…๋ ฅ ์ƒํƒœ๋ฅผ ์ €์ฐจ์› ํ‘œํ˜„ ๋“ฑ์œผ๋กœ ๋Œ€์ฒด/์ฆ๊ฐ•ํ•˜์—ฌ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ๊ด€๋ จ๋œ ํŠน์ง• ๋„์ถœ.
    10. PI model identification: MBRL ์„ค์ •์—์„œ ๋ฌผ๋ฆฌ ์ •๋ณด๋ฅผ ๋ชจ๋ธ ์‹๋ณ„ ๊ณผ์ •์— ํ†ตํ•ฉ.

PIRL: Review and Analysis

  • Algorithmic review: ์œ„์— ์ œ์‹œ๋œ PIRL ๋ฐฉ๋ฒ• ๋ฐ ํ•™์Šต ์•„ํ‚คํ…์ฒ˜ ๋ฒ”์ฃผ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์—ฐ๊ตฌ๋“ค์„ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ ๋…ผ์˜ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, State design์—์„œ๋Š” CAV ์ œ์–ด์—์„œ์˜ ๋ฌผ๋ฆฌ ๊ธฐ๋ฐ˜ ์ƒํƒœ ์œตํ•ฉ, Adaptive cruise control์—์„œ์˜ jam-avoiding distance ํ™œ์šฉ ๋“ฑ์ด ๋…ผ์˜๋ฉ๋‹ˆ๋‹ค. Action regulation์—์„œ๋Š” ์•ˆ์ „ ์ค‘์š” ์‹œ์Šคํ…œ์˜ CBF/CBC๋ฅผ ํ™œ์šฉํ•œ ์•ก์…˜ ์ œ์•ฝ์ด ๊ฐ•์กฐ๋˜๋ฉฐ, B_\epsilon(x)์™€ Lie derivative \mathcal{L}_f(x, u_{RL}) B_\epsilon(x)๋ฅผ ์ด์šฉํ•œ ์•ˆ์ „ ์กฐ๊ฑด์ด ์–ธ๊ธ‰๋ฉ๋‹ˆ๋‹ค. Reward design์—์„œ๋Š” ๋กœ๋ด‡ ๋ณดํ–‰, ์—๋„ˆ์ง€ ๊ด€๋ฆฌ, ์œ ์ฒด์—ญํ•™ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ ๋ฌผ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ณด์ƒ ํ•จ์ˆ˜ ์„ค๊ณ„ ์‚ฌ๋ก€๊ฐ€ ์ œ์‹œ๋ฉ๋‹ˆ๋‹ค. Augment simulator or model์—์„œ๋Š” LNN์„ ์‚ฌ์šฉํ•œ ์‹œ์Šคํ…œ ๋ชจ๋ธ ํ•™์Šต, sim-to-real ์ „์ด ๊ฐœ์„ ์„ ์œ„ํ•œ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ์ฆ๊ฐ•, ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ์‚ฌ์šฉ ๋“ฑ์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. Augment policy and/or value N/W์—์„œ๋Š” ์‹ ๊ฒฝ๋ง ์ •์ฑ…์— ๋™์  ์‹œ์Šคํ…œ์„ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ ๋ ˆ์ด์–ด๋กœ ํ†ตํ•ฉํ•˜๋Š” Neural Dynamic Policies (NDP), ๊ฐ€์น˜ ํ•จ์ˆ˜๋ฅผ HJB PDE๋ฅผ ํ‘ธ๋Š” PINN์œผ๋กœ ์ทจ๊ธ‰ํ•˜๋Š” ์ ‘๊ทผ ๋ฐฉ์‹ ๋“ฑ์ด ์†Œ๊ฐœ๋ฉ๋‹ˆ๋‹ค.

  • Simulation/ evaluation benchmarks: ์—ฐ๊ตฌ์—์„œ ์‚ฌ์šฉ๋œ ๋‹ค์–‘ํ•œ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ๋ฐ ํ‰๊ฐ€ ํ™˜๊ฒฝ์„ OpenAI Gym, MuJoCo, Pybullet, Deep mind control suite์™€ ๊ฐ™์€ ํ‘œ์ค€ ๋ฒค์น˜๋งˆํฌ์™€ SUMO, CARLA, IEEE distribution system benchmarks ๊ฐ™์€ ๋„๋ฉ”์ธ๋ณ„ ํ”Œ๋žซํผ, ๊ทธ๋ฆฌ๊ณ  ๋‹ค์ˆ˜์˜ ๋งž์ถคํ˜• ํ™˜๊ฒฝ์œผ๋กœ ๋ถ„๋ฅ˜ํ•˜์—ฌ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

  • Analysis:

    • ์—ฐ๊ตฌ ๋™ํ–ฅ ๋ฐ ํ†ต๊ณ„: ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ PPO์ด๋ฉฐ, ๊ทธ ๋’ค๋ฅผ DDPG, SAC ๋“ฑ์ด ์ž‡์Šต๋‹ˆ๋‹ค. ๋ฌผ๋ฆฌ ์ •๋ณด ์œ ํ˜•์œผ๋กœ๋Š” ๋ฌผ๋ฆฌ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ, ์‹œ์Šคํ…œ ๋ชจ๋ธ, ๋ฐฐ๋ฆฌ์–ด ์ธ์ฆ์„œ/๋ฌผ๋ฆฌ ์ œ์•ฝ์ด ๊ฐ€์žฅ ํ”ํ•˜๊ฒŒ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ํ•™์Šต ์•„ํ‚คํ…์ฒ˜ ์ค‘ PI reward์™€ safety filter๋Š” ์ฃผ๋กœ learning bias๋ฅผ ํ†ตํ•ด, physics embedded network๋Š” inductive bias๋ฅผ ํ†ตํ•ด ๋ฌผ๋ฆฌ๋ฅผ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ๋„๋ฉ”์ธ์˜ 85% ๊ฐ€๋Ÿ‰์ด ์ œ์–ด ๋˜๋Š” ์ •์ฑ… ์„ค๊ณ„์™€ ๊ด€๋ จ ์žˆ์œผ๋ฉฐ, ๊ทธ ์ค‘ Miscellaneous control, Safe control and exploration, Dynamic control์ด ์ฃผ๋ฅผ ์ด๋ฃน๋‹ˆ๋‹ค.
    • RL ํ•ด๊ฒฐ ๊ณผ์ œ: PIRL์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ RL ๊ณผ์ œ ํ•ด๊ฒฐ์— ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค. Sample efficiency (์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ/๋ชจ๋ธ ์ฆ๊ฐ•), Curse of dimensionality (๋ฌผ๋ฆฌ ๊ด€๋ จ ์ €์ฐจ์› ํ‘œํ˜„ ํ•™์Šต), Safety exploration (CBF/CLF ๋“ฑ ์ œ์–ด ์ด๋ก  ํ™œ์šฉ), Partial observability (์ƒํƒœ ์ฆ๊ฐ•/์œตํ•ฉ), Under-defined reward function (๋ฌผ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ณด์ƒ ์„ค๊ณ„/์ฆ๊ฐ•).

๋ฏธํ•ด๊ฒฐ ๊ณผ์ œ ๋ฐ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ (Open Challenges and Research Directions)

  1. High Dimensional Spaces: ๊ณ ์ฐจ์› ๊ณต๊ฐ„์—์„œ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ๊ด€๋ จ๋œ ์ •๋ณด์„ฑ์ด ํ’๋ถ€ํ•œ ์ €์ฐจ์› ํ‘œํ˜„์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์—ฌ์ „ํžˆ ๊ณผ์ œ์ž…๋‹ˆ๋‹ค.
  2. Safety in Complex and Uncertain Environments: ๋ณต์žกํ•˜๊ณ  ๋ถˆํ™•์‹คํ•œ ํ™˜๊ฒฝ์—์„œ model-agnosticํ•˜๋ฉฐ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅํ•œ ์•ˆ์ „ํ•œ ํƒ์ƒ‰ ๋ฐ ์ œ์–ด ์ ‘๊ทผ ๋ฐฉ์‹ ๊ฐœ๋ฐœ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ ํ•™์Šต์— ๋ฌผ๋ฆฌ๋ฅผ ํ†ตํ•ฉํ•˜๋Š” ์ผ๋ฐ˜ํ™”๋œ ์ ‘๊ทผ ๋ฐฉ์‹๋„ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.
  3. Choice of physics prior: ๋ฌธ์ œ์— ์ ํ•ฉํ•œ ๋ฌผ๋ฆฌ ์‚ฌ์ „ ์ง€์‹์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ต๊ณ  ๋„๋ฉ”์ธ๋ณ„ ์ „๋ฌธ ์ง€์‹์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ƒˆ๋กœ์šด ๋ฌผ๋ฆฌ์  ํƒœ์Šคํฌ๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” ํฌ๊ด„์ ์ธ ํ”„๋ ˆ์ž„์›Œํฌ ๊ตฌ์ถ•์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  4. Evaluation and bench-marking platform: PIRL ์—ฐ๊ตฌ๋ฅผ ์œ„ํ•œ ํฌ๊ด„์ ์ธ ๋ฒค์น˜๋งˆํ‚น ๋ฐ ํ‰๊ฐ€ ํ™˜๊ฒฝ์ด ๋ถ€์กฑํ•˜์—ฌ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋ก ์˜ ๋น„๊ต ๋ฐ ํ‰๊ฐ€๊ฐ€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ๋„๋ฉ”์ธ๋ณ„๋กœ ๋งž์ถคํ™”๋œ ํ™˜๊ฒฝ์— ์˜์กดํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก  (Conclusions): ๋ณธ ๋…ผ๋ฌธ์€ PIRL ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์†Œ๊ฐœํ•˜๊ณ , ๋ฌผ๋ฆฌ ์‚ฌ์ „ ์ง€์‹ ์œ ํ˜• ๋ฐ ๋ฌผ๋ฆฌ ์ •๋ณด ํ†ตํ•ฉ ๋ฐฉ์‹(RL ๋ฐฉ๋ฒ•)์— ๊ธฐ๋ฐ˜ํ•œ ๋ถ„๋ฅ˜ ์ฒด๊ณ„๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ํ•™์Šต ์•„ํ‚คํ…์ฒ˜ ๋ฐ bias์— ๋”ฐ๋ฅธ ์ถ”๊ฐ€ ๋ถ„๋ฅ˜๋ฅผ ํ†ตํ•ด PIRL ๊ตฌํ˜„์„ ๋” ์ž˜ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•์Šต๋‹ˆ๋‹ค. ์ตœ์‹  ๋ฌธํ—Œ์„ ๊ฒ€ํ† ํ•˜๊ณ , ๋ฌผ๋ฆฌ ์ •๋ณด๊ฐ€ RL ํŒŒ์ดํ”„๋ผ์ธ์˜ ๋‹ค์–‘ํ•œ ๋‹จ๊ณ„์— ์–ด๋–ป๊ฒŒ ํ†ตํ•ฉ๋˜๋Š”์ง€ ๋ถ„์„ํ•˜๋ฉฐ, ์‚ฌ์šฉ๋œ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์š”์•ฝํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ํ˜„์žฌ PIRL ์—ฐ๊ตฌ์˜ ํ•œ๊ณ„์ ๊ณผ ๋ฏธํ•ด๊ฒฐ ๊ณผ์ œ๋ฅผ ๋…ผ์˜ํ•˜๋ฉฐ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. PIRL์€ ๋ฌผ๋ฆฌ์  ํƒ€๋‹น์„ฑ, ์ •๋ฐ€๋„, ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ, ์‹ค์ œ ํ™˜๊ฒฝ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์„ ๋†’์—ฌ RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ž ์žฌ๋ ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

Copyright 2024, Jung Yeon Lee