Curieux.JY
  • JungYeon Lee
  • Post
  • Lecture
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ๋“ค์–ด๊ฐ€๋ฉฐ: ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์˜ ์•ฝ์†๊ณผ ๋ฐฐ์‹ 
    • ๋ฌธ์ œ ์„ค์ •: โ€œSim-to-Onlineโ€์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„
      • ์‹คํ—˜ ํ”Œ๋žซํผ ํ•œ๋ˆˆ์—
    • ์ž ๊น ๋ณต์Šต: Off-policy RL์˜ ์ˆ˜ํ•™์  ๊ณจ๊ฒฉ
      • ์•ก์…˜-๊ฐ€์น˜ ํ•จ์ˆ˜์˜ ํ•™์Šต
      • ์ •์ฑ… ๊ฐœ์„ ๊ณผ Kakade-Langford ๋ถ€๋“ฑ์‹
    • ์ง„๋‹จ: ํ•˜ํ–ฅ ๋‚˜์„ (Downward Spiral)์˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜
    • ์ฒ˜๋ฐฉ 1: ๋ฐ์ดํ„ฐ๋ฅผ ํ•จ๋ถ€๋กœ ๋ฒ„๋ฆฌ์ง€ ๋งˆ๋ผ (Data Retention)
      • ์™œ ์ด๊ฒŒ ํ†ตํ•˜๋Š”๊ฐ€
      • ์‹คํ—˜ ๊ฒฐ๊ณผ
    • ์ฒ˜๋ฐฉ 2: ์›œ์Šคํƒ€ํŠธ(Warm Starts) โ€” ์ž„๊ณ„ ๋ฐ์ดํ„ฐ ํ™•๋ณด
      • ์™œ ์ด๊ฒŒ ํ†ตํ•˜๋Š”๊ฐ€
      • ๋ฐ์ดํ„ฐ ๋ณด์กด vs. ์›œ์Šคํƒ€ํŠธ โ€” ๋ฌด์—‡์„ ์–ธ์ œ ์“ธ๊นŒ
    • ์ฒ˜๋ฐฉ 3: ์•กํ„ฐ-ํฌ๋ฆฌํ‹ฑ์˜ ๋ฐ•์ž๋ฅผ ๋‹ค๋ฅด๊ฒŒ (Asymmetric Updates)
      • Update-to-Data Ratio (UTD)์™€ ๊ทธ ํ•จ์ •
      • ์™œ ์ด๊ฒŒ ํ†ตํ•˜๋Š”๊ฐ€ โ€” ๋‘ ์‹œ๊ฐ„ ์ฒ™๋„(Two-Timescale) ์ง๊ด€
      • ์‹คํ—˜ ๊ฒฐ๊ณผ โ€” ๊ฐ€์žฅ ๊ทน์ ์ธ ablation
    • ๋ณด๋„ˆ์Šค: ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„œ SAC ์‚ด๋ฆฌ๊ธฐ
      • โ€œ์™œ SAC๋Š” PPO๋ณด๋‹ค ๋ณ‘๋ ฌ ์‹œ๋ฎฌ์—์„œ ์ž˜ ์•ˆ ๋˜๋Š”๊ฐ€โ€ ๋ฏธ์Šคํ„ฐ๋ฆฌ
      • ํ•ต์‹ฌ ์ง„๋‹จ: N_e๊ฐ€ ์ปค์ง€๋ฉด \eta๋„ ๊ฐ™์ด ํ‚ค์›Œ์•ผ ํ•œ๋‹ค
      • ๋„๋ฉ”์ธ ๋žœ๋คํ™” ํ™˜๊ฒฝ ์ˆ˜ N_e๋„ ์ค‘์š”ํ•˜๋‹ค
    • ์‹คํ—˜ ์ข…ํ•ฉ: ์„ธ ๋กœ๋ด‡์ด ๋“ค๋ ค์ฃผ๋Š” ์ด์•ผ๊ธฐ
      • Franka Emika Panda (Manipulation, Vision-based)
      • Unitree Go1 (Locomotion)
      • Race Car (Navigation, Fast Dynamics)
      • ์ข…ํ•ฉ ๊ทธ๋ž˜ํ”„ โ€” Zero-shot vs After Finetuning
    • ๋น„ํŒ์  ๊ณ ์ฐฐ
      • ๊ฐ•์ 
      • ์•ฝ์ ๊ณผ ํ•œ๊ณ„
    • ๊ด€๋ จ ์—ฐ๊ตฌ ์ง€ํ˜•๋„
      • ํ•œ ๋ฐœ์ง ๋”
    • ์ฐธ๊ณ 

๐Ÿ“ƒSimulation to Online RL

sim2real
online-rl
simulation
What Matters for Simulation to Online Reinforcement Learning on Real Robots
Published

May 6, 2026

  • Paper Link
  • Code
  1. ๐Ÿค– ๋ณธ ๋…ผ๋ฌธ์€ ์„ธ ๊ฐ€์ง€ ์‹ค์ œ ๋กœ๋ด‡ ํ”Œ๋žซํผ์—์„œ โ€œsim-to-onlineโ€ ๊ฐ•ํ™” ํ•™์Šต(RL) ์„ค์ •์„ ๊ฒฝํ—˜์ ์œผ๋กœ ์—ฐ๊ตฌํ•˜์—ฌ, ์‹œ๋ฎฌ๋ ˆ์ด์…˜-ํ˜„์‹ค ๊ฐ„์˜ ๋ถˆ์ผ์น˜์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์•ˆ์ •์ ์ด๊ณ  ํšจ์œจ์ ์ธ ์ •์ฑ… ๋ฏธ์„ธ ์กฐ์ •์„ ์œ„ํ•œ ํ•ต์‹ฌ ์„ค๊ณ„ ์„ ํƒ ์‚ฌํ•ญ๋“ค์„ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค.
  2. ๐Ÿ’ก ์ €์ž๋“ค์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋˜๋Š” ์ด์ „ ์‹œํ–‰ ๋ฐ์ดํ„ฐ ์œ ์ง€, ์›œ ์Šคํƒ€ํŠธ(warm start) ์‚ฌ์šฉ, ๊ทธ๋ฆฌ๊ณ  ๋น„๋Œ€์นญ์ ์ธ ์•กํ„ฐ-ํฌ๋ฆฌํ‹ฑ(actor-critic) ์—…๋ฐ์ดํŠธ(์˜ˆ: ์•กํ„ฐ ์—…๋ฐ์ดํŠธ ์ง€์—ฐ)๊ฐ€ ์ •์ฑ… ๋ถˆ์•ˆ์ •์„ฑ์„ ์™„ํ™”ํ•˜๊ณ  ์–ธ๋Ÿฌ๋‹(unlearning)์„ ๋ฐฉ์ง€ํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•จ์„ ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค.
  3. ๐Ÿ› ๏ธ 100ํšŒ ์ด์ƒ์˜ ์‹ค์ œ ๋กœ๋ด‡ ํ›ˆ๋ จ ์‹คํ–‰์„ ํ†ตํ•ด ๊ฒ€์ฆ๋œ ์ด๋Ÿฌํ•œ ๋ฐœ๊ฒฌ๋“ค์€ ์‹ค์ œ ๋กœ๋ด‡์— ์˜จ๋ผ์ธ RL์„ ์ ์šฉํ•˜๋ ค๋Š” ์—ฐ๊ตฌ์ž ๋ฐ ์‹ค๋ฌด์ž์—๊ฒŒ ์œ ์šฉํ•œ ์ง€์นจ์„ ์ œ๊ณตํ•˜์—ฌ, ์—”์ง€๋‹ˆ์–ด๋ง ๋ถ€๋‹ด์„ ์ค„์ž…๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

์ด ๋…ผ๋ฌธ์€ ์‹ค์ œ ๋กœ๋ด‡์—์„œ์˜ ์„ฑ๊ณต์ ์ธ ์˜จ๋ผ์ธ(online) ๊ฐ•ํ™” ํ•™์Šต(Reinforcement Learning, RL)์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ํŠน์ • ์„ค๊ณ„ ์„ ํƒ์— ๋Œ€ํ•ด ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ์„ธ ๊ฐ€์ง€ ๋กœ๋ด‡ ํ”Œ๋žซํผ์—์„œ 100ํšŒ ์ด์ƒ์˜ ์‹ค์ œ ํ›ˆ๋ จ์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ, ๊ธฐ์กด ์—ฐ๊ตฌ์—์„œ ์•”๋ฌต์ ์œผ๋กœ ๋‹ค๋ฃจ์–ด์กŒ๋˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜, ์‹œ์Šคํ…œ ๋ฐ ์‹คํ—˜์  ๊ฒฐ์ •๋“ค์„ ์ฒด๊ณ„์ ์œผ๋กœ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค. ์ด ์—ฐ๊ตฌ๋Š” ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ์ผ๋ถ€ ๊ธฐ๋ณธ ์„ค์ •๋“ค์ด ํ•ด๋กœ์šธ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ํ‘œ์ค€ RL ๊ด€ํ–‰ ๋‚ด์˜ ๊ฒฌ๊ณ ํ•˜๊ณ  ์‰ฝ๊ฒŒ ์ ์šฉ ๊ฐ€๋Šฅํ•œ ์„ค๊ณ„ ์„ ํƒ๋“ค์ด ์ž‘์—…๊ณผ ํ•˜๋“œ์›จ์–ด ์ „๋ฐ˜์— ๊ฑธ์ณ ์•ˆ์ •์ ์ธ ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๊ทธ๋Ÿฌํ•œ ์„ค๊ณ„ ์„ ํƒ์— ๋Œ€ํ•œ ์ตœ์ดˆ์˜ ๋Œ€๊ทœ๋ชจ ํ‘œ๋ณธ ์‹ค์ฆ ์—ฐ๊ตฌ์ด๋ฉฐ, ์—”์ง€๋‹ˆ์–ด๋ง ๋…ธ๋ ฅ์„ ์ค„์—ฌ ์˜จ๋ผ์ธ RL์„ ๋ฐฐํฌํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•์Šต๋‹ˆ๋‹ค.

1. ์„œ๋ก  (Introduction)

๋กœ๋ด‡ ๊ณตํ•™ ๋ถ„์•ผ์—์„œ RL์˜ ์„ฑ๊ณต์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ๋Œ€๋ถ€๋ถ„์˜ ๊ธฐ์กด ์‹œ์Šคํ…œ์—์„œ ํ•™์Šต์€ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋‚˜ ๊ณ ์ •๋œ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ ์˜คํ”„๋ผ์ธ(offline)์œผ๋กœ ์ด๋ฃจ์–ด์ง€๋ฉฐ, ์˜จ๋ผ์ธ ํ•™์Šต์€ ํ‘œ์ค€ ๊ด€ํ–‰๊ณผ๋Š” ๊ฑฐ๋ฆฌ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋Š” ํ•„์—ฐ์ ์œผ๋กœ ๋ถˆ์™„์ „ํ•˜๋ฉฐ, ๋กœ๋ด‡ ๊ณตํ•™์„ ์œ„ํ•œ ๊ณ ํ’ˆ์งˆ์˜ ์‚ฌ์ „ ํ›ˆ๋ จ(pre-training) ์‹ค์ œ ๋ฐ์ดํ„ฐ ํš๋“ ๋น„์šฉ์€ ๋‹ค๋ฅธ ๋„๋ฉ”์ธ์— ๋น„ํ•ด ํ›จ์”ฌ ๋†’์Šต๋‹ˆ๋‹ค. ์ด ์—ฐ๊ตฌ๋Š” ์ž‘์—…์ด ๋ณต์žกํ•ด์ง์— ๋”ฐ๋ผ ๋ฏธ๋ž˜์˜ ์ž์œจ ๋กœ๋ด‡ ์‹œ์Šคํ…œ์ด ๊ตฌํ˜„๋œ ์ƒํ˜ธ์ž‘์šฉ์„ ํ†ตํ•ด ์˜จ๋ผ์ธ์œผ๋กœ ํ•™์Šตํ•˜๊ณ , ๋ณ€ํ™”ํ•˜๋Š” ํ™˜๊ฒฝ์— ์ง€์†์ ์œผ๋กœ ์ ์‘ํ•˜๋ฉฐ ์—ญ๋Ÿ‰์„ ํ–ฅ์ƒ์‹œ์ผœ์•ผ ํ•œ๋‹ค๋Š” ์ธ์‹์—์„œ ์‹œ์ž‘๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ ํŠน์ • ์•„์ด๋””์–ด๋ฅผ ์ข์€ ์‹ค์ œ ์‹คํ—˜ ํ™˜๊ฒฝ์—์„œ ์‹œ์—ฐํ•˜๋Š” ๋ฐ ์ดˆ์ ์„ ๋งž์ถ”๊ฑฐ๋‚˜, โ€™์Šคํฌ๋ž˜์น˜๋ถ€ํ„ฐ ํ•™์Šตโ€™๊ณผ ๊ฐ™์ด ๋œ ํ˜„์‹ค์ ์ธ ์„ค์ •์„ ๋‹ค๋ฃจ์–ด ์•ˆ์ „ํ•˜์ง€ ์•Š๊ณ  ๋น„ํšจ์œจ์ ์ธ ํƒ์ƒ‰์œผ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ์ •์ฑ…์„ ์‹ค์ œ ์‹œ์Šคํ…œ์—์„œ ๋ฏธ์„ธ ์กฐ์ •(finetuning)ํ•˜๋Š” โ€œsim-to-onlineโ€ ์„ค์ •์€ ๋ถˆ์•ˆ์ •์„ฑ์„ ์•ผ๊ธฐํ•˜๊ณ  ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ํ•™์Šต๋œ ์ •์ฑ…์ด โ€™ํ•™์Šต ๋ง๊ฐ(unlearning)โ€™์œผ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Œ์„ ์ด ์—ฐ๊ตฌ๋Š” ๊ฒฝํ—˜์ ์œผ๋กœ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ฃผ์š” ๊ธฐ์—ฌ (Contributions):

  1. ์˜คํ”ˆ ์†Œ์Šค ํ›ˆ๋ จ ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ๋ฐœ: MuJoCo Playground [9]์—์„œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์œผ๋กœ ์‚ฌ์ „ ํ›ˆ๋ จํ•˜๊ณ  ์‹ค์ œ ๋กœ๋ด‡์—์„œ ์›ํ™œํ•˜๊ฒŒ ์˜จ๋ผ์ธ ํ›ˆ๋ จ์„ ๊ณ„์†ํ•  ์ˆ˜ ์žˆ๋Š” ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ฐœ๋ฐœ ๋ฐ ์˜คํ”ˆ ์†Œ์Šคํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” Franka Emika Panda (์กฐ์ž‘), Unitree Go1 (์ด๋™), Race Car (๋‚ด๋น„๊ฒŒ์ด์…˜) ์„ธ ๊ฐ€์ง€ ๋กœ๋ด‡ ํ”Œ๋žซํผ์—์„œ ์œ ์—ฐ์„ฑ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.
  2. Franka Emika Panda ๋กœ๋ด‡ ์Šคํƒ ๊ณต๊ฐœ: ํŠนํžˆ Franka Emika Panda์˜ ๊ฒฝ์šฐ, ํ•˜๋“œ์›จ์–ด ์ธํ„ฐํŽ˜์ด์Šค๋ถ€ํ„ฐ ๋น„์ „ ๊ธฐ๋ฐ˜ ์ •์ฑ…์˜ ์‹ค์ œ ํ›ˆ๋ จ๊นŒ์ง€ ์ „์ฒด ๋กœ๋ด‡ ์Šคํƒ์„ ์˜คํ”ˆ ์†Œ์Šคํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” โ€˜์ƒ์šฉ(off-the-shelf)โ€™ ํ•˜๋“œ์›จ์–ด์— ์˜์กดํ•˜์—ฌ ์žฌํ˜„์„ฑ์„ ๋†’์ด๊ณ  ์‹ค์ œ RL ์—ฐ๊ตฌ์˜ ์ง„์ž… ์žฅ๋ฒฝ์„ ๋‚ฎ์ถฅ๋‹ˆ๋‹ค.
  3. ์•ˆ์ •์„ฑ ๋ฌธ์ œ ์—ฐ๊ตฌ ๋ฐ ์™„ํ™” ๊ธฐ๋ฒ• ์ œ์‹œ: ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ›ˆ๋ จ ์ •์ฑ…์„ ์‹ค์ œ ๋กœ๋ด‡์œผ๋กœ ์ „์ดํ•  ๋•Œ ๋ฐœ์ƒํ•˜๋Š” ์•ˆ์ •์„ฑ ๋ฌธ์ œ๋ฅผ ๊ด‘๋ฒ”์œ„ํ•œ ์‹ค์ œ ์‹คํ—˜์„ ํ†ตํ•ด ์—ฐ๊ตฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ ์‹คํ—˜ ๋ฐ์ดํ„ฐ์™€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ์–ป์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด์กดํ•˜๋Š” ๊ฒƒ์ด ๋ถ„ํฌ ๋ณ€ํ™”(distribution shifts) ํ•˜์—์„œ ๊ฒฌ๊ณ ์„ฑ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ๋น„ํ‰๊ฐ€(critic) ์—…๋ฐ์ดํŠธ๋ฅผ ์ง€์—ฐ์‹œํ‚ค๋Š” ๊ฒƒ(Fujimoto et al. [10])์ด ์•ˆ์ •์„ฑ์„ ๋”์šฑ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
  4. ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„œ์˜ ํšจ์œจ์ ์ธ ์‚ฌ์ „ ํ›ˆ๋ จ: ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„œ ์˜คํ”„-์ •์ฑ…(off-policy) RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํšจ๊ณผ์ ์ธ ์‚ฌ์ „ ํ›ˆ๋ จ ๊ธฐ๋ฒ•์„ ๊ฒฝํ—˜์ ์œผ๋กœ ์—ฐ๊ตฌํ•˜๊ณ  ์‹œ์—ฐํ–ˆ์Šต๋‹ˆ๋‹ค.

2. ๊ด€๋ จ ์—ฐ๊ตฌ (Related Work)

์ด์ „ RL ์—ฐ๊ตฌ๋“ค์€ ์ข…์ข… ๋งž์ถคํ˜• ํ•˜๋“œ์›จ์–ด ๋˜๋Š” ๋…์  ์†Œํ”„ํŠธ์›จ์–ด์— ์˜์กดํ•˜์—ฌ ์žฌํ˜„ํ•˜๊ธฐ ์–ด๋ ต๊ณ , ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ˜์‹ ์— ์ค‘์ ์„ ๋‘์–ด ์‹ค์ œ ๋กœ๋ด‡ ์‹œ์Šคํ…œ์— RL์„ ๋ฐฐํฌํ•˜๋Š” ์‹ค์งˆ์ ์ธ ๋ฌธ์ œ๋“ค์„ ์ฒด๊ณ„์ ์œผ๋กœ ๊ฒ€ํ† ํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. Ibarz et al. [19]๋Š” ์žฌํ˜„์„ฑ ๋ฌธ์ œ๋ฅผ ํฌ๊ด„์ ์œผ๋กœ ๊ฒ€ํ† ํ•˜๋ฉฐ ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ์˜ ์ค‘์š”์„ฑ์„ ์ง€์ ํ–ˆ์ง€๋งŒ, ๊ฒฝํ—˜์  ์ฆ๊ฑฐ๋Š” ์ œ์‹œํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. Tirumala et al. [20]์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์—์„œ ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ์˜ ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์œผ๋ฉฐ, ์ด ์—ฐ๊ตฌ๋Š” ์ด๋ฅผ ์‹ค์ œ ๋กœ๋ด‡์œผ๋กœ ํ™•์žฅํ•˜์—ฌ ๋†’์€ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์˜ ์ค‘์š”์„ฑ์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.

3. ๋ฐฐ๊ฒฝ (Background)

3.1. ๋ฌธ์ œ ์„ค์ • (Problem Setting)

์ด ์—ฐ๊ตฌ๋Š” ์—ฐ์†์ ์ธ ์ƒํƒœ ๊ณต๊ฐ„ \mathcal{S} \subset \mathbb{R}^{d_\mathcal{S}}๊ณผ ํ–‰๋™ ๊ณต๊ฐ„ \mathcal{A} \subset \mathbb{R}^{d_\mathcal{A}}์„ ๊ฐ–๋Š” ๋ฌดํ•œ ์‹œ๊ฐ„(infinite-horizon) ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ๊ณผ์ •(Markov Decision Process, MDP)์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค. ๋ชฉํ‘œ๋Š” ์ •์ฑ… \pi^*๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ• ์ธ๋œ ๋ˆ„์  ๋ณด์ƒ(accumulated sum of discounted rewards)์˜ ๊ธฐ๋Œ€๊ฐ’์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค: \pi^* \in \arg \max_{\pi \in \Pi} J(\pi) := \mathbb{E}_{\pi} \sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) ์—ฌ๊ธฐ์„œ \gamma \in [0, 1)๋Š” ํ• ์ธ ๊ณ„์ˆ˜(discounting factor), \rho_0๋Š” ์ดˆ๊ธฐ ์ƒํƒœ ๋ถ„ํฌ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ฐ€์น˜ ํ•จ์ˆ˜(Value Function) V^\pi(s), ํ–‰๋™-๊ฐ€์น˜ ํ•จ์ˆ˜(Action-Value Function) Q^\pi(s, a), ์ด์  ํ•จ์ˆ˜(Advantage Function) A^\pi(s, a)๊ฐ€ ์ •์˜๋ฉ๋‹ˆ๋‹ค.

์—ํ”ผ์†Œ๋“œ์„ฑ ์˜จ๋ผ์ธ ํ•™์Šต (Episodic online learning):

ํ•™์Šต์€ ์œ ํ•œํ•œ ์—ํ”ผ์†Œ๋“œ(episode)๋กœ ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค. ๊ฐ ์—ํ”ผ์†Œ๋“œ n์—์„œ ์—์ด์ „ํŠธ๋Š” T ์‹œ๊ฐ„ ๋‹จ๊ณ„ ๋™์•ˆ ์ •์ฑ… \pi_n์„ ์‹คํ–‰ํ•œ ํ›„, ๋กœ๋ด‡์€ ์ˆ˜๋™์œผ๋กœ ์ดˆ๊ธฐ ์ƒํƒœ s_0 \sim \rho_0(\cdot)๋กœ ๋ฆฌ์…‹๋ฉ๋‹ˆ๋‹ค. ์—ํ”ผ์†Œ๋“œ n์˜ ๋ฐ์ดํ„ฐ \mathcal{D}_n := \{(s_t, a_t, s_{t+1}, r_t)\}_{t=0}^{T-1}๋Š” โ€˜๋ฆฌํ”Œ๋ ˆ์ด ๋ฒ„ํผ(replay buffer)โ€™ \mathcal{D}_{\le n} := \bigcup_{n'=0}^n \mathcal{D}_{n'}์— ํ†ตํ•ฉ๋ฉ๋‹ˆ๋‹ค [26, 27]. ์ด ์„ค์ •์€ ์ˆ˜๋™ ๋ฆฌ์…‹์„ ํ•„์š”๋กœ ํ•˜์ง€๋งŒ, ์™„์ „ ์ž์œจ ํ•™์Šต์€ ๋ฏธ๋ž˜ ์—ฐ๊ตฌ๋กœ ๋‚จ๊ฒจ๋‘ก๋‹ˆ๋‹ค.

์‚ฌ์ „ ์ง€์‹ (Priors):

์ด ์—ฐ๊ตฌ๋Š” ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋‚˜ ๊ณ ์ •๋œ ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ์…‹์˜ ํ˜•ํƒœ๋กœ ์‚ฌ์ „ ์ง€์‹์ด ์ฃผ์–ด์ง€๋Š” ์„ค์ •์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค. โ€˜์˜คํ”„๋ผ์ธ-ํˆฌ-์˜จ๋ผ์ธ(offline-to-online)โ€™ ์„ค์ •์—์„œ๋Š” ๋ฐ์ดํ„ฐ์…‹ \mathcal{D}_0์— ์ ‘๊ทผํ•˜์—ฌ ์‚ฌ์ „ ์ •์ฑ… \pi_0๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ์‚ฌ์ „ ์ง€์‹์œผ๋กœ ๊ฐ„์ฃผํ•  ๋•Œ๋Š” \mathcal{D}_0๊ฐ€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. \mathcal{D}_0์˜ ์ œํ•œ๋œ ๋ฐ์ดํ„ฐ ๋ฒ”์œ„๋‚˜ โ€™sim-to-real gapโ€™์œผ๋กœ ์ธํ•ด \pi_0๋Š” ์‹ค์ œ ์‹œ์Šคํ…œ์—์„œ ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, ์ถ”๊ฐ€์ ์ธ ์‹ค์ œ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

3.2. ์˜จ๋ผ์ธ ์ „์ด (Online Transfer)

์ƒ˜ํ”Œ ํšจ์œจ์„ฑ (Sample efficiency):

๋งŽ์€ ๋กœ๋ด‡ ์ž‘์—…์—์„œ ์„ฑ๊ณต์ ์ธ ํŒŒ์ดํ”„๋ผ์ธ์€ ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ [9, 35]์™€ ๋„๋ฉ”์ธ ๋ฌด์ž‘์œ„ํ™”(domain randomization) [36], ๊ทธ๋ฆฌ๊ณ  PPO [37]์™€ ๊ฐ™์€ ๋ชจ๋ธ-ํ”„๋ฆฌ(model-free) ์˜จ-์ •์ฑ…(on-policy) ๋ฐฉ๋ฒ•์„ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋Š” ์ ‘์ด‰์ด ๋งŽ์€ ์ž‘์—…์ด๋‚˜ ๋ณต์žกํ•œ ์žฅ๋ฉด์˜ ๋น„์ „ ๊ธฐ๋ฐ˜ ์ž‘์—…์„ ์ •ํ™•ํ•˜๊ฒŒ ๋ชจ๋ธ๋งํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์„ ๊ฒช์œผ๋ฏ€๋กœ, ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ์˜ ์ ์‘์ด ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค. ์˜จ๋ผ์ธ ํ›ˆ๋ จ์€ ์‹ค์‹œ๊ฐ„ ์‹คํ–‰์— ์ œ์•ฝ์ด ์žˆ์œผ๋ฏ€๋กœ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ์˜จ-์ •์ฑ… ๋ฐฉ๋ฒ•์€ ํ˜„์žฌ ์ •์ฑ…์—์„œ ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ๋งŒ ์‚ฌ์šฉํ•˜๊ณ  ์ด์ „ ๊ฒฝํ—˜์„ ๋ฒ„๋ฆฌ๋ฏ€๋กœ, ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์ด ์ œํ•œ๋˜์–ด ์‹ค์ œ ๋กœ๋ด‡ ํ™˜๊ฒฝ์—์„œ์˜ ์‹ค์šฉ์„ฑ์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค.

์˜คํ”„-์ •์ฑ… ํ•™์Šต (Off-policy learning):

๋Œ€์กฐ์ ์œผ๋กœ, ์˜คํ”„-์ •์ฑ… ์•Œ๊ณ ๋ฆฌ์ฆ˜ [38, 8, 39, 10, 40]์€ ๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด์กดํ•˜๊ณ , ์‹ฌ์ง€์–ด ์ตœ์ ํ™”๋˜์ง€ ์•Š์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ(hyperparameter)๋ฅผ ์‚ฌ์šฉํ•œ ๋‹ค๋ฅธ ์‹คํ—˜์˜ ๋ฐ์ดํ„ฐ๋„ ์žฌ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์–ด ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์—์„œ ํฐ ํ–ฅ์ƒ์„ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ์˜คํ”„-์ •์ฑ… ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ทผ์‚ฌ ์ •์ฑ… ๋ฐ˜๋ณต(approximate policy iteration) ๋ฐฉ์‹์œผ๋กœ ์ž‘๋™ํ•˜๋ฉฐ, ํ–‰๋™-๊ฐ€์น˜ ํ•จ์ˆ˜ Q^\pi_\varphi๋Š” Bellman backup์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค: \ell(\varphi) := \mathbb{E}_{(s_t, a_t, s_{t+1}, r_t) \sim \mathcal{D}_{\le n}} \frac{1}{2} \left\| Q^{\pi_n}_\varphi (s_t, a_t) - y \right\|^2 ์—ฌ๊ธฐ์„œ y = r_t + \gamma \bar{V}^{\pi_n}(s_{t+1})์ด๊ณ , \bar{V}^{\pi_n}(s_{t+1}) \approx \bar{Q}^{\pi_n}(s_{t+1}, a_{t+1}), a_{t+1} \sim \pi_n(\cdot|s_{t+1})์ž…๋‹ˆ๋‹ค. \bar{Q}^{\pi_n}๋Š” Polyak averaging [38]์„ ํ†ตํ•ด Q^{\pi_n}_\varphi์˜ ์ด์ „ ๋ณต์‚ฌ๋ณธ์„ ์ถ”์ ํ•˜๋Š” โ€™ํƒ€๊ฒŸ ๋„คํŠธ์›Œํฌ(target network)โ€™์ž…๋‹ˆ๋‹ค: \varphi^{\text{target}}_{k+1} = (1 - \tau) \varphi^{\text{target}}_k + \tau \varphi_k, \quad k = 0, \ldots, K ์ •์ฑ… ๊ฐœ์„  ๋‹จ๊ณ„์—์„œ๋Š” Q^{\pi_n}_\varphi์—์„œ ์ •์ฑ…์ด ์ถ”์ถœ๋ฉ๋‹ˆ๋‹ค. Kakade์™€ Langford [44]๋Š” N๋ฒˆ์˜ ํƒ์š•์ ์ธ(greedy) ์ •์ฑ… ์—…๋ฐ์ดํŠธ ํ›„ ๋ˆ„์  ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ•˜ํ•œ์„ ์„ ๊ฐ€์ง์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค: J(\pi_N) - J(\pi_0) \ge \sum_{n=0}^{N-1} \mathbb{E}_{\pi_{n+1}} \left[ \sum_{t=0}^{\infty} \gamma^t \underbrace{A^{\pi_n}(s_t, a_t)}_{\text{Greedy policy improvement}} - \underbrace{2\gamma^t |\epsilon(s_t, a_t)|}_{\text{Approximation and modeling errors}} \right] ์—ฌ๊ธฐ์„œ \epsilon(s, a)๋Š” ์ถ”์ •, ํ•จ์ˆ˜ ๊ทผ์‚ฌ ๋˜๋Š” ๋ชจ๋ธ ๋ถˆ์ผ์น˜๋กœ ์ธํ•œ Q^{\pi_n}_\varphi์˜ ์˜ค๋ฅ˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

๋ถ„ํฌ ๋ณ€ํ™”์™€ โ€˜ํ•˜ํ–ฅ ๋‚˜์„ (downward spiral)โ€™:

์˜คํ”„๋ผ์ธ ๋˜๋Š” sim-to-online ์„ค์ •์—์„œ๋Š” ๋ถ„ํฌ ๋ณ€ํ™”๊ฐ€ ๋‚ด์žฌ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ดˆ๊ธฐ ์ •์ฑ… \pi_0๊ฐ€ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์˜ ๋™์—ญํ•™ p_0๋ฅผ ์ตœ์ ํ™”ํ•˜๋„๋ก ํ›ˆ๋ จ๋˜์—ˆ์ง€๋งŒ, ์‹ค์ œ ํ™˜๊ฒฝ์— ๋ฐฐํฌ๋˜๋ฉด \pi_0์— ๋”ฐ๋ผ ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ๋Š” Q^{\pi_n}_\varphi์— ํฐ ์˜ค๋ฅ˜ \epsilon(s, a)๋ฅผ ์•ผ๊ธฐํ•  ์ˆ˜ ์žˆ๋Š” (s, a)-์Œ์„ ํฌํ•จํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. โ€™sim-to-real gapโ€™์ด ํด ๊ฒฝ์šฐ, ์ด๋Ÿฌํ•œ ์˜ค๋ฅ˜๋Š” ์—ํ”ผ์†Œ๋“œ์— ๊ฑธ์ณ ๋ˆ„์ ๋˜์–ด ์ •์ฑ… ๊ฐœ์„ ์„ ์••๋„ํ•˜๊ณ  \pi_N์ด \pi_0๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” โ€™ํ•™์Šต ๋ง๊ฐโ€™์„ ์•ผ๊ธฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

4. ๋ฐฐํฌ ๋ณ€ํ™” ํ•˜์—์„œ์˜ ํ•™์Šต ์•ˆ์ •ํ™” (Stabilizing Learning Under Deployment Shifts)

์ด ์—ฐ๊ตฌ๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜-ํˆฌ-์˜จ๋ผ์ธ ๋ฐฐํฌ ๋ณ€ํ™”์— ์ง๋ฉดํ–ˆ์„ ๋•Œ ํ•™์Šต์„ ์•ˆ์ •ํ™”ํ•˜๋Š” ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ๊ธฐ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์˜คํ”„-์ •์ฑ… ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ๋Š” Soft Actor-Critic (SAC) [8]์— ์ดˆ์ ์„ ๋งž์ถฅ๋‹ˆ๋‹ค.

  1. ๋ฐ์ดํ„ฐ ๋ณด์กด (Data retention): Q^{\pi_n}_\varphi ์—…๋ฐ์ดํŠธ ์‹œ ์ƒ˜ํ”Œ์ด ์ถ”์ถœ๋˜๋Š” ๋ถ„ํฌ์˜ ์ค‘์š”์„ฑ์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ \mathcal{D}_{\le n}์ด ํฐ ๊ทผ์‚ฌ ์˜ค๋ฅ˜๋ฅผ ๊ฐ€์ง„ ์ „ํ™˜(transition)์„ ๊ณผ๋„ํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚ด๋ฉด, ์—…๋ฐ์ดํŠธ๋Š” ํŽธํ–ฅ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. \mathcal{D}_0๋Š” ์›๋ž˜ \mathcal{D}_0์—์„œ ํ›ˆ๋ จ๋˜์—ˆ์œผ๋ฏ€๋กœ, ํ•ด๋‹น ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๊ทผ์‚ฌ ์˜ค๋ฅ˜๊ฐ€ ๋” ์ž‘์Šต๋‹ˆ๋‹ค. ์ด๋Š” \mathcal{D}_0๋ฅผ ์•ˆ์ •ํ™”ํ•˜๋Š” ์‚ฌ์ „ ์ง€์‹(prior)์œผ๋กœ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์„ ๋™๊ธฐ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค. Tirumala et al. [20]๊ณผ Ball et al. [45]์€ ๋‘ ๊ฐœ์˜ ๋ฒ„ํผ(\mathcal{D}_0์™€ \mathcal{D}_{\text{online}} := \mathcal{D}_{\le n} \setminus \mathcal{D}_0)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋ฅผ ์ƒ˜ํ”Œ๋งํ•ฉ๋‹ˆ๋‹ค: (s_t, a_t, s_{t+1}, r_t) \sim (1 - \alpha)\text{Unif}(\mathcal{D}_0) + \alpha\text{Unif}(\mathcal{D}_{\text{online}}), \quad \alpha \in [0, 1] ์ด ์—ฐ๊ตฌ๋Š” \alpha \to 1๋กœ ์–ด๋‹๋ง(annealing)ํ•˜๋Š” ๊ฒƒ์„ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.

  2. ์›œ ์Šคํƒ€ํŠธ (Warm starts): ๋งŒ์•ฝ \mathcal{D}_0๋ฅผ ์˜จ๋ผ์ธ ํ•™์Šต ์ค‘์— ๋ณด์กดํ•  ์ˆ˜ ์—†๋‹ค๋ฉด [17], ์ดˆ๊ธฐ ์ •์ฑ… \pi_0๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜์—ฌ ์ด๋ฅผ ๊ทผ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์ด ์›œ ์Šคํƒ€ํŠธ ์ˆ˜์ง‘์€ ์˜คํ”„-์ •์ฑ… RL์—์„œ ์ด๋ฏธ ํ‘œ์ค€์ ์ด๋ฉฐ [8], Zhou et al. [17]์€ ์˜คํ”„๋ผ์ธ-ํˆฌ-์˜จ๋ผ์ธ RL์—์„œ ๋ถˆ์•ˆ์ •์„ฑ์„ ์™„ํ™”ํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•˜๋‹ค๊ณ  ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

  3. ๋น„๋Œ€์นญ ์—…๋ฐ์ดํŠธ (Asymmetric updates): ์˜คํ”„-์ •์ฑ… ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ข…์ข… โ€˜์—…๋ฐ์ดํŠธ-ํˆฌ-๋ฐ์ดํ„ฐ(UTD)โ€™ ๋น„์œจ \eta := K/T๋ฅผ ์ •์˜ํ•˜๋Š”๋ฐ, ์ด๋Š” ์‹ค์ œ ์„ธ๊ณ„ ์ „ํ™˜ ๋‹น ์•กํ„ฐ(actor)์™€ ๋น„ํ‰๊ฐ€(critic)์˜ ๊ทธ๋ž˜๋””์–ธํŠธ ์—…๋ฐ์ดํŠธ ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค [49, 50]. UTD \eta๋ฅผ ๋Š˜๋ฆฌ๋ฉด ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์ด ํ–ฅ์ƒ๋˜์ง€๋งŒ, ๊ทผ์‚ฌ ์˜ค๋ฅ˜์™€ ๊ณผ์ ํ•ฉ(overfitting)์„ ์ฆํญ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค [51]. ์ด๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์•กํ„ฐ์˜ ํ•™์Šต๋ฅ ์„ ์ค„์ด๊ณ  ์—…๋ฐ์ดํŠธ๋ฅผ ๋œ ์ž์ฃผ ์ธํ„ฐ๋ฆฌ๋น™(interleaving)ํ•ฉ๋‹ˆ๋‹ค (k = M, 2M, 3M, \ldots, K ๋‹จ๊ณ„๋งˆ๋‹ค). ์ด ์•„์ด๋””์–ด๋Š” Fujimoto et al. [10]์— ์˜ํ•ด ์†Œ๊ฐœ๋˜์—ˆ์œผ๋ฉฐ, ๋†’์€ UTD ์ฒด์ œ์—์„œ ํ•™์Šต ์•ˆ์ •ํ™”์— ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค. (Figure 4)

5. ์‹คํ—˜ (Experiments)

์ €์ž๋“ค์€ ์„ธ ๊ฐ€์ง€ ์‹ค์ œ ๋กœ๋ด‡์—์„œ ์ด๋Ÿฌํ•œ ์„ค๊ณ„ ์„ ํƒ์˜ ํšจ๊ณผ๋ฅผ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

5.1. ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ์˜ \pi_0 ํ•™์Šต (Learning \pi_0 in Simulation)

Soft Actor-Critic ํ™•์žฅ (Scaling Soft Actor-Critic):

๋Œ€๋ถ€๋ถ„์˜ SAC ๊ตฌํ˜„์€ ๋ณ‘๋ ฌ ํ™˜๊ฒฝ ๋‹จ๊ณ„ ๋‹น ๋‹จ์ผ ์•กํ„ฐ-๋น„ํ‰๊ฐ€ ์—…๋ฐ์ดํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ N_e๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ UTD ๋น„์œจ \eta๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๊ฐ์†Œ์‹œํ‚ต๋‹ˆ๋‹ค. ์ด ์—ฐ๊ตฌ๋Š” ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ SAC๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ™•์žฅํ•˜๋Š” ํ•ต์‹ฌ์€ N_e์— ๋น„๋ก€ํ•˜์—ฌ \eta๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์ž„์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. N_e \sim 1000์€ ๊ฒฌ๊ณ ํ•œ ์ „์ด์— ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. (Section A)

Sim-to-real gap:

Franka Emika Panda ๋ฐ Unitree Go1 ๋กœ๋ด‡์˜ ์‚ฌ์ „ ์ •์ฑ… \pi_0๋Š” MuJoCo Playground [9]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ›ˆ๋ จ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. Race Car์˜ ๋™์—ญํ•™์€ Kabzan et al. [55]์˜ ๋ชจ๋ธ์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค. (i) Franka Emika Panda ์„ค์ •์—์„œ๋Š” ์นด๋ฉ”๋ผ ์‹œ์ , ์กฐ๋ช… ๋ฐ ์‹œ์•ผ๋ฅผ ๋ฌด์ž‘์œ„ํ™”ํ•˜์—ฌ ์‹œ๊ฐ์  ๋ณ€ํ™”์— ๋Œ€ํ•œ ๊ฒฌ๊ณ ์„ฑ์„ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค (Section C). ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ๋Š” ํ๋ธŒ๋ฅผ ์„ฑ๊ณต์ ์œผ๋กœ ๊ฐ์ง€ํ•˜๊ณ  ์ ‘๊ทผํ•˜์ง€๋งŒ, ์‹ค์ œ ๋กœ๋ด‡์—์„œ๋Š” ์ข…์ข… ์žก๊ฑฐ๋‚˜ ๋“ค์–ด ์˜ฌ๋ฆฌ๋Š” ๋ฐ ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ฃผ๋กœ ๊ทธ๋ฆฌํผ์™€ ํ๋ธŒ ์‚ฌ์ด์˜ ๋ชจ๋ธ๋ง๋˜์ง€ ์•Š์€ ์ ‘์ด‰ ๋™์—ญํ•™๊ณผ ๋ Œ๋”๋ง๋œ ์‹œ๊ฐ์  ๊ด€์ฐฐ๊ณผ ์‹ค์ œ ๊ด€์ฐฐ ์‚ฌ์ด์˜ ๋ถˆ์ผ์น˜ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. (ii) ์‚ฌ์กฑ๋ณดํ–‰ ๋กœ๋ด‡์˜ ๊ฒฝ์šฐ, ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์ค‘ ๋ช…๋ น๋œ ์„ ํ˜• ๋ฐ ๊ฐ์†๋„ ๋ฒ”์œ„๋ฅผ ์ œํ•œํ•˜์—ฌ ์ œ์•ฝ๋œ ์‚ฌ์ „ ์ •์ฑ…์„ ํ›ˆ๋ จํ–ˆ์Šต๋‹ˆ๋‹ค. (iii) Race Car ํ™˜๊ฒฝ์—์„œ๋Š” ๋ชจํ„ฐ ํŒŒ๋ผ๋ฏธํ„ฐ, ํƒ€์ด์–ด ๋งˆ์ฐฐ ๋ฐ ์ž๋™์ฐจ ์งˆ๋Ÿ‰์„ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ sim-to-real ์ „์ด๋ฅผ ๊ฐœ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค.

5.2. ์‹ค์ œ ๊ฒฐ๊ณผ (Real-World Results)

๋ฐ์ดํ„ฐ ์žฌํ™œ์šฉ์ด ํ•™์Šต์„ ๊ฐ€์†ํ™” (Recycling data accelerates learning):

์ด์ „ ์‹คํ—˜์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด์กดํ•˜๋Š” ๊ฒƒ์ด ํ•™์Šต ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ์—ฐ๊ตฌํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ์‹คํ—˜์€ ๋™์ผํ•œ ๋ฌด์ž‘์œ„ ์‹œ๋“œ(random seed)๋ฅผ ๊ณต์œ ํ•˜๋Š” ๋„ค ๋ฒˆ์˜ ์‹œํ–‰์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ๊ฐ ์‹œํ–‰์—์„œ, ํ›ˆ๋ จ์€ \mathcal{D}_{\text{online}}์—์„œ ์ˆ˜์ง‘๋œ ์˜จ๋ผ์ธ ๋ฐ์ดํ„ฐ๋กœ๋งŒ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. ํ›„์† ์‹œํ–‰์—์„œ๋Š” ์ด์ „ ์‹œํ–‰์˜ ์˜จ๋ผ์ธ ๋ฆฌํ”Œ๋ ˆ์ด ๋ฒ„ํผ๋ฅผ \mathcal{D}_0์— ๋กœ๋“œํ•˜๊ณ  ์ƒˆ๋กœ์šด ๋ฆฌํ”Œ๋ ˆ์ด ๋ฒ„ํผ \mathcal{D}_{\text{online}}์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. Figure 8์€ ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ณด์กด๋จ์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ๋ณด์กด์˜ ๋Œ€๋ฆฌ์ธ์œผ๋กœ์„œ ์›œ ์Šคํƒ€ํŠธ (Warm starts as a proxy for data retention):

\mathcal{D}_0๋ฅผ ๋กœ๋“œํ•˜์ง€ ์•Š๊ณ  \pi_0์˜ ๊ณ ์ •๋œ ๋ณต์‚ฌ๋ณธ์„ ์‚ฌ์šฉํ•˜์—ฌ \mathcal{D}_{\text{online}}์„ N^* ๋ฐ˜๋ณต ๋™์•ˆ ๋ฏธ๋ฆฌ ์ฑ„์›๋‹ˆ๋‹ค. Franka Emika Panda ๋ฐ Unitree Go1์˜ ๊ฒฝ์šฐ 5000๋ฒˆ์˜ ์ „ํ™˜์„ ์ˆ˜์ง‘ํ–ˆ์œผ๋ฉฐ, ์ด๋Š” ๊ฐ๊ฐ N^* = 20 ๋ฐ N^* = 5์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. Race Car์˜ ๊ฒฝ์šฐ 1250๋ฒˆ์˜ ์ „ํ™˜(N^* = 5 ์—ํ”ผ์†Œ๋“œ)์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. Figure 9์—์„œ Franka Emika Panda๋Š” ์›œ ์Šคํƒ€ํŠธ ์—†์ด๋„ ํ•™์Šต์— ์„ฑ๊ณตํ•˜์ง€๋งŒ, Unitree Go1๊ณผ Race Car ๋กœ๋ด‡์˜ ๊ฒฝ์šฐ ์›œ ์Šคํƒ€ํŠธ ์—†์ด๋Š” ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ์ €ํ•˜๋จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋น„๋Œ€์นญ ์—…๋ฐ์ดํŠธ๊ฐ€ ์•ˆ์ •์„ฑ์— ์ค‘์š” (Asymmetric updates are critical for stability):

์•กํ„ฐ์— ๋Œ€ํ•œ ๋ณด๋‹ค ๋ณด์ˆ˜์ ์ธ ์—…๋ฐ์ดํŠธ๋ฅผ ์ฑ„์šฉํ•˜๊ณ  ๋น„ํ‰๊ฐ€ ์—…๋ฐ์ดํŠธ๋ฅผ ๋” ์ž์ฃผ ์ธํ„ฐ๋ฆฌ๋น™ํ•˜๋Š” ๊ฒƒ์˜ ์ค‘์š”์„ฑ์„ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, ์•กํ„ฐ๋ฅผ 20๋ฒˆ์˜ ๋น„ํ‰๊ฐ€ ์—…๋ฐ์ดํŠธ๋งˆ๋‹ค ํ•œ ๋ฒˆ ์—…๋ฐ์ดํŠธํ•˜๊ณ  ํ•™์Šต๋ฅ ์„ ์ค„์˜€์Šต๋‹ˆ๋‹ค (Section F). ์ด๋ฅผ ์•กํ„ฐ๋ฅผ ๋ชจ๋“  ๋น„ํ‰๊ฐ€ ๋‹จ๊ณ„์—์„œ ์—…๋ฐ์ดํŠธํ•˜๊ณ  ์•กํ„ฐ์™€ ๋น„ํ‰๊ฐ€๋ฅผ ์œ„ํ•œ ๊ณต์œ  ํ•™์Šต๋ฅ ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ธฐ์ค€์„ ๊ณผ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค. Figure 10์€ ๋ชจ๋“  ๋กœ๋ด‡์—์„œ ๊ธฐ์ค€์„ ์ด ํ›ˆ๋ จ ๋ถˆ์•ˆ์ •์„ฑ์œผ๋กœ ์ธํ•ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ์‹คํŒจํ•˜๋Š” ๋ฐ˜๋ฉด, ๋น„๋Œ€์นญ ์—…๋ฐ์ดํŠธ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ํšจ์œจ์ ์ธ ์ „์ด๊ฐ€ ๊ฐ€๋Šฅํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

6. ๊ฒฐ๋ก  (Conclusion)

์ด ์—ฐ๊ตฌ๋Š” ์„ธ ๊ฐ€์ง€ ๋กœ๋ด‡ ํ”Œ๋žซํผ์—์„œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ›ˆ๋ จ๋œ RL ์‚ฌ์ „ ์ง€์‹์„ ํ•˜๋“œ์›จ์–ด์—์„œ ์ง์ ‘ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๋Š” ๋Œ€๊ทœ๋ชจ ๊ฒฝํ—˜์  ์—ฐ๊ตฌ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ, ์˜จ๋ผ์ธ RL์„ RL ์—ฐ๊ตฌ์ž์™€ ์‹ค๋ฌด์ž์—๊ฒŒ ๋” ์‰ฝ๊ฒŒ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋„๋ก ์•ˆ๋‚ดํ•ฉ๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ๋ถ„ํฌ ๋ณ€ํ™”๋กœ ์ธํ•œ ํ›ˆ๋ จ ๋ถˆ์•ˆ์ •์„ฑ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ํ‘œ์ค€ ์˜คํ”„-์ •์ฑ… ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํฐ ์ˆ˜์ • ์—†์ด๋„ ํšจ๊ณผ์ ์œผ๋กœ ์ •์ฑ…์„ ๋ฏธ์„ธ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ํฌ์†Œ ๋ณด์ƒ(sparse rewards)์„ ๊ฐ€์ง„ ๋น„์ „ ๊ธฐ๋ฐ˜ ์ž‘์—…์—์„œ๋„ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ๋ฐ์ดํ„ฐ๋ฅผ ์žฌ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋” ๋ณต์žกํ•œ ์ž‘์—…์œผ๋กœ ํšจ์œจ์ ์œผ๋กœ ํ™•์žฅํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐํšŒ๋ฅผ ์ œ๊ณตํ•จ์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ฐœ๊ฒฌ์€ ์˜จ๋ผ์ธ RL์„ ๋” ์‹ค์šฉ์ ์œผ๋กœ ๋งŒ๋“œ๋Š” ๋ฐ ๊ธฐ์—ฌํ•˜์ง€๋งŒ, ๋™์‹œ์— ๋ช‡ ๊ฐ€์ง€ ์ค‘์š”ํ•œ ์—ฐ๊ตฌ ์งˆ๋ฌธ์„ ์ œ๊ธฐํ•ฉ๋‹ˆ๋‹ค: ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ \mathcal{D}_0์—์„œ ์ƒ˜ํ”Œ์„ ์ตœ์ ์œผ๋กœ ์„ ํƒํ•˜์—ฌ ์˜จ๋ผ์ธ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์€ ๋ฌด์—‡์ธ๊ฐ€? ๋ฐ์ดํ„ฐ๊ฐ€ ๋‹ค๋ฅธ ์ž‘์—…์— ๊ฑธ์ณ ํšจ๊ณผ์ ์œผ๋กœ ์žฌ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์€ ๋ฌด์—‡์ธ๊ฐ€? ๋” ๋น ๋ฅธ ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๋” ๋‚˜์€ ์ •๊ทœํ™” ์ „๋žต์€ ์—†๋Š”๊ฐ€? ๋งˆ์ง€๋ง‰์œผ๋กœ, ์ด ์—ฐ๊ตฌ๋Š” ์ˆ˜๋™ ๋ฆฌ์…‹ ๋ฐ ์•ˆ์ „์„ ์œ„ํ•ด ์ธ๊ฐ„์˜ ๊ฐœ์ž…์ด ์—ฌ์ „ํžˆ ํ•„์š”ํ•œ ๋ฐ˜์ž๋™ ์—ํ”ผ์†Œ๋“œ ์„ค์ •์— ์ค‘์ ์„ ๋‘ก๋‹ˆ๋‹ค. ์™„์ „ ์ž์œจ ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ์‹ค์šฉ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์†”๋ฃจ์…˜์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ์€ ๋ฏธ๋ž˜ ์—ฐ๊ตฌ๋ฅผ ์œ„ํ•œ ์œ ๋งํ•œ ๋ฐฉํ–ฅ์ž…๋‹ˆ๋‹ค.

๋ถ€๋ก (Appendix)

A. ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„œ์˜ ์˜คํ”„-์ •์ฑ… ํ›ˆ๋ จ (Off-Policy Training in Massively-Parallel Simulators)

๋Œ€๋ถ€๋ถ„์˜ ์˜คํ”„-์ •์ฑ… ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹จ์ผ ํ™˜๊ฒฝ์—์„œ ์ˆœ์ฐจ์ ์œผ๋กœ ๊ถค์ (trajectory)์„ ์ˆ˜์ง‘ํ•˜๋Š” ์„ค์ •์— ๋งž๊ฒŒ ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ RL์˜ ์ฃผ์š” ๋ฐœ์ „์€ ์ˆ˜์ฒœ ๊ฐœ์˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜๋œ ๊ถค์ ์„ ๋ณ‘๋ ฌ๋กœ ๋กค์•„์›ƒ(rollout)ํ•˜์—ฌ ํ›ˆ๋ จ์„ ๊ฐ€์†ํ™”ํ•˜๋Š” ๋Šฅ๋ ฅ์„ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ์„ฑ๊ณต์€ ์ฃผ๋กœ ์˜จ-์ •์ฑ… ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์˜์กดํ•ด์™”์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด ์˜คํ”„-์ •์ฑ… ๋ฐฉ๋ฒ•์€ ๋” ์ƒ˜ํ”Œ ํšจ์œจ์ ์ด์ง€๋งŒ, ๋ณ‘๋ ฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ํšจ๊ณผ์ ์œผ๋กœ ํ™•์žฅํ•˜๊ธฐ ์œ„ํ•ด ๋ฏธ๋ฌ˜ํ•˜์ง€๋งŒ ์‚ฌ์†Œํ•˜์ง€ ์•Š์€ ์ˆ˜์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค [41]. ์ด ์—ฐ๊ตฌ๋Š” SAC๊ฐ€ ์ตœ์†Œํ•œ์˜ ์ˆ˜์ •์œผ๋กœ๋„ ํšจ๊ณผ์ ์ด๋ฉฐ, ๋Œ€๊ทœ๋ชจ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ์‹ค์ œ ์„ธ๊ณ„ ๋ฏธ์„ธ ์กฐ์ •์œผ๋กœ์˜ ํ†ตํ•ฉ ์ „์ด๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๊ทœ๋ชจ์˜ ์ค‘์š”์„ฑ (Scale matters):

๋„ˆ๋ฌด ์ ์€ ์ˆ˜์˜ ๋„๋ฉ”์ธ ๋ฌด์ž‘์œ„ํ™”๋œ ํ™˜๊ฒฝ (N_e)์„ ์‚ฌ์šฉํ•˜๋ฉด ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ SAC๊ฐ€ ๋ณด๊ธฐ์— ์ข‹์€ ์ •์ฑ…์œผ๋กœ ์ˆ˜๋ ดํ•˜๋”๋ผ๋„ ์‹ค์ œ ๋กœ๋ด‡์œผ๋กœ์˜ ์ „์ด๊ฐ€ ์ข‹์ง€ ์•Š์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. Figure 11์€ N_e=128๋กœ ํ›ˆ๋ จ๋œ ์ •์ฑ…์ด ์‹ค์ œ ํ™˜๊ฒฝ์— ๋ฐฐํฌ๋  ๋•Œ ์•ˆ์ •์„ฑ์ด ๊ฐ์†Œํ•˜๊ณ  ๋ณด์ƒ์ด ํ˜„์ €ํžˆ ๋‚ฎ์•„์ง์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด๋Š” ๊ฒฌ๊ณ ํ•œ sim-to-real ์ „์ด๋ฅผ ์œ„ํ•ด ๋Œ€๊ทœ๋ชจ ๋„๋ฉ”์ธ ๋ฌด์ž‘์œ„ํ™”๋œ ํ™˜๊ฒฝ (N_e \sim 10^3)์ด ํ•„์ˆ˜์ ์ž„์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

B. ์ถ”๊ฐ€ ์‹คํ—˜ (More Experiments)

PPO์™€ ๋น„๊ตํ•œ ์ œ๋กœ-์ƒท ์„ฑ๋Šฅ (Zero-shot performance compared to PPO):

์ด ์—ฐ๊ตฌ๋Š” ์ฃผ๋กœ ์˜จ๋ผ์ธ ํ›ˆ๋ จ ์‹œ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์ด ํ–ฅ์ƒ๋œ ์˜คํ”„-์ •์ฑ… ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์ดˆ์ ์„ ๋งž์ถฅ๋‹ˆ๋‹ค. SAC๋ฅผ ์‚ฌ์šฉํ•œ ์‹ค์ œ ์‹œ์Šคํ…œ์—์„œ์˜ ์ œ๋กœ-์ƒท ๋ฐฐํฌ ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ ํƒ ๋•Œ๋ฌธ์ด ์•„๋‹ˆ๋ผ sim-to-real gap ๋•Œ๋ฌธ์ž„์„ ๊ฒ€์ฆํ•ฉ๋‹ˆ๋‹ค. Figure 12๋Š” PPO์˜ ์ œ๋กœ-์ƒท ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ๋‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ชจ๋‘ sim-to-real gap์œผ๋กœ ์ธํ•ด ์ดˆ๊ธฐ ์„ฑ๋Šฅ์ด ๋‚ฎ๊ณ , ์˜จ๋ผ์ธ ํ•™์Šต์„ ํ†ตํ•ด ํ–ฅ์ƒ๋จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

TD3๋ฅผ ์‚ฌ์šฉํ•œ Sim-to-sim (Sim-to-sim with TD3):

TD3 [10] (์ตœ์‹  ์˜คํ”„-์ •์ฑ… RL ์•Œ๊ณ ๋ฆฌ์ฆ˜)์— ๋Œ€ํ•œ ์ถ”๊ฐ€ ์‹คํ—˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. TD3๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ ์ •์ฑ… ์—…๋ฐ์ดํŠธ๋ฅผ ์ง€์—ฐ์‹œํ‚ค๋ฉฐ (M=2๊ฐ€ ๊ธฐ๋ณธ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ). Figure 13์€ TD3๊ฐ€ SAC์™€ ์œ ์‚ฌํ•œ ์ „์ด ๋™์—ญํ•™์„ ๋ณด์ž„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ดˆ๊ธฐ ํ˜ผํ•ฉ \alpha (Initial mixing \alpha):

ํ•™์Šต ์•ˆ์ •์„ฑ ๋ฐ ์„ฑ๋Šฅ์— ๋Œ€ํ•œ ์ดˆ๊ธฐ \alpha ๊ฐ’์˜ ์˜ํ–ฅ์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. Figure 14๋Š” ํ›ˆ๋ จ ์‹œ์ž‘ ์‹œ ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ๊ฐ€ ์‚ฌ์šฉ๋˜๊ณ , ํ›ˆ๋ จ ํ›„๋ฐ˜์— ์˜จ๋ผ์ธ ๋ฐ์ดํ„ฐ๊ฐ€ ์ง€๋ฐฐ์ ์ด๋ผ๋ฉด, ์ข‹์€ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ์ดํ„ฐ ๋ณด์กด (Retaining simulation data):

์˜จ๋ผ์ธ ํ•™์Šต์—์„œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์ค‘์— ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด์กดํ•˜๋Š” ํšจ๊ณผ๋ฅผ ์กฐ์‚ฌํ•˜๊ณ , Zhou et al. [17]์˜ ์›œ ์Šคํƒ€ํŠธ ์„ค์ •๊ณผ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค. Figure 15๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด์กดํ•˜๋Š” ๊ฒƒ์ด ํ•™์Šต ํšจ์œจ์„ฑ๊ณผ ์•ˆ์ •์„ฑ์„ ๋ชจ๋‘ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

C. Franka Emika Panda

MuJoCo Playground [9]์˜ PandaPickCubeCartesian ์ž‘์—…์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ ์‹ค์ œ ํ™˜๊ฒฝ์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค. ์—์ด์ „ํŠธ๋Š” 64x64 ๊ทธ๋ ˆ์ด์Šค์ผ€์ผ ์ด๋ฏธ์ง€์™€ ์—”๋“œ-์ดํŽ™ํ„ฐ(end-effector) ์œ„์น˜ ๋ฐ ๊ทธ๋ฆฌํผ ๊ฐœ๋ฐฉ๋„๋ฅผ ๊ด€์ฐฐํ•ฉ๋‹ˆ๋‹ค. ์กฐ์ž‘ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ, ์„ฑ๊ณต ๊ธฐ์ค€์€ ํ๋ธŒ๊ฐ€ ๋ชฉํ‘œ ์œ„์น˜๋กœ๋ถ€ํ„ฐ 0.05m ์ด๋‚ด์— ๋“ค์–ด์˜ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. Figure 16์€ ๋„๋ฉ”์ธ ๋ฌด์ž‘์œ„ํ™”๋œ ํ™˜๊ฒฝ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

D. Unitree Go1

MuJoCo Playground์˜ FlatTerrainGo1Joystick ํ™˜๊ฒฝ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค [9]. Zakka et al. [9]์™€ ๋‹ฌ๋ฆฌ, ์ด ์—ฐ๊ตฌ๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ [$$0.5, $$0.8, $$1.2]์˜ ๋” ์ข์€ ๋ฒ”์œ„์˜ ์†๋„ ๋ช…๋ น์„ ์‚ฌ์šฉํ•˜์—ฌ ์ „์ด๊ฐ€ ๋” ๋„์ „์ ์ด๋„๋ก ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค. Figure 17์€ ๊ฐœ์„ ๋œ ์•ˆ์ •์„ฑ์œผ๋กœ ํ›ˆ๋ จ ํ›„์˜ ๊ถค์ ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

E. Race Car

Kabzan et al. [55]์˜ ๋ชจ๋ธ์„ ๋”ฐ๋ผ ์ž๋™์ฐจ ๋™์—ญํ•™์„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•ฉ๋‹ˆ๋‹ค. sim-to-real gap์œผ๋กœ ์ธํ•ด ์ž๋™์ฐจ ๋“œ๋ฆฌํ”„ํŠธ๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์–ด๋ ค์›Œ ์ž๋™์ฐจ๊ฐ€ ๋ชฉํ‘œ ์œ„์น˜๋ฅผ ์ง€๋‚˜์น˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค (Figure 18). ์—์ด์ „ํŠธ๋Š” ์ฐจ๋Ÿ‰ ์ƒํƒœ๋ฅผ ๊ด€์ฐฐํ•˜๊ณ  ์—ฐ์†์ ์ธ 2D ํ–‰๋™(์กฐํ–ฅ, ์Šค๋กœํ‹€)์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค. ๋ณด์ƒ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค: r_t(s_t, a_t) := d_{t-1} - d_t + \mathbf{1}[d_t \le \epsilon] - \lambda_c \|a_t\|^2 - \lambda_l \|a_t - a_{t-1}\|^2_2 ์—ฌ๊ธฐ์„œ d_t = \|\mathbf{x}_t - \mathbf{x}_{\text{goal}}\|_2๋Š” ๋ชฉํ‘œ๊นŒ์ง€์˜ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ๋ฅผ, \mathbf{1}[d_t \le \epsilon]์€ \epsilon=0.3 ๋ฏธํ„ฐ ์ด๋‚ด์ผ ๋•Œ์˜ ๋ณด๋„ˆ์Šค๋ฅผ, \lambda_c๋Š” ์ œ์–ด ๋…ธ๋ ฅ์— ๋Œ€ํ•œ ํŒจ๋„ํ‹ฐ๋ฅผ, \lambda_l์€ ํ–‰๋™ ๋ณ€ํ™”์— ๋Œ€ํ•œ ํŒจ๋„ํ‹ฐ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

F. ๊ตฌํ˜„ ์„ธ๋ถ€ ์‚ฌํ•ญ (Implementation Details)

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ (Hyperparameters):

ํŠน๋ณ„ํžˆ ๋ช…์‹œ๋˜์ง€ ์•Š๋Š” ํ•œ, ์•กํ„ฐ์— ๋Œ€ํ•ด 10^{-5}์˜ ํ•™์Šต๋ฅ ์„ ์‚ฌ์šฉํ•˜๊ณ , 20๋ฒˆ์˜ ๋น„ํ‰๊ฐ€ ์—…๋ฐ์ดํŠธ๋งˆ๋‹ค ์•กํ„ฐ๋ฅผ ํ•œ ๋ฒˆ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋“  ๋กœ๋ด‡์— ๋Œ€ํ•ด ์—ํ”ผ์†Œ๋“œ ๋‹น 1250๋ฒˆ์˜ ์—…๋ฐ์ดํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Franka Emika Panda ๋ฐ Race Car์˜ ๊ฒฝ์šฐ \eta=5, Unitree Go1์˜ ๊ฒฝ์šฐ \eta \approx 1์ด ๋˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

ํ•จ์ • (Pitfalls):

๊ฐœ๋ฐœ ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ ํ•™์Šต ๋™์—ญํ•™ ๋ฐ ์ตœ์ข… ์„ฑ๋Šฅ์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๋ฏธ๋ฌ˜ํ•œ ๋ฌธ์ œ๋“ค์ด ๊ด€์ฐฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค:

  1. ์˜ตํ‹ฐ๋งˆ์ด์ € ์ƒํƒœ๊ฐ€ ๋ณต์›๋˜์ง€ ์•Š์Œ: ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๋งŒ ๋ณต์›ํ•˜๊ณ  ์˜ตํ‹ฐ๋งˆ์ด์ € ์ƒํƒœ(๋ชจ๋ฉ˜ํ…€, 2์ฐจ ๋ชจ๋ฉ˜ํŠธ ์ถ”์ •, ํ•™์Šต๋ฅ  ์Šค์ผ€์ค„๋Ÿฌ ๋“ฑ)๋ฅผ ๋ณต์›ํ•˜์ง€ ์•Š์œผ๋ฉด ์˜ตํ‹ฐ๋งˆ์ด์ € ๋™์—ญํ•™์ด ๋ณ€๊ฒฝ๋˜์–ด ํ•™์Šต์ด ํฌ๊ฒŒ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  2. ๋น„ํ‰๊ฐ€๊ฐ€ ํƒ€๊ฒŸ ๋„คํŠธ์›Œํฌ ์—†์ด ๋ณต์›๋จ: Q^{\pi_n}_\varphi๋งŒ ๋กœ๋“œํ•˜๊ณ  ํƒ€๊ฒŸ ๋„คํŠธ์›Œํฌ \bar{Q}^{\pi_n}๋ฅผ ๋กœ๋“œํ•˜์ง€ ์•Š์œผ๋ฉด ์ผ๊ด€์„ฑ ์—†๋Š” ํƒ€๊ฒŸ์ด ์ƒ์„ฑ๋˜์–ด ๋น„ํ‰๊ฐ€์™€ ์•กํ„ฐ๊ฐ€ ํ•™์Šตํ•œ ๋‚ด์šฉ์„ ์žŠ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  3. SAC ์˜จ๋„ \alpha (๋ฐ ๊ทธ ์˜ตํ‹ฐ๋งˆ์ด์ €)๊ฐ€ ๋ณต์›๋˜์ง€ ์•Š์Œ: ์‚ฌ์ „ ํ›ˆ๋ จ ์ค‘์— \alpha๊ฐ€ ๋ณ€๊ฒฝ๋˜๋ฏ€๋กœ, ๊ทธ ๊ฐ’๊ณผ ์˜ตํ‹ฐ๋งˆ์ด์ € ์ƒํƒœ๋ฅผ ๋ณต์›ํ•˜์ง€ ์•Š์œผ๋ฉด ์•กํ„ฐ ๋ฐ ๋น„ํ‰๊ฐ€ ์—…๋ฐ์ดํŠธ์—์„œ ์—”ํŠธ๋กœํ”ผ ๋ณด๋„ˆ์Šค(entropy bonus)์˜ ์Šค์ผ€์ผ์ด ๋ณ€๊ฒฝ๋˜์–ด ๋ถˆ์•ˆ์ •์„ฑ์„ ์ดˆ๋ž˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋™๊ธฐ์‹ ์—…๋ฐ์ดํŠธ (Synchronous updates):

ํ‘œ์ค€ ์˜คํ”„-์ •์ฑ… ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๋ชจ๋“  ์‹ค์ œ ์„ธ๊ณ„ ์ „ํ™˜ ํ›„์— ์•กํ„ฐ-๋น„ํ‰๊ฐ€ ์—…๋ฐ์ดํŠธ๊ฐ€ ๋ฐœ์ƒํ•˜๋„๋ก ๊ตฌํ˜„๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ทธ๋ž˜๋””์–ธํŠธ ๊ณ„์‚ฐ์€ ์‹ค์‹œ๊ฐ„ ์ œ์–ด ์ฃผ๊ธฐ๋ณด๋‹ค ๋А๋ฆฐ ๊ฒฝํ–ฅ์ด ์žˆ์–ด ๋†’์€ UTD ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ๋Š” ํŠนํžˆ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ์ด ์—ฐ๊ตฌ๋Š” ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘๊ณผ ์ตœ์ ํ™”๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” ๋ฐฐ์น˜(batch) ๋ฐฉ์‹์˜ ๋น„๋™๊ธฐ์ , ์—ํ”ผ์†Œ๋“œ์„ฑ ํ•™์Šต ๋ฐฉ์‹์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

๋“ค์–ด๊ฐ€๋ฉฐ: ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์˜ ์•ฝ์†๊ณผ ๋ฐฐ์‹ 

๋กœ๋ด‡๊ณตํ•™์ž๋ผ๋ฉด ๋ˆ„๊ตฌ๋‚˜ ํ•œ ๋ฒˆ์ฏค ์ด ์žฅ๋ฉด์„ ๊ฒฝํ—˜ํ•ด๋ดค์„ ๊ฒƒ์ด๋‹ค. ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„œ๋Š” ์ •์ฑ…์ด ์™„๋ฒฝํ•˜๊ฒŒ ๋™์ž‘ํ•œ๋‹ค. ๋ชจ๋“  ๋ณด์ƒ ๊ณก์„ ์ด ์šฐ์ƒํ–ฅ์ด๊ณ , ํ‰๊ฐ€ ์˜์ƒ์€ ๊น”๋”ํ•˜๋‹ค. ์ž, ์ด์ œ ์‹ค๋ฌผ ๋กœ๋ด‡์— ์˜ฌ๋ ค๋ณด์ž. ์ฒซ ์—ํ”ผ์†Œ๋“œ๋ถ€ํ„ฐ ๋ง๊ฐ€์ง„๋‹ค. ๋” ์Šฌํ”ˆ ๊ฒƒ์€ ๊ทธ๋‹ค์Œ์ด๋‹ค. โ€œ๊ดœ์ฐฎ์•„, ์˜จ๋ผ์ธ์œผ๋กœ ๋ฏธ์„ธ์กฐ์ •ํ•˜๋ฉด ํšŒ๋ณต๋  ๊ฑฐ์•ผโ€๋ผ๋ฉฐ ํ•™์Šต์„ ์ผœ๋‘” ์ฑ„ ๋‘๋ฉด โ€” ์ •์ฑ… ์„ฑ๋Šฅ์ด ๋ณต๊ตฌ๋˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ ๋” ๋‚˜๋น ์ง„๋‹ค. ์‹œ๋ฎฌ์—์„œ ์ž˜ ๋ฐฐ์› ๋˜ ๊ฒƒ๋“ค๋งˆ์ € ์žŠ์–ด๋ฒ„๋ฆฌ๊ธฐ ์‹œ์ž‘ํ•œ๋‹ค.

ETH Zรผrich์™€ Google DeepMind ํŒ€์ด 2026๋…„ 2์›”์— ๊ณต๊ฐœํ•œ ์ด ๋…ผ๋ฌธ(Yarden As et al.)์€ ๋ฐ”๋กœ ์ด ํ˜„์ƒ์„ ์ •๋ฉด์œผ๋กœ ๋‹ค๋ฃฌ๋‹ค. ์ €์ž๋“ค์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์  ๋ฌ˜๊ธฐ๋ฅผ ๋ถ€๋ฆฌ๊ฑฐ๋‚˜ ์ƒˆ๋กœ์šด ์†์‹คํ•จ์ˆ˜๋ฅผ ์ œ์•ˆํ•˜์ง€ ์•Š๋Š”๋‹ค. ๊ทธ ๋Œ€์‹  ์„ธ ๊ฐ€์ง€ ๋กœ๋ด‡ ํ”Œ๋žซํผ์—์„œ 100ํšŒ ์ด์ƒ์˜ ์‹ค๋ฌผ ํ•™์Šต ์‹คํ—˜์„ ๋Œ๋ ค, ํ‘œ์ค€์ ์ธ off-policy RL ํŒŒ์ดํ”„๋ผ์ธ์—์„œ ์›๋ž˜ ์ž˜ ์•Œ๋ ค์ ธ ์žˆ์–ด์•ผ ํ–ˆ์ง€๋งŒ ์•”๋ฌต์ ์œผ๋กœ๋งŒ ์ „๋‹ฌ๋˜๋˜ ์„ค๊ณ„ ๊ฒฐ์ •๋“ค์„ ์ฒด๊ณ„์ ์œผ๋กœ ablationํ•œ๋‹ค. ๊ฒฐ๋ก ์€ ์šฐ์•„ํ•˜๋‹ค โ€” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ƒˆ๋กœ ๋งŒ๋“ค ํ•„์š” ์—†์ด, ์„ธ ๊ฐ€์ง€ ๋‹จ์ˆœํ•œ ๊ธฐ๋ฒ•(๋ฐ์ดํ„ฐ ๋ณด์กด, ์›œ์Šคํƒ€ํŠธ, ๋น„๋Œ€์นญ ์—…๋ฐ์ดํŠธ)๋งŒ ์ œ๋Œ€๋กœ ์ ์šฉํ•ด๋„ ์‹œ๋ฎฌโ†’์‹ค๋ฌผ finetune์€ ์•ˆ์ •์ ์œผ๋กœ ๋™์ž‘ํ•œ๋‹ค.

์ด ๊ธ€์€ ๊ทธ ์ฒ˜๋ฐฉ์ „์„ ๋กœ๋ด‡๊ณตํ•™์ž์˜ ๋ˆˆ๋†’์ด์—์„œ ๋ถ„ํ•ดํ•ด๋ณธ๋‹ค. โ€œ์™œ ๊ทธ๋ ‡๊ฒŒ ํ•ด์•ผ ํ•˜๋Š”๊ฐ€โ€์˜ ์ง๊ด€, ์ˆ˜์‹์˜ ์˜๋ฏธ, ์‹คํ—˜ ๊ฒฐ๊ณผ์˜ ํ•จ์˜, ๊ทธ๋ฆฌ๊ณ  โ€” ๊ฐ€์žฅ ์ค‘์š”ํ•˜๊ฒŒ โ€” ๋‚ด ๋กœ๋ด‡์— ์˜ฌ๋ฆด ๋•Œ ๋ฌด์—‡์„ ์ฒดํฌํ•ด์•ผ ํ•˜๋Š”๊ฐ€๊นŒ์ง€.

Noteํ•œ ์ค„ ์š”์•ฝ

Sim-to-online RL์˜ ํ•ต์‹ฌ ์ ์€ โ€œํ•˜ํ–ฅ ๋‚˜์„ (downward spiral)โ€์ด๋‹ค. ์ด๋ฅผ ๋ง‰์œผ๋ ค๋ฉด (1) ์‚ฌ์ „ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฒ„๋ฆฌ์ง€ ๋ง๊ณ , (2) ์›Œ๋ฐ์—…์œผ๋กœ ๋ถ„ํฌ ์ถฉ๊ฒฉ์„ ์™„ํ™”ํ•˜๊ณ , (3) ์•กํ„ฐ๋ฅผ ํฌ๋ฆฌํ‹ฑ๋ณด๋‹ค ๋А๋ฆฌ๊ฒŒ ์—…๋ฐ์ดํŠธํ•˜๋ผ. ๊ทธ๊ฒŒ ๊ฑฐ์˜ ์ „๋ถ€๋‹ค.


๋ฌธ์ œ ์„ค์ •: โ€œSim-to-Onlineโ€์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„

์ €์ž๋“ค์ด ๊ฐ€์žฅ ๋จผ์ € ํ•˜๋Š” ์ผ์€ ์šฉ์–ด๋ฅผ ์ •๋ฆฌํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ํ”ํžˆ ์“ฐ๋Š” ํ‘œํ˜„๋“ค์„ ํ•œ๋ฒˆ ์ค„ ์„ธ์›Œ๋ณด์ž.

์šฉ์–ด ์˜๋ฏธ ํ•œ๊ณ„
Sim-to-Real ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„œ ํ•™์Šต โ†’ ์‹ค๋ฌผ์— zero-shot ๋ฐฐํฌ ์‹œ๋ฎฌ-์‹ค๋ฌผ ๊ฐญ์ด ํฌ๋ฉด ์„ฑ๋Šฅ์ด ํ•œ๊ณ„์— ๋ถ€๋”ชํž˜
Offline-to-Online ๊ณ ์ •๋œ ์‹ค๋ฌผ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต โ†’ ์‹ค๋ฌผ์—์„œ ๋ฏธ์„ธ์กฐ์ • ์–‘์งˆ์˜ ์‚ฌ์ „ ๋ฐ์ดํ„ฐ ํ™•๋ณด๊ฐ€ ๋น„์‹ธ๋‹ค
Sim-to-Online ์‹œ๋ฎฌ์—์„œ ์‚ฌ์ „ํ•™์Šต โ†’ ์‹ค๋ฌผ์—์„œ ์˜จ๋ผ์ธ์œผ๋กœ ๊ณ„์† ํ•™์Šต ๋ถ„ํฌ ๋ณ€ํ™”๋กœ ๋ถˆ์•ˆ์ • โ€” ์ด ๋…ผ๋ฌธ์˜ ํ‘œ์ 

์„ธ ๋ฒˆ์งธ๊ฐ€ ์ด ๋…ผ๋ฌธ์ด ์ •์˜ํ•˜๋Š” setting์ด๋‹ค. ์ฒ˜์Œ๋ถ€ํ„ฐ ์‹ค๋ฌผ์—์„œ RL์„ ๋Œ๋ฆฌ๋Š” ๊ฒƒ์€ ์•ˆ์ „ํ•˜์ง€ ์•Š๊ณ  ๋น„์‹ธ๋‹ค. ์‹œ๋ฎฌ์—์„œ๋งŒ ํ•™์Šตํ•˜๊ณ  ๋๋‚ด๋Š” ๊ฒƒ์€ ํ•œ๊ณ„๊ฐ€ ๋ช…ํ™•ํ•˜๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์‹œ๋ฎฌ์—์„œ ์ข‹์€ ์‚ฌ์ „ ์ •์ฑ… \pi_0์„ ๋งŒ๋“  ๋‹ค์Œ, ์‹ค๋ฌผ์—์„œ ๊ทธ๊ฒƒ์„ ์ด์–ด์„œ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์ด ์ž์—ฐ์Šค๋Ÿฌ์šด ์ ˆ์ถฉ์•ˆ์ด๋‹ค. ๋‹จ์ง€, ์šฐ๋ฆฌ๊ฐ€ ๊ณง ๋ณด๊ฒ ์ง€๋งŒ โ€” ์ด โ€œ์ด์–ด์„œโ€๊ฐ€ ์ •๋ง ๊นŒ๋‹ค๋กญ๋‹ค.

์‹คํ—˜ ํ”Œ๋žซํผ ํ•œ๋ˆˆ์—

์ €์ž๋“ค์ด ์‚ฌ์šฉํ•œ ์„ธ ๋กœ๋ด‡์€ ์˜๋„์ ์œผ๋กœ ์„œ๋กœ ๋‹ค๋ฅธ ์ข…๋ฅ˜์˜ ์–ด๋ ค์›€์„ ๋Œ€ํ‘œํ•œ๋‹ค.

+------------------+----------------------+----------------------+----------------------+
|     Platform     |   Franka Panda       |    Unitree Go1       |     Race Car         |
+------------------+----------------------+----------------------+----------------------+
| Task             | Pick & lift cube     | Joystick locomotion  | Park at goal         |
| Observation      | 64x64 grayscale img  | proprioceptive       | 2D pose + velocity   |
|                  | + EE pose + gripper  |                      |                      |
| Action dim       | 4 (dx,dy,dz,grip)    | 12 joint positions   | 2 (steer, throttle)  |
| Control rate     | episodic / step      | high-rate locomotion | 60 Hz                |
| Sim-to-real gap  | contact + visuals    | friction             | tire/drift dynamics  |
| Why hard         | Vision-based RL      | Stable gait transfer | Fast, agile dynamics |
+------------------+----------------------+----------------------+----------------------+

์„ธ ๋กœ๋ด‡ ๋ชจ๋‘์—์„œ ์‹œ๋ฎฌ ์‚ฌ์ „ ์ •์ฑ… \pi_0์€ ์‹œ๋ฎฌ ์•ˆ์—์„œ๋Š” ๊ฑฐ์˜ ๋งŒ์ ์„ ๋ฐ›์ง€๋งŒ, ์‹ค๋ฌผ์—์„œ๋Š” zero-shot์œผ๋กœ ๋–จ์–ด๋œจ๋ ค๋ณด๋ฉด ์„ฑ๋Šฅ์ด ์•ฝ 30โ€“60% ์ˆ˜์ค€์œผ๋กœ ์ถ”๋ฝํ•œ๋‹ค(๋…ผ๋ฌธ Figure 6 ์ฐธ์กฐ). ์ด๊ฒƒ์ด ์šฐ๋ฆฌ๊ฐ€ ๋ฉ”์›Œ์•ผ ํ•  ๊ฐญ์ด๋‹ค.


์ž ๊น ๋ณต์Šต: Off-policy RL์˜ ์ˆ˜ํ•™์  ๊ณจ๊ฒฉ

์ฒ˜๋ฐฉ์„ ์ดํ•ดํ•˜๋ ค๋ฉด ํ™˜์ž์˜ ํ•ด๋ถ€ํ•™์„ ์•Œ์•„์•ผ ํ•œ๋‹ค. ์ด ๋…ผ๋ฌธ์€ SAC(Soft Actor-Critic)๋ฅผ ํ‘œ์ค€ ๋„๊ตฌ๋กœ ์“ฐ์ง€๋งŒ, ํ•ต์‹ฌ ๋…ผ๋ฆฌ๋Š” ๋ชจ๋“  actor-critic off-policy ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๊ทธ๋Œ€๋กœ ์ ์šฉ๋œ๋‹ค.

์•ก์…˜-๊ฐ€์น˜ ํ•จ์ˆ˜์˜ ํ•™์Šต

ํฌ๋ฆฌํ‹ฑ์€ ๋‹ค์Œ ์†์‹ค์„ ์ตœ์†Œํ™”ํ•˜๋ฉฐ ํ•™์Šต๋œ๋‹ค:

\ell(\varphi) = \mathbb{E}_{(s_t, a_t, s_{t+1}, r_t)\sim\mathcal{D}_{\le n}} \left[ \tfrac{1}{2}\Big(Q^{\pi_n}_\varphi(s_t,a_t) - y\Big)^2 \right]

ํƒ€๊นƒ์€ ํ•œ ์Šคํ… Bellman backup์ด๋‹ค: y = r_t + \gamma \bar{V}^{\pi_n}(s_{t+1}). ๊ทธ๋ฆฌ๊ณ  \bar{V}๋Š” ํƒ€๊นƒ ๋„คํŠธ์›Œํฌ์—์„œ ํ‰๊ฐ€๋œ๋‹ค โ€” Polyak averaging์œผ๋กœ ์ฒœ์ฒœํžˆ ๋”ฐ๋ผ์˜ค๋Š” ๊ฑฐ์šธ ๊ฐ™์€ ์กด์žฌ๋‹ค:

\varphi^{\text{target}}_{k+1} = (1-\tau)\varphi^{\text{target}}_k + \tau \varphi_k

์ง๊ด€์ ์œผ๋กœ ๋ณด์ž. ํƒ€๊นƒ ๋„คํŠธ์›Œํฌ๊ฐ€ ์—†์œผ๋ฉด, ํฌ๋ฆฌํ‹ฑ์ด ์ž์‹ ์˜ ๊ทธ๋ฆผ์ž๋ฅผ ์ซ“์•„๊ฐ€๋ฉฐ ํ•™์Šตํ•˜๊ฒŒ ๋œ๋‹ค. ๋งˆ์น˜ ๊ฑฐ์šธ ๋‘ ๊ฐœ๋ฅผ ๋งˆ์ฃผ ๋ณด๊ฒŒ ํ–ˆ์„ ๋•Œ ๋ฌดํ•œ ๋ฐ˜์‚ฌ๊ฐ€ ์ผ์–ด๋‚˜๋“ฏ, ์ถ”์ •๊ฐ’์ด ๋ฐœ์‚ฐํ•  ์œ„ํ—˜์ด ์žˆ๋‹ค. \tau๋ฅผ ์ž‘๊ฒŒ ์žก์•„ ํƒ€๊นƒ์„ ๋Šฆ๊ฒŒ ๋”ฐ๋ผ์˜ค๊ฒŒ ํ•˜๋ฉด, ์ถ”์ • ๋Œ€์ƒ์ด ์ž ๊น ๋™์•ˆ ๊ณ ์ •๋œ ๋“ฏ์ด ๋ณด์—ฌ์„œ ํ•™์Šต์ด ์•ˆ์ •๋œ๋‹ค.

์ •์ฑ… ๊ฐœ์„ ๊ณผ Kakade-Langford ๋ถ€๋“ฑ์‹

ํฌ๋ฆฌํ‹ฑ์ด ์–ด๋А ์ •๋„ ์ž๋ฆฌ ์žก์œผ๋ฉด, ์•กํ„ฐ๋Š” ๊ทธ๊ฒƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก ์›€์ง์ธ๋‹ค. Kakade์™€ Langford(2002)์˜ ๊ณ ์ „์  ๊ฒฐ๊ณผ๋Š” ๊ทธ๋ฆฌ๋”” ์ •์ฑ… ์—…๋ฐ์ดํŠธ์˜ ๋ˆ„์  ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๋‹ค์Œ ํ•˜ํ•œ์„ ์ค€๋‹ค:

J(\pi_N) - J(\pi_0) \;\ge\; \sum_{n=0}^{N-1}\mathbb{E}_{\pi_{n+1}}\!\!\left[\sum_{t=0}^{\infty}\gamma^t\Big(\underbrace{A^{\pi_n}(s_t,a_t)}_{\text{policy improvement}} - \underbrace{2\gamma^t |\epsilon(s_t,a_t)|}_{\text{approximation error}}\Big)\right]

์ด ์‹์ด ์ด ๋…ผ๋ฌธ ์ „์ฒด๋ฅผ ๊ด€ํ†ตํ•˜๋Š” ์ž‘์€ ์šฐํ™”๋‹ค. ๊ฐœ์„ ์˜ ์–‘์€ ์–ด๋“œ๋ฐดํ‹ฐ์ง€์˜ ํ•ฉ์—์„œ ์ถ”์ • ์˜ค์ฐจ์˜ ํ•ฉ์„ ๋บ€ ๋งŒํผ์ด๋ผ๋Š” ๊ฒƒ. ์ด๊ฒŒ ์–‘์ˆ˜๊ฐ€ ๋˜๋ ค๋ฉด, ์šฐ๋ฆฌ๊ฐ€ ๋งŒ๋“  ์–ด๋“œ๋ฐดํ‹ฐ์ง€ ์ถ”์ • ์‹ ํ˜ธ๊ฐ€ ๋…ธ์ด์ฆˆ๋ณด๋‹ค ์ปค์•ผ ํ•œ๋‹ค. ๊ทธ๋ ‡์ง€ ๋ชปํ•˜๋ฉด โ€” ์ฆ‰, ์˜ค์ฐจ ํ•ญ์ด ๋” ํฌ๋ฉด โ€” ํ•™์Šตํ• ์ˆ˜๋ก ์ •์ฑ…์ด ๋‚˜๋น ์ง„๋‹ค. ์ด๊ฒƒ์ด ์ง๊ด€์ ์œผ๋กœ โ€œํ•˜ํ–ฅ ๋‚˜์„ โ€์ด ๋ฐœ์ƒํ•˜๋Š” ์ด์œ ์ด๋‹ค.


์ง„๋‹จ: ํ•˜ํ–ฅ ๋‚˜์„ (Downward Spiral)์˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜

์ด ๋…ผ๋ฌธ์˜ ๊ฐ€์žฅ ๋ช…๋ฃŒํ•œ ๊ธฐ์—ฌ ํ•˜๋‚˜๋Š”, sim-to-online์—์„œ ์ž์ฃผ ๊ด€์ฐฐ๋˜๋Š” ํ•™์Šต ์‹คํŒจ๋ฅผ ๋‹จ์ผํ•œ ๋ฉ”์ปค๋‹ˆ์ฆ˜์œผ๋กœ ์„ค๋ช…ํ•œ ๊ฒƒ์ด๋‹ค. ๊ทธ๋ฆผ์œผ๋กœ ๊ทธ๋ ค๋ณด์ž.

flowchart LR
    A["๋ฐฐํฌ: ฯ€_n ์‹คํ–‰<br/>์‹ค๋ฌผ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘"] --> B["์˜ค์ฐจ๊ฐ€ ํฐ (s,a)๋“ค์ด<br/>๋ฆฌํ”Œ๋ ˆ์ด ๋ฒ„ํผ์— ์Œ“์ž„"]
    B --> C["ํฌ๋ฆฌํ‹ฑ ํ‰๊ฐ€:<br/>ํฐ ์˜ค์ฐจ ์˜์—ญ์˜ Q๊ฐ’<br/>์ž˜๋ชป ์ถ”์ • (๊ณผ๋Œ€ํ‰๊ฐ€)"]
    C --> D["์ •์ฑ… ๊ฐœ์„ :<br/>์ž˜๋ชป๋œ Q๋ฅผ ์ตœ๋Œ€ํ™”<br/>โ†’ ๋” ์œ„ํ—˜ํ•œ ์˜์—ญ์œผ๋กœ ์ด๋™"]
    D --> A
    style A fill:#fff4e6
    style B fill:#ffe6e6
    style C fill:#ffe6e6
    style D fill:#ffe6e6

์ด ์‚ฌ์ดํด์„ ํ•œ ๋ฒˆ์— ์ดํ•ดํ•˜๋Š” ๋น„์œ  โ€” ํ•™์ƒ์ด ์ž˜๋ชป๋œ ๊ต๊ณผ์„œ๋กœ ์‹œํ—˜์„ ๋ณธ๋‹ค๊ณ  ํ•˜์ž. ์ฒ˜์Œ์—๋Š” ์•ฝ๊ฐ„ ํ‹€๋ฆฐ ๋‹ต์„ ์“ด๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ๊ทธ ํ‹€๋ฆฐ ๋‹ต์„ ๋ฐ›์•„์„œ ๋‹ค์‹œ ๊ทธ ๊ต๊ณผ์„œ๋กœ ๋ณต์Šตํ•œ๋‹ค. ๋‘ ๋ฒˆ์งธ ์‹œํ—˜์—์„œ๋Š” ๋” ์ž์‹  ์žˆ๊ฒŒ ๋” ํ‹€๋ฆฐ ๋‹ต์„ ์“ด๋‹ค. ๋งค ์‚ฌ์ดํด๋งˆ๋‹ค ์ž์‹ ๊ฐ(๊ฐ€์น˜ ์ถ”์ •)์€ ์˜ฌ๋ผ๊ฐ€์ง€๋งŒ, ์ •๋‹ต(์‹ค์ œ ๊ฐ€์น˜)์—์„œ๋Š” ๋ฉ€์–ด์ง„๋‹ค. ์ด๊ฒŒ ์ •ํ™•ํžˆ ์•ก์…˜-๊ฐ€์น˜ ์ถ”์ •์ด ๋ถ„ํฌ ์‹œํ”„ํŠธ ํ•˜์—์„œ ํญ์ฃผํ•˜๋Š” ๋ชจ์Šต์ด๋‹ค.

๋…ผ๋ฌธ์€ ์‹œ๋ฎฌ Race Car์—์„œ ๊ฐ€๋ฒผ์šด ๋‹ค์ด๋‚ด๋ฏน์Šค mismatch๋ฅผ ์ธ์œ„์ ์œผ๋กœ ๋„ฃ๊ณ  ์ด ํ˜„์ƒ์„ ์ง์ ‘ ๊ด€์ธกํ•œ๋‹ค(Figure 3). ๋ถˆ์•ˆ์ •ํ•œ ์‹คํ–‰์—์„œ๋Š” Q^{\pi_n}_\varphi - Q^{\pi_n}_{\text{MC}} ์˜ ๋ถ„ํฌ(Monte Carlo๋กœ ์ธก์ •ํ•œ ์ง„์งœ ๊ฐ€์น˜์™€ ๋น„๊ต)๊ฐ€ ์‹œ๊ฐ„์ด ๊ฐˆ์ˆ˜๋ก ์–‘์˜ ๋ฐฉํ–ฅ์œผ๋กœ ์ ์  ๋‘๊บผ์›Œ์ง„๋‹ค โ€” ์ฆ‰, ํ•™์Šตํ• ์ˆ˜๋ก ๋” ์ž์‹ ๋งŒ๋งŒํ•˜๊ฒŒ ๋” ๊ณผ๋Œ€ํ‰๊ฐ€ํ•œ๋‹ค. ์•ˆ์ •๋œ ์‹คํ–‰์—์„œ๋Š” ์ด ๋ถ„ํฌ๊ฐ€ 0 ๊ทผ์ฒ˜์— ๋‹จ์ •ํ•˜๊ฒŒ ๋ชจ์—ฌ ์žˆ๋‹ค.

Tip์ง๊ด€ ์ •๋ฆฌ

ํ•˜ํ–ฅ ๋‚˜์„ ์˜ ๋ณธ์งˆ์€ โ€œ๋ถ„ํฌ ์‹œํ”„ํŠธ ํ•˜์—์„œ์˜ ๊ฐ€์น˜ ํ•จ์ˆ˜ ๊ณผ๋Œ€ํ‰๊ฐ€๊ฐ€ ์ •์ฑ…์„ ๋” ๋‚˜์œ ์˜์—ญ์œผ๋กœ ๋ฐ€์–ด๋„ฃ๊ณ , ๊ทธ ์˜์—ญ์˜ ๋ฐ์ดํ„ฐ๋กœ ๋‹ค์‹œ ๊ฐ€์น˜ ํ•จ์ˆ˜๋ฅผ ์ž˜๋ชป ํ•™์Šตํ•˜๋Š” ์ž๊ธฐ๊ฐ•ํ™” ๋ฃจํ”„โ€์ด๋‹ค. ์ด๊ฑธ ๋Š๋Š” ๋ฐฉ๋ฒ•์€ ๋ณธ์งˆ์ ์œผ๋กœ ๋‘ ๊ฐ€์ง€๋ฟ์ด๋‹ค โ€” (a) ๋‚˜์œ ๋ฐ์ดํ„ฐ๋กœ ๊ฐ€์น˜ ํ•จ์ˆ˜๊ฐ€ ํœ˜๋‘˜๋ฆฌ์ง€ ์•Š๊ฒŒ ํ•˜๊ฑฐ๋‚˜, (b) ๊ฐ€์น˜ ํ•จ์ˆ˜๊ฐ€ ์–ด๋А ์ •๋„ ์•ˆ์ •๋  ๋•Œ๊นŒ์ง€ ์ •์ฑ…์„ ์ฒœ์ฒœํžˆ ์›€์ง์ด๊ฒŒ ํ•˜๊ฑฐ๋‚˜. ๋…ผ๋ฌธ์˜ ์„ธ ์ฒ˜๋ฐฉ์€ ๋ชจ๋‘ ์ด ๋‘ ์ถ•์˜ ๋ณ€์ฃผ์ด๋‹ค.


์ฒ˜๋ฐฉ 1: ๋ฐ์ดํ„ฐ๋ฅผ ํ•จ๋ถ€๋กœ ๋ฒ„๋ฆฌ์ง€ ๋งˆ๋ผ (Data Retention)

๊ฐ€์žฅ ๋‹จ์ˆœํ•˜๊ณ  ๊ฐ€์žฅ ํšจ๊ณผ ์ข‹์€ ์ฒ˜๋ฐฉ์ด๋‹ค. ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ๋ชจ์€ ๋ฐ์ดํ„ฐ \mathcal{D}_0๋ฅผ ์‹ค๋ฌผ ํ•™์Šต ์‹œ์ž‘ ํ›„์—๋„ ๋ฒ„๋ฆฌ์ง€ ๋ง๊ณ  ๊ณ„์† ์“ฐ์ž.

์ˆ˜์‹์œผ๋กœ๋Š” ์ด๋ ‡๊ฒŒ ํ‘œํ˜„๋œ๋‹ค. ๋‘ ๊ฐœ์˜ ๋ฒ„ํผ๋ฅผ ๋‘๊ณ :

  • \mathcal{D}_0: ์‹œ๋ฎฌ ์‚ฌ์ „ํ•™์Šต ์‹œ ๋ชจ์€ ๋ฐ์ดํ„ฐ (๋˜๋Š” ์ด์ „ trial์˜ ์‹ค๋ฌผ ๋ฐ์ดํ„ฐ)
  • \mathcal{D}_{\text{online}}: ํ˜„์žฌ ์‹ค๋ฌผ์—์„œ ๋ชจ์œผ๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ

๋ฏธ๋‹ˆ๋ฐฐ์น˜๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ž๋Š”๋‹ค:

(s_t,a_t,s_{t+1},r_t) \sim (1-\alpha)\, \text{Unif}(\mathcal{D}_0) + \alpha\, \text{Unif}(\mathcal{D}_{\text{online}}), \quad \alpha\in[0,1]

์ €์ž๋“ค์˜ ํ•ต์‹ฌ ๋ณ€ํ˜•์€ \alpha๋ฅผ ์‹œ๊ฐ„์— ๋”ฐ๋ผ ์–ด๋‹๋งํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ฒ˜์Œ์—๋Š” \alpha=0.5 ์ •๋„๋กœ ์‹œ๋ฎฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ ˆ๋ฐ˜์”ฉ ์„ž๋‹ค๊ฐ€, ํ•™์Šต์ด ์ง„ํ–‰๋˜๋ฉด์„œ \alpha\to 1๋กœ ๋ณด๋‚ด ๊ฒฐ๊ตญ ์‹ค๋ฌผ ๋ฐ์ดํ„ฐ๋งŒ ์“ฐ๊ฒŒ ํ•œ๋‹ค.

์™œ ์ด๊ฒŒ ํ†ตํ•˜๋Š”๊ฐ€

๋‹ค์‹œ Bellman ์†์‹ค ์‹์„ ๋ณด์ž. ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ๋ถ„ํฌ๊ฐ€ ๊ณง ํ•™์Šต ์‹ ํ˜ธ์˜ ๊ฐ€์ค‘์น˜ ๋ถ„ํฌ๋‹ค. ๋งŒ์•ฝ \mathcal{D}_{\le n}์— ์˜ค์ฐจ |\epsilon(s,a)|๊ฐ€ ํฐ transition๋“ค์ด ๊ณผ๋Œ€ํ‘œ๋  ๊ฒฝ์šฐ, ํฌ๋ฆฌํ‹ฑ ์—…๋ฐ์ดํŠธ๋Š” ๊ทธ ์ ๋“ค์— ๋Œ๋ ค๊ฐ„๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ์ •์ฑ…์ด ๋˜ ๊ทธ ์˜์—ญ์œผ๋กœ ๊ฐ€๊ฒŒ ๋˜๋ฉด ๋ถ„ํฌ ์‹œํ”„ํŠธ๊ฐ€ ๋” ์ปค์ง„๋‹ค โ€” ํ•˜ํ–ฅ ๋‚˜์„ ์ด๋‹ค.

๋ฐ˜๋Œ€๋กœ \mathcal{D}_0์— ๋Œ€ํ•ด์„œ๋Š” ํฌ๋ฆฌํ‹ฑ์ด ์ด๋ฏธ ์ถฉ๋ถ„ํžˆ ํ•™์Šต๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ \epsilon์ด ํ‰๊ท ์ ์œผ๋กœ ์ž‘๋‹ค. ์ฆ‰ \mathcal{D}_0๋ฅผ ๋ฏธ๋‹ˆ๋ฐฐ์น˜์— ์„ž๋Š” ๊ฒƒ์€, ์œ„ํ—˜ํ•œ ์˜์—ญ์˜ ํ•™์Šต ์‹ ํ˜ธ์— โ€œ๋‹ปโ€์„ ๋‚ด๋ฆฌ๋Š” ์ผ์ด๋‹ค. ์‹œ๋ฎฌ ๋ฐ์ดํ„ฐ๋Š” ์™„๋ฒฝํ•œ ์ •๋‹ต์€ ์•„๋‹ˆ์ง€๋งŒ ์•ˆ์ •๋œ ์‹ ํ˜ธ๋‹ค. ์•ˆ์ •๋œ ์‹ ํ˜ธ์™€ ๋ถ€์ •ํ™•ํ•œ ์‹ ํ˜ธ๋ฅผ ์ ์ ˆํžˆ ์„ž์œผ๋ฉด, ํฌ๋ฆฌํ‹ฑ์ด ๊ฐ‘์ž๊ธฐ ํœฉ์“ธ๋ฆฌ์ง€ ์•Š๋Š”๋‹ค.

๋‹ค๋งŒ ์‹œ๋ฎฌ๊ณผ ์‹ค๋ฌผ์˜ ๋‹ค์ด๋‚ด๋ฏน์Šค๊ฐ€ ๋‹ค๋ฅด๋ฏ€๋กœ \mathcal{D}_0๋ฅผ ์˜์›ํžˆ ์“ฐ๋ฉด ์•ˆ ๋œ๋‹ค โ€” ๊ฒฐ๊ตญ ์ง„์งœ ์‹œ์Šคํ…œ์—์„œ ์ตœ์ ํ™”ํ•ด์•ผ ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ์–ด๋‹๋ง์ด ๋“ฑ์žฅํ•œ๋‹ค. ์ดˆ๊ธฐ์—๋Š” ์•ˆ์ •์„ฑ, ํ›„๊ธฐ์—๋Š” ์ •ํ™•์„ฑ. ๋งˆ์น˜ ์ƒˆ๋กœ์šด ์™ธ๊ตญ์–ด๋ฅผ ๋ฐฐ์šธ ๋•Œ ์ฒ˜์Œ์—๋Š” ๋ชจ๊ตญ์–ด ์‚ฌ์ „์„ ์˜†์— ๋‘์ง€๋งŒ, ๊ฒฐ๊ตญ ๊ทธ ์–ธ์–ด์˜ ํ™”์ž์ฒ˜๋Ÿผ ์‚ฌ๊ณ ํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค.

์‹คํ—˜ ๊ฒฐ๊ณผ

๋…ผ๋ฌธ Figure 8์€ ๋ฐ์ดํ„ฐ ๋ณด์กด ํšจ๊ณผ๋ฅผ ๊น”๋”ํ•˜๊ฒŒ ๋ณด์—ฌ์ค€๋‹ค. ๊ฐ™์€ random seed๋กœ 4๋ฒˆ์˜ trial์„ ์—ฐ์†ํ•ด์„œ ๋Œ๋ฆฌ๋ฉด์„œ, ๋งค๋ฒˆ ์ด์ „ trial์˜ \mathcal{D}_{\text{online}}์„ ์ƒˆ trial์˜ \mathcal{D}_0๋กœ ๋กœ๋“œํ•œ๋‹ค. ๊ฒฐ๊ณผ:

  • Franka Panda: trial 0์—์„œ๋Š” ํ”ฝ์—… ์‹คํŒจ๊ฐ€ ์žฆ์ง€๋งŒ, trial 3์ฏค ๊ฐ€๋ฉด ๊ฑฐ์˜ ์™„๋ฒฝํ•œ ์„ฑ๊ณต๋ฅ 
  • Unitree Go1: trial 0์—์„œ๋Š” ์ž์ฃผ ๋„˜์–ด์ง€์ง€๋งŒ, ๋ˆ„์  ํ•™์Šต์œผ๋กœ ์•ˆ์ •์  ๋ณดํ–‰
  • Race Car: ์ฒ˜์Œ์—๋Š” ๋ชฉํ‘œ๋ฅผ ์ž์ฃผ ๋†“์น˜์ง€๋งŒ, ๋‚˜์ค‘์—๋Š” ๋น ๋ฅด๊ณ  ์ •๋ฐ€ํ•˜๊ฒŒ ์ฃผ์ฐจ

์ €์ž๋“ค์˜ ํ‘œํ˜„์„ ๋นŒ๋ฆฌ๋ฉด โ€” ์•ฝ 10๋ถ„ ์ •๋„์˜ ํ•™์Šต(ํ•˜๋“œ์›จ์–ด ๋ฆฌ์…‹, ๋„คํŠธ์›Œํฌ ํ†ต์‹  ์˜ค๋ฒ„ํ—ค๋“œ ํฌํ•จ) ๋งŒ์— Franka Panda๋Š” ๊ฑฐ์˜ ์™„๋ฒฝํ•œ ์„ฑ๊ณต๋ฅ ์— ๋„๋‹ฌํ•œ๋‹ค. ๊ทธ๊ฒƒ๋„, ์‹œ๋ฎฌ๊ณผ ์‹ค๋ฌผ ๋ชจ๋‘์—์„œ sparseํ•œ grayscale ๋น„์ „ ์ž…๋ ฅ์œผ๋กœ.

Warning์‹ค๋ฌด ํŒ

โ€œ์ด์ „ trial์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์Œ trial์—์„œ ์žฌ์‚ฌ์šฉํ•œ๋‹คโ€๋Š” ๊ฒƒ์€ ๋‹จ์ˆœํ•œ ๊ตฌํ˜„ ๋””ํ…Œ์ผ์ด ์•„๋‹ˆ๋‹ค. ์ด๋Š” ๋กœ๋ด‡ ํ•œ ๋Œ€๋กœ ๋ฉฐ์น ์— ๊ฑธ์ณ ํ•™์Šตํ•  ๋•Œ ๋งค๋ฒˆ ์ฒ˜์Œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๋œป์ด๋‹ค. ๋งŒ์•ฝ ๋ณธ์ธ์˜ setup์—์„œ trial์ด ๋๋‚  ๋•Œ๋งˆ๋‹ค ๋ฆฌํ”Œ๋ ˆ์ด ๋ฒ„ํผ๋ฅผ ๋””์Šคํฌ์— dumpํ•˜์ง€ ์•Š๊ณ  ์žˆ๋‹ค๋ฉด, ์ง€๊ธˆ ๋‹น์žฅ ์ฝ”๋“œ๋ฅผ ์ˆ˜์ •ํ•  ๊ฐ€์น˜๊ฐ€ ์žˆ๋‹ค.


์ฒ˜๋ฐฉ 2: ์›œ์Šคํƒ€ํŠธ(Warm Starts) โ€” ์ž„๊ณ„ ๋ฐ์ดํ„ฐ ํ™•๋ณด

๋ฐ์ดํ„ฐ ๋ณด์กด์ด ์–ด๋ ค์šด ์ƒํ™ฉ๋„ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์‹œ๋ฎฌ ์‚ฌ์ „ํ•™์Šต์— ์“ด ๋ฐ์ดํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ํฌ๊ฑฐ๋‚˜(parallel sim์—์„œ ์ˆ˜์–ต transitions), ๋ฉ”๋ชจ๋ฆฌ ์ œ์•ฝ์œผ๋กœ ๋””์Šคํฌ์— ๋ณด๊ด€ํ•˜๊ธฐ ๊นŒ๋‹ค๋กœ์šด ๊ฒฝ์šฐ๋‹ค. Zhou et al.(2025)์€ offline-to-online ์„ธํŒ…์—์„œ ์‚ฌ์ „ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๊ด€ํ•˜์ง€ ์•Š๊ณ ๋„ ์•ˆ์ •์  ๋ฏธ์„ธ์กฐ์ •์ด ๊ฐ€๋Šฅํ•จ์„ ๋ณด์˜€๋‹ค. ์ด ๋…ผ๋ฌธ์€ ๊ทธ ์•„์ด๋””์–ด๋ฅผ sim-to-online์œผ๋กœ ๊ฐ€์ ธ์˜จ๋‹ค.

๋ฐฉ๋ฒ•์€ ๊ฐ„๋‹จํ•˜๋‹ค. ์‹ค๋ฌผ์— ์ •์ฑ…์„ ์˜ฌ๋ฆฐ ์งํ›„, ํ•™์Šต ์—…๋ฐ์ดํŠธ๋ฅผ ์ผ๋‹จ ๋ฉˆ์ถ”๊ณ  \pi_0๋กœ N^* ์—ํ”ผ์†Œ๋“œ๋ฅผ ๊ทธ๋ƒฅ ๊ตด๋ฆฐ๋‹ค. ์ด ๋™์•ˆ ๋ชจ์€ ๋ฐ์ดํ„ฐ๊ฐ€ \mathcal{D}_{\text{online}}์˜ ์‹œ๋“œ๊ฐ€ ๋œ๋‹ค. ๊ทธ ํ›„์—์•ผ actor-critic ์—…๋ฐ์ดํŠธ๋ฅผ ์‹œ์ž‘ํ•œ๋‹ค.

์ˆ˜์‹์œผ๋กœ ๋ณด๋ฉด ์ด๋ ‡๋‹ค:

WarmStartPhase:                 # no parameter updates
    for n in 1..N*:
        rollout pi_0 on real robot
        store transitions in D_online

LearningPhase:                  # standard SAC begins
    for n in N*+1..N:
        rollout pi_n
        store transitions
        update Q_phi using Eq.(3)
        update pi every M critic steps

๋…ผ๋ฌธ ์‹คํ—˜์—์„œ:

  • Franka Panda: N^* = 20 ์—ํ”ผ์†Œ๋“œ (์•ฝ 5000 transitions)
  • Unitree Go1: N^* = 5 ์—ํ”ผ์†Œ๋“œ (์•ฝ 5000 transitions)
  • Race Car: N^* = 5 ์—ํ”ผ์†Œ๋“œ (์•ฝ 1250 transitions)

์™œ ์ด๊ฒŒ ํ†ตํ•˜๋Š”๊ฐ€

์›œ์Šคํƒ€ํŠธ๊ฐ€ ํ•˜๋Š” ์ผ์€ ๋ณธ์งˆ์ ์œผ๋กœ ๋ฐ์ดํ„ฐ ๋ณด์กด์˜ โ€œ๋ฏธ๋‹ˆ ๋ฒ„์ „โ€์ด๋‹ค. ์‹œ๋ฎฌ ๋ฐ์ดํ„ฐ๋ฅผ ์‹ค๋ฌผ์— ๊ฐ€์ ธ๊ฐˆ ์ˆ˜ ์—†๋‹ค๋ฉด, ์‹ค๋ฌผ์—์„œ ์ฆ‰์‹œ \pi_0 ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋น ๋ฅด๊ฒŒ ๋งŒ๋“ค์–ด ๊ทธ๊ฒƒ์„ ๋‹ป์œผ๋กœ ์“ฐ๋Š” ๊ฒƒ. ์ฒซ actor-critic ์—…๋ฐ์ดํŠธ๊ฐ€ ์ผ์–ด๋‚  ๋•Œ, ์ด๋ฏธ \mathcal{D}_{\text{online}} ์•ˆ์—๋Š” ์ •์ฑ…์ด ์ž˜ ์ž‘๋™ํ•˜๋Š” ์˜์—ญ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ฒซ ๊ทธ๋ž˜๋””์–ธํŠธ ์Šคํ…๋ถ€ํ„ฐ ๋ถ„ํฌ ์‹œํ”„ํŠธ๊ฐ€ ํญ๋ฐœ์ ์ด์ง€ ์•Š๊ฒŒ ๋œ๋‹ค.

ํฅ๋ฏธ๋กœ์šด ๊ฒฐ๊ณผ โ€” Franka Panda์—์„œ๋Š” ์›œ์Šคํƒ€ํŠธ ์—†์ด๋„ ์ž˜ ๋™์ž‘(Figure 9)ํ•œ๋‹ค. ํ”ฝ ์•ค ํ”Œ๋ ˆ์ด์Šค task์˜ ๋ณด์ƒ์ด ๋งค์šฐ sparseํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์›Œ๋ฐ์—…์˜ ์ •๋ณด ๊ฐ€์น˜๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์€ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. ๋ฐ˜๋ฉด Unitree Go1๊ณผ Race Car๋Š” ์›Œ๋ฐ์—…์ด ๋น ์ง€๋ฉด ํ•™์Šต์ด ๊ฑฐ์˜ ์‹คํŒจํ•œ๋‹ค. task ํŠน์„ฑ์— ๋”ฐ๋ผ ์ฒ˜๋ฐฉ์˜ ๊ฐ•๋„๊ฐ€ ๋‹ค๋ฅด๋‹ค๋Š” ์ ์ด ํฅ๋ฏธ๋กญ๋‹ค.

๋ฐ์ดํ„ฐ ๋ณด์กด vs. ์›œ์Šคํƒ€ํŠธ โ€” ๋ฌด์—‡์„ ์–ธ์ œ ์“ธ๊นŒ

์กฐ๊ฑด ๊ถŒ์žฅ
์‹œ๋ฎฌ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๊ด€ํ•  ์ˆ˜ ์žˆ๊ณ  ๋‹ค์ด๋‚ด๋ฏน์Šค ๊ฐญ์ด ํฌ์ง€ ์•Š์Œ ๋ฐ์ดํ„ฐ ๋ณด์กด (\alpha ์–ด๋‹๋ง)
์‹œ๋ฎฌ ๋ฐ์ดํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ํฌ๊ฑฐ๋‚˜ ๊ฐญ์ด ๋งค์šฐ ํผ ์›œ์Šคํƒ€ํŠธ
Sparse reward + zero-shot ์„ฑ๋Šฅ์ด ๋‚˜์˜์ง€ ์•Š์Œ ๋‘˜ ๋‹ค ํšจ๊ณผ ์•ฝํ•  ์ˆ˜ ์žˆ์Œ
Dense reward + ๋น ๋ฅธ ๋‹ค์ด๋‚ด๋ฏน์Šค ๋‘˜ ๋‹ค ์ ์šฉ ๊ถŒ์žฅ

๋…ผ๋ฌธ ๋ถ€๋ก์˜ Figure 15๋Š” ์‹œ๋ฎฌ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด์กดํ–ˆ์„ ๋•Œ๊ฐ€ ์›œ์Šคํƒ€ํŠธ๋งŒ ์“ธ ๋•Œ๋ณด๋‹ค ํ•™์Šต์ด ๋” ์•ˆ์ •์ ์ด๊ณ  ๋น ๋ฆ„์„ ๋ณด์—ฌ์ค€๋‹ค. ์ฆ‰, ๋ฐ์ดํ„ฐ ๋ณด์กด์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋ฉด ๊ทธ๊ฒƒ์ด ์šฐ์„ , ๋ถˆ๊ฐ€๋Šฅํ•  ๋•Œ ์›œ์Šคํƒ€ํŠธ๊ฐ€ ํ•ฉ๋ฆฌ์  ์ฐจ์„ ์ด๋‹ค.


์ฒ˜๋ฐฉ 3: ์•กํ„ฐ-ํฌ๋ฆฌํ‹ฑ์˜ ๋ฐ•์ž๋ฅผ ๋‹ค๋ฅด๊ฒŒ (Asymmetric Updates)

์„ธ ๋ฒˆ์งธ ์ฒ˜๋ฐฉ์€ ๊ฐ€์žฅ ๋ฏธ๋ฌ˜ํ•˜์ง€๋งŒ, ์‹คํ—˜์ ์œผ๋กœ๋Š” ๊ฐ€์žฅ ๊ฒฐ์ •์ ์ด๋‹ค. ํ•ต์‹ฌ ๋ฉ”์‹œ์ง€: ์•กํ„ฐ๋ฅผ ํฌ๋ฆฌํ‹ฑ๋ณด๋‹ค ํ›จ์”ฌ ์ ๊ฒŒ, ๋” ์ž‘์€ ํ•™์Šต๋ฅ ๋กœ ์—…๋ฐ์ดํŠธํ•˜๋ผ.

Update-to-Data Ratio (UTD)์™€ ๊ทธ ํ•จ์ •

\eta := K/T๋ฅผ ํ•œ transition ๋‹น ๊ทธ๋ž˜๋””์–ธํŠธ ์—…๋ฐ์ดํŠธ ํšŸ์ˆ˜๋กœ ์ •์˜ํ•˜์ž. UTD๋ฅผ ๋†’์ด๋ฉด sample efficiency๊ฐ€ ์ข‹์•„์ง„๋‹ค โ€” ๊ฐ™์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋” ๊นŠ์ด ์šฐ๋ ค๋จน๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด๊ฒŒ ์‹ค์‹œ๊ฐ„ ์ œ์•ฝ์ด ์žˆ๋Š” ์‹ค๋ฌผ ํ•™์Šต์—์„œ ํŠนํžˆ ๋งค๋ ฅ์ ์ด๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ํ•จ์ •์ด ์žˆ๋‹ค: UTD๊ฐ€ ๋†’์„์ˆ˜๋ก ์ถ”์ • ์˜ค์ฐจ๊ฐ€ ์ฆํญ๋˜๊ณ  overfitting์ด ์‹ฌํ•ด์ง„๋‹ค(Nauman et al., 2024).

ํ•ด๊ฒฐ์ฑ…์€ Fujimoto et al.(2018)์˜ TD3์—์„œ ์˜๊ฐ์„ ๋ฐ›์€ trick์ด๋‹ค โ€” ํฌ๋ฆฌํ‹ฑ์€ ๋งค ์Šคํ… ์—…๋ฐ์ดํŠธํ•˜๋˜, ์•กํ„ฐ๋Š” M ํฌ๋ฆฌํ‹ฑ ์Šคํ…๋งˆ๋‹ค ํ•œ ๋ฒˆ์”ฉ ์—…๋ฐ์ดํŠธํ•œ๋‹ค. ๋™์‹œ์— ์•กํ„ฐ์˜ ํ•™์Šต๋ฅ ์€ ๋” ์ž‘๊ฒŒ ์žก๋Š”๋‹ค.

for k in 1..K:
    update Q_phi  using Eq.(3) with lr_critic = 3e-4
    if k % M == 0:
        update pi    with lr_actor = 1e-5    # M=20 in paper

์™œ ์ด๊ฒŒ ํ†ตํ•˜๋Š”๊ฐ€ โ€” ๋‘ ์‹œ๊ฐ„ ์ฒ™๋„(Two-Timescale) ์ง๊ด€

์ด๊ฑด ํ™•๋ฅ  ๊ทผ์‚ฌ๋ก (stochastic approximation)์—์„œ ์ž˜ ์•Œ๋ ค์ง„ ์•„์ด๋””์–ด๋‹ค โ€” ๋‘ ๋ณ€์ˆ˜๊ฐ€ ๊ฒฐํ•ฉ๋œ ๋™์—ญํ•™ ์‹œ์Šคํ…œ์—์„œ, ํ•œ ๋ณ€์ˆ˜๊ฐ€ ๋‹ค๋ฅธ ๋ณ€์ˆ˜๋ณด๋‹ค ๋А๋ฆฌ๊ฒŒ ์›€์ง์ด๋ฉด ๋น ๋ฅธ ๋ณ€์ˆ˜๋Š” ๋А๋ฆฐ ๋ณ€์ˆ˜๊ฐ€ ๊ณ ์ •๋˜์–ด ์žˆ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ํ•™์Šต๋œ๋‹ค.

Actor-critic์— ์ ์šฉํ•ด๋ณด์ž:

  • ๋น ๋ฅธ ์‹œ๊ฐ„ ์ฒ™๋„(ํฌ๋ฆฌํ‹ฑ): ๋งค ์Šคํ… ์—…๋ฐ์ดํŠธ. ์ •์ฑ… \pi_n์ด ๊ณ ์ •๋œ ์ฑ„ ๊ทธ ์ •์ฑ…์˜ ๊ฐ€์น˜๋ฅผ ์ •ํ™•ํžˆ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค.
  • ๋А๋ฆฐ ์‹œ๊ฐ„ ์ฒ™๋„(์•กํ„ฐ): ๋งค M ์Šคํ…๋งˆ๋‹ค ์—…๋ฐ์ดํŠธ. ๊ทธ๋™์•ˆ ํฌ๋ฆฌํ‹ฑ์€ ์ถฉ๋ถ„ํžˆ ์ˆ˜๋ ดํ•ด ์žˆ์œผ๋ฏ€๋กœ, ์•กํ„ฐ๋Š” ์‹ ๋ขฐํ•  ๋งŒํ•œ Q^{\pi_n}_\varphi ์œ„์—์„œ ์ •์ฑ… ๊ฐœ์„ ์„ ํ•œ๋‹ค.

๋Œ€์นญ์œผ๋กœ (์•กํ„ฐ=ํฌ๋ฆฌํ‹ฑ) ์—…๋ฐ์ดํŠธํ•˜๋ฉด, ์ •์ฑ…์ด ๋งค ์Šคํ… ๋ณ€ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํฌ๋ฆฌํ‹ฑ์ด ์ถ”์ ํ•˜๋Š” ํƒ€๊นƒ ์ž์ฒด๊ฐ€ ๋งค ์Šคํ… ํ”๋“ค๋ฆฐ๋‹ค. ๊ฒฐ๊ตญ ํฌ๋ฆฌํ‹ฑ์€ ์–ด๋–ค ์ •์ฑ…์˜ ๊ฐ€์น˜๋„ ์ •ํ™•ํžˆ ํ•™์Šตํ•˜์ง€ ๋ชปํ•œ ์ฑ„ ์•กํ„ฐ๋ฅผ ๊ฐ€์ด๋“œํ•˜๊ฒŒ ๋˜๊ณ , ์ด๋Š” Eq.(5)์˜ |\epsilon(s,a)| ํ•ญ์„ ํ‚ค์šด๋‹ค. ํ•˜ํ–ฅ ๋‚˜์„ ์˜ ๋˜ ๋‹ค๋ฅธ ์ž…๊ตฌ๋‹ค.

๋น„์œ ํ•˜์ž๋ฉด โ€” ์ƒˆ๋กœ์šด ๋„์‹œ์—์„œ ์šด์ „์„ ๋ฐฐ์šด๋‹ค๊ณ  ํ•˜์ž. ์ง€๋„(ํฌ๋ฆฌํ‹ฑ) ๋Š” ์ž์ฃผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค. ํ•˜์ง€๋งŒ ์šด์ „ ์Šคํƒ€์ผ(์•กํ„ฐ) ์€ ์ง€๋„๊ฐ€ ์–ด๋А ์ •๋„ ์ •ํ™•ํ•ด์ง„ ๋‹ค์Œ์— ๋ฐ”๊พธ๋Š” ๊ฒŒ ์•ˆ์ „ํ•˜๋‹ค. ๋‘˜์„ ๋™์‹œ์— ๋งค ์ˆœ๊ฐ„ ๋ฐ”๊พธ๋ฉด ์‚ฌ๊ณ  ๋‚œ๋‹ค.

์‹คํ—˜ ๊ฒฐ๊ณผ โ€” ๊ฐ€์žฅ ๊ทน์ ์ธ ablation

๋…ผ๋ฌธ Figure 10์ด ์ด ์ฒ˜๋ฐฉ์˜ ์œ„๋ ฅ์„ ๊ฐ€์žฅ ๊ทน์ ์œผ๋กœ ๋ณด์—ฌ์ค€๋‹ค. ์„ธ ๋กœ๋ด‡ ๋ชจ๋‘์—์„œ, ๋Œ€์นญ ์—…๋ฐ์ดํŠธ baseline์€ ํ•™์Šต ์ž์ฒด๊ฐ€ ์‹คํŒจํ•œ๋‹ค โ€” ์„ฑ๋Šฅ์ด ์ •์ฒด๋˜๊ฑฐ๋‚˜ ์˜คํžˆ๋ ค ๋–จ์–ด์ง„๋‹ค. ๊ฐ™์€ ์ฝ”๋“œ, ๊ฐ™์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์—์„œ ์•กํ„ฐ ์—…๋ฐ์ดํŠธ ๋นˆ๋„๋ฅผ M=20์œผ๋กœ ๋Šฆ์ถ”๊ณ  ํ•™์Šต๋ฅ ์„ 3\times 10^{-4} \to 1\times 10^{-5}๋กœ ์ค„์ด๋Š” ๊ฒƒ๋งŒ์œผ๋กœ ํ•™์Šต์ด ์ •์ƒ ๊ถค๋„์— ์˜ค๋ฅธ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ํฅ๋ฏธ๋กœ์šด ์  โ€” ์›œ์Šคํƒ€ํŠธ๋ฅผ ์ถ”๊ฐ€ํ•˜๋”๋ผ๋„, ๋Œ€์นญ ์—…๋ฐ์ดํŠธ๋Š” ์—ฌ์ „ํžˆ ์‹คํŒจํ•œ๋‹ค. ์ฆ‰, asymmetric update๋Š” ๋‹ค๋ฅธ ์ฒ˜๋ฐฉ์œผ๋กœ ๋Œ€์ฒด๋˜์ง€ ์•Š๋Š” ๋…๋ฆฝ์ ์ธ ์•ˆ์ •ํ™” ํšจ๊ณผ๋ฅผ ๊ฐ–๋Š”๋‹ค.

Importantํ•ต์‹ฌ ํฌ์ธํŠธ

โ€œUTD๊ฐ€ ๋†’์œผ๋ฉด ๋น ๋ฅด๋‹คโ€๋Š” ์ผ๋ฐ˜๋ก ์€ sim-to-online์—์„œ ํ‹€๋ฆฌ๋‹ค. ์ •ํ™•ํžˆ๋Š” โ€” UTD๊ฐ€ ๋†’์„ ๋•Œ ๊ทธ๊ฒƒ์„ ์•ˆ์ „ํ•˜๊ฒŒ ์ˆ˜ํ™•ํ•˜๋ ค๋ฉด ์•กํ„ฐ๋ฅผ ํฌ๋ฆฌํ‹ฑ๋ณด๋‹ค ํ›จ์”ฌ ๋ณด์ˆ˜์ ์œผ๋กœ ์›€์ง์—ฌ์•ผ ํ•œ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด sample efficiency๋ฅผ ์–ป๊ธฐ๋Š”์ปค๋…• ํ•™์Šต ์ž์ฒด๊ฐ€ ๋ง๊ฐ€์ง„๋‹ค.


๋ณด๋„ˆ์Šค: ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„œ SAC ์‚ด๋ฆฌ๊ธฐ

์ด ๋ถ€๋ถ„์€ ๋ถ€๋ก์— ๋ฌปํ˜€ ์žˆ์ง€๋งŒ โ€” Isaac Lab/MuJoCo Playground๋ฅ˜ ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ์“ฐ๋Š” ์‚ฌ๋žŒ์—๊ฒŒ๋Š” ๋ณธ๋ฌธ๋ณด๋‹ค ๋” ์ค‘์š”ํ•  ์ˆ˜ ์žˆ๋Š” ๋””ํ…Œ์ผ์ด๋‹ค. ์ •์—ฐ๋‹˜์ฒ˜๋Ÿผ IsaacGym โ†’ Isaac Lab ๋งˆ์ด๊ทธ๋ ˆ์ด์…˜์„ ๊ฒฝํ—˜ํ•œ ๋ถ„๋“ค์—๊ฒŒ๋Š” ํŠนํžˆ ์™€๋‹ฟ์„ ๋ถ€๋ถ„.

โ€œ์™œ SAC๋Š” PPO๋ณด๋‹ค ๋ณ‘๋ ฌ ์‹œ๋ฎฌ์—์„œ ์ž˜ ์•ˆ ๋˜๋Š”๊ฐ€โ€ ๋ฏธ์Šคํ„ฐ๋ฆฌ

๋ณ‘๋ ฌ ์‹œ๋ฎฌ์—์„œ RL์„ ๋Œ๋ฆด ๋•Œ PPO๋Š” ์ž˜ ์•Œ๋ ค์ ธ ์žˆ๊ณ  ์ž˜ ๋™์ž‘ํ•œ๋‹ค. ๋ฐ˜๋ฉด SAC๋ฅผ ๊ฐ™์€ ํ™˜๊ฒฝ์—์„œ ๋Œ๋ฆฌ๋ ค๊ณ  ํ•˜๋ฉด ํ”ํžˆ ํ•™์Šต์ด ์ž˜ ์•ˆ ๋œ๋‹ค. ๊ทธ๋ž˜์„œ ์‚ฌ๋žŒ๋“ค ์‚ฌ์ด์—์„œ๋Š” โ€œSAC๋Š” ๋ณ‘๋ ฌ ์‹œ๋ฎฌ์— ์•ˆ ๋งž๋Š”๋‹คโ€๋Š” ํ†ต๋…์ด ์žˆ๋‹ค โ€” Raffin(2025)์˜ ์ธ๊ธฐ ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ๋„ ์ด๋Ÿฐ ์–ด๋ ค์›€์„ ์ง€์ ํ–ˆ๋‹ค.

์ด ๋…ผ๋ฌธ์€ ๊ทธ๊ฒƒ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณธ์งˆ์˜ ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ๋ผ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์Šค์ผ€์ผ๋ง์˜ ๋ฌธ์ œ๋ผ๊ณ  ๋ณธ๋‹ค.

ํ•ต์‹ฌ ์ง„๋‹จ: N_e๊ฐ€ ์ปค์ง€๋ฉด \eta๋„ ๊ฐ™์ด ํ‚ค์›Œ์•ผ ํ•œ๋‹ค

CleanRL ๋“ฑ ํ”ํ•œ SAC ๊ตฌํ˜„์€ ๋ณ‘๋ ฌ ํ™˜๊ฒฝ ์ˆ˜์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ํ•œ ๋ฒˆ์˜ ํ™˜๊ฒฝ ์Šคํ…๋‹น ํ•œ ๋ฒˆ์˜ actor-critic ์—…๋ฐ์ดํŠธ๋ฅผ ํ•œ๋‹ค. ํ™˜๊ฒฝ์ด 10๊ฐœ์ผ ๋•Œ๋Š” ์ ์ ˆํ•˜์ง€๋งŒ, N_e = 8192๋กœ ๊ฐ€๋ฉด ํ•œ โ€œ์Šคํ…โ€์— 8192 transitions์ด ์Ÿ์•„์ ธ ๋“ค์–ด์˜ค๋Š”๋ฐ ์—…๋ฐ์ดํŠธ๋Š” ์—ฌ์ „ํžˆ 1๋ฒˆ์ด๋‹ค. ์ฆ‰ ์‹คํšจ UTD \eta = 1/N_e \to 0, ๋ฐ์ดํ„ฐ ๋Œ€๋น„ ์‹ฌํ•˜๊ฒŒ undertrain๋˜๋Š” ๊ฒƒ์ด๋‹ค.

ํ•ด๊ฒฐ: \eta๋ฅผ N_e์— ๋น„๋ก€ํ•ด์„œ ํ‚ค์›Œ๋ผ. ๋‹จ, ๋ฌดํ•œ์ • ํ‚ค์šธ ํ•„์š”๋Š” ์—†๋‹ค. ๋…ผ๋ฌธ Figure 5์˜ sweep ๊ฒฐ๊ณผ:

Franka Panda (Ne=512):   eta in {4..128}   ->  saturation around eta ~= 32
Unitree Go1 (Ne=8192):   eta in {4..128}   ->  similar saturation pattern

UTD๋ฅผ ๋” ํ‚ค์šฐ๋ฉด transition ์ˆ˜๋Š” ์ค„์–ด๋“ค์ง€๋งŒ wall-clock time์€ ๋น„๋ก€ํ•ด์„œ ๋Š˜์–ด๋‚œ๋‹ค. ๊ทธ๋ž˜์„œ ์‹ค์šฉ์ ์œผ๋กœ๋Š” task๋ณ„๋กœ saturation point๋ฅผ ์ฐพ๋Š” ๊ฒŒ ๋‹ต์ด๋‹ค.

๋„๋ฉ”์ธ ๋žœ๋คํ™” ํ™˜๊ฒฝ ์ˆ˜ N_e๋„ ์ค‘์š”ํ•˜๋‹ค

์ €์ž๋“ค์€ ์ถ”๊ฐ€๋กœ ํฅ๋ฏธ๋กœ์šด ablation์„ ํ•œ๋‹ค โ€” Unitree Go1์„ N_e=128 vs N_e=8192๋กœ ํ•™์Šต์‹œ์ผฐ์„ ๋•Œ, ์‹œ๋ฎฌ ์•ˆ์—์„œ๋Š” ๋‘˜ ๋‹ค ๋น„์Šทํ•œ ์„ฑ๋Šฅ์ด์ง€๋งŒ ์‹ค๋ฌผ zero-shot์—์„œ๋Š” ํฐ ๊ฐญ์ด ์ƒ๊ธด๋‹ค(Figure 11). N_e=128 ์ •์ฑ…์€ ์‹ค๋ฌผ์—์„œ ํ›จ์”ฌ ๋ถˆ์•ˆ์ •ํ•˜๋‹ค.

์ด๊ฑด ๋„๋ฉ”์ธ ๋žœ๋คํ™”์˜ ๋ถ„์‚ฐ์ด ์ถฉ๋ถ„ํžˆ ์ปค์•ผ ์ •์ฑ…์ด robustํ•ด์ง„๋‹ค๋Š” ์ž˜ ์•Œ๋ ค์ง„ ๊ฒฐ๊ณผ์˜ ์ •๋Ÿ‰์  ํ™•์ธ์ด๋‹ค. N_e \sim 10^3์ด robust sim-to-real์˜ ์ž„๊ณ„์ ์ด๋ผ๋Š” ๊ฒƒ์ด ์ €์ž๋“ค์˜ ๊ฒฝํ—˜์  ๊ฒฐ๋ก ์ด๋‹ค.

Tip์ •์—ฐ๋‹˜๊ป˜ ํŠนํžˆ ๊ด€๋ จ ์žˆ๋Š” ํฌ์ธํŠธ

HORA ํ™˜๊ฒฝ์„ IsaacGym โ†’ Isaac Lab์œผ๋กœ ๋งˆ์ด๊ทธ๋ ˆ์ด์…˜ํ•˜๋Š” ๊ณผ์ •์—์„œ actuator gain์ด๋‚˜ angular_damping ๊ฐ™์€ ๋””ํ…Œ์ผ์ด ํ•™์Šต ๊ฒฐ๊ณผ๋ฅผ ๋ฐ”๊พผ๋‹ค๋Š” ๊ฒƒ์€ ์ด๋ฏธ ๊ฒฝํ—˜ํ•˜์…จ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์— ๋”ํ•ด โ€” off-policy ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๊ฐˆ์•„ํƒˆ ๊ณ„ํš์ด ์žˆ๋‹ค๋ฉด \eta๋ฅผ ํ™˜๊ฒฝ ์ˆ˜์— ๋งž๊ฒŒ ์Šค์ผ€์ผ๋งํ•˜๋Š” ๊ฒƒ์ด criticalํ•ฉ๋‹ˆ๋‹ค. ๋‹จ์ˆœํžˆ SAC ์ฝ”๋“œ๋ฅผ ๊ฐ€์ ธ๋‹ค ์“ฐ๋ฉด โ€œSAC๊ฐ€ ์ž˜ ์•ˆ ๋˜๋„คโ€๋ผ๋Š” ์ž˜๋ชป๋œ ๊ฒฐ๋ก ์— ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


์‹คํ—˜ ์ข…ํ•ฉ: ์„ธ ๋กœ๋ด‡์ด ๋“ค๋ ค์ฃผ๋Š” ์ด์•ผ๊ธฐ

Franka Emika Panda (Manipulation, Vision-based)

์ด setup์ด ํŠนํžˆ ํฅ๋ฏธ๋กœ์šด ์ด์œ ๋Š” โ€” ์žฌํ˜„ ๊ฐ€๋Šฅํ•œ hardware stack์„ ์˜๋„์ ์œผ๋กœ ๋‹จ์ˆœํ•˜๊ฒŒ ๊ฐ€์ ธ๊ฐ”๋‹ค๋Š” ์ ์ด๋‹ค. RealSense D455 ์นด๋ฉ”๋ผ ํ•˜๋‚˜, grayscale 64ร—64 ์ž…๋ ฅ, end-effector pose, gripper opening. ์ด ์ •๋„๋ฉด ๋Œ€๋ถ€๋ถ„์˜ ์—ฐ๊ตฌ์‹ค์— ์ด๋ฏธ ์žˆ๋Š” ์žฅ๋น„๋‹ค. ์ €์ž๋“ค์€ ์ด ์ „์ฒด stack์„ ์˜คํ”ˆ์†Œ์Šค๋กœ ๊ณต๊ฐœํ–ˆ๋‹ค(panda-rl-kit).

ํ•™์Šต dynamics:

  • ์‹œ๋ฎฌ ์‚ฌ์ „ํ•™์Šต: ๋„๋ฉ”์ธ ๋žœ๋คํ™” (์กฐ๋ช…, ์นด๋ฉ”๋ผ perspective, ์ƒ‰์ƒ)๋กœ ์ •์ฑ…์ด ํ๋ธŒ๋ฅผ ๋ณด๊ณ  ์ ‘๊ทผํ•˜๋Š” ๊ฒƒ์€ ์ž˜ ํ•จ
  • Zero-shot ์‹คํŒจ ๋ชจ๋“œ: gripper-cube contact dynamics๊ฐ€ ์‹œ๋ฎฌ๊ณผ ๋‹ค๋ฆ„ + rendering gap โ†’ ์žก๊ธฐ/๋“ค์–ด์˜ฌ๋ฆฌ๊ธฐ ์‹คํŒจ
  • ์•ฝ 10๋ถ„์˜ ์‹ค๋ฌผ ํ•™์Šต ํ›„ ๊ฑฐ์˜ ์™„๋ฒฝํ•œ ์„ฑ๊ณต๋ฅ  (Figure 7)

ํฅ๋ฏธ๋กœ์šด ์  โ€” vision policy์ž„์—๋„ sample efficientํ•˜๊ฒŒ ๋ฏธ์„ธ์กฐ์ •๋œ๋‹ค. DrQ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•๊ณผ BRO ํฌ๋ฆฌํ‹ฑ ์•„ํ‚คํ…์ฒ˜(Nauman et al., 2024)์˜ ์กฐํ•ฉ์ด ํ•ต์‹ฌ์ด๋‹ค.

Unitree Go1 (Locomotion)

Locomotion์€ ๋Œ€ํ‘œ์ ์œผ๋กœ sim-to-real์ด ์ž˜ ๋™์ž‘ํ•˜๋Š” ์˜์—ญ์ด๋‹ค. ๊ทธ๋ž˜์„œ ์ €์ž๋“ค์€ ์˜๋„์ ์œผ๋กœ ์–ด๋ ต๊ฒŒ ๋งŒ๋“ ๋‹ค โ€” ์‹œ๋ฎฌ์—์„œ ์ œํ•œ๋œ ์†๋„ ๋ช…๋ น ๋ฒ”์œ„๋กœ๋งŒ ํ•™์Šตํ•œ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ์‹ค๋ฌผ์—์„œ๋Š” ํ•™์Šต๋˜์ง€ ์•Š์€ ๋ช…๋ น ์˜์—ญ์—์„œ zero-shot์ด ์•ฝํ•˜๋‹ค. ์ด๋ฅผ ์˜จ๋ผ์ธ finetune์œผ๋กœ ๋ฉ”์šด๋‹ค.

์‹คํ—˜ ๊ฒฐ๊ณผ(Figure 17)๋Š” ์˜๋ฏธ์‹ฌ์žฅํ•˜๋‹ค. trial 0์—์„œ๋Š” ์ •์ฑ…์ด ์ž์ฃผ ๋„˜์–ด์ง€์ง€๋งŒ, trial์ด ๋ˆ„์ ๋˜๋ฉด์„œ ์ƒˆ๋กœ์šด ๋ช…๋ น ์˜์—ญ๊นŒ์ง€ robustํ•˜๊ฒŒ ๋”ฐ๋ผ๊ฐ€๊ฒŒ ๋œ๋‹ค. ์‹œ๋ฎฌ์—์„œ ๋ณด์ง€ ๋ชปํ•œ ๋ถ„ํฌ์— ๋Œ€ํ•œ ์ ์‘์„ sim-to-online์œผ๋กœ ๋ฉ”์šฐ๋Š” ๊น”๋”ํ•œ ๋ฐ๋ชจ๋‹ค.

Race Car (Navigation, Fast Dynamics)

์ด๊ฒŒ ์•„๋งˆ ๊ฐ€์žฅ ์•ผ์‹ฌ์ฐฌ ์‹คํ—˜์ด๋‹ค. 60Hz ์ œ์–ด, ํƒ€์ด์–ด ๋งˆ์ฐฐ๊ณผ drift๊ฐ€ ํ•ต์‹ฌ์ธ system, kinematic bicycle ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ์‹œ๋ฎฌ. ์ €์ž๋“ค์€ ์ด ์‹œ๋ฎฌ-์‹ค๋ฌผ ๊ฐญ์„ ์˜๋„์ ์œผ๋กœ ํฌ๊ฒŒ ๊ฐ€์ ธ๊ฐ„๋‹ค โ€” semi-kinematic bicycle์—์„œ ์‚ฌ์ „ํ•™์Šต ํ›„, ๋” ์ •ํ™•ํ•œ (๋งˆ์ฐฐ ๋ชจ๋ธ ํฌํ•จ) ๋‹ค์ด๋‚ด๋ฏน์Šค์—์„œ finetune.

zero-shot ์‹คํŒจ ๋ชจ๋“œ๊ฐ€ ์ง๊ด€์ ์ด๋‹ค โ€” ์ฐจ๋Ÿ‰์ด ๋ชฉํ‘œ ์ง€์ ์„ ์˜ค๋ฒ„์ŠˆํŒ…ํ•œ๋‹ค. ์‚ฌ์ „ ์ •์ฑ…์€ ์‹œ๋ฎฌ ๋‹ค์ด๋‚ด๋ฏน์Šค ๊ธฐ์ค€์œผ๋กœ ์ตœ์ ์ด์ง€๋งŒ, ์‹ค๋ฌผ์˜ ๋” ํฐ ๊ด€์„ฑ + ๋ฏธ๋„๋Ÿฌ์ง์„ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•œ๋‹ค. ์•ฝ 20 trial์˜ finetune ํ›„ ๊ฑฐ์˜ ์ •ํ™•ํ•œ ์ฃผ์ฐจ์— ๋„๋‹ฌํ•œ๋‹ค(Figure 18).

์ด task๋Š” ๋น ๋ฅธ ๋‹ค์ด๋‚ด๋ฏน์Šค + sparse reward + ํฐ sim-to-real ๊ฐญ์˜ ์กฐํ•ฉ์œผ๋กœ, ์„ธ ์ฒ˜๋ฐฉ์ด ๋ชจ๋‘ criticalํ•จ์„ ๋ณด์—ฌ์ฃผ๋Š” stress test์— ๊ฐ€๊น๋‹ค.

์ข…ํ•ฉ ๊ทธ๋ž˜ํ”„ โ€” Zero-shot vs After Finetuning

๋…ผ๋ฌธ์˜ Figure 6๋ฅผ ํ…์ŠคํŠธ๋กœ ํ‘œํ˜„ํ•˜๋ฉด:

                         Sim performance     Real zero-shot      After finetuning
Franka Emika Panda            ~1.0              ~0.5                ~1.0
Unitree Go1                   ~1.0              ~0.6                ~1.0
Race Car                      ~1.0              ~0.4                ~1.0
                                                  ^^^                  ^^^
                                              this is the gap     finetune closes it

์„ธ task ๋ชจ๋‘์—์„œ zero-shot ๊ฐญ์ด ํฌ์ง€๋งŒ, sim-to-online ๋ฏธ์„ธ์กฐ์ •์ด ์ด๋ฅผ ๊ฑฐ์˜ ์™„๋ฒฝํ•˜๊ฒŒ ๋ฉ”์šด๋‹ค. ์ด๊ฒŒ ๋…ผ๋ฌธ์˜ ๊ทธ๋ฆผ ํ•œ ์žฅ ์š”์•ฝ์ด๋‹ค.


๋น„ํŒ์  ๊ณ ์ฐฐ

๊ฐ•์ 

  1. ์žฌํ˜„์„ฑ์— ๋Œ€ํ•œ ์ง„์ง€ํ•จ. 100+ํšŒ์˜ ์‹ค๋ฌผ ํ•™์Šต ์‹คํ—˜์€ RL ์—ฐ๊ตฌ์—์„œ ๋งค์šฐ ๋“œ๋ฌผ๋‹ค. ์—ฌ๋Ÿฌ random seed, ์—ฌ๋Ÿฌ trial, ์—ฌ๋Ÿฌ ablation์„ ์ง„์งœ๋กœ ๋Œ๋ ธ๋‹ค๋Š” ์ ์—์„œ ์‹ ๋ขฐํ•  ๋งŒํ•˜๋‹ค.
  2. ์ถ”๊ฐ€ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์—†์ด ๋„๋‹ฌ. ์ƒˆ๋กœ์šด ์†์‹ค ํ•จ์ˆ˜, ์ƒˆ๋กœ์šด regularizer, ์ƒˆ๋กœ์šด representation learning ๋ชจ๋“ˆ ์—†์ด standard SAC๋กœ ๋„๋‹ฌํ•œ ๊ฒƒ์ด ๊ฐ•์ ์ด๋‹ค. ๋‹ค๋ฅธ ์‚ฌ๋žŒ์ด ๋”ฐ๋ผ ํ•˜๊ธฐ ์‰ฝ๋‹ค.
  3. Negative result์— ์ •์งํ•˜๋‹ค. Sparse reward์—์„œ ์›œ์Šคํƒ€ํŠธ ํšจ๊ณผ๊ฐ€ ์•ฝํ•˜๋‹ค๋Š” ๊ฒƒ, ์•กํ„ฐ-ํฌ๋ฆฌํ‹ฑ ๋Œ€์นญ ์—…๋ฐ์ดํŠธ๊ฐ€ ๋‹ค๋ฅธ ์ฒ˜๋ฐฉ์œผ๋กœ ๋ณด์™„๋˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ ๋“ฑ โ€” ์ฒ˜๋ฐฉ์˜ ํ•œ๊ณ„๋ฅผ ๋ช…ํ™•ํžˆ ํ•œ๋‹ค.
  4. ํ•˜๋“œ์›จ์–ด stack์˜ ์˜คํ”ˆ์†Œ์Šคํ™”. ํŠนํžˆ Franka ๋น„์ „ ๊ธฐ๋ฐ˜ RL ํ™˜๊ฒฝ ์ „์ฒด ๊ณต๊ฐœ๋Š” ์ง„์ž… ์žฅ๋ฒฝ์„ ๋‚ฎ์ถ”๋Š” ์‹ค์งˆ์  ๊ธฐ์—ฌ๋‹ค.
  5. Pitfalls ์„น์…˜์˜ ์ง„๊ฐ€. ๋ถ€๋ก F์˜ ํ•จ์ • ๋ชฉ๋ก(optimizer state, target network, SAC temperature ๋ณต์› ๋“ฑ)์€ ์‹ค์ œ๋กœ ๋ฉฐ์น ์„ ๋‚ ๋ ค๋ณธ ์‚ฌ๋žŒ๋งŒ ์“ธ ์ˆ˜ ์žˆ๋Š” ์ข…๋ฅ˜์˜ ๋””ํ…Œ์ผ์ด๋‹ค.

์•ฝ์ ๊ณผ ํ•œ๊ณ„

  1. ์—ํ”ผ์†Œ๋“œ ๊ธฐ๋ฐ˜ ์„ธํŒ…์˜ ํ•œ๊ณ„. ๋ชจ๋“  ์‹คํ—˜์ด ์‚ฌ๋žŒ์˜ ์ˆ˜๋™ ๋ฆฌ์…‹์„ ๊ฐ€์ •ํ•œ๋‹ค. ์‹ค์„ธ๊ณ„ ์ž์œจ ํ•™์Šต์˜ ์„ฑ๋ฐฐ์ธ reset-free RL์€ ์—ฌ์ „ํžˆ ๋ฏธํ•ด๊ฒฐ๋กœ ๋‚จ๋Š”๋‹ค. ์ €์ž๋“ค๋„ ์ด๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ์ธ์ •ํ•œ๋‹ค.
  2. Reward ์„ค๊ณ„๋Š” ์—ฌ์ „ํžˆ ์†์œผ๋กœ ๋งŒ๋“ ๋‹ค. Vision-based pick-and-place์— progress-based dense reward๋ฅผ ์“ด๋‹ค. ์ง„์งœ ์–ด๋ ค์šด manipulation task์—์„œ๋Š” ์ด reward ์ž์ฒด๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒŒ ์–ด๋ ต๋‹ค. ์ด ๋…ผ๋ฌธ์ด ๋‹ตํ•˜์ง€ ์•Š๋Š” ํฐ ์งˆ๋ฌธ์ด๋‹ค.
  3. ์„ธ plant์˜ ๋‹ค์–‘์„ฑ, ๊ทธ๋Ÿฌ๋‚˜ ํ•œ task๋‹น ํ•˜๋‚˜. ๊ฐ ๋กœ๋ด‡๋งˆ๋‹ค task๊ฐ€ ํ•˜๋‚˜์”ฉ์ด๋‹ค. ๊ฐ™์€ ๋กœ๋ด‡์—์„œ ์—ฌ๋Ÿฌ task๋กœ sample efficiency๋ฅผ ๋” ํ‰๊ฐ€ํ–ˆ๋‹ค๋ฉด ์ฒ˜๋ฐฉ์˜ ์ผ๋ฐ˜์„ฑ์ด ๋” ๊ฐ•ํ•˜๊ฒŒ ์ž…์ฆ๋˜์—ˆ์„ ๊ฒƒ์ด๋‹ค.
  4. Dexterous manipulation์˜ ๋ถ€์žฌ. Allegro Hand ๊ฐ™์€ high-DoF ์†์ด ๋“ค์–ด๊ฐ€์ง€ ์•Š์•˜๋‹ค. Contact-richํ•œ in-hand manipulation์€ sim-to-real ๊ฐญ์ด ๊ฐ€์žฅ ํฐ ์˜์—ญ ์ค‘ ํ•˜๋‚˜์ด๊ณ , ์ด ์ฒ˜๋ฐฉ๋“ค์ด ๊ฑฐ๊ธฐ์„œ ์–ด๋–ป๊ฒŒ ๋™์ž‘ํ• ์ง€๋Š” ๋ณ„๊ฐœ์˜ ๋ฌธ์ œ๋‹ค.
  5. Tactile/force sensing ์—†์Œ. ๋ชจ๋“  task๊ฐ€ ์‹œ๊ฐ ๋˜๋Š” proprioceptive ์ž…๋ ฅ๋งŒ ์“ด๋‹ค. Tactile feedback์ด ๋“ค์–ด๊ฐ€๋Š” task์—์„œ sim-to-online์ด ์–ด๋–ป๊ฒŒ ๋ณ€ํ•˜๋Š”์ง€๋Š” ๋ฏธํ•ด๊ฒฐ.
  6. N= trial ์ˆ˜์˜ ํ†ต๊ณ„์  ๊ฒ€์ •๋ ฅ. ๊ฐ ์‹คํ—˜์„ 3 seeds๋กœ ๋Œ๋ ธ๋‹ค๋Š” ๊ฒƒ์€ RL ํ‘œ์ค€์ด์ง€๋งŒ, ๊ฐ•ํ•œ ํ†ต๊ณ„์  ๊ฒ€์ •๋ ฅ์„ ์œ„ํ•ด์„œ๋Š” ๋” ๋งŽ์€ seed๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‹ค๋งŒ ์‹ค๋ฌผ ์‹คํ—˜์˜ ๋น„์šฉ์„ ์ƒ๊ฐํ•˜๋ฉด ํ•ฉ๋ฆฌ์  trade-off๋‹ค.
  7. Critic ์•„ํ‚คํ…์ฒ˜ ์˜์กด์„ฑ. BRO ์•„ํ‚คํ…์ฒ˜๊ฐ€ ํ•ต์‹ฌ ์žฅ์น˜ ์ค‘ ํ•˜๋‚˜์ธ๋ฐ, ์ด๊ฒƒ ์—†์ด vanilla MLP์—์„œ๋„ ๊ฐ™์€ ๊ฒฐ๋ก ์ด ์„ฑ๋ฆฝํ• ์ง€์— ๋Œ€ํ•œ ๊นŠ์€ ๋ถ„์„์€ ์—†๋‹ค.

๊ด€๋ จ ์—ฐ๊ตฌ ์ง€ํ˜•๋„

์ด ๋…ผ๋ฌธ์˜ ์œ„์น˜๋ฅผ ํ•œ ์žฅ์˜ ์ง€ํ˜•๋„๋กœ ๊ทธ๋ ค๋ณด์ž.

flowchart TB
    subgraph A["Sim-to-Real (zero-shot)"]
        A1["Hwangbo et al. 2019<br/>(Legged locomotion)"]
        A2["Tang et al. 2023<br/>(IndustReal)"]
    end
    subgraph B["Online RL on Real Robots (from scratch)"]
        B1["Haarnoja et al. 2018<br/>(SAC on real robot)"]
        B2["Smith et al. 2022<br/>(Walk in the park)"]
    end
    subgraph C["Offline-to-Online RL"]
        C1["Nair et al. 2020 (AWAC)"]
        C2["Ball et al. 2023 (RLPD)"]
        C3["Zhou et al. 2025<br/>(no offline retention)"]
    end
    subgraph D["Sim-to-Online (this paper)"]
        D1["As et al. 2026<br/>What Matters..."]
    end
    A -->|"Pretrain only,<br/>no finetune"| D1
    B -->|"No simulation prior"| D1
    C -->|"Adapts ideas to<br/>simulated priors"| D1
    style D1 fill:#fff4e6,stroke:#ff9800,stroke-width:3px

ํ•ต์‹ฌ ์ฐจ๋ณ„์ :

  • Sim-to-Real ๋‹จ๋… ์—ฐ๊ตฌ๋“ค์€ zero-shot์—์„œ ๋ฉˆ์ถ˜๋‹ค. ์ด ๋…ผ๋ฌธ์€ ๊ทธ ํ›„์˜ finetune ๋‹จ๊ณ„๋ฅผ ์—ฐ๊ตฌํ•œ๋‹ค.
  • From-scratch online RL(Haarnoja, Smith)๋Š” ์‹œ๋ฎฌ priors์˜ ๋„์›€ ์—†์ด ์‹œ์ž‘ํ•œ๋‹ค. ์•ˆ์ „์„ฑ + ์‹œ๊ฐ„ ๋น„์šฉ์ด ์ด ๋…ผ๋ฌธ๋ณด๋‹ค ํ›จ์”ฌ ํฌ๋‹ค.
  • Offline-to-Online RL(AWAC, RLPD, Zhou et al.)์€ prior๊ฐ€ ๊ณ ์ • ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค. ์‹œ๋ฎฌ๊ณผ์˜ ๊ด€๊ณ„๋Š” ๋‹ค๋ฃจ์ง€ ์•Š๋Š”๋‹ค. ์ด ๋…ผ๋ฌธ์€ ๊ทธ ๋ผ์ธ์˜ ๊ธฐ๋ฒ•(๋ฐ์ดํ„ฐ ๋ณด์กด, ์›œ์Šคํƒ€ํŠธ)์„ ์‹œ๋ฎฌ prior์— ๋งž๊ฒŒ ๊ฐ€์ ธ์˜จ๋‹ค.
  • ์ด ๋…ผ๋ฌธ์˜ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ด์›ƒ์€ Yin et al. (2025, โ€œRapidly adapting policies via simulation-guided fine-tuningโ€)์ธ๋ฐ, ๊ทธ ๋…ผ๋ฌธ์€ ๋ณด์ƒ์„ reshapeํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์  ์ ‘๊ทผ์ด๋‹ค. ๋ฐ˜๋ฉด ์ด ๋…ผ๋ฌธ์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฑฐ์˜ ๊ฑด๋“œ๋ฆฌ์ง€ ์•Š๊ณ  ์‹œ์Šคํ…œ ๋””์ž์ธ ๊ฒฐ์ •์œผ๋กœ ๊ฐ™์€ ๋ฌธ์ œ๋ฅผ ํ‘ผ๋‹ค. ๋ฉ”์‹œ์ง€๊ฐ€ ์ •๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์ด๋‹ค โ€” โ€œ๋ฌ˜์•ˆ์„ ๋งŒ๋“ค์ง€ ๋ง๊ณ  ๊ธฐ๋ณธ๊ธฐ๋ฅผ ์ œ๋Œ€๋กœ ํ•˜์ž.โ€

๋˜ํ•œ Tirumala et al.(2024)์˜ โ€œReplay across experimentsโ€ ์•„์ด๋””์–ด๋ฅผ ์‹ค๋ฌผ์—์„œ ์ฒ˜์Œ์œผ๋กœ ์ •๋Ÿ‰์ ์œผ๋กœ ๊ฒ€์ฆํ–ˆ๋‹ค๋Š” ์ ์—์„œ ๊ทธ ๋ผ์ธ์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ํ›„์†์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.


ํ•œ ๋ฐœ์ง ๋”

์ด ๋…ผ๋ฌธ์˜ ๊ฐ€์žฅ ํฐ ํ•จ์˜๋Š”, sim-to-real์ด ๋์ด ์•„๋‹ˆ๋ผ ์‹œ์ž‘์ด๋ผ๋Š” ๊ด€์ ์ด๋‹ค. ์šฐ๋ฆฌ๋Š” ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„œ ์™„๋ฒฝํ•œ ์ •์ฑ…์„ ๋งŒ๋“ค๋ ค๊ณ  ์ ์  ๋” ์ •๋ฐ€ํ•œ ๋ฌผ๋ฆฌ ์—”์ง„, ์ ์  ๋” ์ •๊ตํ•œ ๋„๋ฉ”์ธ ๋žœ๋คํ™”์— ํˆฌ์žํ•ด์™”๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ด ๋…ผ๋ฌธ์€ โ€” ์–ด์ฐจํ”ผ ๊ฐญ์€ ๋‚จ๋Š”๋‹ค, ์ฐจ๋ผ๋ฆฌ ๊ทธ๊ฒƒ์„ ์‹ค๋ฌผ์—์„œ ์งง์€ ์‹œ๊ฐ„ ๋‚ด์— ๋ฉ”์šฐ๋Š” ์ธํ”„๋ผ๋ฅผ ๊ฐ–์ถ”์ž โ€” ๋ผ๋Š” ๋‹ค๋ฅธ ๊ด€์ ์„ ์ œ์‹œํ•œ๋‹ค.

์ด๋Š” future-proofํ•œ ๊ด€์ ์ด๋‹ค. ์ ์  ๋” ๋ณต์žกํ•œ task๋กœ ๊ฐˆ์ˆ˜๋ก ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋Š” ๊ทผ๋ณธ์ ์œผ๋กœ ๋ถ€์กฑํ•  ๊ฒƒ์ด๋‹ค(open-world ๊ฐ€์ •). ๊ทธ๋ ‡๋‹ค๋ฉด robot learning์˜ ๋งˆ์ง€๋ง‰ ๋งˆ์ผ์€ ๊ฒฐ๊ตญ ์‹ค๋ฌผ์—์„œ์˜ ํ•™์Šต์ด ๋˜์–ด์•ผ ํ•œ๋‹ค. ์ด ๋…ผ๋ฌธ์€ ๊ทธ ๋งˆ์ง€๋ง‰ ๋งˆ์ผ์„ ์•ˆ์ „ํ•˜๊ณ  ๋น„์‹ธ์ง€ ์•Š๊ฒŒ ๋งŒ๋“œ๋Š” ์—”์ง€๋‹ˆ์–ด๋ง ์ฒ˜๋ฐฉ์˜ ๋ชจ์Œ์ด๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜์  ์šฐ์•„ํ•จ์€ ์—†๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ทธ๊ฒƒ์ด ํ•ต์‹ฌ์ด๋‹ค. ํ˜„์žฅ์—์„œ ๊บผ๋‚ด ์“ธ ์ˆ˜ ์žˆ๋Š” โ€œ๊ทธ๋ƒฅ ์ž‘๋™ํ•˜๋Š”โ€ ๋ ˆ์‹œํ”ผ๊ฐ€ RL ์—ฐ๊ตฌ์— ์ •๋ง๋กœ ํ•„์š”ํ–ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์€ ๊ทธ ๋นˆ์ž๋ฆฌ๋ฅผ ์ •์งํ•˜๊ณ  ๊ผผ๊ผผํ•˜๊ฒŒ ์ฑ„์šด๋‹ค.


์ฐธ๊ณ 

  • ๋…ผ๋ฌธ: Yarden As, Dhruva Tirumala, Renรฉ Zurbrรผgg, Chenhao Li, Stelian Coros, Andreas Krause, Markus Wulfmeier. What Matters for Sim-to-Online Reinforcement Learning on Real Robots. arXiv:2602.20220, 2026.
  • ์ฝ”๋“œ/ํ•˜๋“œ์›จ์–ด ์Šคํƒ: github.com/yardenas/panda-rl-kit
  • ๊ด€๋ จ background:
    • Haarnoja et al., Soft Actor-Critic, ICML 2018
    • Fujimoto et al., TD3 / Addressing function approximation error, ICML 2018
    • Tirumala et al., Replay across experiments, ICLR 2024
    • Zhou et al., Efficient online RL fine-tuning need not retain offline data, ICLR 2025
    • Nauman et al., BRO architecture, NeurIPS 2024
    • Zakka et al., MuJoCo Playground, 2025

Copyright 2026, JungYeon Lee