Curieux.JY
  • Post
  • Note
  • Jung Yeon Lee

On this page

  • 1 Introduction
  • 2 DreamWaQ
    • 2.1 Key Contribution
    • 2.2 Implicit Terrain Imagination
    • 2.3 Asymmetric Actor-Critic
    • 2.4 Context-Aided Estimator Network
  • 3 Experiments
    • 3.1 Simultation Result
    • 3.2 Real-world Result
  • 4 Conclusion
  • 5 Reference

๐Ÿ“ƒDreamWaQ ๋ฆฌ๋ทฐ

context
rl
paper
Learning Robust Quadrupedal Locomotion With Implicit Terrain Imagination via Deep Reinforcement Learning
Published

July 2, 2023

์ด๋ฒˆ ํฌ์ŠคํŒ…์€ DeepMind์—์„œ ๋ฐœํ‘œ๋œ DreamWaQ: Learning Robust Quadrupedal Locomotion With Implicit Terrain Imagination via Deep Reinforcement Learning ๋…ผ๋ฌธ์„ ์ฝ๊ณ  ์ •๋ฆฌํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค. ์ตœ๊ทผ ICRA 2023 ๋Ÿฐ๋˜์—์„œ 5์›” 30์ผ๋ถ€ํ„ฐ 6์›” 1์ผ๊นŒ์ง€ ์ง„ํ–‰๋œ Autonomous Quadruped Robot Challenge (QRC)์—์„œ KAIST ์—ฐ๊ตฌํŒ€์ด 1๋“ฑ์„ ํ•˜์—ฌ ํฐ ์ด์Šˆ๊ฐ€ ๋˜์—ˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ ๋ฆฌ๋ทฐํ•˜๋Š” ์ด ๋…ผ๋ฌธ์ด ๋ฐ”๋กœ ๋Œ€ํšŒ์—์„œ ์‚ฌ์šฉ๋˜์—ˆ๋˜ ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜์˜ ๋ณดํ–‰์ œ์–ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•œ ๋‚ด์šฉ์„ ๋‹ด๊ณ  ์žˆ๋Š” ๋…ผ๋ฌธ์ž…๋‹ˆ๋‹ค.

2023 Autonomous Quadruped Robot Challenge

1 Introduction

๋…ผ๋ฌธ์„ ์†Œ๊ฐœํ•ด๋“œ๋ฆฌ๋ฉด์„œ ๋ง์”€๋“œ๋ฆฐ๋ฐ”์™€ ๊ฐ™์ด ์˜ค๋Š˜ ๋ฆฌ๋ทฐํ•  DreamWaQ๋ผ๋Š” ๋…ผ๋ฌธ์— ๋‚˜์™€์žˆ๋Š” ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ™œ์šฉํ•œ ๋ณดํ–‰์ œ์–ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ KAIST ์—ฐ๊ตฌํŒ€์ด MIT ์—ฐ๊ตฌํŒ€์„ ์ œ์น˜๊ณ  1๋“ฑ์„ ํ•˜์—ฌ ๋‹ค์‹œํ•œ๋ฒˆ ์šฐ๋ฆฌ๋‚˜๋ผ ๊ธฐ์ˆ ๋ ฅ์„ ์„ธ๊ณ„์— ์•Œ๋ฆฐ ๊ธฐํšŒ๊ฐ€ ๋˜์—ˆ๋‹ค๋Š” ์ข‹์€ ๋‰ด์Šค๋ฅผ ๋“ค์„ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

Finals Team DREAM STEP KAIST

์„ธ๊ณ„ ๋กœ๋ด‡ ์—ฐ๊ตฌํŒ€๋“ค์ด ์ฐธ์—ฌํ•œ ๋Œ€ํšŒ์—์„œ KAIST์˜ DREAM STEP ํŒ€์˜ ๊ฒฐ์Šน ๋Œ€ํšŒ ์˜์ƒ์ž…๋‹ˆ๋‹ค. ์‚ฌ์กฑ ๋ณดํ–‰๋กœ๋ด‡์˜ ๋‹ค์–‘ํ•œ ํ—˜์ง€์—์„œ์˜ ์ž์œจ์ ์ธ ๋ณดํ–‰์„ ํ…Œ์ŠคํŠธํ•˜๊ธฐ ์œ„ํ•ด ๋Œ€ํšŒ์—์„œ ์‚ฌ์šฉ๋œ ์ง€ํ˜•์€ ์ •๋ง ๋‹ค์–‘ํ•˜๊ณ  ๋กœ๋ด‡์ด ์•ˆ์ •์ ์œผ๋กœ ์™„์ฃผํ•˜๊ธฐ์— ๊ต‰์žฅํžˆ ์–ด๋ ค์šด ์ฝ”์Šค์ž„์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ ๋Œ€ํšŒ์— ์ฐธ๊ฐ€ํ•œ ๋‹ค๋ฅธ ํŒ€๋“ค์˜ ์˜์ƒ๋“ค์„ ๋ณด์‹œ๋ฉด ๋Œ€ํšŒ ์ฝ”์Šค์˜ ํ•œ ๋ถ€๋ถ„ ๋ถ€๋ถ„๋งˆ๋‹ค ๊ฐ์ž ๊ณ ๊ตฐ๋ถ„ํˆฌํ•˜๋ฉฐ ๊ฑธ์–ด๊ฐ€๋Š” ๋กœ๋ด‡๋“ค์˜ ๋ชจ์Šต์„ ๋‹ค์–‘ํ•˜๊ฒŒ ๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‰ด์Šค์—์„œ ๋‚˜์™”๋˜ ๊ฒƒ์ฒ˜๋Ÿผ ์œ ๋ช…ํ•œ ๋Œ€ํ•™ ์—ฐ๊ตฌํŒ€๋“ค์„ ์ œ์น˜๊ณ  1๋“ฑ์„ ํ•œ ์ž๋ž‘์Šค๋Ÿฌ์šด KAIST ์—ฐ๊ตฌํŒ€์˜ 1๋“ฑ ๋น„๊ฒฐ์„ ๋‹ด์€ ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ DreamWaQ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ•œ๋ฒˆ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

2 DreamWaQ

2.1 Key Contribution

Overview of DreamWaQ

DreamWaQ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ „์ฒด์ ์ธ ํ๋ฆ„์€ ์œ„์˜ ์‚ฌ์ง„๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. โ€œDreamโ€์ด๋ผ๋Š” ์›Œ๋”ฉ๊ณผ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐœ๊ด„๋„์—์„œ ์ƒ๊ฐ ํ’์„  ๋ชจ์–‘ ํ‘œํ˜„์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด DreamWaQ ๋…ผ๋ฌธ์˜ ์ฃผ์š” Contribution์œผ๋กœ๋Š” Implicit Terrain Imagination์„ ํ•  ์ˆ˜ ์žˆ๋„๋ก Context-Aided Estimator Network(CENet)์„ ๋„์ž…ํ•˜์˜€๊ณ  ์•ˆ์ •์ ์œผ๋กœ Policy๊ฐ€ ํ•™์Šต๋  ์ˆ˜ ์žˆ๋„๋ก Adaptive Bootstrapping(AdaBoot)๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์—ฌ ๊ฐ•ํ™”ํ•™์Šต ๋ณดํ–‰ ์ œ์–ด๊ธฐ๋ฅผ ์„ค๊ณ„ํ•œ ์ ์„ ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

2.2 Implicit Terrain Imagination

์•ž์„œ ์ฑŒ๋ฆฐ์ง€์—์„œ ์‚ฌ์šฉ๋œ ํ™˜๊ฒฝ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด ์‚ฌ์กฑ๋ณดํ–‰๋กœ๋ด‡์€ ๋‹ค์–‘ํ•œ ์ง€ํ˜•(Terrain)์„ ๊ทน๋ณตํ•˜๋ฉฐ ๋ณดํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿผ ๋‹ค์–‘ํ•œ ์ง€ํ˜•์„ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋Š” ์†์„ฑ๋“ค์—๋Š” ๋ฌด์—‡์ด ์žˆ์„๊นŒ์š”? ์ง€ํ˜•์˜ ๋งˆ์ฐฐ๊ณ„์ˆ˜, ๋ฐ˜๋ฐœ๊ณ„์ˆ˜, ๋†“์—ฌ์ ธ ์žˆ๋Š” ์žฅ์• ๋ฌผ, ์šธํ‰๋ถˆํ‰ํ•œ ์ •๋„ ๋“ฑ๋“ฑ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์†์„ฑ๋“ค๋กœ ์ง€ํ˜•์˜ ํŠน์ง•์„ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ทธ๋Ÿฌํ•œ ํŠน์ง•์„ ์–ด๋–ป๊ฒŒ 4๊ฐœ์˜ ๋‹ค๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ณดํ–‰์˜ ์–ด๋ ค์šด ์ ๋“ค์„ ๊ทน๋ณตํ•˜๋ฉฐ ์›ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ด๋™ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด ๊ด€๊ฑด์ธ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด๋Ÿฐ ์ง€ํ˜•์˜ ํŠน์ง•์„ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•ด ๋งŽ์€ ์—ฐ๊ตฌ๋“ค์ด ์นด๋ฉ”๋ผ๋‚˜ ๋ผ์ด๋‹ค์™€ ๊ฐ™์€ ๋น„์ ผ์„ผ์„œ๋ฅผ ๋ถ€์ฐฉํ•˜์—ฌ ํ™˜๊ฒฝ์„ ์ธ์‹ํ•œ ๋’ค ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์„ ๊ณ ์•ˆํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ KAIST ์—ฐ๊ตฌ์ง„์ด ์ œ์•ˆํ•œ DreamWaQ์—์„œ๋Š” ์ง€ํ˜•์„ ์ธ์‹ํ•  ์ˆ˜ ์žˆ๋Š” ๋ถ€์ฐจ์ ์ธ ๋น„์ ผ์„ผ์„œ ์—†์ด ๋กœ๋ด‡์˜ ์ž์ฒด์˜ ์ •๋ณด(proprioception)๋ฅผ ์ด์šฉํ•˜์—ฌ ์ง€ํ˜•์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด explicitํ•œ ํ™˜๊ฒฝ ์ •๋ณด๊ฐ€ ์•„๋‹Œ, implicitํ•œ terrain imagination์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•๋ก ์€ ์ œ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค.

์‚ฌ์‹ค Implicitํ•˜๊ฒŒ ๋กœ๋ด‡ ์ฃผ๋ณ€์˜ ์ง€ํ˜•์ด๋‚˜ ํ™˜๊ฒฝ์ •๋ณด๋ฅผ ๊ฐ•ํ™”ํ•™์Šต ๋กœ๋ด‡ ์—์ด์ „ํŠธ๊ฐ€ ์ธ์‹ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ์—ฐ๊ตฌ๋Š” ๋‹ค์–‘ํ•˜๊ฒŒ ์ง„ํ–‰๋˜์–ด์™”์—ˆ์Šต๋‹ˆ๋‹ค. ์•ž์„  ์ฃผ์š” ๋ฐฉ๋ฒ•์€๋กœ๋Š” Teacher-Student Network๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ชจ๋“  ํ™˜๊ฒฝ์ •๋ณด๋ฅผ ํ•™์Šตํ•œ Teacher Network๋กœ๋ถ€ํ„ฐ Student Network๊ฐ€ ์ถ”ํ›„์— ๋”ฐ๋ผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹์ด ์žˆ์—ˆ์ง€๋งŒ, ํ•ด๋‹น ๋ฐฉ๋ฒ•์€ Teacher Network๋ฅผ ํ•™์Šต๊ณผ Student Network ํ•™์Šต์„ ๋”ฐ๋กœ 2๊ฐœ์˜ ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์ณ ํ•™์Šต์„ ํ•ด์•ผํ•œ๋‹ค๋Š” ๋ฐ์ดํ„ฐ ๋น„ํšจ์œจ์ ์ธ ํ•™์Šต ๋ฐฉ๋ฒ•์ด๋ผ๋Š” ๋‹จ์ ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ DreamWaQ์—์„œ๋Š” Asymmetric Actor-Critic์ด๋ผ๋Š” ๊ธฐ์กด์˜ Actor-Critic ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์•ฝ๊ฐ„ ๋ณ€ํ˜•์„ ์ค€ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ Teacher-Student Network์ฒ˜๋Ÿผ ๋‘ ๋‹จ๊ณ„๋กœ ๋‚˜๋ˆ„์–ด์„œ ํ•™์Šตํ•˜์ง€ ์•Š๊ณ ๋„ Implicitํ•˜๊ฒŒ Terrain ์ •๋ณด๋ฅผ Actor-Critic ๊ตฌ์กฐ์— ๋…น์—ฌ๋“ค ์ˆ˜ ์žˆ๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค.

2.3 Asymmetric Actor-Critic

๊ธฐ์กด์˜ PPO, SAC์™€ ๊ฐ™์€ Policy Gradient์˜ ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์˜ ์ฃผ์š” ๊ตฌ์„ฑ์š”์†Œ๋กœ Actor Network์™€ Critic(Value) Network๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. Actor๋Š” ๊ฐ•ํ™”ํ•™์Šต ์—์ด์ „ํŠธ๊ฐ€ ์ทจํ•ด์•ผํ•˜๋Š” action ๊ฐ’์„ ์ถœ๋ ฅํ•˜๋Š” ๋„คํŠธ์›Œํฌ์ด๋ฉฐ Critic๋Š” ์—์ด์ „ํŠธ์˜ ํ•™์Šต ๋ฐฉํ–ฅ์„ ๋ณด์—ฌ์ฃผ๋Š” value๊ฐ’์„ ์ถœ๋ ฅํ•˜์—ฌ ์ด 2๊ฐœ์˜ ๋„คํŠธ์›Œํฌ๋“ค์ด Policy Gradient ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ชฉ์ ์‹์„ ๋”ฐ๋ผ Return(๋ˆ„์  ๋ณด์ƒ)๊ฐ’์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šตํ•˜๊ฒŒ ๋˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ณดํ†ต 2๊ฐœ์˜ ๋„คํŠธ์›Œํฌ ๋ชจ๋‘์—๊ฒŒ ๊ฐ™์€ state(ํ˜น์€ observation) ์ •๋ณด๊ฐ€ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ๋“ค์–ด๊ฐ€๊ฒŒ ๋˜๊ธฐ ๋•Œ๋ฌธ์— Actor ๋„คํŠธ์›Œํฌ์™€ Critic ๋„คํŠธ์›Œํฌ๋Š” ์„œ๋กœ Symmetricํ•˜๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ ์•ž์„œ ๋กœ๋ด‡์ด ์„ผ์„œ ์—†์ด๋Š” ์–ป์„ ์ˆ˜ ์—†๋Š” ์ง€ํ˜• ์ •๋ณด๊ฐ€ ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์‚ฌ์šฉ๋˜๋Š” ๋„คํŠธ์›Œํฌ์˜ ์ธํ’‹์œผ๋กœ ๋“ค์–ด๊ฐ„๋‹ค๋ฉด ์‹ค์ œ ๋กœ๋ด‡์—์„œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋Œ์•„๊ฐˆ ๋•Œ ๋„ฃ์–ด์ค„ ์ง€ํ˜•์ •๋ณด๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ œ์–ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋Œ์•„๊ฐˆ ์ˆ˜ ์—†์„ ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ DreamWaQ์—์„œ๋Š” Actor/Critic Network์˜ ์ƒํ˜ธ์ž‘์šฉ ๊ณผ์ •์—์„œ ๊ฐ•ํ™”ํ•™์Šต ์—์ด์ „ํŠธ๊ฐ€ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์‹œ๊ฐ„์  ์ •๋ณด๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ terrain ์ •๋ณด๋ฅผ ์ƒ์ƒํ•  ์ˆ˜ ์žˆ๋„๋ก, Actor ๋„คํŠธ์›Œํฌ์— ๋“ค์–ด๊ฐ€๋Š” ์ž…๋ ฅ๊ฐ’๊ณผ Critic ๋„คํŠธ์›Œํฌ์— ๋“ค์–ด๊ฐ€๋Š” ์ž…๋ ฅ๊ฐ’์„ ๋‹ค๋ฅด๊ฒŒ ์„ค๊ณ„ํ•˜์˜€๊ณ  ์ด๋ฅผ Asymmetricํ•œ ๊ตฌ์กฐ๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Asymmetric Actor-Critic

์œ„์— ๋ณด์ด์‹œ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ Actor Network์—๋Š” Observation o_t, estimated velocity v_t, latent vector z_t๊ฐ€ ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด๊ฐ€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. v_t์™€ z_t๋Š” ๋‹ค์Œ ํŒŒํŠธ์—์„œ ์ข€ ๋” ์‚ดํŽด๋ณผ ์˜ˆ์ •์ด๋ฏ€๋กœ ์—ฌ๊ธฐ์—์„œ๋Š” ์šฐ์„  observation vecter์ธ o_t์— ์ดˆ์ ์„ ๋งž์ถ”์–ด์„œ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. observation ์ •๋ณด๋Š” ๊ฐ•ํ™”ํ•™์Šต MDP๋ฅผ ์ •์˜ํ•˜๋Š” ํ•œ ์š”์†Œ๋กœ ๊ฐ•ํ™”ํ•™์Šต ์—์ด์ „ํŠธ๊ฐ€ ํ•™์Šตํ•  ๋•Œ ๊ด€์ธก(ํ˜น์€ ์ ‘๊ทผ ๊ฐ€๋Šฅํ•œ ์ •๋ณด)ํ•˜๋Š” ์ •๋ณด์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋กœ๋ด‡์— ํŠน๋ณ„ํ•œ ๋น„์ ผ ์„ผ์„œ ์ถ”๊ฐ€ ์—†์ด ๋กœ๋ด‡ ์ž์ฒด ํ•˜๋“œ์›จ์–ด์—์„œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์ •๋ณด์ธ proprioceptive ์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ชธ์ฒด์˜ ๊ฐ์†๋„ \omega_t, ์ค‘๋ ฅ๋ฐฉํ–ฅ ๋ฒกํ„ฐ g_t ๋“ฑ๋“ฑ์˜ ์ •๋ณด๊ฐ€ observation vector์˜ ์š”์†Œ๋กœ ๋“ค์–ด๊ฐ€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด, Critic Network์—๋Š” State s_t๊ฐ€ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ๋“ค์–ด๊ฐ€๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋Š”๋ฐ ์ด๋Š” ์œ„์—์„œ Observation๊ณผ State๋ฅผ ๋น„๊ตํ•ด๋†“์€ ๊ฒƒ๊ณผ ๊ฐ™์ด state๊ฐ€ observation๋ณด๋‹ค ๋งŽ์€ ์ •๋ณด๋ฅผ ํฌํ•จํ•œ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์—์„œ ์ฃผ๋ชฉํ•ด์„œ ๋ณผ ์ˆ˜ ์žˆ๋Š” ์ ์ด ๋ฐ”๋กœ ์ง€ํ˜•์— ๋Œ€ํ•œ ์ •๋ณด์ธ heightmap scan h_t๊ฐ€ ํ•œ ์š”์†Œ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๊ณ  ์ด๋ฅผ ํ†ตํ•ด implicitํ•œ terrain imagination์ด ๊ฐ€๋Šฅํ•œ ๊ฒƒ ์ž…๋‹ˆ๋‹ค. Heightmap scan์— ๋Œ€ํ•ด ์กฐ๊ธˆ ๋” ์„ค๋ช…์„ ๋ง๋ถ™์ด์ž๋ฉด, ์ง€ํ˜•์˜ heightmap scan ์ •๋ณด๋Š” ์‹ค์ œ ๋กœ๋ด‡์—์„œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์ •๋ณด๋Š” ์•„๋‹ˆ๊ณ  ๊ฐ•ํ™”ํ•™์Šต ์—์ด์ „ํŠธ๊ฐ€ ํ•™์Šตํ•˜๊ฒŒ ๋˜๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ๋งŒ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์ •๋ณด๋กœ ์ง€ํ˜•์˜ z์ถ• ๋ฐฉํ–ฅ์˜ ๋†’์ด ์ •๋ณด๋ฅผ ๋งํ•ฉ๋‹ˆ๋‹ค.

ํ™˜๊ฒฝ์„ ์ •์˜ํ•˜๋Š” ๋ณ€์ˆ˜์ด๊ณ  ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ๋Š” ๊ฐ€์ƒ๊ณต๊ฐ„์ด๊ธฐ ๋•Œ๋ฌธ์— ํ”„๋กœ๊ทธ๋žจ์—์„œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๋ฌผ๋ฆฌ์  ์ •๋ณด์ด์ง€๋งŒ ์‹ค์ œ๋กœ ๋กœ๋ด‡์ด ์ด์šฉํ•  ์ˆ˜ ์—†๋Š” ์ •๋ณด๋ฅผ privileged observation์ด๋ผ๊ณ  ๋ถ€๋ฅด๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๊ธฐ์กด์— ๊ฐ•ํ™”ํ•™์Šต์—์„œ State๊ฐ€ ํ™˜๊ฒฝ์—์„œ ์—์ด์ „ํŠธ๊ฐ€ ๋†“์—ฌ์žˆ๋Š” ์ƒํ™ฉ์„ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  ์ •๋ณด๋ฅผ ๋งํ•˜๊ณ  Observation์ด ํ™˜๊ฒฝ์— ๋†“์—ฌ์žˆ๋Š” ์—์ด์ „ํŠธ๊ฐ€ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ๋Š” ์ผ๋ถ€ ์ƒํƒœ ์ •๋ณด๋ฅผ ๋œปํ•˜๊ธฐ ๋•Œ๋ฌธ์— State = Observation + Privileged Observation ํฌํ•จ๊ด€๊ณ„๋กœ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.(๋…ผ๋ฌธ์—์„œ๋Š” privileged observation์ด๋ผ๋Š” ํ‘œ๊ธฐ๋ฅผ state๋ฅผ ๋œปํ•˜๋Š” ๊ฒƒ์œผ๋กœ ํ‘œ๊ธฐํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ—ท๊ฐˆ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.)

2.4 Context-Aided Estimator Network

์ด๋ฒˆ ํŒŒํŠธ์—์„œ ์‚ดํŽด๋ณด๊ฒŒ ๋  Context-Aided Estimator Network๋Š” ์„ผ์„œ๋กœ ์ธ์‹ํ•  ์ˆ˜ ์—†๋Š” ์ง€ํ˜• ์ •๋ณด๋ฅผ ์—์ด์ „ํŠธ๊ฐ€ ์œ ์ถ”ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ์ผ๋“ฑ๊ณต์‹  ์•„์ด๋””์–ด ์ž…๋‹ˆ๋‹ค.

The architecture of CENet

CENet์˜ ๊ตฌ์กฐ๋Š” ์œ„์™€ ๊ฐ™์ด \beta-VAE๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ผ์ • time horizon H๋™์•ˆ ๋ชจ์€ observation์ด Encoder์— ๋“ค์–ด๊ฐ€๋ฉด latent vector z์™€ ๋ชธ์ฒด์˜ ์„ ์†๋„ ์ถ”์ •๊ฐ’์ธ v_t๊ฐ€ ์ถœ๋ ฅ๊ฐ’์œผ๋กœ ๋‚˜์˜ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. Auto-Encoder์˜ ์ผ๋ฐ˜์ ์ธ ๊ตฌ์กฐ๋ฅผ ๋”ฐ๋ผ ์ด ๊ฐ’๋“ค์ด Decoder์˜ ์ธํ’‹์œผ๋กœ ๋“ค์–ด๊ฐ€๊ณ  Decoder์˜ ์ถœ๋ ฅ๊ฐ’์œผ๋กœ๋Š” time horizon์„ ์ง€๋‚œ ๋‹ค์Œ observation vector o_{t+1}์„ reconstructionํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•™์Šตํ•˜๊ฒŒ ๋˜๋Š” ๊ฒƒ ์ž…๋‹ˆ๋‹ค.

The loss of CENet

๊ทธ๋ž˜์„œ CENet์˜ loss function์€ ํฌ๊ฒŒ 2๊ฐœ์˜ ํŒŒํŠธ L_{est}์™€ L_{VAE}๋กœ ๋‚˜๋ˆ„์–ด์ ธ ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋จผ์ € L_{est}๋Š” ๋ณดํ–‰ํ•˜๋Š” ๋กœ๋ด‡ ์—์ด์ „ํŠธ์˜ ์†๋„ ์ถ”์ •์„ CENet์—์„œ ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ๋ถ€๋ถ„์œผ๋กœ, ๋กœ๋ด‡ ๋ชธ์ฒด์˜ ์„ ์†๋„ ์ถ”์ •๊ฐ’ \tilde{v}_t๋Š” ์‹ค์ œ ์ •๋‹ต๊ฐ’ v_t๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ๋Š” ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๊ฐ’์ด๊ธฐ ๋•Œ๋ฌธ์— Encoder์—์„œ ์ถ”์ •ํ•œ ๊ฐ’ \tilde{v}_t์™€์˜ MSE(mean square error)๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ์œผ๋กœ L_{VAE}๋Š” time horizon H๋™์•ˆ ๋ˆ„์ ๋˜ ์—ฌ๋Ÿฌ๊ฐœ์˜ observation ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ๋‹ค์Œ observation o_{t+1}์„ ์˜คํ† ์ธ์ฝ”๋” ๊ตฌ์กฐ๋กœ ์ž˜ reconstructionํ•œ์ง€๋ฅผ ๋ณด๋Š” ์ฒซ๋ฒˆ์งธ term๊ณผ ์ถ”์ • ๋ถ„ํฌ๋ฅผ ๋งž์ถ”๋Š” ๋ถ€๋ถ„์ธ KL-divergence ์ œ์•ฝ ์กฐ๊ฑด ๋‘๋ฒˆ์งธ term์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค. (VAE loss์— ๋Œ€ํ•ด์„œ ๋” ์ž์„ธํ•œ ์ •๋ณด๋ฅผ ์•Œ๊ณ  ์‹ถ์œผ์‹  ๋ถ„์€ ์ด์ „์— VAE ๋…ผ๋ฌธ์„ ๋ฆฌ๋ทฐํ•œ ํฌ์ŠคํŒ…์„ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”.)

์ด์™€ ๊ฐ™์€ loss ๊ตฌ์„ฑ์œผ๋กœ ํ•™์Šต๋œ CENet์€ ์—ฌ๋Ÿฌ ํƒ€์ž„ ์Šคํ…๋™์•ˆ ๊ด€์ฐฐ๋œ observation ์ •๋ณด๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์—์ด์ „ํŠธ๊ฐ€ privileged observation์„ ์œ ์ถ”ํ•  ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ๋Š” ์ด์œ ๋Š” privileged observation์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ€์น˜๋ฅผ ์ถ”์ •ํ•˜๋Š” Critic(Value) Network๋ฅผ ํ†ตํ•ด์„œ Actor Network๊ฐ€ ์—…๋ฐ์ดํŠธ ๋˜๋Š” Policy gradient๊ณผ์ •์„ ๊ฑฐ์น˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ Asymmetric Actor-Critic๊ตฌ์กฐ์™€์˜ ์‹œ๋„ˆ์ง€ ํšจ๊ณผ๊ฐ€ ๊ธฐ์กด์˜ Context RL ๋ถ„์•ผ์—์„œ๋„ ์‚ฌ์šฉ๋˜๋Š” ์•„์ด๋””์–ด ์ธ๋ฐ(์ฐธ์กฐ๋…ผ๋ฌธ: AACC) ์ด์™€ ๋น„๊ตํ•ด๋ณด์•˜์„ ๋•Œ, Critic Network๊ฐ€ deploy๋˜๋Š” ๊ณผ์ •์—์„œ ์“ฐ์ด์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— Actor๋ณด๋‹ค ๋” ๋งŽ์€ ์ •๋ณด๋ฅผ ๋ฐ›์•„์„œ ๋” ์ •ํ™•ํ•œ ๊ฐ€์น˜๋ฅผ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค๋Š” ๊ธฐ์กฐ๋Š” ๋น„์Šทํ•˜์ง€๋งŒ time-invarientํ•œ context vector๋ฅผ ๋งŒ๋“œ๋Š” Context RL์—์„œ์˜ Asymmetric Actor-Critic๊ณผ ๋‹ค๋ฅด๊ฒŒ DreamWaQ์—์„œ๋Š” time-varientํ•œ ๋ณ€์ˆ˜๋“ค์„ ์ถ”์ •ํ•˜์—ฌ implicitํ•˜๊ฒŒ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ–ˆ๋‹ค๋Š” ์ ์ด ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

Adaptive Bootstrapping(AdaBoot)

Adaptive bootstrapping์€ policy ํ•™์Šต๊ณผ์ • ์ค‘์— Estimator network์ธ CENet์ด ์•ˆ์ •์ ์œผ๋กœ ํ•™์Šต๋˜๋„๋ก ํ•˜๊ธฐ ์œ„ํ•ด domain randomized๋กœ ๋‹ค์–‘ํ™”๋œ ์—ฌ๋Ÿฌ ํ™˜๊ฒฝ์š”์†Œ์— ๋Œ€ํ•ด ์—ํ”ผ์†Œ๋“œ๋ณ„ reward์˜ ํ‰๊ท ๊ฐ’์— ๋Œ€ํ•œ ํ‘œ์ค€ ํŽธ์ฐจ์˜ ๋น„์œจ์ธ ๋ณ€๋™ ๊ณ„์ˆ˜(CV)์— ์˜ํ•ด ์ œ์–ด๋˜๋Š” ๋ฐฉ๋ฒ•์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ๋ถ€์ •ํ™•ํ•œ ๊ฐ€์น˜ ์ถ”์ •์— ๋Œ€ํ•œ ์ •์ฑ…์„ ๋ณด๋‹ค ๊ฒฌ๊ณ ํ•˜๊ฒŒ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด m๊ฐœ์˜ ์—์ด์ „ํŠธ reward์˜ CV๊ฐ€ ์ž‘์„ ๋•Œ ๋ถ€ํŠธ์ŠคํŠธ๋ž˜ํ•‘์„ ํ•˜๊ฒŒ๋ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋Œ€๋กœ ์—์ด์ „ํŠธ๊ฐ€ ์ถฉ๋ถ„ํžˆ ํ•™์Šตํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ์—๋Š” reward์—์„œ ํฐ CV๋กœ ํ‘œ์‹œ๋œ ๊ฒƒ์ฒ˜๋Ÿผ ๋ถ€ํŠธ์ŠคํŠธ๋žฉ์„ ํ•ด์„œ๋Š” ์•ˆํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

Adaptive Bootstrapping Probability

3 Experiments

DreamWaQ์˜ ํšจ๊ณผ๋ฅผ ์‹คํ—˜์„ ํ†ตํ•ด ์‚ดํŽด๋ณด๊ธฐ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋น„๊ต ๋ชจ๋ธ๊ตฐ์„ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

Compared Methods

  • Baseline: Adaptation์„ ํ•˜๋Š” ๋ถ€๋ถ„์ด ์—†๋Š” ๊ธฐ๋ณธ ๋ชจ๋ธ ๊ตฌ์กฐ
  • AdaptationNet: Teacher-Student ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋‘ ๋‹จ๊ณ„ ํ•™์Šต์„ ๊ฑฐํ…จ implicitํ•œ ํ™˜๊ฒฝ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ชจ๋ธ
  • EstimatorNet: Context ์ถ”์ •์ด ์—†์ด explicitํ•˜๊ฒŒ ํ™˜๊ฒฝ์ •๋ณด๋ฅผ ์ถ”์ •ํ•˜๋Š” Estimator network๊ฐ€ ์žˆ๋Š” ๋ชจ๋ธ
  • DreamWaQ w/o AdaBoot: AdaBoot๋ฅผ ํ•˜์ง€ ์•Š์€ DreamWaQ
  • DreamWaQ w/ AdaBoot: [proposed method] AdaBoot๋ฅผ ํ•œ DreamWaQ

3.1 Simultation Result

The loss of CENet

Isaac Gym ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ PPO ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ด์šฉํ•˜์—ฌ ํ•™์Šต๊ณผ์ • ๋™์•ˆ์˜ Episodic Reward ๊ทธ๋ž˜ํ”„ ๋ณ€ํ™”๋ฅผ ์‚ดํŽด๋ณด๋ฉด, EstimatorNet์€ ์ฒ˜์Œ์—๋Š” AdaptationNet๋ณด๋‹ค ํ‰๊ท  ์—ํ”ผ์†Œ๋“œ ๋ณด์ƒ์ด ๋†’์ง€๋งŒ, ๋” ๋งŽ์€ training step ํ›„์— ๋” ์–ด๋ ค์šด ์ง€ํ˜•๊ณผ ๋งˆ์ฃผ์น˜๊ธฐ ๋•Œ๋ฌธ์— ๋” ๋งŽ์€ ๋ฐ˜๋ณต ํ›„์— ์„ฑ๋Šฅ์ด ์ €ํ•˜๋จ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋Œ€๋กœ DreamWaQ๋Š” ํ•™์Šต ์ง€ํ˜•์ด ์ ์  ์–ด๋ ค์›Œ ์ง์—๋„ ๋‹ค๋ฅธ ๋ชจ๋“  ๋ฐฉ๋ฒ•๋“ค์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ํผํฌ๋จผ์Šค๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์™ธ๋ถ€ ์ธ์‹ ์—†์ด ๊ฑท๋Š” ๊ฒƒ์ž„์—๋„ DreamWaQ๋Š” ์ฃผ๋ณ€ ์ง€ํ˜•์˜ heightmap์„ ๋‹ค ์•Œ ์ˆ˜ ์žˆ๋Š” ์˜ค๋ผํด policy๋งŒํผ ์„ฑ๋Šฅ์ด ์ข‹์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Explicit Estimation Comparison

์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ํ•œ๋ฒˆ ์ง€ํ˜•์ •๋ณด๋ฅผ Implicit๊ฐ€ ์•„๋‹Œ Explicitํ•˜๊ฒŒ ์•Œ๋ ค์ฃผ๊ณ  ํ•™์Šตํ•œ๋‹ค๋ฉด ์–ด๋–ค ์œ ์˜๋ฏธํ•œ ์ฐจ์ด๊ฐ€ ์žˆ๋Š”์ง€ ์•Œ์•„๋ณด๋Š” ์‹คํ—˜๋„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

Adaptive Bootstrapping Probability

Timestep์ด ๋Š˜์–ด๋‚  ์ˆ˜๋ก ๋” ์–ด๋ ค์šด ๊ณ„๋‹จ์ง€ํ˜•์—์„œ ๋ณดํ–‰ํ•˜๋„๋ก ํ•™์Šต์‹œํ‚จ ๊ฒฐ๊ณผ Explicitํ•˜๊ฒŒ ์ง€ํ˜•์ •๋ณด๋ฅผ ํ•™์Šตํ•œ Estimator๋Š” ์ง€ํ˜•์ด ์–ด๋ ค์›Œ์ง€์ž Foot stumble ํ˜„์ƒ์ด ์‹ฌํ•˜๊ฒŒ ์žˆ์—ˆ์ง€๋งŒ DreamWaQ๋Š” ์ง€ํ˜•์ด ์–ด๋ ค์›Œ์ ธ๋„ ์ž‘์€ foot stumble์ด ์žˆ์Œ์„ ํ™•์ธํ•˜์—ฌ ์˜คํžˆ๋ ค Implicitํ•˜๊ฒŒ ์ง€ํ˜•์ •๋ณด๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ธ robustํ•œ ๋ณดํ–‰์„ ํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

3.2 Real-world Result

์‹ค์ œ ๋กœ๋ด‡ ํ”Œ๋žซํผ์„ ๊ฐ€์ง€๊ณ  Command tracking error๋ฅผ plot ํ•ด๋ณด์•˜์„ ๋•Œ๋„ ๋‹ค๋ฅธ ๋น„๊ต ๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ตํ•ด๋ณด์•˜์„ ๋•Œ error ๊ฐ’์ด ์ ์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ๋‚˜ AdaBoot๋ฐฉ๋ฒ•์ด ์žˆ๊ณ  ์—†๊ณ ์— ๋”ฐ๋ผ error๊ฐ’์˜ ํฌ๊ธฐ๊ฐ€ ๋‹ค๋ฅธ ๊ฒƒ์„ ํ†ตํ•ด AdaBoot ๋ฐฉ๋ฒ•์ด policy ํ•™์Šต์— ํ•„์š”ํ•œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

4 Conclusion

๋ฆฌ๋ทฐ๋ฅผ ํ•˜๋ฉด์„œ ๋…ผ๋ฌธ์—์„œ ์• ๋งคํ•˜๊ฒŒ ๊ทธ๋ ค์ง„ ๋ถ€๋ถ„๋„ ์žˆ์–ด์„œ ์•„์‰ฝ๋‹ค๋Š” ์ƒ๊ฐ์ด ๋“ค์—ˆ์ง€๋งŒ ๋ณดํ–‰๋กœ๋ด‡์ด ๋น„์ •ํ˜•์ ์ด๊ณ  ๋‹ค์–‘ํ•œ ์ง€ํ˜•์„ ๊ทน๋ณตํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด ์—ฌ์ „ํžˆ ํ’€๊ธฐ ์–ด๋ ค์šด ๋ฌธ์ œ๋กœ ๋‚จ์•„์žˆ๋Š”๋ฐ ์ด๋ฅผ ์ถ”๊ฐ€์ ์ธ ๋น„์ ผ์„ผ์„œ ์ •๋ณด ์—†์ด CENet๊ณผ AdaBoot๋ผ๋Š” ์•„์ด๋””์–ด๋กœ ํ’€์–ด๋‚ด๊ณ  ์‹ค์ œ ํ•™ํšŒ์—์„œ ์—ด๋ฆฐ ๋Œ€ํšŒ์—์„œ๋„ ์ข‹์€ ํผํฌ๋จผ์Šค๋ฅผ ๋ƒˆ๋‹ค๋Š” ์ ์—์„œ ์ข‹์€ ์—ฐ๊ตฌ๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

5 Reference

  • Original Paper: DreamWaQ

  • ICRA 2023 Quadruped Robot Challenges

  • ์นด์ด์ŠคํŠธ ์‚ฌ์กฑ๋ณดํ–‰ ๋กœ๋ด‡ ยทยทยท MIT ์ œ์น˜๊ณ  ์„ธ๊ณ„๋Œ€ํšŒ ์šฐ์Šน ๋น„๊ฒฐ์€?

  • AACC: Asymmetric Actor-Critic in Contextual Reinforcement Learning

Copyright 2024, Jung Yeon Lee