Curieux.JY
  • Post
  • Note
  • Jung Yeon Lee

On this page

  • Introduction
    • GAN
    • WGAN
    • RL with GAN
  • Method
    • Problem Definition
    • Reward Design
      • Imitation(Task) Reward
      • Regularization Reward
      • Termination Reward
      • Total Reward
  • Result
    • Induced Imitation Reward Distributions
    • Learning to Mimic Rough Demonstrations
      • Dynamic Time Warping
      • Handcrafted Task Reward
    • Evaluation on Real Robot
    • Cross-platform Imitation
  • Conclusion
  • Reference

๐Ÿ“ƒWASABI ๋ฆฌ๋ทฐ

rl
gan
quadruped
backflip
paper
Learning Agile Skills via Adversarial Imitation of Rough Partial Demonstrations
Published

March 12, 2023

์ด๋ฒˆ ํฌ์ŠคํŒ…์€ WASABI: Learning Agile Skills via Adversarial Imitation of Rough Partial Demonstrations ๋…ผ๋ฌธ์„ ์ฝ๊ณ  ์ •๋ฆฌํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค. 4์กฑ ๋ณดํ–‰ ๋กœ๋ด‡ ์—ฐ๊ตฌ์—์„œ ๋งŽ์€ ์—ฐ๊ตฌ ์„ฑ๊ณผ๋“ค์„ ๋ฐœํ‘œํ•˜๋Š” ์Šค์œ„์Šค์˜ ETH Robotic System Lab๊ณผ ๋…์ผ์˜ Max Plank Institude for Intelligent Systems์—์„œ ๋ฐœํ‘œํ•œ ๋…ผ๋ฌธ์œผ๋กœ, ๊ฐ•ํ™”ํ•™์Šต์—์„œ ์ค‘์š”ํ•œ ๋ถ€๋ถ„๋“ค ์ค‘ ํ•˜๋‚˜์ธ reward design์— ๋Œ€ํ•œ ๊ณ ๋ฏผ์„ generatvie adversarial method(WGAN, Wasserstein GAN)๋ฅผ ํ†ตํ•ด ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

๋ณดํ–‰ ๋กœ๋ด‡์˜ ๋ชจ์…˜ ์ œ์–ด์—์„œ ๊ธฐ๋ณธ์ ์ธ ๋ณดํ–‰๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹ค์–‘ํ•œ ๋‹ค์ด๋‚˜๋ฏนํ•œ ๋ชจ์…˜์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ๋กœ๋ด‡์˜ ํผํฌ๋จผ์Šค๋ฅผ ๋Œ์–ด์˜ฌ๋ฆฌ๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœํ•˜๊ฒŒ ์ง„ํ–‰๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๋งํ•˜๋Š” ๋‹ค์ด๋‚˜๋ฏนํ•œ ๋ชจ์…˜๋“ค๋กœ๋Š” ๋กœ๋ด‡์ด ๊ณต์ค‘์—์„œ ํ•œ๋ฐ”ํ€ด ๋Œ์•„์•ผ ํ•˜๋Š” backflip๊ณผ ๊ฐ™์€ ๊ธฐ์กด์˜ ์ „ํ†ต์ ์ธ ๋ณดํ–‰ ์ œ์–ด ์—ฐ๊ตฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ rule-based๋กœ ์ œ์–ดํ•˜๊ธฐ์—๋Š” ๋งค์šฐ ์–ด๋ ค์šด ๋ชจ์…˜๋“ค์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ๋กœ๋ด‡์ด ์ด๋Ÿฐ ๋ชจ์…˜๋“ค์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ์ˆ˜ํ•™์ ์œผ๋กœ ์ž์„ธํžˆ ๋ช…์‹œํ•˜๊ณ  ๊ทธ๋ฆฌ๊ณ  ๋ชจ๋“  ๋ฌผ๋ฆฌ์  ํ™˜๊ฒฝ์š”์†Œ๋“ค์„ ๊ณ ๋ คํ•˜์—ฌ ์ œ์–ดํ•˜๊ธฐ ์–ด๋ ค์šธ ๋•Œ, ๊ฐ•ํ™”ํ•™์Šต์ด๋ผ๋Š” ์ธ๊ณต์ง€๋Šฅ ํ”„๋ ˆ์ž„ ์›Œํฌ๋ฅผ ์ด์šฉํ•˜์—ฌ reward๋ผ๋Š” ๋ณด์ƒ์ฒด๊ณ„๋ฅผ ๊ธฐ์ค€์œผ๋กœ trial-and-error๋ฅผ ํ†ตํ•ด ๋ชจ์…˜์„ ํ•™์Šตํ•˜๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด ์ง๊ด€์ ์œผ๋กœ ๋งค์šฐ ์ข‹์€ ํ•ด๊ฒฐ์ฑ…์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ ๋‹ค์ด๋‚˜๋ฏนํ•œ ๋ชจ์…˜์„ ๊ฐ task๋กœ ์ •์˜ํ•˜๊ณ  ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ๋ฐฉํ–ฅ๋Œ€๋กœ ๋กœ๋ด‡์ด ๋ชจ์…˜๋“ค์„ ํ•™์Šต๋˜๊ธฐ ์œ„ํ•ด์„œ๋Š” reward๋ฅผ ์ž˜ ์ •์˜ํ•ด์ฃผ์–ด์•ผ ํ•˜๋Š”๋ฐ ์ด ๊ณผ์ •์ด ๋งŒ๋งŒ์น˜ ์•Š๊ฒŒ ๊นŒ๋‹ค๋กญ๊ณ  ์–ด๋ ค์šฐ๋ฉฐ, ์˜คํžˆ๋ ค ์ˆ˜ํ•™์ ์ธ ๋™์—ญํ•™ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ œ์–ดํ•  ๋•Œ๋ณด๋‹ค ๋ถ„์„์ ์ธ ์ ‘๊ทผ์ด ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— reward design์ด๋ผ๋Š” ๊ณผ์ œ๋ฅผ ํ•ด๊ฒฐํ•ด์•ผ๋งŒ ์šฐ๋ฆฌ๊ฐ€ ์›ํ–ˆ๋˜ ๋‹ค์ด๋‚˜๋ฏน ๋ชจ์…˜๋“ค์„ ๊ฐ•ํ™”ํ•™์Šต์„ ์ด์šฉํ•˜์—ฌ ๋กœ๋ด‡์ด ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ๋ฐ”๋กœ ์ด ๋ถ€๋ถ„์„ ์ƒ์„ฑ๋ชจ๋ธ๋กœ ์œ ๋ช…ํ•œ GAN ๋ชจ๋ธ๋“ค ์ค‘ ํ•˜๋‚˜์ธ WGAN์„ ์ด์šฉํ•˜์—ฌ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ–ˆ์œผ๋ฉฐ ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ ๊ฐ€์žฅ ํฅ๋ฏธ๋กœ์› ๋˜ ์ ‘๊ทผ๋ฒ•์€ ๊ฐ•ํ™”ํ•™์Šต์˜ policy๋ฅผ GAN์˜ generator ๊ด€์ ์œผ๋กœ ๋ฐ”๋ผ๋ณด๊ณ  reward๋ฅผ ์ถ”๋ก ํ•˜๋„๋กํ•˜๋Š” ํ”„๋ ˆ์ž„ ์›Œํฌ๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค๋Š” ์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค. (์ดํ›„ ๊ด€๋ จํ•ด์„œ ๋” ๋…ผ๋ฌธ๋“ค์„ ์ฐพ์•„๋ณด๋‹ˆ ์ƒ์„ฑ๋ชจ๋ธ๊ณผ ๊ฐ•ํ™”ํ•™์Šต์€ ๋‹ฎ์€ ์ ์ด ๋งŽ์€ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๊ด€๋ จํ•ด์„œ ํฅ๋ฏธ๋กญ๊ฒŒ ์ฝ์—ˆ๋˜ ๋‹ค๋ฅธ ๋…ผ๋ฌธ Connecting Generative Adversarial Networks and Actor-Critic Methods๋„ ๊ด€์‹ฌ์ด ์žˆ์œผ์‹œ๋‹ค๋ฉด ๊ฐ€๋ณ๊ฒŒ ์ฝ์–ด๋ณด์‹œ๋Š” ๊ฒƒ์„ ์ถ”์ฒœ๋“œ๋ฆฝ๋‹ˆ๋‹ค.)

Introduction

๊ฐ•ํ™”ํ•™์Šต์€ ์ •๋ง ๋งค๋ ฅ์ ์ธ ์ธ๊ณต์ง€๋Šฅ ํ•™์Šต๋ฒ• ์ค‘ ํ•˜๋‚˜๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์ €๋„ ์ง๊ด€์ ์ด๊ณ , ์–ด๋–ป๊ฒŒ ๋ณด๋ฉด ๊ฐ€๋” ์šฐ๋ฆฌ๋„ค ์ธ์ƒ์˜ ๋ชจ์Šต์„ ๋‹จ์ˆœํ•˜์ง€๋งŒ ๋ช…๋ฃŒํ•˜๊ฒŒ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ ๊ฐ™์•„ ๊ทธ๋Ÿฐ ๊ฐ•ํ™”ํ•™์Šต์˜ ๋งค๋ ฅ์— ๋น ์ ธ ์ง€๊ธˆ๊นŒ์ง€๋„ ์—ด์‹ฌํžˆ ์ดํ•ดํ•˜๊ณ  ๊ณต๋ถ€ํ•˜๋ ค๊ณ  ๋…ธ๋ ฅํ•˜๊ณ  ์žˆ๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋งŽ์€ ๋ถ„๋“ค์ด ์ธ๊ณต์ง€๋Šฅ์„ ์ฒ˜์Œ์— ํ•™์Šตํ•  ๋•Œ ๋งˆ์ฃผํ•˜๊ฒŒ ๋˜๋Š” ๊ฒƒ์€ โ€œ์ง€๋„ํ•™์Šต(Supervised Learning)โ€์ธ๋ฐ ์ด๋ก  ๊ณต๋ถ€๋ฅผ ์–ด๋А์ •๋„ ๋งˆ์นœ ํ›„, ๊ด€๋ จํ•ด์„œ vision์ด๋‚˜ ์ž์—ฐ์–ด ๋“ฑ์˜ ํ”„๋กœ์ ํŠธ๋ฅผ ์‹œ์ž‘ํ•˜๋ฉด ์ฒ˜์Œ์— ๋งˆ์ฃผ์น˜๋Š” ๋‚œ๊ด€์€ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•์ด๋ผ๊ณ  ์ƒ๊ฐ๋ฉ๋‹ˆ๋‹ค. ๋น…๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜์œผ๋กœ ๋™์ž‘๋˜๋Š” ๋ฐฉ๋ฒ•๋ก ์ด๋‹ค ๋ณด๋‹ˆ Garbage In, Garbage Out์ด ์•ˆ๋˜๋„๋ก ์กฐ์‹ฌํ•ด์•ผํ•˜๊ณ  ๋‚ด๊ฐ€ ์›ํ•˜๋Š” ์ปค์Šคํ…€ ๋ฐ์ดํ„ฐ ์…‹์„ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐ๋งŒ ์—„์ฒญ๋‚œ ์—๋„ˆ์ง€๋ฅผ ์Ÿ์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค. (์˜คํ”ˆ ๋ฐ์ดํ„ฐ์…‹์ด๋‚˜ transfer learning ๊ธฐ๋ฒ• ๋“ฑ์„ ์ด์šฉํ•ด์„œ ํ•ด๊ฒฐํ•˜๊ธฐ๋„ ํ•˜์ง€๋งŒ์š”.)

ํ•˜์ง€๋งŒ ๊ฐ•ํ™”ํ•™์Šต์—์„œ๋Š” ๋ฐ์ดํ„ฐ ์…‹์ด ํ•„์š”์—†์Šต๋‹ˆ๋‹ค! ์™œ๋ƒํ•˜๋ฉด ๊ฐ•ํ™”ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ๋™์ž‘ํ•˜๋ฉด์„œ trial-and-error๋ฅผ ํ†ตํ•ด interaction data๋ฅผ ๋งŒ๋“ค๊ฒŒ ๋˜๊ณ  ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต์ด ๋˜๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์•ˆํƒ€๊น๊ฒŒ๋„ ๊ฐ•ํ™”ํ•™์Šต์—๋„ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•์˜ ์–ด๋ ค์›€ ๋งŒํผ์ด๋‚˜(ํ˜น์€ ๊ทธ ์ด์ƒ์œผ๋กœ) ์–ด๋ ค์šด ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ”๋กœ ํ™˜๊ฒฝ(Environment) ๊ตฌ์ถ•์ž…๋‹ˆ๋‹ค. ์œ ๋ช…ํ•œ DeepMind์˜ ์‹œ๋‹ˆ์–ด ์—ฐ๊ตฌ์ž๋Š” Behind every great agent, thereโ€™s a great environment๋ผ๊ณ  ์ด์•ผ๊ธฐ ํ–ˆ์„ ์ •๋„๋กœ ๊ฐ•ํ™”ํ•™์Šต์—์„œ๋Š” ํ™˜๊ฒฝ ๊ตฌ์ถ•์— ํ•™์Šต์˜ ์„ฑํŒจ๊ฐ€ ๋‹ฌ๋ ธ๋‹ค๊ณ  ํ•ด๋„ ๊ณผ์–ธ์ด ์•„๋‹™๋‹ˆ๋‹ค.

๊ฐ•ํ™”ํ•™์Šต์˜ ํ˜„์‹ค์— ๋Œ€ํ•ด ์ข€ ๋” ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.(๋กœ๋ด‡ํ‹ฑ์Šค ๋ถ„์•ผ ๊ฐ•ํ™”ํ•™์Šต ์—ฐ๊ตฌ์ž์˜ ๊ด€์ ์ด๋ฏ€๋กœ ๋‹ค๋ฅธ ๋ถ„์•ผ์—์„œ ๊ฐ•ํ™”ํ•™์Šต์„ ๋„์ž…ํ•  ๋•Œ์˜ ๊ด€์ ๊ณผ๋Š” ์ฐจ์ด๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.) ๋จผ์ € ์ฒซ๋ฒˆ์งธ๋กœ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๊ฐ•ํ™”ํ•™์Šต ๋˜ํ•œ ๋น…๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต๋˜๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— (1)๋งŽ์€ interaction data๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋กœ๋ด‡ํ‹ฑ์Šค์— ๊ฐ•ํ™”ํ•™์Šต์„ ๋„์ž…ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋กœ๋ด‡์„ ์—ฌ๋Ÿฌ๋ฒˆ ๋Œ๋ฆฌ๋ฉฐ ๋ฐ์ดํ„ฐ๋ฅผ ์–ป์–ด์•ผ ํ•˜๋Š”๋ฐ (์—ฐ๊ตฌ ์ดˆ๊ธฐ์—๋Š” ์‹ค์ œ๋กœ ๋กœ๋ด‡์„ ์—ฐ๊ตฌ์ž๊ฐ€ ์—ฌ๋Ÿฌ๋ฒˆ ๋‹ค์‹œ ์…‹ํŒ…ํ•˜๊ณ  ์‹คํ—˜์„ ํ•˜๋ฉฐ ๋ฐ์ดํ„ฐ๋ฅผ ์–ป์—ˆ๋‹ค๊ณ ๋Š” ํ•˜์ง€๋งŒ..) ์‚ฌ์‹ค์ƒ ๋ถˆ๊ฐ€๋Šฅ์— ๊ฐ€๊น๊ธฐ ๋•Œ๋ฌธ์— ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์–ป๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ์ ์—์„œ ์‹ค์ œ ๋ฌผ๋ฆฌ์ ์ธ ์„ธ๊ณ„์—์„œ ๋กœ๋ด‡์ด ๊ตฌ๋™๋˜์–ด ์–ป์–ด์ง€๋Š” ๋ฐ์ดํ„ฐ์™€ ๋ฌผ๋ฆฌ์ ์ธ ์„ธ๊ณ„๋ฅผ ๋ชจ์‚ฌํ•œ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„œ ์–ป๊ฒŒ๋œ ๋ฐ์ดํ„ฐ๋Š” ์ฐจ์ด๊ฐ€ ์กด์žฌํ•  ์ˆ˜ ๋ฐ–์— ์—†๊ธฐ ๋•Œ๋ฌธ์— Sim-to-real์ด๋ผ๋Š” ๋˜ ํ•˜๋‚˜์˜ ์—ฐ๊ตฌ๊ณผ์ œ๊ฐ€ ๋งŒ๋“ค์–ด์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ์œผ๋กœ๋Š” ์•ž์„œ ์ด์•ผ๊ธฐ ํ–ˆ๋˜, (2)๊ฐ•ํ™”ํ•™์Šต์˜ ํ™˜๊ฒฝ ๊ตฌ์ถ•์ด ์ž˜ ๋˜์–ด์•ผ ์ œ๋Œ€๋กœ ํ•™์Šต์ด ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ํ™˜๊ฒฝ ๊ตฌ์ถ•, ํ˜น์€ ๊ฐ•ํ™”ํ•™์Šต์˜ ์ˆ˜ํ•™์  ๋ชจ๋ธ๋ง์ธ MDP(Markov Decision Process)์˜ ์š”์†Œ๋“ค์„ ์ž˜ ์ •์˜ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์€ ์‚ฌ์ง„์—์„œ ๋ณด์ด๋Š” ๊ฐ•ํ™”ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ์— ์žˆ๋Š” State, Reward, Action ๋“ฑ์„ ํ’€๊ณ ์ž ํ•˜๋Š” ๋ฌธ์ œ์— ๋งž๊ฒŒ ์ž˜ ์ •ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ €๋Š” ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์—ฐ๊ตฌ์ž๊ฐ€ ์•„๋‹ˆ๊ณ  ๊ฐ•ํ™”ํ•™์Šต์„ ํ™œ์šฉํ•œ ๋กœ๋ด‡์ œ์–ด ์—ฐ๊ตฌ์ž์ด๊ธฐ์— ๊ฐ™์€ ๊ฐ•ํ™”ํ•™์Šต ๋ฐฉ๋ฒ•๋ก ์„ ๋ณด๋”๋ผ๋„ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์—ฐ๊ตฌ์ž์™€ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜ ์—ฐ๊ตฌ์ž๊ฐ€ ๋ณด๋Š” ํ™˜๊ฒฝ์˜ ๋””ํ…Œ์ผ์ด ๋งŽ์ด ๋‹ค๋ฅธ ๊ฒƒ์„ ๋А๊ผˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์œ„ ์‚ฌ์ง„์—์„œ ๊ฐ™์€ quadruped walking robot์˜ locomotion(๋ณดํ–‰) task๋ฅผ ์ƒ๊ฐํ•  ๋•Œ, ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋…ผ๋ฌธ๋“ค์€ Ant์™€ ๊ฐ™์€ ๋‹จ์ˆœํ•œ rigid model์„ ์ƒ๊ฐํ•˜๊ณ  ์‹คํ—˜์„ ํ•˜์ง€๋งŒ ๊ฐ•ํ™”ํ•™์Šต์„ ์‹ค์ œ ๋กœ๋ด‡์— ์ ์šฉํ•˜๋ ค๊ณ  ๋ณด๋ฉด ๋กœ๋ด‡์˜ ๊ฐ ๋ชจํ„ฐ์˜ ํŠน์„ฑ, ์„ผ์„œ๋“ฑ์„ ๊ณ ๋ คํ•œ State, Reward, Action์„ ์ •์˜ํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ›จ์”ฌ ๋ณต์žกํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์‹ค ํ™˜๊ฒฝ์˜ ์š”์†Œ๋“ค ์ค‘, State์™€ Action์€ ๊ฐ ๋„๋ฉ”์ธ ๋งˆ๋‹ค ๊ด€๋ก€์ ์ธ ์ •์˜ ๋ฐฉ๋ฒ•๋“ค์ด ์žˆ๊ณ  ๋กœ๋ด‡์˜ ์„ผ์„œ๋“ค์ด ํ•œ์ •์ ์ด๊ธฐ ๋•Œ๋ฌธ์— ์–ด๋А์ •๋„ ์ •ํ•ด์ ธ์žˆ๋‹ค(limited)๊ณ  ๋ณผ ์ˆ˜ ์žˆ์ง€๋งŒ Reward๋Š” ๊ฐ•ํ™”ํ•™์Šต์—์„œ ํ•™์Šต์˜ motivation์ด ๋˜๋Š” ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์ด์ž ์ˆ˜ํ–‰ํ•˜๊ณ ์ž ํ•˜๋Š” task์— ์˜ํ–ฅ์„ ๊ฐ€์žฅ ๋งŽ์ด ๋ฐ›๋Š” ๋ถ€๋ถ„์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€์žฅ ์ •์˜ํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋Ÿฐ ์–ด๋ ค์›€์„ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•˜๋Š” ๋˜ ํ•˜๋‚˜์˜ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์„ Reward Engineering์ด๋ผ๊ณ  ์ง€์นญํ•˜๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฒˆ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ฐ”๋กœ ์ด์ ์„ ํŒŒ๊ณ ๋“  ๊ฒƒ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


์–ด๋–ค Decision Process(์˜์‚ฌ๊ฒฐ์ • ๋ฐฉ๋ฒ•)๋ฅผ ํ•™์Šตํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ์„ ๋•Œ ๊ฐ€์žฅ ์ง๊ด€์ ์œผ๋กœ ๋– ์˜ค๋ฅด๋Š” ๋ฐฉ๋ฒ•์ด ๋ฌด์—‡์ธ๊ฐ€์š”? ๊ทธ๋ƒฅ ์ž˜ํ•˜๋Š” ์‚ฌ๋žŒ์„ ๋”ฐ๋ผํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ Imitation Learning ํ˜น์€ Behavior Cloning์ด๋ผ๊ณ  ํ•˜๋Š”๋ฐ(๊ตฌ๋ณ„์„ ์œ„ํ•ด ์ดํ•˜ ๋‚ด์šฉ์—์„œ Plain Imitation Learning์ด๋ผ๊ณ  ์นญํ•˜๊ธฐ๋„ ํ•จ.) ํ•œ๊ฐ€์ง€ ์˜ˆ์‹œ๋กœ๋Š” ์šด์ „์„ ์ž˜ํ•˜๋Š” ์ธ๊ณต์ง€๋Šฅ(Agent)์„ ๋งŒ๋“ค๊ณ  ์‹ถ๋‹ค๋ฉด ์šด์ „์„ ์ž˜ํ•˜๋Š” ์‚ฌ๋žŒ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋Œ€๋กœ ๋”ฐ๋ผํ•˜๋„๋ก ํ•™์Šตํ•˜๋ฉด ๋  ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ Expert์˜ State-Action pair๋ฅผ ๋ฐ์ดํ„ฐ ์…‹์œผ๋กœ ๋ณด๊ณ  ์ง€๋„ํ•™์Šต์„ ํ•œ Agent๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ์ธ๋ฐ Expert์˜ ๋ฐ์ดํ„ฐ๋งŒ ํ•™์Šตํ•˜๋‹ค๋ณด๋‹ˆ error๊ฐ€ ๋“ค์–ด๊ฐ€๊ฒŒ ๋˜๊ณ  generalization๋„ ์ž˜ ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์ด์— ๋Œ€ํ•œ ๋ณด์™„์œผ๋กœ GAIL(Generative Adversarial Imitation Learning)์ด๋ผ๋Š” ๋ฐฉ๋ฒ•์ด ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฆ„์—์„œ๋„ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด Generative Adversarial Network(์ ๋Œ€์  ์‹ ๊ฒฝ๋ง)์™€ Imitation Learning(๋ชจ๋ฐฉ ํ•™์Šต)์ด ํ•ฉ์ณ์ง„ ํ•™์Šต ๋ฐฉ๋ฒ•์ธ๋ฐ, Expert์˜ state-action ๋ถ„ํฌ๋ฅผ True data distribution์œผ๋กœ, ํ•™์Šตํ•˜๋Š” Agent์˜ Policy๋ฅผ True data distribution์„ ๋”ฐ๋ผ๊ฐ€๊ณ ์ž ํ•˜๋Š” Generator๋กœ ๋ณด๋Š” ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ์•ž์„œ ์ด์•ผ๊ธฐํ•œ plain imitation learning๊ณผ ๋น„๊ตํ•ด๋ณด๋ฉด pair data point์— ๋Œ€ํ•ด ๋งž์ถฐ๊ฐ€๋Š” ํ•™์Šต์ด ์•„๋‹Œ data distribution์ด๋ผ๋Š” ํ™•๋ฅ ์  ์ŠคํŽ™ํŠธ๋Ÿผ์„ ์ด์šฉํ•ด์„œ ๋” generalization์„ ์ž˜ํ•  ์ˆ˜ ์žˆ๋Š” ํ•ด๊ฒฐ์ฑ…์„ ์ œ์•ˆํ•œ ๊ฒƒ์œผ๋กœ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. Data distribution์„ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์€ ์ƒ์„ฑ ๋ชจ๋ธ ๋ถ„์•ผ์—์„œ ํ™œ๋ฐœํžˆ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๊ณ , ์ด ์ค‘ GAN์ด๋ผ๋Š” ์ ๋Œ€์  ์‹ ๊ฒฝ๋ง ๋ฐฉ๋ฒ•์—์„œ Generator์™€ Discriminator๋ผ๋Š” ๊ฐœ๋…์„ Imitation Learning์— ์ ์šฉํ•œ ๊ฒƒ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

GAIL์˜ ๋ฐฉ๋ฒ•๋ก ๋“ค ์ค‘ ํ•˜๋‚˜๋กœ, AMP(Adversarial Motion Priors)๋ผ๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ ์†Œ๊ฐœ๋˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ WASABI์™€ AMP ๋ชจ๋‘ GAIL์ด๋ผ๋Š” ๋ฐฉ๋ฒ•๋ก  ์•ˆ์— ์†ํ•ด์žˆ๊ณ , ๋‘˜์„ ๋น„๊ตํ•ด์„œ ์ƒ๊ฐํ•ด๋ณด๋ฉด ์ข‹๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ„๋žตํ•˜๊ฒŒ ์งš๊ณ  ๋„˜์–ด๊ฐ€๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. AMP๋Š” Motion data, ์˜ˆ๋ฅผ ๋“ค๋ฉด ๋™๋ฌผ์˜ ์›€์ง์ž„์—์„œ ๋”ฐ์˜จ expert data๋ฅผ ๊ฐ€์ง€๊ณ  ๋กœ๋ด‡ agent์˜ ๋ชจ์…˜์ด ์ข€ ๋” ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์›€์ง์ž„์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. Discriminator๊ฐ€ Motion data์—์„œ ๋‚˜์˜จ State-transition(S_t \rightarrow S_{t+1})์ธ์ง€ ์•„๋‹ˆ๋ฉด ํ•™์Šต ์ค‘์ธ Policy(Generator ์—ญํ• )์—์„œ ๋‚˜์˜จ State-transition์ธ์ง€๋ฅผ ๊ตฌ๋ณ„ํ•˜์—ฌ ์‹ค์ œ ๋™๋ฌผ์˜ ์›€์ง์ž„์ฒ˜๋Ÿผ ์ž์—ฐ์Šค๋Ÿฌ์šด ์Šคํƒ€์ผ์„ ํ•™์Šต ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ณด์กฐ์ ์ธ Style Reward(r_{style})์„ ๊ธฐ์กด์˜ ๊ฐ•ํ™”ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ ์•ˆ์— ์ถ”๊ฐ€ํ•ด์ค๋‹ˆ๋‹ค. State-action pair๋ฅผ ๊ฐ€์ง€๊ณ  ํ•™์Šตํ•˜๋Š” Plain imitation learning๊ณผ ๋‹ค๋ฅด๊ฒŒ, State-transition์„ ๋ณด๊ณ  Discriminator๊ฐ€ ํŒ๋‹จํ•˜๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— expert์˜ Action์— ๋Œ€ํ•œ ์ •๋ณด๋Š” ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

AMP ๋ฐฉ๋ฒ•์—์„œ๋Š” ์ž์—ฐ์Šค๋Ÿฌ์šด ๋ชจ์…˜์— ์ดˆ์ ์„ ๋งž์ถ”์—ˆ๋‹ค๋Š” ๊ฒƒ์„ ์งš์–ด๋ณผ ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. Walking, Jumping๊ณผ ๊ฐ™์€ ๋‹ค์ด๋‚˜๋ฏนํ•œ ์ฃผ์š” ๋ชจ์…˜ task์— ๋Œ€ํ•œ reward๊ฐ€ ์•„๋‹ˆ๋ผ ํ•™์Šตํ•  ๋•Œ ์ž์—ฐ์Šค๋Ÿฝ์ง€ ๋ชปํ•œ ๋ชจ์…˜์œผ๋กœ ํ•™์Šต ๋ฐฉํ–ฅ์ด ํŠ€์ง€ ์•Š๋„๋ก, ๋ง ๊ทธ๋Œ€๋กœ ๋ณด์กฐ์ ์ธ ๋ชจ์…˜ ์Šคํƒ€์ผ์„ ์žก์•„์ค€ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋Ÿฐ ์ž์—ฐ์Šค๋Ÿฌ์›€์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” Motion data๋Š” ๋กœ๋ด‡์˜ pose configuration์— ๋Œ€ํ•ด์„œ ํ•˜๋‚˜ํ•˜๋‚˜ ๋ช…์‹œ๋˜์–ด ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ well-defined๋œ task์ด์–ด์•ผ ํ•œ๋‹ค๋Š” ๋ง๋กœ ๋ฐ”๊ฟ” ๋งํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ๋กœ๋ด‡์˜ joint(๊ด€์ ˆ) position์ด timestep ๋งˆ๋‹ค ์–ด๋–ป๊ฒŒ ์›€์ง์—ฌ์•ผ ํ•˜๋Š”์ง€ ์ˆ˜์น˜์ ์œผ๋กœ ๋‹ค ๋ช…์‹œ๋˜์–ด ์žˆ๋Š” Motion data๊ฐ€ ์žˆ์–ด์•ผ ํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์ฃผ์š” ๋ชจ์…˜ Task reward ๋””์ž์ธ์— ๊ณ ๋ ค๊ฐ€ ์•„๋‹Œ Style reward ๋””์ž์ธ์— GAN ๋ฐฉ๋ฒ•์„ ๋„์ž…ํ•œ AMP ๋ฐฉ๋ฒ•์— ๋ฐ˜ํ•ด WASABI๋Š” Task reward์— GAN ๋ฐฉ๋ฒ•์„ ๋„์ž…ํ–ˆ๋‹ค๋Š” ์ ์—์„œ ๊ฐ€์žฅ ํฐ ์ฐจ์ด์ ์ด ์žˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

GAN

์ ๋Œ€์  ์‹ ๊ฒฝ๋ง์— ๋Œ€ํ•ด ๊ธฐ๋ณธ์ ์ธ ์ด๋ก ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. GAN์€ ์ƒ์„ฑ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•๋ก  ์ค‘ ํ•˜๋‚˜๋กœ Generative, ์–ด๋– ํ•œ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ์ƒ์„ฑ์„ ํ•˜๋Š”, Adversarial ๊ฒŒ์ž„๊ณผ ๊ฐ™์ด Discriminator์™€ Generator๋ผ๋Š” 2๊ฐœ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ชจ๋“ˆ์ด ๊ฒฝ์Ÿ์„ ํ•˜๋ฉฐ ํ•™์Šต์„ ํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก  ์ž…๋‹ˆ๋‹ค. ์•„๋ž˜ ์‚ฌ์ง„์—์„œ ๋ณด์ด๋Š” ์˜ˆ์‹œ๋กœ ๋ณด๋ฉด ์ง„์งœ ๋ชจ๋‚˜๋ฆฌ์ž ๊ทธ๋ฆผ์ด๋ผ๋Š” Real example์„ ๋ณด๊ณ  ์ด๋ฅผ ๋ชจ์‚ฌํ•œ ์ž‘ํ’ˆ์„ ํŒŒ๋Š” ํ™”๊ฐ€๋ฅผ Generator๋ผ๊ณ  ์ƒ๊ฐํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ๋ฏธ์ˆ  ์ž‘ํ’ˆ ๊ฐ๋ณ„์‚ฌ์ธ Discriminator๋Š” ์ด ์ž‘ํ’ˆ์ด ์ง„์งœ ๋ชจ๋‚˜๋ฆฌ์ž ๊ทธ๋ฆผ์ธ์ง€ ์•„๋‹ˆ๋ฉด ํ™”๊ฐ€๊ฐ€ ๋ชจ์‚ฌํ•œ ๊ฐ€์งœ ๋ชจ๋‚˜๋ฆฌ์ž ์ธ์ง€ ํŒ๋‹จํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๋‹น์—ฐํžˆ Generator ์ž…์žฅ์—์„œ๋Š” Discriminator๊ฐ€ ๊ฐ๋ณ„ํ•˜๊ธฐ ์–ด๋ ต๊ฒŒ ์ ์  ๋” ์ง„์งœ๊ฐ™์€ ๋ชจ๋‚˜๋ฆฌ์ž๋ฅผ ๊ทธ๋ฆฌ๊ฒŒ ๋˜๊ณ (new data) Discriminiator ์ž…์žฅ์—์„œ๋Š” ์ง„์งœ์™€ ๊ฐ€์งœ ์‚ฌ์ด์— ๋” ์ž์„ธํ•˜๊ณ  ๋ฏผ๊ฐํ•œ ์ฐจ์ด๋ฅผ ์ฐพ์•„๋‚ด์–ด Generator์˜ ๋ชจ์‚ฌํ’ˆ์„ ์ฐพ์•„๋‚ด๋ ค๊ณ  ํ•  ๊ฒƒ ์ž…๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ GAN์˜ ํ•™์Šต ๊ณผ์ •์—๋Š” ์ง€๋„ ํ•™์Šต๊ณผ ๋น„์ง€๋„ ํ•™์Šต์ด ๋ชจ๋‘ ๋“ค์–ด์žˆ์Šต๋‹ˆ๋‹ค. ์šฐ์„  Discriminator ์ž…์žฅ์—์„œ๋Š” ์ง„์งœ์™€ ๊ฐ€์งœ ๋ผ๋ฒจ์„ ๊ฐ€์ง„, ์ธํ’‹ ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์˜ค๋ฉด 2๊ฐœ์˜ ์นดํ…Œ๊ณ ๋ฆฌ๋“ค ์ค‘ ํ•˜๋‚˜๋ฅผ ์„ ํƒํ•˜๋Š” ์ง€๋„ํ•™์Šต์„ ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. Generator๋Š” ๋น„์ง€๋„ ํ•™์Šต์œผ๋กœ latent code๋ผ๋Š” ์ผ์ข…์˜ trigger ์š”์†Œ์ธ ์–ด๋–ค ๋ฒกํ„ฐ๋ฅผ ์ธํ’‹์œผ๋กœ ๋ฐ›์œผ๋ฉด ์ง„์งœ data distribution๊ณผ ๊ฐ€๊นŒ์šด ๋ฐ์ดํ„ฐ์ธ new data๋ฅผ ์ƒ์„ฑํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์ž ๊น data distribution์ด๋ผ๋Š” ๊ฐœ๋…์ด GAN์—์„œ๋Š” ์ค‘์š”ํ•œ ๊ฐœ๋…์ด๋ฏ€๋กœ Probability Distribution(ํ™•๋ฅ  ๋ถ„ํฌ)์„ ๊ฐ„๋‹จํ•˜๊ฒŒ ์งš๊ณ  ๋„˜์–ด๊ฐ€๊ฒ ์Šต๋‹ˆ๋‹ค. ํ™•๋ฅ  ๋ถ„ํฌ๋ž€ ์–ด๋–ค ์‚ฌ๊ฑด์„ ๋Œ€๋ณ€ํ•˜๋Š” ๋žœ๋ค ๋ณ€์ˆ˜๋“ค์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฃผ์‚ฌ์œ„๋ฅผ ์ด 6๋ฒˆ ๋˜์ ธ์„œ 1, 2, 3, 5๊ฐ€ ๊ฐ๊ฐ 1๋ฒˆ์”ฉ ๊ทธ๋ฆฌ๊ณ  6์ด 2๋ฒˆ ๋‚˜์™”๋‹ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ํ™•๋ฅ  ๋ถ„ํฌ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆด ์ˆ˜ ์žˆ๊ณ , ์ด๋•Œ์˜ Expectation(๊ธฐ๋Œ“๊ฐ’)์„ ๊ตฌํ•ด๋ณด๋ฉด 1 \cdot \frac{1}{6} + 2 \cdot \frac{1}{6} + 3 \cdot \frac{1}{6} + 4 \cdot \frac{0}{6} + 5 \cdot \frac{1}{6} + 6 \cdot \frac{2}{6} = \frac{23}{6} \eqsim 3.8 ์ž„์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋ฏธ์ง€๋ฅผ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ x๋ผ๊ณ  ํ•˜๊ณ  ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์‚ฌ๋žŒ ์–ผ๊ตด ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ ์…‹์˜ ๋ถ„ํฌ๊ฐ€ ์™ผ์ชฝ์˜ ๋ถ„ํฌ์™€ ๊ฐ™๋‹ค๊ณ  ํ•œ๋‹ค๋ฉด, ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋ชจ๋“œ(mode)๊ฐ€ ์žˆ๋Š”๋ฐ ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ์˜ mode์—์„œ๋Š” ๊ธˆ๋ฐœ ์—ฌ์„ฑ์˜ ์–ผ๊ตด์ด ์žˆ๊ณ  ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์€ ํ™•๋ฅ ๋กœ ํ‘๋ฐœ์˜ ์•ˆ๊ฒฝ ์“ด ๋‚จ์ž์˜ ์–ผ๊ตด ์ด๋ฏธ์ง€๊ฐ€ ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ mode๊ฐ€ ์•„๋‹Œ ๋งค์šฐ ๋‚ฎ์€ ํ™•๋ฅ ์„ ๋ณด์ด๋Š” ๋ถ„ํฌ์˜ ๊ผฌ๋ฆฌ ๋ถ€๋ถ„์„ ๋ณด๋ฉด ๋งค์šฐ ์ด์ƒํ•œ ์–ผ๊ตด ์ด๋ฏธ์ง€๋“ค์ด ๋‚˜์˜ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฐ”๋กœ ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ ์…‹ ๋ถ„ํฌ(๋นจ๊ฐ•์ƒ‰)๊ณผ ์œ ์‚ฌํ•œ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ(ํŒŒ๋ž€์ƒ‰)๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์ƒ์„ฑ ๋ชจ๋ธ์˜ ๋ชฉํ‘œ์ด๊ณ  ์ด๋ฅผ Discriminator์™€ Generator๋ฅผ ๊ฐ€์ง€๊ณ  ํ•™์Šตํ•˜๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด GAN์ž…๋‹ˆ๋‹ค.

Discriminator์˜ Objective Function(V)์„ ๋ณด๋ฉด, ๋จผ์ € ์ฒซ๋ฒˆ์งธ term์€ ๋ฐ์ดํ„ฐ x๋Š” true dataset distribution์ธ p_{data}์—์„œ ์ƒ˜ํ”Œ๋ง ๋˜์—ˆ์„ ๋•Œ Discriminator๋Š” ์ด๋ฅผ ์ง„์งœ๋ผ๊ณ  ํŒ๋ณ„ํ•ด์•ผ ํ•˜๊ณ  ์ด๋Š” output 1(true label)์„ ์ถœ๋ ฅํ•ด์•ผํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋‘๋ฒˆ์งธ term์€ fake dataset distribution์ธ, ์ฆ‰ generator๊ฐ€ ๋งŒ๋“  ๋ฐ์ดํ„ฐ์ผ ๊ฒฝ์šฐ์— ๊ฐ€์งœ๋ผ๊ณ  ํŒ๋ณ„ํ•ด์•ผ ํ•˜๊ณ  output 0(fake label)์„ ์ถœ๋ ฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ 2๊ฐœ์˜ term์„ ๋ชจ๋‘ maxmizationํ•˜๋Š” ๊ฒƒ์ด Discriminator์˜ ๋ชฉํ‘œ์ด๊ธฐ ๋•Œ๋ฌธ์— \text{max}_DV(\cdot)์ด ๋ฉ๋‹ˆ๋‹ค.

Generator์˜ Objective Function์„ ๋ณด๋ฉด, ์ฒซ๋ฒˆ์งธ true dataset distribution์—์„œ ์ƒ˜ํ”Œ๋ง ๋˜๋Š” ๋ถ€๋ถ„์€ Generator์™€ ์ƒ๊ด€์ด ์—†์Šต๋‹ˆ๋‹ค. ๋‘๋ฒˆ์งธ term์—์„œ Generator์—์„œ ๋‚˜์˜จ ouput new data๋ฅผ Discriminator์—๊ฒŒ ๋„˜๊ฒจ์ฃผ์—ˆ์„ ๋•Œ 1(true label)๋กœ ์ฐฉ๊ฐํ•˜๋„๋ก ๋งŒ๋“ค์–ด์•ผ ํ•˜๋ฏ€๋กœ \text{min}_GV(\cdot)์ด ๋ฉ๋‹ˆ๋‹ค.

WGAN

์œ„์—์„œ ์„ค๋ช…ํ•œ ๊ธฐ๋ณธ์ ์ธ GAN์„ ์ž˜ ํ•™์Šตํ–ˆ์„ ๋•Œ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ๊ทธ๋ ค๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด Discriminator์˜ ํŒ๋ณ„ ๋ถ„ํฌ๊ฐ€ ๋นจ๊ฐ„์ƒ‰ ๊ทธ๋ž˜ํ”„์ฒ˜๋Ÿผ ๊ทธ๋ ค์ง€๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์™„๋ฒฝํ•˜๊ฒŒ true distribution์ธ p_{data}์— ๋Œ€ํ•ด์„œ๋Š” 1์„, generated distribution p_G์— ๋Œ€ํ•ด์„œ๋Š” 0์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์ง€๋งŒ ์ด๋Ÿฐ ์ƒํ™ฉ์—์„œ๋Š” ์œ ์˜๋ฏธํ•œ ํ•™์Šต์ด ์ผ์–ด๋‚˜๊ธฐ ํž˜๋“ญ๋‹ˆ๋‹ค.

Optimalํ•œ Discriminator๋ฅผ ๊ฐ€์ •ํ•˜๊ณ  Objective function์„ ๋‹ค์‹œ๋ณด๋ฉด p_{data}์™€ p_G๊ฐ€ ๋„ˆ๋ฌด ๋ฉ€๋ฆฌ ๋–จ์–ด์ ธ ์žˆ์–ด์„œ ์‚ฌ์‹ค์ƒ ๊ณ„์‚ฐ๋œ V(\cdot)๊ฐ’์ด 0์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ Generator๊ฐ€ ๋‘ ๋ถ„ํฌ๊ฐ€ ๊ฐ€๊น๋„๋ก ๋งŒ๋“œ๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต์„ ํ•ด์•ผ ํ•˜๋Š”๋ฐ Classic GAN์˜ Objective Function์—๋Š” ์ด๋Ÿฌํ•œ ์ •๋ณด๋ฅผ ์•Œ๋ ค์ค„ ์ˆ˜ ์žˆ๋Š” ๋ถ€๋ถ„์ด ์ˆ˜ํ•™์ ์œผ๋กœ ๋ชจ๋ธ๋ง์ด ๋˜์–ด ์žˆ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ๋ถ„ํฌ๋“ค๊ฐ„์˜ ๋จผ ์ •๋„๋ฅผ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋Š” WGAN(Wasserstein GAN)์ด ์ œ์•ˆ๋˜์—ˆ๊ณ  ์ด์— ๋Œ€ํ•ด์„œ๋Š” ์ˆ˜ํ•™์ ์œผ๋กœ ๋งค์šฐ ๋”ฅํ•œ ๋‚ด์šฉ์ด ์žˆ์ง€๋งŒ ๋ณธ ํฌ์ŠคํŒ…์—์„œ๋Š” ๊ฐ„๋‹จํ•˜๊ฒŒ ๊ฐœ๋…์ ์œผ๋กœ ๊ณต์‚ฌ์žฅ์˜ ํฌํฌ๋ ˆ์ธ์„ ์ด์šฉํ•˜์—ฌ ์ดํ•ดํ•˜๊ณ  ๋„˜์–ด๊ฐ€๊ฒ ์Šต๋‹ˆ๋‹ค. Wassertein Distance๋Š” Earth moverโ€™s distance๋ผ๊ณ ๋„ ๋ถˆ๋ฆฌ๋Š”๋ฐ ์ด๋ฆ„์—์„œ ์ง๊ด€์ ์œผ๋กœ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋“ฏ์ด, ๋‘ ๋ถ„ํฌ๋ฅผ ์–ด๋–ค ํ™๋”๋ฏธ๋ผ๊ณ  ์ƒ๊ฐํ•˜๊ณ  ์šฐ๋ฆฌ๊ฐ€ Generated Distribution์— ์žˆ๋Š” ํ™๋“ค์„ Real Distribution์˜ ๋ชจ์–‘๋Œ€๋กœ ํ™๋“ค์„ ์˜ฎ๊ธด๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ ๋“œ๋Š” cost๊ฐ€ distance๋กœ ์ •์˜๋œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (์ˆ˜ํ•™์ ์œผ๋กœ ๋” ๊ถ๊ธˆํ•˜์‹  ๋ถ„๋“ค์€ Implicit DGM 29 | Wasserstein Distance with GAN์„ ์ถ”์ฒœํ•ฉ๋‹ˆ๋‹ค.) ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์ด WGAN์„ ์ด์šฉํ•˜์—ฌ reward ๋””์ž์ธ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค.

RL with GAN

GAN ๋‚ด์šฉ์„ ์„ค๋ช…ํ•  ๋•Œ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ถ„์•ผ์˜ ์˜ˆ์‹œ๊ฐ€ ์ง๊ด€์ ์ด๊ณ  ์‰ฝ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฅผ ๊ฐ€์ง€๊ณ  ์„ค๋ช…ํ•˜๋‹ค ๋ณด๋‹ˆ ๋ฌธ๋“ ๊ทธ๋ž˜์„œ ๊ฐ•ํ™”ํ•™์Šต์—์„œ ์–ด๋–ป๊ฒŒ GAN์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ? ๋ผ๋Š” ์˜๋ฌธ์ด ์ƒ๊ธธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์‹œ ๊ฐ•ํ™”ํ•™์Šต์—์„œ์˜ ์—ฌ๋Ÿฌ ์–ด๋ ค์›€๋“ค ์ค‘ Task reward๋ฅผ ์ž˜ ์ •์˜ํ•ด์ฃผ๊ธฐ๊ฐ€ ์–ด๋ ต๋‹ค๋Š” ์ ์„ ์ƒ๊ธฐ์‹œ์ผœ๋ณด๋ฉด Task reward๋ฅผ Discriminator๊ฐ€ ๊ฒฐ์ •ํ•ด์ค„ ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ๋ผ๋Š” ์•„์ด๋””์–ด๋ฅผ ๋– ์˜ฌ๋ ค๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ์…˜์˜ reference๊ฐ€ ๋  ์ˆ˜ ์žˆ๋Š” demonstration์˜ ์ผ๋ จ์˜ state๋“ค์ด true distribution์ด ๋˜๊ณ , policy์—์„œ ๋‚˜์˜ค๋Š” ์ผ๋ จ์˜ state๋“ค์ด generated distribution์ด ๋˜์–ด์„œ, Discriminator๊ฐ€ ๋‘ ๋ถ„ํฌ๋ฅผ ๋ชป ๊ตฌ๋ถ„ํ•  ์ •๋„๋ฅผ task reward๋กœ ์ •์˜ํ•œ๋‹ค๋ฉด policy๊ฐ€ demonstration์—์„œ ๋‚˜ํƒ€๋‚œ ๋‹ค์ด๋‚˜๋ฏนํ•œ ๋ชจ์…˜๋“ค์„ ๋”ฐ๋ผํ•˜๋„๋ก ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ์ง€ํ‘œ๊ฐ€ ๋  ์ˆ˜ ์žˆ์„ ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ์ด์ „์— locomotion์ด๋‚˜ backflip ๋“ฑ์˜ ๊ฐ๊ฐ์˜ ๋ชจ์…˜๋งˆ๋‹ค task reward๋ฅผ hand design ํ•  ๋•Œ๋Š” ๊ฐ ๋ชจ์…˜์—์„œ ๋ณดํ–‰ ๋กœ๋ด‡์˜ ๋ฐœ์ด ์–ด๋–ป๊ฒŒ ์›€์ง์—ฌ์•ผ ํ•˜๋Š”์ง€, ๋ชธ์ฒด์˜ ์†๋„๊ฐ€ ์–ด๋– ํ•ด์•ผ ํ•˜๋Š”์ง€ ์ผ์ผ์ด reward๋กœ ๊ณ ๋ คํ•˜๊ณ  ์—ฌ๋Ÿฌ reward term๋“ค์„ weighted sumํ•˜๋Š” ๋ฐฉ์‹์ด์—ˆ์ง€๋งŒ ์ด GAN ๋ฐฉ์‹์„ ์ด์šฉํ•˜๋ฉด ๊ฐ ๋ชจ์…˜์— ๋Œ€ํ•œ demonstration์˜ state๋“ค์„ ๋ณด๊ณ  ์–ด๋–ค ๋ชจ์…˜์„ ์–ด๋–ป๊ฒŒ ๋”ฐ๋ผํ•ด์•ผํ•˜๋Š”์ง€ agent์˜ policy๊ฐ€ ์•Œ์•„์„œ task reward๋ฅผ ๋†’์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ ์ž…๋‹ˆ๋‹ค.

Method

Problem Definition

์ด์ „์— AMP ๋ฐฉ์‹์—์„œ ๋ชจ์…˜์˜ ์ž์—ฐ์Šค๋Ÿฌ์›€์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด Motion data๊ฐ€ ๋งค์šฐ well-defined ๋˜์–ด ์žˆ์–ด์•ผ ํ•œ๋‹ค๊ณ  ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ Motion data(ํ˜น์€ demonstration)์„ ์–ป๊ธฐ๋Š” ์–ด๋ ต๊ณ  ํŠนํžˆ๋‚˜ ๋ณดํ–‰๊ณผ ๊ฐ™์ด ์ด๋ฏธ ๋งŽ์ด ์—ฐ๊ตฌ๊ฐ€ ๋˜์–ด์™”๊ณ  ๋™๋ฌผ๋“ค์˜ ๋ชจ์Šต์—์„œ๋„ ๋งŽ์ด ๊ด€์ฐฐ๋  ์ˆ˜ ์žˆ๋Š” task์™€๋Š” ๋‹ค๋ฅด๊ฒŒ ๋‹ค์ด๋‚˜๋ฏนํ•œ backflipํ•˜๋Š” ๋ชจ์…˜ task๋“ค์€ ์ฐธ๊ณ ํ•  ๋ฐ์ดํ„ฐ๋“ค๋„ ๋งค์šฐ ์ ๊ณ  ๋งŒ๋“ค์–ด๋‚ด๊ธฐ๋„ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๋ฌธ์ œ ์ƒํ™ฉ์„ ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” Roughํ•˜๊ณ  Partialํ•œ demonstration๋งŒ ์žˆ๋Š” ๋ฌธ์ œ๋กœ ํŒŒ์•…ํ•˜๊ณ  Roughํ•œ ๋ชจ์…˜ ๋ฐ์ดํ„ฐ๋ผ๋Š” ๊ฒƒ์€ ์‹ค์ œ ๋กœ๋ด‡์ด๋‚˜ ๋™๋ฌผ์ด ์›€์ง์—ฌ์„œ ์–ป์€ ๋ฐ์ดํ„ฐ๊ฐ€ ์•„๋‹Œ ์‚ฌ๋žŒ์ด ๋กœ๋ด‡์„ ๋‹จ์ˆœํžˆ ๋“ค๊ณ  ์›€์ง์—ฌ์„œ ์–ป์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋งํ•˜๋ฉฐ Partialํ•˜๋‹ค๋Š” ๊ฒƒ์€ ๋กœ๋ด‡์˜ ๋ชจ์…˜ ๋ฐ์ดํ„ฐ๋ผ๊ณ  ํ•ด์„œ ๋กœ๋ด‡์„ ๊ตฌ์„ฑํ•˜๊ณ  ์žˆ๋Š” ๋ชจ๋“  joint๋“ค์˜ ์›€์ง์ž„์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์•„๋‹Œ ๋กœ๋ด‡์˜ ๋ชธ์ฒด์— ๋Œ€ํ•œ ์ •๋ณด๋งŒ ์žˆ๋Š” ๋ชจ์…˜๋ฐ์ดํ„ฐ๋งŒ ์žˆ๋Š” ๊ฒƒ์„ ๋งํ•ฉ๋‹ˆ๋‹ค.

๋ง๋กœ๋งŒ ๋“ค์œผ๋ฉด ์ž˜ ์™€๋‹ฟ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์œ„์— ์‚ฌ์ง„์—์„œ ํ•œ ์—ฐ๊ตฌ์ž๊ฐ€ backflipํ•˜๋Š” demonstration ๋ฐ์ดํ„ฐ๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด ๋กœ๋ด‡์„ ๋“ค๊ณ  ์†์œผ๋กœ ๊ทธ๋ƒฅ ํ•œ๋ฒˆ ๋’ค์ง‘์–ด์ฃผ๋Š” ๋ชจ์Šต์„ ๋ณด๋ฉด์„œ ๋‹ค์‹œํ•œ๋ฒˆ ์„ค๋ช…์„ ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์•ž์„œ ์„ค๋ช…ํ–ˆ๋“ฏ์ด ๋กœ๋ด‡์ด backflipํ•˜๋Š” ์ž‘๋™์„ ํ•ด์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์–ป์ง€ ์•Š๊ณ  ์‚ฌ๋žŒ์ด ๋‹จ์ˆœํžˆ ๋กœ๋ด‡์„ ๋“ค๊ณ  ์›ํ•˜๋Š” ๋ชจ์…˜์˜ demonstration ๋ฐ์ดํ„ฐ๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ Backflip demonstration ๋ฐ์ดํ„ฐ๋Š” ๋กœ๋ด‡์˜ 12๊ฐœ์˜ joint๋“ค์— ๋Œ€ํ•œ ์ •๋ณด๋Š” ์—†์ด ๋ชธ์ฒด์— ๋Œ€ํ•œ ์ •๋ณด(base linear, angular velocity, projected gravity, base height)๋งŒ์„ ํฌํ•จํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ demonstration ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๋†€๋ผ์šด ์ ์€ ๋กœ๋ด‡์ด ์ง์ ‘ ์›€์ง์—ฌ์„œ ์–ป์€ ๋ฐ์ดํ„ฐ๋„ ์•„๋‹ˆ๊ณ  ์‹ค์ œ ๋™๋ฌผ์˜ ๋ชจ์…˜ ๋ฐ์ดํ„ฐ๋„ ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ฌผ๋ฆฌ์ ์œผ๋กœ๋„ ์‹œ๊ฐ„์ ์œผ๋กœ๋„ ๋กœ๋ด‡ ํ”Œ๋žซํผ์—์„œ๋Š” ์‚ฌ์‹ค์ƒ ๋”ฐ๋ผํ•˜๊ธฐ ์–ด๋ ค์šด ๋ฐ์ดํ„ฐ๋ผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฐ demo ๋ฐ์ดํ„ฐ๋งŒ ์žˆ๋‹ค๊ณ  ๋ฌธ์ œ์ƒํ™ฉ์„ ๊ฐ€์ •ํ•œ ์ด์œ ๋Š” backflip๊ณผ ๊ฐ™์ด ๋‹ค์ด๋‚˜๋ฏนํ•˜๊ณ  ๋‹ค์–‘ํ•œ ๋ชจ์…˜์— ๋Œ€ํ•ด์„œ๋Š” reference๊ฐ€ ๋  ๋งŒํ•œ motion data๋ฅผ well-definedํ•˜๊ธฐ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

์ด์ฏค์—์„œ ๋‹ค์‹œํ•œ๋ฒˆ AMP์™€ WASABI๋ฅผ ๋‹ค์‹œ ๋น„๊ตํ•ด๋ณด๋ฉด, ๋‘๊ฐ€์ง€ ๋ฐฉ๋ฒ• ๋ชจ๋‘ expert์˜ action์ด ์—†์ด๋„ reference๊ฐ€ ๋  ์ˆ˜ ์žˆ๋Š” motion data(ํ˜น์€ demonstration)๋ฅผ ๊ฐ€์ง€๊ณ  reward engineering์„ ์ž˜ํ•ด์„œ ๋ชจ์…˜ ์ œ์–ด๋ฅผ ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค๋Š” ์ ์—์„œ ๊ณตํ†ต์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ AMP๋Š” well-definedํ•œ ๋ชจ์…˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ์–ด์•ผ ๊ฐ€๋Šฅํ•œ ๋ฐฉ๋ฒ•๋ก ์ธ ๋ฐ˜๋ฉด WASABI๋Š” ๋กœ๋ด‡์˜ ๋ชธ์ฒด์— ๋Œ€ํ•œ partialํ•œ ๋ชจ์…˜ ๋ฐ์ดํ„ฐ๋งŒ ์žˆ์œผ๋ฉด ํ•™์Šตํ•  ์ˆ˜ ์žˆ์—ˆ๊ณ  AMP๋Š” ๋ชจ์…˜์˜ ์ฃผ์š” reward๋ฅผ ๋””์ž์ธํ•œ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์ž์—ฐ์Šค๋Ÿฌ์›€์„ ์œ„ํ•œ ๋ณด์กฐ์ ์ธ style reward ๋””์ž์ธ์„ ํ–ˆ๊ณ  WASABI๋Š” ๊ฐ ๋ชจ์…˜์— ๋Œ€ํ•œ task reward๋ฅผ ๋””์ž์ธ ํ•œ ๊ฒƒ์ด ํฐ ์ฐจ์ด์ ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Reward Design

Partialํ•˜๊ณ  Roughํ•œ ๋ชจ์…˜ demo๋“ค์„ ๊ฐ€์ง€๊ณ  ์–ด๋–ป๊ฒŒ ํ•˜๋ฉด ๋‹ค์ด๋‚˜๋ฏนํ•œ ๋ชจ์…˜์— ๋Œ€ํ•œ reward๋ฅผ ์ •์˜ํ•  ์ˆ˜ ์žˆ์„๊นŒ์š”?

WASABI์—์„œ ์ œ์•ˆํ•œ ์ „์ฒด์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์กฐ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค. r^I, r^R, r^T ๋ผ๋Š” ๊ฐ๊ฐ์˜ reward๊ฐ€ ํ•ฉ์ณ์ง€๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ์š” ์ด์ œ๋ถ€ํ„ฐ ๊ฐ๊ฐ์˜ reward๊ฐ€ ์–ด๋–ค ์˜๋ฏธ์™€ ๋ชฉ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ฒƒ์ธ์ง€ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Imitation(Task) Reward

์šฐ์„ , task reward๋Š” ๋‹ค์ด๋‚˜๋ฏน ๋ชจ์…˜์˜ demo๋ฅผ ์ž˜ ๋ชจ๋ฐฉ(imitate)ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์•ผํ•  ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ imitation reward ํ˜น์€ task reward๋กœ ๋ถˆ๋ฆฌ๋ฉฐ ์—ฌ๊ธฐ์„œ WGAN ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•ด์„œ ์ •์˜ํ•˜๊ฒŒ ๋˜๋Š” ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค. ๋‹ค์‹œํ•œ๋ฒˆ ์ด์•ผ๊ธฐํ•˜์ง€๋งŒ ์šฐ๋ฆฌ๊ฐ€ backflip์„ ํ•˜๋Š” ํ•™์Šต์„ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋กœ๋ด‡์˜ ๋ชธ์ฒด๋ฅผ ๊ณต์ค‘์— ์˜ฌ๋ฆฌ๊ณ  pitch ๋ฐฉํ–ฅ์œผ๋กœ์˜ ํšŒ์ „์„ 360๋„ ํ•ด์•ผํ•ด!๋ผ๊ณ  ๋งํ•ด์ฃผ๋Š” imitation reward function(hand-designed)์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ demo(true) distribution์„ ๋ณด๊ณ  ์ด๋ฅผ ๋”ฐ๋ผ๊ฐ€๋Š” generated distribution์„ policy๊ฐ€ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด ์ด ๋ฐฉ๋ฒ•์˜ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค.

์ž ๊น ์•ž์—์„œ ์ด์•ผ๊ธฐ ํ–ˆ๋“ฏ์ด ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” demo ๋ฐ์ดํ„ฐ๋Š” well-definedํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์•„๋‹Œ ์‚ฌ๋žŒ์ด ๋กœ๋ด‡์„ ๋“ค๊ณ  ๋ชจ์€ ๋ฐ์ดํ„ฐ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋กœ๋ด‡์˜ base์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ(O)๋กœ ํ•œ์ •์ ์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ policy์—์„œ generated๋œ observation ๋ฐ์ดํ„ฐ(S)๋Š” ๋กœ๋ด‡์˜ ๊ฐ joint์— ๋Œ€ํ•œ ์ •๋ณด ๋“ฑ ๋” ๋งŽ์€ ์ •๋ณด๊ฐ€ ์žˆ๋Š” vector space์ด๊ธฐ ๋•Œ๋ฌธ์— true distribution๊ณผ generated distribution์„ ๋น„๊ต๊ฐ€๋Šฅํ•œ ์ƒํƒœ๋กœ ๋งŒ๋“ค์–ด์ฃผ๊ธฐ ์œ„ํ•ด Mapping function \phi๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋งž์ถฐ์ค๋‹ˆ๋‹ค. ์‰ฝ๊ฒŒ ์ƒ๊ฐํ•˜์ž๋ฉด ์ •๋ณด๋Ÿ‰์ด ๋” ๋งŽ์€ S๋ฅผ ์ฐจ์›์ด ์ ์€ O๋กœ ๋งž์ถฐ์ฃผ๊ธฐ ์œ„ํ•ด joint position, velocity, last action๊ณผ ๊ฐ™์€ ๋ถ€๋ถ„์„ ๊ฐ€๋ฆฌ๊ณ  data distribution์„ Discriminator์—๊ฒŒ ๋„˜๊ฒจ์ฃผ๋Š” ๊ฒƒ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

mapping function์„ ํ†ตํ•ด ์ฐจ์›์„ ๋งž์ถ˜ \phi(s) ์™€ o๋Š” GAN์˜ objective function์—์„œ Discriminator์˜ ์ธํ’‹์œผ๋กœ ๋“ค์–ด๊ฐ€๋Š” seq. of states(observations)์ด๋ฉฐ ์•„๋ž˜์™€ ๊ฐ™์ด ์ผ์ • time horizon H๋™์•ˆ ๋ชจ์•„์ง„ states ๋ฒกํ„ฐ๋“ค๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ seq. of states๋“ค์„ ๊ฐ€์ง€๊ณ  Discriminator๊ฐ€ ๋งŒ๋“  reward distribution์„ ๊ฐ๊ฐ LSGAN(Least Squares GAN)๊ณผ WGAN์˜ objective function์œผ๋กœ ์•„๋ž˜์™€ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ LSGAN์€ WGAN์˜ ๋น„๊ต๊ตฐ์ด ๋˜๋Š” ๋˜ ๋‹ค๋ฅธ GAN์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ฉฐ LSGAN์˜ Objective function์„ ํ•ด์„ํ•ด๋ณด๋ฉด, policy์—์„œ ๋‚˜์˜จ state history๋ฅผ ๊ฐ€์ง€๊ณ  ๋‚˜์˜จ reward distribution์€ -1์— ๊ฐ€๊น๋„๋ก demo๋ฅผ ํ†ตํ•ด ๋‚˜์˜จ reward distribution์€ +1์— ๊ฐ€๊น๋„๋ก ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด, WGAN์€ ์ด ๋‘ ๋ถ„ํฌ๊ฐ„์˜ wasserstein distance ์ค„์ด๋„๋กํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ๋‘ ๊ฐ€์ง€ GAN ๋ชจ๋‘ policy์—์„œ ๋‚˜์˜จ seq. of states๋กœ ๋‚˜์˜จ task reward distribution์„ demo์˜ seq. of states๋กœ ๋‚˜์˜จ task reward distribution์„ ๋งž์ถฐ๊ฐ€๋„๋ก ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ๊ณตํ†ต์ ์ž…๋‹ˆ๋‹ค.

์ด๋ ‡๊ฒŒ Discriminator๋ฅผ ํ†ตํ•ด ๋‚˜์˜จ task reward๋Š” ๋ฐ”๋กœ ์‚ฌ์šฉ๋˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๊ณ  zero-mean unit-variance๋กœ ๋งŒ๋“ค์–ด์ฃผ๋Š” ๊ณผ์ •์„ ํ•œ๋ฒˆ ๊ฑฐ์นœ ํ›„ ๋น„๋กœ์†Œ Task(Imitation) Reward๋กœ ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค.

Regularization Reward

์ด์ „์— AMP์—์„œ์˜ Style reward์˜ ์—ญํ• ์„ WASABI์—์„œ๋Š” Regularization Reward๊ฐ€ ๋Œ€์‹ ํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด reward๋Š” task-dependentํ•˜์ง€ ์•Š์€ task-agnosticํ•œ term๋“ค๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์–ด์„œ backflip ๋ชจ์…˜์„ ํ•˜๋“  locomotion ๋ชจ์…˜์„ ํ•˜๋“  ๋กœ๋ด‡์˜ ์ž์—ฐ์Šค๋Ÿฝ๊ณ  ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ๋ชจ์…˜์„ ์œ„ํ•ด ๋ถ€๊ฐ€์ ์œผ๋กœ ๋”ํ•ด์ง€๋Š” reward๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Termination Reward

๋งˆ์ง€๋ง‰์œผ๋กœ agent๊ฐ€ ๋ชจ์…˜์„ ์ถฉ๋ถ„ํžˆ ํ•™์Šตํ•˜๊ธฐ๋„ ์ „์— episode๋ฅผ ๋” ๋นจ๋ฆฌ ๋๋‚ด๋Š” ๊ฒƒ์ด ์ด๋“์ด๋ผ ํŒ๋‹จํ•˜๊ณ  ํ•™์Šต์ด ์ž˜ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š๋Š” ๊ฒฝ์šฐ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ณ ์ž Termination Reward๋ฅผ ์ถ”๊ฐ€ํ•ด์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. T๋Š” episode๋ฅผ ๋„ˆ๋ฌด ๋นจ๋ฆฌ ๋๋‚ด๋ฒ„๋ฆฐ ๊ฒฝ์šฐ์— ๋Œ€ํ•ด์„œ 0 ๋˜๋Š” 1๋กœ ํŒ๋‹จํ•˜๋Š” ์ธ๋””์ผ€์ดํ„ฐ ์—ญํ• ์„ ํ•˜๊ฒŒ ๋˜๊ณ , termination์— ๋Œ€ํ•œ ๊ณ ๋ ค๋Š” Imitation reward์˜ ๋ถ„ํฌ์—์„œ ๋‚˜์˜จ \sigma์™€ ํ• ์ธ์œจ \gamma๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •ํ•ด์ฃผ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Total Reward

์•ž์„œ ์„ค๋ช…ํ•œ Imitation reward r^I, Regularization reward r^R, Termination reward r^T๋ฅผ ๋ชจ๋‘ ํ•ฉ์‚ฐํ•˜์—ฌ Total reward๊ฐ€ ๊ณ„์‚ฐ๋˜๊ฒŒ ๋˜๊ณ  ์ด๋ฅผ Agent์—๊ฒŒ ํ•™์Šต ํ”ผ๋“œ๋ฐฑ์œผ๋กœ ๋ณด๋‚ด์ฃผ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋•Œ r^I์™€ r^T๋Š” ๋ชจ์…˜ task ๋งˆ๋‹ค ๋‹ค๋ฅด๊ฒŒ ์ •์˜๋  ์ˆ˜ ์žˆ๋Š” ๋ถ€๋ถ„์ด๋ฏ€๋กœ task-relatedํ•œ ๋ถ€๋ถ„์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์œผ๋ฉฐ r^R๋Š” ์–ด๋–ค ๋ชจ์…˜ task์ธ์ง€ ์ƒ๊ด€์—†์ด ํ•ญ์ƒ ๋™์ผํ•œ reward term์ด๊ธฐ ๋•Œ๋ฌธ์— task-agnosticํ•œ ๋ถ€๋ถ„์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฌผ๋ก  ์—ฌ๊ธฐ์„œ ํ•ด๋‹น ์—ฐ๊ตฌ์˜ contribution์ด ๋‘๋“œ๋Ÿฌ์ง„ ๋ถ€๋ถ„์€ Imitation reward r^I์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Result

์‹คํ—˜์—์„œ ์‚ฌ์šฉํ•œ ๋กœ๋ด‡ ํ”Œ๋žซํผ์€ Solo 8์ด๋ผ๋Š” 4์กฑ ๋ณดํ–‰ ๋กœ๋ด‡์ž…๋‹ˆ๋‹ค. ๋กœ๋ด‡์˜ ๊ฐ ๋‹ค๋ฆฌ๋Š” 2๊ฐœ์˜ joint๊ฐ€ ์žˆ๊ณ  ์ƒํ•˜์ขŒ์šฐ ๋Œ€์นญ์ ์œผ๋กœ ๋‹ค๋ฆฌ์˜ joint๋ฅผ ๊บพ์„ ์ˆ˜ ์žˆ์œผ๋ฉฐ ๋‹ค๋ฅธ 4์กฑ ๋ณดํ–‰ ๋กœ๋ด‡๋“ค์— ๋น„ํ•ด ๋น„๊ต์  ์†Œํ˜• ํ”Œ๋žซํผ์ด๊ณ  jumping์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ํŠน์ง•์„ ๊ฐ€์ง„ ์˜คํ”ˆ ์†Œ์Šค ํ”Œ๋žซํผ์ž…๋‹ˆ๋‹ค.

์ด 4๊ฐ€์ง€ ๋ชจ์…˜ task๋ฅผ ์‹คํ—˜ํ–ˆ์œผ๋ฉฐ ๊ฐœ๊ตฌ๋ฆฌ์ฒ˜๋Ÿผ ํด์งํด์ง ๋›ฐ๋Š” ๋“ฏํ•œ LEAP, ๋ชธ์ฒด๋ฅผ ์›จ์ด๋ธŒ ํƒ€๋“ฏ ์›€์ง์ด๋ฉด์„œ ๊ฑท๋Š” WAVE, ๋’ท ๋‹ค๋ฆฌ 2๊ฐœ๋ฅผ ๊ฐ€์ง€๊ณ  2์กฑ ๋ณดํ–‰์œผ๋กœ ์„œ๋Š” STANDUP, ๋งˆ์ง€๋ง‰์œผ๋กœ ๊ณต์ค‘์—์„œ 360๋„ ๋„๋Š” BACKFLIP๊นŒ์ง€ 4๊ฐœ์˜ ๋ชจ์…˜์„ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.

Induced Imitation Reward Distributions

์šฐ์„  Imitation Reward Distribution์ด ์ •๋ง ์˜๋ฏธ์žˆ๊ฒŒ ํ•™์Šต์„ ํ–ˆ๋Š”๊ฐ€(Informativeํ•œ reward distribution์„ ๋งŒ๋“ค์–ด ๋ƒˆ๋Š”๊ฐ€)๋ฅผ ๋ณด๊ธฐ ์œ„ํ•ด reward distribution์„ ์‹œ๊ฐํ™”ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค. ๋จผ์ € Informativeํ•œ ๋ถ„ํฌ๋ผ๋Š” ๊ฒƒ์€ ์–ด๋–ค ๋ถ„ํฌ๋ฅผ ๋งํ•˜๋Š”๊ฐ€๋ฅผ ์งš์–ด๋ณผ ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜ ์‚ฌ์ง„์˜ ์˜ค๋ฅธ์ชฝ 2๊ฐœ์˜ ๋ถ„ํฌ ๊ทธ๋ž˜ํ”„์—์„œ ํ‰ํ‰ํ•œ ๋ถ„ํฌ(ํŒŒ๋ž€์ƒ‰)๋ณด๋‹ค๋Š” ๋พฐ์กฑํ•œ ๋ถ„ํฌ(์ดˆ๋ก์ƒ‰)๊ฐ€ ์—ฌ๋Ÿฌ x๊ฐ’๋“ค์— ๋Œ€ํ•ด ๋ถ„๋ณ„์ ์ธ y๊ฐ’(ํ™•๋ฅ )์„ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋” informativeํ•˜๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.(๋” ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์ •๋ณด์ด๋ก ์„ ์‚ดํŽด๋ณด์…”๋„ ์ข‹์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.)

์™ผ์ชฝ์˜ 2๊ฐœ์˜ ๊ทธ๋ž˜ํ”„๋Š” ๊ฐ๊ฐ LSGAN๊ณผ WGAN(WASABI)๋ฅผ ๊ฐ€์ง€๊ณ  ํ•™์Šตํ–ˆ์„ ๋•Œ, O์˜ ์š”์†Œ๋“ค ์ค‘ ๊ณ ์ •๋œ pitch rate(\dot\theta)์™€ height(z)๋ฅผ ๊ฐ€์ง€๊ณ  Imitation reward ๋ถ„ํฌ๋ฅผ ์‹œ๊ฐํ™”ํ•œ ๊ทธ๋ž˜ํ”„์ž…๋‹ˆ๋‹ค. LSGAN๋ณด๋‹ค WGAN์œผ๋กœ ํ•™์Šตํ•œ ๋ถ„ํฌ๊ฐ€ reward range๋„ ๋” ๋„“๊ณ  ๋” ๊ตฌ๋ถ„๋˜๋Š” ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์„ธ๋ฒˆ์งธ ๊ทธ๋ž˜ํ”„๋Š” ํ•™์Šต ๊ณผ์ • ์ค‘์— r^I์˜ ๋ถ„ํฌ๋ฅผ ๊ทธ๋ฆฐ ๊ฒƒ์œผ๋กœ LSGAN์€ -1๊ณผ 1, ๊ฐ๊ฐ์œผ๋กœ reward targeting์„ ํ•˜๊ฒŒ ๋˜๋Š” objective function์„ ๊ฐ€์ง€๊ณ  ์žˆ์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋„“๊ณ  ๋‹ค์–‘ํ•œ reward distribution์„ ๊ฐ€์ง€์ง€ ๋ชปํ•œ ๋ชจ์Šต์„ ๋ณผ ์ˆ˜ ์žˆ๊ณ  ๊ทธ์— ๋ฐ˜ํ•ด WGAN์€ ์•ฝ -5~2 ์ •๋„์˜ range๋ฅผ ๊ฐ€์ง€๋Š” ๋„“์€ reward distribution์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Learning to Mimic Rough Demonstrations

๊ทธ๋Ÿผ ์ •๋ง๋กœ Demo ๋ชจ์…˜ ๋ฐ์ดํ„ฐ๋“ค์„ ์–ผ๋งŒํผ ์ž˜ ๋”ฐ๋ผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์—ˆ์„๊นŒ์š”? ์ด์— ๋Œ€ํ•œ ์ง€ํ‘œ๋Š” ๋‹จ์ˆœํžˆ reward๊ฐ€ ๋†’๋‹ค๊ณ  ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๋ชจ์…˜์˜ ์œ ์‚ฌ์„ฑ์„ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค๋ฅธ metric์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

Dynamic Time Warping

Dynamic Time Warping์ด๋ž€ ๊ฐ ๋ฐ์ดํ„ฐ์˜ ์‹œ๊ฐ„์˜ ๊ธธ์ด๋„ ๋‹ค๋ฅด๊ณ  ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ์ˆ˜๋„ ๋‹ค๋ฅธ 2๊ฐœ์˜ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ๋น„๊ตํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ๊ธฐ์กด์˜ Euclidean distance๋ผ๋ฉด ์ธก์ •ํ•  ์ˆ˜ ์—†๊ฑฐ๋‚˜ ์ •ํ™•ํ•œ ๋น„๊ต๊ฐ€ ์–ด๋ ค์šด ์ ์„ DTW๋ฅผ ์ด์šฉํ•˜๋ฉด ์‹œ๊ฐ„์ ์ธ ๋ฐ€๋ฆผ์ด๋‚˜ ์†Œ์‹ค๋œ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๊นŒ์ง€ ๊ณ ๋ คํ•˜์—ฌ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ”๋กœ ์ด ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•ด์„œ ์‚ฌ๋žŒ์ด ๋“ค๊ณ  ๋งŒ๋“ค์—ˆ๋˜ demo์˜ ๋ชจ์…˜ ๋ฐ์ดํ„ฐ์™€ ์‹ค์ œ ํ•™์Šต ํ›„ policy์—์„œ ๋งŒ๋“ค์–ด๋‚ธ ๋ชจ์…˜ ๋ฐ์ดํ„ฐ ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ์ธก์ •ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

\tau_\pi๋Š” policy์—์„œ ๋งŒ๋“ค์–ด์ง„ trajectory๋ฅผ, \tau_M์€ demo์—์„œ ๋”ฐ์˜จ trajectory๋ฅผ ๋งํ•˜๋ฉฐ ์•„๋ž˜์˜ ์‹คํ—˜ ๊ฒฐ๊ณผํ‘œ๋Š” ๊ฐ๊ฐ WASABI์™€ LSGAN์—์„œ์˜ 4 task์— ๋Œ€ํ•œ DTW๋ฅผ ๊ตฌํ•œ ๊ฐ’์„ ๋‚˜ํƒ€๋‚ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. DTW๊ฐ€ ๋‚ฎ์„์ˆ˜๋ก demo ๋ฐ์ดํ„ฐ์™€์˜ ์œ ์‚ฌ์„ฑ์ด ๋†’์€ ๊ฒƒ์ด๋ฉฐ ์ž˜ ๋ชจ์…˜์„ ๋”ฐ๋ผ ํ•™์Šตํ–ˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.(์•„๋ž˜ Stand Still์€ ๋‹จ์ˆœํžˆ ๊ฐ€๋งŒํžˆ ์„œ ์žˆ๋Š” ๋ชจ์…˜์˜ ๋ฐ์ดํ„ฐ์™€ demo ๋ฐ์ดํ„ฐ ๊ฐ„์˜ DTW ๊ฐ’์„ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์ด๋ฉฐ ๋น„๊ต๋ฅผ ์œ„ํ•œ DTW์˜ ์ตœ๋Œ€ ์ƒํ•œ์„ ์„ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.)

Handcrafted Task Reward

๋˜ ๋‹ค๋ฅธ ์ง€ํ‘œ๋กœ๋Š”, ํ•ด๋‹น ๋ชจ์…˜ task์— ๋Œ€ํ•œ Handcrafted task reward๋กœ ์ ์ˆ˜๋ฅผ ๋งค๊ฒผ์„ ๋•Œ ๊ทธ ์ ์ˆ˜๊ฐ€ ๋” ๋†’๋‹ค๋ฉด ํ•ด๋‹น ๋ชจ์…˜์„ ์ž˜ ํ•™์Šตํ–ˆ๋‹ค๊ณ  ํŒ๋‹จํ•˜๋Š” ์ง€ํ‘œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด STANDUP์€ ๋ชธ์ฒด์˜ pitch angle์ด 90๋„์— ๊ฐ€๊น๊ณ  ๋ชธ์ฒด์˜ ๋†’์ด๊ฐ€ ๋†’๊ณ  ๋ชธ์ฒด์˜ z์ถ•์ด ์ค‘๋ ฅ๋ฐฉํ–ฅ์— ์ˆ˜์ง์ด ๋˜๋Š” ์ƒํƒœ๋ผ๋ฉด ํ•ด๋‹น ๋ชจ์…˜์„ ์ž˜ ์ˆ˜ํ–‰ํ•˜๊ณ  ์žˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์„ ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ์ด์ฒ˜๋Ÿผ ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ๋ชจ์…˜์— ๋Œ€ํ•œ Handcrafted task reward๋ฅผ ๊ณ„์‚ฐํ•ด์„œ ํ•™์Šต iteration ๋งˆ๋‹ค ๊ทธ๋ ค๋ณด๋ฉด ์˜ค๋ฅธ์ชฝ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด WASABI๋ฅผ ๊ฐ€์ง€๊ณ  ํ•™์Šตํ•œ reward ์ ์ˆ˜๊ฐ€ ๋Œ€์ฒด์ ์œผ๋กœ LSGAN์— ๋น„ํ•ด ๋†’์€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜ ํ‘œ์—์„œ๋Š” ํ•™์Šต์„ ๋๋‚ธ ํ›„ ๊ฐ task์— ๋Œ€ํ•œ handcrafted reward ์ ์ˆ˜์ด๋ฉฐ ๋งจ ์•„๋ž˜ ์ ์ˆ˜๋Š” ์ตœ๊ณ  ์ƒํ•œ ๊ธฐ์ค€ ์ ์ˆ˜๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ‘œ์—์„œ ๋ณผ๋“œ์ฒด๋กœ ํ‘œ์‹œ๋œ ๋ถ€๋ถ„์€ roll-out์„ ํ–ˆ์„ ๋•Œ ๋ชจ์…˜์„ ๋ˆˆ์œผ๋กœ ํ™•์ธํ•œ ๊ฒฐ๊ณผ ์ž˜ ์ˆ˜ํ–‰ํ–ˆ๋‹ค๊ณ  ํŒ๋‹จํ•œ ๊ฒฝ์šฐ๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉด WASABI๋กœ ํ•™์Šตํ•œ 4๊ฐ€์ง€ task ๋ชจ๋‘์—์„œ ์„ฑ๊ณต์ ์ธ ํ•™์Šต ๊ฒฐ๊ณผ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Evaluation on Real Robot

ํ•™์Šต์ด ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ๋งŒ ๋ฉˆ์ถ˜๋‹ค๋ฉด ๋‹น์—ฐํžˆ ์˜๋ฏธ๊ฐ€ ์—†๋Š” ๊ฒƒ์ด๋ฏ€๋กœ ์‹ค์ œ ๋กœ๋ด‡์„ ๊ฐ€์ง€๊ณ  ํ•ด๋‹น policy์˜ ํ•™์Šต ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•ด๋ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ WASABI๋กœ ํ•™์Šตํ•œ policy๋ฅผ ๊ฐ€์ง€๊ณ  ์‹ค์ œ ๋กœ๋ด‡์œผ๋กœ ์ž‘๋™์„ ํ•ด๋ณด๊ณ  ์ด๋•Œ 10๊ฐœ์˜ marker๋ฅผ ์ด์šฉํ•ด์„œ ๋ชจ์…˜ ๋ฐ์ดํ„ฐ๋ฅผ ์–ป์–ด DTW๋ฅผ ์ธก์ •ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ํ‘œ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด Sim-to-Real์˜ ํผํฌ๋จผ์Šค ์ฐจ์ด๊ฐ€ ๊ฑฐ์˜ ์—†์—ˆ๊ณ  ์‹ค์ œ ๋กœ๋ด‡์—์„œ๋„ 4๊ฐ€์ง€ task ๋ชจ๋‘ ๋‹ค ์ž˜ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ถ€๋ถ„์€ ์‹คํ—˜์˜์ƒ์—์„œ ์ง์ ‘ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Leap

Wave

Stand up

Backflip

Cross-platform Imitation

์‚ฌ์‹ค ๊ฐ•ํ™”ํ•™์Šต์€ ํŠน์ • ๋กœ๋ด‡ ํ”Œ๋žซํผ์—์„œ ํ•™์Šตํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋‹ค๋ฅธ configuration์„ ๊ฐ€์ง„ ๋กœ๋ด‡ ํ”Œ๋žซํผ์— ๋ฐ”๋กœ ์ ์šฉํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ WASABI ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ฒ˜์Œ์— Roughํ•˜๊ณ  Partialํ•œ demo ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ํ•™์Šตํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค๋ฅธ ๋กœ๋ด‡ ํ”Œ๋žซํผ์— ์ ์šฉํ•ด๋ณด๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ–ˆ์œผ๋ฉฐ ๊ธฐ์กด์— Solo 8 ๋กœ๋ด‡ ํ”Œ๋žซํผ์„ ๊ฐ€์ง€๊ณ  ํ•™์Šตํ•œ policy๋ฅผ ๋‹จ์ˆœํžˆ ๋กœ๋ด‡ ํ”Œ๋žซํผ์˜ ํฌ๊ธฐ ์ฐจ์ด๋งŒ์„ ๊ณ ๋ คํ•˜์—ฌ base height๋ฅผ 0.25m ์กฐ๊ธˆ ๋” ํฐ๊ฐ’์œผ๋กœ ์ˆ˜์ •ํ•ด์„œ Anymal-C ๋กœ๋ด‡ ํ”Œ๋žซํผ์— ์ ์šฉํ–ˆ์„ ๋•Œ ํŠน๋ณ„ํ•œ ์ถ”๊ฐ€์ ์ธ ํ•™์Šต ๊ณผ์ •์—†์ด๋„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ์—๋„ DTW ๊ฐ’์„ ์ฐ์–ด์„œ ํ™•์ธํ•œ ๊ฒฐ๊ณผ, ๋‚ฎ์€ DTW ๊ฐ’๊ณผ ํ•จ๊ป˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์œผ๋กœ roll-out์„ ํ–ˆ์„ ๋•Œ์— ๋กœ๋ด‡์ด ์ž˜ ์ž‘๋™๋˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Conclusion

๋กœ๋ด‡์˜ ๋ชจ์…˜์ œ์–ด๋ฅผ ๊ฐ•ํ™”ํ•™์Šต์œผ๋กœ ํ’€์–ด๊ฐ€๋ ค๊ณ  ํ•  ๋•Œ ๊ฐ€์žฅ ์–ด๋ ค์šด ๋ถ€๋ถ„์ธ task reward๋ฅผ ๋” ์ด์ƒ handcrafted ์ ์ธ ๋””์ž์ธ์— ์˜์กดํ•˜์ง€ ์•Š๊ณ  reward distribution์˜ ๊ด€์ ์œผ๋กœ ์ ‘๊ทผํ•˜์—ฌ ์ƒ์„ฑ ๋ชจ๋ธ ๋ถ„์•ผ์˜ ์•„์ด๋””์–ด์ธ GAN์˜ ์•„์ด๋””์–ด๋ฅผ ๋นŒ๋ ค ์ ‘๊ทผํ•œ ๊ฒƒ์ด ์ •๋ง ์‹ ์„ ํ•œ ๋…ผ๋ฌธ์ด์—ˆ์Šต๋‹ˆ๋‹ค. Policy๋ฅผ GAN์—์„œ์˜ Generator๋กœ ๋ฐ”๋ผ๋ณด๊ณ  ๋ฌธ์ œ๋ฅผ ๋””์ž์ธํ•œ ๊ฒƒ๋„ ์ •๋ง ์‹ ๊ธฐํ–ˆ์œผ๋ฉฐ ์—ฌ๋Ÿฌ๊ฐ€์ง€ GAN ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘์—์„œ LSGAN๊ณผ WGAN์˜ ์ฐจ์ด๋ฅผ ๋ช…ํ™•ํžˆ ๋ณด์—ฌ์ฃผ๋ฉฐ ๋น„๊ต๋ฅผ ์ˆ˜์น˜์ ์œผ๋กœ ๋ณด์—ฌ์ฃผ๊ณ  ํ•ด์„ํ•œ ์ ๋„ ์ธ์ƒ์ ์ธ ์—ฐ๊ตฌ์˜€์Šต๋‹ˆ๋‹ค.

Reference

  • Original Paper: Learning Agile Skills via Adversarial Imitation of Rough Partial Demonstrations
  • Original Project Homepage: CoRL2022-WASABI
  • CoRL 2022 Oral Presentation
  • Learning Quadrupedal Locomotion over Challenging Terrain
  • Joonho Lee: Learning Quadrupedal Locomotion over Challenging Terrain
  • Advanced Skills through Multiple Adversarial Motion Priors in Reinforcement Learning
  • What Are GANs?
  • 1์‹œ๊ฐ„๋งŒ์— GAN(Generative Adversarial Network) ์™„์ „ ์ •๋ณตํ•˜๊ธฐ
  • CS 182: Lecture 19: Part 3: GANs
  • GANs for Synthetic Data Generation
  • An Open Torque-Controlled Modular Robot Architecture for Legged Locomotion Research
  • DTW(Dynamic Time Warping)
  • ํŒŒ์ด์ฌ ์ฝ”๋”ฉ์œผ๋กœ ๋งํ•˜๋Š” ๋ฐ์ดํ„ฐ ๋ถ„์„ - 10. DTW (Dynamic time wrapping)
  • Dynamic time warping 1: Motivation

Copyright 2024, Jung Yeon Lee