Curieux.JY
  • Post
  • Note
  • Jung Yeon Lee

On this page

  • 0. Abstract
  • I. Introduction
    • System Process
    • How
    • Main Contribution
    • Details with Hash tags
  • II. Fine-tuning Locomotion in the Real World
    • Overview
    • Motion Imitation & Off-policy RL
  • III. System Design
    • A. State & Action Spaces
    • B. Reward Function
    • C. Reset Controller
  • IV. Experiments
    • A. Simulation Experiments
    • B. Real-World Experiments
    • C. Semi-autonomous training
  • V. Conclusion
  • Review
  • Reference

๐Ÿ“ƒLegged Robots that Keep on Learning ๋ฆฌ๋ทฐ

quadruped
rl
redq
paper
Fine-Tuning Locomotion Policies in the Real World
Published

June 26, 2022

0. Abstract

Legged robots are physically capable of traversing a wide range of challenging environments but designing controllers that are sufficiently robust to handle this diversity has been a long-standing challenge in robotics. Reinforcement learning presents an appealing approach for automating the controller design process and has been able to produce remarkably robust controllers when trained in a suitable range of environments. However, it is difficult to predict all likely conditions the robot will encounter during deployment and enumerate them at training-time. What if instead of training controllers that are robust enough to handle any eventuality, we enable the robot to continually learn in any setting it finds itself in? This kind of real-world reinforcement learning poses a number of challenges, including efficiency, safety, and autonomy. To address these challenges, we propose a practical robot reinforcement learning system for fine-tuning locomotion policies in the real world. We demonstrate that a modest amount of real-world training can substantially improve performance during deployment, and this enables a real A1 quadrupedal robot to autonomously fine-tune multiple locomotion skills in a range of environments, including an outdoor lawn and a variety of indoor terrains.

I. Introduction

๊ฐ•ํ™”ํ•™์Šต์ด ๋กœ๋ด‡ ์ œ์–ด ๋ถ„์•ผ์—์„œ ๊ฐ๊ด‘ ๋ฐ›๋Š” ์ด์œ ๊ฐ€ ๋ฌด์—‡์ผ๊นŒ? ๊ธฐ์กด์˜ ๋กœ๋ด‡ ์ œ์–ด ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์€ ์ •๋ง ๋งŽ์€ engineering ์ ์ธ ๊ณ ๋ ค์™€ ๋ณต์žกํ•œ ์ˆ˜ํ•™์  ๋ชจ๋ธ๋ง์ด ํ•„์š”ํ•˜๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ๊ทธ๋งˆ์ € ์—”์ง€๋‹ˆ์–ด๊ฐ€ ๋ฏธ์ฒ˜ ๊ณ ๋ คํ•˜์ง€ ๋ชปํ•œ ์ž‘๋™์„ ํ•ด์•ผ ํ•  ๋•Œ๋Š” ๋ฐ”๋กœ ์‹คํŒจํ•œ controller ๋””์ž์ธ์ด ๋˜์–ด ๋ฒ„๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— ๋กœ๋ด‡ ์ œ์–ด๋Š” ์‰ฝ์ง€ ์•Š์€ ๋ฌธ์ œ์˜€๋‹ค. ์ด๋Ÿฐ ๋ฉด์—์„œ ๊ฐ•ํ™”ํ•™์Šต์€ controller๋ฅผ trial-and-error๋กœ ๋กœ๋ด‡ agent๊ฐ€ ์•Œ์•„์„œ ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•ด์•ผ ํ• ์ง€ ํ•™์Šตํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณตํ•™์ž์—๊ฒŒ controller ๋””์ž์ธ์— ๋Œ€ํ•œ ๋ถ€๋‹ด์„ ์ค„์—ฌ์ฃผ์—ˆ๊ณ  ์ด๋Ÿฐ ์ ์— ๊ฐ•ํ™”ํ•™์Šต์ด ๋กœ๋ด‡ ์ œ์–ด ๋ถ„์•ผ์—์„œ ์ฃผ๋ชฉ ๋ฐ›๋Š” ์ด์œ ์˜€๋‹ค.

ํ•˜์ง€๋งŒ, ์•ˆํƒ€๊น๊ฒŒ๋„ ๊ฐ•ํ™”ํ•™์Šต์ด controller๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ์˜ ๋ถ€๋‹ด์„ ์ค„์—ฌ์ฃผ์—ˆ์ง€๋งŒ ๊ฐ•ํ™”ํ•™์Šต์˜ environment ์„ค๊ณ„์— ๋Œ€ํ•œ ๋ถ€๋‹ด์ด์—ˆ๋‹ค. ์œ„์—์„œ ์„ค๋ช…ํ•œ ๋Œ€๋กœ ๊ฐ•ํ™”ํ•™์Šต์—์„œ trial-and-error๋กœ ์•Œ์•„์„œ ํ•™์Šตํ•œ๋‹ค๋Š” ์ ์ด ๋งค๋ ฅ์ ์ด์ง€๋งŒ, ์ด๋Ÿฐ ํ•™์Šต์˜ ์กฐ๊ฑด์—๋Š” ์ข‹์€ environment๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ๊ฐ•ํ™”ํ•™์Šต ๋ถ„์•ผ์—์„œ ์ž์ฃผ ์–ธ๊ธ‰๋˜๋Š” ์ข‹์€ agent์˜ ๋ฐฐ๊ฒฝ์—๋Š” ์ข‹์€ environment๊ฐ€ ์žˆ๋‹ค.๋Š” ๋ง์ฒ˜๋Ÿผ agent๊ฐ€ environment์—์„œ ๊ฒฝํ—˜ํ•˜๋ฉด์„œ ์ข‹์€ ํ•™์Šต์„ ํ•˜์ง€ ๋ชปํ•˜๋ฉด ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์—†๋‹ค. ๋”ฐ๋ผ์„œ ๋งˆ์น˜ controller ๋””์ž์ธ๊ณผ environment ๋””์ž์ธ์€ trade-off ๊ด€๊ณ„๋กœ ์—”์ง€๋‹ˆ์–ด์—๊ฒŒ ๊ณผ์ œ๋ฅผ ๋‚จ๊ธฐ๊ฒŒ ๋œ๋‹ค.

agent๊ฐ€ ํ•™์Šตํ•˜๋Š” ๋™์•ˆ์— ๊ฒฝํ—˜ํ•˜๊ฒŒ ๋˜๋Š” environment์™€ ํ…Œ์ŠคํŠธ ์‹œ(์‹ค์‚ฌ์šฉ ์‹œ) ๊ฒฝํ—˜ํ•˜๊ฒŒ ๋˜๋Š” environment์˜ ์ฐจ์ด๊ฐ€ ํฌ๋ฉด ํด์ˆ˜๋ก agent๋Š” ์ œ๋Œ€๋กœ ์ž‘๋™ํ•  ์ˆ˜ ์—†๋‹ค. ํ•™์Šต๋˜์ง€ ์•Š์€ ๊ฒฝํ—˜๋“ค์ด๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šต๋œ agent์˜ policy๊ฐ€ ์ข‹์€ action์„ ํ•  ์ˆ˜ ์—†๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ๊ฒฝํ—˜ํ•ด๋ณด์ง€ ๋ชปํ•œ, ์ฆ‰ ํ•™์Šตํ•˜์ง€ ๋ชปํ•œ ๊ฒฝ์šฐ์— ๋Œ€ํ•ด์„œ๋„ ์ œ๋Œ€๋กœ agent๊ฐ€ ๋™์ž‘ํ•˜๊ธฐ ์œ„ํ•ด zero-shot generalization(ํ•œ๋ฒˆ๋„ ๋ณด์ง€ ๋ชปํ•œ-zero shot ๊ฒฝํ—˜ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์ž˜ ์ผ๋ฐ˜ํ™”-generalization ํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ) ์ด ํ•„์š”ํ•˜์ง€๋งŒ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์™„๋ฒฝํ•œ zero-shot generalization์€ ์ผ์–ด๋‚  ์ˆ˜ ์—†๋‹ค๋Š” ๊ฐ€์ •ํ•˜์— ๋ฌธ์ œ๋ฅผ ์–ด๋–ป๊ฒŒ ํ’€๊ฒƒ์ธ๊ฐ€ ๊ณ ๋ฏผํ–ˆ๋‹ค.

๊ทธ๋ ‡๊ฒŒ ํ•ด์„œ ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ ํ…Œ์ŠคํŠธ ํ™˜๊ฒฝ์—์„œ ๋น ๋ฅด๊ฒŒ fine-tuning ํ•ด์„œ agent๊ฐ€ ์ž˜ ๋™์ž‘ํ•˜๊ฒŒ ๋งŒ๋“ค์ž์˜€๊ณ , ์ด ๋ฐฉ๋ฒ•์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋ฉด ๋กœ๋ด‡์€ ์‹ค์ œ๋กœ ๋™์ž‘ํ•˜๋ฉด์„œ ์–ธ์ œ๋“ ์ง€ ๋งˆ์ฃผ์น  ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ํ™˜๊ฒฝ์— ์ ์‘ํ•ด์„œ(fine-tuned) ์ž˜ ๋™์ž‘ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

๐ŸŽฏ ๋ณธ ๋…ผ๋ฌธ์˜ ๋ชฉํ‘œ๋Š” ์‹ค์ œ ํ™˜๊ฒฝ(real-world)์—์„œ ๋กœ๋ด‡์˜ locomotion policy๋“ค์ด fine-tuningํ•  ์ˆ˜ ์žˆ๋Š” ์™„์ „ํ•œ ์‹œ์Šคํ…œ์„ ๋””์ž์ธ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

System Process

  1. ์œ„์˜ ์‚ฌ์ง„์— ๋ณด์ด๋Š” ๊ณต์›๊ณผ ๊ฐ™์€ ์ƒˆ๋กœ์šด ํ™˜๊ฒฝ์—์„œ ๋จผ์ € ๋กœ๋ด‡ agent๊ฐ€ ์ฒซ๋ฒˆ์งธ ์‹œ๋„๋กœ locomotion task๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค.
  2. ๋งŒ์•ฝ์— ๋•…์ด ๊ณ ๋ฅด์ง€ ๋ชปํ•ด์„œ agent์˜ ํ•™์Šต๋œ policy๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์—†๋Š” ์ƒํ™ฉ์ด ๋˜์–ด์„œ ๋„˜์–ด์ง€๊ฒŒ ๋˜๋Š” ์ƒํ™ฉ์ด ๋  ์ˆ˜ ๋„ ์žˆ๋‹ค.
  3. ์ด๋•Œ reset controller๋ฅผ ์ด์šฉํ•ด์„œ ๋น ๋ฅด๊ฒŒ ๋‹ค์‹œ ์ผ์–ด๋‚œ๋‹ค.
  4. ์‹ค์ œ task์—์„œ ์ข€ ๋” ๋ช‡ ๋ฒˆ ์‹œ๋„๋ฅผ ํ•˜๋ฉด์„œ 1~3์˜ ๊ณผ์ •์„ ๋ช‡ ๋ฒˆ ๋ฐ˜๋ณตํ•˜๊ฒŒ ๋˜๊ณ  ์ด ๊ณผ์ •์—์„œ policy๊ฐ€ ์—…๋ฐ์ดํŠธ ๋˜๊ฒŒ ๋œ๋‹ค.
  5. ์—…๋ฐ์ดํŠธ๊ฐ€ ๋˜๋ฉด์„œ policy๋Š” ์ƒˆ๋กœ์šด test ํ™˜๊ฒฝ์—์„œ ์ œ๋Œ€๋กœ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

How

  • ๊ฐ•ํ™”ํ•™์Šต์˜ reward ๊ฐ€ robot์˜ on-board ์„ผ์„œ๋กœ ์ธก์ •๋˜๋Š” ๊ฐ’๋“ค๋กœ๋งŒ ๋””์ž์ธ ๋˜์–ด์•ผ ์‹ค์ œ Real-world์—์„œ ์ž‘๋™ํ•˜๋ฉด์„œ fine tuning์„ ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • Agileํ•œ behavior๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด์„œ Motion imitation ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ–ˆ๋‹ค.
  • ๋กœ๋ด‡์˜ ๋„˜์–ด์ง€๊ณ  ๋‚˜์„œ ๋น ๋ฅด๊ฒŒ ์ •์ƒ์ž์„ธ๋กœ ํšŒ๋ณตํ•  ์ˆ˜ ์žˆ๋„๋ก Recovery policy๋ฅผ ํ•™์Šตํ–ˆ๋‹ค.
  • ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค ์ค‘์—์„œ REDQ(Randomized Ensembled Double Q-Learning) ๋ผ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ, ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์—ฌ๋Ÿฌ๊ฐœ Q-network๋“ค์˜ ์•™์ƒ๋ธ”์„ ํ†ตํ•ด randomization์„ ํ•ด์„œ Q-learning ๊ณ„์—ด์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์˜ sample-efficiency์™€ ์•ˆ์ •์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚จ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค.

Main Contribution

๋ณธ ๋…ผ๋ฌธ์˜ ์ฃผ์š” contribution์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • 4์กฑ ๋ณดํ–‰ ๋กœ๋ด‡์˜ agileํ•œ locomotion skill์„ real-world์—์„œ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ fine-tuning ์ž๋™ํ™” ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•˜์˜€๋‹ค.
  • ์ฒ˜์Œ์œผ๋กœ ์ž๋™ํ™” reset๊ณผ on-board ์ƒํƒœ ์ถ”์ •์„ ํ†ตํ•ด real-world์—์„œ fine-tuning์ด ๋  ์ˆ˜ ์žˆ์Œ์œผ๋กœ ๋ณด์˜€๋‹ค.
  • A1 ๋กœ๋ด‡์„ ๊ฐ€์ง€๊ณ  dynamic skill๋“ค์„ ํ•™์Šตํ•ด์„œ ์™ธ๋ถ€ ์ž”๋””์—์„œ ์•ž์œผ๋กœ, ๋’ค๋กœ pacing์„ ํ•˜๊ณ  3๊ฐ€์ง€ ๋‹ค๋ฅธ ์ง€ํ˜• ํŠน์ง•์„ ๊ฐ€์ง„ ํ™˜๊ฒฝ์—์„œ side-stepping์„ ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

Details with Hash tags

์› ๋…ผ๋ฌธ์˜ II. Related Work section ์ฐธ๊ณ 

#Cumbersome controller designs

  • ์ด์ „์˜ ๋กœ๋ด‡ controller๋“ค์€ footstep planning, trajectory optimization, model-predictive control (MPC) ๋“ฑ์˜ ์กฐํ•ฉ์œผ๋กœ ๋งŒ๋“ค์–ด์ง€๊ณ  ์žˆ์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฐ ๋ฐฉ๋ฒ•๋“ค์€ ๋กœ๋ด‡์˜ ๋™์—ญํ•™๊ณผ ๊ฐ ๋กœ๋ด‡๋งˆ๋‹ค ๋‹ค๋ฅด๊ณ  ๊ฐ skill๋งˆ๋‹ค ๋‹ค๋ฅธ ๋งŽ์€ ์š”์†Œ๋“ค์„ ๊ณ ๋ คํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ •๋ง ์–ด๋ ค์› ๋‹ค.

#Sim2Real

  • trial-and-error๋ผ๋Š” ๋ฐ์ดํ„ฐ์— ๋งค์šฐ ์˜์กด์„ฑ์ด ๋†’์€ ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํŠน์„ฑ๊ณผ ํ•˜๋“œ์›จ์–ด์˜ safety ์ด์Šˆ ๋•Œ๋ฌธ์— ๋ณดํ†ต ๋กœ๋ด‡ ๊ฐ•ํ™”ํ•™์Šต agent๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต๋œ๋‹ค. ํ•˜์ง€๋งŒ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ํ•™์Šตํ•˜๋ฉด์„œ ์‹ค์ œ๋กœ ๋งŒ๋‚˜๋ณด์ง€ ์•Š์€ real-world์˜ ๋ชจ๋“  ์กฐ๊ฑด๋“ค์„ ์˜ˆ์ƒํ•˜๊ณ  ํ•™์Šตํ•˜๊ธฐ๋ž€ ์‚ฌ์‹ค์ƒ ๋ถˆ๊ฐ€๋Šฅํ•˜๋ฉฐ ๊ฐ€์žฅ robustํ•œ policy๋ผ๊ณ  ํ• ์ง€๋ผ๋„ ๋ชจ๋“  ์ƒํ™ฉ์— ๋Œ€ํ•ด generalization ๋˜์—ˆ๋‹ค๊ณ  ํ•  ์ˆ˜ ์—†๋‹ค.

#Real-world

  • ์ด์ „์— ๋ณต์žกํ•œ motion๋“ค์„ ํ•™์Šตํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด์„œ environment์˜ ๋‹ค์–‘ํ•œ ์žฅ์น˜๋“ค๋กœ ๋‹ค์–‘ํ•œ ์ƒํƒœ ์ •๋ณด๋ฅผ ๋งŒ๋“ค์–ด์„œ ์‚ฌ์šฉํ–ˆ์ง€๋งŒ ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” real-world์—์„œ ์ž‘๋™ํ•˜๊ณ  ์žˆ๋Š” ๋กœ๋ด‡์—์„œ fine-tuning์„ ํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋กœ๋ด‡์˜ on-board์—์„œ ๋ฐ›์„ ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  state estimation ์ •๋ณด๋“ค์„ ๊ฐ€์ง€๊ณ ๋งŒ ์ง„ํ–‰ํ–ˆ์œผ๋ฉฐ motion capture๋‚˜ ์™ธ๋ถ€ ์žฅ์น˜๋“ค์„ ๋ณ„๋„๋กœ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜๋‹ค.

  • scratch๋ถ€ํ„ฐ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ๋‹จ์ˆœํ•œ ๊ตฌ์กฐ์˜ ๋กœ๋ด‡๋“ค๋กœ walking gaits๋“ค์„ ํ•™์Šตํ•˜๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ, A1 ๋กœ๋ด‡์œผ๋กœ pacing, side stepping ๋“ฑ ๋งค์šฐ ์ž์—ฐ์Šค๋Ÿฝ๊ณ  ์กฐ๊ธˆ์€ ๋ถˆ์•ˆ์ •ํ•˜๊ณ  ์„ธ๋ฐ€ํ•œ balancing์ด ์š”๊ตฌ๋˜๋Š” skill๋“ค์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. (๊ธฐ์กด์˜ ์—ฐ๊ตฌ๋“ค์€ balancing์— ๋งค์šฐ ์‹ ๊ฒฝ์“ด ๋‚˜๋จธ์ง€ ๋А๋ฆฌ๊ณ  ๋ถ€์ž์—ฐ์Šค๋Ÿฌ์šด walking gaits ์— ์น˜์ค‘ํ•œ ๋ฉด์ด ์žˆ์—ˆ๋‹ค.) ๋ณธ ๋…ผ๋ฌธ์˜ ์—ฐ๊ตฌ์—์„œ motion imitation๊ณผ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ์˜ fine-tuning ์ด ์ด๋Ÿฐ ๋‹ค์ด๋‚˜๋ฏนํ•œ task๋“ค์„ ์„ฑ๊ณต์‹œํ‚ค๋Š”๋ฐ ๋งค์šฐ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ–ˆ๋‹ค. ๋˜ํ•œ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ๋กœ๋ด‡์ด ์ž‘๋™ํ•˜๋ฉด์„œ ๋„˜์–ด์งˆ ๋•Œ, manualํ•˜๊ฒŒ ๋กœ๋ด‡์˜ resetํ•˜๊ฑฐ๋‚˜ recovery์‹œํ‚ค์ง€ ์•Š๊ณ  ๊ฐ•ํ™”ํ•™์Šต์œผ๋กœ ์ž๋™์ ์œผ๋กœ reset ํ•  ์ˆ˜ ์žˆ๋Š” controller๋ฅผ ๋งŒ๋“ค์–ด์„œ ์‚ฌ์šฉํ–ˆ๋‹ค.

#Few-shot adaptation

  • ๊ธฐ์กด์˜ Adaptation structure๋ผ๋Š” ๊ตฌ์กฐ๋ฅผ ๋งŒ๋“ค์–ด์„œ ํ•™์Šต์‹œ์ผœ์„œ latent ๋˜๋Š” explicitํ•œ ํ™˜๊ฒฝ์— ๋Œ€ํ•œ descriptor๋กœ adaptiveํ•œ policy๋ฅผ ๋งŒ๋“œ๋Š” ์—ฐ๊ตฌ๋“ค์ด ์žˆ์—ˆ์œผ๋‚˜, ์ด ๊ธฐ๋ฒ•๋“ค ๋˜ํ•œ ๊ฒฐ๊ตญ training์—์„œ ๊ฒฝํ—˜ํ–ˆ๋˜ ๊ฒƒ๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ adaptiveํ•จ์„ ๋ณด์ด๋Š” ๊ฒƒ์ด๋ฏ€๋กœ ์‹ค์ œ test ํ™˜๊ฒฝ์ด ์ด ํ—ˆ์šฉ ๋ฒ”์œ„์—์„œ ๋งŽ์ด ๋ฒ—์–ด๋‚  ๊ฒฝ์šฐ ์ œ๋Œ€๋กœ ์ž‘๋™์•ˆ๋˜๋Š” ๊ฒƒ์€ ๋˜‘๊ฐ™๋‹ค. ๋”ฐ๋ผ์„œ ๊ฐ•ํ™”ํ•™์Šต์œผ๋กœ ์ง€์†์ ์ธ ์ ์‘์ ์ธ ํ•™์Šต๋Šฅ๋ ฅ์„ ๋ณด์žฅํ•ด์„œ ์–ด๋–ค test ํ™˜๊ฒฝ์—์„œ๋“  ์ž˜ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ–ˆ๋‹ค.

#RL Algorithm

  • ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ๋Š” ๊ธฐ์กด์˜ vision ๊ธฐ๋ฐ˜ ๋งค๋‹ˆํ“ฐ๋ ˆ์ดํ„ฐ๋“ค์—์„œ grasping ์ž‘์—…์„ ํ•˜๋Š” task๋“ค์—์„œ ๋งŽ์ด ์“ฐ์ธ off-policy model-free RL ๊ธฐ๋ฒ•๋“ค์„ ์ฐธ๊ณ ํ•˜์—ฌ fixed๋˜์–ด ์žˆ๋Š” ๋งค๋‹ˆํ“ฐ๋ ˆ์ดํ„ฐ๋“ค๋ณด๋‹ค ๋” challengingํ•œ floating-based ๋ณดํ–‰ ๋กœ๋ด‡์˜ locomotion์— ์ ์šฉํ•ด์„œ ์„ฑ๊ณต์‹œ์ผฐ๋‹ค.

II. Fine-tuning Locomotion in the Real World

๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ multi-tasking์„ ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•™์Šต์‹œ์ผฐ๋‹ค.

  • REDQ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์ด์šฉํ•ด์„œ sample efficiency๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.
  • ํ•™์Šต๋œ reset policy๋ฅผ ์ด์šฉํ•ด์„œ ์—ฌ๋Ÿฌ๊ฐœ์˜ episode๋“ค์„ ์ด์–ด์„œ(stitch together) ํ•™์Šต์‹œ์ผฐ๋‹ค.

Overview

์•„๋ž˜ ์‚ฌ์ง„์˜ ์ „์ฒด ์‹œ์Šคํ…œ์˜ ๊ฐœ๋žต๋„์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด ๊ฐ๊ฐ์˜ policy๋Š” ํ•˜๋‚˜์˜ desired skill์„ ํ•™์Šตํ•˜๊ฒŒ ๋œ๋‹ค. ์ฆ‰ ํ•˜๋‚˜์˜ policy๋Š” forward๋ฅผ, ๋‹ค๋ฅธ policy๋Š” backward๋ฅผ, ๋งˆ์ง€๋ง‰ ๋‹ค๋ฅธ policy๋Š” reset์„ ๋‹ด๋‹นํ•˜์—ฌ ํ•™์Šตํ•˜๊ฒŒ ๋œ๋‹ค. ์ด๋ ‡๊ฒŒ ๋‹ค์–‘ํ•œ task๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“  ํ”„๋ ˆ์ž„์›Œํฌ ์ด๊ธฐ ๋•Œ๋ฌธ์— Multitask framework์ธ ๊ฒƒ์ด๋‹ค.


Pseudo Algorithm

์‹œ์Šคํ…œ ๊ฐœ๋žต๋„์—์„œ ๋ดค๋“ฏ์ด ๋…ผ๋ฌธ์— ๋‚˜์™€์žˆ๋Š” ์‹œ์Šคํ…œ ์ „์ฒด๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” Algorithm2 ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํฌ๊ฒŒ 2๊ฐœ์˜ ๊ณผ์ •์œผ๋กœ ์ง„ํ–‰๋œ๋‹ค.

  1. Agent์˜ policy๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ pretrained ํ•œ๋‹ค. (Algorithm 2 line 2~7)
    • ๊ฐ ์—ํ”ผ์†Œ๋“œ๊ฐ€ ๋๋‚  ๋•Œ๋งˆ๋‹ค ํ•™์Šต๋œ recovery policy๊ฐ€ ๋กœ๋ด‡์„ ๋‹ค์Œ rollout์„ ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ค€๋น„์‹œ์ผœ์ค€๋‹ค.
    • ๊ฐ skill์„ ์œ„ํ•œ policy๋“ค์€ ๋…๋ฆฝ์ ์œผ๋กœ ํ•™์Šต๋˜๊ณ  recovery policy๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋…๋ฆฝ์ ์œผ๋กœ ํ•™์Šต๋œ๋‹ค.
  2. Fine-tuning์„ ์‹ค์ œ ๋ฌผ๋ฆฌ์ ์ธ ํ™˜๊ฒฝ์—์„œ ์ง„ํ–‰ํ•˜๋ฉด์„œ training process๋ฅผ ๊ณ„์† ์ด์–ด๋‚˜๊ฐˆ ์ˆ˜ ์žˆ๋‹ค. (Algorithm 2 line 8~14)
    • ์‹œ๋ฎฌ๋ ˆ์ด์…˜๊ณผ ์‹ค์ œ ํ™˜๊ฒฝ์˜ ์ฐจ์ด๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ๊ฐ policy๋“ค์˜ replay buffer๋Š” ์ดˆ๊ธฐํ™” ์‹œ์ผœ์ค€๋‹ค.(Algorithm 2 line 12)

Motion Imitation & Off-policy RL

Motion Imitation

Motion Imiation ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•˜์—ฌ reference motion clip๋“ค์˜ skill๋“ค์„ ๋ชจ๋ฐฉ ํ•™์Šตํ•˜๋„๋ก ํ–ˆ๋Š”๋ฐ ์ด๋Š” Learning Agile Robotic Locomotion Skills by Imitating Animals๋ผ๋Š” ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ๋ฐฉ๋ฒ•์„ ๋”ฐ๋ผํ–ˆ๋‹ค. (Algorithm 1 line1~4)

Reference motion M์ด ์ฃผ์–ด์ง€๋ฉด agent์˜์ผ๋ จ์˜ pose๋“ค๊ณผ ๋น„๊ตํ•˜์—ฌ section III-B์—์„œ ์†Œ๊ฐœ๋  reward function์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•œ๋‹ค. - ์ด ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด reference motion data๋งŒ ๋ฐ”๊ฟ”์ฃผ๋ฉด ๋ฐ”๋กœ ๋‹ค๋ฅธ ์—ฌ๋Ÿฌ skill๋“ค์„ ๋ฐฐ์šธ ์ˆ˜ ์žˆ๋‹ค. - recovery policy๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด์„œ standing pose๋ฅผ ๋ชจ๋ฐฉํ•˜๋„๋ก ํ•  ์ˆ˜ ์žˆ๋‹ค.(III-C ์ฐธ๊ณ )

Off-policy RL

off-policy ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ REDQ algorithm ์‚ฌ์šฉํ–ˆ๋‹ค.(Algorithm 1 line5~9) - SAC ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋” ๋ฐœ์ „์‹œํ‚จ ์•Œ๊ณ ๋ฆฌ์ฆ˜ - time step์— ๋Œ€ํ•œ gradient step๋น„์œจ์„ ์ฆ๊ฐ€์‹œ์ผœ์„œ ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜sample efficiency๋ฅผ ๋†’์˜€๋‹ค. - ๋„ˆ๋ฌด ๋งŽ์€ gradient step์„ ํ•  ๊ฒฝ์šฐ์— ์ผ์–ด๋‚  ์ˆ˜ ์žˆ๋Š” overestimation issue๋ฅผ ์•™์ƒ๋ธ” ๊ธฐ๋ฒ•์„ ์ด์šฉํ•ด์„œ ์™„ํ™”ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

III. System Design

Setting - A1 robot from Unitree - PyBullet simulator - motion imitation skills์„ ์–ป๊ธฐ ์œ„ํ•ด์„œ - ๊ณต๊ฐœ๋œ ๋ฐ์ดํ„ฐ ์…‹๋“ค ์ค‘์— dog pacing์˜ mocap์„ ๋…นํ™”ํ•˜๊ณ  retargetting ํ•˜์˜€๋‹ค. - ๋กœ๋ด‡์˜ ์—ญ๊ธฐ๊ตฌํ•™์„ ์ด์šฉํ•ด์„œ A1 ๋กœ๋ด‡์˜ side-step motion์„ ์• ๋‹ˆ๋ฉ”์ด์…˜์œผ๋กœ ๋งŒ๋“ค์–ด์„œ ์‚ฌ์šฉํ–ˆ๋‹ค. - REDQ ์•Œ๊ณ ๋ฆฌ์ฆ˜ - Adam optimizer - learning rate of 10โˆ’4 - batch size of 256 transitions - TensorFlow

A. State & Action Spaces

  1. State space

    • State๋Š” ์—ฐ์†์ ์ธ 3 timesteps์—์„œ ์–ป์€ ์•„๋ž˜ ์ •๋ณด๋“ค๋กœ ์ •์˜ํ–ˆ๋‹ค.
      • Root orientation (read from the IMU)
      • Joint angles
      • Previous actions
    • Policy๋Š” ์œ„์—์„œ ๋งํ•œ Proprioceptive input ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ a goal g_t์— ๋Œ€ํ•œ ์ •๋ณด๋„ input์œผ๋กœ ๋ฐ›๊ฒŒ ๋œ๋‹ค.
      • g_t๋Š” future timesteps์—์„œ์˜ reference motion์—์„œ ๊ณ„์‚ฐ๋œ Target pose (root position, root rotation, joint angles)์˜ ์ •๋ณด๋ฅผ ํฌํ•จํ•œ๋‹ค.
      • 4 future target poses ๋Š” ํ˜„์žฌ timestep์—์„œ ์•ฝ 1์ดˆ ์ •๋„ ์ดํ›„์˜ pose๋“ค์ด๋‹ค.
  2. Action space

    • Action์€ 12 joints๋“ค์— ๋Œ€ํ•œ PD position targets ์ด๋‹ค.
    • 33Hz์˜ ์ฃผํŒŒ์ˆ˜๋กœ command๊ฐ€ ์ ์šฉ๋œ๋‹ค.
    • ์ž์—ฐ์Šค๋Ÿฌ์šด ์›€์ง์ž„์„ ์œ„ํ•ด PD targets์„ low-pass filter๋ฅผ ๋กœ๋ด‡์— ์ ์šฉํ•˜๊ธฐ ์ „์— ํ†ต๊ณผ์‹œ์ผœ์ค€๋‹ค.

B. Reward Function

\begin{gathered}r_{t}=w^{\mathrm{p}} r_{t}^{\mathrm{p}}+w^{\mathrm{v}} r_{t}^{\mathrm{v}}+w^{\mathrm{e}} r_{t}^{\mathrm{e}}+w^{\mathrm{rp}} r_{t}^{\mathrm{rp}}+w^{\mathrm{rv}} r_{t}^{\mathrm{rv}} \\w^{\mathrm{p}}=0.5, w^{\mathrm{v}}=0.05, w^{\mathrm{e}}=0.2, w^{\mathrm{rp}}=0.15, w^{\mathrm{rv}}=0.1\end{gathered}


  • r_{t}^{\mathrm{p}} : ๋กœ๋ด‡์˜ joint rotation ๊ฐ’๋“ค์„ reference motion์˜ joint rotation๊ณผ ๋งž์ถ”๋„๋ก ํ•˜๋Š” reward term

    r_{t}^{\mathrm{p}}=\exp \left[-5 \sum_{j}\left\|\hat{q}_{t}^{j}-q_{t}^{j}\right\|^{2}\right]

    • \hat{q}_{t}^{j} : ์‹œ์  t์— reference motion์˜ j๋ฒˆ์งธ joint์˜ local rotation
    • q_{t}^{j} : ๋กœ๋ด‡์˜ j๋ฒˆ์งธ joint local rotation
  • r_{t}^{\mathrm{v}} : joint velocities

  • r_{t}^{\mathrm{e}} : end-effector positions

  • ๋กœ๋ด‡์ด reference root motion์„ ์ž˜ tracking ํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•œ reward term

    • r_{t}^{\mathrm{rp}} : root pose reward
    • r_{t}^{\mathrm{rv}} : root velocity reward

์ด์ „๋ถ€ํ„ฐ ๊ฐ•์กฐํ•ด์™”๋“ฏ์ด, ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ fine-tuning๊ณผ์ •์„ ์ง„ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด์„œ on-board ์„ผ์„œ๋“ค์˜ ๊ฐ’์„ ์ด์šฉํ•ด์„œ reward function์„ ๋””์ž์ธํ•˜์˜€๊ณ  ์‹ค์ œ ๋ฌผ๋ฆฌ์ ์ธ ํ™˜๊ฒฝ์—์„œ ๊ตฌ๋™ํ•  ๋•Œ ์ด๋ฅผ ์ƒํƒœ ์ถ”์ • ๊ธฐ๋ฒ•์„ ์ด์šฉํ•ด์„œ reward๋ฅผ ๊ตฌํ•˜๊ฒŒ ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ์•„๋ž˜์˜ ์ƒํƒœ ์ถ”์ • ๋ฐฉ๋ฒ•(State Estimation)์ด fine-tuning์˜ ์„ฑ๋Šฅ์„ ๊ฒฐ์ •ํ•˜๋Š” ์ค‘์š”ํ•œ ๋ถ€๋ถ„์ด ๋œ๋‹ค.

  • Real-world์—์„œ ๋กœ๋ด‡์˜ linear root velocity๋ฅผ ์ž˜ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•ด์„œ Kalman filter๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค.
    • ์นผ๋งŒ ํ•„ํ„ฐ๋Š” IMU ์„ผ์„œ์—์„œ acceleration๊ณผ orientation ๊ฐ’๋“ค์„ ์ฝ์–ด์„œ foot contact sensors๋กœ ๊ฐ’๋“ค์„ ๋ณด์ •ํ•œ๋‹ค.
    • ์ฒ˜์Œ์— ๋ฐœ ๋์˜ ์†๋„๋ฅผ 0์œผ๋กœ ์ƒ๊ฐํ•ด์„œ ๊ฐ ๋‹ค๋ฆฌ์˜ joint velocities๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ๋ชธ์ฒด์˜ ์†๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  IMU์œผ๋กœ๋ถ€ํ„ฐ ์ถ”์ •ํ–ˆ๋˜ ๊ฐ’์„ ๋ณด์ •ํ•œ๋‹ค.
  • ์ด๋ ‡๊ฒŒ ๊ณ„์‚ฐ๋œ linear velocity๋ฅผ ๋กœ๋ด‡์˜ position ์ถ”์ •๊ฐ’์— ํ†ตํ•ฉ์‹œํ‚จ๋‹ค.

์œ„์˜ ๊ทธ๋ž˜ํ”„๋“ค์— ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด(์•„๋ž˜์—์„œ ์œ„ ๋ฐฉํ–ฅ์œผ๋กœ),

  • angular velocity์™€ orientation ์„ผ์„œ ๊ฐ’๋“ค์€ ๋งค์šฐ ์ •ํ™•ํ–ˆ๋‹ค.
  • linear velocity๋Š” ๋งค์šฐ ์ •ํ™•ํ•˜์ง„ ์•Š์•˜์ง€๋งŒ ํ—ˆ์šฉ๊ฐ€๋Šฅํ–ˆ๋‹ค.(reasonable)
  • position drifts๋Š” ์ƒ๋‹นํžˆ ๋ฒ—์–ด๋‚˜๋Š” ๋ถ€๋ถ„์ด ์žˆ์—ˆ์ง€๋งŒ, ๊ฐ ์—ํ”ผ์†Œ๋“œ์—์„œ reward function์„ ๊ณ„์‚ฐํ•  ์ •๋„๋กœ์˜ ์ ํ•ฉํ•œ ๊ฐ’๋“ค์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.

C. Reset Controller

  • reset policy๋ฅผ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ initial states์—์„œ ์‹œ์ž‘ํ•˜๋„๋ก ํ–ˆ๋‹ค.

โ†’ ๋กœ๋ด‡์„ randomํ•œ height & orientation์—์„œ ๋–จ์–ด๋œจ๋ ค์„œ ์•„๋ž˜ ์‚ฌ์ง„์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด ๋‹ค์–‘ํ•œ initial states๋ฅผ ์„ค์ •

  • Motion imitation ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ ์ˆ˜์ •ํ•ด์„œ single, streamlined reset policy๋ฅผ ํ•™์Šต์‹œ์ผฐ๋‹ค.

  • Reference motion์„ ๊ฐ€์ง€๊ณ  ๋กœ๋ด‡์ด ์ •ํ™•ํžˆ ์–ด๋–ป๊ฒŒ ์ผ์–ด๋‚˜์•ผ ํ• ์ง€๋ฅผ ์•Œ๋ ค์ฃผ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์•„๋ž˜์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ reset policy๋ฅผ ํ•™์Šต์‹œ์ผฐ๋‹ค.

  1. policy๊ฐ€ rolling right side up์„ ์œ„ํ•œ reward๋งŒ์„ ๊ฐ€์ง€๊ณ  ํ•™์Šตํ•œ๋‹ค.
  2. ๋งŒ์•ฝ ๋กœ๋ด‡์ด uprightํ•˜๋Š”๋ฐ ์„ฑ๊ณตํ•˜๋ฉด ์ดํ›„์— motion imitation reward๋ฅผ ์ถ”๊ฐ€์‹œ์ผœ์„œ ํ•™์Šต๋‹ˆ๋‹ค.
    • ์ด๋•Œ์˜ reference motion์€ standing pose๊ฐ€ ๋˜๊ณ  ๋กœ๋ด‡์ด ๋˜‘๋ฐ”๋กœ ์„ค ์ˆ˜ ์žˆ๋„๋ก ํ•™์Šต์‹œํ‚จ๋‹ค.
  • ์ด๋Ÿฐ ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต๋œ reset policy๋Š” ๋‹ค์–‘ํ•œ test ์ง€ํ˜•์—์„œ fine-tuning ์—†์ด๋„ ์ž˜ ๋™์ž‘ํ–ˆ๋‹ค.(tranfered well)

IV. Experiments

๐Ÿ’ก ์‹คํ—˜ ๊ฒฐ๊ณผ์—์„œ ์ฃผ๋ชฉํ•ด์„œ ๋ด์•ผํ•  ์งˆ๋ฌธ 3๊ฐ€์ง€!

  1. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ finetuning-based method๊ฐ€ ์ด์ „์˜ ๋ฐฉ๋ฒ•๋“ค์— ๋น„ํ•ด ์‹œ๋ฎฌ๋ ˆ์ด์…˜ trianing์„ ์ถฉ๋ถ„ํžˆ ํ™œ์šฉํ•˜๊ณ  ์‹ค์ œ ๋ฌผ๋ฆฌ ํ™˜๊ฒฝ์—์„œ ์ ์‘ํ•  ์ˆ˜ ์žˆ์—ˆ๋Š”๊ฐ€?
  2. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ์‹œ์Šคํ…œ ๋””์ž์ธ ์š”์†Œ๋“ค์ด feasibility of real-world training์— ์–ด๋–ค ์˜ํ–ฅ์„ ์ฃผ์—ˆ๋Š”๊ฐ€?
  3. ์–ผ๋งˆ๋‚˜ ๋‹ค์–‘ํ•œ ์‹ค์ œ ๋ฌผ๋ฆฌ์ ์ธ ํ™˜๊ฒฝ๋“ค์—์„œ autonomous, online fine-tuning ๋ฐฉ๋ฒ•์ด ๋กœ๋ด‡์˜ skill์„ ํ–ฅ์ƒ์‹œ์ผฐ๋Š”๊ฐ€?

A. Simulation Experiments

  • agent์˜ policy๋ฅผ ๋จผ์ € ํŠน์ • ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์…‹ํŒ…์—์„œ ํ•™์Šต์‹œํ‚จ ํ›„์— ํ•™์Šต๋œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜๊ณผ ๋˜ ๋‹ค๋ฅธ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ ์…‹ํŒ…์— โ€œdeployedโ€ํ•œ ํ›„ ๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณด์•˜๋‹ค.
  • Learned forward pacing gait๊ฐ€ ํ…Œ์ŠคํŠธ ํ™˜๊ฒฝ๋“ค์—์„œ ์–ผ๋งˆ๋‚˜ ๋นจ๋ฆฌ ์ ์šฉ๋˜๋Š”์ง€ ํ™•์ธํ•ด๋ณด์•˜๋‹ค.
  • Standard dynamics randomization (mass, inertia, motor strength, friction, latency ๋ณ€๋™)์œผ๋กœ Pre-train์„ flat ground์—์„œ ์ง„ํ–‰ํ–ˆ๋‹ค.

The test terrains

test ํ™˜๊ฒฝ๋“ค๋กœ๋Š” ์ด 3๊ฐ€์ง€๋กœ ์‹คํ—˜ํ•˜์˜€์œผ๋ฉฐ pre-training ๊ณผ์ •์˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์…‹ํŒ…๊ณผ ์œ ์‚ฌํ•œ test ํ™˜๊ฒฝ [1]๊ณผ pre-training ๊ณผ์ •์˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์…‹ํŒ…๊ณผ ๋‹ค์†Œ ๋‹ค๋ฅธ test ํ™˜๊ฒฝ [2], [3]์—์„œ ์ง„ํ–‰๋๋‹ค.

  1. a flat ground
  2. randomized heightfield : ๋žœ๋คํ•˜๊ฒŒ ์ง€ํ˜•์˜ ๋†’์ด๋ฅผ ์„ค์ •ํ•œ ์šธํ‰๋ถˆํ‰ํ•œ ์ง€ํ˜•
  3. a low friction surface : ๋‚ฎ์€ ๋งˆ์ฐฐ๊ณ„์ˆ˜๋ฅผ ๊ฐ€์ง€๋Š” ์ง€ํ˜•, ๋น™ํŒ๊ธธ๊ณผ ๊ฐ™์€ ๋ฏธ๋„๋Ÿฌ์šด ์ง€ํ˜•(Training ๊ณผ์ •์—์„œ ๊ฒฝํ—˜ํ•œ ๋งˆ์ฐฐ๊ณ„์ˆ˜ ๋ถ„ํฌ์™€ ํ•œ์ฐธ ๋™๋–จ์–ด์ง„ ๋งˆ์ฐฐ๊ณ„์ˆ˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Œ)

๋น„๊ต๊ตฐ

  1. latent space : ํ˜ธ์œจ์ ์ธ ๋‹ค์–‘ํ•œ dynamics parameters์— ๋Œ€ํ•œ ํ•™์Šต์„ ํ•˜๊ธฐ ์œ„ํ•ด latent space์— ํ‘œํ˜„๋œ behaviors์„ ํ•™์Šต
  2. RMA: dynamics randomizationํ•œ ๋ชจ๋ธ. ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ Adaptation Module์„ ๊ฐ€์ง€๊ณ  ํ•™์Šต
  3. Vanilla SAC : Soft Actor-Critic ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ํ•™์Šต
  4. Ours(REDQ): 10๊ฐœ์˜ Q-functions์„ ๊ฐ€์ง€๊ณ  randomly sample 2๋กœ ํ•™์Šต

์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณด๋ฉด, RMA๋Š” training ํ™˜๊ฒฝ์—์„œ๋งŒ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์–ด Adaptation Module์˜ ํ•œ๊ณ„์ ์„ ๋ช…ํ™•ํžˆ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. SAC์— ๋น„ํ•ด์„œ REDQ(Ours)๊ฐ€ sample efficiency๊ฐ€ ์ข‹์„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ˆ˜๋ ดํ•˜๋Š” Return ๊ฐ’๋„ ๋†’์•˜๋‹ค.

B. Real-World Experiments

์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ํ•™์Šต๋œ Agent๋ฅผ 4๊ฐœ์˜ real-world ํ™˜๊ฒฝ(Outdoor 1๊ฐœ, Indoor 3๊ฐœ)์—์„œ test ํ–ˆ๋‹ค. ๋ชจ๋“  (real-world) test ์ง€ํ˜• ์‹คํ—˜์€ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์˜ flat ground์—์„œ pre-training๋œ agent๋กœ ์‹คํ—˜ํ•œ ๊ฒƒ์ด์—ˆ์œผ๋ฉฐ, ์ฒ˜์Œ์— buffer๋ฅผ 5000 samples๋กœ ์ดˆ๊ธฐํ™” ํ•ด์ฃผ๊ณ  ์‹œ์ž‘ํ•œ ๋‹ค์Œ test real world ํ™˜๊ฒฝ์—์„œ policy๋ฅผ fine-tuning ํ•ด์ฃผ์—ˆ๋‹ค.

  1. Outdoor grassy lawn:

    • slippery surface๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์–ด์„œ ๋ฐœ์ด ์ž”๋””์—์„œ ๋ฏธ๋„๋Ÿฌ์ง€๊ฑฐ๋‚˜ ํ™์— ๋น ์งˆ ์ˆ˜ ์žˆ๋‹ค.

    • ์•ž ํ˜น์€ ๋’ค๋กœ ์›€์ง์ด๋Š” pacing gait๋ฅผ fine-tuning ํ•˜๋„๋ก ํ–ˆ๋‹ค.(pacing gait: ์ขŒ๋‚˜ ์šฐ์˜ 2๊ฐœ์˜ ๋‹ค๋ฆฌ๊ฐ€ ํ•œ๋ฒˆ์— ์›€์ง์ด๋Š” ๊ฑธ์Œ์ƒˆ)

    • Pre-trained forward pacing policy๋Š” ๋งค์šฐ ์กฐ๊ธˆ๋งŒ ์•ž์œผ๋กœ ๊ฐˆ ์ˆ˜ ์žˆ์—ˆ๊ณ , pre-trained backward pacing policy๋Š” ์ž˜ ๋„˜์–ด์ง€๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์—ˆ๋‹ค.

    • ์ž‘๋™ํ•œ ์ง€ ์•ฝ 2์‹œ๊ฐ„ ๋งŒ์—, ๋กœ๋ด‡์€ (์•„์ฃผ ์กฐ๊ธˆ์˜ ๋„˜์–ด์ง์€ ์žˆ์—ˆ์ง€๋งŒ) ์ง€์†์ ์ด๊ณ  ์•ˆ์ •์ ์œผ๋กœ ์•ž ํ˜น์€ ๋’ค๋กœ pacing gait๋ฅผ ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

  2. Indoor

    • Carpeted room: ๋†’์€ ๋งˆ์ฐฐ๊ณ„์ˆ˜๋ฅผ ๊ฐ€์ง€๋Š” ์ง€ํ˜•์œผ๋กœ (์นดํŽซ์ด ํ‘น์‹ ํ•˜๋ฏ€๋กœ) ๋กœ๋ด‡์˜ ๊ณ ๋ฌด๋กœ ๋งˆ๊ฐ๋˜์–ด ์žˆ๋Š” ๋ฐœ์ด ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ํ•™์Šต๋œ ๊ฒƒ๊ณผ ๋‹ค๋ฅด๊ฒŒ ์•ˆ์ •์ ์ด์ง€ ์•Š์€ ์ปจํƒ์„ ํ•˜๊ฒŒ ๋  ์ˆ˜ ์žˆ๋‹ค.

    • Doormat with crevices: ๋งคํŠธ ํ‘œ๋ฉด์— ๋ฐœ์ด ๋น ์งˆ ์ˆ˜๋„ ์žˆ๋Š” ํ™˜๊ฒฝ์ด๋‹ค.

    • Memory foam: 4cm ์ •๋„์˜ ๋‘๊ป˜์˜ ๋ฉ”๋ชจ๋ฆฌํผ์œผ๋กœ ๋ฐœ์ด ๋งคํŠธ๋ฆฌ์Šค์— ๋น ์ง€๊ณ  ํ‰ํ‰ํ•˜๊ณ  ๋”ฑ๋”ฑํ•œ ๋ฐ”๋‹ฅ๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ์ด ํ™˜๊ฒฝ์—์„œ๋Š” gait(๊ฑธ์Œ์ƒˆ)๊ฐ€ ์ƒ๋‹นํžˆ ๋ณ€ํ™”๊ฐ€ ๋งŽ์ด ์ผ์–ด๋‚  ์ˆ˜ ์žˆ๋‹ค.

    • Indoors์—์„œ๋Š”, pre-trained side stepping policy๊ฐ€ ์›€์ง์ผ ๋•Œ ๋งค์šฐ ๋ถˆ์•ˆ์ •ํ–ˆ๊ณ  motion์„ ๋๋‚ด๊ธฐ ์ „์— ๋„˜์–ด์กŒ๋‹ค.

    • ๊ทธ๋Ÿฌ๋‚˜ ๊ฐ ์ง€ํ˜• ์…‹ํŒ…์—์„œ 2.5 ์‹œ๊ฐ„ ์ด๋‚ด๋กœ ๋กœ๋ด‡์ด ๋น„ํ‹€๊ฑฐ๋ฆผ ์—†์ด skill์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

C. Semi-autonomous training

  • ์ „๋ฐ˜์ ์ธ ๋ชจ๋“  ์‹คํ—˜๋“ค์—์„œ, the recovery policy๋Š” 100% ์„ฑ๊ณต์ ์ด์—ˆ๋‹ค.
  • ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œ๋œ ๋ฐฉ๋ฒ•์œผ๋กœ ํ•™์Šต๋œ reset controller์™€ Unitree์—์„œ ์ œ๊ณตํ•œ built-in rollover controller๋ฅผ ๋น„๊ตํ•ด๋ณด์•˜๋‹ค.
    • On hard surfaces : ๋‘ ๊ฐ€์ง€ controllers ๋ชจ๋‘ ํšจ๊ณผ์ ์œผ๋กœ ์ž˜ ์ž‘๋™ํ–ˆ์ง€๋งŒ built-in ์ปจํŠธ๋กค๋Ÿฌ๋Š” learned policy์— ๋น„ํ•ด ์ƒ๋‹นํžˆ ๋А๋ ธ๋‹ค.
    • On the memory foam : built-in ์ปจํŠธ๋กค๋Ÿฌ๋Š” ๋” ์„ฑ๋Šฅ์ด ์ข‹์ง€ ๋ชปํ–ˆ๋‹ค.

V. Conclusion

  • grass, carpets, doormats and memory foam๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ real-world settings์—์„œ finetune locomotion policies์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ์‹œ์Šคํ…œ์„ ์ œ์•ˆํ•˜์˜€๋‹ค.
  • autonomous data collection๊ณผ data-efficient model-free RL์˜ ๊ฒฐํ•ฉ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.
  • ๋กœ๋ด‡์˜ ๋„˜์–ด์ง์—์„œ automated recoveries๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก, ๋กœ๋ด‡์˜ on-board sensors๋“ค์„ ๊ฐ€์ง€๊ณ  state estimation์„ ํ–ˆ์œผ๋ฉฐ, ์ด ์ •๋ณด๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํšจ๊ณผ์ ์ธ reward calculation์„ ์ œ์•ˆํ•˜์˜€๋‹ค.
  • ๋‹ค์–‘ํ•œ locomotion skill์— ๋Œ€ํ•œ data-efficient fine-tuning ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.
  • ๋ณต์žกํ•˜๊ณ  ๋‹ค์–‘ํ•˜๋ฉฐ ๋Š์ž„์—†์ด ๋ณ€ํ™”ํ•˜๋Š” real-world environments์— ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๋Š” a lifelong learning system for legged robots๋ฅผ future work๋กœ ๋ณด๊ณ  ์žˆ๋‹ค.

Review

๋…ผ๋ฌธ ๋ฆฌ๋ทฐํ›„์˜ ์ฃผ๊ด€์ ์ธ ์žฅ๋‹จ์ ์„ ์ •๋ฆฌํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • Pros ๐Ÿ‘
    • ๋กœ๋ด‡ operation์˜ ๋ช…ํ™•ํ•œ ํ•œ๊ณ„์ , ๊ฒฐ๊ตญ ๋กœ๋ด‡์ด ๋™์ž‘ํ•ด์•ผ ํ•˜๋Š” ํ™˜๊ฒฝ์ด ๊ณ„์† ๋ณ€ํ™”ํ•  ์ˆ˜ ๋ฐ–์— ์—†๋‹ค๋Š” ๋ฌธ์ œ์  ์ธ์‹์ด ์ข‹์€ ๊ฒƒ ๊ฐ™์Œ
    • ์‹ค์ œ ์‚ฐ์—…์—์„œ๋„ ํšจ์œจ์ ์ผ ๊ฒƒ ๊ฐ™์€ ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ์ƒ๊ฐ์ด ๋“ค์—ˆ์Œ
    • rest policy์˜ ์„ฑ๊ณต๋ฅ ์ด ๋Œ€๋‹จํ–ˆ์Œ
  • Cons ๐Ÿ‘Ž
    • Out door ์‹คํ—˜์—์„œ๋Š” ์—ฌ๋Ÿฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋น„๊ตํ•ด๋ณด์ง„ ์•Š์•˜์Œ
    • ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋™์ผํ•˜๊ฒŒ ํ•˜๊ณ  3๊ฐœ์˜ policy๋ฅผ ๋”ฐ๋กœ ๋‘์ง€ ์•Š๊ณ  1๊ฐœ์˜ policy๋กœ ๋งŒ๋“ค์—ˆ์„ ๋•Œ๋„ ๋น„๊ต๊ตฐ์œผ๋กœ ๋น„๊ตํ•ด์„œ ์‹คํ—˜๊ฒฐ๊ณผ๊ฐ€ ์žˆ์—ˆ์œผ๋ฉด ๋” ์ข‹์•˜์„ ๊ฒƒ ๊ฐ™์Œ

Reference

  • Original Paper
  • Project Homepage
  • Randomized Ensembled Double Q-Learning: Learning Fast Without a Model
  • REDQ REVIEW

Copyright 2024, Jung Yeon Lee