Curieux.JY
  • Post
  • Note
  • Jung Yeon Lee

On this page

  • Abstract
  • Introduction
  • NerveNet
    • Graph Construction
    • NerveNet as Policy
      • 0. Notation
      • 1. Input model
      • 2. Propagation model
      • 3. Output model
    • Learning Algorithm
  • Experiments
    • 1. Comparison on standard benchmarks of MuJoCo
    • 2. Structure transfer learning
    • 3. Multi-task learning
    • 4. Robustness of learnt policies
    • 5. Interpreting the learned representations
    • 6. Comparison of model variants
  • Conclusion
  • Review
  • Reference

๐Ÿ“ƒNerveNet ๋ฆฌ๋ทฐ

gnn
rl
paper
Learning Structured Policy with Graph Neural Networks
Published

June 10, 2022

Abstract

We address the problem of learning structured policies for continuous control. In traditional reinforcement learning, policies of agents are learned by multi-layer perceptrons (MLPs) which take the concatenation of all observations from the environment as input for predicting actions. In this work, we propose NerveNet to explicitly model the structure of an agent, which naturally takes the form of a graph. Specifically, serving as the agentโ€™s policy network, NerveNet first propagates information over the structure of the agent and then predict actions for different parts of the agent. In the experiments, we first show that our NerveNet is comparable to state-of-the-art methods on standard MuJoCo environments. We further propose our customized reinforcement learning environments for benchmarking two types of structure transfer learning tasks, i.e., size and disability transfer, as well as multi-task learning. We demonstrate that policies learned by NerveNet are significantly more transferable and generalizable than policies learned by other models and are able to transfer even in a zero-shot setting.

๋ณดํ†ต ๊ฐ•ํ™”ํ•™์Šต์—์„œ agent๋“ค์˜ policy๋Š” multi-layer perceptrons (MLPs)์œผ๋กœ ๋„คํŠธ์›Œํฌ๋ฅผ ๋งŒ๋“ค๊ธฐ ๋•Œ๋ฌธ์— agent๊ฐ€ environment์—์„œ ๋ฐ›์€ observation๋“ค์„ ๋‹จ์ˆœํžˆ ์Œ“์•„์„œ(concatenation) policy network์— ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด๊ฐ€๊ฒŒ ๋œ๋‹ค. ํ•˜์ง€๋งŒ ์†์˜ ์†๋„ ์ •๋ณด์™€ ๋ฐœ์˜ ์†๋„ ์ •๋ณด๊ฐ€ ๊ฐ™์€ ์†๋„ ๋ฒ”์ฃผ์ด์ง€๋งŒ ์œ„์น˜๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ๊ตฌ๋ถ„์ด ์žˆ์„ ์ˆ˜ ์žˆ๋“ฏ์ด agent์˜ ์ด๋Ÿฐ ๊ตฌ์กฐ์ ์ธ ํŠน์„ฑ์„ ๋ฐ˜์˜ํ•ด์„œ policy๋ฅผ ๋งŒ๋“ ๋‹ค๋ฉด observation ์ •๋ณด๋“ค๊ฐ„์˜ ๊ตฌ๋ถ„์„ ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฐ agent์˜ ๊ตฌ์กฐ์  ๊ด€๊ณ„์„ฑ์„ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด์„œ MLP๋Œ€์‹  ๊ทธ๋ž˜ํ”„๋ฅผ ํ™œ์šฉํ•˜๊ฒŒ ๋˜์—ˆ๊ณ  NerveNet์„ ๊ณ ์•ˆํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค. NerveNet์€ ๊ทธ๋ž˜ํ”„ ๊ตฌ์กฐ๋กœ ๋˜์–ด ์žˆ๋Š” policy network์—์„œ ๊ฐ ๋…ธ๋“œ๋“ค์˜ ์ •๋ณด๋“ค์ด ์ „ํŒŒ(propagation)๋˜๋ฉฐ agent์˜ ๋ถ€๋ถ„๋“ค์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋…ธ๋“œ๋งˆ๋‹ค action์„ prediction ํ•˜๊ฒŒ ๋œ๋‹ค. MuJoCo ํ™˜๊ฒฝ์—์„œ MLP ๊ธฐ๋ฐ˜์˜ ๋ฒค์น˜๋งˆํฌ๋“ค๊ณผ ๋น„๋“ฑํ•œ ํ•™์Šต๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์œผ๋ฉฐ, transfer learning task๋กœ agent์˜ ํฌ๊ธฐ(size)์™€ agent์˜ ์ผ๋ถ€ ํŒŒํŠธ๊ฐ€ ์ž‘๋™ํ•˜์ง€ ์•Š๋Š”(disability) variation์„ ์ฃผ์—ˆ์„ ๋•Œ๋„ ์ž˜ ํ•™์Šต๋˜์—ˆ์œผ๋ฉฐ multi-task learning์œผ๋กœ walker ๊ทธ๋ฃน์˜ ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ์—์„œ์˜ ํ•™์Šต ๊ฒฐ๊ณผ๋“ค๋„ ์ข‹์•˜๋‹ค. ์ด๋Ÿฐ ๊ฒฐ๊ณผ๋“ค์„ ํ†ตํ•ด NerveNet์ด transferableํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ zero-shot setting๋„ ๊ฐ€๋Šฅํ•จ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.

  • transferable - A task๋ฅผ ํ•™์Šตํ•œ ๋„คํŠธ์›Œํฌ(weights)๋ฅผ ํ™œ์šฉํ•˜์—ฌ B task ํ•™์Šต์—๋„ ์ ์šฉํ•˜์—ฌ scratch์—์„œ B task๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ๋” ๋น ๋ฅด๊ณ  ํšจ์œจ์ ์ธ ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์˜๋ฏธ. A task ํ•™์Šต์—์„œ ์Šต๋“ํ•œ ๋…ผ๋ฆฌ์ฒด๊ณ„๋ฅผ B task์—๋„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Œ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
  • zero-shot - Meta learning์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์šฉ์–ด๋กœ A task์— ๋Œ€ํ•ด์„œ ํ•™์Šต๋œ ๋„คํŠธ์›Œํฌ๊ฐ€ fine tuning์ด ์—†์ด ๋ฐ”๋กœ unseen new task B์— ๋Œ€ํ•ด์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๊ฒƒ์„ ์˜๋ฏธ.

Introduction

๋งŽ์€ ๊ฐ•ํ™”ํ•™์Šต ๋ฌธ์ œ๋“ค์—์„œ agent๋“ค์€ ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋…๋ฆฝ์ ์ธ controller๋“ค๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. ์˜ˆ๋ฅผ๋“ค์–ด ๋กœ๋ด‡์˜ ์ œ์–ด์—์„œ ๊ฐ•ํ™”ํ•™์Šต์ด ๋งŽ์ด ์ ์šฉ๋˜๊ณ  ์žˆ๋Š”๋ฐ, ๋กœ๋ด‡์€ ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋งํฌ(link)๋“ค๊ณผ ์กฐ์ธํŠธ(joint)๋“ค๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๊ณ  ์›€์ง์ž„์ด ์ผ์–ด๋‚˜๋Š” joint๋“ค์„ ๊ฐ ๊ฐœ๋ณ„์ ์ธ controller๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋•Œ link๋Š” ๋กœ๋ด‡์˜ ๋ฌผ๋ฆฌ์ ์ธ ํ˜•ํƒœ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๋ผˆ๋Œ€์ฒ˜๋Ÿผ ์ƒ๊ฐํ•˜๋ฉด ๋˜๊ณ  joint๋Š” ๋กœ๋ด‡์˜ ๋ชจ์…˜์„ ๊ฒฐ์ •ํ•˜๋Š” ๊ด€์ ˆ๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค.(๋ณดํ†ต ํšŒ์ „์ด ๊ฐ€๋Šฅํ•œ revolute joint๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.) ๋กœ๋ด‡์€ link-joint-link- ... -joint-link์™€ ๊ฐ™์ด link์™€ joint๊ฐ€ ์ฒด์ธ์ฒ˜๋Ÿผ ์—ฐ๊ฒฐ๋˜์–ด์„œ ๊ตฌ์„ฑ๋˜๋Š”๋ฐ, ๊ฐ link์™€ joint์˜ ์›€์ง์ž„์€ ์ž์‹ ์˜ ์ƒํƒœ์—๋งŒ ์˜์กดํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์—ฐ๊ฒฐ๋œ ์ฃผ๋ณ€ link์™€ joint๋“ค์—๊ฒŒ์„œ๋„ ์˜ํ–ฅ์„ ๋ฐ›์„ ์ˆ˜ ๋ฐ–์— ์—†๋‹ค. ๋กœ๋ด‡์„ ์›€์ง์ด๋„๋ก ์ œ์–ด๋ฅผ ํ•œ๋‹ค๋Š” ๊ฒƒ์€ ๋ฐ”๋กœ ๋ชจ์…˜์„ ๋งŒ๋“œ๋Š” joint์˜ ์›€์ง์ž„์„ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๋”ฐ๋ผ์„œ robot agent๋Š” ๊ฐ•ํ™”ํ•™์Šต์—์„œ์˜ action์„ joint์„ ์ œ์–ดํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๊ณ  action์œผ๋กœ ๋กœ๋ด‡์˜ joint์˜ ๊ฐ๋„๋ฅผ ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ ๋กœ๋ด‡์„ ์›€์ง์ด๊ฒŒ ๋œ๋‹ค.

๋ณดํ†ต ๊ฐ•ํ™”ํ•™์Šต์—์„œ agent์˜ policy๋Š” MLP๋กœ ๊ตฌ์„ฑํ•œ๋‹ค. MLP๊ธฐ๋ฐ˜์˜ policy๋Š” ๊ฐ€์žฅ ๋‹จ์ˆœํ•œ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋กœ input์œผ๋กœ agent๊ฐ€ ์–ป๋Š” observation ์ •๋ณด๋ฅผ concatenationํ•ด์„œ ๋„ฃ์–ด์ฃผ๊ฒŒ ๋œ๋‹ค. ๋‹ค์‹œ ๋กœ๋ด‡ agent์˜ ์˜ˆ์‹œ๋กœ ๋Œ์•„๊ฐ€์„œ ์ƒ๊ฐํ•ด๋ณด๋ฉด, agent๊ฐ€ observation์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ์ •๋ณด๋กœ๋Š” ๊ฐ joint์˜ ํšŒ์ „๊ฐ๋„, ํšŒ์ „ ๊ฐ์†๋„, ์œ„์น˜ ์ •๋ณด๋“ฑ ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ์ •๋ณด๋“ค์ด ์žˆ๊ณ  ์ด ๋‹ค์–‘ํ•œ ์ •๋ณด๋“ค์€ ๊ฐ joint๋กœ๋ถ€ํ„ฐ ์–ป๊ฒŒ ๋˜๋ฏ€๋กœ ๊ฐ joint์—์„œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์ •๋ณด x joint์˜ ์ˆ˜ ๊ฐ€ ๋ณดํ†ต observation์˜ ์ฐจ์› ์ˆ˜๊ฐ€ ๋œ๋‹ค. ๋กœ๋ด‡์˜ joint๊ฐ€ ๋งŽ์•„์งˆ ์ˆ˜๋ก ์ฐจ์›์ด ๋ฐฐ๋กœ ์ปค์ง€๊ฒŒ ๋˜๊ณ  ์ด๋Ÿฐ ์—ฌ๋Ÿฌ ์ •๋ณด๋“ค์„ ๋‹จ์ˆœํžˆ concatenateํ•ด์„œ policy ๋„คํŠธ์›Œํฌ์— ๋„ฃ์–ด์ฃผ๋Š” ๊ฒƒ์€ ๋” ๋งŽ์€ training time์„ ์š”๊ตฌํ•˜๊ฒŒ ๋˜๊ณ  ๋” ๋งŽ์€ ํ™˜๊ฒฝ๊ณผ agent ๊ฐ„์˜ interaction ๊ณผ์ •์ด ํ•„์š”ํ•˜๊ฒŒ ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” agent๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” (link์™€ joint๋กœ ์ด๋ฃจ์–ด์ง„) ๊ตฌ์กฐ์ ์ธ ํŠน์„ฑ์„ ์ด์šฉํ•ด์„œ policy๋ฅผ ๊ทธ๋ž˜ํ”„๋กœ ๋งŒ๋“ค์–ด observation์„ ๋„ฃ์–ด์ฃผ๊ณ  ํ•™์Šต์„ ํ•˜๋Š” ๊ฒƒ์„ ์ œ์•ˆํ•˜๊ฒŒ ๋œ๋‹ค.

๋กœ๋ด‡์ด๋‚˜ ๋™๋ฌผ๋“ค์˜ ์‹ ์ฒด์ ์ธ ๊ตฌ์กฐ๋ฅผ ๋ณด๋ฉด ๊ทธ๋ž˜ํ”„ ๊ตฌ์กฐ์™€ ์œ ์‚ฌํ•˜๋‹ค. ์œ„์—์„œ ์„ค๋ช…ํ•œ link์™€ joint์˜ ์ฒด์ธ๊ณผ ๊ฐ™์€ ์—ฐ๊ฒฐ์„ฑ์€ Graph Neural Network๋ฅผ ์ ์šฉํ•˜๊ธฐ์— ์ข‹๋‹ค. ๊ทธ๋ž˜์„œ NerveNet์—์„œ ์ •๋ณด๋“ค์˜ propagation์€ ์ด๋Ÿฐ ๊ทธ๋ž˜ํ”„ ๊ตฌ์กฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ผ์–ด๋‚˜๊ฒŒ ๋˜๊ณ  agent์˜ body์ •๋ณด๋ฅผ ์—ฌ๋Ÿฌ ๋‹ค๋ฅธ ํŒŒํŠธ๋“ค์„ ๊ทธ๋ž˜ํ”„์˜ node์™€ edge๋กœ ์ •์˜ํ•˜๋ฉด์„œ ์›€์ง์ž„์ด ์ผ์–ด๋‚˜๋Š” body node๋“ค์˜ action์„ ๊ฒฐ์ •ํ•˜๊ฒŒ ๋œ๋‹ค.

NerveNet

์šฐ์„  ๊ฐ•ํ™”ํ•™์Šต์˜ Notation์„ ์ •๋ฆฌํ•ด๋ณด๋ฉด, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” locomotion control problem๋“ค์„ ๋ชฉํ‘œ๋กœ ์žก์•˜๊ธฐ ๋•Œ๋ฌธ์— infinite-horizon discounted Markov decision process (MDP) ๋กœ ์„ค์ •ํ–ˆ๋‹ค. ๋ณดํ†ต continuousํ•œ ์ œ์–ด ๋ฌธ์ œ์—์„œ๋Š” ์‹œ๊ฐ„ ํ• ์ธ์œจ์„ ๊ณ ๋ คํ•œ ๋ฌดํ•œ ์‹œ๊ฐ„ ์Šคํ…์„ ๊ฐ€์ง€๊ณ  MDP๋ฅผ ๊ตฌ์„ฑํ•˜๊ฒŒ ๋œ๋‹ค(์‹ค์ œ๋กœ๋Š” max step์„ ์„ค์ •ํ•˜๊ธดํ•˜๋‚˜ ๋งค์šฐ ํฐ ์ˆ˜๋กœ ์žก๋Š”๋‹ค).

state ํ˜น์€ observation space๋กœ S , action space๋กœ A, stochastic policy \pi_{\theta}\left(a^{\tau} \mid s^{\tau}\right) ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ \theta๋ฅผ ๊ฐ€์ง€๊ณ  ํ˜„์žฌ ์ƒํƒœ s๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ a๋ฅผ ๋งŒ๋“ค๊ฒŒ ๋œ๋‹ค. ์ด๋ ‡๊ฒŒ ๋‚˜์˜จ agent์˜ a์™€ s๋ฅผ ๊ฐ€์ง€๊ณ  ํ™˜๊ฒฝ์—์„œ๋Š” reward r\left(s^{\tau}, a^{\tau}\right) ๋ฅผ ์ฃผ๊ฒŒ ๋˜๊ณ  agent๋Š” ์ด๋ ‡๊ฒŒ ๋ฐ›๋Š” reward๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํ•™์Šตํ•˜๊ฒŒ ๋œ๋‹ค.

์ด๋Ÿฌํ•œ MDP ๊ตฌ์„ฑ์€ ๊ธฐ๋ณธ์ ์ธ ๊ฐ•ํ™”ํ•™์Šต์˜ Notation์—์„œ ํฌ๊ฒŒ ๋ฒ—์–ด๋‚˜์ง€ ์•Š๋Š”๋‹ค.

Graph Construction

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์‚ฌ์šฉํ•œ MuJoCo์˜ agent๋“ค์€ ์ด๋ฏธ ๊ตฌ์กฐ์ ์œผ๋กœ tree ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. NerveNet์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด์ธ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ตฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด body์™€ joint, root๋ผ๋Š” 3๊ฐ€์ง€ ์ข…๋ฅ˜์˜ ๋…ธ๋“œ๋ฅผ ์„ค์ •ํ–ˆ๋‹ค. body ๋…ธ๋“œ๋Š” ๋กœ๋ด‡๊ณตํ•™์—์„œ ๋งํ•˜๋Š” link ๊ธฐ์ค€์˜ ์ขŒํ‘œ์‹œ์Šคํ…œ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋…ธ๋“œ์ด๊ณ , joint ๋…ธ๋“œ๋Š” ๋ชจ์…˜์˜ ์ž์œ ๋„(freedom of motion)์„ ๋‚˜ํƒ€๋‚ด๋ฉฐ 2๊ฐœ์˜ body ๋…ธ๋“œ๋“ค์„ ์—ฐ๊ฒฐํ•ด์ฃผ๋Š” ๋…ธ๋“œ์ด๋‹ค.

์•„๋ž˜๋Š” Ant ํ™˜๊ฒฝ์˜ ์˜ˆ์‹œ์ธ๋ฐ, ํ•œ ๊ฐ€์ง€ ๊ทธ๋ฆผ์—์„œ ํ—ท๊ฐˆ๋ฆฌ์ง€ ๋ง์•„์•ผ ํ•  ์ ์€ ๊ทธ๋ฆผ์—์„œ๋Š” ๋งˆ์น˜ body์™€ root ๋…ธ๋“œ๋งŒ ๋…ธ๋“œ๋กœ ๋งŒ๋“ ๊ฒƒ ์ฒ˜๋Ÿผ ๋ณด์ด์ง€๋งŒ root์™€ body, body์™€ body๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” ์—ฃ์ง€๋“ค๋„ ์‹ค์ œ๋กœ๋Š” joint ๋…ธ๋“œ๋“ค์ด๋‹ค.(we omit the joint nodes and use edges to represent the physical connections of joint nodes.)root๋ผ๋Š” ๋…ธ๋“œ๋Š” agent์˜ ์ถ”๊ฐ€์ ์ธ ์ •๋ณด๋“ค์„ ๋‹ด์„ ๋ถ€๋ถ„์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์ถ”๊ฐ€ํ•œ ๋…ธ๋“œ ์ข…๋ฅ˜๋กœ, ์˜ˆ๋ฅผ ๋“ค์–ด agent๊ฐ€ ๋„๋‹ฌํ•ด์•ผ ํ•˜๋Š” target position์— ๋Œ€ํ•œ ์ •๋ณด ๋“ฑ์ด ๋‹ด๊ฒจ์žˆ๋‹ค.

NerveNet as Policy

ํฌ๊ฒŒ 3๊ฐ€์ง€ ํŒŒํŠธ๋กœ NerveNet์„ ์‚ดํŽด๋ณผ ๊ฒƒ์ธ๋ฐ ์šฐ์„  (0) Notation์„ ๋ณด๊ณ  ๋‚œ๋’ค, (1) Input model (2) Propagation model (3) Output model ์ˆœ์œผ๋กœ ์‚ดํŽด๋ณผ ์˜ˆ์ •์ด๋‹ค.

0. Notation

๊ทธ๋ž˜ํ”„์—์„œ์˜ ๋…ธํ…Œ์ด์…˜์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด G ๋ผ๋Š” ๊ทธ๋ž˜ํ”„๋Š” ๋…ธ๋“œ ์ง‘ํ•ฉ V์™€ ์—ฃ์ง€ ์ง‘ํ•ฉ E๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.

G=(V, E)

Nervenet policy๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๊ทธ๋ž˜ํ”„๋Š” Directed graph(์œ ํ–ฅ ๊ทธ๋ž˜ํ”„)์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ ๋…ธ๋“œ์—์„œ์˜ in๊ณผ out์ด ๋”ฐ๋กœ ๋ช…์‹œ๋˜๊ฒŒ ๋œ๋‹ค.

  • ๋…ธ๋“œ u๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ๋…ธ๋“œ u๋กœ ๋“ค์–ด์˜ค๋Š” ์ด์›ƒ ๋…ธ๋“œ์ด๋ฉด \mathcal{N}_{in}(u)
  • ๋…ธ๋“œ u๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ๋…ธ๋“œ u์—์„œ ๋‚˜๊ฐ€๋Š” ์ด์›ƒ ๋…ธ๋“œ์ด๋ฉด \mathcal{N}_{out}(u)

๊ทธ๋ž˜ํ”„์˜ ๋ชจ๋“  ๋…ธ๋“œ u๋Š” ํƒ€์ž…์„ ๊ฐ€์ง€๊ฒŒ ๋˜๊ณ  ์ด๋ฅผ p_{u} \in\{1,2, \ldots, P\} (associated note type)๋กœ ๋‚˜ํƒ€๋‚ด๋ฉฐ ์—ฌ๊ธฐ์—์„œ๋Š” ์œ„์— ์„ค๋ช…ํ•œ ๊ฒƒ๊ณผ ๊ฐ™์ด body, joint, root 3๊ฐ€์ง€ ํƒ€์ž…์ด ์žˆ๋‹ค.

๋…ธ๋“œ๋“ค ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์—ฃ์ง€๋“ค๋„ ํƒ€์ž…์„ ์ •ํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ c_{(u, v)} \in\{1,2, \ldots, C\} (associate each edge)๋กœ ํ‘œ๊ธฐํ•˜์—ฌ ๋…ธ๋“œ์Œ (u, v) ์‚ฌ์ด์˜ ์—ฃ์ง€ ํƒ€์ž…์„ ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.(ํ•˜๋‚˜์˜ ์—ฃ์ง€์— ๋Œ€ํ•ด์„œ ์—ฌ๋Ÿฌ ์—ฃ์ง€ ํƒ€์ž…์„ ์ •์˜ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ์—ฌ๊ธฐ์—์„œ๋Š” ์‹ฌํ”Œ ์ด์ฆˆ ๋” ๋ฒ ์ŠคํŠธ ์ฒ ํ•™์œผ๋กœ ํ•˜๋‚˜์˜ ์—ฃ์ง€๋Š” ํ•˜๋‚˜์˜ ํƒ€์ž…๋งŒ ๊ฐ€์ง€๋„๋ก ํ–ˆ๋‹ค)

์ด๋ ‡๊ฒŒ ๋…ธ๋“œ๋ณ„, ์—ฃ์ง€๋ณ„ ํƒ€์ž…์„ ๋‚˜๋ˆ”์œผ๋กœ์จ,

  • ๋…ธ๋“œ ํƒ€์ž…์€ ๋…ธ๋“œ๋“ค๊ฐ„์˜ ๋‹ค๋ฅธ ์ค‘์š”๋„๋ฅผ ํŒŒ์•…ํ•˜๋Š”๋ฐ ๋„์›€์ด ๋˜๊ณ 
  • ์—ฃ์ง€ ํƒ€์ž…์€ ๋…ธ๋“œ๋“ค๊ฐ„์˜ ์„œ๋กœ๋‹ค๋ฅธ ๊ด€๊ณ„๋“ค์„ ๋‚˜ํƒ€๋‚ด๊ณ  ์ด ๊ด€๊ณ„์˜ ์ข…๋ฅ˜์— ๋”ฐ๋ผ ์ •๋ณด๋ฅผ ๋‹ค๋ฅด๊ฒŒ propagation ํ•˜๊ฒŒ ๋œ๋‹ค.

์ด์ œ ์‹œ๊ฐ„ ๋…ธํ…Œ์ด์…˜์— ๋Œ€ํ•œ ๋ถ€๋ถ„์„ ์‚ดํŽด๋ณด์ž. NerveNet์—๋Š” ์‹œ๊ฐ„(time step)์˜ ๊ฐœ๋…์ด 2๊ฐ€์ง€ ์กด์žฌํ•œ๋‹ค.

  1. ๊ธฐ์กด ๊ฐ•ํ™”ํ•™์Šต์—์„œ ํ™˜๊ฒฝ๊ณผ agent ์‚ฌ์ด์˜ interaction time step์„ ๋‚˜ํƒ€๋‚ด๋Š” \tau
  2. NerveNet์˜ ๋‚ด๋ถ€ graph policy์—์„œ์˜ propagation step์„ ๋‚˜ํƒ€๋‚ด๋Š” t

๋‹ค์‹œ ํ’€์–ด์„œ ์ƒ๊ฐํ•ด๋ณด๋ฉด, ๊ฐ•ํ™”ํ•™์Šต์˜ ์‹œ๊ฐ„ ๊ฐœ๋… \tau ์Šคํ…์—์„œ ํ™˜๊ฒฝ์œผ๋กœ๋ถ€ํ„ฐ observation์„ ๋ฐ›๊ณ , ๋ฐ›์€ observation์„ ๊ธฐ๋ฐ˜์œผ๋กœ t ์Šคํ…๋™์•ˆ NerveNet์˜ ๋‚ด๋ถ€์˜ ๊ทธ๋ž˜ํ”„์˜ propagation์ด ์ผ์–ด๋‚œ๋‹ค.

1. Input model

์œ„์—์„œ ๋งํ–ˆ๋“ฏ์ด ํ™˜๊ฒฝ๊ณผ ์ƒํ˜ธ์ž‘์šฉ์œผ๋กœ observation s^{\tau} \in \mathcal{S}์„ ๋ฐ›๊ฒŒ ๋œ๋‹ค(time step \tau). ์ด s^{\tau}๋Š” concatenation๋œ ๊ฐ ๋…ธ๋“œ์˜ observation์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด์ œ ๊ฐ•ํ™”ํ•™์Šต interaction ์ˆ˜์ค€์˜ \tau ์Šคํ…์€ ์ž ์‹œ ๋ฉˆ์ถฐ๋‘๊ณ  ๊ทธ๋ž˜ํ”„ ๋‚ด๋ถ€์˜ ํƒ€์ž„ ์Šคํ…์ธ t ์ˆ˜์ค€์—์„œ ์ƒ๊ฐํ•ด๋ณด์ž. observation์€ node u์— ํ•ด๋‹นํ•˜๋Š” x_{u}๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๊ณ  x_{u}๋Š” input network F_{\mathrm{in}}(MLP)๋ฅผ ๊ฑฐ์ณ์„œ ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ state vector์ธ h_{u}^{0}๊ฐ€ ๋œ๋‹ค. h_{u}^{0}์˜ ๋…ธํ…Œ์ด์…˜์„ ํ’€์–ด์„œ ํ•ด์„ํ•˜๋ฉด ๋…ธ๋“œ u์˜ propagation step 0 ์—์„œ์˜ state vector์ธ ๊ฒƒ์ด๋‹ค. ์ด๋•Œ observation vector x_{u}๊ฐ€ ๋…ธ๋“œ๋งˆ๋‹ค ํฌ๊ธฐ๊ฐ€ ๋‹ค๋ฅผ ๊ฒฝ์šฐ zero padding์œผ๋กœ ๋งž์ถฐ์„œ input network์— ๋„ฃ์–ด์ฃผ๊ฒŒ ๋œ๋‹ค.

h_{u}^{0}=F_{\text {in }}\left(x_{u}\right)

2. Propagation model

NerveNet์˜ propagation ๊ณผ์ • ๋…ธ๋“œ๋“ค ๊ฐ„์— ์ฃผ๊ณ  ๋ฐ›๋Š” ์ •๋ณด๋ฅผ message๋ผ๊ณ  ํ•˜๊ฒŒ ๋˜๊ณ  ์ด๋Š” ๋…ธ๋“œ๋“ค ๊ฐ„์— ์ฃผ๊ณ  ๋ฐ›๋Š” ์ƒํ˜ธ์ž‘์šฉ์ด๋ผ๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค. Propagation model์€ 3๊ฐ€์ง€ ๋‹จ๊ณ„๋กœ ๋‚˜๋ˆ„์–ด์„œ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

  1. Message Computation
    • ์ „๋‹ฌํ•  ๋ฉ”์„ธ์ง€๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

    • propagation step์ธ t์—, ๋ชจ๋“  ๋…ธ๋“œ๋“ค u์—์„œ state vector h_{u}^{t}๋ฅผ ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

    • ๋…ธ๋“œ u๋กœ ๋ชจ์•„์ง€๋Š”(in-coming) ๋ชจ๋“  ์—ฃ์ง€๋“ค์„ ๊ฐ€์ง€๊ณ  ๋ฉ”์‹œ์ง€๋ฅผ ๊ตฌํ•˜๊ฒŒ ๋˜๋Š”๋ฐ, ์ด๋•Œ M์€ MLP์ด๊ณ  M์˜ ์•„๋ž˜์ฒจ์ž c_{(u, v)} ๋…ธํ…Œ์ด์…˜์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด ๊ฐ™์€ ์ข…๋ฅ˜์˜ ์—ฃ์ง€์— ๋Œ€ํ•ด์„œ๋Š” ๊ฐ™์€ message function M์„ ์“ด๋‹ค.

      m_{(u, v)}^{t}=M_{c_{(u, v)}}\left(h_{u}^{t}\right)

    • ์˜ˆ๋ฅผ ๋“ค์–ด ์•„๋ž˜ ๊ทธ๋ฆผ์€ CentipedeEight ์˜ ๋ชจ์Šต์ธ๋ฐ, ์™ผ์ชฝ์€ ์‹ค์ œ agent์˜ ๋ชจ์Šต์„ ๋‚˜ํƒ€๋‚ด๊ณ  ์žˆ์œผ๋ฉฐ ์˜ค๋ฅธ์ชฝ์€ agent๋ฅผ ๊ทธ๋ž˜ํ”„๋กœ ๋‚˜ํƒ€๋ƒˆ์„ ๋•Œ์˜ ๋ชจ์Šต์ด๋‹ค. ์—ฌ๊ธฐ์—์„œ 2๋ฒˆ์งธ torso์—์„œ ์ฒซ๋ฒˆ์งธ ์„ธ๋ฒˆ์งธ torso์—์„œ ๋ณด๋‚ผ ๋•Œ ๊ฐ™์€ ๋ฉ”์„ธ์ง€ ํŽ‘์…˜ M_{1} ์„ ์‚ฌ์šฉํ•˜๊ณ , LeftHip๊ณผ RightHip์œผ๋กœ ๋ณด๋‚ด๋Š” ๋ฉ”์„ธ์ง€ ํŽ‘์…˜ M_{2}๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜๋Š” ๊ฒƒ์ด๋‹ค.

  2. Message Aggregation
    • ์•ž ๋‹จ๊ณ„์—์„œ ๋ชจ๋“  ๋…ธ๋“œ๋“ค์— ๋Œ€ํ•ด์„œ ๋ฉ”์„ธ์ง€ ๊ณ„์‚ฐ์ด ๋๋‚œ ํ›„์— in-coming ์ด์›ƒ ๋…ธ๋“œ๋“ค๋กœ๋ถ€ํ„ฐ ์˜จ(๊ณ„์‚ฐ๋œ) ๋ฉ”์„ธ์ง€๋ฅผ ๋ชจ์œผ๊ฒŒ ๋œ๋‹ค. ์ด๋•Œ summation, average, max-pooling ๋“ฑ ๋‹ค์–‘ํ•œ aggregation ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

      \bar{m}_{u}^{t}=A\left(\left\{h_{v}^{t} \mid v \in \mathcal{N}_{i n}(u)\right\}\right)

  3. States Update
    • ์ด์ œ ๋ชจ์€ ๋ฉ”์„ธ์ง€๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ state vector๋ฅผ ์—…๋ฐ์ดํŠธ ํ•˜๋ฉด ๋œ๋‹ค!

      h_{u}^{t+1}=U_{p_{u}}\left(h_{u}^{t}, \bar{m}_{u}^{t}\right)

    • ์—ฌ๊ธฐ์„œ ์—…๋ฐ์ดํŠธ ํ•จ์ˆ˜ U ๋Š” a gated recurrent unit (GRU), a long short term memory (LSTM) unit ๋˜๋Š” MLP๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค.

    • Update function์˜ ์•„๋ž˜์ฒจ์ž p_{u}์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋‹ค์‹œํ”ผ ๊ฐ™์€ ๋…ธ๋“œ ํƒ€์ž…์ด๋ฉด ๊ฐ™์€ update function U๋ฅผ ์“ฐ๊ฒŒ ๋œ๋‹ค. ์ด๋ ‡๊ฒŒ ์—…๋ฐ์ดํŠธ๋œ state vector๋Š” ํƒ€์ž„ ์Šคํ… t๊ฐ€ ํ•˜๋‚˜ ์˜ฌ๋ผ๊ฐ„ t+1 ์ด ๋œ h_{u}^{t+1}๊ฐ€ ๋œ๋‹ค.

์ด๋ ‡๊ฒŒ ๋‚ด๋ถ€ propagation ๊ณผ์ • 3๋‹จ๊ณ„(Message Computation, Message Aggregation, States Update)๊ฐ€ T ์Šคํ…๋™์•ˆ ์ผ์–ด๋‚˜๊ฒŒ ๋˜๊ณ  ๊ฐ ๋…ธ๋“œ์˜ ์ตœ์ข… state vector๋Š” h_{u}^{T} ๊ฐ€ ๋œ๋‹ค.

3. Output model

์ „ํ˜•์ ์ธ RL์˜ MLP ํด๋ฆฌ์‹œ์—์„œ๋Š” ๋„คํŠธ์›Œํฌ์—์„œ ๊ฐ action์˜ gaussian distribution์˜ mean์„ ๋ฝ‘์•„๋‚ด๊ฒŒ ๋œ๋‹ค. std๋Š” trainableํ•œ ๋ฒกํ„ฐ์ด๋‹ค. NerveNet์—์„œ๋„ std๋Š” ๋น„์Šทํ•˜๊ฒŒ ๋‹ค๋ฃจ์ง€๋งŒ ๊ฐ ๋…ธ๋“œ์— ๋งˆ๋‹ค action prediction์„ ๋งŒ๋“ค๊ฒŒ ๋œ๋‹ค.

actuator์™€ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ๋Š” ๋…ธ๋“œ๋“ค์˜ ์ง‘ํ•ฉ์„ O๋ผ๊ณ  ํ•˜์ž. ์ด ์ง‘ํ•ฉ์— ์žˆ๋Š” ๋…ธ๋“œ๋“ค์˜ ์ตœ์ข… state vector h_{u \in \mathcal{O}}^{T}๋Š” MLP์ธ Ouput model O_{q_{u}}์— ์ธํ’‹์œผ๋กœ ๋“ค์–ด๊ฐ€๊ฒŒ ๋˜๊ณ  ์•„์›ƒํ’‹์œผ๋กœ ๊ฐ actuator์˜ action distribution์ธ gaussian distribution์˜ mean \mu์„ ์ถœ๋ ฅํ•˜๊ฒŒ ๋œ๋‹ค. ์—ฌ๊ธฐ์—์„œ ์ƒˆ๋กœ์šด ๋…ธํ…Œ์ด์…˜ q_{u}๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ q_{u}๋Š” ์•„์›ƒํ’‹ ํƒ€์ž…, ์ฆ‰ ์•„์›ƒํ’‹์„ ๋‚ด๋†“๋Š” ๋…ธ๋“œ u์˜ ํƒ€์ž…์œผ๋กœ ์•„์›ƒํ’‹ ํŽ‘์…˜์˜ ์•„๋ž˜์ฒจ์ž์— q_{u}์— ๋”ฐ๋ผ ์•„์›ƒํ’‹ ๋…ธ๋“œ์˜ ํƒ€์ž…์ด ๊ฐ™์œผ๋ฉด Output function์„ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‹ค์‹œ๋งํ•ด ์•„์›ƒํ’‹ ๋…ธ๋“œ ํƒ€์ž…์— ๋”ฐ๋ผ ์ปจํŠธ๋กค๋Ÿฌ๋ฅผ ๊ณต์œ ํ•  ์ˆ˜๋„ ์žˆ๋Š” ๊ฒƒ์ด๋‹ค. ์œ„์˜ Centipedes์˜ ์˜ˆ์‹œ๋กœ ๋ณด๋ฉด, ๊ฐ™์€ LeftHip ๋ผ๋ฆฌ๋Š” ์ปจํŠธ๋กค๋Ÿฌ๋ฅผ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

\mu_{u \in \mathcal{O}}=O_{q_{u}}\left(h_{u}^{T}\right)

๋…ผ๋ฌธ์—์„œ ์‹ค์ œ๋กœ ์‹คํ—˜์„ ํ•ด๋ดค์„ ๋•Œ ๋‹ค๋ฅธ ํƒ€์ž…์˜ ์ปจํŠธ๋กค๋Ÿฌ๋“ค์„ ํ•˜๋‚˜๋กœ ํ†ตํ•ฉํ–ˆ๋”๋ผ๋„(O function์„ ๋‹ค ๊ฐ™์€ MLP๋กœ ์‚ฌ์šฉ) ํผํฌ๋จผ์Šค๊ฐ€ ๊ทธ๋ ‡๊ฒŒ ํ•ด์ณ์ง€์ง€ ์•Š์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

์—ฌ๊ธฐ๊นŒ์ง€ํ•ด์„œ ๊ทธ๋ž˜ํ”„ ๋…ธํ…Œ์ด์…˜์„ ๋นŒ๋ ค ๊ทธ๋ž˜ํ”„ ๊ธฐ๋ฐ˜ ๊ฐ€์šฐ์‹œ์•ˆ stochastic policy๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉด ์•„๋ž˜์˜ ์ˆ˜์‹๊ณผ ๊ฐ™๋‹ค.

\pi_{\theta}\left(a^{\tau} \mid s^{\tau}\right)=\prod_{u \in \mathcal{O}} \pi_{\theta, u}\left(a_{u}^{\tau} \mid s^{\tau}\right)=\prod_{u \in \mathcal{O}} \frac{1}{\sqrt{2 \pi \sigma_{u}^{2}}} e^{\left(a_{u}^{\tau}-\mu_{u}\right)^{2} /\left(2 \sigma_{u}^{2}\right)}


์—ฌ๊ธฐ๊นŒ์ง€ NerveNet์˜ ๊ฐ ๋‹จ๊ณ„๋ฅผ Walker-Ostrich ํ™˜๊ฒฝ์—์„œ ์˜ˆ์‹œ๋กœ ํ•œ๋ˆˆ์— ๋ณด๊ธฐ ์‰ฝ๊ฒŒ ์ •๋ฆฌํ•œ ๊ทธ๋ฆผ์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

Learning Algorithm

์ด์ „ ํŒŒํŠธ์—์„œ NerveNet์˜ ๋‚ด๋ถ€์—์„œ propagation ์Šคํ… t ๋‹จ์œ„์—์„œ ๊ฐ ๋‹จ๊ณ„๋“ค์„ ์ž์„ธํžˆ ์‚ดํŽด๋ณด์•˜๋‹ค๋ฉด ์ด์ œ ๊ฐ•ํ™”ํ•™์Šต ํƒ€์ž„ ์Šคํ… \tau ๋‹จ์œ„์—์„œ ํ•™์Šต์˜ ๋ชฉ์ ํ•จ์ˆ˜์™€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ดํŽด๋ณด์ž. ๋ชฉ์ ํ•จ์ˆ˜๋Š” ์ „ํ˜•์ ์ธ RL๊ณผ ๋‹ค๋ฅธ ์ ์ด ์—†์ด policy์˜ ํŒŒ๋ผ๋ฏธํ„ฐ \theta๋ฅผ ๊ฐ€์ง€๊ณ  Return ๊ฐ’์„ maximizationํ•˜๋Š” ๊ฒƒ์œผ๋กœ ํ•œ๋‹ค.

J(\theta)=\mathbb{E}{\pi}\left[\sum{\tau=0}^{\infty} \gamma^{\tau} r\left(s^{\tau}, a^{\tau}\right)\right]

๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ๋Š” PPO๊ณผ GAE๋ฅผ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ ํ•ด๋‹น ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์˜ ๋‚ด์šฉ์€ ๊ฐ๊ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์˜ ์›๋ž˜ ์ˆ˜์‹๊ณผ ๋‚ด์šฉ๋“ค๊ณผ ์ƒ์ดํ•œ ์ ์ด ์—†์œผ๋ฏ€๋กœ ๊ฐ ๋…ผ๋ฌธ์œผ ์ฐธ๊ณ ํ•˜๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฒˆ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ์—์„œ๋Š” ์ƒ๋žตํ•œ๋‹ค.

PPO์™€ GAE ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ฐธ๊ณ ํ•˜์—ฌ ์œ„์˜ ๋ชฉ์ ํ•จ์ˆ˜ J๋ฅผ ์ •๋ฆฌํ•˜๋ฉด NerveNet์˜ ๋ชฉ์ ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

\begin{aligned} \tilde{J}(\theta)=& J(\theta)-\beta L_{K L}(\theta)-\alpha L_{V}(\theta) \\ =& \mathbb{E}_{\pi_{\theta}}\left[\sum_{\tau=0}^{\infty} \min \left(\hat{A}^{\tau} r^{\tau}(\theta), \hat{A}^{\tau} \operatorname{clip}\left(r^{\tau}(\theta), 1-\epsilon, 1+\epsilon\right)\right)\right] \\ &-\beta \mathbb{E}_{\pi_{\theta}}\left[\sum_{\tau=0}^{\infty} \operatorname{KL}\left[\pi_{\theta}\left(: \mid s^{\tau}\right) \mid \pi_{\theta_{o l d}}\left(: \mid s^{\tau}\right)\right]\right]-\alpha \mathbb{E}_{\pi_{\theta}}\left[\sum_{\tau=0}^{\infty}\left(V_{\theta}\left(s^{\tau}\right)-V\left(s^{\tau}\right)^{\operatorname{target}}\right)^{2}\right] \end{aligned}

์œ„์˜ ์ˆ˜์‹์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋Š” value network V๋ฅผ ์–ด๋–ป๊ฒŒ ๋””์ž์ธํ•  ๊ฒƒ์ธ์ง€๊ฐ€ ์ด๋ฒˆ ๋…ผ๋ฌธ์˜ ๋‹ค๋ฅธ ํฌ์ธํŠธ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋…ผ๋ฌธ์˜ ๊ธฐ๋ณธ ์•„์ด๋””์–ด๋Š” policy network๋ฅผ ๊ทธ๋ž˜ํ”„๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์ด๊ณ , value network๋Š” ์–ด๋–ป๊ฒŒ ํ• ์ง€ ์—ฌ๋Ÿฌ ์„ ํƒ์ง€๋“ค์ด ๋‚จ์•„์žˆ๋‹ค. ๊ทธ๋ž˜์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” value network์˜ ๋””์ž์ธ์„ ๋‘๊ณ  ํฌ๊ฒŒ 3๊ฐ€์ง€ NerveNet์˜ ๋ณ€ํ˜• ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์„ ์‹คํ—˜ํ•ด๋ณด์•˜๋‹ค.

  1. NerveNet-MLP : policy network๋ฅผ 1๊ฐœ์˜ GNN์œผ๋กœ ๊ตฌ์„ฑํ•˜๊ณ  value network๋Š” MLP๋กœ ๊ตฌ์„ฑ

  2. NerveNet-2 : policy network๋ฅผ 1๊ฐœ์˜ GNN์œผ๋กœ ๊ตฌ์„ฑํ•˜๊ณ  value network๋Š” ๋˜ ๋‹ค๋ฅธ GNN์œผ๋กœ ๊ตฌ์„ฑ(์ด GNN 2๊ฐœ - without sharing the parameters of the two GNNs)

  3. NerveNet-1 : policy network์™€ value network ๋ชจ๋‘ 1๊ฐœ์˜ GNN์œผ๋กœ ๊ตฌ์„ฑ(์ด GNN 1๊ฐœ)

Experiments

๋จผ์ € MuJoCo ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„œ NerveNet์˜ ํšจ๊ณผ๋ฅผ ํ™•์ธํ•˜๊ณ  ์ผ๋ถ€ ์ปค์Šคํ…€ํ•œ ํ™˜๊ฒฝ๋“ค์—์„œ NerveNet์˜ transferable๊ณผ multi-task learning ๋Šฅ๋ ฅ์„ ํ™•์ธํ•œ๋‹ค.

1. Comparison on standard benchmarks of MuJoCo

  • ๋น„๊ต๊ตฐ์œผ๋กœ MLP, TreeNet(๋ชจ๋“  ๋…ธ๋“œ๋“ค์ด ์—ฐ๊ฒฐ ๋˜์–ด ์žˆ๋Š” ๊ทธ๋ž˜ํ”„, depth 1)์„ ์‚ฌ์šฉ
  • ์ด 8๊ฐœ์˜ ํ™˜๊ฒฝ์—์„œ ์‹คํ—˜ - Reacher, InvertedPendulum, InvertedDoublePendulum, Swimmer, HalfCheetah, Hopper, Walker2d, Ant
  • ์ถฉ๋ถ„ํžˆ ํ•™์Šตํ•˜๋Š” ์Šคํ…์„ ์ฃผ๊ธฐ ์œ„ํ•ด์„œ 1 million์„ max๋กœ ๋‘ 
  • ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๊ฒฝ์šฐ ๊ทธ๋ฆฌ๋“œ ์„œ์น˜๋กœ ์ฐพ์•˜์œผ๋ฉฐ(Appendix ์ฐธ๊ณ ) ๊ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํผํฌ๋จผ์Šค๋ฅผ ์ธก์ •ํ•  ๋•Œ 3๋ฒˆ์˜ run์„ ๋žœ๋ค ์‹œ๋“œ๋ฅผ ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ ์‹คํ–‰์‹œํ‚จ ํ›„ ํ‰๊ท ์„ ๊ตฌํ•ด์„œ ๊ธฐ๋ก
  • ๋Œ€๋ถ€๋ถ„์˜ ํ™˜๊ฒฝ์—์„œ MLP๊ฐ€ ์ž˜๋๊ณ  NerveNet๋„ ์ด์™€ ๋น„๋“ฑํ•œ ํผํฌ๋จผ์Šค๋ฅผ ๋ƒˆ๋‹ค.

(3๊ฐ€์ง€ ์ผ€์ด์Šค์— ๋Œ€ํ•œ learning curve, ๋‹ค๋ฅธ ์ผ€์ด์Šค๋“ค์—์„œ๋Š” ๋Œ€์ฒด๋กœ NerveNet๊ณผ MLP๊ฐ€ ๋น„์Šทํ–ˆ๋‹ค.)

HalfCheetah InvertedDoublePendulum Swimmer
MLP์™€ NerveNet์ด ๋น„์Šทํ•˜๊ณ  TreeNet์ด ๋งŽ์ด ์•ˆ์ข‹์•˜์Œ MLP๊ฐ€ ์ข€๋” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ƒ„ NerveNet์ด MLP๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ƒ„
  • ๋Œ€๋ถ€๋ถ„ ํ™˜๊ฒฝ๋“ค์—์„œ TreeNet์ด NerveNet๋ณด๋‹ค ์ข‹์ง€ ์•Š์•˜๊ณ  ์ด๋ฅผ ํ†ตํ•ด์„œ ๋ฌผ๋ฆฌ์ ์ธ ๊ทธ๋ž˜ํ”„ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ ธ๊ฐ€๋Š” ๊ฒƒ์ด ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

2. Structure transfer learning

  • MuJoCo์˜ ํ™˜๊ฒฝ ํ•˜๋‚˜๋ฅผ ์ปค์Šคํ…€ํ•ด์„œ size์™€ disability์˜ ๋ณ€ํ™”๊ฐ€ ์žˆ์„ ๋•Œ transferable ํ•จ์„ ๊ฒ€์ฆ
    • size transfer - ์ž‘์€ ์‚ฌ์ด์ฆˆ์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ฐ€์ง„ agent๋ฅผ ํ•™์Šต ์‹œํ‚จ ํ›„ ๋” ํฐ ์‚ฌ์ด์ฆˆ์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ฐ€์ง„ agent๋กœ transferable ํ•œ์ง€
    • disability transfer - ๋ชจ๋“  ํŒŒํŠธ๋“ค์ด ์ •์ƒ์ž‘๋™ํ•˜๋Š” agent๋กœ ํ•™์Šตํ•œ ํ›„ ์ผ๋ถ€ ํŒŒํŠธ๋“ค์ด ์ž‘๋™ํ•˜์ง€ ์•Š๋Š” ์ƒํ™ฉ์˜ agent๋กœ transferable ํ•œ์ง€
  • 2๊ฐœ ์ข…๋ฅ˜์˜ ํ™˜๊ฒฝ์„ ์ปค์Šคํ…€ํ•˜์—ฌ ์‹คํ—˜ - centipede์™€ snake
    1. centipede - ์ง€๋„ค์™€ ๊ฐ™์ด ์ƒ๊ธด agent๋กœ torso body๋“ค์ด ์—ฌ๋Ÿฌ๊ฐœ ์ฒด์ธ์ฒ˜๋Ÿผ ์—ฐ๊ฒฐ ๋˜์–ด ์žˆ๊ณ  torso๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ์–‘์ชฝ์— ๋‹ค๋ฆฌ๊ฐ€ 1์Œ์œผ๋กœ ๋ถ™์–ด ์žˆ๋‹ค. ํ•˜๋‚˜์˜ ๋‹ค๋ฆฌ๋Š” thigh์™€ shin์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๊ณ  hinge actuator๋กœ ๊ตฌํ˜„๋˜์–ด ์žˆ๋‹ค. ์ปค์Šคํ…€์€ ๋‹ค๋ฆฌ์˜ ๊ฐฏ์ˆ˜๋ฅผ ๋‹ค์–‘ํ•˜๊ฒŒ ํ•ด์„œ ์—ฌ๋Ÿฌ ์ปค์Šคํ…€ ํ™˜๊ฒฝ๋“ค์„ ๋งŒ๋“ค์—ˆ๋Š”๋ฐ, ๊ฐ€์žฅ ์งง์€ agent๋กœ๋Š” CentipedeFour ๋ถ€ํ„ฐ ๊ฐ€์žฅ ๊ธด agent๋กœ๋Š” CentipedeFourty ๋กœ ๋‹ค๋ฆฌ๊ฐ€ 40๊ฐœ๊นŒ์ง€(20์Œ) ์žˆ๋Š” ํ™˜๊ฒฝ์„ ๋งŒ๋“ค์ˆ˜ ์žˆ์—ˆ๋‹ค. disability๋กœ ์ผ๋ถ€ ํŒŒํŠธ๊ฐ€ ์ž‘๋™ํ•˜์ง€ ์•Š๋Š” ํ™˜๊ฒฝ์€ Cp(Cripple)๋กœ ๋”ฐ๋กœ ํ‘œ๊ธฐํ–ˆ๋‹ค. ์ด ํ™˜๊ฒฝ์—์„œ y-direction์œผ๋กœ ๋นจ๋ฆฌ ์•ž์œผ๋กœ ๊ฐ€๋Š”๊ฒŒ ๋ชฉํ‘œ๋‹ค.

    2. snake - swimmer ํ™˜๊ฒฝ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ปค์Šคํ…€ํ–ˆ์œผ๋ฉฐ ๊ฐ€์žฅ ๋นจ๋ฆฌ ์ง„ํ–‰๋ฐฉํ–ฅ์œผ๋กœ ์›€์ง์ด๋Š” ๊ฒŒ ๋ชฉํ‘œ๋‹ค.

๋น„๊ต๊ตฐ

  • NerveNet : small agent๊ฐ€ ํ•™์Šตํ•œ ๋ชจ๋ธ์„ ๋ฐ”๋กœ large agent์— ์ ์šฉํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. agent์˜ ๊ตฌ์กฐ๊ฐ€ ๋ฐ˜๋ณต์ ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ˜๋ณต๋˜๋Š” ๋ถ€๋ถ„์„ ๋” ๋Š˜๋ฆฌ๊ธฐ๋งŒ ํ•˜๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.
  • MLP Pre-trained (MLPP): agent์˜ ํฌ๊ธฐ๊ฐ€ ์ปค์ง์— ๋”ฐ๋ผ input size๊ฐ€ ๋‹ฌ๋ผ์ง€๋ฏ€๋กœ ๊ฐ€์žฅ straightforwardํ•˜๊ฒŒ ์ฒซ๋ฒˆ์งธ hidden layer๋ฅผ ๊ทธ๋Œ€๋กœ output layer๋กœ ์‚ฌ์šฉํ•˜๊ณ  input layer์˜ ์‚ฌ์ด์ฆˆ๋งŒ ํ‚ค์›Œ์„œ ์ถ”๊ฐ€ํ•˜๊ณ  ์ด input layer๋Š” ๋žœ๋ค ์ดˆ๊ธฐํ™”๋ฅผ ํ•ด์ค€๋‹ค.
  • MLP Activation Assigning (MLPAA): small agent์˜ weight๋“ค์„ ๋ฐ”๋กœ large agent์˜ ๋ชจ๋ธ์— ๋„ฃ์–ด์ฃผ๊ณ  weight๋“ค์˜ ๋‚จ๋Š” ๋ถ€๋ถ„๋“ค์„ 0์œผ๋กœ ์ดˆ๊ธฐํ™” ํ•ด์ค€๋‹ค.
  • TreeNet: MLPAA์ฒ˜๋Ÿผ ์Šค์ผ€์ผ์„ ํ‚ค์›Œ์„œ 0์œผ๋กœ ์ดˆ๊ธฐํ™” ํ•ด์ค€๋‹ค.
  • Random : action space์—์„œ uniformlyํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋ง์„ ํ•˜๋Š” policy์ด๋‹ค.

Result

  1. Centipede

    1-1. Pretraining

    • 6-๋‹ค๋ฆฌ ๋ชจ๋ธ๊ณผ 4-๋‹ค๋ฆฌ ๋ชจ๋ธ๋กœ NerveNet, MLP, TreeNet ์—์„œ์˜ ํผํฌ๋จผ์Šค๋ฅผ ๋น„๊ตํ–ˆ๋‹ค. ์—ฌ๊ธฐ์„œ 3๊ฐœ์˜ ๋ชจ๋ธ์€ ์•ž์„œ benchmark ๋น„๊ต ์‹คํ—˜์—์„œ ์‚ฌ์šฉํ•œ ๋น„๊ต๊ตฐ๋“ค๊ณผ ๋™์ผํ•˜๋‹ค.

    • 4-๋‹ค๋ฆฌ ๋ชจ๋ธ์—์„œ๋Š” NerveNet์ด ๊ฐ€์žฅ Reward๊ฐ€ ๋†’๊ณ , 6-๋‹ค๋ฆฌ ๋ชจ๋ธ์—์„œ๋Š” MLP๊ฐ€ ๊ฐ€์žฅ Reward๊ฐ€ ๋†’์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. TreeNet์€ ๋‘ ํ™˜๊ฒฝ ๋ชจ๋‘์—์„œ ๊ฐ€์žฅ ๋‚ฎ๋‹ค.
    • 6-๋‹ค๋ฆฌ ๋ชจ๋ธ๊ณผ 4-๋‹ค๋ฆฌ ๋ชจ๋ธ๋กœ pretraining์„ ์ง„ํ–‰ํ•œ ํ›„ transferable์„ ์‹คํ—˜ํ–ˆ๋‹ค.

    1-2. Zero-shot

    • fine tuning ์—†์ด ํผํฌ๋จผ์Šค๋ฅผ ์ธก์ •ํ–ˆ๋‹ค.
    • ํผํฌ๋จผ์Šค๋ฅผ ์‰ฝ๊ฒŒ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๋„๋ก average reward์™€ average running-length๋ฅผ normalizationํ•ด์„œ ์ƒ‰์œผ๋กœ ์•„๋ž˜์™€ ๊ฐ™์ด ํ‘œํ˜„ํ–ˆ๋‹ค.(green-good, red-bad)

    • ๋ˆˆ์œผ๋กœ ํ™•์‹คํžˆ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋“ฏ์ด NerveNet์˜ ํผํฌ๋จผ์Šค๊ฐ€ ๋‹ค๋ฅธ ๋น„๊ต๊ตฐ์— ๋น„ํ•ด ์›”๋“ฑํžˆ transferableํ•จ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค.
    • ๋˜ํ•œ learning curve์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด NerveNet+Pretrain ์ด ๋‹ค๋ฅธ Pretrain ๋น„๊ต๊ตฐ๋“ค์— ๋น„ํ•ด ํ›จ์”ฌ ๋†’์€ reward ์‹œ์ž‘์ ์—์„œ ์‹œ์ž‘ํ•˜๊ณ  ๋” ์ ์€ timestep์œผ๋กœ solved ์ ์ˆ˜์— ๋„๋‹ฌํ•˜๋Š” ๊ฒƒ์„ ๋ณด์•„ ๊ทธ๋ž˜ํ”„์˜ ๊ตฌ์กฐ์  ์ด์ ์„ ํ™•์‹คํžˆ ํ™œ์šฉํ•˜๊ณ  ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

    • NerveNet์˜ agent๋“ค์€ ๋‹ค๋ฅธ ๋น„๊ต๊ตฐ agent๋“ค์—์„œ ๋ณด์ด์ง€ ์•Š๋Š” walk-cycle์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋Š”๋ฐ, ์ด๋Š” ๋ณดํ–‰ ๋กœ๋ด‡๋“ค์€ ๊ฑธ์Œ์ƒˆ์—์„œ ๋ฐ˜๋ณต์ ์ธ ์›€์ง์ž„์„ ํ•˜๊ฒŒ ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ cycle์„ ๊ฐ€์ง€๊ฒŒ ๋˜๋Š” ๊ฒƒ์„ agent๊ฐ€ ํ•™์Šตํ–ˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. (๋ฐ˜๋ฉด MLP๋Š” 8-๋‹ค๋ฆฌ ๋ชจ๋ธ์—์„œ ๋ชจ๋“  ๋‹ค๋ฆฌ๋ฅผ ์›€์ง์ด์ง€ ์•Š๋Š” ๋ชจ์Šต์„ ๋ณด์ด๊ธฐ๋„ ํ–ˆ๋‹ค.)
  2. Snake

    • snakeํ™˜๊ฒฝ์—์„œ๋„ NerveNet์ด ๋‹ค๋ฅธ ๋น„๊ต๊ตฐ๋“ค์— ๋น„ํ•ด ๋›ฐ์–ด๋‚œ reward ์ ์ˆ˜๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉฐ transferable ํ•จ์„ ์•„๋ž˜์˜ ๋„ํ‘œ์—์„œ์ฒ˜๋Ÿผ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.
    • 350์  ์ •๋„๊ฐ€ snakeThree์—์„œ solved๋œ ์ƒํƒœ๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ NerveNet์˜ ์‹œ์ž‘ ์ ์ˆ˜๋“ค์ด ๋Œ€๋ถ€๋ถ„ 300์ ๋Œ€์—์„œ ์‹œ์ž‘ํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์•„ ์ด๋Š” ์ƒ๋‹นํ•œ zero-shot ์—ญ๋Ÿ‰์ด ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
    • ๋‹ค๋ฅธ ๋น„๊ต๊ตฐ๋“ค์€ overfitting์ด ์‹ฌํ•ด์„œ Random๋ณด๋‹ค ์•ˆ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ์ ๋„ ํฅ๋ฏธ๋กญ๋‹ค.

    • zero-shot ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ fine tuning์„ ํ•˜๋Š” learning curve์—์„œ๋„ NerveNet์€ Pretrain์˜ ์ด์ ์„ ๋‹ค๋ฅธ ๋น„๊ต๊ตฐ๋“ค์— ๋น„ํ•ด ์ž˜ ํ™œ์šฉํ•˜๊ณ  ์žˆ์Œ์„ ๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค. NerveNet+Pretrain์˜ ์‹œ์ž‘ reward๊ฐ€ ๋†’์œผ๋ฉฐ, ํŠน์ • size transfer ์‹คํ—˜์—์„œ๋Š” scratch NerveNet์ด ๋„˜์ง€ ๋ชปํ•œ MLP ์ ์ˆ˜๋ฅผ NerveNet+Pretrain์ด ๋”ฐ๋ผ์žก์•˜๋‹ค.

3. Multi-task learning

NerveNet์€ ๋„คํŠธ์›Œํฌ์— structure prior๋ฅผ ํฌํ•จํ•œ ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— multi-task learning์— ์œ ๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋ฅผ ์‹คํ—˜ํ•˜๊ธฐ ์œ„ํ•ด Walker multi-task learning์„ ์ง„ํ–‰ํ–ˆ๋‹ค.

  • 2d-walker ํ™˜๊ฒฝ๋“ค 5๊ฐœ - Walker-HalfHumanoid, Walker-Hopper, Walker-Horse, Walker-Ostrich, Walker-Wolf
  • 1๊ฐœ์˜ ํ†ตํ•ฉ๋œ network๋กœ ํ•™์Šต

๋น„๊ต๊ตฐ

  • NerveNet : agent๋“ค์˜ ํ˜•ํƒœ๊ฐ€ ๋‹ฌ๋ผ weight๋“ค์ด ๋‹ค๋ฅผ ์ˆ˜ ๋ฐ–์— ์—†๊ธฐ ๋•Œ๋ฌธ์— propagation๊ณผ์ •์—์„œ์˜ weight matrices์™€ output๋งŒ ๊ณต์œ ํ–ˆ๋‹ค.
  • MLP Sharing : hidden layer๋“ค ๊ฐ„์˜ weight matrices ๋ฅผ ๊ณต์œ 
  • MLP Aggregation : ์ฐจ์›์ด ๋‹ค๋ฅธ observation๋“ค์„ aggregation๊ณผ์ •์„ ํ†ตํ•ด ์ฒซ๋ฒˆ์งธ hidden layer์˜ ํฌ๊ธฐ๋กœ ๋‹ค ๋งž์ถฐ์ฃผ์–ด์„œ input์œผ๋กœ ๋„ฃ์–ด์คŒ
  • TreeNet: TreeNet๋„ weight๋ฅผ ๊ณต์œ ๋ฅผ ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ agent์˜ ๊ตฌ์กฐ์ ์ธ ์ •๋ณด๋Š” ์•Œ ์ˆ˜ ์—†๋‹ค. ๋‹จ์ˆœํžˆ root node๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ๋ชจ๋“  ๋…ธ๋“œ์˜ ์ •๋ณด๋‹ค aggregation ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.
  • MLPs: ๊ฐ agent๋งˆ๋‹ค ๋”ฐ๋กœ MLP policy๋ฅผ ๋งŒ๋“ค์–ด์„œ ํ•™์Šต(single-task)

Result - multi-task learning ์‹คํ—˜์ด๊ธฐ ๋•Œ๋ฌธ์— ํ•œ ๋‘๊ฐœ ๋Ÿฌ๋‹ ๊ทธ๋ž˜ํ”„๋งŒ ๋ณผ ์ˆ˜ ์—†๊ณ  5๊ฐœ์˜ ๋Ÿฌ๋‹ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ฐ™์ด ๋ด์•ผ ํ•œ๋‹ค. - Single-task policy๋ฅผ ์ œ์™ธํ•˜๊ณ  ๋ชจ๋“  ํ™˜๊ฒฝ์—์„œ NerveNet์˜ ํผํฌ๋จผ์Šค๊ฐ€ ์ข‹์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

  • ํ…Œ์ด๋ธ”์—์„œ Ratio๊ฐ€ single-task policy์— ๋น„ํ•ด multi-task policy์˜ ์„ฑ๋Šฅ์„ percentage๋กœ ๋‚˜ํƒ€๋‚ธ ์ˆ˜์น˜์ธ๋ฐ, MLP์˜ ํผํฌ๋จผ์Šค๊ฐ€ single-task์—์„œ multi-task๋กœ ๋„˜์–ด๊ฐ”์„ ๋•Œ 42%๋‚˜ ํผํฌ๋จผ์Šค๊ฐ€ ์ค„์–ด๋“œ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. (Average-58.6%) ๋ฐ˜๋ฉด์— NerveNet์€ ์„ฑ๋Šฅ์ด ์ „ํ˜€ ๋–จ์–ด์ง€์ง€ ์•Š๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.

4. Robustness of learnt policies

๊ฐ•ํ™”ํ•™์Šต ์ œ์–ด์—์„œ robustness๋Š” ์ค‘์š”ํ•œ ์ง€ํ‘œ์ธ๋ฐ ์งˆ๋Ÿ‰์ด๋‚˜ ํž˜๊ณผ ๊ฐ™์€ ๋ฌผ๋ฆฌ์ ์ธ ๊ฐ’๋“ค์˜ ์˜ค์ฐจ ๋ฒ”์œ„๊ฐ€ ์–ด๋А์ •๋„๊นŒ์ง€ policy๊ฐ€ ํ—ˆ์šฉํ•˜๊ณ  ์ž˜ ์ž‘๋™ํ•˜๋Š”์ง€๋ฅผ ํ™•์ธํ•ด์•ผ ํ•œ๋‹ค.

  • 5๊ฐœ์˜ Walker ๊ทธ๋ฃน์˜ ํ™˜๊ฒฝ์—์„œ ์‹คํ—˜
  • pretrained agent๋ฅผ ๊ฐ€์ง€๊ณ  agent์˜ ์งˆ๋Ÿ‰๊ณผ joint์˜ strength์„ ๋ณ€๊ฒฝํ•œ ๋’ค ํผํฌ๋จผ์Šค ์ธก์ •
  • ๋Œ€๋ถ€๋ถ„์˜ ํ™˜๊ฒฝ๊ณผ variation์—์„œ NerveNet์˜ robustness๊ฐ€ MLP๋ณด๋‹ค ์ข‹์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

5. Interpreting the learned representations

์‹ค์ œ ํด๋ฆฌ์‹œ๋“ค์ด ์–ด๋–ค representation๋“ค์„ ํ•™์Šตํ–ˆ๋Š”์ง€ ์•Œ์•„๋ณด๊ธฐ ์œ„ํ•ด CentipedeEight ํ™˜๊ฒฝ์—์„œ ํ•™์Šต๋œ agent์˜ final state vector๋ฅผ ๊ฐ€์ง€๊ณ  2D, 1D PCA๋ฅผ ์ง„ํ–‰ํ–ˆ๋‹ค.

๊ฐ ๋‹ค๋ฆฌ์Œ๋“ค(Left Hip-Right Hip)๋“ค์€ agent์˜ ์ „์ฒด ๋ชธ์ฒด์—์„œ ๊ฐ๊ธฐ ๋‹ค๋ฅธ ์œ„์น˜์— ์žˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  invariant representation์„ ๋ฐฐ์šธ ์ˆ˜ ์žˆ์—ˆ์Œ์„ PCA๋ฅผ ํ†ตํ•ด์„œ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

๋˜ํ•œ ์•ž์„œ Centipede transfer learning ์‹คํ—˜ ๊ฒฐ๊ณผ์—์„œ๋„ ์ž ๊น ์–ธ๊ธ‰ํ–ˆ๋˜ walk-cycle์ด ์ฃผ๊ธฐ์„ฑ์ด ๋šœ๋ ทํ•˜๊ฒŒ ๋ณด์˜€๋‹ค.

6. Comparison of model variants

Value Network๋ฅผ ์–ด๋–ป๊ฒŒ ํ•  ๊ฒƒ์ธ์ง€์— ๋”ฐ๋ผ NerveNet์˜ ์—ฌ๋Ÿฌ ๋ณ€ํ˜•์ด ์žˆ์„ ์ˆ˜ ์žˆ๋Š”๋ฐ Swimmer, Reacher, HalfCheetah์—์„œ ๋น„๊ตํ•ด๋ณธ ๊ฒฐ๊ณผ, Value Network๋Š” MLP๋กœ ํ•œ NerveNet-MLP์˜ ํผํฌ๋จผ์Šค๊ฐ€ ๊ฐ€์žฅ ์ข‹์•˜๊ณ  NerveNet-1์˜ ํผํฌ๋จผ์Šค๊ฐ€ 2๋“ฑ์œผ๋กœ NerveNet-MLP ์™€ ๋น„์Šทํ–ˆ๋‹ค. ์ด์— ๋Œ€ํ•œ ์ž ์žฌ์ ์ธ ์ด์œ ๋กœ value network์™€ policy network๊ฐ€ weight๋ฅผ ๊ณต์œ ํ•˜๋Š” ๊ฒƒ์ด PPO ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ์˜ trust-region based optimitaion์—์„œ์˜ weight \alpha๋ฅผ ๋” sensitiveํ•˜๊ฒŒ ๋งŒ๋“ค๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ์ถ”๋ก ํ•  ์ˆ˜ ์žˆ๋‹ค.

Conclusion

  • NerveNet์ด๋ผ๋Š” ๊ทธ๋ž˜ํ”„ ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•œ policy๋ฅผ ๊ฐ€์ง€๊ณ  RL agent์˜ body structure๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์•ˆ
    • ๊ฐ body์™€ joint์˜ observation์„ ๋ฐ›์•„ GNN์„ ํ†ตํ•ด non-linear message๋“ค์„ ๊ณ„์‚ฐํ•˜๊ณ  propagationํ•˜๋Š” ๋ชจ๋ธ๋ง
    • propagation์€ ์—ฃ์ง€๋กœ ํ‘œํ˜„๋œ joint๊ฐ„์˜ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ์—ฐ๊ฒฐ์„ฑ์„ ๊ฐ€์ง€๊ณ  ๋ณธ๋ž˜ ์žˆ๋Š” ์˜์กด์„ฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด๋ฃจ์–ด์ง
  • ์‹คํ—˜์ ์œผ๋กœ NerveNet์ด MoJoCo ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ๊ธฐ๋ฐ˜ ์—ฌ๋Ÿฌ ํ™˜๊ฒฝ๋“ค์—์„œ MLP ๊ธฐ๋ฐ˜ SOTA ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค๊ณผ ๊ฒฌ์ค„๋งŒํ•œ ํผํฌ๋จผ์Šค๋ฅผ ๋ณด์—ฌ์คŒo state-of-the-art methods on standard MuJoCo environments.
  • ๋ช‡๊ฐ€์ง€ ํ™˜๊ฒฝ์„ ์ปค์Šคํ…€ํ•ด์„œ size์™€ disability transfer๋ฅผ ๊ฒ€์ฆํ–ˆ์œผ๋ฉฐ zero-shot setting์—์„œ๋„ transferableํ•จ์„ ๋ณด์ž„

Review

๋…ผ๋ฌธ ๋ฆฌ๋ทฐํ›„์˜ ์ฃผ๊ด€์ ์ธ ์žฅ๋‹จ์ ์„ ์ •๋ฆฌํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • Pros ๐Ÿ‘
    • ๋กœ๋ด‡์˜ ๊ตฌ์กฐ์ ์ธ ํŠน์ง•์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํšจ์œจ์ ์ธ feature embedding์ด์—ˆ์Œ์„ ๋ณด์—ฌ์คŒ
    • ๋‹ค์–‘ํ•œ robot configuration์—์„œ๋„ ์ž˜ ์ž‘๋™ํ•จ
    • ๋ชจ๋ธ์˜ ํ™•์žฅ์„ฑ์„ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋Š” Transfer learning๊ณผ Multi-task learning์ด ์ธ์ƒ์ ์ด์—ˆ๊ณ  ํฐ ์žฅ์ ์ด๋ผ๊ณ  ์ƒ๊ฐ
  • Cons ๐Ÿ‘Ž
    • ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ๋งŒ ์‹คํ—˜ํ–ˆ๋‹ค๋Š” ์ ์ด ์•„์‰ฌ์›€
    • ์ƒ๊ฐ๋ณด๋‹ค ๊ธฐ๋ณธ์ ์ธ gnn๋ชจ๋ธ์ด๋ผ์„œ edge์— ๋Œ€ํ•œ ํฐ ๋””์ž์ธ ์š”์†Œ๊ฐ€ ๋“ค์–ด๊ฐ€์ง€ ์•Š์€ ๊ฒƒ ๊ฐ™์Œ
    • ๋‹ค์–‘ํ•œ RL ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค๊ณผ์˜ ์‹œ๋„ˆ์ง€๋ฅผ ๋ณด๊ธฐ์—๋Š” ์†”์งํžˆ ๋…ผ๋ฌธ์˜ ์–‘์ด ๋„ˆ๋ฌด ๋ฐฉ๋Œ€ํ•ด์งˆ ๊ฒƒ ๊ฐ™๊ธดํ•˜์ง€๋งŒ ์ด์— ๋Œ€ํ•œ ๋น„๊ต๊ฐ€ ์žˆ์—ˆ์œผ๋ฉด ์ข‹์•˜์„ ๊ฒƒ ๊ฐ™์Œ

Reference

  • Original Project Homepage: http://www.cs.toronto.edu/~tingwuwang/nervenet.html
  • Code
    • Official: https://github.com/WilsonWangTHU/NerveNet
    • Not official: https://github.com/HannesStark/gnn-reinforcement-learning

Copyright 2024, Jung Yeon Lee