Curieux.JY
  • JungYeon Lee
  • Post
  • Lecture
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ์„œ๋ก 
    • ๋ฐฉ๋ฒ•: ๋‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๊ทธ ๋‹ค๋ฆฌ
      • ์ด์ค‘ ๋ ˆ๋ฒจ ์ตœ์ ํ™”๋ผ๋Š” ๊ณตํ†ต ํ‹€
      • GAN
      • Actor-Critic
      • GAN์„ AC๋กœ ํ™˜์›ํ•˜๋Š” ๊ตฌ์„ฑ (ํ•ต์‹ฌ)
      • ์™œ ์ ๋Œ€์ ์ด ๋˜๋Š”๊ฐ€
    • ์•ˆ์ •ํ™” ์ „๋žต: ๋‘ ์ปค๋ฎค๋‹ˆํ‹ฐ์˜ ํŠธ๋ฆญ (Table 1)
    • ๋” ๋ณต์žกํ•œ ์ •๋ณด ํ๋ฆ„: ํ™•์žฅ๋“ค
    • ๋น„ํŒ์  ๊ณ ์ฐฐ
    • ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

๐Ÿ“ƒGAN-RL

rl
gan
theory
Connecting Generative Adversarial Networks and Actor-Critic Methods
Published

April 21, 2026

  • Paper Link (arXiv:1610.01945)
  1. ๐Ÿ“œ ๋ณธ ๋…ผ๋ฌธ์€ GANs๋ฅผ actor๊ฐ€ ๋ณด์ƒ์— ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์—†๋Š” stateless MDP ํ™˜๊ฒฝ์—์„œ ๋ณ€ํ˜•๋œ Actor-Critic ๋ฐฉ๋ฒ•์œผ๋กœ ๊ณต์‹์ ์œผ๋กœ ์—ฐ๊ฒฐํ•˜์—ฌ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.
  2. ๐Ÿ’ก GANs์™€ Actor-Critic์€ ๋ชจ๋‘ ์ตœ์ ํ™”ํ•˜๊ธฐ ์–ด๋ ค์šด ๋‹ค๋‹จ๊ณ„ ์ตœ์ ํ™” ๋ฌธ์ œ์ด๋ฉฐ, ์ด ๋…ผ๋ฌธ์€ ๋‘ ๋ถ„์•ผ์—์„œ ๊ฐœ๋ฐœ๋œ ํ›ˆ๋ จ ์•ˆ์ •ํ™” ์ „๋žต์„ ๊ฒ€ํ† ํ•˜๊ณ  ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค.
  3. ๐Ÿค ์ €์ž๋“ค์€ ์ด ํ˜•์‹์ ์ธ ์—ฐ๊ฒฐ์„ ๊ฐ•์กฐํ•จ์œผ๋กœ์จ GAN๊ณผ RL ์ปค๋ฎค๋‹ˆํ‹ฐ๊ฐ€ ๋”ฅ ๋„คํŠธ์›Œํฌ๋ฅผ ์œ„ํ•œ ์ผ๋ฐ˜์ ์ด๊ณ  ์•ˆ์ •์ ์ธ ๋‹ค๋‹จ๊ณ„ ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐœ๋ฐœํ•˜๊ณ  ์•„์ด๋””์–ด๋ฅผ ๊ต๋ฅ˜ํ•˜๋„๋ก ์žฅ๋ คํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

๋ณธ ๋…ผ๋ฌธ์€ ๋น„์ง€๋„ ํ•™์Šต(unsupervised learning)์˜ ์ƒ์„ฑ์  ์ ๋Œ€ ์‹ ๊ฒฝ๋ง(Generative Adversarial Networks, GANs)๊ณผ ๊ฐ•ํ™” ํ•™์Šต(reinforcement learning, RL)์˜ ์•กํ„ฐ-ํฌ๋ฆฌํ‹ฑ(Actor-Critic, AC) ๋ฉ”์„œ๋“œ ๊ฐ„์˜ ํ˜•์‹์ ์ธ ์—ฐ๊ฒฐ์ ์„ ์กฐ๋ช…ํ•˜๋ฉฐ, ๋‘ ๋ถ„์•ผ ๋ชจ๋‘ ์ตœ์ ํ™”ํ•˜๊ธฐ ์–ด๋ ต์ง€๋งŒ, ๋‹ค๋‹จ๊ณ„ ์ตœ์ ํ™” ๋ฌธ์ œ(multilevel optimization problems)๋กœ์„œ ์œ ์‚ฌํ•œ ์ •๋ณด ํ๋ฆ„ ๊ตฌ์กฐ์™€ ํ›ˆ๋ จ ๋ถˆ์•ˆ์ •์„ฑ ๋ฌธ์ œ๋ฅผ ๊ณต์œ ํ•œ๋‹ค๋Š” ์ ์„ ๊ฐ•์กฐํ•œ๋‹ค. ์ €์ž๋“ค์€ GAN์„ ์•กํ„ฐ๊ฐ€ ๋ณด์ƒ์— ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์—†๋Š” ํ™˜๊ฒฝ(stateless MDP)์—์„œ์˜ ์ˆ˜์ •๋œ ์•กํ„ฐ-ํฌ๋ฆฌํ‹ฑ ๋ฉ”์„œ๋“œ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Œ์„ ์ œ์‹œํ•˜๊ณ , ๋‘ ์ปค๋ฎค๋‹ˆํ‹ฐ๊ฐ€ ๋…๋ฆฝ์ ์œผ๋กœ ๊ฐœ๋ฐœํ•œ ํ›ˆ๋ จ ์•ˆ์ •ํ™” ์ „๋žต๋“ค์„ ๋น„๊ต ๋ถ„์„ํ•จ์œผ๋กœ์จ ์ƒํ˜ธ ์˜๊ฐ์„ ์–ป์–ด ๋” ๋‚˜์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐœ๋ฐœ์„ ์ด‰์ง„ํ•˜๊ณ ์ž ํ•œ๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก  (Core Methodology)

GAN๊ณผ AC ๋ฉ”์„œ๋“œ ๋ชจ๋‘ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ˜•ํƒœ์˜ ์ด๋‹จ๊ณ„ ์ตœ์ ํ™” ๋ฌธ์ œ(bilevel optimization problems)๋กœ ์ •ํ˜•ํ™”๋  ์ˆ˜ ์žˆ๋‹ค. x^* = \arg \min_{x \in X} F (x, y^*(x)) y^*(x) = \arg \min_{y \in Y} f (x, y)

์—ฌ๊ธฐ์„œ x๋Š” ์ƒ์œ„ ๋ฌธ์ œ(upper-level problem)์˜ ๋ณ€์ˆ˜์ด๊ณ , y๋Š” ํ•˜์œ„ ๋ฌธ์ œ(lower-level problem)์˜ ๋ณ€์ˆ˜์ด๋‹ค. ์ƒ์œ„ ๋ฌธ์ œ์˜ ์ตœ์ ํ™”๋Š” ํ•˜์œ„ ๋ฌธ์ œ์˜ ์ตœ์  ์†”๋ฃจ์…˜ y^*(x)์— ์˜์กดํ•œ๋‹ค.

1. ์ƒ์„ฑ์  ์ ๋Œ€ ์‹ ๊ฒฝ๋ง (Generative Adversarial Networks, GANs)

GAN์€ ์ƒ์„ฑ์ž(generator, G)์™€ ํŒ๋ณ„์ž(discriminator, D)๋ผ๋Š” ๋‘ ์‹ ๊ฒฝ๋ง ๊ฐ„์˜ ์ œ๋กœ์„ฌ ๊ฒŒ์ž„(zero-sum game)์œผ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ์ƒ์„ฑ์ž๋Š” ์ž„์˜์˜ ๋…ธ์ด์ฆˆ z๋กœ๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ํŒ๋ณ„์ž๋Š” ์ž…๋ ฅ๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ์‹ค์ œ ๋ฐ์ดํ„ฐ์ธ์ง€ ์ƒ์„ฑ๋œ ๊ฐ€์งœ ๋ฐ์ดํ„ฐ์ธ์ง€ ๋ถ„๋ฅ˜ํ•œ๋‹ค.

  • ๋ชฉํ‘œ ํ•จ์ˆ˜: ํ‘œ์ค€ GAN์€ ๋‹ค์Œ์˜ ๋ฏธ๋‹ˆ๋งฅ์Šค(minimax) ๊ฒŒ์ž„์„ ์ตœ์ ํ™”ํ•œ๋‹ค. \min_G \max_D V(D, G) = \mathbb{E}_{w \sim p_{data}}[\log D(w)] + \mathbb{E}_{z \sim N(0,I)}[\log(1 - D(G(z)))] ์—ฌ๊ธฐ์„œ p_{data}๋Š” ์‹ค์ œ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ, N(0,I)๋Š” ๋…ธ์ด์ฆˆ ๋ถ„ํฌ์ด๋‹ค. D๋Š” V(D,G)๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋ ค๊ณ  ํ•˜๊ณ , G๋Š” D๊ฐ€ 1-D(G(z))์— ๋Œ€ํ•ด ๋‚ฎ์€ ๊ฐ’์„ ์˜ˆ์ธกํ•˜๋„๋ก ์ฆ‰ D(G(z))๊ฐ€ ๋†’์€ ๊ฐ’์„ ๊ฐ–๋„๋ก V(D,G)๋ฅผ ์ตœ์†Œํ™”ํ•˜๋ ค๊ณ  ํ•œ๋‹ค.
  • ์ด๋‹จ๊ณ„ ์ตœ์ ํ™” ๊ด€์ : GAN์„ ์ด๋‹จ๊ณ„ ์ตœ์ ํ™” ๋ฌธ์ œ๋กœ ๋ณผ ๊ฒฝ์šฐ, ์ƒ์œ„ ๋ฌธ์ œ๋Š” ์ƒ์„ฑ์ž G๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ๊ฒƒ์ด๊ณ , ํ•˜์œ„ ๋ฌธ์ œ๋Š” ํŒ๋ณ„์ž D๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. F(D, G) = -\mathbb{E}_{w \sim p_{data}}[\log D(w)] - \mathbb{E}_{z \sim N(0,I)}[\log(1 - D(G(z)))] (G์˜ ๋ชฉ์ , ์ฆ‰ G๊ฐ€ D(G(z))๋ฅผ 1๋กœ ๋งŒ๋“ค๊ณ ์ž ํ•˜๋Š” ๊ฒฝ์šฐ) f(D, G) = -\mathbb{E}_{z \sim N(0,I)}[\log D(G(z))] (D์˜ ๋ชฉ์ ) ์ผ๋ฐ˜์ ์œผ๋กœ ์ƒ์„ฑ์ž๋Š” ํŒ๋ณ„์ž์˜ ์ถœ๋ ฅ์ธ D(G(z))๋ฅผ 0์œผ๋กœ ๋งŒ๋“œ๋Š” ๋Œ€์‹ , \log D(G(z))๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต๋˜์–ด ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค(vanishing gradients) ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•œ๋‹ค.

2. ์•กํ„ฐ-ํฌ๋ฆฌํ‹ฑ ๋ฉ”์„œ๋“œ (Actor-Critic Methods)

AC ๋ฉ”์„œ๋“œ๋Š” ๊ฐ•ํ™” ํ•™์Šต์—์„œ ์ •์ฑ…(policy, \pi)์„ ํ•™์Šตํ•˜๋Š” ์•กํ„ฐ์™€ ๊ฐ€์น˜ ํ•จ์ˆ˜(value function, Q)๋ฅผ ํ•™์Šตํ•˜๋Š” ํฌ๋ฆฌํ‹ฑ์„ ๋™์‹œ์— ์‚ฌ์šฉํ•œ๋‹ค. ํฌ๋ฆฌํ‹ฑ์€ ์•กํ„ฐ์˜ ์ •์ฑ…์— ๋Œ€ํ•œ ํ‰๊ฐ€๋ฅผ ์ œ๊ณตํ•˜๋ฉฐ, ์ด๋Š” ์ •์ฑ… ๊ธฐ์šธ๊ธฐ(policy gradient)๋ฅผ ์ถ”์ •ํ•˜๊ฑฐ๋‚˜ ์ง์ ‘์ ์œผ๋กœ ์ •์ฑ…์„ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ๋‹ค.

  • ๋ชฉํ‘œ ํ•จ์ˆ˜: MDP(Markov Decision Process) ํ™˜๊ฒฝ์—์„œ, ์•กํ„ฐ-ํฌ๋ฆฌํ‹ฑ์˜ ๋ชฉํ‘œ๋Š” ๊ฐ€์น˜ ํ•จ์ˆ˜ Q^\pi(s,a)๋ฅผ ํ•™์Šตํ•˜๊ณ , ์ด ๊ฐ€์น˜ ํ•จ์ˆ˜์— ๋Œ€ํ•ด ์ตœ์ ์ธ ์ •์ฑ… \pi^*๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด๋‹ค. Q^\pi(s, a) = \mathbb{E}_{s_{t+k} \sim P, r_{t+k} \sim R, a_{t+k} \sim \pi}[\sum_{k=1}^\infty \gamma^k r_{t+k} | s_t=s, a_t=a] (๊ธฐ๋Œ€ ํ• ์ธ ๋ณด์ƒ) \pi^* = \arg \max_\pi \mathbb{E}_{s_0 \sim p_0, a_0 \sim \pi}[Q^\pi(s_0, a_0)] (์ตœ์  ์ •์ฑ…)
  • ํฌ๋ฆฌํ‹ฑ ์ตœ์ ํ™”: ํฌ๋ฆฌํ‹ฑ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฒจ๋งŒ ๋ฐฉ์ •์‹(Bellman equation)์— ๊ธฐ๋ฐ˜ํ•œ ์†์‹ค์„ ์ตœ์†Œํ™”ํ•˜์—ฌ ๊ฐ€์น˜ ํ•จ์ˆ˜๋ฅผ ํ•™์Šตํ•œ๋‹ค. Q^\pi = \arg \min_Q \mathbb{E}_{s_t, a_t \sim \pi}[D(\mathbb{E}_{s_{t+1}, r_t, a_{t+1}}[r_t + \gamma Q(s_{t+1}, a_{t+1})] || Q(s_t, a_t))] ์—ฌ๊ธฐ์„œ D(\cdot||\cdot)๋Š” divergence ์ธก์ •๊ฐ’์ด๋‹ค.
  • ์ด๋‹จ๊ณ„ ์ตœ์ ํ™” ๊ด€์ : AC ๋ฉ”์„œ๋“œ๋ฅผ ์ด๋‹จ๊ณ„ ์ตœ์ ํ™” ๋ฌธ์ œ๋กœ ๋ณผ ๊ฒฝ์šฐ, ์ƒ์œ„ ๋ฌธ์ œ๋Š” ์ •์ฑ… \pi๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ๊ฒƒ์ด๊ณ , ํ•˜์œ„ ๋ฌธ์ œ๋Š” ๊ฐ€์น˜ ํ•จ์ˆ˜ Q๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. F(Q, \pi) = \mathbb{E}_{s_t, a_t \sim \pi}[D(\mathbb{E}_{s_{t+1}, r_t, a_{t+1}}[r_t + \gamma Q(s_{t+1}, a_{t+1})] || Q(s_t, a_t))] f(Q, \pi) = -\mathbb{E}_{s_0 \sim p_0, a_0 \sim \pi}[Q^\pi(s_0, a_0)]

GAN๊ณผ AC์˜ ์—ฐ๊ฒฐ์  (Connection between GANs and AC)

๋…ผ๋ฌธ์€ GAN์„ ํŠน์ • MDP ์„ค์ •์—์„œ์˜ ์•กํ„ฐ-ํฌ๋ฆฌํ‹ฑ ๋ฉ”์„œ๋“œ๋กœ ํ•ด์„ํ•œ๋‹ค.

  1. Stateless MDP: ์•กํ„ฐ(์ƒ์„ฑ์ž)๊ฐ€ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ํ™˜๊ฒฝ์€ ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€๋‚˜ ์‹ค์ œ ์ด๋ฏธ์ง€๋ฅผ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒํ•˜์—ฌ ๋ณด์—ฌ์ค€๋‹ค. ์•กํ„ฐ๋Š” ํ˜„์žฌ ์ƒํƒœ์— ๋Œ€ํ•œ ์ง€์‹์ด ์—†์–ด(blind actor) ์‹ค์ œ ์ด๋ฏธ์ง€๋ฅผ ๋‹จ์ˆœํžˆ ์ „๋‹ฌํ•  ์ˆ˜ ์—†๋‹ค. ์ฆ‰, ์•กํ„ฐ๋Š” ๋ณด์ƒ์— ์ธ๊ณผ์ ์œผ๋กœ ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์—†๋‹ค.
  2. ๋ณด์ƒ ์ฒด๊ณ„: ํ™˜๊ฒฝ์ด ์‹ค์ œ ์ด๋ฏธ์ง€๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉด ๋ณด์ƒ 1, ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉด ๋ณด์ƒ 0์„ ์ค€๋‹ค.
  3. ํฌ๋ฆฌํ‹ฑ(ํŒ๋ณ„์ž): ํฌ๋ฆฌํ‹ฑ์€ ์ด ๋ณด์ƒ์„ ํ†ตํ•ด ์ž…๋ ฅ๋œ ์ด๋ฏธ์ง€๊ฐ€ ์‹ค์ œ(1)์ธ์ง€ ๊ฐ€์งœ(0)์ธ์ง€ ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ์„ ํ•™์Šตํ•œ๋‹ค. ํฌ๋ฆฌํ‹ฑ์˜ ์†์‹ค ํ•จ์ˆ˜๋Š” ํ‰๊ท  ์ œ๊ณฑ ๋ฒจ๋งŒ ์ž”์ฐจ(mean-squared Bellman residual) ๋Œ€์‹  GAN์˜ ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ(cross-entropy) ์†์‹ค์„ ์‚ฌ์šฉํ•œ๋‹ค.
  4. ์•กํ„ฐ ์—…๋ฐ์ดํŠธ: ์•กํ„ฐ๋Š” ํฌ๋ฆฌํ‹ฑ์œผ๋กœ๋ถ€ํ„ฐ์˜ ๊ธฐ์šธ๊ธฐ ์ •๋ณด๋ฅผ ๋ฐ›์•„ ์ž์‹ ์˜ ํ–‰๋™(์ด๋ฏธ์ง€ ์ƒ์„ฑ)์„ ์ˆ˜์ •ํ•œ๋‹ค. ํ™˜๊ฒฝ์ด ์‹ค์ œ ์ด๋ฏธ์ง€๋ฅผ ๋ณด์—ฌ์ค€ ๊ฒฝ์šฐ์—๋Š” ์•กํ„ฐ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์—…๋ฐ์ดํŠธ๋˜์ง€ ์•Š๋Š”๋‹ค.

์ด๋Ÿฌํ•œ ๊ด€์ ์—์„œ GAN์€ ์•กํ„ฐ๊ฐ€ ๋ณด์ƒ์— ์ธ๊ณผ


๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

์„œ๋ก 

๋Œ€๋ถ€๋ถ„์˜ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ฌธ์ œ๋Š” ๋‹จ์ผ ๋ชฉ์ ํ•จ์ˆ˜ ์ตœ์ ํ™”๋กœ ์ •์‹ํ™”๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ผ๋ถ€ ๋ฌธ์ œ๋Š” ๋‹จ์ผ ๋น„์šฉ์ด ์—†๊ณ , ์—ฌ๋Ÿฌ ๋ชจ๋ธ์ด ์„œ๋กœ ์ •๋ณด๋ฅผ ์ฃผ๊ณ ๋ฐ›๋˜ ๊ฐ์ž ์ž๊ธฐ๋งŒ์˜ ์‚ฌ์ (private) ์†์‹ค ์„ ์ตœ์†Œํ™”ํ•˜๋ ค๋Š” ํ•˜์ด๋ธŒ๋ฆฌ๋“œ/๋ฉ€ํ‹ฐ๋ ˆ๋ฒจ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์— ํ‰๋ฒ”ํ•œ gradient descent๋ฅผ ์ ์šฉํ•˜๋ฉด ์ง„๋™(oscillation) ์ด๋‚˜ ํ‡ดํ™” ํ•ด(degenerate solution)๋กœ์˜ ๋ถ•๊ดด(collapse) ๊ฐ™์€ ๋ณ‘๋ฆฌ์  ๊ฑฐ๋™์ด ํ”ํžˆ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿผ์—๋„ ์ด๋Ÿฐ ๋ฉ€ํ‹ฐ๋ ˆ๋ฒจ ์†์‹ค ๋ชจ๋ธ์€ ์ž ์žฌ๋ ฅ์ด ํฌ๋ฉฐ(๋‡Œ์˜ ์ž‘๋™๋„ ์—ฌ๋Ÿฌ ๊ตญ์†Œ ์†์‹ค์˜ ๊ฒฐํ•ฉ์ด๋ผ๋Š” ๊ฐ€์„ค์ด ์žˆ์Šต๋‹ˆ๋‹ค), ๊ทธ ๋Œ€ํ‘œ๊ฐ€ ๋ฐ”๋กœ AC ์™€ GAN ์ž…๋‹ˆ๋‹ค.

๋‘˜์€ ๋†€๋ž๋„๋ก ๋‹ฎ์•˜์Šต๋‹ˆ๋‹ค.

  • ์ •๋ณด ํ๋ฆ„: ํ•œ ๋ชจ๋ธ(AC์˜ ํ–‰์œ„์ž / GAN์˜ ์ƒ์„ฑ์ž)์ด ์ถœ๋ ฅ์„ ๋งŒ๋“ค๊ณ , ๋‘ ๋ฒˆ์งธ ๋ชจ๋ธ(AC์˜ ๋น„ํ‰๊ฐ€ / GAN์˜ ํŒ๋ณ„์ž)์ด ๊ทธ๊ฒƒ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋‹จ์ˆœ feedforward.
  • ํ™˜๊ฒฝ ์ •๋ณด ์ ‘๊ทผ: ๋‘ ๋ฒˆ์งธ ๋ชจ๋ธ๋งŒ ํ™˜๊ฒฝ์˜ ํŠน๋ณ„ ์ •๋ณด(AC์˜ ๋ณด์ƒ / GAN์˜ ์ง„์งœ ์ƒ˜ํ”Œ)๋ฅผ ์ง์ ‘ ๋ด…๋‹ˆ๋‹ค.
  • ํ•™์Šต ์‹ ํ˜ธ: ์ฒซ ๋ฒˆ์งธ ๋ชจ๋ธ์€ ๋‘ ๋ฒˆ์งธ ๋ชจ๋ธ์ด ์ฃผ๋Š” ์˜ค์ฐจ ์‹ ํ˜ธ๋งŒ์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

๋‘˜ ๋‹ค ์•ˆ์ •์„ฑ ๋ฌธ์ œ๋ฅผ ๊ฒช๊ณ , ์•ˆ์ •ํ™” ๊ธฐ๋ฒ•์€ ๋‘ ์ปค๋ฎค๋‹ˆํ‹ฐ๊ฐ€ ๊ฑฐ์˜ ๋…๋ฆฝ์ ์œผ๋กœ ๋ฐœ์ „์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. ์ด ๋…ธํŠธ์˜ ๋ชฉ์ ์€ ๋‘ ๋ชจ๋ธ ๋ถ€๋ฅ˜ ์‚ฌ์ด์˜ ๊ฐ•ํ•œ ์—ฐ๊ฒฐ์„ ๋ถ€๊ฐ ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋ฐฉ๋ฒ•: ๋‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๊ทธ ๋‹ค๋ฆฌ

์ด์ค‘ ๋ ˆ๋ฒจ ์ตœ์ ํ™”๋ผ๋Š” ๊ณตํ†ต ํ‹€

GAN๊ณผ AC๋Š” ๋ชจ๋‘ ํ•œ ๋ชจ๋ธ์ด ๋‹ค๋ฅธ ๋ชจ๋ธ์˜ ์ตœ์ ๊ฐ’์— ๋Œ€ํ•ด ์ตœ์ ํ™”๋˜๋Š” bilevel(ํ˜น์€ two-time-scale) ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค.

x^* = \arg\min_{x\in X} F(x, y^*(x)), \qquad y^*(x) = \arg\min_{y\in Y} f(x, y)

์šด์šฉ์—ฐ๊ตฌ(operations research)์—์„œ ์˜ค๋ž˜ ์—ฐ๊ตฌ๋์ง€๋งŒ ์ฃผ๋กœ ์„ ํ˜•/๋ณผ๋ก ๋ฌธ์ œ์˜€๊ณ , ์—ฌ๊ธฐ์„œ๋Š” ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง ์„ ์ตœ์ ํ™” ๋Œ€์ƒ์œผ๋กœ ์‚ผ๋Š”๋‹ค๋Š” ์ ์ด ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

GAN

์ƒ์„ฑ์ž G (๋…ธ์ด์ฆˆ z\sim\mathcal N(0,I) ๋ฅผ ๋ฐ›์•„ ์ƒ˜ํ”Œ ์ƒ์„ฑ)์™€ ํŒ๋ณ„์ž D (์ง„์งœ/๊ฐ€์งœ ๋ถ„๋ฅ˜) ์‚ฌ์ด์˜ ์ œ๋กœ์„ฌ ๊ฒŒ์ž„ ์ž…๋‹ˆ๋‹ค.

\min_G \max_D \ \mathbb{E}_{w\sim p_{\text{data}}}[\log D(w)] + \mathbb{E}_{z\sim\mathcal N(0,I)}[\log(1 - D(G(z)))]

ํŒ๋ณ„์ž๊ฐ€ ๋งค์šฐ ์ •ํ™•ํ•  ๋•Œ๋„ ์ƒ์„ฑ์ž๊ฐ€ ๊ธฐ์šธ๊ธฐ๋ฅผ ๋ฐ›๋„๋ก, ์ƒ์„ฑ์ž ์†์‹ค์€ ๋ณดํ†ต โ€œ๊ฐ€์งœ๋กœ ๋ถ„๋ฅ˜๋  ํ™•๋ฅ  ์ตœ์†Œํ™”โ€ ๋Œ€์‹  โ€œ์ง„์งœ๋กœ ๋ถ„๋ฅ˜๋  ํ™•๋ฅ  ์ตœ๋Œ€ํ™”โ€๋กœ ์”๋‹ˆ๋‹ค(non-saturating). ์ด๋ฅผ bilevel๋กœ ์“ฐ๋ฉด:

F(D, G) = -\mathbb{E}_{w\sim p_{\text{data}}}[\log D(w)] - \mathbb{E}_{z}[\log(1 - D(G(z)))]

f(D, G) = -\mathbb{E}_{z}[\log D(G(z))]

Actor-Critic

ํ–‰์œ„์ž(์ •์ฑ… \pi)์™€ ๋น„ํ‰๊ฐ€(๊ฐ€์น˜ํ•จ์ˆ˜ Q^\pi)๋ฅผ ๋™์‹œ์— ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. action-value ํ•จ์ˆ˜๋Š” ๊ธฐ๋Œ€ ํ• ์ธ ๋ณด์ƒ์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

Q^\pi(s,a) = \mathbb{E}\Big[\textstyle\sum_{k=1}^{\infty} \gamma^k r_{t+k} \,\big|\, s_t=s, a_t=a\Big]

Q^\pi ๋Š” Bellman ์ž”์ฐจ์˜ ๋ฐœ์‚ฐ ์ตœ์†Œํ™”๋กœ ํ‘œํ˜„๋˜๊ณ , ์ „์ฒด ๋ฌธ์ œ๋„ bilevel์ด ๋ฉ๋‹ˆ๋‹ค.

F(Q, \pi) = \mathbb{E}_{s_t,a_t\sim\pi}\big[\mathcal D(\mathbb{E}[r_t + \gamma Q(s_{t+1}, a_{t+1})] \,\Vert\, Q(s_t, a_t))\big]

f(Q, \pi) = -\mathbb{E}_{s_0\sim p_0, a_0\sim\pi}[Q^\pi(s_0, a_0)]

์ €์ž๋“ค์€ ํŠนํžˆ ์—ฐ์† ํ–‰๋™ ์„ ๋‹ค๋ฃจ๋Š” DPG(deterministic policy gradient), ๊ทธ ํ™•๋ฅ ์  ํ™•์žฅ SVG(0), NFQCA(neurally-fitted Q-learning)์— ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค. ์ด๋“ค์€ TD ์˜ค์ฐจ๋ฅผ ์ง์ ‘ ๋„˜๊ธฐ๋Š” ๋Œ€์‹ , ์ถ”์ • ๊ฐ€์น˜๋ฅผ ํ–‰๋™์œผ๋กœ ๋ฏธ๋ถ„ํ•œ ๊ธฐ์šธ๊ธฐ๋ฅผ ํ–‰์œ„์ž์— ์—ญ์ „ํŒŒ ํ•œ๋‹ค๋Š” ๊ณตํ†ต์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค โ€” ์ด ์ ์ด GAN์˜ ์ƒ์„ฑ์ž๊ฐ€ ํŒ๋ณ„์ž๋กœ๋ถ€ํ„ฐ ๊ธฐ์šธ๊ธฐ๋ฅผ ๋ฐ›๋Š” ๋ฐฉ์‹๊ณผ ์ •ํ™•ํžˆ ๋Œ€์‘๋ฉ๋‹ˆ๋‹ค.

GAN์„ AC๋กœ ํ™˜์›ํ•˜๋Š” ๊ตฌ์„ฑ (ํ•ต์‹ฌ)

flowchart LR
    subgraph GAN["GAN"]
        Z["z (๋…ธ์ด์ฆˆ)"] --> G["G (์ƒ์„ฑ์ž=ํ–‰์œ„์ž)"]
        G --> D["D (ํŒ๋ณ„์ž=๋น„ํ‰๊ฐ€)"]
        X["์ง„์งœ ์ƒ˜ํ”Œ x"] --> D
        Y["๋ผ๋ฒจ y (๋ณด์ƒ)"] --> D
        D -.->|๊ธฐ์šธ๊ธฐ| G
    end
    subgraph AC["Actor-Critic"]
        S["s_t (์ƒํƒœ)"] --> PI["ฯ€ (ํ–‰์œ„์ž)"]
        PI --> Q["Q (๋น„ํ‰๊ฐ€)"]
        R["r_t (๋ณด์ƒ)"] --> Q
        Q -.->|โˆ‚Q/โˆ‚a ๊ธฐ์šธ๊ธฐ| PI
    end

GAN๊ณผ ๋™์ผํ•œ MDP๋ฅผ ๋‹ค์Œ์ฒ˜๋Ÿผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

  • ํ–‰๋™: ์ด๋ฏธ์ง€์˜ ๋ชจ๋“  ํ”ฝ์…€์„ ์„ค์ •.
  • ํ™˜๊ฒฝ: ๋ฌด์ž‘์œ„๋กœ ํ–‰์œ„์ž๊ฐ€ ๋งŒ๋“  ์ด๋ฏธ์ง€ ๋˜๋Š” ์ง„์งœ ์ด๋ฏธ์ง€๋ฅผ ๋ณด์—ฌ์คŒ.
  • ๋ณด์ƒ: ์ง„์งœ ์ด๋ฏธ์ง€๋ฅผ ๊ณจ๋ž์œผ๋ฉด 1, ์•„๋‹ˆ๋ฉด 0.
  • ํ–‰์œ„์ž์˜ ์ด๋ฏธ์ง€๋Š” ๋ฏธ๋ž˜ ๋ฐ์ดํ„ฐ์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š์œผ๋ฏ€๋กœ ์ด MDP๋Š” ์ƒํƒœ ์—†๋Š”(stateless) ๊ตฌ์กฐ.

์—ฌ๊ธฐ์— AC๋ฅผ ํ•™์Šต์‹œํ‚ค๋ฉด GAN ๊ฒŒ์ž„๊ณผ ๊ฑฐ์˜ ๊ฐ™์•„์ง‘๋‹ˆ๋‹ค. ์ •ํ™•ํžˆ ์ผ์น˜์‹œํ‚ค๋ ค๋ฉด ๋ช‡ ๊ฐ€์ง€ ์กฐ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

  1. ๋ˆˆ๋จผ(blind) ํ–‰์œ„์ž: ํ–‰์œ„์ž๊ฐ€ ์ƒํƒœ๋ฅผ ๋ณด๋ฉด ์ง„์งœ ์ด๋ฏธ์ง€๋ฅผ ๊ทธ๋Œ€๋กœ ๋„˜๊ฒจ๋ฒ„๋ฆด ์ˆ˜ ์žˆ์œผ๋‹ˆ, ์ƒํƒœ๋ฅผ ๋ชจ๋ฅด๊ฒŒ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค(stateless๋ผ ํ•™์Šต์—” ์ง€์žฅ ์—†์Œ).
  2. cross-entropy ์†์‹ค: ๋น„ํ‰๊ฐ€์— ๋ณดํ†ต ์“ฐ๋Š” MSE Bellman ์ž”์ฐจ ๋Œ€์‹  GAN ์†์‹ค์— ๋งž์ถฐ cross-entropy ์‚ฌ์šฉ.
  3. ์Šค์ผ€์ผ๋ง ํ•ญ: ํ–‰์œ„์ž๊ฐ€ Bellman ์ž”์ฐจ๊ฐ€ ์•„๋‹Œ ๊ฐ€์น˜์˜ ๊ธฐ์šธ๊ธฐ ๋ฅผ ๋ฐ›์œผ๋ฏ€๋กœ, \partial \mathcal D/\partial Q ์— ๋น„๋ก€ํ•˜๋Š” ํ•ญ์ด ํ•„์š”(์‹ค๋ฌด์—์„  ๋ณ„๋„ ์ƒ์„ฑ์ž ์†์‹ค๋กœ ์ฒ˜๋ฆฌ).
  4. ์ง„์งœ ์ด๋ฏธ์ง€์ผ ๋• ํ–‰์œ„์ž ๋ฏธ๊ฐฑ์‹ : ๋ณด์ƒ์ด 1์ด๋ฉด ๋น„ํ‰๊ฐ€๊ฐ€ ํ–‰๋™์— ๋Œ€ํ•œ ๊ธฐ์šธ๊ธฐ๋ฅผ 0์œผ๋กœ.

โ†’ GAN = ์ƒํƒœ ์—†๋Š” MDP์—์„œ ๋ˆˆ๋จผ ํ–‰์œ„์ž๋ฅผ ๊ฐ€์ง„, ๋ณ€ํ˜•๋œ actor-critic.

์™œ ์ ๋Œ€์ ์ด ๋˜๋Š”๊ฐ€

๋ณดํ†ต AC์—์„œ ํ–‰์œ„์ž์™€ ๋น„ํ‰๊ฐ€๋Š” ์ƒ๋ณด์  ์†์‹ค์„ ์ตœ์ ํ™”ํ•˜์ง€ ์ ๋Œ€ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. GAN์ด ์ ๋Œ€์ ์ธ ์ด์œ ๋Š”, ์ด MDP๊ฐ€ ํ–‰์œ„์ž๊ฐ€ ๋ณด์ƒ์— ์–ด๋–ค ์ธ๊ณผ์  ์˜ํ–ฅ๋„ ์ค„ ์ˆ˜ ์—†๋Š” ํ™˜๊ฒฝ โ€” ์ฆ‰ ์ง„์งœ policy gradient๊ฐ€ ํ•ญ์ƒ 0 ์ธ ํ™˜๊ฒฝ์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋น„ํ‰๊ฐ€๋Š” ์ž…๋ ฅ ์˜ˆ์‹œ๋งŒ์œผ๋กœ ๊ฒŒ์ž„์˜ ์ธ๊ณผ ๊ตฌ์กฐ๋ฅผ ๋ฐฐ์šธ ์ˆ˜ ์—†์–ด โ€œ๋ณด์ƒ์„ ์˜ˆ์ธกํ•˜๋Š” ํŠน์ง•โ€ ๋ฐฉํ–ฅ์œผ๋กœ ์›€์ง์ด๊ณ , ํ–‰์œ„์ž๋Š” ๋น„ํ‰๊ฐ€์˜ ์ตœ์„  ์ถ”์ •์— ๋”ฐ๋ผ ๋ณด์ƒ์„ ๋Š˜๋ฆฌ๋ ค ์›€์ง์ด์ง€๋งŒ ์ง„์งœ ๋ณด์ƒ์€ ๋ชป ๋Š˜๋ฆฝ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๋น„ํ‰๊ฐ€๊ฐ€ ๊ณง ๊ทธ ๋ฐฉํ–ฅ์— ๋‚ฎ์€ ๊ฐ€์น˜๋ฅผ ๋งค๊ธฐ๊ณ , ์ด์ƒ์ ์œผ๋กœ๋Š” ์ง๊ตํ•ด์•ผ ํ•  ๋‘ ์—…๋ฐ์ดํŠธ๊ฐ€ ์ ๋Œ€์  ์œผ๋กœ ๋ณ€ํ•ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ ๋ถ€๋ถ„ ๊ด€์ธก์„ฑ์˜ ๊ฒฐ๊ณผ๋„ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ์™„์ „ ๊ด€์ธก MDP์—์„œ๋Š” ์ตœ์  ์ •์ฑ…์ด ํ•ญ์ƒ ๊ฒฐ์ •์ ์ด์ง€๋งŒ, GAN์—์„œ๋Š” ์ƒ์„ฑ์ž๊ฐ€ ์ง„์งœ ๋ถ„ํฌ์™€ ์ผ์น˜ํ•˜๋Š” ๊ฒƒ ์ด minimax์˜ ๊ณ ์ •์ ์ž…๋‹ˆ๋‹ค.

์•ˆ์ •ํ™” ์ „๋žต: ๋‘ ์ปค๋ฎค๋‹ˆํ‹ฐ์˜ ํŠธ๋ฆญ (Table 1)

์ €์ž๋“ค์€ ๊ฐ ๋ถ„์•ผ์˜ โ€œํ˜„์žฅ์˜ ๊ธฐ์ˆ โ€์„ ์ •๋ฆฌํ•˜๊ณ  ์„œ๋กœ ์ด์‹ ๊ฐ€๋Šฅ์„ฑ์„ ๋…ผํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ๋ฒ• GAN AC
Freezing learning โœ… โœ…
Label smoothing โœ… โŒ(๋ฏธ์‹œ๋„)
Historical averaging โœ… โŒ(๋ฏธ์‹œ๋„)
Minibatch discrimination โœ… โŒ(๋ฏธ์‹œ๋„)
Batch normalization โœ… โœ…
Target networks n/a โœ…
Replay buffers โŒ(๋ฏธ์‹œ๋„) โœ…
Entropy regularization โŒ(๋ฏธ์‹œ๋„) โœ…
Compatibility โŒ โœ…

ํ•ต์‹ฌ ๊ต์ฐจ ํ†ต์ฐฐ:

  • Freezing learning: GAN์€ ํ•œ ๋ชจ๋ธ์ด ๋„ˆ๋ฌด ๊ฐ•ํ•ด์ง€๋ฉด ํ•™์Šต์„ ๋™๊ฒฐ. AC๋„ TD ์˜ค์ฐจ ํฌ๊ธฐ๊ฐ€ ์ž„๊ณ„๊ฐ’์„ ๋ฒ—์–ด๋‚˜๋ฉด ํ–‰์œ„์ž/๋น„ํ‰๊ฐ€ ํ•™์Šต์„ ๋™๊ฒฐ โ€” ๊ฐ™์€ ๋ฐœ์ƒ.
  • Label smoothing: 0/1 ๋ผ๋ฒจ์„ \epsilon/1-\epsilon ๋กœ ๋ฐ”๊ฟ” ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฐฉ์ง€. ๋ณด์ƒ์ด 0/1์ด๊ณ  ๋น„ํ‰๊ฐ€ ๊ธฐ์šธ๊ธฐ๊ฐ€ ์‚ฌ๋ผ์ง€๋Š” RL์—๋„ ์ ์šฉ ๊ฐ€๋Šฅํ•  ๊ฒƒ.
  • Historical averaging: ๊ฒŒ์ž„์ด๋ก ์˜ fictitious play์—์„œ ์˜๊ฐ, ๊ณผ๊ฑฐ ํŒŒ๋ผ๋ฏธํ„ฐ ํ‰๊ท ์—์„œ ๋ฉ€์–ด์ง€๋Š” step์— drag ํ•ญ ์ถ”๊ฐ€ โ†’ ์ง„๋™ ๋ฐฉ์ง€. Polyak-Ruppert ํ‰๊ท ๊ณผ ์—ฐ๊ฒฐ(RL์—์„œ ๋ถ„์„๋์œผ๋‚˜ ํ‘œ์ค€ ๋„๊ตฌ๋กœ ์ฑ„ํƒ๋˜์ง„ ์•Š์Œ). DPG์˜ replay buffer๋„ ๊ฐœ๋…์ ์œผ๋กœ fictitious play์™€ ์œ ์‚ฌ(๋‹จ ํ–‰์œ„์ž์—” ์ ์šฉ ๋ถˆ๊ฐ€).
  • Minibatch discrimination: ๋‹จ์ผ ์ƒ˜ํ”Œ collapse ๋ฐฉ์ง€๋ฅผ ์œ„ํ•ด ํŒ๋ณ„์ž๊ฐ€ ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ์ „์ฒด๋ฅผ ๋ถ„๋ฅ˜ โ†’ ์ƒ์„ฑ์ž ์—”ํŠธ๋กœํ”ผ ์ฆ๊ฐ€. RL์˜ ํƒ์ƒ‰ ๋ถ€์กฑ(underexploration) ๋ฌธ์ œ์™€ ๋Œ€์‘(์—”ํŠธ๋กœํ”ผ ํŽ˜๋„ํ‹ฐ) โ€” ์—ฐ์† ๊ณต๊ฐ„ ํƒ์ƒ‰์— minibatch discrimination๋ฅ˜๊ฐ€ ๋Œ€์•ˆ์ด ๋  ์ˆ˜ ์žˆ์Œ.
  • Replay buffers: AC์—์„œ ์ƒ๊ด€ ์ œ๊ฑฐ์— ํšจ๊ณผ์ ์ด๋‚˜ ๋น„ํ‰๊ฐ€์—๋งŒ ์ ์šฉ ๊ฐ€๋Šฅ(ํ–‰์œ„์ž๋Š” ๊ณผ๊ฑฐ ๋‹ค๋ฅธ ํ–‰๋™์— ๋Œ€ํ•œ ๊ธฐ์šธ๊ธฐ๋กœ ๋ชป ๋ฐฐ์›€). GAN์— ๊ณผ๊ฑฐ ์ƒ์„ฑ ์ด๋ฏธ์ง€ ๋ฒ„ํผ๋ฅผ ์‹œ๋„ํ–ˆ์œผ๋‚˜ ๋‹จ์ˆœ ๋ถ„ํฌ์—์„œ๋„ ์ ๊ทผ์ ์œผ๋กœ ์˜ฌ๋ฐ”๋ฅธ ์ƒ˜ํ”Œ ์ƒ์„ฑ์— ์‹คํŒจ.
  • Target networks: GAN์€ stateless MDP๋ผ Bellman ์žฌ๊ท€์˜ ๋‘ ๋ฒˆ์งธ Q ๊ฐ€ ์‚ฌ๋ผ์ ธ ํŒ๋ณ„์ž ํ•™์Šต์ด ํ‰๋ฒ”ํ•œ ํšŒ๊ท€๊ฐ€ ๋จ โ†’ GAN์—” target network ๋น„์ ์šฉ. ๋‹จ, Q-learning์„ ํ•˜์œ„๋ฌธ์ œ๋กœ ๊ฐ–๋Š” ๋‹ค๋ฅธ ๋ฉ€ํ‹ฐ๋ ˆ๋ฒจ ๋ฌธ์ œ์—” ์œ ์šฉ.
  • Entropy regularization (AC) โ†”๏ธŽ mode collapse (GAN): ์—ฐ์† ์ œ์–ด ํƒ์ƒ‰ ๊ธฐ๋ฒ•์€ GAN ์ƒ˜ํ”Œ ๋‹ค์–‘์„ฑ ์ฆ๋Œ€๋กœ ์ด์‹ ๊ฐ€๋Šฅ.
  • Compatibility: AC์˜ compatible critic์€ (์ตœ์ ์ผ ๋•Œ) ๋ฌดํŽธํ–ฅ ์ž์—ฐ ๊ธฐ์šธ๊ธฐ๋ฅผ ์ฃผ๋Š” ์šฐ์•„ํ•œ ์ด๋ก . ํ•˜์ง€๋งŒ GAN MDP์—์„  ๋ชจ๋“  ์ •์ฑ…์˜ ์ง„์งœ ๊ฐ€์น˜๊ฐ€ ํ•ญ์ƒ 0.5๋ผ ์ง„์งœ policy gradient๊ฐ€ 0 โ€” โ€œcompatible๋ณด๋‹ค adversarial์„ ์„ ํ˜ธโ€.

๋” ๋ณต์žกํ•œ ์ •๋ณด ํ๋ฆ„: ํ™•์žฅ๋“ค

์ €์ž๋“ค์€ ๋ฉ€ํ‹ฐ๋ ˆ๋ฒจ ์ตœ์ ํ™”๋กœ ๋ณผ ์ˆ˜ ์žˆ๋Š” ๋” ๋ณต์žกํ•œ ๋ชจ๋ธ๋“ค๋„ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค(๋ณด์ถฉ์ž๋ฃŒ).

  • GAN ํ™•์žฅ: f-GAN(GAN ์†์‹ค์„ f-divergence ํ•˜ํ•œ์œผ๋กœ ์ผ๋ฐ˜ํ™”), EBGAN(์—๋„ˆ์ง€ ๊ธฐ๋ฐ˜ ํŒ๋ณ„์ž), VAE/GAN, BiGAN/ALI(์ถ”๋ก ๋ง ์ถ”๊ฐ€), Adversarial Autoencoder, InfoGAN(์ƒํ˜ธ์ •๋ณด ์ตœ๋Œ€ํ™”). ์ถ”๋ก ๋งยท์„ธ ๋ฒˆ์งธ ๋ชจ๋ธ ์ถ”๊ฐ€๋กœ ์ตœ์ ํ™”๊ฐ€ ๋” ๋ณต์žกํ•ด์ง.
  • AC ํ™•์žฅ: A3C(์ƒํƒœ ๊ฐ€์น˜ V ๋งŒ ํ•™์Šต โ†’ ํ–‰๋™ ๊ธฐ์šธ๊ธฐ ์—ญ์ „ํŒŒ ๋ถˆ๊ฐ€, GAN๊ณผ๋Š” ๋œ ๋ฐ€์ ‘ํ•˜๋‚˜ ์—ฐ์† ์ œ์–ด์— ์„ฑ๊ณต์ ), SVG(1)(ํ–‰์œ„์žยท๋น„ํ‰๊ฐ€ยท๋ชจ๋ธ f ๊ฒฐํ•ฉ).
  • ๋ชจ๋ฐฉํ•™์Šต/์—ญ๊ฐ•ํ™”ํ•™์Šต(IRL): GAIL์€ ๋น„์šฉํ•จ์ˆ˜ ํ•™์Šต์„ ์ ์œ ๋ถ„ํฌ(occupancy) ๊ฑฐ๋ฆฌ ์ตœ์†Œํ™”๋กœ ํ™˜์›ํ•ด GAN๊ณผ ๊ฑฐ์˜ ๊ฐ™์€ ํ˜•ํƒœ โ€” ์ง์ ‘ ์ •์ฑ… ์ตœ์ ํ™” ๋Œ€์‹  AC๋ฅผ ๋„ฃ์œผ๋ฉด GAN๊ณผ AC๋ฅผ ๋‘˜ ๋‹ค ํ•˜์œ„๋ฌธ์ œ๋กœ ๊ฐ–๋Š” 3-๋ ˆ๋ฒจ ์ตœ์ ํ™” ๊ฐ€ ๋จ. Finn et al.์€ GAN ๋ชฉ์ ์ด MaxEnt IRL ๋ชฉ์ ๊ณผ ๋™์ผํ•˜๊ณ  GAN ํ•™์Šต์ด guided cost learning๊ณผ ๊ฐ™์Œ์„ ๋ณด์ž„.

๋น„ํŒ์  ๊ณ ์ฐฐ

๊ฐ•์ 

  • ๊ฐœ๋…์  ํ†ต์ฐฐ์˜ ํž˜. โ€œGAN = ํ–‰์œ„์ž๊ฐ€ ๋ณด์ƒ์— ์˜ํ–ฅ ๋ชป ์ฃผ๋Š” MDP์˜ ACโ€๋ผ๋Š” ์žฌ๊ตฌ์„ฑ์€, ๋‘ ๋ถ„์•ผ์˜ ๋ถˆ์•ˆ์ •์„ฑ๊ณผ ์ ๋Œ€์„ฑ์„ ๊ฐ™์€ ๋ฉ€ํ‹ฐ๋ ˆ๋ฒจ ์ตœ์ ํ™” ๋ Œ์ฆˆ ๋กœ ๋ณด๊ฒŒ ํ•ด์ค๋‹ˆ๋‹ค. ์ ๋Œ€์„ฑ์˜ ๊ทผ์›์„ โ€œ์ง„์งœ policy gradient = 0โ€์œผ๋กœ ์„ค๋ช…ํ•œ ์ ์ด ํŠนํžˆ ๋ช…๋ฃŒํ•ฉ๋‹ˆ๋‹ค.
  • ์‹ค์šฉ์  ํŠธ๋ฆญ์˜ ๊ต์ฐจ ์ง€๋„. Table 1๊ณผ ๊ต์ฐจ ํ†ต์ฐฐ์€, ํ•œ ๋ถ„์•ผ์˜ ์•ˆ์ •ํ™” ๊ธฐ๋ฒ•์„ ๋‹ค๋ฅธ ๋ถ„์•ผ๋กœ ์˜ฎ๊ธธ ์ˆ˜ ์žˆ๋Š” ๊ตฌ์ฒด์  ํ›„๋ณด ๋ฅผ ์ œ์‹œํ•ด ํ›„์† ์—ฐ๊ตฌ์˜ ์ถœ๋ฐœ์ ์„ ์ค๋‹ˆ๋‹ค.
  • ํ™•์žฅ ๋ชจ๋ธ์˜ ํ†ตํ•ฉ ์‹œ๊ฐ. VAE/GANยทBiGANยทGAILยทIRL๊นŒ์ง€ ๋ฉ€ํ‹ฐ๋ ˆ๋ฒจ ์ตœ์ ํ™”๋กœ ๋ฌถ์–ด, ์ ๋Œ€์  ํ•™์Šต ์ƒํƒœ๊ณ„ ์ „๋ฐ˜์˜ ์ง€ํ˜•๋„๋ฅผ ๊ทธ๋ฆฝ๋‹ˆ๋‹ค.

์•ฝ์ ๊ณผ ํ•œ๊ณ„

  • ์‹คํ—˜ ๋ถ€์žฌ(๋…ธํŠธ ์„ฑ๊ฒฉ). ์ด ๊ธ€์€ ๊ฐœ๋…์  ์—ฐ๊ฒฐ์„ ๋ถ€๊ฐํ•˜๋Š” ํฌ์ง€์…˜/๋ฆฌ๋ทฐ ๋…ธํŠธ ๋กœ, ์ƒˆ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‚˜ ์‹ค์ฆ์ด ์—†์Šต๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ๊ต์ฐจ ์ด์‹(์˜ˆ: RL์šฉ minibatch discrimination, GAN์šฉ replay buffer)์˜ ์‹ค์ œ ํšจ๊ณผ๋Š” ๊ฒ€์ฆ๋˜์ง€ ์•Š์•˜๊ณ , ์‹ค์ œ๋กœ GAN replay buffer๋Š” ์‹คํŒจ ํ–ˆ๋‹ค๊ณ  ๋ณด๊ณ ํ•ฉ๋‹ˆ๋‹ค.
  • ์—ฐ๊ฒฐ์˜ ๋น„๋Œ€์นญ์„ฑ. ์ €์ž๋„ ๋ช…์‹œํ•˜๋“ฏ, ์ด ๋…ผ๋ฌธ์€ โ€œํŠน์ • MDP์—์„œ์˜ AC = ๋ชจ๋“  GANโ€์„ ๋ณด์ธ ๋ฐ˜๋ฉด, Finn et al.์€ โ€œํŠน์ • GAN ํ™•์žฅ = ๋ชจ๋“  ๊ฒฝ์šฐ์˜ guided cost learningโ€์„ ๋ณด์ž…๋‹ˆ๋‹ค. ์—ฐ๊ฒฐ์ด ์–‘๋ฐฉํ–ฅ์œผ๋กœ ์™„์ „ํžˆ ๋Œ€์นญ์€ ์•„๋‹™๋‹ˆ๋‹ค.
  • ์‹œ๋Œ€์  ๋ฒ”์œ„. 2016๋…„ ์‹œ์ ์˜ GAN/AC(DPG, SVG, NFQCA, DCGAN ๋“ฑ) ๊ธฐ์ค€์ด๋ผ, ์ดํ›„์˜ ๋ฐœ์ „(WGAN, diffusion, ์ตœ์‹  RL)๊ณผ์˜ ์ •ํ•ฉ์„ฑ์€ ๋…์ž๊ฐ€ ๋ณด์™„ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค(์ถ”์ธก).

์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

์ด ๋…ผ๋ฌธ์€ ๋น„์ง€๋„ ํ•™์Šต์˜ GAN ๊ณผ ๊ฐ•ํ™”ํ•™์Šต์˜ actor-critic ์„ ์ด์ค‘ ๋ ˆ๋ฒจ ์ตœ์ ํ™” ๋ผ๋Š” ๊ณตํ†ต ํ‹€๋กœ ๋ฌถ๊ณ , GAN์ด โ€œํ–‰์œ„์ž๊ฐ€ ๋ณด์ƒ์— ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์—†๋Š” ์ƒํƒœ ์—†๋Š” MDP์—์„œ ๋ˆˆ๋จผ ํ–‰์œ„์ž๋ฅผ ๊ฐ€์ง„ ACโ€ ์ž„์„ ์ •ํ™•ํžˆ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด ํ™˜๊ฒฝ์—์„œ ์ง„์งœ policy gradient๊ฐ€ 0์ด๊ธฐ์—, ๋ณธ๋ž˜ ์ง๊ตํ•ด์•ผ ํ•  AC ์—…๋ฐ์ดํŠธ๊ฐ€ ์ ๋Œ€์  ์œผ๋กœ ๋ณ€ํ•œ๋‹ค๋Š” ๊ฒƒ์ด GAN ๋ถˆ์•ˆ์ •์„ฑ์˜ ๊ทผ์›์ž…๋‹ˆ๋‹ค.

์ด ์—ฐ๊ฒฐ์„ ํ† ๋Œ€๋กœ ๋‘ ์ปค๋ฎค๋‹ˆํ‹ฐ์˜ ์•ˆ์ •ํ™” ํŠธ๋ฆญ(freezing, label smoothing, historical averaging, replay buffer, target network, entropy regularization, compatibility ๋“ฑ)์„ ํ•œ ํ‘œ๋กœ ์ •๋ฆฌํ•˜๊ณ , ์–ด๋–ค ๊ธฐ๋ฒ•์ด ์–ด๋А ๋ฐฉํ–ฅ์œผ๋กœ ์ด์‹๋  ์ˆ˜ ์žˆ๋Š”์ง€ โ€” ๊ทธ๋ฆฌ๊ณ  ์–ด๋””์„œ ํ™˜์›์ด ๋ง‰ํžˆ๋Š”์ง€(์˜ˆ: stateless๋ผ GAN์—” target network ๋ถˆํ•„์š”) โ€” ๋ฅผ ์งš์Šต๋‹ˆ๋‹ค.

์‹คํ—˜์€ ์—†์ง€๋งŒ, ์ด ๋…ธํŠธ์˜ ๊ฐ€์น˜๋Š” โ€œ๋‘ ์–ด๋ ค์šด ๋ฉ€ํ‹ฐ๋ ˆ๋ฒจ ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ํ•˜๋‚˜์˜ ์–ธ์–ด๋กœ ๋ฌถ์–ด, ๋ถ„์•ผ ๊ฐ„ ์•„์ด๋””์–ด์˜ ์ž์œ ๋กœ์šด ํ๋ฆ„๊ณผ ์ผ๋ฐ˜์ ยทํ™•์žฅ ๊ฐ€๋Šฅยท์•ˆ์ •์  ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐœ๋ฐœ์„ ์ด‰๊ตฌํ•œ ๊ฒƒโ€ ์— ์žˆ์Šต๋‹ˆ๋‹ค. GAILยทMaxEnt IRL๊ณผ์˜ ์—ฐ๊ฒฐ๊นŒ์ง€ ํฌํ•จํ•ด, ์ ๋Œ€์  ํ•™์Šต๊ณผ ๊ฐ•ํ™”ํ•™์Šต์„ ์ž‡๋Š” ๊ฐœ๋…์  ์ง€๋„ ๋กœ์„œ ์ดํ›„ ์—ฐ๊ตฌ์— ๊พธ์ค€ํžˆ ์ธ์šฉ๋˜๋Š” ํ† ๋Œ€ ๋ฌธํ—Œ์ž…๋‹ˆ๋‹ค.

Copyright 2026, JungYeon Lee