Curieux.JY
  • JungYeon Lee
  • Post
  • Projects
  • Note

On this page

  • ๐Ÿ” Ping Review
  • ๐Ÿ”” Ring Review
    • ์„œ๋ก : ์™œ ์ด ๋…ผ๋ฌธ์ด ์ค‘์š”ํ•œ๊ฐ€?
      • ๋กœ๋ด‡ ์กฐ์ž‘์˜ ์˜ค๋ž˜๋œ ๊ฟˆ
      • ๋ฌธ์ œ ์ •์˜: ์šฐ๋ฆฌ๊ฐ€ ํ•ด๊ฒฐํ•˜๋ ค๋Š” ๊ฒƒ
      • ์—ฐ๊ตฌ์˜ ๊ธฐ์—ฌ
    • ๋ฐฉ๋ฒ•๋ก : HIL-SERL์˜ ์ž‘๋™ ์›๋ฆฌ
      • ์‹œ์Šคํ…œ ์•„ํ‚คํ…์ฒ˜ ๊ฐœ์š”
      • ํ•ต์‹ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜: RLPD ๊ธฐ๋ฐ˜ ํ•™์Šต
      • ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ
      • Human-in-the-Loop: ์ธ๊ฐ„ ๊ฐœ์ž…์˜ ๋งˆ๋ฒ•
      • ๋ณด์ƒ ํ•จ์ˆ˜ ์„ค๊ณ„: ์ด์ง„ ๋ถ„๋ฅ˜๊ธฐ์˜ ํž˜
      • ํ•˜์œ„ ๋กœ๋ด‡ ์‹œ์Šคํ…œ ์„ค๊ณ„
    • ์‹คํ—˜: ๋‹ค์–‘ํ•œ ๋„์ „ ๊ณผ์ œ๋“ค
      • ์‹คํ—˜ ์ž‘์—… ๊ฐœ์š”
      • ์ž‘์—…๋ณ„ ์ƒ์„ธ ์„ค๋ช…
      • ์ฃผ์š” ์‹คํ—˜ ๊ฒฐ๊ณผ
      • ๊ฒฌ๊ณ ์„ฑ(Robustness) ๊ฒฐ๊ณผ
    • ๊ฒฐ๊ณผ ๋ถ„์„: ์™œ HIL-SERL์ด ์ž‘๋™ํ•˜๋Š”๊ฐ€?
      • ํ•™์Šต๋œ ์ •์ฑ…์˜ ์‹ ๋ขฐ์„ฑ
      • ๋ฐ˜์‘์  ์ •์ฑ… vs ์˜ˆ์ธก์  ์ •์ฑ…
      • ์ ‘์ด‰ ๋™์—ญํ•™์˜ ์•”๋ฌต์  ํ•™์Šต
    • Dexterous Hand๋กœ์˜ ํ™•์žฅ: ๊ทธ๋ฆฌํผ๋ฅผ ๋„˜์–ด์„œ
      • ์™œ Dexterous Hand์ธ๊ฐ€?
      • HIL-SERL์„ Dexterous Hand์— ์ ์šฉํ•˜๊ธฐ
      • ๊ตฌ์ฒด์  ์ ์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค: Allegro Hand V4
      • ์˜ˆ์ƒ๋˜๋Š” ๋„์ „๊ณผ ํ•ด๊ฒฐ์ฑ…
      • ๊ถŒ์žฅ ์—ฐ๊ตฌ ๋กœ๋“œ๋งต
      • ์‹คํ—˜์  ์ œ์•ˆ: Allegro Hand V4 + HIL-SERL
    • ๋น„ํŒ์  ๊ณ ์ฐฐ: ๊ฐ•์ ๊ณผ ํ•œ๊ณ„
      • ๊ฐ•์ 
      • ํ•œ๊ณ„ ๋ฐ ๊ฐœ์„  ๋ฐฉํ–ฅ
      • ์—ฐ๊ตฌ ๋ฐฉํ–ฅ ์ œ์•ˆ
    • ๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต
      • ์‹ค์„ธ๊ณ„ RL ์•Œ๊ณ ๋ฆฌ์ฆ˜
      • ๋ชจ๋ฐฉํ•™์Šต ๋ฐฉ๋ฒ•๋ก 
      • ๊ธฐ์กด ์กฐ์ž‘ ์ ‘๊ทผ๋ฒ•
    • ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 
      • ํ•ต์‹ฌ ๋ฉ”์‹œ์ง€
      • ํ•ต์‹ฌ ๊ธฐ์—ฌ ์š”์•ฝ
      • ๋กœ๋ด‡๊ณตํ•™์ž๋ฅผ ์œ„ํ•œ ์‹ค์ฒœ์  ์กฐ์–ธ
      • ๋ฏธ๋ž˜ ์ „๋ง
    • ๋ถ€๋ก: ๊ตฌํ˜„ ์„ธ๋ถ€์‚ฌํ•ญ
      • A. ์˜์‚ฌ์ฝ”๋“œ (Pseudocode)
      • B. ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ
      • C. ํ•˜๋“œ์›จ์–ด ๊ตฌ์„ฑ
  • โ›๏ธ Dig Review
    • ์„œ๋ก : ๋ฌธ์ œ ์ •์˜์™€ ๋ฐฐ๊ฒฝ
    • ๋ฐฉ๋ฒ•: HIL-SERL ์‹œ์Šคํ…œ์˜ ์„ค๊ณ„์™€ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ƒ์„ธ๋ถ„์„
      • ์‹œ์Šคํ…œ ๊ฐœ์š”์™€ ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜
      • ๋ณด์ƒ ํ•จ์ˆ˜ ์„ค๊ณ„: ์ด์ง„ ์„ฑ๊ณต ํŒ์ •
      • ๋กœ๋ด‡ ์‹œ์Šคํ…œ ์„ค๊ณ„: ์ขŒํ‘œ๊ณ„์™€ ์ปจํŠธ๋กค๋Ÿฌ
      • ๊ทธ๋ฆฌํผ(์†) ์ œ์–ด: ์ด์‚ฐ ํ–‰๋™์˜ ๋ถ„๋ฆฌ
      • ์ธ๊ฐ„-์ฐธ์—ฌ ๊ฐ•ํ™”ํ•™์Šต ์ ˆ์ฐจ: ์ธํ„ฐ๋ฒค์…˜๊ณผ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘
      • ์ „์ฒด ํ›ˆ๋ จ ๊ณผ์ • ์ •๋ฆฌ
    • ์‹คํ—˜: ๋‹ค์–‘ํ•œ ์กฐ์ž‘ ๊ณผ์ œ์—์„œ์˜ ์„ฑ๋Šฅ ๊ฒ€์ฆ
      • ์‹คํ—˜ ํ™˜๊ฒฝ๊ณผ ๊ณผ์ œ ๊ฐœ์š”
      • ์„ฑ๋Šฅ ๊ฒฐ๊ณผ: ์„ฑ๊ณต๋ฅ ๊ณผ ์†๋„
      • ํ•™์Šต ๊ณก์„ ๊ณผ ์ •์ฑ… ํŠน์„ฑ
    • ๋น„ํŒ์  ๊ณ ์ฐฐ: ์žฅ์ , ํ•œ๊ณ„์™€ ํ–ฅํ›„ ๋ฐฉํ–ฅ
      • ๊ฐ•์  ๋ฐ ๊ธฐ์—ฌ
      • ์•ฝ์  ๋ฐ ํ•œ๊ณ„
      • ๊ด€๋ จ ์—ฐ๊ตฌ์™€ ๋น„๊ต
      • ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ
    • ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

๐Ÿ“ƒHIL-SERL ๋ฆฌ๋ทฐ

human-in-the-loop
rl
il
Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning
Published

January 23, 2026

๐Ÿ” Ping. ๐Ÿ”” Ring. โ›๏ธ Dig. A tiered review series: quick look, key ideas, deep dive.

  • Paper Link
  • Project
  • Code
  1. ๐Ÿค– HIL-SERL (Human-in-the-Loop Sample-Efficient Robotic Reinforcement Learning)์€ ์ธ๊ฐ„์˜ ๊ฐœ์ž…(human corrections)๊ณผ ํšจ์œจ์ ์ธ RL (Reinforcement Learning) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•ฉํ•˜์—ฌ ๋กœ๋ด‡ ์กฐ์ž‘ ๊ธฐ์ˆ ์„ ํ•™์Šตํ•˜๋Š” ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค.
  2. ๐Ÿš€ ์ด ์‹œ์Šคํ…œ์€ Jenga block ํœ˜ํ•‘, ํƒ€์ด๋ฐ ๋ฒจํŠธ ์กฐ๋ฆฝ, ๋งˆ๋”๋ณด๋“œ ์กฐ๋ฆฝ, ์–‘ํŒ” ํ˜‘์‘ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ณต์žกํ•œ ์กฐ์ž‘ ์ž‘์—…์—์„œ 1~2.5์‹œ๊ฐ„์˜ ํ›ˆ๋ จ๋งŒ์œผ๋กœ ๊ฑฐ์˜ ์™„๋ฒฝํ•œ ์„ฑ๊ณต๋ฅ ๊ณผ ๋น ๋ฅธ cycle time์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
  3. โœจ HIL-SERL์€ ๋ชจ๋ฐฉ ํ•™์Šต(imitation learning) ๊ธฐ๋ฐ˜์˜ ๊ธฐ์กด ๋ฐฉ์‹๋ณด๋‹ค 2๋ฐฐ ๋†’์€ ์„ฑ๊ณต๋ฅ ๊ณผ 1.8๋ฐฐ ๋น ๋ฅธ ์‹คํ–‰ ์†๋„๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉฐ, RL์ด ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ๋ณต์žกํ•˜๊ณ  ์ •๊ตํ•œ ๋น„์ „ ๊ธฐ๋ฐ˜ ์กฐ์ž‘ ์ •์ฑ…์„ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

๋ณธ ๋…ผ๋ฌธ์€ ์‹ค์‹œ๊ฐ„(real-world) ๋กœ๋ด‡ ์กฐ์ž‘ ๊ธฐ์ˆ  ์Šต๋“์— ์žˆ์–ด Reinforcement Learning (RL)์˜ ์ž ์žฌ๋ ฅ์„ ์‹คํ˜„ํ•˜๋Š” ๋ฐ ๋”ฐ๋ฅด๋Š” ๋„์ „ ๊ณผ์ œ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ Human-in-the-Loop vision-based RL ์‹œ์Šคํ…œ์ธ HIL-SERL์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด ์‹œ์Šคํ…œ์€ ๋™์ (dynamic) ์กฐ์ž‘, ์ •๋ฐ€ ์กฐ๋ฆฝ(precision assembly), ์–‘ํŒ” ํ˜‘์‘(dual-arm coordination)์„ ํฌํ•จํ•œ ๋‹ค์–‘ํ•œ ์ˆ™๋ จ๋œ(dexterous) ์กฐ์ž‘ ์ž‘์—…์„ ์ธ์ƒ์ ์ธ ์„ฑ๋Šฅ์œผ๋กœ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก  (Core Methodology)

HIL-SERL์€ RL์˜ ์‹ค์ œ ์ ์šฉ์„ ์–ด๋ ต๊ฒŒ ํ•˜๋Š” ์ƒ˜ํ”Œ ๋ณต์žก์„ฑ(sample complexity), ์ตœ์ ํ™” ์•ˆ์ •์„ฑ(optimization stability), ์ •ํ™•ํ•œ ๋ณด์ƒ ํ•จ์ˆ˜(reward function)์˜ ๋ถ€์žฌ ๋“ฑ์˜ ๋ฌธ์ œ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค.

  1. ์ƒ˜ํ”Œ ํšจ์œจ์ ์ธ RL ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฐ ๋ฐ์ดํ„ฐ ํ†ตํ•ฉ (Sample-Efficient RL Algorithm and Data Integration):
    • ์‹œ์Šคํ…œ์˜ ํ•ต์‹ฌ RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ RLPD (Ball et al., 2023)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. RLPD๋Š” ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ์„ ๋†’์ด๊ณ  ์‚ฌ์ „ ๋ฐ์ดํ„ฐ(prior data)๋ฅผ ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์  ๋•Œ๋ฌธ์— ์„ ํƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
    • ํ•™์Šต ๊ณผ์ •์—์„œ Actor process๋Š” ํ˜„์žฌ์˜ policy๋ฅผ ๋กœ๋ด‡์— ์ ์šฉํ•˜์—ฌ ํ™˜๊ฒฝ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๊ณ , ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ๋ฅผ Replay buffer๋กœ ๋ณด๋ƒ…๋‹ˆ๋‹ค. Learner process๋Š” Replay buffer์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ policy๋ฅผ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค.
    • ๋‘ ๊ฐ€์ง€ ์ข…๋ฅ˜์˜ Replay buffer๊ฐ€ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค: demo buffer๋Š” ์ธ๊ฐ„ ์‹œ์—ฐ(demonstrations) ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๊ณ , RL buffer๋Š” on-policy ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. Learner process๋Š” demo buffer์™€ RL buffer์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ท ๋“ฑํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ policy๋ฅผ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.
  2. Human-in-the-Loop (HIL) ์ƒํ˜ธ์ž‘์šฉ (Human-in-the-Loop Interaction):
    • ์ธ๊ฐ„ ์‹œ์—ฐ(human demonstrations)๊ณผ ์ธ๊ฐ„ ๊ต์ •(human corrections)์„ ํ†ตํ•ฉํ•˜์—ฌ ํ•™์Šต ๊ณผ์ •์„ ๊ฐ€์†ํ™”ํ•ฉ๋‹ˆ๋‹ค.
    • Actor process ๋‚ด์—์„œ ์ธ๊ฐ„ ์ž‘์—…์ž๋Š” SpaceMouse์™€ ๊ฐ™์€ ์ž…๋ ฅ ์žฅ์น˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋กœ๋ด‡์„ ๊ฐœ์ž…(intervene)ํ•˜๊ณ  ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋กœ๋ด‡์ด ํšŒ๋ณต ๋ถˆ๊ฐ€๋Šฅํ•˜๊ฑฐ๋‚˜ ์›์น˜ ์•Š๋Š” ์ƒํƒœ์— ๋„๋‹ฌํ–ˆ์„ ๋•Œ, ๋˜๋Š” local optimum์— ๊ฐ‡ํ˜”์„ ๋•Œ ํŠนํžˆ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.
    • ์ธ๊ฐ„์ด ๊ฐœ์ž…ํ•  ๋•Œ, ์ •์ฑ…์˜ ํ–‰๋™(\mathbf{a}_{RL}) ๋Œ€์‹  ์ธ๊ฐ„์˜ ํ–‰๋™(\mathbf{a}_{intv})์ด ๋กœ๋ด‡์— ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ด ๊ฐœ์ž… ๋ฐ์ดํ„ฐ๋Š” demo buffer์™€ RL buffer ๋ชจ๋‘์— ์ €์žฅ๋˜๋ฉฐ, policy์˜ ์ „ํ™˜(transitions)์€ RL buffer์—๋งŒ ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค.
    • ์ดˆ๊ธฐ์—๋Š” ์ธ๊ฐ„ ๊ฐœ์ž… ๋นˆ๋„๊ฐ€ ๋†’์ง€๋งŒ, policy๊ฐ€ ๊ฐœ์„ ๋จ์— ๋”ฐ๋ผ ์ ์ฐจ ๊ฐ์†Œํ•˜์—ฌ policy๊ฐ€ ์ž์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰๋  ์ˆ˜ ์žˆ๋„๋ก ์œ ๋„ํ•ฉ๋‹ˆ๋‹ค.
  3. ์‹œ์Šคํ…œ ์ˆ˜์ค€ ์„ค๊ณ„ ์„ ํƒ (System-Level Design Choices):
    • ์‚ฌ์ „ ํ›ˆ๋ จ๋œ Vision Backbones (Pretrained Vision Backbones): ํ•™์Šต ํšจ์œจ์„ฑ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ResNet-10 ๋ชจ๋ธ(ImageNet์œผ๋กœ ์‚ฌ์ „ ํ›ˆ๋ จ๋จ)๊ณผ ๊ฐ™์€ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ vision backbone์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๊ฐ•๊ฑด์„ฑ(robustness), ์ผ๋ฐ˜ํ™”(generalization)๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ตœ์ ํ™” ์•ˆ์ •์„ฑ(optimization stability) ๋ฐ ํƒ์ƒ‰ ํšจ์œจ์„ฑ(exploration efficiency)์— ์ด์ ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ์นด๋ฉ”๋ผ์˜ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ(embeddings)์€ proprioceptive information๊ณผ ํ•จ๊ป˜ ์—ฐ๊ฒฐ๋˜์–ด ํ•™์Šต์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
    • ๋ณด์ƒ ํ•จ์ˆ˜ (Reward Function): ํฌ์†Œ ๋ณด์ƒ ํ•จ์ˆ˜(sparse reward function)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ์ž‘์—…์˜ ์„ฑ๊ณต ์—ฌ๋ถ€๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ์ด์ง„ ๋ถ„๋ฅ˜๊ธฐ(binary classifier)๋ฅผ ์˜คํ”„๋ผ์ธ(offline)์œผ๋กœ ํ›ˆ๋ จํ•˜์—ฌ ๋ณด์ƒ์„ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ˆ˜๋™์œผ๋กœ ๋ณด์ƒ์„ ์„ค๊ณ„ํ•˜๋Š” ์–ด๋ ค์›€์„ ํšŒํ”ผํ•ฉ๋‹ˆ๋‹ค.
    • ํ•˜์œ„ ๋กœ๋ด‡ ์‹œ์Šคํ…œ (Downstream Robotic System):
      • ๊ณต๊ฐ„ ์ผ๋ฐ˜ํ™”(spatial generalization)๋ฅผ ์ด‰์ง„ํ•˜๊ธฐ ์œ„ํ•ด ๋กœ๋ด‡์˜ proprioceptive state๋ฅผ ์ƒ๋Œ€ ์ขŒํ‘œ๊ณ„(relative coordinate system)๋กœ ํ‘œํ˜„ํ•˜์—ฌ ego-centric formulation์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์—ํ”ผ์†Œ๋“œ ์‹œ์ž‘ ์‹œ end-effector์˜ ํฌ์ฆˆ๋ฅผ ๋ฌด์ž‘์œ„ํ™”ํ•˜์—ฌ policy๊ฐ€ ๋ฌผ์ฒด์˜ ์›€์ง์ž„์— ์ ์‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
      • ์ ‘์ด‰์ด ๋งŽ์€(contact-rich) ์ž‘์—…์˜ ์•ˆ์ „์„ ์œ„ํ•ด impedance controller๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ๋™์ (dynamic) ์ž‘์—…์—๋Š” end-effector frame์—์„œ ์ง์ ‘ feedforward wrenches๋ฅผ ๋ช…๋ นํ•˜์—ฌ ๋กœ๋ด‡ ํŒ”์„ ๊ฐ€์†์‹œํ‚ต๋‹ˆ๋‹ค.
    • ๊ทธ๋ฆฌํผ ์ œ์–ด (Gripper Control): ๊ทธ๋ฆฌํผ ์ œ์–ด๋ฅผ ์œ„ํ•ด ๋ณ„๋„์˜ critic network๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด์‚ฐ(discrete) grasping action์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๊ทธ๋ฆฌํผ์˜ โ€œopenโ€, โ€œcloseโ€, โ€œstayโ€์™€ ๊ฐ™์€ ์ด์‚ฐ์ ์ธ ํ–‰๋™์„ ์—ฐ์† ๋ถ„ํฌ๋กœ ๊ทผ์‚ฌํ™”ํ•˜๋Š” ์–ด๋ ค์›€์„ ํ•ด์†Œํ•ฉ๋‹ˆ๋‹ค. ๋‘ ๊ฐœ์˜ ๋ณ„๋„ MDP๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค: ์—ฐ์†์ ์ธ ํ–‰๋™ ๊ณต๊ฐ„ \mathcal{M}_1 = \{\mathcal{S}, \mathcal{A}_1, \rho_1, \mathcal{P}_1, r, \gamma\}๊ณผ ์ด์‚ฐ์ ์ธ ํ–‰๋™ ๊ณต๊ฐ„ \mathcal{M}_2 = \{\mathcal{S}, \mathcal{A}_2, \rho_2, \mathcal{P}_2, r, \gamma\}์ž…๋‹ˆ๋‹ค. \mathcal{M}_2์˜ critic์€ DQN (Mnih et al., 2013)์˜ ์—…๋ฐ์ดํŠธ ๊ทœ์น™์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค: \mathcal{L}(\theta) = \mathbb{E}_{\mathbf{s},\mathbf{a},\mathbf{s}'}\left[\left(r + \gamma Q_{\theta}'\left(\mathbf{s}', \arg\max_{\mathbf{a}'} Q_{\theta}\left(\mathbf{s}', \mathbf{a}'\right)\right) - Q_{\theta}(\mathbf{s}, \mathbf{a})\right)^2\right] ํ›ˆ๋ จ ๋˜๋Š” ์ถ”๋ก  ์‹œ, ๋จผ์ € \mathcal{M}_1์˜ policy์—์„œ ์—ฐ์†์ ์ธ ํ–‰๋™์„ ์ฟผ๋ฆฌํ•œ ๋‹ค์Œ, \mathcal{M}_2์˜ critic์—์„œ argmax๋ฅผ ํ†ตํ•ด ์ด์‚ฐ์ ์ธ ํ–‰๋™์„ ์ฟผ๋ฆฌํ•˜์—ฌ ๋กœ๋ด‡์— ๊ฒฐํ•ฉ๋œ ํ–‰๋™์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ (Experiments and Results)

์ด ์—ฐ๊ตฌ๋Š” Motherboard Assembly, IKEA Assembly, Car Dashboard Assembly, Object Handover, Timing Belt Assembly, Jenga Whipping, Object Flipping์„ ํฌํ•จํ•œ 7๊ฐ€์ง€ ๋‹ค์–‘ํ•œ ์ž‘์—…์— ๋Œ€ํ•ด HIL-SERL์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

  • ์„ฑ๋Šฅ ์šฐ์›”์„ฑ: HIL-SERL์€ ํ‰๊ท  100%์˜ ์„ฑ๊ณต๋ฅ ๊ณผ ํ‰๊ท  5.4์ดˆ์˜ cycle time์„ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ, ์ด๋Š” Imitation Learning (IL) ๊ธฐ๋ฐ˜์˜ HG-DAgger baseline(ํ‰๊ท  49.7% ์„ฑ๊ณต๋ฅ , ํ‰๊ท  9.6์ดˆ cycle time)์„ ํฌ๊ฒŒ ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ ๋ณต์žกํ•œ ์ž‘์—…(Jenga Whipping, RAM stick insertion, Timing Belt Assembly)์—์„œ ์„ฑ๋Šฅ ๊ฒฉ์ฐจ๊ฐ€ ๋‘๋“œ๋Ÿฌ์ง‘๋‹ˆ๋‹ค.
  • ํ›ˆ๋ จ ์‹œ๊ฐ„: ๋Œ€๋ถ€๋ถ„์˜ ์ž‘์—…์—์„œ 1~2.5์‹œ๊ฐ„ ์ด๋‚ด์˜ ์‹ค์‹œ๊ฐ„ ํ›ˆ๋ จ์œผ๋กœ near-perfect ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์ธ๊ฐ„ ๊ฐœ์ž… ๊ฐ์†Œ: ํ›ˆ๋ จ์ด ์ง„ํ–‰๋จ์— ๋”ฐ๋ผ ์ธ๊ฐ„ ๊ฐœ์ž… ๋นˆ๋„์™€ ๊ฐœ์ž… ์‹œ๊ฐ„์ด ์ ์ง„์ ์œผ๋กœ ๊ฐ์†Œํ•˜์—ฌ policy๊ฐ€ ์ž์œจ์ ์œผ๋กœ ๊ฐœ์„ ๋จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  • ๊ฐ•๊ฑด์„ฑ (Robustness): HIL-SERL๋กœ ํ•™์Šต๋œ policy๋Š” ์™ธ๋ถ€ ๊ต๋ž€(perturbations), ๊ฐ•์ œ๋œ ๊ทธ๋ฆฌํผ ๊ฐœ๋ฐฉ(forcibly opened grippers), ๋ถˆ๋Ÿ‰ํ•œ grasping pose์™€ ๊ฐ™์€ ์˜ˆ์ƒ์น˜ ๋ชปํ•œ ์ƒํ™ฉ์— ๋™์ ์œผ๋กœ ์ ์‘ํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ์Šค์Šค๋กœ ์žฌ์‹œ๋„(retrying)ํ•˜๊ฑฐ๋‚˜ ์žฌ๊ทธ๋ฆฌํ•‘(regrasping)ํ•˜๋Š” ํ–‰๋™์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  • ๋‹ค๋ฅธ Baseline๊ณผ์˜ ๋น„๊ต: Diffusion Policy, Residual RL, DAPG, IBRL๊ณผ ๊ฐ™์€ ์ตœ์‹  ๋ฐฉ๋ฒ•๋“ค๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ๋„ HIL-SERL์€ ์ผ๊ด€๋˜๊ฒŒ ๋” ๋†’์€ ์„ฑ๊ณต๋ฅ ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, ์ดˆ๊ธฐ ๋ฐ์ดํ„ฐ ์—†์ด RL๋กœ๋งŒ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ 0%์˜ ์„ฑ๊ณต๋ฅ ์„ ๋ณด์˜€์œผ๋ฉฐ, ์˜จ๋ผ์ธ(online) ์ธ๊ฐ„ ๊ต์ • ์—†์ด ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋งŒ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์€ ๋ณต์žกํ•œ ์ž‘์—…์—์„œ ์‹คํŒจํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ ๋ถ„์„ (Result Analysis)

  1. ํ•™์Šต๋œ Policy์˜ ์‹ ๋ขฐ์„ฑ (Reliability of the Learned Policies):
    • HIL-SERL์˜ ๋†’์€ ์‹ ๋ขฐ์„ฑ์€ RL์˜ inherentํ•œ ์ž๊ธฐ ๊ต์ •(self-correction) ๋Šฅ๋ ฅ์—์„œ ๋น„๋กฏ๋ฉ๋‹ˆ๋‹ค. policy sampling์„ ํ†ตํ•ด agent๋Š” ์„ฑ๊ณต๊ณผ ์‹คํŒจ๋กœ๋ถ€ํ„ฐ ์ง€์†์ ์œผ๋กœ ํ•™์Šตํ•˜์—ฌ ๊ฐœ์„ ๋ฉ๋‹ˆ๋‹ค.
    • RAM insertion ์ž‘์—…์˜ ์ƒํƒœ ๋ฐฉ๋ฌธ ํžˆํŠธ๋งต(state visitation heatmaps) ๋ถ„์„ ๊ฒฐ๊ณผ, policy๋Š” ์ดˆ๊ธฐ ์ƒํƒœ์—์„œ ๋ชฉํ‘œ ์œ„์น˜๋กœ ์ด์–ด์ง€๋Š” โ€œ๊น”๋•Œ๊ธฐ(funnel)โ€ ๋ชจ์–‘์„ ์ ์ง„์ ์œผ๋กœ ํ˜•์„ฑํ•˜๋ฉฐ, ์ด๋Š” policy์˜ ์ž์‹ ๊ฐ๊ณผ ์ •๋ฐ€๋„๊ฐ€ ์ฆ๊ฐ€ํ•จ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
    • Q-function variance ๋ถ„์„์„ ํ†ตํ•ด โ€œcritical statesโ€(Q-function variance๊ฐ€ ํฐ ์ƒํƒœ)๊ฐ€ ํ™•์ธ๋˜์—ˆ์œผ๋ฉฐ, ์ด ์ƒํƒœ๋“ค์€ policy ์„ฑ๊ณต์— ์ค‘์š”ํ•˜๋ฉฐ ๋†’์€ Q-value์™€ ๊ด€๋ จ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” RL์ด ๋™์  ํ”„๋กœ๊ทธ๋ž˜๋ฐ(dynamic programming)์„ ํ†ตํ•ด ์ค‘์š”ํ•œ ์ƒํƒœ๋ฅผ ๋†’์€ Q-value๋กœ ์—ฐ๊ฒฐํ•จ์œผ๋กœ์จ ์˜์—ญ์„ ๊ฐ•๊ฑดํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค๋Š” ๊ฒƒ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ฐ˜์‘ํ˜•(Reactive) Policy์™€ ์˜ˆ์ธกํ˜•(Predictive) Policy (Reactive Policy and Predictive Policy):
    • HIL-SERL์€ ๋‹จ์ผ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ”„๋ ˆ์ž„์›Œํฌ ๋‚ด์—์„œ ๋‘ ๊ฐ€์ง€ ์œ ํ˜•์˜ policy๋ฅผ ๋ชจ๋‘ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
    • ๋ฐ˜์‘ํ˜• Policy (Reactive Policy): RAM insertion๊ณผ ๊ฐ™์€ ์ •๋ฐ€ ์กฐ์ž‘ ์ž‘์—…์—์„œ๋Š” ์ดˆ๊ธฐ์—๋Š” ๋†’์€ ๋ถ„์‚ฐ(variance)์„ ๋ณด์ด์ง€๋งŒ, ๋ชฉํ‘œ์— ๊ฐ€๊นŒ์›Œ์งˆ์ˆ˜๋ก ๋น ๋ฅด๊ฒŒ ๊ฐ์†Œํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค. ์ด๋Š” continuous visual servoing๊ณผ ๊ฐ™์ด ์‹ค์‹œ๊ฐ„์œผ๋กœ ๊ฐ๊ฐ ํ”ผ๋“œ๋ฐฑ์— ๋ฐ˜์‘ํ•˜๋Š” ํ๋ฃจํ”„(closed-loop) ๋™์ž‘์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ถฉ๋Œ ํ›„ ์ ‘์ด‰์„ ๋Š๊ณ  ๋‹ค์‹œ ์ ‘๊ทผํ•˜๋Š” ๋“ฑ ์—ฌ๋Ÿฌ ๋ฒˆ์˜ ์‹œ๋„๋ฅผ ํ†ตํ•ด ์˜ค๋ฅ˜๋ฅผ ์ˆ˜์ •ํ•˜๋Š” ๋Šฅ๋ ฅ์ด ํŠน์ง•์ž…๋‹ˆ๋‹ค.
    • ์˜ˆ์ธกํ˜• Policy (Predictive Policy): Jenga Whipping๊ณผ ๊ฐ™์€ ๋™์  ์กฐ์ž‘ ์ž‘์—…์—์„œ๋Š” ํ‘œ์ค€ ํŽธ์ฐจ(standard deviation)๊ฐ€ ์ง€์†์ ์œผ๋กœ ๋‚ฎ๊ฒŒ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” agent๊ฐ€ ํ™˜๊ฒฝ๊ณผ์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ํ†ตํ•ด ์ž์‹ ์˜ ํ–‰๋™ ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธกํ•˜๊ณ , ์˜ˆ์ธก ์˜ค์ฐจ๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋™์ž‘์„ ์ •๋ฐ€ํ•˜๊ฒŒ ๋‹ค๋“ฌ์–ด ์ผ๊ด€๋œ ๊ฐœ๋ฃจํ”„(open-loop) ๋™์ž‘์„ ์ˆ˜ํ–‰ํ•จ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

๊ฒฐ๋ก  (Conclusion)

๋ณธ ์—ฐ๊ตฌ๋Š” ์ ์ ˆํ•œ ์„ค๊ณ„ ์„ ํƒ์„ ํ†ตํ•ด Model-free RL์ด ํ˜„์‹ค ์„ธ๊ณ„์—์„œ ์‹ค์šฉ์ ์ธ ์‹œ๊ฐ„ ๋‚ด์— ๋‹ค์–‘ํ•œ ๋ณต์žกํ•œ ์กฐ์ž‘ ์ž‘์—…์„ ํšจ์œจ์ ์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. HIL-SERL์€ ์ธ๊ฐ„ ์‹œ์—ฐ๊ณผ ๊ต์ •์„ ํšจ๊ณผ์ ์œผ๋กœ ํ†ตํ•ฉํ•˜๊ณ , RLPD์™€ ๊ฐ™์€ ์ƒ˜ํ”Œ ํšจ์œจ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ™œ์šฉํ•˜๋ฉฐ, ํŠน์ • ๋กœ๋ด‡ ์‹œ์Šคํ…œ ์„ค๊ณ„(์˜ˆ: relative coordinate system, gripper control์„ ์œ„ํ•œ ๋ณ„๋„ critic)๋ฅผ ํ†ตํ•ด ๋†’์€ ์„ฑ๋Šฅ๊ณผ ๊ฐ•๊ฑด์„ฑ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด ์ž‘์—…์€ High-Mix Low-Volume (HMLV) ์ œ์กฐ์™€ ๊ฐ™์€ ์‚ฐ์—… ์‘์šฉ ๋ถ„์•ผ์— ์ž ์žฌ์ ์ธ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋ฉฐ, ๋ฏธ๋ž˜์˜ ๋กœ๋ด‡ foundation model์„ ์œ„ํ•œ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ์ˆ˜๋‹จ์œผ๋กœ๋„ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


๐Ÿ”” Ring Review

๐Ÿ”” Ring โ€” An idea that echoes. Grasp the core and its value.

์„œ๋ก : ์™œ ์ด ๋…ผ๋ฌธ์ด ์ค‘์š”ํ•œ๊ฐ€?

๋กœ๋ด‡ ์กฐ์ž‘์˜ ์˜ค๋ž˜๋œ ๊ฟˆ

์—ฌ๋Ÿฌ๋ถ„, ์ž ์‹œ ์ƒ์ƒํ•ด ๋ณด์„ธ์š”. ๋กœ๋ด‡์ด ๋งˆ๋”๋ณด๋“œ์— RAM ์นด๋“œ๋ฅผ ์‚ฝ์ž…ํ•˜๊ณ , IKEA ๊ฐ€๊ตฌ๋ฅผ ์กฐ๋ฆฝํ•˜๊ณ , ์‹ฌ์ง€์–ด ์  ๊ฐ€ ๋ธ”๋ก์„ ์ฑ„์ฐ์œผ๋กœ ๋นผ๋‚ด๋Š” ์žฅ๋ฉด์„์š”. SF ์˜ํ™”์—์„œ๋‚˜ ๋ณผ ๋ฒ•ํ•œ ์žฅ๋ฉด ๊ฐ™์ง€๋งŒ, UC Berkeley์˜ ์—ฐ๊ตฌํŒ€์ด ์ด๋ฅผ ํ˜„์‹ค๋กœ ๋งŒ๋“ค์–ด๋ƒˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๊ฒƒ๋„ 1~2.5์‹œ๊ฐ„์˜ ์‹ค์„ธ๊ณ„ ํ›ˆ๋ จ๋งŒ์œผ๋กœ์š”.

๊ฐ•ํ™”ํ•™์Šต(Reinforcement Learning, RL)์€ ์˜ค๋žซ๋™์•ˆ ๋กœ๋ด‡๊ณตํ•™์˜ ์„ฑ๋ฐฐ(Holy Grail)๋กœ ์—ฌ๊ฒจ์ ธ ์™”์Šต๋‹ˆ๋‹ค. ์‹œํ–‰์ฐฉ์˜ค๋ฅผ ํ†ตํ•ด ์Šค์Šค๋กœ ์ตœ์ ์˜ ํ–‰๋™์„ ํ•™์Šตํ•œ๋‹ค๋Š” ๊ฐœ๋…์€ ๋งค๋ ฅ์ ์ด์ง€๋งŒ, ํ˜„์‹ค์—์„œ๋Š” ๋Š˜ ๋ฒฝ์— ๋ถ€๋”ชํ˜”์Šต๋‹ˆ๋‹ค:

  • ์ƒ˜ํ”Œ ๋ณต์žก๋„(Sample Complexity): ์ˆ˜๋ฐฑ๋งŒ ๋ฒˆ์˜ ์‹œํ–‰์ด ํ•„์š”ํ•˜๋‹ค๋ฉด, ์‹ค์ œ ๋กœ๋ด‡์œผ๋กœ๋Š” ๋ถˆ๊ฐ€๋Šฅ
  • ๋ณด์ƒ ํ•จ์ˆ˜ ์„ค๊ณ„: โ€œ์ข‹์€โ€ ํ–‰๋™์„ ์ •์˜ํ•˜๋Š” ๊ฒƒ์ด ์ƒ๊ฐ๋ณด๋‹ค ์–ด๋ ต๋‹ค
  • ์ตœ์ ํ™” ๋ถˆ์•ˆ์ •์„ฑ: ๊ณ ์ฐจ์› ์ด๋ฏธ์ง€ ์ž…๋ ฅ์—์„œ ์ •์ฑ…์ด ์ˆ˜๋ ดํ•˜์ง€ ์•Š๋Š” ๋ฌธ์ œ

์ด ๋…ผ๋ฌธ, HIL-SERL (Human-in-the-Loop Sample-Efficient Robotic Reinforcement Learning)์€ ์ด ๋ชจ๋“  ๋ฌธ์ œ๋ฅผ ์ •๋ฉด์œผ๋กœ ๋ŒํŒŒํ•ฉ๋‹ˆ๋‹ค. ํŒŒ์ธ๋งŒ ๊ต์ˆ˜๋‹˜์ด๋ผ๋ฉด ์ด๋ ‡๊ฒŒ ๋ง์”€ํ•˜์…จ์„ ๊ฑฐ์˜ˆ์š”: โ€œ์ž์—ฐ์€ ์šฐ๋ฆฌ๊ฐ€ ์ƒ๊ฐํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ๋‹จ์ˆœํ•  ๋•Œ๊ฐ€ ๋งŽ๋‹ค. ๋‹ค๋งŒ ์˜ฌ๋ฐ”๋ฅธ ๊ด€์ ์„ ์ฐพ์•„์•ผ ํ•œ๋‹ค.โ€

๋ฌธ์ œ ์ •์˜: ์šฐ๋ฆฌ๊ฐ€ ํ•ด๊ฒฐํ•˜๋ ค๋Š” ๊ฒƒ

๋กœ๋ด‡ ์กฐ์ž‘ ๋ฌธ์ œ๋ฅผ ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ • ๊ณผ์ •(MDP)์œผ๋กœ ์ •์˜ํ•ด ๋ด…์‹œ๋‹ค:

\mathcal{M} = \{\mathcal{S}, \mathcal{A}, \rho, P, r, \gamma\}

์—ฌ๊ธฐ์„œ:

๊ธฐํ˜ธ ์˜๋ฏธ ์‹ค์ œ ์˜ˆ์‹œ
\mathcal{S} ์ƒํƒœ ๊ณต๊ฐ„ ์นด๋ฉ”๋ผ ์ด๋ฏธ์ง€ + ๋กœ๋ด‡ ๊ด€์ ˆ ์ •๋ณด
\mathcal{A} ํ–‰๋™ ๊ณต๊ฐ„ End-effector twist (6D) + ๊ทธ๋ฆฌํผ ๋ช…๋ น
\rho(\mathbf{s}_0) ์ดˆ๊ธฐ ์ƒํƒœ ๋ถ„ํฌ ์ž‘์—… ์‹œ์ž‘ ์œ„์น˜์˜ ๋žœ๋คํ™”
P ์ „์ด ํ™•๋ฅ  ๋กœ๋ด‡๊ณผ ํ™˜๊ฒฝ์˜ ๋ฌผ๋ฆฌ ๋ฒ•์น™ (๋ฏธ์ง€)
r ๋ณด์ƒ ํ•จ์ˆ˜ ์ž‘์—… ์„ฑ๊ณต ์‹œ +1, ์‹คํŒจ ์‹œ 0
\gamma ํ• ์ธ ๊ณ„์ˆ˜ ๋ฏธ๋ž˜ ๋ณด์ƒ์˜ ํ˜„์žฌ ๊ฐ€์น˜ (๋ณดํ†ต 0.99)

์šฐ๋ฆฌ์˜ ๋ชฉํ‘œ๋Š” ๊ธฐ๋Œ€ ๋ˆ„์  ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ์ •์ฑ… \pi^*๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

\pi^* = \arg\max_\pi \mathbb{E}\left[\sum_{t=0}^{H} \gamma^t r(\mathbf{s}_t, \mathbf{a}_t)\right]

์—ฌ๊ธฐ์„œ ๊ธฐ๋Œ€๊ฐ’์€ ์ดˆ๊ธฐ ์ƒํƒœ ๋ถ„ํฌ, ์ „์ด ํ™•๋ฅ , ๊ทธ๋ฆฌ๊ณ  ์ •์ฑ… \pi์— ๋Œ€ํ•ด ์ทจํ•ด์ง‘๋‹ˆ๋‹ค.

Noteํ•ต์‹ฌ ํ†ต์ฐฐ

HIL-SERL์˜ ํ˜์‹ ์€ ๋‹จ์ˆœํ•ฉ๋‹ˆ๋‹ค: ์ธ๊ฐ„์˜ ๊ฐœ์ž…(correction)์„ ๊ฐ•ํ™”ํ•™์Šต์˜ ํƒ์ƒ‰ ํšจ์œจ์„ฑ์„ ๋†’์ด๋Š” ๋„๊ตฌ๋กœ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ. ์ด๋Š” ๋‹จ์ˆœํ•œ ๋ชจ๋ฐฉํ•™์Šต(Imitation Learning)์ด ์•„๋‹ˆ๋ผ, RL์ด ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜๋ฉด์„œ๋„ ์ดˆ์›”ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

์—ฐ๊ตฌ์˜ ๊ธฐ์—ฌ

์ด ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ๊ธฐ์—ฌ๋ฅผ ์ •๋ฆฌํ•˜๋ฉด:

  1. ์‹ค์„ธ๊ณ„ ํ•™์Šต ์‹œ๊ฐ„ ๋‹จ์ถ•: 1~2.5์‹œ๊ฐ„ ๋‚ด ๊ฑฐ์˜ 100% ์„ฑ๊ณต๋ฅ  ๋‹ฌ์„ฑ
  2. ๋ชจ๋ฐฉํ•™์Šต ๋Œ€๋น„ ์šฐ์›”์„ฑ: ํ‰๊ท  101% ์„ฑ๊ณต๋ฅ  ํ–ฅ์ƒ, 1.8๋ฐฐ ๋น ๋ฅธ ์‚ฌ์ดํด ํƒ€์ž„
  3. ๋‹ค์–‘ํ•œ ์ž‘์—… ๋ฒ”์œ„: ๋™์  ์กฐ์ž‘, ์ •๋ฐ€ ์กฐ๋ฆฝ, ์–‘ํŒ” ํ˜‘์กฐ๊นŒ์ง€ ๋‹จ์ผ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ ํ•ด๊ฒฐ
  4. ์ตœ์ดˆ์˜ ์‹ค์„ธ๊ณ„ ์„ฑ๊ณผ๋“ค:
    • ์ด๋ฏธ์ง€ ์ž…๋ ฅ ๊ธฐ๋ฐ˜ ์–‘ํŒ” ํ˜‘์กฐ RL
    • ์  ๊ฐ€ ํœ˜ํ•‘(Jenga Whipping)
    • ํƒ€์ด๋ฐ ๋ฒจํŠธ ์กฐ๋ฆฝ

๋ฐฉ๋ฒ•๋ก : HIL-SERL์˜ ์ž‘๋™ ์›๋ฆฌ

์‹œ์Šคํ…œ ์•„ํ‚คํ…์ฒ˜ ๊ฐœ์š”

HIL-SERL์€ ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ํ”„๋กœ์„ธ์Šค๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:

flowchart TB
    subgraph Actor["๐Ÿค– Actor Process"]
        ENV[Environment]
        ROBOT[Robot Controller]
        HUMAN[Human Intervention<br>SpaceMouse]
        POLICY[RL Policy ฯ€]
        
        ENV --> |observation| POLICY
        POLICY --> |action a_rl| ROBOT
        HUMAN --> |action a_itv| ROBOT
        ROBOT --> |execute| ENV
    end
    
    subgraph Learner["๐Ÿง  Learner Process"]
        RLPD[RLPD Update]
        DQN[DQN Update<br>Gripper Control]
        CRITIC[Critic Q]
        ACTOR_NET[Actor ฯ€]
        GRASP_CRITIC[Grasp Critic]
    end
    
    subgraph Buffer["๐Ÿ’พ Replay Buffers"]
        DEMO[Demo Buffer<br>20-30 demos]
        RL_BUF[RL Buffer<br>On-policy data]
    end
    
    Actor --> |transitions| RL_BUF
    Actor --> |interventions| DEMO
    Actor --> |interventions| RL_BUF
    
    DEMO --> |50% sampling| Learner
    RL_BUF --> |50% sampling| Learner
    
    Learner --> |updated params| Actor
    
    style Actor fill:#e1f5fe
    style Learner fill:#fff3e0
    style Buffer fill:#e8f5e9
Figure 1: HIL-SERL ์‹œ์Šคํ…œ ์•„ํ‚คํ…์ฒ˜

์ด ์•„ํ‚คํ…์ฒ˜์˜ ํ•ต์‹ฌ์€ ๋น„๋™๊ธฐ์  ํ†ต์‹ ์ž…๋‹ˆ๋‹ค. Actor๋Š” ํ™˜๊ฒฝ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฉด์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ , Learner๋Š” ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ ์ •์ฑ…์„ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์น˜ ์ฒด์Šค ์„ ์ˆ˜๊ฐ€ ๊ฒฝ๊ธฐ๋ฅผ ์น˜๋ฅด๋ฉด์„œ ๋™์‹œ์— ๋ณต๊ธฐ๋ฅผ ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์ฃ .

ํ•ต์‹ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜: RLPD ๊ธฐ๋ฐ˜ ํ•™์Šต

RLPD (Reinforcement Learning with Prior Data)

HIL-SERL์˜ ์‹ฌ์žฅ๋ถ€๋Š” RLPD ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค: ๋งค ํ•™์Šต ์Šคํ…์—์„œ ์‚ฌ์ „ ๋ฐ์ดํ„ฐ์™€ ์˜จ๋ผ์ธ ๋ฐ์ดํ„ฐ๋ฅผ 50:50์œผ๋กœ ์ƒ˜ํ”Œ๋งํ•ฉ๋‹ˆ๋‹ค.

Q-ํ•จ์ˆ˜์™€ ์ •์ฑ…์˜ ์†์‹ค ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

Critic (Q-ํ•จ์ˆ˜) ์—…๋ฐ์ดํŠธ: \mathcal{L}_Q(\phi) = \mathbb{E}_{\mathbf{s}, \mathbf{a}, \mathbf{s}'} \left[ \left( Q_\phi(\mathbf{s}, \mathbf{a}) - \left( r(\mathbf{s}, \mathbf{a}) + \gamma \mathbb{E}_{\mathbf{a}' \sim \pi_\theta} [Q_{\bar{\phi}}(\mathbf{s}', \mathbf{a}')] \right) \right)^2 \right]

Actor (์ •์ฑ…) ์—…๋ฐ์ดํŠธ: \mathcal{L}_\pi(\theta) = -\mathbb{E}_{\mathbf{s}} \left[ \mathbb{E}_{\mathbf{a} \sim \pi_\theta} [Q_\phi(\mathbf{s}, \mathbf{a})] + \alpha \mathcal{H}(\pi_\theta(\cdot|\mathbf{s})) \right]

์—ฌ๊ธฐ์„œ Q_{\bar{\phi}}๋Š” ํƒ€๊ฒŸ ๋„คํŠธ์›Œํฌ์ด๊ณ , \alpha๋Š” ์—”ํŠธ๋กœํ”ผ ์ •๊ทœํ™” ๊ฐ€์ค‘์น˜์ž…๋‹ˆ๋‹ค.

Tip์ง๊ด€์  ์ดํ•ด

Q-ํ•จ์ˆ˜๋Š” โ€œ์ด ์ƒํƒœ์—์„œ ์ด ํ–‰๋™์„ ํ•˜๋ฉด ์•ž์œผ๋กœ ์–ผ๋งˆ๋‚˜ ์ข‹์„๊นŒ?โ€๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ์ •์ฑ…์€ โ€œQ-๊ฐ’์ด ๋†’์€ ํ–‰๋™์„ ๋” ์ž์ฃผ ์„ ํƒํ•˜์žโ€๋ผ๊ณ  ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์—”ํŠธ๋กœํ”ผ ํ•ญ์€ ์ •์ฑ…์ด ๋„ˆ๋ฌด ๋นจ๋ฆฌ ํ•œ ๊ฐ€์ง€ ํ–‰๋™์— ์ง‘์ฐฉํ•˜์ง€ ์•Š๋„๋ก ํƒ์ƒ‰์„ ์žฅ๋ คํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋ฆฌํผ ์ œ์–ด: ๋ณ„๋„์˜ DQN

์—ฐ์† ํ–‰๋™(end-effector twist)๊ณผ ์ด์‚ฐ ํ–‰๋™(๊ทธ๋ฆฌํผ ์—ด๊ธฐ/๋‹ซ๊ธฐ/์œ ์ง€)์„ ๋ถ„๋ฆฌํ•˜๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆฌํผ๋Š” DQN์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค:

\mathcal{L}(\theta) = \mathbb{E}_{\mathbf{s}, \mathbf{a}, \mathbf{s}'} \left[ \left( r + \gamma Q_{\theta'}(\mathbf{s}', \arg\max_{\mathbf{a}'} Q_\theta(\mathbf{s}', \mathbf{a}')) - Q_\theta(\mathbf{s}, \mathbf{a}) \right)^2 \right]

์ด์‚ฐ ํ–‰๋™ ๊ณต๊ฐ„ \mathcal{A}_2๋Š”:

๋‹จ์ผ ๊ทธ๋ฆฌํผ ์–‘ํŒ” ๊ทธ๋ฆฌํผ
{open, close, stay} {open, close, stay}ยฒ = 9๊ฐ€์ง€ ์กฐํ•ฉ

์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ

flowchart LR
    subgraph Input["์ž…๋ ฅ"]
        IMG1[Wrist Camera<br>128ร—128]
        IMG2[Side Camera<br>128ร—128]
        PROP[Proprioception<br>๊ด€์ ˆ ์œ„์น˜/์†๋„/ํž˜]
    end
    
    subgraph Vision["๋น„์ „ ๋ฐฑ๋ณธ"]
        RESNET1[ResNet-10<br>ImageNet Pretrained]
        RESNET2[ResNet-10<br>Shared Weights]
    end
    
    subgraph Fusion["ํŠน์ง• ์œตํ•ฉ"]
        CONCAT[Concatenate]
        MLP1[MLP Layers]
    end
    
    subgraph Output["์ถœ๋ ฅ"]
        ACTOR[Actor Head<br>ฮผ, ฯƒ for 6D twist]
        CRITIC[Critic Head<br>Q(s,a)]
        GRASP[Grasp Critic<br>Q(s, a_gripper)]
    end
    
    IMG1 --> RESNET1
    IMG2 --> RESNET2
    RESNET1 --> CONCAT
    RESNET2 --> CONCAT
    PROP --> CONCAT
    CONCAT --> MLP1
    MLP1 --> ACTOR
    MLP1 --> CRITIC
    MLP1 --> GRASP
    
    style Input fill:#e3f2fd
    style Vision fill:#f3e5f5
    style Fusion fill:#e8f5e9
    style Output fill:#fff8e1
Figure 2: HIL-SERL ์‹ ๊ฒฝ๋ง ์•„ํ‚คํ…์ฒ˜

์‚ฌ์ „ํ›ˆ๋ จ ๋น„์ „ ๋ฐฑ๋ณธ์˜ ์ค‘์š”์„ฑ

์™œ ImageNet์œผ๋กœ ์‚ฌ์ „ํ›ˆ๋ จ๋œ ResNet-10์„ ์‚ฌ์šฉํ• ๊นŒ์š”? ์ด๋Š” ๋‹จ์ˆœํžˆ ์ผ๋ฐ˜ํ™”๋ฅผ ์œ„ํ•œ ๊ฒƒ์ด ์•„๋‹™๋‹ˆ๋‹ค:

  1. ์ตœ์ ํ™” ์•ˆ์ •์„ฑ: ๋žœ๋ค ์ดˆ๊ธฐํ™”๋œ ๋„คํŠธ์›Œํฌ๋Š” ์ดˆ๊ธฐ์— ๋ถˆ์•ˆ์ •ํ•œ ํŠน์ง•์„ ์ƒ์„ฑ
  2. ํƒ์ƒ‰ ํšจ์œจ์„ฑ: ์˜๋ฏธ ์žˆ๋Š” ์‹œ๊ฐ ํŠน์ง•์ด ๋” ๋‚˜์€ ์ดˆ๊ธฐ ์ •์ฑ…์„ ์œ ๋„
  3. ํ›ˆ๋ จ ์‹œ๊ฐ„ ๋‹จ์ถ•: ์‹œ๊ฐ ํ‘œํ˜„์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•  ํ•„์š”๊ฐ€ ์—†์Œ
NoteํŒŒ์ธ๋งŒ ์Šคํƒ€์ผ ๋น„์œ 

์‚ฌ์ „ํ›ˆ๋ จ ๋ฐฑ๋ณธ์€ ๋งˆ์น˜ ์™ธ๊ตญ์–ด๋ฅผ ๋ฐฐ์šธ ๋•Œ ๋ชจ๊ตญ์–ด ์‹ค๋ ฅ์ด ๋„์›€์ด ๋˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์™„์ „ํžˆ ์ƒˆ๋กœ์šด ์–ธ์–ด์ง€๋งŒ, ์–ธ์–ด์˜ ๊ตฌ์กฐ๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ธฐ์ดˆ๊ฐ€ ์žˆ์œผ๋ฉด ํ›จ์”ฌ ๋นจ๋ฆฌ ๋ฐฐ์šธ ์ˆ˜ ์žˆ์ฃ .

Human-in-the-Loop: ์ธ๊ฐ„ ๊ฐœ์ž…์˜ ๋งˆ๋ฒ•

๊ฐœ์ž… ๋ฉ”์ปค๋‹ˆ์ฆ˜

sequenceDiagram
    participant H as Human
    participant A as Actor
    participant E as Environment
    participant B as Buffer
    
    Note over A,E: ์ž์œจ ๋กค์•„์›ƒ ์‹œ์ž‘ (tโ‚€)
    loop tโ‚€ to tโ‚™
        A->>E: action a_rl
        E->>A: observation, reward
        A->>B: store (s, a_rl, r, s')
    end
    
    Note over H,A: ์ธ๊ฐ„์ด ๋ฌธ์ œ ์ƒํ™ฉ ๊ฐ์ง€ (tแตข)
    H->>A: SpaceMouse takeover
    
    rect rgb(255,230,230)
        Note over H,E: ์ธ๊ฐ„ ๊ฐœ์ž… ๊ตฌ๊ฐ„
        loop tแตข to tแตขโ‚Šโ‚™
            H->>E: action a_itv
            E->>A: observation, reward
            A->>B: store to Demo + RL buffer
        end
    end
    
    Note over A,E: ์ •์ฑ… ์ œ์–ด ๋ณต๊ท€
    A->>E: continue with a_rl
Figure 3: ์ธ๊ฐ„ ๊ฐœ์ž… ๋ฉ”์ปค๋‹ˆ์ฆ˜

ํ•ต์‹ฌ ๊ทœ์น™:

  1. ๊ฐœ์ž… ๋ฐ์ดํ„ฐ์˜ ์ด์ค‘ ์ €์žฅ: ์ธ๊ฐ„ ๊ฐœ์ž…์€ Demo ๋ฒ„ํผ์™€ RL ๋ฒ„ํผ ๋ชจ๋‘์— ์ €์žฅ
  2. ์ •์ฑ… ์ „์ด ๋ฐ์ดํ„ฐ: ๊ฐœ์ž… ์ „ํ›„์˜ ์ƒํƒœ-ํ–‰๋™์€ RL ๋ฒ„ํผ์—๋งŒ ์ €์žฅ
  3. ์ ์ง„์  ๊ฐœ์ž… ๊ฐ์†Œ: ํ›ˆ๋ จ ์ดˆ๊ธฐ์—๋Š” ์žฆ์€ ๊ฐœ์ž…, ์ •์ฑ… ๊ฐœ์„ ์— ๋”ฐ๋ผ ๊ฐ์†Œ

HG-DAgger์™€์˜ ์ฐจ์ด์ 

์ธก๋ฉด HG-DAgger HIL-SERL
ํ•™์Šต ๋ฐฉ์‹ ์ง€๋„ํ•™์Šต (Behavioral Cloning) ๊ฐ•ํ™”ํ•™์Šต (RLPD)
๋ณด์ƒ ํ™œ์šฉ ์—†์Œ ์ž‘์—… ๋ณด์ƒ์œผ๋กœ ์ตœ์ ํ™”
๋ฐ์ดํ„ฐ ๊ฐ€์ค‘์น˜ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ๋™๋“ฑ Q-๊ฐ’์— ๋”ฐ๋ฅธ ๋™์  ๊ฐ€์ค‘์น˜
์„ฑ๋Šฅ ํ•œ๊ณ„ ์ธ๊ฐ„ ์‹œ์—ฐ ์ˆ˜์ค€ ์ธ๊ฐ„ ์ดˆ์›” ๊ฐ€๋Šฅ
Importantํ•ต์‹ฌ ์ฐจ์ด์ 

HG-DAgger๋Š” ์ธ๊ฐ„์ด โ€œ์ด๋ ‡๊ฒŒ ํ•ดโ€๋ผ๊ณ  ๋ณด์—ฌ์ฃผ๋ฉด ๊ทธ๋Œ€๋กœ ๋”ฐ๋ผํ•ฉ๋‹ˆ๋‹ค. HIL-SERL์€ ์ธ๊ฐ„์ด โ€œ์—ฌ๊ธฐ์„œ ์‹ค์ˆ˜ํ–ˆ์–ดโ€๋ผ๊ณ  ์•Œ๋ ค์ฃผ๋ฉด, ๊ทธ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•ด ๋” ๋‚˜์€ ๋ฐฉ๋ฒ•์„ ์Šค์Šค๋กœ ์ฐพ์•„๋ƒ…๋‹ˆ๋‹ค.

๋ณด์ƒ ํ•จ์ˆ˜ ์„ค๊ณ„: ์ด์ง„ ๋ถ„๋ฅ˜๊ธฐ์˜ ํž˜

๋ณต์žกํ•œ ๋ณด์ƒ ํ˜•์„ฑ(reward shaping) ๋Œ€์‹ , HIL-SERL์€ ๋‹จ์ˆœํ•œ ์ด์ง„ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

r(\mathbf{s}) = \begin{cases} 1 & \text{if classifier predicts success} \ 0 & \text{otherwise} \end{cases}

๋ถ„๋ฅ˜๊ธฐ ํ›ˆ๋ จ ๊ณผ์ •

flowchart LR
    subgraph Collection["๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ (~5๋ถ„)"]
        POS[์–‘์„ฑ ์ƒ˜ํ”Œ<br>~200๊ฐœ]
        NEG[์Œ์„ฑ ์ƒ˜ํ”Œ<br>~1000๊ฐœ]
    end
    
    subgraph Training["๋ถ„๋ฅ˜๊ธฐ ํ›ˆ๋ จ"]
        DATA[์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ]
        CNN[CNN Classifier]
        EVAL[ํ‰๊ฐ€<br>>95% ์ •ํ™•๋„]
    end
    
    subgraph Deployment["๋ฐฐํฌ"]
        REWARD[์‹ค์‹œ๊ฐ„<br>๋ณด์ƒ ํŒ์ •]
    end
    
    POS --> DATA
    NEG --> DATA
    DATA --> CNN
    CNN --> EVAL
    EVAL --> REWARD
    
    style Collection fill:#e3f2fd
    style Training fill:#fff3e0
    style Deployment fill:#e8f5e9
Figure 4: ๋ณด์ƒ ๋ถ„๋ฅ˜๊ธฐ ํ›ˆ๋ จ ํŒŒ์ดํ”„๋ผ์ธ

์™œ ์ด ๋ฐฉ๋ฒ•์ด ํšจ๊ณผ์ ์ผ๊นŒ์š”?

  1. ์„ค๊ณ„ ์šฉ์ด์„ฑ: โ€œ์„ฑ๊ณต์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€?โ€๋งŒ ์ •์˜ํ•˜๋ฉด ๋จ
  2. ์ผ๋ฐ˜์„ฑ: ๋ชจ๋“  ์ž‘์—…์— ๋™์ผํ•œ ๋ฐฉ์‹ ์ ์šฉ
  3. ์ธ๊ฐ„ ๋ฐ๋ชจ์™€์˜ ์‹œ๋„ˆ์ง€: ๋ฐ๋ชจ๊ฐ€ sparse ๋ณด์ƒ์˜ ํƒ์ƒ‰ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐ

ํ•˜์œ„ ๋กœ๋ด‡ ์‹œ์Šคํ…œ ์„ค๊ณ„

์ž๊ธฐ์ค‘์‹ฌ์ (Ego-centric) ์ขŒํ‘œ๊ณ„

๊ณต๊ฐ„ ์ผ๋ฐ˜ํ™”๋ฅผ ์œ„ํ•ด, ๋ชจ๋“  ๊ด€์ธก๊ณผ ํ–‰๋™์€ ํ˜„์žฌ end-effector ํ”„๋ ˆ์ž„์„ ๊ธฐ์ค€์œผ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค:

\mathbf{x}_{ego} = \mathbf{T}_{ee}^{-1} \cdot \mathbf{x}_{world}

์ด๊ฒƒ์ด ์™œ ์ค‘์š”ํ• ๊นŒ์š”? ๋ฌผ์ฒด์˜ ์œ„์น˜๊ฐ€ ์กฐ๊ธˆ ๋‹ฌ๋ผ์ ธ๋„, end-effector ๊ด€์ ์—์„œ๋Š” ๋™์ผํ•œ ์ƒ๋Œ€์  ์œ„์น˜ ๊ด€๊ณ„๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์น˜ ์šฐ๋ฆฌ๊ฐ€ ์ปต์„ ์ง‘์„ ๋•Œ, ์ปต์ด ํ…Œ์ด๋ธ” ์–ด๋””์— ์žˆ๋“  โ€œ์† ์•ž์— ์žˆ๋‹คโ€๋Š” ๊ด€์ ์—์„œ ์ ‘๊ทผํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์ฃ .

์ž„ํ”ผ๋˜์Šค ์ œ์–ด๊ธฐ

์ ‘์ด‰์ด ๋งŽ์€ ์ž‘์—…์—์„œ ์•ˆ์ „์„ฑ์„ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด ์ž„ํ”ผ๋˜์Šค ์ œ์–ด๊ธฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

\mathbf{F} = K_p(\mathbf{x}_{des} - \mathbf{x}) + K_d(\dot{\mathbf{x}}_{des} - \dot{\mathbf{x}})

  • K_p: ๊ฐ•์„ฑ(stiffness) ํ–‰๋ ฌ
  • K_d: ๊ฐ์‡ (damping) ํ–‰๋ ฌ
  • ์ฐธ์กฐ ์ œํ•œ(reference limiting)์œผ๋กœ ๊ธ‰๊ฒฉํ•œ ์›€์ง์ž„ ๋ฐฉ์ง€

๋™์  ์ž‘์—…(์  ๊ฐ€ ํœ˜ํ•‘, ๋ฌผ์ฒด ๋’ค์ง‘๊ธฐ)์˜ ๊ฒฝ์šฐ, end-effector ํ”„๋ ˆ์ž„์—์„œ ์ง์ ‘ ํ”ผ๋“œํฌ์›Œ๋“œ ๋ Œ์น˜(wrench)๋ฅผ ๋ช…๋ นํ•ฉ๋‹ˆ๋‹ค.


์‹คํ—˜: ๋‹ค์–‘ํ•œ ๋„์ „ ๊ณผ์ œ๋“ค

์‹คํ—˜ ์ž‘์—… ๊ฐœ์š”

HIL-SERL์€ 7๊ฐ€์ง€ ์ฃผ์š” ์ž‘์—… ๋ฒ”์ฃผ์—์„œ ํ‰๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค:

์‹คํ—˜ ์ž‘์—… ๋ถ„๋ฅ˜

์นดํ…Œ๊ณ ๋ฆฌ ์ž‘์—…
์ •๋ฐ€ ์กฐ๋ฆฝ RAM ์‚ฝ์ž…, SSD ์กฐ๋ฆฝ, USB ์‚ฝ์ž…, ์ผ€์ด๋ธ” ํด๋ฆฌํ•‘
๋Œ€ํ˜• ์กฐ๋ฆฝ IKEA ์„ ๋ฐ˜, ์ž๋™์ฐจ ๋Œ€์‹œ๋ณด๋“œ
์–‘ํŒ” ํ˜‘์กฐ ๋ฌผ์ฒด ํ•ธ๋“œ์˜ค๋ฒ„, ํƒ€์ด๋ฐ ๋ฒจํŠธ
๋™์  ์กฐ์ž‘ ์  ๊ฐ€ ํœ˜ํ•‘, ๋ฌผ์ฒด ๋’ค์ง‘๊ธฐ

์ž‘์—…๋ณ„ ์ƒ์„ธ ์„ค๋ช…

1. ๋งˆ๋”๋ณด๋“œ ์กฐ๋ฆฝ (Motherboard Assembly)

ํ•˜์œ„ ์ž‘์—… ๋‚œ์ด๋„ ํ•ต์‹ฌ ๋„์ „
RAM ์‚ฝ์ž… โญโญโญ ๋ฏธ์„ธํ•œ ์ •๋ ฌ + ์ ์ ˆํ•œ ํž˜ ์กฐ์ ˆ
SSD ์กฐ๋ฆฝ โญโญโญ ํ•€ ์†์ƒ ๋ฐฉ์ง€ + 2๋‹จ๊ณ„ ์‚ฝ์ž…
USB ์‚ฝ์ž… โญโญโญโญ ์ž์œ  ๋ฐฐ์น˜๋œ ์ผ€์ด๋ธ” ํŒŒ์ง€ + ๋ถˆํ™•์‹ค์„ฑ ์ฒ˜๋ฆฌ
์ผ€์ด๋ธ” ํด๋ฆฌํ•‘ โญโญ ๋ณ€ํ˜• ๊ฐ€๋Šฅ ์ผ€์ด๋ธ” + ํƒ€์ดํŠธํ•œ ์‚ฝ์ž…

RAM ์‚ฝ์ž…์˜ ๊ฒฝ์šฐ, ๊ณผ๋„ํ•œ ํž˜์€ RAM ์นด๋“œ๋ฅผ ๊ทธ๋ฆฌํผ ๋‚ด์—์„œ ๊ธฐ์šธ์–ด์ง€๊ฒŒ ํ•˜๊ณ , ๋ถ€์กฑํ•œ ํž˜์€ ์‚ฝ์ž… ์‹คํŒจ๋ฅผ ์•ผ๊ธฐํ•ฉ๋‹ˆ๋‹ค. ์ •์ฑ…์€ ์ด ๋ฏธ๋ฌ˜ํ•œ ๊ท ํ˜•์„ ํ•™์Šตํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

2. ํƒ€์ด๋ฐ ๋ฒจํŠธ ์กฐ๋ฆฝ (Timing Belt Assembly)

์ด ์ž‘์—…์€ NIST ๋ณด๋“œ ์กฐ๋ฆฝ ์ฑŒ๋ฆฐ์ง€์˜ ์ผ๋ถ€๋กœ, ๊ฐ€์žฅ ๋„์ „์ ์ธ ์ž‘์—… ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค:

  • ๋ณ€ํ˜• ๊ฐ€๋Šฅ ๋ฌผ์ฒด: ๋ฒจํŠธ๊ฐ€ ์˜ˆ์ธก ๋ถˆ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋ณ€ํ˜•
  • ์–‘ํŒ” ํ˜‘์กฐ: ์ •๋ฐ€ํ•œ ํƒ€์ด๋ฐ์œผ๋กœ ๋ฒจํŠธ๋ฅผ ํ’€๋ฆฌ์— ๋ผ์›Œ์•ผ ํ•จ
  • ํ…์…”๋„ˆ ์กฐ์ž‘: ๋ฒจํŠธ๋ฅผ ๋ผ์šฐ๋Š” ๋™์•ˆ ํ…์…”๋„ˆ ์กฐ์ ˆ ํ•„์š”

3. ์  ๊ฐ€ ํœ˜ํ•‘ (Jenga Whipping)

์ด๊ฒƒ์€ ๋‹ค๋ฅธ ์ž‘์—…๋“ค๊ณผ ๋ณธ์งˆ์ ์œผ๋กœ ๋‹ค๋ฆ…๋‹ˆ๋‹ค:

  • ๊ณ ์† ๋™์  ์กฐ์ž‘: ์ฑ„์ฐ์ด ๋งค์šฐ ๋น ๋ฅด๊ฒŒ ์›€์ง์ž„
  • ๋ณต์žกํ•œ ์ ‘์ด‰ ์—ญํ•™: ๊ณต๊ธฐ ์ €ํ•ญ, ๋ธ”๋ก ๊ฐ„ ๋งˆ์ฐฐ ๋“ฑ
  • ๊ฐœ๋ฐฉ ๋ฃจํ”„ ํ–‰๋™: ์‹ค์‹œ๊ฐ„ ํ”ผ๋“œ๋ฐฑ์ด ๋ถˆ๊ฐ€๋Šฅํ•œ ์†๋„
Tip์ง๊ด€์  ์ดํ•ด

์  ๊ฐ€ ํœ˜ํ•‘์„ ํ…Œ๋‹ˆ์Šค ์„œ๋ธŒ์— ๋น„์œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ผ๋‹จ ์Šค์œ™์ด ์‹œ์ž‘๋˜๋ฉด ์ค‘๊ฐ„์— ์กฐ์ •ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์ •์ฑ…์€ ์ˆ˜๋งŽ์€ ์‹œํ–‰์ฐฉ์˜ค๋ฅผ ํ†ตํ•ด โ€œ์ด ๊ฐ๋„์™€ ํž˜์œผ๋กœ ํœ˜๋‘๋ฅด๋ฉด ์ € ๋ธ”๋ก์ด ๋น ์ง„๋‹คโ€๋Š” ์ง๊ด€์  ๋ฌผ๋ฆฌ๋ฅผ ์ฒดํ™”ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” ์‹คํ—˜ ๊ฒฐ๊ณผ

์„ฑ๊ณต๋ฅ  ๋ฐ ์‚ฌ์ดํด ํƒ€์ž„ ๋น„๊ต

๋‹ค์Œ ํ‘œ๋Š” HIL-SERL๊ณผ HG-DAgger(๋ชจ๋ฐฉํ•™์Šต) ๊ฐ„์˜ ๋น„๊ต ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค:

์ž‘์—… ํ›ˆ๋ จ์‹œ๊ฐ„ BC ์„ฑ๊ณต๋ฅ  HIL-SERL ์„ฑ๊ณต๋ฅ  ํ–ฅ์ƒ BC ์‚ฌ์ดํด HIL-SERL ์‚ฌ์ดํด ์†๋„ ํ–ฅ์ƒ
RAM ์‚ฝ์ž… 1.5h 29% 100% +245% 8.3s 4.8s 1.7ร—
SSD ์กฐ๋ฆฝ 1h 79% 100% +27% 6.7s 3.3s 2.0ร—
USB ์‚ฝ์ž… 2.5h 26% 100% +285% 13.4s 6.7s 2.0ร—
์ผ€์ด๋ธ” ํด๋ฆฌํ•‘ 1.25h 95% 100% +5% 7.2s 4.2s 1.7ร—
IKEA ์ธก๋ฉด1 2h 77% 100% +30% 6.5s 2.7s 2.4ร—
IKEA ์ธก๋ฉด2 1.75h 79% 100% +27% 5.0s 2.4s 2.1ร—
IKEA ์ƒํŒ 1h 35% 100% +186% 8.9s 2.4s 3.7ร—
๋Œ€์‹œ๋ณด๋“œ ์กฐ๋ฆฝ 2h 41% 100% +144% 20.3s 8.8s 2.3ร—
๋ฌผ์ฒด ํ•ธ๋“œ์˜ค๋ฒ„ 2.5h 79% 100% +27% 16.1s 13.6s 1.2ร—
ํƒ€์ด๋ฐ ๋ฒจํŠธ 6h 2% 100% +4900% 9.1s 7.2s 1.3ร—
์  ๊ฐ€ ํœ˜ํ•‘ 1.25h 8% 100% +1150% - - -
๋ฌผ์ฒด ๋’ค์ง‘๊ธฐ 1h 46% 100% +117% 3.9s 3.8s 1.0ร—
ํ‰๊ท  - 49.7% 100% +101% 9.6s 5.4s 1.8ร—
Importantํ•ต์‹ฌ ๋ฐœ๊ฒฌ
  1. ๋ชจ๋“  ์ž‘์—…์—์„œ 100% ์„ฑ๊ณต๋ฅ  ๋‹ฌ์„ฑ
  2. ํ‰๊ท  101% ์„ฑ๊ณต๋ฅ  ํ–ฅ์ƒ, 1.8๋ฐฐ ๋น ๋ฅธ ์‚ฌ์ดํด ํƒ€์ž„
  3. ๋ณต์žกํ•œ ์ž‘์—…์ผ์ˆ˜๋ก ๊ฒฉ์ฐจ ํ™•๋Œ€ (ํƒ€์ด๋ฐ ๋ฒจํŠธ: +4900%)

๋‹ค๋ฅธ ๋ฐฉ๋ฒ•๋ก ๊ณผ์˜ ๋น„๊ต

๋ฐฉ๋ฒ• RAM ์‚ฝ์ž… ๋Œ€์‹œ๋ณด๋“œ ๋ฌผ์ฒด ๋’ค์ง‘๊ธฐ ํ‰๊ท 
Diffusion Policy 27% 18% 56% 34%
HG-DAgger 29% 41% 46% 39%
BC (200 demos) 12% 35% 46% 31%
IBRL 75% 0% 95% 57%
Residual RL 0% 0% 97% 32%
DAPG 8% 18% 72% 33%
HIL-SERL (no demo, no itv) 0% 0% 0% 0%
HIL-SERL (no itv) 48% 0% 100% 49%
HIL-SERL (full) 100% 100% 100% 100%

ํ•™์Šต ๊ณก์„  ๋ถ„์„

RAM ์‚ฝ์ž… ์ž‘์—… - ์„ฑ๊ณต๋ฅ  ์ถ”์ด (HIL-SERL vs HG-DAgger)

ํ›ˆ๋ จ ์‹œ๊ฐ„ (๋ถ„) HIL-SERL HG-DAgger
0 10% 15%
20 45% 30%
40 75% 35%
60 95% 28%
80 100% 32%

HIL-SERL์˜ ํ•™์Šต ๊ณก์„ ์€ ๋ช…ํ™•ํ•œ ํŒจํ„ด์„ ๋ณด์ž…๋‹ˆ๋‹ค:

  1. ์„ฑ๊ณต๋ฅ : ๋น ๋ฅด๊ฒŒ ์ƒ์Šนํ•˜์—ฌ 100%์— ์ˆ˜๋ ด
  2. ๊ฐœ์ž…๋ฅ : ์ ์ง„์ ์œผ๋กœ ๊ฐ์†Œํ•˜์—ฌ 0%์— ๋„๋‹ฌ
  3. ์‚ฌ์ดํด ํƒ€์ž„: ํ›ˆ๋ จ ์ง„ํ–‰์— ๋”ฐ๋ผ ์ง€์†์  ๊ฐ์†Œ

๋ฐ˜๋ฉด HG-DAgger๋Š”:

  1. ์„ฑ๊ณต๋ฅ : ๋ณ€๋™ํ•˜๋ฉฐ ์ผ์ • ์ˆ˜์ค€์—์„œ ์ •์ฒด
  2. ๊ฐœ์ž…๋ฅ : ์‹œ๊ฐ„์— ๋”ฐ๋ผ ๊ฐ์†Œํ•˜์ง€ ์•Š์Œ
  3. ์‚ฌ์ดํด ํƒ€์ž„: ๊ฐœ์„ ๋˜์ง€ ์•Š์Œ

๊ฒฌ๊ณ ์„ฑ(Robustness) ๊ฒฐ๊ณผ

ํ•™์Šต๋œ ์ •์ฑ…์€ ๋‹ค์–‘ํ•œ ์™ธ๋ถ€ ๊ต๋ž€์— ๋Œ€ํ•ด ๊ฒฌ๊ณ ํ•จ์„ ๋ณด์ž…๋‹ˆ๋‹ค:

์ž‘์—… ๊ต๋ž€ ์œ ํ˜• ์ •์ฑ… ๋ฐ˜์‘
RAM ์‚ฝ์ž… ๋งˆ๋”๋ณด๋“œ ์ด๋™ ์‹ค์‹œ๊ฐ„ ์ถ”์ ํ•˜๋ฉฐ ์‚ฝ์ž… ์„ฑ๊ณต
ํ•ธ๋“œ์˜ค๋ฒ„ ๊ทธ๋ฆฌํผ ๊ฐ•์ œ ์—ด๋ฆผ ์žฌํŒŒ์ง€ ํ›„ ์ž‘์—… ์žฌ์‹œ๋„
ํƒ€์ด๋ฐ ๋ฒจํŠธ ๋ฒจํŠธ ์™ธ๋ถ€ ๊ต๋ž€ ์ ์‘์  ์žฌ์กฐ์ •
USB ์‚ฝ์ž… ๋ถˆ๋Ÿ‰ ํŒŒ์ง€ ์ž์„ธ ์žฌํŒŒ์ง€ ํ›„ ์‚ฝ์ž…
๋Œ€์‹œ๋ณด๋“œ ์–‘์ชฝ ๊ทธ๋ฆฌํผ ๊ฐ•์ œ ์—ด๋ฆผ ์ˆœ์ฐจ์  ์žฌํŒŒ์ง€ ํ›„ ์žฌ์‹œ๋„

์ด๋Ÿฌํ•œ ๊ฒฌ๊ณ ํ•œ ํ–‰๋™๋“ค์€ ๋ช…์‹œ์ ์œผ๋กœ ํ”„๋กœ๊ทธ๋ž˜๋ฐ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. RL์˜ ์ž์œจ ํƒ์ƒ‰ ๊ณผ์ •์—์„œ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ถœํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค.


๊ฒฐ๊ณผ ๋ถ„์„: ์™œ HIL-SERL์ด ์ž‘๋™ํ•˜๋Š”๊ฐ€?

ํ•™์Šต๋œ ์ •์ฑ…์˜ ์‹ ๋ขฐ์„ฑ

ํผ๋„(Funnel) ํ˜•์„ฑ ๋ฉ”์ปค๋‹ˆ์ฆ˜

HIL-SERL์ด 100% ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ•˜๋Š” ์ด์œ ๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด, RAM ์‚ฝ์ž… ์ž‘์—…์˜ ์ƒํƒœ ๋ฐฉ๋ฌธ ๋ถ„ํฌ๋ฅผ ๋ถ„์„ํ•ด ๋ด…์‹œ๋‹ค.

flowchart TB
    subgraph Early["์ดˆ๊ธฐ ํ›ˆ๋ จ"]
        E1[๋„“๊ฒŒ ๋ถ„์‚ฐ๋œ<br>์ƒํƒœ ๋ฐฉ๋ฌธ]
        E2[๋ถˆํ™•์‹คํ•œ<br>๊ถค์ ]
    end
    
    subgraph Mid["์ค‘๊ฐ„ ํ›ˆ๋ จ"]
        M1[ํผ๋„ ํ˜•ํƒœ<br>์ถœํ˜„]
        M2[์„ฑ๊ณต ์˜์—ญ์œผ๋กœ<br>์ˆ˜๋ ด ์‹œ์ž‘]
    end
    
    subgraph Late["ํ›„๊ธฐ ํ›ˆ๋ จ"]
        L1[๋ช…ํ™•ํ•œ ํผ๋„]
        L2[๋†’์€ Q-๊ฐ’<br>์ง‘์ค‘ ์˜์—ญ]
    end
    
    Early --> Mid --> Late
    
    style Early fill:#ffcdd2
    style Mid fill:#fff9c4
    style Late fill:#c8e6c9
Figure 5: ์ •์ฑ… ํ›ˆ๋ จ ์ค‘ ํผ๋„ ํ˜•์„ฑ

ํ•ต์‹ฌ ๊ด€์ฐฐ:

  1. ํผ๋„ ํ˜•ํƒœ: ์ดˆ๊ธฐ ์ƒํƒœ์—์„œ ๋ชฉํ‘œ๊นŒ์ง€ ์—ฐ๊ฒฐํ•˜๋Š” โ€œ๊น”๋•Œ๊ธฐโ€ ํ˜•์„ฑ
  2. Q-๊ฐ’ ์ง‘์ค‘: ํผ๋„ ๋‚ด ์ƒํƒœ๋“ค์ด ๋†’์€ Q-๊ฐ’์„ ๊ฐ€์ง
  3. Q-๊ฐ’ ๋ถ„์‚ฐ: ์ค‘์š”ํ•œ ์ƒํƒœ์—์„œ Q-๊ฐ’ ๋ถ„์‚ฐ์ด ํผ (ํ–‰๋™ ์„ ํƒ์ด ์ค‘์š”ํ•จ์„ ์˜๋ฏธ)

Q-๊ฐ’ ๋ถ„์‚ฐ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค:

\text{Var}[Q(\mathbf{s}, \mathbf{a})] = \mathbb{E}_{\epsilon \sim [-c, c]} \left[ (Q(\mathbf{s}, \mathbf{a} + \epsilon) - \mathbb{E}_{\epsilon}[Q(\mathbf{s}, \mathbf{a} + \epsilon)])^2 \right]

ํฐ ๋ถ„์‚ฐ์€ ํ•ด๋‹น ์ƒํƒœ๊ฐ€ โ€œ์ž„๊ณ„ ์ƒํƒœ(critical state)โ€์ž„์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ž˜๋ชป๋œ ํ–‰๋™์„ ํ•˜๋ฉด Q-๊ฐ’์ด ๊ธ‰๋ฝํ•˜์ฃ .

RL vs DAgger: ํƒ์ƒ‰์˜ ์ฐจ์ด

HG-DAgger์˜ ์ƒํƒœ ๋ฐฉ๋ฌธ ๋ถ„ํฌ๋Š” ํ›จ์”ฌ ํฌ๋ฐ•ํ•˜๊ณ  ๊ท ์ผํ•ฉ๋‹ˆ๋‹ค. ์™œ์ผ๊นŒ์š”?

  • RL: ์ž์œจ์ ์œผ๋กœ ํƒ์ƒ‰ํ•˜๊ณ , ๋™์  ํ”„๋กœ๊ทธ๋ž˜๋ฐ์œผ๋กœ ๋ณด์ƒ ๋ฐฉํ–ฅ ์ตœ์ ํ™”
  • DAgger: ํ˜„์žฌ ์ •์ฑ… ์ฃผ๋ณ€์—์„œ๋งŒ ํƒ์ƒ‰, ์ธ๊ฐ„ ์‹œ์—ฐ ๋ชจ๋ฐฉ์— ์ง‘์ค‘

๊ฒฐ๊ณผ์ ์œผ๋กœ DAgger๊ฐ€ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๋ ค๋ฉด ํ›จ์”ฌ ๋” ๋งŽ์€ ์‹œ์—ฐ๊ณผ ์ˆ˜์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๋ฐ˜์‘์  ์ •์ฑ… vs ์˜ˆ์ธก์  ์ •์ฑ…

HIL-SERL์€ ์ž‘์—… ํŠน์„ฑ์— ๋”ฐ๋ผ ๋‘ ๊ฐ€์ง€ ๋‹ค๋ฅธ ์œ ํ˜•์˜ ์ •์ฑ…์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค:

๋ฐ˜์‘์  ์ •์ฑ… (Reactive Policy)

RAM ์‚ฝ์ž…, ๋Œ€์‹œ๋ณด๋“œ ์กฐ๋ฆฝ ๋“ฑ ์ •๋ฐ€ ์กฐ์ž‘ ์ž‘์—…์—์„œ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.

ํŠน์ง•:

  • ๋†’์€ ์ดˆ๊ธฐ ๋ถ„์‚ฐ: ์ ‘๊ทผ ๋‹จ๊ณ„์—์„œ ๋ถˆํ™•์‹ค์„ฑ
  • ์ ์ง„์  ๋ถ„์‚ฐ ๊ฐ์†Œ: ๋ชฉํ‘œ์— ๊ฐ€๊นŒ์›Œ์งˆ์ˆ˜๋ก ์ •๋ฐ€ํ•ด์ง
  • ํ์‡„ ๋ฃจํ”„ ํ–‰๋™: ์ง€์†์ ์ธ ๊ฐ๊ฐ ํ”ผ๋“œ๋ฐฑ ํ™œ์šฉ
sequenceDiagram
    participant S as Sensor
    participant P as Policy
    participant A as Actuator
    participant E as Environment
    
    loop ๋งค ํƒ€์ž„์Šคํ…
        E->>S: ๊ด€์ธก
        S->>P: ์‹œ๊ฐ+์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ
        P->>P: ์˜ค๋ฅ˜ ์ถ”์ •
        P->>A: ๋ณด์ • ํ–‰๋™
        A->>E: ์‹คํ–‰
    end
    
    Note over P: ๋†’์€ ฯƒ โ†’ ๋‚ฎ์€ ฯƒ<br>์ ‘๊ทผ โ†’ ์‚ฝ์ž…
Figure 6: ๋ฐ˜์‘์  ์ •์ฑ…์˜ ํ–‰๋™ ํŒจํ„ด

์˜ˆ์ธก์  ์ •์ฑ… (Predictive Policy)

์  ๊ฐ€ ํœ˜ํ•‘, ๋ฌผ์ฒด ๋’ค์ง‘๊ธฐ ๋“ฑ ๋™์  ์กฐ์ž‘ ์ž‘์—…์—์„œ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.

ํŠน์ง•:

  • ์ผ๊ด€๋˜๊ฒŒ ๋‚ฎ์€ ๋ถ„์‚ฐ: ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ํ™•์‹  ์žˆ๋Š” ํ–‰๋™
  • ๊ฐœ๋ฐฉ ๋ฃจํ”„ ํ–‰๋™: ์‹คํ–‰ ์ค‘ ํ”ผ๋“œ๋ฐฑ ๋ถˆ๊ฐ€๋Šฅํ•œ ์†๋„
  • ๋ฐ˜์‚ฌ(reflex) ๊ฐ™์€ ํ–‰๋™: ํ•™์Šต๋œ ์ง๊ด€์  ๋ฌผ๋ฆฌ
sequenceDiagram
    participant S as Sensor
    participant P as Policy
    participant A as Actuator
    participant E as Environment
    
    S->>P: ์ดˆ๊ธฐ ๊ด€์ธก
    P->>P: ๊ฒฐ๊ณผ ์˜ˆ์ธก
    P->>A: ๊ณ„ํš๋œ ๋ชจ์…˜ ์‹œํ€€์Šค
    
    rect rgb(255,245,238)
        Note over A,E: ๊ณ ์† ์‹คํ–‰ (ํ”ผ๋“œ๋ฐฑ ์—†์Œ)
        A->>E: ์‹คํ–‰
        E->>E: ๊ฒฐ๊ณผ
    end
    
    Note over P: ฯƒ โ‰ˆ 0 (์ „ ๊ตฌ๊ฐ„)
Figure 7: ์˜ˆ์ธก์  ์ •์ฑ…์˜ ํ–‰๋™ ํŒจํ„ด
Noteํ•ต์‹ฌ ํ†ต์ฐฐ

HIL-SERL์€ ์ž‘์—…์˜ ๋ฌผ๋ฆฌ์  ํŠน์„ฑ์„ ๋ช…์‹œ์ ์œผ๋กœ ๋ถ„์„ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ™˜๊ฒฝ๊ณผ์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ํ†ตํ•ด ์ž๋™์œผ๋กœ ์ ์ ˆํ•œ ์ œ์–ด ์ „๋žต์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ๋‹จ์ผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ๋น„๊ฒฐ์ž…๋‹ˆ๋‹ค.

์ ‘์ด‰ ๋™์—ญํ•™์˜ ์•”๋ฌต์  ํ•™์Šต

๋Œ€์‹œ๋ณด๋“œ ์กฐ๋ฆฝ ์ž‘์—…์—์„œ ๊ด€์ฐฐ๋œ ํฅ๋ฏธ๋กœ์šด ํ–‰๋™:

  1. ์ ‘์ด‰ ์ƒํƒœ์—์„œ ๋ง‰ํž˜ ๊ฐ์ง€
  2. ๋น ๋ฅด๊ฒŒ ๋‘ ํŒ”์„ ๋“ค์–ด ์ ‘์ด‰ ํ•ด์ œ
  3. ์žฌ์ ‘๊ทผํ•˜์—ฌ ๋ชฉํ‘œ์— ๋„๋‹ฌ
  4. ์‚ฝ์ž… ์„ฑ๊ณต

์ด โ€œstuck โ†’ lift โ†’ re-approach โ†’ insertโ€ ํŒจํ„ด์€ ๋ช…์‹œ์ ์œผ๋กœ ํ”„๋กœ๊ทธ๋ž˜๋ฐ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ ์ ‘์ด‰ ๊ธฐ๋ฐ˜ ์กฐ์ž‘ ์—ฐ๊ตฌ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ํ–‰๋™์„ ํ˜ผํ•ฉ ์ •์ˆ˜ ํ”„๋กœ๊ทธ๋ž˜๋ฐ(MIP)์œผ๋กœ ๊ณต์‹ํ™”ํ–ˆ์ง€๋งŒ:

  • ๊ณ„ํš ์ง€ํ‰์ด ๊ธธ์–ด์ง€๋ฉด ๊ณ„์‚ฐ์ ์œผ๋กœ ๋‹ค๋ฃจ๊ธฐ ์–ด๋ ค์›€
  • ์ •ํ™•ํ•œ ์ƒํƒœ ์ถ”์ •๊ธฐ ํ•„์š”

HIL-SERL์€ ์ด๋Ÿฌํ•œ ๋ณต์žกํ•œ ์ ‘์ด‰ ๋™์—ญํ•™์„ ๋ฌธ์ œ์˜ ์ผ๋ถ€๊ฐ€ ์•„๋‹Œ ํ•ด์˜ ์ผ๋ถ€๋กœ ์ทจ๊ธ‰ํ•ฉ๋‹ˆ๋‹ค.


Dexterous Hand๋กœ์˜ ํ™•์žฅ: ๊ทธ๋ฆฌํผ๋ฅผ ๋„˜์–ด์„œ

์™œ Dexterous Hand์ธ๊ฐ€?

HIL-SERL์€ ํ‰ํ–‰ ๊ทธ๋ฆฌํผ(parallel gripper)๋ฅผ end-effector๋กœ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ Allegro Hand V4์™€ ๊ฐ™์€ ๋‹ค๊ด€์ ˆ ๋กœ๋ด‡ ํ•ธ๋“œ(dexterous hand)๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์–ด๋–จ๊นŒ์š”? ์ด๋Š” ๋‹จ์ˆœํ•œ ํ•˜๋“œ์›จ์–ด ๊ต์ฒด๊ฐ€ ์•„๋‹ˆ๋ผ, ์™„์ „ํžˆ ์ƒˆ๋กœ์šด ์ฐจ์›์˜ ์กฐ์ž‘ ๋Šฅ๋ ฅ์„ ์—ด์–ด์ค๋‹ˆ๋‹ค.

flowchart LR
    subgraph Gripper["ํ‰ํ–‰ ๊ทธ๋ฆฌํผ"]
        G1[1 DoF<br>์—ด๊ธฐ/๋‹ซ๊ธฐ]
        G2[์ด์‚ฐ ํ–‰๋™<br>3๊ฐ€์ง€]
        G3[์ œํ•œ๋œ<br>ํŒŒ์ง€ ํ˜•ํƒœ]
    end
    
    subgraph Hand["Dexterous Hand<br>(์˜ˆ: Allegro Hand V4)"]
        H1[16 DoF<br>4ร—4 ๊ด€์ ˆ]
        H2[์—ฐ์† ํ–‰๋™<br>๊ณ ์ฐจ์›]
        H3[๋‹ค์–‘ํ•œ<br>ํŒŒ์ง€/์กฐ์ž‘]
    end
    
    Gripper --> |"ํ™•์žฅ"| Hand
    
    style Gripper fill:#ffcdd2
    style Hand fill:#c8e6c9
Figure 8: ๊ทธ๋ฆฌํผ vs Dexterous Hand ๋น„๊ต
ํŠน์„ฑ ํ‰ํ–‰ ๊ทธ๋ฆฌํผ Allegro Hand V4
์ž์œ ๋„ (DoF) 1 16 (4์†๊ฐ€๋ฝ ร— 4๊ด€์ ˆ)
ํ–‰๋™ ๊ณต๊ฐ„ ์ด์‚ฐ (open/close/stay) ์—ฐ์† (16D ๊ด€์ ˆ ์œ„์น˜/ํ† ํฌ)
ํŒŒ์ง€ ์œ ํ˜• Power grasp only Precision, power, pinch, etc.
In-hand ์กฐ์ž‘ ๋ถˆ๊ฐ€๋Šฅ ๊ฐ€๋Šฅ
์ด‰๊ฐ ์„ผ์‹ฑ ์ œํ•œ์  ํ’๋ถ€ํ•œ ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ ๊ฐ€๋Šฅ

HIL-SERL์„ Dexterous Hand์— ์ ์šฉํ•˜๊ธฐ

๋„์ „ ๊ณผ์ œ 1: ํญ๋ฐœ์ ์ธ ํ–‰๋™ ๊ณต๊ฐ„

Allegro Hand์˜ 16 DoF๋Š” ๊ทธ๋ฆฌํผ์˜ 1 DoF์— ๋น„ํ•ด ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ํฐ ํ–‰๋™ ๊ณต๊ฐ„์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

ํ•ด๊ฒฐ ์ „๋žต:

  1. ๊ณ„์ธต์  ํ–‰๋™ ๊ณต๊ฐ„ (Hierarchical Action Space)
flowchart TB
    subgraph High["์ƒ์œ„ ์ •์ฑ…"]
        ARM[Arm Policy<br>6D twist]
        GRASP_TYPE[Grasp Type<br>Selector]
    end
    
    subgraph Low["ํ•˜์œ„ ์ •์ฑ…"]
        FINGER[Finger Policy<br>16D joint]
        SYNERGY[Synergy-based<br>Control]
    end
    
    ARM --> |"์œ„์น˜ ๋ชฉํ‘œ"| FINGER
    GRASP_TYPE --> |"ํŒŒ์ง€ ์œ ํ˜•"| SYNERGY
    SYNERGY --> |"๊ด€์ ˆ ๋ช…๋ น"| FINGER
    
    style High fill:#e3f2fd
    style Low fill:#fff3e0
Figure 9: ๊ณ„์ธต์  ํ–‰๋™ ๊ณต๊ฐ„ ์„ค๊ณ„
  1. ์‹œ๋„ˆ์ง€ ๊ธฐ๋ฐ˜ ์ฐจ์› ์ถ•์†Œ (Synergy-based Dimensionality Reduction)

์ธ๊ฐ„ ์†์˜ ์›€์ง์ž„์€ ์‹ค์ œ๋กœ ๋ช‡ ๊ฐœ์˜ ์ฃผ์š” ์‹œ๋„ˆ์ง€(synergy)๋กœ ์„ค๋ช…๋ฉ๋‹ˆ๋‹ค:

\mathbf{q}_{hand} = \mathbf{S} \cdot \mathbf{z} + \mathbf{q}_0

  • \mathbf{q}_{hand} \in \mathbb{R}^{16}: ์ „์ฒด ๊ด€์ ˆ ์œ„์น˜
  • \mathbf{S} \in \mathbb{R}^{16 \times k}: ์‹œ๋„ˆ์ง€ ํ–‰๋ ฌ (๋ณดํ†ต k = 2 \sim 6)
  • \mathbf{z} \in \mathbb{R}^k: ์ €์ฐจ์› ์‹œ๋„ˆ์ง€ ์ขŒํ‘œ
  • \mathbf{q}_0: ๊ธฐ๋ณธ ์ž์„ธ

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด 16D ๋ฌธ์ œ๊ฐ€ 2~6D ๋ฌธ์ œ๋กœ ์ถ•์†Œ๋ฉ๋‹ˆ๋‹ค!

  1. HIL-SERL ์ˆ˜์ •: ์—ฐ์† ํ–‰๋™์œผ๋กœ ํ†ตํ•ฉ

์›๋ž˜ HIL-SERL: \mathcal{A} = \mathcal{A}_{arm} \times \mathcal{A}_{gripper}^{discrete}

Dexterous Hand ๋ฒ„์ „: \mathcal{A} = \mathcal{A}_{arm} \times \mathcal{A}_{hand}^{continuous}

๋˜๋Š” ์‹œ๋„ˆ์ง€ ์‚ฌ์šฉ ์‹œ: \mathcal{A} = \mathcal{A}_{arm} \times \mathcal{A}_{synergy}^{continuous}

๋„์ „ ๊ณผ์ œ 2: ์ธ๊ฐ„ ๊ฐœ์ž…์˜ ๋ณต์žก์„ฑ

SpaceMouse๋กœ 16 DoF๋ฅผ ์ง์ ‘ ์ œ์–ดํ•˜๋Š” ๊ฒƒ์€ ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

ํ•ด๊ฒฐ ์ „๋žต:

  1. ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ ์ธํ„ฐํŽ˜์ด์Šค ๊ณ ๋„ํ™”
์ธํ„ฐํŽ˜์ด์Šค ์ ํ•ฉ์„ฑ ํŠน์ง•
SpaceMouse โŒ ๋ถ€์ ํ•ฉ 6 DoF๋งŒ ์ง€์›
Teleoperation Glove (์˜ˆ: MANUS) โœ… ์ ํ•ฉ ์†๊ฐ€๋ฝ ์›€์ง์ž„ ์ง์ ‘ ๋งคํ•‘
VR Controller + Hand Tracking โœ… ์ ํ•ฉ Quest 3 ๋“ฑ ํ™œ์šฉ ๊ฐ€๋Šฅ
Vision-based Retargeting โœ… ์ ํ•ฉ ์นด๋ฉ”๋ผ๋กœ ์† ์ถ”์ 
  1. ๊ฐœ์ž… ๋ฐฉ์‹์˜ ์žฌ์„ค๊ณ„
flowchart TB
    subgraph Traditional["๊ธฐ์กด HIL-SERL"]
        T1[SpaceMouse] --> T2[6D + Gripper]
    end
    
    subgraph Dexterous["Dexterous Hand ๋ฒ„์ „"]
        D1[Teleoperation Glove]
        D2[Vision Retargeting]
        D3[VR Hand Tracking]
        
        D1 --> D4[Full Hand Pose]
        D2 --> D4
        D3 --> D4
        
        D4 --> D5[Synergy Projection]
        D5 --> D6[Low-dim Intervention]
    end
    
    style Traditional fill:#ffcdd2
    style Dexterous fill:#c8e6c9
Figure 10: Dexterous Hand๋ฅผ ์œ„ํ•œ ๊ฐœ์ž… ๋ฐฉ์‹

๋„์ „ ๊ณผ์ œ 3: ์ด‰๊ฐ ์„ผ์‹ฑ์˜ ํ†ตํ•ฉ

Dexterous hand์˜ ์ง„์ •ํ•œ ํž˜์€ ์ด‰๊ฐ ํ”ผ๋“œ๋ฐฑ์—์„œ ๋‚˜์˜ต๋‹ˆ๋‹ค. Allegro Hand์— ์ด‰๊ฐ ์„ผ์„œ๋ฅผ ์žฅ์ฐฉํ•˜๋ฉด:

์ˆ˜์ •๋œ ๊ด€์ธก ๊ณต๊ฐ„: \mathcal{S} = \{\mathbf{I}_{wrist}, \mathbf{I}_{side}, \mathbf{q}_{arm}, \mathbf{q}_{hand}, \mathbf{\tau}_{tactile}\}

์—ฌ๊ธฐ์„œ \mathbf{\tau}_{tactile}์€ ์†๊ฐ€๋ฝ ๋์˜ ์ด‰๊ฐ ์ •๋ณด์ž…๋‹ˆ๋‹ค.

์‹ ๊ฒฝ๋ง ์•„ํ‚คํ…์ฒ˜ ์ˆ˜์ •:

flowchart LR
    subgraph Input["์ž…๋ ฅ"]
        IMG[RGB Images]
        PROP[Proprioception<br>Arm + Hand]
        TACT[Tactile<br>Fingertip sensors]
    end
    
    subgraph Encoders["์ธ์ฝ”๋”"]
        VIS_ENC[Vision Encoder<br>ResNet-10]
        PROP_ENC[Prop Encoder<br>MLP]
        TACT_ENC[Tactile Encoder<br>1D CNN / MLP]
    end
    
    subgraph Fusion["์œตํ•ฉ"]
        CONCAT[Concatenate]
        ATTN[Cross-Modal<br>Attention]
    end
    
    IMG --> VIS_ENC
    PROP --> PROP_ENC
    TACT --> TACT_ENC
    
    VIS_ENC --> CONCAT
    PROP_ENC --> CONCAT
    TACT_ENC --> CONCAT
    
    CONCAT --> ATTN
    ATTN --> OUTPUT[Policy + Critic]
    
    style Input fill:#e3f2fd
    style Encoders fill:#f3e5f5
    style Fusion fill:#e8f5e9
Figure 11: ์ด‰๊ฐ ํ†ตํ•ฉ ์‹ ๊ฒฝ๋ง

๊ตฌ์ฒด์  ์ ์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค: Allegro Hand V4

์‹œ๋‚˜๋ฆฌ์˜ค 1: ์ •๋ฐ€ ์‚ฝ์ž… ์ž‘์—… (RAM ์‚ฝ์ž…์˜ ํ™•์žฅ)

Dexterous hand์˜ ์žฅ์ :

  • ์ ์‘์  ํŒŒ์ง€: RAM ์นด๋“œ๊ฐ€ ๊ธฐ์šธ์–ด์ ธ๋„ ์†๊ฐ€๋ฝ์œผ๋กœ ์žฌ์กฐ์ • ๊ฐ€๋Šฅ
  • ํž˜ ๋ถ„๋ฐฐ: ์—ฌ๋Ÿฌ ์†๊ฐ€๋ฝ์œผ๋กœ ๊ท ์ผํ•œ ํž˜ ์ ์šฉ
  • In-hand ์กฐ์ž‘: ํŒŒ์ง€ ์ž์„ธ๋ฅผ ๋†“์ง€ ์•Š๊ณ  ์กฐ์ •
# ์˜์‚ฌ์ฝ”๋“œ: Dexterous Hand๋ฅผ ์œ„ํ•œ HIL-SERL
def dexterous_hil_serl():
    # ํ–‰๋™ ๊ณต๊ฐ„ ์ •์˜
    arm_action_dim = 6  # Cartesian twist
    hand_action_dim = 6  # Synergy-based (์ถ•์†Œ๋œ ๊ณต๊ฐ„)
    
    # ๊ด€์ธก ๊ณต๊ฐ„
    obs = {
        'images': [wrist_cam, side_cam],  # ์‹œ๊ฐ
        'arm_proprio': arm_joint_states,   # ํŒ” ๊ณ ์œ ์ˆ˜์šฉ
        'hand_proprio': hand_joint_states, # ์† ๊ณ ์œ ์ˆ˜์šฉ (16D)
        'tactile': fingertip_forces        # ์ด‰๊ฐ (์„ ํƒ์ )
    }
    
    # ์—ฐ์† ํ–‰๋™ ๊ณต๊ฐ„์œผ๋กœ ํ†ตํ•ฉ (DQN ์ œ๊ฑฐ)
    policy = ContinuousPolicy(
        obs_dim=compute_obs_dim(obs),
        action_dim=arm_action_dim + hand_action_dim
    )
    
    # ์ธ๊ฐ„ ๊ฐœ์ž…: Teleoperation glove ์‚ฌ์šฉ
    if human_intervenes():
        glove_data = get_glove_data()
        hand_action = project_to_synergy(glove_data)
        arm_action = get_arm_action_from_glove()

์‹œ๋‚˜๋ฆฌ์˜ค 2: In-hand ์žฌ์กฐ์ž‘ (๊ธฐ์กด HIL-SERL๋กœ๋Š” ๋ถˆ๊ฐ€๋Šฅ)

flowchart LR
    A[๋ฌผ์ฒด ํŒŒ์ง€] --> B[ํšŒ์ „ ํ•„์š”<br>๊ฐ์ง€]
    B --> C[์†๊ฐ€๋ฝ<br>์žฌ๋ฐฐ์น˜]
    C --> D[๋ฌผ์ฒด ํšŒ์ „]
    D --> E[์‚ฝ์ž… ์ˆ˜ํ–‰]
    
    style A fill:#e3f2fd
    style B fill:#fff3e0
    style C fill:#f3e5f5
    style D fill:#fff3e0
    style E fill:#c8e6c9
Figure 12: In-hand ์žฌ์กฐ์ž‘ ์‹œ๋‚˜๋ฆฌ์˜ค

์ด ์ž‘์—…์€ ๊ทธ๋ฆฌํผ๋กœ๋Š” ์›์ฒœ์ ์œผ๋กœ ๋ถˆ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, Allegro Hand๋กœ๋Š” ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ์ƒ๋˜๋Š” ๋„์ „๊ณผ ํ•ด๊ฒฐ์ฑ…

๋„์ „ ์›์ธ ํ•ด๊ฒฐ์ฑ…
๊ธด ํ›ˆ๋ จ ์‹œ๊ฐ„ ๊ณ ์ฐจ์› ํ–‰๋™ ๊ณต๊ฐ„ ์‹œ๋„ˆ์ง€ ๊ธฐ๋ฐ˜ ์ฐจ์› ์ถ•์†Œ
์–ด๋ ค์šด ์ธ๊ฐ„ ๊ฐœ์ž… ๋ณต์žกํ•œ ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜ Glove/VR ์ธํ„ฐํŽ˜์ด์Šค
๋ถˆ์•ˆ์ •ํ•œ ํ•™์Šต ์†๊ฐ€๋ฝ ๊ฐ„ ์ถฉ๋Œ ์•ˆ์ „ํ•œ ํƒ์ƒ‰ ์˜์—ญ ์ œํ•œ
ํ•˜๋“œ์›จ์–ด ์†์ƒ ๊ณผ๋„ํ•œ ์ ‘์ด‰๋ ฅ ์ž„ํ”ผ๋˜์Šค ์ œ์–ด + ํž˜ ์ œํ•œ

๊ถŒ์žฅ ์—ฐ๊ตฌ ๋กœ๋“œ๋งต

flowchart TB
    subgraph Phase1["1๋‹จ๊ณ„: ๊ธฐ์ดˆ"]
        P1A[์‹œ๋„ˆ์ง€ ๋ถ„์„<br>์ฐจ์› ์ถ•์†Œ]
        P1B[ํ…”๋ ˆ์˜คํผ๋ ˆ์ด์…˜<br>์ธํ„ฐํŽ˜์ด์Šค ๊ตฌ์ถ•]
        P1C[์•ˆ์ „ ์ œ์–ด๊ธฐ<br>๊ตฌํ˜„]
    end
    
    subgraph Phase2["2๋‹จ๊ณ„: ๋‹จ์ˆœ ์ž‘์—…"]
        P2A[๋‹จ์ˆœ ํŒŒ์ง€<br>์ž‘์—…]
        P2B[๊ธฐ๋ณธ ์‚ฝ์ž…<br>์ž‘์—…]
    end
    
    subgraph Phase3["3๋‹จ๊ณ„: ๊ณ ๊ธ‰ ์ž‘์—…"]
        P3A[In-hand<br>์žฌ์กฐ์ž‘]
        P3B[๋ณต์žกํ•œ<br>์กฐ๋ฆฝ]
        P3C[๋„๊ตฌ ์‚ฌ์šฉ]
    end
    
    subgraph Phase4["4๋‹จ๊ณ„: ํ†ตํ•ฉ"]
        P4A[์ด‰๊ฐ ํ†ตํ•ฉ]
        P4B[VLA ๋ชจ๋ธ<br>์—ฐ๋™]
    end
    
    Phase1 --> Phase2 --> Phase3 --> Phase4
    
    style Phase1 fill:#e3f2fd
    style Phase2 fill:#fff3e0
    style Phase3 fill:#f3e5f5
    style Phase4 fill:#c8e6c9
Figure 13: Dexterous Hand HIL-SERL ์—ฐ๊ตฌ ๋กœ๋“œ๋งต

์‹คํ—˜์  ์ œ์•ˆ: Allegro Hand V4 + HIL-SERL

์ดˆ๊ธฐ ์‹คํ—˜ ์„ค์ •

# ๊ถŒ์žฅ ์ดˆ๊ธฐ ์„ค์ •
hardware:
  arm: Franka Emika Panda (๋˜๋Š” ์œ ์‚ฌ 7DoF ์•”)
  hand: Allegro Hand V4
  cameras:
    - wrist_mounted: Intel RealSense D435
    - side_view: Intel RealSense D435
  teleoperation: MANUS Prime 3 Glove

action_space:
  arm: 6D Cartesian twist
  hand: 4D synergy (์ฒซ 4๊ฐœ ์ฃผ์„ฑ๋ถ„)
  
observation_space:
  images: 2 ร— 128ร—128 RGB
  arm_proprio: 7 joint positions + velocities
  hand_proprio: 16 joint positions + velocities
  
training:
  demo_collection: 30-50 demonstrations
  intervention_device: MANUS Glove
  expected_training_time: 2-4 hours (๊ทธ๋ฆฌํผ ๋Œ€๋น„ ์ฆ๊ฐ€)

๊ธฐ๋Œ€ ํšจ๊ณผ

  1. ์ƒˆ๋กœ์šด ์ž‘์—… ๋ฒ”์ฃผ: In-hand manipulation, ๋„๊ตฌ ์‚ฌ์šฉ
  2. ๋” ๊ฒฌ๊ณ ํ•œ ํŒŒ์ง€: ๋‹ค์–‘ํ•œ ๋ฌผ์ฒด ํ˜•์ƒ ๋Œ€์‘
  3. ์ ์‘์  ์กฐ์ž‘: ์‹ค์‹œ๊ฐ„ ํŒŒ์ง€ ์ž์„ธ ์กฐ์ •
  4. ์ธ๊ฐ„ ์ˆ˜์ค€ ์œ ์—ฐ์„ฑ: ๋ณต์žกํ•œ ์กฐ๋ฆฝ ์ž‘์—… ๊ฐ€๋Šฅ
Importantํ•ต์‹ฌ ๋ฉ”์‹œ์ง€

HIL-SERL์˜ ํ”„๋ ˆ์ž„์›Œํฌ๋Š” Dexterous Hand๋กœ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํ™•์žฅ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ (1) ํ–‰๋™ ๊ณต๊ฐ„์˜ ํšจ์œจ์  ํ‘œํ˜„, (2) ์ ์ ˆํ•œ ์ธ๊ฐ„ ๊ฐœ์ž… ์ธํ„ฐํŽ˜์ด์Šค, (3) ์ด‰๊ฐ ์ •๋ณด์˜ ํšจ๊ณผ์  ํ†ตํ•ฉ์ž…๋‹ˆ๋‹ค. Allegro Hand V4์™€ ๊ฐ™์€ ํ”Œ๋žซํผ์€ ์ด ์—ฐ๊ตฌ์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ๋‹ค์Œ ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค.


๋น„ํŒ์  ๊ณ ์ฐฐ: ๊ฐ•์ ๊ณผ ํ•œ๊ณ„

๊ฐ•์ 

1. ์‹ค์šฉ์  ํ›ˆ๋ จ ์‹œ๊ฐ„

๋Œ€๋ถ€๋ถ„์˜ ์ž‘์—…์ด 1~2.5์‹œ๊ฐ„ ๋‚ด์— ์™„๋ฃŒ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๊ธฐ์กด RL ์—ฐ๊ตฌ์—์„œ โ€œsimulation-to-realโ€ ๋˜๋Š” ์ˆ˜์ผ~์ˆ˜์ฃผ์˜ ์‹ค์„ธ๊ณ„ ํ›ˆ๋ จ์ด ํ•„์š”ํ–ˆ๋˜ ๊ฒƒ๊ณผ ๋Œ€์กฐ์ ์ž…๋‹ˆ๋‹ค.

2. ๋ฒ”์šฉ์„ฑ

๋‹จ์ผ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ ํ•ด๊ฒฐ ๊ฐ€๋Šฅํ•œ ์ž‘์—… ๋ฒ”์œ„:

  • ์ •๋ฐ€ ์กฐ๋ฆฝ (๋งˆ์ดํฌ๋กœ๋ฏธํ„ฐ ๋‹จ์œ„)
  • ๋™์  ์กฐ์ž‘ (๋ฐ€๋ฆฌ์ดˆ ๋‹จ์œ„)
  • ์–‘ํŒ” ํ˜‘์กฐ (๋ณต์žกํ•œ ๋™๊ธฐํ™”)
  • ๋ณ€ํ˜• ๊ฐ€๋Šฅ ๋ฌผ์ฒด (์˜ˆ์ธก ๋ถˆ๊ฐ€๋Šฅํ•œ ํ˜•ํƒœ ๋ณ€ํ™”)

3. ์ธ๊ฐ„ ์ดˆ์›” ์„ฑ๋Šฅ

RL์˜ ํ•ต์‹ฌ ์žฅ์ ์ด ์ž…์ฆ๋จ:

  • ์‚ฌ์ดํด ํƒ€์ž„: ์ธ๊ฐ„ ์‹œ์—ฐ๋ณด๋‹ค 1.8๋ฐฐ ๋น ๋ฆ„
  • ์ผ๊ด€์„ฑ: 100% ์„ฑ๊ณต๋ฅ  (์ธ๊ฐ„๋„ ์‹ค์ˆ˜ํ•จ)
  • ์ ์‘์„ฑ: ์™ธ๋ถ€ ๊ต๋ž€์— ์ž๋™ ๋Œ€์‘

4. ์‹œ์Šคํ…œ ์ˆ˜์ค€ ํ†ตํ•ฉ

๊ฐœ๋ณ„ ๊ธฐ์ˆ ๋ณด๋‹ค ํ†ตํ•ฉ์˜ ํž˜์„ ๋ณด์—ฌ์คŒ:

  • ์‚ฌ์ „ํ›ˆ๋ จ ๋น„์ „ + ํšจ์œจ์  RL + ์ธ๊ฐ„ ๊ฐœ์ž… + ์ ์ ˆํ•œ ์ œ์–ด๊ธฐ

ํ•œ๊ณ„ ๋ฐ ๊ฐœ์„  ๋ฐฉํ–ฅ

1. ์žฅ๊ธฐ ์ง€ํ‰ ์ž‘์—…

ํ˜„์žฌ ๊ฐ€์žฅ ๊ธด ์ž‘์—…(ํƒ€์ด๋ฐ ๋ฒจํŠธ)๋„ 6์‹œ๊ฐ„์ด ์†Œ์š”๋ฉ๋‹ˆ๋‹ค. ๋” ๊ธด ์ง€ํ‰์˜ ์ž‘์—…์—์„œ๋Š” ์ƒ˜ํ”Œ ๋ณต์žก๋„๊ฐ€ ๊ธ‰์ฆํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ž ์žฌ์  ํ•ด๊ฒฐ์ฑ…:

  • ์ž‘์—… ์ž๋™ ๋ถ„ํ•  (VLM ํ™œ์šฉ)
  • ๊ณ„์ธต์  RL
  • ๊ฐ€์น˜ ํ•จ์ˆ˜ ์‚ฌ์ „ํ›ˆ๋ จ

2. ์ผ๋ฐ˜ํ™” ํ•œ๊ณ„

์‹คํ—˜์—์„œ ๊ด‘๋ฒ”์œ„ํ•œ ๋žœ๋คํ™”๋‚˜ ๋น„๊ตฌ์กฐํ™” ํ™˜๊ฒฝ ํ…Œ์ŠคํŠธ๊ฐ€ ์—†์—ˆ์Šต๋‹ˆ๋‹ค.

์ž ์žฌ์  ํ•ด๊ฒฐ์ฑ…:

  • ํ›ˆ๋ จ ์‹œ๊ฐ„ ์—ฐ์žฅ + ํ™˜๊ฒฝ ๋žœ๋คํ™”
  • ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์‚ฌ์ „ํ›ˆ๋ จ๋œ ๋น„์ „ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ ํ™œ์šฉ

3. ์Šค์ผ€์ผ๋ง ๋ฌธ์ œ

๊ฐ ์ž‘์—…๋งˆ๋‹ค ์ฒ˜์Œ๋ถ€ํ„ฐ ํ›ˆ๋ จํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ž ์žฌ์  ํ•ด๊ฒฐ์ฑ…:

  • ๋ฒ”์šฉ ๊ฐ€์น˜ ํ•จ์ˆ˜ ์‚ฌ์ „ํ›ˆ๋ จ
  • ๋ฉ€ํ‹ฐํƒœ์Šคํฌ RL
  • ๋กœ๋ด‡ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ๊ณผ์˜ ํ†ตํ•ฉ

4. ์ธ๊ฐ„ ๊ฐœ์ž…์˜ ํ’ˆ์งˆ ์˜์กด์„ฑ

์ธ๊ฐ„ ๊ฐœ์ž…์˜ ์งˆ์ด ํ•™์Šต์— ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๋‹ค. ์ผ๊ด€๋˜์ง€ ์•Š์€ ๊ฐœ์ž…์€ ์˜คํžˆ๋ ค ํ•ด๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ž ์žฌ์  ํ•ด๊ฒฐ์ฑ…:

  • ๊ฐœ์ž… ํ’ˆ์งˆ ํ‰๊ฐ€ ๋ฉ”์ปค๋‹ˆ์ฆ˜
  • ์ž๋™ํ™”๋œ ๊ฐœ์ž… ํ•„ํ„ฐ๋ง
  • ์ ์‘์  ๊ฐœ์ž… ํ†ตํ•ฉ ๊ฐ€์ค‘์น˜

์—ฐ๊ตฌ ๋ฐฉํ–ฅ ์ œ์•ˆ

  1. ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ ํ†ตํ•ฉ: HIL-SERL๋กœ ์ƒ์„ฑํ•œ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ๋กœ ๋ฒ”์šฉ ๋กœ๋ด‡ ๋ชจ๋ธ ํ›ˆ๋ จ
  2. ๊ฐ€์น˜ ํ•จ์ˆ˜ ์ „์ด: ์—ฌ๋Ÿฌ ์ž‘์—…์—์„œ ๊ณต์œ  ๊ฐ€๋Šฅํ•œ ์กฐ์ž‘ โ€œํ”„๋ฆฌ๋ฏธํ‹ฐ๋ธŒโ€ ํ•™์Šต
  3. ์ž์œจ ์Šคํ‚ฌ ๋ฐœ๊ฒฌ: VLM์„ ํ™œ์šฉํ•œ ์ž๋™ ์ž‘์—… ๋ถ„ํ•  ๋ฐ ๋ณด์ƒ ์ƒ์„ฑ
  4. ์‚ฐ์—… ์ ์šฉ: HMLV(High-Mix Low-Volume) ์ œ์กฐ ํ™˜๊ฒฝ์—์„œ์˜ ๊ฒ€์ฆ

๊ด€๋ จ ์—ฐ๊ตฌ์™€์˜ ๋น„๊ต

์‹ค์„ธ๊ณ„ RL ์•Œ๊ณ ๋ฆฌ์ฆ˜

๋ฐฉ๋ฒ• ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ ํ™œ์šฉ ์„ฑ๋Šฅ
QT-Opt ์ค‘๊ฐ„ ์—†์Œ ์ค‘๊ฐ„
SERL ๋†’์Œ ๋ฐ๋ชจ๋งŒ ๋†’์Œ
HIL-SERL ๋งค์šฐ ๋†’์Œ ๋ฐ๋ชจ + ๊ฐœ์ž… ๋งค์šฐ ๋†’์Œ
Model-based RL ๋†’์Œ ์„ ํƒ์  ์ค‘๊ฐ„~๋†’์Œ

HIL-SERL๊ณผ SERL์˜ ํ•ต์‹ฌ ์ฐจ์ด:

  • SERL: ์˜คํ”„๋ผ์ธ ๋ฐ๋ชจ๋งŒ ํ™œ์šฉ
  • HIL-SERL: ์˜จ๋ผ์ธ ์ธ๊ฐ„ ๊ฐœ์ž… ์ถ”๊ฐ€

์ด โ€œ์ž‘์€โ€ ์ฐจ์ด๊ฐ€ ๋ณต์žกํ•œ ์ž‘์—…์—์„œ ํฐ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

๋ชจ๋ฐฉํ•™์Šต ๋ฐฉ๋ฒ•๋ก 

๋ฐฉ๋ฒ• ์›๋ฆฌ ํ•œ๊ณ„
Behavioral Cloning ์ง์ ‘ ๋ชจ๋ฐฉ ์˜ค๋ฅ˜ ๋ˆ„์ 
DAgger ๋Œ€ํ™”ํ˜• ๋ชจ๋ฐฉ ์ธ๊ฐ„ ์ˆ˜์ค€ ํ•œ๊ณ„
Diffusion Policy ๋‹ค์ค‘ ๋ชจ๋‹ฌ ๋ถ„ํฌ ํ•™์Šต ๋ฐ˜์‘์„ฑ ๋ถ€์กฑ
HIL-SERL RL + ์ธ๊ฐ„ ๊ฐ€์ด๋“œ ์ธ๊ฐ„ ์ดˆ์›” ๊ฐ€๋Šฅ

๊ธฐ์กด ์กฐ์ž‘ ์ ‘๊ทผ๋ฒ•

์  ๊ฐ€ ์ž‘์—… ๋น„๊ต:

์—ฐ๊ตฌ ์ ‘๊ทผ๋ฒ• ํ•œ๊ณ„
Fazeli et al. ์ค€๋™์ (quasi-dynamic) ๋ฐ€๊ธฐ ์ €์†, ๋œ ๋„์ „์ 
HIL-SERL ๋™์  ํœ˜ํ•‘ ๊ณ ์†, ์ง์ ‘ ํ”ฝ์…€ ์ž…๋ ฅ

๋ฌผ์ฒด ๋’ค์ง‘๊ธฐ ๋น„๊ต:

์—ฐ๊ตฌ ์ ‘๊ทผ๋ฒ• ํ•œ๊ณ„
Kormushev et al. ๋ชจ์…˜ ์บก์ฒ˜ + DMP ํŠน์ˆ˜ ์žฅ๋น„ ํ•„์š”
HIL-SERL ํ”ฝ์…€ ์ง์ ‘ ์ž…๋ ฅ ๋ฒ”์šฉ ์นด๋ฉ”๋ผ๋งŒ ํ•„์š”

์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

ํ•ต์‹ฌ ๋ฉ”์‹œ์ง€

HIL-SERL์€ ์‹ค์„ธ๊ณ„ ๋กœ๋ด‡ ๊ฐ•ํ™”ํ•™์Šต์˜ ์‹ค์šฉ์„ฑ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค:

โ€œ์ ์ ˆํ•œ ์‹œ์Šคํ…œ ์ˆ˜์ค€ ์„ค๊ณ„ ์„ ํƒ๊ณผ ํ•จ๊ป˜๋ผ๋ฉด, RL์€ ์‹ค์„ธ๊ณ„์—์„œ ๋‹ค์–‘ํ•˜๊ณ  ๋ณต์žกํ•œ ๋น„์ „ ๊ธฐ๋ฐ˜ ์กฐ์ž‘ ์ž‘์—…์„ ํšจ๊ณผ์ ์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค.โ€

ํ•ต์‹ฌ ๊ธฐ์—ฌ ์š”์•ฝ

HIL-SERL ํ•ต์‹ฌ ๊ธฐ์—ฌ ์š”์•ฝ

์ธก๋ฉด ์„ธ๋ถ€ ๋‚ด์šฉ
์‹œ๊ฐ„ ํšจ์œจ์„ฑ 1-2.5์‹œ๊ฐ„ ํ›ˆ๋ จ, ์‹ค์šฉ์  ๋ฐฐํฌ ๊ฐ€๋Šฅ
์„ฑ๋Šฅ 100% ์„ฑ๊ณต๋ฅ , ์ธ๊ฐ„ ์ดˆ์›”, 1.8x ๋น ๋ฅธ ์†๋„
๋ฒ”์šฉ์„ฑ ๋™์  ์กฐ์ž‘, ์ •๋ฐ€ ์กฐ๋ฆฝ, ์–‘ํŒ” ํ˜‘์กฐ, ๋ณ€ํ˜• ๋ฌผ์ฒด
์‹œ์Šคํ…œ ํ†ตํ•ฉ ์‚ฌ์ „ํ›ˆ๋ จ ๋น„์ „, RLPD, ์ธ๊ฐ„ ๊ฐœ์ž…, ์ž„ํ”ผ๋˜์Šค ์ œ์–ด

๋กœ๋ด‡๊ณตํ•™์ž๋ฅผ ์œ„ํ•œ ์‹ค์ฒœ์  ์กฐ์–ธ

  1. ์‹œ์ž‘์€ ๊ฐ„๋‹จํ•˜๊ฒŒ: ๋‹จ์ˆœํ•œ ์ด์ง„ ๋ณด์ƒ ๋ถ„๋ฅ˜๊ธฐ๋กœ ์‹œ์ž‘ํ•˜์„ธ์š”. ๋ณต์žกํ•œ ๋ณด์ƒ ํ˜•์„ฑ์€ ๋Œ€๋ถ€๋ถ„ ๋ถˆํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  2. ์‚ฌ์ „ํ›ˆ๋ จ ํ™œ์šฉ: ImageNet ์‚ฌ์ „ํ›ˆ๋ จ๋œ ๋ฐฑ๋ณธ์€ โ€œ๋ฌด๋ฃŒ ์ ์‹ฌโ€์ž…๋‹ˆ๋‹ค. ๊ผญ ํ™œ์šฉํ•˜์„ธ์š”.
  3. ์ธ๊ฐ„ ๊ฐœ์ž…์€ ๊ฐ€์ด๋“œ: ์‹œ์—ฐ์„ โ€œ์ •๋‹ตโ€์œผ๋กœ ์ทจ๊ธ‰ํ•˜์ง€ ๋ง๊ณ , RL์ด ๋” ๋‚˜์€ ํ•ด๋ฅผ ์ฐพ๋„๋ก โ€œํžŒํŠธโ€๋กœ ํ™œ์šฉํ•˜์„ธ์š”.
  4. ์ขŒํ‘œ๊ณ„ ์„ค๊ณ„ ์ค‘์š”: ์ž๊ธฐ์ค‘์‹ฌ์  ํ‘œํ˜„์€ ๊ณต๊ฐ„ ์ผ๋ฐ˜ํ™”์˜ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค.
  5. ์•ˆ์ „ ์ œ์–ด๊ธฐ ํ•„์ˆ˜: ์ž„ํ”ผ๋˜์Šค ์ œ์–ด์™€ ์ฐธ์กฐ ์ œํ•œ์€ ํƒ์ƒ‰ ์ค‘ ์•ˆ์ „์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

๋ฏธ๋ž˜ ์ „๋ง

HIL-SERL์€ ๋กœ๋ด‡ ์กฐ์ž‘ ์—ฐ๊ตฌ์˜ ์ƒˆ๋กœ์šด ์žฅ์„ ์—ด์—ˆ์Šต๋‹ˆ๋‹ค:

  • ๋‹จ๊ธฐ: ์‚ฐ์—… ํ˜„์žฅ์—์„œ์˜ HMLV ์ œ์กฐ ์ ์šฉ
  • ์ค‘๊ธฐ: ๋กœ๋ด‡ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์˜ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ๋„๊ตฌ
  • ์žฅ๊ธฐ: ๋ฒ”์šฉ ๋กœ๋ด‡ ์กฐ์ž‘์„ ํ–ฅํ•œ ๋””๋”ค๋Œ

ํŒŒ์ธ๋งŒ ๊ต์ˆ˜๋‹˜์ด๋ผ๋ฉด ์ด๋ ‡๊ฒŒ ๋งˆ๋ฌด๋ฆฌํ•˜์…จ์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค:

โ€œ๋ณต์žกํ•ด ๋ณด์ด๋Š” ๋ฌธ์ œ๋„ ์˜ฌ๋ฐ”๋ฅธ ๊ด€์ ์—์„œ ๋ณด๋ฉด ๋‹จ์ˆœํ•ด์งˆ ์ˆ˜ ์žˆ๋‹ค. HIL-SERL์€ ์ธ๊ฐ„๊ณผ ๊ธฐ๊ณ„์˜ ํ˜‘๋ ฅ์ด ์–ด๋–ป๊ฒŒ ๋ณต์žกํ•œ ์กฐ์ž‘ ๋ฌธ์ œ๋ฅผ ๋‹จ์ˆœํ™”ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค€๋‹ค. ์šฐ๋ฆฌ๋Š” ์•„์ง ์‹œ์ž‘์ ์— ์žˆ์ง€๋งŒ, ๊ทธ ์‹œ์ž‘์ ์ด ์–ผ๋งˆ๋‚˜ ํฅ๋ฏธ๋กœ์šด์ง€!โ€


๋ถ€๋ก: ๊ตฌํ˜„ ์„ธ๋ถ€์‚ฌํ•ญ

A. ์˜์‚ฌ์ฝ”๋“œ (Pseudocode)

# HIL-SERL ๋ฉ”์ธ ํ›ˆ๋ จ ๋ฃจํ”„
def hil_serl_training():
    # ์ดˆ๊ธฐํ™”
    demo_buffer = load_demonstrations(n=20-30)
    rl_buffer = ReplayBuffer()
    policy = Policy(pretrained_resnet=True)
    q_function = QFunction()
    grasp_critic = GraspCritic()  # DQN for gripper
    
    for episode in range(max_episodes):
        state = env.reset()
        
        for step in range(max_steps):
            # ์ธ๊ฐ„ ๊ฐœ์ž… ์ฒดํฌ
            if human_wants_to_intervene():
                action = get_human_action()  # SpaceMouse
                store_to_buffer(demo_buffer, (s, a, r, s'))
                store_to_buffer(rl_buffer, (s, a, r, s'))
            else:
                # ์ •์ฑ…์—์„œ ํ–‰๋™ ์ƒ˜ํ”Œ๋ง
                continuous_action = policy.sample(state)
                gripper_action = grasp_critic.argmax(state)
                action = concat(continuous_action, gripper_action)
                store_to_buffer(rl_buffer, (s, a, r, s'))
            
            next_state, reward, done = env.step(action)
            state = next_state
            
            # ๋น„๋™๊ธฐ ํ•™์Šต (Learner process)
            if learner_ready():
                # 50:50 ์ƒ˜ํ”Œ๋ง
                demo_batch = demo_buffer.sample(batch_size // 2)
                rl_batch = rl_buffer.sample(batch_size // 2)
                batch = concat(demo_batch, rl_batch)
                
                # RLPD ์—…๋ฐ์ดํŠธ
                update_q_function(q_function, batch)
                update_policy(policy, batch, q_function)
                
                # DQN ์—…๋ฐ์ดํŠธ (๊ทธ๋ฆฌํผ)
                update_grasp_critic(grasp_critic, batch)

B. ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’ ์„ค๋ช…
Learning rate (actor) 3e-4 Adam optimizer
Learning rate (critic) 3e-4 Adam optimizer
Batch size 256 Demo + RL ๋ฒ„ํผ์—์„œ 128์”ฉ
Discount factor (ฮณ) 0.99 ๋ฏธ๋ž˜ ๋ณด์ƒ ๊ฐ€์ค‘์น˜
Target network update (ฯ„) 0.005 Polyak averaging
Image size 128ร—128 ๋ชจ๋“  ์นด๋ฉ”๋ผ ๊ณตํ†ต
Control frequency 10 Hz ์ •์ฑ… ์‹คํ–‰ ๋นˆ๋„
Demo buffer size 20-30 episodes ์ž‘์—…๋ณ„ ์กฐ์ •

C. ํ•˜๋“œ์›จ์–ด ๊ตฌ์„ฑ

  • ๋กœ๋ด‡: Franka Emika Panda (๋‹จ์ผ/์–‘ํŒ”)
  • ์นด๋ฉ”๋ผ: Intel RealSense (์†๋ชฉ + ์ธก๋ฉด)
  • ์ž…๋ ฅ ์žฅ์น˜: 3Dconnexion SpaceMouse
  • ์ปดํ“จํŒ…: NVIDIA RTX 4090 GPU

์ฃผ์š” ์ฐธ๊ณ ๋ฌธํ—Œ:

  1. RLPD: Ball et al. (2023). โ€œEfficient Online Reinforcement Learning with Offline Dataโ€
  2. SERL: Luo et al. (2024). โ€œSERL: A Software Suite for Sample-Efficient Robotic Reinforcement Learningโ€
  3. SAC: Haarnoja et al. (2018). โ€œSoft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learningโ€
  4. HG-DAgger: Kelly et al. (2018). โ€œHG-DAgger: Interactive Imitation Learning with Human Expertsโ€
  5. Diffusion Policy: Chi et al. (2024). โ€œDiffusion Policy: Visuomotor Policy Learning via Action Diffusionโ€

โ›๏ธ Dig Review

โ›๏ธ Dig โ€” Go deep, uncover the layers. Dive into technical detail.

์„œ๋ก : ๋ฌธ์ œ ์ •์˜์™€ ๋ฐฐ๊ฒฝ

๋กœ๋ด‡ ์กฐ์ž‘(manipulation)์€ ๋กœ๋ด‡๊ณตํ•™์˜ ํ•ต์‹ฌ ๊ณผ์ œ ์ค‘ ํ•˜๋‚˜๋กœ, ์ธ๊ฐ„ ์ˆ˜์ค€์˜ ์ •๋ฐ€ํ•˜๊ณ  ์—ญ๋™์ ์ธ ์กฐ์ž‘ ๋Šฅ๋ ฅ์„ ๋‹ฌ์„ฑํ•˜๋Š” ๊ฒƒ์€ ์˜ค๋žœ ๋„์ „ ๊ณผ์ œ์ž…๋‹ˆ๋‹ค. ํŠนํžˆ ๋ฌผ์ฒด๋ฅผ ์ •๊ตํ•˜๊ฒŒ ๋‹ค๋ฃจ๊ฑฐ๋‚˜(dynamic & dexterous), ์˜ˆ๋ฅผ ๋“ค์–ด ๋ถ€ํ’ˆ์„ ์กฐ๋ฆฝํ•˜๊ฑฐ๋‚˜ ๋น ๋ฅธ ๋™์ž‘์œผ๋กœ ๋ฌผ์ฒด๋ฅผ ๋˜์ง€๊ณ  ๋ฐ›๋Š” ๋“ฑ์˜ ์ž‘์—…์—์„œ, ๋กœ๋ด‡์ด ์Šค์Šค๋กœ ํ•™์Šตํ•˜์—ฌ ์ธ๊ฐ„ ์ด์ƒ์˜ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒƒ์€ ์ด์ƒ์ ์ธ ๋ชฉํ‘œ์ž…๋‹ˆ๋‹ค. ๊ฐ•ํ™”ํ•™์Šต(RL)์€ ์ด๋Ÿฌํ•œ ๋ณต์žกํ•œ ์Šคํ‚ฌ์„ ์ž์œจ์ ์œผ๋กœ ์‹œ๋„์™€ ์‹คํŒจ๋ฅผ ๊ฑฐ์ณ ์Šต๋“ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์—์„œ ํฐ ์ž ์žฌ๋ ฅ์„ ์ง€๋‹™๋‹ˆ๋‹ค. ์ž˜๋งŒ ํ™œ์šฉํ•˜๋ฉด, RL๋กœ ํ•™์Šต๋œ ์ •์ฑ…(policy)์€ ํ•ด๋‹น ๋ฌผ๋ฆฌ์  ์ž‘์—…์— ์ตœ์ ํ™”๋˜์–ด ์ˆ˜์ž‘์—… ์„ค๊ณ„ํ•œ ์ œ์–ด๊ธฐ๋ณด๋‹ค๋„ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ, ์‹ฌ์ง€์–ด ์ธ๊ฐ„ ์›๊ฒฉ์กฐ์ž‘๋ณด๋‹ค๋„ ๋‚˜์€ ์„ฑ๊ณผ๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ์„ ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€๋ฉ๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ ํ˜„์‹ค ์„ธ๊ณ„์—์„œ ์ด ์•ฝ์†์„ ์‹คํ˜„ํ•˜๊ธฐ๋ž€ ์‰ฝ์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ํ‘œ๋ณธ ํšจ์œจ์„ฑ ๋ฌธ์ œ(sample complexity)์™€ ๋ณด์ƒ ์„ค๊ณ„ ๋ฌธ์ œ, ๊ทธ๋ฆฌ๊ณ  ํ•™์Šต์˜ ์•ˆ์ •์„ฑ ๋“ฑ์ด ๋ฐœ๋ชฉ์„ ์žก์•„ ์™”์Šต๋‹ˆ๋‹ค. ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์ƒ์—์„œ๋Š” ๋กœ๋ด‡์˜ ๊ณ ๋‚œ์ด๋„ ์šด๋™ ๊ธฐ์ˆ ์„ RL๋กœ ํ•™์Šต์‹œํ‚จ ์„ฑ๊ณต ์‚ฌ๋ก€๋“ค์ด ์žˆ์—ˆ์ง€๋งŒ, ์‹ค์ œ ๋กœ๋ด‡์— ๋น„์ „ ๊ธฐ๋ฐ˜ RL์„ ์ ์šฉํ•˜์—ฌ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ๋ณต์žกํ•œ ์ž‘์—…์„ ๋น ๋ฅด๊ฒŒ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์€ ์˜ค๋žซ๋™์•ˆ โ€œ๋น„ํšจ์œจ์ ์ด๊ณ  ์œ„ํ—˜ํ•˜๋‹คโ€๋Š” ์ธ์‹์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์ •ํ™•ํ•œ ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ์†์ˆ˜ ์„ค๊ณ„ํ•ด์•ผ ํ•˜๋Š” ์ „์ œ๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•˜๋Š”๋ฐ, ๋ณต์žกํ•œ ์กฐ๋ฆฝ ์ž‘์—… ๋“ฑ์—์„œ๋Š” ์–ด๋–ค ๋ถ€๋ถ„ ์„ฑ๊ณต์— ์–ผ๋งˆ์˜ ๋ณด์ƒ์„ ์ค„์ง€ ๊ฒฐ์ •ํ•˜๊ธฐ ์–ด๋ ค์›Œ ์‚ฌ์‹ค์ƒ ๋ถˆ๊ฐ€๋Šฅ์— ๊ฐ€๊น์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฐ ๋ฐฐ๊ฒฝ์—์„œ Berkeley ๋Œ€ํ•™ Levine ๊ต์ˆ˜ ์—ฐ๊ตฌํŒ€์€ โ€œ์‹ค์‹œ๊ฐ„ ์‹ค์ œ ๋กœ๋ด‡ ์ƒ์—์„œ, ์‹œ๊ฐ ์ž…๋ ฅ๋งŒ์œผ๋กœ๋„ 1~2์‹œ๊ฐ„ ๋งŒ์— ๊ณ ๋‚œ๋„ ์ž‘์—…๋“ค์„ ๊ฑฐ์˜ ์™„๋ฒฝํžˆ ๋ฐฐ์šฐ๊ฒŒ ํ•  ์ˆ˜๋Š” ์—†์„๊นŒ?โ€๋ผ๋Š” ๋„์ „์ ์ธ ๋ชฉํ‘œ๋ฅผ ์„ธ์› ์Šต๋‹ˆ๋‹ค. ๊ทธ ํ•ด๋‹ต์œผ๋กœ ์ œ์‹œ๋œ ๊ฒƒ์ด HIL-SERL(Human-in-the-Loop Sample Efficient Robotic Learning)์ž…๋‹ˆ๋‹ค. ์ด ์‹œ์Šคํ…œ์€ ํ•œ๋งˆ๋””๋กœ โ€œ์‚ฌ๋žŒ์ด ์ฐธ์—ฌํ•˜๋Š” ํ‘œ๋ณธ-ํšจ์œจ ๊ฐ•ํ™”ํ•™์Šตโ€์œผ๋กœ, ์—ฌ๋Ÿฌ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ์ •๊ตํ•˜๊ฒŒ ๊ฒฐํ•ฉํ•จ์œผ๋กœ์จ ์‹ค์ œ ๋กœ๋ด‡ ํ™˜๊ฒฝ์—์„œ ์งง์€ ํ•™์Šต์œผ๋กœ๋„ ๊ณ ์„ฑ๋Šฅ์˜ ๋น„์ „ ๊ธฐ๋ฐ˜ ์กฐ์ž‘ ์ •์ฑ…์„ ์–ป์–ด๋ƒˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: ์ธ๊ฐ„ ์‹œ๋ฒ”๊ณผ ์ค‘๊ฐ„ ๊ต์ •์„ ํ™œ์šฉํ•œ ์˜คํ”„ํด๋ฆฌ์‹œ(off-policy) ๊ฐ•ํ™”ํ•™์Šต์ž…๋‹ˆ๋‹ค. ์šฐ์„  ์ธ๊ฐ„ ์ „๋ฌธ๊ฐ€๊ฐ€ ์ผ์ •๋Ÿ‰์˜ ์‹œ๋ฒ” ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ณตํ•˜์—ฌ ์ดˆ๊ธฐ ํ•™์Šต์„ ๋„์šฐ๋ฉฐ, ํ•™์Šต ์ค‘์— ๋กœ๋ด‡์ด ์‹ค์ˆ˜๋ฅผ ํ•  ๋•Œ ์‚ฌ๋žŒ์ด ๊ฐœ์ž…(intervene)ํ•˜์—ฌ ๋กœ๋ด‡์„ ๋‹ค์‹œ ์˜ฌ๋ฐ”๋ฅธ ์ƒํƒœ๋กœ ์œ ๋„ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ธ๊ฐ„์˜ ์‹œ๋ฒ”/๊ต์ • ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œ๋ณธ ํšจ์œจ์ด ๋†’์€ RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ํ†ตํ•ฉํ•ด ํ•™์Šตํ•จ์œผ๋กœ์จ, 1~2.5์‹œ๊ฐ„ ๋‚ด์— ๊ฑฐ์˜ ๋ชจ๋“  ์‹œ๋„์—์„œ ์„ฑ๊ณตํ•˜๊ณ  ์ธ๊ฐ„๋ณด๋‹ค ๋น ๋ฅธ ์ž‘์—… ์ˆ˜ํ–‰ ์‹œ๊ฐ„์„ ๋ณด์ด๋Š” ์ •์ฑ…์„ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ์ด ์ ‘๊ทผ๋ฒ•์€ ๋™์  ๋ฌผ์ฒด ๋‹ค๋ฃจ๊ธฐ(์˜ˆ: ํ”„๋ผ์ดํŒฌ์œผ๋กœ ๋ฌผ์ฒด ๋’ค์ง‘๊ธฐ), ์ •๋ฐ€ ์กฐ๋ฆฝ(์˜ˆ: ๋ถ€ํ’ˆ ๊ฝ‚๊ธฐ), ์–‘ํŒ” ํ˜‘์—…(dual-arm coordination) ๋“ฑ ๋‹ค์–‘ํ•œ ์–ด๋ ค์šด ์ž‘์—…๋“ค์— ์ผ๊ด€๋˜๊ฒŒ ์ ์šฉ๋˜์–ด, ๊ธฐ์กด ๋ชจ๋ฐฉํ•™์Šต์ด๋‚˜ ์ด์ „ RL ๋ฐฉ์‹ ๋Œ€๋น„ ํ‰๊ท  2๋ฐฐ์˜ ์„ฑ๊ณต๋ฅ  ํ–ฅ์ƒ, 1.8๋ฐฐ์˜ ์†๋„ ๊ฐœ์„ ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์„ฑ๊ณผ๋Š” ๊ฐ•ํ™”ํ•™์Šต์ด ํ˜„์‹ค์—์„œ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ์ ์ ˆํ•œ ์‹œ์Šคํ…œ ์„ค๊ณ„ ํ•˜์— ์ธ๊ฐ„ ์ „๋ฌธ๊ฐ€๋„ ์••๋„ํ•˜๋Š” โ€œ์Šˆํผํœด๋จผโ€ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ๋Š” ์˜๋ฏธ ์žˆ๋Š” ์‚ฌ๋ก€์ž…๋‹ˆ๋‹ค.

๋ณธ ๋ฆฌ๋ทฐ์—์„œ๋Š” HIL-SERL ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ๋‚ด์šฉ์„ ๋กœ๋ด‡๊ณตํ•™ ์—ฐ๊ตฌ์ž ๊ด€์ ์—์„œ ๊นŠ์ด ์žˆ๊ฒŒ ๋ถ„์„ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์„œ๋ก ์—์„œ๋Š” ๋ฌธ์ œ ์ œ๊ธฐ์™€ ์ ‘๊ทผ๋ฒ•์˜ ํฐ ๊ทธ๋ฆผ์„ ๋‹ค๋ค˜๊ณ , ์ดํ›„ ๋ฐฉ๋ฒ•๋ก  ์„น์…˜์—์„œ HIL-SERL์˜ ์‹œ์Šคํ…œ ๊ตฌ์กฐ์™€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ƒ์„ธํžˆ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ ๋…ผ๋ฌธ์— ์ œ์‹œ๋œ ์ˆ˜์‹๋“ค์„ ์ง๊ด€์ ์œผ๋กœ ํ’€์–ด ์„ค๋ช…ํ•˜๊ณ , ๋น„์œ ๋ฅผ ํ†ตํ•ด ์ดํ•ด๋ฅผ ๋•๊ฒ ์Šต๋‹ˆ๋‹ค. ์‹คํ—˜ ์„น์…˜์—์„œ๋Š” ์ €์ž๋“ค์ด ์ˆ˜ํ–‰ํ•œ ๋‹ค์–‘ํ•œ ์ž‘์—…๋“ค, ๊ทธ ์‹คํ—˜ ์„ค์ •๊ณผ ๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณด๊ณ , ์–ป์–ด์ง„ ์ •์ฑ…์˜ ํŠน์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ํ•ด์„ํ•ฉ๋‹ˆ๋‹ค. ์ด์–ด์„œ ๋น„ํŒ์  ๊ณ ์ฐฐ์—์„œ๋Š” HIL-SERL์˜ ๊ฐ•์ ๊ณผ ์•ฝ์ ์„ ํ‰๊ฐ€ํ•˜๊ณ , ํ–ฅํ›„ ๊ฐœ์„ ์ด๋‚˜ ์‘์šฉ์„ ์œ„ํ•œ ๋ฏธ๋ž˜ ๋ฐฉํ–ฅ์„ ์ œ์–ธํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์š”์•ฝ ๋ฐ ๊ฒฐ๋ก ์—์„œ ํ•ต์‹ฌ ํ†ต์ฐฐ์„ ์ •๋ฆฌํ•˜๋ฉด์„œ, ๋…์ž์ธ ๋กœ๋ด‡๊ณตํ•™์ž์—๊ฒŒ ์ด ์—ฐ๊ตฌ๊ฐ€ ์ฃผ๋Š” ์‹œ์‚ฌ์ ์„ ์งš์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๋ฐฉ๋ฒ•: HIL-SERL ์‹œ์Šคํ…œ์˜ ์„ค๊ณ„์™€ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ƒ์„ธ๋ถ„์„

HIL-SERL์€ ์—ฌ๋Ÿฌ ๊ตฌ์„ฑ ์š”์†Œ์˜ ์ •๊ตํ•œ ํ†ตํ•ฉ์„ ํ†ตํ•ด ์‹ค์ œ ๋กœ๋ด‡ ๊ฐ•ํ™”ํ•™์Šต์˜ ๋‚œ์ ๋“ค์„ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ๋จผ์ € ์ „์ฒด์ ์ธ ์‹œ์Šคํ…œ ์•„ํ‚คํ…์ฒ˜์™€ ํ•™์Šต ํ๋ฆ„์„ ๊ฐœ๊ด€ํ•˜๊ณ , ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ โ€“ ์‹œ๊ฐ ์ž…๋ ฅ ์ฒ˜๋ฆฌ, ๋ณด์ƒ ์„ค๊ณ„, ๋กœ๋ด‡ ์ œ์–ด ์ฒด๊ณ„, ๊ทธ๋ฆฌํผ(์†) ์ œ์–ด, ์ธ๊ฐ„ ๊ฐœ์ž… ์•Œ๊ณ ๋ฆฌ์ฆ˜ โ€“ ๋ฅผ ์ฐจ๋ก€๋กœ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์‹œ์Šคํ…œ ๊ฐœ์š”์™€ ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜

HIL-SERL์€ ๋น„์ „ ๊ธฐ๋ฐ˜ ์˜คํ”„ํด๋ฆฌ์‹œ RL ๊ตฌ์กฐ๋กœ, Actor-critic ๊ณ„์—ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. Actor(ํ–‰๋™์ž) ํ”„๋กœ์„ธ์Šค๋Š” ์‹ค์ œ ๋กœ๋ด‡ ํ™˜๊ฒฝ์—์„œ ์ •์ฑ…์„ ์‹คํ–‰ํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์œผ๊ณ , Learner(ํ•™์Šต์ž) ํ”„๋กœ์„ธ์Šค๋Š” ์ด ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ์ •์ฑ…(๋ฐฐ์šฐ)๊ณผ ๊ฐ€์น˜ํ•จ์ˆ˜(ํ‰๊ฐ€์ž) ์‹ ๊ฒฝ๋ง์„ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค. ์ด ๋‘˜์€ ๋น„๋™๊ธฐ(asynchronous) ๋ฐฉ์‹์œผ๋กœ ๋ณ‘๋ ฌ ๋™์ž‘ํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘๊ณผ ํŒŒ๋ผ๋ฏธํ„ฐ ํ•™์Šต์„ ํšจ์œจ์ ์œผ๋กœ ๋ณ‘ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ๋Š” ์žฌํ˜„ ๋ฉ”๋ชจ๋ฆฌ(replay buffer)์— ์ €์žฅ๋˜๊ณ , ํ•™์Šต์ž๋Š” ์ด๋ฅผ ๋ฌด์ž‘์œ„ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ ์‹ ๊ฒฝ๋ง์„ ํ›ˆ๋ จ์‹œํ‚ต๋‹ˆ๋‹ค. HIL-SERL์—์„œ๋Š” ๋‘ ๊ฐœ์˜ ๋ฒ„ํผ๋ฅผ ๋‘ก๋‹ˆ๋‹ค: ํ•˜๋‚˜๋Š” ์‹œ๋ฒ”/๊ต์ • ๋ฐ์ดํ„ฐ ๋ฒ„ํผ (๋ฐ๋ชจ ๋ฒ„ํผ)์ด๊ณ , ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ๋กœ๋ด‡ ์ž์ฒด ์‹œ๋„ ๋ฐ์ดํ„ฐ ๋ฒ„ํผ (RL ๋ฒ„ํผ)์ž…๋‹ˆ๋‹ค. ํ•™์Šต ์‹œ ์ด ๋‘ ๋ฒ„ํผ์—์„œ ๋™๋“ฑํ•œ ๋น„์œจ๋กœ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ ๋ฐฐ์น˜(batch)๋ฅผ ๊ตฌ์„ฑํ•จ์œผ๋กœ์จ, ์‚ฌ์ „๋ฐ์ดํ„ฐ(์˜คํ”„๋ผ์ธ)์™€ ์˜จ๋ผ์ธ ๊ฒฝํ—˜์„ ๊ท ํ˜• ์žˆ๊ฒŒ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์˜คํ”„๋ผ์ธ-์˜จ๋ผ์ธ 50:50 ์ƒ˜ํ”Œ๋ง ์ „๋žต์€ ์ €์ž๋“ค์ด ์‚ฌ์šฉํ•œ RLPD ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ•ต์‹ฌ์œผ๋กœ, ์ด์ „ ๋ฐ์ดํ„ฐ์˜ ์ง€์‹์„ ๋น ๋ฅด๊ฒŒ ํ™œ์šฉํ•˜๋ฉด์„œ๋„ ์ƒˆ๋กœ์šด ํƒ์ƒ‰์„ ์†Œํ™€ํžˆ ํ•˜์ง€ ์•Š๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

ํ•œํŽธ, ์ •์ฑ… ์‹ ๊ฒฝ๋ง์˜ ์‹œ๊ฐ ์ž…๋ ฅ๋ถ€์—๋Š” ์‚ฌ์ „ํ•™์Šต๋œ ๋น„์ „ ๋ฐฑ๋ณธ์„ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์˜ˆ์ปจ๋Œ€ ImageNet ๋“ฑ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต๋œ CNN ํŠน์ง•์ถ”์ถœ๊ธฐ๋ฅผ ์ดˆ๊ธฐ ๊ฐ€์ค‘์น˜๋กœ ์‚ผ์•„, ์ดˆ๋ฐ˜ ํ•™์Šต ์•ˆ์ •์„ฑ์„ ๋†’์˜€์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๊ณ ์ฐจ์› ์ด๋ฏธ์ง€ ์ž…๋ ฅ์œผ๋กœ ์ธํ•œ ์ตœ์ ํ™” ๋ถˆ์•ˆ์ • ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜์—ฌ, ์งง์€ ์‹œ๊ฐ„ ๋‚ด ์ˆ˜๋ ด์„ ๋•๋Š” ์žฅ์น˜์ž…๋‹ˆ๋‹ค.

๊ฐ•ํ™”ํ•™์Šต ๋ฌธ์ œ ์ •์˜๋ฅผ ๊ฐ„๋žตํžˆ ๊ธฐ์ˆ ํ•˜๋ฉด, MDP (\mathcal{S}, \mathcal{A}, P, R, \gamma)์—์„œ ์ •์ฑ… \pi_\theta(a\|s)๋Š” ๋ˆ„์  ๊ธฐ๋Œ€๋ณด์ƒ J(\pi)์„ ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก ํ•™์Šต๋ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ \mathcal{S}๋Š” ์ƒํƒœ๊ณต๊ฐ„ (์˜ˆ: ์นด๋ฉ”๋ผ ์ด๋ฏธ์ง€ + ๋กœ๋ด‡ ๊ด€์ ˆ/์—”๋“œ์ดํŽ™ํ„ฐ ์ƒํƒœ), \mathcal{A}๋Š” ํ–‰๋™๊ณต๊ฐ„ (์˜ˆ: ์—”๋“œ์ดํŽ™ํ„ฐ์˜ ์†๋„/ํž˜ ์ปค๋งจ๋“œ, ๊ทธ๋ฆฌํผ ์—ฌ๋‹ซ๊ธฐ ๋“ฑ), P(s\'\|s,a)๋Š” ํ™˜๊ฒฝ ๋™์—ญํ•™์ด๋ฉฐ R(s)๋Š” ๋ณด์ƒํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค. ์ตœ์  ์ •์ฑ…์€ ๋‹ค์Œ ์‹์„ ๋งŒ์กฑํ•˜๋Š” \pi\^\*๋กœ ์ •์˜๋ฉ๋‹ˆ๋‹ค:

\pi^{*} = \arg\max_{\pi}J(\pi),\quad\quad\text{where }J(\pi) = \mathbb{E}_{s_{0} \sim \rho_{0},a_{t} \sim \pi}\left\lbrack \sum_{t = 0}^{\infty}\gamma^{t}R\left( s_{t} \right) \right\rbrack.

์ฆ‰ ํ• ์ธ ์ธ์ž \gamma \in \[0,1\] ํ•˜์—์„œ ๋ฏธ๋ž˜ ๋ณด์ƒ์˜ ๊ธฐ๋Œ“๊ฐ’์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. HIL-SERL์˜ ๊ฒฝ์šฐ, R(s)๋Š” ์ž‘์—… ์„ฑ๊ณต ์—ฌ๋ถ€๋งŒ ํŒ๋‹จํ•˜๋Š” ํฌ์†Œ ๋ณด์ƒ(sparse reward)์ด๋ฏ€๋กœ, J(\pi)๋Š” ๊ณง ์—ํ”ผ์†Œ๋“œ ์„ฑ๊ณต ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. \gamma ๊ฐ’์€ 0.96~0.985 ์‚ฌ์ด๋กœ ์ž‘์—…๋ณ„๋กœ ์ง€์ •๋˜์—ˆ๋Š”๋ฐ (์˜ˆ: ๋Œ€๋ถ€๋ถ„ 0.97, ์ผ๋ถ€ 0.98), \gamma \< 1๋กœ ์„ค์ •ํ•œ ๊ฒƒ์€ โ€œ๋นจ๋ฆฌ ์„ฑ๊ณตํ• ์ˆ˜๋ก ๋” ๋‚˜์€ ๋ณด์ƒโ€์ด ๋˜๋„๋ก ํ•˜์—ฌ, ์ •์ฑ…์ด ์ž‘์—… ์ˆ˜ํ–‰ ์‹œ๊ฐ„์„ ๋‹จ์ถ•ํ•˜๋„๋ก ์œ ๋„ํ•ฉ๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ ์ด๋Ÿฌํ•œ ์„ค๊ณ„ ๋•๋ถ„์— ํ•™์Šต๋œ ์ •์ฑ…์ด ์ธ๊ฐ„ ์‹œ๋ฒ”๋ณด๋‹ค ๋น ๋ฅธ ๊ฒฝ๋กœ๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค (์ž์„ธํ•œ ๋‚ด์šฉ์€ ์‹คํ—˜ ๊ฒฐ๊ณผ์—์„œ ๋…ผ์˜).

๊ฐ•ํ™”ํ•™์Šต ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ Off-policy Actor-Critic ๋ฐฉ์‹์œผ๋กœ, ์ €์ž๋“ค์€ ์ด๋ฅผ RLPD (Ball et al., 2023) ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋ช…์‹œํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ RLPD๋Š” Soft Actor-Critic(SAC) ๊ณ„์—ด์˜ ๊ฐœ๋Ÿ‰๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Critic(๊ฐ€์น˜ ํ•จ์ˆ˜) ๋„คํŠธ์›Œํฌ Q_{\theta}(s,a) ๋‘ ๊ฐœ๋ฅผ ์šด์šฉํ•˜๋ฉฐ (๋”๋ธ” Q), ํƒ€๊นƒ ๋„คํŠธ์›Œํฌ Q_{\theta\'}๋ฅผ ์ด์šฉํ•œ TD(์‹œ๊ฐ„์ฐจ) ์†์‹ค์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์ˆ˜์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด, ์—ฐ์† ํ–‰๋™๊ณต๊ฐ„์— ๋Œ€ํ•œ Q-ํ•จ์ˆ˜ ์—…๋ฐ์ดํŠธ ์†์‹ค L_Q๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

L_{Q}(\theta) = \mathbb{E}_{(s,a,r,s\prime) \sim \mathcal{B}}\left\lbrack (Q_{\theta}(s,a) - \left( r + \gamma\, Q_{\theta\prime}\left( s\prime,\, a\prime = \pi_{\phi}(s\prime) \right) \right))^{2} \right\rbrack,

์—ฌ๊ธฐ์„œ \mathcal{B}๋Š” ๋ฆฌํ”Œ๋ ˆ์ด ๋ฒ„ํผ์—์„œ ์ƒ˜ํ”Œ๋ง๋œ ๋ฐฐ์น˜์ด๊ณ , a\'=\pi_\phi(s\')๋Š” ํ˜„์žฌ ์ •์ฑ…(์•กํ„ฐ) \pi_\phi๊ฐ€ ๋‹ค์Œ ์ƒํƒœ์—์„œ ์„ ํƒํ•œ ํ–‰๋™์ž…๋‹ˆ๋‹ค. \theta\'์€ ํƒ€๊นƒ(target) ๋„คํŠธ์›Œํฌ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ, ํด๋ฆฌ์•ก ํ‰๊ท (Polyak averaging) ๋ฐฉ์‹์œผ๋กœ \theta๋ฅผ ์ง€์—ฐ ์—…๋ฐ์ดํŠธํ•˜์—ฌ ํ•™์Šต ์•ˆ์ •์„ฑ์„ ์ค๋‹ˆ๋‹ค. ์‹ (1)์€ ๋ฒจ๋งŒ ๋ฐฉ์ •์‹ Q(s,a) = r + \gamma Q(s\', a\')์„ ํ‰๊ท ์ œ๊ณฑ์˜ค์ฐจ๋กœ ๋งž์ถ”๋Š” TD(0) ํ•™์Šต์ด๋ฉฐ, ์‹ค์ œ ๊ตฌํ˜„์—์„œ๋Š” ๋”๋ธ” Q ๋ฐ ํƒ€๊นƒ ๋„คํŠธ์›Œํฌ๋กœ ๊ณผ์ตœ์ ํ™”์™€ ๋ฐœ์‚ฐ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ์œผ๋กœ Actor(์ •์ฑ…) ๋„คํŠธ์›Œํฌ \pi_\phi(a\|s)์˜ ํ•™์Šต์€, Maximum Entropy RL์˜ ์›๋ฆฌ์— ๋”ฐ๋ผ ์—”ํŠธ๋กœํ”ผ ๋ณด๋„ˆ์Šค๋ฅผ ํฌํ•จํ•œ ์†์‹ค๋กœ ์ตœ์ ํ™”๋ฉ๋‹ˆ๋‹ค. ์‰ฝ๊ฒŒ ๋งํ•˜๋ฉด, ์ •์ฑ…์€ Q ๊ฐ’์ด ๋†’์€ ํ–‰๋™์„ ์„ ํ˜ธํ•˜๋ฉด์„œ๋„ ํ–‰๋™ ๋ถ„ํฌ์˜ ์—”ํŠธ๋กœํ”ผ๊ฐ€ ๋†’์•„์ง€๋„๋ก ํ•™์Šต๋ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์—”ํŠธ๋กœํ”ผ ์ •๊ทœํ™”๋Š” ํƒ์ƒ‰์„ ์ด‰์ง„ํ•˜๊ณ  ์ตœ์ ํ•ด๋ฅผ ์ฐพ๋Š” ๋ฐ ๋„์›€์„ ์ค๋‹ˆ๋‹ค. Actor์˜ ์†์‹ค L_\pi๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์“ธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

L_{\pi}(\phi) = \mathbb{E}_{s \sim \mathcal{B}}\left\lbrack - Q_{\theta}\left( s,a = \pi_{\phi}(s) \right) + \alpha\,\mathcal{H}\left( \pi_{\phi}\left( \cdot |s \right) \right) \right\rbrack,

์—ฌ๊ธฐ์„œ \mathcal{H}๋Š” ์ •์ฑ…์˜ ์—”ํŠธ๋กœํ”ผ์ด๋ฉฐ, \alpha๋Š” ์—”ํŠธ๋กœํ”ผ ๊ฐ€์ค‘์น˜์ž…๋‹ˆ๋‹ค. HIL-SERL์—์„œ๋Š” ์ด \alpha๋ฅผ ํ•™์Šต ์ค‘ ์ž๋™ ์กฐ์ ˆํ•˜๋Š” ์—”ํŠธ๋กœํ”ผ ํŠœ๋‹ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค๊ณ  ์–ธ๊ธ‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. Actor ์—…๋ฐ์ดํŠธ๋Š” ์ •์ฑ… ๊ทธ๋ž˜๋””์–ธํŠธ ๋ฐฉ์‹์œผ๋กœ ์ด๋ฃจ์–ด์ง€๋ฉฐ, ๊ฒฐ๊ณผ์ ์œผ๋กœ ์ •์ฑ…์ด ๋†’์€ Q ๊ฐ’๊ณผ ํƒ์ƒ‰์„ฑ์„ ๋™์‹œ์— ์ถ”๊ตฌํ•˜๋„๋ก ๋งŒ๋“ญ๋‹ˆ๋‹ค.

์š”์•ฝํ•˜๋ฉด, HIL-SERL์˜ RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ SAC ๊ธฐ๋ฐ˜ off-policy RL๋กœ ๋ณผ ์ˆ˜ ์žˆ๊ณ , ์ด์ „ ์‹œ๋ฒ”/๊ต์ • ๋ฐ์ดํ„ฐ์™€ ํ˜„์žฌ ๋ฐ์ดํ„ฐ๋ฅผ ๊ท ํ˜• ์„ž์–ด ์“ฐ๋Š” RLPD ์ „๋žต์ด ์ ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ตฌ์„ฑ ๋•๋ถ„์—, ์ ์€ ์‹คํ—˜ ๋ฐ์ดํ„ฐ๋กœ๋„ ์•ˆ์ •์ ์ด๊ณ  ๋น ๋ฅด๊ฒŒ ํ•™์Šต์ด ๊ฐ€๋Šฅํ–ˆ์Šต๋‹ˆ๋‹ค.

๋ณด์ƒ ํ•จ์ˆ˜ ์„ค๊ณ„: ์ด์ง„ ์„ฑ๊ณต ํŒ์ •

๋ณด์ƒ ํ•จ์ˆ˜ R(s)๋Š” ๊ฐ•ํ™”ํ•™์Šต์˜ ๋ฐฉํ–ฅ์„ ๊ฒฐ์ •ํ•˜๋Š” ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค. ์•ž์„œ ์–ธ๊ธ‰ํ–ˆ๋“ฏ, HIL-SERL์€ ํฌ์†Œํ•œ(binary) ๋ณด์ƒ ์ฒด๊ณ„๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๊ฐ ์ž‘์—…์— ๋Œ€ํ•ด โ€œ์„ฑ๊ณต ์‹œ +1, ๊ทธ ์™ธ 0โ€์˜ ๋ณด์ƒ์„ ์ฃผ๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ๋‹จ์ˆœํ•˜์ง€๋งŒ, ๋ณต์žกํ•œ ์ž‘์—…์—์„œ ์ž„์˜๋กœ dense ๋ณด์ƒ์„ ์„ค๊ณ„ํ•˜๋Š” ๋Œ€์‹  ์„ฑ๊ณต/์‹คํŒจ๋งŒ ๋ช…ํ™•ํžˆ ์ •์˜ํ•˜์—ฌ ๋ฌธ์ œ๋ฅผ ๋‹จ์ˆœํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ฌผ๋ก , ๋กœ๋ด‡์€ ์–ด๋–ป๊ฒŒ ์„ฑ๊ณต ์—ฌ๋ถ€๋ฅผ ์•Œ๊นŒ์š”? ์ €์ž๋“ค์€ ์ด๋ฅผ ์œ„ํ•ด ์ž‘์—…๋ณ„ ์ด์ง„ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ํ•™์Šต์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

๊ตฌ์ฒด์ ์œผ๋กœ, ๊ฐ ๊ณผ์ œ(task)์— ๋Œ€ํ•ด ์‚ฌ๋žŒ์ด ๋กœ๋ด‡์„ ์›๊ฒฉ ์กฐ์ž‘(tele-operation)ํ•˜์—ฌ ์„ฑ๊ณต ์ƒํƒœ์˜ ์˜ˆ์‹œ ๋ฐ์ดํ„ฐ ~200๊ฐœ์™€ ์‹คํŒจ ์ƒํƒœ์˜ ์˜ˆ์‹œ ~1000๊ฐœ๋ฅผ ๋ชจ์•˜์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, RAM ์‚ฝ์ž… ์ž‘์—…์ด๋ผ๋ฉด ์„ฑ๊ณต ์ƒํƒœ๋Š” RAM์ด ์Šฌ๋กฏ์— ์ •ํ™•ํžˆ ๊ฝ‚ํ˜€์žˆ๋Š” ์ด๋ฏธ์ง€๋“ค์ด๊ณ , ์‹คํŒจ ์ƒํƒœ๋Š” ์‚ฝ์ž…์ด ์•ˆ ๋˜์—ˆ๊ฑฐ๋‚˜ ์ž˜๋ชป๋œ ์œ„์น˜์˜ ์ด๋ฏธ์ง€๋“ค์ž…๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์•ฝ 10ํšŒ ๊ฐ€๋Ÿ‰์˜ ์‹œ๋ฒ” ์—ํ”ผ์†Œ๋“œ(์„ฑ๊ณต/์‹คํŒจ ๋‹ค์–‘ํ•œ)์—์„œ ๋ฝ‘์€ ์˜์ƒ ํ”„๋ ˆ์ž„๋“ค์„ ๊ฐ€์ง€๊ณ  ์ด์ง„ ๋ถ„๋ฅ˜๊ธฐ C_\psi(s)๋ฅผ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ณด์ƒํŒ๋ณ„๊ธฐ๋Š” ๋กœ๋ด‡ ํŒ”์˜ ์†๋ชฉ ์นด๋ฉ”๋ผ(wrist camera) ๋˜๋Š” ์ธก๋ฉด ์นด๋ฉ”๋ผ ์ด๋ฏธ์ง€ ์ž…๋ ฅ์„ ๋ฐ›์•„ ํ•ด๋‹น ์ƒํƒœ๊ฐ€ ์„ฑ๊ณต ์™„๋ฃŒ์ธ์ง€ ์•„๋‹Œ์ง€๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค. ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์†์‹ค๋กœ ํ•™์Šต๋œ ์ด ๋ถ„๋ฅ˜๊ธฐ๋Š” 95% ์ด์ƒ์˜ ์ •ํ™•๋„๋ฅผ ๋ณด์˜€๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

ํ›ˆ๋ จ๋œ ๋ณด์ƒ ๋ถ„๋ฅ˜๊ธฐ๋Š” ๋งค ์‹œ๊ฐ„ ์Šคํ…๋งˆ๋‹ค ๋กœ๋ด‡ ์ƒํƒœ๋ฅผ ๋ณด๊ณ , ์„ฑ๊ณต ์ƒํƒœ๋กœ ํŒ์ •๋˜๋Š” ์ˆœ๊ฐ„์—๋งŒ +1 ๋ณด์ƒ์„ ์ฃผ๊ณ  ์—ํ”ผ์†Œ๋“œ๋ฅผ ์ข…๋ฃŒ์‹œํ‚ต๋‹ˆ๋‹ค. ๊ทธ ์ด์ „๊นŒ์ง€๋Š” ๋ณด์ƒ์ด 0์ด๋ฉฐ, ๋งŒ์•ฝ ์ผ์ • ์‹œ๊ฐ„ ๋‚ด ์„ฑ๊ณต ๋ชป ํ•˜๋ฉด ์‹คํŒจ๋กœ ๊ฐ„์ฃผํ•˜๊ณ  ์—ํ”ผ์†Œ๋“œ ์ข…๋ฃŒ(๋ณด์ƒ 0) ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰ HIL-SERL์˜ ์—ํ”ผ์†Œ๋“œ๋Š” โ€œ์„ฑ๊ณต=+1๋กœ ์ข…๋ฃŒโ€ ์•„๋‹ˆ๋ฉด โ€œํƒ€์ž„์•„์›ƒ/์‹คํŒจ=0๋กœ ์ข…๋ฃŒโ€์˜ ๊ตฌ์กฐ์ธ ์…ˆ์ž…๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ์‚ฌ๋žŒ์ด ์ง์ ‘ shapingํ•œ ๋ณต์žกํ•œ ๋ณด์ƒ ์—†์ด๋„, ์‹œ๋ฒ”+๊ต์ • ๋ฐ์ดํ„ฐ๋งŒ ์žˆ๋‹ค๋ฉด ํฌ์†Œ๋ณด์ƒ์œผ๋กœ ์ถฉ๋ถ„ํ•˜๋‹ค๋Š” ์ ์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ โ€œ๋ณต์žกํ•œ ์ž‘์—…์—์„œ๋Š” ์„ฃ๋ถˆ๋ฆฌ ์กฐ๋ฐ€ ๋ณด์ƒ ์„ค๊ณ„ํ•˜๊ธฐ๋ณด๋‹ค, ์ด๋ ‡๊ฒŒ ๊ฐ„๋‹จํžˆ ์„ฑ๊ณต/์‹คํŒจ๋งŒ ์ •์˜ํ•˜๊ณ  ๋‚˜๋จธ์ง€๋Š” RL๊ณผ ์‚ฌ๋žŒ ๋„์›€์— ๋งก๊ธฐ๋Š” ํŽธ์ด ๋‚ซ๋‹คโ€๋Š” ํ†ต์ฐฐ์„ ์–ป์—ˆ๋‹ค๊ณ  ์„œ์ˆ ํ•ฉ๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ์ ์œผ๋กœ, ์ด ๋ณด์ƒ ์ฒด๊ณ„๋Š” ํด๋ฆฌ์–ดํ•œ ๋ชฉํ‘œ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค: ์—ํ”ผ์†Œ๋“œ ๋‹น ์„ฑ๊ณต ํ™•๋ฅ  ์ตœ๋Œ€ํ™”. ์ด๋Š” ์‚ฌ์‹ค์ƒ ๊ฐ•ํ™”ํ•™์Šต์„ ์„ฑ๊ณต๋ฅ  100%๋ฅผ ํ–ฅํ•ด ์ˆ˜๋ ด์‹œํ‚ค๋Š” ๊ณผ์ •์œผ๋กœ ํ•ด์„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ์‹œ๊ฐ„ ํ• ์ธ\ \gamma ๋•๋ถ„์— ์ •์ฑ…์€ ๊ฐ€๋Šฅํ•œ ๋นจ๋ฆฌ ์„ฑ๊ณตํ•˜๋ ค๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ด๊ฒŒ ๋˜๊ณ , ์ด๋Š” ์ธ๊ฐ„ ๋ฐ๋ชจ๋ณด๋‹ค ๋” ํšจ์œจ์ ์ธ ๊ฒฝ๋กœ๋ฅผ ํƒ์ƒ‰ํ•˜๋Š” ์›๋™๋ ฅ์ด ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์š”์•ฝํ•˜๋ฉด, HIL-SERL์˜ ๋ณด์ƒ์€ ์ž‘์—… ์™„๋ฃŒ ์—ฌ๋ถ€ ํ•˜๋‚˜๋กœ ๊ฒฐ์ •๋˜๋ฉฐ, ์ด๋ฅผ ์œ„ํ•ด ์‚ฌ์ „ ์ˆ˜์ง‘ํ•œ ์„ฑ๊ณต/์‹คํŒจ ์‚ฌ๋ก€๋กœ ํ›ˆ๋ จ๋œ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ํ™œ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋กœ๋ด‡์—๊ฒŒ โ€œ๋”ฑ ๋งž์ท„์„ ๋•Œ๋งŒ ์นญ์ฐฌํ•ด์ฃผ๋Š”โ€ ๋ฐฉ์‹์ด๋ผ ์ฒ˜์Œ์—” ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์ง€๋งŒ, ๊ณง ์†Œ๊ฐœํ•  ์ธ๊ฐ„ ๊ฐœ์ž…๊ณผ ๋ฐ๋ชจ ๋•์— ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ํ™˜๊ฒฝ์ด ์กฐ์„ฑ๋ฉ๋‹ˆ๋‹ค. ๋น„์œ ํ•˜์ž๋ฉด, ์•„์ด์—๊ฒŒ ์ •๋‹ต์ผ ๋•Œ๋งŒ ๋ณด์ƒ์„ ์ฃผ๋Š” ์—„๊ฒฉํ•œ ์„ ์ƒ๋‹˜๊ณผ ๊ฐ™์ง€๋งŒ, ๋Œ€์‹  ์˜†์—์„œ ํ•„์š”ํ•˜๋ฉด ์†์„ ์žก์•„ ์ด๋Œ์–ด์ฃผ๋Š” ๋ณด์กฐ ๊ต์‚ฌ๊ฐ€ ํ•จ๊ป˜ ์žˆ๋Š” ์…ˆ์ž…๋‹ˆ๋‹ค.

๋กœ๋ด‡ ์‹œ์Šคํ…œ ์„ค๊ณ„: ์ขŒํ‘œ๊ณ„์™€ ์ปจํŠธ๋กค๋Ÿฌ

HIL-SERL์ด ์„ฑ๊ณตํ•˜๋ ค๋ฉด, ์†Œํ”„ํŠธ์›จ์–ด๋ฟ ์•„๋‹ˆ๋ผ ๋ฌผ๋ฆฌ์ ์ธ ๋กœ๋ด‡ ์‹œ์Šคํ…œ ์„ค๊ณ„๋„ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ๊ด€์ฐฐ๊ณต๊ฐ„๊ณผ ์ œ์–ด๊ธฐ์˜ ์„ค๊ณ„์— ๋ช‡ ๊ฐ€์ง€ ํ•ต์‹ฌ์ ์ธ ๊ฒฐ์ •์„ ๋‚ด๋ ธ์Šต๋‹ˆ๋‹ค.

๋จผ์ € ๊ด€์ฐฐ(์ƒํƒœ) ํ‘œํ˜„์œผ๋กœ, ๋กœ๋ด‡ ์ž์‹ ์˜ ๊ด€์ ˆ/์—”๋“œ์ดํŽ™ํ„ฐ ์ƒํƒœ(proprioceptive state)๋ฅผ ์ƒ๋Œ€ ์ขŒํ‘œ๊ณ„๋กœ ๋‚˜ํƒ€๋‚ด์—ˆ์Šต๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ๊ฐ ์—ํ”ผ์†Œ๋“œ ์‹œ์ž‘ ์‹œ ๋กœ๋ด‡ ์†๋(End-effector)์˜ ์ดˆ๊ธฐ ์ž์„ธ๋ฅผ ์›์ ์œผ๋กœ ์‚ผ์•„, ๊ทธ ์ดํ›„์˜ ๋ชจ๋“  ์œ„์น˜/์ž์„ธ ๋ณ€ํ™”๋ฅผ ์ž๊ธฐ ์ž์‹ ์˜ ์ถœ๋ฐœ์  ๊ธฐ์ค€์œผ๋กœ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์ž‘์—… ๋Œ€์ƒ์˜ ์œ„์น˜๊ฐ€ ์กฐ๊ธˆ์”ฉ ๋‹ฌ๋ผ์ ธ๋„, ๋กœ๋ด‡์€ ํ•ญ์ƒ ์ž๊ธฐ ๊ธฐ์ค€์œผ๋กœ ๋ณด๋Š” ์…ˆ์ด ๋˜์–ด ๊ณต๊ฐ„ ์ผ๋ฐ˜ํ™”์— ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์น˜ ์‚ฌ๋žŒ์ด ๋ˆˆ์„ ๊ฐ๊ณ  ์†์„ ์›€์ง์ผ ๋•Œ, ์ฒ˜์Œ ์† ์œ„์น˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ƒ๋Œ€์ ์œผ๋กœ ๋ฐฉํ–ฅ์„ ์žก๋Š” ๊ฒƒ๊ณผ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค. ์ด์™€ ํ•จ๊ป˜, ํ•™์Šต ์‹œ ์—ํ”ผ์†Œ๋“œ๋งˆ๋‹ค ๋กœ๋ด‡์˜ ์ดˆ๊ธฐ ์ž์„ธ๋ฅผ ๋ฌด์ž‘์œ„๋กœ ์•ฝ๊ฐ„ ๋ณ€๊ฒฝํ•˜์—ฌ ์‹œ์ž‘ํ–ˆ๋Š”๋ฐ, ์ด ์—ญ์‹œ ์ •์ฑ…์ด ๋‹ค์–‘ํ•œ ์ถœ๋ฐœ ์—ฌ๊ฑด์—์„œ๋„ ์„ฑ๊ณตํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ฃผ๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ์š”์ปจ๋Œ€ ๋กœ๋ด‡์€ โ€œ๋‚ด ์†์ด ์›๋ž˜ ์—ฌ๊ธฐ ์žˆ์–ด์•ผ ํ•˜๋Š”๋ฐ?โ€๋ผ๋Š” ๊ณ ์ •๊ด€๋…์„ ๋ฒ„๋ฆฌ๊ณ , ์–ด๋””์„œ ์‹œ์ž‘ํ•˜๋“  ๋ชฉํ‘œ๋ฌผ๊ณผ ์ƒ๋Œ€์ ์ธ ์›€์ง์ž„๋งŒ ๋ฐฐ์šฐ๋„๋ก ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ ๋•๋ถ„์—, ์‹คํ—˜ ์ค‘์— ์ž‘์—… ๋Œ€์ƒ์ด ์ค‘๊ฐ„์— ์›€์ง์ด๋Š” ๋ฐฉํ•ด ์ƒํ™ฉ์ด ์žˆ์–ด๋„ ์ •์ฑ…์ด ์ž˜ ๋Œ€์ฒ˜ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค (ํ›„์ˆ ํ•  ์ ์‘/๊ฐ•๊ฑด์„ฑ ์‹คํ—˜ ๋ถ€๋ถ„์—์„œ ์˜ˆ์‹œ).

๋‹ค์Œ์œผ๋กœ ๋กœ๋ด‡ ํŒ” ์ œ์–ด๋ฅผ ์•ˆ์ „ํ•˜๊ณ  ํšจ๊ณผ์ ์œผ๋กœ ํ•˜๊ธฐ ์œ„ํ•ด, ์ž„ํ”ผ๋˜์Šค ์ œ์–ด๊ธฐ(impedance controller)๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์ž„ํ”ผ๋˜์Šค ์ œ์–ด๋Š” ํ™˜๊ฒฝ๊ณผ ์ ‘์ด‰์ด ํ•„์š”ํ•œ ์ž‘์—…์—์„œ ์œ ์šฉํ•œ ๋ฐฉ์‹์œผ๋กœ, ๋กœ๋ด‡์„ ๋งˆ์น˜ ์Šคํ”„๋ง-๋Œํผ ์‹œ์Šคํ…œ์ฒ˜๋Ÿผ ์ทจ๊ธ‰ํ•˜์—ฌ ํž˜ ์กฐ์ ˆ๊ณผ ์•ˆ์ „์„ฑ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค. HIL-SERL์—์„œ๋Š” ์˜ˆ๋ฅผ ๋“ค์–ด ๊ฝ‚๊ธฐ/์‚ฝ์ž… ์ž‘์—…์ฒ˜๋Ÿผ ํž˜์ด ๊ฐ€ํ•ด์ง€๋Š” ์ž‘์—…์— ์ด ์ž„ํ”ผ๋˜์Šค ์ œ์–ด๋ฅผ ์ ์šฉํ•˜๊ณ , ๊ฑฐ๊ธฐ์— ์ฐธ์กฐ ๊ถค์  ์ œํ•œ(reference limiting) ๋“ฑ์„ ์ถ”๊ฐ€ํ•ด ์‹ค์‹œ๊ฐ„์œผ๋กœ ๊ณผ๋„ํ•œ ํž˜์ด๋‚˜ ์†๋„๋ฅผ ์ œํ•œํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋กœ๋ด‡ ํŒ”์ด ํ•™์Šต ์ค‘ ์—‰๋šฑํ•œ ํž˜์„ ์ฃผ๊ฑฐ๋‚˜ ์ถฉ๋Œํ•  ๋•Œ ํ•˜๋“œ์›จ์–ด๋ฅผ ๋ณดํ˜ธํ•˜๋Š” ์•ˆ์ „์žฅ์น˜์ž…๋‹ˆ๋‹ค. ์ด์ „์— ๊ฐ™์€ ์—ฐ๊ตฌ์ง„์˜ SERL ์‹œ์Šคํ…œ์—์„œ ์ด๋Ÿฌํ•œ ์•ˆ์ •ํ™” ๊ธฐ๋ฒ•์ด ์‚ฌ์šฉ๋˜์—ˆ๊ณ , ์ด๋ฒˆ์—๋„ ๊ทธ๊ฒƒ์„ ๊ณ„์Šนํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์‰ฝ๊ฒŒ ๋งํ•ด, ๋กœ๋ด‡์ด ์•„๋ฌด๋ฆฌ ํ•™์Šต ์ค‘ ํญ์ฃผํ•˜๋”๋ผ๋„ โ€œ์•ˆ์ „ ๋ชจ๋“œโ€๊ฐ€ ํ•ญ์ƒ ์ž‘๋™ํ•˜๊ณ  ์žˆ์–ด ํฐ ์‚ฌ๊ณ  ์—†์ด ์ง„ํ–‰๋  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

๋ฐ˜๋Œ€๋กœ, ๋งค์šฐ ์—ญ๋™์ ์ธ ์ž‘์—…(์˜ˆ: ๊ณต์ค‘์œผ๋กœ ๋ฌผ์ฒด๋ฅผ ๋˜์ง€๊ฑฐ๋‚˜, ๋น ๋ฅด๊ฒŒ ์ฑ„์ฐ์งˆํ•˜๋“ฏ ์›€์ง์ด๋Š” ๋™์ž‘)์—์„œ๋Š” ์—ด๋ฆฐ ๊ณ ๋ฆฌ(์˜คํ”ˆ๋ฃจํ”„) ์ œ์–ด๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ์—”๋“œ์ดํŽ™ํ„ฐ ์ขŒํ‘œ๊ณ„์—์„œ ์ง์ ‘ ํž˜/ํ† ํฌ(wrench)๋ฅผ ๋ช…๋ นํ•˜์—ฌ ๋กœ๋ด‡์„ ๊ฐ€์†์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ feedforward ํž˜ ์ œ์–ด๋Š”, ์˜ˆ์ปจ๋Œ€ ํ”„๋ผ์ดํŒฌ์œผ๋กœ ๋ฌผ์ฒด๋ฅผ ํœ™ ๋’ค์ง‘๋Š” ๋™์ž‘์ด๋‚˜ ์ œ๋‚˜ ๋ธ”๋ก์„ ์ฑ„์ฐ์œผ๋กœ ๋นผ๋‚ด๋Š” ๋™์ž‘์—์„œ ์•„์ฃผ ์งง์€ ์ˆœ๊ฐ„์— ํฐ ๊ฐ€์†์„ ์ฃผ๊ธฐ ์œ„ํ•ด ํ•„์š”ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ๋ฃจํ”„ ์ œ์–ด(ํ”ผ๋“œ๋ฐฑ ์ œ์–ด)๋กœ๋Š” ์„ผ์„œ ์ง€์—ฐ ๋•Œ๋ฌธ์— ๋”ฐ๋ผ์žก๊ธฐ ํž˜๋“  ์ˆœ๊ฐ„์  ํŒŒ์›Œ๋ฅผ ์˜คํ”ˆ๋ฃจํ”„๋กœ ์ค˜๋ฒ„๋ฆฐ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ฌผ๋ก  ์ด๋Ÿฐ ๋ฐฉ์‹์€ ์ •๊ตํ•œ ํ”ผ๋“œ๋ฐฑ์€ ์—†์ง€๋งŒ, ๋‹จ ๋ช‡๋ฐฑ ๋ฐ€๋ฆฌ์ดˆ์˜ ์•ก์…˜์œผ๋กœ ๊ฒฐ์ •๋˜๋Š” ๊ณผ์ œ์—์„œ๋Š” โ€œ๊ทธ ์ˆœ๊ฐ„ ์ œ๋Œ€๋กœ ํž˜์„ ์คฌ์œผ๋ฉดโ€ ์„ฑ๊ณตํ•˜๋ฏ€๋กœ ์ถฉ๋ถ„ํ–ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด Jenga ๋ธ”๋ก์„ ์ฑ„์ฐ์œผ๋กœ ์ณ์„œ ๋นผ๋‚ผ ๋•Œ, ํ•œ๋ฒˆ ํœ˜๋‘๋ฅด๋Š” ๋™์•ˆ์€ ๋ฏธ์„ธ ์กฐ์ • ์—†์ด๋„ ์ฒ˜์Œ ๊ฐ๋„/ํž˜์ด ์ •ํ™•ํ•˜๋ฉด ์„ฑ๊ณตํ•ฉ๋‹ˆ๋‹ค. HIL-SERL ์ •์ฑ…์€ ๊ทธ ๊ฐ๋„์™€ ํž˜์˜ ์ ์ ˆํ•œ ์กฐํ•ฉ์„ ํ•™์Šต์œผ๋กœ ๋ฐœ๊ฒฌํ–ˆ๊ณ , ์˜คํ”ˆ๋ฃจํ”„ ์ œ์–ด๋กœ ์ด๋ฅผ ์‹คํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค.

์ •๋ฆฌํ•˜๋ฉด, HIL-SERL์˜ ๋กœ๋ด‡ ์ œ์–ด ์„ค๊ณ„๋Š” ์ž‘์—… ํŠน์„ฑ์— ๋งž๊ฒŒ ๋‘ ๊ฐ€์ง€๋กœ ์š”์•ฝ๋ฉ๋‹ˆ๋‹ค: - ์ •์ /์ •๋ฐ€ ์ž‘์—…์—๋Š” ์ž„ํ”ผ๋˜์Šค ๊ธฐ๋ฐ˜ ํ”ผ๋“œ๋ฐฑ ์ œ์–ด๋กœ ์•ˆ์ „ํ•˜๊ณ  ์„ฌ์„ธํ•˜๊ฒŒ ์ ‘๊ทผ, - ๋™์ /์‹ ์† ์ž‘์—…์—๋Š” ํ”ผ๋“œํฌ์›Œ๋“œ ํž˜ ์ œ์–ด๋กœ ํ•„์š”ํ•œ ๋ชจ๋ฉ˜ํ…€์„ ์ฆ‰๊ฐ ๋ถ€์—ฌ.

์ด ๋ชจ๋‘๋Š” ์‹ค์ œ ๋กœ๋ด‡ ํ•˜๋“œ์›จ์–ด์—์„œ 1~2์‹œ๊ฐ„ ๋™์•ˆ ์ˆ˜์ฒœ ๋ฒˆ ์‹œ๋„ํ•ด๋„ ๊ธฐ๊ณ„์— ๋ฌด๋ฆฌ๊ฐ€ ์—†๋„๋ก ํ•˜๋ฉด์„œ, ๋™์‹œ์— ํ•™์Šต์— ์ถฉ๋ถ„ํ•œ ์ž์œ ๋„๋ฅผ ์ฃผ๊ธฐ ์œ„ํ•œ ์„ค๊ณ„์ž…๋‹ˆ๋‹ค. IsaacSim๊ณผ ๊ฐ™์€ ๊ณ ์„ฑ๋Šฅ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ์ด๋Ÿฌํ•œ ๋ฌผ๋ฆฌ์  ์œ„ํ—˜์„ ๊ฐ€์ƒํ™”ํ•  ์ˆ˜ ์žˆ์œผ๋‚˜, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์‹ค ๋กœ๋ด‡์œผ๋กœ ์ง์ ‘ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์„ ํƒํ–ˆ๊ธฐ์— ์ด๋Ÿฌํ•œ ์•ˆ์ „์žฅ์น˜๋“ค์ด ํ•„์ˆ˜์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค. ๋กœ๋ด‡๊ณตํ•™ ์‹ค๋ฌด์ž ์ž…์žฅ์—์„œ ๋ณผ ๋•Œ, ๋งŒ์•ฝ IsaacSim์—์„œ HIL-SERL์„ ๊ตฌํ˜„ํ•œ๋‹ค๋ฉด ์‹ค์ œ ๋ฌผ๋ฆฌ ์ถฉ๋Œ์˜ ์œ„ํ—˜ ์—†์ด๋„ ๋™์ผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๊ฒ ์ง€๋งŒ, ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์˜ ๋ฌผ๋ฆฌ ์ •ํ™•๋„์™€ ๋„๋ฉ”์ธ ์ฐจ์ด ๋ฌธ์ œ๊ฐ€ ๋˜ ์ƒ๊ธธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. HIL-SERL์€ ์•„์˜ˆ ์ฒ˜์Œ๋ถ€ํ„ฐ ์‹ค์ œ์—์„œ ํ•ด๋ฒ„๋ฆผ์œผ๋กœ์จ sim-to-real ๋ฌธ์ œ๋ฅผ ํ”ผํ–ˆ๊ณ , ๊ทธ ๋Œ€์‹  ์‚ฌ๋žŒ์˜ ๊ฐ๋…๊ณผ ์ปจํŠธ๋กค๋Ÿฌ๋กœ ์œ„ํ—˜์„ ๊ด€๋ฆฌํ•œ ๊ฒƒ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” โ€œ์ธ๊ฐ„ ์•ˆ์ „ ๊ฐ๋…๊ด€์ด ์ง€์ผœ๋ณด๋Š” ๊ฐ€์šด๋ฐ, ๋กœ๋ด‡์ด ๋ฐฐ์šฐ๋„๋ก ํ˜„์žฅ ํˆฌ์ž…โ€ํ•œ ์…ˆ์ด์ง€์š”.

๊ทธ๋ฆฌํผ(์†) ์ œ์–ด: ์ด์‚ฐ ํ–‰๋™์˜ ๋ถ„๋ฆฌ

HIL-SERL์ด ํŠน๋ณ„ํ•œ ๋˜ ํ•œ ๊ฐ€์ง€๋Š” ๊ทธ๋ฆฌํผ(open/close ์†๋™์ž‘) ์ œ์–ด๋ฅผ ๋ณธ์ฒด ํŒ”์˜ ์—ฐ์† ์ œ์–ด์™€ ๋ถ„๋ฆฌํ–ˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๋กœ๋ด‡ ํŒ”์˜ ์›€์ง์ž„(6~7์ถ• ์—ฐ์† ์ œ์–ด)๊ณผ, ๋ฌผ์ฒด๋ฅผ ์žก๊ธฐ ์œ„ํ•œ ์†์•„๊ท€ ๋™์ž‘(์ด์‚ฐ 2๊ฐ’: ์—ด๊ธฐ/๋‹ซ๊ธฐ)์€ ์„ฑ๊ฒฉ์ด ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ๊ธฐ์กด ์—ฐ๊ตฌ์—์„œ ์ด๋ฅผ ํ•˜๋‚˜์˜ ์—ฐ์† ๊ณต๊ฐ„์— ๋„ฃ์–ด ํ•™์Šตํ•˜๋ฉด, ์† ๋™์ž‘์˜ ์ด์‚ฐ์  ํŠน์„ฑ์„ ๋„คํŠธ์›Œํฌ๊ฐ€ ํ‘œํ˜„ํ•˜๊ธฐ ์–ด๋ ค์›Œ ํ•™์Šต์ด ๋น„ํšจ์œจ์ ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ฌผ์ฒด๋ฅผ ์žก๋Š” ๋™์ž‘์€ 0 ๋˜๋Š” 1์˜ ํŒ๋‹จ์ธ๋ฐ, ์ด๋ฅผ ํ•˜๋‚˜์˜ ์—ฐ์† ์•ก์…˜ ๊ฐ’(์˜ˆ: -1~+1 ์‚ฌ์ด)์œผ๋กœ ํ•ด๋ดค์ž ์ œ๋Œ€๋กœ ์—ด๊ณ  ๋‹ซ๋Š” ํƒ€์ด๋ฐ์„ ํ‘œํ˜„ํ•˜๊ธฐ ๊นŒ๋‹ค๋กญ์Šต๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ HIL-SERL์—์„œ๋Š” ๋‘ ๊ฐœ์˜ MDP๋ฅผ ๋ณ‘ํ–‰ ํ•ด๊ฒฐํ•œ๋‹ค๊ณ  ๊ฐœ๋…ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜๋‚˜๋Š” ์—ฐ์† ํ–‰๋™๊ณต๊ฐ„ M_c๋กœ ๋กœ๋ด‡ ํŒ”์˜ 3D ์ด๋™/ํšŒ์ „/ํž˜ ์กฐ์ ˆ ๋“ฑ์„ ๋‹ด๋‹นํ•˜๊ณ , ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ์ด์‚ฐ ํ–‰๋™๊ณต๊ฐ„ M_d๋กœ ๊ทธ๋ฆฌํผ์˜ โ€œ์—ด๊ธฐ/๋‹ซ๊ธฐ/์œ ์ง€โ€๋ฅผ ๋‹ด๋‹นํ•ฉ๋‹ˆ๋‹ค. ๋‘ MDP๋Š” ์ƒํƒœ๊ณต๊ฐ„ S๋Š” ๊ณตํ†ต (๋˜‘๊ฐ™์€ ํ™˜๊ฒฝ ๊ด€์ธก: ์นด๋ฉ”๋ผ ์˜์ƒ, ๋กœ๋ด‡ ์ƒํƒœ, ๊ทธ๋ฆฌํผ ์ƒํƒœ ๋“ฑ)์ด๊ณ  ํ–‰๋™๊ณต๊ฐ„๋งŒ ์—ฐ์† vs ์ด์‚ฐ์œผ๋กœ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ์‰ฝ๊ฒŒ ๋งํ•˜๋ฉด, ํ•˜๋‚˜๋Š” ํŒ” ์›€์ง์ด๋Š” ๋‡Œ, ํ•˜๋‚˜๋Š” ์† ์›€์ง์ด๋Š” ๋‡Œ๋ฅผ ๋‘” ์…ˆ์ž…๋‹ˆ๋‹ค.

์ด์‚ฐ ๊ทธ๋ฆฌํผ ๋™์ž‘์„ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•ด DQN ๋ฐฉ์‹์˜ ๋ณ„๋„ Critic ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌํผ์˜ ํ–‰๋™๋“ค์€ ์˜ˆ๋ฅผ ๋“ค๋ฉด {์—ด๊ธฐ, ๋‹ซ๊ธฐ, ์œ ์ง€} 3๊ฐ€์ง€์ด๋ฉฐ (์–‘ํŒ” ๋กœ๋ด‡์ด๋ฉด ๊ฐ ํŒ”์˜ ์—ด๊ธฐ/๋‹ซ๊ธฐ๋ฅผ ์กฐํ•ฉํ•ด ๋” ๋งŽ์•„์งˆ ์ˆ˜๋„ ์žˆ์Œ), ์ด๋“ค์— ๋Œ€ํ•ด Q๊ฐ’์„ ํ‰๊ฐ€ํ•˜๋Š” ๊ทธ๋ฆฌํผ Q-ํฌ๋ฆฌํ‹ฑ Q_d(s, a_d)๋ฅผ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค. Critic ํ•™์Šต์€ ๊ณ ์ „์ ์ธ ๋ฒจ๋งŒ ์ดํ€˜์ด์…˜ ์—…๋ฐ์ดํŠธ๋กœ, DQN ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. ์ˆ˜์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค (HIL-SERL ๋ณธ๋ฌธ Eq.(3)):

Q_{d}\left( s_{t},a_{t} \right) \leftarrow r_{t} + \gamma\max_{a\prime \in \mathcal{A}_{d}}{\widehat{Q}}_{d}\left( s_{t + 1},a\prime \right),

์—ฌ๊ธฐ์„œ \hat{Q}*d๋Š” ํƒ€๊นƒ ๋„คํŠธ์›Œํฌ๋กœ์„œ ํ˜„์žฌ Q_d์˜ ์ง€์—ฐ๋œ ๋ณต์‚ฌ๋ณธ์ž…๋‹ˆ๋‹ค. ์ด ์—…๋ฐ์ดํŠธ๋ฅผ ์†์‹ค ํ•จ์ˆ˜ ๊ด€์ ์—์„œ ๋ณด๋ฉด, 2์ œ๊ณฑ ์˜ค๋ฅ˜ ์†์‹ค L*_d(s\',a\')))\^2์„ ์ค„์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ Q_d ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํƒ€๊นƒ ๋„คํŠธ์›Œํฌ๋Š” Polyak ํ‰๊ท ์œผ๋กœ ์—…๋ฐ์ดํŠธํ•˜์—ฌ ํ›ˆ๋ จ์„ ์•ˆ์ •ํ™”ํ•ฉ๋‹ˆ๋‹ค (DQN์—์„œ ํ”ํžˆ ํ•˜๋Š” ๊ธฐ๋ฒ•). } = (Q_d(s,a) - (r + _{a'} \hat{Q

์ •์ฑ… ์‹คํ–‰ ์‹œ๋Š” ์–ด๋–ป๊ฒŒ ๋‘˜์„ ๊ฒฐํ•ฉํ•˜๋А๋ƒ ํ•˜๋ฉด, ์šฐ์„  ํ˜„์žฌ ์ƒํƒœ์—์„œ ์—ฐ์† ์ •์ฑ… \pi_\phi๋กœ ํŒ” ์›€์ง์ž„ ์•ก์…˜ a_c๋ฅผ ๋ฝ‘๊ณ , ๋™์‹œ์— ์ด์‚ฐ Critic Q_d๋กœ๋ถ€ํ„ฐ ์ตœ๋Œ“๊ฐ’ ํ–‰๋™ a_d๋ฅผ ๊ณ ๋ฆ…๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด (a_c, a_d) ์Œ์„ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ ์•ก์…˜์œผ๋กœ ๋กœ๋ด‡์— ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ์ปจ๋Œ€ ์‚ฝ์ž… ์ž‘์—…์—์„œ ์–ด๋–ค ์‹œ์ ์— \pi_\phi๊ฐ€ โ€œ์•ž์œผ๋กœ ์ „์ง„โ€์ด๋ผ๋Š” ํŒ” ์šด๋™์„ ๋ƒˆ๊ณ , Q_d๋Š” โ€œ๊ทธ๋ฆฌํผ ๋‹ซ์•„๋ผโ€๋ฅผ ์ตœ๋Œ€ Q๋กœ ํŒ๋‹จํ–ˆ๋‹ค๋ฉด, ๋กœ๋ด‡์€ ์•ž์œผ๋กœ ์ „์ง„ํ•˜๋ฉด์„œ ๋™์‹œ์— ์ง‘๊ฒŒ ์†๊ฐ€๋ฝ์„ ๋‹ซ์Šต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•จ์œผ๋กœ์จ, ์ •์ฑ… ๋„คํŠธ์›Œํฌ๋Š” ํŒ” ์›€์ง์ž„์— ์ง‘์ค‘ํ•˜์—ฌ ์—ฐ์† ๊ณต๊ฐ„์„ ์ปค๋ฒ„ํ•˜๊ณ , ๊ทธ๋ฆฌํผ ๋™์ž‘์€ ํƒ์š•์  ์ •์ฑ…(\arg\max Q_d)์œผ๋กœ ์ทจํ•ด์ง‘๋‹ˆ๋‹ค.

์ด ์ ‘๊ทผ์€ ์–ธ๋œป ๋ณต์žกํ•ด ๋ณด์ด์ง€๋งŒ, ์‹ค์ œ๋กœ๋Š” ํ•™์Šต์„ ํฌ๊ฒŒ ์•ˆ์ •์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌํผ ๋™์ž‘์€ ์‚ฌ๋žŒ ์‹œ๋ฒ”์œผ๋กœ๋„ ์–ด๋А ํƒ€์ด๋ฐ์— ๋‹ซ์•„์•ผ ํ• ์ง€๊ฐ€ ๋ช…ํ™•ํžˆ ํ‘œ์‹œ๋˜๊ธฐ ๋•Œ๋ฌธ์— Q๋Ÿฌ๋„ˆ๊ฐ€ ๋น„๊ต์  ์‰ฝ๊ฒŒ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ณ , ์—ฐ์† ์ •์ฑ…์€ ๊ทธ์— ๋งž์ถฐ ์†์ด ๋‹ซํž ์ƒํ™ฉ์„ ๋งŒ๋“ค๋„๋ก ์กฐ์ ˆ๋ฉ๋‹ˆ๋‹ค. ๋น„์œ ํ•˜์ž๋ฉด, ์šด์ „ํ•  ๋•Œ ํŽ˜๋‹ฌ ์กฐ์ž‘(์ด์‚ฐ: ๋ธŒ๋ ˆ์ดํฌ/์•ก์…€)๊ณผ ์Šคํ‹ฐ์–ด๋ง(์—ฐ์†)์„ ๋”ฐ๋กœ ๋ฐฐ์šฐ๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ํ•จ๊ป˜ ํ•œ ๋„คํŠธ์›Œํฌ๋กœ ๋ฐฐ์šฐ๋ ค ํ•˜๋ฉด ํ—ท๊ฐˆ๋ฆฌ์ง€๋งŒ, ๋‘˜์„ ๋‚˜๋ˆ„๋ฉด ๋” ๋นจ๋ฆฌ ์ˆ™๋‹ฌ๋˜์ง€์š”. ์‹ค์ œ ์ €์ž๋“ค๋„ โ€œ์—ฐ์† ๋ถ„ํฌ๋กœ ์ด์‚ฐ ๊ทธ๋ฆฝ ๋™์ž‘๊นŒ์ง€ ๊ทผ์‚ฌํ•˜๋ ค๋‹ˆ ์–ด๋ ต๋”๋ผ, ๊ทธ๋ž˜์„œ ์ฐจ๋ผ๋ฆฌ ๋ถ„๋ฆฌํ–ˆ๋”๋‹ˆ ์‹œ๋ฒ”+๊ต์ •๊ณผ ๋งž๋ฌผ๋ ค ์„ฑ๋Šฅ์ด ์ข‹์•˜๋‹คโ€๋ผ๊ณ  ๋ฐํ˜”์Šต๋‹ˆ๋‹ค.

์ธ๊ฐ„-์ฐธ์—ฌ ๊ฐ•ํ™”ํ•™์Šต ์ ˆ์ฐจ: ์ธํ„ฐ๋ฒค์…˜๊ณผ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘

์ด์ œ HIL-SERL์˜ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์ธ๊ฐ„ ๊ฐœ์ž…(human-in-the-loop) ๋ถ€๋ถ„์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ํ‘œ๋ณธ ํšจ์œจ ๋ฌธ์ œ๋ฅผ ๊ทน๋ณตํ•˜๋ ค๋ฉด, ํšจ๊ณผ์ ์ธ ํƒ์ƒ‰์ด ํ•„์š”ํ•œ๋ฐ ํ˜„์‹ค ๋กœ๋ด‡์—์„œ๋Š” ๋ฌด์ž‘์ • ํƒ์ƒ‰ํ•˜๋‹ค๊ฐ„ ์‹œ๊ฐ„๋„ ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๊ณ  ์œ„ํ—˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. HIL-SERL์€ ์‚ฌ๋žŒ์˜ ํ”ผ๋“œ๋ฐฑ์„ ํ†ตํ•ด ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•œ๋งˆ๋””๋กœ โ€œ๋กœ๋ด‡์ด ์ž˜๋ชปํ•  ๋•Œ ์˜†์—์„œ ์ง€์ ํ•˜๊ณ  ๊ณ ์ณ์ค€๋‹คโ€๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

่จป: ์ด๋Ÿฌํ•œ ์ ‘๊ทผ์€ DAgger(Dataset Aggregation) ๊ฐ™์€ ๋ชจ๋ฐฉํ•™์Šต+์ธํ„ฐ๋ฒค์…˜ ๊ธฐ๋ฒ•๊ณผ ์œ ์‚ฌํ•˜์ง€๋งŒ, ๊ฒฐ์ •์ ์œผ๋กœ ๋ชจ์€ ๋ฐ์ดํ„ฐ๋กœ RL ์—…๋ฐ์ดํŠธ๋ฅผ ํ•œ๋‹ค๋Š” ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, HIL-SERL์€ HG-DAgger(Kelly et al., 2018)์—์„œ ์˜๊ฐ์„ ๋ฐ›๋˜, ์ˆ˜์ง‘ ๋ฐ์ดํ„ฐ๋กœ ์ฆ‰๊ฐ ์ •์ฑ…์„ ๊ฐ•ํ™”ํ•™์Šต ์—…๋ฐ์ดํŠธํ•˜์—ฌ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ์–ป๋Š” ์ ์—์„œ ์ƒˆ๋กœ์šด ๋ฐฉํ–ฅ์ž…๋‹ˆ๋‹ค.

์ธํ„ฐ๋ฒค์…˜ ์ ˆ์ฐจ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ํ•™์Šต ์ค‘ ๋กœ๋ด‡์ด ์—ํ”ผ์†Œ๋“œ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋™์•ˆ, ์‚ฌ๋žŒ์ด ๋ชจ๋‹ˆํ„ฐ๋งํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ๋žŒ์€ VR ์žฅ์น˜๋‚˜ SpaceMouse(3D ์กฐ์ด์Šคํ‹ฑ) ๋“ฑ์„ ํ†ตํ•ด ๋กœ๋ด‡์„ ์›๊ฒฉ์กฐ์ž‘ํ•  ์ˆ˜ ์žˆ๋Š” ์ƒํƒœ์ž…๋‹ˆ๋‹ค. ์—ํ”ผ์†Œ๋“œ์—์„œ ๋งค ์‹œ๊ฐ„ ๋‹จ๊ณ„ t๋งˆ๋‹ค, ์‚ฌ๋žŒ์€ ํ˜„์žฌ ๋กœ๋ด‡ ์ƒํƒœ s_t๋ฅผ ๋ณด๊ณ  ๊ฐœ์ž… ์—ฌ๋ถ€ \mathbb{1}_{\text{intervene}}๋ฅผ ํŒ๋‹จํ•ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ ๋กœ๋ด‡์ด ํฐ ์‹ค์ˆ˜๋ฅผ ์ €์ง€๋ฅด๊ฑฐ๋‚˜ ํšŒ๋ณต ๋ถˆ๊ฐ€๋Šฅํ•œ ๋‚˜์œ ์ƒํƒœ๋กœ ๊ฐˆ ๊ฒƒ ๊ฐ™์œผ๋ฉด, ๊ฐœ์ž… ํ”Œ๋ž˜๊ทธ๋ฅผ ์ผญ๋‹ˆ๋‹ค (\mathbb{1}=1). ๊ทธ๋Ÿฌ๋ฉด ๋กœ๋ด‡ ์ œ์–ด๊ถŒ์ด ์‚ฌ๋žŒ์—๊ฒŒ ๋„˜์–ด๊ฐ€๋ฉฐ, ์‚ฌ๋žŒ์€ ์ตœ๋Œ€ H ์Šคํ…๊นŒ์ง€ ์—ฐ์†์œผ๋กœ ๋กœ๋ด‡์„ ์กฐ์ž‘ํ•ด ์˜ฌ๋ฐ”๋ฅธ ์ƒํƒœ๋กœ ๋ณต๊ท€์‹œํ‚ค๊ฑฐ๋‚˜ ๊ณผ์ œ๋ฅผ ๋Œ€์‹  ์ˆ˜ํ–‰ํ•ด์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์‚ฌ๋žŒ์ด ์กฐ์ข…ํ•œ ๊ตฌ๊ฐ„์„ ๋นจ๊ฐ„ ์„ ์œผ๋กœ ํ‘œ์‹œํ•˜๋ฉด, ์•„๋ž˜ ์ˆœ์„œ๋„์ฒ˜๋Ÿผ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

flowchart LR
    subgraph episode["ํ•œ ์—ํ”ผ์†Œ๋“œ"]
        PolicyAction[๋กœ๋ด‡ ์ •์ฑ… ์‹คํ–‰] --> State1[์ƒํƒœ]
        State1 -->|์ž˜ ์ˆ˜ํ–‰ ์ค‘| NextAction[๋‹ค์Œ ํ–‰๋™ ๊ฒฐ์ •]
        State1 -->|์œ„ํ—˜ ์ƒํ™ฉ ๋ฐœ์ƒ| HumanIntervene[์‚ฌ๋žŒ ๊ฐœ์ž…]
        HumanIntervene --> CorrectAct[์‚ฌ๋žŒ ์กฐ์ž‘]
        CorrectAct --> StateFix[์ƒˆ๋กœ์šด ์ƒํƒœ]
        StateFix -->|๊ต์ • ์™„๋ฃŒ| NextAction
        StateFix -->|์—ฌ์ „ํžˆ ์œ„ํ—˜| HumanIntervene
    end

๊ทธ๋ฆผ: HIL-SERL์˜ ์ธ๊ฐ„ ๊ฐœ์ž… ์ ˆ์ฐจ. ๋กœ๋ด‡์ด ์œ„ํ—˜ ์ƒํ™ฉ์— ๋น ์ง€๋ฉด ์‚ฌ๋žŒ ์šด์˜์ž๊ฐ€ SpaceMouse ๋“ฑ์˜ ์ธํ„ฐํŽ˜์ด์Šค๋กœ ์ผ์ • ๊ตฌ๊ฐ„ ๋กœ๋ด‡์„ ์›๊ฒฉ ์กฐ์ข…ํ•˜์—ฌ ๋ฐ”๋กœ์žก๋Š”๋‹ค. ๊ทธ๋Ÿฐ ํ›„ ๋‹ค์‹œ ๋กœ๋ด‡ ์ •์ฑ…์ด ์‹คํ–‰์„ ์ด์–ด๊ฐ„๋‹ค. ํ•œ ์—ํ”ผ์†Œ๋“œ ๋‚ด์— ์—ฌ๋Ÿฌ ๋ฒˆ ๊ฐœ์ž…(red segments)์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์ •์ฑ…์ด ๊ฐœ์„ ๋จ์— ๋”ฐ๋ผ ๊ฐœ์ž… ๋นˆ๋„๋Š” ์ค„์–ด๋“ ๋‹ค.

์—ฌ๋Ÿฌ ์ฐจ๋ก€ ๊ฐœ์ž…์ด ํ•œ ์—ํ”ผ์†Œ๋“œ์—์„œ ์ด๋ฃจ์–ด์งˆ ์ˆ˜๋„ ์žˆ์ง€๋งŒ, ์ •์ฑ…์ด ์ ์  ํ•™์Šต๋˜๋ฉด ์•ž๋ถ€๋ถ„์—์„œ๋งŒ ๊ฐ€๋” ๊ฐœ์ž…ํ•˜๊ณ  ์ดํ›„์—๋Š” ๋กœ๋ด‡์ด ์Šค์Šค๋กœ ์ž˜ ์ˆ˜ํ–‰ํ•˜๋Š” ํ˜•ํƒœ๋กœ ๋ณ€ํ•ฉ๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ ์—ฐ๊ตฌํŒ€์€ โ€œ์ดˆ๋ฐ˜ ์•ฝ 30% ๊ตฌ๊ฐ„์€ ์ผ์ข…์˜ ์œ ์•„ ๋ณดํ˜ธ์ž์ฒ˜๋Ÿผ ๋”ฐ๋ผ๋‹ค๋‹ˆ๋ฉฐ ๊ฐœ์ž…ํ•ด์ค˜์•ผ ํ–ˆ์ง€๋งŒ, ์ •์ฑ…์ด ํ•™์Šตํ•˜๋ฉด์„œ ์‚ฌ๋žŒ์€ ์ ์  ์†์„ ๋–ผ๊ณ  ์ง€์ผœ๋ณด๊ธฐ๋งŒ ํ•ด๋„ ๋˜์—ˆ๋‹คโ€๊ณ  ํšŒ์ƒํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋ ‡๋‹ค๋ฉด, ์ด๋ ‡๊ฒŒ ์–ป์–ด์ง„ ์‚ฌ๋žŒ ๊ฐœ์ž… ๋ฐ์ดํ„ฐ๋Š” ์–ด๋–ป๊ฒŒ ํ™œ์šฉ๋ ๊นŒ์š”? ํ•ต์‹ฌ์€ ์ด ๋ฐ์ดํ„ฐ๋„ ๊ณง๋ฐ”๋กœ ํ•™์Šต์— ์“ฐ์ธ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ์‚ฌ๋žŒ์ด ์กฐ์ข…ํ•œ ๊ตฌ๊ฐ„์˜ ์ƒํƒœ/ํ–‰๋™ ์ „์ด (s, a_{human}, s\')๋“ค์„ ๋ฐ๋ชจ ๋ฒ„ํผ์™€ RL ๋ฒ„ํผ ๋ชจ๋‘์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์˜คํ”„๋ผ์ธ ์‹œ๋ฒ” ๋ฐ์ดํ„ฐ์ฒ˜๋Ÿผ๋„ ์“ฐ์ด๊ณ , ๋™์‹œ์— ์˜จ๋ผ์ธ ๊ฒฝํ—˜์œผ๋กœ๋„ ์ทจ๊ธ‰๋˜๋Š” ์ด์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ํ•œํŽธ, ์‚ฌ๋žŒ์ด ๊ฐœ์ž…ํ•˜์ง€ ์•Š๊ณ  ๋กœ๋ด‡์ด ํ–ˆ๋˜ ๊ตฌ๊ฐ„๋“ค์˜ ์ „์ด (s, a_{robot}, s\')๋“ค์€ RL ๋ฒ„ํผ์—๋งŒ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์‚ฌ๋žŒ์˜ ์˜ฌ๋ฐ”๋ฅธ ์‹œ๋ฒ” ๋ฐ์ดํ„ฐ์™€ ๋กœ๋ด‡์˜ ์‹คํŒจ/์„ฑ๊ณต ๊ฒฝํ—˜์ด ๋ถ„๋ฆฌ๋˜์–ด ๋ฒ„ํผ ๋‘ ๊ฐœ์— ๋‹ด๊ธฐ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์•ž์„œ ์–ธ๊ธ‰ํ•œ RLPD ํ•™์Šต์€ ๋‘ ๋ฒ„ํผ๋กœ๋ถ€ํ„ฐ ์ ˆ๋ฐ˜์”ฉ ์ƒ˜ํ”Œ๋งํ•˜๋ฏ€๋กœ, ์‚ฌ๋žŒ ๊ต์ • ๋ฐ์ดํ„ฐ๋Š” ๋ฐ˜๋ณตํ•ด์„œ ์žฌ์‚ฌ์šฉ๋˜๋ฉฐ, ๋กœ๋ด‡ ์ž์ฒด ํƒ์ƒ‰ ๋ฐ์ดํ„ฐ์™€ ์„ž์—ฌ ์ •์ฑ…์„ ํ–ฅ์ƒ์‹œํ‚ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์™œ ์ด๋Ÿฌํ•œ ์ €์žฅ ์ „๋žต์ด ์ค‘์š”ํ•œ๊ฐ€? ์‚ฌ๋žŒ ๊ฐœ์ž… ๋ฐ์ดํ„ฐ๋Š” ๋Œ€์ฒด๋กœ ์„ฑ๊ณต์„ ํ–ฅํ•œ ๋ฐ”๋ฅธ ํ–‰๋™ ์‹œํ€€์Šค์ž…๋‹ˆ๋‹ค. ๋กœ๋ด‡์ด ์—‰๋šฑํ•œ ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ€๋‹ค๊ฐ€ ์‚ฌ๋žŒ์ด ๋„˜๊ฒจ๋ฐ›์œผ๋ฉด ๊ณง๋ฐ”๋กœ ๋ชฉํ‘œ ์ชฝ์œผ๋กœ ์กฐ์ž‘ํ•  ๊ฒƒ์ด๋ฏ€๋กœ, ๊ทธ ๊ตฌ๊ฐ„์€ ๊ต์ •๋œ ์ตœ์ ๊ฒฝ๋กœ๋ผ ํ•  ์ˆ˜ ์žˆ์ง€์š”. ์ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ๋ชจ ๋ฒ„ํผ์— ๋„ฃ์Œ์œผ๋กœ์จ, ๋งˆ์น˜ ์ถ”๊ฐ€ ์‹œ๋ฒ”์„ ์–ป์€ ํšจ๊ณผ๋ฅผ ๋ƒ…๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  RL ๋ฒ„ํผ์—๋„ ๋„ฃ๋Š” ์ด์œ ๋Š”, ๋กœ๋ด‡ ๊ด€์ ์—์„œ๋Š” ์ž๊ธฐ ํ–‰๋™ ์ค‘๊ฐ„์— ์‚ฌ๋žŒ์ด ๋ฐ”๊ฟจ๋˜ ๊ฒฝํ—˜๋„ ํ•˜๋‚˜์˜ โ€œ์‹คํŒจ ํ›„ ๊ต์ •โ€ ๊ฒฝํ—˜์œผ๋กœ ์ธ์ง€๋˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ €์žฅ ๋ฐฉ์‹์€ ์ •์ฑ… ํ•™์Šต์„ ํšจ์œจ์ ์œผ๋กœ ๊ฐ•ํ™”ํ•˜๋Š” ๋ฐ ํšจ๊ณผ์ ์ด์—ˆ๋‹ค๊ณ  ์ €์ž๋“ค์€ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.

์ธํ„ฐ๋ฒค์…˜์˜ ์—ญํ• ์„ ์š”์•ฝํ•˜๋ฉด, ๋กœ๋ด‡์ด ์Šค์Šค๋กœ ํƒ์ƒ‰ํ•˜๊ธฐ ์–ด๋ ค์šด ์ƒํƒœ ๊ณต๊ฐ„ ์˜์—ญ์„ ์ธ๊ฐ„์ด ๋ฉ”์›Œ์ฃผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ƒํƒœ/ํ–‰๋™ ๊ณต๊ฐ„์ด ํฌ๊ณ  ๊ณผ์ œ ์ง€ํ‰(horizon)์ด ๊ธธ์ˆ˜๋ก, ์ด๋ก ์ ์œผ๋กœ ํ•„์š”ํ•œ ์ƒ˜ํ”Œ ์ˆ˜๋Š” ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ๋Š˜์–ด๋‚˜ RL์ด ํž˜๋“ค์–ด์ง‘๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์‚ฌ๋žŒ์ด ์ ์ ˆํžˆ ๊ฐœ์ž…ํ•˜๋ฉด, ๋กœ๋ด‡์€ ํ•„์š”ํ•œ ์ค‘์š”ํ•œ ๊ฒฝํ—˜์„ ๋น ๋ฅด๊ฒŒ ์Šต๋“ํ•˜๊ณ  ์“ธ๋ฐ์—†๋Š” ์‹คํŒจ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋งˆ์น˜ ์ž์ „๊ฑฐ๋ฅผ ์ฒ˜์Œ ํƒˆ ๋•Œ ์˜†์—์„œ ์žก์•„์ฃผ๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ํ˜ผ์ž ์ˆ˜์—†์ด ๋„˜์–ด์งˆ ๊ฒƒ์„, ๋ช‡ ๋ฒˆ ์žก์•„์ฃผ๊ณ  ๋ฐฉํ–ฅ ๊ต์ •ํ•ด์ฃผ๋ฉด ๊ธˆ์„ธ ๊ท ํ˜• ์žก๋Š” ๋ฒ•์„ ๋ฐฐ์šฐ๋Š” ์ด์น˜์ง€์š”. HIL-SERL์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด, ํŠนํžˆ ๋ณต์žกํ•œ ๊ณผ์ œ์ผ์ˆ˜๋ก ์ด๋Ÿฌํ•œ ์˜จ๋ผ์ธ ๊ต์ •์ด ์—†์ด๋Š” ํ•™์Šต์ด ๊ฑฐ์˜ ๋ถˆ๊ฐ€๋Šฅํ•˜๊ฑฐ๋‚˜ ๋งค์šฐ ๋А๋ ธ์ง€๋งŒ, ์ธ๊ฐ„ ๊ฐœ์ž…์„ ํ—ˆ์šฉํ•˜์ž ์งง์€ ์‹œ๊ฐ„์— 100% ์„ฑ๊ณต๋ฅ ๊นŒ์ง€ ์˜ฌ๋ผ๊ฐˆ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

์ฃผ์˜ํ•  ์ ์€, ์‚ฌ๋žŒ ๊ฐœ์ž…๋„ ๊ณผํ•˜๋ฉด ์•ˆ ๋œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” โ€œํ•„์š” ์ด์ƒ์œผ๋กœ ์‚ฌ๋žŒ์ด ๋ชจ๋“  ๊ฑธ ํ•ด์ค˜๋ฒ„๋ฆฌ๋ฉด ๊ฐ•ํ™”ํ•™์Šต์ด Qํ•จ์ˆ˜๋ฅผ ์ž˜๋ชป ์ถ”์ •ํ•ด์„œ ํ•™์Šต์ด ๋ถˆ์•ˆ์ •ํ•ด์งˆ ์ˆ˜ ์žˆ๋‹คโ€๊ณ  ์ง€์ ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ์ปจ๋Œ€ ์ •์ฑ…์ด ์—‰๋ง์ด์–ด๋„ ์‚ฌ๋žŒ์ด ๋งค๋ฒˆ ๊ธด ๊ตฌ๊ฐ„ ๊ฐœ์ž…ํ•ด์„œ ์„ฑ๊ณต์‹œ์ผœ๋ฒ„๋ฆฌ๋ฉด, ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ โ€œ์–ด? ํ–‰๋™๋งŒ ์ทจํ•˜๋ฉด ์•Œ์•„์„œ ์„ฑ๊ณต์œผ๋กœ ์—ฐ๊ฒฐ๋˜๋„คโ€๋ผ๊ณ  ์˜คํŒํ•ด๋ฒ„๋ฆด ์œ„ํ—˜์ด ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ๋ง‰๊ธฐ ์œ„ํ•ด ๊ฐœ์ž…์€ ์งง๊ณ  ๊ตญ์ง€์ ์œผ๋กœ, โ€œํ•„์š”ํ•œ ์ตœ์†Œํ•œโ€์œผ๋กœ๋งŒ ํ•˜๋Š” ๊ฒƒ์ด ์š”๋ น์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ํ˜„์‹ค์—์„œ ์—ฐ๊ตฌ์ž๋„ ์ด๋ฅผ ์ฒด๋“ํ•˜์—ฌ, ์ ์  specificํ•œ ์ˆœ๊ฐ„์—๋งŒ ๊ฐœ์ž…ํ•˜๊ณ  ๊ทธ ์™ธ์—” ์‹คํŒจํ•˜๊ฒŒ ๋‘๋Š” ์‹์œผ๋กœ ํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๋ฏธ๋ฌ˜ํ•œ ์กฐ์ ˆ์€ ๊ฒฐ๊ตญ ์‚ฌ๋žŒ ๊ฒฝํ—˜์— ์˜์กดํ•˜์ง€๋งŒ, ๋กœ๋ด‡ ์ •์ฑ…์ด ์ถฉ๋ถ„ํžˆ ์Šค์Šค๋กœ ์‹คํŒจ๋กœ๋ถ€ํ„ฐ๋„ ๋ฐฐ์šฐ๋„๋ก ์—ฌ์ง€๋ฅผ ์ฃผ๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

์ „์ฒด ํ›ˆ๋ จ ๊ณผ์ • ์ •๋ฆฌ

์ด์ƒ์œผ๋กœ ์„ค๋ช…ํ•œ ๊ตฌ์„ฑ์š”์†Œ๋“ค์„ ํ•˜๋‚˜๋กœ ๋ชจ์•„, HIL-SERL์˜ ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ์„ ๋‹จ๊ณ„๋ณ„๋กœ ์ •๋ฆฌํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

  1. ์นด๋ฉ”๋ผ ๋ฐ ์„ผ์„œ ์„ค์ •: ๊ณผ์ œ์— ์ ํ•ฉํ•œ ์‹œ๊ฐ ๊ด€์ธก์„ ์ค€๋น„ํ•ฉ๋‹ˆ๋‹ค. ์†๋ชฉ ์นด๋ฉ”๋ผ๋Š” ๋ฌผ์ฒด๋ฅผ ๊ทผ์ ‘ํ•˜๊ณ  ์ž๊ธฐ ์ค‘์‹ฌ ์‹œ์•ผ๋ฅผ ์ฃผ๊ธฐ ๋•Œ๋ฌธ์— ์œ ์šฉํ•˜๋ฉฐ, ํ•„์š”ํ•˜๋ฉด ์—ฌ๋Ÿฌ ๋Œ€์˜ ์ธก๋ฉด ์นด๋ฉ”๋ผ๋„ ๋ฐฐ์น˜ํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋“  ์นด๋ฉ”๋ผ ์˜์ƒ์€ ๊ด€์‹ฌ ์˜์—ญ์œผ๋กœ ํฌ๋กญ ๋ฐ 128ร—128 ํ•ด์ƒ๋„๋กœ ๋ฆฌ์‚ฌ์ด์ฆˆํ•˜์—ฌ ์‹ ๊ฒฝ๋ง ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด์ฒ˜๋Ÿผ ์ž…๋ ฅ์„ ์ •์ œํ•จ์œผ๋กœ์จ ์ •์ฑ…์ด ๊ผญ ํ•„์š”ํ•œ ์ •๋ณด์—๋งŒ ์ง‘์ค‘ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ณด์ƒ ๋ถ„๋ฅ˜๊ธฐ ํ›ˆ๋ จ: ์‚ฌ๋žŒ ์›๊ฒฉ์กฐ์ž‘์œผ๋กœ ์•ฝ 10ํšŒ์˜ ์—ํ”ผ์†Œ๋“œ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฉด์„œ ์„ฑ๊ณต/์‹คํŒจ ์žฅ๋ฉด ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์๋‹ˆ๋‹ค (์•ฝ 5๋ถ„ ์†Œ์š”). ์ˆ˜์ง‘๋œ ์•ฝ 1200์žฅ ์ด๋ฏธ์ง€(์„ฑ๊ณต 200, ์‹คํŒจ 1000 ๋น„์œจ)๋ฅผ ๊ฐ€์ง€๊ณ  ์ด์ง„ ๋ถ„๋ฅ˜๊ธฐ C_\psi(s)๋ฅผ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค. ํ•™์Šต์€ Adam ์˜ตํ‹ฐ๋งˆ์ด์ €๋กœ ์ง„ํ–‰ํ•˜๋ฉฐ, 100ํšŒ ๋ฐ˜๋ณต์œผ๋กœ ์™„๋ฃŒ๋ฉ๋‹ˆ๋‹ค. ์ •ํ™•๋„ 95% ์ด์ƒ์˜ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์™„์„ฑ๋˜๋ฉด, ์ด๋ฅผ ํ™˜๊ฒฝ์˜ ๋ณด์ƒ ํŒ์ • ๋ชจ๋“ˆ๋กœ ํƒ‘์žฌํ•ฉ๋‹ˆ๋‹ค.
  3. ์ธ๊ฐ„ ์‹œ๋ฒ” ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘: ๊ฐ™์€ ํ˜น์€ ๋‹ค๋ฅธ ์‚ฌ๋žŒ ์กฐ์ž‘์œผ๋กœ 20~30ํšŒ ์„ฑ๊ณต ์‹œ๋ฒ” ํŠธ๋ ˆ์ด์ง€๋ฅผ ๋ชจ์๋‹ˆ๋‹ค. ์ด๋ฅผ ๋ฐ๋ชจ ๋ฒ„ํผ(B_demo)์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ ํ™˜๊ฒฝ ๋ฆฌ์…‹(reset)์€ ์ž‘์—…๋ณ„๋กœ ๋‹ค๋ฅด๊ฒŒ ์ด๋ค„์ง€๋Š”๋ฐ, ์–ด๋–ค ๊ณผ์ œ๋Š” ์ž๋™ ๋ฆฌ์…‹ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์งœ๋†“๊ณ , ์–ด๋–ค ๊ฒƒ์€ ์‚ฌ๋žŒ์ด ์ง์ ‘ ์„ธํŒ…์„ ์ดˆ๊ธฐํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. (์˜ˆ: USB ๊ฝ‚๊ธฐ ์ž‘์—…์€ ์‚ฌ๋žŒ์ด ์†์œผ๋กœ ๊ฝ‚ํžŒ USB๋ฅผ ๋นผ์ฃผ๋Š” ์‹์œผ๋กœ ๋ฆฌ์…‹ํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.)
  4. ๊ฐ•ํ™”ํ•™์Šต ์‹œ์ž‘: ์ด์ œ ์ •์ฑ… ๋„คํŠธ์›Œํฌ \pi_\phi์™€ Q ๋„คํŠธ์›Œํฌ Q_\theta๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๊ณ , ํ•™์Šต์„ ๊ฐœ์‹œํ•ฉ๋‹ˆ๋‹ค. ์ดˆ๊ธฐ์—๋Š” ์ •์ฑ…์ด ๋ฌด์ž‘์œ„์— ๊ฐ€๊น๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ๋žŒ์ด ์žฆ์€ ๊ฐœ์ž…์„ ํ•ด์ค๋‹ˆ๋‹ค. ์—ํ”ผ์†Œ๋“œ๋งˆ๋‹ค, ๋กœ๋ด‡์€ ํ˜„์žฌ ์ •์ฑ…์œผ๋กœ ์‹œ๋„ํ•˜๊ณ , ์‚ฌ๋žŒ์€ ํ•„์š” ์‹œ ๊ฐœ์ž…ํ•˜์—ฌ ๋ฐ”๋กœ ์žก์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  ์ „์ด(๋กœ๋ด‡ ํ–‰๋™์ด๋“  ์ธ๊ฐ„ ํ–‰๋™์ด๋“ )์ด RL ๋ฒ„ํผ(B_rl)์— ๊ธฐ๋ก๋˜๊ณ , ์ธ๊ฐ„ ํ–‰๋™ ์ „์ด๋Š” ๋ฐ๋ชจ ๋ฒ„ํผ์—๋„ ์ค‘๋ณต ๊ธฐ๋ก๋ฉ๋‹ˆ๋‹ค.
  5. ์˜จ๋ผ์ธ RL ์—…๋ฐ์ดํŠธ: ๋งค ์‹œ๊ฐ„ ์Šคํ…, ํ˜น์€ ์ฃผ๊ธฐ์ ์œผ๋กœ, ํ•™์Šต์ž(Learner) ํ”„๋กœ์„ธ์Šค๊ฐ€ ๋™์ž‘ํ•˜์—ฌ RL ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ B_demo์™€ B_rl์—์„œ ์ ˆ๋ฐ˜์”ฉ ์„ž์€ ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋กœ (์‹ (1)) Q-ํ•จ์ˆ˜ ์†์‹ค์„ ๊ณ„์‚ฐํ•ด Critic ์—…๋ฐ์ดํŠธ๋ฅผ ํ•˜๊ณ , (์‹ (2))์˜ ์ •์ฑ… ์†์‹ค๋กœ Actor ์—…๋ฐ์ดํŠธ๋ฅผ ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ทธ๋ฆฌํผ Critic Q_d๋„ (์‹ (3))์— ๋”ฐ๋ผ ์—…๋ฐ์ดํŠธ๋ฉ๋‹ˆ๋‹ค. ํƒ€๊นƒ ๋„คํŠธ์›Œํฌ๋“ค์€ ํด๋ฆฌ์•ก ํ‰๊ท ์œผ๋กœ ๊ฐฑ์‹ ๋ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์—…๋ฐ์ดํŠธ๋Š” ์ถฉ๋ถ„ํžˆ ๋งŽ์€ ๋นˆ๋„๋กœ ๋Œ์•„๊ฐ€, ์ •์ฑ…์ด ์‹ค์‹œ๊ฐ„์œผ๋กœ ๊ฐœ์„ ๋˜๊ณ , ์ตœ์‹  ์ •์ฑ… ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” Actor(๋กœ๋ด‡) ํ”„๋กœ์„ธ์Šค์— ๋น„๋™๊ธฐ๋กœ ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค.
  6. ํ›ˆ๋ จ ์ง€์† ๋ฐ ์ข…๋ฃŒ: ์ธ๊ฐ„ ๊ฐœ์ž… ๋นˆ๋„๊ฐ€ ์ค„๊ณ , ์—ํ”ผ์†Œ๋“œ ์„ฑ๊ณต๋ฅ ์ด ๊ฑฐ์˜ 100%์— ๋„๋‹ฌํ•˜๋ฉด ํ•™์Šต์„ ์ข…๋ฃŒํ•ฉ๋‹ˆ๋‹ค. ์‹คํ—˜์ƒ ๋Œ€๋ถ€๋ถ„ ๊ณผ์ œ๋Š” 1์‹œ๊ฐ„ ๋‚ด์™ธ, ์–ด๋ ค์šด ๊ฒƒ๋“ค๋„ 2~2.5์‹œ๊ฐ„ ๋‚ด์— ์ˆ˜๋ ดํ–ˆ์Šต๋‹ˆ๋‹ค. ํ›ˆ๋ จ์ด ์ง„ํ–‰๋ ์ˆ˜๋ก ์ธ๊ฐ„ ๊ฐœ์ž…์€ ์•„์˜ˆ 0%๋กœ ๊ฐ์†Œํ•˜๊ณ , ์„ฑ๊ณต๋ฅ  100%์™€ ๋น ๋ฅธ ์ˆ˜ํ–‰ ์†๋„๋ฅผ ์ •์ฑ…์ด ๋‹ฌ์„ฑํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๊ทธ๋ฆผ 1: HIL-SERL ์‹œ์Šคํ…œ ๊ฐœ๊ด€. (1) ์šฐ์„  ์ธ๊ฐ„ ํ…”๋ ˆ์˜คํผ๋ ˆ์ดํ„ฐ๊ฐ€ ์„ฑ๊ณต/์‹คํŒจ ์‚ฌ๋ก€๋ฅผ ๋ชจ์•„ ๋ณด์ƒ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ํ•™์Šต์‹œํ‚จ๋‹ค. (2) ์†Œ์ˆ˜์˜ ์ธ๊ฐ„ ์‹œ๋ฒ”(๋ฐ๋ชจ)์„ ๋ชจ์•„ ๋ฐ๋ชจ ๋ฒ„ํผ๋ฅผ ์ดˆ๊ธฐํ™”ํ•œ๋‹ค. (3) ์‹ค์ œ ๋กœ๋ด‡์œผ๋กœ ํ•™์Šต์„ ์‹œ์ž‘ํ•˜๋ฉฐ, ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ณด์ƒ ๋ถ„๋ฅ˜๊ธฐ๋กœ๋ถ€ํ„ฐ ๋ฐ›์€ ํฌ์†Œ ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก ์ •์ฑ…์„ ์—…๋ฐ์ดํŠธํ•œ๋‹ค. ์ด ๊ณผ์ •์—์„œ ์‚ฌ๋žŒ์ด ๊ฐœ์ž…ํ•˜์—ฌ ์‹คํŒจ๋ฅผ ๊ต์ •ํ•˜๊ณ , ๊ทธ ๊ต์ • ๋ฐ์ดํ„ฐ ๋˜ํ•œ ๋ฒ„ํผ์— ์Œ“์—ฌ ํ•™์Šต์— ์‚ฌ์šฉ๋œ๋‹ค. ์‹œ๊ฐ„์ด ์ง€๋‚ ์ˆ˜๋ก ์„ฑ๊ณต๋ฅ ์€ ์˜ฌ๋ผ๊ฐ€๊ณ  ์ธ๊ฐ„ ๊ฐœ์ž…์€ ์ค„์–ด๋“ ๋‹ค.

์š”์•ฝํ•˜์ž๋ฉด, HIL-SERL์˜ ๋ฐฉ๋ฒ•๋ก ์€ โ€œ์ข‹์€ ๋ฐ์ดํ„ฐ ํ™•๋ณด โ†’ ๊ฐ•๊ฑดํ•œ RL ์—…๋ฐ์ดํŠธ โ†’ ํ•„์š”์‹œ ์‚ฌ๋žŒ ๊ฐœ์ž…โ€์˜ ์„ ์ˆœํ™˜ ๋ฃจํ”„๋ฅผ ๊ตฌํ˜„ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋กœ๋ด‡๊ณตํ•™ ๊ด€์ ์—์„œ ์ด๋Š” ๊ฐ•ํ™”ํ•™์Šต๊ณผ ์ธ๊ฐ„ ์ „๋ฌธ๊ฐ€ ์ง€์‹์˜ ์ ˆ๋ฌ˜ํ•œ ์กฐํ•ฉ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‚ฌ๋žŒ์€ ์ดˆ๊ธฐ ๊ฐ€์ด๋“œ์™€ ์•ˆ์ „์žฅ์น˜ ์—ญํ• ์„ ํ•˜๊ณ , ๊ฐ•ํ™”ํ•™์Šต์€ ๊ฒฐ๊ตญ ์‚ฌ๋žŒ์˜ ํ•œ๊ณ„๋ฅผ ๋„˜์–ด์„œ๋Š” ์ตœ์ ํ™”๋ฅผ ์ด๋ฃจ์–ด๋ƒ…๋‹ˆ๋‹ค. ๋‹ค์Œ์œผ๋กœ, ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์ด ์‹ค์ œ๋กœ ์–ด๋–ค ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ ธ์™”๋Š”์ง€ ๋‹ค์–‘ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์‹คํ—˜: ๋‹ค์–‘ํ•œ ์กฐ์ž‘ ๊ณผ์ œ์—์„œ์˜ ์„ฑ๋Šฅ ๊ฒ€์ฆ

HIL-SERL์˜ ์œ ํšจ์„ฑ์„ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด, ์ €์ž๋“ค์€ 7๊ฐ€์ง€์˜ ์ƒ์ดํ•œ ์ž‘์—…(task)์— ์‹œ์Šคํ…œ์„ ์ ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ์ž‘์—…๋“ค์€ ๋‚œ์ด๋„์™€ ์„ฑ๊ฒฉ ๋ฉด์—์„œ ์„œ๋กœ ํฌ๊ฒŒ ๋‹ฌ๋ผ, ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์ด ์ผ๋ฐ˜์ ์œผ๋กœ ํ†ตํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ฐ ๊ณผ์ œ์™€ ํ™˜๊ฒฝ ์„ค์ •, ๊ทธ๋ฆฌ๊ณ  ๊ฒฐ๊ณผ๋ฅผ ํ•˜๋‚˜์”ฉ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์‹คํ—˜ ํ™˜๊ฒฝ๊ณผ ๊ณผ์ œ ๊ฐœ์š”

์‹คํ—˜์— ์‚ฌ์šฉ๋œ ๋กœ๋ด‡์€ 7์ž์œ ๋„ ๊ด€์ ˆ์˜ ๋กœ๋ด‡ ํŒ”์ด๋ฉฐ, ๊ณผ์ œ์— ๋”ฐ๋ผ 1๋Œ€ ๋˜๋Š” 2๋Œ€์˜ ํŒ”์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ ๊ตฌ์ฒด์ ์ธ ๋กœ๋ด‡ ๊ธฐ์ข…์ด ๋ช…์‹œ๋˜์ง„ ์•Š์•˜์ง€๋งŒ, ์‚ฌ์ง„๊ณผ ๋ฌธ๋งฅ์ƒ Franka Emika Panda์™€ ๊ฐ™์€ ํ˜‘๋™๋กœ๋ด‡ ํŒ”์„ ํ™œ์šฉํ•œ ๊ฒƒ์œผ๋กœ ์ถ”์ •๋ฉ๋‹ˆ๋‹ค (ํฐ์ƒ‰๊ณผ ๊ฒ€์ •์ƒ‰์˜ ํŒ”์ด ๋“ฑ์žฅ). ๊ทธ๋ฆฌํผ๋Š” 2ํ•‘๊ฑฐ ๊ทธ๋ฆฌํผ๋กœ ๋ณด์ด๋ฉฐ, ํ•„์š” ์‹œ ๋‘ ๋กœ๋ด‡์ด ๊ฐ๊ฐ ๊ทธ๋ฆฌํผ๋ฅผ ์žฅ์ฐฉํ•œ ์–‘ํŒ” ๊ตฌ์„ฑ์„ ์ทจํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ฐ ๊ณผ์ œ์—๋Š” ์‹œ๊ฐ ๊ด€์ธก์œผ๋กœ ์†๋ชฉ ์นด๋ฉ”๋ผ ์˜์ƒ์ด ๊ธฐ๋ณธ์œผ๋กœ ์“ฐ์˜€๊ณ , ์ธก๋ฉด ์นด๋ฉ”๋ผ ์˜์ƒ๋„ ๋ณด์กฐ์ ์œผ๋กœ ์ œ๊ณต๋์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์—”๋“œ์ดํŽ™ํ„ฐ ์œ„์น˜, ์ž์„ธ, ์†๋„(ํŠธ์œ„์ŠคํŠธ)์™€ ํž˜/ํ† ํฌ ์„ผ์„œ ๊ฐ’, ๊ทธ๋ฆฌ๊ณ  ๊ทธ๋ฆฌํผ ์ƒํƒœ(์—ด๋ฆผ/๋‹ซํž˜) ๋“ฑ์ด ์ƒํƒœ์— ํฌํ•จ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํ–‰๋™์€ ์•ž์„œ ์–ธ๊ธ‰ํ•œ๋Œ€๋กœ ์—”๋“œ์ดํŽ™ํ„ฐ ๊ณต๊ฐ„์˜ ์†๋„/ํž˜ ๋ช…๋ น(์—ฐ์†)๊ณผ ๊ทธ๋ฆฌํผ ๊ฐœํ(์ด์‚ฐ)๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ์€ ์‹คํ—˜ํ•œ 7๊ฐœ ๊ณผ์ œ๋ฅผ ์š”์•ฝํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค:

  • (A) SSD ์„ค์น˜: ์ปดํ“จํ„ฐ ๋ฉ”์ธ๋ณด๋“œ์— SSD๋ฅผ ์ •๋ฐ€ํ•˜๊ฒŒ ๋ผ์šฐ๋Š” ์ž‘์—…์ž…๋‹ˆ๋‹ค. ์ž‘์€ ์Šฌ๋กฏ์— SSD ์ปค๋„ฅํ„ฐ๋ฅผ ์‚ฝ์ž…ํ•ด์•ผ ํ•˜๋ฏ€๋กœ ์ •๋ฐ€ ์œ„์น˜ ์กฐ์ •์ด ์š”๊ตฌ๋ฉ๋‹ˆ๋‹ค. ํ•œ ์†์œผ๋กœ SSD๋ฅผ ์žก๊ณ  ๊ฐ๋„๋ฅผ ๋งž์ถฐ ๋ˆŒ๋Ÿฌ ๋ผ์šฐ๋Š” ์‹œ๋‚˜๋ฆฌ์˜ค์ž…๋‹ˆ๋‹ค.
  • (B) RAM ๊ฝ‚๊ธฐ: ๋ฉ”์ธ๋ณด๋“œ์— RAM ๋ชจ๋“ˆ์„ ์‚ฝ์ž…ํ•˜๋Š” ๊ณผ์ œ์ž…๋‹ˆ๋‹ค. ์Šฌ๋กฏ ์œ„์น˜์— ๋งž์ถฐ ๊ธฐ์šธ๊ธฐ๋ฅผ ์ž˜ ์กฐ์ •ํ•ด ๋๊นŒ์ง€ ๋ˆŒ๋Ÿฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค. SSD์™€ ์œ ์‚ฌํ•˜๊ฒŒ ๊ณ ์ •๋ฐ€ ์‚ฝ์ž… ์ž‘์—…์ž…๋‹ˆ๋‹ค.
  • (C) USB ๊ฝ‚๊ธฐ + ์ผ€์ด๋ธ” ํด๋ฆฝ: USB ์ปค๋„ฅํ„ฐ๋ฅผ ์ง‘์–ด๋“ค์–ด ํฌํŠธ์— ๊ฝ‚๊ณ , ์ด์–ด์„œ ์ผ€์ด๋ธ”์„ ํด๋ฆฝ์— ๊ฑฐ๋Š” ์—ฐ์† ์ž‘์—…์ž…๋‹ˆ๋‹ค. ๋‘ ๋‹จ๊ณ„๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฉ€ํ‹ฐ์Šคํ… ์ž‘์—…์ด๊ณ , ํŠนํžˆ ๋‘ ๋ฒˆ์งธ ๋‹จ๊ณ„์ธ ์ผ€์ด๋ธ” ํด๋ฆฝ์— ๋ผ์šฐ๊ธฐ๋Š” ์œ ์—ฐํ•œ ์ผ€์ด๋ธ”์„ ๋‹ค๋ฃจ๋Š” ๊ณผ์ œ์ž…๋‹ˆ๋‹ค.
  • (D) IKEA ์„ ๋ฐ˜ ์กฐ๋ฆฝ: IKEA ์ฑ…์žฅ์˜ ๋‘ ๊ฐœ ์ธก๋ฉด ํŒ์„ ๊ฒฐํ•ฉํ•˜๊ณ , ์œ„ํŒ์„ ์–น์–ด ์กฐ๋ฆฝํ•˜๋Š” ์ž‘์—…์ž…๋‹ˆ๋‹ค. ๋‹ค๋‹จ๊ณ„ ์กฐ๋ฆฝ์œผ๋กœ, ๋ณผํŠธ ์—†์ด ๊ฒฐํ•ฉ ๊ตฌ์กฐ๋ฅผ ๋งž์ถฐ ๋ผ์›Œ์•ผ ํ•˜๋ฏ€๋กœ ์ •๋ฐ€๋„์™€ ํž˜ ์กฐ์ ˆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์–‘ํŒ”์ด ํ˜‘๋ ฅํ•˜์—ฌ ๋“ค๊ณ  ๋งž์ถ”๋Š” ์žฅ๋ฉด์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. (์‹คํ—˜์—์„œ๋Š” ์ธกํŒ1, ์ธกํŒ2, ํƒ‘ํŒ 3๋‹จ๊ณ„๋กœ ๋‚˜๋ˆ  ์ธก์ •ํ•จ.)
  • (E) ์ž๋™์ฐจ ๋Œ€์‹œ๋ณด๋“œ ์กฐ๋ฆฝ: ์ž๋™์ฐจ ๋‚ด๋ถ€ ๋Œ€์‹œ๋ณด๋“œ ํŒจ๋„์„ ์ฐจ์ฒด ํ”„๋ ˆ์ž„์— ๋ผ์›Œ ๋งž์ถ”๋Š” ์ž‘์—…์ž…๋‹ˆ๋‹ค. ์–‘ํŒ”์ด ๋Œ€ํ˜• ํŒจ๋„์„ ํ•จ๊ป˜ ๋“ค๊ณ , ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์œ„์น˜๋ฅผ ๋™์‹œ์— ์ •๋ ฌํ•ด์•ผ ํ•˜๋Š” ๋‚œ์ด๋„๊ฐ€ ๋งค์šฐ ๋†’์€ ์ •๋ฐ€ ์ž‘์—…์ž…๋‹ˆ๋‹ค. ํ•€๊ณผ ์Šฌ๋กฏ ์—ฌ๋Ÿฌ ๊ฐœ๊ฐ€ ๋™์‹œ์— ๋งž์•„์•ผ ํ•˜๋Š” ํผ์ฆ ๊ฐ™์€ ์ƒํ™ฉ์ž…๋‹ˆ๋‹ค.
  • (F) ๋ฌผ์ฒด ํ•ธ๋“œ์˜ค๋ฒ„: ๋‘ ๋กœ๋ด‡ ํŒ” ์‚ฌ์ด์— ๋ฌผ์ฒด๋ฅผ ์ „๋‹ฌํ•˜๋Š” ์ž‘์—…์ž…๋‹ˆ๋‹ค. ํ•œ ํŒ”์ด ๋ฌผ์ฒด๋ฅผ ๋“ค์–ด ๋‹ค๋ฅธ ํŒ” ์†์— ๊ฑด๋„ค์ฃผ๊ณ , ๊ฑด๋„ค๋ฐ›์€ ํŒ”์ด ๋‹ค์‹œ ์ œ์ž๋ฆฌ์— ๋‚ด๋ ค๋†“์Šต๋‹ˆ๋‹ค. ์–‘ํŒ” ๊ฐ„ ๊ถค์  ์กฐํ™”์™€ ํƒ€์ด๋ฐ์ด ์ค‘์š”ํ•œ ์ž‘์—…์ž…๋‹ˆ๋‹ค.
  • (G) ํƒ€์ด๋ฐ ๋ฒจํŠธ ์žฅ์ฐฉ: ํƒ„์„ฑ ์žˆ๋Š” ํƒ€์ด๋ฐ ๋ฒจํŠธ๋ฅผ ๊ธฐ์–ด/์ถ•์— ๋ผ์šฐ๋Š” ์ž‘์—…์ž…๋‹ˆ๋‹ค. ๋ฒจํŠธ๋Š” ๋Š˜์–ด๋‚  ์ˆ˜๋„, ๊ผฌ์ผ ์ˆ˜๋„ ์žˆ๊ธฐ์— ์œ ์—ฐ์ฒด ์กฐ์ž‘์˜ ์ผ์ข…์ž…๋‹ˆ๋‹ค. ์–‘ํŒ”์ด ๋ฒจํŠธ ์–‘์ชฝ์„ ์žก๊ณ  ๋‹น๊ฒจ๊ฐ€๋ฉฐ, ํ†ฑ๋‹ˆ์— ๊ฑธ๋ฆฌ๋„๋ก ๋งž์ถฐ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • (H) Jenga ๋ธ”๋ก ์ฑ„์ฐ์งˆ: ์  ๊ฐ€ ํƒ€์›Œ์—์„œ ํŠน์ • ๋ธ”๋ก์„ ์ฑ„์ฐ์œผ๋กœ ์ณ์„œ ๋นผ๋‚ด๋Š” ๋งค์šฐ ๋…ํŠนํ•œ ๊ณผ์ œ์ž…๋‹ˆ๋‹ค. ๋กœ๋ด‡ ํŒ”์— ๊ฐ€์ฃฝ ์ฑ„์ฐ์„ ์ฅ๊ณ , ์ •ํ™•ํ•œ ์†๋„์™€ ๊ฐ๋„๋กœ ํœ˜๋‘˜๋Ÿฌ ํ•ด๋‹น ๋ธ”๋ก๋งŒ ํŠ•๊ฒจ๋‚ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋งค์šฐ ์—ญ๋™์ ์ด๊ณ  ์˜คํ”ˆ๋ฃจํ”„ ์„ฑ๊ฒฉ์˜ ์ž‘์—…์œผ๋กœ, ์ธ๊ฐ„๋„ ์„ฑ๊ณตํ•˜๊ธฐ ์–ด๋ ค์šด ๊ณ ๋‚œ๋„ ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค.
  • (I) ํ”„๋ผ์ดํŒฌ ๋’ค์ง‘๊ธฐ: ํ”„๋ผ์ดํŒฌ์œผ๋กœ ํŒฌ์ผ€์ดํฌ๋‚˜ ๋‹ฌ๊ฑ€ ๊ฐ™์€ ๋ฌผ์ฒด๋ฅผ ๊ณต์ค‘์œผ๋กœ ๋˜์กŒ๋‹ค ๋ฐ›๋Š” ์ž‘์—…์ž…๋‹ˆ๋‹ค. ๋กœ๋ด‡์ด ํ”„๋ผ์ดํŒฌ์„ ์žก๊ณ  ์ˆœ๊ฐ„์ ์œผ๋กœ ํŠ•๊ฒจ์˜ฌ๋ ค ๋ฌผ์ฒด๋ฅผ 180๋„ ๋’ค์ง‘์Šต๋‹ˆ๋‹ค. ๋™์  ์กฐ์ž‘์˜ ์˜ˆ๋กœ, ์ •ํ™•ํ•œ ํž˜๊ณผ ํƒ€์ด๋ฐ์ด ํ•„์ˆ˜์ž…๋‹ˆ๋‹ค.

์ด์ƒ์˜ ๊ณผ์ œ๋“ค์€ ์ •์  vs ๋™์ , ๋‹จ์ผ vs ๋‹ค๋‹จ๊ณ„, ๋‹จ๋‹จํ•œ ๋ฌผ์ฒด vs ์œ ์—ฐํ•œ ๋ฌผ์ฒด, ๋‹จํŒ” vs ์–‘ํŒ” ๋“ฑ ๋‹ค์–‘ํ•œ ์กฐํ•ฉ์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด์ „์—๋Š” ์ด๋Ÿฌํ•œ ์ž‘์—…๋“ค ์ค‘ ์—ฌ๋Ÿฟ์ด โ€œ์‹ค์„ธ๊ณ„ RL๋กœ๋Š” ๋ถˆ๊ฐ€๋Šฅโ€ํ•˜๊ฑฐ๋‚˜ โ€œ๋งค์šฐ ์–ด๋ ค์›Œ ๋ณ„๋„ ์ „์šฉ ๊ธฐ๋ฒ•์ด ํ•„์š”โ€ํ•˜๋‹ค๊ณ  ์—ฌ๊ฒจ์กŒ์Šต๋‹ˆ๋‹ค. ์˜ˆ์ปจ๋Œ€ ์–‘ํŒ” ์‹œ๊ฐ ๊ธฐ๋ฐ˜ RL์ด๋‚˜, ํƒ€์ด๋ฐ๋ฒจํŠธ ๊ฐ™์€ ๋ณ€ํ˜•์ฒด ์กฐ๋ฆฝ, ์  ๊ฐ€ ์ฑ„์ฐ์งˆ ๋“ฑ์€ ์ „๋ก€๊ฐ€ ๊ฑฐ์˜ ์—†๋Š” ๋„์ „๊ณผ์ œ์ž…๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ์ด๋Ÿฌํ•œ ์ตœ์ „์„  ๋‚œ์ œ๋“ค์— ๊ณผ๊ฐํžˆ ๋„์ „ํ•˜์—ฌ, HIL-SERL์˜ ์ผ๋ฐ˜์„ฑ๊ณผ ํšจ๊ณผ๋ฅผ ์ž…์ฆํ•˜๊ณ ์ž ํ–ˆ์Šต๋‹ˆ๋‹ค.

์‹คํ—˜ ๊ณผ์ •์€ ์•ž์„œ ์„ค๋ช…ํ•œ ์‹œ์Šคํ…œ์œผ๋กœ ์ง„ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ์ž‘์—…๋งˆ๋‹ค 1~2.5์‹œ๊ฐ„์˜ ํ•™์Šต์„ ์‹œ์ผฐ์œผ๋ฉฐ, ๋ชจ๋“  ํ•™์Šต์€ ์‹ค์ œ ๋กœ๋ด‡์œผ๋กœ ์ด๋ฃจ์–ด์กŒ์Šต๋‹ˆ๋‹ค. ํ•™์Šต์ด ์™„๋ฃŒ๋œ ํ›„์—๋Š” 100ํšŒ์”ฉ ์‹œํ—˜ ์‹œ๋„๋ฅผ ํ•ด๋ณด๋ฉฐ ์„ฑ๊ณต๋ฅ ๊ณผ ์ˆ˜ํ–‰ ์‹œ๊ฐ„์„ ์ธก์ •ํ–ˆ์Šต๋‹ˆ๋‹ค (IKEA ์ „์ฒด ์กฐ๋ฆฝ์˜ ๊ฒฝ์šฐ 10ํšŒ์˜ ์‹œํ—˜). ์ด๋Ÿฌํ•œ ์„ฑ๋Šฅ์„ ์—ฌ๋Ÿฌ baseline ๋ฐฉ๋ฒ•๋“ค๊ณผ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค. ๋น„๊ต ๋Œ€์ƒ์€ ํฌ๊ฒŒ ๋‘ ๊ทธ๋ฃน์œผ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค: - ๋ชจ๋ฐฉํ•™์Šต(IL) ๊ณ„์—ด: ๊ธฐ๋ณธ Behavior Cloning (BC), ๊ฐœ์„ ๋œ HG-DAgger ๋“ฑ ์ธ๊ฐ„ ๋ฐ๋ชจ/๊ต์ •์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค. - ๊ฐ•ํ™”ํ•™์Šต/RL ๊ณ„์—ด: ์ €์ž๋“ค์ด ๊ตฌํ˜„ํ•œ IBRL (Intervention-Based RL), Residual RL, DAPG (Demo Augmented Policy Gradient) ๋“ฑ ๊ธฐ์กด ์—ฐ๊ตฌ ๋˜๋Š” ๋ณ€ํ˜• ๊ธฐ๋ฒ•๋“ค.

๊ฐ baseline์— ๋Œ€ํ•ด์„œ๋Š” ๋…ผ์˜ ์„น์…˜์—์„œ ๋” ์ƒ์„ธํžˆ ์„ค๋ช…ํ•˜๊ฒ ์ง€๋งŒ, ๊ฐ„๋žตํžˆ ๊ฐœ๋…๋งŒ ์งš๊ณ  ๋„˜์–ด๊ฐ€๊ฒ ์Šต๋‹ˆ๋‹ค: - BC (Behavior Cloning): ์˜คํ”„๋ผ์ธ ์‹œ๋ฒ”๋งŒ์œผ๋กœ ํ•™์Šต, ์—ํ”ผ์†Œ๋“œ๊ฐ„ error accumulation ๋ฌธ์ œ ํผ. - HG-DAgger: DAgger ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ณ€ํ˜•์œผ๋กœ, ์ •์ฑ…์ด ๋ถˆ์•ˆํ•  ๋•Œ ์‚ฌ๋žŒ์ด teleop์œผ๋กœ ์ •๋‹ต์„ ๋ณด์—ฌ์ฃผ๋Š” IL. ์ธ๊ฐ„ ๊ต์ • ๋ฐ์ดํ„ฐ๋ฅผ supervised learning์œผ๋กœ ํ”ผํŒ…. - IBRL: Luo et al. 2023 ๋“ฑ์˜ RL ๋ฐฉ์‹์œผ๋กœ, ์‚ฌ๋žŒ ๊ฐœ์ž… ๋ฐ์ดํ„ฐ๋ฅผ RL์— ๋ฐ˜์˜ํ•˜๋Š” ์‹œ๋„๊ฐ€ ์žˆ์—ˆ๋˜ ๊ฒƒ์œผ๋กœ ๋ณด์ž„ (HIL-SERL๊ณผ ์œ ์‚ฌํ•˜๋‚˜ ๊ตฌ์„ฑ ์š”์†Œ ์ผ๋ถ€ ๋‹ค๋ฆ„). - Residual RL: ๊ธฐ์กด ํ”ผ๋“œ๋ฐฑ ์ปจํŠธ๋กค๋Ÿฌ(์˜ˆ: ์‚ฝ์ž… heuristic)์— RL๋กœ ๋ณด์ •๊ฐ’์„ ๋”ํ•˜๋Š” ๋ฐฉ๋ฒ•. ์ธ๊ฐ„ ์ง€์‹์œผ๋กœ ๊ธฐ๋ณธ๊ธฐ ์ฑ„์šฐ๊ณ  RL์ด ์„ธ๋ถ€ ํŠœ๋‹. - DAPG: ์‹œ๋ฒ” ๋ฐ์ดํ„ฐ๋กœ ์ดˆ๊ธฐ ์ •์ฑ…์„ ๋งŒ๋“  ํ›„ On-policy RL (TRPO/DDPG ๋“ฑ)์œผ๋กœ ํŒŒ์ธํŠœ๋‹ํ•˜๋Š” ๊ธฐ๋ฒ• (Rajeswaran et al. 2018). ์ฃผ๋กœ ๋ชจ์กฐํ’ˆ ์† ์กฐ์ž‘ ๋“ฑ์— ์“ฐ์˜€์Œ. - Diffusion Policy (DP): ์ตœ๊ทผ ๊ด€์‹ฌ๋ฐ›๋Š” IL ๋ฐฉ๋ฒ•์œผ๋กœ, conditional diffusion model์„ ์‚ฌ์šฉํ•ด ์‹œ์—ฐ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋ฅผ ํ•™์Šตํ•˜๊ณ  ์ƒ˜ํ”Œ๋ง์œผ๋กœ ํ–‰๋™ ์ƒ์„ฑ. (Ma et al. 2023) - Ours (HIL-SERL): ๋ณธ ๋…ผ๋ฌธ ์ œ์•ˆ ๋ฐฉ๋ฒ•.

์ด์ œ ๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ๊ฒฐ๊ณผ: ์„ฑ๊ณต๋ฅ ๊ณผ ์†๋„

ํ‘œ 1: HIL-SERL (์šฐ๋ฆฌ ๊ธฐ๋ฒ•)๊ณผ BC(HG-DAgger, ๋™๋Ÿ‰์˜ ์ธ๊ฐ„๋ฐ์ดํ„ฐ ์‚ฌ์šฉ) ์„ฑ๋Šฅ ๋น„๊ต. ์„ฑ๊ณต๋ฅ ์€ 100ํšŒ ์ค‘ ์„ฑ๊ณต ๋น„์œจ (IKEA ์ „์ฒด๋Š” 10ํšŒ ์ค‘), ๊ด„ํ˜ธ๋Š” BC ๋Œ€๋น„ ์ƒ๋Œ€ ํ–ฅ์ƒ๋ฅ . ์‹œ๊ฐ„์€ ํ•œ ์—ํ”ผ์†Œ๋“œ ๋‹น ํ‰๊ท  ์™„๋ฃŒ ์‹œ๊ฐ„, ๊ด„ํ˜ธ๋Š” BC ๋Œ€๋น„ ์†๋„ ๋ฐฐ์œจ. ๋ชจ๋“  ๊ณผ์ œ์—์„œ ๊ฐ•ํ™”ํ•™์Šต ์ •์ฑ…์ด ์ธ๊ฐ„ ์‹œ๋ฒ” ๊ธฐ๋ฐ˜ ์ •์ฑ…๋ณด๋‹ค ๋†’์€ ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ–ˆ๊ณ , ๋Œ€๋ถ€๋ถ„ ๋” ๋น ๋ฅด๊ฒŒ ๊ณผ์ œ๋ฅผ ์™„๋ฃŒํ•จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

์œ„ ํ‘œ๋Š” Behavior Cloning (BC)๊ณผ HIL-SERL ์ •์ฑ…์„ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. BC๋Š” HIL-SERL๊ณผ ๋™์ผํ•œ ์ˆ˜์˜ ์‹œ๋ฒ”+๊ต์ • ์—ํ”ผ์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ HG-DAgger๋กœ ํ•™์Šต์‹œํ‚จ ๋ชจ๋ฐฉํ•™์Šต ์ •์ฑ…์ž…๋‹ˆ๋‹ค. ์ฆ‰, ์ธ๊ฐ„์ด ์ œ๊ณตํ•œ ๋ฐ์ดํ„ฐ๋Ÿ‰์€ ๊ฐ™๊ฒŒ ๋งž์ถ”๊ณ  ๋ฐฉ๋ฒ•๋งŒ RL vs IL๋กœ ๋‹ฌ๋ฆฌํ•œ ๋น„๊ต์ž…๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ๋Š” HIL-SERL์˜ ์••์Šน์ž…๋‹ˆ๋‹ค: ๋ชจ๋“  ๊ณผ์ œ์—์„œ HIL-SERL์ด 100% ์„ฑ๊ณต๋ฅ ์„ ๋ณด์˜€๊ณ , BC๋Š” ๊ณผ์ œ์— ๋”ฐ๋ผ 2%~95% ์‚ฌ์ด๋กœ ํŽธ์ฐจ๊ฐ€ ํฌ์ง€๋งŒ ํ‰๊ท  49.7%์— ๋ถˆ๊ณผํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ๋ณต์žกํ•œ ๊ณผ์ œ์ผ์ˆ˜๋ก ๋ชจ๋ฐฉํ•™์Šต์€ ์„ฑ๋Šฅ์ด ์ €์กฐํ–ˆ๋Š”๋ฐ, ์˜ˆ๋ฅผ ๋“ค์–ด ํƒ€์ด๋ฐ ๋ฒจํŠธ ์กฐ๋ฆฝ์€ BC ์„ฑ๊ณต๋ฅ ์ด 2%๋กœ ๊ฑฐ์˜ ์‹คํŒจ๋งŒ ํ•œ ๋ฐ˜๋ฉด HIL-SERL์€ 100%๋กœ ์™„๋ฒฝํžˆ ์„ฑ๊ณตํ–ˆ์Šต๋‹ˆ๋‹ค. ์ž๋™์ฐจ ๋Œ€์‹œ๋ณด๋“œ ์กฐ๋ฆฝ๋„ BC 41% vs RL 100%๋กœ ํฐ ์ฐจ์ด๋ฅผ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ํฅ๋ฏธ๋กญ๊ฒŒ๋„, ์ผ€์ด๋ธ” ํด๋ฆฝ์ด๋‚˜ ์–‘ํŒ” ํ•ธ๋“œ์˜ค๋ฒ„ ๊ฐ™์ด ์‚ฌ๋žŒ ์‹œ๋ฒ”๋งŒ์œผ๋กœ๋„ ๊ทธ๋Ÿญ์ €๋Ÿญ ๋˜๋Š” ์ผ์€ BC๋„ 80~95%๋ฅผ ๋ณด์˜€์ง€๋งŒ, RL์€ ์–ด๊น€์—†์ด 100%๋กœ ๋งˆ๋ฌด๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค.

๋˜ ํ•˜๋‚˜ ์ฃผ๋ชฉํ•  ์ ์€ ์ž‘์—… ์ˆ˜ํ–‰ ์‹œ๊ฐ„์ž…๋‹ˆ๋‹ค. HIL-SERL ์ •์ฑ…์€ ๋Œ€๋ถ€๋ถ„์˜ ๊ณผ์ œ์—์„œ BC ์ •์ฑ…๋ณด๋‹ค ๋น ๋ฅด๊ฒŒ ์ž‘์—…์„ ์™„๋ฃŒํ–ˆ์Šต๋‹ˆ๋‹ค (ํ‰๊ท  1.8๋ฐฐ ๋น ๋ฆ„). ์˜ˆ์ปจ๋Œ€ IKEA ํƒ‘ํŒ ์กฐ๋ฆฝ์€ BC๊ฐ€ 8.9์ดˆ ๊ฑธ๋ฆฌ๋˜ ๊ฒƒ์„ RL์€ 2.4์ดˆ ๋งŒ์— ํ•ด๋‚ด์–ด 3.7๋ฐฐ ํšจ์œจ์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์•ž์„œ ๋…ผํ•œ ํ• ์ธ ๋ณด์ƒ ์„ค๊ณ„ ๋•์— RL ์ •์ฑ…์ด ์ตœ๋‹จ ๊ฒฝ๋กœ๋ฅผ ์ถ”๊ตฌํ•˜๊ฒŒ ๋œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. ์‚ฌ๋žŒ ์‹œ๋ฒ”์€ ๋Œ€๊ฐœ ์•ˆ์ „ํ•˜๊ฒŒ ์ฒœ์ฒœํžˆ ํ•˜๋Š” ๋ฐ˜๋ฉด, RL ์ •์ฑ…์€ โ€œ๋นจ๋ฆฌ ์„ฑ๊ณตํ•˜๋ฉด ์ด๋“โ€์ด๋‹ˆ ๋ถˆํ•„์š”ํ•œ ๋™์ž‘์„ ์ค„์ด๊ณ  ๋™์‹œ๋‹ค๋ฐœ๋กœ ์ง„ํ–‰ํ•˜์—ฌ ์†๋„๋ฅผ ๋†’์ธ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋‹ค๋งŒ ๋™์  ์ž‘์—…์ธ Jenga ์ฑ„์ฐ์ด๋‚˜ ํŒฌ์ผ€์ดํฌ ๋’ค์ง‘๊ธฐ์˜ ๊ฒฝ์šฐ BC๋„ ์ด๋ฏธ ์‚ฌ๋žŒ ์ˆ˜์ค€๋ณด๋‹ค ๋น ๋ฅด๊ณ  (์ธ๊ฐ„ ํ…”๋ ˆ์˜คํผ๋ ˆ์ดํ„ฐ๊ฐ€ ํ•œ ๊ฒƒ์ด๊ธด ํ•˜๋‚˜) RL๊ณผ ๊ฑฐ์˜ ๋น„์Šทํ–ˆ์Šต๋‹ˆ๋‹ค โ€“ ์ด๋“ค์€ ์›Œ๋‚™ ์งง์€ ์ˆœ๊ฐ„์— ๋๋‚˜๋ฏ€๋กœ ์†๋„ ํ–ฅ์ƒ์˜ ์—ฌ์ง€๊ฐ€ ํฌ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ •์ ์ธ ๋‹ค๋‹จ๊ณ„ ์ž‘์—…์ผ์ˆ˜๋ก RL ์ •์ฑ…์€ ๋ช…ํ™•ํžˆ ๋ณ‘๋ ฌํ™”๋‚˜ ์ง€๋ฆ„๊ธธ์„ ์ฐพ์•„๋ƒ…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์–‘ํŒ” ํ•ธ๋“œ์˜ค๋ฒ„์˜ RL ์ •์ฑ…์€ ๋ฌผ๊ฑด์„ ๊ฑด๋„ค๊ธฐ ์ง์ „๋ถ€ํ„ฐ ๋ฐ˜๋Œ€ํŒ”์ด ์‚ด์ง ์›€์ง์—ฌ ๋ฏธ๋ฆฌ ์ค€๋น„ํ•˜๋Š” ์‹์œผ๋กœ ์‹œ๊ฐ„์„ ๋‹จ์ถ•ํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•œํŽธ, ๋‹ค๋ฅธ baseline๋“ค๊ณผ์˜ ๋น„๊ต๋„ ์ด๋ฃจ์–ด์กŒ์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” Table 1(b)๋กœ ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•๋“ค์˜ ์„ฑ๊ณต๋ฅ ์„ ๋‚˜์—ดํ–ˆ๋Š”๋ฐ, ๋ช‡ ๊ฐ€์ง€ ๋Œ€ํ‘œ๋ฅผ ๋“ค๋ฉด:

  • Diffusion Policy (DP): ์ผ๋ถ€ ๊ณผ์ œ์—์„œ๋Š” 50~60%๋Œ€๋กœ ์„ ๋ฐฉํ–ˆ์œผ๋‚˜, HIL-SERL์—๋Š” ๋ฏธ์น˜์ง€ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค. DP๋Š” ์˜คํ”„๋ผ์ธ IL์ด๋ผ ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
  • HG-DAgger: ์œ„ BC์™€ ๋™์ผํ•œ ๊ฒƒ์ด๋ฉฐ ํ‰๊ท  49.7%์˜€์Šต๋‹ˆ๋‹ค.
  • IBRL (Luo et al. 2023 ๋ฐฉ๋ฒ•): RAM ์‚ฝ์ž… 75%, ๋Œ€์‹œ๋ณด๋“œ 0%, ํ•ธ๋“œ์˜ค๋ฒ„ 95% ๋“ฑ ๊ณผ์ œ๋ณ„ ํŽธ์ฐจ๊ฐ€ ์ปธ๊ณ , ์–ด๋ ค์šด ๊ณผ์ œ์—์„  ์•„์˜ˆ ์‹คํŒจํ•˜๊ธฐ๋„ ํ–ˆ์Šต๋‹ˆ๋‹ค. ์‚ฌ๋žŒ ๊ฐœ์ž…์„ RL์— ์“ฐ๋˜ HIL-SERL๋งŒํผ์˜ ์žฅ์น˜๋ฅผ ๊ฐ–์ถ”์ง€ ๋ชปํ•œ ๊ฒฐ๊ณผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.
  • Residual RL: ๋Œ€์ฒด๋กœ ๋ณต์žก ๊ณผ์ œ๋Š” ์‹คํŒจ(0%), ๋‹จ์ˆœ ๊ณผ์ œ๋Š” IL ์ˆ˜์ค€(์˜ˆ: ๋ฌผ์ฒด ๋’ค์ง‘๊ธฐ 97%). ๊ธฐ๋ณธ ์ œ์–ด๊ธฐ๋กœ ํ•ด๊ฒฐ๋ชปํ•  ๊ฑด RL๋„ ๋ชป ๋ถ™์ธ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค.
  • DAPG: ์˜คํ”„ํด๋ฆฌ์‹œ RL๋กœ ์‹œ๋ฒ”์„ ํ™œ์šฉํ•˜์ง€๋งŒ, on-policy ์„ฑ๊ฒฉ ํƒ“์— ํ‘œ๋ณธ ํšจ์œจ์ด ๋‚ฎ์•„ ์„ฑ๊ณต๋ฅ ์ด ์ „๋ฐ˜์ ์œผ๋กœ ์ €์กฐํ–ˆ์Šต๋‹ˆ๋‹ค (๋ณด๋“œ ์‚ฝ์ž… 8%, ํ•ธ๋“œ์˜ค๋ฒ„ 72% ๋“ฑ).
  • HIL-SERL: ๋ชจ๋‘ 100%.

์š”์•ฝํ•˜๋ฉด, HIL-SERL์€ ๊ฑฐ์˜ ๋ชจ๋“  ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์„ ๋Šฅ๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ๋‚œํ•ดํ•œ ์ž‘์—…๋“ค(์–‘ํŒ” ์กฐ๋ฆฝ, ์œ ์—ฐ์ฒด, ๋งค์šฐ ์—ญ๋™์  ํ–‰์œ„ ๋“ฑ)์—์„œ๋Š” ์œ ์ผํ•˜๊ฒŒ ์„ฑ๊ณตํ•œ ๋ฐฉ๋ฒ•์ด์—ˆ์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ์„ฑ๋Šฅ ์ฐจ์ด๋Š” ํ†ต๊ณ„์ ์œผ๋กœ๋„ ์œ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋“  ๊ณผ์ œ์—์„œ HIL-SERL๊ณผ BC์˜ ์„ฑ๊ณต๋ฅ  ์ฐจ์ด๋Š” ์ƒ๋‹นํ•˜๋ฉฐ, ํ‰๊ท ์ ์œผ๋กœ 2๋ฐฐ ์ด์ƒ์ž…๋‹ˆ๋‹ค. ์—ฐ๊ตฌ์ง„์€ โ€œ๊ฐ•ํ™”ํ•™์Šต์ด ๊ฐ™์€ ์–‘์˜ ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ๋กœ ๋ชจ๋ฐฉํ•™์Šต๋ณด๋‹ค ํ›จ์”ฌ ๋‚ซ๋‹คโ€๋Š” ๊ฒƒ์„ ๊ฐ•๋ ฅํžˆ ์ฃผ์žฅํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” RL์ด ์Šค์Šค๋กœ ์˜ค๋ฅ˜๋ฅผ ์ˆ˜์ •ํ•˜๊ณ  ๋” ๋„“์€ ์ƒํƒœ ๋ถ„ํฌ๋ฅผ ํƒ์ƒ‰ํ•œ๋‹ค๋Š” ๊ทผ๋ณธ์  ์ด์ ์„ ๋ฐ˜์˜ํ•œ๋‹ค๊ณ  ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด IL(ํŠนํžˆ DAgger)์€ ์‚ฌ๋žŒ ๋ฐ๋ชจ ์ฃผ๋ณ€์—์„œ๋งŒ ํ•™์Šตํ•˜๋ฏ€๋กœ ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด์ง€์š”.

ํ•™์Šต ๊ณก์„ ๊ณผ ์ •์ฑ… ํŠน์„ฑ

HIL-SERL์˜ ํ•™์Šต ๊ณผ์ •์„ ๋“ค์—ฌ๋‹ค๋ณด๋ฉด ํฅ๋ฏธ๋กœ์šด ์ •์ฑ… ์ง„ํ™” ์–‘์ƒ์„ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ํ•™์Šต ์ค‘ ์—ํ”ผ์†Œ๋“œ๋“ค์˜ ์„ฑ๊ณต๋ฅ , ์ˆ˜ํ–‰์‹œ๊ฐ„, ๊ฐœ์ž…๋ฅ  ๋ณ€ํ™”๋ฅผ ๊ณก์„ ์œผ๋กœ ๊ทธ๋ ค ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค. ๋Œ€ํ‘œ์ ์œผ๋กœ RAM ์‚ฝ์ž… ์ž‘์—…์— ๋Œ€ํ•ด HIL-SERL๊ณผ HG-DAgger์˜ ํ•™์Šต ๊ณก์„ ์„ ๋น„๊ตํ•œ ๊ทธ๋ฆผ์ด ์žˆ์—ˆ๋Š”๋ฐ, HIL-SERL ์ชฝ์€ ์—ํ”ผ์†Œ๋“œ๊ฐ€ ์ง„ํ–‰๋ ์ˆ˜๋ก ์„ฑ๊ณต๋ฅ ์ด ๋‹จ์กฐ ์ฆ๊ฐ€ํ•˜์—ฌ ๋น ๋ฅด๊ฒŒ 100%์— ๋„๋‹ฌํ•˜๊ณ , ๊ฐœ์ž…๋ฅ ์€ 0%๋กœ ๋–จ์–ด์ง€๋ฉฐ, ์ˆ˜ํ–‰์‹œ๊ฐ„๋„ ์งง์•„์ง€๋Š” ๋ชจ์Šต์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด HG-DAgger(IL)๋Š” ์„ฑ๊ณต๋ฅ ์ด ๋“ค์ญ‰๋‚ ์ญ‰ํ•˜๊ณ  ๋๋‚ด 50% ์–ธ์ €๋ฆฌ์— ๋จธ๋ฌผ๋ €์œผ๋ฉฐ, ๊ฐœ์ž…์€ ๊ณ„์† ํ•„์š”ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” RL์ด ์ž๊ธฐ ๋ถ„ํฌ์—์„œ ์‹คํŒจ๋ฅผ ๊ฒช๊ณ  ๊ณ ์น˜๊ธฐ๋ฅผ ๋ฐ˜๋ณตํ•˜๋ฉด์„œ ์•ˆ์ •์ ์œผ๋กœ ์ˆ˜๋ ดํ•˜๋Š” ๋ฐ˜๋ฉด, IL์€ ์‚ฌ๋žŒ ๋ฐ์ดํ„ฐ๋กœ๋งŒ ํ•™์Šตํ•˜๋‹ค๋ณด๋‹ˆ ์ผ๊ด€๋˜์ง€ ๋ชปํ•˜๊ณ  ํ•œ๊ณ„์— ๋ถ€๋”ชํžˆ๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋˜ํ•œ ์ €์ž๋“ค์€ ํ•™์Šต๋œ ์ •์ฑ…์˜ ์‹ ๋ขฐ๋„(reliability)์™€ ์ „๋žต์  ํŠน์ง•์„ ๋ถ„์„ํ–ˆ์Šต๋‹ˆ๋‹ค. ์šฐ์„ , HIL-SERL ์ •์ฑ…์€ 100% ์„ฑ๊ณต์— ๋„๋‹ฌํ•œ ์ดํ›„์—๋„ ๋งค์šฐ ๊ฒฌ๊ณ ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฐ™์€ ์ž‘์—…์„ ๋ฐ˜๋ณต 100๋ฒˆ ํ•ด๋„ ํ•œ ๋ฒˆ๋„ ์‹คํŒจํ•˜์ง€ ์•Š์œผ๋‹ˆ ๋ถ„์‚ฐ์ด 0์— ๊ฐ€๊นŒ์šด ์‹ ๋ขฐ์„ฑ์ž…๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋ฅผ โ€œfunnel-shaped state distributionโ€๋กœ ์‹œ๊ฐํ™”ํ–ˆ๋Š”๋ฐ, ํ•™์Šต ์ดˆ๋ฐ˜์—๋Š” ๋กœ๋ด‡์ด ๋ฐฉํ™ฉํ•˜๋˜ ์ƒํƒœ๊ณต๊ฐ„์ด ์ ์ฐจ ๋ฐ๋ชจ+๊ต์ • ์ƒํƒœ ์ฃผ๋ณ€์œผ๋กœ funnel(๊น”๋•Œ๊ธฐ)์ฒ˜๋Ÿผ ์ง‘์ค‘๋˜์–ด๊ฐ€๋Š” ๋ชจ์Šต์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ฆ‰ ์ดˆ๊ธฐ์—๋Š” ์‹œํ–‰์ฐฉ์˜ค๋กœ ์—ฌ๋Ÿฌ ์ƒํƒœ๋ฅผ ๊ฑฐ์น˜์ง€๋งŒ, ์ตœ์ข… ์ •์ฑ…์€ ์„ฑ๊ณต์œผ๋กœ ์ด์–ด์ง€๋Š” ๊ฒฝ๋กœ๋งŒ ์ฃผ๋กœ ํƒ์ƒ‰ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๊ฐ•ํ™”ํ•™์Šต์˜ ์ž๊ธฐ ์ˆ˜๋ ด ํŠน์„ฑ ๋•๋ถ„์ž…๋‹ˆ๋‹ค. IL ์ •์ฑ…์€ ์‚ฌ๋žŒ์ด ๊ฐ€๋ฅด์ณ์ค€ ๋Œ€๋กœ ๋”ฐ๋ผ๊ฐ€๋‹ค ์ž˜๋ชป๋˜๋ฉด ์†์ˆ˜๋ฌด์ฑ…์ด๋‚˜, RL ์ •์ฑ…์€ ์‚ด์ง ๋น—๋‚˜๊ฐ€๋„ ์Šค์Šค๋กœ ๋‹ค์‹œ ๊ฒฝ๋กœ๋ฅผ ์ˆ˜์ •ํ•˜๋ฉฐ ๊ฒฐ๊ตญ ๋ชฉํ‘œ์— ๋„๋‹ฌํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์‰ฝ๊ฒŒ ๋งํ•ด, โ€œ์‹คํŒจํ•ด๋„ ๊ณ„์† ์‹œ๋„ํ•ด์„œ ๊ฒฐ๊ตญ ํ•ด๋‚ธ๋‹คโ€๋Š” ์ „๋žต์ด ๋‚ด์žฌ๋˜์—ˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ ‡๋“ฏ RL ์ •์ฑ…์€ ์ž๊ธฐ-๋ณด์ •(self-correcting) ๋Šฅ๋ ฅ์ด ์žˆ์–ด์„œ, ์ดˆ์Šˆํผ๋งจ ์ˆ˜์ค€์˜ ์„ฑ๊ณต๋ฅ ์„ ๊ตฌํ˜„ํ–ˆ๋‹ค๊ณ  ์ €์ž๋“ค์€ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

๋˜ ํ•˜๋‚˜ ํฅ๋ฏธ๋กœ์šด ๋ถ„์„์€ Reactive vs Predictive Control์ž…๋‹ˆ๋‹ค. HIL-SERL์ด ๋‹ค๋ฃฌ ๊ณผ์ œ๋“ค์€ ํฌ๊ฒŒ ๋‘ ์œ ํ˜•์˜ ์ œ์–ด๊ฐ€ ์„ž์—ฌ ์žˆ์Šต๋‹ˆ๋‹ค: - Reactive(๋ฐ˜์‘์ ) ์ œ์–ด: ํ”ผ๋“œ๋ฐฑ์„ ์ˆ˜์‹œ๋กœ ํ™œ์šฉํ•˜์—ฌ ๋ชฉํ‘œ๋ฅผ ํ–ฅํ•ด ์กฐ๊ธˆ์”ฉ ์กฐ์ •ํ•˜๋Š” ๋ฐฉ์‹. ์ฃผ๋กœ ์ •๋ฐ€ ์กฐ๋ฆฝ ๊ฐ™์€ ๊ณผ์ œ์—์„œ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค. - Predictive(์˜ˆ์ธก์ ) ์ œ์–ด: ํ•œ๋ฒˆ ์ •ํ•œ ๊ถค์ ์„ ๋น ๋ฅด๊ฒŒ ์‹คํ–‰ํ•˜์—ฌ ์„ฑํŒจ๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๋ฐฉ์‹. ๋™์  ๋˜์ง€๊ธฐ/์ฑ„์ฐ์งˆ ๋“ฑ์— ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์ „ํ†ต์ ์œผ๋กœ ๋ฐ˜์‘์  ์ œ์–ด๋Š” PID๋‚˜ ํฌ์Šค์ปจํŠธ๋กค ๊ฐ™์ด ๋ฏธ๋ฆฌ ๋ชจ๋ธ๋งํ•˜๊ฑฐ๋‚˜ ๊ณ ์ • ์ „๋žต์œผ๋กœ ๊ตฌํ˜„๋˜๊ณ , ์˜ˆ์ธก์  ์ œ์–ด๋Š” ๋ชจ์…˜ํ”„๋ฆฌ๋ฏธํ‹ฐ๋ธŒ๋‚˜ ์ตœ์ ์ œ์–ด ์†”๋ฃจ์…˜์œผ๋กœ ๋”ฐ๋กœ ์ ‘๊ทผํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•˜์Šต๋‹ˆ๋‹ค. ๋†€๋ž๊ฒŒ๋„, HIL-SERL์˜ ๋‹จ์ผ RL ์ •์ฑ…์€ ์ด ๋‘ ๊ฐ€์ง€ ๊ทน๋‹จ์˜ ์ „๋žต์„ ๋ชจ๋‘ ํš๋“ํ–ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ์ปจ๋Œ€, SSD ์‚ฝ์ž… ์ •์ฑ…์€ ์ฒœ์ฒœํžˆ ์ ‘๊ทผํ•˜๋‹ค ๋‹ฟ์œผ๋ฉด ํž˜์„ ์กฐ์ ˆํ•ด ๋ผ์›Œ๋„ฃ๋Š” ์•„์ฃผ ์„ฌ์„ธํ•œ ํ”ผ๋“œ๋ฐฑ ์ „๋žต์„ ๊ตฌ์‚ฌํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด Jenga ์ฑ„์ฐ ์ •์ฑ…์€ ์‚ฌ์ „์— ํ•„์š”ํ•œ ์œ„์น˜์™€ ์†๋„๋ฅผ ์ •ํ™•ํžˆ ๋งž์ถฐ ํ•œ๋ฒˆ์— ํœ˜๋‘๋ฅด๋Š” ์˜ˆ์ธก ์ „๋žต์„ ๋ณด์—ฌ์ฃผ์—ˆ์ฃ . ํ•˜๋‚˜์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์ด๋Ÿฌํ•œ ๋‹ค์–‘ํ•œ ํ–‰๋™ ์–‘์‹์ด ํ•™์Šต๋œ ๊ฒƒ์€ RL์˜ ํฐ ๊ฐ•์ ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๋ณด์ƒ๋งŒ ๋งž๊ฒŒ ์ฃผ์–ด์ง€๋ฉด, ์ „๋žต์˜ ํ˜•ํƒœ๋Š” ํ™˜๊ฒฝ์— ๋งž๊ฒŒ ์Šค์Šค๋กœ ๋„์ถœ๋จ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์ธ๊ฐ„ ์‹œ๋ฒ”์œผ๋กœ๋Š” ์ด๋Ÿฐ ์ „๋žต์„ ์ง์ ‘ ์„ค๊ณ„ํ•ด์ค„ ์ˆ˜ ์—†์ง€๋งŒ, RL์€ ๊ฐ€๋Šฅํ–ˆ์Šต๋‹ˆ๋‹ค.

๋˜ ๋‹ค๋ฅธ ์‹คํ—˜์œผ๋กœ, ์ •์ฑ…์˜ ์ ์‘๋ ฅ์„ ์‹œํ—˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์‹œ ๋ณด์ง€ ๋ชปํ•œ ๋Œ๋ฐœ ์ƒํ™ฉ์„ ์ค˜๋ณธ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ๋กœ๋ด‡์ด ๋งˆ๋”๋ณด๋“œ์— ๋ถ€ํ’ˆ ๊ฝ‚๋Š” ์ค‘๊ฐ„์— ์‚ฌ๋žŒ์ด ์Šฌ์ฉ ๋งˆ๋”๋ณด๋“œ๋ฅผ ์›€์ง์—ฌ ์œ„์น˜๋ฅผ ๋ฐ”๊ฟ”๋ด…๋‹ˆ๋‹ค. ๊ทธ๋žฌ๋”๋‹ˆ RL ์ •์ฑ…์€ ์ฆ‰์‹œ ์ƒˆ๋กœ์šด ์œ„์น˜์— ๋งž์ถฐ ํŒ”์„ ์กฐ์ •ํ•˜์—ฌ ๊ณ„์† ์‚ฝ์ž…์„ ์‹œ๋„ํ–ˆ๊ณ , ์„ฑ๊ณตํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด ์ด๋Ÿฐ ์ƒํ™ฉ์€ ์‚ฌ๋žŒ ์‹œ๋ฒ”์—๋Š” ์—†๋˜ ๊ฑฐ๋ผ, IL ์ •์ฑ…์ด๋ผ๋ฉด ๋Œ€์ฒ˜ํ•˜์ง€ ๋ชปํ–ˆ์„ ๊ฒ๋‹ˆ๋‹ค. ๋˜ ๋ฌผ์ฒด ์ „๋‹ฌ ๋„์ค‘ ์ผ๋ถ€๋Ÿฌ ๋ฌผ๊ฑด์„ ๋–จ์–ด๋œจ๋ฆฌ๊ฒŒ ํ•ด๋ณด๊ธฐ๋„ ํ–ˆ๋Š”๋ฐ, RL ์ •์ฑ…์€ ๋–จ์–ด๋œจ๋ฆฌ๋ฉด ์ฃผ์›Œ์„œ ๋‹ค์‹œ ์‹œ๋„ํ•˜๋Š” ์‹์œผ๋กœ ์‹คํŒจ๋ฅผ ๋งŒํšŒํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๋ชจ์Šต์„ ๋ณด๋ฉด, RL ์ •์ฑ…์ด ํ™˜๊ฒฝ ๋ณ€ํ™”๋‚˜ ์‹ค์ˆ˜์—๋„ ์œ ์—ฐํ•˜๊ฒŒ ๋Œ€์‘ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฐ•๊ฑด์„ฑ์€ ์‹ค์ œ ์‚ฐ์—… ์‘์šฉ์—์„œ ํŠนํžˆ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ์—ฐ๊ตฌ์ง„์ด ๊ฐ•์กฐํ•˜๋“ฏ, ๋กœ๋ด‡์˜ ์ผ๊ด€์„ฑ(consistency)๊ณผ ์‹ ๋ขฐ์„ฑ์€ ์ƒ์šฉํ™”์˜ ํ•„์ˆ˜ ์š”๊ฑด์ธ๋ฐ, HIL-SERL ์ •์ฑ…์€ ์ด ๊ธฐ์ค€์„ ์ถฉ์กฑ์‹œํ‚ค๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๋‚˜์•„๊ฐ”์Šต๋‹ˆ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ, ํ•™์Šต ์‹œ๊ฐ„์— ๋Œ€ํ•ด ์งš๊ณ  ๋„˜์–ด๊ฐ€์ฃ . 7๊ฐ€์ง€ ๊ณผ์ œ ๋ชจ๋‘, ํ›ˆ๋ จ์— ๊ฑธ๋ฆฐ ์‹œ๊ฐ„์€ 1~2.5์‹œ๊ฐ„์ด์—ˆ์Šต๋‹ˆ๋‹ค. ๊ฐ€์žฅ ์˜ค๋ž˜ ๊ฑธ๋ฆฐ ๊ฒƒ์€ ํƒ€์ด๋ฐ ๋ฒจํŠธ (~2.5h)์™€ IKEA ์กฐ๋ฆฝ (~2h ๋‚จ์ง“)์ด๊ณ , ๋‚˜๋จธ์ง€๋Š” 1์‹œ๊ฐ„ ๋‚ด์™ธ์˜€์Šต๋‹ˆ๋‹ค. ์ด ์‹œ๊ฐ„์— ์•ฝ ์ˆ˜๋ฐฑ ์—ํ”ผ์†Œ๋“œ(์ˆ˜์ฒœ~๋งŒ ๋ฒˆ์˜ ์Šคํ…)๋ฅผ ์‹คํ–‰ํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค. ํŠนํžˆ Jenga ์ฑ„์ฐ์€ 1.25์‹œ๊ฐ„๋งŒ์— ์ •์ฑ…์ด ์™„์„ฑ๋˜์—ˆ๋Š”๋ฐ, ์ด๋Š” ์—ฐ๊ตฌ์ž๋“ค๋„ โ€œ์ธ๊ฐ„๋„ ํž˜๋“ ๊ฑธ ๋กœ๋ด‡์ด ์ด๋ ‡๊ฒŒ ๋นจ๋ฆฌ ๋ฐฐ์›Œ์„œ ์ถฉ๊ฒฉโ€์ด์—ˆ๋‹ค๊ณ  ํšŒ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ด์ฒ˜๋Ÿผ ํ›ˆ๋ จ์‹œ๊ฐ„์ด ์‹ค์šฉ์  ์ˆ˜์ค€์ด๋ผ๋Š” ์ ์€ HIL-SERL์˜ ํฐ ์„ฑ๊ณผ์ž…๋‹ˆ๋‹ค. ์ด์ „๊นŒ์ง€ ์‹ค๋กœ๋ด‡ RL์€ ๋ฉฐ์น , ์‹ฌ์ง€์–ด ๋ช‡ ์ฃผ๋ฅผ ๋งํ•˜๊ณค ํ–ˆ๋Š”๋ฐ, ์ด์ œ๋Š” ์ ์‹ฌ์‹œ๊ฐ„~๋ฐ˜๋‚˜์ ˆ ์•ˆ์— ํ•œ ๊ฐ€์ง€ ๊ธฐ์ˆ ์„ ๋งˆ์Šคํ„ฐํ•˜๋Š” ๊ฒŒ ๊ฐ€๋Šฅํ•ด์ง„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ์ •๋„๋ฉด, ์ž‘์—… ๋‹จ์œ„๋กœ ๋กœ๋ด‡์„ ํ˜„์žฅ์— ๊ฐ€์ ธ๋‹ค ๋†“๊ณ  ๋ฐ”๋กœ ํ•™์Šต์‹œ์ผœ ์“ธ ์ˆ˜๋„ ์žˆ๋Š” ์ˆ˜์ค€์ž…๋‹ˆ๋‹ค.

๋น„ํŒ์  ๊ณ ์ฐฐ: ์žฅ์ , ํ•œ๊ณ„์™€ ํ–ฅํ›„ ๋ฐฉํ–ฅ

HIL-SERL์˜ ํ˜์‹ ๊ณผ ์„ฑ๊ณผ๋ฅผ ์‚ดํŽด๋ณด์•˜์œผ๋‹ˆ, ์ด์ œ ์ด๋ฅผ ๋น„ํŒ์ ์œผ๋กœ ํ‰๊ฐ€ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์šฐ์„  ๋›ฐ์–ด๋‚œ ์ ๋“ค์„ ์ •๋ฆฌํ•œ ํ›„, ์•„์ง ๋‚จ์€ ํ•œ๊ณ„๋‚˜ ๊ฐœ์„  ์—ฌ์ง€๋ฅผ ์งš์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ด€๋ จ ์—ฐ๊ตฌ๋“ค๊ณผ ๋น„๊ตํ•˜์—ฌ HIL-SERL์˜ ์œ„์ƒ๊ณผ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ๋„ ๋…ผ์˜ํ•ฉ๋‹ˆ๋‹ค.

๊ฐ•์  ๋ฐ ๊ธฐ์—ฌ

1) ๋ฒ”์šฉ์„ฑ ์žˆ๋Š” ์„ฑ๊ณผ: HIL-SERL์€ ํŠน์ • ์ž‘์—… ํ•œ๋‘ ๊ฐœ๊ฐ€ ์•„๋‹Œ, ์ข…๋ฅ˜๊ฐ€ ๋‹ค๋ฅธ ์—ฌ๋Ÿฌ ์ž‘์—…์— ํ†ต์ผ๋œ ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•ด ์„ฑ๊ณตํ–ˆ์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์— ๋“ฑ์žฅํ•œ ์ •๋ฐ€ ์กฐ๋ฆฝ, ์œ ์—ฐ๋ฌผ์ฒด, ๋™์  ๋˜์ง€๊ธฐ, ์–‘ํŒ” ํ˜‘๋™ ๋“ฑ์€ ๊ฐ๊ฐ ๋ณ„๊ฐœ์˜ ๋‚œ์ œ์ธ๋ฐ, ๋‹จ์ผ ์‹œ์Šคํ…œ์œผ๋กœ ๋ชจ๋‘ ํ•ด๊ฒฐํ•œ ๊ฒƒ์€ ์ „๋ก€๋ฅผ ์ฐพ๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ์–‘ํŒ” ์‹œ๊ฐ RL์ด๋‚˜ Jenga ์ฑ„์ฐ์งˆ ๋“ฑ์€ ์„ธ๊ณ„ ์ตœ์ดˆ ์ˆ˜์ค€์˜ ์‹œ์—ฐ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๊ฐ•ํ™”ํ•™์Šต+ํœด๋จผ์ธ๋”๋ฃจํ”„ ํ”„๋ ˆ์ž„์›Œํฌ์˜ ๊ฐ•๋ ฅํ•จ์„ ์ž…์ฆํ•œ ๊ฒƒ์œผ๋กœ, ํ–ฅํ›„ ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ์ž‘์—…์— ์ด ์ ‘๊ทผ์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์น˜ ๋ฒ”์šฉ ํ•™์Šต๊ธฐ์ฒ˜๋Ÿผ, ๋ฐ์ดํ„ฐ๋งŒ ์กฐ๊ธˆ ์ฃผ๋ฉด ์–ด๋–ค ์ž‘์—…์ด๋“  ๊ฐ€๋Šฅํ•œ ๋กœ๋ด‡์˜ ๊ฐ€๋Šฅ์„ฑ์„ ์—ฟ๋ณด์•˜์Šต๋‹ˆ๋‹ค.

2) ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ๊ณผ ์‹ค์‹œ๊ฐ„ ํ•™์Šต: 1~2์‹œ๊ฐ„ ๋‚ด ํ•™์Šต์ด๋ผ๋Š” ๊ฒƒ์€ ํ‘œ๋ณธ ํšจ์œจ ์ธก๋ฉด์˜ ํฐ ๋„์•ฝ์ž…๋‹ˆ๋‹ค. ์ด๋Š” RLPD ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ฑ„ํƒ, ์‹œ๋ฒ”๊ณผ ๊ต์ • ๋ฐ์ดํ„ฐ ํ™œ์šฉ, ์‚ฌ์ „ํ•™์Šต ๋น„์ „ ๋ชจ๋ธ ์‚ฌ์šฉ ๋“ฑ ์—ฌ๋Ÿฌ ์ตœ์ ํ™”์˜ ๊ฒฐ์‹ค์ž…๋‹ˆ๋‹ค. ํŠนํžˆ ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ 50% ํ˜ผํ•ฉ ํ›ˆ๋ จ์€ ์ตœ๊ทผ RL ์—ฐ๊ตฌ์—์„œ ์ค‘์š”ํ•œ ์ฃผ์ œ๋กœ, ๋ณธ ์—ฐ๊ตฌ๋Š” ๊ทธ ์‹คํšจ์„ฑ์„ ์‹ค์ œ๋กœ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ํ˜„์‹ค์—์„œ ์‚ฌ๋žŒ์˜ ๋„์›€์„ ๋ฐ›์œผ๋ฉฐ ํ•™์Šตํ•˜๋Š” ์‹œ๋‚˜๋ฆฌ์˜ค ์ž์ฒด๊ฐ€ ์‹ค์šฉ์ ์ž…๋‹ˆ๋‹ค. ์™„์ „ ์ž์œจ ํ•™์Šต์€ ์‹œ๊ฐ„๊ณผ ์œ„ํ—˜์ด ํฌ์ง€๋งŒ, HIL-SERL์ฒ˜๋Ÿผ ์‚ฌ๋žŒ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฉด ํ•™์Šต ๊ณผ์ •์„ ํ†ต์ œํ•˜๊ณ  ํ•„์š”ํ•œ ๋ถ€๋ถ„๋งŒ ๋ฐ์ดํ„ฐ ์ œ๊ณตํ•˜์—ฌ ํšจ์œจ์„ ๊ทน๋Œ€ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด์ฒ˜๋Ÿผ ์‹ค์‹œ๊ฐ„ ๋Œ€ํ™”ํ˜• ํ•™์Šต(interactive learning)์€ ์•ž์œผ๋กœ ๋กœ๋ด‡ ํ•™์Šต์˜ ์ค‘์š”ํ•œ ํŒจ๋Ÿฌ๋‹ค์ž„์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

3) ์ธ๊ฐ„ ์ˆ˜์ค€์„ ๋„˜์–ด์„  ์„ฑ๋Šฅ: ์„ฑ๋Šฅ ๋ฉด์—์„œ, HIL-SERL ์ •์ฑ…๋“ค์€ ์ธ๊ฐ„ ์ „๋ฌธ๊ฐ€(ํ…”๋ ˆ์˜ต)๋ณด๋‹ค ๋†’์€ ์„ฑ๊ณต๋ฅ ๊ณผ ์†๋„๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์ธ๊ฐ„์ด ๋ช‡์‹ญ% ์„ฑ๊ณตํ•˜๋Š” Jenga ์ฑ„์ฐ์„ 100%๋กœ ๋งŒ๋“ค๊ณ , ์ธ๊ฐ„์ด 8์ดˆ ๊ฑธ๋ฆฌ๋Š” ์กฐ๋ฆฝ์„ 3์ดˆ๋งŒ์— ํ•ด์น˜์šด ๊ฒƒ์€ ๋งค์šฐ ๊ณ ๋ฌด์ ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๊ฐ•ํ™”ํ•™์Šต์˜ ์ž ์žฌ๋ ฅ์„ ์ž˜ ๋ณด์—ฌ์ฃผ๋Š” ์‚ฌ๋ก€๋กœ, ์‚ฌ๋žŒ์ด ๊ฐ€๋ฅด์ณ์ค„ ์ˆ˜ ์—†๋Š” ์ตœ์ ํ•ด๋ฅผ ํƒ์ƒ‰ํ•ด๋ƒˆ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค. ๊ธฐ์กด ๋ชจ๋ฐฉํ•™์Šต์œผ๋กœ๋Š” ์ธ๊ฐ„ ์„ฑ๋Šฅ์ด ์ƒํ•œ์ด์—ˆ๋Š”๋ฐ, ์ด์ œ RL๋กœ ์ดˆ์ธ์  ๋กœ๋ด‡ ์ž‘์—…์ž๋ฅผ ๋ฐฐ์ถœํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ ๊ฒƒ์ด์ฃ . ์‚ฐ์—…์  ๊ด€์ ์—์„œ๋„, ๋” ์ •ํ™•ํ•˜๊ณ  ๋น ๋ฅธ ๋กœ๋ด‡์€ ๋‹น์—ฐํžˆ ๋งค๋ ฅ์ ์ธ ๋ชฉํ‘œ์ด๋ฏ€๋กœ, ๋ณธ ์—ฐ๊ตฌ์˜ ๊ฒฐ๊ณผ๋Š” RL๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ์„ ํ˜„์žฅ์— ๋„์ž…ํ•˜๋ ค๋Š” ์›€์ง์ž„์— ํž˜์„ ์‹ค์–ด์ค„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

4) ์‹œ์Šคํ…œ ์„ค๊ณ„ ํ†ต์ฐฐ: HIL-SERL์€ โ€œ๋””์ž์ธ์˜ ์Šน๋ฆฌโ€๋ผ๊ณ ๋„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์„ธ๋ถ€์ ์œผ๋กœ, ์ƒ๋Œ€ ์ขŒํ‘œ ํ™œ์šฉ, ์ž„ํ”ผ๋˜์Šค/์˜คํ”ˆ๋ฃจํ”„ ์ œ์–ด ํ˜ผ์šฉ, ๊ทธ๋ฆฌํผ ์ด์‚ฐ ๋ถ„๋ฆฌ ์ œ์–ด, ํฌ์†Œ ๋ณด์ƒ ๋ถ„๋ฅ˜๊ธฐ ์‚ฌ์šฉ ๋“ฑ ํ•œ๋‘ ๋ฌธ์žฅ์œผ๋กœ ์Šค์ณ๊ฐ”์„ ๋ฒ•ํ•œ ์•„์ด๋””์–ด๋“ค์ด ๋ชจ์—ฌ ์ „์ฒด ์‹œ์Šคํ…œ์˜ ์„ฑ๊ณต์„ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ณตํ•™์ ์ธ ํ†ต์ฐฐ๋“ค์€ ์•ž์œผ๋กœ ์œ ์‚ฌ ์—ฐ๊ตฌ์˜ ๋ฒ ์ŠคํŠธ ํ”„๋ž™ํ‹ฐ์Šค(best practice)๋กœ ์ž๋ฆฌ์žก์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํŠนํžˆ ์—ฐ์†+์ด์‚ฐ ์•ก์…˜ ๋ถ„๋ฆฌ๋Š” ๋‹ค๋ฅธ ๋กœ๋ด‡ ์†๋™์ž‘ ํ•™์Šต์—๋„ ์‘์šฉ๋  ์ˆ˜ ์žˆ๊ณ , ์„ฑ๊ณต ํŒ์ • ๋ถ„๋ฅ˜๊ธฐ ์ ‘๊ทผ์€ ๋ณด์ƒ ์„ค๊ณ„๊ฐ€ ์–ด๋ ค์šด ๋งŽ์€ ๋ฌธ์ œ์— ๋ฒ”์šฉ ์†”๋ฃจ์…˜์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์‚ฌ๋žŒ ๊ฐœ์ž… ๋ฐ์ดํ„ฐ์˜ ์ฒ˜๋ฆฌ ๋ฐฉ์‹(๋ฐ๋ชจ/RL ๋ฒ„ํผ ์ด์ค‘ ๋“ฑ๋ก)์€ ํ–ฅํ›„ ์ธ๊ฐ„-๋กœ๋ด‡ ์ƒํ˜ธํ•™์Šต์—์„œ ์ฐธ๊ณ ํ•  ๊ท€์ค‘ํ•œ ๊ฒฝํ—˜์ž…๋‹ˆ๋‹ค.

์•ฝ์  ๋ฐ ํ•œ๊ณ„

์•„๋ฌด๋ฆฌ ์ข‹์€ ์—ฐ๊ตฌ๋„ ํ•œ๊ณ„๋Š” ์žˆ๊ธฐ ๋งˆ๋ จ์ž…๋‹ˆ๋‹ค. HIL-SERL์˜ ์ œ์•ฝ์ด๋‚˜ ๊ฐœ์„ ํ•  ์ ์„ ๊ผฝ์•„๋ณด๋ฉด:

1) ์‚ฌ๋žŒ ์˜์กด๋„: โ€œHuman-in-the-loopโ€๋ผ๋Š” ์ด๋ฆ„ ๊ทธ๋Œ€๋กœ, ์‚ฌ๋žŒ์˜ ๊ฐœ์ž… ์—†์ด๋Š” ์„ฑ๋ฆฝ์ด ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ์šฐ์„  ์ดˆ๊ธฐ ์‹œ๋ฒ”๊ณผ ๋ณด์ƒ ๋ถ„๋ฅ˜๊ธฐ ๋ฐ์ดํ„ฐ, ๊ทธ๋ฆฌ๊ณ  ํ•™์Šต ์ค‘ ์ˆ˜์‹œ ๊ฐœ์ž…๊นŒ์ง€ ์‚ฌ๋žŒ์˜ ๋ถ€๋‹ด์ด ์ƒ๋‹นํ•ฉ๋‹ˆ๋‹ค. ์ˆ™๋ จ๋œ ์กฐ์ž‘์ž๊ฐ€ ํ•„์š”ํ•˜๋ฉฐ, ํŠนํžˆ ๋ณต์žก ๊ณผ์ œ์ผ์ˆ˜๋ก ์ดˆ๋ฐ˜์— ์ž์ฃผ ๊ฐœ์ž…ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ฌผ๋ก  ๊ฐœ์ž…์„ ์ ์ฐจ ์ค„์—ฌ ์™„์ „ ์ž๋™ํ™” ์ •์ฑ…์„ ์–ป๋Š” ๊ฒŒ ๋ชฉํ‘œ์ง€๋งŒ, ์‚ฌ๋žŒ ๊ฐœ์ž… ์—†์ด ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ RL์„ ํ•˜๋Š” ์‹œ๋‚˜๋ฆฌ์˜ค์™€ ๋น„๊ตํ•˜๋ฉด ์ค€๋น„ ๋…ธ๋ ฅ์ด ํฝ๋‹ˆ๋‹ค. ์ด๋Š” ์–ด๋””๊นŒ์ง€๋‚˜ โ€œ์‹ค์šฉ์„ฑ vs ์ž์œจ์„ฑโ€ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„์ธ๋ฐ, ๋ณธ ์—ฐ๊ตฌ๋Š” ์‹ค์šฉ์„ฑ์„ ์ทจํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ–ฅํ›„์—๋Š” ์‚ฌ๋žŒ ๊ฐœ์ž…์„ ์ตœ์†Œํ™”ํ•˜๊ฑฐ๋‚˜, ์›๊ฒฉ์˜ ๋น„์ „ ์ „๋ฌธ๊ฐ€๊ฐ€ ์—ฌ๋Ÿฌ ๋กœ๋ด‡์„ ๋ชจ๋‹ˆํ„ฐ๋งํ•ด์ฃผ๋“ฏ ์ธ๊ฐ„ ๋…ธ๋™ ํšจ์œจํ™”๋ฅผ ๊ณ ๋ฏผํ•ด์•ผ ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ฐœ์ž…์„ ์–ธ์ œ ์–ด๋–ป๊ฒŒ ํ• ์ง€๋Š” ์ „์ ์œผ๋กœ ์‚ฌ๋žŒ์—๊ฒŒ ๋‹ฌ๋ ธ๋Š”๋ฐ, ์ด ์ •์ฑ…์ด ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ์ด๋ก ์ ์œผ๋กœ ๋ถ„์„ํ•˜๊ธฐ๋Š” ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ๋งŒ์•ฝ ์‚ฌ๋žŒ์ด ์‹ค์ˆ˜๋กœ ์ž˜๋ชป ๊ฐœ์ž…ํ•˜๊ฑฐ๋‚˜, ๋„ˆ๋ฌด ๊ฐœ์ž…์„ ์•ˆ ํ•ด์„œ ๋กœ๋ด‡์ด ๋ง๊ฐ€์ง„๋‹ค๋ฉด ์–ด๋–ป๊ฒŒ ํ• ์ง€ ๋“ฑ ํœด๋จผ-์ธ๋”๋ฃจํ”„์˜ ์ „๋žต ์ตœ์ ํ™” ๋ฌธ์ œ๊ฐ€ ๋‚จ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋ฐ˜์ž๋™ ๊ฐœ์ž… ํŠธ๋ฆฌ๊ฑฐ(์˜ˆ: ์œ„ํ—˜์ƒํ™ฉ ์ž๋™๊ฐ์ง€)๋‚˜, ํ•™์Šต ๋„์šฐ๋ฏธ AI ๊ฐ™์€ ๊ฒƒ์ด ์ถ”๊ฐ€๋  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

2) ๊ณผ์ œ ๋ฒ”์šฉ์„ฑ (์ผ๋ฐ˜ํ™”): HIL-SERL ์ •์ฑ…์€ ๊ฐ ๊ณผ์ œ๋ณ„๋กœ ๋”ฐ๋กœ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค. ํ•˜๋‚˜์˜ ์ •์ฑ…์ด ์—ฌ๋Ÿฌ ์ž‘์—…์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ํ•™์Šต์€ ์‹œ๋„๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ์ฆ‰ ์ž‘์—…์ด ๋ฐ”๋€Œ๋ฉด ๋‹ค์‹œ 1-2์‹œ๊ฐ„ ํ•™์Šต์„ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ฌผ๋ก  2์‹œ๊ฐ„์ด๋ฉด ์‹ผ ๊ฒƒ์ด์ง€๋งŒ, ์‚ฌ๋žŒ์ฒ˜๋Ÿผ ๋‹ค์–‘ํ•œ ์ผ์— ๋ฐ”๋กœ ๋Œ€์‘ํ•˜๋Š” ๊ฑด ์•„๋‹™๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ง€๊ธˆ ๋‹น์žฅ์€ ํŠน์ • ์ œ์กฐ๊ณต์ • ๋“ฑ ๊ณ ์ •๋œ ์ž‘์—…์— ๋กœ๋ด‡ ํ•˜๋‚˜ ํˆฌ์ž…ํ•ด์„œ ํ•™์Šต์‹œํ‚ค๋Š” ์ •๋„์— ํ™œ์šฉ๋  ๋“ฏํ•ฉ๋‹ˆ๋‹ค. IsaacSim ๊ฐ™์€ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜๋ฉด ํ•œ ๋ฒˆ ํ•™์Šต๋œ ์ •์ฑ…์„ ์—ฌ๋Ÿฌ ์œ ์‚ฌ ์ž‘์—…์œผ๋กœ ์˜ฎ๊ธฐ๋Š” ์ „์ดํ•™์Šต์ด๋‚˜, ํ•œ๊บผ๋ฒˆ์— ์—ฌ๋Ÿฌ ๋ณ€ํ˜•๋œ ์ƒํ™ฉ์„ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ๋„ ๊ฐ€๋Šฅํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ๊ฑฐ๊ธฐ๊นŒ์ง€ ๋‹ค๋ฃจ์ง„ ์•Š์•˜์ง€๋งŒ, ํ–ฅํ›„ ๋‹ค์ค‘์ž‘์—…/๋‹ค์ค‘ํ™˜๊ฒฝ ์ผ๋ฐ˜ํ™”๋Š” ๋‚จ์€ ์ˆ™์ œ์ž…๋‹ˆ๋‹ค. ์ถ”๊ฐ€๋กœ, ๋ณด์ƒ ๋ถ„๋ฅ˜๊ธฐ๋„ ์ž‘์—…๋งˆ๋‹ค ๋”ฐ๋กœ ๋งŒ๋“ค์–ด์•ผ ํ•˜๋ฏ€๋กœ, ์ž‘์—… ์ •์˜๊ฐ€ ์•„์˜ˆ ์ƒˆ๋กœ์šด ๊ฒฝ์šฐ์—๋Š” ๊ทธ ์ ˆ์ฐจ๋ฅผ ๋ฐ˜๋ณตํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

3) ์ž๋™ ๋ฆฌ์…‹ ๋ฐ ์—ฐ์† ํ•™์Šต: ์‹คํ—˜์—์„œ๋Š” ์ž‘์—…๋งˆ๋‹ค ์‚ฌ๋žŒ์ด ํ™˜๊ฒฝ ๋ฆฌ์…‹์„ ๋„์™€์ฃผ๊ฑฐ๋‚˜ ์ž๋™ ๋ฆฌ์…‹ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์งฐ๋‹ค๊ณ  ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ˜„์‹ค์—์„œ๋Š” ์–ด๋–ค ์ž‘์—…๋“ค์€ ์ž๋™ ๋ฆฌ์…‹ ์ž์ฒด๊ฐ€ ์–ด๋ ค์šธ ์ˆ˜ ์žˆ๊ณ , ์‚ฌ๋žŒ ๋ฆฌ์…‹์€ ๋˜ ๋‹ค๋ฅธ ๋น„์šฉ์ž…๋‹ˆ๋‹ค. reset-free RL์— ๊ด€ํ•œ ์„ ํ–‰ ์—ฐ๊ตฌ๋“ค๋„ ์žˆ๋Š”๋ฐ, HIL-SERL์—์„œ๋Š” reset ๋ฌธ์ œ๋ฅผ ํฌ๊ฒŒ ๊ฐ•์กฐํ•˜์ง„ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์™„์ „ ์ž์œจ ๋กœ๋ด‡ ํ•™์Šต์„ ์œ„ํ•ด์„  ์‹คํŒจ ํ›„ ํ™˜๊ฒฝ ๋ณต๊ตฌ๋ฅผ ๋กœ๋ด‡์ด ์Šค์Šค๋กœ ํ•˜๊ฑฐ๋‚˜, ์‹คํŒจ ์ž์ฒด๊ฐ€ ๊ฑฐ์˜ ์—†๋„๋ก ํ•˜๋Š” ๊ฒŒ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. HIL-SERL ์ •์ฑ…์€ ๊ฒฐ๊ตญ ์‹คํŒจ๋ฅผ ์•ˆ ํ•˜๊ฒŒ ๋˜์—ˆ์ง€๋งŒ ์ดˆ๋ฐ˜์—๋Š” ์‹คํŒจํ•˜๋ฉด ์‚ฌ๋žŒ์ด ๋ฐ”๋กœ์žก์•„ ์ค€ ๊ฒƒ์ด๋‹ˆ, reset ๋ฌธ์ œ๋ฅผ ์šฐํšŒํ•œ ์…ˆ์ž…๋‹ˆ๋‹ค. IsaacSim ๋“ฑ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„  ๋ฆฌ์…‹์ด ๊ฐ„๋‹จํ•˜๋‹ˆ, ๊ทธ๋Ÿฐ ํ™˜๊ฒฝ์—์„œ ๋ฏธ๋ฆฌ ํ•™์Šต์‹œ์ผœ ํ˜„์‹ค๋กœ ์˜ฎ๊ธฐ๋ฉด reset ๋ถ€๋‹ด์„ ์ค„์ผ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค (sim-to-real ์ „์ด์˜ ์–ด๋ ค์›€์€ ์žˆ์ง€๋งŒ). ์žฅ๊ธฐ์ ์œผ๋กœ, ๋กœ๋ด‡์ด ๊ณ„์† ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ๋ฐฐ์šฐ๋ ค๋ฉด ์ค‘๋‹จ ์—†์ด ์—ฐ์†ํ•™์Šต(continuous learning) ํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•˜๊ณ , ๊ทธ ๊ณผ์ •์—์„œ ํ™˜๊ฒฝ ๋ฆฌ์…‹์ด๋‚˜ ์‚ฌ๋žŒ ๋„์›€ ์—†์ด ์ž๊ธฐ ํšŒ๋ณตํ•˜๋Š” ๋Šฅ๋ ฅ๋„ ์—ฐ๊ตฌ๋˜์–ด์•ผ ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

4) ํ•œ๊ณ„ ์ƒํ™ฉ: HIL-SERL์ด ๋งŒ๋Šฅ์€ ์•„๋‹ˆ๋ฏ€๋กœ, ์‹คํŒจํ•˜๋Š” ๊ฒฝ์šฐ๋„ ๋ถ„๋ช… ์กด์žฌํ•  ๊ฒ๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฑฐ์˜ ๋ชจ๋“  ์‹œ๋„์— ์„ฑ๊ณตํ–ˆ๋‹ค๊ณ  ํ•˜์ง€๋งŒ, ์„ผ์„œ ์˜ค์ž‘๋™์ด๋‚˜ ํ™˜๊ฒฝ ๊ธ‰๋ณ€ ๋“ฑ ์—ฃ์ง€ ์ผ€์ด์Šค์—์„œ ์ •์ฑ…์ด ๋ฌด๋ ฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ์‚ฌ๋žŒ ๊ฐœ์ž… ์—†์ด ๋™์ž‘ํ•˜๋‹ค ์‚ฌ๊ณ ๊ฐ€ ๋‚˜๋ฉด ์–ด๋–กํ• ์ง€ ๋“ฑ ์•ˆ์ „ ๊ฒ€์ฆ(formal safety verification) ์ธก๋ฉด์€ ๋‹ค๋ฃจ์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ์‚ฐ์—… ํ˜„์žฅ์— ์ ์šฉํ•˜๋ ค๋ฉด ์ด๋Ÿฐ worst-case ๋Œ€์‘๊ณผ ๊ฒ€์ฆ ๊ฐ€๋Šฅ์„ฑ๋„ ๊ณ ๋ คํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํ˜„ ์‹œ์  HIL-SERL์€ ๋‹จ์ผ ๋ชฉํ‘œ ์ž‘์—…๋งŒ ๋‹ค๋ฃจ๊ณ  ์žˆ์–ด, ์ž„์˜์˜ ๋ชฉํ‘œ๋ณ€์ˆ˜๊ฐ€ ์ฃผ์–ด์ง€๋Š” ์ž‘์—… (์˜ˆ: ์ž„์˜ ์œ„์น˜์˜ ๊ตฌ๋ฉ์— ๊ฝ‚๊ธฐ, ๋‹ค์–‘ํ•œ ๋ถ€ํ’ˆ ์กฐ๋ฆฝ ๋“ฑ)์œผ๋กœ ํ™•์žฅํ•˜๋ ค๋ฉด ์ถ”๊ฐ€ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ ๊ฐ•ํ™”ํ•™์Šต + ๊ณ„ํš(Planning)์˜ ๊ฒฐํ•ฉ์ด๋‚˜, ๋ชฉํ‘œ ์กฐ๊ฑด ์ •์ฑ…(goal-conditioned policy) ๋“ฑ์ด ํ•„์š”ํ•œ๋ฐ, ์‚ฌ๋žŒ ๊ฐœ์ž…์„ ๊ทธ ๋งฅ๋ฝ์— ํ†ตํ•ฉํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋‚จ์Šต๋‹ˆ๋‹ค.

๊ด€๋ จ ์—ฐ๊ตฌ์™€ ๋น„๊ต

HIL-SERL์€ ์•ž์„  ๋งŽ์€ ์—ฐ๊ตฌ๋“ค์˜ ์„ฑ๊ณผ ์œ„์— ๊ตฌ์ถ•๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ด€๋ จ ๋ถ„์•ผ์™€ ๋น„๊ตํ•ด ํŠน์ง•์„ ์ •๋ฆฌํ•˜๋ฉด:

  • ๋ชจ๋ฐฉํ•™์Šต vs ๊ฐ•ํ™”ํ•™์Šต: HIL-SERL์€ ๋ชจ๋ฐฉํ•™์Šต์˜ ๋‹จ์ ์„ ๋ช…ํ™•ํžˆ ์งš๊ณ  ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค. HG-DAgger ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋„ ์‚ฌ๋žŒ ๊ต์ •์„ ๋ฐ›์ง€๋งŒ ์Šˆํผ๋ฐ”์ด์ฆˆ๋“œ ํ•™์Šต์œผ๋กœ ์ •์ฑ…์„ ์—…๋ฐ์ดํŠธํ•˜๋ฏ€๋กœ, ๋ถ„ํฌ ํ•œ๊ณ„์™€ ๋ˆ„์  ์˜ค์ฐจ ๋ฌธ์ œ๋ฅผ ์™„์ „ํžˆ ํ•ด์†Œ ๋ชป ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด HIL-SERL์€ RL์ด๋ฏ€๋กœ ์ •์ฑ…์ด ์ž๊ธฐ ์‹œ๋„์—์„œ ์–ป์€ ๋ณด์ƒ์„ ํ†ตํ•ด ์ž๊ธฐ ๋ถ„ํฌ๋ฅผ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” IL ๋ฐฉ๋ฒ•๋“ค์ด ๊ฒช๋Š” ๋ฐ์ดํ„ฐ ๋ถˆ์ผ์น˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ด์ฃผ๊ณ , ํƒํ—˜์„ ๊ฐ€๋Šฅ์ผ€ ํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ ๊ฒฐ๊ณผ๋Š” ๊ฐ™์€ ๋ฐ์ดํ„ฐ ์กฐ๊ฑด์—์„œ RL์ด IL๋ณด๋‹ค ๋‚ซ๋‹ค๋Š” ๊ฒƒ์„ ์‹ค์ฆํ•ด ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ์‚ฌ์‹ค ์ด๋Š” ํ•™๊ณ„์—์„œ๋„ ํ™œ๋ฐœํ•œ ๋…ผ์Ÿ์ด์—ˆ๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ๋Š” โ€œRL > ILโ€๋กœ ๊ฒฐ๋ก ์ง€์€ ๋ชจ์–‘์ƒˆ์ž…๋‹ˆ๋‹ค. ๋‹ค๋งŒ, IL์ด ๋ฐ์ดํ„ฐ ํšจ์œจ์€ ๋” ์ข‹์„ ์ˆ˜ ์žˆ์–ด ๋งค์šฐ ์ œํ•œ๋œ ์‹œ๋ฒ”๋งŒ ์žˆ๋Š” ๊ฒฝ์šฐ์—” ์—ฌ์ „ํžˆ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ณธ ์—ฐ๊ตฌ์ฒ˜๋Ÿผ ์ธํ„ฐ๋ฒค์…˜์œผ๋กœ ๋ฐ์ดํ„ฐ๋Ÿ‰์„ ๋Š˜๋ฆด ์ˆ˜ ์žˆ๋‹ค๋ฉด, RL๋กœ ์ „ํ™˜ํ•˜๋Š” ๊ฒŒ ๊ถ๊ทน์ ์œผ๋กœ ์„ฑ๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ฆฌ๋Š” ๊ธธ์ž„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  • ์˜คํ”„๋ผ์ธ+์˜จ๋ผ์ธ RL: ์ตœ๊ทผ offline-to-online RL ์—ฐ๊ตฌ ํ๋ฆ„์—์„œ, ๋จผ์ € ๋ฐ์ดํ„ฐ๋กœ ๋ชจ๋ธ์„ ์˜ˆ์—ด(pretrain)ํ•˜๊ณ  ์˜จ๋ผ์ธ ํŠœ๋‹ํ•˜๋Š” ์ ‘๊ทผ์ด ๋– ์˜ฌ๋ž์Šต๋‹ˆ๋‹ค. HIL-SERL์˜ RLPD ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ Off2On์˜ ์ผ์ข…์œผ๋กœ, Ball et al.(2023)์˜ ICML ๋…ผ๋ฌธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ผ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. HIL-SERL์ด ๋ณด์—ฌ์ค€ ์„ฑ๊ณผ๋Š”, ์˜คํ”„๋ผ์ธ ๋ฐ์ดํ„ฐ ํ™œ์šฉ RL์ด ์‹ค์ œ ๋กœ๋ด‡์—๋„ ํ†ตํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์ฆ๋ช…ํ•ฉ๋‹ˆ๋‹ค. ๊ณผ๊ฑฐ QT-Opt(์นผ๋ผ์‰ฌ๋‹ˆ์ฝ”ํ”„ et al.) ๊ฐ™์€ ์‚ฌ๋ก€๋„ ์žˆ์ง€๋งŒ, ๊ทธ๊ฒƒ์€ ๊ฑฐ๋Œ€ ๋„คํŠธ์›Œํฌ์™€ ์ˆ˜์‹ญ๋งŒ ๋ฐ์ดํ„ฐ๋กœ ์„ฑ๊ณต๋ฅ  80% ์ˆ˜์ค€์ด์—ˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ํ›จ์”ฌ ์ ์€ ๋ฐ์ดํ„ฐ์™€ ์‹œ๊ฐ„์œผ๋กœ 100%๋ฅผ ๋‹ฌ์„ฑํ–ˆ์œผ๋‹ˆ, sample-efficient RL์˜ ์‹ค์ œ ์ง„๊ฐ€๋ฅผ ๋ณด์—ฌ์ค€ ์…ˆ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ ์ธ๊ฐ„ ๊ต์ • ๋ฐ์ดํ„ฐ๋ฅผ RL์— ํ™œ์šฉํ•œ ๊ฒƒ์€ Luo et al.(2023) ๋“ฑ์˜ ์„ ํ–‰๊ณผ ์œ ์‚ฌํ•˜๋‚˜, HIL-SERL์€ ๋ฐ๋ชจ+๊ต์ •+์•ˆ์ „์žฅ์น˜๋ฅผ ๋ชจ๋‘ ์•„์šฐ๋ฅด๋Š” ์™„์„ฑํ˜• ์‹œ์Šคํ…œ์œผ๋กœ ํ•œ ๋‹จ๊ณ„ ๋ฐœ์ „์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.
  • ์„ฑ๊ณต ํŒ์ • ๋ฐ ๋ณด์ƒ ์„ค๊ณ„: ๋กœ๋ด‡ ํ•™์Šต์—์„œ ๋ณด์ƒ ํ•จ์ˆ˜ ์„ค๊ณ„๋Š” ์˜์›ํ•œ ๊ณ ๋ฏผ์ž…๋‹ˆ๋‹ค. ์‚ฌ๋žŒ ์†์‹คํ•จ์ˆ˜ ๋ž„๊นŒ์š”. ์ตœ๊ทผ ์ž์—ฐ ์–ธ์–ด๋‚˜ ์˜์ƒ AI๋ฅผ ์ด์šฉํ•œ ๋ฆฌ์›Œ๋“œ(์˜ˆ: CLIP ๊ธฐ๋ฐ˜ ๋ณด์ƒ, ๋น„๋””์˜ค ๋น„๊ต ๋ณด์ƒ ๋“ฑ) ์—ฐ๊ตฌ๋„ ๋งŽ์Šต๋‹ˆ๋‹ค. HIL-SERL์€ ๊ทธ ์ค‘ ์„ฑ๊ณต์—ฌ๋ถ€ ๋ถ„๋ฅ˜๊ธฐ๋ผ๋Š” ๊ฐ„๋‹จํ•˜์ง€๋งŒ ํšจ๊ณผ์ ์ธ ๋ฐฉ์‹์„ ์„ ํƒํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๊ฑด ์„ฑ๊ณต์„ ๋ช…ํ™•ํžˆ ์ •์˜ํ•  ์ˆ˜ ์žˆ์„ ๋•Œ ํ†ตํ•ฉ๋‹ˆ๋‹ค. ๋งŽ์€ ์กฐ์ž‘ ์ž‘์—…์€ ์ตœ์ข… ๋ชฉํ‘œ๊ฐ€ ๋šœ๋ ทํ•ด์„œ, ์„ฑ๊ณต/์‹คํŒจ๋งŒ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์‰ฝ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋” ์ฃผ๊ด€์ ์ด๊ฑฐ๋‚˜ ์—ฐ์†์ ์ธ ๋ชฉํ‘œ(์˜ˆ: โ€œ์ž˜ ์›€์ง์—ฌ๋ดโ€)๋ผ๋ฉด ์ด ์ ‘๊ทผ์€ ํž˜๋“ค ์ˆ˜ ์žˆ์ฃ . ๊ทธ๋ž˜๋„ ์ด ์—ฐ๊ตฌ๋Š” ํ‘œ์ค€์ ์ธ ์„ฑ๊ณตํŒ์ •๊ธฐ ์‚ฌ์šฉ๋ฒ•์„ ์ œ์‹œํ–ˆ๊ณ , ์•ž์œผ๋กœ ์ด ๋ฐฉ์‹์€ ๋กœ๋ด‡ ํ•™์Šต ์ดˆ๊ธฐ ์„ธํŒ…์˜ ์ผ๋ถ€๋ถ„์œผ๋กœ ์ž๋ฆฌ์žก์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ํฌ์†Œ๋ณด์ƒ์œผ๋กœ๋„ ๋œ๋‹ค๋Š” ์ฃผ์žฅ์€ ์‹œ์‚ฌํ•˜๋Š” ๋ฐ”๊ฐ€ ํฝ๋‹ˆ๋‹ค. ๊ตณ์ด ๋ณต์žกํ•œ shaping ์•ˆ ํ•ด๋„, ์ถฉ๋ถ„ํ•œ ์‹œ๋ฒ”๊ณผ ๊ต์ •์ด ์žˆ์œผ๋ฉด sparse reward๋กœ๋„ ํ•™์Šต์ด ๋นจ๋ฆฌ ๋œ๋‹ค๋‹ˆ, ์ด๋Š” ๋ณด์ƒ ์„ค๊ณ„ ๋ถ€๋‹ด์„ ๋œ์–ด์ฃผ๋Š” ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. ์•ž์œผ๋กœ RL ์—ฐ๊ตฌ์ž๋“ค์€ ์„ฑ๊ธ‰ํžˆ reward engineeringํ•˜์ง€ ๋ง๊ณ , ๋ถ„๋ฅ˜๊ธฐ ๊ธฐ๋ฐ˜ sparse reward + ๋ฐ๋ชจ๋ฅผ ๋จผ์ € ๊ณ ๋ คํ•ด๋ณผ ๋งŒ ํ•ฉ๋‹ˆ๋‹ค.
  • ์‹œ๋ฎฌ๋ ˆ์ด์…˜ vs ์‹ค์„ธ๊ณ„: ๋งŽ์€ RL ๋…ผ๋ฌธ๋“ค์ด ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ (์˜ˆ: MuJoCo, IsaacGym ๋“ฑ)์—์„œ ๋ฉ‹์ง„ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด์ง€๋งŒ, ํ˜„์‹ค๋กœ ์˜ค๋ฉด ์‰ฝ์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์„ ์šฐ๋ฆฌ๋Š” ์•Œ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด ์ด ์—ฐ๊ตฌ๋Š” ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ์‹ค์ œ ๋กœ๋ด‡ ํŒ”๋กœ ํ•ด๋ƒˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์„ผ์„œ ๋…ธ์ด์ฆˆ, ์‹œ์Šคํ…œ ์ง€์—ฐ, ๋ฌผ๋ฆฌ ์˜ค์ฐจ ๋“ฑ ํ˜„์‹ค ๋ฌธ์ œ๋ฅผ ์ •๋ฉด ๋ŒํŒŒํ–ˆ๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค. IsaacSim ๋“ฑ์˜ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋Š” ๋งค์šฐ ์ •๊ตํ•˜์ง€๋งŒ, ๊ฒฐ๊ตญ sim-to-real ๋‹จ๊ณ„์—์„œ ํŠœ๋‹์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. HIL-SERL์€ ์‚ฌ๋žŒ ๊ฐœ์ž…๊ณผ ์•ˆ์ „ ์ปจํŠธ๋กค๋Ÿฌ๋กœ ํ•™์Šต ์œ„ํ—˜์„ ์–ต์ œํ•˜๋ฉด์„œ, ์‹คํ™˜๊ฒฝ ๋ณ€์ด๋„ ์ง์ ‘ ๊ฒช์œผ๋ฉฐ ํ•™์Šตํ•˜๊ฒŒ ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ sim ์—†์ด๋„ ์ถฉ๋ถ„ํžˆ ๋น ๋ฅด๊ฒŒ ํ•™์Šต ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฑธ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” โ€œ๋ชจ๋“  ๊ฑธ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ํ•  ํ•„์š”๋Š” ์—†๋‹คโ€๋Š” ๋ฉ”์‹œ์ง€์ž…๋‹ˆ๋‹ค. ๋ฌผ๋ก  ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ํ™œ์šฉ๋„ ๋ณ‘ํ–‰ํ•˜๋ฉด ๋” ๋ณต์žกํ•œ ํ™˜๊ฒฝ์ด๋‚˜ ์žฅ์‹œ๊ฐ„ ํ•™์Šต์„ ๋‹ค๋ค„๋ณผ ์ˆ˜ ์žˆ๊ฒ ์ง€๋งŒ, ๋ณธ ์—ฐ๊ตฌ๋Š” ์‹คํ™˜๊ฒฝ ํ•™์Šต์˜ ๋ชจ๋ฒ”์„ ์ œ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ •์ฑ…: HIL-SERL์˜ ์ •์ฑ…์€ ์นด๋ฉ”๋ผ ๋น„์ „๊ณผ ๋กœ๋ด‡ ์ƒํƒœ๋ฅผ ๋ชจ๋‘ ์ด์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” End-to-End ๋น„์ „์ œ์–ด RL์˜ ์นดํ…Œ๊ณ ๋ฆฌ์— ์†ํ•ฉ๋‹ˆ๋‹ค. ๋ช‡ ๋…„ ์ „๋งŒ ํ•ด๋„ ๋น„์ „ ์ž…๋ ฅ์œผ๋กœ ์‹ค์ œ ๋กœ๋ด‡ ํ•™์Šตํ•˜๋Š” ๊ฑด โ€œํ‘œ๋ณธ ๋„ˆ๋ฌด ๋งŽ์ด ํ•„์š”โ€๋ผ๋Š” ์ธ์‹์ด ๊ฐ•ํ–ˆ๋Š”๋ฐ, ์ด ๋…ผ๋ฌธ์€ ๊ทธ ์žฅ๋ฒฝ์„ ํ—ˆ๋ฌผ์—ˆ์Šต๋‹ˆ๋‹ค. ResNet ๋“ฑ ์‚ฌ์ „๋น„์ „์„ ์“ฐ๊ณ  ๋ฐ์ดํ„ฐ ์ ์ ˆํžˆ ์ฃผ๋ฉด, ์‹œ๋ฎฌ ์—†์ด ์‹ค๋น„์ „ 100% ์„ฑ๊ณต ์ •์ฑ…๋„ ๊ฐ€๋Šฅํ•จ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ด๋ถ€๋ถ„์€ ์ดˆ๊ฑฐ๋Œ€ ๋ชจํ˜•์ด๋‚˜ ๋น…๋ฐ์ดํ„ฐ ์—†์ด ์ •๊ตํ•œ ์‹œ์Šคํ…œ ํ†ตํ•ฉ์œผ๋กœ ์ด๋ค„๋ƒˆ๋‹ค๋Š” ์ ์—์„œ ์˜๋ฏธ ์žˆ์Šต๋‹ˆ๋‹ค. ์ตœ๊ทผ์—์•ผ โ€œ๊ฑฐ๋Œ€ ๋น„์ „-๋ชจ๋ธ+RLโ€ ์กฐํ•ฉ ์ด์•ผ๊ธฐ๊ฐ€ ๋‚˜์˜ค์ง€๋งŒ, ์ด ์—ฐ๊ตฌ๋Š” ํ›จ์”ฌ ํšจ์œจ์ ์œผ๋กœ, ๊ผญ ๊ฑฐ๋Œ€ ๋ชจํ˜•์ด ์•„๋‹ˆ๋ผ๋„ ์ž˜ ์งœ์ธ ํŒŒ์ดํ”„๋ผ์ธ์ด๋ฉด ์ถฉ๋ถ„ํ•จ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ

HIL-SERL์˜ ์„ฑ๊ณต์€ ์ƒˆ๋กœ์šด ์งˆ๋ฌธ๋“ค์„ ๋‚ณ์Šต๋‹ˆ๋‹ค. ์•ž์œผ๋กœ ์ด ๋ถ„์•ผ์—์„œ ํƒ๊ตฌ๋  ๋งŒํ•œ ๋ฐฉํ–ฅ์„ ์ •๋ฆฌํ•˜๋ฉด:

  • ์ธ๊ฐ„ ๊ฐœ์ž…์˜ ์ž๋™ํ™” ๋ฐ ์ตœ์ ํ™”: ํ˜„์žฌ๋Š” ์‚ฌ๋žŒ์ด ์ฃผ๊ด€์ ์œผ๋กœ ํŒ๋‹จํ•ด ๊ฐœ์ž…ํ–ˆ์ง€๋งŒ, ์ด๋ฅผ AI๊ฐ€ ํŒ๋‹จํ•˜๊ฑฐ๋‚˜ ํ•„์š” ์‹œ์  ์˜ˆ์ธกํ•˜๋Š” ์—ฐ๊ตฌ๊ฐ€ ์ด๋ฃจ์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ์ปจ๋Œ€ ์˜ค๋ฅ˜ ์˜ˆ์ธก ๋ชจ๋ธ์„ ์ •์ฑ…๊ณผ ๋ณ‘๋ ฌ๋กœ ๋‘์–ด, ์œ„ํ—˜๋„๊ฐ€ ๋†’์•„์ง€๋ฉด ์•Œ๋ ค์ฃผ๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ฐœ์ž… ์‹œ ์–ด๋–ป๊ฒŒ ์ œ์–ดํ•˜๋Š” ๊ฒŒ ์ตœ์„ ์ธ์ง€ (์˜ˆ: ์งง๊ฒŒ ์—ฌ๋Ÿฌ ๋ฒˆ vs ๊ธธ๊ฒŒ ํ•œ ๋ฒˆ) ๋“ฑ๋„ ์ •๋Ÿ‰์  ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๊ถ๊ทน์ ์œผ๋กœ ์‚ฌ๋žŒ์ด ์•„๋‹Œ ๋กœ๋ด‡๋ผ๋ฆฌ ์„œ๋กœ ๋„์™€ ํ•™์Šตํ•˜๋Š” ๋ชจ์Šต๋„ ์ƒ์ƒํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (ํ•œ ๋กœ๋ด‡์ด ์‹คํŒจํ•˜๋ฉด ๋” ์„ฑ์ˆ™ํ•œ ๋‹ค๋ฅธ ๋กœ๋ด‡์ด ๋„์™€์ค€๋‹ค๋“ ์ง€).
  • ๋‹ค์ค‘ ๊ณผ์ œ ์—ฐ์† ํ•™์Šต: ํ•˜๋‚˜ ๋ฐฐ์šด ๋‹ค์Œ ๋‹ค๋ฅธ ๊ณผ์ œ๋กœ ๋„˜์–ด๊ฐ€๋ฉด์„œ, ์ด์ „ ์ง€์‹์„ ์‚ด๋ฆฌ๋Š” continual learning ๋ฐฉํ–ฅ์ž…๋‹ˆ๋‹ค. HIL-SERL ๋ฐฉ์‹์œผ๋กœ ๊ณผ์ œ๋ฅผ ํ•˜๋‚˜ ์Šต๋“ํ•œ ๋กœ๋ด‡์ด, ๊ทธ ์ •์ฑ…์„ ์ „์ดํ•˜์—ฌ ๋‹ค์Œ ๊ณผ์ œ ํ•™์Šต ์‹œ๊ฐ„์„ ์ค„์ด๊ฑฐ๋‚˜, ๋™์‹œ์— ์—ฌ๋Ÿฌ ๊ณผ์ œ๋ฅผ ๋ฐฐ์šธ ์ˆ˜ ์žˆ๋Š”์ง€ ์‹คํ—˜ํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, RAM ๊ฝ‚๊ธฐ์™€ SSD ๊ฝ‚๊ธฐ๋Š” ์œ ์‚ฌํ•˜๋‹ˆ ํ•œ๊บผ๋ฒˆ์— ๋ฐฐ์šฐ๋ฉด ๋” ํšจ์œจ์ ์ผ์ง€, ํ˜น์€ ๋กœ๋ด‡์ด ์—ฌ๋Ÿฌ ์ž‘์—…์„ ์„ž์–ด์„œ ํ•ด๋„ ํ˜ผ๋™ ์—†์ด ํ•™์Šตํ• ์ง€ ๋“ฑ์˜ ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ์ •์ฑ… ํ‘œํ˜„์˜ ๊ณต์œ ๋‚˜ ์ƒํƒœ์— ๋ชฉํ‘œ ์ธ์ฝ”๋”ฉ ์ถ”๊ฐ€ ๋“ฑ ๊ธฐ์ˆ ์  ๋ณ€๊ฒฝ์ด ํ•„์š”ํ•˜์ง€๋งŒ, ๋‹ฌ์„ฑ๋˜๋ฉด ์ง„์งœ ๋ฒ”์šฉ ์กฐ์ž‘๋กœ๋ด‡์— ํ•œ์ธต ๊ฐ€๊นŒ์›Œ์งˆ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์™€์˜ ์กฐํ™”: ํ˜„์‹ค ํ•™์Šต์˜ ๋ฆฌ์Šคํฌ์™€ ๋น„์šฉ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด, ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์‚ฌ์ „ํ•™์Šต + ํ˜„์‹ค ๋ฏธ์„ธ์กฐ์ •(fine-tuning)์„ ๋ชจ์ƒ‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. HIL-SERL์˜ ์ธ๊ฐ„ ๊ฐœ์ž…์„ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋กœ ๋Œ€์ฒดํ•œ๋‹ค๋ฉด, ์‚ฌ๋žŒ์ด ์œ„ํ—˜์— ๊ฐœ์ž…ํ•  ์ผ ์—†์ด ๊ฐ€์ƒํ™˜๊ฒฝ์—์„œ ๋ง‰ ๊ตด๋ฆฌ๋‹ค๊ฐ€ ํ˜„์‹ค์—์„œ ์กฐ๊ธˆ๋งŒ ์กฐ์ •ํ•˜๋ฉด ๋ ์ง€๋„ ๋ชจ๋ฆ…๋‹ˆ๋‹ค. NVIDIA IsaacSim์€ ๋ฌผ๋ฆฌ ์ •ํ™•๋„๊ฐ€ ๋†’์•„ ์ด ๊ฒฝ์šฐ ์œ ๋งํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋งŒ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋กœ ์‚ฌ๋žŒ ํ–‰๋™์„ ์–ด๋–ป๊ฒŒ ๋ชจ๋ธ๋งํ• ์ง€, ํ˜„์‹ค-๊ฐ€์ƒ ๊ฐ„ ๋ณด์ƒ ๋ถ„๋ฅ˜๊ธฐ ์ฐจ์ด๋Š” ์—†๋Š”์ง€ ๋“ฑ ํ•ด๊ฒฐํ•  ๊ณผ์ œ๋“ค์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋Œ€๋กœ, ํ˜„์‹ค ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ ๋ชจ๋ธ ๊ฐœ์„ ์— ํ™œ์šฉํ•˜๋Š” sim-to-real-to-sim ํ”ผ๋“œ๋ฐฑ ์—ฐ๊ตฌ๋„ ์ƒ๊ฐํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด๋ก ์  ๋ถ„์„: HIL-SERL๊ฐ™์€ ์‹œ์Šคํ…œ์€ ๊ตฌ์„ฑ ์š”์†Œ๊ฐ€ ๋งŽ์•„ ์ด๋ก  ๋ถ„์„์ด ์‰ฝ์ง„ ์•Š์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ผ๋ถ€ ์š”์†Œ๋ณ„๋กœ ์ˆ˜๋ ด ๋ณด์žฅ์ด๋‚˜ ์ƒ˜ํ”Œ ๋ณต์žก๋„๋ฅผ ๋”ฐ์ ธ๋ณผ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ˆ์ปจ๋Œ€, ์ธ๊ฐ„ ๊ฐœ์ž…์œผ๋กœ ์ธํ•ด MDP๊ฐ€ non-Markovianํ•ด์ง€๋Š” ๋ถ€๋ถ„์€ ์—†๋Š”์ง€, ๋˜๋Š” off-policy + demonstration ํ•™์Šต์˜ ์˜ค์ฐจ ๊ฒฝ๊ณ„๋Š” ์–ด๋–ป๊ฒŒ ๋˜๋Š”์ง€ ๋“ฑ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ถ„์„์€ ํ–ฅํ›„ ์œ ์‚ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ค๊ณ„ํ•  ๋•Œ ์›์น™์ ์ธ ๊ฐ€์ด๋“œ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋‹ค๋ฅธ ํ˜•ํƒœ์˜ ์ธ๊ฐ„ ํ”ผ๋“œ๋ฐฑ: ๋ณธ ์—ฐ๊ตฌ๋Š” ์‹œ๋ฒ”(action-level ๊ฐœ์ž…)์„ ๋‹ค๋ค˜์ง€๋งŒ, ์‚ฌ๋žŒ์ด ์ค„ ์ˆ˜ ์žˆ๋Š” ํ”ผ๋“œ๋ฐฑ์€ ์ด์™ธ์—๋„ ์–ธ์–ด ์ง€์‹œ, ํ‰๊ฐ€ ์ ์ˆ˜, ๋ˆˆ์ง“ ๋“ฑ ๋‹ค์–‘ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์‚ฌ๋žŒ์ด โ€œ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ํ•ดโ€๋ผ๊ณ  ๋งํ•˜๋ฉด ๋กœ๋ด‡์ด ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ๋ฐ”๊พธ๋Š” RLHF(์ธ๊ฐ„ํ”ผ๋“œ๋ฐฑ ๊ฐ•ํ™”ํ•™์Šต) ์Šคํƒ€์ผ๋„ ์žˆ์„ ์ˆ˜ ์žˆ๊ณ , ์‚ฌ๋žŒ์ด ์„ฑ๊ณต/์‹คํŒจ๋ฅผ ๋ผ์ด๋ธŒ๋กœ ๋ผ๋ฒจ๋งํ•ด์ฃผ๋Š” ๋ฐฉ์‹๋„ ์ƒ๊ฐํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋‹ค์ฑ„๋กœ์šด ์ธ๊ฐ„-๋กœ๋ด‡ ์ƒํ˜ธ์ž‘์šฉ์„ RL์— ํ†ตํ•ฉํ•˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ํฅ๋ฏธ๋กœ์šด ๋ฏธ๋ž˜ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค.

์š”์•ฝ ๋ฐ ๊ฒฐ๋ก 

HIL-SERL ๋…ผ๋ฌธ์€ ๋กœ๋ด‡ ํ•™์Šต ๋ถ„์•ผ์— ์‹ ์„ ํ•œ ์ถฉ๊ฒฉ์„ ์ฃผ๋Š” ์ž‘์—…์ž…๋‹ˆ๋‹ค. ์š”์•ฝํ•˜์ž๋ฉด: โ€œ์ธ๊ฐ„์ด ์ ์žฌ์ ์†Œ์— ๋„์™€์ค€๋‹ค๋ฉด, ๊ฐ•ํ™”ํ•™์Šต์œผ๋กœ ์‹ค์ œ ๋กœ๋ด‡์—๊ฒŒ ๋ณต์žกํ•œ ์กฐ์ž‘ ๊ธฐ์ˆ ์„ ๋‹จ ๋ช‡ ์‹œ๊ฐ„๋งŒ์— ๊ฐ€๋ฅด์น  ์ˆ˜ ์žˆ๊ณ , ๊ทธ ์„ฑ๋Šฅ์€ ์ธ๊ฐ„์„ ๋›ฐ์–ด๋„˜๋Š”๋‹ค.โ€ ์ด๋Š” ์˜ค๋žœ ๊ธฐ๊ฐ„ ๋‚œ์ œ์˜€๋˜ ์‹ค์„ธ๊ณ„ ๋กœ๋ด‡ ๊ฐ•ํ™”ํ•™์Šต์— ๋Œ€ํ•œ ํ•˜๋‚˜์˜ ํ•ด๋‹ต์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ธฐ์—ฌ๋ฅผ ๋‹ค์‹œ ์งš์œผ๋ฉด: (1) ์‚ฌ๋žŒ์˜ ๋ฐ๋ชจ์™€ ์˜จ๋ผ์ธ ๊ฐœ์ž…์„ RL์— ํ†ตํ•ฉํ•œ ํšจ์œจ์  ํ•™์Šต ๋ฃจํ”„๋ฅผ ์„ค๊ณ„ํ–ˆ๊ณ , (2) ์ด๋ฅผ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ๊ณ ๋‚œ๋„ ์ž‘์—…์—์„œ ๊ฑฐ์˜ ์™„๋ฒฝํ•œ ์ •์ฑ…์„ ๋น ๋ฅด๊ฒŒ ์–ป์—ˆ์œผ๋ฉฐ, (3) RL ์ •์ฑ…์ด ๋ชจ๋ฐฉํ•™์Šต์„ ํฌ๊ฒŒ ๋Šฅ๊ฐ€ํ•จ์„ ์‹คํ—˜์œผ๋กœ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ (4) ํ˜„์‹ค ๋กœ๋ด‡์—์„œ์˜ ์—ฌ๋Ÿฌ ์—”์ง€๋‹ˆ์–ด๋ง ๋ฌธ์ œ๋ฅผ ํ†ตํ•ฉ์ ์œผ๋กœ ํ•ด๊ฒฐํ•˜์—ฌ ์‹ค์šฉ์ ์ธ ์‹œ์Šคํ…œ์„ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค. Richard Feynman์ด โ€œ๊ณผํ•™์€ ์‹ค์ œ๋กœ ํ•ด๋ณด๊ณ  ๋ณด์—ฌ์ค˜์•ผ ํ•œ๋‹คโ€ ํ–ˆ๋“ฏ์ด, ์ด ๋…ผ๋ฌธ์€ ๋ณต์žกํ•œ ์•„์ด๋””์–ด๋“ค์„ ์‹ค์ œ ๋กœ๋ด‡์— ๊ตฌํ˜„ํ•ด ์›€์ง์ด๋Š” ์ฆ๊ฑฐ๋ฅผ ๋ณด์—ฌ์ค€ ์…ˆ์ž…๋‹ˆ๋‹ค.

๋กœ๋ด‡๊ณตํ•™ ์—ฐ๊ตฌ์ž ์ž…์žฅ์—์„œ, HIL-SERL์€ โ€œํ•™์Šตํ•˜๋Š” ๋กœ๋ด‡โ€์— ํ•œ ๋ฐœ์ง ๋‹ค๊ฐ€์„  ์‚ฌ๋ก€๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ณผ๊ฑฐ์—๋Š” ๋กœ๋ด‡์—๊ฒŒ ์ƒˆ๋กœ์šด ์ž‘์—…์„ ๊ฐ€๋ฅด์น˜๋ ค๋ฉด ์ผ์ผ์ด ํ”„๋กœ๊ทธ๋ž˜๋ฐํ•˜๊ฑฐ๋‚˜, ์•„๋‹ˆ๋ฉด ์ˆ˜์‹ญ๋งŒ๋ฒˆ์˜ ์‹œ๋„๋ฅผ ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ๋กœ ๋Œ๋ ค์•ผ ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด์ œ๋Š” ์‚ฌ๋žŒ๊ณผ ํ•จ๊ป˜ ๋ช‡ ๋ฒˆ ์—ฐ์Šตํ•˜๋ฉด ์Šค์Šค๋กœ ๋” ์ž˜ํ•˜๊ฒŒ ๋˜๋Š” ๋กœ๋ด‡์„ ๊ฟˆ๊ฟ”๋ณผ ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์น˜ ์ˆ™๋ จ๊ณต์ด ์‹ ์ž…์—๊ฒŒ ์ผ ๊ฐ€๋ฅด์น˜๋“ฏ, ๋กœ๋ด‡์—๊ฒŒ๋„ ์‹œ์—ฐํ•˜๊ณ  ์‹ค์ˆ˜ํ•˜๋ฉด ๋ฐ”๋กœ์žก์•„์ฃผ๋ฉด์„œ ํ›ˆ๋ จํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ตญ ๋กœ๋ด‡์ด ์ถฉ๋ถ„ํžˆ ๋˜‘๋˜‘ํ•ด์ง€๋ฉด ์‚ฌ๋žŒ ๊ฐ๋… ์—†์ด๋„ ์•Œ์•„์„œ ์ž˜ ํ•˜๊ฒ ์ง€๋งŒ, ๊ทธ ์ง€์ ๊นŒ์ง€ ์ธ๊ฐ„์˜ ์ง€์‹์„ ๋นŒ๋ ค์ฃผ๋Š” ๊ฒƒ์ด ํšจ๊ณผ์ ์ž„์„ HIL-SERL์ด ๋ณด์—ฌ์ค€ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๊ณต์žฅ ์ž๋™ํ™”, ๋ฌผ๋ฅ˜, ์กฐ๋ฆฝ, ์„œ๋น„์Šค ๋กœ๋ด‡ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์— ํฐ ํŒŒ๊ธ‰์„ ๋ฏธ์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ๋‹คํ’ˆ์ข… ์†Œ๋Ÿ‰์ƒ์‚ฐ(high-mix low-volume) ํ™˜๊ฒฝ์—์„œ๋Š” ์ผ์ผ์ด ๋กœ๋ด‡ ํ”„๋กœ๊ทธ๋ž˜๋ฐํ•  ์ˆ˜ ์—†๋Š”๋ฐ, HIL-SERL ๊ฐ™์€ ๊ธฐ๋ฒ•์ด๋ผ๋ฉด ์ž‘์—…์ด ๋ฐ”๋€” ๋•Œ๋งˆ๋‹ค ๋กœ๋ด‡์„ ๋น ๋ฅด๊ฒŒ ์žฌํ›ˆ๋ จํ•˜์—ฌ ๋Œ€์‘ํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•ด์งˆ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ, ์ด ์—ฐ๊ตฌ๋Š” ๋กœ๋ด‡ํ•™์Šต์˜ ์ƒˆ๋กœ์šด ํ‘œ์ค€์„ ์ œ์‹œํ–ˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•ž์œผ๋กœ ๋‚˜์˜ฌ ๋…ผ๋ฌธ๋“ค์€ ์•„๋งˆ โ€œ์šฐ๋ฆฌ ๋ฐฉ๋ฒ•์€ HIL-SERL๋ณด๋‹ค ๋ฐ์ดํ„ฐ ํšจ์œจ์ด 2๋ฐฐ ๋†’๋‹คโ€ ๋˜๋Š” โ€œHIL-SERL ์—†์ด๋„ ์ด๋งŒํผ ๋œ๋‹คโ€ ๋“ฑ์œผ๋กœ ๋น„๊ตํ•˜๊ฒŒ ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋งŒํผ ํ•˜๋‚˜์˜ ๋ ˆํผ๋Ÿฐ์Šค ์„ฑ๋Šฅ๊ณผ ๋ฐฉ๋ฒ•๋ก ์ด ์„ธ์›Œ์กŒ๋‹ค๋Š” ๋œป์ด์ง€์š”. ๋ฌผ๋ก  ํ•ด๊ฒฐํ•ด์•ผ ํ•  ๊ณผ์ œ๋“ค๋„ ๋‚จ์•˜์ง€๋งŒ, HIL-SERL์˜ ์„ฑ๊ณต์€ ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ ๋กœ๋ด‡๊ธฐ์ˆ ์˜ ํ˜„์‹คํ™”๋ฅผ ํฌ๊ฒŒ ์•ž๋‹น๊ธด ํš๊ธฐ์ ์ธ ๊ฑธ์Œ์œผ๋กœ ํ‰๊ฐ€ํ•  ๋งŒํ•ฉ๋‹ˆ๋‹ค.

Copyright 2026, JungYeon Lee