Curieux.JY
  • Post
  • Note
  • Jung Yeon Lee

On this page

  • 1 Dexterous Manipulation through Imitation Learning
    • 1.1 Behavioral Cloning (BC)
    • 1.2 Inverse Reinforcement Learning (IRL)
    • 1.3 Generative Adversarial Imitation Learning (GAIL)
    • 1.4 Hierarchical Imitation Learning (HIL)
    • 1.5 Continual Imitation Learning (CIL)
  • 2 End Effectors for Dexterous Manipulation
  • 3 Teleoperation Systems and Data Collection
  • 4 Challenges and Future Directions
  • 5 Conclusion

๐Ÿ“ƒDex Imitation Learning ๋ฆฌ๋ทฐ

paper
imitation
dexterous
Dexterous Manipulation through Imitation Learning / A Survey
Published

June 11, 2025

Paper Link

๋ฆฌ๋ทฐ ๋…ผ๋ฌธ์€ ๊ผญ ํ•œ๋ฒˆ ์ฝ์–ด๋ณด๋Š” ๊ฒƒ์„ ์ถ”์ฒœํ•ฉ๋‹ˆ๋‹ค.

  1. โœจ ๊ณ ์ฐจ์›์  ๋ณต์žก์„ฑ๊ณผ ์—ญํ•™์œผ๋กœ ์ธํ•ด ์ „ํ†ต์ ์ธ ๋ฐฉ๋ฒ•๊ณผ ๊ฐ•ํ™” ํ•™์Šต์€ ๋กœ๋ด‡์˜ ๋Šฅ์ˆ™ํ•œ ์กฐ์ž‘(dexterous manipulation)์— ์–ด๋ ค์›€์„ ๊ฒช์Šต๋‹ˆ๋‹ค.
  2. ๐Ÿค– ๋ชจ๋ฐฉ ํ•™์Šต(Imitation Learning, IL)์€ ์ „๋ฌธ๊ฐ€ ์‹œ์—ฐ์„ ํ†ตํ•ด ๋กœ๋ด‡์ด ๋ณต์žกํ•œ ์กฐ์ž‘ ๊ธฐ์ˆ ์„ ์ง์ ‘ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ์œ ๋งํ•œ ๋Œ€์•ˆ์ž…๋‹ˆ๋‹ค.
  3. ๐Ÿ“˜ ๋ณธ ์กฐ์‚ฌ๋Š” ๋ชจ๋ฐฉ ํ•™์Šต ๊ธฐ๋ฐ˜ ๋Šฅ์ˆ™ํ•œ ์กฐ์ž‘์˜ ์ตœ์‹  ๊ธฐ์ˆ , ๋„์ „ ๊ณผ์ œ ๋ฐ ๋ฏธ๋ž˜ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์— ๋Œ€ํ•œ ํฌ๊ด„์ ์ธ ๊ฐœ์š”๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

1 Dexterous Manipulation through Imitation Learning

๋ณธ ๋…ผ๋ฌธ์€ Imitation Learning (IL) ๊ธฐ๋ฐ˜์˜ Dexterous Manipulation(DM)์— ๋Œ€ํ•œ ํฌ๊ด„์ ์ธ ์„œ๋ฒ ์ด ๋…ผ๋ฌธ์ž…๋‹ˆ๋‹ค. DM์€ ๋กœ๋ด‡ ์† ๋˜๋Š” ๋‹ค์ง€(multi-fingered) End-effector๊ฐ€ ์ •๋ฐ€ํ•˜๊ฒŒ ์กฐ์œจ๋œ ์†๊ฐ€๋ฝ ์›€์ง์ž„๊ณผ ์ ์‘์ ์ธ ํž˜ ์กฐ์ ˆ์„ ํ†ตํ•ด ๊ฐ์ฒด๋ฅผ ๋Šฅ์ˆ™ํ•˜๊ฒŒ ์ œ์–ด, ์žฌ๋ฐฐํ–ฅ, ์กฐ์ž‘ํ•˜๋Š” ๋Šฅ๋ ฅ์„ ์˜๋ฏธํ•˜๋ฉฐ, ์ธ๊ฐ„ ์†์˜ dexterity์™€ ์œ ์‚ฌํ•œ ๋ณต์žกํ•œ ์ƒํ˜ธ์ž‘์šฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ๋กœ๋ด‡ ๊ณตํ•™ ๋ฐ ๊ธฐ๊ณ„ ํ•™์Šต์˜ ๋ฐœ์ „๊ณผ ํ•จ๊ป˜ ๋ณต์žกํ•˜๊ณ  ๋น„์ •ํ˜•์ ์ธ ํ™˜๊ฒฝ์—์„œ ์ž‘๋™ํ•˜๋Š” ์‹œ์Šคํ…œ์— ๋Œ€ํ•œ ์ˆ˜์š”๊ฐ€ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๊ธฐ์กด์˜ ๋ชจ๋ธ ๊ธฐ๋ฐ˜(model-based) ์ ‘๊ทผ ๋ฐฉ์‹์€ DM์˜ ๋†’์€ ์ฐจ์›์„ฑ(high dimensionality)๊ณผ ๋ณต์žกํ•œ ์ ‘์ด‰ ๋™์—ญํ•™(contact dynamics)์œผ๋กœ ์ธํ•ด ์ž‘์—… ๋ฐ ๊ฐ์ฒด ๋ณ€ํ™”์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™”(generalize)์— ์–ด๋ ค์›€์„ ๊ฒช์Šต๋‹ˆ๋‹ค. Reinforcement Learning (RL)๊ณผ ๊ฐ™์€ ๋ชจ๋ธ ํ”„๋ฆฌ(model-free) ๋ฐฉ์‹์€ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ฃผ์ง€๋งŒ, ์•ˆ์ •์„ฑ๊ณผ ํšจ๊ณผ์„ฑ์„ ์œ„ํ•ด ๊ด‘๋ฒ”์œ„ํ•œ ํ›ˆ๋ จ, ๋Œ€๊ทœ๋ชจ ์ƒํ˜ธ์ž‘์šฉ ๋ฐ์ดํ„ฐ, ์‹ ์ค‘ํ•˜๊ฒŒ ์„ค๊ณ„๋œ ๋ณด์ƒ(reward)์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. IL์€ ์ „๋ฌธ๊ฐ€ ๋ฐ๋ชจ(expert demonstrations)๋กœ๋ถ€ํ„ฐ DM ๊ธฐ์ˆ ์„ ์ง์ ‘ ์Šต๋“ํ•˜๊ฒŒ ํ•˜์—ฌ, ๋ช…์‹œ์ ์ธ ๋ชจ๋ธ๋ง์ด๋‚˜ ๋Œ€๊ทœ๋ชจ ์‹œํ–‰์ฐฉ์˜ค ์—†์ด ๋ฏธ์„ธํ•œ ์กฐ์œจ(fine-grained coordination) ๋ฐ ์ ‘์ด‰ ๋™์—ญํ•™์„ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ๋Š” ๋Œ€์•ˆ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋ณธ ์„œ๋ฒ ์ด๋Š” IL์— ๊ธฐ๋ฐ˜ํ•œ DM ๋ฐฉ๋ฒ•์„ ๊ฐœ๊ด„ํ•˜๊ณ , ์ตœ๊ทผ์˜ ๋ฐœ์ „ ์‚ฌํ•ญ์„ ์ž์„ธํžˆ ์„ค๋ช…ํ•˜๋ฉฐ, ์ด ๋ถ„์•ผ์˜ ์ฃผ์š” ๋„์ „ ๊ณผ์ œ๋ฅผ ๋‹ค๋ฃน๋‹ˆ๋‹ค. ๋˜ํ•œ, IL ๊ธฐ๋ฐ˜ DM์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•œ ์ž ์žฌ์ ์ธ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์„ ํƒ์ƒ‰ํ•ฉ๋‹ˆ๋‹ค.

IL ๊ธฐ๋ฐ˜ DM ์ ‘๊ทผ ๋ฐฉ์‹์€ ํฌ๊ฒŒ 5 ๊ฐ€์ง€ ๋ฒ”์ฃผ๋กœ ๋ถ„๋ฅ˜๋ฉ๋‹ˆ๋‹ค:

  1. Behavioral Cloning (BC),
  2. Inverse Reinforcement Learning (IRL),
  3. Generative Adversarial Imitation Learning (GAIL), ๊ทธ๋ฆฌ๊ณ  ํ™•์žฅ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ์„œ
  4. Hierarchical Imitation Learning (HIL) ๋ฐ
  5. Continual Imitation Learning (CIL)์ž…๋‹ˆ๋‹ค.

1.1 Behavioral Cloning (BC)

BC๋Š” ์ „๋ฌธ๊ฐ€ ๋ฐ๋ชจ์˜ state-action ์Œ์œผ๋กœ๋ถ€ํ„ฐ ์ง์ ‘ ํ•™์Šตํ•˜์—ฌ ์ „๋ฌธ๊ฐ€ ํ–‰๋™์„ ๋ณต์ œํ•˜๋Š” ์ง€๋„ ํ•™์Šต(supervised learning) ํŒจ๋Ÿฌ๋‹ค์ž„์ž…๋‹ˆ๋‹ค. ๋ณด์ƒ ์‹ ํ˜ธ๋‚˜ ํƒ์ƒ‰(exploration) ์—†์ด ์ƒํƒœ์—์„œ ํ–‰๋™์œผ๋กœ์˜ ์ง์ ‘ ๋งคํ•‘์„ ํŠน์ง•์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ๋ชฉํ‘œ ํ•จ์ˆ˜๋Š” ๋ฐ๋ชจ๋œ ์•ก์…˜์˜ negative log-likelihood๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

L(\pi) = -E_{(s,a)\sim p_D}[\log \pi(a | s)]

์—ฌ๊ธฐ์„œ D = \{\tau_1, \dots, \tau_n\}๋Š” n๊ฐœ์˜ ๋ฐ๋ชจ ์ง‘ํ•ฉ์ด๋ฉฐ, ๊ฐ ๋ฐ๋ชจ \tau_i๋Š” ๊ธธ์ด N_i์˜ state-action ์Œ ์‹œํ€€์Šค \{(s_1, a_1), \dots, (s_{N_i}, a_{N_i})\}์ž…๋‹ˆ๋‹ค. BC๋Š” ํ‘ธ์‹ฑ(pushing) ๋ฐ grasping๊ณผ ๊ฐ™์€ ๋น„๊ต์  ๊ฐ„๋‹จํ•œ ์ž‘์—…์—์„œ ํšจ๊ณผ์ ์ธ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ›ˆ๋ จ ์ค‘ ๋ณด์ง€ ๋ชปํ•œ ์ƒํƒœ์— ์ง๋ฉดํ•  ๋•Œ ์ „๋ฌธ๊ฐ€ ํ–‰๋™์—์„œ ๋ฒ—์–ด๋‚˜๋Š” ์•ก์…˜์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” distribution shift ๋ฐ sequential decision-making ๊ณผ์ •์—์„œ ์˜ค๋ฅ˜๊ฐ€ ๋ˆ„์ ๋˜๋Š” compounding error ๋ฌธ์ œ์— ์ทจ์•ฝํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๊ณ„์ธต์  ํ”„๋ ˆ์ž„์›Œํฌ [29]๋ฅผ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜, ๋‹จ๊ณ„๋ณ„ ์•ก์…˜ ๋Œ€์‹  ์ „์ฒด ์•ก์…˜ ์‹œํ€€์Šค๋ฅผ ์˜ˆ์ธกํ•˜์—ฌ ์œ ํšจ ๊ฒฐ์ • ์‹œ๊ฐ„ ๋ฒ”์œ„(effective decision horizon)๋ฅผ ์ค„์ด๋Š” ์ ‘๊ทผ ๋ฐฉ์‹ [53]์ด ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ธ๊ฐ„ ๋ฐ๋ชจ์— ํ”ํ•œ multi-modal ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•ด ์—๋„ˆ์ง€ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋ง [26], ๊ฐ€์šฐ์‹œ์•ˆ ํ˜ผํ•ฉ ๋ชจ๋ธ [58], ์ƒ์„ฑ ๋ชจ๋ธ [59] ๋“ฑ์ด ํƒ๊ตฌ๋˜์—ˆ์œผ๋ฉฐ, ์ตœ๊ทผ Diffusion models [32, 60, 61, 62]์ด BC ๋ฐฉ๋ฒ•์˜ ๊ฐ•๊ฑด์„ฑ ๋ฐ ์ผ๋ฐ˜ํ™” ํ–ฅ์ƒ์— ํฐ ์ž ์žฌ๋ ฅ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. BC ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์€ ์ผ๋ฐ˜ํ™” ๋ฐ multi-modal ์•ก์…˜ ๋ถ„ํฌ ๋ชจ๋ธ๋ง์— ์–ด๋ ค์›€์„ ๊ฒช์ง€๋งŒ, Diffusion models๋Š” ์ง์ ‘ ์•ก์…˜ ์‹œํ€€์Šค๋ฅผ ์ƒ์„ฑํ•˜๊ฑฐ๋‚˜ ๊ณ ์ˆ˜์ค€ ์ „๋žต์„ ์•ˆ๋‚ดํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์œ ์—ฐ์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

1.2 Inverse Reinforcement Learning (IRL)

IRL์€ ์‚ฌ์ „ ์ •์˜๋œ ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์ •์ฑ…์„ ํ•™์Šตํ•˜๋Š” ๊ธฐ์กด RL ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์—ญ์ „์‹œํ‚ต๋‹ˆ๋‹ค. ๋Œ€์‹ , ์ „๋ฌธ๊ฐ€ ๋ฐ๋ชจ ์ง‘ํ•ฉ D๋ฅผ ๊ฐ€์žฅ ์ž˜ ์„ค๋ช…ํ•˜๋Š” ๊ธฐ์ €์˜ ๋ณด์ƒ ํ•จ์ˆ˜ R(s, a)๋ฅผ ์ถ”๋ก ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ๋ชจ๋Š” ์ตœ์  ๋˜๋Š” ๊ฑฐ์˜ ์ตœ์ ์˜ ์ •์ฑ…์„ ๋”ฐ๋ฅด๋Š” ์ „๋ฌธ๊ฐ€์— ์˜ํ•ด ์ƒ์„ฑ๋˜์—ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค.

IRL ๋ฌธ์ œ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์œ ํ•œ Markov Decision Process M = \langle S, A, T, R, \gamma \rangle ๋‚ด์—์„œ ๊ณต์‹ํ™”๋˜๋ฉฐ, ์—ฌ๊ธฐ์„œ S์™€ A๋Š” ์ƒํƒœ ๋ฐ ์•ก์…˜ ๊ณต๊ฐ„, T(s'|s, a)๋Š” ์ƒํƒœ ์ „์ด ํ™•๋ฅ , R(s, a)๋Š” ๋ณด์ƒ ํ•จ์ˆ˜, \gamma \in [0, 1]๋Š” ํ• ์ธ์œจ์ž…๋‹ˆ๋‹ค. ๋ณด์ƒ ํ•จ์ˆ˜๋Š” ์ข…์ข… ํŠน์ง• ํ•จ์ˆ˜ \phi(s, a)์˜ ์„ ํ˜• ์กฐํ•ฉ R(s_t, a_t) = w^\top\phi(s_t, a_t)์œผ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค. ์ •์ฑ… \pi ํ•˜์—์„œ์˜ ๊ธฐ๋Œ€ ํŠน์ง• ์นด์šดํŠธ๋Š” \mu_\phi(\pi) = \sum_{t=0}^\infty \gamma^t \psi_\pi(s_t)\phi(s_t, a_t)๋กœ ์ •์˜๋ฉ๋‹ˆ๋‹ค. IRL์€ ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ์ˆ˜๋™์œผ๋กœ ์ •์˜ํ•˜๊ธฐ ์–ด๋ ค์šด DM ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ํŠนํžˆ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค์€ reward normalization, task-specific feature masking [63], adaptive sampling [64], ์‚ฌ์šฉ์ž ํ”ผ๋“œ๋ฐฑ ํ†ตํ•ฉ [65], ๋น„์ •ํ˜• ๋ฐ๋ชจ๋กœ๋ถ€ํ„ฐ ๋ณด์ƒ ํ•จ์ˆ˜ ํ•™์Šต [67], Proximal Policy Optimization [45]๊ณผ์˜ ํ†ตํ•ฉ [68], ์‹œ๊ฐ ๊ธฐ๋ฐ˜ ์ธ๊ฐ„-๋กœ๋ด‡ ํ˜‘์—… [69] ๋“ฑ์„ ํ†ตํ•ด IRL ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๋ฐœ์ „์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. IRL์€ ์ „๋ฌธ๊ฐ€ ๋ฐ๋ชจ๋กœ๋ถ€ํ„ฐ ๊ธฐ์ € ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ์ถ”๋ก ํ•จ์œผ๋กœ์จ ๋ณต์žกํ•œ ํ–‰๋™์„ ์ผ๋ฐ˜ํ™”ํ•˜๊ณ  ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ์— ์ ์‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์ง€๋งŒ, ๊ณ ์ฐจ์› ์•ก์…˜ ๊ณต๊ฐ„์ด๋‚˜ ํฌ์†Œํ•œ ํ”ผ๋“œ๋ฐฑ ์‹ ํ˜ธ์—์„œ ์ •ํ™•ํ•œ ๋ณด์ƒ ํ•จ์ˆ˜ ์ถ”์ • ๋ฐ ๋Œ€๋Ÿ‰์˜ ๋ฐ๋ชจ ๋ฐ์ดํ„ฐ ์š”๊ตฌ์™€ ๊ฐ™์€ ํ•œ๊ณ„์— ์ง๋ฉดํ•ฉ๋‹ˆ๋‹ค.

1.3 Generative Adversarial Imitation Learning (GAIL)

GAIL์€ GAN [102] ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ IL ์˜์—ญ์œผ๋กœ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ฐฉ ํ”„๋กœ์„ธ์Šค๋ฅผ ์ƒ์„ฑ์ž์™€ ํŒ๋ณ„์ž ์‚ฌ์ด์˜ 2์ธ ์ ๋Œ€์  ๊ฒŒ์ž„์œผ๋กœ ๊ณต์‹ํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ƒ์„ฑ์ž๋Š” ์ „๋ฌธ๊ฐ€ ๋ฐ๋ชจ์™€ ์œ ์‚ฌํ•œ ํ–‰๋™์„ ์ƒ์„ฑํ•˜๋ ค๋Š” ์ •์ฑ… \pi์— ํ•ด๋‹นํ•˜๋ฉฐ, ํŒ๋ณ„์ž D(s, a)๋Š” state-action ์Œ (s, a)๊ฐ€ ์ „๋ฌธ๊ฐ€ ๋ฐ์ดํ„ฐ M์—์„œ ์™”๋Š”์ง€ ๋˜๋Š” \pi์— ์˜ํ•ด ์ƒ์„ฑ๋˜์—ˆ๋Š”์ง€ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. GAIL์€ ์ „๋ฌธ๊ฐ€์™€ ์ƒ์„ฑ์ž์˜ state-action ๋ถ„ํฌ ์‚ฌ์ด์˜ Jensen-Shannon divergence๋ฅผ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค.

ํŒ๋ณ„์ž๋Š” ๋‹ค์Œ ๋ชฉํ‘œ๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค:

\arg \min_D -E_{d_M(s,a)}[\log D(s, a)] - E_{d_\pi(s,a)}[\log(1 - D(s, a))]

์ƒ์„ฑ์ž์˜ ์ •์ฑ… \pi๋Š” ํŒ๋ณ„์ž์—์„œ ํŒŒ์ƒ๋œ ๋ณด์ƒ r_t = -\log(1 - D(s_t, a_t))์„ ์‚ฌ์šฉํ•˜์—ฌ RL๋กœ ์ตœ์ ํ™”๋ฉ๋‹ˆ๋‹ค. ์ด ์ ๋Œ€์  ํ›ˆ๋ จ ๊ณผ์ •์„ ํ†ตํ•ด GAIL์€ ๋ช…์‹œ์ ์œผ๋กœ ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ๋ณต๊ตฌํ•˜์ง€ ์•Š๊ณ ๋„ ์ „๋ฌธ๊ฐ€ ๋ฐ๋ชจ๋กœ๋ถ€ํ„ฐ ๋ณต์žกํ•œ ํ–‰๋™์„ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

GAIL์€ DM์—์„œ ๋„๋ฆฌ ์ฑ„ํƒ๋˜์—ˆ์ง€๋งŒ, ๋ฐ๋ชจ ๋ฐ์ดํ„ฐ์˜ ํ’ˆ์งˆ ๋ฐ ๊ฐ€์šฉ์„ฑ, ๊ทธ๋ฆฌ๊ณ  ํ›ˆ๋ จ ๋ถˆ์•ˆ์ •์„ฑ(mode collapse, gradient vanishing) ๋ฌธ์ œ์— ํฌ๊ฒŒ ์˜์กดํ•ฉ๋‹ˆ๋‹ค. Hindsight Experience Replay [77], semi-supervised correction [76], Sim-to-real transfer [78] ๋“ฑ์ด ๋ฐ์ดํ„ฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค ์‹œ๋„ํ–ˆ์œผ๋ฉฐ, Variational Autoencoders [79], Wasserstein GAN [80], self-organizing generative model [82] ๋“ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ›ˆ๋ จ ์•ˆ์ •์„ฑ์„ ๊ฐœ์„ ํ•˜๊ณ  Mode collapse๋ฅผ ์™„ํ™”ํ•˜๋ ค๋Š” ๋…ธ๋ ฅ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. GAIL์€ ์ ๋Œ€์  ํ›ˆ๋ จ์˜ ๊ทผ๋ณธ์ ์ธ ํ•œ๊ณ„๋ฅผ ์ƒ์†๋ฐ›์•„ ํ›ˆ๋ จ ๋ถˆ์•ˆ์ •์„ฑ ๋ฐ ๊ณ ์ฐจ์› ์•ก์…˜ ๊ณต๊ฐ„์œผ๋กœ์˜ ํ™•์žฅ ์–ด๋ ค์›€์— ์ง๋ฉดํ•ฉ๋‹ˆ๋‹ค.

1.4 Hierarchical Imitation Learning (HIL)

HIL์€ ๋ณต์žกํ•œ ์ž‘์—…์„ ๊ณ„์ธต์  ๊ตฌ์กฐ๋กœ ๋ถ„ํ•ดํ•˜์—ฌ ํ•ด๊ฒฐํ•˜๋„๋ก ์„ค๊ณ„๋œ IL ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ 2๋‹จ๊ณ„ ๊ณ„์ธต ๊ตฌ์กฐ๋ฅผ ์ฑ„ํƒํ•˜๋ฉฐ, ์ƒ์œ„ ์ˆ˜์ค€ ์ •์ฑ…์€ ํ˜„์žฌ ์ƒํƒœ ๋ฐ ์ž‘์—… ์š”๊ตฌ ์‚ฌํ•ญ์— ๋”ฐ๋ผ ํ•˜์œ„ ์ž‘์—… ๋˜๋Š” ์›์‹œ(primitives) ์‹œํ€€์Šค๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ํ•˜์œ„ ์ˆ˜์ค€ ์ •์ฑ…์€ ํ•˜์œ„ ์ž‘์—…์„ ์‹คํ–‰ํ•˜์—ฌ ์ „์ฒด ๋ชฉํ‘œ๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณ„์ธต์  ๋ถ„ํ•ด๋Š” ์˜์‚ฌ ๊ฒฐ์ • ๋ฐ ์ œ์–ด๋ฅผ ๋ถ„๋ฆฌํ•˜์—ฌ ์žฅ๊ธฐ์ ์ธ ๋ณต์žกํ•œ ์ž‘์—…์„ ๋ณด๋‹ค ํšจ๊ณผ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

์ƒ์œ„ ์ •์ฑ… \pi_h๋Š” ๋ฏธ๋ฆฌ ์ •์˜๋œ ์›์‹œ ์ง‘ํ•ฉ \{p_1, \dots, p_K\}์—์„œ ์›์‹œ p_i๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค: \pi_h(s_t) = p_i. ํ•ด๋‹น ํ•˜์œ„ ์ •์ฑ… \pi_{p_i}๋Š” ์„ ํƒ๋œ ์›์‹œ๋ฅผ ์‹คํ–‰ํ•  ์•ก์…˜์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค: a_t = \pi_{p_i}(s_t). ์ „์ฒด ๋ชฉํ‘œ๋Š” ๋ˆ„์  ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

L(\pi) = \sum_{t=1}^T E_{(s_t,a_t)\sim\pi}[\ell(s_t, a_t)]

HIL์˜ ์ฃผ์š” ์žฅ์ ์€ ์ž‘์—…์„ ๊ณ„์ธต์  ๊ตฌ์กฐ๋กœ ๋ถ„ํ•ดํ•˜์—ฌ ์ง์ ‘์ ์ธ ์•ก์…˜ ๊ณต๊ฐ„ ํƒ์ƒ‰์˜ ๋ณต์žก์„ฑ์„ ์ค„์ด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

CompILE [88], HDR-IL [89], ARCH [90], XSkill [91], LOTUS [92] ๋“ฑ์˜ ์—ฐ๊ตฌ๋“ค์ด ์ž‘์—… ๋ถ„ํ•ด, ๊ธฐ์ˆ  ์ผ๋ฐ˜ํ™”, ์žฅ๊ธฐ์ ์ธ ์ž‘์—… ์ฒ˜๋ฆฌ์— ๊ธฐ์—ฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค์€ Play data [93, 94]๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋‘ ์ˆ˜์ค€์˜ ์ •์ฑ…์„ ํšจ์œจ์ ์œผ๋กœ ํ›ˆ๋ จํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํƒ๊ตฌํ–ˆ์Šต๋‹ˆ๋‹ค. HIL์€ ์ž‘์—… ๋ถ„ํ•ด ๋ฐ ๊ธฐ์ˆ  ์ผ๋ฐ˜ํ™”์—์„œ ์ƒ๋‹นํ•œ ์ด์ ์„ ๋ณด์—ฌ์ฃผ์ง€๋งŒ, Cross-modal ๊ธฐ์ˆ  ์ผ๋ฐ˜ํ™”์—์„œ์˜ ์ ์‘์„ฑ ๋ฐ ๋™์  ํ™˜๊ฒฝ์—์„œ์˜ ๋ชจ๋ธ ๊ฐ•๊ฑด์„ฑ ๋ฐ ์—ฐ์†์„ฑ ํ™•๋ณด์— ์–ด๋ ค์›€์„ ๊ฒช๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

1.5 Continual Imitation Learning (CIL)

CIL์€ ์ง€์† ํ•™์Šต(continual learning)๊ณผ IL์„ ํ†ตํ•ฉํ•˜์—ฌ ์—์ด์ „ํŠธ๊ฐ€ ๋™์ ์œผ๋กœ ๋ณ€ํ™”ํ•˜๋Š” ํ™˜๊ฒฝ์—์„œ ์ „๋ฌธ๊ฐ€ ํ–‰๋™์„ ๋ชจ๋ฐฉํ•จ์œผ๋กœ์จ ๊ธฐ์ˆ ์„ ์ง€์†์ ์œผ๋กœ ์Šต๋“ํ•˜๊ณ  ์ ์‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ์—์ด์ „ํŠธ๋Š” ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ ์ „๋ฌธ๊ฐ€ ๋ฐ๋ชจ๋กœ๋ถ€ํ„ฐ ๊ธฐ๋ณธ ๊ธฐ์ˆ ์„ ํ•™์Šตํ•˜๊ณ , ์ดํ›„ ๋‹จ๊ณ„์—์„œ ์ ์ง„์ ์œผ๋กœ ์ง€์‹์„ ์ถ•์ ํ•˜๊ณ  ์ƒˆ๋กœ์šด ์ž‘์—…์ด๋‚˜ ํ™˜๊ฒฝ์— ์ ์‘ํ•˜๋ฉฐ ์ด์ „์— ์Šต๋“ํ•œ ๊ธฐ์ˆ ์„ ์žŠ์–ด๋ฒ„๋ฆด ์œ„ํ—˜์„ ์™„ํ™”ํ•ฉ๋‹ˆ๋‹ค.

CIL์—์„œ ์ •์ฑ… \pi๋Š” ์ด์ „์— ์ ‘ํ•œ ๋ชจ๋“  ์ž‘์—…์— ๋Œ€ํ•œ ๋ˆ„์  ๋ชจ๋ฐฉ ์†์‹ค์„ ์ตœ์†Œํ™”ํ•˜์—ฌ ์ตœ์ ํ™”๋ฉ๋‹ˆ๋‹ค:

L(\pi) = -\sum_{i=1}^t \lambda^{(i)} E_{(s^{(i)},a^{(i)})\sim \rho^{(i)}_{exp}}[\log \pi(a^{(i)} | s^{(i)})]

์—ฌ๊ธฐ์„œ \lambda^{(i)}๋Š” t๊ฐœ์˜ ๊ฐ ์ž‘์—…์— ํ• ๋‹น๋œ ๊ฐ€์ค‘์น˜์ด๊ณ  \rho^{(i)}_{exp}๋Š” ์ž‘์—… i์— ๋Œ€ํ•œ ์ „๋ฌธ๊ฐ€ state-action ์Œ์˜ ๋ถ„ํฌ์ž…๋‹ˆ๋‹ค.

์ดˆ๊ธฐ ์—ฐ๊ตฌ [95]๋Š” ์ด์ „์— ์Šต๋“ํ•œ ๊ธฐ์ˆ ์„ ์†์ƒ์‹œํ‚ค์ง€ ์•Š๊ณ  ์ž‘์—… ๊ฐ„ ์ „ํ™˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ์ง€๋งŒ, ์ƒ๋‹นํ•œ ์ €์žฅ ๋ฐ ๊ณ„์‚ฐ ๋ฆฌ์†Œ์Šค๊ฐ€ ํ•„์š”ํ–ˆ์Šต๋‹ˆ๋‹ค. Task-specific adapter ๊ตฌ์กฐ [96], ๋น„์ง€๋„ ๊ธฐ์ˆ  ๋ฐœ๊ฒฌ [92], ํ–‰๋™ ์ฆ๋ฅ˜๋ฅผ ํ†ตํ•œ ํ†ตํ•ฉ ์ •์ฑ… ํ•™์Šต [97], Deep Generative Replay (DGR) [98], ์ž๊ธฐ ์ง€๋„ ํ•™์Šต [99] ๋“ฑ ๋‹ค์–‘ํ•œ ์ ‘๊ทผ ๋ฐฉ์‹์ด ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. CIL์€ ํšจ๊ณผ์ ์ธ ๋ฉ€ํ‹ฐํƒœ์Šคํ‚น ํ•™์Šต, DGR ๊ธฐ์ˆ  ์ ์šฉ, ์ž๊ธฐ ์ง€๋„ ๊ธฐ์ˆ  ์ถ”์ƒํ™”์— ์ค‘์ ์„ ๋‘์ง€๋งŒ, ์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ์˜ ํ’ˆ์งˆ ๋ฐ ์ผ๊ด€์„ฑ, ๋ฆฌ์†Œ์Šค ์†Œ๋น„, ํ˜„์‹ค ์„ธ๊ณ„ ์‘์šฉ์„ ์œ„ํ•œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ ๋ถ€์กฑ๊ณผ ๊ฐ™์€ ์‹ค์งˆ์ ์ธ ๋ฐฐํฌ ๊ณผ์ œ๊ฐ€ ๋‚จ์•„ ์žˆ์Šต๋‹ˆ๋‹ค.

2 End Effectors for Dexterous Manipulation

DM์„ ์œ„ํ•œ End-effector๋Š” ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€ ๊ทธ๋ฆฌํผ(two-fingered grippers), ๋‹ค์ง€ ์ธ๊ฐ„ํ˜• ์†(multi-fingered anthropomorphic hands), ์„ธ ๊ฐ€์ง€ ๋กœ๋ด‡ ํด๋กœ(three-fingered robotic claws)๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค. ๋‘ ๊ฐ€์ง€ ๊ทธ๋ฆฌํผ๋Š” ์‹ ๋ขฐ์„ฑ, ๋‹จ์ˆœ์„ฑ, ์ œ์–ด ์šฉ์ด์„ฑ์œผ๋กœ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜์ง€๋งŒ (์˜ˆ: Franka robot [104], ALOHA [53], Mobile ALOHA [112]), ์† ์•ˆ์—์„œ์˜ ๊ฐ์ฒด ์žฌ๊ตฌ์„ฑ ๋Šฅ๋ ฅ์ด ์ œํ•œ์ ์ด๊ณ  ์ธ๊ฐ„ ์†๊ณผ์˜ ํ˜•ํƒœํ•™์  ์ฐจ์ด๋กœ ์ธํ•ด ์ธ๊ฐ„ ๋ฐ๋ชจ๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•˜๋Š” ๋ฐ ๋ฐฉํ•ด๊ฐ€ ๋ฉ๋‹ˆ๋‹ค [115]. ๋‹ค์ง€ ์ธ๊ฐ„ํ˜• ์†์€ ์ธ๊ฐ„๊ณผ ์œ ์‚ฌํ•œ ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง€๋ฉฐ ์ธ๊ฐ„์ด ์‚ฌ์šฉํ•˜๋„๋ก ์„ค๊ณ„๋œ ๊ฐ์ฒด์™€์˜ ์ƒํ˜ธ์ž‘์šฉ์— ๋” ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค [116].

๊ตฌ๋™ ๋ฉ”์ปค๋‹ˆ์ฆ˜์— ๋”ฐ๋ผ

  • Tendon-driven (์˜ˆ: Shadow Dexterous Hand [130]),
  • Linkage-driven (์˜ˆ: INSPIRE-ROBOTS RH56 [122]),
  • Direct-driven (์˜ˆ: Allegro Hand [125]),
  • Hybrid-transmission (์˜ˆ: DLR/HIT Hand II [179]) ๋ฐฉ์‹์œผ๋กœ ๋ถ„๋ฅ˜๋ฉ๋‹ˆ๋‹ค.

Tendon-driven์€ ๋†’์€ DoF์™€ Dexterity๋ฅผ ์ œ๊ณตํ•˜์ง€๋งŒ ๋งˆ์ฐฐ, ๋งˆ๋ชจ ๋“ฑ์˜ ๋ฌธ์ œ๊ฐ€ ์žˆ๊ณ , Linkage-driven์€ ์ •๋ฐ€ํ•˜๊ณ  ๊ฐ•๊ฑดํ•˜์ง€๋งŒ DoF๊ฐ€ ์ ์€ ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค. Direct-driven์€ ์ œ์–ด ์ •๋ฐ€๋„๊ฐ€ ๋†’์ง€๋งŒ ์งˆ๋Ÿ‰, ๊ด€์„ฑ ์ฆ๊ฐ€์˜ ๋‹จ์ ์ด ์žˆ์œผ๋ฉฐ, Hybrid ๋ฐฉ์‹์€ ์—ฌ๋Ÿฌ ๋ฐฉ์‹์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์†๋“ค์€ ๋†’์€ Dexterity๋ฅผ ์ œ๊ณตํ•˜์ง€๋งŒ ๋ณต์žก์„ฑ, ๋น„์šฉ, ๊ณ ์žฅ ์ทจ์•ฝ์„ฑ ๋“ฑ์˜ ๊ณผ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์„ธ ๊ฐ€์ง€ ๋กœ๋ด‡ ํด๋กœ (์˜ˆ: DEX-EE [204], BarrettHand [208])๋Š” ๋‘ ๊ฐ€์ง€ ๊ทธ๋ฆฌํผ์™€ ๋‹ค์ง€ ์ธ๊ฐ„ํ˜• ์† ์‚ฌ์ด์˜ ์ ˆ์ถฉ์•ˆ์œผ๋กœ, ์ผ๋ฐ˜์ ์ธ grasping ์œ ํ˜•๊ณผ ์ œํ•œ์ ์ธ in-hand manipulation์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

3 Teleoperation Systems and Data Collection

Teleoperation ์‹œ์Šคํ…œ์€ ์ธ๊ฐ„-๋กœ๋ด‡ ํ˜‘์—…์„ ์œ„ํ•œ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ œ๊ณตํ•˜๋ฉฐ, ๋กœ๋ด‡ ํ–‰๋™์ด ์ธ๊ฐ„ ์ˆ˜์ค€์˜ ์ง€๋Šฅ์„ ๋”ฐ๋ฅด๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ธ๊ฐ„์˜ ๊ด‘๋ฒ”์œ„ํ•œ ์ง€์‹๊ณผ ๊ฒฝํ—˜์„ ํ™œ์šฉํ•˜์—ฌ ๋ณต์žกํ•œ ์žฅ๋ฉด์—์„œ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ํŒ๋‹จํ•˜๊ณ  ํ”ผ๋“œ๋ฐฑ์— ์‹ ์†ํ•˜๊ฒŒ ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋งค์šฐ ์ง๊ด€์ ์ž…๋‹ˆ๋‹ค. Teleoperation ์ค‘ ๋กœ๋ด‡ ์ƒํƒœ์™€ ํ•ด๋‹น ์•ก์…˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜์—ฌ end-to-end IL์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์ถ•ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Teleoperation ์‹œ์Šคํ…œ์€ ๋กœ์ปฌ ์‚ฌ์ดํŠธ(์ธ๊ฐ„ ์กฐ์ž‘์ž, I/O ์žฅ์น˜)์™€ ์›๊ฒฉ ์‚ฌ์ดํŠธ(๋กœ๋ด‡, ์„ผ์„œ)๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. DM์„ ์œ„ํ•œ I/O ์žฅ์น˜๋กœ๋Š” ์นด๋ฉ”๋ผ [17], mocap gloves [16], VR/AR controllers [14], exoskeletons ๋ฐ bilateral systems [53] ๋“ฑ์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

  • Vision-based systems๋Š” ์ปดํ“จํ„ฐ ๋น„์ „์œผ๋กœ ์† ํฌ์ฆˆ๋ฅผ ์ถ”์ •ํ•˜์ง€๋งŒ, ๊ฐ€๋ฆผ(occlusion), ์กฐ๋ช… ๋“ฑ์˜ ๋ฌธ์ œ์— ์ทจ์•ฝํ•ฉ๋‹ˆ๋‹ค. TeachNet [222], Dexpilot [18], Robotic Telekinesis [17], AnyTeleop [19], ACE [221] ๋“ฑ์ด ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ธ๊ฐ„ ์†๊ณผ ๋กœ๋ด‡ ์†์˜ ํ˜•ํƒœํ•™์  ๋ถˆ์ผ์น˜๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ์—ฐ๊ตฌ [20]๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

  • Mocap Gloves๋Š” ์„ผ์„œ๋ฅผ ํ†ตํ•ด ์ธ๊ฐ„ ์† ์›€์ง์ž„์„ ์ง์ ‘ ์ •๋ฐ€ํ•˜๊ฒŒ ์ถ”์ ํ•ฉ๋‹ˆ๋‹ค [16]. ๋น„์‹ธ์ง€๋งŒ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ํšจ์œจ์„ ๋†’์ž…๋‹ˆ๋‹ค.

  • VR/AR Controllers๋Š” ๋ชฐ์ž…ํ˜• ํ™˜๊ฒฝ์„ ์ œ๊ณตํ•˜๋ฉฐ ์ €๋น„์šฉ ์†”๋ฃจ์…˜์œผ๋กœ ํƒ๊ตฌ๋ฉ๋‹ˆ๋‹ค [14, 234]. ์‹œ๋ฎฌ๋ ˆ์ด์…˜ [245], ํ˜ผํ•ฉ ํ˜„์‹ค [234], haptic feedback ํ†ตํ•ฉ [235] ๋“ฑ์ด ์‹œ๋„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • Exoskeleton ๋ฐ Bilateral Systems๋Š” joint space ์ œ์–ด์— ์ค‘์ ์„ ๋‘์–ด inverse kinematics (IK) ๊ณ„์‚ฐ ๋ฌธ์ œ๋ฅผ ํšŒํ”ผํ•ฉ๋‹ˆ๋‹ค [239, 240, 241]. ๋ฆฌ๋”-ํŒ”๋กœ์›Œ ๊ตฌ์กฐ๋กœ ํž˜ ํ”ผ๋“œ๋ฐฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค [53, 242, 243].

  • Retargeting [19, 221]์€ ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ํ”Œ๋žซํผ์—์„œ ๋ฐ๋ชจ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์ฃผ์š” ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ๋Š” MIME [250], RH20T [251], BridgeData [252, 253], DROID [254] ๋“ฑ์ด ์žˆ์œผ๋ฉฐ, ๋Œ€๊ทœ๋ชจ์˜ ๋‹ค์–‘ํ•œ ์ž‘์—… ๋ฐ ํ™˜๊ฒฝ ๋ฐ๋ชจ ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• [255, 256] ๋ฐ ๋ฐ๋ชจ ์ƒ์„ฑ ์‹œ์Šคํ…œ [257, 258, 259]์€ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋น„์šฉ์„ ์ค„์ด๊ณ  ๋ฐ์ดํ„ฐ ๋‹ค์–‘์„ฑ์„ ๋†’์ด๋Š” ๋ฐ ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค. ARCTIC [260], DexGraspNet [261], OAKINK2 [262] ๋“ฑ์€ ํŠนํžˆ bimanual manipulation ๋ฐ ์†-๊ฐ์ฒด ์ƒํ˜ธ์ž‘์šฉ์— ์ดˆ์ ์„ ๋งž์ถ˜ ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค.

4 Challenges and Future Directions

IL ๊ธฐ๋ฐ˜ DM์€ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋ฐ ์ƒ์„ฑ, ๋ฒค์น˜๋งˆํ‚น ๋ฐ ์žฌํ˜„์„ฑ, ์ƒˆ๋กœ์šด ํ™˜๊ฒฝ์œผ๋กœ์˜ ์ผ๋ฐ˜ํ™”, ์‹ค์‹œ๊ฐ„ ์ œ์–ด, ์•ˆ์ „์„ฑ, ๊ฐ•๊ฑด์„ฑ ๋ฐ ์‚ฌํšŒ์  ์ค€์ˆ˜ ์ธก๋ฉด์—์„œ ์—ฌ๋Ÿฌ ๋„์ „ ๊ณผ์ œ์— ์ง๋ฉดํ•ด ์žˆ์Šต๋‹ˆ๋‹ค.

  • Data Collection and Generation: ์ด์ข… ๋ฐ์ดํ„ฐ ์œตํ•ฉ(heterogeneous data fusion), ๋ฐ์ดํ„ฐ ์–‘, ํ’ˆ์งˆ, ๋‹ค์–‘์„ฑ ํ™•๋ณด์˜ ์–ด๋ ค์›€, ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ ํฌ์†Œ์„ฑ, ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋น„์šฉ์ด ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค. ๋ฏธ๋ž˜ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์œผ๋กœ๋Š” Multi-modal alignment ๊ธฐ์ˆ , Cross-embodiment ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ, ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•, Domain randomization, ์ƒ์„ฑ ๋ชจ๋ธ, Crowdsourced teleoperation, Self-supervised learning, ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ํ”„๋กœํ† ์ฝœ ํ‘œ์ค€ํ™”, Sim-to-real fidelity ํ–ฅ์ƒ, Differentiable physics engines, Adaptive parameter tuning, Self-supervised real-to-sim refinement ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • Benchmarking and Reproducibility: ํ˜„์‹ค ์„ธ๊ณ„ ํ•˜๋“œ์›จ์–ด ์‹คํ—˜์˜ ์˜์กด์„ฑ ๋ฐ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์˜ ๊ฐ€๋ณ€์„ฑ์œผ๋กœ ์ธํ•ด ๋ฒค์น˜๋งˆํ‚น ๋ฐ ๊ฒฐ๊ณผ ์žฌํ˜„์ด ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ํ‘œ์ค€ํ™”๋œ ๋ฒค์น˜๋งˆํ‚น ํ”„๋ ˆ์ž„์›Œํฌ ๋ฐ ์˜คํ”ˆ ์†Œ์Šค ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•, ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฌผ๋ฆฌ ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฐ ํ™˜๊ฒฝ ํ‘œํ˜„์˜ ์ผ๊ด€์„ฑ ํ™•๋ณด, ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ํ˜•ํƒœ์— ๊ฑธ์นœ Multi-modal ๋ฐ์ดํ„ฐ ๊ธฐ๋ก, ํ‘œ์ค€ ํ‰๊ฐ€ ํ”„๋กœํ† ์ฝœ ๋งˆ๋ จ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  • Generalization to Novel Setups: ์ž‘์—… ๋ฐ ํ™˜๊ฒฝ ๊ฐ€๋ณ€์„ฑ, ์ „ํ†ต์  IL์˜ ์ ์‘ ํ•™์Šต ํ•œ๊ณ„, Sim-to-real transfer ๋ฌธ์ œ, Cross-embodiment ์ ์‘์„ฑ ๋ถ€์กฑ์ด ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค. ์ ์‘์  ๋ฐ ์ง€์† ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ (Meta-learning, RL fine-tuning), ๋ถˆํ™•์‹ค์„ฑ ์ธ์ง€ ๋ชจ๋ธ, ๋ฌผ๋ฆฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์˜ ํ˜„์‹ค์„ฑ ํ–ฅ์ƒ, Hybrid learning ์ ‘๊ทผ ๋ฐฉ์‹, Morphology-agnostic policy learning, ๊ทธ๋ž˜ํ”„ ๊ธฐ๋ฐ˜ ๋ฐ ์ž ์žฌ ๊ณต๊ฐ„ ํ‘œํ˜„ ํ™œ์šฉ, Modular policy architectures, Few-shot adaptation ๋“ฑ์ด ๋ฏธ๋ž˜ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์ž…๋‹ˆ๋‹ค.
  • Real-Time Control: ๊ณ ์ฐจ์› ์•ก์…˜ ๊ณต๊ฐ„ ๋ฐ ๋ณต์žกํ•œ ๋™์—ญํ•™์œผ๋กœ ์ธํ•œ ๊ณ„์‚ฐ ๋ณต์žก์„ฑ์ด ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค. Model-based (MPC)์™€ Model-free (RL) ๋ฐฉ๋ฒ•์˜ ํšจ์œจ์ ์ธ ํ™œ์šฉ, Hybrid control strategies, Accelerated learning ๊ธฐ์ˆ , ๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ… ํ•˜๋“œ์›จ์–ด (GPUs, TPUs), Edge computing, Custom ASICs, Neuromorphic computing ๋“ฑ ํ•˜๋“œ์›จ์–ด ์•„ํ‚คํ…์ฒ˜ ๊ฐœ์„ ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  • Safety, Robustness, and Social Compliance: ์˜ค๋ฅ˜ ํƒ์ง€ ๋ฐ ๋ณต๊ตฌ, ์•ˆ์ „ ์กฐ์น˜(์ถฉ๋Œ ํšŒํ”ผ, ํž˜ ์กฐ์ ˆ), ์‚ฌํšŒ์  ๊ทœ๋ฒ” ์ค€์ˆ˜๊ฐ€ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ๋Œ€๊ทœ๋ชจ ์‹คํŒจ ๋ฐ์ดํ„ฐ์…‹ ๋ฐ ํ‘œ์ค€ํ™”๋œ ๋ฒค์น˜๋งˆํ‚น, Self-supervised multi-modal anomaly detection, ๊ฐ•๊ฑดํ•œ ์ •์ฑ… ํ›ˆ๋ จ/๋ฒค์น˜๋งˆํ‚น, ์ถฉ๋Œ ์™„ํ™”๋ฅผ ์œ„ํ•œ Compliant actuators ๋ฐ Soft robotic designs, ์ธ๊ฐ„ ์ค‘์‹ฌ ํ™˜๊ฒฝ์— ํ†ตํ•ฉ๋˜๊ธฐ ์œ„ํ•œ ์‚ฌํšŒ์  ์ค€์ˆ˜ ํ•™์Šต, Interactive learning paradigm, Multi-modal human-robot interaction datasets, ์‚ฌํšŒ์  ์ค€์ˆ˜ ๋ฒค์น˜๋งˆํฌ ํ‘œ์ค€ํ™” ๋“ฑ์ด ๋ฏธ๋ž˜ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์œผ๋กœ ์ œ์‹œ๋ฉ๋‹ˆ๋‹ค.

5 Conclusion

IL์€ ๋กœ๋ด‡์ด ์ธ๊ฐ„๊ณผ ์œ ์‚ฌํ•œ ๊ธฐ์ˆ ๊ณผ ์ •๋ฐ€๋„๋กœ DM ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๋ฐ ์ƒ๋‹นํ•œ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘, ์ผ๋ฐ˜ํ™”, ์‹ค์‹œ๊ฐ„ ์ œ์–ด, ์•ˆ์ „์„ฑ, Sim-to-real transfer์™€ ๊ด€๋ จ๋œ ๋ฌธ์ œ๊ฐ€ ์‹ค์งˆ์ ์ธ ๋ฐฐํฌ๋ฅผ ๊ฐ€๋กœ๋ง‰๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ถ„์•ผ์˜ ๋ฐœ์ „์„ ์œ„ํ•ด์„œ๋Š” ์ตœ์ ํ™”๋œ IL ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐœ๋ฐœ, ์ธ๊ฐ„-๋กœ๋ด‡ ํ˜‘์—… ๊ฐ•ํ™”, ์ฒจ๋‹จ ์„ผ์„œ ์‹œ์Šคํ…œ ํ†ตํ•ฉ์— ์ดˆ์ ์„ ๋งž์ถ˜ ๋ฏธ๋ž˜ ์—ฐ๊ตฌ๊ฐ€ ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค. DM์˜ ๋ฏธ๋ž˜๋Š” ์‚ฐ์—… ์ž๋™ํ™”๋ถ€ํ„ฐ ํ—ฌ์Šค์ผ€์–ด ๋ฐ ์„œ๋น„์Šค ๋กœ๋ด‡์— ์ด๋ฅด๊ธฐ๊นŒ์ง€ ํฐ ์ž ์žฌ๋ ฅ์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ, IL ๋ฐ ๋กœ๋ด‡ ์กฐ์ž‘์˜ ๊ฒฝ๊ณ„๋ฅผ ๊ณ„์† ํ™•์žฅํ•จ์œผ๋กœ์จ ๋” ์œ ๋Šฅํ•˜๊ณ  ์ ์‘ ๊ฐ€๋Šฅํ•˜๋ฉฐ ์ง€๋Šฅ์ ์ธ ๋กœ๋ด‡ ์‹œ์Šคํ…œ์˜ ๊ธธ์„ ์—ด ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Copyright 2024, Jung Yeon Lee