Curieux.JY
  • JungYeon Lee
  • Post
  • Projects
  • Note

On this page

  • 1 VLA ๊ธฐ๋ฐ˜ ์ตœ์‹  Dexterous Manipulation ๋ชจ๋ธ
    • 1.1 RT-2 (2023, DeepMind)
    • 1.2 Octo (2024, UC Berkeley et al.)
    • 1.3 OpenVLA (2024, Stanford et al.)
    • 1.4 RoboMamba (2024, CMU et al.)
    • 1.5 ฯ€โ‚€ (2024, UC Berkeley et al.)
    • 1.6 Being-H0 (2025, Tsinghua et al.)
    • 1.7 DexVLG (2025, CMU et al.)
    • 1.8 METIS (2025, Peking Univ.)
    • 1.9 Shake-VLA (2025, NRC ๋“ฑ์˜ HRI 2025)
    • 1.10 Scaffolding (2025, Stanford et al.)
    • 1.11 ๋ชจ๋ธ๋ณ„ ํŠน์„ฑ ๋น„๊ตํ‘œ

๐ŸงฉDexterous VLAs

vla
dexterity
hand
2025
Survey of the latest VLA models using dexterous hands
Published

December 12, 2025

1 VLA ๊ธฐ๋ฐ˜ ์ตœ์‹  Dexterous Manipulation ๋ชจ๋ธ

2023-2025

1.1 RT-2 (2023, DeepMind)

RT-2๋Š” ๊ตฌ๊ธ€ ๋”ฅ๋งˆ์ธ๋“œ์—์„œ ์ œ์•ˆํ•œ ์ตœ์ดˆ์˜ Vision-Language-Action(VLA) ๋ชจ๋ธ๋กœ, ์ธํ„ฐ๋„ท ๊ทœ๋ชจ์˜ ๋น„์ „-์–ธ์–ด ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋ด‡ ์ œ์–ด์— ์ง์ ‘ ํ†ตํ•ฉํ•œ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. PaLI-X, PaLM-E ๋“ฑ ๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ํ•™์Šต๋œ ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ, ๋กœ๋ด‡์˜ ์˜์ƒ ๊ด€์ฐฐ๊ณผ ์ž์—ฐ์–ด ์ž…๋ ฅ์„ ๋ฐ›์•„ ํ–‰๋™์„ ํ…์ŠคํŠธ ํ† ํฐ ํ˜•ํƒœ๋กœ ์ƒ์„ฑํ•˜๋„๋ก ๊ณต๋™ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค[1]. ์ด ๊ณผ์ •์—์„œ ๋กœ๋ด‡ ๊ถค์  ๋ฐ์ดํ„ฐ์™€ VQA, ์˜์ƒ๋Œ€ํ™” ๋“ฑ ์ธํ„ฐ๋„ท ๋น„์ „-์–ธ์–ด ํƒœ์Šคํฌ๋ฅผ ํ•จ๊ป˜ ํ•™์Šตํ•จ์œผ๋กœ์จ, ๋ณ„๋„์˜ ๊ตฌ์กฐ ๋ณ€๊ฒฝ ์—†์ด๋„ ์ œ3์ž ์ง€์‹œ ๋ช…๋ น(์˜ˆ: โ€œ๊ฐ€์žฅ ์ž‘์€ ๋ฌผ์ฒด๋ฅผ ์ง‘์–ด ์˜ฌ๋ ค๋ผโ€)์ด๋‚˜ ์ถ”๋ก ์  ํ–‰๋™(์˜ˆ: โ€œํ…Œ์ด๋ธ” ๊ฐ€์žฅ์ž๋ฆฌ์— ๋†“์ธ ๊ฐ€๋ฐฉ์„ ์ง‘์–ด๋ผโ€)์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋Šฅ๋ ฅ์ด ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค. ์ฃผ์š” ์‹คํ—˜์—์„œ RT-2๋Š” 6์ฒœ ํšŒ ์ด์ƒ์˜ ํ‰๊ฐ€ ์‹คํ—˜์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ์กฐ์ž‘ ์ž‘์—…์—์„œ ๋†’์€ ์„ฑ๊ณต๋ฅ ๊ณผ ์šฐ์ˆ˜ํ•œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ์ƒˆ๋กœ์šด ๊ฐ์ฒด๋‚˜ ๋ชฉํ‘œ์—๋„ ๊ฐ•๊ฑดํ•จ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค[1]. (์ฝ”๋“œ ๊ณต๊ฐœ: ๋น„๊ณต๊ฐœ)

1.2 Octo (2024, UC Berkeley et al.)

Octo๋Š” ๋ฒ”์šฉ ๋กœ๋ด‡ ์ •์ฑ…์„ ์œ„ํ•œ ๊ณต๊ฐœํ˜• ๋Œ€ํ˜• ๋ชจ๋ธ๋กœ, ๋‹ค์–‘ํ•œ ๋กœ๋ด‡๊ณผ ์ž‘์—…์— ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๋Š” Transformer ๊ธฐ๋ฐ˜ ํ™•์‚ฐ์ •์ฑ…(diffusion policy)์ž…๋‹ˆ๋‹ค. Open X-Embodiment ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ๋ถ€ํ„ฐ ์ˆ˜์ง‘๋œ 800์ฒœ ๊ฐœ ์ด์ƒ์˜ ๋กœ๋ด‡ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด ์‚ฌ์ „ํ•™์Šต๋˜์—ˆ์œผ๋ฉฐ[2], 9์ข…์˜ ์‹ค์ œ ๋กœ๋ด‡ ํ”Œ๋žซํผ(์˜ˆ: WidowX, UR5, Dexterous Hand ๋“ฑ)๊ณผ ๋‹ค์–‘ํ•œ ์„ผ์„œ ๊ตฌ์„ฑ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์–ธ์–ด ๋ช…๋ น ๋˜๋Š” ๋ชฉํ‘œ ์ด๋ฏธ์ง€(goal image)๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ํ–‰๋™ ๋ถ„ํฌ๋ฅผ ์ƒ์„ฑํ•˜๋ฉฐ, ๋ฏธ์„ธ์กฐ์ •(fine-tuning)๋„ ํ‘œ์ค€ ์†Œ๋น„์ž์šฉ GPU์—์„œ ํšจ์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, Octo๋Š” โ€œBridgeV2โ€, โ€œStanford Coffeeโ€, โ€œPeg Insertโ€ ๋“ฑ 6๊ฐœ ๋ฒค์น˜๋งˆํฌ ์ž‘์—…์—์„œ RT-1-X ๋Œ€๋น„ ํ‰๊ท  ์„ฑ๊ณต๋ฅ ์„ 52% ์ด์ƒ ๊ฐœ์„ ํ•˜๊ณ , 55B ํŒŒ๋ผ๋ฏธํ„ฐ์˜ RT-2-X(์–ธ์–ด๋ชจ๋“œ)์™€ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค[3][4]. ํŠนํžˆ, ๋ชฉํ‘œ ์ด๋ฏธ์ง€ ์ง€์‹œ(goal-image conditioning)๋ฅผ ํ™œ์šฉํ•  ๋•Œ ๋”์šฑ ๋†’์€ ์„ฑ๊ณต๋ฅ ์„ ๋ณด์˜€๊ณ (๊ธฐ์กด ์–ธ์–ด ์ง€์‹œ ๋Œ€๋น„ +25% ์„ฑ๋Šฅ), ์ƒˆ๋กœ์šด ๊ด€์ธก(์˜ˆ: ํž˜ ํ† ํฌ ์„ผ์„œ)์ด๋‚˜ ์ƒˆ๋กœ์šด ๋™์ž‘ ๊ณต๊ฐ„(์˜ˆ: ๊ด€์ ˆ ์œ„์น˜ ์ œ์–ด)์—๋„ ๋น ๋ฅด๊ฒŒ ์ ์‘ํ•จ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค[3][4]. (์ฝ”๋“œยท๋ชจ๋ธ ๊ณต๊ฐœ: ์˜ˆ[2])

1.3 OpenVLA (2024, Stanford et al.)

OpenVLA๋Š” 7์–ต ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทœ๋ชจ์˜ ์˜คํ”ˆ์†Œ์Šค VLA ๋ชจ๋ธ๋กœ, Llama 2 ์–ธ์–ด๋ชจ๋ธ์— DINOv2์™€ SigLIP๋กœ๋ถ€ํ„ฐ ์ถ”์ถœํ•œ ์‹œ๊ฐ ํŠน์ง•์„ ๊ฒฐํ•ฉํ•œ ๋น„์ „ ์ธ์ฝ”๋”๋ฅผ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค[5]. ์ด 97๋งŒ ๊ฐœ์˜ ์‹ค์ œ ๋กœ๋ด‡ ์กฐ์ž‘ ๋ฐ์ดํ„ฐ(๋ฐฉ๋Œ€ํ•œ ๋กœ๋ด‡ ์‹œ์—ฐ)๋กœ ์‚ฌ์ „ํ•™์Šต๋˜์—ˆ์œผ๋ฉฐ[5], 29๊ฐœ ์ž‘์—…๊ณผ ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ์†์ž„ํŽ˜๋˜์Šค(์ด์ข… ๊ตฌ์กฐ)์— ๊ฑธ์ณ ์ผ๋ฐ˜ํ™”๋˜๋Š” ๋ฒ”์šฉ ์กฐ์ž‘ ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ํ•™์Šต๋œ ์ •์ฑ…์€ RT-2-X(55B ํŒŒ๋ผ๋ฏธํ„ฐ)๋ณด๋‹ค ์ ˆ๋Œ€ ์„ฑ๊ณต๋ฅ  ๊ธฐ์ค€์œผ๋กœ +16.5% ํฌ์ธํŠธ ์ด์ƒ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ[6], ๋‹ค์ˆ˜ ๊ฐ์ฒด ํ˜ผํ•ฉ ์ž‘์—…์ด๋‚˜ ๋ณต์žกํ•œ ์–ธ์–ด ์ง€์‹œ๋ฅผ ๋‹ค๋ฃฐ ๋•Œ ํŠนํžˆ ์šฐ์ˆ˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ์ผ๋ฐ˜ํ™” ํ™˜๊ฒฝ(๋‹ค์ค‘ ๊ฐ์ฒด, ๋ณต์žก ์–ธ์–ด)์—์„œ ๊ธฐ์กด ๋น„ํ™•์žฅ์  ํ•™์Šต ๊ธฐ๋ฒ• ๋Œ€๋น„ ์›”๋“ฑํžˆ ๋†’์€ ์„ฑ๊ณต๋ฅ ์„ ๋ณด์˜€๊ณ , LoRA ๋“ฑ์˜ ์ €์ˆœ์œ„ ์ ์‘ ๊ธฐ๋ฒ•์„ ์ด์šฉํ•˜์—ฌ GPU ํ•œ ๋Œ€์—์„œ๋„ ์†์‰ฝ๊ฒŒ ๋ฏธ์„ธ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด๊ณ ํ–ˆ์Šต๋‹ˆ๋‹ค[7]. ๋ชจ๋ธ, ํŒŒ์ธํŠœ๋‹ ์ฝ”๋“œ ๋ฐ Open X-Embodiment ์ง€์› ์ฝ”๋“œ๊ฐ€ ๊ณต๊ฐœ๋˜์–ด ์—ฐ๊ตฌ ํ™•์‚ฐ์— ๊ธฐ์—ฌํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค[8]. (์ฝ”๋“œยท๋ชจ๋ธ ๊ณต๊ฐœ: ์˜ˆ[9])

1.4 RoboMamba (2024, CMU et al.)

RoboMamba๋Š” Mamba ์ƒํƒœ๊ณต๊ฐ„๋ชจ๋ธ(SSM)์„ ํ™œ์šฉํ•˜์—ฌ ๋กœ๋ด‡์˜ ์ธ์ง€(visual)์™€ ์ถ”๋ก  ๋Šฅ๋ ฅ์— ํšจ์œจ์„ฑ์„ ๋”ํ•œ VLA ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค[10]. ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”๋ฅผ Mamba ๋ชจ๋ธ๊ณผ ๊ฒฐํ•ฉ์‹œ์ผœ ์‹œ๊ฐ์  ํ† ํฐ๊ณผ ์–ธ์–ด ์ž„๋ฒ ๋”ฉ์„ ์‚ฌ์ „ ํ•™์Šตํ•จ์œผ๋กœ์จ, ์ผ๋ฐ˜์ ์ธ ์‹œ๊ฐ ์ƒ์‹๊ณผ ๋กœ๋ด‡ ๊ด€๋ จ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๊ฐ•ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ดํ›„ ๋กœ๋ด‡์˜ SE(3) ์ž์„ธ ์˜ˆ์ธก์„ ์œ„ํ•œ ๊ฐ„๋‹จํ•œ ์ •์ฑ… ํ—ค๋“œ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ, ์ „์ฒด ๋ชจ๋ธ์˜ 0.1% ์ •๋„๋งŒ ๋ฏธ์„ธ์กฐ์ •ํ•ด๋„ ๋ณต์žกํ•œ ์กฐ์ž‘ ๊ธฐ์ˆ ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค[10]. ์‹คํ—˜ ๊ฒฐ๊ณผ, RoboMamba๋Š” ์ผ๋ฐ˜/๋กœ๋ด‡ ๊ด€๋ จ ์ถ”๋ก  ๋ฒค์น˜๋งˆํฌ์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ, ๋ณต์žกํ•œ ์—ฐ์† ํ–‰๋™(ํฌ์ฆˆ) ์˜ˆ์ธก์—์„œ๋„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, ๊ธฐ์กด VLA ๋ชจ๋ธ ๋Œ€๋น„ ์ถ”๋ก  ์†๋„๊ฐ€ ์•ฝ 3๋ฐฐ ๋น ๋ฅธ ๊ฒƒ์œผ๋กœ ๋ณด๊ณ ๋˜์—ˆ๊ณ [10], ๋‹จ์ผ ์ƒ์ž ์กฐ๋ฆฝ, ํ…Œ์ด๋ธ” ์ •๋ฆฌ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฌผ๋ฆฌ์  ์ž‘์—… ์ˆ˜ํ–‰์— ์„ฑ๊ณตํ–ˆ์Šต๋‹ˆ๋‹ค. (์ฝ”๋“œ ๊ณต๊ฐœ: ๋น„๊ณต๊ฐœ)

1.5 ฯ€โ‚€ (2024, UC Berkeley et al.)

ฯ€โ‚€๋Š” ํ’๋ถ€ํ•œ ์‚ฌ์ „ํ•™์Šต๋œ ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ(PaliGemma)์„ ํ™œ์šฉํ•œ VLA ์ •์ฑ…์œผ๋กœ, ํ•˜๋‚˜์˜ Transformer์— ์—ฌ๋Ÿฌ ๋กœ๋ด‡ ํ”Œ๋žซํผ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ๋ฒ”์šฉ ์ œ์–ด ๋Šฅ๋ ฅ์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค[11]. ์‚ฌ์ „ํ•™์Šต์—๋Š” ๋‹จ์ผ ์•”, ์ด์ค‘ ์•”, ๋ชจ๋ฐ”์ผ ๋งค๋‹ˆํ“ฐ๋ ˆ์ดํ„ฐ๋ฅผ ํฌํ•จํ•œ ๋‹ค์–‘ํ•œ ๋กœ๋ด‡์—์„œ ์–ป์€ ๋Œ€๊ทœ๋ชจ ๊ถค์  ๋ฐ์ดํ„ฐ๊ฐ€ ์‚ฌ์šฉ๋˜์—ˆ์œผ๋ฉฐ[11], ๊ฐ€์‚ฌ์ผ(์„ธํƒ๋ฌผ ๊ฐœ๊ธฐ), ํ…Œ์ด๋ธ” ์ฒญ์†Œ, ๋ฐ•์Šค ์กฐ๋ฆฝ ๋“ฑ ๊ธด ์‹œํ€€์Šค์˜ ๋ณตํ•ฉ ์กฐ์ž‘ ์ž‘์—…๋“ค์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค. ๋ชจ๋ธ ๊ตฌ์กฐ๋Š” ๋น„์ „-์–ธ์–ด ์ž…๋ ฅ์„ ๋ฐ›์•„ ๋™์ž‘ ์ถœ๋ ฅ์„ ์˜ˆ์ธกํ•˜๋Š” ํ๋ฆ„ ๋งค์นญ(flow matching) ์•„ํ‚คํ…์ฒ˜๋กœ, ์—ฐ์† ํ–‰๋™ ๋ถ„ํฌ๋ฅผ ์ •ํ™•ํžˆ ๋ชจ๋ธ๋งํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ฯ€โ‚€๋Š” ๋กœ๋ด‡์—๊ฒŒ ์–ธ์–ด ๋ช…๋ น์„ ๊ทธ๋Œ€๋กœ ๋”ฐ๋ผ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ•™์Šต๋œ ์ œ๋กœ์ƒท ์ œ์–ด ๋Šฅ๋ ฅ์„ ๋ณด์œ ํ•˜๋ฉฐ, ์‚ฌ์ „ํ•™์Šต ํ›„ ์†Œ๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋กœ ์ƒˆ๋กœ์šด ๊ธฐ์ˆ ์„ ๋น ๋ฅด๊ฒŒ ์Šต๋“ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค[11]. (์ฝ”๋“œ ๊ณต๊ฐœ: ๋น„๊ณต๊ฐœ)

1.6 Being-H0 (2025, Tsinghua et al.)

Being-H0๋Š” ์‚ฌ๋žŒ ์† ์‹œํ€€์Šค ๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ํ•™์Šต๋œ ์ตœ์ดˆ์˜ ๋Œ€๊ทœ๋ชจ VLA ๊ธฐ๋ฐ˜ ๊ธฐ๋ฏผ ์กฐ์ž‘ ๋ชจ๋ธ๋กœ, ์œ ๋‹ˆํผํ•œ ์›€์ง์ž„ ํ‘œํ˜„์„ ํ†ตํ•ด ์ธ๊ฐ„ ๋™์ž‘๊ณผ ๋กœ๋ด‡ ์ œ์–ด๋ฅผ ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค[12][13]. ์ฃผ์š” ๊ธฐ์ˆ ๋กœ๋Š” (1) Physical Instruction Tuning: ์‚ฌ๋žŒ ์‹œํ€€์Šค์— ์ž์—ฐ์–ด ์ง€์‹œ๋ฅผ ๋ถ€์—ฌํ•œ ๋’ค ์ด๋ฅผ ์‚ฌ์ „ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ์‚ผ์•„, ์ธ๊ฐ„ ๋™์ž‘ ์˜ˆ์‹œ๋กœ๋ถ€ํ„ฐ ์˜๋„(intent)๋ฅผ ์ถ”๋ก ํ•˜๋„๋ก ํ•™์Šตํ•˜๊ณ , (2) Part-Level Motion Tokenization: ๋กœ๋ด‡ ๊ด€์ ˆ ๋ฐ์ดํ„ฐ์™€ 3D ์†๊ด€์ ˆ์„ ๊ฐ™์€ ํ‘œํ˜„ ๊ณต๊ฐ„์œผ๋กœ ์–‘์žํ™”ํ•˜์—ฌ, ๋ณด๋‹ค ์„ธ๋ฐ€ํ•œ ์›€์ง์ž„์„ ์–ธ์–ด-๋น„์ „ ๋ชจ๋ธ์— ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ 150M ์ƒ˜ํ”Œ ๊ทœ๋ชจ์˜ UniHand ๋ฐ์ดํ„ฐ์…‹(์‹ค์‹œ๊ฐ„ ๋ชจ์…˜์บก์ฒ˜, VR, RGB ์˜์ƒ ๋“ฑ)์œผ๋กœ ๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ํ•™์Šต์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค[12][13]. ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด Being-H0๋Š” ์‹ค์ œ ์ธ๊ฐ„-๋กœ๋ด‡ ์‹œ๋ฎฌ๋ ˆ์ด์…˜(Shadow Hand ๋“ฑ)์—์„œ ์†๊ฐ€๋ฝ ์กฐ์ž‘, ๋ฌผ์ฒด ์žฌ๋ฐฐ์น˜ ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ์กฐ์ž‘ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ, ์–ธ์–ด-์‹œ๊ฐ ์ปจํ…์ŠคํŠธ๋ฅผ ์† ๋™์ž‘์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๊ฐ–์ท„์Šต๋‹ˆ๋‹ค. (๋…ผ๋ฌธ ๋งํฌ ํฌํ•จ: arXiv[12]) (์ฝ”๋“œ ๊ณต๊ฐœ: ๋น„๊ณต๊ฐœ)

1.7 DexVLG (2025, CMU et al.)

DexVLG๋Š” dexterous grasping์„ ์œ„ํ•œ ๋Œ€๊ทœ๋ชจ VLA ๋ชจ๋ธ๋กœ, ํ•œ ๋ทฐ์˜ RGB-D ์˜์ƒ๊ณผ ์ž์—ฐ์–ด ์ง€์‹œ๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ๋กœ๋ด‡์˜ ๋‹ค์ง€์ ˆ ์†์œผ๋กœ ๋ฌผ์ฒด์˜ ํŒŒํŠธ์— ์ •๋ ฌ๋œ ์žก๊ธฐ ๋™์ž‘์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค[14]. ์•ฝ 170M๊ฐœ์˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ธฐ๋ฐ˜ ๋‹ค์ง€์ ˆ ๊ทธ๋ฆฝ ํฌ์ฆˆ(174K๊ฐœ ๊ฐ์ฒด, ํŒŒํŠธ๋ณ„ ์„ค๋ช… ์บก์…˜ ํฌํ•จ)๋ฅผ ๋‹ด์€ DexGraspNet 3.0 ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šตํ–ˆ์œผ๋ฉฐ[15], ํ๋ฆ„๋งค์นญ(flow matching) ํ—ค๋“œ๋ฅผ ํ†ตํ•ด ์—ฐ์†์  ์† ์ž์„ธ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฒค์น˜๋งˆํฌ์™€ ์‹ค์ œ ๋กœ๋ด‡ ์‹คํ—˜ ๊ฒฐ๊ณผ, DexVLG๋Š” ์ œ๋กœ์ƒท ์ผ๋ฐ˜ํ™”์— ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋Š”๋ฐ, ์‹œ๋ฎฌ์—์„œ 76% ์ด์ƒ์˜ ์„ฑ๊ณต๋ฅ ๊ณผ ์ตœ์ฒจ๋‹จ ์ˆ˜์ค€์˜ ํŒŒํŠธ-์ •๋ ฌ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ–ˆ๊ณ [16], ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ๋„ ๋กœ๋ด‡ ํ•ธ๋“œ๋ฅผ ์‚ฌ์šฉํ•ด ๋ฌผ์ฒด์˜ ํŠน์ • ํŒŒํŠธ๋ฅผ ์ •ํ™•ํžˆ ์žก๋Š” ์‹คํ—˜์— ์„ฑ๊ณตํ–ˆ์Šต๋‹ˆ๋‹ค. (์ฝ”๋“œ ๊ณต๊ฐœ: ๋ฏธ๋ฐœํ‘œ)

1.8 METIS (2025, Peking Univ.)

METIS๋Š” ๋‹ค์ค‘ ์ถœ์ฒ˜ ์ฃผ๊ด€์  ์˜์ƒ(egocentric video)์œผ๋กœ ํ•™์Šต๋œ VLA ๋ชจ๋ธ๋กœ, ์‚ฌ๋žŒ๊ณผ ๋กœ๋ด‡์˜ ์กฐ์ž‘ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ฉํ•œ EgoAtlas๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค[17]. EgoAtlas๋Š” ์ธํ„ฐ๋„ท์ƒ์˜ ๋Œ€๊ทœ๋ชจ ์ธ๊ฐ„ ์‹œ์—ฐ(์˜ˆ: EgoDex, H2O)๊ณผ ๋กœ๋ด‡ ์‹œ์—ฐ(์˜ˆ: ActionNet ๋“ฑ)์„ ๋™์ผํ•œ ํ–‰๋™ ๊ณต๊ฐ„์œผ๋กœ ํ†ตํ•ฉํ•œ ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค. METIS๋Š” ์ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ motion-aware dynamics๋ผ๋Š” ์ด์‚ฐํ™”๋œ ๋™์ž‘ ํ‘œํ˜„์„ ๋„์ž…ํ•˜์—ฌ, ์‹œ๊ฐ-์šด๋™ ์ •๋ณด๋ฅผ ํ•จ๊ป˜ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค[17]. ๊ทธ ๊ฒฐ๊ณผ ๋‹ค์–‘ํ•œ ์‹ค์ œ dexterous ์กฐ์ž‘ ์ž‘์—…์—์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, 6๊ฐœ ์‹ค์ œ ์ž‘์—…์—์„œ ํ‰๊ท  ์„ฑ๊ณต๋ฅ ์ด ๊ฐ€์žฅ ๋†’๊ฒŒ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค[17]. ๋˜ํ•œ, ํ™˜๊ฒฝ ๋ณ€ํ™”๋‚˜ ์ƒˆ๋กœ์šด ๋ฌผ์ฒด์—๋„ ๊ฐ•์ธํ•œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ํ™•์ธํ•ด, ๊ธฐ๋ฏผ ์กฐ์ž‘์„ ์œ„ํ•œ ๋ฒ”์šฉ VLA ๋ชจ๋ธ๋กœ์„œ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. (์ฝ”๋“œ ๊ณต๊ฐœ: ์˜ˆ์ •)

1.9 Shake-VLA (2025, NRC ๋“ฑ์˜ HRI 2025)

Shake-VLA๋Š” ๋ฐ”ํ…๋” ๋กœ๋ด‡ ์‹œ์Šคํ…œ์„ ๋ชฉํ‘œ๋กœ ํ•œ VLA ๊ธฐ๋ฐ˜ ์‘์šฉ ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค[18]. ๋‘ ๋Œ€์˜ ๋กœ๋ด‡ ํŒ”(์–‘ํŒ”)์„ ์‚ฌ์šฉํ•ด ์นตํ…Œ์ผ์„ ์ œ์กฐํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์œผ๋ฉฐ, ๋น„์ „ ๋ชจ๋“ˆ(์นตํ…Œ์ผ ์žฌ๋ฃŒ ๋ณ‘ ์ธ์‹ ๋ฐ ๋ผ๋ฒจ ์ฝ๊ธฐ), ์Œ์„ฑ-ํ…์ŠคํŠธ ๋ชจ๋“ˆ(์‚ฌ์šฉ์ž ์Œ์„ฑ ๋ช…๋ น ์ธ์‹), ์–ธ์–ด ๋ชจ๋ธ(๋งž์ถคํ˜• ์กฐ์ž‘ ๋ช…๋ น ์ƒ์„ฑ) ๋“ฑ์„ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ RAG(Retrieval-Augmented Generation)๋ฅผ ์ด์šฉํ•ด ๋ ˆ์‹œํ”ผ ์ง€์‹์„ ๊ฒ€์ƒ‰ํ•˜๊ณ , ์žฌ๋ฃŒ ๋ถˆ์ผ์น˜(anomaly detection) ๊ธฐ๋Šฅ์œผ๋กœ ๋ˆ„๋ฝ๋œ ์žฌ๋ฃŒ๋ฅผ ํŒ๋ณ„ํ•ฉ๋‹ˆ๋‹ค. Force-torque ์„ผ์„œ๋ฅผ ํ™œ์šฉํ•ด ์•ก์ฒด ๊ณ„๋Ÿ‰์˜ ์ •ํ™•๋„๋ฅผ ๋†’์˜€์œผ๋ฉฐ, ์‹คํ—˜์—์„œ ์Œ์„ฑ ์ธ์‹(93%), ๋น„์ „ ์ธ์‹(91%) ๋“ฑ์˜ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ , ์ „์ฒด์ ์œผ๋กœ 100%์˜ ์นตํ…Œ์ผ ์ œ์กฐ ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค[18][19]. ์ด์ฒ˜๋Ÿผ Shake-VLA๋Š” ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ๋ณตํ•ฉ์ ์ธ ์‹œ๊ฐยท์–ธ์–ด ์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ์–‘ํŒ” ์กฐ์ž‘์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์ดˆ๊ธฐ ์‚ฌ๋ก€์ž…๋‹ˆ๋‹ค. (๋…ผ๋ฌธ ๋งํฌ: arXiv[18]) (์ฝ”๋“œ ๊ณต๊ฐœ: ๋น„๊ณต๊ฐœ)

1.10 Scaffolding (2025, Stanford et al.)

Scaffolding์€ ๋‹จ๊ณ„์  ์กฐ์ž‘ ํ•™์Šต์„ ์œ„ํ•ด VLM์„ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์ž…๋‹ˆ๋‹ค. ์–ธ์–ด ์ง€์‹œ์— ๋”ฐ๋ผ ์˜์ƒ์—์„œ ํ•ต์‹ฌ 2D ํ‚คํฌ์ธํŠธ(์˜ˆ: ๋ฌผ์ฒด ์†์žก์ด, ๋ฒ„ํŠผ ๋“ฑ)๋ฅผ ์ถ”์ถœํ•˜๊ณ , ์ด๋ฅผ 3D ๊ถค์ ์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๊ณ ์ˆ˜์ค€ ๊ณ„ํš์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค[20][21]. ์ƒ์„ฑ๋œ 3D ํ‚คํฌ์ธํŠธ ๊ฒฝ๋กœ(์†๋ชฉ๊ณผ ๊ฐ์ฒด ์›€์ง์ž„)๋ฅผ ์ €์ˆ˜์ค€ ์ œ์–ด(RL)๊ฐ€ ์ถ”์ ํ•˜๋„๋ก ํ•™์Šตํ•จ์œผ๋กœ์จ, ์žฅ๊ธฐ๊ฐ„์˜ ๋ณตํ•ฉ ์ž‘์—…(์˜ˆ: ๋ง์น˜์งˆ, ๋ฌธ์˜ ์†์žก์ด๋ฅผ ๋Œ๋ฆฌ๋Š” ์ž‘์—… ๋“ฑ)์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, VLM์ด ์–ธ์–ด์™€ ์˜์ƒ์„ ๋ฐ”ํƒ•์œผ๋กœ ๊ฑฐ์นœ ๊ฒฝ๋กœ๋ฅผ ์ œ์‹œํ•˜๊ณ , ๊ฐ•ํ™”ํ•™์Šต ์—์ด์ „ํŠธ๊ฐ€ ์ด๋ฅผ ์„ธ๋ฐ€ํžˆ ์ˆ˜ํ–‰ํ•˜๋Š” ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ •์ฑ… ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค[20][21]. ์ดˆ๊ธฐ ์‹คํ—˜์—์„œ ์ด ๋ฐฉ๋ฒ•์€ VLM์œผ๋กœ๋ถ€ํ„ฐ ์–ป์€ ๊ณ„ํš์„ ๋”ฐ๋ผ ํšจ๊ณผ์ ์ธ ์กฐ์ž‘์„ ํ•™์Šตํ•˜๋ฉฐ, ์ธ๊ฐ„ ์ˆ˜์ค€์˜ ์ถ”๊ฐ€ ์–ธ์–ด ์ง€์‹œ ์—†์ด๋„ ๋‹ค์–‘ํ•œ dexterous ๊ณผ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. (๋…ผ๋ฌธ ๋งํฌ: arXiv[20][21]) (์ฝ”๋“œ ๊ณต๊ฐœ: ๋ฏธ๊ณต๊ฐœ)

1.11 ๋ชจ๋ธ๋ณ„ ํŠน์„ฑ ๋น„๊ตํ‘œ

๋ชจ๋ธ Vision/์–ธ์–ด ๋ชจ๋ธ ํ•™์Šต ๋ฐ์ดํ„ฐ ์ง€์› ํ”Œ๋žซํผ/์กฐ์ž‘ ์œ ํ˜• ๋ฒค์น˜๋งˆํฌ ์„ฑ๋Šฅ ์ง€ํ‘œ (์„ฑ๊ณต๋ฅ  ๋“ฑ) ์ฝ”๋“œยท๋ฐ๋ชจ ๊ณต๊ฐœ ์—ฌ๋ถ€
RT-2 (2023) ๋Œ€๊ทœ๋ชจ VLM (Google PaLI-X/PaLM-E ๊ธฐ๋ฐ˜) ๋กœ๋ด‡ ๊ถค์  ๋ฐ์ดํ„ฐ + ์ธํ„ฐ๋„ท ๋น„์ „-์–ธ์–ด ํƒœ์Šคํฌ[1] ๋‹จ์ผ/์ด์ค‘ ํŒ” ๋กœ๋ด‡ (Franka, WidowX ๋“ฑ), ๋‹ค์–‘ํ•œ ํ”ฝ&ํ”Œ๋ ˆ์ด์Šค ์ž‘์—… RT-1 ์ž‘์—… ๋ฒค์น˜๋งˆํฌ ๋“ฑ ์ œ๋กœ์ƒท ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ ์šฐ์ˆ˜, Emergent reasoning ํš๋“[1] ๋น„๊ณต๊ฐœ
Octo (2024) Transformer (diffusion) w/์–ธ์–ด+๋น„์ „ ์ž…๋ ฅ Open X-Embod (800k ์—ํ”ผ์†Œ๋“œ)[2] 9๋Œ€ ๋กœ๋ด‡ (WidowX, UR5 ๋“ฑ), ๋‹จ์ผ/์ด์ค‘ ํŒ”, ๋‹ค์–‘ํ•œ ์ž‘์—… (๋ธŒ๋ฆฟ์ง€, ์ปคํ”ผ ์ œ์กฐ ๋“ฑ) 6๊ฐœ ๋กœ๋ด‡ ์ž‘์—… ๋ฒค์น˜๋งˆํฌ RT-1-X ๋Œ€๋น„ ํ‰๊ท  ์„ฑ๊ณต๋ฅ  +52% (ํ‰๊ท  0.72), RT-2-X์™€ ์œ ์‚ฌ ์„ฑ๋Šฅ[3][4] ๊ณต๊ฐœ (GitHub/HF)[2]
OpenVLA (2024) Llama2 (์–ธ์–ด) + DINOv2/SigLIP (๋น„์ „) ์‹ค์ œ ๋กœ๋ด‡ ์กฐ์ž‘ 970k ์‹œ์—ฐ ๋ฐ์ดํ„ฐ[5] ๋‹ค์–‘ํ•œ ๋กœ๋ด‡(29 ์ž‘์—…, ๋ณต์ˆ˜ ์—”๋ฐ”๋””๋จผํŠธ) 29๊ฐœ ์ž‘์—… (Open X-Embod) RT-2-X ๋Œ€๋น„ ์„ฑ๊ณต๋ฅ  +16.5% (SOTA ๋‹ฌ์„ฑ)[6] ๊ณต๊ฐœ (GitHub)[8]
RoboMamba (2024) Vision encoder + Mamba SSM ๋น„๊ณต๊ฐœ (์ผ๋ฐ˜ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ) ์ผ๋ฐ˜ ๋กœ๋ด‡ ์กฐ์ž‘ (ํฌ์ฆˆ ์˜ˆ์ธก), ํ‘œ์ค€ ์‹œ๋ฎฌ/์‹คํ—˜ ๋กœ๋ด‡ ์ถ”๋ก  ๋ฒค์น˜ + ํฌ์ฆˆ ์˜ˆ์ธก ๊ธฐ์กด VLA ๋Œ€๋น„ 3๋ฐฐ ๋น ๋ฅธ ์ถ”๋ก  ์†๋„[10], 0.1% ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฏธ์„ธ์กฐ์ •์œผ๋กœ ์กฐ์ž‘ ํ•™์Šต ๋น„๊ณต๊ฐœ
ฯ€โ‚€ (2024) PaliGemma VLM (Vision+์–ธ์–ด) ๋‹ค์ˆ˜ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ (7์ข… ๋กœ๋ด‡ 68๊ฐœ ์ž‘์—… + OXE 22๋กœ๋ด‡)[22] ๋‹ค์–‘ํ•œ ๋กœ๋ด‡(์‹ฑ๊ธ€์•”, ๋“€์–ผ์•”, ๋ชจ๋ฐ”์ผ), ์žฅ๊ธฐ/๋‹ค๋‹จ๊ณ„ ์ž‘์—… (์„ธํƒ, ์ฒญ์†Œ ๋“ฑ) ์„ธํƒ๋ฌผ ๊ฐœ๊ธฐ, ํ…Œ์ด๋ธ” ์ •๋ฆฌ ๋“ฑ ๊ธด ํ˜ธ๋ผ์ด์ฆŒ ์ž‘์—…(์ˆ˜์‹ญ ๋ถ„) ์ˆ˜ํ–‰ ๊ฐ€๋Šฅ, ์ œ๋กœ์ƒท/๋ฏธ์„ธ์กฐ์ •์—์„œ ์šฐ์ˆ˜ํ•œ ์œ ์—ฐ์„ฑ[11] ๋น„๊ณต๊ฐœ
Being-H0 (2025) Vision-Lang Transformer (์ž์ฒด ๋ชจ๋ธ) ์ธ๊ฐ„ ์† ๊ด€๋ จ ๋Œ€๊ทœ๋ชจ ์˜์ƒยท๋ชจ์…˜ (UniHand 150M+ ๋ฐ์ดํ„ฐ)[12][13] ํœด๋จผ ํ•ธ๋“œ ๊ธฐ๋ฐ˜ ๋‹ค์ง€์ ˆ ์† (Shadow Hand ๋“ฑ), ๋ฌผ์ฒด ์กฐ์ž‘ยท์žฌ๋ฐฐ์น˜ DexGraspNet3.0 ํŒŒํŠธ ์žก๊ธฐ ์„ธ๋ฐ€ํ•œ ์†๋™์ž‘ ์ธ์ง€ยท์ƒ์„ฑ, ์–ธ์–ด-๋น„์ „ ํ†ตํ•ฉ ์ œ์–ด ๋Šฅ๋ ฅ ํš๋“[12][13] ๋น„๊ณต๊ฐœ
DexVLG (2025) ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ + Flow-Matching(์ถœ๋ ฅ) DexGraspNet 3.0 (170M ๊ทธ๋ฆฝ ํฌ์ฆˆ, 174K ๋ฌผ์ฒด)[15] ํœด๋จธ๋…ธ์ด๋“œ ์†(hand) ๊ธฐ๋ฐ˜ dexterous ๊ทธ๋ฆฝ (ํ…Œ์ด๋ธ”top ๊ฐ์ฒด ํŒŒํŠธ) ํŒŒํŠธ-์–ผ๋ผ์ธ ์žก๊ธฐ (์‹œ๋ฎฌ/์‹ค์ œ) ์ œ๋กœ์ƒท ์„ฑ๊ณต๋ฅ  76%โ†‘, ์‹œ๋ฎฌ SOTA ํŒŒํŠธ-๊ทธ๋ฆฝ ์ •ํ™•๋„[16] ๋น„๊ณต๊ฐœ
METIS (2025) Vision-Language Transformer EgoAtlas (๋‹ค์ค‘ ์ถœ์ฒ˜ Egocentric ๋ฐ์ดํ„ฐ)[17] SharpaWave 22-DoF dexterous hand, ๋‹จ์ถ•ยท์žฅ์ถ• ์ž‘์—… (ํ”ฝ&ํ”Œ๋ ˆ์ด์Šค, ์ƒ์žํฌ์žฅ ๋“ฑ) 6๊ฐœ ์‹ค์ œ dexterous ๊ณผ์ œ 6๊ฐœ ์ž‘์—… ์ตœ๊ณ  ํ‰๊ท  ์„ฑ๊ณต๋ฅ  ๊ธฐ๋ก, ๊ฐ•์ธํ•œ OOD ์ผ๋ฐ˜ํ™”[17] ๊ณต๊ฐœ ์ค€๋น„ ์ค‘
Shake-VLA (2025) Object Detector + ์Œ์„ฑ์ธ์‹ + LLM ์กฐ๋ฆฌ ๋ ˆ์‹œํ”ผ DB + ์‹ค์‹œ๊ฐ„ ์ด๋ฏธ์ง€/์Œ์„ฑ ๋ฐ์ดํ„ฐ ์–‘ํŒ” ๋ฐ”ํ…๋” ๋กœ๋ด‡, ์•ก์ฒด ํ˜ผํ•ฉ (์นตํ…Œ์ผ ์ œ์กฐ) ์นตํ…Œ์ผ ์ œ์กฐ ํŒŒ์ดํ”„๋ผ์ธ ์Œ์„ฑ์ธ์‹ 93%, ๋น„์ „ 91% ์ •ํ™•๋„, ์ „์ฒด ์‹œ์Šคํ…œ 100% ์„ฑ๊ณต๋ฅ [18][19] ๋น„๊ณต๊ฐœ
Scaffolding (2025) Off-the-shelf VLM (์˜ˆ: GPT-4) ์‚ฌ์ „ ํ•™์Šต๋œ VLM + RL ์‹œ๋ฎฌ๋ฐ์ดํ„ฐ Dexterous ์† ๋กœ๋ด‡ (๋ง์น˜์งˆ, ๋„์–ด ํ•ธ๋“ค ์กฐ์ž‘, Semantic ํ”ฝ&ํ”Œ๋ ˆ์ด์Šค) ์‚ฌ์šฉ์ž ์ •์˜ ์กฐ์ž‘ ๊ณผ์ œ VLM ๊ธฐ๋ฐ˜ 3D ๊ฒฝ๋กœ ๊ณ„ํš๊ณผ RL์˜ ๊ฒฐํ•ฉ์œผ๋กœ ์–ด๋ ค์šด dexterous ๊ณผ์ œ ํ•ด๊ฒฐ ๋น„๊ณต๊ฐœ

๊ฐ ๋ชจ๋ธ์˜ ์–ธ์–ด-๋น„์ „ ํ†ตํ•ฉ ๋ฐ ์ œ์–ด ์ •์ฑ… ํ•™์Šต ํ˜์‹  ์š”์•ฝ:

  • RT-2: ๋กœ๋ด‡ ๋™์ž‘์„ ์–ธ์–ด ํ† ํฐ์œผ๋กœ ํ‘œํ˜„ํ•˜๊ณ , ์ธํ„ฐ๋„ท ๋น„์ „-์–ธ์–ด ํƒœ์Šคํฌ์™€ ํ†ตํ•ฉ ํ•™์Šตํ•จ์œผ๋กœ์จ ์–ธ์–ด ์ดํ•ด์™€ ์ œ๋กœ์ƒท ์ œ์–ด๋ฅผ ๋™์‹œ์— ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค. (์˜ˆ: ์—ฐ์‡„์ถ”๋ก (chain-of-thought) ๊ธฐ๋ฒ•์„ ๋„์ž…ํ•ด ๋ณตํ•ฉ ์ง€์‹œ ์ˆ˜ํ–‰)[1].
  • Octo: ํ™•์‚ฐ ๋ชจ๋ธ(diffusion policy)๊ณผ ๋‹ค์–‘ํ•œ ์ž…๋ ฅ(์–ธ์–ด+๋ชฉํ‘œ ์˜์ƒ)์„ ๋„์ž…ํ•˜์—ฌ ๋‹ค๋ชฉ์  ๋กœ๋ด‡ ์ œ์–ด๊ธฐ๋ฅผ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค. ์–ธ์–ด์™€ ์‹œ๊ฐ์  ๋ชฉํ‘œ๋ฅผ ํ•จ๊ป˜ ์ฒ˜๋ฆฌํ•˜๊ณ  ์ƒˆ๋กœ์šด ๊ด€์ธกยท๋™์ž‘ ๊ณต๊ฐ„์—๋„ ๋น ๋ฅด๊ฒŒ ์ ์‘ํ•˜๋Š” ์„ค๊ณ„๊ฐ€ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค[3][4].
  • OpenVLA: ๋Œ€๊ทœ๋ชจ ์–ธ์–ด๋ชจ๋ธ(Llama2)๊ณผ ์‹œ๊ฐ ์ธ์ฝ”๋”(SigLIP+DINOv2)๋ฅผ ๊ฒฐํ•ฉํ•ด ๋ฒ”์šฉ ์กฐ์ž‘ ์ •์ฑ…์„ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค. ํ’๋ถ€ํ•œ ๋กœ๋ด‡ ์‹œ์—ฐ ๋ฐ์ดํ„ฐ๋กœ ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ ๊ณ ์ˆ˜์ค€ ์–ธ์–ด ์ง€์‹œ๋ฅผ ๊ตฌ์ฒด์  ํ–‰๋™์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๊ฐ–์ท„์Šต๋‹ˆ๋‹ค[5].
  • RoboMamba: ํšจ์œจ์ ์ธ Mamba SSM์„ ์ ์šฉํ•˜์—ฌ ์ถ”๋ก ๋Šฅ๋ ฅ ํ–ฅ์ƒ๊ณผ ์ถ”๋ก -์ถ”๋ก (reasoning-action) ํ†ตํ•ฉ์„ ์‹คํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค. ์‹œ๊ฐ-์–ธ์–ด ์ž…๋ ฅ์„ ๋‹จ์ผ SSM์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ณ , ์ •์ฑ… ํ—ค๋“œ๋งŒ ์†Œ๋Ÿ‰ ๋ฏธ์„ธ์กฐ์ •ํ•ด๋„ ์กฐ์ž‘ ์„ฑ๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ฆฌ๋Š” ๊ฒƒ์ด ํŠน์ง•์ž…๋‹ˆ๋‹ค[10].
  • ฯ€โ‚€: ๋Œ€๊ทœ๋ชจ VLM(PaliGemma)์— ํ๋ฆ„๋งค์นญ ๊ธฐ๋ฐ˜ ํ–‰๋™ ์ƒ์„ฑ์„ ๊ฒฐํ•ฉํ•˜์—ฌ, ์žฅ๊ธฐ๊ฐ„ ๋ณตํ•ฉ ์กฐ์ž‘๊ณผ ํฌ๋กœ์Šค-๋กœ๋ด‡ ์ผ๋ฐ˜ํ™”๋ฅผ ์ง€์›ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ ๋ชจ๋ธ์— ํ†ตํ•ฉํ•˜์—ฌ ์ œ๋กœ์ƒท ๋ฐ ์†Œ๋Ÿ‰ ๋ฐ์ดํ„ฐ ํŒŒ์ธํŠœ๋‹์œผ๋กœ ์ƒˆ๋กœ์šด ์—…๋ฌด๋ฅผ ๋น ๋ฅด๊ฒŒ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค[11].
  • Being-H0: ์‚ฌ๋žŒ ์† ์‹œํ€€์Šค๋ฅผ ์ด์šฉํ•œ ๋™์ž‘-์˜๋„ ํ†ตํ•ฉ ํ”„๋ฆฌํŠธ๋ ˆ์ด๋‹์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ž์—ฐ์–ด ์ง€์‹œ๋ฅผ ๋ชจ์…˜ ๋ฐ์ดํ„ฐ์— ๊ฒฐํ•ฉํ•˜๊ณ  ์„ธ๋ถ€ ์†๊ด€์ ˆ์„ ์–‘์žํ™”ํ•˜์—ฌ, VLM์ด ์ธ๊ฐ„ ์† ์›€์ง์ž„์„ ์–ธ์–ด-์ง€์‹œ์™€ ์—ฐ๊ฒฐํ•  ์ˆ˜ ์žˆ๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค[12][13].
  • DexVLG: ์œ ์—ฐํ•œ ๊ทธ๋ฆฝ ์ •์ฑ…์„ ์œ„ํ•ด ์–ธ์–ด-์กฐ๊ฑด๋ถ€ ๊ทธ๋ฆฝ ๋ชจ๋ธ์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ฌผ์ฒด์˜ ๋ถ€์œ„ ์„ค๋ช…๊ณผ ๊ฒฐํ•ฉํ•ด ๊ทธ๋ฆฝ ํฌ์ฆˆ๋ฅผ ์˜ˆ์ธกํ•˜๋ฉฐ, ์—ฐ์† ๋™์ž‘์„ ํ๋ฆ„๋งค์นญ์œผ๋กœ ๋ชจ๋ธ๋งํ•˜์—ฌ ์ œ๋กœ์ƒท ๊ทธ๋ฆฝ ๋Šฅ๋ ฅ์„ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค[14].
  • METIS: ๋‹ค์–‘ํ•œ ์ธ๊ฐ„ ์˜์ƒ๊ณผ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ๋ฅผ ์ผ๊ด€๋œ ํ–‰๋™ ๊ณต๊ฐ„์œผ๋กœ ํ†ตํ•ฉํ•ด ๋„“์€ ๋ฒ”์œ„์˜ ์กฐ์ž‘ ์ง€์‹์„ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค. motion-aware dynamics๋กœ ์‹œ๊ฐ-์šด๋™ ์ •๋ณด๋ฅผ ์••์ถ• ํ•™์Šตํ•˜๊ณ , ์ธ๊ฐ„ ๋ฐ์ดํ„ฐ์˜ ํ’๋ถ€ํ•จ์„ ์กฐ์ž‘ ์ •์ฑ…์— ์ „๋‹ฌํ•˜๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค[17].
  • Shake-VLA: ์Œ์„ฑ, ์‹œ๊ฐ, ์–ธ์–ด๋ชจ๋ธ์„ ๊ฒฐํ•ฉํ•œ ์ข…ํ•ฉ ์‹œ์Šคํ…œ์œผ๋กœ ์‹ค์‹œ๊ฐ„ ๋ช…๋ น ์ฒ˜๋ฆฌ์™€ ์กฐ์ž‘ ์ œ์–ด๋ฅผ ํ†ตํ•ฉํ–ˆ์Šต๋‹ˆ๋‹ค. Retrieval-Augmented Generation์œผ๋กœ ๋ฐฐ๊ฒฝ์ง€์‹์„ ํ™œ์šฉํ•˜๊ณ , ์„ผ์„œ๋ฅผ ํ†ตํ•œ ํ”ผ๋“œ๋ฐฑ(loop closing)์„ ํ™œ์šฉํ•ด ๋กœ๋ด‡ ๋‘ ํŒ”์˜ ํ˜‘๋™์กฐ์ž‘์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค[18][19].
  • Scaffolding: VLM์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ๊ณผ ๊ฐ•ํ™”ํ•™์Šต์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๊ณ„์ธต์  ์กฐ์ž‘ ํ•™์Šต์„ ์‹œ๋„ํ–ˆ์Šต๋‹ˆ๋‹ค. VLM์ด ์ƒ์„ฑํ•œ 3D ๊ถค์ ์œผ๋กœ ๊ณ„ํš์„ ์ œ๊ณตํ•˜๊ณ , ์ €์ˆ˜์ค€ RL์ด ์„ธ๋ฐ€ํ•œ ์‹คํ–‰์„ ๋‹ด๋‹นํ•จ์œผ๋กœ์จ ๋ณต์žกํ•œ dexterous ์ž‘์—…์„ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค[20][21].

๊ฐ ๋ชจ๋ธ์˜ ์„ค๊ณ„๋Š” ์–ธ์–ด์™€ ๋น„์ „ ์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ๋กœ๋ด‡ ์ •์ฑ…์„ ํ•™์Šตํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•˜์˜€๊ณ , ํŠนํžˆ ๋ฏธ์„ธ ์กฐ์ž‘(dexterous manipulation) ๋ถ„์•ผ์—์„œ ๋ณต์žกํ•œ ๋ฌผ์ฒด ์กฐ์ž‘๊ณผ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐ ๊ธฐ์—ฌํ–ˆ์Šต๋‹ˆ๋‹ค.

์ฐธ๊ณ ๋ฌธํ—Œ: ์ƒ๊ธฐ ์ธ์šฉ์€ ํ•ด๋‹น ๋ชจ๋ธ ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ๋‚ด์šฉ์„ ๋ฐ˜์˜ํ•˜๋ฉฐ, ๊ฐ ๋ชจ๋ธ๋ณ„ ์š”์•ฝ์— ์ธ์šฉ๋œ ํŽ˜์ด์ง€๋ฅผ ํ†ตํ•ด ์ƒ์„ธ ์ •๋ณด๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค[1][5][10][11][12][14][17][18][20].

[1] [2307.15818] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

https://arxiv.org/abs/2307.15818

[2] [3] [4] Octo: An Open-Source Generalist Robot Policy

https://octo-models.github.io

[5] [6] [7] [8] [9] [2406.09246] OpenVLA: An Open-Source Vision-Language-Action Model

https://arxiv.org/abs/2406.09246

[10] [2406.04339] RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation

https://arxiv.org/abs/2406.04339

[11] [2410.24164] $ฯ€_0$: A Vision-Language-Action Flow Model for General Robot Control

https://arxiv.org/abs/2410.24164

[12] [13] Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

https://arxiv.org/html/2507.15597v1

[14] [15] [16] [2507.02747] DexVLG: Dexterous Vision-Language-Grasp Model at Scale

https://arxiv.org/abs/2507.02747

[17] METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model

https://arxiv.org/html/2511.17366v1

[18] [19] [2501.06919] Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing

https://arxiv.org/abs/2501.06919

[20] [21] Scaffolding Dexterous Manipulation with Vision-Language Models

https://arxiv.org/html/2506.19212v1

[22] physicalintelligence.company

https://www.physicalintelligence.company/download/pi0.pdf

Copyright 2026, JungYeon Lee