Curieux.JY
  • Post
  • Note
  • Jung Yeon Lee

On this page

  • Brief Review

๐Ÿ“ƒRoboTwin 2.0 ๋ฆฌ๋ทฐ

vla
bimanual
A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
Published

October 16, 2025

  • Paper Link
  • Homepage
  • Code Link
  1. ๐Ÿค– RoboTwin 2.0์€ ์ด์ค‘ ํŒ” ๋กœ๋ด‡ ์กฐ์ž‘์„ ์œ„ํ•œ ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ, MLLM(Multimodal Large Language Model) ๊ธฐ๋ฐ˜์˜ ์ž๋™ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ํŒŒ์ดํ”„๋ผ์ธ๊ณผ ํฌ๊ด„์ ์ธ ๋„๋ฉ”์ธ ๋ฌด์ž‘์œ„ํ™” ๊ธฐ๋Šฅ์„ ํ†ตํ•ฉํ•˜์—ฌ ์‹ค์ œ ํ™˜๊ฒฝ์˜ ๋ณต์žก์„ฑ์„ ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค.
  2. ๐Ÿ’ก ์ด ํ”„๋ ˆ์ž„์›Œํฌ๋Š” 731๊ฐœ ๊ฐ์ฒด๋ฅผ ํฌํ•จํ•˜๋Š” RoboTwin-OD ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์™€ 50๊ฐ€์ง€ ์ด์ค‘ ํŒ” ์ž‘์—…์— ๊ฑธ์นœ 10๋งŒ ๊ฐœ ์ด์ƒ์˜ ์ „๋ฌธ๊ฐ€ ๊ถค์  ๋ฐ์ดํ„ฐ์…‹์„ ์ œ๊ณตํ•˜๋ฉฐ, ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋‚ด ํ”ผ๋“œ๋ฐฑ์„ ํ†ตํ•ด ํƒœ์Šคํฌ ์ฝ”๋“œ ์ƒ์„ฑ ์„ฑ๊ณต๋ฅ ์„ 10.9% ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ๋‹ค์–‘ํ•œ ๋กœ๋ด‡ ๊ธฐ๊ตฌํ•™์— ์ ์‘ํ•ฉ๋‹ˆ๋‹ค.
  3. ๐Ÿš€ RoboTwin 2.0 ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋œ ์ •์ฑ…์€ ๋„๋ฉ”์ธ ๋ฌด์ž‘์œ„ํ™”๊ฐ€ ์—†๋Š” ๋ฐ์ดํ„ฐ์— ๋น„ํ•ด ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์—์„œ ์ •์ฑ… ๊ฒฌ๊ณ ์„ฑ์ด ์ตœ๋Œ€ 31.9% ํ–ฅ์ƒ๋˜์—ˆ์œผ๋ฉฐ, ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ 10๊ฐœ์˜ ์‹ค์ œ ๋ฐ๋ชจ์™€ ํ•จ๊ป˜ ์‚ฌ์šฉ๋  ๋•Œ ํ‰๊ท  ์„ฑ๊ณต๋ฅ ์ด 24.4% ์ƒ์Šนํ•˜์—ฌ Sim-to-Real ์ „์ด ๋ฐ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค.

teaser

Brief Review

RoboTwin 2.0์€ ์ด์ค‘ ๋กœ๋ด‡ ์กฐ์ž‘(bimanual robotic manipulation)์„ ์œ„ํ•œ ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ํ”„๋ ˆ์ž„์›Œํฌ์ด์ž ๋ฒค์น˜๋งˆํฌ์ด๋ฉฐ, ๊ฐ•๋ ฅํ•œ ๋„๋ฉ”์ธ ๋ฌด์ž‘์œ„ํ™”(domain randomization)๋ฅผ ํ†ตํ•ด ๋กœ๋ด‡ ์ •์ฑ…(robot policy)์˜ ๊ฐ•๊ฑดํ•จ(robustness)๊ณผ ์ผ๋ฐ˜ํ™”(generalization) ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐ ์ค‘์ ์„ ๋‘ก๋‹ˆ๋‹ค. ์ด ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ๊ธฐ์กด์˜ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ์…‹์ด ์ƒˆ๋กœ์šด ์ž‘์—…์„ ์œ„ํ•œ ํšจ์œจ์ ์ด๊ณ  ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ์ƒ์„ฑ ๋ฐฉ๋ฒ•๊ณผ ํ˜„์‹ค ์„ธ๊ณ„์˜ ๋ณต์žก์„ฑ์„ ํฌ์ฐฉํ•˜์ง€ ๋ชปํ•˜๋Š” ์ง€๋‚˜์น˜๊ฒŒ ๋‹จ์ˆœํ™”๋œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์ด๋ผ๋Š” ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค.

RoboTwin 2.0์€ ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค:

  1. ์ž๋™ํ™”๋œ ์ „๋ฌธ๊ฐ€ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ํŒŒ์ดํ”„๋ผ์ธ (Automated Expert Data Generation Pipeline): ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(MLLM)๊ณผ ์‹œ๋ฎฌ๋ ˆ์ด์…˜-์ธ-๋”-๋ฃจํ”„(simulation-in-the-loop) ํ”ผ๋“œ๋ฐฑ์„ ํ™œ์šฉํ•˜์—ฌ ์ž‘์—… ์‹คํ–‰ ์ฝ”๋“œ๋ฅผ ๋ฐ˜๋ณต์ ์œผ๋กœ ๊ฒ€์ฆํ•˜๊ณ  ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค. MLLM์€ ์ž์—ฐ์–ด ์ง€์‹œ(natural language instructions)๋กœ๋ถ€ํ„ฐ ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ์ž‘์—… ๊ณ„ํš(executable task plans)์„ ํ•ฉ์„ฑํ•˜๊ณ , vision-language model (VLM) observer๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ ์‹คํ–‰์„ ๋ชจ๋‹ˆํ„ฐ๋งํ•˜๊ณ  ์˜ค๋ฅ˜๋ฅผ ๊ฐ์ง€ํ•˜๋ฉฐ ์ˆ˜์ •์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด ํ์‡„ ๋ฃจํ”„ ์•„ํ‚คํ…์ฒ˜๋Š” ์ฝ”๋“œ ์ƒ์„ฑ ์—์ด์ „ํŠธ๊ฐ€ ํ”„๋กœ๊ทธ๋žจ์„ ์ž๋™์œผ๋กœ ๊ฐœ์„ ํ•˜์—ฌ ์ตœ์†Œํ•œ์˜ ์‚ฌ๋žŒ ๊ฐ๋…์œผ๋กœ ๊ฐ•๊ฑดํ•˜๊ณ  ์ž์ฒด ๊ฐœ์„ ๋˜๋Š” ์ „๋ฌธ๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

  2. ํฌ๊ด„์ ์ธ ๋„๋ฉ”์ธ ๋ฌด์ž‘์œ„ํ™” (Comprehensive Domain Randomization): ์ •์ฑ…์˜ Sim-to-Real ๊ฒฉ์ฐจ๋ฅผ ์ค„์ด๊ณ  ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋‹ค์„ฏ ๊ฐ€์ง€ ์ถ•(์–ธ์–ด ์ง€์‹œ(language instructions), ์žฅ๋ฉด ํ˜ผ๋ž€(scene clutter), ๋ฐฐ๊ฒฝ ํ…์Šค์ฒ˜(background textures), ์กฐ๋ช… ์กฐ๊ฑด(lighting conditions), ํƒ์ž ๋†’์ด(tabletop configurations))์— ๊ฑธ์ณ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ์žฅ๋ฉด ํ˜ผ๋ž€์„ ์œ„ํ•ด RoboTwin-OD์—์„œ ๊ฐ€์ ธ์˜จ 731๊ฐœ์˜ ๋ฐฉํ•ด ๊ฐ์ฒด(distractor objects)๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉฐ, ์ถฉ๋Œ ๊ฐ์ง€ ๋ฐฐ์น˜(collision-aware placement)๋ฅผ ํ†ตํ•ด ๋ฌผ๋ฆฌ์  ํƒ€๋‹น์„ฑ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค. ๋ฐฐ๊ฒฝ ๋ฐ ํƒ์ž ํ‘œ๋ฉด์„ ์œ„ํ•ด LLM ํ”„๋กฌํ”„ํŠธ์™€ Stable Diffusion v2๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ์„ฑํ•˜๊ณ  ์‚ฌ๋žŒ์˜ ํ•„ํ„ฐ๋ง์„ ๊ฑฐ์นœ 11,000๊ฐœ์˜ ๊ณ ํ’ˆ์งˆ ํ…์Šค์ฒ˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์กฐ๋ช…์€ ์ƒ‰์˜จ๋„, ๊ด‘์› ์œ ํ˜•, ๊ฐ•๋„ ๋ฐ ์œ„์น˜๊ฐ€ ๋ฌด์ž‘์œ„ํ™”๋ฉ๋‹ˆ๋‹ค. ํƒ์ž ๋†’์ด๋Š” ๊ฐ€๋Šฅํ•œ ๋ฒ”์œ„ ๋‚ด์—์„œ ๊ท ์ผํ•˜๊ฒŒ ๋ฌด์ž‘์œ„ํ™”๋ฉ๋‹ˆ๋‹ค. ์–ธ์–ด ์ง€์‹œ์˜ ๊ฒฝ์šฐ MLLM์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์ž‘์—… ํ…œํ”Œ๋ฆฟ๊ณผ ๊ฐ์ฒด ์„ค๋ช…์„ ์ƒ์„ฑํ•˜์—ฌ ์–ธ์–ด์  ๋‹ค์–‘์„ฑ์„ ํ™•๋ณดํ•ฉ๋‹ˆ๋‹ค.

  3. ๊ตฌํ˜„์ฒด ์ธ์‹ ํŒŒ์ง€ ์ ์‘ (Embodiment-Aware Grasp Adaptation): ๋กœ๋ด‡ ํŒ”์˜ ์ž์œ ๋„(DoF)์™€ ์šด๋™ํ•™์  ๊ตฌ์กฐ(kinematic structures)์˜ ์ฐจ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ ๊ฐ์ฒด์— ์—ฌ๋Ÿฌ ํŒŒ์ง€ ์ถ•(grasp axes)๊ณผ ์ ‘๊ทผ ๋ฐฉํ–ฅ(approach directions)์„ ํฌ๊ด„ํ•˜๋Š” ํ’๋ถ€ํ•œ ํ›„๋ณด ์กฐ์ž‘ ํฌ์ฆˆ(candidate manipulation poses) ์„ธํŠธ๋ฅผ ์ฃผ์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. Curobo์™€ ๊ฐ™์€ ๊ณ ์„ฑ๋Šฅ, GPU ๊ฐ€์† ๋ชจ์…˜ ํ”Œ๋ž˜๋„ˆ(motion planner)๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์šด๋™ํ•™์  ์ œ์•ฝ ์กฐ๊ฑด(kinematic constraints) ํ•˜์—์„œ๋„ ํšจ์œจ์ ์ด๊ณ  ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ๊ณ„ํš์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. Franka, Piper, UR5, ARX-X5, Aloha-AgileX์™€ ๊ฐ™์€ ๋‹ค์„ฏ ๊ฐ€์ง€ ๋กœ๋ด‡ ๊ตฌํ˜„์ฒด(robot embodiments)๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

RoboTwin 2.0์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ƒˆ๋กœ์šด ๋ฆฌ์†Œ์Šค๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

  • RoboTwin-OD ๊ฐ์ฒด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ: 147๊ฐœ ์นดํ…Œ๊ณ ๋ฆฌ์— ๊ฑธ์ณ 731๊ฐœ์˜ ๊ฐ์ฒด ์ธ์Šคํ„ด์Šค๋ฅผ ํฌํ•จํ•˜๋ฉฐ, ๊ฐ ๊ฐ์ฒด๋Š” ์˜๋ฏธ๋ก ์ (semantic) ๋ฐ ์กฐ์ž‘ ๊ด€๋ จ(manipulation-relevant) ๋ ˆ์ด๋ธ”, ๋‹ค์–‘ํ•œ ์–ธ์–ด ์„ค๋ช…, ํ‚คํฌ์ธํŠธ-์ถ• ์ •๋ณด(placement points, functional points, grasp points, grasp axes)๋กœ ์ฃผ์„ ์ฒ˜๋ฆฌ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹: 50๊ฐœ์˜ ์ด์ค‘ ๋กœ๋ด‡ ์กฐ์ž‘ ์ž‘์—…์„ 5๊ฐœ์˜ ๋กœ๋ด‡ ๊ตฌํ˜„์ฒด์— ๊ฑธ์ณ 100,000๊ฐœ ์ด์ƒ์˜ ์ „๋ฌธ๊ฐ€ ๊ถค์ (expert trajectories)์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฒค์น˜๋งˆํฌ: ํ˜ผ๋ž€์Šค๋Ÿฌ์šด ํ™˜๊ฒฝ(cluttered environments)๊ณผ ๊ฐœ๋ฐฉํ˜• ์–ธ์–ด ๋ชฉํ‘œ(open-ended language goals)์— ๋Œ€ํ•œ ์ •์ฑ… ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” RoboTwin 2.0์˜ ํšจ๊ณผ๋ฅผ ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค:

  • ์ž๋™ํ™”๋œ ์ „๋ฌธ๊ฐ€ ์ฝ”๋“œ ์ƒ์„ฑ: MLLM๊ณผ ์‹œ๋ฎฌ๋ ˆ์ด์…˜-์ธ-๋”-๋ฃจํ”„ ํ”ผ๋“œ๋ฐฑ์„ ํ†ตํ•ฉํ•œ ํŒŒ์ดํ”„๋ผ์ธ์€ RoboTwin 1.0 ๋Œ€๋น„ ์ฝ”๋“œ ์ƒ์„ฑ ์„ฑ๊ณต๋ฅ (ASR)์—์„œ 10.9% ํ–ฅ์ƒ๋œ 71.3%๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ”ผ๋“œ๋ฐฑ์€ ์˜ค๋ฅ˜๋ฅผ ๊ฐ์ง€ํ•˜๊ณ  ์ •ํ™•ํ•œ ์ˆ˜์ •์„ ์œ ๋„ํ•˜์—ฌ ์‹ ๋ขฐ์„ฑ์„ ๋†’์ž…๋‹ˆ๋‹ค.
  • ์ ์‘ํ˜• ํŒŒ์ง€(Adaptive Grasping) ํšจ์œจ์„ฑ: ๊ตฌํ˜„์ฒด ์ธ์‹ ํŒŒ์ง€ ์ฆ๊ฐ• ์ „๋žต์€ ํŠนํžˆ Aloha-AgileX, Piper, ARX-X5์™€ ๊ฐ™์€ ๋‚ฎ์€ ์ž์œ ๋„ ๋กœ๋ด‡ ํ”Œ๋žซํผ์—์„œ ํ‰๊ท  8.3%์˜ ์ž‘์—… ์„ฑ๊ณต๋ฅ  ๊ฐœ์„ ์„ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.
  • ์ •์ฑ… ๊ฐ•๊ฑดํ•จ์— ๋Œ€ํ•œ ์˜ํ–ฅ: RoboTwin 2.0์˜ ๋„๋ฉ”์ธ ๋ฌด์ž‘์œ„ํ™” ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์€ ์‹œ๊ฐ์  ๋ฐ ๊ณต๊ฐ„์  ๋ณ€ํ™”์— ๋Œ€ํ•œ ๊ฐ•๊ฑดํ•จ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค. 10๊ฐœ์˜ ์‹ค์ œ ๋ฐ๋ชจ์™€ 1,000๊ฐœ์˜ ํ•ฉ์„ฑ ๊ถค์ ์„ ํ˜ผํ•ฉํ•˜์—ฌ ํ•™์Šต๋œ visionโ€“languageโ€“action (VLA) ๋ชจ๋ธ์€ 10๊ฐœ ๋ฐ๋ชจ ๊ธฐ๋ฐ˜(baseline) ๋ชจ๋ธ ๋Œ€๋น„ 367%์˜ ์ƒ๋Œ€์  ๊ฐœ์„ ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์‹ค์ œ ๋ฐ์ดํ„ฐ ์—†์ด ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ ํ•™์Šต๋œ ์ œ๋กœ์ƒท(zero-shot) ๋ชจ๋ธ๋„ 228%์˜ ์ƒ๋Œ€์  ๊ฐœ์„ ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
  • Sim-to-Real ์„ฑ๋Šฅ: ์‹ค์ œ ํ™˜๊ฒฝ ์‹คํ—˜์—์„œ RoboTwin 2.0์˜ ๋„๋ฉ”์ธ ๋ฌด์ž‘์œ„ํ™” ํ•ฉ์„ฑ ๊ถค์ ์œผ๋กœ ๋ณด๊ฐ•๋œ ์ด์ค‘ ๋กœ๋ด‡ ์ •์ฑ…์€ ๊ฐ•๊ฑดํ•จ์—์„œ ๋ช…ํ™•ํ•œ ์ด๋“์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. 10๊ฐœ์˜ ์‹ค์ œ ๋ฐ๋ชจ์™€ 1,000๊ฐœ์˜ ํ•ฉ์„ฑ ๊ถค์ ์„ ๊ฒฐํ•ฉํ•œ few-shot ์„ค์ •์—์„œ ํ‰๊ท  ์„ฑ๊ณต๋ฅ ์€ 24.4% ํ–ฅ์ƒ๋˜์—ˆ์œผ๋ฉฐ, ์ œ๋กœ์ƒท ์„ค์ •์—์„œ๋„ 20% ์ด์ƒ์˜ ๊ฐœ์„ ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฐœ์„ ์€ ์‹œ๊ฐ์ ์œผ๋กœ ๋ณต์žกํ•œ ์žฅ๋ฉด์—์„œ ๋”์šฑ ๋‘๋“œ๋Ÿฌ์ ธ, RoboTwin 2.0์ด ์–ด๋ ค์šด ์กฐ๊ฑด์—์„œ ํŠนํžˆ ํšจ๊ณผ์ ์ž„์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • RoboTwin 2.0 ๋ฒค์น˜๋งˆํฌ: 50๊ฐœ ๋ฒค์น˜๋งˆํฌ ์ž‘์—…์—์„œ VLA ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ, ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ(RDT, Pi0)์€ Hard ์กฐ๊ฑด(๋„๋ฉ”์ธ ๋ฌด์ž‘์œ„ํ™”๋œ ํ™˜๊ฒฝ)์—์„œ ๋” ๊ฐ•๋ ฅํ•œ ํšŒ๋ณตํƒ„๋ ฅ์„ฑ์„ ๋ณด์˜€์ง€๋งŒ, Easy ์กฐ๊ฑด(๊นจ๋—ํ•œ ํ™˜๊ฒฝ) ๋Œ€๋น„ ์„ฑ๊ณต๋ฅ ์ด ๊ฐ๊ฐ 20.8%, 30.1% ํ•˜๋ฝํ•˜์—ฌ ๋„๋ฉ”์ธ ์ด๋™(domain shifts) ํ•˜์—์„œ์˜ ๊ฐ•๊ฑดํ•จ์ด ์—ฌ์ „ํžˆ ์ค‘์š”ํ•œ ๋„์ „ ๊ณผ์ œ์ž„์„ ๊ฐ•์กฐํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก ์ ์œผ๋กœ, RoboTwin 2.0์€ ๋‹ค์–‘ํ•˜๊ณ  ๊ณ ํ’ˆ์งˆ์˜ ์ „๋ฌธ๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•œ ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ๊ณตํ•˜์—ฌ ๊ฐ•๊ฑดํ•œ ์ด์ค‘ ๋กœ๋ด‡ ์กฐ์ž‘์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์ด ์‹œ์Šคํ…œ์€ MLLM ๊ธฐ๋ฐ˜ ์ž‘์—… ์ƒ์„ฑ, ๊ตฌํ˜„์ฒด ์ ์‘ํ˜• ํ–‰๋™ ํ•ฉ์„ฑ ๋ฐ ํฌ๊ด„์ ์ธ ๋„๋ฉ”์ธ ๋ฌด์ž‘์œ„ํ™”๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ๊ธฐ์กด ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ๊ธฐ์˜ ์ฃผ์š” ํ•œ๊ณ„๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค.

Copyright 2024, Jung Yeon Lee