Curieux.JY
  • JungYeon Lee
  • Post
  • Projects
  • Note

On this page

  • ๐Ÿ” Ping Review

๐Ÿ“ƒSegDAC ๋ฆฌ๋ทฐ

visual-rl
segmentation
Segmentation-Driven Actor-Critic for Visual Reinforcement Learning
Published

February 20, 2026

๐Ÿ” Ping. ๐Ÿ”” Ring. โ›๏ธ Dig. A tiered review series: quick look, key ideas, deep dive.

  • Paper Link

Related Post: ManiSkill3 ๋ฆฌ๋ทฐ

  1. ๐Ÿ’ก SegDAC๋Š” ์‹œ๊ฐ์  ๊ฐ•ํ™” ํ•™์Šต(RL)์„ ์œ„ํ•ด Segment Anything(SAM)๊ณผ YOLO-World๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋™์ ์ด๊ณ  ๊ฐ€๋ณ€์ ์ธ ์ˆ˜์˜ ๊ฐ์ฒด ์ค‘์‹ฌ ํ‘œํ˜„์„ ์ถ”์ถœํ•˜๋Š” ์ƒˆ๋กœ์šด Transformer ๊ธฐ๋ฐ˜ Actor-Critic ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
  2. ๐Ÿš€ ์ด ๋ฐฉ๋ฒ•์€ ์ด๋ฏธ์ง€ ์žฌ๊ตฌ์„ฑ, ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•, ๋˜๋Š” ์ˆ˜๋™ ๋ ˆ์ด๋ธ” ์—†์ด ์ž ์žฌ ๊ณต๊ฐ„์—์„œ ์ง์ ‘ ํ•™์Šตํ•˜๋ฉฐ, ๊ฐ€๋ณ€ ๊ธธ์ด์˜ segment embedding์„ ์ฒ˜๋ฆฌํ•˜๋Š” ์ตœ์ดˆ์˜ ์˜จ๋ผ์ธ RL ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  3. ๐Ÿ† ManiSkill3 ๋ฒค์น˜๋งˆํฌ์—์„œ SegDAC๋Š” ๊ฐ€์žฅ ์–ด๋ ค์šด ์‹œ๊ฐ์  ์ผ๋ฐ˜ํ™” ์„ค์ •์—์„œ ๊ธฐ์กด ์„ฑ๋Šฅ์„ ์ตœ๋Œ€ 2๋ฐฐ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ๋„ ๋Šฅ๊ฐ€ํ•˜๋ฉฐ, ๋” ๊ฐ€๋ณ๊ณ  ์ง์ ‘์ ์ธ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ๋„ ์šฐ์ˆ˜ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.


๐Ÿ” Ping Review

๐Ÿ” Ping โ€” A light tap on the surface. Get the gist in seconds.

๋ณธ ๋…ผ๋ฌธ์€ Visual Reinforcement Learning (Visual RL)์—์„œ ๊ณ ์ฐจ์› ์‹œ๊ฐ ์ž…๋ ฅ, ํ™˜๊ฒฝ ๊ฐ€๋ณ€์„ฑ, ๊ทธ๋ฆฌ๊ณ  ์‹œ๊ฐ์  perturbations์— ๋Œ€ํ•œ ์ •์ฑ…์˜ ๋‚ฎ์€ ๊ฒฌ๊ณ ์„ฑ์œผ๋กœ ์ธํ•ด ๋ฐœ์ƒํ•˜๋Š” ๋„์ „ ๊ณผ์ œ๋ฅผ ๋‹ค๋ฃน๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ ๋Œ€๊ทœ๋ชจ ์ธ์ง€ ๋ชจ๋ธ(perception models)์„ ํšจ๊ณผ์ ์œผ๋กœ Visual RL์— ํ†ตํ•ฉํ•˜์—ฌ ์‹œ๊ฐ์  ์ผ๋ฐ˜ํ™”(visual generalization) ๋ฐ ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ(sample efficiency)์„ ๊ฐœ์„ ํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต๋‹ค๋Š” ์ ์„ ์ง€์ ํ•ฉ๋‹ˆ๋‹ค.

SegDAC: Improving Visual Reinforcement Learning by Extracting Dynamic Object-Centric Representations from Pretrained Vision Models

๋ณธ ๋…ผ๋ฌธ์€ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•ด SegDAC (Segmentation-Driven Actor-Critic)์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. SegDAC๋Š” Object-Centric Representations๊ฐ€ ํ”ฝ์…€ ๊ธฐ๋ฐ˜ ๋˜๋Š” ํŒจ์น˜ ๊ธฐ๋ฐ˜(patch-based) ํ‘œํ˜„๋ณด๋‹ค ๋” ์œ ์šฉํ•˜๋‹ค๋Š” ๊ฐ€์ •ํ•˜์— ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ ๋ถ„ํ• (segmentation) ๊ธฐ๋ฐ˜ RL ๋ฐฉ๋ฒ•๋“ค์ด ๊ณ ์ •๋œ ์Šฌ๋กฏ(fixed slots), ์‚ฌ์ „ ๊ณ„์‚ฐ๋œ ๋งˆ์Šคํฌ(precomputed masks) ๋˜๋Š” ๊ฐ•ํ•œ ์ง€๋„ํ•™์Šต(strong supervision)์— ์˜์กดํ•˜์—ฌ ์œ ์—ฐ์„ฑ๊ณผ ์ผ๋ฐ˜์„ฑ์„ ์ œํ•œํ–ˆ๋˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•๋ก  (Core Methodology)

SegDAC์˜ ํ•ต์‹ฌ์€ ์‚ฌ์ „์— ํ•™์Šต๋œ ๋น„์ „ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ๋™์ ์ธ ์ˆ˜์˜ Object-Centric Embeddings๋ฅผ ์ถ”์ถœํ•˜๊ณ  ์ด๋ฅผ ํ†ตํ•ด ํ–‰๋™์„ ์˜ˆ์ธกํ•˜๊ฑฐ๋‚˜ Q-value๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. SegDAC๋Š” raw pixels์—์„œ ์ธ์ฝ”๋”๋ฅผ ํ•™์Šตํ•˜๋Š” ๋Œ€์‹  latent space์—์„œ ์™„์ „ํžˆ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.

  1. Grounded Segmentation Module:

    • ์ด ๋ชจ๋“ˆ์€ RGB ์ด๋ฏธ์ง€์™€ ์ผ๋ จ์˜ grounding text inputs์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
    • YOLO-World ๋ชจ๋ธ์€ open-vocabulary๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ œ๊ณต๋œ text tags (์˜ˆ: โ€œcubeโ€, โ€œrobotโ€, โ€œbackgroundโ€)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค(bounding boxes)๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. YOLO-World๋Š” zero-shot ๋ฐฉ์‹์œผ๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.
    • ์ด ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค๋“ค์€ EfficientViT-SAM (SAM) ๋ชจ๋ธ์˜ ํ”„๋กฌํ”„ํŠธ(prompts)๋กœ ์‚ฌ์šฉ๋˜์–ด ๊ฐ ๋ฐ•์Šค ๋‚ด์—์„œ ์„ธ๊ทธ๋จผํŠธ ๋งˆ์Šคํฌ(segment masks)์™€ ํŒจ์น˜ ์ž„๋ฒ ๋”ฉ(patch embeddings)์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. SAM๊ณผ YOLO-World๋Š” ํ•™์Šต ๊ณผ์ •์—์„œ frozen ์ƒํƒœ๋กœ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค.
    • ์ด ๋ชจ๋“ˆ์˜ ์ถœ๋ ฅ์€ ์‹œ๊ฐ„ ๋‹จ๊ณ„๋งˆ๋‹ค ๊ฐ€๋ณ€์ ์ธ ์ˆ˜(N)์˜ ์„ธ๊ทธ๋จผํŠธ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๊ณ ์ •๋œ ์ˆ˜์˜ ๊ฐ์ฒด ํ‘œํ˜„์— ์˜์กดํ•˜๋Š” ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค๊ณผ์˜ ์ฐจ์ด์ ์ž…๋‹ˆ๋‹ค.
    • ํŠนํžˆ โ€œbackgroundโ€์™€ ๊ฐ™์€ ์ผ๋ฐ˜์ ์ธ text tag๋ฅผ ํฌํ•จํ•˜์—ฌ ์—์ด์ „ํŠธ๊ฐ€ ๊ด€๋ จ ์—†๋Š” ์˜์—ญ์„ ๋ฌด์‹œํ•˜๋„๋ก ํ•™์Šตํ•จ์œผ๋กœ์จ ์ผ๋ฐ˜ํ™”๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ์Œ์ด ์ž…์ฆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  2. Segment Embeddings Extraction Module:

    • ์ด ๋ชจ๋“ˆ์€ Grounded Segmentation Module์—์„œ ์ƒ์„ฑ๋œ ์ด์ง„ ์„ธ๊ทธ๋จผํŠธ ๋งˆ์Šคํฌ(N๊ฐœ)์™€ SAM์˜ ํŒจ์น˜ ์ž„๋ฒ ๋”ฉ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์Šต๋‹ˆ๋‹ค.
    • ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์—†๋Š” ์ด ๋ชจ๋“ˆ์€ ๊ฐ ์„ธ๊ทธ๋จผํŠธ ๋งˆ์Šคํฌ์— ๋Œ€ํ•ด ํ•ด๋‹น ๋งˆ์Šคํฌ์™€ ๊ณต๊ฐ„์ ์œผ๋กœ ๊ฒน์น˜๋Š” SAM ํŒจ์น˜ ์ž„๋ฒ ๋”ฉ์„ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค.
    • ๊ฐ ํŒจ์น˜ ๋‚ด ๋งˆ์Šคํฌ์˜ ํ™œ์„ฑ ํ”ฝ์…€(active pixels) ์ˆ˜๋ฅผ ์„ธ์–ด, ์ž‘์€ ์ž„๊ณ„๊ฐ’(์˜ˆ: 4ํ”ฝ์…€) ๋ฏธ๋งŒ์œผ๋กœ ๊ฒน์น˜๋Š” ํŒจ์น˜๋Š” ๋ฒ„๋ ค์ง‘๋‹ˆ๋‹ค.
    • ๋‚จ์€ ๊ด€๋ จ ํŒจ์น˜ ์ž„๋ฒ ๋”ฉ์— Global Average Pooling์„ ์ ์šฉํ•˜์—ฌ ๊ฐ ์„ธ๊ทธ๋จผํŠธ์— ๋Œ€ํ•œ ๋‹จ์ผ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ (์ฐจ์› S, ์˜ˆ: S=256)๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • ์ด ๊ณผ์ •์€ SAM์˜ ํŒจ์น˜ ์ž„๋ฒ ๋”ฉ์ด ์ „์ฒด ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ์˜ contextual information์„ ํฌํ•จํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๊ฒฐ๊ณผ ์„ธ๊ทธ๋จผํŠธ ์ž„๋ฒ ๋”ฉ ๋˜ํ•œ ์ด๋Ÿฌํ•œ ๊ณต์œ  ์ปจํ…์ŠคํŠธ๋ฅผ ๊ณ„์Šนํ•˜์—ฌ ๋ถ„ํ• ์ด ๋ถˆ์™„์ „ํ•  ๋•Œ์—๋„ ๊ฒฌ๊ณ ์„ฑ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

  3. Actor-Critic Networks:

    • Actor์™€ Critic์€ ๊ฐ๊ฐ ๋…๋ฆฝ์ ์ธ transformer decoder๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ๊ณ ์œ ํ•œ ๊ฐ€์ค‘์น˜(weights), projection heads ๋ฐ encoding layers๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.
    • ์ž…๋ ฅ์€ ์„ธ๊ทธ๋จผํŠธ ์ž„๋ฒ ๋”ฉ, proprioception (๋กœ๋ด‡์˜ ์ž์ฒด ์ƒํƒœ ์ •๋ณด), ๊ทธ๋ฆฌ๊ณ  ํ•™์Šต๋œ query token์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.
    • ๋ชจ๋ธ์€ ์„ธ๊ทธ๋จผํŠธ, proprioception, query๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต๋œ token-type encoding์„ ๊ฐ ํ† ํฐ์— ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค.
    • ์„ธ๊ทธ๋จผํŠธ ํ† ํฐ์—๋Š” ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ์ขŒํ‘œ์— ๊ธฐ๋ฐ˜ํ•œ positional encoding์ด ์ถ”๊ฐ€๋˜์–ด ๊ฐ์ฒด ์ค‘์‹ฌ ๊ตฌ์กฐ์— ๋งž๋Š” ๊ณต๊ฐ„์  ์ฐธ์กฐ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
    • Critic ๋„คํŠธ์›Œํฌ์˜ ๊ฒฝ์šฐ, query๋Š” action vector์™€ ํ•™์Šต๋œ ํ† ํฐ์„ concatenateํ•˜๊ณ  MLP๋ฅผ ํ†ตํ•ด ํˆฌ์˜ํ•˜์—ฌ ํ˜•์„ฑ๋ฉ๋‹ˆ๋‹ค. Keys์™€ values๋Š” ์„ธ๊ทธ๋จผํŠธ ํ† ํฐ, proprioception ํ† ํฐ, ํ•™์Šต๋œ ํ† ํฐ์˜ ์ง‘ํ•ฉ์„ ํˆฌ์˜ํ•˜์—ฌ ์–ป์Šต๋‹ˆ๋‹ค. ๋””์ฝ”๋”๋Š” ์ด ์ง‘ํ•ฉ์— ์–ดํ…์…˜(attention)์„ ์ ์šฉํ•˜์—ฌ ๋‹จ์ผ ์ถœ๋ ฅ ํ† ํฐ์„ ์ƒ์„ฑํ•˜๋ฉฐ, ์ด๋Š” projection head๋ฅผ ํ†ตํ•ด Q-value๋กœ ๋งคํ•‘๋ฉ๋‹ˆ๋‹ค.
    • Actor ๋„คํŠธ์›Œํฌ๋Š” ์œ ์‚ฌํ•œ ์„ค๊ณ„๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ํ•™์Šต๋œ query token์ด action ์ž…๋ ฅ ์—ญํ• ์„ ํ•˜๋ฉฐ, ์ถœ๋ ฅ ํ† ํฐ์€ action space๋กœ ํˆฌ์˜๋ฉ๋‹ˆ๋‹ค.
    • SegDAC๋Š” ์ง์ ‘ ์„ธ๊ทธ๋จผํŠธ ์ž„๋ฒ ๋”ฉ์—์„œ ์ž‘๋™ํ•˜๋ฏ€๋กœ ํŒจ์น˜ ๊ธฐ๋ฐ˜ ์ธ์ฝ”๋”๋ณด๋‹ค ํ›จ์”ฌ ์ ์€ ์ˆ˜์˜ ํ† ํฐ์„ ์ฒ˜๋ฆฌํ•˜๋ฉฐ, ๊ฐ๋… ์—†์ด๋„ ์ค‘์š”ํ•œ ๊ฐ์ฒด์— ์ดˆ์ ์„ ๋งž์ถœ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ฃผ์š” ๊ธฐ์—ฌ (Main Contributions)

  • Dynamic object-centric RL: ์ด๋ฏธ์ง€ ์žฌ๊ตฌ์„ฑ ๋‹จ๊ณ„ ์—†์ด ๊ฐ€๋ณ€ ๊ธธ์ด ์„ธ๊ทธ๋จผํŠธ ์ž„๋ฒ ๋”ฉ์—์„œ ์ง์ ‘ ์ž‘๋™ํ•˜๋Š” transformer ๊ธฐ๋ฐ˜ Actor-Critic์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” SegDAC๊ฐ€ ์˜จ๋ผ์ธ RL์—์„œ ๋™์ ์œผ๋กœ ๊ณ„์‚ฐ๋˜๋Š” ๊ฐ€๋ณ€ ๊ธธ์ด ๊ฐ์ฒด ์ž„๋ฒ ๋”ฉ์—์„œ ํ•™์Šตํ•˜๋Š” ์ตœ์ดˆ์˜ ๋ฐฉ๋ฒ•์ž„์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • Text-Grounded Segmentation for Online RL: ์˜จ๋ผ์ธ RL์„ ์œ„ํ•ด ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ๋ถ„ํ• ์„ ์‚ฌ์šฉํ•˜๊ณ  ๊ฐ€๋ณ€์ ์ธ ์ˆ˜์˜ ์„ธ๊ทธ๋จผํŠธ ์ž„๋ฒ ๋”ฉ์—์„œ ํ•™์Šตํ•˜๋Š” ์ตœ์ดˆ์˜ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • Strong visual generalization: ManiSkill3 ๊ธฐ๋ฐ˜์˜ ์ƒˆ๋กœ์šด ์‹œ๊ฐ์  ์ผ๋ฐ˜ํ™” ๋ฒค์น˜๋งˆํฌ์—์„œ ๊ธฐ์กด ์„ฑ๋Šฅ์˜ ๋‘ ๋ฐฐ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
  • Faster SAM-based training and inference: ๊ฒฝ๋Ÿ‰ ์„ธ๊ทธ๋จผํŠธ ์ž„๋ฒ ๋”ฉ, ๋น ๋ฅธ ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ๋ถ„ํ• , ๊ฐ„๋‹จํ•œ ๋งˆ์Šคํฌ ํ›„์ฒ˜๋ฆฌ, ์™„์ „ํ•œ latent-space ํ•™์Šต์„ ํ†ตํ•ด ๊ธฐ์กด SAM ๊ธฐ๋ฐ˜ ์ ‘๊ทผ ๋ฐฉ์‹๋ณด๋‹ค 2~5๋ฐฐ ๋น ๋ฅธ ์†๋„๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • New direction for visual RL: ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•(data augmentation), ๋ณด์กฐ ์†์‹ค(auxiliary losses), ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ์…‹ ์—†์ด ์ˆœ์ˆ˜ SAC ์†์‹ค๋งŒ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ•ํ•œ ์‹œ๊ฐ์  ์ผ๋ฐ˜ํ™”๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ๋” ๊ฐ€๋ณ๊ณ  ์ง์ ‘์ ์ธ ํŒŒ์ดํ”„๋ผ์ธ์˜ ๊ฐ€๋Šฅ์„ฑ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

์‹คํ—˜ ๊ฒฐ๊ณผ (Experimental Results)

SegDAC๋Š” ManiSkill3 ๊ธฐ๋ฐ˜์˜ ์ƒˆ๋กœ์šด ์‹œ๊ฐ์  ์ผ๋ฐ˜ํ™” ๋ฒค์น˜๋งˆํฌ(8๊ฐœ ์กฐ์ž‘ ์ž‘์—…, 3๋‹จ๊ณ„ ๋‚œ์ด๋„, 12๊ฐ€์ง€ ์‹œ๊ฐ์  ์„ญ๋™)์—์„œ ํ‰๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค. SegDAC๋Š” ๋ชจ๋“  ๊ธฐ์กด baseline (SAC-AE, DrQ-v2, SAM-G, SMG, SADA, MaDi) ๋Œ€๋น„ ๋” ๋†’์€ ๊ฒฌ๊ณ ์„ฑ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ๊ฐ€์žฅ ์–ด๋ ค์šด ์„ค์ •์—์„œ๋Š” ๊ธฐ์กด ์„ฑ๋Šฅ์„ ๋‘ ๋ฐฐ๋กœ ๋†’์˜€์œผ๋ฉฐ, ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ ์ธก๋ฉด์—์„œ๋Š” state-of-the-art์ธ DrQ-v2์™€ ํ•„์ ํ•˜๊ฑฐ๋‚˜ ๋Šฅ๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. SegDAC๋Š” ๋™์ ์œผ๋กœ ๋ณ€ํ™”ํ•˜๋Š” ์„ธ๊ทธ๋จผํŠธ ์ˆ˜, ํฌ๊ธฐ, ์„ธ๋ฐ€๋„(granularity)์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์•ˆ์ •์ ์ธ ๋™์ž‘์„ ์œ ์ง€ํ–ˆ์œผ๋ฉฐ, ์ž‘์—… ๊ด€๋ จ ๊ฐ์ฒด์— ์„ ํƒ์ ์œผ๋กœ attention์„ ๊ธฐ์šธ์ด๋Š” ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๊ฐ์ฒด ์ค‘์‹ฌ ์ ‘๊ทผ ๋ฐฉ์‹์˜ ๊ฐ•์ ์„ ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค.

Copyright 2026, JungYeon Lee