Curieux.JY
  • Post
  • Note
  • Jung Yeon Lee

On this page

  • Background
  • WaveNet
    • 1. Dilated Casual Convolution
    • 2. Residual Connection & Gated Activation Units
    • 3. Skip Connection
    • 4. Conditional WaveNets
  • Experiments
  • Conclusion
  • Improved Works

๐Ÿ“ƒWaveNet ๋ฆฌ๋ทฐ

autoregressive
generative
paper
A Generative Model for Raw Audio
Published

September 17, 2022

์ด๋ฒˆ ํฌ์ŠคํŒ…์€ Google DeepMind์—์„œ ๋ฐœํ‘œํ•œ WaveNet์ด๋ผ๋Š” ๋…ผ๋ฌธ์— ๋Œ€ํ•ด ๋ฆฌ๋ทฐ๋ฅผ ํ•˜๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. WaveNet์€ Autoregressiveํ•œ Generative model๋กœ์จ Google์˜ ์Šคํ”ผ์ปค ์„œ๋น„์Šค์— ์‚ฌ์šฉ๋˜์—ˆ๋‹ค๊ณ  ๋งŽ์ด ์•Œ๋ ค์ง„ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

๋ฆฌ๋ทฐ์— ์•ž์„œ์„œ ๊ฐ€์žฅ ๋„์›€์„ ๋งŽ์ด ๋ฐ›๊ณ  ์•„๋ž˜ ํฌ์ŠคํŒ…์˜ ์ƒ๋‹นํ•œ ์ด๋ฏธ์ง€๋“ค์ด ๊น€์ •ํฌ ๋‹˜์˜ [๋…ผ๋ฌธ๋ฆฌ๋ทฐ]WaveNet ํฌ์ŠคํŒ…์—์„œ ๊ฐ€์ ธ์˜จ ๊ฒƒ์ž„์„ ๋ฐํžˆ๋ฉฐ ๊ฐ์‚ฌ์˜ ๋ง์”€์„ ์ „ํ•ด๋“œ๋ฆฌ๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค. ๊ฐ ์ด๋ฏธ์ง€์˜ ์ถœ์ฒ˜๋Š” ์œ—์ฒจ์ž๋กœ Reference numbering์„ ํ‘œ์‹œํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Background

Raw waveform of the audio2

WaveNet์€ ์Œ์„ฑ ์ƒ์„ฑ ๋ชจ๋ธ๋กœ ๋ณธ๊ฒฉ์ ์œผ๋กœ ๋ชจ๋ธ์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ธฐ ์ „์— ์†Œ๋ฆฌ๋ผ๋Š” ๊ฒƒ์ด ์–ด๋–ป๊ฒŒ ์‹ ํ˜ธ๊ฐ€ ๋˜๋Š”๊ฐ€๋ฅผ ์‚ดํŽด๋ณผ ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์†Œ๋ฆฌ๋Š” ๊ณต๊ธฐ ์ž…์ž๋“ค์˜ ๋–จ๋ฆผ์ด๋ฉฐ ์ข…ํŒŒ์˜ ํŒŒํ˜•์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์†Œ๋ฆฌ๋ผ๋Š” ํ˜„์ƒ์„ ํŒŒ๋™์œผ๋กœ ํ‘œํ˜„ํ•ด๋ณด์ž๋ฉด, ์•„๋ž˜์˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ๊ณต๊ธฐ ์ž…์ž๋“ค์ด ๋งŽ์ด ๋ฐ€์ง‘๋˜์–ด ์žˆ๋Š” ๋ถ€๋ถ„์„ ํŒŒ๋™์˜ ์ง„ํญ์„ ํฌ๊ฒŒ, ์ƒ๋Œ€์ ์œผ๋กœ ์ž…์ž๋“ค์˜ ์ˆ˜๊ฐ€ ์ ์€ ๊ณณ์€ ์ง„ํญ์„ ์ž‘๊ฒŒํ•˜์—ฌ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Sound waveform3

์ด๋ ‡๊ฒŒ ํŒŒ๋™ ๋ชจํ˜•์œผ๋กœ ๋‚˜ํƒ€๋‚ด์–ด์ง„ ์†Œ๋ฆฌ๋Š” Continutous(์—ฐ์†์ ์ธ) ์‹ ํ˜ธ ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์‹ ํ˜ธ๋ฅผ ์ปดํ“จํ„ฐ์—์„œ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ปดํ“จํ„ฐ๊ฐ€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก Discrete(๋ถˆ์—ฐ์†์ ์ธ) ๊ฐ’์œผ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์–ด์•ผ ํ•˜๋ฉฐ Continuousํ•œ ์‹ ํ˜ธ โ†’ Discreteํ•œ ์‹ ํ˜ธ๋กœ ๋ฐ”๊พธ๋Š” ๊ณผ์ •์„ Sampling์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๋์ด ์•„๋‹Œ ์ปดํ“จํ„ฐ๋Š” ๋ฌดํ•œํ•œ (์ด์ง„ํ™”๋œ)์ •์ˆ˜ ํ‘œํ˜„์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š”๊ฒƒ์ด ์•„๋‹ˆ๊ณ  ๋” ํšจ์œจ์ ์œผ๋กœ ์‹ ํ˜ธ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด Quantization(์–‘์žํ™”)๊ณผ์ •์„ ๊ฑฐ์น˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ƒ˜ํ”Œ๋ง๋˜์–ด ์ด์‚ฐํ™” ๋˜์–ด ์žˆ๋Š” ์‹ ํ˜ธ ๊ฐ’์„ Section์„ ๋‚˜๋ˆ„์–ด ์ผ์ • ๊ตฌ๊ฐ„ ๋‚ด์— ์žˆ๋Š” ๊ฐ’๋“ค์€ ํ•˜๋‚˜์˜ ์–‘์žํ™”๋œ ๊ฐ’์œผ๋กœ ๋งค์นญํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์ด์ง„์ˆ˜๋กœ ์ •์ˆ˜ํ™”๋œ ์†Œ๋ฆฌ๋Š” ์•„๋ž˜์˜ ์˜ค๋ฅธ์ชฝ ๊ทธ๋ฆผ์—์„œ์™€ ๊ฐ™์ด ์‹œ๊ฐ„์ถ•(x)์— ๋”ฐ๋ผ ๋นจ๊ฐ„ ์ ์œผ๋กœ ๋‚˜ํƒ€๋‚ด์–ด์ง€๋Š” ์‹ ํ˜ธ๋กœ ๋ณ€ํ™˜๋˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์‹ ํ˜ธ์ฒ˜๋ฆฌ ๊ณผ์ •์„ Pulse-Code Modulation(PCM)์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

PCM6, 8

๋ณดํ†ต ์†Œ๋ฆฌ์˜ ์‹ ํ˜ธ์ฒ˜๋ฆฌ๋Š” 16-bit์˜ ์ •์ˆ˜ํ‘œํ˜„(-255 ~ 256)์œผ๋กœ ๋‚˜ํƒ€๋‚ด์ง€๋งŒ WaveNet์—์„œ๋Š” Nonlinearity๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๊ณ  ๋” ํšจ์œจ์ ์ด์—ˆ๋˜ 8-bit ์ •์ˆ˜ ํ‘œํ˜„ ๋””์ง€ํ„ธ ์‹ ํ˜ธ๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ ์‚ฌ์šฉํ•œ ๋ฐฉ๋ฒ•์€ ยต-law Companding Transformation(ฮผ-law algorithm)์œผ๋กœ ์‚ฌ๋žŒ์ด ์†Œ๋ฆฌ๋ฅผ ์ธ์‹ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ชจ๋ฐฉํ•œ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์‚ฌ๋žŒ์€ ์ž‘์€ ์†Œ๋ฆฌ์˜ ๋ณ€ํ™”์—๋Š” ๋ฏผ๊ฐํ•˜์ง€๋งŒ ํฐ ์†Œ๋ฆฌ์˜ ๋ณ€ํ™”์—๋Š” ๋‘”๊ฐํ•˜๋ฏ€๋กœ ฮผ-law algorithm์—์„œ๋„ ์ž‘์€ ์†Œ๋ฆฌ์˜ ๊ตฌ๊ฐ„(์•„๋ž˜ ๊ทธ๋ž˜ํ”„์—์„œ ์ค‘์•™ ๋ถ€๋ถ„)์€ ์„ธ๋ฐ€ํ•˜๊ฒŒ ๋‚˜๋ˆ„๊ณ  ํฐ ์†Œ๋ฆฌ ๊ตฌ๊ฐ„(์•„๋ž˜ ๊ทธ๋ž˜ํ”„์—์„œ ์ขŒ์šฐ ๋ ๋ถ€๋ถ„)์€ ๊ธฐ์šธ๊ธฐ๋ฅผ ์™„๋งŒํ•˜๊ฒŒ ํ•˜์—ฌ ๋น„๊ต์  ๋“ฌ์„ฑํ•˜๊ฒŒ ๋‚˜๋ˆ„์—ˆ์Šต๋‹ˆ๋‹ค.

ฮผ-law algorithm9

WaveNet์—์„œ 16-bit๊ฐ€ ์•„๋‹Œ 8-bit๋ฅผ ์‚ฌ์šฉํ•œ ์ด์œ ๋Š” ์•„๋ž˜ ๊ทธ๋ฆผ์˜ ์˜ค๋ฅธ์ชฝ์—์„œ WaveNet์˜ ์ „์ฒด ํ๋ฆ„์—์„œ ๋ณผ ๋•Œ ์–‘์žํ™”๋œ ๊ฐ ๊ตฌ๊ฐ„์˜ softmax๋กœ ํ•ด๋‹น ๊ฐ’์˜ ํ™•๋ฅ ์„ ๊ตฌํ•˜๊ฒŒ ๋˜๋Š”๋ฐ, 16-bit๋ผ๋ฉด softmax layer์—์„œ ์ด 65,536(= -2^{15} ~ 2^{15}-1 )๊ฐœ์˜ ํ™•๋ฅ ์„ ๊ตฌํ•ด์•ผ ํ•˜๋ฏ€๋กœ ๊ณ„์‚ฐ์ด ๋งค์šฐ ๋งŽ์ด ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

WaveNet Structure6

TTS(Text-to-Speech)๋ž€

์•ž์„œ ์ด์•ผ๊ธฐํ•œ๋Œ€๋กœ ๊ตฌ๊ธ€์˜ ์Šคํ”ผ์ปค ์„œ๋น„์Šค์— WaveNet์ด ์“ฐ์ธ ๊ฒƒ์œผ๋กœ ํฐ ํ™”์ œ์˜€๋Š”๋ฐ ์ด๋Š” ๋ฐ”๋กœ TTS ์„œ๋น„์Šค์— WaveNet์ด ์“ฐ์ธ ๊ฒƒ ์ด์—ˆ์Šต๋‹ˆ๋‹ค. TTS task๋Š” ํŠน์ • text๊ฐ€ ์ฃผ์–ด์ง€๋ฉด ์ด๋ฅผ ์Œ์„ฑ ์‹ ํ˜ธ๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š”(์Œ์„ฑ์„ ์ƒ์„ฑํ•˜๋Š”) task์ด๋ฉฐ Text analysis์™€ Speech synthesis๊ฐ€ ๊ฐ™์ด ์ด๋ฃจ์–ด์ง€๋Š” task ์ž…๋‹ˆ๋‹ค.

TTS Process 19

๊ธฐ์กด์˜ TTS ๊ธฐ์ˆ ์€ ํฌ๊ฒŒ 2๊ฐ€์ง€๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ฒซ๋ฒˆ์งธ๋กœ Concatenative๋Š” ๋‹ค๋Ÿ‰์˜ ์Œ์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ์Œ์†Œ ๋‹จ์œ„๋กœ ์ชผ๊ฐœ์–ด ์‹ ํ˜ธ๋ฅผ ์ €์žฅํ•œ ๊ฒƒ์„ ์กฐํ•ฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ์Œ์„ฑ์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ, ๋งˆ์น˜ ํ€ผํŠธ๋กœ ์˜ท๊ฐ์˜ ํŒจํ„ด์„ ๋งŒ๋“ค์–ด๋‚ด๋“ฏ์ด ์Œ์„ฑ ๋‹จ์œ„๋“ค์„ ์ด์–ด๋ถ™์ด๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ์‹ค์ œ ์Œ์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ์ชผ๊ฐ  ๊ฒƒ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด๋ฏ€๋กœ ์Œ์„ฑ ๋ฐ์ดํ„ฐ ํ•˜๋‚˜ ํ•˜๋‚˜์˜ ํ€„๋ฆฌํ‹ฐ๋Š” ์ข‹์ง€๋งŒ ๋‹จ์ ์œผ๋กœ๋Š” ์Œ์„ฑ์„ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋Š” ์ž์œ ๋„๊ฐ€ ๋–จ์–ด์ง„๋‹ค๋Š” ์ ๊ณผ ์Œ์„ฑ ๋ฐ์ดํ„ฐ๊ฐ€ ๋งค์šฐ ๋งŽ์•„์•ผ ํ•œ๋‹ค๋Š” ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

๋‘๋ฒˆ์งธ๋กœ Parametric์€ ํ†ต๊ณ„์  ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ์Œ์„ฑ์„ ํ•ฉ์„ฑํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ WaveNet์˜ ๋ถ€๋ก์— ์ž์„ธํžˆ ์„ค๋ช…์ด ๋˜์–ด ์žˆ๋“ฏ์ด Acoustic model์„ ๋งŒ๋“ค์–ด์„œ ์Œ์„ฑ์„ ๋งŒ๋“ค์–ด ๋ƒ…๋‹ˆ๋‹ค. Concatenative์™€ ๋‹ค๋ฅด๊ฒŒ ์ƒˆ๋กœ์šด ์Œ์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ค์–ด๋‚ธ๋‹ค๋Š” ์ ์—์„œ ์Œ์„ฑ ์‹ ํ˜ธ๋ฅผ ์กฐ์ž‘ํ•  ์ˆ˜ ์žˆ๋Š” ์ž์œ ๋„๊ฐ€ ์ปค์ง€๊ณ  ๋ฐ์ดํ„ฐ ์…‹์ด ๋งŽ์ด ํ•„์š” ์—†์œผ๋‚˜ ์Œ์„ฑ์„ ์ƒ์„ฑํ•ด๋‚ด๋Š” ํ€„๋ฆฌํ‹ฐ๊ฐ€ ๋‹ค์†Œ ๋–จ์–ด์ง€๋Š” ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ 2๊ฐ€์ง€ ๋ฐฉ์‹๊ณผ ๋‹ค๋ฅด๊ฒŒ WaveNet์€ explicitํ•œ acoustic feature๋ฅผ ๋ชจ๋ธ๋ง ํ•˜์ง€ ์•Š๊ณ  ๋ฐ”๋กœ raw waveform์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ํฐ ์ฐจ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Concatenative and Parametric methods10
NN-based GM for TTS19

WaveNet์„ vocoder๋กœ ์ด์šฉํ•˜์—ฌ Tacotron2์™€ ๊ฐ™์€ ํ…์ŠคํŠธ์—์„œ ์ง์ ‘ ์Œ์„ฑ ํ•ฉ์„ฑ์„ ์œ„ํ•œ ์‹ ๊ฒฝ๋ง ์•„ํ‚คํ…์ฒ˜์—์„œ ์“ฐ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์•„๋ž˜๋Š” Tacotron2์˜ ๊ตฌ์กฐ์ด๋ฉฐ ์˜ค๋ฅธ์ชฝ ์ƒ๋‹จ์—์„œ WaveNet MoL(mixture of logistic distributions)์„ ์ฐพ์•„๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Tacotron 2 system architecture20

WaveNet

WaveNet์˜ ์ „์ฒด์ ์ธ ๊ตฌ์กฐ๋Š” ์•„๋ž˜์˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ํฌ๊ฒŒ 4๊ฐ€์ง€ ๋ถ€๋ถ„์œผ๋กœ ๋‚˜๋ˆ„์–ด์„œ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

WaveNet 4 Main components 6
  1. Dilated Casual Convolution

  2. Residual Connection & Gated Activation Units

  3. Skip Connection

  4. Conditional WaveNets

WaveNet ๊ตฌํ˜„์€ ๋‚ด์šฉ ์ดํ•ด๋ฅผ ์šฐ์„ ์œผ๋กœ ํ•˜๊ธฐ ์œ„ํ•ด ๋น„๊ต์  ๊ตฌํ˜„์ด ๊ฐ„๋‹จ ๋ช…๋ฃŒํ•˜๊ฒŒ ๋˜์–ด์žˆ๋Š” Reference[17]์„ ์ฐธ๊ณ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.(Youtube ๊ฐ•์˜) ์šฐ์„  WaveNet์˜ ์ „์ฒด ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๊ณ  class ๋‚ด๋ถ€์— ์žˆ๋Š” ๋‹ค๋ฅธ module class์— ๋Œ€ํ•œ ์ž์„ธํ•œ ์ฝ”๋“œ๋Š” ์•„๋ž˜ ๋‚ด์šฉ์—์„œ ์„ค๋ช…๊ณผ ํ•จ๊ป˜ ๋‚˜์˜ฌ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

class WaveNet(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, stack_size, layer_size):
        super().__init__()
        self.stack_size = stack_size
        self.layer_size = layer_size
        self.kernel_size = kernel_size
        self.casualConv1D = CasualDilatedConv1D(in_channels, in_channels, kernel_size, dilation=1)
        self.stackResBlock = StackOfResBlocks(self.stack_size, self.layer_size, in_channels, out_channels, kernel_size)
        self.denseLayer = DenseLayer(out_channels)


    def calculateReceptiveField(self):
        return np.sum([(self.kernel_size - 1) * (2 ** l) for l in range(self.layer_size)] * self.stack_size)

    def calculateOutputSize(self, x):
        return int(x.size(2)) - self.calculateReceptiveField()

    def forward(self, x):
        # x: b c t -> input data size
        x = self.casualConv1D(x)
        skipSize = self.calculateOutputSize(x)
        _, skipConnections = self.stackResBlock(x, skipSize)
        dense=self.denseLayer(skipConnections)
        return dense

1. Dilated Casual Convolution

๋จผ์ € Dilated Casual Convolution์€ ยต-law Companding Transformation ์ฒ˜๋ฆฌ๋ฅผ ๊ฑฐ์นœ ์Œ์„ฑ ์‹ ํ˜ธ๋ฅผ ๋ฐ›์•„์˜ค๋Š” ์ฒซ๋ฒˆ์งธ ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค.

Casual Convolution 2

์šฐ์„  Casual ์ด๋ผ๋Š” ๊ฒƒ์€ Time-series์ธ ์Œ์„ฑ ์‹ ํ˜ธ์˜ ์‹œ๊ฐ„ ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ํ˜„์žฌ ์‹œ์  t๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ฏธ๋ž˜ ์ •๋ณด๋Š” ์‚ฌ์šฉํ•  ์ˆ˜ ์—†๊ณ  ํ˜„์žฌ๊นŒ์ง€์˜(๊ณผ๊ฑฐ~ํ˜„์žฌ t) ์ •๋ณด๋งŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค. ์™ผ์ชฝ Causal Convolution ๊ทธ๋ฆผ์—์„œ Receptive Field๋Š” (๋ ˆ์ด์–ด ์ˆ˜) + (ํ•„ํ„ฐ์˜ length) -1๋กœ ๊ณ„์‚ฐ๋˜์–ด ์ด ๋ ˆ์ด์–ด ์ˆ˜๋Š” 4๊ฐœ์ด๊ณ  ํ•„ํ„ฐ length๋Š” ์ด์ „ ๋ ˆ์ด์–ด์—์„œ 2๊ฐœ์˜ ์ •๋ณด๊ฐ€ ๋ชจ์•„์ ธ์„œ ๋‹ค์Œ ๋ ˆ์ด์–ด์˜ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ๋กœ ์‚ฐ์ถœ๋˜๋ฏ€๋กœ ํ•„ํ„ฐ length๋Š” 2๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ 4+2-1๋กœ Receptive Field๋Š” 5๊ฐ€ ๋˜๋ฉฐ ์ด๋ฅผ ๊ทธ๋ฆผ์—์„œ ์‚ดํŽด๋ณด๋ฉด ์ฒ˜์Œ input์—์„œ 5๊ฐœ์˜ ์Œ์„ฑ ์ •๋ณด๊ฐ€ output์˜ 1๊ฐœ์˜ ์ •๋ณด๋กœ ๋‚˜์˜ค๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฐ Receptive Field๋Š” ๋งค์šฐ ์งง์€ ์‹œ๊ฐ„์— ๋งŽ์€ ์Œ์„ฑ์‹ ํ˜ธ๊ฐ€ ๋งค์นญ๋˜๋Š” ์ƒํ™ฉ์—์„œ ๋งค์šฐ ์ข์œผ๋ฉฐ RF๋ฅผ ๋Š˜๋ฆฌ๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ ˆ์ด์–ด ์ˆ˜๋ฅผ ๋Š˜๋ฆฌ๊ฑฐ๋‚˜ ํ•„ํ„ฐ์˜ length๋ฅผ ๋Š˜๋ ค์•ผ ํ•˜๋Š”๋ฐ ์ด๋Š” ๋ชจ๋ธ์„ ๋งค์šฐ ํฌ๊ฒŒ ๋งŒ๋“ค๊ฒŒ ๋˜๊ณ  ๊ณ„์‚ฐ๋„ ๋งŽ์ด ์š”๊ตฌ๋ฉ๋‹ˆ๋‹ค.

Dilated Casual Convolution 2

๊ทธ๋ž˜์„œ ์ œ์•ˆ์ด ๋œ ๋ฐฉ๋ฒ•์ด ๋ฐ”๋กœ Dilated Convolution์ž…๋‹ˆ๋‹ค. ์ด๋Š” convolution with holes๋กœ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ ์œ„์˜ ๊ทธ๋ฆผ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด ์ด์ „ ๋ ˆ์ด์–ด์—์„œ ๋ฐ์ดํ„ฐ๊ฐ€ Dilated๋˜์–ด ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ฌ์„ฑ๋“ฌ์„ฑํ•˜๊ฒŒ ๋ชจ์•„์ ธ์„œ ๋‹ค์Œ ๋ ˆ์ด์–ด๋กœ ๋„˜์–ด๊ฐ€๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” skip์ด๋‚˜ pooling๊ณผ ์œ ์‚ฌํ•ด๋ณด์ด์ง€๋งŒ input๊ณผ output์˜ ์ฐจ์›์ด ์œ ์ง€๋œ๋‹ค๋Š” ์ ์—์„œ ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ์˜ RF๋Š” ๊ฐ ๋ ˆ์ด์–ด์˜ Dilation ๊ฐ’์„ ๋ชจ๋‘ ๋”ํ•˜๊ณ  ๋งˆ์ง€๋ง‰์— ํ˜„์žฌ ์‹œ์ ์˜ ๋ฐ์ดํ„ฐ 1์„ ๋”ํ•˜๋ฉฐ RF๊ฐ€ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค. WaveNet์—์„œ๋Š” Dilation์„ ์ด 30๊ฐœ์˜ ๋ ˆ์ด์–ด์— ์ ์šฉํ–ˆ๊ณ  Dilation ๊ฐ’์˜ ํŒจํ„ด์€ input์—์„œ ๋ถ€ํ„ฐ 1, 2, โ€ฆ, 512 ๋กœ 2๋ฐฐ์”ฉ ๋Š˜๋ฆฐ 10๊ฐœ์˜ ๋ ˆ์ด์–ด๋ฅผ ์ด 3๋ฒˆ ๋ฐ˜๋ณตํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ, 1 ~ 512 Dilation ๊ฐ’์„ ๊ฐ€์ง„ 10๊ฐœ ๋ ˆ์ด์–ด์˜ RF๋Š” 1024๋กœ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.

Dilated Casual Convolution Process2
Dilated Convolution Pattern6

Code ๊ตฌํ˜„์œผ๋กœ ์‚ดํŽด๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Casual ํŠน์„ฑ์„ ๋ฐ˜์˜ํ•˜๊ธฐ ์œ„ํ•ด self.ignoreOutIndex ์„ ๋งŒ๋“ค์–ด์„œ dilation ๊ฐ’์„ ๊ณ ๋ คํ•˜์—ฌ (kernel_size - 1) * dilation์œผ๋กœ ๊ณ„์‚ฐํ•œ ํ›„์— ์ž˜๋ผ๋‚ด์ฃผ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

class CasualDilatedConv1D(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, dilation, padding=1):
        super().__init__()
        self.conv1D = nn.Conv1d(in_channels, out_channels, kernel_size, dilation=dilation, bias=False, padding='same')
        self.ignoreOutIndex = (kernel_size - 1) * dilation # casual

    def forward(self, x):
        return self.conv1D(x)[..., :-self.ignoreOutIndex] # casual

2. Residual Connection & Gated Activation Units

๋‹ค์Œ์œผ๋กœ Dilated Causal Convolution์„ ๊ฑฐ์นœ ํ›„ ํ†ต๊ณผํ•˜๊ฒŒ ๋˜๋Š” Residual Connection & Gated Activation Units ๋ถ€๋ถ„์— ๋Œ€ํ•ด์„œ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

WaveNet์—์„œ ์‚ฌ์šฉ๋œ Gated Activation Units๋Š” PixelCNN์—์„œ ์‚ฌ์šฉ๋œ ๋งค์ปค๋‹ˆ์ฆ˜์„ ์ฐจ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜์˜ ๊ทธ๋ฆผ์—์„œ ๋ณด์ด๋Š” ๋ณด๋ผ์ƒ‰ Dilated Conv๊ฐ€ ์•ž์—์„œ ์„ค๋ช…ํ•œ DCC์ด๋ฉฐ ์ด๋ฅผ ๊ฑฐ์นœ ํ›„ Convoltion layer์™€ ๊ฐ๊ฐ tanh, sigmoid activation์„ ํ†ต๊ณผํ•˜์—ฌ Filter, Gate๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์ด 2๊ฐ€์ง€ ๊ฒฝ๋กœ๋กœ ๊ณ„์‚ฐ๋œ ๊ฐ’์€ elementwise product๋ฅผ ํ†ตํ•ด ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค. ์ด๋–„ Dilated๋ฅผ ํ†ต๊ณผํ•˜๊ธฐ ์ „ ๊ฐ’์„ Residual Connection์„ ํ†ตํ•ด ์—ฐ๊ฒฐํ•จ์œผ๋กœ์จ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์ด ๋ ˆ์ด์–ด๋ฅผ ๋” ๊นŠ๊ฒŒ ์Œ“์„ ์ˆ˜ ์žˆ๋„๋ก ๋•๊ณ  ๋” ๋น ๋ฅด๊ฒŒ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Residual Connection & Gated Activation Units6

3. Skip Connection

Skip Connection์€ Dilated Convolution์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ Receptive Field๋ฅผ ๊ฐ€์ง„ ๊ฐ ๋ ˆ์ด์–ด๋“ค์˜ ๊ฐ’์„ ํ™œ์šฉํ•˜์—ฌ output์„ ๋งŒ๋“ค์–ด๋‚ผ ์ˆ˜ ์žˆ๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค. ์•ž์„œ ์„ค๋ช…ํ–ˆ๋˜ ๋Œ€๋กœ ๊ฐ Residual Block์˜ Dilation ๊ฐ’์ด ๋‹ค ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ Residual Block์˜ output์€ ์„œ๋กœ ๋‹ค๋ฅธ Receptive Field๋ฅผ ๊ฐ€์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Skip Connection6

Residual Connection๊ณผ Skip Connection์„ Code๋กœ ๊ตฌํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์œ„์—์„œ ์„ค๋ช…ํ–ˆ๋˜ Gated Activation Units์˜ tanh, sigmoid activation์„ ๊ฐ๊ฐ์˜ activation function์„ ๊ฑฐ์นœํ›„ self.resConv1D์„ ํ†ต๊ณผํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ Skip Connection์„ ๊ตฌํ˜„ํ•˜๋Š” ๋ถ€๋ถ„์€ self.skipConv1D์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰ return์—์„œ resOutput, skipOutput์œผ๋กœ 2๊ฐœ์˜ output์ด ๋‚˜์˜ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

class ResBlock(nn.Module):
    def __init__(self, res_channels, skip_channels, kernel_size, dilation):
        super().__init__()
        self.casualDilatedConv1D = CasualDilatedConv1D(res_channels, res_channels, kernel_size, dilation=dilation)
        self.resConv1D = nn.Conv1d(res_channels, res_channels, kernel_size=1)
        self.skipConv1D = nn.Conv1d(res_channels, skip_channels, kernel_size=1)
        self.tanh = nn.Tanh()
        self.sigmoid = nn.Sigmoid()

    def forward(self, inputX, skipSize):
        x = self.casualDilatedConv1D(inputX)
        x1 = self.tanh(x)
        x2 = self.sigmoid(x)
        x = x1 * x2
        resOutput = self.resConv1D(x)
        resOutput = resOutput + inputX[..., -resOutput.size(2):]
        skipOutput = self.skipConv1D(x)
        skipOutput = skipOutput[..., -skipSize:]
        return resOutput, skipOutput

์œ„์™€ ๊ฐ™์€ ResBlock์€ ์ „์ฒด ๊ตฌ์กฐ์—์„œ ๋ณด์‹œ๋‹ค์‹œํ”ผ ์—ฌ๋Ÿฌ๊ฐœ๊ฐ€ stacked ๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ StackOfResBlocks class๋กœ ๊ตฌํ˜„ํ•˜์—ฌ WaveNet์— ๋„ฃ์–ด์ฃผ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

class StackOfResBlocks(nn.Module):

    def __init__(self, stack_size, layer_size, res_channels, skip_channels, kernel_size):
        super().__init__()
        buildDilationFunc = np.vectorize(self.buildDilation)
        dilations = buildDilationFunc(stack_size, layer_size)
        self.resBlocks = []
        for s,dilationPerStack in enumerate(dilations):
            for l,dilation in enumerate(dilationPerStack):
                resBlock=ResBlock(res_channels, skip_channels, kernel_size, dilation)
                self.add_module(f'resBlock_{s}_{l}', resBlock) # Add modules manually
                self.resBlocks.append(resBlock)

    def buildDilation(self, stack_size, layer_size):
        # stack1=[1,2,4,8,16,...512]
        dilationsForAllStacks = []
        for stack in range(stack_size):
            dilations = []
            for layer in range(layer_size):
                dilations.append(2 ** layer)
            dilationsForAllStacks.append(dilations)
        return dilationsForAllStacks

    def forward(self, x, skipSize):
        resOutput = x
        skipOutputs = []
        for resBlock in self.resBlocks:
            resOutput, skipOutput = resBlock(resOutput, skipSize)
            skipOutputs.append(skipOutput)
        return resOutput, torch.stack(skipOutputs)

4. Conditional WaveNets

Conditional modeling 6

Conditional Modeling์€ Autoregressive model์ธ WaveNet์— ์ ์šฉํ•˜๊ธฐ ์‰ฝ๊ณ  ์ด ๋˜ํ•œ PixelCNN์—์„œ์˜ ์•„์ด๋””์–ด์™€ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค. Feature h ๋ฒกํ„ฐ๋ฅผ ์กฐ๊ฑด ๋ถ€๋ถ„์— ์ถ”๊ฐ€ํ•˜์—ฌ ์Œ์„ฑ ๋ฐ์ดํ„ฐ์— ์กฐ๊ฑด์„ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

p(\mathbf{x} \mid \mathbf{h})=\prod_{t=1}^T p\left(x_t \mid x_1, \ldots, x_{t-1}, \mathbf{h}\right)

Condition์—๋Š” ํฌ๊ฒŒ 2๊ฐ€์ง€๋กœ Global๊ณผ Local์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋จผ์ € Global์€ Time-invariantํ•œ ์กฐ๊ฑด์œผ๋กœ ์‹œ์ ์— ๋”ฐ๋ผ ๋ณ€ํ•˜์ง€ ์•Š๋Š” ์กฐ๊ฑด ์ •๋ณด๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ํ•œ ๋ฐœํ™”์ž์˜ ์Œ์„ฑ์€ ํ•ด๋‹น ์Œ์„ฑ ํŒŒ์ผ์˜ ์–ด๋–ค ์‹œ์ ์—์„œ๋‚˜ ๋˜‘๊ฐ™์€ condition์ด๊ธฐ ๋•Œ๋ฌธ์— Global condition์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ์˜ Feature vector h๋Š” linear projection์„ ๊ฑฐ์นœ ํ›„ data x์™€ ๋”ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ์œผ๋กœ Time-variantํ•œ Local condition์€ ์‹œ์ ์— ๋”ฐ๋ผ ๋ณ€ํ•˜๋Š” ์กฐ๊ฑด ์ •๋ณด๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ๋งํ•˜๋Š”๋ฐ ์Œ์„ฑ ๋ฐ์ดํ„ฐ๋ณด๋‹ค ๊ธธ์ด๊ฐ€ ์งง์ง€๋งŒ ์ˆœ์„œ๊ฐ€ ์žˆ๋Š” ์ผ์ • ๊ธธ์ด์˜ Sequence vector๋ผ๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ™์€ ๋ฐœํ™”์ž์—ฌ๋„ ์–ด๋–ค ๋‹จ์–ด๋ฅผ ๋งํ•˜๋А๋ƒ์— ๋”ฐ๋ผ ์Œ์„ฑํ•™์ ์ธ ํŠน์ง•(linguistic feature)๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๊ธฐ ๋–„๋ฌธ์— localํ•œ ์กฐ๊ฑด์€ ํ•œ ์Œ์„ฑ ํŒŒ์ผ์— ์—ฌ๋Ÿฌ๊ฐœ๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ Feature vector h๋Š” ์Œ์„ฑ ํŒŒ์ผ๊ณผ ๊ธธ์ด๊ฐ€ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— Upsampling์„ ๊ฑฐ์นœํ›„ 1x1 convolution์„ ๊ฑฐ์ณ์„œ data x์™€ ๋”ํ•ด์ง‘๋‹ˆ๋‹ค.

Experiments

์‹คํ—˜์€ ์ด 4๊ฐ€์ง€ Free-form Speech Generation, TTS, Music Audio Modelling, Speech Recognition์„ ์ง„ํ–‰ํ–ˆ์ง€๋งŒ ์ฃผ๋œ ์‹คํ—˜์€ TTS๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ์ด๋ฃจ์–ด์กŒ์œผ๋ฉฐ Evaluation์€ 2๊ฐ€์ง€๋กœ Paired Comparison Test, Mean Opinion Score์œผ๋กœ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. Paired Comparison Test์€ ํ”ผ์‹คํ—˜์ž์—๊ฒŒ 2๊ฐœ์˜ ์‹คํ—˜ ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋œ ์Œ์„ฑ ํŒŒ์ผ์„ ๋“ค๋ ค์ฃผ๊ณ  ๋‘˜ ์ค‘ ๋” ์ž์—ฐ์Šค๋Ÿฝ๋‹ค๊ณ  ์ƒ๊ฐ๋˜๋Š” ์Œ์„ฑ ํŒŒ์ผ์„ ์„ ํƒํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ ๋‘ ๊ฐœ์˜ ์Œ์„ฑ๋“ค์—์„œ ๋”ฑํžˆ ์„ ํ˜ธ๋„๊ฐ€ ์—†์„ ๊ฒฝ์šฐ์—๋Š” No preference๋กœ ์‘๋‹ตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Mean Opinion Score ์‹คํ—˜์—์„œ๋Š” ํ”ผ์‹คํ—˜์ž์—๊ฒŒ ์ƒ์„ฑ๋œ ์Œ์„ฑ 1๊ฐœ๋ฅผ ๋“ค๋ ค์ฃผ๊ณ  1~5์  ์‚ฌ์ด์˜ ํ’ˆ์งˆ ์ ์ˆ˜๋ฅผ ๋ฐ›๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. (1: Bad, 2: Poor, 3: Fair, 4: Good, 5: Excellent)

TTS ์‹คํ—˜์—์„œ Paired Comparison Test๋ฅผ ์ง„ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ์ž…๋ ฅ text์—์„œ ์ถ”์ถœ๋œ linguistic feature[L]์™€ ์Œ์„ฑ์˜ ํŠน์ง• ์ค‘ ํ•˜๋‚˜์ธ logarithmic fundamental frequency(F_o)[F]๋ฅผ local condition์œผ๋กœ ๋„ฃ์–ด์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ Receptive Field๋Š” 240 ๋ฐ€๋ฆฌ์„ธ์ปจ๋“œ์˜€์œผ๋ฉฐ ๋น„๊ต๋ชจ๋ธ๋กœ๋Š” concatenative ๊ณ„์—ด์˜ HMM-driven unit selection๊ณผ parametric ๊ณ„์—ด์˜ LSTM-RNN-based ๋ชจ๋ธ์„ ๊ฐ€์ง€๊ณ  ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค.

Preference score์„ ๋น„๊ตํ•ด๋ดค์„ ๋•Œ, ์šฐ์„  ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•๋ก ์ด์—ˆ๋˜ LSTM์™€ Concat์„ ๋น„๊ตํ•ด๋ณด๋ฉด(๊ฐ€์žฅ ์™ผ์ชฝ bar graph) ์˜์–ด์—์„œ๋Š” Concat์ด ์ค‘๊ตญ์–ด์—์„œ๋Š” LSTM์ด ๋” ๋†’์€ ์ ์ˆ˜๋ฅผ ๋ฐ›์€ ๊ฒƒ์„ ๋ณด์•„ ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์€ ์˜์–ด์—์„œ๋Š” Concat ๋ฐฉ๋ฒ•๋ก ์ด ๋” ์ข‹์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ์œผ๋กœ WaveNet์˜ local condition์„ L๋งŒ ์ฃผ์—ˆ์„ ๋•Œ์™€ L+F๋ฅผ ์ฃผ์—ˆ์„ ๋•Œ๋ฅผ ๋น„๊ตํ•ด๋ณด๋ฉด(๊ฐ€์šด๋ฐ bar graph) local condition ์กฐ๊ฑด์ด ๋งŽ์„์ˆ˜๋ก, ์ฆ‰ L+F๋ฅผ local condition์œผ๋กœ ์ฃผ์—ˆ์„ ๋•Œ ์„ ํ˜ธ๋„๊ฐ€ ๋†’์Œ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ๋น„๊ต๊ตฐ์ด์—ˆ๋˜ ๊ธฐ์กด์˜ ๋ชจ๋ธ๋“ค ์ค‘ ๊ฐ€์žฅ ์„ ํ˜ธ๋„๊ฐ€ ๋†’์€ ๋ชจ๋ธ๊ณผ WaveNet์— ๋ชจ๋“  local condition์„ ์ฃผ์—ˆ์„ ๋•Œ๋ฅผ ๋น„๊ตํ•ด๋ณด๋ฉด(๊ฐ€์žฅ ์˜ค๋ฅธ์ชฝ bar graph) ์˜์–ด์™€ ์ค‘๊ตญ์–ด ๋ชจ๋‘์—์„œ WaveNet์˜ ์„ ํ˜ธ๋„๊ฐ€ ๋†’์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Paired Comparison Test Result and Logarithmic fundamental frequency12

๋‘๋ฒˆ์งธ ์‹คํ—˜์ธ Mean Opinion Score์—์„œ๋Š” WaveNet์ด 4์  Good์„ ์˜์–ด์™€ ์ค‘๊ตญ์–ด์—์„œ ๋ชจ๋‘ ๋„˜์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์œผ๋ฉฐ ์‹ค์ œ ์Œ์„ฑ(ground truth)์—์„œ 8-bit ํ˜น์€ 16-bit๋กœ ๋ณ€ํ™˜ํ•œ ๊ฒƒ๊ณผ ๊ธฐ์กด ๋ชจ๋ธ๋“ค(LSTM, HMM)์‚ฌ์ด์˜ ์ฐจ์ด๋ฅผ ๋” ์ค„์—ฌ์ค€ ๊ฒƒ์„ ํ™•์ธํ•จ์œผ๋กœ์จ ์Œ์„ฑ ์ƒ์„ฑ ๋ชจ๋ธ์˜ ํผํฌ๋จผ์Šค๊ฐ€ ํ–ฅ์ƒ๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Mean Opinion Score Result

Conclusion

WaveNet ๋…ผ๋ฌธ์—์„œ๋Š” ์Œ์„ฑ ์ƒ์„ฑ์„ raw data๋กœ ๋ฐ”๋กœ ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€ ๊ฒƒ์— ํฐ Contribution์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด Dilated Causal Convolution / Skip / Residual ๊ธฐ๋ฒ•์„ ์ด์šฉํ•˜์—ฌ Receptive Field๋ฅผ ๋Š˜๋ ค์„œ ๊ธด ์Œ์„ฑ ํŒŒํ˜•์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์Œ์„ฑ ํŒŒํ˜• ๋ฐ์ดํ„ฐ์—๋‹ค๊ฐ€ conditioning model์„ ๋”ํ•จ์œผ๋กœ์จ ๋” ํŠน์ง•์ ์ด๊ณ  ์ž์—ฐ์Šค๋Ÿฌ์šด ์Œ์„ฑ์„ ์ƒ์„ฑ ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ TTS๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ์—ฐ๊ตฌ๊ฐ€ ๋˜๊ธดํ–ˆ์ง€๋งŒ ์Œ์•…๊ณผ ๊ฐ™์€ ์‚ฌ๋žŒ์˜ ์Œ์„ฑ์ด ์•„๋‹Œ ์Œ์„ฑ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ์—๋„ potentialํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์–ด ๊ทธ ํ™•์žฅ์„ฑ์ด ์ข‹๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Improved Works

WaveNet์˜ auto-regressiveํ•œ ํŠน์„ฑ์œผ๋กœ ์ธํ•ด ๊ณ„์‚ฐ๋Ÿ‰์ด ๋งŽ๊ณ  ๋А๋ฆฐ ์ƒ์„ฑ์„ ๋ณด์™„ํ•œ Fast Wavenet Generation Algorithm ์—ฐ๊ตฌ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋„คํŠธ์›Œํฌ์˜ ๋ ˆ์ด์–ด ์ˆ˜๋ฅผ L์ด๋ผ๊ณ  ํ–ˆ์„ ๋•Œ ๊ธฐ์กด์˜ naive WaveNet์ด O(2^L) ๋ณต์žก๋„๊ฐ€ ์žˆ์—ˆ์ง€๋งŒ ์ค‘๋ณต๋˜๋Š” convolution ์—ฐ์‚ฐ์„ cachingํ•จ์œผ๋กœ์จ O(L) ๋ณต์žก๋„๋กœ ์ค„์ผ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

Fast Wavenet 21

Reference

[1] Original paper - WaveNet: A Generative Model for Raw Audio

[2] Project page - https://www.deepmind.com/blog/wavenet-a-generative-model-for-raw-audio

[3] https://brilliant.org/practice/wave-anatomy-2/

[4] https://m.blog.naver.com/sbkim24/10084099777

[5] https://blog.naver.com/sorionclinic/221184537689

[6] https://joungheekim.github.io/2020/09/17/paper-review/

[7] https://tech.kakaoenterprise.com/66

[8] https://www.researchgate.net/publication/269935208_Psychophysics_of_musical_elements_in_the_discrete-time_representation_of_sound

[9] https://en.wikipedia.org/wiki/%CE%9C-law_algorithm

[10] https://youtu.be/m2A9g6Xu91I

[11] https://youtu.be/GyQnex_DK2k

[12] https://wiki.aalto.fi/pages/viewpage.action?pageId=149890776

[13] https://youtu.be/MNZepE1m-kI

[14] https://medium.com/@satyam.kumar.iiitv/understanding-wavenet-architecture-361cc4c2d623

[15] https://www.deepmind.com/blog/wavenet-a-generative-model-for-raw-audio

[16] https://towardsdatascience.com/wavenet-google-assistants-voice-synthesizer-a168e9af13b1

[17] https://github.com/antecessor/Wavenet

[18] https://youtu.be/nsrSrYtKkT8

[19] https://research.google/pubs/pub45882/

[20] https://arxiv.org/abs/1712.05884

[21] https://arxiv.org/abs/1611.09482

Copyright 2024, Jung Yeon Lee