Vol Building AGI
581 subscribers
116 photos
9 videos
12 files
199 links
Past topics: speech synthesis, transformers, LSTM, recurrence
Download Telegram
https://serrjoa.github.io/projects/universe/

Score-based diffusion for universal speech enhancement (55 distortion types)

Base model: 49M parameters, 5 days, 2xV100, AMP
The paper goes on to describe improvements to the model
Scaled up model: 189M parameters, 14 days 8xV100
StyleGAN3 antialiasing generator meets vocoder. Trained on all of LibriTTS. Generalizes to laughter and music.

https://arxiv.org/abs/2206.04658

https://github.com/NVIDIA/BigVGAN

https://bigvgan-demo.github.io
👍1
Try StarGAN-VC and ACVAE-VC to speak like a dog. ACVAE sounds more like a dog while StarGAN has better speech clarity.

https://arxiv.org/abs/2206.04780

https://github.com/suzuki256/dog-dataset
ACL 2022: Direct speech-to-speech translation with discrete units, Lee at al

https://ai.facebook.com/blog/advancing-direct-speech-to-speech-modeling-with-discrete-units/
Meta does speech translation by feeding discrete units from a transformer encoder-decoder block to a vocoder. I noted how they don’t use pitch information as a HiFi-GAN input and use a mini duration prediction block from FastSpeech 2.
👍1
Turns out the 80fps vs 200fps frame rate issue was addressed in the original Tacotron 2 paper.

https://arxiv.org/abs/1712.05884
LJSpeech is a noisy dataset! Compare a single utterance from LJ and HiFi-TTS speaker 92_clean
Vol Building AGI
Audio
Audio
The sample above was actually preprocessed (at least downsampled to 16khz), here’s the original one, the noise at the silence interval is audible
Channel name was changed to «Vol Trying Synthesis»
Thanks Taras for sharing the Deep Creativity course, I’ve been stuck watching a lecture on music synthesis.

The topic of using proper strong inductive biases to achieve realistic output seems to be much more explored there: it seems like vocoder community has just started using DWT (FreGAN), PQMF (Avocodo, RAVE) and antialiasing and quality for parameter efficiency while DDSP has a much larger pool of building blocks for upsampling.

https://youtu.be/oiPWOTr44qQ
https://github.com/TariqAHassan/HiFiHybrid

Anti-aliased multi-periodicity composition backported to HiFi-GAN

The Snake1d activation replaces more common Leaky ReLU (snake a x = x + sin^2(ax)/a where a is trainable) to arbitrarily change the frequency of the input. It is anti-aliased by applying low pass filtering (blurring) after upsampling and downsampling operations.

However trying to replace transposed convoluions with their antialiased counterparts causes mode collapse in the BigVGAN. Maybe it won’t in Diff? :)
A map of vocoders

Inside one of the slide decks of NSF https://nii-yamagishilab.github.io/samples-nsf/index.html
👍1