Vol Building AGI
580 subscribers
116 photos
9 videos
12 files
199 links
Past topics: speech synthesis, transformers, LSTM, recurrence
Download Telegram
Vol Building AGI
Photo
signal processing revealed
New SOTA on TTS from Microsoft Research Asia (outside of ICASSP)

Uses 24 hours (13100 utterances) from LJSpeech, 200M text sentences for phoneme encoder pretraining and a g2p model. 8 V100 GPUs. 3000 epochs.

https://speechresearch.github.io/naturalspeech/
5297-1.pdf
888.6 KB
https://www.youtube.com/watch?v=-p_awLZWLeI

https://github.com/facebookresearch/vocoder-benchmark

VocBench from Facebook

Autoregressive vocoders: WaveNet, WaveRNN
GANs: Parallel WaveGAN, MelGAN
Diffusion: WaveGrad, DiffWave

All in one place with a common input-output interface with modern codebase from Facebook.

Might be useful for VC if it’s easy to make condition those vocoders using custom features.
Neural HMM: learns alignments fast

https://shivammehta007.github.io/Neural-HMM/

Promises to converge with 500 utterances, i couldn’t get it to work with that much data. I think with 2k utterances it should.
Prosody annotations for Switchboard: https://groups.inf.ed.ac.uk/switchboard/index.html
Convolutional Pitch Tracker (ICASSP 2018)

https://marl.github.io/crepe/

PyTorch port with lots of usage details: https://github.com/maxrmorrison/torchcrepe
Transformer-based sprocket successor, uses TTS pretraining. Available as egs/arctic/vc1 in ESPnet. Sounds much worse than sprocket.


http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/Demos/vtn/index.html
Ephraim1985_Speech_enhancement_using_a_minimum_mean_square_error.pdf
311.1 KB
Dealing with residual vocoder noise:

LogMMSE Speech Enhancement and Noise Reduction

https://github.com/rajivpoddar/logmmse



y_enh = logmmse(y, sr, output_file=None, initial_noise=1, window_size=160, noise_threshold=0.15)
Vol Building AGI
StarGANv2-VC authors mentioned this method as one achieving highest MOS on VCC-2020 🤯 https://github.com/yl4579/StarGANv2-VC I need to take a closer look at VTN
VTN is T23,

T10 is ASR and prosody encoder fed into speaker-dependent TTS fed into WaveNet with single Gaussian outputs. The alternative system of T10 was an autoregressive LSTM that converted PPG into melspc and was used for two male-male parallel speakers.