Vol Building AGI

signal processing revealed

34 views09:03

VTLP code: https://github.com/biggytruck/SpeechSplit2/blob/0911c09732e0e935c7c0a7aaf23eb2923d9889d8/utils.py#L252-L276

SpeechSplit2/utils.py at 0911c09732e0e935c7c0a7aaf23eb2923d9889d8 · biggytruck/SpeechSplit2

Official implementation of SpeechSplit2. Contribute to biggytruck/SpeechSplit2 development by creating an account on GitHub.

34 views09:09

Vol Building AGI

New SOTA on TTS from Microsoft Research Asia (outside of ICASSP)

Uses 24 hours (13100 utterances) from LJSpeech, 200M text sentences for phoneme encoder pretraining and a g2p model. 8 V100 GPUs. 3000 epochs.

https://speechresearch.github.io/naturalspeech/

33 views11:24

Vol Building AGI

In the mean time all Interspeech 2021 videos have been made available https://www.superlectures.com/interspeech2021/tutorials

https://www.youtube.com/channel/UC2-z0HD4WpSbJONj73BgfwQ/videos

33 viewsedited 14:23

Vol Building AGI

5297-1.pdf

888.6 KB

https://www.youtube.com/watch?v=-p_awLZWLeI

https://github.com/facebookresearch/vocoder-benchmark

VocBench from Facebook

Autoregressive vocoders: WaveNet, WaveRNN
GANs: Parallel WaveGAN, MelGAN
Diffusion: WaveGrad, DiffWave

All in one place with a common input-output interface with modern codebase from Facebook.

Might be useful for VC if it’s easy to make condition those vocoders using custom features.

36 viewsedited 14:53

Vol Building AGI

Neural HMM: learns alignments fast

https://shivammehta007.github.io/Neural-HMM/

Promises to converge with 500 utterances, i couldn’t get it to work with that much data. I think with 2k utterances it should.

36 views15:20

37 views15:20

39 views16:20

https://github.com/mindslab-ai/assem-vc

GitHub

GitHub - maum-ai/assem-vc: Official Code for Assem-VC @ICASSP2022

Official Code for Assem-VC @ICASSP2022. Contribute to maum-ai/assem-vc development by creating an account on GitHub.

36 views18:31

36 views18:31

tg_image_3087241015.jpeg

35 views18:31

33 views02:21

33 views02:22

Prosody annotations for Switchboard: https://groups.inf.ed.ac.uk/switchboard/index.html

49 views03:17

Vol Building AGI

Photo

Neural Text to Speech Synthesis Tutorial

https://github.com/tts-tutorial/icassp2022

Survey paper: https://arxiv.org/abs/2106.15561

GitHub

GitHub - tts-tutorial/icassp2022

Contribute to tts-tutorial/icassp2022 development by creating an account on GitHub.

35 views04:22

Vol Building AGI

Convolutional Pitch Tracker (ICASSP 2018)

https://marl.github.io/crepe/

PyTorch port with lots of usage details: https://github.com/maxrmorrison/torchcrepe

31 viewsedited 16:05

Vol Building AGI

Transformer-based sprocket successor, uses TTS pretraining. Available as egs/arctic/vc1 in ESPnet. Sounds much worse than sprocket.

http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/Demos/vtn/index.html

29 viewsedited 19:17

Vol Building AGI

Ephraim1985_Speech_enhancement_using_a_minimum_mean_square_error.pdf

311.1 KB

Dealing with residual vocoder noise:

LogMMSE Speech Enhancement and Noise Reduction

https://github.com/rajivpoddar/logmmse


y_enh = logmmse(y, sr, output_file=None, initial_noise=1, window_size=160, noise_threshold=0.15)

29 views10:43

Vol Building AGI

Transformer-based sprocket successor, uses TTS pretraining. Available as egs/arctic/vc1 in ESPnet. Sounds much worse than sprocket. http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/Demos/vtn/index.html

30 views16:03

Vol Building AGI

Photo

StarGANv2-VC authors mentioned this method as one achieving highest MOS on VCC-2020 🤯

https://github.com/yl4579/StarGANv2-VC

I need to take a closer look at VTN

GitHub

GitHub - yl4579/StarGANv2-VC: StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion - yl4579/StarGANv2-VC

31 viewsedited 16:03

Vol Building AGI

StarGANv2-VC authors mentioned this method as one achieving highest MOS on VCC-2020 🤯 https://github.com/yl4579/StarGANv2-VC I need to take a closer look at VTN

VTN is T23,

T10 is ASR and prosody encoder fed into speaker-dependent TTS fed into WaveNet with single Gaussian outputs. The alternative system of T10 was an autoregressive LSTM that converted PPG into melspc and was used for two male-male parallel speakers.

33 viewsedited 16:11

About

Blog

Apps

Platform