Ephraim1985_Speech_enhancement_using_a_minimum_mean_square_error.pdf
311.1 KB
Dealing with residual vocoder noise:
LogMMSE Speech Enhancement and Noise Reduction
https://github.com/rajivpoddar/logmmse
LogMMSE Speech Enhancement and Noise Reduction
https://github.com/rajivpoddar/logmmse
y_enh = logmmse(y, sr, output_file=None, initial_noise=1, window_size=160, noise_threshold=0.15)
Vol Building AGI
Photo
StarGANv2-VC authors mentioned this method as one achieving highest MOS on VCC-2020 🤯
https://github.com/yl4579/StarGANv2-VC
I need to take a closer look at VTN
https://github.com/yl4579/StarGANv2-VC
I need to take a closer look at VTN
GitHub
GitHub - yl4579/StarGANv2-VC: StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion
StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion - yl4579/StarGANv2-VC
Vol Building AGI
StarGANv2-VC authors mentioned this method as one achieving highest MOS on VCC-2020 🤯 https://github.com/yl4579/StarGANv2-VC I need to take a closer look at VTN
VTN is T23,
T10 is ASR and prosody encoder fed into speaker-dependent TTS fed into WaveNet with single Gaussian outputs. The alternative system of T10 was an autoregressive LSTM that converted PPG into melspc and was used for two male-male parallel speakers.
T10 is ASR and prosody encoder fed into speaker-dependent TTS fed into WaveNet with single Gaussian outputs. The alternative system of T10 was an autoregressive LSTM that converted PPG into melspc and was used for two male-male parallel speakers.
https://prml-lab-speech-team.github.io/demo/FreGAN2/
A vocoder that uses discrete wavelet transform in the discriminator and has a progressive generator structure similar to StyleGAN2 that produce iDWT arguments
https://github.com/prml-lab-speech-team/demo/tree/master/FreGAN2/code
A vocoder that uses discrete wavelet transform in the discriminator and has a progressive generator structure similar to StyleGAN2 that produce iDWT arguments
https://github.com/prml-lab-speech-team/demo/tree/master/FreGAN2/code
GitHub
demo/FreGAN2/code at master · prml-lab-speech-team/demo
Contribute to prml-lab-speech-team/demo development by creating an account on GitHub.
ICLR 2022
HiFi-GAN + chunked autoregression trains faster and keeps track of pitch better
https://github.com/descriptinc/cargan
HiFi-GAN + chunked autoregression trains faster and keeps track of pitch better
https://github.com/descriptinc/cargan
👍1
https://serrjoa.github.io/projects/universe/
Score-based diffusion for universal speech enhancement (55 distortion types)
Base model: 49M parameters, 5 days, 2xV100, AMP
The paper goes on to describe improvements to the model
Scaled up model: 189M parameters, 14 days 8xV100
Score-based diffusion for universal speech enhancement (55 distortion types)
Base model: 49M parameters, 5 days, 2xV100, AMP
The paper goes on to describe improvements to the model
Scaled up model: 189M parameters, 14 days 8xV100
serrjoa.github.io
UNIVERSE
Personal website
Neural Phonetic Alignment with pretrained models for English:
https://github.com/lingjzhu/charsiu/
https://github.com/lingjzhu/charsiu/
GitHub
GitHub - lingjzhu/charsiu: Charsiu: A neural phonetic aligner.
Charsiu: A neural phonetic aligner. Contribute to lingjzhu/charsiu development by creating an account on GitHub.
StyleGAN3 antialiasing generator meets vocoder. Trained on all of LibriTTS. Generalizes to laughter and music.
https://arxiv.org/abs/2206.04658
https://github.com/NVIDIA/BigVGAN
https://bigvgan-demo.github.io
https://arxiv.org/abs/2206.04658
https://github.com/NVIDIA/BigVGAN
https://bigvgan-demo.github.io
👍1
Try StarGAN-VC and ACVAE-VC to speak like a dog. ACVAE sounds more like a dog while StarGAN has better speech clarity.
https://arxiv.org/abs/2206.04780
https://github.com/suzuki256/dog-dataset
https://arxiv.org/abs/2206.04780
https://github.com/suzuki256/dog-dataset
ACL 2022: Direct speech-to-speech translation with discrete units, Lee at al
https://ai.facebook.com/blog/advancing-direct-speech-to-speech-modeling-with-discrete-units/
Meta does speech translation by feeding discrete units from a transformer encoder-decoder block to a vocoder. I noted how they don’t use pitch information as a HiFi-GAN input and use a mini duration prediction block from FastSpeech 2.
https://ai.facebook.com/blog/advancing-direct-speech-to-speech-modeling-with-discrete-units/
Meta does speech translation by feeding discrete units from a transformer encoder-decoder block to a vocoder. I noted how they don’t use pitch information as a HiFi-GAN input and use a mini duration prediction block from FastSpeech 2.
👍1
Very neat TTS composer from Sonatic https://www.youtube.com/watch?v=fNtwg-lXie8
YouTube
How Sonantic AI Voices Work
Turns out the 80fps vs 200fps frame rate issue was addressed in the original Tacotron 2 paper.
https://arxiv.org/abs/1712.05884
https://arxiv.org/abs/1712.05884
LJSpeech is a noisy dataset! Compare a single utterance from LJ and HiFi-TTS speaker 92_clean