Vol Building AGI

Dealing with residual vocoder noise:

LogMMSE Speech Enhancement and Noise Reduction

https://github.com/rajivpoddar/logmmse


y_enh = logmmse(y, sr, output_file=None, initial_noise=1, window_size=160, noise_threshold=0.15)

29 views10:43

Vol Building AGI

Transformer-based sprocket successor, uses TTS pretraining. Available as egs/arctic/vc1 in ESPnet. Sounds much worse than sprocket. http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/Demos/vtn/index.html

30 views16:03

Vol Building AGI

Photo

StarGANv2-VC authors mentioned this method as one achieving highest MOS on VCC-2020 🤯

https://github.com/yl4579/StarGANv2-VC

I need to take a closer look at VTN

GitHub

GitHub - yl4579/StarGANv2-VC: StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion - yl4579/StarGANv2-VC

31 viewsedited 16:03

Vol Building AGI

StarGANv2-VC authors mentioned this method as one achieving highest MOS on VCC-2020 🤯 https://github.com/yl4579/StarGANv2-VC I need to take a closer look at VTN

VTN is T23,

T10 is ASR and prosody encoder fed into speaker-dependent TTS fed into WaveNet with single Gaussian outputs. The alternative system of T10 was an autoregressive LSTM that converted PPG into melspc and was used for two male-male parallel speakers.

33 viewsedited 16:11

Vol Building AGI

On AMP and HiFi-GAN: may need to remove the bias from convolution

32 views09:46

Vol Building AGI

https://prml-lab-speech-team.github.io/demo/FreGAN2/

A vocoder that uses discrete wavelet transform in the discriminator and has a progressive generator structure similar to StyleGAN2 that produce iDWT arguments

https://github.com/prml-lab-speech-team/demo/tree/master/FreGAN2/code

GitHub

demo/FreGAN2/code at master · prml-lab-speech-team/demo

Contribute to prml-lab-speech-team/demo development by creating an account on GitHub.

35 viewsedited 12:22

34 views12:22

34 views12:22

ICLR 2022

HiFi-GAN + chunked autoregression trains faster and keeps track of pitch better

https://github.com/descriptinc/cargan

👍1

31 viewsedited 15:10

Vol Building AGI

ICLR 2022 HiFi-GAN + chunked autoregression trains faster and keeps track of pitch better https://github.com/descriptinc/cargan

https://www.maxrmorrison.com/sites/cargan/

34 views15:18

Vol Building AGI

https://serrjoa.github.io/projects/universe/

Score-based diffusion for universal speech enhancement (55 distortion types)

Base model: 49M parameters, 5 days, 2xV100, AMP
The paper goes on to describe improvements to the model
Scaled up model: 189M parameters, 14 days 8xV100

30 views09:31

31 views08:17

Neural Phonetic Alignment with pretrained models for English:
https://github.com/lingjzhu/charsiu/

GitHub

GitHub - lingjzhu/charsiu: Charsiu: A neural phonetic aligner.

Charsiu: A neural phonetic aligner. Contribute to lingjzhu/charsiu development by creating an account on GitHub.

35 views11:09

Vol Building AGI

Lightweight speech encoder

https://github.com/yl4579/AuxiliaryASR

GitHub

GitHub - yl4579/AuxiliaryASR: Joint CTC-S2S Phoneme-level ASR for Voice Conversion and TTS (Text-Mel Alignment)

Joint CTC-S2S Phoneme-level ASR for Voice Conversion and TTS (Text-Mel Alignment) - yl4579/AuxiliaryASR

33 views11:33

Vol Building AGI

StyleGAN3 antialiasing generator meets vocoder. Trained on all of LibriTTS. Generalizes to laughter and music.

https://arxiv.org/abs/2206.04658

https://github.com/NVIDIA/BigVGAN

https://bigvgan-demo.github.io

👍1

34 views07:35

Vol Building AGI

Try StarGAN-VC and ACVAE-VC to speak like a dog. ACVAE sounds more like a dog while StarGAN has better speech clarity.

https://arxiv.org/abs/2206.04780

https://github.com/suzuki256/dog-dataset

45 views07:53

Vol Building AGI

ACL 2022: Direct speech-to-speech translation with discrete units, Lee at al

https://ai.facebook.com/blog/advancing-direct-speech-to-speech-modeling-with-discrete-units/
Meta does speech translation by feeding discrete units from a transformer encoder-decoder block to a vocoder. I noted how they don’t use pitch information as a HiFi-GAN input and use a mini duration prediction block from FastSpeech 2.

👍1

33 views09:49

Vol Building AGI

https://twitter.com/ysaito_human/status/1536521048568438785

日本語を学びましょう

Twitter

Yuki Saito

今日の13時からの信号処理特論でゲスト講師として発表します🤓 資料は👇 です（slideshare ですが，問題なく閲覧できると思います） slideshare.net/YukiSaito8/neu…

32 views20:10

Vol Building AGI

Very neat TTS composer from Sonatic https://www.youtube.com/watch?v=fNtwg-lXie8

YouTube

How Sonantic AI Voices Work