Open Preview for #ICASSP2023 is now available on
@IEEEXplore
! Available through June 10, you can now browse all the papers that were accepted to ICASSP 2023, free of charge. Browse research here: https://hubs.la/Q01N_PdX0
@IEEEXplore
! Available through June 10, you can now browse all the papers that were accepted to ICASSP 2023, free of charge. Browse research here: https://hubs.la/Q01N_PdX0
Good VS quality
https://quickvc.github.io/quickvc-demo/
https://github.com/quickvc/QuickVC-VoiceConversion
https://quickvc.github.io/quickvc-demo/
https://github.com/quickvc/QuickVC-VoiceConversion
GitHub
GitHub - quickvc/QuickVC-VoiceConversion: QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for…
QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion - quickvc/QuickVC-VoiceConversion
People report Whisper PEFT + LORA tuning gives quite good results:
https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb
https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb
GitHub
peft/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb at main · huggingface/peft
🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning. - huggingface/peft
EfficientSpeech, or ES for short, is an efficient neural text to speech (TTS) model. It generates mel spectrogram at a speed of 104 (mRTF) or 104 secs of speech per sec on an RPi4. Its tiny version has a footprint of just 266k parameters. Generating 6 secs of speech consumes 90 MFLOPS only.
https://github.com/roatienza/efficientspeech
https://roatienza.github.io/efficientspeech-demo/
https://github.com/roatienza/efficientspeech
https://roatienza.github.io/efficientspeech-demo/
GitHub
GitHub - roatienza/efficientspeech: PyTorch code implementation of EfficientSpeech - to be presented at ICASSP2023.
PyTorch code implementation of EfficientSpeech - to be presented at ICASSP2023. - GitHub - roatienza/efficientspeech: PyTorch code implementation of EfficientSpeech - to be presented at ICASSP2023.
May 12, 2023: Challenge announcement
May 19, 2023: Leaderboard is online and accepting submissions
June 26, 2023: New Language Track Submission Deadline
July 07, 2023: Paper / Model Submission Deadline
July 10, 2023: Paper Revision Deadline
🌍🗣️SUPERB benchmark is back with ML-SUPERB, its multilingual version! The challenge, as one of the #ASRU2023 challenges, includes 3 tracks:
1️⃣ML-SUPERB: For multilingual SSL
2️⃣New language: To new languages!
3️⃣Research: For research papers
More to see 👉 https://multilingual.superbbenchmark.org
May 19, 2023: Leaderboard is online and accepting submissions
June 26, 2023: New Language Track Submission Deadline
July 07, 2023: Paper / Model Submission Deadline
July 10, 2023: Paper Revision Deadline
🌍🗣️SUPERB benchmark is back with ML-SUPERB, its multilingual version! The challenge, as one of the #ASRU2023 challenges, includes 3 tracks:
1️⃣ML-SUPERB: For multilingual SSL
2️⃣New language: To new languages!
3️⃣Research: For research papers
More to see 👉 https://multilingual.superbbenchmark.org
multilingual.superbbenchmark.org
ML-SUPERB: Multilingual Speech processing Universal PERformance Benchmark
A multilingual benchmark for Self-supervised Speech Representation Learning
Universal Source Separation with Weakly Labelled Data
abs: https://arxiv.org/abs/2305.07447
paper page: https://huggingface.co/papers/2305.07447
github: https://github.com/bytedance/uss
abs: https://arxiv.org/abs/2305.07447
paper page: https://huggingface.co/papers/2305.07447
github: https://github.com/bytedance/uss
Some people implement streaming speaker diarization manually
https://github.com/pyannote/pyannote-audio/commit/4a6ea9c825b9447a7d03cb9bd94f5f81d661ca16
others just ask ChatGPT to write it
https://github.com/huseinzol05/malaya-speech/commit/564f50c0d91528126fe3b410f387d1b4ff33d364
ChatGPT version is not that bad
https://github.com/pyannote/pyannote-audio/commit/4a6ea9c825b9447a7d03cb9bd94f5f81d661ca16
others just ask ChatGPT to write it
https://github.com/huseinzol05/malaya-speech/commit/564f50c0d91528126fe3b410f387d1b4ff33d364
ChatGPT version is not that bad
GitHub
wip: add streaming speaker diarization task · pyannote/pyannote-audio@4a6ea9c
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding - wip: add streaming speaker diarization task · pyannote/pyannote-audio@4a6ea9c
The first Arabic TTS Challenge - QASR TTS 1.0 is on!! Register and build your own Arabic Anchor Voice and contribute to enriching #ArabicAI #ASRU2023Challege
More details: https://arabicspeech.org/qasr-challenge/
https://twitter.com/shammur_absar/status/1658429029483986944
More details: https://arabicspeech.org/qasr-challenge/
https://twitter.com/shammur_absar/status/1658429029483986944
Some nice things from industry, autoscaling with Triton and Kubernetes
https://www.speechmatics.com/company/articles-and-news/autoscaling-with-gpu-transcription-models
https://www.speechmatics.com/company/articles-and-news/autoscaling-with-gpu-transcription-models
Speechmatics
Autoscaling with GPU Transcription models
Speechmatics has recently switched from CPUs to GPUs to run most batch transcription models. Better hardware = increased accuracy. Find out more!
Multilingual TTS from ElevenLabs
https://twitter.com/radamar/status/1658540025611685888
https://huggingface.co/spaces/elevenlabs/tts
https://twitter.com/radamar/status/1658540025611685888
https://huggingface.co/spaces/elevenlabs/tts
Recent advances in the AudioLM family: 100x higher speed, better consistency, no quality hit - a new paper from and the AudioLM team.
Give it a listen: https://google-research.github.io/seanet/soundstorm/examples/
Arxiv:
https://arxiv.org/abs/2305.09636
Give it a listen: https://google-research.github.io/seanet/soundstorm/examples/
Arxiv:
https://arxiv.org/abs/2305.09636
Final VoxCeleb Challenge
https://mm.kaist.ac.kr/datasets/voxceleb/voxsrc/competition2023.html
Timeline
May 20th Development set for verification tracks released.
May 31rd Development set for diarisation tracks released.
June 1st Test set released and evaluation server open.
Early August Deadline for submission of results; invitation to workshop speakers.
August 20th Challenge workshop
https://mm.kaist.ac.kr/datasets/voxceleb/voxsrc/competition2023.html
Timeline
May 20th Development set for verification tracks released.
May 31rd Development set for diarisation tracks released.
June 1st Test set released and evaluation server open.
Early August Deadline for submission of results; invitation to workshop speakers.
August 20th Challenge workshop
Whisper is essentially an audio-conditioned LLM. Can we prompt it to do unseen tasks? Introducing PromptingWhisper!
We use simple prompts to adapt Whisper to unseen tasks zero-shot without any finetuning.
📄 Paper: http://arxiv.org/abs/2305.11095
💻 Code: https://github.com/jasonppy/PromptingWhisper
We use simple prompts to adapt Whisper to unseen tasks zero-shot without any finetuning.
📄 Paper: http://arxiv.org/abs/2305.11095
💻 Code: https://github.com/jasonppy/PromptingWhisper
GitHub
GitHub - jasonppy/PromptingWhisper: Promting Whisper for Audio-Visual Speech Recognition, Code-Switched Speech Recognition, and…
Promting Whisper for Audio-Visual Speech Recognition, Code-Switched Speech Recognition, and Zero-Shot Speech Translation - jasonppy/PromptingWhisper
https://twitter.com/csteinmetz1/status/1659458441197355008
I was complaining that LLMs don't have ears... This paper is a solid attempt to try to make that happen.
abs: https://arxiv.org/abs/2305.10790
Work from Yuan Gong et al. at MIT
I was complaining that LLMs don't have ears... This paper is a solid attempt to try to make that happen.
abs: https://arxiv.org/abs/2305.10790
Work from Yuan Gong et al. at MIT
Speech Technology
https://twitter.com/csteinmetz1/status/1659458441197355008 I was complaining that LLMs don't have ears... This paper is a solid attempt to try to make that happen. abs: https://arxiv.org/abs/2305.10790 Work from Yuan Gong et al. at MIT
Interactive demo available https://github.com/YuanGongND/ltu
GitHub
GitHub - YuanGongND/ltu: Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".
Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand". - YuanGongND/ltu