The first Arabic TTS Challenge - QASR TTS 1.0 is on!! Register and build your own Arabic Anchor Voice and contribute to enriching #ArabicAI #ASRU2023Challege
More details: https://arabicspeech.org/qasr-challenge/
https://twitter.com/shammur_absar/status/1658429029483986944
More details: https://arabicspeech.org/qasr-challenge/
https://twitter.com/shammur_absar/status/1658429029483986944
Some nice things from industry, autoscaling with Triton and Kubernetes
https://www.speechmatics.com/company/articles-and-news/autoscaling-with-gpu-transcription-models
https://www.speechmatics.com/company/articles-and-news/autoscaling-with-gpu-transcription-models
Speechmatics
Autoscaling with GPU Transcription models
Speechmatics has recently switched from CPUs to GPUs to run most batch transcription models. Better hardware = increased accuracy. Find out more!
Multilingual TTS from ElevenLabs
https://twitter.com/radamar/status/1658540025611685888
https://huggingface.co/spaces/elevenlabs/tts
https://twitter.com/radamar/status/1658540025611685888
https://huggingface.co/spaces/elevenlabs/tts
Recent advances in the AudioLM family: 100x higher speed, better consistency, no quality hit - a new paper from and the AudioLM team.
Give it a listen: https://google-research.github.io/seanet/soundstorm/examples/
Arxiv:
https://arxiv.org/abs/2305.09636
Give it a listen: https://google-research.github.io/seanet/soundstorm/examples/
Arxiv:
https://arxiv.org/abs/2305.09636
Final VoxCeleb Challenge
https://mm.kaist.ac.kr/datasets/voxceleb/voxsrc/competition2023.html
Timeline
May 20th Development set for verification tracks released.
May 31rd Development set for diarisation tracks released.
June 1st Test set released and evaluation server open.
Early August Deadline for submission of results; invitation to workshop speakers.
August 20th Challenge workshop
https://mm.kaist.ac.kr/datasets/voxceleb/voxsrc/competition2023.html
Timeline
May 20th Development set for verification tracks released.
May 31rd Development set for diarisation tracks released.
June 1st Test set released and evaluation server open.
Early August Deadline for submission of results; invitation to workshop speakers.
August 20th Challenge workshop
Whisper is essentially an audio-conditioned LLM. Can we prompt it to do unseen tasks? Introducing PromptingWhisper!
We use simple prompts to adapt Whisper to unseen tasks zero-shot without any finetuning.
π Paper: http://arxiv.org/abs/2305.11095
π» Code: https://github.com/jasonppy/PromptingWhisper
We use simple prompts to adapt Whisper to unseen tasks zero-shot without any finetuning.
π Paper: http://arxiv.org/abs/2305.11095
π» Code: https://github.com/jasonppy/PromptingWhisper
GitHub
GitHub - jasonppy/PromptingWhisper: Promting Whisper for Audio-Visual Speech Recognition, Code-Switched Speech Recognition, andβ¦
Promting Whisper for Audio-Visual Speech Recognition, Code-Switched Speech Recognition, and Zero-Shot Speech Translation - jasonppy/PromptingWhisper
https://twitter.com/csteinmetz1/status/1659458441197355008
I was complaining that LLMs don't have ears... This paper is a solid attempt to try to make that happen.
abs: https://arxiv.org/abs/2305.10790
Work from Yuan Gong et al. at MIT
I was complaining that LLMs don't have ears... This paper is a solid attempt to try to make that happen.
abs: https://arxiv.org/abs/2305.10790
Work from Yuan Gong et al. at MIT
Speech Technology
https://twitter.com/csteinmetz1/status/1659458441197355008 I was complaining that LLMs don't have ears... This paper is a solid attempt to try to make that happen. abs: https://arxiv.org/abs/2305.10790 Work from Yuan Gong et al. at MIT
Interactive demo available https://github.com/YuanGongND/ltu
GitHub
GitHub - YuanGongND/ltu: Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".
Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand". - YuanGongND/ltu
More details on Soundstorm
https://twitter.com/danlyth/status/1660608450852691968
SoundStorm does a nice job of alleviating a key shortcoming of AudioLM.
By replacing the somewhat cumbersome and slow dual Transformers required for the acoustic token generation, they use bi-directional parallel decoding, leading to a speed-up of two orders of magnitude.
https://twitter.com/danlyth/status/1660608450852691968
SoundStorm does a nice job of alleviating a key shortcoming of AudioLM.
By replacing the somewhat cumbersome and slow dual Transformers required for the acoustic token generation, they use bi-directional parallel decoding, leading to a speed-up of two orders of magnitude.
https://arxiv.org/abs/2305.11834
Pengi: An Audio Language Model for Audio Tasks
Soham Deshmukh, Benjamin Elizalde, Rita Singh, Huaming Wang
In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity to produce the requisite language for open-ended tasks, such as Audio Captioning or Audio Question & Answering. We introduce Pengi, a novel Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. It takes as input, an audio recording, and text, and generates free-form text as output. The input audio is represented as a sequence of continuous embeddings by an audio encoder. A text encoder does the same for the corresponding text input. Both sequences are combined as a prefix to prompt a pre-trained frozen language model. The unified architecture of Pengi enables open-ended tasks and close-ended tasks without any additional fine-tuning or task-specific extensions. When evaluated on 22 downstream tasks, our approach yields state-of-the-art performance in several of them. Our results show that connecting language models with audio models is a major step towards general-purpose audio understanding
Pengi: An Audio Language Model for Audio Tasks
Soham Deshmukh, Benjamin Elizalde, Rita Singh, Huaming Wang
In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity to produce the requisite language for open-ended tasks, such as Audio Captioning or Audio Question & Answering. We introduce Pengi, a novel Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. It takes as input, an audio recording, and text, and generates free-form text as output. The input audio is represented as a sequence of continuous embeddings by an audio encoder. A text encoder does the same for the corresponding text input. Both sequences are combined as a prefix to prompt a pre-trained frozen language model. The unified architecture of Pengi enables open-ended tasks and close-ended tasks without any additional fine-tuning or task-specific extensions. When evaluated on 22 downstream tasks, our approach yields state-of-the-art performance in several of them. Our results show that connecting language models with audio models is a major step towards general-purpose audio understanding
MMS: Massively Multilingual Speech.
- Can do speech2text and text speech in 1100 languages.
- Can recognize 4000 spoken languages.
- Code and models available under the CC-BY-NC 4.0 license.
- half the word error rate of Whisper.
Code+Models: https://github.com/facebookresearch/fairseq/tree/main/examples/mms
Paper: https://scontent-lga3-2.xx.fbcdn.net/v/t39.8562-6/348836647_265923086001014_6878005808275791319_n.pdf
Blog: https://ai.facebook.com/blog/multilingual-model-speech-recognition/
- Can do speech2text and text speech in 1100 languages.
- Can recognize 4000 spoken languages.
- Code and models available under the CC-BY-NC 4.0 license.
- half the word error rate of Whisper.
Code+Models: https://github.com/facebookresearch/fairseq/tree/main/examples/mms
Paper: https://scontent-lga3-2.xx.fbcdn.net/v/t39.8562-6/348836647_265923086001014_6878005808275791319_n.pdf
Blog: https://ai.facebook.com/blog/multilingual-model-speech-recognition/
GitHub
fairseq/examples/mms at main Β· facebookresearch/fairseq
Facebook AI Research Sequence-to-Sequence Toolkit written in Python. - facebookresearch/fairseq
CALLS: Japanese Empathetic Dialogue Speech Corpus of Complaint Handling and Attentive Listening in Customer Center
Yuki Saito, Eiji Iimori, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari
We present CALLS, a Japanese speech corpus that considers phone calls in a customer center as a new domain of empathetic spoken dialogue. The existing STUDIES corpus covers only empathetic dialogue between a teacher and student in a school. To extend the application range of empathetic dialogue speech synthesis (EDSS), we designed our corpus to include the same female speaker as the STUDIES teacher, acting as an operator in simulated phone calls. We describe a corpus construction methodology and analyze the recorded speech. We also conduct EDSS experiments using the CALLS and STUDIES corpora to investigate the effect of domain differences. The results show that mixing the two corpora during training causes biased improvements in the quality of synthetic speech due to the different degrees of expressiveness. Our project page of the corpus is this http URL.
https://arxiv.org/abs/2305.13713
https://sython.org/Corpus/STUDIES-2/
Yuki Saito, Eiji Iimori, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari
We present CALLS, a Japanese speech corpus that considers phone calls in a customer center as a new domain of empathetic spoken dialogue. The existing STUDIES corpus covers only empathetic dialogue between a teacher and student in a school. To extend the application range of empathetic dialogue speech synthesis (EDSS), we designed our corpus to include the same female speaker as the STUDIES teacher, acting as an operator in simulated phone calls. We describe a corpus construction methodology and analyze the recorded speech. We also conduct EDSS experiments using the CALLS and STUDIES corpora to investigate the effect of domain differences. The results show that mixing the two corpora during training causes biased improvements in the quality of synthetic speech due to the different degrees of expressiveness. Our project page of the corpus is this http URL.
https://arxiv.org/abs/2305.13713
https://sython.org/Corpus/STUDIES-2/
Announcing the VoiceMOS Challenge 2023!
Challenge website: https://voicemos-challenge-2023.github.io
Register to participate: https://forms.gle/kcLc69Wa4Q97rSNq7
This edition of the challenge will focus on real-world and challenging zero-shot out-of-domain mean opinion score prediction!
https://twitter.com/yamagishilab/status/1643788523886235648
Challenge website: https://voicemos-challenge-2023.github.io
Register to participate: https://forms.gle/kcLc69Wa4Q97rSNq7
This edition of the challenge will focus on real-world and challenging zero-shot out-of-domain mean opinion score prediction!
https://twitter.com/yamagishilab/status/1643788523886235648
VoiceMOS-Challenge-2023.github.io
Announcing the VoiceMOS Challenge 2023!
Website for the VoiceMOS Challenge 2023.
https://pages.cs.huji.ac.il/adiyoss-lab/twist/
Textually Pretrained Speech Language Models
https://arxiv.org/pdf/2305.13009.pdf
Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language model. We show using both automatic and human evaluation that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observation, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field.
Textually Pretrained Speech Language Models
https://arxiv.org/pdf/2305.13009.pdf
Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language model. We show using both automatic and human evaluation that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observation, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field.
Another audio LM from Google Research
LMs with a Voice: Spoken Language Modeling beyond Speech Tokens
- Presents Spectron, a novel approach to adapting pre-trained LMs to perform speech continuation.
- Surpasses existing spoken LMs both in semantic content and speaker preservation
proj: https://michelleramanovich.github.io/spectron/spectron/
abs: https://arxiv.org/abs/2305.15255
LMs with a Voice: Spoken Language Modeling beyond Speech Tokens
- Presents Spectron, a novel approach to adapting pre-trained LMs to perform speech continuation.
- Surpasses existing spoken LMs both in semantic content and speaker preservation
proj: https://michelleramanovich.github.io/spectron/spectron/
abs: https://arxiv.org/abs/2305.15255
VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
Is one decoder-only generative model all you need for speech recognition, synthesis, and translation?
https://arxiv.org/abs/2305.16107
Is one decoder-only generative model all you need for speech recognition, synthesis, and translation?
https://arxiv.org/abs/2305.16107
https://github.com/Takaaki-Saeki/zm-text-tts
https://arxiv.org/abs/2301.12596
While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language. All experiments were conducted using public datasets and the implementation will be made available for reproducibility.
https://arxiv.org/abs/2301.12596
While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language. All experiments were conducted using public datasets and the implementation will be made available for reproducibility.
GitHub
GitHub - Takaaki-Saeki/zm-text-tts: [IJCAI'23] Learning to Speak from Text for Low-Resource TTS
[IJCAI'23] Learning to Speak from Text for Low-Resource TTS - Takaaki-Saeki/zm-text-tts