Speech Technology

EfficientSpeech, or ES for short, is an efficient neural text to speech (TTS) model. It generates mel spectrogram at a speed of 104 (mRTF) or 104 secs of speech per sec on an RPi4. Its tiny version has a footprint of just 266k parameters. Generating 6 secs of speech consumes 90 MFLOPS only.

https://github.com/roatienza/efficientspeech

https://roatienza.github.io/efficientspeech-demo/

GitHub

GitHub - roatienza/efficientspeech: PyTorch code implementation of EfficientSpeech - to be presented at ICASSP2023.

PyTorch code implementation of EfficientSpeech - to be presented at ICASSP2023. - GitHub - roatienza/efficientspeech: PyTorch code implementation of EfficientSpeech - to be presented at ICASSP2023.

716 views06:14

Speech Technology

May 12, 2023: Challenge announcement
May 19, 2023: Leaderboard is online and accepting submissions
June 26, 2023: New Language Track Submission Deadline
July 07, 2023: Paper / Model Submission Deadline
July 10, 2023: Paper Revision Deadline

🌍🗣️SUPERB benchmark is back with ML-SUPERB, its multilingual version! The challenge, as one of the #ASRU2023 challenges, includes 3 tracks:
1️⃣ML-SUPERB: For multilingual SSL
2️⃣New language: To new languages!
3️⃣Research: For research papers

More to see 👉 https://multilingual.superbbenchmark.org

multilingual.superbbenchmark.org

ML-SUPERB: Multilingual Speech processing Universal PERformance Benchmark

A multilingual benchmark for Self-supervised Speech Representation Learning

766 viewsedited 09:06

Speech Technology

https://github.com/facebookresearch/AudioDec

GitHub

GitHub - facebookresearch/AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec

An Open-source Streaming High-fidelity Neural Audio Codec - facebookresearch/AudioDec

827 views20:13

Speech Technology

Universal Source Separation with Weakly Labelled Data

abs: https://arxiv.org/abs/2305.07447
paper page: https://huggingface.co/papers/2305.07447
github: https://github.com/bytedance/uss

805 views04:04

Speech Technology

Some people implement streaming speaker diarization manually

https://github.com/pyannote/pyannote-audio/commit/4a6ea9c825b9447a7d03cb9bd94f5f81d661ca16

others just ask ChatGPT to write it

https://github.com/huseinzol05/malaya-speech/commit/564f50c0d91528126fe3b410f387d1b4ff33d364

ChatGPT version is not that bad

GitHub

wip: add streaming speaker diarization task · pyannote/pyannote-audio@4a6ea9c

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding - wip: add streaming speaker diarization task · pyannote/pyannote-audio@4a6ea9c

822 views23:04

Speech Technology

The first Arabic TTS Challenge - QASR TTS 1.0 is on!! Register and build your own Arabic Anchor Voice and contribute to enriching #ArabicAI #ASRU2023Challege
More details: https://arabicspeech.org/qasr-challenge/

https://twitter.com/shammur_absar/status/1658429029483986944

790 viewsedited 18:00

Speech Technology

Some nice things from industry, autoscaling with Triton and Kubernetes

https://www.speechmatics.com/company/articles-and-news/autoscaling-with-gpu-transcription-models

Speechmatics

Autoscaling with GPU Transcription models

Speechmatics has recently switched from CPUs to GPUs to run most batch transcription models. Better hardware = increased accuracy. Find out more!

809 views18:51

Speech Technology

Multilingual TTS from ElevenLabs

https://twitter.com/radamar/status/1658540025611685888

https://huggingface.co/spaces/elevenlabs/tts

881 viewsedited 18:52

Speech Technology

Recent advances in the AudioLM family: 100x higher speed, better consistency, no quality hit - a new paper from and the AudioLM team.

Give it a listen: https://google-research.github.io/seanet/soundstorm/examples/

Arxiv:
https://arxiv.org/abs/2305.09636

919 views12:23

Speech Technology

Final VoxCeleb Challenge

https://mm.kaist.ac.kr/datasets/voxceleb/voxsrc/competition2023.html

Timeline
May 20th Development set for verification tracks released.
May 31rd Development set for diarisation tracks released.
June 1st Test set released and evaluation server open.
Early August Deadline for submission of results; invitation to workshop speakers.
August 20th Challenge workshop

873 views04:02

Speech Technology

3 nice Persian TTS datasets

https://www.kaggle.com/magnoliasis/datasets

Kaggle

Magnoliasis

Kaggle profile for Magnoliasis

1.04K views03:10

Speech Technology

Whisper is essentially an audio-conditioned LLM. Can we prompt it to do unseen tasks? Introducing PromptingWhisper!

We use simple prompts to adapt Whisper to unseen tasks zero-shot without any finetuning.

📄 Paper: http://arxiv.org/abs/2305.11095
💻 Code: https://github.com/jasonppy/PromptingWhisper

GitHub

GitHub - jasonppy/PromptingWhisper: Promting Whisper for Audio-Visual Speech Recognition, Code-Switched Speech Recognition, and…

Promting Whisper for Audio-Visual Speech Recognition, Code-Switched Speech Recognition, and Zero-Shot Speech Translation - jasonppy/PromptingWhisper

1.32K views19:01

Speech Technology

https://twitter.com/csteinmetz1/status/1659458441197355008

I was complaining that LLMs don't have ears... This paper is a solid attempt to try to make that happen.

abs: https://arxiv.org/abs/2305.10790
Work from Yuan Gong et al. at MIT

857 views19:06

Speech Technology

https://twitter.com/csteinmetz1/status/1659458441197355008 I was complaining that LLMs don't have ears... This paper is a solid attempt to try to make that happen. abs: https://arxiv.org/abs/2305.10790 Work from Yuan Gong et al. at MIT

Interactive demo available https://github.com/YuanGongND/ltu

GitHub

GitHub - YuanGongND/ltu: Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".

Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand". - YuanGongND/ltu

861 views15:17

Speech Technology

https://github.com/0nutation/SpeechGPT

GitHub

GitHub - 0nutation/SpeechGPT: SpeechGPT Series: Speech Large Language Models

SpeechGPT Series: Speech Large Language Models. Contribute to 0nutation/SpeechGPT development by creating an account on GitHub.

830 views12:53

Speech Technology

More details on Soundstorm

https://twitter.com/danlyth/status/1660608450852691968

SoundStorm does a nice job of alleviating a key shortcoming of AudioLM.

By replacing the somewhat cumbersome and slow dual Transformers required for the acoustic token generation, they use bi-directional parallel decoding, leading to a speed-up of two orders of magnitude.

743 views13:49

Speech Technology

https://arxiv.org/abs/2305.11834

Pengi: An Audio Language Model for Audio Tasks

Soham Deshmukh, Benjamin Elizalde, Rita Singh, Huaming Wang

In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity to produce the requisite language for open-ended tasks, such as Audio Captioning or Audio Question & Answering. We introduce Pengi, a novel Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. It takes as input, an audio recording, and text, and generates free-form text as output. The input audio is represented as a sequence of continuous embeddings by an audio encoder. A text encoder does the same for the corresponding text input. Both sequences are combined as a prefix to prompt a pre-trained frozen language model. The unified architecture of Pengi enables open-ended tasks and close-ended tasks without any additional fine-tuning or task-specific extensions. When evaluated on 22 downstream tasks, our approach yields state-of-the-art performance in several of them. Our results show that connecting language models with audio models is a major step towards general-purpose audio understanding

801 views13:51

Speech Technology

MMS: Massively Multilingual Speech.
- Can do speech2text and text speech in 1100 languages.
- Can recognize 4000 spoken languages.
- Code and models available under the CC-BY-NC 4.0 license.
- half the word error rate of Whisper.

Code+Models: https://github.com/facebookresearch/fairseq/tree/main/examples/mms
Paper: https://scontent-lga3-2.xx.fbcdn.net/v/t39.8562-6/348836647_265923086001014_6878005808275791319_n.pdf
Blog: https://ai.facebook.com/blog/multilingual-model-speech-recognition/

GitHub

fairseq/examples/mms at main · facebookresearch/fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python. - facebookresearch/fairseq

1.29K views19:59

Speech Technology

In case you want to play with laughter

https://twitter.com/forthshinji/status/1660990946606219266

X (formerly Twitter)

Shinnosuke Takamichi / 高道慎之介 on X

The corpus is now available!!
- Duration: 6.04 hours
- Sampling rate: 24 kHz
- Speakers: 584 Japanese speakers
- Utterance: 11413 utterances
https://t.co/zLuFtxJ9kf

744 views17:36

Speech Technology

https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer

GitHub

GitHub - gitmylo/bark-voice-cloning-HuBERT-quantizer: The code for the bark-voicecloning model. Training and inference.

The code for the bark-voicecloning model. Training and inference. - gitmylo/bark-voice-cloning-HuBERT-quantizer

782 views22:32

Speech Technology

CALLS: Japanese Empathetic Dialogue Speech Corpus of Complaint Handling and Attentive Listening in Customer Center
Yuki Saito, Eiji Iimori, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari
We present CALLS, a Japanese speech corpus that considers phone calls in a customer center as a new domain of empathetic spoken dialogue. The existing STUDIES corpus covers only empathetic dialogue between a teacher and student in a school. To extend the application range of empathetic dialogue speech synthesis (EDSS), we designed our corpus to include the same female speaker as the STUDIES teacher, acting as an operator in simulated phone calls. We describe a corpus construction methodology and analyze the recorded speech. We also conduct EDSS experiments using the CALLS and STUDIES corpora to investigate the effect of domain differences. The results show that mixing the two corpora during training causes biased improvements in the quality of synthetic speech due to the different degrees of expressiveness. Our project page of the corpus is this http URL.

https://arxiv.org/abs/2305.13713

https://sython.org/Corpus/STUDIES-2/

698 views10:18

About

Blog

Apps

Platform