Respected guys
https://arxiv.org/abs/2301.13341
Neural Target Speech Extraction: An Overview
Katerina Zmolikova, Marc Delcroix, Tsubasa Ochiai, Keisuke Kinoshita, Jan Černocký, Dong Yu
Humans can listen to a target speaker even in challenging acoustic conditions that have noise, reverberation, and interfering speakers. This phenomenon is known as the cocktail-party effect. For decades, researchers have focused on approaching the listening ability of humans. One critical issue is handling interfering speakers because the target and non-target speech signals share similar characteristics, complicating their discrimination. Target speech/speaker extraction (TSE) isolates the speech signal of a target speaker from a mixture of several speakers with or without noises and reverberations using clues that identify the speaker in the mixture. Such clues might be a spatial clue indicating the direction of the target speaker, a video of the speaker's lips, or a pre-recorded enrollment utterance from which their voice characteristics can be derived. TSE is an emerging field of research that has received increased attention in recent years because it offers a practical approach to the cocktail-party problem and involves such aspects of signal processing as audio, visual, array processing, and deep learning. This paper focuses on recent neural-based approaches and presents an in-depth overview of TSE. We guide readers through the different major approaches, emphasizing the similarities among frameworks and discussing potential future directions.
https://arxiv.org/abs/2301.13341
Neural Target Speech Extraction: An Overview
Katerina Zmolikova, Marc Delcroix, Tsubasa Ochiai, Keisuke Kinoshita, Jan Černocký, Dong Yu
Humans can listen to a target speaker even in challenging acoustic conditions that have noise, reverberation, and interfering speakers. This phenomenon is known as the cocktail-party effect. For decades, researchers have focused on approaching the listening ability of humans. One critical issue is handling interfering speakers because the target and non-target speech signals share similar characteristics, complicating their discrimination. Target speech/speaker extraction (TSE) isolates the speech signal of a target speaker from a mixture of several speakers with or without noises and reverberations using clues that identify the speaker in the mixture. Such clues might be a spatial clue indicating the direction of the target speaker, a video of the speaker's lips, or a pre-recorded enrollment utterance from which their voice characteristics can be derived. TSE is an emerging field of research that has received increased attention in recent years because it offers a practical approach to the cocktail-party problem and involves such aspects of signal processing as audio, visual, array processing, and deep learning. This paper focuses on recent neural-based approaches and presents an in-depth overview of TSE. We guide readers through the different major approaches, emphasizing the similarities among frameworks and discussing potential future directions.
https://twitter.com/alphacep/status/1621612504840273928
NeMo 1.15 is out now! Theres a whole bunch of powerful ASR features added in this release including Hybrid CTC-RNNT models, Multiblank Transducer, Multi Head Attention Adapters, Conformer longformer inference, and a Beam Search API!
First, we dicuss Hybrid CTC-RNNT models. We can train a single model with both losses, and then perform inference with either decoder. It turns out, we can attain better CTC results, and converge 40-50% faster for CTC head when jointly trained.
Next up, we have Multiblank Transducers supported in NeMo. It is an extension of RNNT loss - in which tokens can jump multiple timesteps per predicted token, allowing for highly efficient inference - even at sample level ! Refer to the paper here
With this change, you can now easily train a multi blank RNNT model and obtain better WER but also much faster inference than regular RNNT models.
Next up, we now support Multi Head Attention Adapters in NeMo ASR. With this approach, now any NeMo module can be retrofitted into an adapter module. We see significant parameter efficiency when compared to Houlsby Adapter. With the newly updated scripts for adapter training, we can now easily train either Linear adapters or MHA adapters from the same script. More details can be found in the PR
Long form audio transcription has long been a challange for Conformer based ASR models, because of the attention component. So we now support Longformer based transcriptions - even for pre-trained models ! You can use the transcribe_speech script for this! We find that if you further finetune the model after conversion to Longformer attention, you can recover most of the WER and still get excellent long audio transcription of up to 30-40 minutes in one shot forward pass.
A long-asked feature is to support beam search in NeMo ASR in a easy to use way. So we unified the way we do CTC beam search with external libraries with the simple
We also begin support for AIStore as a framework for terabyte-scale datasets as a scalable solution to train ASR models on enormous real world datasets.
NeMo 1.15 is out now! Theres a whole bunch of powerful ASR features added in this release including Hybrid CTC-RNNT models, Multiblank Transducer, Multi Head Attention Adapters, Conformer longformer inference, and a Beam Search API!
First, we dicuss Hybrid CTC-RNNT models. We can train a single model with both losses, and then perform inference with either decoder. It turns out, we can attain better CTC results, and converge 40-50% faster for CTC head when jointly trained.
Next up, we have Multiblank Transducers supported in NeMo. It is an extension of RNNT loss - in which tokens can jump multiple timesteps per predicted token, allowing for highly efficient inference - even at sample level ! Refer to the paper here
With this change, you can now easily train a multi blank RNNT model and obtain better WER but also much faster inference than regular RNNT models.
Next up, we now support Multi Head Attention Adapters in NeMo ASR. With this approach, now any NeMo module can be retrofitted into an adapter module. We see significant parameter efficiency when compared to Houlsby Adapter. With the newly updated scripts for adapter training, we can now easily train either Linear adapters or MHA adapters from the same script. More details can be found in the PR
Long form audio transcription has long been a challange for Conformer based ASR models, because of the attention component. So we now support Longformer based transcriptions - even for pre-trained models ! You can use the transcribe_speech script for this! We find that if you further finetune the model after conversion to Longformer attention, you can recover most of the WER and still get excellent long audio transcription of up to 30-40 minutes in one shot forward pass.
A long-asked feature is to support beam search in NeMo ASR in a easy to use way. So we unified the way we do CTC beam search with external libraries with the simple
model.transcribe() method! You can simply update the config, and then transcribe !We also begin support for AIStore as a framework for terabyte-scale datasets as a scalable solution to train ASR models on enormous real world datasets.
Twitter
RT @HaseoX94: NeMo 1.15 is out now! Theres a whole bunch of powerful ASR features added in this release including Hybrid CTC-RNNT models, M…
It is interesting that for things like NER for latest research Google returned to structured prediction instead of pure transformers
https://github.com/lyutyuh/ASP
https://arxiv.org/abs/2210.14698
Autoregressive Structured Prediction with Language Models
Tianyu Liu, Yuchen Jiang, Nicholas Monath, Ryan Cotterell, Mrinmaya Sachan
Recent years have seen a paradigm shift in NLP towards using pretrained language models ({PLM}) for a wide range of tasks.
However, there are many difficult design decisions to represent structures (e.g. tagged text, coreference chains) in a way such that they can be captured by PLMs. Prior work on structured prediction with PLMs typically flattens the structured output into a sequence, which limits the quality of structural information being learned and leads to inferior performance compared to classic discriminative models. In this work, we describe an approach to model structures as sequences of actions in an autoregressive manner with PLMs, allowing in-structure dependencies to be learned without any loss.
Our approach achieves the new state-of-the-art on all the structured prediction tasks we looked at, namely, named entity recognition, end-to-end relation extraction, and coreference resolution.
https://github.com/lyutyuh/ASP
https://arxiv.org/abs/2210.14698
Autoregressive Structured Prediction with Language Models
Tianyu Liu, Yuchen Jiang, Nicholas Monath, Ryan Cotterell, Mrinmaya Sachan
Recent years have seen a paradigm shift in NLP towards using pretrained language models ({PLM}) for a wide range of tasks.
However, there are many difficult design decisions to represent structures (e.g. tagged text, coreference chains) in a way such that they can be captured by PLMs. Prior work on structured prediction with PLMs typically flattens the structured output into a sequence, which limits the quality of structural information being learned and leads to inferior performance compared to classic discriminative models. In this work, we describe an approach to model structures as sequences of actions in an autoregressive manner with PLMs, allowing in-structure dependencies to be learned without any loss.
Our approach achieves the new state-of-the-art on all the structured prediction tasks we looked at, namely, named entity recognition, end-to-end relation extraction, and coreference resolution.
GitHub
GitHub - lyutyuh/ASP: PyTorch implementation and pre-trained models for ASP - Autoregressive Structured Prediction with Language…
PyTorch implementation and pre-trained models for ASP - Autoregressive Structured Prediction with Language Models, EMNLP 22. https://arxiv.org/pdf/2210.14698.pdf - lyutyuh/ASP
https://twitter.com/DrJimFan/status/1622276293776793600
Looks like many of you are ready to embrace the Year of Sound Waves!
Here’s a big and OPEN dataset for you to get your hands dirty on AI audio modeling: EPIC-SOUNDS, 78k segments of annotated, audible events and actions.
Downloadable here: https://epic-kitchens.github.io/epic-sounds/
Looks like many of you are ready to embrace the Year of Sound Waves!
Here’s a big and OPEN dataset for you to get your hands dirty on AI audio modeling: EPIC-SOUNDS, 78k segments of annotated, audible events and actions.
Downloadable here: https://epic-kitchens.github.io/epic-sounds/
CMU pubs are nice. High quality TTS trained from Youtube
https://github.com/b04901014/MQTTS
https://arxiv.org/abs/2302.04215
A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech
Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky
Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness. The diversity of human speech, however, often goes beyond the coverage of these corpora. We believe the ability to handle such diversity is crucial for AI systems to achieve human-level communication. Our work explores the use of more abundant real-world data for building speech synthesizers. We train TTS systems using real-world speech from YouTube and podcasts. We observe the mismatch between training and inference alignments in mel-spectrogram based autoregressive models, leading to unintelligible synthesis, and demonstrate that learned discrete codes within multiple code groups effectively resolves this issue. We introduce our MQTTS system whose architecture is designed for multiple code generation and monotonic alignment, along with the use of a clean silence prompt to improve synthesis quality. We conduct ablation analyses to identify the efficacy of our methods. We show that MQTTS outperforms existing TTS systems in several objective and subjective measures.
https://github.com/b04901014/MQTTS
https://arxiv.org/abs/2302.04215
A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech
Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky
Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness. The diversity of human speech, however, often goes beyond the coverage of these corpora. We believe the ability to handle such diversity is crucial for AI systems to achieve human-level communication. Our work explores the use of more abundant real-world data for building speech synthesizers. We train TTS systems using real-world speech from YouTube and podcasts. We observe the mismatch between training and inference alignments in mel-spectrogram based autoregressive models, leading to unintelligible synthesis, and demonstrate that learned discrete codes within multiple code groups effectively resolves this issue. We introduce our MQTTS system whose architecture is designed for multiple code generation and monotonic alignment, along with the use of a clean silence prompt to improve synthesis quality. We conduct ablation analyses to identify the efficacy of our methods. We show that MQTTS outperforms existing TTS systems in several objective and subjective measures.
GitHub
GitHub - b04901014/MQTTS
Contribute to b04901014/MQTTS development by creating an account on GitHub.
It is interesting how quickly people implement ideas. Like the one of podcast transcript with Whisper. Here is a selection
https://podscript.ai/
https://podtext.ai/
https://podscription.app/
https://podsearch.page/
Discussion https://news.ycombinator.com/item?id=34727695
https://podscript.ai/
https://podtext.ai/
https://podscription.app/
https://podsearch.page/
Discussion https://news.ycombinator.com/item?id=34727695
https://github.com/openai/whisper/discussions/937
Whisper model in CTranslate2, which is a fast inference engine for Transformer models. The project supports many useful inference features such as CPU and GPU execution, asynchronous execution, multi-GPU execution, 8-bit quantization, etc.
You can find a usage example here.
Note that it does not currently implement the full transcription loop, only the
For example, here's the transcription time of 13 minutes of audio on a V100 for the same accuracy:
Implementation Time with "small" model Time with "medium" model
Baseline 1m37s 3m16s
CTranslate2 0m25s 0m42s
Whisper model in CTranslate2, which is a fast inference engine for Transformer models. The project supports many useful inference features such as CPU and GPU execution, asynchronous execution, multi-GPU execution, 8-bit quantization, etc.
You can find a usage example here.
Note that it does not currently implement the full transcription loop, only the
model.decode part. So you would still need to implement the transcription logic from transcribe.py on top of it (iterate on each 30-second window, accumulate the context in the prompt, etc.).For example, here's the transcription time of 13 minutes of audio on a V100 for the same accuracy:
Implementation Time with "small" model Time with "medium" model
Baseline 1m37s 3m16s
CTranslate2 0m25s 0m42s
GitHub
Accelerate the Whisper decoding with CTranslate2 · openai whisper · Discussion #937
Hello, We integrated the Whisper model in CTranslate2, which is a fast inference engine for Transformer models. The project implements many useful inference features such as optimized CPU and GPU e...
Abdelrahman Mohamed, Director of AI at Meta joined Rembrand
https://www.rembrand.com/blog/rembrand-announces-8-million-seed-round/
https://www.rembrand.com/blog/rembrand-announces-8-million-seed-round/
Some things about state of software from this weekend:
Transformers library pipeline API doesn't split long texts for NER yet https://github.com/huggingface/transformers/pull/19735
Fixed bug in Kaldi matrix to load 10Gb matrices, not sure how it was unnoticed for such a long time https://github.com/kaldi-asr/kaldi/pull/4823
Transformers library pipeline API doesn't split long texts for NER yet https://github.com/huggingface/transformers/pull/19735
Fixed bug in Kaldi matrix to load 10Gb matrices, not sure how it was unnoticed for such a long time https://github.com/kaldi-asr/kaldi/pull/4823
From FunASR
Updated onnxruntime today, optimized inference speed, actual measurement, paraformer-large, compared to modelscope pipeline, 100 averages on cpu, 2.8 times faster inference speed, rtf: 0.110 -> 0.0386, deployed using onnx.
Users can update the new pipeline: https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/onnxruntime/paraformer/rapid_paraformer
Updated onnxruntime today, optimized inference speed, actual measurement, paraformer-large, compared to modelscope pipeline, 100 averages on cpu, 2.8 times faster inference speed, rtf: 0.110 -> 0.0386, deployed using onnx.
Users can update the new pipeline: https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/onnxruntime/paraformer/rapid_paraformer
Generally, TRT gives avg ~26.5% speed up. See README.md for details.
https://github.com/k2-fsa/sherpa/pull/300
https://github.com/k2-fsa/sherpa/pull/300
GitHub
Support TensorRT FP16 for Conformer Offline encoder model by wd929 · Pull Request #300 · k2-fsa/sherpa
Hi all,
This PR gives how to convert Conformer Offline ONNX model to TRT.
We also show throughput and latency comparisons of TRT vs ONNX.
Generally, TRT gives avg ~26.5% speed up. See README.md fo...
This PR gives how to convert Conformer Offline ONNX model to TRT.
We also show throughput and latency comparisons of TRT vs ONNX.
Generally, TRT gives avg ~26.5% speed up. See README.md fo...
https://opendata.iisys.de/
Good German TTS dataset (326 hours, 5 speakers)
and 610h ASR dataset
Learned from https://arxiv.org/abs/2302.06008
ASR Bundestag: A Large-Scale political debate dataset in German
We present ASR Bundestag, a dataset for automatic speech recognition in German, consisting of 610 hours of aligned audio-transcript pairs for supervised training as well as 1,038 hours of unlabeled audio snippets for self-supervised learning, based on raw audio data and transcriptions from plenary sessions and committee meetings of the German parliament. In addition, we discuss utilized approaches for the automated creation of speech datasets and assess the quality of the resulting dataset based on evaluations and finetuning of a pre-trained state of the art model. We make the dataset publicly available, including all subsets.
Good German TTS dataset (326 hours, 5 speakers)
and 610h ASR dataset
Learned from https://arxiv.org/abs/2302.06008
ASR Bundestag: A Large-Scale political debate dataset in German
We present ASR Bundestag, a dataset for automatic speech recognition in German, consisting of 610 hours of aligned audio-transcript pairs for supervised training as well as 1,038 hours of unlabeled audio snippets for self-supervised learning, based on raw audio data and transcriptions from plenary sessions and committee meetings of the German parliament. In addition, we discuss utilized approaches for the automated creation of speech datasets and assess the quality of the resulting dataset based on evaluations and finetuning of a pre-trained state of the art model. We make the dataset publicly available, including all subsets.
Coqui's AI voice studio is live!
🪄 Create new voices
🪞 Clone your voice
⚡️ Fuse voices
🎬 Direct your voices
📂 Organize your projects
https://twitter.com/coqui_ai/status/1626738634849239042
🪄 Create new voices
🪞 Clone your voice
⚡️ Fuse voices
🎬 Direct your voices
📂 Organize your projects
https://twitter.com/coqui_ai/status/1626738634849239042
X (formerly Twitter)
coqui (@coqui_ai) on X
💫 💫 💫 💫 💫 💫 💫 💫
Coqui's AI voice studio is live!
🪄 Create new voices
🪞 Clone your voice
⚡️ Fuse voices
🎬 Direct your voices
📂 Organize your projects
Try yourself today!
https://t.co/X9DKLdXJEQ
💫 💫 💫 💫 💫 💫 💫 💫
Coqui's AI voice studio is live!
🪄 Create new voices
🪞 Clone your voice
⚡️ Fuse voices
🎬 Direct your voices
📂 Organize your projects
Try yourself today!
https://t.co/X9DKLdXJEQ
💫 💫 💫 💫 💫 💫 💫 💫
We glad people use Vosk for real-life applications. If you have built something using Vosk please share. Here is a great example
Pal Robotics uses Vosk to recognize speech in ARI V2 service robot
https://pal-robotics.com/wp-content/uploads/2022/12/ARI-Datasheet.pdf
Pal Robotics uses Vosk to recognize speech in ARI V2 service robot
https://pal-robotics.com/wp-content/uploads/2022/12/ARI-Datasheet.pdf
Attention ASR developers and researchers! 🚀 Great news, with the latest update of 🤗 PEFT, you can now fine-tune your Whisper-large model faster than ever before! The new update allows you to fit 5X larger batches with less than 10GB GPU VRAM, thanks to LoRA and Tim Dettmers's bnb packaged nicely in 🤗 PEFT. And the best part? You get a comparable WER, but just faster!! ⚡️
But that's not all, you no longer have to compromise on the training speed to maintain WER. In fact, in our experiments with the Marathi language, the WER was comparable with full fine-tuning runs of Whisper-large. Without PEFT, 13.64 WER (full training run)
and with PEFT, 14.01 WER (trained on a @googlecolab
). With 🤗 PEFT, you can now train a Whisper-large v2 model in less than 8GB GPU VRAM! 📉
Without 🤗 PEFT, you could experience OOM on a Colab T4, but not anymore! You can easily save on storage and port tiny checkpoints, ~63 MB compared to 6.7 GB fully fine-tuned model. 🐜
And that's not all! For low latency, you can convert the PEFT model to ONNX and use ORT using 🤗 Optimum.
Start experimenting today and fine-tune your Whisper using PEFT+INT8 in Colab on a language of your choice! Join our Discord community to get involved in the conversation and discuss your results and questions. 🔬
Check out the Colab notebook examples and start your ASR development journey with 🤗 PEFT today!
https://github.com/huggingface/peft
But that's not all, you no longer have to compromise on the training speed to maintain WER. In fact, in our experiments with the Marathi language, the WER was comparable with full fine-tuning runs of Whisper-large. Without PEFT, 13.64 WER (full training run)
and with PEFT, 14.01 WER (trained on a @googlecolab
). With 🤗 PEFT, you can now train a Whisper-large v2 model in less than 8GB GPU VRAM! 📉
Without 🤗 PEFT, you could experience OOM on a Colab T4, but not anymore! You can easily save on storage and port tiny checkpoints, ~63 MB compared to 6.7 GB fully fine-tuned model. 🐜
And that's not all! For low latency, you can convert the PEFT model to ONNX and use ORT using 🤗 Optimum.
Start experimenting today and fine-tune your Whisper using PEFT+INT8 in Colab on a language of your choice! Join our Discord community to get involved in the conversation and discuss your results and questions. 🔬
Check out the Colab notebook examples and start your ASR development journey with 🤗 PEFT today!
https://github.com/huggingface/peft
GitHub
GitHub - huggingface/peft: 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning. - huggingface/peft