Tried a popular https://github.com/Kyubyong/g2p. As usual, networks are very bad for unseen cases. Missing letters, extra letters, etc. Watch outputs carefully. Example:
bio-sand B AY1 OW0 S T AE2 N D
bio-sand B AY1 OW0 S T AE2 N D
GitHub
GitHub - Kyubyong/g2p: g2p: English Grapheme To Phoneme Conversion
g2p: English Grapheme To Phoneme Conversion. Contribute to Kyubyong/g2p development by creating an account on GitHub.
BUT 3rd
System description
https://www.fit.vutbr.cz/research/groups/speech/publi/2022/NIST_LRE_2022_System_Description.pdf
System description
https://www.fit.vutbr.cz/research/groups/speech/publi/2022/NIST_LRE_2022_System_Description.pdf
Paraformer released models for other languages too:
We release several new UniASR model: Southern Fujian Dialect model, French model, German model, Vietnamese model, Persian model.
https://github.com/alibaba-damo-academy/FunASR
We release several new UniASR model: Southern Fujian Dialect model, French model, German model, Vietnamese model, Persian model.
https://github.com/alibaba-damo-academy/FunASR
GitHub
GitHub - modelscope/FunASR: A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting…
A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc. - modelscope/FunASR
How can we make inference faster when using big #speech #selfsupervised models?
Check out @salah_zaiem 's paper that compares various approaches, revealing some pretty interesting insights.
https://arxiv.org/abs/2303.06740
These techniques will be soon available in #SpeechBrain
https://twitter.com/mirco_ravanelli/status/1635678132731518976
Check out @salah_zaiem 's paper that compares various approaches, revealing some pretty interesting insights.
https://arxiv.org/abs/2303.06740
These techniques will be soon available in #SpeechBrain
https://twitter.com/mirco_ravanelli/status/1635678132731518976
New model from Assembly AI. Definitely improved from before, but not as great as Speechmatics.
On a toy test WER 10.89, previous assemblyAI (version 9) was at 11.04, version before 11.89. Speechmatics 6.88. Whisper large 8.94
https://twitter.com/AssemblyAI/status/1636050346240884744
Introducing Conformer-1: our latest state-of-the-art speech recognition model.
Built on top of the Conformer architecture and trained on 650K hours of audio data, it achieves near-human-level performance, making up to 43% fewer errors on noisy data than other ASR models.
We use a modified version of the conformer neural net published by Google Brain.
It's built on top of an Efficient Conformer (Orange Labs, 2021), that introduces the following technical modifications:
- Progressive Downsampling to reduce the length of the encoded sequence
- Grouped Attention: A modified version of the attention mechanism that makes it agnostic to sequence-length
These changes yield speedups of 29% at inference time and 36% at training time.
To further improve our model’s accuracy on noisy audio, we implemented a modified version of Sparse Attention, a pruning method for achieving sparsity of the model’s weights in order to achieve regularization.
We took inspiration from the data scaling laws described in DeepMind's Chinchilla paper and adapted them to the ASR domain.
Our team curated a dataset of 650K hours of English audio - making our model the largest-trained supervised model for English available today.
Based on our results, Conformer-1 is more robust on real-world data than popular commercial and open-source ASR models, making up to 43% fewer errors on average on noisy data:
The biggest improvement with this new release is in our robustness to a wide variety of data domains and noisy audio.
On a toy test WER 10.89, previous assemblyAI (version 9) was at 11.04, version before 11.89. Speechmatics 6.88. Whisper large 8.94
https://twitter.com/AssemblyAI/status/1636050346240884744
Introducing Conformer-1: our latest state-of-the-art speech recognition model.
Built on top of the Conformer architecture and trained on 650K hours of audio data, it achieves near-human-level performance, making up to 43% fewer errors on noisy data than other ASR models.
We use a modified version of the conformer neural net published by Google Brain.
It's built on top of an Efficient Conformer (Orange Labs, 2021), that introduces the following technical modifications:
- Progressive Downsampling to reduce the length of the encoded sequence
- Grouped Attention: A modified version of the attention mechanism that makes it agnostic to sequence-length
These changes yield speedups of 29% at inference time and 36% at training time.
To further improve our model’s accuracy on noisy audio, we implemented a modified version of Sparse Attention, a pruning method for achieving sparsity of the model’s weights in order to achieve regularization.
We took inspiration from the data scaling laws described in DeepMind's Chinchilla paper and adapted them to the ASR domain.
Our team curated a dataset of 650K hours of English audio - making our model the largest-trained supervised model for English available today.
Based on our results, Conformer-1 is more robust on real-world data than popular commercial and open-source ASR models, making up to 43% fewer errors on average on noisy data:
The biggest improvement with this new release is in our robustness to a wide variety of data domains and noisy audio.
Twitter
Introducing Conformer-1: our latest state-of-the-art speech recognition model.
Built on top of the Conformer architecture and trained on 650K hours of audio data, it achieves near-human-level performance, making up to 43% fewer errors on noisy data than…
Built on top of the Conformer architecture and trained on 650K hours of audio data, it achieves near-human-level performance, making up to 43% fewer errors on noisy data than…
Kincaid46 WER from Ursa announcement:
AssemblyAI: 8.6
Speechmatics: 7.88
Microsoft: 9.70
Whisper Large-v2: 8.7
Vosk 0.42 Gigaspeech 15.8
Google 12.52
Amazon 10.94
AssemblyAI: 8.6
Speechmatics: 7.88
Microsoft: 9.70
Whisper Large-v2: 8.7
Vosk 0.42 Gigaspeech 15.8
Google 12.52
Amazon 10.94
Streaming punctuation model is interesting
https://github.com/alibaba-damo-academy/FunASR/releases/tag/v0.3.0
https://github.com/alibaba-damo-academy/FunASR/releases/tag/v0.3.0
The amount of models this guy trained is quite outstanding
https://malaya-speech.readthedocs.io/en/latest/index.html
https://malaya-speech.readthedocs.io/en/latest/index.html
🚨 🔔: We've just released our GitHub repository for #ASR and #NLP tools for air traffic control communications, based on ATCO2 dataset
@Atco2P
!
We made public 5000+ hours of audio --> research on ASR for ATC.
GitHub: https://github.com/idiap/atco2-corpus
https://twitter.com/Pablogomez3/status/1640331512389279744
@Atco2P
!
We made public 5000+ hours of audio --> research on ASR for ATC.
GitHub: https://github.com/idiap/atco2-corpus
https://twitter.com/Pablogomez3/status/1640331512389279744
GitHub
GitHub - idiap/atco2-corpus: A Corpus for Research on Robust Automatic Speech Recognition and Natural Language Understanding of…
A Corpus for Research on Robust Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications - idiap/atco2-corpus
12th ISCA Speech Synthesis Workshop (SSW) is now open for submissions!
Final submission deadline: May, 3 2023
Late breaking reports submission deadline : June, 28 2023
The Speech Synthesis Workshop will be held in Grenoble, France and is organized as a satellite event of the Interspeech conference in Dublin, Ireland
Come and join the SSW community and the people who creates machines that talk!
Visit the official site for more information
https://ssw2023.org/
Final submission deadline: May, 3 2023
Late breaking reports submission deadline : June, 28 2023
The Speech Synthesis Workshop will be held in Grenoble, France and is organized as a satellite event of the Interspeech conference in Dublin, Ireland
Come and join the SSW community and the people who creates machines that talk!
Visit the official site for more information
https://ssw2023.org/
Forwarded from Machinelearning
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
Propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT.
Конвейер обработки для фильтрации зашумленных данных и создания высококачественных титров.
🖥 Github: https://github.com/xinhaomei/wavcaps
⏩ Paper: https://arxiv.org/abs/2303.17395v1
💨 Dataset: https://paperswithcode.com/dataset/sounddescs
ai_machinelearning_big_data
Propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT.
Конвейер обработки для фильтрации зашумленных данных и создания высококачественных титров.
ai_machinelearning_big_data
Please open Telegram to view this post
VIEW IN TELEGRAM
This is interesting, all open source conformer implementations have bugs:
📢 We have just released open source a bug-free 🚫🪲implementation of the Conformer model.
📌Check it at: https://github.com/hlt-mt/FBK-fairseq/blob/master/fbk_works/BUGFREE_CONFORMER.md
Want to discover what "bug-free" means?
➡ Take a look at our paper: https://arxiv.org/pdf/2303.16166.pdf
#opensource #conformer #speech #bug #bugfree #NLProc
https://twitter.com/sarapapi/status/1641750885524029440
📢 We have just released open source a bug-free 🚫🪲implementation of the Conformer model.
📌Check it at: https://github.com/hlt-mt/FBK-fairseq/blob/master/fbk_works/BUGFREE_CONFORMER.md
Want to discover what "bug-free" means?
➡ Take a look at our paper: https://arxiv.org/pdf/2303.16166.pdf
#opensource #conformer #speech #bug #bugfree #NLProc
https://twitter.com/sarapapi/status/1641750885524029440
GitHub
FBK-fairseq/fbk_works/BUGFREE_CONFORMER.md at master · hlt-mt/FBK-fairseq
Repository containing the open source code of works published at the FBK MT unit. - hlt-mt/FBK-fairseq
📢The largest 2,000 hours multi-layer annotated corpus QASR is available @ https://arabicspeech.org/qasr/ QASR is suitable for ASR, dialect ID, punctuation, speaker ID-linking, and potentially other NLP modules for spoken data.
#nlproc #speechproc #Arabic #AI
@QatarComputing
@qcrialt
https://twitter.com/ArabicSpeech/status/1641402805951815681
#nlproc #speechproc #Arabic #AI
@QatarComputing
@qcrialt
https://twitter.com/ArabicSpeech/status/1641402805951815681
X (formerly Twitter)
Arabic Speech (@ArabicSpeech) on X
📢The largest 2,000 hours multi-layer annotated corpus QASR is available @ https://t.co/sPwMy4DSLj QASR is suitable for ASR, dialect ID, punctuation, speaker ID-linking, and potentially other NLP modules for spoken data.
#nlproc #speechproc #Arabic #AI @QatarComputing…
#nlproc #speechproc #Arabic #AI @QatarComputing…
https://www.openslr.org/136/
EMNS
Identifier: SLR136
Summary: An emotive single-speaker dataset for narrative storytelling. EMNS is dataset containing transcriptions, emotion, emotion intensity, and description of acted speech.
Category: Speech, text-to-speech, automatic speech recognition
License: Apache 2.0
About this resource:
Emotive Narrative Storytelling (EMNS) corpus introduces a dataset consisting of a single speaker, British English speech with high-quality labelled utterances tailored to drive interactive experiences with dynamic and expressive language. Each audio-text pairs are reviewed for artefacts and quality. Furthermore, we extract critical features using natural language descriptions, including word emphasis, level of expressiveness and emotion.
EMNS data collection tool: https://github.com/knoriy/EMNS-DCT
EMNS cleaner: https://github.com/knoriy/EMNS-cleaner
EMNS
Identifier: SLR136
Summary: An emotive single-speaker dataset for narrative storytelling. EMNS is dataset containing transcriptions, emotion, emotion intensity, and description of acted speech.
Category: Speech, text-to-speech, automatic speech recognition
License: Apache 2.0
About this resource:
Emotive Narrative Storytelling (EMNS) corpus introduces a dataset consisting of a single speaker, British English speech with high-quality labelled utterances tailored to drive interactive experiences with dynamic and expressive language. Each audio-text pairs are reviewed for artefacts and quality. Furthermore, we extract critical features using natural language descriptions, including word emphasis, level of expressiveness and emotion.
EMNS data collection tool: https://github.com/knoriy/EMNS-DCT
EMNS cleaner: https://github.com/knoriy/EMNS-cleaner
https://groups.inf.ed.ac.uk/edacc/
The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR. Ramon Sanabria, Bogoychev, Markl, Carmantini, Klejch, and Bell. ICASSP 2023. Presentation of the EdAcc.
The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR. Ramon Sanabria, Bogoychev, Markl, Carmantini, Klejch, and Bell. ICASSP 2023. Presentation of the EdAcc.
groups.inf.ed.ac.uk
EdAcc
EddAcc
NeMo 1.17 is now released and and includes a lot of improvements that users have long requested.
This includes a high level Diarization API, PyCTCDecode support for beam search, InterCTC Loss support, AWS Sagemaker tutorial and more !
https://twitter.com/alphacep/status/1644685634404073472
This includes a high level Diarization API, PyCTCDecode support for beam search, InterCTC Loss support, AWS Sagemaker tutorial and more !
https://twitter.com/alphacep/status/1644685634404073472
Twitter
RT @HaseoX94: NeMo 1.17 is now released and and includes a lot of improvements that users have long requested.
This includes a high level…
This includes a high level…