Streaming punctuation model is interesting
https://github.com/alibaba-damo-academy/FunASR/releases/tag/v0.3.0
https://github.com/alibaba-damo-academy/FunASR/releases/tag/v0.3.0
The amount of models this guy trained is quite outstanding
https://malaya-speech.readthedocs.io/en/latest/index.html
https://malaya-speech.readthedocs.io/en/latest/index.html
🚨 🔔: We've just released our GitHub repository for #ASR and #NLP tools for air traffic control communications, based on ATCO2 dataset
@Atco2P
!
We made public 5000+ hours of audio --> research on ASR for ATC.
GitHub: https://github.com/idiap/atco2-corpus
https://twitter.com/Pablogomez3/status/1640331512389279744
@Atco2P
!
We made public 5000+ hours of audio --> research on ASR for ATC.
GitHub: https://github.com/idiap/atco2-corpus
https://twitter.com/Pablogomez3/status/1640331512389279744
GitHub
GitHub - idiap/atco2-corpus: A Corpus for Research on Robust Automatic Speech Recognition and Natural Language Understanding of…
A Corpus for Research on Robust Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications - idiap/atco2-corpus
12th ISCA Speech Synthesis Workshop (SSW) is now open for submissions!
Final submission deadline: May, 3 2023
Late breaking reports submission deadline : June, 28 2023
The Speech Synthesis Workshop will be held in Grenoble, France and is organized as a satellite event of the Interspeech conference in Dublin, Ireland
Come and join the SSW community and the people who creates machines that talk!
Visit the official site for more information
https://ssw2023.org/
Final submission deadline: May, 3 2023
Late breaking reports submission deadline : June, 28 2023
The Speech Synthesis Workshop will be held in Grenoble, France and is organized as a satellite event of the Interspeech conference in Dublin, Ireland
Come and join the SSW community and the people who creates machines that talk!
Visit the official site for more information
https://ssw2023.org/
Forwarded from Machinelearning
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
Propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT.
Конвейер обработки для фильтрации зашумленных данных и создания высококачественных титров.
🖥 Github: https://github.com/xinhaomei/wavcaps
⏩ Paper: https://arxiv.org/abs/2303.17395v1
💨 Dataset: https://paperswithcode.com/dataset/sounddescs
ai_machinelearning_big_data
Propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT.
Конвейер обработки для фильтрации зашумленных данных и создания высококачественных титров.
ai_machinelearning_big_data
Please open Telegram to view this post
VIEW IN TELEGRAM
This is interesting, all open source conformer implementations have bugs:
📢 We have just released open source a bug-free 🚫🪲implementation of the Conformer model.
📌Check it at: https://github.com/hlt-mt/FBK-fairseq/blob/master/fbk_works/BUGFREE_CONFORMER.md
Want to discover what "bug-free" means?
➡ Take a look at our paper: https://arxiv.org/pdf/2303.16166.pdf
#opensource #conformer #speech #bug #bugfree #NLProc
https://twitter.com/sarapapi/status/1641750885524029440
📢 We have just released open source a bug-free 🚫🪲implementation of the Conformer model.
📌Check it at: https://github.com/hlt-mt/FBK-fairseq/blob/master/fbk_works/BUGFREE_CONFORMER.md
Want to discover what "bug-free" means?
➡ Take a look at our paper: https://arxiv.org/pdf/2303.16166.pdf
#opensource #conformer #speech #bug #bugfree #NLProc
https://twitter.com/sarapapi/status/1641750885524029440
GitHub
FBK-fairseq/fbk_works/BUGFREE_CONFORMER.md at master · hlt-mt/FBK-fairseq
Repository containing the open source code of works published at the FBK MT unit. - hlt-mt/FBK-fairseq
📢The largest 2,000 hours multi-layer annotated corpus QASR is available @ https://arabicspeech.org/qasr/ QASR is suitable for ASR, dialect ID, punctuation, speaker ID-linking, and potentially other NLP modules for spoken data.
#nlproc #speechproc #Arabic #AI
@QatarComputing
@qcrialt
https://twitter.com/ArabicSpeech/status/1641402805951815681
#nlproc #speechproc #Arabic #AI
@QatarComputing
@qcrialt
https://twitter.com/ArabicSpeech/status/1641402805951815681
X (formerly Twitter)
Arabic Speech (@ArabicSpeech) on X
📢The largest 2,000 hours multi-layer annotated corpus QASR is available @ https://t.co/sPwMy4DSLj QASR is suitable for ASR, dialect ID, punctuation, speaker ID-linking, and potentially other NLP modules for spoken data.
#nlproc #speechproc #Arabic #AI @QatarComputing…
#nlproc #speechproc #Arabic #AI @QatarComputing…
https://www.openslr.org/136/
EMNS
Identifier: SLR136
Summary: An emotive single-speaker dataset for narrative storytelling. EMNS is dataset containing transcriptions, emotion, emotion intensity, and description of acted speech.
Category: Speech, text-to-speech, automatic speech recognition
License: Apache 2.0
About this resource:
Emotive Narrative Storytelling (EMNS) corpus introduces a dataset consisting of a single speaker, British English speech with high-quality labelled utterances tailored to drive interactive experiences with dynamic and expressive language. Each audio-text pairs are reviewed for artefacts and quality. Furthermore, we extract critical features using natural language descriptions, including word emphasis, level of expressiveness and emotion.
EMNS data collection tool: https://github.com/knoriy/EMNS-DCT
EMNS cleaner: https://github.com/knoriy/EMNS-cleaner
EMNS
Identifier: SLR136
Summary: An emotive single-speaker dataset for narrative storytelling. EMNS is dataset containing transcriptions, emotion, emotion intensity, and description of acted speech.
Category: Speech, text-to-speech, automatic speech recognition
License: Apache 2.0
About this resource:
Emotive Narrative Storytelling (EMNS) corpus introduces a dataset consisting of a single speaker, British English speech with high-quality labelled utterances tailored to drive interactive experiences with dynamic and expressive language. Each audio-text pairs are reviewed for artefacts and quality. Furthermore, we extract critical features using natural language descriptions, including word emphasis, level of expressiveness and emotion.
EMNS data collection tool: https://github.com/knoriy/EMNS-DCT
EMNS cleaner: https://github.com/knoriy/EMNS-cleaner
https://groups.inf.ed.ac.uk/edacc/
The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR. Ramon Sanabria, Bogoychev, Markl, Carmantini, Klejch, and Bell. ICASSP 2023. Presentation of the EdAcc.
The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR. Ramon Sanabria, Bogoychev, Markl, Carmantini, Klejch, and Bell. ICASSP 2023. Presentation of the EdAcc.
groups.inf.ed.ac.uk
EdAcc
EddAcc
NeMo 1.17 is now released and and includes a lot of improvements that users have long requested.
This includes a high level Diarization API, PyCTCDecode support for beam search, InterCTC Loss support, AWS Sagemaker tutorial and more !
https://twitter.com/alphacep/status/1644685634404073472
This includes a high level Diarization API, PyCTCDecode support for beam search, InterCTC Loss support, AWS Sagemaker tutorial and more !
https://twitter.com/alphacep/status/1644685634404073472
Twitter
RT @HaseoX94: NeMo 1.17 is now released and and includes a lot of improvements that users have long requested.
This includes a high level…
This includes a high level…
Not sure about claimed accuracy but numbers are interesting
https://blog.deepgram.com/nova-speech-to-text-whisper-api/
A remarkable 22% reduction in word error rate (WER)
A blazing-fast 23-78x quicker inference time
A budget-friendly 3-7x lower cost starting at only $0.0043/min
https://blog.deepgram.com/nova-speech-to-text-whisper-api/
A remarkable 22% reduction in word error rate (WER)
A blazing-fast 23-78x quicker inference time
A budget-friendly 3-7x lower cost starting at only $0.0043/min
Deepgram
Introducing Nova: World's Most Powerful Speech-to-Text API - Deepgram Blog ⚡️ | Deepgram
We’re introducing our next-gen speech recognition model with unmatched speed, accuracy, and cost. Plus, a fully managed Whisper API....
AUDIT:
Audio Editing by Following Instructions with Latent Diffusion Models
Yuancheng Wang, Zeqian Ju, Xu Tan, Lei He, Zhizheng Wu, Jiang Bian, Sheng Zhao
Abstract. Audio editing is applicable for various purposes, such as adding background sound effects, replacing a musical instrument, and repairing damaged audio. Recently, some diffusion-based methods achieved zero-shot audio editing by using a diffusion and denoising process conditioned on the text description of the output audio. However, these methods still have some problems: 1) they have not been trained on editing tasks and cannot ensure good editing effects; 2) they can erroneously modify audio segments that do not require editing; 3) they need a complete description of the output audio, which is not always available or necessary in practical scenarios. In this work, we propose AUDIT, an instruction-guided audio editing model based on latent diffusion models. Specifically, AUDIT has three main design features: 1) we construct triplet training data (instruction, input audio, output audio) for different audio editing tasks and train a diffusion model using instruction and input (to be edited) audio as conditions and generating output (edited) audio; 2) it can automatically learn to only modify segments that need to be edited by comparing the difference between the input and output audio; 3) it only needs edit instructions instead of full target audio descriptions as text input. AUDIT achieves state-of-the-art results in both objective and subjective metrics for several audio editing tasks (e.g., adding, dropping, replacement, inpainting, super-resolution).
This research is done in alignment with Microsoft's responsible AI principles.
https://audit-demo.github.io/
Audio Editing by Following Instructions with Latent Diffusion Models
Yuancheng Wang, Zeqian Ju, Xu Tan, Lei He, Zhizheng Wu, Jiang Bian, Sheng Zhao
Abstract. Audio editing is applicable for various purposes, such as adding background sound effects, replacing a musical instrument, and repairing damaged audio. Recently, some diffusion-based methods achieved zero-shot audio editing by using a diffusion and denoising process conditioned on the text description of the output audio. However, these methods still have some problems: 1) they have not been trained on editing tasks and cannot ensure good editing effects; 2) they can erroneously modify audio segments that do not require editing; 3) they need a complete description of the output audio, which is not always available or necessary in practical scenarios. In this work, we propose AUDIT, an instruction-guided audio editing model based on latent diffusion models. Specifically, AUDIT has three main design features: 1) we construct triplet training data (instruction, input audio, output audio) for different audio editing tasks and train a diffusion model using instruction and input (to be edited) audio as conditions and generating output (edited) audio; 2) it can automatically learn to only modify segments that need to be edited by comparing the difference between the input and output audio; 3) it only needs edit instructions instead of full target audio descriptions as text input. AUDIT achieves state-of-the-art results in both objective and subjective metrics for several audio editing tasks (e.g., adding, dropping, replacement, inpainting, super-resolution).
This research is done in alignment with Microsoft's responsible AI principles.
https://audit-demo.github.io/
NaturalSpeech 2, a new powerful zero-shot TTS model in NaturaSpeech series🔥
1. Latent diffusion model + continuous codec, avoiding the dilemma in language model + discrete codec;
2. Strong zero-shot speech synthesis with a 3s prompt, singing synthesis with only a speech prompt!
abs: https://arxiv.org/abs/2304.09116
project page: https://speechresearch.github.io/naturalspeech2/
1. Latent diffusion model + continuous codec, avoiding the dilemma in language model + discrete codec;
2. Strong zero-shot speech synthesis with a 3s prompt, singing synthesis with only a speech prompt!
abs: https://arxiv.org/abs/2304.09116
project page: https://speechresearch.github.io/naturalspeech2/