Speech Technology – Telegram

Speech Technology

1.59K subscribers

122 photos

4 videos

1 file

2.12K links

Download Telegram

About

Blog

Apps

Platform

Speech Technology

1.59K subscribers

Speech Technology

Forwarded from Machinelearning

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

Propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT.

Конвейер обработки для фильтрации зашумленных данных и создания высококачественных титров.

🖥

Github: https://github.com/xinhaomei/wavcaps

⏩

Paper: https://arxiv.org/abs/2303.17395v1

💨

Dataset: https://paperswithcode.com/dataset/sounddescs

ai_machinelearning_big_data

Please open Telegram to view this post

VIEW IN TELEGRAM

294 views00:24

Speech Technology

This is interesting, all open source conformer implementations have bugs:

📢 We have just released open source a bug-free 🚫🪲implementation of the Conformer model.
📌Check it at: https://github.com/hlt-mt/FBK-fairseq/blob/master/fbk_works/BUGFREE_CONFORMER.md
Want to discover what "bug-free" means?
➡ Take a look at our paper: https://arxiv.org/pdf/2303.16166.pdf

#opensource #conformer #speech #bug #bugfree #NLProc

https://twitter.com/sarapapi/status/1641750885524029440

FBK-fairseq/fbk_works/BUGFREE_CONFORMER.md at master · hlt-mt/FBK-fairseq

Repository containing the open source code of works published at the FBK MT unit. - hlt-mt/FBK-fairseq

661 views00:37

Speech Technology

📢The largest 2,000 hours multi-layer annotated corpus QASR is available @ https://arabicspeech.org/qasr/ QASR is suitable for ASR, dialect ID, punctuation, speaker ID-linking, and potentially other NLP modules for spoken data.
#nlproc #speechproc #Arabic #AI
@QatarComputing

@qcrialt

https://twitter.com/ArabicSpeech/status/1641402805951815681

X (formerly Twitter)

Arabic Speech (@ArabicSpeech) on X

📢The largest 2,000 hours multi-layer annotated corpus QASR is available @ https://t.co/sPwMy4DSLj QASR is suitable for ASR, dialect ID, punctuation, speaker ID-linking, and potentially other NLP modules for spoken data.
#nlproc #speechproc #Arabic #AI @QatarComputing…

630 views13:36

Speech Technology

https://www.openslr.org/136/

EMNS
Identifier: SLR136

Summary: An emotive single-speaker dataset for narrative storytelling. EMNS is dataset containing transcriptions, emotion, emotion intensity, and description of acted speech.

Category: Speech, text-to-speech, automatic speech recognition

License: Apache 2.0
About this resource:

Emotive Narrative Storytelling (EMNS) corpus introduces a dataset consisting of a single speaker, British English speech with high-quality labelled utterances tailored to drive interactive experiences with dynamic and expressive language. Each audio-text pairs are reviewed for artefacts and quality. Furthermore, we extract critical features using natural language descriptions, including word emphasis, level of expressiveness and emotion.

EMNS data collection tool: https://github.com/knoriy/EMNS-DCT

EMNS cleaner: https://github.com/knoriy/EMNS-cleaner

653 viewsedited 13:45

Speech Technology

https://groups.inf.ed.ac.uk/edacc/

The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR. Ramon Sanabria, Bogoychev, Markl, Carmantini, Klejch, and Bell. ICASSP 2023. Presentation of the EdAcc.

groups.inf.ed.ac.uk

626 views01:52

Speech Technology

Learning model from Whisper

https://github.com/speechcatcher-asr

speechcatcher-asr

speechcatcher-asr has 8 repositories available. Follow their code on GitHub.

752 viewsedited 22:12

Speech Technology

https://www.youtube.com/watch?v=v73YdJQqaQo

#43 D. Povey - simple questions about researcher's life.

https://medium.com/@nadirapovey/d-povey-simple-questions-about-researchers-life-1e7328351167

0:00 What Dan does as a computer science researcher?
1:00 IBM experience
1:30 Writing big software code bases
1:45 JHU experience
2:00 Why didn’t he do tenured track?…

797 views01:44

Speech Technology

NeMo 1.17 is now released and and includes a lot of improvements that users have long requested.

This includes a high level Diarization API, PyCTCDecode support for beam search, InterCTC Loss support, AWS Sagemaker tutorial and more !

https://twitter.com/alphacep/status/1644685634404073472

RT @HaseoX94: NeMo 1.17 is now released and and includes a lot of improvements that users have long requested.

This includes a high level…

885 views12:59

Speech Technology

GPU beam search in pytorch

https://github.com/pytorch/audio/pull/3096

Add cuctc decoder by yuekaizhang · Pull Request #3096 · pytorch/audio

This PR related to #2957. It implements a CUDA based ctc prefix beam search decoder.
Attach serveral benchmark results using V100 below:

decoder type
model
datasets
decoding time (secs)
beam siz...

864 views07:54

Speech Technology

Space is closer than you think. Happy Cosmonautics day my friends.

814 views13:20

Speech Technology

Laugh is nice, Russian stress is traditionally bad

https://github.com/suno-ai/bark

GitHub - suno-ai/bark: 🔊 Text-Prompted Generative Audio Model

🔊 Text-Prompted Generative Audio Model. Contribute to suno-ai/bark development by creating an account on GitHub.

704 viewsedited 21:30

Speech Technology

Not sure about claimed accuracy but numbers are interesting

https://blog.deepgram.com/nova-speech-to-text-whisper-api/

A remarkable 22% reduction in word error rate (WER)

A blazing-fast 23-78x quicker inference time

A budget-friendly 3-7x lower cost starting at only $0.0043/min

Introducing Nova: World's Most Powerful Speech-to-Text API - Deepgram Blog ⚡️ | Deepgram

We’re introducing our next-gen speech recognition model with unmatched speed, accuracy, and cost. Plus, a fully managed Whisper API....

778 viewsedited 21:44

Speech Technology

AUDIT:
Audio Editing by Following Instructions with Latent Diffusion Models

Yuancheng Wang, Zeqian Ju, Xu Tan, Lei He, Zhizheng Wu, Jiang Bian, Sheng Zhao
Abstract. Audio editing is applicable for various purposes, such as adding background sound effects, replacing a musical instrument, and repairing damaged audio. Recently, some diffusion-based methods achieved zero-shot audio editing by using a diffusion and denoising process conditioned on the text description of the output audio. However, these methods still have some problems: 1) they have not been trained on editing tasks and cannot ensure good editing effects; 2) they can erroneously modify audio segments that do not require editing; 3) they need a complete description of the output audio, which is not always available or necessary in practical scenarios. In this work, we propose AUDIT, an instruction-guided audio editing model based on latent diffusion models. Specifically, AUDIT has three main design features: 1) we construct triplet training data (instruction, input audio, output audio) for different audio editing tasks and train a diffusion model using instruction and input (to be edited) audio as conditions and generating output (edited) audio; 2) it can automatically learn to only modify segments that need to be edited by comparing the difference between the input and output audio; 3) it only needs edit instructions instead of full target audio descriptions as text input. AUDIT achieves state-of-the-art results in both objective and subjective metrics for several audio editing tasks (e.g., adding, dropping, replacement, inpainting, super-resolution).

This research is done in alignment with Microsoft's responsible AI principles.

https://audit-demo.github.io/

982 views21:50

Speech Technology

NaturalSpeech 2, a new powerful zero-shot TTS model in NaturaSpeech series🔥
1. Latent diffusion model + continuous codec, avoiding the dilemma in language model + discrete codec;
2. Strong zero-shot speech synthesis with a 3s prompt, singing synthesis with only a speech prompt!

abs: https://arxiv.org/abs/2304.09116
project page: https://speechresearch.github.io/naturalspeech2/

1.43K views04:52

Speech Technology

https://www.youtube.com/watch?v=-oDQ4ggjBnQ

#42 Teaching speech recognizers new words — without retraining [Amazon AWS AI SLT 2022 paper]

https://www.amazon.science/blog/teaching-speech-recognizers-new-words-without-retraining
https://www.amazon.science/publications/personalization-of-ctc-speech-recognition-models

00:00 How is Amazon teaching speech recognizers new words?
00:17 This is a blog…

858 views19:22

Speech Technology

JaX is faster than Pytorch

https://twitter.com/sanchitgandhi99/status/1649046661816287236

1.16K views18:45

Speech Technology

https://github.com/152334H/tortoise-tts-fast

GitHub - 152334H/tortoise-tts-fast: Fast TorToiSe inference (5x or your money back!)

Fast TorToiSe inference (5x or your money back!). Contribute to 152334H/tortoise-tts-fast development by creating an account on GitHub.

703 views17:15

Speech Technology

Whisper can actually do speaker diarization with a prompt. Magic is:

or do a crude form of speaker turn tracking (e.g. " - Hey how are you doing? - I'm doing good. How are you?", note that the token for " -" is suppressed by default and will need to be enabled manually.)

https://github.com/openai/whisper/discussions/117#discussioncomment-3727051

prompt vs prefix in DecodingOptions · openai whisper · Discussion #117

DecodingOptions has the following properties that aren't really discussed in the blog post or paper: # prompt, prefix, and token suppression prompt: Optional[Union[str, List[int]]] = None # tex...

685 viewsedited 00:45

Speech Technology

http://www.asru2023.org/

Taiwan, Taipei

December 16-20, 2023

Regular & Challenge paper submission due: July 3, 2023

644 viewsedited 08:41

Speech Technology

https://slt2022.org/hackathon_projects.php

2022 IEEE Workshop on Spoken Language Technology

The 2022 IEEE Spoken Language Technology Workshop (SLT 2022) will be held on 9th - 12th January 2023 at Doha, Qatar. SLT 2022 will be the first speech conference to have visited the Middle East and the first speech conference to be held in an Arabic speaking…

602 views19:25

Speech Technology

LODR decoding in K2

https://mp.weixin.qq.com/s/HJDaZ5BN1TzEa8oWQ9CBhw

Adding LODR to the rescore process only increases the decoding time by 20% compared to beam search, but reduces the word error rate by 13.8%, which is fast and accurate.

696 views02:55