Data Science by ODS.ai 🦜

Torch, TF, Lasagne code for audio style transfer.

http://dmitryulyanov.github.io/audio-texture-synthesis-and-style-transfer/

#dl #audio #styletransfer #torch #tf #lasagne

Dmitry Ulyanov

Audio texture synthesis and style transfer

by Dmitry Ulyanov and Vadim Lebedev We present an extension of texture synthesis and style transfer method of Leon Gatys et al. for audio. We have developed the same code for three frameworks (well, it is cold in Moscow), choose your favorite: Torch TensorFlow…

2.6K views01:25

Data Science by ODS.ai 🦜

Google has set up a new milestone for speech generation: "Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model"

You can listen to generated samples at: https://google.github.io/tacotron/

Paper: https://arxiv.org/abs/1703.10135

#audio #arxiv #google #breakthrough #generative

arXiv.org

Tacotron: Towards End-to-End Speech Synthesis

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires...

5.2K views14:51

Data Science by ODS.ai 🦜

Google published new article about voice cloning: Expressive Speech Synthesis with Tacotron

link: https://research.googleblog.com/2018/03/expressive-speech-synthesis-with.html
samples: https://google.github.io/tacotron/publications/global_style_tokens/

#wavenet #audio #speech #deeplearning

Googleblog

Expressive Speech Synthesis with Tacotron

7.1K views11:11

Data Science by ODS.ai 🦜

Alhanai_Interspeech-2018.pdf

189 KB

New model to naturally detect depression in conversations

Link: http://news.mit.edu/2018/neural-network-model-detect-depression-conversations-0830

#nlp #audio #dl

11.7K views10:28

Data Science by ODS.ai 🦜

Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

High-quality #speechrecognition systems require large amounts of data—yet many languages have little data available. Check out new research into an end-to-end system trained as a single model allowing for real-time multilingual speech recognition.

Link: https://ai.googleblog.com/2019/09/large-scale-multilingual-speech.html

#speech #audio #DL #Google

Googleblog

Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

9.1K views17:26

Data Science by ODS.ai 🦜

Online speech recognition with wav2letter@anywhere

Facebook have open-sourced wav2letter@anywhere, an inference framework for online speech recognition that delivers state-of-the-art performance.

Link: https://ai.facebook.com/blog/online-speech-recognition-with-wav2letteranywhere/

#wav2letter #audiolearning #soundlearning #sound #acoustic #audio #facebook

10.5K views06:01

Data Science by ODS.ai 🦜

MMS: Scaling Speech Technology to 1000+ languages

Get ready for a breakthrough in speech technology that is set to revolutionize the world of communication! The field, which has so far been restricted to around a hundred languages, barely scratches the surface of the more than 7,000 languages spoken globally. The Massively Multilingual Speech (MMS) project is taking a monumental leap to bridge this gap, increasing the number of supported languages by an astounding 10 to 40 times, depending on the task. This unprecedented expansion will be a game-changer, significantly improving global access to information and creating a more inclusive digital landscape.

This incredible feat is achieved through the creation of a new dataset drawn from publicly available religious texts and the strategic implementation of self-supervised learning. The MMS project's achievements are staggering, including the development of pre-trained wav2vec 2.0 models for 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for as many languages, and a language identification model for a whopping 4,017 languages. Even more impressive is the significant improvement in accuracy - our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark, despite being trained on a significantly smaller dataset.

Paper link: https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/
Blogpost link: https://ai.facebook.com/blog/multilingual-model-speech-recognition/
Code link: https://github.com/facebookresearch/fairseq/tree/main/examples/mms

A detailed unofficial overview of the paper: https://andlukyane.com/blog/paper-review-mms
#deeplearning #speechrecognition #tts #audio

10.6K views16:33

Data Science by ODS.ai 🦜

Forwarded from Machinelearning

🌟Qwen2-Audio: Общайтесь с LLM помощью голоса.

Qwen2-Audio - аудио-языковых модель, которая способна принимать аудио и текст на вход и генерировать текст на выходе.

Предусмотрено два режима взаимодействия:

🟠

голосовой чат: пользователи могут использовать голос для передачи инструкций модели без без ввода текста;

🟠

аудио-анализ: пользователи могут предоставлять аудиоинформацию (включая речь, звук, музыку) и текстовые инструкции для анализа.

Обе опубликованные модели поддерживают 8 языков и диалектов: китайский, английский, кантонский, французский, итальянский, испанский, немецкий и японский:

🟢

Qwen2-Audio-7B

🟢

Qwen2-Audio-7B-Instruct

Инференс на transformers в cli возможен в нескольких режимах:

🟠простой инференс модели Qwen2-Audio;

🟠

пакетный инференс (например, несколько текстовых запросов к аудиофайлу);

🟠

инференс анализа аудио (в этом режиме доступны и текстовые и аудио-инструкции);

🟠

инференс голосового чата.

▶️Локальный запуск с GradioUI:


# Ensure you have latest Hugging face transformers
pip install git+https://github.com/huggingface/transformers

# to build a web UI demoinstall the following packages
pip install -r requirements_web_demo.txt

# run Gradio web UI
python demo/web_demo_audio.py

📌Лицензирование : Apache 2.0

🟡

Страница проекта

🟡

Коллекция моделей на HF

🟡

Arxiv

🟡

Сообщество в Discord

🟡

Demo

🖥