SLM adversarial training in styletts is not that useful (we confirmed that on our experiments too). Spectral loss is enough
Its fun how much compute resources were spent on it
Interesting that some teams still claimed it is useful, for example here in FlashSpeech https://arxiv.org/abs/2404.14700. I suppose they used only simple discriminators.
Its fun how much compute resources were spent on it
Interesting that some teams still claimed it is useful, for example here in FlashSpeech https://arxiv.org/abs/2404.14700. I suppose they used only simple discriminators.
This is an important paper for training audio LLM. One can keep LM aligned with synthetic TTS data.
https://arxiv.org/abs/2309.00916
BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing
Chen Wang, Minpeng Liao, Zhongqiang Huang, Jinliang Lu, Junhong Wu, Yuchen Liu, Chengqing Zong, Jiajun Zhang
The emergence of large language models (LLMs) has sparked significant interest in extending their remarkable language capabilities to speech. However, modality alignment between speech and text still remains an open problem. Current solutions can be categorized into two strategies. One is a cascaded approach where outputs (tokens or states) of a separately trained speech recognition system are used as inputs for LLMs, which limits their potential in modeling alignment between speech and text. The other is an end-to-end approach that relies on speech instruction data, which is very difficult to collect in large quantities. In this paper, we address these issues and propose the BLSP approach that Bootstraps Language-Speech Pre-training via behavior alignment of continuation writing. We achieve this by learning a lightweight modality adapter between a frozen speech encoder and an LLM, ensuring that the LLM exhibits the same generation behavior regardless of the modality of input: a speech segment or its transcript. The training process can be divided into two steps. The first step prompts an LLM to generate texts with speech transcripts as prefixes, obtaining text continuations. In the second step, these continuations are used as supervised signals to train the modality adapter in an end-to-end manner. We demonstrate that this straightforward process can extend the capabilities of LLMs to speech, enabling speech recognition, speech translation, spoken language understanding, and speech conversation, even in zero-shot cross-lingual scenarios.
Also
https://arxiv.org/abs/2405.19041
BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge Distillation
Chen Wang, Minpeng Liao, Zhongqiang Huang, Jiajun Zhang
Recent end-to-end approaches have shown promise in extending large language models (LLMs) to speech inputs, but face limitations in directly assessing and optimizing alignment quality and fail to achieve fine-grained alignment due to speech-text length mismatch. We introduce BLSP-KD, a novel approach for Bootstrapping Language-Speech Pretraining via Knowledge Distillation, which addresses these limitations through two key techniques. First, it optimizes speech-text alignment by minimizing the divergence between the LLM's next-token prediction distributions for speech and text inputs using knowledge distillation. Second, it employs a continuous-integrate-andfire strategy to segment speech into tokens that correspond one-to-one with text tokens, enabling fine-grained alignment. We also introduce Partial LoRA (PLoRA), a new adaptation method supporting LLM finetuning for speech inputs under knowledge distillation. Quantitative evaluation shows that BLSP-KD outperforms previous end-to-end baselines and cascaded systems with comparable scale of parameters, facilitating general instruction-following capabilities for LLMs with speech inputs. This approach provides new possibilities for extending LLMs to spoken language interactions.
https://arxiv.org/abs/2309.00916
BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing
Chen Wang, Minpeng Liao, Zhongqiang Huang, Jinliang Lu, Junhong Wu, Yuchen Liu, Chengqing Zong, Jiajun Zhang
The emergence of large language models (LLMs) has sparked significant interest in extending their remarkable language capabilities to speech. However, modality alignment between speech and text still remains an open problem. Current solutions can be categorized into two strategies. One is a cascaded approach where outputs (tokens or states) of a separately trained speech recognition system are used as inputs for LLMs, which limits their potential in modeling alignment between speech and text. The other is an end-to-end approach that relies on speech instruction data, which is very difficult to collect in large quantities. In this paper, we address these issues and propose the BLSP approach that Bootstraps Language-Speech Pre-training via behavior alignment of continuation writing. We achieve this by learning a lightweight modality adapter between a frozen speech encoder and an LLM, ensuring that the LLM exhibits the same generation behavior regardless of the modality of input: a speech segment or its transcript. The training process can be divided into two steps. The first step prompts an LLM to generate texts with speech transcripts as prefixes, obtaining text continuations. In the second step, these continuations are used as supervised signals to train the modality adapter in an end-to-end manner. We demonstrate that this straightforward process can extend the capabilities of LLMs to speech, enabling speech recognition, speech translation, spoken language understanding, and speech conversation, even in zero-shot cross-lingual scenarios.
Also
https://arxiv.org/abs/2405.19041
BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge Distillation
Chen Wang, Minpeng Liao, Zhongqiang Huang, Jiajun Zhang
Recent end-to-end approaches have shown promise in extending large language models (LLMs) to speech inputs, but face limitations in directly assessing and optimizing alignment quality and fail to achieve fine-grained alignment due to speech-text length mismatch. We introduce BLSP-KD, a novel approach for Bootstrapping Language-Speech Pretraining via Knowledge Distillation, which addresses these limitations through two key techniques. First, it optimizes speech-text alignment by minimizing the divergence between the LLM's next-token prediction distributions for speech and text inputs using knowledge distillation. Second, it employs a continuous-integrate-andfire strategy to segment speech into tokens that correspond one-to-one with text tokens, enabling fine-grained alignment. We also introduce Partial LoRA (PLoRA), a new adaptation method supporting LLM finetuning for speech inputs under knowledge distillation. Quantitative evaluation shows that BLSP-KD outperforms previous end-to-end baselines and cascaded systems with comparable scale of parameters, facilitating general instruction-following capabilities for LLMs with speech inputs. This approach provides new possibilities for extending LLMs to spoken language interactions.
arXiv.org
BLSP: Bootstrapping Language-Speech Pre-training via Behavior...
The emergence of large language models (LLMs) has sparked significant interest in extending their remarkable language capabilities to speech. However, modality alignment between speech and text...
https://github.com/SWivid/F5-TTS
A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model’s performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our Fairytaler Fakes Fluent and Faithful speech with Flow matching (F5-TTS) exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. Demo samples can be found at https://SWivid.github.io/F5-TTS. We will release all code and checkpoints to promote community development.
A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model’s performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our Fairytaler Fakes Fluent and Faithful speech with Flow matching (F5-TTS) exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. Demo samples can be found at https://SWivid.github.io/F5-TTS. We will release all code and checkpoints to promote community development.
GitHub
GitHub - SWivid/F5-TTS: Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching" - SWivid/F5-TTS
5Hz tokenization for better performance of speech LM
SyllableLM
https://twitter.com/BaadeAlan/status/1844148297562538479
Q: Why can't we get GPT-level understanding from language models on speech?
A: We need better speech tokens!
SyllableLM beats kyutai_labs Moshi on semantic understanding in 70 hours of training by making speech tokens at 5 frames/s
https://github.com/AlanBaade/SyllableLM
https://arxiv.org/abs/2410.04029
SyllableLM: Learning Coarse Semantic Units for Speech Language Models
Alan Baade, Puyuan Peng, David Harwath
Language models require tokenized inputs. However, tokenization strategies for continuous data like audio and vision are often based on simple heuristics such as fixed sized convolutions or discrete clustering, which do not necessarily align with the semantic structure of the data. For speech in particular, the high resolution of waveforms (16,000 samples/second or more) presents a significant challenge as speech-based language models have had to use several times more tokens per word than text-based language models. In this work, we introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units while still preserving semantic information. We do this by 1) extracting noisy boundaries through analyzing correlations in pretrained encoder losses and 2) iteratively improving model representations with a novel distillation technique. Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and achieves SotA in syllabic segmentation and clustering. Using these coarse tokens, we successfully train SyllableLM, a Speech Language Model (SpeechLM) that matches or outperforms current SotA SpeechLMs on a range of spoken language modeling tasks. SyllableLM also achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
SyllableLM
https://twitter.com/BaadeAlan/status/1844148297562538479
Q: Why can't we get GPT-level understanding from language models on speech?
A: We need better speech tokens!
SyllableLM beats kyutai_labs Moshi on semantic understanding in 70 hours of training by making speech tokens at 5 frames/s
https://github.com/AlanBaade/SyllableLM
https://arxiv.org/abs/2410.04029
SyllableLM: Learning Coarse Semantic Units for Speech Language Models
Alan Baade, Puyuan Peng, David Harwath
Language models require tokenized inputs. However, tokenization strategies for continuous data like audio and vision are often based on simple heuristics such as fixed sized convolutions or discrete clustering, which do not necessarily align with the semantic structure of the data. For speech in particular, the high resolution of waveforms (16,000 samples/second or more) presents a significant challenge as speech-based language models have had to use several times more tokens per word than text-based language models. In this work, we introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units while still preserving semantic information. We do this by 1) extracting noisy boundaries through analyzing correlations in pretrained encoder losses and 2) iteratively improving model representations with a novel distillation technique. Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and achieves SotA in syllabic segmentation and clustering. Using these coarse tokens, we successfully train SyllableLM, a Speech Language Model (SpeechLM) that matches or outperforms current SotA SpeechLMs on a range of spoken language modeling tasks. SyllableLM also achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
To understand the reality, training times on F5. On the other hand, GAN-based TTS like VITS take about the same time.
And you could simply train your own model for a new language:
* Leverage Emilia Dataset (DE EN FR JA KO ZH), as we have include script for it (NOTE. download the mentioned version of Emilia in script, cuz it's currently updated to a WebDataset ver.)
or prepare your own data pairs if not covered, just tailor a Dataset Class in model/dataset.py to your need
* For Base model (multilingual, ~300M), we use <50K hours for each language
* For Small model (e.g. Chinese-only, ~150M), we have made it work with just 1K hours data, config. mentioned in our paper also
Just one thing, the training would take a long time, especially for E2 TTS (if you choose)
And be patient, 8xRTX3090 small model for one week (200~400K updates to hear something reasonable) 8xA100 for base model similarly.
https://github.com/SWivid/F5-TTS/issues/5#issuecomment-2404160945
And you could simply train your own model for a new language:
* Leverage Emilia Dataset (DE EN FR JA KO ZH), as we have include script for it (NOTE. download the mentioned version of Emilia in script, cuz it's currently updated to a WebDataset ver.)
or prepare your own data pairs if not covered, just tailor a Dataset Class in model/dataset.py to your need
* For Base model (multilingual, ~300M), we use <50K hours for each language
* For Small model (e.g. Chinese-only, ~150M), we have made it work with just 1K hours data, config. mentioned in our paper also
Just one thing, the training would take a long time, especially for E2 TTS (if you choose)
And be patient, 8xRTX3090 small model for one week (200~400K updates to hear something reasonable) 8xA100 for base model similarly.
https://github.com/SWivid/F5-TTS/issues/5#issuecomment-2404160945
GitHub
Is it possible to train TTS for a new language? · Issue #5 · SWivid/F5-TTS
Thank you for your work. I would like to inquire about the possibility of training for a new language. If this is feasible, could you please provide more details on the following: How much data is ...
Pretty simple approach to transfer knowledge from existing task-specific models to audio LLM, however, it is interesting that careful data construction can make good results
https://github.com/kehanlu/DeSTA2
https://arxiv.org/abs/2409.20007
Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data
Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs) by incorporating pre-trained speech models. However, these SLMs often undergo extensive speech instruction-tuning to bridge the gap between speech and text modalities. This requires significant annotation efforts and risks catastrophic forgetting of the original language capabilities. In this work, we present a simple yet effective automatic process for creating speech-text pair data that carefully injects speech paralinguistic understanding abilities into SLMs while preserving the inherent language capabilities of the text-based LLM. Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data, achieving impressive performance on Dynamic-SUPERB and AIR-Bench-Chat benchmarks. Furthermore, our model exhibits the ability to follow complex instructions derived from LLMs, such as specific output formatting and chain-of-thought reasoning. Our approach not only enhances the versatility and effectiveness of SLMs but also reduces reliance on extensive annotated datasets, paving the way for more efficient and capable speech understanding systems.
https://github.com/kehanlu/DeSTA2
https://arxiv.org/abs/2409.20007
Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data
Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs) by incorporating pre-trained speech models. However, these SLMs often undergo extensive speech instruction-tuning to bridge the gap between speech and text modalities. This requires significant annotation efforts and risks catastrophic forgetting of the original language capabilities. In this work, we present a simple yet effective automatic process for creating speech-text pair data that carefully injects speech paralinguistic understanding abilities into SLMs while preserving the inherent language capabilities of the text-based LLM. Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data, achieving impressive performance on Dynamic-SUPERB and AIR-Bench-Chat benchmarks. Furthermore, our model exhibits the ability to follow complex instructions derived from LLMs, such as specific output formatting and chain-of-thought reasoning. Our approach not only enhances the versatility and effectiveness of SLMs but also reduces reliance on extensive annotated datasets, paving the way for more efficient and capable speech understanding systems.
GitHub
GitHub - kehanlu/DeSTA2: Code and model for ICASSP 2025 Paper "Developing Instruction-Following Speech Language Model Without Speech…
Code and model for ICASSP 2025 Paper "Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data" - kehanlu/DeSTA2
Audio tokens are not that simple, doesn't feel modern models work easily with them
https://arxiv.org/abs/2409.19283
Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models
Wenrui Liu, Zhifang Guo, Jin Xu, Yuanjun Lv, Yunfei Chu, Zhou Zhao, Junyang Lin
Building upon advancements in Large Language Models (LLMs), the field of audio processing has seen increased interest in training audio generation tasks with discrete audio token sequences. However, directly discretizing audio by neural audio codecs often results in sequences that fundamentally differ from text sequences. Unlike text, where text token sequences are deterministic, discrete audio tokens can exhibit significant variability based on contextual factors, while still producing perceptually identical audio segments. We refer to this phenomenon as \textbf{Discrete Representation Inconsistency (DRI)}. This inconsistency can lead to a single audio segment being represented by multiple divergent sequences, which creates confusion in neural codec language models and results in omissions and repetitions during speech generation. In this paper, we quantitatively analyze the DRI phenomenon within popular audio tokenizers such as EnCodec. Our approach effectively mitigates the DRI phenomenon of the neural audio codec. Furthermore, extensive experiments on the neural codec language model over LibriTTS and large-scale MLS datases (44,000 hours) demonstrate the effectiveness and generality of our method. The demo of audio samples is available online~\footnote{\url{this https URL}}.
https://arxiv.org/abs/2409.19283
Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models
Wenrui Liu, Zhifang Guo, Jin Xu, Yuanjun Lv, Yunfei Chu, Zhou Zhao, Junyang Lin
Building upon advancements in Large Language Models (LLMs), the field of audio processing has seen increased interest in training audio generation tasks with discrete audio token sequences. However, directly discretizing audio by neural audio codecs often results in sequences that fundamentally differ from text sequences. Unlike text, where text token sequences are deterministic, discrete audio tokens can exhibit significant variability based on contextual factors, while still producing perceptually identical audio segments. We refer to this phenomenon as \textbf{Discrete Representation Inconsistency (DRI)}. This inconsistency can lead to a single audio segment being represented by multiple divergent sequences, which creates confusion in neural codec language models and results in omissions and repetitions during speech generation. In this paper, we quantitatively analyze the DRI phenomenon within popular audio tokenizers such as EnCodec. Our approach effectively mitigates the DRI phenomenon of the neural audio codec. Furthermore, extensive experiments on the neural codec language model over LibriTTS and large-scale MLS datases (44,000 hours) demonstrate the effectiveness and generality of our method. The demo of audio samples is available online~\footnote{\url{this https URL}}.
arXiv.org
Analyzing and Mitigating Inconsistency in Discrete Audio Tokens...
Building upon advancements in Large Language Models (LLMs), the field of audio processing has seen increased interest in training audio generation tasks with discrete audio token sequences....
A new paper from StyleTTS author. This trick is kind of the same as genetic programming though.
https://dmdspeech.github.io/
https://arxiv.org/abs/2410.11097
DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization
Yingahao Aaron Li, Rithesh Kumar, Zeyu Jin
Diffusion models have demonstrated significant potential in speech synthesis tasks, including text-to-speech (TTS) and voice cloning. However, their iterative denoising processes are inefficient and hinder the application of end-to-end optimization with perceptual metrics. In this paper, we propose a novel method of distilling TTS diffusion models with direct end-to-end evaluation metric optimization, achieving state-of-the-art performance. By incorporating Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss, our approach optimizes perceptual evaluation metrics, leading to notable improvements in word error rate and speaker similarity. Our experiments show that DMDSpeech consistently surpasses prior state-of-the-art models in both naturalness and speaker similarity while being significantly faster. Moreover, our synthetic speech has a higher level of voice similarity to the prompt than the ground truth in both human evaluation and objective speaker similarity metric. This work highlights the potential of direct metric optimization in speech synthesis, allowing models to better align with human auditory preferences. The audio samples are available at this https URL.
https://dmdspeech.github.io/
https://arxiv.org/abs/2410.11097
DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization
Yingahao Aaron Li, Rithesh Kumar, Zeyu Jin
Diffusion models have demonstrated significant potential in speech synthesis tasks, including text-to-speech (TTS) and voice cloning. However, their iterative denoising processes are inefficient and hinder the application of end-to-end optimization with perceptual metrics. In this paper, we propose a novel method of distilling TTS diffusion models with direct end-to-end evaluation metric optimization, achieving state-of-the-art performance. By incorporating Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss, our approach optimizes perceptual evaluation metrics, leading to notable improvements in word error rate and speaker similarity. Our experiments show that DMDSpeech consistently surpasses prior state-of-the-art models in both naturalness and speaker similarity while being significantly faster. Moreover, our synthetic speech has a higher level of voice similarity to the prompt than the ground truth in both human evaluation and objective speaker similarity metric. This work highlights the potential of direct metric optimization in speech synthesis, allowing models to better align with human auditory preferences. The audio samples are available at this https URL.
arXiv.org
DMOSpeech: Direct Metric Optimization via Distilled Diffusion...
Diffusion models have demonstrated significant potential in speech synthesis tasks, including text-to-speech (TTS) and voice cloning. However, their iterative denoising processes are...
SANE 2024 workshop ended today
https://www.saneworkshop.org/sane2024/
topics are somewhat interesting. For example, Google attempt to use LLM for diarization
https://www.saneworkshop.org/sane2024/#quan
Hopefully videos will be here:
https://www.youtube.com/@speechandaudiointhenortheast
https://www.saneworkshop.org/sane2024/
topics are somewhat interesting. For example, Google attempt to use LLM for diarization
https://www.saneworkshop.org/sane2024/#quan
Hopefully videos will be here:
https://www.youtube.com/@speechandaudiointhenortheast
www.saneworkshop.org
SANE 2024 - Speech and Audio in the Northeast
SANE is a series of workshops gathering researchers and students in speech and audio from the Northeast of the American continent.
After spending some hours on F5, I found passion to finalize this small post. I'm telling this for quite some time already though.
https://alphacephei.com/nsh/2024/10/18/tts-design.html
https://alphacephei.com/nsh/2024/10/18/tts-design.html
Speech Recognition With Vosk
TTS Design Thoughts
We spent last year working mostly on TTS just as in the good old Festival times. Here are some more random thoughts I have on the subject. Rants follow, I still have trouble living in a positive thinking world. That one of course has advantages as life demonstrates…
Meta shared their SpiritLM
https://github.com/facebookresearch/spiritlm
https://twitter.com/AIatMeta/status/1847383580269510670
https://github.com/facebookresearch/spiritlm
https://twitter.com/AIatMeta/status/1847383580269510670
GitHub
GitHub - facebookresearch/spiritlm: Inference code for the paper "Spirit-LM Interleaved Spoken and Written Language Model".
Inference code for the paper "Spirit-LM Interleaved Spoken and Written Language Model". - facebookresearch/spiritlm
Forwarded from Nick Fisher
https://petewarden.com/2024/10/21/introducing-moonshine-the-new-state-of-the-art-for-speech-to-text/
Pete Warden's blog
Introducing Moonshine, the new state of the art for speech to text
Can you imagine using a keyboard where it took a key press two seconds to show up on screen? That’s the typical latency for most voice interfaces, so it’s no wonder they’ve failed…
A good Chinese MLLM
https://github.com/westlake-baichuan-mllm/bc-omni
https://arxiv.org/abs/2410.08565
Baichuan-Omni Technical Report
The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Baichuan-Omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering an advanced multimodal interactive experience and strong performance. We propose an effective multimodal training schema starting with 7B model...
https://github.com/westlake-baichuan-mllm/bc-omni
https://arxiv.org/abs/2410.08565
Baichuan-Omni Technical Report
The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Baichuan-Omni, the first open-source 7B Multimodal Large Language Model (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering an advanced multimodal interactive experience and strong performance. We propose an effective multimodal training schema starting with 7B model...
Quite in-depth paper on continuous vs discrete representation
https://arxiv.org/abs/2410.16048
Continuous Speech Synthesis using per-token Latent Diffusion
Arnon Turetzky, Nimrod Shabtay, Slava Shechtman, Hagai Aronowitz, David Haws, Ron Hoory, Avihu Dekel
The success of autoregressive transformer models with discrete tokens has inspired quantization-based approaches for continuous modalities, though these often limit reconstruction quality. We therefore introduce SALAD, a per-token latent diffusion model for zero-shot text-to-speech, that operates on continuous representations. SALAD builds upon the recently proposed expressive diffusion head for image generation, and extends it to generate variable-length outputs. Our approach utilizes semantic tokens for providing contextual information and determining the stopping condition. We suggest three continuous variants for our method, extending popular discrete speech synthesis techniques. Additionally, we implement discrete baselines for each variant and conduct a comparative analysis of discrete versus continuous speech modeling techniques. Our results demonstrate that both continuous and discrete approaches are highly competent, and that SALAD achieves a superior intelligibility score while obtaining speech quality and speaker similarity on par with the ground-truth audio.
https://arxiv.org/abs/2410.16048
Continuous Speech Synthesis using per-token Latent Diffusion
Arnon Turetzky, Nimrod Shabtay, Slava Shechtman, Hagai Aronowitz, David Haws, Ron Hoory, Avihu Dekel
The success of autoregressive transformer models with discrete tokens has inspired quantization-based approaches for continuous modalities, though these often limit reconstruction quality. We therefore introduce SALAD, a per-token latent diffusion model for zero-shot text-to-speech, that operates on continuous representations. SALAD builds upon the recently proposed expressive diffusion head for image generation, and extends it to generate variable-length outputs. Our approach utilizes semantic tokens for providing contextual information and determining the stopping condition. We suggest three continuous variants for our method, extending popular discrete speech synthesis techniques. Additionally, we implement discrete baselines for each variant and conduct a comparative analysis of discrete versus continuous speech modeling techniques. Our results demonstrate that both continuous and discrete approaches are highly competent, and that SALAD achieves a superior intelligibility score while obtaining speech quality and speaker similarity on par with the ground-truth audio.
arXiv.org
Continuous Speech Synthesis using per-token Latent Diffusion
The success of autoregressive transformer models with discrete tokens has inspired quantization-based approaches for continuous modalities, though these often limit reconstruction quality. We...
F5 made a splash. This is a bit more complicated but also a better version (more reasonable audio codec for example)
https://maskgct.github.io
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, Zhizheng Wu
The recent large-scale text-to-speech (TTS) systems are usually grouped as autoregressive and non-autoregressive systems. The autoregressive systems implicitly model duration but exhibit certain deficiencies in robustness and lack of duration controllability. Non-autoregressive systems require explicit alignment information between text and speech during training and predict durations for linguistic units (e.g. phone), which may compromise their naturalness. In this paper, we introduce Masked Generative Codec Transformer (MaskGCT), a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision, as well as phone-level duration prediction. MaskGCT is a two-stage model: in the first stage, the model uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model, and in the second stage, the model predicts acoustic tokens conditioned on these semantic tokens. MaskGCT follows the mask-and-predict learning paradigm. During training, MaskGCT learns to predict masked semantic or acoustic tokens based on given conditions and prompts. During inference, the model generates tokens of a specified length in a parallel manner. Experiments with 100K hours of in-the-wild speech demonstrate that MaskGCT outperforms the current state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility. Audio samples are available at this https URL. We release our code and model checkpoints at this https URL.
https://maskgct.github.io
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, Zhizheng Wu
The recent large-scale text-to-speech (TTS) systems are usually grouped as autoregressive and non-autoregressive systems. The autoregressive systems implicitly model duration but exhibit certain deficiencies in robustness and lack of duration controllability. Non-autoregressive systems require explicit alignment information between text and speech during training and predict durations for linguistic units (e.g. phone), which may compromise their naturalness. In this paper, we introduce Masked Generative Codec Transformer (MaskGCT), a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision, as well as phone-level duration prediction. MaskGCT is a two-stage model: in the first stage, the model uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model, and in the second stage, the model predicts acoustic tokens conditioned on these semantic tokens. MaskGCT follows the mask-and-predict learning paradigm. During training, MaskGCT learns to predict masked semantic or acoustic tokens based on given conditions and prompts. During inference, the model generates tokens of a specified length in a parallel manner. Experiments with 100K hours of in-the-wild speech demonstrate that MaskGCT outperforms the current state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility. Audio samples are available at this https URL. We release our code and model checkpoints at this https URL.
"We don't want 200ms latency, that's just not useful"
Will Williams is CTO of Speechmatics in Cambridge. In this sponsored episode - he shares deep technical insights into modern speech recognition technology and system architecture. The episode covers several key technical areas:
Speechmatics' hybrid approach to ASR, which focusses on unsupervised learning methods, achieving comparable results with 100x less data than fully supervised approaches. Williams explains why this is more efficient and generalizable than end-to-end models like Whisper.
Their production architecture implementing multiple operating points for different latency-accuracy trade-offs, with careful latency padding (up to 1.8 seconds) to ensure consistent user experience. The system uses lattice-based decoding with language model integration for improved accuracy.
The challenges and solutions in real-time ASR, including their approach to diarization (speaker identification), handling cross-talk, and implicit source separation. Williams explains why these problems remain difficult even with modern deep learning approaches.
Their testing and deployment infrastructure, including the use of mirrored environments for catching edge cases in production, and their strategy of maintaining global models rather than allowing customer-specific fine-tuning.
Technical evolution in ASR, from early days of custom CUDA kernels and manual memory management to modern frameworks, with Williams offering interesting critiques of current PyTorch memory management approaches and arguing for more efficient direct memory allocation in production systems.
https://www.youtube.com/watch?v=k6eXkBtYIHg
Will Williams is CTO of Speechmatics in Cambridge. In this sponsored episode - he shares deep technical insights into modern speech recognition technology and system architecture. The episode covers several key technical areas:
Speechmatics' hybrid approach to ASR, which focusses on unsupervised learning methods, achieving comparable results with 100x less data than fully supervised approaches. Williams explains why this is more efficient and generalizable than end-to-end models like Whisper.
Their production architecture implementing multiple operating points for different latency-accuracy trade-offs, with careful latency padding (up to 1.8 seconds) to ensure consistent user experience. The system uses lattice-based decoding with language model integration for improved accuracy.
The challenges and solutions in real-time ASR, including their approach to diarization (speaker identification), handling cross-talk, and implicit source separation. Williams explains why these problems remain difficult even with modern deep learning approaches.
Their testing and deployment infrastructure, including the use of mirrored environments for catching edge cases in production, and their strategy of maintaining global models rather than allowing customer-specific fine-tuning.
Technical evolution in ASR, from early days of custom CUDA kernels and manual memory management to modern frameworks, with Williams offering interesting critiques of current PyTorch memory management approaches and arguing for more efficient direct memory allocation in production systems.
https://www.youtube.com/watch?v=k6eXkBtYIHg
YouTube
One Step Closer to the Star Trek Voice AI Assistant!
Will Williams is CTO of Speechmatics in Cambridge. In this sponsored episode - he shares deep technical insights into modern speech recognition technology and system architecture. The episode covers several key technical areas:
* Speechmatics' hybrid approach…
* Speechmatics' hybrid approach…
https://twitter.com/SamueleCornell/status/1849115845516984758
https://arxiv.org/abs/2408.09215
Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition
Samuele Cornell, Jordan Darefsky, Zhiyao Duan, Shinji Watanabe
Currently, a common approach in many speech processing tasks is to leverage large scale pre-trained models by fine-tuning them on in-domain data for a particular application. Yet obtaining even a small amount of such data can be problematic, especially for sensitive domains and conversational speech scenarios, due to both privacy issues and annotation costs. To address this, synthetic data generation using single speaker datasets has been employed. Yet, for multi-speaker cases, such an approach often requires extensive manual effort and is prone to domain mismatches. In this work, we propose a synthetic data generation pipeline for multi-speaker conversational ASR, leveraging a large language model (LLM) for content creation and a conversational multi-speaker text-to-speech (TTS) model for speech synthesis. We conduct evaluation by fine-tuning the Whisper ASR model for telephone and distant conversational speech settings, using both in-domain data and generated synthetic data. Our results show that the proposed method is able to significantly outperform classical multi-speaker generation approaches that use external, non-conversational speech datasets.
https://arxiv.org/abs/2408.09215
Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition
Samuele Cornell, Jordan Darefsky, Zhiyao Duan, Shinji Watanabe
Currently, a common approach in many speech processing tasks is to leverage large scale pre-trained models by fine-tuning them on in-domain data for a particular application. Yet obtaining even a small amount of such data can be problematic, especially for sensitive domains and conversational speech scenarios, due to both privacy issues and annotation costs. To address this, synthetic data generation using single speaker datasets has been employed. Yet, for multi-speaker cases, such an approach often requires extensive manual effort and is prone to domain mismatches. In this work, we propose a synthetic data generation pipeline for multi-speaker conversational ASR, leveraging a large language model (LLM) for content creation and a conversational multi-speaker text-to-speech (TTS) model for speech synthesis. We conduct evaluation by fine-tuning the Whisper ASR model for telephone and distant conversational speech settings, using both in-domain data and generated synthetic data. Our results show that the proposed method is able to significantly outperform classical multi-speaker generation approaches that use external, non-conversational speech datasets.
X (formerly Twitter)
Samuele Cornell (@SamueleCornell) on X
In this paper we show that is possible to create synthetic 2-speakers conversations with TTS and LLMs and fine-tune successfully Whisper for multi-speaker ASR generalizing well to real-world scenarios:
https://t.co/Z4b4YrLzpR
Examples of such synth data:…
https://t.co/Z4b4YrLzpR
Examples of such synth data:…
Speech Technology
"We don't want 200ms latency, that's just not useful" Will Williams is CTO of Speechmatics in Cambridge. In this sponsored episode - he shares deep technical insights into modern speech recognition technology and system architecture. The episode covers several…
Some notes on Speechmatics interview:
Latency should be dynamic, modern advertising about small latency is not reasonable, but dynamic context-dependent latency is a thing. AudioLLMs enable that.
Lattices are not the optimal way of representation of the search space if you have may aspects of speech (emotion, etc). Vectorized representations suit GPU better, more compact and learnable. By using lattices we have some control over results but restrict ourselves at the same time.
Wav2vec-like learning Speechmatics uses is 100x faster but at the same time it is very hard to learn long distribution tail without lexical information just from the audio. Semi-supervised learning or full e2e approach definitely have an advantage.
Continuous learning (active inference) is something to think about more actively, yes, something very important for the future.
Latency should be dynamic, modern advertising about small latency is not reasonable, but dynamic context-dependent latency is a thing. AudioLLMs enable that.
Lattices are not the optimal way of representation of the search space if you have may aspects of speech (emotion, etc). Vectorized representations suit GPU better, more compact and learnable. By using lattices we have some control over results but restrict ourselves at the same time.
Wav2vec-like learning Speechmatics uses is 100x faster but at the same time it is very hard to learn long distribution tail without lexical information just from the audio. Semi-supervised learning or full e2e approach definitely have an advantage.
Continuous learning (active inference) is something to think about more actively, yes, something very important for the future.
We released new Vosk models for Persian, WER improved significantly
https://alphacephei.com/vosk/models/vosk-model-fa-0.42.zip
https://alphacephei.com/vosk/models/vosk-model-small-fa-0.42.zip
For more details see
https://github.com/alphacep/awesome-speech/blob/main/persian.md#asr-results
https://alphacephei.com/vosk/models/vosk-model-fa-0.42.zip
https://alphacephei.com/vosk/models/vosk-model-small-fa-0.42.zip
For more details see
https://github.com/alphacep/awesome-speech/blob/main/persian.md#asr-results