Large Language Models (LLMs) demonstrate impressive capabilities, yet interaction with these models is mostly facilitated through text.Using Text-To-Speech to synthesize LLM outputs typically results in notable latency, which is impractical for fluent voice conversations.Discrete Audio codecs (or audio tokenizers), have recently regained interest due to the ability of Large Language Models (LLMs) to learn their compressed acoustic representations. Various publicly available trainable discrete tokenizers recently demonstrated impressive results for audio tokenization, yet they mostly require high token rates to gain high-quality reconstruction. In our study, we fine-tuned an open-source general audio RVQGAN model using diverse open-source speech data, considering various recording conditions and quality levels. The resulting wideband (24kHz) speech-only model achieves speech reconstruction, which is nearly indistinguishable from PCM (pulse-code modulation) with a rate of 150-300 tokens per second (1500-3000 bps). The evaluation was done using comprehensive English speech data encompassing different recording conditions, including studio settings. The models are offically released.
Model usage
The model is released with cdla-permissive-2.0 license and can be downloaded from here.
Audio Samples
Selected samples from the listening test described in Section 3.3, comparing the following systems:
Orig. model, 2 codebooks the original model, in 2-codebook configuration (1.5 kbps)
Orig. model, 4 codebooks the original model, in 2-codebook configuration (3 kbps)
Retrained model, 2 codebooks the proposed model 2-codebook model (1.5 kbps)
Retrained model, 4 codebooks the proposed 4-codebook model (3 kbps)
Raw speech - original wavefiles, resampled to 24kHz
orig-2
orig-4
retr-2
retr-4
Raw
Training Configuration
The training cofiguration yaml, serving for the balanced speech-only DAC model training, used in the paper: