Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer
(Supplementary material toInterspeech 2024 paper)

Slava Shechtman, Avihu Dekel

IBM Research

Abstract

Large Language Models (LLMs) demonstrate impressive capabilities, yet interaction with these models is mostly facilitated through text.Using Text-To-Speech to synthesize LLM outputs typically results in notable latency, which is impractical for fluent voice conversations.Discrete Audio codecs (or audio tokenizers), have recently regained interest due to the ability of Large Language Models (LLMs) to learn their compressed acoustic representations. Various publicly available trainable discrete tokenizers recently demonstrated impressive results for audio tokenization, yet they mostly require high token rates to gain high-quality reconstruction. In our study, we fine-tuned an open-source general audio RVQGAN model using diverse open-source speech data, considering various recording conditions and quality levels. The resulting wideband (24kHz) speech-only model achieves speech reconstruction, which is nearly indistinguishable from PCM (pulse-code modulation) with a rate of 150-300 tokens per second (1500-3000 bps). The evaluation was done using comprehensive English speech data encompassing different recording conditions, including studio settings. The models are offically released.

Model usage

The model is released with cdla-permissive-2.0 license and can be downloaded from here.

Audio Samples

Selected samples from the listening test described in Section 3.3, comparing the following systems:

Orig. model, 2 codebooks the original model, in 2-codebook configuration (1.5 kbps)
Orig. model, 4 codebooks the original model, in 2-codebook configuration (3 kbps)
Retrained model, 2 codebooks the proposed model 2-codebook model (1.5 kbps)
Retrained model, 4 codebooks the proposed 4-codebook model (3 kbps)
Raw speech - original wavefiles, resampled to 24kHz

orig-2	orig-4	retr-2	retr-4	Raw

Training Configuration

The training cofiguration yaml, serving for the balanced speech-only DAC model training, used in the paper:


$include:
  - conf/final/24khz.yml

train/build_dataset.folders:  
  speech_hq1:
    - /data/daps/train
    - /data/lj-speech/train    
    - /data/read_speech/train    
  speech_hq2:
    - /data/libri-tts-r/train-clean-360/
    - /data/libri-tts-r/train-clean-100/
  speech_hq3:
    - /data/libri-tts-r/train-other-500/
  speech_mq1:
    - /data/libri-tts/train-clean-360/
    - /data/libri-tts/train-clean-100/
    - /data/vctk/
  speech_mq2:
    - /data/libri-tts/train-other-500/
  speech_uq:
    - /data/emotional_speech/
    - /data/libri-light/


val/build_dataset.folders:
  speech_hq1:
    - /data/daps/val  
  speech_hq2:
    - /data/libri-tts-r/dev-clean/
  speech_hq3:
    - /data/libri-tts-r/dev-other/
  speech_mq1:
    - /data/libri-tts/dev-clean/
  speech_mq2:
    - /data/libri-tts/dev-other/

Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer (Supplementary material toInterspeech 2024 paper)