Here is my attempt to convert an LLM to a Full Duplex Dialogue System from scratch while being GPU poor :)
My approach is to break it into three steps of training -
- Step 1: Training the LLM to understand the speech tokens. I use Kyutai's mimi for tokenizing the speech.
- Step 2: Training the model on dialogues at utterance level without overlaps to help it understand the distribution of spoken dialogue better.
- Step 3: To finally convert the model to a Full-Duplex system by using time warping.
Cuurently, I am in the process of training the first step.
First we convert the data into webdataset shards such that they can be used easily for preprocessing during training. In this step we convert the audio into mimi tokens, convert the text into Qwen tokens, add instructions and other meta deta. Converting to Qwen tokens is optional as we can also do it in the preprocessin step incase we use another LLM. Hence, the main function is to convert the speech data into Mimi tokens and standardize them so that they can be preprocessed easily later.
The directory contains three sub-directories for each steps. Please refer to the Readme in each folder for more details.
Run all the files from the root directory. Refer to the readme of each sub-directory for more information.
For step 1, run the following for preprocessing -
python -m training.step1.preprocess \
--config training/step1/configs/preprocessing.yamlTo inspect whether the preprocessed data is correctly stored run this -
python -m python training.step1.inspect_packed_shard \
--tar path/to/tar/file \
--sample-index 0 \
--tokenizer path/to/tokenizer \
--mimi-ckpt path/to/mimi \
--num-codebooks 4 \
--speech-codebook-size 2048 \
--device cuda \
--out-dir path/to/output/directoryFor step 2, run the following for preprocessing
For step 3, run the following for preprocessing
Run all the files from the root directory. Refer to the readme of each sub-directory for more information.
For step 1, we utilize curriculum learning which can be set up using the config file. Then we run the following for training -
python -m training.step1.train \
--config training/step1/configs/train.yaml \
--num-nodes [num_of_nodes] \
--num-gpus-per-node [num_of_gpus_per_node]For step 1, use the eval.yaml config file to run the evaluation -
srun python -m training.step1.eval \
--config training/step1/configs/eval.yaml \
--num-nodes [num_of_nodes] \
--num-gpus-per-node [num_of_gpus_per_node]