SongGen

SongGen is a deep learning model for generating singing voice from lyrics and melody. It uses a transformer-based architecture to generate high-quality singing voices conditioned on text descriptions and lyrics.

Important Notes

Batching: The model now supports batching with configurable batch sizes per device
Distributed Training: Multi-GPU training is supported through DistributedDataParallel (DDP)
Memory Optimization: Includes gradient checkpointing and mixed precision training for efficient memory usage

Features

Text-to-singing voice generation
Support for both lyrics and descriptive text conditioning
High-quality audio output using XCodec for audio tokenization
Efficient training pipeline with distributed training support
Memory-optimized architecture with gradient checkpointing
Support for Grouped Query Attention (GQA)

Setup

Requirements

Python 3.9 or higher
CUDA-compatible GPU (recommended)

Installation

Clone the repository:

git clone https://github.com/pixaverse-studios/songgen.git
cd songgen

Create and activate a new conda environment:

conda create -n songgen python=3.9
conda activate songgen

Install PyTorch with CUDA support:

# For CUDA 11.8
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118

# For CUDA 12.1
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121

Install SongGen and its dependencies:

# Install in editable mode
pip install -e .

Set up XCodec:

# Clone the XCodec repository into the wrapper folder
cd songgen/encoders/xcodec
git clone https://github.com/ZhenYe234/xcodec.git

# Download the checkpoint
mkdir -p xcodec/ckpts/general_more/
wget https://huggingface.co/ZhenYe234/xcodec/resolve/main/xcodec_hubert_general_audio_v2.pth -O xcodec/ckpts/general_more/xcodec_hubert_general_audio_v2.pth
wget https://huggingface.co/ZhenYe234/xcodec/resolve/main/config_hubert_general.yaml?download=true -O xcodec/ckpts/general_more/config_hubert_general.yaml

cd ../..  # Return to root directory

Data Preparation

Prepare your training data in the following structure:

data_dir/
    ├── audio/                     # Directory containing all audio files
    │   ├── song1.wav             # Audio files must be WAV format (16kHz)
    │   ├── song2.wav
    │   └── ...
    │
    ├── train_descriptions.json    # Training data descriptions
    └── eval_descriptions.json     # Evaluation data descriptions

The JSON files should follow this format:

[
    {
        "text": "A pop song with upbeat melody and energetic vocals",  # Required: text description
        "audio_path": "audio/song1.wav",                              # Required: path relative to data_dir
        "lyrics": "Verse 1: ...",                                     # Optional: song lyrics
        "reference_audio": "audio/ref1.wav"                          # Optional: reference audio for voice cloning
    },
    ...
]

After preparing the raw data, process it using the preprocessing script:

python -m songgen.data.preprocessing \
    --data_dir /path/to/data_dir \
    --output_dir /path/to/output_dir \
    --train_text train_descriptions.json \
    --eval_text eval_descriptions.json

This will create the following structure in your output directory:

output_dir/
    ├── codes/                     # Directory containing extracted XCodec codes
    │   ├── song1_codes.pt        # Tensor files containing audio codes
    │   ├── song2_codes.pt
    │   └── ...
    │
    ├── train_metadata.json       # Training metadata with paths to codes
    └── eval_metadata.json        # Evaluation metadata with paths to codes

Important Notes:

Audio files must be in WAV format with 16kHz sampling rate
For stereo files, they will be automatically converted to mono
The maximum supported audio length is 30 seconds (480,000 samples at 16kHz)
Text descriptions are limited to 256 tokens
Lyrics are limited to 512 tokens

Training

To start training:

# Single GPU
python -m songgen.scripts.train \
    --data_dir /path/to/data_dir \
    --output_dir /path/to/output_dir \
    --model_name_or_path /path/to/model \
    --description_tokenizer_name_or_path /path/to/tokenizer

# Multi-GPU training
torchrun --nproc_per_node=4 scripts/train.py\
         --data_dir songgen/output_dir/ \
         --output_dir ./checkpoints \
         --per_device_train_batch_size 8 \
         --per_device_eval_batch_size 8 \
         --learning_rate 5e-5 \
         --num_train_epochs 15 \
         --warmup_steps 1000 \
         --logging_steps 100 \
         --eval_steps 500 \
         --save_steps 2000 \
         --fp16 true  \
         --ddp_backend "nccl" \
         --do_train
         --do_eval

Training configuration can be customized through command line arguments:

--data_dir: Path to the preprocessed data directory
--output_dir: Directory to save model checkpoints and logs
--model_name_or_path: Path to pretrained model or model identifier from huggingface.co
--description_tokenizer_name_or_path: Path to pretrained tokenizer or tokenizer identifier
--per_device_train_batch_size: Batch size per GPU for training (default: 4)
--per_device_eval_batch_size: Batch size per GPU for evaluation (default: 4)
--gradient_checkpointing: Enable gradient checkpointing for memory efficiency
--fp16: Enable mixed precision training
--learning_rate: Set the initial learning rate (default: 5e-5)
--warmup_steps: Number of warmup steps for learning rate scheduler (default: 1000)
--num_train_epochs: Total number of training epochs (default: 10)
--gradient_accumulation_steps: Number of updates steps to accumulate (default: 1)
--logging_steps: Log every X updates steps (default: 100)
--eval_steps: Run evaluation every X steps (default: 1000)
--save_steps: Save checkpoint every X updates steps (default: 1000)
--save_total_limit: Limit the total amount of checkpoints (default: 5)

Generation

To generate singing voice from text and lyrics:

python scripts/generate.py \
    --ckpt_path /path/to/checkpoint \
    --text "A melodic pop song with piano and drums, following a verse-chorus structure at 120 BPM" \
    --lyrics "I see the sunrise, bringing a new day" \
    --output_path output.wav

Generation Parameters

The quality of generated audio can be controlled through sampling parameters:

Conservative (More stable, less creative):

--temperature 0.85 --top_k 120 --top_p 0.92 --repetition_penalty 1.2 --max_length 768

Balanced (Recommended starting point):

--temperature 0.95 --top_k 250 --top_p 0.95 --repetition_penalty 1.3 --max_length 768

Creative (More varied, but potentially less stable):

--temperature 1.0 --top_k 0 --top_p 0.99 --repetition_penalty 1.5 --max_length 768

Best Practices for Generation

Text Description Guidelines:
- Be specific about musical elements (genre, instruments, tempo)
- Include structural information (verse, chorus, bridge)
- Specify desired mood and energy level
- Example: "An upbeat pop song with electric guitar and drums, featuring a catchy chorus and bridge section at 120 BPM"
Lyrics Guidelines:
- Keep lyrics clear and rhythmically consistent
- Avoid complex or unusual words
- Match syllable count to desired melody length
- Example: "Verse: Walking through the city lights, Feeling alive tonight"
Optional Parameters:
- --ref_voice_path: Path to reference voice for voice cloning
- --separate: Separate vocals from reference audio
- --num_return_sequences: Generate multiple variations (default: 1)

Model Architecture

The model consists of:

Text Encoder: T5-based transformer with configurable parameters
Decoder: 24-layer transformer with:
- Hidden size: 1024
- Attention heads: 16
- FFN dimension: 4096
- Max position embeddings: 6000
- Support for RoPE embeddings
- Support for Grouped Query Attention (GQA)
XCodec: Audio tokenizer with 8 codebooks

Training Configuration

Optimizer: AdamW with cosine learning rate schedule
Mixed precision training (FP16)
Gradient checkpointing for memory efficiency
Layer dropout for regularization
Configurable warmup steps and learning rate

License

[Add License Information]

Citation

[Add Citation Information]

Acknowledgments

XCodec model from ZhenYe234/xcodec
Thanks to the contributors and maintainers of the dependencies used in this project

Known Limitations and Future Work

Further memory optimization for larger batch sizes
Additional attention implementations (SDPA)
Support for more audio tokenization methods

Please check back for updates or contribute to help implement these features!

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
songgen		songgen
utils		utils
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py
test_lyrics_cleaning.py		test_lyrics_cleaning.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SongGen

Important Notes

Features

Setup

Requirements

Installation

Data Preparation

Training

Generation

Generation Parameters

Best Practices for Generation

Model Architecture

Training Configuration

License

Citation

Acknowledgments

Known Limitations and Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SongGen

Important Notes

Features

Setup

Requirements

Installation

Data Preparation

Training

Generation

Generation Parameters

Best Practices for Generation

Model Architecture

Training Configuration

License

Citation

Acknowledgments

Known Limitations and Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages