Voice Cloning with CSM-1B

This repository contains tools to clone your voice using the Sesame CSM-1B model. It provides two methods for voice cloning:

Local execution on your own GPU
Cloud execution using Modal

Note: While this solution does capture some voice characteristics and provides a recognizable clone, it's not the best voice cloning solution available. The results are decent but not perfect. If you have ideas on how to improve the cloning quality, feel free to contribute!

Prerequisites

Python 3.10+
CUDA-compatible GPU (for local execution)
Hugging Face account with access to the CSM-1B model
Hugging Face API token

Installation

Clone this repository:

git clone https://github.com/isaiahbjork/csm-voice-cloning.git
cd csm-voice-cloning

Install the required dependencies:

pip install -r requirements.txt

Setting Up Your Hugging Face Token

You need to set your Hugging Face token to download the model. You can do this in two ways:

Set it as an environment variable:

export HF_TOKEN="your_hugging_face_token"

Or directly in the voice_clone.py file:

os.environ["HF_TOKEN"] = "your_hugging_face_token"

Accepting the Model on Hugging Face

Before using the model, you need to accept the terms on Hugging Face:

Visit the Sesame CSM-1B model page
Click on "Access repository" and accept the terms
Make sure you're logged in with the same account that your HF_TOKEN belongs to

Preparing Your Voice Sample

Record a clear audio sample of your voice (2-3 minutes is recommended)
Save it as an MP3 or WAV file
Transcribe the audio using Whisper or another transcription tool to get the exact text

Running Voice Cloning Locally

Edit the voice_clone.py file to set your parameters directly in the code:

# Set the path to your voice sample
context_audio_path = "path/to/your/voice/sample.mp3"

# Set the transcription of your voice sample
# You need to use Whisper or another tool to transcribe your audio
context_text = "The exact transcription of your voice sample..."

# Set the text you want to synthesize
text = "Text you want to synthesize with your voice."

# Set the output filename
output_filename = "output.wav"

Run the script:

python voice_clone.py

Running Voice Cloning on Modal

Modal provides cloud GPU resources for faster processing:

Install Modal:

pip install modal

Set up Modal authentication:

modal token new

Edit the modal_voice_cloning.py file to set your parameters directly in the code:

# Set the path to your voice sample
context_audio_path = "path/to/your/voice/sample.mp3"

# Set the transcription of your voice sample
# You need to use Whisper or another tool to transcribe your audio
context_text = "The exact transcription of your voice sample..."

# Set the text you want to synthesize
text = "Text you want to synthesize with your voice."

# Set the output filename
output_filename = "output.wav"

Run the Modal script:

modal run modal_voice_cloning.py

Important Note on Model Sequence Length

If you encounter tensor dimension errors, you may need to adjust the model's maximum sequence length in models.py. The default sequence length is 2048, which works for most cases, but if you're using longer audio samples, you might need to increase this value.

Look for the max_seq_len parameter in the llama3_2_1B() and llama3_2_100M() functions in models.py and ensure they have the same value:

def llama3_2_1B():
    return llama3_2.llama3_2(
        # other parameters...
        max_seq_len=2048,  # Increase this value if needed
        # other parameters...
    )

Example

Using a 2 minute and 50 second audio sample works fine with the default settings. For longer samples, you may need to adjust the sequence length as mentioned above.

Troubleshooting

Tensor dimension errors: Adjust the model sequence length as described above
CUDA out of memory: Try reducing the audio sample length or use a GPU with more memory
Model download issues: Ensure you've accepted the model terms on Hugging Face and your token is correct

License

This project uses the Sesame CSM-1B model, which is subject to its own license terms. Please refer to the model page for details.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
generator.py		generator.py
modal_voice_cloning.py		modal_voice_cloning.py
models.py		models.py
requirements.txt		requirements.txt
voice_clone.py		voice_clone.py
watermarking.py		watermarking.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Voice Cloning with CSM-1B

Prerequisites

Installation

Setting Up Your Hugging Face Token

Accepting the Model on Hugging Face

Preparing Your Voice Sample

Running Voice Cloning Locally

Running Voice Cloning on Modal

Important Note on Model Sequence Length

Example

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

isaiahbjork/csm-voice-cloning

Folders and files

Latest commit

History

Repository files navigation

Voice Cloning with CSM-1B

Prerequisites

Installation

Setting Up Your Hugging Face Token

Accepting the Model on Hugging Face

Preparing Your Voice Sample

Running Voice Cloning Locally

Running Voice Cloning on Modal

Important Note on Model Sequence Length

Example

Troubleshooting

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages