This repository contains tools to clone your voice using the Sesame CSM-1B model. It provides two methods for voice cloning:
- Local execution on your own GPU
- Cloud execution using Modal
Note: While this solution does capture some voice characteristics and provides a recognizable clone, it's not the best voice cloning solution available. The results are decent but not perfect. If you have ideas on how to improve the cloning quality, feel free to contribute!
- Python 3.10+
- CUDA-compatible GPU (for local execution)
- Hugging Face account with access to the CSM-1B model
- Hugging Face API token
- Clone this repository:
git clone https://github.com/isaiahbjork/csm-voice-cloning.git
cd csm-voice-cloning
- Install the required dependencies:
pip install -r requirements.txt
You need to set your Hugging Face token to download the model. You can do this in two ways:
- Set it as an environment variable:
export HF_TOKEN="your_hugging_face_token"
- Or directly in the
voice_clone.py
file:
os.environ["HF_TOKEN"] = "your_hugging_face_token"
Before using the model, you need to accept the terms on Hugging Face:
- Visit the Sesame CSM-1B model page
- Click on "Access repository" and accept the terms
- Make sure you're logged in with the same account that your HF_TOKEN belongs to
- Record a clear audio sample of your voice (2-3 minutes is recommended)
- Save it as an MP3 or WAV file
- Transcribe the audio using Whisper or another transcription tool to get the exact text
- Edit the
voice_clone.py
file to set your parameters directly in the code:
# Set the path to your voice sample
context_audio_path = "path/to/your/voice/sample.mp3"
# Set the transcription of your voice sample
# You need to use Whisper or another tool to transcribe your audio
context_text = "The exact transcription of your voice sample..."
# Set the text you want to synthesize
text = "Text you want to synthesize with your voice."
# Set the output filename
output_filename = "output.wav"
- Run the script:
python voice_clone.py
Modal provides cloud GPU resources for faster processing:
- Install Modal:
pip install modal
- Set up Modal authentication:
modal token new
- Edit the
modal_voice_cloning.py
file to set your parameters directly in the code:
# Set the path to your voice sample
context_audio_path = "path/to/your/voice/sample.mp3"
# Set the transcription of your voice sample
# You need to use Whisper or another tool to transcribe your audio
context_text = "The exact transcription of your voice sample..."
# Set the text you want to synthesize
text = "Text you want to synthesize with your voice."
# Set the output filename
output_filename = "output.wav"
- Run the Modal script:
modal run modal_voice_cloning.py
If you encounter tensor dimension errors, you may need to adjust the model's maximum sequence length in models.py
. The default sequence length is 2048, which works for most cases, but if you're using longer audio samples, you might need to increase this value.
Look for the max_seq_len
parameter in the llama3_2_1B()
and llama3_2_100M()
functions in models.py
and ensure they have the same value:
def llama3_2_1B():
return llama3_2.llama3_2(
# other parameters...
max_seq_len=2048, # Increase this value if needed
# other parameters...
)
Using a 2 minute and 50 second audio sample works fine with the default settings. For longer samples, you may need to adjust the sequence length as mentioned above.
- Tensor dimension errors: Adjust the model sequence length as described above
- CUDA out of memory: Try reducing the audio sample length or use a GPU with more memory
- Model download issues: Ensure you've accepted the model terms on Hugging Face and your token is correct
This project uses the Sesame CSM-1B model, which is subject to its own license terms. Please refer to the model page for details.