ParsonLabs Caption is a powerful tool that automatically removes silent parts from videos and adds subtitles, making it perfect for creating engaging social media clips, YouTube Shorts, TikToks, and more.
- Smart Silence Detection: Automatically detects and removes silent parts of videos
- Smooth Transitions: Preserves context around speech for natural-sounding cuts
- Automatic Subtitles: Uses OpenAI's Whisper model to generate accurate subtitles
- GPU Acceleration: Leverages CUDA (NVIDIA), VideoToolbox (macOS), or VAAPI (Linux) for faster processing
- Cross-Platform: Works on Windows, macOS, and Linux
- Customizable: Configure silence thresholds, context preservation, and video quality
- Python 3.8+
- FFmpeg
- ImageMagick
- PyTorch (with CUDA for GPU acceleration, optional)
# Install FFmpeg and ImageMagick
winget install ImageMagick
winget install FFmpeg
# Create a virtual environment
python -m venv init
init\Scripts\activate
# Install Python dependencies
pip install moviepy pydub openai-whisper torch torchvision torchaudio
# Install FFmpeg and ImageMagick
brew install ffmpeg imagemagick
# Create a virtual environment
python -m venv init
source init/bin/activate
# Install Python dependencies
pip install moviepy pydub openai-whisper torch torchvision torchaudio
# Install FFmpeg and ImageMagick
sudo apt-get update
sudo apt-get install ffmpeg imagemagick
# Create a virtual environment
python -m venv init
source init/bin/activate
# Install Python dependencies
pip install -r requirements.txt
git clone https://github.com/ParsonLabs/caption.git
cd caption
python main.py input_video.mp4
This will generate input_video_processed.mp4
with silent parts removed and subtitles added.
python main.py input_video.mp4 --output output.mp4 --silence-length 1500 --silence-threshold -35 --context 300 --pause 500
--output, -o
: Specify output file path--silence-length, -sl
: Minimum silence length in milliseconds to be removed (default: 700)--silence-threshold, -st
: Audio level (in dB) below which is considered silence (default: -35)--context, -c
: Milliseconds of context to add before each speech segment (default: 300)--pause, -p
: Milliseconds of silence to keep at the end of each segment (default: 500)--no-gpu
: Disable GPU usage even if available--threads, -t
: Number of threads to use for video processing (default: auto-detect)--high-quality
: Use higher quality (slower) encoding
python main.py input_video.mp4 --silence-length 1500
python main.py input_video.mp4 --silence-length 500 --silence-threshold -30
python main.py input_video.mp4 --high-quality
python main.py input_video.mp4 --no-gpu
- Silence Detection: Analyzes the audio track to find silent segments
- Smart Segmentation: Cuts out silent parts while preserving context around speech
- Transcription: Uses OpenAI's Whisper model to transcribe the audio
- Subtitle Generation: Creates readable, properly timed subtitles
- Video Compilation: Combines the non-silent video segments with subtitles
Make sure ImageMagick is properly installed and in your PATH.
Check that you have:
- An NVIDIA GPU with CUDA support (Windows)
- Proper NVIDIA drivers installed
- CUDA-enabled PyTorch installation
Try these options:
- Ensure GPU acceleration is enabled
- Increase thread count (
--threads
) - Use fast mode (default) instead of
--high-quality
- Process shorter videos first
Adjust these parameters:
- Increase
--silence-threshold
to a higher negative number (e.g., -40) to detect more speech - Decrease
--silence-length
to remove shorter pauses - Increase
--context
to keep more context around speech segments
MIT License
- OpenAI Whisper for speech transcription
- MoviePy for video processing
- PyDub for audio analysis