AI-powered text-to-video generation with neural lip-sync capabilities
Transform text prompts into professional talking-head videos with accurate lip synchronization.
Powered by CogVideoX, XTTS-v2, Wav2Lip, and Real-ESRGAN.
Features • Demo • Quick Start • API • Deployment
| Feature | Description |
|---|---|
| 🎬 Text-to-Video | Generate video from text using CogVideoX diffusion models |
| 🎤 Neural TTS | High-quality speech synthesis with XTTS-v2 (17 languages) |
| 👄 Lip Sync | Accurate lip synchronization using Wav2Lip GAN |
| 📺 HD Upscaling | 4x video enhancement with Real-ESRGAN |
| 🎭 Voice Cloning | Clone any voice from a 6-second audio sample |
| 🌐 REST API | Production-ready FastAPI backend |
| 💻 Modern UI | Beautiful web interface with real-time progress |
| 🐳 Docker Ready | One-command deployment with docker-compose |
Live Demo: text-to-video-generator.vercel.app
Enter your text prompt and watch AI generate a lip-synced video!
┌─────────────────────────────────────────────────────────────────┐
│ Text Prompt │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────┴─────────────────────┐
▼ ▼
┌───────────────────┐ ┌───────────────────┐
│ CogVideoX │ │ XTTS-v2 │
│ Video Gen │ │ Speech Gen │
└───────────────────┘ └───────────────────┘
│ │
│ ┌─────────────────────┐ │
└────────►│ Wav2Lip │◄─────────┘
│ Lip Sync │
└─────────────────────┘
│
▼
┌─────────────────────┐
│ Real-ESRGAN │ (Optional)
│ Upscaling │
└─────────────────────┘
│
▼
┌─────────────────────┐
│ Final MP4 │
│ Download │
└─────────────────────┘
- Python 3.10-3.11
- NVIDIA GPU with 12GB+ VRAM (RTX 3060 or better)
- CUDA 11.8+
- FFmpeg
# Clone the repository
git clone https://github.com/yourusername/text-to-video-generator.git
cd text-to-video-generator
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Download model checkpoints
python scripts/download_models.py
# Copy environment config
cp .env.example .env
# Start the server
uvicorn api.main:app --reload --host 0.0.0.0 --port 8000Open http://localhost:8000 in your browser.
# Build and run with Docker Compose
docker-compose up -d
# Check logs
docker-compose logs -fPOST /api/generateRequest Body:
{
"prompt": "Hello! Welcome to our AI demonstration.",
"duration": 6,
"language": "en",
"upscale": true
}Response:
{
"job_id": "job_20260121_123456_abc12345",
"status": "pending",
"message": "Video generation started"
}GET /api/status/{job_id}Response:
{
"job": {
"job_id": "job_20260121_123456_abc12345",
"status": "generating_video",
"progress": 35.5,
"current_step": "Generating video frames..."
}
}GET /api/download/{job_id}Returns the generated MP4 video file.
GET /healthResponse:
{
"status": "healthy",
"version": "1.0.0",
"gpu_available": true,
"gpu_name": "NVIDIA RTX 4090",
"gpu_memory_gb": 24.0
}| Code | Language | Code | Language |
|---|---|---|---|
en |
English | ru |
Russian |
es |
Spanish | nl |
Dutch |
fr |
French | cs |
Czech |
de |
German | ar |
Arabic |
it |
Italian | zh-cn |
Chinese |
pt |
Portuguese | ko |
Korean |
pl |
Polish | ja |
Japanese |
tr |
Turkish | hi |
Hindi |
text-to-video-generator/
├── api/ # FastAPI backend
│ ├── main.py # Application entry point
│ ├── routes/ # API endpoints
│ └── models/ # Pydantic schemas
├── core/ # Pipeline orchestration
│ ├── config.py # Configuration management
│ ├── pipeline.py # Main workflow
│ └── utils.py # Utility functions
├── modules/ # AI model wrappers
│ ├── video_generator/ # CogVideoX integration
│ ├── tts/ # XTTS-v2 integration
│ ├── lip_sync/ # Wav2Lip integration
│ └── upscaler/ # Real-ESRGAN integration
├── frontend/ # Web interface
│ ├── index.html # Main HTML
│ ├── styles/ # CSS styles
│ └── scripts/ # JavaScript
├── scripts/ # Utility scripts
│ └── download_models.py # Model downloader
├── outputs/ # Generated videos
├── checkpoints/ # Model weights
├── requirements.txt # Python dependencies
├── Dockerfile # Container definition
└── docker-compose.yml # Docker orchestration
Configuration is managed through environment variables. Copy .env.example to .env and customize:
| Variable | Default | Description |
|---|---|---|
APP_API_PORT |
8000 |
API server port |
APP_TORCH_DTYPE |
float16 |
Model precision |
VIDEO_MODEL_ID |
THUDM/CogVideoX-2b |
Video model |
VIDEO_HEIGHT |
480 |
Output height |
VIDEO_WIDTH |
720 |
Output width |
TTS_LANGUAGE |
en |
Default language |
UPSCALE_SCALE |
4 |
Upscale factor |
| Configuration | VRAM Required | Recommended GPU |
|---|---|---|
| Minimum | 12GB | RTX 3060 |
| Standard | 16GB | RTX 4070 |
| Optimal | 24GB+ | RTX 4090, A100 |
The pipeline automatically manages GPU memory by loading/unloading modules sequentially.
Contributions are welcome! Please read our Contributing Guidelines first.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- CogVideoX - Video generation model
- Coqui TTS - Text-to-speech
- Wav2Lip - Lip synchronization
- Real-ESRGAN - Video upscaling
Made with ❤️ by AI-powered development