🎬 Text-to-Video Generator

AI-powered text-to-video generation with neural lip-sync capabilities

Transform text prompts into professional talking-head videos with accurate lip synchronization.
Powered by CogVideoX, XTTS-v2, Wav2Lip, and Real-ESRGAN.

Features • Demo • Quick Start • API • Deployment

✨ Features

Feature	Description
🎬 Text-to-Video	Generate video from text using CogVideoX diffusion models
🎤 Neural TTS	High-quality speech synthesis with XTTS-v2 (17 languages)
👄 Lip Sync	Accurate lip synchronization using Wav2Lip GAN
📺 HD Upscaling	4x video enhancement with Real-ESRGAN
🎭 Voice Cloning	Clone any voice from a 6-second audio sample
🌐 REST API	Production-ready FastAPI backend
💻 Modern UI	Beautiful web interface with real-time progress
🐳 Docker Ready	One-command deployment with docker-compose

🖥️ Demo

Live Demo: text-to-video-generator.vercel.app

Enter your text prompt and watch AI generate a lip-synced video!

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Text Prompt                               │
└─────────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┴─────────────────────┐
        ▼                                           ▼
┌───────────────────┐                     ┌───────────────────┐
│   CogVideoX       │                     │   XTTS-v2         │
│   Video Gen       │                     │   Speech Gen      │
└───────────────────┘                     └───────────────────┘
        │                                           │
        │         ┌─────────────────────┐          │
        └────────►│     Wav2Lip         │◄─────────┘
                  │     Lip Sync        │
                  └─────────────────────┘
                              │
                              ▼
                  ┌─────────────────────┐
                  │   Real-ESRGAN       │ (Optional)
                  │   Upscaling         │
                  └─────────────────────┘
                              │
                              ▼
                  ┌─────────────────────┐
                  │   Final MP4         │
                  │   Download          │
                  └─────────────────────┘

🚀 Quick Start

Prerequisites

Python 3.10-3.11
NVIDIA GPU with 12GB+ VRAM (RTX 3060 or better)
CUDA 11.8+
FFmpeg

Installation

# Clone the repository
git clone https://github.com/yourusername/text-to-video-generator.git
cd text-to-video-generator

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Download model checkpoints
python scripts/download_models.py

# Copy environment config
cp .env.example .env

# Start the server
uvicorn api.main:app --reload --host 0.0.0.0 --port 8000

Open http://localhost:8000 in your browser.

Docker Deployment

# Build and run with Docker Compose
docker-compose up -d

# Check logs
docker-compose logs -f

📖 API Reference

Generate Video

POST /api/generate

Request Body:

{
  "prompt": "Hello! Welcome to our AI demonstration.",
  "duration": 6,
  "language": "en",
  "upscale": true
}

Response:

{
  "job_id": "job_20260121_123456_abc12345",
  "status": "pending",
  "message": "Video generation started"
}

Check Status

GET /api/status/{job_id}

Response:

{
  "job": {
    "job_id": "job_20260121_123456_abc12345",
    "status": "generating_video",
    "progress": 35.5,
    "current_step": "Generating video frames..."
  }
}

Download Video

GET /api/download/{job_id}

Returns the generated MP4 video file.

Health Check

GET /health

Response:

{
  "status": "healthy",
  "version": "1.0.0",
  "gpu_available": true,
  "gpu_name": "NVIDIA RTX 4090",
  "gpu_memory_gb": 24.0
}

🌍 Supported Languages

Code	Language	Code	Language
`en`	English	`ru`	Russian
`es`	Spanish	`nl`	Dutch
`fr`	French	`cs`	Czech
`de`	German	`ar`	Arabic
`it`	Italian	`zh-cn`	Chinese
`pt`	Portuguese	`ko`	Korean
`pl`	Polish	`ja`	Japanese
`tr`	Turkish	`hi`	Hindi

📁 Project Structure

text-to-video-generator/
├── api/                    # FastAPI backend
│   ├── main.py             # Application entry point
│   ├── routes/             # API endpoints
│   └── models/             # Pydantic schemas
├── core/                   # Pipeline orchestration
│   ├── config.py           # Configuration management
│   ├── pipeline.py         # Main workflow
│   └── utils.py            # Utility functions
├── modules/                # AI model wrappers
│   ├── video_generator/    # CogVideoX integration
│   ├── tts/                # XTTS-v2 integration
│   ├── lip_sync/           # Wav2Lip integration
│   └── upscaler/           # Real-ESRGAN integration
├── frontend/               # Web interface
│   ├── index.html          # Main HTML
│   ├── styles/             # CSS styles
│   └── scripts/            # JavaScript
├── scripts/                # Utility scripts
│   └── download_models.py  # Model downloader
├── outputs/                # Generated videos
├── checkpoints/            # Model weights
├── requirements.txt        # Python dependencies
├── Dockerfile              # Container definition
└── docker-compose.yml      # Docker orchestration

⚙️ Configuration

Configuration is managed through environment variables. Copy .env.example to .env and customize:

Variable	Default	Description
`APP_API_PORT`	`8000`	API server port
`APP_TORCH_DTYPE`	`float16`	Model precision
`VIDEO_MODEL_ID`	`THUDM/CogVideoX-2b`	Video model
`VIDEO_HEIGHT`	`480`	Output height
`VIDEO_WIDTH`	`720`	Output width
`TTS_LANGUAGE`	`en`	Default language
`UPSCALE_SCALE`	`4`	Upscale factor

🎮 GPU Requirements

Configuration	VRAM Required	Recommended GPU
Minimum	12GB	RTX 3060
Standard	16GB	RTX 4070
Optimal	24GB+	RTX 4090, A100

The pipeline automatically manages GPU memory by loading/unloading modules sequentially.

🤝 Contributing

Contributions are welcome! Please read our Contributing Guidelines first.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

CogVideoX - Video generation model
Coqui TTS - Text-to-speech
Wav2Lip - Lip synchronization
Real-ESRGAN - Video upscaling

Made with ❤️ by AI-powered development

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬 Text-to-Video Generator

✨ Features

🖥️ Demo

🏗️ Architecture

🚀 Quick Start

Prerequisites

Installation

Docker Deployment

📖 API Reference

Generate Video

Check Status

Download Video

Health Check

🌍 Supported Languages

📁 Project Structure

⚙️ Configuration

🎮 GPU Requirements

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
api		api
core		core
frontend		frontend
modules		modules
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
.vercelignore		.vercelignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
package.json		package.json
requirements.txt		requirements.txt
vercel.json		vercel.json

Folders and files

Latest commit

History

Repository files navigation

🎬 Text-to-Video Generator

✨ Features

🖥️ Demo

🏗️ Architecture

🚀 Quick Start

Prerequisites

Installation

Docker Deployment

📖 API Reference

Generate Video

Check Status

Download Video

Health Check

🌍 Supported Languages

📁 Project Structure

⚙️ Configuration

🎮 GPU Requirements

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages