Transilience VideoGen

AI-powered pipeline that turns any content into professional marketing videos — images, documents, presentations, videos, or website URLs — with voice cloning, AI music, and cinematic animation.

🚀 Quick Start • 🏗️ Architecture • 📖 CLI Reference • 🌐 Web UI • 💰 Pricing

📋 Table of Contents

Overview
Quick Start
Architecture
Features
CLI Reference
Processing Pipeline
Web UI
Cost & Performance
Project Structure
Technical Stack

🎯 Overview

Transilience VideoGen automatically creates professional marketing videos from any combination of content. Drop in your files, provide a storyline, and get a polished video with AI-generated animation, voiceover, and music.

What It Does

Images / PPT / PDF / Videos / URLs  →  AI Pipeline  →  Professional Marketing Video

🎬 Veo 3.1 animates static images with cinematic motion (zoom, pan, drift)
🎙️ Unified Narrative — one AI call writes cohesive scripts for all scenes (hook → flow → CTA)
🗣️ Voice Cloning — clone any voice from a sample, or use premium ElevenLabs voices
🎵 AI Music — generate a 30s background loop, auto-repeated to fill the video
🌐 URL to Video — crawl a website, plan scenes with Gemini Vision, generate images with Imagen 4.0
🎛️ Audio Ducking — system audio auto-lowers under voiceover via sidechain compression
⚡ Fast Assembly — FFmpeg concat with stream copy (near-instant final merge)

Why VideoGen?

	Traditional Video Tools	VideoGen
Time	Hours of manual editing	Minutes (automated)
Cost	$500–5,000 per video	$0.25–0.65 per video
Input	Requires edited footage	Any files — images, slides, recordings
Voiceover	Hire talent or record yourself	AI voices + clone your own
Music	License tracks	AI-generated, royalty-free
Narrative	Write scripts manually	AI writes cohesive storyline

🚀 Quick Start

Prerequisites

Python 3.9+
FFmpeg (brew install ffmpeg / apt install ffmpeg)

Installation

pip install -r requirements.txt
playwright install chromium    # For website storyboarding
cp .env.example .env           # Fill in your local secrets; .env stays ignored

API Keys

export GOOGLE_API_KEY="..."       # Required — Gemini + Veo 3.1 + Imagen 4.0
export ELEVENLABS_API_KEY="..."   # Optional — Premium TTS, voice cloning, music

Usage

# Run from project/videogen/
cd project/videogen

# Simplest — drop images, get video
python cli.py create ./content/ -o video.mp4

# Full production — voice clone + music + storyline
python cli.py create ./content/ -o video.mp4 \
    --clone-voice "Aman" --clone-from voice_sample.mp3 \
    --generate-music --music-prompt "upbeat corporate" \
    --storyline "A small team discovers AI automation and scales to 10x productivity" \
    --product "My App" --tone "energetic"

# Website URL → storyboard → video
python cli.py storyboard https://example.com \
    --storyline-file story.txt --scenes 9 --product "My App"
python cli.py create storyboard_output/sequence/ -o video.mp4 \
    --ai-order --generate-music

# Screen recording with preserved system audio
python cli.py create ./recordings/ -o tutorial.mp4 \
    --style tutorial --storyline "Step-by-step setup guide"

# Preview without running
python cli.py create ./content/ -o video.mp4 --dry-run

🎬 Sample Videos

Sample	Description	Source
marketing.mp4	Full marketing video — voiceover + music	Mixed content
feature_explainer.mp4	Product feature walkthrough	Screenshots
Vulnerability_tutorial.mp4	Security vulnerability tutorial	Screen recording

Clone the repo and open files locally to view, or download from GitHub.

🏗️ Architecture

`create` — End-to-End Pipeline

flowchart TD
    A["python cli.py create ./content/ -o video.mp4<br/>--storyline 'your narrative arc'"] --> B

    subgraph Optional Phases
        P1["Phase 1: Voice Cloning<br/>--clone-voice + --clone-from<br/>ElevenLabs API"]
        P2["Phase 2: Music Generation<br/>--generate-music<br/>30s loop (cost-efficient)"]
    end

    B["Phase 3: Video Pipeline<br/>veo_pipeline.py"] --> C["Scan & Sort Files<br/>screenshot_handler.py"]

    P1 -.->|"Cloned voice name"| B
    P2 -.->|"Music MP3"| FINAL

    C --> D{File Type?}

    D -->|".png .jpg .pptx .pdf"| STATIC
    D -->|".mp4 .mov .webm"| VIDEO
    D -->|"No Voiceover flag"| PASSTHROUGH["Pass Through<br/>Keep original audio<br/>Skip all TTS"]

    subgraph STATIC["Static Content"]
        S1["1a. OCR Text Extraction<br/>Gemini Vision per scene"]
        S2["1b. Unified Narrative<br/>ONE AI call → all scripts<br/>Hook → Flow → CTA"]
        S3["2. Veo 3.1 Animation<br/>5s/scene, parallel"]
        S4["3. TTS Enhancement<br/>+ Voiceover"]
        S1 --> S2 --> S3 --> S4
    end

    subgraph VIDEO["Video Content"]
        V0["0. Detect companion mic file<br/>screen-recording-* + mic-recording-*"]
        V1["1. Transcribe audio<br/>(mic track if available)"]
        V2["2. Polish transcript"]
        V3["3. Clean video<br/>Remove idle frames"]
        V4["4. Strip or preserve audio"]
        V5["5. TTS + Audio Ducking<br/>Sidechain compression"]
        V0 --> V1 --> V2 --> V3 --> V4 --> V5
    end

    STATIC --> FINAL
    VIDEO --> FINAL
    PASSTHROUGH --> FINAL

    subgraph FINAL["Final Assembly (FFmpeg)"]
        F1["Normalize → target resolution<br/>FFmpeg scale+pad"]
        F2["Concatenate scenes<br/>FFmpeg concat (-c copy)"]
        F3["Mix background music<br/>Loop + trim + fade"]
        F4["Export: H.264 · AAC · 30fps"]
        F1 --> F2 --> F3 --> F4
    end

    FINAL --> OUT["output.mp4 + output.mp3 + output_music.mp3"]

    style A fill:#d4b44e,color:#000
    style OUT fill:#4CAF50,color:#fff
    style PASSTHROUGH fill:#FF9800,color:#fff

`storyboard` — URL to Video Assets

flowchart LR
    A["Website URL"] --> B["Phase 1<br/>Playwright Crawl<br/>Scroll + nav links<br/>16:9 screenshots"]
    B --> C["Phase 2<br/>Gemini Vision<br/>Scene Planning<br/>Pick best captures"]
    C --> D["Phase 3<br/>Imagen 4.0<br/>Generate images<br/>per scene"]
    D --> E["Phase 4<br/>sequence/ folder<br/>Screenshots + Generated"]
    E --> F["Feed into create<br/>--ai-order"]

    style A fill:#2196F3,color:#fff
    style F fill:#4CAF50,color:#fff

Multi-Agent Execution Flow

sequenceDiagram
    participant User
    participant CLI as cli.py
    participant Pipeline as veo_pipeline.py
    participant AI as Gemini / Veo / ElevenLabs
    participant FFmpeg

    User->>CLI: create ./content/ -o video.mp4
    CLI->>CLI: Phase 1: Clone voice (optional)
    CLI->>Pipeline: Phase 3: create_marketing_video_veo()

    Pipeline->>Pipeline: Scan & sort files
    Pipeline->>AI: OCR text extraction (per scene)
    Pipeline->>AI: Unified narrative (ONE call → all scripts)
    Pipeline->>AI: Veo 3.1 animation (parallel)
    Pipeline->>AI: TTS voiceover (per scene)
    Pipeline->>FFmpeg: Combine video + audio per scene
    Pipeline->>FFmpeg: Normalize resolution (scale+pad)
    Pipeline->>FFmpeg: Concatenate all scenes (-c copy)

    CLI->>CLI: Phase 4: Generate music (optional)
    CLI->>FFmpeg: Mix music into final video

    FFmpeg-->>User: output.mp4 + output.mp3 + output_music.mp3

✨ Features

Input Types

Type	Extensions	Processing
Images	`.png` `.jpg` `.jpeg` `.webp` `.gif` `.bmp`	Gemini Vision OCR + Veo 3.1 animation
PowerPoint	`.pptx` `.ppt`	Extract slides → process as images
Documents	`.pdf` `.doc` `.docx`	Extract pages → process as images
Videos	`.mp4` `.mov` `.avi` `.mkv` `.webm` `.m4v`	Transcribe → Polish → Clean → Re-voice
Screen Recordings	`screen-recording-*.webm`	Detect mic track → preserve system audio → ducking
Website URLs	`https://...`	Playwright crawl → Gemini plan → Imagen generate

Mix any types in one folder — the system processes each intelligently and combines them in filename order.

AI Capabilities

Capability	Technology	What It Does
Unified Narrative	Gemini	One call writes cohesive scripts for ALL scenes (hook → flow → CTA)
Veo 3.1 Animation	Google Veo	Cinematic zoom/pan/drift on static images, 5s per scene
TTS Enhancement	Claude	Adds delivery cues (CAPS emphasis, dashes for pauses, energy gradient)
Voice Cloning	ElevenLabs	Clone any voice from an audio sample
AI Music	ElevenLabs/Suno	Generate 30s background loop, auto-repeated
Scene Ordering	Gemini Vision	Auto-sequence scenes into logical narrative flow
Imagen 4.0	Google Imagen	Generate futuristic tech visuals from text prompts
Style Profiles	Built-in	marketing, demo, explainer, pitch, tutorial personas

Audio Processing

Feature	How It Works
Audio Ducking	FFmpeg sidechain compression — system audio ducks under voiceover (attack=100ms, release=800ms)
Per-File Control	Toggle "Keep original audio" to skip voiceover for specific files
Companion Mic	Screen recordings auto-detect separate mic file; transcribe mic, preserve system audio
Triple Output	`.mp4` video + `.mp3` voice+music + `.mp3` music only

Video Processing

Feature	How It Works
FFmpeg Assembly	Normalize + concat with stream copy (near-instant final merge)
Video Cleaning	Remove still/idle frames before re-voicing
Avatar Overlay	Composite image/video avatar on any corner (CLI + Web UI)
Bookends	AI-generated intro/outro frames (Gemini + Imagen + Veo)
Resolution	720p, 1080p, 4K — all clips normalized to target

📖 CLI Reference

`create` — All-in-One Video Generation

python cli.py create INPUT_PATH [OPTIONS]

# Core
  -o, --output TEXT              Output file (default: output.mp4)
  --storyline TEXT               Narrative arc to guide scripts + animation
  --product TEXT                 Product name
  --tone TEXT                    Script tone (default: professional and engaging)
  --style CHOICE                 marketing|demo|explainer|pitch|tutorial
  --resolution CHOICE            720p | 1080p | 4k

# Voice
  --voice TEXT                   TTS voice (default: Smritika)
  --tts-engine CHOICE            elevenlabs | edge (free)
  --clone-voice NAME             Clone voice from audio sample
  --clone-from AUDIO_FILE        Audio sample for cloning
  --voice-speed FLOAT            Speech rate (1.0 = normal)

# Music
  --generate-music               Generate AI background music
  --music-prompt TEXT            Music style prompt
  --music PATH                   Use existing music file
  --music-volume FLOAT           Mix volume (default: 0.03)

# Pipeline
  --max-workers INT              Parallel Veo calls (default: 3)
  --script-duration INT          Target narration seconds (default: 60)
  --scene-duration INT           Seconds per Veo scene (default: 5)
  --ai-order / --no-ai-order     AI scene ordering (default: on)
  --intro / --no-intro           Generate branded intro frame
  --outro / --no-outro           Generate branded outro frame
  --blend                        Blend assets from subfolders
  --mix SPEED,VOICE,MUSIC        Audio shorthand (e.g. "1.2,6.0,0.04")
  --dry-run                      Preview plan without running

`storyboard` — Website to Visual Assets

python cli.py storyboard URL [OPTIONS]

  --storyline TEXT / --storyline-file PATH
  --scenes INT                   Number of scenes (default: 6)
  --product TEXT                 Product name
  --style CHOICE                 marketing|demo|explainer|pitch|tutorial
  --skip-imagen                  Screenshots only
  --skip-screenshots             Imagen only
  --login                        Google sign-in for auth-required sites
  --dry-run                      Preview scene plan

Other Commands

Command	Purpose	Example
`avatar`	Add avatar overlay	`python cli.py avatar video.mp4 avatar.png -o out.mp4`
`veo-marketing`	Pipeline only (no clone/music)	`python cli.py veo-marketing ./content/ -o out.mp4`
`generate`	Basic video (no Veo)	`python cli.py generate ./content/ -o video.mp4`
`voice-clone`	Clone a voice	`python cli.py voice-clone "Name" recording.mp3`
`music`	Generate music	`python cli.py music -o bg.mp3 --prompt "upbeat"`
`voices`	List voices	`python cli.py voices --engine elevenlabs`
`veo`	Direct Veo access	`python cli.py veo image.png --prompt "gentle motion"`
`engage`	Cursor animations	`python cli.py engage ./screenshots/ -o out.mp4`
`info`	Setup guide	`python cli.py info`

🔄 Processing Pipeline

Static Content (Images / PPT / PDF)

Step   What                                How
─────  ──────────────────────────────────  ──────────────────────────────────────
1a     OCR text extraction                 Gemini Vision per scene
1b     Unified narrative                   ONE AI call → all scene scripts
2      Veo 3.1 animation                  Parallel (--max-workers), 5s/scene
3      TTS enhancement                    Emphasis, pauses, energy gradient
3b     Voiceover generation               ElevenLabs or Edge TTS
4      Combine video + audio              FFmpeg merge

Video Content

Step   What                                How
─────  ──────────────────────────────────  ──────────────────────────────────────
0      Detect companion mic file           screen-recording-* + mic-recording-*
1      Transcribe audio                   ElevenLabs Scribe (mic if available)
2      Polish transcript                  AI rewrite + scene position context
3      Clean video                        Remove still/idle frames
4      Strip or preserve audio            Preserve for screen recordings
5      TTS voiceover                      Enhanced script → voice
6      Combine with ducking               Sidechain compression (screen rec)
                                          or simple mix (normal video)

Final Assembly

Step   What                                How
─────  ──────────────────────────────────  ──────────────────────────────────────
1      Normalize resolution               FFmpeg scale+pad (720p/1080p/4K)
2      Concatenate scenes                 FFmpeg concat demuxer (-c copy)
3      Mix background music               Loop 30s clip + trim + fade
4      Export                             H.264 · AAC 192kbps · 30fps
5      Extract audio                      MP3 44.1kHz stereo 192kbps

Mixed Content Example

Input:

marketing/
├── 01_title.png          # Image
├── 02_pitch.pptx         # 3 slides
├── 03_demo.mov           # Video with voiceover (45s)
├── 04_features.pdf       # 2 pages
└── 05_cta.jpg            # Image

Result: 8 scenes, ~99 seconds — cohesive narrative with opening hook, natural flow around the video segment, and closing CTA. Three output files: .mp4 + .mp3 + _music.mp3.

🌐 Web UI

Deploy with Modal:

modal deploy modal_app.py

Endpoints

Endpoint	Method	Purpose
`/`	GET	Web interface
`/create`	POST	Start video generation
`/status/{job_id}`	GET	Poll job status + logs
`/download/{job_id}`	GET	Download finished video
`/storyboard`	POST	Start storyboard generation
`/storyboard-status/{job_id}`	GET	Poll storyboard status
`/avatar`	POST	Apply avatar overlay

💰 Cost & Performance

Estimated Costs (10 scenes)

Component	Cost
Gemini Vision	~$0.03
Veo 3.1 Animation	~$0.20–0.50
ElevenLabs TTS	~$0.02–0.05
Music Generation	~$0.01–0.05
Total	~$0.25–0.65

💡 Free alternative: Edge TTS + generate command (no Veo) = ~$0.03

Processing Times

Input	Time (--max-workers 3)
5 images	~8–10 min
10 slides (PPT)	~12–16 min
2 min video	~2–3 min
Mixed (5 images + 1 video)	~10–13 min

📁 Project Structure

video_generator/
├── CLAUDE.md                              # Architecture & rules
├── README.md                              # Project overview (this file)
├── requirements.txt                       # Python dependencies
├── .env                                   # API keys (not tracked)
├── .gitignore
│
├── .claude/                               # Claude Code config
│   ├── settings.local.json
│   └── skill/                             # Skill files mirroring src/ modules
│       ├── ai/                            #   *.skills.md per AI module
│       ├── pipeline/                      #   *.skills.md per pipeline module
│       ├── processing/                    #   *.skills.md per processing module
│       ├── core/                          #   *.skills.md per core utility
│       └── generators/                    #   *.skills.md per generator
│
└── project/
    └── videogen/
        ├── cli.py                         # CLI entry point (all commands)
        ├── process_recording.py           # Standalone video processor
        ├── samples/
        │   ├── input_samples/             # Sample input files
        │   └── video_samples/             # Generated sample videos
        └── src/
            ├── ai/
            │   ├── gemini_client.py        # Gemini API (Vision, content generation)
            │   ├── ai_analyzer.py          # Image analysis + script generation
            │   ├── veo_generator.py        # Veo 3.1 image-to-video
            │   ├── tts_engine.py           # TTS + STT + voice cloning
            │   └── imagen_generator.py     # Imagen 4.0 text-to-image
            ├── pipeline/
            │   ├── veo_pipeline.py         # Main orchestrator
            │   ├── screenshot_handler.py   # File loading, PPT/PDF extraction
            │   ├── blend_handler.py        # Multi-folder blending
            │   ├── bookend_generator.py    # Intro/outro (Gemini + Imagen + Veo)
            │   ├── storyboard_planner.py   # Gemini Vision scene planning
            │   └── website_screenshotter.py # Playwright crawling
            ├── processing/
            │   ├── avatar_overlay.py       # FFmpeg avatar compositing
            │   ├── video_cleaner.py        # Remove still/idle frames
            │   ├── video_effects.py        # Visual effects
            │   └── video_extractors.py     # Keyframe extraction
            ├── core/
            │   ├── image_utils.py          # Image manipulation
            │   ├── video_utils.py          # Video inspection
            │   └── audio_utils.py          # Audio manipulation
            └── generators/
                ├── music_generator.py      # Music (ElevenLabs, Suno, Replicate)
                └── text_animator.py        # Text animation

🔧 Technical Stack

Layer	Technologies
AI / ML	Google Gemini Vision, Veo 3.1, Imagen 4.0, ElevenLabs (TTS + STT + cloning + music)
Video	FFmpeg (concat demuxer, sidechain compression, scale+pad), OpenCV, MoviePy
Web	FastAPI, Modal (serverless), Playwright (headless Chromium)
Documents	python-pptx, pdf2image, Pillow

Output Specs

Format	Details
Video	H.264, AAC 192kbps, 30fps — 720p / 1080p / 4K
Audio	MP3, 44.1kHz stereo, 192kbps
Music	MP3, 44.1kHz stereo, 192kbps (trimmed to video length)

Required API Keys

cp .env.example .env
export GOOGLE_API_KEY="..."       # Required — Gemini + Veo 3.1 + Imagen 4.0
export ELEVENLABS_API_KEY="..."   # Optional — Premium TTS, voice cloning, music
export GOOGLE_EMAIL="..."         # Optional — Auto sign-in for storyboard --login
export GOOGLE_PASSWORD="..."      # Optional — Auto sign-in for storyboard --login

Built by Transilience AI

# communitytools-marketing

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.claude/skill		.claude/skill
project/videogen		project/videogen
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
claude.md		claude.md
modal_app.py		modal_app.py
requirements.txt		requirements.txt

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Transilience VideoGen

📋 Table of Contents

🎯 Overview

What It Does

Why VideoGen?

🚀 Quick Start

Prerequisites

Installation

API Keys

Usage

🎬 Sample Videos

🏗️ Architecture

create — End-to-End Pipeline

storyboard — URL to Video Assets

Multi-Agent Execution Flow

✨ Features

Input Types

AI Capabilities

Audio Processing

Video Processing

📖 CLI Reference

create — All-in-One Video Generation

storyboard — Website to Visual Assets

Other Commands

🔄 Processing Pipeline

Static Content (Images / PPT / PDF)

Video Content

Final Assembly

Mixed Content Example

🌐 Web UI

Endpoints

💰 Cost & Performance

Estimated Costs (10 scenes)

Processing Times

📁 Project Structure

🔧 Technical Stack

Output Specs

Required API Keys

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`create` — End-to-End Pipeline

`storyboard` — URL to Video Assets

`create` — All-in-One Video Generation

`storyboard` — Website to Visual Assets

Packages