Video-to-Video generation powered by Google Gemini + Seed Dance 2.0. Transform any video into a fully stylized animated version — Pixar, Disney, LEGO, anime, clay, and more — with consistent characters, environments, and cinematic composition preserved across every shot.
Each demo shows the original source video (top-left, picture-in-picture) alongside the generated stylized version.
Huaqiang Watermelon Meme → Claymation
huaqiang.mp4
Avengers Doctor Strange Scene → LEGO
avengers.mp4
Empresses in the Palace → Family Guy 2D Cartoon
empresses.mp4
Input Video
↓
Step 0 Whisper audio transcription (dialogue + timestamps)
↓
Step 1 Sliding-window Gemini analysis (~60s chunks, parallel)
(character extraction + per-shot schema with scene_id, t2i/i2v prompts, dialogue)
↓
Step 2 Character reference images (Imagen 3 / Gemini) + unified Design Sheet
↓
Step 3 Prompt compilation + Gemini style rewrite (LEGO, Pixar, Disney, etc.)
Dialogue language unified: --dialogue-lang zh|en|auto
↓
Step 4 Keyframe generation — Consistency mode: 2×2 grid per scene group → crop + refine
↓
Step 4b Keyframe inspection & auto-fix (skippable with --no-inspect)
↓
Step 5 Video clips — Seed Dance 2.0 (default) or Veo 3
↓
Step 6 ffmpeg merge → final video (1280×720)
- Python 3.10+
- ffmpeg
brew install ffmpeg # macOS sudo apt install ffmpeg # Ubuntu
pip install google-genai Pillow opencv-python scenedetect[opencv] openai-whisper byteplus-python-sdk-v2 httpx PyJWTmacOS (Homebrew Python): add
--break-system-packagesif needed.
| Package | Purpose |
|---|---|
google-genai |
Gemini API (video analysis, image generation, Veo 3) |
Pillow |
Image processing |
opencv-python |
Video reading |
scenedetect[opencv] |
Camera cut detection |
openai-whisper |
Speech-to-text transcription |
byteplus-python-sdk-v2 |
Seed Dance 2.0 (BytePlus ARK) |
httpx |
Async HTTP (video download) |
PyJWT |
JWT auth (not needed for Seed Dance, reserved) |
Gemini (required for analysis + keyframes):
export GENAI_API_KEY="your_gemini_api_key"Seed Dance 2.0 (required for real video generation, default):
export BYTEPLUS_API_KEY="ark-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"Get your key from BytePlus ARK.
Veo 3 (optional alternative video model):
export GENAI_API_KEY="your_gemini_api_key" # same key as aboveRunware (optional — for GPT Image 2 keyframe generation):
export RUNWARE_API_KEY="your_runware_api_key"Get your key from Runware. Used when --keyframe-model gpt-image is set.
# Seed Dance 2.0 (default), clay style, Chinese dialogue
python v2/pipeline.py my_video.mp4 --style clay --real-video --dialogue-lang zh
# With source-frame layout reference — preserves original compositions
python v2/pipeline.py my_video.mp4 --style clay --real-video --source-frame-grid
# GPT Image 2 (Runware) for keyframes
RUNWARE_API_KEY=xxx python v2/pipeline.py my_video.mp4 --style pixar --keyframe-model gpt-image --real-video
# Veo 3 alternative
python v2/pipeline.py my_video.mp4 --style pixar --real-video --video-model veo
# Dev mode (fast, no API cost for video)
python v2/pipeline.py my_video.mp4 --style disney
# Full options
python v2/pipeline.py my_video.mp4 \
--style clay \
--shots 100 \
--mode consistency \
--real-video \
--video-model seeddance \
--dialogue-lang zh \
--output-dir ./output \
--yes \
--no-inspect| Flag | Default | Description |
|---|---|---|
--style |
disney |
Target visual style |
--shots |
10 |
Max shots to generate |
--mode |
consistency |
Keyframe generation mode |
--keyframe-model |
gemini |
Image model: gemini (default) or gpt-image (Runware GPT Image 2) |
--source-frame-grid |
off | Layout reference: extract the midpoint frame from each source-video shot and compose a 2×2 reference grid to guide keyframe layout. Unlocks concurrent grid generation. |
--real-video |
off | Use real video model (Seed Dance / Veo 3) |
--video-model |
seeddance |
Video model: seeddance or veo |
--dialogue-lang |
auto |
Dialogue language in i2v prompt: auto, zh, en |
--yes |
off | Auto-confirm when >16 shots detected |
--no-whisper |
off | Skip audio transcription |
--no-inspect |
off | Skip Step 4b keyframe inspection |
--output-dir |
. |
Output directory |
| Style | Description |
|---|---|
pixar |
Pixar 3D — warm soft lighting, subsurface skin glow, richly detailed environments |
disney |
Disney 3D — vibrant colors, expressive characters, polished CG render |
anime |
Japanese anime — clean linework, vivid colors, cinematic composition |
japanese_anime |
Manga/anime — dynamic poses, expressive faces, bold outlines |
clay |
Claymation — visible clay texture, warm handcrafted look |
lego |
LEGO — blocky minifigures, bright primary colors, brick-built environments |
family_guy |
American cartoon — flat colors, thick outlines, comedic proportions |
realistic |
Photorealistic cinematic — 35mm film look |
| Model | Flag | Notes |
|---|---|---|
| Seed Dance 2.0 | --video-model seeddance |
Default. BytePlus ARK seedance-1-5-pro-251215. High quality, ~7-17MB per clip. Requires BYTEPLUS_API_KEY. |
| Veo 3 | --video-model veo |
Google veo-3.0-generate-001. Requires GENAI_API_KEY. Subject to Google IP/safety filters. |
| Mode | Description |
|---|---|
consistency |
Default. Groups shots by scene, generates 2×2 grids for visual consistency, then crops + refines each cell. |
default |
Each shot generated independently with character refs. |
camera_tree |
Groups shots by camera setup (DAG scheduling). Best for complex multi-angle scenes. |
All intermediate results are cached — rerunning with the same input skips completed steps:
| Cache | Location |
|---|---|
| Analysis JSON | {output_dir}/input_720p_analysis.json |
| Character images | {output_dir}/char_character_NN.png |
| Design sheet | {output_dir}/design_sheet.png |
| Keyframes | {output_dir}/shot_N.png |
| Video clips | {output_dir}/shot_N_video.mp4 |
Two complete examples with all intermediate files (analysis JSON, character refs, keyframes, video clips, final output) are included:
python v2/pipeline.py example/huaqiang_watermelon/input_720p.mp4 \
--style clay --shots 100 --mode consistency --yes --real-video \
--video-model seeddance --dialogue-lang zh --source-frame-grid \
--output-dir example/huaqiang_watermelon- 18 shots, 6 characters, original Mandarin dialogue preserved
- Final video:
example/huaqiang_watermelon/final_output.mp4
python v2/pipeline.py example/titanic/input_720p.mp4 \
--style pixar --shots 100 --mode consistency --yes --real-video \
--video-model seeddance --dialogue-lang en \
--output-dir example/titanic- 12 shots, 5 characters, English dialogue
- Final video:
example/titanic/final_output.mp4
- Input video resized to 720p (cached)
- Whisper transcribes audio with timestamps
- Video split into ~60s chunks, each analyzed by Gemini in parallel
- Each chunk returns ~10 narrative shots (not one per camera cut)
- Shots merged and scene IDs normalized across chunks
- Analysis cached as
input_720p_analysis.jsonfor reuse
- For each character extracted in Step 1, Gemini renders a reference image: a face close-up alongside a full-body outfit shot
- All characters are then composed into a unified Design Sheet — a single image showing every character side-by-side, used as the master reference for cross-shot consistency
- Both per-character images and the design sheet are cached under
{output_dir}/and reused on subsequent runs
- Raw scene descriptions → Gemini rewrites into target style language for keyframe generation
- Dialogue preserved separately (not rewritten):
--dialogue-lang zh|en|auto @character_XXtokens are kept in t2i prompts (so the keyframe model can map them to the reference images) but replaced with natural-language aliases (e.g.the man in black zip-up jacket) in i2v prompts
- Shots grouped by
scene_idinto batches of 4 - Reference images per batch — only the characters appearing in that batch's shots (smart per-grid assignment):
- Default mode: previous grid(s) from the same scene act as the layout reference (grids run sequentially due to this chain)
- Source-frame-grid mode (
--source-frame-grid): for each batch, extract the midpoint frame of each shot from the original video and compose them into a 2×2 grid. This becomes the layout reference instead of previous AI grids — and since there's no inter-grid dependency, all grids run concurrently.
- Grid prompts >4000c are auto-compressed by Gemini before generation
- Gemini generates a 2×2 grid image for the batch
- Grid cropped into 4 cells → each cell refined with per-character reference images (refinement runs concurrently, up to 10 workers)
- Result: visually consistent keyframes across all shots in a scene
@character_XXtokens replaced with natural-language aliases (e.g.the man in black zip-up light jacket) before being sent to the I2V model- Dialogue speakers use role labels directly from
analysis.json(e.g.Jack,Rose,Huaqiang,Vendor 1) - Off-screen voiceover (speaker IDs like
narrator,background-voice,crowd) gets a special "no visible character speaks" instruction so the I2V model doesn't lip-sync them to on-screen characters - I2V runs concurrently — up to 17 workers in parallel