Soap2Soap

Video-to-Video generation powered by Google Gemini + Seed Dance 2.0. Transform any video into a fully stylized animated version — Pixar, Disney, LEGO, anime, clay, and more — with consistent characters, environments, and cinematic composition preserved across every shot.

Showcase

Each demo shows the original source video (top-left, picture-in-picture) alongside the generated stylized version.

Huaqiang Watermelon Meme → Claymation

huaqiang.mp4

Avengers Doctor Strange Scene → LEGO

avengers.mp4

Empresses in the Palace → Family Guy 2D Cartoon

empresses.mp4

How It Works

Input Video
    ↓
Step 0  Whisper audio transcription (dialogue + timestamps)
    ↓
Step 1  Sliding-window Gemini analysis (~60s chunks, parallel)
        (character extraction + per-shot schema with scene_id, t2i/i2v prompts, dialogue)
    ↓
Step 2  Character reference images (Imagen 3 / Gemini) + unified Design Sheet
    ↓
Step 3  Prompt compilation + Gemini style rewrite (LEGO, Pixar, Disney, etc.)
        Dialogue language unified: --dialogue-lang zh|en|auto
    ↓
Step 4  Keyframe generation — Consistency mode: 2×2 grid per scene group → crop + refine
    ↓
Step 4b Keyframe inspection & auto-fix (skippable with --no-inspect)
    ↓
Step 5  Video clips — Seed Dance 2.0 (default) or Veo 3
    ↓
Step 6  ffmpeg merge → final video (1280×720)

Setup

1. Prerequisites

Python 3.10+

ffmpeg

brew install ffmpeg        # macOS
sudo apt install ffmpeg    # Ubuntu

2. Install Python Dependencies

pip install google-genai Pillow opencv-python scenedetect[opencv] openai-whisper byteplus-python-sdk-v2 httpx PyJWT

macOS (Homebrew Python): add --break-system-packages if needed.

Package	Purpose
`google-genai`	Gemini API (video analysis, image generation, Veo 3)
`Pillow`	Image processing
`opencv-python`	Video reading
`scenedetect[opencv]`	Camera cut detection
`openai-whisper`	Speech-to-text transcription
`byteplus-python-sdk-v2`	Seed Dance 2.0 (BytePlus ARK)
`httpx`	Async HTTP (video download)
`PyJWT`	JWT auth (not needed for Seed Dance, reserved)

3. Set API Keys

Gemini (required for analysis + keyframes):

export GENAI_API_KEY="your_gemini_api_key"

Seed Dance 2.0 (required for real video generation, default):

export BYTEPLUS_API_KEY="ark-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"

Get your key from BytePlus ARK.

Veo 3 (optional alternative video model):

export GENAI_API_KEY="your_gemini_api_key"   # same key as above

Runware (optional — for GPT Image 2 keyframe generation):

export RUNWARE_API_KEY="your_runware_api_key"

Get your key from Runware. Used when --keyframe-model gpt-image is set.

Quick Start

# Seed Dance 2.0 (default), clay style, Chinese dialogue
python v2/pipeline.py my_video.mp4 --style clay --real-video --dialogue-lang zh

# With source-frame layout reference — preserves original compositions
python v2/pipeline.py my_video.mp4 --style clay --real-video --source-frame-grid

# GPT Image 2 (Runware) for keyframes
RUNWARE_API_KEY=xxx python v2/pipeline.py my_video.mp4 --style pixar --keyframe-model gpt-image --real-video

# Veo 3 alternative
python v2/pipeline.py my_video.mp4 --style pixar --real-video --video-model veo

# Dev mode (fast, no API cost for video)
python v2/pipeline.py my_video.mp4 --style disney

# Full options
python v2/pipeline.py my_video.mp4 \
  --style clay \
  --shots 100 \
  --mode consistency \
  --real-video \
  --video-model seeddance \
  --dialogue-lang zh \
  --output-dir ./output \
  --yes \
  --no-inspect

All CLI Options

Flag	Default	Description
`--style`	`disney`	Target visual style
`--shots`	`10`	Max shots to generate
`--mode`	`consistency`	Keyframe generation mode
`--keyframe-model`	`gemini`	Image model: `gemini` (default) or `gpt-image` (Runware GPT Image 2)
`--source-frame-grid`	off	Layout reference: extract the midpoint frame from each source-video shot and compose a 2×2 reference grid to guide keyframe layout. Unlocks concurrent grid generation.
`--real-video`	off	Use real video model (Seed Dance / Veo 3)
`--video-model`	`seeddance`	Video model: `seeddance` or `veo`
`--dialogue-lang`	`auto`	Dialogue language in i2v prompt: `auto`, `zh`, `en`
`--yes`	off	Auto-confirm when >16 shots detected
`--no-whisper`	off	Skip audio transcription
`--no-inspect`	off	Skip Step 4b keyframe inspection
`--output-dir`	`.`	Output directory

Styles

Style	Description
`pixar`	Pixar 3D — warm soft lighting, subsurface skin glow, richly detailed environments
`disney`	Disney 3D — vibrant colors, expressive characters, polished CG render
`anime`	Japanese anime — clean linework, vivid colors, cinematic composition
`japanese_anime`	Manga/anime — dynamic poses, expressive faces, bold outlines
`clay`	Claymation — visible clay texture, warm handcrafted look
`lego`	LEGO — blocky minifigures, bright primary colors, brick-built environments
`family_guy`	American cartoon — flat colors, thick outlines, comedic proportions
`realistic`	Photorealistic cinematic — 35mm film look

Video Models

Model	Flag	Notes
Seed Dance 2.0	`--video-model seeddance`	Default. BytePlus ARK `seedance-1-5-pro-251215`. High quality, ~7-17MB per clip. Requires `BYTEPLUS_API_KEY`.
Veo 3	`--video-model veo`	Google `veo-3.0-generate-001`. Requires `GENAI_API_KEY`. Subject to Google IP/safety filters.

Generation Modes

Mode	Description
`consistency`	Default. Groups shots by scene, generates 2×2 grids for visual consistency, then crops + refines each cell.
`default`	Each shot generated independently with character refs.
`camera_tree`	Groups shots by camera setup (DAG scheduling). Best for complex multi-angle scenes.

Smart Caching

All intermediate results are cached — rerunning with the same input skips completed steps:

Cache	Location
Analysis JSON	`{output_dir}/input_720p_analysis.json`
Character images	`{output_dir}/char_character_NN.png`
Design sheet	`{output_dir}/design_sheet.png`
Keyframes	`{output_dir}/shot_N.png`
Video clips	`{output_dir}/shot_N_video.mp4`

Examples

Two complete examples with all intermediate files (analysis JSON, character refs, keyframes, video clips, final output) are included:

Huaqiang Watermelon Meme — Clay Style

python v2/pipeline.py example/huaqiang_watermelon/input_720p.mp4 \
  --style clay --shots 100 --mode consistency --yes --real-video \
  --video-model seeddance --dialogue-lang zh --source-frame-grid \
  --output-dir example/huaqiang_watermelon

18 shots, 6 characters, original Mandarin dialogue preserved
Final video: example/huaqiang_watermelon/final_output.mp4

Titanic — Pixar Style

python v2/pipeline.py example/titanic/input_720p.mp4 \
  --style pixar --shots 100 --mode consistency --yes --real-video \
  --video-model seeddance --dialogue-lang en \
  --output-dir example/titanic

12 shots, 5 characters, English dialogue
Final video: example/titanic/final_output.mp4

Architecture

Video Analysis (Step 1)

Input video resized to 720p (cached)
Whisper transcribes audio with timestamps
Video split into ~60s chunks, each analyzed by Gemini in parallel
Each chunk returns ~10 narrative shots (not one per camera cut)
Shots merged and scene IDs normalized across chunks
Analysis cached as input_720p_analysis.json for reuse

Character References (Step 2)

For each character extracted in Step 1, Gemini renders a reference image: a face close-up alongside a full-body outfit shot
All characters are then composed into a unified Design Sheet — a single image showing every character side-by-side, used as the master reference for cross-shot consistency
Both per-character images and the design sheet are cached under {output_dir}/ and reused on subsequent runs

Prompt Pipeline (Step 3)

Raw scene descriptions → Gemini rewrites into target style language for keyframe generation
Dialogue preserved separately (not rewritten): --dialogue-lang zh|en|auto
@character_XX tokens are kept in t2i prompts (so the keyframe model can map them to the reference images) but replaced with natural-language aliases (e.g. the man in black zip-up jacket) in i2v prompts

Keyframe Generation (Step 4 — Consistency Mode)

Shots grouped by scene_id into batches of 4
Reference images per batch — only the characters appearing in that batch's shots (smart per-grid assignment):
- Default mode: previous grid(s) from the same scene act as the layout reference (grids run sequentially due to this chain)
- Source-frame-grid mode (--source-frame-grid): for each batch, extract the midpoint frame of each shot from the original video and compose them into a 2×2 grid. This becomes the layout reference instead of previous AI grids — and since there's no inter-grid dependency, all grids run concurrently.
Grid prompts >4000c are auto-compressed by Gemini before generation
Gemini generates a 2×2 grid image for the batch
Grid cropped into 4 cells → each cell refined with per-character reference images (refinement runs concurrently, up to 10 workers)
Result: visually consistent keyframes across all shots in a scene

I2V Prompt Pipeline (Step 5)

@character_XX tokens replaced with natural-language aliases (e.g. the man in black zip-up light jacket) before being sent to the I2V model
Dialogue speakers use role labels directly from analysis.json (e.g. Jack, Rose, Huaqiang, Vendor 1)
Off-screen voiceover (speaker IDs like narrator, background-voice, crowd) gets a special "no visible character speaks" instruction so the I2V model doesn't lip-sync them to on-screen characters
I2V runs concurrently — up to 17 workers in parallel

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
example		example
output_reference		output_reference
reference		reference
v2		v2
.gitignore		.gitignore
README.md		README.md
agent_generation.py		agent_generation.py
agent_inspection.py		agent_inspection.py
agent_intelligent_review.py		agent_intelligent_review.py
agent_master.py		agent_master.py
agent_memory.py		agent_memory.py
agent_reference.py		agent_reference.py
agent_review.py		agent_review.py
create_character_sheet.py		create_character_sheet.py
minitest_titanic.mp4		minitest_titanic.mp4
one_click_generate.sh		one_click_generate.sh
regenerate_shots.py		regenerate_shots.py
run_scriptgen.sh		run_scriptgen.sh
video_to_script2.py		video_to_script2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Soap2Soap

Showcase

How It Works

Setup

1. Prerequisites

2. Install Python Dependencies

3. Set API Keys

Quick Start

All CLI Options

Styles

Video Models

Generation Modes

Smart Caching

Examples

Huaqiang Watermelon Meme — Clay Style

Titanic — Pixar Style

Architecture

Video Analysis (Step 1)

Character References (Step 2)

Prompt Pipeline (Step 3)

Keyframe Generation (Step 4 — Consistency Mode)

I2V Prompt Pipeline (Step 5)

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Soap2Soap

Showcase

How It Works

Setup

1. Prerequisites

2. Install Python Dependencies

3. Set API Keys

Quick Start

All CLI Options

Styles

Video Models

Generation Modes

Smart Caching

Examples

Huaqiang Watermelon Meme — Clay Style

Titanic — Pixar Style

Architecture

Video Analysis (Step 1)

Character References (Step 2)

Prompt Pipeline (Step 3)

Keyframe Generation (Step 4 — Consistency Mode)

I2V Prompt Pipeline (Step 5)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages