System design and component overview.
┌─────────────────────────────────────────────────────────────────────────┐
│ User Interface │
│ │
│ ./montage-ai.sh run hitchcock --cgpu --upscale │
│ │ │
└──────────────────────────────┼───────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Docker Container │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Creative │ │ Style │ │ Footage │ │
│ │ Director │────▶│ Templates │────▶│ Manager │ │
│ │ (LLM) │ │ (JSON) │ │ (Story Arc) │ │
│ └────────┬────────┘ └─────────────────┘ └────────┬────────┘ │
│ │ │ │
│ └───────────────────┬───────────────────────────┘ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Editor │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Beat │ │ Scene │ │ Clip │ │ Video │ │ │
│ │ │ Detection│─▶│ Detection│─▶│ Assembly │─▶│ Rendering│ │ │
│ │ │(FFmpeg) │ │(scenedet)│ │ │ │ (FFmpeg) │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Shorts Studio (v1.2+) │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Smart │ │ Caption │ │ Audio │ │ Highlight│ │ │
│ │ │ Reframe │ │ Burn │ │ Polish │ │ Detect │ │ │
│ │ │(MediaPipe)│ │ (Styles) │ │(Sidechain)│ │ (Multi) │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Enhancement Pipeline │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Upscale │ │ Stabilize│ │ Color │ │ Sharpen │ │ │
│ │ │(ESRGAN) │ │ (FFmpeg) │ │ Grade │ │ │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │ │
└───────────────────────────────┼──────────────────────────────────────────┘
▼
/data/output/
montage_001.mp4
For systems with limited resources (e.g., laptops), Montage AI supports a hybrid mode where heavy compute tasks are offloaded to the cloud via cgpu.
See cgpu-setup.md for setup details.
- LLM Inference: Offloaded to Google Gemini via
cgpu serve. - Upscaling: Offloaded to Google Colab GPUs via
cgpu run. - Local: Orchestration, cutting, and basic rendering.
The editing engine has been refactored into a modular pipeline.
MontageBuilder (montage_builder.py)
The central orchestrator that executes the editing pipeline in phases:
- Setup: Initialize workspace and logging.
- Analyze: Process audio (beats/energy) and video (scenes/content).
- Plan: Select clips and map them to the timeline based on the story arc.
- Enhance: Apply stabilization, upscaling, and color grading.
- Render: Generate the final video file.
Components:
| Module | Purpose |
|---|---|
audio_analysis.py |
Beat detection, tempo extraction, energy profiling (FFmpeg astats/tempo; librosa optional) |
scene_analysis.py |
Scene detection, content analysis, visual similarity with LRU cache |
auto_reframe.py |
Auto Reframe Engine. 9:16 conversion using Convex Optimization (L2) for cinematic camera paths |
video_metadata.py |
Technical metadata extraction (ffprobe wrapper) |
clip_enhancement.py |
Stabilization, upscaling, color matching (Local/Cloud hybrid) |
audio_enhancer.py |
Audio Polish. Voice isolation, auto-ducking, loudness normalization (Pro Polish) |
shorts_workflow.py |
Shorts Studio. Dedicated pipeline for vertical content generation |
ffmpeg_config.py |
GPU encoder detection (NVENC/VAAPI/QSV), encoding parameters |
| Optimization | Implementation | Impact |
|---|---|---|
| Preview Pipeline | Low-res (360p) + Ultrafast preset | 10x faster iteration loop |
| LRU Histogram Cache | @lru_cache for frame extraction |
91% cache hit rate, 2-3x faster clip selection |
| Parallel Scene Detection | ProcessPoolExecutor (max_workers = min(len(videos), max(4, cpu_count // 2), MAX_SCENE_WORKERS); ThreadPool fallback) |
3-4x speedup on multi-core |
| FFmpeg Beat Detection | astats + tempo filters (primary path) |
Portable, no heavy deps; librosa optional via try/except |
| Auto GPU Encoding | NVENC > VAAPI > QSV > CPU | 2-6x encoding speedup |
| Hardware-Adjacent (Web) | Server-Sent Events (SSE) + os.nice(10) |
Zero polling overhead, responsive UI under load |
| Lazy Loading (CLI) | Import heavy libs only when needed | Instant CLI startup time |
| Cluster Efficiency | imagePullPolicy: IfNotPresent (dev overlay uses Always) |
Minimized network traffic for cached images |
A thin wrapper that initializes the MontageBuilder and handles CLI arguments.
Translates natural language to editing parameters.
Responsibilities:
- Parse user prompts
- Query LLM (Ollama or Gemini)
- Validate JSON responses
- Map to style parameters
- Incorporate RegisseurMemory hints when available
- Emit schema-versioned outputs for compatibility (
schema_version)
Backends:
| Backend | Protocol | Model |
|---|---|---|
| Ollama | REST API | llama3.1:70b |
| cgpu/Gemini | OpenAI-compatible | gemini-2.0-flash |
Flow:
User Prompt
│
▼
┌─────────────────┐
│ System Prompt │
│ + Style Options │
│ + Examples │
└────────┬────────┘
│
▼
┌─────────────────┐
│ LLM (Ollama or │
│ Gemini) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ JSON Validation │
│ + Schema Check │
└────────┬────────┘
│
▼
Editing Instructions
Professional-grade clip management with story arc awareness.
Key concepts:
| Concept | Description |
|---|---|
| UsageStatus | Track if clip is UNUSED, USED, or RESERVED |
| SceneType | Classify clips: ESTABLISHING, ACTION, DETAIL, PORTRAIT, SCENIC |
| StoryPhase | Timeline position: INTRO, BUILD, CLIMAX, SUSTAIN, OUTRO |
| FootageClip | Data class with all clip metadata |
| FootagePoolManager | Manages available clip pool |
| StoryArcController | Maps timeline position to requirements |
Selection algorithm:
Current Position → Story Phase → Required Energy + Scene Type
│
▼
┌───────────────────────┐
│ Score Available Clips │
│ - Energy match │
│ - Scene type match │
│ - Visual interest │
│ - Variety bonus │
└───────────┬───────────┘
│
▼
Select Highest Score
Mark as USED
Loads and validates JSON style presets.
Responsibilities:
- Discover preset files
- Parse and validate JSON
- Merge defaults with overrides
- Cache loaded templates
File discovery order:
- Built-in:
src/montage_ai/styles/*.json - Env override:
STYLE_PRESET_DIR/*.json - Single file:
STYLE_PRESET_PATH
Offloads AI upscaling to free cloud GPUs.
Flow:
Video Frames (local)
│
▼
┌─────────────────┐
│ cgpu connect │
│ (Google Colab) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Real-ESRGAN │
│ (T4/A100 GPU) │
└────────┬────────┘
│
▼
Upscaled Frames (local)
Real-time decision logging for debugging.
Logged events:
- Clip selection decisions
- Beat alignment choices
- Energy level changes
- Phase transitions
- Performance metrics
/data/input/*.mp4
│
├──▶ Scene Detection ──▶ Scene List
│
├──▶ Energy Analysis ──▶ Clip Scores
│
└──▶ Metadata Extraction ──▶ Clip Database
/data/music/*.mp3
│
├──▶ Beat Detection ──▶ Beat Timestamps
│
├──▶ Tempo Analysis ──▶ BPM
│
└──▶ Energy Curve ──▶ Energy Timeline
Beat Timeline + Clip Database + Style Parameters
│
▼
┌─────────────────┐
│ For each beat: │
│ - Get story │
│ phase │
│ - Score clips │
│ - Select best │
│ - Mark used │
└────────┬────────┘
│
▼
Clip Sequence
Clip Sequence
│
├──▶ Crop/Scale to STANDARD_WIDTH x STANDARD_HEIGHT (1080x1920)
│
├──▶ Optional: Upscale (Real-ESRGAN)
│
├──▶ Optional: Stabilize (vidstab 2-pass)
│
├──▶ Color Grade (20+ presets)
│
└──▶ Progressive Renderer
│
├──▶ Batch clips (default 25)
├──▶ Write segments to disk
├──▶ FFmpeg concat (-c copy)
├──▶ Optional: xfade transitions
└──▶ Audio mix + Logo overlay
│
▼
/data/output/montage.mp4
The system uses ProgressiveRenderer (in segment_writer.py) to prevent OOM crashes.
┌─────────────────────────────────────────────────────────────┐
│ Progressive Renderer │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Clip 1 │ │ Clip 2 │ │ ... │ │ Clip 25 │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │ │
│ └─────────────┴─────────────┴─────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ flush_batch() │ │
│ │ - Normalize │ │
│ │ - Write segment │ │
│ │ - GC + cleanup │ │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ segment_0001.mp4 (disk) │
│ │
│ ... repeat for all batches ... │
│ │
│ ┌──────────────────┐ │
│ │ finalize() │ │
│ │ - FFmpeg concat │ │
│ │ - Audio mix │ │
│ │ - Logo overlay │ │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ output.mp4 │
└─────────────────────────────────────────────────────────────┘
Key Constants (Dynamically Determined):
These constants are automatically determined from input footage using determine_output_profile():
| Constant | Default | Determination Method |
|---|---|---|
STANDARD_WIDTH |
1080 | Weighted median of input widths, snapped to standard presets |
STANDARD_HEIGHT |
1920 | Weighted median of input heights, snapped to standard presets |
STANDARD_FPS |
30 | Weighted median of input frame rates |
STANDARD_PIX_FMT |
yuv420p | Dominant pixel format from inputs (by duration) |
TARGET_CODEC |
libx264 | Dominant codec from inputs, or env OUTPUT_CODEC |
TARGET_PROFILE |
high | Auto-selected based on resolution (4.1 for HD, 5.1 for 4K) |
TARGET_BITRATE |
auto | Weighted median of input bitrates, or calculated from pixels |
Output Profile Heuristics:
- Orientation (horizontal/vertical/square) determined by weighted aspect ratios
- Resolution snapped to common presets (1080p, 720p, 4K) if within 12% variance
- Avoids upscaling beyond maximum input resolution
- Honors environment overrides:
OUTPUT_CODEC,OUTPUT_PIX_FMT,OUTPUT_PROFILE,OUTPUT_LEVEL
| Dependency | Purpose | Version |
|---|---|---|
| FFmpeg | Video encoding/processing (beat detection via astats/tempo) | Latest |
| MoviePy | Video manipulation | 2.2+ |
| OpenCV | Frame processing | 4.12+ |
| scenedetect | Scene detection | 0.6+ |
| numpy | Numerical operations | 2.0+ |
| Pydantic | Data validation | 2.0+ |
| Real-ESRGAN | AI upscaling | Latest |
| OpenAI SDK | cgpu/Gemini client | 1.0+ |
Note:
librosahas been removed. Beat detection now uses FFmpegastats/tempofilters as the primary engine (portable, no heavy deps). Seeaudio_analysis.py.
- Clip enhancement runs in parallel threads
- Frame upscaling can use cloud GPU
- FFmpeg uses multi-threading
- Clips loaded on-demand, not all at once
- Temporary files cleaned after processing
- Large videos processed in chunks
- Auto-detection of available GPU encoders
- Fallback chain: NVENC → VAAPI → QSV → CPU
- Cloud GPU option for heavy workloads