Skip to content

Latest commit

 

History

History
441 lines (358 loc) · 21.2 KB

File metadata and controls

441 lines (358 loc) · 21.2 KB

Architecture

System design and component overview.


High-Level Flow

┌─────────────────────────────────────────────────────────────────────────┐
│                           User Interface                                 │
│                                                                          │
│   ./montage-ai.sh run hitchcock --cgpu --upscale                        │
│                              │                                           │
└──────────────────────────────┼───────────────────────────────────────────┘
                               ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                        Docker Container                                  │
│                                                                          │
│  ┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐   │
│  │ Creative        │     │ Style           │     │ Footage         │   │
│  │ Director        │────▶│ Templates       │────▶│ Manager         │   │
│  │ (LLM)           │     │ (JSON)          │     │ (Story Arc)     │   │
│  └────────┬────────┘     └─────────────────┘     └────────┬────────┘   │
│           │                                               │             │
│           └───────────────────┬───────────────────────────┘             │
│                               ▼                                          │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                         Editor                                    │   │
│  │                                                                   │   │
│  │   ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐        │   │
│  │   │ Beat     │  │ Scene    │  │ Clip     │  │ Video    │        │   │
│  │   │ Detection│─▶│ Detection│─▶│ Assembly │─▶│ Rendering│        │   │
│  │   │(FFmpeg)  │  │(scenedet)│  │          │  │ (FFmpeg) │        │   │
│  │   └──────────┘  └──────────┘  └──────────┘  └──────────┘        │   │
│  │                                                                   │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                               │                                          │
│                               ▼                                          │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                       Shorts Studio (v1.2+)                       │   │
│  │                                                                   │   │
│  │   ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐        │   │
│  │   │ Smart    │  │ Caption  │  │ Audio    │  │ Highlight│        │   │
│  │   │ Reframe  │  │ Burn     │  │ Polish   │  │ Detect   │        │   │
│  │   │(MediaPipe)│ │ (Styles) │  │(Sidechain)│ │ (Multi)  │        │   │
│  │   └──────────┘  └──────────┘  └──────────┘  └──────────┘        │   │
│  │                                                                   │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                               │                                          │
│                               ▼                                          │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                    Enhancement Pipeline                           │   │
│  │                                                                   │   │
│  │   ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐        │   │
│  │   │ Upscale  │  │ Stabilize│  │ Color    │  │ Sharpen  │        │   │
│  │   │(ESRGAN)  │  │ (FFmpeg) │  │ Grade    │  │          │        │   │
│  │   └──────────┘  └──────────┘  └──────────┘  └──────────┘        │   │
│  │                                                                   │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                               │                                          │
└───────────────────────────────┼──────────────────────────────────────────┘
                               ▼
                        /data/output/
                        montage_001.mp4

Hybrid Architecture (Cloud Offloading)

For systems with limited resources (e.g., laptops), Montage AI supports a hybrid mode where heavy compute tasks are offloaded to the cloud via cgpu.

See cgpu-setup.md for setup details.

  • LLM Inference: Offloaded to Google Gemini via cgpu serve.
  • Upscaling: Offloaded to Google Colab GPUs via cgpu run.
  • Local: Orchestration, cutting, and basic rendering.

Module Responsibilities

Core Pipeline (src/montage_ai/core/)

The editing engine has been refactored into a modular pipeline.

MontageBuilder (montage_builder.py) The central orchestrator that executes the editing pipeline in phases:

  1. Setup: Initialize workspace and logging.
  2. Analyze: Process audio (beats/energy) and video (scenes/content).
  3. Plan: Select clips and map them to the timeline based on the story arc.
  4. Enhance: Apply stabilization, upscaling, and color grading.
  5. Render: Generate the final video file.

Components:

Module Purpose
audio_analysis.py Beat detection, tempo extraction, energy profiling (FFmpeg astats/tempo; librosa optional)
scene_analysis.py Scene detection, content analysis, visual similarity with LRU cache
auto_reframe.py Auto Reframe Engine. 9:16 conversion using Convex Optimization (L2) for cinematic camera paths
video_metadata.py Technical metadata extraction (ffprobe wrapper)
clip_enhancement.py Stabilization, upscaling, color matching (Local/Cloud hybrid)
audio_enhancer.py Audio Polish. Voice isolation, auto-ducking, loudness normalization (Pro Polish)
shorts_workflow.py Shorts Studio. Dedicated pipeline for vertical content generation
ffmpeg_config.py GPU encoder detection (NVENC/VAAPI/QSV), encoding parameters

Performance Optimizations

Optimization Implementation Impact
Preview Pipeline Low-res (360p) + Ultrafast preset 10x faster iteration loop
LRU Histogram Cache @lru_cache for frame extraction 91% cache hit rate, 2-3x faster clip selection
Parallel Scene Detection ProcessPoolExecutor (max_workers = min(len(videos), max(4, cpu_count // 2), MAX_SCENE_WORKERS); ThreadPool fallback) 3-4x speedup on multi-core
FFmpeg Beat Detection astats + tempo filters (primary path) Portable, no heavy deps; librosa optional via try/except
Auto GPU Encoding NVENC > VAAPI > QSV > CPU 2-6x encoding speedup
Hardware-Adjacent (Web) Server-Sent Events (SSE) + os.nice(10) Zero polling overhead, responsive UI under load
Lazy Loading (CLI) Import heavy libs only when needed Instant CLI startup time
Cluster Efficiency imagePullPolicy: IfNotPresent (dev overlay uses Always) Minimized network traffic for cached images

editor.py (CLI Entry Point)

A thin wrapper that initializes the MontageBuilder and handles CLI arguments.


creative_director.py (LLM Interface)

Translates natural language to editing parameters.

Responsibilities:

  • Parse user prompts
  • Query LLM (Ollama or Gemini)
  • Validate JSON responses
  • Map to style parameters
  • Incorporate RegisseurMemory hints when available
  • Emit schema-versioned outputs for compatibility (schema_version)

Backends:

Backend Protocol Model
Ollama REST API llama3.1:70b
cgpu/Gemini OpenAI-compatible gemini-2.0-flash

Flow:

User Prompt
    │
    ▼
┌─────────────────┐
│ System Prompt   │
│ + Style Options │
│ + Examples      │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ LLM (Ollama or  │
│ Gemini)         │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ JSON Validation │
│ + Schema Check  │
└────────┬────────┘
         │
         ▼
Editing Instructions

footage_manager.py (Clip Selection)

Professional-grade clip management with story arc awareness.

Key concepts:

Concept Description
UsageStatus Track if clip is UNUSED, USED, or RESERVED
SceneType Classify clips: ESTABLISHING, ACTION, DETAIL, PORTRAIT, SCENIC
StoryPhase Timeline position: INTRO, BUILD, CLIMAX, SUSTAIN, OUTRO
FootageClip Data class with all clip metadata
FootagePoolManager Manages available clip pool
StoryArcController Maps timeline position to requirements

Selection algorithm:

Current Position → Story Phase → Required Energy + Scene Type
                                           │
                                           ▼
                               ┌───────────────────────┐
                               │ Score Available Clips │
                               │ - Energy match        │
                               │ - Scene type match    │
                               │ - Visual interest     │
                               │ - Variety bonus       │
                               └───────────┬───────────┘
                                           │
                                           ▼
                               Select Highest Score
                               Mark as USED

style_templates.py (Style Loader)

Loads and validates JSON style presets.

Responsibilities:

  • Discover preset files
  • Parse and validate JSON
  • Merge defaults with overrides
  • Cache loaded templates

File discovery order:

  1. Built-in: src/montage_ai/styles/*.json
  2. Env override: STYLE_PRESET_DIR/*.json
  3. Single file: STYLE_PRESET_PATH

cgpu_upscaler.py (Cloud GPU)

Offloads AI upscaling to free cloud GPUs.

Flow:

Video Frames (local)
        │
        ▼
┌─────────────────┐
│ cgpu connect    │
│ (Google Colab)  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Real-ESRGAN     │
│ (T4/A100 GPU)   │
└────────┬────────┘
         │
         ▼
Upscaled Frames (local)

monitoring.py (Logging)

Real-time decision logging for debugging.

Logged events:

  • Clip selection decisions
  • Beat alignment choices
  • Energy level changes
  • Phase transitions
  • Performance metrics

Data Flow

Input Processing

/data/input/*.mp4
        │
        ├──▶ Scene Detection ──▶ Scene List
        │
        ├──▶ Energy Analysis ──▶ Clip Scores
        │
        └──▶ Metadata Extraction ──▶ Clip Database

Audio Processing

/data/music/*.mp3
        │
        ├──▶ Beat Detection ──▶ Beat Timestamps
        │
        ├──▶ Tempo Analysis ──▶ BPM
        │
        └──▶ Energy Curve ──▶ Energy Timeline

Assembly

Beat Timeline + Clip Database + Style Parameters
                      │
                      ▼
            ┌─────────────────┐
            │ For each beat:  │
            │ - Get story     │
            │   phase         │
            │ - Score clips   │
            │ - Select best   │
            │ - Mark used     │
            └────────┬────────┘
                     │
                     ▼
              Clip Sequence

Rendering

Clip Sequence
      │
      ├──▶ Crop/Scale to STANDARD_WIDTH x STANDARD_HEIGHT (1080x1920)
      │
        ├──▶ Optional: Upscale (Real-ESRGAN)
        │
        ├──▶ Optional: Stabilize (vidstab 2-pass)
      │
      ├──▶ Color Grade (20+ presets)
      │
      └──▶ Progressive Renderer
            │
            ├──▶ Batch clips (default 25)
            ├──▶ Write segments to disk
            ├──▶ FFmpeg concat (-c copy)
            ├──▶ Optional: xfade transitions
            └──▶ Audio mix + Logo overlay
                    │
                    ▼
            /data/output/montage.mp4

Memory-Efficient Progressive Rendering

The system uses ProgressiveRenderer (in segment_writer.py) to prevent OOM crashes.

┌─────────────────────────────────────────────────────────────┐
│                    Progressive Renderer                      │
│                                                              │
│  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐     │
│  │ Clip 1  │   │ Clip 2  │   │ ...     │   │ Clip 25 │     │
│  └────┬────┘   └────┬────┘   └────┬────┘   └────┬────┘     │
│       │             │             │             │            │
│       └─────────────┴─────────────┴─────────────┘            │
│                         │                                    │
│                         ▼                                    │
│              ┌──────────────────┐                           │
│              │ flush_batch()    │                           │
│              │ - Normalize      │                           │
│              │ - Write segment  │                           │
│              │ - GC + cleanup   │                           │
│              └────────┬─────────┘                           │
│                       │                                      │
│                       ▼                                      │
│              segment_0001.mp4 (disk)                        │
│                                                              │
│  ... repeat for all batches ...                             │
│                                                              │
│              ┌──────────────────┐                           │
│              │ finalize()       │                           │
│              │ - FFmpeg concat  │                           │
│              │ - Audio mix      │                           │
│              │ - Logo overlay   │                           │
│              └────────┬─────────┘                           │
│                       │                                      │
│                       ▼                                      │
│              output.mp4                                      │
└─────────────────────────────────────────────────────────────┘

Key Constants (Dynamically Determined):

These constants are automatically determined from input footage using determine_output_profile():

Constant Default Determination Method
STANDARD_WIDTH 1080 Weighted median of input widths, snapped to standard presets
STANDARD_HEIGHT 1920 Weighted median of input heights, snapped to standard presets
STANDARD_FPS 30 Weighted median of input frame rates
STANDARD_PIX_FMT yuv420p Dominant pixel format from inputs (by duration)
TARGET_CODEC libx264 Dominant codec from inputs, or env OUTPUT_CODEC
TARGET_PROFILE high Auto-selected based on resolution (4.1 for HD, 5.1 for 4K)
TARGET_BITRATE auto Weighted median of input bitrates, or calculated from pixels

Output Profile Heuristics:

  • Orientation (horizontal/vertical/square) determined by weighted aspect ratios
  • Resolution snapped to common presets (1080p, 720p, 4K) if within 12% variance
  • Avoids upscaling beyond maximum input resolution
  • Honors environment overrides: OUTPUT_CODEC, OUTPUT_PIX_FMT, OUTPUT_PROFILE, OUTPUT_LEVEL

External Dependencies

Dependency Purpose Version
FFmpeg Video encoding/processing (beat detection via astats/tempo) Latest
MoviePy Video manipulation 2.2+
OpenCV Frame processing 4.12+
scenedetect Scene detection 0.6+
numpy Numerical operations 2.0+
Pydantic Data validation 2.0+
Real-ESRGAN AI upscaling Latest
OpenAI SDK cgpu/Gemini client 1.0+

Note: librosa has been removed. Beat detection now uses FFmpeg astats/tempo filters as the primary engine (portable, no heavy deps). See audio_analysis.py.


Scaling Considerations

Parallel Processing

  • Clip enhancement runs in parallel threads
  • Frame upscaling can use cloud GPU
  • FFmpeg uses multi-threading

Memory Management

  • Clips loaded on-demand, not all at once
  • Temporary files cleaned after processing
  • Large videos processed in chunks

GPU Utilization

  • Auto-detection of available GPU encoders
  • Fallback chain: NVENC → VAAPI → QSV → CPU
  • Cloud GPU option for heavy workloads