Skip to content

Latest commit

 

History

History
186 lines (135 loc) · 5.38 KB

File metadata and controls

186 lines (135 loc) · 5.38 KB

FOFPred: Language-Driven Future Optical Flow Prediction

FOFPred Overview

FOFPred is a diffusion-based model that predicts future optical flow from a single image guided by natural language instructions. Given an input image and a text prompt describing a desired action (e.g., "Moving the water bottle from right to left"), FOFPred generates optical flow predictions that visualize how objects would move to accomplish that action.


🚀 Quick Start

pip install diffusers==0.34.0
import torch
from diffusers import DiffusionPipeline
from PIL import Image

pipeline = DiffusionPipeline.from_pretrained(
    "Salesforce/FOFPred",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).cuda()

input_image = Image.open("/UPDATE/IMAGE/PATH")

generator = torch.Generator(device="cuda").manual_seed(42)
results = pipeline(
    prompt="UPDATE/PROMPT",
    input_images=[input_image],
    width=256,
    height=256,
    max_input_image_side_length=512,
    max_pixels=65536,
    num_inference_steps=1,
    max_sequence_length=1024,
    text_guidance_scale=5.0,
    image_guidance_scale=2.0,
    negative_prompt="",
    generator=generator,
    output_type="pt",
    frame_count=4,
)

output_tensor = results.images[0]  # [F, C, H, W]

✨ Features

  • Language-Guided Flow Prediction — Control motion predictions using natural language descriptions
  • Single-Image Input — Predict future motion from just one frame
  • Multi-Frame Flow Output — Generates 4 sequential flow frames showing temporal progression
  • Interactive Visualization — CoTracker-style arrow overlays for intuitive flow visualization
  • Efficient Inference — Single-step inference capability

🏗️ Architecture

FOFPred combines several components building off the OmniGen2 project:

Component Model Description
V-LLM Qwen2.5-VL-3B-Instruct Multimodal understanding of images and text
DiT OmniGen2Transformer3DModel Modification of OmniGen2Transformer to generate frame sequences
VAE black-forest-labs/FLUX.1-dev VAE (AutoencoderKL model)
Scheduler FlowMatchEulerDiscreteScheduler Efficient flow-matching sampler used in OmniGen2

📦 Installation

If you wish to create your own env for training, use the following.

conda create -n fofpred python=3.11
pip install torch==2.6.0 torchvision --extra-index-url https://download.pytorch.org/whl/cu124
curl -LsSf https://astral.sh/uv/install.sh | sh
uv pip install -r requirements.txt
uv pip install flash-attn==2.7.4.post1 --no-build-isolation

Optionally install ffmpeg in case your system does not have it (used for torchcodec library).

conda install ffmpeg

🏃 Inference

Interactive Demo

Launch the Gradio web interface:

export PYTHONPATH=$PYTHONPATH:$PWD
python app.py

Then open http://localhost:7860 in your browser.

📊 Output Visualization

FOFPred provides three visualization modes in the demo:

  1. Arrow Visualization — CoTracker-style sparse grid arrows showing motion direction
  2. Raw Flow Output — HSV-encoded optical flow (color = direction, saturation = magnitude)
  3. Alpha Blend — Flow overlaid on input image for context

Optional Arguments:

Argument Description Default
--share Create a public Gradio link False
--port Port for the web server 7860
--enable_model_cpu_offload Offload model to CPU (saves VRAM) False
--enable_sequential_cpu_offload Sequential CPU offload (minimal VRAM) False

Python API

import torch
from fofpred.pipelines.fofpred.pipeline_fofpred import FOFPredPipeline
from fofpred.schedulers.scheduling_flow_match_euler_discrete import FlowMatchEulerDiscreteScheduler
from PIL import Image

# Load the pipeline
pipeline = FOFPredPipeline.from_pretrained(
    "path/to/pretrained_models/hf_upload",
    torch_dtype=torch.bfloat16,
).to("cuda")

# Load input image
input_image = Image.open("example_images/small_office.jpeg")

# Set scheduler
pipeline.scheduler = FlowMatchEulerDiscreteScheduler()

# Generate optical flow prediction
results = pipeline(
    prompt="Moving the water bottle from right to left.",
    input_images=[input_image],
    width=256,
    height=256,
    num_inference_steps=1,
    num_images_per_prompt=4,
    frame_count=4,
    generator=torch.Generator(device="cuda").manual_seed(42),
    output_type="pt",
)

# Access generated flow frames: shape [B, F, C, H, W]
flow_frames = results.images

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.


📄 License

This project is licensed under the Apache License 2.0. See LICENSE.txt for details.

Copyright (c) 2025 Salesforce, Inc.


🔗 Acknowledgement

We thank the authors of following projects for their codebases and model checkpoints.