FOFPred: Language-Driven Future Optical Flow Prediction

FOFPred is a diffusion-based model that predicts future optical flow from a single image guided by natural language instructions. Given an input image and a text prompt describing a desired action (e.g., "Moving the water bottle from right to left"), FOFPred generates optical flow predictions that visualize how objects would move to accomplish that action.

🚀 Quick Start

pip install diffusers==0.34.0

import torch
from diffusers import DiffusionPipeline
from PIL import Image

pipeline = DiffusionPipeline.from_pretrained(
    "Salesforce/FOFPred",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).cuda()

input_image = Image.open("/UPDATE/IMAGE/PATH")

generator = torch.Generator(device="cuda").manual_seed(42)
results = pipeline(
    prompt="UPDATE/PROMPT",
    input_images=[input_image],
    width=256,
    height=256,
    max_input_image_side_length=512,
    max_pixels=65536,
    num_inference_steps=1,
    max_sequence_length=1024,
    text_guidance_scale=5.0,
    image_guidance_scale=2.0,
    negative_prompt="",
    generator=generator,
    output_type="pt",
    frame_count=4,
)

output_tensor = results.images[0]  # [F, C, H, W]

✨ Features

Language-Guided Flow Prediction — Control motion predictions using natural language descriptions
Single-Image Input — Predict future motion from just one frame
Multi-Frame Flow Output — Generates 4 sequential flow frames showing temporal progression
Interactive Visualization — CoTracker-style arrow overlays for intuitive flow visualization
Efficient Inference — Single-step inference capability

🏗️ Architecture

FOFPred combines several components building off the OmniGen2 project:

Component	Model	Description
V-LLM	`Qwen2.5-VL-3B-Instruct`	Multimodal understanding of images and text
DiT	`OmniGen2Transformer3DModel`	Modification of OmniGen2Transformer to generate frame sequences
VAE	`black-forest-labs/FLUX.1-dev`	VAE (AutoencoderKL model)
Scheduler	`FlowMatchEulerDiscreteScheduler`	Efficient flow-matching sampler used in OmniGen2

📦 Installation

If you wish to create your own env for training, use the following.

conda create -n fofpred python=3.11
pip install torch==2.6.0 torchvision --extra-index-url https://download.pytorch.org/whl/cu124
curl -LsSf https://astral.sh/uv/install.sh | sh
uv pip install -r requirements.txt
uv pip install flash-attn==2.7.4.post1 --no-build-isolation

Optionally install ffmpeg in case your system does not have it (used for torchcodec library).

conda install ffmpeg

🏃 Inference

Interactive Demo

Launch the Gradio web interface:

export PYTHONPATH=$PYTHONPATH:$PWD
python app.py

Then open http://localhost:7860 in your browser.

📊 Output Visualization

FOFPred provides three visualization modes in the demo:

Arrow Visualization — CoTracker-style sparse grid arrows showing motion direction
Raw Flow Output — HSV-encoded optical flow (color = direction, saturation = magnitude)
Alpha Blend — Flow overlaid on input image for context

Optional Arguments:

Argument	Description	Default
`--share`	Create a public Gradio link	`False`
`--port`	Port for the web server	`7860`
`--enable_model_cpu_offload`	Offload model to CPU (saves VRAM)	`False`
`--enable_sequential_cpu_offload`	Sequential CPU offload (minimal VRAM)	`False`

Python API

import torch
from fofpred.pipelines.fofpred.pipeline_fofpred import FOFPredPipeline
from fofpred.schedulers.scheduling_flow_match_euler_discrete import FlowMatchEulerDiscreteScheduler
from PIL import Image

# Load the pipeline
pipeline = FOFPredPipeline.from_pretrained(
    "path/to/pretrained_models/hf_upload",
    torch_dtype=torch.bfloat16,
).to("cuda")

# Load input image
input_image = Image.open("example_images/small_office.jpeg")

# Set scheduler
pipeline.scheduler = FlowMatchEulerDiscreteScheduler()

# Generate optical flow prediction
results = pipeline(
    prompt="Moving the water bottle from right to left.",
    input_images=[input_image],
    width=256,
    height=256,
    num_inference_steps=1,
    num_images_per_prompt=4,
    frame_count=4,
    generator=torch.Generator(device="cuda").manual_seed(42),
    output_type="pt",
)

# Access generated flow frames: shape [B, F, C, H, W]
flow_frames = results.images

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

📄 License

This project is licensed under the Apache License 2.0. See LICENSE.txt for details.

🔗 Acknowledgement

We thank the authors of following projects for their codebases and model checkpoints.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FOFPred: Language-Driven Future Optical Flow Prediction

🚀 Quick Start

✨ Features

🏗️ Architecture

📦 Installation

🏃 Inference

Interactive Demo

Python API

🤝 Contributing

📄 License

🔗 Acknowledgement

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

FOFPred: Language-Driven Future Optical Flow Prediction

🚀 Quick Start

✨ Features

🏗️ Architecture

📦 Installation

🏃 Inference

Interactive Demo

Python API

🤝 Contributing

📄 License

🔗 Acknowledgement