Investigating whether language structure can serve as soft physics constraints in generative video models.
AI Video diffusion models produce visually compelling motion but frequently violate physical laws—objects drift against gravity, momentum reverses spontaneously, collisions produce implausible outcomes. The standard fix involves physics simulation layers or physics-informed loss functions, requiring architectural changes and domain-specific training.
This project tests an alternative: can physics constraints be communicated through prompt structure alone?
The hypothesis is that language carries residual physical information. Video diffusion models trained on captioned footage have absorbed correlations between linguistic patterns and motion characteristics. They don't simulate Newtonian mechanics—they reenact stories about physics. Certain words and constructions reliably co-occur with certain motion patterns in training data.
If true, carefully constructed prompts should activate latent priors corresponding to physically consistent motion. This project will show how semantics substitute for equations in generative systems and insights about why language remains such a strong interface for control.
- Prompt Taxonomy — Categorized linguistic structures mapped to physics domains (gravity, momentum, collision, fluid dynamics, articulated motion)
- Evaluation Dataset — Generated videos with controlled prompt variations + automated and human annotations
- Empirical Analysis — Statistical relationships between linguistic features and motion coherence metrics
- Practical Guidelines — Actionable prompt engineering principles for physics-consistent video generation
We systematically vary prompt structure while holding generation parameters constant:
Minimal prompt: a ball falls
Elaborated variants:
- Verb substitution:
a ball drops/a ball plummets/a ball descends - Temporal chaining:
a ball tips off the edge and falls/a ball falls and bounces twice - Force attribution:
gravity pulls a ball downward/a ball falls under gravitational acceleration - Manner specification:
a ball falls slowly/a ball falls rapidly, accelerating
We then measure motion characteristics through optical flow analysis, temporal coherence metrics, and physics-specific heuristics. Human evaluation validates automated metrics.
- GPU: 16GB VRAM minimum (tested on RTX 4080, RTX 3090)
- Storage: ~50GB for models + generated videos
- Platform: Linux or WSL2
sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential git curl wget unzip ffmpeg# Install Miniconda if needed
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash
source ~/.bashrc
# Create environment
conda create -n ipp python=3.10 -y
conda activate ipppip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install diffusers transformers accelerate safetensors
pip install opencv-python-headless scikit-image scipy
pip install pandas numpy matplotlib seaborn
pip install einops omegaconf pyyaml tqdm
pip install pytorch-fid lpips avgit clone https://github.com/[username]/implicit-physics-prompting.git
cd implicit-physics-prompting
python scripts/setup_models.py # Downloads AnimateDiff weightsimplicit-physics-prompting/
│
├── README.md
├── requirements.txt
├── config/
│ ├── generation.yaml
│ ├── evaluation.yaml
│ └── paths.yaml
│
├── prompts/
│ ├── taxonomy.yaml
│ └── templates/
│ ├── gravity.yaml
│ ├── momentum.yaml
│ ├── collision.yaml
│ └── fluid.yaml
│
├── src/
│ ├── generation/
│ │ ├── pipeline.py
│ │ ├── batch_generate.py
│ │ └── utils.py
│ │
│ ├── evaluation/
│ │ ├── metrics.py
│ │ ├── optical_flow.py
│ │ ├── temporal.py
│ │ └── physics_heuristics.py
│ │
│ ├── analysis/
│ │ ├── aggregate.py
│ │ ├── statistics.py
│ │ └── visualize.py
│ │
│ └── prompts/
│ ├── parser.py
│ ├── generator.py
│ └── linguistic.py
│
├── data/
│ ├── generated/
│ ├── metrics/
│ └── annotations/
│
├── notebooks/
│ ├── 01_exploration.ipynb
│ ├── 02_metric_validation.ipynb
│ └── 03_analysis.ipynb
│
├── scripts/
│ ├── setup_models.py
│ ├── run_experiment.py
│ └── aggregate_results.py
│
└── docs/
├── methodology.md
├── prompt_taxonomy.md
└── results_log.md
conda activate ipp
python -c "from src.generation.pipeline import generate; generate('a ball rolling on a table')"Should produce a .mp4 in data/generated/.
Define physics domains and linguistic variations in prompts/templates/. Example structure:
domain: gravity
scenarios:
- id: ball_fall
base: "a ball falls"
variations:
verb_swap:
- "a ball drops"
- "a ball plummets"
temporal_chain:
- "a ball tips off the edge and falls"
force_explicit:
- "gravity pulls a ball downward"
manner:
- "a ball falls slowly"
- "a ball falls rapidly, accelerating"
physics_expectations:
direction: "downward"
acceleration: "constant"python scripts/run_experiment.py --config config/generation.yaml --domain gravityGenerates all prompt variations × seeds, outputs to data/generated/ with JSON metadata sidecars.
python scripts/compute_metrics.py --input data/generated/ --output data/metrics/Computes optical flow, temporal coherence, and physics heuristics for all videos.
python scripts/aggregate_results.pyCompiles results, runs statistical tests, generates figures.
| Metric | Description | Range |
|---|---|---|
flow_magnitude_mean |
Average motion intensity | 0-50 |
flow_direction_entropy |
Motion direction consistency | 0-2 |
temporal_lpips |
Frame-to-frame perceptual change | 0-1 |
warp_error |
Flow-based reconstruction error | 0-1 |
gravity_alignment |
Downward flow dominance | -1 to 1 |
acceleration_smoothness |
Jerk minimization | 0-∞ |
config/generation.yaml
model:
name: "animatediff"
checkpoint: "models/animatediff-v3.safetensors"
motion_module: "models/mm_sd15_v3.safetensors"
inference:
num_frames: 16
fps: 8
height: 512
width: 512
guidance_scale: 7.5
num_inference_steps: 25
seeds: [42, 123, 456, 789, 1011]
output:
format: "mp4"
codec: "libx264"Based on embodied cognition research, we anticipate:
- Temporal chaining shows strongest effects (sequential structure maps to temporal video structure)
- Manner specification reliably modulates speed and acceleration
- Force attribution may show weaker effects (models may not learn explicit physics vocabulary)
- Gravity and momentum domains respond better than complex domains like fluid dynamics
Null results would also be informative—indicating that architectural solutions are necessary rather than prompt-based approaches.
| Issue | Solution |
|---|---|
| CUDA out of memory | Reduce resolution to 384×384 or frames to 12 |
| Black frames | Adjust guidance_scale (try 5-9) |
| Frozen motion | Try different seed or increase motion module weight |
| WSL GPU not detected | Update NVIDIA drivers on Windows host |
Key papers informing this work:
Video Diffusion:
- Blattmann et al. (2023) — Stable Video Diffusion
- Guo et al. (2023) — AnimateDiff
Embodied Cognition:
- Lakoff & Johnson (1980) — Metaphors We Live By
- Barsalou (2008) — Grounded Cognition
- Pulvermüller (2005) — Brain mechanisms linking language and action
Physics-Informed ML:
- Raissi et al. (2019) — Physics-informed neural networks
- Karniadakis et al. (2021) — Physics-informed machine learning review
MIT
Issues and PRs welcome. See docs/methodology.md for design rationale before proposing changes to the taxonomy or metrics.