DynaVisR-Billiards is a procedural dataset generator for evaluating whether a model can combine:
- visual trajectory simulation,
- bounce-indexed state updates, and
- overlap and layer-order reasoning.
Each example is a synthetic billiard world with a ball, a rectangular table, and named rectangular obstacles (A–D).
The solver must mentally simulate the ball’s reflections while also applying visibility rules that change after specific bounce counts.
At a queried moment, the solver must identify the next hit object, determine which obstacles are visible,
and recover the bottom-to-top order of the visible overlapping subset.
- Modality: image-grounded visual reasoning
- Core challenge: coupled physics simulation + dynamic state updates
- Task outputs: next hit object, visible objects, overlapping visible subset, layer order
- Artifacts produced: question image, answer image, metadata text, JSON record, JSONL dataset, manifests, checksums
- Reproducibility: seed-controlled generation, deterministic ordering, SHA-256 manifests
This benchmark is designed to reduce shortcutting by requiring models to solve a coupled reasoning problem rather than classify a familiar static pattern. A correct answer requires:
- exact reflection reasoning against walls and currently visible obstacles,
- correct application of visibility transitions after bounce counts,
- filtering to the visible subset at the queried moment,
- identifying which visible objects overlap,
- sorting those objects into bottom-to-top layers.
Because the generator computes gold answers by exact simulation and rejects ambiguous or low-clarity worlds, the resulting labels are precise and auditable.
The repository includes an analysis notebook with publication-ready figures.
| Model | Mean total score | 95% CI |
|---|---|---|
| Gemini 3.1 Pro Preview | 0.898 | 0.859-0.934 |
| Gemini 3 Flash Preview | 0.758 | 0.698-0.814 |
| Qwen 3 235B A22B Instruct | 0.601 | 0.545-0.657 |
| Claude Sonnet 4.6 | 0.583 | 0.513-0.649 |
| Claude Opus 4.7 | 0.412 | 0.346-0.476 |
The full notebook along with methodology also includes:
- overall model comparison,
- clustered subtask-group comparison,
- score-distribution analysis across examples.
For every generated example, the pipeline writes:
- a question image with the canonical board layout,
- an answer image with the trajectory up to the queried hit and the hit point marked,
- a metadata text file containing the prompt and gold answers,
- a JSON record for the single example,
- a dataset.jsonl file for the full split,
- a manifest.json file with per-file SHA-256 hashes and build metadata,
- manifest.sha256 and dataset.sha256 checksum files.
The generator is designed for repeatable dataset creation.
- The dataset is generated from an explicit
--seed. - JSON and JSONL output use deterministic key ordering.
- The output manifest sorts files deterministically.
- A clean output directory is required for reproducible builds.
- Every emitted file is hashed with SHA-256.
PYTHONHASHSEED=0 python billiard_benchmark_generator.py \
--output-dir dataset/v1 \
--num-examples 100 \
--seed 7 \
--snapshot-after-bounce 2 \
--require-overlap-at-snapshot anyClone the repository
git clone https://github.com/akaliutau/dynavisr-bench.git
cd dynavisr-benchCreate and activate a Conda environment
conda create -n dynavisr python=3.12 -y
conda activate dynavisrInstall dependencies
pip install -r requirements.txtGenerate a dataset:
PYTHONHASHSEED=0 python billiard_benchmark_generator.py \
--output-dir dataset/v1 \
--num-examples 10 \
--seed 7 \
--snapshot-after-bounce 2Convert the generated JSONL to a Kaggle-ready CSV:
python convert_jsonl_to_csv.py \
dataset/v1/dataset.jsonl \
dataset/v1/benchmark.csv \
--image-folder imagesdataset/v1/
├── dataset.jsonl
├── dataset.sha256
├── manifest.json
├── manifest.sha256
└── images/
├── 00000_question.png
├── 00000_answer.png
├── 00001_question.png
└── 00001_answer.png
Each line is a JSON object with:
sample_id— stable example identifierimage_path— relative path to the question imageanswer_image_path— relative path to the answer visualizationmetadata_txt_path— relative path to the text metadata fileprompt— natural-language task promptworld— serialized world configurationanswers.q1_hit_object— gold label for the queried hit objectanswers.q2_visible_objects— visible objects at the queried momentanswers.q3a_visible_overlapping_objects— visible objects that overlap at that momentanswers.q3b_layer_groups_bottom_to_top— overlap-layer groups sorted from bottom to topdebug— exact simulation details for auditing
The manifest records:
- generator version,
- seed and generation parameters,
- Python and Pillow versions,
PYTHONHASHSEED,- source file hashes,
- per-file SHA-256 checksums for all payload files,
- aggregate dataset payload hash.
The generator rejects worlds that are visually confusing or geometrically ambiguous. Rejection filters include:
- ambiguous simultaneous hits,
- corner collisions that are too close to obstacle or wall corners,
- trajectories that pass too close to obstacles without hitting them,
- short trajectory legs that are hard to inspect,
- unreadable same-orientation overlaps,
- crowded starts too close to walls or obstacles.
These filters improve label validity and visual legibility for both human inspection and model evaluation.
This generator is best positioned as a benchmark for Executive Functions, with Attention as a secondary capability.
- Executive Functions: multistep planning, sequential rule application, and working-memory-like state maintenance
- Attention: tracking the currently relevant visible subset under dynamic updates
If you use this generator in a benchmark, cite the repository or benchmark writeup and include the dataset seed, snapshot configuration, and dataset hash from manifest.json.
Built exclusively for:
Measuring Progress Toward AGI - Cognitive Abilities
Google DeepMind Hackathon
https://www.kaggle.com/competitions/kaggle-measuring-agi/writeups/dynamic-visual-reasoning

