This document replaces an older diagnosis written when the project still relied heavily on compact one-hot style inputs. It summarizes the evolution of the encodings actually present in the repo today.
The project no longer revolves around the original compact encoding as the recommended path. The effective line is:
- relative/geometric encodings
- canonical dataset reduction
- search-based correction on top of neural evaluation where needed
- Historical baseline
- Piece-position one-hot style representation
- Larger input size
- Kept mainly for older experiments and comparisons
This encoding is no longer the preferred route for current work.
- Strong breakthrough for 3-piece endgames
- Encodes piece coordinates, piece identities, pairwise geometry, and side to move
- Much smaller than compact one-hot baselines
This is the main reason the project moved away from purely positional one-hot inputs.
- Added move-distance and relationship features
- Useful as an intermediate stage in the repo history
- Still relevant when reading older experiment summaries
- Focused on pawn endgames and race-sensitive structure
- Used in early KPvKP and KRPvKP work
- Important historical milestone, but no longer the whole story
Older docs that present v4 as the current state of the project are stale.
- Current active branch for the canonical KPvKP dataset under
data/v5/ - Used by the present training run documented in
logs/ - Coexists with helper code that still handles the same dimensionalities as V4/V5 together
The metadata file next to the dataset is the authoritative source for the encoding version.
If you are deciding what to use now:
- Use
src/generate_datasets_parallel.py - Prefer
--relative - Use
--version 5for the active KPvKP line - Use
--canonical --canonical-mode autounless you have a specific reason not to - Check the generated
*_metadata.json
Older analysis documents described the compact encoding as the "actual" encoding and flagged the absence of side-to-move information as a central flaw. That was accurate for an earlier stage of the project, but it no longer describes the recommended pipeline in this repository.
For current relative encodings, side to move and geometric structure are part of the feature design.
- Smaller input dimensionality
- Better inductive bias for chess geometry
- Stronger empirical performance in the repo's experiments
- Better fit for canonicalized datasets
- Cleaner handoff into search-based correction workflows
There is still some naming drift in the codebase:
- some scripts classify encodings by input dimensionality
- some metadata labels the dataset as
v5 - some helper comments still mention V4/V5 together
That is a documentation and code-clarity issue, not evidence that the project is still primarily on the old compact path.
When checking an encoding:
- dataset metadata
- checkpoint metadata
- active logs
README.md- older result notes
Last updated: March 20, 2026