|
| 1 | +# SRD PDF vs Database Comparison Tool |
| 2 | + |
| 3 | +**Date:** 2026-05-22 |
| 4 | +**Status:** Approved |
| 5 | + |
| 6 | +## Context |
| 7 | + |
| 8 | +The open5e-api project stores D&D SRD 5.2 content as Django fixture JSON files under |
| 9 | +`data/v2/wizards-of-the-coast/srd-2024/` (8,580 records across 33 files), which are loaded |
| 10 | +into the database via `manage.py import`. These fixtures were generated by five markdown-parsing |
| 11 | +scripts in `data/raw_sources/srd_5_2/scripts/`. |
| 12 | + |
| 13 | +The source of truth is the official SRD PDF at |
| 14 | +`data/raw_sources/srd_5_2/SRD_CC_v5.2.pdf` (5.7 MB, CC-BY-4.0). There is currently no |
| 15 | +automated way to verify that the database faithfully represents the PDF — missing records and |
| 16 | +mis-parsed field values are invisible until manually noticed. |
| 17 | + |
| 18 | +**Goal:** A lightning-fast management command that extracts entities directly from the PDF and |
| 19 | +compares them against the database, reporting both missing records and field-value mismatches |
| 20 | +with a Rich terminal display. Target runtime: under 10 seconds for all entity types. |
| 21 | + |
| 22 | +This PDF parsing layer also serves as the foundation for replacing the markdown-based conversion |
| 23 | +scripts with a direct PDF → fixture generation pipeline in a future iteration. |
| 24 | + |
| 25 | +## Architecture |
| 26 | + |
| 27 | +``` |
| 28 | +SRD_CC_v5.2.pdf |
| 29 | + │ |
| 30 | + └─ (pdfplumber, main thread) ─→ full_text: str |
| 31 | + │ |
| 32 | + ┌───────────────────────────────┼───────────────────────┐ |
| 33 | + ▼ ▼ ▼ |
| 34 | + spells.py(full_text) creatures.py(full_text) items.py(full_text) |
| 35 | + [ThreadPoolExecutor — text strings are passed, not the open PDF object] |
| 36 | + │ │ │ |
| 37 | + └───────────────────────────────┼───────────────────────┘ |
| 38 | + ▼ |
| 39 | + ORM .values() queries |
| 40 | + │ |
| 41 | + ▼ |
| 42 | + Rich terminal report |
| 43 | +``` |
| 44 | + |
| 45 | +**Key constraint:** pdfplumber is not thread-safe when sharing an open PDF object. The |
| 46 | +management command extracts all page text to a single `str` in the main thread first, then |
| 47 | +passes that string to parsers running in a `ThreadPoolExecutor`. Each parser operates only |
| 48 | +on a plain string — no shared file handles. |
| 49 | + |
| 50 | +**Module layout:** |
| 51 | +``` |
| 52 | +data/raw_sources/srd_5_2/parsers/ |
| 53 | + __init__.py |
| 54 | + base.py shared: text cleaning, slug generation, field normalisation |
| 55 | + spells.py parse spell blocks from extracted PDF text |
| 56 | + creatures.py parse creature stat blocks |
| 57 | + items.py weapons, armor, adventuring gear, magic items |
| 58 | +
|
| 59 | +api_v2/management/commands/ |
| 60 | + compare_srd.py orchestrates extraction, ORM queries, diff, Rich render |
| 61 | +``` |
| 62 | + |
| 63 | +### New dependencies |
| 64 | + |
| 65 | +Add to `pyproject.toml` (runtime): |
| 66 | + |
| 67 | +| Package | Purpose | License | |
| 68 | +|---------|---------|---------| |
| 69 | +| `pdfplumber>=0.11` | PDF text + table extraction | MIT | |
| 70 | +| `rich>=13` | terminal tables and panels | MIT | |
| 71 | + |
| 72 | +## Component Details |
| 73 | + |
| 74 | +### `parsers/base.py` |
| 75 | + |
| 76 | +Shared utilities used by all section parsers. All parsers receive `full_text: str` (the |
| 77 | +concatenated text of all PDF pages), not a file path. |
| 78 | + |
| 79 | +- `extract_full_text(pdf_path) -> str` — the single point that opens pdfplumber, iterates |
| 80 | + pages, concatenates `page.extract_text()` output, and closes the file. Called once in the |
| 81 | + main thread before any parallelism. |
| 82 | +- `extract_section(full_text, start_re, end_re) -> str` — return the substring of `full_text` |
| 83 | + between the first match of `start_re` and `end_re` (compiled regex patterns). Used by each |
| 84 | + parser to isolate its section before splitting on individual entries. |
| 85 | +- `clean_text(s) -> str` — strip ligatures (fi, fl, ffi, ffl rendered as single glyphs), |
| 86 | + normalize whitespace, remove soft hyphens, strip leading/trailing whitespace. |
| 87 | +- `slugify(name) -> str` — lowercase, replace spaces with hyphens, strip non-alphanumeric |
| 88 | + except hyphens. Used as the dict key for name-based matching. |
| 89 | +- `parse_cost(s) -> dict | None` — parse "10 gp" / "5 sp" / "2 cp" into |
| 90 | + `{"amount": 10, "unit": "gp"}`. Returns None if no match. Called by weapon `cost` field, |
| 91 | + armor `cost` field (where present), and gear `cost` field. |
| 92 | +- `parse_dice(s) -> dict | None` — parse "2d6+3" into `{"count": 2, "die": 6, "bonus": 3}`. |
| 93 | + Called by creature `hit_dice` field and weapon `damage` field. |
| 94 | + |
| 95 | +### `parsers/spells.py` |
| 96 | + |
| 97 | +`extract_spells(full_text: str) -> list[SpellRecord]` |
| 98 | + |
| 99 | +**Section isolation:** Use `extract_section()` with `start_re` matching the spells chapter |
| 100 | +heading and `end_re` matching the next chapter heading. This confines parsing to the spells |
| 101 | +section and avoids false matches on spell-name-like strings elsewhere in the PDF. |
| 102 | + |
| 103 | +**Entry splitting:** Split on the heading pattern that marks individual spell entries. The |
| 104 | +SRD PDF uses a consistent bold/large-text heading for each spell name (rendered in extracted |
| 105 | +text as a line that matches `r"^[A-Z][A-Za-z '/-]+$"` at the start of a block). Validate |
| 106 | +each candidate entry has at least a level/school line before accepting it as a spell. |
| 107 | + |
| 108 | +**Fields extracted per entry:** |
| 109 | +- `name`: str |
| 110 | +- `level`: int (0 = cantrip) |
| 111 | +- `school`: str (abjuration | conjuration | divination | enchantment | evocation | illusion | necromancy | transmutation) |
| 112 | +- `casting_time`: str |
| 113 | +- `range_text`: str |
| 114 | +- `verbal`: bool |
| 115 | +- `somatic`: bool |
| 116 | +- `material`: bool |
| 117 | +- `material_specified`: str | None |
| 118 | +- `duration`: str |
| 119 | +- `concentration`: bool |
| 120 | +- `ritual`: bool |
| 121 | +- `higher_level`: str | None |
| 122 | + |
| 123 | +**Skip fields** (not compared against DB — expected to differ in whitespace/formatting): |
| 124 | +- `desc` |
| 125 | + |
| 126 | +**Sanity check:** If fewer than 300 spells are parsed, abort with a descriptive error |
| 127 | +(`ValueError`) rather than silently producing a misleading comparison. |
| 128 | + |
| 129 | +Return `list[SpellRecord]` (frozen dataclass). |
| 130 | + |
| 131 | +### `parsers/creatures.py` |
| 132 | + |
| 133 | +`extract_creatures(full_text: str) -> list[CreatureRecord]` |
| 134 | + |
| 135 | +**Section isolation:** Use `extract_section()` to isolate the monsters A–Z chapter (and the |
| 136 | +animals appendix if present) before splitting on creature headings. |
| 137 | + |
| 138 | +**Entry detection:** pdfplumber's `extract_text()` returns plain text — there are no markdown |
| 139 | +`##` markers. In the SRD PDF, creature names appear as a line of title-case text (e.g., |
| 140 | +`"Adult Black Dragon"`) with no surrounding punctuation, followed immediately by a line |
| 141 | +containing comma-separated size, type, and alignment (e.g., `"Huge dragon, chaotic evil"`). |
| 142 | +The two-line pattern `r"^[A-Z][A-Za-z '\-]+$"` followed by a line matching |
| 143 | +`r"^(Tiny|Small|Medium|Large|Huge|Gargantuan)\s+\w+"` is the creature-entry boundary. |
| 144 | +This two-line heuristic disambiguates creature names from chapter headings (which are not |
| 145 | +followed by a size/type/alignment line). |
| 146 | + |
| 147 | +**Page-break handling:** `extract_full_text()` concatenates all page text before parsing, so |
| 148 | +stat blocks that span a page break appear as contiguous text. The ability score table |
| 149 | +(`STR | DEX | CON | INT | WIS | CHA`) is extracted using pdfplumber's table extractor |
| 150 | +*per page* during `extract_full_text()` and inserted inline with a recognizable delimiter, |
| 151 | +so the string-level parsers can find it reliably. |
| 152 | + |
| 153 | +**Fields extracted per entry:** |
| 154 | +- `name`: str |
| 155 | +- `size`: str |
| 156 | +- `type`: str |
| 157 | +- `alignment`: str |
| 158 | +- `armor_class`: int |
| 159 | +- `hit_points`: int |
| 160 | +- `hit_dice`: str (e.g. "10d10+30") — uses `parse_dice()` |
| 161 | +- `speed`: dict (walk/fly/swim/burrow/climb, each int ft) |
| 162 | +- `strength`, `dexterity`, `constitution`, `intelligence`, `wisdom`, `charisma`: int |
| 163 | +- `saving_throws`: dict[str, int] (only listed saves) |
| 164 | +- `skills`: dict[str, int] (only listed skills) |
| 165 | +- `senses`: dict (darkvision_range, blindsight_range, etc., int ft) |
| 166 | +- `languages`: list[str] |
| 167 | +- `challenge_rating`: float |
| 168 | +- `damage_immunities`: list[str] |
| 169 | +- `damage_resistances`: list[str] |
| 170 | +- `damage_vulnerabilities`: list[str] |
| 171 | +- `condition_immunities`: list[str] |
| 172 | + |
| 173 | +**Skip fields:** `traits`, `actions`, `desc` (action text is prose; not compared at this stage). |
| 174 | + |
| 175 | +**Sanity check:** Abort if fewer than 250 creatures parsed. |
| 176 | + |
| 177 | +Return `list[CreatureRecord]`. |
| 178 | + |
| 179 | +### `parsers/items.py` |
| 180 | + |
| 181 | +`extract_weapons(full_text) -> list[WeaponRecord]` |
| 182 | +`extract_armor(full_text) -> list[ArmorRecord]` |
| 183 | +`extract_items(full_text) -> list[ItemRecord]` |
| 184 | +`extract_magic_items(full_text) -> list[MagicItemRecord]` |
| 185 | + |
| 186 | +Each function isolates its own section via `extract_section()` before parsing. |
| 187 | + |
| 188 | +**Weapons:** `## <Name>` blocks with labeled fields **Cost**, **Damage** (uses `parse_dice()`), |
| 189 | +**Weight**, **Properties**, **Mastery**, optionally **Range** / **Long Range**. |
| 190 | + |
| 191 | +**Armor:** `## <Name>` blocks with **AC Base** (int), **AC Add Dex** (bool), **AC Cap Dex** |
| 192 | +(int | None), **Strength Required** (int | None), **Stealth Disadvantage** (bool). |
| 193 | + |
| 194 | +**Adventuring gear:** `## <Name> (Cost)` or `## <Name>` with labeled **Cost** and **Weight** |
| 195 | +fields and a prose description. |
| 196 | + |
| 197 | +**Magic items:** `### <Name>` with a rarity/type/attunement line in italics. |
| 198 | + |
| 199 | +**Skip fields (per type):** |
| 200 | +- Weapons: `desc`, `mastery_desc` |
| 201 | +- Armor: `desc` |
| 202 | +- Items: `desc` |
| 203 | +- Magic items: `desc` |
| 204 | + |
| 205 | +**Sanity checks:** |
| 206 | +- Weapons: abort if fewer than 30 parsed |
| 207 | +- Armor: abort if fewer than 10 parsed |
| 208 | +- Items: abort if fewer than 150 parsed |
| 209 | +- Magic items: abort if fewer than 700 parsed |
| 210 | + |
| 211 | +### `compare_srd.py` (management command) |
| 212 | + |
| 213 | +```bash |
| 214 | +python manage.py compare_srd [--pdf PATH] [--document SLUG] [--entity TYPE] |
| 215 | +``` |
| 216 | + |
| 217 | +| Argument | Default | Description | |
| 218 | +|----------|---------|-------------| |
| 219 | +| `--pdf` | `data/raw_sources/srd_5_2/SRD_CC_v5.2.pdf` | Path to SRD PDF | |
| 220 | +| `--document` | `srd-2024` | Document slug to filter ORM queries | |
| 221 | +| `--entity` | `all` | `spells`, `creatures`, `weapons`, `armor`, `items`, `magic_items`, or `all` | |
| 222 | + |
| 223 | +**Execution flow:** |
| 224 | + |
| 225 | +1. Validate PDF path exists; abort with a clear error if not. |
| 226 | +2. Call `extract_full_text(pdf_path)` once in the main thread → `full_text: str`. |
| 227 | +3. Submit each requested entity type to `ThreadPoolExecutor`. Each task: |
| 228 | + a. Runs the appropriate parser on `full_text` → list of dataclass records. |
| 229 | + b. Bulk-fetches from ORM using `values()` filtered by document slug. |
| 230 | + c. Builds `dict[slug, record]` on each side using `slugify()` for keys. |
| 231 | + d. Computes three sets: `missing` (slugs in PDF not in DB), `extra` (slugs in DB not in PDF), |
| 232 | + `mismatches` (slugs in both but at least one field differs after applying skip lists and |
| 233 | + normalisation rules). |
| 234 | +4. Collect results and render with Rich: |
| 235 | + - Summary table with columns: Entity type | In PDF | In DB | Missing | Extra | Mismatches |
| 236 | + - Per-type bulleted list of missing names |
| 237 | + - Per-type bulleted list of extra names |
| 238 | + - Per-type field-mismatch table: Entity | Field | PDF value | DB value |
| 239 | + - Footer: total elapsed time |
| 240 | + |
| 241 | +**Field comparison rules:** |
| 242 | + |
| 243 | +| Field type | Comparison | |
| 244 | +|-----------|-----------| |
| 245 | +| `str` | `clean_text(a).lower() == clean_text(b).lower()` | |
| 246 | +| `int` | `a == b` | |
| 247 | +| `float` | `abs(a - b) < 0.001` | |
| 248 | +| `bool` | `a == b` | |
| 249 | +| `list` | `sorted(a) == sorted(b)` | |
| 250 | +| `dict` | recursive key-by-key using rules above | |
| 251 | + |
| 252 | +Fields in the skip list for an entity type are excluded from comparison entirely. |
| 253 | + |
| 254 | +## ORM Queries |
| 255 | + |
| 256 | +Each entity type uses a targeted `values()` query: |
| 257 | + |
| 258 | +```python |
| 259 | +# Example: spells |
| 260 | +Spell.objects.filter(document_id=document).values( |
| 261 | + "name", "level", "school__name", "casting_time", |
| 262 | + "range", "range_unit", "concentration", "ritual", |
| 263 | + "verbal", "somatic", "material", "duration" |
| 264 | +) |
| 265 | +``` |
| 266 | + |
| 267 | +`values()` returns flat dicts — no full model instantiation, no N+1 queries. Related fields |
| 268 | +(e.g. `school__name`) use the double-underscore traversal built into `values()`. |
| 269 | + |
| 270 | +## Illustrative Output |
| 271 | + |
| 272 | +``` |
| 273 | +┌────────────────────────────────────────────────────────────────────┐ |
| 274 | +│ SRD 5.2 PDF vs Database — document: srd-2024 │ |
| 275 | +│ Completed in 5.2s │ |
| 276 | +├─────────────────┬────────┬───────┬─────────┬───────┬──────────────┤ |
| 277 | +│ Entity type │ In PDF │ In DB │ Missing │ Extra │ Mismatches │ |
| 278 | +├─────────────────┼────────┼───────┼─────────┼───────┼──────────────┤ |
| 279 | +│ Spells │ 339 │ 339 │ 0 │ 0 │ 12 │ |
| 280 | +│ Creatures │ 330 │ 328 │ 2 │ 0 │ 23 │ |
| 281 | +│ Weapons │ 38 │ 38 │ 0 │ 0 │ 1 │ |
| 282 | +│ Armor │ 13 │ 13 │ 0 │ 0 │ 0 │ |
| 283 | +│ Items │ 203 │ 203 │ 0 │ 0 │ 4 │ |
| 284 | +│ Magic Items │ 757 │ 757 │ 0 │ 0 │ 8 │ |
| 285 | +└─────────────────┴────────┴───────┴─────────┴───────┴──────────────┘ |
| 286 | +
|
| 287 | +Missing from DB — Creatures |
| 288 | + • Aboleth |
| 289 | + • Banshee |
| 290 | +
|
| 291 | +Field mismatches — Spells |
| 292 | +┌──────────────┬───────────────┬──────────────┬──────────────┐ |
| 293 | +│ Spell │ Field │ PDF value │ DB value │ |
| 294 | +├──────────────┼───────────────┼──────────────┼──────────────┤ |
| 295 | +│ Fireball │ range │ 150 │ 120 │ |
| 296 | +│ Acid Arrow │ school │ evocation │ conjuration │ |
| 297 | +└──────────────┴───────────────┴──────────────┴──────────────┘ |
| 298 | +``` |
| 299 | + |
| 300 | +## Known PDF Extraction Risks |
| 301 | + |
| 302 | +- **Multi-column layouts:** D&D SRD PDFs sometimes typeset creature stat blocks in two columns. |
| 303 | + pdfplumber's default text extraction reads left-to-right across the full page width and may |
| 304 | + interleave text from both columns. Mitigate by testing on a sample of creature pages early and |
| 305 | + using pdfplumber's `x_tolerance` / `y_tolerance` parameters or bounding-box cropping if needed. |
| 306 | + |
| 307 | +- **Page-break mid-block:** Stat blocks that span a page break rely on concatenation of page |
| 308 | + text. This works for prose sections but ability score tables (detected per-page with |
| 309 | + pdfplumber's table extractor) are inserted with a sentinel delimiter to survive concatenation. |
| 310 | + |
| 311 | +- **Ligatures:** Common in PDF-embedded fonts (fi, fl, ffi → single glyphs). `clean_text()` |
| 312 | + handles these via Unicode normalization + explicit substitution before any string comparison. |
| 313 | + |
| 314 | +## Implementation Order |
| 315 | + |
| 316 | +Build and verify one entity type before moving to the next: |
| 317 | + |
| 318 | +1. **Spells** — most consistent PDF formatting, 339 records, easy to spot-check manually |
| 319 | +2. **Creatures** — complex stat blocks; test pdfplumber table extraction on a sample first |
| 320 | +3. **Weapons & Armor** — table-like format, small record counts |
| 321 | +4. **Adventuring Gear / Items** — larger set, varied formats |
| 322 | +5. **Magic Items** — 757 records, mostly name + rarity + attunement |
| 323 | + |
| 324 | +## Verification |
| 325 | + |
| 326 | +```bash |
| 327 | +# Full compare against live DB |
| 328 | +python manage.py compare_srd |
| 329 | + |
| 330 | +# Spot-check a single type |
| 331 | +python manage.py compare_srd --entity spells |
| 332 | + |
| 333 | +# Against a different document (future use) |
| 334 | +python manage.py compare_srd --document srd-2014 |
| 335 | +``` |
| 336 | + |
| 337 | +Expected: runtime under 10 seconds; zero missing records for a correct import; field mismatches |
| 338 | +surface real data quality issues in the existing fixtures. |
| 339 | + |
| 340 | +## Future Work (out of scope) |
| 341 | + |
| 342 | +- `manage.py generate_fixtures --pdf ...` to replace the markdown conversion scripts |
| 343 | +- `--output json` flag for CI (straightforward addition once comparison logic exists) |
| 344 | +- Cross-document diff mode (`--compare srd-2014 srd-2024`) to see what changed between versions |
0 commit comments