Add SRD PDF vs database comparison tool design spec

augustjohnson · claude · augustjohnson · commit 93f6fbb88ea0 · 2026-05-22T16:49:18.000-05:00
Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/docs/superpowers/specs/2026-05-22-srd-pdf-compare-design.md b/docs/superpowers/specs/2026-05-22-srd-pdf-compare-design.md
@@ -0,0 +1,344 @@
+# SRD PDF vs Database Comparison Tool
+
+**Date:** 2026-05-22  
+**Status:** Approved
+
+## Context
+
+The open5e-api project stores D&D SRD 5.2 content as Django fixture JSON files under
+`data/v2/wizards-of-the-coast/srd-2024/` (8,580 records across 33 files), which are loaded
+into the database via `manage.py import`. These fixtures were generated by five markdown-parsing
+scripts in `data/raw_sources/srd_5_2/scripts/`.
+
+The source of truth is the official SRD PDF at
+`data/raw_sources/srd_5_2/SRD_CC_v5.2.pdf` (5.7 MB, CC-BY-4.0). There is currently no
+automated way to verify that the database faithfully represents the PDF — missing records and
+mis-parsed field values are invisible until manually noticed.
+
+**Goal:** A lightning-fast management command that extracts entities directly from the PDF and
+compares them against the database, reporting both missing records and field-value mismatches
+with a Rich terminal display. Target runtime: under 10 seconds for all entity types.
+
+This PDF parsing layer also serves as the foundation for replacing the markdown-based conversion
+scripts with a direct PDF → fixture generation pipeline in a future iteration.
+
+## Architecture
+
+```
+SRD_CC_v5.2.pdf
+  │
+  └─ (pdfplumber, main thread) ─→ full_text: str
+                                        │
+        ┌───────────────────────────────┼───────────────────────┐
+        ▼                               ▼                       ▼
+  spells.py(full_text)        creatures.py(full_text)    items.py(full_text)
+  [ThreadPoolExecutor — text strings are passed, not the open PDF object]
+        │                               │                       │
+        └───────────────────────────────┼───────────────────────┘
+                                        ▼
+                              ORM .values() queries
+                                        │
+                                        ▼
+                              Rich terminal report
+```
+
+**Key constraint:** pdfplumber is not thread-safe when sharing an open PDF object. The
+management command extracts all page text to a single `str` in the main thread first, then
+passes that string to parsers running in a `ThreadPoolExecutor`. Each parser operates only
+on a plain string — no shared file handles.
+
+**Module layout:**
+```
+data/raw_sources/srd_5_2/parsers/
+    __init__.py
+    base.py        shared: text cleaning, slug generation, field normalisation
+    spells.py      parse spell blocks from extracted PDF text
+    creatures.py   parse creature stat blocks
+    items.py       weapons, armor, adventuring gear, magic items
+
+api_v2/management/commands/
+    compare_srd.py  orchestrates extraction, ORM queries, diff, Rich render
+```
+
+### New dependencies
+
+Add to `pyproject.toml` (runtime):
+
+| Package | Purpose | License |
+|---------|---------|---------|
+| `pdfplumber>=0.11` | PDF text + table extraction | MIT |
+| `rich>=13` | terminal tables and panels | MIT |
+
+## Component Details
+
+### `parsers/base.py`
+
+Shared utilities used by all section parsers. All parsers receive `full_text: str` (the
+concatenated text of all PDF pages), not a file path.
+
+- `extract_full_text(pdf_path) -> str` — the single point that opens pdfplumber, iterates
+  pages, concatenates `page.extract_text()` output, and closes the file. Called once in the
+  main thread before any parallelism.
+- `extract_section(full_text, start_re, end_re) -> str` — return the substring of `full_text`
+  between the first match of `start_re` and `end_re` (compiled regex patterns). Used by each
+  parser to isolate its section before splitting on individual entries.
+- `clean_text(s) -> str` — strip ligatures (fi, fl, ffi, ffl rendered as single glyphs),
+  normalize whitespace, remove soft hyphens, strip leading/trailing whitespace.
+- `slugify(name) -> str` — lowercase, replace spaces with hyphens, strip non-alphanumeric
+  except hyphens. Used as the dict key for name-based matching.
+- `parse_cost(s) -> dict | None` — parse "10 gp" / "5 sp" / "2 cp" into
+  `{"amount": 10, "unit": "gp"}`. Returns None if no match. Called by weapon `cost` field,
+  armor `cost` field (where present), and gear `cost` field.
+- `parse_dice(s) -> dict | None` — parse "2d6+3" into `{"count": 2, "die": 6, "bonus": 3}`.
+  Called by creature `hit_dice` field and weapon `damage` field.
+
+### `parsers/spells.py`
+
+`extract_spells(full_text: str) -> list[SpellRecord]`
+
+**Section isolation:** Use `extract_section()` with `start_re` matching the spells chapter
+heading and `end_re` matching the next chapter heading. This confines parsing to the spells
+section and avoids false matches on spell-name-like strings elsewhere in the PDF.
+
+**Entry splitting:** Split on the heading pattern that marks individual spell entries. The
+SRD PDF uses a consistent bold/large-text heading for each spell name (rendered in extracted
+text as a line that matches `r"^[A-Z][A-Za-z '/-]+$"` at the start of a block). Validate
+each candidate entry has at least a level/school line before accepting it as a spell.
+
+**Fields extracted per entry:**
+- `name`: str
+- `level`: int (0 = cantrip)
+- `school`: str (abjuration | conjuration | divination | enchantment | evocation | illusion | necromancy | transmutation)
+- `casting_time`: str
+- `range_text`: str
+- `verbal`: bool
+- `somatic`: bool
+- `material`: bool
+- `material_specified`: str | None
+- `duration`: str
+- `concentration`: bool
+- `ritual`: bool
+- `higher_level`: str | None
+
+**Skip fields** (not compared against DB — expected to differ in whitespace/formatting):
+- `desc`
+
+**Sanity check:** If fewer than 300 spells are parsed, abort with a descriptive error
+(`ValueError`) rather than silently producing a misleading comparison.
+
+Return `list[SpellRecord]` (frozen dataclass).
+
+### `parsers/creatures.py`
+
+`extract_creatures(full_text: str) -> list[CreatureRecord]`
+
+**Section isolation:** Use `extract_section()` to isolate the monsters A–Z chapter (and the
+animals appendix if present) before splitting on creature headings.
+
+**Entry detection:** pdfplumber's `extract_text()` returns plain text — there are no markdown
+`##` markers. In the SRD PDF, creature names appear as a line of title-case text (e.g.,
+`"Adult Black Dragon"`) with no surrounding punctuation, followed immediately by a line
+containing comma-separated size, type, and alignment (e.g., `"Huge dragon, chaotic evil"`).
+The two-line pattern `r"^[A-Z][A-Za-z '\-]+$"` followed by a line matching
+`r"^(Tiny|Small|Medium|Large|Huge|Gargantuan)\s+\w+"` is the creature-entry boundary.
+This two-line heuristic disambiguates creature names from chapter headings (which are not
+followed by a size/type/alignment line).
+
+**Page-break handling:** `extract_full_text()` concatenates all page text before parsing, so
+stat blocks that span a page break appear as contiguous text. The ability score table
+(`STR | DEX | CON | INT | WIS | CHA`) is extracted using pdfplumber's table extractor
+*per page* during `extract_full_text()` and inserted inline with a recognizable delimiter,
+so the string-level parsers can find it reliably.
+
+**Fields extracted per entry:**
+- `name`: str
+- `size`: str
+- `type`: str
+- `alignment`: str
+- `armor_class`: int
+- `hit_points`: int
+- `hit_dice`: str (e.g. "10d10+30") — uses `parse_dice()`
+- `speed`: dict (walk/fly/swim/burrow/climb, each int ft)
+- `strength`, `dexterity`, `constitution`, `intelligence`, `wisdom`, `charisma`: int
+- `saving_throws`: dict[str, int] (only listed saves)
+- `skills`: dict[str, int] (only listed skills)
+- `senses`: dict (darkvision_range, blindsight_range, etc., int ft)
+- `languages`: list[str]
+- `challenge_rating`: float
+- `damage_immunities`: list[str]
+- `damage_resistances`: list[str]
+- `damage_vulnerabilities`: list[str]
+- `condition_immunities`: list[str]
+
+**Skip fields:** `traits`, `actions`, `desc` (action text is prose; not compared at this stage).
+
+**Sanity check:** Abort if fewer than 250 creatures parsed.
+
+Return `list[CreatureRecord]`.
+
+### `parsers/items.py`
+
+`extract_weapons(full_text) -> list[WeaponRecord]`  
+`extract_armor(full_text) -> list[ArmorRecord]`  
+`extract_items(full_text) -> list[ItemRecord]`  
+`extract_magic_items(full_text) -> list[MagicItemRecord]`
+
+Each function isolates its own section via `extract_section()` before parsing.
+
+**Weapons:** `## <Name>` blocks with labeled fields **Cost**, **Damage** (uses `parse_dice()`),
+**Weight**, **Properties**, **Mastery**, optionally **Range** / **Long Range**.
+
+**Armor:** `## <Name>` blocks with **AC Base** (int), **AC Add Dex** (bool), **AC Cap Dex**
+(int | None), **Strength Required** (int | None), **Stealth Disadvantage** (bool).
+
+**Adventuring gear:** `## <Name> (Cost)` or `## <Name>` with labeled **Cost** and **Weight**
+fields and a prose description.
+
+**Magic items:** `### <Name>` with a rarity/type/attunement line in italics.
+
+**Skip fields (per type):**
+- Weapons: `desc`, `mastery_desc`
+- Armor: `desc`
+- Items: `desc`
+- Magic items: `desc`
+
+**Sanity checks:**
+- Weapons: abort if fewer than 30 parsed
+- Armor: abort if fewer than 10 parsed
+- Items: abort if fewer than 150 parsed
+- Magic items: abort if fewer than 700 parsed
+
+### `compare_srd.py` (management command)
+
+```bash
+python manage.py compare_srd [--pdf PATH] [--document SLUG] [--entity TYPE]
+```
+
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--pdf` | `data/raw_sources/srd_5_2/SRD_CC_v5.2.pdf` | Path to SRD PDF |
+| `--document` | `srd-2024` | Document slug to filter ORM queries |
+| `--entity` | `all` | `spells`, `creatures`, `weapons`, `armor`, `items`, `magic_items`, or `all` |
+
+**Execution flow:**
+
+1. Validate PDF path exists; abort with a clear error if not.
+2. Call `extract_full_text(pdf_path)` once in the main thread → `full_text: str`.
+3. Submit each requested entity type to `ThreadPoolExecutor`. Each task:
+   a. Runs the appropriate parser on `full_text` → list of dataclass records.
+   b. Bulk-fetches from ORM using `values()` filtered by document slug.
+   c. Builds `dict[slug, record]` on each side using `slugify()` for keys.
+   d. Computes three sets: `missing` (slugs in PDF not in DB), `extra` (slugs in DB not in PDF),
+      `mismatches` (slugs in both but at least one field differs after applying skip lists and
+      normalisation rules).
+4. Collect results and render with Rich:
+   - Summary table with columns: Entity type | In PDF | In DB | Missing | Extra | Mismatches
+   - Per-type bulleted list of missing names
+   - Per-type bulleted list of extra names
+   - Per-type field-mismatch table: Entity | Field | PDF value | DB value
+   - Footer: total elapsed time
+
+**Field comparison rules:**
+
+| Field type | Comparison |
+|-----------|-----------|
+| `str` | `clean_text(a).lower() == clean_text(b).lower()` |
+| `int` | `a == b` |
+| `float` | `abs(a - b) < 0.001` |
+| `bool` | `a == b` |
+| `list` | `sorted(a) == sorted(b)` |
+| `dict` | recursive key-by-key using rules above |
+
+Fields in the skip list for an entity type are excluded from comparison entirely.
+
+## ORM Queries
+
+Each entity type uses a targeted `values()` query:
+
+```python
+# Example: spells
+Spell.objects.filter(document_id=document).values(
+    "name", "level", "school__name", "casting_time",
+    "range", "range_unit", "concentration", "ritual",
+    "verbal", "somatic", "material", "duration"
+)
+```
+
+`values()` returns flat dicts — no full model instantiation, no N+1 queries. Related fields
+(e.g. `school__name`) use the double-underscore traversal built into `values()`.
+
+## Illustrative Output
+
+```
+┌────────────────────────────────────────────────────────────────────┐
+│  SRD 5.2 PDF vs Database — document: srd-2024                      │
+│  Completed in 5.2s                                                 │
+├─────────────────┬────────┬───────┬─────────┬───────┬──────────────┤
+│ Entity type     │ In PDF │ In DB │ Missing │ Extra │  Mismatches  │
+├─────────────────┼────────┼───────┼─────────┼───────┼──────────────┤
+│ Spells          │    339 │   339 │       0 │     0 │           12 │
+│ Creatures       │    330 │   328 │       2 │     0 │           23 │
+│ Weapons         │     38 │    38 │       0 │     0 │            1 │
+│ Armor           │     13 │    13 │       0 │     0 │            0 │
+│ Items           │    203 │   203 │       0 │     0 │            4 │
+│ Magic Items     │    757 │   757 │       0 │     0 │            8 │
+└─────────────────┴────────┴───────┴─────────┴───────┴──────────────┘
+
+Missing from DB — Creatures
+  • Aboleth
+  • Banshee
+
+Field mismatches — Spells
+┌──────────────┬───────────────┬──────────────┬──────────────┐
+│ Spell        │ Field         │ PDF value    │ DB value     │
+├──────────────┼───────────────┼──────────────┼──────────────┤
+│ Fireball     │ range         │ 150          │ 120          │
+│ Acid Arrow   │ school        │ evocation    │ conjuration  │
+└──────────────┴───────────────┴──────────────┴──────────────┘
+```
+
+## Known PDF Extraction Risks
+
+- **Multi-column layouts:** D&D SRD PDFs sometimes typeset creature stat blocks in two columns.
+  pdfplumber's default text extraction reads left-to-right across the full page width and may
+  interleave text from both columns. Mitigate by testing on a sample of creature pages early and
+  using pdfplumber's `x_tolerance` / `y_tolerance` parameters or bounding-box cropping if needed.
+
+- **Page-break mid-block:** Stat blocks that span a page break rely on concatenation of page
+  text. This works for prose sections but ability score tables (detected per-page with
+  pdfplumber's table extractor) are inserted with a sentinel delimiter to survive concatenation.
+
+- **Ligatures:** Common in PDF-embedded fonts (fi, fl, ffi → single glyphs). `clean_text()`
+  handles these via Unicode normalization + explicit substitution before any string comparison.
+
+## Implementation Order
+
+Build and verify one entity type before moving to the next:
+
+1. **Spells** — most consistent PDF formatting, 339 records, easy to spot-check manually
+2. **Creatures** — complex stat blocks; test pdfplumber table extraction on a sample first
+3. **Weapons & Armor** — table-like format, small record counts
+4. **Adventuring Gear / Items** — larger set, varied formats
+5. **Magic Items** — 757 records, mostly name + rarity + attunement
+
+## Verification
+
+```bash
+# Full compare against live DB
+python manage.py compare_srd
+
+# Spot-check a single type
+python manage.py compare_srd --entity spells
+
+# Against a different document (future use)
+python manage.py compare_srd --document srd-2014
+```
+
+Expected: runtime under 10 seconds; zero missing records for a correct import; field mismatches
+surface real data quality issues in the existing fixtures.
+
+## Future Work (out of scope)
+
+- `manage.py generate_fixtures --pdf ...` to replace the markdown conversion scripts
+- `--output json` flag for CI (straightforward addition once comparison logic exists)
+- Cross-document diff mode (`--compare srd-2014 srd-2024`) to see what changed between versions