Skip to content

Commit 93f6fbb

Browse files
augustjohnsonclaude
andcommitted
Add SRD PDF vs database comparison tool design spec
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 0eac2c0 commit 93f6fbb

1 file changed

Lines changed: 344 additions & 0 deletions

File tree

Lines changed: 344 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,344 @@
1+
# SRD PDF vs Database Comparison Tool
2+
3+
**Date:** 2026-05-22
4+
**Status:** Approved
5+
6+
## Context
7+
8+
The open5e-api project stores D&D SRD 5.2 content as Django fixture JSON files under
9+
`data/v2/wizards-of-the-coast/srd-2024/` (8,580 records across 33 files), which are loaded
10+
into the database via `manage.py import`. These fixtures were generated by five markdown-parsing
11+
scripts in `data/raw_sources/srd_5_2/scripts/`.
12+
13+
The source of truth is the official SRD PDF at
14+
`data/raw_sources/srd_5_2/SRD_CC_v5.2.pdf` (5.7 MB, CC-BY-4.0). There is currently no
15+
automated way to verify that the database faithfully represents the PDF — missing records and
16+
mis-parsed field values are invisible until manually noticed.
17+
18+
**Goal:** A lightning-fast management command that extracts entities directly from the PDF and
19+
compares them against the database, reporting both missing records and field-value mismatches
20+
with a Rich terminal display. Target runtime: under 10 seconds for all entity types.
21+
22+
This PDF parsing layer also serves as the foundation for replacing the markdown-based conversion
23+
scripts with a direct PDF → fixture generation pipeline in a future iteration.
24+
25+
## Architecture
26+
27+
```
28+
SRD_CC_v5.2.pdf
29+
30+
└─ (pdfplumber, main thread) ─→ full_text: str
31+
32+
┌───────────────────────────────┼───────────────────────┐
33+
▼ ▼ ▼
34+
spells.py(full_text) creatures.py(full_text) items.py(full_text)
35+
[ThreadPoolExecutor — text strings are passed, not the open PDF object]
36+
│ │ │
37+
└───────────────────────────────┼───────────────────────┘
38+
39+
ORM .values() queries
40+
41+
42+
Rich terminal report
43+
```
44+
45+
**Key constraint:** pdfplumber is not thread-safe when sharing an open PDF object. The
46+
management command extracts all page text to a single `str` in the main thread first, then
47+
passes that string to parsers running in a `ThreadPoolExecutor`. Each parser operates only
48+
on a plain string — no shared file handles.
49+
50+
**Module layout:**
51+
```
52+
data/raw_sources/srd_5_2/parsers/
53+
__init__.py
54+
base.py shared: text cleaning, slug generation, field normalisation
55+
spells.py parse spell blocks from extracted PDF text
56+
creatures.py parse creature stat blocks
57+
items.py weapons, armor, adventuring gear, magic items
58+
59+
api_v2/management/commands/
60+
compare_srd.py orchestrates extraction, ORM queries, diff, Rich render
61+
```
62+
63+
### New dependencies
64+
65+
Add to `pyproject.toml` (runtime):
66+
67+
| Package | Purpose | License |
68+
|---------|---------|---------|
69+
| `pdfplumber>=0.11` | PDF text + table extraction | MIT |
70+
| `rich>=13` | terminal tables and panels | MIT |
71+
72+
## Component Details
73+
74+
### `parsers/base.py`
75+
76+
Shared utilities used by all section parsers. All parsers receive `full_text: str` (the
77+
concatenated text of all PDF pages), not a file path.
78+
79+
- `extract_full_text(pdf_path) -> str` — the single point that opens pdfplumber, iterates
80+
pages, concatenates `page.extract_text()` output, and closes the file. Called once in the
81+
main thread before any parallelism.
82+
- `extract_section(full_text, start_re, end_re) -> str` — return the substring of `full_text`
83+
between the first match of `start_re` and `end_re` (compiled regex patterns). Used by each
84+
parser to isolate its section before splitting on individual entries.
85+
- `clean_text(s) -> str` — strip ligatures (fi, fl, ffi, ffl rendered as single glyphs),
86+
normalize whitespace, remove soft hyphens, strip leading/trailing whitespace.
87+
- `slugify(name) -> str` — lowercase, replace spaces with hyphens, strip non-alphanumeric
88+
except hyphens. Used as the dict key for name-based matching.
89+
- `parse_cost(s) -> dict | None` — parse "10 gp" / "5 sp" / "2 cp" into
90+
`{"amount": 10, "unit": "gp"}`. Returns None if no match. Called by weapon `cost` field,
91+
armor `cost` field (where present), and gear `cost` field.
92+
- `parse_dice(s) -> dict | None` — parse "2d6+3" into `{"count": 2, "die": 6, "bonus": 3}`.
93+
Called by creature `hit_dice` field and weapon `damage` field.
94+
95+
### `parsers/spells.py`
96+
97+
`extract_spells(full_text: str) -> list[SpellRecord]`
98+
99+
**Section isolation:** Use `extract_section()` with `start_re` matching the spells chapter
100+
heading and `end_re` matching the next chapter heading. This confines parsing to the spells
101+
section and avoids false matches on spell-name-like strings elsewhere in the PDF.
102+
103+
**Entry splitting:** Split on the heading pattern that marks individual spell entries. The
104+
SRD PDF uses a consistent bold/large-text heading for each spell name (rendered in extracted
105+
text as a line that matches `r"^[A-Z][A-Za-z '/-]+$"` at the start of a block). Validate
106+
each candidate entry has at least a level/school line before accepting it as a spell.
107+
108+
**Fields extracted per entry:**
109+
- `name`: str
110+
- `level`: int (0 = cantrip)
111+
- `school`: str (abjuration | conjuration | divination | enchantment | evocation | illusion | necromancy | transmutation)
112+
- `casting_time`: str
113+
- `range_text`: str
114+
- `verbal`: bool
115+
- `somatic`: bool
116+
- `material`: bool
117+
- `material_specified`: str | None
118+
- `duration`: str
119+
- `concentration`: bool
120+
- `ritual`: bool
121+
- `higher_level`: str | None
122+
123+
**Skip fields** (not compared against DB — expected to differ in whitespace/formatting):
124+
- `desc`
125+
126+
**Sanity check:** If fewer than 300 spells are parsed, abort with a descriptive error
127+
(`ValueError`) rather than silently producing a misleading comparison.
128+
129+
Return `list[SpellRecord]` (frozen dataclass).
130+
131+
### `parsers/creatures.py`
132+
133+
`extract_creatures(full_text: str) -> list[CreatureRecord]`
134+
135+
**Section isolation:** Use `extract_section()` to isolate the monsters A–Z chapter (and the
136+
animals appendix if present) before splitting on creature headings.
137+
138+
**Entry detection:** pdfplumber's `extract_text()` returns plain text — there are no markdown
139+
`##` markers. In the SRD PDF, creature names appear as a line of title-case text (e.g.,
140+
`"Adult Black Dragon"`) with no surrounding punctuation, followed immediately by a line
141+
containing comma-separated size, type, and alignment (e.g., `"Huge dragon, chaotic evil"`).
142+
The two-line pattern `r"^[A-Z][A-Za-z '\-]+$"` followed by a line matching
143+
`r"^(Tiny|Small|Medium|Large|Huge|Gargantuan)\s+\w+"` is the creature-entry boundary.
144+
This two-line heuristic disambiguates creature names from chapter headings (which are not
145+
followed by a size/type/alignment line).
146+
147+
**Page-break handling:** `extract_full_text()` concatenates all page text before parsing, so
148+
stat blocks that span a page break appear as contiguous text. The ability score table
149+
(`STR | DEX | CON | INT | WIS | CHA`) is extracted using pdfplumber's table extractor
150+
*per page* during `extract_full_text()` and inserted inline with a recognizable delimiter,
151+
so the string-level parsers can find it reliably.
152+
153+
**Fields extracted per entry:**
154+
- `name`: str
155+
- `size`: str
156+
- `type`: str
157+
- `alignment`: str
158+
- `armor_class`: int
159+
- `hit_points`: int
160+
- `hit_dice`: str (e.g. "10d10+30") — uses `parse_dice()`
161+
- `speed`: dict (walk/fly/swim/burrow/climb, each int ft)
162+
- `strength`, `dexterity`, `constitution`, `intelligence`, `wisdom`, `charisma`: int
163+
- `saving_throws`: dict[str, int] (only listed saves)
164+
- `skills`: dict[str, int] (only listed skills)
165+
- `senses`: dict (darkvision_range, blindsight_range, etc., int ft)
166+
- `languages`: list[str]
167+
- `challenge_rating`: float
168+
- `damage_immunities`: list[str]
169+
- `damage_resistances`: list[str]
170+
- `damage_vulnerabilities`: list[str]
171+
- `condition_immunities`: list[str]
172+
173+
**Skip fields:** `traits`, `actions`, `desc` (action text is prose; not compared at this stage).
174+
175+
**Sanity check:** Abort if fewer than 250 creatures parsed.
176+
177+
Return `list[CreatureRecord]`.
178+
179+
### `parsers/items.py`
180+
181+
`extract_weapons(full_text) -> list[WeaponRecord]`
182+
`extract_armor(full_text) -> list[ArmorRecord]`
183+
`extract_items(full_text) -> list[ItemRecord]`
184+
`extract_magic_items(full_text) -> list[MagicItemRecord]`
185+
186+
Each function isolates its own section via `extract_section()` before parsing.
187+
188+
**Weapons:** `## <Name>` blocks with labeled fields **Cost**, **Damage** (uses `parse_dice()`),
189+
**Weight**, **Properties**, **Mastery**, optionally **Range** / **Long Range**.
190+
191+
**Armor:** `## <Name>` blocks with **AC Base** (int), **AC Add Dex** (bool), **AC Cap Dex**
192+
(int | None), **Strength Required** (int | None), **Stealth Disadvantage** (bool).
193+
194+
**Adventuring gear:** `## <Name> (Cost)` or `## <Name>` with labeled **Cost** and **Weight**
195+
fields and a prose description.
196+
197+
**Magic items:** `### <Name>` with a rarity/type/attunement line in italics.
198+
199+
**Skip fields (per type):**
200+
- Weapons: `desc`, `mastery_desc`
201+
- Armor: `desc`
202+
- Items: `desc`
203+
- Magic items: `desc`
204+
205+
**Sanity checks:**
206+
- Weapons: abort if fewer than 30 parsed
207+
- Armor: abort if fewer than 10 parsed
208+
- Items: abort if fewer than 150 parsed
209+
- Magic items: abort if fewer than 700 parsed
210+
211+
### `compare_srd.py` (management command)
212+
213+
```bash
214+
python manage.py compare_srd [--pdf PATH] [--document SLUG] [--entity TYPE]
215+
```
216+
217+
| Argument | Default | Description |
218+
|----------|---------|-------------|
219+
| `--pdf` | `data/raw_sources/srd_5_2/SRD_CC_v5.2.pdf` | Path to SRD PDF |
220+
| `--document` | `srd-2024` | Document slug to filter ORM queries |
221+
| `--entity` | `all` | `spells`, `creatures`, `weapons`, `armor`, `items`, `magic_items`, or `all` |
222+
223+
**Execution flow:**
224+
225+
1. Validate PDF path exists; abort with a clear error if not.
226+
2. Call `extract_full_text(pdf_path)` once in the main thread → `full_text: str`.
227+
3. Submit each requested entity type to `ThreadPoolExecutor`. Each task:
228+
a. Runs the appropriate parser on `full_text` → list of dataclass records.
229+
b. Bulk-fetches from ORM using `values()` filtered by document slug.
230+
c. Builds `dict[slug, record]` on each side using `slugify()` for keys.
231+
d. Computes three sets: `missing` (slugs in PDF not in DB), `extra` (slugs in DB not in PDF),
232+
`mismatches` (slugs in both but at least one field differs after applying skip lists and
233+
normalisation rules).
234+
4. Collect results and render with Rich:
235+
- Summary table with columns: Entity type | In PDF | In DB | Missing | Extra | Mismatches
236+
- Per-type bulleted list of missing names
237+
- Per-type bulleted list of extra names
238+
- Per-type field-mismatch table: Entity | Field | PDF value | DB value
239+
- Footer: total elapsed time
240+
241+
**Field comparison rules:**
242+
243+
| Field type | Comparison |
244+
|-----------|-----------|
245+
| `str` | `clean_text(a).lower() == clean_text(b).lower()` |
246+
| `int` | `a == b` |
247+
| `float` | `abs(a - b) < 0.001` |
248+
| `bool` | `a == b` |
249+
| `list` | `sorted(a) == sorted(b)` |
250+
| `dict` | recursive key-by-key using rules above |
251+
252+
Fields in the skip list for an entity type are excluded from comparison entirely.
253+
254+
## ORM Queries
255+
256+
Each entity type uses a targeted `values()` query:
257+
258+
```python
259+
# Example: spells
260+
Spell.objects.filter(document_id=document).values(
261+
"name", "level", "school__name", "casting_time",
262+
"range", "range_unit", "concentration", "ritual",
263+
"verbal", "somatic", "material", "duration"
264+
)
265+
```
266+
267+
`values()` returns flat dicts — no full model instantiation, no N+1 queries. Related fields
268+
(e.g. `school__name`) use the double-underscore traversal built into `values()`.
269+
270+
## Illustrative Output
271+
272+
```
273+
┌────────────────────────────────────────────────────────────────────┐
274+
│ SRD 5.2 PDF vs Database — document: srd-2024 │
275+
│ Completed in 5.2s │
276+
├─────────────────┬────────┬───────┬─────────┬───────┬──────────────┤
277+
│ Entity type │ In PDF │ In DB │ Missing │ Extra │ Mismatches │
278+
├─────────────────┼────────┼───────┼─────────┼───────┼──────────────┤
279+
│ Spells │ 339 │ 339 │ 0 │ 0 │ 12 │
280+
│ Creatures │ 330 │ 328 │ 2 │ 0 │ 23 │
281+
│ Weapons │ 38 │ 38 │ 0 │ 0 │ 1 │
282+
│ Armor │ 13 │ 13 │ 0 │ 0 │ 0 │
283+
│ Items │ 203 │ 203 │ 0 │ 0 │ 4 │
284+
│ Magic Items │ 757 │ 757 │ 0 │ 0 │ 8 │
285+
└─────────────────┴────────┴───────┴─────────┴───────┴──────────────┘
286+
287+
Missing from DB — Creatures
288+
• Aboleth
289+
• Banshee
290+
291+
Field mismatches — Spells
292+
┌──────────────┬───────────────┬──────────────┬──────────────┐
293+
│ Spell │ Field │ PDF value │ DB value │
294+
├──────────────┼───────────────┼──────────────┼──────────────┤
295+
│ Fireball │ range │ 150 │ 120 │
296+
│ Acid Arrow │ school │ evocation │ conjuration │
297+
└──────────────┴───────────────┴──────────────┴──────────────┘
298+
```
299+
300+
## Known PDF Extraction Risks
301+
302+
- **Multi-column layouts:** D&D SRD PDFs sometimes typeset creature stat blocks in two columns.
303+
pdfplumber's default text extraction reads left-to-right across the full page width and may
304+
interleave text from both columns. Mitigate by testing on a sample of creature pages early and
305+
using pdfplumber's `x_tolerance` / `y_tolerance` parameters or bounding-box cropping if needed.
306+
307+
- **Page-break mid-block:** Stat blocks that span a page break rely on concatenation of page
308+
text. This works for prose sections but ability score tables (detected per-page with
309+
pdfplumber's table extractor) are inserted with a sentinel delimiter to survive concatenation.
310+
311+
- **Ligatures:** Common in PDF-embedded fonts (fi, fl, ffi → single glyphs). `clean_text()`
312+
handles these via Unicode normalization + explicit substitution before any string comparison.
313+
314+
## Implementation Order
315+
316+
Build and verify one entity type before moving to the next:
317+
318+
1. **Spells** — most consistent PDF formatting, 339 records, easy to spot-check manually
319+
2. **Creatures** — complex stat blocks; test pdfplumber table extraction on a sample first
320+
3. **Weapons & Armor** — table-like format, small record counts
321+
4. **Adventuring Gear / Items** — larger set, varied formats
322+
5. **Magic Items** — 757 records, mostly name + rarity + attunement
323+
324+
## Verification
325+
326+
```bash
327+
# Full compare against live DB
328+
python manage.py compare_srd
329+
330+
# Spot-check a single type
331+
python manage.py compare_srd --entity spells
332+
333+
# Against a different document (future use)
334+
python manage.py compare_srd --document srd-2014
335+
```
336+
337+
Expected: runtime under 10 seconds; zero missing records for a correct import; field mismatches
338+
surface real data quality issues in the existing fixtures.
339+
340+
## Future Work (out of scope)
341+
342+
- `manage.py generate_fixtures --pdf ...` to replace the markdown conversion scripts
343+
- `--output json` flag for CI (straightforward addition once comparison logic exists)
344+
- Cross-document diff mode (`--compare srd-2014 srd-2024`) to see what changed between versions

0 commit comments

Comments
 (0)