Skip to content

Commit 72659ed

Browse files
brockwebbclaude
andcommitted
Apply bug fixes to raw_kg_schema.md (v3.0 → v3.1)
Implements all 8 bug fixes from GPT-5.2 structural review (round 3): Breaking changes: - BUG FIX 1: Add APPLIES_TO relationship for product-scoping - Prevents cross-product contamination via generic SurveyProcess nodes - Harvest queries MUST use APPLIES_TO, not IMPLEMENTS → SurveyProcess path - BUG FIX 2: Add DEFINED_FOR relationship (ConceptDefinition → DataProduct) - Fixes cross-survey concept misalignment query path - BUG FIX 3: Consolidate provenance to SOURCED_FROM edges - Eliminates redundancy between node properties and relationship - All extraction metadata (source_section, source_page, raw_text, extraction_model, extraction_date) now lives on edges - New §4.6 Provenance Model section - BUG FIX 4: QualityAttribute typed values - Split value: str into value_number: float|null and value_string: str|null - Enables numeric comparisons in harvest queries - Fractions must be 0-1, not percentages - BUG FIX 5: REQUIRES typed rules - Add rule_type: "numeric_threshold" | "categorical_match" | etc. - Split threshold into threshold_number and threshold_string - Dispatch harvest queries by rule_type (§6.1a/6.1b) Quality improvements: - BUG FIX 6: Temporal values as ISO dates (YYYY-MM-DD) - Specify format in MethodologicalChoice valid_from/valid_until - Use Neo4j date() function in queries - BUG FIX 7: Harvest query families by rule_type - §6.1a: Numeric threshold violations (uses APPLIES_TO) - §6.1b: Categorical mismatch violations (uses DEFINED_FOR) - §6.4: Null-safe date logic - BUG FIX 8: CONFOUNDS interaction_type property - Controlled vocabulary: bias_interaction, variance_interaction, comparability_break, coverage_interaction Additional updates: - Relationship count: 14 → 16 types - §8.2 REQUIRES examples updated with rule_type - Appendix A: Added GPT-5.2 round 3 review summary - All validation checks passed Status: Approved for Phase 1 implementation (post bug-fix review) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent 64e5800 commit 72659ed

2 files changed

Lines changed: 686 additions & 0 deletions

File tree

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# ADR-007: KG-First Authoring with LLM Graph Builder
2+
3+
**Date:** 2026-02-08
4+
**Status:** Accepted
5+
**Supersedes:** None (refines ADR-001 workflow)
6+
**Related:** ADR-001 (authoring/runtime separation), ADR-002 (grounding not RAG)
7+
8+
## Context
9+
10+
ADR-001 established Neo4j as the authoring environment and SQLite as runtime. In practice, the authoring workflow devolved into humans writing JSON files by hand — effectively bypassing Neo4j as the primary authoring tool. This happened because no automated pipeline existed to populate Neo4j from source documents. The round-trip scripts (`staging_to_neo4j.py`, `neo4j_to_staging.py`) became symmetric, implying JSON and Neo4j were co-equal sources of truth. They shouldn't be.
11+
12+
The original vision was: pragmatics are graph threads managed via Cypher. Source documents feed the graph. Curated subgraphs are harvested as pragmatic packs. JSON staging files are an export artifact for version control and compilation — not an authoring surface.
13+
14+
Meanwhile, CPS documentation is fragmented across 12+ PDFs spanning 20 years of methodological changes. Manual extraction doesn't scale. The ACS handbook (89 pages, single document) took a full session to produce 7 findings. CPS would take weeks of manual reading.
15+
16+
Two open-source tools address this:
17+
- **neo4j-labs/llm-graph-builder**: LLM-powered extraction from PDFs directly into Neo4j. LangChain-based, supports Anthropic, configurable entity schemas.
18+
- **HKUDS/RAG-Anything**: MinerU-based PDF parsing with multimodal support (tables, equations, images) feeding LightRAG knowledge graphs.
19+
20+
## Decision
21+
22+
**Neo4j is the upstream source of truth for all pragmatics content. JSON staging is a downstream build artifact. The arrow goes one direction.**
23+
24+
### Authoring Pipeline
25+
26+
```
27+
Source PDFs
28+
29+
llm-graph-builder (or equivalent LLM extraction)
30+
31+
Neo4j: raw knowledge graph ("the quarry")
32+
- Entities: concepts, methods, thresholds, caveats, definitions
33+
- Relationships: applies_to, contradicts, qualifies, supersedes
34+
- Properties: source document, page, section, extraction confidence
35+
36+
Opus/human traverses raw KG via Cypher
37+
- Identifies pragmatic threads (fitness-for-use expert judgments)
38+
- Harvests subgraphs worth packaging
39+
- Assigns latitude, triggers, provenance
40+
41+
Curated Context nodes in Neo4j pragmatics namespace
42+
- Schema: ContextItem model (context_id, domain, category, etc.)
43+
- Managed via Cypher, not JSON
44+
45+
neo4j_to_staging.py (EXPORT ONLY — one direction)
46+
47+
staging/*.json (version-controlled build artifact)
48+
49+
compile_pack.py → packs/*.db (shipped SQLite)
50+
```
51+
52+
### Two Neo4j Namespaces
53+
54+
1. **Raw KG** (`USE raw` or separate database): Everything extracted from source documents. Messy, comprehensive, unfiltered. This is the quarry.
55+
2. **Pragmatics** (`USE pragmatics`): Curated expert judgments conforming to ContextItem schema. This is the cut stone.
56+
57+
### Tool Selection
58+
59+
- **Primary extraction**: neo4j-labs/llm-graph-builder — writes directly to Neo4j, configurable schema, supports Anthropic models.
60+
- **PDF parsing** (if LangChain loaders insufficient): MinerU standalone for structure-preserving extraction of tables, equations, and complex layouts (CPS docs need this).
61+
- **Graph mining**: Cypher queries + Opus reasoning over the raw KG to identify pragmatic threads.
62+
63+
### Script Roles (Clarified)
64+
65+
| Script | Role | Direction |
66+
|--------|------|-----------|
67+
| `neo4j_to_staging.py` | **Primary export** — produces staging JSON from curated pragmatics | Neo4j → JSON |
68+
| `staging_to_neo4j.py` | **Bootstrap/recovery only** — seed Neo4j from existing JSON, not for regular authoring | JSON → Neo4j |
69+
| `compile_pack.py` | Build step — staging JSON → SQLite packs | JSON → SQLite |
70+
| `catalog_report.py` | Inventory — coverage tracking from compiled packs | Read-only |
71+
72+
### Relationship to ADR-002 (Grounding Not RAG)
73+
74+
The raw KG is effectively a RAG store for the **authoring environment** — you query it to find what's worth extracting. This does NOT change the shipped product. The runtime system remains grounding-only: pre-compiled SQLite packs with tag-based retrieval, no embeddings, no vector search. The RAG lives in the workshop, not the product.
75+
76+
## Consequences
77+
78+
**Positive:**
79+
- Scales to arbitrary document volume (CPS, ACS, SIPP, decennial)
80+
- Cypher is the natural authoring language for graph data — not JSON
81+
- Raw KG preserves everything; pragmatics are selective
82+
- LLM-assisted extraction + human/Opus curation = quality at scale
83+
- Provenance catalog tracks what's been extracted from where
84+
- "Once ingested, it's done" — each document is a completed extraction
85+
86+
**Negative:**
87+
- Neo4j becomes a harder dependency for contributors (was optional, now essential for authoring)
88+
- Two namespaces to manage (raw KG + pragmatics)
89+
- llm-graph-builder adds LangChain dependency to dev toolchain (not runtime)
90+
- Raw KG quality depends on extraction model quality — garbage in, garbage out
91+
- Need to define raw KG schema conventions (what entity/relationship types)
92+
93+
**Risks:**
94+
- Raw KG could become a junk drawer if entity types aren't disciplined
95+
- Opus traversal requires well-crafted Cypher — need to develop a library of mining queries
96+
- llm-graph-builder may need customization for Census domain (statistical terminology, table structures)
97+
98+
## Alternatives Considered
99+
100+
1. **Continue manual JSON authoring**: Rejected — doesn't scale, already causing pain with CPS docs.
101+
2. **RAG-Anything as primary tool**: Rejected — writes to LightRAG internal store, not Neo4j. MinerU component useful for PDF parsing but graph builder goes to wrong target.
102+
3. **Custom extraction pipeline from scratch**: Rejected — llm-graph-builder already solves the PDF→Neo4j problem. Don't rebuild.
103+
4. **Keep JSON as co-equal source of truth**: Rejected — this is what caused the workflow confusion. One source of truth, one direction.
104+
105+
## Implementation Notes
106+
107+
- llm-graph-builder requires Neo4j 5.23+ with APOC. Verify current instance compatibility.
108+
- Start with `cps_handbook_of_methods.pdf` (552K, manageable) as proof of concept before ingesting all 12 CPS docs.
109+
- Raw KG schema conventions need a short design doc before first extraction (entity types, relationship types, required properties).
110+
- Existing 25 ACS pragmatics were authored from LLM training data, not source docs (discovered 2026-02-08). These need re-verification against the raw KG once ACS-GEN-001 is ingested.

0 commit comments

Comments
 (0)