Skip to content

Commit d3db8ac

Browse files
authored
Feature/containerized tests (#25)
* feat: containerize QA test execution - Add tests/Dockerfile (Python 3.11-slim) with test deps baked in - Refactor Taskfile.yml: test tasks now run via docker with bind mounts - Replace test:install with test:build (builds test runner image) - Update README.md task runner table and QA section - Update about-src.html QA section with Docker note No local Python environment is needed to run tests anymore. * fix: add per-chapter progress logging and unbuffered stdout - Set PYTHONUNBUFFERED=1 in tests/Dockerfile - Add [n/total] progress output per chapter in algorithmic_checker.py - All print() calls use flush=True for container environments * feat: incremental test report with same-commit resume - Add --report flag to write PDF_TEXT_TESTS.md incrementally per chapter - Parse existing report to skip already-tested chapters on same commit - Include git commit hash and timestamp in report header - Rolling summary updated after each chapter completes - PYTHONUNBUFFERED=1 in Dockerfile for streaming output - Per-chapter progress logging with flush in console output * fix: pass GIT_COMMIT env var into container for report commit hash * fix: strip XML tags from PDF text and skip empty diffs - Add _XML_TAG_RE to strip embedded HTML/XML tags from PDF text (e.g. <ahref='...'>, </a>) before comparison - Skip diffs where both sides are empty after stripping - Eliminates false HIGH severity reports from PDF markup artifacts * docs: update tests/README.md + fix page header stripping - Rewrite tests/README.md with Docker workflow, incremental reporting, XML tag stripping, --report flag, severity rules, and resume feature - Generalize _PDF_FOOTER_RE to match any page header/footer pattern (e.g. 'Introduction : 1 01/28/2021'), not just 'Chapter' prefix * refactor: reorganize test tasks — test:unit, test:pdf, test:all - test:unit → runs CompendiumUI Vitest suite - test:pdf → runs PDF↔HTML text comparison (Docker) - test:all → runs both unit + pdf tests - test:chapter, test:llm unchanged
1 parent 5febfa1 commit d3db8ac

File tree

8 files changed

+315
-57
lines changed

8 files changed

+315
-57
lines changed

CompendiumUI/public/about-src.html

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,8 @@
8787
Each identified issue is classified by severity (HIGH, MEDIUM, or LOW) to help
8888
prioritize corrections. A GitHub Actions workflow automatically runs these
8989
quality checks whenever HTML files are updated, ensuring ongoing accuracy as the
90-
site evolves.
90+
site evolves. The test suite is containerized with Docker so that it can be run
91+
locally without installing Python or any dependencies.
9192
</paragraph>
9293
<paragraph>
9394
For detailed information about the QA methodology, see the <a

README.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -61,10 +61,11 @@ A [`Taskfile.yml`](Taskfile.yml) is included for common workflows. Install [Task
6161
| `task pdf-to-text` | Extract text from PDFs into `.txt` files |
6262
| `task process-pdfs` | Convert PDFs to XHTML via Gemini API |
6363
| `task process-pdfs-chunked` | Convert large PDFs in chunks via Gemini API |
64-
| `task test` | Run algorithmic QA checks on all chapters |
65-
| `task test:chapter -- ch200` | Run QA on a specific chapter |
66-
| `task test:llm` | Run LLM-based QA checks (requires `GOOGLE_API_KEY`) |
67-
| `task test:all` | Run all QA checks with full reports |
64+
| `task test:build` | Build the Python test-runner Docker image |
65+
| `task test` | Run algorithmic QA checks on all chapters (in Docker) |
66+
| `task test:chapter -- ch200` | Run QA on a specific chapter (in Docker) |
67+
| `task test:llm` | Run LLM-based QA checks in Docker (requires `GOOGLE_API_KEY`) |
68+
| `task test:all` | Run all QA checks with full reports (in Docker) |
6869

6970
# Using LLMs to convert pdf to xhtml
7071

@@ -119,7 +120,7 @@ The Compendium viewer now includes experimental browser-based translation suppor
119120

120121
## Quality Assurance
121122

122-
An automated content-checking engine compares the web HTML against original PDF text to detect conversion errors. See [QA.md](QA.md) and [tests/README.md](tests/README.md) for details.
123+
An automated content-checking engine compares the web HTML against original PDF text to detect conversion errors. The QA tests run inside a Docker container (via `task test`), so no local Python environment is required. See [QA.md](QA.md) and [tests/README.md](tests/README.md) for details.
123124

124125
### Frontend Testing
125126

Taskfile.yml

Lines changed: 22 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -67,28 +67,38 @@ tasks:
6767
cmds:
6868
- python scripts/process_pdfs_chunked.py --directory {{.PDF_DIR}}
6969

70-
# ── QA / Tests ─────────────────────────────────────────
71-
test:install:
72-
desc: Install Python test dependencies
70+
# ── Tests ─────────────────────────────────────────────
71+
test:build:
72+
desc: Build the Python test-runner Docker image
7373
cmds:
74-
- pip install -r tests/requirements.txt
74+
- docker build -t {{.IMAGE_NAME}}-tests -f tests/Dockerfile tests
7575

76-
test:
77-
desc: Run algorithmic QA checks on all chapters
76+
test:unit:
77+
desc: Run CompendiumUI unit tests (Vitest)
78+
dir: CompendiumUI
79+
cmds:
80+
- npm test
81+
82+
test:pdf:
83+
desc: Run PDF↔HTML text comparison on all chapters (in Docker)
84+
deps: [test:build]
7885
cmds:
79-
- python -m tests.run_qa --algo
86+
- docker run --rm -v {{.ROOT_DIR}}:/workspace -e GIT_COMMIT="$(git log -1 --format='%H %s')" {{.IMAGE_NAME}}-tests --algo --report PDF_TEXT_TESTS.md
8087

8188
test:chapter:
82-
desc: Run algorithmic QA on a specific chapter (e.g. task test:chapter -- ch200)
89+
desc: Run PDF↔HTML QA on a specific chapter (e.g. task test:chapter -- ch200)
90+
deps: [test:build]
8391
cmds:
84-
- python -m tests.run_qa --algo --chapters {{.CLI_ARGS}}
92+
- docker run --rm -v {{.ROOT_DIR}}:/workspace {{.IMAGE_NAME}}-tests --algo --chapters {{.CLI_ARGS}}
8593

8694
test:llm:
8795
desc: Run LLM-based QA checks (requires GOOGLE_API_KEY)
96+
deps: [test:build]
8897
cmds:
89-
- python -m tests.run_qa --llm
98+
- docker run --rm -v {{.ROOT_DIR}}:/workspace -e GOOGLE_API_KEY {{.IMAGE_NAME}}-tests --llm
9099

91100
test:all:
92-
desc: Run both algorithmic and LLM QA checks with full reports
101+
desc: Run all tests (unit + PDF text comparison)
93102
cmds:
94-
- python -m tests.run_qa --all --format all --output-dir tests/reports
103+
- task: test:unit
104+
- task: test:pdf

tests/Dockerfile

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
FROM python:3.11-slim
2+
WORKDIR /workspace
3+
ENV PYTHONUNBUFFERED=1
4+
COPY requirements.txt /tmp/requirements.txt
5+
RUN pip install --no-cache-dir -r /tmp/requirements.txt
6+
ENTRYPOINT ["python", "-m", "tests.run_qa"]

tests/README.md

Lines changed: 57 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -4,20 +4,41 @@ Automated quality assurance for the Copyright Compendium web conversion. Compare
44

55
## Quick Start
66

7+
Tests run in a Docker container — no local Python installation needed.
8+
79
```bash
8-
# Install dependencies (from project root)
9-
pip install -r tests/requirements.txt
10+
# Run all chapters (results written incrementally to PDF_TEXT_TESTS.md)
11+
task test
12+
13+
# Run a specific chapter
14+
task test:chapter -- ch200
1015

11-
# Run algorithmic check on a chapter
12-
python -m tests.run_qa --algo --chapters ch200
16+
# Run LLM-based checks (requires GOOGLE_API_KEY)
17+
task test:llm
1318

14-
# Run with full report output
15-
python -m tests.run_qa --algo --chapters ch200 --format all --output-dir tests/reports
19+
# Run both algorithmic + LLM checks with full reports
20+
task test:all
1621
```
1722

23+
### Incremental Report & Resume
24+
25+
`task test` writes results to `PDF_TEXT_TESTS.md` as each chapter completes. If the process is interrupted and re-run **on the same commit**, already-tested chapters are skipped automatically. The report includes:
26+
27+
- Git commit hash and timestamp
28+
- Per-chapter severity breakdown (HIGH / MEDIUM / LOW)
29+
- Expandable HIGH severity details
30+
- Rolling summary updated after each chapter
31+
1832
## How It Works
1933

20-
The engine has two comparison modes:
34+
### Containerized Execution
35+
36+
Tests run inside a `python:3.11-slim` Docker image (see [`Dockerfile`](Dockerfile)). The project root is bind-mounted into the container at `/workspace`, giving the test runner access to:
37+
38+
- HTML source files in `CompendiumUI/public/`
39+
- Pre-extracted PDF text in `copyright_compendium_pdfs/`
40+
41+
The `GIT_COMMIT` environment variable is passed from the host so the report captures the correct commit hash.
2142

2243
### Algorithmic Checker (`--algo`)
2344

@@ -26,19 +47,26 @@ Uses character-level comparison on **space-stripped text** to find genuine conte
2647
**Why space-stripped?** PDF text extraction commonly joins or splits words at line breaks (e.g., `ofthe` instead of `of the`, `pra ctices` instead of `practices`). By removing all whitespace before comparing, these artifacts become invisible — only real character-level content differences are reported.
2748

2849
The pipeline:
50+
2951
1. **Extract** text from HTML source files and pre-extracted PDF `.txt` files
30-
2. **Normalize** both texts (Unicode normalization, PDF header/footer removal, bullet markers, word fragment rejoining)
52+
2. **Normalize** both texts:
53+
- Unicode normalization (curly quotes → straight, em-dashes, non-breaking spaces)
54+
- **Strip embedded XML/HTML tags** from PDF text (e.g., `<ahref="...">`, `</a>`)
55+
- Remove PDF headers, footers, TOC dot-leaders, bullet markers
56+
- Rejoin word fragments split by PDF line breaks
57+
- Collapse whitespace
3158
3. **Strip** all whitespace from both texts
3259
4. **Diff** the stripped texts character-by-character using `difflib.SequenceMatcher`
33-
5. **Map** each character difference back to the original text for readable context
34-
6. **Classify** each difference as HIGH, MEDIUM, or LOW severity
60+
5. **Filter** empty diffs (both sides empty after stripping are skipped)
61+
6. **Map** each character difference back to the original text for readable context
62+
7. **Classify** each difference as HIGH, MEDIUM, or LOW severity
3563

3664
### LLM Checker (`--llm`)
3765

3866
Uses Google Gemini API for semantic analysis. Sends both texts with a structured prompt that asks the model to identify and classify discrepancies. Requires a `GOOGLE_API_KEY` environment variable.
3967

4068
```bash
41-
GOOGLE_API_KEY=your-key python -m tests.run_qa --llm --chapters ch200
69+
GOOGLE_API_KEY=your-key task test:llm
4270
```
4371

4472
## Severity Classification
@@ -47,7 +75,19 @@ GOOGLE_API_KEY=your-key python -m tests.run_qa --llm --chapters ch200
4775
|----------|---------|----------|
4876
| **HIGH** | Substantive content change that may affect meaning | Changed section numbers, missing paragraphs, altered legal references |
4977
| **MEDIUM** | Formatting difference worth noting | Section delimiters (`. 202`), punctuation changes, minor reference formatting |
50-
| **LOW** | Expected artifact, safe to ignore | PDF headers/footers, TOC page numbers, whitespace-only differences |
78+
| **LOW** | Expected artifact, safe to ignore | PDF headers/footers, TOC page numbers, whitespace-only differences, case-only changes |
79+
80+
### Classification Rules (in priority order)
81+
82+
1. Whitespace/hyphenation-only difference → **LOW**
83+
2. PDF header/footer content → **LOW**
84+
3. Case-only change (e.g., `NO.``No.`) → **LOW**
85+
4. Punctuation-only change → **MEDIUM**
86+
5. TOC content (dense section number listings) → **LOW**
87+
6. Section number delimiter (e.g., `.202`) → **MEDIUM**
88+
7. Changed numbers or section references → **HIGH**
89+
8. Substantial text addition/deletion (>50% length difference) → **HIGH**
90+
9. Default → **MEDIUM**
5191

5292
## CLI Reference
5393

@@ -57,7 +97,7 @@ python -m tests.run_qa [OPTIONS]
5797
# Check mode (at least one required)
5898
--algo Run algorithmic comparison
5999
--llm Run LLM-based semantic check
60-
--both Run both checks
100+
--all Run both checks
61101

62102
# Chapter selection (default: all mapped chapters)
63103
--chapters CH [CH...] Specific chapter IDs (e.g., ch200 ch800)
@@ -67,14 +107,16 @@ python -m tests.run_qa [OPTIONS]
67107
--format {console,markdown,json,all} Output format (default: console)
68108
--output-dir DIR Directory for markdown/json reports
69109
--severity-filter {HIGH,MEDIUM,LOW} Show only this severity and above
110+
--report PATH Write incremental markdown report (e.g. PDF_TEXT_TESTS.md)
70111
```
71112

72113
## Project Structure
73114

74115
```
75116
tests/
117+
├── Dockerfile ← Python 3.11-slim test runner image
76118
├── README.md ← This file
77-
├── requirements.txt ← Python dependencies
119+
├── requirements.txt ← Python dependencies (beautifulsoup4, lxml, google-generativeai)
78120
├── __init__.py ← Package marker
79121
├── conftest.py ← Chapter-to-file mapping and path constants
80122
├── text_extractor.py ← Text extraction and normalization pipeline
@@ -83,7 +125,7 @@ tests/
83125
├── llm_checker.py ← Gemini API-based semantic checker
84126
├── llm_prompt.txt ← Prompt template for LLM checks
85127
├── report.py ← Report generators (console, markdown, JSON)
86-
├── run_qa.py ← CLI entry point
128+
├── run_qa.py ← CLI entry point (with incremental report support)
87129
└── reports/ ← Generated reports (git-ignored)
88130
```
89131

tests/algorithmic_checker.py

Lines changed: 20 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,12 @@ def compare_chapter(chapter_id: str) -> list[Discrepancy]:
101101
if diff_len < _MIN_DIFF_SIZE:
102102
continue
103103

104+
# Skip empty diffs (both sides empty after stripping)
105+
pdf_chars = pdf_stripped[i1:i2]
106+
html_chars = html_stripped[j1:j2]
107+
if not pdf_chars.strip() and not html_chars.strip():
108+
continue
109+
104110
# Map back to original positions
105111
if i1 < len(pdf_map) and i2 > 0:
106112
pdf_orig_start = pdf_map[i1] if i1 < len(pdf_map) else len(pdf_text)
@@ -118,9 +124,7 @@ def compare_chapter(chapter_id: str) -> list[Discrepancy]:
118124
else:
119125
html_snippet = ""
120126

121-
# Get the actual character diff for classification
122-
pdf_chars = pdf_stripped[i1:i2]
123-
html_chars = html_stripped[j1:j2]
127+
# Use the character diff (already extracted above) for description
124128

125129
if op == "replace":
126130
desc = f"Text differs: '{pdf_chars[:50]}' → '{html_chars[:50]}'"
@@ -155,10 +159,20 @@ def compare_chapters(chapter_ids: list[str]) -> dict[str, list[Discrepancy]]:
155159
Dict mapping chapter_id → list of Discrepancy objects.
156160
"""
157161
results = {}
158-
for chapter_id in chapter_ids:
162+
total = len(chapter_ids)
163+
for idx, chapter_id in enumerate(chapter_ids, 1):
164+
print(f" [{idx}/{total}] Checking {chapter_id}...", flush=True)
159165
try:
160-
results[chapter_id] = compare_chapter(chapter_id)
166+
discs = compare_chapter(chapter_id)
167+
results[chapter_id] = discs
168+
high = sum(1 for d in discs if d.severity.name == "HIGH")
169+
med = sum(1 for d in discs if d.severity.name == "MEDIUM")
170+
low = sum(1 for d in discs if d.severity.name == "LOW")
171+
print(
172+
f" → {len(discs)} issues ({high} HIGH, {med} MEDIUM, {low} LOW)",
173+
flush=True,
174+
)
161175
except (FileNotFoundError, KeyError) as e:
162-
print(f"WARNING: Skipping {chapter_id}: {e}")
176+
print(f" WARNING: Skipping {chapter_id}: {e}", flush=True)
163177
results[chapter_id] = []
164178
return results

0 commit comments

Comments
 (0)