Feature/containerized tests (#25)

aih · web-flow · commit d3db8ac37860 · 2026-02-15T19:18:13.000-08:00
* feat: containerize QA test execution

- Add tests/Dockerfile (Python 3.11-slim) with test deps baked in
- Refactor Taskfile.yml: test tasks now run via docker with bind mounts
- Replace test:install with test:build (builds test runner image)
- Update README.md task runner table and QA section
- Update about-src.html QA section with Docker note

No local Python environment is needed to run tests anymore.

* fix: add per-chapter progress logging and unbuffered stdout

- Set PYTHONUNBUFFERED=1 in tests/Dockerfile
- Add [n/total] progress output per chapter in algorithmic_checker.py
- All print() calls use flush=True for container environments

* feat: incremental test report with same-commit resume

- Add --report flag to write PDF_TEXT_TESTS.md incrementally per chapter
- Parse existing report to skip already-tested chapters on same commit
- Include git commit hash and timestamp in report header
- Rolling summary updated after each chapter completes
- PYTHONUNBUFFERED=1 in Dockerfile for streaming output
- Per-chapter progress logging with flush in console output

* fix: pass GIT_COMMIT env var into container for report commit hash

* fix: strip XML tags from PDF text and skip empty diffs

- Add _XML_TAG_RE to strip embedded HTML/XML tags from PDF text
  (e.g. &lt;ahref='...'&gt;, &lt;/a&gt;) before comparison
- Skip diffs where both sides are empty after stripping
- Eliminates false HIGH severity reports from PDF markup artifacts

* docs: update tests/README.md + fix page header stripping

- Rewrite tests/README.md with Docker workflow, incremental reporting,
  XML tag stripping, --report flag, severity rules, and resume feature
- Generalize _PDF_FOOTER_RE to match any page header/footer pattern
  (e.g. 'Introduction : 1 01/28/2021'), not just 'Chapter' prefix

* refactor: reorganize test tasks — test:unit, test:pdf, test:all

- test:unit  → runs CompendiumUI Vitest suite
- test:pdf   → runs PDF↔HTML text comparison (Docker)
- test:all   → runs both unit + pdf tests
- test:chapter, test:llm unchanged
diff --git a/CompendiumUI/public/about-src.html b/CompendiumUI/public/about-src.html
@@ -87,7 +87,8 @@
                                         Each identified issue is classified by severity (HIGH, MEDIUM, or LOW) to help
                                         prioritize corrections. A GitHub Actions workflow automatically runs these
                                         quality checks whenever HTML files are updated, ensuring ongoing accuracy as the
-                                        site evolves.
+                                        site evolves. The test suite is containerized with Docker so that it can be run
+                                        locally without installing Python or any dependencies.
                                 </paragraph>
                                 <paragraph>
                                         For detailed information about the QA methodology, see the <a
diff --git a/README.md b/README.md
@@ -61,10 +61,11 @@ A [`Taskfile.yml`](Taskfile.yml) is included for common workflows. Install [Task
 | `task pdf-to-text` | Extract text from PDFs into `.txt` files |
 | `task process-pdfs` | Convert PDFs to XHTML via Gemini API |
 | `task process-pdfs-chunked` | Convert large PDFs in chunks via Gemini API |
-| `task test` | Run algorithmic QA checks on all chapters |
-| `task test:chapter -- ch200` | Run QA on a specific chapter |
-| `task test:llm` | Run LLM-based QA checks (requires `GOOGLE_API_KEY`) |
-| `task test:all` | Run all QA checks with full reports |
+| `task test:build` | Build the Python test-runner Docker image |
+| `task test` | Run algorithmic QA checks on all chapters (in Docker) |
+| `task test:chapter -- ch200` | Run QA on a specific chapter (in Docker) |
+| `task test:llm` | Run LLM-based QA checks in Docker (requires `GOOGLE_API_KEY`) |
+| `task test:all` | Run all QA checks with full reports (in Docker) |
 
 # Using LLMs to convert pdf to xhtml
 
@@ -119,7 +120,7 @@ The Compendium viewer now includes experimental browser-based translation suppor
 
 ## Quality Assurance
 
-An automated content-checking engine compares the web HTML against original PDF text to detect conversion errors. See [QA.md](QA.md) and [tests/README.md](tests/README.md) for details.
+An automated content-checking engine compares the web HTML against original PDF text to detect conversion errors. The QA tests run inside a Docker container (via `task test`), so no local Python environment is required. See [QA.md](QA.md) and [tests/README.md](tests/README.md) for details.
 
 ### Frontend Testing
 
diff --git a/Taskfile.yml b/Taskfile.yml
@@ -67,28 +67,38 @@ tasks:
     cmds:
       - python scripts/process_pdfs_chunked.py --directory {{.PDF_DIR}}
 
-  # ── QA / Tests ─────────────────────────────────────────
-  test:install:
-    desc: Install Python test dependencies
+  # ── Tests ─────────────────────────────────────────────
+  test:build:
+    desc: Build the Python test-runner Docker image
     cmds:
-      - pip install -r tests/requirements.txt
+      - docker build -t {{.IMAGE_NAME}}-tests -f tests/Dockerfile tests
 
-  test:
-    desc: Run algorithmic QA checks on all chapters
+  test:unit:
+    desc: Run CompendiumUI unit tests (Vitest)
+    dir: CompendiumUI
+    cmds:
+      - npm test
+
+  test:pdf:
+    desc: Run PDF↔HTML text comparison on all chapters (in Docker)
+    deps: [test:build]
     cmds:
-      - python -m tests.run_qa --algo
+      - docker run --rm -v {{.ROOT_DIR}}:/workspace -e GIT_COMMIT="$(git log -1 --format='%H %s')" {{.IMAGE_NAME}}-tests --algo --report PDF_TEXT_TESTS.md
 
   test:chapter:
-    desc: Run algorithmic QA on a specific chapter (e.g. task test:chapter -- ch200)
+    desc: Run PDF↔HTML QA on a specific chapter (e.g. task test:chapter -- ch200)
+    deps: [test:build]
     cmds:
-      - python -m tests.run_qa --algo --chapters {{.CLI_ARGS}}
+      - docker run --rm -v {{.ROOT_DIR}}:/workspace {{.IMAGE_NAME}}-tests --algo --chapters {{.CLI_ARGS}}
 
   test:llm:
     desc: Run LLM-based QA checks (requires GOOGLE_API_KEY)
+    deps: [test:build]
     cmds:
-      - python -m tests.run_qa --llm
+      - docker run --rm -v {{.ROOT_DIR}}:/workspace -e GOOGLE_API_KEY {{.IMAGE_NAME}}-tests --llm
 
   test:all:
-    desc: Run both algorithmic and LLM QA checks with full reports
+    desc: Run all tests (unit + PDF text comparison)
     cmds:
-      - python -m tests.run_qa --all --format all --output-dir tests/reports
+      - task: test:unit
+      - task: test:pdf
diff --git a/tests/Dockerfile b/tests/Dockerfile
@@ -0,0 +1,6 @@
+FROM python:3.11-slim
+WORKDIR /workspace
+ENV PYTHONUNBUFFERED=1
+COPY requirements.txt /tmp/requirements.txt
+RUN pip install --no-cache-dir -r /tmp/requirements.txt
+ENTRYPOINT ["python", "-m", "tests.run_qa"]
diff --git a/tests/README.md b/tests/README.md
@@ -4,20 +4,41 @@ Automated quality assurance for the Copyright Compendium web conversion. Compare
 
 ## Quick Start
 
+Tests run in a Docker container — no local Python installation needed.
+
 ```bash
-# Install dependencies (from project root)
-pip install -r tests/requirements.txt
+# Run all chapters (results written incrementally to PDF_TEXT_TESTS.md)
+task test
+
+# Run a specific chapter
+task test:chapter -- ch200
 
-# Run algorithmic check on a chapter
-python -m tests.run_qa --algo --chapters ch200
+# Run LLM-based checks (requires GOOGLE_API_KEY)
+task test:llm
 
-# Run with full report output
-python -m tests.run_qa --algo --chapters ch200 --format all --output-dir tests/reports
+# Run both algorithmic + LLM checks with full reports
+task test:all
 ```
 
+### Incremental Report & Resume
+
+`task test` writes results to `PDF_TEXT_TESTS.md` as each chapter completes. If the process is interrupted and re-run **on the same commit**, already-tested chapters are skipped automatically. The report includes:
+
+- Git commit hash and timestamp
+- Per-chapter severity breakdown (HIGH / MEDIUM / LOW)
+- Expandable HIGH severity details
+- Rolling summary updated after each chapter
+
 ## How It Works
 
-The engine has two comparison modes:
+### Containerized Execution
+
+Tests run inside a `python:3.11-slim` Docker image (see [`Dockerfile`](Dockerfile)). The project root is bind-mounted into the container at `/workspace`, giving the test runner access to:
+
+- HTML source files in `CompendiumUI/public/`
+- Pre-extracted PDF text in `copyright_compendium_pdfs/`
+
+The `GIT_COMMIT` environment variable is passed from the host so the report captures the correct commit hash.
 
 ### Algorithmic Checker (`--algo`)
 
@@ -26,19 +47,26 @@ Uses character-level comparison on **space-stripped text** to find genuine conte
 **Why space-stripped?** PDF text extraction commonly joins or splits words at line breaks (e.g., `ofthe` instead of `of the`, `pra ctices` instead of `practices`). By removing all whitespace before comparing, these artifacts become invisible — only real character-level content differences are reported.
 
 The pipeline:
+
 1. **Extract** text from HTML source files and pre-extracted PDF `.txt` files
-2. **Normalize** both texts (Unicode normalization, PDF header/footer removal, bullet markers, word fragment rejoining)
+2. **Normalize** both texts:
+   - Unicode normalization (curly quotes → straight, em-dashes, non-breaking spaces)
+   - **Strip embedded XML/HTML tags** from PDF text (e.g., `<ahref="...">`, `</a>`)
+   - Remove PDF headers, footers, TOC dot-leaders, bullet markers
+   - Rejoin word fragments split by PDF line breaks
+   - Collapse whitespace
 3. **Strip** all whitespace from both texts
 4. **Diff** the stripped texts character-by-character using `difflib.SequenceMatcher`
-5. **Map** each character difference back to the original text for readable context
-6. **Classify** each difference as HIGH, MEDIUM, or LOW severity
+5. **Filter** empty diffs (both sides empty after stripping are skipped)
+6. **Map** each character difference back to the original text for readable context
+7. **Classify** each difference as HIGH, MEDIUM, or LOW severity
 
 ### LLM Checker (`--llm`)
 
 Uses Google Gemini API for semantic analysis. Sends both texts with a structured prompt that asks the model to identify and classify discrepancies. Requires a `GOOGLE_API_KEY` environment variable.
 
 ```bash
-GOOGLE_API_KEY=your-key python -m tests.run_qa --llm --chapters ch200
+GOOGLE_API_KEY=your-key task test:llm
 ```
 
 ## Severity Classification
@@ -47,7 +75,19 @@ GOOGLE_API_KEY=your-key python -m tests.run_qa --llm --chapters ch200
 |----------|---------|----------|
 | **HIGH** | Substantive content change that may affect meaning | Changed section numbers, missing paragraphs, altered legal references |
 | **MEDIUM** | Formatting difference worth noting | Section delimiters (`. 202`), punctuation changes, minor reference formatting |
-| **LOW** | Expected artifact, safe to ignore | PDF headers/footers, TOC page numbers, whitespace-only differences |
+| **LOW** | Expected artifact, safe to ignore | PDF headers/footers, TOC page numbers, whitespace-only differences, case-only changes |
+
+### Classification Rules (in priority order)
+
+1. Whitespace/hyphenation-only difference → **LOW**
+2. PDF header/footer content → **LOW**
+3. Case-only change (e.g., `NO.` → `No.`) → **LOW**
+4. Punctuation-only change → **MEDIUM**
+5. TOC content (dense section number listings) → **LOW**
+6. Section number delimiter (e.g., `.202`) → **MEDIUM**
+7. Changed numbers or section references → **HIGH**
+8. Substantial text addition/deletion (>50% length difference) → **HIGH**
+9. Default → **MEDIUM**
 
 ## CLI Reference
 
@@ -57,7 +97,7 @@ python -m tests.run_qa [OPTIONS]
 # Check mode (at least one required)
   --algo                Run algorithmic comparison
   --llm                 Run LLM-based semantic check
-  --both                Run both checks
+  --all                 Run both checks
 
 # Chapter selection (default: all mapped chapters)
   --chapters CH [CH...] Specific chapter IDs (e.g., ch200 ch800)
@@ -67,14 +107,16 @@ python -m tests.run_qa [OPTIONS]
   --format {console,markdown,json,all}  Output format (default: console)
   --output-dir DIR      Directory for markdown/json reports
   --severity-filter {HIGH,MEDIUM,LOW}   Show only this severity and above
+  --report PATH         Write incremental markdown report (e.g. PDF_TEXT_TESTS.md)
 ```
 
 ## Project Structure
 
 ```
 tests/
+├── Dockerfile             ← Python 3.11-slim test runner image
 ├── README.md              ← This file
-├── requirements.txt       ← Python dependencies
+├── requirements.txt       ← Python dependencies (beautifulsoup4, lxml, google-generativeai)
 ├── __init__.py            ← Package marker
 ├── conftest.py            ← Chapter-to-file mapping and path constants
 ├── text_extractor.py      ← Text extraction and normalization pipeline
@@ -83,7 +125,7 @@ tests/
 ├── llm_checker.py         ← Gemini API-based semantic checker
 ├── llm_prompt.txt         ← Prompt template for LLM checks
 ├── report.py              ← Report generators (console, markdown, JSON)
-├── run_qa.py              ← CLI entry point
+├── run_qa.py              ← CLI entry point (with incremental report support)
 └── reports/               ← Generated reports (git-ignored)
 ```
 
diff --git a/tests/algorithmic_checker.py b/tests/algorithmic_checker.py
@@ -101,6 +101,12 @@ def compare_chapter(chapter_id: str) -> list[Discrepancy]:
         if diff_len < _MIN_DIFF_SIZE:
             continue
 
+        # Skip empty diffs (both sides empty after stripping)
+        pdf_chars = pdf_stripped[i1:i2]
+        html_chars = html_stripped[j1:j2]
+        if not pdf_chars.strip() and not html_chars.strip():
+            continue
+
         # Map back to original positions
         if i1 < len(pdf_map) and i2 > 0:
             pdf_orig_start = pdf_map[i1] if i1 < len(pdf_map) else len(pdf_text)
@@ -118,9 +124,7 @@ def compare_chapter(chapter_id: str) -> list[Discrepancy]:
         else:
             html_snippet = ""
 
-        # Get the actual character diff for classification
-        pdf_chars = pdf_stripped[i1:i2]
-        html_chars = html_stripped[j1:j2]
+        # Use the character diff (already extracted above) for description
 
         if op == "replace":
             desc = f"Text differs: '{pdf_chars[:50]}' → '{html_chars[:50]}'"
@@ -155,10 +159,20 @@ def compare_chapters(chapter_ids: list[str]) -> dict[str, list[Discrepancy]]:
         Dict mapping chapter_id → list of Discrepancy objects.
     """
     results = {}
-    for chapter_id in chapter_ids:
+    total = len(chapter_ids)
+    for idx, chapter_id in enumerate(chapter_ids, 1):
+        print(f"  [{idx}/{total}] Checking {chapter_id}...", flush=True)
         try:
-            results[chapter_id] = compare_chapter(chapter_id)
+            discs = compare_chapter(chapter_id)
+            results[chapter_id] = discs
+            high = sum(1 for d in discs if d.severity.name == "HIGH")
+            med = sum(1 for d in discs if d.severity.name == "MEDIUM")
+            low = sum(1 for d in discs if d.severity.name == "LOW")
+            print(
+                f"           → {len(discs)} issues ({high} HIGH, {med} MEDIUM, {low} LOW)",
+                flush=True,
+            )
         except (FileNotFoundError, KeyError) as e:
-            print(f"WARNING: Skipping {chapter_id}: {e}")
+            print(f"  WARNING: Skipping {chapter_id}: {e}", flush=True)
             results[chapter_id] = []
     return results
diff --git a/tests/run_qa.py b/tests/run_qa.py
diff --git a/tests/text_extractor.py b/tests/text_extractor.py