This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
READ CLAUDE_PRE_COMMIT_CHECKLIST.md BEFORE EVERY COMMIT
ALWAYS run before committing:
pre-commit run --all-filesEvery CI failure costs money. Every failed push wastes time. CHECK BEFORE YOU PUSH.
DigitalChild (GRIMdata / LittleRainbowRights) is a Python 3.12 data pipeline for scraping, processing, and analyzing human rights documents, policies, and reports with focus on child and LGBTQ+ digital protection.
Key Tech: BeautifulSoup4, Selenium, pandas, pypdf, pytest, Flask (API backend)
python init_project.pyCreates directory structure and placeholder files. Safe to re-run (idempotent).
pip install -r requirements.txtFor CI/development, also install:
pip install pytest pytest-cov pre-commitFor API development (Phase 4):
pip install -r api_requirements.txtInstall hooks (REQUIRED before committing):
pre-commit installRun all checks (MUST pass before pushing):
pre-commit run --all-filesPre-commit runs: black, isort, flake8, markdownlint, trailing-whitespace, end-of-file-fixer, check-yaml, check-json, detect-private-key.
Linting config: Line length 88 (black standard), flake8 ignores E203,E501,W503.
IMPORTANT: Activate virtual environment first to ensure all dependencies (pypdf, python-docx) are available.
# Activate virtual environment
source .LittleRainbow/bin/activate # On Windows: .LittleRainbow\Scripts\activate
# Full test suite (~106 seconds, 170 pipeline tests + 39 API tests)
pytest tests/ -v
# Specific test file
pytest tests/test_year_extraction.py -v
# With coverage (as CI does)
pytest tests/ --maxfail=1 --disable-warnings -q --cov=processors --cov=scrapers --cov-report=term-missing# Basic run (AU policies)
python pipeline_runner.py --source au_policy
# With options
python pipeline_runner.py --source au_policy --tags-version latest
python pipeline_runner.py --source upr --country kenya
python pipeline_runner.py --mode scorecard --scorecard-action all
# Demo pipeline
python utils/pipeline_runner_DEMO.pyIMPORTANT: Always run from repository root for imports to work.
# Install API dependencies
pip install -r api_requirements.txt
# Run development server
python run_api.py
# Test API endpoints
python test_api.py
# Quick curl test
curl http://localhost:5000/api/healthAPI available at http://127.0.0.1:5000. See api/README.md for endpoint documentation.
Scraper → Raw Files → Processor → Text → Tagger → Metadata → Exports
- Scrapers (
scrapers/) fetch documents from web sources →data/raw/[source]/ - Processors (
processors/) convert PDFs/DOCX/HTML to text →data/processed/[region]/[org]/text/ - Tagger applies regex rules from
configs/tags_*.json - Metadata stored in
data/metadata/metadata.json(tracks tags history, recommendations) - Exports generate CSV summaries in
data/exports/
pipeline_runner.py - Main entry point with three modes:
scrapermode: Run scrapers, process docs, tag, exporturlsmode: Process from static URL dictionaries inconfigs/url_dict/scorecardmode: Enrich/export/validate scorecard data
SCRAPER_MAP in pipeline_runner.py:46-60 maps source names to:
- Scraper module
- Output directory path
- Document type
Each source has both a requests-based scraper and a Selenium variant (_sel suffix).
Fallback Handler (processors/fallback_handler.py):
- Tries processors in sequence until one succeeds
- Used for unknown file types
Year Extraction (pipeline_runner.py:212-244):
- Pattern:
(19|20)\d{2}with boundary checks - Sources: filename first, then first 1000 chars of text
Country/Region Detection (utils/detectors.py):
- Uses filename, URL keys, and text content
- Normalizes via
json_normalizer.py(preserves_rawfields)
Separate workflow for country-level indicators (10 metrics per country):
- Load:
processors/scorecard.pyreadsdata/scorecard/scorecard_main.xlsx(canonical source) - Enrich:
processors/scorecard_enricher.pyadds indicators to document metadata - Export:
processors/scorecard_export.pygenerates CSV exports - Validate:
processors/scorecard_validator.pychecks source URLs - Diff:
processors/scorecard_diff.pymonitors sources for changes
Run scorecard workflow:
python pipeline_runner.py --mode scorecard --scorecard-action allOr individual steps:
python processors/scorecard_enricher.py
python -c "from processors.scorecard_export import export_scorecard; export_scorecard()"- scrapers/: NO
__init__.py(modules imported directly by name) - processors/: HAS
__init__.py(proper package) - tests/conftest.py: Adds project root to sys.path for imports
Documents tracked in data/metadata/metadata.json:
{
"id": "AU_Digital_Compact.pdf",
"source": "au_policy",
"country": "African_Union",
"country_raw": "African Union",
"region": "Africa",
"region_raw": "Sub-Saharan Africa",
"year": 2024,
"year_extracted_from": "first_page",
"doc_type": "Policy",
"file_type": "PDF",
"ingestion_method": "scraper",
"tags_history": [
{
"tags": ["AI", "DigitalPolicy"],
"version": "tags_v3",
"timestamp": "2025-08-28T15:22:00Z"
}
],
"recommendations_history": [],
"scorecard": {
"matched_country": "Albania",
"enriched_at": "2024-01-15T10:30:00Z",
"indicators": {
"AI_Policy_Status": {"value": "...", "source": "..."},
...
}
},
"last_processed": "2025-08-28T15:22:00Z"
}Tags configs (configs/tags_*.json) define regex patterns:
{
"rules": {
"ChildRights": ["child", "children", "youth", "minor"],
"LGBTQ": ["lgbt", "lgbtq", "sexual orientation"],
"AI": ["artificial intelligence", "\\bAI\\b", "machine learning"]
}
}Version resolution via configs/tags_main.json:
{
"versions": {
"latest": "tags_v3.json",
"v1": "tags_v1.json",
"v2": "tags_v2.json"
}
}Unified + per-module logging (processors/logger.py):
from processors.logger import get_logger, set_run_logfile
# At pipeline start
set_run_logfile("source_run", module_logs=True) # Creates logs/source_run.log
# In modules
logger = get_logger("module_name")
logger.info("Message")Disable module logs with --no-module-logs flag.
GitHub Actions runs on push/PR to main, homebase, basecamp branches.
Job 1: test (Python 3.12, ubuntu-latest)
- Install deps:
pip install -r requirements.txt pytest pytest-cov pre-commit - Run pre-commit:
pre-commit run --all-files --show-diff-on-failure - Run tests:
pytest tests/ --maxfail=1 --disable-warnings -q --cov=processors --cov=scrapers --cov-report=term-missing
Job 2: docs (Python 3.11, ubuntu-latest)
- Install:
pip install mdformat - Check:
mdformat --check README.md docs/
CRITICAL: If pre-commit fails, CI fails. Always run pre-commit run --all-files before pushing.
- Create
scrapers/new_source.py - Implement
scrape()function returning list of file paths - Add to
SCRAPER_MAPinpipeline_runner.py - Add tests in
tests/test_new_source.py
Scraper template:
# scrapers/new_source.py
import os
import requests
from processors.logger import get_logger
RAW_DIR = "data/raw/new_source"
logger = get_logger(__name__)
def scrape(base_url=None, countries=None):
"""Download documents to RAW_DIR. Returns list of file paths."""
os.makedirs(RAW_DIR, exist_ok=True)
downloaded = []
# Your scraping logic here
for name, url in URLS.items():
filepath = os.path.join(RAW_DIR, f"{name}.pdf")
if os.path.exists(filepath):
logger.info(f"Skipping (exists): {filepath}")
continue
resp = requests.get(url, timeout=30)
resp.raise_for_status()
with open(filepath, "wb") as f:
f.write(resp.content)
downloaded.append(filepath)
return downloaded- Create
processors/new_processor.py - Implement
convert(input_path, output_dir)function - Update
fallback_handler.pyif needed - Add tests in
tests/test_new_processor.py
Processor template:
# processors/new_processor.py
import os
from processors.logger import get_logger
logger = get_logger(__name__)
def convert(input_path, output_dir):
"""
Convert input file to text.
Returns path to .txt file or None if failed.
"""
os.makedirs(output_dir, exist_ok=True)
basename = os.path.splitext(os.path.basename(input_path))[0]
output_path = os.path.join(output_dir, f"{basename}.txt")
try:
# Your conversion logic here
text = extract_text(input_path)
with open(output_path, "w", encoding="utf-8") as f:
f.write(text)
return output_path
except Exception as e:
logger.error(f"Conversion failed for {input_path}: {e}")
return None- Edit or create
configs/tags_vX.json - Update version mapping in
configs/tags_main.jsonif needed - Run tests:
pytest tests/test_tagger.py -v - Test with demo:
python utils/pipeline_runner_DEMO.py - Verify exports in
data/exports/
- Edit
data/scorecard/scorecard_main.xlsx(canonical source) with new data - Re-enrich:
python processors/scorecard_enricher.py - Re-export:
python -c "from processors.scorecard_export import export_scorecard; export_scorecard()" - Validate:
python processors/scorecard_validator.py - Update convenience copy:
cp data/scorecard/scorecard_main.xlsx scorecard.xlsx
Use underscores, include year when available:
- ✅
AU_Digital_Compact_2024.pdf - ✅
Kenya_UPR_Report_2020.pdf - ❌
digital compact final.pdf(spaces, no year) - ❌
doc1.pdf(non-descriptive)
Year extraction depends on (19|20)\d{2} regex with boundary checks.
Error: ModuleNotFoundError: No module named 'requests'
Fix: pip install -r requirements.txt
Error: FileNotFoundError: data/metadata/metadata.json
Fix: python init_project.py
Error: ModuleNotFoundError: No module named 'processors'
Fix: Run commands from project root, not subdirectories
Always run pre-commit run --all-files before committing. Common auto-fixes:
- Trailing whitespace
- Missing end-of-file newline
- Black formatting:
pre-commit run black --all-files - Import order:
pre-commit run isort --all-files
- ALWAYS run
python init_project.pyon fresh clone - Test suite takes ~33 seconds for 56 tests
- pypdf deprecation warning is expected (migration to pypdf planned)
- Pre-commit hooks are CRITICAL - CI fails if they fail
- Line length: 88 characters (black standard)
- Import sorting: isort with
--profile black - Scrapers have NO
__init__.py, processors HAVE__init__.py - Use
from processors.logger import get_loggerfor logging - Country/region normalization preserves
_rawfields for provenance - Tags history tracks versions and timestamps
- Scorecard enrichment is separate from main pipeline
- Static URL dictionaries live in
configs/url_dict/ - All data directories (
data/raw/,data/processed/,data/exports/,logs/) are gitignored
docs/guides/FIRST_RUN_ERRORS.md- Troubleshooting first rundocs/notes/PIPELINE_FLOW.md- Detailed pipeline flowdocs/guides/SCORECARD_WORKFLOW.md- Complete scorecard system guidedocs/standards/METADATA_SCHEMA.md- Metadata structuredocs/standards/TAGS_CONFIG_FORMAT.md- Tags configuration formatdocs/standards/SCRAPER_STRUCTURE.md- Scraper implementation guidedocs/standards/FILE_NAMING_STANDARDS.md- File naming conventionsdocs/notes/PIPELINE_LOGGING.md- Logging system details