This document describes the complete scorecard system for the DigitalChild project, which tracks 10 human rights indicators across countries and enriches document metadata.
The scorecard system provides country-level data on digital child protection policies and LGBTQ+ rights. It consists of:
- Data Source:
scorecard_main.xlsx- Excel file with country indicators - Loader:
processors/scorecard.py- Loads and caches scorecard data - Enricher:
processors/scorecard_enricher.py- Adds scorecard data to document metadata - Exporter:
processors/scorecard_export.py- Creates CSV exports for website/analysis - Validator:
processors/scorecard_validator.py- Checks source URLs for broken links - Diff Checker:
processors/scorecard_diff.py- Monitors sources for changes
The scorecard tracks 10 indicators (each with value + source URL):
- AI_Policy_Status - National AI policy/strategy status
- Data_Protection_Law - Data protection/privacy legislation
- Children_Data_Safeguards - Child-specific data protection measures
- SOGI_Sensitive_Data - Sexual orientation/gender identity data protections
- DPA_Independence - Data Protection Authority independence
- DPIA_Required_High_Risk_AI - Data protection impact assessments for AI
- LGBTQ_Legal_Status - Legal status of LGBTQ+ people
- Promotion_Propaganda_Offences - Anti-LGBTQ+ propaganda laws
- COP_Strategy - Child online protection strategy
- SIM_Biometric_ID_Linkage - SIM registration and biometric requirements
File: data/scorecard/scorecard_main.xlsx
The scorecard Excel file contains multiple sheets:
- UN_194 (primary sheet): 194 UN member states with all 10 indicators
- SADC: 16 SADC member states (regional subset)
- ECOWAS: 13 ECOWAS member states (regional subset)
- Global: Scoring rules and methodology documentation
IMPORTANT: The scorecard.py loader reads from the UN_194 sheet by default. This sheet contains the complete dataset for all 194 countries.
Sheet Structure (UN_194):
- Column 1: RowNumber
- Column 2: Country (full country name)
- Columns 3-4: Region - Broad, Region - Specific
- Columns 5+: Indicator value columns paired with _Source columns
Example: AI_Policy_Status (value) + AI_Policy_Status_Source (URL)
scorecard_main.xlsx
(UN_194 sheet)
↓
scorecard.py (loader)
↓
┌───────────┴───────────┐
↓ ↓
scorecard_enricher.py scorecard_export.py
(add to metadata) (CSV exports)
↓
metadata.json
(enriched documents)
# 1. Place scorecard_main.xlsx in project root
# 2. Test scorecard loads correctly
python -c "from processors.scorecard import load_scorecard; print(load_scorecard())"
# 3. Run tests to verify
pytest tests/test_scorecard.py -vAdd scorecard indicators to documents based on their country:
# Enrich all documents in metadata.json
python processors/scorecard_enricher.py
# Dry run (don't save changes)
python processors/scorecard_enricher.py --dry-run
# Show enrichment summary only
python processors/scorecard_enricher.py --summaryProgrammatic usage:
from processors.scorecard_enricher import enrich_document, enrich_all_metadata
# Enrich single document
doc = {"id": "doc-1", "country": "Albania"}
enriched_doc = enrich_document(doc)
# Enrich all metadata
stats = enrich_all_metadata(save=True)
print(f"Enriched {stats['enriched']} documents")Output format:
{
"id": "doc-1",
"country": "Albania",
"scorecard": {
"matched_country": "Albania",
"enriched_at": "2024-01-15T10:30:00Z",
"indicators": {
"AI_Policy_Status": {
"value": "Draft policy under development (2023)",
"source": "https://..."
},
"Data_Protection_Law": {
"value": "Law No. 9887 (2008), aligned with GDPR",
"source": "https://..."
}
// ... 8 more indicators
}
}
}Generate CSV exports for website/analysis:
# From Python code
from processors.scorecard_export import export_scorecard
exports = export_scorecard()
# Returns:
# {
# "summary": "data/exports/scorecard_summary.csv",
# "sources": "data/exports/scorecard_sources.csv",
# "indicator_counts": "data/exports/scorecard_indicator_counts.csv"
# }Export types:
- Summary CSV: All countries with all indicators (for main table)
- Sources CSV: All source URLs (for verification/citation)
- Indicator Counts: Distribution of values per indicator (for charts)
- By Indicator: Individual CSV per indicator
- By Region: Countries filtered by region
Programmatic usage:
from processors.scorecard_export import ScorecardExporter
exporter = ScorecardExporter()
# Export specific region
exporter.export_by_region("Africa", "data/exports/africa.csv")
# Export specific indicator
exporter.export_by_indicator("LGBTQ_Legal_Status", "data/exports/lgbtq_status.csv")
# Export all at once
exports = exporter.export_all()Check all source URLs for broken/redirected links:
# Run validation
python processors/scorecard_validator.py
# Custom worker count
python processors/scorecard_validator.py --workers 20
# Don't save reports
python processors/scorecard_validator.py --no-saveOutput:
data/exports/scorecard_url_validation.json- Full validation reportdata/exports/scorecard_broken_links.csv- Broken links only (for review)
Programmatic usage:
from processors.scorecard_validator import run_validation
report = run_validation(save_reports=True)
print(f"{report['ok']} OK, {report['broken']} broken, {report['redirected']} redirected")Check monitored sources for content changes:
# Check all monitored sources
python processors/scorecard_diff.py
# Check specific country sources
python processors/scorecard_diff.py --country "South Africa"
# Check sources only (skip stale entry detection)
python processors/scorecard_diff.py --sources-onlyMonitored sources:
- UNESCO AI Policy Observatory
- UNCTAD Data Protection Tracker
- ILGA World Maps
- Human Dignity Trust
- GSMA SIM Registration
Programmatic usage:
from processors.scorecard_diff import run_diff_check, check_country_sources
# Full check
report = run_diff_check(save_report=True)
# Check specific country
results = check_country_sources("Kenya")The scorecard enrichment is not part of the main pipeline (pipeline_runner.py) by default. It's a separate step run after documents are processed.
Typical workflow:
# 1. Run pipeline to scrape and process documents
python pipeline_runner.py --source upr --country "Kenya"
# 2. Enrich metadata with scorecard
python processors/scorecard_enricher.py
# 3. Export scorecard data for website
python -c "from processors.scorecard_export import export_scorecard; export_scorecard()"To integrate scorecard enrichment into the pipeline:
# In pipeline_runner.py, after process_documents():
from processors.scorecard_enricher import enrich_all_metadata
# After processing is complete
if args.enrich_scorecard:
logger.info("Enriching metadata with scorecard indicators...")
stats = enrich_all_metadata(save=True)
logger.info(f"Enriched {stats['enriched']} documents")- Edit
scorecard_main.xlsxwith new data - Force reload:
load_scorecard(force_reload=True) - Re-enrich metadata:
python processors/scorecard_enricher.py - Re-export:
python -c "from processors.scorecard_export import export_scorecard; export_scorecard()"
# Check for broken links
python processors/scorecard_validator.py
# Check for stale entries
python processors/scorecard_diff.py
# Run all scorecard tests
pytest tests/test_scorecard.py -v-
Add column pair to
scorecard_main.xlsx:New_Indicator(value column)New_Indicator_Source(source URL column)
-
Update
INDICATOR_COLUMNSinprocessors/scorecard.py:INDICATOR_COLUMNS = [ # ... existing indicators ("New_Indicator", "New_Indicator_Source"), ]
-
Re-run enrichment and exports
- Source Data:
scorecard_main.xlsx(project root) - Exports:
data/exports/scorecard_*.csv - Validation Reports:
data/exports/scorecard_url_validation.json - Diff Reports:
data/exports/scorecard_diff_report.json - Cache:
data/cache/scorecard_sources/*.json
# Run all scorecard tests
pytest tests/test_scorecard.py -v
# Run specific test class
pytest tests/test_scorecard.py::TestScorecardLoader -v
# Run with coverage
pytest tests/test_scorecard.py --cov=processors/scorecard --cov-report=htmlProblem: ValueError: Worksheet named 'X' not found
Solution: The scorecard file has multiple sheets. The loader expects the UN_194 sheet by default (as of 2026-01-24). If you see this error:
- Check that
scorecard_main.xlsxcontains a sheet named "UN_194" - Verify the sheet has 194 rows (countries) with all indicator columns
- The sheet name is hard-coded in
processors/scorecard.pyline 65:df = pd.read_excel(filepath, sheet_name="UN_194")
Historical Note: Prior to 2026-01-24, the code expected a sheet named "Sheet1". This was updated to use the properly named "UN_194" sheet for clarity.
Problem: Document country doesn't match scorecard country names
Solution: The loader tries multiple normalization methods:
- Exact match (case-insensitive)
- ISO code lookup
- Fuzzy matching
Check country names in metadata vs scorecard:
from processors.scorecard import get_countries_list
countries = get_countries_list()
print(countries) # List all scorecard countriesProblem: Enriched document missing some indicators
Solution: Check for empty cells in scorecard_main.xlsx. Empty values are skipped.
Problem: URL validation takes too long
Solution: Reduce worker count or increase timeout:
from processors.scorecard_validator import validate_all_urls
report = validate_all_urls(max_workers=5) # Slower but more reliable- Auto-update from sources: Automatically scrape monitored sources and update scorecard
- Version tracking: Track scorecard changes over time
- API endpoint: Serve scorecard data via REST API for website
- Visualization: Generate charts/maps from scorecard data
- Comparison mode: Compare countries side-by-side
- Timeline view: Show indicator changes over time per country
- METADATA_SCHEMA.md - Document metadata structure
- PIPELINE_FLOW.md - Main pipeline workflow
- ISO_MAPPING.md - Country code standards