This repository includes an automated monthly maintenance workflow for examples/TRUSTED_SOURCES.yaml to ensure the list of trusted accessibility resources remains high-quality and up-to-date.
The maintenance script (.github/scripts/maintain_trusted_sources.py) performs:
- Checks all URLs for accessibility (404 detection)
- First 404: Marks source as "not active"
- Second 404: Removes the entry entirely
- Tracks error history in
.github/data/url_error_history.json
owner: Detects from domain name or description patternslicense: Attempts to detect common licenses (Creative Commons, MIT, Apache, GPL, etc.)- Sets to "unknown" if not found (never leaves as
null)
- Sets to "unknown" if not found (never leaves as
last_reviewed: Updates timestamp when metadata is reviewed
- Checks Last-Modified headers to detect site updates
- Marks as "not active" if no content updates in over 1 year
- Validates YAML structure
- Ensures required fields are present
- Checks for valid status values
The workflow runs:
- Automatically: Monthly on the 1st at 00:00 UTC
- Manually: Via workflow_dispatch with options:
full_scan: Perform annual topic_tags updateskip_validation: Skip URL checks (metadata only)
- Automated Execution: GitHub Action runs monthly
- Changes Detection: Script validates URLs and enriches metadata
- Pull Request Creation: If changes are found, creates a PR for review
- Human Review: Maintainer reviews and merges the PR
python .github/scripts/maintain_trusted_sources.py --full-scanpython .github/scripts/maintain_trusted_sources.py --validate-onlypython .github/scripts/maintain_trusted_sources.py --skip-validation- Python 3.11+
- Dependencies in
.github/scripts/requirements.txt:- PyYAML
- requests
The script maintains a history file at .github/data/url_error_history.json:
{
"errors": {
"source-id": [
{
"date": "2026-02-25T12:00:00",
"status_code": 404,
"error": "404 Not Found"
}
]
}
}The script respects the ai_scraping field in TRUSTED_SOURCES.yaml:
allowed(default): Content can be usedprohibited: Do not scrape/train on contentrestricted: Use only for reference/citation
- ✅ Validates all URLs monthly
- ✅ Tracks 404 errors with two-strike removal
- ✅ Enriches missing metadata
- ✅ Detects content staleness
- ✅ Creates PRs for human review
⚠️ Verify sources marked "not active" are truly inactive⚠️ Confirm removed sources should be deleted⚠️ Check accuracy of detected metadata (owner, license)⚠️ Add context for topic_tags during annual scans
Potential improvements:
- Topic Tags: Implement ML/NLP analysis for automatic tag suggestions
- RSS/Atom: Check blog feeds for recent posts
- Archive.org: Suggest archived versions for removed content
- WCAG Version: Parse content for WCAG 2.1/2.2/3.0 references
- Authority Scoring: Analyze citations, backlinks, author credentials
- Workflow:
.github/workflows/maintain-trusted-sources.yml - Script:
.github/scripts/maintain_trusted_sources.py - Data:
examples/TRUSTED_SOURCES.yaml - History:
.github/data/url_error_history.json
For issues or questions about the maintenance workflow, please:
- Check existing GitHub Issues
- Review the workflow logs in GitHub Actions
- Open a new issue with the
maintenancelabel