AI-enhanced data mining for cholera surveillance data to fill missing observations in WHO African Region historical records through systematic multi-agent search, validation, and integration workflows.
This project addresses a critical gap in global cholera surveillance by using the Anthropic Opus 4 LLM to code an advanced agentic AI workflow to systematically mine, validate, and integrate official and unofficial cholera data sources. With existing surveillance reporting through the WHO AWD Dashboard and historical reports through the JHU Cholera Taxonomy Repository supplying an initial tranche of data (particularly for later time periods), our AI data mining focuses on time periods not already represented in these databases. This work directly supports epidemiological modeling in the MOSAIC framework, which aims to model endemic cholera transmission across the WHO African Region.
Historical cholera surveillance data in the WHO African Region contains substantial gaps, with approximately 50% of weekly surveillance records missing from official databases spanning 1970 to present. These gaps significantly impair epidemiological modeling accuracy, outbreak prediction capabilities, and evidence-based public health planning across 40 MOSAIC framework countries.
Geographic Focus: 40 MOSAIC Framework Countries in the WHO African Region
- Angola, Burundi, Benin, Burkina Faso, Botswana, Central African Republic, Côte d'Ivoire, Cameroon, Democratic Republic of Congo, Congo, Eritrea, Ethiopia, Gabon, Ghana, Guinea, Gambia, Guinea-Bissau, Equatorial Guinea, Kenya, Liberia, Mali, Mozambique, Mauritania, Malawi, Namibia, Niger, Nigeria, Rwanda, Senegal, Sierra Leone, Somalia, South Sudan, Eswatini, Chad, Togo, Tanzania, Uganda, South Africa, Zambia, Zimbabwe
Temporal Coverage: 1970-present with primary focus on the reporting gaps in each country
Data Sources: 486+ pre-authorized domains across 4 reliability tiers
- Gap Identification: Systematically identify missing surveillance periods across 40 countries
- Data Discovery: Deploy AI agents to mine official reports and unofficial cholera data sources
- Quality Assurance: Implement rigorous 4-stage validation protocols
- Integration: Produce standardized data sets that emulate the formatting of the JHU Cholera Taxonomy Repository with enhanced dual-reference indexing
- Documentation: Create comprehensive metadata and with condidence weights for each data source and observation
The methodology employs a systematic 6-agent workflow designed for comprehensive data discovery and validation:
Agent 1: Baseline enhancement and priority source coverage using the "Ultra-Deep Search Protocol" with deep dive modules for WHO DON reports, WHO WER, UNICEF, and MSF
Agent 2: Geographic expansion around found sources (provincial/district-level data)
Agent 3: Zero-transmission validation and absence period documentation with mandatory cross references
Agent 4: Obscure source exploration and archive mining
Agent 5: Query permutation and expanded search engines
Agent 6: Quality audit, validation, formatting checks, and final reporting
- Multi-engine parallel processing: 15+ search engines/databases per country
- Query framework: 7 mandatory categories, 50+ unique queries minimum
- Advanced techniques: Citation following, institution deep-dives, country-specific multi-language searches
- Source validation: 4-stage quality control with source confidence weighting
- Priority period targeting: Focus on specific missing date ranges
- Surveillance coverage analysis: Distinguish disease absence vs. reporting gaps
- Cross-border validation: Regional consistency checks
- Temporal stratification: Decade-specific systematic coverage
4-Stage Validation Process:
- Authentication: URL verification, author credentials, domain validation \
- Data Quality: Epidemiological validation (CFR 0.1-15%, attack rates 0.01-10%) \
- Cross-Reference: Multi-source confirmation for major outbreaks (>1000 cases) \
- Duplication: Systematic detection and resolution protocols \
Source Reliability Classification:
- Level 1 (0.9-1.0): WHO, MoH, peer-reviewed journals \
- Level 2 (0.7-0.9): UNICEF, OCHA, established NGOs \
- Level 3 (0.3-0.6): Reputable news, local government reports \
- Level 4 (0.1-0.3): Local media, preliminary reports \
Output Format: JHU-compatible CSV with enhanced dual-reference indexing
Geographic Coding: AFR::{ISO}::{PROVINCE}::{DISTRICT} standardization
Metadata Documentation: 14-column comprehensive source attribution
Quality Weighting: 4-tier confidence scoring (0.1-1.0)
Data Inclusion Criteria:
- Geographic specificity: Must represent actual administrative units \
- Quantitative requirements: Specific case/death counts or validated absence periods \
- Source authentication: Working URLs, institutional credibility verification \
- Cholera-specific: Disease incidence data only (not vaccination/capacity metrics) \
Quantitative Achievements:
- Data gap reduction: 454 of 796 weekly records (57%) enhanced \
- Sources discovered: 25 working URLs across 6 source categories \
- New observations: 35 data points spanning 1971-2025 \
- Quality distribution: 60% Level 1-2 sources, 40% Level 3-4 sources \
- Validation success: 94% of extracted data passed all quality stages \
Key periods filled: Critical gaps in 2006-2012 and 2016-2018
Geographic detail: Provincial-level data added for major outbreaks
- Search comprehensiveness: Multi-engine approach identified sources missed by single-engine searches
- Quality control effectiveness: Rigorous validation caught and corrected 12% of initial extractions
- Duplication prevention: Systematic checking prevented inclusion of 8 duplicate records
- Cross-reference validation: Historical validation identified and resolved 3 data inconsistencies
Primary Outputs:
- cholera_data.csv: Enhanced surveillance data with standardized formatting \
- metadata.csv: Comprehensive source documentation and validation records \
- search_report.txt: Summary of discoveries, gaps filled, and quality assessment \
- Individual agent logs: Detailed search and validation documentation \
Secondary Products:
- Interactive dashboard: Real-time progress tracking across all countries \
- Timeline visualizations: Coverage plots showing data enhancement impact \
MOSAIC Framework Integration:
- Enhanced time series for more complete surveillance records across 40 countries
- Quality-weighted data observations facilitate the modeling weights built into the likelihood functions
# Required Python packages
pip install pandas numpy matplotlib pathlib pillow
# Required data dependencies (automatically checked by setup script)
# - MOSAIC surveillance data (../MOSAIC-data/processed/cholera/weekly/)
# - JHU cholera database (../jhu_cholera_data/data/)
# - WHO dashboard data (../ees-cholera-mapping/data/cholera/who/awd/)# Clone repository
git clone https://github.com/InstituteforDiseaseModeling/ai-cholera-data-mining.git
cd ai-cholera-data-mining
# Run complete automated setup (handles all initialization)
bash setup.shThe setup script automatically:
- ✅ Generates integrated baseline gap analysis using JHU and WHO data
- ✅ Creates country codes and mappings for 40 MOSAIC framework countries
- ✅ Integrates JHU cholera database as baseline data
- ✅ Integrates WHO dashboard surveillance data (2023-2025)
- ✅ Creates 40 country directories with baseline data files
- ✅ Generates country-specific search protocols from templates
- ✅ Generates 6-agent workflow files from templates
- ✅ Sets up dashboard tracking system
# If you prefer manual step-by-step setup:
python py/analyze_integrated_coverage_gaps.py # Updated gap analysis
python py/get_iso_codes.py # Country mappings
python py/convert_jhu_to_workflow.py # JHU baseline integration
python py/convert_who_to_workflow.py # WHO data integration
python py/configure_countries.py # Template generation# Countries now start with integrated JHU/WHO baseline data
# Use country-specific workflow files:
# ./data/{ISO_CODE}/agentic_workflow_{ISO_CODE}.txt
# Update unified dashboard after agent completion
bash update_dashboard.sh
# Generate specialized analysis datasets (optional)
# python py/generate_weekly_surveillance_longform.py # REMOVED - no longer needed
python py/generate_monthly_surveillance_matrix_v2.py- Integrated Baseline: Countries start with combined JHU + WHO data instead of empty datasets
- Data Enhancement: Mission changed from data collection to baseline enhancement
- Gap-Targeted Search: AI agents focus on specific missing periods identified in baseline analysis
- Template-Based Generation: All workflow files generated from centralized templates
- Unified Dashboard: Single command (
bash update_dashboard.sh) updates all dashboard components
ai-cholera-data-mining/
├── setup.sh # Automated setup script (NEW)
├── data/ # Country-specific data and logs
│ └── {ISO_CODE}/
│ ├── cholera_data.csv # Integrated baseline + AI enhancements
│ ├── metadata.csv # Source documentation (dual-reference)
│ ├── agentic_workflow_{ISO}.txt # Country-specific 6-agent workflow
│ ├── search_protocol_{ISO}.txt # Country-specific search protocol
│ ├── search_log_agent_*.txt # Individual agent logs (1-6)
│ └── search_report.txt # Final summary report
├── dashboard/ # Real-time progress tracking
│ ├── completion_checklist.csv # Automated country status tracking
│ ├── dashboard.html # Interactive dashboard
│ └── timeline_plots_dual/ # Dual timeline visualization (national vs sub-national)
├── py/ # Core Python utilities
│ ├── analyze_integrated_coverage_gaps.py # NEW: Integrated gap analysis
│ ├── convert_jhu_to_workflow.py # NEW: JHU baseline integration
│ ├── convert_who_to_workflow.py # NEW: WHO data integration
│ ├── update_dashboard_data.py # Unified dashboard updater
│ ├── configure_countries.py # Template-based setup
│ └── *.py # Specialized analysis tools
├── reference/ # Reference data and mappings
│ ├── country_mapping.json # MOSAIC country definitions
│ ├── agent_quick_reference.csv # Gap-targeting data (auto-generated)
│ ├── observed_time_periods.csv # Baseline coverage analysis (NEW)
│ ├── priority_data_gaps.csv # Identified gaps for targeting (NEW)
│ └── priority_sources.txt # Pre-authorized domains (486 sources)
└── templates/ # Workflow templates (NEW)
├── template_agentic_workflow.txt # 6-agent workflow template
└── template_search_protocol.txt # Search protocol template
This project is part of the MOSAIC modeling framework for cholera surveillance enhancement. For contributions/collaborations contact [email protected]
This work is licensed under a Creative Commons Attribution 4.0 International License.
