Version: 0.1
Last Updated: 2025-10-11 13:45 CT
Maintainers: Dataset Team (Victor + Cameron)
Storage Limit: ≤ 0.4 GB total (.parquet, Snappy-compressed)
Defines structure, validation rules, and exchange standards for datasets powering the Rural Opportunity Navigator prototype.
All data must meet this contract before ingestion into the backend API or dashboard.
| Dataset | Description | File | Target Size | Update Frequency |
|---|---|---|---|---|
| Programs Clean | Normalized USDA + federal program data | data-samples/programs_clean.parquet |
≤ 150 MB | Static |
| Programs + Income | Joined with Census/ERS income tiers | data-samples/programs_with_income.parquet |
≤ 250 MB | Static |
| Field | Type | Example | Description | Validation |
|---|---|---|---|---|
state |
string (2) | "MO" | U.S. state code | Must match /^[A-Z]{2}$/ |
county_fips |
string (5) | "29095" | County FIPS code | 5-digit numeric |
program_name |
string | "Farm Loan Program" | Program title | Non-null ≤ 120 chars |
agency |
string | "USDA Rural Development" | Admin agency | Optional |
intent_category |
string (enum) | "equipment_purchase" | User intent | Must match set |
industry |
string | "Livestock" | Sector | Optional |
income_band |
string (enum) | "Low" | Derived tier | Low / Mid / High |
funding_type |
string | "Loan" | Grant / Loan / Aid | Optional |
application_deadline |
date | 2024-12-31 | Deadline | ISO-8601 |
resolved_flag |
bool | true | Case resolved | true/false |
contact_reference |
string (URL) | https://rd.usda.gov/contact | Contact link | Valid URL |
source |
string | "USDA_RD_Portal" | Provenance | Required |
- No nulls in
state,county_fips,program_name. - ≤ 200 k rows; ≤ 30 columns.
- Combined size ≤ 0.4 GB.
- UTF-8 encoding;
.parquetSnappy compression. - Drop or hash PII before commit.
- Run
scripts/validate_schema.py. - All pydantic models pass.
- Check total size ≤ 400 MB via
du -sh data-samples/