You are Claude Code. This brief is the source of truth. Read it once, then execute.
Add three new national-registry ingesters to the Commonweave directory, following the existing pipeline conventions. Each one targets a country whose national registry has an explicit nonprofit / charity / public-benefit entity class — meaning we can pull aligned orgs by legal form, not just keyword scoring on descriptions.
The three openers, in priority order:
- Australia — ACNC Charity Register (highest yield, cleanest API)
- Bulgaria — Register of Non-profit Legal Entities (NPO Register)
- Brazil — Mapa das OSCs (IPEA) + a thin CNPJ enrichment pass
Goal: ~50,000–80,000 new aligned rows, all legibility='formal', all geographically outside the existing US/UK skew.
Today's directory is 88% US/UK because IRS BMF + UK Charity Commission + Wikidata are the only big drawers we've opened. The "new wave" research doc names ~30 confirmed-reachable national registries with native nonprofit/charity classes — most outside the Anglosphere. Wave A picks three high-yield, three-different-continents openers to confirm the pattern works before scaling.
Australia (~70K registered charities), Bulgaria (~20K NPOs), and Brazil (~1M+ OSCs, of which we'll tag the aligned ones) cover Oceania, Eastern Europe, and Latin America in one wave.
The bigger win is methodological: this is the first ingest where we're scoring on legal form rather than English-keyword presence. See "Field-family scorer" below.
- Read
commonweave/NEW-WAVE-INGEST-PLAN.mdfirst. It's the full strategic context. This brief is the execution layer. - Follow the existing pipeline. Model new scripts on
ingest_grounded_solutions.pyandingest_unions.py. Same conventions:--dry-run,--refresh, cache todata/sources/<source>-cache/, log todata/ingest-<source>-run.log, idempotent onsource_id. legibility='formal'on every row. All three sources are official registries.- No em dashes. Feynman voice in all comments and docs.
- Surgical. No drive-by refactors of the existing pipeline. If you find a bug, log it in a comment and keep going.
- Polite scraping. Real User-Agent, 1–2s sleep between requests, cache aggressively, honor robots.txt.
- Dedup is the existing pipeline's job. Just ingest idempotently (upsert by
source_id).dedup_merge.pyruns separately and requires location agreement before merging. - No commits before Simon reviews. Stage everything. Push only after a smoke test on
--dry-runfor each ingester.
commonweave/NEW-WAVE-INGEST-PLAN.md— strategic contextcommonweave/data/_common.py— DB helperscommonweave/data/ingest_grounded_solutions.py— best template (caching, fallbacks, idempotent upserts)commonweave/data/ingest_unions.pyandingest_ituc.py— multi-source patterncommonweave/data/ingest_wikidata_bulk.py— SPARQL pattern (Bulgaria fallback may use this)commonweave/data/taxonomy.yamlcommonweave/PIPELINE.md— the field guide; especially Stages 2–4 and 7commonweave/data/phase2_filter.py— the scorer; you'll extend it (see below)commonweave/data/migrate_legibility.py
Transcribe the country atlas from commonweave org finder new wave.md (on Simon's desktop — ask Larry to paste the contents into the workspace if you can't reach the desktop) into a single YAML catalog. Schema per entry:
- country_code: AU
country_name: Australia
registrar: Australian Charities and Not-for-profits Commission
registry_name: ACNC Charity Register
url: https://www.acnc.gov.au/charity
bulk_data_url: https://data.gov.au/dataset/acnc/resource/... # if known
languages: [en]
entity_classes_in_scope: [registered_charity, public_benevolent_institution, religious_charity_secular_arms]
api: true
bulk_csv: true
auth_required: false
update_frequency: monthly
last_checked: 2026-04-25
notes: National charity register; excellent CSV download; size ~70k rows.
wave_a_target: trueIf a country in the new wave doc is "authority-anchor only" (no confirmed public search portal), still include the row with url: null and notes describing the situation. This file becomes the canonical "what we know about each country's drawer."
Don't try to do all ~190 countries today. Do the 17 Wave A countries named in NEW-WAVE-INGEST-PLAN.md, plus any of the ~30 confirmed-reachable countries the new wave doc explicitly verified. Mark wave_a_target: true on the three this brief covers; mark the rest with wave_a_target: false (Wave B/C will pick them up later).
Source order:
- Bulk CSV from data.gov.au — the ACNC publishes a quarterly/monthly dump of the full register. URL pattern: search
https://data.gov.au/dataset/acncfor the most recent "Register of Charities" CSV resource. Cache the CSV. - If the bulk file is unavailable, fall back to the ACNC public API:
https://acnc-public-api.azure-api.net/v1/...(search ACNC developer docs). - If both fail, scrape the public search at
https://www.acnc.gov.au/charitypaginated. Last resort.
Mapping:
name: Charity namecountry_code:AUcountry_name: Australiastate_province: state if availablecity: town/suburbregistration_id: ABNregistration_type:ACNC_REGISTRATIONdescription: Charity purpose + activity description (concatenate)website: charity website if listedframework_area: classify via existing keyword map (re-usephase2_filter.pykeyword bank); fallback tounknownand let the second-pass classifier handle itsource:acnc_charity_registersource_id: ABN (Australian Business Number — unique)legibility:formalmodel_type:nonprofitdefault; promote tocooperativeorfoundationif name/text hints
Filters at ingest time:
- Skip charities marked as revoked or deregistered (status field)
- Skip pure religious-only charities with no secular activity (use the ACNC sub-type classification — keep "Public Benevolent Institution," "Health Promotion Charity," charity types serving advancement of education / health / social welfare; drop charities whose ONLY sub-type is "Advancement of religion" with no secondary classification)
Expected row count: ~50,000–60,000 after filtering.
Flags: --dry-run, --refresh, --limit N.
Source order:
- Bulgarian Registry Agency — Register of Non-profit Legal Entities. URL: search
https://portal.registryagency.bg/for the NPO public search. Bulk export availability uncertain — check first. Cache HTML/JSON aggressively. - Fallback: Wikidata SPARQL for
instance of (P31) wdt:Q163740(nonprofit organization) ANDcountry (P17) wdt:Q219(Bulgaria). Mark these rows withsource='wikidata_bg_npo'(separate source so we know provenance). - Last resort: scrape the public list at the registry's web portal with proper pagination and language headers (
bg).
Language note: Bulgarian (Cyrillic). Keep the Bulgarian name in name; if a Latin transliteration is available, store it in name_translit (add the column via _common.ensure_column if it doesn't exist; otherwise concatenate in parens at end of name). The multilingual term bank in i18n_terms.py already handles Bulgarian alignment scoring.
Mapping:
country_code:BGregistration_type:BG_NPO_REGISTERsource:bg_npo_register(orwikidata_bg_npofor the fallback)source_id: EIK (single-ID code) if available; else slug
Expected row count: ~10,000–20,000 if the bulk pull works; ~1,000 if Wikidata fallback only.
Flags: same.
Source order:
- Mapa das OSCs (IPEA) public API. URL base:
https://mapaosc.ipea.gov.br/— find their JSON API endpoint (likely/api/...). The platform has ~860,000 OSCs catalogued with Brazilian government CNAE classification. - If the API requires registration, fall back to bulk CSV (Mapa publishes annual datasets — check the open data section).
- Last resort: query Receita Federal CNPJ public bulk data and filter by legal form code
3220(Associação Privada) and3999(Outras Fundações). This is a multi-GB download — only do this if Mapa's API is unreachable.
Critical filter: Brazil's Mapa is broad. Pre-filter rows by purpose / CNAE activity code to keep only:
- Health and social services (CNAE 86, 87, 88)
- Education (CNAE 85)
- Environmental protection (CNAE 81.30, 39.00)
- Cooperatives (legal form code 2143 — Cooperativa)
- Cultural/civic associations (CNAE 94)
- Food sovereignty / agriculture co-ops (CNAE 01.13.0, 01.62, 03)
This prevents pulling 800K rows of religious associations and HOAs. Aim for ~30,000–60,000 aligned rows.
Mapping:
country_code:BRregistration_id: CNPJregistration_type:BR_CNPJsource:mapa_oscs_brazilsource_id: CNPJdescription: combine "razão social," "objeto social" (purpose clause), and CNAE description- Language: Portuguese; multilingual term bank in
i18n_terms.pyhandlespt-BRalready.
Flags: same. Add --cnae-filter flag to override the default activity filter list.
Today the filter scores on description keywords. Add a second signal:
LEGAL_FORM_BUMPS = {
'cooperative': 3,
'community_land_trust': 4,
'community_company': 3,
'non_profit_company': 3,
'charitable_organization': 2,
'association': 2,
'foundation': 2,
'social_enterprise': 3,
'community_interest_company': 3,
'section_8_company': 3, # India
'osc': 2, # Brazil
'public_benevolent_institution': 3, # Australia
'benefit_corporation': 2,
}Apply by inspecting model_type and registration_type columns. The bump stacks with the existing keyword score, capped at alignment_score=5. This is what lets a Bulgarian or Brazilian row score high without an English description.
Run the re-score after all three ingesters land.
python data/ingest_acnc.py --dry-run— confirm row count and field mappingpython data/ingest_acnc.py— real run- Same for Bulgaria, then Brazil
python data/phase2_filter.py— re-score with new legal-form bumpspython data/check_counts.py— confirm directory size and country distribution shifted- Update
commonweave/DATA.mdwith the new sources, row counts, and updated "Honest Numbers" table
-
data/sources/REGISTRIES.yamlexists with at least the 17 Wave A countries documented - All three ingesters run cleanly on
--dry-runand real - Each ingester is idempotent (re-running adds zero new rows)
- Each ingester has a cache dir under
data/sources/ - Each ingester logs to
data/ingest-<source>-run.log -
phase2_filter.pyextended with legal-form score bumps -
DATA.mdupdated with new source entries and current row count - US/UK percentage drops below 80% (was 88%)
- No regressions:
python data/check_counts.pyshows expected growth, no orphaned rows - All commits in Feynman voice. One commit per ingester. No squashing.
Send a system event with the summary:
openclaw system event --text "Wave A ingest done: AU=<n>, BG=<n>, BR=<n>. Total directory: <n>. US/UK now <n>%." --mode now
If you hit a hard blocker (registry offline, API requires paid auth Simon hasn't approved, etc.), STOP and announce:
openclaw system event --text "Wave A blocked at <step>: <reason>. Need decision." --mode now
Don't guess. Ask.