Brief: Wave A — Mission-First National Registry Ingest

You are Claude Code. This brief is the source of truth. Read it once, then execute.

Mission

Add three new national-registry ingesters to the Commonweave directory, following the existing pipeline conventions. Each one targets a country whose national registry has an explicit nonprofit / charity / public-benefit entity class — meaning we can pull aligned orgs by legal form, not just keyword scoring on descriptions.

The three openers, in priority order:

Australia — ACNC Charity Register (highest yield, cleanest API)
Bulgaria — Register of Non-profit Legal Entities (NPO Register)
Brazil — Mapa das OSCs (IPEA) + a thin CNPJ enrichment pass

Goal: ~50,000–80,000 new aligned rows, all legibility='formal', all geographically outside the existing US/UK skew.

Why we're doing this

Today's directory is 88% US/UK because IRS BMF + UK Charity Commission + Wikidata are the only big drawers we've opened. The "new wave" research doc names ~30 confirmed-reachable national registries with native nonprofit/charity classes — most outside the Anglosphere. Wave A picks three high-yield, three-different-continents openers to confirm the pattern works before scaling.

Australia (~70K registered charities), Bulgaria (~20K NPOs), and Brazil (~1M+ OSCs, of which we'll tag the aligned ones) cover Oceania, Eastern Europe, and Latin America in one wave.

The bigger win is methodological: this is the first ingest where we're scoring on legal form rather than English-keyword presence. See "Field-family scorer" below.

Hard rules

Read commonweave/NEW-WAVE-INGEST-PLAN.md first. It's the full strategic context. This brief is the execution layer.
Follow the existing pipeline. Model new scripts on ingest_grounded_solutions.py and ingest_unions.py. Same conventions: --dry-run, --refresh, cache to data/sources/<source>-cache/, log to data/ingest-<source>-run.log, idempotent on source_id.
legibility='formal' on every row. All three sources are official registries.
No em dashes. Feynman voice in all comments and docs.
Surgical. No drive-by refactors of the existing pipeline. If you find a bug, log it in a comment and keep going.
Polite scraping. Real User-Agent, 1–2s sleep between requests, cache aggressively, honor robots.txt.
Dedup is the existing pipeline's job. Just ingest idempotently (upsert by source_id). dedup_merge.py runs separately and requires location agreement before merging.
No commits before Simon reviews. Stage everything. Push only after a smoke test on --dry-run for each ingester.

What to read first

commonweave/NEW-WAVE-INGEST-PLAN.md — strategic context
commonweave/data/_common.py — DB helpers
commonweave/data/ingest_grounded_solutions.py — best template (caching, fallbacks, idempotent upserts)
commonweave/data/ingest_unions.py and ingest_ituc.py — multi-source pattern
commonweave/data/ingest_wikidata_bulk.py — SPARQL pattern (Bulgaria fallback may use this)
commonweave/data/taxonomy.yaml
commonweave/PIPELINE.md — the field guide; especially Stages 2–4 and 7
commonweave/data/phase2_filter.py — the scorer; you'll extend it (see below)
commonweave/data/migrate_legibility.py

Work items

0. Pre-flight: `data/sources/REGISTRIES.yaml`

Transcribe the country atlas from commonweave org finder new wave.md (on Simon's desktop — ask Larry to paste the contents into the workspace if you can't reach the desktop) into a single YAML catalog. Schema per entry:

- country_code: AU
  country_name: Australia
  registrar: Australian Charities and Not-for-profits Commission
  registry_name: ACNC Charity Register
  url: https://www.acnc.gov.au/charity
  bulk_data_url: https://data.gov.au/dataset/acnc/resource/...   # if known
  languages: [en]
  entity_classes_in_scope: [registered_charity, public_benevolent_institution, religious_charity_secular_arms]
  api: true
  bulk_csv: true
  auth_required: false
  update_frequency: monthly
  last_checked: 2026-04-25
  notes: National charity register; excellent CSV download; size ~70k rows.
  wave_a_target: true

If a country in the new wave doc is "authority-anchor only" (no confirmed public search portal), still include the row with url: null and notes describing the situation. This file becomes the canonical "what we know about each country's drawer."

Don't try to do all ~190 countries today. Do the 17 Wave A countries named in NEW-WAVE-INGEST-PLAN.md, plus any of the ~30 confirmed-reachable countries the new wave doc explicitly verified. Mark wave_a_target: true on the three this brief covers; mark the rest with wave_a_target: false (Wave B/C will pick them up later).

1. `commonweave/data/ingest_acnc.py` — Australia ACNC Charity Register

Source order:

Bulk CSV from data.gov.au — the ACNC publishes a quarterly/monthly dump of the full register. URL pattern: search https://data.gov.au/dataset/acnc for the most recent "Register of Charities" CSV resource. Cache the CSV.
If the bulk file is unavailable, fall back to the ACNC public API: https://acnc-public-api.azure-api.net/v1/... (search ACNC developer docs).
If both fail, scrape the public search at https://www.acnc.gov.au/charity paginated. Last resort.

Mapping:

name: Charity name
country_code: AU
country_name: Australia
state_province: state if available
city: town/suburb
registration_id: ABN
registration_type: ACNC_REGISTRATION
description: Charity purpose + activity description (concatenate)
website: charity website if listed
framework_area: classify via existing keyword map (re-use phase2_filter.py keyword bank); fallback to unknown and let the second-pass classifier handle it
source: acnc_charity_register
source_id: ABN (Australian Business Number — unique)
legibility: formal
model_type: nonprofit default; promote to cooperative or foundation if name/text hints

Filters at ingest time:

Skip charities marked as revoked or deregistered (status field)
Skip pure religious-only charities with no secular activity (use the ACNC sub-type classification — keep "Public Benevolent Institution," "Health Promotion Charity," charity types serving advancement of education / health / social welfare; drop charities whose ONLY sub-type is "Advancement of religion" with no secondary classification)

Expected row count: ~50,000–60,000 after filtering.

Flags: --dry-run, --refresh, --limit N.

2. `commonweave/data/ingest_bulgaria_npo.py` — Bulgaria NPO Register

Source order:

Bulgarian Registry Agency — Register of Non-profit Legal Entities. URL: search https://portal.registryagency.bg/ for the NPO public search. Bulk export availability uncertain — check first. Cache HTML/JSON aggressively.
Fallback: Wikidata SPARQL for instance of (P31) wdt:Q163740 (nonprofit organization) AND country (P17) wdt:Q219 (Bulgaria). Mark these rows with source='wikidata_bg_npo' (separate source so we know provenance).
Last resort: scrape the public list at the registry's web portal with proper pagination and language headers (bg).

Language note: Bulgarian (Cyrillic). Keep the Bulgarian name in name; if a Latin transliteration is available, store it in name_translit (add the column via _common.ensure_column if it doesn't exist; otherwise concatenate in parens at end of name). The multilingual term bank in i18n_terms.py already handles Bulgarian alignment scoring.

Mapping:

country_code: BG
registration_type: BG_NPO_REGISTER
source: bg_npo_register (or wikidata_bg_npo for the fallback)
source_id: EIK (single-ID code) if available; else slug

Expected row count: ~10,000–20,000 if the bulk pull works; ~1,000 if Wikidata fallback only.

Flags: same.

3. `commonweave/data/ingest_brazil_oscs.py` — Brazil Mapa das OSCs

Source order:

Mapa das OSCs (IPEA) public API. URL base: https://mapaosc.ipea.gov.br/ — find their JSON API endpoint (likely /api/...). The platform has ~860,000 OSCs catalogued with Brazilian government CNAE classification.
If the API requires registration, fall back to bulk CSV (Mapa publishes annual datasets — check the open data section).
Last resort: query Receita Federal CNPJ public bulk data and filter by legal form code 3220 (Associação Privada) and 3999 (Outras Fundações). This is a multi-GB download — only do this if Mapa's API is unreachable.

Critical filter: Brazil's Mapa is broad. Pre-filter rows by purpose / CNAE activity code to keep only:

Health and social services (CNAE 86, 87, 88)
Education (CNAE 85)
Environmental protection (CNAE 81.30, 39.00)
Cooperatives (legal form code 2143 — Cooperativa)
Cultural/civic associations (CNAE 94)
Food sovereignty / agriculture co-ops (CNAE 01.13.0, 01.62, 03)

This prevents pulling 800K rows of religious associations and HOAs. Aim for ~30,000–60,000 aligned rows.

Mapping:

country_code: BR
registration_id: CNPJ
registration_type: BR_CNPJ
source: mapa_oscs_brazil
source_id: CNPJ
description: combine "razão social," "objeto social" (purpose clause), and CNAE description
Language: Portuguese; multilingual term bank in i18n_terms.py handles pt-BR already.

Flags: same. Add --cnae-filter flag to override the default activity filter list.

4. Cross-cutting: extend `phase2_filter.py` with a legal-form score axis

Today the filter scores on description keywords. Add a second signal:

LEGAL_FORM_BUMPS = {
    'cooperative': 3,
    'community_land_trust': 4,
    'community_company': 3,
    'non_profit_company': 3,
    'charitable_organization': 2,
    'association': 2,
    'foundation': 2,
    'social_enterprise': 3,
    'community_interest_company': 3,
    'section_8_company': 3,           # India
    'osc': 2,                          # Brazil
    'public_benevolent_institution': 3, # Australia
    'benefit_corporation': 2,
}

Apply by inspecting model_type and registration_type columns. The bump stacks with the existing keyword score, capped at alignment_score=5. This is what lets a Bulgarian or Brazilian row score high without an English description.

Run the re-score after all three ingesters land.

5. Run order + verification

python data/ingest_acnc.py --dry-run — confirm row count and field mapping
python data/ingest_acnc.py — real run
Same for Bulgaria, then Brazil
python data/phase2_filter.py — re-score with new legal-form bumps
python data/check_counts.py — confirm directory size and country distribution shifted
Update commonweave/DATA.md with the new sources, row counts, and updated "Honest Numbers" table

Done criteria

When you finish

Send a system event with the summary:

openclaw system event --text "Wave A ingest done: AU=<n>, BG=<n>, BR=<n>. Total directory: <n>. US/UK now <n>%." --mode now

If you hit a hard blocker (registry offline, API requires paid auth Simon hasn't approved, etc.), STOP and announce:

openclaw system event --text "Wave A blocked at <step>: <reason>. Need decision." --mode now

Don't guess. Ask.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brief: Wave A — Mission-First National Registry Ingest

Mission

Why we're doing this

Hard rules

What to read first

Work items

0. Pre-flight: `data/sources/REGISTRIES.yaml`

1. `commonweave/data/ingest_acnc.py` — Australia ACNC Charity Register

2. `commonweave/data/ingest_bulgaria_npo.py` — Bulgaria NPO Register

3. `commonweave/data/ingest_brazil_oscs.py` — Brazil Mapa das OSCs

4. Cross-cutting: extend `phase2_filter.py` with a legal-form score axis

5. Run order + verification

Done criteria

When you finish

FilesExpand file tree

WAVE-A-INGEST-BRIEF.md

Latest commit

History

WAVE-A-INGEST-BRIEF.md

File metadata and controls

Brief: Wave A — Mission-First National Registry Ingest

Mission

Why we're doing this

Hard rules

What to read first

Work items

0. Pre-flight: data/sources/REGISTRIES.yaml

1. commonweave/data/ingest_acnc.py — Australia ACNC Charity Register

2. commonweave/data/ingest_bulgaria_npo.py — Bulgaria NPO Register

3. commonweave/data/ingest_brazil_oscs.py — Brazil Mapa das OSCs

4. Cross-cutting: extend phase2_filter.py with a legal-form score axis

5. Run order + verification

Done criteria

When you finish

0. Pre-flight: `data/sources/REGISTRIES.yaml`

1. `commonweave/data/ingest_acnc.py` — Australia ACNC Charity Register

2. `commonweave/data/ingest_bulgaria_npo.py` — Bulgaria NPO Register

3. `commonweave/data/ingest_brazil_oscs.py` — Brazil Mapa das OSCs

4. Cross-cutting: extend `phase2_filter.py` with a legal-form score axis