Skip to content

ukgeo: steer wanted on a community geocoder built on OS Open Names + Open Roads #132

Description

@ThomasHSimm

Hi osdatahub,

Flagging an early-stage Python package in case it's useful context for you, and to ask for a steer on whether it's worth developing further in the direction I'm going.

ukgeo (v0.5, alpha) is a UK free-text geocoder for messy location strings — STATS19-style road references, motorway junctions, colloquial place names. It's built on OS Open Names and OS Open Roads (OGL-attributed), with postcodes.io and OSM filling gaps. Pip-installable, MIT licensed.

It came out of road-safety risk modelling work where the OS Names API wasn't the right shape for two specific reasons:

  1. Bulk batches. Your own product page notes the API isn't intended for bulk searches, and the 600/min live rate limit confirms it. We had hundreds of thousands of dirty STATS19 strings to resolve in one pass.
  2. Restricted-network environments. The analytical environment had no outbound API access, and we didn't want location strings leaving the network. ukgeo loads from a local parquet at startup and runs entirely offline. (Optional OS Names API fallback exists for long-tail infrastructure cases, off by default.)

Beyond those two, the things that have been useful in practice:

  • Fuzzy multi-token matching on dirty strings (junction names, colloquial roundabouts, county-context disambiguation).
  • Transparent output — every result returns confidence, level_resolved, match_type, candidates_considered, notes. Helpful for analyst triage of low-confidence rows.
  • The pipeline is agnostic to feature type — the same scorer handles roads, junctions, places, stations. Extending coverage to the rest of the OS Open Names theme set (hospitals, schools, airports, ferries, etc.) is a parquet-build change rather than a code change.

What it isn't / current gaps, honestly:

  • No reverse geocoding yet — planned, that's the biggest functional gap vs. OS Names.
  • No address-level resolution. OS Places is the right tool for that.
  • Data freshness depends on rebuilding parquets locally.
  • Welsh / Gaelic / multilingual coverage uncertain in the current build.
  • Test data is regional (Yorkshire / NW / Midlands); national-scale accuracy is partly assumption.

It works for us on two levels (the ORR pipeline and ad-hoc geocoding), and I think the offline / bulk angle could be useful for civil-service and research users who run into the same constraints. But it's early, and before investing further I'd value a steer on:

  • Whether something like this overlaps with anything you're already planning or have seen demand for, and any improvements you'd suggest.
  • Whether the "offline + bulk + dirty strings" framing matches a real gap you see from your end, or whether I'm pattern-matching off a narrow use-case.

No specific ask beyond that — happy if it's just noted.
Cheers,
Thomas

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions