Add Normalised Dataset for the UK Railway Network by iiAnderson · Pull Request #3149 · awslabs/open-data-registry

iiAnderson · 2026-05-18T07:14:03Z

Summary

Adding a normalised, analysis-ready dataset of every train service on the UK rail network, derived from the National Rail Darwin Push Port feed.

~30,000 services per day, one row per service, updated daily
Pre-aggregated delay metrics, cancellation status, passenger loading, and a nested per-station stops array
Hive-partitioned Apache Parquet (Snappy-compressed), partitioned by service date
Data available from January 2025 onwards
Stored in S3 (eu-west-1, requester pays)

Schema highlights

Field	Description
`rid`, `toc`, `train_id`	Service identifiers
`origin_tpl`, `destination_tpl`	Route endpoints (TIPLOC codes)
`cancellation_status`	`ran`, `cancelled`, or `partially_cancelled`
`avg_delay_mins`, `max_delay_mins`	Aggregate delay metrics
`avg_loading`	Passenger loading percentage
`stops`	Nested array of per-station scheduled/actual times and delays

Why this dataset is useful

UK rail performance data is widely discussed but hard to access in a structured, queryable format. This dataset enables researchers, journalists, and transport analysts to run queries like:

Which train operating companies have the worst cancellation rates?
How do delays propagate across the network by time of day?
What are the busiest and most delayed stations?

The data is directly queryable via AWS Athena or any Parquet-compatible tool (DuckDB, pandas, Spark).

Documentation & usage

Documentation: https://blog.robbiea.co.uk/posts/the-darwin-normalised-dataset/
Source code & pipeline: https://github.com/iiAnderson/darwin-connect
License: CC-BY-NC-4.0 (upstream data subject to National Rail T&Cs)

Note: This replaces my earlier PR #2474, which offered the raw (un-normalised) tables. The normalised dataset is significantly more useful for consumers — single-row-per-service with pre-computed metrics eliminates the need for complex multi-table joins.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Daily-updated, analysis-ready dataset of every UK rail service derived from the National Rail Darwin Push Port feed. One row per service with pre-aggregated delay metrics, cancellation status, and passenger loading. Hive-partitioned Parquet in S3 (eu-west-1, requester pays). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

iiAnderson changed the title ~~Add UK National Rail Darwin Normalised Dataset~~ Add Normalised Dataset for the UK Railway Network May 18, 2026

iiAnderson added 2 commits May 23, 2026 10:44

Update to add SNS topic

486fd4d

Update email

ccea148

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Normalised Dataset for the UK Railway Network#3149

Add Normalised Dataset for the UK Railway Network#3149
iiAnderson wants to merge 3 commits into
awslabs:mainfrom
iiAnderson:add-normalised-dataset

iiAnderson commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iiAnderson commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Schema highlights

Why this dataset is useful

Documentation & usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

iiAnderson commented May 18, 2026 •

edited

Loading