Skip to content

Add Normalised Dataset for the UK Railway Network#3149

Open
iiAnderson wants to merge 3 commits into
awslabs:mainfrom
iiAnderson:add-normalised-dataset
Open

Add Normalised Dataset for the UK Railway Network#3149
iiAnderson wants to merge 3 commits into
awslabs:mainfrom
iiAnderson:add-normalised-dataset

Conversation

@iiAnderson

@iiAnderson iiAnderson commented May 18, 2026

Copy link
Copy Markdown

Summary

Adding a normalised, analysis-ready dataset of every train service on the UK rail network, derived from the National Rail Darwin Push Port feed.

  • ~30,000 services per day, one row per service, updated daily
  • Pre-aggregated delay metrics, cancellation status, passenger loading, and a nested per-station stops array
  • Hive-partitioned Apache Parquet (Snappy-compressed), partitioned by service date
  • Data available from January 2025 onwards
  • Stored in S3 (eu-west-1, requester pays)

Schema highlights

Field Description
rid, toc, train_id Service identifiers
origin_tpl, destination_tpl Route endpoints (TIPLOC codes)
cancellation_status ran, cancelled, or partially_cancelled
avg_delay_mins, max_delay_mins Aggregate delay metrics
avg_loading Passenger loading percentage
stops Nested array of per-station scheduled/actual times and delays

Why this dataset is useful

UK rail performance data is widely discussed but hard to access in a structured, queryable format. This dataset enables researchers, journalists, and transport analysts to run queries like:

  • Which train operating companies have the worst cancellation rates?
  • How do delays propagate across the network by time of day?
  • What are the busiest and most delayed stations?

The data is directly queryable via AWS Athena or any Parquet-compatible tool (DuckDB, pandas, Spark).

Documentation & usage

Note: This replaces my earlier PR #2474, which offered the raw (un-normalised) tables. The normalised dataset is significantly more useful for consumers — single-row-per-service with pre-computed metrics eliminates the need for complex multi-table joins.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Daily-updated, analysis-ready dataset of every UK rail service derived
from the National Rail Darwin Push Port feed. One row per service with
pre-aggregated delay metrics, cancellation status, and passenger loading.
Hive-partitioned Parquet in S3 (eu-west-1, requester pays).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@iiAnderson iiAnderson changed the title Add UK National Rail Darwin Normalised Dataset Add Normalised Dataset for the UK Railway Network May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant