Skip to content

Bug: 1:m relationship between route_id and recent_combined_name #1527

@amandaha8

Description

@amandaha8

Where did the bug occur?
Select from the below, and be sure to affix the appropriate label to this issue (e.g. dataset, jupyterhub, metabase, analysis.calitp.org)

  • Data (the warehouse)
  • JupyterHub
  • Metabase
  • analysis.calitp.org
  • Other (add detail)

Describe the bug

  • There are 200+ instances in which one recent_combined_name value maps to multiple route_id values.
  • Figure out why this is happening - it could be these operator have different versions of the same route or an operator (like LA Metro) actively chooses new route_id values each time when publishing data for the same recent_combined_name value.
  • Aggregate the dataframe up to recent_combined_name, service_date, and portfolio_organization_name. This will require some finesse for metrics like avg_scheduled_minutes and speed_mph because I'm unsure what's the correct way to represent these values.

To Reproduce
Steps to reproduce the behavior:

  1. Dataframe f"{RT_SCHED_GCS}{DIGEST_RT_SCHED}.parquet" at the end of gtfs_digest/merge_data.py
unique_route_ids = (
    df.groupby([ "portfolio_organization_name", "recent_combined_name"])
    .agg({"route_id": "nunique"})
    .reset_index()
)

  1. unique_route_ids2 = unique_route_ids.loc[unique_route_ids.route_id > 1]
  2. See which recent_combined_name values have more than one route_id.

Image

Expected behavior
It's OK for recent_combined_name values to have more than one route_id but this needs to be grouped before data makes it into the Operator Grain of GTFS Digest or else the charts will disjointed.

Additional context

Metadata

Metadata

Assignees

Labels

adminAdministrative workbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions