-
Notifications
You must be signed in to change notification settings - Fork 6
Labels
Description
Where did the bug occur?
Select from the below, and be sure to affix the appropriate label to this issue (e.g. dataset, jupyterhub, metabase, analysis.calitp.org)
- Data (the warehouse)
- JupyterHub
- Metabase
- analysis.calitp.org
- Other (add detail)
Describe the bug
- There are 200+ instances in which one
recent_combined_namevalue maps to multipleroute_idvalues. - Figure out why this is happening - it could be these operator have different versions of the same route or an operator (like LA Metro) actively chooses new
route_idvalues each time when publishing data for the samerecent_combined_namevalue. - Aggregate the dataframe up to
recent_combined_name,service_date, andportfolio_organization_name. This will require some finesse for metrics likeavg_scheduled_minutesandspeed_mphbecause I'm unsure what's the correct way to represent these values.
To Reproduce
Steps to reproduce the behavior:
- Dataframe
f"{RT_SCHED_GCS}{DIGEST_RT_SCHED}.parquet"at the end ofgtfs_digest/merge_data.py
unique_route_ids = (
df.groupby([ "portfolio_organization_name", "recent_combined_name"])
.agg({"route_id": "nunique"})
.reset_index()
)
- unique_route_ids2 = unique_route_ids.loc[unique_route_ids.route_id > 1]
- See which
recent_combined_namevalues have more than oneroute_id.
Expected behavior
It's OK for recent_combined_name values to have more than one route_id but this needs to be grouped before data makes it into the Operator Grain of GTFS Digest or else the charts will disjointed.
Additional context
- Build off the work Tiffany did Research Request - Route identification over time #924 and this PR.
