This is the third iteration of our GTFS Realtime (RT) downloader aka archiver.
huey is a minimal/lightweight task queue library that we use to enqueue tasks for asynchronous/parallel execution by workers.
The full archiver application is composed of three pieces:
- A ticker pod that creates fetch tasks every 20 seconds, based on the latest download configurations
- Configurations are fetched from GCS and cached for 5 minutes; they are generated upstream by generate_gtfs_download_configs
- Fetches are enqueued as Huey tasks
- A Redis instance holding the Huey queue
- We deploy a single instance per environment namespace
(e.g.
gtfs-rt-v3
,gtfs-rt-v3-test
) with no disk space and no horizontal scaling; we do not care about persistence because only fresh fetch tasks are relevant anyways. - In addition, the RT archiver relies on having low I/O latency with Redis to minimize the latency of fetch starts. Due to these considerations, these Redis instances should NOT be used for any other applications.
- We deploy a single instance per environment namespace
(e.g.
- Some number (greater than 1) of consumer pods that execute enqueued fetch tasks, making HTTP requests and saving the raw responses (and metadata such as headers) to GCS
- Each consumer pod runs some number of worker threads
- As of 2023-04-10, the production archiver has 6 consumer pods each managing 24 worker threads
These deployments are defined in the relevant kubernetes manifests and overlaid with kustomize per-environment (e.g. gtfs-rt-archiver-v3-test).
We've created a Grafana dashboard to display the metrics for this application, based on our desired goals of capturing data to the fullest extent possible and being able to track 20-second update frequencies in the feeds. Counts of task successes is our overall sign of health (i.e. we are capturing enough data) while other metrics such as task delay or download time are useful for identifying bottlenecks or the need for increased resources.
There are two important alerts defined in Grafana based on these metrics.
Both of these tasks can fire if the archiver is only partially degraded, but the first alert is our best catch-all detection mechanism for any downtime. There are other potential issues (e.g. outdated download configs) that are flagged in the dashboard but do not currently have configured alerts.
We log errors and exceptions (both caught and uncaught) to our Sentry instance via the Python SDK for Sentry. Common problems include:
- Failure to connect to Redis following a node upgrade; this is typically fixed by restarting the archiver.
RTFetchException
, a custom class specific to failures during feed download; these can be provider-side (i.e. the agency/vendor) or consumer-side (i.e. us) and are usually fixed (if possible) by changing download configurations. Common examples (and HTTP error code if relevant) include:- Missing or invalid authentication (401/403)
- Changed URLs (404)
- Intermittent outages/errors (may be a ConnectionError or a 500 response)
You must have installed and authenticated kubectl before executing commands (you will need GKE permissions in GCP for this). It's also useful to set your default cluster to our data-infra-apps cluster.
These
kubectl
commands assume your shell is in thekubernetes
directory, but you could run them from root and just prependkubernetes/
to the file paths.
Rolling restarts with kubectl
use the following syntax.
kubectl rollout restart deployment.apps/<deployment> -n <namespace>
So for example, to restart all 3 deployments in test, you would run the following.
kubectl rollout restart deployment.apps/redis -n gtfs-rt-v3-test
kubectl rollout restart deployment.apps/gtfs-rt-archiver-ticker -n gtfs-rt-v3-test
kubectl rollout restart deployment.apps/gtfs-rt-archiver-consumer -n gtfs-rt-v3-test
Environment-agnostic configurations live in app vars while environment-specific configurations live in channel vars. You can edit these files and deploy the changes with kubectl
.
kubectl apply -k apps/overlays/gtfs-rt-archiver-v3-<env>
For example, you can apply the configmap values in test with the following.
kubectl apply -k apps/overlays/gtfs-rt-archiver-v3-test
Running apply
will also deploy the archiver from scratch if it is not deployed yet, as long as the proper namespace exists.
Code changes require building and pushing a new Docker image, as well as applying kubectl
changes to point the deployment at the new image.
- Make code changes and increment version in
pyproject.toml
- Ex.
poetry version 2023.4.10
- Ex.
- Open a pull request and verify that the test container image build succeeds
- Merge the pull request and obtain the new image tag from the GitHub Actions build output or from https://github.com/cal-itp/data-infra/pkgs/container/data-infra%2Fgtfs-rt-archiver-v3
- Change image tag version in the environments
kustomization.yaml
.- Ex. change the value of
newTag
to '2023.4.10-a66f90'
- Ex. change the value of
- Finally, apply changes in production by opening and merging a second PR that includes the
kustomization.yaml
changes.
GTFS download configurations (for both Schedule and RT) are sourced from the GTFS Dataset table in the California Transit Airtable base, and we have specific documentation for modifying the table. (Both of these Airtable links require authentication/access to Airtable.) You may need to make URL or authentication adjustments in this table. This data is downloaded daily into our infrastructure and will propagate to the GTFS Schedule and RT downloads; you may execute the Airtable download job manually after making edits to "deploy" the changes more quickly.
Another possible intervention is updating or adding authentication information in Secret Manager. You may create new versions of existing secrets, or add entirely new secrets. Secrets must be tagged with gtfs_rt: true
to be loaded as secrets in the archiver; secrets are refreshed every 5 minutes by the ticker.