-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Recently we had a production incident, where 2 things were broken:
- some datasets did not have a computed key block cache at all
- some contained incomplete key block cache (we stored only
AddPushSourceevent, but noSeedorSetDataSchema, as earlier events were somehow lost).
The reason why incident occured has been already resolved (incorrect streaming of dataset entries during re-indexing key blocks), although we had some "data consequences" that were not 100% cleanly recovered.
So, the idea would be to enforce a few invariants that detect broken consistency in our state:
-
All datasets in "dataset_entries` table should be:
- present in
dataset_referencestable as unique rows for "Head" ref - present in
dataset_statisticstable as unique rows - present in
dataset_key_blockstable (at least once)
- present in
-
Records in
dataset_key_blocksmust respect validation criteria:- Seed event is present and stays at seq number 0
-
The derivative datasets should be represented in
dataset_dependeciestable as a "downstream" edge.
For these invariants:
- implement SQL queries that detect them
- deliver Prometheus metric per problem type (i.e. gauge for anomalies count)
- configure alerts for new incidents
Since we are talking about quite heavyweight checks, it could be naive to bind Prometheus to those queries directly.
Consider the following implementation strategy that provides more control over performance:
- detect anomalizes in the form of a Materialized View in Postgres
- refresh it concurrently each 30-60 minutes (configure a cron job using
pg_cronor something similar) - bind Prometheus metrics to that materialized view, so that it's just quickly checking the latest snapshot instead of heavy compute
In addition, it could make sense to run the refreshes manually after deployments, since deployments would bring 90%+ incidents.
The refresh should run after deployment stabilizes: this could be manual or this could be automated with a delay or post-deploy reaction.