Skip to content

Incident reaction: implement advanced consistency health checks for dataset tables #279

@zaychenko-sergei

Description

@zaychenko-sergei

Recently we had a production incident, where 2 things were broken:

  • some datasets did not have a computed key block cache at all
  • some contained incomplete key block cache (we stored only AddPushSource event, but no Seed or SetDataSchema, as earlier events were somehow lost).

The reason why incident occured has been already resolved (incorrect streaming of dataset entries during re-indexing key blocks), although we had some "data consequences" that were not 100% cleanly recovered.

So, the idea would be to enforce a few invariants that detect broken consistency in our state:

  1. All datasets in "dataset_entries` table should be:

    • present in dataset_references table as unique rows for "Head" ref
    • present in dataset_statistics table as unique rows
    • present in dataset_key_blocks table (at least once)
  2. Records in dataset_key_blocks must respect validation criteria:

    • Seed event is present and stays at seq number 0
  3. The derivative datasets should be represented in dataset_dependecies table as a "downstream" edge.

For these invariants:

  • implement SQL queries that detect them
  • deliver Prometheus metric per problem type (i.e. gauge for anomalies count)
  • configure alerts for new incidents

Since we are talking about quite heavyweight checks, it could be naive to bind Prometheus to those queries directly.
Consider the following implementation strategy that provides more control over performance:

  • detect anomalizes in the form of a Materialized View in Postgres
  • refresh it concurrently each 30-60 minutes (configure a cron job using pg_cron or something similar)
  • bind Prometheus metrics to that materialized view, so that it's just quickly checking the latest snapshot instead of heavy compute

In addition, it could make sense to run the refreshes manually after deployments, since deployments would bring 90%+ incidents.
The refresh should run after deployment stabilizes: this could be manual or this could be automated with a delay or post-deploy reaction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions