Skip to content

Create centralised way to publish production versions of data to Azure #123

Open
@penelopeysm

Description

@penelopeysm

Right now, our production data is in the Azure blob storage container https://popgetter.blob.core.windows.net/popgetter-dagster-test/test_2 and one of us will populate this by setting ENV=prod and running all the Dagster pipelines locally :)

I think it's useful to have a single, centralised, way to generate all production data and upload it to another Azure blob storage container (that has a less testy name :-)). There are several benefits of this:

  1. Reproducibility — It is clear which data is being uploaded and how it is being generated.
  2. Handles top level countries.txt file cleanly — The CLI uses this file to determine which countries are present as it cannot traverse the Azure directory structure. Right now the file is being manually generated, which can easily lead to inconsistencies between what it says and the actual data that is tehre
  3. Statelessness — The pipeline should wipe the entire blob storage container before re-uploading everything. That way we don't end up with some data updated and others not (which would be bad if e.g. the metadata schema is changed).
  4. Continuous deployment — The pipeline can be automatically triggered by new versions/releases on GitHub.

I can throw together a quick Dockerfile for this and maybe investigate running this on GitHub Actions / Azure!

GHA has usage limits (https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration); in particular, "each job in a workflow can run for up to 6 hours of execution time" so it is not a deployment method that will scale well if we have many countries to run. I think for what we have now (BE + NI) it is still workable.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Backlog:

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions