Skip to content

Publishing the data #10

@olizilla

Description

@olizilla

Ideally, a published dataset is immutable and easily accessible.

The latest dataset is published as a google sheet.. Sheets is a good fit as the tool for collecting the data: multiple people can work on the same sheets, and it's a familiar interface.

For publishing versions of the data it's less useful. It works for people who also want to work with the data in sheets, but for other uses like visualisations, or querying it's less helpful. The data can't be used directly, you have to export it, edit it, and host it somewhere. It's a faff, and it moves the data from being a published, immutable snapshot.

A nice pattern would be to publish each version of the dataset as JSON and CSV under the dpimap.org website at a friendly URL like

https://dpimap.org/data/<ISO 8601 release date>.<dpi type>.csv

e.g. https://dpimap.org/data/2025-03-01.payment.csv (a proposal, will not work yet)

This has been referred to as the "baked data" pattern

JSON and CSV are the most widely understood data formats that exist. Publishing each new version of the dataset as those formats, under the dpimap.org will make it more accessible and easy to use directly, without people having to edit the data or host it themselves. It will also improve the citation as the reference the URL will now be under the dpimap.org domain.

We can continue to use google sheets as the tool for collecting and editing the data. All we need is a bit of code that can pull the data from the sheets and create the JSON and CSV files. That should be easy, with tools like https://sheetjson.com/ but something in our data is preventing that from working. Last time the problem was we had a cell with a number in where sheets was expecting the column to only contain text. I think the problem this time is we have multiple header rows and and it's not possible to guess how to transform the data. CSV data typically contains 0 or 1 header rows.

As a first step we should remove the extra header rows from the sheets and see if we can get the sheets csv export to produce a coherent csv export, then i think sheetjson will start working too. Once we have those, then it will be easy to publish new (and old) releases to the website.

Metadata

Metadata

Assignees

No one assigned

    Labels

    dataSuggestion for the dpi data source

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions