Skip to content

Standardize organization of our build outputs on S3/GCS #5280

@zaneselvans

Description

@zaneselvans

Overview

  • We currently generate several distinct dataset outputs in the nightly builds, including the FERC XBR & DBF databases and the main PUDL outputs.
  • Currently the top-level output of our distributed builds includes all the PUDL files, while the FERC databases have subdirectories for their Parquet files.
  • This means the path patterns for accessing the different outputs are not uniform.
  • It also means that we can't use the canonical datapackage.json filename for the PUDL parquet outputs, which it turns out is a very strongly adhered to norm amongst datapackage users (and required to have a standards compliant data package that validates cleanly)

Proposal

  • We should figure out a clean pattern for how to publish multiple datasets alongside each other, with the possibility of multiple formats for each dataset, and overhaul the layout of our outputs to match that pattern.

  • This will be potentially disruptive to users who have hard-coded the path. If there are any other changes we want to make we should make them all at once so there's only one distruption. It seems like the dimensions we're juggling are: dataset and format. With the additional complication that in some cases (dbf vs. xbrl) the formats pertain to mutually exclusive spans of time, while in others (sqlite vs. duckdb vs. parquet) they may all cover the same timespan.

  • Another possible change is moving from pudl.catalyst.coop to data.catalyst.coop which is what we've kind of standardized on for the PUDL Data Viewer URL, and it would be nice if they matched -- it would suggest (correctly) that they're both looking at the same data and make it more obvious that any data we publish would we be under that bucket, not just "PUDL" data.

  • s3://data.catalyst.coop/nightly/pudl/datapackage.json

  • s3://data.catalyst.coop/nightly/pudl/core_epacems__hourly_emissions.parquet

  • s3://data.catalyst.coop/nightly/ferc1_dbf/ferc1_dbf.sqlite

  • s3://data.catalyst.coop/nightly/ferc1_dbf/ferc1_dbf.duckdb

  • s3://data.catalyst.coop/nightly/ferc1_dbf/f1_steam.parquet

  • s3://data.catalyst.coop/nightly/ferc1_dbf/datapackage.json

  • s3://data.catalyst.coop/nightly/ferc1_xbrl/ferc1_xbrl.sqlite

  • s3://data.catalyst.coop/nightly/ferc1_xbrl/ferc1_xbrl.sqlite

Metadata

Metadata

Assignees

No one assigned

    Labels

    cloudStuff that has to do with adapting PUDL to work in cloud computing context.datapkgFrictionless data package input, output, metadata, manipulationmetadataAnything having to do with the content, formatting, or storage of metadata. Mostly datapackages.nightly-buildsAnything having to do with nightly builds or continuous deployment.
    No fields configured for Feature.

    Projects

    Status
    Icebox

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions