Standardize organization of our build outputs on S3/GCS

# Overview

- We currently generate several distinct dataset outputs in the nightly builds, including the FERC XBR & DBF databases and the main PUDL outputs.
- Currently the top-level output of our distributed builds includes all the PUDL files, while the FERC databases have subdirectories for their Parquet files.
- This means the path patterns for accessing the different outputs are not uniform.
- It also means that we can't use the canonical `datapackage.json` filename for the PUDL parquet outputs, which it turns out is a very strongly adhered to norm amongst datapackage users (and required to have a standards compliant data package that validates cleanly)

## Proposal

- We should figure out a clean pattern for how to publish multiple datasets alongside each other, with the possibility of multiple formats for each dataset, and overhaul the layout of our outputs to match that pattern.
- This will be potentially disruptive to users who have hard-coded the path. If there are any other changes we want to make we should make them all at once so there's only one distruption. It seems like the dimensions we're juggling are: dataset and format. With the additional complication that in some cases (dbf vs. xbrl) the formats pertain to mutually exclusive spans of time, while in others (sqlite vs. duckdb vs. parquet) they may all cover the same timespan.
- Another possible change is moving from `pudl.catalyst.coop` to `data.catalyst.coop` which is what we've kind of standardized on for the PUDL Data Viewer URL, and it would be nice if they matched -- it would suggest (correctly) that they're both looking at the same data and make it more obvious that any data we publish would we be under that bucket, not just "PUDL" data.

- `s3://data.catalyst.coop/nightly/pudl/datapackage.json`
- `s3://data.catalyst.coop/nightly/pudl/core_epacems__hourly_emissions.parquet` 
- `s3://data.catalyst.coop/nightly/ferc1_dbf/ferc1_dbf.sqlite`
- `s3://data.catalyst.coop/nightly/ferc1_dbf/ferc1_dbf.duckdb` 
- `s3://data.catalyst.coop/nightly/ferc1_dbf/f1_steam.parquet`
- `s3://data.catalyst.coop/nightly/ferc1_dbf/datapackage.json`
- `s3://data.catalyst.coop/nightly/ferc1_xbrl/ferc1_xbrl.sqlite`
- `s3://data.catalyst.coop/nightly/ferc1_xbrl/ferc1_xbrl.sqlite`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Standardize organization of our build outputs on S3/GCS #5280

Overview

Proposal

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Standardize organization of our build outputs on S3/GCS #5280

Description

Overview

Proposal

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions