-
-
Notifications
You must be signed in to change notification settings - Fork 139
Standardize organization of our build outputs on S3/GCS #5280
Copy link
Copy link
Open
Labels
cloudStuff that has to do with adapting PUDL to work in cloud computing context.Stuff that has to do with adapting PUDL to work in cloud computing context.datapkgFrictionless data package input, output, metadata, manipulationFrictionless data package input, output, metadata, manipulationmetadataAnything having to do with the content, formatting, or storage of metadata. Mostly datapackages.Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages.nightly-buildsAnything having to do with nightly builds or continuous deployment.Anything having to do with nightly builds or continuous deployment.
Metadata
Metadata
Assignees
Labels
cloudStuff that has to do with adapting PUDL to work in cloud computing context.Stuff that has to do with adapting PUDL to work in cloud computing context.datapkgFrictionless data package input, output, metadata, manipulationFrictionless data package input, output, metadata, manipulationmetadataAnything having to do with the content, formatting, or storage of metadata. Mostly datapackages.Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages.nightly-buildsAnything having to do with nightly builds or continuous deployment.Anything having to do with nightly builds or continuous deployment.
Type
Fields
Give feedbackNo fields configured for Feature.
Projects
StatusShow more project fields
Icebox
Overview
datapackage.jsonfilename for the PUDL parquet outputs, which it turns out is a very strongly adhered to norm amongst datapackage users (and required to have a standards compliant data package that validates cleanly)Proposal
We should figure out a clean pattern for how to publish multiple datasets alongside each other, with the possibility of multiple formats for each dataset, and overhaul the layout of our outputs to match that pattern.
This will be potentially disruptive to users who have hard-coded the path. If there are any other changes we want to make we should make them all at once so there's only one distruption. It seems like the dimensions we're juggling are: dataset and format. With the additional complication that in some cases (dbf vs. xbrl) the formats pertain to mutually exclusive spans of time, while in others (sqlite vs. duckdb vs. parquet) they may all cover the same timespan.
Another possible change is moving from
pudl.catalyst.cooptodata.catalyst.coopwhich is what we've kind of standardized on for the PUDL Data Viewer URL, and it would be nice if they matched -- it would suggest (correctly) that they're both looking at the same data and make it more obvious that any data we publish would we be under that bucket, not just "PUDL" data.s3://data.catalyst.coop/nightly/pudl/datapackage.jsons3://data.catalyst.coop/nightly/pudl/core_epacems__hourly_emissions.parquets3://data.catalyst.coop/nightly/ferc1_dbf/ferc1_dbf.sqlites3://data.catalyst.coop/nightly/ferc1_dbf/ferc1_dbf.duckdbs3://data.catalyst.coop/nightly/ferc1_dbf/f1_steam.parquets3://data.catalyst.coop/nightly/ferc1_dbf/datapackage.jsons3://data.catalyst.coop/nightly/ferc1_xbrl/ferc1_xbrl.sqlites3://data.catalyst.coop/nightly/ferc1_xbrl/ferc1_xbrl.sqlite