-
-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Labels
performanceResource consumption like memory or CPU intensityResource consumption like memory or CPU intensity
Description
Overview
- DuckDB now allows remote querying of their native file format over HTTPS / S3 / GCS.
- We are not fully processing all the FERC data that we extract from XBRL, and republish it as whole databases (unlike the PUDL data which is available as Parquet).
- Providing the outputs in a format that allows remote querying would make it easier for users to pull data down without needing to manage local files and paths, and make it easier to share SQL queries and code snippets that would work regardless of where they're being run.
- Encouraging direct access to the data in cloud storage would also potentially provide us with useful usage stats, so we could see which databases and tables within them are getting the most use.
- We already need to convert all of these databases to DuckDB (or another remote-friendly format like Parquet) to make them accessible in the PUDL Viewer and making that format the original and only format would reduce the complexity of the overall translation pipeline / reduce the potential for translation errors.
- DuckDB files are also compressed internally, so we could hopefully stop zipping the files.
- DuckDB data types are richer than SQLite.
- We don't really maintain FK constraints in these databases anyway.
- We could also use a top-level schema to organize all the FERC data in a single database instead of like 8-10 different SQLite files... e.g.
ferc1_xbrl.table_nameandferc1_dbf.table_nameandferc2_xbrl.table_nameetc. all within a singleferc_raw.duckdb(or whatever), with each top level element corresponding to one of the existing SQLite DBs. This would be a single large file which would be less fun to download, but it would be remotely queryable and keep everything together in one internally self-consistent bundle. - We would need a separate update within the PUDL repo to manage doing the DBF to SQLite part, or could split that out into a separate repo analogous to this one for cleaner organization and the ability to test & manage the FERC Extraction independent of PUDL if we wanted to.
Update (10-13-25)
We are now producing duckdb files from XBRL, but to ease the transition with PUDL we've decided to continue producing SQLite as well. There's no real reason to continue producing SQLite other than the fact that it's what PUDL uses now, so it's still worth deprecating.
Remaining work
- Update PUDL to use new duckdb data
- Update extractor to stop producing SQLite
- Fix generated datapackages to point to duckdb files
- Update CLI to remove SQLite specific options
Metadata
Metadata
Assignees
Labels
performanceResource consumption like memory or CPU intensityResource consumption like memory or CPU intensity
Type
Projects
Status
Icebox