The people working on PUDL are distributed all over North America. Collaboration takes place online. We make extensive use of Github's build int project management tools and we work in public. You can follow our progress in our GitHub Projects
We use Github issues to track bugs, enhancements, support requests, and just about any other work that goes into the project. Try to make sure that issues have informative tags so we can find them easily.
When a new issue is discovered, we need to determine how urgent it is to address.
Our core offering is complete, connected, and granular data. Issues that interrupt the availability of that data are of highest importance.
However, not all datasets in PUDL are the same; some are mature and have many downstream users; others are more experimental. We've split them into "tier 1" and "tier 2" groups below.
Tier 1 datasets * FERC 1 schedules XYZ * EIA 860 - XYZ tables * EIA 923 - XYZ tables * EPA CEMS
Tier 2 datasets Everything else
This then informs some reliability goals:
For tier 1 tables:
- latest source data is incorporated into PUDL within 1 month of publication
- nightly data build using latest PUDL code is available within 3 business days of any code changes
- missing/incorrect data starts to be addressed within 2 weeks
For tier 2 tables we only shoot for nightly builds being available within 3 business days.
Which then, in turn, informs our bug triage guidelines:
Urgent (find some way to address ASAP) - nightly build failures - datasette not available - incorrect data in distribution buckets
High (prioritize in the upcoming sprint planning) - missing/incorrect data in Tier 1 tables - new Tier 1 source data available
Medium (stuff in a backlog and don't forget about it) - new Tier 2 source data available
- We have 3 persistent branches:
main(the default branch),nightly, andstable. - We create temporary feature branches off of
mainand make pull requests intomainthroughout our 2 week long sprints. All code that's merged intomainshould have passed our CI tests and been reviewed by at least one other person. - Every night the
mainbranch is used to run the :ref:`nightly-data-builds`. If the builds are successful, then thenightlybranch is automatically updated to point to the latest commit onmain. If the builds fail, then thenightlybranch is left unchanged. - Every time we do a versioned data release, the
stablebranch is updated to point to the commit associated with the most recent release.
- Before making a PR, make sure the tests run and pass locally, including the code linters and pre-commit hooks. See :ref:`linting` for details.
- Don't forget to merge any new commits to the
mainbranch into your feature branch before making a PR. - If for some reason the continuous integration tests fail for your PR, try and
figure out why and fix it, or ask for help. If the tests fail, we don't want
to merge it into
main. You can see the status of the CI builds in the GitHub Actions for the PUDL repo. - Please don't decrease the overall test coverage -- if you introduce new code, it also needs to be exercised by the tests. See :doc:`testing` for details.
- Write good docstrings using the Google format
- Pull Requests should update the documentation to reflect changes to the code, especially if it changes something user-facing, like how one of the command line scripts works.
- The PUDL data processing pipeline isn't intended to be used as a library that other
Python packages depend on. Rather, it's an end-use application that produces data
which other applications and analyses can consume. Because of this, we no longer
release installable packages on PyPI or
conda-forge. - Periodically, we tag a versioned release on
mainusing a calendar based version, likev2023.07.15. This triggers a snapshot of the repository being archived on Zenodo. - The nightly build outputs associated with any tagged release will also get archived on Zenodo here and be made available longer term in the AWS Open Data Registry.
We don't (yet) have funding to do user support, so it's currently all community and volunteer based. In order to ensure that others can find the answers to questions that have already been asked, we try to do all support in public using Github Discussions.