diff --git a/docs/dev/nightly_data_builds.rst b/docs/dev/nightly_data_builds.rst index aac531a0ca..a5367655c2 100644 --- a/docs/dev/nightly_data_builds.rst +++ b/docs/dev/nightly_data_builds.rst @@ -14,12 +14,29 @@ pushes a Docker image with PUDL installed to `Docker Hub `__. + +If it has failed, they will: + +1. Create a GitHub issue for the nightly build failure to hold the investigation & discussion. +2. Track the GitHub issue and the build status in the above spreadsheet. +3. Look in the logs and determine whether it was an "infrastructure failure," i.e. something went wrong with the code that *runs* +the nightly build, or a "PUDL failure," i.e. something that went wrong with the PUDL ETL itself. +4. Investigate the source of the issue & explore ways to fix it. Get help from the folks whose PRs broke the build. + +Avoiding breaking the builds +---------------------------- Because of how long the full build & tests take, we don't typically run them individually before merging every PR into ``main``. However, running ``make nuke`` @@ -28,33 +45,9 @@ data or made other changes that would be expected to break the data validations, the appropriate changes can be made prior to those changes hitting ``main`` and the nightly builds. -If your PR causes the build to fail, you are probably the best person to fix the -problem, since you already have context on all of the changes that went into it. - -Having multiple PRs merged into ``main`` simultaneously when the builds are breaking -makes it ambiguous where the problem is coming from, makes debugging harder, and -diffuses responsibility for the breakage across several people, so it's important to fix -the breakage quickly. In some cases we may delay merging additional PRs into ``main`` -if the builds are failing to avoid ambiguity and facilitate debugging. - -Therefore, we've adopted the following etiquette regarding build breakage: On the -morning after you merge a PR into ``main``, you should check whether the nightly builds -succeeded by looking in the ``pudl-deployments`` Slack channel (which all team members -should be subscribed to). If the builds failed, look at the logging output (which is -included as an attachment to the notification) and figure out what kind of failure -occurred: - - * If the failure is due to your changes, then you are responsible for fixing the - problem and making a new PR to ``main`` that resolves it, and it should be a high - priority. If you're stumped, ask for help! - * If the failure is due to an infrastructural issue like the build server running out - of memory and the build process getting killed, then you need to notify the member - who is in charge of managing the builds (Currently :user:`bendnorman`), and hand off - responsibility for debugging and fixing the issue. - * If the failure is the result of a transient problem outside of our control like a - network connection failing, then wait until the next morning and repeat the above - process. If the "transient" problem persists, bring it up with the person - managing the builds. +Once the nightly build is broken, we can't know if any new changes on ``main`` +are valid or not. So we should avoid merging unrelated changes to ``main`` +until the builds pass again. Debugging a Broken Build ------------------------ diff --git a/docs/dev/project_management.rst b/docs/dev/project_management.rst index 219b2b5074..233f6268a1 100644 --- a/docs/dev/project_management.rst +++ b/docs/dev/project_management.rst @@ -15,6 +15,54 @@ track bugs, enhancements, support requests, and just about any other work that g into the project. Try to make sure that issues have informative tags so we can find them easily. + +------------------------------------------------------------------------------- +Bug triage +------------------------------------------------------------------------------- + +When a new issue is discovered, we need to determine how urgent it is to +address. + +Our core offering is complete, connected, and granular data. Issues that +interrupt the availability of that data are of highest importance. + +However, not all datasets in PUDL are the same; some are mature and have many +downstream users; others are more experimental. We've split them into "tier 1" +and "tier 2" groups below. + +**Tier 1 datasets** +* FERC 1 schedules XYZ +* EIA 860 - XYZ tables +* EIA 923 - XYZ tables +* EPA CEMS + +**Tier 2 datasets** +Everything else + +This then informs some reliability goals: + +For *tier 1* tables: + +- latest source data is incorporated into PUDL within 1 month of publication +- nightly data build using latest PUDL code is available within 3 business days of any code changes +- missing/incorrect data starts to be addressed within 2 weeks + +For *tier 2* tables we only shoot for nightly builds being available within 3 business days. + +Which then, in turn, informs our bug triage guidelines: + +**Urgent (find some way to address ASAP)** +- nightly build failures +- datasette not available +- incorrect data in distribution buckets + +**High (prioritize in the upcoming sprint planning)** +- missing/incorrect data in Tier 1 tables +- new Tier 1 source data available + +**Medium (stuff in a backlog and don't forget about it)** +- new Tier 2 source data available + ------------------------------------------------------------------------------- Our GitHub Workflow -------------------------------------------------------------------------------