catalyst-cooperative · jdangerx · Jan 12, 2024 · jdangerx · Jan 12, 2024 · e-belfer
diff --git a/docs/dev/nightly_data_builds.rst b/docs/dev/nightly_data_builds.rst
@@ -14,12 +14,29 @@ pushes a Docker image with PUDL installed to `Docker Hub <https://hub.docker.com
 and deploys the image as a container to a GCE instance. The container runs the ETL and
 tests, then copies the outputs to a public AWS S3 bucket for distribution.
 
+
 Breaking the Builds
 -------------------
 The nightly data builds based on the ``main`` branch are our comprehensive integration
-tests. When they pass, we consider the results fit for public consumption.  The builds
-are expected to pass. If they don't then someone needs to take responsibility for
-getting them working again with some urgency.
+tests. When they pass, we consider the results fit for public consumption.
+
+When they don't pass, we need to fix them.
+
+Every morning, someone from inframundo will check the #pudl-deployments slack
+channel to see if the build has succeeded.
+
+If it has succeeded, they will track it in `this spreadsheet <https://docs.google.com/spreadsheets/d/11YdknGi4br51kxz03nNmxD-lOvilWFgm3jkUpiEdAzU/edit#gid=1678819446>`__.
+
+If it has failed, they will:
+
+1. Create a GitHub issue for the nightly build failure to hold the investigation & discussion.
+2. Track the GitHub issue and the build status in the above spreadsheet.
+3. Look in the logs and determine whether it was an "infrastructure failure," i.e. something went wrong with the code that *runs*
+the nightly build, or a "PUDL failure," i.e. something that went wrong with the PUDL ETL itself.
+4. Investigate the source of the issue & explore ways to fix it. Get help from the folks whose PRs broke the build.
+
+Avoiding breaking the builds
+----------------------------
 
 Because of how long the full build & tests take, we don't typically run them
 individually before merging every PR into ``main``. However, running ``make nuke``
@@ -28,33 +45,9 @@ data or made other changes that would be expected to break the data validations,
 the appropriate changes can be made prior to those changes hitting ``main`` and the
 nightly builds.
 
-If your PR causes the build to fail, you are probably the best person to fix the
-problem, since you already have context on all of the changes that went into it.
-
-Having multiple PRs merged into ``main`` simultaneously when the builds are breaking
-makes it ambiguous where the problem is coming from, makes debugging harder, and
-diffuses responsibility for the breakage across several people, so it's important to fix
-the breakage quickly. In some cases we may delay merging additional PRs into ``main``
-if the builds are failing to avoid ambiguity and facilitate debugging.
-
-Therefore, we've adopted the following etiquette regarding build breakage: On the
-morning after you merge a PR into ``main``, you should check whether the nightly builds
-succeeded by looking in the ``pudl-deployments`` Slack channel (which all team members
-should be subscribed to). If the builds failed, look at the logging output (which is
-included as an attachment to the notification) and figure out what kind of failure
-occurred:
-
-  * If the failure is due to your changes, then you are responsible for fixing the
-    problem and making a new PR to ``main`` that resolves it, and it should be a high
-    priority. If you're stumped, ask for help!
-  * If the failure is due to an infrastructural issue like the build server running out
-    of memory and the build process getting killed, then you need to notify the member
-    who is in charge of managing the builds (Currently :user:`bendnorman`), and hand off
-    responsibility for debugging and fixing the issue.
-  * If the failure is the result of a transient problem outside of our control like a
-    network connection failing, then wait until the next morning and repeat the above
-    process. If the "transient" problem persists, bring it up with the person
-    managing the builds.
+Once the nightly build is broken, we can't know if any new changes on ``main``
+are valid or not. So we should avoid merging unrelated changes to ``main``
+until the builds pass again.
 
 Debugging a Broken Build
 ------------------------

diff --git a/docs/dev/project_management.rst b/docs/dev/project_management.rst
@@ -15,6 +15,54 @@ track bugs, enhancements, support requests, and just about any other work that g
 into the project. Try to make sure that issues have informative tags so we can find
 them easily.
 
+
+-------------------------------------------------------------------------------
+Bug triage
+-------------------------------------------------------------------------------
+
+When a new issue is discovered, we need to determine how urgent it is to
+address.
+
+Our core offering is complete, connected, and granular data. Issues that
+interrupt the availability of that data are of highest importance.
+
+However, not all datasets in PUDL are the same; some are mature and have many
+downstream users; others are more experimental. We've split them into "tier 1"
+and "tier 2" groups below.
+
+**Tier 1 datasets**
+* FERC 1 schedules XYZ
+* EIA 860 - XYZ tables
+* EIA 923 - XYZ tables
+* EPA CEMS
+
+**Tier 2 datasets**
+Everything else
+
+This then informs some reliability goals:
+
+For *tier 1* tables:
+
+- latest source data is incorporated into PUDL within 1 month of publication
+- nightly data build using latest PUDL code is available within 3 business days of any code changes
+- missing/incorrect data starts to be addressed within 2 weeks
+
+For *tier 2* tables we only shoot for nightly builds being available within 3 business days.
+
+Which then, in turn, informs our bug triage guidelines:
+
+**Urgent (find some way to address ASAP)**
+- nightly build failures
+- datasette not available
+- incorrect data in distribution buckets
+
+**High (prioritize in the upcoming sprint planning)**
+- missing/incorrect data in Tier 1 tables
+- new Tier 1 source data available
+
+**Medium (stuff in a backlog and don't forget about it)**
+- new Tier 2 source data available
+
 -------------------------------------------------------------------------------
 Our GitHub Workflow
 -------------------------------------------------------------------------------