-
-
Notifications
You must be signed in to change notification settings - Fork 141
Propose bug triage process + nightly build mgmt process #3238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,12 +14,29 @@ pushes a Docker image with PUDL installed to `Docker Hub <https://hub.docker.com | |
| and deploys the image as a container to a GCE instance. The container runs the ETL and | ||
| tests, then copies the outputs to a public AWS S3 bucket for distribution. | ||
|
|
||
|
|
||
| Breaking the Builds | ||
| ------------------- | ||
| The nightly data builds based on the ``main`` branch are our comprehensive integration | ||
| tests. When they pass, we consider the results fit for public consumption. The builds | ||
| are expected to pass. If they don't then someone needs to take responsibility for | ||
| getting them working again with some urgency. | ||
| tests. When they pass, we consider the results fit for public consumption. | ||
|
|
||
| When they don't pass, we need to fix them. | ||
|
|
||
| Every morning, someone from inframundo will check the #pudl-deployments slack | ||
| channel to see if the build has succeeded. | ||
|
|
||
| If it has succeeded, they will track it in `this spreadsheet <https://docs.google.com/spreadsheets/d/11YdknGi4br51kxz03nNmxD-lOvilWFgm3jkUpiEdAzU/edit#gid=1678819446>`__. | ||
|
|
||
| If it has failed, they will: | ||
|
|
||
| 1. Create a GitHub issue for the nightly build failure to hold the investigation & discussion. | ||
| 2. Track the GitHub issue and the build status in the above spreadsheet. | ||
| 3. Look in the logs and determine whether it was an "infrastructure failure," i.e. something went wrong with the code that *runs* | ||
| the nightly build, or a "PUDL failure," i.e. something that went wrong with the PUDL ETL itself. | ||
| 4. Investigate the source of the issue & explore ways to fix it. Get help from the folks whose PRs broke the build. | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you think we need more guidance here? TBH I sort of ran out of doc-writing steam for this afternoon. But I also think this is probably good enough? |
||
|
|
||
| Avoiding breaking the builds | ||
| ---------------------------- | ||
|
|
||
| Because of how long the full build & tests take, we don't typically run them | ||
| individually before merging every PR into ``main``. However, running ``make nuke`` | ||
|
|
@@ -28,33 +45,9 @@ data or made other changes that would be expected to break the data validations, | |
| the appropriate changes can be made prior to those changes hitting ``main`` and the | ||
| nightly builds. | ||
|
|
||
| If your PR causes the build to fail, you are probably the best person to fix the | ||
| problem, since you already have context on all of the changes that went into it. | ||
|
|
||
| Having multiple PRs merged into ``main`` simultaneously when the builds are breaking | ||
| makes it ambiguous where the problem is coming from, makes debugging harder, and | ||
| diffuses responsibility for the breakage across several people, so it's important to fix | ||
| the breakage quickly. In some cases we may delay merging additional PRs into ``main`` | ||
| if the builds are failing to avoid ambiguity and facilitate debugging. | ||
|
|
||
| Therefore, we've adopted the following etiquette regarding build breakage: On the | ||
| morning after you merge a PR into ``main``, you should check whether the nightly builds | ||
| succeeded by looking in the ``pudl-deployments`` Slack channel (which all team members | ||
| should be subscribed to). If the builds failed, look at the logging output (which is | ||
| included as an attachment to the notification) and figure out what kind of failure | ||
| occurred: | ||
|
|
||
| * If the failure is due to your changes, then you are responsible for fixing the | ||
| problem and making a new PR to ``main`` that resolves it, and it should be a high | ||
| priority. If you're stumped, ask for help! | ||
| * If the failure is due to an infrastructural issue like the build server running out | ||
| of memory and the build process getting killed, then you need to notify the member | ||
| who is in charge of managing the builds (Currently :user:`bendnorman`), and hand off | ||
| responsibility for debugging and fixing the issue. | ||
| * If the failure is the result of a transient problem outside of our control like a | ||
| network connection failing, then wait until the next morning and repeat the above | ||
| process. If the "transient" problem persists, bring it up with the person | ||
| managing the builds. | ||
| Once the nightly build is broken, we can't know if any new changes on ``main`` | ||
| are valid or not. So we should avoid merging unrelated changes to ``main`` | ||
| until the builds pass again. | ||
|
|
||
| Debugging a Broken Build | ||
| ------------------------ | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -15,6 +15,54 @@ track bugs, enhancements, support requests, and just about any other work that g | |
| into the project. Try to make sure that issues have informative tags so we can find | ||
| them easily. | ||
|
|
||
|
|
||
| ------------------------------------------------------------------------------- | ||
| Bug triage | ||
| ------------------------------------------------------------------------------- | ||
|
|
||
| When a new issue is discovered, we need to determine how urgent it is to | ||
| address. | ||
|
|
||
| Our core offering is complete, connected, and granular data. Issues that | ||
| interrupt the availability of that data are of highest importance. | ||
|
|
||
| However, not all datasets in PUDL are the same; some are mature and have many | ||
| downstream users; others are more experimental. We've split them into "tier 1" | ||
| and "tier 2" groups below. | ||
|
|
||
| **Tier 1 datasets** | ||
| * FERC 1 schedules XYZ | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is just some placeholder stuff. I think the main thrust of this proposal is:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The issue here is that the most integrated data has the most testing infrastructure written into it and is the most likely to fail. I think more useful here would be that the first step of triage is actually to scope out the issue and write some proposed solutions. Then we make the step of deciding whether we want to fix it in the most minimal way (relax the restriction, xfail the test), actually fix the core issue, or implement a more extensive design change. I think the pause between "here's the problem and what needs to be done" and "which version of this fix should we implement now" is probably the thing we most often fail to do and would help us prioritize. |
||
| * EIA 860 - XYZ tables | ||
| * EIA 923 - XYZ tables | ||
| * EPA CEMS | ||
|
|
||
| **Tier 2 datasets** | ||
| Everything else | ||
|
|
||
| This then informs some reliability goals: | ||
|
|
||
| For *tier 1* tables: | ||
|
|
||
| - latest source data is incorporated into PUDL within 1 month of publication | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is faster than our actual funding levels allow for at present, we're running on a quarterly integration calendar for our sub-annual datasets for now. |
||
| - nightly data build using latest PUDL code is available within 3 business days of any code changes | ||
| - missing/incorrect data starts to be addressed within 2 weeks | ||
|
|
||
| For *tier 2* tables we only shoot for nightly builds being available within 3 business days. | ||
|
|
||
| Which then, in turn, informs our bug triage guidelines: | ||
|
|
||
| **Urgent (find some way to address ASAP)** | ||
| - nightly build failures | ||
| - datasette not available | ||
| - incorrect data in distribution buckets | ||
|
|
||
| **High (prioritize in the upcoming sprint planning)** | ||
| - missing/incorrect data in Tier 1 tables | ||
| - new Tier 1 source data available | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. New data availability won't show up in our nightly builds, so I'm not 100% following the connection here? |
||
|
|
||
| **Medium (stuff in a backlog and don't forget about it)** | ||
| - new Tier 2 source data available | ||
|
|
||
| ------------------------------------------------------------------------------- | ||
| Our GitHub Workflow | ||
| ------------------------------------------------------------------------------- | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@catalyst-cooperative/inframundo is it ok to sign us up for this? I think it's mostly been @zaneselvans and @bendnorman checking in the past. I've been doing it in the new year. We could set up a rotation if we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a realistic description of this team's responsibilities, even if we wind up delegating the fix to someone outside of inframundo.