Propose bug triage process + nightly build mgmt process by jdangerx · Pull Request #3238 · catalyst-cooperative/pudl

jdangerx · 2024-01-12T22:19:09Z

Overview

Sometimes stuff goes wrong! We need to make decisions about how urgently they need to be fixed! If we don't make those decisions, we'll treat everything as urgent and sink hours and hours into stuff that we don't actually care about!

It might be worth writing down some guidelines - but maybe we can just use "discuss as a team how urgent this is" as a process for a little longer. In any case, I tried to sketch out what some written guidelines could look like.

Separately, we have a bit more of an actual process around the nightly build failures now, so I wrote that down.

Testing

Well, other people need to read it and give feedback ;)

# To-do list
- [x] Review the PR yourself and call out any questions or issues you have

…ability, which implies some sort of dataset tier structure) + write down nightly build management process.

jdangerx · 2024-01-12T22:20:02Z

+
+When they don't pass, we need to fix them.
+
+Every morning, someone from inframundo will check the #pudl-deployments slack


@catalyst-cooperative/inframundo is it ok to sign us up for this? I think it's mostly been @zaneselvans and @bendnorman checking in the past. I've been doing it in the new year. We could set up a rotation if we want.

I think this is a realistic description of this team's responsibilities, even if we wind up delegating the fix to someone outside of inframundo.

jdangerx · 2024-01-12T22:20:46Z

+2. Track the GitHub issue and the build status in the above spreadsheet.
+3. Look in the logs and determine whether it was an "infrastructure failure," i.e. something went wrong with the code that *runs*
+the nightly build, or a "PUDL failure," i.e. something that went wrong with the PUDL ETL itself.
+4. Investigate the source of the issue & explore ways to fix it. Get help from the folks whose PRs broke the build.


Do you think we need more guidance here? TBH I sort of ran out of doc-writing steam for this afternoon. But I also think this is probably good enough?

jdangerx · 2024-01-12T22:22:57Z

+and "tier 2" groups below.
+
+**Tier 1 datasets**
+* FERC 1 schedules XYZ


This is just some placeholder stuff. I think the main thrust of this proposal is:

We should decide which datasets are "important" and thus warrant firefighting action, vs. "unimportant" and thus get slotted in with the rest.

We should also decide what it means for something to be "broken" - i.e. X% data missing/incorrect, new data unincorporated after X time, etc. - absolutely I need help actually defining these

If we have 1. and 2. then it will be much easier for us to make prioritization decisions about random things that blow up!

The issue here is that the most integrated data has the most testing infrastructure written into it and is the most likely to fail. I think more useful here would be that the first step of triage is actually to scope out the issue and write some proposed solutions. Then we make the step of deciding whether we want to fix it in the most minimal way (relax the restriction, xfail the test), actually fix the core issue, or implement a more extensive design change. I think the pause between "here's the problem and what needs to be done" and "which version of this fix should we implement now" is probably the thing we most often fail to do and would help us prioritize.

e-belfer

Left some questions and suggestions.

e-belfer · 2024-01-16T14:15:36Z

+
+When they don't pass, we need to fix them.
+
+Every morning, someone from inframundo will check the #pudl-deployments slack


I think this is a realistic description of this team's responsibilities, even if we wind up delegating the fix to someone outside of inframundo.

e-belfer · 2024-01-16T14:19:38Z

+
+For *tier 1* tables:
+
+- latest source data is incorporated into PUDL within 1 month of publication


This is faster than our actual funding levels allow for at present, we're running on a quarterly integration calendar for our sub-annual datasets for now.

e-belfer · 2024-01-16T14:24:20Z

+and "tier 2" groups below.
+
+**Tier 1 datasets**
+* FERC 1 schedules XYZ


The issue here is that the most integrated data has the most testing infrastructure written into it and is the most likely to fail. I think more useful here would be that the first step of triage is actually to scope out the issue and write some proposed solutions. Then we make the step of deciding whether we want to fix it in the most minimal way (relax the restriction, xfail the test), actually fix the core issue, or implement a more extensive design change. I think the pause between "here's the problem and what needs to be done" and "which version of this fix should we implement now" is probably the thing we most often fail to do and would help us prioritize.

e-belfer · 2024-01-16T14:25:10Z

+
+**High (prioritize in the upcoming sprint planning)**
+- missing/incorrect data in Tier 1 tables
+- new Tier 1 source data available


New data availability won't show up in our nightly builds, so I'm not 100% following the connection here?

zaneselvans · 2025-10-10T00:43:42Z

@jdangerx is this still relevant with the new bug triage process that has been getting used lately? Does it need to be updated and merged? Or did the documentation end up somewhere already?

Propose bug triage process (which implies some sort of notion of reli…

d85b79e

…ability, which implies some sort of dataset tier structure) + write down nightly build management process.

jdangerx commented Jan 12, 2024

View reviewed changes

jdangerx requested review from bendnorman, cmgosnell and zaneselvans January 12, 2024 22:23

e-belfer reviewed Jan 16, 2024

View reviewed changes

jdangerx moved this from Backlog to In progress in Catalyst Megaproject Jul 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Propose bug triage process + nightly build mgmt process#3238

Propose bug triage process + nightly build mgmt process#3238
jdangerx wants to merge 1 commit into
mainfrom
bug-triage-proposal

jdangerx commented Jan 12, 2024 •

edited

Loading

Uh oh!

jdangerx Jan 12, 2024

Uh oh!

e-belfer Jan 16, 2024

Uh oh!

jdangerx Jan 12, 2024

Uh oh!

jdangerx Jan 12, 2024

Uh oh!

e-belfer Jan 16, 2024

Uh oh!

e-belfer left a comment

Uh oh!

e-belfer Jan 16, 2024

Uh oh!

e-belfer Jan 16, 2024

Uh oh!

e-belfer Jan 16, 2024

Uh oh!

e-belfer Jan 16, 2024

Uh oh!

zaneselvans commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		When they don't pass, we need to fix them.

		Every morning, someone from inframundo will check the #pudl-deployments slack


		For tier 1 tables:

		- latest source data is incorporated into PUDL within 1 month of publication

Uh oh!

Conversation

jdangerx commented Jan 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

e-belfer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zaneselvans commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jdangerx commented Jan 12, 2024 •

edited

Loading