-
Notifications
You must be signed in to change notification settings - Fork 217
ereport: Task faulted/panicked #2341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Currently, there is no way to programmatically access the panic message of a task which has faulted due to a Rust panic fron within the Hubris userspace. This branch adds a new `read_panic_message` kipc that copies the contents of a panicked task's panic message buffer into the caller. If the requested task has not panicked, this kipc returns an error indicating this. This is intended by use by supervisor implementations or other tasks which wish to report panic messages from userspace. I've also added a test case that exercises this functionality. Fixes #2311
Co-authored-by: Matt Keeter <[email protected]>
Co-authored-by: Cliff L. Biffle <[email protected]>
Co-authored-by: Cliff L. Biffle <[email protected]>
Co-authored-by: Cliff L. Biffle <[email protected]>
Co-authored-by: Cliff L. Biffle <[email protected]>
Co-authored-by: Cliff L. Biffle <[email protected]>
0adecf4 to
bc6268d
Compare
'cause removing a branch makes it bigger somehow
This reverts commit bc6268d.
|
Thinking about things a bit more, there's some more changes I think I want to make here before it's really ready to land. In particular:
Personally, I think we should definitely do the second point here (some kind of "task faults may have occurred" ereport) before merging this PR. I'm on the fence about whether the first point (reducing restart latency) is worth doing now or not. It's a bit more complexity in Jefe... @cbiffle, any thoughts? Footnotes
|
| if faulted_tasks == 0 { | ||
| // Okay, this is a bit weird. We got a notification saying tasks | ||
| // have faulted, but by the time we scanned for faulted tasks, we | ||
| // couldn't find any. This means one of two things: | ||
| // | ||
| // 1. The fault notification was spurious (in practice, this | ||
| // probably means someone is dorking around with Hiffy and sent | ||
| // a fake notification just to mess with us...) | ||
| // 2. Tasks faulted, but we were not scheduled for at least 50ms | ||
| // after the faults occurred, and Jefe has already restarted | ||
| // them by the time we were permtited to run. | ||
| // | ||
| // We should probably record some kind of ereport about this. | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i still wanna figure out what i need to put in the ereport in this case --- what class should it be, etc. hubris.fault.maybe_faults or something weird like that.
It's also a bit strange because the function for recording an ereport in the ereport ringbuffer requires a task ID as part of the insert function. For all the other ereports, I've used the ID of the task that faulted for that field, rather than the ID of Packrat (who is actually generating the ereport) or Jefe (who is spiritually sort of responsible for reporting it in some vibes-based way); this felt like the right thing in general. However, when the ereport just says "some task may have faulted", I'm not totally sure what ID I want to put in here, since I don't want to incorrectly suggest that Jefe or Packrat has faulted...hmm...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think taskID of Packrat and a distinguishing class would be fine.
|
So I think the "max of 50ms" interplay between Jefe and Packrat sounds promising, and doesn't seem like too much additional supervisor complexity -- particularly while crashdumps are still in Jefe. If we want to reduce complexity, I'd start there. I agree that having a "whoops" ereport if packrat finds no faulted tasks would be useful. As one possible alternative... and I'm not sure if this is a good idea or not... Jefe could buffer the faulted taskIDs and provide packrat with a way to collect them... we could then say "this specific task fell over but the system was too loaded for me to say why exactly". TaskIDs are a lot smaller than the full Fault record. That said, packrat is by nature typically just one priority level under Jefe, so it should be able to respond in a timely fashion in most cases. The thing most likely to starve it is ... crash dumps. |
Yeah, I've also wondered about doing that; it might be a good idea. We could also do a fixed-size array of |
Actually, upon thinking about this a bit more, there is actually a scheme where we don't need to add a new IPC to Jefe at all. Instead, we could just do something where Packrat stores an array of the last seen generation number of each task index. When Packrat is notified of faults, it can scan each task's current generation and compare it to the last one it saw to check if the task has faulted. Here's my attempt at doing that, which is both conceptually quite elegant and implementationally somewhat disgusting: eliza/fault-ereport...eliza/fault-counts#diff-48cf874f5ac8432941e2ba390792b33a94f9aea18dd933bbdb105cd23b93c9ee |
I almost suggested that, actually. My concern is mostly theoretical -- that it can't guarantee that it's a fault that restarted the task. Yeah, currently, tasks mostly restart due to faults, but that's not necessarily inherent. But for now it's basically equivalent I think? |
After a bit more thinking, I'm thinking about going back to an approach where we ask Jefe to send us a list of fault counters explicitly, rather than looking at generations. This is mostly for the reason @cbiffle points out: a task can also explicitly ask to be restarted without faulting (though I'm not sure if anything in our production images actually uses this capability). It has a couple other advantages, though: it's a bit quicker for Packrat to do (one IPC to Jefe rather than On the other hand, this would mean that we can no longer uphold the property that "Packrat never makes IPC requests to other tasks", which is documented in a few places. I think an infallible IPC to the supervisor is probably safe, but I'm not sure if we're comfortable violating that property for any reason... |
Depends on #2313
Fixes #2309
It's currently somewhat difficult to become aware of Hubris task panics and other task faults in a production environment. While MGS can ask the SP to list task dumps as part of the API for reading dumps, this requires that the control plane (or faux-mgs user) proactively ask the SP whether it has any record of panicked tasks, rather than recording panics as they occur. Therefore, we should have a proactive notification from the SP indicating that task faults have occurred.
This commit adds code to
packratfor producing an ereport when a task has faulted. This could eventually be used by the control plane to trigger dump collection and produce a service bundle. In addition, it will provide a more permanent record that a task faulted at a particular time, even if the SP that contains the faulted task is later reset or replaced with an entirely different SP. This works using an approach similar to the one described by @cbiffle in this comment.