Skip to content

ereport: hubris task panicked/faulted #2309

@hawkw

Description

@hawkw

In production environments, it is currently somewhat difficult to become aware of Hubris task panics and faults. While MGS can ask the SP to list task dumps as part of the API for reading dumps, this requires that the control plane (or faux-mgs user) proactively ask the SP whether it has any record of panicked tasks, rather than recording panics as they occur. In particular, if the SP resets, the whole system is removed or loses power, or the SP is no longer reachable over the management network, no record exists that a panic occurred. We should probably generate an ereport when a task panics to increase the likelihood that there is some record of panicking tasks.

This is somewhat more complex to implement than other ereports, primarily because panicked tasks must be reported by the supervisor (jefe), which runs at a higher priority than packrat, which is responsible for recording ereports. Thus, we must have a scheme where jefe exposes an API to collect ereports, and sends a notification to packrat when it would like its ereports collected. See #2127 for details on that.

This scheme implies that ereports from the supervisor must be collected asynchronously, which is different from ereports from tasks running at lower priorities than packrat. Generally, a task which wishes to record an ereport performs a synchronous IPC to packrat, which either succeeds or fails, and then the reporting task goes along with its day. For supervisor ereports, though, jefe must send packrat a notification saying "hey, I have some ereports", and then, when packrat is scheduled, it eventually comes around and collect them. During the time between when jefe notifies packrat and when packrat eventually manages to come snarf up the ereports, additional tasks may have panicked. This means that jefe must have some capacity to buffer multiple ereports or otherwise be capable of generating more than one ereport if multiple tasks have panicked before packrat comes asking.

Some things we would probably want these ereports to include are:

Metadata

Metadata

Assignees

Labels

⚠️ ereportif you see something, say something!

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions