-
Notifications
You must be signed in to change notification settings - Fork 97
enhancement(runtime): improve logging in supervised crashes #1415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ee6714d to
765887e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit unsure about the Fatal log level. Isn't one of the main goals of the of the Suture module to restart service when they crash? If we log at the Fatal level that point is basically moot, isn't it?
For the panics I think using PassThroughPanics might be the setting we want to enable to terminate when one of the services panics.
@butonic maybe you could chime in here.
| case suture.EventServicePanic: | ||
| s.Log.Fatal().Str("event", e.String()).Str("service", ev.ServiceName).Str("supervisor", ev.SupervisorName). | ||
| Bool("restarting", ev.Restarting).Float64("failures", ev.CurrentFailures).Float64("threshold", ev.FailureThreshold). | ||
| Msgf("service panic: %v", ev.PanicMsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, not sure about this. Wouldn't it be more straightforward to just set PassThroughPanics on the suture.Spec{} ? If we really want that behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Panics are quite rare and AFAIR we were always the ones resposible for making a panic call, or reva. So PassThroughPanics is the rigth thing to do IMO. I really like the improvement in log level handling.
That being said, I would move the message to a dedicated property:
| Msgf("service panic: %v", ev.PanicMsg) | |
| Str("message", ev.PanicMsg).Msg("service panic") |
Pushed a change to only use |
Probably difficult to wager since it depends on how liberal our dependencies but also our own codebase is with the use of panics. I'm not sure all the vendored packages we have only make use of Arguably, and in favour of @rhafer 's proposal to let panics pass through: can we at least be confident that all (or the vast majority) of calls to |
|
might need a rebase |
|
@butonic @dragonchaser @rhafer Let us merge this PR |
|
@pbleser-oc Would you mind rebasing this one? |
|
Done, rebased on 500487f |
| cancel() | ||
| } | ||
| } | ||
| s.Log.Info().Str("event", e.String()).Msg(fmt.Sprintf("supervisor: %v", e.Map()["supervisor_name"])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ci says that this was the only occurence of fmt. It needs to be removed from the imports to compile again.
When events occur in the supervisor, log them as FATAL when appropriate (panic and termination when not restarting) instead of always using INFO, and include more information such as the service name and the current failures, etc..., rather than the generic message that was used so far.
|
Fixed and added @butonic 's suggestion at #1415 (comment) |
Description
When events occur in the supervisor, log them as
FATALwhen appropriate (panic and termination) instead of always usingINFO, and include more information such as the service name and the current failures, etc..., rather than the generic message that was used so far.Related Issue
Motivation and Context
Whenever a panic crash of the single binary occurs, the logging message that contains the reason and stacktrace is potentially hidden due to its
INFOloglevel, which is typically filtered out in productive environments, or in development environments when the other services are too noisy.With this change, it is correctly logged as
FATAL, which should always be caught in productive environments.How Has This Been Tested?
Has only been tested locally by triggering a panic in another service on purpose.
Types of changes
It could potentially be seen as a breaking change if system administrators have already set up alerting filters based on the previous log event payload, which does change now (different log level, different attributes, different message.)
Checklist:
Unit tests addedAcceptance tests addedDocumentation addedIf we do consider the log messages "public API", then we might want to include documentation about this change in the release notes.