Skip to content

Conversation

@pbleser-oc
Copy link
Contributor

@pbleser-oc pbleser-oc commented Aug 28, 2025

Description

When events occur in the supervisor, log them as FATAL when appropriate (panic and termination) instead of always using INFO, and include more information such as the service name and the current failures, etc..., rather than the generic message that was used so far.

Related Issue

Motivation and Context

Whenever a panic crash of the single binary occurs, the logging message that contains the reason and stacktrace is potentially hidden due to its INFO loglevel, which is typically filtered out in productive environments, or in development environments when the other services are too noisy.

With this change, it is correctly logged as FATAL, which should always be caught in productive environments.

How Has This Been Tested?

Has only been tested locally by triggering a panic in another service on purpose.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Technical debt
  • Tests only (no source changes)

It could potentially be seen as a breaking change if system administrators have already set up alerting filters based on the previous log event payload, which does change now (different log level, different attributes, different message.)

Checklist:

  • Code changes
  • Unit tests added
  • Acceptance tests added
  • Documentation added
  • Unit tests are difficult to implement and have been skipped for this change.
  • The behaviour of the supervised binary in case of panics/crashes is currently not part of the acceptance tests
  • There is no documentation to change because the backend error logs are currently not documented in the first place.

If we do consider the log messages "public API", then we might want to include documentation about this change in the release notes.

Copy link
Contributor

@rhafer rhafer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit unsure about the Fatal log level. Isn't one of the main goals of the of the Suture module to restart service when they crash? If we log at the Fatal level that point is basically moot, isn't it?

For the panics I think using PassThroughPanics might be the setting we want to enable to terminate when one of the services panics.

@butonic maybe you could chime in here.

case suture.EventServicePanic:
s.Log.Fatal().Str("event", e.String()).Str("service", ev.ServiceName).Str("supervisor", ev.SupervisorName).
Bool("restarting", ev.Restarting).Float64("failures", ev.CurrentFailures).Float64("threshold", ev.FailureThreshold).
Msgf("service panic: %v", ev.PanicMsg)
Copy link
Contributor

@rhafer rhafer Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, not sure about this. Wouldn't it be more straightforward to just set PassThroughPanics on the suture.Spec{} ? If we really want that behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panics are quite rare and AFAIR we were always the ones resposible for making a panic call, or reva. So PassThroughPanics is the rigth thing to do IMO. I really like the improvement in log level handling.

That being said, I would move the message to a dedicated property:

Suggested change
Msgf("service panic: %v", ev.PanicMsg)
Str("message", ev.PanicMsg).Msg("service panic")

@pbleser-oc
Copy link
Contributor Author

I am a bit unsure about the Fatal log level. Isn't one of the main goals of the of the Suture module to restart service when they crash? If we log at the Fatal level that point is basically moot, isn't it?

Pushed a change to only use FATAL when the suture service decides not to restart, and using ERROR instead when it is going to restart the service.

@pbleser-oc
Copy link
Contributor Author

For the panics I think using PassThroughPanics might be the setting we want to enable to terminate when one of the services panics.

Probably difficult to wager since it depends on how liberal our dependencies but also our own codebase is with the use of panics. I'm not sure all the vendored packages we have only make use of panic() when it's a fatal error that warrants a crash, something that should typically only happen when e.g. you fail to allocate memory (although I suppose that is opinionated as well), not when a configuration setting is wrong.

Arguably, and in favour of @rhafer 's proposal to let panics pass through: can we at least be confident that all (or the vast majority) of calls to panic() everywhere are for issues that cannot be resolved by retrying (e.g. (static) configuration errors)? Because, then, there is really no point in restarting the service in the first place, if it's just going to fail again the next time it runs into the same code path...

@pbleser-oc pbleser-oc removed the request for review from dragonchaser September 4, 2025 09:53
@butonic
Copy link
Contributor

butonic commented Sep 22, 2025

might need a rebase

@micbar
Copy link
Contributor

micbar commented Nov 4, 2025

@butonic @dragonchaser @rhafer Let us merge this PR

@rhafer
Copy link
Contributor

rhafer commented Nov 4, 2025

@pbleser-oc Would you mind rebasing this one?

@pbleser-oc
Copy link
Contributor Author

Done, rebased on 500487f

cancel()
}
}
s.Log.Info().Str("event", e.String()).Msg(fmt.Sprintf("supervisor: %v", e.Map()["supervisor_name"]))
Copy link
Contributor

@micbar micbar Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ci says that this was the only occurence of fmt. It needs to be removed from the imports to compile again.

When events occur in the supervisor, log them as FATAL when appropriate
(panic and termination when not restarting) instead of always using
INFO, and include more information such as the service name and the
current failures, etc..., rather than the generic message that was used
so far.
@pbleser-oc
Copy link
Contributor Author

Fixed and added @butonic 's suggestion at #1415 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve logging in supervised crashes

4 participants