enhancement(runtime): improve logging in supervised crashes #1415

pbleser-oc · 2025-08-28T14:58:52Z

Description

When events occur in the supervisor, log them as FATAL when appropriate (panic and termination) instead of always using INFO, and include more information such as the service name and the current failures, etc..., rather than the generic message that was used so far.

Related Issue

Fixes Improve logging in supervised crashes #1414

Motivation and Context

Whenever a panic crash of the single binary occurs, the logging message that contains the reason and stacktrace is potentially hidden due to its INFO loglevel, which is typically filtered out in productive environments, or in development environments when the other services are too noisy.

With this change, it is correctly logged as FATAL, which should always be caught in productive environments.

How Has This Been Tested?

Has only been tested locally by triggering a panic in another service on purpose.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Technical debt
Tests only (no source changes)

It could potentially be seen as a breaking change if system administrators have already set up alerting filters based on the previous log event payload, which does change now (different log level, different attributes, different message.)

Checklist:

Code changes
~~Unit tests added~~
~~Acceptance tests added~~
~~Documentation added~~

Unit tests are difficult to implement and have been skipped for this change.
The behaviour of the supervised binary in case of panics/crashes is currently not part of the acceptance tests
There is no documentation to change because the backend error logs are currently not documented in the first place.

If we do consider the log messages "public API", then we might want to include documentation about this change in the release notes.

rhafer

I am a bit unsure about the Fatal log level. Isn't one of the main goals of the of the Suture module to restart service when they crash? If we log at the Fatal level that point is basically moot, isn't it?

For the panics I think using PassThroughPanics might be the setting we want to enable to terminate when one of the services panics.

@butonic maybe you could chime in here.

rhafer · 2025-09-01T12:05:38Z

opencloud/pkg/runtime/service/service.go

+			case suture.EventServicePanic:
+				s.Log.Fatal().Str("event", e.String()).Str("service", ev.ServiceName).Str("supervisor", ev.SupervisorName).
+					Bool("restarting", ev.Restarting).Float64("failures", ev.CurrentFailures).Float64("threshold", ev.FailureThreshold).
+					Msgf("service panic: %v", ev.PanicMsg)


Hm, not sure about this. Wouldn't it be more straightforward to just set PassThroughPanics on the suture.Spec{} ? If we really want that behavior.

Panics are quite rare and AFAIR we were always the ones resposible for making a panic call, or reva. So PassThroughPanics is the rigth thing to do IMO. I really like the improvement in log level handling.

That being said, I would move the message to a dedicated property:

Suggested change

Msgf("service panic: %v", ev.PanicMsg)

Str("message", ev.PanicMsg).Msg("service panic")

pbleser-oc · 2025-09-04T09:48:03Z

I am a bit unsure about the Fatal log level. Isn't one of the main goals of the of the Suture module to restart service when they crash? If we log at the Fatal level that point is basically moot, isn't it?

Pushed a change to only use FATAL when the suture service decides not to restart, and using ERROR instead when it is going to restart the service.

pbleser-oc · 2025-09-04T09:53:06Z

For the panics I think using PassThroughPanics might be the setting we want to enable to terminate when one of the services panics.

Probably difficult to wager since it depends on how liberal our dependencies but also our own codebase is with the use of panics. I'm not sure all the vendored packages we have only make use of panic() when it's a fatal error that warrants a crash, something that should typically only happen when e.g. you fail to allocate memory (although I suppose that is opinionated as well), not when a configuration setting is wrong.

Arguably, and in favour of @rhafer 's proposal to let panics pass through: can we at least be confident that all (or the vast majority) of calls to panic() everywhere are for issues that cannot be resolved by retrying (e.g. (static) configuration errors)? Because, then, there is really no point in restarting the service in the first place, if it's just going to fail again the next time it runs into the same code path...

butonic · 2025-09-22T11:41:47Z

might need a rebase

micbar · 2025-11-04T15:15:02Z

@butonic @dragonchaser @rhafer Let us merge this PR

rhafer · 2025-11-04T16:35:19Z

@pbleser-oc Would you mind rebasing this one?

pbleser-oc · 2025-11-04T20:56:14Z

Done, rebased on 500487f

micbar · 2025-11-04T22:16:58Z

opencloud/pkg/runtime/service/service.go

 					cancel()
 				}
 			}
-			s.Log.Info().Str("event", e.String()).Msg(fmt.Sprintf("supervisor: %v", e.Map()["supervisor_name"]))


Ci says that this was the only occurence of fmt. It needs to be removed from the imports to compile again.

When events occur in the supervisor, log them as FATAL when appropriate (panic and termination when not restarting) instead of always using INFO, and include more information such as the service name and the current failures, etc..., rather than the generic message that was used so far.

pbleser-oc · 2025-11-05T08:20:19Z

Fixed and added @butonic 's suggestion at #1415 (comment)

pbleser-oc requested a review from dragonchaser August 28, 2025 14:58

pbleser-oc self-assigned this Aug 28, 2025

pbleser-oc added Priority:p4-low Low priority Status:Needs-Review labels Aug 28, 2025

pbleser-oc force-pushed the main branch 2 times, most recently from ee6714d to 765887e Compare August 29, 2025 10:14

rhafer reviewed Sep 1, 2025

View reviewed changes

pbleser-oc force-pushed the main branch from 765887e to 51a369a Compare September 4, 2025 09:46

pbleser-oc removed the request for review from dragonchaser September 4, 2025 09:53

pbleser-oc force-pushed the main branch from 51a369a to 3b233cf Compare November 4, 2025 20:55

micbar reviewed Nov 4, 2025

View reviewed changes

pbleser-oc force-pushed the main branch from 3b233cf to 5b7cbb4 Compare November 5, 2025 08:19

micbar approved these changes Nov 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

enhancement(runtime): improve logging in supervised crashes #1415

enhancement(runtime): improve logging in supervised crashes #1415

Uh oh!

pbleser-oc commented Aug 28, 2025 •

edited

Loading

Uh oh!

rhafer left a comment

Uh oh!

rhafer Sep 1, 2025 •

edited

Loading

Uh oh!

butonic Sep 22, 2025

Uh oh!

pbleser-oc commented Sep 4, 2025

Uh oh!

pbleser-oc commented Sep 4, 2025

Uh oh!

butonic commented Sep 22, 2025

Uh oh!

micbar commented Nov 4, 2025

Uh oh!

rhafer commented Nov 4, 2025

Uh oh!

pbleser-oc commented Nov 4, 2025

Uh oh!

micbar Nov 4, 2025 •

edited

Loading

Uh oh!

pbleser-oc commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	Msgf("service panic: %v", ev.PanicMsg)
	Str("message", ev.PanicMsg).Msg("service panic")

enhancement(runtime): improve logging in supervised crashes #1415

Are you sure you want to change the base?

enhancement(runtime): improve logging in supervised crashes #1415

Uh oh!

Conversation

pbleser-oc commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

rhafer left a comment

Choose a reason for hiding this comment

Uh oh!

rhafer Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

butonic Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

pbleser-oc commented Sep 4, 2025

Uh oh!

pbleser-oc commented Sep 4, 2025

Uh oh!

butonic commented Sep 22, 2025

Uh oh!

micbar commented Nov 4, 2025

Uh oh!

rhafer commented Nov 4, 2025

Uh oh!

pbleser-oc commented Nov 4, 2025

Uh oh!

micbar Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pbleser-oc commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pbleser-oc commented Aug 28, 2025 •

edited

Loading

rhafer Sep 1, 2025 •

edited

Loading

micbar Nov 4, 2025 •

edited

Loading