Skip to content

Attempt to gracefully shutdown services first, then hard kill#48

Merged
dzbarsky merged 8 commits into
hermeticbuild:masterfrom
darkrift:allow_specifying_signal_for_shutdown
Oct 29, 2025
Merged

Attempt to gracefully shutdown services first, then hard kill#48
dzbarsky merged 8 commits into
hermeticbuild:masterfrom
darkrift:allow_specifying_signal_for_shutdown

Conversation

@darkrift

@darkrift darkrift commented Oct 5, 2025

Copy link
Copy Markdown
Contributor

This PR changes the services shutdown behavior to a SIGTERM and allows for them to gracefully stop for 1 second before sending a SIGKILL (default value can be revisited).

This allows for a service to properly clean it's own resources and allow for new type of tests to be done (making sure all your services dependencies can graceful shutdown).

It is also mandatory to be able to SIGTERM to emit coverage on services, notably for bazel-contrib/rules_go#4461

@darkrift darkrift force-pushed the allow_specifying_signal_for_shutdown branch 6 times, most recently from 0039ac3 to 598aba8 Compare October 6, 2025 12:06

@dzbarsky dzbarsky left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for sending this! left a bunch of feedback

Comment thread tests/shutdown/BUILD.bazel Outdated
Comment thread tests/shutdown/dummy.sh Outdated
Comment thread svcctl/svcctl.go Outdated
Comment thread runner/service_instance.go Outdated
}

func (s *ServiceInstance) Stop(sig syscall.Signal) error {
func (s *ServiceInstance) Stop(sig *syscall.Signal) error {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't Signal an int newtype? Why do we need a pointer here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, it's to check for presence/absence. Can we push the default behavior into the caller though? Doing it here is a bit weird, it's forcing the Ptr change in svcctl even though that spot really wants to send a specific signal and not deal with the default handling

@darkrift darkrift Oct 19, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It checks for presence or absence but the only place we allow to "override" this flag is in the svcctl http handler : https://github.com/dzbarsky/rules_itest/blob/b44b7a11c88d58c238457461f8e91d62ebfaa3c3/svcctl/svcctl.go#L87-L100

which enforces a default if the signal is not defined. Propagating the nil here allows for the service to define what signal it should use rather than enforcing SIGKILL from an external point of view.

The other places were forcing SIGKILL on shutdown, no matter what.

@darkrift darkrift Oct 19, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the same test, it sometimes make sense for some services to actually be graceful while others not. Such case includes:

  • Resources reservation/cleanup (for instance, docker start/shutdown be it remote or local, k8s tests to delete temporary namespace cleanly at end of test, etc)
  • Coverage

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I just meant there are 2 callers of Stop, one passes nil and relies on the lookup, the other one passes an explicit signal. I was thinking it might make more sense to remove the nil case, lift the default-lookup into the caller that was passing nil, and always pass an explicit signal.

i.e. it moves the concept of "default stop behavior" closer to the runner code that is managing the topology.

not a huge deal though, maybe there's something I'm missing that makes it more annoying than i'm thinking. You've definitely thought about this more deeply and more recently than I have :)

Comment thread runner/service_instance.go Outdated
Comment thread runner/service_instance.go Outdated
Comment thread private/itest.bzl Outdated
Comment thread runner/service_instance.go Outdated
@darkrift darkrift force-pushed the allow_specifying_signal_for_shutdown branch 3 times, most recently from 4a960a1 to 87e8e4c Compare October 19, 2025 17:48
Comment thread BUILD.bazel Outdated
)

bool_flag(
name = "error_on_forceful_shutdown",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: enforce_graceful_shutdown is a bit easier for my brain to parse, if you're inclined

@dzbarsky dzbarsky left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codex please review this

Comment thread runner/runner.go Outdated
func (r *Runner) StopAll() (map[string]*os.ProcessState, error) {
tasks := allTasks(r.serviceInstances, func(ctx context.Context, service *ServiceInstance) error {
if service.Type == "group" {
if service.Type == "group" || service.Type == "task" {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this orphan a long-running task when StopAll is called? I'm guessing you did this because tasks don't have the default-signal logic which got tripped up because you're passing nil?

Comment thread runner/runner.go Outdated
for _, label := range updateActions.toStopLabels {
serviceInstance := r.serviceInstances[label]
if serviceInstance.Type == "group" {
if serviceInstance.Type == "group" || serviceInstance.Type == "task" {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, if we have a task that needs to be restarted it will now end up running twice?

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread runner/service_instance.go Outdated
Comment on lines +214 to +215
if s.cmd.ProcessState.ExitCode() == 0 {
return nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Guard Stop against nil ProcessState

The new Stop implementation calls s.cmd.ProcessState.ExitCode() before verifying that the process has actually finished. While the service is still running cmd.ProcessState is nil, so invoking ExitCode() dereferences a nil pointer and will panic the runner whenever a live service is stopped (via StopAll or the /v0/kill API). This prevents any graceful shutdown. The code should check that ProcessState is non‑nil before reading its exit code or avoid accessing it until after the process has exited.

Useful? React with 👍 / 👎.

Comment thread tests/shutdown/BUILD.bazel Outdated
string_flag(
name = "2seconds",
build_setting_default = "2s",
visibility = ["//visibility:public"],

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these probably don't need to be public?

Comment thread private/itest.bzl Outdated
doc = "The signal to send to the service when it needs to be shut down. Valid values are: SIGTERM and SIGKILL. SIGTERM is necessary to have proper coverage of services which needs to be gracefully terminated",
values = ["SIGTERM", "SIGKILL"],
),
"shutdown_timeout": attr.label(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of doing it like so:

"shutdown_timeout": attr.string(),
"_default_shutdown_timeout": attr.label(default = "//:shutdown_timeout")

...
shutdown_timeout = ctx.attr.shutdown_timeout or ctx.attr.shutdown_timeout[BuildSettingInfo].value

that way the user doesn't need to create those 10seconds targets, wdyt?

(same thing for the flag below this one)

@darkrift darkrift force-pushed the allow_specifying_signal_for_shutdown branch from 87e8e4c to 7257143 Compare October 25, 2025 20:21
Comment thread runner/service_instance.go
@darkrift darkrift force-pushed the allow_specifying_signal_for_shutdown branch 3 times, most recently from 27a4dc6 to c6476f4 Compare October 28, 2025 23:38
@dzbarsky

Copy link
Copy Markdown
Member

@codex please review

@darkrift darkrift force-pushed the allow_specifying_signal_for_shutdown branch from 6e83838 to d23385f Compare October 28, 2025 23:48
@dzbarsky dzbarsky force-pushed the allow_specifying_signal_for_shutdown branch from d23385f to 6a53984 Compare October 28, 2025 23:52
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Can't wait for the next one!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@dzbarsky dzbarsky merged commit 510c68c into hermeticbuild:master Oct 29, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants