Attempt to gracefully shutdown services first, then hard kill#48
Conversation
0039ac3 to
598aba8
Compare
dzbarsky
left a comment
There was a problem hiding this comment.
thank you for sending this! left a bunch of feedback
| } | ||
|
|
||
| func (s *ServiceInstance) Stop(sig syscall.Signal) error { | ||
| func (s *ServiceInstance) Stop(sig *syscall.Signal) error { |
There was a problem hiding this comment.
isn't Signal an int newtype? Why do we need a pointer here?
There was a problem hiding this comment.
oh, it's to check for presence/absence. Can we push the default behavior into the caller though? Doing it here is a bit weird, it's forcing the Ptr change in svcctl even though that spot really wants to send a specific signal and not deal with the default handling
There was a problem hiding this comment.
It checks for presence or absence but the only place we allow to "override" this flag is in the svcctl http handler : https://github.com/dzbarsky/rules_itest/blob/b44b7a11c88d58c238457461f8e91d62ebfaa3c3/svcctl/svcctl.go#L87-L100
which enforces a default if the signal is not defined. Propagating the nil here allows for the service to define what signal it should use rather than enforcing SIGKILL from an external point of view.
The other places were forcing SIGKILL on shutdown, no matter what.
There was a problem hiding this comment.
In the same test, it sometimes make sense for some services to actually be graceful while others not. Such case includes:
- Resources reservation/cleanup (for instance, docker start/shutdown be it remote or local, k8s tests to delete temporary namespace cleanly at end of test, etc)
- Coverage
There was a problem hiding this comment.
yeah I just meant there are 2 callers of Stop, one passes nil and relies on the lookup, the other one passes an explicit signal. I was thinking it might make more sense to remove the nil case, lift the default-lookup into the caller that was passing nil, and always pass an explicit signal.
i.e. it moves the concept of "default stop behavior" closer to the runner code that is managing the topology.
not a huge deal though, maybe there's something I'm missing that makes it more annoying than i'm thinking. You've definitely thought about this more deeply and more recently than I have :)
4a960a1 to
87e8e4c
Compare
| ) | ||
|
|
||
| bool_flag( | ||
| name = "error_on_forceful_shutdown", |
There was a problem hiding this comment.
nit: enforce_graceful_shutdown is a bit easier for my brain to parse, if you're inclined
| func (r *Runner) StopAll() (map[string]*os.ProcessState, error) { | ||
| tasks := allTasks(r.serviceInstances, func(ctx context.Context, service *ServiceInstance) error { | ||
| if service.Type == "group" { | ||
| if service.Type == "group" || service.Type == "task" { |
There was a problem hiding this comment.
Wouldn't this orphan a long-running task when StopAll is called? I'm guessing you did this because tasks don't have the default-signal logic which got tripped up because you're passing nil?
| for _, label := range updateActions.toStopLabels { | ||
| serviceInstance := r.serviceInstances[label] | ||
| if serviceInstance.Type == "group" { | ||
| if serviceInstance.Type == "group" || serviceInstance.Type == "task" { |
There was a problem hiding this comment.
same here, if we have a task that needs to be restarted it will now end up running twice?
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if s.cmd.ProcessState.ExitCode() == 0 { | ||
| return nil |
There was a problem hiding this comment.
Guard Stop against nil ProcessState
The new Stop implementation calls s.cmd.ProcessState.ExitCode() before verifying that the process has actually finished. While the service is still running cmd.ProcessState is nil, so invoking ExitCode() dereferences a nil pointer and will panic the runner whenever a live service is stopped (via StopAll or the /v0/kill API). This prevents any graceful shutdown. The code should check that ProcessState is non‑nil before reading its exit code or avoid accessing it until after the process has exited.
Useful? React with 👍 / 👎.
| string_flag( | ||
| name = "2seconds", | ||
| build_setting_default = "2s", | ||
| visibility = ["//visibility:public"], |
There was a problem hiding this comment.
nit: these probably don't need to be public?
| doc = "The signal to send to the service when it needs to be shut down. Valid values are: SIGTERM and SIGKILL. SIGTERM is necessary to have proper coverage of services which needs to be gracefully terminated", | ||
| values = ["SIGTERM", "SIGKILL"], | ||
| ), | ||
| "shutdown_timeout": attr.label( |
There was a problem hiding this comment.
I was thinking of doing it like so:
"shutdown_timeout": attr.string(),
"_default_shutdown_timeout": attr.label(default = "//:shutdown_timeout")
...
shutdown_timeout = ctx.attr.shutdown_timeout or ctx.attr.shutdown_timeout[BuildSettingInfo].value
that way the user doesn't need to create those 10seconds targets, wdyt?
(same thing for the flag below this one)
87e8e4c to
7257143
Compare
…se the timeout to be graceful
27a4dc6 to
c6476f4
Compare
|
@codex please review |
6e83838 to
d23385f
Compare
d23385f to
6a53984
Compare
|
Codex Review: Didn't find any major issues. Can't wait for the next one! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
This PR changes the services shutdown behavior to a SIGTERM and allows for them to gracefully stop for 1 second before sending a SIGKILL (default value can be revisited).
This allows for a service to properly clean it's own resources and allow for new type of tests to be done (making sure all your services dependencies can graceful shutdown).
It is also mandatory to be able to SIGTERM to emit coverage on services, notably for bazel-contrib/rules_go#4461