Skip to content

ING-1399: Fix flakiness in graceful shutdown tests#355

Merged
Westwooo merged 3 commits into
masterfrom
ING-1399-occasional-EOFs
Apr 1, 2026
Merged

ING-1399: Fix flakiness in graceful shutdown tests#355
Westwooo merged 3 commits into
masterfrom
ING-1399-occasional-EOFs

Conversation

@Westwooo

@Westwooo Westwooo commented Mar 10, 2026

Copy link
Copy Markdown
Contributor

This PR changes how we test graceful shutdown, we now ensure that:

  1. Requests received by the HTTP and gRPC server before shutdown begins are allowed to complete
  2. Requests received before shutdown that take longer than the shutdown timeout are forcible cancelled

To support configurable server side delays we use the barrier hooks for gRPC. This meant that we had to expose the hooks manager through the startup call back, as we can't use an RPC to unblock the hook as the server is shutting down so this must be done in process. Such middleware was not implemented for Data API so this has been added and put behind the Debug flag.

The original plan was to use slow running queries. However the time taken for a query to execute is non-deterministic so this approach would just introduce a different type of flakiness.

@Westwooo Westwooo force-pushed the ING-1399-occasional-EOFs branch 8 times, most recently from 634e895 to 1b0734b Compare March 12, 2026 08:56
@Westwooo Westwooo changed the title ING-1399: Check that test is not flaky in GHA ING-1399: Fix flakiness in graceful shutdown tests Mar 12, 2026
@Westwooo Westwooo requested a review from Copilot March 12, 2026 10:08

This comment was marked as outdated.

@Westwooo Westwooo force-pushed the ING-1399-occasional-EOFs branch 7 times, most recently from 3a81055 to 85426a2 Compare March 13, 2026 12:47
Comment thread gateway/system/system.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

gateway/gateway.go:444

  • ctx, cancel := context.WithCancel(ctx) is created before system.NewSystem(...). If NewSystem returns an error, cancel is never called, which can leak any goroutines tied to that derived context. Consider calling cancel() on the error path (or deferring it until after NewSystem succeeds).
		ctx, cancel := context.WithCancel(ctx)

		config.Logger.Info("initializing protostellar system")
		gatewaySys, err := system.NewSystem(&system.SystemOptions{
			Logger:      config.Logger.Named("gateway-system"),
			DataImpl:    dataImpl,
			DapiImpl:    dapiImpl,
			Metrics:     metrics.GetSnMetrics(),
			RateLimiter: rateLimiter,
			GrpcTlsConfig: &tls.Config{
				ClientCAs:  config.ClientCaCert,
				ClientAuth: tls.VerifyClientCertIfGiven,
				GetCertificate: func(chi *tls.ClientHelloInfo) (*tls.Certificate, error) {
					return g.atomicGrpcCert.Load(), nil
				},
			},
			DapiTlsConfig: &tls.Config{
				GetCertificate: func(chi *tls.ClientHelloInfo) (*tls.Certificate, error) {
					return g.atomicDapiCert.Load(), nil
				},
				ClientCAs:  config.ClientCaCert,
				ClientAuth: tls.VerifyClientCertIfGiven,
			},
			ShutdownTimeout: config.ShutdownTimeout,
			AlphaEndpoints:  config.AlphaEndpoints,
			Debug:           config.Debug,
			Cancel:          cancel,
		})
		if err != nil {
			config.Logger.Error("error creating legacy proxy")
			return err
		}

gateway/test/dapi_graceful_shutdown_test.go:129

  • The response body is never closed in the request loop. Not closing resp.Body can leak connections/resources and can also affect keep-alive behavior, which is relevant for a shutdown test. Consider reading/discarding the body (if needed) and calling resp.Body.Close() each iteration.

				return
			}

			respCloseChan <- resp.Close
			time.Sleep(time.Millisecond * 10)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread gateway/system/system.go Outdated
Comment thread gateway/test/dapi_graceful_shutdown_test.go Outdated
Comment thread gateway/test/dapi_graceful_shutdown_test.go Outdated
Comment thread gateway/test/dapi_graceful_shutdown_test.go Outdated
@Westwooo Westwooo force-pushed the ING-1399-occasional-EOFs branch 6 times, most recently from b1bf2f4 to 59073c5 Compare March 19, 2026 21:02
@Westwooo Westwooo changed the base branch from master to ING-1429-release-resources March 19, 2026 21:03
@Westwooo Westwooo force-pushed the ING-1399-occasional-EOFs branch 3 times, most recently from f0e6b22 to 4a49b6c Compare March 25, 2026 11:48
@Westwooo Westwooo requested a review from Copilot March 26, 2026 07:26

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/test.yml Outdated
Comment thread gateway/test/graceful_shutdown_test.go Outdated
Comment thread gateway/hooks/httpinterceptor.go Outdated
Comment thread gateway/test/graceful_shutdown_test.go Outdated
Comment thread gateway/test/graceful_shutdown_test.go Outdated
@Westwooo Westwooo force-pushed the ING-1399-occasional-EOFs branch 2 times, most recently from f526391 to 7d5604f Compare March 26, 2026 09:51
ING-1399: Use hooks to delay grpc request server side
@Westwooo Westwooo force-pushed the ING-1399-occasional-EOFs branch from 7d5604f to 80e2fb1 Compare March 26, 2026 09:59
Comment thread gateway/hooks/httpinterceptor.go

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread gateway/test/graceful_shutdown_test.go
Comment thread gateway/test/graceful_shutdown_test.go
Comment thread gateway/test/graceful_shutdown_test.go Outdated
Comment thread gateway/test/graceful_shutdown_test.go Outdated
Comment thread gateway/hooks/runstate.go Outdated
Comment thread gateway/hooks/hookscontext.go Outdated
Comment thread gateway/test/graceful_shutdown_test.go Outdated
Comment thread gateway/hooks/httpinterceptor.go Outdated
Comment thread gateway/hooks/reqwatchers.go
Comment thread gateway/hooks/runstate.go

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread gateway/test/graceful_shutdown_test.go
Comment thread gateway/hooks/runstate.go
Comment thread gateway/hooks/runstate.go Outdated
Logger *zap.Logger
ExecResult interface{}
ExecError error
executed bool

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this need to be added? I think the intention was that hooks were required to have 'execute' as part of them?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was originally added because we call ServeHTTP in runACtion_Execute and don't want to call it again in the Run() function. Since ServeHTTP does.t return anything we couldn't do things the same way as grpc where we set a fields for the resp and err and uses that to decide if we ned to execute the request later.

In the latest commit I have added an interceptor writer that allows us to access the http response inside the actions like we do for grpc and we can check if the request has been executed based on that.

Comment thread gateway/hooks/runstate.go Outdated
HooksContext *HooksContext
Handler grpc.UnaryHandler
HTTPHandler http.Handler
HTTPWriter http.ResponseWriter

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the HTTPWriter need to be included here? Shouldn't it flow through the result the same way we do in GRPC, with the exception that we write it out differently at the end?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was included because we actually server the request and wrote the response inside of run state. I;ve changed it so that we use an interceptor to get the http response/error which we now propagate back up to HandleHTTPRequest and then we write the response there.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread gateway/hooks/runstate.go Outdated
Comment thread gateway/hooks/runstate.go
@Westwooo Westwooo force-pushed the ING-1399-occasional-EOFs branch from 9f2821d to 2fcf4a2 Compare April 1, 2026 14:38
@Westwooo Westwooo merged commit d2d11de into master Apr 1, 2026
29 of 30 checks passed
@Westwooo Westwooo deleted the ING-1399-occasional-EOFs branch April 1, 2026 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants