Skip to content

[Feature] Add timeout for apiserver grpc server#3427

Merged
kevin85421 merged 17 commits intoray-project:masterfrom
machichima:3344-apiserver-timeout-grpc-server
May 6, 2025
Merged

[Feature] Add timeout for apiserver grpc server#3427
kevin85421 merged 17 commits intoray-project:masterfrom
machichima:3344-apiserver-timeout-grpc-server

Conversation

@machichima
Copy link
Collaborator

@machichima machichima commented Apr 19, 2025

  • Add time out grpc server interceptor
  • Add unit test to ensure the timeout interceptor works well
  • Enable setting the timeout value through flag -grpc_timeout (e.g. -grpc_timeout 30s), default to 60 seconds

Why are these changes needed?

Currently, there's no timeout setting in gRPC server side. This PR added the timeout to ensure resource access doesn't get overloaded in all cases.

Related issue number

Part of #3344

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

-e
Signed-off-by: machichima <nary12321@gmail.com>
@machichima
Copy link
Collaborator Author

Hi @dentiny ,

I noticed that there's a PR related to this which works on http timeout (#3350), do we need to have a same timeout for http and grpc server or is it fine to have them different?

Thanks

@dentiny
Copy link
Contributor

dentiny commented Apr 20, 2025

I noticed that there's a PR related to this which works on http timeout (#3350), do we need to have a same timeout for http and grpc server or is it fine to have them different?

There're three timeout related issues, as far as I am aware of:

  • grpc service timeout, which limits the processing time for each request (i.e. query kubernetes apiserver)
  • http server timeout, which is used to limit http request read and response write timeout
  • client side timeout

@machichima
Copy link
Collaborator Author

There're three timeout related issues, as far as I am aware of:

Got it, I think this PR is working on grpc service timeout, while PR #3350 works on limiting http request?

As I saw the comment in PR #3350 that they want to decide a default timeout, I am thinking if the default value here should be the same as what they set?

@dentiny
Copy link
Contributor

dentiny commented Apr 20, 2025

Got it, I think this PR is working on grpc service timeout, while PR #3350 works on limiting http request?

yes

As I saw the comment in PR #3350 that they want to decide a default timeout, I am thinking if the default value here should be the same as what they set?

No, I don't think they need to be the same

@bhks
Copy link

bhks commented Apr 23, 2025

Got it, I think this PR is working on grpc service timeout, while PR #3350 works on limiting http request?

yes

As I saw the comment in PR #3350 that they want to decide a default timeout, I am thinking if the default value here should be the same as what they set?

No, I don't think they need to be the same

60 seconds should be a good number ?

…server-timeout-grpc-server

-e
Signed-off-by: machichima <nary12321@gmail.com>
-e
Signed-off-by: machichima <nary12321@gmail.com>
-e
Signed-off-by: machichima <nary12321@gmail.com>
-e
Signed-off-by: machichima <nary12321@gmail.com>
@machichima machichima marked this pull request as ready for review April 23, 2025 14:54
@machichima
Copy link
Collaborator Author

60 seconds should be a good number ?

Yes I think its good for default value. I also add an env variable to let user set the timeout value.

@machichima
Copy link
Collaborator Author

@dentiny PTAL

case <-ctx.Done():
// Raise error if time out
if ctx.Err() == context.DeadlineExceeded {
return nil, fmt.Errorf("grpc server timed out")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we name the grpc server with KubeRay API server ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Just changed

_ = flagSet.Set("log_file", *logFile)
}

grpcTimeout := 60 * time.Second // Default timeout
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we following mechanisms to define constants or we are adding to each files where we are using it ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quickly glancing over the code, we have constants.go for other components (i.e. operator)
https://github.com/ray-project/kuberay/blob/ebb5ba441b0a7f888c17aa5c2d33943084a9a2d9/ray-operator/controllers/ray/utils/constant.go

I usually do this in two ways:

  • either place it to constant file, just as what we did for kuberay operator
    • the benefit of which is we group all constants in one place, rather than scattered around the codebase
  • or define a util function getGrpcServerTimeoutOrDefault and have default timeout besides
    • the benefit of which is it's easy to locate all timeout related functions and features

Our codebase seems to prefer (1).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I found that in apiserver, they put constants in config.go, I'll add it here

const (
// Label keys
RayClusterNameLabelKey = "ray.io/cluster-name"
RayClusterUserLabelKey = "ray.io/user"
RayClusterVersionLabelKey = "ray.io/version"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

}

grpcTimeout := 60 * time.Second // Default timeout
if timeoutStr := os.Getenv("GRPC_SERVER_TIMEOUT"); timeoutStr != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw why do we use env var instead of flags? I think flags are strictly better in a few ways:

  • program checks env variables; for example, bazel uses env to decide whether we could reuse cache
  • impose security issue, because env is shared among all processes which could be accessed everywhere

I almost only use env variables when:

  • across language boundary
  • across process boundary, if no other easier way

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the guidance!

The reason why I put it in environment variable instead of flag is because I search through the code base and find they put this (which I think is a bit similar to timeout?) in the environment variable, so I just simply follow what it does

requeueAfterSeconds, err := strconv.Atoi(os.Getenv(utils.RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV))

I agree to your points, if there's no other places that need this value, I think I'll just move it to flag instead

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to flag

grpcTimeout = timeout
klog.Infof("gRPC servier timeout set to %v", grpcTimeout)
} else {
klog.Warningf("Invalid GRPC_SERVER_TIMEOUT value: %v, using default timeout (60 seconds)", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use %d to print out default value, in case we change in the future

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just added

-e
Signed-off-by: machichima <nary12321@gmail.com>
-e
Signed-off-by: machichima <nary12321@gmail.com>
-e
Signed-off-by: machichima <nary12321@gmail.com>
-e
Signed-off-by: machichima <nary12321@gmail.com>
…server-timeout-grpc-server

-e
Signed-off-by: machichima <nary12321@gmail.com>
-e
Signed-off-by: machichima <nary12321@gmail.com>
-e
Signed-off-by: machichima <nary12321@gmail.com>
-e
Signed-off-by: machichima <nary12321@gmail.com>
@machichima machichima requested a review from dentiny April 26, 2025 08:12
-e
Signed-off-by: machichima <nary12321@gmail.com>
@dentiny
Copy link
Contributor

dentiny commented Apr 28, 2025

let me know when it's ready for review, feel free to ping me any time :)

-e
Signed-off-by: machichima <nary12321@gmail.com>
Comment on lines +29 to +43
select {
case <-time.After(delay):
return "test_response", h.returnErr
case <-ctx.Done():
var grpcCode codes.Code
switch ctx.Err() {
case context.Canceled:
grpcCode = codes.Canceled
case context.DeadlineExceeded:
grpcCode = codes.DeadlineExceeded
default:
grpcCode = codes.Unknown
}
return nil, status.Error(grpcCode, ctx.Err().Error())
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this to mimic the grpc IO handler for testing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is automatically updated when running make test

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is automatically updated when running make test

}

grpcTimeout := 60 * time.Second // Default timeout
if timeoutStr := os.Getenv("GRPC_SERVER_TIMEOUT"); timeoutStr != "" {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to flag

@machichima machichima requested a review from dentiny April 29, 2025 01:45
@machichima
Copy link
Collaborator Author

Sorry I just found that I didn't submit the review as comment

@machichima
Copy link
Collaborator Author

@dentiny PTAL, Thanks!

Copy link
Contributor

@dentiny dentiny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for the help!
And sorry about the delay

// ConfigMapClient indicates an expected call of ConfigMapClient
// ConfigMapClient indicates an expected call of ConfigMapClient.
func (mr *MockKubernetesClientInterfaceMockRecorder) ConfigMapClient(namespace interface{}) *gomock.Call {
mr.mock.ctrl.T.Helper()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the mock call setup fails (e.g., wrong argument types), Go's testing output will show the line in your 
actual test code where the error originated, rather than pointing you to this ConfigMapClient() method in 
the mock recorder.

TIL

info,
func(ctx context.Context, req interface{}) (interface{}, error) {
return tt.handler.Handle(ctx, req)
return tt.handler.Handle(ctx, req, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 0 /*delay*/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Thanks!

info,
func(receivedCtx context.Context, req interface{}) (interface{}, error) {
return handler.Handle(receivedCtx, req)
return handler.Handle(receivedCtx, req, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a comment besides constants

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Thanks!

machichima added 2 commits May 2, 2025 23:39
…server-timeout-grpc-server

-e
Signed-off-by: machichima <nary12321@gmail.com>
-e
Signed-off-by: machichima <nary12321@gmail.com>
@kevin85421 kevin85421 merged commit f45155b into ray-project:master May 6, 2025
24 checks passed
@dentiny
Copy link
Contributor

dentiny commented May 7, 2025

Enable setting the timeout value through env variable GRPC_SERVER_TIMEOUT

PR description needs to be updated :)

laurafitzgerald pushed a commit to laurafitzgerald/kuberay that referenced this pull request May 7, 2025
@machichima
Copy link
Collaborator Author

PR description needs to be updated :)

Thanks! Just updated!

laurafitzgerald pushed a commit to laurafitzgerald/kuberay that referenced this pull request May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants