Skip to content

[FLINK-34524] Scale down JM deployment to 0 before deletion #791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 11, 2024

Conversation

gyfora
Copy link
Contributor

@gyfora gyfora commented Mar 7, 2024

What is the purpose of the change

We recently improved the JM deployment deletion mechanism, however it seems like task manager pod deletion can get stuck sometimes for a couple of minutes in native mode if we simply try to delete everything at once.

It speeds up the process and leads to cleaner shutdown if we scale down the JM deployment to 0 (shutting down the JM pods first) and then perform the deletion.

Verifying this change

  • Manually verified in local Kube env
  • E2Es
  • Unit tests

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: no
  • Core observer or reconciler logic that is regularly executed: yes

Documentation

  • Does this pull request introduce a new feature? no

@gyfora gyfora requested review from mxm, gaborgsomogyi and morhidi March 7, 2024 08:59
@gyfora
Copy link
Contributor Author

gyfora commented Mar 7, 2024

cc @mateczagany , I would love to hear your thoughts on this as well.

I still need to finish up some tests but any feedback is welcome

Copy link
Contributor

@mxm mxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, what is the difference between scaling the deployment to zero vs removing it? I would think both issue a SIGINT to the container. If that is the case, there should be the difference in shutdown behavior.

@gyfora
Copy link
Contributor Author

gyfora commented Mar 7, 2024

I wonder, what is the difference between scaling the deployment to zero vs removing it? I would think both issue a SIGINT to the container. If that is the case, there should be the difference in shutdown behavior.

With respect to the JM there is no difference however deleting the deployment deletes the TM pods at the same time so the JM sees the TM failures / loss in many cases leading to a restart followed by sigterm. The idea here is to delete the JM before touching the TMs

@mxm
Copy link
Contributor

mxm commented Mar 7, 2024

Ah, got it! That makes sense.

flinkStandaloneService.deleteClusterDeployment(
flinkDeployment.getMetadata(),
flinkDeployment.getStatus(),
new Configuration(),
false);

assertEquals(2, mockServer.getRequestCount() - requestsBeforeDelete);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this result in one request?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the JM deletion, instead of 1 DELETE request, this will cost 3 requests:

  • Patch resource
  • waitUntilCondition gets resource, if the condition is not fulfilled, starts websocket (so this might be 2 requests)
  • Delete resource

And we still have the TM deletion request below the new logic, so this will be 4 requests total instead of 2.

Copy link
Contributor

@mateczagany mateczagany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments, but I really like this change, I think it makes a lot of sense to let the JMs shut down the TMs themselves in native mode.

But I wonder, how much difference will it make in standalone mode? Is it worth adding it there as well?

flinkStandaloneService.deleteClusterDeployment(
flinkDeployment.getMetadata(),
flinkDeployment.getStatus(),
new Configuration(),
false);

assertEquals(2, mockServer.getRequestCount() - requestsBeforeDelete);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the JM deletion, instead of 1 DELETE request, this will cost 3 requests:

  • Patch resource
  • waitUntilCondition gets resource, if the condition is not fulfilled, starts websocket (so this might be 2 requests)
  • Delete resource

And we still have the TM deletion request below the new logic, so this will be 4 requests total instead of 2.


protected void scaleJmToZero(
EditReplacePatchable<Deployment> jmDeployment, String namespace, String clusterId) {
LOG.info("Scaling down JM deployment to 0 before deletion");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we also have a possible timeout here, maybe it's worth adding the duration of the timeout in the log

@gyfora
Copy link
Contributor Author

gyfora commented Mar 8, 2024

Added some comments, but I really like this change, I think it makes a lot of sense to let the JMs shut down the TMs themselves in native mode.

But I wonder, how much difference will it make in standalone mode? Is it worth adding it there as well?

The logic can be the same in standalone as well, right now we are issuing the delete request for both TM and JM deployments directly one after the other. There is a good chance that the TM shuts down and sends an error signal to the JM before that terminates.

In standalone the alternative would be simply to wait for the JM deployment to be deleted before deleting the task managers.

I am working on some cleanup / improvements in this area to make this nicer

@gyfora
Copy link
Contributor Author

gyfora commented Mar 8, 2024

@mateczagany I actually reworked the deletion flow and moved some code around to remove some duplicated logic and the unnecessary decoupling of deletion vs waiting.

This leads to a much cleaner logic and avoids using the scale to zero for the standalone mode as well while keeping the benefits of this logic.

(I still need to update some now obsolete tests, will do that today)

@gyfora gyfora requested a review from mateczagany March 8, 2024 18:23
Copy link
Contributor

@mateczagany mateczagany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small comment, but overall this seems like a great improvement compared to the last solution

@gyfora
Copy link
Contributor Author

gyfora commented Mar 10, 2024

The new and improved / consistent logging after the rework @mateczagany @mxm :

[INFO ][default/basic-checkpoint-ha-example] >>> Event  | Info    | CLEANUP         | Cleaning up FlinkDeployment
[INFO ][default/basic-checkpoint-ha-example] Cleaning up autoscaling meta data
[INFO ][default/basic-checkpoint-ha-example] Job is running, cancelling job.
[INFO ][default/basic-checkpoint-ha-example] Job successfully cancelled.
[INFO ][default/basic-checkpoint-ha-example] Deleting cluster with Foreground propagation
[INFO ][default/basic-checkpoint-ha-example] Scaling JobManager Deployment to zero with 300 seconds timeout...
[INFO ][default/basic-checkpoint-ha-example] Completed Scaling JobManager Deployment to zero
[INFO ][default/basic-checkpoint-ha-example] Deleting JobManager Deployment with 298 seconds timeout...
[INFO ][default/basic-checkpoint-ha-example] Completed Deleting JobManager Deployment
[INFO ][default/basic-checkpoint-ha-example] Deleting Kubernetes HA metadata

@mxm
Copy link
Contributor

mxm commented Mar 11, 2024

@gyfora Nice! LGTM

@gyfora gyfora merged commit ede1a61 into apache:main Mar 11, 2024
131 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants