[FLINK-34524] Scale down JM deployment to 0 before deletion #791

gyfora · 2024-03-07T08:59:31Z

What is the purpose of the change

We recently improved the JM deployment deletion mechanism, however it seems like task manager pod deletion can get stuck sometimes for a couple of minutes in native mode if we simply try to delete everything at once.

It speeds up the process and leads to cleaner shutdown if we scale down the JM deployment to 0 (shutting down the JM pods first) and then perform the deletion.

Verifying this change

Manually verified in local Kube env
E2Es
Unit tests

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changes to the CustomResourceDescriptors: no
Core observer or reconciler logic that is regularly executed: yes

Documentation

Does this pull request introduce a new feature? no

gyfora · 2024-03-07T09:00:00Z

cc @mateczagany , I would love to hear your thoughts on this as well.

I still need to finish up some tests but any feedback is welcome

mxm

I wonder, what is the difference between scaling the deployment to zero vs removing it? I would think both issue a SIGINT to the container. If that is the case, there should be the difference in shutdown behavior.

gyfora · 2024-03-07T10:45:47Z

I wonder, what is the difference between scaling the deployment to zero vs removing it? I would think both issue a SIGINT to the container. If that is the case, there should be the difference in shutdown behavior.

With respect to the JM there is no difference however deleting the deployment deletes the TM pods at the same time so the JM sees the TM failures / loss in many cases leading to a restart followed by sigterm. The idea here is to delete the JM before touching the TMs

mxm · 2024-03-07T10:47:31Z

Ah, got it! That makes sense.

mxm · 2024-03-07T10:48:30Z

...r/src/test/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkServiceTest.java

        flinkStandaloneService.deleteClusterDeployment(
                flinkDeployment.getMetadata(),
                flinkDeployment.getStatus(),
                new Configuration(),
                false);

-        assertEquals(2, mockServer.getRequestCount() - requestsBeforeDelete);


Shouldn't this result in one request?

For the JM deletion, instead of 1 DELETE request, this will cost 3 requests:

Patch resource

waitUntilCondition gets resource, if the condition is not fulfilled, starts websocket (so this might be 2 requests)

Delete resource

And we still have the TM deletion request below the new logic, so this will be 4 requests total instead of 2.

mateczagany

Added some comments, but I really like this change, I think it makes a lot of sense to let the JMs shut down the TMs themselves in native mode.

But I wonder, how much difference will it make in standalone mode? Is it worth adding it there as well?

mateczagany · 2024-03-08T09:46:48Z

...r/src/test/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkServiceTest.java

        flinkStandaloneService.deleteClusterDeployment(
                flinkDeployment.getMetadata(),
                flinkDeployment.getStatus(),
                new Configuration(),
                false);

-        assertEquals(2, mockServer.getRequestCount() - requestsBeforeDelete);


For the JM deletion, instead of 1 DELETE request, this will cost 3 requests:

Patch resource

waitUntilCondition gets resource, if the condition is not fulfilled, starts websocket (so this might be 2 requests)

Delete resource

And we still have the TM deletion request below the new logic, so this will be 4 requests total instead of 2.

mateczagany · 2024-03-08T10:02:46Z

...perator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java

+
+    protected void scaleJmToZero(
+            EditReplacePatchable<Deployment> jmDeployment, String namespace, String clusterId) {
+        LOG.info("Scaling down JM deployment to 0 before deletion");


Since we also have a possible timeout here, maybe it's worth adding the duration of the timeout in the log

...perator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java

gyfora · 2024-03-08T12:31:11Z

Added some comments, but I really like this change, I think it makes a lot of sense to let the JMs shut down the TMs themselves in native mode.

But I wonder, how much difference will it make in standalone mode? Is it worth adding it there as well?

The logic can be the same in standalone as well, right now we are issuing the delete request for both TM and JM deployments directly one after the other. There is a good chance that the TM shuts down and sends an error signal to the JM before that terminates.

In standalone the alternative would be simply to wait for the JM deployment to be deleted before deleting the task managers.

I am working on some cleanup / improvements in this area to make this nicer

gyfora · 2024-03-08T13:11:08Z

@mateczagany I actually reworked the deletion flow and moved some code around to remove some duplicated logic and the unnecessary decoupling of deletion vs waiting.

This leads to a much cleaner logic and avoids using the scale to zero for the standalone mode as well while keeping the benefits of this logic.

(I still need to update some now obsolete tests, will do that today)

mateczagany

Just one small comment, but overall this seems like a great improvement compared to the last solution

...-operator/src/main/java/org/apache/flink/kubernetes/operator/service/NativeFlinkService.java

gyfora · 2024-03-10T17:58:57Z

The new and improved / consistent logging after the rework @mateczagany @mxm :

[INFO ][default/basic-checkpoint-ha-example] >>> Event  | Info    | CLEANUP         | Cleaning up FlinkDeployment
[INFO ][default/basic-checkpoint-ha-example] Cleaning up autoscaling meta data
[INFO ][default/basic-checkpoint-ha-example] Job is running, cancelling job.
[INFO ][default/basic-checkpoint-ha-example] Job successfully cancelled.
[INFO ][default/basic-checkpoint-ha-example] Deleting cluster with Foreground propagation
[INFO ][default/basic-checkpoint-ha-example] Scaling JobManager Deployment to zero with 300 seconds timeout...
[INFO ][default/basic-checkpoint-ha-example] Completed Scaling JobManager Deployment to zero
[INFO ][default/basic-checkpoint-ha-example] Deleting JobManager Deployment with 298 seconds timeout...
[INFO ][default/basic-checkpoint-ha-example] Completed Deleting JobManager Deployment
[INFO ][default/basic-checkpoint-ha-example] Deleting Kubernetes HA metadata

mxm · 2024-03-11T14:11:33Z

@gyfora Nice! LGTM

gyfora requested review from mxm, gaborgsomogyi and morhidi March 7, 2024 08:59

mxm reviewed Mar 7, 2024

View reviewed changes

mxm approved these changes Mar 7, 2024

View reviewed changes

mateczagany suggested changes Mar 8, 2024

View reviewed changes

gyfora requested a review from mateczagany March 8, 2024 18:23

mateczagany approved these changes Mar 8, 2024

View reviewed changes

...-operator/src/main/java/org/apache/flink/kubernetes/operator/service/NativeFlinkService.java Outdated Show resolved Hide resolved

[FLINK-34524] Scale down JM deployment to 0 before deletion

57758bb

gyfora force-pushed the FLINK-34524 branch from 5c23748 to 57758bb Compare March 10, 2024 15:47

mateczagany approved these changes Mar 11, 2024

View reviewed changes

gyfora merged commit ede1a61 into apache:main Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-34524] Scale down JM deployment to 0 before deletion #791

[FLINK-34524] Scale down JM deployment to 0 before deletion #791

Uh oh!

gyfora commented Mar 7, 2024 •

edited

Loading

Uh oh!

gyfora commented Mar 7, 2024

Uh oh!

mxm left a comment

Uh oh!

gyfora commented Mar 7, 2024

Uh oh!

mxm commented Mar 7, 2024

Uh oh!

mxm Mar 7, 2024

Uh oh!

mateczagany Mar 8, 2024

Uh oh!

mateczagany left a comment

Uh oh!

mateczagany Mar 8, 2024

Uh oh!

mateczagany Mar 8, 2024

Uh oh!

Uh oh!

gyfora commented Mar 8, 2024

Uh oh!

gyfora commented Mar 8, 2024 •

edited

Loading

Uh oh!

mateczagany left a comment

Uh oh!

Uh oh!

gyfora commented Mar 10, 2024

Uh oh!

mxm commented Mar 11, 2024

Uh oh!

Uh oh!

[FLINK-34524] Scale down JM deployment to 0 before deletion #791

[FLINK-34524] Scale down JM deployment to 0 before deletion #791

Uh oh!

Conversation

gyfora commented Mar 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

gyfora commented Mar 7, 2024

Uh oh!

mxm left a comment

Choose a reason for hiding this comment

Uh oh!

gyfora commented Mar 7, 2024

Uh oh!

mxm commented Mar 7, 2024

Uh oh!

mxm Mar 7, 2024

Choose a reason for hiding this comment

Uh oh!

mateczagany Mar 8, 2024

Choose a reason for hiding this comment

Uh oh!

mateczagany left a comment

Choose a reason for hiding this comment

Uh oh!

mateczagany Mar 8, 2024

Choose a reason for hiding this comment

Uh oh!

mateczagany Mar 8, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gyfora commented Mar 8, 2024

Uh oh!

gyfora commented Mar 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mateczagany left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gyfora commented Mar 10, 2024

Uh oh!

mxm commented Mar 11, 2024

Uh oh!

Uh oh!

gyfora commented Mar 7, 2024 •

edited

Loading

gyfora commented Mar 8, 2024 •

edited

Loading