YARN-11924. Add zkManager.exists(path) check to ZKConfigurationStore:…#8222
YARN-11924. Add zkManager.exists(path) check to ZKConfigurationStore:…#8222ferdelyi wants to merge 1 commit intoapache:trunkfrom
Conversation
|
💔 -1 overall
This message was automatically generated. |
Hean-Chhinling
left a comment
There was a problem hiding this comment.
Big thanks to you, @ferdelyi for working on this.
This is a huge help to the intermittent RM start up failures.
The code overall is really good.
I just have some improvement ideas and some questions
.../apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ZKConfigurationStore.java
Outdated
Show resolved
Hide resolved
.../apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ZKConfigurationStore.java
Show resolved
Hide resolved
.../apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ZKConfigurationStore.java
Outdated
Show resolved
Hide resolved
.../apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ZKConfigurationStore.java
Outdated
Show resolved
Hide resolved
...che/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/TestZKConfigurationStore.java
Outdated
Show resolved
Hide resolved
...che/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/TestZKConfigurationStore.java
Show resolved
Hide resolved
...hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
Show resolved
Hide resolved
...che/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/TestZKConfigurationStore.java
Outdated
Show resolved
Hide resolved
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
ferdelyi
left a comment
There was a problem hiding this comment.
Thank you for your review!
.../apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ZKConfigurationStore.java
Outdated
Show resolved
Hide resolved
.../apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ZKConfigurationStore.java
Outdated
Show resolved
Hide resolved
.../apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ZKConfigurationStore.java
Outdated
Show resolved
Hide resolved
ferdelyi
left a comment
There was a problem hiding this comment.
Thank you for being so active on this PR.
…getZkData() and retry mechanism If the 'yarn resourcemanager -format-conf-store' command is issued while one of the RMs is in a starting state, the RM may fail. This occurs because the /confstore/CONF_STORE path may not yet exist. Alternatively, if the confstore is in the process of being written, the getZkData method returns a null value, causing the crash. To prevent this, added a re-try mechanism before giving up.
Hean-Chhinling
left a comment
There was a problem hiding this comment.
Big thanks @ferdelyi for this patch. It helps solve a lot of yarn HA related issues.
Thank you so much.
This PR LGTM!!
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
The parametrised unit-test This is because the client request is invalid. Maybe because of this JSON file here Then I tested running the unit-test without this PR changes. It still fails with status code 400 for path Thus these unit-tests failure at |
…getZkData() and retry mechanism
Should a "yarn resourcemanager -format-state-store" command be issued while one of the RM is starting and in the INIT state (because of YARN-11551), there is a time period when the /confstore/CONF_STORE path does not exist, hence the getZkData method returns a null value, causing the RM to fail. To prevent this, add a check and re-try mechanism before giving up.
Description of PR
Rare race condition is addressed when "yarn resourcemanager -format-state-store" issued when an RM is in the INITING state (already initialized the confstore) right before reading it. This change avoids a null pointer exception.
How was this patch tested?
Manually with locks introduced in the RM at the confstore format step with sleep, so while one of the RM is formatting the statestore, the other RM will be at the getZkData method trying to read the confstore in the INIT state.
Also with added unit tests.
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?AI Tooling
If an AI tool was used:
where is the name of the AI tool used.
https://www.apache.org/legal/generative-tooling.html