-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[client] Add retry when get one available tablet serverNode fails #426
base: main
Are you sure you want to change the base?
Conversation
f116150
to
c2287b3
Compare
AdminReadOnlyGateway gateway = | ||
GatewayClientProxy.createGatewayProxy( | ||
() -> | ||
getOneAvailableTabletServerNode( | ||
metadataUpdater.getCluster()), | ||
rpcClient, | ||
AdminReadOnlyGateway.class); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary change. Pls revert this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will revert this line.
@@ -52,6 +55,9 @@ | |||
|
|||
/** Utils for metadata for client. */ | |||
public class MetadataUtils { | |||
private static final Logger LOG = LoggerFactory.getLogger(MetadataUtils.class); | |||
|
|||
private static final int MAX_RETRY_TIMES = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this retry time be configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will make it as system configuration
@gkatzioura also added a retry mechanism here UPDATE: There is a retry method, but its only available in the test utils |
@polyzos Good idea, I will try to make a common util method of retry logic. WDYT? @SteNicholas please left some comments, thank you |
List<ServerNode> aliveTabletServers = null; | ||
for (int retryTimes = 0; retryTimes <= MAX_RETRY_TIMES; retryTimes++) { | ||
aliveTabletServers = cluster.getAliveTabletServerList(); | ||
if (aliveTabletServers.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When aliveTabletServers
is empty in cluster
, the retry doesn't work since the cluster wil stil be empty for aliveTabletServers, we need send metadata request to update
cluster`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK,I will adjust it,thank you.
@zhaozijun109, @polyzos, IMO, the retry mechanism maybe different from different behavior, therefore it may not need to introduce a common util method of retry logic. |
@SteNicholas Thank for your advice, I will re-adjust based on all the comments above |
@SteNicholas Indeed, but there might be re-use of some implementations.. my thoughts were that it might be best to have them in a centralized place, to avoid introducing multiple retry implementations across the codebase, because I already saw 3 different cases popping up in different places. Whatever you think works best |
c2287b3
to
fc5a64f
Compare
@luoyuxia @SteNicholas Could you please help me review again when you have free time? Thank you. |
@@ -876,6 +876,13 @@ public class ConfigOptions { | |||
"Enable metrics for client. When metrics is enabled, the client " | |||
+ "will collect metrics and report by the JMX metrics reporter."); | |||
|
|||
public static final ConfigOption<Integer> CLIENT_GET_TABLET_SERVER_NODE_MAX_RETRY_TIMES = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add this option in configuration.md
.
@zhaozijun109, could you create new issue for this pull request? BTW, could you also add some description for this pull request? |
@SteNicholas Thank you for your review, I will create a new issue. |
fc5a64f
to
330d19b
Compare
@luoyuxia @SteNicholas The new issue is #480, could you please help me to review again when you have free time, thank you. |
Thank you for your contribution, @zhaozijun109! You're absolutely right—the current client metadata implementation has significant issues with retrying, timeout handling, and error management. While this PR effectively addresses the specific case of To tackle this comprehensively, I've created Issue #483 to propose a broader refactoring of the client metadata implementation. The goal is to address all recent bugs and provide a more robust, general solution. |
@wuchong Thank you, I will continue to study the principle of fluss and try my best to contribute it. Another, sounds good that it's very good for refactoring of the client metadata implementation, many methods will benefit from this. |
Purpose
Linked issue: #480
Currently, fluss will directly throw an exception when get available tablet server node fails, no matter what the reason once tablet server list is empty at the first attampet get, it will throw an exception, this is too unfriendly, so we can add retry logic to become strong.
Tests
FlussFailServerTableITCase.testRetryGetTabletServerNodes()
TestingMetadataUpdater
API and Format
getOneAvailableTabletServerNode()
Documentation
Add retry times to avoid throwing an exception directly in getOneAvailableTabletServerNode(), and we need send metadata request to update cluster then try to get one available tablet server node again when available tablet server node list is empty in cluster.