Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[client] Add retry when get one available tablet serverNode fails #426

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

zhaozijun109
Copy link

@zhaozijun109 zhaozijun109 commented Feb 18, 2025

Purpose

Linked issue: #480
Currently, fluss will directly throw an exception when get available tablet server node fails, no matter what the reason once tablet server list is empty at the first attampet get, it will throw an exception, this is too unfriendly, so we can add retry logic to become strong.

Tests

FlussFailServerTableITCase.testRetryGetTabletServerNodes()
TestingMetadataUpdater

API and Format

getOneAvailableTabletServerNode()

Documentation

Add retry times to avoid throwing an exception directly in getOneAvailableTabletServerNode(), and we need send metadata request to update cluster then try to get one available tablet server node again when available tablet server node list is empty in cluster.

AdminReadOnlyGateway gateway =
GatewayClientProxy.createGatewayProxy(
() ->
getOneAvailableTabletServerNode(
metadataUpdater.getCluster()),
rpcClient,
AdminReadOnlyGateway.class);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary change. Pls revert this line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will revert this line.

@@ -52,6 +55,9 @@

/** Utils for metadata for client. */
public class MetadataUtils {
private static final Logger LOG = LoggerFactory.getLogger(MetadataUtils.class);

private static final int MAX_RETRY_TIMES = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this retry time be configurable?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will make it as system configuration

@polyzos
Copy link
Collaborator

polyzos commented Feb 19, 2025

@gkatzioura also added a retry mechanism here
to fix some tests. I assume retry logic will be needed in other places as well, so it maybe makes sense to introduce a class to handle that or use a library. WDYT? cc @wuchong

UPDATE: There is a retry method, but its only available in the test utils

@zhaozijun109
Copy link
Author

@polyzos Good idea, I will try to make a common util method of retry logic. WDYT? @SteNicholas please left some comments, thank you

List<ServerNode> aliveTabletServers = null;
for (int retryTimes = 0; retryTimes <= MAX_RETRY_TIMES; retryTimes++) {
aliveTabletServers = cluster.getAliveTabletServerList();
if (aliveTabletServers.isEmpty()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When aliveTabletServers is empty in cluster, the retry doesn't work since the cluster wil stil be empty for aliveTabletServers, we need send metadata request to update cluster`.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK,I will adjust it,thank you.

@SteNicholas
Copy link
Contributor

SteNicholas commented Feb 21, 2025

@zhaozijun109, @polyzos, IMO, the retry mechanism maybe different from different behavior, therefore it may not need to introduce a common util method of retry logic.

@zhaozijun109
Copy link
Author

@SteNicholas Thank for your advice, I will re-adjust based on all the comments above

@polyzos
Copy link
Collaborator

polyzos commented Feb 21, 2025

@SteNicholas Indeed, but there might be re-use of some implementations.. my thoughts were that it might be best to have them in a centralized place, to avoid introducing multiple retry implementations across the codebase, because I already saw 3 different cases popping up in different places. Whatever you think works best

@zhaozijun109
Copy link
Author

@luoyuxia @SteNicholas Could you please help me review again when you have free time? Thank you.

@@ -876,6 +876,13 @@ public class ConfigOptions {
"Enable metrics for client. When metrics is enabled, the client "
+ "will collect metrics and report by the JMX metrics reporter.");

public static final ConfigOption<Integer> CLIENT_GET_TABLET_SERVER_NODE_MAX_RETRY_TIMES =
Copy link
Contributor

@SteNicholas SteNicholas Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this option in configuration.md.

@SteNicholas
Copy link
Contributor

@zhaozijun109, could you create new issue for this pull request? BTW, could you also add some description for this pull request?

@zhaozijun109
Copy link
Author

@SteNicholas Thank you for your review, I will create a new issue.

@zhaozijun109
Copy link
Author

@luoyuxia @SteNicholas The new issue is #480, could you please help me to review again when you have free time, thank you.

@wuchong wuchong linked an issue Feb 25, 2025 that may be closed by this pull request
2 tasks
@wuchong
Copy link
Member

wuchong commented Feb 25, 2025

Thank you for your contribution, @zhaozijun109! You're absolutely right—the current client metadata implementation has significant issues with retrying, timeout handling, and error management. While this PR effectively addresses the specific case of getOneAvailableTabletServerNode, there are many other methods in the client metadata module that could benefit from similar improvements.

To tackle this comprehensively, I've created Issue #483 to propose a broader refactoring of the client metadata implementation. The goal is to address all recent bugs and provide a more robust, general solution.

@zhaozijun109
Copy link
Author

@wuchong Thank you, I will continue to study the principle of fluss and try my best to contribute it. Another, sounds good that it's very good for refactoring of the client metadata implementation, many methods will benefit from this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support retry mechanism when get available tablet server nodes failed
5 participants