feat: support connection lifetime for single client by terut · Pull Request #727 · redis/rueidis

terut · 2025-01-28T12:22:19Z

Background

Recently I noticed that request is unbalanced when its replica failover on memorystore for redis of GCP if the connection keeps. So I consider about connection lifetime to reconnect to redis endpoint because existing connection are not rerouted when a node reintroduced.

Here is the document about archtecture and connection balance manegement.

https://cloud.google.com/memorystore/docs/redis/about-read-replicas#architecture
https://cloud.google.com/memorystore/docs/redis/about-read-replicas#connection_balance_management

Ref: #725

Solution

Support connection lifetime for single client to reconnect fixed read endpoint.

terut · 2025-01-28T12:29:04Z

@rueian Here is draft. Cloud you check your additional points on the discussion. There is no additional tests yet.

Use p.lifeTm = time.AfterFunc(...) instead, because we need more fine-grained control over the timer.
Stop p.lifeTm early if there is a network error or the connection is closed manually.
For dedicated and blocking usages, we should stop p.lifeTm after acquiring the connection from the dpool or spool and reset it when putting the connection back.

terut · 2025-01-28T12:33:16Z

 	}
 	atomic.AddInt32(&p.waits, -1)
 	atomic.AddInt32(&p.blcksig, -1)
+	p.StopTimer()


Oops. I will fix it...

this should be removed.

rueian · 2025-01-28T16:46:59Z

 		}
 	}
 	p.cond.L.Unlock()
+	v.StopTimer()


If the timer is not stopped successfully, we need to acquire another connection.

Ah, that's right. Thanks!

Fixed 390e19b

rueian · 2025-01-28T16:48:15Z

 	r2ps            bool // identify this pipe is used for resp2 pubsub or not
 	noNoDelay       bool
+	lftm            time.Duration // lifetime
+	lftmMu          sync.Mutex    // guards lifetime timer


Do we really need the mutex and the bool flag?

I reviewed again, we don't need bool flag.
I thought that time.Reset and time.Stop need mutex when using <= go 1.22 . Maybe I've got it wrong.

The source looks like it is thread-safe https://cs.opensource.google/go/go/+/refs/tags/go1.22.0:src/runtime/time.go;l=314.

Thanks, you're right. it looks like thread-safe. I will remove it.

I misread This cannot be done concurrent to other receives from the Timer's channel or other calls to the Timer's Stop method. of https://pkg.go.dev/time@go1.21.13#Timer.Stop . Sorry.

And we are using the AfterFunc timer which has no channel associated.

That's right.

Fixed f950c1e

terut · 2025-02-08T14:19:46Z

The rest is the implementation about retrying on singleclient.
I feel like that maybe it's enough to use ConnLifetime option with enabling retry handler. What do you think about retrying by force for errConnExpired ? @rueian

rueian · 2025-02-08T16:33:06Z

 	}
 	atomic.AddInt32(&p.waits, -1)
 	atomic.AddInt32(&p.blcksig, -1)
+	p.StopTimer()


this should be removed.

rueian · 2025-02-08T16:58:38Z

The rest is the implementation about retrying on singleclient. I feel like that maybe it's enough to use ConnLifetime option with enabling retry handler. What do you think about retrying by force for errConnExpired ? @rueian

I think we should use your original proposal and nothing to do with the retry handler.

retry:
	resp = c.conn.Do(ctx, cmd)
	if resp.Error() == errConnExpired {
		goto retry
	}
	if c.retry && cmd.IsReadOnly() && c.isRetryable(resp.Error(), ctx) {
		...

Because whenever an errConnExpired occurs, we know the connection is closed by ourselves, so it should be safe to retry immediately.

terut · 2025-02-09T00:30:41Z

@rueian Thanks. Surely we know the error and it's not good to show errConnExpired to outside when disabling retry too. Retry logic is almost done, just need to add that tests.

Co-authored-by: Rueian <rueiancsie@gmail.com>

rueian · 2025-02-10T07:38:13Z

 	resps = c.conn.DoMulti(ctx, multi...).s
+	if c.hasConnLftm {
+		for _, resp := range resps {
+			if resp.Error() == errConnExpired {


Is it possible that errConnExpired happens in the middle of DoMulti? I am not sure, but If it is possible then we should not retry preceding requests that don't receive the error.

Ah, I think it's unlikely. Surely all responses have same error when changing p.state.

I will change that like the following.

if resps[0].Error() == errConnExpired { goto retry }

Fixed c0c3657

ok, could you leave a comment in the code to explain why it won't happen?

When I was checking the behavior of connection lifetime on concurrent process and then I found the error read tcp [::1]:35190->[::1]:6379: use of closed network connection through singleClient.DoMulti. I think probably the error occurred because of pipe.Close() , but the investigation isn't going well. Could you advice me for that? @rueian

@rueian Thanks! As you said, it looks like that _backgroundRead returns that error.

I will change that lines to return errConnExpired when expired and then errConnExpired happens in the middle of DoMulti, so we may should check the error of all response.

Hi @terut, any update?

Sorry for late. I had no time to spare... I think I can work on this problem from this week. Anyway, I will merge any updates of connection pools. @rueian

I have a feeling that probably this function is only for read replica, right? It seems like write cmds don't work on this approach. Sometime incremented value is 10001, 10002 and so on when loop is 10000.

Sorry for late. I had no time to spare... I think I can work on this problem from this week.

No worries.

Sometime incremented value is 10001, 10002 and so on when loop is 10000.

What do you mean by this? I think this can be a general feature for those who want a limited lifetime on each connection for any reason.

When I just counted up for 10000 times using connection lifetime option, the value of keys is over 10000. But my implementation is not correct for now at the point of view of the error handling of _backgroundRead, I will try to count up again after implementing correctly.

Signed-off-by: Rueian <rueiancsie@gmail.com>

rueian · 2025-04-23T04:48:12Z

Hi @terut, would you mind adding the retry logic to the cluster client and sentinel client in a follow-up PR?

terut · 2025-04-23T05:02:53Z

@rueian Okay, I'll take care of it. Sorry for waiting long time.

rueian · 2025-04-23T05:21:05Z

Thanks @terut!

terut · 2025-04-23T05:21:50Z

@rueian Thank you for your great help for a long time 😄

rueian · 2025-04-23T05:45:02Z

You’re welcome! I know some users really need this feature, so it’s great that we have it. However, I hope we can have your follow-up PR for adding retries to cluster and sentienl soon because next week is the next release cycle. If we don't have the PR merged, we probably can't include this new feature in the next release.

* feat: add connection lifetime option to single client * Remove mutex and timer flag for connection lifetime timer * Retry wire accquition when failed to stop connection lifetime timer * Add timer test to pipe * Add test for reseting timer and stopping timer when using pool * Remove p.StopTimer() from p.Close() Co-authored-by: Rueian <rueiancsie@gmail.com> * Forced to retry when errConnExpired * Remove hasConnLftm and check resps[0] to retry for multi cmds * Recover connection lifetime error in the middle of calls * Fix the handling of connection lifetime error of DoMultiCache * perf: apply fieldaligments Signed-off-by: Rueian <rueiancsie@gmail.com> --------- Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com>

feat: add connection lifetime option to single client

f18e613

terut marked this pull request as draft January 28, 2025 12:22

terut commented Jan 28, 2025

View reviewed changes

rueian reviewed Jan 28, 2025

View reviewed changes

terut added 3 commits February 8, 2025 18:40

Remove mutex and timer flag for connection lifetime timer

f950c1e

Retry wire accquition when failed to stop connection lifetime timer

390e19b

Add timer test to pipe

88c8d7e

terut force-pushed the feat/conn-lifetime branch from 31ffe91 to 88c8d7e Compare February 8, 2025 11:10

Add test for reseting timer and stopping timer when using pool

14349d4

rueian reviewed Feb 8, 2025

View reviewed changes

terut and others added 2 commits February 10, 2025 14:55

Remove p.StopTimer() from p.Close()

e91d316

Co-authored-by: Rueian <rueiancsie@gmail.com>

Forced to retry when errConnExpired

9fba892

rueian reviewed Feb 10, 2025

View reviewed changes

terut added 3 commits February 10, 2025 20:34

Remove hasConnLftm and check resps[0] to retry for multi cmds

c0c3657

Merge branch 'main' into feat/conn-lifetime

31de820

Recover connection lifetime error in the middle of calls

be0ab03

terut marked this pull request as ready for review April 23, 2025 01:31

Fix the handling of connection lifetime error of DoMultiCache

7c6be89

terut requested a review from rueian April 23, 2025 03:18

perf: apply fieldaligments

462c44e

Signed-off-by: Rueian <rueiancsie@gmail.com>

rueian merged commit 488b577 into redis:main Apr 23, 2025

terut deleted the feat/conn-lifetime branch April 23, 2025 05:50

terut mentioned this pull request Apr 23, 2025

Add retry logic of connection lifetime to cluster client and sentinel client #833

Merged

terut mentioned this pull request May 10, 2025

Check transaction block when using connection lifetime #837

Merged

Conversation

terut commented Jan 28, 2025

Background

Solution

Uh oh!

terut commented Jan 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rueian Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

terut Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

terut commented Feb 8, 2025

Uh oh!

rueian Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rueian commented Feb 8, 2025

Uh oh!

terut commented Feb 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

terut Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

terut Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rueian commented Apr 23, 2025

Uh oh!

terut commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rueian commented Apr 23, 2025

Uh oh!

terut commented Apr 23, 2025

Uh oh!

rueian commented Apr 23, 2025

rueian Feb 8, 2025 •

edited

Loading

terut Jan 30, 2025 •

edited

Loading

rueian Feb 8, 2025 •

edited

Loading

terut Feb 10, 2025 •

edited

Loading

terut Mar 31, 2025 •

edited

Loading

terut commented Apr 23, 2025 •

edited

Loading