-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: SendAsync callback was not invoked when producer is in reconnecting #1333
Draft
gunli
wants to merge
5
commits into
apache:master
Choose a base branch
from
gunli:fix-1332
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if my concern is correct.
If use a new thread here, we need to carefully handle follow case:
reconnecting
status to prevent concurrency issues.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with the check below, we can't be 100% sure of using the old connection, because it's inherently concurrent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm thinking that before a message enters the
pendingQueue
, even if the user is using thesendAsync
method, we should block the user's call.This way, on the user side, the timeout should be calculated from when the sendAsync method call succeeds; otherwise, it should block at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that. Actually, I think that it is quite strange to pass a buffer to a connection's channel when sending a message, as this causes the buffer to be read and written by two different goroutines, leading to a data race. What's worse is that if a buffer is sent into a connection's channel and the connection is closed at the same time, the buffer ends up in an new uncertain or pending state, we need to pay more attention to handling this situation again, currently, this situation is not handled when the connection is closed by network or by server notification pb.BaseCommand_CLOSE_PRODUCER/pb.BaseCommand_CLOSE_CONSUMER, may be we need a new PR to handle this.
In my opinion, it would be better to keep the message in the pending queue and use timeout events and server ack events to determine whether a message has timed out or succeeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought about that, but that will be a breaking change. If we change it that way, SendAsync will sometimes become SendSync and the config
MaxPendingMessages
will become meaningless.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that we are already sending data in the goroutine of
partitionProducer.internalSend()
, we don't have to create a new goroutine again inconnection.run()
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have done the same things in go client, the question is that
partitionProducer.runEventsLoop()
is block in the reconnecting case, which prevent the data from entering thependingQueue
, andfailTimeoutMessages()
can't failed it, becausefailTimeoutMessages()
only fail the ones inpendingQueue
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the
connection.State
is sufficient and timely enough to control this, because the reconnection state of a producer relies onconnection.State
and the notification from aconnection
. If the connection is closed, we should stop sending. The problem is that the connection's implementation still needs to be refactored, as I mentioned above: make the sending logic synchronous and close the connection when receivingBaseCommand_CLOSE_PRODUCER
orBaseCommand_CLOSE_CONSUMER
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A connection has multiple producers and consumers, please consider one case the connection is active, the producer is inactive, you should not send the message to the broker.
https://github.com/apache/pulsar/blob/v4.0.2/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ProducerImpl.java#L2366-L2382, this is java implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into the java code, but I still can't understand it as the JAVA implemention encapsulate too much details and I know little about JAVA.
You said that
please consider one case the connection is active, the producer is inactive
, as I know, only the connection's close event can trigger a produce from active to inactive.Regardless, the current implemention of the conneciton in pulsar-go-client is problematic,
c.internalSendRequest(req)
andc.internalWriteData(data)/c.sendPing()
can happen at the same time as the are run in 2 goroutines, they will callc.cnx.Write()
finally, which will cause write/data conflict, IMO, this can be considered as a BUG, but we can fix it in another PR.And, the conneciton use
incomingRequestsCh
with capacity of 10 (incomingRequestsCh: make(chan *request, 10)
) to receive and hold user's data, this can be considered as anotherpendingQueue
, now we have 2pendingQueue
s, one in the producer, think aboute a case: a user's data/buffer is sent intoproducer.pendingQueue
and then sent intoconnection.incomingRequestsCh
, butc.cnx.Write()
is not called yet, now there can be a problem: If the timeout config is short, now, the reqeust is treated astimeout
byproducer.failTimeoutMessages()
, then the buffer is put back into the pool, butconnection.incomingRequestsCh
still keeps a reference of the buffer, if the buffer is realloced by the pool, now, if the connection run toc.cnx.Write()
, there will be data race. This can be considered another BUG,t oo, a new PR is needed again.