fix: numa container hangs occassionally when map panics#3179
fix: numa container hangs occassionally when map panics#3179vaibhavtiwari33 wants to merge 18 commits into
Conversation
Signed-off-by: Vaibhav Tiwari <vaibhav.tiwari33@gmail.com>
Signed-off-by: Vaibhav Tiwari <vaibhav.tiwari33@gmail.com>
Signed-off-by: Vaibhav Tiwari <vaibhav.tiwari33@gmail.com>
vigith
left a comment
There was a problem hiding this comment.
this will not work, let's fix the root problem which is we reaking out of the rx while rx is still active.
Signed-off-by: Vaibhav Tiwari <vaibhav.tiwari33@gmail.com>
| while let Some(_) = resp_stream.next().await { | ||
| // consume the rest of the stream | ||
| } | ||
| break; |
There was a problem hiding this comment.
replace the original loop with the while let some(_) ?
There was a problem hiding this comment.
That will work but we do lose the original gRPC error that was received from the stream in that case.
There was a problem hiding this comment.
One reason why we shouldn't do this:
while let Some(resp) = resp_stream.next().await {
match resp {
Ok(message) => Self::process_unary_response(&sender_map, message).await,
Err(e) => {
Self::broadcast_error(&sender_map, e);
}
};
}would be to avoid processing a message that has already been removed and for which error has been broadcasted if we've encountered an error before.
We'll be spending time to acquire lock to the senders map to remove an entry that might not be there.
There was a problem hiding this comment.
I'll add a boolean state check behind the response sender map mutex so that we don't have to worry about this and can use your semantic
The abovementioned scenario would still play out with the mentioned semantics
Signed-off-by: Vaibhav Tiwari <vaibhav.tiwari33@gmail.com>
Signed-off-by: Yashash <yashashhl25@gmail.com>
Signed-off-by: Vaibhav Tiwari <vaibhav.tiwari33@gmail.com>
Signed-off-by: Vaibhav Tiwari <vaibhav.tiwari33@gmail.com>
@vigith from the previous discussion we know that the problem lies with the following check: // only insert if we are able to send the message to the server
if let Err(e) = self.read_tx.send(message.into()).await {
...
}
// insert the sender into the map
self.senders
.lock()
.expect("failed to acquire poisoned lock")
.insert(key.clone(), (msg_info, respond_to));i.e., we're assuming that if the send to the The problem here is that, from https://docs.rs/tokio/latest/tokio/sync/mpsc/struct.Sender.html#method.send, we have thus, the send can succeed but it doesn't guarantee that the receiver will process it. So, we should have an atomic state on the senders map to ensure we can add to it. |
yes, How about we add to the |
This approach has 2 issues:
|
…e adding messages to the sender map Signed-off-by: Vaibhav Tiwari <vaibhav.tiwari33@gmail.com>
Signed-off-by: Vaibhav Tiwari <vaibhav.tiwari33@gmail.com>
Signed-off-by: Vaibhav Tiwari <vaibhav.tiwari33@gmail.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3179 +/- ##
==========================================
- Coverage 80.08% 80.08% -0.01%
==========================================
Files 297 297
Lines 67717 67765 +48
==========================================
+ Hits 54232 54270 +38
- Misses 12935 12945 +10
Partials 550 550 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Vaibhav Tiwari <vaibhav.tiwari33@gmail.com>
Signed-off-by: Vaibhav Tiwari <vaibhav.tiwari33@gmail.com>
What this PR does / why we need it
During testing of bypass feature for monovertex using long running tests, it was noticed that when running the test with a panicking mapper, sometimes the numa container would get stuck waiting for all the messages to get nacked.
The bug wasn't related to bypassing as the bug was reproducible without the feature enabled.
Hypothesis
Messages are getting added to the
ResponseSenderMapafter the receiver tokio task has exited.Confirmation
Add a log when the senders map has been emptied and errors are broadcasted:
Add a log when messages are inserted into the senders map:
Following logs were observed in all the stuck numa container:
Thus, messages were being inserted into the senders map AFTER the receiver has stopped monitoring the senders map.
This caused this message to stick around since no one responded to its oneshot receiver with a response and the source keeps waiting for ack/nack of the message, since the message never got dropped.
Fix
Add a drop key into the senders map after broadcasting errorWhen inserting into the senders map, if the drop key exists, skip adding the message and send an error to the message receiver.Wait for the receiver to close out before finishing the task.
Testing
How was this tested (unit/integration/manual)? Include commands, links, or screenshots if applicable.
Currently this was tested for unary use case by running multiple replicas of a panicking udf. Although, just tested for about
an hour12 hours for now, I'm yet to encounter a stuck numa container. While in the previous case I'd get at least 1 such scenario within 20-30minsmonovertex spec:
The pods are restarting consistently:
Special notes for reviewers
Anything notable for review (risk, rollout, follow-ups).