Fix race condition in streaming output detection#524
Conversation
dc56b41 to
be941e7
Compare
gkumbhat
left a comment
There was a problem hiding this comment.
Is it possible to add a test for this particular scenario? May be an integration test here: https://github.com/foundation-model-stack/fms-guardrails-orchestrator/blob/main/tests/chat_completions_streaming.rs
| let chat_completions = { | ||
| let mut attempts = 0; | ||
| loop { | ||
| if let Some(entry) = completion_state.completions.get(&choice_index) { | ||
| break entry; | ||
| } | ||
| if attempts > 1000 { | ||
| return Err(Error::Other(format!( | ||
| "completion entry for choice_index {} not ready after 1000 yields", | ||
| choice_index | ||
| ))); | ||
| } | ||
| attempts += 1; | ||
| tokio::task::yield_now().await; |
There was a problem hiding this comment.
am wondering if this creates a bit of busy waiting scenario here and what if this 1000 run out sooner in some cases where generation server is for some reason extra slow 🤔
I guess one solution is to make this logic event driven, where this await gets notified once it receives new element.
For this implementation, we should move the 1000 to top of the file as constants.
There was a problem hiding this comment.
Thanks for all the valid concerns! I:
- moved the limit to a constant
- replaced yield-count with a timeout and used
tokio::time::timeout()in an attempt to deal with scenarios where they may be a slow generation server
If you prefer the previous solution, just let me know and I can revert back :)
|
|
||
| /// Builds a response with output detections. | ||
| fn output_detection_response( | ||
| async fn output_detection_response( |
There was a problem hiding this comment.
does this need to be async ? or can we do the await operation inside tokio runtime ?
There was a problem hiding this comment.
Good question! IIUC yield_now()) only works in async context, so I think it should remain this way; please let me know what you think!
FYI, |
You are correct @declark1. IIUC, the issue is around entry existence. The detector task tries to |
… approach Signed-off-by: m-misiura <mmisiura@redhat.com>
b2ae9b8 to
1f4d637
Compare
gkumbhat
left a comment
There was a problem hiding this comment.
Thanks for adding helpful comments and addressing all the review suggestions.
Description
When load testing, the following was observed:
There appears to be aa race condition in streaming chat completions with output detection that causes consistent panics. When detectors respond faster than the LLM stream inserts completions, .unwrap() on a missing HashMap entry panics. Our in-cluster detectors respond in ~2-8ms which triggers this consistently. Two concurrent tasks access
completion_state.completionswithout synchronization:process_chat_completion_stream): inserts completions from LLMprocess_detection_batch_stream): reads completions when detectors respondWhen detectors respond faster than the LLM stream inserts,
.unwrap()on a missing entry causes panic.On the trusty fork, the following issue was raised: trustyai-explainability#12, but it is presumably better to handle this upstream.
This PR attempts to address this by replacing
.unwrap()withyield_now()loop that waits for entriesto become available.