Fix race condition in streaming output detection by m-misiura · Pull Request #524 · foundation-model-stack/fms-guardrails-orchestrator

m-misiura · 2026-01-07T15:26:49Z

Description

When load testing, the following was observed:

There appears to be aa race condition in streaming chat completions with output detection that causes consistent panics. When detectors respond faster than the LLM stream inserts completions, .unwrap() on a missing HashMap entry panics. Our in-cluster detectors respond in ~2-8ms which triggers this consistently. Two concurrent tasks access completion_state.completions without synchronization:

Spawned task (process_chat_completion_stream): inserts completions from LLM
Main task (process_detection_batch_stream): reads completions when detectors respond
When detectors respond faster than the LLM stream inserts, .unwrap() on a missing entry causes panic.

On the trusty fork, the following issue was raised: trustyai-explainability#12, but it is presumably better to handle this upstream.

This PR attempts to address this by replacing .unwrap() with yield_now() loop that waits for entries
to become available.

gkumbhat

Is it possible to add a test for this particular scenario? May be an integration test here: https://github.com/foundation-model-stack/fms-guardrails-orchestrator/blob/main/tests/chat_completions_streaming.rs

gkumbhat · 2026-01-07T18:27:41Z

src/orchestrator/handlers/chat_completions_detection/streaming.rs

+    let chat_completions = {
+        let mut attempts = 0;
+        loop {
+            if let Some(entry) = completion_state.completions.get(&choice_index) {
+                break entry;
+            }
+            if attempts > 1000 {
+                return Err(Error::Other(format!(
+                    "completion entry for choice_index {} not ready after 1000 yields",
+                    choice_index
+                )));
+            }
+            attempts += 1;
+            tokio::task::yield_now().await;


am wondering if this creates a bit of busy waiting scenario here and what if this 1000 run out sooner in some cases where generation server is for some reason extra slow 🤔

I guess one solution is to make this logic event driven, where this await gets notified once it receives new element.

For this implementation, we should move the 1000 to top of the file as constants.

Thanks for all the valid concerns! I:

moved the limit to a constant

replaced yield-count with a timeout and used tokio::time::timeout() in an attempt to deal with scenarios where they may be a slow generation server

If you prefer the previous solution, just let me know and I can revert back :)

gkumbhat · 2026-01-07T18:27:51Z

src/orchestrator/handlers/chat_completions_detection/streaming.rs


 /// Builds a response with output detections.
-fn output_detection_response(
+async fn output_detection_response(


does this need to be async ? or can we do the await operation inside tokio runtime ?

Good question! IIUC yield_now()) only works in async context, so I think it should remain this way; please let me know what you think!

declark1 · 2026-01-07T22:58:34Z

Two concurrent tasks access completion_state.completions without synchronization

FYI, DashMap is used for completion state, which is a concurrent hashmap, so I don't think this is a synchronization issue.

m-misiura · 2026-01-08T10:07:39Z

Two concurrent tasks access completion_state.completions without synchronization

FYI, DashMap is used for completion state, which is a concurrent hashmap, so I don't think this is a synchronization issue.

You are correct @declark1. IIUC, the issue is around entry existence. The detector task tries to .get() an entry before the stream task has .insert()ed it for that choice_index. The .unwrap() would panic on the missing entry.

… approach Signed-off-by: m-misiura <mmisiura@redhat.com>

gkumbhat

Thanks for adding helpful comments and addressing all the review suggestions.

m-misiura requested review from evaline-ju and gkumbhat as code owners January 7, 2026 15:26

m-misiura force-pushed the fix-streaming-race-condition branch from dc56b41 to be941e7 Compare January 7, 2026 15:56

gkumbhat reviewed Jan 7, 2026

View reviewed changes

🚧 Fix race condition in streaming output detection with timeout-based…

1f4d637

… approach Signed-off-by: m-misiura <mmisiura@redhat.com>

m-misiura force-pushed the fix-streaming-race-condition branch from b2ae9b8 to 1f4d637 Compare January 8, 2026 10:09

gkumbhat approved these changes Jan 10, 2026

View reviewed changes

gkumbhat merged commit b37c4bf into foundation-model-stack:main Jan 13, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in streaming output detection#524

Fix race condition in streaming output detection#524
gkumbhat merged 1 commit intofoundation-model-stack:mainfrom
m-misiura:fix-streaming-race-condition

m-misiura commented Jan 7, 2026

Uh oh!

gkumbhat left a comment

Uh oh!

gkumbhat Jan 7, 2026

Uh oh!

m-misiura Jan 8, 2026

Uh oh!

gkumbhat Jan 7, 2026

Uh oh!

m-misiura Jan 8, 2026

Uh oh!

declark1 commented Jan 7, 2026

Uh oh!

m-misiura commented Jan 8, 2026

Uh oh!

gkumbhat left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

m-misiura commented Jan 7, 2026

Description

Uh oh!

gkumbhat left a comment

Choose a reason for hiding this comment

Uh oh!

gkumbhat Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

m-misiura Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

gkumbhat Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

m-misiura Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

declark1 commented Jan 7, 2026

Uh oh!

m-misiura commented Jan 8, 2026

Uh oh!

gkumbhat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants