Swap Orchestrator if no first segment recevied before timeout #3398

leszko · 2025-02-18T13:15:54Z

Introduce the timeout for the first segment. So, if Gateway does not receive the first segment within the specified timeout, it will go back to the selection logic and select a different Orchestrator.

Then, we would have 2 timeout values defined as flags:

firstSegmentTimeout => Timeout to swap Orchestrator if the first segment was not received
aiProcessingRetryTimeout => Timeout for the whole Gateway processing, so if Gateway is not able to process the request from Orchestrator in that time, it will fail the whole processing (currently it's set to 45s in staging/prod)

TODO:

Take hardcoded 20s to the firstSegmentTimeout flag
Decide on the values of firstSegmentTimeout and aiProcessingRetryTimeout
Use context to cancel all trickle and ffmpeg processing in case of the O's swap
Add log and metric for the malfunctioning Os

codecov · 2025-02-18T13:29:09Z

Codecov Report

Attention: Patch coverage is 0% with 10 lines in your changes missing coverage. Please review.

Project coverage is 32.13699%. Comparing base (39db9b6) to head (cf7a1b1).

Files with missing lines	Patch %	Lines
server/ai_process.go	0.00000%	10 Missing ⚠️

Additional details and impacted files

@@                 Coverage Diff                 @@
##              master       #3398         +/-   ##
===================================================
- Coverage   32.14408%   32.13699%   -0.00709%     
===================================================
  Files            147         147                 
  Lines          40754       40763          +9     
===================================================
  Hits           13100       13100                 
- Misses         26880       26889          +9     
  Partials         774         774

Files with missing lines	Coverage Δ
server/ai_process.go	`0.58774% <0.00000%> (-0.00448%)`	⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 39db9b6...cf7a1b1. Read the comment docs.

Files with missing lines	Coverage Δ
server/ai_process.go	`0.58774% <0.00000%> (-0.00448%)`	⬇️

victorges · 2025-02-18T20:47:47Z

server/ai_process.go

-	return resp, nil
+	select {
+	case <-firstSegmentReceived:
+		cancelCtx()


Did you mean to cancel on the other case here, of timeout? IIUC this will cancel all trickle subs from the stream

victorges · 2025-02-18T20:48:02Z

server/ai_process.go

@@ -1098,10 +1101,17 @@ func submitLiveVideoToVideo(ctx context.Context, params aiRequestParams, sess *A
 			monitor.AIFirstSegmentDelay(delayMs, sess.OrchestratorInfo)
 		}
 		clog.V(common.VERBOSE).Infof(ctx, "First Segment delay=%dms streamID=%s", delayMs, params.liveParams.streamID)
+		firstSegmentReceived <- struct{}{}


Might need a select with default in case timeout has already fired, or a buffered chan

victorges · 2025-02-18T20:56:49Z

server/ai_process.go

+	case <-firstSegmentReceived:
+		cancelCtx()
+		return resp, nil
+	case <-time.After(20 * time.Second):


I'm torn between keeping this timeout so short. 20s does not mean the O is bad, since that will only ever be possible if the runner was warm when the stream started, which isn't always the case until we implement sth on the selection algorithm for that, to only route to Os that do have the pipeline already loaded (and are not just deploying or restarting the container for example, which is the 60s+ we've seen). Another problem could be if we are loading any workflow but the default one. Not sure how fast comfystream would load new nodes and models, but it feels like it could easily take more than 20s.

So maybe 20s would be too short for a threshold that actually blocks Os from being used again, but could be ok if it's just a time that we timeout and start the process with another O on the gateway side, but not flagging the O as malfunctioning. I believe this would be the latter tho, with a "malfunctioning flag", right? In that case I think it'd be better to have a higher threshold here WDYT?

j0sh · 2025-02-19T08:43:13Z

server/ai_process.go

@@ -1090,6 +1090,9 @@ func submitLiveVideoToVideo(ctx context.Context, params aiRequestParams, sess *A
 	}
 	clog.V(common.VERBOSE).Infof(ctx, "pub %s sub %s control %s events %s", pub, sub, control, events)

+	firstSegmentReceived := make(chan struct{})


Maybe make this at least size 1 to minimize the chance of late writes blocking after a timeout

Swap Orchestrator if no first segment recevied before timeout

cf7a1b1

github-actions bot added go Pull requests that update Go code AI Issues and PR related to the AI-video branch. labels Feb 18, 2025

victorges reviewed Feb 18, 2025

View reviewed changes

j0sh reviewed Feb 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swap Orchestrator if no first segment recevied before timeout #3398

Swap Orchestrator if no first segment recevied before timeout #3398

leszko commented Feb 18, 2025 •

edited

Loading

codecov bot commented Feb 18, 2025

victorges Feb 18, 2025

victorges Feb 18, 2025

victorges Feb 18, 2025

j0sh Feb 19, 2025

Swap Orchestrator if no first segment recevied before timeout #3398

Are you sure you want to change the base?

Swap Orchestrator if no first segment recevied before timeout #3398

Conversation

leszko commented Feb 18, 2025 • edited Loading

codecov bot commented Feb 18, 2025

Codecov Report

victorges Feb 18, 2025

Choose a reason for hiding this comment

victorges Feb 18, 2025

Choose a reason for hiding this comment

victorges Feb 18, 2025

Choose a reason for hiding this comment

j0sh Feb 19, 2025

Choose a reason for hiding this comment

leszko commented Feb 18, 2025 •

edited

Loading