Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swap Orchestrator if no first segment recevied before timeout #3398

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

leszko
Copy link
Contributor

@leszko leszko commented Feb 18, 2025

Introduce the timeout for the first segment. So, if Gateway does not receive the first segment within the specified timeout, it will go back to the selection logic and select a different Orchestrator.

Then, we would have 2 timeout values defined as flags:

  • firstSegmentTimeout => Timeout to swap Orchestrator if the first segment was not received
  • aiProcessingRetryTimeout => Timeout for the whole Gateway processing, so if Gateway is not able to process the request from Orchestrator in that time, it will fail the whole processing (currently it's set to 45s in staging/prod)

TODO:

  • Take hardcoded 20s to the firstSegmentTimeout flag
  • Decide on the values of firstSegmentTimeout and aiProcessingRetryTimeout
  • Use context to cancel all trickle and ffmpeg processing in case of the O's swap
  • Add log and metric for the malfunctioning Os

@github-actions github-actions bot added go Pull requests that update Go code AI Issues and PR related to the AI-video branch. labels Feb 18, 2025
Copy link

codecov bot commented Feb 18, 2025

Codecov Report

Attention: Patch coverage is 0% with 10 lines in your changes missing coverage. Please review.

Project coverage is 32.13699%. Comparing base (39db9b6) to head (cf7a1b1).

Files with missing lines Patch % Lines
server/ai_process.go 0.00000% 10 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@                 Coverage Diff                 @@
##              master       #3398         +/-   ##
===================================================
- Coverage   32.14408%   32.13699%   -0.00709%     
===================================================
  Files            147         147                 
  Lines          40754       40763          +9     
===================================================
  Hits           13100       13100                 
- Misses         26880       26889          +9     
  Partials         774         774                 
Files with missing lines Coverage Δ
server/ai_process.go 0.58774% <0.00000%> (-0.00448%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 39db9b6...cf7a1b1. Read the comment docs.

Files with missing lines Coverage Δ
server/ai_process.go 0.58774% <0.00000%> (-0.00448%) ⬇️

return resp, nil
select {
case <-firstSegmentReceived:
cancelCtx()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to cancel on the other case here, of timeout? IIUC this will cancel all trickle subs from the stream

@@ -1098,10 +1101,17 @@ func submitLiveVideoToVideo(ctx context.Context, params aiRequestParams, sess *A
monitor.AIFirstSegmentDelay(delayMs, sess.OrchestratorInfo)
}
clog.V(common.VERBOSE).Infof(ctx, "First Segment delay=%dms streamID=%s", delayMs, params.liveParams.streamID)
firstSegmentReceived <- struct{}{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need a select with default in case timeout has already fired, or a buffered chan

case <-firstSegmentReceived:
cancelCtx()
return resp, nil
case <-time.After(20 * time.Second):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn between keeping this timeout so short. 20s does not mean the O is bad, since that will only ever be possible if the runner was warm when the stream started, which isn't always the case until we implement sth on the selection algorithm for that, to only route to Os that do have the pipeline already loaded (and are not just deploying or restarting the container for example, which is the 60s+ we've seen). Another problem could be if we are loading any workflow but the default one. Not sure how fast comfystream would load new nodes and models, but it feels like it could easily take more than 20s.

So maybe 20s would be too short for a threshold that actually blocks Os from being used again, but could be ok if it's just a time that we timeout and start the process with another O on the gateway side, but not flagging the O as malfunctioning. I believe this would be the latter tho, with a "malfunctioning flag", right? In that case I think it'd be better to have a higher threshold here WDYT?

@@ -1090,6 +1090,9 @@ func submitLiveVideoToVideo(ctx context.Context, params aiRequestParams, sess *A
}
clog.V(common.VERBOSE).Infof(ctx, "pub %s sub %s control %s events %s", pub, sub, control, events)

firstSegmentReceived := make(chan struct{})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make this at least size 1 to minimize the chance of late writes blocking after a timeout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AI Issues and PR related to the AI-video branch. go Pull requests that update Go code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants