-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Swap Orchestrator if no first segment recevied before timeout #3398
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1090,6 +1090,9 @@ | |
} | ||
clog.V(common.VERBOSE).Infof(ctx, "pub %s sub %s control %s events %s", pub, sub, control, events) | ||
|
||
firstSegmentReceived := make(chan struct{}) | ||
ctx, cancelCtx := context.WithCancel(ctx) | ||
|
||
startControlPublish(control, params) | ||
startTricklePublish(ctx, pub, params, sess) | ||
startTrickleSubscribe(ctx, sub, params, func() { | ||
|
@@ -1098,10 +1101,17 @@ | |
monitor.AIFirstSegmentDelay(delayMs, sess.OrchestratorInfo) | ||
} | ||
clog.V(common.VERBOSE).Infof(ctx, "First Segment delay=%dms streamID=%s", delayMs, params.liveParams.streamID) | ||
firstSegmentReceived <- struct{}{} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might need a select with default in case timeout has already fired, or a buffered chan |
||
|
||
}) | ||
startEventsSubscribe(ctx, events, params, sess) | ||
return resp, nil | ||
select { | ||
case <-firstSegmentReceived: | ||
cancelCtx() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you mean to cancel on the other |
||
return resp, nil | ||
case <-time.After(20 * time.Second): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm torn between keeping this timeout so short. 20s does not mean the O is bad, since that will only ever be possible if the runner was warm when the stream started, which isn't always the case until we implement sth on the selection algorithm for that, to only route to Os that do have the pipeline already loaded (and are not just deploying or restarting the container for example, which is the 60s+ we've seen). Another problem could be if we are loading any workflow but the default one. Not sure how fast So maybe 20s would be too short for a threshold that actually blocks Os from being used again, but could be ok if it's just a time that we timeout and start the process with another O on the gateway side, but not flagging the O as malfunctioning. I believe this would be the latter tho, with a "malfunctioning flag", right? In that case I think it'd be better to have a higher threshold here WDYT? |
||
return nil, errors.New("timeout waiting for first segment") | ||
} | ||
} | ||
|
||
// extractMid extracts the mid (manifest ID) from the publish URL | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make this at least size 1 to minimize the chance of late writes blocking after a timeout