Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swap Orchestrator if no first segment recevied before timeout #3398

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion server/ai_process.go
Original file line number Diff line number Diff line change
Expand Up @@ -1090,6 +1090,9 @@
}
clog.V(common.VERBOSE).Infof(ctx, "pub %s sub %s control %s events %s", pub, sub, control, events)

firstSegmentReceived := make(chan struct{})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make this at least size 1 to minimize the chance of late writes blocking after a timeout

ctx, cancelCtx := context.WithCancel(ctx)

Check warning on line 1095 in server/ai_process.go

View check run for this annotation

Codecov / codecov/patch

server/ai_process.go#L1093-L1095

Added lines #L1093 - L1095 were not covered by tests
startControlPublish(control, params)
startTricklePublish(ctx, pub, params, sess)
startTrickleSubscribe(ctx, sub, params, func() {
Expand All @@ -1098,10 +1101,17 @@
monitor.AIFirstSegmentDelay(delayMs, sess.OrchestratorInfo)
}
clog.V(common.VERBOSE).Infof(ctx, "First Segment delay=%dms streamID=%s", delayMs, params.liveParams.streamID)
firstSegmentReceived <- struct{}{}

Check warning on line 1104 in server/ai_process.go

View check run for this annotation

Codecov / codecov/patch

server/ai_process.go#L1104

Added line #L1104 was not covered by tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need a select with default in case timeout has already fired, or a buffered chan


})
startEventsSubscribe(ctx, events, params, sess)
return resp, nil
select {
case <-firstSegmentReceived:
cancelCtx()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to cancel on the other case here, of timeout? IIUC this will cancel all trickle subs from the stream

return resp, nil
case <-time.After(20 * time.Second):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn between keeping this timeout so short. 20s does not mean the O is bad, since that will only ever be possible if the runner was warm when the stream started, which isn't always the case until we implement sth on the selection algorithm for that, to only route to Os that do have the pipeline already loaded (and are not just deploying or restarting the container for example, which is the 60s+ we've seen). Another problem could be if we are loading any workflow but the default one. Not sure how fast comfystream would load new nodes and models, but it feels like it could easily take more than 20s.

So maybe 20s would be too short for a threshold that actually blocks Os from being used again, but could be ok if it's just a time that we timeout and start the process with another O on the gateway side, but not flagging the O as malfunctioning. I believe this would be the latter tho, with a "malfunctioning flag", right? In that case I think it'd be better to have a higher threshold here WDYT?

return nil, errors.New("timeout waiting for first segment")

Check warning on line 1113 in server/ai_process.go

View check run for this annotation

Codecov / codecov/patch

server/ai_process.go#L1108-L1113

Added lines #L1108 - L1113 were not covered by tests
}
}

// extractMid extracts the mid (manifest ID) from the publish URL
Expand Down
Loading