-
Notifications
You must be signed in to change notification settings - Fork 27
Internal designs
- Changes the job status to
cancelling. - Tries to remove the job from the priority queue.
- If the removal is successful and actually removed the requested entry (meaning the job isn't currently being processed), the API server updates the job status to
cancelled, and completes the protocol. - If the removal did not remove an entry from the priority queue (meaning the job is being processed), the API server sends a
cancelevent on the job specific event channel, and completes the protocol.
- When starting a job check the status of the job:
- If the status is
cancelling: update and job status tocancelledand terminate the job processing. In this case the api server will send acancelevent that no processor will listen for - this is ok as events have TTL. - If the status is
cancelled: terminate the job processing. - If another status - continue processing the job.
- If the status is
- While processing the job, the processor listens for events on the job specific event channel.
- Upon receiving a
cancelevent, the processor:- Stops the inference.
- Cleans up resources (including closing the event channel).
- Updates the job status to
cancelled.
- The processor starts a re-enqueue procedure.
- It checks and finds no
cancelevent and proceeds to re-enqueue. - Now the api server sends a
cancelevent. - Eventually another processor will resume the job - when starting it will see that the job is in
cancellingstatus, and then it will update the status tocancelled. - The event has a TTL.
=> It looks that this edge case is covered by the protocol above.
- Worker receives a
cancelevent at the exact moment it isfinalizingthe batch of requests. - There is a risk that the worker might inadvertently overwrite the cancelling status with completed during its final update.
=> Implement a status check (e.g., WHERE status = 'in_progress') to ensure the status only transitions from in_progress to completed, preventing it from overwriting a cancelling or cancelled state.
- A user sends multiple cancellation requests for the same
job_idin rapid succession. - If not handled, the API server might trigger multiple cancel events, potentially causing the processor to attempt resource cleanup or channel closure on an already closed resource (leading to panics)
=> The API server must verify the current job status before acting.
Cancellation is only permitted if the status is in a non-terminal, active state (pending, validating, in_progress, or finalizing).
If the job is already in a cancelling or cancelled state, the server should ignore the request or return an error indicating that the cancellation is already in progress.
If the job has reached a terminal state (completed, failed), the server should inform the user that the job is no longer running and cannot be cancelled.