Skip to content

Internal designs

Jooyeon Mok edited this page Jan 28, 2026 · 3 revisions

Internal designs

Protocol for cancellation of a batch job

API server

  1. Changes the job status to cancelling.
  2. Tries to remove the job from the priority queue.
  3. If the removal is successful and actually removed the requested entry (meaning the job isn't currently being processed), the API server updates the job status to cancelled, and completes the protocol.
  4. If the removal did not remove an entry from the priority queue (meaning the job is being processed), the API server sends a cancel event on the job specific event channel, and completes the protocol.

Processor

  1. When starting a job check the status of the job:
    1. If the status is cancelling: update and job status to cancelled and terminate the job processing. In this case the api server will send a cancel event that no processor will listen for - this is ok as events have TTL.
    2. If the status is cancelled: terminate the job processing.
    3. If another status - continue processing the job.
  2. While processing the job, the processor listens for events on the job specific event channel.
  3. Upon receiving a cancel event, the processor:
    1. Stops the inference.
    2. Cleans up resources (including closing the event channel).
    3. Updates the job status to cancelled.

Edge cases

Case 1

  1. The processor starts a re-enqueue procedure.
  2. It checks and finds no cancel event and proceeds to re-enqueue.
  3. Now the api server sends a cancel event.
  4. Eventually another processor will resume the job - when starting it will see that the job is in cancelling status, and then it will update the status to cancelled.
  5. The event has a TTL.

=> It looks that this edge case is covered by the protocol above.

Case 2: Race between Success and Cancellation

  1. Worker receives a cancel event at the exact moment it is finalizing the batch of requests.
  2. There is a risk that the worker might inadvertently overwrite the cancelling status with completed during its final update.

=> Implement a status check (e.g., WHERE status = 'in_progress') to ensure the status only transitions from in_progress to completed, preventing it from overwriting a cancelling or cancelled state.

Case 3: User sends multiple cancellation requests

  1. A user sends multiple cancellation requests for the same job_id in rapid succession.
  2. If not handled, the API server might trigger multiple cancel events, potentially causing the processor to attempt resource cleanup or channel closure on an already closed resource (leading to panics)

=> The API server must verify the current job status before acting. Cancellation is only permitted if the status is in a non-terminal, active state (pending, validating, in_progress, or finalizing). If the job is already in a cancelling or cancelled state, the server should ignore the request or return an error indicating that the cancellation is already in progress. If the job has reached a terminal state (completed, failed), the server should inform the user that the job is no longer running and cannot be cancelled.