TOCTOU Bug while scaling down workers 

The currently implemented logic for deletion can be summarized into:

1) The operator asks the scheduler to retire n of the workers 
2) The scheduler retires them (process exits) and returns to the operator names of retired workers 
3) Operator deletes worker deployment sequentially

Ref: https://github.com/dask/dask-kubernetes/blob/main/dask_kubernetes/operator/controller/controller.py#L600-L611

However, between 2 and 3 the Kubernetes API may interfere and restart the worker deployment so a new pod will be created and join the cluster for some time before the operator deletes the deployment effectively interpreting the pod mid-run.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TOCTOU Bug while scaling down workers #855

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TOCTOU Bug while scaling down workers #855

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions