Skip to content

Commit fdf1c56

Browse files
committed
fix: planner cleanup on job complete
Whenever a job is stopped, failed, completed, planner will still keep record of the current pipelines. So when we want to restart the same job, or a job with the same id, planner will wrongly decide to do scale_in, in the case where the queue level allows a scale_in. To fix this issue, we need to do cleanup in planner, whenever job context does cleanup.
1 parent b275071 commit fdf1c56

File tree

2 files changed

+6
-0
lines changed

2 files changed

+6
-0
lines changed

infscale/controller/job_context.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1270,6 +1270,7 @@ def cleanup(self) -> None:
12701270
self._new_cfg = None
12711271
self._flow_graph_patched = False
12721272
self._worlds_conflict_count = {}
1273+
self.ctrl.planner.remove_pipeline_data(self.job_id)
12731274

12741275
def _release_gpu_resources(self, agent_data: AgentMetaData) -> None:
12751276
resources = self.ctrl.agent_contexts[agent_data.id].resources

infscale/controller/planner.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,11 @@ def __init__(self, path: str, autoscale: bool) -> None:
110110

111111
self.pipeline_data: dict[str, list[PipelineData]] = {}
112112

113+
def remove_pipeline_data(self, job_id: str) -> None:
114+
"""Remove pipeline data for job id."""
115+
if job_id in self.pipeline_data:
116+
del self.pipeline_data[job_id]
117+
113118
def update_pipeline_data(self, wids_to_remove: set[str], job_id: str) -> None:
114119
"""Update pipeline data based on worker ids."""
115120
if job_id not in self.pipeline_data:

0 commit comments

Comments
 (0)