Skip to content

Commit a3b266c

Browse files
Temporary solution for llm-d#299 (llm-d#300)
While not particularly elegant, the ability to capture the return boolean from `wait_for_job` and ensuring `standup.sh` does not progress until it is `True` will at least prevent the starting of `profile` and `decode` `pods` with models still being downloaded. Signed-off-by: maugustosilva <maugusto.silva@gmail.com>
1 parent 3351411 commit a3b266c

File tree

2 files changed

+5
-1
lines changed

2 files changed

+5
-1
lines changed

setup/functions.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -461,6 +461,7 @@ async def wait_for_job(job_name, namespace, timeout=7200, dry_run: bool = False)
461461
return False
462462
except Exception as e:
463463
announce(f"Error occured while waiting for job {job_name} : {e}")
464+
return False
464465
finally:
465466
await api_client.close()
466467

setup/steps/04_ensure_model_namespace_prepared.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -127,12 +127,15 @@ def main():
127127
verbose=ev["control_verbose"]
128128
)
129129

130-
asyncio.run(wait_for_job(
130+
job_successful = False
131+
while not job_successful :
132+
job_successful= asyncio.run(wait_for_job(
131133
job_name="download-model",
132134
namespace=ev["vllm_common_namespace"],
133135
timeout=ev["vllm_common_pvc_download_timeout"],
134136
dry_run=ev["control_dry_run"]
135137
))
138+
time.sleep(10)
136139

137140
if is_openshift(api) and ev["user_is_admin"] :
138141
# vllm workloads may need to run as a specific non-root UID , the default SA needs anyuid

0 commit comments

Comments
 (0)