Self-annotation on launcher pods to signal hosted instance changes#391
Self-annotation on launcher pods to signal hosted instance changes#391waltforme merged 11 commits intollm-d-incubation:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a launcher-pod sidecar that periodically polls the launcher’s /v2/vllm/instances endpoint and publishes a signature of the hosted vLLM instance state onto the enclosing Pod via an annotation, enabling the dual-pods controller to observe instance changes through Pod watch events (stage 1 of #375).
Changes:
- Inject a
vllm-instance-notifiersidecar into launcher pods during pod construction. - Add
launcher_pod_notifier.pyto compute a stable signature of launcher instance state and patch it onto the pod annotation. - Update launcher Dockerfiles and e2e RBAC so the notifier script is present in images and can patch pod annotations in e2e.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/e2e/run-launcher-based.sh | Grants the launcher test ServiceAccount pods/get + pods/patch so the notifier can self-annotate in e2e. |
| pkg/controller/utils/pod-helper.go | Unconditionally injects the notifier sidecar into launcher pods. |
| inference_server/launcher/launcher_pod_notifier.py | Implements polling + signature computation + pod annotation patching logic. |
| dockerfiles/Dockerfile.launcher.cpu | Copies the notifier script into the launcher CPU image. |
| dockerfiles/Dockerfile.launcher.benchmark | Copies the notifier script into the launcher benchmark image. |
| def canonicalize_launcher_state(payload: dict[str, Any]) -> dict[str, Any]: | ||
| instances = payload.get("instances", []) | ||
| canonical_instances: list[dict[str, str]] = [] | ||
| for instance in instances: | ||
| if not isinstance(instance, dict): | ||
| raise ValueError(f"unexpected instance entry: {instance!r}") | ||
| instance_id = str(instance.get("instance_id", "")) | ||
| status = str(instance.get("status", "")) | ||
| canonical_instances.append({"instance_id": instance_id, "status": status}) | ||
| canonical_instances.sort(key=lambda item: (item["instance_id"], item["status"])) | ||
| return { | ||
| "total_instances": int(payload.get("total_instances", len(canonical_instances))), | ||
| "running_instances": int(payload.get("running_instances", 0)), | ||
| "instances": canonical_instances, | ||
| } | ||
|
|
||
|
|
||
| def compute_signature(payload: dict[str, Any]) -> str: | ||
| canonical = canonicalize_launcher_state(payload) | ||
| blob = json.dumps(canonical, sort_keys=True, separators=(",", ":")).encode("utf-8") | ||
| return hashlib.sha256(blob).hexdigest() | ||
|
|
There was a problem hiding this comment.
This new notifier introduces non-trivial logic (canonicalization + signature computation + publish-on-change behavior) but there are no accompanying unit tests. Given there is already a inference_server/launcher/tests/test_launcher.py suite, adding targeted tests for canonicalize_launcher_state/compute_signature (ordering stability, signature changes on status change, invalid payload handling) would help prevent regressions.
There was a problem hiding this comment.
I think that additional test cases are indeed warranted. I would not complain if they are added in a follow-on PR.
| ) | ||
|
|
||
| const ( | ||
| controllerQueuePerItemRetryMaxDelay = 20 * time.Second |
There was a problem hiding this comment.
Why is this needed?
Would a larger value be adequate?
There was a problem hiding this comment.
Now more events from the launchers are informed so the retry interval increases faster. This is needed to reduce the slack between 'vLLM instance ready' and 'controller's next retry'.
There was a problem hiding this comment.
I think that we need an even more aggressive solution to that problem. This latency is on the critical path that we want to minimize. OK if we address this in a later PR.
| }, | ||
| } | ||
|
|
||
| if idx >= 0 { |
There was a problem hiding this comment.
I would rather define this case as a user error, to be reflected in .status.errors of the LauncherConfig.
There was a problem hiding this comment.
Make later in another PR?
MikeSpreitzer
left a comment
There was a problem hiding this comment.
I left some independent comments.
|
/ok-to-test |
|
🚀 E2E tests triggered by /ok-to-test |
ef70f85 to
4d313ac
Compare
|
Force-push to 4d313ac was a rebase onto main. |
|
/ok-to-test |
|
🚀 E2E tests triggered by /ok-to-test |
|
As demonstrated in https://github.com/llm-d-incubation/llm-d-fast-model-actuation/actions/runs/23663224910/job/68939317606#step:18:860 , the pod log dumping job step is going to need enhancing so that the logs of both containers in the launcher Pod are dumped. |
Enhanced. |
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
7980331 to
3f4b6cb
Compare
|
Force-push to 3f4b6cb was due to a rebase onto main. |
|
/ok-to-test |
|
🚀 E2E tests triggered by /ok-to-test |
|
What is causing this log noise? https://github.com/llm-d-incubation/llm-d-fast-model-actuation/actions/runs/23764698262/job/69242210800#step:18:1625
|
|
It would be better if this error message indicated which URL was involved. https://github.com/llm-d-incubation/llm-d-fast-model-actuation/actions/runs/23764698262/job/69242210800#step:18:1632
|
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
|
/ok-to-test |
|
🚀 E2E tests triggered by /ok-to-test |
|
Turns out that The |
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
|
/ok-to-test |
|
🚀 E2E tests triggered by /ok-to-test |
This reverts commit 4d87e7e.
|
Looks like the noisy message has been there before this PR, e.g. |
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
|
Re https://github.com/llm-d-incubation/llm-d-fast-model-actuation/actions/runs/23776211419/job/69278559364?pr=391#step:9:21 - IMHO that linter is more picky than we need. |
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
It always is IMHO. |
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
|
/ok-to-test |
|
🚀 E2E tests triggered by /ok-to-test |
MikeSpreitzer
left a comment
There was a problem hiding this comment.
This is good enough to merge.
This PR starts to address stage 1 of #375.