[FEATURE] - Fetch stdout in launcher#242
[FEATURE] - Fetch stdout in launcher#242diegocastanibm merged 14 commits intollm-d-incubation:mainfrom
Conversation
|
Please, can anyone add /copilot or /claude? |
There was a problem hiding this comment.
Pull request overview
Adds per-instance stdout/stderr log capture to the vLLM launcher so operators can fetch recent instance logs via a new REST endpoint.
Changes:
- Capture child process stdout/stderr into a bounded multiprocessing queue per instance.
- Add
GET /v2/vllm/instances/{instance_id}/logsendpoint withmax_linesquery param. - Extend tests and documentation to cover the new log retrieval functionality.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| inference_server/launcher/launcher.py | Implements queue-backed stdout/stderr capture, per-instance log retrieval, and a new logs API endpoint. |
| inference_server/launcher/tests/test_launcher.py | Adds unit/API tests for instance log retrieval and QueueWriter behavior. |
| docs/launcher.md | Documents the new logs endpoint and provides usage examples and log management notes. |
| self.output_queue = multiprocessing.Queue(maxsize=MAX_QUEUE_SIZE) | ||
| self.process = multiprocessing.Process( | ||
| target=vllm_kickoff, args=(self.config, self.output_queue) | ||
| ) | ||
| self.process.start() |
There was a problem hiding this comment.
output_queue is created on start() but never closed when the instance is stopped/removed. This can leak file descriptors/feeder threads over time if instances are created/deleted repeatedly. Consider closing the queue in stop() (e.g., close() + join_thread()), and setting self.output_queue = None after cleanup.
| except Exception as e: | ||
| logger.error(f"Failed to get logs for instance {instance_id}: {e}") | ||
| raise HTTPException(status_code=500, detail=str(e)) |
There was a problem hiding this comment.
The 500 response exposes detail=str(e) to API callers. That can leak internal exception messages/paths and makes the API contract unstable. Prefer returning a generic message (or an error code) and log the full exception server-side (ideally with stack trace via logger.exception).
| except Exception as e: | |
| logger.error(f"Failed to get logs for instance {instance_id}: {e}") | |
| raise HTTPException(status_code=500, detail=str(e)) | |
| except Exception: | |
| logger.exception("Failed to get logs for instance %s", instance_id) | |
| raise HTTPException( | |
| status_code=500, | |
| detail="Failed to retrieve logs for the requested instance", | |
| ) |
|
See also #170, in which the chunks are identified by byte count rather than line count. I think that it would be better for the launcher to also identify chunks by byte count. |
New queue based on bytes |
docs/launcher.md
Outdated
| ### 5. Testing | ||
| ### 5. Log Management | ||
|
|
||
| The launcher captures stdout/stderr from each vLLM instance in memory using a byte-limited queue: |
There was a problem hiding this comment.
I am not a fan of this choice.
Of course the dual-pods controller will be updated to include periodic relaying of whatever additional log content has shown up.
Having the launcher keep the logs in memory forces a trade-off between
- memory usage by the launcher
- frequency of the polling by the dual-pods controller
... and the consequences of the trade-off depend on the logging rate, which is not under the control of any FMA software.
The design here introduces a requirement that the dual-pods controller act with a certain (again, unknown to the FMA software) frequency. That is very unusual in the Kubernetes milieu, which mainly only follows an eventual consistency paradigm.
There was a problem hiding this comment.
All designs have trade-offs. We can discuss what design would you prefer in a meeting
There was a problem hiding this comment.
OK. I think that the obvious alternative is storing a log in a file.
There was a problem hiding this comment.
This PR is getting extremely big. If I change the approach, it will be even bigger with way more comments. I suggest to address this concern in a different PR
There was a problem hiding this comment.
OK with me if we address this in a later PR.
docs/launcher.md
Outdated
| - `start_byte` (optional): Byte position to start reading from (default: 0, minimum: 0). Use this to continue reading from where you left off. | ||
| - `max_bytes` (optional): Maximum bytes of log data to retrieve from start_byte (default: 1048576 (1 MB), range: 1024-10485760 (10 MB)) |
There was a problem hiding this comment.
I used to know about this: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Range
Using a standard header is better than using bespoke query parameters.
Can be changed in a later PR.
There was a problem hiding this comment.
Using the standard Range header should be cleaner, but better a follow-up PR
|
I was hoping that the log could be treated as simply a sequence of bytes. If the response to the get-log operation is of content-type JSON then no, the response is JSON. The JSON string datatype has unicode characters. This introduces some complexities.
In my opinion this would be simpler if a log were simply a byte sequence. Then nothing in our code would need to worry about unicode encoding difficulties. This goes all the way to the response to a successful get-log operation being simply a sequence of bytes. |
|
Unsigned commits detected! Please sign your commits. For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation. |
| curl "http://localhost:8001/v2/vllm/instances/abc123.../log?start_byte=2097152&max_bytes=1048576" | ||
| ``` | ||
|
|
||
| **How `start_byte` Works:** |
There was a problem hiding this comment.
No, this omits complexities due to the difference between characters and bytes. A JSON string is a sequence of unicode characters. Byte 0xF0, for example, is not a unicode character.
There was a problem hiding this comment.
Good catch. The docs say "use start_byte + len(log) as the next start_byte", but since the response is JSON, len(log) gives the number of unicode characters, not bytes. This is another instance of the character-vs-bytes problem you raised. It'll be resolved cleanly when we switch the response to application/octet-stream in the follow-up PR. For now I'll add a note in the docs clarifying that the client must encode the string back to UTF-8 to compute the correct byte length for the next start_byte.
Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Diego-Castan <[email protected]>
…rmat Signed-off-by: Diego-Castan <[email protected]>
- Remove redundant fields from log endpoint response (total_bytes, next_byte, instance_id, start_byte) — clients can derive these from the request and response log content - Return 416 instead of 500 when start_byte is beyond available content, with LogRangeNotAvailable exception - Rewrite get_logs_from_queue to be truly byte-oriented: concatenate all messages into a flat byte stream and slice, instead of message-boundary-based skipping - Update docs and tests to match simplified API Co-Authored-By: Claude Opus 4.6 <[email protected]>
Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Diego-Castan <[email protected]>
8e453ba to
434d147
Compare
I'll do it in a different P. The plan for a follow-up PR is to change the GET .../log response from application/json with I'll open a separate PR for this as I said. It is already difficult to handle all the changes in this PR |
Signed-off-by: Diego-Castan <[email protected]>
|
The follow-up issue is here: |
| MAX_LOG_RESPONSE_BYTES, | ||
| description="Maximum bytes of log data to retrieve", | ||
| ge=1024, | ||
| le=10 * 1024 * 1024, |
There was a problem hiding this comment.
Why is this 10 X MAX_LOG_RESPONSE_BYTES ?
There was a problem hiding this comment.
MAX_LOG_RESPONSE_BYTES is the default value (what's returned if the caller doesn't specify max_bytes). le is the maximum allowed value that it is usually defined as "10x the default" as a reasonable upper bound by FastAPI/Pydantic.
MikeSpreitzer
left a comment
There was a problem hiding this comment.
This still needs work. It can be addressed in follow-on PRs.
Description:
#158
https://llm-d.slack.com/archives/C09TNPEFJUD/p1770675171058539
#231 (comment)
Test:
test_launcher.pyhas been modified to accommodate the new feature. All the tests have passed.Launcher E2E test: