[FEATURE] - Fetch stdout in launcher by diegocastanibm · Pull Request #242 · llm-d-incubation/llm-d-fast-model-actuation

diegocastanibm · 2026-02-10T19:52:05Z

Description:
#158
https://llm-d.slack.com/archives/C09TNPEFJUD/p1770675171058539
#231 (comment)

Test:
test_launcher.py has been modified to accommodate the new feature. All the tests have passed.

Launcher E2E test:

diegocastanibm · 2026-02-10T19:56:30Z

Please, can anyone add /copilot or /claude?

Copilot

Pull request overview

Adds per-instance stdout/stderr log capture to the vLLM launcher so operators can fetch recent instance logs via a new REST endpoint.

Changes:

Capture child process stdout/stderr into a bounded multiprocessing queue per instance.
Add GET /v2/vllm/instances/{instance_id}/logs endpoint with max_lines query param.
Extend tests and documentation to cover the new log retrieval functionality.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File	Description
inference_server/launcher/launcher.py	Implements queue-backed stdout/stderr capture, per-instance log retrieval, and a new logs API endpoint.
inference_server/launcher/tests/test_launcher.py	Adds unit/API tests for instance log retrieval and QueueWriter behavior.
docs/launcher.md	Documents the new logs endpoint and provides usage examples and log management notes.

inference_server/launcher/launcher.py

Copilot · 2026-02-10T21:04:46Z

inference_server/launcher/launcher.py

+        self.output_queue = multiprocessing.Queue(maxsize=MAX_QUEUE_SIZE)
+        self.process = multiprocessing.Process(
+            target=vllm_kickoff, args=(self.config, self.output_queue)
+        )
        self.process.start()


output_queue is created on start() but never closed when the instance is stopped/removed. This can leak file descriptors/feeder threads over time if instances are created/deleted repeatedly. Consider closing the queue in stop() (e.g., close() + join_thread()), and setting self.output_queue = None after cleanup.

Copilot · 2026-02-10T21:04:46Z

inference_server/launcher/launcher.py

+    except Exception as e:
+        logger.error(f"Failed to get logs for instance {instance_id}: {e}")
+        raise HTTPException(status_code=500, detail=str(e))


The 500 response exposes detail=str(e) to API callers. That can leak internal exception messages/paths and makes the API contract unstable. Prefer returning a generic message (or an error code) and log the full exception server-side (ideally with stack trace via logger.exception).

Suggested change

except Exception as e:

logger.error(f"Failed to get logs for instance {instance_id}: {e}")

raise HTTPException(status_code=500, detail=str(e))

except Exception:

logger.exception("Failed to get logs for instance %s", instance_id)

raise HTTPException(

status_code=500,

detail="Failed to retrieve logs for the requested instance",

)

docs/launcher.md

inference_server/launcher/tests/test_launcher.py

inference_server/launcher/launcher.py

MikeSpreitzer · 2026-02-10T21:53:32Z

See and cite #157 and, in particular, #158.

MikeSpreitzer · 2026-02-10T21:57:56Z

See also #170, in which the chunks are identified by byte count rather than line count. I think that it would be better for the launcher to also identify chunks by byte count.

docs/launcher.md

diegocastanibm · 2026-02-12T13:55:50Z

See also #170, in which the chunks are identified by byte count rather than line count. I think that it would be better for the launcher to also identify chunks by byte count.

New queue based on bytes

docs/launcher.md

MikeSpreitzer · 2026-02-12T15:38:21Z

docs/launcher.md

-### 5. Testing
+### 5. Log Management
+
+The launcher captures stdout/stderr from each vLLM instance in memory using a byte-limited queue:


I am not a fan of this choice.

Of course the dual-pods controller will be updated to include periodic relaying of whatever additional log content has shown up.

Having the launcher keep the logs in memory forces a trade-off between

memory usage by the launcher

frequency of the polling by the dual-pods controller

... and the consequences of the trade-off depend on the logging rate, which is not under the control of any FMA software.

The design here introduces a requirement that the dual-pods controller act with a certain (again, unknown to the FMA software) frequency. That is very unusual in the Kubernetes milieu, which mainly only follows an eventual consistency paradigm.

All designs have trade-offs. We can discuss what design would you prefer in a meeting

OK. I think that the obvious alternative is storing a log in a file.

This PR is getting extremely big. If I change the approach, it will be even bigger with way more comments. I suggest to address this concern in a different PR

OK with me if we address this in a later PR.

docs/launcher.md

MikeSpreitzer · 2026-02-17T18:21:00Z

docs/launcher.md

+- `start_byte` (optional): Byte position to start reading from (default: 0, minimum: 0). Use this to continue reading from where you left off.
+- `max_bytes` (optional): Maximum bytes of log data to retrieve from start_byte (default: 1048576 (1 MB), range: 1024-10485760 (10 MB))


I used to know about this: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Range

Using a standard header is better than using bespoke query parameters.
Can be changed in a later PR.

Using the standard Range header should be cleaner, but better a follow-up PR

inference_server/launcher/launcher.py

docs/launcher.md

MikeSpreitzer · 2026-02-17T18:50:50Z

I was hoping that the log could be treated as simply a sequence of bytes.

If the response to the get-log operation is of content-type JSON then no, the response is JSON. The JSON string datatype has unicode characters. This introduces some complexities.

What if vllm outputs some bytes that are not valid unicode?
What if vllm outputs just part of the utf-8 encoding of a unicode character and then stops for a while?
Is start_byte really a byte index, or is it a character index? If it is a byte index then this imposes a burden on the client to translate the received string of unicode characters into a string of bytes (and there needs to be an explicit agreement on which encoding is used) in order to calculate the byte length of the response. If it is a character index then the name is wrong.

In my opinion this would be simpler if a log were simply a byte sequence. Then nothing in our code would need to worry about unicode encoding difficulties. This goes all the way to the response to a successful get-log operation being simply a sequence of bytes.

github-actions · 2026-02-18T16:54:06Z

Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation.

MikeSpreitzer · 2026-02-18T18:14:54Z

docs/launcher.md

+curl "http://localhost:8001/v2/vllm/instances/abc123.../log?start_byte=2097152&max_bytes=1048576"
+```
+
+**How `start_byte` Works:**


No, this omits complexities due to the difference between characters and bytes. A JSON string is a sequence of unicode characters. Byte 0xF0, for example, is not a unicode character.

Good catch. The docs say "use start_byte + len(log) as the next start_byte", but since the response is JSON, len(log) gives the number of unicode characters, not bytes. This is another instance of the character-vs-bytes problem you raised. It'll be resolved cleanly when we switch the response to application/octet-stream in the follow-up PR. For now I'll add a note in the docs clarifying that the client must encode the string back to UTF-8 to compute the correct byte length for the next start_byte.

inference_server/launcher/launcher.py

docs/launcher.md

Signed-off-by: Diego-Castan <[email protected]>

…rmat Signed-off-by: Diego-Castan <[email protected]>

- Remove redundant fields from log endpoint response (total_bytes, next_byte, instance_id, start_byte) — clients can derive these from the request and response log content - Return 416 instead of 500 when start_byte is beyond available content, with LogRangeNotAvailable exception - Rewrite get_logs_from_queue to be truly byte-oriented: concatenate all messages into a flat byte stream and slice, instead of message-boundary-based skipping - Update docs and tests to match simplified API Co-Authored-By: Claude Opus 4.6 <[email protected]>

Signed-off-by: Diego-Castan <[email protected]>

diegocastanibm · 2026-02-19T15:22:49Z

I was hoping that the log could be treated as simply a sequence of bytes.

If the response to the get-log operation is of content-type JSON then no, the response is JSON. The JSON string datatype has unicode characters. This introduces some complexities.
1. What if vllm outputs some bytes that are not valid unicode?

2. What if vllm outputs just part of the utf-8 encoding of a unicode character and then stops for a while?

3. Is `start_byte` **really** a byte index, or is it a character index? If it is a byte index then this imposes a burden on the client to translate the received string of unicode characters into a string of bytes (and there needs to be an explicit agreement on which encoding is used) in order to calculate the byte length of the response. If it is a character index then the name is wrong.
In my opinion this would be simpler if a log were simply a byte sequence. Then nothing in our code would need to worry about unicode encoding difficulties. This goes all the way to the response to a successful get-log operation being simply a sequence of bytes.

I'll do it in a different P. The plan for a follow-up PR is to change the GET .../log response from application/json with
{"log": "..."} to application/octet-stream returning the raw bytes directly. This would also require updating the launcherclient.go (maybe you or @waltforme can do it) and the requester's log relay endpoint to handle raw bytes instead of JSON.

I'll open a separate PR for this as I said. It is already difficult to handle all the changes in this PR

Signed-off-by: Diego-Castan <[email protected]>

diegocastanibm · 2026-02-19T16:50:38Z

The follow-up issue is here:
#265

MikeSpreitzer · 2026-02-19T19:27:05Z

inference_server/launcher/launcher.py

+        MAX_LOG_RESPONSE_BYTES,
+        description="Maximum bytes of log data to retrieve",
+        ge=1024,
+        le=10 * 1024 * 1024,


Why is this 10 X MAX_LOG_RESPONSE_BYTES ?

MAX_LOG_RESPONSE_BYTES is the default value (what's returned if the caller doesn't specify max_bytes). le is the maximum allowed value that it is usually defined as "10x the default" as a reasonable upper bound by FastAPI/Pydantic.

MikeSpreitzer

This still needs work. It can be addressed in follow-on PRs.

diegocastanibm marked this pull request as ready for review February 10, 2026 19:55

diegocastanibm requested review from MikeSpreitzer, jimcadden and lionelvillard February 10, 2026 19:55

lionelvillard requested a review from Copilot February 10, 2026 20:58

Copilot started reviewing on behalf of lionelvillard February 10, 2026 20:59 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

MikeSpreitzer reviewed Feb 10, 2026

View reviewed changes

docs/launcher.md Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Feb 10, 2026

View reviewed changes

docs/launcher.md Outdated Show resolved Hide resolved

diegocastanibm linked an issue Feb 11, 2026 that may be closed by this pull request

Extend launcher to return log chunks from launched vLLM instances #158

Closed

MikeSpreitzer reviewed Feb 12, 2026

View reviewed changes

docs/launcher.md Show resolved Hide resolved

MikeSpreitzer reviewed Feb 12, 2026

View reviewed changes

docs/launcher.md Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Feb 12, 2026

View reviewed changes

docs/launcher.md Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Feb 12, 2026

View reviewed changes

docs/launcher.md Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Feb 12, 2026

View reviewed changes

docs/launcher.md Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Feb 12, 2026

View reviewed changes

docs/launcher.md Show resolved Hide resolved

MikeSpreitzer reviewed Feb 12, 2026

View reviewed changes

docs/launcher.md Outdated Show resolved Hide resolved

diegocastanibm requested a review from waltforme February 16, 2026 22:21

MikeSpreitzer reviewed Feb 17, 2026

View reviewed changes

docs/launcher.md Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Feb 17, 2026

View reviewed changes

inference_server/launcher/launcher.py Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Feb 17, 2026

View reviewed changes

inference_server/launcher/launcher.py Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Feb 17, 2026

View reviewed changes

inference_server/launcher/launcher.py Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Feb 17, 2026

View reviewed changes

docs/launcher.md Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Feb 17, 2026

View reviewed changes

docs/launcher.md Outdated Show resolved Hide resolved

MikeSpreitzer reviewed Feb 18, 2026

View reviewed changes

inference_server/launcher/launcher.py Show resolved Hide resolved

MikeSpreitzer reviewed Feb 18, 2026

View reviewed changes

docs/launcher.md Outdated Show resolved Hide resolved

diegocastanibm and others added 13 commits February 18, 2026 14:04

add Queue-based stdout capture

7ee4f3f

Signed-off-by: Diego-Castan <[email protected]>

Prevents memory overload from excessive logs

4411630

Signed-off-by: Diego-Castan <[email protected]>

Fix and add tests for launcher logs

4f1bd19

Signed-off-by: Diego-Castan <[email protected]>

Update documentation

171652b

Signed-off-by: Diego-Castan <[email protected]>

Mike comment - Queue with bytes instead of lines

dfb5924

Signed-off-by: Diego-Castan <[email protected]>

Documentation

30ceed7

Signed-off-by: Diego-Castan <[email protected]>

Fix issue with queue

5f6caa1

Signed-off-by: Diego-Castan <[email protected]>

start_byte implementation

ac947c9

Signed-off-by: Diego-Castan <[email protected]>

converted log retrieval from line-based array to byte-based string fo…

9c11bd7

…rmat Signed-off-by: Diego-Castan <[email protected]>

Fix problem terminal signal vllm cpu

0ba5ac6

Signed-off-by: Diego-Castan <[email protected]>

fix error with launcher. Avoid EngineCore exiting

32f07b3

Signed-off-by: Diego-Castan <[email protected]>

Mike's comments

434d147

Signed-off-by: Diego-Castan <[email protected]>

diegocastanibm force-pushed the fetch-stdout branch from 8e453ba to 434d147 Compare February 18, 2026 19:34

More Mike's comments

f8d72a8

Signed-off-by: Diego-Castan <[email protected]>

diegocastanibm mentioned this pull request Feb 19, 2026

Launcher log API improvements: Range header, byte-oriented responses, and write logs to a file instead of in-memory buffering #265

Closed

MikeSpreitzer reviewed Feb 19, 2026

View reviewed changes

MikeSpreitzer approved these changes Feb 19, 2026

View reviewed changes

diegocastanibm merged commit c47c25b into llm-d-incubation:main Feb 19, 2026
36 checks passed

		- `start_byte` (optional): Byte position to start reading from (default: 0, minimum: 0). Use this to continue reading from where you left off.
		- `max_bytes` (optional): Maximum bytes of log data to retrieve from start_byte (default: 1048576 (1 MB), range: 1024-10485760 (10 MB))

Conversation

diegocastanibm commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

diegocastanibm commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MikeSpreitzer commented Feb 10, 2026

Uh oh!

MikeSpreitzer commented Feb 10, 2026

Uh oh!

Uh oh!

Uh oh!

diegocastanibm commented Feb 12, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MikeSpreitzer commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

diegocastanibm commented Feb 19, 2026

Uh oh!

diegocastanibm commented Feb 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer left a comment

Choose a reason for hiding this comment

diegocastanibm commented Feb 10, 2026 •

edited

Loading

diegocastanibm commented Feb 10, 2026 •

edited

Loading

MikeSpreitzer commented Feb 17, 2026 •

edited

Loading