Skip to content

[FEATURE] - Fetch stdout in launcher#242

Merged
diegocastanibm merged 14 commits intollm-d-incubation:mainfrom
diegocastanibm:fetch-stdout
Feb 19, 2026
Merged

[FEATURE] - Fetch stdout in launcher#242
diegocastanibm merged 14 commits intollm-d-incubation:mainfrom
diegocastanibm:fetch-stdout

Conversation

@diegocastanibm
Copy link
Copy Markdown
Collaborator

@diegocastanibm diegocastanibm commented Feb 10, 2026

Description:
#158
https://llm-d.slack.com/archives/C09TNPEFJUD/p1770675171058539
#231 (comment)

Test:
test_launcher.py has been modified to accommodate the new feature. All the tests have passed.

Launcher E2E test:

Screenshot 2026-02-16 at 5 17 36 PM

@diegocastanibm diegocastanibm marked this pull request as ready for review February 10, 2026 19:55
@diegocastanibm
Copy link
Copy Markdown
Collaborator Author

diegocastanibm commented Feb 10, 2026

Please, can anyone add /copilot or /claude?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds per-instance stdout/stderr log capture to the vLLM launcher so operators can fetch recent instance logs via a new REST endpoint.

Changes:

  • Capture child process stdout/stderr into a bounded multiprocessing queue per instance.
  • Add GET /v2/vllm/instances/{instance_id}/logs endpoint with max_lines query param.
  • Extend tests and documentation to cover the new log retrieval functionality.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
inference_server/launcher/launcher.py Implements queue-backed stdout/stderr capture, per-instance log retrieval, and a new logs API endpoint.
inference_server/launcher/tests/test_launcher.py Adds unit/API tests for instance log retrieval and QueueWriter behavior.
docs/launcher.md Documents the new logs endpoint and provides usage examples and log management notes.

Comment on lines +91 to 95
self.output_queue = multiprocessing.Queue(maxsize=MAX_QUEUE_SIZE)
self.process = multiprocessing.Process(
target=vllm_kickoff, args=(self.config, self.output_queue)
)
self.process.start()
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output_queue is created on start() but never closed when the instance is stopped/removed. This can leak file descriptors/feeder threads over time if instances are created/deleted repeatedly. Consider closing the queue in stop() (e.g., close() + join_thread()), and setting self.output_queue = None after cleanup.

Copilot uses AI. Check for mistakes.
Comment on lines +394 to +396
except Exception as e:
logger.error(f"Failed to get logs for instance {instance_id}: {e}")
raise HTTPException(status_code=500, detail=str(e))
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 500 response exposes detail=str(e) to API callers. That can leak internal exception messages/paths and makes the API contract unstable. Prefer returning a generic message (or an error code) and log the full exception server-side (ideally with stack trace via logger.exception).

Suggested change
except Exception as e:
logger.error(f"Failed to get logs for instance {instance_id}: {e}")
raise HTTPException(status_code=500, detail=str(e))
except Exception:
logger.exception("Failed to get logs for instance %s", instance_id)
raise HTTPException(
status_code=500,
detail="Failed to retrieve logs for the requested instance",
)

Copilot uses AI. Check for mistakes.
@MikeSpreitzer
Copy link
Copy Markdown
Collaborator

See and cite #157 and, in particular, #158.

@MikeSpreitzer
Copy link
Copy Markdown
Collaborator

See also #170, in which the chunks are identified by byte count rather than line count. I think that it would be better for the launcher to also identify chunks by byte count.

@diegocastanibm
Copy link
Copy Markdown
Collaborator Author

See also #170, in which the chunks are identified by byte count rather than line count. I think that it would be better for the launcher to also identify chunks by byte count.

New queue based on bytes

docs/launcher.md Outdated
### 5. Testing
### 5. Log Management

The launcher captures stdout/stderr from each vLLM instance in memory using a byte-limited queue:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of this choice.

Of course the dual-pods controller will be updated to include periodic relaying of whatever additional log content has shown up.

Having the launcher keep the logs in memory forces a trade-off between

  • memory usage by the launcher
  • frequency of the polling by the dual-pods controller

... and the consequences of the trade-off depend on the logging rate, which is not under the control of any FMA software.

The design here introduces a requirement that the dual-pods controller act with a certain (again, unknown to the FMA software) frequency. That is very unusual in the Kubernetes milieu, which mainly only follows an eventual consistency paradigm.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All designs have trade-offs. We can discuss what design would you prefer in a meeting

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I think that the obvious alternative is storing a log in a file.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is getting extremely big. If I change the approach, it will be even bigger with way more comments. I suggest to address this concern in a different PR

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK with me if we address this in a later PR.

docs/launcher.md Outdated
Comment on lines +331 to +332
- `start_byte` (optional): Byte position to start reading from (default: 0, minimum: 0). Use this to continue reading from where you left off.
- `max_bytes` (optional): Maximum bytes of log data to retrieve from start_byte (default: 1048576 (1 MB), range: 1024-10485760 (10 MB))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used to know about this: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Range

Using a standard header is better than using bespoke query parameters.
Can be changed in a later PR.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the standard Range header should be cleaner, but better a follow-up PR

@MikeSpreitzer
Copy link
Copy Markdown
Collaborator

MikeSpreitzer commented Feb 17, 2026

I was hoping that the log could be treated as simply a sequence of bytes.

If the response to the get-log operation is of content-type JSON then no, the response is JSON. The JSON string datatype has unicode characters. This introduces some complexities.

  1. What if vllm outputs some bytes that are not valid unicode?
  2. What if vllm outputs just part of the utf-8 encoding of a unicode character and then stops for a while?
  3. Is start_byte really a byte index, or is it a character index? If it is a byte index then this imposes a burden on the client to translate the received string of unicode characters into a string of bytes (and there needs to be an explicit agreement on which encoding is used) in order to calculate the byte length of the response. If it is a character index then the name is wrong.

In my opinion this would be simpler if a log were simply a byte sequence. Then nothing in our code would need to worry about unicode encoding difficulties. This goes all the way to the response to a successful get-log operation being simply a sequence of bytes.

@github-actions
Copy link
Copy Markdown

Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits, please see GitHub Documentation.

curl "http://localhost:8001/v2/vllm/instances/abc123.../log?start_byte=2097152&max_bytes=1048576"
```

**How `start_byte` Works:**
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this omits complexities due to the difference between characters and bytes. A JSON string is a sequence of unicode characters. Byte 0xF0, for example, is not a unicode character.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. The docs say "use start_byte + len(log) as the next start_byte", but since the response is JSON, len(log) gives the number of unicode characters, not bytes. This is another instance of the character-vs-bytes problem you raised. It'll be resolved cleanly when we switch the response to application/octet-stream in the follow-up PR. For now I'll add a note in the docs clarifying that the client must encode the string back to UTF-8 to compute the correct byte length for the next start_byte.

diegocastanibm and others added 13 commits February 18, 2026 14:04
Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Diego-Castan <[email protected]>
- Remove redundant fields from log endpoint response (total_bytes,
  next_byte, instance_id, start_byte) — clients can derive these
  from the request and response log content
- Return 416 instead of 500 when start_byte is beyond available
  content, with LogRangeNotAvailable exception
- Rewrite get_logs_from_queue to be truly byte-oriented: concatenate
  all messages into a flat byte stream and slice, instead of
  message-boundary-based skipping
- Update docs and tests to match simplified API

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Signed-off-by: Diego-Castan <[email protected]>
@diegocastanibm
Copy link
Copy Markdown
Collaborator Author

I was hoping that the log could be treated as simply a sequence of bytes.

If the response to the get-log operation is of content-type JSON then no, the response is JSON. The JSON string datatype has unicode characters. This introduces some complexities.

1. What if vllm outputs some bytes that are not valid unicode?

2. What if vllm outputs just part of the utf-8 encoding of a unicode character and then stops for a while?

3. Is `start_byte` **really** a byte index, or is it a character index? If it is a byte index then this imposes a burden on the client to translate the received string of unicode characters into a string of bytes (and there needs to be an explicit agreement on which encoding is used) in order to calculate the byte length of the response. If it is a character index then the name is wrong.

In my opinion this would be simpler if a log were simply a byte sequence. Then nothing in our code would need to worry about unicode encoding difficulties. This goes all the way to the response to a successful get-log operation being simply a sequence of bytes.

I'll do it in a different P. The plan for a follow-up PR is to change the GET .../log response from application/json with
{"log": "..."} to application/octet-stream returning the raw bytes directly. This would also require updating the launcherclient.go (maybe you or @waltforme can do it) and the requester's log relay endpoint to handle raw bytes instead of JSON.

I'll open a separate PR for this as I said. It is already difficult to handle all the changes in this PR

Signed-off-by: Diego-Castan <[email protected]>
@diegocastanibm
Copy link
Copy Markdown
Collaborator Author

The follow-up issue is here:
#265

MAX_LOG_RESPONSE_BYTES,
description="Maximum bytes of log data to retrieve",
ge=1024,
le=10 * 1024 * 1024,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this 10 X MAX_LOG_RESPONSE_BYTES ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_LOG_RESPONSE_BYTES is the default value (what's returned if the caller doesn't specify max_bytes). le is the maximum allowed value that it is usually defined as "10x the default" as a reasonable upper bound by FastAPI/Pydantic.

Copy link
Copy Markdown
Collaborator

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs work. It can be addressed in follow-on PRs.

@diegocastanibm diegocastanibm merged commit c47c25b into llm-d-incubation:main Feb 19, 2026
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extend launcher to return log chunks from launched vLLM instances

3 participants