You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* add Queue-based stdout capture
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
* Prevents memory overload from excessive logs
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
* Fix and add tests for launcher logs
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
* Update documentation
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
* Mike comment - Queue with bytes instead of lines
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
* Documentation
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
* Fix issue with queue
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
* start_byte implementation
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
* converted log retrieval from line-based array to byte-based string format
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
* Address PR review: simplify log API response and make byte-oriented
- Remove redundant fields from log endpoint response (total_bytes,
next_byte, instance_id, start_byte) — clients can derive these
from the request and response log content
- Return 416 instead of 500 when start_byte is beyond available
content, with LogRangeNotAvailable exception
- Rewrite get_logs_from_queue to be truly byte-oriented: concatenate
all messages into a flat byte stream and slice, instead of
message-boundary-based skipping
- Update docs and tests to match simplified API
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix problem terminal signal vllm cpu
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
* fix error with launcher. Avoid EngineCore exiting
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
* Mike's comments
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
* More Mike's comments
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
---------
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@@ -314,6 +316,40 @@ Stop and delete a specific vLLM instance.
314
316
315
317
---
316
318
319
+
#### Get Instance Logs
320
+
321
+
**GET**`/v2/vllm/instances/{instance_id}/log`
322
+
323
+
Retrieve stdout/stderr logs from a specific vLLM instance starting from a specific byte position.
324
+
325
+
**Path Parameters:**
326
+
327
+
-`instance_id`: ID of the instance
328
+
329
+
**Query Parameters:**
330
+
331
+
-`start_byte` (optional): Byte position to start reading from (default: 0, minimum: 0). The largest valid value is the total number of bytes captured so far (i.e., the length of the log). Use this to continue reading from where you left off.
332
+
-`max_bytes` (optional): Maximum bytes of log data to retrieve from start_byte (default: 1048576 (1 MB), range: 1024-10485760 (10 MB))
333
+
334
+
**Response (200 OK):**
335
+
336
+
```json
337
+
{
338
+
"log": "INFO: Started server process\nINFO: Waiting for application startup\nINFO: Application startup complete\n"
339
+
}
340
+
```
341
+
342
+
**Response Fields:**
343
+
344
+
-`log`: Log content as a single string. Since `start_byte` is a byte offset but the JSON response contains unicode characters, the client must encode the string back to UTF-8 to compute the correct byte length for the next `start_byte` (e.g., `start_byte + len(log.encode("utf-8"))` in Python). Using the character length directly will produce incorrect offsets if the log contains multi-byte characters.
345
+
346
+
**Error Responses:**
347
+
348
+
-`404 Not Found`: Instance not found
349
+
-`416 Range Not Satisfiable`: The requested `start_byte` is beyond available log content. The response includes a `Content-Range: bytes */N` header (per [RFC 9110 §15.5.17](https://www.rfc-editor.org/rfc/rfc9110#status.416)) and a JSON body: `{"available_bytes": N}`, where `N` is the total number of bytes captured so far (i.e., the largest valid value of `start_byte`).
350
+
351
+
---
352
+
317
353
#### Delete All Instances
318
354
319
355
**DELETE**`/v2/vllm/instances`
@@ -485,6 +521,42 @@ curl -X POST http://localhost:8001/v2/vllm/instances \
485
521
curl http://localhost:8001/v2/vllm/instances
486
522
```
487
523
524
+
### Example 5: Retrieve Instance Logs
525
+
526
+
```bash
527
+
# Get up to 1 MB of logs from the beginning (default)
The log is treated as a flat byte stream. The `start_byte` parameter specifies the exact byte offset to begin reading from, and `max_bytes` limits how many bytes are returned:
To stream logs, use `start_byte + len(log.encode("utf-8"))` as the `start_byte` for the next request, since `start_byte` is a byte offset and the JSON response contains unicode characters.
559
+
488
560
## Configuration
489
561
490
562
### vLLM Options
@@ -557,6 +629,7 @@ Represents a single vLLM instance with its process and configuration.
557
629
-`start()`: Start the vLLM process
558
630
-`stop(timeout=10)`: Stop the vLLM process gracefully (or force kill after timeout)
559
631
-`get_status()`: Get detailed status information
632
+
-`get_logs(start_byte=0, max_bytes=1048576)`: Retrieve logs from the instance starting from a byte position
-`stop_all_instances(timeout=10)`: Stop all running instances
570
643
-`get_instance_status(instance_id)`: Get status of a specific instance
571
644
-`get_all_instances_status()`: Get status of all instances
645
+
-`get_instance_logs(instance_id, start_byte=0, max_bytes=1048576)`: Retrieve logs from a specific instance starting from a byte position
572
646
573
647
## Best Practices
574
648
@@ -622,7 +696,40 @@ Be mindful of system resources:
622
696
-**CPU**: vLLM uses CPU for pre/post-processing
623
697
-**Disk**: Models are cached in the container's filesystem
624
698
625
-
### 5. Testing
699
+
### 5. Log Management
700
+
701
+
The launcher captures stdout/stderr from each vLLM instance using a multiprocessing queue that feeds into a persistent in-memory byte buffer (`_log_buffer`) per instance:
702
+
703
+
-**Architecture**: A `QueueWriter` in the child process sends messages to a bounded queue (`MAX_QUEUE_SIZE = 5000` messages, a Python constant defined in `launcher.py`). On read, the launcher drains the queue into the instance's `_log_buffer`, which accumulates all log bytes.
704
+
-**Byte-Based Retrieval**: The `start_byte` parameter is a byte offset into the accumulated buffer. The `max_bytes` parameter limits how many bytes are returned per request.
705
+
-**Queue Overflow**: When the queue is full, new messages are dropped (non-blocking put). Messages already drained into the buffer are preserved.
706
+
-**Non-blocking**: Log capture doesn't slow down the vLLM process.
707
+
-**Streaming Support**: Use `start_byte` parameter to efficiently stream logs without re-reading.
708
+
709
+
**Best Practices:**
710
+
711
+
-**Streaming Logs**: Use `start_byte` to efficiently stream logs. Since `start_byte` is a byte offset but the JSON response contains unicode characters, compute the byte length by encoding the string back to UTF-8:
0 commit comments