|
1 | | -TODO: need rewrite |
| 1 | +# Disaggregated Serving with Ray orchestrator |
| 2 | +TensorRT-LLM supports an experimental [Ray orchestrator](../README.md) as an alternative to MPI. |
2 | 3 |
|
3 | | -# TensorRT-LLM OpenAPI Client Examples |
| 4 | +Running disaggregated serving with Ray follows [the same workflow as in MPI](/examples/disaggregated/README.md), except that `orchestrator_type="ray"` must be set on the `LLM` class, and `CUDA_VISIBLE_DEVICES` can be omitted since Ray handles GPU placement. |
4 | 5 |
|
5 | | -This directory contains simple client examples using the `requests` library to interact with TensorRT-LLM OpenAPI servers. |
6 | 6 |
|
7 | | -## Files |
| 7 | +## Quick Start |
| 8 | +This script is a shorthand to launch a single-GPU context and generation server, as well as the disaggregated server within a single Ray cluster. Please see [this page]((/examples/disaggregated/README.md)) for details on adjusting parallel settings. |
8 | 9 |
|
9 | | -- **`disagg_serving_test.py`** - Comprehensive client with multiple test cases |
10 | | -- **`simple_client_example.py`** - Minimal example showing core usage patterns |
11 | | - |
12 | | -## Prerequisites |
13 | | - |
14 | | -1. Start a TensorRT-LLM server manually using one of these methods: |
15 | | - |
16 | | -### Option A: Using trtllm-serve (Recommended) |
17 | | -```bash |
18 | | -trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0" |
19 | | -``` |
20 | | - |
21 | | -### Option B: Using the FastAPI server example |
22 | | -```bash |
23 | | -cd ../apps |
24 | | -python fastapi_server.py TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
25 | | -``` |
26 | | - |
27 | | -### Option C: Using any OpenAI-compatible server |
28 | | -Make sure it has a `/v1/chat/completions` endpoint that accepts: |
29 | | -```json |
30 | | -{ |
31 | | - "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", |
32 | | - "messages": [{"role": "user", "content": "Your prompt here"}], |
33 | | - "max_tokens": 100, |
34 | | - "temperature": 0.8, |
35 | | - "stream": false |
36 | | -} |
37 | | -``` |
38 | | - |
39 | | -## Usage |
40 | | - |
41 | | -### Simple Client (Minimal Example) |
42 | 10 | ```bash |
43 | | -python simple_client_example.py |
| 11 | +# requires a total of two GPUs |
| 12 | +bash -e disagg_serving_local.sh |
44 | 13 | ``` |
45 | 14 |
|
46 | | -### Comprehensive Client |
| 15 | +Once the disaggregated server is ready, you can send requests to the disaggregated server using curl: |
47 | 16 | ```bash |
48 | | -# Default server URL (http://localhost:8000) |
49 | | -python disagg_serving_test.py |
50 | | - |
51 | | -# Custom server URL |
52 | | -python disagg_serving_test.py --server-url http://localhost:8080 |
53 | | -``` |
54 | | - |
55 | | -## Example Output |
56 | | - |
57 | | -``` |
58 | | -🤖 Testing TensorRT-LLM server at: http://localhost:8000 |
59 | | -================================================== |
60 | | -1. Health check... |
61 | | - ✅ Server healthy: {'status': 'healthy'} |
62 | | -
|
63 | | -2. Testing text generation... |
64 | | -
|
65 | | -🎯 Test 1: 'Hello, my name is' |
66 | | - Generated: 'John and I am a software engineer...' |
67 | | -
|
68 | | -🎯 Test 2: 'The capital of France is' |
69 | | - Generated: 'Paris, the city of lights...' |
70 | | -
|
71 | | -3. Testing streaming generation... |
72 | | -🎯 Streaming: 'Write a short story about a robot:' |
73 | | -📡 Response: Once upon a time, there was a robot named... |
74 | | -
|
75 | | -✅ All tests completed! |
76 | | -``` |
77 | | - |
78 | | -## API Endpoints Used |
79 | | - |
80 | | -The clients expect these endpoints: |
81 | | - |
82 | | -- `GET /health` - Health check (fallback: simple chat completion request) |
83 | | -- `POST /v1/chat/completions` - OpenAI-compatible chat completions |
84 | | -- Streaming support via Server-Sent Events (SSE) |
85 | | - |
86 | | -## Key Features Demonstrated |
87 | | - |
88 | | -1. **Basic text generation** with requests.post() |
89 | | -2. **Streaming response** handling with SSE |
90 | | -3. **Error handling** for connection issues |
91 | | -4. **Session management** for efficient connections |
92 | | -5. **OpenAI-compatible format** with messages array |
93 | | - |
94 | | -## Example Code |
95 | | - |
96 | | -The minimal example shows exactly what you need: |
97 | | - |
98 | | -```python |
99 | | -import requests |
100 | | - |
101 | | -# Basic generation (OpenAI format) |
102 | | -payload = { |
103 | | - "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", |
104 | | - "messages": [{"role": "user", "content": "Hello, my name is"}], |
105 | | - "max_tokens": 50, |
106 | | - "temperature": 0.8 |
107 | | -} |
108 | | -response = requests.post("http://localhost:8000/v1/chat/completions", json=payload) |
109 | | -result = response.json() |
110 | | -text = result["choices"][0]["message"]["content"] |
111 | | - |
112 | | -# Streaming generation |
113 | | -payload = { |
114 | | - "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", |
115 | | - "messages": [{"role": "user", "content": "The future of AI is"}], |
116 | | - "stream": True |
117 | | -} |
118 | | -response = requests.post("http://localhost:8000/v1/chat/completions", json=payload, stream=True) |
119 | | -``` |
120 | | - |
121 | | -## Customization |
122 | | - |
123 | | -You can modify the scripts to: |
124 | | -- Change server URLs |
125 | | -- Adjust generation parameters (temperature, max_tokens, etc.) |
126 | | -- Add new test prompts |
127 | | -- Handle different response formats |
| 17 | +curl http://localhost:8000/v1/completions \ |
| 18 | + -H "Content-Type: application/json" \ |
| 19 | + -d '{ |
| 20 | + "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", |
| 21 | + "prompt": "NVIDIA is a great company because", |
| 22 | + "max_tokens": 16, |
| 23 | + "temperature": 0 |
| 24 | + }' -w "\n" |
| 25 | +``` |
| 26 | + |
| 27 | +## Disclaimer |
| 28 | +The code is experimental and subject to change. Currently, there are no guarantees regarding functionality, performance, or stability. |
0 commit comments