Skip to content

Commit 694f631

Browse files
committed
revert to use deep copy, update all readmes
1 parent 225c598 commit 694f631

10 files changed

Lines changed: 76 additions & 143 deletions

File tree

283 KB
Loading

examples/llm-api/ray/README.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,41 @@
33
# TensorRT-LLM with Ray orchestrator
44

55
<div align="left">
6+
7+
This folder contains examples for an experimental **Ray orchestrator** that supports on-demand LLM instance spin-up and flexible GPU placement across single- and multi-node inference. It’s a first step toward making TensorRT-LLM a better fit for Reinforcement learning from human feedback (RLHF) workflows. For RLHF, [Ray](https://docs.ray.io/en/latest/index.html) — unlike MPI’s fixed world size and placement — can dynamically spawn and reconnect distributed inference actors, each with its own parallelism strategy.
8+
9+
This feature is a prototype and under active development. MPI remains the default and will continue to be supported in the long term.
10+
11+
12+
## Quick Start
13+
14+
Run a simple `TP=2` example with a Hugging Face model:
15+
16+
```shell
17+
cd examples/llm-api/ray
18+
python llm_inference_distributed_ray.py
19+
```
20+
21+
This example is the same as in `/examples/llm-api`, with the only change being `orchestrator_type="ray"` on `LLM()`. Other examples can be adapted similarly by toggling this flag.
22+
23+
24+
## Features
25+
### Available
26+
- Generate text asynchronously (refer to [llm_inference_async_ray.py](llm_inference_async_ray.py))
27+
- Multi-node inference (refer to [multi-node README](./multi_nodes/README.md))
28+
- Disaggregated serving (refer to [disagg README](./disaggregated/README.md))
29+
30+
**Initial testing has been focused on LLaMA and DeepSeek variants. Please open an Issue if you encounter problems with other models so we can prioritize support.
31+
32+
### Upcoming
33+
- Performance optimization
34+
- Integration with RLHF frameworks, such as [NVIDIA Nemo-RL](https://github.com/NVIDIA-NeMo/RL) and [Verl](https://github.com/volcengine/verl).
35+
36+
## Architecture
37+
This feature introduces new classes such as [RayExecutor](/tensorrt_llm/executor/ray_executor.py) and [RayGPUWorker](/tensorrt_llm/executor/ray_gpu_worker.py) for Ray actor lifecycle management and distributed inference. In Ray mode, collective ops run on [torch.distributed](https://docs.pytorch.org/tutorials/beginner/dist_overview.html) without MPI. We welcome contributions to improve and extend this support.
38+
39+
![Ray orchestrator architecture](/docs/source/media/ray_orchestrator_architecture.jpg)
40+
41+
42+
## Disclaimer
43+
The code is experimental and subject to change. Currently, there are no guarantees regarding functionality, performance, or stability.
Lines changed: 20 additions & 119 deletions
Original file line numberDiff line numberDiff line change
@@ -1,127 +1,28 @@
1-
TODO: need rewrite
1+
# Disaggregated Serving with Ray orchestrator
2+
TensorRT-LLM supports an experimental [Ray orchestrator](../README.md) as an alternative to MPI.
23

3-
# TensorRT-LLM OpenAPI Client Examples
4+
Running disaggregated serving with Ray follows [the same workflow as in MPI](/examples/disaggregated/README.md), except that `orchestrator_type="ray"` must be set on the `LLM` class, and `CUDA_VISIBLE_DEVICES` can be omitted since Ray handles GPU placement.
45

5-
This directory contains simple client examples using the `requests` library to interact with TensorRT-LLM OpenAPI servers.
66

7-
## Files
7+
## Quick Start
8+
This script is a shorthand to launch a single-GPU context and generation server, as well as the disaggregated server within a single Ray cluster. Please see [this page]((/examples/disaggregated/README.md)) for details on adjusting parallel settings.
89

9-
- **`disagg_serving_test.py`** - Comprehensive client with multiple test cases
10-
- **`simple_client_example.py`** - Minimal example showing core usage patterns
11-
12-
## Prerequisites
13-
14-
1. Start a TensorRT-LLM server manually using one of these methods:
15-
16-
### Option A: Using trtllm-serve (Recommended)
17-
```bash
18-
trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
19-
```
20-
21-
### Option B: Using the FastAPI server example
22-
```bash
23-
cd ../apps
24-
python fastapi_server.py TinyLlama/TinyLlama-1.1B-Chat-v1.0
25-
```
26-
27-
### Option C: Using any OpenAI-compatible server
28-
Make sure it has a `/v1/chat/completions` endpoint that accepts:
29-
```json
30-
{
31-
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
32-
"messages": [{"role": "user", "content": "Your prompt here"}],
33-
"max_tokens": 100,
34-
"temperature": 0.8,
35-
"stream": false
36-
}
37-
```
38-
39-
## Usage
40-
41-
### Simple Client (Minimal Example)
4210
```bash
43-
python simple_client_example.py
11+
# requires a total of two GPUs
12+
bash -e disagg_serving_local.sh
4413
```
4514

46-
### Comprehensive Client
15+
Once the disaggregated server is ready, you can send requests to the disaggregated server using curl:
4716
```bash
48-
# Default server URL (http://localhost:8000)
49-
python disagg_serving_test.py
50-
51-
# Custom server URL
52-
python disagg_serving_test.py --server-url http://localhost:8080
53-
```
54-
55-
## Example Output
56-
57-
```
58-
🤖 Testing TensorRT-LLM server at: http://localhost:8000
59-
==================================================
60-
1. Health check...
61-
✅ Server healthy: {'status': 'healthy'}
62-
63-
2. Testing text generation...
64-
65-
🎯 Test 1: 'Hello, my name is'
66-
Generated: 'John and I am a software engineer...'
67-
68-
🎯 Test 2: 'The capital of France is'
69-
Generated: 'Paris, the city of lights...'
70-
71-
3. Testing streaming generation...
72-
🎯 Streaming: 'Write a short story about a robot:'
73-
📡 Response: Once upon a time, there was a robot named...
74-
75-
✅ All tests completed!
76-
```
77-
78-
## API Endpoints Used
79-
80-
The clients expect these endpoints:
81-
82-
- `GET /health` - Health check (fallback: simple chat completion request)
83-
- `POST /v1/chat/completions` - OpenAI-compatible chat completions
84-
- Streaming support via Server-Sent Events (SSE)
85-
86-
## Key Features Demonstrated
87-
88-
1. **Basic text generation** with requests.post()
89-
2. **Streaming response** handling with SSE
90-
3. **Error handling** for connection issues
91-
4. **Session management** for efficient connections
92-
5. **OpenAI-compatible format** with messages array
93-
94-
## Example Code
95-
96-
The minimal example shows exactly what you need:
97-
98-
```python
99-
import requests
100-
101-
# Basic generation (OpenAI format)
102-
payload = {
103-
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
104-
"messages": [{"role": "user", "content": "Hello, my name is"}],
105-
"max_tokens": 50,
106-
"temperature": 0.8
107-
}
108-
response = requests.post("http://localhost:8000/v1/chat/completions", json=payload)
109-
result = response.json()
110-
text = result["choices"][0]["message"]["content"]
111-
112-
# Streaming generation
113-
payload = {
114-
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
115-
"messages": [{"role": "user", "content": "The future of AI is"}],
116-
"stream": True
117-
}
118-
response = requests.post("http://localhost:8000/v1/chat/completions", json=payload, stream=True)
119-
```
120-
121-
## Customization
122-
123-
You can modify the scripts to:
124-
- Change server URLs
125-
- Adjust generation parameters (temperature, max_tokens, etc.)
126-
- Add new test prompts
127-
- Handle different response formats
17+
curl http://localhost:8000/v1/completions \
18+
-H "Content-Type: application/json" \
19+
-d '{
20+
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
21+
"prompt": "NVIDIA is a great company because",
22+
"max_tokens": 16,
23+
"temperature": 0
24+
}' -w "\n"
25+
```
26+
27+
## Disclaimer
28+
The code is experimental and subject to change. Currently, there are no guarantees regarding functionality, performance, or stability.

examples/llm-api/ray/multi_nodes/README.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
# Multi-node inference with Ray orchestrator
2-
TensorRT-LLM supports [Ray](https://docs.ray.io/en/latest/index.html) as an orchestrator with [PyTorch Distributed](https://docs.pytorch.org/tutorials/beginner/dist_overview.html) as an alternative to MPI. This feature is currently experimental and under active development.
2+
TensorRT-LLM supports an experimental [Ray orchestrator](../README.md) as an alternative to MPI. The following example shows how to start a Ray cluster for multi-node inference.
33

4-
**Prerequisite:** a container image with TensorRT-LLM preinstalled (or suitable for installing it). The examples use Slurm and [Enroot](https://github.com/NVIDIA/enroot); if you use a different setup, adapt the container options and launch commands to your multi-node environment.
54

6-
## Run multi-node inference with Ray
5+
## Quick Start
6+
7+
**Prerequisite:** a container image with TensorRT-LLM preinstalled (or suitable for installing it). The examples use Slurm and [Enroot](https://github.com/NVIDIA/enroot). if you use a different setup, adapt the following scrips and commands to your multi-node environment.
78

89
1. Allocate nodes and open a shell on the head node:
910

@@ -32,5 +33,9 @@ TensorRT-LLM supports [Ray](https://docs.ray.io/en/latest/index.html) as an orch
3233
3334
# Under your work directory:
3435
>> pip install -e . # if needed
36+
# You can change this script to a model and parallel settings effective for multi-node inference (e.g., TP8 or TP4PP4)
3537
>> python examples/ray/llm_inference_async_ray.py
3638
```
39+
40+
## Disclaimer
41+
The code is experimental and subject to change. Currently, there are no guarantees regarding functionality, performance, or stability.

tensorrt_llm/_torch/model_config.py

Lines changed: 0 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,4 @@
11
import contextlib
2-
import copy
3-
import dataclasses
42
import json
53
import os
64
from dataclasses import dataclass, field
@@ -565,17 +563,3 @@ def get_num_attention_layers(self):
565563
return self.pretrained_config.hybrid_override_pattern.count("*")
566564
else:
567565
return self.pretrained_config.num_hidden_layers
568-
569-
def clone(self) -> "ModelConfig[TConfig]":
570-
"""
571-
Create a clone of the config.
572-
"""
573-
shallow_fields = ["mapping"]
574-
575-
clone = dataclasses.replace(self,
576-
**{field: None
577-
for field in shallow_fields})
578-
clone = copy.deepcopy(clone)
579-
for field in shallow_fields:
580-
setattr(clone, field, getattr(self, field))
581-
return clone

tensorrt_llm/_torch/models/modeling_deepseekv3.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
# SOFTWARE.
2626
# --------------------------------------------------
2727

28+
import copy
2829
import math
2930
import os
3031
import warnings
@@ -1163,7 +1164,7 @@ class DeepseekV3ForCausalLM(SpecDecOneEngineForCausalLM[DeepseekV3Model,
11631164
def __init__(self, model_config: ModelConfig[PretrainedConfig]):
11641165
# Rename some keys of quant_config_dict to support legacy checkpoints
11651166
if model_config.quant_config_dict is not None:
1166-
model_config = model_config.clone()
1167+
model_config = copy.deepcopy(model_config)
11671168
quant_config_dict = {}
11681169
for key, val in model_config.quant_config_dict.items():
11691170
key_split = key.split(".")

tensorrt_llm/_torch/models/modeling_gemma3vl.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import copy
12
import dataclasses
23
import os
34
from typing import List, Optional, Tuple
@@ -164,7 +165,7 @@ def __init__(self, model_config: ModelConfig[Gemma3Config]):
164165
dtype=torch.int32,
165166
device=self._device)
166167

167-
model_config_cp = model_config.clone()
168+
model_config_cp = copy.deepcopy(model_config)
168169
self.model_config = model_config_cp
169170

170171
llm_model_config = self.get_sub_model_config(model_config_cp,

tensorrt_llm/_torch/models/modeling_hyperclovax.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import copy
12
import math
23
import os
34
from functools import partial
@@ -990,7 +991,7 @@ def __init__(self, model_config: ModelConfig):
990991
return
991992
if not DISAGG:
992993
self.mm_encoder = HCXVisionModel(model_config)
993-
llm_model_config = model_config.clone()
994+
llm_model_config = copy.deepcopy(model_config)
994995
llm_model_config.pretrained_config = PretrainedConfig.from_dict(
995996
llm_model_config.pretrained_config.language_config)
996997
self.llm = AutoModelForCausalLM.from_config(llm_model_config)

tensorrt_llm/_torch/models/modeling_llava_next.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import copy
12
import os
23
from typing import Dict, List, Optional, Tuple, Union
34

@@ -424,7 +425,7 @@ def __init__(self, model_config: ModelConfig[PretrainedConfig], *args,
424425
if not DISAGG:
425426
self.mm_encoder = LlavaNextVisionModel(model_config)
426427

427-
llm_model_config = model_config.clone()
428+
llm_model_config = copy.deepcopy(model_config)
428429
llm_model_config.pretrained_config = model_config.pretrained_config.text_config
429430

430431
# TODO Remove these when MistralConfig is natively supported

tensorrt_llm/_torch/models/modeling_qwen2vl.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import copy
12
import os
23
from typing import Any, Dict, List, Optional, Tuple, Union
34

@@ -475,7 +476,7 @@ def __init__(
475476
if hasattr(self, "llm"):
476477
return
477478

478-
llm_model_config = model_config.clone()
479+
llm_model_config = copy.deepcopy(model_config)
479480
llm_model_config.pretrained_config.architectures = ["Qwen2ForCausalLM"]
480481
self.llm = AutoModelForCausalLM.from_config(llm_model_config)
481482
self.vocab_size = config.vocab_size

0 commit comments

Comments
 (0)