Skip to content

Commit b1653d7

Browse files
authored
Fix the parameter to tensor conversion in TRTLLM FastAPI implementation (#98)
* Fix the parameter to tensor conversion in TRTLLM FastAPI implementation * Fix format
1 parent 246a237 commit b1653d7

File tree

2 files changed

+12
-12
lines changed

2 files changed

+12
-12
lines changed

Triton_Inference_Server_Python_API/examples/fastapi/README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -26,15 +26,15 @@
2626
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
2727
-->
2828

29-
# Triton Inference Server Open AI Compatible Server
29+
# Triton Inference Server Open AI Compatible Server
3030

3131
Using the Triton In-Process Python API you can integrat triton server
3232
based models into any Python framework including FastAPI with an
3333
OpenAI compatible interface.
3434

3535
This directory contains a FastAPI based Triton Inference Server
3636
supporing `llama-3-8b-instruct` with both the vLLM and TRT-LLM
37-
backends.
37+
backends.
3838

3939
The front end application was generated using a trimmed version of the
4040
OpenAI OpenAPI [specification](api-spec/openai_trimmed.yml) and the
@@ -118,7 +118,7 @@ curl -X 'POST' \
118118
"stream": false,
119119
"stop": "string",
120120
"frequency_penalty": 0.0
121-
}' | jq .
121+
}' | jq .
122122
```
123123

124124
#### Chat Completions `/v1/chat/completions`
@@ -165,7 +165,7 @@ curl -s http://localhost:8000/v1/models | jq .
165165
curl -s http://localhost:8000/v1/models/llama-3-8b-instruct | jq .
166166
```
167167

168-
## Comparison to vllm
168+
## Comparison to vllm
169169

170170
The vLLM container can also be used to run the vLLM FastAPI Server
171171

@@ -185,7 +185,7 @@ Note: the following command requires the 24.05 pre-release version of genai-perf
185185
Preliminary results show performance is on par with vLLM with concurrency 2
186186

187187
```
188-
genai-perf -m meta-llama/Meta-Llama-3-8B-Instruct --endpoint v1/chat/completions --endpoint-type chat --service-kind openai -u http://localhost:8000 --num-prompts 100 --synthetic-input-tokens-mean 1024 --synthetic-input-tokens-stddev 50 --concurrency 2 --measurement-interval 40000 --extra-inputs max_tokens:512 --extra-input ignore_eos:true -- -v --max-threads=256
188+
genai-perf -m meta-llama/Meta-Llama-3-8B-Instruct --endpoint v1/chat/completions --endpoint-type chat --service-kind openai -u http://localhost:8000 --num-prompts 100 --synthetic-input-tokens-mean 1024 --synthetic-input-tokens-stddev 50 --concurrency 2 --measurement-interval 40000 --extra-inputs max_tokens:512 --extra-input ignore_eos:true -- -v --max-threads=256
189189
erval 40000 --extra-inputs max_tokens:512 --extra-input ignore_eos:true -- -v --max-threads=256
190190
```
191191

@@ -195,5 +195,5 @@ erval 40000 --extra-inputs max_tokens:512 --extra-input ignore_eos:true -- -v --
195195
* Max tokens is not processed by trt-llm backend correctly
196196
* Usage information is not populated
197197
* `finish_reason` is currently always set to `stop`
198-
* Limited performance testing has been done
198+
* Limited performance testing has been done
199199
* Using genai-perf to test streaming requires changes to genai-perf SSE handling

Triton_Inference_Server_Python_API/examples/fastapi/fastapi-codegen/openai-tritonserver.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -165,21 +165,21 @@ def create_trtllm_inference_request(
165165
inputs["text_input"] = [[prompt]]
166166
inputs["stream"] = [[request.stream]]
167167
if request.max_tokens:
168-
inputs["max_tokens"] = [[numpy.int32(request.max_tokens)]]
168+
inputs["max_tokens"] = numpy.int32([[request.max_tokens]])
169169
if request.stop:
170170
if isinstance(request.stop, str):
171171
request.stop = [request.stop]
172172
inputs["stop_words"] = [request.stop]
173173
if request.top_p:
174-
inputs["top_p"] = [[numpy.float32(request.top_p)]]
174+
inputs["top_p"] = numpy.float32([[request.top_p]])
175175
if request.frequency_penalty:
176-
inputs["frequency_penalty"] = [[numpy.float32(request.frequency_penalty)]]
176+
inputs["frequency_penalty"] = numpy.float32([[request.frequency_penalty]])
177177
if request.presence_penalty:
178-
inputs["presence_penalty":] = [[numpy.int32(request.presence_penalty)]]
178+
inputs["presence_penalty":] = numpy.int32([[request.presence_penalty]])
179179
if request.seed:
180-
inputs["random_seed"] = [[numpy.uint64(request.seed)]]
180+
inputs["random_seed"] = numpy.uint64([[request.seed]])
181181
if request.temperature:
182-
inputs["temperature"] = [[numpy.float32(request.temperature)]]
182+
inputs["temperature"] = numpy.float32([[request.temperature]])
183183

184184
return model.create_request(inputs=inputs)
185185

0 commit comments

Comments
 (0)