26
26
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
27
27
-->
28
28
29
- # Triton Inference Server Open AI Compatible Server
29
+ # Triton Inference Server Open AI Compatible Server
30
30
31
31
Using the Triton In-Process Python API you can integrat triton server
32
32
based models into any Python framework including FastAPI with an
33
33
OpenAI compatible interface.
34
34
35
35
This directory contains a FastAPI based Triton Inference Server
36
36
supporing ` llama-3-8b-instruct ` with both the vLLM and TRT-LLM
37
- backends.
37
+ backends.
38
38
39
39
The front end application was generated using a trimmed version of the
40
40
OpenAI OpenAPI [ specification] ( api-spec/openai_trimmed.yml ) and the
@@ -118,7 +118,7 @@ curl -X 'POST' \
118
118
"stream": false,
119
119
"stop": "string",
120
120
"frequency_penalty": 0.0
121
- }' | jq .
121
+ }' | jq .
122
122
```
123
123
124
124
#### Chat Completions ` /v1/chat/completions `
@@ -165,7 +165,7 @@ curl -s http://localhost:8000/v1/models | jq .
165
165
curl -s http://localhost:8000/v1/models/llama-3-8b-instruct | jq .
166
166
```
167
167
168
- ## Comparison to vllm
168
+ ## Comparison to vllm
169
169
170
170
The vLLM container can also be used to run the vLLM FastAPI Server
171
171
@@ -185,7 +185,7 @@ Note: the following command requires the 24.05 pre-release version of genai-perf
185
185
Preliminary results show performance is on par with vLLM with concurrency 2
186
186
187
187
```
188
- genai-perf -m meta-llama/Meta-Llama-3-8B-Instruct --endpoint v1/chat/completions --endpoint-type chat --service-kind openai -u http://localhost:8000 --num-prompts 100 --synthetic-input-tokens-mean 1024 --synthetic-input-tokens-stddev 50 --concurrency 2 --measurement-interval 40000 --extra-inputs max_tokens:512 --extra-input ignore_eos:true -- -v --max-threads=256
188
+ genai-perf -m meta-llama/Meta-Llama-3-8B-Instruct --endpoint v1/chat/completions --endpoint-type chat --service-kind openai -u http://localhost:8000 --num-prompts 100 --synthetic-input-tokens-mean 1024 --synthetic-input-tokens-stddev 50 --concurrency 2 --measurement-interval 40000 --extra-inputs max_tokens:512 --extra-input ignore_eos:true -- -v --max-threads=256
189
189
erval 40000 --extra-inputs max_tokens:512 --extra-input ignore_eos:true -- -v --max-threads=256
190
190
```
191
191
@@ -195,5 +195,5 @@ erval 40000 --extra-inputs max_tokens:512 --extra-input ignore_eos:true -- -v --
195
195
* Max tokens is not processed by trt-llm backend correctly
196
196
* Usage information is not populated
197
197
* ` finish_reason ` is currently always set to ` stop `
198
- * Limited performance testing has been done
198
+ * Limited performance testing has been done
199
199
* Using genai-perf to test streaming requires changes to genai-perf SSE handling
0 commit comments