Skip to content

Commit 4f166c5

Browse files
fix: Update trtllm-tutorial with latest changes on their branch and added gen-ai tutorial (#106)
1 parent eacf310 commit 4f166c5

File tree

1 file changed

+44
-2
lines changed

1 file changed

+44
-2
lines changed

Popular_Models_Guide/Llama2/trtllm_guide.md

Lines changed: 44 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -264,22 +264,24 @@ steps. The following script do a minimized configuration to run tritonserver,
264264
but if you want optimal performance or custom parameters, read details in
265265
[documentation](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md)
266266
and [perf_best_practices](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-best-practices.md):
267-
267+
Note: `TRITON_BACKEND` has two possible options: `tensorrtllm` and `python`. If `TRITON_BACKEND=python`, the python backend will deploy [`model.py`](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py).
268268
```bash
269269
# preprocessing
270270
TOKENIZER_DIR=/Llama-2-7b-hf/
271271
TOKENIZER_TYPE=auto
272+
ENGINE_DIR=/engines
272273
DECOUPLED_MODE=false
273274
MODEL_FOLDER=/opt/tritonserver/inflight_batcher_llm
274275
MAX_BATCH_SIZE=4
275276
INSTANCE_COUNT=1
276277
MAX_QUEUE_DELAY_MS=10000
278+
TRITON_BACKEND=tensorrtllm
277279
FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py
278280
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}
279281
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT}
280282
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT}
281283
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}
282-
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching
284+
python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRITON_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching
283285
```
284286
285287
3. Launch Tritonserver
@@ -336,6 +338,46 @@ curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What
336338
> {"context_logits":0.0,...,"text_output":"What is ML?\nML is a branch of AI that allows computers to learn from data, identify patterns, and make predictions. It is a powerful tool that can be used in a variety of industries, including healthcare, finance, and transportation."}
337339
> ```
338340
341+
### Evaluating performance with Gen-AI Perf
342+
Gen-AI Perf is a command line tool for measuring the throughput and latency of generative AI models as served through an inference server.
343+
You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client/src/c%2B%2B/perf_analyzer/genai-perf/README.html).
344+
345+
To use Gen-AI Perf, run the following command in the same Triton docker container:
346+
```bash
347+
genai-perf \
348+
-m ensemble \
349+
--service-kind triton \
350+
--backend tensorrtllm \
351+
--num-prompts 100 \
352+
--random-seed 123 \
353+
--synthetic-input-tokens-mean 200 \
354+
--synthetic-input-tokens-stddev 0 \
355+
--output-tokens-mean 100 \
356+
--output-tokens-stddev 0 \
357+
--output-tokens-mean-deterministic \
358+
--tokenizer hf-internal-testing/llama-tokenizer \
359+
--concurrency 1 \
360+
--measurement-interval 4000 \
361+
--profile-export-file my_profile_export.json \
362+
--url localhost:8001
363+
```
364+
You should expect an output that looks like this:
365+
```
366+
LLM Metrics
367+
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
368+
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
369+
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
370+
│ Request latency (ms) │ 1,630.23 │ 1,616.37 │ 1,644.65 │ 1,644.05 │ 1,638.70 │ 1,635.64 │
371+
│ Output sequence length │ 300.00 │ 300.00 │ 300.00 │ 300.00 │ 300.00 │ 300.00 │
372+
│ Input sequence length │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │
373+
└────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
374+
Output token throughput (per sec): 184.02
375+
Request throughput (per sec): 0.61
376+
2024-08-08 19:45 [INFO] genai_perf.export_data.json_exporter:56 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency1/profile_export_genai_perf.json
377+
2024-08-08 19:45 [INFO] genai_perf.export_data.csv_exporter:69 - Generating artifacts/ensemble-triton-tensorrtllm-concurrency1/profile_export_genai_perf.csv
378+
```
379+
380+
339381
## References
340382
341383
For more examples feel free to refer to [End to end workflow to run llama.](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md)

0 commit comments

Comments
 (0)