diff --git a/AI_Agents_Guide/Constrained_Decoding/README.md b/AI_Agents_Guide/Constrained_Decoding/README.md index 365ccd1e..28e07417 100644 --- a/AI_Agents_Guide/Constrained_Decoding/README.md +++ b/AI_Agents_Guide/Constrained_Decoding/README.md @@ -31,4 +31,673 @@ This tutorial focuses on constrained decoding, an important technique for ensuring that large language models (LLMs) generate outputs that adhere to strict formatting requirements—requirements that may be challenging or -expensive to achieve solely through fine-tuning. \ No newline at end of file +expensive to achieve solely through fine-tuning. + +## Table of Contents + +- [Introduction to Constrained Decoding](#introduction-to-constrained-decoding) +- [Prerequisite: Hermes-2-Pro-Llama-3-8B](#prerequisite-hermes-2-pro-llama-3-8b) +- [Structured Generation via Prompt Engineering](#structured-generation-via-prompt-engineering) + * [Example 1](#example-1) + * [Example 2](#example-2) +- [Enforcig Output Format via External Libraries](#enforcig-output-format-via-external-libraries) + * [Pre-requisite: Common set-up](#pre-requisite-common-set-up) + + [Logits Post-Processor](#logits-post-processor) + + [Tokenizer](#tokenizer) + + [Repository set up](#repository-set-up) + * [LM Format Enforcer](#lm-format-enforcer) + * [Outlines](#outlines) + +## Introduction to Constrained Decoding + +Constrained decoding is a powerful technique used in natural language processing +and various AI applications to guide and control the output of a model. +By imposing specific constraints, this method ensures that generated outputs +adhere to predefined criteria, such as length, format, or content restrictions. +This capability is essential in contexts where compliance with rules +is non-negotiable, such as producing valid code snippets, structured data, +or grammatically correct sentences. + +In recent advancements, some models are already fine-tuned to incorporate +these constraints inherently. These models are designed +to seamlessly integrate constraints during the generation process, reducing +the need for extensive post-processing. By doing so, they enhance the efficiency +and accuracy of tasks that require strict adherence to predefined rules. +This built-in capability makes them particularly valuable in applications +like automated content creation, data validation, and real-time language +translation, where precision and reliability are paramount. + +This tutorial is based on [Hermes-2-Pro-Llama-3-8B](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B), +, which already supports JSON Structured Outputs. An extensive instruction stack +on deploying Hermes-2-Pro-Llama-3-8B model with Triton Inference Server and +TensorRT-LLM backend can be found in [this](../../Popular_Models_Guide/Hermes-2-Pro-Llama-3-8B/README.md) +tutorial. The structure and quality of a produced output in such cases can be +controlled through prompt engineering. To explore this path, please refer to +[Structured Generation via Prompt Engineering](#structured-generation-via-prompt-engineering) +section on the tutorial. + +For scenarios where models are not inherently fine-tuned for +constrained decoding, or when more precise control over the output is desired, +dedicated libraries like +[*LM Format Enforcer*](https://github.com/noamgat/lm-format-enforcer?tab=readme-ov-file) +and [*Outlines*](https://github.com/outlines-dev/outlines?tab=readme-ov-file) +offer robust solutions. These libraries provide tools to enforce specific +constraints on model outputs, allowing developers to tailor the generation +process to meet precise requirements. By leveraging such libraries, +users can achieve greater control over the output, ensuring it aligns perfectly +with the desired criteria, whether that involves maintaining a certain format, +adhering to content guidelines, or ensuring grammatical correctness. +In this tutorial we'll show how to use *LM Format Enforcer* and *Outlines* +in your workflow. + +## Prerequisite: Hermes-2-Pro-Llama-3-8B + +Before proceeding, please make sure that you've successfully deployed +[Hermes-2-Pro-Llama-3-8B.](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B) +model with Triton Inference Server and TensorRT-LLM backend +following [these steps.](../../Popular_Models_Guide/Hermes-2-Pro-Llama-3-8B/README.md) + +## Structured Generation via Prompt Engineering + +First, let's start Triton SDK container: +```bash +# Using the SDK container as an example +docker run --rm -it --net host --shm-size=2g \ + --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \ + -v /path/to/tutorials:/tutorials \ + -v /path/to/Hermes-2-Pro-Llama-3-8B/repo:/Hermes-2-Pro-Llama-3-8B \ + nvcr.io/nvidia/tritonserver:-py3-sdk +``` + +The provided client script uses `pydantic` library, which we do not ship with +the sdk container. Make sure to install it, before proceeding: + +```bash +pip install pydantic +``` + +### Example 1 + +For fine-tuned model we can enable JSON mode by simply composing a system prompt +as: + +``` +You are a helpful assistant that answers in JSON. +``` +Please, refer to [`client.py`](./artifacts/client.py) for full `prompt` +composition logic. + +```bash +python3 /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py --prompt "Give me information about Harry Potter and the Order of Phoenix" -o 200 --use-system-prompt +``` +You should expect the following response: + +``` +... +assistant +{ + "title": "Harry Potter and the Order of Phoenix", + "book_number": 5, + "author": "J.K. Rowling", + "series": "Harry Potter", + "publication_date": "June 21, 2003", + "page_count": 766, + "publisher": "Arthur A. Levine Books", + "genre": [ + "Fantasy", + "Adventure", + "Young Adult" + ], + "awards": [ + { + "award_name": "British Book Award", + "category": "Children's Book of the Year", + "year": 2004 + } + ], + "plot_summary": "Harry Potter and the Order of Phoenix is the fifth book in the Harry Potter series. In this installment, Harry returns to Hogwarts School of Witchcraft and Wizardry for his fifth year. The Ministry of Magic is in denial about the return of Lord Voldemort, and Harry finds himself battling against the + +``` + +### Example 2 + +Optionally, we can also restrict an output to a specific schema. For example, +in [`client.py`](./artifacts/client.py) we use a `pydantic` library to define the +following answer format: + +```python +from pydantic import BaseModel + +class AnswerFormat(BaseModel): + title: str + year: int + director: str + producer: str + plot: str + +... + +prompt += "Here's the json schema you must adhere to:\n\n{schema}\n".format( + schema=AnswerFormat.model_json_schema()) + +``` +Let's try it out: + +```bash +python3 /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py --prompt "Give me information about Harry Potter and the Order of Phoenix" -o 200 --use-system-prompt --use-schema +``` +You should expect the following response: + +``` + ... +assistant +{ + "title": "Harry Potter and the Order of Phoenix", + "year": 2007, + "director": "David Yates", + "producer": "David Heyman", + "plot": "Harry Potter and his friends must protect Hogwarts from a threat when the Ministry of Magic is taken over by Lord Voldemort's followers." +} + +``` + +## Enforcig Output Format via External Libraries + +In this section of the tutorial, we'll show how to impose constrains on LLMs, +which are not inherently fine-tuned for constrained decoding. We'll +[*LM Format Enforcer*](https://github.com/noamgat/lm-format-enforcer?tab=readme-ov-file) +and [*Outlines*](https://github.com/outlines-dev/outlines?tab=readme-ov-file) +offer robust solutions. + +The reference implementation for both libraries is provided in +[`utils.py`](./artifacts/utils.py) script, which also defines the output +format `AnswerFormat`: + +```python +class WandFormat(BaseModel): + wood: str + core: str + length: float + +class AnswerFormat(BaseModel): + name: str + house: str + blood_status: str + occupation: str + alive: str + wand: WandFormat +``` + +### Pre-requisite: Common set-up + +Make sure you've successfully deployed Hermes-2-Pro-Llama-3-8B model +with Triton Inference Server and TensorRT-LLM backend following +[these steps](../../Popular_Models_Guide/Hermes-2-Pro-Llama-3-8B/README.md). +> [!IMPORTANT] +> Make sure that the `tutorials` folder is mounted to `/tutorials`, when you +> start the docker container. + + +Upon successful setup you should have `/opt/tritonserver/inflight_batcher_llm` +folder and try a couple of inference requests (e.g. those, provided in +[example 1](#example-1) or [example 2](#example-2)). + +We'll do some adjusments to model files, thu if you have a running server, you +can stop it via: +```bash +pkill tritonserver +``` + +#### Logits Post-Processor + +Both of the libraries limit the set of allowed tokens at every generation stage. +In TensorRT-LLM, user can define a custom +[logits post-processor](https://nvidia.github.io/TensorRT-LLM/advanced/batch-manager.html#logits-post-processor-optional) +to mask logits, which should never be used in the current generation step. + +For TensorRT-LLM models, deployed via `python` backend (i.e. when +[`triton_backend`](https://github.com/triton-inference-server/tensorrtllm_backend/blob/8aaf89bcf723dad112839fd36cbbe09e2e439c63/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt#L28C10-L28C29) +is set to `python` in `tensorrt_llm/config.pbtxt`, Triton's python backend will +use +[`model.py`](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py) +to serve your TensorRT-LLM model.), custom logits processor should be specified +during model's initialization as a part of +[Executor's](https://nvidia.github.io/TensorRT-LLM/executor.html#executor-api) +configuration +([`logits_post_processor_map`](https://github.com/NVIDIA/TensorRT-LLM/blob/32ed92e4491baf2d54682a21d247e1948cca996e/tensorrt_llm/hlapi/llm_utils.py#L205)). +Below is the sample for reference. + +```diff +... + ++ executor_config.logits_post_processor_map = { ++ "": custom_logits_processor ++ } +self.executor = trtllm.Executor(model_path=..., + model_type=..., + executor_config=executor_config) +... +``` + +Additionally, if you want to enable logits post-processor for every request +individually, you can do so via an additional `input` parameter. +For example, in this tutorial we will add `logits_post_processor_name` in +`inflight_batcher_llm/tensorrt_llm/config.pbtxt`: +```diff +input [ + { + name: "input_ids" + data_type: TYPE_INT32 + dims: [ -1 ] + allow_ragged_batch: true + }, + ... + { + name: "lora_config" + data_type: TYPE_INT32 + dims: [ -1, 3 ] + optional: true + allow_ragged_batch: true +- } ++ }, ++ { ++ name: "logits_post_processor_name" ++ data_type: TYPE_STRING ++ dims: [ -1 ] ++ optional: true ++ } +] +... +``` +and process it in `execute` function in +`inflight_batcher_llm/tensorrt_llm/1/model.py`: + +```diff +def execute(self, requests): + """`execute` must be implemented in every Python model. `execute` + function receives a list of pb_utils.InferenceRequest as the only + argument. This function is called when an inference is requested + for this model. + Parameters + ---------- + requests : list + A list of pb_utils.InferenceRequest + Returns + ------- + list + A list of pb_utils.InferenceResponse. The length of this list must + be the same as `requests` + """ + ... + + for request in requests: + response_sender = request.get_response_sender() + if get_input_scalar_by_name(request, 'stop'): + self.handle_stop_request(request.request_id(), response_sender) + else: + try: + converted = convert_request(request, + self.exclude_input_from_output, + self.decoupled) ++ logits_post_processor_name = get_input_tensor_by_name(request, 'logits_post_processor_name') ++ if logits_post_processor_name is not None: ++ converted.logits_post_processor_name = logits_post_processor_name.item().decode('utf-8') + except Exception as e: + ... +``` +In this tutorial, we're deploying Hermes-2-Pro-Llama-3-8B model as a part of an +ensemble. This means that the request is processed by the `ensemble` model +first, and then it is sent to `pre-processing model`, `tensorrt-llm model`, and +finally `post-processing`. This sequence defined in +`inflight_batcher_llm/ensemble/config.pbtxt` as well as input and output +mappings. Thus, we would need to update +`inflight_batcher_llm/ensemble/config.pbtxt` as well, so that `ensemble` model +properly passes additional input parameter to `tensorrt-llm model`: + +```diff +input [ + { + name: "text_input" + data_type: TYPE_STRING + dims: [ -1 ] + }, + ... + { + name: "embedding_bias_weights" + data_type: TYPE_FP32 + dims: [ -1 ] + optional: true +- } ++ }, ++ { ++ name: "logits_post_processor_name" ++ data_type: TYPE_STRING ++ dims: [ -1 ] ++ optional: true ++ } +] +output [ + ... +] +ensemble_scheduling { + step [ + { + model_name: "preprocessing" + model_version: -1 + ... + }, + { + model_name: "tensorrt_llm" + model_version: -1 + input_map { + key: "input_ids" + value: "_INPUT_ID" + } + ... + input_map { + key: "bad_words_list" + value: "_BAD_WORDS_IDS" + } ++ input_map { ++ key: "logits_post_processor_name" ++ value: "logits_post_processor_name" ++ } + output_map { + key: "output_ids" + value: "_TOKENS_BATCH" + } + ... + } + ... +``` + +If you follow along with this tutorial, make sure same changes are incorporated +into corresponding files of `/opt/tritonserver/inflight_batcher_llm` repository. + +#### Tokenizer + +Both [*LM Format Enforcer*](https://github.com/noamgat/lm-format-enforcer?tab=readme-ov-file) +and [*Outlines*](https://github.com/outlines-dev/outlines?tab=readme-ov-file) +require tokenizer access at initialization time. In this tutorial, +we'll be exposing tokenizer via `inflight_batcher_llm/tensorrt_llm/config.pbtxt` +parameter: + +```txt +parameters: { + key: "tokenizer_dir" + value: { + string_value: "/Hermes-2-Pro-Llama-3-8B" + } +} +``` +Simply append to the end on the `inflight_batcher_llm/tensorrt_llm/config.pbtxt`. + +#### Repository set up + +We've provided a sample implementation for *LM Format Enforcer* and *Outlines* +in [`artifacts/utils.py`](./artifacts/utils.py). Make sure you've copied it into +`/opt/tritonserver/inflight_batcher_llm/tensorrt_llm/1/lib` via + +```bash +mkdir -p inflight_batcher_llm/tensorrt_llm/1/lib +cp /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/utils.py inflight_batcher_llm/tensorrt_llm/1/lib/ +``` +Finally, let's install all required libraries: + +```bash +pip install pydantic lm-format-enforcer outlines setuptools +``` + +### LM Format Enforcer + +To use LM Format Enforcer, make sure +`inflight_batcher_llm/tensorrt_llm/1/model.py` contains the following changes: + +```diff +... +import tensorrt_llm.bindings.executor as trtllm + ++ from lib.utils import LMFELogitsProcessor, AnswerFormat + +... + +class TritonPythonModel: + """Your Python model must use the same class name. Every Python model + that is created must have "TritonPythonModel" as the class name. + """ + ... + + def get_executor_config(self, model_config): ++ tokenizer_dir = model_config['parameters']['tokenizer_dir']['string_value'] ++ logits_lmfe_processor = LMFELogitsProcessor(tokenizer_dir, AnswerFormat.model_json_schema()) + kwargs = { + "max_beam_width": + get_parameter(model_config, "max_beam_width", int), + "scheduler_config": + self.get_scheduler_config(model_config), + "kv_cache_config": + self.get_kv_cache_config(model_config), + "enable_chunked_context": + get_parameter(model_config, "enable_chunked_context", bool), + "normalize_log_probs": + get_parameter(model_config, "normalize_log_probs", bool), + "batching_type": + convert_batching_type(get_parameter(model_config, + "gpt_model_type")), + "parallel_config": + self.get_parallel_config(model_config), + "peft_cache_config": + self.get_peft_cache_config(model_config), + "decoding_config": + self.get_decoding_config(model_config), ++ "logits_post_processor_map":{ ++ LMFELogitsProcessor.PROCESSOR_NAME: logits_lmfe_processor ++ } + } + kwargs = {k: v for k, v in kwargs.items() if v is not None} + return trtllm.ExecutorConfig(**kwargs) +... +``` + +#### Send an inference request + +First, let's start Triton SDK container: +```bash +# Using the SDK container as an example +docker run --rm -it --net host --shm-size=2g \ + --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \ + -v /path/to/tutorials/:/tutorials \ + -v /path/to/tutorials/repo:/tutorials \ + nvcr.io/nvidia/tritonserver:-py3-sdk +``` + +The provided client script uses `pydantic` library, which we do not ship with +the sdk container. Make sure to install it, before proceeding: + +```bash +pip install pydantic +``` + +##### Option 1. Use provided [client script](./artifacts/client.py) + +Let's first send a standard request, without enforcing the JSON answer format: +```bash +python3 /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py --prompt "Who is Harry Potter?" -o 100 +``` + +You should expect the following response: + +```bash +Who is Harry Potter? Harry Potter is a fictional character in a series of fantasy novels written by British author J.K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic and subjugate all wizards and +``` + +Now, let's specify `logits_post_processor_name` in our request: + +```bash +python3 /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py --prompt "Who is Harry Potter?" -o 100 --logits-post-processor-name "lmfe" +``` + +This time, the expected response looks like: +```bash +Who is Harry Potter? + { + "name": "Harry Potter", + "occupation": "Wizard", + "house": "Gryffindor", + "wand": { + "wood": "Holly", + "core": "Phoenix feather", + "length": 11 + }, + "blood_status": "Pure-blood", + "alive": "Yes" + } +``` +As we can see, the schema, defined in [`utils.py`](./artifacts/utils.py) is +respected. Note, LM Format Enforcer lets LLM to control the order of generated +fields, thus re-ordering of fields is allowed. + +##### Option 2. Use [generate endpoint](https://github.com/triton-inference-server/tensorrtllm_backend/tree/release/0.5.0#query-the-server-with-the-triton-generate-endpoint). + +Let's first send a standard request, without enforcing the JSON answer format: +```bash +curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "Who is Harry Potter?", "max_tokens": 100, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}' +``` + +You should expect the following response: + +```bash +{"context_logits":0.0,...,"text_output":"Who is Harry Potter? Harry Potter is a fictional character in a series of fantasy novels written by British author J.K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic and subjugate all wizards and"} +``` + +Now, let's specify `logits_post_processor_name` in our request: + +```bash +curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "Who is Harry Potter?", "max_tokens": 100, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2, "logits_post_processor_name": "lmfe"}' +``` + +This time, the expected response looks like: +```bash +{"context_logits":0.0,...,"text_output":"Who is Harry Potter? \t\t\t\n\t\t{\n\t\t\t\"name\": \"Harry Potter\",\n\t\t\t\"occupation\": \"Wizard\",\n\t\t\t\"house\": \"Gryffindor\",\n\t\t\t\"wand\": {\n\t\t\t\t\"wood\": \"Holly\",\n\t\t\t\t\"core\": \"Phoenix feather\",\n\t\t\t\t\"length\": 11\n\t\t\t},\n\t\t\t\"blood_status\": \"Pure-blood\",\n\t\t\t\"alive\": \"Yes\"\n\t\t}\n\n\t\t\n\n\n\n\t\t\n"} +``` + +### Outlines + +To use Outlines, make sure +`inflight_batcher_llm/tensorrt_llm/1/model.py` contains the following changes: + +```diff +... +import tensorrt_llm.bindings.executor as trtllm + ++ from lib.utils import OutlinesLogitsProcessor, AnswerFormat + +... + +class TritonPythonModel: + """Your Python model must use the same class name. Every Python model + that is created must have "TritonPythonModel" as the class name. + """ + ... + + def get_executor_config(self, model_config): ++ tokenizer_dir = model_config['parameters']['tokenizer_dir']['string_value'] ++ logits_lmfe_processor = OutlinesLogitsProcessor(tokenizer_dir, AnswerFormat.model_json_schema()) + kwargs = { + "max_beam_width": + get_parameter(model_config, "max_beam_width", int), + "scheduler_config": + self.get_scheduler_config(model_config), + "kv_cache_config": + self.get_kv_cache_config(model_config), + "enable_chunked_context": + get_parameter(model_config, "enable_chunked_context", bool), + "normalize_log_probs": + get_parameter(model_config, "normalize_log_probs", bool), + "batching_type": + convert_batching_type(get_parameter(model_config, + "gpt_model_type")), + "parallel_config": + self.get_parallel_config(model_config), + "peft_cache_config": + self.get_peft_cache_config(model_config), + "decoding_config": + self.get_decoding_config(model_config), ++ "logits_post_processor_map":{ ++ OutlinesLogitsProcessor.PROCESSOR_NAME: logits_lmfe_processor ++ } + } + kwargs = {k: v for k, v in kwargs.items() if v is not None} + return trtllm.ExecutorConfig(**kwargs) +... +``` + +#### Send an inference request + +First, let's start Triton SDK container: +```bash +# Using the SDK container as an example +docker run --rm -it --net host --shm-size=2g \ + --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \ + -v /path/to/tutorials/:/tutorials \ + -v /path/to/tutorials/repo:/tutorials \ + nvcr.io/nvidia/tritonserver:-py3-sdk +``` + +The provided client script uses `pydantic` library, which we do not ship with +the sdk container. Make sure to install it, before proceeding: + +```bash +pip install pydantic +``` + +##### Option 1. Use provided [client script](./artifacts/client.py) + +Let's first send a standard request, without enforcing the JSON answer format: +```bash +python3 /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py --prompt "Who is Harry Potter?" -o 100 +``` + +You should expect the following response: + +```bash +Who is Harry Potter? Harry Potter is a fictional character in a series of fantasy novels written by British author J.K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic and subjugate all wizards and +``` + +Now, let's specify `logits_post_processor_name` in our request: + +```bash +python3 /tutorials/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py --prompt "Who is Harry Potter?" -o 100 --logits-post-processor-name "outlines" +``` + +This time, the expected response looks like: +```bash +Who is Harry Potter?{ "name": "Harry Potter","house": "Gryffindor","blood_status": "Pure-blood","occupation": "Wizards","alive": "No","wand": {"wood": "Holly","core": "Phoenix feather","length": 11 }} +``` +As we can see, the schema, defined in [`utils.py`](./artifacts/utils.py) is +respected. Note, LM Format Enforcer lets LLM to control the order of generated +fields, thus re-ordering of fields is allowed. + +##### Option 2. Use [generate endpoint](https://github.com/triton-inference-server/tensorrtllm_backend/tree/release/0.5.0#query-the-server-with-the-triton-generate-endpoint). + +Let's first send a standard request, without enforcing the JSON answer format: +```bash +curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "Who is Harry Potter?", "max_tokens": 100, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}' +``` + +You should expect the following response: + +```bash +{"context_logits":0.0,...,"text_output":"Who is Harry Potter? Harry Potter is a fictional character in a series of fantasy novels written by British author J.K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic and subjugate all wizards and"} +``` + +Now, let's specify `logits_post_processor_name` in our request: + +```bash +curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "Who is Harry Potter?", "max_tokens": 100, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2, "logits_post_processor_name": "outlines"}' +``` + +This time, the expected response looks like: +```bash +{"context_logits":0.0,...,"text_output":"Who is Harry Potter?{ \"name\": \"Harry Potter\",\"house\": \"Gryffindor\",\"blood_status\": \"Pure-blood\",\"occupation\": \"Wizards\",\"alive\": \"No\",\"wand\": {\"wood\": \"Holly\",\"core\": \"Phoenix feather\",\"length\": 11 }}"} +``` \ No newline at end of file diff --git a/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py b/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py new file mode 100755 index 00000000..f9f2a6e8 --- /dev/null +++ b/AI_Agents_Guide/Constrained_Decoding/artifacts/client.py @@ -0,0 +1,278 @@ +#!/usr/bin/python +# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import argparse +import sys + +import client_utils +import numpy as np +import tritonclient.grpc as grpcclient +from pydantic import BaseModel + + +class AnswerFormat(BaseModel): + title: str + year: int + director: str + producer: str + plot: str + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "-v", + "--verbose", + action="store_true", + required=False, + default=False, + help="Enable verbose output", + ) + parser.add_argument( + "-u", "--url", type=str, required=False, help="Inference server URL." + ) + + parser.add_argument("-p", "--prompt", type=str, required=True, help="Input prompt.") + + parser.add_argument( + "--model-name", + type=str, + required=False, + default="ensemble", + choices=["ensemble", "tensorrt_llm_bls"], + help="Name of the Triton model to send request to", + ) + + parser.add_argument( + "-S", + "--streaming", + action="store_true", + required=False, + default=False, + help="Enable streaming mode. Default is False.", + ) + + parser.add_argument( + "-b", + "--beam-width", + required=False, + type=int, + default=1, + help="Beam width value", + ) + + parser.add_argument( + "--temperature", + type=float, + required=False, + default=1.0, + help="temperature value", + ) + + parser.add_argument( + "--repetition-penalty", + type=float, + required=False, + default=None, + help="The repetition penalty value", + ) + + parser.add_argument( + "--presence-penalty", + type=float, + required=False, + default=None, + help="The presence penalty value", + ) + + parser.add_argument( + "--frequency-penalty", + type=float, + required=False, + default=None, + help="The frequency penalty value", + ) + + parser.add_argument( + "-o", + "--output-len", + type=int, + default=100, + required=False, + help="Specify output length", + ) + + parser.add_argument( + "--request-id", + type=str, + default="", + required=False, + help="The request_id for the stop request", + ) + + parser.add_argument("--stop-words", nargs="+", default=[], help="The stop words") + + parser.add_argument("--bad-words", nargs="+", default=[], help="The bad words") + + parser.add_argument( + "--embedding-bias-words", nargs="+", default=[], help="The biased words" + ) + + parser.add_argument( + "--embedding-bias-weights", + nargs="+", + default=[], + help="The biased words weights", + ) + + parser.add_argument( + "--overwrite-output-text", + action="store_true", + required=False, + default=False, + help="In streaming mode, overwrite previously received output text instead of appending to it", + ) + + parser.add_argument( + "--return-context-logits", + action="store_true", + required=False, + default=False, + help="Return context logits, the engine must be built with gather_context_logits or gather_all_token_logits", + ) + + parser.add_argument( + "--return-generation-logits", + action="store_true", + required=False, + default=False, + help="Return generation logits, the engine must be built with gather_ generation_logits or gather_all_token_logits", + ) + + parser.add_argument( + "--end-id", type=int, required=False, help="The token id for end token." + ) + + parser.add_argument( + "--pad-id", type=int, required=False, help="The token id for pad token." + ) + + parser.add_argument( + "--use-system-prompt", + action="store_true", + required=False, + default=False, + help="Enhance text input with system prompt.", + ) + + parser.add_argument( + "--use-schema", + action="store_true", + required=False, + default=False, + help="Use client-defined JSON schema.", + ) + + parser.add_argument( + "--logits-post-processor-name", + type=str, + required=False, + default=None, + help="Logits Post-Processor to use for output generation.", + ) + + FLAGS = parser.parse_args() + if FLAGS.url is None: + FLAGS.url = "localhost:8001" + + embedding_bias_words = ( + FLAGS.embedding_bias_words if FLAGS.embedding_bias_words else None + ) + embedding_bias_weights = ( + FLAGS.embedding_bias_weights if FLAGS.embedding_bias_weights else None + ) + + try: + client = grpcclient.InferenceServerClient(url=FLAGS.url) + except Exception as e: + print("client creation failed: " + str(e)) + sys.exit(1) + + return_context_logits_data = None + if FLAGS.return_context_logits: + return_context_logits_data = np.array( + [[FLAGS.return_context_logits]], dtype=bool + ) + + return_generation_logits_data = None + if FLAGS.return_generation_logits: + return_generation_logits_data = np.array( + [[FLAGS.return_generation_logits]], dtype=bool + ) + + prompt = FLAGS.prompt + + if FLAGS.use_system_prompt: + prompt = ( + "<|im_start|>system\n You are a helpful assistant that answers in JSON." + ) + + if FLAGS.use_schema: + prompt += "Here's the json schema you must adhere to:\n\n{schema}\n".format( + schema=AnswerFormat.model_json_schema() + ) + + prompt += "<|im_end|>\n<|im_start|>user\n{user_prompt}<|im_end|>\n<|im_start|>assistant\n".format( + user_prompt=FLAGS.prompt + ) + + output_text = client_utils.run_inference( + client, + prompt, + FLAGS.output_len, + FLAGS.request_id, + FLAGS.repetition_penalty, + FLAGS.presence_penalty, + FLAGS.frequency_penalty, + FLAGS.temperature, + FLAGS.stop_words, + FLAGS.bad_words, + embedding_bias_words, + embedding_bias_weights, + FLAGS.model_name, + FLAGS.streaming, + FLAGS.beam_width, + FLAGS.overwrite_output_text, + return_context_logits_data, + return_generation_logits_data, + FLAGS.end_id, + FLAGS.pad_id, + FLAGS.verbose, + logits_post_processor_name=FLAGS.logits_post_processor_name, + ) + + print(output_text) diff --git a/AI_Agents_Guide/Constrained_Decoding/artifacts/client_utils.py b/AI_Agents_Guide/Constrained_Decoding/artifacts/client_utils.py new file mode 100755 index 00000000..1890f3c8 --- /dev/null +++ b/AI_Agents_Guide/Constrained_Decoding/artifacts/client_utils.py @@ -0,0 +1,225 @@ +#!/usr/bin/python + +# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import queue +from functools import partial + +import numpy as np +import tritonclient.grpc as grpcclient +from tritonclient.utils import InferenceServerException, np_to_triton_dtype + + +def prepare_tensor(name, input): + t = grpcclient.InferInput(name, input.shape, np_to_triton_dtype(input.dtype)) + t.set_data_from_numpy(input) + return t + + +class UserData: + def __init__(self): + self._completed_requests = queue.Queue() + + +def callback(user_data, result, error): + if error: + user_data._completed_requests.put(error) + else: + user_data._completed_requests.put(result) + + +def run_inference( + triton_client, + prompt, + output_len, + request_id, + repetition_penalty, + presence_penalty, + frequency_penalty, + temperature, + stop_words, + bad_words, + embedding_bias_words, + embedding_bias_weights, + model_name, + streaming, + beam_width, + overwrite_output_text, + return_context_logits_data, + return_generation_logits_data, + end_id, + pad_id, + verbose, + num_draft_tokens=0, + use_draft_logits=None, + logits_post_processor_name=None, +): + input0 = [[prompt]] + input0_data = np.array(input0).astype(object) + output0_len = np.ones_like(input0).astype(np.int32) * output_len + streaming_data = np.array([[streaming]], dtype=bool) + beam_width_data = np.array([[beam_width]], dtype=np.int32) + temperature_data = np.array([[temperature]], dtype=np.float32) + + inputs = [ + prepare_tensor("text_input", input0_data), + prepare_tensor("max_tokens", output0_len), + prepare_tensor("stream", streaming_data), + prepare_tensor("beam_width", beam_width_data), + prepare_tensor("temperature", temperature_data), + ] + + if num_draft_tokens > 0: + inputs.append( + prepare_tensor( + "num_draft_tokens", np.array([[num_draft_tokens]], dtype=np.int32) + ) + ) + if use_draft_logits is not None: + inputs.append( + prepare_tensor( + "use_draft_logits", np.array([[use_draft_logits]], dtype=bool) + ) + ) + + if bad_words: + bad_words_list = np.array([bad_words], dtype=object) + inputs += [prepare_tensor("bad_words", bad_words_list)] + + if stop_words: + stop_words_list = np.array([stop_words], dtype=object) + inputs += [prepare_tensor("stop_words", stop_words_list)] + + if repetition_penalty is not None: + repetition_penalty = [[repetition_penalty]] + repetition_penalty_data = np.array(repetition_penalty, dtype=np.float32) + inputs += [prepare_tensor("repetition_penalty", repetition_penalty_data)] + + if presence_penalty is not None: + presence_penalty = [[presence_penalty]] + presence_penalty_data = np.array(presence_penalty, dtype=np.float32) + inputs += [prepare_tensor("presence_penalty", presence_penalty_data)] + + if frequency_penalty is not None: + frequency_penalty = [[frequency_penalty]] + frequency_penalty_data = np.array(frequency_penalty, dtype=np.float32) + inputs += [prepare_tensor("frequency_penalty", frequency_penalty_data)] + + if return_context_logits_data is not None: + inputs += [ + prepare_tensor("return_context_logits", return_context_logits_data), + ] + + if return_generation_logits_data is not None: + inputs += [ + prepare_tensor("return_generation_logits", return_generation_logits_data), + ] + + if (embedding_bias_words is not None and embedding_bias_weights is None) or ( + embedding_bias_words is None and embedding_bias_weights is not None + ): + assert 0, "Both embedding bias words and weights must be specified" + + if embedding_bias_words is not None and embedding_bias_weights is not None: + assert len(embedding_bias_words) == len( + embedding_bias_weights + ), "Embedding bias weights and words must have same length" + embedding_bias_words_data = np.array([embedding_bias_words], dtype=object) + embedding_bias_weights_data = np.array( + [embedding_bias_weights], dtype=np.float32 + ) + inputs.append(prepare_tensor("embedding_bias_words", embedding_bias_words_data)) + inputs.append( + prepare_tensor("embedding_bias_weights", embedding_bias_weights_data) + ) + if end_id is not None: + end_id_data = np.array([[end_id]], dtype=np.int32) + inputs += [prepare_tensor("end_id", end_id_data)] + + if pad_id is not None: + pad_id_data = np.array([[pad_id]], dtype=np.int32) + inputs += [prepare_tensor("pad_id", pad_id_data)] + + if logits_post_processor_name is not None: + logits_post_processor_name_data = np.array( + [[logits_post_processor_name]], dtype=object + ) + inputs += [ + prepare_tensor( + "logits_post_processor_name", logits_post_processor_name_data + ) + ] + + user_data = UserData() + # Establish stream + triton_client.start_stream(callback=partial(callback, user_data)) + # Send request + triton_client.async_stream_infer(model_name, inputs, request_id=request_id) + + # Wait for server to close the stream + triton_client.stop_stream() + + # Parse the responses + output_text = "" + while True: + try: + result = user_data._completed_requests.get(block=False) + except Exception: + break + + if type(result) == InferenceServerException: + print("Received an error from server:") + print(result) + else: + output = result.as_numpy("text_output") + if streaming and beam_width == 1: + new_output = output[0].decode("utf-8") + if overwrite_output_text: + output_text = new_output + else: + output_text += new_output + else: + output_text = output[0].decode("utf-8") + if verbose: + print(output, flush=True) + + if return_context_logits_data is not None: + context_logits = result.as_numpy("context_logits") + if verbose: + print(f"context_logits.shape: {context_logits.shape}") + print(f"context_logits: {context_logits}") + if return_generation_logits_data is not None: + generation_logits = result.as_numpy("generation_logits") + if verbose: + print(f"generation_logits.shape: {generation_logits.shape}") + print(f"generation_logits: {generation_logits}") + + if streaming and beam_width == 1: + if verbose: + print(output_text) + + return output_text diff --git a/AI_Agents_Guide/Constrained_Decoding/artifacts/utils.py b/AI_Agents_Guide/Constrained_Decoding/artifacts/utils.py new file mode 100644 index 00000000..70ef4237 --- /dev/null +++ b/AI_Agents_Guide/Constrained_Decoding/artifacts/utils.py @@ -0,0 +1,187 @@ +# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# * Neither the name of NVIDIA CORPORATION nor the names of its +# contributors may be used to endorse or promote products derived +# from this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +import json +from collections import defaultdict +from typing import DefaultDict, Dict, List + +import torch +from lmformatenforcer import JsonSchemaParser, TokenEnforcer +from lmformatenforcer.integrations.trtllm import build_trtlmm_tokenizer_data +from outlines.fsm.guide import RegexGuide +from outlines.fsm.json_schema import build_regex_from_schema +from outlines.integrations.utils import adapt_tokenizer +from pydantic import BaseModel +from transformers import AutoTokenizer + + +class WandFormat(BaseModel): + """Represents the format of a wand description. + + Attributes: + wood (str): The type of wood used in the wand. + core (str): The core material of the wand. + length (float): The length of the wand. + """ + + wood: str + core: str + length: float + + +class AnswerFormat(BaseModel): + """Represents the output format, which LLM should follow. + + Attributes: + name (str): The name of the person. + house (str): The house affiliation of the person (e.g., Gryffindor). + blood_status (str): The blood status (e.g., pure-blood). + occupation (str): The occupation of the person. + alive (str): Whether the person is alive. + wand (WandFormat): The wand information. + """ + + name: str + house: str + blood_status: str + occupation: str + alive: str + wand: WandFormat + + +class LMFELogitsProcessor: + """ + The class implementing logits post-processor via LM Format Enforcer. + """ + + PROCESSOR_NAME = "lmfe" + + def __init__(self, tokenizer_dir, schema): + tokenizer = AutoTokenizer.from_pretrained( + tokenizer_dir, legacy=False, padding_side="left", trust_remote_code=True + ) + self.eos_token = tokenizer.eos_token_id + tokenizer_data = build_trtlmm_tokenizer_data(tokenizer) + # TokenEnforcer provides a token filtering mechanism, + # given a tokenizer and a CharacterLevelParser. + # ref: https://github.com/noamgat/lm-format-enforcer/blob/fe6cbf107218839624e3ab39b47115bf7f64dd6e/lmformatenforcer/tokenenforcer.py#L32 + self.token_enforcer = TokenEnforcer(tokenizer_data, JsonSchemaParser(schema)) + + def get_allowed_tokens(self, ids): + def _trim(ids): + return [x for x in ids if x != self.eos_token] + + allowed = self.token_enforcer.get_allowed_tokens(_trim(ids[0])) + return allowed + + def __call__( + self, + req_id: int, + logits: torch.Tensor, + ids: List[List[int]], + stream_ptr: int, + ): + # Create a mask with negative infinity to block all tokens initially. + mask = torch.full_like(logits, fill_value=float("-inf"), device=logits.device) + allowed = self.get_allowed_tokens(ids) + # Update the mask to zero for allowed tokens, + # allowing them to be selected. + mask[:, :, allowed] = 0 + with torch.cuda.stream(torch.cuda.ExternalStream(stream_ptr)): + logits += mask + + +class OutlinesLogitsProcessor: + """ + The class implementing logits post-processor via Outlines. + """ + + PROCESSOR_NAME = "outlines" + + def __init__(self, tokenizer_dir, schema): + tokenizer = AutoTokenizer.from_pretrained( + tokenizer_dir, legacy=False, padding_side="left", trust_remote_code=True + ) + tokenizer = adapt_tokenizer(tokenizer) + regex_string = build_regex_from_schema(json.dumps(schema)) + self.fsm = RegexGuide(regex_string, tokenizer) + self._fsm_state: DefaultDict[int, int] = defaultdict(int) + self.mask_cache: Dict[int, torch.Tensor] = {} + # By default, TensorRT-LLM includes request query into the output. + # Outlines should only look at generated outputs, thus we'll keep + # track of the request's input prefix. + self._prefix = [-1] + + def __call__( + self, + req_id: int, + logits: torch.Tensor, + ids: List[List[int]], + stream_ptr: int, + ): + seq_id = None + # If the prefix token IDs have changed we assume that we are dealing + # with a new sample and reset the FSM state + if ( + ids[0][: len(self._prefix)] != self._prefix + # handling edge case, when the new request is identical to already + # processed + or len(ids[0][len(self._prefix) :]) == 0 + ): + self._fsm_state = defaultdict(int) + self._prefix = ids[0] + seq_id = hash(tuple([])) + + else: + # Remove the prefix token IDs from the input token IDs, + # because the FSM should only be applied to the generated tokens + ids = ids[0][len(self._prefix) :] + last_token = ids[-1] + last_seq_id = hash(tuple(ids[:-1])) + seq_id = hash(tuple(ids)) + self._fsm_state[seq_id] = self.fsm.get_next_state( + state=self._fsm_state[last_seq_id], token_id=last_token + ) + + state_id = self._fsm_state[seq_id] + if state_id not in self.mask_cache: + allowed_tokens = self.fsm.get_next_instruction( + state=self._fsm_state[seq_id] + ).tokens + # Create a mask with negative infinity to block all + # tokens initially. + mask = torch.full_like( + logits, fill_value=float("-inf"), device=logits.device + ) + # Update the mask to zero for allowed tokens, + # allowing them to be selected. + mask[:, :, allowed_tokens] = 0 + self.mask_cache[state_id] = mask + else: + mask = self.mask_cache[state_id] + + with torch.cuda.stream(torch.cuda.ExternalStream(stream_ptr)): + logits += mask diff --git a/Popular_Models_Guide/Hermes-2-Pro-Llama-3-8B/README.md b/Popular_Models_Guide/Hermes-2-Pro-Llama-3-8B/README.md index 1c3625b4..29b007bb 100644 --- a/Popular_Models_Guide/Hermes-2-Pro-Llama-3-8B/README.md +++ b/Popular_Models_Guide/Hermes-2-Pro-Llama-3-8B/README.md @@ -163,7 +163,7 @@ MODEL_FOLDER=/opt/tritonserver/inflight_batcher_llm MAX_BATCH_SIZE=4 INSTANCE_COUNT=1 MAX_QUEUE_DELAY_MS=10000 -TRTLLM_BACKEND=tensorrtllm +TRTLLM_BACKEND=python FILL_TEMPLATE_SCRIPT=/tensorrtllm_backend/tools/fill_template.py python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT} python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT}