From ae18647242b9033393d6725439e196ba2f1d0247 Mon Sep 17 00:00:00 2001 From: Fahad Alghanim Date: Fri, 30 Jan 2026 23:17:20 -0700 Subject: [PATCH 1/3] docs: add AWS (EC2/SageMaker) deployment + benchmarking guide - Add an AWS deployment tutorial for EC2 + SageMaker - Fix SageMaker example indentation and link to the new guide - Add the new guide to the docs toctree --- docs/source/_toctree.yml | 2 + docs/source/basic_tutorials/deploy_aws.md | 105 ++++++++++++++++++++++ docs/source/reference/api_reference.md | 42 +++++---- 3 files changed, 130 insertions(+), 19 deletions(-) create mode 100644 docs/source/basic_tutorials/deploy_aws.md diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 4c6f0151d9b..16afe376b1a 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -36,6 +36,8 @@ title: Serving Private & Gated Models - local: basic_tutorials/using_cli title: Using TGI CLI + - local: basic_tutorials/deploy_aws + title: Deploying on AWS (EC2 and SageMaker) - local: basic_tutorials/non_core_models title: Non-core Model Serving - local: basic_tutorials/safety diff --git a/docs/source/basic_tutorials/deploy_aws.md b/docs/source/basic_tutorials/deploy_aws.md new file mode 100644 index 00000000000..9c3fdfa8a6d --- /dev/null +++ b/docs/source/basic_tutorials/deploy_aws.md @@ -0,0 +1,105 @@ +# Deploying TGI on AWS (EC2 and SageMaker) + +This guide shows how to deploy **Text Generation Inference (TGI)** on AWS and how to benchmark it in a way that is useful for capacity planning. + +## Deploy on EC2 (Docker) + +For most setups, the simplest path is to run the official container on an EC2 GPU instance. + +1. **Launch an EC2 GPU instance** (for example `g5.*` for NVIDIA GPUs). +2. **Install Docker + NVIDIA Container Toolkit** (see [Using TGI with Nvidia GPUs](../installation_nvidia) and NVIDIA’s installation docs). +3. **Run TGI**: + +```bash +model=HuggingFaceH4/zephyr-7b-beta +volume=$PWD/data + +docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \ + ghcr.io/huggingface/text-generation-inference:3.3.5 \ + --model-id "$model" +``` + +4. **Smoke test**: + +```bash +curl 127.0.0.1:8080/generate \ + -X POST \ + -H 'Content-Type: application/json' \ + -d '{"inputs":"Hello","parameters":{"max_new_tokens":16}}' +``` + +## Deploy on SageMaker (real-time endpoint) + +TGI includes a SageMaker compatibility route (`POST /invocations`) and a SageMaker entrypoint (`sagemaker-entrypoint.sh`) that maps SageMaker environment variables to TGI launcher settings. + +If you are using Hugging Face’s SageMaker integration (recommended), you typically only need to set the model environment variables: + +- **`HF_MODEL_ID`**: model id on the Hub (required) +- **`HF_MODEL_REVISION`**: optional revision +- **`SM_NUM_GPUS`**: number of GPUs (SageMaker sets this) +- **`HF_MODEL_QUANTIZE`**: optional quantization +- **`HF_MODEL_TRUST_REMOTE_CODE`**: optional trust remote code flag + +For a minimal example using the Hugging Face SageMaker SDK and the official TGI image URI: + +```python +import json +import boto3 +import sagemaker +from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri + +try: + role = sagemaker.get_execution_role() +except ValueError: + iam = boto3.client("iam") + role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"] + +hub = { + "HF_MODEL_ID": "HuggingFaceH4/zephyr-7b-beta", + # SageMaker expects SM_NUM_GPUS to be a JSON-encoded int + "SM_NUM_GPUS": json.dumps(1), +} + +huggingface_model = HuggingFaceModel( + image_uri=get_huggingface_llm_image_uri("huggingface", version="3.3.5"), + env=hub, + role=role, +) + +predictor = huggingface_model.deploy( + initial_instance_count=1, + instance_type="ml.g5.2xlarge", + container_startup_health_check_timeout=300, +) + +predictor.predict( + { + "messages": [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "What is deep learning?"}, + ] + } +) +``` + +## Benchmarking (what to measure, and how) + +For meaningful benchmarks, measure both: + +- **Client-visible latency** (end-to-end): p50/p95, time-to-first-token (TTFT), tokens/sec +- **Server-side performance metrics** (to attribute bottlenecks): see [metrics](../reference/metrics) + +### End-to-end HTTP benchmark (recommended for EC2/SageMaker) + +Use a load generator from *outside* the instance/endpoint VPC when possible (to include network overhead), and run a warmup phase before measuring. + +Example approach: + +1. Warm up with a small number of requests. +2. Run a fixed-duration load test at a target concurrency. +3. Record p50/p95 latency, error rate, and generated tokens/sec. + +### Microbenchmark (model server only) + +TGI also provides `text-generation-benchmark` (see the [benchmarking tool README](https://github.com/huggingface/text-generation-inference/tree/main/benchmark#readme)). This tool connects directly to the model server over a Unix socket and bypasses the router, so it’s useful for low-level profiling and batch-size sweeps, but it is **not** an end-to-end benchmark for SageMaker/HTTP. + diff --git a/docs/source/reference/api_reference.md b/docs/source/reference/api_reference.md index 7d21eca70a8..e7f30881cf0 100644 --- a/docs/source/reference/api_reference.md +++ b/docs/source/reference/api_reference.md @@ -141,7 +141,9 @@ TGI can be deployed on various cloud providers for scalable and robust text gene ## Amazon SageMaker -Amazon Sagemaker natively supports the message API: +Amazon SageMaker natively supports the Messages API. + +For a fuller deployment + benchmarking guide (including EC2), see [Deploying on AWS (EC2 and SageMaker)](../basic_tutorials/deploy_aws). ```python import json @@ -150,36 +152,38 @@ import boto3 from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri try: - role = sagemaker.get_execution_role() + role = sagemaker.get_execution_role() except ValueError: - iam = boto3.client('iam') - role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn'] + iam = boto3.client("iam") + role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"] # Hub Model configuration. https://huggingface.co/models hub = { - 'HF_MODEL_ID':'HuggingFaceH4/zephyr-7b-beta', - 'SM_NUM_GPUS': json.dumps(1), + "HF_MODEL_ID": "HuggingFaceH4/zephyr-7b-beta", + "SM_NUM_GPUS": json.dumps(1), } # create Hugging Face Model Class huggingface_model = HuggingFaceModel( - image_uri=get_huggingface_llm_image_uri("huggingface",version="3.3.5"), - env=hub, - role=role, + image_uri=get_huggingface_llm_image_uri("huggingface", version="3.3.5"), + env=hub, + role=role, ) # deploy model to SageMaker Inference predictor = huggingface_model.deploy( - initial_instance_count=1, - instance_type="ml.g5.2xlarge", - container_startup_health_check_timeout=300, - ) + initial_instance_count=1, + instance_type="ml.g5.2xlarge", + container_startup_health_check_timeout=300, +) # send request -predictor.predict({ -"messages": [ - {"role": "system", "content": "You are a helpful assistant." }, - {"role": "user", "content": "What is deep learning?"} - ] -}) +predictor.predict( + { + "messages": [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "What is deep learning?"}, + ] + } +) ``` From c9e21b7a716ee6e425c53684a233d89bb64b2f65 Mon Sep 17 00:00:00 2001 From: Fahad Alghanim Date: Tue, 10 Mar 2026 20:48:42 -0700 Subject: [PATCH 2/3] docs: address AWS deployment review feedback --- docs/source/basic_tutorials/deploy_aws.md | 13 ++++++++----- docs/source/reference/api_reference.md | 4 ++-- 2 files changed, 10 insertions(+), 7 deletions(-) diff --git a/docs/source/basic_tutorials/deploy_aws.md b/docs/source/basic_tutorials/deploy_aws.md index 9c3fdfa8a6d..9d7c2e2be94 100644 --- a/docs/source/basic_tutorials/deploy_aws.md +++ b/docs/source/basic_tutorials/deploy_aws.md @@ -19,18 +19,20 @@ docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \ --model-id "$model" ``` -4. **Smoke test**: +4. **Send Chat Completions API request**: ```bash -curl 127.0.0.1:8080/generate \ +curl 127.0.0.1:8080/v1/chat/completions \ -X POST \ -H 'Content-Type: application/json' \ - -d '{"inputs":"Hello","parameters":{"max_new_tokens":16}}' + -d '{"model":"tgi","messages":[{"role":"user","content":"What is Deep Learning?"}]}' ``` ## Deploy on SageMaker (real-time endpoint) -TGI includes a SageMaker compatibility route (`POST /invocations`) and a SageMaker entrypoint (`sagemaker-entrypoint.sh`) that maps SageMaker environment variables to TGI launcher settings. +TGI includes a SageMaker compatibility route (`POST /invocations`) and a SageMaker entrypoint (`sagemaker-entrypoint.sh`) that maps SageMaker environment variables to TGI launcher settings. The `/invocations` route forwards requests to `/v1/chat/completions` underneath. + +> **Warning:** For this flow, use the AWS SageMaker SDK `< 3.0`. For example: `pip install "sagemaker<3"`. If you are using Hugging Face’s SageMaker integration (recommended), you typically only need to set the model environment variables: @@ -93,6 +95,8 @@ For meaningful benchmarks, measure both: Use a load generator from *outside* the instance/endpoint VPC when possible (to include network overhead), and run a warmup phase before measuring. +You can use [inference-benchmarker](https://github.com/huggingface/inference-benchmarker) for end-to-end HTTP benchmarking. + Example approach: 1. Warm up with a small number of requests. @@ -102,4 +106,3 @@ Example approach: ### Microbenchmark (model server only) TGI also provides `text-generation-benchmark` (see the [benchmarking tool README](https://github.com/huggingface/text-generation-inference/tree/main/benchmark#readme)). This tool connects directly to the model server over a Unix socket and bypasses the router, so it’s useful for low-level profiling and batch-size sweeps, but it is **not** an end-to-end benchmark for SageMaker/HTTP. - diff --git a/docs/source/reference/api_reference.md b/docs/source/reference/api_reference.md index e7f30881cf0..598353348a0 100644 --- a/docs/source/reference/api_reference.md +++ b/docs/source/reference/api_reference.md @@ -141,9 +141,9 @@ TGI can be deployed on various cloud providers for scalable and robust text gene ## Amazon SageMaker -Amazon SageMaker natively supports the Messages API. +Amazon SageMaker natively supports the Chat Completions API. -For a fuller deployment + benchmarking guide (including EC2), see [Deploying on AWS (EC2 and SageMaker)](../basic_tutorials/deploy_aws). +For a complete deployment + benchmarking guide (including EC2), see [Deploying on AWS (EC2 and SageMaker)](../basic_tutorials/deploy_aws). ```python import json From e3ec8f382a62e0507fc10e636d5c03eae688954e Mon Sep 17 00:00:00 2001 From: Alvaro Bartolome <36760800+alvarobartt@users.noreply.github.com> Date: Sat, 21 Mar 2026 20:33:56 +0900 Subject: [PATCH 3/3] Apply suggestion from code review --- docs/source/basic_tutorials/deploy_aws.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/source/basic_tutorials/deploy_aws.md b/docs/source/basic_tutorials/deploy_aws.md index 9d7c2e2be94..eac4a8ffe12 100644 --- a/docs/source/basic_tutorials/deploy_aws.md +++ b/docs/source/basic_tutorials/deploy_aws.md @@ -30,6 +30,12 @@ curl 127.0.0.1:8080/v1/chat/completions \ ## Deploy on SageMaker (real-time endpoint) +```bash +pip install "sagemaker<3.0.0" --upgrade --quiet +``` + +> [!WARNING] +> [SageMaker Python SDK v3 has been recently released](https://github.com/aws/sagemaker-python-sdk), so unless specified otherwise, all the documentation and tutorials are still using the [SageMaker Python SDK v2](https://github.com/aws/sagemaker-python-sdk/tree/master-v2). We are actively working on updating all the tutorials and examples, but in the meantime make sure to install the SageMaker SDK as `pip install "sagemaker<3.0.0"`. TGI includes a SageMaker compatibility route (`POST /invocations`) and a SageMaker entrypoint (`sagemaker-entrypoint.sh`) that maps SageMaker environment variables to TGI launcher settings. The `/invocations` route forwards requests to `/v1/chat/completions` underneath. > **Warning:** For this flow, use the AWS SageMaker SDK `< 3.0`. For example: `pip install "sagemaker<3"`.