From ae18647242b9033393d6725439e196ba2f1d0247 Mon Sep 17 00:00:00 2001
From: Fahad Alghanim <fkalghan@email.sc.edu>
Date: Fri, 30 Jan 2026 23:17:20 -0700
Subject: [PATCH 1/3] docs: add AWS (EC2/SageMaker) deployment + benchmarking
 guide

- Add an AWS deployment tutorial for EC2 + SageMaker
- Fix SageMaker example indentation and link to the new guide
- Add the new guide to the docs toctree
---
 docs/source/_toctree.yml                  |   2 +
 docs/source/basic_tutorials/deploy_aws.md | 105 ++++++++++++++++++++++
 docs/source/reference/api_reference.md    |  42 +++++----
 3 files changed, 130 insertions(+), 19 deletions(-)
 create mode 100644 docs/source/basic_tutorials/deploy_aws.md

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 4c6f0151d9b..16afe376b1a 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -36,6 +36,8 @@
     title: Serving Private & Gated Models
   - local: basic_tutorials/using_cli
     title: Using TGI CLI
+  - local: basic_tutorials/deploy_aws
+    title: Deploying on AWS (EC2 and SageMaker)
   - local: basic_tutorials/non_core_models
     title: Non-core Model Serving
   - local: basic_tutorials/safety
diff --git a/docs/source/basic_tutorials/deploy_aws.md b/docs/source/basic_tutorials/deploy_aws.md
new file mode 100644
index 00000000000..9c3fdfa8a6d
--- /dev/null
+++ b/docs/source/basic_tutorials/deploy_aws.md
@@ -0,0 +1,105 @@
+# Deploying TGI on AWS (EC2 and SageMaker)
+
+This guide shows how to deploy **Text Generation Inference (TGI)** on AWS and how to benchmark it in a way that is useful for capacity planning.
+
+## Deploy on EC2 (Docker)
+
+For most setups, the simplest path is to run the official container on an EC2 GPU instance.
+
+1. **Launch an EC2 GPU instance** (for example `g5.*` for NVIDIA GPUs).
+2. **Install Docker + NVIDIA Container Toolkit** (see [Using TGI with Nvidia GPUs](../installation_nvidia) and NVIDIA’s installation docs).
+3. **Run TGI**:
+
+```bash
+model=HuggingFaceH4/zephyr-7b-beta
+volume=$PWD/data
+
+docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
+  ghcr.io/huggingface/text-generation-inference:3.3.5 \
+  --model-id "$model"
+```
+
+4. **Smoke test**:
+
+```bash
+curl 127.0.0.1:8080/generate \
+  -X POST \
+  -H 'Content-Type: application/json' \
+  -d '{"inputs":"Hello","parameters":{"max_new_tokens":16}}'
+```
+
+## Deploy on SageMaker (real-time endpoint)
+
+TGI includes a SageMaker compatibility route (`POST /invocations`) and a SageMaker entrypoint (`sagemaker-entrypoint.sh`) that maps SageMaker environment variables to TGI launcher settings.
+
+If you are using Hugging Face’s SageMaker integration (recommended), you typically only need to set the model environment variables:
+
+- **`HF_MODEL_ID`**: model id on the Hub (required)
+- **`HF_MODEL_REVISION`**: optional revision
+- **`SM_NUM_GPUS`**: number of GPUs (SageMaker sets this)
+- **`HF_MODEL_QUANTIZE`**: optional quantization
+- **`HF_MODEL_TRUST_REMOTE_CODE`**: optional trust remote code flag
+
+For a minimal example using the Hugging Face SageMaker SDK and the official TGI image URI:
+
+```python
+import json
+import boto3
+import sagemaker
+from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
+
+try:
+    role = sagemaker.get_execution_role()
+except ValueError:
+    iam = boto3.client("iam")
+    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]
+
+hub = {
+    "HF_MODEL_ID": "HuggingFaceH4/zephyr-7b-beta",
+    # SageMaker expects SM_NUM_GPUS to be a JSON-encoded int
+    "SM_NUM_GPUS": json.dumps(1),
+}
+
+huggingface_model = HuggingFaceModel(
+    image_uri=get_huggingface_llm_image_uri("huggingface", version="3.3.5"),
+    env=hub,
+    role=role,
+)
+
+predictor = huggingface_model.deploy(
+    initial_instance_count=1,
+    instance_type="ml.g5.2xlarge",
+    container_startup_health_check_timeout=300,
+)
+
+predictor.predict(
+    {
+        "messages": [
+            {"role": "system", "content": "You are a helpful assistant."},
+            {"role": "user", "content": "What is deep learning?"},
+        ]
+    }
+)
+```
+
+## Benchmarking (what to measure, and how)
+
+For meaningful benchmarks, measure both:
+
+- **Client-visible latency** (end-to-end): p50/p95, time-to-first-token (TTFT), tokens/sec
+- **Server-side performance metrics** (to attribute bottlenecks): see [metrics](../reference/metrics)
+
+### End-to-end HTTP benchmark (recommended for EC2/SageMaker)
+
+Use a load generator from *outside* the instance/endpoint VPC when possible (to include network overhead), and run a warmup phase before measuring.
+
+Example approach:
+
+1. Warm up with a small number of requests.
+2. Run a fixed-duration load test at a target concurrency.
+3. Record p50/p95 latency, error rate, and generated tokens/sec.
+
+### Microbenchmark (model server only)
+
+TGI also provides `text-generation-benchmark` (see the [benchmarking tool README](https://github.com/huggingface/text-generation-inference/tree/main/benchmark#readme)). This tool connects directly to the model server over a Unix socket and bypasses the router, so it’s useful for low-level profiling and batch-size sweeps, but it is **not** an end-to-end benchmark for SageMaker/HTTP.
+
diff --git a/docs/source/reference/api_reference.md b/docs/source/reference/api_reference.md
index 7d21eca70a8..e7f30881cf0 100644
--- a/docs/source/reference/api_reference.md
+++ b/docs/source/reference/api_reference.md
@@ -141,7 +141,9 @@ TGI can be deployed on various cloud providers for scalable and robust text gene
 
 ## Amazon SageMaker
 
-Amazon Sagemaker natively supports the message API:
+Amazon SageMaker natively supports the Messages API.
+
+For a fuller deployment + benchmarking guide (including EC2), see [Deploying on AWS (EC2 and SageMaker)](../basic_tutorials/deploy_aws).
 
 ```python
 import json
@@ -150,36 +152,38 @@ import boto3
 from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
 
 try:
- role = sagemaker.get_execution_role()
+    role = sagemaker.get_execution_role()
 except ValueError:
- iam = boto3.client('iam')
- role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
+    iam = boto3.client("iam")
+    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]
 
 # Hub Model configuration. https://huggingface.co/models
 hub = {
- 'HF_MODEL_ID':'HuggingFaceH4/zephyr-7b-beta',
- 'SM_NUM_GPUS': json.dumps(1),
+    "HF_MODEL_ID": "HuggingFaceH4/zephyr-7b-beta",
+    "SM_NUM_GPUS": json.dumps(1),
 }
 
 # create Hugging Face Model Class
 huggingface_model = HuggingFaceModel(
- image_uri=get_huggingface_llm_image_uri("huggingface",version="3.3.5"),
- env=hub,
- role=role,
+    image_uri=get_huggingface_llm_image_uri("huggingface", version="3.3.5"),
+    env=hub,
+    role=role,
 )
 
 # deploy model to SageMaker Inference
 predictor = huggingface_model.deploy(
- initial_instance_count=1,
- instance_type="ml.g5.2xlarge",
- container_startup_health_check_timeout=300,
-  )
+    initial_instance_count=1,
+    instance_type="ml.g5.2xlarge",
+    container_startup_health_check_timeout=300,
+)
 
 # send request
-predictor.predict({
-"messages": [
-        {"role": "system", "content": "You are a helpful assistant." },
-        {"role": "user", "content": "What is deep learning?"}
-    ]
-})
+predictor.predict(
+    {
+        "messages": [
+            {"role": "system", "content": "You are a helpful assistant."},
+            {"role": "user", "content": "What is deep learning?"},
+        ]
+    }
+)
 ```

From c9e21b7a716ee6e425c53684a233d89bb64b2f65 Mon Sep 17 00:00:00 2001
From: Fahad Alghanim <fkalghan@email.sc.edu>
Date: Tue, 10 Mar 2026 20:48:42 -0700
Subject: [PATCH 2/3] docs: address AWS deployment review feedback

---
 docs/source/basic_tutorials/deploy_aws.md | 13 ++++++++-----
 docs/source/reference/api_reference.md    |  4 ++--
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/docs/source/basic_tutorials/deploy_aws.md b/docs/source/basic_tutorials/deploy_aws.md
index 9c3fdfa8a6d..9d7c2e2be94 100644
--- a/docs/source/basic_tutorials/deploy_aws.md
+++ b/docs/source/basic_tutorials/deploy_aws.md
@@ -19,18 +19,20 @@ docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
   --model-id "$model"
 ```
 
-4. **Smoke test**:
+4. **Send Chat Completions API request**:
 
 ```bash
-curl 127.0.0.1:8080/generate \
+curl 127.0.0.1:8080/v1/chat/completions \
   -X POST \
   -H 'Content-Type: application/json' \
-  -d '{"inputs":"Hello","parameters":{"max_new_tokens":16}}'
+  -d '{"model":"tgi","messages":[{"role":"user","content":"What is Deep Learning?"}]}'
 ```
 
 ## Deploy on SageMaker (real-time endpoint)
 
-TGI includes a SageMaker compatibility route (`POST /invocations`) and a SageMaker entrypoint (`sagemaker-entrypoint.sh`) that maps SageMaker environment variables to TGI launcher settings.
+TGI includes a SageMaker compatibility route (`POST /invocations`) and a SageMaker entrypoint (`sagemaker-entrypoint.sh`) that maps SageMaker environment variables to TGI launcher settings. The `/invocations` route forwards requests to `/v1/chat/completions` underneath.
+
+> **Warning:** For this flow, use the AWS SageMaker SDK `< 3.0`. For example: `pip install "sagemaker<3"`.
 
 If you are using Hugging Face’s SageMaker integration (recommended), you typically only need to set the model environment variables:
 
@@ -93,6 +95,8 @@ For meaningful benchmarks, measure both:
 
 Use a load generator from *outside* the instance/endpoint VPC when possible (to include network overhead), and run a warmup phase before measuring.
 
+You can use [inference-benchmarker](https://github.com/huggingface/inference-benchmarker) for end-to-end HTTP benchmarking.
+
 Example approach:
 
 1. Warm up with a small number of requests.
@@ -102,4 +106,3 @@ Example approach:
 ### Microbenchmark (model server only)
 
 TGI also provides `text-generation-benchmark` (see the [benchmarking tool README](https://github.com/huggingface/text-generation-inference/tree/main/benchmark#readme)). This tool connects directly to the model server over a Unix socket and bypasses the router, so it’s useful for low-level profiling and batch-size sweeps, but it is **not** an end-to-end benchmark for SageMaker/HTTP.
-
diff --git a/docs/source/reference/api_reference.md b/docs/source/reference/api_reference.md
index e7f30881cf0..598353348a0 100644
--- a/docs/source/reference/api_reference.md
+++ b/docs/source/reference/api_reference.md
@@ -141,9 +141,9 @@ TGI can be deployed on various cloud providers for scalable and robust text gene
 
 ## Amazon SageMaker
 
-Amazon SageMaker natively supports the Messages API.
+Amazon SageMaker natively supports the Chat Completions API.
 
-For a fuller deployment + benchmarking guide (including EC2), see [Deploying on AWS (EC2 and SageMaker)](../basic_tutorials/deploy_aws).
+For a complete deployment + benchmarking guide (including EC2), see [Deploying on AWS (EC2 and SageMaker)](../basic_tutorials/deploy_aws).
 
 ```python
 import json

From e3ec8f382a62e0507fc10e636d5c03eae688954e Mon Sep 17 00:00:00 2001
From: Alvaro Bartolome <36760800+alvarobartt@users.noreply.github.com>
Date: Sat, 21 Mar 2026 20:33:56 +0900
Subject: [PATCH 3/3] Apply suggestion from code review

---
 docs/source/basic_tutorials/deploy_aws.md | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/docs/source/basic_tutorials/deploy_aws.md b/docs/source/basic_tutorials/deploy_aws.md
index 9d7c2e2be94..eac4a8ffe12 100644
--- a/docs/source/basic_tutorials/deploy_aws.md
+++ b/docs/source/basic_tutorials/deploy_aws.md
@@ -30,6 +30,12 @@ curl 127.0.0.1:8080/v1/chat/completions \
 
 ## Deploy on SageMaker (real-time endpoint)
 
+```bash
+pip install "sagemaker<3.0.0" --upgrade --quiet
+```
+
+> [!WARNING]
+> [SageMaker Python SDK v3 has been recently released](https://github.com/aws/sagemaker-python-sdk), so unless specified otherwise, all the documentation and tutorials are still using the [SageMaker Python SDK v2](https://github.com/aws/sagemaker-python-sdk/tree/master-v2). We are actively working on updating all the tutorials and examples, but in the meantime make sure to install the SageMaker SDK as `pip install "sagemaker<3.0.0"`.
 TGI includes a SageMaker compatibility route (`POST /invocations`) and a SageMaker entrypoint (`sagemaker-entrypoint.sh`) that maps SageMaker environment variables to TGI launcher settings. The `/invocations` route forwards requests to `/v1/chat/completions` underneath.
 
 > **Warning:** For this flow, use the AWS SageMaker SDK `< 3.0`. For example: `pip install "sagemaker<3"`.