Skip to content
This repository was archived by the owner on Mar 21, 2026. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@
title: Serving Private & Gated Models
- local: basic_tutorials/using_cli
title: Using TGI CLI
- local: basic_tutorials/deploy_aws
title: Deploying on AWS (EC2 and SageMaker)
- local: basic_tutorials/non_core_models
title: Non-core Model Serving
- local: basic_tutorials/safety
Expand Down
114 changes: 114 additions & 0 deletions docs/source/basic_tutorials/deploy_aws.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Deploying TGI on AWS (EC2 and SageMaker)

This guide shows how to deploy **Text Generation Inference (TGI)** on AWS and how to benchmark it in a way that is useful for capacity planning.

## Deploy on EC2 (Docker)

For most setups, the simplest path is to run the official container on an EC2 GPU instance.

1. **Launch an EC2 GPU instance** (for example `g5.*` for NVIDIA GPUs).
2. **Install Docker + NVIDIA Container Toolkit** (see [Using TGI with Nvidia GPUs](../installation_nvidia) and NVIDIA’s installation docs).
3. **Run TGI**:

```bash
model=HuggingFaceH4/zephyr-7b-beta
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
ghcr.io/huggingface/text-generation-inference:3.3.5 \
--model-id "$model"
```

4. **Send Chat Completions API request**:

```bash
curl 127.0.0.1:8080/v1/chat/completions \
-X POST \
-H 'Content-Type: application/json' \
-d '{"model":"tgi","messages":[{"role":"user","content":"What is Deep Learning?"}]}'
```

## Deploy on SageMaker (real-time endpoint)

```bash
pip install "sagemaker<3.0.0" --upgrade --quiet
```

> [!WARNING]
> [SageMaker Python SDK v3 has been recently released](https://github.com/aws/sagemaker-python-sdk), so unless specified otherwise, all the documentation and tutorials are still using the [SageMaker Python SDK v2](https://github.com/aws/sagemaker-python-sdk/tree/master-v2). We are actively working on updating all the tutorials and examples, but in the meantime make sure to install the SageMaker SDK as `pip install "sagemaker<3.0.0"`.
TGI includes a SageMaker compatibility route (`POST /invocations`) and a SageMaker entrypoint (`sagemaker-entrypoint.sh`) that maps SageMaker environment variables to TGI launcher settings. The `/invocations` route forwards requests to `/v1/chat/completions` underneath.

> **Warning:** For this flow, use the AWS SageMaker SDK `< 3.0`. For example: `pip install "sagemaker<3"`.

If you are using Hugging Face’s SageMaker integration (recommended), you typically only need to set the model environment variables:

- **`HF_MODEL_ID`**: model id on the Hub (required)
- **`HF_MODEL_REVISION`**: optional revision
- **`SM_NUM_GPUS`**: number of GPUs (SageMaker sets this)
- **`HF_MODEL_QUANTIZE`**: optional quantization
- **`HF_MODEL_TRUST_REMOTE_CODE`**: optional trust remote code flag

For a minimal example using the Hugging Face SageMaker SDK and the official TGI image URI:

```python
import json
import boto3
import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client("iam")
role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

hub = {
"HF_MODEL_ID": "HuggingFaceH4/zephyr-7b-beta",
# SageMaker expects SM_NUM_GPUS to be a JSON-encoded int
"SM_NUM_GPUS": json.dumps(1),
}

huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface", version="3.3.5"),
env=hub,
role=role,
)

predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_startup_health_check_timeout=300,
)

predictor.predict(
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is deep learning?"},
]
}
)
```

## Benchmarking (what to measure, and how)

For meaningful benchmarks, measure both:

- **Client-visible latency** (end-to-end): p50/p95, time-to-first-token (TTFT), tokens/sec
- **Server-side performance metrics** (to attribute bottlenecks): see [metrics](../reference/metrics)

### End-to-end HTTP benchmark (recommended for EC2/SageMaker)

Use a load generator from *outside* the instance/endpoint VPC when possible (to include network overhead), and run a warmup phase before measuring.

You can use [inference-benchmarker](https://github.com/huggingface/inference-benchmarker) for end-to-end HTTP benchmarking.

Example approach:

1. Warm up with a small number of requests.
2. Run a fixed-duration load test at a target concurrency.
3. Record p50/p95 latency, error rate, and generated tokens/sec.

### Microbenchmark (model server only)

TGI also provides `text-generation-benchmark` (see the [benchmarking tool README](https://github.com/huggingface/text-generation-inference/tree/main/benchmark#readme)). This tool connects directly to the model server over a Unix socket and bypasses the router, so it’s useful for low-level profiling and batch-size sweeps, but it is **not** an end-to-end benchmark for SageMaker/HTTP.
42 changes: 23 additions & 19 deletions docs/source/reference/api_reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,9 @@ TGI can be deployed on various cloud providers for scalable and robust text gene

## Amazon SageMaker

Amazon Sagemaker natively supports the message API:
Amazon SageMaker natively supports the Chat Completions API.

For a complete deployment + benchmarking guide (including EC2), see [Deploying on AWS (EC2 and SageMaker)](../basic_tutorials/deploy_aws).

```python
import json
Expand All @@ -150,36 +152,38 @@ import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
role = sagemaker.get_execution_role()
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
iam = boto3.client("iam")
role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'HuggingFaceH4/zephyr-7b-beta',
'SM_NUM_GPUS': json.dumps(1),
"HF_MODEL_ID": "HuggingFaceH4/zephyr-7b-beta",
"SM_NUM_GPUS": json.dumps(1),
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="3.3.5"),
env=hub,
role=role,
image_uri=get_huggingface_llm_image_uri("huggingface", version="3.3.5"),
env=hub,
role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_startup_health_check_timeout=300,
)
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_startup_health_check_timeout=300,
)

# send request
predictor.predict({
"messages": [
{"role": "system", "content": "You are a helpful assistant." },
{"role": "user", "content": "What is deep learning?"}
]
})
predictor.predict(
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is deep learning?"},
]
}
)
```