This repository was archived by the owner on Mar 21, 2026. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
docs: add AWS (EC2/SageMaker) deployment + benchmarking guide #3352
Merged
alvarobartt
merged 3 commits into
huggingface:main
from
KOKOSde:docs/aws-deployment-benchmarks
Mar 21, 2026
+139
−19
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,114 @@ | ||
| # Deploying TGI on AWS (EC2 and SageMaker) | ||
|
|
||
| This guide shows how to deploy **Text Generation Inference (TGI)** on AWS and how to benchmark it in a way that is useful for capacity planning. | ||
|
|
||
| ## Deploy on EC2 (Docker) | ||
|
|
||
| For most setups, the simplest path is to run the official container on an EC2 GPU instance. | ||
|
|
||
| 1. **Launch an EC2 GPU instance** (for example `g5.*` for NVIDIA GPUs). | ||
| 2. **Install Docker + NVIDIA Container Toolkit** (see [Using TGI with Nvidia GPUs](../installation_nvidia) and NVIDIA’s installation docs). | ||
| 3. **Run TGI**: | ||
|
|
||
| ```bash | ||
| model=HuggingFaceH4/zephyr-7b-beta | ||
| volume=$PWD/data | ||
|
|
||
| docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \ | ||
| ghcr.io/huggingface/text-generation-inference:3.3.5 \ | ||
| --model-id "$model" | ||
| ``` | ||
|
|
||
| 4. **Send Chat Completions API request**: | ||
|
|
||
| ```bash | ||
| curl 127.0.0.1:8080/v1/chat/completions \ | ||
| -X POST \ | ||
| -H 'Content-Type: application/json' \ | ||
| -d '{"model":"tgi","messages":[{"role":"user","content":"What is Deep Learning?"}]}' | ||
| ``` | ||
|
|
||
| ## Deploy on SageMaker (real-time endpoint) | ||
|
|
||
| ```bash | ||
| pip install "sagemaker<3.0.0" --upgrade --quiet | ||
| ``` | ||
|
|
||
| > [!WARNING] | ||
| > [SageMaker Python SDK v3 has been recently released](https://github.com/aws/sagemaker-python-sdk), so unless specified otherwise, all the documentation and tutorials are still using the [SageMaker Python SDK v2](https://github.com/aws/sagemaker-python-sdk/tree/master-v2). We are actively working on updating all the tutorials and examples, but in the meantime make sure to install the SageMaker SDK as `pip install "sagemaker<3.0.0"`. | ||
| TGI includes a SageMaker compatibility route (`POST /invocations`) and a SageMaker entrypoint (`sagemaker-entrypoint.sh`) that maps SageMaker environment variables to TGI launcher settings. The `/invocations` route forwards requests to `/v1/chat/completions` underneath. | ||
|
|
||
| > **Warning:** For this flow, use the AWS SageMaker SDK `< 3.0`. For example: `pip install "sagemaker<3"`. | ||
|
|
||
| If you are using Hugging Face’s SageMaker integration (recommended), you typically only need to set the model environment variables: | ||
|
|
||
| - **`HF_MODEL_ID`**: model id on the Hub (required) | ||
| - **`HF_MODEL_REVISION`**: optional revision | ||
| - **`SM_NUM_GPUS`**: number of GPUs (SageMaker sets this) | ||
| - **`HF_MODEL_QUANTIZE`**: optional quantization | ||
| - **`HF_MODEL_TRUST_REMOTE_CODE`**: optional trust remote code flag | ||
|
|
||
| For a minimal example using the Hugging Face SageMaker SDK and the official TGI image URI: | ||
|
|
||
| ```python | ||
| import json | ||
| import boto3 | ||
| import sagemaker | ||
| from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri | ||
|
|
||
| try: | ||
| role = sagemaker.get_execution_role() | ||
| except ValueError: | ||
| iam = boto3.client("iam") | ||
| role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"] | ||
|
|
||
| hub = { | ||
| "HF_MODEL_ID": "HuggingFaceH4/zephyr-7b-beta", | ||
| # SageMaker expects SM_NUM_GPUS to be a JSON-encoded int | ||
| "SM_NUM_GPUS": json.dumps(1), | ||
| } | ||
|
|
||
| huggingface_model = HuggingFaceModel( | ||
| image_uri=get_huggingface_llm_image_uri("huggingface", version="3.3.5"), | ||
| env=hub, | ||
| role=role, | ||
| ) | ||
|
|
||
| predictor = huggingface_model.deploy( | ||
| initial_instance_count=1, | ||
| instance_type="ml.g5.2xlarge", | ||
| container_startup_health_check_timeout=300, | ||
| ) | ||
|
|
||
| predictor.predict( | ||
| { | ||
| "messages": [ | ||
| {"role": "system", "content": "You are a helpful assistant."}, | ||
| {"role": "user", "content": "What is deep learning?"}, | ||
| ] | ||
| } | ||
| ) | ||
| ``` | ||
|
|
||
| ## Benchmarking (what to measure, and how) | ||
|
|
||
| For meaningful benchmarks, measure both: | ||
|
|
||
| - **Client-visible latency** (end-to-end): p50/p95, time-to-first-token (TTFT), tokens/sec | ||
| - **Server-side performance metrics** (to attribute bottlenecks): see [metrics](../reference/metrics) | ||
|
|
||
| ### End-to-end HTTP benchmark (recommended for EC2/SageMaker) | ||
|
|
||
| Use a load generator from *outside* the instance/endpoint VPC when possible (to include network overhead), and run a warmup phase before measuring. | ||
|
|
||
| You can use [inference-benchmarker](https://github.com/huggingface/inference-benchmarker) for end-to-end HTTP benchmarking. | ||
|
|
||
| Example approach: | ||
|
|
||
| 1. Warm up with a small number of requests. | ||
| 2. Run a fixed-duration load test at a target concurrency. | ||
| 3. Record p50/p95 latency, error rate, and generated tokens/sec. | ||
|
|
||
| ### Microbenchmark (model server only) | ||
|
|
||
| TGI also provides `text-generation-benchmark` (see the [benchmarking tool README](https://github.com/huggingface/text-generation-inference/tree/main/benchmark#readme)). This tool connects directly to the model server over a Unix socket and bypasses the router, so it’s useful for low-level profiling and batch-size sweeps, but it is **not** an end-to-end benchmark for SageMaker/HTTP. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.