Skip to content

Conversation

aasthavar
Copy link

@aasthavar aasthavar commented Apr 30, 2025

What does this PR do?

Fixes #53!

Adds a new blueprint at blueprints/inference/trtllm-nvidia-triton-server-gpu for deploying and serving LLMs using TensorRT-LLM with NVIDIA Triton Inference Server, demonstrated with a 1B parameter LLaMA model for ultra-low latency.

Includes:

  • Scripts for building and pushing Triton + TensorRT-LLM images
  • GPU inference profiling
  • Troubleshooting and system checks
  • Autoscaling validation
  • Observability via Prometheus and Grafana

Motivation

No existing blueprint demonstrates optimized LLM inference with TensorRT-LLM + Triton. This fills that gap with a performant, scalable GPU-based solution.

More

  • Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
  • Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
  • Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
  • Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

  • E2E Test successfully complete before merge?

Additional Notes

  • Repo maintainer requested a re-submission due to upstream changes requiring recent PR contributors to re-fork and re-apply their changes.
  • pre-commit's results:
    pre-commit-results
  • deployment status:
    ai-on-eks-successful-deployment

Copy link
Contributor

@omrishiv omrishiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incredibly thorough, thank you so much for the contribution! I have a few small requests/questions. @vara-bonthu would you also mind taking a look as this infrastructure is the only one I haven't consolidated and you may be a little more familiar with it.

blueprints/inference/trtllm-nvidia-triton-server-gpu/triton_model_files/*
blueprints/inference/trtllm-nvidia-triton-server-gpu/benchmark-grpc/results.txt
blueprints/inference/trtllm-nvidia-triton-server-gpu/benchmark-http/results/*
blueprints/inference/trtllm-nvidia-triton-server-gpu/.ecr_repo_uri
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As .ecr_repo_uri and .eks_region seem to be a common pattern, can we gitignore them with

**/.ecr_repo_uri
**/.eks_region

COPY start.sh /start.sh
RUN chmod +x /start.sh

ENTRYPOINT ["/bin/bash", "/start.sh"] No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline

echo -e "\nBuilding llama finetuning trn1 docker image" \
&& docker build . --no-cache -t $ECR_REPO_URI:latest \
&& docker push $ECR_REPO_URI:latest \
&& echo -e "\nImage successfully pushed to ECR" No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline

print(f"* {region}: {region_long_name}")
for instance_type in instance_types:
print(f" - {instance_type}")
print("\n") No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline

@@ -0,0 +1,3 @@
#!/bin/bash

python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=1 --model_repo=/triton_model_files No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline

instanceStorePolicy: RAID0
nodePool:
labels:
- type: karpenter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we align these to the current labels:

          - instanceType: g6e-gpu-karpenter
          - type: karpenter
          - accelerator: nvidia
          - gpuType: l40s

values: ["amd64"]
- key: "karpenter.sh/capacity-type"
operator: In
values: ["on-demand"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add spot

# --------------------------------------------------------------------------------------------------------- #
# NOTE: this is a reminder to modify "aws s3 sync command" within provisioner "local-exec", before deploying
# --------------------------------------------------------------------------------------------------------- #
# module "triton_server_trtllm" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is all of this commented out?

bucket_name = module.s3_bucket[count.index].s3_bucket_id
}

provisioner "local-exec" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this possible to toggle base on which one you're using?

@@ -0,0 +1,658 @@
---
title: NVIDIA Triton Server with TensorRT LLM
sidebar_position: 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove the fixed positioning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Blueprint Request] Add support for TensorRT-LLM with NVIDIA Triton Server

2 participants