-
Notifications
You must be signed in to change notification settings - Fork 52
feat: Add blueprint TensorRT-LLM + Triton #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incredibly thorough, thank you so much for the contribution! I have a few small requests/questions. @vara-bonthu would you also mind taking a look as this infrastructure is the only one I haven't consolidated and you may be a little more familiar with it.
blueprints/inference/trtllm-nvidia-triton-server-gpu/triton_model_files/* | ||
blueprints/inference/trtllm-nvidia-triton-server-gpu/benchmark-grpc/results.txt | ||
blueprints/inference/trtllm-nvidia-triton-server-gpu/benchmark-http/results/* | ||
blueprints/inference/trtllm-nvidia-triton-server-gpu/.ecr_repo_uri |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As .ecr_repo_uri
and .eks_region
seem to be a common pattern, can we gitignore them with
**/.ecr_repo_uri
**/.eks_region
COPY start.sh /start.sh | ||
RUN chmod +x /start.sh | ||
|
||
ENTRYPOINT ["/bin/bash", "/start.sh"] No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newline
echo -e "\nBuilding llama finetuning trn1 docker image" \ | ||
&& docker build . --no-cache -t $ECR_REPO_URI:latest \ | ||
&& docker push $ECR_REPO_URI:latest \ | ||
&& echo -e "\nImage successfully pushed to ECR" No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newline
print(f"* {region}: {region_long_name}") | ||
for instance_type in instance_types: | ||
print(f" - {instance_type}") | ||
print("\n") No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newline
@@ -0,0 +1,3 @@ | |||
#!/bin/bash | |||
|
|||
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=1 --model_repo=/triton_model_files No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newline
instanceStorePolicy: RAID0 | ||
nodePool: | ||
labels: | ||
- type: karpenter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we align these to the current labels:
- instanceType: g6e-gpu-karpenter
- type: karpenter
- accelerator: nvidia
- gpuType: l40s
values: ["amd64"] | ||
- key: "karpenter.sh/capacity-type" | ||
operator: In | ||
values: ["on-demand"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add spot
# --------------------------------------------------------------------------------------------------------- # | ||
# NOTE: this is a reminder to modify "aws s3 sync command" within provisioner "local-exec", before deploying | ||
# --------------------------------------------------------------------------------------------------------- # | ||
# module "triton_server_trtllm" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is all of this commented out?
bucket_name = module.s3_bucket[count.index].s3_bucket_id | ||
} | ||
|
||
provisioner "local-exec" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this possible to toggle base on which one you're using?
@@ -0,0 +1,658 @@ | |||
--- | |||
title: NVIDIA Triton Server with TensorRT LLM | |||
sidebar_position: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove the fixed positioning
What does this PR do?
Fixes #53!
Adds a new blueprint at
blueprints/inference/trtllm-nvidia-triton-server-gpu
for deploying and serving LLMs using TensorRT-LLM with NVIDIA Triton Inference Server, demonstrated with a 1B parameter LLaMA model for ultra-low latency.Includes:
Motivation
No existing blueprint demonstrates optimized LLM inference with TensorRT-LLM + Triton. This fills that gap with a performant, scalable GPU-based solution.
More
website/docs
orwebsite/blog
section for this featurepre-commit run -a
with this PR. Link for installing pre-commit locallyFor Moderators
Additional Notes