-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Description
System Info
Running TGI 3.3.6 on a new GPU NVIDIA GeForce RTX 5060 Ti (compute capability 12.0 / sm_120) causes TGI to crash during warmup with:
CUDA Error: no kernel image is available for execution on the device
/usr/src/flash-attention/csrc/layer_norm/ln_fwd_kernels.cuh:236
it crashes immediately because FlashAttention 1.0.9—bundled inside the TGI Docker image—does not include kernels compiled for sm_120. This appears to be the root cause.
Environment
Hardware
GPU: NVIDIA GeForce RTX 5060 Ti
Compute capability: 12.0
VRAM: 16 GB
Driver: 581.80
CUDA (system): 13.0 (from nvidia-smi)
Inside the TGI 3.3.6 container
torch == 2.7.0+cu128
torch.version.cuda == "12.8"
flash_attn == 1.0.9
triton == 3.3.0
mamba_ssm == 1.1.2
text-generation-server == 2.0.5.dev0 (from the internal server component)
FlashAttention 1.0.9 is confirmed by:
import flash_attn
print(flash_attn.__version__) # 1.0.9
print(flash_attn.__file__)
nvidia-smi
Tue Dec 9 11:01:48 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.07 Driver Version: 581.80 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5060 Ti On | 00000000:01:00.0 On | N/A |
| 0% 39C P8 7W / 180W | 1781MiB / 16311MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 25 G /Xwayland N/A |
| 0 N/A N/A 50 G /Xwayland N/A |
+-----------------------------------------------------------------------------------------+
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
Just run docker with llama 3.2
export HF_TOKEN="..."
docker run --rm -it \
--gpus all \
--shm-size 1g \
-p 8080:80 \
-e HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} \
-e RUST_LOG=debug \
-v /mnt/hf-cache:/data \
--name llama-3.2-1b-tgi \
ghcr.io/huggingface/text-generation-inference:3.3.6 \
--model-id meta-llama/Llama-3.2-1B-Instruct \
--max-input-length 4096 \
--max-total-tokens 4224 \
--cuda-memory-fraction 0.9
The entire stacktrace
2025-12-09T10:04:16.474404Z INFO text_generation_launcher: Args {
model_id: "meta-llama/Llama-3.2-1B-Instruct",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: None,
speculate: None,
dtype: None,
kv_cache_dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: Some(
4096,
),
max_total_tokens: Some(
4224,
),
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "da2c85ccd2e1",
port: 80,
prometheus_port: 9000,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: true,
cuda_memory_fraction: 0.9,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
payload_limit: 2000000,
enable_prefill_logprobs: false,
graceful_termination_timeout: 90,
}
2025-12-09T10:04:18.510804Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching true
2025-12-09T10:04:18.599056Z WARN text_generation_launcher: Unkown compute for card nvidia-geforce-rtx-5060-ti
2025-12-09T10:04:18.639000Z INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4096
2025-12-09T10:04:18.639066Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2025-12-09T10:04:18.639261Z INFO download: text_generation_launcher: Starting check and download process for meta-llama/Llama-3.2-1B-Instruct
2025-12-09T10:04:24.229594Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2025-12-09T10:04:24.963645Z INFO download: text_generation_launcher: Successfully downloaded weights for meta-llama/Llama-3.2-1B-Instruct
2025-12-09T10:04:24.963959Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2025-12-09T10:04:29.917437Z INFO text_generation_launcher: Using prefix caching = True
2025-12-09T10:04:29.917489Z INFO text_generation_launcher: Using Attention = flashinfer
2025-12-09T10:04:34.991699Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-12-09T10:04:45.011112Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-12-09T10:04:55.030039Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-12-09T10:05:04.721763Z INFO text_generation_launcher: Using prefill chunking = True
2025-12-09T10:05:05.041495Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-12-09T10:05:05.123703Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2025-12-09T10:05:05.141684Z INFO shard-manager: text_generation_launcher: Shard ready in 40.172224949s rank=0
2025-12-09T10:05:05.214578Z INFO text_generation_launcher: Starting Webserver
2025-12-09T10:05:05.276269Z INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2025-12-09T10:05:05.299117Z INFO text_generation_launcher: Using optimized Triton indexing kernels.
2025-12-09T10:05:07.527851Z ERROR warmup{max_input_length=Some(4096) max_prefill_tokens=4096 max_total_tokens=Some(4224) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: transport error
Error: Backend(Warmup(Generation("transport error")))
2025-12-09T10:05:07.566220Z ERROR text_generation_launcher: Webserver Crashed
2025-12-09T10:05:07.566264Z INFO text_generation_launcher: Shutting down shards
2025-12-09T10:05:07.655671Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
2025-12-09 10:04:26.943 | INFO | text_generation_server.utils.import_utils:<module>:76 - Detected system cuda
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@custom_fwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
@custom_bwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@custom_fwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
@custom_bwd
CUDA Error: no kernel image is available for execution on the device /usr/src/flash-attention/csrc/layer_norm/ln_fwd_kernels.cuh 236 rank=0
Error: WebserverFailed
Additional test in the docker
from flash_attn.ops.rms_norm import rms_norm
import torch
b = torch.rand(32).cuda()
a = torch.rand(2,32).cuda()
rms_norm(a,b,1e-6)
Expected behavior
It's gonna load the model and it will work. Common, it's almost a year after releasing blackwell GPUs it should work...
AdamPalaxo
Metadata
Metadata
Assignees
Labels
No labels