Skip to content

FlashAttention CUDA "no kernel image" crash on RTX 5060 Ti #3342

@pauli31

Description

@pauli31

System Info

Running TGI 3.3.6 on a new GPU NVIDIA GeForce RTX 5060 Ti (compute capability 12.0 / sm_120) causes TGI to crash during warmup with:

CUDA Error: no kernel image is available for execution on the device
/usr/src/flash-attention/csrc/layer_norm/ln_fwd_kernels.cuh:236

it crashes immediately because FlashAttention 1.0.9—bundled inside the TGI Docker image—does not include kernels compiled for sm_120. This appears to be the root cause.

Environment
Hardware

GPU: NVIDIA GeForce RTX 5060 Ti

Compute capability: 12.0

VRAM: 16 GB

Driver: 581.80

CUDA (system): 13.0 (from nvidia-smi)

Inside the TGI 3.3.6 container

torch == 2.7.0+cu128
torch.version.cuda == "12.8"
flash_attn == 1.0.9
triton == 3.3.0
mamba_ssm == 1.1.2
text-generation-server == 2.0.5.dev0  (from the internal server component)

FlashAttention 1.0.9 is confirmed by:

import flash_attn
print(flash_attn.__version__)       # 1.0.9
print(flash_attn.__file__)
nvidia-smi
Tue Dec  9 11:01:48 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.07             Driver Version: 581.80         CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5060 Ti     On  |   00000000:01:00.0  On |                  N/A |
|  0%   39C    P8              7W /  180W |    1781MiB /  16311MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A              25      G   /Xwayland                             N/A      |
|    0   N/A  N/A              50      G   /Xwayland                             N/A      |
+-----------------------------------------------------------------------------------------+

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Just run docker with llama 3.2

export HF_TOKEN="..."

docker run --rm -it \
  --gpus all \
  --shm-size 1g \
  -p 8080:80 \
  -e HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} \
  -e RUST_LOG=debug \
  -v /mnt/hf-cache:/data \
  --name llama-3.2-1b-tgi \
  ghcr.io/huggingface/text-generation-inference:3.3.6 \
    --model-id meta-llama/Llama-3.2-1B-Instruct \
    --max-input-length 4096 \
    --max-total-tokens 4224 \
    --cuda-memory-fraction 0.9 

The entire stacktrace

2025-12-09T10:04:16.474404Z  INFO text_generation_launcher: Args {
    model_id: "meta-llama/Llama-3.2-1B-Instruct",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: None,
    speculate: None,
    dtype: None,
    kv_cache_dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: Some(
        4096,
    ),
    max_total_tokens: Some(
        4224,
    ),
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "da2c85ccd2e1",
    port: 80,
    prometheus_port: 9000,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: None,
    weights_cache_override: None,
    disable_custom_kernels: true,
    cuda_memory_fraction: 0.9,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    api_key: None,
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
    usage_stats: On,
    payload_limit: 2000000,
    enable_prefill_logprobs: false,
    graceful_termination_timeout: 90,
}
2025-12-09T10:04:18.510804Z  INFO text_generation_launcher: Using attention flashinfer - Prefix caching true
2025-12-09T10:04:18.599056Z  WARN text_generation_launcher: Unkown compute for card nvidia-geforce-rtx-5060-ti
2025-12-09T10:04:18.639000Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4096
2025-12-09T10:04:18.639066Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2025-12-09T10:04:18.639261Z  INFO download: text_generation_launcher: Starting check and download process for meta-llama/Llama-3.2-1B-Instruct
2025-12-09T10:04:24.229594Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
2025-12-09T10:04:24.963645Z  INFO download: text_generation_launcher: Successfully downloaded weights for meta-llama/Llama-3.2-1B-Instruct
2025-12-09T10:04:24.963959Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2025-12-09T10:04:29.917437Z  INFO text_generation_launcher: Using prefix caching = True
2025-12-09T10:04:29.917489Z  INFO text_generation_launcher: Using Attention = flashinfer
2025-12-09T10:04:34.991699Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-12-09T10:04:45.011112Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-12-09T10:04:55.030039Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-12-09T10:05:04.721763Z  INFO text_generation_launcher: Using prefill chunking = True
2025-12-09T10:05:05.041495Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-12-09T10:05:05.123703Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2025-12-09T10:05:05.141684Z  INFO shard-manager: text_generation_launcher: Shard ready in 40.172224949s rank=0
2025-12-09T10:05:05.214578Z  INFO text_generation_launcher: Starting Webserver
2025-12-09T10:05:05.276269Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2025-12-09T10:05:05.299117Z  INFO text_generation_launcher: Using optimized Triton indexing kernels.
2025-12-09T10:05:07.527851Z ERROR warmup{max_input_length=Some(4096) max_prefill_tokens=4096 max_total_tokens=Some(4224) max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: transport error
Error: Backend(Warmup(Generation("transport error")))
2025-12-09T10:05:07.566220Z ERROR text_generation_launcher: Webserver Crashed
2025-12-09T10:05:07.566264Z  INFO text_generation_launcher: Shutting down shards
2025-12-09T10:05:07.655671Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

2025-12-09 10:04:26.943 | INFO     | text_generation_server.utils.import_utils:<module>:76 - Detected system cuda
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @custom_bwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @custom_bwd
CUDA Error: no kernel image is available for execution on the device /usr/src/flash-attention/csrc/layer_norm/ln_fwd_kernels.cuh 236 rank=0
Error: WebserverFailed

Additional test in the docker

from flash_attn.ops.rms_norm import rms_norm
import torch
b = torch.rand(32).cuda()
a = torch.rand(2,32).cuda()
rms_norm(a,b,1e-6)

Expected behavior

It's gonna load the model and it will work. Common, it's almost a year after releasing blackwell GPUs it should work...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions