Eval bug: llama-server infinite token gen on long prompt eval with dual GPU (CUDA)

### Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes
  Device 1: Quadro RTX 8000, compute capability 7.5, VMM: yes
  Device 2: Quadro RTX 8000, compute capability 7.5, VMM: yes
version: 7256 (2e1c9cd81)
built with MSVC 19.38.33145.0 for x64


### Operating systems

Windows

### GGML backends

CUDA

### Hardware

2x Quadro RTX 8000

### Models

Zai GLM 4.5 Air Q5
https://huggingface.co/ddh0/GLM-4.5-Air-GGUF/blob/main/GLM-4.5-Air-Q8_0-FFN-Q5_K-Q5_K-Q8_0-v2.gguf

### Problem description & steps to reproduce

Running a long prompt in llama-server results in infinite token generation (Single character repeated), running the build before the FA refactor does not have issues. 
Currently have llama-server running, split model between both GPU's, fully in VRAM. 
Tested a long prompt (>14K tokens) in the default UI. 

### First Bad Commit

https://github.com/ggml-org/llama.cpp/pull/17505
CUDA: generalized (mma) FA, add Volta support #17505 

### Relevant log output

```shell
Pretty much repeats the following (token generated is random per last character in input prompt):


data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"?"}}],"created":1765154291,"id":"chatcmpl-akU6wugj4COtpEO8Of4GvIehBmEVBzG6","model":"GLM-4.5-Air-Q8_0-FFN-Q5_K-Q5_K-Q8_0-v2.gguf","system_fingerprint":"b7256-2e1c9cd81","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":19259,"prompt_ms":22598.212,"prompt_per_token_ms":1.173384495560517,"prompt_per_second":852.2355662474537,"predicted_n":1,"predicted_ms":0.001,"predicted_per_token_ms":0.001,"predicted_per_second":1000000.0}}


srv  update_chat_: Parsing chat message: ??
Parsing input with format Content-only: ??
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 2, n_remaining = -1, next token:    30 '?'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 8
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 9, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 32768, n_tokens = 19261, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
srv   operator (): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"?"}}],"created":1765154291,"id":"chatcmpl-akU6wugj4COtpEO8Of4GvIehBmEVBzG6","model":"GLM-4.5-Air-Q8_0-FFN-Q5_K-Q5_K-Q8_0-v2.gguf","system_fingerprint":"b7256-2e1c9cd81","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":19259,"prompt_ms":22598.212,"prompt_per_token_ms":1.173384495560517,"prompt_per_second":852.2355662474537,"predicted_n":2,"predicted_ms":41.783,"predicted_per_token_ms":20.8915,"predicted_per_second":47.86635713089055}}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: llama-server infinite token gen on long prompt eval with dual GPU (CUDA) #17852

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eval bug: llama-server infinite token gen on long prompt eval with dual GPU (CUDA) #17852

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions