Skip to content

Eval bug: llama-server infinite token gen on long prompt eval with dual GPU (CUDA) #17852

@albertnsoliz

Description

@albertnsoliz

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes
Device 1: Quadro RTX 8000, compute capability 7.5, VMM: yes
Device 2: Quadro RTX 8000, compute capability 7.5, VMM: yes
version: 7256 (2e1c9cd)
built with MSVC 19.38.33145.0 for x64

Operating systems

Windows

GGML backends

CUDA

Hardware

2x Quadro RTX 8000

Models

Zai GLM 4.5 Air Q5
https://huggingface.co/ddh0/GLM-4.5-Air-GGUF/blob/main/GLM-4.5-Air-Q8_0-FFN-Q5_K-Q5_K-Q8_0-v2.gguf

Problem description & steps to reproduce

Running a long prompt in llama-server results in infinite token generation (Single character repeated), running the build before the FA refactor does not have issues.
Currently have llama-server running, split model between both GPU's, fully in VRAM.
Tested a long prompt (>14K tokens) in the default UI.

First Bad Commit

#17505
CUDA: generalized (mma) FA, add Volta support #17505

Relevant log output

Pretty much repeats the following (token generated is random per last character in input prompt):


data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"?"}}],"created":1765154291,"id":"chatcmpl-akU6wugj4COtpEO8Of4GvIehBmEVBzG6","model":"GLM-4.5-Air-Q8_0-FFN-Q5_K-Q5_K-Q8_0-v2.gguf","system_fingerprint":"b7256-2e1c9cd81","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":19259,"prompt_ms":22598.212,"prompt_per_token_ms":1.173384495560517,"prompt_per_second":852.2355662474537,"predicted_n":1,"predicted_ms":0.001,"predicted_per_token_ms":0.001,"predicted_per_second":1000000.0}}


srv  update_chat_: Parsing chat message: ??
Parsing input with format Content-only: ??
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 2, n_remaining = -1, next token:    30 '?'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 8
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 9, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 32768, n_tokens = 19261, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
srv   operator (): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"?"}}],"created":1765154291,"id":"chatcmpl-akU6wugj4COtpEO8Of4GvIehBmEVBzG6","model":"GLM-4.5-Air-Q8_0-FFN-Q5_K-Q5_K-Q8_0-v2.gguf","system_fingerprint":"b7256-2e1c9cd81","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":19259,"prompt_ms":22598.212,"prompt_per_token_ms":1.173384495560517,"prompt_per_second":852.2355662474537,"predicted_n":2,"predicted_ms":41.783,"predicted_per_token_ms":20.8915,"predicted_per_second":47.86635713089055}}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions