Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes
Device 1: Quadro RTX 8000, compute capability 7.5, VMM: yes
Device 2: Quadro RTX 8000, compute capability 7.5, VMM: yes
version: 7256 (2e1c9cd)
built with MSVC 19.38.33145.0 for x64
Operating systems
Windows
GGML backends
CUDA
Hardware
2x Quadro RTX 8000
Models
Zai GLM 4.5 Air Q5
https://huggingface.co/ddh0/GLM-4.5-Air-GGUF/blob/main/GLM-4.5-Air-Q8_0-FFN-Q5_K-Q5_K-Q8_0-v2.gguf
Problem description & steps to reproduce
Running a long prompt in llama-server results in infinite token generation (Single character repeated), running the build before the FA refactor does not have issues.
Currently have llama-server running, split model between both GPU's, fully in VRAM.
Tested a long prompt (>14K tokens) in the default UI.
First Bad Commit
#17505
CUDA: generalized (mma) FA, add Volta support #17505
Relevant log output
Pretty much repeats the following (token generated is random per last character in input prompt):
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"?"}}],"created":1765154291,"id":"chatcmpl-akU6wugj4COtpEO8Of4GvIehBmEVBzG6","model":"GLM-4.5-Air-Q8_0-FFN-Q5_K-Q5_K-Q8_0-v2.gguf","system_fingerprint":"b7256-2e1c9cd81","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":19259,"prompt_ms":22598.212,"prompt_per_token_ms":1.173384495560517,"prompt_per_second":852.2355662474537,"predicted_n":1,"predicted_ms":0.001,"predicted_per_token_ms":0.001,"predicted_per_second":1000000.0}}
srv update_chat_: Parsing chat message: ??
Parsing input with format Content-only: ??
res send: sending result for task id = 0
res send: task id = 0 pushed to result queue
slot process_toke: id 0 | task 0 | n_decoded = 2, n_remaining = -1, next token: 30 '?'
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 8
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 9, front = 0
slot update_slots: id 0 | task 0 | slot decode token, n_ctx = 32768, n_tokens = 19261, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
clear_adapter_lora: call
set_embeddings: value = 0
srv operator (): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"?"}}],"created":1765154291,"id":"chatcmpl-akU6wugj4COtpEO8Of4GvIehBmEVBzG6","model":"GLM-4.5-Air-Q8_0-FFN-Q5_K-Q5_K-Q8_0-v2.gguf","system_fingerprint":"b7256-2e1c9cd81","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":19259,"prompt_ms":22598.212,"prompt_per_token_ms":1.173384495560517,"prompt_per_second":852.2355662474537,"predicted_n":2,"predicted_ms":41.783,"predicted_per_token_ms":20.8915,"predicted_per_second":47.86635713089055}}
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes
Device 1: Quadro RTX 8000, compute capability 7.5, VMM: yes
Device 2: Quadro RTX 8000, compute capability 7.5, VMM: yes
version: 7256 (2e1c9cd)
built with MSVC 19.38.33145.0 for x64
Operating systems
Windows
GGML backends
CUDA
Hardware
2x Quadro RTX 8000
Models
Zai GLM 4.5 Air Q5
https://huggingface.co/ddh0/GLM-4.5-Air-GGUF/blob/main/GLM-4.5-Air-Q8_0-FFN-Q5_K-Q5_K-Q8_0-v2.gguf
Problem description & steps to reproduce
Running a long prompt in llama-server results in infinite token generation (Single character repeated), running the build before the FA refactor does not have issues.
Currently have llama-server running, split model between both GPU's, fully in VRAM.
Tested a long prompt (>14K tokens) in the default UI.
First Bad Commit
#17505
CUDA: generalized (mma) FA, add Volta support #17505
Relevant log output