Skip to content

Eval bug: #23972

@wul7chaos

Description

@wul7chaos

Name and Version

.\llama-cli.exe --version
version: 9456 (5aba536)
built with MSVC 19.51.36246.0 for Windows AMD64

Operating systems

Windows

GGML backends

CUDA

Hardware

Ryzen 9700x+3060Ti8G+32G DDR5

Models

LFM2.5-8B-A1B-Uncensored-Gaston-Q8_0.gguf

Problem description & steps to reproduce

When running llama-server to load LFM2.5-8B-Q8_0, the first conversation works normally in WebUI, but the second message crashes. The same issue occurs even when using --ubatch-size 128.

Specifically, there are two bugs: 1. The first message sent directly crashes: mmq.cuh:4135, mmq_x_best=0 leads to GGML_ABORT (all mmq_x are skipped due to the 48KB shared memory limit). 2. After bypassing bug1 locally, the second message crashes again: mmid.cu:133, after context checkpoint recovery, n_tokens * sizeof(mm_ids_helper_store) > 48KB leads to GGML_ASSERT failure

Command: llama-server -m LFM2.5-8B-A1B-Uncensored-Gaston-Q8_0.gguf -ngl 99 -c 32768 --ubatch-size 128

First Bad Commit

No response

Relevant log output

Logs
========================================
 LFM2.5-8B HTTP Server
 Model: .\models\LFW\LFM2.5-8B-A1B-Uncensored-Gaston-Q8_0.gguf
 URL:   http://127.0.0.1:8080
========================================

0.00.026.888 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.026.895 I device_info:
0.00.095.467 I   - CUDA0   : NVIDIA GeForce RTX 3060 Ti (8191 MiB, 7140 MiB free)
0.00.095.481 I   - CPU     : AMD Ryzen 7 9700X 8-Core Processor              (31861 MiB, 18725 MiB free)
0.00.095.578 I system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.095.580 I srv  llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.095.820 I srv          init: using 15 threads for HTTP server
0.00.096.118 I srv         start: binding port with default address family
0.00.099.330 I srv  llama_server: loading model
0.00.099.340 I srv    load_model: loading model '.\models\LFW\LFM2.5-8B-A1B-Uncensored-Gaston-Q8_0.gguf'
0.00.099.445 I common_init_result: fitting params to device memory ...
0.00.099.445 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.971.591 W common_fit_params: failed to fit params to free device memory: n_gpu_layers already set by user to 99, abort
0.05.158.624 W llama_context: n_ctx_seq (32768) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
0.05.330.707 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.05.541.791 I srv    load_model: initializing slots, n_slots = 4
0.05.718.991 W srv    load_model: speculative decoding will use checkpoints
0.05.719.087 W common_speculative_init: no implementations specified for speculative decoding
0.05.719.092 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 32768
0.05.719.101 I slot   load_model: id  1 | task -1 | new slot, n_ctx = 32768
0.05.719.102 I slot   load_model: id  2 | task -1 | new slot, n_ctx = 32768
0.05.719.103 I slot   load_model: id  3 | task -1 | new slot, n_ctx = 32768
0.05.719.174 I srv    load_model: prompt cache is enabled, size limit: 8192 MiB
0.05.719.174 I srv    load_model: use `--cache-ram 0` to disable the prompt cache
0.05.719.177 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.05.719.178 I srv    load_model: context checkpoints enabled, max = 32, min spacing = 256
0.05.719.245 I srv          init: idle slots will be saved to prompt cache and cleared upon starting a new task
0.05.736.473 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
0.05.738.354 I srv          init: init: chat template, thinking = 1
0.05.738.423 I srv  llama_server: model loaded
0.05.738.427 I srv  llama_server: server is listening on http://127.0.0.1:8080
0.05.738.439 I srv  update_slots: all slots are idle
0.16.667.027 I srv  params_from_: Chat format: peg-native
0.16.667.761 I slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
0.16.667.766 I srv  get_availabl: updating prompt cache
0.16.667.776 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.16.667.783 I srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 32768 tokens, 8589934592 est)
0.16.667.784 I srv  get_availabl: prompt cache update took 0.02 ms
0.16.667.960 I slot launch_slot_: id  3 | task 0 | processing task, is_child = 0
0.16.915.112 I slot create_check: id  3 | task 0 | created context checkpoint 1 of 32 (pos_min = 6, pos_max = 6, n_tokens = 7, size = 0.282 MiB)
0.19.129.178 I slot print_timing: id  3 | task 0 | prompt eval time =     300.81 ms /    11 tokens (   27.35 ms per token,    36.57 tokens per second)
0.19.129.187 I slot print_timing: id  3 | task 0 |        eval time =    2160.33 ms /    84 tokens (   25.72 ms per token,    38.88 tokens per second)
0.19.129.188 I slot print_timing: id  3 | task 0 |       total time =    2461.14 ms /    95 tokens
0.19.129.190 I slot print_timing: id  3 | task 0 |    graphs reused =         83
0.19.129.276 I slot      release: id  3 | task 0 | stop processing: n_tokens = 94, truncated = 0
0.19.129.309 I srv  update_slots: all slots are idle
0.30.454.570 I srv  params_from_: Chat format: peg-native
0.30.455.192 I slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.303 (> 0.100 thold), f_keep = 0.106
0.30.455.199 I srv  get_availabl: updating prompt cache
0.30.455.641 W srv   prompt_save:  - saving prompt with length 94, total state size = 1.384 MiB (draft: 0.000 MiB)
0.30.460.099 I srv          load:  - looking for better prompt, base f_keep = 0.106, sim = 0.303
0.30.460.113 I srv        update:  - cache state: 1 prompts, 1.666 MiB (limits: 8192.000 MiB, 32768 tokens, 462167 est)
0.30.460.116 I srv        update:    - prompt 000001D744D0D910:      94 tokens, checkpoints:  1,     1.666 MiB
0.30.460.118 I srv  get_availabl: prompt cache update took 4.92 ms
0.30.460.515 I slot launch_slot_: id  3 | task 86 | processing task, is_child = 0
0.30.460.567 I slot update_slots: id  3 | task 86 | Checking checkpoint with [6, 6] against 9...
0.30.500.704 W slot update_slots: id  3 | task 86 | restored context checkpoint (pos_min = 6, pos_max = 6, n_tokens = 7, n_past = 7, size = 0.282 MiB)
*\llama.cpp\ggml\src\ggml-cuda\mmid.cu:133: GGML_ASSERT(nbytes_shared <= smpbo) failed

Done. Close this window to stop the server.
Press any key to continue . . .
Terminate batch job (Y/N)?
^C

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions