========================================
LFM2.5-8B HTTP Server
Model: .\models\LFW\LFM2.5-8B-A1B-Uncensored-Gaston-Q8_0.gguf
URL: http://127.0.0.1:8080
========================================
0.00.026.888 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.026.895 I device_info:
0.00.095.467 I - CUDA0 : NVIDIA GeForce RTX 3060 Ti (8191 MiB, 7140 MiB free)
0.00.095.481 I - CPU : AMD Ryzen 7 9700X 8-Core Processor (31861 MiB, 18725 MiB free)
0.00.095.578 I system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.095.580 I srv llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.095.820 I srv init: using 15 threads for HTTP server
0.00.096.118 I srv start: binding port with default address family
0.00.099.330 I srv llama_server: loading model
0.00.099.340 I srv load_model: loading model '.\models\LFW\LFM2.5-8B-A1B-Uncensored-Gaston-Q8_0.gguf'
0.00.099.445 I common_init_result: fitting params to device memory ...
0.00.099.445 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.971.591 W common_fit_params: failed to fit params to free device memory: n_gpu_layers already set by user to 99, abort
0.05.158.624 W llama_context: n_ctx_seq (32768) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
0.05.330.707 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.05.541.791 I srv load_model: initializing slots, n_slots = 4
0.05.718.991 W srv load_model: speculative decoding will use checkpoints
0.05.719.087 W common_speculative_init: no implementations specified for speculative decoding
0.05.719.092 I slot load_model: id 0 | task -1 | new slot, n_ctx = 32768
0.05.719.101 I slot load_model: id 1 | task -1 | new slot, n_ctx = 32768
0.05.719.102 I slot load_model: id 2 | task -1 | new slot, n_ctx = 32768
0.05.719.103 I slot load_model: id 3 | task -1 | new slot, n_ctx = 32768
0.05.719.174 I srv load_model: prompt cache is enabled, size limit: 8192 MiB
0.05.719.174 I srv load_model: use `--cache-ram 0` to disable the prompt cache
0.05.719.177 I srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.05.719.178 I srv load_model: context checkpoints enabled, max = 32, min spacing = 256
0.05.719.245 I srv init: idle slots will be saved to prompt cache and cleared upon starting a new task
0.05.736.473 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
0.05.738.354 I srv init: init: chat template, thinking = 1
0.05.738.423 I srv llama_server: model loaded
0.05.738.427 I srv llama_server: server is listening on http://127.0.0.1:8080
0.05.738.439 I srv update_slots: all slots are idle
0.16.667.027 I srv params_from_: Chat format: peg-native
0.16.667.761 I slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
0.16.667.766 I srv get_availabl: updating prompt cache
0.16.667.776 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.16.667.783 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 32768 tokens, 8589934592 est)
0.16.667.784 I srv get_availabl: prompt cache update took 0.02 ms
0.16.667.960 I slot launch_slot_: id 3 | task 0 | processing task, is_child = 0
0.16.915.112 I slot create_check: id 3 | task 0 | created context checkpoint 1 of 32 (pos_min = 6, pos_max = 6, n_tokens = 7, size = 0.282 MiB)
0.19.129.178 I slot print_timing: id 3 | task 0 | prompt eval time = 300.81 ms / 11 tokens ( 27.35 ms per token, 36.57 tokens per second)
0.19.129.187 I slot print_timing: id 3 | task 0 | eval time = 2160.33 ms / 84 tokens ( 25.72 ms per token, 38.88 tokens per second)
0.19.129.188 I slot print_timing: id 3 | task 0 | total time = 2461.14 ms / 95 tokens
0.19.129.190 I slot print_timing: id 3 | task 0 | graphs reused = 83
0.19.129.276 I slot release: id 3 | task 0 | stop processing: n_tokens = 94, truncated = 0
0.19.129.309 I srv update_slots: all slots are idle
0.30.454.570 I srv params_from_: Chat format: peg-native
0.30.455.192 I slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 0.303 (> 0.100 thold), f_keep = 0.106
0.30.455.199 I srv get_availabl: updating prompt cache
0.30.455.641 W srv prompt_save: - saving prompt with length 94, total state size = 1.384 MiB (draft: 0.000 MiB)
0.30.460.099 I srv load: - looking for better prompt, base f_keep = 0.106, sim = 0.303
0.30.460.113 I srv update: - cache state: 1 prompts, 1.666 MiB (limits: 8192.000 MiB, 32768 tokens, 462167 est)
0.30.460.116 I srv update: - prompt 000001D744D0D910: 94 tokens, checkpoints: 1, 1.666 MiB
0.30.460.118 I srv get_availabl: prompt cache update took 4.92 ms
0.30.460.515 I slot launch_slot_: id 3 | task 86 | processing task, is_child = 0
0.30.460.567 I slot update_slots: id 3 | task 86 | Checking checkpoint with [6, 6] against 9...
0.30.500.704 W slot update_slots: id 3 | task 86 | restored context checkpoint (pos_min = 6, pos_max = 6, n_tokens = 7, n_past = 7, size = 0.282 MiB)
*\llama.cpp\ggml\src\ggml-cuda\mmid.cu:133: GGML_ASSERT(nbytes_shared <= smpbo) failed
Done. Close this window to stop the server.
Press any key to continue . . .
Terminate batch job (Y/N)?
^C
Name and Version
.\llama-cli.exe --version
version: 9456 (5aba536)
built with MSVC 19.51.36246.0 for Windows AMD64
Operating systems
Windows
GGML backends
CUDA
Hardware
Ryzen 9700x+3060Ti8G+32G DDR5
Models
LFM2.5-8B-A1B-Uncensored-Gaston-Q8_0.gguf
Problem description & steps to reproduce
When running llama-server to load LFM2.5-8B-Q8_0, the first conversation works normally in WebUI, but the second message crashes. The same issue occurs even when using --ubatch-size 128.
Specifically, there are two bugs: 1. The first message sent directly crashes: mmq.cuh:4135, mmq_x_best=0 leads to GGML_ABORT (all mmq_x are skipped due to the 48KB shared memory limit). 2. After bypassing bug1 locally, the second message crashes again: mmid.cu:133, after context checkpoint recovery, n_tokens * sizeof(mm_ids_helper_store) > 48KB leads to GGML_ASSERT failure
Command: llama-server -m LFM2.5-8B-A1B-Uncensored-Gaston-Q8_0.gguf -ngl 99 -c 32768 --ubatch-size 128
First Bad Commit
No response
Relevant log output
Logs