When I run llama-server with more than one GPU, starting from b7256, processing long prompts triggers the following error:
If I limit the execution to a single GPU by setting CUDA_VISIBLE_DEVICES=0, the error does not occur.
c:\temp\context_test>c:\temp\context_test\llamacpp\b7256-error\llama-server.exe -m e:\neuro\LLM-server\models\gpt-oss-120b-Derestricted.MXFP4_MOE.gguf --temp 0.7 --top-p 0.95 --min-p 0.01 --top-k 200 -c 0 -ts 25,12 --n-cpu-moe 10 -ub 2048 -b 2048 --cpu-range 0-7 --cpu-strict 1 --prio 2 --threads 8 --no-mmap --jinja -fa on --n-predict -1 --keep -1 --host 0.0.0.0 --port 5000 --no-webui
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from c:\temp\context_test\llamacpp\b7256-error\ggml-cuda.dll
load_backend: loaded RPC backend from c:\temp\context_test\llamacpp\b7256-error\ggml-rpc.dll
load_backend: loaded CPU backend from c:\temp\context_test\llamacpp\b7256-error\ggml-cpu-alderlake.dll
main: setting n_parallel = 4 and kv_unified = true (add -kvu to disable this)
build: 7256 (2e1c9cd81) with clang version 19.1.5 for x86_64-pc-windows-msvc
system info: n_threads = 8, n_threads_batch = 8, total_threads = 32
system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
init: using 31 threads for HTTP server
Web UI is disabled
start: binding port with default address family
main: loading model
srv load_model: loading model 'e:\neuro\LLM-server\models\gpt-oss-120b-Derestricted.MXFP4_MOE.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 30841 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:07:00.0) - 23304 MiB free
llama_model_loader: loaded meta data with 47 key-value pairs and 687 tensors from e:\neuro\LLM-server\models\gpt-oss-120b-Derestricted.MXFP4_MOE.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gpt-oss
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gpt Oss 120b Derestricted
llama_model_loader: - kv 3: general.finetune str = Derestricted
llama_model_loader: - kv 4: general.basename str = gpt-oss
llama_model_loader: - kv 5: general.size_label str = 120B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Gpt Oss 120b
llama_model_loader: - kv 9: general.base_model.0.organization str = Openai
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/openai/gpt-oss...
llama_model_loader: - kv 11: general.tags arr[str,7] = ["abliterated", "derestricted", "gpt-...
llama_model_loader: - kv 12: gpt-oss.block_count u32 = 36
llama_model_loader: - kv 13: gpt-oss.context_length u32 = 131072
llama_model_loader: - kv 14: gpt-oss.embedding_length u32 = 2880
llama_model_loader: - kv 15: gpt-oss.feed_forward_length u32 = 2880
llama_model_loader: - kv 16: gpt-oss.attention.head_count u32 = 64
llama_model_loader: - kv 17: gpt-oss.attention.head_count_kv u32 = 8
llama_model_loader: - kv 18: gpt-oss.rope.freq_base f32 = 150000.000000
llama_model_loader: - kv 19: gpt-oss.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 20: gpt-oss.expert_count u32 = 128
llama_model_loader: - kv 21: gpt-oss.expert_used_count u32 = 4
llama_model_loader: - kv 22: gpt-oss.attention.key_length u32 = 64
llama_model_loader: - kv 23: gpt-oss.attention.value_length u32 = 64
llama_model_loader: - kv 24: gpt-oss.attention.sliding_window u32 = 128
llama_model_loader: - kv 25: gpt-oss.expert_feed_forward_length u32 = 2880
llama_model_loader: - kv 26: gpt-oss.rope.scaling.type str = yarn
llama_model_loader: - kv 27: gpt-oss.rope.scaling.factor f32 = 32.000000
llama_model_loader: - kv 28: gpt-oss.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 29: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 30: tokenizer.ggml.pre str = gpt-4o
llama_model_loader: - kv 31: tokenizer.ggml.tokens arr[str,201088] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 32: tokenizer.ggml.token_type arr[i32,201088] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 33: tokenizer.ggml.merges arr[str,446189] = ["─а ─а", "─а ─а─а─а", "─а─а ─а─а", "...
llama_model_loader: - kv 34: tokenizer.ggml.bos_token_id u32 = 199998
llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 200002
llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 199999
llama_model_loader: - kv 37: tokenizer.chat_template str = {#-\n In addition to the normal input...
llama_model_loader: - kv 38: general.quantization_version u32 = 2
llama_model_loader: - kv 39: general.file_type u32 = 38
llama_model_loader: - kv 40: general.url str = https://huggingface.co/mradermacher/g...
llama_model_loader: - kv 41: mradermacher.quantize_version str = 2
llama_model_loader: - kv 42: mradermacher.quantized_by str = mradermacher
llama_model_loader: - kv 43: mradermacher.quantized_at str = 2025-11-29T09:32:25+01:00
llama_model_loader: - kv 44: mradermacher.quantized_on str = nico1
llama_model_loader: - kv 45: general.source.url str = https://huggingface.co/ArliAI/gpt-oss...
llama_model_loader: - kv 46: mradermacher.convert_type str = hf
llama_model_loader: - type f32: 433 tensors
llama_model_loader: - type q8_0: 146 tensors
llama_model_loader: - type mxfp4: 108 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = MXFP4 MoE
print_info: file size = 59.02 GiB (4.34 BPW)
load: printing all EOG tokens:
load: - 199999 ('<|endoftext|>')
load: - 200002 ('<|return|>')
load: - 200007 ('<|end|>')
load: - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 21
load: token to piece cache size = 1.3332 MB
print_info: arch = gpt-oss
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 2880
print_info: n_embd_inp = 2880
print_info: n_layer = 36
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 64
print_info: n_swa = 128
print_info: is_swa_any = 1
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 2880
print_info: n_expert = 128
print_info: n_expert_used = 4
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = yarn
print_info: freq_base_train = 150000.0
print_info: freq_scale_train = 0.03125
print_info: n_ctx_orig_yarn = 4096
print_info: rope_finetuned = unknown
print_info: model type = 120B
print_info: model params = 116.83 B
print_info: general.name = Gpt Oss 120b Derestricted
print_info: n_ff_exp = 2880
print_info: vocab type = BPE
print_info: n_vocab = 201088
print_info: n_merges = 446189
print_info: BOS token = 199998 '<|startoftext|>'
print_info: EOS token = 200002 '<|return|>'
print_info: EOT token = 199999 '<|endoftext|>'
print_info: PAD token = 199999 '<|endoftext|>'
print_info: LF token = 198 '─К'
print_info: EOG token = 199999 '<|endoftext|>'
print_info: EOG token = 200002 '<|return|>'
print_info: EOG token = 200012 '<|call|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors: CUDA0 model buffer size = 24977.23 MiB
load_tensors: CUDA1 model buffer size = 18695.54 MiB
load_tensors: CUDA_Host model buffer size = 16765.73 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 131072
llama_context: n_ctx_seq = 131072
llama_context: n_batch = 2048
llama_context: n_ubatch = 2048
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = true
llama_context: freq_base = 150000.0
llama_context: freq_scale = 0.03125
llama_context: CUDA_Host output buffer size = 3.07 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 131072 cells
llama_kv_cache: CUDA0 KV buffer size = 3072.00 MiB
llama_kv_cache: CUDA1 KV buffer size = 1536.00 MiB
llama_kv_cache: size = 4608.00 MiB (131072 cells, 18 layers, 4/1 seqs), K (f16): 2304.00 MiB, V (f16): 2304.00 MiB
llama_kv_cache_iswa: creating SWA KV cache, size = 2560 cells
llama_kv_cache: CUDA0 KV buffer size = 65.00 MiB
llama_kv_cache: CUDA1 KV buffer size = 25.00 MiB
llama_kv_cache: size = 90.00 MiB ( 2560 cells, 18 layers, 4/1 seqs), K (f16): 45.00 MiB, V (f16): 45.00 MiB
llama_context: CUDA0 compute buffer size = 1848.93 MiB
llama_context: CUDA1 compute buffer size = 1593.50 MiB
llama_context: CUDA_Host compute buffer size = 1066.59 MiB
llama_context: graph nodes = 2024
llama_context: graph splits = 63 (with bs=2048), 23 (with bs=1)
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|return|> logit bias = -inf
common_init_from_params: added <|call|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 131072
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 4
slot init: id 0 | task -1 | new slot, n_ctx = 131072
slot init: id 1 | task -1 | new slot, n_ctx = 131072
slot init: id 2 | task -1 | new slot, n_ctx = 131072
slot init: id 3 | task -1 | new slot, n_ctx = 131072
srv init: prompt cache is enabled, size limit: 8192 MiB
srv init: use `--cache-ram 0` to disable the prompt cache
srv init: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv init: thinking = 0
init: chat template, chat_template: {#-
In addition to the normal inputs of `messages` and `tools`, this template also accepts the
following kwargs:
- "builtin_tools": A list, can contain "browser" and/or "python".
- "model_identity": A string that optionally describes the model identity.
- "reasoning_effort": A string that describes the reasoning effort, defaults to "medium".
#}
{#- Tool Definition Rendering ============================================== #}
{%- macro render_typescript_type(param_spec, required_params, is_nullable=false) -%}
{%- if param_spec.type == "array" -%}
{%- if param_spec['items'] -%}
{%- if param_spec['items']['type'] == "string" -%}
{{- "string[]" }}
{%- elif param_spec['items']['type'] == "number" -%}
{{- "number[]" }}
{%- elif param_spec['items']['type'] == "integer" -%}
{{- "number[]" }}
{%- elif param_spec['items']['type'] == "boolean" -%}
{{- "boolean[]" }}
{%- else -%}
{%- set inner_type = render_typescript_type(param_spec['items'], required_params) -%}
{%- if inner_type == "object | object" or inner_type|length > 50 -%}
{{- "any[]" }}
{%- else -%}
{{- inner_type + "[]" }}
{%- endif -%}
{%- endif -%}
{%- if param_spec.nullable -%}
{{- " | null" }}
{%- endif -%}
{%- else -%}
{{- "any[]" }}
{%- if param_spec.nullable -%}
{{- " | null" }}
{%- endif -%}
{%- endif -%}
{%- elif param_spec.type is defined and param_spec.type is iterable and param_spec.type is not string and param_spec.type is not mapping and param_spec.type[0] is defined -%}
{#- Handle array of types like ["object", "object"] from Union[dict, list] #}
{%- if param_spec.type | length > 1 -%}
{{- param_spec.type | join(" | ") }}
{%- else -%}
{{- param_spec.type[0] }}
{%- endif -%}
{%- elif param_spec.oneOf -%}
{#- Handle oneOf schemas - check for complex unions and fallback to any #}
{%- set has_object_variants = false -%}
{%- for variant in param_spec.oneOf -%}
{%- if variant.type == "object" -%}
{%- set has_object_variants = true -%}
{%- endif -%}
{%- endfor -%}
{%- if has_object_variants and param_spec.oneOf|length > 1 -%}
{{- "any" }}
{%- else -%}
{%- for variant in param_spec.oneOf -%}
{{- render_typescript_type(variant, required_params) -}}
{%- if variant.description %}
{{- "// " + variant.description }}
{%- endif -%}
{%- if variant.default is defined %}
{{ "// default: " + variant.default|tojson }}
{%- endif -%}
{%- if not loop.last %}
{{- " | " }}
{% endif -%}
{%- endfor -%}
{%- endif -%}
{%- elif param_spec.type == "string" -%}
{%- if param_spec.enum -%}
{{- '"' + param_spec.enum|join('" | "') + '"' -}}
{%- else -%}
{{- "string" }}
{%- if param_spec.nullable %}
{{- " | null" }}
{%- endif -%}
{%- endif -%}
{%- elif param_spec.type == "number" -%}
{{- "number" }}
{%- elif param_spec.type == "integer" -%}
{{- "number" }}
{%- elif param_spec.type == "boolean" -%}
{{- "boolean" }}
{%- elif param_spec.type == "object" -%}
{%- if param_spec.properties -%}
{{- "{\n" }}
{%- for prop_name, prop_spec in param_spec.properties.items() -%}
{{- prop_name -}}
{%- if prop_name not in (param_spec.required or []) -%}
{{- "?" }}
{%- endif -%}
{{- ": " }}
{{ render_typescript_type(prop_spec, param_spec.required or []) }}
{%- if not loop.last -%}
{{-", " }}
{%- endif -%}
{%- endfor -%}
{{- "}" }}
{%- else -%}
{{- "object" }}
{%- endif -%}
{%- else -%}
{{- "any" }}
{%- endif -%}
{%- endmacro -%}
{%- macro render_tool_namespace(namespace_name, tools) -%}
{{- "## " + namespace_name + "\n\n" }}
{{- "namespace " + namespace_name + " {\n\n" }}
{%- for tool in tools %}
{%- set tool = tool.function %}
{{- "// " + tool.description + "\n" }}
{{- "type "+ tool.name + " = " }}
{%- if tool.parameters and tool.parameters.properties %}
{{- "(_: {\n" }}
{%- for param_name, param_spec in tool.parameters.properties.items() %}
{%- if param_spec.description %}
{{- "// " + param_spec.description + "\n" }}
{%- endif %}
{{- param_name }}
{%- if param_name not in (tool.parameters.required or []) -%}
{{- "?" }}
{%- endif -%}
{{- ": " }}
{{- render_typescript_type(param_spec, tool.parameters.required or []) }}
{%- if param_spec.default is defined -%}
{%- if param_spec.enum %}
{{- ", // default: " + param_spec.default }}
{%- elif param_spec.oneOf %}
{{- "// default: " + param_spec.default }}
{%- else %}
{{- ", // default: " + param_spec.default|tojson }}
{%- endif -%}
{%- endif -%}
{%- if not loop.last %}
{{- ",\n" }}
{%- else %}
{{- ",\n" }}
{%- endif -%}
{%- endfor %}
{{- "}) => any;\n\n" }}
{%- else -%}
{{- "() => any;\n\n" }}
{%- endif -%}
{%- endfor %}
{{- "} // namespace " + namespace_name }}
{%- endmacro -%}
{%- macro render_builtin_tools(browser_tool, python_tool) -%}
{%- if browser_tool %}
{{- "## browser\n\n" }}
{{- "// Tool for browsing.\n" }}
{{- "// The `cursor` appears in brackets before each browsing display: `[{cursor}]`.\n" }}
{{- "// Cite information from the tool using the following format:\n" }}
{{- "// `уАР{cursor}тАаL{line_start}(-L{line_end})?уАС`, for example: `уАР6тАаL9-L11уАС` or `уАР8тАаL3уАС`.\n" }}
{{- "// Do not quote more than 10 words directly from the tool output.\n" }}
{{- "// sources=web (default: web)\n" }}
{{- "namespace browser {\n\n" }}
{{- "// Searches for information related to `query` and displays `topn` results.\n" }}
{{- "type search = (_: {\n" }}
{{- "query: string,\n" }}
{{- "topn?: number, // default: 10\n" }}
{{- "source?: string,\n" }}
{{- "}) => any;\n\n" }}
{{- "// Opens the link `id` from the page indicated by `cursor` starting at line number `loc`, showing `num_lines` lines.\n" }}
{{- "// Valid link ids are displayed with the formatting: `уАР{id}тАа.*уАС`.\n" }}
{{- "// If `cursor` is not provided, the most recent page is implied.\n" }}
{{- "// If `id` is a string, it is treated as a fully qualified URL associated with `source`.\n" }}
{{- "// If `loc` is not provided, the viewport will be positioned at the beginning of the document or centered on the most relevant passage, if available.\n" }}
{{- "// Use this function without `id` to scroll to a new location of an opened page.\n" }}
{{- "type open = (_: {\n" }}
{{- "id?: number | string, // default: -1\n" }}
{{- "cursor?: number, // default: -1\n" }}
{{- "loc?: number, // default: -1\n" }}
{{- "num_lines?: number, // default: -1\n" }}
{{- "view_source?: boolean, // default: false\n" }}
{{- "source?: string,\n" }}
{{- "}) => any;\n\n" }}
{{- "// Finds exact matches of `pattern` in the current page, or the page given by `cursor`.\n" }}
{{- "type find = (_: {\n" }}
{{- "pattern: string,\n" }}
{{- "cursor?: number, // default: -1\n" }}
{{- "}) => any;\n\n" }}
{{- "} // namespace browser\n\n" }}
{%- endif -%}
{%- if python_tool %}
{{- "## python\n\n" }}
{{- "Use this tool to execute Python code in your chain of thought. The code will not be shown to the user. This tool should be used for internal reasoning, but not for code that is intended to be visible to the user (e.g. when creating plots, tables, or files).\n\n" }}
{{- "When you send a message containing Python code to python, it will be executed in a stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 120.0 seconds. The drive at '/mnt/data' can be used to save and persist user files. Internet access for this session is UNKNOWN. Depends on the cluster.\n\n" }}
{%- endif -%}
{%- endmacro -%}
{#- System Message Construction ============================================ #}
{%- macro build_system_message() -%}
{%- if model_identity is not defined %}
{%- set model_identity = "You are ChatGPT, a large language model trained by OpenAI." %}
{%- endif %}
{{- model_identity + "\n" }}
{{- "Knowledge cutoff: 2024-06\n" }}
{{- "Current date: " + strftime_now("%Y-%m-%d") + "\n\n" }}
{%- if reasoning_effort is not defined %}
{%- set reasoning_effort = "medium" %}
{%- endif %}
{{- "Reasoning: " + reasoning_effort + "\n\n" }}
{%- if builtin_tools %}
{{- "# Tools\n\n" }}
{%- set available_builtin_tools = namespace(browser=false, python=false) %}
{%- for tool in builtin_tools %}
{%- if tool == "browser" %}
{%- set available_builtin_tools.browser = true %}
{%- elif tool == "python" %}
{%- set available_builtin_tools.python = true %}
{%- endif %}
{%- endfor %}
{{- render_builtin_tools(available_builtin_tools.browser, available_builtin_tools.python) }}
{%- endif -%}
{{- "# Valid channels: analysis, commentary, final. Channel must be included for every message." }}
{%- if tools -%}
{{- "\nCalls to these tools must go to the commentary channel: 'functions'." }}
{%- endif -%}
{%- endmacro -%}
{#- Main Template Logic ================================================= #}
{#- Set defaults #}
{#- Render system message #}
{{- "<|start|>system<|message|>" }}
{{- build_system_message() }}
{{- "<|end|>" }}
{#- Extract developer message #}
{%- if messages[0].role == "developer" or messages[0].role == "system" %}
{%- set developer_message = messages[0].content %}
{%- set loop_messages = messages[1:] %}
{%- else %}
{%- set developer_message = "" %}
{%- set loop_messages = messages %}
{%- endif %}
{#- Render developer message #}
{%- if developer_message or tools %}
{{- "<|start|>developer<|message|>" }}
{%- if developer_message %}
{{- "# Instructions\n\n" }}
{{- developer_message }}
{{- "\n\n" }}
{%- endif %}
{%- if tools -%}
{{- "# Tools\n\n" }}
{{- render_tool_namespace("functions", tools) }}
{%- endif -%}
{{- "<|end|>" }}
{%- endif %}
{#- Render messages #}
{%- set last_tool_call = namespace(name=none) %}
{%- for message in loop_messages -%}
{#- At this point only assistant/user/tool messages should remain #}
{%- if message.role == 'assistant' -%}
{#- Checks to ensure the messages are being passed in the format we expect #}
{%- if "content" in message %}
{%- if false %}
{{- raise_exception("You have passed a message containing <|channel|> tags in the content field. Instead of doing this, you should pass analysis messages (the string between '<|message|>' and '<|end|>') in the 'thinking' field, and final messages (the string between '<|message|>' and '<|end|>') in the 'content' field.") }}
{%- endif %}
{%- endif %}
{%- if "thinking" in message %}
{%- if "<|channel|>analysis<|message|>" in message.thinking or "<|channel|>final<|message|>" in message.thinking %}
{{- raise_exception("You have passed a message containing <|channel|> tags in the thinking field. Instead of doing this, you should pass analysis messages (the string between '<|message|>' and '<|end|>') in the 'thinking' field, and final messages (the string between '<|message|>' and '<|end|>') in the 'content' field.") }}
{%- endif %}
{%- endif %}
{%- if "tool_calls" in message %}
{#- We need very careful handling here - we want to drop the tool call analysis message if the model #}
{#- has output a later <|final|> message, but otherwise we want to retain it. This is the only case #}
{#- when we render CoT/analysis messages in inference. #}
{%- set future_final_message = namespace(found=false) %}
{%- for future_message in loop_messages[loop.index:] %}
{%- if future_message.role == 'assistant' and "tool_calls" not in future_message %}
{%- set future_final_message.found = true %}
{%- endif %}
{%- endfor %}
{#- We assume max 1 tool call per message, and so we infer the tool call name #}
{#- in "tool" messages from the most recent assistant tool call name #}
{%- set tool_call = message.tool_calls[0] %}
{%- if tool_call.function %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{%- if message.content and message.thinking %}
{{- raise_exception("Cannot pass both content and thinking in an assistant message with tool calls! Put the analysis message in one or the other, but not both.") }}
{%- elif message.content and not future_final_message.found %}
{{- "<|start|>assistant<|channel|>analysis<|message|>" + message.content + "<|end|>" }}
{%- elif message.thinking and not future_final_message.found %}
{{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
{%- endif %}
{{- "<|start|>assistant to=" }}
{{- "functions." + tool_call.name + "<|channel|>commentary " }}
{{- (tool_call.content_type if tool_call.content_type is defined else "json") + "<|message|>" }}
{{- tool_call.arguments|tojson }}
{{- "<|call|>" }}
{%- set last_tool_call.name = tool_call.name %}
{%- elif loop.last and not add_generation_prompt %}
{#- Only render the CoT if the final turn is an assistant turn and add_generation_prompt is false #}
{#- This is a situation that should only occur in training, never in inference. #}
{%- if "thinking" in message %}
{{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
{%- endif %}
{#- <|return|> indicates the end of generation, but <|end|> does not #}
{#- <|return|> should never be an input to the model, but we include it as the final token #}
{#- when training, so the model learns to emit it. #}
{{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|return|>" }}
{%- else %}
{#- CoT is dropped during all previous turns, so we never render it for inference #}
{{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|end|>" }}
{%- set last_tool_call.name = none %}
{%- endif %}
{%- elif message.role == 'tool' -%}
{%- if last_tool_call.name is none %}
{{- raise_exception("Message has tool role, but there was no previous assistant message with a tool call!") }}
{%- endif %}
{{- "<|start|>functions." + last_tool_call.name }}
{{- " to=assistant<|channel|>commentary<|message|>" + message.content|tojson + "<|end|>" }}
{%- elif message.role == 'user' -%}
{{- "<|start|>user<|message|>" + message.content + "<|end|>" }}
{%- endif -%}
{%- endfor -%}
{#- Generation prompt #}
{%- if add_generation_prompt -%}
<|start|>assistant
{%- endif -%}, example_format: '<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-12-06
Reasoning: medium
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions
You are a helpful assistant
<|end|><|start|>user<|message|>Hello<|end|><|start|>assistant<|channel|>final<|message|>Hi there<|end|><|start|>user<|message|>How are you?<|end|><|start|>assistant'
main: model loaded
main: server is listening on http://0.0.0.0:5000
main: starting the main loop...
srv update_slots: all slots are idle
srv params_from_: Chat format: GPT-OSS
slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 0 | processing task
slot update_slots: id 3 | task 0 | new prompt, n_ctx_slot = 131072, n_keep = -1, task.n_tokens = 90072
slot update_slots: id 3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.022737
slot update_slots: id 3 | task 0 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.045475
slot update_slots: id 3 | task 0 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.068212
slot update_slots: id 3 | task 0 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.090949
slot update_slots: id 3 | task 0 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 10240, batch.n_tokens = 2048, progress = 0.113687
slot update_slots: id 3 | task 0 | n_tokens = 10240, memory_seq_rm [10240, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 12288, batch.n_tokens = 2048, progress = 0.136424
slot update_slots: id 3 | task 0 | n_tokens = 12288, memory_seq_rm [12288, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 14336, batch.n_tokens = 2048, progress = 0.159162
slot update_slots: id 3 | task 0 | n_tokens = 14336, memory_seq_rm [14336, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 16384, batch.n_tokens = 2048, progress = 0.181899
slot update_slots: id 3 | task 0 | n_tokens = 16384, memory_seq_rm [16384, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 18432, batch.n_tokens = 2048, progress = 0.204636
slot update_slots: id 3 | task 0 | n_tokens = 18432, memory_seq_rm [18432, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 20480, batch.n_tokens = 2048, progress = 0.227374
slot update_slots: id 3 | task 0 | n_tokens = 20480, memory_seq_rm [20480, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 22528, batch.n_tokens = 2048, progress = 0.250111
slot update_slots: id 3 | task 0 | n_tokens = 22528, memory_seq_rm [22528, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 24576, batch.n_tokens = 2048, progress = 0.272848
slot update_slots: id 3 | task 0 | n_tokens = 24576, memory_seq_rm [24576, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 26624, batch.n_tokens = 2048, progress = 0.295586
slot update_slots: id 3 | task 0 | n_tokens = 26624, memory_seq_rm [26624, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 28672, batch.n_tokens = 2048, progress = 0.318323
slot update_slots: id 3 | task 0 | n_tokens = 28672, memory_seq_rm [28672, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 30720, batch.n_tokens = 2048, progress = 0.341060
slot update_slots: id 3 | task 0 | n_tokens = 30720, memory_seq_rm [30720, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 32768, batch.n_tokens = 2048, progress = 0.363798
slot update_slots: id 3 | task 0 | n_tokens = 32768, memory_seq_rm [32768, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 34816, batch.n_tokens = 2048, progress = 0.386535
slot update_slots: id 3 | task 0 | n_tokens = 34816, memory_seq_rm [34816, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 36864, batch.n_tokens = 2048, progress = 0.409273
slot update_slots: id 3 | task 0 | n_tokens = 36864, memory_seq_rm [36864, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 38912, batch.n_tokens = 2048, progress = 0.432010
slot update_slots: id 3 | task 0 | n_tokens = 38912, memory_seq_rm [38912, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 40960, batch.n_tokens = 2048, progress = 0.454747
slot update_slots: id 3 | task 0 | n_tokens = 40960, memory_seq_rm [40960, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 43008, batch.n_tokens = 2048, progress = 0.477485
slot update_slots: id 3 | task 0 | n_tokens = 43008, memory_seq_rm [43008, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 45056, batch.n_tokens = 2048, progress = 0.500222
slot update_slots: id 3 | task 0 | n_tokens = 45056, memory_seq_rm [45056, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 47104, batch.n_tokens = 2048, progress = 0.522959
slot update_slots: id 3 | task 0 | n_tokens = 47104, memory_seq_rm [47104, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 49152, batch.n_tokens = 2048, progress = 0.545697
slot update_slots: id 3 | task 0 | n_tokens = 49152, memory_seq_rm [49152, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 51200, batch.n_tokens = 2048, progress = 0.568434
slot update_slots: id 3 | task 0 | n_tokens = 51200, memory_seq_rm [51200, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 53248, batch.n_tokens = 2048, progress = 0.591172
slot update_slots: id 3 | task 0 | n_tokens = 53248, memory_seq_rm [53248, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 55296, batch.n_tokens = 2048, progress = 0.613909
slot update_slots: id 3 | task 0 | n_tokens = 55296, memory_seq_rm [55296, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 57344, batch.n_tokens = 2048, progress = 0.636646
slot update_slots: id 3 | task 0 | n_tokens = 57344, memory_seq_rm [57344, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 59392, batch.n_tokens = 2048, progress = 0.659384
slot update_slots: id 3 | task 0 | n_tokens = 59392, memory_seq_rm [59392, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 61440, batch.n_tokens = 2048, progress = 0.682121
slot update_slots: id 3 | task 0 | n_tokens = 61440, memory_seq_rm [61440, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 63488, batch.n_tokens = 2048, progress = 0.704858
slot update_slots: id 3 | task 0 | n_tokens = 63488, memory_seq_rm [63488, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 65536, batch.n_tokens = 2048, progress = 0.727596
slot update_slots: id 3 | task 0 | n_tokens = 65536, memory_seq_rm [65536, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 67584, batch.n_tokens = 2048, progress = 0.750333
slot update_slots: id 3 | task 0 | n_tokens = 67584, memory_seq_rm [67584, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 69632, batch.n_tokens = 2048, progress = 0.773070
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:90: CUDA error
Name and Version
c:\temp\context_test\llamacpp\b7256-error>llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from c:\temp\context_test\llamacpp\b7256-error\ggml-cuda.dll
load_backend: loaded RPC backend from c:\temp\context_test\llamacpp\b7256-error\ggml-rpc.dll
load_backend: loaded CPU backend from c:\temp\context_test\llamacpp\b7256-error\ggml-cpu-alderlake.dll
version: 7256 (2e1c9cd)
built with clang version 19.1.5 for x86_64-pc-windows-msvc
Operating systems
Windows
GGML backends
CUDA
Hardware
GPU: RTX 5090 + RTX 3090
CPU: i9-13900F
Models
https://huggingface.co/mradermacher/gpt-oss-120b-Derestricted-i1-GGUF/blob/main/gpt-oss-120b-Derestricted.i1-MXFP4_MOE.gguf.part1of2
or
https://huggingface.co/unsloth/gpt-oss-120b-GGUF/blob/main/gpt-oss-120b-F16.gguf
Problem description & steps to reproduce
When I run llama-server with more than one GPU, starting from b7256, processing long prompts triggers the following error:
In newer versions, for example b7278, the error changes to:
If I limit the execution to a single GPU by setting CUDA_VISIBLE_DEVICES=0, the error does not occur.
Short python test:
On b7255 error does not occur even with dual GPU use
First Bad Commit
#17505
Relevant log output