Skip to content

Commit 1ab61a0

Browse files
localai-botmudler
andauthored
feat: generic chat_template_kwargs (model config + per-request metadata) (#10359)
* feat(config): add chat_template_kwargs model field + resolver Adds the ChatTemplateKwargs model-config map and RequestMetadata carrier, plus ResolveChatTemplateKwargs which layers the config map under coerced request metadata. Foundation for generic jinja chat-template kwargs (issue #10329). Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend): forward resolved chat_template_kwargs blob to backends gRPCPredictOpts now merges per-request client metadata over the server-derived enable_thinking/reasoning_effort (reaching all backends via the standalone keys) and serialises the resolved chat_template_kwargs map into a JSON blob for llama.cpp, written last so a client cannot clobber it. Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(http): wire request metadata to config.RequestMetadata The OpenAI request metadata field was parsed but unused; stamp it onto the per-request ModelConfig so gRPCPredictOpts forwards it as chat_template_kwargs overrides. Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(llama-cpp): generic chat_template_kwargs merge (drop per-key blocks) Replace the per-key enable_thinking/reasoning_effort handling in both the streaming and non-streaming chat paths with a single block that parses the chat_template_kwargs JSON blob resolved by the Go layer and merges every key into body_json. New jinja template levers (e.g. preserve_thinking) now need no C++ change. Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs: document custom chat_template_kwargs (model + per-request) Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(backend): pin reasoning_effort as a string in the chat_template_kwargs blob Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(http): e2e guard pinning chat_template_kwargs forwarded to gRPC Adds an ECHO_PREDICT_METADATA marker to the mock-backend that echoes the received PredictOptions.Metadata, and an app_test.go spec that drives a real /v1/chat/completions request (model chat_template_kwargs + per-request metadata override) and asserts the exact metadata + chat_template_kwargs blob the REST layer forwards to gRPC. Locks the REST->gRPC contract against regressions. Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(config): grandfather chat_template_kwargs in registry coverage chat_template_kwargs is a free-form map[string]any (like engine_args, already on the list), not a scalar the config UI registry can surface, so it is exempt from the registry-entry requirement. Fixes the TestAllFieldsHaveRegistryEntries failure introduced by the new field. Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
1 parent f440340 commit 1ab61a0

11 files changed

Lines changed: 396 additions & 34 deletions

File tree

backend/cpp/llama-cpp/grpc-server.cpp

Lines changed: 37 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1922,25 +1922,27 @@ class BackendServiceImpl final : public backend::Backend::Service {
19221922
body_json["min_p"] = data["min_p"];
19231923
}
19241924

1925-
// Pass enable_thinking via chat_template_kwargs (where oaicompat_chat_params_parse reads it)
1925+
// Forward the chat_template_kwargs the Go layer resolved (model config
1926+
// chat_template_kwargs + per-request metadata: enable_thinking,
1927+
// reasoning_effort, preserve_thinking, ...). One generic merge replaces
1928+
// the previous per-key handling - new template levers need no C++ change.
1929+
// oaicompat_chat_params_parse reads these from body_json.
19261930
const auto& metadata = request->metadata();
1927-
auto et_it = metadata.find("enable_thinking");
1928-
if (et_it != metadata.end()) {
1929-
if (!body_json.contains("chat_template_kwargs")) {
1930-
body_json["chat_template_kwargs"] = json::object();
1931-
}
1932-
body_json["chat_template_kwargs"]["enable_thinking"] = (et_it->second == "true");
1933-
}
1934-
1935-
// Pass reasoning_effort via chat_template_kwargs too: the lever
1936-
// jinja templates like gpt-oss (Harmony) / LFM2.5 read, distinct
1937-
// from enable_thinking which those templates ignore.
1938-
auto re_it = metadata.find("reasoning_effort");
1939-
if (re_it != metadata.end() && !re_it->second.empty()) {
1940-
if (!body_json.contains("chat_template_kwargs")) {
1941-
body_json["chat_template_kwargs"] = json::object();
1931+
auto ctk_it = metadata.find("chat_template_kwargs");
1932+
if (ctk_it != metadata.end() && !ctk_it->second.empty()) {
1933+
try {
1934+
json ctk = json::parse(ctk_it->second);
1935+
if (ctk.is_object()) {
1936+
if (!body_json.contains("chat_template_kwargs")) {
1937+
body_json["chat_template_kwargs"] = json::object();
1938+
}
1939+
for (auto& el : ctk.items()) {
1940+
body_json["chat_template_kwargs"][el.key()] = el.value();
1941+
}
1942+
}
1943+
} catch (const std::exception & e) {
1944+
SRV_WRN("failed to parse chat_template_kwargs metadata: %s\n", e.what());
19421945
}
1943-
body_json["chat_template_kwargs"]["reasoning_effort"] = re_it->second;
19441946
}
19451947

19461948
// Debug: Print full body_json before template processing (includes messages, tools, tool_choice, etc.)
@@ -2756,25 +2758,26 @@ class BackendServiceImpl final : public backend::Backend::Service {
27562758
body_json["min_p"] = data["min_p"];
27572759
}
27582760

2759-
// Pass enable_thinking via chat_template_kwargs (where oaicompat_chat_params_parse reads it)
2761+
// Forward the chat_template_kwargs the Go layer resolved (model config
2762+
// chat_template_kwargs + per-request metadata: enable_thinking,
2763+
// reasoning_effort, preserve_thinking, ...). One generic merge replaces
2764+
// the previous per-key handling - new template levers need no C++ change.
27602765
const auto& predict_metadata = request->metadata();
2761-
auto predict_et_it = predict_metadata.find("enable_thinking");
2762-
if (predict_et_it != predict_metadata.end()) {
2763-
if (!body_json.contains("chat_template_kwargs")) {
2764-
body_json["chat_template_kwargs"] = json::object();
2765-
}
2766-
body_json["chat_template_kwargs"]["enable_thinking"] = (predict_et_it->second == "true");
2767-
}
2768-
2769-
// Pass reasoning_effort via chat_template_kwargs too: the lever
2770-
// jinja templates like gpt-oss (Harmony) / LFM2.5 read, distinct
2771-
// from enable_thinking which those templates ignore.
2772-
auto predict_re_it = predict_metadata.find("reasoning_effort");
2773-
if (predict_re_it != predict_metadata.end() && !predict_re_it->second.empty()) {
2774-
if (!body_json.contains("chat_template_kwargs")) {
2775-
body_json["chat_template_kwargs"] = json::object();
2766+
auto predict_ctk_it = predict_metadata.find("chat_template_kwargs");
2767+
if (predict_ctk_it != predict_metadata.end() && !predict_ctk_it->second.empty()) {
2768+
try {
2769+
json ctk = json::parse(predict_ctk_it->second);
2770+
if (ctk.is_object()) {
2771+
if (!body_json.contains("chat_template_kwargs")) {
2772+
body_json["chat_template_kwargs"] = json::object();
2773+
}
2774+
for (auto& el : ctk.items()) {
2775+
body_json["chat_template_kwargs"][el.key()] = el.value();
2776+
}
2777+
}
2778+
} catch (const std::exception & e) {
2779+
SRV_WRN("failed to parse chat_template_kwargs metadata: %s\n", e.what());
27762780
}
2777-
body_json["chat_template_kwargs"]["reasoning_effort"] = predict_re_it->second;
27782781
}
27792782

27802783
// Debug: Print full body_json before template processing (includes messages, tools, tool_choice, etc.)

core/backend/options.go

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -368,6 +368,25 @@ func gRPCPredictOpts(c config.ModelConfig, modelPath string) *pb.PredictOptions
368368
if c.ReasoningEffort != "" {
369369
metadata["reasoning_effort"] = c.ReasoningEffort
370370
}
371+
// Client request metadata overrides the server-derived reasoning levers and
372+
// reaches every backend through these standalone string keys (Python backends
373+
// read them directly). The reserved blob key is server-owned and skipped.
374+
for k, v := range c.RequestMetadata {
375+
if k == "chat_template_kwargs" {
376+
continue
377+
}
378+
metadata[k] = v
379+
}
380+
// Build the generic chat_template_kwargs blob (model config map + coerced
381+
// metadata) for llama.cpp and write it LAST so a client cannot clobber it.
382+
if blob := c.ResolveChatTemplateKwargs(metadata); len(blob) > 0 {
383+
b, err := json.Marshal(blob)
384+
if err != nil {
385+
xlog.Warn("failed to marshal chat_template_kwargs", "error", err)
386+
} else {
387+
metadata["chat_template_kwargs"] = string(b)
388+
}
389+
}
371390
pbOpts.Metadata = metadata
372391

373392
// Logprobs and TopLogprobs are set by the caller if provided

core/backend/options_internal_test.go

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -161,3 +161,67 @@ var _ = Describe("grpcModelOpts NBatch", func() {
161161
Expect(opts.ContextSize).To(BeEquivalentTo(4096), "n_batch must match the effective n_ctx the backend receives")
162162
})
163163
})
164+
165+
// Guards the generic chat_template_kwargs forwarding: the model config map plus any
166+
// per-request metadata overrides are merged, coerced, and serialised into the
167+
// backend metadata blob that llama.cpp reads. Client metadata also overrides the
168+
// server-derived standalone enable_thinking key (cross-backend consistency).
169+
var _ = Describe("gRPCPredictOpts chat_template_kwargs metadata", func() {
170+
baseCfg := func() config.ModelConfig {
171+
cfg := config.ModelConfig{}
172+
cfg.SetDefaults()
173+
return cfg
174+
}
175+
176+
It("serialises the config map into the chat_template_kwargs blob", func() {
177+
cfg := baseCfg()
178+
cfg.ChatTemplateKwargs = map[string]any{"preserve_thinking": true}
179+
opts := gRPCPredictOpts(cfg, "/tmp/models")
180+
Expect(opts.Metadata).To(HaveKey("chat_template_kwargs"))
181+
var blob map[string]any
182+
Expect(json.Unmarshal([]byte(opts.Metadata["chat_template_kwargs"]), &blob)).To(Succeed())
183+
Expect(blob).To(HaveKeyWithValue("preserve_thinking", true))
184+
})
185+
186+
It("serialises reasoning_effort into the blob as a JSON string", func() {
187+
cfg := baseCfg()
188+
cfg.ReasoningEffort = "high"
189+
opts := gRPCPredictOpts(cfg, "/tmp/models")
190+
Expect(opts.Metadata).To(HaveKey("chat_template_kwargs"))
191+
var blob map[string]any
192+
Expect(json.Unmarshal([]byte(opts.Metadata["chat_template_kwargs"]), &blob)).To(Succeed())
193+
// reasoning_effort must remain a string in the blob (jinja templates that
194+
// key on the level read a string), unlike enable_thinking which is a bool.
195+
Expect(blob["reasoning_effort"]).To(BeAssignableToTypeOf(""))
196+
Expect(blob).To(HaveKeyWithValue("reasoning_effort", "high"))
197+
})
198+
199+
It("lets client request metadata override the server-derived enable_thinking key", func() {
200+
cfg := baseCfg()
201+
disable := true
202+
cfg.ReasoningConfig = reasoning.Config{DisableReasoning: &disable} // server: enable_thinking=false
203+
cfg.RequestMetadata = map[string]string{"enable_thinking": "true"} // client overrides
204+
opts := gRPCPredictOpts(cfg, "/tmp/models")
205+
// standalone key (Python backends) reflects the client override
206+
Expect(opts.Metadata).To(HaveKeyWithValue("enable_thinking", "true"))
207+
// blob (llama.cpp) reflects it too, as a real bool
208+
var blob map[string]any
209+
Expect(json.Unmarshal([]byte(opts.Metadata["chat_template_kwargs"]), &blob)).To(Succeed())
210+
Expect(blob).To(HaveKeyWithValue("enable_thinking", true))
211+
})
212+
213+
It("does not let a client clobber the blob via a chat_template_kwargs metadata key", func() {
214+
cfg := baseCfg()
215+
cfg.ChatTemplateKwargs = map[string]any{"preserve_thinking": true}
216+
cfg.RequestMetadata = map[string]string{"chat_template_kwargs": "{\"preserve_thinking\": false}"}
217+
opts := gRPCPredictOpts(cfg, "/tmp/models")
218+
var blob map[string]any
219+
Expect(json.Unmarshal([]byte(opts.Metadata["chat_template_kwargs"]), &blob)).To(Succeed())
220+
Expect(blob).To(HaveKeyWithValue("preserve_thinking", true))
221+
})
222+
223+
It("omits the blob when there is nothing to forward", func() {
224+
opts := gRPCPredictOpts(baseCfg(), "/tmp/models")
225+
Expect(opts.Metadata).ToNot(HaveKey("chat_template_kwargs"))
226+
})
227+
})
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
package config_test
2+
3+
import (
4+
. "github.com/onsi/ginkgo/v2"
5+
. "github.com/onsi/gomega"
6+
7+
"github.com/mudler/LocalAI/core/config"
8+
)
9+
10+
// ResolveChatTemplateKwargs layers the model config map (base) under the coerced
11+
// backend metadata (server reasoning levers + client request overrides).
12+
var _ = Describe("ModelConfig.ResolveChatTemplateKwargs", func() {
13+
It("returns nil when nothing is set", func() {
14+
c := &config.ModelConfig{}
15+
Expect(c.ResolveChatTemplateKwargs(nil)).To(BeNil())
16+
})
17+
18+
It("returns the config map when no metadata is present", func() {
19+
c := &config.ModelConfig{ChatTemplateKwargs: map[string]any{"preserve_thinking": true}}
20+
Expect(c.ResolveChatTemplateKwargs(nil)).To(HaveKeyWithValue("preserve_thinking", true))
21+
})
22+
23+
It("lets metadata override the config map", func() {
24+
c := &config.ModelConfig{ChatTemplateKwargs: map[string]any{"enable_thinking": true}}
25+
got := c.ResolveChatTemplateKwargs(map[string]string{"enable_thinking": "false"})
26+
Expect(got).To(HaveKeyWithValue("enable_thinking", false))
27+
})
28+
29+
It("coerces true/false to bool and leaves other strings as-is", func() {
30+
c := &config.ModelConfig{}
31+
got := c.ResolveChatTemplateKwargs(map[string]string{
32+
"enable_thinking": "true",
33+
"reasoning_effort": "high",
34+
})
35+
Expect(got).To(HaveKeyWithValue("enable_thinking", true))
36+
Expect(got).To(HaveKeyWithValue("reasoning_effort", "high"))
37+
})
38+
39+
It("skips the reserved chat_template_kwargs metadata key but keeps siblings", func() {
40+
c := &config.ModelConfig{}
41+
got := c.ResolveChatTemplateKwargs(map[string]string{
42+
"chat_template_kwargs": "{\"x\":1}",
43+
"preserve_thinking": "true",
44+
})
45+
Expect(got).ToNot(HaveKey("chat_template_kwargs"))
46+
Expect(got).To(HaveKeyWithValue("preserve_thinking", true))
47+
})
48+
})

core/config/meta/registry_coverage_test.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,7 @@ var grandfatheredUnregistered = []string{
112112
"agent.max_attempts",
113113
"agent.max_iterations",
114114
"cfg_scale",
115+
"chat_template_kwargs",
115116
"concurrency_groups",
116117
"cutstrings",
117118
"debug",

core/config/model_config.go

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,19 @@ type ModelConfig struct {
7070
// (Harmony) or LFM2.5 — honor it; "none" also toggles enable_thinking off.
7171
ReasoningEffort string `yaml:"reasoning_effort,omitempty" json:"reasoning_effort,omitempty"`
7272

73+
// ChatTemplateKwargs are arbitrary key/values forwarded to the backend's jinja
74+
// chat template via chat_template_kwargs (e.g. preserve_thinking: true). The
75+
// server-derived reasoning levers (enable_thinking / reasoning_effort) and any
76+
// per-request metadata overrides layer on top. See gRPCPredictOpts.
77+
ChatTemplateKwargs map[string]any `yaml:"chat_template_kwargs,omitempty" json:"chat_template_kwargs,omitempty"`
78+
79+
// RequestMetadata holds the raw client request `metadata` map for the current
80+
// request. The request middleware stamps it; gRPCPredictOpts merges it into the
81+
// backend gRPC metadata (overriding the server-derived enable_thinking /
82+
// reasoning_effort) and folds it, coerced, into the chat_template_kwargs blob.
83+
// Never persisted to YAML.
84+
RequestMetadata map[string]string `yaml:"-" json:"-"`
85+
7386
FeatureFlag FeatureFlag `yaml:"feature_flags,omitempty" json:"feature_flags,omitempty"` // Feature Flag registry. We move fast, and features may break on a per model/backend basis. Registry for (usually temporary) flags that indicate aborting something early.
7487
// LLM configs (GPT4ALL, Llama.cpp, ...)
7588
LLMConfig `yaml:",inline" json:",inline"`
@@ -551,6 +564,44 @@ func (c *ModelConfig) ApplyReasoningEffort(requestEffort string) {
551564
}
552565
}
553566

567+
// coerceChatTemplateKwarg coerces a request-metadata string value for use as a
568+
// jinja chat_template_kwarg. "true"/"false" become real booleans (so a jinja
569+
// `{% if preserve_thinking %}` reads false correctly, since any non-empty string
570+
// is truthy); everything else stays a string. Numeric/typed per-request values are
571+
// out of scope - set those in the model YAML chat_template_kwargs (YAML keeps the type).
572+
func coerceChatTemplateKwarg(v string) any {
573+
switch v {
574+
case "true":
575+
return true
576+
case "false":
577+
return false
578+
default:
579+
return v
580+
}
581+
}
582+
583+
// ResolveChatTemplateKwargs builds the final chat_template_kwargs map forwarded to
584+
// the backend, layered: the model config map (base) < the coerced backend metadata
585+
// (server reasoning levers + client request overrides). `meta` is the already-merged
586+
// backend metadata string map. The reserved "chat_template_kwargs" key is skipped so
587+
// a client cannot smuggle a nested blob. Returns nil when there is nothing to forward.
588+
func (c *ModelConfig) ResolveChatTemplateKwargs(meta map[string]string) map[string]any {
589+
out := map[string]any{}
590+
for k, v := range c.ChatTemplateKwargs {
591+
out[k] = v
592+
}
593+
for k, v := range meta {
594+
if k == "chat_template_kwargs" {
595+
continue
596+
}
597+
out[k] = coerceChatTemplateKwarg(v)
598+
}
599+
if len(out) == 0 {
600+
return nil
601+
}
602+
return out
603+
}
604+
554605
// @Description PipelineStreaming toggles incremental delivery per realtime stage.
555606
type PipelineStreaming struct {
556607
LLM *bool `yaml:"llm,omitempty" json:"llm,omitempty"`

core/http/app_test.go

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -735,6 +735,18 @@ parameters:
735735
`
736736
Expect(os.WriteFile(filepath.Join(modelDir, "mock-model.yaml"), []byte(mockModelYAML), 0644)).To(Succeed())
737737

738+
// A second model carrying chat_template_kwargs so the REST->gRPC
739+
// metadata-forwarding spec below can assert the model-YAML kwarg is
740+
// merged with the per-request override.
741+
mockCTKModelYAML := `name: mock-ctk-model
742+
backend: mock-backend
743+
parameters:
744+
model: mock-model.bin
745+
chat_template_kwargs:
746+
preserve_thinking: true
747+
`
748+
Expect(os.WriteFile(filepath.Join(modelDir, "mock-ctk-model.yaml"), []byte(mockCTKModelYAML), 0644)).To(Succeed())
749+
738750
systemState, err := system.GetSystemState(
739751
system.WithBackendPath(backendDir),
740752
system.WithModelPath(modelDir),
@@ -809,6 +821,59 @@ parameters:
809821
Expect(string(dat)).To(ContainSubstring("mock-backend"))
810822
})
811823

824+
It("forwards chat_template_kwargs and reasoning levers to gRPC PredictOptions.Metadata", func() {
825+
// True HTTP->gRPC contract guard: drive a real /v1/chat/completions
826+
// request and assert the exact metadata the REST layer forwarded to
827+
// the backend. The mock-backend echoes PredictOptions.Metadata as JSON
828+
// when it sees the ECHO_PREDICT_METADATA marker in the prompt, so this
829+
// pins the request->gRPC mapping (model-YAML chat_template_kwargs +
830+
// per-request metadata override + type coercion + standalone keys)
831+
// without adding a new RPC. The marker rides in the user content and
832+
// must survive into the backend prompt; if a future default chat
833+
// template drops raw user content, move the marker to /v1/completions.
834+
reqBody := map[string]any{
835+
"model": "mock-ctk-model",
836+
"messages": []map[string]any{
837+
{"role": "user", "content": "ECHO_PREDICT_METADATA"},
838+
},
839+
// per-request override: overrides the standalone enable_thinking key
840+
// and exercises coercion ("false" -> bool, "low" -> string) in the blob
841+
"metadata": map[string]string{
842+
"enable_thinking": "false",
843+
"reasoning_effort": "low",
844+
},
845+
}
846+
847+
var chatResp struct {
848+
Choices []struct {
849+
Message struct {
850+
Content string `json:"content"`
851+
} `json:"message"`
852+
} `json:"choices"`
853+
}
854+
err := postRequestResponseJSON("http://127.0.0.1:9090/v1/chat/completions", &reqBody, &chatResp)
855+
Expect(err).ToNot(HaveOccurred())
856+
Expect(chatResp.Choices).ToNot(BeEmpty())
857+
858+
// The assistant content is the JSON snapshot of PredictOptions.Metadata.
859+
var meta map[string]string
860+
Expect(json.Unmarshal([]byte(chatResp.Choices[0].Message.Content), &meta)).To(Succeed(), "echoed metadata: %s", chatResp.Choices[0].Message.Content)
861+
862+
// Standalone keys reflect the per-request override (consumed by Python
863+
// backends; consistent across backends).
864+
Expect(meta).To(HaveKeyWithValue("enable_thinking", "false"))
865+
Expect(meta).To(HaveKeyWithValue("reasoning_effort", "low"))
866+
867+
// The chat_template_kwargs blob (consumed by llama.cpp) merges the
868+
// model-YAML kwarg with the coerced request metadata override.
869+
Expect(meta).To(HaveKey("chat_template_kwargs"))
870+
var ctk map[string]any
871+
Expect(json.Unmarshal([]byte(meta["chat_template_kwargs"]), &ctk)).To(Succeed(), "chat_template_kwargs blob: %s", meta["chat_template_kwargs"])
872+
Expect(ctk).To(HaveKeyWithValue("preserve_thinking", true)) // bool from model YAML
873+
Expect(ctk).To(HaveKeyWithValue("enable_thinking", false)) // coerced "false" -> bool
874+
Expect(ctk).To(HaveKeyWithValue("reasoning_effort", "low")) // non-bool stays string
875+
})
876+
812877
// Agent Jobs: HTTP API for task/job scheduling. The underlying AgentPool
813878
// service is exercised in core/services/agentpool/agent_jobs_test.go;
814879
// these specs cover the /api/agent/* HTTP plumbing on top.

core/http/middleware/request.go

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -318,6 +318,13 @@ func mergeOpenAIRequestAndModelConfig(config *config.ModelConfig, input *schema.
318318
// (an operator's explicit disable wins over a request asking to think).
319319
config.ApplyReasoningEffort(input.ReasoningEffort)
320320

321+
// Forward the client's request metadata so chat-template kwargs set per-request
322+
// (enable_thinking, reasoning_effort, preserve_thinking, ...) reach the backend
323+
// and override the model's reasoning-config defaults. See gRPCPredictOpts.
324+
if len(input.Metadata) > 0 {
325+
config.RequestMetadata = input.Metadata
326+
}
327+
321328
// Collapse the modern max_completion_tokens alias into the
322329
// legacy Maxtokens field so downstream code reads exactly one.
323330
// MaxCompletionTokens wins on conflict — it's the canonical

0 commit comments

Comments
 (0)