Skip to content

Commit c11577d

Browse files
QVAC-17939: release(llamacpp-llm): v0.17.1 — back-port per-request grammar / json_schema (#1788)
* feat[api]: per-request GBNF grammar in llm-llamacpp generationParams (QVAC-17939) Adds a new optional `grammar` field on the addon's per-request `generationParams`. When set, the sampler is re-initialized with the provided GBNF for the duration of that single `runJob` call and then restored to the prior (load-time) state, matching the existing save/restore pattern used for `temp` / `top_p` / `seed` / etc. This is the per-request equivalent of the load-time `--grammar` config key. It is the addon-level building block needed to support per-request structured output in the SDK without requiring callers to reload the model just to switch (or disable) a grammar. Changes: - `GenerationParams` (C++): new `std::optional<std::string> grammar`, included in `hasOverrides()`. - `AddonJs::runJob` `parseText`: read the optional `grammar` string from per-request `generationParams`. - `TextLlmContext::applyGenerationParams`: copy the override into `params_.sampling.grammar` and re-init the sampler. The existing `common_params_sampling savedSampling` snapshot already covers the grammar field, so the restore lambda automatically reverts. - `index.d.ts`: expose `grammar?: string` on `GenerationParams`. - `test/integration/grammar.test.js`: covers (1) a request grammar constraining output, (2) a follow-up request without `grammar` reverting to unconstrained generation, and (3) a per-request grammar overriding a load-time `grammar`. Wired into both ios `lightB` and android `groupA` mobile groups. `MtmdLlmContext` keeps the no-op default `applyGenerationParams` — multimodal + grammar is out of scope for this change. Tested locally: `bare-make build && bare-make install` succeeds, `tsc -p tsconfig.dts.json` and `standard` lint are clean. * feat[api]: per-request json_schema in llm-llamacpp generationParams (QVAC-17939) Adds an ergonomic `json_schema` field on per-request `generationParams` alongside the GBNF `grammar` field introduced in the previous commit. Mirrors what `--json-schema` does at load time: the addon parses the JSON Schema and converts it to GBNF natively via llama.cpp's `json_schema_to_grammar()` (already linked through `llama::common`), so callers never have to ship a JSON-Schema-to-GBNF converter on the JS side. ```ts await model.run(prompt, { generationParams: { json_schema: { type: 'object', properties: { name: { type: 'string' }, age: { type: 'integer' } }, required: ['name', 'age'] } } }) ``` Changes: - `vcpkg.json` / `vcpkg-configuration.json`: pull `nlohmann-json` from the upstream microsoft/vcpkg registry (header-only). Required to call `json_schema_to_grammar(nlohmann::ordered_json, ...)`, whose signature lives in `llama/common/json-schema-to-grammar.h` but only ships a forward decl by default. - `CMakeLists.txt`: `find_package(nlohmann_json CONFIG REQUIRED)` and link `nlohmann_json::nlohmann_json` into the addon. - `GenerationParams` (C++): new `std::optional<std::string> json_schema` alongside `grammar`, included in `hasOverrides()`. - `AddonJs::runJob` `parseText`: read the optional `json_schema` string and reject requests that set both `grammar` and `json_schema`. - `TextLlmContext::applyGenerationParams`: when `json_schema` is set, parse with `nlohmann::ordered_json::parse` and convert via `json_schema_to_grammar()`, then assign to `params_.sampling.grammar` and re-init the sampler. Parse / conversion errors surface as `InvalidArgument` StatusError. The existing save/restore lambda already covers `params_.sampling.grammar`, so the prior grammar is reverted automatically after the request. - `index.d.ts`: expose `json_schema?: string | Record<string, unknown>` with mutual-exclusion docs against `grammar`. - `index.js`: `normalizeGenerationParams()` accepts a plain object (the common ergonomic shape) and JSON-stringifies it before handing to the binding; also enforces the grammar/json_schema mutual exclusion at the JS boundary so callers get a clearer TypeError. - `test/integration/grammar.test.js`: three new tests covering (1) `json_schema` (object form) constraining output to schema-valid JSON, (2) `json_schema` (string form) accepted equivalently, and (3) passing both `grammar` and `json_schema` in one request throws. Tested locally: `bare-make generate && bare-make build && bare-make install` succeeds (vcpkg fetched `nlohmann-json` 3.12.0), `tsc -p tsconfig.dts.json` and `standard` lint clean, mobile integration test config still validates. * fix[api]: address review feedback on per-request grammar/json_schema Addresses review comments on PR #1787: - **Gianni**: wire grammar/json_schema through to `MtmdLlmContext` so multimodal models get the same per-request hook as text-only ones. `MtmdLlmContext::applyGenerationParams` had near-identical body to `TextLlmContext::applyGenerationParams`, so factor the grammar/ json_schema/sampling-overrides logic into a small free function in the new `GenerationParamsApply.{hpp,cpp}` and call it from both. - **Jesús (doc)**: fix the misleading "Empty string disables" comment on `GenerationParams.grammar` in `index.d.ts`. Empty string and `undefined` both fall through to the load-time grammar — clarify that explicitly. - **Jesús (safety)**: handle `common_sampler_init` returning nullptr. This happens on invalid GBNF (and therefore also on a `json_schema` whose conversion produces a grammar the sampler rejects). Both contexts now check the result, restore the saved sampling block, re-init with the known-good params, and throw `InvalidArgument`. Without this guard the addon would carry a null `smpl_` into the next sample call and crash. The `cli_tool` target picks up the new `GenerationParamsApply.cpp` source and `nlohmann_json::nlohmann_json` link dependency so it stays buildable. Tested locally: `bare-make build && bare-make install` clean, `tsc -p tsconfig.dts.json` and `standard` lint clean, mobile integration test config still valid. * release(llamacpp-llm): v0.17.1 — back-port per-request grammar / json_schema (QVAC-17939) Cherry-picks the per-request `grammar` / `json_schema` change from PR #1787 onto the 0.17.0 release line so SDK consumers still pinned to `@qvac/llm-llamacpp@0.17.x` can pick up structured-output support without having to migrate to 0.18.x first. Bumps `package.json` to `0.17.1`, adds the matching changelog block, and refreshes the mobile integration auto-runner (the new `grammar.test.js` is now wired in for both ios `lightB` and android `groupA` mobile groups). Same source changes as #1787 plus its review fixes, no new code. Targets the `release-llamacpp-llm-0.17.1` branch on tetherto/qvac (branched from the `llamacpp-llm-v0.17.0` tag); merging triggers the GPR publish for `@tetherto/llm-llamacpp@0.17.1`. * fix[api]: apply generationParams overrides atomically (QVAC-17939) Build the new sampling block + sampler against local copies and only commit them onto the live `params_` / `smpl_` once the json_schema parse/convert and `common_sampler_init()` have both succeeded. Without this, an invalid `json_schema` paired with another override (`temp`, `seed`, …) would write the numeric overrides into `params_` and then throw before `applyGenerationParams()` could return its restore lambda, leaving those mutations to leak into subsequent requests. The helper now takes the two fields it actually mutates by reference (`common_params_sampling&`, `int& nPredict`) so callers can pass copies trivially. Behaviour for happy-path requests is unchanged. Per gianni-cor review on #1787. * fix[notask]: address CI failures on per-request grammar PR (QVAC-17939) - test/unit/CMakeLists.txt: include `GenerationParamsApply.cpp` in the `addon-test` source list and link `nlohmann_json::nlohmann_json` so the new helper resolves at link time. Without these the cpp-tests target failed with `undefined symbol: applyGenerationOverridesToSampling(...)` after the helper was extracted from the inline `Text/Mtmd LlmContext` paths. - LlmContext.hpp: collapse the trailing `grammar || json_schema;` onto one line in `GenerationParams::hasOverrides()` to satisfy clang-format-19's wrap rules (cpp-lint). - test/integration/grammar.test.js: pass the `model.run()` promise directly to `t.exception(...)` instead of wrapping it in an inner `async () => { await ... }`. Bare's runtime aborts on unhandled rejections; the IIFE form created a small window where `model.run()`'s rejection landed before brittle's catch handler attached, producing `Uncaught (in promise)` and exit code 134. Direct-promise form matches the existing pattern in `finetuning.test.js`. * fix[notask]: use t.exception.all for native-error rejection (QVAC-17939) `normalizeGenerationParams` throws a `TypeError` when both `grammar` and `json_schema` are set, and brittle's plain `t.exception` deliberately re-raises native error subclasses (TypeError, ReferenceError, RangeError, etc.) on the basis that those "tend to be unintentional". The result is the rejection escapes brittle's catch, trips Bare's unhandled-rejection guard, and the test runner aborts with exit 134 — across every integration platform plus the on-device mobile e2e (where the WDIO crash monitor reports it as a background crash). The earlier IIFE-→-direct-promise change didn't help because this isn't a microtask-timing race; it's intentional brittle policy. `t.exception.all` is the documented escape hatch (per brittle's README) for asserting on a native-error rejection. * fix[notask]: tolerate Vulkan teardown SIGSEGV on ai-run-linux-gpu Mirrors commit dbad904 in integration-test-qvac-lib-infer-vla.yml. The linux-x64 integration matrix runs on the self-hosted ai-run-linux-gpu (Tesla T4 + Vulkan) runner. After every test in the suite passes, the bare process crashes with SIGSEGV (exit 139) ~1s into static-destructor teardown — inside ggml-vulkan's destructor chain interacting with the NVIDIA Vulkan ICD. Same upstream issue already worked around for the VLA addon. Wrap the integration test invocation so exit 139 is tolerated IFF the captured TAP output shows the run completed cleanly (the '# ok' end marker AND a '# tests = N/N pass' summary). Any other non-zero exit, or a missing TAP pass marker, still fails the job. This is purely a CI workaround; no addon code changes. * fix[notask]: extract per-request override helper + warn on grammar/json_schema clash (QVAC-17939) Per jesusmb1995 review on #1788: 1. The full applyGenerationParams body in TextLlmContext.cpp was a verbatim copy of the body in MtmdLlmContext.cpp (~50 lines each: local-copy of sampling/n_predict, helper call, sampler init + null-check, snapshot, commit, restore lambda). Hoist into a free function `applyGenerationParamsToContext(common_params&, CommonSamplerPtr&, llama_model*, const GenerationParams&)` in GenerationParamsApply.cpp that returns the restore lambda. Both contexts collapse to a single forwarding line. 2. The clang/JSON helper already had a "schema wins" precedence branch for the (theoretically unreachable) case where both `grammar` and `json_schema` are set, but no log. The JS and AddonJs paths both reject that combination, so reaching it means a direct C++ caller (unit tests or `cli_tool`) bypassed the boundary. Add a `LOG_WRN` stating which field is being applied so the issue is visible when debugging from C++. No behaviour change for normal callers; the lambda's capture mode switches from `[this, ...]` (method context) to `[&params, &smpl, model, ...]` (free function), with identical lifetime guarantees — the owning context outlives any single request. Locally clang-format-19 clean, tsc + standard clean, all addon TUs compile. * Revert "fix[notask]: tolerate Vulkan teardown SIGSEGV on ai-run-linux-gpu" This reverts commit eeff742.
1 parent 60f70ef commit c11577d

17 files changed

Lines changed: 542 additions & 67 deletions

packages/qvac-lib-infer-llamacpp-llm/CHANGELOG.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,46 @@
11
# Changelog
22

3+
## [0.17.1] - 2026-04-28
4+
5+
This patch release adds per-request structured-output support to the LLM addon: callers can now constrain a single completion to either a JSON Schema or a raw GBNF grammar without reloading the model. Back-port of the same change being prepared for `main` in [#1787](https://github.com/tetherto/qvac/pull/1787); shipped here on top of `0.17.0` so it can be consumed by SDK lines that have not yet migrated to `0.18.x`.
6+
7+
### Added
8+
9+
#### Per-request `json_schema` and `grammar` in `generationParams`
10+
11+
`RunOptions.generationParams` accepts two new optional fields:
12+
13+
- **`json_schema`** — JSON Schema applied to a single `run()` call. Accepts either a JSON Schema object literal or a pre-stringified JSON Schema. Internally converted to GBNF via llama.cpp's `json_schema_to_grammar()`, the same converter used by the load-time `--json-schema` config key.
14+
- **`grammar`** — raw GBNF string applied to a single `run()` call. Useful for non-JSON outputs (regex-like DSLs, CSV, custom syntaxes). Mirrors the load-time `--grammar` config key.
15+
16+
The two are mutually exclusive — passing both throws a `TypeError` at the JS boundary. When either is set, the sampler is re-initialized for that request and the prior (typically load-time) grammar is restored automatically afterwards. Both `TextLlmContext` and `MtmdLlmContext` are wired through, so multimodal models get the same per-request hook as text-only ones.
17+
18+
```js
19+
// JSON Schema (recommended for structured output)
20+
await model.run(prompt, {
21+
generationParams: {
22+
json_schema: {
23+
type: 'object',
24+
properties: { name: { type: 'string' }, age: { type: 'integer' } },
25+
required: ['name', 'age']
26+
}
27+
}
28+
})
29+
30+
// GBNF (non-JSON outputs)
31+
await model.run(prompt, {
32+
generationParams: {
33+
grammar: 'root ::= ("yes" | "no")'
34+
}
35+
})
36+
```
37+
38+
A new `nlohmann-json` vcpkg dependency is pulled in (header-only) so the addon can call `json_schema_to_grammar()` directly without shipping a JSON-Schema-to-GBNF converter on the JS side. Bad GBNF / unparseable JSON Schema surfaces as `InvalidArgument` with the saved sampler restored, so a malformed per-request schema cannot leave the model in a broken state.
39+
40+
## Pull Requests
41+
42+
- [#1787](https://github.com/tetherto/qvac/pull/1787) - feat[api]: per-request grammar / json_schema in llm-llamacpp generationParams (forward-port to `main`)
43+
344
## [0.17.0] - 2026-04-21
445

546
### Changed

packages/qvac-lib-infer-llamacpp-llm/CMakeLists.txt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,12 @@ configure_file(${VCPKG_INSTALLED_PATH}/share/qvac-lint-cpp/.clang-tidy
3333
find_path(PICOJSON_INCLUDE_DIRS "picojson/picojson.h")
3434
find_path(QVAC_LIB_INFERENCE_ADDON_CPP_INCLUDE_DIRS "qvac-lib-inference-addon-cpp/JsInterface.hpp")
3535
find_package(llama CONFIG REQUIRED)
36+
# Required to call llama.cpp's `json_schema_to_grammar()` for per-request
37+
# JSON-Schema → GBNF conversion. The function signature lives in libcommon
38+
# (linked via `llama::common`) but takes a `nlohmann::ordered_json`, so we
39+
# need the full nlohmann headers, not just the forward decl shipped with
40+
# `llama/common/json-schema-to-grammar.h`.
41+
find_package(nlohmann_json CONFIG REQUIRED)
3642

3743
if(WIN32)
3844
add_definitions( -DNOMINMAX -DWIN32_MEAN_AND_LEAN -DNOGDI )
@@ -65,6 +71,7 @@ endif()
6571
${PROJECT_SOURCE_DIR}/addon/src/model-interface/AsyncWeightsLoader.cpp
6672
${PROJECT_SOURCE_DIR}/addon/src/model-interface/CacheManager.cpp
6773
${PROJECT_SOURCE_DIR}/addon/src/model-interface/ContextSlider.cpp
74+
${PROJECT_SOURCE_DIR}/addon/src/model-interface/GenerationParamsApply.cpp
6875
${PROJECT_SOURCE_DIR}/addon/src/model-interface/LlamaLazyInitializeBackend.cpp
6976
${PROJECT_SOURCE_DIR}/addon/src/model-interface/LlamaModel.cpp
7077
${PROJECT_SOURCE_DIR}/addon/src/model-interface/LlamaFinetuningHelpers.cpp
@@ -98,6 +105,7 @@ endif()
98105
llama::llama
99106
llama::common
100107
llama::mtmd
108+
nlohmann_json::nlohmann_json
101109
)
102110

103111

@@ -113,6 +121,7 @@ if(BUILD_CLI)
113121
${PROJECT_SOURCE_DIR}/addon/src/model-interface/AsyncWeightsLoader.cpp
114122
${PROJECT_SOURCE_DIR}/addon/src/model-interface/CacheManager.cpp
115123
${PROJECT_SOURCE_DIR}/addon/src/model-interface/ContextSlider.cpp
124+
${PROJECT_SOURCE_DIR}/addon/src/model-interface/GenerationParamsApply.cpp
116125
${PROJECT_SOURCE_DIR}/addon/src/model-interface/LlamaLazyInitializeBackend.cpp
117126
${PROJECT_SOURCE_DIR}/addon/src/model-interface/LlamaModel.cpp
118127
${PROJECT_SOURCE_DIR}/addon/src/model-interface/MtmdLlmContext.cpp
@@ -142,6 +151,7 @@ if(BUILD_CLI)
142151
llama::llama
143152
llama::common
144153
llama::mtmd
154+
nlohmann_json::nlohmann_json
145155
)
146156
find_path(QVAC_LIB_INFERENCE_ADDON_CPP_INCLUDE_DIRS "qvac-lib-inference-addon-cpp/JsInterface.hpp")
147157
target_include_directories(cli_tool PRIVATE ${QVAC_LIB_INFERENCE_ADDON_CPP_INCLUDE_DIRS})

packages/qvac-lib-infer-llamacpp-llm/addon/src/addon/AddonJs.hpp

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -339,6 +339,27 @@ inline js_value_t* runJob(js_env_t* env, js_callback_info_t* info) try {
339339
readNum("frequency_penalty", ov.frequency_penalty);
340340
readNum("presence_penalty", ov.presence_penalty);
341341
readNum("repeat_penalty", ov.repeat_penalty);
342+
343+
auto grammarStr =
344+
configObj->getOptionalPropertyAs<js::String, std::string>(
345+
env, "grammar");
346+
if (grammarStr.has_value() && !grammarStr->empty()) {
347+
ov.grammar = std::move(*grammarStr);
348+
}
349+
350+
auto jsonSchemaStr =
351+
configObj->getOptionalPropertyAs<js::String, std::string>(
352+
env, "json_schema");
353+
if (jsonSchemaStr.has_value() && !jsonSchemaStr->empty()) {
354+
ov.json_schema = std::move(*jsonSchemaStr);
355+
}
356+
357+
if (ov.grammar && ov.json_schema) {
358+
throw StatusError(
359+
general_error::InvalidArgument,
360+
"generationParams.grammar and generationParams.json_schema are "
361+
"mutually exclusive");
362+
}
342363
}
343364

344365
prompt.cacheKey =
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
#include "GenerationParamsApply.hpp"
2+
3+
#include <exception>
4+
#include <string>
5+
#include <utility>
6+
7+
#include <nlohmann/json.hpp>
8+
#include <qvac-lib-inference-addon-cpp/Errors.hpp>
9+
10+
#include "addon/LlmErrors.hpp"
11+
#include "common/json-schema-to-grammar.h"
12+
#include "common/log.h"
13+
14+
void applyGenerationOverridesToSampling(
15+
common_params_sampling& sampling, int& nPredict,
16+
const GenerationParams& overrides) {
17+
auto setIf = [](const auto& src, auto& dst) {
18+
if (src) {
19+
dst = *src;
20+
}
21+
};
22+
23+
setIf(overrides.temp, sampling.temp);
24+
setIf(overrides.top_p, sampling.top_p);
25+
setIf(overrides.top_k, sampling.top_k);
26+
setIf(overrides.n_predict, nPredict);
27+
setIf(overrides.seed, sampling.seed);
28+
setIf(overrides.frequency_penalty, sampling.penalty_freq);
29+
setIf(overrides.presence_penalty, sampling.penalty_present);
30+
setIf(overrides.repeat_penalty, sampling.penalty_repeat);
31+
32+
// `json_schema` and `grammar` are mutually exclusive at the JS boundary
33+
// and in `AddonJs::runJob::parseText`, so reaching this branch with both
34+
// set means a caller bypassed those checks (most likely the C++ unit
35+
// tests or `cli_tool` driving the helper directly). Log a warning so
36+
// the issue surfaces in stderr/log output and pick `json_schema`, which
37+
// is the higher-level surface.
38+
if (overrides.json_schema && overrides.grammar) {
39+
LOG_WRN(
40+
"%s: both generationParams.grammar and generationParams.json_schema "
41+
"were provided; ignoring `grammar` and applying `json_schema` "
42+
"(the JS and AddonJs paths reject this combination — this branch "
43+
"exists only for direct C++ callers).\n",
44+
__func__);
45+
}
46+
47+
if (overrides.json_schema) {
48+
try {
49+
auto parsed = nlohmann::ordered_json::parse(*overrides.json_schema);
50+
sampling.grammar = json_schema_to_grammar(parsed);
51+
} catch (const std::exception& ex) {
52+
throw qvac_errors::StatusError(
53+
ADDON_ID,
54+
qvac_errors::general_error::toString(
55+
qvac_errors::general_error::InvalidArgument),
56+
std::string("invalid generationParams.json_schema: ") + ex.what());
57+
}
58+
} else if (overrides.grammar) {
59+
sampling.grammar = *overrides.grammar;
60+
}
61+
}
62+
63+
std::function<void()> applyGenerationParamsToContext(
64+
common_params& params, CommonSamplerPtr& smpl, llama_model* model,
65+
const GenerationParams& overrides) {
66+
if (!overrides.hasOverrides()) {
67+
return []() {};
68+
}
69+
70+
// Apply overrides to *local copies* first. Only commit them onto the
71+
// live `params` and `smpl` after both the json_schema parse/convert and
72+
// `common_sampler_init` have succeeded — otherwise a partially applied
73+
// override (e.g. temp/seed already written, then json_schema throws)
74+
// would leak into subsequent requests because no restore lambda gets
75+
// returned to the caller's `ScopeGuard`.
76+
common_params_sampling nextSampling = params.sampling;
77+
int nextPredict = params.n_predict;
78+
79+
// May throw `InvalidArgument` for malformed `json_schema`. `params`
80+
// and `smpl` remain untouched in that case.
81+
applyGenerationOverridesToSampling(nextSampling, nextPredict, overrides);
82+
83+
// `common_sampler_init` returns nullptr on bad inputs (most commonly an
84+
// invalid GBNF grammar — `json_schema` is converted to GBNF above and
85+
// can in principle produce a grammar that the sampler rejects). Build
86+
// the new sampler before touching live state so a failure here also
87+
// leaves `params` / `smpl` intact.
88+
CommonSamplerPtr nextSmpl(common_sampler_init(model, nextSampling));
89+
if (!nextSmpl) {
90+
throw qvac_errors::StatusError(
91+
ADDON_ID,
92+
qvac_errors::general_error::toString(
93+
qvac_errors::general_error::InvalidArgument),
94+
"failed to initialise sampler with per-request generationParams "
95+
"(invalid grammar or json_schema?)");
96+
}
97+
98+
// Snapshot the live values before committing so the restore lambda can
99+
// roll the request's mutations back at the end of the call.
100+
common_params_sampling savedSampling = params.sampling;
101+
int savedPredict = params.n_predict;
102+
103+
params.sampling = std::move(nextSampling);
104+
params.n_predict = nextPredict;
105+
smpl = std::move(nextSmpl);
106+
107+
bool restored = false;
108+
return [&params,
109+
&smpl,
110+
model,
111+
savedSampling = std::move(savedSampling),
112+
savedPredict,
113+
restored]() mutable {
114+
if (restored)
115+
return;
116+
restored = true;
117+
params.sampling = savedSampling;
118+
params.n_predict = savedPredict;
119+
smpl.reset(common_sampler_init(model, params.sampling));
120+
};
121+
}
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
#pragma once
2+
3+
#include <functional>
4+
5+
#include "LlmContext.hpp"
6+
#include "common/common.h"
7+
8+
// Apply per-request `generationParams` overrides onto a sampling block
9+
// + `n_predict` value in place. Operates on the two mutable fields the
10+
// helper actually needs so callers can pass *copies* and only commit
11+
// them to live state once the whole call (including json_schema parse
12+
// and `common_sampler_init`) has succeeded — avoiding partial mutation
13+
// of the live `common_params` if this throws.
14+
//
15+
// If `overrides.json_schema` is set, parses the JSON Schema and converts
16+
// it to GBNF via llama.cpp's `json_schema_to_grammar()`, mirroring what
17+
// the `--json-schema` load-time flag does. If `overrides.grammar` is set,
18+
// the GBNF is used verbatim. The two are mutually exclusive (validated at
19+
// the JS boundary and again in `AddonJs::runJob::parseText`); if both are
20+
// present here a `LOG_WRN` is emitted and `json_schema` wins — the JS and
21+
// AddonJs paths reject this combination, so reaching it means a direct
22+
// C++ caller (unit tests / `cli_tool`) bypassed those checks.
23+
//
24+
// Throws `qvac_errors::StatusError(InvalidArgument)` when `json_schema`
25+
// fails to parse or convert. Caller is responsible for re-initialising
26+
// the sampler after this call so the new sampling block takes effect.
27+
void applyGenerationOverridesToSampling(
28+
common_params_sampling& sampling, int& nPredict,
29+
const GenerationParams& overrides);
30+
31+
// Apply per-request `generationParams` overrides onto a context's live
32+
// `params` + `smpl` and return a restore lambda the caller can install
33+
// into a `ScopeGuard` to roll the mutation back at end-of-request.
34+
//
35+
// Implements the atomic-commit pattern: overrides are applied to local
36+
// copies of `params.sampling` and `params.n_predict`, the new sampler is
37+
// built against those copies, and only after both the json_schema parse
38+
// and `common_sampler_init` have succeeded are the live `params` / `smpl`
39+
// updated. Any throw or null-sampler failure leaves live state untouched.
40+
//
41+
// Returns a no-op lambda when `overrides.hasOverrides()` is false. The
42+
// returned lambda re-initialises `smpl` from the saved sampling block, so
43+
// it MUST be invoked before the owning context is destroyed (i.e. via a
44+
// guard scoped to the request).
45+
//
46+
// Throws `qvac_errors::StatusError(InvalidArgument)` for malformed
47+
// `json_schema` or when the resulting GBNF is rejected by the sampler.
48+
//
49+
// `params`, `smpl`, and `model` are captured by reference inside the
50+
// returned lambda; callers must guarantee they outlive the lambda. Both
51+
// `TextLlmContext::applyGenerationParams` and
52+
// `MtmdLlmContext::applyGenerationParams` satisfy this — the context
53+
// owns the fields and outlives any single request.
54+
std::function<void()> applyGenerationParamsToContext(
55+
common_params& params, CommonSamplerPtr& smpl, llama_model* model,
56+
const GenerationParams& overrides);

packages/qvac-lib-infer-llamacpp-llm/addon/src/model-interface/LlmContext.hpp

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
#include <algorithm>
44
#include <functional>
55
#include <optional>
6+
#include <string>
67

78
#include "addon/LlmErrors.hpp"
89
#include "common/chat.h"
@@ -20,10 +21,20 @@ struct GenerationParams {
2021
std::optional<float> presence_penalty;
2122
std::optional<float> repeat_penalty;
2223
std::optional<uint32_t> seed;
24+
// GBNF grammar applied per request to constrain sampling. When set, the
25+
// sampler is re-initialized with this grammar for the duration of the
26+
// request and the prior grammar is restored afterwards. Mirrors the
27+
// load-time `--grammar` flag but scoped to a single completion call.
28+
std::optional<std::string> grammar;
29+
// JSON-Schema applied per request. Converted to GBNF via llama.cpp's
30+
// `json_schema_to_grammar()` and applied identically to `grammar`.
31+
// Mutually exclusive with `grammar` — the JS wrapper rejects requests
32+
// that set both. Mirrors the load-time `--json-schema` flag.
33+
std::optional<std::string> json_schema;
2334

2435
[[nodiscard]] bool hasOverrides() const {
2536
return n_predict || temp || top_p || top_k || frequency_penalty ||
26-
presence_penalty || repeat_penalty || seed;
37+
presence_penalty || repeat_penalty || seed || grammar || json_schema;
2738
}
2839
};
2940

packages/qvac-lib-infer-llamacpp-llm/addon/src/model-interface/MtmdLlmContext.cpp

Lines changed: 2 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
#include <qvac-lib-inference-addon-cpp/Errors.hpp>
1010

1111
#include "ContextSlider.hpp"
12+
#include "GenerationParamsApply.hpp"
1213
#include "addon/LlmErrors.hpp"
1314
#include "qvac-lib-inference-addon-cpp/Logger.hpp"
1415
#include "utils/ChatTemplateUtils.hpp"
@@ -480,38 +481,7 @@ bool MtmdLlmContext::generateResponse(
480481

481482
std::function<void()>
482483
MtmdLlmContext::applyGenerationParams(const GenerationParams& overrides) {
483-
if (!overrides.hasOverrides()) {
484-
return []() {};
485-
}
486-
487-
common_params_sampling savedSampling = params_.sampling;
488-
int savedPredict = params_.n_predict;
489-
490-
auto setIf = [](const auto& src, auto& dst) {
491-
if (src) {
492-
dst = *src;
493-
}
494-
};
495-
setIf(overrides.temp, params_.sampling.temp);
496-
setIf(overrides.top_p, params_.sampling.top_p);
497-
setIf(overrides.top_k, params_.sampling.top_k);
498-
setIf(overrides.n_predict, params_.n_predict);
499-
setIf(overrides.seed, params_.sampling.seed);
500-
setIf(overrides.frequency_penalty, params_.sampling.penalty_freq);
501-
setIf(overrides.presence_penalty, params_.sampling.penalty_present);
502-
setIf(overrides.repeat_penalty, params_.sampling.penalty_repeat);
503-
504-
smpl_.reset(common_sampler_init(model_, params_.sampling));
505-
506-
bool restored = false;
507-
return [this, savedSampling, savedPredict, restored]() mutable {
508-
if (restored)
509-
return;
510-
restored = true;
511-
params_.sampling = savedSampling;
512-
params_.n_predict = savedPredict;
513-
smpl_.reset(common_sampler_init(model_, params_.sampling));
514-
};
484+
return applyGenerationParamsToContext(params_, smpl_, model_, overrides);
515485
}
516486

517487
void MtmdLlmContext::stop() { stopGeneration_.store(true); }

0 commit comments

Comments
 (0)