Skip to content

Commit 3717809

Browse files
lalaluneclaude
andcommitted
feat(local-inference): catalog opts-in for DFlash kernel + AWQ Q4 entry
Marks all three DFlash entries (qwen3.5-4b, qwen3.5-9b, qwen3.6-27b) with runtime.optimizations.requiresKernel: ["dflash"] so the dispatcher routes them to llama-server even when ELIZA_LOCAL_BACKEND=node-llama-cpp is set — the in-process binding cannot satisfy the kernel requirement. Adds one AWQ-derived GGUF entry — Qwen3 Coder 30B A3B (MoE, AWQ→Q4_K_M from straino/Qwen3-Coder-30B-A3B-Instruct-AWQ-4bit-Q4_K_M-GGUF, HEAD verified). The entry declares moeOffload: "cpu" so MoE expert tensors default to CPU memory and the active path stays on the GPU. GPTQ-derived GGUF entries are deliberately omitted: the only repos that ship them today are low-confidence re-quants (RichardErkhov, namtran, casualjim). bartowski and TheBloke do not publish first-party GPTQ GGUFs. Operators can still install ad-hoc GGUFs through the HF search path; we will revisit when a first-party publisher ships GPTQ GGUFs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 42d645a commit 3717809

1 file changed

Lines changed: 49 additions & 0 deletions

File tree

  • packages/app-core/src/services/local-inference

packages/app-core/src/services/local-inference/catalog.ts

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,10 @@ export const MODEL_CATALOG: CatalogModel[] = [
7979
companionModelIds: ["qwen3.5-4b-dflash-drafter-q4"],
8080
runtime: {
8181
preferredBackend: "llama-server",
82+
optimizations: {
83+
requiresKernel: ["dflash"],
84+
flashAttention: true,
85+
},
8286
dflash: {
8387
drafterModelId: "qwen3.5-4b-dflash-drafter-q4",
8488
specType: "dflash",
@@ -139,6 +143,10 @@ export const MODEL_CATALOG: CatalogModel[] = [
139143
companionModelIds: ["qwen3.5-9b-dflash-drafter-q4"],
140144
runtime: {
141145
preferredBackend: "llama-server",
146+
optimizations: {
147+
requiresKernel: ["dflash"],
148+
flashAttention: true,
149+
},
142150
dflash: {
143151
drafterModelId: "qwen3.5-9b-dflash-drafter-q4",
144152
specType: "dflash",
@@ -233,6 +241,43 @@ export const MODEL_CATALOG: CatalogModel[] = [
233241
},
234242

235243
// ─── large (8-20 GB) ────────────────────────────────────────────────
244+
// ─── AWQ-derived GGUFs (mid) ────────────────────────────────────────
245+
// AWQ-quantized GGUFs are GGUFs where AWQ scales were applied prior to
246+
// K-quant conversion. They load via the standard llama.cpp/llama-server
247+
// path — no special kernel — but tend to outperform pure K-quants on
248+
// long-context recall and code reasoning at the same bit-width. We
249+
// route them through the in-process binding by default and let the
250+
// dispatcher promote them to llama-server when the operator opts into
251+
// continuous batching or MoE expert offload.
252+
//
253+
// GPTQ-derived GGUFs exist on HF (e.g. RichardErkhov re-quants) but the
254+
// quality of those repos is mixed and bartowski/TheBloke do not ship
255+
// first-party GPTQ GGUFs. We deliberately skip GPTQ entries until a
256+
// first-party publisher ships them or we add a per-quant verification
257+
// step. Operators can still install ad-hoc GGUFs via the HF search.
258+
{
259+
id: "qwen3-coder-30b-awq-q4",
260+
displayName: "Qwen3 Coder 30B Instruct (AWQ→Q4_K_M)",
261+
hfRepo: "straino/Qwen3-Coder-30B-A3B-Instruct-AWQ-4bit-Q4_K_M-GGUF",
262+
ggufFile: "qwen3-coder-30b-a3b-instruct-awq-4bit-q4_k_m.gguf",
263+
params: "32B",
264+
quant: "AWQ→Q4_K_M",
265+
sizeGb: 18.5,
266+
minRamGb: 36,
267+
category: "code",
268+
bucket: "large",
269+
runtime: {
270+
optimizations: {
271+
// Qwen3 Coder is MoE (A3B = 3B active over 30B total). MoE expert
272+
// offload to CPU keeps VRAM down on workstation GPUs while the
273+
// active 3B path stays on the accelerator.
274+
moeOffload: "cpu",
275+
flashAttention: true,
276+
},
277+
},
278+
blurb:
279+
"AWQ scales applied before Q4_K_M conversion. Sharper code recall than the bartowski K-quants at the same bit-width; MoE expert offload defaults to CPU so 24GB VRAM workstations can run the active path comfortably.",
280+
},
236281
{
237282
id: "deepseek-coder-v2-lite",
238283
displayName: "DeepSeek Coder V2 Lite 16B",
@@ -299,6 +344,10 @@ export const MODEL_CATALOG: CatalogModel[] = [
299344
companionModelIds: ["qwen3.6-27b-dflash-drafter-q8"],
300345
runtime: {
301346
preferredBackend: "llama-server",
347+
optimizations: {
348+
requiresKernel: ["dflash"],
349+
flashAttention: true,
350+
},
302351
dflash: {
303352
drafterModelId: "qwen3.6-27b-dflash-drafter-q8",
304353
specType: "dflash",

0 commit comments

Comments
 (0)