This benchmark validates whether the app should move from regex-led filtering to a local-LLM-led filtering workflow.
Important boundary:
- The benchmark exists to choose a default model recommendation and a default deep-filter strategy.
- The product architecture should still expose provider and model selection.
- The benchmark runner currently uses
ollamaas its first implemented provider, but that does not mean the app is intended to stay single-provider forever. - To avoid mixed-localhost ambiguity, canonical comparison runs should be generated from one environment where the benchmark runner, Node runtime, and Ollama service share the same network context.
Compare these variants under the same runtime and scoring rules:
regexregex + qwen3.5:2bregex + qwen3.5:4bregex + gemma4:e4bpure + qwen3.5:4bpure + qwen3.5:4b + priorspure + qwen3.5:4b + priors-v2pure + qwen3.5:4b + priors-v3pure + qwen3.5:4b + priors-v4
The benchmark is designed to answer four questions:
- Does any LLM-backed workflow materially outperform the current regex baseline on contextual privacy?
- Does
qwen3.5:4bjustify its extra cost compared withqwen3.5:2b? - At the same class, should the default model be
qwen3.5:4borgemma4:e4b? - Should deep filtering ship as a broad
regex + LLMmerge, or as a pure-LLM extraction path with stronger prompt guardrails?
dataset/Benchmark corpus definition and labeled samples.schemas/JSON schema contracts for benchmark samples, model findings, and scored results.prompts/The fixed extraction prompts used by the LLM-backed variants.variants.jsonThe compared benchmark variants and fixed runtime settings.results/Generated benchmark reports and run artifacts.
- All LLM variants use Ollama.
- All LLM variants must use the same chunking rules and JSON output schema.
- Thinking-capable models must be benchmarked with
think: falseunless a variant explicitly overrides it, because the product workflow needs direct structured extraction instead of long reasoning traces. - The model must never rewrite the full input text.
- The benchmark scores structured findings plus locally executed replacement behavior.
- Final default-model and default-strategy choices follow the gates in PLAN-20260418-local-llm-filter-benchmark-spec.md.
Regex baseline only:
npm run bench:local-llm-filter -- --variant regexAll variants:
npm run bench:local-llm-filterSelected variants:
npm run bench:local-llm-filter -- --variant "pure+qwen3.5:4b" --variant "pure+qwen3.5:4b+priors-v2"V1 uses a provider abstraction with ollama as the first implementation.
The benchmark runner talks to the provider over HTTP, so the provider can run on Windows or WSL as long as the configured base_url is reachable from the benchmark process.
This is a benchmark-runner choice, not a product lock-in. The app-level local-LLM configuration is still designed around provider selection.
Expected default endpoint:
http://127.0.0.1:11434
Before running any LLM-backed variant, make sure:
- Ollama is installed.
- The Ollama service is running.
- The compared models are pulled.
Typical commands:
ollama pull qwen3.5:2b
ollama pull qwen3.5:4b
ollama pull gemma4:e4b
ollama serveIf the provider is unreachable, the runner fails fast with a provider healthcheck error instead of silently producing invalid benchmark data.
This directory contains the benchmark contract, labeled starter corpus, prompt variants, and generated reports for Sprint 1.
The harness reads dataset/samples.jsonl, executes each variant through a provider abstraction, and writes normalized reports into results/.
Current starter-set reference runs:
benchmarks/local-llm-filter/results/full-benchmark-20260419-think-false.summary.mdbenchmarks/local-llm-filter/results/full-benchmark-20260419-22samples.summary.mdbenchmarks/local-llm-filter/results/full-benchmark-20260419-33samples.summary.mdbenchmarks/local-llm-filter/results/full-benchmark-20260419-48samples.summary.mdbenchmarks/local-llm-filter/results/full-benchmark-20260419-60samples.summary.mdbenchmarks/local-llm-filter/results/pure-llm-priors-v2-20260419.summary.mdbenchmarks/local-llm-filter/results/pure-qwen-fallback-priors-v2-20260419-rerun.summary.mdbenchmarks/local-llm-filter/results/pure-qwen-priors-v2-v3-20260419.summary.mdbenchmarks/local-llm-filter/results/pure-qwen-priors-v3-v4-20260419.summary.mdbenchmarks/local-llm-filter/results/pure-qwen-priors-v2-v4-postprocess-20260419.summary.md
Current pause-point leader under the production-like Ollama request shape:
pure + qwen3.5:4b + priors-v4 + post-processing
The current repo starter set contains 60 samples. This is a reasonable pause point for model-selection work: the next step is no longer "blindly add more samples", but to implement the chosen deep-filter strategy and then resume expansion toward the 140-sample decision target if the product needs a stronger evidence bar before release.
Implementation note:
- The benchmark runner now resolves model findings from either
anchor_textortext, matching the shipped Rust resolver contract. This matters for smaller models such asqwen3.5:2b, which sometimes omitanchor_textbut still return usable exact spans intext. - The benchmark runner also applies a small deterministic post-processing layer before scoring: unsupported labels are dropped, and nested
IP_ADDRESS/PORT/SENSITIVE_VALUEspans are removed when aDATABASE_URLalready covers the same DSN. The shipped Rust backend now mirrors this behavior.