feat: adding speculative decoding and draft model support by thewulf7 · Pull Request #8115 · janhq/jan

thewulf7 · 2026-05-04T11:07:06Z

Describe Your Changes

Adds full end-to-end support for llama.cpp speculative decoding, covering both draft-model-based and model-free (n-gram) modes.

How to test

Install two compatible llamacpp models (same architecture/tokenizer, e.g. Llama 3.2 3B as draft for Llama 3.1 8B)
Open model settings for the larger model → "Draft Model" dropdown should list the smaller model
Select the draft model → the server restarts with --model-draft and --draft-max 16
Alternatively, set "Speculative type" to ngram-mod (no draft model needed) — useful for reasoning models

- Introduced a new command `score_hub_model` to evaluate model performance. - Created a scoring module with detailed scoring logic based on model characteristics. - Implemented caching for scoring results to optimize performance. - Added permissions for allowing and denying access to the `score_hub_model` command. - Updated the schema to include new permissions and scoring request/response types. - Enhanced the web application with components to display model scores and breakdowns. - Integrated model scoring into the model detail view and hub content. - Added tests for the new scoring functionality and ensured proper error handling.

…parameters - Implemented `scoreHubModel` function in guest JS to invoke model scoring. - Extended `HubModelScoreRequest` and `HubModelScoreResult` types to include new parameters: `use_case`, `capabilities`, `release_date`, `tools`, `num_mmproj`, and `pinned`. - Updated permissions to include `allow-score-hub-model`. - Enhanced scoring logic in Rust to derive model specifications and analyze performance based on new parameters. - Added utility functions for parameter parsing and model capability normalization. - Updated frontend components to display new model attributes and handle loading states. - Added tests to validate new functionality and ensure correct behavior with extended model data.

…ails

…odel scores - Updated the resolution hashes for multiple extensions in yarn.lock to ensure consistency. - Removed unnecessary fields from HubModelScoreResult in types.ts. - Simplified the scoring logic in scoring.rs by eliminating hardware fingerprinting and cache key generation. - Enhanced the DefaultModelsService to read and write model scores to a persistent cache, improving performance and reliability. - Added error handling and logging for cache operations in default.ts.

- Added new fields to HubModelScoreRequest for runtime, quantization, and total size. - Updated ModelScoreSummary component to display fit levels and run modes with translations. - Enhanced tests for ModelScoreSummary to cover new features and translations. - Modified HubModelDetailContent to handle MLX models and display scores correctly. - Improved the DefaultModelsService to support MLX scoring and aggregate safetensors size. - Updated types for models to include new properties related to MLX and scoring. - Added localization for new score summary keys in hub.json.

…tates

…ce HubModelDetailContent with detailed score breakdown

…mapping

…component

… HubModelDetailContent with score summary display

- Removed redundant model score handling logic from HubModelDetailContent and HubContent components. - Replaced manual score fetching and caching with useModelScore hook for better state management. - Updated DefaultModelsService to utilize useModelScore for fetching cached and live model scores. - Cleaned up unused functions and types related to model scoring. - Improved test cases to align with the new model score handling approach.

…e retrieval

…scoring interfaces and implementations

…available responses

…omponents

…lay in HubContent

…nce score display

- Updated constants to use DEFAULT_MODEL_QUANTIZATIONS instead of PREFERRED_DOWNLOAD_QUANTIZATIONS across various components. - Simplified the providerModels structure by consolidating model capabilities into single arrays for better readability. - Enhanced the useModelScore hook to deduplicate concurrent score fetch requests for the same model, improving performance and reducing unnecessary API calls. - Added tests to ensure the deduplication logic works as expected. - Cleaned up the ModelInfoHoverCard and HubModelDetailContent components for better readability and maintainability.

…translations for fit score

…r draft model and model-free modes

…/thewulf7/jan into thewulf7/loading-multiple-models

…s and model settings

…ing tests

tokamak-pm · 2026-05-05T02:04:33Z

PR Review: feat: adding speculative decoding and draft model support

Overview

This PR adds end-to-end support for llama.cpp speculative decoding across 15 files (+1383/-7 lines). It covers:

Rust backend (args.rs): New LlamacppConfig fields and CLI argument generation for --model-draft, --spec-type, --draft-max, --draft-min
TypeScript guest-js (index.ts, types.ts): Config normalization and type definitions
Extension (llamacpp-extension/src/index.ts): Draft model path resolution from model YAML
UI (ModelSetting.tsx, useModelProvider.ts, predefined.ts): Settings controls, migration (v13->v14), and dynamic dropdown population
Extension settings (settings.json): spec_type dropdown
CLI (jan-cli.rs): Config struct initialization
Docs: Comprehensive documentation in two .mdx files
Tests: Rust unit tests, TS normalization tests, draft model resolution tests, UI component tests, hook tests

Correctness & Risks

1. draft_model_id value "none" is not cleared before reaching Rust config (Medium Risk)

When the user selects "None" in the Draft Model dropdown, the value is "none" (a string). In index.ts, the resolution logic is:

const draftModelId = cfg.draft_model_id ?? (overrideSettings as any)?.draft_model_id ?? undefined
if (draftModelId) { ... }

The string "none" is truthy, so the code will attempt to resolve a model named "none" via read_yaml, which will fail. The catch block logs a warning and continues, so this is not fatal, but it means every model load with the default "none" setting will produce a spurious warning log. Consider explicitly filtering out "none":

const draftModelId = cfg.draft_model_id ?? (overrideSettings as any)?.draft_model_id ?? undefined
if (draftModelId && draftModelId !== 'none') { ... }

2. spec_type setting defined in settings.json but migration does not add it (Low Risk)

The migration (version <= 13) adds draft_model_id, draft_max, and draft_min to existing models, but does not add spec_type. The spec_type is defined in settings.json (extension-level setting) rather than as a per-model setting via modelSettings, so it should flow through overrideSettings when the extension settings are applied. However, the asymmetry is worth noting -- spec_type is configured at the extension level while draft_model_id/draft_max/draft_min are per-model. If a user wants different speculative types for different models, this is currently not possible.

3. Type mismatch: i32 in Rust vs u32 for draft_max/draft_min

In args.rs, draft_max and draft_min are u32, while most other integer fields (e.g., keep, cache_reuse) are i32. The JS normalization uses asI32 which returns a signed integer. If asI32 returns a negative value (e.g., from invalid user input), the serde deserialization into u32 would fail. Consider using i32 for consistency or adding validation.

4. draftModelCandidates recomputed on every render (Low Risk, Performance)

In ModelSetting.tsx, draftModelCandidates is computed outside of useMemo, and getControllerProps depends on it via useCallback. Since draftModelCandidates is recreated on every render (it creates new array/objects), the useCallback dependency will always be stale, effectively negating the memoization. Wrap draftModelCandidates in useMemo:

const draftModelCandidates = useMemo(() => providers
  .filter(...)
  .flatMap(...)
  .filter(...)
  .map(...), [providers, model.id])

Missing Tests

No test validates the behavior when draft_model_id is "none" -- specifically that the model loads without trying to resolve a non-existent model. The draftModelResolution.test.ts tests call extension.loadModel(modelConfig), but loadModel is inherited from LocalOAIEngine and takes a Model object, not a raw config. The tests rely on catching thrown errors and only asserting side effects, which makes them fragile.
No Rust test verifies that draft_max / draft_min values passed as u32 survive serde round-trip from JavaScript's signed integers.
No integration-level test verifies the full flow from UI setting change through to server restart with correct CLI args.

Architecture Concerns

Mixed granularity: spec_type is an extension-level setting (from settings.json), while draft_model_id, draft_max, draft_min are per-model settings (from predefined.ts + migration). This inconsistency could confuse users who expect all speculative decoding settings to live together. If a user runs multiple models, they cannot have different spec_type per model.
(overrideSettings as any)?.draft_model_id: The as any cast in index.ts bypasses type safety. Consider extending the LlamacppConfig type to include draft_model_id as an optional field.

Style Issues

Minor: trailing whitespace on spec_type: string in types.ts (line with trailing space after string).
Blank line at line 592 in useModelProvider.ts (empty line before the migration block).
Double space in model.displayName ?? model.id in ModelSetting.tsx.

Positive Aspects

Excellent documentation additions in both .mdx files covering usage, requirements, and performance tips.
Good test coverage across Rust, TypeScript normalization, and React components.
Graceful error handling -- draft model resolution failure is non-fatal with a warning log.
The add_speculative_decoding_args Rust method is cleanly separated and thoroughly tested with 14 dedicated test cases.
Migration versioning is properly incremented (13 -> 14).

Recommendation: improve needed

…tion

…/thewulf7/jan into thewulf7/loading-multiple-models

thewulf7 · 2026-05-06T13:25:17Z

@tokamak-pm review again please

tokamak-pm · 2026-05-06T13:25:22Z

🤖 Tokamak Bot — the GitHub Bot is disabled for the project that owns this repository. A project owner can enable it under Project Settings → GitHub Bot.

Model scoring system

Co-authored-by: thewulf7 <1582808+thewulf7@users.noreply.github.com>

tokamak-pm · 2026-06-03T08:41:00Z

Follow-up review (new activity detected)

Since our last review (2026-05-05), the PR author requested a re-review on 2026-05-06 but the bot was disabled. There have been no new commits since our initial review. The diff remains at ~1383 lines across 15 files.

Status of previously raised concerns

All four concerns from our initial review remain unaddressed in the current code:

draft_model_id value "none" not filtered -- The string "none" is still truthy and will trigger a spurious read_yaml lookup and warning log on every model load with the default setting. Still needs an explicit && draftModelId !== 'none' guard.
spec_type granularity mismatch -- spec_type is still an extension-level setting while draft_model_id/draft_max/draft_min are per-model, so users cannot use different speculative types across models.
i32 vs u32 type mismatch -- draft_max and draft_min remain u32 in Rust while JS normalization uses asI32. Negative input values would fail serde deserialization.
draftModelCandidates not memoized -- Still computed outside useMemo, negating the useCallback memoization.

New observations

The PR includes a large number of unrelated commits in its history (model scoring system, hub enhancements, etc.) dating back to March 2026. These appear to be part of a fork that was not rebased against main before opening the PR. This makes the actual diff for speculative decoding harder to isolate and increases merge conflict risk. A rebase onto current main would be beneficial.

Recommendation: fix needed

tokamak-pm · 2026-06-06T02:06:57Z

Comprehensive Review (40 files, +3739/-697)

This PR has grown substantially since our earlier reviews. It now contains two major features:

Speculative decoding support (draft model + n-gram modes)
Hub model scoring system (new llmfit-core integration, scoring UI in hub, quantization utilities)

These are architecturally independent features. Bundling them in one PR increases review complexity, merge conflict risk, and rollback difficulty. That said, here is a review of the full changeset.

Previously Raised Issues

Our prior reviews (May 5 and June 3) identified four concerns. Here is the current status:

#	Issue	Status
1	`draft_model_id` value `"none"` not filtered (truthy string triggers spurious `read_yaml` warning)	Partially addressed -- `draftModelCandidates` is now memoized (concern #4 resolved), but the `"none"` guard in the extension index.ts is still not visible in the diff. If the resolution logic in `llamacpp-extension/src/index.ts` for `draft_model_id` still treats `"none"` as truthy, this remains a bug.
2	`spec_type` extension-level vs per-model granularity mismatch	Unchanged -- still extension-level
3	`i32` vs `u32` type mismatch for `draft_max`/`draft_min`	Fixed -- New `asU32()` function in `guest-js/index.ts` clamps values to `[0, I32_MAX]`, preventing negative values from reaching Rust
4	`draftModelCandidates` not memoized	Fixed -- now wrapped in `useMemo` with `[model.id, providers]` dependencies

New Feature: Hub Model Scoring System

Architecture (Positive)

Clean separation: Rust scoring.rs (718 lines) handles all scoring logic via llmfit-core, exposed through a Tauri command
TypeScript layer adds useModelScore Zustand store with localStorage persistence and deduplication via pendingScoreRequests map
modelQuantization.ts provides reusable quantization inference and variant selection
Good test coverage for scoring.rs (8 unit tests), modelQuantization.test.ts, score-utils.test.ts, useModelScores.test.ts

Concerns

1. New llmfit-core dependency pulls in ureq (HTTP client) -- acknowledged but not mitigated (Medium Risk)

The Cargo.toml comment explains that ureq is a transitive dependency used only for optional model-card fetching not exercised here. However, shipping an HTTP client in a desktop app increases attack surface. Consider:

Using default-features = false on llmfit-core if supported
Filing an upstream issue for a no-network feature flag

2. ModelScoreStatus type mismatch between Rust and TypeScript (Medium Risk)

Rust ModelScoreStatus has 3 variants: Ready, Unavailable, Error. TypeScript ModelScoreStatus in web-app/src/services/models/types.ts has 4: 'loading' | 'ready' | 'unavailable' | 'error'. The 'loading' state is client-only and never comes from Rust, which is fine conceptually, but the duplicate type definitions in guest-js/types.ts (3 variants) vs services/models/types.ts (4 variants) create a maintenance burden. Consider having one canonical type or at least a comment explaining the divergence.

3. estimated_tps type inconsistency (Low Risk)

Rust HubModelScoreResult.estimated_tps is Option<f32> (nullable). The guest-js types.ts declares estimated_tps: number | null. But the web-app ModelScore.estimated_tps is number (non-nullable, defaulted to 0). The useModelScore store does estimated_tps: scoreResult.estimated_tps ?? 0, which papers over the null. This works but the silent null -> 0 conversion could hide scoring failures from the UI.

4. Score cache never expires (Low Risk)

The useModelScore store persists scores in localStorage with an updated_at field, but nothing ever checks that timestamp to invalidate stale scores. If the user upgrades their hardware, cached scores will be wrong until they manually clear storage or call reset(). Consider adding a TTL or hardware fingerprint check.

5. selectBestGgufVariant prefers Q8_0 over smaller quants unconditionally (Design)

The preference order in GGUF_QUANT_PREFERENCE_ORDER ranks Q8_0 first (highest quality). This means scoring always evaluates the largest variant, which may return "Too Tight" for users with limited hardware. If the purpose is to show whether a model can run, scoring the smallest viable variant might be more useful. This is a product decision rather than a bug.

6. getHubModelScore casts engine with as AIEngine & { getHubModelScore?: ... } (Low Risk)

Similar to the as any pattern flagged in our initial review. This bypasses type safety. Consider adding getHubModelScore to the AIEngine interface or creating a proper subtype.

Speculative Decoding Changes (Updated)

7. reasoning setting ordering logic is fragile (Medium Risk)

In ModelSetting.tsx, the new .reduce() block inserts the reasoning setting at position 0:

.reduce<[string, unknown][]>((acc, entry) => {
  const reasoning = Object.entries(model.settings || {}).find(
    ([k]) => k === 'reasoning'
  )
  if (reasoning && acc.length === 0) acc.push(reasoning)
  acc.push(entry)
  return acc
}, [])

This re-looks-up reasoning on every iteration but only inserts it once (when acc.length === 0). The first .reduce() block already filters OUT reasoning with if (entry[0] === 'reasoning') return acc. So the second .reduce() manually re-inserts it at position 0. This is an unnecessarily convoluted way to move one setting to the top. A simpler approach would be:

const entries = Object.entries(model.settings || {}).filter(...);
const reasoning = entries.find(([k]) => k === 'reasoning');
const sorted = reasoning ? [reasoning, ...entries.filter(([k]) => k !== 'reasoning')] : entries;

8. fit_ctx / fitEnabled UI block was removed (Regression?)

The diff removes the entire fitEnabled/fitCtxSetting block and the handleEngineSettingChange function. If fit_ctx was a working feature before this PR, its UI is now gone. The if (fitEnabled && entry[0] === 'ctx_len') return acc line is also removed, meaning ctx_len will now always show even when fit is enabled. Verify this is intentional.

Missing Tests

No test for getHubModelScore in DefaultModelsService -- especially the MLX branch and the fallback paths
No test for stale cache behavior in useModelScore
No test for the "none" draft model ID flow through the extension layer
scoring.rs tests are pure unit tests; no test verifies the Tauri command wiring (score_hub_model)

Style / Minor

web-app/src/services/models/types.ts has both created_at and createdAt on CatalogModel -- pick one naming convention
scoring.rs line "*.expert_used_count", "*.expert_used_count" has a duplicate entry in the infer_metadata_u32 patterns array
The lddtree = "0.5" dependency was added to Cargo.toml but is never imported or used in the diff -- is it a leftover from another branch?

Scope Concern

This PR combines speculative decoding with a full hub model scoring system (718-line Rust module, new Zustand store, hub UI overhaul, new llmfit-core dependency). These are independent features that would benefit from separate PRs for cleaner review, easier rollback, and reduced merge conflict risk. The 54 commits in the history (including model scoring system, hub enhancements from March/April) confirm this is a multi-feature branch.

Recommendation: fix needed

Key items requiring attention before merge:

Confirm draft_model_id === "none" is properly guarded in the extension loading path
Verify the fit_ctx UI removal is intentional (not a regression)
Remove the unused lddtree dependency if it is not needed
Fix the duplicate "*.expert_used_count" pattern in scoring.rs
Consider splitting this into two PRs (speculative decoding + hub scoring) given the scope

thewulf7 added 30 commits March 19, 2026 21:21

test: add unit tests for ModelScoreBadge and ModelScorePanel components

5d9f3e8

feat: enhance model scoring with additional metrics and breakdown det…

ec2e4e0

…ails

Merge branch 'main' into model_scoring_system

875ef42

feat: enhance HubModelDetailContent with score handling and loading s…

30b149c

…tates

feat: export utility functions for fit level and run mode keys, enhan…

00c9dd1

…ce HubModelDetailContent with detailed score breakdown

feat: refactor ModelScoreSummary tests and remove unused quant score …

cb98c88

…mapping

refactor: clean up import statements in $modelId.tsx

44a700e

refactor: remove unused icon imports in $modelId.tsx

761c28f

feat: enhance type definitions and improve type safety in HubContent …

ee85c8d

…component

feat: implement Badge component for consistent UI styling and enhance…

4adcfc5

… HubModelDetailContent with score summary display

refactor: remove unused score caching logic and streamline model scor…

1fbb2db

…e retrieval

refactor: remove default_quant_model_id and related logic from model …

4fe59f2

…scoring interfaces and implementations

feat: enhance model scoring logic to infer quantization and handle un…

6fb2855

…available responses

feat: introduce preferred download quantizations and update related c…

0405ed4

…omponents

refactor: remove ModelScoreSummary component and integrate score disp…

de53a2c

…lay in HubContent

Merge branch 'main' into model_scoring_system

5c7f57e

refactor: remove ModelScoreSummary test file and related mocks

266947b

fix: improve score handling in HubContent component

191f685

refactor: streamline model rendering in HubContent component and enha…

9aaa4ce

…nce score display

Merge branch 'main' into model_scoring_system

427483f

Merge branch 'main' into model_scoring_system

52bbaa0

refactor: migrate score utilities to a new location and update imports

ddf2a83

Merge branch 'main' into model_scoring_system

084d595

Merge branch 'main' into model_scoring_system

2b25177

thewulf7 added 12 commits April 27, 2026 11:15

Merge branch 'main' into model_scoring_system

8ae24ec

fix: update resolution hashes for core package dependencies in yarn.lock

19958e1

fix: handle nullable estimated_tps in HubModelScoreResult and update …

660f7cd

…translations for fit score

fix: optimize dependency tracking in HubContent for improved performance

88acfeb

fix: update storage mechanism in useModelScore to use localStorage

4d4a22d

Merge branch 'janhq:main' into model_scoring_system

c26c158

feat(llamacpp): add speculative decoding configuration options

90914b6

Adding UI elements for speculative decoding

c860dd1

getControllerProps method for more unified approach for dynamic config

b5148d5

feat(llamacpp): implement draft model settings for speculative decoding

4fbcc76

Merge branch 'main' into model_scoring_system

1466924

refactor(useModelProvider): remove unnecessary console log for version

8fb2658

github-project-automation Bot added this to Jan May 4, 2026

thewulf7 added 5 commits May 4, 2026 13:07

Merge branch 'main' into thewulf7/loading-multiple-models

935f449

feat(llamacpp): add speculative decoding settings and explanations fo…

c6d0e57

…r draft model and model-free modes

Merge branch 'thewulf7/loading-multiple-models' of https://github.com…

a0e4191

…/thewulf7/jan into thewulf7/loading-multiple-models

feat(tests): add comprehensive tests for speculative decoding feature…

cd201bd

…s and model settings

refactor(tests): streamline modelSettings import in speculative decod…

7cc124e

…ing tests

thewulf7 added 4 commits May 5, 2026 09:10

Merge branch 'main' into thewulf7/loading-multiple-models

805f53d

feat(llamacpp): enhance draft model handling and configuration valida…

8490c62

…tion

Merge branch 'thewulf7/loading-multiple-models' of https://github.com…

cdd3527

…/thewulf7/jan into thewulf7/loading-multiple-models

Merge branch 'main' into model_scoring_system

670840b

thewulf7 and others added 3 commits May 14, 2026 08:32

Merge branch 'main' into model_scoring_system

00ec5d1

Merge pull request #1 from thewulf7/model_scoring_system

2a5029e

Model scoring system

Merge branch 'main' into thewulf7/loading-multiple-models

d0a3e54

Co-authored-by: thewulf7 <1582808+thewulf7@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adding speculative decoding and draft model support#8115

feat: adding speculative decoding and draft model support#8115
thewulf7 wants to merge 54 commits into
janhq:mainfrom
thewulf7:thewulf7/loading-multiple-models

thewulf7 commented May 4, 2026 •

edited

Loading

Uh oh!

tokamak-pm Bot commented May 5, 2026

Uh oh!

thewulf7 commented May 6, 2026

Uh oh!

tokamak-pm Bot commented May 6, 2026

Uh oh!

tokamak-pm Bot commented Jun 3, 2026

Uh oh!

tokamak-pm Bot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thewulf7 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe Your Changes

How to test

Uh oh!

tokamak-pm Bot commented May 5, 2026

PR Review: feat: adding speculative decoding and draft model support

Overview

Correctness & Risks

Missing Tests

Architecture Concerns

Style Issues

Positive Aspects

Uh oh!

thewulf7 commented May 6, 2026

Uh oh!

tokamak-pm Bot commented May 6, 2026

Uh oh!

tokamak-pm Bot commented Jun 3, 2026

Status of previously raised concerns

New observations

Uh oh!

tokamak-pm Bot commented Jun 6, 2026

Comprehensive Review (40 files, +3739/-697)

Previously Raised Issues

New Feature: Hub Model Scoring System

Architecture (Positive)

Concerns

Speculative Decoding Changes (Updated)

Missing Tests

Style / Minor

Scope Concern

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thewulf7 commented May 4, 2026 •

edited

Loading