Skip to content

feat: adding speculative decoding and draft model support#8115

Open
thewulf7 wants to merge 54 commits into
janhq:mainfrom
thewulf7:thewulf7/loading-multiple-models
Open

feat: adding speculative decoding and draft model support#8115
thewulf7 wants to merge 54 commits into
janhq:mainfrom
thewulf7:thewulf7/loading-multiple-models

Conversation

@thewulf7
Copy link
Copy Markdown
Contributor

@thewulf7 thewulf7 commented May 4, 2026

Describe Your Changes

Adds full end-to-end support for llama.cpp speculative decoding, covering both draft-model-based and model-free (n-gram) modes.

image image

How to test

  1. Install two compatible llamacpp models (same architecture/tokenizer, e.g. Llama 3.2 3B as draft for Llama 3.1 8B)
  2. Open model settings for the larger model → "Draft Model" dropdown should list the smaller model
  3. Select the draft model → the server restarts with --model-draft and --draft-max 16
  4. Alternatively, set "Speculative type" to ngram-mod (no draft model needed) — useful for reasoning models

thewulf7 added 30 commits March 19, 2026 21:21
- Introduced a new command `score_hub_model` to evaluate model performance.
- Created a scoring module with detailed scoring logic based on model characteristics.
- Implemented caching for scoring results to optimize performance.
- Added permissions for allowing and denying access to the `score_hub_model` command.
- Updated the schema to include new permissions and scoring request/response types.
- Enhanced the web application with components to display model scores and breakdowns.
- Integrated model scoring into the model detail view and hub content.
- Added tests for the new scoring functionality and ensured proper error handling.
…parameters

- Implemented `scoreHubModel` function in guest JS to invoke model scoring.
- Extended `HubModelScoreRequest` and `HubModelScoreResult` types to include new parameters: `use_case`, `capabilities`, `release_date`, `tools`, `num_mmproj`, and `pinned`.
- Updated permissions to include `allow-score-hub-model`.
- Enhanced scoring logic in Rust to derive model specifications and analyze performance based on new parameters.
- Added utility functions for parameter parsing and model capability normalization.
- Updated frontend components to display new model attributes and handle loading states.
- Added tests to validate new functionality and ensure correct behavior with extended model data.
…odel scores

- Updated the resolution hashes for multiple extensions in yarn.lock to ensure consistency.
- Removed unnecessary fields from HubModelScoreResult in types.ts.
- Simplified the scoring logic in scoring.rs by eliminating hardware fingerprinting and cache key generation.
- Enhanced the DefaultModelsService to read and write model scores to a persistent cache, improving performance and reliability.
- Added error handling and logging for cache operations in default.ts.
- Added new fields to HubModelScoreRequest for runtime, quantization, and total size.
- Updated ModelScoreSummary component to display fit levels and run modes with translations.
- Enhanced tests for ModelScoreSummary to cover new features and translations.
- Modified HubModelDetailContent to handle MLX models and display scores correctly.
- Improved the DefaultModelsService to support MLX scoring and aggregate safetensors size.
- Updated types for models to include new properties related to MLX and scoring.
- Added localization for new score summary keys in hub.json.
…ce HubModelDetailContent with detailed score breakdown
… HubModelDetailContent with score summary display
- Removed redundant model score handling logic from HubModelDetailContent and HubContent components.
- Replaced manual score fetching and caching with useModelScore hook for better state management.
- Updated DefaultModelsService to utilize useModelScore for fetching cached and live model scores.
- Cleaned up unused functions and types related to model scoring.
- Improved test cases to align with the new model score handling approach.
- Updated constants to use DEFAULT_MODEL_QUANTIZATIONS instead of PREFERRED_DOWNLOAD_QUANTIZATIONS across various components.
- Simplified the providerModels structure by consolidating model capabilities into single arrays for better readability.
- Enhanced the useModelScore hook to deduplicate concurrent score fetch requests for the same model, improving performance and reducing unnecessary API calls.
- Added tests to ensure the deduplication logic works as expected.
- Cleaned up the ModelInfoHoverCard and HubModelDetailContent components for better readability and maintainability.
@tokamak-pm
Copy link
Copy Markdown

tokamak-pm Bot commented May 5, 2026

PR Review: feat: adding speculative decoding and draft model support

Overview

This PR adds end-to-end support for llama.cpp speculative decoding across 15 files (+1383/-7 lines). It covers:

  1. Rust backend (args.rs): New LlamacppConfig fields and CLI argument generation for --model-draft, --spec-type, --draft-max, --draft-min
  2. TypeScript guest-js (index.ts, types.ts): Config normalization and type definitions
  3. Extension (llamacpp-extension/src/index.ts): Draft model path resolution from model YAML
  4. UI (ModelSetting.tsx, useModelProvider.ts, predefined.ts): Settings controls, migration (v13->v14), and dynamic dropdown population
  5. Extension settings (settings.json): spec_type dropdown
  6. CLI (jan-cli.rs): Config struct initialization
  7. Docs: Comprehensive documentation in two .mdx files
  8. Tests: Rust unit tests, TS normalization tests, draft model resolution tests, UI component tests, hook tests

Correctness & Risks

1. draft_model_id value "none" is not cleared before reaching Rust config (Medium Risk)

When the user selects "None" in the Draft Model dropdown, the value is "none" (a string). In index.ts, the resolution logic is:

const draftModelId = cfg.draft_model_id ?? (overrideSettings as any)?.draft_model_id ?? undefined
if (draftModelId) { ... }

The string "none" is truthy, so the code will attempt to resolve a model named "none" via read_yaml, which will fail. The catch block logs a warning and continues, so this is not fatal, but it means every model load with the default "none" setting will produce a spurious warning log. Consider explicitly filtering out "none":

const draftModelId = cfg.draft_model_id ?? (overrideSettings as any)?.draft_model_id ?? undefined
if (draftModelId && draftModelId !== 'none') { ... }

2. spec_type setting defined in settings.json but migration does not add it (Low Risk)

The migration (version <= 13) adds draft_model_id, draft_max, and draft_min to existing models, but does not add spec_type. The spec_type is defined in settings.json (extension-level setting) rather than as a per-model setting via modelSettings, so it should flow through overrideSettings when the extension settings are applied. However, the asymmetry is worth noting -- spec_type is configured at the extension level while draft_model_id/draft_max/draft_min are per-model. If a user wants different speculative types for different models, this is currently not possible.

3. Type mismatch: i32 in Rust vs u32 for draft_max/draft_min

In args.rs, draft_max and draft_min are u32, while most other integer fields (e.g., keep, cache_reuse) are i32. The JS normalization uses asI32 which returns a signed integer. If asI32 returns a negative value (e.g., from invalid user input), the serde deserialization into u32 would fail. Consider using i32 for consistency or adding validation.

4. draftModelCandidates recomputed on every render (Low Risk, Performance)

In ModelSetting.tsx, draftModelCandidates is computed outside of useMemo, and getControllerProps depends on it via useCallback. Since draftModelCandidates is recreated on every render (it creates new array/objects), the useCallback dependency will always be stale, effectively negating the memoization. Wrap draftModelCandidates in useMemo:

const draftModelCandidates = useMemo(() => providers
  .filter(...)
  .flatMap(...)
  .filter(...)
  .map(...), [providers, model.id])

Missing Tests

  • No test validates the behavior when draft_model_id is "none" -- specifically that the model loads without trying to resolve a non-existent model. The draftModelResolution.test.ts tests call extension.loadModel(modelConfig), but loadModel is inherited from LocalOAIEngine and takes a Model object, not a raw config. The tests rely on catching thrown errors and only asserting side effects, which makes them fragile.
  • No Rust test verifies that draft_max / draft_min values passed as u32 survive serde round-trip from JavaScript's signed integers.
  • No integration-level test verifies the full flow from UI setting change through to server restart with correct CLI args.

Architecture Concerns

  • Mixed granularity: spec_type is an extension-level setting (from settings.json), while draft_model_id, draft_max, draft_min are per-model settings (from predefined.ts + migration). This inconsistency could confuse users who expect all speculative decoding settings to live together. If a user runs multiple models, they cannot have different spec_type per model.
  • (overrideSettings as any)?.draft_model_id: The as any cast in index.ts bypasses type safety. Consider extending the LlamacppConfig type to include draft_model_id as an optional field.

Style Issues

  • Minor: trailing whitespace on spec_type: string in types.ts (line with trailing space after string).
  • Blank line at line 592 in useModelProvider.ts (empty line before the migration block).
  • Double space in model.displayName ?? model.id in ModelSetting.tsx.

Positive Aspects

  • Excellent documentation additions in both .mdx files covering usage, requirements, and performance tips.
  • Good test coverage across Rust, TypeScript normalization, and React components.
  • Graceful error handling -- draft model resolution failure is non-fatal with a warning log.
  • The add_speculative_decoding_args Rust method is cleanly separated and thoroughly tested with 14 dedicated test cases.
  • Migration versioning is properly incremented (13 -> 14).

Recommendation: improve needed

@thewulf7
Copy link
Copy Markdown
Contributor Author

thewulf7 commented May 6, 2026

@tokamak-pm review again please

@tokamak-pm
Copy link
Copy Markdown

tokamak-pm Bot commented May 6, 2026

🤖 Tokamak Bot — the GitHub Bot is disabled for the project that owns this repository. A project owner can enable it under Project Settings → GitHub Bot.

@tokamak-pm
Copy link
Copy Markdown

tokamak-pm Bot commented Jun 3, 2026

Follow-up review (new activity detected)

Since our last review (2026-05-05), the PR author requested a re-review on 2026-05-06 but the bot was disabled. There have been no new commits since our initial review. The diff remains at ~1383 lines across 15 files.

Status of previously raised concerns

All four concerns from our initial review remain unaddressed in the current code:

  1. draft_model_id value "none" not filtered -- The string "none" is still truthy and will trigger a spurious read_yaml lookup and warning log on every model load with the default setting. Still needs an explicit && draftModelId !== 'none' guard.

  2. spec_type granularity mismatch -- spec_type is still an extension-level setting while draft_model_id/draft_max/draft_min are per-model, so users cannot use different speculative types across models.

  3. i32 vs u32 type mismatch -- draft_max and draft_min remain u32 in Rust while JS normalization uses asI32. Negative input values would fail serde deserialization.

  4. draftModelCandidates not memoized -- Still computed outside useMemo, negating the useCallback memoization.

New observations

The PR includes a large number of unrelated commits in its history (model scoring system, hub enhancements, etc.) dating back to March 2026. These appear to be part of a fork that was not rebased against main before opening the PR. This makes the actual diff for speculative decoding harder to isolate and increases merge conflict risk. A rebase onto current main would be beneficial.

Recommendation: fix needed

@tokamak-pm
Copy link
Copy Markdown

tokamak-pm Bot commented Jun 6, 2026

Comprehensive Review (40 files, +3739/-697)

This PR has grown substantially since our earlier reviews. It now contains two major features:

  1. Speculative decoding support (draft model + n-gram modes)
  2. Hub model scoring system (new llmfit-core integration, scoring UI in hub, quantization utilities)

These are architecturally independent features. Bundling them in one PR increases review complexity, merge conflict risk, and rollback difficulty. That said, here is a review of the full changeset.


Previously Raised Issues

Our prior reviews (May 5 and June 3) identified four concerns. Here is the current status:

# Issue Status
1 draft_model_id value "none" not filtered (truthy string triggers spurious read_yaml warning) Partially addressed -- draftModelCandidates is now memoized (concern #4 resolved), but the "none" guard in the extension index.ts is still not visible in the diff. If the resolution logic in llamacpp-extension/src/index.ts for draft_model_id still treats "none" as truthy, this remains a bug.
2 spec_type extension-level vs per-model granularity mismatch Unchanged -- still extension-level
3 i32 vs u32 type mismatch for draft_max/draft_min Fixed -- New asU32() function in guest-js/index.ts clamps values to [0, I32_MAX], preventing negative values from reaching Rust
4 draftModelCandidates not memoized Fixed -- now wrapped in useMemo with [model.id, providers] dependencies

New Feature: Hub Model Scoring System

Architecture (Positive)

  • Clean separation: Rust scoring.rs (718 lines) handles all scoring logic via llmfit-core, exposed through a Tauri command
  • TypeScript layer adds useModelScore Zustand store with localStorage persistence and deduplication via pendingScoreRequests map
  • modelQuantization.ts provides reusable quantization inference and variant selection
  • Good test coverage for scoring.rs (8 unit tests), modelQuantization.test.ts, score-utils.test.ts, useModelScores.test.ts

Concerns

1. New llmfit-core dependency pulls in ureq (HTTP client) -- acknowledged but not mitigated (Medium Risk)

The Cargo.toml comment explains that ureq is a transitive dependency used only for optional model-card fetching not exercised here. However, shipping an HTTP client in a desktop app increases attack surface. Consider:

  • Using default-features = false on llmfit-core if supported
  • Filing an upstream issue for a no-network feature flag

2. ModelScoreStatus type mismatch between Rust and TypeScript (Medium Risk)

Rust ModelScoreStatus has 3 variants: Ready, Unavailable, Error. TypeScript ModelScoreStatus in web-app/src/services/models/types.ts has 4: 'loading' | 'ready' | 'unavailable' | 'error'. The 'loading' state is client-only and never comes from Rust, which is fine conceptually, but the duplicate type definitions in guest-js/types.ts (3 variants) vs services/models/types.ts (4 variants) create a maintenance burden. Consider having one canonical type or at least a comment explaining the divergence.

3. estimated_tps type inconsistency (Low Risk)

Rust HubModelScoreResult.estimated_tps is Option<f32> (nullable). The guest-js types.ts declares estimated_tps: number | null. But the web-app ModelScore.estimated_tps is number (non-nullable, defaulted to 0). The useModelScore store does estimated_tps: scoreResult.estimated_tps ?? 0, which papers over the null. This works but the silent null -> 0 conversion could hide scoring failures from the UI.

4. Score cache never expires (Low Risk)

The useModelScore store persists scores in localStorage with an updated_at field, but nothing ever checks that timestamp to invalidate stale scores. If the user upgrades their hardware, cached scores will be wrong until they manually clear storage or call reset(). Consider adding a TTL or hardware fingerprint check.

5. selectBestGgufVariant prefers Q8_0 over smaller quants unconditionally (Design)

The preference order in GGUF_QUANT_PREFERENCE_ORDER ranks Q8_0 first (highest quality). This means scoring always evaluates the largest variant, which may return "Too Tight" for users with limited hardware. If the purpose is to show whether a model can run, scoring the smallest viable variant might be more useful. This is a product decision rather than a bug.

6. getHubModelScore casts engine with as AIEngine & { getHubModelScore?: ... } (Low Risk)

Similar to the as any pattern flagged in our initial review. This bypasses type safety. Consider adding getHubModelScore to the AIEngine interface or creating a proper subtype.


Speculative Decoding Changes (Updated)

7. reasoning setting ordering logic is fragile (Medium Risk)

In ModelSetting.tsx, the new .reduce() block inserts the reasoning setting at position 0:

.reduce<[string, unknown][]>((acc, entry) => {
  const reasoning = Object.entries(model.settings || {}).find(
    ([k]) => k === 'reasoning'
  )
  if (reasoning && acc.length === 0) acc.push(reasoning)
  acc.push(entry)
  return acc
}, [])

This re-looks-up reasoning on every iteration but only inserts it once (when acc.length === 0). The first .reduce() block already filters OUT reasoning with if (entry[0] === 'reasoning') return acc. So the second .reduce() manually re-inserts it at position 0. This is an unnecessarily convoluted way to move one setting to the top. A simpler approach would be:

const entries = Object.entries(model.settings || {}).filter(...);
const reasoning = entries.find(([k]) => k === 'reasoning');
const sorted = reasoning ? [reasoning, ...entries.filter(([k]) => k !== 'reasoning')] : entries;

8. fit_ctx / fitEnabled UI block was removed (Regression?)

The diff removes the entire fitEnabled/fitCtxSetting block and the handleEngineSettingChange function. If fit_ctx was a working feature before this PR, its UI is now gone. The if (fitEnabled && entry[0] === 'ctx_len') return acc line is also removed, meaning ctx_len will now always show even when fit is enabled. Verify this is intentional.


Missing Tests

  • No test for getHubModelScore in DefaultModelsService -- especially the MLX branch and the fallback paths
  • No test for stale cache behavior in useModelScore
  • No test for the "none" draft model ID flow through the extension layer
  • scoring.rs tests are pure unit tests; no test verifies the Tauri command wiring (score_hub_model)

Style / Minor

  • web-app/src/services/models/types.ts has both created_at and createdAt on CatalogModel -- pick one naming convention
  • scoring.rs line "*.expert_used_count", "*.expert_used_count" has a duplicate entry in the infer_metadata_u32 patterns array
  • The lddtree = "0.5" dependency was added to Cargo.toml but is never imported or used in the diff -- is it a leftover from another branch?

Scope Concern

This PR combines speculative decoding with a full hub model scoring system (718-line Rust module, new Zustand store, hub UI overhaul, new llmfit-core dependency). These are independent features that would benefit from separate PRs for cleaner review, easier rollback, and reduced merge conflict risk. The 54 commits in the history (including model scoring system, hub enhancements from March/April) confirm this is a multi-feature branch.


Recommendation: fix needed

Key items requiring attention before merge:

  1. Confirm draft_model_id === "none" is properly guarded in the extension loading path
  2. Verify the fit_ctx UI removal is intentional (not a regression)
  3. Remove the unused lddtree dependency if it is not needed
  4. Fix the duplicate "*.expert_used_count" pattern in scoring.rs
  5. Consider splitting this into two PRs (speculative decoding + hub scoring) given the scope

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants