feat: adding speculative decoding and draft model support#8115
feat: adding speculative decoding and draft model support#8115thewulf7 wants to merge 54 commits into
Conversation
- Introduced a new command `score_hub_model` to evaluate model performance. - Created a scoring module with detailed scoring logic based on model characteristics. - Implemented caching for scoring results to optimize performance. - Added permissions for allowing and denying access to the `score_hub_model` command. - Updated the schema to include new permissions and scoring request/response types. - Enhanced the web application with components to display model scores and breakdowns. - Integrated model scoring into the model detail view and hub content. - Added tests for the new scoring functionality and ensured proper error handling.
…parameters - Implemented `scoreHubModel` function in guest JS to invoke model scoring. - Extended `HubModelScoreRequest` and `HubModelScoreResult` types to include new parameters: `use_case`, `capabilities`, `release_date`, `tools`, `num_mmproj`, and `pinned`. - Updated permissions to include `allow-score-hub-model`. - Enhanced scoring logic in Rust to derive model specifications and analyze performance based on new parameters. - Added utility functions for parameter parsing and model capability normalization. - Updated frontend components to display new model attributes and handle loading states. - Added tests to validate new functionality and ensure correct behavior with extended model data.
…odel scores - Updated the resolution hashes for multiple extensions in yarn.lock to ensure consistency. - Removed unnecessary fields from HubModelScoreResult in types.ts. - Simplified the scoring logic in scoring.rs by eliminating hardware fingerprinting and cache key generation. - Enhanced the DefaultModelsService to read and write model scores to a persistent cache, improving performance and reliability. - Added error handling and logging for cache operations in default.ts.
- Added new fields to HubModelScoreRequest for runtime, quantization, and total size. - Updated ModelScoreSummary component to display fit levels and run modes with translations. - Enhanced tests for ModelScoreSummary to cover new features and translations. - Modified HubModelDetailContent to handle MLX models and display scores correctly. - Improved the DefaultModelsService to support MLX scoring and aggregate safetensors size. - Updated types for models to include new properties related to MLX and scoring. - Added localization for new score summary keys in hub.json.
…ce HubModelDetailContent with detailed score breakdown
… HubModelDetailContent with score summary display
- Removed redundant model score handling logic from HubModelDetailContent and HubContent components. - Replaced manual score fetching and caching with useModelScore hook for better state management. - Updated DefaultModelsService to utilize useModelScore for fetching cached and live model scores. - Cleaned up unused functions and types related to model scoring. - Improved test cases to align with the new model score handling approach.
…scoring interfaces and implementations
…available responses
…lay in HubContent
…nce score display
- Updated constants to use DEFAULT_MODEL_QUANTIZATIONS instead of PREFERRED_DOWNLOAD_QUANTIZATIONS across various components. - Simplified the providerModels structure by consolidating model capabilities into single arrays for better readability. - Enhanced the useModelScore hook to deduplicate concurrent score fetch requests for the same model, improving performance and reducing unnecessary API calls. - Added tests to ensure the deduplication logic works as expected. - Cleaned up the ModelInfoHoverCard and HubModelDetailContent components for better readability and maintainability.
…translations for fit score
…r draft model and model-free modes
…/thewulf7/jan into thewulf7/loading-multiple-models
…s and model settings
PR Review: feat: adding speculative decoding and draft model supportOverviewThis PR adds end-to-end support for llama.cpp speculative decoding across 15 files (+1383/-7 lines). It covers:
Correctness & Risks1. When the user selects "None" in the Draft Model dropdown, the value is const draftModelId = cfg.draft_model_id ?? (overrideSettings as any)?.draft_model_id ?? undefined
if (draftModelId) { ... }The string const draftModelId = cfg.draft_model_id ?? (overrideSettings as any)?.draft_model_id ?? undefined
if (draftModelId && draftModelId !== 'none') { ... }2. The migration (version <= 13) adds 3. Type mismatch: In 4. In const draftModelCandidates = useMemo(() => providers
.filter(...)
.flatMap(...)
.filter(...)
.map(...), [providers, model.id])Missing Tests
Architecture Concerns
Style Issues
Positive Aspects
Recommendation: improve needed |
…/thewulf7/jan into thewulf7/loading-multiple-models
|
@tokamak-pm review again please |
|
🤖 Tokamak Bot — the GitHub Bot is disabled for the project that owns this repository. A project owner can enable it under Project Settings → GitHub Bot. |
Model scoring system
Co-authored-by: thewulf7 <1582808+thewulf7@users.noreply.github.com>
|
Follow-up review (new activity detected) Since our last review (2026-05-05), the PR author requested a re-review on 2026-05-06 but the bot was disabled. There have been no new commits since our initial review. The diff remains at ~1383 lines across 15 files. Status of previously raised concernsAll four concerns from our initial review remain unaddressed in the current code:
New observationsThe PR includes a large number of unrelated commits in its history (model scoring system, hub enhancements, etc.) dating back to March 2026. These appear to be part of a fork that was not rebased against Recommendation: fix needed |
Comprehensive Review (40 files, +3739/-697)This PR has grown substantially since our earlier reviews. It now contains two major features:
These are architecturally independent features. Bundling them in one PR increases review complexity, merge conflict risk, and rollback difficulty. That said, here is a review of the full changeset. Previously Raised IssuesOur prior reviews (May 5 and June 3) identified four concerns. Here is the current status:
New Feature: Hub Model Scoring SystemArchitecture (Positive)
Concerns1. New The
2. Rust 3. Rust 4. Score cache never expires (Low Risk) The 5. The preference order in 6. Similar to the Speculative Decoding Changes (Updated)7. In .reduce<[string, unknown][]>((acc, entry) => {
const reasoning = Object.entries(model.settings || {}).find(
([k]) => k === 'reasoning'
)
if (reasoning && acc.length === 0) acc.push(reasoning)
acc.push(entry)
return acc
}, [])This re-looks-up const entries = Object.entries(model.settings || {}).filter(...);
const reasoning = entries.find(([k]) => k === 'reasoning');
const sorted = reasoning ? [reasoning, ...entries.filter(([k]) => k !== 'reasoning')] : entries;8. The diff removes the entire Missing Tests
Style / Minor
Scope ConcernThis PR combines speculative decoding with a full hub model scoring system (718-line Rust module, new Zustand store, hub UI overhaul, new Recommendation: fix needed Key items requiring attention before merge:
|
Describe Your Changes
Adds full end-to-end support for llama.cpp speculative decoding, covering both draft-model-based and model-free (n-gram) modes.
How to test