elizaOS
diff --git a/‎evidence/speculative/rerun-20260520/true-dflash-2b-200step-q4-metal-n1-512.json‎
Lines changed: 199 additions & 0 deletions b/‎evidence/speculative/rerun-20260520/true-dflash-2b-200step-q4-metal-n1-512.json‎
Lines changed: 199 additions & 0 deletions
@@ -0,0 +1,199 @@
+{
+  "reportSchema": "eliza.speculative-benchmark.v1",
+  "generatedAt": "2026-05-20T18:06:45.856Z",
+  "verifier": "plugins/plugin-local-inference/native/verify/dflash_drafter_runtime_smoke.mjs",
+  "speculator": "dflash",
+  "tier": "2b",
+  "targetModel": "/Users/shawwalters/.eliza/local-inference/models/eliza-1-2b.bundle/text/eliza-1-2b-64k.gguf",
+  "drafterModel": "evidence/speculative/rerun-20260520/true-dflash-2b-200step-q4_k_m.local-stamped.gguf",
+  "specBinary": ".tmp/llama-mtp-build/bin/llama-cli",
+  "benchTokens": 512,
+  "available": true,
+  "status": "pass",
+  "failure": null,
+  "backend": "unknown",
+  "binary": {
+    "path": ".tmp/llama-mtp-build/bin/llama-cli",
+    "exists": true,
+    "sha256": "ff9e4ed6bf303fe54095b5757fd66d04799ea28d3548db3f5dc5d6cae50d52f9"
+  },
+  "drafted": 255,
+  "accepted": 255,
+  "acceptanceRate": 1,
+  "speedup": 0.9285714285714287,
+  "withDrafter": {
+    "available": true,
+    "binary": ".tmp/llama-mtp-build/bin/llama-cli",
+    "withDrafter": true,
+    "args": [
+      "-m",
+      "/Users/shawwalters/.eliza/local-inference/models/eliza-1-2b.bundle/text/eliza-1-2b-64k.gguf",
+      "-md",
+      "evidence/speculative/rerun-20260520/true-dflash-2b-200step-q4_k_m.local-stamped.gguf",
+      "-p",
+      "Write a short paragraph about speculative decoding.",
+      "-n",
+      "512",
+      "-c",
+      "2048",
+      "-ngl",
+      "99",
+      "-ngld",
+      "99",
+      "--single-turn",
+      "--simple-io",
+      "--no-display-prompt",
+      "--log-disable",
+      "--spec-draft-n-min",
+      "1",
+      "--spec-draft-n-max",
+      "1",
+      "--spec-draft-p-min",
+      "0.1",
+      "--spec-type",
+      "dflash"
+    ],
+    "skippedCliFlags": [
+      {
+        "flag": "--ctx-size-draft",
+        "alternatives": [
+          "--spec-draft-ctx-size",
+          "-cd"
+        ],
+        "values": [
+          "2048"
+        ],
+        "reason": "not advertised by binary --help"
+      }
+    ],
+    "status": 0,
+    "signal": null,
+    "error": null,
+    "wallMs": 206177,
+    "contextTokens": 2048,
+    "draftMin": 1,
+    "draftMax": 1,
+    "tokensRequested": 512,
+    "drafted": 255,
+    "accepted": 255,
+    "acceptanceRate": 1,
+    "tokensPerSecond": 2.6,
+    "generation": {
+      "encoded": null,
+      "decoded": null,
+      "tokensPerSecond": null
+    },
+    "timings": {
+      "targetPromptEval": null,
+      "targetEval": null,
+      "draftPromptEval": null,
+      "draftEval": null
+    },
+    "vocabIncompatibleWarning": false,
+    "draftingActive": true,
+    "dflashFailure": null,
+    "tokenizerCompatible": true,
+    "outputTail": "  /read <file>        add a text file\n  /glob <pattern>     add text files using globbing pattern\n\n\n> Write a short paragraph about speculative decoding.\n\n[Start thinking]\nThinking Process:\n\n1.  **Analyze the Request:**\n    *   **Topic:** Speculative decoding.\n    *   **Format:** Short paragraph.\n    *   **Goal:** Explain the concept concisely.\n\n2.  **Define Speculative Decoding:**\n    *   What is it? A technique used in AI/ML inference to speed up computation.\n    *   How does it work? Instead of running a complex decoding algorithm (e.g., search, decoding) from scratch for every token, it uses a fast \"guess\" (speculation) based on a simplified model or prior knowledge.\n    *   What happens next? If the guess is wrong, it's corrected (decoding correction).\n    *   Why is it good? Speedup, efficiency.\n\n3.  **Drafting - Attempt 1 (Mental Outline):**\n    Speculative decoding is a technique to speed up AI inference. It uses a fast model to guess the next token. If it's wrong, it fixes it. This saves time compared to pure decoding.\n\n4.  **Refining - Attempt 2 (Adding detail):**\n    Speculative decoding is a technique used to accelerate machine learning model inference by predicting the output before performing full decoding. It typically uses a lightweight, fast model to make an educated guess about the next token to generate a prediction. If the guess is incorrect, it corrects the model in a second pass, allowing for a speedup.\n\n5.  **Polishing - Attempt 3 (Conciseness and flow):**\n    Speculative decoding is a technique used to accelerate machine learning model inference by predicting the output before performing full decoding. Instead of running a complex decoding algorithm from scratch for every token, it uses a fast, lightweight model to make an educated guess or \"speculate\" on the next token. If that initial guess is incorrect, the system can quickly correct the error in a second pass, significantly reducing computation time and improving performance.\n\n6.  **Review against constraints:**\n    *   Short paragraph? Yes.\n    *   About speculative decoding? Yes.\n\n7.  **Final Polish:** Make it punchier.\n    Speculative decoding is an inference optimization technique designed to accelerate machine learning models by predicting outputs before performing full decoding. Instead of running a computationally expensive decoding process from scratch for every token, it uses a lightweight, fast model to generate an educated guess based on prior knowledge. If the initial\n\n[ Prompt: 50.8 t/s | Generation: 2.6 t/s ]\n[ Draft: 255 accepted / 255 generated | Acceptance: 1.0000 ]\n\nExiting..."
+  },
+  "withoutDrafter": {
+    "available": true,
+    "binary": ".tmp/llama-mtp-build/bin/llama-cli",
+    "withDrafter": false,
+    "args": [
+      "-m",
+      "/Users/shawwalters/.eliza/local-inference/models/eliza-1-2b.bundle/text/eliza-1-2b-64k.gguf",
+      "-md",
+      "evidence/speculative/rerun-20260520/true-dflash-2b-200step-q4_k_m.local-stamped.gguf",
+      "-p",
+      "Write a short paragraph about speculative decoding.",
+      "-n",
+      "512",
+      "-c",
+      "2048",
+      "-ngl",
+      "99",
+      "-ngld",
+      "99",
+      "--single-turn",
+      "--simple-io",
+      "--no-display-prompt",
+      "--log-disable",
+      "--spec-draft-n-min",
+      "0",
+      "--spec-draft-n-max",
+      "0",
+      "--spec-draft-p-min",
+      "1",
+      "--spec-type",
+      "dflash"
+    ],
+    "skippedCliFlags": [
+      {
+        "flag": "--ctx-size-draft",
+        "alternatives": [
+          "--spec-draft-ctx-size",
+          "-cd"
+        ],
+        "values": [
+          "2048"
+        ],
+        "reason": "not advertised by binary --help"
+      }
+    ],
+    "status": 0,
+    "signal": null,
+    "error": null,
+    "wallMs": 186843,
+    "contextTokens": 2048,
+    "draftMin": 0,
+    "draftMax": 0,
+    "tokensRequested": 512,
+    "drafted": 18,
+    "accepted": 18,
+    "acceptanceRate": 1,
+    "tokensPerSecond": 2.8,
+    "generation": {
+      "encoded": null,
+      "decoded": null,
+      "tokensPerSecond": null
+    },
+    "timings": {
+      "targetPromptEval": null,
+      "targetEval": null,
+      "draftPromptEval": null,
+      "draftEval": null
+    },
+    "vocabIncompatibleWarning": false,
+    "draftingActive": true,
+    "dflashFailure": null,
+    "tokenizerCompatible": true,
+    "outputTail": "\navailable commands:\n  /exit or Ctrl+C     stop or exit\n  /regen              regenerate the last response\n  /clear              clear the chat history\n  /read <file>        add a text file\n  /glob <pattern>     add text files using globbing pattern\n\n\n> Write a short paragraph about speculative decoding.\n\n[Start thinking]\nThinking Process:\n\n1.  **Analyze the Request:**\n    *   **Topic:** Speculative decoding (a concept in machine learning/computing).\n    *   **Format:** Short paragraph.\n    *   **Goal:** Explain or describe the concept clearly and concisely.\n\n2.  **Define Speculative Decoding:**\n    *   What is it? It's a technique in NLP/NLP model inference.\n    *   How does it work? Instead of running the full model (which is slow/expensive), it uses a faster, lighter model (speculative model) to generate a likely sequence of tokens. If the speculative model predicts the right sequence, it's a \"hit\" (success); otherwise, a \"miss\" (failure).\n    *   What happens on a hit? The original model uses that sequence to continue generating the rest of the text, potentially skipping the computation of the speculative model.\n    *   What happens on a miss? The original model re-runs the speculative model to re-generate the sequence or a fallback path.\n    *   Why use it? To speed up inference (latency).\n    *   Key idea: \"Speculative\" means betting on a correct path based on the speculative model.\n\n3.  **Drafting - Attempt 1 (Mental):**\n    Speculative decoding is a fast way to generate text. It uses a smaller model to guess the next tokens. If it guesses right, the original big model skips those tokens. This makes inference faster because it saves computation. It's like a cache hit.\n\n4.  **Refining - Attempt 2 (Adding technical nuance):**\n    Speculative decoding is an optimization technique for large language models (LLMs). Instead of processing every token one by one, it employs a lightweight, faster model to predict a likely sequence of tokens. If the prediction is correct, the heavy model skips processing those tokens, significantly reducing latency. This is known as a \"speculative\" step where a small model is used in place of the full model.\n\n5.  **Polishing - Attempt 3 (Concise and smooth):**\n    Speculative decoding is an optimization technique for language models that improves inference speed by predicting a likely sequence of tokens using a lightweight, faster model before the full model executes. If the prediction is correct, the original model skips processing those tokens, effectively reducing latency. This approach allows for real-time text generation by leveraging\n\n[ Prompt: 77.0 t/s | Generation: 2.8 t/s ]\n[ Draft: 18 accepted / 18 generated | Acceptance: 1.0000 ]\n\nExiting..."
+  },
+  "summary": {
+    "tokensPerSecondWithDrafter": 2.6,
+    "tokensPerSecondBaseline": 2.8,
+    "generationTokensPerSecondWithDrafter": null,
+    "generationTokensPerSecondBaseline": null,
+    "targetEvalTokensPerSecondWithDrafter": null,
+    "draftEvalTokensPerSecondWithDrafter": null,
+    "dflashDraftedTokens": 255,
+    "dflashAcceptedTokens": 255,
+    "dflashAcceptanceRate": 1,
+    "dflashSpeedup": 0.9285714285714287,
+    "dflashDraftingActive": true,
+    "dflashFailure": null,
+    "tokenizerCompatible": true,
+    "speculator": "dflash",
+    "tier": "2b",
+    "backend": "unknown",
+    "binarySha256": "ff9e4ed6bf303fe54095b5757fd66d04799ea28d3548db3f5dc5d6cae50d52f9",
+    "drafted": 255,
+    "accepted": 255,
+    "acceptanceRate": 1,
+    "speedup": 0.9285714285714287,
+    "status": "pass",
+    "failure": null
+  },
+  "draftingActive": true,
+  "dflashFailure": null
+}