Skip to content

Commit 1174d34

Browse files
leifericfclaude
andcommitted
docs(site): update benchmark results from 2026-03-28 run
Full mean rose from 37.9% to 42.7% (+4.7pp) after ask-agent improvements. Spread widened from +19.7pp to +27.5pp. Updated all referenced stats and linked to the new report. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 441732b commit 1174d34

1 file changed

Lines changed: 17 additions & 17 deletions

File tree

docs/index.html

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@
4545
</div>
4646
<h1>Ask your codebase precise questions.<br>Get grounded answers.</h1>
4747
<p class="hero-sub">
48-
Noumenon builds a <a href="https://www.datomic.com">Datomic</a> knowledge graph from your repository so agents query structured facts instead of dumping raw files into context windows. In benchmarks across 9 repos and 8 languages, graph-augmented answers scored <strong>2&times; higher</strong> on average.
48+
Noumenon builds a <a href="https://www.datomic.com">Datomic</a> knowledge graph from your repository so agents query structured facts instead of dumping raw files into context windows. In benchmarks across 9 repos and 8 languages, graph-augmented answers scored <strong>2.8&times; higher</strong> on average.
4949
</p>
5050
<div class="hero-actions">
5151
<a href="#get-started" class="btn btn-primary">Get Started</a>
@@ -85,7 +85,7 @@ <h2 class="section-title">Why structured knowledge</h2>
8585
<div class="problem-grid">
8686
<div class="card">
8787
<h3>Context windows don't scale</h3>
88-
<p>A Datalog query returns exactly the entities a question needs. In benchmarks, graph context improved LLM accuracy by <strong>+19.7 percentage points</strong> on average.</p>
88+
<p>A Datalog query returns exactly the entities a question needs. In benchmarks, graph context improved LLM accuracy by <strong>+27.5 percentage points</strong> on average.</p>
8989
</div>
9090
<div class="card">
9191
<h3>Answers you can't verify</h3>
@@ -139,7 +139,7 @@ <h3>Recursive LLM Querying</h3>
139139
</div>
140140
<div class="principle">
141141
<span class="principle-label text-purple">Measurable</span>
142-
<p>Built-in A/B benchmarks. 9 repos, 8 languages: 18.2% (raw) to 37.9% (graph-augmented).</p>
142+
<p>Built-in A/B benchmarks. 9 repos, 8 languages: 15.2% (raw) to 42.7% (graph-augmented).</p>
143143
</div>
144144
<div class="principle">
145145
<span class="principle-label text-muted">Local</span>
@@ -209,36 +209,36 @@ <h2 class="section-title">Measured on real codebases</h2>
209209
</tr>
210210
</thead>
211211
<tbody>
212-
<tr><td>flask</td><td>Python</td><td>12.5%</td><td>41.2%</td><td class="text-green"><strong>+28.8pp</strong></td></tr>
213-
<tr><td>fzf</td><td>Go</td><td>13.8%</td><td>42.5%</td><td class="text-green"><strong>+28.8pp</strong></td></tr>
214-
<tr><td>express</td><td>JavaScript</td><td>18.8%</td><td>45.0%</td><td class="text-green"><strong>+26.2pp</strong></td></tr>
215-
<tr><td>fresh</td><td>TypeScript</td><td>12.5%</td><td>35.0%</td><td class="text-green"><strong>+22.5pp</strong></td></tr>
216-
<tr><td>guava</td><td>Java</td><td>2.5%</td><td>23.8%</td><td class="text-green"><strong>+21.3pp</strong></td></tr>
217-
<tr><td>ripgrep</td><td>Rust</td><td>12.5%</td><td>30.0%</td><td class="text-green"><strong>+17.5pp</strong></td></tr>
218-
<tr><td>redis</td><td>C</td><td>11.3%</td><td>26.3%</td><td class="text-green"><strong>+15.0pp</strong></td></tr>
219-
<tr><td>ring</td><td>Clojure</td><td>51.2%</td><td>60.0%</td><td class="text-green"><strong>+8.8pp</strong></td></tr>
220-
<tr><td>noumenon</td><td>Clojure</td><td>28.8%</td><td>37.5%</td><td class="text-green"><strong>+8.8pp</strong></td></tr>
212+
<tr><td>fresh</td><td>TypeScript</td><td>0.0%</td><td>41.3%</td><td class="text-green"><strong>+41.3pp</strong></td></tr>
213+
<tr><td>fzf</td><td>Go</td><td>2.5%</td><td>38.8%</td><td class="text-green"><strong>+36.3pp</strong></td></tr>
214+
<tr><td>ripgrep</td><td>Rust</td><td>2.5%</td><td>37.5%</td><td class="text-green"><strong>+35.0pp</strong></td></tr>
215+
<tr><td>flask</td><td>Python</td><td>10.0%</td><td>41.3%</td><td class="text-green"><strong>+31.3pp</strong></td></tr>
216+
<tr><td>redis</td><td>C</td><td>2.5%</td><td>26.3%</td><td class="text-green"><strong>+23.8pp</strong></td></tr>
217+
<tr><td>express</td><td>JavaScript</td><td>31.3%</td><td>53.8%</td><td class="text-green"><strong>+22.5pp</strong></td></tr>
218+
<tr><td>noumenon</td><td>Clojure</td><td>23.8%</td><td>45.0%</td><td class="text-green"><strong>+21.3pp</strong></td></tr>
219+
<tr><td>guava</td><td>Java</td><td>7.5%</td><td>26.3%</td><td class="text-green"><strong>+18.8pp</strong></td></tr>
220+
<tr><td>ring</td><td>Clojure</td><td>56.3%</td><td>73.8%</td><td class="text-green"><strong>+17.5pp</strong></td></tr>
221221
</tbody>
222222
<tfoot>
223-
<tr><td><strong>Average</strong></td><td></td><td><strong>18.2%</strong></td><td><strong>37.9%</strong></td><td class="text-green"><strong>+19.7pp</strong></td></tr>
223+
<tr><td><strong>Average</strong></td><td></td><td><strong>15.2%</strong></td><td><strong>42.7%</strong></td><td class="text-green"><strong>+27.5pp</strong></td></tr>
224224
</tfoot>
225225
</table>
226226
<div class="benchmark-notes">
227227
<div class="card">
228228
<h3>Biggest gains on unfamiliar repos</h3>
229-
<p>Flask, fzf, and Express saw +26&ndash;29pp &mdash; the graph fills in what the LLM lacks from training data.</p>
229+
<p>Fresh scored 0% without the graph and 41% with it &mdash; the knowledge graph provides what the LLM simply doesn't know.</p>
230230
</div>
231231
<div class="card">
232-
<h3>Factual lookups improved most</h3>
233-
<p>Single-hop accuracy (e.g. &ldquo;which files import X?&rdquo;) jumped from 29.5% to 65.9% on Ring &mdash; +36pp.</p>
232+
<h3>Best overall: Ring at 73.8%</h3>
233+
<p>Ring scored the highest absolute Full score. Deterministic accuracy hit 75.0%, with LLM-judged questions close behind at 72.2%.</p>
234234
</div>
235235
<div class="card">
236236
<h3>8 languages, zero failures</h3>
237237
<p>Clojure, Python, JavaScript, TypeScript, C, Go, Rust, and Java all completed the full pipeline successfully.</p>
238238
</div>
239239
</div>
240240
<p class="section-sub" style="margin-top: 1.5rem;">
241-
<a href="https://github.com/leifericf/noumenon/blob/main/reports/digest-run-2026-03-27.md">Read the full benchmark report &rarr;</a>
241+
Last updated: 2026-03-28. <a href="https://github.com/leifericf/noumenon/blob/main/reports/digest-run-2026-03-28.md">Read the full benchmark report &rarr;</a>
242242
</p>
243243
</div>
244244
</section>

0 commit comments

Comments
 (0)