A local-first llama.cpp knowledge assistant with instant import + QA (built on llama.cpp inference) #24354

zhuxingwan · 2026-06-09T11:00:12Z

zhuxingwan
Jun 9, 2026

A local-first llama.cpp knowledge assistant with instant import + QA (built on llama.cpp inference)

Hi everyone,

I’ve been building a local-first knowledge assistant around the llama.cpp ecosystem and wanted to share it with the community for feedback.

This is not focused on code completion or agent-style coding, but rather on building a repo-level knowledge layer over the llama.cpp project itself.

🧠 What it does

It allows you to import the entire llama.cpp documentation ecosystem, including:

Markdown docs
CSV files (ops/backend support matrices, benchmarks, configs)
TXT logs and design notes
HTML documentation exports
(Code is included but treated as secondary context)

After import, you can immediately:

Ask questions about the repository
Perform semantic + full-text search
Edit and update notes in real time
Get traceable answers grounded in source chunks

⚙️ Key detail: fully local inference (llama.cpp backend)

The LLM inference backend itself is also powered by llama.cpp.

So the full stack is completely local:

Retrieval: FTS + embedding hybrid search
Context assembly: packed into ~5000 token window
Inference: llama.cpp-based local LLM
No external API dependency

📦 Ready-to-use knowledge base

I’ve also exported a full prebuilt knowledge base for llama.cpp:

llamacpp.nrs

It includes structured indexing over the repo documentation (docs + CSV + logs), and can be directly imported into the system.

🚀 Tooling

The system is available as a product called:

NoteRich.com

It supports:

One-click import of the llama.cpp knowledge base
Instant Q&A over the repo
Full-text search + semantic search
Inline note editing + updating
Traceable answers with source references

So you can literally:

import → ask questions → edit notes → continue exploring

without any indexing setup or preprocessing pipeline.

🔍 What I found interesting

Even with a relatively small local model (~5K context window), performance is largely determined by:

retrieval quality (FTS + embedding hybrid)
context packing strategy

rather than model size itself.

The LLM mostly acts as a reasoning + explanation layer, while the system handles most of the information selection.

🤔 Questions for the community

I’m curious:

Has anyone explored similar “repo knowledge layer” systems on top of llama.cpp or similar ML repos?
Do you think this overlaps with coding agents (like Codex-style tools), or is it a fundamentally different layer focused on documentation + structured repo knowledge?
Are CSV / benchmark / ops-matrix parts of ML repos something people actively use in your workflows, or are they usually underexposed in current tooling?

Would love feedback from people working on retrieval, tooling, or llama.cpp internals.

Thanks for reading!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A local-first llama.cpp knowledge assistant with instant import + QA (built on llama.cpp inference) #24354

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

A local-first llama.cpp knowledge assistant with instant import + QA (built on llama.cpp inference) #24354

Uh oh!

zhuxingwan Jun 9, 2026