Skip to content
187 changes: 187 additions & 0 deletions FAQ.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
# Frequently Asked Questions

## General

### What does Distill do?

Distill is a post-retrieval processing layer for RAG pipelines. When you fetch chunks from a vector database, 30-40% are typically redundant - same information phrased differently. Distill clusters semantically similar chunks, picks the best representative from each cluster, compresses verbose content, and re-ranks for diversity. Total overhead is ~12ms. No LLM calls.

### Why not just fetch fewer results from the vector DB?

Fetching fewer results risks missing relevant information. The better approach is to over-fetch (retrieve 20-50 results) and then intelligently deduplicate. This casts a wide net for recall, then optimizes for precision and diversity.

### Is this just removing exact duplicates?

No. Exact dedup is trivial (hash comparison). Distill does _semantic_ dedup - it identifies chunks that convey the same information in different words. Two paragraphs explaining "how JWT auth works" with different wording will be clustered together, and only the best one is kept.

### Why not use an LLM for compression?

LLMs are non-deterministic. The same input can produce different compressed outputs across runs. Distill uses deterministic algorithms (cosine distance, agglomerative clustering, MMR) so the same input always produces the same output. It's also ~40x faster (~12ms vs ~500ms) and ~100x cheaper per call.

---

## Algorithms

### Why agglomerative clustering instead of K-Means?

K-Means requires specifying K upfront and assumes spherical clusters. Agglomerative clustering adapts to the data - it stops merging when the distance between the closest clusters exceeds the threshold. If your 20 chunks have 8 natural groups, you get 8 clusters. If they have 15, you get 15. No tuning required.

### What does the threshold of 0.15 mean?

Cosine distance of 0.15 means cosine similarity of 0.85. Two chunks with 85%+ similarity are considered "saying the same thing." For code, use 0.10 (stricter - code is more precise). For prose, use 0.20 (looser - natural language has more variation).

### How does MMR (Maximal Marginal Relevance) work?

MMR greedily selects chunks that balance relevance and diversity:

```
MMR(chunk) = λ × relevance - (1-λ) × max_similarity(chunk, already_selected)
```

- `λ = 1.0` - pure relevance (top-K by score)
- `λ = 0.5` - balanced (default)
- `λ = 0.0` - pure diversity (maximize distance from selected chunks)

### What's the time complexity?

Distance matrix computation is O(N² × D) where N = number of chunks and D = embedding dimension. The merge loop is O(N³) worst case. For typical RAG inputs (N=20-50, D=1536), the full pipeline completes in ~12ms. For larger inputs (N=1000+), the K-Means path with parallel workers is available.

### How does compression work without an LLM?

Three rule-based strategies, chainable via a pipeline:

1. **Extractive** - Scores sentences by position, length, and keyword signals. Keeps the top sentences within a token budget.
2. **Placeholder** - Detects JSON, XML, and tables. Replaces them with structural summaries (e.g., `[JSON object with 12 keys: id, name, ...]`).
3. **Pruner** - Removes filler phrases ("as mentioned earlier", "basically", "it is important to note that") and intensifiers.

No API calls needed.

---

## Integration

### How does Distill work with LangChain?

Three integration paths, from simplest to deepest:

**1. MCP (works today):** Distill ships an MCP server (`distill mcp`). LangChain supports MCP via [`langchain-mcp-adapters`](https://github.com/langchain-ai/langchain-mcp-adapters). Distill's tools (`deduplicate_chunks`, `retrieve_deduplicated`, `analyze_redundancy`) become LangChain tools automatically.

```python
from langchain_mcp_adapters.client import MultiServerMCPClient
from langchain.agents import create_agent

client = MultiServerMCPClient({
"distill": {
"command": "distill",
"args": ["mcp"],
"transport": "stdio",
}
})
tools = await client.get_tools()
agent = create_agent("openai:gpt-4.1", tools)
```

**2. HTTP API (works today):** Call `POST /v1/dedupe` as a post-processing step on retrieval results.

```python
import httpx

def deduplicate(docs, threshold=0.15):
chunks = [{"id": str(i), "text": doc.page_content} for i, doc in enumerate(docs)]
resp = httpx.post("https://distill-api-4u92.onrender.com/v1/dedupe", json={
"chunks": chunks, "threshold": threshold
})
kept = {c["id"] for c in resp.json()["chunks"]}
return [doc for i, doc in enumerate(docs) if str(i) in kept]

raw_docs = retriever.invoke("query") # Over-fetch 20 results
clean_docs = deduplicate(raw_docs) # -> ~8 unique results
```

**3. Python SDK (planned - [#5](https://github.com/Siddhant-K-code/distill/issues/5)):** A `DistillRetriever` that wraps any LangChain retriever with automatic dedup.

### Does it work with LlamaIndex, CrewAI, AutoGen, etc.?

Yes. The HTTP API is framework-agnostic. MCP works with any MCP-compatible client. The planned Python SDK ([#5](https://github.com/Siddhant-K-code/distill/issues/5)) will include a LlamaIndex `NodePostprocessor`.

### How is this different from LangChain's built-in MMR retriever?

LangChain's `search_type="mmr"` applies MMR at the vector DB level - a single re-ranking step. Distill runs a multi-stage pipeline: cache lookup, agglomerative clustering (groups similar chunks), representative selection (picks the best from each group), compression (reduces token count), then MMR (diversity re-ranking). The clustering step is the key difference - it understands group structure, not just pairwise similarity.

### Can I use Distill with local models (Ollama, vLLM)?

The dedup pipeline itself doesn't call any LLM - it's pure math (cosine distance, clustering). The only external dependency is for embedding generation when you send text without pre-computed embeddings. Multi-provider embedding support (Ollama, Azure, Cohere, HuggingFace) is planned in [#33](https://github.com/Siddhant-K-code/distill/issues/33).

---

## Performance & Cost

### What's the latency overhead?

~12ms total for the pipeline: distance matrix ~2ms, clustering ~6ms, selection <1ms, MMR ~3ms. Embedding generation adds more if needed (depends on OpenAI API latency, typically 100-300ms for a batch). If embeddings are pre-computed, it's just the 12ms.

### What's the cost?

If chunks already have embeddings (from your vector DB): **$0**. If text-only chunks are sent, Distill uses `text-embedding-3-small` at $0.02 per 1M tokens. A typical 20-chunk request with ~100 tokens each = 2,000 tokens = $0.00004.

### Does it scale to thousands of chunks?

The agglomerative clustering is O(N²) for the distance matrix. For N=50, this is trivial (~2ms). For N=1,000, it's still fast (~100ms). For N=10,000+, the K-Means path (`pkg/dedup/`) with parallel workers is available. A batch API is planned in [#11](https://github.com/Siddhant-K-code/distill/issues/11).

### What if chunks don't have embeddings?

If you send text-only chunks to the API, Distill calls OpenAI's `text-embedding-3-small` to generate embeddings on the fly. Set `OPENAI_API_KEY` to enable this. If you send chunks with pre-computed embeddings (e.g., from your vector DB retrieval), no OpenAI call is needed.

---

## Deployment

### How do I self-host Distill?

Three options:

```bash
# Binary
distill api --port 8080

# Docker
docker run -p 8080:8080 -e OPENAI_API_KEY=xxx ghcr.io/siddhant-k-code/distill

# Build from source
go build -o distill . && ./distill api
```

### How do I protect my self-hosted instance?

Set `DISTILL_API_KEYS` with comma-separated API keys. Clients must include `Authorization: Bearer <key>` in requests.

```bash
export DISTILL_API_KEYS="key1,key2,key3"
distill api --port 8080
```

### What observability is available?

- **Prometheus metrics** at `/metrics` - request counts, latency histograms, chunk reduction ratios, cluster counts
- **OpenTelemetry tracing** - per-stage spans (embedding, clustering, selection, MMR) with W3C Trace Context propagation
- **Grafana dashboard** - pre-built template in `grafana/`

---

## Context & Positioning

### Why should I use this instead of just increasing my context window?

Larger context windows don't solve redundancy. If you stuff 50 chunks into a 128K window and 20 say the same thing, the model still processes all of them. This wastes tokens, increases latency, and can confuse the model. Distill ensures the model sees unique, diverse chunks instead of overlapping ones.

### Is Distill open source?

Yes, AGPL-3.0. The full pipeline, CLI, API server, MCP server, and all algorithms are open source. Commercial licensing is available for closed-source usage - contact siddhantkhare2694@gmail.com.

### What's on the roadmap?

Three pillars:

1. **Context Memory** - Persistent deduplicated memory across agent sessions with hierarchical decay ([#29](https://github.com/Siddhant-K-code/distill/issues/29), [#31](https://github.com/Siddhant-K-code/distill/issues/31))
2. **Code Intelligence** - Dependency graphs, co-change patterns, blast radius analysis ([#30](https://github.com/Siddhant-K-code/distill/issues/30), [#32](https://github.com/Siddhant-K-code/distill/issues/32))
3. **Platform** - Python SDK, multi-provider embeddings, batch API ([#5](https://github.com/Siddhant-K-code/distill/issues/5), [#33](https://github.com/Siddhant-K-code/distill/issues/33), [#11](https://github.com/Siddhant-K-code/distill/issues/11))
68 changes: 59 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,9 @@ LLM outputs are unreliable because context is polluted. "Garbage in, garbage out

30-40% of context assembled from multiple sources is semantically redundant. Same information from docs, code, memory, and tools competing for attention. This leads to:

- **Non-deterministic outputs** Same workflow, different results
- **Confused reasoning** Signal diluted by repetition
- **Production failures** Works in demos, breaks at scale
- **Non-deterministic outputs** - Same workflow, different results
- **Confused reasoning** - Signal diluted by repetition
- **Production failures** - Works in demos, breaks at scale

You can't fix unreliable outputs with better prompts. You need to fix the context that goes in.

Expand Down Expand Up @@ -457,19 +457,19 @@ W3C Trace Context propagation is enabled by default for cross-service tracing.

Reduces token count while preserving meaning. Three strategies:

- **Extractive** Scores sentences by position, keyword density, and length; keeps the most salient spans
- **Placeholder** Replaces verbose JSON, XML, and table outputs with compact structural summaries
- **Pruner** Strips filler phrases, redundant qualifiers, and boilerplate patterns
- **Extractive** - Scores sentences by position, keyword density, and length; keeps the most salient spans
- **Placeholder** - Replaces verbose JSON, XML, and table outputs with compact structural summaries
- **Pruner** - Strips filler phrases, redundant qualifiers, and boilerplate patterns

Strategies can be chained via `compress.Pipeline`. Configure with target reduction ratio (e.g., 0.3 = keep 30% of original).

### Cache (`pkg/cache`)

KV cache for repeated context patterns (system prompts, tool definitions, boilerplate). Sub-millisecond retrieval for cache hits.

- **MemoryCache** In-memory LRU with TTL, configurable size limits (entries and bytes), background cleanup
- **PatternDetector** Identifies cacheable content: system prompts, tool/function definitions, code blocks
- **RedisCache** Interface for distributed deployments (requires external Redis)
- **MemoryCache** - In-memory LRU with TTL, configurable size limits (entries and bytes), background cleanup
- **PatternDetector** - Identifies cacheable content: system prompts, tool/function definitions, code blocks
- **RedisCache** - Interface for distributed deployments (requires external Redis)

## Architecture

Expand Down Expand Up @@ -574,6 +574,55 @@ Works with your existing AI stack:
- **AI Assistants:** Claude Desktop, Cursor (via MCP)
- **Observability:** Prometheus, Grafana, OpenTelemetry (Jaeger, Tempo)

## FAQ

<details>
<summary>Is this just removing exact duplicates?</summary>
<p>No. Exact dedup is trivial (hash comparison). Distill does <em>semantic</em> dedup - it identifies chunks that convey the same information in different words. Two paragraphs explaining "how JWT auth works" with different wording will be clustered together, and only the best one is kept.</p>
</details>

<details>
<summary>Why agglomerative clustering instead of K-Means?</summary>
<p>K-Means requires specifying K upfront and assumes spherical clusters. Agglomerative clustering adapts to the data - it stops merging when the distance between the closest clusters exceeds the threshold. If your 20 chunks have 8 natural groups, you get 8 clusters. If they have 15, you get 15. No tuning required.</p>
</details>

<details>
<summary>What does the threshold of 0.15 mean?</summary>
<p>Cosine distance of 0.15 means cosine similarity of 0.85. Two chunks with 85%+ similarity are considered "saying the same thing." For code, use 0.10 (stricter). For prose, use 0.20 (looser).</p>
</details>

<details>
<summary>Why cosine distance and not Euclidean?</summary>
<p>OpenAI embeddings (and most embedding models) are normalized to unit length. For unit vectors, cosine distance and Euclidean distance are monotonically related, but cosine is more interpretable: 0 = identical direction, 1 = orthogonal, 2 = opposite. The threshold of 0.15 means "chunks whose embeddings point within ~22 degrees of each other."</p>
</details>

<details>
<summary>How does compression work without an LLM?</summary>
<p>Three rule-based strategies: (1) Extractive - scores sentences by position, length, and keyword signals, keeps the top ones. (2) Placeholder - detects JSON/XML/tables and replaces with structural summaries. (3) Pruner - removes filler phrases and intensifiers. No API calls needed.</p>
</details>

<details>
<summary>How does Distill work with LangChain?</summary>
<p>Three paths: (1) MCP - <code>distill mcp</code> exposes tools that become LangChain tools via <a href="https://github.com/langchain-ai/langchain-mcp-adapters">langchain-mcp-adapters</a>. (2) HTTP API - call <code>POST /v1/dedupe</code> as a post-processing step on retrieval results. (3) Python SDK (planned - <a href="https://github.com/Siddhant-K-code/distill/issues/5">#5</a>) - a <code>DistillRetriever</code> that wraps any LangChain retriever.</p>
</details>

<details>
<summary>How is this different from LangChain's built-in MMR?</summary>
<p>LangChain's <code>search_type="mmr"</code> is a single re-ranking step at the vector DB level. Distill runs a multi-stage pipeline: cache, agglomerative clustering, representative selection, compression, then MMR. The clustering step understands group structure, not just pairwise similarity.</p>
</details>

<details>
<summary>What's the time complexity?</summary>
<p>Distance matrix is O(N² x D) where N = chunks and D = embedding dimension. The merge loop is O(N³) worst case. For typical RAG inputs (N=20-50, D=1536), the full pipeline completes in ~12ms.</p>
</details>

<details>
<summary>Why not just increase the context window?</summary>
<p>Larger context windows don't solve redundancy. If you stuff 50 chunks into a 128K window and 20 say the same thing, the model still processes all of them. This wastes tokens, increases latency, and can confuse the model. Distill ensures the model sees unique, diverse chunks instead of overlapping ones.</p>
</details>

See [FAQ.md](FAQ.md) for the full list.

## Contributing

Contributions welcome! Check the [open issues](https://github.com/Siddhant-K-code/distill/issues) for things to work on.
Expand All @@ -595,6 +644,7 @@ For commercial licensing, contact: siddhantkhare2694@gmail.com

- [Website](https://distill.siddhantkhare.com)
- [Playground](https://distill.siddhantkhare.com/playground)
- [FAQ](FAQ.md)
- [Blog Post](https://dev.to/siddhantkcode/the-engineering-guide-to-context-window-efficiency-202b)
- [MCP Configuration](mcp/README.md)
- [Book a Demo](https://meet.siddhantkhare.com)
Expand Down
2 changes: 1 addition & 1 deletion pkg/compress/placeholder.go
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,7 @@ func (p *PlaceholderCompressor) tryCompressXML(text string) (string, bool) {
break
}
if count > 1 {
summary.WriteString(fmt.Sprintf("%s(×%d)", elem, count))
fmt.Fprintf(&summary, "%s(×%d)", elem, count)
} else {
summary.WriteString(elem)
}
Expand Down