Skip to content

Conversation

@reakaleek
Copy link
Member

@reakaleek reakaleek commented Dec 22, 2025

Adds LLM-powered enrichment to documentation during indexing, generating metadata optimized for semantic search and RAG.

What it does

Each document gets enriched with:

  • ai_rag_optimized_summary - Dense technical summary for vector search
  • ai_short_summary - 5-10 word tooltip
  • ai_search_query - Keywords a dev would search
  • ai_questions - Questions this doc answers
  • ai_use_cases - Simple tasks like "bulk index documents"

How enrichment works

Hybrid approach:

  • Cache hit: Enrich processor applies fields at index time (no LLM call)
  • Cache miss (new or stale content): Call LLM, store in cache, apply fields inline
  • Backfill: Documents skipped by hash-based upsert get AI fields via _update_by_query
Document → Generate enrichment_key(title + body) → Check cache (valid entries only) → Hit? Use enrich processor : Call LLM → Store in cache → Apply inline
  • Cache index: docs-ai-enriched-fields-cache stores LLM responses with enrichment_key and prompt_hash
  • Pre-loaded at startup: Only entries with current prompt hash are loaded as valid

Enrichment key construction

The enrichment key is a content-only SHA-256 hash (no prompt hash):

input = title + stripped_body
normalized = regex_replace(input, "[^a-zA-Z0-9]", "").ToLowerInvariant()
enrichment_key = SHA256(normalized)

Why content-only keys?

  • Same content always maps to the same cache entry
  • Allows stale entries (old prompt) to be overwritten rather than orphaned
  • Prompt versioning handled separately via prompt_hash field in cache entries

Why aggressive normalization?

  • Strips all whitespace, punctuation, markdown syntax
  • Only keeps alphanumeric characters, lowercased
  • Result: Formatting changes don't invalidate the cache, only content changes trigger re-enrichment

Example:

"# Hello, World!"  →  "helloworld"  →  SHA256  →  "a591a6d40..."
"Hello World"      →  "helloworld"  →  SHA256  →  "a591a6d40..."  (same!)

Prompt versioning (automatic cache invalidation)

Each cache entry stores the prompt_hash it was generated with. On startup:

  1. Load all cache entries with their prompt_hash values
  2. Compare each entry's prompt_hash to current prompt hash
  3. Matching: Entry is valid → cache hit
  4. Mismatched: Entry is stale → treated as non-existent → regenerated

Startup:
Load 5000 entries: 4200 valid (current prompt), 800 stale (will be refreshed)

Document processing:
if enrichment_key in valid_entries → cache hit
else → cache miss (new OR stale) → generate + overwrite

Result:

  • Prompt changes trigger automatic, gradual re-enrichment
  • No manual version bumping needed
  • Stale entries are overwritten (not orphaned) since keys are content-based

Gradual rollout

To avoid deployment explosions, enrichment is rate-limited:

Setting Value
Enrichments per run (new + stale) 100
Concurrent LLM calls 10
Retry on 429 Exponential backoff (5 retries)

Timeline to full enrichment:

Documents Per run Runs needed At hourly deploys
~12,000 100 120 runs ~5 days

Cache hits don't count against the limit, so once cached, all documents enrich instantly.

Usage

docs-builder assembler index --enable-ai-enrichment

@reakaleek reakaleek self-assigned this Dec 22, 2025
@reakaleek reakaleek marked this pull request as ready for review December 22, 2025 09:06
@reakaleek reakaleek requested a review from a team as a code owner December 22, 2025 09:06
@reakaleek reakaleek requested a review from Mpdreamz December 22, 2025 09:06
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
/// <summary>
/// Initializes the cache, including any index bootstrapping and preloading.
/// </summary>
Task InitializeAsync(CancellationToken ct);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could get rid of client side caches by using enrichment indices and enrich processors:

https://www.elastic.co/docs/manage-data/ingest/transform-enrich/example-enrich-data-based-on-exact-values

That way whenever we index data into semantic-*-* indices they will be enriched with the ai fields.

That would scale better than loading all enrichments into memory + keeps data locality in Elasticsearch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the hint! Will try it now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// Version number for cache entries. Bump to trigger gradual re-enrichment.
/// Using int allows future range queries (e.g., re-enrich all entries below version 5).
/// </summary>
public int PromptVersion { get; init; } = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be PromptHash that automatically breaks when we adjust our prompt?

That way our indices would self heal even if we forget to bump PromptVersion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of this.

My thinking was to separate this, so we can add small prompt changes without necessarily invalidating all the cache entries.

But thinking of it now, it makes sense to just use the hash.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Copy link
Member

@Mpdreamz Mpdreamz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more nitpicks :)

It might make sense to have a short lived EnrichmentChannel to buffer those 200 enrichments and have them _bulk up. Especially if we decide to up that number.

Or build a enrichment only index mode that we can run as an offline/one off process that can run for many hours.

var pipeline = EnrichPolicyManager.PipelineName;
var url = $"/{indexAlias}/_update_by_query?pipeline={pipeline}&timeout=10m";

var response = await WithRetryAsync(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a background task and we need to poll the task status, since this is potentially very long running (1minute locally).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private async ValueTask BackfillMissingAiFieldsAsync(Cancel ctx)
{
// Why backfill is needed:
// The exporter uses hash-based upsert - unchanged documents are skipped during indexing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we use the EnrichmentKey in AssignMetadata content hash we will break the content hash and force an update. Rendering this backfill unnecessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is already covered... the EnrichmentKey is basically a more aggressive content hash.

The problem is, when the content doesn't change, and we gradually add AI fields over time.

I'm not sure yet.. but maybe we can remove the backfill when all documents already have AI fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why we are targeting elements that don't have ai fields yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// Document structure for the enrichment cache index.
/// Fields are stored directly for use with the enrich processor.
/// </summary>
public sealed record CacheIndexEntry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add Url, super handy when debugging the data in discover.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public async Task InitializeAsync(CancellationToken ct)
{
await EnsureIndexExistsAsync(ct);
await LoadExistingHashesAsync(ct);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should load all hashes in memory just to check if we can enrich 100-200 documents at a time, the llm client can do a DocExists() call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I had this before. But it felt like I got more 429s, and the execution was slower because it would do a lookup for every document.

But maybe I'm misunderstanding this comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool cool, I don't think we should block the PR on this. Will think about an alternative approach that does not require reading all hashes into memory separately.

The sooner we get this running to gather enrichments the better :) although maybe lets not deploy this till Monday 😸


// Check if we've hit the limit for enrichments
var current = Interlocked.Increment(ref _enrichmentCount);
if (current > _enrichmentOptions.MaxNewEnrichmentsPerRun)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this check above the Exists() check (especially if it does IO to do an exist check).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order is intentional. Exists() is an in-memory dictionary lookup (as of now). Cache hits don't call the LLM, so they shouldn't count against the limit. If we checked the limit first, we'd block documents that already have cached enrichments. The limit caps LLM calls, not total enrichments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants