Add AI-enriched fields #2396

reakaleek · 2025-12-22T09:04:32Z

Adds LLM-powered enrichment to documentation during indexing, generating metadata optimized for semantic search and RAG.

What it does

Each document gets enriched with:

ai_rag_optimized_summary - Dense technical summary for vector search
ai_short_summary - 5-10 word tooltip
ai_search_query - Keywords a dev would search
ai_questions - Questions this doc answers
ai_use_cases - Simple tasks like "bulk index documents"

How enrichment works

Hybrid approach:

Cache hit: Enrich processor applies fields at index time (no LLM call)
Cache miss (new or stale content): Call LLM, store in cache, apply fields inline
Backfill: Documents skipped by hash-based upsert get AI fields via _update_by_query

Document → Generate enrichment_key(title + body) → Check cache (valid entries only) → Hit? Use enrich processor : Call LLM → Store in cache → Apply inline

Cache index: docs-ai-enriched-fields-cache stores LLM responses with enrichment_key and prompt_hash
Pre-loaded at startup: Only entries with current prompt hash are loaded as valid

Enrichment key construction

The enrichment key is a content-only SHA-256 hash (no prompt hash):

input = title + stripped_body
normalized = regex_replace(input, "[^a-zA-Z0-9]", "").ToLowerInvariant()
enrichment_key = SHA256(normalized)

Why content-only keys?

Same content always maps to the same cache entry
Allows stale entries (old prompt) to be overwritten rather than orphaned
Prompt versioning handled separately via prompt_hash field in cache entries

Why aggressive normalization?

Strips all whitespace, punctuation, markdown syntax
Only keeps alphanumeric characters, lowercased
Result: Formatting changes don't invalidate the cache, only content changes trigger re-enrichment

Example:

"# Hello, World!"  →  "helloworld"  →  SHA256  →  "a591a6d40..."
"Hello World"      →  "helloworld"  →  SHA256  →  "a591a6d40..."  (same!)

Prompt versioning (automatic cache invalidation)

Each cache entry stores the prompt_hash it was generated with. On startup:

Load all cache entries with their prompt_hash values
Compare each entry's prompt_hash to current prompt hash
Matching: Entry is valid → cache hit
Mismatched: Entry is stale → treated as non-existent → regenerated

Startup:
Load 5000 entries: 4200 valid (current prompt), 800 stale (will be refreshed)

Document processing:
if enrichment_key in valid_entries → cache hit
else → cache miss (new OR stale) → generate + overwrite

Result:

Prompt changes trigger automatic, gradual re-enrichment
No manual version bumping needed
Stale entries are overwritten (not orphaned) since keys are content-based

Gradual rollout

To avoid deployment explosions, enrichment is rate-limited:

Setting	Value
Enrichments per run (new + stale)	100
Concurrent LLM calls	10
Retry on 429	Exponential backoff (5 retries)

Timeline to full enrichment:

Documents	Per run	Runs needed	At hourly deploys
~12,000	100	120 runs	~5 days

Cache hits don't count against the limit, so once cached, all documents enrich instantly.

Usage

docs-builder assembler index --enable-ai-enrichment

src/Elastic.Markdown/Exporters/Elasticsearch/Enrichment/ElasticsearchEnrichmentCache.cs

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

src/Elastic.Markdown/Exporters/Elasticsearch/Enrichment/ElasticsearchEnrichmentCache.cs

Mpdreamz · 2025-12-22T09:29:51Z

src/Elastic.Markdown/Exporters/Elasticsearch/Enrichment/IEnrichmentCache.cs

+	/// <summary>
+	/// Initializes the cache, including any index bootstrapping and preloading.
+	/// </summary>
+	Task InitializeAsync(CancellationToken ct);


I think we could get rid of client side caches by using enrichment indices and enrich processors:

https://www.elastic.co/docs/manage-data/ingest/transform-enrich/example-enrich-data-based-on-exact-values

That way whenever we index data into semantic-*-* indices they will be enriched with the ai fields.

That would scale better than loading all enrichments into memory + keeps data locality in Elasticsearch

Thank you for the hint! Will try it now

Mpdreamz · 2025-12-22T09:32:41Z

src/Elastic.Markdown/Exporters/Elasticsearch/Enrichment/EnrichmentOptions.cs

+	/// Version number for cache entries. Bump to trigger gradual re-enrichment.
+	/// Using int allows future range queries (e.g., re-enrich all entries below version 5).
+	/// </summary>
+	public int PromptVersion { get; init; } = 1;


Could this be PromptHash that automatically breaks when we adjust our prompt?

That way our indices would self heal even if we forget to bump PromptVersion

I was thinking of this.

My thinking was to separate this, so we can add small prompt changes without necessarily invalidating all the cache entries.

But thinking of it now, it makes sense to just use the hash.

src/Elastic.Markdown/Exporters/Elasticsearch/Enrichment/ElasticsearchEnrichmentCache.cs

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

src/Elastic.Markdown/Exporters/Elasticsearch/Enrichment/ElasticsearchEnrichmentCache.cs

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Mpdreamz

A few more nitpicks :)

It might make sense to have a short lived EnrichmentChannel to buffer those 200 enrichments and have them _bulk up. Especially if we decide to up that number.

Or build a enrichment only index mode that we can run as an offline/one off process that can run for many hours.

Mpdreamz · 2025-12-23T11:10:10Z

src/Elastic.Markdown/Exporters/Elasticsearch/ElasticsearchMarkdownExporter.cs

+		var pipeline = EnrichPolicyManager.PipelineName;
+		var url = $"/{indexAlias}/_update_by_query?pipeline={pipeline}&timeout=10m";
+
+		var response = await WithRetryAsync(


This should be a background task and we need to poll the task status, since this is potentially very long running (1minute locally).

Mpdreamz · 2025-12-23T11:15:23Z

src/Elastic.Markdown/Exporters/Elasticsearch/ElasticsearchMarkdownExporter.cs

+	private async ValueTask BackfillMissingAiFieldsAsync(Cancel ctx)
+	{
+		// Why backfill is needed:
+		// The exporter uses hash-based upsert - unchanged documents are skipped during indexing.


I think if we use the EnrichmentKey in AssignMetadata content hash we will break the content hash and force an update. Rendering this backfill unnecessary?

This case is already covered... the EnrichmentKey is basically a more aggressive content hash.

The problem is, when the content doesn't change, and we gradually add AI fields over time.

I'm not sure yet.. but maybe we can remove the backfill when all documents already have AI fields.

This is why we are targeting elements that don't have ai fields yet.

Mpdreamz · 2025-12-23T11:17:09Z

src/Elastic.Markdown/Exporters/Elasticsearch/Enrichment/ElasticsearchEnrichmentCache.cs

+/// Document structure for the enrichment cache index.
+/// Fields are stored directly for use with the enrich processor.
+/// </summary>
+public sealed record CacheIndexEntry


Lets add Url, super handy when debugging the data in discover.

Mpdreamz · 2025-12-23T11:23:26Z

src/Elastic.Markdown/Exporters/Elasticsearch/Enrichment/ElasticsearchEnrichmentCache.cs

+	public async Task InitializeAsync(CancellationToken ct)
+	{
+		await EnsureIndexExistsAsync(ct);
+		await LoadExistingHashesAsync(ct);


I don't think we should load all hashes in memory just to check if we can enrich 100-200 documents at a time, the llm client can do a DocExists() call.

I think I had this before. But it felt like I got more 429s, and the execution was slower because it would do a lookup for every document.

But maybe I'm misunderstanding this comment.

Cool cool, I don't think we should block the PR on this. Will think about an alternative approach that does not require reading all hashes into memory separately.

The sooner we get this running to gather enrichments the better :) although maybe lets not deploy this till Monday 😸

Mpdreamz · 2025-12-23T11:27:21Z

src/Elastic.Markdown/Exporters/Elasticsearch/ElasticsearchMarkdownExporter.Export.cs

+
+		// Check if we've hit the limit for enrichments
+		var current = Interlocked.Increment(ref _enrichmentCount);
+		if (current > _enrichmentOptions.MaxNewEnrichmentsPerRun)


Move this check above the Exists() check (especially if it does IO to do an exist check).

The order is intentional. Exists() is an in-memory dictionary lookup (as of now). Cache hits don't call the LLM, so they shouldn't count against the limit. If we checked the limit first, we'd block documents that already have cached enrichments. The limit caps LLM calls, not total enrichments.

reakaleek added 2 commits December 22, 2025 09:58

Add progressive AI field enrichments

114f6cd

Format

f6f118b

github-actions bot deployed to docs-preview December 22, 2025 09:04 View deployment

reakaleek self-assigned this Dec 22, 2025

reakaleek added the feature label Dec 22, 2025

reakaleek marked this pull request as ready for review December 22, 2025 09:06

reakaleek requested a review from a team as a code owner December 22, 2025 09:06

reakaleek requested a review from Mpdreamz December 22, 2025 09:06

github-code-quality bot found potential problems Dec 22, 2025

View reviewed changes

src/Elastic.Markdown/Exporters/Elasticsearch/Enrichment/ElasticsearchEnrichmentCache.cs Fixed Show fixed Hide fixed

src/Elastic.Markdown/Exporters/Elasticsearch/Enrichment/ElasticsearchEnrichmentCache.cs Fixed Show fixed Hide fixed

Potential fix for pull request finding 'Generic catch clause'

0034a02

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

github-actions bot deployed to docs-preview December 22, 2025 09:11 View deployment

github-code-quality bot found potential problems Dec 22, 2025

View reviewed changes

src/Elastic.Markdown/Exporters/Elasticsearch/Enrichment/ElasticsearchEnrichmentCache.cs Fixed Show fixed Hide fixed

Mpdreamz requested changes Dec 22, 2025

View reviewed changes

Use enrich policy and ingest pipeline for cache hits

eca9220

github-actions bot deployed to docs-preview December 22, 2025 21:49 View deployment

github-code-quality bot found potential problems Dec 22, 2025

View reviewed changes

src/Elastic.Markdown/Exporters/Elasticsearch/Enrichment/ElasticsearchEnrichmentCache.cs Fixed Show fixed Hide fixed

Rename to enrichment_key and add prompt hash to the key

ab9a895

github-actions bot deployed to docs-preview December 22, 2025 22:35 View deployment

github-code-quality bot found potential problems Dec 22, 2025

View reviewed changes

src/Elastic.Markdown/Exporters/Elasticsearch/Enrichment/ElasticsearchEnrichmentCache.cs Fixed Show fixed Hide fixed

Potential fix for pull request finding 'Missed opportunity to use Where'

786afd8

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

reakaleek requested a review from Mpdreamz December 23, 2025 00:19

github-actions bot had a problem deploying to docs-preview December 23, 2025 00:19 Failure

reakaleek added 2 commits December 23, 2025 09:58

Fix staleness logic

2c34f4c

Fix formatting

8edb9da

github-actions bot deployed to docs-preview December 23, 2025 09:01 View deployment

github-code-quality bot found potential problems Dec 23, 2025

View reviewed changes

src/Elastic.Markdown/Exporters/Elasticsearch/Enrichment/ElasticsearchEnrichmentCache.cs Fixed Show fixed Hide fixed

Potential fix for pull request finding 'Missed opportunity to use Where'

9ab12bc

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

github-actions bot deployed to docs-preview December 23, 2025 09:07 View deployment

Reduce 429s

95851e5

github-actions bot deployed to docs-preview December 23, 2025 10:15 View deployment

Mpdreamz requested changes Dec 23, 2025

View reviewed changes

Run _update_by_query in the background

9f7ddbb

github-actions bot deployed to docs-preview December 23, 2025 12:28 View deployment

Add URL to cache entry

c61fcdb

github-actions bot deployed to docs-preview December 23, 2025 12:32 View deployment

Mpdreamz approved these changes Dec 23, 2025

View reviewed changes

Add AI-enriched fields #2396

Are you sure you want to change the base?

Add AI-enriched fields #2396

Uh oh!

Conversation

reakaleek commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What it does

How enrichment works

Enrichment key construction

Prompt versioning (automatic cache invalidation)

Gradual rollout

Usage

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mpdreamz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

reakaleek commented Dec 22, 2025 •

edited

Loading