-
Notifications
You must be signed in to change notification settings - Fork 33
Add AI-enriched fields #2396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add AI-enriched fields #2396
Conversation
src/Elastic.Markdown/Exporters/Elasticsearch/Enrichment/ElasticsearchEnrichmentCache.cs
Fixed
Show fixed
Hide fixed
src/Elastic.Markdown/Exporters/Elasticsearch/Enrichment/ElasticsearchEnrichmentCache.cs
Fixed
Show fixed
Hide fixed
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
src/Elastic.Markdown/Exporters/Elasticsearch/Enrichment/ElasticsearchEnrichmentCache.cs
Fixed
Show fixed
Hide fixed
| /// <summary> | ||
| /// Initializes the cache, including any index bootstrapping and preloading. | ||
| /// </summary> | ||
| Task InitializeAsync(CancellationToken ct); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could get rid of client side caches by using enrichment indices and enrich processors:
That way whenever we index data into semantic-*-* indices they will be enriched with the ai fields.
That would scale better than loading all enrichments into memory + keeps data locality in Elasticsearch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the hint! Will try it now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// Version number for cache entries. Bump to trigger gradual re-enrichment. | ||
| /// Using int allows future range queries (e.g., re-enrich all entries below version 5). | ||
| /// </summary> | ||
| public int PromptVersion { get; init; } = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be PromptHash that automatically breaks when we adjust our prompt?
That way our indices would self heal even if we forget to bump PromptVersion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of this.
My thinking was to separate this, so we can add small prompt changes without necessarily invalidating all the cache entries.
But thinking of it now, it makes sense to just use the hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/Elastic.Markdown/Exporters/Elasticsearch/Enrichment/ElasticsearchEnrichmentCache.cs
Fixed
Show fixed
Hide fixed
src/Elastic.Markdown/Exporters/Elasticsearch/Enrichment/ElasticsearchEnrichmentCache.cs
Fixed
Show fixed
Hide fixed
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
src/Elastic.Markdown/Exporters/Elasticsearch/Enrichment/ElasticsearchEnrichmentCache.cs
Fixed
Show fixed
Hide fixed
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Mpdreamz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more nitpicks :)
It might make sense to have a short lived EnrichmentChannel to buffer those 200 enrichments and have them _bulk up. Especially if we decide to up that number.
Or build a enrichment only index mode that we can run as an offline/one off process that can run for many hours.
| var pipeline = EnrichPolicyManager.PipelineName; | ||
| var url = $"/{indexAlias}/_update_by_query?pipeline={pipeline}&timeout=10m"; | ||
|
|
||
| var response = await WithRetryAsync( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a background task and we need to poll the task status, since this is potentially very long running (1minute locally).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| private async ValueTask BackfillMissingAiFieldsAsync(Cancel ctx) | ||
| { | ||
| // Why backfill is needed: | ||
| // The exporter uses hash-based upsert - unchanged documents are skipped during indexing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we use the EnrichmentKey in AssignMetadata content hash we will break the content hash and force an update. Rendering this backfill unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is already covered... the EnrichmentKey is basically a more aggressive content hash.
The problem is, when the content doesn't change, and we gradually add AI fields over time.
I'm not sure yet.. but maybe we can remove the backfill when all documents already have AI fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why we are targeting elements that don't have ai fields yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// Document structure for the enrichment cache index. | ||
| /// Fields are stored directly for use with the enrich processor. | ||
| /// </summary> | ||
| public sealed record CacheIndexEntry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add Url, super handy when debugging the data in discover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| public async Task InitializeAsync(CancellationToken ct) | ||
| { | ||
| await EnsureIndexExistsAsync(ct); | ||
| await LoadExistingHashesAsync(ct); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should load all hashes in memory just to check if we can enrich 100-200 documents at a time, the llm client can do a DocExists() call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I had this before. But it felt like I got more 429s, and the execution was slower because it would do a lookup for every document.
But maybe I'm misunderstanding this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool cool, I don't think we should block the PR on this. Will think about an alternative approach that does not require reading all hashes into memory separately.
The sooner we get this running to gather enrichments the better :) although maybe lets not deploy this till Monday 😸
|
|
||
| // Check if we've hit the limit for enrichments | ||
| var current = Interlocked.Increment(ref _enrichmentCount); | ||
| if (current > _enrichmentOptions.MaxNewEnrichmentsPerRun) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this check above the Exists() check (especially if it does IO to do an exist check).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order is intentional. Exists() is an in-memory dictionary lookup (as of now). Cache hits don't call the LLM, so they shouldn't count against the limit. If we checked the limit first, we'd block documents that already have cached enrichments. The limit caps LLM calls, not total enrichments.
Adds LLM-powered enrichment to documentation during indexing, generating metadata optimized for semantic search and RAG.
What it does
Each document gets enriched with:
ai_rag_optimized_summary- Dense technical summary for vector searchai_short_summary- 5-10 word tooltipai_search_query- Keywords a dev would searchai_questions- Questions this doc answersai_use_cases- Simple tasks like "bulk index documents"How enrichment works
Hybrid approach:
_update_by_querydocs-ai-enriched-fields-cachestores LLM responses withenrichment_keyandprompt_hashEnrichment key construction
The enrichment key is a content-only SHA-256 hash (no prompt hash):
Why content-only keys?
prompt_hashfield in cache entriesWhy aggressive normalization?
Example:
Prompt versioning (automatic cache invalidation)
Each cache entry stores the
prompt_hashit was generated with. On startup:prompt_hashvaluesprompt_hashto current prompt hashStartup:
Load 5000 entries: 4200 valid (current prompt), 800 stale (will be refreshed)
Document processing:
if enrichment_key in valid_entries → cache hit
else → cache miss (new OR stale) → generate + overwrite
Result:
Gradual rollout
To avoid deployment explosions, enrichment is rate-limited:
Timeline to full enrichment:
Cache hits don't count against the limit, so once cached, all documents enrich instantly.
Usage