feat(kb): user metadata on ingestion + retrieval#12921
feat(kb): user metadata on ingestion + retrieval#12921keval718 wants to merge 8 commits intofeat/kb-v1-db-connectorsfrom
Conversation
…ault Rename "Configure Sources" heading to "Ingest Content" and the dropdown trigger from "Add Sources" to "Add Files". Drop the "Hide Configuration" footer toggle so the ingest section is always visible. Chunk size, overlap, and separator inputs render disabled until at least one source is added. Match the Add Files focus-visible ring to border-input so the Radix-managed focus return no longer flashes a black outline.
Flag the Google Generative AI and IBM WatsonX embedding model entries in the unified model catalog as deprecated. They are not natively supported by Knowledge Base ingestion, so users were attempting invalid configurations. The KB upload picker and Model Providers modal now fetch with include_deprecated=true and render a "Deprecated" badge so the models stay visible but are clearly flagged as unsupported.
Only mark text-embedding-004 and embedding-001 as deprecated; the gemini-embedding-001 model is still served by the Google Generative AI v1beta endpoint, so leave it visible without the badge.
Surface the per-chunk user-metadata pipeline that already lives on the KB
infra branch. Adds the API + UI to collect tags at ingest time, persist
them on the ingestion_run row, filter the chunks browser by them, and
narrow KnowledgeBaseComponent retrieval through them.
Backend
- POST /{kb}/ingest gains optional metadata + per_file_metadata multipart
fields. POST /{kb}/ingest/folder gains the same fields on its JSON body.
- New parse_user_metadata / parse_per_file_metadata helpers enforce ≤16
keys, lowercase ^[a-z0-9_]{1,32}$ keys, ≤256-char values, ≤16-item
string arrays, and reject the reserved chunk-internal keys.
- FileUploadSource and FolderSource merge per-file overrides into each
IngestionItem.source_metadata; perform_ingestion now lets per-item
values win over run-level on key collision.
- ingestion_run grows a user_metadata JSON column (alembic
16a290ab1332, EXPAND, server_default '{}').
- GET /{kb}/chunks accepts repeating meta_<key>=<value> params; values
AND across keys, OR within each key. Sidesteps the global comma-split
query middleware that breaks JSON-blob query params.
- KnowledgeBaseComponent gains a metadata_filter input that post-filters
the similarity_search results client-side until per-backend
translators land.
Frontend
- New MetadataEditor renders inside the KB upload modal under Chunking
Settings for run-level tags, and inside FilesPanel as an expandable
per-file override.
- Chunks browser gets ChunksMetadataFilter (popover + chips) wired into
useGetKnowledgeBaseChunks via the new metadata_filter param map.
Tests
- 29 validator tests for the rule set, 4 ingestion-source tests for
per-file propagation + describe counts, integration tests for the
chunks endpoint AND/OR filter, 9 retrieval-side helper tests for
parse + match, 8 frontend MetadataEditor tests.
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
✅ Migration Validation Passed All migrations follow the Expand-Contract pattern correctly. |
Surface run-level tags + per-file override count in the Step 2 Review summary so users can confirm what is about to ship onto each chunk before clicking Create. Empty metadata stays invisible to keep the summary tight on the common no-tag path.
…to feat/kb-ingest-metadata # Conflicts: # src/frontend/src/modals/knowledgeBaseUploadModal/__tests__/KnowledgeBaseUploadModal.test.tsx # src/frontend/src/modals/knowledgeBaseUploadModal/hooks/useKnowledgeBaseForm.ts
Summary
Lands user-supplied metadata across the Knowledge Base ingestion + retrieval surface, plus a set of UX cleanups on the upload modal and the embedding-model picker. Closes the long-standing gap where users had no way to attach tags or custom fields to ingested content, and fixes the smaller papercuts (deprecated embeddings cluttering the picker, the Hide Configuration toggle hiding fields users actually want to see, the dropdown trigger flashing a black focus ring after Radix returned focus).
The metadata pipeline is wired end-to-end through:
/ingestand JSON/ingest/folderAPI routes,FileUploadSource,FolderSource),source_metadatablob eachIngestionItemalready produces onfeat/kb-v1-db-connectors,user_metadatacolumn oningestion_runfor run-history rendering,/{kb}/chunksbrowser endpoint (filter chips),KnowledgeBaseComponentretrieval node (canvas-side filter input).Built on top of #12802 (
feat/kb-v1-db-connectors) so the metadata work stays stacked on the registry/backends infrastructure overhaul rather than fanning out intorelease-1.10.0and racing the same files.Closes the "Allow users to provide metadata on ingested content" ticket.
What Users Can Do Now
source_metadata.confidential=truefile inside an otherwise public batch).metadata_filter={"tag": "invoice"}on theKnowledgeBaseComponent's advanced section. Single value or array of values per key; missing-or-malformed JSON falls through to the unfiltered path so a typo on the canvas never breaks a flow run.user_metadatafield onIngestionRunInfo).UX Cleanups Bundled In
The following items were originally separate fixes on
fix/kb-ingest-content-namingand have been merged into this PR so the upload-modal surface ships in one consistent shape:focus-visible:ring-ringflashed a black 1px outline on click → close. Trigger now usesfocus-visible:ring-inputso the keyboard-focus indicator matches the surrounding border colour.google_generative_ai_constants.py,watsonx_constants.py) — they are not natively supported by KB ingestion and were silently leading users into broken configurations.gemini-embedding-001is left active because Google's v1beta endpoint still serves it.include_deprecated=trueand render a Deprecated badge next to the affected models — visible but clearly flagged, instead of silently dropped.Architecture Highlights
metadata(run-level, applies to every chunk in the run) andper_file_metadata(filename-keyed map of overrides). Validated up-front byparse_user_metadata/parse_per_file_metadata; no raw user values reach the ingestion pipeline without going through the validator.KB_METADATA_*constants inlangflow.utils.kb_constants(max 16 keys, lowercase^[a-z0-9_]{1,32}$, ≤256-char string values, ≤16-item string arrays, reserved-key block). FrontendMetadataEditormirrors the same rules so invalid input is flagged inline before the request lands.perform_ingestionwas previously merging in the wrong order; now the per-item dict overrides the run-level dict so per-file overrides actually win.FileUploadSourcecarriessource_nameplus the matchingper_file_metadata[name];FolderSourcecarriesrelative_path+modified_atplus a relative-path-first lookup that falls back to bare basename. Twoinvoice.pdffiles in different subfolders can therefore carry different tags.ingestion_run.user_metadatais a new JSON column (alembic16a290ab1332, EXPAND,server_default '{}') so the run-history UI can render the tags applied to a batch without scanning per-chunksource_metadata. Legacy rows backfill with{}on next write.meta_<key>=<value>query params instead of a JSON-blobmetadata_filter. The globalflatten_query_string_listsmiddleware inlangflow.mainsplits every query value on,, which corrupts a JSON-object payload like{"tag":"invoice","year":2026}. Repeated key=value params side-step that without invasive middleware changes; a doc-comment on the route flags this for future readers.KnowledgeBaseComponentboth AND across keys, OR within values for a single key, and treat array-valued metadata as a set for overlap testing. Lists serialize through the JSONsource_metadatatag so behaviour is consistent across vector stores whose own metadata APIs only accept primitives.KnowledgeBaseComponentretrievestop_k * 4results when a filter is active, then narrows in Python. Backend-native filter translators (Chromawhere, OpenSearch bool/term, Mongo$or/$in, Astra Data API filter, PG JSONB) are deferred to a follow-up PR because they require flattening user metadata into top-level chunk-metadata keys; the JSON-blob shape today is portable but not query-pushable.Validation Rules (Enforced Server-Side, Mirrored Client-Side)
^[a-z0-9_]{1,32}$).source,file_name,chunk_index,total_chunks,ingested_at,job_id,source_type,source_metadata. These are written by the ingestion pipeline itself; user values would silently overwrite ingestion-internal tags and break the chunks browser source-type / file-name / job-id filters.UX
+ Add field. Inline validation surfaces invalid keys / duplicates on blur. The ingest section above it is now always-visible; chunk size/overlap/separator inputs disable themselves until a file is added.MetadataEditor. A smallN tagsbadge replaces the toggle when a file has any overrides so the count is glanceable when the row is collapsed.key: valueplus a+N file overridescounter. Empty metadata leaves the row hidden so the no-tag path stays as terse as before.Filter by metadatabutton next to the Source filter dropdown opens a popover with key/value inputs; submitting adds a chip (year: 2020 ✕) below the toolbar. Clicking the chip's X removes it. Inline validation on the popover mirrors the backend rules.KnowledgeBaseComponent— new advancedMetadata Filtertext input accepting a JSON object. Documented in the field'sinfotooltip with examples.include_deprecated=trueto feed this UI.Out Of Scope (this PR)
where/ OpenSearch bool query / Mongo / Astra / Postgres. The infrastructure is in place (validator + JSON-blob storage + post-filter behaviour locked down with tests), but each backend adapter needs a translator that operates on flatteneduser_meta__<key>chunk fields. Done as a follow-up so this PR stays reviewable.ingestion_run.user_metadatacolumn is structured to support an edit endpoint when the requirement firms up.Test Plan
uv run --frozen pytest src/backend/tests/unit/api/utils/test_kb_metadata.py— 29 passed (parse + validate, rule rejections, reserved-key block, per-file inner-validate inheritance).uv run --frozen pytest src/backend/tests/unit/base/knowledge_bases/ingestion_sources/test_ingestion_sources.py— 25 passed (file-upload propagation, folder relative-path > basename precedence, describe metadata-counts).uv run --frozen pytest src/backend/tests/unit/test_knowledge_bases_api.py::TestKnowledgeBaseAPI— 17 passed (single-key match, AND across keys, OR within key, array-stored values, malformed query rejection).uv run --frozen pytest src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py::TestMetadataFilterHelpers— 9 passed (parse + AND/OR match + missing-metadata + empty-filter passthrough).npx jest src/modals/knowledgeBaseUploadModal src/modals/modelProviderModal src/pages/MainPage/pages/knowledgePage— 115+ passed across the merged surfaces (existing suites + new MetadataEditor coverage + deprecated-badge rendering + Ingest Content rename + Hide Configuration removal).make format_backend/make format_frontendclean.year=2020→ KB showsReadywith 9344 chunks. Filteryear=2020returns all chunks; filteryear=9999returns 0; run-historyuser_metadatashows{"year": "2020"}.confidential=truenarrows to only the overridden file's chunks.KnowledgeBaseComponentwithmetadata_filter={"year": "2020"}returns only matching chunks; malformed JSON falls through to unfiltered without raising.gemini-embedding-001stays unbadged.Notes for Reviewers
feat/kb-v1-db-connectors, notrelease-1.10.0. The diff againstrelease-1.10.0would include every file Eric's PR already touches and obscure what's actually new here.meta_*query params, not a JSON blob —langflow.main.flatten_query_string_listssplits every query value on,, which mangles a JSON object payload ({"tag":"invoice","year":2026}→ first comma splits the value into two strings, FastAPI picks one). Repeatedmeta_<key>=<value>params side-step the middleware without invasive changes. Doc-comment on the route flags this so it doesn't get silently "fixed" by a future reader.combined = item_metadata; combined.update(run_metadata)) silently dropped per-file overrides whenever a run-level key existed. The new order is what the architecture comment inkb_helpers.pyalready described; the code just hadn't matched the comment.perform_ingestionand the ingestion sources trust whatever they receive. The single gate isparse_user_metadata/parse_per_file_metadatain the route handler; downstream code does not re-validate. Tests lock the rule set in place so the API can't drift without these tests breaking first.source,file_name,chunk_index,total_chunks,ingested_at,job_id,source_type,source_metadata. These are written by the ingestion pipeline; allowing user overrides would silently break the existingsource_type/file_name/job_idfilters in the chunks browser. If we ever need to expose more chunk-internal keys to users, removing them fromKB_METADATA_RESERVED_KEYSis the single point of change.KnowledgeBaseComponent.metadata_filterparses JSON lazily — malformed input logs a warning and falls through to the unfiltered retrieval path. A flow run never breaks because of a typo on the canvas; this is a deliberate trade-off vs. surfacing a hard error on the node.top_k * 4retrieval window when a filter is active is a heuristic to avoid post-filter starvation. Once backend-native push-down lands (follow-up PR) this multiplier goes away andsimilarity_searchdoes the narrowing natively. Flag if you'd rather see the multiplier configurable instead of hard-coded.16a290ab1332isEXPANDonly — addsuser_metadata JSON NOT NULL DEFAULT '{}'toingestion_run. Idempotent (column_existsguard). Downgrade drops the column. No CONTRACT migration is needed because no column is being removed.MetadataEditorsilently drops invalid keys at submit time — known UX gap. The editor flags invalid keys on blur, butmetadataPairsToFormValuefilters byKEY_PATTERNregardless, so a user who typesYearand never blurs out of the field will see their tag disappear at create. Listed in Follow-Ups; not blocking for the first ship since the inline-blur validation does cover the common path.gemini-embedding-001stays active even thoughtext-embedding-004andembedding-001are deprecated. The catalog filter operates on thedeprecatedflag inside each model entry, so re-enabling a model later is a single boolean flip ingoogle_generative_ai_constants.py/watsonx_constants.py.IngestionItem.source_metadataand the existingMETADATA_KEY_SOURCE_METADATAchunk tag from feat(kb): Knowledge Bases — Infrastructure Overhaul #12802. No new vector-store backend interfaces; no new ingestion-source ABC methods.