feat(kb): user metadata on ingestion + retrieval by keval718 · Pull Request #12921 · langflow-ai/langflow

keval718 · 2026-04-28T21:40:10Z

Summary

Lands user-supplied metadata across the Knowledge Base ingestion + retrieval surface, plus a set of UX cleanups on the upload modal and the embedding-model picker. Closes the long-standing gap where users had no way to attach tags or custom fields to ingested content, and fixes the smaller papercuts (deprecated embeddings cluttering the picker, the Hide Configuration toggle hiding fields users actually want to see, the dropdown trigger flashing a black focus ring after Radix returned focus).

The metadata pipeline is wired end-to-end through:

the upload modal (run-level + per-file editors, Step 2 review chips),
the multipart /ingest and JSON /ingest/folder API routes,
both built-in ingestion sources (FileUploadSource, FolderSource),
the per-chunk source_metadata blob each IngestionItem already produces on feat/kb-v1-db-connectors,
a new user_metadata column on ingestion_run for run-history rendering,
the /{kb}/chunks browser endpoint (filter chips),
the KnowledgeBaseComponent retrieval node (canvas-side filter input).

Built on top of #12802 (feat/kb-v1-db-connectors) so the metadata work stays stacked on the registry/backends infrastructure overhaul rather than fanning out into release-1.10.0 and racing the same files.

Closes the "Allow users to provide metadata on ingested content" ticket.

What Users Can Do Now

Tag a whole batch with a JSON object of key/value pairs at upload time. Tags propagate to every chunk produced by the run and surface on the run-history row + each chunk's source_metadata.
Override on a single file by expanding its row in the Sources panel and adding key/value pairs there. Per-file values win over run-level values on key collision (e.g. one confidential=true file inside an otherwise public batch).
Filter the chunks browser by user-tagged metadata via "Filter by metadata" → key/value popover → chip. Multiple chips AND across keys; repeating the same key OR-s its values.
Filter retrieval in flows by setting metadata_filter={"tag": "invoice"} on the KnowledgeBaseComponent's advanced section. Single value or array of values per key; missing-or-malformed JSON falls through to the unfiltered path so a typo on the canvas never breaks a flow run.
See the tags applied to each batch on the run-history endpoint (user_metadata field on IngestionRunInfo).

UX Cleanups Bundled In

The following items were originally separate fixes on fix/kb-ingest-content-naming and have been merged into this PR so the upload-modal surface ships in one consistent shape:

Renamed sections in the KB upload modal: Configure Sources → Ingest Content; Add Sources dropdown trigger → Add Files. Reflects what the section actually does and matches the rest of the IA.
Removed the Hide Configuration footer toggle. The ingest section is now open by default; the chunk size / overlap / separator inputs render disabled until at least one source is added so the layout stays stable while showing the user what's coming next.
Fixed the Add Files dropdown black-ring flash. Radix returns focus to the trigger after the menu closes; the default focus-visible:ring-ring flashed a black 1px outline on click → close. Trigger now uses focus-visible:ring-input so the keyboard-focus indicator matches the surrounding border colour.
Marked Google + IBM embedding models as deprecated in the unified model catalog (google_generative_ai_constants.py, watsonx_constants.py) — they are not natively supported by KB ingestion and were silently leading users into broken configurations. gemini-embedding-001 is left active because Google's v1beta endpoint still serves it.
Surfaced the deprecation in the picker UI. The KB upload modal and the Model Providers settings modal now fetch with include_deprecated=true and render a Deprecated badge next to the affected models — visible but clearly flagged, instead of silently dropped.

Architecture Highlights

Two metadata channels at the API boundary — metadata (run-level, applies to every chunk in the run) and per_file_metadata (filename-keyed map of overrides). Validated up-front by parse_user_metadata / parse_per_file_metadata; no raw user values reach the ingestion pipeline without going through the validator.
Single source of truth for the validation contract — KB_METADATA_* constants in langflow.utils.kb_constants (max 16 keys, lowercase ^[a-z0-9_]{1,32}$, ≤256-char string values, ≤16-item string arrays, reserved-key block). Frontend MetadataEditor mirrors the same rules so invalid input is flagged inline before the request lands.
Per-item beats run-level on collision — perform_ingestion was previously merging in the wrong order; now the per-item dict overrides the run-level dict so per-file overrides actually win.
Sources merge their own provenance + user overrides — FileUploadSource carries source_name plus the matching per_file_metadata[name]; FolderSource carries relative_path + modified_at plus a relative-path-first lookup that falls back to bare basename. Two invoice.pdf files in different subfolders can therefore carry different tags.
ingestion_run.user_metadata is a new JSON column (alembic 16a290ab1332, EXPAND, server_default '{}') so the run-history UI can render the tags applied to a batch without scanning per-chunk source_metadata. Legacy rows backfill with {} on next write.
Chunks endpoint uses repeated meta_<key>=<value> query params instead of a JSON-blob metadata_filter. The global flatten_query_string_lists middleware in langflow.main splits every query value on ,, which corrupts a JSON-object payload like {"tag":"invoice","year":2026}. Repeated key=value params side-step that without invasive middleware changes; a doc-comment on the route flags this for future readers.
Match semantics across all surfaces are identical — chunks endpoint and KnowledgeBaseComponent both AND across keys, OR within values for a single key, and treat array-valued metadata as a set for overlap testing. Lists serialize through the JSON source_metadata tag so behaviour is consistent across vector stores whose own metadata APIs only accept primitives.
Retrieval-side post-filter, not push-down — KnowledgeBaseComponent retrieves top_k * 4 results when a filter is active, then narrows in Python. Backend-native filter translators (Chroma where, OpenSearch bool/term, Mongo $or/$in, Astra Data API filter, PG JSONB) are deferred to a follow-up PR because they require flattening user metadata into top-level chunk-metadata keys; the JSON-blob shape today is portable but not query-pushable.

Validation Rules (Enforced Server-Side, Mirrored Client-Side)

Max 16 keys per dict (run-level and per-file inner dicts).
Keys: 1-32 lowercase alphanumeric or underscore characters (^[a-z0-9_]{1,32}$).
Values: string ≤256 chars, integer, float, bool, or string-array ≤16 items each ≤256 chars. Nested objects rejected.
Reserved keys (rejected with 422): source, file_name, chunk_index, total_chunks, ingested_at, job_id, source_type, source_metadata. These are written by the ingestion pipeline itself; user values would silently overwrite ingestion-internal tags and break the chunks browser source-type / file-name / job-id filters.
Per-file map: max 16 file keys; each filename must be a non-empty string.

UX

Knowledge Base upload modal — new Metadata subsection inside Chunking Settings. Tooltip-equipped Custom Fields editor renders as a list of key/value rows with + Add field. Inline validation surfaces invalid keys / duplicates on blur. The ingest section above it is now always-visible; chunk size/overlap/separator inputs disable themselves until a file is added.
Sources side panel — every staged file now has an inline tag-icon toggle that expands a per-file MetadataEditor. A small N tags badge replaces the toggle when a file has any overrides so the count is glanceable when the row is collapsed.
Step 2 review summary — when run-level tags or per-file overrides are present, a Metadata row renders chips for each key: value plus a +N file overrides counter. Empty metadata leaves the row hidden so the no-tag path stays as terse as before.
Chunks browser — Filter by metadata button next to the Source filter dropdown opens a popover with key/value inputs; submitting adds a chip (year: 2020 ✕) below the toolbar. Clicking the chip's X removes it. Inline validation on the popover mirrors the backend rules.
KnowledgeBaseComponent — new advanced Metadata Filter text input accepting a JSON object. Documented in the field's info tooltip with examples.
Embedding-model picker — Google + IBM embedding models stay visible but render with a grey Deprecated badge so users know they will fail in KB ingestion before they pick one. The KB upload modal and the Model Providers settings page both fetch with include_deprecated=true to feed this UI.

Out Of Scope (this PR)

Backend-native filter push-down for Chroma where / OpenSearch bool query / Mongo / Astra / Postgres. The infrastructure is in place (validator + JSON-blob storage + post-filter behaviour locked down with tests), but each backend adapter needs a translator that operates on flattened user_meta__<key> chunk fields. Done as a follow-up so this PR stays reviewable.
Editing metadata after ingestion without re-ingesting. Out of scope today; the ingestion_run.user_metadata column is structured to support an edit endpoint when the requirement firms up.
Frontend tag autocompletion based on previously-seen keys/values. The popover today is free-form; a TODO surfaces this once we have data on which keys users repeat.

Test Plan

Notes for Reviewers

Stacked on feat(kb): Knowledge Bases — Infrastructure Overhaul #12802 — review against feat/kb-v1-db-connectors, not release-1.10.0. The diff against release-1.10.0 would include every file Eric's PR already touches and obscure what's actually new here.
Why repeated meta_* query params, not a JSON blob — langflow.main.flatten_query_string_lists splits every query value on ,, which mangles a JSON object payload ({"tag":"invoice","year":2026} → first comma splits the value into two strings, FastAPI picks one). Repeated meta_<key>=<value> params side-step the middleware without invasive changes. Doc-comment on the route flags this so it doesn't get silently "fixed" by a future reader.
Per-item beats run-level merge order is a fix, not a refactor — the previous merge order (combined = item_metadata; combined.update(run_metadata)) silently dropped per-file overrides whenever a run-level key existed. The new order is what the architecture comment in kb_helpers.py already described; the code just hadn't matched the comment.
Validation lives at the API boundary only — perform_ingestion and the ingestion sources trust whatever they receive. The single gate is parse_user_metadata / parse_per_file_metadata in the route handler; downstream code does not re-validate. Tests lock the rule set in place so the API can't drift without these tests breaking first.
Reserved-key list — source, file_name, chunk_index, total_chunks, ingested_at, job_id, source_type, source_metadata. These are written by the ingestion pipeline; allowing user overrides would silently break the existing source_type / file_name / job_id filters in the chunks browser. If we ever need to expose more chunk-internal keys to users, removing them from KB_METADATA_RESERVED_KEYS is the single point of change.
KnowledgeBaseComponent.metadata_filter parses JSON lazily — malformed input logs a warning and falls through to the unfiltered retrieval path. A flow run never breaks because of a typo on the canvas; this is a deliberate trade-off vs. surfacing a hard error on the node.
The top_k * 4 retrieval window when a filter is active is a heuristic to avoid post-filter starvation. Once backend-native push-down lands (follow-up PR) this multiplier goes away and similarity_search does the narrowing natively. Flag if you'd rather see the multiplier configurable instead of hard-coded.
Alembic migration 16a290ab1332 is EXPAND only — adds user_metadata JSON NOT NULL DEFAULT '{}' to ingestion_run. Idempotent (column_exists guard). Downgrade drops the column. No CONTRACT migration is needed because no column is being removed.
Frontend MetadataEditor silently drops invalid keys at submit time — known UX gap. The editor flags invalid keys on blur, but metadataPairsToFormValue filters by KEY_PATTERN regardless, so a user who types Year and never blurs out of the field will see their tag disappear at create. Listed in Follow-Ups; not blocking for the first ship since the inline-blur validation does cover the common path.
Deprecated-embedding flagging is intentionally per-model, not per-provider — gemini-embedding-001 stays active even though text-embedding-004 and embedding-001 are deprecated. The catalog filter operates on the deprecated flag inside each model entry, so re-enabling a model later is a single boolean flip in google_generative_ai_constants.py / watsonx_constants.py.
No new model registry plumbing — the metadata feature reuses IngestionItem.source_metadata and the existing METADATA_KEY_SOURCE_METADATA chunk tag from feat(kb): Knowledge Bases — Infrastructure Overhaul #12802. No new vector-store backend interfaces; no new ingestion-source ABC methods.

…ault Rename "Configure Sources" heading to "Ingest Content" and the dropdown trigger from "Add Sources" to "Add Files". Drop the "Hide Configuration" footer toggle so the ingest section is always visible. Chunk size, overlap, and separator inputs render disabled until at least one source is added. Match the Add Files focus-visible ring to border-input so the Radix-managed focus return no longer flashes a black outline.

Flag the Google Generative AI and IBM WatsonX embedding model entries in the unified model catalog as deprecated. They are not natively supported by Knowledge Base ingestion, so users were attempting invalid configurations. The KB upload picker and Model Providers modal now fetch with include_deprecated=true and render a "Deprecated" badge so the models stay visible but are clearly flagged as unsupported.

Only mark text-embedding-004 and embedding-001 as deprecated; the gemini-embedding-001 model is still served by the Google Generative AI v1beta endpoint, so leave it visible without the badge.

Surface the per-chunk user-metadata pipeline that already lives on the KB infra branch. Adds the API + UI to collect tags at ingest time, persist them on the ingestion_run row, filter the chunks browser by them, and narrow KnowledgeBaseComponent retrieval through them. Backend - POST /{kb}/ingest gains optional metadata + per_file_metadata multipart fields. POST /{kb}/ingest/folder gains the same fields on its JSON body. - New parse_user_metadata / parse_per_file_metadata helpers enforce ≤16 keys, lowercase ^[a-z0-9_]{1,32}$ keys, ≤256-char values, ≤16-item string arrays, and reject the reserved chunk-internal keys. - FileUploadSource and FolderSource merge per-file overrides into each IngestionItem.source_metadata; perform_ingestion now lets per-item values win over run-level on key collision. - ingestion_run grows a user_metadata JSON column (alembic 16a290ab1332, EXPAND, server_default '{}'). - GET /{kb}/chunks accepts repeating meta_<key>=<value> params; values AND across keys, OR within each key. Sidesteps the global comma-split query middleware that breaks JSON-blob query params. - KnowledgeBaseComponent gains a metadata_filter input that post-filters the similarity_search results client-side until per-backend translators land. Frontend - New MetadataEditor renders inside the KB upload modal under Chunking Settings for run-level tags, and inside FilesPanel as an expandable per-file override. - Chunks browser gets ChunksMetadataFilter (popover + chips) wired into useGetKnowledgeBaseChunks via the new metadata_filter param map. Tests - 29 validator tests for the rule set, 4 ingestion-source tests for per-file propagation + describe counts, integration tests for the chunks endpoint AND/OR filter, 9 retrieval-side helper tests for parse + match, 8 frontend MetadataEditor tests.

coderabbitai · 2026-04-28T21:40:17Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4ace9a47-ee99-4c12-addb-9de0c5e567a8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/kb-ingest-metadata

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-28T21:41:19Z

✅ Migration Validation Passed

All migrations follow the Expand-Contract pattern correctly.

Surface run-level tags + per-file override count in the Step 2 Review summary so users can confirm what is about to ship onto each chunk before clicking Create. Empty metadata stays invisible to keep the summary tight on the common no-tag path.

…to feat/kb-ingest-metadata # Conflicts: # src/frontend/src/modals/knowledgeBaseUploadModal/__tests__/KnowledgeBaseUploadModal.test.tsx # src/frontend/src/modals/knowledgeBaseUploadModal/hooks/useKnowledgeBaseForm.ts

keval718 added 4 commits April 28, 2026 13:45

fix(models): keep gemini-embedding-001 active

07764ff

Only mark text-embedding-004 and embedding-001 as deprecated; the gemini-embedding-001 model is still served by the Google Generative AI v1beta endpoint, so leave it visible without the badge.

github-actions Bot added the enhancement New feature or request label Apr 28, 2026

keval718 marked this pull request as draft April 28, 2026 21:41

[autofix.ci] apply automated fixes

47cb396

github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Apr 28, 2026

[autofix.ci] apply automated fixes

8f7ce68

github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Apr 28, 2026

github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Apr 30, 2026

keval718 marked this pull request as ready for review April 30, 2026 03:54

github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kb): user metadata on ingestion + retrieval#12921

feat(kb): user metadata on ingestion + retrieval#12921
keval718 wants to merge 8 commits intofeat/kb-v1-db-connectorsfrom
feat/kb-ingest-metadata

keval718 commented Apr 28, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading

Review skipped

Uh oh!

github-actions Bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

keval718 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Users Can Do Now

UX Cleanups Bundled In

Architecture Highlights

Validation Rules (Enforced Server-Side, Mirrored Client-Side)

UX

Out Of Scope (this PR)

Test Plan

Notes for Reviewers

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

keval718 commented Apr 28, 2026 •

edited

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading

github-actions Bot commented Apr 28, 2026 •

edited

Loading