Skip to content

feat(kb): user metadata on ingestion + retrieval#12921

Open
keval718 wants to merge 8 commits intofeat/kb-v1-db-connectorsfrom
feat/kb-ingest-metadata
Open

feat(kb): user metadata on ingestion + retrieval#12921
keval718 wants to merge 8 commits intofeat/kb-v1-db-connectorsfrom
feat/kb-ingest-metadata

Conversation

@keval718
Copy link
Copy Markdown
Collaborator

@keval718 keval718 commented Apr 28, 2026

Summary

Lands user-supplied metadata across the Knowledge Base ingestion + retrieval surface, plus a set of UX cleanups on the upload modal and the embedding-model picker. Closes the long-standing gap where users had no way to attach tags or custom fields to ingested content, and fixes the smaller papercuts (deprecated embeddings cluttering the picker, the Hide Configuration toggle hiding fields users actually want to see, the dropdown trigger flashing a black focus ring after Radix returned focus).

The metadata pipeline is wired end-to-end through:

  • the upload modal (run-level + per-file editors, Step 2 review chips),
  • the multipart /ingest and JSON /ingest/folder API routes,
  • both built-in ingestion sources (FileUploadSource, FolderSource),
  • the per-chunk source_metadata blob each IngestionItem already produces on feat/kb-v1-db-connectors,
  • a new user_metadata column on ingestion_run for run-history rendering,
  • the /{kb}/chunks browser endpoint (filter chips),
  • the KnowledgeBaseComponent retrieval node (canvas-side filter input).

Built on top of #12802 (feat/kb-v1-db-connectors) so the metadata work stays stacked on the registry/backends infrastructure overhaul rather than fanning out into release-1.10.0 and racing the same files.

Closes the "Allow users to provide metadata on ingested content" ticket.

What Users Can Do Now

  • Tag a whole batch with a JSON object of key/value pairs at upload time. Tags propagate to every chunk produced by the run and surface on the run-history row + each chunk's source_metadata.
  • Override on a single file by expanding its row in the Sources panel and adding key/value pairs there. Per-file values win over run-level values on key collision (e.g. one confidential=true file inside an otherwise public batch).
  • Filter the chunks browser by user-tagged metadata via "Filter by metadata" → key/value popover → chip. Multiple chips AND across keys; repeating the same key OR-s its values.
  • Filter retrieval in flows by setting metadata_filter={"tag": "invoice"} on the KnowledgeBaseComponent's advanced section. Single value or array of values per key; missing-or-malformed JSON falls through to the unfiltered path so a typo on the canvas never breaks a flow run.
  • See the tags applied to each batch on the run-history endpoint (user_metadata field on IngestionRunInfo).

UX Cleanups Bundled In

The following items were originally separate fixes on fix/kb-ingest-content-naming and have been merged into this PR so the upload-modal surface ships in one consistent shape:

  • Renamed sections in the KB upload modal: Configure SourcesIngest Content; Add Sources dropdown trigger → Add Files. Reflects what the section actually does and matches the rest of the IA.
  • Removed the Hide Configuration footer toggle. The ingest section is now open by default; the chunk size / overlap / separator inputs render disabled until at least one source is added so the layout stays stable while showing the user what's coming next.
  • Fixed the Add Files dropdown black-ring flash. Radix returns focus to the trigger after the menu closes; the default focus-visible:ring-ring flashed a black 1px outline on click → close. Trigger now uses focus-visible:ring-input so the keyboard-focus indicator matches the surrounding border colour.
  • Marked Google + IBM embedding models as deprecated in the unified model catalog (google_generative_ai_constants.py, watsonx_constants.py) — they are not natively supported by KB ingestion and were silently leading users into broken configurations. gemini-embedding-001 is left active because Google's v1beta endpoint still serves it.
  • Surfaced the deprecation in the picker UI. The KB upload modal and the Model Providers settings modal now fetch with include_deprecated=true and render a Deprecated badge next to the affected models — visible but clearly flagged, instead of silently dropped.

Architecture Highlights

  • Two metadata channels at the API boundarymetadata (run-level, applies to every chunk in the run) and per_file_metadata (filename-keyed map of overrides). Validated up-front by parse_user_metadata / parse_per_file_metadata; no raw user values reach the ingestion pipeline without going through the validator.
  • Single source of truth for the validation contractKB_METADATA_* constants in langflow.utils.kb_constants (max 16 keys, lowercase ^[a-z0-9_]{1,32}$, ≤256-char string values, ≤16-item string arrays, reserved-key block). Frontend MetadataEditor mirrors the same rules so invalid input is flagged inline before the request lands.
  • Per-item beats run-level on collisionperform_ingestion was previously merging in the wrong order; now the per-item dict overrides the run-level dict so per-file overrides actually win.
  • Sources merge their own provenance + user overridesFileUploadSource carries source_name plus the matching per_file_metadata[name]; FolderSource carries relative_path + modified_at plus a relative-path-first lookup that falls back to bare basename. Two invoice.pdf files in different subfolders can therefore carry different tags.
  • ingestion_run.user_metadata is a new JSON column (alembic 16a290ab1332, EXPAND, server_default '{}') so the run-history UI can render the tags applied to a batch without scanning per-chunk source_metadata. Legacy rows backfill with {} on next write.
  • Chunks endpoint uses repeated meta_<key>=<value> query params instead of a JSON-blob metadata_filter. The global flatten_query_string_lists middleware in langflow.main splits every query value on ,, which corrupts a JSON-object payload like {"tag":"invoice","year":2026}. Repeated key=value params side-step that without invasive middleware changes; a doc-comment on the route flags this for future readers.
  • Match semantics across all surfaces are identical — chunks endpoint and KnowledgeBaseComponent both AND across keys, OR within values for a single key, and treat array-valued metadata as a set for overlap testing. Lists serialize through the JSON source_metadata tag so behaviour is consistent across vector stores whose own metadata APIs only accept primitives.
  • Retrieval-side post-filter, not push-downKnowledgeBaseComponent retrieves top_k * 4 results when a filter is active, then narrows in Python. Backend-native filter translators (Chroma where, OpenSearch bool/term, Mongo $or/$in, Astra Data API filter, PG JSONB) are deferred to a follow-up PR because they require flattening user metadata into top-level chunk-metadata keys; the JSON-blob shape today is portable but not query-pushable.

Validation Rules (Enforced Server-Side, Mirrored Client-Side)

  • Max 16 keys per dict (run-level and per-file inner dicts).
  • Keys: 1-32 lowercase alphanumeric or underscore characters (^[a-z0-9_]{1,32}$).
  • Values: string ≤256 chars, integer, float, bool, or string-array ≤16 items each ≤256 chars. Nested objects rejected.
  • Reserved keys (rejected with 422): source, file_name, chunk_index, total_chunks, ingested_at, job_id, source_type, source_metadata. These are written by the ingestion pipeline itself; user values would silently overwrite ingestion-internal tags and break the chunks browser source-type / file-name / job-id filters.
  • Per-file map: max 16 file keys; each filename must be a non-empty string.

UX

  • Knowledge Base upload modal — new Metadata subsection inside Chunking Settings. Tooltip-equipped Custom Fields editor renders as a list of key/value rows with + Add field. Inline validation surfaces invalid keys / duplicates on blur. The ingest section above it is now always-visible; chunk size/overlap/separator inputs disable themselves until a file is added.
  • Sources side panel — every staged file now has an inline tag-icon toggle that expands a per-file MetadataEditor. A small N tags badge replaces the toggle when a file has any overrides so the count is glanceable when the row is collapsed.
  • Step 2 review summary — when run-level tags or per-file overrides are present, a Metadata row renders chips for each key: value plus a +N file overrides counter. Empty metadata leaves the row hidden so the no-tag path stays as terse as before.
  • Chunks browserFilter by metadata button next to the Source filter dropdown opens a popover with key/value inputs; submitting adds a chip (year: 2020 ✕) below the toolbar. Clicking the chip's X removes it. Inline validation on the popover mirrors the backend rules.
  • KnowledgeBaseComponent — new advanced Metadata Filter text input accepting a JSON object. Documented in the field's info tooltip with examples.
  • Embedding-model picker — Google + IBM embedding models stay visible but render with a grey Deprecated badge so users know they will fail in KB ingestion before they pick one. The KB upload modal and the Model Providers settings page both fetch with include_deprecated=true to feed this UI.

Out Of Scope (this PR)

  • Backend-native filter push-down for Chroma where / OpenSearch bool query / Mongo / Astra / Postgres. The infrastructure is in place (validator + JSON-blob storage + post-filter behaviour locked down with tests), but each backend adapter needs a translator that operates on flattened user_meta__<key> chunk fields. Done as a follow-up so this PR stays reviewable.
  • Editing metadata after ingestion without re-ingesting. Out of scope today; the ingestion_run.user_metadata column is structured to support an edit endpoint when the requirement firms up.
  • Frontend tag autocompletion based on previously-seen keys/values. The popover today is free-form; a TODO surfaces this once we have data on which keys users repeat.

Test Plan

  • Backend unit suite (validators): uv run --frozen pytest src/backend/tests/unit/api/utils/test_kb_metadata.py — 29 passed (parse + validate, rule rejections, reserved-key block, per-file inner-validate inheritance).
  • Backend unit suite (sources): uv run --frozen pytest src/backend/tests/unit/base/knowledge_bases/ingestion_sources/test_ingestion_sources.py — 25 passed (file-upload propagation, folder relative-path > basename precedence, describe metadata-counts).
  • Backend integration (chunks endpoint): uv run --frozen pytest src/backend/tests/unit/test_knowledge_bases_api.py::TestKnowledgeBaseAPI — 17 passed (single-key match, AND across keys, OR within key, array-stored values, malformed query rejection).
  • Backend unit (retrieval helpers): uv run --frozen pytest src/backend/tests/unit/components/files_and_knowledge/test_retrieval.py::TestMetadataFilterHelpers — 9 passed (parse + AND/OR match + missing-metadata + empty-filter passthrough).
  • Frontend unit (modal + chunks + provider modal): npx jest src/modals/knowledgeBaseUploadModal src/modals/modelProviderModal src/pages/MainPage/pages/knowledgePage — 115+ passed across the merged surfaces (existing suites + new MetadataEditor coverage + deprecated-badge rendering + Ingest Content rename + Hide Configuration removal).
  • make format_backend / make format_frontend clean.
  • Manual smoke: ingest a 23 MB PDF with run-level year=2020 → KB shows Ready with 9344 chunks. Filter year=2020 returns all chunks; filter year=9999 returns 0; run-history user_metadata shows {"year": "2020"}.
  • Manual smoke: per-file override on a 2-file batch → chunks browser filter confidential=true narrows to only the overridden file's chunks.
  • Manual smoke: KnowledgeBaseComponent with metadata_filter={"year": "2020"} returns only matching chunks; malformed JSON falls through to unfiltered without raising.
  • Manual smoke: confirm Google + IBM embeddings render with a grey Deprecated badge in both the KB modal picker and the Model Providers settings modal; gemini-embedding-001 stays unbadged.

Notes for Reviewers

  • Stacked on feat(kb): Knowledge Bases — Infrastructure Overhaul #12802 — review against feat/kb-v1-db-connectors, not release-1.10.0. The diff against release-1.10.0 would include every file Eric's PR already touches and obscure what's actually new here.
  • Why repeated meta_* query params, not a JSON bloblangflow.main.flatten_query_string_lists splits every query value on ,, which mangles a JSON object payload ({"tag":"invoice","year":2026} → first comma splits the value into two strings, FastAPI picks one). Repeated meta_<key>=<value> params side-step the middleware without invasive changes. Doc-comment on the route flags this so it doesn't get silently "fixed" by a future reader.
  • Per-item beats run-level merge order is a fix, not a refactor — the previous merge order (combined = item_metadata; combined.update(run_metadata)) silently dropped per-file overrides whenever a run-level key existed. The new order is what the architecture comment in kb_helpers.py already described; the code just hadn't matched the comment.
  • Validation lives at the API boundary onlyperform_ingestion and the ingestion sources trust whatever they receive. The single gate is parse_user_metadata / parse_per_file_metadata in the route handler; downstream code does not re-validate. Tests lock the rule set in place so the API can't drift without these tests breaking first.
  • Reserved-key listsource, file_name, chunk_index, total_chunks, ingested_at, job_id, source_type, source_metadata. These are written by the ingestion pipeline; allowing user overrides would silently break the existing source_type / file_name / job_id filters in the chunks browser. If we ever need to expose more chunk-internal keys to users, removing them from KB_METADATA_RESERVED_KEYS is the single point of change.
  • KnowledgeBaseComponent.metadata_filter parses JSON lazily — malformed input logs a warning and falls through to the unfiltered retrieval path. A flow run never breaks because of a typo on the canvas; this is a deliberate trade-off vs. surfacing a hard error on the node.
  • The top_k * 4 retrieval window when a filter is active is a heuristic to avoid post-filter starvation. Once backend-native push-down lands (follow-up PR) this multiplier goes away and similarity_search does the narrowing natively. Flag if you'd rather see the multiplier configurable instead of hard-coded.
  • Alembic migration 16a290ab1332 is EXPAND only — adds user_metadata JSON NOT NULL DEFAULT '{}' to ingestion_run. Idempotent (column_exists guard). Downgrade drops the column. No CONTRACT migration is needed because no column is being removed.
  • Frontend MetadataEditor silently drops invalid keys at submit time — known UX gap. The editor flags invalid keys on blur, but metadataPairsToFormValue filters by KEY_PATTERN regardless, so a user who types Year and never blurs out of the field will see their tag disappear at create. Listed in Follow-Ups; not blocking for the first ship since the inline-blur validation does cover the common path.
  • Deprecated-embedding flagging is intentionally per-model, not per-providergemini-embedding-001 stays active even though text-embedding-004 and embedding-001 are deprecated. The catalog filter operates on the deprecated flag inside each model entry, so re-enabling a model later is a single boolean flip in google_generative_ai_constants.py / watsonx_constants.py.
  • No new model registry plumbing — the metadata feature reuses IngestionItem.source_metadata and the existing METADATA_KEY_SOURCE_METADATA chunk tag from feat(kb): Knowledge Bases — Infrastructure Overhaul #12802. No new vector-store backend interfaces; no new ingestion-source ABC methods.

…ault

Rename "Configure Sources" heading to "Ingest Content" and the dropdown
trigger from "Add Sources" to "Add Files". Drop the "Hide Configuration"
footer toggle so the ingest section is always visible. Chunk size,
overlap, and separator inputs render disabled until at least one source
is added. Match the Add Files focus-visible ring to border-input so the
Radix-managed focus return no longer flashes a black outline.
Flag the Google Generative AI and IBM WatsonX embedding model entries in
the unified model catalog as deprecated. They are not natively supported
by Knowledge Base ingestion, so users were attempting invalid
configurations. The KB upload picker and Model Providers modal now fetch
with include_deprecated=true and render a "Deprecated" badge so the
models stay visible but are clearly flagged as unsupported.
Only mark text-embedding-004 and embedding-001 as deprecated; the
gemini-embedding-001 model is still served by the Google Generative AI
v1beta endpoint, so leave it visible without the badge.
Surface the per-chunk user-metadata pipeline that already lives on the KB
infra branch. Adds the API + UI to collect tags at ingest time, persist
them on the ingestion_run row, filter the chunks browser by them, and
narrow KnowledgeBaseComponent retrieval through them.

Backend
- POST /{kb}/ingest gains optional metadata + per_file_metadata multipart
  fields. POST /{kb}/ingest/folder gains the same fields on its JSON body.
- New parse_user_metadata / parse_per_file_metadata helpers enforce ≤16
  keys, lowercase ^[a-z0-9_]{1,32}$ keys, ≤256-char values, ≤16-item
  string arrays, and reject the reserved chunk-internal keys.
- FileUploadSource and FolderSource merge per-file overrides into each
  IngestionItem.source_metadata; perform_ingestion now lets per-item
  values win over run-level on key collision.
- ingestion_run grows a user_metadata JSON column (alembic
  16a290ab1332, EXPAND, server_default '{}').
- GET /{kb}/chunks accepts repeating meta_<key>=<value> params; values
  AND across keys, OR within each key. Sidesteps the global comma-split
  query middleware that breaks JSON-blob query params.
- KnowledgeBaseComponent gains a metadata_filter input that post-filters
  the similarity_search results client-side until per-backend
  translators land.

Frontend
- New MetadataEditor renders inside the KB upload modal under Chunking
  Settings for run-level tags, and inside FilesPanel as an expandable
  per-file override.
- Chunks browser gets ChunksMetadataFilter (popover + chips) wired into
  useGetKnowledgeBaseChunks via the new metadata_filter param map.

Tests
- 29 validator tests for the rule set, 4 ingestion-source tests for
  per-file propagation + describe counts, integration tests for the
  chunks endpoint AND/OR filter, 9 retrieval-side helper tests for
  parse + match, 8 frontend MetadataEditor tests.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4ace9a47-ee99-4c12-addb-9de0c5e567a8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/kb-ingest-metadata

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the enhancement New feature or request label Apr 28, 2026
@keval718 keval718 marked this pull request as draft April 28, 2026 21:41
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 28, 2026

Migration Validation Passed

All migrations follow the Expand-Contract pattern correctly.

@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Apr 28, 2026
Surface run-level tags + per-file override count in the Step 2 Review
summary so users can confirm what is about to ship onto each chunk
before clicking Create. Empty metadata stays invisible to keep the
summary tight on the common no-tag path.
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Apr 28, 2026
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Apr 28, 2026
…to feat/kb-ingest-metadata

# Conflicts:
#	src/frontend/src/modals/knowledgeBaseUploadModal/__tests__/KnowledgeBaseUploadModal.test.tsx
#	src/frontend/src/modals/knowledgeBaseUploadModal/hooks/useKnowledgeBaseForm.ts
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Apr 30, 2026
@keval718 keval718 marked this pull request as ready for review April 30, 2026 03:54
@github-actions github-actions Bot added enhancement New feature or request and removed enhancement New feature or request labels Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant