Feature/table parser column roles by ahmadintisar · Pull Request #13710 · infiniflow/ragflow

ahmadintisar · 2026-03-19T22:51:02Z

What problem does this PR solve?

The table file parser (CSV/Excel) currently treats all columns identically — every column is both vectorized (embedded in chunk text) and stored as filterable metadata. There's no way for users to control which columns should be searchable by semantic meaning versus which should only be filterable attributes.

For example, when ingesting a news articles CSV with columns like title, content, country, category, source, etc., the embedding includes metadata fields like country: Brazil and source: Reuters in the chunk text, which dilutes the semantic quality of the embedding without adding retrieval value.

The RDBMS connector (MySQL/PostgreSQL) already supports content_columns / metadata_columns, but this capability was missing for file-based table ingestion.

This PR adds column-level control (vectorize / metadata / both) for the table file parser, following RAGFlow's existing patterns.

Backward compatible: Datasets without table_column_roles or with table_column_mode: auto behave exactly as before (all columns = both).

Type of change

New Feature (non-breaking change which adds functionality)

… chunk construction - Read table_column_roles from parser_config in chunk() - Split row processing: vectorize/both → chunk text, metadata/both → chunk_data/typed fields - Build field_map only from stored columns (metadata + both) - Backward compatible: missing config defaults all columns to 'both'

…h, and frontend column role selector - Add TableColumnRole type and table_column_roles/table_column_names to ParserConfig (validation_utils.py) - Pass table_column_names from table.py to parser_config on first parse - Build column role selector UI in table.tsx with empty state and re-parse tip - Add form schema fields and i18n keys for table column roles

…guration - Add table_column_mode to ParserConfig (auto default, manual enables column roles) - Gate table_column_roles reading behind manual mode check in table.py - Add mode selector UI in table.tsx, column role table only shown in manual mode - Preserve manual selections when switching back to auto

…config at parse time - task_executor now overlays table_column_mode/roles/names from KB parser_config onto document parser_config for table parser tasks - Root cause: dataset settings saved to KB but task used doc-level config - Add debug logging with [TABLE_PARSER_DEBUG] and [TASK_EXECUTOR_DEBUG] prefixes

…lumns + fix ES field resolution - Add aggregate_table_manual_doc_metadata() to collect unique values from metadata/both columns - Write aggregated metadata via DocMetadataService.update_document_metadata after chunk insertion - Add ES typed field probe fallback (_probe_es_typed_key_for_column) for when field_map is empty (field_map is written during chunk() but task snapshot predates it) - Probe tries _tks, _dt, _long, _flt, _kwd, bare name in order - field_map used as primary when available, probe as fallback - Only triggers for table parser with table_column_mode=manual - Auto mode unchanged: no doc metadata written - Debug logging with [TABLE_META_DEBUG] prefix throughout

ahmadintisar · 2026-03-19T22:58:11Z

Here's a quick demonstration.

I tested by uploading a csv file with table parser.

Now there is an option In configuration to select auto or manual column mode. By default it's set to Auto.

First lets demonstrate with the Auto mode, which works as default. Each row is a document and the whole content is vectorized while metadata is empty.

Chunk Results:

Now I will change the configuration mode to manual and select which columns I want vectorized and which ones should be included in the metadata:

Changes will take affect after reparsing the same file.

And now the chunk results include the columns which were selected as vectorized in the configuration.

Also now we can visualize the metadata in the manual column mode.

I hope this demonstration helps!

ahmadintisar · 2026-03-19T23:05:26Z

I have also added debug logs, once the feature is reviewed, its safe to remove them!

codecov · 2026-03-20T05:19:34Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.52%. Comparing base (af40be6) to head (00275c3).

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #13710   +/-   ##
=======================================
  Coverage   96.52%   96.52%           
=======================================
  Files          10       10           
  Lines         690      690           
  Branches      108      108           
=======================================
  Hits          666      666           
  Misses          8        8           
  Partials       16       16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ahmadintisar · 2026-03-21T18:14:53Z

@Magicbook1108 Can you pls review the PR?

Copilot

Pull request overview

Adds dataset-level column role controls to the CSV/Excel (“table”) parser so users can choose which columns are embedded into chunk text vs stored as filterable metadata, aligning file-based table ingestion with existing RDBMS connector patterns.

Changes:

Introduces table_column_mode (auto/manual), table_column_roles, and table_column_names in dataset parser config (API + frontend schema/UI).
Updates the table chunker to respect column roles: vectorize-only columns go to chunk text; metadata/both columns are stored for filtering and retrieval.
Adds post-index aggregation to write table-derived metadata into DocMetadataService when manual mode is enabled.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
web/src/pages/dataset/dataset-setting/form-schema.ts	Extends frontend validation schema for table column mode/roles/names.
web/src/pages/dataset/dataset-setting/configuration/table.tsx	Adds UI controls (auto/manual + per-column role selector) for table ingestion.
web/src/locales/en.ts	Adds English i18n strings for the new table column role UI.
api/utils/validation_utils.py	Extends backend request validation model (`ParserConfig`) with table column mode/roles/names.
rag/app/table.py	Applies column roles during chunk creation; updates KB parser_config with `field_map` + `table_column_names`.
rag/svr/task_executor.py	Merges KB table parser config into chunk tasks and aggregates table metadata to document-level metadata post-index.

rag/app/table.py

rag/svr/task_executor.py

rag/app/table.py

…ields for DocMetadataService - Add {col}_raw field for metadata/both text columns in table.py (ES path only) - Aggregation in task_executor prefers _raw over _tks for human-readable metadata - Legacy _tks fallback joins token lists into strings instead of repr() - ES search behavior unchanged: _tks fields still store tokenized values

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…hot is empty for table metadata aggregation

…dd chunk-key probe fallback for pinyin/suffixed fields

…s before single KB parser_config update

ahmadintisar · 2026-03-24T21:43:33Z

@Magicbook1108 @yingfeng

All review feedback has been addressed:

Tokenized metadata values — Added _raw fields alongside _tks for metadata/both text columns on ES. Aggregation prefers _raw for human-readable values; falls back to _tks with list-to-string conversion for old chunks.

Consistent default roles — meta_cols is now built from table_column_names with roles.get(col, 'both') so unlisted columns default to both, matching the table parser's behavior.

Pinyin/non-ASCII field_map resolution — When field_map is empty on the task snapshot (first parse), aggregation now reloads the latest field_map from DB via KnowledgebaseService.get_by_id. Probe kept as last-resort fallback.

Unit tests — 27 tests added covering ES probe helpers, field_map vs probe priority, _raw preference, metadata aggregation (manual/auto mode, partial roles, dedup, tks fallback, KB reload, Infinity path). Helpers extracted to rag/utils/table_es_metadata.py for clean imports.

All tests passing locally. Ready for re-review.

ahmadintisar · 2026-03-25T00:35:56Z

@Magicbook1108 @yingfeng

Please restart the CI.

ahmadintisar · 2026-03-26T08:16:37Z

@Magicbook1108 @yingfeng Could you please complete the review?

ahmadintisar · 2026-03-31T00:42:31Z

@6ba3i @Magicbook1108 @yingfeng

Can you please review the PR? :)

6ba3i

Sorry for the late response. Because this PR affects the parser/config path, I took a bit more time to read through it carefully.

The feature direction makes sense, but I do not think it is ready to merge yet.

I am currently hitting a blocker when trying to validate the main flow locally. When modifying manual column mode, I get Field: <parser_config.ext> - Extra inputs are not permitted, so I cannot confirm the feature works end to end.

My main concerns are:

ParserConfig currently replaces ext with the new table-specific fields, which looks risky and seems directly related to the validation error above. This should probably be an additive schema change, not one that invalidates an existing payload shape.
The implementation goes fairly deep for what looks like a configuration feature. It now touches parser chunking, task-time config merging, and post-index metadata aggregation. That makes the behavior harder to reason about and harder for me to confidently validate from outside the parser area.
I only see helper-level unit tests added here. I do not see coverage for the full save -> reparse -> retrieval/metadata flow, which is exactly where I hit the regression.
Since this affects parser behavior, I would be more comfortable with a clearer contract for where table_column_mode / table_column_roles live, how they propagate, and how backward compatibility is preserved for existing parser configs.

Because this touches parser behavior, I would prefer to hold off on merging until the validation issue is fixed and the full flow is covered.

…ct resolution - ext: Annotated[dict, Field(default={})] is an upstream field on ParserConfig - Was accidentally removed during rebase conflict resolution - Causes 'Extra inputs are not permitted' when frontend sends ext in parser_config

ahmadintisar · 2026-04-01T09:42:25Z

@6ba3i
Thanks for the detailed review and for catching the ext issue.

1. parser_config.ext validation error — Fixed

The upstream ext: Annotated[dict, Field(default={})] field on ParserConfig was accidentally removed during a rebase conflict resolution. It's now restored in the latest commit. This was not an intentional schema change — the three new table fields (table_column_mode, table_column_roles, table_column_names) are purely additive with None defaults. Should be unblocked now.

2. Implementation depth

Understood the concern. The layers involved are:

table.py — chunk construction (core feature, gated behind table_column_mode == "manual")
task_executor.py — config merge needed because task snapshots predate KB config saves; metadata aggregation for DocMetadataService visibility
table_es_metadata.py — extracted helpers to keep task_executor clean

All new code paths are gated behind parser_id == "table" + table_column_mode == "manual", so they're fully isolated from other parsers. Happy to refactor further if you'd prefer a different separation.

3. End-to-end test coverage

Will add integration tests covering the full flow: save config → reparse → verify chunk content (vectorize-only columns in text, metadata-only in chunk_data) → verify DocMetadataService field count. Current 27 tests cover helpers and aggregation; the E2E tests will cover config propagation.

4. Config contract

Will add a doc comment block in table_es_metadata.py clarifying:

table_column_mode / table_column_roles / table_column_names live on KB-level parser_config
Frontend saves via dataset settings PUT
merge_table_parser_config_from_kb() overlays them onto doc-level task config at parse time
If absent or auto → zero behavior change

Let me know if you'd like any of these handled differently.

ahmadintisar · 2026-04-01T10:10:54Z

I think adding the config contract to the codebase isn't clean, I'll drop it here instead.

Config contract for table column roles:

Three keys on KB-level parser_config:

Key	Type	Description
`table_column_mode`	`"auto"` \| `"manual"` \| `None`	`auto`/absent = all columns both (unchanged RAGFlow behavior). `manual` = apply `table_column_roles`.
`table_column_roles`	`dict[str, "vectorize" \| "metadata" \| "both"]` \| `None`	Per-column role. Only used when mode is `manual`. Unlisted columns default to `both`.
`table_column_names`	`list[str]` \| `None`	Set by backend after first parse. Used by frontend for the column role selector UI.

Lifecycle:

Frontend saves table_column_mode and table_column_roles via dataset settings PUT → stored on Knowledgebase.parser_config
table_column_names is written by rag/app/table.py after first parse via KnowledgebaseService.update_parser_config
At parse time, merge_table_parser_config_from_kb() overlays these KB-level keys onto the document-level task config (task snapshots predate KB config saves)
rag/app/table.py reads the merged config and applies column roles to chunk construction
aggregate_table_manual_doc_metadata() writes metadata-role columns to DocMetadataService after chunk indexing

Backward compatibility: If table_column_mode is absent or "auto", zero behavior change across all paths.

…oles - Test auto mode, manual vectorize/metadata/both, partial roles defaulting to both - Test ES _raw fields alongside _tks for metadata text columns - Test KnowledgebaseService.update_parser_config called with table_column_names - Test chunk count matches CSV row count - Mocked: KnowledgebaseService, tokenizer, DOC_ENGINE settings - No source files modified

…e in CI - Mock deepdoc.vision.ocr, deepdoc.parser.figure_parser, rag.app.picture in sys.modules before importing rag.app.table - ONNX model files don't exist in CI environment - Test logic unchanged, only import guards added

ahmadintisar · 2026-04-01T11:01:48Z

CI is green. 35 tests passing:

27 helper/aggregation unit tests (test/unit_test/rag/svr/) — ES probe, field_map resolution, _raw derivation, value conversion, merge config, metadata aggregation (manual/auto, partial roles, dedup, KB reload, Infinity path)
8 chunk-level integration tests (test/unit_test/rag/app/) — calls real chunk() with synthetic CSV, mocked KnowledgebaseService and tokenizer. Covers auto mode, manual vectorize/metadata/both, partial roles defaulting to both, ES _raw fields, update_parser_config payload, chunk count.

Full save → reparse → retrieve E2E requires a running stack per test/README.md conventions and is out of scope for unit tests.

Ahmad Intisar added 6 commits March 18, 2026 22:44

merge: resolve validation_utils.py conflict with upstream main

148f2af

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. 🌈 python Pull requests that update Python code 💞 feature Feature request, pull request that fullfill a new feature. 🧰 typescript Pull requests that update Typescript code labels Mar 19, 2026

yingfeng requested a review from Magicbook1108 March 20, 2026 01:31

yingfeng added the ci Continue Integration label Mar 20, 2026

yingfeng marked this pull request as draft March 20, 2026 01:32

yingfeng marked this pull request as ready for review March 20, 2026 01:32

yingfeng requested a review from Copilot March 24, 2026 11:06

Copilot started reviewing on behalf of yingfeng March 24, 2026 11:07 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

ahmadintisar marked this pull request as draft March 24, 2026 16:06

Ahmad Intisar and others added 7 commits March 24, 2026 19:10

Update rag/svr/task_executor.py

2a69c36

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix(task_executor): reload KB parser_config field_map when task snaps…

d4dc5f6

…hot is empty for table metadata aggregation

fix(task_executor): reload KB field_map for table metadata ES keys; a…

66b7673

…dd chunk-key probe fallback for pinyin/suffixed fields

Downgraded logs from logging.info to logging.debug

fb67de1

fix(table): merge field_map and table_column_names across Excel sheet…

1fda78e

…s before single KB parser_config update

Helper methods, and unit tests covered

c371058

ahmadintisar marked this pull request as ready for review March 24, 2026 21:43

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Mar 24, 2026

JinHai-CN and others added 2 commits March 25, 2026 18:19

Merge branch 'main' into feature/table-parser-column-roles

f0c2733

Merge branch 'main' into feature/table-parser-column-roles

8c4f7a6

Magicbook1108 requested a review from 6ba3i March 26, 2026 09:50

6ba3i suggested changes Apr 1, 2026

View reviewed changes

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Apr 1, 2026

ahmadintisar requested a review from 6ba3i April 1, 2026 11:02

Merge branch 'infiniflow:main' into feature/table-parser-column-roles

00275c3

Conversation

ahmadintisar commented Mar 19, 2026 • edited by yingfeng Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Type of change

Uh oh!

ahmadintisar commented Mar 19, 2026

Uh oh!

ahmadintisar commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ahmadintisar commented Mar 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ahmadintisar commented Mar 24, 2026

Uh oh!

ahmadintisar commented Mar 25, 2026

Uh oh!

ahmadintisar commented Mar 26, 2026

Uh oh!

ahmadintisar commented Mar 31, 2026

Uh oh!

6ba3i left a comment

Choose a reason for hiding this comment

Uh oh!

ahmadintisar commented Apr 1, 2026

Uh oh!

ahmadintisar commented Apr 1, 2026

Uh oh!

ahmadintisar commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ahmadintisar commented Mar 19, 2026 •

edited by yingfeng

Loading

ahmadintisar commented Mar 19, 2026 •

edited

Loading

codecov bot commented Mar 20, 2026 •

edited

Loading