Skip to content

Feature/table parser column roles#13710

Open
ahmadintisar wants to merge 19 commits intoinfiniflow:mainfrom
Attili-sys:feature/table-parser-column-roles
Open

Feature/table parser column roles#13710
ahmadintisar wants to merge 19 commits intoinfiniflow:mainfrom
Attili-sys:feature/table-parser-column-roles

Conversation

@ahmadintisar
Copy link
Copy Markdown
Contributor

@ahmadintisar ahmadintisar commented Mar 19, 2026

What problem does this PR solve?

The table file parser (CSV/Excel) currently treats all columns identically — every column is both vectorized (embedded in chunk text) and stored as filterable metadata. There's no way for users to control which columns should be searchable by semantic meaning versus which should only be filterable attributes.

For example, when ingesting a news articles CSV with columns like title, content, country, category, source, etc., the embedding includes metadata fields like country: Brazil and source: Reuters in the chunk text, which dilutes the semantic quality of the embedding without adding retrieval value.

The RDBMS connector (MySQL/PostgreSQL) already supports content_columns / metadata_columns, but this capability was missing for file-based table ingestion.

This PR adds column-level control (vectorize / metadata / both) for the table file parser, following RAGFlow's existing patterns.

Backward compatible: Datasets without table_column_roles or with table_column_mode: auto behave exactly as before (all columns = both).

Type of change

  • New Feature (non-breaking change which adds functionality)

Ahmad Intisar added 6 commits March 18, 2026 22:44
… chunk construction

- Read table_column_roles from parser_config in chunk()
- Split row processing: vectorize/both → chunk text, metadata/both → chunk_data/typed fields
- Build field_map only from stored columns (metadata + both)
- Backward compatible: missing config defaults all columns to 'both'
…h, and frontend column role selector

- Add TableColumnRole type and table_column_roles/table_column_names to ParserConfig (validation_utils.py)
- Pass table_column_names from table.py to parser_config on first parse
- Build column role selector UI in table.tsx with empty state and re-parse tip
- Add form schema fields and i18n keys for table column roles
…guration

- Add table_column_mode to ParserConfig (auto default, manual enables column roles)
- Gate table_column_roles reading behind manual mode check in table.py
- Add mode selector UI in table.tsx, column role table only shown in manual mode
- Preserve manual selections when switching back to auto
…config at parse time

- task_executor now overlays table_column_mode/roles/names from KB parser_config
  onto document parser_config for table parser tasks
- Root cause: dataset settings saved to KB but task used doc-level config
- Add debug logging with [TABLE_PARSER_DEBUG] and [TASK_EXECUTOR_DEBUG] prefixes
…lumns + fix ES field resolution

- Add aggregate_table_manual_doc_metadata() to collect unique values from metadata/both columns
- Write aggregated metadata via DocMetadataService.update_document_metadata after chunk insertion
- Add ES typed field probe fallback (_probe_es_typed_key_for_column) for when field_map is empty
  (field_map is written during chunk() but task snapshot predates it)
- Probe tries _tks, _dt, _long, _flt, _kwd, bare name in order
- field_map used as primary when available, probe as fallback
- Only triggers for table parser with table_column_mode=manual
- Auto mode unchanged: no doc metadata written
- Debug logging with [TABLE_META_DEBUG] prefix throughout
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. 🌈 python Pull requests that update Python code 💞 feature Feature request, pull request that fullfill a new feature. 🧰 typescript Pull requests that update Typescript code labels Mar 19, 2026
@ahmadintisar
Copy link
Copy Markdown
Contributor Author

Here's a quick demonstration.

I tested by uploading a csv file with table parser.
Screenshot 2026-03-20 at 3 52 17 AM

Now there is an option In configuration to select auto or manual column mode. By default it's set to Auto.
Screenshot 2026-03-20 at 3 53 15 AM

First lets demonstrate with the Auto mode, which works as default. Each row is a document and the whole content is vectorized while metadata is empty.

Screenshot 2026-03-20 at 3 54 16 AM

Chunk Results:
Screenshot 2026-03-20 at 3 54 42 AM

Now I will change the configuration mode to manual and select which columns I want vectorized and which ones should be included in the metadata:
Screenshot 2026-03-20 at 3 55 32 AM

Changes will take affect after reparsing the same file.
Screenshot 2026-03-20 at 3 56 19 AM

And now the chunk results include the columns which were selected as vectorized in the configuration.

Screenshot 2026-03-20 at 3 57 04 AM

Also now we can visualize the metadata in the manual column mode.

Screenshot 2026-03-20 at 3 57 45 AM

I hope this demonstration helps!

@ahmadintisar
Copy link
Copy Markdown
Contributor Author

ahmadintisar commented Mar 19, 2026

I have also added debug logs, once the feature is reviewed, its safe to remove them!

@yingfeng yingfeng requested a review from Magicbook1108 March 20, 2026 01:31
@yingfeng yingfeng added the ci Continue Integration label Mar 20, 2026
@yingfeng yingfeng marked this pull request as draft March 20, 2026 01:32
@yingfeng yingfeng marked this pull request as ready for review March 20, 2026 01:32
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.52%. Comparing base (af40be6) to head (00275c3).

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #13710   +/-   ##
=======================================
  Coverage   96.52%   96.52%           
=======================================
  Files          10       10           
  Lines         690      690           
  Branches      108      108           
=======================================
  Hits          666      666           
  Misses          8        8           
  Partials       16       16           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ahmadintisar
Copy link
Copy Markdown
Contributor Author

@Magicbook1108 Can you pls review the PR?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds dataset-level column role controls to the CSV/Excel (“table”) parser so users can choose which columns are embedded into chunk text vs stored as filterable metadata, aligning file-based table ingestion with existing RDBMS connector patterns.

Changes:

  • Introduces table_column_mode (auto/manual), table_column_roles, and table_column_names in dataset parser config (API + frontend schema/UI).
  • Updates the table chunker to respect column roles: vectorize-only columns go to chunk text; metadata/both columns are stored for filtering and retrieval.
  • Adds post-index aggregation to write table-derived metadata into DocMetadataService when manual mode is enabled.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
web/src/pages/dataset/dataset-setting/form-schema.ts Extends frontend validation schema for table column mode/roles/names.
web/src/pages/dataset/dataset-setting/configuration/table.tsx Adds UI controls (auto/manual + per-column role selector) for table ingestion.
web/src/locales/en.ts Adds English i18n strings for the new table column role UI.
api/utils/validation_utils.py Extends backend request validation model (ParserConfig) with table column mode/roles/names.
rag/app/table.py Applies column roles during chunk creation; updates KB parser_config with field_map + table_column_names.
rag/svr/task_executor.py Merges KB table parser config into chunk tasks and aggregates table metadata to document-level metadata post-index.

@ahmadintisar ahmadintisar marked this pull request as draft March 24, 2026 16:06
Ahmad Intisar and others added 7 commits March 24, 2026 19:10
…ields for DocMetadataService

- Add {col}_raw field for metadata/both text columns in table.py (ES path only)
- Aggregation in task_executor prefers _raw over _tks for human-readable metadata
- Legacy _tks fallback joins token lists into strings instead of repr()
- ES search behavior unchanged: _tks fields still store tokenized values
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…hot is empty for table metadata aggregation
…dd chunk-key probe fallback for pinyin/suffixed fields
@ahmadintisar
Copy link
Copy Markdown
Contributor Author

@Magicbook1108 @yingfeng

All review feedback has been addressed:

Tokenized metadata values — Added _raw fields alongside _tks for metadata/both text columns on ES. Aggregation prefers _raw for human-readable values; falls back to _tks with list-to-string conversion for old chunks.

Consistent default rolesmeta_cols is now built from table_column_names with roles.get(col, 'both') so unlisted columns default to both, matching the table parser's behavior.

Pinyin/non-ASCII field_map resolution — When field_map is empty on the task snapshot (first parse), aggregation now reloads the latest field_map from DB via KnowledgebaseService.get_by_id. Probe kept as last-resort fallback.

Unit tests — 27 tests added covering ES probe helpers, field_map vs probe priority, _raw preference, metadata aggregation (manual/auto mode, partial roles, dedup, tks fallback, KB reload, Infinity path). Helpers extracted to rag/utils/table_es_metadata.py for clean imports.

All tests passing locally. Ready for re-review.

@ahmadintisar ahmadintisar marked this pull request as ready for review March 24, 2026 21:43
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Mar 24, 2026
@ahmadintisar
Copy link
Copy Markdown
Contributor Author

@Magicbook1108 @yingfeng

Please restart the CI.

@ahmadintisar
Copy link
Copy Markdown
Contributor Author

@Magicbook1108 @yingfeng Could you please complete the review?

@Magicbook1108 Magicbook1108 requested a review from 6ba3i March 26, 2026 09:50
@ahmadintisar
Copy link
Copy Markdown
Contributor Author

@6ba3i @Magicbook1108 @yingfeng

Can you please review the PR? :)

Copy link
Copy Markdown
Contributor

@6ba3i 6ba3i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response. Because this PR affects the parser/config path, I took a bit more time to read through it carefully.

The feature direction makes sense, but I do not think it is ready to merge yet.

I am currently hitting a blocker when trying to validate the main flow locally. When modifying manual column mode, I get Field: <parser_config.ext> - Extra inputs are not permitted, so I cannot confirm the feature works end to end.

Image

My main concerns are:

  • ParserConfig currently replaces ext with the new table-specific fields, which looks risky and seems directly related to the validation error above. This should probably be an additive schema change, not one that invalidates an existing payload shape.
  • The implementation goes fairly deep for what looks like a configuration feature. It now touches parser chunking, task-time config merging, and post-index metadata aggregation. That makes the behavior harder to reason about and harder for me to confidently validate from outside the parser area.
  • I only see helper-level unit tests added here. I do not see coverage for the full save -> reparse -> retrieval/metadata flow, which is exactly where I hit the regression.
  • Since this affects parser behavior, I would be more comfortable with a clearer contract for where table_column_mode / table_column_roles live, how they propagate, and how backward compatibility is preserved for existing parser configs.

Because this touches parser behavior, I would prefer to hold off on merging until the validation issue is fixed and the full flow is covered.

…ct resolution

- ext: Annotated[dict, Field(default={})] is an upstream field on ParserConfig
- Was accidentally removed during rebase conflict resolution
- Causes 'Extra inputs are not permitted' when frontend sends ext in parser_config
@ahmadintisar
Copy link
Copy Markdown
Contributor Author

@6ba3i
Thanks for the detailed review and for catching the ext issue.

1. parser_config.ext validation error — Fixed

The upstream ext: Annotated[dict, Field(default={})] field on ParserConfig was accidentally removed during a rebase conflict resolution. It's now restored in the latest commit. This was not an intentional schema change — the three new table fields (table_column_mode, table_column_roles, table_column_names) are purely additive with None defaults. Should be unblocked now.

2. Implementation depth

Understood the concern. The layers involved are:

  • table.py — chunk construction (core feature, gated behind table_column_mode == "manual")
  • task_executor.py — config merge needed because task snapshots predate KB config saves; metadata aggregation for DocMetadataService visibility
  • table_es_metadata.py — extracted helpers to keep task_executor clean

All new code paths are gated behind parser_id == "table" + table_column_mode == "manual", so they're fully isolated from other parsers. Happy to refactor further if you'd prefer a different separation.

3. End-to-end test coverage

Will add integration tests covering the full flow: save config → reparse → verify chunk content (vectorize-only columns in text, metadata-only in chunk_data) → verify DocMetadataService field count. Current 27 tests cover helpers and aggregation; the E2E tests will cover config propagation.

4. Config contract

Will add a doc comment block in table_es_metadata.py clarifying:

  • table_column_mode / table_column_roles / table_column_names live on KB-level parser_config
  • Frontend saves via dataset settings PUT
  • merge_table_parser_config_from_kb() overlays them onto doc-level task config at parse time
  • If absent or auto → zero behavior change

Let me know if you'd like any of these handled differently.

@ahmadintisar
Copy link
Copy Markdown
Contributor Author

I think adding the config contract to the codebase isn't clean, I'll drop it here instead.

Config contract for table column roles:

Three keys on KB-level parser_config:

Key Type Description
table_column_mode "auto" | "manual" | None auto/absent = all columns both (unchanged RAGFlow behavior). manual = apply table_column_roles.
table_column_roles dict[str, "vectorize" | "metadata" | "both"] | None Per-column role. Only used when mode is manual. Unlisted columns default to both.
table_column_names list[str] | None Set by backend after first parse. Used by frontend for the column role selector UI.

Lifecycle:

  1. Frontend saves table_column_mode and table_column_roles via dataset settings PUT → stored on Knowledgebase.parser_config
  2. table_column_names is written by rag/app/table.py after first parse via KnowledgebaseService.update_parser_config
  3. At parse time, merge_table_parser_config_from_kb() overlays these KB-level keys onto the document-level task config (task snapshots predate KB config saves)
  4. rag/app/table.py reads the merged config and applies column roles to chunk construction
  5. aggregate_table_manual_doc_metadata() writes metadata-role columns to DocMetadataService after chunk indexing

Backward compatibility: If table_column_mode is absent or "auto", zero behavior change across all paths.

…oles

- Test auto mode, manual vectorize/metadata/both, partial roles defaulting to both
- Test ES _raw fields alongside _tks for metadata text columns
- Test KnowledgebaseService.update_parser_config called with table_column_names
- Test chunk count matches CSV row count
- Mocked: KnowledgebaseService, tokenizer, DOC_ENGINE settings
- No source files modified
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Apr 1, 2026
…e in CI

- Mock deepdoc.vision.ocr, deepdoc.parser.figure_parser, rag.app.picture
  in sys.modules before importing rag.app.table
- ONNX model files don't exist in CI environment
- Test logic unchanged, only import guards added
@ahmadintisar
Copy link
Copy Markdown
Contributor Author

CI is green. 35 tests passing:

  • 27 helper/aggregation unit tests (test/unit_test/rag/svr/) — ES probe, field_map resolution, _raw derivation, value conversion, merge config, metadata aggregation (manual/auto, partial roles, dedup, KB reload, Infinity path)
  • 8 chunk-level integration tests (test/unit_test/rag/app/) — calls real chunk() with synthetic CSV, mocked KnowledgebaseService and tokenizer. Covers auto mode, manual vectorize/metadata/both, partial roles defaulting to both, ES _raw fields, update_parser_config payload, chunk count.

Full save → reparse → retrieve E2E requires a running stack per test/README.md conventions and is out of scope for unit tests.

@ahmadintisar ahmadintisar requested a review from 6ba3i April 1, 2026 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continue Integration 💞 feature Feature request, pull request that fullfill a new feature. 🌈 python Pull requests that update Python code size:XXL This PR changes 1000+ lines, ignoring generated files. 🧰 typescript Pull requests that update Typescript code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants