Feature/table parser column roles#13710
Feature/table parser column roles#13710ahmadintisar wants to merge 19 commits intoinfiniflow:mainfrom
Conversation
… chunk construction - Read table_column_roles from parser_config in chunk() - Split row processing: vectorize/both → chunk text, metadata/both → chunk_data/typed fields - Build field_map only from stored columns (metadata + both) - Backward compatible: missing config defaults all columns to 'both'
…h, and frontend column role selector - Add TableColumnRole type and table_column_roles/table_column_names to ParserConfig (validation_utils.py) - Pass table_column_names from table.py to parser_config on first parse - Build column role selector UI in table.tsx with empty state and re-parse tip - Add form schema fields and i18n keys for table column roles
…guration - Add table_column_mode to ParserConfig (auto default, manual enables column roles) - Gate table_column_roles reading behind manual mode check in table.py - Add mode selector UI in table.tsx, column role table only shown in manual mode - Preserve manual selections when switching back to auto
…config at parse time - task_executor now overlays table_column_mode/roles/names from KB parser_config onto document parser_config for table parser tasks - Root cause: dataset settings saved to KB but task used doc-level config - Add debug logging with [TABLE_PARSER_DEBUG] and [TASK_EXECUTOR_DEBUG] prefixes
…lumns + fix ES field resolution - Add aggregate_table_manual_doc_metadata() to collect unique values from metadata/both columns - Write aggregated metadata via DocMetadataService.update_document_metadata after chunk insertion - Add ES typed field probe fallback (_probe_es_typed_key_for_column) for when field_map is empty (field_map is written during chunk() but task snapshot predates it) - Probe tries _tks, _dt, _long, _flt, _kwd, bare name in order - field_map used as primary when available, probe as fallback - Only triggers for table parser with table_column_mode=manual - Auto mode unchanged: no doc metadata written - Debug logging with [TABLE_META_DEBUG] prefix throughout
|
I have also added debug logs, once the feature is reviewed, its safe to remove them! |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #13710 +/- ##
=======================================
Coverage 96.52% 96.52%
=======================================
Files 10 10
Lines 690 690
Branches 108 108
=======================================
Hits 666 666
Misses 8 8
Partials 16 16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@Magicbook1108 Can you pls review the PR? |
There was a problem hiding this comment.
Pull request overview
Adds dataset-level column role controls to the CSV/Excel (“table”) parser so users can choose which columns are embedded into chunk text vs stored as filterable metadata, aligning file-based table ingestion with existing RDBMS connector patterns.
Changes:
- Introduces
table_column_mode(auto/manual),table_column_roles, andtable_column_namesin dataset parser config (API + frontend schema/UI). - Updates the table chunker to respect column roles: vectorize-only columns go to chunk text; metadata/both columns are stored for filtering and retrieval.
- Adds post-index aggregation to write table-derived metadata into
DocMetadataServicewhen manual mode is enabled.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| web/src/pages/dataset/dataset-setting/form-schema.ts | Extends frontend validation schema for table column mode/roles/names. |
| web/src/pages/dataset/dataset-setting/configuration/table.tsx | Adds UI controls (auto/manual + per-column role selector) for table ingestion. |
| web/src/locales/en.ts | Adds English i18n strings for the new table column role UI. |
| api/utils/validation_utils.py | Extends backend request validation model (ParserConfig) with table column mode/roles/names. |
| rag/app/table.py | Applies column roles during chunk creation; updates KB parser_config with field_map + table_column_names. |
| rag/svr/task_executor.py | Merges KB table parser config into chunk tasks and aggregates table metadata to document-level metadata post-index. |
…ields for DocMetadataService
- Add {col}_raw field for metadata/both text columns in table.py (ES path only)
- Aggregation in task_executor prefers _raw over _tks for human-readable metadata
- Legacy _tks fallback joins token lists into strings instead of repr()
- ES search behavior unchanged: _tks fields still store tokenized values
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…hot is empty for table metadata aggregation
…dd chunk-key probe fallback for pinyin/suffixed fields
…s before single KB parser_config update
|
All review feedback has been addressed: Tokenized metadata values — Added Consistent default roles — Pinyin/non-ASCII field_map resolution — When Unit tests — 27 tests added covering ES probe helpers, field_map vs probe priority, All tests passing locally. Ready for re-review. |
|
Please restart the CI. |
|
@Magicbook1108 @yingfeng Could you please complete the review? |
|
@6ba3i @Magicbook1108 @yingfeng Can you please review the PR? :) |
6ba3i
left a comment
There was a problem hiding this comment.
Sorry for the late response. Because this PR affects the parser/config path, I took a bit more time to read through it carefully.
The feature direction makes sense, but I do not think it is ready to merge yet.
I am currently hitting a blocker when trying to validate the main flow locally. When modifying manual column mode, I get Field: <parser_config.ext> - Extra inputs are not permitted, so I cannot confirm the feature works end to end.
My main concerns are:
ParserConfigcurrently replacesextwith the new table-specific fields, which looks risky and seems directly related to the validation error above. This should probably be an additive schema change, not one that invalidates an existing payload shape.- The implementation goes fairly deep for what looks like a configuration feature. It now touches parser chunking, task-time config merging, and post-index metadata aggregation. That makes the behavior harder to reason about and harder for me to confidently validate from outside the parser area.
- I only see helper-level unit tests added here. I do not see coverage for the full save -> reparse -> retrieval/metadata flow, which is exactly where I hit the regression.
- Since this affects parser behavior, I would be more comfortable with a clearer contract for where
table_column_mode/table_column_roleslive, how they propagate, and how backward compatibility is preserved for existing parser configs.
Because this touches parser behavior, I would prefer to hold off on merging until the validation issue is fixed and the full flow is covered.
…ct resolution
- ext: Annotated[dict, Field(default={})] is an upstream field on ParserConfig
- Was accidentally removed during rebase conflict resolution
- Causes 'Extra inputs are not permitted' when frontend sends ext in parser_config
|
@6ba3i 1. The upstream 2. Implementation depth Understood the concern. The layers involved are:
All new code paths are gated behind 3. End-to-end test coverage Will add integration tests covering the full flow: save config → reparse → verify chunk content (vectorize-only columns in text, metadata-only in chunk_data) → verify DocMetadataService field count. Current 27 tests cover helpers and aggregation; the E2E tests will cover config propagation. 4. Config contract Will add a doc comment block in
Let me know if you'd like any of these handled differently. |
|
I think adding the config contract to the codebase isn't clean, I'll drop it here instead. Config contract for table column roles: Three keys on KB-level
Lifecycle:
Backward compatibility: If |
…oles - Test auto mode, manual vectorize/metadata/both, partial roles defaulting to both - Test ES _raw fields alongside _tks for metadata text columns - Test KnowledgebaseService.update_parser_config called with table_column_names - Test chunk count matches CSV row count - Mocked: KnowledgebaseService, tokenizer, DOC_ENGINE settings - No source files modified
…e in CI - Mock deepdoc.vision.ocr, deepdoc.parser.figure_parser, rag.app.picture in sys.modules before importing rag.app.table - ONNX model files don't exist in CI environment - Test logic unchanged, only import guards added
|
CI is green. 35 tests passing:
Full save → reparse → retrieve E2E requires a running stack per |








What problem does this PR solve?
The table file parser (CSV/Excel) currently treats all columns identically — every column is both vectorized (embedded in chunk text) and stored as filterable metadata. There's no way for users to control which columns should be searchable by semantic meaning versus which should only be filterable attributes.
For example, when ingesting a news articles CSV with columns like title, content, country, category, source, etc., the embedding includes metadata fields like country: Brazil and source: Reuters in the chunk text, which dilutes the semantic quality of the embedding without adding retrieval value.
The RDBMS connector (MySQL/PostgreSQL) already supports content_columns / metadata_columns, but this capability was missing for file-based table ingestion.
This PR adds column-level control (vectorize / metadata / both) for the table file parser, following RAGFlow's existing patterns.
Backward compatible: Datasets without table_column_roles or with table_column_mode: auto behave exactly as before (all columns = both).
Type of change