Skip to content

[Feature Request]: Optimize Table Structure in Document Parsing for Better Token Efficiency #11490

@TeslaZY

Description

@TeslaZY

Self Checks

  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (Language Policy).
  • Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • Please do not modify this template :) and fill in all the required fields.

Is your feature request related to a problem?

Describe the feature you'd like

Feature Request: Optimize Table Structure in Document Parsing for Better Token Efficiency

Currently, parsed documents containing tabular data (e.g., reports) are chunked and stored as raw HTML

elements. While this approach faithfully preserves complex table structures (e.g., merged cells, nested headers), it results in low data density—most tokens are spent reconstructing HTML markup rather than conveying actual content. This introduces significant noise, especially for simple two-dimensional tables that could be more efficiently represented in denser formats like CSV or Markdown tables.

Proposal:

Introduce intelligent table serialization during document parsing:

For simple 2D tables (no merged cells, consistent structure): convert to a compact format such as CSV or Markdown to improve token efficiency and reduce noise.
For complex tables (merged cells, irregular layouts): retain HTML representation or explore alternative structured representations that balance fidelity and density.
This would allow the system to adapt its output format based on table complexity, optimizing downstream processing (e.g., LLM consumption, embedding, retrieval) without sacrificing the ability to handle sophisticated layouts.

We’d appreciate community input on strategies for detecting table complexity and alternative compact representations for non-trivial tables.

Describe implementation you've considered

No response

Documentation, adoption, use case

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    💞 featureFeature request, pull request that fullfill a new feature.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions