Skip to content

[MEDI] Design feedback: Built-in chunkers don't propagate element metadata to chunks #7465

@luisquintanilla

Description

@luisquintanilla

Summary

All four public IngestionChunker<string> implementations drop element metadata when creating chunks. The metadata API exists at every layer (IngestionDocumentElement.Metadata, IngestionChunk.Metadata), and VectorStoreWriter correctly persists all chunk metadata, but the chunkers are the gap.

The issue

When an IngestionDocumentReader sets metadata on elements (e.g., bounding box coordinates, element type labels, page numbers beyond what's in the section), that metadata is lost during chunking. ElementsChunker.Process() reads element.GetMarkdown() for text content but never reads element.Metadata. The chunks it creates at lines 199 and 208 only receive (text, document, context).

Affected chunkers

Chunker Uses ElementsChunker? Metadata dropped?
SectionChunker Yes Yes
HeaderChunker Yes Yes
SemanticSimilarityChunker Yes Yes
DocumentTokenChunker No (creates chunks directly) Yes (same pattern)

Reproduction

// Reader sets element metadata
paragraph.Metadata["element_type"] = "table";

// After chunking, the metadata is gone
await foreach (var chunk in chunker.ProcessAsync(document))
{
    chunk.Metadata.ContainsKey("element_type"); // false
}

Why this matters

Readers that do layout analysis (ONNX models, Azure Document Intelligence, etc.) detect element types like table, picture, section_header, caption, formula. This structural information is useful for:

  • Type-aware enrichment: table chunks benefit from different summarization prompts than body text
  • Filtered search: "find all table chunks" without scanning content
  • Hybrid search metadata: element types and other reader-produced metadata can participate in keyword matching

The metadata infrastructure is already there. IngestionDocumentElement.Metadata is a Dictionary<string, object?>, IngestionChunk.Metadata is a Dictionary<string, object>, and VectorStoreWriter (lines 88-92) writes all chunk metadata to the vector store. The only gap is ElementsChunker not copying element metadata to the chunks it produces.

Potential fix

In ElementsChunker.Process(), when accumulating elements into a chunk, also accumulate their metadata. When committing a chunk, merge the accumulated metadata into chunk.Metadata. For keys that appear in multiple elements within the same chunk, a "first wins" or "most common" strategy would work.

Rough shape (~15 lines):

// Track metadata for the current chunk's elements
var accumulatedMetadata = new Dictionary<string, object>();

// When adding an element to the current chunk:
if (element.HasMetadata)
{
    foreach (var kvp in element.Metadata)
    {
        if (kvp.Value is not null)
            accumulatedMetadata.TryAdd(kvp.Key, kvp.Value);
    }
}

// When committing a chunk:
var chunk = new IngestionChunk<string>(_currentChunk.ToString(), document, context);
foreach (var kvp in accumulatedMetadata)
{
    chunk.Metadata[kvp.Key] = kvp.Value;
}
accumulatedMetadata.Clear();

DocumentTokenChunker has a similar pattern and would need the same treatment.

Workaround

We wrote a MetadataAwareSectionChunker that wraps SectionChunker and post-processes chunks to resolve metadata via content matching. It builds an index of element text to metadata from the document, then matches each chunk's content back to source elements.

Available at: https://github.com/luisquintanilla/PdfPig/blob/feature/intelligent-pdf-ingestion/src/UglyToad.PdfPig.DataIngestion/MetadataAwareSectionChunker.cs

This works but is a workaround. The content-matching approach is less precise than having the chunker propagate metadata directly, especially when chunks span multiple elements or when element text is modified during chunking (e.g., table row splitting).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions