Summary
All four public IngestionChunker<string> implementations drop element metadata when creating chunks. The metadata API exists at every layer (IngestionDocumentElement.Metadata, IngestionChunk.Metadata), and VectorStoreWriter correctly persists all chunk metadata, but the chunkers are the gap.
The issue
When an IngestionDocumentReader sets metadata on elements (e.g., bounding box coordinates, element type labels, page numbers beyond what's in the section), that metadata is lost during chunking. ElementsChunker.Process() reads element.GetMarkdown() for text content but never reads element.Metadata. The chunks it creates at lines 199 and 208 only receive (text, document, context).
Affected chunkers
| Chunker |
Uses ElementsChunker? |
Metadata dropped? |
SectionChunker |
Yes |
Yes |
HeaderChunker |
Yes |
Yes |
SemanticSimilarityChunker |
Yes |
Yes |
DocumentTokenChunker |
No (creates chunks directly) |
Yes (same pattern) |
Reproduction
// Reader sets element metadata
paragraph.Metadata["element_type"] = "table";
// After chunking, the metadata is gone
await foreach (var chunk in chunker.ProcessAsync(document))
{
chunk.Metadata.ContainsKey("element_type"); // false
}
Why this matters
Readers that do layout analysis (ONNX models, Azure Document Intelligence, etc.) detect element types like table, picture, section_header, caption, formula. This structural information is useful for:
- Type-aware enrichment: table chunks benefit from different summarization prompts than body text
- Filtered search: "find all table chunks" without scanning content
- Hybrid search metadata: element types and other reader-produced metadata can participate in keyword matching
The metadata infrastructure is already there. IngestionDocumentElement.Metadata is a Dictionary<string, object?>, IngestionChunk.Metadata is a Dictionary<string, object>, and VectorStoreWriter (lines 88-92) writes all chunk metadata to the vector store. The only gap is ElementsChunker not copying element metadata to the chunks it produces.
Potential fix
In ElementsChunker.Process(), when accumulating elements into a chunk, also accumulate their metadata. When committing a chunk, merge the accumulated metadata into chunk.Metadata. For keys that appear in multiple elements within the same chunk, a "first wins" or "most common" strategy would work.
Rough shape (~15 lines):
// Track metadata for the current chunk's elements
var accumulatedMetadata = new Dictionary<string, object>();
// When adding an element to the current chunk:
if (element.HasMetadata)
{
foreach (var kvp in element.Metadata)
{
if (kvp.Value is not null)
accumulatedMetadata.TryAdd(kvp.Key, kvp.Value);
}
}
// When committing a chunk:
var chunk = new IngestionChunk<string>(_currentChunk.ToString(), document, context);
foreach (var kvp in accumulatedMetadata)
{
chunk.Metadata[kvp.Key] = kvp.Value;
}
accumulatedMetadata.Clear();
DocumentTokenChunker has a similar pattern and would need the same treatment.
Workaround
We wrote a MetadataAwareSectionChunker that wraps SectionChunker and post-processes chunks to resolve metadata via content matching. It builds an index of element text to metadata from the document, then matches each chunk's content back to source elements.
Available at: https://github.com/luisquintanilla/PdfPig/blob/feature/intelligent-pdf-ingestion/src/UglyToad.PdfPig.DataIngestion/MetadataAwareSectionChunker.cs
This works but is a workaround. The content-matching approach is less precise than having the chunker propagate metadata directly, especially when chunks span multiple elements or when element text is modified during chunking (e.g., table row splitting).
Summary
All four public
IngestionChunker<string>implementations drop element metadata when creating chunks. The metadata API exists at every layer (IngestionDocumentElement.Metadata,IngestionChunk.Metadata), andVectorStoreWritercorrectly persists all chunk metadata, but the chunkers are the gap.The issue
When an
IngestionDocumentReadersets metadata on elements (e.g., bounding box coordinates, element type labels, page numbers beyond what's in the section), that metadata is lost during chunking.ElementsChunker.Process()readselement.GetMarkdown()for text content but never readselement.Metadata. The chunks it creates at lines 199 and 208 only receive(text, document, context).Affected chunkers
SectionChunkerHeaderChunkerSemanticSimilarityChunkerDocumentTokenChunkerReproduction
Why this matters
Readers that do layout analysis (ONNX models, Azure Document Intelligence, etc.) detect element types like table, picture, section_header, caption, formula. This structural information is useful for:
The metadata infrastructure is already there.
IngestionDocumentElement.Metadatais aDictionary<string, object?>,IngestionChunk.Metadatais aDictionary<string, object>, andVectorStoreWriter(lines 88-92) writes all chunk metadata to the vector store. The only gap isElementsChunkernot copying element metadata to the chunks it produces.Potential fix
In
ElementsChunker.Process(), when accumulating elements into a chunk, also accumulate their metadata. When committing a chunk, merge the accumulated metadata intochunk.Metadata. For keys that appear in multiple elements within the same chunk, a "first wins" or "most common" strategy would work.Rough shape (~15 lines):
DocumentTokenChunkerhas a similar pattern and would need the same treatment.Workaround
We wrote a
MetadataAwareSectionChunkerthat wrapsSectionChunkerand post-processes chunks to resolve metadata via content matching. It builds an index of element text to metadata from the document, then matches each chunk's content back to source elements.Available at: https://github.com/luisquintanilla/PdfPig/blob/feature/intelligent-pdf-ingestion/src/UglyToad.PdfPig.DataIngestion/MetadataAwareSectionChunker.cs
This works but is a workaround. The content-matching approach is less precise than having the chunker propagate metadata directly, especially when chunks span multiple elements or when element text is modified during chunking (e.g., table row splitting).