[MEDI] Design feedback: Built-in chunkers don't propagate element metadata to chunks

## Summary

All four public `IngestionChunker<string>` implementations drop element metadata when creating chunks. The metadata API exists at every layer (`IngestionDocumentElement.Metadata`, `IngestionChunk.Metadata`), and `VectorStoreWriter` correctly persists all chunk metadata, but the chunkers are the gap.

## The issue

When an `IngestionDocumentReader` sets metadata on elements (e.g., bounding box coordinates, element type labels, page numbers beyond what's in the section), that metadata is lost during chunking. `ElementsChunker.Process()` reads `element.GetMarkdown()` for text content but never reads `element.Metadata`. The chunks it creates at lines 199 and 208 only receive `(text, document, context)`.

### Affected chunkers

| Chunker | Uses ElementsChunker? | Metadata dropped? |
|---------|----------------------|-------------------|
| `SectionChunker` | Yes | Yes |
| `HeaderChunker` | Yes | Yes |
| `SemanticSimilarityChunker` | Yes | Yes |
| `DocumentTokenChunker` | No (creates chunks directly) | Yes (same pattern) |

## Reproduction

```csharp
// Reader sets element metadata
paragraph.Metadata["element_type"] = "table";

// After chunking, the metadata is gone
await foreach (var chunk in chunker.ProcessAsync(document))
{
    chunk.Metadata.ContainsKey("element_type"); // false
}
```

## Why this matters

Readers that do layout analysis (ONNX models, Azure Document Intelligence, etc.) detect element types like table, picture, section_header, caption, formula. This structural information is useful for:

- **Type-aware enrichment**: table chunks benefit from different summarization prompts than body text
- **Filtered search**: "find all table chunks" without scanning content
- **Hybrid search metadata**: element types and other reader-produced metadata can participate in keyword matching

The metadata infrastructure is already there. `IngestionDocumentElement.Metadata` is a `Dictionary<string, object?>`, `IngestionChunk.Metadata` is a `Dictionary<string, object>`, and `VectorStoreWriter` (lines 88-92) writes all chunk metadata to the vector store. The only gap is `ElementsChunker` not copying element metadata to the chunks it produces.

## Potential fix

In `ElementsChunker.Process()`, when accumulating elements into a chunk, also accumulate their metadata. When committing a chunk, merge the accumulated metadata into `chunk.Metadata`. For keys that appear in multiple elements within the same chunk, a "first wins" or "most common" strategy would work.

Rough shape (~15 lines):

```csharp
// Track metadata for the current chunk's elements
var accumulatedMetadata = new Dictionary<string, object>();

// When adding an element to the current chunk:
if (element.HasMetadata)
{
    foreach (var kvp in element.Metadata)
    {
        if (kvp.Value is not null)
            accumulatedMetadata.TryAdd(kvp.Key, kvp.Value);
    }
}

// When committing a chunk:
var chunk = new IngestionChunk<string>(_currentChunk.ToString(), document, context);
foreach (var kvp in accumulatedMetadata)
{
    chunk.Metadata[kvp.Key] = kvp.Value;
}
accumulatedMetadata.Clear();
```

`DocumentTokenChunker` has a similar pattern and would need the same treatment.

## Workaround

We wrote a `MetadataAwareSectionChunker` that wraps `SectionChunker` and post-processes chunks to resolve metadata via content matching. It builds an index of element text to metadata from the document, then matches each chunk's content back to source elements.

Available at: https://github.com/luisquintanilla/PdfPig/blob/feature/intelligent-pdf-ingestion/src/UglyToad.PdfPig.DataIngestion/MetadataAwareSectionChunker.cs

This works but is a workaround. The content-matching approach is less precise than having the chunker propagate metadata directly, especially when chunks span multiple elements or when element text is modified during chunking (e.g., table row splitting).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MEDI] Design feedback: Built-in chunkers don't propagate element metadata to chunks #7465

Summary

The issue

Affected chunkers

Reproduction

Why this matters

Potential fix

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Chunker	Uses ElementsChunker?	Metadata dropped?
`SectionChunker`	Yes	Yes
`HeaderChunker`	Yes	Yes
`SemanticSimilarityChunker`	Yes	Yes
`DocumentTokenChunker`	No (creates chunks directly)	Yes (same pattern)

[MEDI] Design feedback: Built-in chunkers don't propagate element metadata to chunks #7465

Description

Summary

The issue

Affected chunkers

Reproduction

Why this matters

Potential fix

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions