Performance Degradation When Materializing LangChain's Document Objects

## Description
I'm experiencing a significant performance degradation when materializing a list of ``Document`` objects compared to using their JSON (dictionary) representation. Specifically, processing 200 documents takes roughly 20x longer when using ``List[Document]`` objects versus a list of dictionaries ``List[Dict]`` as return.

## Code
```python
from typing import Annotated, List, Dict
from langchain_core.documents import Document
from zenml import step, get_step_context

@step()
def chunk_docs(docs: List[Document]) -> Annotated[List[Document], "chunked_docs"]:
    print(f"Received {len(docs)} documents. Returning documents without changes.")
    get_step_context().add_output_metadata(
        output_name="chunked_docs",
        metadata={"num_chunks": len(docs)}
    )
    return docs

@step()
def chunk_docs_dict(docs: List[Document]) -> Annotated[List[Dict], "chunked_docs"]:
    print(f"Received {len(docs)} documents. Returning documents without changes.")
    get_step_context().add_output_metadata(
        output_name="chunked_docs",
        metadata={"num_chunks": len(docs)}
    )
    docs = [{"page_content": doc.page_content, "metadata": doc.metadata} for doc in docs]
    return docs

if __name__ == "__main__":
    num_docs = 200
    docs = [
        Document(
            page_content=f"This is the content of document {i}." * 50,
            metadata={"doc_id": i}
        )
        for i in range(num_docs)
    ]

    # Time to chunk with and without langchain
    import time

    start = time.time()
    chunked_docs = chunk_docs(docs)
    print(f"Time taken to chunk {num_docs} docs without langchain: {time.time() - start}")

    start = time.time()
    chunked_docs = chunk_docs_dict(docs)
    print(f"Time taken to chunk {num_docs} docs with langchain: {time.time() - start}")
```
## Output

```terminal
Running single step pipeline to execute step chunk_docs
...
Received 200 documents. Returning documents without changes.
Step chunk_docs has finished in 33.667s.
Pipeline run has finished in 33.725s.
Time taken to chunk 200 docs without langchain: 36.364107847213745

Running single step pipeline to execute step chunk_docs_dict
...
Received 200 documents. Returning documents without changes.
Step chunk_docs_dict has finished in 0.440s.
Pipeline run has finished in 0.487s.
Time taken to chunk 200 docs with langchain: 1.629422664642334
```
## Expected Behavior
I expected both steps to have similar performance since both functions essentially process the same data. The conversion of a ``Document`` to a dictionary (as shown in ``chunk_docs_dict``) appears to be much faster.

## Actual Behavior
Using Document objects: ~36.36 seconds for 200 documents.
Using dictionary conversion: ~1.63 seconds for 200 documents.

## Environment:
langchain 0.3.19
zenml 0.74.0
python 3.11.11

## Discussion

Since both steps return a `list` type, the `BuiltInContainerMaterializer` is used by default, bypassing the materializer defined in the LangChain integration. For `List[Dict],` the materializer uses the `_is_serializable` method. However, for `List[Document],` each item triggers a lookup in the materializer registry, which for `Document` selects the `PydanticMaterializer`.

For reference, see the following code sections:
- [BuiltInContainerMaterializer implementation](https://github.com/zenml-io/zenml/blob/e4ed83fbdf0dd3d9623901df1d3b31437ba5b19a/src/zenml/materializers/built_in_materializer.py#L398-L416)
- [LangChain Document Materializer](https://github.com/zenml-io/zenml/blob/e4ed83fbdf0dd3d9623901df1d3b31437ba5b19a/src/zenml/integrations/langchain/materializers/document_materializer.py#L50)
- [PydanticMaterializer implementation](https://github.com/zenml-io/zenml/blob/e4ed83fbdf0dd3d9623901df1d3b31437ba5b19a/src/zenml/materializers/pydantic_materializer.py#L50-L57)

However, besides the creation of a single file for each item in the list, I don't understand why there is a massive difference in performance. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Degradation When Materializing LangChain's Document Objects #3371

Description

Code

Output

Expected Behavior

Actual Behavior

Environment:

Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance Degradation When Materializing LangChain's Document Objects #3371

Description

Description

Code

Output

Expected Behavior

Actual Behavior

Environment:

Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions