Skip to content

Performance Degradation When Materializing LangChain's Document Objects #3371

Open
@SasCezar

Description

@SasCezar

Description

I'm experiencing a significant performance degradation when materializing a list of Document objects compared to using their JSON (dictionary) representation. Specifically, processing 200 documents takes roughly 20x longer when using List[Document] objects versus a list of dictionaries List[Dict] as return.

Code

from typing import Annotated, List, Dict
from langchain_core.documents import Document
from zenml import step, get_step_context

@step()
def chunk_docs(docs: List[Document]) -> Annotated[List[Document], "chunked_docs"]:
    print(f"Received {len(docs)} documents. Returning documents without changes.")
    get_step_context().add_output_metadata(
        output_name="chunked_docs",
        metadata={"num_chunks": len(docs)}
    )
    return docs

@step()
def chunk_docs_dict(docs: List[Document]) -> Annotated[List[Dict], "chunked_docs"]:
    print(f"Received {len(docs)} documents. Returning documents without changes.")
    get_step_context().add_output_metadata(
        output_name="chunked_docs",
        metadata={"num_chunks": len(docs)}
    )
    docs = [{"page_content": doc.page_content, "metadata": doc.metadata} for doc in docs]
    return docs

if __name__ == "__main__":
    num_docs = 200
    docs = [
        Document(
            page_content=f"This is the content of document {i}." * 50,
            metadata={"doc_id": i}
        )
        for i in range(num_docs)
    ]

    # Time to chunk with and without langchain
    import time

    start = time.time()
    chunked_docs = chunk_docs(docs)
    print(f"Time taken to chunk {num_docs} docs without langchain: {time.time() - start}")

    start = time.time()
    chunked_docs = chunk_docs_dict(docs)
    print(f"Time taken to chunk {num_docs} docs with langchain: {time.time() - start}")

Output

Running single step pipeline to execute step chunk_docs
...
Received 200 documents. Returning documents without changes.
Step chunk_docs has finished in 33.667s.
Pipeline run has finished in 33.725s.
Time taken to chunk 200 docs without langchain: 36.364107847213745

Running single step pipeline to execute step chunk_docs_dict
...
Received 200 documents. Returning documents without changes.
Step chunk_docs_dict has finished in 0.440s.
Pipeline run has finished in 0.487s.
Time taken to chunk 200 docs with langchain: 1.629422664642334

Expected Behavior

I expected both steps to have similar performance since both functions essentially process the same data. The conversion of a Document to a dictionary (as shown in chunk_docs_dict) appears to be much faster.

Actual Behavior

Using Document objects: ~36.36 seconds for 200 documents.
Using dictionary conversion: ~1.63 seconds for 200 documents.

Environment:

langchain 0.3.19
zenml 0.74.0
python 3.11.11

Discussion

Since both steps return a list type, the BuiltInContainerMaterializer is used by default, bypassing the materializer defined in the LangChain integration. For List[Dict], the materializer uses the _is_serializable method. However, for List[Document], each item triggers a lookup in the materializer registry, which for Document selects the PydanticMaterializer.

For reference, see the following code sections:

However, besides the creation of a single file for each item in the list, I don't understand why there is a massive difference in performance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions