Description
Description
I'm experiencing a significant performance degradation when materializing a list of Document
objects compared to using their JSON (dictionary) representation. Specifically, processing 200 documents takes roughly 20x longer when using List[Document]
objects versus a list of dictionaries List[Dict]
as return.
Code
from typing import Annotated, List, Dict
from langchain_core.documents import Document
from zenml import step, get_step_context
@step()
def chunk_docs(docs: List[Document]) -> Annotated[List[Document], "chunked_docs"]:
print(f"Received {len(docs)} documents. Returning documents without changes.")
get_step_context().add_output_metadata(
output_name="chunked_docs",
metadata={"num_chunks": len(docs)}
)
return docs
@step()
def chunk_docs_dict(docs: List[Document]) -> Annotated[List[Dict], "chunked_docs"]:
print(f"Received {len(docs)} documents. Returning documents without changes.")
get_step_context().add_output_metadata(
output_name="chunked_docs",
metadata={"num_chunks": len(docs)}
)
docs = [{"page_content": doc.page_content, "metadata": doc.metadata} for doc in docs]
return docs
if __name__ == "__main__":
num_docs = 200
docs = [
Document(
page_content=f"This is the content of document {i}." * 50,
metadata={"doc_id": i}
)
for i in range(num_docs)
]
# Time to chunk with and without langchain
import time
start = time.time()
chunked_docs = chunk_docs(docs)
print(f"Time taken to chunk {num_docs} docs without langchain: {time.time() - start}")
start = time.time()
chunked_docs = chunk_docs_dict(docs)
print(f"Time taken to chunk {num_docs} docs with langchain: {time.time() - start}")
Output
Running single step pipeline to execute step chunk_docs
...
Received 200 documents. Returning documents without changes.
Step chunk_docs has finished in 33.667s.
Pipeline run has finished in 33.725s.
Time taken to chunk 200 docs without langchain: 36.364107847213745
Running single step pipeline to execute step chunk_docs_dict
...
Received 200 documents. Returning documents without changes.
Step chunk_docs_dict has finished in 0.440s.
Pipeline run has finished in 0.487s.
Time taken to chunk 200 docs with langchain: 1.629422664642334
Expected Behavior
I expected both steps to have similar performance since both functions essentially process the same data. The conversion of a Document
to a dictionary (as shown in chunk_docs_dict
) appears to be much faster.
Actual Behavior
Using Document objects: ~36.36 seconds for 200 documents.
Using dictionary conversion: ~1.63 seconds for 200 documents.
Environment:
langchain 0.3.19
zenml 0.74.0
python 3.11.11
Discussion
Since both steps return a list
type, the BuiltInContainerMaterializer
is used by default, bypassing the materializer defined in the LangChain integration. For List[Dict],
the materializer uses the _is_serializable
method. However, for List[Document],
each item triggers a lookup in the materializer registry, which for Document
selects the PydanticMaterializer
.
For reference, see the following code sections:
- BuiltInContainerMaterializer implementation
- LangChain Document Materializer
- PydanticMaterializer implementation
However, besides the creation of a single file for each item in the list, I don't understand why there is a massive difference in performance.