-
Notifications
You must be signed in to change notification settings - Fork 6.4k
Description
Bug Description
I am currently working on a RAG system, which query engine uses both VectorIndexRetriever and bm25retriever. For latter one I require nodes, which I try to generate from files that I gather from MinioReader that uses DoclingReader as file_extractor.
When I am running load_data() natively it takes about 4 minutes to extract about 170 .docx and .xlsx files from minio container. But when launched from inside docker container application, process takes more than an hour.
What could be the cause of this? Docker container has no constraints on resourses and I even added GPU to deploy in docker-compose file.
Version
llama-index-readers-docling = 0.4.1
Steps to Reproduce
Just build a docker app with 2 containers: minio and main, put some files inside minio, then try extracting them with following script:
For comparison, try the same without putting the script inside docker, for example, launch from jupyter notebook
Calling nodes from minio via next function:
def get_documents_minio():
file_extr = {}
reader = DoclingReader(export_type=DoclingReader.ExportType.MARKDOWN, doc_converter= DocumentConverter())
for val in ['.pdf', '.vdx', '.docx', '.xlsx']:
file_extr[val] = reader
minio_reader = MinioReader(
bucket='example',
minio_endpoint='127.0.0.1:9000',
minio_access_key="minioadmin",
minio_secret_key="minioadmin",
file_extractor= file_extr,
)
documents = minio_reader.load_data()
return documents
Relevant Logs/Tracbacks
Don't know about relevancy, but docker logs just show Reader working on documents for a long time, between seconds and multiple minutes.
2025-10-01 15:27:35 2025-10-01 12:27:35,230 - INFO - deleted item in tree at stack: (597,) => #/texts/644
2025-10-01 15:27:35 2025-10-01 12:27:35,247 - INFO - deleted item in tree at stack: (597,) => #/texts/644
2025-10-01 15:27:35 2025-10-01 12:27:35,263 - INFO - deleted item in tree at stack: (597,) => #/texts/644
2025-10-01 15:27:35 2025-10-01 12:27:35,280 - INFO - deleted item in tree at stack: (597,) => #/texts/644
2025-10-01 15:27:35 2025-10-01 12:27:35,298 - INFO - deleted item in tree at stack: (597,) => #/texts/644
2025-10-01 15:27:35 2025-10-01 12:27:35,324 - INFO - deleted item in tree at stack: (597,) => #/texts/646
2025-10-01 15:27:35 2025-10-01 12:27:35,341 - INFO - deleted item in tree at stack: (597,) => #/texts/646
2025-10-01 15:27:35 2025-10-01 12:27:35,358 - INFO - deleted item in tree at stack: (597,) => #/texts/646
2025-10-01 15:27:35 2025-10-01 12:27:35,375 - INFO - deleted item in tree at stack: (597,) => #/texts/646
2025-10-01 15:27:35 2025-10-01 12:27:35,391 - INFO - deleted item in tree at stack: (597,) => #/texts/646
2025-10-01 15:27:35 2025-10-01 12:27:35,416 - INFO - deleted item in tree at stack: (597,) => #/texts/648
2025-10-01 15:27:35 2025-10-01 12:27:35,432 - INFO - deleted item in tree at stack: (597,) => #/texts/648
2025-10-01 15:27:35 2025-10-01 12:27:35,448 - INFO - deleted item in tree at stack: (597,) => #/texts/648
2025-10-01 15:27:35 2025-10-01 12:27:35,464 - INFO - deleted item in tree at stack: (597,) => #/texts/648
2025-10-01 15:27:35 2025-10-01 12:27:35,481 - INFO - deleted item in tree at stack: (597,) => #/texts/648
2025-10-01 15:27:35 2025-10-01 12:27:35,507 - INFO - deleted item in tree at stack: (597,) => #/texts/650
2025-10-01 15:27:35 2025-10-01 12:27:35,523 - INFO - deleted item in tree at stack: (597,) => #/texts/650
2025-10-01 15:27:35 2025-10-01 12:27:35,540 - INFO - deleted item in tree at stack: (597,) => #/texts/650
2025-10-01 15:27:35 2025-10-01 12:27:35,564 - INFO - deleted item in tree at stack: (604,) => #/texts/656
2025-10-01 15:27:35 2025-10-01 12:27:35,580 - INFO - deleted item in tree at stack: (604,) => #/texts/656
2025-10-01 15:27:35 2025-10-01 12:27:35,598 - INFO - deleted item in tree at stack: (604,) => #/texts/656
2025-10-01 15:27:35 2025-10-01 12:27:35,616 - INFO - deleted item in tree at stack: (604,) => #/texts/656
2025-10-01 15:27:35 2025-10-01 12:27:35,633 - INFO - deleted item in tree at stack: (604,) => #/texts/656
2025-10-01 15:27:35 2025-10-01 12:27:35,649 - INFO - deleted item in tree at stack: (604,) => #/texts/656
2025-10-01 15:27:35 2025-10-01 12:27:35,666 - INFO - deleted item in tree at stack: (604,) => #/texts/656
2025-10-01 15:27:35 2025-10-01 12:27:35,682 - INFO - deleted item in tree at stack: (604,) => #/texts/656
2025-10-01 15:27:35 2025-10-01 12:27:35,699 - INFO - deleted item in tree at stack: (604,) => #/texts/656