-
Notifications
You must be signed in to change notification settings - Fork 78
Description
When running extract_and_build() with documents of varying sizes, the ExtractionPipeline distributes whole documents to ProcessPoolExecutor workers. The SentenceSplitter runs inside each worker, so chunking happens after distribution. This means each worker gets 1 (or few) documents, and large documents create
massive load imbalance.
Observed behavior
Batch of 16 documents with EXTRACTION_NUM_WORKERS=16:
Worker 1: 77 nodes → done in ~2 min, then IDLE
Worker 2: 90 nodes → done in ~2.5 min, then IDLE
Worker 3: 107 nodes
Worker 4: 130 nodes
Worker 5: 155 nodes
Worker 6: 320 nodes
Worker 7: 644 nodes
Worker 8: 987 nodes → done in ~25 min ← entire batch waits for this
...
The batch took ~25 min, dominated by one large document. Workers that finished early sat idle for 20+ minutes.
The node_batcher splits evenly by node count, but at this point each document is still a single node (pre-chunking). The SentenceSplitter runs inside each worker after distribution, producing wildly different chunk counts depending on document size (15KB → 77 chunks, 200KB → 987 chunks).
Environment
- graphrag-toolkit-lexical-graph==3.16.2
- ECS Fargate (16 vCPU, 32GB)
- EXTRACTION_NUM_WORKERS=16
- Documents ranging from 15KB to 200KB+
It would be great to see improvements in how work is distributed across workers so that large documents don't bottleneck the entire batch. Happy to test any changes on our end.