[OPTIMIZATION] Uneven worker utilization in ExtractionPipeline — large documents bottleneck the entire batch

  When running extract_and_build() with documents of varying sizes, the ExtractionPipeline distributes whole documents to ProcessPoolExecutor workers. The SentenceSplitter runs inside each worker, so chunking happens after distribution. This means each worker gets 1 (or few) documents, and large documents create
  massive load imbalance.

  **Observed behavior**

  Batch of 16 documents with EXTRACTION_NUM_WORKERS=16:

  Worker 1:   77 nodes  → done in ~2 min, then IDLE
  Worker 2:   90 nodes  → done in ~2.5 min, then IDLE
  Worker 3:  107 nodes
  Worker 4:  130 nodes
  Worker 5:  155 nodes
  Worker 6:  320 nodes
  Worker 7:  644 nodes
  Worker 8:  987 nodes  → done in ~25 min  ← entire batch waits for this
  ...

  The batch took ~25 min, dominated by one large document. Workers that finished early sat idle for 20+ minutes.

  The node_batcher splits evenly by node count, but at this point each document is still a single node (pre-chunking). The SentenceSplitter runs inside each worker after distribution, producing wildly different chunk counts depending on document size (15KB → 77 chunks, 200KB → 987 chunks).

  **Environment**

  - graphrag-toolkit-lexical-graph==3.16.2
  - ECS Fargate (16 vCPU, 32GB)
  - EXTRACTION_NUM_WORKERS=16
  - Documents ranging from 15KB to 200KB+

  It would be great to see improvements in how work is distributed across workers so that large documents don't bottleneck the entire batch. Happy to test any changes on our end.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OPTIMIZATION] Uneven worker utilization in ExtractionPipeline — large documents bottleneck the entire batch #132

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[OPTIMIZATION] Uneven worker utilization in ExtractionPipeline — large documents bottleneck the entire batch #132

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions