Skip to content

[OPTIMIZATION] Uneven worker utilization in ExtractionPipeline — large documents bottleneck the entire batch #132

@voidwisp

Description

@voidwisp

When running extract_and_build() with documents of varying sizes, the ExtractionPipeline distributes whole documents to ProcessPoolExecutor workers. The SentenceSplitter runs inside each worker, so chunking happens after distribution. This means each worker gets 1 (or few) documents, and large documents create
massive load imbalance.

Observed behavior

Batch of 16 documents with EXTRACTION_NUM_WORKERS=16:

Worker 1: 77 nodes → done in ~2 min, then IDLE
Worker 2: 90 nodes → done in ~2.5 min, then IDLE
Worker 3: 107 nodes
Worker 4: 130 nodes
Worker 5: 155 nodes
Worker 6: 320 nodes
Worker 7: 644 nodes
Worker 8: 987 nodes → done in ~25 min ← entire batch waits for this
...

The batch took ~25 min, dominated by one large document. Workers that finished early sat idle for 20+ minutes.

The node_batcher splits evenly by node count, but at this point each document is still a single node (pre-chunking). The SentenceSplitter runs inside each worker after distribution, producing wildly different chunk counts depending on document size (15KB → 77 chunks, 200KB → 987 chunks).

Environment

  • graphrag-toolkit-lexical-graph==3.16.2
  • ECS Fargate (16 vCPU, 32GB)
  • EXTRACTION_NUM_WORKERS=16
  • Documents ranging from 15KB to 200KB+

It would be great to see improvements in how work is distributed across workers so that large documents don't bottleneck the entire batch. Happy to test any changes on our end.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions