-
Notifications
You must be signed in to change notification settings - Fork 78
Description
When running ExtractionPipeline in a non-TTY environment (e.g., ECS Fargate → CloudWatch Logs, Docker, CI/CD), there is essentially zero visibility into extraction progress between batch start and batch end.
The "Extracting propositions" and "Extracting topics" phases use tqdm_asyncio.gather() (via llama_index's run_jobs()), which writes progress bars to stderr using \r carriage returns. In a TTY this works great — you see a live updating bar. But log collectors (CloudWatch, Docker log drivers, journald) split output
on \n newlines, so \r-based overwrites are either:
- Collapsed into a single enormous line (all updates concatenated)
- Silently dropped
In our prod run processing 410 documents (batch 1/13, 16 workers with up to 2,716 nodes each), we got 5 total proposition log lines and 6 topic log lines across the entire multi-hour extraction. The only way to confirm the task was still working was checking Bedrock CloudWatch metrics for ongoing API invocations.
Meanwhile, all per-node extraction logging (Extracting propositions for node {node_id}) is at DEBUG level, which is too noisy to enable in production (it would include boto, opensearch, urllib noise — and even with the ModuleFilter, it's too verbose).
What would help:
- Periodic INFO-level progress logging during extraction — e.g., every N nodes or every M seconds, emit an logger.info(f"Extracting propositions: {completed}/{total} nodes ({pct:.0f}%)"). This would give usable progress in any log environment without depending on tqdm's TTY behavior.
- tqdm non-TTY detection — when sys.stderr.isatty() is False, either disable tqdm entirely (relying on the INFO logs above) or configure it with mininterval=30, file=sys.stdout so it emits periodic readable lines instead of carriage-return overwrites.
Environment:
- graphrag-toolkit v3.16.2 (lexical-graph)
- Running on ECS Fargate, logs via CloudWatch awslogs driver
- llama_index run_jobs() with show_progress=True