[FEATURE] Improve extraction progress visibility in non-TTY environments (CloudWatch, CI/CD, Docker logs)

When running ExtractionPipeline in a non-TTY environment (e.g., ECS Fargate → CloudWatch Logs, Docker, CI/CD), there is essentially zero visibility into extraction progress between batch start and batch end.

  The "Extracting propositions" and "Extracting topics" phases use tqdm_asyncio.gather() (via llama_index's run_jobs()), which writes progress bars to stderr using \r carriage returns. In a TTY this works great — you see a live updating bar. But log collectors (CloudWatch, Docker log drivers, journald) split output
   on \n newlines, so \r-based overwrites are either:
  - Collapsed into a single enormous line (all updates concatenated)
  - Silently dropped

  In our prod run processing 410 documents (batch 1/13, 16 workers with up to 2,716 nodes each), we got 5 total proposition log lines and 6 topic log lines across the entire multi-hour extraction. The only way to confirm the task was still working was checking Bedrock CloudWatch metrics for ongoing API invocations.

  Meanwhile, all per-node extraction logging (Extracting propositions for node {node_id}) is at DEBUG level, which is too noisy to enable in production (it would include boto, opensearch, urllib noise — and even with the ModuleFilter, it's too verbose).

  What would help:

  1. Periodic INFO-level progress logging during extraction — e.g., every N nodes or every M seconds, emit an logger.info(f"Extracting propositions: {completed}/{total} nodes ({pct:.0f}%)"). This would give usable progress in any log environment without depending on tqdm's TTY behavior.
  2. tqdm non-TTY detection — when sys.stderr.isatty() is False, either disable tqdm entirely (relying on the INFO logs above) or configure it with mininterval=30, file=sys.stdout so it emits periodic readable lines instead of carriage-return overwrites.

  Environment:
  - graphrag-toolkit v3.16.2 (lexical-graph)
  - Running on ECS Fargate, logs via CloudWatch awslogs driver
  - llama_index run_jobs() with show_progress=True

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Improve extraction progress visibility in non-TTY environments (CloudWatch, CI/CD, Docker logs) #129

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE] Improve extraction progress visibility in non-TTY environments (CloudWatch, CI/CD, Docker logs) #129

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions