Skip to content

[FEATURE] Improve extraction progress visibility in non-TTY environments (CloudWatch, CI/CD, Docker logs) #129

@voidwisp

Description

@voidwisp

When running ExtractionPipeline in a non-TTY environment (e.g., ECS Fargate → CloudWatch Logs, Docker, CI/CD), there is essentially zero visibility into extraction progress between batch start and batch end.

The "Extracting propositions" and "Extracting topics" phases use tqdm_asyncio.gather() (via llama_index's run_jobs()), which writes progress bars to stderr using \r carriage returns. In a TTY this works great — you see a live updating bar. But log collectors (CloudWatch, Docker log drivers, journald) split output
on \n newlines, so \r-based overwrites are either:

  • Collapsed into a single enormous line (all updates concatenated)
  • Silently dropped

In our prod run processing 410 documents (batch 1/13, 16 workers with up to 2,716 nodes each), we got 5 total proposition log lines and 6 topic log lines across the entire multi-hour extraction. The only way to confirm the task was still working was checking Bedrock CloudWatch metrics for ongoing API invocations.

Meanwhile, all per-node extraction logging (Extracting propositions for node {node_id}) is at DEBUG level, which is too noisy to enable in production (it would include boto, opensearch, urllib noise — and even with the ModuleFilter, it's too verbose).

What would help:

  1. Periodic INFO-level progress logging during extraction — e.g., every N nodes or every M seconds, emit an logger.info(f"Extracting propositions: {completed}/{total} nodes ({pct:.0f}%)"). This would give usable progress in any log environment without depending on tqdm's TTY behavior.
  2. tqdm non-TTY detection — when sys.stderr.isatty() is False, either disable tqdm entirely (relying on the INFO logs above) or configure it with mininterval=30, file=sys.stdout so it emits periodic readable lines instead of carriage-return overwrites.

Environment:

  • graphrag-toolkit v3.16.2 (lexical-graph)
  • Running on ECS Fargate, logs via CloudWatch awslogs driver
  • llama_index run_jobs() with show_progress=True

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions