Skip to content

Feat/multi node rank collective report#458

Open
mvstrauss wants to merge 3 commits intomainfrom
feat/multi-node-rank-collective-report
Open

Feat/multi node rank collective report#458
mvstrauss wants to merge 3 commits intomainfrom
feat/multi-node-rank-collective-report

Conversation

@mvstrauss
Copy link

Summary

  • Add --trace_glob + --rank_regex support to generate_multi_rank_collective_report_pytorch.py to handle tensorboard-style / multinode trace filenames and nested directories.
  • Add optional node-aware reporting via --gpus_per_node, producing node-span (intra_node vs inter_node) summary sheets.
  • Add minimal progress/status prints (with elapsed time) during trace resolution, loading, and report generation to avoid long “silent” runs.

Details

  • New input mode: --trace_glob (recursive glob) with rank extraction using --rank_regex.
  • Node-span summary sheets (when --gpus_per_node is set):
    • nccl_summary_long_node_span
    • nccl_summary_implicit_node_span
  • Default output path is derived from the resolved trace file paths (works for glob/pattern modes).
  • Documentation updated in docs/generate_multi_rank_collective_report_pytorch.md.

Test plan

Usage example for summarizing profiles from a 2 node experiment, 8 GPUs per node.

python -m TraceLens.Reporting.generate_multi_rank_collective_report_pytorch \
  --trace_glob "/path/to/your/run/tensorboard/**/**.pt.trace.json.gz" \
  --rank_regex "rank\\[(?P<rank>\\d+)\\]" \
  --world_size 16 \
  --gpus_per_node 8 \
  --detailed_analysis \
  --output_xlsx_path "/path/to/output/nccl_analysis_report.xlsx"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant