Release v0.5.0

Latest

Latest

briancoutinho released this 28 May 16:45

· 44 commits to main since this release

726a5b5

Summary

Added

Added support for AMD GPUs.
Update pyproject.toml to workaround missing stub packages for yaml.
Add trace format validator
Added multiple trace filter classes and demos.
Added enhanced trace call stack graph implementation.
Added memory timeline view.
Added support for trace parser customization.
Added support for H100 traces.
Add nccl collective fields to parser config
Queue length analysis: Add feature to compute time blocked on a stream hitting max queue length.
Add kernel_backend to parser config for Triton / torch.compile() support.
Add analyses features for GPU user annotation attribution at trace and kernel level.
Add support to parse all trace event args.

New Feature: Critical Path Analysis

Added lightweight critical path analysis feature.
Critical path analysis features: event attribution and summary()
Critical path analysis fixes: fixing async memcpy and adding GPU to CPU event based synchronization.
Added save and restore feature for critical path graph.
Added save and restore feature for critical path graph.
Fixes bug in Critical path analysis relating to listing out the edges on the critical path.
Updated critical path analysis with edge attribution.
Improvement: allow filtering of flow events in the overlaid trace.

Changed

Change test data path in unittests from relative path to real path to support running test within IDEs.
Add a workaround for overlapping events when using ns resolution traces (pytorch/pytorch#122425)
Better handling of CUDA sync evaents with steam = -1
Fix ijson metadata parser for some corner cases
Add an option for ns rounding and cover ijson loading with it.
Updated Trace() api to specify a list of files and auto figure out ranks.

Fixed

Fixed issue #65 to handle floating point counter values in cupti_counter_analysis.

Full Changelog: v0.2.0...v0.5.0

Assets 2