Summary
Added
- Added support for AMD GPUs.
- Update pyproject.toml to workaround missing stub packages for yaml.
- Add trace format validator
- Added multiple trace filter classes and demos.
- Added enhanced trace call stack graph implementation.
- Added memory timeline view.
- Added support for trace parser customization.
- Added support for H100 traces.
- Add nccl collective fields to parser config
- Queue length analysis: Add feature to compute time blocked on a stream hitting max queue length.
- Add
kernel_backendto parser config for Triton / torch.compile() support. - Add analyses features for GPU user annotation attribution at trace and kernel level.
- Add support to parse all trace event args.
New Feature: Critical Path Analysis
- Added lightweight critical path analysis feature.
- Critical path analysis features: event attribution and
summary() - Critical path analysis fixes: fixing async memcpy and adding GPU to CPU event based synchronization.
- Added save and restore feature for critical path graph.
- Added save and restore feature for critical path graph.
- Fixes bug in Critical path analysis relating to listing out the edges on the critical path.
- Updated critical path analysis with edge attribution.
- Improvement: allow filtering of flow events in the overlaid trace.
Changed
- Change test data path in unittests from relative path to real path to support running test within IDEs.
- Add a workaround for overlapping events when using ns resolution traces (pytorch/pytorch#122425)
- Better handling of CUDA sync evaents with steam = -1
- Fix ijson metadata parser for some corner cases
- Add an option for ns rounding and cover ijson loading with it.
- Updated Trace() api to specify a list of files and auto figure out ranks.
Fixed
- Fixed issue #65 to handle floating point counter values in cupti_counter_analysis.
Full Changelog: v0.2.0...v0.5.0