Switching the repo to public
What's Changed
- Alternative subtract_intervals function optimized for scalability by @tjkemp in #89
- Add MIT license to all meaningful files by @gabeweisz in #92
- Bugfix/conv output shape by @RElbers in #95
- dim eff analysis by @ajassani in #94
- Stagingv0.4 by @ajassani in #96
- pass arch and detail to all op by @ajassani in #97
- Add additional Jax data analyses by @gabeweisz in #98
- Add in GEMM shape analysis from XLA file by @gabeweisz in #100
- move convert import up by @gabeweisz in #101
- only include linked kernels in gpu timeline by @ajassani in #110
- Feat/event replay by @ajassani in #102
- include nn module in kernel view and refine by @ajassani in #116
- add aten::bmm perf model by @ajassani in #117
- Feat/topk ops by @ajassani in #120
- Feat/nn module parent by @ajassani in #121
- feat: TraceLens UI by @mpashkovskii in #99
- Fix get_breakdown_df_multigpu to filter out CPU events by @gabeweisz in #122
- include args in perf metrics table by @ajassani in #130
- short term fix for te linear gemms by @ajassani in #126
- add perf model for aten baddbmm by @ajassani in #131
- annotate gpu events by stream index by @ajassani in #135
- Reorg examples by @ajassani in #136
- Feat/unique args by @ajassani in #137
- feat: transformer engine ver 1 GEMM ops te_gemm_ts added by @olehtika in #125
- parse trans from args and add te gemm name to list in examples by @ajassani in #140
- Add GEMM performance model support for Jax by @gabeweisz in #139
- Integrate gemmologist for modeling gemm efficiencies. by @araina-amd in #133
- use correct dtype in cmd by @ajassani in #141
- Jax support for gemmologist integration by @gabeweisz in #143
- fix: tex_ts::te_gemm_ts missing dtype by @olehtika in #142
- allow causal attention by @ajassani in #145
- support gqa by @ajassani in #147
- tev2 native support by @ajassani in #148
- native cat 2 op names by @ajassani in #149
- megatron lm custom flow by @ajassani in #152
- fix flash true in fusedattnfunc by @ajassani in #153
- Fix/unique args kernel names by @ajassani in #154
- Fix/megatron gemm dtype by @ajassani in #156
- Fix/te dtype by @ajassani in #157
- Feat/qkv stride by @ajassani in #159
- torch_op_mapping: fix typo, add clamp_max by @lauri9 in #162
- Add script to filter trace to range of given user annotation by @lauri9 in #163
- Feat/custom report by @ajassani in #166
- Feat/nn flops by @ajassani in #169
- Feat/replay from report by @ajassani in #168
- Feat/aten sdp eff atten by @ajassani in #172
- Test/event replay by @ajassani in #173
- Fix/event replay by @ajassani in #174
- Fix/torch import by @ajassani in #175
- Test/event replay improve by @ajassani in #176
- Feat/evt replay example improve by @ajassani in #177
- add conditional strenum as suggested by @jakki-amd by @gabeweisz in #161
- Fix/lib deps by @ajassani in #178
- continue on fn call error by @ajassani in #179
- explicit openpyxl import error by @ajassani in #180
- sort false to avoid error by @ajassani in #181
- Add roofline_analyzer 0.1.1 by @tykow in #182
- regression test for perf report by @ajassani in #183
- fix is tensor type by @ajassani in #188
- Feat/perf report compare by @ajassani in #193
- NCCLAnalyzer: add missing allgather_into_tensor_coalesced collective name by @lauri9 in #194
- Enabling batch gemm through gemmologist. by @araina-amd in #189
- NcclAnalyser: add gzip support by @lauri9 in #196
- add flash_attn::_flash_attn_forward by @lauri9 in #197
- feat: extend performance report for multiple GPU ranks and trace files by @olehtika in #186
- Revert "feat: extend performance report for multiple GPU ranks and tr… by @olehtika in #198
- fix: fallback for StrEnum ImportError for Python 3.10 by @olehtika in #205
- feat: use util.DataLoader for all data loading by @olehtika in #208
- fix: JaxAnalysis FP8 GEMM kernel missing issue by @olehtika in #203
- SDPA/Flash attention changes for the perf model by @araina-amd in #191
- Partition on K and V instead of Q in SDPA backward pass by @araina-amd in #211
- option to print cpu op dispatch args by @ajassani in #213
- add aten::_scaled_dot_product_flash_attention to perf model by @ajassani in #217
- Feat/call stack analysis by @ajassani in #220
- dropout not needed for perf metrics by @ajassani in #221
- warn on dtype mismatch in gemms by @ajassani in #223
- feat: jax analysis reporting command line tool by @olehtika in #210
- Fix/jax analyses do not parse file header by @jujaykka in #218
- Fix/processing performance fixes by @jujaykka in #219
- minimize installation dependencies by @ajassani in #224
- Add GitHub Actions Workflows by @spandoesai in #222
- torch op categorization by @ajassani in #228
- fix error on dtype mismatch by @ajassani in #231
- fix edge cases for perf report by @ajassani in #235
- quick fix for execute by @ajassani in #236
- add openpyxl as requirement by @stephen-youn in #230
- overcome openpyxl dependency by @ajassani in #238
- gemms conceptual by @ajassani in #134
- notebook for autocast exploration by @ajassani in #240
- Commandline interface for roofline and code refactoring for perf model by @araina-amd in #229
- Added Perf Model for aiter::flash_attn by @spandoesai in #242
- Docs/trace2tree by @ajassani in #243
- fix perf report comparison script by @ajassani in #244
- TraceMap: interactive HTML dashboards for analyzing vLLM workloads (added to custom_workflows) by @hyukjlee in #246
- warn when compute perf metrics fails by @ajassani in #248
- Check if m, n and k are not none in gemm. by @araina-amd in #247
- add feature for extension by @ajassani in #249
- kernel detail update by @ajassani in #245
- Added the TraceDiff API to TraceLens by @spandoesai in #241
- add support for grouped gemm by @ajassani in #254
- allow different d_h_qk and d_h_v by @ajassani in #255
- fix sdpa param extract by @ajassani in #256
- some fixes for double dtype by @ajassani in #259
- Fix/event replay double by @ajassani in #260
- counts in repro by @ajassani in #261
- Refact/perf report by @ajassani in #264
- refact compare perf report by @ajassani in #265
- Feat/jax trace2 tree by @olehtika in #257
- FIX: Update te_gemms.py to be compatible with TreePerfAnalyzer. by @wenchenvincent in #266
- include collective analysis in report by @ajassani in #268
- Feat/jax trace2 tree add gpu kernel operation category by @olehtika in #267
- [NcclAnalyser]: Add option to warn instead of raise on metadata consistency failure by @lauri9 in #271
- include start time stamp for df_kernels by @ajassani in #273
- TraceDiff Graph View Quality of Life Changes by @spandoesai in #269
- profiling guide by @ajassani in #274
- add docs for profile analysis by @ajassani in #275
- use fp8 arg for perf model by @ajassani in #277
- Added New Perf Model for aiter::fmha_v3 (fwd and bwd) by @spandoesai in #278
- add script for collective analysis by @ajassani in #283
- Added Perf Model for FAv3 (fwd only) by @spandoesai in #284
- Add support for flash_attn variable sequence length by @wenchenvincent in #272
- Add logic for vLLM Llama-3.3-70B FP4 by @poznano-amd in #270
- add graph launch gpu events corectly to tree by @ajassani in #292
- fix to include unlinked runtime events in kernel launchers by @ajassani in #293
- Feat/jax tree perf perf model by @gphuang in #286
- Fixes for JAX. by @araina-amd in #282
- fix missing dictionary check by @gabeweisz in #323
- Switch installed jax report from jax_analysis_report to jax_report by @gabeweisz in #327
- 325 creating one excel file with multiple tabs rather than one file per report by @gphuang in #338
- add contributing md by @ajassani in #356
- Update CONTRIBUTING.md by @ajassani in #360
- update readme by @ajassani in #362
- Feat/jax perf analysis keyword filter by @olehtika in #332
- 308 jax tree perf analyzer parse gemm metadata value error invalid operand list by @gphuang in #347
- Add trtllm::cublas_scaled_mm to torch_op_mapping by @poznano-amd in #368
- include only non empty roofline sheets in perf report by @ajassani in #369
- include unlinked kernels by @ajassani in #370
- docs(contributing): add linting and branch naming rules by @spandoesai in #371
- New and Improved Compare Report Regression Testing by @spandoesai in #372
- add logging levels by @gphuang in #367
- add feature to optionally separate idle time into micro and macro by @ajassani in #375
- Added PR and Issue Templates by @spandoesai in #376
- docs(test-traces-disclaimer): add disclaimer for open-source test traces clarifying non-official perf usage by @ajassani in #379
- chore(license): update .py files to use AMD copyright header by @ajassani in #381
- Chore/license/update headers by @ajassani in #383
- Update issue templates by @ajassani in #387
- fix(tests): resolve non-determinism in perf report comparison by @ajassani in #385
- test(ci): add copyright header validation to regression tests by @ajassani in #384
- ci: added automation test for linting check using the black formatter by @spandoesai in #388
- feat: JAX XLA op analysis including GPU kernel launch latency and operation total input bytes calculation by @olehtika in #392
- Test/339 add unit test with minimal jax conv by @gphuang in #380
New Contributors
- @tjkemp made their first contribution in #89
- @mpashkovskii made their first contribution in #99
- @olehtika made their first contribution in #125
- @araina-amd made their first contribution in #133
- @lauri9 made their first contribution in #162
- @tykow made their first contribution in #182
- @jujaykka made their first contribution in #218
- @spandoesai made their first contribution in #222
- @stephen-youn made their first contribution in #230
- @hyukjlee made their first contribution in #246
- @wenchenvincent made their first contribution in #266
- @poznano-amd made their first contribution in #270
- @gphuang made their first contribution in #286
Full Changelog: v0.3.0...v0.4.0