[Refactor][Benchmark] Migrate microbench hardware characterization into benchmarks/hardware/#639
[Refactor][Benchmark] Migrate microbench hardware characterization into benchmarks/hardware/#639lcy-seso wants to merge 1 commit intotile-ai:mainfrom
Conversation
|
/gemini review |
|
Warning Gemini is experiencing higher than usual traffic and was unable to create the summary. Please try again in a few hours by commenting |
There was a problem hiding this comment.
Pull request overview
Migrates the hardware characterization microbenchmark suite into benchmarks/hardware/ and adds a CSV→YAML profile generator to produce GPU peak/calibration data (e.g., H200) for downstream performance modeling.
Changes:
- Add
benchmarks/hardware/microbench suite (memory/compute/system) with shared utils and checked-in CSV results. - Add
scripts/generate_gpu_profile.pyplus pytest coverage to generate validated GPU profile YAMLs from benchmark CSVs. - Add an H200 profile YAML (
tileops/perf/profiles/h200.yaml) with measured HBM bandwidth and FP16/BF16 Tensor Core throughput.
Reviewed changes
Copilot reviewed 33 out of 34 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tileops/perf/init.py | Introduces perf module package marker/docstring. |
| tileops/perf/profiles/init.py | Introduces profiles subpackage marker/docstring. |
| tileops/perf/profiles/h200.yaml | Adds generated H200 measured/theoretical peaks + calibration factors. |
| scripts/generate_gpu_profile.py | New CLI/script to parse benchmark CSVs and emit GPU profile YAML. |
| tests/hardware/init.py | Adds hardware test package marker. |
| tests/hardware/test_hardware_structure.py | Smoke test asserting benchmark directory structure + key files exist. |
| tests/hardware/test_generate_gpu_profile.py | Tests CSV parsing/profile generation + validates committed H200 profile. |
| benchmarks/hardware/init.py | Adds top-level hardware benchmark package docs. |
| benchmarks/hardware/run_all.py | Convenience runner to execute core Python benchmarks in a fixed order. |
| benchmarks/hardware/utils/init.py | Re-exports shared utilities for benchmarks. |
| benchmarks/hardware/utils/bench.py | Adds do_bench wrapper + bandwidth/TFLOPS calculators. |
| benchmarks/hardware/utils/env.py | Adds env capture and theoretical peak tables for supported GPUs. |
| benchmarks/hardware/utils/output.py | Adds CSV schema/constants and a CSV writer targeting benchmarks/hardware/results/. |
| benchmarks/hardware/results/hbm_peak.csv | Checks in HBM peak measurement CSV used for offline profile generation. |
| benchmarks/hardware/results/gemm_throughput.csv | Checks in GEMM throughput CSV including BF16 rows. |
| benchmarks/hardware/memory/init.py | Memory benchmarks package marker/docstring. |
| benchmarks/hardware/memory/hbm_bandwidth.py | Python HBM bandwidth benchmark emitting bandwidth.csv. |
| benchmarks/hardware/memory/l2_bandwidth.py | L2 “working set” reuse benchmark emitting l2_cache.csv. |
| benchmarks/hardware/memory/shared_bandwidth.py | TileLang shared-memory bandwidth benchmark emitting shared_memory.csv. |
| benchmarks/hardware/memory/latency.py | Latency suite including pointer-chase integration and CSV output. |
| benchmarks/hardware/memory/pointer_chase.cu | CUDA pointer-chase kernel used by latency benchmark. |
| benchmarks/hardware/memory/hbm_saturation.cu | Standalone CUDA “cookbook” for HBM saturation exploration. |
| benchmarks/hardware/compute/init.py | Compute benchmarks package marker/docstring. |
| benchmarks/hardware/compute/gemm_throughput.py | GEMM TFLOPS benchmark for cuBLAS/TileLang (FP16 + BF16). |
| benchmarks/hardware/system/init.py | System benchmark package marker with nvcc build notes. |
| benchmarks/hardware/system/stream_sync.py | Python stream/event synchronization overhead experiments. |
| benchmarks/hardware/system/sync_overhead.cu | CUDA sync primitive overhead microbench (reference). |
| benchmarks/hardware/system/atomic_overhead.cu | CUDA atomic contention microbench (reference). |
| benchmarks/hardware/system/bank_conflict.cu | CUDA shared-memory bank conflict microbench (reference). |
| benchmarks/hardware/system/bank_conflict_v2.cu | Updated bank conflict microbench variant (reference). |
| benchmarks/hardware/system/async_copy.cu | CUDA cp.async pipeline microbench (reference). |
| benchmarks/hardware/system/warp_spec.cu | CUDA warp-specialization microbench (reference). |
| benchmarks/hardware/system/occupancy_latency.cu | CUDA occupancy vs throughput microbench (reference). |
| benchmarks/hardware/system/register_spill.cu | CUDA register pressure/spill microbench (reference). |
There was a problem hiding this comment.
Code Review
This pull request is a significant and valuable addition, migrating the hardware microbenchmarks into the main repository with a well-organized structure. The benchmarks are comprehensive, covering memory, compute, and system-level characteristics. The inclusion of a script to generate GPU profiles from the results is a great feature for performance modeling.
My review focuses on a couple of areas to improve robustness and maintainability:
- The
CSVWriterutility can be made more robust by implementing it as a context manager to prevent resource leaks. - There is a duplicated constant in
scripts/generate_gpu_profile.pythat should be imported to follow the DRY principle.
Overall, this is an excellent contribution that provides a solid foundation for hardware characterization.
…to benchmarks/hardware/ Migrate the hardware characterization benchmark suite from tilelang-microbench into benchmarks/hardware/ with memory/, compute/, and system/ subdirectories. Add scripts/generate_gpu_profile.py to produce gpu_profile.yaml from benchmark results, include BF16 GEMM alongside FP16, and generate an H200 profile with calibrated HBM bandwidth and Tensor Core TFLOPS values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
12e5a95 to
5b188b2
Compare
Closes #630
Summary
benchmarks/hardware/withmemory/,compute/, andsystem/subdirectories, plus sharedutils/andresults/directoriesscripts/generate_gpu_profile.pythat reads CSV benchmark results and produces calibratedgpu_profile.yamlfiles for the roofline modeltileops/perf/profiles/h200.yamlwith measured HBM bandwidth (4255 GB/s) and Tensor Core TFLOPS (fp16: 370.3, bf16: 364.0) with calibration factorsTest plan
pytest tests/hardware/test_hardware_structure.py— validates directory structure and file presence (AC-1)pytest tests/hardware/test_generate_gpu_profile.py— validates profile generation, schema compliance, and calibration factors (AC-2, AC-4)gemm_throughput.csvand consumed byparse_gemm_results(AC-3)Additional context
.cufiles) requirenvccwith-arch=sm_90for H200; they are included for reference but not compiled during CI🤖 Generated with Claude Code