Skip to content

[Refactor][Benchmark] Migrate microbench hardware characterization into benchmarks/hardware/#639

Open
lcy-seso wants to merge 1 commit intotile-ai:mainfrom
lcy-seso:refactor/benchmark/issue-630
Open

[Refactor][Benchmark] Migrate microbench hardware characterization into benchmarks/hardware/#639
lcy-seso wants to merge 1 commit intotile-ai:mainfrom
lcy-seso:refactor/benchmark/issue-630

Conversation

@lcy-seso
Copy link
Collaborator

Closes #630

Summary

  • Migrate the hardware characterization benchmark suite from tilelang-microbench into benchmarks/hardware/ with memory/, compute/, and system/ subdirectories, plus shared utils/ and results/ directories
  • Add scripts/generate_gpu_profile.py that reads CSV benchmark results and produces calibrated gpu_profile.yaml files for the roofline model
  • Include BF16 GEMM throughput benchmark alongside existing FP16 measurement
  • Generate tileops/perf/profiles/h200.yaml with measured HBM bandwidth (4255 GB/s) and Tensor Core TFLOPS (fp16: 370.3, bf16: 364.0) with calibration factors

Test plan

  • pytest tests/hardware/test_hardware_structure.py — validates directory structure and file presence (AC-1)
  • pytest tests/hardware/test_generate_gpu_profile.py — validates profile generation, schema compliance, and calibration factors (AC-2, AC-4)
  • BF16 GEMM rows present in gemm_throughput.csv and consumed by parse_gemm_results (AC-3)
  • Full smoke suite: 867 tests passed, 0 failed
  • Pre-commit lint passes cleanly

Additional context

  • The original tilelang-microbench repo remains as-is (archive) — this is a copy, not a move
  • CUDA benchmarks (.cu files) require nvcc with -arch=sm_90 for H200; they are included for reference but not compiled during CI
  • CSV result files are checked in to enable offline profile generation without GPU access

🤖 Generated with Claude Code

@lcy-seso lcy-seso added refactor Code restructuring without behavior change all-ai-powered Produced entirely by automated contributors labels Mar 23, 2026
@lcy-seso
Copy link
Collaborator Author

/gemini review

@gemini-code-assist
Copy link
Contributor

Warning

Gemini is experiencing higher than usual traffic and was unable to create the summary. Please try again in a few hours by commenting /gemini summary.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Migrates the hardware characterization microbenchmark suite into benchmarks/hardware/ and adds a CSV→YAML profile generator to produce GPU peak/calibration data (e.g., H200) for downstream performance modeling.

Changes:

  • Add benchmarks/hardware/ microbench suite (memory/compute/system) with shared utils and checked-in CSV results.
  • Add scripts/generate_gpu_profile.py plus pytest coverage to generate validated GPU profile YAMLs from benchmark CSVs.
  • Add an H200 profile YAML (tileops/perf/profiles/h200.yaml) with measured HBM bandwidth and FP16/BF16 Tensor Core throughput.

Reviewed changes

Copilot reviewed 33 out of 34 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tileops/perf/init.py Introduces perf module package marker/docstring.
tileops/perf/profiles/init.py Introduces profiles subpackage marker/docstring.
tileops/perf/profiles/h200.yaml Adds generated H200 measured/theoretical peaks + calibration factors.
scripts/generate_gpu_profile.py New CLI/script to parse benchmark CSVs and emit GPU profile YAML.
tests/hardware/init.py Adds hardware test package marker.
tests/hardware/test_hardware_structure.py Smoke test asserting benchmark directory structure + key files exist.
tests/hardware/test_generate_gpu_profile.py Tests CSV parsing/profile generation + validates committed H200 profile.
benchmarks/hardware/init.py Adds top-level hardware benchmark package docs.
benchmarks/hardware/run_all.py Convenience runner to execute core Python benchmarks in a fixed order.
benchmarks/hardware/utils/init.py Re-exports shared utilities for benchmarks.
benchmarks/hardware/utils/bench.py Adds do_bench wrapper + bandwidth/TFLOPS calculators.
benchmarks/hardware/utils/env.py Adds env capture and theoretical peak tables for supported GPUs.
benchmarks/hardware/utils/output.py Adds CSV schema/constants and a CSV writer targeting benchmarks/hardware/results/.
benchmarks/hardware/results/hbm_peak.csv Checks in HBM peak measurement CSV used for offline profile generation.
benchmarks/hardware/results/gemm_throughput.csv Checks in GEMM throughput CSV including BF16 rows.
benchmarks/hardware/memory/init.py Memory benchmarks package marker/docstring.
benchmarks/hardware/memory/hbm_bandwidth.py Python HBM bandwidth benchmark emitting bandwidth.csv.
benchmarks/hardware/memory/l2_bandwidth.py L2 “working set” reuse benchmark emitting l2_cache.csv.
benchmarks/hardware/memory/shared_bandwidth.py TileLang shared-memory bandwidth benchmark emitting shared_memory.csv.
benchmarks/hardware/memory/latency.py Latency suite including pointer-chase integration and CSV output.
benchmarks/hardware/memory/pointer_chase.cu CUDA pointer-chase kernel used by latency benchmark.
benchmarks/hardware/memory/hbm_saturation.cu Standalone CUDA “cookbook” for HBM saturation exploration.
benchmarks/hardware/compute/init.py Compute benchmarks package marker/docstring.
benchmarks/hardware/compute/gemm_throughput.py GEMM TFLOPS benchmark for cuBLAS/TileLang (FP16 + BF16).
benchmarks/hardware/system/init.py System benchmark package marker with nvcc build notes.
benchmarks/hardware/system/stream_sync.py Python stream/event synchronization overhead experiments.
benchmarks/hardware/system/sync_overhead.cu CUDA sync primitive overhead microbench (reference).
benchmarks/hardware/system/atomic_overhead.cu CUDA atomic contention microbench (reference).
benchmarks/hardware/system/bank_conflict.cu CUDA shared-memory bank conflict microbench (reference).
benchmarks/hardware/system/bank_conflict_v2.cu Updated bank conflict microbench variant (reference).
benchmarks/hardware/system/async_copy.cu CUDA cp.async pipeline microbench (reference).
benchmarks/hardware/system/warp_spec.cu CUDA warp-specialization microbench (reference).
benchmarks/hardware/system/occupancy_latency.cu CUDA occupancy vs throughput microbench (reference).
benchmarks/hardware/system/register_spill.cu CUDA register pressure/spill microbench (reference).

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and valuable addition, migrating the hardware microbenchmarks into the main repository with a well-organized structure. The benchmarks are comprehensive, covering memory, compute, and system-level characteristics. The inclusion of a script to generate GPU profiles from the results is a great feature for performance modeling.

My review focuses on a couple of areas to improve robustness and maintainability:

  1. The CSVWriter utility can be made more robust by implementing it as a context manager to prevent resource leaks.
  2. There is a duplicated constant in scripts/generate_gpu_profile.py that should be imported to follow the DRY principle.

Overall, this is an excellent contribution that provides a solid foundation for hardware characterization.

@lcy-seso lcy-seso marked this pull request as ready for review March 23, 2026 13:15
@lcy-seso lcy-seso requested a review from a team March 23, 2026 13:15
…to benchmarks/hardware/

Migrate the hardware characterization benchmark suite from tilelang-microbench
into benchmarks/hardware/ with memory/, compute/, and system/ subdirectories.
Add scripts/generate_gpu_profile.py to produce gpu_profile.yaml from benchmark
results, include BF16 GEMM alongside FP16, and generate an H200 profile with
calibrated HBM bandwidth and Tensor Core TFLOPS values.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lcy-seso lcy-seso force-pushed the refactor/benchmark/issue-630 branch from 12e5a95 to 5b188b2 Compare March 24, 2026 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

all-ai-powered Produced entirely by automated contributors refactor Code restructuring without behavior change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Refactor][Benchmark] Migrate microbench hardware characterization into benchmarks/hardware/

2 participants