Description
For torchbench benchmarks with dynamo backend, the aarch64 linux nightly wheel performance is 2x slow compared to the wheel I've built using the pytorch/builder/build_aarch64_wheel.py script for the same pytorch commit.
The difference seems to be coming from
the https://github.com/pytorch/builder/blob/main/aarch64_linux/aarch64_ci_build.sh used for nightly builds. I suspect it's with the libomp.
How to reproduce?
git clone https://github.com/pytorch/benchmark.git
cd benchmark
# apply this PR: https://github.com/pytorch/benchmark/pull/2187
# setting omp threads =16, because i'm using c7g.4xl instance
OMP_NUM_THREADS=16 python3 run_benchmark.py cpu --model hf_DistilBert --test eval --torchdynamo inductor --freeze_prepack_weights --metrics="latencies,cpu_peak_mem"
Metadata
Metadata
Assignees
Labels
No labels