This repository contains the code for the reproduction of the paper "RoMeo: Mitigating Dual-dimensional Outliers with Rotated Mixed Precision Quantization" at PPoPP'26.
The reproduction includes Tables 1 and 2 and Figures 7, 8, 9, and 10 from the submitted version of the paper.
To reproduce this work, a GPU server with NVIDIA 4090 and H100 GPUs is required.
For AE Reviewers, please check the HotCRP website comments for instructions on how to access the provided GPU servers.
To avoid issues of environment and network, we strongly recommend reviewers to use our provided environment.
For AE reviewers, please skip this step and use the provided environment.
Download this repository and its submodules:
git clone --recursive https://github.com/zqh-wz/RoMeo-AE.git
cd RoMeo-AE/Download calibration dataset for SmoothQuant:
wget https://hf-mirror.com/datasets/mit-han-lab/pile-val-backup/resolve/main/val.jsonl.zst
zstd -d --rm val.jsonl.zstThen, apply nessesary patches to submodules.
cd third_party/cutlass
git apply ../patches/cutlass.patch
cd ../../
cd third_party/fast-hadamard-transform
git apply ../patches/fast-hadamard-transform.patch
cd ../../
cd third_party/QuaRot
git apply ../patches/QuaRot.patch
cd ../../We manage python virtual environments with uv.
bash ./scripts/create_env.sh .venv
source ./scripts/activate_env.sh .venv
bash ./scripts/install.sh
deactivateSince the reproductions involve long-running tasks, we strongly recommend running these experiments using
tmuxto ensure that your sessions remain active even if your SSH connection is interrupted.# Create a new tmux session with a unique name tmux new-session -s RoMeo-Reviewer-A # Inside tmux, run your experiments as usual bash ./scripts/reproduce.sh tab1 | tee tab1.log # To detach from tmux session without killing the process: # Press Ctrl+b, then d (detach) # To reattach to the session later: tmux attach -t RoMeo-Reviewer-A # To list all tmux sessions: tmux list-sessionsFor more tmux commands and advanced usage, please refer to the tmux manual (
man tmux) or online documentation.
Estimated runtime: ~90 minutes.
bash ./scripts/reproduce.sh tab1 | tee tab1.logResult summary will be generated at reproduce/tab1/perplexity_summary.log.
Estimated runtime: ~45 minutes.
bash ./scripts/reproduce.sh tab2 | tee tab2.logResult summary will be generated at reproduce/tab2/zero_shot_summary.log.
Note: Due to the long runtime of full zero-shot evaluation, this script only runs partial evaluation (partial models and benchmarks) for quick verification.
To run the full evaluation, simply modify
reproduce/tab2/run.shandreproduce/tab2/run_acc_allbench.sh.
Estimated runtime: ~25 minutes.
bash ./scripts/reproduce.sh fig7 | tee fig7.logResult figure will be generated at reproduce/fig7/layer_latency.pdf.
Estimated runtime: ~25 minutes.
bash ./scripts/reproduce.sh fig8 | tee fig8.logResult figure will be generated at reproduce/fig8/bench_kernels_results.pdf.
Figure 9. Layer-level latency breakdown for Qwen3-8B across different batch sizes with progressive optimizations.
Estimated runtime: ~5 minutes.
bash ./scripts/reproduce.sh fig9 | tee fig9.logResult figure will be generated at reproduce/fig9/plot_breakdown.pdf.
Estimated runtime: ~50 minutes.
bash ./scripts/reproduce.sh fig10 | tee fig10.logResult figure will be generated at reproduce/fig10/percent-ppl.pdf.