RoMeo-AE

A. Abstract

This repository contains the code for the reproduction of the paper "RoMeo: Mitigating Dual-dimensional Outliers with Rotated Mixed Precision Quantization" at PPoPP'26.

The reproduction includes Tables 1 and 2 and Figures 7, 8, 9, and 10 from the submitted version of the paper.

B. Prepare Hardware Environment

To reproduce this work, a GPU server with NVIDIA 4090 and H100 GPUs is required.

For AE Reviewers, please check the HotCRP website comments for instructions on how to access the provided GPU servers.

To avoid issues of environment and network, we strongly recommend reviewers to use our provided environment.

C. Prepare Software Environment

For AE reviewers, please skip this step and use the provided environment.

C1. Prepare codebase

Download this repository and its submodules:

git clone --recursive https://github.com/zqh-wz/RoMeo-AE.git
cd RoMeo-AE/

Download calibration dataset for SmoothQuant:

wget https://hf-mirror.com/datasets/mit-han-lab/pile-val-backup/resolve/main/val.jsonl.zst
zstd -d --rm val.jsonl.zst

Then, apply nessesary patches to submodules.

cd third_party/cutlass
git apply ../patches/cutlass.patch
cd ../../

cd third_party/fast-hadamard-transform
git apply ../patches/fast-hadamard-transform.patch
cd ../../

cd third_party/QuaRot
git apply ../patches/QuaRot.patch
cd ../../

C2. Installation

We manage python virtual environments with uv.

bash ./scripts/create_env.sh .venv
source ./scripts/activate_env.sh .venv
bash ./scripts/install.sh
deactivate

D. Reproduce Experimental Results

Since the reproductions involve long-running tasks, we strongly recommend running these experiments using tmux to ensure that your sessions remain active even if your SSH connection is interrupted.

Using tmux for Long-running Experiments
# Create a new tmux session with a unique name
tmux new-session -s RoMeo-Reviewer-A

# Inside tmux, run your experiments as usual
bash ./scripts/reproduce.sh tab1 | tee tab1.log

# To detach from tmux session without killing the process:
# Press Ctrl+b, then d (detach)

# To reattach to the session later:
tmux attach -t RoMeo-Reviewer-A

# To list all tmux sessions:
tmux list-sessions
For more tmux commands and advanced usage, please refer to the tmux manual (man tmux) or online documentation.

Table 1: Comparison of measured perplexity on WikiText2 dataset.

Estimated runtime: ~90 minutes.

bash ./scripts/reproduce.sh tab1 | tee tab1.log

Result summary will be generated at reproduce/tab1/perplexity_summary.log.

Table 2: Comparison of zero-shot accuracy on four downstream tasks.

Estimated runtime: ~45 minutes.

bash ./scripts/reproduce.sh tab2 | tee tab2.log

Result summary will be generated at reproduce/tab2/zero_shot_summary.log.

Note: Due to the long runtime of full zero-shot evaluation, this script only runs partial evaluation (partial models and benchmarks) for quick verification.

To run the full evaluation, simply modify reproduce/tab2/run.sh and reproduce/tab2/run_acc_allbench.sh.

Figure 7: Normalized layer-level latency on Qwen3 models of different input batch sizes.

Estimated runtime: ~25 minutes.

bash ./scripts/reproduce.sh fig7 | tee fig7.log

Result figure will be generated at reproduce/fig7/layer_latency.pdf.

Figure 8: Normalized kernel performance on various matrix shapes.

Estimated runtime: ~25 minutes.

bash ./scripts/reproduce.sh fig8 | tee fig8.log

Result figure will be generated at reproduce/fig8/bench_kernels_results.pdf.

Figure 9. Layer-level latency breakdown for Qwen3-8B across different batch sizes with progressive optimizations.

Estimated runtime: ~5 minutes.

bash ./scripts/reproduce.sh fig9 | tee fig9.log

Result figure will be generated at reproduce/fig9/plot_breakdown.pdf.

Figure 10. Scaling the percentage of outliers.

Estimated runtime: ~50 minutes.

bash ./scripts/reproduce.sh fig10 | tee fig10.log

Result figure will be generated at reproduce/fig10/percent-ppl.pdf.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
qfactory		qfactory
reproduce		reproduce
scripts		scripts
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
bench_module.py		bench_module.py
capture.py		capture.py
eval.py		eval.py
hadamard_utils.py		hadamard_utils.py
modeling_qwen3_quarot.py		modeling_qwen3_quarot.py
qlinear.py		qlinear.py
qmatmul.py		qmatmul.py
quant.py		quant.py
requirements.txt		requirements.txt
rotate.py		rotate.py
setup.py		setup.py
smooth_q.py		smooth_q.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RoMeo-AE

A. Abstract

B. Prepare Hardware Environment

C. Prepare Software Environment

C1. Prepare codebase

C2. Installation

D. Reproduce Experimental Results

Using tmux for Long-running Experiments

Table 1: Comparison of measured perplexity on WikiText2 dataset.

Table 2: Comparison of zero-shot accuracy on four downstream tasks.

Figure 7: Normalized layer-level latency on Qwen3 models of different input batch sizes.

Figure 8: Normalized kernel performance on various matrix shapes.

Figure 9. Layer-level latency breakdown for Qwen3-8B across different batch sizes with progressive optimizations.

Figure 10. Scaling the percentage of outliers.

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

License

thu-pacman/RoMeo

Folders and files

Latest commit

History

Repository files navigation

RoMeo-AE

A. Abstract

B. Prepare Hardware Environment

C. Prepare Software Environment

C1. Prepare codebase

C2. Installation

D. Reproduce Experimental Results

Using tmux for Long-running Experiments

Table 1: Comparison of measured perplexity on WikiText2 dataset.

Table 2: Comparison of zero-shot accuracy on four downstream tasks.

Figure 7: Normalized layer-level latency on Qwen3 models of different input batch sizes.

Figure 8: Normalized kernel performance on various matrix shapes.

Figure 9. Layer-level latency breakdown for Qwen3-8B across different batch sizes with progressive optimizations.

Figure 10. Scaling the percentage of outliers.

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages