NestQuant simulation

If Already Cloned

If you cloned without submodules, run:

git submodule update --init --recursive

Build CUDA Extension

cd applied-ai/kernels/cuda/inference/hadamard_transform
python setup.py install

Prerequisites

To install the conda environment needed to run scripts in this repository, run:

conda env create -f quant.yaml
conda activate quant

Downloading Llama

To run the quantization of a Llama model, you need three components: checkpoint, params.json file, and the tokenizer config. They should be in the original format and can be downloaded from Huggingface here (for Llama-3-8B):

https://huggingface.co/meta-llama/Meta-Llama-3-8B/tree/main/original

Running scripts

All scripts are run with command torchrun --nproc_per_node GPUS -m SCRIPT -- ARGS. The number of GPUs depends on the model and is equal to the number of checkpoint files.

Most scripts contain useful documentation in help, it can be obtained by running python -m SCRIPT_NAME --help.

For Llama-3-70B, please run the script parallel.llama3_script to convert the checkpoint into expected format.

Computing Hessians for LDLQ. Run the script parallel.hessian. The script creates a folder with files, where each file corresponds to the Hessian of each layer. If running the whole script makes the GPU to run out of memory, you can specify arguments --L and --R to describe the half-interval of Hessian indices to compute.

Running the dynamic programming for optimizing betas. The dynamic programming script will be added in the future. Right now, you could try running the perplexity script with default beta values.

Computing QA-LDLQ statistics. The QA-LDLQ statistics are computed by parallel.err script and are saved in a YAML file.

Quantizing the model. The parallel.w_quant script takes the path to Hessians, and (optionally) path to the QA-LDLQ statistics, and quantizes the model with a given $q$ and betas. The model is saved in a specified directory.

Evaluation. The perplexity on wikitext2 can be evaluated with parallel.ppl script.

Zero-shot benchmark evaluation. We will add the code for zero-shot evaluations in the future.

Please let us know if you run into any issues when using this code, or would like to see any features implemented.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
llama		llama
parallel		parallel
.gitignore		.gitignore
README.md		README.md
coder.py		coder.py
e8.py		e8.py
h_matrices.py		h_matrices.py
hadamard.py		hadamard.py
quant.yaml		quant.yaml
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NestQuant simulation

If Already Cloned

Build CUDA Extension

Prerequisites

Downloading Llama

Running scripts

About

Uh oh!

Releases

Packages

Languages

cookiedoth/nestquant

Folders and files

Latest commit

History

Repository files navigation

NestQuant simulation

If Already Cloned

Build CUDA Extension

Prerequisites

Downloading Llama

Running scripts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages