GitHub - gaugarg-nv/llama_cu_awq: llama INT4 cuda inference with AWQ

llama2_q4.cu

Simple and fast Pure Cuda inference for 4-bit AWQ quantized models

Build

Make sure nvcc is in the PATH. Download and extract Nvidia HPC SDK for NCCL: https://developer.nvidia.com/hpc-sdk-downloads

git clone https://github.com/ankan-ban/llama_cu_awq
cd llama_cu_awq
mdkir build
cmake -DCMAKE_BUILD_TYPE=Debug -B build -DNCCL_ROOT_DIR=/home/scratch.gaugarg_sw/hpc_sdk/Linux_x86_64/2025/comm_libs/nccl
cmake --build build
cd ..

Run

The simpler way is to download a pre-converted model from Huggingface, but you can also do all the steps

Using Huggingface

You can use one of these models:

Here are the commands for the 7B model:

wget https://huggingface.co/abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq/resolve/main/pytorch_model.bin
wget https://huggingface.co/abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq/resolve/main/config.json

pip install numpy torch
python3 convert_awq_to_bin.py pytorch_model.bin output
./weight_packer config.json output llama2-7b-awq-q4.bin 1

./llama2_q4 llama2-7b-awq-q4.bin -n 256 -i "write an essay about GPUs"

And here are the commands for the 13B model:

wget https://huggingface.co/abhinavkulkarni/meta-llama-Llama-2-13b-chat-hf-w4-g128-awq/resolve/main/config.json
wget https://huggingface.co/abhinavkulkarni/meta-llama-Llama-2-13b-chat-hf-w4-g128-awq/resolve/main/pytorch_model-00001-of-00003.bin
wget https://huggingface.co/abhinavkulkarni/meta-llama-Llama-2-13b-chat-hf-w4-g128-awq/resolve/main/pytorch_model-00002-of-00003.bin
wget https://huggingface.co/abhinavkulkarni/meta-llama-Llama-2-13b-chat-hf-w4-g128-awq/resolve/main/pytorch_model-00003-of-00003.bin

pip install numpy torch
python3 convert_awq_to_bin.py pytorch_model-00001-of-00003.bin output
python3 convert_awq_to_bin.py pytorch_model-00002-of-00003.bin output
python3 convert_awq_to_bin.py pytorch_model-00003-of-00003.bin output

./weight_packer config.json output llama2-13b-awq-q4.bin 1

./llama2_q4 llama2-13b-awq-q4.bin -n 256 -i "write an essay about GPUs"

And for the 33B model:

wget https://huggingface.co/abhinavkulkarni/lmsys-vicuna-33b-v1.3-w4-g128-awq/resolve/main/config.json
wget https://huggingface.co/abhinavkulkarni/lmsys-vicuna-33b-v1.3-w4-g128-awq/resolve/main/pytorch_model-00001-of-00006.bin
wget https://huggingface.co/abhinavkulkarni/lmsys-vicuna-33b-v1.3-w4-g128-awq/resolve/main/pytorch_model-00002-of-00006.bin
wget https://huggingface.co/abhinavkulkarni/lmsys-vicuna-33b-v1.3-w4-g128-awq/resolve/main/pytorch_model-00003-of-00006.bin
wget https://huggingface.co/abhinavkulkarni/lmsys-vicuna-33b-v1.3-w4-g128-awq/resolve/main/pytorch_model-00004-of-00006.bin
wget https://huggingface.co/abhinavkulkarni/lmsys-vicuna-33b-v1.3-w4-g128-awq/resolve/main/pytorch_model-00005-of-00006.bin
wget https://huggingface.co/abhinavkulkarni/lmsys-vicuna-33b-v1.3-w4-g128-awq/resolve/main/pytorch_model-00006-of-00006.bin

pip install numpy torch
python3 convert_awq_to_bin.py pytorch_model-00001-of-00006.bin output
python3 convert_awq_to_bin.py pytorch_model-00002-of-00006.bin output
python3 convert_awq_to_bin.py pytorch_model-00003-of-00006.bin output
python3 convert_awq_to_bin.py pytorch_model-00004-of-00006.bin output
python3 convert_awq_to_bin.py pytorch_model-00005-of-00006.bin output
python3 convert_awq_to_bin.py pytorch_model-00006-of-00006.bin output

./weight_packer config.json output vicuna-33b-awq-q4.bin 1

./llama2_q4 vicuna-33b-awq-q4.bin -n 256 -i "write an essay about GPUs"

Note: the last argument of weight_packer is used to indicate whether the awq weights are using old packing format (that need repacking). If you use latest AWQ repo from github, it will generate weights in new packing format. The weights at https://huggingface.co/abhinavkulkarni/ are still using old format so we are setting the param to 1 above.

Converting yourself

First generate AWQ int-4 quantized weights following steps in llm-awq
Convert AWQ weights into individual weight binary files using convert_awq_to_bin.py
Convert/repack the weight binary files using the weight_repacker.cpp utility.
Run the inference (llama2_q4.cu) pointing to the final weight file.

Note: AWQ scripts doesn't run on Windows. Use Linux or WSL.

Example:

python -m awq.entry --model_path /path-to-model/Llama-2-7b-chat-hf --w_bit 4 --q_group_size 128 --run_awq --dump_awq awq_cache/llama2-7b-chat-metadata.pt
python -m awq.entry --model_path /path-to-model/Llama-2-7b-chat-hf --w_bit 4 --q_group_size 128 --load_awq awq_cache/llama2-7b-chat-metadata.pt --q_backend real --dump_quant awq_weights/llama2-7b-awq.pt

pip install numpy torch
python3 convert_awq_to_bin.py awq_weights/llama2-7b-awq.pt output
./weight_packer config.json output llama2-7b-awq-q4.bin 0

Sample output and performance

We get ~200 tokens per second with RTX 4090 for 7b paramater models:

llama2_q4.exe llama2-7b-awq-q4.bin -n 256 -i "write an essay about GPUs"

Model params:-
dim: 4096
hidden_dim: 11008
n_heads: 32
n_kv_heads: 32
n_layers: 32
seq_len: 2048
vocab_size: 32000


loaded weights
<s>
write an essay about GPUs

Introduction:

GPU (Graphics Processing Unit) is a specialized electronic circuit designed to accelerate the manipulation of graphical data. It is a key component of a computer's hardware that is used to improve the performance of graphics-intensive applications such as video games, computer-aided design (CAD) software, and scientific simulations. In this essay, we will explore the history of GPUs, their architecture, and their impact on the computer industry.
History of GPUs:
The concept of a GPU can be traced back to the 1960s when computer graphics were still in their infancy. At that time, computer graphics were primarily used for scientific visualization and were not yet a major component of mainstream computing. However, as computer graphics became more popular in the 1980s and 1990s, the need for specialized hardware to handle the increasingly complex graphics tasks became apparent. In the early 1990s, the first GPUs were developed, which were designed to offload the computationally intensive graphics tasks from the CPU (Central Processing Unit) to the GPU.
Architecture
achieved tok/s: 200.787402. Tokens: 255, seconds: 1.27

Multi-GPU support

The application shows column-wise and hybrid split strategy

column-wise split strategy adds total 4 AllGather in each decoder layer - before, after O-proj and before, after FFN-down. This strategy split all linear layers along column (N-dimension).
hybrid split strategy adds total 2 AllReduce in each decoder layer - after O-proj and after FFN-down. This strategy splits qkv/ffn-up-gate along column and O-proj/ffn-down along row (along K-dimension).

The strategy can be controlled with -x command line option.

-x "none": Run on single GPU
-x "column" : Column wise split strategy
-x "hybrid" : Hyrid split strategy

The larger models show highest scaling. In my experiments, I have found hybrid strategy to perform better than column-wise split.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.gitattributes		.gitattributes
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
common.h		common.h
convert_awq_to_bin.py		convert_awq_to_bin.py
cuda_comm.h		cuda_comm.h
gpu_kernels.h		gpu_kernels.h
llama2_q4.cu		llama2_q4.cu
llama2_q4.vcxproj		llama2_q4.vcxproj
sampler.h		sampler.h
tokenizer.bin		tokenizer.bin
tokenizer.h		tokenizer.h
weight_packer.cpp		weight_packer.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama2_q4.cu

Build

Run

Using Huggingface

Converting yourself

Sample output and performance

Multi-GPU support

License

About

Uh oh!

Releases

Packages

Languages

License

gaugarg-nv/llama_cu_awq

Folders and files

Latest commit

History

Repository files navigation

llama2_q4.cu

Build

Run

Using Huggingface

Converting yourself

Sample output and performance

Multi-GPU support

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages