💡 Samsung Computer Engineering Challenge 💡

Team name: 서교수네 라마농장
Affiliation: Computer Systems Lab. (CSL), Sungkyunkwan University
Members: Junyeol Yu, Gwanjong Park, Osama Khan
E-mail: [email protected], [email protected], [email protected]
Challenge site: [link]

Llama Ranch: High Throughput LLaMA-30B Inference Framework

Due to the nature of the language model, sequences of various sizes may come into the input of inference. When batching multiple sequences to improve inference throughput, the length of the batch sequence for this is the same based on the longest sequence with padding the remaining sequences.

Depending on the combination of the sequences that make up the batch, padding process would be unnecessarily performed. When the input sequence lengths of the dataset are arranged in descending order, the cost of padding for batch processing increases monotonically when inference is performed sequentially. In addition, the runtime gain obtained by increasing the size of the batch decreases as the batch size increases.

Given these facts, the goal is to determine batch sizes that can minimize the computational overhead of zero-padding

Features

Organizing datasets based on sequence length to minimize the need for excessive padding
Determining batch size to fully utilize available GPU memory and maximize throughput

Quick Start

Building the image

With the given Dockerfile, build your testbed image. This image is based on junyeolyu/torch:2.0.1. Therefore, it will take a while for pull the base image.

$ docker build -t cechallenge .

Running the testbed

Assuming the model repository is available in /path/to/model. Use the following command to run the container for evaluation.

$ docker run --rm --gpus all --ipc=host --shm-size=1g --ulimit memlock=-1 --ulimit stack=134217728 -v /path/to/model:/model -it cechallenge bash

The entry point is /worksapce.

We need to init the submodules.

cd /workspace/LlamaRanch
git submodule update --init

FasterTransformer Evaluation

First, we need to convert the huggingface model.

cd /workspace/LlamaRanch/src/FasterTransformer
sudo mkdir -p models && sudo chmod -R 777 ./*
python ./examples/cpp/llama/huggingface_llama_convert.py -saved_dir=./models/llama -in_file=$MODEL_PATH -infer_gpu_num=4 -trained_gpu_num=4 -weight_data_type=fp16 -model_name=llama

We need to build the library before evaluation. DSM should be set to 70 for the Tesla V100.

cd /workspace/LlamaRanch/src/FasterTransformer
mkdir build && cd build
git submodule init && git submodule update
# These packages are already installed during the image building
pip3 install fire jax jaxlib transformers datasets sentencepiece numpysocket

CUDAFLAGS="-include stdio.h" cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON -D PYTHON_PATH=/usr/bin/python3 ..
make -j$(nproc)

Then, you can run the evaluation script. Before running the script, you should change the checkpoint path, tokenizer_path, and library path in the script.

cd /workspace/LlamaRanch/src/FasterTransformer/examples/pytorch/llama
FMHA_ENABLE=ON ./exec_evaluation.sh

For more details, see FasterTransformer:Setup

FasterTransformer Evaluation (1st round)

We need to build the library before evaluation. DSM should be set to 70 for the Tesla V100.

cd /workspace/LlamaRanch/src/FasterTransformer
git switch 1st-round
mkdir -p build && cd build
git submodule init && git submodule update
# These packages are already installed during the image building
pip3 install fire jax jaxlib transformers datasets sentencepiece numpysocket

CUDAFLAGS="-include stdio.h" cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON -D PYTHON_PATH=/usr/bin/python3 ..
make -j$(nproc)

Then, you can run the evaluation script.

cd /workspace/LlamaRanch/src/FasterTransformer/examples/pytorch/llama
mpirun -n 4 --allow-run-as-root python llama_example.py --output_len 1 --pipeline_para_size 4 --ckpt_path /model/$MODEL_PATH --tokenizer_path /model/$HF_TOKENIZER_PATH --lib_path /workspace/LlamaRanch/src/FasterTransformer/build/lib/libth_transformer.so

Meta Evaluation (1st Round)

The provided example.py can be run on a single or multiple GPUs with torchrun and will output completions for two pre-defined prompts.

In this repository, a 4-GPU inference setting is considered.

cd /workspace/LlamaRanch/src/Meta
# Install this repository, if you need
pip install -e .
torchrun --nproc_per_node 4 example.py --ckpt_dir /model/$TARGET_FOLDER --tokenizer_path /model/$TARGET_FOLDER/tokenizer.model

example.py will produce Meta 4_bins
example_opt.py will produce Meta Greedy

For testing vanilla, change the branch main to vanilla.

cd /workspace/LlamaRanch/src/Meta
git switch vanilla
pip install -e .
torchrun --nproc_per_node 4 example.py --ckpt_dir /model/$TARGET_FOLDER --tokenizer_path /model/$TARGET_FOLDER/tokenizer.model

In this branch, example.py generates Meta Vanilla

Name		Name	Last commit message	Last commit date
Latest commit History 263 Commits
batching		batching
src		src
tests		tests
.gitmodules		.gitmodules
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

💡 Samsung Computer Engineering Challenge 💡

Llama Ranch: High Throughput LLaMA-30B Inference Framework

Features

Quick Start

Building the image

Running the testbed

FasterTransformer Evaluation

FasterTransformer Evaluation (1st round)

Meta Evaluation (1st Round)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

JunyeolYu/LlamaRanch

Folders and files

Latest commit

History

Repository files navigation

💡 Samsung Computer Engineering Challenge 💡

Llama Ranch: High Throughput LLaMA-30B Inference Framework

Features

Quick Start

Building the image

Running the testbed

FasterTransformer Evaluation

FasterTransformer Evaluation (1st round)

Meta Evaluation (1st Round)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages