Skip to content

JunyeolYu/LlamaRanch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

💡 Samsung Computer Engineering Challenge 💡


Llama Ranch: High Throughput LLaMA-30B Inference Framework

Due to the nature of the language model, sequences of various sizes may come into the input of inference. When batching multiple sequences to improve inference throughput, the length of the batch sequence for this is the same based on the longest sequence with padding the remaining sequences.

Depending on the combination of the sequences that make up the batch, padding process would be unnecessarily performed. When the input sequence lengths of the dataset are arranged in descending order, the cost of padding for batch processing increases monotonically when inference is performed sequentially. In addition, the runtime gain obtained by increasing the size of the batch decreases as the batch size increases.

Given these facts, the goal is to determine batch sizes that can minimize the computational overhead of zero-padding

Features

  • Organizing datasets based on sequence length to minimize the need for excessive padding
  • Determining batch size to fully utilize available GPU memory and maximize throughput

Quick Start

Building the image

With the given Dockerfile, build your testbed image. This image is based on junyeolyu/torch:2.0.1. Therefore, it will take a while for pull the base image.

$ docker build -t cechallenge .

Running the testbed

Assuming the model repository is available in /path/to/model. Use the following command to run the container for evaluation.

$ docker run --rm --gpus all --ipc=host --shm-size=1g --ulimit memlock=-1 --ulimit stack=134217728 -v /path/to/model:/model -it cechallenge bash

The entry point is /worksapce.

We need to init the submodules.

cd /workspace/LlamaRanch
git submodule update --init

FasterTransformer Evaluation

First, we need to convert the huggingface model.

cd /workspace/LlamaRanch/src/FasterTransformer
sudo mkdir -p models && sudo chmod -R 777 ./*
python ./examples/cpp/llama/huggingface_llama_convert.py -saved_dir=./models/llama -in_file=$MODEL_PATH -infer_gpu_num=4 -trained_gpu_num=4 -weight_data_type=fp16 -model_name=llama

We need to build the library before evaluation. DSM should be set to 70 for the Tesla V100.

cd /workspace/LlamaRanch/src/FasterTransformer
mkdir build && cd build
git submodule init && git submodule update
# These packages are already installed during the image building
pip3 install fire jax jaxlib transformers datasets sentencepiece numpysocket

CUDAFLAGS="-include stdio.h" cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON -D PYTHON_PATH=/usr/bin/python3 ..
make -j$(nproc)

Then, you can run the evaluation script. Before running the script, you should change the checkpoint path, tokenizer_path, and library path in the script.

cd /workspace/LlamaRanch/src/FasterTransformer/examples/pytorch/llama
FMHA_ENABLE=ON ./exec_evaluation.sh

For more details, see FasterTransformer:Setup

FasterTransformer Evaluation (1st round)

We need to build the library before evaluation. DSM should be set to 70 for the Tesla V100.

cd /workspace/LlamaRanch/src/FasterTransformer
git switch 1st-round
mkdir -p build && cd build
git submodule init && git submodule update
# These packages are already installed during the image building
pip3 install fire jax jaxlib transformers datasets sentencepiece numpysocket

CUDAFLAGS="-include stdio.h" cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON -D PYTHON_PATH=/usr/bin/python3 ..
make -j$(nproc)

Then, you can run the evaluation script.

cd /workspace/LlamaRanch/src/FasterTransformer/examples/pytorch/llama
mpirun -n 4 --allow-run-as-root python llama_example.py --output_len 1 --pipeline_para_size 4 --ckpt_path /model/$MODEL_PATH --tokenizer_path /model/$HF_TOKENIZER_PATH --lib_path /workspace/LlamaRanch/src/FasterTransformer/build/lib/libth_transformer.so

Meta Evaluation (1st Round)

The provided example.py can be run on a single or multiple GPUs with torchrun and will output completions for two pre-defined prompts.

In this repository, a 4-GPU inference setting is considered.

cd /workspace/LlamaRanch/src/Meta
# Install this repository, if you need
pip install -e .
torchrun --nproc_per_node 4 example.py --ckpt_dir /model/$TARGET_FOLDER --tokenizer_path /model/$TARGET_FOLDER/tokenizer.model
  • example.py will produce Meta 4_bins
  • example_opt.py will produce Meta Greedy

For testing vanilla, change the branch main to vanilla.

cd /workspace/LlamaRanch/src/Meta
git switch vanilla
pip install -e .
torchrun --nproc_per_node 4 example.py --ckpt_dir /model/$TARGET_FOLDER --tokenizer_path /model/$TARGET_FOLDER/tokenizer.model
  • In this branch, example.py generates Meta Vanilla

About

Samsung Computer Engineering Challenge 2023

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published