-
Notifications
You must be signed in to change notification settings - Fork 136
Open
Description
I created two COS VM nodes using "gcloud alpha compute" with shared network.
I created a docker image( which contains tritonserver 25.01 and tensorrtllm_backend 0.17.0)
Then inside each node (node0 and node1 each with 8 GPUs), I pull the docker image, and inside node 0 docker image, I run the triton prepration:
export ROOT_DIR=/opt/tritonserver
cd $ROOT_DIR/tensorrtllm_backend/tensorrt_llm/examples/llama/
python3 convert_checkpoint.py --model_dir $ROOT_DIR/Llama-2-7b-hf/ --output_dir $ROOT_DIR/Llama-2-7b-hf/Llama7b_cp_fp16_tp4 --dtype float16 --tp_size 8 --pp_size 2
mkdir $ROOT_DIR/engines
trtllm-build --checkpoint_dir $ROOT_DIR/Llama-2-7b-hf/Llama7b_cp_fp16_tp4 --output_dir $ROOT_DIR/engines/080/llama/7B/8-gpu/ --gpt_attention_plugin float16 --context_fmha enable --gemm_plugin float16 --max_batch_size 64 --max_input_len 1024 --max_seq_len 2048 --max_num_tokens 4096 --paged_kv_cache enable --workers 8
`mkdir` $ROOT_DIR/tensorrtllm_backend/repo/
cd $ROOT_DIR/tensorrtllm_backend
cp -R $ROOT_DIR/tensorrtllm_backend/all_models/inflight_batcher_llm $ROOT_DIR/tensorrtllm_backend/repo
python3 tools/fill_template.py --in_place \
repo/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_type:llama,tokenizer_dir:$ROOT_DIR/Llama-2-7b-hf,preprocessing_instance_count:8,triton_max_batch_size:64
python3 tools/fill_template.py --in_place \
repo/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_type:llama,tokenizer_dir:$ROOT_DIR/Llama-2-7b-hf,postprocessing_instance_count:8,triton_max_batch_size:64
python3 tools/fill_template.py --in_place repo/inflight_batcher_llm/tensorrt_llm/config.pbtxt decoupled_mode:false,max_tokens_in_paged_kv_cache:409600,batch_scheduler_policy:max_utilization,kv_cache_free_gpu_mem_fraction:0.8,max_num_sequences:64,triton_max_batch_size:64,batching_strategy:inflight_fused_batching,engine_dir:$ROOT_DIR/engines/080/llama/7B/8-gpu/,max_beam_width:1,exclude_input_in_output:true,enable_kv_cache_reuse:False,max_queue_delay_microseconds:1000,triton_backend:tensorrtllm,encoder_input_features_data_type:TYPE_FP16,logits_datatype:TYPE_FP32
python3 tools/fill_template.py -i repo/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:false,bls_instance_count:8,logits_datatype:TYPE_FP32
python3 tools/fill_template.py -i repo/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64,logits_datatype:TYPE_FP32
Then in node0(inside dock image)
After I run
mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 --allow-run-as-root -np 16 --hostfile /hostfiles --map-by ppr:1:node -x CUDA_VISIBLE_DEVICES -x WORLD_SIZE=16 /opt/tritonserver/bin/tritonserver --model-repository=$ROOT_DIR/tensorrtllm_backend/repo/inflight_batcher_llm --grpc-port=8001 --http-port=8000 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix$OMPI_COMM_WORLD_RANK\_ --model-control-mode=explicit --load-model=tensorrt_llm 1> triton_output.log 2> triton_error.log
But this gives me nothing output and my GPUs in both nodes are not used.(nothing shows in nvidia-smi).
I think I might miss some env setting up? But I totally have no clue what is needed and I'm quite new to triton, any suggestions are appreciated! Thanks ahead.
Metadata
Metadata
Assignees
Labels
No labels