Sourced from the Qwen2.5-Coder repository with updated dependencies for better reproducability.
This folder contains the code and scripts to evaluate the performance of the QwenCoder-2.5 series models on EvalPlus benchmark, which includes HumanEval(+) and MBPP(+) datasets. These datasets are designed to test code generation capabilities under varied conditions.
Please refer to EvalPlus for detailed setup instructions. Install the required packages using:
pip install evalplus --upgrade
pip install -r requirements.txt
We utilize 8xA100 GPUs for this benchmark. The following scripts are used to run the inference and evaluations:
bash test.sh {path_to_your_local_model_checkpoint} {tensor_parallel_size} {output_dir}