Introduction

This repository hosts the official code and data artifact for the paper "NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness". The work is a comprehensive evaluation of code language models on real-world code editing scenarios and metrics beyond functional correctness.

Repository Contents

Datasets: Runtime Efficiency, Maintainability, Latency, Resource Utilization, Security, HumanEvalClassify
Generations: Generated code for examples in NoFunEval with various model configurations reported in the paper. The graphs and tables reported in the paper can be reproduced by running the evaluation scripts on the provided generations. One can also generate code using a new model using the script.
Evaluation scripts: Scripts to evaluate LMs by taking input from examples in NoFunEval and producing score@k scores for the metrics reported in the paper: DiffBleu, Average SpeedUp, CodeQL, CodeQL-DiffBleu.

Datasets

Runtime Efficiency: Improving algorithmic time complexity of code for faster run-time.
Maintainability: Improving code readability and style, and following the best programming practices.
Latency: Optimizing code for faster response times in Android applications.
Resource Utilization: Optimizing code for lower resource utilization on edge devices.
Security: Fixing specific security vulnerabilities in the input code.
HumanEvalClassify: Classifying between the buggy and correct snippets of code.

Generation

Environment Setup

Create a virtual environment.

bash setup.sh

NoFunEdit

python3 src/nofunedit_generation.py --data_subset <subset from nofunedit: eg-latency> --model_path <model name from HF: eg-WizardLM/WizardCoder-15B-V1.0> --temperature <temperature to be set for model generation: eg-0> --max_new_tokens <maximum number of new tokens to be generated: eg-5192> --prompt <type of prompt to use from our dataset: eg-base_prompt> --num_samples <number of samples to be generated: eg-1> --precision <floating point format: eg-fp16> --batch_size <number of examples to send to llm engine at once: eg-1>

Classification

python3 src/classification_generation.py --data_subset <subset from non_func or humanevalclassify: eg-latency> --model <model name from HF: eg-WizardLM/WizardCoder-15B-V1.0> --temperature <temperature to be set for model generation: eg-0> --max_new_tokens <maximum number of new tokens to be generated: eg-5192> --prompt <type of prompt to use from our dataset: eg-base_prompt> --precision <floating point format: eg-fp16> --batch_size <number of examples to send to llm engine at once: eg-1>

Evaluation Scripts

Evaluation

python3 src/evaluation.py --data_subset <subset from nofunedit: eg-latency> --model_path <model name from HF: eg-WizardLM/WizardCoder-15B-V1.0> --prompt <type of prompt to use from our dataset: eg-base_prompt> --num_samples <number of samples to be generated: eg-1> --score_k <K values for score@k: eg-1,5,10,20> --metric <eval_metric to be used: eg-diffbleu>

Example eval script (For maintainability)

bash evaluation_example_script.sh

Parameters

Parameter	Description
`data_subset`	The subset of data to use. Options: `latency`, `resource_util`, `maintainability`, `security`, `runtime_efficiency` for nofunedit. Additionally `humanevalclassify` for classification.
`model_path`	The path of the model from HF. Example: `WizardLM/WizardCoder-15B-V1.0`.
`prompt`	Prompt to use. Options: `base_prompt`, `one_shot`, `chain_of_thought`, `coding_concepts`.
`num_samples`	Number of samples to generate. Example: `1` (We used `1` for greedy and `20` for higher temperature). [NoFunEdit - Generation only]
`max_new_tokens`	Budget for new token generation for a model. Example: `1200` (NoFunEdit: We used `1200` for runtime_efficiency and security for all prompts than CoT where `1500` was used. For others, we used `5192` or max possible limit. Classification: We used `4` for all generations).
`temperature`	Temperature for model generation. Example: `0` (We used `0` for greedy and `0.8` for higher samples)
`score_k`	K vales for Score@K. Example: `1,5,10,20` (Should not be greater than num_samples and is comma separated) [Eval only]
`metric`	Metric to be used for evaluation. Option: `diffbleu`, `codeql`, `codeql-diffbleu` (to be run after first two params are run), `classification`, `runtime` [Eval only]

VLLM Parameters (for generation)

Parameter	Description
`batch-size`	Batch size. Default: `1`
`precision`	Floating point format: Default: `fp16`
`tensor_parallel_size`	Default: `1`
`swap_space`	The size (GiB) of CPU memory per GPU to use as swap space: Default: `4`

Code Reference

Our evaluation code for runtime efficiency has been derived from the PIE codebase.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
datasets		datasets
generations/edit		generations/edit
src		src
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
example_eval_script.sh		example_eval_script.sh
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Repository Contents

Datasets

Generation

Environment Setup

NoFunEdit

Classification

Evaluation Scripts

Evaluation

Example eval script (For maintainability)

Parameters

VLLM Parameters (for generation)

Code Reference

Contributing

Trademarks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

microsoft/NoFunEval

Folders and files

Latest commit

History

Repository files navigation

Introduction

Repository Contents

Datasets

Generation

Environment Setup

NoFunEdit

Classification

Evaluation Scripts

Evaluation

Example eval script (For maintainability)

Parameters

VLLM Parameters (for generation)

Code Reference

Contributing

Trademarks

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages