Python -VV
Are there relevant papers, and what are the metrics used to measure the dataset? For example, Is the evaluation metric for MultiPL-E pass@1 ?
Pip Freeze
Are there relevant papers, and what are the metrics used to measure the dataset? For example, Is the evaluation metric for MultiPL-E pass@1 ?
Reproduction Steps
Are there relevant papers, and what are the metrics used to measure the dataset? For example, Is the evaluation metric for MultiPL-E pass@1 ?
Expected Behavior
Are there relevant papers, and what are the metrics used to measure the dataset? For example, Is the evaluation metric for MultiPL-E pass@1 ?
Additional Context
No response
Suggested Solutions
No response