This repo demonstrates how to replicate the results of the Large Language Monkeys paper using a different model, Ministral 8B, and a different dataset, HumanEval.
It runs both the code generation model and the sandboxed code evaluation on Modal and massively in parallel -- on Modal's free tier, that's code generation across 10 H100 GPUs running at an aggregate throughput of 5 - 10k tok/s per GPU and code evaluation across over 100 Sandboxes.
For more on using Modal Sandboxes, see our product launch post.
pip install modal # that's it :)
modal setup # if you're new to Modal
# test
modal run le_inference.py
# deploy
modal deploy le_inference.py
# test
modal run le_client.py --dry-run --n 1 --subsample 1
# test and save results
modal run le_client.py --no-dry-run --n 1 --subsample 1
# run full dataset, 1000 attempts per problem
modal run le_client.py --no-dry-run --n 1000 --subsample 100 # percent
# run concurrently or afterwards
modal run les_evals.py
modal launch jupyter --volume mistral-humaneval --mount analysis
# run the notebook in `mount/`
The le_quant
and le_quant_wrapper
scripts demonstrate language model quantization
with llm-compressor
run on Modal.
We ran those already to generate the model used by default in the example, so you don't need to run them, but they are included for completeness.