Un Ministral, des grands singes de langage

This repo demonstrates how to replicate the results of the Large Language Monkeys paper using a different model, Ministral 8B, and a different dataset, HumanEval.

It runs both the code generation model and the sandboxed code evaluation on Modal and massively in parallel -- on Modal's free tier, that's code generation across 10 H100 GPUs running at an aggregate throughput of 5 - 10k tok/s per GPU and code evaluation across over 100 Sandboxes.

For more on using Modal Sandboxes, see our product launch post.

How-To

Setup Modal

pip install modal  # that's it :)
modal setup  # if you're new to Modal

Test and deploy inference on Modal

# test
modal run le_inference.py
# deploy
modal deploy le_inference.py

Test and run the benchmark in parallel

# test
modal run le_client.py --dry-run --n 1 --subsample 1
# test and save results
modal run le_client.py --no-dry-run --n 1 --subsample 1
# run full dataset, 1000 attempts per problem
modal run le_client.py --no-dry-run --n 1000 --subsample 100 # percent

Calculate results in parallel in sandboxes

# run concurrently or afterwards
modal run les_evals.py

Analyze results

modal launch jupyter --volume mistral-humaneval --mount analysis
# run the notebook in `mount/`

Other files

The le_quant and le_quant_wrapper scripts demonstrate language model quantization with llm-compressor run on Modal.

We ran those already to generate the model used by default in the example, so you don't need to run them, but they are included for completeness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!