This repository was create for Mistral Hackathon 2026 in London.
Single-GPU RL fine-tuning (GRPO) for Mistral using the deduplicated dataset datasets/unique_prompts_balanced.json by default (or any JSON dataset path).
Dataset summary and quality notes: DATASET_DETAILS.md
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements-rl.txtOptional (faster attention if it builds in your environment):
pip install --no-build-isolation flash-attnbash scripts/run_grpo_single_h200.shThe launcher auto-loads .env if present (or ENV_FILE=path/to/file).
Priority order: explicit shell env > .env values > script defaults.
Example .env:
MODEL_NAME=mistralai/Ministral-3-3B-Instruct-2512
DATA_DIR=datasets/unique_prompts_balanced.json
OUTPUT_DIR=outputs/mistral-grpo-exp1
RUN_NAME=ministral-grpo-exp1
SEED=42
WANDB_PROJECT=mistral-rl
WANDB_ENTITY=geo-politis-n-a
WANDB_GROUP=ministral-grpo
WANDB_JOB_TYPE=train
WANDB_TAGS=grpo,ministral,single-h200,exp1
WANDB_API_KEY=your_api_key
USE_4BIT=0Default base model: mistralai/Ministral-3-3B-Instruct-2512
Default W&B project: mistral-rl
Launcher preflight: installs wandb if missing and validates W&B authentication before training starts.
Tracking conventions: fixed seed, explicit step logging, run naming, tags/group/job_type.
If you want non-interactive auth:
export WANDB_API_KEY=your_api_key
bash scripts/run_grpo_single_h200.shOr login once manually:
python3 -m wandb loginMODEL_NAME=mistralai/Ministral-3-3B-Instruct-2512 \
DATA_DIR=datasets/unique_prompts_balanced.json \
OUTPUT_DIR=outputs/mistral-grpo-exp1 \
RUN_NAME=ministral-grpo-exp1 \
SEED=42 \
WANDB_PROJECT=mistral-rl \
WANDB_ENTITY=your_team_or_user \
WANDB_GROUP=ministral-grpo \
WANDB_JOB_TYPE=train \
WANDB_TAGS=grpo,ministral,single-h200,exp1 \
USE_4BIT=0 \
bash scripts/run_grpo_single_h200.shSet USE_4BIT=1 only for base models that support bitsandbytes 4-bit loading.
mistralai/Ministral-3-3B-Instruct-2512 should stay at USE_4BIT=0.
python3 scripts/train_grpo_mistral.py \
--model-name mistralai/Ministral-3-3B-Instruct-2512 \
--data-dir datasets/unique_prompts_balanced.json \
--output-dir outputs/mistral-grpo \
--report-to none \
--bf16python3 scripts/train_grpo_mistral.py \
--model-name mistralai/Ministral-3-3B-Instruct-2512 \
--data-dir datasets/unique_prompts_balanced.json \
--output-dir outputs/mistral-grpo \
--run-name ministral-grpo-single-h200 \
--seed 42 \
--report-to wandb \
--wandb-project mistral-rl \
--wandb-entity your_team_or_user \
--wandb-group ministral-grpo \
--wandb-job-type train \
--wandb-tags grpo,ministral,single-h200 \
--bf16 \
--per-device-batch-size 1 \
--gradient-accumulation-steps 16 \
--num-generations 4scripts/infer.py runs dataset-level validation for your trained adapter and prints:
- malicious refusal rate
- benign helpfulness rate
- balanced score
Example:
python3 scripts/infer.py \
--base-model mistralai/Ministral-3-3B-Instruct-2512 \
--adapter-path outputs/mistral-grpo \
--data-dir datasets/unique_prompts_balanced.json \
--eval-split 0.02 \
--max-samples 200 \
--save-predictions outputs/mistral-grpo/validation.jsonInstall vLLM (separate from requirements-rl.txt):
pip install vllmValidate base model only:
python3 scripts/validate_vllm.py \
--model mistralai/Ministral-3-3B-Instruct-2512 \
--data-dir datasets/unique_prompts_balanced.json \
--eval-split 0.02 \
--max-samples 200 \
--save-predictions outputs/mistral-grpo/validation-vllm-base.jsonValidate fine-tuned LoRA adapter:
python3 scripts/validate_vllm.py \
--model mistralai/Ministral-3-3B-Instruct-2512 \
--adapter-path outputs/mistral-grpo \
--data-dir datasets/unique_prompts_balanced.json \
--eval-split 0.02 \
--max-samples 200 \
--save-predictions outputs/mistral-grpo/validation-vllm-lora.jsonValidate against a remote vLLM server (OpenAI-compatible API):
export VLLM_API_KEY=your_api_key_if_needed
python3 scripts/validate_vllm_remote.py \
--base-url http://your-vllm-server:8000 \
--endpoint /v1/chat/completions \
--model mistralai/Ministral-3-3B-Instruct-2512 \
--data-dir datasets/unique_prompts_balanced.json \
--eval-split 0.02 \
--max-samples 200 \
--save-predictions outputs/mistral-grpo/validation-vllm-remote.jsonSet token once:
pip install huggingface_hub
export HF_TOKEN=your_hf_tokenUpload GRPO adapter/output folder:
python3 scripts/upload_to_hf.py \
--local-path outputs/mistral-grpo \
--repo-id your-user/ministral-grpo-lora \
--repo-type model \
--private \
--commit-message \"Upload GRPO adapter checkpoint\"Upload SFT output folder:
python3 scripts/upload_to_hf.py \
--local-path outputs/mistral-sft \
--repo-id your-user/ministral-sft-lora \
--repo-type model \
--private \
--commit-message \"Upload SFT adapter checkpoint\"Upload as public repo:
python3 scripts/upload_to_hf.py \
--local-path outputs/Ministral-3-3B-Instruct-sec \
--repo-id llmtrace/Ministral-3-3B-Instruct-sec \
--repo-type model \
--publicThe reward function is label-conditioned:
maliciousprompts: reward refusals, penalize unsafe/helpful attack content.benignprompts: reward normal helpful answers, penalize unnecessary refusals.
This is implemented in scripts/train_grpo_mistral.py and can be adjusted easily.