mistral-RL-scripts

This repository was create for Mistral Hackathon 2026 in London.

Single-GPU RL fine-tuning (GRPO) for Mistral using the deduplicated dataset datasets/unique_prompts_balanced.json by default (or any JSON dataset path).

Dataset summary and quality notes: DATASET_DETAILS.md

1) Install

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements-rl.txt

Optional (faster attention if it builds in your environment):

pip install --no-build-isolation flash-attn

2) Run on one H200

bash scripts/run_grpo_single_h200.sh

The launcher auto-loads .env if present (or ENV_FILE=path/to/file). Priority order: explicit shell env > .env values > script defaults.

Example .env:

MODEL_NAME=mistralai/Ministral-3-3B-Instruct-2512
DATA_DIR=datasets/unique_prompts_balanced.json
OUTPUT_DIR=outputs/mistral-grpo-exp1
RUN_NAME=ministral-grpo-exp1
SEED=42
WANDB_PROJECT=mistral-rl
WANDB_ENTITY=geo-politis-n-a
WANDB_GROUP=ministral-grpo
WANDB_JOB_TYPE=train
WANDB_TAGS=grpo,ministral,single-h200,exp1
WANDB_API_KEY=your_api_key
USE_4BIT=0

Default base model: mistralai/Ministral-3-3B-Instruct-2512 Default W&B project: mistral-rl Launcher preflight: installs wandb if missing and validates W&B authentication before training starts. Tracking conventions: fixed seed, explicit step logging, run naming, tags/group/job_type.

If you want non-interactive auth:

export WANDB_API_KEY=your_api_key
bash scripts/run_grpo_single_h200.sh

Or login once manually:

python3 -m wandb login

3) Override defaults

MODEL_NAME=mistralai/Ministral-3-3B-Instruct-2512 \
DATA_DIR=datasets/unique_prompts_balanced.json \
OUTPUT_DIR=outputs/mistral-grpo-exp1 \
RUN_NAME=ministral-grpo-exp1 \
SEED=42 \
WANDB_PROJECT=mistral-rl \
WANDB_ENTITY=your_team_or_user \
WANDB_GROUP=ministral-grpo \
WANDB_JOB_TYPE=train \
WANDB_TAGS=grpo,ministral,single-h200,exp1 \
USE_4BIT=0 \
bash scripts/run_grpo_single_h200.sh

Set USE_4BIT=1 only for base models that support bitsandbytes 4-bit loading. mistralai/Ministral-3-3B-Instruct-2512 should stay at USE_4BIT=0.

4) Run without W&B

python3 scripts/train_grpo_mistral.py \
  --model-name mistralai/Ministral-3-3B-Instruct-2512 \
  --data-dir datasets/unique_prompts_balanced.json \
  --output-dir outputs/mistral-grpo \
  --report-to none \
  --bf16

5) Direct script usage

python3 scripts/train_grpo_mistral.py \
  --model-name mistralai/Ministral-3-3B-Instruct-2512 \
  --data-dir datasets/unique_prompts_balanced.json \
  --output-dir outputs/mistral-grpo \
  --run-name ministral-grpo-single-h200 \
  --seed 42 \
  --report-to wandb \
  --wandb-project mistral-rl \
  --wandb-entity your_team_or_user \
  --wandb-group ministral-grpo \
  --wandb-job-type train \
  --wandb-tags grpo,ministral,single-h200 \
  --bf16 \
  --per-device-batch-size 1 \
  --gradient-accumulation-steps 16 \
  --num-generations 4

6) Validation (infer.py)

scripts/infer.py runs dataset-level validation for your trained adapter and prints:

malicious refusal rate
benign helpfulness rate
balanced score

Example:

python3 scripts/infer.py \
  --base-model mistralai/Ministral-3-3B-Instruct-2512 \
  --adapter-path outputs/mistral-grpo \
  --data-dir datasets/unique_prompts_balanced.json \
  --eval-split 0.02 \
  --max-samples 200 \
  --save-predictions outputs/mistral-grpo/validation.json

7) Validation with vLLM

Install vLLM (separate from requirements-rl.txt):

pip install vllm

Validate base model only:

python3 scripts/validate_vllm.py \
  --model mistralai/Ministral-3-3B-Instruct-2512 \
  --data-dir datasets/unique_prompts_balanced.json \
  --eval-split 0.02 \
  --max-samples 200 \
  --save-predictions outputs/mistral-grpo/validation-vllm-base.json

Validate fine-tuned LoRA adapter:

python3 scripts/validate_vllm.py \
  --model mistralai/Ministral-3-3B-Instruct-2512 \
  --adapter-path outputs/mistral-grpo \
  --data-dir datasets/unique_prompts_balanced.json \
  --eval-split 0.02 \
  --max-samples 200 \
  --save-predictions outputs/mistral-grpo/validation-vllm-lora.json

Validate against a remote vLLM server (OpenAI-compatible API):

export VLLM_API_KEY=your_api_key_if_needed
python3 scripts/validate_vllm_remote.py \
  --base-url http://your-vllm-server:8000 \
  --endpoint /v1/chat/completions \
  --model mistralai/Ministral-3-3B-Instruct-2512 \
  --data-dir datasets/unique_prompts_balanced.json \
  --eval-split 0.02 \
  --max-samples 200 \
  --save-predictions outputs/mistral-grpo/validation-vllm-remote.json

8) Upload to Hugging Face Hub

Set token once:

pip install huggingface_hub
export HF_TOKEN=your_hf_token

Upload GRPO adapter/output folder:

python3 scripts/upload_to_hf.py \
  --local-path outputs/mistral-grpo \
  --repo-id your-user/ministral-grpo-lora \
  --repo-type model \
  --private \
  --commit-message \"Upload GRPO adapter checkpoint\"

Upload SFT output folder:

python3 scripts/upload_to_hf.py \
  --local-path outputs/mistral-sft \
  --repo-id your-user/ministral-sft-lora \
  --repo-type model \
  --private \
  --commit-message \"Upload SFT adapter checkpoint\"

Upload as public repo:

python3 scripts/upload_to_hf.py \
  --local-path outputs/Ministral-3-3B-Instruct-sec \
  --repo-id llmtrace/Ministral-3-3B-Instruct-sec \
  --repo-type model \
  --public

Reward behavior

The reward function is label-conditioned:

malicious prompts: reward refusals, penalize unsafe/helpful attack content.
benign prompts: reward normal helpful answers, penalize unnecessary refusals.

This is implemented in scripts/train_grpo_mistral.py and can be adjusted easily.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
datasets		datasets
datasets_v2		datasets_v2
deployment		deployment
docs/images		docs/images
outputs/mistral-grpo		outputs/mistral-grpo
scripts		scripts
.gitignore		.gitignore
DATASET_DETAILS.md		DATASET_DETAILS.md
HACKATHON_SUBMISSION.md		HACKATHON_SUBMISSION.md
README.md		README.md
TRAINING_REPORT_REVIEW.md		TRAINING_REPORT_REVIEW.md
WEAVE_GAP_REPORT.md		WEAVE_GAP_REPORT.md
requirements-rl.txt		requirements-rl.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mistral-RL-scripts

1) Install

2) Run on one H200

3) Override defaults

4) Run without W&B

5) Direct script usage

6) Validation (infer.py)

7) Validation with vLLM

8) Upload to Hugging Face Hub

Reward behavior

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mistral-RL-scripts

1) Install

2) Run on one H200

3) Override defaults

4) Run without W&B

5) Direct script usage

6) Validation (infer.py)

7) Validation with vLLM

8) Upload to Hugging Face Hub

Reward behavior

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages