This recipe provides a complete guide to fine-tuning a base LLM model to generate correct SQL queries from natural language questions, optimizing specifically for execution correctness (i.e., the generated SQL returns the correct result when executed).
- Base Model: google/gemma-4-E2B
- Dataset: philschmid/gretel-synthetic-text-to-sql
The goal is to demonstrate how to use the OpenRL infrastructure to run training locally on a single machine with multiple GPUs. This provides a baseline and understanding before scaling to a distributed Kubernetes (K8s) cluster in later guides.
What the core script does: The core training script texttosql_sft_grpo.py orchestrates the training loop. It performs the following actions:
- Calls our OpenRL server (gateway) to request samples from vLLM.
- Executes the generated SQL queries in a local SQLite database to compute rewards.
- Sends these rewards back to the server to update the LoRA adapter weights via the trainer.
Sequence Diagram
This sequence diagram illustrates the interaction between the client script, the gateway/trainer, and the vLLM sampler during the RL training phase, noting the key functions called.
sequenceDiagram
participant Client as Client Script<br/>(texttosql_sft_grpo.py)
participant Gateway as Gateway / Trainer<br/>(GPU 1)
participant Sampler as vLLM Sampler<br/>(GPU 0)
Note over Client,Sampler: 1. Rollout Phase
Client->>Gateway: 1.1 Save current weights for sampler<br/>(save_weights_and_get_sampling_client_async)
Gateway-->>Client: Acknowledged
Client->>Gateway: 1.2 Request N samples for prompt<br/>(sample_async)
Gateway->>Sampler: 1.3 Forward generation request (API)
Sampler-->>Gateway: 1.4 Return generated SQL completions
Gateway-->>Client: Return samples to client
Note over Client,Sampler: 2. Reward Phase
Client->>Client: 2.1 Execute SQL in local SQLite<br/>(build_rollout -> score_eval_prediction)
Client->>Client: 2.2 Calculate rewards & advantages
Note over Client,Sampler: 3. Update Phase
Client->>Gateway: 3.1 Send data for Forward/Backward (Grads)<br/>(forward_backward_async)
Gateway->>Gateway: Compute loss & gradients on GPU 1
Gateway-->>Client: Acknowledged
Client->>Gateway: 3.2 Send request to step optimizer<br/>(optim_step_async)
Gateway->>Gateway: Update weights on GPU 1
Gateway-->>Client: Acknowledged
Before running the training, you need to set up the environment and deploy OpenRL. You can choose to run it locally on a VM with multiple GPUs or on a GKE cluster.
- For Local Setup (recommended for baseline), follow the Local Setup Guide.
- For GKE Setup (recommended for scaling), follow the GKE Setup Guide.
After completing the setup and ensuring the gateway and vLLM sampler are running, proceed to the training section below.
Open a third terminal session to run the training script.
You can copy and paste these into your training terminal before proceeding.
# OpenRL Gateway URL
export TINKER_BASE_URL=http://127.0.0.1:9003
# Dummy API key for local gateway
export TINKER_API_KEY=tml-dummy
# Recommended to avoid Hugging Face rate limits
# export HF_TOKEN="your_huggingface_token"Set the MODE environment variable to control how the script runs.
Supported modes with known-good configurations:
full: Runs both SFT (5steps on100examples, learning rate5e-5) and RL (80steps,8prompts x8samples per step, learning rate5e-6).rl_only: Skips SFT and runs only RL (80steps,8prompts x8samples per step, learning rate5e-6) from the base model.
# Option 1: Full SFT + RL (Default)
export MODE="full"
# Option 2: RL Only (from scratch)
# export MODE="rl_only"
# Run training
(cd examples/text-to-sql && \
uv run python texttosql_sft_grpo.py gemma4_e2b_rl_recipe phase=$MODE)After training, you can plot the metrics. Run this from the repository root. We use the $MODE variable to point to the correct artifact directory.
(cd examples/text-to-sql && \
uv run python -m utils.plot \
artifacts/texttosql_sft_grpo_gemma4_e2b_rl_recipe_$MODE/metrics.jsonl)The plotter renders several curves to help you understand training progress:
- Execution Match: The percentage of generated SQL queries that returned the correct result when executed. This is the primary metric for success.
- RL Reward EWMA: The Exponentially Weighted Moving Average (smoothed average) of the RL reward.
- Compile Rate EWMA: The smoothed rate at which generated SQL queries were successfully compiled by the database.
- SFT Loss: The loss during the Supervised Fine-Tuning phase. It should decrease.
Here are the actual plots from known-good runs for each mode. Expand the sections below to see the plots and their interpretation.
RL Only Results
- In this mode, training starts directly with RL without a prior SFT phase.
- You should see the Execution Match curve start low (around 8%) and steadily increase as the model learns to generate correct SQL based on execution rewards.
- The Compile Rate should also increase, showing the model learns valid SQL syntax.
- The RL Reward should show an upward trend.
Full SFT + RL Results
- In this mode, the model first undergoes SFT for 5 steps, which typically raises the baseline execution accuracy from ~8% to ~12%.
- After SFT, the RL phase begins. You should see a significant jump in Execution Match during the RL phase, reaching around 40% in known-good runs.
- The SFT Loss curve (usually shown in a separate plot or as part of the logs) should show a decrease during the first 5 steps.
- This plot demonstrates the combined effect of format learning (SFT) and correctness optimization (RL).
If you want to experiment further, you can override default configurations by appending them to the command line:
(cd examples/text-to-sql && \
uv run python texttosql_sft_grpo.py gemma4_e2b_rl_recipe sft.steps=10 rl.steps=100)Common overrides:
sft.steps=5sft.learning_rate=5e-5rl.steps=80rl.learning_rate=5e-6rl.prompts_per_step=8rl.samples_per_prompt=8
- CUDA Out of Memory (OOM): If vLLM fails with OOM during startup, ensure you have enough VRAM and set a higher value for
VLLM_GPU_MEMORY_UTILIZATION(e.g., 0.9). - System Hangs during model loading: Ensure you have at least 32GB of system RAM. Loading large models can overwhelm smaller RAM configurations.

