Infinite-RL is a reward functions toolbox for LLM Reinforcement Learning. It provides modular reward functions (coding, math, language detection, length and repetition penalties), utilities for evaluating model responses, and optional dataset generation for synthetic RLHF samples via the Gemini API.
git clone https://github.com/hon9kon9ize/infinite-rl.git
cd infinite-rl
pip install .pip install git+https://github.com/hon9kon9ize/infinite-rl.git-
Install dependencies and language runtimes:
The installation process will automatically attempt to install required language runtimes:
- macOS: Uses Homebrew to install Node.js and ts-node
- Linux: Uses apt-get to install Node.js and ts-node
- Windows: Provides links for manual installation, ts-node installation via npm if available
-
Set up your Gemini API key:
export GEMINI_API_KEY=your_api_key_here -
(Optional) Activate the Python virtual environment before using the CLI:
source .ven/bin/activate -
Runtimes (WASM)
- The JS and MicroPython runtimes are built by
build_src/build_wasm.sh. - A GitHub Actions workflow (
.github/workflows/build_and_release_runtimes.yml) runs the build and uploadsuniversal_js.wasmandmicropython.wasmto a GitHub Release. - During installation,
setup.pywill try to download these runtimes automatically from the latest release (or use theRUNTIME_RELEASE_TAGenvironment variable to pin a release). If you prefer to build locally, run./build_src/build_wasm.shand the generated files will be placed ininfinite_rl/runtimes/.
- The JS and MicroPython runtimes are built by
You can generate a synthetic dataset using the provided script. The generator is designed to be idempotent and resumable—if a dataset.csv already exists in the output directory, the script will calculate the delta needed to reach your target --num_samples while maintaining the requested task distribution.
python scripts/generate.py --num_samples 100 --out_dir ./my_dataset --threads 4Arguments:
--num_samples: Target total number of samples for the dataset (default: 10).--model_name: Gemini model to use (default:gemini-2.0-flash-exp).--out_dir: Directory to save thedataset.csv(default:data).--save_every: Save progress to CSV every N successful samples (default: 1).--threads: Number of parallel generation threads (default: 1).--max_retries: Maximum consecutive failed attempts per task type before stopping (default: 5).--timeout: Timeout (in seconds) for reward function execution (default: 5).--task_dist: Task distribution as comma-separated floats[coding, math](default:0.5,0.5).--debug: Enable verbose logging and save raw LLM responses todata/debug_prompts.
Example for generating only math tasks:
python scripts/generate.py --num_samples 10 --task_dist 0,1 --out_dir ./math_onlyInfinite RL includes a comprehensive testing suite and verification tools to ensure the generator and reward functions are working correctly.
Use pytest to run the unit tests for reward functions and the parser:
# Run all tests
python -m pytest tests -v
# Run specific reward function tests
python -m pytest tests/test_reward_functions.py -vYou can also run the built-in examples to verify that all task types are correctly parsed and evaluated:
python -m infinite_rl.run_examplesEvaluates LLM-generated code across multiple programming languages with test case validation.
Supported Languages:
- Python
- JavaScript
Features:
- Code execution and validation
- Test case evaluation
- Output comparison with similarity scoring
- Detailed error reporting
Example:
from infinite_rl import get_reward_functions
# Initialize with custom timeout
reward_fns = get_reward_functions(timeout=10)
coding_fn = reward_fns["coding"]
coding_fn.set_language("python")
# Evaluate with expected output
result = coding_fn.compute_reward(
model_output="<answer>\n```python\nprint(2 + 2)\n```\n</answer>",
expected_output="4"
)
print(f"Score: {result.correctness_score}")Evaluates mathematical problem-solving using symbolic computation.
Example:
from infinite_rl import get_reward_functions
reward_fns = get_reward_functions()
math_fn = reward_fns["math"]
result = math_fn.compute_reward(
model_output="<answer>x^2 + 2x + 1</answer>",
expected_output="(x+1)^2"
)
print(f"Correctness: {result.correctness_score}") Test the executor with different programming languages:
from infinite_rl import RewardExecutor
executor = RewardExecutor(timeout=5)
# Test Python
stdout, stderr = executor.run_single("print('Hello, World!')", "python")
print(f"Python: {stdout}") # Output: Hello, World!
# Test JavaScript
stdout, stderr = executor.run_single("console.log('Hello, World!')", "javascript")
print(f"JavaScript: {stdout}") # Output: Hello, World!
# Test Qwen3 Embed (similarity)
# Provide either a (document, query) tuple or a string with a separator (e.g. 'document|||query')
stdout, stderr = executor.run_single(("This is a passage", "query text"), "qwen3")
print(f"Qwen3 similarity: {stdout}") # Output: a float string like '0.8234' (if runtime present)Install and test in Colab with this notebook:
# Install the package
!pip install git+https://github.com/hon9kon9ize/infinite-rl.git
# Import and test
from infinite_rl import RewardExecutor, get_reward_functions
# Test executor
executor = RewardExecutor(timeout=5)
stdout, stderr = executor.run_single("print(2 + 2)", "python")
print(f"Executor test - Output: {stdout}, Error: {stderr}")
# Test coding reward function
reward_fns = get_reward_functions(timeout=5)
coding_fn = reward_fns["coding"]
coding_fn.set_language("python")
result = coding_fn.compute_reward(
model_output="<answer>\n```python\nprint(2 + 2)\n```\n</answer>",
expected_output="4"
)
print(f"Reward Result: {result}")
print(f"Format Score: {result.format_score}")
print(f"Correctness Score: {result.correctness_score}")To run all unit tests, install development dependencies and use pytest:
pip install -r requirements_dev.txt
pytestinfinite_rl/
├── executor.py # Multi-language code executor
├── generator.py # LLM orchestration and resume logic
├── parser.py # Robust tag extraction and markdown parsing
├── prompts.py # Task-specific system instructions
└── reward_functions/
├── reward_function.py # Base reward function class
├── coding.py # Coding (Python, JS) evaluator
└── math.py # Symbolic Math evaluator
data/dataset.csv: The primary output containing successful samples (Prompt, Answer, Response, Scores).data/failed_dataset.csv: Detailed log of failed attempts and rectification errors for troubleshooting.data/debug_prompts/: Raw system and user prompts sent to the LLM (enabled via--debug).
All task types are designed for RLHF (Reinforcement Learning from Human Feedback) readiness. Every sample follows a strict three-headed structure:
- Prompt: The instruction.
- Answer: The ground-truth reference.
- Response: A detailed step-by-step reasoning (Chain-of-Thought) where the final solution is always wrapped in
<answer>tags.
We use a specialized ExampleParser with fuzzy logic to extract answers even when the LLM slightly deviates from markdown standards (e.g., malformed tags or missing headers).
Handles execution of code in multiple languages with timeout protection and error handling. Located in infinite_rl/executor.py.
Each task type has a specialized reward function that:
- Initializes necessary components (e.g., loading embedding or ML models)
- Executes/evaluates generated content extracted from
<answer>tags. - Computes a reward score (0-1) combining format and correctness.
- Returns detailed evaluation metrics.
All reward functions inherit from RewardFunction base class and are accessible via get_reward_functions().
A utility to discourage verbosity when the answer is correct and to discourage laziness (encourage effort) when the answer is incorrect. Instead of a linear penalty, it uses a cosine curve to create a "sweet spot" for response length.
- Purpose: Prevent overly long correct answers and encourage longer attempts for incorrect answers.
- Math (short): For a normalized x in [0,1], the functions used are:
- Correct answers (decay after target): R = (cos(pi * x) + 1) / 2 (maps 1 -> 0 over range)
- Incorrect answers (encourage effort): R = (1 - cos(pi * x)) / 2 (maps 0 -> 1 over range)
- Implementation: See
infinite_rl/reward_functions/length.py— functioncosine_length_reward(length, min_len=1, max_len=1000, target_len=None, correct=True).
Usage example:
from infinite_rl.reward_functions.length import cosine_length_reward
length = 350
len_reward = cosine_length_reward(
length=length,
min_len=1,
max_len=1000,
target_len=200, # for correct answers, lengths <= 200 get full credit
correct=True,
)
# Combine with a base correctness score (example):
final_score = base_correctness_score * len_rewardNotes:
- For
correct=True, lengths <=target_lenreceive full reward (1.0); beyond that the reward decays smoothly to 0 atmax_len. - For
correct=False, the reward increases smoothly with length to encourage longer reasoning attempts. - The function clamps
lengthto[min_len, max_len]and validates bounds.
We penalize repeated n-grams to discourage degenerate or looping responses. The penalty is a normalized negative value computed as:
from infinite_rl.reward_functions.repetition import ngram_repetition_reward
penalty = ngram_repetition_reward(text, n=3, weight=-0.1)Behavior:
- Uses simple tokenization (lowercasing and punctuation removal) and counts duplicated n-grams.
- Returns a negative penalty (<= 0) proportional to the fraction of duplicated n-grams in the response; 0 if no duplicates.
weightcontrols the maximum magnitude (default -0.1).
Notes:
- Combine this penalty with the base correctness score (e.g., final_score = max(0.0, base_correctness + penalty)).