This repository provides a self-contained Colab-friendly benchmark for comparing different reasoning strategies:
- CoT (Chain-of-Thought)
- CoT + Self-consistency
- RAG × CoT (mocked)
- ReAct (Tool-Augmented, mocked)
- FSM Controller (LangGraph-like controller, mocked)
It does not call any external LLM API – everything is implemented using small, deterministic toy logic so that it runs for free on Colab or any laptop.
python scripts/run_benchmark.pyYou should see a table similar to:
Model | Outcome(EM) | Process(StepAcc) | Robustness(SC/Para) | Efficiency(Tokens/Steps)
--------+-------------+------------------+----------------------+-------------------------
CoT ...
...
The exact numbers will differ from the paper-style example, but the evaluation pipeline and comparison structure match what you would use in a NeurIPS/ICLR paper.
reasoning_core/– core reasoning models (CoT, RAG×CoT, ReAct, FSM)evaluation/– metric computation and table/report generationdata/– small JSONL sample datasetsscripts/– CLI entrypoints (run_benchmark.py)
This is intended as a template: you can replace the mocked models with real LLM wrappers and keep all evaluation code intact.
The default experiment is defined in configs/experiment_default.yaml:
dataset:
path: data/tasks/sample_tasks.jsonl
models:
- name: CoT
type: cot
- name: CoT+SC
type: cot_sc
- name: RAG×CoT
type: rag_cot
- name: ReAct
type: react
- name: FSM
type: fsm
random_seed: 42You can create new configs pointing to different datasets or subsets of models without touching the Python code.
A minimal pytest smoke test is included under tests/:
pytest -qThis verifies that all models run end-to-end and that metrics stay in valid ranges.