Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
210 changes: 210 additions & 0 deletions examples/Eliminating_math_hallucinations_with_tool_use.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Eliminating Mathematical Hallucinations with Deterministic Tool Use\n",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Add registry entry for the new notebook

This commit introduces a new cookbook notebook but does not update registry.yaml, so the publication pipeline will not index/render this page on cookbook.openai.com. Per this repo’s metadata workflow, new content must be added to the registry in the same change to avoid shipping an effectively hidden example.

Useful? React with 👍 / 👎.

"\n",
"LLMs predict tokens — they don't compute. When asked \"What is 347 × 893?\", the model guesses which digits are most likely, not which are mathematically correct. This guide shows how to eliminate mathematical hallucinations entirely by routing computation to a deterministic engine.\n",
"\n",
"## Key Finding\n",
"\n",
"We benchmarked 94 math problems across models of different sizes:\n",
"\n",
"| System | Accuracy | Speed | Size |\n",
"|--------|----------|-------|------|\n",
"| Small model (3B) | 55% | 200ms | 1.8 GB |\n",
"| Medium model (7B) | 77% | 300ms | 4.4 GB |\n",
"| Large model (32B) | 93% | 2,600ms | 18.5 GB |\n",
"| **SymPy Tool** | **100%** | **1.9ms** | **0 GPU** |\n",
"\n",
"Scaling parameters from 3B to 32B (10x) only moves accuracy from 55% to 93%. It never reaches 100% because the failure is **architectural** — you can't guarantee mathematical correctness by sampling from a probability distribution. Tool use solves this.\n",
"\n",
"## The Pattern: Separate Perception from Execution\n",
"\n",
"1. The LLM handles **perception** — understanding what's being asked\n",
"2. A deterministic tool handles **execution** — computing the answer\n",
"3. The LLM handles **presentation** — formatting the result\n",
"\n",
"This is the same principle behind code interpreter, but applied specifically to mathematical computation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Install dependencies\n",
"!pip install openai sympy -q"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import sympy\n",
"from openai import OpenAI\n",
"\n",
"client = OpenAI()\n",
"\n",
"# Define the math computation tool\n",
"math_tool = {\n",
" \"type\": \"function\",\n",
" \"function\": {\n",
" \"name\": \"compute_math\",\n",
" \"description\": \"Compute an exact mathematical result using SymPy symbolic math. Use this for ANY calculation — arithmetic, algebra, trigonometry, statistics, finance, etc. Never compute math yourself; always use this tool.\",\n",
" \"parameters\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"expression\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"A valid SymPy expression to evaluate. Examples: '347 * 893', 'sin(pi/6)', 'factorial(10)', '450000 * (0.065/12 * (1+0.065/12)**360) / ((1+0.065/12)**360 - 1)'\"\n",
" }\n",
" },\n",
" \"required\": [\"expression\"]\n",
" }\n",
" }\n",
"}\n",
"\n",
"def compute_math(expression: str) -> dict:\n",
" \"\"\"Evaluate a math expression using SymPy. Zero hallucination.\"\"\"\n",
" try:\n",
" # SymPy namespace for evaluation\n",
" ns = {\n",
" 'sin': sympy.sin, 'cos': sympy.cos, 'tan': sympy.tan,\n",
" 'asin': sympy.asin, 'acos': sympy.acos, 'atan': sympy.atan,\n",
" 'sqrt': sympy.sqrt, 'log': sympy.log, 'exp': sympy.exp,\n",
" 'pi': sympy.pi, 'e': sympy.E,\n",
" 'factorial': sympy.factorial, 'binomial': sympy.binomial,\n",
" 'gcd': sympy.gcd, 'lcm': sympy.lcm,\n",
" 'Rational': sympy.Rational, 'N': sympy.N,\n",
" }\n",
" result = sympy.sympify(expression, locals=ns)\n",
" numeric = float(sympy.N(result))\n",
" return {\"result\": numeric, \"exact\": str(result), \"verified\": True}\n",
" except Exception as e:\n",
" return {\"error\": str(e), \"verified\": False}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Demo: Problems LLMs Get Wrong\n",
"\n",
"Let's test problems that LLMs commonly hallucinate on."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_problems = [\n",
" (\"What is 347 * 893?\", 309871),\n",
" (\"What is the monthly payment on a $450,000 mortgage at 6.5% for 30 years?\", 2844.31),\n",
" (\"What is 2 + 3 * 4?\", 14), # Order of operations\n",
" (\"What is 0.1 + 0.2?\", 0.3), # Float trap\n",
" (\"What is 15 factorial?\", 1307674368000),\n",
" (\"What is the sine of 30 degrees?\", 0.5),\n",
"]\n",
"\n",
"# First: ask the model WITHOUT tools (raw token prediction)\n",
"print(\"=\" * 60)\n",
"print(\"WITHOUT TOOLS (token prediction)\")\n",
"print(\"=\" * 60)\n",
"\n",
"for problem, expected in test_problems:\n",
" response = client.chat.completions.create(\n",
" model=\"gpt-4o-mini\",\n",
" messages=[{\"role\": \"user\", \"content\": f\"{problem} Answer with ONLY the number.\"}],\n",
" temperature=0,\n",
" )\n",
" answer = response.choices[0].message.content.strip()\n",
" print(f\" {problem[:45]:45s} => {answer:>20s} (expected: {expected})\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Now: ask the model WITH the math tool\n",
"print(\"=\" * 60)\n",
"print(\"WITH SYMPY TOOL (deterministic computation)\")\n",
"print(\"=\" * 60)\n",
"\n",
"for problem, expected in test_problems:\n",
" messages = [{\"role\": \"user\", \"content\": problem}]\n",
" \n",
" response = client.chat.completions.create(\n",
" model=\"gpt-4o-mini\",\n",
" messages=messages,\n",
" tools=[math_tool],\n",
" tool_choice=\"auto\",\n",
" temperature=0,\n",
Comment on lines +150 to +152
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Force math tool invocation in deterministic path

The “WITH SYMPY TOOL (deterministic computation)” call uses tool_choice="auto", which still allows the model to skip the tool and answer from token prediction (your own else branch already handles [NO TOOL USED]). That means this path cannot guarantee deterministic/no-hallucination behavior and can regress to incorrect answers for some prompts; use a required/specific function tool choice for this section.

Useful? React with 👍 / 👎.

" )\n",
" \n",
" msg = response.choices[0].message\n",
" \n",
" if msg.tool_calls:\n",
" # Model chose to use the tool — execute it\n",
" call = msg.tool_calls[0]\n",
" args = json.loads(call.function.arguments)\n",
" result = compute_math(args[\"expression\"])\n",
" \n",
" answer = result.get(\"result\", \"ERROR\")\n",
" tol = max(abs(expected) * 0.001, 0.01) if expected != 0 else 0.01\n",
" correct = isinstance(answer, (int, float)) and abs(answer - expected) <= tol\n",
" status = \"CORRECT\" if correct else \"WRONG\"\n",
" print(f\" {problem[:45]:45s} => {str(answer):>20s} [{status}]\")\n",
" else:\n",
" print(f\" {problem[:45]:45s} => {msg.content[:20]:>20s} [NO TOOL USED]\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Why This Works\n",
"\n",
"Token prediction **cannot guarantee mathematical correctness**. The model samples from a probability distribution over possible next tokens. For `347 × 893`, the correct answer (`309871`) and a plausible-looking wrong answer (`309,650`) have similar token probabilities.\n",
"\n",
"SymPy performs **symbolic computation** — the same math engine behind tools like Wolfram Alpha. Given the same input, it always produces the same output. There's no randomness, no approximation, no hallucination.\n",
"\n",
"The model's job becomes **perception** (understanding the question) and **presentation** (formatting the answer), not computation. This separation of concerns eliminates an entire class of failure.\n",
"\n",
"## Scaling This Pattern\n",
"\n",
"The [Math Swarm](https://github.com/michaelwinczuk/math-swarm) project extends this pattern to 12 categories (arithmetic, finance, trigonometry, combinatorics, statistics, healthcare) with 1,079 verified test problems at 100% accuracy, including 15 clinical healthcare formulas with guideline-based decision support.\n",
"\n",
"The same principle applies beyond math:\n",
"- **Facts** → Knowledge graph lookup instead of memorized recall\n",
"- **Code** → Execute and verify instead of predict\n",
"- **Dates** → `datetime` library instead of token prediction\n",
"\n",
"Anywhere an LLM computes, a deterministic tool can replace the computation with a guarantee."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.10.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

This file was deleted.

Loading