-
Notifications
You must be signed in to change notification settings - Fork 12.3k
Add: Eliminating mathematical hallucinations with deterministic tool use #2599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,210 @@ | ||
| { | ||
| "cells": [ | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "# Eliminating Mathematical Hallucinations with Deterministic Tool Use\n", | ||
| "\n", | ||
| "LLMs predict tokens — they don't compute. When asked \"What is 347 × 893?\", the model guesses which digits are most likely, not which are mathematically correct. This guide shows how to eliminate mathematical hallucinations entirely by routing computation to a deterministic engine.\n", | ||
| "\n", | ||
| "## Key Finding\n", | ||
| "\n", | ||
| "We benchmarked 94 math problems across models of different sizes:\n", | ||
| "\n", | ||
| "| System | Accuracy | Speed | Size |\n", | ||
| "|--------|----------|-------|------|\n", | ||
| "| Small model (3B) | 55% | 200ms | 1.8 GB |\n", | ||
| "| Medium model (7B) | 77% | 300ms | 4.4 GB |\n", | ||
| "| Large model (32B) | 93% | 2,600ms | 18.5 GB |\n", | ||
| "| **SymPy Tool** | **100%** | **1.9ms** | **0 GPU** |\n", | ||
| "\n", | ||
| "Scaling parameters from 3B to 32B (10x) only moves accuracy from 55% to 93%. It never reaches 100% because the failure is **architectural** — you can't guarantee mathematical correctness by sampling from a probability distribution. Tool use solves this.\n", | ||
| "\n", | ||
| "## The Pattern: Separate Perception from Execution\n", | ||
| "\n", | ||
| "1. The LLM handles **perception** — understanding what's being asked\n", | ||
| "2. A deterministic tool handles **execution** — computing the answer\n", | ||
| "3. The LLM handles **presentation** — formatting the result\n", | ||
| "\n", | ||
| "This is the same principle behind code interpreter, but applied specifically to mathematical computation." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Install dependencies\n", | ||
| "!pip install openai sympy -q" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "import json\n", | ||
| "import sympy\n", | ||
| "from openai import OpenAI\n", | ||
| "\n", | ||
| "client = OpenAI()\n", | ||
| "\n", | ||
| "# Define the math computation tool\n", | ||
| "math_tool = {\n", | ||
| " \"type\": \"function\",\n", | ||
| " \"function\": {\n", | ||
| " \"name\": \"compute_math\",\n", | ||
| " \"description\": \"Compute an exact mathematical result using SymPy symbolic math. Use this for ANY calculation — arithmetic, algebra, trigonometry, statistics, finance, etc. Never compute math yourself; always use this tool.\",\n", | ||
| " \"parameters\": {\n", | ||
| " \"type\": \"object\",\n", | ||
| " \"properties\": {\n", | ||
| " \"expression\": {\n", | ||
| " \"type\": \"string\",\n", | ||
| " \"description\": \"A valid SymPy expression to evaluate. Examples: '347 * 893', 'sin(pi/6)', 'factorial(10)', '450000 * (0.065/12 * (1+0.065/12)**360) / ((1+0.065/12)**360 - 1)'\"\n", | ||
| " }\n", | ||
| " },\n", | ||
| " \"required\": [\"expression\"]\n", | ||
| " }\n", | ||
| " }\n", | ||
| "}\n", | ||
| "\n", | ||
| "def compute_math(expression: str) -> dict:\n", | ||
| " \"\"\"Evaluate a math expression using SymPy. Zero hallucination.\"\"\"\n", | ||
| " try:\n", | ||
| " # SymPy namespace for evaluation\n", | ||
| " ns = {\n", | ||
| " 'sin': sympy.sin, 'cos': sympy.cos, 'tan': sympy.tan,\n", | ||
| " 'asin': sympy.asin, 'acos': sympy.acos, 'atan': sympy.atan,\n", | ||
| " 'sqrt': sympy.sqrt, 'log': sympy.log, 'exp': sympy.exp,\n", | ||
| " 'pi': sympy.pi, 'e': sympy.E,\n", | ||
| " 'factorial': sympy.factorial, 'binomial': sympy.binomial,\n", | ||
| " 'gcd': sympy.gcd, 'lcm': sympy.lcm,\n", | ||
| " 'Rational': sympy.Rational, 'N': sympy.N,\n", | ||
| " }\n", | ||
| " result = sympy.sympify(expression, locals=ns)\n", | ||
| " numeric = float(sympy.N(result))\n", | ||
| " return {\"result\": numeric, \"exact\": str(result), \"verified\": True}\n", | ||
| " except Exception as e:\n", | ||
| " return {\"error\": str(e), \"verified\": False}" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Demo: Problems LLMs Get Wrong\n", | ||
| "\n", | ||
| "Let's test problems that LLMs commonly hallucinate on." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "test_problems = [\n", | ||
| " (\"What is 347 * 893?\", 309871),\n", | ||
| " (\"What is the monthly payment on a $450,000 mortgage at 6.5% for 30 years?\", 2844.31),\n", | ||
| " (\"What is 2 + 3 * 4?\", 14), # Order of operations\n", | ||
| " (\"What is 0.1 + 0.2?\", 0.3), # Float trap\n", | ||
| " (\"What is 15 factorial?\", 1307674368000),\n", | ||
| " (\"What is the sine of 30 degrees?\", 0.5),\n", | ||
| "]\n", | ||
| "\n", | ||
| "# First: ask the model WITHOUT tools (raw token prediction)\n", | ||
| "print(\"=\" * 60)\n", | ||
| "print(\"WITHOUT TOOLS (token prediction)\")\n", | ||
| "print(\"=\" * 60)\n", | ||
| "\n", | ||
| "for problem, expected in test_problems:\n", | ||
| " response = client.chat.completions.create(\n", | ||
| " model=\"gpt-4o-mini\",\n", | ||
| " messages=[{\"role\": \"user\", \"content\": f\"{problem} Answer with ONLY the number.\"}],\n", | ||
| " temperature=0,\n", | ||
| " )\n", | ||
| " answer = response.choices[0].message.content.strip()\n", | ||
| " print(f\" {problem[:45]:45s} => {answer:>20s} (expected: {expected})\")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Now: ask the model WITH the math tool\n", | ||
| "print(\"=\" * 60)\n", | ||
| "print(\"WITH SYMPY TOOL (deterministic computation)\")\n", | ||
| "print(\"=\" * 60)\n", | ||
| "\n", | ||
| "for problem, expected in test_problems:\n", | ||
| " messages = [{\"role\": \"user\", \"content\": problem}]\n", | ||
| " \n", | ||
| " response = client.chat.completions.create(\n", | ||
| " model=\"gpt-4o-mini\",\n", | ||
| " messages=messages,\n", | ||
| " tools=[math_tool],\n", | ||
| " tool_choice=\"auto\",\n", | ||
| " temperature=0,\n", | ||
|
Comment on lines
+150
to
+152
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The “WITH SYMPY TOOL (deterministic computation)” call uses Useful? React with 👍 / 👎. |
||
| " )\n", | ||
| " \n", | ||
| " msg = response.choices[0].message\n", | ||
| " \n", | ||
| " if msg.tool_calls:\n", | ||
| " # Model chose to use the tool — execute it\n", | ||
| " call = msg.tool_calls[0]\n", | ||
| " args = json.loads(call.function.arguments)\n", | ||
| " result = compute_math(args[\"expression\"])\n", | ||
| " \n", | ||
| " answer = result.get(\"result\", \"ERROR\")\n", | ||
| " tol = max(abs(expected) * 0.001, 0.01) if expected != 0 else 0.01\n", | ||
| " correct = isinstance(answer, (int, float)) and abs(answer - expected) <= tol\n", | ||
| " status = \"CORRECT\" if correct else \"WRONG\"\n", | ||
| " print(f\" {problem[:45]:45s} => {str(answer):>20s} [{status}]\")\n", | ||
| " else:\n", | ||
| " print(f\" {problem[:45]:45s} => {msg.content[:20]:>20s} [NO TOOL USED]\")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Why This Works\n", | ||
| "\n", | ||
| "Token prediction **cannot guarantee mathematical correctness**. The model samples from a probability distribution over possible next tokens. For `347 × 893`, the correct answer (`309871`) and a plausible-looking wrong answer (`309,650`) have similar token probabilities.\n", | ||
| "\n", | ||
| "SymPy performs **symbolic computation** — the same math engine behind tools like Wolfram Alpha. Given the same input, it always produces the same output. There's no randomness, no approximation, no hallucination.\n", | ||
| "\n", | ||
| "The model's job becomes **perception** (understanding the question) and **presentation** (formatting the answer), not computation. This separation of concerns eliminates an entire class of failure.\n", | ||
| "\n", | ||
| "## Scaling This Pattern\n", | ||
| "\n", | ||
| "The [Math Swarm](https://github.com/michaelwinczuk/math-swarm) project extends this pattern to 12 categories (arithmetic, finance, trigonometry, combinatorics, statistics, healthcare) with 1,079 verified test problems at 100% accuracy, including 15 clinical healthcare formulas with guideline-based decision support.\n", | ||
| "\n", | ||
| "The same principle applies beyond math:\n", | ||
| "- **Facts** → Knowledge graph lookup instead of memorized recall\n", | ||
| "- **Code** → Execute and verify instead of predict\n", | ||
| "- **Dates** → `datetime` library instead of token prediction\n", | ||
| "\n", | ||
| "Anywhere an LLM computes, a deterministic tool can replace the computation with a guarantee." | ||
| ] | ||
| } | ||
| ], | ||
| "metadata": { | ||
| "kernelspec": { | ||
| "display_name": "Python 3", | ||
| "language": "python", | ||
| "name": "python3" | ||
| }, | ||
| "language_info": { | ||
| "name": "python", | ||
| "version": "3.10.0" | ||
| } | ||
| }, | ||
| "nbformat": 4, | ||
| "nbformat_minor": 4 | ||
| } | ||
This file was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit introduces a new cookbook notebook but does not update
registry.yaml, so the publication pipeline will not index/render this page on cookbook.openai.com. Per this repo’s metadata workflow, new content must be added to the registry in the same change to avoid shipping an effectively hidden example.Useful? React with 👍 / 👎.