A CLI tool that downloads arXiv papers, extracts mathematical formulas, and generates AI-powered explanations.
curl -LsSf https://astral.sh/uv/install.sh | shuv pip install -e .Copy the example environment file and add your OpenAI API key:
cp .env.example .env
# Edit .env and add your key:
# OPENAI_API_KEY=your_key_hereGet your API key from: https://platform.openai.com/api-keys
The process command downloads papers from arXiv, extracts formulas, and generates explanations using LLMs.
Process a single paper:
mai process https://arxiv.org/abs/1706.03762Process multiple papers:
mai process https://arxiv.org/abs/1706.03762 https://arxiv.org/abs/1508.06576You can also use just the paper ID:
mai process 1706.03762| Option | Short | Default | Description |
|---|---|---|---|
--model |
-m |
gpt-5 |
LLM model to use for explanations (e.g., gpt-5, gpt-4o, gpt-4o-mini) |
--context-words |
-c |
300 |
Number of words of context extracted around each formula |
--max-workers |
-w |
10 |
Number of concurrent API calls for formula explanation (higher = faster, but may hit rate limits) |
--max-formulas |
-f |
50 |
Maximum number of formulas to explain, prioritizing longest formulas first |
--add-error |
False |
Inject errors into formulas before explanation for testing/evaluation | |
--error-rate |
0.5 |
Probability (0.0-1.0) of injecting error into each formula when --add-error is enabled |
|
--keep-temp |
-k |
False |
Keep temporary extracted TeX files after processing (original PDF/tar.gz are always kept) |
--output-dir |
-o |
./output |
Directory to save output files |
Use GPT-4o with more context:
mai process 1706.03762 --model gpt-4o --context-words 500Process with custom output directory and keep temp files:
mai process 1706.03762 --output-dir ./my_papers --keep-tempBatch processing multiple papers:
mai process 1706.03762 1508.06576 2010.11929 --model gpt-5Use more workers for faster processing:
mai process 1706.03762 --max-workers 20Explain more formulas from a paper:
mai process 1706.03762 --max-formulas 100Inject errors for testing (50% error rate):
mai process 1706.03762 --add-errorInject errors with custom rate (30% chance):
mai process 1706.03762 --add-error --error-rate 0.3Complete workflow example - Process multiple papers to custom directory:
# Process multiple papers with error injection to custom dataset directory
mai process 2210.10000 2210.11111 2210.12222 --add-error --error-rate 0.5 -f 20 -o ./dataset/set1
# Then benchmark all papers in that directory
mai benchmark --all --output-dir ./dataset/set1 --model openai/gpt-4oNotes:
- Higher
--max-workersvalues speed up processing but may hit API rate limits. Start with default (10) and increase if needed. --max-formulasprioritizes longer formulas first (more complex). Use this to control costs and focus on important formulas.--add-erroris useful for testing LLM's ability to detect mathematical errors or creating evaluation datasets.
For each paper, the process command performs:
- Download - Fetches PDF and LaTeX source from arXiv
- Consolidate - Merges multi-file LaTeX projects into single
.texfile - Extract - Identifies and labels all mathematical formulas (filters out nested/overlapping formulas, keeping outermost ones)
- Prioritize - Sorts formulas by length (longest first) and selects top N formulas (controlled by
--max-formulas) - Inject Errors (optional, if
--add-errorenabled) - Randomly modifies selected formulas with common mathematical errors - Explain - Generates AI-powered explanations using LLM (with concurrent processing for speed)
The explanation step:
- Prioritizes longer formulas (typically more complex and important)
- Replaces formula labels in context with original LaTeX before sending to LLM
- When
--add-erroris used, explains the ERROR-INJECTED formulas (not the originals) - Uses concurrent API calls (controlled by
--max-workers) to process multiple formulas simultaneously
When --add-error is enabled, the tool can inject common mathematical errors into formulas before explanation. This is useful for:
- Testing LLM's ability to detect formula errors
- Creating evaluation datasets for mathematical reasoning
- Studying how errors affect formula understanding
Error types injected:
- Sign flipping (
+↔-) - Exponent order changes (e.g.,
E(X)^2↔E(X^2)) - Operator swaps (
+→×,×→/) - Index changes (
x_i→x_j) - Inequality flips (
<↔>,≤↔≥) - Transpose errors (add/remove
^T) - Fraction inversions (
\frac{a}{b}↔\frac{b}{a}) - Sum/product swaps (
\sum↔\prod) - Missing parentheses (e.g.,
(a+b)c→a+bc) - Function swaps (
sin↔cos,log↔ln,max↔min)
Each formula has a probability (controlled by --error-rate) of receiving one random error.
For each processed paper, the following files are created in ./output/{paper_id}/:
Without --add-error:
output/
└── 1706.03762/
├── original/
│ ├── 1706.03762.pdf # Downloaded PDF
│ └── 1706.03762.tar.gz # Downloaded LaTeX source archive
├── 1706.03762_consolidated.tex # Consolidated LaTeX file
├── 1706.03762_formulas.json # Extracted formulas with labels
├── 1706.03762_labeled.tex # TeX with formulas replaced by labels
└── 1706.03762_explained.json # AI-generated explanations
With --add-error:
output/
└── 1706.03762/
├── original/
│ ├── 1706.03762.pdf # Downloaded PDF
│ └── 1706.03762.tar.gz # Downloaded LaTeX source archive
├── 1706.03762_consolidated.tex # Consolidated LaTeX file
├── 1706.03762_formulas.json # Original extracted formulas (unchanged)
├── 1706.03762_formulas_with_errors.json # Modified formulas with injected errors
├── 1706.03762_error_log.json # Documentation of all errors injected
├── 1706.03762_labeled.tex # TeX with formulas replaced by labels
└── 1706.03762_explained.json # AI-generated explanations (of ERROR-INJECTED formulas)
original/{paper_id}.pdf- Downloaded PDF from arXivoriginal/{paper_id}.tar.gz- Downloaded LaTeX source archive from arXiv{paper_id}_consolidated.tex- Single LaTeX file with all\input{}and\include{}resolved{paper_id}_formulas.json- JSON mapping of formula labels to metadata (formula text, type, line number, position){paper_id}_formulas_with_errors.json- (Only with--add-error) Modified formulas with injected errors{paper_id}_error_log.json- (Only with--add-error) Documentation of which formulas were modified and what errors were injected{paper_id}_labeled.tex- LaTeX file with formulas replaced by labels like<<FORMULA_0001>>{paper_id}_explained.json- JSON with high-level explanations and notation definitions for each formula
Note: Notation keys in the notations dictionary use exact LaTeX format from the formula. For example:
- Use
"f^*_{\\mathrm{NL}}"not"f*_NL" - Use
"\\mathbf{Q}"not"Q"(if the formula has\mathbf{Q}) - Use
"d_{model}"not"d_model"(preserves subscript braces)
{
"formulas": [
{
"label": "<<FORMULA_0009>>",
"formula": "\\mathrm{Attention}(Q, K, V) = \\mathrm{softmax}(\\frac{QK^T}{\\sqrt{d_k}})V",
"formula_type": "equation",
"is_formula": true,
"high_level_explanation": "This is the scaled dot-product attention mechanism...",
"notations": {
"Q": "Query matrix",
"K": "Key matrix",
"V": "Value matrix",
"d_k": "Dimension of key vectors (or 'NOT MENTIONED' if not defined in context)"
},
"model_used": "gpt-5",
"timestamp": "2025-10-27T02:00:45.110697"
}
],
"metadata": {
"model": "gpt-5",
"context_words": 300,
"total_analyzed": 59,
"formulas_explained": 42,
"notations_skipped": 15,
"failed": 2
},
"skipped_notations": [...],
"failed": [...]
}{
"metadata": {
"error_rate": 0.5,
"random_seed": null,
"total_formulas_processed": 50,
"formulas_modified": 23,
"formulas_unmodified": 27,
"timestamp": "2025-10-27T18:30:00.123456"
},
"errors": [
{
"label": "<<FORMULA_0009>>",
"original_formula": "\\mathrm{softmax}(\\frac{QK^T}{\\sqrt{d_k}})V",
"modified_formula": "\\mathrm{softmax}(\\frac{QK^T}{\\sqrt{d_k}})-V",
"error_type": "sign_flip",
"error_description": "Changed implicit '+' to '-' before V",
"line_number": 145,
"formula_type": "equation"
},
{
"label": "<<FORMULA_0021>>",
"original_formula": "\\mathrm{FFN}(x)=\\max(0, xW_1 + b_1) W_2 + b_2",
"modified_formula": "\\mathrm{FFN}(x)=\\min(0, xW_1 + b_1) W_2 + b_2",
"error_type": "function_swap",
"error_description": "Changed 'max' to 'min'",
"line_number": 198,
"formula_type": "equation"
}
],
"unmodified": [
"<<FORMULA_0001>>",
"<<FORMULA_0003>>",
...
]
}The error log allows you to:
- Compare original vs. modified formulas
- Identify which formulas were changed and which errors were injected
- Evaluate LLM's ability to detect specific error types
- Reproduce experiments with the same random seed
The mai benchmark command evaluates an LLM's ability to detect mathematical errors in formulas. This is particularly useful when combined with the --add-error flag to test error detection capabilities.
Single Paper Benchmarking:
# Benchmark a processed paper (default: OpenAI GPT-5)
mai benchmark output/1706.03762
# Use different OpenAI models
mai benchmark output/1706.03762 --model openai/gpt-4o
mai benchmark output/1706.03762 --model openai/gpt-4o-mini
# Use OpenRouter models (requires OPENROUTER_API_KEY)
mai benchmark output/1706.03762 --model openrouter/anthropic/claude-3.5-sonnet
mai benchmark output/1706.03762 --model openrouter/google/gemini-pro
# Use more workers for speed
mai benchmark output/1706.03762 --max-workers 20Batch Benchmarking (All Papers):
# Benchmark ALL papers in default output directory
mai benchmark --all
# Benchmark papers in custom directory (e.g., dataset/set1)
mai benchmark --all --output-dir ./dataset/set1
# Use different model for batch benchmark
mai benchmark --all --model openai/gpt-4o
# Combine options - benchmark custom dataset with specific model
mai benchmark --all --output-dir ./dataset/set1 --model openrouter/anthropic/claude-3.5-sonnet --max-workers 20Note: The --output-dir option works for both process and benchmark --all commands, allowing you to organize datasets in custom directories.
Model Format: provider/model
- OpenAI:
openai/gpt-5,openai/gpt-4o,openai/gpt-4o-mini - OpenRouter:
openrouter/anthropic/claude-3.5-sonnet,openrouter/google/gemini-pro, etc. - Backward compatibility:
gpt-5defaults toopenai/gpt-5
Environment Variables:
- OpenAI models:
OPENAI_API_KEY(same as process command) - OpenRouter models:
OPENROUTER_API_KEY(get your key at https://openrouter.ai/keys)
Single Paper Mode:
- Loads formulas from
{paper_id}_explained.json - Extracts context (300 words before/after) from
consolidated_labeled.texfor each formula - Asks LLM to detect if each formula contains a mathematical error
- Saves results to
benchmarks/{model_name}/error_detection.json - Calculates metrics (if
error_log.jsonexists) and saves tobenchmarks/{model_name}/benchmark_report.json
Batch Mode (--all):
- Scans output directory for all processed papers (papers with
_explained.json) - Runs benchmark on each paper sequentially (individual results saved in each paper's directory)
- Aggregates metrics across all papers (mean, std, min, max)
- Saves aggregate report to
output/aggregate_benchmarks/{model_name}/
Single Paper:
output/{paper_id}/
└── benchmarks/
└── {model_name}/
├── error_detection.json # Detection results (without raw responses)
├── raw_responses.json # All model raw outputs
├── parsing_failures.log # Detailed log of parsing failures
├── benchmark_report.json # Metrics report
└── summary.txt # Human-readable summary
Batch Benchmark:
output/
├── {paper_id_1}/
│ └── benchmarks/{model_name}/... # Individual paper results
├── {paper_id_2}/
│ └── benchmarks/{model_name}/... # Individual paper results
└── aggregate_benchmarks/
└── {model_name}/
├── aggregate_report.json # Combined metrics across all papers
├── per_paper_summary.json # Individual paper metrics
└── aggregate_summary.txt # Human-readable aggregate summary
Single Paper Metrics (when ground truth is available from --add-error):
Binary Classification:
- Accuracy: Overall correctness of error detection
- Precision: Of detected errors, how many were actually errors
- Recall: Of actual errors, how many were detected
- F1 Score: Harmonic mean of precision and recall
Confusion Matrix:
- TP (True Positive): Correctly detected errors
- FP (False Positive): False alarms (detected error where none exists)
- TN (True Negative): Correctly identified correct formulas
- FN (False Negative): Missed errors
Error Type Matching:
- Accuracy of identifying the specific error type (sign_flip, operator_swap, etc.)
Instruction Following:
- Perfect JSON Rate: Percentage of responses that followed JSON format perfectly
- Fallback Rate: Percentage requiring fallback parsing strategies
- Failure Rate: Percentage of complete parsing failures
Per-Error-Type Performance:
- Detection recall for each error type separately
- Identifies which types of errors are easiest/hardest to detect
Batch Aggregate Metrics (across all papers):
Mean & Standard Deviation:
- Average performance metrics across all papers with std deviation
- Shows consistency of model performance
Accuracy Range:
- Minimum and maximum accuracy observed across papers
- Helps identify if model performs consistently
Aggregated Per-Error-Type Recall:
- Combined detection rates for each error type across all papers
- Larger sample size for more reliable error type analysis
Per-Paper Summary:
- Individual accuracy and F1 scores for each paper
- Quick overview of which papers were harder/easier
Single Paper Workflow:
# Step 1: Process paper with error injection
mai process 1706.03762 --add-error --error-rate 0.5
# Step 2: Benchmark error detection
mai benchmark output/1706.03762
# Step 3: View results
cat output/1706.03762/benchmarks/openai_gpt-5/summary.txt
cat output/1706.03762/benchmarks/openai_gpt-5/benchmark_report.jsonBatch Benchmarking Workflow:
# Step 1: Process multiple papers with error injection
mai process 1706.03762 2010.11929 1508.06576 --add-error --error-rate 0.5
# Step 2: Benchmark all papers at once
mai benchmark --all --model openai/gpt-4o
# Step 3: View aggregate results
cat output/aggregate_benchmarks/openai_gpt-4o/aggregate_summary.txt
# Optional: Compare different models
mai benchmark --all --model openrouter/anthropic/claude-3.5-sonnet
cat output/aggregate_benchmarks/openrouter_anthropic_claude-3.5-sonnet/aggregate_summary.txtCustom Dataset Workflow (Organized Directory Structure):
# Step 1: Create dataset in custom directory with specific settings
mai process 2210.10000 2210.11111 2210.12222 2210.13333 \
--add-error --error-rate 0.5 -f 20 -o ./dataset/set1
# Step 2: Benchmark the entire dataset
mai benchmark --all --output-dir ./dataset/set1 --model openai/gpt-4o
# Step 3: View results
cat dataset/set1/aggregate_benchmarks/openai_gpt-4o/aggregate_summary.txt
# Optional: Test with different model on same dataset
mai benchmark --all --output-dir ./dataset/set1 --model openai/gpt-4o-miniSingle Paper Report:
{
"binary_classification": {
"accuracy": 0.84,
"precision": 0.857,
"recall": 0.783,
"f1_score": 0.818,
"true_positives": 5,
"false_positives": 0,
"true_negatives": 2,
"false_negatives": 12
},
"error_type_matching": {
"type_accuracy": 0.714,
"correct_type_identified": 10,
"total_errors_detected": 14
},
"instruction_following": {
"perfect_json_rate": 0.895,
"fallback_rate": 0.105,
"failure_rate": 0.0
},
"per_error_type_performance": {
"sign_flip": {"detected": 4, "total": 5, "recall": 0.8},
"operator_swap": {"detected": 3, "total": 4, "recall": 0.75},
"exponent_order": {"detected": 2, "total": 3, "recall": 0.667}
}
}Aggregate Report (Batch):
{
"model": "openai/gpt-4o",
"metadata": {
"total_papers_in_directory": 3,
"successful_benchmarks": 3,
"failed_benchmarks": 0,
"no_ground_truth": 0
},
"aggregate_metrics": {
"binary_classification": {
"mean_accuracy": 0.823,
"std_accuracy": 0.045,
"mean_precision": 0.841,
"mean_recall": 0.795,
"mean_f1_score": 0.817,
"min_accuracy": 0.78,
"max_accuracy": 0.87
},
"instruction_following": {
"mean_perfect_json_rate": 0.913,
"mean_fallback_rate": 0.087,
"mean_failure_rate": 0.0
}
},
"per_error_type_performance": {
"sign_flip": {"detected": 12, "total": 15, "recall": 0.8},
"operator_swap": {"detected": 9, "total": 12, "recall": 0.75}
},
"paper_ids": ["1706.03762", "2010.11929", "1508.06576"]
}Use cases:
- Evaluate different LLM models' error detection capabilities
- Test if errors are being explained correctly
- Identify which error types are most challenging to detect
- Create datasets for mathematical reasoning research
- Compare model performance across multiple papers for robust evaluation
- Analyze consistency of model performance (via std deviation metrics)
View all available commands:
mai --helpGet help for a specific command:
mai process --helpRun tests:
uv run python test_explainer.pyMIT