Skip to content

Latest commit

 

History

History
112 lines (85 loc) · 6.98 KB

File metadata and controls

112 lines (85 loc) · 6.98 KB

Beyond Accuracy and Complexity: The Effective Information Criterion for Structurally Stable Symbolic Regression

This repository contains the official implementation for the paper: Beyond Accuracy and Complexity: The Effective Information Criterion for Structurally Stable Symbolic Regression

Symbolic regression (SR) aims to discover interpretable mathematical formulas from data. We propose the Effective Information Criterion (EIC), a new measure that detects unreasonable structures in formulas by analyzing their numerical stability. EIC improves both performance and sample-efficiency across search-based and generative SR methods, and aligns well with human expert preference for interpretability.


[2026.03.30 NEWS!] During rebuttal, we conducted a rigorous audit of our PySR evaluation and discovered two critical issues: (1) a bug in our automated script failed to properly call PySR under noisy conditions, and (2) PySR defaults to multi-processing across all CPU cores, skewing duration comparisons. We have uploaded the corrected PySR results to data/srbench/pysr_results.csv.gz. The corrected evaluation reveals that while PySR discovers more compact formulas, its accuracy plummets under noise (dropping to $R^2=0.0722$ at 0.1 noise), whereas EIC-MCTS maintains robust accuracy ($R^2=0.5012$). When evaluated fairly on a single core, PySR's search time becomes comparable to EIC-GP. See the updated comparison table and analysis in the paper revision.

[2026.03.30 NEWS!] We have added experiments comparing EIC with Order of Nonlinearity (ON), an alternative complexity metric. Results show that replacing EIC with ON fails to improve search performance as EIC does—likely because widespread nonlinear functions (sin, tan, sqrt) in SRbench problems evade ON's polynomial-fit assumptions. EIC-GP achieves $R^2>0.99$ rate of 0.7288 with complexity 32.47, while ON-GP drops to 0.7117 with complexity 30.28. Similar trends hold for MCTS variants. Full experimental setup and results are detailed in data/srbench/on_results.md.

Install

We recommend using conda to set up the environment:

conda create -p ./venv python=3.12 -y
conda activate ./venv
pip install numpy matplotlib seaborn pandas scikit-learn numexpr sympytorch pyyaml tqdm setproctitle dotenv requests torch torchvision uvicorn fastapi pmlb

Experiments

EIC evaluates Existing SR Methods

You can calculate the EIC values of formulas discovered by existing SR methods on SRbench by:

# White-box results
python eic_for_evaluate.py --name whitebox_results --data_path ./data/srbench/whitebox_results_full.csv.gz
# Black-box results
python eic_for_evaluate.py --name blackbox_results --data_path ./data/srbench/blackbox_results_full.csv.gz

The step above can be skipped since we have provide the EIC values in ./data/srbench/whitebox_results_full.csv.gz and ./data/srbench/blackbox_results_full.csv.gz.

Therefore, you can directly use eic_for_evaluate.ipynb to draw EIC distgributions of formulas discovered by existing SR methods.

EIC for Search-based SR Methods

EIC-MCTS

You can run Monte-Carlo Tree Search (MCTS) with and without EIC by:

conda activate ./venv
python eic_for_heuristic.py --method eic-mcts --name mcts_wo_eic --function "f=x1+x2*sin(x3)" --save_dir logs/eic-mcts/
python eic_for_heuristic.py --method eic-mcts --name mcts_with_eic --alpha 0.01 --function "f=x1+x2*sin(x3)" --save_dir logs/eic-mcts/

The result will be saved in ./logs/eic-mcts/ .

You can also run it on SRbench by

# Feynman Whitebox Problems
python eic_for_heuristic.py --method eic-mcts --name mcts_wo_eic_on_feynman_I_34_1 --function "feynman_I_34_1" --save_dir logs/eic-mcts/
# Strogatz Whitebox Problems
python eic_for_heuristic.py --method eic-mcts --name mcts_wo_eic_on_strogatz_vdp1 --function "strogatz_vdp1" --save_dir logs/eic-mcts/
# Blackbox Problems
python eic_for_heuristic.py --method eic-mcts --name mcts_wo_eic_on_first_principles_kepler --function "first_principles_kepler" --save_dir logs/eic-mcts/

If PMLB raise a ValueError that "Dataset not found in PMLB", please ensure pmlb dataset is downloaded into ./data/pmlb/datasets/*/*.tsv.gz.

EIC-GP

You also run Genetic Programming (GP) with and without EIC by:

conda activate ./venv
python eic_for_heuristic.py --method eic-gp --name gp_wo_eic --function "f=x1+x2*sin(x3)" --save_dir logs/eic-gp/
python eic_for_heuristic.py --method eic-gp --name gp_with_eic --alpha 0.01 --function "f=x1+x2*sin(x3)" --save_dir logs/eic-gp/

The result will be saved in ./logs/eic-gp/ .

EIC for Pre-trained Generative SR Methods

EIC for E2ESR

You can train the End-to-End Symbolic Regression (E2ESR) with randomly generated, EIC-filtered, and PhyE2E-generated formulas by:

export CUDA_VISIBLE_DEVICES=0
conda activate ./venv
python eic_for_generative.py --method e2esr --exp_name e2esr_wo_eic --dump_path logs/e2esr --eval_size 32 --n_steps_per_epoch 3000 --lr 1e-5 --batch_size 64 --max_trials 5 --num_workers 0
python eic_for_generative.py --method e2esr --exp_name e2esr_with_eic --eic_threshold 2.0 --dump_path logs/e2esr --eval_size 32 --n_steps_per_epoch 3000 --lr 1e-5 --batch_size 64 --max_trials 5 --num_workers 0
python eic_for_generative.py --method e2esr --exp_name e2esr_with_phye2e --use_phye2e_data --dump_path logs/e2esr --eval_size 32 --n_steps_per_epoch 3000 --lr 1e-5 --batch_size 64 --max_trials 5 --num_workers 0

The results will be saved in ./logs/e2esr/ .

EIC for SNIP

You can also train the Symbolic-Numeric Integrated Pre-training model (SNIP) with randomly generated and EIC-filtered formulas by:

export CUDA_VISIBLE_DEVICES=0
conda activate ./venv
python eic_for_generative.py --method snip --exp_name snip_wo_eic --dump_path logs/snip --snip_loss 1.0 --eval_size 32 --n_steps_per_epoch 3000 --lr 1e-5 --batch_size 64 --max_trials 5 --num_workers 0
python eic_for_generative.py --method snip --exp_name snip_with_eic --eic_threshold 2.0 --dump_path logs/snip --snip_loss 1.0 --eval_size 32 --n_steps_per_epoch 3000 --lr 1e-5 --batch_size 64 --max_trials 5 --num_workers 0

The results will be saved in ./logs/snip/ .

EIC for SR4MDL

You can train the Symbolic Regression for Minimal Description Length model (SR4MDL) with randomly generated and EIC-filtered formulas by:

conda activate ./venv
python eic_for_generative.py --method sr4mdl --name sr4mdl_wo_eic --save_dir logs/sr4mdl/ --lr 2e-4 --dropout 0.5 --batch_size 128 --test_per_step 200 --device cuda:0
python eic_for_generative.py --method sr4mdl --name sr4mdl_with_eic --eic_threshold 2.0 --save_dir logs/sr4mdl/ --lr 2e-4 --dropout 0.5 --batch_size 128 --test_per_step 200 --device cuda:0

The results will be saved in ./logs/sr4mdl/ .

EIC in Alignment with Human Experts

You can launch a local web interface for expert evaluation at http://localhost:23333 using:

python eic_for_interpretability.py

The results will be saved in ./logs/expert_rating/ . We also provide the original rating results in ./data/expert_rating/human_expert_results.csv.