Machine learning system for extracting and predicting copolymerization reactivity ratios from scientific literature.
This project combines automated literature data extraction with machine learning to predict copolymerization reactivity patterns. The system extracts reactivity ratio data (rβ, rβ) from scientific papers and trains models to predict the r-product (rβ Γ rβ) class, which indicates the copolymerization behavior:
- Class 0: rβΒ·rβ < 1 β Alternating copolymerization
- Class 1: 1 β€ rβΒ·rβ β€ 25 β Random to weak block formation
- Class 2: rβΒ·rβ > 25 β Strong block formation
β
Automated literature data extraction
β
ML-based reactivity prediction (78.6% accuracy)
β
REST API for predictions
β
Comprehensive analysis tools
# Clone repository
git clone https://github.com/marawilhelmi/polymer-reactivity
cd polymer-reactivity
# Install package
pip install -e .
# Or install from GitHub directly
pip install git+https://github.com/marawilhelmi/polymer-reactivityThe fastest way to use the trained model:
cd copol_prediction/api
pip install -r requirements.txt
./start.shThen open http://localhost:8000/docs for interactive API documentation.
See copol_prediction/api/README.md for complete API documentation.
from copolpredictor.inference import CopolymerPredictor
# Load trained model
predictor = CopolymerPredictor("copol_prediction/artifacts/model_bundle")
# Make prediction
features = {...} # 15 molecular and reaction features
result = predictor.predict_with_confidence(features)
print(f"Predicted class: {result['predictions'][0]}")
print(f"Confidence: {result['confidence'][0]:.2%}")βββ copol_prediction/ # ML prediction pipeline
β βββ api/ # REST API (FastAPI)
β βββ analysis/ # Model analysis tools
β βββ utils/ # Utility functions
β βββ artifacts/ # Trained models & data splits
β βββ README.md # Prediction pipeline docs
β
βββ data_extraction/ # Literature data extraction
β βββ obtain_data.py # Main extraction script
β βββ output/ # Extracted data
β βββ README.md # Extraction docs
β
βββ experiments/ # Experiments & filter sweeps
β βββ sweep_filters.py # Test filter combinations
β βββ baseline/ # Baseline models
β βββ fingerprint/ # Fingerprint-based models
β βββ README.md # Experiments docs
β
βββ src/ # Core libraries
β βββ copolextractor/ # Data extraction library
β βββ copolpredictor/ # ML prediction library
βββ tests/ # Unit tests
βββ dump/ # Legacy code (archived)
βββ pyproject.toml # Package configuration
Location: copol_prediction/api/
REST API for making predictions:
cd copol_prediction/api
./start.sh
# Open http://localhost:8000/docsFeatures:
- FastAPI with automatic validation
- Batch predictions
- Docker deployment ready
π Full documentation: copol_prediction/api/README.md
Location: copol_prediction/
Complete machine learning pipeline for training and evaluating models:
cd copol_prediction
# Train production model (~20 min)
python train_final_model.py
# Run analysis
cd analysis && python analyze_model.py --all
# Test filter combinations (~3 hours)
cd ../../experiments && python sweep_filters.pyKey Scripts:
train_final_model.py- Train production model with automatic analysismonomer_feature_calculation.py- Calculate molecular featuresanalysis/analyze_model.py- Generate analysis plotssweep_filters.py- Test all filter combinations
π Full documentation: copol_prediction/README.md
Location: data_extraction/
Automated extraction of copolymerization data from scientific literature:
cd data_extraction
python obtain_data.pyFeatures:
- CrossRef API integration
- LLM-based data extraction
- Automatic monomer name resolution
- Confidence scoring
π Full documentation: data_extraction/README.md
Location: src/
Two main Python packages:
Library for extracting copolymerization data from literature.
from copolextractor import crossref_search, utils
# Search for papers
papers = crossref_search.search_papers("copolymerization")
# Resolve monomer names
smiles = utils.name_to_smiles("styrene")Library for ML-based reactivity prediction.
from copolpredictor.inference import CopolymerPredictor
from copolpredictor import data_processing, model_training
# Inference
predictor = CopolymerPredictor("path/to/model")
result = predictor.predict(features)
# Training
model = model_training.train_final_model(X, y, params)Modules:
data_processing.py- Data loading & preprocessingdata_augmentation.py- Gaussian augmentationmodel_training.py- Model training & hyperparameter optimizationevaluation.py- Metrics & evaluationinference.py- Production inferencecalibration.py- Probability calibration
Current production model (trained 2025-11-14):
| Metric | Value |
|---|---|
| Holdout Accuracy | 78.6% |
| Holdout F1 (weighted) | 79.0% |
| Holdout F1 (macro) | 67.8% |
| CV Score | 84.6% |
| Features | 15 |
| Classes | 3 |
Model location: copol_prediction/artifacts/model_bundle/
The model uses 15 molecular and reaction features:
Molecular descriptors (8):
fukui_radical_max_1/2- Fukui radical indicesdelta_HOMO_LUMO_AA/AB/BB/BA- Orbital interactions
Reaction conditions (7):
temperature- Reaction temperature (Β°C)solvent_logP,solvent_TPSA,solvent_HBD,solvent_FractionCSP3- Solvent propertiespolytype_emb_1/2- Polymerization type embeddingsmethod_emb_1/2- Method embeddings
from copolpredictor.inference import CopolymerPredictor
# Initialize predictor
predictor = CopolymerPredictor("copol_prediction/artifacts/model_bundle")
# Make prediction
features = {
"fukui_radical_max_1": 0.15,
"fukui_radical_max_2": 0.18,
"delta_HOMO_LUMO_AA": -5.2,
"delta_HOMO_LUMO_AB": -4.8,
"delta_HOMO_LUMO_BB": -5.5,
"delta_HOMO_LUMO_BA": -4.9,
"temperature": 60.0,
"polytype_emb_1": 0.23,
"polytype_emb_2": -0.15,
"method_emb_1": 0.45,
"method_emb_2": -0.32,
"solvent_logP": 2.1,
"solvent_TPSA": 20.5,
"solvent_HBD": 0.0,
"solvent_FractionCSP3": 0.67
}
result = predictor.predict_with_confidence(features)
print(f"Class: {result['predictions'][0]}")
print(f"Confidence: {result['confidence'][0]:.2%}")# Start API
cd copol_prediction/api && ./start.sh
# Make prediction
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{"features": {...}}'# Train new model
cd copol_prediction
python train_final_model.py
# Extract data from literature
cd data_extraction
python obtain_data.py
# Run experiments
cd experiments
python sweep_filters.pyEach major component has its own detailed README:
| Component | Documentation |
|---|---|
| Prediction API | copol_prediction/api/README.md |
| ML Pipeline | copol_prediction/README.md |
| Data Extraction | data_extraction/README.md |
| Experiments | experiments/README.md |
cd copol_prediction/api
docker-compose up -dThe API will be available at http://localhost:8000
# Install dependencies
pip install -r copol_prediction/api/requirements.txt
# Start with Gunicorn (4 workers)
cd copol_prediction/api
gunicorn app:app \
--workers 4 \
--worker-class uvicorn.workers.UvicornWorker \
--bind 0.0.0.0:8000The system has extracted and processed data from ~400 scientific papers:
- Location:
data_extraction/output/copol_database/ - Format: JSON files (one per paper)
- Contents: Monomer pairs, reactivity ratios, reaction conditions
- Location:
copol_prediction/output/processed_data.csv - Samples: ~1,100 copolymerization reactions
- Features: Molecular descriptors + reaction conditions
- Labels: r-product class (0, 1, or 2)
Centralized train/test split:
- Location:
copol_prediction/artifacts/data_splits/ - Split: ~80% train / ~20% test
- Method: Group-based (by
reaction_id) to prevent data leakage
The experiments/ directory contains baseline comparisons:
- Baseline models: Simple feature sets
- Fingerprint models: Morgan fingerprints
- Filter sweeps: Testing 16 combinations of preprocessing filters
Run all experiments:
cd experiments
./run_all.shComprehensive analysis tools in copol_prediction/analysis/:
cd copol_prediction
python analysis/analyze_model.py --all --compare-holdoutGenerated plots:
- Confusion matrices
- Confidence distributions
- Feature importance
- Calibration curves
- Error analysis
- Confidence filtering
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in editable mode with dev dependencies
pip install -e .
pip install pytest black isort flake8
# Install pre-commit hooks (optional)
pip install pre-commit
pre-commit installsrc/copolextractor/- Data extraction librarysrc/copolpredictor/- ML prediction librarycopol_prediction/- Training & analysis scriptsdata_extraction/- Extraction scriptsexperiments/- Baseline experimentstests/- Unit tests
- Add feature calculation in
src/copolpredictor/data_processing.py - Update feature list in
prediction_utils.py - Retrain model with
copol_prediction/train_final_model.py - Update API documentation
If you use this code in your research, please cite:
@software{wilhelmi2024copolymer,
author = {Schilling-Wilhelmi, Mara; Jablonka, Kevin M.},
title = {Copolymerization Reactivity Prediction},
year = {2025},
url = {https://github.com/lamalab-org/copolymer-reactivity}
}This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
If you encounter any problems or have suggestions, please open an issue.
Mara Schilling-Wilhelmi - [email protected]
Project Link: https://github.com/lamalab-org/copolymer-reactivity
- API Documentation - REST API usage
- ML Pipeline - Model training & evaluation
- Data Extraction - Literature data extraction
- Experiments - Baseline comparisons
- Interactive API: http://localhost:8000/docs (when running)
- Model Performance:
copol_prediction/output/analysis/ - Extracted Data:
data_extraction/output/copol_database/ - Trained Model:
copol_prediction/artifacts/model_bundle/