Extract kinetics data from PDFs using LLMs.
git clone https://github.com/ChemBioHTP/EnzyExtract
cd EnzyExtract
pip install -e .Furthermore, create a .env file in the project root. Add your OpenAI API key:
OPENAI_API_KEY=...A preliminary tutorial is available on Google Colab or downloadable here.
For explanations of each header, see docs/enzyextract_headers.md. Reading it is recommended due to possible surprises.
| Filename | kcat count | Location | Description |
|---|---|---|---|
| Original, crude | 242,115 | data/export/TheData_kcat.parquet | As originally described |
| Deduplicated, crude | 218,095 | data/export/TheData_unpruned.parquet | With deduplication |
| EnzyExtractDB_176463 | 176,463 | EnzyExtractDB/EnzyExtractDB_176463.parquet | Deduplicated and filtered |
The data is released as CC BY-NC 4.0. (NC is due to Elsevier and Wiley TDM APIs.)
The code is released as MIT.
See experiments/example/pipeline/ex_step*.py for example scripts. The scripts should be run sequentially, though file paths may need to be adjusted.
Steps:
- ex_step0_run_preprocessing.py:
- Handles the preprocessing steps (ResNet, Table Extraction)
- create a
.enzyfolder for simplified file management
- ex_step1_run_tableboth.py
- Given PDFs and preprocessed data, feed to LLMs using Batch API.
- File locations should be automatically saved to
.enzy/llm_log.tsv.
- ex_step1b_run_pdf_binaries.py
- Alternative to
step0andstep1: feed PDF binaries directly to Claude.
- Alternative to
- ex_step2_download.py
- Small script to retrieve batches from Batch APIs.
- ex_step3_llm_to_df.py
- Convert the LLM output to parquet files.
- ex_step5_generate_identifiers.py
- Optional: Attach sequence identifiers (EC number, UniProt ID, PDB ID, SMILES, PubChem ID) to the data from
step3.
- Optional: Attach sequence identifiers (EC number, UniProt ID, PDB ID, SMILES, PubChem ID) to the data from
See experiments/example/evaluation/ex_step*.py.
- ex_eval1_compare_dfs.py
- Evaluate and benchmark LLM data against a trusted dataset.
- ex_eval2_plot_dfs.py
- Plot the data from
ex_eval1.
- Plot the data from
Enzyme accession pipeline is a WIP. Please see docs/pipeline/enzyme_accessions.md.
For docs on getting substrate IDs, please see docs/pipeline/substrate_accessions.md
If you find EnzyExtract useful, please cite it as below:
@article{wei2025finding,
title={Finding the dark matter: Large language model-based enzyme kinetic data extractor and its validation},
author={Wei, Galen and Ran, Xinchun and AI-Abssi, Runeem and Yang, Zhongyue},
journal={Protein Science},
volume={34},
number={9},
pages={e70251},
year={2025},
publisher={Wiley Online Library}
}