EnzyExtract

Extract kinetics data from PDFs using LLMs.

Installation

git clone https://github.com/ChemBioHTP/EnzyExtract
cd EnzyExtract
pip install -e .

Furthermore, create a .env file in the project root. Add your OpenAI API key:

OPENAI_API_KEY=...

Quickstart

A preliminary tutorial is available on Google Colab or downloadable here.

Database

For explanations of each header, see docs/enzyextract_headers.md. Reading it is recommended due to possible surprises.

Initial Release

Filename	kcat count	Location	Description
Original, crude	242,115	data/export/TheData_kcat.parquet	As originally described
Deduplicated, crude	218,095	data/export/TheData_unpruned.parquet	With deduplication
EnzyExtractDB_176463	176,463	EnzyExtractDB/EnzyExtractDB_176463.parquet	Deduplicated and filtered

The data is released as CC BY-NC 4.0. (NC is due to Elsevier and Wiley TDM APIs.)

The code is released as MIT.

Usage

See experiments/example/pipeline/ex_step*.py for example scripts. The scripts should be run sequentially, though file paths may need to be adjusted.

Steps:

ex_step0_run_preprocessing.py:
- Handles the preprocessing steps (ResNet, Table Extraction)
- create a .enzy folder for simplified file management
ex_step1_run_tableboth.py
- Given PDFs and preprocessed data, feed to LLMs using Batch API.
- File locations should be automatically saved to .enzy/llm_log.tsv.
ex_step1b_run_pdf_binaries.py
- Alternative to step0 and step1: feed PDF binaries directly to Claude.
ex_step2_download.py
- Small script to retrieve batches from Batch APIs.
ex_step3_llm_to_df.py
- Convert the LLM output to parquet files.
ex_step5_generate_identifiers.py
- Optional: Attach sequence identifiers (EC number, UniProt ID, PDB ID, SMILES, PubChem ID) to the data from step3.

Evaluation

See experiments/example/evaluation/ex_step*.py.

ex_eval1_compare_dfs.py
- Evaluate and benchmark LLM data against a trusted dataset.
ex_eval2_plot_dfs.py
- Plot the data from ex_eval1.

Accessions

Enzyme accession pipeline is a WIP. Please see docs/pipeline/enzyme_accessions.md.

For docs on getting substrate IDs, please see docs/pipeline/substrate_accessions.md

Architecture

Citation

If you find EnzyExtract useful, please cite it as below:

@article{wei2025finding,
  title={Finding the dark matter: Large language model-based enzyme kinetic data extractor and its validation},
  author={Wei, Galen and Ran, Xinchun and AI-Abssi, Runeem and Yang, Zhongyue},
  journal={Protein Science},
  volume={34},
  number={9},
  pages={e70251},
  year={2025},
  publisher={Wiley Online Library}
}

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
EnzyExtractDB		EnzyExtractDB
data		data
docs		docs
enzyextract		enzyextract
experiments		experiments
test/hungarian		test/hungarian
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EnzyExtract

Installation

Quickstart

Database

Initial Release

Usage

Evaluation

Accessions

Architecture

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

ChemBioHTP/EnzyExtract

Folders and files

Latest commit

History

Repository files navigation

EnzyExtract

Installation

Quickstart

Database

Initial Release

Usage

Evaluation

Accessions

Architecture

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages