Skip to content

Commit a348c11

Browse files
authored
Merge pull request #2 from aspuru-guzik-group/main-python3.9
merging main python3.9 into main
2 parents b5aca4e + 35893e7 commit a348c11

16 files changed

Lines changed: 430 additions & 263 deletions

Prompts/get_data_prompt.txt

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
You will receive several images. Your goal is to extract and structure detailed information on different reaction conditions, ensuring that all specified modifications are applied correctly. Identify the standard conditions provided explicitly either in the first footnote or in the reaction diagram. ONLY use information found in these two sources.
2+
3+
Your task is to generate a JSON object structured as follows:
4+
5+
Optimization Runs Dictionary:
6+
This is a dictionary of dictionaries, where each entry represents an optimization run. For each run, begin by using the standard conditions. You MUST modify the conditions where specific changes are indicated in each entry, REPLACING with the correct conditions. Each run should contain the following key-value pairs:
7+
8+
"Entry": Entry number for the run.
9+
"Anode": Anode material (positive end). Use abbreviations if available. NOTE: the anode and cathode may be separated by delimiters such as |, /, //, or ||, with or without space. the anode usually appears before the delimiter. Pay particular attention to SEPARATE before including as anode or cathode. There MUST be NO delimiters in the final output. If no polarity indications or delimiters are used, assume the material is used for both the anode and cathode.
10+
"Cathode": Cathode material (negative end). Use abbreviations if available. Cathode usually appears after the delimiter.
11+
"Electrolytes": Include ALL NON-SOLVENT chemicals, such as electrolytes, additives, bases, acids, mediators etc, separated by commas if there are multiple. You MUST use chemical (quantity) format, where quantity can refer to amounts, equivalents (eq., equiv.), and concentrations, whichever are present. you MUST assume quantitative values and UNITS from standard conditions when the chemical changes but explicit quantitative values are not provided.
12+
"Solvents": Specify ALL SOLVENTS and COSOLVENTS used. You MUST use chemical (quantity with UNITS) format, where quantity can refer to volumes, ratios etc. You must assume quantitative values and units from standard conditions when explicit quantitative values are not provided.
13+
"Footnote": A string representing all superscript notations associated with the run, separated by commas if there are multiple. Superscript notations may appear in any columns. Use empty string if no superscript.
14+
Footnotes Dictionary:
15+
This dictionary stores footnotes, where each superscript notation is a key, and its full explanation is the value. For missing explanation, put N.R.
16+
17+
Important Rules:
18+
For any runs missing specific details, assume the values from the standard conditions unless otherwise specified.
19+
For any missing information and empty cells, use "N.R." (Not Reported).
20+
Each material or compound should only appear once in each dictionary.
21+
Please provide a complete list of all entries, even if the list is long.
22+
Internal standard MUST NOT be included.
23+
Assume changes SUBSTITUTE the standard condition unless otherwise specified that it is an addition.
24+
MAKE SURE ALL CHEMICALS are included.
25+
If any COLUMN INFORMATION is not included in previous fields, you MUST add to 'OTHERS'.
26+
27+
28+
29+
30+
31+

README.md

Lines changed: 83 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -2,110 +2,125 @@
22

33
<img src="./Assets/MERMaid-overview.jpg" alt="Overview" width="600">
44

5-
## Note:
6-
* MERMaid is an end-to-end knowledge ingestion pipeline to automatically convert disparate information conveyed through figures, schemes, and tables across various PDFs into a coherent and machine-actionable knowledge graph. It integrates three seqeuntial modules:
7-
* VisualHeist for table and figure segmentation from PDFs
8-
* DataRaider for multimodal analysis to extract relevant information as structured reaction schema
9-
* KGWizard for automated knowledge graph construction
10-
* You can run MERMaid directly or use each module as a standalone tool for its specific functionality.
11-
* MERMaid is integrated with the OpenAI provider at present. We will extend MERMaid to support other providers and open-source VLMs in future updates.
12-
* NOTE: VisualHeist works best on systems with high RAM. For optimal performance, ensure that your system has sufficient memory, as running out of memory may cause the process to be terminated prematurely.
5+
### Table of Contents
6+
1. [Overview](#overview)
7+
2. [Installation](#1-installation)
8+
3. [Usage](#2-usage)
139

14-
## 1. Installation
10+
## Overview
11+
MERMaid is an end-to-end knowledge ingestion pipeline to automatically convert disparate information conveyed through figures, schemes, and tables across various PDFs into a coherent and machine-actionable knowledge graph. It integrates three sequential modules:
12+
- **VisualHeist** for table and figure segmentation from PDFs
13+
- **DataRaider** for multimodal analysis to extract relevant information as structured reaction schema
14+
- **KGWizard** for automated knowledge graph construction
1515

16-
### 1.1 Create a new virtual environment. The recommended python version is 3.9.
16+
You can run MERMaid directly or use VisualHeist and DataRaider as standalone tools for their specific functionality.
1717

18-
```
18+
MERMaid is integrated with the OpenAI provider at present. We will extend MERMaid to support other providers and open-source VLMs in future updates.
19+
20+
VisualHeist works best on systems with **high RAM**. For optimal performance, ensure that your system has sufficient memory, as running out of memory may cause the process to be terminated prematurely.
21+
22+
Further usage details on KGWizard can be found in the [KGWizard README file](https://github.com/aspuru-guzik-group/MERMaid/blob/main/src/kgwizard/README.org).
23+
24+
---
25+
26+
## 1. Installation
27+
28+
### 1.1 Create a new virtual environment
29+
The recommended Python version is **3.9**.
30+
31+
#### Using Conda:
32+
```sh
1933
conda create -n mermaid-env python=3.9
2034
conda activate mermaid-env
2135
```
22-
OR
23-
```
36+
#### Using venv:
37+
```sh
2438
python3.9 -m venv mermaid-env
2539
source mermaid-env/bin/activate
2640
```
2741

28-
### 1.2 Install RxnScribe using the following steps for optical chemical structure recognition:
29-
```
42+
### 1.2 Install RxnScribe for Optical Chemical Structure Recognition
43+
```sh
3044
git clone https://github.com/thomas0809/RxnScribe.git
3145
cd RxnScribe
3246
pip install -r requirements.txt
3347
python setup.py install
3448
cd ..
3549
```
36-
Please note that you may get a compatibility warning stating that `MolScribe version 1.1.1 not being compatible with Torch versions >2.0`. You can safely ignore this warning.
37-
38-
### 1.3 Install MERMaid using the following steps:
50+
> ⚠️ You may see a compatibility warning about `MolScribe version 1.1.1 not being compatible with Torch versions >2.0`. This can be safely ignored.
3951
40-
#### Option 1 (quick installation)
41-
Directly install the package.
42-
```
43-
pip install git+https://github.com/aspuru-guzik-group/MERMaid/git
44-
```
45-
46-
#### Option 2 (for development purposes)
47-
Download the entire repository and install the requirements.
48-
```
52+
### 1.3 Install MERMaid
53+
Download the repository and install dependencies:
54+
```sh
4955
git clone https://github.com/aspuru-guzik-group/MERMaid/
5056
cd MERMaid
5157
pip install -e .
5258
```
53-
For full MERMaid pipeline:
54-
```
59+
For the **full MERMaid pipeline**:
60+
```sh
5561
pip install MERMaid[full]
5662
```
57-
58-
For individual modules only:
59-
```
63+
For **individual modules**:
64+
```sh
6065
pip install MERMaid[visualheist]
6166
pip install MERMaid[dataraider]
6267
pip install MERMaid[kgwizard]
6368
```
6469

65-
## 2. Usage
66-
### 2.1 Setting up your plug-and-play configuration file
67-
* Indicate your configuration settings in `startup.json`:
68-
* "pdf_dir": "/path/to/directory/storing/pdfs"
69-
* "image_dir": "/path/to/directory/to/store/extracted/images"
70-
* "json_dir": "/path/to/directory/to/store/json/output"
71-
* "graph_dir": "/path/to/directory/to/store/graph/files"
72-
* "api_key": "your_api_key_here"
73-
* "model_size": "model_size_here" ("base" OR "large")
74-
* "keys": ["key1", "key2"] (the in-built reaction parameter keys can be found in `Prompts/inbuilt_keyvaluepairs.txt`)
75-
* "new_keys": define your custom keys here
76-
77-
* For post-processing extracted JSON reaction dictionaries:
78-
* you can add your own common chemical names by modifying the `COMMON_NAMES` constant in `dataraider/postprocess.py`
79-
* you can also add your own key names that you want to be cleaned by modifying the `KEYS` constant in `dataraider/postprocess.py`
80-
81-
* We have included an additional `filter_prompt` in `Prompts/` folder to identify only images that are relevant to your task of interest. You are highly encouraged to specify your own task and keys.
82-
83-
### 2.2 Running the end-to-end MERMaid pipeline
84-
The main command to launch and run MERMaid is:
70+
---
71+
72+
## 2. Usage
73+
74+
### 2.1 Setting Up Your Configuration File
75+
Define settings in `startup.json`:
76+
```json
77+
{
78+
"pdf_dir": "/path/to/directory/storing/pdfs",
79+
"image_dir": "/path/to/directory/to/store/extracted/images",
80+
"json_dir": "/path/to/directory/to/store/json/output",
81+
"graph_dir": "/path/to/directory/to/store/graph/files",
82+
"model_size": "base",
83+
"keys": ["key1", "key2"],
84+
"new_keys": [],
85+
"graph_name": "your_graph_name",
86+
"schema": "your_schema_name"
87+
}
8588
```
86-
mermaid
89+
- The in-built reaction parameter keys are in `Prompts/inbuilt_keyvaluepairs.txt`.
90+
- For post-processing extracted JSON reaction dictionaries:
91+
- Modify `COMMON_NAMES` in `dataraider/postprocess.py` to add custom chemical names.
92+
- Modify `KEYS` in `dataraider/postprocess.py` to clean specific key names.
93+
- Customize `filter_prompt` in `Prompts/` to filter relevant images.
94+
95+
### 2.2 Setting Up API Key
96+
The environment variable **`OPENAI_API_KEY`** is required for **DataRaider** and **KGWizard**.
97+
98+
```sh
99+
export OPENAI_API_KEY="your-openai-api-key"
87100
```
88-
All intermediate files from each module will be saved in the `Results` folder of your root directory by default.
89101

90-
### 2.3 Running individual modules
91-
#### 2.3.1 VisualHeist for image segmentation from scientific PDF documents
92-
The main command to launch and run VisualHeist is:
102+
---
103+
104+
### 2.3 Running the Full MERMaid Pipeline
105+
Run the full pipeline with:
106+
```sh
107+
mermaid
93108
```
109+
Intermediate files will be saved in the `Results/` directory.
110+
111+
### 2.4 Running Individual Modules
112+
113+
#### 2.4.1 VisualHeist – Image Segmentation from PDFs
114+
```sh
94115
visualheist
95116
```
96117

97-
#### 2.3.2 DataRaider for image-to-data conversion into JSON dictionaries
98-
The main command to launch and run DataRaider is:
99-
```
118+
#### 2.4.2 DataRaider – Image-to-Data Conversion
119+
```sh
100120
dataraider
101121
```
102-
A sample output json dictionary can be found in `Assets` folder.
122+
*A sample output JSON is available in the `Assets` folder.*
103123

104-
#### 2.3.3 KGWizard for data-to-knowledge graph translation
105-
The main command to launch and run KGWizard is:
106-
```
124+
#### 2.4.3 KGWizard – Data-to-Knowledge Graph Translation
125+
```sh
107126
kgwizard
108-
```
109-
110-
## 3. Data Visualization
111-
<Coming Soon!>

pyproject.toml

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,15 @@ dataraider = [
3030
"requests", "opencv-python", "numpy", "regex",
3131
"pubchempy", "openai", "huggingface_hub"
3232
]
33-
kgwizard = ["gremlinpython", "openai"]
34-
full = ["visualheist", "dataraider", "kgwizard"]
33+
kgwizard = ["gremlinpython", "openai", "numpy"]
34+
full = [
35+
"pdf2image", "Pillow", "transformers", "safetensors",
36+
"torch==2.3.0", "torchvision==0.18.0", "pytorch-lightning==2.3.0",
37+
"huggingface_hub", "einops", "timm",
38+
"requests", "opencv-python", "numpy", "regex",
39+
"pubchempy", "openai", "huggingface_hub",
40+
"gremlinpython"
41+
]
3542

3643
[project.scripts]
3744
dataraider = "scripts.run_dataraider:main"
@@ -40,7 +47,6 @@ kgwizard = "src.kgwizard.__main__:main"
4047
mermaid = "scripts.run_mermaid:main"
4148

4249
[tool.setuptools]
43-
# Tell setuptools to look in BOTH src/ and the current directory (where "scripts" is).
4450
packages = { find = { where = [".", "src"] } }
4551

4652
[tool.setuptools.package-data]

scripts/run_dataraider.py

100644100755
Lines changed: 4 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ def main():
4848
parser.add_argument("--json_dir", type=str, help="Directory to save processed JSON data", default=None)
4949
parser.add_argument("--keys", type=str, nargs='+', help="List of keys to extract", default=None)
5050
parser.add_argument("--new_keys", type=str, nargs='+', help="List of new keys for data extraction", default=None)
51-
parser.add_argument("--api_key", type=str, help="API key", default=None)
51+
# parser.add_argument("--api_key", type=str, help="API key", default=None)
5252

5353
args = parser.parse_args()
5454

@@ -68,10 +68,10 @@ def main():
6868
json_dir = config.get('default_json_dir')
6969
keys = config.get('keys', ["Entry", "Catalyst", "Ligand", "Cathode", "Solvents"])
7070
new_keys = config.get('new_keys', None)
71-
api_key = config.get('api_key', None)
71+
# api_key = config.get('api_key', None)
72+
api_key = os.environ.get("OPENAI_API_KEY")
7273

7374
info = DataRaiderInfo(api_key=api_key, device="cpu", ckpt_path=ckpt_path)
74-
print(f'keys are {keys}')
7575
# Construct the initial reaction data extraction prompt
7676
print("Constructing your custom reaction data extraction prompt")
7777
construct_initial_prompt(prompt_dir, keys, new_keys)
@@ -86,31 +86,4 @@ def main():
8686

8787

8888
if __name__ == "__main__":
89-
main()
90-
91-
92-
# # Load the configuration from the file
93-
# config = load_config('./mermaid/startup.json')
94-
95-
# # Use the default configuration in the function call if unspecified
96-
# image_dir = config.get('image_dir', config.get('default_image_dir'))
97-
# prompt_dir = config.get('prompt_dir', "./Prompts")
98-
# get_data_prompt = "get_data_prompt"
99-
# update_dict_prompt = "update_dict_prompt"
100-
# output_dir = config.get('json_dir', config.get('default_json_dir'))
101-
# keys = config.get('keys', ["Entry", "Catalyst", "Ligand", "Cathode", "Solvents"])
102-
# new_keys = config.get('new_keys', None)
103-
# api_key = config.get('api_key', None)
104-
105-
# # Use the loaded configuration in the function call
106-
# info = DataRaiderInfo(api_key=api_key, device="cpu", ckpt_path=ckpt_path)
107-
# # processor = RxnOptDataProcessor(ckpt_path=ckpt_path, device='cpu', api_key=api_key)
108-
# print("Constructing your custom reaction data extraction prompt")
109-
# construct_initial_prompt(keys, new_keys)
110-
# # processor.construct_initial_prompt(keys, new_keys)
111-
# print('######################## Starting up DataRaider ############################')
112-
# batch_process_images(info, image_dir, prompt_dir, get_data_prompt, update_dict_prompt, output_dir)
113-
# # processor.batch_process_images(image_dir, prompt_dir, get_data_prompt, update_dict_prompt, output_dir)
114-
# print('Clearing temporary files and custom prompts')
115-
# clear_temp_files(prompt_dir, image_dir)
116-
# # processor.clear_temp_files(prompt_dir, image_dir)
89+
main()

0 commit comments

Comments
 (0)