|
2 | 2 |
|
3 | 3 | <img src="./Assets/MERMaid-overview.jpg" alt="Overview" width="600"> |
4 | 4 |
|
5 | | -## Note: |
6 | | -* MERMaid is an end-to-end knowledge ingestion pipeline to automatically convert disparate information conveyed through figures, schemes, and tables across various PDFs into a coherent and machine-actionable knowledge graph. It integrates three seqeuntial modules: |
7 | | - * VisualHeist for table and figure segmentation from PDFs |
8 | | - * DataRaider for multimodal analysis to extract relevant information as structured reaction schema |
9 | | - * KGWizard for automated knowledge graph construction |
10 | | -* You can run MERMaid directly or use each module as a standalone tool for its specific functionality. |
11 | | -* MERMaid is integrated with the OpenAI provider at present. We will extend MERMaid to support other providers and open-source VLMs in future updates. |
12 | | -* NOTE: VisualHeist works best on systems with high RAM. For optimal performance, ensure that your system has sufficient memory, as running out of memory may cause the process to be terminated prematurely. |
| 5 | +### Table of Contents |
| 6 | +1. [Overview](#overview) |
| 7 | +2. [Installation](#1-installation) |
| 8 | +3. [Usage](#2-usage) |
13 | 9 |
|
14 | | -## 1. Installation |
| 10 | +## Overview |
| 11 | +MERMaid is an end-to-end knowledge ingestion pipeline to automatically convert disparate information conveyed through figures, schemes, and tables across various PDFs into a coherent and machine-actionable knowledge graph. It integrates three sequential modules: |
| 12 | +- **VisualHeist** for table and figure segmentation from PDFs |
| 13 | +- **DataRaider** for multimodal analysis to extract relevant information as structured reaction schema |
| 14 | +- **KGWizard** for automated knowledge graph construction |
15 | 15 |
|
16 | | -### 1.1 Create a new virtual environment. The recommended python version is 3.9. |
| 16 | +You can run MERMaid directly or use VisualHeist and DataRaider as standalone tools for their specific functionality. |
17 | 17 |
|
18 | | -``` |
| 18 | +MERMaid is integrated with the OpenAI provider at present. We will extend MERMaid to support other providers and open-source VLMs in future updates. |
| 19 | + |
| 20 | +VisualHeist works best on systems with **high RAM**. For optimal performance, ensure that your system has sufficient memory, as running out of memory may cause the process to be terminated prematurely. |
| 21 | + |
| 22 | +Further usage details on KGWizard can be found in the [KGWizard README file](https://github.com/aspuru-guzik-group/MERMaid/blob/main/src/kgwizard/README.org). |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +## 1. Installation |
| 27 | + |
| 28 | +### 1.1 Create a new virtual environment |
| 29 | +The recommended Python version is **3.9**. |
| 30 | + |
| 31 | +#### Using Conda: |
| 32 | +```sh |
19 | 33 | conda create -n mermaid-env python=3.9 |
20 | 34 | conda activate mermaid-env |
21 | 35 | ``` |
22 | | -OR |
23 | | -``` |
| 36 | +#### Using venv: |
| 37 | +```sh |
24 | 38 | python3.9 -m venv mermaid-env |
25 | 39 | source mermaid-env/bin/activate |
26 | 40 | ``` |
27 | 41 |
|
28 | | -### 1.2 Install RxnScribe using the following steps for optical chemical structure recognition: |
29 | | -``` |
| 42 | +### 1.2 Install RxnScribe for Optical Chemical Structure Recognition |
| 43 | +```sh |
30 | 44 | git clone https://github.com/thomas0809/RxnScribe.git |
31 | 45 | cd RxnScribe |
32 | 46 | pip install -r requirements.txt |
33 | 47 | python setup.py install |
34 | 48 | cd .. |
35 | 49 | ``` |
36 | | -Please note that you may get a compatibility warning stating that `MolScribe version 1.1.1 not being compatible with Torch versions >2.0`. You can safely ignore this warning. |
37 | | - |
38 | | -### 1.3 Install MERMaid using the following steps: |
| 50 | +> ⚠️ You may see a compatibility warning about `MolScribe version 1.1.1 not being compatible with Torch versions >2.0`. This can be safely ignored. |
39 | 51 |
|
40 | | -#### Option 1 (quick installation) |
41 | | -Directly install the package. |
42 | | -``` |
43 | | -pip install git+https://github.com/aspuru-guzik-group/MERMaid/git |
44 | | -``` |
45 | | - |
46 | | -#### Option 2 (for development purposes) |
47 | | -Download the entire repository and install the requirements. |
48 | | -``` |
| 52 | +### 1.3 Install MERMaid |
| 53 | +Download the repository and install dependencies: |
| 54 | +```sh |
49 | 55 | git clone https://github.com/aspuru-guzik-group/MERMaid/ |
50 | 56 | cd MERMaid |
51 | 57 | pip install -e . |
52 | 58 | ``` |
53 | | -For full MERMaid pipeline: |
54 | | -``` |
| 59 | +For the **full MERMaid pipeline**: |
| 60 | +```sh |
55 | 61 | pip install MERMaid[full] |
56 | 62 | ``` |
57 | | - |
58 | | -For individual modules only: |
59 | | -``` |
| 63 | +For **individual modules**: |
| 64 | +```sh |
60 | 65 | pip install MERMaid[visualheist] |
61 | 66 | pip install MERMaid[dataraider] |
62 | 67 | pip install MERMaid[kgwizard] |
63 | 68 | ``` |
64 | 69 |
|
65 | | -## 2. Usage |
66 | | -### 2.1 Setting up your plug-and-play configuration file |
67 | | -* Indicate your configuration settings in `startup.json`: |
68 | | - * "pdf_dir": "/path/to/directory/storing/pdfs" |
69 | | - * "image_dir": "/path/to/directory/to/store/extracted/images" |
70 | | - * "json_dir": "/path/to/directory/to/store/json/output" |
71 | | - * "graph_dir": "/path/to/directory/to/store/graph/files" |
72 | | - * "api_key": "your_api_key_here" |
73 | | - * "model_size": "model_size_here" ("base" OR "large") |
74 | | - * "keys": ["key1", "key2"] (the in-built reaction parameter keys can be found in `Prompts/inbuilt_keyvaluepairs.txt`) |
75 | | - * "new_keys": define your custom keys here |
76 | | - |
77 | | -* For post-processing extracted JSON reaction dictionaries: |
78 | | - * you can add your own common chemical names by modifying the `COMMON_NAMES` constant in `dataraider/postprocess.py` |
79 | | - * you can also add your own key names that you want to be cleaned by modifying the `KEYS` constant in `dataraider/postprocess.py` |
80 | | - |
81 | | -* We have included an additional `filter_prompt` in `Prompts/` folder to identify only images that are relevant to your task of interest. You are highly encouraged to specify your own task and keys. |
82 | | - |
83 | | -### 2.2 Running the end-to-end MERMaid pipeline |
84 | | -The main command to launch and run MERMaid is: |
| 70 | +--- |
| 71 | + |
| 72 | +## 2. Usage |
| 73 | + |
| 74 | +### 2.1 Setting Up Your Configuration File |
| 75 | +Define settings in `startup.json`: |
| 76 | +```json |
| 77 | +{ |
| 78 | + "pdf_dir": "/path/to/directory/storing/pdfs", |
| 79 | + "image_dir": "/path/to/directory/to/store/extracted/images", |
| 80 | + "json_dir": "/path/to/directory/to/store/json/output", |
| 81 | + "graph_dir": "/path/to/directory/to/store/graph/files", |
| 82 | + "model_size": "base", |
| 83 | + "keys": ["key1", "key2"], |
| 84 | + "new_keys": [], |
| 85 | + "graph_name": "your_graph_name", |
| 86 | + "schema": "your_schema_name" |
| 87 | +} |
85 | 88 | ``` |
86 | | -mermaid |
| 89 | +- The in-built reaction parameter keys are in `Prompts/inbuilt_keyvaluepairs.txt`. |
| 90 | +- For post-processing extracted JSON reaction dictionaries: |
| 91 | + - Modify `COMMON_NAMES` in `dataraider/postprocess.py` to add custom chemical names. |
| 92 | + - Modify `KEYS` in `dataraider/postprocess.py` to clean specific key names. |
| 93 | +- Customize `filter_prompt` in `Prompts/` to filter relevant images. |
| 94 | + |
| 95 | +### 2.2 Setting Up API Key |
| 96 | +The environment variable **`OPENAI_API_KEY`** is required for **DataRaider** and **KGWizard**. |
| 97 | + |
| 98 | +```sh |
| 99 | +export OPENAI_API_KEY="your-openai-api-key" |
87 | 100 | ``` |
88 | | -All intermediate files from each module will be saved in the `Results` folder of your root directory by default. |
89 | 101 |
|
90 | | -### 2.3 Running individual modules |
91 | | -#### 2.3.1 VisualHeist for image segmentation from scientific PDF documents |
92 | | -The main command to launch and run VisualHeist is: |
| 102 | +--- |
| 103 | + |
| 104 | +### 2.3 Running the Full MERMaid Pipeline |
| 105 | +Run the full pipeline with: |
| 106 | +```sh |
| 107 | +mermaid |
93 | 108 | ``` |
| 109 | +Intermediate files will be saved in the `Results/` directory. |
| 110 | + |
| 111 | +### 2.4 Running Individual Modules |
| 112 | + |
| 113 | +#### 2.4.1 VisualHeist – Image Segmentation from PDFs |
| 114 | +```sh |
94 | 115 | visualheist |
95 | 116 | ``` |
96 | 117 |
|
97 | | -#### 2.3.2 DataRaider for image-to-data conversion into JSON dictionaries |
98 | | -The main command to launch and run DataRaider is: |
99 | | -``` |
| 118 | +#### 2.4.2 DataRaider – Image-to-Data Conversion |
| 119 | +```sh |
100 | 120 | dataraider |
101 | 121 | ``` |
102 | | -A sample output json dictionary can be found in `Assets` folder. |
| 122 | +*A sample output JSON is available in the `Assets` folder.* |
103 | 123 |
|
104 | | -#### 2.3.3 KGWizard for data-to-knowledge graph translation |
105 | | -The main command to launch and run KGWizard is: |
106 | | -``` |
| 124 | +#### 2.4.3 KGWizard – Data-to-Knowledge Graph Translation |
| 125 | +```sh |
107 | 126 | kgwizard |
108 | | -``` |
109 | | - |
110 | | -## 3. Data Visualization |
111 | | -<Coming Soon!> |
|
0 commit comments