aspuru-guzik-group
diff --git a/‎Prompts/get_data_prompt.txt‎
Lines changed: 31 additions & 0 deletions b/‎Prompts/get_data_prompt.txt‎
Lines changed: 31 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 83 additions & 68 deletions b/‎README.md‎
Lines changed: 83 additions & 68 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 9 additions & 3 deletions b/‎pyproject.toml‎
Lines changed: 9 additions & 3 deletions
diff --git a/‎scripts/run_dataraider.py‎
100644100755
Lines changed: 4 additions & 31 deletions b/‎scripts/run_dataraider.py‎
100644100755
Lines changed: 4 additions & 31 deletions
@@ -0,0 +1,31 @@
+You will receive several images. Your goal is to extract and structure detailed information on different reaction conditions, ensuring that all specified modifications are applied correctly. Identify the standard conditions provided explicitly either in the first footnote or in the reaction diagram. ONLY use information found in these two sources.  
+
+Your task is to generate a JSON object structured as follows:
+
+Optimization Runs Dictionary: 
+This is a dictionary of dictionaries, where each entry represents an optimization run. For each run, begin by using the standard conditions. You MUST modify the conditions where specific changes are indicated in each entry, REPLACING with the correct conditions. Each run should contain the following key-value pairs:
+
+"Entry": Entry number for the run.
+"Anode": Anode material (positive end). Use abbreviations if available. NOTE: the anode and cathode may be separated by delimiters such as |, /, //, or ||, with or without space. the anode usually appears before the delimiter. Pay particular attention to SEPARATE before including as anode or cathode. There MUST be NO delimiters in the final output. If no polarity indications or delimiters are used, assume the material is used for both the anode and cathode.
+"Cathode": Cathode material (negative end). Use abbreviations if available. Cathode usually appears after the delimiter.
+"Electrolytes": Include ALL NON-SOLVENT chemicals, such as electrolytes, additives, bases, acids, mediators etc, separated by commas if there are multiple. You MUST use chemical (quantity) format, where quantity can refer to amounts, equivalents (eq., equiv.), and concentrations, whichever are present. you MUST assume quantitative values and UNITS from standard conditions when the chemical changes but explicit quantitative values are not provided.
+"Solvents": Specify ALL SOLVENTS and COSOLVENTS used. You MUST use chemical (quantity with UNITS) format, where quantity can refer to volumes, ratios etc. You must assume quantitative values and units from standard conditions when explicit quantitative values are not provided.
+"Footnote": A string representing all superscript notations associated with the run, separated by commas if there are multiple. Superscript notations may appear in any columns. Use empty string if no superscript.
+Footnotes Dictionary: 
+This dictionary stores footnotes, where each superscript notation is a key, and its full explanation is the value. For missing explanation, put N.R.
+
+Important Rules: 
+For any runs missing specific details, assume the values from the standard conditions unless otherwise specified. 
+For any missing information and empty cells, use "N.R." (Not Reported).
+Each material or compound should only appear once in each dictionary.
+Please provide a complete list of all entries, even if the list is long. 
+Internal standard MUST NOT be included. 
+Assume changes SUBSTITUTE the standard condition unless otherwise specified that it is an addition. 
+MAKE SURE ALL CHEMICALS are included. 
+If any COLUMN INFORMATION is not included in previous fields, you MUST add to 'OTHERS'. 
+
+
+
+
+
+
@@ -2,110 +2,125 @@
 
 <img src="./Assets/MERMaid-overview.jpg" alt="Overview" width="600">
 
-## Note: 
-* MERMaid is an end-to-end knowledge ingestion pipeline to automatically convert disparate information conveyed through figures, schemes, and tables across various PDFs into a coherent and machine-actionable knowledge graph. It integrates three seqeuntial modules: 
-    * VisualHeist for table and figure segmentation from PDFs 
-    * DataRaider for multimodal analysis to extract relevant information as structured reaction schema
-    * KGWizard for automated knowledge graph construction
-* You can run MERMaid directly or use each module as a standalone tool for its specific functionality.
-* MERMaid is integrated with the OpenAI provider at present. We will extend MERMaid to support other providers and open-source VLMs in future updates.
-* NOTE: VisualHeist works best on systems with high RAM. For optimal performance, ensure that your system has sufficient memory, as running out of memory may cause the process to be terminated prematurely.
+### Table of Contents  
+1. [Overview](#overview)  
+2. [Installation](#1-installation)  
+3. [Usage](#2-usage)  
 
-## 1. Installation 
+## Overview  
+MERMaid is an end-to-end knowledge ingestion pipeline to automatically convert disparate information conveyed through figures, schemes, and tables across various PDFs into a coherent and machine-actionable knowledge graph. It integrates three sequential modules:  
+- **VisualHeist** for table and figure segmentation from PDFs  
+- **DataRaider** for multimodal analysis to extract relevant information as structured reaction schema  
+- **KGWizard** for automated knowledge graph construction  
 
-### 1.1 Create a new virtual environment. The recommended python version is 3.9.
+You can run MERMaid directly or use VisualHeist and DataRaider as standalone tools for their specific functionality.  
 
-```
+MERMaid is integrated with the OpenAI provider at present. We will extend MERMaid to support other providers and open-source VLMs in future updates.  
+
+VisualHeist works best on systems with **high RAM**. For optimal performance, ensure that your system has sufficient memory, as running out of memory may cause the process to be terminated prematurely.  
+
+Further usage details on KGWizard can be found in the [KGWizard README file](https://github.com/aspuru-guzik-group/MERMaid/blob/main/src/kgwizard/README.org).  
+
+---
+
+## 1. Installation  
+
+### 1.1 Create a new virtual environment  
+The recommended Python version is **3.9**.  
+
+#### Using Conda:
+```sh
 conda create -n mermaid-env python=3.9
 conda activate mermaid-env
 ```
-OR 
-```
+#### Using venv:
+```sh
 python3.9 -m venv mermaid-env
 source mermaid-env/bin/activate
 ```
 
-### 1.2 Install RxnScribe using the following steps for optical chemical structure recognition:
-```
+### 1.2 Install RxnScribe for Optical Chemical Structure Recognition  
+```sh
 git clone https://github.com/thomas0809/RxnScribe.git
 cd RxnScribe
 pip install -r requirements.txt
 python setup.py install
 cd ..
 ```
-Please note that you may get a compatibility warning stating that `MolScribe version 1.1.1 not being compatible with Torch versions >2.0`. You can safely ignore this warning.
-
-### 1.3 Install MERMaid using the following steps: 
+> ⚠️ You may see a compatibility warning about `MolScribe version 1.1.1 not being compatible with Torch versions >2.0`. This can be safely ignored.  
 
-#### Option 1 (quick installation) 
-Directly install the package. 
-```
-pip install git+https://github.com/aspuru-guzik-group/MERMaid/git
-```
-
-#### Option 2 (for development purposes)
-Download the entire repository and install the requirements.
-```
+### 1.3 Install MERMaid  
+Download the repository and install dependencies:  
+```sh
 git clone https://github.com/aspuru-guzik-group/MERMaid/
 cd MERMaid
 pip install -e .
 ```
-For full MERMaid pipeline: 
-```
+For the **full MERMaid pipeline**:  
+```sh
 pip install MERMaid[full]
 ```
-
-For individual modules only: 
-```
+For **individual modules**:  
+```sh
 pip install MERMaid[visualheist]
 pip install MERMaid[dataraider]
 pip install MERMaid[kgwizard]
 ```
 
-## 2. Usage 
-### 2.1 Setting up your plug-and-play configuration file 
-* Indicate your configuration settings in `startup.json`: 
-    * "pdf_dir": "/path/to/directory/storing/pdfs"
-    * "image_dir": "/path/to/directory/to/store/extracted/images"
-    * "json_dir": "/path/to/directory/to/store/json/output"
-    * "graph_dir": "/path/to/directory/to/store/graph/files"
-    * "api_key": "your_api_key_here"
-    * "model_size": "model_size_here" ("base" OR "large")
-    * "keys": ["key1", "key2"] (the in-built reaction parameter keys can be found in `Prompts/inbuilt_keyvaluepairs.txt`) 
-    * "new_keys": define your custom keys here 
-
-* For post-processing extracted JSON reaction dictionaries: 
-    * you can add your own common chemical names by modifying the `COMMON_NAMES` constant in `dataraider/postprocess.py`
-    * you can also add your own key names that you want to be cleaned by modifying the `KEYS` constant in `dataraider/postprocess.py`
-
-* We have included an additional `filter_prompt` in `Prompts/` folder to identify only images that are relevant to your task of interest. You are highly encouraged to specify your own task and keys. 
-
-### 2.2 Running the end-to-end MERMaid pipeline 
-The main command to launch and run MERMaid is: 
+---
+
+## 2. Usage  
+
+### 2.1 Setting Up Your Configuration File  
+Define settings in `startup.json`:  
+```json
+{
+  "pdf_dir": "/path/to/directory/storing/pdfs",
+  "image_dir": "/path/to/directory/to/store/extracted/images",
+  "json_dir": "/path/to/directory/to/store/json/output",
+  "graph_dir": "/path/to/directory/to/store/graph/files",
+  "model_size": "base",
+  "keys": ["key1", "key2"],
+  "new_keys": [],
+  "graph_name": "your_graph_name",
+  "schema": "your_schema_name"
+}
 ```
-mermaid
+- The in-built reaction parameter keys are in `Prompts/inbuilt_keyvaluepairs.txt`.  
+- For post-processing extracted JSON reaction dictionaries:  
+  - Modify `COMMON_NAMES` in `dataraider/postprocess.py` to add custom chemical names.  
+  - Modify `KEYS` in `dataraider/postprocess.py` to clean specific key names.  
+- Customize `filter_prompt` in `Prompts/` to filter relevant images.  
+
+### 2.2 Setting Up API Key  
+The environment variable **`OPENAI_API_KEY`** is required for **DataRaider** and **KGWizard**.  
+
+```sh
+export OPENAI_API_KEY="your-openai-api-key"
 ```
-All intermediate files from each module will be saved in the `Results` folder of your root directory by default.
 
-### 2.3 Running individual modules 
-#### 2.3.1 VisualHeist for image segmentation from scientific PDF documents 
-The main command to launch and run VisualHeist is: 
+---
+
+### 2.3 Running the Full MERMaid Pipeline  
+Run the full pipeline with:  
+```sh
+mermaid
 ```
+Intermediate files will be saved in the `Results/` directory.  
+
+### 2.4 Running Individual Modules  
+
+#### 2.4.1 VisualHeist – Image Segmentation from PDFs  
+```sh
 visualheist
 ```
 
-#### 2.3.2 DataRaider for image-to-data conversion into JSON dictionaries 
-The main command to launch and run DataRaider is: 
-```
+#### 2.4.2 DataRaider – Image-to-Data Conversion  
+```sh
 dataraider
 ```
-A sample output json dictionary can be found in `Assets` folder. 
+*A sample output JSON is available in the `Assets` folder.*  
 
-#### 2.3.3 KGWizard for data-to-knowledge graph translation 
-The main command to launch and run KGWizard is: 
-```
+#### 2.4.3 KGWizard – Data-to-Knowledge Graph Translation  
+```sh
 kgwizard
-```
-
-## 3. Data Visualization 
-<Coming Soon!>
 
@@ -30,8 +30,15 @@ dataraider = [
     "requests", "opencv-python", "numpy", "regex",
     "pubchempy", "openai", "huggingface_hub"
 ]
-kgwizard = ["gremlinpython", "openai"]
-full = ["visualheist", "dataraider", "kgwizard"]
+kgwizard = ["gremlinpython", "openai", "numpy"]
+full = [
+    "pdf2image", "Pillow", "transformers", "safetensors",
+    "torch==2.3.0", "torchvision==0.18.0", "pytorch-lightning==2.3.0",
+    "huggingface_hub", "einops", "timm",
+    "requests", "opencv-python", "numpy", "regex",
+    "pubchempy", "openai", "huggingface_hub",
+    "gremlinpython"
+]
 
 [project.scripts]
 dataraider   = "scripts.run_dataraider:main"
@@ -40,7 +47,6 @@ kgwizard     = "src.kgwizard.__main__:main"
 mermaid      = "scripts.run_mermaid:main"
 
 [tool.setuptools]
-# Tell setuptools to look in BOTH src/ and the current directory (where "scripts" is).
 packages = { find = { where = [".", "src"] } }
 
 [tool.setuptools.package-data]
 
@@ -48,7 +48,7 @@ def main():
     parser.add_argument("--json_dir", type=str, help="Directory to save processed JSON data", default=None)
     parser.add_argument("--keys", type=str, nargs='+', help="List of keys to extract", default=None)
     parser.add_argument("--new_keys", type=str, nargs='+', help="List of new keys for data extraction", default=None)
-    parser.add_argument("--api_key", type=str, help="API key", default=None)
+    # parser.add_argument("--api_key", type=str, help="API key", default=None)
 
     args = parser.parse_args()
 
@@ -68,10 +68,10 @@ def main():
         json_dir = config.get('default_json_dir')
     keys = config.get('keys', ["Entry", "Catalyst", "Ligand", "Cathode", "Solvents"])
     new_keys = config.get('new_keys', None)
-    api_key = config.get('api_key', None)
+    # api_key = config.get('api_key', None)
+    api_key = os.environ.get("OPENAI_API_KEY")
 
     info = DataRaiderInfo(api_key=api_key, device="cpu", ckpt_path=ckpt_path)
-    print(f'keys are {keys}')
     # Construct the initial reaction data extraction prompt
     print("Constructing your custom reaction data extraction prompt")
     construct_initial_prompt(prompt_dir, keys, new_keys)
@@ -86,31 +86,4 @@ def main():
 
 
 if __name__ == "__main__":
-    main()
-
-
-    # # Load the configuration from the file
-    # config = load_config('./mermaid/startup.json')
-
-    # # Use the default configuration in the function call if unspecified 
-    # image_dir = config.get('image_dir', config.get('default_image_dir'))
-    # prompt_dir = config.get('prompt_dir', "./Prompts")
-    # get_data_prompt = "get_data_prompt"
-    # update_dict_prompt = "update_dict_prompt"
-    # output_dir = config.get('json_dir', config.get('default_json_dir'))
-    # keys = config.get('keys', ["Entry", "Catalyst", "Ligand", "Cathode", "Solvents"])
-    # new_keys = config.get('new_keys', None)
-    # api_key = config.get('api_key', None)
-
-    # # Use the loaded configuration in the function call
-    # info = DataRaiderInfo(api_key=api_key, device="cpu", ckpt_path=ckpt_path)
-    # # processor = RxnOptDataProcessor(ckpt_path=ckpt_path, device='cpu', api_key=api_key)
-    # print("Constructing your custom reaction data extraction prompt")
-    # construct_initial_prompt(keys, new_keys)
-    # # processor.construct_initial_prompt(keys, new_keys)
-    # print('######################## Starting up DataRaider ############################')
-    # batch_process_images(info, image_dir, prompt_dir, get_data_prompt, update_dict_prompt, output_dir)
-    # # processor.batch_process_images(image_dir, prompt_dir, get_data_prompt, update_dict_prompt, output_dir)
-    # print('Clearing temporary files and custom prompts')
-    # clear_temp_files(prompt_dir, image_dir)
-    # # processor.clear_temp_files(prompt_dir, image_dir)
+    main()