Evaluation of Large Language Models for Nanomaterials Data Extraction

Abstract

Large language models can automate knowledge extraction, but generic systems struggle with the complexities of nanomaterials research. We introduce MINION, a specialized framework for extracting information from full-text papers in this field. Combining vision and natural language processing with a dedicated knowledge base, MINION converts papers into structured records. It effectively links figures to text, resolves complex chemical entities, and captures critical experimen012 tal details often overlooked by general models. A qualitative evaluation on the CHEMX dataset shows that MINION achieves significantly more accurate and comprehensive extractions, underscoring the need for domain-specific architectures in scientific information mining.

Project Overview

MINION is a multi-agent system for extracting structured data from nanomaterials research articles. The system includes:

Vision agent: Extracts and analyzes figures, tables, and images from PDFs (YOLO + GPT-4o).
Main LLM agent: Extracts parameters from article text (LLM inference, structured output).
NER agent: Extracts entities and parameters using a specialized SFT NER agent.

The project supports the full pipeline: data preparation, training, inference, and model comparison.

Repository Structure

graph_processing/ — Vision agent: PDF processing, figure/table extraction, YOLO & OpenAI integration.
structured_output/ — Main LLM agent: structured extraction from text.
ner_data_prep/ — NER agent: data preparation, training, and inference (SFT).
data/ — Results, benchmarks, and auxiliary files.

NER Agent: Data Preparation, Training, and Inference

Data Preparation

Collect and split data by domain:

python ner_data_prep/scripts/01_collect_and_split.py ner_data_prep/data ner_data_prep/datasplits

Clean and validate data:

python ner_data_prep/scripts/02_clean_jsonl.py --in ner_data_prep/datasplits/train.jsonl --out ner_data_prep/datasplits/clean_train.jsonl --max_tokens 12000
# Repeat for val/test

Convert to dialogue format:

python ner_data_prep/scripts/03.py ner_data_prep/datasplits/clean_train.jsonl ner_data_prep/datasplits/clean_train_converted.jsonl
# Repeat for val/test

Create HuggingFace Dataset:

python ner_data_prep/scripts/04.py ner_data_prep/datasplits/clean_train_converted.jsonl ner_data_prep/datasplits/clean_val_converted.jsonl ner_data_prep/datasplits/clean_test_converted.jsonl ner_data_prep/hf_dataset

Training

For training, use effective_llm_alignment — configure your YAML and specify the path to your HF Dataset.

Inference

To run inference with trained NER models:

python ner_data_prep/scripts/infer_ner.py

The script runs both model variants (Qwen1.5-7B and Llama-3.1-8B) on all splits of the zjkarina/nanoMINER_test dataset and saves results to separate .jsonl files.

Vision Agent (`graph_processing/`)

Extracts images from PDFs, detects figures/tables (YOLO), analyzes with GPT-4o.
See graph_processing/README.md for details.

Main LLM Agent (`structured_output/`)

Extracts nanomaterial parameters from article text.
Structured output, supports various models (Llama, Qwen, Mistral, etc.).
Examples and instructions in structured_output/.

Requirements

Python 3.10+
torch, transformers, datasets, ultralytics, fitz, pillow, rich, typer, tiktoken, and more.
For NER agent: see ner_data_prep/requirements.txt
For vision agent: see graph_processing/README.md

License

License: MIT

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
data_preprocessing		data_preprocessing
few-shot		few-shot
graph_processing		graph_processing
ner_data_prep		ner_data_prep
prompts		prompts
structured_output		structured_output
.gitignore		.gitignore
.pdm-python		.pdm-python
LICENSE		LICENSE
README.md		README.md
auto_extraction.py		auto_extraction.py
gpt5_extraction.py		gpt5_extraction.py
logger.py		logger.py
metrics.py		metrics.py
ner_api_client.py		ner_api_client.py
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml
vision_api_client.py		vision_api_client.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evaluation of Large Language Models for Nanomaterials Data Extraction

Abstract

Project Overview

Repository Structure

NER Agent: Data Preparation, Training, and Inference

Data Preparation

Training

Inference

Vision Agent (`graph_processing/`)

Main LLM Agent (`structured_output/`)

Requirements

License

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

License

ai-chem/llm-nanomaterials-extraction

Folders and files

Latest commit

History

Repository files navigation

Evaluation of Large Language Models for Nanomaterials Data Extraction

Abstract

Project Overview

Repository Structure

NER Agent: Data Preparation, Training, and Inference

Data Preparation

Training

Inference

Vision Agent (graph_processing/)

Main LLM Agent (structured_output/)

Requirements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Vision Agent (`graph_processing/`)

Main LLM Agent (`structured_output/`)

Packages