LLM_Material_Property_Benchmark: A Python toolkit for evaluating Large Language Models (LLMs) in materials science workflows

The current software is a Python toolkit for evaluating Large Language Models (LLMs) in materials science workflows, specifically testing chemical element tokenization uniqueness and predictive capabilities for material properties. For all details, please refer to the accompanying paper .

Overview

This benchmark evaluates two key aspects of LLMs for materials science applications:

Token Uniqueness Analysis: Examines how chemical element names are tokenized and visualizes the distribution across the periodic table
Property Prediction Assessment: Tests the model's ability to predict material properties (e.g., melting temperature) with accuracy evaluation

Features

Periodic Table Visualization: Color-coded periodic table showing tokenization patterns and prediction accuracy
Token Analysis: Statistical analysis of element name tokenization across different LLMs
Property Prediction: Automated testing of material property predictions with fuzzy matching
Multiple Model Support: Compatible with Hugging Face transformers models

Installation and usage

Quick Start

Download the code file: Save the complete code as LLM_Material_Property_Benchmark.py
Install dependencies:

pip install torch transformers matplotlib pandas mendeleev fuzzywuzzy

Run the complete benchmark:

python llm_materials_benchmark.py

The script will automatically:

Load the Gemma-2-2b model (you can change the ckpt variable at the top)
Analyze token uniqueness for all chemical elements
Generate periodic table visualizations
Test property prediction capabilities
Display results and statistics

Configuration

Supported Models

Google Gemma models (default example is Gemma-2-2b)
Meta Llama models
Microsoft Phi models
Alibaba Qwen models
Any Hugging Face causal language model

Customizable Parameters

property: Material property to test (default: "melting temperature")
threshold: Accuracy threshold for fuzzy matching of numerical values (default: 95%)
n_new_tokens: Maximum tokens to generate for the answer (default: 10)

Output Interpretation

Color Coding

Token Uniqueness: Colors represent number of tokens per element name
Prediction Accuracy:
- Green: Correct predictions
- Red: Incorrect predictions
- Blue: Correct handling of missing data
- Orange: Hallucinations (predictions for missing ground truth)

Contributing

Besides the usual contribution in the form of forking, creating a branch, pushing it to the repository and opening a pull request, you can also contact me directly for scientific discussions about the next steps. My goal is to create a material science benchmark to assess LLMs for Processing-Structure-Property-Performance chain reasoning. I am looking forward to your contributions!

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

The code is referenced in Zenodo. If you use this benchmark in your research, please cite:

@software{LLM_Material_Property_Benchmark,
  title={LLM_Material_Property_Benchmark: A Python toolkit for evaluating Large Language Models (LLMs) in materials science workflows},
  author={Adrian Ehrenhofer},
  year={2025},
  url={https://github.com/ehrenhofer-group/LLM_Material_Property_Benchmark}
}

The accompanying paper is:

@misc{Ehrenhofer2025LLM,
      title={What do Large Language Models know about materials?}, 
      author={Adrian Ehrenhofer and Thomas Wallmersperger and Gianaurelio Cuniberti},
      year={2025},
      eprint={2507.14586},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2507.14586}, 
}

Acknowledgments

Built using the Mendeleev package for chemical data
Utilizes HuggingFace Transformers for model loading
Fuzzy string matching powered by FuzzyWuzzy
Matplotlib for plotting.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
LLM_Material_Property_Benchmark.py		LLM_Material_Property_Benchmark.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM_Material_Property_Benchmark: A Python toolkit for evaluating Large Language Models (LLMs) in materials science workflows

Overview

Features

Installation and usage

Quick Start

Configuration

Supported Models

Customizable Parameters

Output Interpretation

Color Coding

Contributing

License

Citation

Acknowledgments

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM_Material_Property_Benchmark: A Python toolkit for evaluating Large Language Models (LLMs) in materials science workflows

Overview

Features

Installation and usage

Quick Start

Configuration

Supported Models

Customizable Parameters

Output Interpretation

Color Coding

Contributing

License

Citation

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages