LLM_Material_Property_Benchmark: A Python toolkit for evaluating Large Language Models (LLMs) in materials science workflows
The current software is a Python toolkit for evaluating Large Language Models (LLMs) in materials science workflows, specifically testing chemical element tokenization uniqueness and predictive capabilities for material properties. For all details, please refer to the accompanying paper .
This benchmark evaluates two key aspects of LLMs for materials science applications:
- Token Uniqueness Analysis: Examines how chemical element names are tokenized and visualizes the distribution across the periodic table
- Property Prediction Assessment: Tests the model's ability to predict material properties (e.g., melting temperature) with accuracy evaluation
- Periodic Table Visualization: Color-coded periodic table showing tokenization patterns and prediction accuracy
- Token Analysis: Statistical analysis of element name tokenization across different LLMs
- Property Prediction: Automated testing of material property predictions with fuzzy matching
- Multiple Model Support: Compatible with Hugging Face transformers models
-
Download the code file: Save the complete code as
LLM_Material_Property_Benchmark.py -
Install dependencies:
pip install torch transformers matplotlib pandas mendeleev fuzzywuzzy- Run the complete benchmark:
python llm_materials_benchmark.pyThe script will automatically:
- Load the Gemma-2-2b model (you can change the ckpt variable at the top)
- Analyze token uniqueness for all chemical elements
- Generate periodic table visualizations
- Test property prediction capabilities
- Display results and statistics
- Google Gemma models (default example is Gemma-2-2b)
- Meta Llama models
- Microsoft Phi models
- Alibaba Qwen models
- Any Hugging Face causal language model
property: Material property to test (default: "melting temperature")threshold: Accuracy threshold for fuzzy matching of numerical values (default: 95%)n_new_tokens: Maximum tokens to generate for the answer (default: 10)
- Token Uniqueness: Colors represent number of tokens per element name
- Prediction Accuracy:
- Green: Correct predictions
- Red: Incorrect predictions
- Blue: Correct handling of missing data
- Orange: Hallucinations (predictions for missing ground truth)
Besides the usual contribution in the form of forking, creating a branch, pushing it to the repository and opening a pull request, you can also contact me directly for scientific discussions about the next steps. My goal is to create a material science benchmark to assess LLMs for Processing-Structure-Property-Performance chain reasoning. I am looking forward to your contributions!
This project is licensed under the MIT License - see the LICENSE file for details.
The code is referenced in Zenodo. If you use this benchmark in your research, please cite:
@software{LLM_Material_Property_Benchmark,
title={LLM_Material_Property_Benchmark: A Python toolkit for evaluating Large Language Models (LLMs) in materials science workflows},
author={Adrian Ehrenhofer},
year={2025},
url={https://github.com/ehrenhofer-group/LLM_Material_Property_Benchmark}
}The accompanying paper is:
@misc{Ehrenhofer2025LLM,
title={What do Large Language Models know about materials?},
author={Adrian Ehrenhofer and Thomas Wallmersperger and Gianaurelio Cuniberti},
year={2025},
eprint={2507.14586},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2507.14586},
}- Built using the Mendeleev package for chemical data
- Utilizes HuggingFace Transformers for model loading
- Fuzzy string matching powered by FuzzyWuzzy
- Matplotlib for plotting.