Skip to content

ehrenhofer-group/LLM_Material_Property_Benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

LLM_Material_Property_Benchmark: A Python toolkit for evaluating Large Language Models (LLMs) in materials science workflows

The current software is a Python toolkit for evaluating Large Language Models (LLMs) in materials science workflows, specifically testing chemical element tokenization uniqueness and predictive capabilities for material properties. For all details, please refer to the accompanying paper arXiv.

Overview

This benchmark evaluates two key aspects of LLMs for materials science applications:

  1. Token Uniqueness Analysis: Examines how chemical element names are tokenized and visualizes the distribution across the periodic table
  2. Property Prediction Assessment: Tests the model's ability to predict material properties (e.g., melting temperature) with accuracy evaluation

Features

  • Periodic Table Visualization: Color-coded periodic table showing tokenization patterns and prediction accuracy
  • Token Analysis: Statistical analysis of element name tokenization across different LLMs
  • Property Prediction: Automated testing of material property predictions with fuzzy matching
  • Multiple Model Support: Compatible with Hugging Face transformers models

Installation and usage

Quick Start

  1. Download the code file: Save the complete code as LLM_Material_Property_Benchmark.py

  2. Install dependencies:

pip install torch transformers matplotlib pandas mendeleev fuzzywuzzy
  1. Run the complete benchmark:
python llm_materials_benchmark.py

The script will automatically:

  • Load the Gemma-2-2b model (you can change the ckpt variable at the top)
  • Analyze token uniqueness for all chemical elements
  • Generate periodic table visualizations
  • Test property prediction capabilities
  • Display results and statistics

Configuration

Supported Models

  • Google Gemma models (default example is Gemma-2-2b)
  • Meta Llama models
  • Microsoft Phi models
  • Alibaba Qwen models
  • Any Hugging Face causal language model

Customizable Parameters

  • property: Material property to test (default: "melting temperature")
  • threshold: Accuracy threshold for fuzzy matching of numerical values (default: 95%)
  • n_new_tokens: Maximum tokens to generate for the answer (default: 10)

Output Interpretation

Color Coding

  • Token Uniqueness: Colors represent number of tokens per element name
  • Prediction Accuracy:
    • Green: Correct predictions
    • Red: Incorrect predictions
    • Blue: Correct handling of missing data
    • Orange: Hallucinations (predictions for missing ground truth)

Contributing

Besides the usual contribution in the form of forking, creating a branch, pushing it to the repository and opening a pull request, you can also contact me directly for scientific discussions about the next steps. My goal is to create a material science benchmark to assess LLMs for Processing-Structure-Property-Performance chain reasoning. I am looking forward to your contributions!

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

The code is referenced in Zenodo. If you use this benchmark in your research, please cite:

@software{LLM_Material_Property_Benchmark,
  title={LLM_Material_Property_Benchmark: A Python toolkit for evaluating Large Language Models (LLMs) in materials science workflows},
  author={Adrian Ehrenhofer},
  year={2025},
  url={https://github.com/ehrenhofer-group/LLM_Material_Property_Benchmark}
}

The accompanying paper is:

@misc{Ehrenhofer2025LLM,
      title={What do Large Language Models know about materials?}, 
      author={Adrian Ehrenhofer and Thomas Wallmersperger and Gianaurelio Cuniberti},
      year={2025},
      eprint={2507.14586},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2507.14586}, 
}

Acknowledgments

About

A Python toolkit for evaluating Large Language Models (LLMs) in materials science workflows

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages