Skip to content

Repository for code backing the article "LLM-based feature generation from text for interpretable machine learning".

License

Notifications You must be signed in to change notification settings

vojtech-balek/llm-features

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Extraction of Interpretable Features from Text

This repository contains the code and supplementary materials for the scientific article "LLM Extraction of Interpretable Features from Text." The aim of this project is to demonstrate how large language models (LLMs) can be used to extract interpretable features from textual data. We further demonstrate the use of these interpretable features with action rules.

Table of Contents

Introduction

Existing text representations such as embeddings and bag-of-words are not suitable for rule learning due to their high dimensionality and absent or questionable feature-level interpretability. This article explores whether large language models (LLMs) could address this by extracting a small number of interpretable features from text. Additionally, we introduce an LLM-based feature discovery mechanism that autonomously identifies key attributes directly from the data, thereby reducing reliance on manual feature selection. We demonstrate this process primarily on two datasets (CORD-19 and M17+) containing several thousand scientific articles with a target proxy for research impact. An evaluation based on the statistically significant correlation with research impact has shown that LLama 2-generated features are semantically meaningful, and we consequently used these features for text classification—predicting the binary citation rate for the CORD-19 dataset and an ordinal 5-class expert-awarded grade for the M17+ dataset. Machine-learning models trained on the LLM-generated features provided similar predictive performance to the state-of-the-art embedding model SciBERT for scientific text—achieving competitive results with only 62 features (as opposed to 768 in SciBERT embeddings) that are fully interpretable, capturing qualities such as article methodological rigour, novelty, or grammatical correctness. Consequently, we apply action rule mining to derive a small number of well-interpretable rules, and we further believe that this technique can be beneficial for other white-box methods beyond rule learning. In addition, to verify the universality of our approach, we applied it to additional datasets: BANKING77, Hate Speech, and Food Hazard (without action rule mining).

Installation

To get started, clone this repository and install the necessary dependencies:

git clone https://github.com/vojtech-balek/llm-features.git
cd llm-features
pip install -r requirements.txt

Experiments

Data

Data is stored in data folder.

Feature Generation

Feature generation on user-defined features. The corresponding notebook is ffeature_extraction-CORD19-M17plus.ipynb.

LLM-based feature generation with automatic feature discovery

LLM-based feature discovery with GPT-4o and automated feature generation with GPT-4o-mini. The corresponding notebook with feature discovery prompt is llm-chatgpt.ipynb.

Feature Analysis

Analysis of the features generated for the datasets. Formal test of the relationship between target and generatef features:

Model Evaluation

Evaluate the performance of the models and the extracted user-defined features:

Evaluate the performance of the models and the extracted LLM-generated features:

Action Rules

Mining of the action rules for CORD-19 and M17+ datasets action-CORD19.ipynb and action-M17Plus.ipynb.

License

MIT

About

Repository for code backing the article "LLM-based feature generation from text for interpretable machine learning".

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •