Skip to content

A comprehensive Python package for multi-class and multi-label text classification using Large Language Models (LLMs) and traditional machine learning models, with advanced ensemble methods for optimized performance.

License

Notifications You must be signed in to change notification settings

DataandAIReseach/LabelFusion

Repository files navigation

LabelFusion Logo

LabelFusion

Learning to Fuse LLMs and Transformer Classifiers for Robust Text Classification

A Python package for advanced text classification that combines Large Language Models (LLMs) with traditional transformer-based classifiers through a learned fusion approach. LabelFusion uses a trainable neural network to intelligently combine predictions from ML models (like RoBERTa) and LLMs (OpenAI, Gemini, DeepSeek) to achieve superior accuracy with data efficiency.

License: MIT Python 3.8+

Key Innovation: AutoFusion

The simplest way to get state-of-the-art text classification:

from textclassify.ensemble.auto_fusion import AutoFusionClassifier

# One configuration, automatic ML+LLM fusion!
config = {
    'llm_provider': 'deepseek',  # or 'openai', 'gemini'
    'label_columns': ['positive', 'negative', 'neutral']
}

classifier = AutoFusionClassifier(config)
classifier.fit(your_dataframe)  # Trains ML backbone + LLM + fusion layer
predictions = classifier.predict(test_texts)

What makes it special?

  • Superior Performance: 92.4% accuracy on AG News, 92.3% on Reuters-21578 (vs individual models)
  • Data Efficient: Achieves 92.2% with only 20% training data
  • Learned Fusion: Neural network learns optimal combination of ML embeddings + LLM scores
  • Cost-Aware: Intelligent caching and efficient resource usage
  • One-Line Setup: No complex configuration needed

Features

Features

Fusion Ensemble (Core Innovation)

  • AutoFusionClassifier: One-line interface for ML+LLM fusion
  • FusionMLP: Trainable neural network that combines predictions
  • Smart Training: Different learning rates for ML backbone vs fusion layer
  • Calibration: Temperature scaling and isotonic regression for better probability estimates
  • Production-Ready: Includes caching, results management, and cost monitoring

Supported Models

  • LLM Providers: OpenAI GPT, Google Gemini, DeepSeek
  • ML Models: RoBERTa-based classifiers with fine-tuning
  • Traditional Ensembles: Voting, weighted, and class-specific routing

Classification Support

  • Multi-class: Single label per text (mutually exclusive)
  • Multi-label: Multiple labels per text (28 emotions on GoEmotions dataset)

Production Features

  • LLM Response Caching: Automatic disk-based caching to reduce API costs
  • Results Management: Track experiments, metrics, and predictions
  • Batch Processing: Efficient processing of large datasets
  • Async Support: Asynchronous LLM API calls for better throughput

Performance Benchmarks

AG News Topic Classification (4-class)

Evaluated on AG News dataset with 5,000 test samples:

Training Data Model Accuracy F1-Score Precision Recall
20% (800) Fusion 92.2% 0.922 0.923 0.922
20% (800) RoBERTa 89.8% 0.899 0.902 0.898
20% (800) OpenAI 85.1% 0.847 0.863 0.846
40% (1,600) Fusion 92.2% 0.922 0.924 0.922
40% (1,600) RoBERTa 91.0% 0.911 0.913 0.910
40% (1,600) OpenAI 83.9% 0.835 0.847 0.834
100% (4,000) Fusion 92.4% 0.924 0.926 0.924
100% (4,000) RoBERTa 92.2% 0.922 0.923 0.922
100% (4,000) OpenAI 85.3% 0.849 0.868 0.847

Reuters-21578 Topic Classification (10-class)

Evaluated on Reuters-21578 single-label 10-class subset:

Training Data Model Accuracy F1-Score Precision Recall
20% (1,168) Fusion 72.0% 0.752 0.769 0.745
20% (1,168) RoBERTa 67.3% 0.534 0.465 0.643
20% (1,168) OpenAI 88.6% 0.928 0.951 0.923
40% (2,336) Fusion 83.6% 0.886 0.893 0.889
40% (2,336) RoBERTa 82.0% 0.836 0.858 0.850
40% (2,336) OpenAI 87.9% 0.931 0.952 0.917
100% (5,842) Fusion 92.3% 0.960 0.967 0.961
100% (5,842) RoBERTa 89.0% 0.946 0.932 0.966
100% (5,842) OpenAI 88.9% 0.939 0.963 0.927

Key Findings:

  • Fusion consistently outperforms individual models across both datasets
  • Superior data efficiency: achieves 92.2% on AG News with only 20% training data
  • Combines LLM reasoning with ML efficiency for robust classification
  • Demonstrates strong performance on both balanced (AG News) and imbalanced (Reuters) datasets

Installation

# Install from source
git clone https://github.com/DataandAIReseach/LabelFusion.git
cd LabelFusion
pip install -e .

Dependencies

Core Requirements:

pip install pandas python-dotenv openai aiohttp google-generativeai

For ML Models (RoBERTa):

pip install transformers torch scikit-learn

For Development:

pip install -e ".[dev]"

Quick Start

1. AutoFusion - Simplest Way (Recommended)

from textclassify.ensemble.auto_fusion import AutoFusionClassifier
import pandas as pd

# Your training data
df = pd.DataFrame({
    'text': [
        "I love this product!",
        "Terrible experience, very disappointed",
        "It's okay, nothing special"
    ],
    'positive': [1, 0, 0],
    'negative': [0, 1, 0],
    'neutral': [0, 0, 1]
})

# Simple configuration
config = {
    'llm_provider': 'deepseek',  # Choose: 'deepseek', 'openai', or 'gemini'
    'label_columns': ['positive', 'negative', 'neutral']
}

# Train fusion model (ML + LLM + learned combination)
classifier = AutoFusionClassifier(config)
classifier.fit(df)

# Make predictions
test_texts = ["This is amazing!", "Not good at all"]
result = classifier.predict(test_texts)
print(result.predictions)  # ['positive', 'negative']

2. Multi-Label Classification

# Multi-label example (e.g., movie genres)
config = {
    'llm_provider': 'deepseek',
    'label_columns': ['action', 'comedy', 'drama', 'horror', 'romance'],
    'multi_label': True  # Enable multi-label mode
}

classifier = AutoFusionClassifier(config)
classifier.fit(movie_dataframe)

result = classifier.predict(["A funny action movie with romance"])
print(result.predictions[0])  # ['action', 'comedy', 'romance']

3. Using Individual LLM Classifiers

from textclassify import DeepSeekClassifier, OpenAIClassifier, GeminiClassifier
from textclassify.config import Config

# Configure LLM
config = Config()
config.model_type = ModelType.LLM
config.parameters = {
    'model': 'deepseek-chat',
    'temperature': 1,
    'max_tokens': 150
}

# Create classifier
classifier = DeepSeekClassifier(
    config=config,
    text_column='text',
    label_columns=['positive', 'negative', 'neutral']
)

# Make predictions
result = classifier.predict(train_df=train_df, test_df=test_df)

4. RoBERTa Classifier (Traditional ML)

from textclassify.ml import RoBERTaClassifier
from textclassify.core.types import ModelConfig, ModelType

config = ModelConfig(
    model_name='roberta-base',
    model_type=ModelType.TRADITIONAL_ML,
    parameters={
        'max_length': 256,
        'learning_rate': 2e-5,
        'num_epochs': 3,
        'batch_size': 16
    }
)

classifier = RoBERTaClassifier(
    config=config,
    text_column='text',
    label_columns=['positive', 'negative', 'neutral'],
    multi_label=False
)

classifier.fit(train_df)
result = classifier.predict(test_texts)

Advanced Fusion Usage

Manual Fusion Configuration

For advanced users who want full control:

from textclassify.ensemble.fusion import FusionEnsemble
from textclassify.ml.roberta_classifier import RoBERTaClassifier
from textclassify.llm.deepseek_classifier import DeepSeekClassifier

# Create ML model
ml_config = ModelConfig(
    model_name='roberta-base',
    model_type=ModelType.TRADITIONAL_ML
)
ml_model = RoBERTaClassifier(config=ml_config, label_columns=labels)

# Create LLM model
llm_config = Config()
llm_model = DeepSeekClassifier(config=llm_config, label_columns=labels)

# Create fusion ensemble
fusion = FusionEnsemble(
    ml_classifier=ml_model,
    llm_classifiers=[llm_model],
    label_columns=labels,
    classification_type=ClassificationType.MULTI_CLASS
)

# Train fusion layer
fusion.fit(
    train_texts=train_df['text'].tolist(),
    train_labels=train_df[labels].values.tolist(),
    val_texts=val_df['text'].tolist(),
    val_labels=val_df[labels].values.tolist()
)

# Predict
result = fusion.predict(test_texts, test_labels)

Command-Line Training

# Create config file
python train_fusion.py --create-config fusion_config.yaml

# Edit fusion_config.yaml with your settings, then train
python train_fusion.py --config fusion_config.yaml

# Evaluate on test data
python train_fusion.py --config fusion_config.yaml --evaluate --test-data path/to/test.csv

Traditional Ensemble Methods

Voting Ensemble

from textclassify import VotingEnsemble, EnsembleConfig
from textclassify import OpenAIClassifier, GeminiClassifier

## Traditional Ensemble Methods

### Voting Ensemble

```python
from textclassify import VotingEnsemble, EnsembleConfig
from textclassify import OpenAIClassifier, GeminiClassifier

# Create individual classifiers
openai_clf = OpenAIClassifier(openai_config)
gemini_clf = GeminiClassifier(gemini_config)

ensemble = VotingEnsemble(ensemble_config)
ensemble.add_model(openai_clf, "openai")
ensemble.add_model(gemini_clf, "gemini")

ensemble.fit(training_data)
result = ensemble.predict(texts)
# Supports: 'majority', 'plurality' voting strategies

Weighted Ensemble

from textclassify import WeightedEnsemble

ensemble_config = EnsembleConfig(
    models=[model1_config, model2_config],
    ensemble_method="weighted",
    weights=[0.7, 0.3]  # Custom weights based on validation performance
)
ensemble = WeightedEnsemble(ensemble_config)

Class Routing Ensemble

from textclassify import ClassRoutingEnsemble

routing_rules = {
    "technical": "model1",
    "creative": "model2"
}

ensemble_config = EnsembleConfig(
    models=[model1_config, model2_config],
    ensemble_method="routing",
    routing_rules=routing_rules
)
ensemble = ClassRoutingEnsemble(ensemble_config)

Supported Models

LLM Providers

Provider Models API Key Required
OpenAI gpt-3.5-turbo, gpt-4, gpt-4-turbo
Gemini gemini-1.5-flash, gemini-1.5-pro
DeepSeek deepseek-chat, deepseek-coder

ML Models

Model Description Dependencies
RoBERTa Fine-tunable transformer model transformers, torch, scikit-learn

Production Features

LLM Response Caching

# Caching is automatic! Reduces API costs dramatically
classifier = DeepSeekClassifier(
    config=config,
    auto_use_cache=True,  # Enable automatic cache usage
    cache_dir="cache"      # Cache directory
)

Results Management

# Automatic experiment tracking
classifier = AutoFusionClassifier(
    config,
    output_dir="outputs",
    experiment_name="my_experiment",
    auto_save_results=True  # Saves predictions, metrics, and config
)

API Reference

Core Classes

  • AutoFusionClassifier - One-line ML+LLM fusion interface (⭐ recommended)
  • FusionEnsemble - Advanced fusion ensemble with manual control
  • FusionMLP - Trainable neural network for fusion
  • BaseClassifier - Abstract base class for all classifiers
  • ClassificationResult - Container for prediction results
  • ModelConfig - Configuration for individual models
  • EnsembleConfig - Configuration for ensemble methods

LLM Classifiers

  • OpenAIClassifier - OpenAI GPT models
  • GeminiClassifier - Google Gemini models
  • DeepSeekClassifier - DeepSeek models

ML Classifiers

  • RoBERTaClassifier - RoBERTa-based classifier with fine-tuning

Traditional Ensemble Methods

  • VotingEnsemble - Voting-based ensemble
  • WeightedEnsemble - Weighted ensemble
  • ClassRoutingEnsemble - Class-specific routing

Configuration Management

API Key Management

from textclassify.config import APIKeyManager

# Set up API keys
api_manager = APIKeyManager()
api_manager.set_key("openai", "your-openai-key")
api_manager.set_key("gemini", "your-gemini-key")

# Or use environment variables (recommended)
# export OPENAI_API_KEY="your-key"
# export GEMINI_API_KEY="your-key"
# export DEEPSEEK_API_KEY="your-key"

Configuration Files

from textclassify.config import Config

# Load configuration
config = Config()
config.load('config.yaml')

# Or create from scratch
config.set('llm.default_provider', 'deepseek')
config.set('general.batch_size', 32)
config.save('my_config.yaml')

Evaluation and Metrics

from textclassify.utils import evaluate_predictions

# Evaluate model performance
result = classifier.predict(test_texts, test_labels)

if result.metadata and 'metrics' in result.metadata:
    metrics = result.metadata['metrics']
    print(f"Accuracy: {metrics['accuracy']:.3f}")
    print(f"F1 Score: {metrics['f1']:.3f}")
    print(f"Precision: {metrics['precision']:.3f}")
    print(f"Recall: {metrics['recall']:.3f}")

Examples and Documentation

Example Scripts

The package includes comprehensive examples in the examples/ and textclassify/examples/ directories:

  • test_autofusion_kaggle.py - AutoFusion on real dataset
  • test_kaggle_data.py - Testing with Kaggle ecommerce data
  • multi_class_example.py - Multi-class classification examples
  • multi_label_example.py - Multi-label classification examples
  • ensemble_example.py - Advanced ensemble methods

Evaluation Scripts

Comprehensive evaluation scripts in tests/evaluation/:

  • eval_ag_news.py - AG News topic classification benchmarks
  • eval_goemotions.py - GoEmotions multi-label emotion classification

Documentation Files

  • FUSION_README.md - Detailed fusion ensemble documentation
  • PACKAGE_OVERVIEW.md - Complete package architecture overview
  • AUTO_CACHE_FEATURE.md - LLM caching system documentation
  • paper_labelfusion.md - Academic paper describing the fusion methodology

How It Works

Fusion Architecture

  1. ML Backbone (RoBERTa): Generates logits from input text
  2. LLM Component: Produces per-class scores via prompting (cached for efficiency)
  3. Calibration: Both ML and LLM signals are calibrated for better probability estimates
  4. FusionMLP: Small neural network concatenates and learns to combine the signals
  5. Training: ML backbone uses small learning rate, fusion MLP uses higher rate for fast adaptation

Why Fusion Works

  • Complementary Strengths: LLMs provide robust reasoning, ML provides efficiency
  • Data Efficiency: LLM knowledge compensates for limited training data
  • Learned Combination: Neural network optimizes the fusion for your specific task
  • Cost-Effective: Caching and smart fusion reduce LLM API costs

Use Cases

  • Customer Feedback Analysis: Multi-label sentiment with nuanced categories
  • Content Moderation: Balance accuracy with real-time processing requirements
  • Scientific Literature Classification: Handle domain shift and new terminology
  • Low-Resource Scenarios: Achieve high accuracy with limited training data
  • Multi-Domain Classification: Leverage complementary model strengths

Contributing

We welcome contributions! Please see our development setup below.

Development Setup

git clone https://github.com/DataandAIReseach/LabelFusion.git
cd LabelFusion
pip install -e ".[dev]"

Running Tests

# Unit tests
pytest tests/

# Integration tests
pytest tests/integration/

# Evaluation benchmarks
python tests/evaluation/eval_ag_news.py
python tests/evaluation/eval_goemotions.py

Citation

If you use LabelFusion in your research, please cite:

@software{labelfusion2025,
  title={LabelFusion: Learning to Fuse LLMs and Transformer Classifiers for Robust Text Classification},
  author={Weisser, Christoph and Contributors},
  year={2025},
  url={https://github.com/DataandAIReseach/LabelFusion}
}

See paper_labelfusion.md for the full research paper.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support and Links

  • 📖 Documentation: See FUSION_README.md and PACKAGE_OVERVIEW.md
  • 🐛 Issues: GitHub Issues
  • Paper: paper_labelfusion.md
  • Examples: Check examples/ and textclassify/examples/ directories

Changelog

See CHANGELOG.md for version history and updates.


LabelFusion - Superior text classification through learned ML+LLM fusion

About

A comprehensive Python package for multi-class and multi-label text classification using Large Language Models (LLMs) and traditional machine learning models, with advanced ensemble methods for optimized performance.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 5