Skip to content

jimnoneill/bsg_cyllama

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧬 BSG CyLlama

Biomedical Summary Generation through Cyclical Llama

A novel approach to scientific corpus-level summarization combining SBERT embeddings with Llama fine-tuning using soft prompts.

Model on HuggingFace Dataset on HuggingFace License


🎯 Overview

BSG CyLlama generates high-quality scientific summaries by:

  1. Embedding documents with SBERT (thenlper/gte-large)
  2. Projecting embeddings to soft prompts via a trained projection network
  3. Generating summaries with a LoRA fine-tuned Llama-3.2-1B-Instruct

The model produces three outputs for each document:

  • Abstract: 150-300 word detailed summary
  • Short Summary: 50-100 word concise summary
  • Title: 5-10 word informative title

Best Performance: 81.50% semantic similarity on held-out validation set.


🚀 Quick Start

Installation

git clone https://github.com/jimnoneill/bsg_cyllama.git
cd bsg_cyllama
pip install -e .

Generate Summaries

from sentence_transformers import SentenceTransformer
from bsg_cyllama.utils import load_trained_model
from bsg_cyllama.training.generation import generate_summary, parse_generated_output

# Load models
sbert = SentenceTransformer("thenlper/gte-large").to("cuda")
model, prompt_gen, tokenizer = load_trained_model("./trained_model")

# Your scientific document
document = """
Clustered regularly interspaced short palindromic repeats (CRISPR) 
and CRISPR-associated protein 9 (Cas9) have revolutionized genome 
editing capabilities...
"""

# Generate embedding and summary
embedding = sbert.encode([document], convert_to_tensor=True, device="cuda")
output = generate_summary(embedding, model, prompt_gen, tokenizer)
abstract, short_summary, title = parse_generated_output(output)

print(f"Title: {title}")
print(f"Summary: {short_summary}")
print(f"Abstract: {abstract}")

Training

python scripts/train.py \
    --training-data /path/to/training_targets.tsv \
    --source-data /path/to/source_documents.tsv \
    --output-dir ./output \
    --epochs 20

📁 Repository Structure

bsg_cyllama/
├── src/bsg_cyllama/          # Main package
│   ├── config.py             # Configuration dataclass
│   ├── models/               # Model components
│   │   ├── prompt_generator.py   # SBERT → soft prompt projection
│   │   └── evaluator.py          # Semantic evaluation metrics
│   ├── data/                 # Data processing
│   │   ├── dataset.py            # PyTorch Dataset
│   │   ├── preprocessing.py      # Text cleaning utilities
│   │   └── embedding_cache.py    # Disk-based embedding cache
│   ├── training/             # Training loop
│   │   ├── trainer.py            # Main training orchestrator
│   │   └── generation.py         # Inference utilities
│   └── utils/                # Helpers
│       └── helpers.py            # Environment setup, model loading
├── scripts/                  # Entry points
│   ├── train.py              # Training script
│   ├── generate.py           # Inference script
│   └── evaluate.py           # Evaluation script
├── configs/                  # YAML configurations
│   └── default.yaml          # Default training config
├── requirements.txt          # Dependencies
└── pyproject.toml           # Package metadata

🧠 Architecture

Sbert2Prompt Projection Network

SBERT Embedding (1024d)
        ↓
   Linear Layer
        ↓
      GELU
        ↓
    Dropout
        ↓
   Linear Layer
        ↓
    Reshape
        ↓
Soft Prompts (24 × 2048d)

Training Pipeline

  1. Data Preparation: Load TSV files, validate text, compute SBERT embeddings
  2. Embedding Caching: Store embeddings on disk for fast resume
  3. Forward Pass: Project embedding → concatenate with instruction → compute loss
  4. Adaptive LR: Three phases (breakthrough → fine-tune → convergence)
  5. Evaluation: Periodic semantic similarity checks with early stopping

Key Hyperparameters

Parameter Value Description
LoRA Rank 128 Adapter rank for efficient fine-tuning
LoRA Alpha 256 Scaling factor
Learning Rate 8e-5 Initial learning rate
Batch Size 2 × 8 Effective batch size with gradient accumulation
Prompt Length 24 Number of soft prompt tokens

📊 Results

Metric Score
Semantic Similarity 81.50%
Word Overlap ~45%
Training Epochs 22 (with early stopping)

🤗 HuggingFace Resources


📜 License

This project uses the Llama 3.2 Community License.

The training code is provided for research and educational purposes.


🙏 Acknowledgments

  • Meta AI for Llama 3.2
  • Sentence Transformers team for SBERT
  • HuggingFace for transformers and PEFT libraries

📧 Contact

Jamey O'Neill

About

Training code for lightweight Biomedical Summary Generator for local use intended to generate summaries for thousands of documents through Cycling & averaging embeddings into Llama (BSG CyLlama)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages