🧬 BSG CyLlama

Biomedical Summary Generation through Cyclical Llama

A novel approach to scientific corpus-level summarization combining SBERT embeddings with Llama fine-tuning using soft prompts.

🎯 Overview

BSG CyLlama generates high-quality scientific summaries by:

Embedding documents with SBERT (thenlper/gte-large)
Projecting embeddings to soft prompts via a trained projection network
Generating summaries with a LoRA fine-tuned Llama-3.2-1B-Instruct

The model produces three outputs for each document:

Abstract: 150-300 word detailed summary
Short Summary: 50-100 word concise summary
Title: 5-10 word informative title

Best Performance: 81.50% semantic similarity on held-out validation set.

🚀 Quick Start

Installation

git clone https://github.com/jimnoneill/bsg_cyllama.git
cd bsg_cyllama
pip install -e .

Generate Summaries

from sentence_transformers import SentenceTransformer
from bsg_cyllama.utils import load_trained_model
from bsg_cyllama.training.generation import generate_summary, parse_generated_output

# Load models
sbert = SentenceTransformer("thenlper/gte-large").to("cuda")
model, prompt_gen, tokenizer = load_trained_model("./trained_model")

# Your scientific document
document = """
Clustered regularly interspaced short palindromic repeats (CRISPR) 
and CRISPR-associated protein 9 (Cas9) have revolutionized genome 
editing capabilities...
"""

# Generate embedding and summary
embedding = sbert.encode([document], convert_to_tensor=True, device="cuda")
output = generate_summary(embedding, model, prompt_gen, tokenizer)
abstract, short_summary, title = parse_generated_output(output)

print(f"Title: {title}")
print(f"Summary: {short_summary}")
print(f"Abstract: {abstract}")

Training

python scripts/train.py \
    --training-data /path/to/training_targets.tsv \
    --source-data /path/to/source_documents.tsv \
    --output-dir ./output \
    --epochs 20

📁 Repository Structure

bsg_cyllama/
├── src/bsg_cyllama/          # Main package
│   ├── config.py             # Configuration dataclass
│   ├── models/               # Model components
│   │   ├── prompt_generator.py   # SBERT → soft prompt projection
│   │   └── evaluator.py          # Semantic evaluation metrics
│   ├── data/                 # Data processing
│   │   ├── dataset.py            # PyTorch Dataset
│   │   ├── preprocessing.py      # Text cleaning utilities
│   │   └── embedding_cache.py    # Disk-based embedding cache
│   ├── training/             # Training loop
│   │   ├── trainer.py            # Main training orchestrator
│   │   └── generation.py         # Inference utilities
│   └── utils/                # Helpers
│       └── helpers.py            # Environment setup, model loading
├── scripts/                  # Entry points
│   ├── train.py              # Training script
│   ├── generate.py           # Inference script
│   └── evaluate.py           # Evaluation script
├── configs/                  # YAML configurations
│   └── default.yaml          # Default training config
├── requirements.txt          # Dependencies
└── pyproject.toml           # Package metadata

🧠 Architecture

Sbert2Prompt Projection Network

SBERT Embedding (1024d)
        ↓
   Linear Layer
        ↓
      GELU
        ↓
    Dropout
        ↓
   Linear Layer
        ↓
    Reshape
        ↓
Soft Prompts (24 × 2048d)

Training Pipeline

Data Preparation: Load TSV files, validate text, compute SBERT embeddings
Embedding Caching: Store embeddings on disk for fast resume
Forward Pass: Project embedding → concatenate with instruction → compute loss
Adaptive LR: Three phases (breakthrough → fine-tune → convergence)
Evaluation: Periodic semantic similarity checks with early stopping

Key Hyperparameters

Parameter	Value	Description
LoRA Rank	128	Adapter rank for efficient fine-tuning
LoRA Alpha	256	Scaling factor
Learning Rate	8e-5	Initial learning rate
Batch Size	2 × 8	Effective batch size with gradient accumulation
Prompt Length	24	Number of soft prompt tokens

📊 Results

Metric	Score
Semantic Similarity	81.50%
Word Overlap	~45%
Training Epochs	22 (with early stopping)

🤗 HuggingFace Resources

Model: jimnoneill/BSG_CyLlama
Dataset: jimnoneill/BSG_CyLlama-training

📜 License

This project uses the Llama 3.2 Community License.

The training code is provided for research and educational purposes.

🙏 Acknowledgments

Meta AI for Llama 3.2
Sentence Transformers team for SBERT
HuggingFace for transformers and PEFT libraries

📧 Contact

Jamey O'Neill

GitHub: @jimnoneill
HuggingFace: jimnoneill

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
scripts		scripts
src/bsg_cyllama		src/bsg_cyllama
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 BSG CyLlama

🎯 Overview

🚀 Quick Start

Installation

Generate Summaries

Training

📁 Repository Structure

🧠 Architecture

Sbert2Prompt Projection Network

Training Pipeline

Key Hyperparameters

📊 Results

🤗 HuggingFace Resources

📜 License

🙏 Acknowledgments

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧬 BSG CyLlama

🎯 Overview

🚀 Quick Start

Installation

Generate Summaries

Training

📁 Repository Structure

🧠 Architecture

Sbert2Prompt Projection Network

Training Pipeline

Key Hyperparameters

📊 Results

🤗 HuggingFace Resources

📜 License

🙏 Acknowledgments

📧 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages