Biomedical Summary Generation through Cyclical Llama
A novel approach to scientific corpus-level summarization combining SBERT embeddings with Llama fine-tuning using soft prompts.
BSG CyLlama generates high-quality scientific summaries by:
- Embedding documents with SBERT (
thenlper/gte-large) - Projecting embeddings to soft prompts via a trained projection network
- Generating summaries with a LoRA fine-tuned Llama-3.2-1B-Instruct
The model produces three outputs for each document:
- Abstract: 150-300 word detailed summary
- Short Summary: 50-100 word concise summary
- Title: 5-10 word informative title
Best Performance: 81.50% semantic similarity on held-out validation set.
git clone https://github.com/jimnoneill/bsg_cyllama.git
cd bsg_cyllama
pip install -e .from sentence_transformers import SentenceTransformer
from bsg_cyllama.utils import load_trained_model
from bsg_cyllama.training.generation import generate_summary, parse_generated_output
# Load models
sbert = SentenceTransformer("thenlper/gte-large").to("cuda")
model, prompt_gen, tokenizer = load_trained_model("./trained_model")
# Your scientific document
document = """
Clustered regularly interspaced short palindromic repeats (CRISPR)
and CRISPR-associated protein 9 (Cas9) have revolutionized genome
editing capabilities...
"""
# Generate embedding and summary
embedding = sbert.encode([document], convert_to_tensor=True, device="cuda")
output = generate_summary(embedding, model, prompt_gen, tokenizer)
abstract, short_summary, title = parse_generated_output(output)
print(f"Title: {title}")
print(f"Summary: {short_summary}")
print(f"Abstract: {abstract}")python scripts/train.py \
--training-data /path/to/training_targets.tsv \
--source-data /path/to/source_documents.tsv \
--output-dir ./output \
--epochs 20bsg_cyllama/
├── src/bsg_cyllama/ # Main package
│ ├── config.py # Configuration dataclass
│ ├── models/ # Model components
│ │ ├── prompt_generator.py # SBERT → soft prompt projection
│ │ └── evaluator.py # Semantic evaluation metrics
│ ├── data/ # Data processing
│ │ ├── dataset.py # PyTorch Dataset
│ │ ├── preprocessing.py # Text cleaning utilities
│ │ └── embedding_cache.py # Disk-based embedding cache
│ ├── training/ # Training loop
│ │ ├── trainer.py # Main training orchestrator
│ │ └── generation.py # Inference utilities
│ └── utils/ # Helpers
│ └── helpers.py # Environment setup, model loading
├── scripts/ # Entry points
│ ├── train.py # Training script
│ ├── generate.py # Inference script
│ └── evaluate.py # Evaluation script
├── configs/ # YAML configurations
│ └── default.yaml # Default training config
├── requirements.txt # Dependencies
└── pyproject.toml # Package metadata
SBERT Embedding (1024d)
↓
Linear Layer
↓
GELU
↓
Dropout
↓
Linear Layer
↓
Reshape
↓
Soft Prompts (24 × 2048d)
- Data Preparation: Load TSV files, validate text, compute SBERT embeddings
- Embedding Caching: Store embeddings on disk for fast resume
- Forward Pass: Project embedding → concatenate with instruction → compute loss
- Adaptive LR: Three phases (breakthrough → fine-tune → convergence)
- Evaluation: Periodic semantic similarity checks with early stopping
| Parameter | Value | Description |
|---|---|---|
| LoRA Rank | 128 | Adapter rank for efficient fine-tuning |
| LoRA Alpha | 256 | Scaling factor |
| Learning Rate | 8e-5 | Initial learning rate |
| Batch Size | 2 × 8 | Effective batch size with gradient accumulation |
| Prompt Length | 24 | Number of soft prompt tokens |
| Metric | Score |
|---|---|
| Semantic Similarity | 81.50% |
| Word Overlap | ~45% |
| Training Epochs | 22 (with early stopping) |
- Model: jimnoneill/BSG_CyLlama
- Dataset: jimnoneill/BSG_CyLlama-training
This project uses the Llama 3.2 Community License.
The training code is provided for research and educational purposes.
- Meta AI for Llama 3.2
- Sentence Transformers team for SBERT
- HuggingFace for transformers and PEFT libraries
Jamey O'Neill
- GitHub: @jimnoneill
- HuggingFace: jimnoneill