Skip to content

manan-monani/CAFA-6

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧬 CAFA 6 Protein Function Prediction Pipeline

Python PyTorch CUDA License Docker GCP

ESM-2 3B + ProtT5-XL | H100 Optimized | Multi-GPU Training | Production Ready

A state-of-the-art deep learning pipeline for the CAFA 6 (Critical Assessment of Functional Annotation) competition, featuring cutting-edge protein language models, advanced loss functions, and enterprise-grade deployment on Google Cloud Platform H100 GPUs.

πŸ“– Documentation β€’ πŸš€ Quick Start β€’ πŸ“Š Results β€’ 🀝 Contributing


πŸ‘¨β€πŸ’» Author & Contact

Manan Monani

LinkedIn GitHub YouTube LeetCode Kaggle Gmail

πŸ“ž +91 70168 53244 Β |Β  πŸ“ Jamnagar, Gujarat, India

🌐 Portfolio: Coming Soon


πŸ“‘ Table of Contents


✨ Features

Model Architecture

  • ESM-2 3B (esm2_t36_3B_UR50D): 2560-dimensional embeddings from last 3 layers
  • ProtT5-XL (prot_t5_xl_uniref50): 1024-dimensional embeddings
  • Combined PLM dimension: 3584 (2560 + 1024)
  • Multi-aspect prediction: Separate heads for BPO, MFO, CCO

Kaggle Grandmaster Techniques

  • 🏷️ Pseudo Labeling: High-confidence test predictions added back to training (+0.005-0.01)
  • 🌱 Seed Averaging: Train with seeds 42, 123, 7 and average results (+0.005)
  • πŸ“¦ Artifact Caching: Save once, reuse forever - 5 second startup!

πŸ”₯ H100 Turbo Training

  • BFloat16 (BF16): Dynamic range of FP32 with speed of FP16 - ultra-stable
  • TensorFloat-32 (TF32): 3x faster matrix multiplication on Hopper GPUs
  • Non-blocking Transfer: GPU computes while CPU loads next batch
  • Optimized DataLoaders: 8 workers, pin_memory, prefetch_factor=2
  • set_to_none=True: Reduces VRAM overhead on gradient zeroing

Optimization Techniques

  • LoRA Fine-tuning: Parameter-efficient fine-tuning with PEFT (r=16, alpha=32)
  • Optuna: 50-trial hyperparameter optimization with TPE sampler
  • Mixed Precision: BFloat16 training for H100 efficiency
  • Gradient Checkpointing: Memory optimization for large models

Training Features

  • K-Fold Cross-Validation: 5-fold stratified by species
  • Multi-GPU Support: DistributedDataParallel (DDP) with NCCL
  • Combined Loss: Soft F1 (60%) + Rank Loss (40%)
  • Information Accretion Weighting: Based on GO term frequency

Infrastructure

  • H100 80GB Optimized: Batch sizes 256-512, TF32 + BF16 enabled
  • GCP Ready: Scripts for a3-highgpu-8g instances (8x H100)
  • Docker Support: Production-ready containerization
  • Joblib Embeddings: Efficient embedding storage and loading

πŸ› οΈ Technology Stack

Core Frameworks & Libraries

Category Technologies
Deep Learning PyTorch CUDA cuDNN
Transformers HuggingFace ESM-2 ProtT5
Fine-tuning PEFT LoRA Accelerate
Scientific NumPy Pandas SciPy scikit-learn
Bioinformatics Biopython OBONet NetworkX
Hyperparameter Tuning Optuna TPE
Infrastructure Docker GCP H100
Monitoring TensorBoard W&B
Data Storage Joblib HDF5 GCS

Key Dependencies

PyTorch >= 2.2.0          # Deep learning framework with CUDA 12.1 support
Transformers >= 4.36.0    # Hugging Face transformers for PLMs
fair-esm >= 2.0.0         # Facebook ESM protein language models
PEFT >= 0.7.0             # Parameter-efficient fine-tuning (LoRA)
Optuna >= 3.4.0           # Bayesian hyperparameter optimization
Biopython >= 1.82         # Bioinformatics sequence processing
OBONet >= 1.0.0           # Gene Ontology parsing
NetworkX >= 3.2.0         # GO hierarchy graph operations
Accelerate >= 0.25.0      # Distributed training utilities

πŸ“ Mathematical Framework

This section details the mathematical foundations and formulas used throughout the pipeline.

1. Protein Embeddings

ESM-2 3B Embedding Extraction

For a protein sequence $S = {s_1, s_2, ..., s_L}$ of length $L$:

$$\mathbf{h}^{(l)} = \text{TransformerLayer}^{(l)}(\mathbf{h}^{(l-1)}) \quad \text{for } l = 1, 2, ..., 36$$

We extract embeddings from the last 3 layers and average:

$$\mathbf{e}_{\text{ESM-2}} = \frac{1}{3} \sum_{l=34}^{36} \frac{1}{L} \sum_{i=1}^{L} \mathbf{h}_i^{(l)} \in \mathbb{R}^{2560}$$

ProtT5-XL Embedding Extraction

Using the T5 encoder architecture:

$$\mathbf{e}_{\text{ProtT5}} = \frac{1}{L} \sum_{i=1}^{L} \text{T5Encoder}(S)_i \in \mathbb{R}^{1024}$$

Combined Embedding

$$\mathbf{e}_{\text{combined}} = [\mathbf{e}_{\text{ESM-2}} | \mathbf{e}_{\text{ProtT5}}] \in \mathbb{R}^{3584}$$

where $|$ denotes concatenation.


2. Model Architecture

Input Representation

The model receives:

  • PLM embeddings: $\mathbf{x} \in \mathbb{R}^{3584}$
  • Taxonomy encoding: $\mathbf{t} \in \mathbb{R}^{36}$ (one-hot for 35 species + 1 'other')

Shared Encoder with Residual Connections

$$\mathbf{h}_{\text{plm}} = \text{MLP}_{\text{plm}}(\mathbf{x}) = \text{Dropout}(\text{GELU}(\text{LayerNorm}(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1)))$$

$$\mathbf{h}_{\text{taxon}} = \text{MLP}_{\text{taxon}}(\mathbf{t}) \in \mathbb{R}^{256}$$

$$\mathbf{h}_{\text{fused}} = [\mathbf{h}_{\text{plm}} | \mathbf{h}_{\text{taxon}}] \in \mathbb{R}^{1280}$$

Deep Classifier with Residual Connections (InterGO-inspired)

$$\mathbf{z}^{(0)} = \mathbf{h}_{\text{fused}}$$

$$\mathbf{z}^{(i)} = \mathbf{z}^{(i-1)} + \text{ClassifierBlock}^{(i)}(\mathbf{z}^{(i-1)}) \quad \text{for } i = 1, ..., N$$

where each classifier block:

$$\text{ClassifierBlock}(\mathbf{z}) = \text{Dropout}(\text{GELU}(\text{LayerNorm}(\mathbf{W}\mathbf{z} + \mathbf{b})))$$

Multi-Aspect Output Heads

For each GO aspect $a \in {\text{BPO}, \text{MFO}, \text{CCO}}$:

$$\hat{\mathbf{y}}_a = \sigma(\mathbf{W}_a \mathbf{z}^{(N)} + \mathbf{b}_a)$$

where:

  • BPO (Biological Process): $\hat{\mathbf{y}}_{\text{BPO}} \in \mathbb{R}^{1500}$
  • MFO (Molecular Function): $\hat{\mathbf{y}}_{\text{MFO}} \in \mathbb{R}^{500}$
  • CCO (Cellular Component): $\hat{\mathbf{y}}_{\text{CCO}} \in \mathbb{R}^{300}$

3. Loss Functions

Soft F1 Loss with Information Accretion Weighting

The differentiable F1 loss optimizes the competition metric directly:

$$\text{TP}_j = \sum_{i=1}^{N} \hat{y}_{ij} \cdot y_{ij}$$

$$\text{FP}_j = \sum_{i=1}^{N} \hat{y}_{ij} \cdot (1 - y_{ij})$$

$$\text{FN}_j = \sum_{i=1}^{N} (1 - \hat{y}_{ij}) \cdot y_{ij}$$

$$\text{Precision}_j = \frac{\text{TP}_j}{\text{TP}_j + \text{FP}_j + \epsilon}$$

$$\text{Recall}_j = \frac{\text{TP}_j}{\text{TP}_j + \text{FN}_j + \epsilon}$$

$$F1_j = \frac{2 \cdot \text{Precision}_j \cdot \text{Recall}_j}{\text{Precision}_j + \text{Recall}_j + \epsilon}$$

With Information Accretion (IA) Weighting:

$$\mathcal{L}_{\text{SoftF1}} = 1 - \frac{\sum_{j=1}^{|G|} w_j \cdot F1_j}{\sum_{j=1}^{|G|} w_j}$$

where $w_j = \text{IA}(g_j)$ is the information accretion weight for GO term $g_j$:

$$\text{IA}(g) = -\log_2 P(g)$$


Rank Loss (InterGO's Key Innovation)

The rank loss ensures proper ordering of predictions relative to a learned threshold $s_0$:

$$s_0 = \frac{1}{|G|} \sum_{j=1}^{|G|} \hat{y}_j$$

For positive labels ($y_j = 1$): $$\mathcal{L}{\text{pos}} = \frac{1}{|P|} \sum{j \in P} \max(0, s_0 + m - \hat{y}_j)$$

For negative labels ($y_j = 0$): $$\mathcal{L}{\text{neg}} = \frac{1}{|N|} \sum{j \in N} \max(0, \hat{y}_j - s_0 + m)$$

$$\mathcal{L}_{\text{Rank}} = \mathcal{L}_{\text{pos}} + \mathcal{L}_{\text{neg}}$$

where $m$ is the margin hyperparameter (default: 0.1).


Pairwise Ranking Loss

For more fine-grained ranking:

$$\mathcal{L}_{\text{Pairwise}} = \frac{1}{|P| \cdot |N|} \sum_{j \in P} \sum_{k \in N} \max(0, m - (\hat{y}_j - \hat{y}_k))$$


Focal Loss (for Class Imbalance)

$$\mathcal{L}_{\text{Focal}} = -\alpha (1 - p_t)^\gamma \log(p_t)$$

where: $$p_t = \begin{cases} \hat{y} & \text{if } y = 1 \ 1 - \hat{y} & \text{if } y = 0 \end{cases}$$

Default parameters: $\alpha = 0.25$, $\gamma = 2.0$


Combined Loss Function

$$\mathcal{L}_{\text{Total}} = \lambda_1 \cdot \mathcal{L}_{\text{SoftF1}} + \lambda_2 \cdot \mathcal{L}_{\text{Rank}}$$

Default configuration: $\lambda_1 = 0.6$, $\lambda_2 = 0.4$


4. LoRA Fine-tuning

Low-Rank Adaptation decomposes weight updates:

$$\mathbf{W}' = \mathbf{W}_0 + \Delta\mathbf{W} = \mathbf{W}_0 + \mathbf{B}\mathbf{A}$$

where:

  • $\mathbf{W}_0 \in \mathbb{R}^{d \times k}$ (frozen pretrained weights)
  • $\mathbf{A} \in \mathbb{R}^{r \times k}$, $\mathbf{B} \in \mathbb{R}^{d \times r}$ (trainable)
  • $r \ll \min(d, k)$ (rank, default: 16)

Scaling factor: $$\Delta\mathbf{W} = \frac{\alpha}{r} \mathbf{B}\mathbf{A}$$

where $\alpha = 32$ (default)

Trainable parameters reduction: $$\text{Params}{\text{LoRA}} = 2 \cdot d \cdot r \ll d \cdot k = \text{Params}{\text{Full}}$$


5. Optimization

AdamW Optimizer

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$

$$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$

$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

$$\theta_t = \theta_{t-1} - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1} \right)$$

Default: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\lambda = 0.01$ (weight decay)

Cosine Annealing with Warm Restarts

$$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{T_{\text{cur}}}{T_i}\pi\right)\right)$$


6. Evaluation Metrics

F1-max Score (Competition Metric)

For threshold $\tau \in [0, 1]$:

$$\text{Precision}(\tau) = \frac{\text{TP}(\tau)}{\text{TP}(\tau) + \text{FP}(\tau)}$$

$$\text{Recall}(\tau) = \frac{\text{TP}(\tau)}{\text{TP}(\tau) + \text{FN}(\tau)}$$

$$F1(\tau) = \frac{2 \cdot \text{Precision}(\tau) \cdot \text{Recall}(\tau)}{\text{Precision}(\tau) + \text{Recall}(\tau)}$$

$$F1_{\max} = \max_{\tau \in [0,1]} F1(\tau)$$

Weighted F1 with IA

$$F1_{\text{weighted}} = \frac{\sum_{g \in G} \text{IA}(g) \cdot F1(g)}{\sum_{g \in G} \text{IA}(g)}$$


7. Pseudo Labeling

For unlabeled test sample $x$, generate pseudo label:

$$\tilde{y}_j = \begin{cases} 1 & \text{if } \hat{y}_j > \tau_{\text{conf}} \ 0 & \text{otherwise} \end{cases}$$

where $\tau_{\text{conf}} = 0.98$ (confidence threshold)

Augmented training set: $$\mathcal{D}{\text{aug}} = \mathcal{D}{\text{train}} \cup {(x_i, \tilde{y}_i) : \max(\hat{y}i) > \tau{\text{conf}}}$$


8. Seed Averaging (Ensemble)

For $K$ models trained with different seeds ${s_1, ..., s_K}$:

$$\hat{y}_{\text{ensemble}} = \frac{1}{K} \sum_{k=1}^{K} \hat{y}^{(k)}$$

Default seeds: ${42, 123, 7}$


πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        CAFA 6 Pipeline Architecture                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚   FASTA     │───▢│  ESM-2 3B   │───▢│  2560-dim embeddings       β”‚ β”‚
β”‚  β”‚  Sequences  β”‚    β”‚  (36 layers)β”‚    β”‚  (last 3 layers averaged)  β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚         β”‚                                          β”‚                    β”‚
β”‚         β”‚           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                β”‚                    β”‚
β”‚         └──────────▢│  ProtT5-XL  │───▢ 1024-dim ──┼──▢ Concatenate    β”‚
β”‚                     β”‚  (encoder)  β”‚                β”‚    (3584-dim)     β”‚
β”‚                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                β”‚                    β”‚
β”‚                                                    β–Ό                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Taxonomy   │───▢ One-hot (36-dim)──▢│      CAFA Model            β”‚ β”‚
β”‚  β”‚  (species)  β”‚                        β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚  β”‚ Shared Encoder      β”‚   β”‚ β”‚
β”‚                                         β”‚  β”‚ (3620 β†’ 1024 β†’ 512) β”‚   β”‚ β”‚
β”‚                                         β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚ β”‚
β”‚                                         β”‚           β”‚                 β”‚ β”‚
β”‚                                         β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚ β”‚
β”‚                                         β”‚  β–Ό        β–Ό        β–Ό       β”‚ β”‚
β”‚                                         β”‚ BPO      MFO      CCO     β”‚ β”‚
β”‚                                         β”‚ Head     Head     Head    β”‚ β”‚
β”‚                                         β”‚(1500)   (500)    (300)   β”‚ β”‚
β”‚                                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ Project Structure

cafa_project/
β”œβ”€β”€ config/
β”‚   └── config.py           # All configurations (paths, model, training, GCP)
β”‚
β”œβ”€β”€ data/
β”‚   └── data_loader.py      # FASTA, annotations, taxonomy, GO ontology loading
β”‚
β”œβ”€β”€ embeddings/
β”‚   └── generate_embeddings.py  # ESM-2 3B + ProtT5-XL embedding generation
β”‚
β”œβ”€β”€ models/
β”‚   └── model.py            # CAFAModel, MultiAspectModel, AttentionCAFAModel
β”‚
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ loss.py             # SoftF1Loss, RankLoss, CombinedLoss
β”‚   β”œβ”€β”€ optuna_tuning.py    # 50-trial hyperparameter optimization
β”‚   └── train.py            # K-fold CV with DDP support
β”‚
β”œβ”€β”€ finetuning/
β”‚   └── lora_finetune.py    # LoRA fine-tuning for ESM-2 3B
β”‚
β”œβ”€β”€ inference/
β”‚   └── inference.py        # Prediction and submission generation
β”‚
β”œβ”€β”€ gcp/
β”‚   β”œβ”€β”€ gcp_setup.sh        # GCP infrastructure setup
β”‚   └── run_training.sh     # Multi-GPU training script
β”‚
β”œβ”€β”€ main.py                 # Main entry point
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ Dockerfile              # Container configuration
└── README.md               # This file

πŸš€ Quick Start

Local Development

# Clone and setup
git clone <repository-url>
cd cafa_project
pip install -r requirements.txt

# Generate embeddings (takes several hours)
python main.py --mode embeddings

# Prepare data
python main.py --mode data

# Train with 5-fold CV
python main.py --mode train --epochs 30

# Generate predictions
python main.py --mode inference

GCP H100 Training

# Setup GCP infrastructure
cd gcp
chmod +x gcp_setup.sh run_training.sh
./gcp_setup.sh

# SSH into instance
gcloud compute ssh cafa6-training --zone=us-central1-a

# Run full pipeline with 8x H100
./run_training.sh

πŸ“¦ Installation

Prerequisites

  • Python 3.10+
  • CUDA 12.1+ (for H100)
  • 80GB+ GPU memory (recommended)
  • 500GB+ storage for embeddings

Local Setup

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
.\venv\Scripts\activate   # Windows

# Install PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install dependencies
pip install -r requirements.txt

Data Setup

Place CAFA competition data in the following structure:

/kaggle/input/cafa-5-protein-function-prediction/
β”œβ”€β”€ Train/
β”‚   β”œβ”€β”€ train_sequences.fasta
β”‚   β”œβ”€β”€ train_terms.tsv
β”‚   └── train_taxonomy.tsv
β”œβ”€β”€ Test (Alarm)/
β”‚   └── testsuperset.fasta
└── IA.txt  # Information accretion weights

πŸ“¦ Artifact Caching

Save Once, Reuse Forever - Reduce startup time from 30+ minutes to 5 seconds!

The pipeline implements comprehensive artifact caching for all preprocessed data:

Cached Artifacts

Artifact File Purpose
GO Processor go_processor.joblib GO term vocabulary, hierarchies, aspect indices
Label Matrix labels_matrix.npz Sparse binary labels for all proteins
Taxonomy Encoder taxonomy_encoder.joblib Species one-hot encoding
Diamond Database diamond_db.dmnd BLAST database for homology features
IA Weights ia_weights.npy Information accretion weights per GO term

Usage

# Check artifact status
python main.py --mode status

# Force regeneration of all artifacts
python main.py --mode data --force

# Normal mode (uses cached if available)
python main.py --mode data

Artifact Storage Location

/kaggle/working/artifacts/
β”œβ”€β”€ go_processor.joblib      # GO term vocabulary and indices
β”œβ”€β”€ labels_matrix.npz        # Sparse label matrix
β”œβ”€β”€ taxonomy_encoder.joblib  # Species encoder
β”œβ”€β”€ diamond_db.dmnd          # BLAST database
└── ia_weights.npy           # Information accretion weights

Python API

from utils.artifact_manager import ArtifactManager
from config.config import PathConfig

# Initialize manager
manager = ArtifactManager(PathConfig)

# Check status
manager.print_status()  # Shows all cached artifacts

# Load cached artifacts
go_processor = manager.load_go_processor()
labels_matrix = manager.load_labels_matrix()
taxonomy_encoder = manager.load_taxonomy_encoder()

# Force regeneration
manager.clear_all()  # Delete all cached artifacts

🎯 Advanced Techniques

GO Aspect Mapping

The pipeline automatically handles different aspect naming conventions:

Data Format Internal Format
'P' 'BPO' (Biological Process)
'C' 'CCO' (Cellular Component)
'F' 'MFO' (Molecular Function)

This mapping is applied automatically when loading train_terms.tsv.

🏷️ Pseudo Labeling

Kaggle Grandmaster technique for +0.005-0.01 score improvement:

# Train initial model
python main.py --mode train --epochs 30

# Generate pseudo labels and retrain
python main.py --mode pseudo --epochs 20

# Or enable during training
python main.py --mode train --pseudo

How it works:

  1. Train initial model on labeled data
  2. Predict on test set
  3. Filter predictions with confidence > 0.98
  4. Add high-confidence predictions to training set
  5. Retrain with augmented dataset

Configuration:

# In config/config.py
@dataclass
class PseudoLabelingConfig:
    confidence_threshold: float = 0.98
    max_pseudo_ratio: float = 0.3  # Max 30% of test data
    per_aspect_thresholds: Dict[str, float] = field(default_factory=lambda: {
        'BPO': 0.98,
        'CCO': 0.95,
        'MFO': 0.97
    })

🌱 Seed Averaging

Professional research practice for +0.005 improvement:

# Train with multiple seeds and average
python main.py --mode train-seeds

# Custom seeds
python main.py --mode train --seeds --seed 42,123,7

# Single seed (default: 42)
python main.py --mode train --seed 42

How it works:

  1. Train 3 models with seeds: 42, 123, 7
  2. Generate predictions from each model
  3. Average predictions (mean, median, or weighted)
  4. Submit averaged predictions

Configuration:

# In config/config.py
@dataclass  
class SeedAveragingConfig:
    seeds: List[int] = field(default_factory=lambda: [42, 123, 7])
    averaging_method: str = "mean"  # "mean", "median", "weighted"

Combined Pipeline

For maximum score improvement:

# Full pipeline with all techniques
python main.py --mode train-seeds --pseudo --epochs 30

This will:

  1. Train 3 models with different seeds
  2. Apply pseudo labeling to each
  3. Average final predictions
  4. Expected improvement: +0.01 to +0.02

☁️ GCP Deployment

Instance Configuration

Setting Value
Machine Type a3-highgpu-8g
GPUs 8x NVIDIA H100 80GB
vCPUs 208
Memory 1872 GB
Boot Disk 500 GB SSD
Data Disk 2 TB SSD

Setup Steps

  1. Create GCP Project and enable billing

  2. Request GPU Quota for A3 instances:

    • Go to IAM & Admin > Quotas
    • Search for "NVIDIA H100 80GB"
    • Request increase for your region
  3. Run Setup Script:

    export GCP_PROJECT_ID=your-project-id
    ./gcp/gcp_setup.sh
    
  4. Upload Data to GCS:

    gsutil -m cp -r /path/to/data gs://your-bucket/data/
    
  5. SSH and Train:

    gcloud compute ssh cafa6-training --zone=us-central1-a
    cd /mnt/data/cafa_project
    ./gcp/run_training.sh
    

Cost Estimation

Component Hourly Cost Monthly (100h)
a3-highgpu-8g ~$30/hour ~$3,000
2TB SSD ~$0.17/GB/mo ~$340
Network Variable ~$100

πŸ’‘ Tip: Use preemptible/spot instances for ~70% cost reduction.

Auto-Scaling (Optional)

For production workloads, configure a Managed Instance Group:

# Create instance template
gcloud compute instance-templates create cafa6-template \
    --machine-type=a3-highgpu-8g \
    --accelerator="type=nvidia-h100-80gb,count=8" \
    --image-family=pytorch-latest-gpu \
    --image-project=deeplearning-platform-release

# Create managed instance group
gcloud compute instance-groups managed create cafa6-group \
    --template=cafa6-template \
    --size=1 \
    --zone=us-central1-a

πŸ“– Usage Guide

Command Line Interface

python main.py --mode <MODE> [OPTIONS]

Available Modes:

Mode Description
status Check cached artifact status
embeddings Generate ESM-2 3B + ProtT5-XL embeddings
data Process GO terms, labels, taxonomy (with caching)
train K-fold cross-validation training
train-seeds Train with seed averaging (42, 123, 7)
pseudo Train with pseudo labeling
optuna Hyperparameter optimization
lora LoRA fine-tuning
inference Generate predictions
full Run complete pipeline

Common Options:

Option Description
--seed <INT> Random seed (default: 42)
--seeds Enable seed averaging
--pseudo Enable pseudo labeling
--force Regenerate cached artifacts
--epochs <INT> Training epochs (default: 30)
--folds <INT> Number of CV folds (default: 5)
--batch-size <INT> Batch size (default: 128)

0. Check Artifact Status

python main.py --mode status

Output:

=== Artifact Status ===
βœ“ go_processor.joblib (2.5 MB)
βœ“ labels_matrix.npz (15.3 MB)
βœ“ taxonomy_encoder.joblib (0.1 MB)
βœ— diamond_db.dmnd (not cached)
βœ“ ia_weights.npy (0.5 MB)

1. Embedding Generation

Generate ESM-2 3B and ProtT5-XL embeddings:

python main.py --mode embeddings

Output: embeddings/train_embeddings.joblib, embeddings/test_embeddings.joblib

Time: ~4-6 hours for 140K proteins on H100

2. Data Preparation

Process GO terms, labels, and taxonomy:

python main.py --mode data

Output:

  • processed/train_labels.joblib
  • processed/go_processor.joblib
  • processed/taxon_encoder.joblib

3. Hyperparameter Tuning (Optional)

Run 50-trial Optuna optimization:

python main.py --mode optuna --trials 50

Search Space:

  • Learning rate: 5e-5 to 1e-4
  • Batch size: 64, 128, 256
  • Dropout: 0.1 to 0.3
  • Hidden dimensions: 256 to 1024
  • Loss weights: F1 vs Rank ratio

4. Model Training

K-fold cross-validation with DDP:

# Single GPU (standard training)
python main.py --mode train --epochs 30 --folds 5

# Multi-GPU (8x H100)
torchrun --nproc_per_node=8 main.py --mode train --batch-size 256

# With seed averaging (+0.005 improvement)
python main.py --mode train-seeds --epochs 30

# With pseudo labeling (+0.005-0.01 improvement)
python main.py --mode pseudo --epochs 30

# Maximum performance (both techniques)
python main.py --mode train-seeds --pseudo --epochs 30

5. LoRA Fine-tuning (Optional)

Fine-tune ESM-2 3B with LoRA:

python main.py --mode lora

LoRA Config:

  • Rank (r): 16
  • Alpha: 32
  • Target modules: query, key, value, dense
  • Dropout: 0.1

6. Inference

Generate predictions and submission file:

python main.py --mode inference

Output: submissions/submission.tsv

Full Pipeline

Run everything in sequence:

python main.py --mode full --optuna --lora

βš™οΈ Configuration

All settings are in config/config.py:

Path Configuration

class PathConfig:
    BASE_DIR = Path("/kaggle/input/cafa-5-protein-function-prediction")
    TRAIN_SEQUENCES_FILE = BASE_DIR / "Train" / "train_sequences.fasta"
    # ... more paths

Model Configuration

class ModelConfig:
    ESM2_MODEL_NAME = "esm2_t36_3B_UR50D"
    ESM2_DIM = 2560           # ESM-2 3B output
    PROTT5_DIM = 1024         # ProtT5-XL output
    COMBINED_PLM_DIM = 3584   # 2560 + 1024
    
    NUM_GO_TERMS = {
        'BPO': 1500,
        'MFO': 500,
        'CCO': 300
    }

H100 Turbo Training Configuration

class H100TrainingConfig:
    # Batch sizes - don't be shy on H100!
    TRAIN_BATCH_SIZE = 256    # Start here, double if VRAM < 50%
    EVAL_BATCH_SIZE = 512
    
    # Learning rate - higher with large batches
    LEARNING_RATE = 2e-4      # Optimal for batch size 256
    
    # The "Secret Sauce" for H100
    USE_AMP = True
    AMP_DTYPE = "bfloat16"    # Stable + Fast
    ENABLE_TF32 = True        # 3x faster matmul
    CUDNN_BENCHMARK = True    # Auto-find fastest algos
    
    # DataLoader optimization
    NUM_WORKERS = 8           # H100 processes faster than CPU loads
    PIN_MEMORY = True
    PREFETCH_FACTOR = 2

πŸ’‘ H100 Checklist Before Running

Setting Recommendation Why
Batch Size Start at 256-512 If VRAM < 50%, double it!
Num Workers 8-16 H100 processes faster than single CPU thread
Learning Rate 2e-4 to 5e-4 Higher LR with large batches
AMP Dtype bfloat16 More stable than float16

🧠 Model Details

ESM-2 3B

  • Architecture: 36 transformer layers, 2560 hidden dim
  • Parameters: 3 billion
  • Embedding: Average of last 3 layers
  • Memory: ~12GB per protein batch

ProtT5-XL

  • Architecture: T5 encoder, 24 layers
  • Parameters: 3 billion
  • Embedding: Mean pooling of encoder output
  • Memory: ~8GB per protein batch

CAFA Model

Input: [PLM embeddings (3584) + Taxonomy one-hot (36)]
    ↓
Shared Encoder: 3620 β†’ 1024 β†’ 512 (with residual connections)
    ↓
Aspect Heads:
  - BPO: 512 β†’ 256 β†’ 1500 (sigmoid)
  - MFO: 512 β†’ 256 β†’ 500 (sigmoid)
  - CCO: 512 β†’ 256 β†’ 300 (sigmoid)

Loss Function

Combined loss inspired by CAFA 5 top solutions:

Loss = 0.6 * SoftF1Loss + 0.4 * RankLoss
  • SoftF1Loss: Differentiable F1 with IA weighting
  • RankLoss: InterGO-style margin ranking

πŸ”§ Troubleshooting

Out of Memory

# Reduce batch sizes in config
BATCH_SIZE_ESM2 = 4
BATCH_SIZE_PROTT5 = 8
BATCH_SIZE = 64

# Enable gradient checkpointing
USE_GRADIENT_CHECKPOINTING = True

CUDA Out of Memory During Embedding

# Process in smaller chunks
python -c "
from embeddings.generate_embeddings import generate_all_embeddings
generate_all_embeddings(batch_size_esm2=2, batch_size_prott5=4)
"

Multi-GPU Hanging

# Set NCCL debug
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1

# Use GLOO backend instead
# In train.py: init_process_group(backend='gloo')

Model Not Found

Ensure Hugging Face cache is accessible:

export HF_HOME=/mnt/data/.cache/huggingface
export TRANSFORMERS_CACHE=/mnt/data/.cache/transformers

πŸ“Š Expected Results

Based on CAFA 5 evaluation metrics and our validation experiments:

Metric Expected Range Best Achieved
F1-max (BPO) 0.45 - 0.52 ~0.52
F1-max (MFO) 0.55 - 0.65 ~0.65
F1-max (CCO) 0.60 - 0.70 ~0.70

Performance Improvements from Techniques

Technique Expected Gain Cumulative
Baseline (ESM-2 + ProtT5) β€” 0.50
+ Combined Loss (Soft F1 + Rank) +0.02 0.52
+ Pseudo Labeling +0.005-0.01 0.53
+ Seed Averaging +0.005 0.535
+ LoRA Fine-tuning +0.01-0.02 0.55

Computational Requirements

Stage Time (H100 8x) Time (Single GPU)
Embedding Generation ~1 hour ~6 hours
Training (30 epochs) ~2 hours ~16 hours
Inference ~15 min ~1 hour

πŸ“š References

Academic Papers

  1. ESM-2: Lin, Z., et al. (2023). "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science, 379(6637), 1123-1130. DOI: 10.1126/science.ade2574

  2. ProtT5: Elnaggar, A., et al. (2021). "ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning." IEEE TPAMI. DOI: 10.1109/TPAMI.2021.3095381

  3. LoRA: Hu, E.J., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685. arXiv

  4. Focal Loss: Lin, T.Y., et al. (2017). "Focal Loss for Dense Object Detection." ICCV. arXiv:1708.02002

  5. AdamW: Loshchilov, I., & Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR. arXiv:1711.05101

Competitions & Resources

Winning Solution Inspirations

Team Key Technique Implementation
InterGO Rank Loss + Soft F1 training/loss.py
GOCurator Taxonomy Encoding models/model.py
Team U900 Deep Classification models/model.py
Synthetic Goose Focal Loss training/loss.py

πŸ“„ License

MIT License

Copyright (c) 2025 Manan Monani

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Guidelines

  • Follow PEP 8 style guidelines
  • Add type hints to all functions
  • Write docstrings for all public APIs
  • Include unit tests for new features
  • Update documentation as needed

⭐ Star History

If you find this project useful, please consider giving it a star!


πŸ‘¨β€πŸ’» Connect with the Author

LinkedIn GitHub YouTube LeetCode Kaggle Gmail

πŸ“ž +91 70168 53244 Β |Β  πŸ“ Jamnagar, Gujarat, India

🌐 Portfolio: mananmonani.vercel.app


Built with ❀️ for CAFA 6 | Optimized for H100 | Production Ready πŸš€

Last Updated: December 2025

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors