A Bidirectional DNA Language Model for Chicken Genome
PoultryCaduceus is the DNA foundation model specifically pre-trained on the chicken (Gallus gallus) genome, based on the Caduceus architecture.
- 𧬠Chicken-specific: Pre-trained on GRCg6a (~1.1 Gb) genome
- π Bidirectional: Mamba-based bidirectional sequence modeling
- β‘ RC Equivariance: Built-in reverse complement equivariance
- π Long-range: Supports 65,536 bp context
| Parameter | Value |
|---|---|
| Base Model | caduceus-ph (4-layer) |
| Hidden Dim | 256 |
| Vocab Size | 16 |
| Sequence Length | 65,536 bp |
| Training Steps | 10,000 |
| Hardware | 4x H200 (80GB) |
git clone https://github.com/chengzhimin/PoultryCaduceus.git
cd PoultryCaduceus
source setup_env.shfrom transformers import AutoModelForMaskedLM
# Load from HuggingFace
model = AutoModelForMaskedLM.from_pretrained(
"jamie0315/PoultryCaduceus",
subfolder="checkpoint-10000",
trust_remote_code=True
)
# Or load from local checkpoint
model = AutoModelForMaskedLM.from_pretrained(
"./checkpoint-10000",
trust_remote_code=True
)import torch
# DNA vocabulary
DNA_VOCAB = {'A': 7, 'C': 8, 'G': 9, 'T': 10, 'N': 5, '[MASK]': 4}
# Encode sequence
sequence = "ATGCGATCGATCGATCG"
input_ids = torch.tensor([[DNA_VOCAB.get(c, 5) for c in sequence]])
# Get embeddings
model.eval()
with torch.no_grad():
outputs = model(input_ids, output_hidden_states=True)
embeddings = outputs.hidden_states[-1] # (batch, seq_len, 256)# Create conda environment
conda create -n caduceus_env python=3.10
conda activate caduceus_env
# Install dependencies
pip install torch transformers h5py biopython pyyaml tensorboard
# Install Caduceus (requires CUDA)
pip install caduceus-dnaDownload pre-trained model from Caduceus:
git lfs install
git clone https://huggingface.co/kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-4 ./caduceus-ph-modelRun data preparation notebook on Google Colab (for servers without internet):
# Run on Colab
notebooks/data_preparation.ipynb
# Download generated data file
# chicken_pretrain_data_GRCg6a.tar.gz
# Upload to server and extract
tar -xzf chicken_pretrain_data_GRCg6a.tar.gzData directory structure (from HuggingFace):
PoultryCaduceus/
βββ checkpoint-10000/ # Model weights
β βββ config.json
β βββ model.safetensors
βββ chicken_pretrain_data_GRCg6a/ # Pre-training data
βββ train_65k.h5 # Training set (~58,000 sequences)
βββ val_65k.h5 # Validation set (~1,200 sequences)
# Single GPU
python scripts/train_chicken_caduceus_v8.py --config configs/chicken_caduceus_10k.yaml
# Multi-GPU (4x H200)
torchrun --nproc_per_node=4 scripts/train_chicken_caduceus_v8.py \
--config configs/chicken_caduceus_10k.yamloutputs/chicken_caduceus_10k/
βββ checkpoint-1000/
βββ checkpoint-2000/
βββ ...
βββ checkpoint-10000/ # Final model
βββ config.json
βββ model.safetensors
βββ training_state.pt
PoultryCaduceus/
βββ README.md
βββ LICENSE
βββ setup_env.sh # Environment setup
βββ configs/
β βββ chicken_caduceus_10k.yaml # Training config
βββ scripts/
β βββ chicken_dataset.py # Dataset class
β βββ train_chicken_caduceus_v8.py # Training script
βββ notebooks/
βββ data_preparation.ipynb # Data preparation (Colab)
# chicken_caduceus_10k.yaml
model:
pretrained_model: ./caduceus-ph-model # Base model path
data:
train_path: chicken_pretrain_data_GRCg6a/train_65k.h5
val_path: chicken_pretrain_data_GRCg6a/val_65k.h5
seq_length: 65536 # Sequence length
batch_size: 6 # Batch size per GPU
mlm_probability: 0.15 # Mask ratio
rc_aug: true # Reverse complement augmentation
training:
max_steps: 10000 # Training steps
warmup_steps: 500
gradient_accumulation_steps: 2
bf16: true # Mixed precision
optimizer:
lr: 2e-4
weight_decay: 0.01- MPRA Prediction: Predict regulatory sequence activity
- eQTL Analysis: Identify expression quantitative trait loci
- GWAS Fine-mapping: Prioritize causal variants
- Regulatory Element Annotation: Identify enhancers, promoters, etc.
MIT License
- π€ HuggingFace: jamie0315/PoultryCaduceus