SE+ST Combined Model

A combined State Embedding (SE) and State Transition (ST) model for cross-cell-type perturbation prediction in single-cell genomics.

🚀 Features

Cross-cell-type generalization: Better prediction accuracy across different cell types
Cell-type-agnostic modeling: Uses universal state embeddings for robust predictions
Pre-trained SE integration: Leverages pre-trained State Embedding models
Easy installation: Install via pip from GitHub
Complete pipeline: Training, inference, and evaluation utilities

📦 Installation

From GitHub (Recommended)

# Install directly from GitHub
pip install git+https://github.com/maggie26375/se-st-combined@main

# Or using uv (faster)
uv add git+https://github.com/maggie26375/se-st-combined@main

From Source

# Clone the repository
git clone https://github.com/maggie26375/se-st-combined.git
cd se-st-combined

# Install in development mode
pip install -e .

# Or using uv
uv pip install -e .

🎯 Quick Start

Basic Usage

from se_st_combined.models.se_st_combined import SE_ST_CombinedModel
from se_st_combined.utils.se_st_utils import load_se_st_model, predict_perturbation_effects

# Load the model
model = load_se_st_model(
    model_dir="path/to/model",
    checkpoint_path="path/to/checkpoint.ckpt",
    se_model_path="path/to/se/model",
    se_checkpoint_path="path/to/se/checkpoint.ckpt"
)

# Make predictions
predictions = predict_perturbation_effects(
    model=model,
    ctrl_expressions=ctrl_expressions,
    pert_embeddings=pert_embeddings
)

Training

# See examples/se_st_virtual_cell_challenge.py for complete training example
import se_st_combined

# Training command (similar to original STATE training)
! uv run se-st-train \
  data.kwargs.toml_config_path="data/starter.toml" \
  data.kwargs.perturbation_features_file="data/ESM2_pert_features.pt" \
  training.max_steps=40000 \
  model=se_st_combined \
  model.kwargs.se_model_path="SE-600M" \
  model.kwargs.se_checkpoint_path="SE-600M/se600m_epoch15.ckpt" \
  output_dir="results" \
  name="se_st_experiment"

🏗️ Architecture

The SE+ST Combined Model consists of two main components:

1. State Embedding (SE) Encoder

Converts raw gene expression to universal state embeddings
Uses pre-trained SE model (e.g., SE-600M)
Provides cell-type-agnostic representations

2. State Transition (ST) Predictor

Predicts perturbation effects in state embedding space
Uses transformer architecture (GPT2/Llama)
Learns set-to-set functions for perturbation modeling

Data Flow

Raw Expression → SE Encoder → State Embeddings
                                    ↓
Perturbation Embeddings → ST Predictor → Predicted Expression

📊 Performance

Expected improvements over baseline StateTransition model:

Cross-cell-type accuracy: 10-20% improvement
Generalization: Better performance on unseen cell types
Training stability: More stable convergence
Robustness: More consistent predictions across cell types

🔧 Configuration

The model can be configured via YAML files:

# se_st_combined.yaml
name: se_st_combined
kwargs:
  se_model_path: "SE-600M"
  se_checkpoint_path: "SE-600M/se600m_epoch15.ckpt"
  freeze_se_model: true
  st_hidden_dim: 672
  st_cell_set_len: 128
  transformer_backbone_key: llama

📁 Project Structure

se-st-combined/
├── se_st_combined/
│   ├── models/
│   │   ├── se_st_combined.py      # Main SE+ST model
│   │   ├── base.py                # Base perturbation model
│   │   ├── state_transition.py    # ST model component
│   │   └── utils.py               # Model utilities
│   ├── utils/
│   │   ├── se_st_utils.py         # Training/inference utilities
│   │   └── se_inference.py        # SE model inference
│   ├── configs/
│   │   └── se_st_combined.yaml    # Model configuration
│   └── data/                      # Data utilities
├── examples/
│   └── se_st_virtual_cell_challenge.py  # Complete training example
├── setup.py                       # Package setup
├── requirements.txt               # Dependencies
└── README.md                      # This file

🧪 Examples

Virtual Cell Challenge

See examples/se_st_virtual_cell_challenge.py for a complete example of training and evaluating the SE+ST model on the Virtual Cell Challenge dataset.

Cross-cell-type Prediction

from se_st_combined.utils.se_st_utils import evaluate_cross_cell_type_performance

# Evaluate performance across different cell types
results = evaluate_cross_cell_type_performance(
    model=model,
    test_data=test_data,
    cell_types=["k562", "hepg2", "rpe1", "jurkat"]
)

# Compare with baseline
comparison = compare_with_baseline(se_st_results, baseline_results)

🔬 Research Background

This model is based on the STATE (State Transition and Embedding) framework for single-cell perturbation prediction. The key innovation is combining:

State Embedding models for universal cell representations
State Transition models for perturbation effect prediction
Cross-cell-type generalization through shared embedding space

📚 Dependencies

Python >= 3.8
PyTorch >= 1.12.0
Lightning >= 2.0.0
Scanpy >= 1.9.0
And more (see requirements.txt)

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Based on the STATE framework from Arc Institute
Uses pre-trained SE models for cell embeddings
Inspired by the Virtual Cell Challenge

📞 Support

If you encounter any issues or have questions, please:

Check the examples in the examples/ directory
Review the configuration options
Open an issue on GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
cli		cli
configs		configs
data		data
examples		examples
gnn		gnn
models		models
neuro-IDE		neuro-IDE
se_st_combined		se_st_combined
se_st_upgrade		se_st_upgrade
utils		utils
.gitignore		.gitignore
ARCHITECTURE_COMPARISON.md		ARCHITECTURE_COMPARISON.md
ARCHITECTURE_FIX.md		ARCHITECTURE_FIX.md
CHECK_INFERENCE_OUTPUT.md		CHECK_INFERENCE_OUTPUT.md
DATA_LOADING.md		DATA_LOADING.md
DISK_SPACE_FIX.md		DISK_SPACE_FIX.md
FIX_SPLITS_COMMAND.md		FIX_SPLITS_COMMAND.md
LICENSE		LICENSE
README.md		README.md
RUN_INFERENCE.md		RUN_INFERENCE.md
RUN_TRAINING.md		RUN_TRAINING.md
SOLUTION_VAL_TEST_SPLIT.md		SOLUTION_VAL_TEST_SPLIT.md
TRAINING_SCHEDULE.md		TRAINING_SCHEDULE.md
check_data_location.py		check_data_location.py
clean_disk_space.sh		clean_disk_space.sh
debug_cell_counts.py		debug_cell_counts.py
debug_checkpoint_detailed.py		debug_checkpoint_detailed.py
debug_data_loading.py		debug_data_loading.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_neural_ode_training.sh		run_neural_ode_training.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SE+ST Combined Model

🚀 Features

📦 Installation

From GitHub (Recommended)

From Source

🎯 Quick Start

Basic Usage

Training

🏗️ Architecture

1. State Embedding (SE) Encoder

2. State Transition (ST) Predictor

Data Flow

📊 Performance

🔧 Configuration

📁 Project Structure

🧪 Examples

Virtual Cell Challenge

Cross-cell-type Prediction

🔬 Research Background

📚 Dependencies

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

🔗 Related Work

About

Uh oh!

Releases

Packages

Languages

License

maggie26375/neuro-ide

Folders and files

Latest commit

History

Repository files navigation

SE+ST Combined Model

🚀 Features

📦 Installation

From GitHub (Recommended)

From Source

🎯 Quick Start

Basic Usage

Training

🏗️ Architecture

1. State Embedding (SE) Encoder

2. State Transition (ST) Predictor

Data Flow

📊 Performance

🔧 Configuration

📁 Project Structure

🧪 Examples

Virtual Cell Challenge

Cross-cell-type Prediction

🔬 Research Background

📚 Dependencies

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

🔗 Related Work

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages