Skip to content

paulynamagana/protein_ss_prediction

Repository files navigation

Protein Secondary Structure Prediction

Personal project to explore the application of deep learning for protein secondary structure prediction. I'm learning by doing, and also taking courses for background knowledge.

The main objective is to predict the secondary structure for each amino acid in a protein—classifying it as an α-helix, β-strand, or coil—based on its primary sequence and evolutionary information.

Getting Started

  1. Install dependencies:

    pip install -r requirements.txt
  2. Download the dataset:

    python scripts/download_data.py
  3. Train a model:

    python -m src.cli --model DilatedResNetCNN

Usage Examples

# Train different models
python -m src.cli --model SimpleCNN
python -m src.cli --model AdvancedCNN
python -m src.cli --model BiLSTM
python -m src.cli --model DilatedResNetCNN

# Use custom configuration
python -m src.cli --model DilatedResNetCNN --config custom_config.yaml

The Data

I'm using cullpdb+profile_5926_filtered.npy.gz because it provides a non-redundant dataset filtered to avoid overlap with the CB513 benchmark set. This ensures that models trained and validated on this data can be fairly evaluated on CB513 without data leakage.

The dataset contains:

  • 5,365 proteins with fixed-length sequences (700 residues)
  • 57 features per residue: one-hot encoded amino acids, secondary structure labels, solvent accessibility, and sequence profiles
  • 3-class labels: α-helix (H), β-strand (E), coil (C)

Model Architectures

I'm experimenting with different neural network architectures to see how they perform on this task:

1. SimpleCNN

A foundational 1D convolutional neural network, serving as a baseline.

2. AdvancedCNN

A deeper 1D CNN with more layers and larger kernels than the simple version, designed to capture more complex local patterns. Includes batch normalization.

3. BiLSTM

A bidirectional Long Short-Term Memory (LSTM) network. This RNN architecture processes protein sequences in both forward and reverse directions to capture long-range dependencies between amino acids.

4. DilatedResNetCNN

This is the most complex model, combining the strengths from dilated convolutions and residual connections. Dilations allow the model to expand its receptive field to see broader sequence contexts, while residual connections help in training deeper networks effectively.

Model performance

Model Best Validation Accuracy Training Notes Training History
SimpleCNN 68.08% Showed signs of underfitting, with performance plateauing early. This suggests a more complex model is needed. SimpleCNN
AdvancedCNN 69. 69% Demonstrated more stable training than the baseline, but did not improve upon validation accuracy. This highlights the trade-off between model complexity and performance on this dataset. AdvancedCNN
BiLSTM 67.86% Showed excellent generalization with minimal overfitting, confirming the importance of capturing long-range dependencies in sequence data. BiLSTM
DilatedResNetCNN 70.61% Achieved the highest accuracy with a stable training process, demonstrating strong generalization without significant overfitting. DilatedResNetCNN

Project structure

protein_ss_prediction/
├── configs/                # Configuration files for models and training
│   └── model_configs/
├── data/                   # Data directory
│   ├── raw/                # Raw data downloaded from source
│   └── processed/          # Processed data ready for model consumption
├── logs/                   # Log files for training
├── models/                 # Trained models
├── notebooks/              # Jupyter notebooks for exploration and analysis
├── results/
│   ├── figures/            # Plots for accuracy and loss curves
│   └── logs/               # CSV logs for configs changes vs accuracy metrics
├── scripts/                # Standalone scripts for the pipeline
│   ├── download_data.py    # Downloads the raw dataset
│   └── prepare_data.py     # Preprocesses raw data
├── src/                    # Source code for the project
│   ├── data/               # Data loading and processing classes
│   ├── models/             # Model architecture definitions
│   ├── utils/              # Utility functions
│   └── cli.py              # Command-line interface
├── tests/                  # Unit tests
├── CONTRIBUTING.md         # Contributing guidelines
├── LICENSE                 # MIT License
├── README.md               # This file
├── requirements.txt        # Python dependencies
└── setup.py               # Package setup

Tracking

Automated logging

To compare different runs, I set up a system that automatically records the configuration and final performance of each training session into a CSV file in the results/logs directory. This captures the model's name, the key hyperparameters used, and the best validation accuracy achieved.

Centralized logs

For monitoring and debugging, I'm using Python's built-in logging module. This streams detailed information to the console while training and saves a complete record to a file in the /logs directory.

Automated plotting

To see how the models are performing, I wrote a plotting utility that automatically generates and saves charts of the training and validation metrics for each epoch. This provides a quick visual way to check for things like overfitting.

Key findings

The DilatedResNetCNN was the top-performing model. Its success is likely due to its advanced architecture:

  • Dilated convolutions allowed the network to learn relationships between distant amino acids, which is critical for structure prediction
  • Residual connections enabled stable training of a deeper, more powerful network capable of learning a richer hierarchy of features

While the BiLSTM also performed well by capturing long-range dependencies, the DilatedResNetCNN's ability to efficiently learn hierarchical spatial patterns across the sequence proved to be the most effective strategy for this task.

Future Work

  • Final Evaluation: Evaluate the best-performing model (DilatedResNetCNN) on the CB513 test set for a final, unbiased performance metric.
  • Error Analysis: Perform a detailed error analysis, including a confusion matrix, to identify which secondary structure classes (or sequence contexts) are most difficult to predict.
  • Hyperparameter Tuning: Experiment with hyperparameter tuning for the DilatedResNetCNN model to potentially improve performance further.

About

Personal project to explore the application of deep learning for protein secondary structure prediction.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published