Personal project to explore the application of deep learning for protein secondary structure prediction. I'm learning by doing, and also taking courses for background knowledge.
The main objective is to predict the secondary structure for each amino acid in a protein—classifying it as an α-helix, β-strand, or coil—based on its primary sequence and evolutionary information.
-
Install dependencies:
pip install -r requirements.txt
-
Download the dataset:
python scripts/download_data.py
-
Train a model:
python -m src.cli --model DilatedResNetCNN
# Train different models
python -m src.cli --model SimpleCNN
python -m src.cli --model AdvancedCNN
python -m src.cli --model BiLSTM
python -m src.cli --model DilatedResNetCNN
# Use custom configuration
python -m src.cli --model DilatedResNetCNN --config custom_config.yamlI'm using cullpdb+profile_5926_filtered.npy.gz because it provides a non-redundant dataset filtered to avoid overlap with the CB513 benchmark set. This ensures that models trained and validated on this data can be fairly evaluated on CB513 without data leakage.
The dataset contains:
- 5,365 proteins with fixed-length sequences (700 residues)
- 57 features per residue: one-hot encoded amino acids, secondary structure labels, solvent accessibility, and sequence profiles
- 3-class labels: α-helix (H), β-strand (E), coil (C)
I'm experimenting with different neural network architectures to see how they perform on this task:
A foundational 1D convolutional neural network, serving as a baseline.
A deeper 1D CNN with more layers and larger kernels than the simple version, designed to capture more complex local patterns. Includes batch normalization.
A bidirectional Long Short-Term Memory (LSTM) network. This RNN architecture processes protein sequences in both forward and reverse directions to capture long-range dependencies between amino acids.
This is the most complex model, combining the strengths from dilated convolutions and residual connections. Dilations allow the model to expand its receptive field to see broader sequence contexts, while residual connections help in training deeper networks effectively.
protein_ss_prediction/
├── configs/ # Configuration files for models and training
│ └── model_configs/
├── data/ # Data directory
│ ├── raw/ # Raw data downloaded from source
│ └── processed/ # Processed data ready for model consumption
├── logs/ # Log files for training
├── models/ # Trained models
├── notebooks/ # Jupyter notebooks for exploration and analysis
├── results/
│ ├── figures/ # Plots for accuracy and loss curves
│ └── logs/ # CSV logs for configs changes vs accuracy metrics
├── scripts/ # Standalone scripts for the pipeline
│ ├── download_data.py # Downloads the raw dataset
│ └── prepare_data.py # Preprocesses raw data
├── src/ # Source code for the project
│ ├── data/ # Data loading and processing classes
│ ├── models/ # Model architecture definitions
│ ├── utils/ # Utility functions
│ └── cli.py # Command-line interface
├── tests/ # Unit tests
├── CONTRIBUTING.md # Contributing guidelines
├── LICENSE # MIT License
├── README.md # This file
├── requirements.txt # Python dependencies
└── setup.py # Package setup
To compare different runs, I set up a system that automatically records the configuration and final performance of each training session into a CSV file in the results/logs directory. This captures the model's name, the key hyperparameters used, and the best validation accuracy achieved.
For monitoring and debugging, I'm using Python's built-in logging module. This streams detailed information to the console while training and saves a complete record to a file in the /logs directory.
To see how the models are performing, I wrote a plotting utility that automatically generates and saves charts of the training and validation metrics for each epoch. This provides a quick visual way to check for things like overfitting.
The DilatedResNetCNN was the top-performing model. Its success is likely due to its advanced architecture:
- Dilated convolutions allowed the network to learn relationships between distant amino acids, which is critical for structure prediction
- Residual connections enabled stable training of a deeper, more powerful network capable of learning a richer hierarchy of features
While the BiLSTM also performed well by capturing long-range dependencies, the DilatedResNetCNN's ability to efficiently learn hierarchical spatial patterns across the sequence proved to be the most effective strategy for this task.
- Final Evaluation: Evaluate the best-performing model (
DilatedResNetCNN) on the CB513 test set for a final, unbiased performance metric. - Error Analysis: Perform a detailed error analysis, including a confusion matrix, to identify which secondary structure classes (or sequence contexts) are most difficult to predict.
- Hyperparameter Tuning: Experiment with hyperparameter tuning for the
DilatedResNetCNNmodel to potentially improve performance further.



