Welcome to the repository for our research project on predicting cancer progression and survival rates using evolutionary cancer trees integrated with advanced machine learning algorithms. This innovative approach leverages multi-regional sequencing data and sophisticated computational techniques to enhance our understanding of cancer dynamics and improve prognostic accuracy.
Cancer is a fundamentally genetic disease characterized by complex clonal evolution, where different cancer cell populations evolve over time. Understanding these evolutionary dynamics is critical for developing targeted treatments and improving prognostic accuracy. In this project, we employ evolutionary cancer trees, constructed from multi-regional sequencing data, to model the evolutionary relationships among cancer clones.
By integrating these evolutionary models with machine learning algorithms such as linear regression, random forests, support vector machines, and genetic algorithms, we aim to enhance the prediction of survival rates among cancer patients. Our study is grounded in the TRACERx lung cancer dataset, providing a rich and clinically relevant foundation for predictive analysis.
ml-cancer-research/
├── data/ # Dataset directory
├── models/ # Saved model files
├── checkpoints/ # Training checkpoints
├── graphs/ # Generated visualizations
├── dissertation/ # Research documentation
├── structuring_project/ # Main project code
│ ├── preprocessing.py # Data processing pipeline
│ ├── train_models.py # Model training scripts
│ ├── evaluation.py # Model evaluation tools
│ ├── utils.py # Utility functions
│ └── experiments.ipynb # Initial Experiments
├── NN and XGboost.csv # Model comparison data
├── requirements.txt # Project dependencies
└── LICENSE # License information
- Clone the repository:
git clone https://github.com/rafipatel/MLCancerResearch.git
cd MLCancerResearch- Create a virtual environment in python or conda (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtpython structuring_project/train_models.py The project uses lung cancer datasets stored in the data/ directory. Key components:
- Data is placed in
data - Data preprocessing pipeline is defined in
preprocessing.py
The project implements several machine learning models:
- Linear Regression
- Lasso Regression
- Ridge Regression
- Neural Networks
- XGBoost
Model artifacts are saved in:
models/: Model architecturescheckpoints/: Training checkpoints for model recovery and selection
- Data cleaning and normalization
- Feature engineering
- Data transformation pipelines
- Model architecture definitions
- Training loop implementation
- Hyperparameter configuration
- Checkpoint management
- Performance metric calculations
- Model comparison tools
- Visualization generation
- Data loading/saving utilities
- Common helper functions
- Configuration management
- Detailed project documentation is available in the
dissertation/directory - Technical implementation details are in
MLCancerResearch_final.zip - Additional research context: "The evolution of lung cancer TracerX.pdf"
This project is licensed under the LICENSE - see the LICENSE file for details.