ML Cancer Research Project

Overview

Welcome to the repository for our research project on predicting cancer progression and survival rates using evolutionary cancer trees integrated with advanced machine learning algorithms. This innovative approach leverages multi-regional sequencing data and sophisticated computational techniques to enhance our understanding of cancer dynamics and improve prognostic accuracy.

Cancer is a fundamentally genetic disease characterized by complex clonal evolution, where different cancer cell populations evolve over time. Understanding these evolutionary dynamics is critical for developing targeted treatments and improving prognostic accuracy. In this project, we employ evolutionary cancer trees, constructed from multi-regional sequencing data, to model the evolutionary relationships among cancer clones.

By integrating these evolutionary models with machine learning algorithms such as linear regression, random forests, support vector machines, and genetic algorithms, we aim to enhance the prediction of survival rates among cancer patients. Our study is grounded in the TRACERx lung cancer dataset, providing a rich and clinically relevant foundation for predictive analysis.

Project Structure

ml-cancer-research/
├── data/                   # Dataset directory
├── models/                 # Saved model files
├── checkpoints/            # Training checkpoints
├── graphs/                 # Generated visualizations
├── dissertation/           # Research documentation
├── structuring_project/    # Main project code
│   ├── preprocessing.py    # Data processing pipeline
│   ├── train_models.py     # Model training scripts
│   ├── evaluation.py       # Model evaluation tools
│   ├── utils.py            # Utility functions
│   └── experiments.ipynb   # Initial Experiments
├── NN and XGboost.csv      # Model comparison data
├── requirements.txt        # Project dependencies
└── LICENSE                 # License information

Installation

Clone the repository:

git clone https://github.com/rafipatel/MLCancerResearch.git
cd MLCancerResearch

Create a virtual environment in python or conda (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Usage

Data Preprocessing

Data Preprocessing, Model Training and Model Evaluation

python structuring_project/train_models.py

Data

The project uses lung cancer datasets stored in the data/ directory. Key components:

Data is placed in data
Data preprocessing pipeline is defined in preprocessing.py

Models

The project implements several machine learning models:

Linear Regression
Lasso Regression
Ridge Regression
Neural Networks
XGBoost

Model artifacts are saved in:

models/: Model architectures
checkpoints/: Training checkpoints for model recovery and selection

Scripts

preprocessing.py

Data cleaning and normalization
Feature engineering
Data transformation pipelines

train_models.py

Model architecture definitions
Training loop implementation
Hyperparameter configuration
Checkpoint management

evaluation.py

Performance metric calculations
Model comparison tools
Visualization generation

utils.py

Data loading/saving utilities
Common helper functions
Configuration management

Documentation

Detailed project documentation is available in the dissertation/ directory
Technical implementation details are in MLCancerResearch_final.zip
Additional research context: "The evolution of lung cancer TracerX.pdf"

License

This project is licensed under the LICENSE - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML Cancer Research Project

Overview

Table of Contents

Project Structure

Installation

Usage

Data Preprocessing

Data Preprocessing, Model Training and Model Evaluation

Data

Models

Scripts

preprocessing.py

train_models.py

evaluation.py

utils.py

Documentation

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
__pycache__		__pycache__
data		data
dissertation		dissertation
graphs		graphs
learning_curves		learning_curves
models		models
structuring_project		structuring_project
.DS_Store		.DS_Store
LICENSE		LICENSE
NN and XGboost.csv		NN and XGboost.csv
README.md		README.md
The evolution of lung cancer TracerX.pdf		The evolution of lung cancer TracerX.pdf
best_params.json		best_params.json
best_params_old.json		best_params_old.json
best_params_old1.json		best_params_old1.json
best_params_old3.json		best_params_old3.json
evaluation.py		evaluation.py
experiments.ipynb		experiments.ipynb
notes.txt		notes.txt
preprocessing.py		preprocessing.py
regression_model_results_old.csv		regression_model_results_old.csv
requirements.txt		requirements.txt
train_models.py		train_models.py
utils.py		utils.py

License

rafipatel/MLCancerResearch

Folders and files

Latest commit

History

Repository files navigation

ML Cancer Research Project

Overview

Table of Contents

Project Structure

Installation

Usage

Data Preprocessing

Data Preprocessing, Model Training and Model Evaluation

Data

Models

Scripts

preprocessing.py

train_models.py

evaluation.py

utils.py

Documentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages