BIOF3001-Project

Investigating the molecular basis of accelerated epigenetic aging

Tasks distribution

Name	Tasks
Jason	- the download of GDC-PANCAN data - initial clock scripts, fix plot - clinical correlation analysis - snakemake, parallel processing and dvc implementation
Jacob	- plot boxplots and scatterplots for different epigenetic clocks - fix some projects having multiple platforms causing error when running clock - other error fixes
William	- classify samples into accelerated and decelerated aging - download corresponding rna sequence data - differential expression analysis - pathway analysis

Running the Analysis

This project uses Snakemake to efficiently process epigenetic clock data for multiple TCGA projects. Snakemake tracks which outputs have been generated and only reprocesses missing or out-of-date files.

Prerequisites

Run library.R to download all required libraries for R

Run uv sync to install all required libraries for python

For data management, we have used dvc to store and share our raw and processed data, please contact Jason Ho for the server address. (just the raw data account for 1XGB, it takes at least 3 hours to download)

For workflow, we have used snakemake to ensure reproducible results.

Usage

Run the entire workflow:

uv run snakemake --cores 1

N: for number of cores

Process specific projects only:

uv run snakemake results/clock/gdc_pan/TCGA-BRCA_scatterplots.png --cores 1

Dry run to see what would be executed:

uv run snakemake -n

Workflow Structure

The workflow consists of the following steps:

Process individual projects (scripts/clock/process_project.R): Calculates epigenetic clocks and generates plots for each TCGA project independently
Combine pan-cancer analysis (scripts/clock/combine_pancan.R): Aggregates all project predictions and generates pan-cancer wide plots
Generate PDF report (scripts/clock/generate_pdf.R): Combines all plots into a single PDF document
Clinical correlation analysis (scripts/clinical_correlation.R): Performs clinical correlation analysis

Outputs

results/clock/gdc_pan/{PROJECT}_scatterplots.png: Scatter plots for each project
results/clock/gdc_pan/{PROJECT}_residuals_boxplots.png: Residual boxplots for each project
results/clock/gdc_pan/{PROJECT}_predictions.rds: Prediction data for each project
results/clock/gdc_pan/gdc_pancan_scatterplots.png: Combined pan-cancer scatter plots
results/clock/gdc_pan/gdc_pancan_residuals_boxplots.png: Combined pan-cancer residual boxplots
results/clock/gdc_pan/gdc_pancan_methylclock.pdf: Final PDF report with all plots
results/clinical/*: Heat map for clinical correlation

some *.rds was saved to prevent excess api usage and avoid recomputing of variables

Big Data Approach

This project implements several big data strategies to efficiently handle large-scale genomic data from TCGA (The Cancer Genome Atlas):

Data Management

DVC (Data Version Control): Large raw data files (.dvc files) are tracked separately from the git repository, enabling efficient versioning and sharing of multi-gigabyte methylation datasets without bloating the repository
Selective Processing: Only normal tissue samples are extracted from the full TCGA-PANCAN dataset, reducing computational overhead
Cached Intermediate Results: Key data structures (.rds files) are saved to prevent redundant API calls and expensive recomputations

Parallel Processing

Snakemake Workflow: Orchestrates parallel execution across 24 TCGA projects, automatically managing dependencies and utilizing multiple CPU cores
Multi-core Downloads: The download_gdc_pancan.R script uses R's parallel::mclapply() to download data from multiple projects concurrently, significantly reducing wall-clock time
Project-level Parallelism: Each TCGA project is processed independently, allowing Snakemake to distribute work across available cores without waiting for sequential completion
Adaptive Core Usage: Scripts automatically detect available CPU cores and use half to balance performance with system stability

Scalability Features

Incremental Execution: Snakemake tracks which outputs exist and only reprocesses missing or outdated files, enabling efficient iterative development
Chunk-based Downloads: GDC downloads use files.per.chunk = 20 to prevent timeout issues when fetching large batches of methylation files
Memory-efficient Processing: Data is processed project-by-project rather than loading the entire pan-cancer dataset into memory at once
Error Recovery: Parallel processing includes robust error handling to continue processing remaining projects even if individual projects fail

Performance Optimization

Binary Serialization: R's .rds format provides fast, compressed storage for intermediate results
Lazy Evaluation: Phenotype and methylation data are only loaded when needed for specific analyses
Distributed I/O: Multiple projects write to separate output files simultaneously, avoiding I/O bottlenecks

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.dvc		.dvc
data		data
results		results
scripts		scripts
workflow		workflow
.dvcignore		.dvcignore
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
library.R		library.R
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BIOF3001-Project

Tasks distribution

Running the Analysis

Prerequisites

Usage

Workflow Structure

Outputs

Big Data Approach

Data Management

Parallel Processing

Scalability Features

Performance Optimization

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

jasonho1308/molecular-epigenetic-aging

Folders and files

Latest commit

History

Repository files navigation

BIOF3001-Project

Tasks distribution

Running the Analysis

Prerequisites

Usage

Workflow Structure

Outputs

Big Data Approach

Data Management

Parallel Processing

Scalability Features

Performance Optimization

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages