Investigating the molecular basis of accelerated epigenetic aging
| Name | Tasks |
|---|---|
| Jason | - the download of GDC-PANCAN data - initial clock scripts, fix plot - clinical correlation analysis - snakemake, parallel processing and dvc implementation |
| Jacob | - plot boxplots and scatterplots for different epigenetic clocks - fix some projects having multiple platforms causing error when running clock - other error fixes |
| William | - classify samples into accelerated and decelerated aging - download corresponding rna sequence data - differential expression analysis - pathway analysis |
This project uses Snakemake to efficiently process epigenetic clock data for multiple TCGA projects. Snakemake tracks which outputs have been generated and only reprocesses missing or out-of-date files.
Run library.R to download all required libraries for R
Run uv sync to install all required libraries for python
For data management, we have used dvc to store and share our raw and processed data, please contact Jason Ho for the server address. (just the raw data account for 1XGB, it takes at least 3 hours to download)
For workflow, we have used snakemake to ensure reproducible results.
Run the entire workflow:
uv run snakemake --cores 1N: for number of cores
Process specific projects only:
uv run snakemake results/clock/gdc_pan/TCGA-BRCA_scatterplots.png --cores 1Dry run to see what would be executed:
uv run snakemake -nThe workflow consists of the following steps:
- Process individual projects (
scripts/clock/process_project.R): Calculates epigenetic clocks and generates plots for each TCGA project independently - Combine pan-cancer analysis (
scripts/clock/combine_pancan.R): Aggregates all project predictions and generates pan-cancer wide plots - Generate PDF report (
scripts/clock/generate_pdf.R): Combines all plots into a single PDF document - Clinical correlation analysis (
scripts/clinical_correlation.R): Performs clinical correlation analysis
results/clock/gdc_pan/{PROJECT}_scatterplots.png: Scatter plots for each projectresults/clock/gdc_pan/{PROJECT}_residuals_boxplots.png: Residual boxplots for each projectresults/clock/gdc_pan/{PROJECT}_predictions.rds: Prediction data for each projectresults/clock/gdc_pan/gdc_pancan_scatterplots.png: Combined pan-cancer scatter plotsresults/clock/gdc_pan/gdc_pancan_residuals_boxplots.png: Combined pan-cancer residual boxplotsresults/clock/gdc_pan/gdc_pancan_methylclock.pdf: Final PDF report with all plotsresults/clinical/*: Heat map for clinical correlation
some
*.rdswas saved to prevent excess api usage and avoid recomputing of variables
This project implements several big data strategies to efficiently handle large-scale genomic data from TCGA (The Cancer Genome Atlas):
- DVC (Data Version Control): Large raw data files (
.dvcfiles) are tracked separately from the git repository, enabling efficient versioning and sharing of multi-gigabyte methylation datasets without bloating the repository - Selective Processing: Only normal tissue samples are extracted from the full TCGA-PANCAN dataset, reducing computational overhead
- Cached Intermediate Results: Key data structures (
.rdsfiles) are saved to prevent redundant API calls and expensive recomputations
- Snakemake Workflow: Orchestrates parallel execution across 24 TCGA projects, automatically managing dependencies and utilizing multiple CPU cores
- Multi-core Downloads: The
download_gdc_pancan.Rscript uses R'sparallel::mclapply()to download data from multiple projects concurrently, significantly reducing wall-clock time - Project-level Parallelism: Each TCGA project is processed independently, allowing Snakemake to distribute work across available cores without waiting for sequential completion
- Adaptive Core Usage: Scripts automatically detect available CPU cores and use half to balance performance with system stability
- Incremental Execution: Snakemake tracks which outputs exist and only reprocesses missing or outdated files, enabling efficient iterative development
- Chunk-based Downloads: GDC downloads use
files.per.chunk = 20to prevent timeout issues when fetching large batches of methylation files - Memory-efficient Processing: Data is processed project-by-project rather than loading the entire pan-cancer dataset into memory at once
- Error Recovery: Parallel processing includes robust error handling to continue processing remaining projects even if individual projects fail
- Binary Serialization: R's
.rdsformat provides fast, compressed storage for intermediate results - Lazy Evaluation: Phenotype and methylation data are only loaded when needed for specific analyses
- Distributed I/O: Multiple projects write to separate output files simultaneously, avoiding I/O bottlenecks