This repository contains bash scripts for genomic variant association analysis. All specific file paths have been replaced with configurable environment variables for portability and security.
Edit config_paths.sh to set your environment-specific paths:
export BASE_DIR="/path/to/base/directory"
export IMPUTATION_DIR="/path/to/imputation/directory"
export PROJECT_DIR="/path/to/project/directory"Before running any analysis script, source the configuration file:
source config_paths.shCreate all necessary directories:
source config_paths.sh
create_directories| Variable | Description | Original Path |
|---|---|---|
BASE_DIR |
Base directory for original genotype data | NA |
IMPUTATION_DIR |
Imputated data directory | NA |
PROJECT_DIR |
Project root directory | NA |
DATA_DIR |
Data directory (relative) | NA |
PHENO_FILE |
Phenotype file | NA |
SEX_FILE |
Sex update file | NA |
source config_paths.sh
bash cmd05_sampleCleanup.sh- Filters samples and removes duplicates
- Identifies samples with high genotype missing rate
source config_paths.sh
bash cmd06_mds.sh
bash cmd06_2_mds_ocm50k.sh- Calculates MDS eigenvalues for pooled and population-specific data
- Generates IBD estimates
- Performs population stratification analysis
source config_paths.sh
bash cmd07_gemma_array_1_GRM.sh # Generate GRM
bash cmd07_gemma_array_2_maf_run.sh # Run GEMMA
bash cmd07_gemma_array_3_plot.sh # Plot results- Filters to 711 samples with valid phenotypes
- Generates genetic relationship matrices
- Runs GWAS with MAF thresholds (0.01, 0.05)
- Adjusts for sex and principal components
- Generates Manhattan and QQ plots
source config_paths.sh
bash cmd09_gemma_impuOCM50k_1_dataPrep.sh # Prepare data
bash cmd09_gemma_impuOCM50k_2_run.sh # Run GEMMA- Focuses on OCM 50kb region
- r_scp_Gemma_dataPrep_cov_array.R: Builds GEMMA covariate files (intercept + sex + selected MDS PCs) for the array genotype dataset.
- r_scp_Gemma_dataPrep_cov_oriOCM50k.R: Builds GEMMA covariate files (intercept + sex + selected MDS PCs) for the original OCM50k dataset.
- r_scp_plot_GEMMA_gwas_arg.R: Generates QQ and Manhattan plots from a GEMMA GWAS association output specified via command-line arguments.
- r_proc_plot_GEMMA_OCM50k.R: Produces QQ and Manhattan plots for GEMMA results in the OCM50k region, including an alternative labeled Manhattan plot.
All cleaned scripts remove hard-coded absolute paths and instead use environment variables:
SAMOCM_PROJECT_DIR: directory where these scripts live (default: current working directory).SAMOCM_Analyst_DATA_DIR: path toAnalyst``_data(default:file.path(SAMOCM_PROJECT_DIR, '..', '``Analyst``_data')).
${GEMMA_ARRAY_DIR}/*standrel.sXX.txt
${GEMMA_ARRAY_DIR}/gemma_out_*.assoc.txt- Array-based GWAS${GEMMA_IMPU_DIR}/gemma_out_*.assoc.txt- Imputation-based GWAS${GEMMA_OCM50K_DIR}/gemma_out_*.assoc.txt- OCM50k region GWAS
${MDS_DIR}/*_mds.mds- MDS coordinates
${LIST_DIR}/list.id.rm- Samples to remove (n=19)${LIST_DIR}/list.id.JAM- Jamaican samples${LIST_DIR}/list.id.MAL- Malawian samples
module load plink/1.9.0-beta4.4
module load GEMMA/0.96
module load R- 0.01 (1% minor allele frequency)
- 0.05 (5% minor allele frequency)
SEX- Sex onlySEX_C1- Sex + PC1SEX_C1-C2- Sex + PC1-2SEX_C1-C3- Sex + PC1-3
- Window: 50 SNPs
- Step: 10 SNPs
- r² threshold: 0.1
- Final analysis: 711 samples
- Jamaican (JAM): 340 samples
- Malawian (MAL): 371 samples
If using these scripts, please cite the original study and acknowledge the data sources.
For questions about the analysis pipeline, please contact the study investigators.