Pipeline to analyze ATACseq data
##################################################################################################
last update: 26 June 2018
The script is still under development and might contains bugs.
For any problem: please write to tsviya.olender@weizmann.ac.il, I will appreciate your feedback #
##################################################################################################
The pipeline was developed in perl.
Update the $progD parameter to be the full path of code location.
The script collect_bedtools_count.pl has a hard coded parameter=> $pipelinePATH, it should contain the location of the pipeline
#####################################################################################################
LOCATION OF FASTQ
Very often, crude reads come in multiple fastq files per sample. In this case, the script expects a file structure with a folder per sample,
all samples in the same folder. The fastq files should be gzipped, with R1 for R1 and R2 for R2.
The parameter file has to be defined as follows:
- Define the crude reads location: by the parameter crude_reads_location_for_merging
- set the option combine_fastq to 1
Otherwise, if you have allready 2 fastq files per sample (one for R1, and one for R2)- put the files under 1_fastq, in the working directory.
Set the option combine_fastq to 0
The fastq files much be named as: XXX_R1.fastq.gz, XXX_R2.fastq.gz- where XXX is the sample name.
(that means gziped)
########################################################################################################
########################################################################################################
The script atacpipeline.pl send the LSF queries to a server. It runs the script run_ATAC_Ts_V2.pl in parallel
The script accepts the following:
- name of file with names of all samples. the file should contain a row per sample.
- parameters file, e.g. run_ATAC_Ts_V2_params.txt
- name of the LSF queue. The default is new-short.
- The memory (default is 4000). In order to change the default memory, one has to define the queue in the command line, otherwise the script will not work.
Note that the script runs in 4 threds, therefore a memory of 4000G means 16000Gb memory per job.
perl PATH_TO_prog/atacpipeline.pl samples.txt params_file [queue_name memory]
example
perl PATH_TO_prog/atacpipeline.pl samples.txt run_ATAC_Ts_V2_params.txt
submits all samples listed in samples.txt to the analysis. The parameters are defined in the file run_ATAC_Ts_V1_params.txt.
The jobs will be submitted to the queue new-all with memory of 4000G.
perl PATH_TO_prog/atacpipeline.pl samples.txt run_ATAC_Ts_V2_params.txt new-short 8000
here the jobs was submitted with more memory. Because we changed the default memory we add also to define the queue.
perl PATH_TO_prog/atacpipeline.pl samples.txt run_ATAC_Ts_V2_params.txt new-all.q
here we submitted the job to different queue, with the default memory
perl PATH_TO_prog/atacpipeline.pl samples.txt run_ATAC_Ts_V2_params.txt new-all.q 8000
both: queue name and memory are not the default options
########################################################################################################
########################################################################################################
The script collect_qual_params.pl generates a report with the quality parameters of the run (e.g. num of reads in every analysis step).
perl PATH_TO_prog/collect_qual_params.pl samples.txt > qual_report.txt
This script also generates a file called TSS_counts_table.txt with the crude read counts, and the counts in FPKM of the promoters.
########################################################################################################
########################################################################################################
this files allows the user to set up the run parameters.
It is build from 2 parts:
[params] and [setup_run]
the parameters in params are self explanatory
in the set up the user decides which parts of the analysis will be processes
0 mean no, 1 means yes
collectReads = 0 => In case the fastq is divided into several part. The script excpet to find a folder per sample, gzipped.
R1 files should be named as R1 (meaning- they should contain the string R1 in the file name), and R2 for R2 files.
The location of the crude reads is defined by crude_reads_location_for_merging
fastqc = 0 => run fastqc
trim_adapter = 1 => runs cutadapt
make_body = 1 => runs bowtie alignment, and process the alignment to contain only uniquily mapped reads. the opuput is sorted bam
make_plots = 1 => generates ngsplot plots
nucleosome_free = 1 => filters the bam file, to contain only paired-end reads with < 130bp insert
call_peaks = 1 => uses MACS2 to call peaks
countTSS_reads = 1 => counts the reads on the TSS reagions, performs FPKM normalization. The report contains crude read count, and FPKM count