This repository was archived by the owner on Jan 5, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1
Run SCAPE on pair end scRNAseq datasets (Microwell seq)
zhou-ran edited this page Dec 3, 2020
·
3 revisions
In this tutorial, we used the datasets from Microwell-seq.
We first download the bone marrow fastq files from EBI.
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR695/003/SRR6954503/SRR6954503_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR695/003/SRR6954503/SRR6954503_2.fastq.gz
The GRCm38.cr.gtf.gz was download from Cell Ranger. The gtf from any other source also can be used.
python main.py prepare \
--gtf GRCm38.cr.gtf.gz \
--prefix GRCm38
There is a snakemake-based workflow for processing the microwell-seq datasets.
- FastqToSam (rule.fq2bam)
# Convert fastq file into unmapped bam file.
java -Xmx20G -jar picard.jar\
FastqToSam \
F1=sample_R1.fastq.gz \
F2=sample_R2.fastq.gz \
SM=DS \
O=sample_unmapped.bam
- TagBamWithReadSequenceExtended (rule.addBarcode)
# Add barcode sequence into tags of unmapped bam.
# In Microwell-seq, the barcode sequence located in 1-6:22-27:43-48 of R1 reads.
# In 10X genomics, the barcode sequence located in 1:16 of R1 reads.
droptools/TagBamWithReadSequenceExtended \
SUMMARY=addBarcode.log \
BASE_RANGE=1-6:22-27:43-48 \
DISCARD_READ=false \
BARCODED_READ=1 \
TAG_NAME=CB \
NUM_BASES_BELOW_QUALITY=10 \
INPUT=sample_unmapped.bam \
OUTPUT=sample_BC.bam
- AddPolyTInfo (rule.addToanother)
# Add polyT length information into bam tags.
# The polyT length was used to infer the accurate polyadenylation sites.
python script/manibam.py \
-i sample_BC.bam \
-o sample_BC_dT.bam \
-b 54
- TagBamWithReadSequenceExtended (rule.addUMI)
# Add UMI sequence into tags of unmapped bam.
# In Microwell-seq, the UMI sequence located in 49-54 of R1 reads.
# In 10X genomics, the UMI sequence located in 17:26 of R1 reads.
droptools/TagBamWithReadSequenceExtended \
SUMMARY=addBarcode.log \
BASE_RANGE=49-54 \
DISCARD_READ=true \
BARCODED_READ=1 \
TAG_NAME=CB \
NUM_BASES_BELOW_QUALITY=10 \
INPUT=sample_BC_dT.bam \
OUTPUT=sample_BC_dT_UMI.bam
- Alignment
# Convert the unmmaped bam into fq and align to reference genome
java -Djava.io.tmpdir=tmp \
-Xmx20G -jar picard.jar SamToFastq \
INPUT=sample_BC_dT_UMI.bam FASTQ=/dev/stdout | STAR \
--genomeDir star_index \
--alignMatesGapMax 5000 --runThreadN 12 \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes All \
--readFilesIn /dev/stdin \
--sjdbGTFfile gtf \
--outFileNamePrefix alignment/sample_ \
--outFilterScoreMinOverLread 0.4 \
--outFilterMatchNminOverLread 0.4 \
--limitBAMsortRAM 70000000000
# Merge unmapped and mapped file to add the barcode, umi and dT tag into mapped file.
java -Djava.io.tmpdir=tmp \
-Xmx20G -jar picard.jar \
MergeBamAlignment \
REFERENCE_SEQUENCE=genome.fa \
UNMAPPED_BAM=sample_BC_dT_UMI.bam \
ALIGNED_BAM=alignment/sample_Aligned.sortedByCoord.out.bam \
INCLUDE_SECONDARY_ALIGNMENTS=false OUTPUT=sample.bam
# index the bam file
samtools index sample.bam
- Run SCAPE
python scape/main.py apamix \
--bed utr.bed \
--bam sample.bam \
--out sample/ \
--cores 12 \
--cb sample.tsv