Skip to content
This repository was archived by the owner on Jan 5, 2022. It is now read-only.

Run SCAPE on single end scRNAseq datasets (10X genomics)

zhou-ran edited this page Dec 5, 2020 · 4 revisions

1. Inferring pA sites for R2-only datasets

Downloading datasets

We first download the PBMC datasets from 10X genomics websites.

wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_possorted_genome_bam.bam

wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_possorted_genome_bam.bam.bai

wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_filtered_feature_bc_matrix.tar.gz

Prepare 3UTR region

The GRCh38.p12.cr.gtf.gz was download from Cell Ranger. This will generate a 3UTR bed file.

python main.py prepare \
--gtf GRCh38.p12.cr.gtf.gz \
--prefix GRCh38

Running SCAPE

The APA expression matrix was stored in pasite.csv.gz.

python main.py apamix \
--bed GRCh38_utr.bed \
--bam pbmc_1k_v3_possorted_genome_bam.bam \
--out pbmc1k/ \
--cores 12 \
--cb barcodes.tsv.gz

Merge multiple results from scRNAseq

if you hava multiple datasets, then run this.

python script/group_pa.py \
--files pbmc1k_1/pasite.csv.gz,pbmc1k_2/pasite.csv.gz \
--labels pbmc1k_1,pbmc1k_2 \
--outfile collapse_pa.tsv.gz

Or if you want to re-group your pA sites by different threshold.

python script/group_pa.py \
--files pbmc1k_1/pasite.csv.gz \
--labels pbmc1k_1 \
--outfile collapse_pa.tsv.gz \
--group_threshold 200

2. Inferring pA sites for pair-end datasets

In this tutorial, we used the datasets from Microwell-seq.

Prepare datasets

1. Downloading PE scRNAseq(e.g. microwell-seq) datasets

We first download the bone marrow fastq files from EBI.

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR695/003/SRR6954503/SRR6954503_1.fastq.gz

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR695/003/SRR6954503/SRR6954503_2.fastq.gz

2. Prepare 3UTR region

The GRCm38.cr.gtf.gz was download from Cell Ranger. The gtf from any other source also can be used.

python main.py prepare \
--gtf GRCm38.cr.gtf.gz \
--prefix GRCm38

Process datasets

Snakemake pipeline

There is a snakemake-based workflow for processing the microwell-seq datasets.

Run pipe step by step

  1. FastqToSam (rule.fq2bam)
# Convert fastq file into unmapped bam file.

java -Xmx20G -jar picard.jar\
FastqToSam \
F1=sample_R1.fastq.gz \
F2=sample_R2.fastq.gz \
SM=DS \
O=sample_unmapped.bam
  1. TagBamWithReadSequenceExtended (rule.addBarcode)
# Add barcode sequence into tags of unmapped bam.
# In Microwell-seq, the barcode sequence located in 1-6:22-27:43-48 of R1 reads.
# In 10X genomics, the barcode sequence located in 1:16 of R1 reads.

droptools/TagBamWithReadSequenceExtended \
SUMMARY=addBarcode.log \
BASE_RANGE=1-6:22-27:43-48 \
DISCARD_READ=false \
BARCODED_READ=1 \
TAG_NAME=CB \
NUM_BASES_BELOW_QUALITY=10 \
INPUT=sample_unmapped.bam \
OUTPUT=sample_BC.bam
  1. AddPolyTInfo (rule.addToanother)
# Add polyT length information into bam tags. 
# The polyT length was used to infer the accurate polyadenylation sites.

python script/manibam.py \
-i sample_BC.bam \
-o sample_BC_dT.bam \
-b 54
  1. TagBamWithReadSequenceExtended (rule.addUMI)
# Add UMI sequence into tags of unmapped bam.
# In Microwell-seq, the UMI sequence located in 49-54 of R1 reads.
# In 10X genomics, the UMI sequence located in 17:26 of R1 reads.

droptools/TagBamWithReadSequenceExtended \
SUMMARY=addBarcode.log \
BASE_RANGE=49-54 \
DISCARD_READ=true \
BARCODED_READ=1 \
TAG_NAME=CB \
NUM_BASES_BELOW_QUALITY=10 \
INPUT=sample_BC_dT.bam \
OUTPUT=sample_BC_dT_UMI.bam
  1. Alignment
# Convert the unmmaped bam into fq and align to reference genome

java -Djava.io.tmpdir=tmp \
-Xmx20G -jar picard.jar SamToFastq \
INPUT=sample_BC_dT_UMI.bam FASTQ=/dev/stdout | STAR \
--genomeDir star_index \
--alignMatesGapMax 5000 --runThreadN 12 \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes All \
--readFilesIn /dev/stdin \
--sjdbGTFfile gtf \
--outFileNamePrefix alignment/sample_ \
--outFilterScoreMinOverLread 0.4 \
--outFilterMatchNminOverLread 0.4 \
--limitBAMsortRAM 70000000000

# Merge unmapped and mapped file to add the barcode, umi and dT tag into mapped file.

java -Djava.io.tmpdir=tmp \
-Xmx20G -jar picard.jar \
MergeBamAlignment \
REFERENCE_SEQUENCE=genome.fa \
UNMAPPED_BAM=sample_BC_dT_UMI.bam \
ALIGNED_BAM=alignment/sample_Aligned.sortedByCoord.out.bam \
INCLUDE_SECONDARY_ALIGNMENTS=false OUTPUT=sample.bam

# index the bam file

samtools index sample.bam
  1. Run SCAPE
python scape/main.py apamix \
--bed utr.bed \
--bam sample.bam \
--out sample/ \
--cores 12 \
--cb sample.tsv