KIDS25-Team1

PacBio Iso-Seq and TAGET Analysis Pipeline

Authors: Zhongshan Cheng1, Jia-Hua “George” Qu2, Nick Peterson3, Meghan Carr4

Affiliations:

Center for Applied Bioinformatics, St. Jude Children’s Research Hospital
Host-Microbe Interactions, St. Jude Children’s Research Hospital
Host-Microbe Interactions, St. Jude Graduate School (Ph.D. track)
Student, CodeCrew Coding School

This repository documents the steps for running a long-read RNA-Seq analysis pipeline integrating PacBio Iso-Seq, IsoQuant, and our modified TAGET workflow. The pipeline enables isoform-level and gene-level quantification, followed by differential expression (DE) analysis with DESeq2.

1. Environment Setup

Create a dedicated conda environment:

conda create -n pacbio python=3.8.20
conda activate pacbio

Install required packages:

python=3.8.20
samtools=1.10
openssl=1.1.1w
minimap2=2.26
hisat2=2.2.1
pandas=1.3.5
numpy=1.21.6

Export environment if needed:

conda env export -n pacbio > environment-pacbio.yaml

2. Input Data Preparation

Download example BAM:

wget https://downloads.pacbcloud.com/public/dataset/Melanoma2019_IsoSeq/FullLengthReads/flnc.bam

Split BAM into per-sample files. Example:

samtools view transcripts.bam -h | perl ... > sample.sam
bash PreparePacbioClusteredBAM2TAGET.sh sample.sam sampleID

3. Reference Genome

Download hg38 reference and annotation (GENCODE or Ensembl).

4. Conversion & CPM Generation

Convert BAM to FASTA per sample.

Generate TPM/CPM tables:

conda activate pacbio
python Py4LongReadRNASeq/DetermineExp4ClusteredReads.py -f input.fasta -o output_for_TAGET.tpm.txt

5. Configure and Run TAGET

Update config file with CPM paths.

Run TAGET refinement:

conda activate pacbio
python RefinedLongReadMappingAndQuantification.py -c config

6. Merge Outputs

Merge isoform results:

conda activate pacbio
python TransAnnotMerge.py -c merge.config -o outputdir -m FLC

Outputs include: gene.exp, transcript.exp, and supporting files.

7. IsoQuant (Optional)

Install isoquant:

conda create -c bioconda -n isoquant python=3.8 isoquant=3.7.1
conda activate isoquant

conda env export -n isoquant > environment-isoquant.yaml

Download the fastq file

wget https://downloads.pacbcloud.com/public/dataset/Melanoma2019_IsoSeq/FullLengthReads/flnc.fastq

Run isoquant on FASTQ:
```
isoquant.py --reference hg38.fa --genedb hg38.gtf --fastq flnc.fastq \
--data_type pacbio_ccs -o output_isoquant
```
This step also generated the new gft file, the IsoQuant-derived GTFs incorporating novel transcripts.

8. PacBio Standard Pipeline (Optional)

Supports running using Singularity containers: pbskera, isoseq, pbmm2, pbpigeon.
Includes alignment, isoquant refinement, pigeon classification, and annotation processing.
This optional step is to generate results to compare the functions of the PacBio Standard Pipeline and our modified TAGET pipeline.
Either the PacBio Standard Pipeline or our modified TAGET pipeline produces at least two output results, including gene.exp and transcript.exp.

To save time, we performed the analysis in the St. Jude HPC this time.

module load parallel/20240222
export USER=zcheng
export HG38=/research/rgs01/applications/hpcf/authorized_apps/hartwell/Automation/REF/Kinnex-IsoSeq/RefGenomes/Human_hg38_Gencode_v39
export REFS=/research/rgs01/applications/hpcf/authorized_apps/hartwell/Automation/REF/Kinnex-IsoSeq

mkdir -p /scratch_space/$USER/KinnexBulkIsoSeqAnalysis
cd /scratch_space/$USER/KinnexBulkIsoSeqAnalysis

singularity pull docker://quay.io/biocontainers/pbskera:1.4.0--hdfd78af_0
singularity pull docker://quay.io/biocontainers/lima:2.13.0--h9ee0642_0
singularity pull docker://quay.io/biocontainers/isoseq:4.3.0--h9ee0642_0
singularity pull docker://quay.io/pacbio/pbmm2:1.17.0_build1
singularity pull docker://quay.io/biocontainers/pbfusion:0.5.1--hdfd78af_0

# Run ISOSEQ REFINE
# Don't need to run this step, because we have downloaded the flnc.bam in the beginning.
# wget https://downloads.pacbcloud.com/public/dataset/Melanoma2019_IsoSeq/FullLengthReads/flnc.bam
# If we don't have the refined file, flnc.bam, the user needs to run this code.
singularity run -B $PWD -B $REFS -B $HG38 docker://quay.io/biocontainers/isoseq:4.3.0--h9ee0642_0 \
	isoseq refine --require-polya -j 16 \
	02_COLO829Pacbio.{1}.fl.{2}.bam \
	$REFS/bulkRNA/02_lima-primers/IsoSeq_v2_primers_12.fasta \
	03_COLO829Pacbio.{1}.flnc.bam

### x.sh
# Use previously generated or downloaded flnc BAM;
https://downloads.pacbcloud.com/public/dataset/Melanoma2019_IsoSeq/
https://github.com/RhettRautsaw/StJude_PacBio-WDL-tutorial/blob/main/Kinnex_IsoSeq_Pipelines/KinnexBulkIsoSeq.lsf
# extract reads for each sample;
# Search and extract the reads identified in a certain sample, and generate the subset sam file.
export sampleID=$1; samtools view flnc.bam -h |perl -anE 'if
 (/^\@/){print}else{print if /$ENV{sampleID}/}
' |samtools view -Sb - -o $sampleID.bam;
samtools index $sampleID.bam
# put the above into x.sh to run it for each sample in the cluster;

bsub_Grace_Next -n 1 -m 20 "bash x.sh m54019_190120_021709"
bsub_Grace_Next -n 1 -m 20 "bash x.sh m54026_190120_000756"
bsub_Grace_Next -n 1 -m 20 "sleep 40s;bash x.sh m54119_190202_095143"
bsub_Grace_Next -n 1 -m 20 "sleep 100s;bash x.sh m54119_190203_061153"
bsub_Grace_Next -n 1 -m 20 "sleep 200s;bash x.sh m54119_190131_171128"
bsub_Grace_Next -n 1 -m 20 "sleep 300s;bash x.sh m54119_190201_133141"
 
### y.sh
export USER=zcheng
export HG38=/research/rgs01/applications/hpcf/authorized_apps/hartwell/Automation/REF/Kinnex-IsoSeq/RefGenomes/Human_hg38_Gencode_v39
export REFS=/research/rgs01/applications/hpcf/authorized_apps/hartwell/Automation/REF/Kinnex-IsoSeq
realpath $1.bam > 03_COLO829Pacbio.$1.flnc.fofn
# Run PBMM2
singularity run -B $PWD -B $REFS -B $HG38 docker://quay.io/pacbio/pbmm2:1.17.0_build1 \
	pbmm2 align -j 16 --preset ISOSEQ --sort \
	$HG38/human_GRCh38_no_alt_analysis_set.fasta \
	03_COLO829Pacbio.$1.flnc.fofn \
	04_COLO829Pacbio.$1.align.bam
#put the above codes into y.sh to run it for each sample;

#Run all samples using the y.sh
ls m*bam | perl -pe 's/\.bam//' | perl -ane 'chomp;`bsub_Grace_Next -n 1 -m 100 "bash y.sh $_"`'
	
realpath 04_COLO829Pacbio.*.align.bam > 04_COLO829Pacbio.align.fofn
perl -pe 's/.*kinnex.//g' 04_COLO829Pacbio.align.fofn | perl -pe 's/.align.bam//g' > 04_COLO829Pacbio.align.labels

singularity run -B $PWD -B $REFS -B $HG38 docker://quay.io/pacbio/pb_wdl_base:build3 \
	python $REFS/isoquant_generateYAML.py -b 04_COLO829Pacbio.align.fofn -l 04_COLO829Pacbio.align.labels -e 05_COLO829Pacbio.isoquant -o 05_isoquant4COLO829.yaml
rm 05_COLO829Pacbio.isoquant -rf 
singularity run -B $PWD -B $REFS -B $HG38 docker://quay.io/biocontainers/isoquant:3.6.3--hdfd78af_0 \
	isoquant.py -t 64 -d pacbio --yaml 05_isoquant4COLO829.yaml \
	-r $HG38/human_GRCh38_no_alt_analysis_set.fasta \
	-g $HG38/gencode.v39.annotation.sorted.gtf.db --complete_genedb \
	-o 05_COLO829Pacbio.isoquant \
	--sqanti_output

ln -s 05_COLO829Pacbio.isoquant/05_COLO829Pacbio.isoquant/05_COLO829Pacbio.isoquant.* .

singularity run -B $PWD -B $REFS -B $HG38 docker://quay.io/pacbio/pb_wdl_base:build3 \
	python $REFS/isoquant2pigeon.py \
	--gtf 05_COLO829Pacbio.isoquant.transcript_models.gtf \
	--tsv 05_COLO829Pacbio.isoquant.transcript_model_grouped_counts.tsv \
	--output 06_COLO829Pacbio.pigeon.transcript_model_grouped_counts.csv

singularity run -B $PWD -B $REFS -B $HG38 docker://quay.io/biocontainers/pbpigeon:1.4.0--h9948957_0 \
	pigeon classify -j 64 -o 06_COLO829Pacbio.pigeon \
	05_COLO829Pacbio.isoquant.transcript_models.gtf \
	$HG38/gencode.v39.annotation.sorted.gtf \
	$HG38/human_GRCh38_no_alt_analysis_set.fasta \
	--flnc 06_COLO829Pacbio.pigeon.transcript_model_grouped_counts.csv \
	--cage-peak $HG38/refTSS_v3.3_human_coordinate.hg38.sorted.bed \
	--poly-a $HG38/polyA.list.txt \
	--coverage $HG38/intropolis.v1.hg19_with_liftover_to_hg38.tsv.min_count_10.modified2.sorted.tsv

singularity run -B $PWD -B $REFS -B $HG38 docker://quay.io/biocontainers/pbpigeon:1.4.0--h9948957_0 \
	pigeon report -j 64 06_COLO829Pacbio.pigeon_classification.txt 06_COLO829Pacbio.pigeon_classification.report.txt

cat 05_COLO829Pacbio.isoquant.extended_annotation.gtf | grep transcript | perl -pe 's/.*gene_id.{2}(ENSG\d+\.\d+).; transcript_id .(ENST\d+\.\d+).;.*; .*gene_name "(\S+)".*transcript_name "(\S+)".*/$1\t$2\t$3\t$4/' |sort -u |grep gene_id -v >ensembl_gene2tx.txt

#Clean the file headers;
perl -i.bak -pe 's/\/[^\t]+\/04_COLO829Pacbio.//g' 05_COLO829Pacbio.isoquant.transcript_model_grouped_counts.tsv
perl -i.bak -pe 's/\/[^\t]+\/04_COLO829Pacbio.//g' 05_COLO829Pacbio.isoquant.gene_grouped_counts.tsv
ln -s 05_COLO829Pacbio.isoquant.transcript_model_grouped_counts.tsv 07_isoquant.isoforms4COLO829.matrix
ln -s 05_COLO829Pacbio.isoquant.gene_grouped_counts.tsv 07_isoquant.genes4COLO829.matrix
#Sample matrix, including sample grp and sample ids;
#COLO8299T	m54026_190120_000756 
#COLO8299T	m54119_190202_095143  
#COLO8299T	m54119_190203_061153
#COLO829BL	m54019_190120_021709
#COLO829BL	m54119_190131_171128
#COLO829BL	m54119_190201_133141

9. Differential Expression (DE) Analysis

Performed with DESeq2 (via Trinity Singularity image). Below are the detailed steps for both PacBio and TAGET pipelines.

9.1 DE Analysis of PacBio Pipeline Outputs

9.1.1 Isoform Matrix

Annotate transcript IDs with gene names:

vlookup 07_isoquant.isoforms4COLO829.matrix 1 ensembl_gene2tx.txt 2 3,4 y | \
perl -ane '$F[0]=$F[-1] unless /NaN/;print join("\t",@F[0..($#F-2)]),"\n";' \
> 07_isoquant.isoforms4COLO829.matrix.new

Run DESeq2:

singularity run -B $PWD -B $REFS -B $HG38 \
  docker://quay.io/biocontainers/trinity:2.15.2--pl5321h077b44d_3 \
  run_DE_analysis.pl \
  --matrix 07_isoquant.isoforms4COLO829.matrix.new \
  --method DESeq2 \
  --samples_file 07_samples.matrix4COLO829.txt \
  --output 07_deseq2.isoforms

Extract top 20 isoforms:

head -n21 07_deseq2.isoforms/07_isoquant.isoforms4COLO829.matrix.new.COLO8299T_vs_COLO829BL.DESeq2.DE_results | \
cut -f1 | tail -n +2 > 07_deseq2.top20.isoforms.new.txt

9.1.2 Gene Matrix

Annotate gene IDs with gene names:

vlookup 07_isoquant.genes4COLO829.matrix 1 ensembl_gene2tx.txt 1 3 y | \
perl -ane '$F[0]=$F[-1] unless /NaN/;print join("\t",@F[0..($#F-1)]),"\n";' | \
SortFileByCols.sh - '-k1,1 -u ' 1 > 07_isoquant.genes4COLO829.matrix.new

Run DESeq2:

singularity run -B $PWD -B $REFS -B $HG38 \
  docker://quay.io/biocontainers/trinity:2.15.2--pl5321h077b44d_3 \
  run_DE_analysis.pl \
  --matrix 07_isoquant.genes4COLO829.matrix.new \
  --method DESeq2 \
  --samples_file 07_samples.matrix4COLO829.txt \
  --output 07_deseq2.genes

Extract top 20 genes:

head -n21 07_deseq2.genes/07_isoquant.genes4COLO829.matrix.new.COLO8299T_vs_COLO829BL.DESeq2.DE_results | \
cut -f1 | tail -n +2 > 07_deseq2.top20.genes.new.txt

9.2 DE Analysis of TAGET Pipeline Outputs

9.2.1 Isoform Matrix

Format TAGET transcript expression file:

cat ~/working_scripts/Py4LongReadRNASeq/outputdir/transcript.exp | \
delete_column 1 | \
perl -pe 's/Transcript/#feature_id/' | \
SortFileByCols.sh - '-k1,1 -u ' 1 > 07_TAGET.isoforms4COLO829.matrix.new

Run DESeq2:

singularity run -B $PWD -B $REFS -B $HG38 \
  docker://quay.io/biocontainers/trinity:2.15.2--pl5321h077b44d_3 \
  run_DE_analysis.pl \
  --matrix 07_TAGET.isoforms4COLO829.matrix.new \
  --method DESeq2 \
  --samples_file 07_samples.matrix4COLO829.txt \
  --output 07_deseq2.isoforms_TAGET

Extract top 20 isoforms:

head -n21 07_deseq2.isoforms_TAGET/07_TAGET.isoforms4COLO829.matrix.new.COLO8299T_vs_COLO829BL.DESeq2.DE_results | \
cut -f1 | tail -n +2 > 07_deseq2.top20.isoforms.TAGET.new.txt

9.2.2 Gene Matrix

Format TAGET gene expression file:

cat ~/working_scripts/Py4LongReadRNASeq/outputdir/gene.exp | \
perl -pe 's/Gene/#feature_id/' | \
SortFileByCols.sh - '-k1,1 -u ' 1 > 07_TAGET.genes4COLO829.matrix.new

Run DESeq2:

singularity run -B $PWD -B $REFS -B $HG38 \
  docker://quay.io/biocontainers/trinity:2.15.2--pl5321h077b44d_3 \
  run_DE_analysis.pl \
  --matrix 07_TAGET.genes4COLO829.matrix.new \
  --method DESeq2 \
  --samples_file 07_samples.matrix4COLO829.txt \
  --output 07_deseq2.genes_TAGET

Extract top 20 genes:

head -n21 07_deseq2.genes_TAGET/07_TAGET.genes4COLO829.matrix.new.COLO8299T_vs_COLO829BL.DESeq2.DE_results | \
cut -f1 | tail -n +2 > 07_deseq2.top20.genes.TAGET.new.txt

10. Comparison and Summary

Both PacBio Standard Pipeline and TAGET-modified Pipeline produce isoform- and gene-level DE results.
Top isoforms and genes can be extracted for downstream biological interpretation.
Running both pipelines in parallel enables comparison and validation of results, highlighting concordance as well as novel isoform discoveries.

Notes

Ensure reference genome and annotation files are consistent across steps.
Results can be further rerun using IsoQuant-derived GTFs to incorporate novel transcripts.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KIDS25-Team1

PacBio Iso-Seq and TAGET Analysis Pipeline

1. Environment Setup

2. Input Data Preparation

3. Reference Genome

4. Conversion & CPM Generation

5. Configure and Run TAGET

6. Merge Outputs

7. IsoQuant (Optional)

8. PacBio Standard Pipeline (Optional)

9. Differential Expression (DE) Analysis

9.1 DE Analysis of PacBio Pipeline Outputs

9.1.1 Isoform Matrix

9.1.2 Gene Matrix

9.2 DE Analysis of TAGET Pipeline Outputs

9.2.1 Isoform Matrix

9.2.2 Gene Matrix

10. Comparison and Summary

Notes

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

KIDS25-Team1

PacBio Iso-Seq and TAGET Analysis Pipeline

1. Environment Setup

2. Input Data Preparation

3. Reference Genome

4. Conversion & CPM Generation

5. Configure and Run TAGET

6. Merge Outputs

7. IsoQuant (Optional)

8. PacBio Standard Pipeline (Optional)

9. Differential Expression (DE) Analysis

9.1 DE Analysis of PacBio Pipeline Outputs

9.1.1 Isoform Matrix

9.1.2 Gene Matrix

9.2 DE Analysis of TAGET Pipeline Outputs

9.2.1 Isoform Matrix

9.2.2 Gene Matrix

10. Comparison and Summary

Notes

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages