|
| 1 | +Introduction |
| 2 | +------------ |
| 3 | + |
| 4 | +The ST Pipeline contains the tools and scripts needed to process |
| 5 | +and analyze the raw files generated with the Spatial Transcriptomics |
| 6 | +or Visium in FASTQ format to generate datasets for down-stream analysis. |
| 7 | +The ST pipeline can also be used to process single cell RNA-seq data as |
| 8 | +long as a file with barcodes identifying each cell is provided. |
| 9 | +The ST Pipeline can also process RNA-Seq datasets generated with |
| 10 | +or without UMIs. |
| 11 | + |
| 12 | +The ST Pipeline has been optimized for speed, robustness and |
| 13 | +it is very easy to use with many parameters to adjust all the settings. |
| 14 | +The ST Pipeline is fully parallel and has constant memory use. |
| 15 | +The ST Pipeline allows to skip any of the steps and to use the |
| 16 | +genome or the transcriptome as reference. |
| 17 | + |
| 18 | +The following files/parameters are required: |
| 19 | + |
| 20 | +- FASTQ files (Read 1 containing the spatial information and the UMI |
| 21 | + and read 2 containing the genomic sequence) |
| 22 | +- A genome index generated with STAR |
| 23 | +- An annotation file in GTF or GFF format (optional) |
| 24 | +- The file containing the barcodes and array coordinates |
| 25 | + (look at the folder "ids" and chose the correct one). |
| 26 | + Basically this file contains 3 columns (BARCODE, X and Y), |
| 27 | + so if you provide this file with barcodes identinfying cells (for example), |
| 28 | + the ST pipeline can be used for single cell data. |
| 29 | + This file is optional too. |
| 30 | +- A name for the dataset |
| 31 | + |
| 32 | +The ST pipeline has multiple parameters mostly related to trimming, |
| 33 | +mapping and annotation but generally the default values are good enough. |
| 34 | +You can see a full description of the parameters |
| 35 | +typing "st_pipeline_run.py --help" after you have installed the ST pipeline. |
| 36 | + |
| 37 | +The input FASTQ files can be given in gzip/bzip format as well. |
| 38 | + |
| 39 | +Basically what the ST pipeline does is: |
| 40 | + |
| 41 | +- Quality trimming (read 1 and read 2): |
| 42 | + - Remove low quality bases |
| 43 | + - Sanity check (reads same length, reads order, etc..) |
| 44 | + - Check quality UMI (if provided) |
| 45 | + - Remove artifacts (PolyT, PolyA, PolyG, PolyN and PolyC) of user defined length |
| 46 | + - Check for AT and GC content |
| 47 | + - Discard reads with a minimum number of bases of that failed any of the checks above |
| 48 | +- Contamimant filter e.x. rRNA genome (Optional) |
| 49 | +- Mapping with STAR (only read 2) |
| 50 | +- Demultiplexing with [Taggd](https://github.com/SpatialTranscriptomicsResearch/taggd) (only read 1) |
| 51 | +- Keep reads (read 2) that contain a valid barcode and are correctly mapped |
| 52 | +- Annotate the reads with htseq-count (optional) |
| 53 | +- Group annotated reads by barcode(spot position) and gene to get a read count |
| 54 | +- In the grouping/counting only unique molecules (UMIs) are kept. |
| 55 | + |
| 56 | +You can see a graphical more detailed description of the workflow in the documents workflow.pdf and workflow_extended.pdf |
| 57 | + |
| 58 | +The output will be a matrix of counts (genes as columns, spots as rows), |
| 59 | +a BED file containing the transcripts (Read name, coordinate, gene, etc..), and a JSON |
| 60 | +file with useful stats. |
| 61 | +The ST pipeline will also output a log file with useful information. |
0 commit comments