-
Notifications
You must be signed in to change notification settings - Fork 17
Add modules and subworkflows for single-cell DNA processing. #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
Performing merge for first release
Post-release DOI
1.1.0 Updates
Release 1.2.0
…lt. Now using chopper
…pt the fastq file
…ode_formats. Also change dlogic for whitelist gunzipping
|
|
||
| The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. | ||
|
|
||
| TODO: Should here be added which output is cDNA/DNA specific? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense, to put CDNA-ONLY, DNA-ONLY, or SHARED, or something to that effect in the sections and maybe put a blurb at the top that there are notes about which steps are type-specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, maybe easiest to indeed just have * = cDNA specific output, ** = DNA specific output or something. Will add something coming week and tag you once I finish the output.md
docs/output.md
Outdated
| The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps: | ||
|
|
||
| - [Preprocessing](#preprocessing) | ||
| - [Nanofilt](#nanofilt) - Read Quality Filtering and Trimming |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to get rid of the Nanofilt section and replace it with Chopper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this, is the bit under output files correct anyways? We do not publish any of these files so should we include this? Will also need to adjust some other things in the output.md, will put it on my TODO.
docs/usage.md
Outdated
| Please note that while the above command specifies both transcriptome and genome fasta files, only one is needed for the pipeline and is dependent on which quantifier you wish to use. Furthermore, if you have any DNA samples, the `genome_fasta` is required. | ||
| Additionally, for the `quantifier` parameter in the above command, we've listed the quantifiers as a comma-delimited string. It is possible to only use one quantifier, and can be accomplished by just providing the name of the quantifying tool you wish to run as a single value, i.e. providing `oarfish` if you only wish to run `oarfish`. | ||
|
|
||
| The pipeline supports barcode identification and extraction through both `flexiplex` and `blaze` and can be set through `demux_tool` parameter. The barcode format can be specified through the `barcode_format` parameter. When working with completely custom barcode structures, you can additionally specify these with `custom_flexiplex_barcode_dna` and `custom_flexiplex_barcode_cdna` parameters. Note: ensure that you are using `flexiplex` as the barcode calling tool. This can be a string formatted as follows `"-x CTACACGACGCTCTTCCGATCT -b ???????????????? -u ?????????? -x TTTCTTATATGGG -f 8 -e 2"`, for more information check the documentation: https://davidsongroup.github.io/flexiplex/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll want to put a couple sentences to more explicitly state that dna sample_types only support flexiplex as a barcode caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acutally related, is this because BLAZE can't be used on the dna sample types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True good call. And indeed BLAZE does not support DNA, it only supports the (most commonly) used 10x barcodes basically. Flexiplex will support anything I've encountered so far with.
| ## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/) | ||
|
|
||
| > Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you've added your flexiformatter tool which is a public github repo, might be worth adding the link to the repo here. We've got a couple tools like that ( e.g. pigz) that don't have a true citation to add here
nextflow.config
Outdated
| barcode_format = null | ||
| custom_flexiplex_barcode_dna = null | ||
| custom_flexiplex_barcode_cdna = null | ||
| demux_tool = 'flexiplex' // Options: flexiplex, blaze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know there's only one option for dna, but may be good to go ahead and implement a demux_tool_cdna and demux_tool_dna if there are going to be different subsets for each type. But this is dependent on if BLAZE can run on teh dna samples or not
| include { MINIMAP2_ALIGN } from '../../modules/nf-core/minimap2/align' | ||
| include { PICARD_MARKDUPLICATES } from '../../modules/nf-core/picard/markduplicates' | ||
| include { BAM_SORT_STATS_SAMTOOLS } from '../../subworkflows/nf-core/bam_sort_stats_samtools' | ||
| include { FLEXIFORMATTER } from '../../modules/local/flexiformatter' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this import needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch, import is indeed needed but I was missing the actual flexiformatter process (this will move barcode and umi tags from read name to the tags). Will adjust, thanks!
| ch_flexiplex_barcodes = Channel.empty() | ||
|
|
||
| // Unzip the whitelist if needed | ||
| if (whitelist.extension == "gz"){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does flexiplex work with .zip files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope it does not work with .zip files. Additionally, Blaze does also work with .gzip and I am sure that .gzip is more commonly used for these files than.zip. I also changed the input test files to match this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I was thinking of adding a central demultiplex workflow that calls out the demux_flexiplex or demux_blaze based on the demux_tool because there's a decent bit of shared code between the two -- so wanted to check on that
|
Hey @ljwharbers, I've been running the pipeline using the test profile. Currently got it working using BLAZE as the barcode caller, however, using flexiplex causes an error at the flexiformatter step (this is also the error with the automated tests). It looks like it may be caused by the print statements in the flexiformatter code (at lines 37 and 42), so may be worth outputting that information directly to a log or to stderr? Here's the error for reference: |
|
Oops looks like that one indeed slipped through the cracks when I updated |
|
I know we talked in slack, but just to put it here as well. One of the other items we'll need is test data for the DNA portions of the pipeline (and updates to the test.configs or perhaps even new test configs/profiles), with the downsampled and normal FASTQs going here: https://github.com/nf-core/test-datasets/tree/scnanoseq |
The following key changes have been made to enable both cDNA and DNA sample processing with the same pipeline.
Modules added:
flexiplex/discovery,flexiplex/filter,flexiplex/assignandflexiformatter. The flexiplex modules are used to extract barcodes, filter and assign barcodes.flexiformatteris used to move the barcodes from the readname to the bam tags.Subworkflow added:
demultiplex_blaze.nfanddemultiplex_flexiplex.nfwhere the demultiplexing steps are executed.align_deduplicate_dna.nfwhich performsminimap2alignment intopicard MarkDuplicatesandBAM_SORT_STATS_SAMTOOLS.Input changes:
typewhich can be either dna or cdna. It is an optional column and defaults to cdna to maintain compatibility with older samplesheets.Config changes:
whitelistparameter is removed and replaced by two new parameterswhitelist_dnaandwhitelist_cdnato specifiy a DNA and cDNA whitelist, respectively.assets/are now.gzinstead of.zipto easier compatibility with blaze/flexiplex with one simple gunzip module if the file ends with.gz.demux_toolto specify whether to useflexiplexorblazefor cDNA samples demultiplexing.QC changes:
READ_COUNTS()is edited to be able to use flexiplex and blaze output as input.TODO:
Update readmes and ...?
PR checklist
nf-core pipelines lint).nextflow run . -profile test,docker --outdir <OUTDIR>).nextflow run . -profile debug,test,docker --outdir <OUTDIR>).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).