Add modules and subworkflows for single-cell DNA processing. #79

ljwharbers · 2025-08-14T14:53:23Z

The following key changes have been made to enable both cDNA and DNA sample processing with the same pipeline.

Modules added:

The following modules have been added flexiplex/discovery, flexiplex/filter, flexiplex/assign and flexiformatter. The flexiplex modules are used to extract barcodes, filter and assign barcodes. flexiformatter is used to move the barcodes from the readname to the bam tags.

Subworkflow added:

The demultiplexing has been moved to two subworkflows. demultiplex_blaze.nf and demultiplex_flexiplex.nf where the demultiplexing steps are executed.
One subworkflow has been added: align_deduplicate_dna.nf which performs minimap2 alignment into picard MarkDuplicates and BAM_SORT_STATS_SAMTOOLS.

Input changes:

Samplesheet now accepts an additional column type which can be either dna or cdna. It is an optional column and defaults to cdna to maintain compatibility with older samplesheets.

Config changes:

The whitelist parameter is removed and replaced by two new parameters whitelist_dna and whitelist_cdna to specifiy a DNA and cDNA whitelist, respectively.
Included whitelists in assets/ are now .gz instead of .zip to easier compatibility with blaze/flexiplex with one simple gunzip module if the file ends with .gz.
New config parameter: demux_tool to specify whether to use flexiplex or blaze for cDNA samples demultiplexing.

QC changes:

The script used by READ_COUNTS() is edited to be able to use flexiplex and blaze output as input.

TODO:
Update readmes and ...?

PR checklist

Performing merge for first release

Post-release DOI

1.1.0 Updates

Release 1.2.0

…lt. Now using chopper

…pt the fastq file

…ode_formats. Also change dlogic for whitelist gunzipping

atrull314 · 2025-09-18T17:39:15Z

docs/output.md


 The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

+TODO: Should here be added which output is cDNA/DNA specific?


I think that makes sense, to put CDNA-ONLY, DNA-ONLY, or SHARED, or something to that effect in the sections and maybe put a blurb at the top that there are notes about which steps are type-specific.

Agreed, maybe easiest to indeed just have * = cDNA specific output, ** = DNA specific output or something. Will add something coming week and tag you once I finish the output.md

atrull314 · 2025-09-18T17:40:37Z

docs/output.md

 The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:

 - [Preprocessing](#preprocessing)
  - [Nanofilt](#nanofilt) - Read Quality Filtering and Trimming


We'll need to get rid of the Nanofilt section and replace it with Chopper

Looking at this, is the bit under output files correct anyways? We do not publish any of these files so should we include this? Will also need to adjust some other things in the output.md, will put it on my TODO.

atrull314 · 2025-09-18T17:47:38Z

docs/usage.md

+Please note that while the above command specifies both transcriptome and genome fasta files, only one is needed for the pipeline and is dependent on which quantifier you wish to use. Furthermore, if you have any DNA samples, the `genome_fasta` is required.
 Additionally, for the `quantifier` parameter in the above command, we've listed the quantifiers as a comma-delimited string. It is possible to only use one quantifier, and can be accomplished by just providing the name of the quantifying tool you wish to run as a single value, i.e. providing `oarfish` if you only wish to run `oarfish`.

+The pipeline supports barcode identification and extraction through both `flexiplex` and `blaze` and can be set through `demux_tool` parameter. The barcode format can be specified through the `barcode_format` parameter. When working with completely custom barcode structures, you can additionally specify these with `custom_flexiplex_barcode_dna` and `custom_flexiplex_barcode_cdna` parameters. Note: ensure that you are using `flexiplex` as the barcode calling tool. This can be a string formatted as follows `"-x CTACACGACGCTCTTCCGATCT -b ???????????????? -u ?????????? -x TTTCTTATATGGG -f 8 -e 2"`, for more information check the documentation: https://davidsongroup.github.io/flexiplex/


We'll want to put a couple sentences to more explicitly state that dna sample_types only support flexiplex as a barcode caller.

Acutally related, is this because BLAZE can't be used on the dna sample types?

True good call. And indeed BLAZE does not support DNA, it only supports the (most commonly) used 10x barcodes basically. Flexiplex will support anything I've encountered so far with.

atrull314 · 2025-09-18T17:53:14Z

CITATIONS.md

 ## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

 > Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.



I know you've added your flexiformatter tool which is a public github repo, might be worth adding the link to the repo here. We've got a couple tools like that ( e.g. pigz) that don't have a true citation to add here

atrull314 · 2025-09-18T18:29:06Z

nextflow.config

    barcode_format                      = null
+    custom_flexiplex_barcode_dna        = null
+    custom_flexiplex_barcode_cdna       = null
+    demux_tool                          = 'flexiplex' // Options: flexiplex, blaze


I know there's only one option for dna, but may be good to go ahead and implement a demux_tool_cdna and demux_tool_dna if there are going to be different subsets for each type. But this is dependent on if BLAZE can run on teh dna samples or not

atrull314 · 2025-09-18T20:46:58Z

subworkflows/local/align_deduplicate_dna.nf

+include { MINIMAP2_ALIGN                          } from '../../modules/nf-core/minimap2/align'
+include { PICARD_MARKDUPLICATES                   } from '../../modules/nf-core/picard/markduplicates'
+include { BAM_SORT_STATS_SAMTOOLS                 } from '../../subworkflows/nf-core/bam_sort_stats_samtools'
+include { FLEXIFORMATTER                          } from '../../modules/local/flexiformatter'


Is this import needed?

Great catch, import is indeed needed but I was missing the actual flexiformatter process (this will move barcode and umi tags from read name to the tags). Will adjust, thanks!

atrull314 · 2025-09-18T20:52:01Z

subworkflows/local/demultiplex_flexiplex.nf

+        ch_flexiplex_barcodes = Channel.empty()
+
+        // Unzip the whitelist if needed
+        if (whitelist.extension == "gz"){


Does flexiplex work with .zip files?

Nope it does not work with .zip files. Additionally, Blaze does also work with .gzip and I am sure that .gzip is more commonly used for these files than.zip. I also changed the input test files to match this.

Got it, I was thinking of adding a central demultiplex workflow that calls out the demux_flexiplex or demux_blaze based on the demux_tool because there's a decent bit of shared code between the two -- so wanted to check on that

atrull314 · 2025-09-24T19:54:09Z

Hey @ljwharbers,

I've been running the pipeline using the test profile. Currently got it working using BLAZE as the barcode caller, however, using flexiplex causes an error at the flexiformatter step (this is also the error with the automated tests). It looks like it may be caused by the print statements in the flexiformatter code (at lines 37 and 42), so may be worth outputting that information directly to a log or to stderr?

Here's the error for reference:

Command error:
  /usr/local/lib/python3.13/site-packages/simplesam.py:23: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
    from pkg_resources import get_distribution
  [W::sam_read1_sam] Parse error at line 11
  samtools sort: truncated file. Aborting

ljwharbers · 2025-09-25T08:19:00Z

Oops looks like that one indeed slipped through the cracks when I updated flexiformatter to work with barcode only (no umi) reads. Will fix and do the testing. I'll also work on it today and tomorrow so should be able to resolve some/most of the outstanding work still.

atrull314 · 2025-09-29T15:32:25Z

I know we talked in slack, but just to put it here as well.

One of the other items we'll need is test data for the DNA portions of the pipeline (and updates to the test.configs or perhaps even new test configs/profiles), with the downsampled and normal FASTQs going here: https://github.com/nf-core/test-datasets/tree/scnanoseq

atrull314 and others added 30 commits October 7, 2024 09:21

Merge pull request nf-core#25 from nf-core/dev

35da74a

Performing merge for first release

Merge pull request nf-core#39 from nf-core/dev

7e8db60

Post-release DOI

Merge pull request nf-core#42 from nf-core/dev

500954e

1.1.0 Updates

Merge pull request nf-core#50 from nf-core/dev

05d705a

Release 1.2.0

add "type" in samplesheet

d9c226d

Add 10X multiome whitelists

b0c993b

Wrap demultiplexing (blaze and flexiplex) in separate subworkflows

bed76ee

blaze subworkflow updates

64e76b8

add 10x multiome

8f1fbe9

add flexiplex related modules

e64e1a5

initial refactoring to include flexiplex

56f40a5

add whitelist_dna param

aed6f32

change blaze ext args

9180db1

added copper

9cd659d

replaced nanofilt with chopper

90bd00f

update flexiplex version

424ae35

add chopper and seqkit split2

c8b9368

replacing blaze with flexiplex

27029d7

fixed flexiplex for cDNA libraries

b39352a

changed to .txt.gz whitelists

99fea25

removing code that was unzipping and splitting fastq files for nanofi…

3b771ce

…lt. Now using chopper

reverted flexiplex version to 1.02.3 because of a bug that will corru…

49c2de3

…pt the fastq file

adding modules

38eae51

add dna_whitelist and cdna_whitelist params

ad24d06

adding mergebarcodefile module

399b8c8

adjusting modules.config for possible barcode formats

8a1a0d8

working version of demultiplex_flexiplex subworkflow

3bd6534

blaze and flexiplex adjustments and subworkflows integration

f7ad359

add custom flexiplex barcode format option

99fc266

minor changes for blaze to make it compatible with new potentila barc…

2b09e88

…ode_formats. Also change dlogic for whitelist gunzipping

ljwharbers and others added 7 commits August 19, 2025 11:23

fixed barcode outputs and publishdir

e9cf356

added flexiplex to output.md

7ec8211

added usage docs

f4e8d37

updated flexiplex version

3e1b237

linting

e63fc6f

Merge branch 'dev' into scdnalong

213da77

linting

edf2e2f

atrull314 mentioned this pull request Sep 9, 2025

Exposing parameters to control barcode and UMI length for datasets generated with technologies other than 10X Genomics #66

Open

atrull314 reviewed Sep 18, 2025

View reviewed changes

ljwharbers and others added 9 commits September 19, 2025 11:02

fixed mawk versioning

cfa7711

fixed merging output

d52d140

update flexiformatter version to fix cb matching

f68031c

added note to specify DNA is only compatible with flexiplex

ea939b2

added dna and cdna specific demultiplex options

52c042c

added flexiformatter in dna subworkflow

1d9df4e

fix versioning of mergebarcodecounts

a36b47f

output docs updates

99f1928

Instantiating channels and updating parameter names

4485b66

ljwharbers added 2 commits September 25, 2025 13:31

fix demux tool references and update flexiformatter

34a81a4

merge

1605c8c

fix skip trimming

27c72f5


		The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

		TODO: Should here be added which output is cDNA/DNA specific?

		## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

		> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

Add modules and subworkflows for single-cell DNA processing. #79

Are you sure you want to change the base?

Add modules and subworkflows for single-cell DNA processing. #79

Uh oh!

Conversation

ljwharbers commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atrull314 Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atrull314 commented Sep 24, 2025

Uh oh!

ljwharbers commented Sep 25, 2025

Uh oh!

atrull314 commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ljwharbers commented Aug 14, 2025 •

edited

Loading

atrull314 Sep 18, 2025 •

edited

Loading