@@ -25,96 +25,96 @@ It is highly efficient and multi-threaded for high performance.
25
25
26
26
Usage for ` fqtk demux ` follows:
27
27
28
+ <!-- start usage -->
28
29
``` console
30
+
29
31
Performs sample demultiplexing on FASTQs.
30
32
31
- The sample barcode for each sample in the metadata TSV will be compared against
32
- the sample barcode bases extracted from the FASTQs, to assign each read to a
33
- sample. Reads that do not match any sample within the given error tolerance
34
- will be placed in the ``unmatched_prefix`` file.
33
+ The sample barcode for each sample in the metadata TSV will be compared against the sample
34
+ barcode bases extracted from the FASTQs, to assign each read to a sample. Reads that do not
35
+ match any sample within the given error tolerance will be placed in the ``unmatched_prefix``
36
+ file.
35
37
36
38
FASTQs and associated read structures for each sub-read should be given:
37
39
38
- - a single fragment read (with inline index) should have one FASTQ and one read
39
- structure
40
- - paired end reads should have two FASTQs and two read structures
41
- - a dual-index sample with paired end reads should have four FASTQs and four read
42
- structures given: two for the two index reads, and two for the template reads.
40
+ - a single fragment read (with inline index) should have one FASTQ and one read structure
41
+ - paired end reads should have two FASTQs and two read structures
42
+ - a dual-index sample with paired end reads should have four FASTQs and four read structures
43
+ given: two for the two index reads, and two for the template reads.
43
44
44
- If multiple FASTQs are present for each sub-read, then the FASTQs for each
45
- sub-read should be concatenated together prior to running this tool (e.g.
46
- `zcat s_R1_L001.fq.gz s_R1_L002.fq.gz | bgzip -c > s_R1.fq.gz`).
45
+ If multiple FASTQs are present for each sub-read, then the FASTQs for each sub-read should be
46
+ concatenated together prior to running this tool
47
+ (e.g. `zcat s_R1_L001.fq.gz s_R1_L002.fq.gz | bgzip -c > s_R1.fq.gz`).
47
48
48
- Read structures are made up of `<number><operator>` pairs much like the `CIGAR`
49
- string in BAM files. Four kinds of operators are recognized:
49
+ (Read structures)[<https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures>] are made up of
50
+ `<number><operator>` pairs much like the `CIGAR` string in BAM files.
51
+ Four kinds of operators are recognized:
50
52
51
53
1. `T` identifies a template read
52
54
2. `B` identifies a sample barcode read
53
55
3. `M` identifies a unique molecular index read
54
56
4. `S` identifies a set of bases that should be skipped or ignored
55
57
56
- The last `<number><operator>` pair may be specified using a `+` sign instead of
57
- number to denote "all remaining bases". This is useful if, e.g., fastqs have
58
- been trimmed and contain reads of varying length. Both reads must have template
59
- bases. Any molecular identifiers will be concatenated using the `-` delimiter
60
- and placed in the given SAM record tag (`RX` by default). Similarly, the sample
61
- barcode bases from the given read will be placed in the `BC` tag.
58
+ The last `<number><operator>` pair may be specified using a `+` sign instead of number to
59
+ denote "all remaining bases". This is useful if, e.g., fastqs have been trimmed and contain
60
+ reads of varying length. Both reads must have template bases. Any molecular identifiers will
61
+ be concatenated using the `-` delimiter and placed in the given SAM record tag (`RX` by
62
+ default). Similarly, the sample barcode bases from the given read will be placed in the `BC`
63
+ tag.
62
64
63
- Metadata about the samples should be given as a headered metadata TSV file with
64
- at least the following two columns present:
65
+ Metadata about the samples should be given as a headered metadata TSV file with at least the
66
+ following two columns present:
65
67
66
68
1. `sample_id` - the id of the sample or library.
67
69
2. `barcode` - the expected barcode sequence associated with the `sample_id`.
68
70
69
- For reads containing multiple barcodes (such as dual-indexed reads), all barcodes
70
- should be concatenated together in the order they are read and stored in the
71
- `barcode` field.
71
+ For reads containing multiple barcodes (such as dual-indexed reads), all barcodes should be
72
+ concatenated together in the order they are read and stored in the `barcode` field.
72
73
73
- The read structures will be used to extract the observed sample barcode, template
74
- bases, and molecular identifiers from each read. The observed sample barcode
75
- will be matched to the sample barcodes extracted from the bases in the sample
76
- metadata and associated read structures.
74
+ The read structures will be used to extract the observed sample barcode, template bases, and
75
+ molecular identifiers from each read. The observed sample barcode will be matched to the
76
+ sample barcodes extracted from the bases in the sample metadata and associated read structures.
77
77
78
78
An observed barcode matches an expected barcode if all the following are true:
79
-
80
- 1. The number of mismatches (edits/substitutions) is less than or equal to the
81
- maximum mismatches (see --max-mismatches).
82
- 2. The difference between number of mismatches in the best and second best
83
- barcodes is greater than or equal to the minimum mismatch delta
84
- (`--min-mismatch-delta`). The expected barcode sequence may contains Ns,
85
- which are not counted as mismatches regardless of the observed base (e.g.
86
- the expected barcode `AAN` will have zero mismatches relative to both the
87
- observed barcodes `AAA` and `AAN`).
79
+ 1. The number of mismatches (edits/substitutions) is less than or equal to the maximum
80
+ mismatches (see `--max-mismatches`).
81
+ 2. The difference between number of mismatches in the best and second best barcodes is greater
82
+ than or equal to the minimum mismatch delta (`--min-mismatch-delta`).
83
+ The expected barcode sequence may contains Ns, which are not counted as mismatches regardless
84
+ of the observed base (e.g. the expected barcode `AAN` will have zero mismatches relative to
85
+ both the observed barcodes `AAA` and `AAN`).
88
86
89
87
## Outputs
90
88
91
- All outputs are generated in the provided `--output` directory. For each sample
92
- plus the unmatched reads, FASTQ files are written for each read segment
93
- (specified in the read structures) of one of the types supplied to
94
- `--output-types`.
95
-
96
- FASTQ files have names of the format:
89
+ All outputs are generated in the provided `--output` directory. For each sample plus the
90
+ unmatched reads, FASTQ files are written for each read segment (specified in the read
91
+ structures) of one of the types supplied to `--output-types`. FASTQ files have names
92
+ of the format:
97
93
94
+ ```bash
98
95
{sample_id}.{segment_type}{read_num}.fq.gz
96
+ ```
99
97
100
- where `segment_type` is one of `R`, `I`, and `U` (for template, barcode/index
101
- and molecular barcode/UMI reads respectively) and `read_num` is a number starting
102
- at 1 for each segment type.
98
+ where ` segment_type ` is one of ` R ` , ` I ` , and ` U ` (for template, barcode/index and molecular
99
+ barcode/UMI reads respectively) and ` read_num ` is a number starting at 1 for each segment
100
+ type.
103
101
104
- In addition a `demux-metrics.txt` file is written that is a tab-delimited file
105
- with counts of how many reads were assigned to each sample and derived metrics.
102
+ In addition a ` demux-metrics.txt ` file is written that is a tab-delimited file with counts
103
+ of how many reads were assigned to each sample and derived metrics.
106
104
107
105
## Example Command Line
108
106
109
- As an example, if the sequencing run was 2x100bp (paired end) with two 8bp index
110
- reads both reading a sample barcode, as well as an in-line 8bp sample barcode in
111
- read one, the command line would be:
107
+ As an example, if the sequencing run was 2x100bp (paired end) with two 8bp index reads both
108
+ reading a sample barcode, as well as an in-line 8bp sample barcode in read one, the command
109
+ line would be:
112
110
111
+ ``` bash
113
112
fqtk demux \
114
- --inputs r1.fq.gz i1.fq.gz i2.fq.gz r2.fq.gz \
115
- --read-structures 8B92T 8B 8B 100T \
116
- --sample-metadata metadata.tsv \
117
- --output output_folder
113
+ --inputs r1.fq.gz i1.fq.gz i2.fq.gz r2.fq.gz \
114
+ --read-structures 8B92T 8B 8B 100T \
115
+ --sample-metadata metadata.tsv \
116
+ --output output_folder
117
+ ```
118
118
119
119
Usage: fqtk demux [ OPTIONS] --inputs <INPUTS >... --read-structures <READ_STRUCTURES>... --sample-metadata <SAMPLE_METADATA> --output <OUTPUT >
120
120
@@ -126,8 +126,7 @@ Options:
126
126
The read structures, one per input FASTQ in the same order
127
127
128
128
-b, --output-types <OUTPUT_TYPES>...
129
- The read structure types to write to their own files (Must be one of T, B,
130
- or M for template reads, sample barcode reads, and molecular barcode reads)
129
+ The read structure types to write to their own files (Must be one of T, B, or M for template reads, sample barcode reads, and molecular barcode reads).
131
130
132
131
Multiple output types may be specified as a space-delimited list.
133
132
@@ -150,8 +149,7 @@ Options:
150
149
[default: 1]
151
150
152
151
-d, --min-mismatch-delta <MIN_MISMATCH_DELTA>
153
- Minimum difference between number of mismatches in the best and second best barcodes
154
- for a barcode to be considered a match
152
+ Minimum difference between number of mismatches in the best and second best barcodes for a barcode to be considered a match
155
153
156
154
[default: 2]
157
155
@@ -168,16 +166,15 @@ Options:
168
166
-S, --skip-reasons <SKIP_REASONS>
169
167
Skip demultiplexing reads for any of the following reasons, otherwise panic.
170
168
171
- 1. `too-few-bases`: there are too few bases or qualities to extract given the
172
- read structures. For example, if a read is 8bp long but the read structure
173
- is `10B`, or if a read is empty and the read structure is `+T`.
169
+ 1. `too-few-bases`: there are too few bases or qualities to extract given the read structures. For example, if a read is 8bp long but the read structure is `10B`, or if a read is empty and the read structure is `+T`.
174
170
175
171
-h, --help
176
172
Print help information (use ` -h ` for a summary)
177
173
178
174
-V, --version
179
175
Print version information
180
176
```
177
+ <!-- end usage -->
181
178
182
179
## Installing
183
180
0 commit comments