@@ -198,9 +198,9 @@ receive a different suffix when the name is assigned.
198
198
CHIIMP breaks the genotyping process into two parts. First a sample file is
199
199
de-replicated and a table of unique sequences is created, with no filtering yet
200
200
applied. Second the table is filtered to just candidate allele sequences, and
201
- up to sequences are reported as the genotype. Both the per-sample table and the
202
- final genotypes are saved in the final output, as spreadsheets in the
203
- ` processed-samples ` directory and as the ` summary.csv ` spreadsheet.
201
+ up to two sequences are reported as the genotype. Both the per-sequence table
202
+ and the final genotypes are saved in the final output, as spreadsheets in the
203
+ ` processed-files ` directory and as the ` summary.csv ` spreadsheet.
204
204
205
205
### Sample Processing
206
206
@@ -213,8 +213,9 @@ locus attributes described above. First each locus' forward primer is compared
213
213
with the sequence and the matching locus name is stored in a MatchingLocus
214
214
column. The sequence is then checked for several tandem repeats of the motif
215
215
for that locus, and compared to the length range expected for that locus.
216
- TRUE/FALSE values for these are stored in MotifMatch and LengthMatch columns
217
- respectively.
216
+ TRUE/FALSE values for these are stored in MotifMatch and LengthMatch columns
217
+ respectively. The Ambiguous column marks any sequences containing bases outside
218
+ of A, C, T, and G (such as N).
218
219
219
220
PCR artifacts can obscure real allele sequences with incorrect sequences. There
220
221
are extra filters to attempt to remove these if possible or highlight cases that
@@ -239,7 +240,7 @@ Lastly, the ratio of read counts for each sequence to the total reads in the
239
240
sample and the reads with the same MatchingLocus value is stored in
240
241
FractionOfTotal and FractionOfLocus columns respectively.
241
242
242
- This is the ` analyze_sample ` function in the R package.
243
+ This is the ` analyze_seqs ` function in the R package.
243
244
244
245
### Genotype Calling
245
246
@@ -254,31 +255,36 @@ LengthMatch columns). If the resulting total read count is below a minimum
254
255
value (by default ` r config.defaults$sample_summary$counts.min ` , customizable
255
256
via the ` sample_summary: counts.min ` setting) no genotyping will be attempted.
256
257
Next only those sequences accounting for at least a minimum fraction of the
257
- remaining reads are kept. (The default value is
258
- ` r config.defaults$sample_summary$fraction.min ` . This can be changed via the
259
- ` sample_summary: fraction.min ` setting.) Sequences that are marked as potential
260
- stutter or other artifacts (via the Stutter and Artifact columns of the table)
261
- are removed next.
258
+ remaining reads are considered. (The default value is
259
+ ` r config.defaults$sample_analysis$fraction.min ` . This can be changed via the
260
+ ` sample_analysis: fraction.min ` setting.) Sequences that are marked as
261
+ potential stutter or other artifacts (via the Stutter and Artifact columns of
262
+ the table) or contain ambiguous sequence content (via the Ambiguous column) are
263
+ excluded next.
262
264
263
265
After these filters are applied, the top one or two remaining sequences are
264
- reported as the alleles. (If only one sequence remains, the sample is labeled
265
- homozygous; if two or more, heterozygous.) The details kept are:
266
+ labeled as the alleles. (If only one sequence remains, the sample is labeled
267
+ homozygous; if two or more, heterozygous.) The final details kept for each
268
+ sample are:
266
269
267
270
* the sequence content, length, and counts for the one or two alleles
268
271
* the zygosity of the sample
272
+ * whether the ambiguous-sequence filter removed a potential allele
269
273
* whether the stutter and/or artifact filter removed a potential allele
270
274
* The read counts of the entire sample before any filtering
271
275
* The read counts of just those sequences matching the locus primer, motif, and
272
276
length range
273
277
274
- This is the ` summarize_sample ` function in the R package.
278
+ These tasks (the filtering and categorizing of each sequence in the table and
279
+ the short genotype summary) are the ` analyze_sample ` and ` summarize_sample `
280
+ functions in the R package.
275
281
276
282
### Summary and Reporting
277
283
278
284
The genotype and details identified in the previous step for each sample are
279
285
aggregated into a spreadsheet with a row for each sample. This summary
280
- spreadsheet and the more detailed per-sample tables are all saved in the final
281
- output.
286
+ spreadsheet and the more detailed per-file and per- sample tables are all saved
287
+ in the final output.
282
288
283
289
For inter-sample comparisons, the alleles identified across samples for each
284
290
locus are aligned to one another. The genotypes for each sample are clustered
@@ -358,11 +364,16 @@ A the end of an analysis CHIIMP creates a directory of files with all results.
358
364
dataset spreadsheet including locus, replicate, and sample identifiers, the
359
365
sequences, sequence lengths, and counts of the identified allele(s), and
360
366
several additional attributes.
361
- * ` processed-samples ` : directory of spreadsheets for each sample . Each
367
+ * ` processed-files ` : directory of spreadsheets for each input data file . Each
362
368
spreadsheet contains one unique sequence per row with attributes on columns.
363
- These represent the intermediate data CHIIMP uses to call a genotype for each
364
- sample, and each spreadsheet here corresponds to a single row in the
365
- ` summary.csv ` file.
369
+ At this stage no filtering for sample/locus-specific attributes has been
370
+ applied. (This is particularly relevant for sequencer-multiplexed samples as
371
+ one input data file may contain data for multiple samples.)
372
+ * ` processed-samples ` : directory of spreadsheets for each sample. As for
373
+ ` processed-files ` , each spreadsheet contains one unique sequence per row with
374
+ attributes on columns. These represent the intermediate sample-specific data
375
+ CHIIMP uses to call a genotype for each sample, and each spreadsheet here
376
+ corresponds to a single row in the ` summary.csv ` file.
366
377
* ` histograms ` : directory of counts-versus-length histograms for each sample.
367
378
Counts are tallied on a by-sequence basis rather than by-length for alleles, so
368
379
the bars for called alleles (in red) are generally shorter than the bars for
0 commit comments