Skip to content

Commit b0217d4

Browse files
committed
Merge branch 'release-0.2.0'
2 parents c79ec30 + bc4a97e commit b0217d4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

67 files changed

+2844
-1571
lines changed

DESCRIPTION

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
Package: chiimp
22
Title: Computational, High-throughput Individual Identification through Microsatellite Profiling
3-
Version: 0.1.0
3+
Version: 0.2.0
44
Authors@R: person("Jesse", "Connell", email = "[email protected]", role = c("aut", "cre"))
55
Description: An R package to analyze microsatellites in high-throughput sequencing datasets.
66
Depends: R (>= 3.2.3)
@@ -11,12 +11,10 @@ Imports:
1111
argparser (>= 0.4),
1212
dnaplotr (>= 0.1),
1313
dnar (>= 0.1),
14-
dplyr (>= 0.7.4),
1514
graphics (>= 3.2.3),
1615
grDevices (>= 3.2.3),
1716
kableExtra (>= 0.2.1.9000),
1817
knitr (>= 1.16),
19-
magrittr (>= 1.5),
2018
methods (>= 3.2.3),
2119
msa (>= 1.2.1),
2220
openssl (>= 0.9.6),

GUIDE.Rmd

Lines changed: 31 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -198,9 +198,9 @@ receive a different suffix when the name is assigned.
198198
CHIIMP breaks the genotyping process into two parts. First a sample file is
199199
de-replicated and a table of unique sequences is created, with no filtering yet
200200
applied. Second the table is filtered to just candidate allele sequences, and
201-
up to sequences are reported as the genotype. Both the per-sample table and the
202-
final genotypes are saved in the final output, as spreadsheets in the
203-
`processed-samples` directory and as the `summary.csv` spreadsheet.
201+
up to two sequences are reported as the genotype. Both the per-sequence table
202+
and the final genotypes are saved in the final output, as spreadsheets in the
203+
`processed-files` directory and as the `summary.csv` spreadsheet.
204204

205205
### Sample Processing
206206

@@ -213,8 +213,9 @@ locus attributes described above. First each locus' forward primer is compared
213213
with the sequence and the matching locus name is stored in a MatchingLocus
214214
column. The sequence is then checked for several tandem repeats of the motif
215215
for that locus, and compared to the length range expected for that locus.
216-
TRUE/FALSE values for these are stored in MotifMatch and LengthMatch columns
217-
respectively.
216+
TRUE/FALSE values for these are stored in MotifMatch and LengthMatch columns
217+
respectively. The Ambiguous column marks any sequences containing bases outside
218+
of A, C, T, and G (such as N).
218219

219220
PCR artifacts can obscure real allele sequences with incorrect sequences. There
220221
are extra filters to attempt to remove these if possible or highlight cases that
@@ -239,7 +240,7 @@ Lastly, the ratio of read counts for each sequence to the total reads in the
239240
sample and the reads with the same MatchingLocus value is stored in
240241
FractionOfTotal and FractionOfLocus columns respectively.
241242

242-
This is the `analyze_sample` function in the R package.
243+
This is the `analyze_seqs` function in the R package.
243244

244245
### Genotype Calling
245246

@@ -254,31 +255,36 @@ LengthMatch columns). If the resulting total read count is below a minimum
254255
value (by default `r config.defaults$sample_summary$counts.min`, customizable
255256
via the `sample_summary: counts.min` setting) no genotyping will be attempted.
256257
Next only those sequences accounting for at least a minimum fraction of the
257-
remaining reads are kept. (The default value is
258-
`r config.defaults$sample_summary$fraction.min`. This can be changed via the
259-
`sample_summary: fraction.min` setting.) Sequences that are marked as potential
260-
stutter or other artifacts (via the Stutter and Artifact columns of the table)
261-
are removed next.
258+
remaining reads are considered. (The default value is
259+
`r config.defaults$sample_analysis$fraction.min`. This can be changed via the
260+
`sample_analysis: fraction.min` setting.) Sequences that are marked as
261+
potential stutter or other artifacts (via the Stutter and Artifact columns of
262+
the table) or contain ambiguous sequence content (via the Ambiguous column) are
263+
excluded next.
262264

263265
After these filters are applied, the top one or two remaining sequences are
264-
reported as the alleles. (If only one sequence remains, the sample is labeled
265-
homozygous; if two or more, heterozygous.) The details kept are:
266+
labeled as the alleles. (If only one sequence remains, the sample is labeled
267+
homozygous; if two or more, heterozygous.) The final details kept for each
268+
sample are:
266269

267270
* the sequence content, length, and counts for the one or two alleles
268271
* the zygosity of the sample
272+
* whether the ambiguous-sequence filter removed a potential allele
269273
* whether the stutter and/or artifact filter removed a potential allele
270274
* The read counts of the entire sample before any filtering
271275
* The read counts of just those sequences matching the locus primer, motif, and
272276
length range
273277

274-
This is the `summarize_sample` function in the R package.
278+
These tasks (the filtering and categorizing of each sequence in the table and
279+
the short genotype summary) are the `analyze_sample` and `summarize_sample`
280+
functions in the R package.
275281

276282
### Summary and Reporting
277283

278284
The genotype and details identified in the previous step for each sample are
279285
aggregated into a spreadsheet with a row for each sample. This summary
280-
spreadsheet and the more detailed per-sample tables are all saved in the final
281-
output.
286+
spreadsheet and the more detailed per-file and per-sample tables are all saved
287+
in the final output.
282288

283289
For inter-sample comparisons, the alleles identified across samples for each
284290
locus are aligned to one another. The genotypes for each sample are clustered
@@ -358,11 +364,16 @@ A the end of an analysis CHIIMP creates a directory of files with all results.
358364
dataset spreadsheet including locus, replicate, and sample identifiers, the
359365
sequences, sequence lengths, and counts of the identified allele(s), and
360366
several additional attributes.
361-
* `processed-samples`: directory of spreadsheets for each sample. Each
367+
* `processed-files`: directory of spreadsheets for each input data file. Each
362368
spreadsheet contains one unique sequence per row with attributes on columns.
363-
These represent the intermediate data CHIIMP uses to call a genotype for each
364-
sample, and each spreadsheet here corresponds to a single row in the
365-
`summary.csv` file.
369+
At this stage no filtering for sample/locus-specific attributes has been
370+
applied. (This is particularly relevant for sequencer-multiplexed samples as
371+
one input data file may contain data for multiple samples.)
372+
* `processed-samples`: directory of spreadsheets for each sample. As for
373+
`processed-files`, each spreadsheet contains one unique sequence per row with
374+
attributes on columns. These represent the intermediate sample-specific data
375+
CHIIMP uses to call a genotype for each sample, and each spreadsheet here
376+
corresponds to a single row in the `summary.csv` file.
366377
* `histograms`: directory of counts-versus-length histograms for each sample.
367378
Counts are tallied on a by-sequence basis rather than by-length for alleles, so
368379
the bars for called alleles (in red) are generally shorter than the bars for

NAMESPACE

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,14 @@
33
export(align_alleles)
44
export(analyze_dataset)
55
export(analyze_sample)
6+
export(analyze_sample_guided)
7+
export(analyze_sample_naive)
8+
export(analyze_seqs)
69
export(calc_genotype_distance)
710
export(config.defaults)
811
export(find_closest_matches)
912
export(full_analysis)
1013
export(histogram)
11-
export(histogram2)
1214
export(load_allele_names)
1315
export(load_config)
1416
export(load_dataset)
@@ -30,18 +32,20 @@ export(plot_heatmap_stutter)
3032
export(prepare_dataset)
3133
export(report_genotypes)
3234
export(report_idents)
35+
export(sample_analysis_funcs)
36+
export(sample_summary_funcs)
3337
export(save_alignment_images)
3438
export(save_alignments)
3539
export(save_allele_seqs)
40+
export(save_dataset)
3641
export(save_dist_mat)
3742
export(save_histograms)
3843
export(save_results_summary)
3944
export(save_sample_data)
45+
export(save_seqfile_data)
4046
export(summarize_attribute)
4147
export(summarize_dataset)
42-
export(summarize_genotypes)
43-
export(summarize_genotypes_known)
4448
export(summarize_sample)
4549
export(summarize_sample_guided)
50+
export(tabulate_allele_names)
4651
export(tally_cts_per_locus)
47-
importFrom(magrittr,"%>%")

NEWS.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# chiimp 0.2.0
2+
3+
* Restructured code to avoid analyzing multiplexed samples more than once ([#3]).
4+
* Reorganized output into per-sequence-file (full) and per-sample (filtered)
5+
sections ([#5]).
6+
* Added a new saving function, `save_seqfile_data`, to save a directory tree
7+
of per-sequence-file output files starting from the first shared directory
8+
in the input file paths.
9+
* Moved functionality from `analyze_sample` into a new `analyze_seqs`
10+
function to be used as a separate step (enabling shared processing between
11+
multiplexed samples in a single data file for [#3]).
12+
* Split data list from `analyze_dataset` output into two separate lists
13+
called files and samples.
14+
* Added sequence name matching to `analyze_dataset`. The summary data frame
15+
now has Allele1Name and Allele2Name columns, and the sample data frames a
16+
SeqName column, matching any sequence recognized as a called allele from any
17+
sample in the current analysis (or a previous analysis if the allele table
18+
is provided).
19+
* Improved `histogram` function to recognize more categories of unique
20+
sequences including sequences called as alleles elsewhere, and return the
21+
counts-by-length data by category in a list.
22+
* Removed `histogram2` to centralize functionality in `histogram`.
23+
* Fixed bugs causing failure of report generation for completely blank
24+
dataset analysis results ([#7]).
25+
* Removed `summarize_sample_by_length` function.
26+
* Clarified behavior of `summarize_sample` functions to allow any combination
27+
of TRUE/FALSE values in the Ambiguous/Stutter/Artifact entries. Previously
28+
only the first (highest-count) case would be flagged.
29+
* Added features to track and filter ambiguous sequences ([#4]).
30+
* Added column named Ambiguous to `analyze_sample` output to flag sequences
31+
with non-ACTG characters.
32+
* Added entry named Ambiguous to `summarize_sample` output to track
33+
filtering of sequences with non-ACTG characters.
34+
35+
[#7]: https://github.com/ShawHahnLab/chiimp/issues/7
36+
[#5]: https://github.com/ShawHahnLab/chiimp/issues/5
37+
[#4]: https://github.com/ShawHahnLab/chiimp/issues/4
38+
[#3]: https://github.com/ShawHahnLab/chiimp/issues/3
39+
40+
# chiimp 0.1.0
41+
42+
* Initial release

0 commit comments

Comments
 (0)