ShawHahnLab
diff --git a/‎DESCRIPTION
Lines changed: 1 addition & 3 deletions b/‎DESCRIPTION
Lines changed: 1 addition & 3 deletions
diff --git a/‎GUIDE.Rmd
Lines changed: 31 additions & 20 deletions b/‎GUIDE.Rmd
Lines changed: 31 additions & 20 deletions
diff --git a/‎NAMESPACE
Lines changed: 8 additions & 4 deletions b/‎NAMESPACE
Lines changed: 8 additions & 4 deletions
diff --git a/‎NEWS.md
Lines changed: 42 additions & 0 deletions b/‎NEWS.md
Lines changed: 42 additions & 0 deletions
@@ -1,6 +1,6 @@
 Package: chiimp
 Title: Computational, High-throughput Individual Identification through Microsatellite Profiling
-Version: 0.1.0
+Version: 0.2.0
 Authors@R: person("Jesse", "Connell", email = "[email protected]", role = c("aut", "cre"))
 Description: An R package to analyze microsatellites in high-throughput sequencing datasets.
 Depends: R (>= 3.2.3)
@@ -11,12 +11,10 @@ Imports:
   argparser (>= 0.4),
   dnaplotr (>= 0.1),
   dnar (>= 0.1),
-  dplyr (>= 0.7.4),
   graphics (>= 3.2.3),
   grDevices (>= 3.2.3),
   kableExtra (>= 0.2.1.9000),
   knitr (>= 1.16),
-  magrittr (>= 1.5),
   methods (>= 3.2.3),
   msa (>= 1.2.1),
   openssl (>= 0.9.6),
 
@@ -198,9 +198,9 @@ receive a different suffix when the name is assigned.
 CHIIMP breaks the genotyping process into two parts.  First a sample file is 
 de-replicated and a table of unique sequences is created, with no filtering yet 
 applied.  Second the table is filtered to just candidate allele sequences, and 
-up to sequences are reported as the genotype.  Both the per-sample table and the
-final genotypes are saved in the final output, as spreadsheets in the
-`processed-samples` directory and as the `summary.csv` spreadsheet.
+up to two sequences are reported as the genotype.  Both the per-sequence table
+and the final genotypes are saved in the final output, as spreadsheets in the 
+`processed-files` directory and as the `summary.csv` spreadsheet.
 
 ### Sample Processing
 
@@ -213,8 +213,9 @@ locus attributes described above.  First each locus' forward primer is compared
 with the sequence and the matching locus name is stored in a MatchingLocus 
 column.  The sequence is then checked for several tandem repeats of the motif 
 for that locus, and compared to the length range expected for that locus. 
-TRUE/FALSE values for these are stored in MotifMatch and LengthMatch columns
-respectively.
+TRUE/FALSE values for these are stored in MotifMatch and LengthMatch columns 
+respectively.  The Ambiguous column marks any sequences containing bases outside
+of A, C, T, and G (such as N).
 
 PCR artifacts can obscure real allele sequences with incorrect sequences.  There 
 are extra filters to attempt to remove these if possible or highlight cases that
@@ -239,7 +240,7 @@ Lastly, the ratio of read counts for each sequence to the total reads in the
 sample and the reads with the same MatchingLocus value is stored in
 FractionOfTotal and FractionOfLocus columns respectively.
 
-This is the `analyze_sample` function in the R package.
+This is the `analyze_seqs` function in the R package.
 
 ### Genotype Calling
 
@@ -254,31 +255,36 @@ LengthMatch columns).  If the resulting total read count is below a minimum
 value (by default `r config.defaults$sample_summary$counts.min`, customizable
 via the `sample_summary: counts.min` setting) no genotyping will be attempted. 
 Next only those sequences accounting for at least a minimum fraction of the
-remaining reads are kept.  (The default value is
-`r config.defaults$sample_summary$fraction.min`.  This can be changed via the 
-`sample_summary: fraction.min` setting.)  Sequences that are marked as potential
-stutter or other artifacts (via the Stutter and Artifact columns of the table)
-are removed next.
+remaining reads are considered.  (The default value is
+`r config.defaults$sample_analysis$fraction.min`.  This can be changed via the 
+`sample_analysis: fraction.min` setting.)  Sequences that are marked as
+potential stutter or other artifacts (via the Stutter and Artifact columns of 
+the table) or contain ambiguous sequence content (via the Ambiguous column) are
+excluded next.
 
 After these filters are applied, the top one or two remaining sequences are 
-reported as the alleles.  (If only one sequence remains, the sample is labeled 
-homozygous; if two or more, heterozygous.)  The details kept are:
+labeled as the alleles.  (If only one sequence remains, the sample is labeled 
+homozygous; if two or more, heterozygous.)  The final details kept for each
+sample are:
 
  * the sequence content, length, and counts for the one or two alleles
  * the zygosity of the sample
+ * whether the ambiguous-sequence filter removed a potential allele
  * whether the stutter and/or artifact filter removed a potential allele
  * The read counts of the entire sample before any filtering
  * The read counts of just those sequences matching the locus primer, motif, and
  length range
 
-This is the `summarize_sample` function in the R package.
+These tasks (the filtering and categorizing of each sequence in the table and
+the short genotype summary) are the `analyze_sample` and `summarize_sample`
+functions in the R package.
 
 ### Summary and Reporting
 
 The genotype and details identified in the previous step for each sample are 
 aggregated into a spreadsheet with a row for each sample.  This summary 
-spreadsheet and the more detailed per-sample tables are all saved in the final
-output.
+spreadsheet and the more detailed per-file and per-sample tables are all saved
+in the final output.
 
 For inter-sample comparisons, the alleles identified across samples for each 
 locus are aligned to one another.  The genotypes for each sample are clustered 
@@ -358,11 +364,16 @@ A the end of an analysis CHIIMP creates a directory of files with all results.
  dataset spreadsheet including locus, replicate, and sample identifiers, the 
  sequences, sequence lengths, and counts of the identified allele(s), and
  several additional attributes.
- * `processed-samples`: directory of spreadsheets for each sample.  Each
+ * `processed-files`: directory of spreadsheets for each input data file.  Each 
  spreadsheet contains one unique sequence per row with attributes on columns. 
- These represent the intermediate data CHIIMP uses to call a genotype for each
- sample, and each spreadsheet here corresponds to a single row in the
- `summary.csv` file.
+ At this stage no filtering for sample/locus-specific attributes has been
+ applied.  (This is particularly relevant for sequencer-multiplexed samples as
+ one input data file may contain data for multiple samples.)
+ * `processed-samples`: directory of spreadsheets for each sample.  As for 
+ `processed-files`, each spreadsheet contains one unique sequence per row with 
+ attributes on columns.  These represent the intermediate sample-specific data
+ CHIIMP uses to call a genotype for each sample, and each spreadsheet here
+ corresponds to a single row in the `summary.csv` file.
  * `histograms`: directory of counts-versus-length histograms for each sample. 
  Counts are tallied on a by-sequence basis rather than by-length for alleles, so
  the bars for called alleles (in red) are generally shorter than the bars for
 
@@ -3,12 +3,14 @@
 export(align_alleles)
 export(analyze_dataset)
 export(analyze_sample)
+export(analyze_sample_guided)
+export(analyze_sample_naive)
+export(analyze_seqs)
 export(calc_genotype_distance)
 export(config.defaults)
 export(find_closest_matches)
 export(full_analysis)
 export(histogram)
-export(histogram2)
 export(load_allele_names)
 export(load_config)
 export(load_dataset)
@@ -30,18 +32,20 @@ export(plot_heatmap_stutter)
 export(prepare_dataset)
 export(report_genotypes)
 export(report_idents)
+export(sample_analysis_funcs)
+export(sample_summary_funcs)
 export(save_alignment_images)
 export(save_alignments)
 export(save_allele_seqs)
+export(save_dataset)
 export(save_dist_mat)
 export(save_histograms)
 export(save_results_summary)
 export(save_sample_data)
+export(save_seqfile_data)
 export(summarize_attribute)
 export(summarize_dataset)
-export(summarize_genotypes)
-export(summarize_genotypes_known)
 export(summarize_sample)
 export(summarize_sample_guided)
+export(tabulate_allele_names)
 export(tally_cts_per_locus)
-importFrom(magrittr,"%>%")
 
@@ -0,0 +1,42 @@
+# chiimp 0.2.0
+
+ * Restructured code to avoid analyzing multiplexed samples more than once ([#3]).
+ * Reorganized output into per-sequence-file (full) and per-sample (filtered)
+   sections ([#5]).
+   * Added a new saving function, `save_seqfile_data`, to save a directory tree
+     of per-sequence-file output files starting from the first shared directory
+     in the input file paths.
+   * Moved functionality from `analyze_sample` into a new `analyze_seqs`
+     function to be used as a separate step (enabling shared processing between
+     multiplexed samples in a single data file for [#3]).
+   * Split data list from `analyze_dataset` output into two separate lists
+     called files and samples.
+ * Added sequence name matching to `analyze_dataset`.  The summary data frame
+   now has Allele1Name and Allele2Name columns, and the sample data frames a
+   SeqName column, matching any sequence recognized as a called allele from any
+   sample in the current analysis (or a previous analysis if the allele table
+   is provided).
+ * Improved `histogram` function to recognize more categories of unique
+   sequences including sequences called as alleles elsewhere, and return the
+   counts-by-length data by category in a list.
+   * Removed `histogram2` to centralize functionality in `histogram`.
+ * Fixed bugs causing failure of report generation for completely blank
+   dataset analysis results ([#7]).
+ * Removed `summarize_sample_by_length` function.
+ * Clarified behavior of `summarize_sample` functions to allow any combination
+   of TRUE/FALSE values in the Ambiguous/Stutter/Artifact entries.  Previously
+   only the first (highest-count) case would be flagged.
+ * Added features to track and filter ambiguous sequences ([#4]).
+   * Added column named Ambiguous to `analyze_sample` output to flag sequences
+     with non-ACTG characters.
+   * Added entry named Ambiguous to `summarize_sample` output to track
+     filtering of sequences with non-ACTG characters.
+
+[#7]: https://github.com/ShawHahnLab/chiimp/issues/7
+[#5]: https://github.com/ShawHahnLab/chiimp/issues/5
+[#4]: https://github.com/ShawHahnLab/chiimp/issues/4
+[#3]: https://github.com/ShawHahnLab/chiimp/issues/3
+
+# chiimp 0.1.0
+
+ * Initial release