Skip to content

Commit

Permalink
Merge branch 'release-0.2.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
ressy committed May 24, 2018
2 parents c79ec30 + bc4a97e commit b0217d4
Show file tree
Hide file tree
Showing 67 changed files with 2,844 additions and 1,571 deletions.
4 changes: 1 addition & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: chiimp
Title: Computational, High-throughput Individual Identification through Microsatellite Profiling
Version: 0.1.0
Version: 0.2.0
Authors@R: person("Jesse", "Connell", email = "[email protected]", role = c("aut", "cre"))
Description: An R package to analyze microsatellites in high-throughput sequencing datasets.
Depends: R (>= 3.2.3)
Expand All @@ -11,12 +11,10 @@ Imports:
argparser (>= 0.4),
dnaplotr (>= 0.1),
dnar (>= 0.1),
dplyr (>= 0.7.4),
graphics (>= 3.2.3),
grDevices (>= 3.2.3),
kableExtra (>= 0.2.1.9000),
knitr (>= 1.16),
magrittr (>= 1.5),
methods (>= 3.2.3),
msa (>= 1.2.1),
openssl (>= 0.9.6),
Expand Down
51 changes: 31 additions & 20 deletions GUIDE.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -198,9 +198,9 @@ receive a different suffix when the name is assigned.
CHIIMP breaks the genotyping process into two parts. First a sample file is
de-replicated and a table of unique sequences is created, with no filtering yet
applied. Second the table is filtered to just candidate allele sequences, and
up to sequences are reported as the genotype. Both the per-sample table and the
final genotypes are saved in the final output, as spreadsheets in the
`processed-samples` directory and as the `summary.csv` spreadsheet.
up to two sequences are reported as the genotype. Both the per-sequence table
and the final genotypes are saved in the final output, as spreadsheets in the
`processed-files` directory and as the `summary.csv` spreadsheet.

### Sample Processing

Expand All @@ -213,8 +213,9 @@ locus attributes described above. First each locus' forward primer is compared
with the sequence and the matching locus name is stored in a MatchingLocus
column. The sequence is then checked for several tandem repeats of the motif
for that locus, and compared to the length range expected for that locus.
TRUE/FALSE values for these are stored in MotifMatch and LengthMatch columns
respectively.
TRUE/FALSE values for these are stored in MotifMatch and LengthMatch columns
respectively. The Ambiguous column marks any sequences containing bases outside
of A, C, T, and G (such as N).

PCR artifacts can obscure real allele sequences with incorrect sequences. There
are extra filters to attempt to remove these if possible or highlight cases that
Expand All @@ -239,7 +240,7 @@ Lastly, the ratio of read counts for each sequence to the total reads in the
sample and the reads with the same MatchingLocus value is stored in
FractionOfTotal and FractionOfLocus columns respectively.

This is the `analyze_sample` function in the R package.
This is the `analyze_seqs` function in the R package.

### Genotype Calling

Expand All @@ -254,31 +255,36 @@ LengthMatch columns). If the resulting total read count is below a minimum
value (by default `r config.defaults$sample_summary$counts.min`, customizable
via the `sample_summary: counts.min` setting) no genotyping will be attempted.
Next only those sequences accounting for at least a minimum fraction of the
remaining reads are kept. (The default value is
`r config.defaults$sample_summary$fraction.min`. This can be changed via the
`sample_summary: fraction.min` setting.) Sequences that are marked as potential
stutter or other artifacts (via the Stutter and Artifact columns of the table)
are removed next.
remaining reads are considered. (The default value is
`r config.defaults$sample_analysis$fraction.min`. This can be changed via the
`sample_analysis: fraction.min` setting.) Sequences that are marked as
potential stutter or other artifacts (via the Stutter and Artifact columns of
the table) or contain ambiguous sequence content (via the Ambiguous column) are
excluded next.

After these filters are applied, the top one or two remaining sequences are
reported as the alleles. (If only one sequence remains, the sample is labeled
homozygous; if two or more, heterozygous.) The details kept are:
labeled as the alleles. (If only one sequence remains, the sample is labeled
homozygous; if two or more, heterozygous.) The final details kept for each
sample are:

* the sequence content, length, and counts for the one or two alleles
* the zygosity of the sample
* whether the ambiguous-sequence filter removed a potential allele
* whether the stutter and/or artifact filter removed a potential allele
* The read counts of the entire sample before any filtering
* The read counts of just those sequences matching the locus primer, motif, and
length range

This is the `summarize_sample` function in the R package.
These tasks (the filtering and categorizing of each sequence in the table and
the short genotype summary) are the `analyze_sample` and `summarize_sample`
functions in the R package.

### Summary and Reporting

The genotype and details identified in the previous step for each sample are
aggregated into a spreadsheet with a row for each sample. This summary
spreadsheet and the more detailed per-sample tables are all saved in the final
output.
spreadsheet and the more detailed per-file and per-sample tables are all saved
in the final output.

For inter-sample comparisons, the alleles identified across samples for each
locus are aligned to one another. The genotypes for each sample are clustered
Expand Down Expand Up @@ -358,11 +364,16 @@ A the end of an analysis CHIIMP creates a directory of files with all results.
dataset spreadsheet including locus, replicate, and sample identifiers, the
sequences, sequence lengths, and counts of the identified allele(s), and
several additional attributes.
* `processed-samples`: directory of spreadsheets for each sample. Each
* `processed-files`: directory of spreadsheets for each input data file. Each
spreadsheet contains one unique sequence per row with attributes on columns.
These represent the intermediate data CHIIMP uses to call a genotype for each
sample, and each spreadsheet here corresponds to a single row in the
`summary.csv` file.
At this stage no filtering for sample/locus-specific attributes has been
applied. (This is particularly relevant for sequencer-multiplexed samples as
one input data file may contain data for multiple samples.)
* `processed-samples`: directory of spreadsheets for each sample. As for
`processed-files`, each spreadsheet contains one unique sequence per row with
attributes on columns. These represent the intermediate sample-specific data
CHIIMP uses to call a genotype for each sample, and each spreadsheet here
corresponds to a single row in the `summary.csv` file.
* `histograms`: directory of counts-versus-length histograms for each sample.
Counts are tallied on a by-sequence basis rather than by-length for alleles, so
the bars for called alleles (in red) are generally shorter than the bars for
Expand Down
12 changes: 8 additions & 4 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,14 @@
export(align_alleles)
export(analyze_dataset)
export(analyze_sample)
export(analyze_sample_guided)
export(analyze_sample_naive)
export(analyze_seqs)
export(calc_genotype_distance)
export(config.defaults)
export(find_closest_matches)
export(full_analysis)
export(histogram)
export(histogram2)
export(load_allele_names)
export(load_config)
export(load_dataset)
Expand All @@ -30,18 +32,20 @@ export(plot_heatmap_stutter)
export(prepare_dataset)
export(report_genotypes)
export(report_idents)
export(sample_analysis_funcs)
export(sample_summary_funcs)
export(save_alignment_images)
export(save_alignments)
export(save_allele_seqs)
export(save_dataset)
export(save_dist_mat)
export(save_histograms)
export(save_results_summary)
export(save_sample_data)
export(save_seqfile_data)
export(summarize_attribute)
export(summarize_dataset)
export(summarize_genotypes)
export(summarize_genotypes_known)
export(summarize_sample)
export(summarize_sample_guided)
export(tabulate_allele_names)
export(tally_cts_per_locus)
importFrom(magrittr,"%>%")
42 changes: 42 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# chiimp 0.2.0

* Restructured code to avoid analyzing multiplexed samples more than once ([#3]).
* Reorganized output into per-sequence-file (full) and per-sample (filtered)
sections ([#5]).
* Added a new saving function, `save_seqfile_data`, to save a directory tree
of per-sequence-file output files starting from the first shared directory
in the input file paths.
* Moved functionality from `analyze_sample` into a new `analyze_seqs`
function to be used as a separate step (enabling shared processing between
multiplexed samples in a single data file for [#3]).
* Split data list from `analyze_dataset` output into two separate lists
called files and samples.
* Added sequence name matching to `analyze_dataset`. The summary data frame
now has Allele1Name and Allele2Name columns, and the sample data frames a
SeqName column, matching any sequence recognized as a called allele from any
sample in the current analysis (or a previous analysis if the allele table
is provided).
* Improved `histogram` function to recognize more categories of unique
sequences including sequences called as alleles elsewhere, and return the
counts-by-length data by category in a list.
* Removed `histogram2` to centralize functionality in `histogram`.
* Fixed bugs causing failure of report generation for completely blank
dataset analysis results ([#7]).
* Removed `summarize_sample_by_length` function.
* Clarified behavior of `summarize_sample` functions to allow any combination
of TRUE/FALSE values in the Ambiguous/Stutter/Artifact entries. Previously
only the first (highest-count) case would be flagged.
* Added features to track and filter ambiguous sequences ([#4]).
* Added column named Ambiguous to `analyze_sample` output to flag sequences
with non-ACTG characters.
* Added entry named Ambiguous to `summarize_sample` output to track
filtering of sequences with non-ACTG characters.

[#7]: https://github.com/ShawHahnLab/chiimp/issues/7
[#5]: https://github.com/ShawHahnLab/chiimp/issues/5
[#4]: https://github.com/ShawHahnLab/chiimp/issues/4
[#3]: https://github.com/ShawHahnLab/chiimp/issues/3

# chiimp 0.1.0

* Initial release
Loading

0 comments on commit b0217d4

Please sign in to comment.