- Updated dependencies to remove known vulnerabilities
- Exposes a new option
--max-branch-factorin bothcompareandmergemodes. This option controls a heuristic that limits the exploration of the sequence search space during query optimization. Increasing the value may lead to slightly more accurate results at the cost of increased compute requirements.
- Improved error reporting from VCF parsing to indicate the most recently parsed record
- Fixed an off-by-one error in the output of the merge region BED files
- Improved inline documentation that appears on docs.rs
- Updates the GitHub workflows to patch an issue with tarball version naming
- Fixed a versioning issue with dependencies
- GitHub release workflow updated to auto-publish to crates.io
- The library has been renamed to
aardvark-biofor publishing to crates.io, binary name is unchanged
- Adds the optional
RECORD_BPcomparison type to all summary files. Review our methods for details on this metric. This metric is enabled with--enable-record-basepair-metrics. - The
GTandBASEPAIRmetrics are recommend for most users. Thus, theHAPandWEIGHTED_HAPmetrics are now disabled by default. Two new options have been added to enable these secondary metrics:--enable-hapotype-metricsand--enabled-weighted-haplotype-metrics. - Refactored the way group metrics are internally represented for easier long-term maintenance
- Added new GitHub automations for developer ease-of-use
- Adds automated workflow for static binary build and release. Aardvark functionality identical to v0.8.0.
- Adds beta support for medium-sized variant types (<= 10 kbp):
SvDeletionandSvInsertion- Structural variants identified by theSVTYPEtag in the provided VCF filesTrContractionandTrExpansion- Tandem repeat variants identified by aTRIDtag in the provided VCF files- Add
JointStructuralVariantandJointTandemRepeattypes, which are similar toJointIndel - Including these larger variant types increase the compute cost of Aardvark
- Adds additional heuristic limitation to reduce compute costs from larger variant types:
- Adds a filter on variants larger than 10 kbp
- Adds a short-circuit for GT candidate scoring such that any edits immediately ends the candidate exploration
- Adds a limit on the number of candidates that can get generated before a sync point is identified in GT scoring, preventing exponentially blow-up of GT scoring when large variants mismatch
- Adds filter on variants that are not sequence resolved (e.g., "
" entry is not supported)
- Updates
noodlescrate to resolve some parsing issues:- Fixed issue with parsing
SVLENfield - Fixed a VCF parsing issue where absent chromosomes would error instead of gracefully returning no variants
- Fixed issue with parsing
- Updated build to
rust:1.88.0to resolve some compilation issues
- Adds a new scoring mode,
WEIGHTED_HAP. This scoring mode is similar toHAPscoring, but variants are weighted by the number of changes between the REF and ALT sequences. For SNPs, theHAPandWEIGHTED_HAPscores should be identical since all SNPs have the same weight. For indels, each variant is effectively weighted by its length, so longer variants have an increased weight.
- Removed an exact-match shortcut in
comparemode that would occasionally under-estimate the variant-level errors
- Added REF/ALT trimming of identical tail basepair sequences, which resolves some edge-case variant conflicts in the T2T truth set. This minor change tends to slightly improve the overall accuracy by reducing variant conflicts, allowing for overlapping changes when the trimming removes previously conflicting bases. Some Indel variants may be classified differently compared to previous versions. See the output VCF files for trimmed representations. Examples:
- AC->CC : this is trimmed to A->C and classified as a SNV now.
- ACC->AC : this is trimmed to AC->A and classified as a Deletion now.
- AC->ACC : this is trimmed to A->AC and classified as an Insertion now.
- Adds a new option
--disable-variant-trimmingto bothcompareandmergemodes, which disable the above trimming behavior. - Updated the install documentation to reflect bioconda support
- Added changes to build script to enable bioconda building from source
- Added two new
GT-specific statistics to the summary files:truth_fn_gt- The number oftruth_fnwhere the allelic sequence was matched in both inputs, but with the wrong genotype (e.g., 1/1 in truth, 0/1 in query). This value is only populated if thecomparisonisGT.query_fp_gt- The number ofquery_fpwhere the allelic sequence was matched in both inputs, but with the wrong genotype (e.g., 0/1 in truth, 1/1 in query). This value is only populated if thecomparisonisGT.
- Added parallelization to the writers for both
compareandmerge.
- Added a new option to
comparemode:--stratifications. If provided, this will post-annotate all regions by the provided stratifications and add additional rows to the output summary TSV, see documentation for more details. - Added
query_totalto the main summary and region summary output files. This metric isquery_tp + query_fp, and is analogous totruth_total. - Replaced the info statements during file writing with a progress bar.
- Fixed an issue where "*" ALT alleles were treated as alternate sequence. They are now ignored.
- In
comparemode, the--confidence-regionsparameter has been replaced with--regionsfor consistency. - The progress bar has been added to the parallelized variant loading step.
- Fixed an issue where variants that were hemizygous would be ignored entirely. They are now treated as homozygous variants for comparison.
- Added a short-circuit in the
mergeroutine that checks variant lengths prior to trying a full comparison. This reduces run-time significantly when larger events are unique to an input. - Adds parallelization to the file loading for both
compareandmerge. The parallelization is across both file and chromosome for the typical process, significantly reducing initially variant loading times.
- Added a new optional output file for merging (
--output-summary) which contains statistics on the merge. See documentation for details.
- Added a new file (
region_sequences.tsv.gz) to the debug output foraardvark compare. This file contains the constructed haplotype sequences for each region. - Replaced the region summary file in the debug output with a gzip-compressed version (
region_summary.tsv->region_summary.tsv.gz) foraardvark compare. In internal tests, this reduced the disk footprint by ~90% for this file.
- Updated
crossbeam-channelfor security update
Initial release.