Skip to content

Sample size for error estimation is to small! #28

@SimonHegele

Description

@SimonHegele

Hello,

first of all: Thanks for this great tool, that really provides a comprehensive sets of metrics and with neat visualisations as well!

For the evaluation of the error rate AlignQC samples the best alignments from the first n reads such that n % 100 = 1 and the total alignment length reaches at least 1,000,000 bases. For datasets with an average read length over 2000 bases, this results in a sample of only about 501 reads. As a consequence, the reported error rates can be significantly influenced by the number of threads used during the mapping step. This is because higher-quality reads often map more quickly and therefore tend to appear earlier in the BAM file, leading to their overrepresentation in the sample AlignQC analyzes.

I tested this behavior using a small dataset of corrected long reads aligned with Minimap2, once using 1 thread and once using 128 threads. The resulting AlignQC-reported error rates were:

  • 0.387% with 128 thread
  • 0.447% with 1 threads

Best, simon.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions