Skip to content

Wrong output on assemblies with long scaffolds #18

@KirillKryukov

Description

@KirillKryukov

assembly-stats 1.0.1 produces the following wrong output on one of our assemblies:

sum = 3595598458, n = 896, ave = 4012944.71, largest = 1193142484
N50 = 844155996, n = 2
N60 = 836758171, n = 3
N70 = 836758171, n = 3
N80 = 762492267, n = 4
N90 = 762492267, n = 4
N100 = 18446744071783539901, n = 896
N_count = 178800
Gaps = 1788

Command: assembly-stats scaffolds.fa >scaffolds.stats. The machine is an rather ordinary Linux server.

Correct output from another tool:

Filepath	TotSeqs	TotLen	N50	N75	N90	I50	GC	Avg	Min	Max	AuN
scaffolds.fa	896	7890565754	1193142484	836758171	729054925	891	44.80	8806434.99	17617	2368955581	1224440811

In particular, please that see assembly-stats shows total assembly length as 3,595,598,458, while it should be 7,890,565,754. Also note the N100 of 18446744071783539901.

Also, assembly-stats works fine on our other assemblies of similar total size, but consisting of smaller scaffolds.

Here is the test input: https://biokirr.com/Supporting-Data/assembly-stats-bug-report/scaffolds-N.fa.zstd - It's the same assembly filled with N, so it's only 682 kB compressed. (Even more compact in NAF format: https://biokirr.com/Supporting-Data/assembly-stats-bug-report/scaffolds-N.fa.naf - 125 kB). Decompressed size is 8 GB.

I guess there is some kind of integer overflow, so I hope it will be easy to fix. Please let me know if you need any other information, or the full repro script.

EDIT: Added test data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions