Skip to content

Sentinel character in BWT increases input alphabet size? #38

@cpockrandt

Description

@cpockrandt

I noticed when building a csa_wt<wt_pc<>> that the input sequence (an int_vector) cannot contain a zero since the sentinel character in the BWT will be represented by zero.

As far as I understand the code, the alphabet size of the input sequence increases by one (the sentinel character) when building the wavelet tree which increases the wavelet tree size. Wouldn't it be more efficient in terms of space to represent the sentinel character by some character from the input alphabet and store its position?

In SeqAn2 we replace the sentinel character by some character of the input alphabet and store the position of the sentinel in the BWT. When counting characters in a prefix of the BWT in the wavelet tree, the rank is decremented by one iff. the character to be counted is equal to the sentinel replacement and its position is within the prefix.

For collections of strings (which is yet to come), the sentinels would be replaced by the least common character of the input sequence (assuming that rare characters in the input text will also occur less frequently in the pattern to be searched) and we would keep a (sparse) bitvector storing the sentinel positions with rank support.

Is this already implemented in the SDSL (or in progress) or have you discussed this approach earlier @mpetri @simongog?

FYI @h-2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions