Skip to content

C: Evaluating string pooling #262

@ojwb

Description

@ojwb

Xapian has been using a patched version of Snowball which includes a change to merge string literals into a single block of characters. There's a lot of overlap in the strings in many stemmers, so this can reduce the size of constant data in the compiled C code, and probably more importantly that reduces the amount of data that needs to be cached so reduces cache pressure and improves cache utilisation.

C toolchains (I think the linker) can merge string literals which end the same way, but it seems they don't have the ability to do the same for overlapping const unsigned char[] arrays. We could perhaps generate string literals instead, but we don't need a trailing zero byte, and including it significantly reduces opportunities for overlapping literals.

I've now switched Xapian to using Snowball git master, but this patch seems worth rebasing and evaluating. I originally implemented it many years ago (before I took over Snowball maintenance and in a period when it was difficult to get patches reviewed and included in Snowball, which is why it didn't get proposed back then). I don't remember what evaluation we did back in the day, but computers have evolved anyway, and so has Snowball's generated code. If it's still useful, it should be generically useful so really belongs in upstream Snowball.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions