Skip to content

Support for non-ASCII characters #5

@sykesd

Description

@sykesd

Thanks for this library. It looks fantastic.

However, it appears that the current implementation does not support any characters outside of the ASCII 0-127 range. Specifically, this condition in EdgeBag.get(char c) seems to trigger if a character with code > 127 appears in the input text:

    public Edge get(char c) {
        if (c != (char) (byte) c) {
            throw new IllegalArgumentException("Illegal input character " + c + ".");
        }
...

I am happy to dig in and try and implement support for at least the normal Java char range of characters, but before I do I was wondering if there is any inherent reason for the current limitation?

My application that I am considering this library for is part of search function over a large text index, and I need to support multiple languages most of which use characters outside the range currently supported.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions