Skip to content

Handle ZWNBSP the same way as Word Joiner #145

Open
@dhouck

Description

@dhouck

wink-nlp version: 2.3.0
wink-eng-lite-web-model: 1.8.0

Currently the Unicode Zero-Width Non-Breaking Space character is only supposed to be used as a Byte-Order Mark, but it has previously had the same job as the Word Joiner character and is still occasionally used that way, and Unicode recommends treating a ZWNBSP that is not at the start of the file the same way as a word joiner.

Currently, the old ZWNBSP character is not output in the token stream, similar to #135. For example, I had a text with the date range 1830<U+FEFF>–<U+FEFF>1832, and the output did not include the U+FEFF characters at all. When I replace the deprecated U+FEFF characters with U+2060 Word Joiners, all characters are correctly reproduced in the output stream.

Note: I found this bug while debugging an issue in another project which uses Wink, and I donʼt know much about Wink myself. I expect this is enough information to identify the issue, but if not then I might need extra help to provide more useful information.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions