Skip to content

Ambiguous tokenization for AnchorText #139

Open
@jklaise

Description

@jklaise

We use spacy to produce AnchorText explanations, however sometimes the tokenization does not match onto some of the metadata we return (e.g. positions of words in the anchor under exp['raw']['features']) in cases were spacy creates tokens from shortened phrases (e.g. doesn't->[does, n't] will result in off-by-one). We need to provide an unambiguous representation of which tokens are in the anchor, e.g. could be a list of tuples of indices (ix1, ix2) denoting the start and end of each word in the anchor wrt the original text.

Also, commas currently seem to be tokenized too.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions