Ambiguous tokenization for AnchorText

We use spacy to produce AnchorText explanations, however sometimes the tokenization does not match onto some of the metadata we return (e.g. positions of words in the anchor under `exp['raw']['features']`) in cases were spacy creates tokens from shortened phrases (e.g. `doesn't`->`[does, n't]` will result in off-by-one). We need to provide an unambiguous representation of which tokens are in the anchor, e.g. could be a list of tuples of indices `(ix1, ix2)` denoting the start and end of each word in the anchor wrt the original text.

Also, commas currently seem to be tokenized too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ambiguous tokenization for AnchorText #139

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ambiguous tokenization for AnchorText #139

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions