Open
Description
We use spacy to produce AnchorText explanations, however sometimes the tokenization does not match onto some of the metadata we return (e.g. positions of words in the anchor under exp['raw']['features']
) in cases were spacy creates tokens from shortened phrases (e.g. doesn't
->[does, n't]
will result in off-by-one). We need to provide an unambiguous representation of which tokens are in the anchor, e.g. could be a list of tuples of indices (ix1, ix2)
denoting the start and end of each word in the anchor wrt the original text.
Also, commas currently seem to be tokenized too.