Description
context
I'm looking to get the original token positions of keyterms when performing keyterm extraction with e.g. TextRank, but this can apply to the other extractors. Example:
>>> doc = nlp("I survived because the fire inside me burned brighter than the fire around me.")
>>> textrank(doc, return_positions=True)
[("fire", 0.1, 4, 4)]
This would provide a mapping back to the original spaCy doc
that can be used to find the keyterm regardless of how it was normalized during keyterm extraction.
proposed solution
The main solution I envision would be to add a new keyword argument to the extractors such as return_positions
(defaulting to False
to not break existing workflows) that would add the original token indices in addition to the term as a string and its score. Since textaCy can access the Token
positions while considering candidate terms (before they're normalized to strings), it would be a matter of passing these indices along when returning the set of candidate tuples.
The main issues with this approach would be that it would clutter the existing implementation with effectively two return types: the current List[Tuple[str, float]]
if return_positions
is False
, or List[Tuple[str, float, int, int]]
if True
. Any preprocessing functions (such as _get_candidates
for TextRank) would have to be modified to pass these indices along, and any code calling these functions now has to handle different return types.
alternative solutions?
One solution I considered (outside of the keyterm extraction functions) was to rescan the document for each keyterm, but that ultimately requires scanning through the Doc
an additional time, vs. textaCy already has the original token positions, allowing a client to immediately re-index into the Doc
.
I have some proof-of-concept code adding this feature for TextRank in a fork, and would be willing to extend this feature to the other extractors if this sounds like a useful idea!