Skip to content

Return keyterm positions in original document when performing keyterm extraction #323

Open
@ChrisJBlake

Description

@ChrisJBlake

context

I'm looking to get the original token positions of keyterms when performing keyterm extraction with e.g. TextRank, but this can apply to the other extractors. Example:

>>> doc = nlp("I survived because the fire inside me burned brighter than the fire around me.")
>>> textrank(doc, return_positions=True)
[("fire", 0.1, 4, 4)]

This would provide a mapping back to the original spaCy doc that can be used to find the keyterm regardless of how it was normalized during keyterm extraction.

proposed solution

The main solution I envision would be to add a new keyword argument to the extractors such as return_positions (defaulting to False to not break existing workflows) that would add the original token indices in addition to the term as a string and its score. Since textaCy can access the Token positions while considering candidate terms (before they're normalized to strings), it would be a matter of passing these indices along when returning the set of candidate tuples.
The main issues with this approach would be that it would clutter the existing implementation with effectively two return types: the current List[Tuple[str, float]] if return_positions is False, or List[Tuple[str, float, int, int]] if True. Any preprocessing functions (such as _get_candidates for TextRank) would have to be modified to pass these indices along, and any code calling these functions now has to handle different return types.

alternative solutions?

One solution I considered (outside of the keyterm extraction functions) was to rescan the document for each keyterm, but that ultimately requires scanning through the Doc an additional time, vs. textaCy already has the original token positions, allowing a client to immediately re-index into the Doc.

I have some proof-of-concept code adding this feature for TextRank in a fork, and would be willing to extend this feature to the other extractors if this sounds like a useful idea!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions