Return keyterm positions in original document when performing keyterm extraction

### context

I'm looking to get the original token positions of keyterms when performing keyterm extraction with e.g. TextRank, but this can apply to the other extractors. Example:
```python
>>> doc = nlp("I survived because the fire inside me burned brighter than the fire around me.")
>>> textrank(doc, return_positions=True)
[("fire", 0.1, 4, 4)]
```
This would provide a mapping back to the original spaCy `doc` that can be used to find the keyterm regardless of how it was normalized during keyterm extraction.

### proposed solution

The main solution I envision would be to add a new keyword argument to the extractors such as `return_positions` (defaulting to `False` to not break existing workflows) that would add the original token indices in addition to the term as a string and its score. Since textaCy can access the `Token` positions while considering candidate terms (before they're normalized to strings), it would be a matter of passing these indices along when returning the set of candidate tuples.
The main issues with this approach would be that it would clutter the existing implementation with effectively two return types: the current `List[Tuple[str, float]]` if `return_positions` is `False`, or `List[Tuple[str, float, int, int]]` if `True`. Any preprocessing functions (such as [`_get_candidates`](https://github.com/chartbeat-labs/textacy/blob/master/src/textacy/ke/textrank.py#L114) for TextRank) would have to be modified to pass these indices along, and any code calling these functions now has to handle different return types.

### alternative solutions?

One solution I considered (outside of the keyterm extraction functions) was to rescan the document for each keyterm, but that ultimately requires scanning through the `Doc` an additional time, vs. textaCy already has the original token positions, allowing a client to immediately re-index into the `Doc`.

I have some proof-of-concept code [adding this feature for TextRank](https://github.com/ChrisJBlake/textacy/blob/return-spans/src/textacy/ke/textrank.py#L102) in a fork, and would be willing to extend this feature to the other extractors if this sounds like a useful idea!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Return keyterm positions in original document when performing keyterm extraction #323

context

proposed solution

alternative solutions?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Return keyterm positions in original document when performing keyterm extraction #323

Description

context

proposed solution

alternative solutions?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions