Skip to content

About CanopyIndex implementation #1125

@lmores

Description

@lmores

I am curious about the current implementation of the CanopyIndex class in predicates.py.
Given a doc (a string) and the doc_id used to identify the current doc inside the inverted index, the main steps inside __call__(self, record, **kwargs) (line 214 and subsequent ones) seem to be:

  • if the key doc_id is already inside self.canopy, set block_key to its associated value (even if self.canopy[doc_id] is None),
  • otherwise retrieve from the index the list of the other docs (member_ids) that are close enough to doc and for each of them set self.canopy[member_id] = doc_id if self.canopy[member_id] is not already set (or leave it unchanged otherwise).
    • if we found at least another doc (i.e., len(member_ids) > 0), point the current doc to itself inside the canopy (self.canopy[doc_id] = doc_id)
    • otherwise set self.canopy[doc_id] = None (meaning the the current doc is "isolated" with respect to all the other docs inside the index, I suppose)
  • Finally, if block_key is None there is nothing close to the current doc (except the doc itself), otherwise the id of the "local representative" of the region where doc_id lies is returned.

I have two questions/doubts:

  1. If my understanding of the current implementation is correct, I do not see the relation to the "traditional" concept of "canopy" as it appears in the literature, see wikipedia.
  2. Does the current implementation of the CanopyIndex significantly differ from the behaviour of SearchIndex? A SearchIndex returns a list of all the doc_id near the current doc, whereas the current CanopyIndex just returns the id of the "local representative" of the same set of docs, which carries (more or less) the same amount of information. Am I wrong?

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions