I am curious about the current implementation of the CanopyIndex class in predicates.py.
Given a doc (a string) and the doc_id used to identify the current doc inside the inverted index, the main steps inside __call__(self, record, **kwargs) (line 214 and subsequent ones) seem to be:
- if the key
doc_id is already inside self.canopy, set block_key to its associated value (even if self.canopy[doc_id] is None),
- otherwise retrieve from the index the list of the other docs (
member_ids) that are close enough to doc and for each of them set self.canopy[member_id] = doc_id if self.canopy[member_id] is not already set (or leave it unchanged otherwise).
- if we found at least another doc (i.e.,
len(member_ids) > 0), point the current doc to itself inside the canopy (self.canopy[doc_id] = doc_id)
- otherwise set
self.canopy[doc_id] = None (meaning the the current doc is "isolated" with respect to all the other docs inside the index, I suppose)
- Finally, if
block_key is None there is nothing close to the current doc (except the doc itself), otherwise the id of the "local representative" of the region where doc_id lies is returned.
I have two questions/doubts:
- If my understanding of the current implementation is correct, I do not see the relation to the "traditional" concept of "canopy" as it appears in the literature, see wikipedia.
- Does the current implementation of the
CanopyIndex significantly differ from the behaviour of SearchIndex? A SearchIndex returns a list of all the doc_id near the current doc, whereas the current CanopyIndex just returns the id of the "local representative" of the same set of docs, which carries (more or less) the same amount of information. Am I wrong?
Thank you!
I am curious about the current implementation of the
CanopyIndexclass inpredicates.py.Given a
doc(a string) and thedoc_idused to identify the current doc inside the inverted index, the main steps inside__call__(self, record, **kwargs)(line 214 and subsequent ones) seem to be:doc_idis already insideself.canopy, setblock_keyto its associated value (even ifself.canopy[doc_id] is None),member_ids) that are close enough todocand for each of them setself.canopy[member_id] = doc_idifself.canopy[member_id]is not already set (or leave it unchanged otherwise).len(member_ids) > 0), point the current doc to itself inside the canopy (self.canopy[doc_id] = doc_id)self.canopy[doc_id] = None(meaning the the currentdocis "isolated" with respect to all the other docs inside the index, I suppose)block_key is Nonethere is nothing close to the currentdoc(except the doc itself), otherwise the id of the "local representative" of the region wheredoc_idlies is returned.I have two questions/doubts:
CanopyIndexsignificantly differ from the behaviour ofSearchIndex? A SearchIndex returns a list of all the doc_id near the currentdoc, whereas the current CanopyIndex just returns the id of the "local representative" of the same set of docs, which carries (more or less) the same amount of information. Am I wrong?Thank you!