IndexedSeq instead of Iterator in NearestNeighborIterator [Priority Queue Serialization Error]#7
IndexedSeq instead of Iterator in NearestNeighborIterator [Priority Queue Serialization Error]#7oscaroboto wants to merge 3 commits into
Conversation
|
LGTM but does this work? I uncommented the test in |
|
Hi, |
|
Hi @uzadude Very glad to hear that this library helped. I am no longer at LinkedIn though so I don't have permissions to change collaborators. Also given it is under |
|
Well, in my company (PayPal), we work with our private accounts in the public GitHub space, so we would have permissions also in the future. Maybe you still know someone there that can add you back as a collaborator with your current user account? |
There seemed to have been an issue with concurrency when returning the Iterator in the NearestNeighborIterator class inside of LSHNearestNeighborSearchModel.scala.
Iterator[(ItemId, Iterator[ItemIdDistancePair])] was changed to Iterator[(ItemId, IndexedSeq[ItemIdDistancePair])]. The iterator within the iterator is not serialized and causing a problem with the groupByKey in the getAllNearestNeighbors function. What I think was happening is that during the groupByKey the iterator within the iterator was pointing to a location in memory on a particular node, but when that iterator is copied to another node during the groupByKey it is then pointing to a random position in memory not where one expects.
As a bonus I also rewrote the groupByKey as aggregateByKey, as an aggregate would be more efficient in this case than a groupByKey. I have not done any benchmarking, but from my experience have found aggregateByKey to be more efficient.
Code compiled with ./gradlew build and passed all tests.