Update Document.eq to intelligently compare floats #8412

sjrl · 2024-09-26T11:29:27Z

Unsure if this would classify as a bug or a feature, but the Document equality method does a direct dict comparison between two documents.

I think this is potentially sub-optimal in regards to when the score value is set in the Document. It's possible that all other aspects of two Documents match except the score value could differ only slightly due to float imprecision. For example,

from haystack import Document
doc1 = Document(content="doc1", id="1", score=0.123456782)
doc2 = Document(content="doc1", id="1", score=0.12345678)
doc1 == doc2
# False

To me this feels misleading since I'd normally say these two Documents should be considered the same.

The text was updated successfully, but these errors were encountered:

davidsbatista · 2024-09-26T13:01:19Z

I think the score should be excluded from this comparison, since the score is not part of the document itself, it's associated with the retrieval process.

silvanocerza · 2024-09-30T16:18:29Z

I'm really unsure about this. 🤔

This would also be a breaking change too, not easy to handle either. I'd have to think how we could follow the deprecation policy for this kind of changes.

We'd also need to decide on the tolerance too, and that could be a huge debate on itself. 😅

Also what about the embedding? Should we compare them with a tolerance too?

I think it would be better to leave the current implementation as is, I don't see many benefits to change this. Also as @davidsbatista says most of the times Document won't have a score set. If one needs to compare Documents by score I'd expect they would do it explicitly and not just with doc1 == doc2.

Just for reference this current implementation comes from #6323, before that we were just comparing the id field.

davidsbatista · 2024-10-01T08:07:09Z

In the context of Information Retrieval, my point is that document content and document score derived from a retrieval process are two completely distinct aspects, and it seems that Sebastian needs to compare retrieved documents by content.

julian-risch added the P2 Medium priority, add to the next sprint if no P1 available label Sep 27, 2024

julian-risch assigned silvanocerza Sep 30, 2024

julian-risch mentioned this issue Sep 30, 2024

feat: Add DocumentNDCGEvaluator component #8419

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Document.eq to intelligently compare floats #8412

Update Document.eq to intelligently compare floats #8412

sjrl commented Sep 26, 2024

davidsbatista commented Sep 26, 2024

silvanocerza commented Sep 30, 2024

davidsbatista commented Oct 1, 2024

Update Document.__eq__ to intelligently compare floats #8412

Update Document.__eq__ to intelligently compare floats #8412

Comments

sjrl commented Sep 26, 2024

davidsbatista commented Sep 26, 2024

silvanocerza commented Sep 30, 2024

davidsbatista commented Oct 1, 2024

Update Document.eq to intelligently compare floats #8412

Update Document.eq to intelligently compare floats #8412