Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Document.__eq__ to intelligently compare floats #8412

Open
sjrl opened this issue Sep 26, 2024 · 3 comments
Open

Update Document.__eq__ to intelligently compare floats #8412

sjrl opened this issue Sep 26, 2024 · 3 comments
Assignees
Labels
P2 Medium priority, add to the next sprint if no P1 available

Comments

@sjrl
Copy link
Contributor

sjrl commented Sep 26, 2024

Unsure if this would classify as a bug or a feature, but the Document equality method does a direct dict comparison between two documents.

I think this is potentially sub-optimal in regards to when the score value is set in the Document. It's possible that all other aspects of two Documents match except the score value could differ only slightly due to float imprecision. For example,

from haystack import Document
doc1 = Document(content="doc1", id="1", score=0.123456782)
doc2 = Document(content="doc1", id="1", score=0.12345678)
doc1 == doc2
# False

To me this feels misleading since I'd normally say these two Documents should be considered the same.

@davidsbatista
Copy link
Contributor

I think the score should be excluded from this comparison, since the score is not part of the document itself, it's associated with the retrieval process.

@julian-risch julian-risch added the P2 Medium priority, add to the next sprint if no P1 available label Sep 27, 2024
@silvanocerza
Copy link
Contributor

I'm really unsure about this. 🤔

This would also be a breaking change too, not easy to handle either. I'd have to think how we could follow the deprecation policy for this kind of changes.

We'd also need to decide on the tolerance too, and that could be a huge debate on itself. 😅

Also what about the embedding? Should we compare them with a tolerance too?

I think it would be better to leave the current implementation as is, I don't see many benefits to change this. Also as @davidsbatista says most of the times Document won't have a score set. If one needs to compare Documents by score I'd expect they would do it explicitly and not just with doc1 == doc2.

Just for reference this current implementation comes from #6323, before that we were just comparing the id field.

@davidsbatista
Copy link
Contributor

In the context of Information Retrieval, my point is that document content and document score derived from a retrieval process are two completely distinct aspects, and it seems that Sebastian needs to compare retrieved documents by content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Medium priority, add to the next sprint if no P1 available
Projects
None yet
Development

No branches or pull requests

4 participants