Conversation
straeter
commented
Feb 10, 2026
- add one_on_one parameter for merge
- add / uncomment some system files to gitignore
jackwildman
left a comment
There was a problem hiding this comment.
I think this needs better explained to the user, and the bug sentry spotted also looks genuine. Other than that, good to go.
CallumMcMahon
left a comment
There was a problem hiding this comment.
does this need to be a bool? what about "1:1"/"m:1"/"1:m" or enums, to handle the other options?
ok good point. At the moment we can only have m:1 and 1:1 but in the future we might have 1:m and m:m. I will instead introduce a parameter "relationship_type" this an enum now and default to m:1 |
|
@jackwildman is it now clearer with the relationship_type enum? |
jackwildman
left a comment
There was a problem hiding this comment.
Looks good, and definitely more obvious now. I'm not the keenest on relationship_type, but I'm happy to go with whatever you ultimately go with
| right_key=right_key, | ||
| use_web_search=use_web_search, | ||
| one_on_one=one_on_one, | ||
| relationship_type=relationship_type, |
There was a problem hiding this comment.
Nittiest of nits: relationship_type is kind of vague, and arguably a bit inaccurate. cardinality is maybe more accurate (or at least where terms like "many-to-many" often come in), but at the expense of not being immediately obvious to anyone who doesn't live in a database. We already have a strategy field on dedupe, so strategy here might be a good choice. If nothing else, it would add some harmony between the operations.
There was a problem hiding this comment.
I was thinking about cardinality but for me this is a very mathematical term that most people have never heard of, whereas relationship should also be more familiar for every person that has worked with SQL. I think strategy would not be a good choice here, it would rather refer / understood as the strategy how to perform the merge like "first try fuzzy, then web agents". What we want to describe is really an existing relationship between the data / rows and our algorithm then figures out the best way + strategy to cope with that relationship
update: I just realized that cardinality has two very different meanings in mathematics and computer science
There was a problem hiding this comment.
what I am a bit confused about is that you call relationship_type "inaccurate" -> is it not just a synonym of cardinality? https://www.geeksforgeeks.org/dbms/types-of-relationship-in-database/
There was a problem hiding this comment.
Yeah, I think cardinality also suffers the same inaccuracy now that I think about it more. I think, fundamentally, it's about whether we call this a "relationship", as we're not really establishing a relationship but more operating in a manner where x on one side can match and merge with y on the other side, so it is more like a mode or principle of operation than a relationship.
Saying this, I think basically any term can be nitpicked for this, so probably best just to pick relationship_type and move on with it. The key thing is that even if the term isn't immediately obvious, it doesn't take long to figure out from reading the doc string or the enumerated values, and from there it's easy enough to understand
There was a problem hiding this comment.
... and the blog articles we will soon write about it :)
There was a problem hiding this comment.
update: I just realized that cardinality has two very different meanings in mathematics and computer science
Yet more confusingly, computer science often uses both definitions. I mean, I do see why one might arrive at "cardinality" for n:m relationships in a database if we consider that element x has a set of connections of cardinality m, but it's definitely a bit of an overloaded term.