You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the link_type setting is an enum with only three options. This prevents some configurations; for example, I want to take two datasets A and B, dedupe within A, link A to B, but not dedupe B.
This problem is worse with 3 or more datasets: I also can't, for example, say "link dataset A to B and A to C, and dedupe within B, but don't link B to C directly."
Quibble about names
This may be a pedantic point, but I find the names "link_only," "link_and_dedupe," and "dedupe_only" confusing. That is because they operate on pairs, not clusters, which means even "link_only" can lead to some deduplication.
Consider the following datasets with three records each:
graph LR
subgraph dataset_1
record_1
record_2
record_3
end
subgraph dataset_2
record_a
record_b
record_c
end
Loading
"link_and_dedupe" would evaluate all of these pairs:
results in a single cluster, saying all the records in each dataset are duplicates of each other, without directly evaluating any pairs of records from the same dataset.
Describe the solution you'd like
Keep link_type as is, but make it just syntactic sugar for a more flexible system.
The more flexible system would be to list the acceptable dataset pairings, e.g. "A-B", "A-C", "B-B" for the example above, or to list the ones to exclude from the set of all combinations.
Describe alternatives you've considered
You can accomplish this by doing a "link_and_dedupe" and manually excluding matches, but that is way more computation than necessary.
You can also accomplish it by putting your datasets in 1 at a time for dedupe, and 2 at a time for pairs of datasets that should have links between them, and combining all the links from all models before clustering, but this requires making a bunch of Splink objects.
Additional context
The text was updated successfully, but these errors were encountered:
Is your proposal related to a problem?
Currently, the
link_type
setting is an enum with only three options. This prevents some configurations; for example, I want to take two datasets A and B, dedupe within A, link A to B, but not dedupe B.This problem is worse with 3 or more datasets: I also can't, for example, say "link dataset A to B and A to C, and dedupe within B, but don't link B to C directly."
Quibble about names
This may be a pedantic point, but I find the names "link_only," "link_and_dedupe," and "dedupe_only" confusing. That is because they operate on pairs, not clusters, which means even "link_only" can lead to some deduplication.
Consider the following datasets with three records each:
"link_and_dedupe" would evaluate all of these pairs:
Whereas "link_only" would only evaluate these:
But with either of these options, if you
cluster_pairwise_predictions_at_threshold
, you can end up doing deduplication, e.g.:results in a single cluster, saying all the records in each dataset are duplicates of each other, without directly evaluating any pairs of records from the same dataset.
Describe the solution you'd like
Keep
link_type
as is, but make it just syntactic sugar for a more flexible system.The more flexible system would be to list the acceptable dataset pairings, e.g. "A-B", "A-C", "B-B" for the example above, or to list the ones to exclude from the set of all combinations.
Describe alternatives you've considered
You can accomplish this by doing a "link_and_dedupe" and manually excluding matches, but that is way more computation than necessary.
You can also accomplish it by putting your datasets in 1 at a time for dedupe, and 2 at a time for pairs of datasets that should have links between them, and combining all the links from all models before clustering, but this requires making a bunch of Splink objects.
Additional context
The text was updated successfully, but these errors were encountered: