Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Generalize link_type #2545

Open
zmbc opened this issue Dec 6, 2024 · 0 comments
Open

[FEAT] Generalize link_type #2545

zmbc opened this issue Dec 6, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@zmbc
Copy link
Contributor

zmbc commented Dec 6, 2024

Is your proposal related to a problem?

Currently, the link_type setting is an enum with only three options. This prevents some configurations; for example, I want to take two datasets A and B, dedupe within A, link A to B, but not dedupe B.

This problem is worse with 3 or more datasets: I also can't, for example, say "link dataset A to B and A to C, and dedupe within B, but don't link B to C directly."

Quibble about names

This may be a pedantic point, but I find the names "link_only," "link_and_dedupe," and "dedupe_only" confusing. That is because they operate on pairs, not clusters, which means even "link_only" can lead to some deduplication.

Consider the following datasets with three records each:

graph LR
  subgraph dataset_1
     record_1
     record_2
     record_3
  end

  subgraph dataset_2
     record_a
     record_b
     record_c
  end
Loading

"link_and_dedupe" would evaluate all of these pairs:

graph LR
  subgraph dataset_1
     record_1
     record_2
     record_3
  end

  subgraph dataset_2
     record_a
     record_b
     record_c
  end

  record_1 <-.-> record_2
  record_1 <-.-> record_3
  record_1 <-.-> record_a
  record_1 <-.-> record_b
  record_1 <-.-> record_c

  record_2 <-.-> record_3
  record_2 <-.-> record_a
  record_2 <-.-> record_b
  record_2 <-.-> record_c

  record_3 <-.-> record_a
  record_3 <-.-> record_b
  record_3 <-.-> record_c

  record_a <-.-> record_b
  record_a <-.-> record_c

  record_b <-.-> record_c
Loading

Whereas "link_only" would only evaluate these:

graph LR
  subgraph dataset_1
     record_1
     record_2
     record_3
  end

  subgraph dataset_2
     record_a
     record_b
     record_c
  end

  record_1 <-.-> record_a
  record_1 <-.-> record_b
  record_1 <-.-> record_c

  record_2 <-.-> record_a
  record_2 <-.-> record_b
  record_2 <-.-> record_c

  record_3 <-.-> record_a
  record_3 <-.-> record_b
  record_3 <-.-> record_c
Loading

But with either of these options, if you cluster_pairwise_predictions_at_threshold, you can end up doing deduplication, e.g.:

graph LR
  subgraph dataset_1
     record_1
     record_2
     record_3
  end

  subgraph dataset_2
     record_a
     record_b
     record_c
  end

  record_1 <---> record_a
  record_1 <-.-> record_b
  record_1 <---> record_c

  record_2 <---> record_a
  record_2 <---> record_b
  record_2 <-.-> record_c

  record_3 <-.-> record_a
  record_3 <---> record_b
  record_3 <---> record_c
Loading

results in a single cluster, saying all the records in each dataset are duplicates of each other, without directly evaluating any pairs of records from the same dataset.

Describe the solution you'd like

Keep link_type as is, but make it just syntactic sugar for a more flexible system.

The more flexible system would be to list the acceptable dataset pairings, e.g. "A-B", "A-C", "B-B" for the example above, or to list the ones to exclude from the set of all combinations.

Describe alternatives you've considered

You can accomplish this by doing a "link_and_dedupe" and manually excluding matches, but that is way more computation than necessary.

You can also accomplish it by putting your datasets in 1 at a time for dedupe, and 2 at a time for pairs of datasets that should have links between them, and combining all the links from all models before clustering, but this requires making a bunch of Splink objects.

Additional context

@zmbc zmbc added the enhancement New feature or request label Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant