[FEAT] Generalize `link_type` #2545

zmbc · 2024-12-06T23:58:15Z

Is your proposal related to a problem?

Currently, the link_type setting is an enum with only three options. This prevents some configurations; for example, I want to take two datasets A and B, dedupe within A, link A to B, but not dedupe B.

This problem is worse with 3 or more datasets: I also can't, for example, say "link dataset A to B and A to C, and dedupe within B, but don't link B to C directly."

Quibble about names

This may be a pedantic point, but I find the names "link_only," "link_and_dedupe," and "dedupe_only" confusing. That is because they operate on pairs, not clusters, which means even "link_only" can lead to some deduplication.

Consider the following datasets with three records each:

graph LR
  subgraph dataset_1
     record_1
     record_2
     record_3
  end

  subgraph dataset_2
     record_a
     record_b
     record_c
  end

"link_and_dedupe" would evaluate all of these pairs:

graph LR
  subgraph dataset_1
     record_1
     record_2
     record_3
  end

  subgraph dataset_2
     record_a
     record_b
     record_c
  end

  record_1 <-.-> record_2
  record_1 <-.-> record_3
  record_1 <-.-> record_a
  record_1 <-.-> record_b
  record_1 <-.-> record_c

  record_2 <-.-> record_3
  record_2 <-.-> record_a
  record_2 <-.-> record_b
  record_2 <-.-> record_c

  record_3 <-.-> record_a
  record_3 <-.-> record_b
  record_3 <-.-> record_c

  record_a <-.-> record_b
  record_a <-.-> record_c

  record_b <-.-> record_c

Whereas "link_only" would only evaluate these:

graph LR
  subgraph dataset_1
     record_1
     record_2
     record_3
  end

  subgraph dataset_2
     record_a
     record_b
     record_c
  end

  record_1 <-.-> record_a
  record_1 <-.-> record_b
  record_1 <-.-> record_c

  record_2 <-.-> record_a
  record_2 <-.-> record_b
  record_2 <-.-> record_c

  record_3 <-.-> record_a
  record_3 <-.-> record_b
  record_3 <-.-> record_c

But with either of these options, if you cluster_pairwise_predictions_at_threshold, you can end up doing deduplication, e.g.:

graph LR
  subgraph dataset_1
     record_1
     record_2
     record_3
  end

  subgraph dataset_2
     record_a
     record_b
     record_c
  end

  record_1 <---> record_a
  record_1 <-.-> record_b
  record_1 <---> record_c

  record_2 <---> record_a
  record_2 <---> record_b
  record_2 <-.-> record_c

  record_3 <-.-> record_a
  record_3 <---> record_b
  record_3 <---> record_c

results in a single cluster, saying all the records in each dataset are duplicates of each other, without directly evaluating any pairs of records from the same dataset.

Describe the solution you'd like

Keep link_type as is, but make it just syntactic sugar for a more flexible system.

The more flexible system would be to list the acceptable dataset pairings, e.g. "A-B", "A-C", "B-B" for the example above, or to list the ones to exclude from the set of all combinations.

Describe alternatives you've considered

You can accomplish this by doing a "link_and_dedupe" and manually excluding matches, but that is way more computation than necessary.

You can also accomplish it by putting your datasets in 1 at a time for dedupe, and 2 at a time for pairs of datasets that should have links between them, and combining all the links from all models before clustering, but this requires making a bunch of Splink objects.

Additional context

The text was updated successfully, but these errors were encountered:

zmbc added the enhancement New feature or request label Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Generalize `link_type` #2545

[FEAT] Generalize `link_type` #2545

zmbc commented Dec 6, 2024

[FEAT] Generalize link_type #2545

[FEAT] Generalize link_type #2545

Comments

zmbc commented Dec 6, 2024

Is your proposal related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

[FEAT] Generalize `link_type` #2545

[FEAT] Generalize `link_type` #2545