[FEAT] block_on Method Should Support Arrays And SUBSTR #2563

ModeMonkey · 2024-12-16T03:09:24Z

Is your proposal related to a problem?

The block_on method does not seem to support arrays when implementing the substr method. I've provided sample code below to demonstrate the point. With the release of v4.0.6 and its new PairwiseStringDistanceFunctionAtThresholds function for comparing similarities between arrays, it would seem appropriate to expect data be in arrays and for the block_on function to be able to work on those arrays.

Describe the solution you'd like

I would like the block_on method to be able to function on arrays when using the substr method.

Describe alternatives you've considered

One alternative is to provide a string blocking rule like:

EXISTS (
    SELECT 1
    FROM UNNEST(l.surname) AS l_surname_struct,
         UNNEST(r.surname) AS r_surname_struct,
    WHERE SUBSTR(l_surname_struct.unnest, 1, 3) = SUBSTR(r_surname_struct.unnest, 1, 3)
)

This blocking rule works with at least DuckDB.

However, this ends up being incredibly slow. Adding a second unnest and substr method, like for first_name, fails to compute in a timely fashion on a 32-core computer. After 30 minutes the computer did not seem to be able to estimate the probability two random records match in the fake_1000 sample.

Perhaps not including the ability to use the substr method on arrays in the block_on method is intentional, as it can't be made performant. If this is the case, perhaps an error message that notifies the user of the performance issues and provides the blocking rule string above if they really want this functionality?

Additional context

Here is some sample code to demonstrate the point. I've mimicked it off the the intro Splink tutorial. I made this while trying to understand how to implement the new PairwiseStringDistanceFunctionAtThresholds method.

from splink import  splink_datasets
import splink.comparison_library as cl
from splink import Linker, SettingsCreator, block_on, DuckDBAPI

df = splink_datasets.fake_1000
df = df.drop(columns=["cluster"])
df.head(5)

def make_list(in_value):
    if type(in_value) == str and len(in_value) > 0:
        return [in_value]
    else:
        return []

# turn all columns into arrays
for column in df.columns:
    if column != "unique_id":
        df[column] = df[column].apply(make_list)

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.PairwiseStringDistanceFunctionAtThresholds("first_name", "levenshtein", [0,1,2,3,4,5,6,7,8,9]),
        cl.PairwiseStringDistanceFunctionAtThresholds("surname",    "levenshtein", [0,1,2,3,4,5,6,7,8,9]),
        cl.PairwiseStringDistanceFunctionAtThresholds("dob",        "levenshtein", [0,1,2,3,4,5,6,7,8,9]),
        cl.PairwiseStringDistanceFunctionAtThresholds("city",       "levenshtein", [0,1,2,3,4,5,6,7,8,9]).configure(term_frequency_adjustments=True),
        cl.PairwiseStringDistanceFunctionAtThresholds("email",      "levenshtein", [0,1,2,3,4,5,6,7,8,9]),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "city", arrays_to_explode=["first_name","city"]),
        block_on("surname",            arrays_to_explode=["surname"]),
    ],
    retain_intermediate_calculation_columns=True,
)

linker = Linker(df, settings, db_api=DuckDBAPI())

deterministic_rules = [
    block_on("first_name", "dob",        arrays_to_explode=["first_name", "dob"]),
    block_on("email",                    arrays_to_explode=["email"]),
    block_on("substr(first_name, 1, 2)", arrays_to_explode=["first_name"]),
    
]

linker.training.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)
# ^^^^^^^ Error appears here, seemingly with the addition of the substr block_on rule

# vvvvvvvv This is the rest of the code, which works after removing the third block_on rule with the substr: 
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
training_blocking_rule = block_on("first_name", "surname")
training_session_fname_sname = (
    linker.training.estimate_parameters_using_expectation_maximisation(training_blocking_rule)
)
training_blocking_rule = block_on("dob")
training_session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
    training_blocking_rule
)
df_predictions = linker.inference.predict(threshold_match_probability=0.2)
df_predictions.as_pandas_dataframe(limit=5)
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predictions, threshold_match_probability=0.5
)
clusters.as_pandas_dataframe(limit=10)

I know the comparison rules above aren't great, I was just making a proof of concept. Overall I really love the new PairwiseStringDistanceFunctionAtThresholds method - makes things so much easier than the alternatives I've messed with. Great work on that feature!

Happy to help, though I don't really know where to start.

The text was updated successfully, but these errors were encountered:

ModeMonkey added the enhancement New feature or request label Dec 16, 2024

ModeMonkey changed the title ~~[FEAT] <title>~~ [FEAT] block_on Method Should Support Arrays And SUBSTR Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] block_on Method Should Support Arrays And SUBSTR #2563

[FEAT] block_on Method Should Support Arrays And SUBSTR #2563

ModeMonkey commented Dec 16, 2024

[FEAT] block_on Method Should Support Arrays And SUBSTR #2563

[FEAT] block_on Method Should Support Arrays And SUBSTR #2563

Comments

ModeMonkey commented Dec 16, 2024

Is your proposal related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context