Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] block_on Method Should Support Arrays And SUBSTR #2563

Open
ModeMonkey opened this issue Dec 16, 2024 · 0 comments
Open

[FEAT] block_on Method Should Support Arrays And SUBSTR #2563

ModeMonkey opened this issue Dec 16, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@ModeMonkey
Copy link

Is your proposal related to a problem?

The block_on method does not seem to support arrays when implementing the substr method. I've provided sample code below to demonstrate the point. With the release of v4.0.6 and its new PairwiseStringDistanceFunctionAtThresholds function for comparing similarities between arrays, it would seem appropriate to expect data be in arrays and for the block_on function to be able to work on those arrays.

Describe the solution you'd like

I would like the block_on method to be able to function on arrays when using the substr method.

Describe alternatives you've considered

One alternative is to provide a string blocking rule like:

EXISTS (
    SELECT 1
    FROM UNNEST(l.surname) AS l_surname_struct,
         UNNEST(r.surname) AS r_surname_struct,
    WHERE SUBSTR(l_surname_struct.unnest, 1, 3) = SUBSTR(r_surname_struct.unnest, 1, 3)
)

This blocking rule works with at least DuckDB.

However, this ends up being incredibly slow. Adding a second unnest and substr method, like for first_name, fails to compute in a timely fashion on a 32-core computer. After 30 minutes the computer did not seem to be able to estimate the probability two random records match in the fake_1000 sample.

Perhaps not including the ability to use the substr method on arrays in the block_on method is intentional, as it can't be made performant. If this is the case, perhaps an error message that notifies the user of the performance issues and provides the blocking rule string above if they really want this functionality?

Additional context

Here is some sample code to demonstrate the point. I've mimicked it off the the intro Splink tutorial. I made this while trying to understand how to implement the new PairwiseStringDistanceFunctionAtThresholds method.

from splink import  splink_datasets
import splink.comparison_library as cl
from splink import Linker, SettingsCreator, block_on, DuckDBAPI

df = splink_datasets.fake_1000
df = df.drop(columns=["cluster"])
df.head(5)

def make_list(in_value):
    if type(in_value) == str and len(in_value) > 0:
        return [in_value]
    else:
        return []

# turn all columns into arrays
for column in df.columns:
    if column != "unique_id":
        df[column] = df[column].apply(make_list)

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.PairwiseStringDistanceFunctionAtThresholds("first_name", "levenshtein", [0,1,2,3,4,5,6,7,8,9]),
        cl.PairwiseStringDistanceFunctionAtThresholds("surname",    "levenshtein", [0,1,2,3,4,5,6,7,8,9]),
        cl.PairwiseStringDistanceFunctionAtThresholds("dob",        "levenshtein", [0,1,2,3,4,5,6,7,8,9]),
        cl.PairwiseStringDistanceFunctionAtThresholds("city",       "levenshtein", [0,1,2,3,4,5,6,7,8,9]).configure(term_frequency_adjustments=True),
        cl.PairwiseStringDistanceFunctionAtThresholds("email",      "levenshtein", [0,1,2,3,4,5,6,7,8,9]),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "city", arrays_to_explode=["first_name","city"]),
        block_on("surname",            arrays_to_explode=["surname"]),
    ],
    retain_intermediate_calculation_columns=True,
)

linker = Linker(df, settings, db_api=DuckDBAPI())

deterministic_rules = [
    block_on("first_name", "dob",        arrays_to_explode=["first_name", "dob"]),
    block_on("email",                    arrays_to_explode=["email"]),
    block_on("substr(first_name, 1, 2)", arrays_to_explode=["first_name"]),
    
]

linker.training.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)
# ^^^^^^^ Error appears here, seemingly with the addition of the substr block_on rule

# vvvvvvvv This is the rest of the code, which works after removing the third block_on rule with the substr: 
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
training_blocking_rule = block_on("first_name", "surname")
training_session_fname_sname = (
    linker.training.estimate_parameters_using_expectation_maximisation(training_blocking_rule)
)
training_blocking_rule = block_on("dob")
training_session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
    training_blocking_rule
)
df_predictions = linker.inference.predict(threshold_match_probability=0.2)
df_predictions.as_pandas_dataframe(limit=5)
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predictions, threshold_match_probability=0.5
)
clusters.as_pandas_dataframe(limit=10)

I know the comparison rules above aren't great, I was just making a proof of concept. Overall I really love the new PairwiseStringDistanceFunctionAtThresholds method - makes things so much easier than the alternatives I've messed with. Great work on that feature!

Happy to help, though I don't really know where to start.

@ModeMonkey ModeMonkey added the enhancement New feature or request label Dec 16, 2024
@ModeMonkey ModeMonkey changed the title [FEAT] <title> [FEAT] block_on Method Should Support Arrays And SUBSTR Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant