You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The block_on method does not seem to support arrays when implementing the substr method. I've provided sample code below to demonstrate the point. With the release of v4.0.6 and its new PairwiseStringDistanceFunctionAtThresholds function for comparing similarities between arrays, it would seem appropriate to expect data be in arrays and for the block_on function to be able to work on those arrays.
Describe the solution you'd like
I would like the block_on method to be able to function on arrays when using the substr method.
Describe alternatives you've considered
One alternative is to provide a string blocking rule like:
EXISTS (
SELECT 1
FROM UNNEST(l.surname) AS l_surname_struct,
UNNEST(r.surname) AS r_surname_struct,
WHERE SUBSTR(l_surname_struct.unnest, 1, 3) = SUBSTR(r_surname_struct.unnest, 1, 3)
)
This blocking rule works with at least DuckDB.
However, this ends up being incredibly slow. Adding a second unnest and substr method, like for first_name, fails to compute in a timely fashion on a 32-core computer. After 30 minutes the computer did not seem to be able to estimate the probability two random records match in the fake_1000 sample.
Perhaps not including the ability to use the substr method on arrays in the block_on method is intentional, as it can't be made performant. If this is the case, perhaps an error message that notifies the user of the performance issues and provides the blocking rule string above if they really want this functionality?
Additional context
Here is some sample code to demonstrate the point. I've mimicked it off the the intro Splink tutorial. I made this while trying to understand how to implement the new PairwiseStringDistanceFunctionAtThresholds method.
from splink import splink_datasets
import splink.comparison_library as cl
from splink import Linker, SettingsCreator, block_on, DuckDBAPI
df = splink_datasets.fake_1000
df = df.drop(columns=["cluster"])
df.head(5)
def make_list(in_value):
if type(in_value) == str and len(in_value) > 0:
return [in_value]
else:
return []
# turn all columns into arrays
for column in df.columns:
if column != "unique_id":
df[column] = df[column].apply(make_list)
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.PairwiseStringDistanceFunctionAtThresholds("first_name", "levenshtein", [0,1,2,3,4,5,6,7,8,9]),
cl.PairwiseStringDistanceFunctionAtThresholds("surname", "levenshtein", [0,1,2,3,4,5,6,7,8,9]),
cl.PairwiseStringDistanceFunctionAtThresholds("dob", "levenshtein", [0,1,2,3,4,5,6,7,8,9]),
cl.PairwiseStringDistanceFunctionAtThresholds("city", "levenshtein", [0,1,2,3,4,5,6,7,8,9]).configure(term_frequency_adjustments=True),
cl.PairwiseStringDistanceFunctionAtThresholds("email", "levenshtein", [0,1,2,3,4,5,6,7,8,9]),
],
blocking_rules_to_generate_predictions=[
block_on("first_name", "city", arrays_to_explode=["first_name","city"]),
block_on("surname", arrays_to_explode=["surname"]),
],
retain_intermediate_calculation_columns=True,
)
linker = Linker(df, settings, db_api=DuckDBAPI())
deterministic_rules = [
block_on("first_name", "dob", arrays_to_explode=["first_name", "dob"]),
block_on("email", arrays_to_explode=["email"]),
block_on("substr(first_name, 1, 2)", arrays_to_explode=["first_name"]),
]
linker.training.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)
# ^^^^^^^ Error appears here, seemingly with the addition of the substr block_on rule
# vvvvvvvv This is the rest of the code, which works after removing the third block_on rule with the substr:
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
training_blocking_rule = block_on("first_name", "surname")
training_session_fname_sname = (
linker.training.estimate_parameters_using_expectation_maximisation(training_blocking_rule)
)
training_blocking_rule = block_on("dob")
training_session_dob = linker.training.estimate_parameters_using_expectation_maximisation(
training_blocking_rule
)
df_predictions = linker.inference.predict(threshold_match_probability=0.2)
df_predictions.as_pandas_dataframe(limit=5)
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
df_predictions, threshold_match_probability=0.5
)
clusters.as_pandas_dataframe(limit=10)
I know the comparison rules above aren't great, I was just making a proof of concept. Overall I really love the new PairwiseStringDistanceFunctionAtThresholds method - makes things so much easier than the alternatives I've messed with. Great work on that feature!
Happy to help, though I don't really know where to start.
The text was updated successfully, but these errors were encountered:
Is your proposal related to a problem?
The block_on method does not seem to support arrays when implementing the substr method. I've provided sample code below to demonstrate the point. With the release of v4.0.6 and its new PairwiseStringDistanceFunctionAtThresholds function for comparing similarities between arrays, it would seem appropriate to expect data be in arrays and for the block_on function to be able to work on those arrays.
Describe the solution you'd like
I would like the block_on method to be able to function on arrays when using the substr method.
Describe alternatives you've considered
One alternative is to provide a string blocking rule like:
This blocking rule works with at least DuckDB.
However, this ends up being incredibly slow. Adding a second unnest and substr method, like for first_name, fails to compute in a timely fashion on a 32-core computer. After 30 minutes the computer did not seem to be able to estimate the probability two random records match in the fake_1000 sample.
Perhaps not including the ability to use the substr method on arrays in the block_on method is intentional, as it can't be made performant. If this is the case, perhaps an error message that notifies the user of the performance issues and provides the blocking rule string above if they really want this functionality?
Additional context
Here is some sample code to demonstrate the point. I've mimicked it off the the intro Splink tutorial. I made this while trying to understand how to implement the new PairwiseStringDistanceFunctionAtThresholds method.
I know the comparison rules above aren't great, I was just making a proof of concept. Overall I really love the new PairwiseStringDistanceFunctionAtThresholds method - makes things so much easier than the alternatives I've messed with. Great work on that feature!
Happy to help, though I don't really know where to start.
The text was updated successfully, but these errors were encountered: