IWeb Administrator Guide

Probabilistic Rules/Logic

The Probabilistic logic offers a scoring technique to determine a deduplication result of a match, non-match, or unsure (do not know). This logic is also referred to as a record linkage process.

The logic includes field comparison functions to return the basic matching weights for each record pair that are stored in weight vectors. The weight vectors are then given to a classifier to calculate a matching decision (match, non-match, or possible match).

Agreement and disagreement weights are computed using the M- and U- probabilities, as shown below:

$Formula for probabilistic rules: agreement_weight equals log sub 2 begin fraction m_probability over u_probability and disagreement_weight equals log sub 2 begin fraction 1 minus m_probability over 1 minus u_probability$

The Exact String Comparator function compares the two fields given to it and returns the agreement weight if they are the same, or the disagreement weight if they differ.

The Approximate String Comparator function allows for partial agreement if the strings are not exactly alike, but are almost the same. This can be due to typographical and other errors.

All implemented string comparison functions return a value between zero (two strings are completely different) and one (two strings are the same). The result is a minimal approximate string similarity measure tolerance, which is a number between zero and one.

If the two strings are the same (i.e., the similarity measure returned by the approximate string comparator is one), the agreement weight is returned. If the value is less than one, but larger or equal to the minimal approximate value, a partial agreement weight is calculated using this formula:

$Partial agreement weight formula: partial_agreement equals agree_wt minus begin fraction 1 minus sim_measure over 1 minus min_approx_value multiplied by left-parenthesis agree_wt plus abs left-parenthesis disagree_wt right-parenthesis right-parenthesis$

After records have been compared and weight vectors calculated, a classification of record pairs is performed: links, non-links, or if the decision should be done via human review, possible links. This classifier simply sums all of the weights in a weight map and then uses two thresholds (lower and upper) to classify a record pair into one of three classes: links, non-links, or possible links. These thresholds are the boundaries that define a definite match or non-match. Anything in between these boundaries are sent to the system administrator for human review.