Match and Deduplicate Data
Matching and Deduplicating Overview
We can use the built in index type on a pipe to perform different types of merging. For complete details, search for the Index Type on the Pipe help page. Two common matches provided by the index type are:
- Exact Match: The pipe retrieves data from its cache based on an "Exact Match" lookup with the values provided.
- Near Match: The pipe will retrieve data from its cache based on a "Near Match" lookup, taking into account the numeric value provided by Maximum Number of Edits Expression which sets a tolerance for near match. For example, if the index key is "Smyhte" and the result of the index expression is "Smith" this would still be a match providing that the allowed number of edits is 3 or more because the two values are 3 changes away from one another. Specifically, substitute the 'i' for a 'y', transpose the 't' and the 'h' and insert an 'e' at the end.
This is standard functionality, however you can extend this by creating your own attributes in your tables specifically for matching. For example creating a version of a company name with business suffixes removed, such as Ltd and Limited, to improve matching. Consider other standardisations that can be applied, such as replacing accented characters with a normalised version so that "Anthóny" would match "Anthony".
When matching with a Near Match it is possible set an integer in the Maximum Number of Edits Expression, this allows for a level of difference, specifically the number of edits allowed before two strings match. For example, setting Maximum Number of Edits Expression to 1, "Anthony" and "Antony" would match.
Consider how you want to match, if you hard code a value of 3 for your Maximum Number of Edits Expression, names such as Lee and Zoe would match giving you a lot of false positive matches. Using a calculated value can improve the matching accuracy. For example, the function stringLength to calculate a percentage of character you would allow e.g. 33%. Here is a worked example:
Solution
Here is our example incoming data, we want to tune our results so that we avoid false positive matches but still capture all potential matches:
UID | Name |
---|---|
1 | Dan |
2 | Yan |
3 | Gary |
4 | Barry |
5 | Zoe |
6 | Gus |
- We have a set of incoming records that contain a Name and a unique ID (UID). Set the table to be static so that we can test our matching without running analysis on this table.
- Create a table called Match Results with the same attributes as the Incoming records. We will expand this later to perform matching.
- Connect a pipe between the two tables.
- Connect a second pipe between the two tables. This will be setup to look for duplicates.
- Set its type to be Look-up
- Call the pipe
nc
. - Set the filter to on the nc pipe to:
UID not equals to in.UID
.
This will ensure that when we perform the lookup to see if there are duplicate names matching, we do not check against the record performing the lookup. - Click Apply.
Set the Order/Index as follows:
$distance variable defines the character distance between matches.
- Save and close the pipe properties by clicking Apply and Close.
- On the analysis model, double click Matching Results to open its properties.
- In the attributes section add a new attribute:
- Set Name to
Potential Matches.
- Set type to be BigString.
Set expression to the following, the comments explain each step:
do( // Calculate 33% of the String length. $i = stringLength(_out.Names)*0.33, // Round the value to an integer $i = roundUp($i, 0), // Perform the lookup to see if there are any potential matches // we set the variable $distance to be 33% of the length of the name. $matches = lookup(nc, $distance = $i).Names, // Set the array of names to a string for output listToString($matches) )
- Set Name to
- Save and close the properties by clicking Apply and Close.
- Run Analysis and the matching names will appear in the Potential Matches column of the matching Results Table.
- Tune your threshold value for your data set.
Alternatively, if you are looking for an Exact Match, in Order/Index set the Match Type to
Exact Name
and remove all values from Maximum Number of Edits Expression, and the expression for the Potential Matches would be:do( // Perform the exact name match lookup. $matches = lookup(nc).Names, // Set the array of names to a string for output listToString($matches) )
Next Steps
Once you have identified your potential matches you need to decide what to do next. There are two mains courses of action, which can be used in unison:
- Create a Screen that allows a user to decide if the name combinations are a match, and if they are these can be filtered out.
- Programmatically handle the matches. Introduce other attributes into the matching and score the potentially matches, for example if a name, DOB and address match then you can programmatically merge these records as your confidence in the match is sufficiently high. Where it is not high enough a score these can be decided by a human using option 1 above.
Self-referencing Pipes and Updates
It is possible for a table to look at itself to check for existing matches. For example, if you are loading a second set of data into the Match Results table, we can have a second lookup, shown here as ec, to check the existing content for matches.
Updating Existing Records
To update an existing record with new data we use a technique called Superseding. This technique will supersede the existing record, making the original record a historic record. This means that only the latest version of the record will display in a view, unless a view is specifically setup to see historic records. Historic records are shown with a grey background.
For information on the setup, see Superseding.