Match and Deduplicate Data

Matching and Deduplicating Overview

We can use the built in index type on a pipe to perform different types of merging. For complete details, search for the Index Type on the Pipe help page. Two common matches provided by the index type are:

  • Exact MatchThe pipe retrieves data from its cache based on an "Exact Match" lookup with the values provided. 
  • Near MatchThe pipe will retrieve data from its cache based on a "Near Match" lookup, taking into account the numeric value provided by Maximum Number of Edits Expression which sets a tolerance for near match. For example, if the index key is "Smyhte" and the result of the index expression is "Smith" this would still be a match providing that the allowed number of edits is 3 or more because the two values are 3 changes away from one another. Specifically, substitute the 'i' for a 'y', transpose the 't' and the 'h' and insert an 'e' at the end.

This is standard functionality, however you can extend this by creating your own attributes in your tables specifically for matching. For example creating a version of a company name with business suffixes removed, such as Ltd and Limited, to improve matching. Consider other standardisations that can be applied, such as replacing accented characters with a normalised version so that "Anthóny" would match "Anthony". 

When matching with a Near Match it is possible set an integer in the Maximum Number of Edits Expression, this allows for a level of difference, specifically the number of edits allowed before two strings match. For example, setting Maximum Number of Edits Expression to 1, "Anthony" and "Antony" would match. 

Consider how you want to match, if you hard code a value of 3 for your Maximum Number of Edits Expression, names such as Lee and Zoe would match giving you a lot of false positive matches. Using a calculated value can improve the matching accuracy. For example, the function stringLength to calculate a percentage of character you would allow e.g. 33%. Here is a worked example:

Solution

Here is our example incoming data, we want to tune our results so that we avoid false positive matches but still capture all potential matches:

UIDName

1

Dan
2Yan
3Gary
4Barry
5Zoe
6Gus
  1. We have a set of incoming records that contain a Name and a unique ID (UID). Set the table to be static so that we can test our matching without running analysis on this table.
  2. Create a table called Match Results with the same attributes as the Incoming records. We will expand this later to perform matching.
  3. Connect a pipe between the two tables.
  4. Connect a second pipe between the two tables. This will be setup to look for duplicates.
    1. Set its type to be Look-up
    2. Call the pipe nc.
    3. Set the filter to on the nc pipe to: UID not equals to in.UID.
      This will ensure that when we perform the lookup to see if there are duplicate names matching, we do not check against the record performing the lookup. 
    4. Click  Apply.
    5. Set the Order/Index as follows:

      1. $distance variable  defines the character distance between matches.

    6. Save and close the pipe properties by clicking  Apply and Close.
  5. On the analysis model, double click Matching Results to open its properties.
  6. In the attributes section add a new attribute:
    1. Set Name to Potential Matches.
    2. Set type to be BigString.
    3. Set expression to the following, the comments explain each step:

      do(
          // Calculate 33% of the String length.
          $i = stringLength(_out.Names)*0.33,
       
          // Round the value to an integer
          $i = roundUp($i, 0),
          
          // Perform the lookup to see if there are any potential matches
      	// we set the variable $distance to be 33% of the length of the name.
          $matches = lookup(nc, $distance = $i).Names,
          
          // Set the array of names to a string for output
          listToString($matches)
      )
  7. Save and close the properties by clicking  Apply and Close.
  8. Run Analysis and the matching names will appear in the Potential Matches column of the matching Results Table.
  9. Tune your threshold value for your data set.
  10. Alternatively, if you are looking for an Exact Match, in Order/Index set the Match Type to Exact Name and remove all values from Maximum Number of Edits Expression, and the expression for the Potential Matches would be:

    do(
        // Perform the exact name match lookup.	 
        $matches = lookup(nc).Names,
        
        // Set the array of names to a string for output
        listToString($matches)
    )

Next Steps

Once you have identified your potential matches you need to decide what to do next. There are two mains courses of action, which can be used in unison:

  1. Create a Screen that allows a user to decide if the name combinations are a match, and if they are these can be filtered out.
  2. Programmatically handle the matches. Introduce other attributes into the matching and score the potentially matches, for example if a name, DOB and address match then you can programmatically merge these records as your confidence in the match is sufficiently high. Where it is not high enough a score these can be decided by a human using option 1 above.   


Self-referencing Pipes and Updates

It is possible for a table to look at itself to check for existing matches. For example, if you are loading a second set of data into the Match Results table, we can have a second lookup, shown here as ec, to check the existing content for matches. 

Updating Existing Records

To update an existing record with new data we use a technique called Superseding. This technique will supersede the existing record, making the original record a historic record. This means that only the latest version of the record will display in a view, unless a view is specifically setup to see historic records. Historic records are shown with a grey background.

For information on the setup, see Superseding.