Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Insert excerpt
_Banners
_Banners
nameanalysis
nopaneltrue

Matching

Data from Different TablesWhen Merging two or more Tables, we

and Deduplicating

We can use the built in index type to perform different types of merging. This is available on a pipe. For complete details search for the Index Type on the Pipe help page. Here will look at the following two matchesTwo common matches are:

  1. Exact MatchThe pipe retrieves data from its cache based on an "Exact Match" lookup with the values provided.
  2. Near MatchThe pipe will retrieve data from its cache based on a "Near Match" lookup, taking into account the numeric value provided by Maximum Number of Edits Expression which sets a tolerance for near match. For example, if the index key is "Smyhte" and the result of the index expression is "Smith" this would still be a match providing that the allowed number of edits is 3 or more because the two values are 3 changes away from one another. Specifically, substitute the 'i' for a 'y', transpose the 't' and the 'h' and insert an 'e' at the end.

This is standard functionality however you can create your own attributes in your tables specifically for matching, for example creating a version of a company name with business suffixes removed such as Ltd and Limited to improve matching. Consider other standardisations that can be applied, such as replacing accented characters with a normalised version so that "Anthóny" becomes "Anthony" which will standardise matching. 

When matching with a Near Match it is also possible pass the Maximum Number of Edits Expression a variable, this allows you to allow a number of edits based on the length of your value. For examplefor a level of difference, specifically the number of edits allowed before two strings match. For example, setting Maximum Number of Edits Expression to 1, "Anthony" and "Antony" would match. 

Consider how you want to match, if you hard code a value of 3 for your Maximum Number of Edits Expression, names such as Lee and Zoe would match giving you a lot of false positive matches. Consider using Using a calculated value can improve the matching accuracy, for example the function stringLength to calculate a percentage change of character you would allow e.g. 33%, you can use the round function or one of its variants to create an integer. This will be explored in more detail in the next section.

Code BlockMaximum Number of Edits Expression do(

. Here is a worked example:

Solution

Here is our example incoming data, we want to tune our results so that we avoid false positive matches but still capture all potential matches:

UIDName

1

Dan
2Yan
3Gary
4Barry
5Zoe
6Gus

Setup

  1. We have a set of incoming records that contain a Name and a unique ID (UID). Set the table to be static so that we can test our matching without running analysis on this table.
  2. Create a table called Match Results with the same attributes as the Incoming records. We will expand this later to perform matching.
  3. Connect a pipe between the two tables.
  4. Connect a second pipe between the two tables. This will be setup to look for duplicates.
    1. Set its type to be Look-up
    2. Call the pipe nc.
    3. at this point the setup will now look like this:
    4. Image Added
    5. Set the filter to on the nc pipe to: UID not equals to in.UID.
      1. This will ensure that when we perform the lookup to see if there are duplicate names matching, we do not check against the record performing the lookup. 
    6. Click 
      Insert excerpt
      _save
      _save
      nopaneltrue
      .
    7. Set the Order/Index as follows:
      1. Image Added
      2. Note: $distance is a variable we will declare that will define the character distance between matches.
    8. Save and close the pipe properties by clicking 
      Insert excerpt
      _finish
      _finish
      nopaneltrue
      .
  5. On the analysis model double click Matching Results to open its properties.
  6. In the attributes section add a new attribute:
    1. Set Name to Potential Matches.
    2. Set type to be BigString.
    3. Set expression to the following, the comments explain each step:

    4. Code Block
      do(
          // Calculate 33% of the String
    1.  length.
          $i = stringLength(_out.
myString
    1. Names)
/
    1. *0.33,
       
    1. 
          // Round the value to an integer
         
    1.  $i = 
round
    1. roundUp($i, 0),
          
          // Perform the lookup to see if there are any potential matches
      	// we set the variable $distance to be 33% of the length of the name.
          $matches = lookup(nc, $distance = $i).Names,
          
          // Set the array of names to a string for output
          listToString($matches)
      )


  1. Save and close the properties by clicking 
    Insert excerpt
    _finish
    _finish
    nopaneltrue
    .
  2. Run Analysis and the matching names will appear in the Potential Matches column of the matching Results Table.
  3. Tune your threshold value for your data set.
  4. Alternatively if you are looking for an Exact Match, in Order/Index set the Match Type to Exact Name and remove all values from Maximum Number of Edits Expression, and the expression for the Potential Matches would be:

    Code Block
    do(
        /
Output $i )

Deduplicating Data in a Single Table

self referencing pipes and updates
  1. / Perform the exact name match lookup.	 
        $matches = lookup(nc).Names,
        
        // Set the array of names to a string for output
        listToString($matches)
    )



Next Steps

Once you have identified your potential matches you need to decide what to do next. There are two mains routes, which can be used in unison:

  1. Create a Screen that allows a user to decide if the name combinations are a match and if they are these can be filtered out of an output.
  2. Programmatically handle the matches. Introduce other attributes into the matching and score the potentially matches for example if a name, DOB and address match then you can programmatically merge these records.   


Self referencing pipes and updates

It is possible for a table to look at itself to check for existing matches. For example, if you are loading a second set of data into the Match Results table, we can have a second lookup check the existing content for matches. 

Image Added

Updating Existing Content

There 

Superseeding

Checking if a Value Already Exists