Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Scenario

Files (or database records) can often show up with duplicate data. Often it is OK, and sometimes it is required to ignore duplicate records.

...

Panel
bgColor#e6f0ff
titleBGColor#99c2ff
titleWhat counts as a duplicate?

There are three cases of duplicate records:

  1. Two or more records have identical values in each and every field (true duplicates).
  2. Two or more records have identical values in some fields, and the fields that do not have matching values are of no consequence (it does not matter which record we take).
  3. Two or more records have identical values in some fields, and one of the variable fields gives us a vale we can select on (in practice, usually a datetime field like 'last updated time').

Step-by-step guide: Identifying Duplicate Records

  1. Click to select the stream that may contain duplicates.
  2. Right-click on the model view pane, and select 'Merge selected streams'. 

    Image Added
  3. in the pipe configuration dialog that pops up, group on the field with duplicated data and click the green tick to save your input.

    Image Added
  4. in the Automatic Stream Configuration dialog that appears, select 'just key attributes' from the drop down. 
    Image Added
  5. Run analysis on the stream that results. Viewing the data, it can be seen that for each value of the grouping key, 
    PhixFlow reports the number of records in that group, and also highlights lines where it is greater than one.
  6. If after inspecting the data you think the groups of duplicated records are defined by some key other than what you have tried, you can repeat the above procedure, generating a new merge stream but with different grouping keys.

Step-by-step guide: Removing Duplicates

  1. Load all data (including duplicates) into a stream
  2. Create a new stream from this stream - make it an aggregate stream.
    Make Image Added
  3. On the pipe linking the

    2

    two streams

    an aggregate pipe, grouped

    , set the maximum number of records to be one and group it on the field with duplicated data

    , and sorted by another field

    Panel
    bgColor#e6f0ff
    titleBGColor#99c2ff
    titleOptional

    If this is case 3, apply sorting on another attribute, depending on which record you want.

    e

    E.g.

    if you want

    to get the latest record,

    you could

    sort by the last updated date.

  4. In the second stream, reference data coming from the input pipe using an array index . i.e. in[1].value to just retrieve the first of the grouped records.

 

...

  1. See 'outMult' in the screenshot below for illustration.

    Image Added

  2. Run the analysis and inspect the resultant stream data. There should now be only unique records.

Filter by label (Content by label)
showLabelsfalse
max5
spacesHELP60
showSpacefalse
sortmodified
reversetrue
typepage
cqllabel = "kb-how-to-article" and type = "page" and space = "HELP60"
labelskb-how-to-article

...