Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Scenario

A duplicate record is one which has the same values for key attributes as another record, and in addition there is no 

There are three types of duplicate record:

Two or more records have identical values in each and every field (true duplicates)

Two or more records have identical values in some fields, and the fields that do not have matching values are of no consequence (it does not matter which value we take)

Two or more records have identical values in some fields, and the fields that do not have matching values are of no consequence (it does not matter which value we take)

Files (or database records) can often show up with duplicate data. Often it is OK, and sometimes it is required to ignore duplicate records.

...

  1. Load all data (including duplicates) into a stream
  2. Create a new stream from this stream - make it an aggregate stream.
  3. On the pipe linking the two streams, set the maximum number of records to be one and group it on the field with duplicated data. 
    Image Added
  4. Run the analysis and inspect the resultant stream data. There should now be only unique records.

Panel
titleWhat counts as a duplicate?

Apply sorting on another attribute, depending on which record you want. E.g. to get the latest record, sort by the last updated date.

...

A table of data can be divided up into recordsets by saying that each record that shares a value x for attribute X belongs to the same recordset. 

Info

Filter by label (Content by label)
showLabelsfalse
max5
spacesHELP60
showSpacefalse
sortmodified
reversetrue
typepage
cqllabel = "kb-how-to-article" and type = "page" and space = "HELP60"
labelskb-how-to-article

...