PhixFlow Help

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 12 Next »

Scenario

What counts as a duplicate?

There are three cases of duplicate records:

  1. Two or more records have identical values in each and every field (true duplicates).
  2. Two or more records have identical values in some fields, and the fields that do not have matching values are of no consequence (it does not matter which record we take).
  3. Two or more records have identical values in some fields, and one of the variable fields gives us a vale we can select on (in practice, usually a datetime field like 'last updated time').
NameAddressLine1LastUpdatedDateFavouriteDog
Andrew10 Gwydir St.01/02/2017Blue
Andrew10 Gwydir St.01/02/2017Blue
NameAddressLine1LastUpdatedDateFavouriteDog
Andrew10 Gwydir St.01/02/2017Blue
Andrew15 Mill Road03/04/2018Red

Step-by-step guide: Identifying Duplicate Records

  1. Click to select the stream that may contain duplicates.
  2. Right-click on the model view pane, and select 'Merge selected streams'. 

  3. in the pipe configuration dialog that pops up, group on the field with duplicated data and click the green tick to save your input.

  4. in the Automatic Stream Configuration dialog that appears, select 'just key attributes' from the drop down. 
  5. Run analysis on the stream that results. Viewing the data, it can be seen that for each value of the grouping key, 
    PhixFlow reports the number of records in that group, and also highlights lines where it is greater than one.

Step-by-step guide: Removing Duplicates

 

  1. Load all data (including duplicates) into a stream
  2. Create a new stream from this stream - make it an aggregate stream.
  3. On the pipe linking the two streams, set the maximum number of records to be one and group it on the field with duplicated data. 

    Optional

    If this is case 3, apply sorting on another attribute, depending on which record you want. E.g. to get the latest record, sort by the last updated date. See 'outMult' in the screenshot below for illustration.

  4. Run the analysis and inspect the resultant stream data. There should now be only unique records.

Filter by label

There are no items with the selected labels at this time.

  • No labels