PhixFlow Help

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Scenario

Files (or database records) can often show up with duplicate data. Often it is OK, and sometimes it is required to ignore duplicate records.

Duplicate data, or data with duplicate keys are a feature of most enterprise systems. PhixFlow has lots of ways of dealing with duplicated data, and how a model uses them depends entirely on the system requirements. In this case, we just want to ignore duplicates.

Step-by-step guide: Identifying Duplicate Records

  1. Click to select the stream that may contain duplicates.
  2. Right-click on the model view pane, and select 'Merge selected streams'.
  3. in the pipe configuration dialog that pops up, group on the field with duplicated data and click the green tick to save your input.
  4. in the Automatic Stream Configuration dialog that appears, select 'just key attributes' from the drop down. 
  5. Run analysis on the stream that results. Viewing the data, it can be seen that for each value of the grouping key, 
    PhixFlow reports the number of records in that group, and also highlights lines where it is greater than one.

Step-by-step guide: Removing Duplicates

  1. Load all data (including duplicates) into a stream
  2. Create a new stream from this stream - make it an aggregate stream.
  3. Make the pipe linking the 2 streams an aggregate pipe, grouped on the field with duplicated data, and sorted by another field, depending on which record you want. e.g. if you want the latest record, you could sort by the updated date.
  4. In the second stream, reference data coming from the input pipe using an array index . i.e. in[1].value to just retrieve the first of the grouped records.

 

Filter by label

There are no items with the selected labels at this time.

  • No labels