Scenario
...
Panel | ||||||
---|---|---|---|---|---|---|
| ||||||
There are three |
...
cases of duplicate |
...
records:
|
...
|
...
Files (or database records) can often show up with duplicate data. Often it is OK, and sometimes it is required to ignore duplicate records.
...
|
Step-by-step guide: Identifying Duplicate Records
- Click to select the stream that may contain duplicates.
- Right-click on the model view pane, and select 'Merge selected streams'.
- in the pipe configuration dialog that pops up, group on the field with duplicated data and click the green tick to save your input.
- in the Automatic Stream Configuration dialog that appears, select 'just key attributes' from the drop down.
- Run analysis on the stream that results. Viewing the data, it can be seen that for each value of the grouping key,
PhixFlow reports the number of records in that group, and also highlights lines where it is greater than one. - If after inspecting the data you think the groups of duplicated records are defined by some key other than what you have tried, you can repeat the above procedure, generating a new merge stream but with different grouping keys.
Step-by-step guide: Removing Duplicates
...
- Load all data (including duplicates) into a stream
- Create a new stream from this stream - make it an aggregate stream.
On the pipe linking the two streams, set the maximum number of records to be one and group it on the field with duplicated data.
Run the analysis and inspect the resultant stream data. There should now be only unique records.Panel bgColor
...
title | What counts as a duplicate? |
---|
#e6f0ff titleBGColor #99c2ff title Optional If this is case 3, apply sorting on another attribute, depending on which record you want. E.g. to get the latest record, sort by the last updated date.
...
A table of data can be divided up into recordsets by saying that each record that shares a value x for attribute X belongs to the same recordset.
...
See 'outMult' in the screenshot below for illustration.
- Run the analysis and inspect the resultant stream data. There should now be only unique records.
Related articles
Filter by label (Content by label) | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
...