/
Deduplicating Similar Data Sets

PhixFlow Help

Deduplicating Similar Data Sets

Scenario

Finding records with the same key in a large stream of data. For each pair of matching records from the data sets, a single record is produced in the output.

Example

Finding account details for 1 million records in a reference list of all (~20m) accounts.

The configuration for this scenario is initially similar to Merging Similar Data Sets. A pull pipe is created from each source stream to the result stream. A single grouping key is put on each pipe. The key fields chosen should contain the same sort of data in the same format in both source streams, but do not have to have the same attribute name.

For the example scenario, where you want to find account details for 1 million records in a reference list of all (~20m) accounts, 'Source Stream 1', in the above screenshot, would be the stream containing data the 1 million records and 'Source Stream 2' would be the reference list of all accounts.

The pipe from 'Source Stream 1' should be set to mandatory.

The pipe from 'Source Stream 2' should be set to 'Directed' and the 'Max Workers' and 'Worker Size' fields populated. 'Max Workers' is the number of groups of keys that will be processed at the same time, i.e. threads. The 'Worker Size' is the number of keys that you want in each group.

Please note that if you set 'Max workers' to the maximum number of threads available in your environment, then no other tasks will run until your directed merge completes.

Your stream should be configured like a standard merge: 

The attribute expressions for your key fields will be _key[n], where n is the order number of your key field grouping.

All other attributes will be referred to by prefixing their pipe name. For example, to return the contact name from the contacts pipe, you would write 'contacts.ContactName'.

Watch out for the multiple records in each group. You will either need to:

  • choose a record in each group (e.g. contacts.1.ContactName will choose the first record in the group),
  • use an aggregate function to find the best record in the the group (e.g. max(contacts.ContactName),
  • or tick 'multiplier' on the pipe, which will cause the resulting stream to display one row per record in the group, rather than one record for the group.

Check List For a Directed Merge From Database Collector

  • One mandatory pipe, sorted by the key field.
  • One directed pipe, with max workers and worker size set, sorted by the key field.
  • Query in database collector includes the following, where <key field> is replaced by the name of the key field: <key field> is in ({_keyList})
  • Query in database collector is ordered by the key field.
  • The key field is the stream has the expression _key[1].

Please let us know if we could improve this page feedback@phixflow.com