PhixFlow Help

Enriching large data sets from very large data sets

By the end of this chapter you will be able to:

  • User Multiplier Pipes
  • Use directed merge method

In this exercise you will re-model the enrichment you did in the previous exercise, but this time using a merge with a pipe multiplier. Although the data sets in the exercise are small, bear in mind that the technique you use in this exercise is useful when the data set you wish to enrich (in this case, sales orders) is large, and the data set you are using to enrich this data (in this case, phone numbers) is very large.

Enrichment using Merge with Multiplier Pipe

Create a new model, and add:

  • A Database Collector to read from the table SOURCE_AM_PHONE_NUMBERS, using the Datasource CRM
  • A Stream to read data from this Database Collector
  • A Database Collector to read from the table SOURCE_AM_SALES_ORDERS, using the Datasource CRM
  • A Stream to read data from this Database Collector

Read the contents of both tables into PhixFlow. As in the previous exercise, you will see that there are 15 sales orders, and 100 phone numbers.
You will now add a Merge Stream to build a list of sales orders, enriched with phone numbers:

  • Add a Merge Stream to your model
  • Give this Stream the Name Sales Orders with Phone Numbers
  • Add pipes into this Stream from SOURCE_AM_PHONE_NUMBERS_3 and SOURCE_AM_SALES_ORDERS_2_2
  • Add the grouping attribute CUSTOMER_REF to input both pipes
  • To complete the configuration of the pipe linking SOURCE_AM_SALES_ORDERS_2_2 to Sales Orders with Phone Numbers:
    • Go to the Advanced tab
      • Tick the Mandatory flag
      • Tick the Multiplier flag
  • Drag all attributes from SOURCE_AM_SALES_ORDERS_2_2 into Sales Orders with Phone Numbers
  • Drag the attribute PHONE_NUMBER from SOURCE_AM_PHONE_NUMBERS_3 into Sales Orders with Phone Numbers
  • To complete configuration of the stream Sales Orders with Phone Numbers make the expression for the attribute CUSTOMER_REF: _key[1]

Run Analysis on Sales Orders with Phone Numbers. You will see that the result is the same as in the previous exercise – this Merge, using the Mandatory Multiplier pipe, is equivalent to a Calculate using a lookup to get the phone number for each sales order. However, this configuration is useful when the set being enriched (here, sales orders) is large and the data set being used to enrich this (here, phone numbers) is very large.

Enrichment using Merge with Multiplier Pipe, with Execution Strategy

Update the model you created in the previous exercise:

  • Open the configuration form for the pipe linking SOURCE_AM_PHONE_NUMBERS_3 to Sales Orders with Phone Numbers
    • Go to the Advanced tab
      • Set the Execution Strategy to Directed
      • (Leave the Max Workers and Worker Size fields blank)

Run Analysis on Sales Orders with Phone Numbers again. You will get the same result. The Execution Strategy does not make any difference to the logic of this model, but it will change the underlying mechanism used to get values used in the merge.

Where you are dealing with very large data sets and are using a configuration like this – enrichment through a Merge where the main input pipe is a Mandatory Multiplier, it is often useful to use the Execute Strategy option to optimise performance. See the PhixFlow help for more details.

Enrichment using merge with multiplier pipe from external table

Create a new model, and add:

  • A Database Collector to read from the table SOURCE_AM_PHONE_NUMBERS, using the Datasource CRM
  • A Database Collector to read from the table SOURCE_AM_SALES_ORDERS, using the Datasource CRM
  • A Stream to read data from the database collector SOURCE_AM_SALES_ORDERS_3
  • Add a Merge Stream to the model with name Sales Orders with Phone Numbers 2
  • Add a pipe from the stream SOURCE_AM_SALES_ORDERS_3 into Sales Orders with Phone Numbers 2
  • Add a pipe from the database collector SOURCE_AM_PHONE_NUMBERS_4 into Sales Orders with Phone Numbers 2
  • Drag all attributes from stream SOURCE_AM_SALES_ORDERS_3 into Sales Orders with Phone Numbers 2
  • Complete configuration of the pipe from database collector SOURCE_AM_PHONE_NUMBERS_4 into Sales Orders with Phone Numbers 2:
    • Add the grouping attribute CUSTOMER_REF
  • Edit the query in the Database Collector SOURCE_AM_PHONE_NUMBERS_4:
    • Add the clause

order by CUSTOMER_REF asc
Remember that any time you merge directly from an external database, you must match the ordering on the query to the grouping/ sorting on the output pipe

  • Complete configuration of the pipe from stream SOURCE_AM_SALES_ORDERS_3 into Sales Orders with Phone Numbers 2:
    • Add the grouping attribute CUSTOMER_REF
    • Tick the Mandatory flag
    • Tick the Multiplier flag
  • To complete configuration of Sales Orders with Phone Numbers 2:
    • Add an attribute PHONE_NUMBER
    • Give this attribute an expression to read the phone number from the Database Collector SOURCE_AM_PHONE_NUMBERS_4

Run Analysis on Sales Orders with Phone Numbers 2. You will get 15 records – the same results as in the previous few exercises.

Enrichment using directed merge from external table

  • Copy the four objects you created in the previous exercise (2 Database Collectors and 2 Streams)
  • Edit the new version of the configuration, so that you are using Execution Strategy = Directed in the pipe from the database collector SOURCE_AM_PHONE_NUMBERS_5 into the merge Sales Orders with Phone Numbers 2_2
    • (Leave the Max Workers and Worker Size fields blank)
  • Edit the query in SOURCE_AM_PHONE_NUMBERS_5:

select * from SOURCE_AM_PHONE_NUMBERS
where CUSTOMER_REF in ({_keyList})
order by CUSTOMER_REF asc
Run Analysis on Sales Orders with Phone Numbers 2_2. You will get the same result as before – as when reading from a stream, using Execution Strategy does not affect the results of this model. However, it changes the underlying mechanism used to retrieve the merge values. To see this, turn on the logging option Log Collector Statements. Run Analysis on Sales Orders with Phone Numbers 2_2 again. Find the log message that gives the query sent by the Database Collector SOURCE_AM_PHONE_NUMBERS_5 – you will see the query built using a list of values from the output Stream, in the in clause of the query.

This is an important configuration and is commonly used to enrich a large data set in PhixFlow from a very large external table.

Please let us know if we could improve this page feedback@phixflow.com