/
Enriching small data sets from very large data sets

PhixFlow Help

Enriching small data sets from very large data sets

By the end of this chapter you will be able to:

  • Do individual lookups into Streams
  • Do individual lookups into external databases
  • Use data from the output Stream of a Database Collector in the collector query

Individual lookups into Streams

Create a new model, and add:

  • A Database Collector to read from the table SOURCE_AM_PHONE_NUMBERS, using the Datasource CRM
  • A Stream to read data from this Database Collector
  • A Database Collector to read from the table SOURCE_AM_SALES_ORDERS, using the Datasource CRM
  • A Stream to read data from this Database Collector

Read the contents of both tables into PhixFlow. You will see that there are 15 sales orders, and 100 phone numbers. Although 100 phone numbers would easily fit into memory – this exercise uses a technique which would be useful if the number of phone numbers was very large - several hundred thousand or even millions.
You will enrich each sales order with the customer's phone number:

  • Add a lookup pipe from SOURCE_AM_PHONE_NUMBERS into SOURCE_AM_SALES_ORDERS
  • Set the Index Type of the lookup pipe as None
  • Add a filter:

CUSTOMER_REF equals _out.CUSTOMER_REF
Remember to make the filter value an expression (fx) rather than a fixed value (ABC)

  • Save your changes to the pipe
  • Drag the attribute PHONE_NUMBER from SOURCE_AM_PHONE_NUMBERS to SOURCE_AM_SALES_ORDERS

Run Analysis on SOURCE_AM_SALES_ORDERS. You will see the phone number populated for each sales order in the output in SOURCE_AM_SALES_ORDERS.
By using this technique, you looked up only the phone numbers you needed, rather than loading all 100 into memory.

Individual lookups into external databases

To your model add:

  • Another Stream to read from the Database Collector SOURCE_AM_SALES_ORDERS
  • Another Database Collector to read from the table SOURCE_AM_PHONE_NUMBERS, using the Datasource CRM
  • Add a lookup pipe from the Database Collector SOURCE_AM_PHONE_NUMBERS_2 to the Stream SOURCE_AM_SALES_ORDERS_2.
  • Set Index Type set to None
  • Update the query in the Database Collector SOURCE_AM_PHONE_NUMBERS_2 to

select * from SOURCE_AM_PHONE_NUMBERS
where CUSTOMER_REF = {_out.CUSTOMER_REF}

  • Add an attribute PHONE_NUMBER to the Stream SOURCE_AM_SALES_ORDERS_2 to read the phone number from the Database Collector SOURCE_AM_PHONE_NUMBERS_2

Run Analysis on the Stream SOURCE_AM_SALES_ORDERS_2. You will see the phone number populated for each sales order.
By using this technique, you looked up from the external table only the phone numbers you needed, rather than loading the entire table into memory.

In some databases, e.g. Oracle, it is often very efficient doing quite a large number of individual lookups rather than trying to cache large amounts of data using only a few queries. When you set up data enrichment directly from large external database tables, it is often worth experimenting with different techniques to see which is fastest – and also to check that you aren't trying to exceed the limits of the memory on your server!

Apply index to input stream

In this exercise you will apply an index to a stream you created in exercise 9.1. Although you will not see a difference in performance in this case – because the data volumes are so small – you can see how to apply an index, and an important part of the process for assessing and improving the performance of your models.

  • Open the stream SOURCE_AM_PHONE_NUMBERS

This is an input stream to SOURCE_AM_SALES_ORDERS – the phone numbers are read in from this stream by filtering on CUSTOMER_REF, so we will add an index on CUSTOMER_REF to improve the performance of this lookup

  • In the stream configuration for SOURCE_AM_PHONE_NUMBERS, double-click on the attribute CUSTOMER_REF
    • In the configuration form for the attribute tick the flag Indexed
  • Save your changes
  • Run SOURCE_AM_SALES_ORDERS

When the input stream – in this example SOURCE_AM_PHONE_NUMBERS – is large, with this type of configuration where several lookups are made (rather than caching all input records in one go) indexing the correct attribute in the input stream can make a profound difference to the performance.

Please let us know if we could improve this page feedback@phixflow.com