PhixFlow Help

Using Directed Pipes to Read from Large Data Sets

The directed pipe strategy can often be useful when reading from a large data set, especially when the number of records being used from the source is small compared to the overall size of the data set.

The directed pipe strategy can be used in a number of configurations, including reading from PhixFlow streams - see pipes for details. The example given on this page is for the most common use case: reading data from a large table in an external database. When applying this technique to reading from a stream, apply the same steps except for Update query to use _keyList - no special configuration is needed from the source stream that is being read by the directed pipe.

Create merge

Create a merge with an external table as normal. In the example show, we are using the merge to enrich a list of sales orders (account number, package name) with the category of the account, held in an external table. This results in the configuration:

The following settings have been applied in this model:

Mandatory input pipe/ output filter

The pipe acc is mandatory. An alternative to this is to check whether a value was received from the pipe acc in the output filter of 010 Sales Orders with Acc Category - if not, don't write the record.

Order query by grouping attributes

Remember to order the query by the grouping attributes in the pipe - in this case, the pipe cat is grouped by ACCOUNT_NUM and so the query is

select * from source_010_account_details
order by ACCOUNT_NUM

Apply directed pipe strategy

Open the pipe cat and set the pipe execution Strategy to Directed:

Note also the settings:

  • Max Workers: the number of queries that can be sent to the database in parallel
  • Worker Size: the number of elements that can be put into the in () clauses that are constructed by the pipe; 1000 is the standard maximum for Oracle.

Remove mandatory flag from input

At this point you can remove the mandatory flag from the pipe acc. You don't need to; in this example it will make no difference. But this might prove useful in the long term if you add additional inputs - it will then be clear to the designer updating the model that the mandatory flag is not required on this input pipe.

Update query to use _keyList

Update the query to use the internal variable _keyList in an in () clause:

select * from source_010_account_details
where ACCOUNT_NUM in ({_keyList})
order by ACCOUNT_NUM

Review results of configuration

Note that with this configuration, using the directed pipe strategy on the cat pipe, the results are identical to the original model. The directed pipe strategy has no impact on the output of a stream calculation - it is purely for optimising the performance of data enrichments from large data sets.

Model Ref: 010


Please let us know if we could improve this page feedback@phixflow.com