Enrich Stream Data

Scenario

Very often, a data extract will contains lots of references, and keys and links, but not necessarily data elements that are readable or useful to users. For example, lots of database tables have "type" fields where 1=home, 2=office, 3=school ... etc. An output report would want to show the text value of type, rather than just the number.

There are numerous ways to do data enrichment, the most common methods are:

Merge All data and then filter out unwanted data
Lookup to DB, via a lookup pipe to the DB collector
Use a SQL IN clause via an Input Multiplier

Solution:

Create a stream to serve as the source of the lookup.
Create a DB collector to retrieve the data that needs to be "looked up"
Implement data merges based on the 3 methods listed above.

Note : where stream and data sizes are small, the lookup method will have a negligible impact on the overall model. However, where data volumes are very large, choosing the correct lookup / enrichment solution can greatly impact the performance of the model. Try to determine which of the 3 methods listed here should be used in the following instances :

Stream contains 5,000 records, and DB table for lookup contains 200,000,000 records.
Stream contains 10,000,000 records, and DB table for lookup contains 10,000,000 records with an expected hit rate of 5%
Stream contains 10,000,000 records, and DB table for lookup contains 10,000,000 records with an expected hit rate of 85%
Stream contains 100,000,000 records, and DB table for lookup contains 100,000 records

What other data volume scenarios might impact data enrichment?