File Collector Properties
Basic Settings
Field | Description | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Name | Enter a name for this file collector. | ||||||||||||||||||||||||
Auto Configuration | This check box is available for a new file collector that has not been configured. It is ticked by default.
| ||||||||||||||||||||||||
Enabled |
| ||||||||||||||||||||||||
Source Type | Select whether or not the file is already on the PhixFlow server. Specified Directory Use for files that are already stored on the PhixFlow server. By default, PhixFlow assumes files are in the System Configuration → System Directories → Import File Location, or a subdirectory of it. You must also specify the relative path to the file(s) in the Input Directory Expression. If your file is not in the Import File Location, select Specified directory and tick Ignore base directory. In this case you must also specify the full path to the file(s) in the Input Directory Expression. See also Input Directory Expression Expression and Ignore base directory, below. Managed File Use for files that you have on your local machine or the network. For a managed file, you must load the file before running the file collector.
If you have selected Auto Configuration, PhixFlow automatically starts the process for loading a managed file; see Easy Loading for Excel Spreadsheets, below. | ||||||||||||||||||||||||
Number of Header Lines | Enter the number of lines in the header of the file. These are ignored when reading the file. (This option is not available for Binary File , or XML and HTML file File types). | ||||||||||||||||||||||||
File Type | Can have values:
For information about the PhixFlow internal variables you can use to specify the attributes you want, see File Collector Attributes, below. | ||||||||||||||||||||||||
Next Sequence | Available when File Location Strategy is All Files in Folder. Enter the next sequence number expected to be found within the name of the file being imported. | ||||||||||||||||||||||||
Allow Non-Scheduled Collection |
| ||||||||||||||||||||||||
FTP Site | Specify the FTP Site on which the import file is stored. If no site is specified then the file is assumed to be on the local machine. If a site is specified then all directory paths specified on this form should be the full path to the file since the base directory specified in system configuration is ignored (since the base directory is specific to the local machine). | ||||||||||||||||||||||||
File Location Strategy | Select one of:
This input database collector or stream must be attached to the file collector by a lookup pipe with no index set. The attribute of the input stream or collector which contains the file path names is specified in the field: File Name Attribute. The value entered into this field should be plain text, e.g. myFilePaths but not quoted "myFilePaths". Each file path name is interpreted as a pathname relative to the Import Directory. A path name may be a simple file name, or it may have multiple levels of directory. Compressed files are interpreted as directories. On Linux and Windows platforms, the directory separator must be forward-slash
Read in file locations from a collector or stream. This input collector or stream must be attached to the file collector by a pipe. The attribute of the input stream or collector which contains the file locations is specified in the field: File Name Attribute.
| ||||||||||||||||||||||||
Tag
| Available when Source Type field is Managed File. Specify a sub-directory of System Configuration → System Directories → File Upload Directory using string literals only. Do not use PhixFlow variables. PhixFlow loads files into a directory whose full path is a combination of:
For the example paths above, the file collector looks in the directory If you are creating a file collector to load email messages and/or attached files, you can specify a tag here if one has been provided in the subject line of the incoming emails. See Reading Data From an Email Account for further details. | ||||||||||||||||||||||||
Ignore Base Directory | Available when Source Type is Specified Directory. The base directory is set in System Configuration → System Directories → Import File Location.
| ||||||||||||||||||||||||
Input Directory Expression
| Specify the location of the files you want to load. When Source Type is Managed File This field contains a read-only value of When Source Type is Specified Directory Specify a path:
If File Location Strategy is:
See also System Configuration for File Upload Directory and Import File Location. | ||||||||||||||||||||||||
Directory Pattern Expression | Enter a regular expression to specify one or more relative paths from the Input Directory to its sub-directories. PhixFlow looks in the input directory and its sub-directories for files whose names match the File Pattern Expression. PhixFlow can recognise some compressed file types; see Handling Compressed Files, below. PhixFlow treats these compressed files as a matching sub-directory. Internal Variables You can use the following PhixFlow internal variables in this expression:
| ||||||||||||||||||||||||
Exclude Dir. Pattern Expr. | If the Directory Pattern Expression includes some sub-directories that you do not want to search, enter a regular expression that will exclude those sub-directories. Internal Variables You can use the following PhixFlow internal variables in this expression:
| ||||||||||||||||||||||||
File Pattern Expression | Available when File Location Strategy is All Files in Folder. Enter a regular expression that will match one or more files in the Input Directory, in order to generate a list of files to load. PhixFlow uses the File Pattern Expression combined with the fields that specify the directories in which to look. This means PhixFlow:
If the specified directories contain no files that match the regular expression, the file collector will not load any files. Internal Variables You can use the following PhixFlow internal variables in this expression:
| ||||||||||||||||||||||||
Archive Directory Expression | Optionally, enter an expression for a directory path. The expression must resolve to a Regular Expression.
The Archive Directory Expression is the location to which all files processed by the file collector will be written. The Error Directory Expression is the location to which any files that cause an error during processing will be written. | ||||||||||||||||||||||||
Error Directory Expression | |||||||||||||||||||||||||
Local Archive Directory | Available when FTP Site is specified. Specify whether the archive directory is on the PhixFlow server (local) or on the original server.
| ||||||||||||||||||||||||
Local Error Directory | Available when FTP Site is specified. Specify whether the error directory is on the PhixFlow server (local) or on the original server. |
Advanced
Field | Description | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Maximum Files | Enter the maximum number of files that PhixFlow will process when the file collector runs. | ||||||||||||||||||||||
Minimum Files | Enter the minimum number of files that PhixFlow expects to find when the file collector runs. If fewer files are found then this is treated as an error. | ||||||||||||||||||||||
Max Records Per File | Enter the maximum number of records that PhixFlow will read from each file. | ||||||||||||||||||||||
Errors Before Rollback |
| ||||||||||||||||||||||
Parallel Readers | Enter the number of files to process in parallel. If blank, this defaults to 1. If the file collector is configured to read files in sequence, this field is ignored and a single file reader is used. | ||||||||||||||||||||||
Unreadable Directories | Select what PhixFlow will do if there are unreadable directories when it is searching a directory hierarchy for files to import.
| ||||||||||||||||||||||
XPath Expression | Available when File Type is XML File or HTML File. Enter valid XPath syntax. For information about how to use XPath expressions and how to use the returned data in the corresponding stream attribute expressions, see XPath Examples. | ||||||||||||||||||||||
Character Set |
| ||||||||||||||||||||||
Column Separator |
| ||||||||||||||||||||||
Separator Character |
| ||||||||||||||||||||||
Quote Style |
| ||||||||||||||||||||||
Quote Character |
| ||||||||||||||||||||||
Ignore Extra Columns |
| ||||||||||||||||||||||
Ignore Missing Columns |
| ||||||||||||||||||||||
Import Rows Matching | Enter an expression that PhixFlow compares to each line in the file against the expression. Only lines that match are imported. | ||||||||||||||||||||||
Replace Text Matching | InThese two settings work in conjunction with each other. For each imported line, find all occurrences of the expression that you enter inthe Replace Text Matching and replace it with the expression that you enter in Withidentifies all occurrences where the text in the imported line matches the expression. Where there is a match, the value stipulated in the With setting is used to replace those values. | ||||||||||||||||||||||
With | |||||||||||||||||||||||
Excel Data Range Expression |
Leave this field blank or enter an expression for the spreadsheet data range that PhixFlow will look in. The data that PhixFlow extracts from the range is defined in the File Columns section, below. The expression can specify:
You cannot specify:
If the worksheet name contains:
| ||||||||||||||||||||||
Ignore Undefined Values | Available when File Type is Excel Spreadsheet. When importing the file:
| ||||||||||||||||||||||
Maximum Excel File Size (MB) | Specify the maximum file size, in megabytes, to process. This prevents upload attempts on excessively large files from slowing down the server. The default is 0, which means no restriction. Specify a number greater than 0, for example | ||||||||||||||||||||||
File Password | Available if File Type is Excel Spreadsheet. If you are reading a spreadsheet which is password protected, enter the password here so that the file can be unlocked. | ||||||||||||||||||||||
Confirm Password | Available when File Type is Excel Spreadsheet. If you are reading a spreadsheet which is password protected, confirm the password here so that the file can be unlocked. | ||||||||||||||||||||||
Log Traffic |
|
File Columns
Anchor | ||||
---|---|---|---|---|
|
Available when File Type is Comma Separated Values, Fixed Length Records and Excel Spreadsheet. Enter the attributes of the data columns that you want to extract from the input file. The grid has the standard toolbar and the extra buttons:
Once you have uploaded a file, automatically populate the File Columns grid with the data attributes. PhixFlow samples some rows in the file to determine the values to use.Insert excerpt _attribute_populate _attribute_populate nopanel true
Open a repository tab listing all the streams.Insert excerpt _streams _streams nopanel true
Open a repository tab listing all the file collectors.Insert excerpt _file_collector _file_collector nopanel true
To add attributes, click
and to edit an attribute double-click on a row in the grid. PhixFlow opens an attribute form. Insert excerpt _add _add nopanel true
Field | Description | ||||||||
---|---|---|---|---|---|---|---|---|---|
Name | Enter the name of the column, which can contain any combination of letters, numbers and _ the underscore character. If you use PhixFlow always uses this name to refer to this attribute. | ||||||||
Order | Enter a number that matches the column number in the input file. For example, if you want to extract the third, first and fifth column of data from a file, the three rows in this grid will have the order:
| ||||||||
Type | Enter one of the data types:
If you use | ||||||||
Length | Enter the maximum length of the field in the input file
This is:
If you use |
XML Namespaces
This section is available when File Type is XML. It has a toolbar with standard buttons. The grid contains a list of the namespaces defined in an XML response.
To add a namespace to the list, click
Insert excerpt | ||||||
---|---|---|---|---|---|---|
|
Insert excerpt | ||||||
---|---|---|---|---|---|---|
|
File Format Description
Available when File Type is Binary File. Enter details about the binary input file format and the data you want to extract.
Field | Description | |||||
---|---|---|---|---|---|---|
Validate File Format | Tick to indicate you have completed the properties and the XML description matches the file format. | |||||
Stream Item Node | Enter an expression for a string or list of strings. Each string is a node name in the binary file that will generate an output record.
| |||||
File Format Description | Enter an expression, using XML Binary File Grammar, describing describing the format of data in the file. |
Error Handling
Anchor | ||||
---|---|---|---|---|
|
All the properties that support error handling are documented individually above; this is a summary of these features, and what to consider when designing the error handling needed for the files you are loading.
Ignore Missing Columns
If you tick Advanced → Ignore Missing Columns you will still import rows which are missing columns. The missing columns will give you blank values in the stream for the missing fields.
However, PhixFlow starts lining columns up from the left, based on the idea that if there are missing columns, they will be missing from the right hand side. This means if columns are missing from somewhere in the middle or at the beginning of the line, the line may fail to import if the data types don't line up. However, this is not an error trap, it is arbitrary - if you think lines might be missing values in the middle or at the beginning of lines you should be careful about using this option – in any case, you will need further validation.
Ignore Extra Columns
If you tick Advanced → Ignore Extra Columns - again, this assumes that extra columns will appear on the right. If they appear in the middle or at the beginning of the row, the row may or may not import depending on how the data types of the resulting record line up with a standard record. Again, if you think this may occur, you need further validation.
Errors Before Rollback
If you set a number in Advanced → Errors Before Rollback – providing that, across the run (not individual files), the error count is less than the threshold you have set - this will load all records that are not in error. You will get a warning message about any that fail in the console – with the file name, and details of the line that failed. But these are essentially discarded from the import. These files will then be placed in the archive directory.
If, across the run, the threshold is reached – all files that have been processed up to that point will be placed in the error directory, including files don’t contain errors (because these are associated with a failed run). The run will be in error - that is, a red line in the console, the resulting data set for the run (the stream set) will be invalid, and therefore not available to the next step in your process. Any files remaining after the error threshold was reached will remain, untouched, in the input directory.
Supported Formats and Values
Supported Date/Datetime Format Patterns
Anchor | ||||
---|---|---|---|---|
|
The following formats are available for use in Date and Datetime type fields:
Valid Date Formats | Valid Datetime Formats |
---|---|
dd/MM/yy dd/MMM/yy dd/MM/yyyy dd/MMM/yyyy dd-MM-yy dd-MMM-yy dd-MM-yyyy dd-MMM-yyyy MM/dd/yy MMM/dd/yy MM/dd/yyyy MMM/dd/yyyy MM-dd-yy MMM-dd-yy MM-dd-yyyy MMM-dd-yyyy yyyyMMdd | dd/MM/yy HH:mm:ss dd/MMM/yy HH:mm:ss dd/MM/yyyy HH:mm:ss dd/MMM/yyyy HH:mm:ss dd-MM-yy HH:mm:ss dd-MMM-yy HH:mm:ss dd-MM-yyyy HH:mm:ss dd-MMM-yyyy HH:mm:ss MM/dd/yy HH:mm:ss MMM/dd/yy HH:mm:ss MM/dd/yyyy HH:mm:ss MMM/dd/yyyy HH:mm:ss MM-dd-yy HH:mm:ss MMM-dd-yy HH:mm:ss MM-dd-yyyy HH:mm:ss MMM-dd-yyyy HH:mm:ss yyyyMMdd.HHmmss |
The symbols used in these formats are explained in the following table
Symbol | Meaning | Presentation | Examples |
---|---|---|---|
y | year | year | 1996 |
M | month of year | month | Jul; 07 |
d | day of month | number | 10 |
H | hour of day (0~23) | number | 0 |
m | minute of hour | number | 30 |
s | second of minute | number | 55 |
The number of letters used in the pattern determines the format.
- Number: The minimum number of digits. Shorter numbers are zero-padded to this amount.
- Year: Numeric presentation for the year field are handled specially. For example, if the count of 'y' is 2, the year should be displayed as the zero-based year of the century, which is two digits.
- Month: 3 or over, use text, otherwise use number.
Supported TrueFalse Values
Anchor | ||||
---|---|---|---|---|
|
The following values are available in upper, lower and mixed cases for use in the TrueFalse type field:
Valid True Values | Valid False Values |
---|---|
true,yes,T,Y,1 | false,no,F,N,0 |
File Collector Attributes
Anchor | ||||
---|---|---|---|---|
|
You can use the following PhixFlow internal variables to specify the attributes you want to load using a file collector. These attributes are available for all file types.
Attribute | Description |
---|---|
_fileName | The name of the file. |
_lineNumber | The line number of the record within the file it was read from. The _lineNumber attribute is not available for file collectors of File Type is File Details Only. |
_modifiedDate | The datetime of when the file was last modified. The last modified time of a single file residing within a .gz or a .tgz container can not be determined by phixflow, instead the datetime of when the corresponding gz/tgz container was created will be returned. |
_path | The full path to the file which is the result of concatenating the _rootDirectory and the _subDirectory values. |
_rootDirectory | The root base directory (if specified) concatenated with the value evaluated in the Collectors 'Input Directory Expr' field. |
_size | The size of the file in bytes. The size of a single file residing within a .gz or a .tgz container can not be determined by phixflow, instead a size of -1 will be returned. |
_subDirectory | The sub-directory relative to the _rootDirectory in which the corresponding file resides. |
_worksheet | The name of the current worksheet of the Excel file. The _worksheet is only available if the File Type is Excel Spreadsheet. |
_range | The Excel range expression that was used. The _range attribute is only available if the File Type is Excel Spreadsheet. |
Handling Compressed Files
Anchor | ||||
---|---|---|---|---|
|
In the majority of cases a compressed file will just contain a single file. For example, simple zip file called DailyCalls.zip would contain a single file named DailyCalls.csv.
However, some compressed files contain directories, sub-directories, files and further compressed files. In such cases the compressed file can be thought of as a directory, and further, any compressed files within the compressed file can be thought of as directories in the directory structure inside the compressed file. Therefore, compressed files are treated like normal directories and obey the same rules when matching the Directory Pattern Expression and the Exclude Dir Pattern Expression. Similarly all directories, sub-directories, and compressed files within a compressed file will also be treated as normal directories when matching the Directory Pattern Expression and the Exclude Dir Pattern Expression. Files contained anywhere inside the directory structure in the compressed file (including files contained in a compressed file within the compressed file) are treated as normal files when matching the File Pattern Expression.
Supported Compressed Files
Compression Type | Files ending with extension | Description |
---|---|---|
zip | ".zip" | A zip archive created by either windows programs such as winzip etc or unix commands such as zip. For example, to create a compressed zip file called dailyCalls.zip that includes a single file called dailyCalls_20120918.csv:
|
tar | ".tar" | A tar archive created by the unix tar command. For example to create a tar file called dailyCalls.tar that includes a single file called dailyCalls_20120918.csv:
|
gz | ".gz" | A gz archive created by the unix gzip command. For example, to create a compressed gz file called dailyCalls_20120918.csv.gz that includes a single file called dailyCalls_20120918.csv.
The unix gzip command always assumes the named container has a single file of the same name contained within. |
tgz | ".tgz" | A joint tarred gz archive created by combining both the tar and gzip commands into a single command. For example, to create a compressed tgz file called dailyCalls.tgz that includes a single file called dailyCalls_20120918.csv:
|
Note |
---|
There is currently no support for rar type compressions. |
File Compression Examples
The following table shows how each compressed file found will be treated given the following values for 'File Pattern Expression', 'Directory Pattern Expression' and the 'Exclude Dir Pattern Expression'.
Compressed File Name | Compressed File Sub System | File Pattern Expression | Directory Pattern Expression | Exclude Dir Pattern Expression | Matching/Processed Files |
---|---|---|---|---|---|
DailyCalls10.zip | /DailyCalls10.csv | ".*Calls10.*" | ".* | DailyCalls10.zip/DailyCalls10.csv | |
DailyCalls.tar | /subdir1/calls10.csv /subdir1/calls20.csv /subdir2/calls100.csv /subdir2/calls200.csv /subdir3/calls1000.csv /subdir3/calls2000.csv | ".*calls10.*" | ".* | ".*subdir2.* | DailyCalls.tar/subdir1/calls10.csv DailyCalls.tar/subdir3/calls1000.csv |
Outer.zip | /subdir1/calls10.csv /subdir1/calls20.csv /subdir1/Inner.zip/innerdir/calls100.csv /subdir1/Inner.zip/subdir2/calls1000.csv Note that Outer.zip contains a compressed zip file called Inner.zip | ".*calls10.*" | ".*subdir1.* | ".*subdir2.* | Outer.zip/subdir1/calls10.csv Outer.zip/subdir1/Inner.zip/innerdir/calls100.csv |
Outer.tar.gz | /Outer.tar/subdir1/calls10.csv /Outer.tar/innerdir/calls100.csv /Outer.tar/subdir1/Inner.zip/innerdir/calls1000.csv Note that Outer.tar.gz contains a tar container which in turn contains a compressed zip file called Inner.zip | ".*calls10.*" | ".*subdir1.*innerdir.* | Outer.tar.gz/Outer.tar/subdir1/Inner.zip/innerdir/calls1000.csv |
Anchor |
---|
|
|
Easy Loading for Excel Spreadsheets
With PhixFlow, you can quickly and easily load all the data from a worksheet in an Excel file, without having to set all the different file collector and stream properties.
- Drag a file collector into an analysis model.
- In the file collector properties, enter a name for the collector. Leave Auto Configure ticked.
- Hover your mouse pointer over the file collector and click
Managed File.Insert excerpt _upload_file _upload_file nopanel true - Find and select the Excel spreadsheet you want to load into PhixFlow. Click Open then
.Insert excerpt _upload_button _upload_button nopanel true - For an
.xlsx
file only, PhixFlow prompts you to specify the worksheet that you want to upload to the file collector. - PhixFlow automatically:
- uploads the Excel data:
- from the specified worksheet in an
.xlsx
file. - from the first worksheet in an
.xls
file.
- from the specified worksheet in an
- sets the file collector properties
- creates a stream
- sets the stream properties, including adding attributes for each data column in the Excel spreadsheet.
- uploads the Excel data:
You cannot use the Auto Configure option to:
- select specific Excel data columns to load
- always load the latest version of an Excel worksheet when you run analysis on the model
- load a data file that is not an Excel file.