Forms: File Collector
A File Collector describes the structure, content, naming patterns and location of files of data to be imported into PhixFlow.
Note that File Collectors can also be used to process files that reside inside compressed file archives such as zip files. Please see the section below on Compressed Files for further information.
Info |
---|
This page describes, in detail, all configuration options for file collectors. But if you are setting up a file collector to load email messages and/ or attached files, a good starting place is the article How To: Read data from an email account, which covers a number of common examples. |
Form: File Collector Details
The following fields are configured on the Details tab:
Field | Type | Description | ||||
---|---|---|---|---|---|---|
Name | Text | The name of the file collector. | ||||
Enabled | Checkbox | Tick when the configuration is complete and the file collector is ready for use. | ||||
Source Type | Dropdown | This field can have any of the following values:
| ||||
Number of Header Lines | Text | The number of lines in the header of the file. These are ignored when reading the file. (This option is not available for Binary File, XML and HTML file types). | ||||
Tag | Text Expression (special case: only string literals allowed, no PhixFlow variables) | This field is only available if the Source Type field is set to Managed File. When files are uploaded by PhixFlow they are placed into a directory whose full path is a combination of the root File Upload Directory (specified in System Configuration on the System Directories tab), the tag value specified here and the Input Directory specified below (hard coded to 'in' for Managed files). If you are creating a file collector to load email messages and/ or attached files, you can specify a tag here if one has been provided in the subject line of the incoming emails. See How To: Read data from an email account for further details. | ||||
Allow Non-Scheduled Collection | Checkbox | If this is turned on, then the collector will run as part of any ad-hoc Analysis Engine run which requires this data. If not, it will only run as part of a scheduled task under the Analysis Engine. | ||||
File Type | Dropdown | Can have values:
| ||||
Next Sequence | The next sequence number expected to be found within the name of the file being imported. This field is only available if File Location Strategy = All Files in Folder. | |||||
FTP Site | The FTP Site on which the import file is stored. If no site is specified then the file is assumed to be on the local machine. If a site is specified then all directory paths specified on this form should be the full path to the file since the base directory specified in system configuration is ignored (since the base directory is specific to the local machine). | |||||
Ignore Base Directory | This field is only available if Source Type = Specified Directory. Normally the base directory, specified in the "System Directories" tab of the "System Configuration" screen, is prepended to all directories specified on this form. However, if this flag is ticked then this does not happen and the directories specified on this form alone are used as the full path specifications for the import file. | |||||
File Location Strategy | Text | Can have values:
| ||||
Input Directory Expression | Text Expression | Source Type = Specified DirectoryIf the Source Type is Specified Directory, files will be ready from the directory specified in Input Directory Expression. Unless the flag Ignore Base Directory is ticked, the path specified in this field will be added to the default input directory root - this is specified in the System Configuration File Upload Location. If the flag Ignore Base Directory is ticked, the full path for the input directory must be specified. In fact, this field is an expression. This must evaluate to a plain text string. In the simple case, this will be text surrounded by quotes, e.g.
Also, because this is an expression, you must always use / rather than \, even on windows platforms. You can include PhixFlow variables in this expression, e.g.:
If you need to include wildcards or some other variable element in the resulting path, you must use the Directory Pattern Expression. If File Location Strategy = All Files in Folder PhixFlow will look in this directory to find files matching the pattern specified in File Pattern Expression. If File Location Strategy = Read Names this is added to the start of the file location read from the file name attribute. Source Type = Managed FileIf the Source Type is Managed file, this will contain a non editable value of "in" | ||||
Directory Pattern Expression | Regular Expression | This field is used to identify valid sub-directories of the input directory. If a Directory Pattern Expression is provided then PhixFlow will not only check the Input Directory for files but will also check all sub-directories of the Input Directory. Each file found will then not only have its name checked against the File Pattern Expression but will also have the relative path from the Input Directory to the file (referred to as the sub-directory path) checked against the Directory Pattern Expression. For example, suppose the Input Directory has the sub-directories: 'region1/teamA'; 'region1/teamB'; 'region2/teamA'. If you want all the files across all regions for teamA, but not teamB, then you could use the following Directory Pattern Expression to pick out just the files for teamA: ".*/teamA/" Alternatively, if you wanted all the files for all teams in region 1 only, you could use the following Directory Pattern Expression: "region1/.*" Regular expression rules are used to perform this match rather than the sort of pattern matching rules you might be used to when listing files. For example:
A number of internal variables are available in these expressions:
Note that there are also a number of predefined compressed file expressions that will always be checked to determine if a file within a valid sub directory is actually a compressed file. If so then this file will assumed to be a valid compressed file and hence will be recursed into as if it was a standard matching directory. Please see Compressed Files for a list of valid compressed file expressions. | ||||
Exclude Dir. Pattern Expr. | Regular Expression | This field can be used to exclude certain sub-directories found by the Directory Pattern Expression. For example, suppose the Input Directory has the sub-directories: 'region1/teamA'; 'region1/teamB'; 'region2/teamA'. If you want all the files across all regions for teamA, but not teamB, then you could use the following Directory Pattern Expression to find all files: ".*" combined with the following Exclude Dir. Pattern Expr to exclude those for teamB: ".*/teamB/" Regular expression rules are used to perform this match rather than the sort of pattern matching rules you might be used to when listing files. For example:
A number of internal variables are available in these expressions: | ||||
File Pattern Expression | Text Expression | This field is only available if File Location Strategy = All Files in Folder. An expression used to generate a list of files to be read. This expression must itself resolve to a Regular Expression, used to match files in the input directory. Note that regular expression rules are used to perform this match, not the shell replacement style rules used in many file systems. E.g. to match all files, you must use ".*" and not "*". A number of internal variables are available in these expressions:
Examples: "inputRecords.txt" will read files called "inputRecords.txt" from the input directories. ".*" will read all files in the input directories. ".*\\.txt" will read all files in the input directories with the extension ".txt" "teamA.*" will read all files in the input directories starting with "teamA." "record_" + toString(now(),"yyyy-MM-dd") + "\\.txt" will read files in the input directories with the format "record_yyyy-MM-dd.txt", where yyyy-MM-dd is the current date. E.g. "record_2013-03-26.txt". "("+listToString(_context.f,"|")+")" will read files with name contained in the list of files uploaded by the Stream Action which caused the File Collector to run, but only if a Context Value called 'f' is set in the Action and its value expression is '_files' | ||||
Archive Directory Expression | Regular Expression | If set, processed files will be written to this directory. This field is an expression that must resolve to a Regular Expression. Note that because this is an expression field, if you supply a simple directory definition in plain text it must be surrounded by quotes. Also, directory separators must be / and not \, even if the file is being moved to a directory on a Windows platform. E.g. "C:/data/address/archive/". | ||||
Error Directory Expression | Regular Expression | If set, files that error during processing will be written to this directory. This field is an expression that must resolve to a Regular Expression. Note that because this is an expression field, if you supply a simple directory definition in plain text it must be surrounded by quotes. Also, directory separators must be / and not \, even if the file is being moved to a directory on a Windows platform. E.g. "C:/data/address/error/". | ||||
Local Archive Directory | Text Expresison | |||||
Local Error Directory | Text Expression |
...
Field | Description | |
---|---|---|
File Columns | A list of the File Attributes configured on this File Collector. Selecting an attribute by double clicking it brings up the details of that attribute in the File Collector Attribute Details form. | |
XML Namespaces | Xml Namespaces are used for providing uniquely named elements and attributes in an XML document. An XML instance may contain element or attribute names from more than one XML vocabulary. If each vocabulary is given a namespace, the ambiguity between identically named elements or attributes can be resolved. Note : this tab is only available for XML File Collectors. | |
Description | Text | Description of the file collector. |
Forms: File Columns
The data columns present in the import file are defined here.
Form: File Column Details
Field | Description |
---|---|
Name | The name of the column. The attribute will be referred by this name elsewhere in PhixFlow. This can contain any combination of letters and numbers, and the characters: '_'. |
Type | This can be one of:
|
Length | For a String, the maximum length of the String. For an Integer, the maximum number of digits. Note that for a comma-separated file it is not necessary to set field sizes; only the field sizes of the fields in the stream that this collector will write to are important. Of course, for a file with fixed length records it is crucial to set the fields lengths correctly. |
Order | The order in which the attributes are placed. This must match the order of the fields in the files that will be read by this file collector. |
Description | Description of the file column. |
Anchor | ||||
---|---|---|---|---|
|
Supported Date/Datetime Format Patterns
The following formats are available for use in Date and Datetime type fields:
...
Anchor | ||||
---|---|---|---|---|
|
Supported TrueFalse Values
The following values are available in upper, lower and mixed cases for use in the TrueFalse type field:
Valid True Values | Valid False Values |
---|---|
true,yes,T,Y,1 | false,no,F,N,0 |
Form Icons
The form provides the standard form icons.
File Collector Attributes
A number of attributes are available on all types of File Collector:
...
Anchor | ||||
---|---|---|---|---|
|
Compressed Files
In the majority of cases a compressed file will just contain a single file.
e.g A simple zip file called DailyCalls.zip would contain a single file named DailyCalls.csv.
However, some compressed files contain directories, sub-directories, files and further compressed files. In such cases the compressed file can be thought of as a directory, and further, any compressed files within the compressed file can be thought of as directories in the directory structure inside the compressed file. Therefore, Compressed Files are treated like normal directories and obey the same rules when matching the 'Directory Pattern Expression' and the 'Exclude Dir Pattern Expression'. Similarly all directories, sub-directoires, and compressed files within a compressed file will also be treated as normal directories when matching the 'Directory Pattern Expression' and the 'Exclude Dir Pattern Expression'. Files contained anywhere inside the directory structure in the compressed file (including files contained in a compressed file within the compressed file) are treated as normal files when matching the 'File Pattern Expression'.
Supported Compressed Files
Compression Type | Files ending with extension | Description |
---|---|---|
zip | ".zip" | A zip archive created by either windows programs such as winzip etc or unix commands such as zip. e.g zip dailyCalls.zip dailyCalls_20120918.csv would result in a compressed zip file called dailyCalls.zip being created which would include a single file called dailyCalls_20120918.csv |
tar | ".tar" | A tar archive created by the unix tar command. e.g tar -cvf dailyCalls.tar dailyCalls_20120918.csv would result in a tar file called dailyCalls.tar being created which would include a single file called dailyCalls_20120918.csv |
gz | ".gz" | A gz archive created by the unix gzip command e.g gzip dailyCalls_20120918.csv would result in a compressed gz file called dailyCalls_20120918.csv.gz being created which would include a single file called dailyCalls_20120918.csv. Note that the unix gzip command always assumes the named container has a single file of the same name contained within. |
tgz | ".tgz" | A joint tarred gz archive created by combining both the tar and gzip commands into a single command. e.g tar -cvzf dailyCalls.tgz dailyCalls_20120918.csv would result in a compressed tgz file called dailyCalls.tgz being created which would include a single file called dailyCalls_20120918.csv |
Note that there is currently no support for rar type compressions.
File Compression Examples
The following table shows how each compressed file found will be treated given the following values for 'File Pattern Expression', 'Directory Pattern Expression' and the 'Exclude Dir Pattern Expression'.
Compressed File Name | Compressed File Sub System | File Pattern Expression | Directory Pattern Expression | Exclude Dir Pattern Expression | Matching/Processed Files |
---|---|---|---|---|---|
DailyCalls10.zip | /DailyCalls10.csv | ".*Calls10.*" | ".* | DailyCalls10.zip/DailyCalls10.csv | |
DailyCalls.tar | /subdir1/calls10.csv /subdir1/calls20.csv /subdir2/calls100.csv /subdir2/calls200.csv /subdir3/calls1000.csv /subdir3/calls2000.csv | ".*calls10.*" | ".* | ".*subdir2.* | DailyCalls.tar/subdir1/calls10.csv DailyCalls.tar/subdir3/calls1000.csv |
Outer.zip | /subdir1/calls10.csv /subdir1/calls20.csv /subdir1/Inner.zip/innerdir/calls100.csv /subdir1/Inner.zip/subdir2/calls1000.csv Note that Outer.zip contains a compressed zip file called Inner.zip | ".*calls10.*" | ".*subdir1.* | ".*subdir2.* | Outer.zip/subdir1/calls10.csv Outer.zip/subdir1/Inner.zip/innerdir/calls100.csv |
Outer.tar.gz | /Outer.tar/subdir1/calls10.csv /Outer.tar/innerdir/calls100.csv /Outer.tar/subdir1/Inner.zip/innerdir/calls1000.csv Note that Outer.tar.gz contains a tar container which in turn contains a compressed zip file called Inner.zip | ".*calls10.*" | ".*subdir1.*innerdir.* | Outer.tar.gz/Outer.tar/subdir1/Inner.zip/innerdir/calls1000.csv |
Form Icons
The form provides the standard form icons as well as the following:
...
If a valid location has been configured in the file collector to locate an existing CSV file, the user can click on a button at the top of the grid to automatically create the file column descriptions in this form. The column names are taken from the first row of the file. To construct the name all invalid characters are stripped out of the value found in each cell and the result is assumed to be the name of the column. The remaining rows are examined to try to determine the type and length for each column definition based on the values found in the file. If a type cannot be determined then the column is defined as a string. The length of the field is set to be the length of the longest value found. | |
Shows the list of File Collectors. | |
Shows the list of Streams. | |
Deletes the selected object from the list. | |
Adds a new File Attribute. See the File Collector Attribute Details form. |