Use a file collector to specify the structure, content, naming patterns and location of files of data. When you run the file collector, PhixFlow imports the data. You can also use a file collector to process files inside compressed file archives such as zip files; see Compressed Files, below.
You can also use a file collector to import email messages and their attached files; see How To: Read data from an email account for details.
To open a file collector's settings tab, double-click on:
...
Use a file collector to specify the structure, content, naming patterns and location of files of data. When you run the file collector, PhixFlow imports the data. You can also use a file collector to process files inside compressed file archives such as zip files; see Compressed Files, below.
You can also use a file collector to import email messages and their attached files; see How To: Read data from an email account for details.
To open a file collector's settings tab, double-click on:
- the file collector name in the repository browser
- the file collector icon in a model.
The toolbar has the standard icons and an extra button:
If a valid location has been configured in the file collector to locate an existing CSV file, the user can click on a button at the top of the grid to automatically create the file column descriptions in this form. The column names are taken from the first row of the file. To construct the name all invalid characters are stripped out of the value found in each cell and the result is assumed to be the name of the column. The remaining rows are examined to try to determine the type and length for each column definition based on the values found in the file. If a type cannot be determined then the column is defined as a string. The length of the field is set to be the length of the longest value found. |
The settings tab has the following sections:
Field |
---|
Description | |
---|---|
Basic Settings | |
Name |
Enter a name for this file collector. |
Enabled |
Tick to indicate you have completed the settings and the file collector is ready |
to use. |
Source Type |
You can select
|
Number of Header Lines |
The number of lines in the header of the file. These are ignored when reading the file. (This option is not available for Binary File, XML and HTML file types). |
File Type |
Can have values: Comma Separated Values: fields are delimited by a comma, (or other character). Fixed Length Records: fields have a fixed column width. Binary File: Data is extracted from the file using a Binary File Grammar (in XML) specified in the File Format Description tab. File Details Only: Only attribute details about the file itself will be available. Excel Spreadsheet: Data is extracted from the an excel spreadsheet supporting a .xls or .xlsx extension.
XML File: Data is extracted from an XML file HTML File: Data is extracted from an HTML file | |||
Allow Non-Scheduled Collection |
If this is turned on, then the collector will run as part of any ad-hoc Analysis Engine run which requires this data. If not, it will only run as part of a scheduled task under the Analysis Engine. | |
FTP Site | The FTP Site on which the import file is stored. If no site is specified then the file is assumed to be on the local machine. If a site is specified then all directory paths specified on this form should be the full path to the file since the base directory specified in system configuration is ignored (since the base directory is specific to the local machine). |
File Location Strategy |
Can have values: All Files in Folder: read all files matching the pattern specified in File Pattern Expression. Read File Paths: read in file path names from a collector or stream. This input database collector or stream must be attached to the file collector by a lookup pipe with no index set. The attribute of the input stream or collector which contains the file path names is specified in the field: File Name Attribute. The value entered into this field should be plain text, e.g. myFilePaths but not quoted "myFilePaths". Each file path name is interpreted as a pathname relative to the Import Directory. A path name may be a simple file name, or it may have multiple levels of directory, including compressed files (which will be interpreted as directories). The directory separator must be '/' (forward-slash), and not '\' (back-slash), even on a Windows platform. There should be no leading '/'. E.g. 'abc.csv', 'dir1/dir2.zip/abc.csv' Read Names: This option is deprecated. Read in file locations from a collector or stream. This input collector or stream must be attached to the file collector by a pipe. The attribute of the input stream or collector which contains the file locations is specified in the field: File Name Attribute. |
Tag |
(special case: only string literals allowed, no PhixFlow variables)
This field is only available if the Source Type field is set to Managed File. Specify a directory using string literals only. Do not use PhixFlow variables. When files are uploaded by PhixFlow they are placed into a directory whose full path is a combination of the root File Upload Directory (specified in System Configuration on the System Directories tab), the tag value specified here and the Input Directory specified below (hard coded to 'in' for Managed files). If you are creating a file collector to load email messages and/ or attached files, you can specify a tag here if one has been provided in the subject line of the incoming emails. See How To: Read data from an email account for further details. |
Ignore Base Directory |
This field is only available if Source Type = Specified Directory. Normally the base directory, specified in the "System Directories" tab of the "System Configuration" screen, is prepended to all directories specified on this form. However, if this flag is ticked then this does not happen and the directories specified on this form alone are used as the full path specifications for the import file. | |
Input Directory Expression |
Source Type = Specified DirectorySpecify a directory using string literals only. Do not use PhixFlow variables. If the Source Type is Specified Directory, files will be |
read from the directory specified in Input Directory Expression. Unless the flag Ignore Base Directory is ticked, the path specified in this field will be added to the default input directory root - this is specified in the System Configuration File Upload Location. If the flag Ignore Base Directory is ticked, the full path for the input directory must be specified. In fact, this field is an expression. This must evaluate to a plain text string. In the simple case, this will be text surrounded by quotes, e.g.
Also, because this is an expression, you must always use / rather than \, even on windows platforms. You can include PhixFlow variables in this expression, e.g.:
If you need to include wildcards or some other variable element in the resulting path, you must use the Directory Pattern Expression. If File Location Strategy = All Files in Folder PhixFlow will look in this directory to find files matching the pattern specified in File Pattern Expression. If File Location Strategy = Read Names this is added to the start of the file location read from the file name attribute. Source Type = Managed FileIf the Source Type is Managed file, this will contain a non editable value of "in" | ||||
Directory Pattern Expression |
This field is used to identify valid sub-directories of the input directory. If a Directory Pattern Expression is provided then PhixFlow will not only check the Input Directory for files but will also check all sub-directories of the Input Directory. Each file found will then not only have its name checked against the File Pattern Expression but will also have the relative path from the Input Directory to the file (referred to as the sub-directory path) checked against the Directory Pattern Expression. For example, suppose the Input Directory has the sub-directories: 'region1/teamA'; 'region1/teamB'; 'region2/teamA'. If you want all the files across all regions for teamA, but not teamB, then you could use the following Directory Pattern Expression to pick out just the files for teamA: ".*/teamA/" Alternatively, if you wanted all the files for all teams in region 1 only, you could use the following Directory Pattern Expression: "region1/.*" Regular expression rules are used to perform this match rather than the sort of pattern matching rules you might be used to when listing files. For example:
A number of internal variables are available in these expressions:
Note that there are also a number of predefined compressed file expressions that will always be checked to determine if a file within a valid sub directory is actually a compressed file. If so then this file will assumed to be a valid compressed file and hence will be recursed into as if it was a standard matching directory. Please see Compressed Files for a list of valid compressed file expressions. |
Exclude Dir. Pattern Expr. |
This field can be used to exclude certain sub-directories found by the Directory Pattern Expression. For example, suppose the Input Directory has the sub-directories: 'region1/teamA'; 'region1/teamB'; 'region2/teamA'. If you want all the files across all regions for teamA, but not teamB, then you could use the following Directory Pattern Expression to find all files: ".*" combined with the following Exclude Dir. Pattern Expr to exclude those for teamB: ".*/teamB/" Regular expression rules are used to perform this match rather than the sort of pattern matching rules you might be used to when listing files. For example:
A number of internal variables are available in these expressions: | |
File Pattern |
Expression | This field is only available if File Location Strategy = All Files in Folder. An expression used to generate a list of files to be read. This expression must itself resolve to a Regular Expression, used to match files in the input directory. Note that regular expression rules are used to perform this match, not the shell replacement style rules used in many file systems. E.g. to match all files, you must use ".*" and not "*". A number of internal variables are available in these expressions:
Examples: "inputRecords.txt" will read files called "inputRecords.txt" from the input directories. ".*" will read all files in the input directories. ".*\\.txt" will read all files in the input directories with the extension ".txt" "teamA.*" will read all files in the input directories starting with "teamA." "record_" + toString(now(),"yyyy-MM-dd") + "\\.txt" will read files in the input directories with the format "record_yyyy-MM-dd.txt", where yyyy-MM-dd is the current date. E.g. "record_2013-03-26.txt". "("+listToString(_context.f,"|")+")" will read files with name contained in the list of files uploaded by the Stream Action which caused the File Collector to run, but only if a Context Value called 'f' is set in the Action and its value expression is '_files' |
Archive Directory Expression |
If set, processed files will be written to this directory. This field is an expression that must resolve to a Regular Expression. Note that because this is an expression field, if you supply a simple directory definition in plain text it must be surrounded by quotes. Also, directory separators must be / and not \, even if the file is being moved to a directory on a Windows platform. E.g. "C:/data/address/archive/". |
Error |
Directory Expression | If set, files that error during processing will be written to this directory. This field is an expression that must resolve to a Regular Expression. Note that because this is an expression field, if you supply a simple directory definition in plain text it must be surrounded by quotes. Also, directory separators must be / and not \, even if the file is being moved to a directory on a Windows platform. E.g. "C:/data/address/error/". |
Local Archive Directory |
For files retrieved via FTP, whether the archive directory is on the PhixFlow server (local) or on the original server. | |
Local Error Directory |
For files retrieved via FTP, whether the error directory is on the PhixFlow server (local) or on the original server. | |
Advanced | |
---|---|
Maximum Files |
Specifies the maximum number of files that will be processed whenever the collector runs. | |
Minimum Files |
Specifies the minimum number of files that are expected to be found whenever the collector runs. If fewer files are found then this is treated as an error. | |
Max Records Per File |
Specifies the maximum number of records that will be read from each file processed. | ||||||
|
The maximum number of errors PhixFlow will permit during the processing of a file before. Once this number has been exceeded, PhixFlow will abandon the attempted file load. See error handling summary below. |
Parallel Readers |
The number of files to process in parallel. If blank, this defaults to 1.If the collector is configured to read files in sequence, this field is ignored and a single file reader is used. | |
Next Sequence | The next sequence number expected to be found within the name of the file being imported. This field is only available if File Location Strategy = All Files in Folder. |
Unreadable Directories |
The action to take on finding an unreadable directory when searching a directory hierarchy for files to import.
|
XPath Expression |
This field is only available if File Type = XML File or HTML File This field should be populated according to valid XPath syntax. Please see XPath Examples for how to use XPath expressions and how the returned data can be used and evaluated in the corresponding stream attribute expressions. |
Character Set |
Text/
DropdownThe character encoding to be used. Select a value from the drop down list. If Other if selected, a new box opens and a new character set can be entered. Full list of available character sets can be found here (Canonical Names from both columns can be used). | |
Column Separator |
Text/
DropdownSelect a value from the drop down list. If Other is selected, a new box opens and a new column separator can be entered. | |
Separator Character |
This field is only available if Column Separator = Other. Allows a custom column separator to be entered. | |
Quote Style |
Text/
Dropdown
Select a value from the drop down list. If Other if selected, a new box opens and a new quote character can be entered. |
Quote Character |
This field is only available if Quote Style = Other. Allows a custom quote style character to be entered. | ||||||
|
This field is only available if File Type = Comma Separated Values. If this flag is set then PhixFlow will not throw an error if the record being read contains more columns than expected. If this flag is not set then an error will be reported if there are too many columns. See error handling summary below. | ||||||
|
This field is only available if File Type = Comma Separated Values. If this flag is set then PhixFlow will not throw an error if the record being read contains fewer columns than expected. If this flag is not set then an error will be reported if there are too few columns. See error handling summary below. |
Import Rows Matching |
PhixFlow will attempt to match each line in the file against the expression, and only those that match will be imported. |
Replace Text Matching |
In each imported line, replace all occurrences of the text matched with Replace Text Matching with With. | |
With |
See description of Replace Text Matching above. | ||||||
|
Expression |
This field is only available if File Type = Excel Spreadsheet. Enter an expression for a list. This field is an expression that must evaluate to a list of ranges with the format "WorksheetName!TopLeftCell:BottomRightCell" If this field is left blank, PhixFlow will read the first worksheet it finds in the excel file (even if this is a hidden sheet) with a range covering the whole sheet. E.g. if just a single range is needed: "DailyCallsSheet!A1:G100" E.g. if a list of ranges is required: ["DailyCallsSheet!A1:G100", "A1:B20", "Calls!A1:C100"] Remember that in all cases PhixFlow will only read the columns that have been defined in the File Columns tab. Because this field is an expression, the resulting list can be generated with any valid PhixFlow expression. You can also use the internal variable _worksheets which gives you the list of worksheets that PhixFlow found in the file. See the example below for how you might use this. Related internal variablesSee notes for the internal variables _worksheet and _range below. These can be used in stream attribute expressions to record the source worksheet and range for data you have loaded into PhixFlow. ConstraintsIf a worksheet is specified, then the full cell range must also be specified. Hence it is not possible to select a 'worksheet only' or 'columns only for a specified worksheet'. e.g DailyCallsSheet or DailyCallsSheet!A:C are not supported. If the worksheet name contains spaces or single quotes, it must be enclosed in single quotes and embedded single quotes must be doubled up. E.g. if the Sheet name is "Gary's Sheet", a single range expression would look like "'Gary''s Sheet'!A1:G100". ExamplesExample - Specify all rows and columns in the default 1st worksheetLeave this field empty as this |
is the default behaviour Example - Specify columns A to C only on the 1st worksheetSet the range expression to "A:C" Example - Specify columns A to G, rows 1 to |
10 only on the 1st worksheetSet the range expression to " |
A1: |
G10" Example - Specify columns A to G, rows 1 to 10 only on the |
"Daily" worksheetSet the range expression to "Daily!A1:G10" Example - Specify columns A to G, rows 1 to 10 only on the |
"John's Data" worksheetSet the range expression to "'John''s Data'!A1:G10". Note the additional and doubled single quotes. Example Examine the list of worksheets that have been found and specify ranges for only certain sheets, if they existThe worksheet names contain spaces and/or single quotes. Set the range expression to |
Example - Specify columns A to G, rows 1 to 10 only on the "John's Data" worksheet
Set the range expression to "'John''s Data'!A1:G10". Note the additional and doubled single quotes.
Example Examine the list of worksheets that have been found and specify ranges for only certain sheets, if they exist
The worksheet names contain spaces and/or single quotes.
Set the range expression to:
:
|
|
|
This expression will evaluate to the list of a maximum of two ranges, if both worksheets sheet A and sheet'B exist. Crucially, if a sheet is not found, the range will not be included. This is important for error handling. If you specify a range that is not in the excel file PhixFlow will error. So if you are not sure that a worksheet will always be included, write an expression like this to check, and only specify the range when the sheet is found.
Note the expression for $safeSheet, which doubles up embedded single quotes and wraps the result in single quotes - if this is omitted, PhixFlow will not recognise the generated ranges as valid.
This expression will evaluate to the list of a maximum of two ranges, if both worksheets sheet A and sheet'B exist. Crucially, if a sheet is not found, the range will not be included. This is important for error handling. If you specify a range that is not in the excel file PhixFlow will error. So if you are not sure that a worksheet will always be included, write an expression like this to check, and only specify the range when the sheet is found. Note the expression for $safeSheet, which doubles up embedded single quotes and wraps the result in single quotes - if this is omitted, PhixFlow will not recognise the generated ranges as valid. | |
Ignore Undefined Values | This checkbox is only available if File Type = Excel Spreadsheet. This checkbox should be ticked if all unsupported excel values such as #N/A, #REF! #DIV/0 etc should be ignored and replaced with null values during processing. In this case a single warning message will be displayed to the user once processing has completed stating the number of unsupported cell values found during the processing and a detailed message about the first unsupported cell value. |
File Password | This checkbox is only available if File Type = Excel Spreadsheet. If you are reading a spreadsheet which is password protected, enter the password here so that the file can be unlocked. |
Confirm Password | This checkbox is only available if File Type = Excel Spreadsheet. |
If this checkbox is unticked then each unsupported excel value will be reported as an individual warning/error in the console and processing will be terminated if the maximum number of errors/warnings is exceeded.
This checkbox is only available if File Type = Excel Spreadsheet.
If you are reading a spreadsheet which is password protected, enter the password here so that the file can be unlocked.
This checkbox is only available if File Type = Excel Spreadsheet.
If you are reading a spreadsheet which is password protected, confirm the password here so that the file can be unlocked.File Columns
Extra section toolbar buttons:
Shows the list of Streams.
Order
If you are reading a spreadsheet which is password protected, confirm the password here so that the file can be unlocked. | |||||||||||||||||||||||||||||||||
File Columns | |||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Available when File Type is Comma Separated Values, Fixed Length Records and Excel Spreadsheet. Enter details about the attributes of the data in the input file. The grid has the standard toolbar and the extra buttons:
To add attributes, click | |||||||||||||||||||||||||||||||||
Name | The name of the column, which can contain any combination of letters, numbers and _ the underscore character. If you use PhixFlow always uses this name to refer to this attribute. | ||||||||||||||||||||||||||||||||
Order | A number. This must match the order of the |
columns in the |
file |
. |
Type |
One of the data types:
|
|
|
|
If you use |
, PhixFlow uses the type it recognises from the data, or | |
Length | The maximum length of the field in the input file
|
|
This is:
|
Xml Namespaces
If you use | ||||||||
Xml Namespaces | ||||||||
---|---|---|---|---|---|---|---|---|
Available when File Type is XML. Enter details about the XML namespace of the XML input file
|
| ||
Name |
For the XML namespace |
, enter a name that matches the name you use in XPath expressions to extract data from the XML response. |
For example, |
if a default namespace is |
but you use the alias | |
Value | Enter the XML namespace, for example http://schemas.xmlsoap.org/soap/envelope/ . |
File Format Description | |
---|---|
Available when File Type is Binary File. Enter details about the binary input file format and the data you want to extract. | |
Validate File Format |
Tick to indicate you have completed the settings and the XML description matches the file format. |
Stream Item Node |
Enter an expression for a string or list of strings |
. Each string is a node name in the binary file that will generate an output record. |
|
|
|
|
|
| |
File Format Description |
Enter an expression, using XML Binary File Grammar |
, describing the format of data in the file. |
Anchor | ||||
---|---|---|---|---|
|
...
Anchor | ||||
---|---|---|---|---|
|
Supported
...
True or False Values
The following values are available in upper, lower and mixed cases for use in the TrueFalse type field:
...
File Collector Attributes
A number of The following attributes are available on file collectors for all file types of File Collector:.
Attribute | Description |
---|---|
_fileName | The name of the file. |
_lineNumber | The line number of the record within the file it was read from. The _lineNumber attribute is not available for File Collectors of Type File Details Only |
_modifiedDate | The datetime of when the file was last modified. The last modified time of a single file residing within a .gz or a .tgz container can not be determined by phixflow, instead the datetime of when the corresponding gz/tgz container was created will be returned. |
_path | The full path to the file which is the result of concatenating the _rootDirectory and the _subDirectory values. |
_rootDirectory | The root base directory (if specified) concatenated with the value evaluated in the Collectors 'Input Directory Expr' field. |
_size | The size of the file in bytes. The size of a single file residing within a .gz or a .tgz container can not be determined by phixflow, instead a size of -1 will be returned. |
_subDirectory | The sub directory relative to the _rootDirectory in which the corresponding file resides. |
_worksheet | The name of the current worksheet of the excel file. The _worksheet is not available if the file type is not 'excel'. |
_range | The excel range expression that was used. The _range attribute is not available if the file type is not 'excel'. |
...