Forms: File Collector

A File Collector describes the structure, content, naming patterns and location of files of data to be imported into PhixFlow.

Note that File Collectors can also be used to process files that reside inside compressed file archives such as zip files. Please see the section below on Compressed Files for further information.

Info
Image Added This page describes, in detail, all configuration options for file collectors. But if you are setting up a file collector to load email messages and/ or attached files, a good starting place is the article How To: Read data from an email account, which covers a number of common examples.

Form: File Collector Details

The following fields are configured on the Details tab:

...

Specified Directory: This option will cause the file collector to use the Import File Location (specified in System Configuration on the System Directories tab) as the root input directory when looking for files to load.
Managed File: This option will cause the file collector to use the File Upload Directory (specified in System Configuration on the System Directories tab) as the root input directory when looking for files to load.

...

form provides the standard form icons.

Extra buttons:

Image Added

If a valid location has been configured in the file collector to locate an existing CSV file, the user can click on a button at the top of the grid to automatically create the file column descriptions in this form.

The column names are taken from the first row of the file. To construct the name all invalid characters are stripped out of the value found in each cell and the result is assumed to be the name of the column.

The remaining rows are examined to try to determine the type and length for each column definition based on the values found in the file. If a type cannot be determined then the column is defined as a string. The length of the field is set to be the length of the longest value found.

The following fields are configured on the Details tab:

Field	Type	Description
Name	Text	The name of the file collector.
Enabled	Checkbox	Tick when the configuration is complete and the file collector is ready for use.
Source Type	Dropdown	This field can have any of the following values: Specified Directory: This option will cause the file collector to use the Import File Location (specified in System Configuration on the System Directories tab)

, the tag value specified here and the Input Directory specified below (hard coded to 'in' for Managed files).
For example, if the System Configuration File Upload Directory is set to C:\ManagedFiles and Tag is set to CVFiles then the File Collector will look within C:\ManagedFiles\CVFiles\in for files to process.Allow Non-Scheduled CollectionIf this is turned on, then the collector will run as part of any ad-hoc Analysis Engine run which requires this data. If not, it will only run as part of a scheduled task under the Analysis Engine.File Type

as the root input directory when looking for files to load. Managed File: This option will cause the file collector to use the File Upload Directory (specified in System Configuration on the System Directories tab) as the root input directory when looking for files to load.
Number of Header Lines	Text	The number of lines in the header of the file. These are ignored when reading the file. (This option is not available for Binary File, XML and HTML file types).
File Type	Dropdown	Can have values: Comma Separated Values: fields are delimited by a comma, (or other character). Fixed Length Records: fields have a fixed column width. Binary File: Data is extracted from the file using a Binary File Grammar (in XML) specified in the File Format Description tab. File Details Only: Only attribute details about the file itself will be available. Excel Spreadsheet: Data is extracted from the an excel spreadsheet supporting a .xls or .xlsx extension. .xlsx type excel files containing in excess of 10,000 rows are not supported by PhixFlow and should not be imported using a File Collector. XML File: Data is extracted from an XML file HTML File: Data is extracted from an HTML file

Next Sequence

The next sequence number expected to be found within the name of the file being imported.

This field is only available if File Location Strategy = All Files in Folder


Allow Non-Scheduled Collection	Checkbox	If this is turned on, then the collector will run as part of any ad-hoc Analysis Engine run which requires this data. If not, it will only run as part of a scheduled task under the Analysis Engine.
FTP Site		The FTP Site on which the import file is stored. If no site is specified then the file is assumed to be on the local machine. If a site is specified then all directory paths specified on this form should be the full path to the file since the base directory specified in system configuration is ignored (since the base directory is specific to the local machine).

Ignore Base DirectoryThis field is only available if Source Type = Specified Directory.
Normally the base directory, specified in the "System Directories" tab of the "System Configuration" screen, is prepended to all directories specified on this form. However, if this flag is ticked then this does not happen and the directories specified on this form alone are used as the full path specifications for the import file.File Location StrategyCan have values:

All Files in Folder: read all files matching the pattern specified in File Pattern Expression.

Read


File Location Strategy	Drop-down	Can have values: All Files in Folder: read all files matching the pattern specified in File Pattern Expression. Read File Paths: read in file path names from a collector or stream. This input database collector or stream must be attached to the file collector by a lookup pipe

.

with no index set. The attribute of the input stream or collector which contains the file path names is specified in the field: File Name Attribute. The value entered into this field should be plain text, e.g. myFilePaths but not quoted "myFilePaths".
Each file path name is interpreted as a pathname relative to the Import Directory. A path name may be a simple file name, or it may have multiple levels of directory, including compressed files (which will be interpreted as directories). The directory separator must be '/' (forward-slash), and not '\' (back-slash), even on a Windows platform. There should be no leading '/'.
E.g. 'abc.csv', 'dir1/dir2.zip/abc.csv'

Read Names: read in file locations from a collector or stream. This input collector or stream must be attached to the file collector If

Please note that once an attempt is made to read a file via the Read Paths method, the stream will not attempt to re-import the contents of that file if instructed to do so during another analysis run, even if the original import failed.
Please also note that the stream containing the file paths will not run automatically in this situation, and so must be added to a taskplan if it is required to run it before the file collector stream.
Read Names: read in file locations from a collector or stream. This input collector or stream must be attached to the file collector by a pipe. The attribute of the input stream or collector which contains the file locations is specified in the field: File Name Attribute.
Read Names is deprecated.

Input Directory Expression

Source Type = Specified Directory

Tag

Text Expression

(special case: only string literals allowed, no PhixFlow variables)

This field is only available if the Source Type

is Specified Directory, files will be ready from the directory specified in Input Directory Expression.

Unless the flag Ignore Base Directory is ticked, the path specified in this field will be added to the default input directory root - this is specified in the System Configuration File Upload Location. If the flag Ignore Base Directory is ticked, the full path for the input directory must be specified.

In fact, this field is an expression. This must evaluate to a fixed string. In the simple case, this will be text surrounded by quotes, e.g.

Code Block
"C:/data/address/input/accountValues"

Also, because this is an expression, you must always use / rather than \, even on windows platforms.

You can include PhixFlow variables in this expression, providing that the entire expression will evaluate to a fixed string. E.g.

Code Block
"C:/data/address/input/" + _inputMultiplier

If you need to include wildcards or some other variable element in the resulting path, you must use the Directory Pattern Expression.

If File Location Strategy = All Files in Folder PhixFlow will look in this directory to find files matching the pattern specified in File Pattern Expression.

If File Location Strategy = Read Names this is added to the start of the file location read from the file name attribute.

Source Type = Managed File

If the Source Type is Managed file, this will contain a non editable value of "in"
This will be appended to the combined path of System Configuration File Upload Directory and Tag to give the input directory that files will be read from.

Directory Pattern Expression

This field is an expression used to identify valid sub-directories of the input directory. This expression must itself resolve to a "pattern matching" Regular Expression.

If a Directory Pattern Expression is provided then PhixFlow will not only check the Input Directory for files but will also check all sub-directories of the Input Directory. Each file found will then not only have its name checked against the File Pattern Expression but will also have the relative path from the Input Directory to the file (referred to as the sub-directory path) checked against the Directory Pattern Expression.

For example, suppose the Input Directory has the sub-directories: 'region1/teamA'; 'region1/teamB'; 'region2/teamA'. If you want all the files across all regions for teamA, but not teamB, then you could use the following Directory Pattern Expression to pick out just the files for teamA:

".*/teamA/"

Alternatively, if you wanted all the files for all teams in region 1 only, you could use the following Directory Pattern Expression:

"region1/.*"

Regular expression rules are used to perform this match rather than the sort of pattern matching rules you might be used to when listing files. For example:

To match any string of characters, you must use ".*" and not "*"
To match a "." you must use "\\." and not "." (which means any character)
You must use forward slashes "/" instead of backslashes "\" for directory separators

A number of internal variables are available in these expressions:

_fromDate: the start date of the period of the stream being processed.
_toDate: the end date of the period of the the stream being processed.

Note that there are also a number of predefined compressed file expressions that will always be checked to determine if a file within a valid sub directory is actually a compressed file. If so then this file will assumed to be a valid compressed file and hence will be recursed into as if it was a standard matching directory. Please see Compressed Files for a list of valid compressed file expressions.

Exclude Dir. Pattern Expr.

This field is an expression that can be used to exclude certain sub-directories found by the Directory Pattern Expression. As with the Directory Pattern Expression, this expression must itself resolve to a Regular Expression.

For example, suppose the Input Directory has the sub-directories: 'region1/teamA'; 'region1/teamB'; 'region2/teamA'. If you want all the files across all regions for teamA, but not teamB, then you could use the following Directory Pattern Expression to find all files:

".*"

combined with the following Exclude Dir. Pattern Expr to exclude those for teamB:

".*/teamB/"

Regular expression rules are used to perform this match rather than the sort of pattern matching rules you might be used to when listing files. For example:

To match any string of characters, you must use

field is set to Managed File. When files are uploaded by PhixFlow they are placed into a directory whose full path is a combination of the root File Upload Directory (specified in System Configuration on the System Directories tab), the tag value specified here and the Input Directory specified below (hard coded to 'in' for Managed files).
For example, if the System Configuration File Upload Directory is set to C:\ManagedFiles and Tag is set to CVFiles then the File Collector will look within C:\ManagedFiles\CVFiles\in for files to process.

If you are creating a file collector to load email messages and/ or attached files, you can specify a tag here if one has been provided in the subject line of the incoming emails. See How To: Read data from an email account for further details.

Ignore Base Directory

Checkbox

This field is only available if Source Type = Specified Directory.
Normally the base directory, specified in the "System Directories" tab of the "System Configuration" screen, is prepended to all directories specified on this form. However, if this flag is ticked then this does not happen and the directories specified on this form alone are used as the full path specifications for the import file.

Input Directory Expression

Text Expression

Source Type = Specified Directory

If the Source Type is Specified Directory, files will be ready from the directory specified in Input Directory Expression.

Unless the flag Ignore Base Directory is ticked, the path specified in this field will be added to the default input directory root - this is specified in the System Configuration File Upload Location. If the flag Ignore Base Directory is ticked, the full path for the input directory must be specified.

In fact, this field is an expression. This must evaluate to a plain text string. In the simple case, this will be text surrounded by quotes, e.g.

Code Block
"C:/data/address/input/accountValues"

Also, because this is an expression, you must always use / rather than \, even on windows platforms.

You can include PhixFlow variables in this expression, e.g.:

Code Block
"C:/data/address/input/" + _inputMultiplier

If you need to include wildcards or some other variable element in the resulting path, you must use the Directory Pattern Expression.

If File Location Strategy = All Files in Folder PhixFlow will look in this directory to find files matching the pattern specified in File Pattern Expression.

If File Location Strategy = Read Names this is added to the start of the file location read from the file name attribute.

Source Type = Managed File

If the Source Type is Managed file, this will contain a non editable value of "in"
This will be appended to the combined path of System Configuration File Upload Directory and Tag to give the input directory that files will be read from.

Directory Pattern Expression

Regular Expression

This field is used to identify valid sub-directories of the input directory.

If a Directory Pattern Expression is provided then PhixFlow will not only check the Input Directory for files but will also check all sub-directories of the Input Directory. Each file found will then not only have its name checked against the File Pattern Expression but will also have the relative path from the Input Directory to the file (referred to as the sub-directory path) checked against the Directory Pattern Expression.

For example, suppose the Input Directory has the sub-directories: 'region1/teamA'; 'region1/teamB'; 'region2/teamA'. If you want all the files across all regions for teamA, but not teamB, then you could use the following Directory Pattern Expression to pick out just the files for teamA:

".*/teamA/"

Alternatively, if you wanted all the files for all teams in region 1 only, you could use the following Directory Pattern Expression:

"region1/.*"

Regular expression rules are used to perform this match rather than the sort of pattern matching rules you might be used to when listing files. For example:

To match any string of characters, you must use ".*" and not "*"
To match a "." you must use "\\." and not "." (which means any character)
You must use forward slashes "/" instead of backslashes "\" for directory separators

A number of internal variables are available in these expressions:

_fromDate: the start date of the period of the stream being processed.
_toDate: the end date of the period of the the stream being processed.

Note that there are also a number of predefined compressed file expressions that will always be checked to determine if a file within a valid sub directory is actually a compressed file. If so then this file will assumed to be a valid compressed file and hence will be recursed into as if it was a standard matching directory. Please see Compressed Files for a list of valid compressed file expressions.

Exclude Dir. Pattern Expr.

Regular Expression

This field can be used to exclude certain sub-directories found by the Directory Pattern Expression.

For example, suppose the Input Directory has the sub-directories: 'region1/teamA'; 'region1/teamB'; 'region2/teamA'. If you want all the files across all regions for teamA, but not teamB, then you could use the following Directory Pattern Expression to find all files:

".*"

combined with the following Exclude Dir. Pattern Expr to exclude those for teamB:

".*/teamB/"

Regular expression rules are used to perform this match rather than the sort of pattern matching rules you might be used to when listing files. For example:

To match any string of characters, you must use ".*" and not "*"
To match a "." you must use "\\." and not "." (which means any character)
You must use forward slashes "/" instead of backslashes "\" for directory separators

A number of internal variables are available in these expressions:

_fromDate: the start date of the period of the stream being processed.
_toDate: the end date of the period of the the stream being processed.

File Pattern Expression

This

Text Expression	This field is only available if File Location Strategy = All Files in Folder. An expression used to generate a list of files to be read. This expression must itself resolve to a Regular Expression, used to match files in the input directory. Note that regular expression rules are used to perform this match, not the shell replacement style rules used in many file systems. E.g. to match all files, you must use "." and not "". A number of internal variables are available in these expressions: _fromDate: the start date of the period of the stream being processed. _toDate: the end date of the period of the the stream being processed. %SEQ%: the current sequence number. Examples: "inputRecords.txt" will read files called "inputRecords.txt" from the input directories. "." will read all files in the input directories. ".\\.txt" will read all files in the input directories with the extension ".txt" "teamA.*" will read all files in the input directories starting with "teamA." "record_" + toString(now(),"yyyy-MM-dd") + "\\.txt" will read files in the input directories with the format "record_yyyy-MM-dd.txt", where yyyy-MM-dd is the current date. E.g. "record_2013-03-26.txt". "("+listToString(_context.f,"\|")+")" will read files with name contained in the list of files uploaded by the Stream Action which caused the File Collector to run, but only if a Context Value called 'f' is set in the Action and its value expression is '_files'
Archive Directory Expression	Regular Expression	If set, processed files will be written to this directory. This field is an expression that must resolve to a Regular Expression. Note that because this is an expression field, if you supply a simple directory definition in plain text it must be surrounded by quotes. Also, directory separators must be / and not \, even if the file is being moved to a directory on a Windows platform. E.g. "C:/data/address/archive/".
Error Directory Expression	Regular Expression	If set, files that error during processing will be written to this directory. This field is an expression that must resolve to a Regular Expression. Note that because this is an expression field, if you supply a simple directory definition in plain text it must be surrounded by quotes. Also, directory separators must be / and not \, even if the file is being moved to a directory on a Windows platform. E.g. "C:/data/address/error/".
Local Archive Directory

Checkbox	For files retrieved via FTP, whether the archive directory is on the PhixFlow server (local) or on the original server.
Local Error Directory

The following fields are configured on the Advanced tab:

FieldDescriptionMinimum FilesSpecifies the minimum number of files that are expected to be found

Checkbox	For files retrieved via FTP, whether the error directory is on the PhixFlow server (local) or on the original server.
*Advanced*
Maximum Files	Text	Specifies the maximum number of files that will be processed whenever the collector runs.

If fewer files are found then this is treated as an error.Maximum


Minimum Files	Text	Specifies the

maximum

minimum number of files that

will

are expected to be

processed

found whenever the collector runs. If fewer files are found then this is treated as an error.

Max Records Per File

Text

Specifies the maximum number of records that will be read from each file processed.

Anchor

	fileCollectorErrorsBeforeRollback
	fileCollectorErrorsBeforeRollback

Errors Before Rollback

Text

The maximum number of errors PhixFlow will permit during the processing of a file before. Once this number has been exceeded, PhixFlow will abandon the attempted file load.

See error handling summary below.

Parallel Readers

Text

The number of files to process in parallel. If blank, this defaults to 1.

If the collector is configured to read files in sequence, this field is ignored and a single file reader is used.

Unreadable Directories

Next Sequence

The

action to take on finding an unreadable directory

next sequence number expected to be found within the name of the file being imported.

This field is only available if File Location Strategy = All Files in Folder.

Unreadable Directories

Drop-down

The action to take on finding an unreadable directory when searching a directory hierarchy for files to import.

Error: unreadable directories will be reported, and if any are found, the file search will fail.
Warning: unreadable directories will be reported, but the file search will continue unaffected.
Ignore: unreadable directories will be silently ignored

Character SetThe character encoding to be used.
Select a value from the drop down list. If Other if selected, a

XPath Expression	XPath	This field is only available if File Type = XML File or HTML File This field should be populated according to valid XPath syntax. Please see XPath Examples for how to use XPath expressions and how the returned data can be used and evaluated in the corresponding stream attribute expressions.
Character Set	Text/ Dropdown	The character encoding to be used. Select a value from the drop down list. If Other if selected, a new box opens and a new character set can be entered. Full list of available character sets can be found here (Canonical Names from both columns can be used).
Column Separator	Text/ Dropdown	Select a value from the drop down list. If Other is selected, a new box opens and a new column separator can be entered.
Separator Character	Text	This field is only available if Column Separator = Other. Allows a custom column separator to be entered.
Quote Style	Text/ Dropdown	Select a value from the drop down list. If Other if selected, a new box opens and a new quote character can be entered.
Quote Character	Text	This field is only available if Quote Style = Other. Allows a custom quote style character to be entered.

Ignore Missing

Anchor

	fileCollectorIgnoreExtraColumns
	fileCollectorIgnoreExtraColumns

Ignore Extra Columns

Checkbox

This field is only available if File Type = Comma Separated Values.

If this flag is set then PhixFlow will not throw an error if the record being read contains

fewer

more columns than expected. If this flag is not set then an error will be reported if there are too

few columns.Ignore Extra Columns

many columns.

See error handling summary below.

Anchor

	fileCollectorIgnoreMissingColumns
	fileCollectorIgnoreMissingColumns

Ignore Missing Columns

Checkbox

This field is only available if File Type = Comma Separated Values.

If this flag is set then PhixFlow will not throw an error if the record being read contains

more

fewer columns than expected. If this flag is not set then an error will be reported if there are too

many columns

few columns.

See error handling summary below.

Import Rows Matching

An expression, that must resolve to a

Regular Expression

, can be specified in this field.

	PhixFlow will attempt to match each line in the file against the expression, and only those that match will be imported.
Replace Text Matching

Both fields Replace Text Matching and With are expressions that must resolve to Regular Expressions.

Regular Expression	In each imported line, replace all occurrences of the text matched with Replace Text Matching with With.
With	Regular Expression	See description of Replace Text Matching above.

Excel Data Expression

Anchor

	excelRangeExpression
	excelRangeExpression

Excel Data Range Expression

Expression: List

This field is only available if File Type = Excel Spreadsheet.

This field is an expression that must evaluate to a list of ranges with the format "WorksheetName!TopLeftCell:BottomRightCell"

If this field is left blank, PhixFlow will read the first worksheet it finds in the excel file (even if this is a hidden sheet) with a range covering the whole sheet.

E.g. if just a single range is needed: "DailyCallsSheet!A1:G100"

E.g. if a list of ranges is required: ["DailyCallsSheet!A1:G100", "A1:B20", "Calls!A1:C100"]

Remember that in all cases PhixFlow will only read the columns that have been defined in the File Columns tab.

Because this field is an expression, the resulting list can be generated with any valid PhixFlow expression. You can also use the internal variable _worksheets which gives you the list of worksheets that PhixFlow found in the file. See the example below for how you might use this.

Examples

All rows and columns in the default 1st worksheet:- leave this field empty as this is the default behaviour
Specified columns only on the default 1st worksheet:- "A:C"
Specified cell range only on the default 1st worksheet:- "B1:G10"
Specified cell range on a specified worksheet:- "DailyCallsSheet!A2:F20"
List of specified cell ranges on multiple worksheets:- ["DailyCallsSheet!A2:F20", "Calls!A1:C400", "Accounts!A5:F50"]
Examine the list of worksheets that have been found and specify ranges for only certain sheets, if they exist:

do (    $rangeList = [],

    forEach($sheet, _worksheets,

        if ( listContains(["sheetA", "sheetB"], $sheet),

            addElement($rangeList, $sheet + "!A1:B10")

),

    $rangeList

This expression will evaluate to the list of a maximum of two ranges, if both worksheets sheetA and sheetB exist - in this case, this is equivalent to ["sheetA!A1:B10", "sheetB!A1:B10"]. Crucially, if a sheet is not found, the range will not be included. This is important for error handling. If you specify a range that is not in the excel file PhixFlow will error. So if you are not sure that a worksheet will always be included, write an expression like this to check, and only specify the range when the sheet is found.

Related internal variables

See notes for the internal variables _worksheet and _range below. These can be used in stream attribute expressions to record the source worksheet and range for data you have loaded into PhixFlow.

Constraints

Note that if a worksheet is specified, then the full cell range must also be specified. Hence it is not possible to select a 'worksheet only' or 'columns only for a specified worksheet'. e.g DailyCallsSheet or DailyCallsSheet!A:C are not supported.

Ignore Undefined ValuesThis checkbox is only available if File Type = Excel Spreadsheet.This checkbox should be ticked if all unsupported excel values such as #N/A, #REF! #DIV/0 etc should be ignored and replaced with null values during processing. In this case a single warning message will be displayed to the user once processing has completed stating the number of unsupported cell values found during the processing and a detailed message about the first unsupported cell value.
If this checkbox is unticked then each unsupported excel value will be reported as

Related internal variables

See notes for the internal variables _worksheet and _range below. These can be used in stream attribute expressions to record the source worksheet and range for data you have loaded into PhixFlow.

Constraints

If a worksheet is specified, then the full cell range must also be specified. Hence it is not possible to select a 'worksheet only' or 'columns only for a specified worksheet'. e.g DailyCallsSheet or DailyCallsSheet!A:C are not supported.

If the worksheet name contains spaces or single quotes, it must be enclosed in single quotes and embedded single quotes must be doubled up.

E.g. if the Sheet name is "Gary's Sheet", a single range expression would look like "'Gary''s Sheet'!A1:G100".

Examples

Example - Specify all rows and columns in the default 1st worksheet

Leave this field empty as this is the default behaviour

Example - Specify columns A to C only on the 1st worksheet

Set the range expression to "A:C"

Example - Specify columns A to G, rows 1 to 10 only on the 1st worksheet

Set the range expression to "A1:G10"

Example - Specify columns A to G, rows 1 to 10 only on the "Daily" worksheet

Set the range expression to "Daily!A1:G10"

Example - Specify columns A to G, rows 1 to 10 only on the "John's Data" worksheet

Set the range expression to "'John''s Data'!A1:G10". Note the additional and doubled single quotes.

Example Examine the list of worksheets that have been found and specify ranges for only certain sheets, if they exist

The worksheet names contain spaces and/or single quotes.

Set the range expression to:

Code Block

do (
    $rangeList = [],
    forEach($sheet, _worksheets,
        if ( listContains(["sheet A", "sheet'B"], $sheet),
            do(
                $safeSheet = "'" + replaceAll($sheet, "'", "''") + "'",
                $range = $safeSheet + "!A1:B10",
                addElement($rangeList, $range)
            )
        )
    ),
    $rangeList
)

This expression will evaluate to the list of a maximum of two ranges, if both worksheets sheet A and sheet'B exist. Crucially, if a sheet is not found, the range will not be included. This is important for error handling. If you specify a range that is not in the excel file PhixFlow will error. So if you are not sure that a worksheet will always be included, write an expression like this to check, and only specify the range when the sheet is found.

Note the expression for $safeSheet, which doubles up embedded single quotes and wraps the result in single quotes - if this is omitted, PhixFlow will not recognise the generated ranges as valid.

Ignore Undefined Values

Checkbox

This checkbox is only available if File Type = Excel Spreadsheet.

This checkbox should be ticked if all unsupported excel values such as #N/A, #REF! #DIV/0 etc should be ignored and replaced with null values during processing. In this case a single warning message will be displayed to the user once processing has completed stating the number of unsupported cell values found during the processing and a detailed message about the first unsupported cell value.

If this checkbox is unticked then each unsupported excel value will be reported as an individual warning/error in the console and processing will be terminated if the maximum number of errors/warnings is exceeded.

File Password

XPath Expression

Text

This

field FieldDescriptionFile ColumnsA list of the File Attributes configured on this File Collector. Selecting an attribute by double clicking it brings up the details of that attribute in the File Collector Attribute Details form.XML

checkbox is only available if File Type =

XML File or HTML File

This field should be populated according to valid XPath syntax. Please see XPath Examples for how to use XPath expressions and how the returned data can be used and evaluated in the corresponding stream attribute expressions.

The following fields are configured through the File Format Description tab on the form: Note : this tab is only available for Binary File Collectors.

Field

Description

Stream Item Node

A list of target node names - that is, the names of nodes that will generate an output record. This field is an expression that must resolve to a single string, or a list of strings.

Example of a single target:

"DATA_RECORD"

Example of a list of targets:

["DATA_RECORD1","DATA_RECORD2","DATA_RECORD3"]

Validate File Format

Validate that the file matches the XML description.

File Format Description

A Binary File Grammar, in XML, describing the format of data in the file.

The following fields are configured through separate tabs on the form:

Excel Spreadsheet.

If you are reading a spreadsheet which is password protected, enter the password here so that the file can be unlocked.

Confirm Password

Text

This checkbox is only available if File Type = Excel Spreadsheet.

If you are reading a spreadsheet which is password protected, confirm the password here so that the file can be unlocked.

File Columns

Extra buttons:

Image Added	Shows the list of Streams.
Image Added	Shows the list of file collectors.

Name

Text

The name of the column. The attribute will be referred by this name elsewhere in PhixFlow. This can contain any combination of letters and numbers, and the characters: '_'.

Order

Number

The order in which the attributes are placed. This must match the order of the fields in the files that will be read by this file collector.

Type

Drop-down

This can be one of:

String
Integer
Float
Date - please see valid Date formats
Datetime - please see valid Datetime formats
TrueFalse - please see valid TrueFalse values

Length

Number

For a String, the maximum length of the String. For an Integer, the maximum number of digits. Note that for a comma-separated file it is not necessary to set field sizes; only the field sizes of the fields in the stream that this collector will write to are important. Of course, for a file with fixed length records it is crucial to set the fields lengths correctly.

Xml Namespaces

Xml Namespaces are used for providing uniquely named elements and attributes in an XML document. An XML instance may contain element or attribute names from more than one XML vocabulary. If each vocabulary is given a namespace, the ambiguity between identically named elements or attributes can be resolved. Note : this tab is only available for XML File Collectors.

Description

Name

Text

Description

Name of the

file collector.

File Collector Attributes

A number of attributes are available on all types of File Collector:

Attribute	Description
_fileName	The name of the file.
_lineNumber	The line number of the record within the file it was read from. The _lineNumber attribute is not available for File Collectors of Type File Details Only
_modifiedDate	The datetime of when the file was last modified. The last modified time of a single file residing within a .gz or a .tgz container can not be determined by phixflow, instead the datetime of when the corresponding gz/tgz container was created will be returned.
_path	The full path to the file which is the result of concatenating the _rootDirectory and the _subDirectory values.
_rootDirectory	The root base directory (if specified) concatenated with the value evaluated in the Collectors 'Input Directory Expr' field.
_size	The size of the file in bytes. The size of a single file residing within a .gz or a .tgz container can not be determined by phixflow, instead a size of -1 will be returned.
_subDirectory	The sub directory relative to the _rootDirectory in which the corresponding file resides.
_worksheet	The name of the current worksheet of the excel file. The _worksheet is not available if the file type is not 'excel'.
_range	The excel range expression that was used. The _range attribute is not available if the file type is not 'excel'.

...

Compressed Files

...

Supported Compressed Files

...

Note that there is currently no support for rar type compressions.

File Compression Examples

...

XML namespace. However, this is not mandatory - you can give any namespace any name - all that matters is that the names defined here match those you use in XPath expressions to extract data from the XML response. In particular, default namespaces, e.g. xmlns="urn:xmlns:company-com:message" can be given any name, providing that you use this name in XPath expressions.
Value	Text	Value of the XML namespace. E.g. http://schemas.xmlsoap.org/soap/envelope/
File Format Description This tab is only available for Binary File Collectors.
Validate File Format	Checkbox	Validate that the file matches the XML description.
Stream Item Node	Expression: list of strings	A list of target node names - that is, the names of nodes that will generate an output record. Example of a single target: "DATA_RECORD" Example of a list of targets: ["DATA_RECORD1","DATA_RECORD2","DATA_RECORD3"]
File Format Description		A Binary File Grammar, in XML, describing the format of data in the file.
*Description*
Description	Text	Description of the file collector.

Anchor
fileCollectorErrorHandlingSummary
fileCollectorErrorHandlingSummary
Error handling

All the settings that support error handling are documented individually above; this is a summary of these features, and what to consider when designing the error handling needed for the files you are loading.

Ignore missing columns

If you tick Ignore Missing Columns you will still import rows which are missing columns – the missing columns will give you blank values in the stream for the missing fields.

However, PhixFlow starts lining columns up from the left, based on the idea that if there are missing columns, they will be missing from the right hand side. This means if columns are missing from somewhere in the middle or at the beginning of the line, the line may fail to import if the data types don't line up. However, this is not an error trap, it is arbitrary - if you think lines might be missing values in the middle or at the beginning of lines you should be careful about using this option – in any case, you will need further validation.

Ignore extra columns

If you tick Ignore Extra Columns - again, this assumes that extra columns will appear on the right. If they appear in the middle or at the beginning of the row, the row may or may not import depending on how the data types of the resulting record line up with a standard record. Again, if you think this may occur, you need further validation.

Errors before rollback

If you set a number in Errors Before Rollback – providing that, across the run (not individual files), the error count is less than the threshold you have set - this will load all records that are not in error. You will get a warning message about any that fail in the console – with the file name, and details of the line that failed. But these are essentially discarded from the import. These files will then be placed in the archive directory.

If, across the run, the threshold is reached – all files that have been processed up to that point will be placed in the error directory, including files don’t contain errors (because these are associated with a failed run). The run will be in error - that is, a red line in the console, the resulting data set for the run (the stream set) will be invalid, and therefore not available to the next step in your process. Any files remaining after the error threshold was reached will remain, untouched, in the input directory.

Anchor
validdateformats
validdateformats
Supported Date/Datetime Format Patterns

The following formats are available for use in Date and Datetime type fields:

Valid Date Formats	Valid Datetime Formats
dd/MM/yy dd/MMM/yy dd/MM/yyyy dd/MMM/yyyy dd-MM-yy dd-MMM-yy dd-MM-yyyy dd-MMM-yyyy MM/dd/yy MMM/dd/yy MM/dd/yyyy MMM/dd/yyyy MM-dd-yy MMM-dd-yy MM-dd-yyyy MMM-dd-yyyy yyyyMMdd	dd/MM/yy HH:mm:ss dd/MMM/yy HH:mm:ss dd/MM/yyyy HH:mm:ss dd/MMM/yyyy HH:mm:ss dd-MM-yy HH:mm:ss dd-MMM-yy HH:mm:ss dd-MM-yyyy HH:mm:ss dd-MMM-yyyy HH:mm:ss MM/dd/yy HH:mm:ss MMM/dd/yy HH:mm:ss MM/dd/yyyy HH:mm:ss MMM/dd/yyyy HH:mm:ss MM-dd-yy HH:mm:ss MMM-dd-yy HH:mm:ss MM-dd-yyyy HH:mm:ss MMM-dd-yyyy HH:mm:ss yyyyMMdd.HHmmss

The symbols used in these formats are explained in the following table

Symbol	Meaning	Presentation	Examples
y	year	year	1996
M	month of year	month	Jul; 07
d	day of month	number	10
H	hour of day (0~23)	number	0
m	minute of hour	number	30
s	second of minute	number	55

The number of letters used in the pattern determines the format.

Number: The minimum number of digits. Shorter numbers are zero-padded to this amount.
Year: Numeric presentation for the year field are handled specially. For example, if the count of 'y' is 2, the year should be displayed as the zero-based year of the century, which is two digits.
Month: 3 or over, use text, otherwise use number.

Anchor

	validtruefalsevalues
	validtruefalsevalues

Supported TrueFalse Values

The following values are available in upper, lower and mixed cases for use in the TrueFalse type field:

Valid True Values	Valid False Values
true,yes,T,Y,1	false,no,F,N,0

File Collector Attributes

A number of attributes are available on all types of File Collector:

Attribute	Description
_fileName	The name of the file.
_lineNumber	The line number of the record within the file it was read from. The _lineNumber attribute is not available for File Collectors of Type File Details Only
_modifiedDate	The datetime of when the file was last modified. The last modified time of a single file residing within a .gz or a .tgz container can not be determined by phixflow, instead the datetime of when the corresponding gz/tgz container was created will be returned.
_path	The full path to the file which is the result of concatenating the _rootDirectory and the _subDirectory values.
_rootDirectory	The root base directory (if specified) concatenated with the value evaluated in the Collectors 'Input Directory Expr' field.
_size	The size of the file in bytes. The size of a single file residing within a .gz or a .tgz container can not be determined by phixflow, instead a size of -1 will be returned.
_subDirectory	The sub directory relative to the _rootDirectory in which the corresponding file resides.
_worksheet	The name of the current worksheet of the excel file. The _worksheet is not available if the file type is not 'excel'.
_range	The excel range expression that was used. The _range attribute is not available if the file type is not 'excel'.

Anchor

	containers
	containers

Compressed Files

In the majority of cases a compressed file will just contain a single file.
e.g A simple zip file called DailyCalls.zip would contain a single file named DailyCalls.csv.

However, some compressed files contain directories, sub-directories, files and further compressed files. In such cases the compressed file can be thought of as a directory, and further, any compressed files within the compressed file can be thought of as directories in the directory structure inside the compressed file. Therefore, Compressed Files are treated like normal directories and obey the same rules when matching the 'Directory Pattern Expression' and the 'Exclude Dir Pattern Expression'. Similarly all directories, sub-directoires, and compressed files within a compressed file will also be treated as normal directories when matching the 'Directory Pattern Expression' and the 'Exclude Dir Pattern Expression'. Files contained anywhere inside the directory structure in the compressed file (including files contained in a compressed file within the compressed file) are treated as normal files when matching the 'File Pattern Expression'.

Supported Compressed Files

Compression Type	Files ending with extension	Description
zip	".zip"	A zip archive created by either windows programs such as winzip etc or unix commands such as zip. e.g zip dailyCalls.zip dailyCalls_20120918.csv would result in a compressed zip file called dailyCalls.zip being created which would include a single file called dailyCalls_20120918.csv
tar	".tar"	A tar archive created by the unix tar command. e.g tar -cvf dailyCalls.tar dailyCalls_20120918.csv would result in a tar file called dailyCalls.tar being created which would include a single file called dailyCalls_20120918.csv
gz	".gz"	A gz archive created by the unix gzip command e.g gzip dailyCalls_20120918.csv would result in a compressed gz file called dailyCalls_20120918.csv.gz being created which would include a single file called dailyCalls_20120918.csv. Note that the unix gzip command always assumes the named container has a single file of the same name contained within.
tgz	".tgz"	A joint tarred gz archive created by combining both the tar and gzip commands into a single command. e.g tar -cvzf dailyCalls.tgz dailyCalls_20120918.csv would result in a compressed tgz file called dailyCalls.tgz being created which would include a single file called dailyCalls_20120918.csv

Note that there is currently no support for rar type compressions.

File Compression Examples

The following table shows how each compressed file found will be treated given the following values for 'File Pattern Expression', 'Directory Pattern Expression' and the 'Exclude Dir Pattern Expression'.

Image Removed

If a valid location has been configured in the file collector to locate an existing CSV file, the user can click on a button at the top of the grid to automatically create the file column descriptions in this form.

The column names are taken from the first row of the file. To construct the name all invalid characters are stripped out of the value found in each cell and the result is assumed to be the name of the column.

The remaining rows are examined to try to determine the type and length for each column definition based on the values found in the file. If a type cannot be determined then the column is defined as a string. The length of the field is set to be the length of the longest value found.

Image Removed

Shows the list of File Collectors.

Image Removed

Shows the list of Streams.

Image Removed

Deletes the selected object from the list.

Image Removed

Adds a new File Attribute. See the File Collector Attribute Details form./subdir1/calls10.csv
/subdir1/calls20.csv
/subdir2/calls100.csv
/subdir2/calls200.csv
/subdir3/calls1000.csv
/subdir3/calls2000

Compressed File Name	Compressed File Sub System	File Pattern Expression	Directory Pattern Expression	Exclude Dir Pattern Expression	Matching/Processed Files
DailyCalls10.zip	/DailyCalls10.csv	".Calls10."	".*		DailyCalls10.zip/DailyCalls10.csv
DailyCalls.tar		Directory Pattern Expression	Exclude Dir Pattern Expression	Matching/Processed Files
DailyCalls10.zip	/DailyCalls10.csv	".calls10Calls10."	".*	".subdir2.		DailyCalls.tar/subdir1/calls10DailyCalls10.zip/DailyCalls10.csv
DailyCalls.tar/subdir3/calls1000.csv	Outer.zip	/subdir1/calls10.csv /subdir1/calls20.csv /subdir1/Inner.zip/innerdirsubdir2/calls100.csv /subdir1subdir2/Innercalls200.zipcsv /subdir2subdir3/calls1000.csv Note that Outer.zip contains a compressed zip file called Inner.zip/subdir3/calls2000.csv	".calls10."	".subdir1.	".subdir2.	OuterDailyCalls.ziptar/subdir1/calls10.csv OuterDailyCalls.ziptar/subdir1/Inner.zip/innerdir/calls100subdir3/calls1000.csv
Outer.tar.gzzip	/subdir1/Outercalls10.tarcsv /subdir1/calls10calls20.csv /Outersubdir1/Inner.tarzip/innerdir/calls100.csv /Outer.tar/subdir1/Inner.zip/innerdirsubdir2/calls1000.csv Note that Outer.tar.gz contains a tar container which in turn zip contains a compressed zip file called Inner.zip	".calls10."	".subdir1.	".innerdirsubdir2.	Outer.tar.gz/zip/subdir1/calls10.csv Outer.tarzip/subdir1/Inner.zip/innerdir/calls1000calls100.csv

Form Icons

The form provides the standard form icons as well as the following:

Image Removed

Shows the model predecessors of the file collector.

The form also provides the following icons on the File Columns tab:

Outer.tar.gz

/Outer.tar/subdir1/calls10.csv
/Outer.tar/innerdir/calls100.csv
/Outer.tar/subdir1/Inner.zip/innerdir/calls1000.csv

Note that Outer.tar.gz contains a tar container which in turn contains a compressed zip file called Inner.zip

".*calls10.*"

".*subdir1.*innerdir.*

Outer.tar.gz/Outer.tar/subdir1/Inner.zip/innerdir/calls1000.csv

Versions Compared

Old Version 10

New Version Current

Key

Forms: File Collector

Form: File Collector Details

Source Type = Specified Directory

Source Type = Managed File

Source Type = Specified Directory

Source Type = Managed File

Related internal variables

Constraints

Examples

Example - Specify all rows and columns in the default 1st worksheet

Example - Specify columns A to C only on the 1st worksheet

Example - Specify columns A to G, rows 1 to 10 only on the 1st worksheet

Example - Specify columns A to G, rows 1 to 10 only on the "Daily" worksheet

Example - Specify columns A to G, rows 1 to 10 only on the "John's Data" worksheet

Example Examine the list of worksheets that have been found and specify ranges for only certain sheets, if they exist

File Collector Attributes

Compressed Files

Supported Compressed Files

File Compression Examples

Anchor
fileCollectorErrorHandlingSummary
fileCollectorErrorHandlingSummary
Error handling

Ignore missing columns

Ignore extra columns

Errors before rollback

Anchor
validdateformats
validdateformats
Supported Date/Datetime Format Patterns

Supported TrueFalse Values

File Collector Attributes

Compressed Files

Supported Compressed Files

File Compression Examples

Form Icons

See Also

Page Comparison

Versions Compared

Old Version 10

New Version Current

Key

Forms: File Collector

Form: File Collector Details

Source Type = Specified Directory

Source Type = Managed File

Source Type = Specified Directory

Source Type = Managed File

Related internal variables

Constraints

Examples

Example - Specify all rows and columns in the default 1st worksheet

Example - Specify columns A to C only on the 1st worksheet

Example - Specify columns A to G, rows 1 to 10 only on the 1st worksheet

Example - Specify columns A to G, rows 1 to 10 only on the "Daily" worksheet

Example - Specify columns A to G, rows 1 to 10 only on the "John's Data" worksheet

Example Examine the list of worksheets that have been found and specify ranges for only certain sheets, if they exist

File Collector Attributes

Compressed Files

Supported Compressed Files

File Compression Examples

AnchorfileCollectorErrorHandlingSummaryfileCollectorErrorHandlingSummaryError handling

Ignore missing columns

Ignore extra columns

Errors before rollback

AnchorvaliddateformatsvaliddateformatsSupported Date/Datetime Format Patterns

Supported TrueFalse Values

File Collector Attributes

Compressed Files

Supported Compressed Files

File Compression Examples

Form Icons

See Also

Anchor
fileCollectorErrorHandlingSummary
fileCollectorErrorHandlingSummary
Error handling

Anchor
validdateformats
validdateformats
Supported Date/Datetime Format Patterns