/
File Collector

File Collector

Overview

Use a file collector to specify the structure, content, naming patterns and location of files of data. When you run the file collector, PhixFlow imports the data. You can also use a file collector to process files inside compressed file archives such as zip files; see Handling Compressed Files, below.

You can also use a file collector to import email messages and their attached files; see Reading Data From an Email Account for details.

To add a new file collector object to an analysis model:

  1. Go to the model's toolbar → Create group.
  2. Click  File to expand the menu.
  3. Drag a  File Collector onto the analysis model.

To add an existing file collector object to an analysis model:

  1. Go to the model toolbar → List group.
  2. Click  File to expand the menu.
  3. Click  File Collector to open the list of available file collectors.
  4. Drag a file collector into the analysis model.

To open a file collector's property tab, double-click on:

  • the file collector name in the repository
  • the file collector icon in a model.

Uploading Files

Before running a model, you must upload files that do not already exist on the PhixFlow server. These are called managed files. For example you may have CSV or Microsoft Excel files saved to your computer or the network.

  1. In the property tab, set the Input Directory Expression.
  2. In the model, right-click on the file collector icon to open the context toolbar.
  3. Click  Upload File.
  4. PhixFlow opens a file browser. Find and select the file.
  5. Click  Upload File

PhixFlow uploads the managed file to its database in the following location:

directory/tag/date/id/

where

  • directory is configured in System Configuration → File Upload Directory
  • tag is a sub-directory, configured in the file collector → Tag property
    This allows you to group related files.
  • date is the current date when you run the model
  • id is a unique import identifier automatically generated by PhixFlow.

If you want to reference a managed file, use the _fileUploadId internal variable, which references the file import identifier; see Internal Variables.

Archive and Error Folders

When you specify archive and error folder locations, after running the model, PhixFlow moves a managed file:

  • either to the archive directory, after successful processing:
      directory/tag/archive/date/id/
    where directory, tag, date and id are defined above, and
    archive is configured in the file collector → Archive Folder expression.

  • or to the error directory, if PhixFlow cannot process the file:
      directory/tag/error/date/id/
    where directory, tag, date and id are defined above, and
    error is configured in the file collector → Error Folder expression.

If you do not specify the archive or error folder, PhixFlow leaves the file in the original folder.

When PhixFlow moves a managed file to the archive or error folder, it is not available when you next run analysis. If you want to reuse a file in subsequent analysis runs, you must:

  • either upload the file for each run
  • or do not set an archive folder location in the Archive Folder expression.
    The managed file remains in  the directory/tag/in/date/id/ location and PhixFlow can resue it.


 Properties Tab

Property Pane Toolbar

For information about the toolbar options, see the Common Properties page, Toolbars and Controls section.

Parent Details

If this item is within or belongs to another, its parent name is shown here. See the Parent Details section on the Common Properties page for more details.

Basic Settings

FieldDescription
NameEnter a name for this file collector.
Auto Configuration

This tick box is available for a new file collector that has not been configured. It is ticked by default.

 Tick to load an Excel spreadsheet and automatically configure the file collector and table properties using the data in the spreadsheet; see Easy Loading for Excel Spreadsheets, below.

 Untick to switch off auto-configuration. More file collector properties become available so that you can specify details about the file(s) you want to load. You also need to add the table and pipe to the model.

Enabled Tick to indicate you have completed the properties and the file collector is ready to use.
Source Type
Select whether or not the file is already on the PhixFlow server.

Specified Directory

Use for files that are already stored on the PhixFlow server. By default, PhixFlow assumes files are in the System Configuration → System Directories → Import File Location, or a subdirectory of it. You must also specify the relative path to the file(s) in the Input Directory Expression.

If your file is not in the Import File Location, select Specified directory and tick Ignore base directory. In this case you must also specify the full path to the file(s) in the Input Directory Expression.

See also Input Directory Expression and Ignore base directory, below.

Managed File

Use for files that you have on your local machine or the network. 
 PhixFlow loads the file to its System Configuration → System Directories→ File Upload Directory. You can optionally specify a sub-directory of the File Upload Directory using the Tag property.

For a managed file, you must load the file before running the file collector.

 How to load a managed file
  1. In the analysis model, hover your mouse pointer over the new file collector to display the pop-up toolbar.
  2. Click  Upload File.
  3. In the file explorer, find your file and click Open.
  4. In the Upload Managed File window, click the  Upload button.

If you have selected Auto Configuration, PhixFlow automatically starts the process for loading a managed file; see Easy Loading for Excel Spreadsheets, below.

Number of Header LinesEnter the number of lines in the header of the file. These are ignored when reading the file. (This option is not available for Binary File or XML file types).
File Type
Select one of:
  • Comma Separated Values: fields are delimited by a comma, (or other character).
  • Fixed Length Records: fields have a fixed column width.
  • Binary File: data is extracted from the file using a Binary File Grammar (in XML) specified in the File Format Description tab.
  • File Details Only: only attribute details about the file itself will be available.
  • Excel Spreadsheet: data is extracted from the an excel spreadsheet supporting a .xls or .xlsx extension.

    .xlsx type excel files containing in excess of 10,000 rows are not supported by PhixFlow and should not be imported using a file collector.

  • XML File: data is extracted from an XML file.

For information about the PhixFlow internal variables you can use to specify the attributes you want, see File Collector Attributes, below.

Next Sequence

Available when File Location Strategy is All Files in Folder.

Enter the next sequence number expected to be found within the name of the file being imported.

Allow Non-Scheduled Collection

 Tick to run the file collector as part of any analysis run that requires this data. 

 Untick to only run the file collector as part of a scheduled task.

FTP SiteSpecify the FTP Site on which the import file is stored. If no site is specified then the file is assumed to be on the local machine. If a site is specified then all directory paths specified on this form should be the full path to the file since the base directory specified in system configuration is ignored (since the base directory is specific to the local machine).

File Location Strategy

Select one of:

  • All Files in Folder: read all files matching the pattern specified in File Pattern Expression.
  • Read File Paths: read in file path names from a collector or table.

This input database collector or table must be attached to the file collector by a lookup pipe with no index set.
The attribute of the input table or file collector that contains the file path names is specified in File Name Attribute. The value entered into this field should be plain text and not in quotes, for example myFilePaths.

Each file path name is interpreted as a pathname relative to the Import Directory. A path name may be a simple file name, or it may have multiple levels of directory. Compressed files are interpreted as directories. On Linux and Windows platforms, the directory separator must be forward-slash /  not  backslash \ . Do not include a leading forward-slash /, for example: 'abc.csv' or 'dir1/dir2.zip/abc.csv'

  • Read Names: This option is deprecated.

Read in file locations from a collector or table. This input collector or table must be attached to the file collector by a pipe. The attribute of the input table or collector which contains the file locations is specified in File Name Attribute.

After a table has attempted to read a file using Read File Paths, the table will not attempt to re-read the contents of that file during another analysis run, even if the original import failed.

This means the table containing the file paths will not run during an analysis run. You must add it to a task plan in order to run it before the file collector table.

Tag 

Available when Source Type field is Managed File.

Specify a sub-directory of System Configuration → System Directories →  File Upload Directory using string literals only. Do not use PhixFlow variables.

PhixFlow loads files into a directory whose full path is a combination of:

  • File Upload Directory, for example  C:\LoadedManagedFiles
  • Tag, for example CVFiles
  •  Input Directory - this is automatically set to in for managed files.

For the example paths above, the file collector looks in the directory C:\LoadedManagedFiles\CVFiles\in

If you are creating a file collector to load email messages and/or attached files, you can specify a tag here if one has been provided in the subject line of the incoming emails. See Reading Data From an Email Account for further details.

Ignore Base Directory 

Available when Source Type is Specified Directory.

The base directory is set in System Configuration → System Directories → Import File Location.

 Untick to automatically prepend the Import File Location directory path to the directories specified in the file collector properties.

 Tick so that PhixFlow reads directories specified in the file collector properties as the full path to the file.

For improved security, your administrator can set a System Configuration → System Directories → Restricted Directory. If it is set, PhixFlow will only load files from the Restricted Directory or a sub-directory of it. Even if you tick Ignore base directory, all directories specified in the file collector properties must be within the Restricted Directory

Input Directory
Expression 

Specify the location of the files you want to load.

When Source Type is Managed File

This field contains a read-only value of in.
PhixFlow reads files from the combined path: File Upload Directory + Tag +in

When Source Type is Specified Directory

Specify a path:
  - either to a directory relative to the Import File Location. PhixFlow reads files from the directory whose full path is a combination of Import File Location and the path specified here.
  - or, if Ignore Base Directory is ticked, the full path to the file(s).

 Tips on specifying a directory in an expression

This expression must evaluate to a plain text string.

Use string literals only.

Always use forward-slash / rather than backslash \, even on Windows.

Enclose the string in quotes, for example:

"C:/data/address/input/accountValues"


Optionally include PhixFlow variables, for example:

"C:/data/address/input/" + _inputMultiplier

Do not use wildcards. If you need to include wildcards or some other variable element in the resulting path, use the Directory Pattern Expression.

If File Location Strategy is:

  •  All Files in Folder, PhixFlow looks in this directory to find files matching the pattern specified in File Pattern Expression.
  • Read Names, this is added to the start of the file location read from the file name attribute.

See also System Configuration for File Upload Directory and Import File Location.

Directory
Pattern Expression

Enter a regular expression to specify one or more relative paths from the Input Directory to its sub-directories. PhixFlow looks in the input directory and its sub-directories for files whose names match the File Pattern Expression.

PhixFlow can recognise some compressed file types; see Handling Compressed Files, below. PhixFlow treats these compressed files as a matching sub-directory.

Internal Variables

You can use the following PhixFlow internal variables in this expression:

  • _fromDate: the start date of the period of the table being processed.
  • _toDate: the end date of the period of the the table being processed.
 Examples

The Input Directory specifies three directories:
    'region1/teamA'; 'region1/teamB'; 'region2/teamA'

To load files from all regions for teamA only, enter:  ".*/teamA/"
To load files for all teams in region 1 only, enter:  "region1/.*"


 Tips on Regular Expressions

As this field takes a regular expression, remember:

  • Directory separators are forward slashes / (not backslashes \)
  • Any character is a full stop (period) .
  • Any string of characters is .* (not just *)
  • A full stop (period) must be escaped like this \\.

More information can be found on the web, such as the Mozilla Regular Expression page.

File
Pattern Expression 

Available when File Location Strategy is All Files in Folder.

Enter a regular expression that will match one or more files in the Input Directory, in order to generate a list of files to load.

PhixFlow uses the File Pattern Expression combined with the fields that specify the directories in which to look. This means PhixFlow:

  • looks in directories specified in the Input Directory
  • and looks in sub-directories specified the Directory Pattern Expression
  • but ignores sub-directories that have been excluded by the Exclude Dir Pattern Expr.

If the specified directories contain no files that match the regular expression, the file collector will not load any files.

Internal Variables

You can use the following PhixFlow internal variables in this expression:

  • _fromDate: the start date of the period of the table being processed.
  • _toDate: the end date of the period of the the table being processed.
  • %SEQ%: the current sequence number.
 Examples

An Input Directory, inputRecords.txt, contains multiple files. To load:

  • a single file called inputRecords.txt, enter: "inputRecords.txt"
  • all the files, enter:  ".*"
  • files with the extension .txt, enter: ".*\\.txt"
  • files whose names start with teamA, enter: "teamA.*"
  • files with the format record_yyyy-MM-dd.txt, where yyyy-MM-dd is the current date, such as
  • record_2013-03-26.txt, enter:  "record_" + toString(now(),"yyyy-MM-dd") + "\\.txt"

You can also load files with name contained in the list of files uploaded by the table-action that caused the file collector to run. The table-action must it must have a Context Value called 'f' with the value expression '_files'.

In this case, enter the file pattern expression:  "("+listToString(_context.f,"|")+")"


 Tips on Regular Expressions

As this field takes a regular expression, remember:

  • Directory separators are forward slashes / (not backslashes \)
  • Any character is a full stop (period) .
  • Any string of characters is .* (not just *)
  • A full stop (period) must be escaped like this \\.

More information can be found on the web, such as the Mozilla Regular Expression page.

Exclude Dir.
Pattern Expr.

If the Directory Pattern Expression includes some sub-directories that you do not want to search, enter a regular expression that will exclude those sub-directories.

Internal Variables

You can use the following PhixFlow internal variables in this expression:

  • _fromDate: the start date of the period of the table being processed.
  • _toDate: the end date of the period of the the table being processed.
 Examples

Input Directory has the sub-directories:
    'region1/teamA'; 'region1/teamB'; 'region2/teamA'
Directory Pattern Expression is:  ".*"
This means PhixFlow will load all files from all regions.

To exclude files from teamB, enter the expression:  ".*/teamB/"


 Tips on Regular Expressions

As this field takes a regular expression, remember:

  • Directory separators are forward slashes / (not backslashes \)
  • Any character is a full stop (period) .
  • Any string of characters is .* (not just *)
  • A full stop (period) must be escaped like this \\.

More information can be found on the web, such as the Mozilla Regular Expression page.

Archive Directory Expression

Optionally, enter an expression for a directory path. The expression must resolve to a Regular Expression. See Regular Expressions.

To enter a plain text directory as an expression:

  • enclose the path in quotes
  • use forward-slash / as a separator, even for paths on a Windows platform, for example, "C:/data/address/directory/".

The Archive Directory Expression is the location to which all files processed by the file collector will be written.

The Error Directory Expression is the location to which any files that cause an error during processing will be written.

Error Directory
Expression
Local Archive Directory

Available when FTP Site is specified.

Specify whether the archive directory is on the PhixFlow server (local) or on the original server.

If you have uploaded a file to the file collector, when you run analysis, PhixFlow reads the data into the file collector's table and then archives the loaded data. This means the file collector no longer has any data. If you subsequently run analysis again, the recordset will be empty. To keep the data in the file collector, you must do one of the following:

  • do not specify an archive directory
  • upload the file to the file collector again
  • once the data is in the table, set as static:
    • either the table
    • or the pipe to the table.
Local Error Directory

Available when FTP Site is specified.

Specify whether the error directory is on the PhixFlow server (local) or on the original server.

File Columns

Available when File Type is Comma Separated Values, Fixed Length Records and Excel Spreadsheet. Enter the attributes of the data columns that you want to extract from the input file. The grid has the standard toolbar and the extra buttons:

  •  Populate Attributes Once you have uploaded a file, automatically populate the File Columns grid with the data attributes. PhixFlow samples some rows in the file to determine the values to use.
  •  List Tables Open a repository tab listing all the tables.
  •  File Collector Open a repository tab listing all the file collectors.

To add attributes, click  Create New and to edit an attribute double-click on a row in the grid. PhixFlow opens an attribute form.

FieldDescription
Name

Enter the name of the column, which can contain any combination of letters, numbers and _ the underscore character.

If you use   Populate AttributesPhixFlow uses data in the first row of a file to generate column names. Any invalid characters are stripped out. 

PhixFlow always uses this name to refer to this attribute.

Order

Enter a number that matches the column number in the input file. For example, if you want to extract the third, first and fifth column of data from a file, the three rows in this grid will have the order:

  • 3
  • 1
  • 5
Type
Enter one of the data types:

If you use   Populate Attributes , PhixFlow uses the type it recognises from the data, or String.

Length

Enter the maximum length of the field in the input file

  • For String, the maximum length of the string
  • For Integer, the maximum number of digits.

This is:

  • required when File Type is Fixed Length Records
  • optional when File Type is Comma Separated Values or Excel Spreadsheet, as only the field sizes of the fields in the table that this collector will write to are important.

If you use   Populate Attributes , PhixFlow uses the length of the longest record that PhixFlow has seen. As PhixFlow does not check all the records, longer fields may exist in the CSV file.


Analysis Models

If this item is used by an analysis model, its name is listed here. See the Common Properties page, Analysis Model section for more details.


Advanced

FieldDescription
Maximum FilesEnter the maximum number of files that PhixFlow will process when the file collector runs.
Minimum FilesEnter the minimum number of files that PhixFlow expects to find when the file collector runs. If fewer files are found then this is treated as an error.
Max Records Per FileEnter the maximum number of records that PhixFlow will read from each file.

Errors Before Rollback

The maximum number of errors PhixFlow will permit during the processing of a file before. Once this number has been exceeded, PhixFlow will abandon the attempted file load; see File Collector Error Handling Summary below.

Parallel ReadersEnter the number of files to process in parallel. If blank, this defaults to 1. If the file collector is configured to read files in sequence, this field is ignored and a single file reader is used.
Unreadable DirectoriesSelect what PhixFlow will do if there are unreadable directories when it is searching a directory hierarchy for files to import.
  • Error: the search fails and the unreadable directories are reported in the log.
  • Warning: the search continues and unreadable directories are reported in the log.
  • Ignore: the search continues and unreadable directories are ignored.
XPath ExpressionAvailable when File Type is XML File.

Enter valid XPath syntax. For information about how to use XPath expressions and how to use the returned data in the corresponding stream attribute expressions, see XPath Examples.

Character SetSelect the character encoding to be used:
  • utf-8
  • utf-16
  • ascii
  • other - see Other Character Set, below
  • cp1252
  • iso-8859-1
Column SeparatorSelect a value from the drop-down list.
  • Comma
  • Space
  • Tab
  • Other -  see Separator Character, below
Separator CharacterAvailable when Column Separator is Other. Enter a custom column separator or a sequence of multiple characters.
Quote Style

Select a value from the drop-down list.

  • Double Quote
  • Single Quote
  • Back Tick
  • Other - see Quote Character below
  • None

Quote CharacterAvailable when Quote Style is Other. Enter a custom quote style character.

Ignore Extra Columns

Available when File Type is Comma Separated Values.

 Tick PhixFlow will not report an error if the record being read  contains more columns than expected.

 Untick PhixFlow report will report an error if there are too many columns; see File Collector Error Handling Summary below.

Ignore Missing Columns

Available when File Type is Comma Separated Values.

 Tick PhixFlow will not report an error if the record being read contains fewer columns than expected.