Regular expressions provide a powerful way of matching patterns, in essence anything other than a simple string literal.
When you enter a regular expression in a PhixFlow field, remember that it must be contained inside "quotes", except where you are using PhixFlow variables that contain strings, or functions that return strings.
"inputFile_[0-9]{3}\\.txt"
will match data - in this case file names - that have the format inputFile_NNN.txt, where N are digits. E.g. inputFile_034.txt.
"inputFile_" + today() + "\\.txt"
will match data - in this case file names - that have the format inputFile_YYYYMMDD.txt, where YYYYMMDD is the current date. E.g. inputFile_20130828.txt.
"inputFile_" + $fileSeq + "\\.txt"
will match data - in this case file names - that have the format inputFile_$fileSeq.txt. E.g. if $fileSeq="034", this will match inputFile_034.txt.
Regular expressions are seen throughout many languages (C, Java, Perl) and unix shells. PhixFlow regular expressions follow the common rules of POSIX regular expressions to a great extent; specification for the elements of regular expressions are given below. Useful examples of regular expressions are found in replaceAll, replaceFirst, matches.
In PhixFlow, to escape a character you must use two backslashes: \\, e.g. \\n for a new line. To match a backslash in the data you are matching, you will need four backslashes: \\\\ in your regular expression. See below for details.
Regular expression constructs, and what they match
Construct | Matches |
---|---|
Characters | |
x | The character x |
\\\\ | The backslash character |
\\0n | The character with octal value 0n (0 <= n <= 7) |
\\0nn | The character with octal value 0nn (0 <= n <= 7) |
\\0mnn | The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7) |
\\xhh | The character with hexadecimal value 0xhh |
\uhhhh | The character with hexadecimal value 0xhhhh |
\\t | The tab character ('\u0009') |
\\n | The newline (line feed) character ('\u000A') |
\\r | The carriage-return character ('\u000D') |
\\f | The form-feed character ('\u000C') |
\\a | The alert (bell) character ('\u0007') |
\\e | The escape character ('\u001B') |
\\cx | The control character corresponding to x |
Character classes | |
[abc] | a, b, or c (simple class) |
[^abc] | Any character except a, b, or c (negation) |
[a-zA-Z] | a through z or A through Z, inclusive (range) |
[a-d[m-p]] | a through d, or m through p: [a-dm-p] (union) |
[a-z&&[def]] | d, e, or f (intersection) |
[a-z&&[^bc]] | a through z, except for b and c: [ad-z] (subtraction) |
[a-z&&[^m-p]] | a through z, and not m through p: [a-lq-z](subtraction) |
Predefined character classes | |
. | Any character (may or may not match line terminators) |
\\d | A digit: [0-9] |
\\D | A non-digit: [^0-9] |
\\s | A whitespace character: [ \\t\\n\\x0B\\f\\r] |
\\S | A non-whitespace character: [^\\s] |
\\w | A word character: [a-zA-Z_0-9] |
\\W | A non-word character: [^\\w] |
POSIX character classes (US-ASCII only) | |
\\p{Lower} | A lower-case alphabetic character: [a-z] |
\\p{Upper} | An upper-case alphabetic character:[A-Z] |
\\p{ASCII} | All ASCII:[\\x00-\\x7F] |
\\p{Alpha} | An alphabetic character:[\\p{Lower}\\p{Upper}] |
\\p{Digit} | A decimal digit: [0-9] |
\\p{Alnum} | An alphanumeric character:[\\p{Alpha}\\p{Digit}] |
\\p{Punct} | Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ |
\\p{Graph} | A visible character: [\\p{Alnum}\\p{Punct}] |
\\p{Print} | A printable character: [\\p{Graph}] |
\\p{Blank} | A space or a tab: [ \\t] |
\\p{Cntrl} | A control character: [\\x00-\\x1F\\x7F] |
\\p{XDigit} | A hexadecimal digit: [0-9a-fA-F] |
\\p{Space} | A whitespace character: [ \\t\\n\\x0B\\f\\r] |
Classes for Unicode blocks and categories | |
\\p{InGreek} | A character in the Greek block (simple href="#ubc">block) |
\\p{Lu} | An uppercase letter (simple href="#ubc">category) |
\\p{Sc} | A currency symbol |
\\P{InGreek} | Any character except one in the Greek block (negation) |
[\\p{L}&&[^\\p{Lu}]] | Any letter except an uppercase letter (subtraction) |
Boundary matchers | |
^ | The beginning of a line |
$ | The end of a line |
\\b | A word boundary |
\\B | A non-word boundary |
\\A | The beginning of the input |
\\G | The end of the previous match |
\\Z | The end of the input but for the final href="#lt">terminator, if any |
\\z | The end of the input |
Greedy quantifiers | |
X? | X, once or not at all |
X* | X, zero or more times |
X+ | X, one or more times |
X{n} | X, exactly n times |
X{n,} | X, at least n times |
X{n,m} | X, at least n but not more than m times |
Reluctant quantifiers | |
X?? | X, once or not at all |
X*? | X, zero or more times |
X+? | X, one or more times |
X{n}? | X, exactly n times |
X{n,}? | X, at least n times |
X{n,m}? | X, at least n but not more than m times |
Possessive quantifiers | |
X?+ | X, once or not at all |
X*+ | X, zero or more times |
X++ | X, one or more times |
X{n}+ | X, exactly n times |
X{n,}+ | X, at least n times |
X{n,m}+ | X, at least n but not more than m times |
Logical operators | |
XY | X followed by Y |
X|Y | Either X or Y |
(X) | X, as a href="#cg">capturing group |
Back references | |
$n | Whatever the nth href="#cg">capturing group matched |
Quotation | |
\\ | Nothing, but quotes the following character |
\\Q | Nothing, but quotes all characters until \\E |
\\E | Nothing, but ends quoting started by \\Q |
Special constructs (non-capturing) | |
(?:X) | X, as a non-capturing group |
(?idmsux-idmsux) | Nothing, but turns match flags on - off |
(?idmsux-idmsux:X) | X, as a href="#cg">non-capturing group with the given flags on - off |
(?=X) | X, via zero-width positive lookahead |
(?!X) | X, via zero-width negative lookahead |
(?<=X) | X, via zero-width positive lookbehind |
(?<!X) | X, via zero-width negative lookbehind |
(?>X) | X, as an independent, non-capturing group |
Backslashes, escapes, and quoting
A double backslash \\ either makes an alphanumeric character a contruct (e.g. \\n means newline), or allows a special character to be quoted (e.g. use \\{ if you want to match the { character). If you escape an alphabetic character that is not a construct when escaped, you will get an error.
Line terminators
A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators:
- A newline (line feed) character ('\\n'),
- A carriage-return character followed immediately by a newline character ("\\r\\n"),
- A standalone carriage-return character ('\\r'),
- A next-line character ('\\u0085'),
- A line-separator character ('\\u2028'), or
- A paragraph-separator character ('\\u2029).
Groups and capturing
Capturing groups are numbered by counting their opening parentheses from left to right. In the expression ((A)(B(C))), for example, there are four such groups:
1 | ((A)(B(C))) |
---|---|
2 | (A) |
3 | (B(C)) |
4 | (C) |
Group zero always stands for the entire expression.
During a match, each sub-sequence of the input sequence that matches a group is saved. The captured sub-sequence may be used later in the expression, via a back reference. In PhixFlow, a back reference has the format $n, where n is the number of the capturing group. See examples for replaceFirst.
The captured input associated with a group is always the sub-sequence that the group most recently matched. If a group is evaluated a second time because of a quantifier then its previously captured value, if any, will be retained if the second evaluation fails. Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set to "b". All captured input is discarded at the beginning of each match.
Groups beginning with (? are pure, non-capturing groups that do not capture text and do not count towards the group total.