Regular Expressions

Regular expressions provide a powerful way of matching patterns, in essence anything other than a simple string literal.

When you enter a regular expression in a PhixFlow field, remember that it must be contained inside "quotes", except where you are using PhixFlow variables that contain strings, or functions that return strings.

"inputFile_[0-9]{3}\\.txt"

will match data - in this case file names - that have the format inputFile_NNN.txt, where N are digits. E.g. inputFile_034.txt.

"inputFile_" + today() + "\\.txt"

will match data - in this case file names - that have the format inputFile_YYYYMMDD.txt, where YYYYMMDD is the current date. E.g. inputFile_20130828.txt.

"inputFile_" + $fileSeq + "\\.txt"

will match data - in this case file names - that have the format inputFile_$fileSeq.txt. E.g. if $fileSeq="034", this will match inputFile_034.txt.

Regular expressions are seen throughout many languages (C, Java, Perl) and unix shells. PhixFlow regular expressions follow the common rules of POSIX regular expressions to a great extent; specification for the elements of regular expressions are given below. Useful examples of regular expressions are found in replaceAll, replaceFirst, matches.

In PhixFlow, to escape a character you must use two backslashes: \\, e.g. \\n for a new line. To match a backslash in the data you are matching, you will need four backslashes: \\\\ in your regular expression. See below for details.

Regular expression constructs, and what they match

Construct	Matches
Characters
x	The character x
\\\\	The backslash character
\\0n	The character with octal value 0n (0 <= n <= 7)
\\0nn	The character with octal value 0nn (0 <= n <= 7)
\\0mnn	The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)
\\xhh	The character with hexadecimal value 0xhh
\uhhhh	The character with hexadecimal value 0xhhhh
\\t	The tab character ('\u0009')
\\n	The newline (line feed) character ('\u000A')
\\r	The carriage-return character ('\u000D')
\\f	The form-feed character ('\u000C')
\\a	The alert (bell) character ('\u0007')
\\e	The escape character ('\u001B')
\\cx	The control character corresponding to x
Character classes
[abc]	a, b, or c (simple class)
[^abc]	Any character except a, b, or c (negation)
[a-zA-Z]	a through z or A through Z, inclusive (range)
[a-d[m-p]]	a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]]	d, e, or f (intersection)
[a-z&&[^bc]]	a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]]	a through z, and not m through p: [a-lq-z](subtraction)
Predefined character classes
.	Any character (may or may not match line terminators)
\\d	A digit: [0-9]
\\D	A non-digit: [^0-9]
\\s	A whitespace character: [ \\t\\n\\x0B\\f\\r]
\\S	A non-whitespace character: [^\\s]
\\w	A word character: [a-zA-Z_0-9]
\\W	A non-word character: [^\\w]
POSIX character classes (US-ASCII only)
\\p{Lower}	A lower-case alphabetic character: [a-z]
\\p{Upper}	An upper-case alphabetic character:[A-Z]
\\p{ASCII}	All ASCII:[\\x00-\\x7F]
\\p{Alpha}	An alphabetic character:[\\p{Lower}\\p{Upper}]
\\p{Digit}	A decimal digit: [0-9]
\\p{Alnum}	An alphanumeric character:[\\p{Alpha}\\p{Digit}]
\\p{Punct}	Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{\|}~
\\p{Graph}	A visible character: [\\p{Alnum}\\p{Punct}]
\\p{Print}	A printable character: [\\p{Graph}]
\\p{Blank}	A space or a tab: [ \\t]
\\p{Cntrl}	A control character: [\\x00-\\x1F\\x7F]
\\p{XDigit}	A hexadecimal digit: [0-9a-fA-F]
\\p{Space}	A whitespace character: [ \\t\\n\\x0B\\f\\r]
Classes for Unicode blocks and categories
\\p{InGreek}	A character in the Greek block (simple href="#ubc">block)
\\p{Lu}	An uppercase letter (simple href="#ubc">category)
\\p{Sc}	A currency symbol
\\P{InGreek}	Any character except one in the Greek block (negation)
[\\p{L}&&[^\\p{Lu}]]	Any letter except an uppercase letter (subtraction)
Boundary matchers
^	The beginning of a line
$	The end of a line
\\b	A word boundary
\\B	A non-word boundary
\\A	The beginning of the input
\\G	The end of the previous match
\\Z	The end of the input but for the final href="#lt">terminator, if any
\\z	The end of the input
Greedy quantifiers
X?	X, once or not at all
X*	X, zero or more times
X+	X, one or more times
X{n}	X, exactly n times
X{n,}	X, at least n times
X{n,m}	X, at least n but not more than m times
Reluctant quantifiers
X??	X, once or not at all
X*?	X, zero or more times
X+?	X, one or more times
X{n}?	X, exactly n times
X{n,}?	X, at least n times
X{n,m}?	X, at least n but not more than m times
Possessive quantifiers
X?+	X, once or not at all
X*+	X, zero or more times
X++	X, one or more times
X{n}+	X, exactly n times
X{n,}+	X, at least n times
X{n,m}+	X, at least n but not more than m times
Logical operators
XY	X followed by Y
X\|Y	Either X or Y
(X)	X, as a href="#cg">capturing group
Back references
$n	Whatever the n^th href="#cg">capturing group matched
Quotation
\\	Nothing, but quotes the following character
\\Q	Nothing, but quotes all characters until \\E
\\E	Nothing, but ends quoting started by \\Q
Special constructs (non-capturing)
(?:X)	X, as a non-capturing group
(?idmsux-idmsux)	Nothing, but turns match flags on - off
(?idmsux-idmsux:X)	X, as a href="#cg">non-capturing group with the given flags on - off
(?=X)	X, via zero-width positive lookahead
(?!X)	X, via zero-width negative lookahead
(?<=X)	X, via zero-width positive lookbehind
(?<!X)	X, via zero-width negative lookbehind
(?>X)	X, as an independent, non-capturing group

Backslashes, escapes, and quoting

A double backslash \\ either makes an alphanumeric character a contruct (e.g. \\n means newline), or allows a special character to be quoted (e.g. use \\{ if you want to match the { character). If you escape an alphabetic character that is not a construct when escaped, you will get an error.

Line terminators

A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators:

A newline (line feed) character ('\\n'),
A carriage-return character followed immediately by a newline character ("\\r\\n"),
A standalone carriage-return character ('\\r'),
A next-line character ('\\u0085'),
A line-separator character ('\\u2028'), or
A paragraph-separator character ('\\u2029).

Groups and capturing

Capturing groups are numbered by counting their opening parentheses from left to right. In the expression ((A)(B(C))), for example, there are four such groups:

1	((A)(B(C)))
2	(A)
3	(B(C))
4	(C)

Group zero always stands for the entire expression.

During a match, each sub-sequence of the input sequence that matches a group is saved. The captured sub-sequence may be used later in the expression, via a back reference. In PhixFlow, a back reference has the format $n, where n is the number of the capturing group. See examples for replaceFirst.

The captured input associated with a group is always the sub-sequence that the group most recently matched. If a group is evaluated a second time because of a quantifier then its previously captured value, if any, will be retained if the second evaluation fails. Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set to "b". All captured input is discarded at the beginning of each match.

Groups beginning with (? are pure, non-capturing groups that do not capture text and do not count towards the group total.