Regular
...
Regular expressions provide a powerful way of matching patterns, in essence anything other than a simple string literal.
...
will match data - in this case file names - that have the format inputFile_NNN.txt, where N are digits. E.g. inputFile_034.txt.
...
will match data - in this case file names - that have the format inputFile_YYYYMMDD.txt, where YYYYMMDD is the current date. E.g. inputFile_20130828.txt.
...
will match data - in this case file names - that have the format inputFile_$fileSeq.txt. E.g. if $fileSeq="034", this will match inputFile_034.txt.
...
In PhixFlow, to escape a character you must use two backslashes: \\, e.g. \\n for a new line. To match a backslash in the data you are matching, you will need four backslashes: \\\\ in your regular expression. See below for details.
Regular expression constructs, and what they match
Construct | Matches | |
---|---|---|
Characters | ||
x | The character x | |
\\\\ | The backslash character | |
\\/ | The forward slash character | |
\\\" | The double quote character | |
\\0n | The character with octal value 0n (0 <= n <= 7) | |
\\0nn | The character with octal value 0nn (0 <= n <= 7) | |
\\0mnn | The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7) | |
\\xhh | The character with hexadecimal value 0xhh | |
\uhhhh | The character with hexadecimal value 0xhhhh | |
\\t | The tab character ('\u0009') | |
\\n | The newline (line feed) character ('\u000A') | |
\\r | The carriage-return character ('\u000D') | |
\\f | The form-feed character ('\u000C') | |
\\a | The alert (bell) character ('\u0007') | |
\\e | The escape character ('\u001B') | |
\\cx | The control character corresponding to x | |
Character classes | ||
[abc] | a, b, or c (simple class) | |
[^abc] | Any character except a, b, or c (negation) | |
[a-zA-Z] | a through z or A through Z, inclusive (range) | |
[a-d[m-p]] | a through d, or m through p: [a-dm-p] (union) | |
[a-z&&[def]] | d, e, or f (intersection) | |
[a-z&&[^bc]] | a through z, except for b and c: [ad-z] (subtraction) | |
[a-z&&[^m-p]] | a through z, and not m through p: [a-lq-z](subtraction) | |
Predefined character classes | ||
. | Any character (may or may not match | href="#lt">line |
\\d | A digit: [0-9] | |
\\D | A non-digit: [^0-9] | |
\\s | A whitespace character: [ \\t\\n\\x0B\\f\\r] | |
\\S | A non-whitespace character: [^\\s] | |
\\w | A word character: [a-zA-Z_0-9] | |
\\W | A non-word character: [^\\w] | |
POSIX character classes (US-ASCII only) | ||
\\p{Lower} | A lower-case alphabetic character: [a-z] | |
\\p{Upper} | An upper-case alphabetic character:[A-Z] | |
\\p{ASCII} | All ASCII:[\\x00-\\x7F] | |
\\p{Alpha} | An alphabetic character:[\\p{Lower}\\p{Upper}] | |
\\p{Digit} | A decimal digit: [0-9] | |
\\p{Alnum} | An alphanumeric character:[\\p{Alpha}\\p{Digit}] | |
\\p{Punct} | Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ | |
\\p{Graph} | A visible character: [\\p{Alnum}\\p{Punct}] | |
\\p{Print} | A printable character: [\\p{Graph}] | |
\\p{Blank} | A space or a tab: [ \\t] | |
\\p{Cntrl} | A control character: [\\x00-\\x1F\\x7F] | |
\\p{XDigit} | A hexadecimal digit: [0-9a-fA-F] | |
\\p{Space} | A whitespace character: [ \\t\\n\\x0B\\f\\r] | |
Classes for Unicode blocks and categories | ||
\\p{InGreek} | A character in the Greek block (simple | href="#ubc">blockblock) |
\\p{Lu} | An uppercase letter (simple | href="#ubc">categorycategory) |
\\p{Sc} | A currency symbol | |
\\P{InGreek} | Any character except one in the Greek block (negation) | |
[\\p{L}&&[^\\p{Lu}]] | Any letter except an uppercase letter (subtraction) | |
Boundary matchers | ||
^ | The beginning of a line | |
$ | The end of a line | |
\\b | A word boundary | |
\\B | A non-word boundary | |
\\A | The beginning of the input | |
\\G | The end of the previous match | |
\\Z | The end of the input but for the final | href="#lt">terminatorterminator, if any |
\\z | The end of the input | |
Greedy quantifiers | ||
X? | X, once or not at all | |
X* | X, zero or more times | |
X+ | X, one or more times | |
X{n} | X, exactly n times | |
X{n,} | X, at least n times | |
X{n,m} | X, at least n but not more than m times | |
Reluctant quantifiers | ||
X?? | X, once or not at all | |
X*? | X, zero or more times | |
X+? | X, one or more times | |
X{n}? | X, exactly n times | |
X{n,}? | X, at least n times | |
X{n,m}? | X, at least n but not more than m times | |
Possessive quantifiers | ||
X?+ | X, once or not at all | |
X*+ | X, zero or more times | |
X++ | X, one or more times | |
X{n}+ | X, exactly n times | |
X{n,}+ | X, at least n times | |
X{n,m}+ | X, at least n but not more than m times | |
Logical operators | ||
XY | X followed by Y | |
X|Y | Either X or Y | |
(X) | X, as a | href="#cg">capturingcapturing group |
Back references | ||
$n | Whatever the nth | href="#cg">capturingcapturing group matched |
Quotation | ||
\\ | Nothing, but quotes the following character | |
\\Q | Nothing, but quotes all characters until \\E | |
\\E | Nothing, but ends quoting started by \\Q | |
Special constructs (non-capturing) | ||
(?:X) | X, as a non-capturing group | |
(?idmsux-idmsux) | Nothing, but turns match flags on - off | |
(?idmsux-idmsux:X) | X, as | a href="#cg">nona non-capturing group with the given flags on - off |
(?=X) | X, via zero-width positive lookahead | |
(?!X) | X, via zero-width negative lookahead | |
(?<=X) | X, via zero-width positive lookbehind | |
(?<!X) | X, via zero-width negative lookbehind | |
(?>X) | X, as an independent, non-capturing group |
Anchor | ||||
---|---|---|---|---|
|
A double backslash \\ either makes an alphanumeric character a contruct (e.g. \\n means newline), or allows a special character to be quoted (e.g. use \\{ if you want to match the { character). If you escape an alphabetic character that is not a construct when escaped, you will get an error.
Anchor | ||||
---|---|---|---|---|
|
A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators:
- A newline (line feed) character ('\\n'),
- A carriage-return character followed immediately by a newline character ("\\r\\n"),
- A standalone carriage-return character ('\\r'),
- A next-line character ('\\u0085'),
- A line-separator character ('\\u2028'), or
- A paragraph-separator character ('\\u2029).
Anchor | ||||
---|---|---|---|---|
|
Capturing groups are numbered by counting their opening parentheses from left to right. In the expression ((A)(B(C))), for example, there are four such groups:
...
Groups beginning with (? are pure, non-capturing groups that do not capture text and do not count towards the group total.