XRegExp provides four new flags (n
, s
, x
, A
), which can be combined with native flags and arranged in any order. Unlike native flags, non-native flags do not show up as properties on regular expression objects.
n
— Named capture onlys
— Dot matches all (singleline) — Added as a native flag in ES2018, but XRegExp always supports itx
— Free-spacing and line comments (extended)A
— 21-bit Unicode properties (astral) — Requires the Unicode Base addong
— All matches, or advance lastIndex
after matches (global
)i
— Case insensitive (ignoreCase
)m
— ^
and $
match at newlines (multiline
)u
— Handle surrogate pairs as code points and enable \u{…}
and \p{…}
(unicode
) — Requires native ES6 supporty
— Matches must start at lastIndex
(sticky
) — Requires Firefox 3+ or native ES6 supportd
— Include indices for capturing groups on match results (hasIndices
) — Requires native ES2021 supportn
)Specifies that the only captures are explicitly named groups of the form (?<name>…)
. This allows unnamed (…)
parentheses to act as noncapturing groups without the syntactic clumsiness of the expression (?:…)
.
n
flag frees the (…)
syntax from its often-undesired capturing side effect, while still allowing explicitly-named capturing groups.n
flag is illegal in native JavaScript regular expressions.n
flag comes from .NET, where it's called "explicit capture."s
)Usually, a dot does not match newlines. However, a mode in which dots match any code unit (including newlines) can be as useful as one where dots don't. The s
flag allows the mode to be selected on a per-regex basis. Escaped dots (\.
) and dots within character classes ([.]
) are always equivalent to literal dots. The newline code points are as follows:
U+000A
— Line feed — \n
U+000D
— Carriage return — \r
U+2028
— Line separatorU+2029
— Paragraph separator[\s\S]
, [\0-\uFFFF]
, [^]
(JavaScript only; doesn't work in some browsers without XRegExp), or god forbid (.|\s)
(which requires unnecessary backtracking).s
flag is illegal in native JavaScript regular expressions prior to ES2018.s
flag comes from Perl.When using XRegExp's Unicode Properties addon, you can match any code point without using the s
flag via \p{Any}
.
x
)This flag has two complementary effects. First, it causes all whitespace recognized natively by \s
to be ignored, so you can free-format the regex pattern for readability. Second, it allows comments with a leading #
. Specifically, it turns whitespace into an "ignore me" metacharacter, and #
into an "ignore me and everything else up to the next newline" metacharacter. They aren't taken as metacharacters within character classes (which means that classes are not free-format even with x
, following precedent from most other regex libraries that support x
), and as with other metacharacters, you can escape whitespace and #
that you want to be taken literally. Of course, you can always use \s
to match whitespace.
It might be better to think of whitespace and comments as do-nothing (rather than ignore-me) metacharacters. This distinction is important with something like \12 3
, which with the x
flag is taken as \12
followed by 3
, and not \123
. However, quantifiers following whitespace or comments apply to the preceeding token, so x +
is equivalent to x+
.
The ignored whitespace characters are those matched natively by \s
. ES3 whitespace is based on Unicode 2.1.0 or later. ES5 whitespace is based on Unicode 3.0.0 or later, plus U+FEFF
. Following are the code points that should be matched by \s
according to ES5 and Unicode 4.0.1:
U+0009
— Tab — \t
U+000A
— Line feed — \n
U+000B
— Vertical tab — \v
U+000C
— Form feed — \f
U+000D
— Carriage return — \r
U+0020
— SpaceU+00A0
— No-break spaceU+1680
— Ogham space markU+180E
— Mongolian vowel separatorU+2000
— En quadU+2001
— Em quadU+2002
— En spaceU+2003
— Em spaceU+2004
— Three-per-em spaceU+2005
— Four-per-em spaceU+2006
— Six-per-em spaceU+2007
— Figure spaceU+2008
— Punctuation spaceU+2009
— Thin spaceU+200A
— Hair spaceU+2028
— Line separatorU+2029
— Paragraph separatorU+202F
— Narrow no-break spaceU+205F
— Medium mathematical spaceU+3000
— Ideographic spaceU+FEFF
— Zero width no-break spacex
flag is illegal in native JavaScript regular expressions.x
flag comes from Perl, and was originally inspired by Jeffrey Friedl's pretty-printing of complex regexes.Unicode 1.1.5–4.0.0 assigned code point U+200B
(ZWSP) to the Zs
(Space separator) category, which means that some browsers or regex engines might include this additional code point in those matched by \s
, etc. Unicode 4.0.1 moved ZWSP to the Cf
(Format) category.
Unicode 1.1.5 assigned code point U+FEFF
(ZWNBSP) to the Zs
category. Unicode 2.0.14 moved ZWNBSP to the Cf
category. ES5 explicitly includes ZWNBSP in its list of whitespace characters, even though this does not match any version of the Unicode standard since 1996.
U+180E
(Mongolian vowel separator) was introduced in Unicode 3.0.0, which assigned it the Cf
category. Unicode 4.0.0 moved it into the Zs
category, and Unicode 6.3.0 moved it back to the Cf
category.
JavaScript's \s
is similar but not equivalent to \p{Z}
(the Separator category) from regex libraries that support Unicode categories, including XRegExp's own Unicode Categories addon. The difference is that \s
includes code points U+0009
–U+000D
and U+FEFF
, which are not assigned the Separator category in the Unicode character database.
JavaScript's \s
is nearly equivalent to \p{White_Space}
from the Unicode Properties addon. The differences are: 1. \p{White_Space}
does not include U+FEFF
(ZWNBSP), and 2. \p{White_Space}
includes U+0085
(NEL), which is not assigned the Separator category in the Unicode character database.
Aside: Not all JavaScript regex syntax is Unicode-aware. According to JavaScript specs, \s
, \S
, .
, ^
, and $
use Unicode-based interpretations of whitespace and newline, while \d
, \D
, \w
, \W
, \b
, and \B
use ASCII-only interpretations of digit, word character, and word boundary. Some browsers and browser versions get aspects of these details wrong.
For more details, see JavaScript, Regex, and Unicode.
A
)Requires the Unicode Base addon.
By default, \p{…}
and \P{…}
support the Basic Multilingual Plane (i.e. code points up to U+FFFF
). You can opt-in to full 21-bit Unicode support (with code points up to U+10FFFF
) on a per-regex basis by using flag A
. In XRegExp, this is called astral mode. You can automatically add flag A
for all new regexes by running XRegExp.install('astral')
. When in astral mode, \p{…}
and \P{…}
always match a full code point rather than a code unit, using surrogate pairs for code points above U+FFFF
.
// Using flag A to match astral code points XRegExp('^\\p{S}$').test('💩'); // -> false XRegExp('^\\p{S}$', 'A').test('💩'); // -> true XRegExp('(?A)^\\p{S}$').test('💩'); // -> true // Using surrogate pair U+D83D U+DCA9 to represent U+1F4A9 (pile of poo) XRegExp('(?A)^\\p{S}$').test('\uD83D\uDCA9'); // -> true // Implicit flag A XRegExp.install('astral'); XRegExp('^\\p{S}$').test('💩'); // -> true
Important: Opting in to astral mode disables the use of \p{…}
and \P{…}
within character classes. In astral mode, use e.g. (\p{L}|[0-9_])+
instead of [\p{L}0-9_]+
.
A
flag is illegal in native JavaScript regular expressions.