WARNING: This page is under development and its content is untested.
XRegExp.addToken
================
The XRegExp.addToken function lets you easily extend the regex language by searching for a specific
pattern (your new token) and providing a handler that returns natively-supported regex syntax to
replace it. Some examples will help show how this works. None of the following example tokens are
included in XRegExp.
Custom Escape
-------------
Many regex flavors include the token \a for matching the bell control character (ASCII position 0x07).
JavaScript does not, so lets add it.
[script]
XRegExp.addToken(
/\\a/,
function () {return "\\x07"},
XRegExp.INSIDE_CLASS | XRegExp.OUTSIDE_CLASS
);
[/script]
The \a token can now be used inside and outside character classes.
Making Syntax More Descriptive
------------------------------
Here's another simple one. Perl 5.10 and PCRE 7 include a (*FAIL) token that works equivalently to
\b\B. Let's add it.
[script]
XRegExp.addToken(
/\(\*FAIL\)/,
function () {return "\\b\\B"}
);
XRegExp("(*FAIL)").test(anyString); // returns false regardless of the value of anyString
[/script]
With this token, the third argument (scope) is omitted, and therefore defaults to
XRegExp.OUTSIDE_CLASS.
Shorthand
---------
Next, let's add a way to embed a longer pattern by name. In this case, our token will match
variations of the copyright symbol.
[script]
XRegExp.addToken(
/{copy}/,
function () {return "(?:\\xA9|\\([Cc]\\)|©|&\\#169;|&\\#x[Aa]9;)";}
);
var copy = XRegExp("{copy}");
copy.validate("©"); // true
copy.validate("(c)"); // true
copy.validate("©"); // true
copy.validate("©"); // true
copy.validate("©"); // true
[/script]
(validate is a method that XRegExp adds to RegExp.prototype. It's similar to
RegExp.prototype.test, but only returns true if the regex matches the entire string, rather
than somewhere within it.)
Like the last example, the scope of this token defaults to
XRegExp.OUTSIDE_CLASS.
A few things to note:
- After adding tokens, escaped characters automatically continue to work as expected.
E.g., XRegExp("\\{copy}") will not trigger this token, and XRegExp("\\\\{copy}") will.
- [TODO: Check if this is needed.] The pound signs (#) are escaped, so that this token will continue to work correctly
with a regex that uses the "x" (free-spacing and comments) flag.
Vertical Whitespace (and Its Inverse)
-------------------------------------
Here's another, more useful feature from Perl 5.10 and PCRE 7. \v matches any single vertical
whitespace character, and \V matches any other character. As the name denotes, vertical
whitespace does not include horizontal whitespace like spaces and tabs.
[script]
XRegExp.addToken(
/\\([Vv])/,
function (match, scope) {
var range = "\\n-\\r\\x85\\u2028\\u2029",
invRange = "\\0-\\t\\x0E-\\x84\\x86-\\u2027\\u202A-\\uFFFF",
negated = match[1] == "V";
return scope == XRegExp.INSIDE_CLASS ?
(negated ? invRange : range) :
"[" + (negated ? "^" : "") + range + "]";
},
XRegExp.INSIDE_CLASS | XRegExp.OUTSIDE_CLASS
);
[/script]
Here we're using the match and scope arguments that are provided to token handlers.
The third addToken argument specifies that this token should apply both inside and
outside character classes. We then use the scope argument within the handler to
provide different handling depending on whether the token was found inside or outside
of a character class. Alternatively, you could add two separate tokens, one with the
third argument set to XRegExp.OUTSIDE_CLASS, and the other with it set to
XRegExp.INSIDE_CLASS.
This token also used the handler's match argument to get to backreference 1 from our
token's capturing group ([Vv]).
This works pretty well, but note that we cannot detect [\0-\v] or [\0-\V] as invalid
character class ranges. So don't do that. :-)
Implementing the reverse -- \h and \H for horizontal whitespace -- is left as an excersize.
Platform-Independent Line Separator
-----------------------------------
[script]
// line separator: CRLF, CR, LF, etc.
XRegExp.addToken(
/\\R/,
function () {return "(?:\\r\\n?|[\\n-\\f\\x85\\u2028\\u2029])";}
);
[/script]
Variable Interpolation, Activated Via a New Custom Flag
-------------------------------------------------------
[script]
var window.data = "check * me * out";
XRegExp.addToken(
/\${([^}]+)}/,
function (match) {return XRegExp.escape(String(window[match[1]]));},
XRegExp.OUTSIDE_CLASS,
function () {return this.hasFlag("$");}
);
XRegExp("${data}", "$").test("check * me * out"); // true
[/script]
- escaped
- $ flag
...
POSIX Character Classes
-----------------------
[script]
XRegExp.addToken(
/\[:([a-z\d]+):]/i,
function () {
var POSIX = {
alnum: "A-Za-z0-9",
alpha: "A-Za-z",
ascii: "\\0-\\x7F",
blank: " \\t",
cntrl: "\\0-\\x1F\\x7F",
digit: "0-9",
graph: "\\x21-\\x7E",
lower: "a-z",
print: "\\x20-\\x7E",
punct: "!\"#$%&'()*+,\\-./:;<=>?@[\\\\\\]^_`{|}~",
space: " \\t\\r\\n\\v\\f",
upper: "A-Z",
word: "A-Za-z0-9_",
xdigit: "A-Fa-f0-9"
};
// this is the actual handler function, which has access to POSIX through closure
return function (match) {
if (!POSIX[match[1]])
throw SyntaxError(match[1] + " is not a valid POSIX character class");
return POSIX[match[1]];
};
}(),
XRegExp.INSIDE_CLASS
);
XRegExp("[[:xdigit:]]+").validate("00A9"); // true
[/script]
INSIDE_CHARCLASS only
[\0-[:alnum:]] won't be detected as an invalid range.
...
Escape Sequence
---------------
Add support for escape sequences (\Q..\E and \Q...).
[script]
// escape sequence
XRegExp.addToken(
/\\Q([\s\S]*?)(?:\\E|$)/,
function (match) {return XRegExp.escape(match[1]);},
XRegExp.INSIDE_CLASS | XRegExp.OUTSIDE_CLASS
);
new XRegExp("^\\Q({?*+})").test("({?*+})"); // true
[/script]
...
Subpattern Definition (Oniguruma-Style)
---------------------------------
[script]
XRegExp.addToken(
/\(\?<([$\w]+)>([\s\S]*?)\)\{0}/,
function (match) {
if (!this.defs)
this.defs = {};
this.defs[match[1]] = match[2];
return "";
}
);
XRegExp.addToken(
/\\g<([$\w]+)>/,
function (match) {
if (!this.defs || this.defs[match[1]] === undefined)
throw SyntaxError("subpattern definition not found: " + match[1]);
return "(?:" + this.defs[match[1]] + ")";
}
);
XRegExp("(?\\1|\\2){0} \
(?(?::-[D)])+){0} \
([ab]) ([12]) \\g+\
\\g ", "x").test("a22aaa:-):-D"); // true
[/script]
- adds this.defs for data storage during regex construction.
- does not support using capturing groups within the subpattern definitions!
- does not support nested subpattern definitions.
- does not support x and s flags within the subpattern definitions.
- does not support using \g<...> with names of capturing groups that are not
followed by {0}.
- subpattern definitions end at the first occurrence of "){0}", so something
like "(?(?:.){0}){0}" does not work correctly.
- \g<...> in Oniguruma is similar to (?&...) in Perl/PCRE.
...
For more examples, take a look at the XRegExp source code.
WARNING: This page is under development and its content is untested.