WARNING: This page is under development and its content is untested.

XRegExp.addToken ================ The XRegExp.addToken function lets you easily extend the regex language by searching for a specific pattern (your new token) and providing a handler that returns natively-supported regex syntax to replace it. Some examples will help show how this works. None of the following example tokens are included in XRegExp. Custom Escape ------------- Many regex flavors include the token \a for matching the bell control character (ASCII position 0x07). JavaScript does not, so lets add it. [script] XRegExp.addToken( /\\a/, function () {return "\\x07"}, XRegExp.INSIDE_CLASS | XRegExp.OUTSIDE_CLASS ); [/script] The \a token can now be used inside and outside character classes. Making Syntax More Descriptive ------------------------------ Here's another simple one. Perl 5.10 and PCRE 7 include a (*FAIL) token that works equivalently to \b\B. Let's add it. [script] XRegExp.addToken( /\(\*FAIL\)/, function () {return "\\b\\B"} ); XRegExp("(*FAIL)").test(anyString); // returns false regardless of the value of anyString [/script] With this token, the third argument (scope) is omitted, and therefore defaults to XRegExp.OUTSIDE_CLASS. Shorthand --------- Next, let's add a way to embed a longer pattern by name. In this case, our token will match variations of the copyright symbol. [script] XRegExp.addToken( /{copy}/, function () {return "(?:\\xA9|\\([Cc]\\)|&copy;|&\\#169;|&\\#x[Aa]9;)";} ); var copy = XRegExp("{copy}"); copy.validate("©"); // true copy.validate("(c)"); // true copy.validate("&copy;"); // true copy.validate("&#169;"); // true copy.validate("&#xA9;"); // true [/script] (validate is a method that XRegExp adds to RegExp.prototype. It's similar to RegExp.prototype.test, but only returns true if the regex matches the entire string, rather than somewhere within it.) Like the last example, the scope of this token defaults to XRegExp.OUTSIDE_CLASS. A few things to note: - After adding tokens, escaped characters automatically continue to work as expected. E.g., XRegExp("\\{copy}") will not trigger this token, and XRegExp("\\\\{copy}") will. - [TODO: Check if this is needed.] The pound signs (#) are escaped, so that this token will continue to work correctly with a regex that uses the "x" (free-spacing and comments) flag. Vertical Whitespace (and Its Inverse) ------------------------------------- Here's another, more useful feature from Perl 5.10 and PCRE 7. \v matches any single vertical whitespace character, and \V matches any other character. As the name denotes, vertical whitespace does not include horizontal whitespace like spaces and tabs. [script] XRegExp.addToken( /\\([Vv])/, function (match, scope) { var range = "\\n-\\r\\x85\\u2028\\u2029", invRange = "\\0-\\t\\x0E-\\x84\\x86-\\u2027\\u202A-\\uFFFF", negated = match[1] == "V"; return scope == XRegExp.INSIDE_CLASS ? (negated ? invRange : range) : "[" + (negated ? "^" : "") + range + "]"; }, XRegExp.INSIDE_CLASS | XRegExp.OUTSIDE_CLASS ); [/script] Here we're using the match and scope arguments that are provided to token handlers. The third addToken argument specifies that this token should apply both inside and outside character classes. We then use the scope argument within the handler to provide different handling depending on whether the token was found inside or outside of a character class. Alternatively, you could add two separate tokens, one with the third argument set to XRegExp.OUTSIDE_CLASS, and the other with it set to XRegExp.INSIDE_CLASS. This token also used the handler's match argument to get to backreference 1 from our token's capturing group ([Vv]). This works pretty well, but note that we cannot detect [\0-\v] or [\0-\V] as invalid character class ranges. So don't do that. :-) Implementing the reverse -- \h and \H for horizontal whitespace -- is left as an excersize. Platform-Independent Line Separator ----------------------------------- [script] // line separator: CRLF, CR, LF, etc. XRegExp.addToken( /\\R/, function () {return "(?:\\r\\n?|[\\n-\\f\\x85\\u2028\\u2029])";} ); [/script] Variable Interpolation, Activated Via a New Custom Flag ------------------------------------------------------- [script] var window.data = "check * me * out"; XRegExp.addToken( /\${([^}]+)}/, function (match) {return XRegExp.escape(String(window[match[1]]));}, XRegExp.OUTSIDE_CLASS, function () {return this.hasFlag("$");} ); XRegExp("${data}", "$").test("check * me * out"); // true [/script] - escaped - $ flag ... POSIX Character Classes ----------------------- [script] XRegExp.addToken( /\[:([a-z\d]+):]/i, function () { var POSIX = { alnum: "A-Za-z0-9", alpha: "A-Za-z", ascii: "\\0-\\x7F", blank: " \\t", cntrl: "\\0-\\x1F\\x7F", digit: "0-9", graph: "\\x21-\\x7E", lower: "a-z", print: "\\x20-\\x7E", punct: "!\"#$%&'()*+,\\-./:;<=>?@[\\\\\\]^_`{|}~", space: " \\t\\r\\n\\v\\f", upper: "A-Z", word: "A-Za-z0-9_", xdigit: "A-Fa-f0-9" }; // this is the actual handler function, which has access to POSIX through closure return function (match) { if (!POSIX[match[1]]) throw SyntaxError(match[1] + " is not a valid POSIX character class"); return POSIX[match[1]]; }; }(), XRegExp.INSIDE_CLASS ); XRegExp("[[:xdigit:]]+").validate("00A9"); // true [/script] INSIDE_CHARCLASS only [\0-[:alnum:]] won't be detected as an invalid range. ... Escape Sequence --------------- Add support for escape sequences (\Q..\E and \Q...). [script] // escape sequence XRegExp.addToken( /\\Q([\s\S]*?)(?:\\E|$)/, function (match) {return XRegExp.escape(match[1]);}, XRegExp.INSIDE_CLASS | XRegExp.OUTSIDE_CLASS ); new XRegExp("^\\Q({?*+})").test("({?*+})"); // true [/script] ... Subpattern Definition (Oniguruma-Style) --------------------------------- [script] XRegExp.addToken( /\(\?<([$\w]+)>([\s\S]*?)\)\{0}/, function (match) { if (!this.defs) this.defs = {}; this.defs[match[1]] = match[2]; return ""; } ); XRegExp.addToken( /\\g<([$\w]+)>/, function (match) { if (!this.defs || this.defs[match[1]] === undefined) throw SyntaxError("subpattern definition not found: " + match[1]); return "(?:" + this.defs[match[1]] + ")"; } ); XRegExp("(?<any_backref>\\1|\\2){0} \ (?<smiles>(?::-[D)])+){0} \ ([ab]) ([12]) \\g<any_backref>+\ \\g<smiles> ", "x").test("a22aaa:-):-D"); // true [/script] - adds this.defs for data storage during regex construction. - does not support using capturing groups within the subpattern definitions! - does not support nested subpattern definitions. - does not support x and s flags within the subpattern definitions. - does not support using \g<...> with names of capturing groups that are not followed by {0}. - subpattern definitions end at the first occurrence of "){0}", so something like "(?<x>(?:.){0}){0}" does not work correctly. - \g<...> in Oniguruma is similar to (?&...) in Perl/PCRE. ... For more examples, take a look at the XRegExp source code.

WARNING: This page is under development and its content is untested.