This plugin adds support for Unicode categories and blocks. It follows the Unicode 5.2 character database, which is the latest version as of 2010-03-10.
The Unicode plugin enables the following Unicode categories/properties:
\p{L} — Letter\p{M} — Mark\p{N} — Number\p{P} — Punctuation\p{S} — Symbol\p{Z} — Separator\p{C} — Other (control, format, private use, surrogate, and unassigned codes)It also enables all 150 blocks that Unicode 5.2 divides the code points U+0000 through U+FFFF into. Unicode blocks use the prefix "In", following Perl and Java (.NET uses "Is"). Following are the supported blocks in alphabetical order:
\p{InAlphabeticPresentationForms}\p{InArabic}\p{InArabicPresentationFormsA}\p{InArabicPresentationFormsB}\p{InArabicSupplement}\p{InArmenian}\p{InArrows}\p{InBalinese}\p{InBamum}\p{InBasicLatin}\p{InBengali}\p{InBlockElements}\p{InBopomofo}\p{InBopomofoExtended}\p{InBoxDrawing}\p{InBraillePatterns}\p{InBuginese}\p{InBuhid}\p{InCham}\p{InCherokee}\p{InCJKCompatibility}\p{InCJKCompatibilityForms}\p{InCJKCompatibilityIdeographs}\p{InCJKRadicalsSupplement}\p{InCJKStrokes}\p{InCJKSymbolsandPunctuation}\p{InCJKUnifiedIdeographs}\p{InCJKUnifiedIdeographsExtensionA}\p{InCombiningDiacriticalMarks}\p{InCombiningDiacriticalMarksforSymbols}\p{InCombiningDiacriticalMarksSupplement}\p{InCombiningHalfMarks}\p{InCommonIndicNumberForms}\p{InControlPictures}\p{InCoptic}\p{InCurrencySymbols}\p{InCyrillic}\p{InCyrillicExtendedA}\p{InCyrillicExtendedB}\p{InCyrillicSupplement}\p{InDevanagari}\p{InDevanagariExtended}\p{InDingbats}\p{InEnclosedAlphanumerics}\p{InEnclosedCJKLettersandMonths}\p{InEthiopic}\p{InEthiopicExtended}\p{InEthiopicSupplement}\p{InGeneralPunctuation}\p{InGeometricShapes}\p{InGeorgian}\p{InGeorgianSupplement}\p{InGlagolitic}\p{InGreekandCoptic}\p{InGreekExtended}\p{InGujarati}\p{InGurmukhi}\p{InHalfwidthandFullwidthForms}\p{InHangulCompatibilityJamo}\p{InHangulJamo}\p{InHangulJamoExtendedA}\p{InHangulJamoExtendedB}\p{InHangulSyllables}\p{InHanunoo}\p{InHebrew}\p{InHighPrivateUseSurrogates}\p{InHighSurrogates}\p{InHiragana}\p{InIdeographicDescriptionCharacters}\p{InIPAExtensions}\p{InJavanese}\p{InKanbun}\p{InKangxiRadicals}\p{InKannada}\p{InKatakana}\p{InKatakanaPhoneticExtensions}\p{InKayahLi}\p{InKhmer}\p{InKhmerSymbols}\p{InLao}\p{InLatinExtendedAdditional}\p{InLatinExtendedA}\p{InLatinExtendedB}\p{InLatinExtendedC}\p{InLatinExtendedD}\p{InLatin1Supplement}\p{InLepcha}\p{InLetterlikeSymbols}\p{InLimbu}\p{InLisu}\p{InLowSurrogates}\p{InMalayalam}\p{InMathematicalOperators}\p{InMeeteiMayek}\p{InMiscellaneousMathematicalSymbolsA}\p{InMiscellaneousMathematicalSymbolsB}\p{InMiscellaneousSymbols}\p{InMiscellaneousSymbolsandArrows}\p{InMiscellaneousTechnical}\p{InModifierToneLetters}\p{InMongolian}\p{InMyanmar}\p{InMyanmarExtendedA}\p{InNewTaiLue}\p{InNKo}\p{InNumberForms}\p{InOgham}\p{InOlChiki}\p{InOpticalCharacterRecognition}\p{InOriya}\p{InPhagspa}\p{InPhoneticExtensions}\p{InPhoneticExtensionsSupplement}\p{InPrivateUseArea}\p{InRejang}\p{InRunic}\p{InSamaritan}\p{InSaurashtra}\p{InSinhala}\p{InSmallFormVariants}\p{InSpacingModifierLetters}\p{InSpecials}\p{InSundanese}\p{InSuperscriptsandSubscripts}\p{InSupplementalArrowsA}\p{InSupplementalArrowsB}\p{InSupplementalMathematicalOperators}\p{InSupplementalPunctuation}\p{InSylotiNagri}\p{InSyriac}\p{InTagalog}\p{InTagbanwa}\p{InTaiLe}\p{InTaiTham}\p{InTaiViet}\p{InTamil}\p{InTelugu}\p{InThaana}\p{InThai}\p{InTibetan}\p{InTifinagh}\p{InUnifiedCanadianAboriginalSyllabics}\p{InUnifiedCanadianAboriginalSyllabicsExtended}\p{InVai}\p{InVariationSelectors}\p{InVedicExtensions}\p{InVerticalForms}\p{InYiRadicals}\p{InYiSyllables}\p{InYijingHexagramSymbols}In accordance with the Unicode standard, casing, spaces, hyphens, and underscores are ignored when comparing block names. Hence, \p{InLatinExtendedA} is equivalent to \p{InLatin Extended-A} and \p{in latin extended a}.
All properties and blocks can be inverted by using an uppercase P. For example, \P{N} matches any code point that is not in the Number category. \P{InArabic} matches code points that are not in the Arabic block.
IMPORTANT: The use of Unicode properties and blocks within character classes is not currently supported—although it is planned for a future version. In the meantime, you can emulate this usage via alternation and/or lookahead, as shown in the following table:
| Instead of: | Use: |
|---|---|
[\p{N}] | \p{N} |
[\p{N}a-z~] | (?:\p{N}|[a-z~]) |
[\p{N}\P{Z}] | (?:\p{N}|\P{Z}) |
[\p{N}\P{Z}a-z~] | (?:\p{N}|\P{Z}|[a-z~]) |
[^\p{N}] | \P{N} |
[^\p{N}a-z~] | (?:(?!\p{N})[^a-z~]) |
[^\p{N}\P{Z}] | (?:(?!\p{N}|\P{Z})[\s\S]) |
[^\p{N}\P{Z}a-z~] | (?:(?!\p{N}|\P{Z})[^a-z~]) |
Additionally, Unicode subcategories like \p{Sc} (currency symbol) and scripts like \p{Latin} are not currently supported. For more information about the use of Unicode in regular expressions, see regexp.info/unicode.html.
To activate this plugin, simply load it after loading XRegExp 1.0 or later.
<script src="xregexp.js"></script> <script src="xregexp-unicode.js"></script> <script> var unicodeWord = XRegExp("^\\p{L}+$"); unicodeWord.test("Русский"); // -> true unicodeWord.test("日本語"); // -> true unicodeWord.test("العربية"); // -> true </script>
Download the Unicode plugin (7.1 KB when minified and gzipped).
Adds the following function to the XRegExp namespace:
XRegExp.matchRecursive(string, left, right, [flags], [options])Accepts a string to search, left and right format delimiters as regex pattern strings, optional regex flags, and optional extended options. Returns an array of matches, allowing nested instances of the left and right delimiters. Use the g flag to return all matches, otherwise only the first is returned.
| Parameters: |
|
|---|---|
| Returns: |
|
var input = "(t((e))s)t()(ing)";
var output = XRegExp.matchRecursive(input, "\\(", "\\)");
// -> ["t((e))s"]
// Global match
output = XRegExp.matchRecursive(input, "\\(", "\\)", "g");
// -> ["t((e))s", "", "ing"]
// Unbalanced delimiter on the left or right
output = XRegExp.matchRecursive("<<t>est", "<", ">", "g");
output = XRegExp.matchRecursive("<t>>est", "<", ">", "g");
// **both lines throw an error**
// Ignoring escaped delimiters
input = "t\\{e\\\\{s{t\\{i}ng}";
output = XRegExp.matchRecursive(input, "{", "}", "g", {escapeChar: "\\"});
// -> ["s{t\\{i}ng"]
// Extended information mode with valueNames
input = "HTML: <div id='x'>A <div>nested <div /></div> element.</div>";
// The left delimiter is designed to skip self-closed <div /> elements
output = XRegExp.matchRecursive(input, "<div\\b(?:[^>](?!/>))*>", "</div>", "i",
{valueNames: ["text", "left", "match", "right"]});
/* ->
[ ["text", "HTML: ", 0, 6],
["left", "<div id='x'>", 6, 18],
["match", "A <div>nested <div /></div> element.", 18, 54],
["right", "</div>", 54, 60] ]
*/
// Omitting unneeded parts with null valueNames
input = "...{1}..{function(a,b){return a+b;}}";
output = XRegExp.matchRecursive(input, "{", "}", "g",
{valueNames: ["literal", null, "value", null]});
/* ->
[ ["literal", "...", 0, 3],
["value", "1", 4, 5],
["literal", "..", 6, 8],
["value", "function(a,b){return a+b;}", 9, 35] ]
*/
/* The matchRecursive function specifically supports the y flag (sticky mode).
This mode requires the first match to appear at the beginning of the string,
with each subsequent match immediately following the last. (Outside of the
matchRecursive function, the y flag requires native browser support.) */
input = "<1><2><3>4<5>";
output = XRegExp.matchRecursive(input, "<", ">", "gy");
// -> ["1", "2", "3"]
Download the Match Recursive plugin (0.8 KB when minified and gzipped).