-
Notifications
You must be signed in to change notification settings - Fork 0
/
DynamicRegexHighlighter.html
426 lines (407 loc) · 31.9 KB
/
DynamicRegexHighlighter.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Dynamic Regex Highlighting with Javascript! (Rev:20110421_2300)</title>
<meta name="author" content="Jeff Roberson">
<meta name="version" content="20110421_2300">
<meta name="License" content="MIT: http://www.opensource.org/licenses/mit-license.php">
<style type="text/css" media="screen">
body {margin: 2em; color:#333; background:#DDB;}
div {margin: 0; padding: 0em 1.5em 1em; border: 2px solid #555; background:#EEC;}
pre {margin: 1em 0; padding: 1em; border: 2px solid #555; font-size: 1.2em; overflow: auto;}
h1 {font-family: monospace; text-align: center;}
.regex_err {color: #FFF; background-color: #F00;}
.regex_hl {color: #FFF; background-color: #060;}
samp {color: #080; font-weight: bold;}
</style>
<script type="text/javascript" src="DynamicRegexHighlighter.js"></script>
<!-- To add regex syntax colorizing to this page, get jsresyntaxhighlighter.js and
jsresyntaxhighlighter.css from: http://stevenlevithan.com/regex/syntaxhighlighter/.
Both: (copyright) 2010 Steven Levithan <http://stevenlevithan.com>, MIT License.
Copy both to this file's folder then uncomment the following three lines. -->
<!--
<link rel="stylesheet" type="text/css" href="jsresyntaxhighlighter.css">
<script type="text/javascript" src="jsresyntaxhighlighter.js"></script>
<script type="text/javascript" src="ColorizeRegexSyntax.js"></script>
-->
</head>
<body>
<div>
<h1 class="regex_x">Dynamic (?:Regex Highlighting)++ with Javascript!</h1>
<p style="text-align: center; font: bold 1.2em monospace;"><a title="Download latest version from Github" href="http://github.com/jmrware/DynamicRegexHighlighter/archives/master">Version 20110421_2300</a></p>
<p id="headerlinks">Interactive tester: <a title="Test the highlighter with your own regular expressions" href="DynamicRegexHighlighterTester.html">DynamicRegexHighlighterTester.html</a>.</p>
<p>This page demonstrates, documents and tests the Javascript dynamic regex highlighter script: <tt><a title="Source listing on Github repository" href="http://github.com/jmrware/DynamicRegexHighlighter/blob/master/DynamicRegexHighlighter.js">DynamicRegexHighlighter.js</a></tt>. To see the highlighting in action, place the mouse cursor over the various parts of the regular expressions below. The components which are highlighted include: matching character class delimiters (and any quantifiers), comments, comment groups and matching (possibly nested) group delimiters (and their quantifiers). When moused over, these regex components are highlighted in <span class="regex_hl">THIS COLOR</span> (or more precisely, in whatever style is defined by the CSS class: <tt>.regex_hl</tt>). The script identifies erroneous unbalanced parentheses and these are displayed in <span class="regex_err">RED</span> (or more precisely, in whatever style is defined by the CSS class: <tt>.regex_err</tt>). When the mouse is placed over a numbered capturing group, a tooltip appears indicating the capture group number (e.g. "Capture group $3"). Note, however, that if the regex has a "branch reset" construct, <tt class="regex_x">(?|(like)|(this)|(one))</tt>, then the script is not smart enough to compute capture group numbers beyond that point. In this case, affected capture group numbers are simply not included in the tooltips.</p>
<p>Each regular expression from the script is presented below in two formats: 1.) fully commented in free-spacing format, and 2.) uncommented in native Javascript format. Additionally, an extended example pseudo regex is provided which tests the various <a title="The Perl Compatible Regular Expression library" href="http://www.pcre.org/">PCRE</a> regex constructs that this script recognizes. During the page load process, (which can take quite some time on pages that contain many large, highly complex regexes), progress information is displayed in the browser's status bar at the bottom of the page (if the browser is setup to allow this.) Once the page is loaded, the script is idle except for <tt>mouseover</tt> and <tt>mouseout</tt> events, which are handled very quickly. When the page is unloaded, the script once again comes back to life to free up the memory it allocated and to null out all its references to DOM node objects (to prevent nefarious memory leaks that happen in IE if you don't).</p>
<h2 class="regex_x">Usage: (easy as one two three!)</h2>
<ol>
<li>Add a script tag to the document head element to include: <tt>DynamicRegexHighlighter.js</tt>.<br>e.g. <samp><script type="text/javascript" src="DynamicRegexHighlighter.js"></script></samp></li>
<li>Add a <tt>.regex_hl</tt> class selector to the stylesheet to define what the highlighted regex text should look like. To show unbalanced parentheses as visible errors, add a <tt>.regex_err</tt> class selector to the stylesheet before the <tt>.regex_hl</tt> rule.<br>
e.g. <samp><style type="text/css">.regex_err {color: #FFF; background-color: #F00;}<br>
.regex_hl {color: #333; background-color: #0F0;}</style></samp></li>
<li>Wrap the regex to be highlighted in an element having either <tt>class="regex"</tt> or <tt>class="regex_x"</tt>. The regular expression should be valid and in <em title="i.e. Extra backslashes may confuse the script">native regex format</em> and should have all "<tt><</tt>", "<tt>></tt>" and "<tt>&</tt>" characters converted to HTML entities, so that the web page is valid. If the regex is written in free-spacing mode with <tt>#comments</tt> (i.e. with Perl syntax "x" modifier set), use the "<tt>.regex_x</tt>" class variation, and wrap it in a <tt>PRE</tt> tag (to preserve whitespace for IE).<br>
e.g. <samp class="regex_x"><h1 class="regex">Dynamic (?:Regex Highlighting)++ with Javascript!</h1"></samp><br>
e.g. <samp><pre class="regex_x"><span class="regex_x">Free spacing ("x" mode) regex with #comments here.</span></pre"></samp></li>
</ol>
<h2>Adding HTML markup to the Regexes:</h2>
<p>You can apply HTML markup to the regex and the dynamic highlighter script will still work correctly as long as you follow a few rules. There are several atomic multi-character regex tokens which must not be split up by an HTML opening or closing tag. These <em>not-to-be-interrupted</em> regex tokens include the following:</p>
<ul>
<li>The opening sequence of a group is atomic:<br>
OK:
<samp title="Non-capture group"><B>(?:</B>...)</samp>,
<samp title="Non-capture group"><B>(?:...</B>)</samp>,
<samp title="Non-capture group"><B>(?:...)</B></samp>,
<samp title="Non-capture group">(?:<B>...)</B></samp>,
<samp title="Non-capture group">(?:...<B>)</B></samp>.<br>
BAD:
<samp title="Non-capture group"><B>(</B>?:...)</samp>,
<samp title="Non-capture group"><B>(?</B>:...)</samp>,
<samp title="Non-capture group">(<B>?:...)</B></samp>,
<samp title="Non-capture group">(?<B>:...)</B></samp>.<br>
MORE BAD:
<samp title="Atomic group"><B>(</B>?>...)</samp>,
<samp title="Positive lookahead"><B>(?</B>=...)</samp>,
<samp title="Positive lookbehind"><B>(?<</B>=...)</samp>,
<samp title="Negative lookahead">(<B>?!...</B>)</samp>,
<samp title="Negative lookbehind">(?<B><!...)</B></samp>, etc.</li>
<li>These escaped metacharacters "<tt>()|[]\#</tt>" are atomic:<br>
OK:
<samp title="Literal left parentheses"><B>\(</B></samp>,
<samp title="Literal right parentheses"><B>\)</B></samp>,
<samp title="Literal OR pipe character"><B>\|</B></samp>,
<samp title="Left square bracket"><B>\[</B></samp>,
<samp title="Right square bracket"><B>\]</B></samp>,
<samp title="Backslash"><B>\\</B></samp>,
<samp title="Pound sign"><B>\#</B></samp>.<br>
ALSO OK: You <em title="but its probably not a good idea!">may</em> split up other escaped metacharacters:
<samp title="Space char class">\<B>s</B></samp>,
<samp title="Word char class"><B>\</B>w</samp>,
<samp title="">\<B>2</B></samp>,
<samp title=""><B>\</B>u89AF</samp>, etc.<br>
BAD:
<samp title="Literal left parentheses">\<B>(</B></samp>,
<samp title="Literal right parentheses"><B>\</B>)</samp>,
<samp title="Literal OR pipe character">\<B>|</B></samp>,
<samp title="Left square bracket"><B>\</B>[</samp>,
<samp title="Right square bracket">\<B>]</B></samp>,
<samp title="Backslash">\<B>\</B></samp>,
<samp title="Pound sign"><B>\</B>#</samp>.
</li>
<li>The opening sequence of a character class is atomic:<br>
OK:
<samp title="Negative character class"><B>[^</B>...]</samp>,
<samp title="Positive class with ] first char"><B>[]</B>...]</samp>,
<samp title="Negative class with ] first char"><B>[^]</B>...]</samp>,
<samp title="Negative character class"><B>[^...</B>]</samp>,
<samp title="Negative character class">[^<B>...</B>]</samp>,
<samp title="Negative character class">[^<B>...]</B></samp>,
<samp title="Negative character class">[^...<B>]</B></samp>.<br>
BAD:
<samp title="Negative character class"><B>[</B>^...]</samp>,
<samp title="Positive class with ] first char"><B>[</B>]...]</samp>,
<samp title="Negative class with ] first char">[<B>^]</B>...]</samp>,
<samp title="Negative class with ] first char">[^<B>]</B>...]</samp>, etc.
</li>
<li>An embedded POSIX character class is atomic:<br>
OK:
<samp title="POSIX character class"><B>[...</B>[:alpha:]...]</samp>,
<samp title="Negated POSIX character class">[...<B>[:^alpha:]</B>...]</samp>,
<samp title="POSIX character class">[...[:alpha:]<B>...]</B></samp>.<br>
BAD:
<samp title="POSIX character class">[...<B>[</B>:alpha:]...]</samp>,
<samp title="POSIX character class">[...<B>[:</B>alpha:]...]</samp>,
<samp title="Negated POSIX character class">[...<B>[:^al</B>pha:]...]</samp>,
<samp title="POSIX character class">[...[:alp<B>ha:]</B>...]</samp>, etc.<br>
</li>
<li>Quantifiers applied to character classes and groups are atomic:<br>
OK:
<samp title="Posessive plus"><B>[A-F]</B>++</samp>,
<samp title="Lazy star">(A|B)<B>*?</B></samp>,
<samp title="10 or more">[ABC]<B>{10,}</B></samp>,
<samp title="From 3 to 5 posessively">(\s\w+)<B>{3,5}+</B></samp>.<br>
ALSO OK: You <em title="but its probably not a good idea!">may</em> split up other quantifiers:
<samp title="One or more word chars (posessive)"><B>\w+</B>+</samp>,
<samp title="Zero or more spaces (lazy)">\s*<B>?</B></samp>,
<samp title="More than 10 backslashes">\\<B>{10</B>,}</samp>,
<samp title="From 3 to 5 X chars (posessive)">X{3,<B>5}+</B></samp>.<br>
BAD:
<samp title="Posessive plus"><B>[A-F]+</B>+</samp>,
<samp title="Lazy star">(A|B)*<B>?</B></samp>,
<samp title="10 or more">[ABC]<B>{10</B>,}</samp>,
<samp title="From 3 to 5 posessively">(\s\w+){<B>3,5}+</B></samp>.<br>
</li>
</ul>
<p>Note that these rules apply only to the <tt>DynamicRegexHighlighter.js</tt> script and have nothing to do with regular expression syntax itself. For the ultimate regular expression tutorial and reference, <a href="http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124">Mastering Regular Expressions (3rd Edition)</a> by Jeffrey Friedl is the definitive guide and is highly recommended by this author.</p>
<p><strong>Colorization note:</strong> Steven Levithan's color <a href="http://stevenlevithan.com/regex/syntaxhighlighter/">JavaScript Regex Syntax Highlighter</a> script does <em>not</em> allow HTML markup in its input text. However, its generated output, (which does contain lots of HTML mark-up), may be fed into this script. Thus, if you are using both scripts, be sure to apply the colorizer script first (as is demonstrated on this and the tester page). Note, however, that if a non-Javascript regex is fed into the colorizer script, it will identify many common valid regex expressions (such as possessive quantifiers, lookbehind, etc), as errors, and will also mark-up these errors in a way that violates the markup rules described above. (e.g. it will split up a "<tt>++</tt>" possessive quantifier to show the second plus as an error.) The dynamic highlighter will then misinterpret these expressions. Bottom line: when using the colorizer, only feed it <em>Javascript syntax</em> regular expressions!</p>
<h2>Regular Expression Goodness:</h2>
<p>Following are all the regexes from the <tt>DynamicRegexHighlighter.js</tt> script. First up is the Phase 1 regex used to parse regexes having commenting turned on (i.e. regexes with the Perl style "x" modifier set). This first listing is presented in verbose format with a liberal sprinkling of comments and indentation for clarity.</p>
<pre class="regex_x" id="n1"># Rev:20100913_0900 github.com/jmrware/DynamicRegexHighlighter
# re_1_cmt: Match character classes, comment groups, HTML tags, and comments.
( [^[(#<\\]+(?:\\[^<][^[(#<\\]*)* # $1: Everything else (starting w/non-escape)
| (?:\\[^<][^[(#<\\]*)+ # or everything else (starting w/escape).
) # End $1. (Note: No escaped "\<" allowed.)
| (\[\^?) # $2: Character class opening delim.
( # $3: Character class contents.
\]? # Unescaped ] allowed if first char.
[^[\]\\]*(?:\\[\S\s][^[\]\\]*)* # Non-[], escaped-anything (normal*).
(?: \[ # Allow a non-escaped "[", and it
(?::\^?\w+:\])? # may be embedded POSIX char class.
[^[\]\\]*(?:\\[\S\s][^[\]\\]*)* # More non-[], escaped-anything.
)* # Unroll-the-loop (special normal*)*
) # End $3. Character class contents.
\] # Character class closing delimiter.
((?:</?\w+\b[^>]*>)*) # $4: HTML tags between "]" and quantifier.
((?:(?:[?*+]|\{\d+(?:,\d*)?\})[+?]?)?) # $5: Optional char class quantifier.
| (\((?!\?\#)) # $6: Opening "(" (non comment group).
| (\(\?\#[^)]*\)) # $7: Comment group (cmt_grp).
| ((?:</?\w+\b[^>]*>)+) # $8: Embedded HTML tags (open or close).
| (\#.*) # $9: Comment (cmt).
</pre>
<p>Here is the exact same regex in its raw, uncommented native Javascript format as it appears in the script (with a few added newlines to avoid it going off screen).</p>
<pre class="regex">var re_1_cmt = /([^[(#<\\]+(?:\\[^<][^[(#<\\]*)*|(?:\\[^<][^[(#<\\]*)+)|
(\[\^?)(\]?[^[\]\\]*(?:\\[\S\s][^[\]\\]*)*(?:\[(?::\^?\w+:\])?[^[\]\\]*
(?:\\[\S\s][^[\]\\]*)*)*)\]((?:<\/?\w+\b[^>]*>)*)((?:(?:[?*+]|\{\d+(?:,\d*)?
\})[+?]?)?)|(\((?!\?#))|(\(\?#[^)]*\))|((?:<\/?\w+\b[^>]*>)+)|(#.*)/g;</pre>
<p>Following, are the remaining regular expressions from the script in both commented and non-commented formats.</p>
<pre class="regex_x"># Rev:20100913_0900 github.com/jmrware/DynamicRegexHighlighter
# re_1_nocmt: Match character classes and comment groups (no comments).
( [^[(\\]+(?:\\[\S\s][^[(\\]*)* # $1: Everything else (starting w/non-escape)
| (?:\\[\S\s][^[(\\]*)+ # or everything else (starting w/escape).
) # End $1.
| (\[\^?) # $2: Character class opening delim.
( # $3: Character class contents.
\]? # Unescaped ] allowed if first char.
[^[\]\\]*(?:\\[\S\s][^[\]\\]*)* # Non-[], escaped-anything (normal*).
(?: \[ # Allow a non-escaped "[", and it
(?::\^?\w+:\])? # may be embedded POSIX char class.
[^[\]\\]*(?:\\[\S\s][^[\]\\]*)* # More non-[], escaped-anything.
)* # Unroll-the-loop (special normal*)*
) # End $3. Character class contents.
\] # Character class closing delimiter.
((?:</?\w+\b[^>]*>)*) # $4: HTML tags between "]" and quantifier.
((?:(?:[?*+]|\{\d+(?:,\d*)?\})[+?]?)?) # $5: Optional char class quantifier.
| (\((?!\?\#)) # $6: Opening "(" (non comment group).
| (\(\?\#[^)]*\)) # $7: Comment group (cmt_grp).
</pre>
<pre class="regex">var re_1_nocmt = /([^[(\\]+(?:\\[\S\s][^[(\\]*)*|(?:\\[\S\s][^[(\\]*)+)|
(\[\^?)(\]?[^[\]\\]*(?:\\[\S\s][^[\]\\]*)*(?:\[(?::\^?\w+:\])?[^[\]\\]*
(?:\\[\S\s][^[\]\\]*)*)*)\]((?:<\/?\w+\b[^>]*>)*)((?:(?:[?*+]|\{\d+(?:,\d*)?
\})[+?]?)?)|(\((?!\?#))|(\(\?#[^)]*\))/g;</pre>
<pre class="regex_x"># Rev:20100913_0900 github.com/jmrware/DynamicRegexHighlighter
# re_2: Match inner (non-nested) PCRE syntax regex groups.
\( # Regex group opening "(" delimiter.
( # $1: Optional group type specification.
\? # All special group types start with a "?".
(?: # Non-capture group for group types alternatives.
[:|>=!] # Types specified with a single character.
| &gt; # Atomic group (HTML entity).
| &lt;[=!] # Look behind (HTML entity).
| <[=!] # Look behind (Note 1).
| P?&lt;\w+&gt; # Named capture group (Python/Perl) (HTML entity).
| P?<\w+> # Named capture group (Python/Perl) (Note 1).
| '\w+' # Named capturing group (Perl).
| (?=<span[^>]*>&\#40;) # Previously-marked nested generic conditional.
| \( # Begin conditional group with "(" delimiter.
(?: # Non-capture group for conditional alternatives.
[+\-]?\d+ # Absolute/+-relative reference condition.
| &lt;\w+&gt; # Named reference condition (Perl) (HTML entity).
| <\w+> # Named reference condition (Perl) (Note 1).
| '\w+' # Named reference condition (Perl).
| R&amp;\w+ # specific recursion condition (HTML entity).
| R&\w+ # specific recursion condition (Note 1).
| \w+ # Named reference condition (PCRE)
) \) # End conditional group with ")" delimiter.
| (?: # Group types that must have zero content.
R # Recurse whole pattern.
| (?:-?[iJmsUx])+ # Flag modifiers (PCRE).
| [+\-]?\d+ # Call subpattern by absolute/+-relative number.
| &amp;\w+ # Call subpattern by name (Perl) (HTML entity).
| &\w+ # Call subpattern by name (Perl) (Note 1).
| P&gt;\w+ # Call subpattern by name (Python) (HTML entity).
| P>\w+ # Call subpattern by name (Python) (Note 1).
| P=\w+ # Reference by name (Python).
)(?=\)) # Ensure this group type has no contents.
) # End non-capture group of group types alternatives.
)? # End $1: Optional group type specification.
([^()]*) # $2: Inner group contents.
\) # Regex group closing ")" delimiter.
((?:</?\w+\b[^>]*>)*) # $3 HTML between ")" and quantifier.
((?:(?:[?*+]|\{\d+(?:,\d*)?\})[+?]?)?) # $4: Optional quantifier.
# Note 1: Handle "<", ">" and "&", even if not converted to HTML entities.
</pre>
<pre class="regex">var re_2 = /\((\?(?:[:|>=!]|&gt;|&lt;[=!]|<[=!]|P?&lt;\w+&gt;|P?<\w+>|'\w+'|
(?=<span[^>]*>&#40;)|\((?:[+\-]?\d+|&lt;\w+&gt;|<\w+>|'\w+'|R&amp;\w+|R&\w+|\w+)
\)|(?:R|(?:-?[iJmsUx])+|[+\-]?\d+|&amp;\w+|&\w+|P&gt;\w+|P>\w+|P=\w+)(?=\))))?
([^()]*)\)((?:<\/?\w+\b[^>]*>)*)((?:(?:[?*+]|\{\d+(?:,\d*)?\})[+?]?)?)/g;</pre>
<pre class="regex_x"># Rev:20100913_0900 github.com/jmrware/DynamicRegexHighlighter
# re_escapedgroupdelims: Convert escaped group delimiter chars to HTML entities.
( [^\\]+(?:\\[^()|][^\\]*)* # $1: Everything else (starting with non-escape),
| (?:\\[^()|][^\\]*)+ # or everything else (starting with escape).
) # End $1.
| \\([()|]) # $2: Escaped "(", ")" or "|".
</pre>
<pre class="regex">/([^\\]+(?:\\[^()|][^\\]*)*|(?:\\[^()|][^\\]*)+)|\\([()|])/g</pre>
<pre class="regex_x"># Rev:20100913_0900 github.com/jmrware/DynamicRegexHighlighter
# re_open_html_tag: Match HTML opening tag with at least one attribute.
< # Opening tag opening "<" delimiter.
( # $1: Opening tag name and attribute contents.
\w+\b # Tag name.
(?: # Non-capture group for required attribute(s).
\s+ # Attributes must be separated by whitespace.
[\w\-.:]+ # Attribute name is required for attr=value pair.
(?: # Non-capture group for optional attribute value.
\s*=\s* # Name and value separated by "=" and optional ws.
(?: # Non-capture group for attrib value alternatives.
"[^"]*" # Double quoted string (Note: may contain "&<>").
| '[^']*' # Single quoted string (Note: may contain "&<>").
| [\w\-.:]+ # Non-quoted attrib value can be A-Z0-9-._:
) # End of attribute value
)? # Attribute value is optional.
)+ # One or more attributes required.
\s* /? # Optional whitespace and "/" before ">".
) # End $1. Opening tag name and attribute contents.
> # Opening tag closing ">" delimiter.
</pre>
<pre class="regex">/<(\w+\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[\w\-.:]+))?)+\s*\/?)>/g</pre>
<h2>A Script Test Chamber:</h2>
<pre class="regex_x"># Example pseudo-regex demonstrating all recognized PCRE component types.
(?# CHARACTER CLASSES)
[...] # positive character class
[^...] # negative character class
[]...] # unescaped ] allowed if first char
[^]...] # unescaped ] allowed if first char
[x-y] # range (can be used for hex characters)
[[:xxx:]] # positive POSIX named set
[[:^xxx:]] # negative POSIX named set
[[:alpha:][:alpha:][:alpha:]] # can have multiple embedded POSIX cc
[[[[[:alpha:][[[[:alpha:][[[] # can have unescaped non-POSIX class "["
(?# QUANTIFIERS applied to character classes and simple capture groups.)
[x]? (x)? # 0 or 1, greedy
[x]?+ (x)?+ # 0 or 1, possessive
[x]?? (x)?? # 0 or 1, lazy
[x]* (x)* # 0 or more, greedy
[x]*+ (x)*+ # 0 or more, possessive
[x]*? (x)*? # 0 or more, lazy
[x]+ (x)+ # 1 or more, greedy
[x]++ (x)++ # 1 or more, possessive
[x]+? (x)+? # 1 or more, lazy
[x]{1} (x){1} # exactly n
[x]{1,2} (x){1,2} # at least n, no more than m, greedy
[x]{1,2}+ (x){1,2}+ # at least n, no more than m, possessive
[x]{1,2}? (x){1,2}? # at least n, no more than m, lazy
[x]{1,} (x){1,} # n or more, greedy
[x]{1,}+ (x){1,}+ # n or more, possessive
[x]{1,}? (x){1,}? # n or more, lazy
[x]{10} (x){10} # exactly nn (multiple digits)
[x]{10,20} (x){10,20} # at least nn, no more than mm, greedy
[x]{10,20}+ (x){10,20}+ # at least nn, no more than mm, possessive
[x]{10,20}? (x){10,20}? # at least nn, no more than mm, lazy
[x]{10,} (x){10,} # nn or more, greedy
[x]{10,}+ (x){10,}+ # nn or more, possessive
[x]{10,}? (x){10,}? # nn or more, lazy
(?# CAPTURING)
(...) # capturing group
(?<name>...) # named capturing group (Perl)
(?'name'...) # named capturing group (Perl)
(?P<name>...) # named capturing group (Python)
(?:...) # non-capturing group
(?|(...)|(...)) # "branch reset" non-capturing group; reset group
# numbers for capturing groups in each alternative
(?# ATOMIC GROUPS)
(?>...) # atomic, non-capturing group
(?# OPTION SETTING)
(?i) # caseless
(?J) # allow duplicate names
(?m) # multiline
(?s) # single line (dotall)
(?U) # default ungreedy (lazy)
(?x) # extended (ignore white space)
(?-i) # NOT caseless
(?-J) # NOT allow duplicate names
(?-m) # NOT multiline
(?-s) # NOT single line (dotall)
(?-U) # NOT default ungreedy (lazy)
(?-x) # NOT extended (ignore white space)
(?i-Jm-sU-x) # multiple options at once.
(?-iJ-ms-Ux) # multiple options at once.
(?# LOOKAHEAD AND LOOKBEHIND ASSERTIONS)
(?=...) # positive look ahead
(?!...) # negative look ahead
(?<=...) # positive look behind
(?<!...) # negative look behind
(?# BACKREFERENCES)
(?P=name) # reference by name (Python)
(?# SUBROUTINE REFERENCES {POSSIBLY RECURSIVE})
(?R) # recurse whole pattern
(?1) # call subpattern by absolute number
(?+1) # call subpattern by relative number
(?-1) # call subpattern by relative number
(?&name) # call subpattern by name (Perl)
(?P>name) # call subpattern by name (Python)
(?# CONDITIONAL PATTERNS)
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
(?(1)...) # absolute reference condition
(?(+1)...) # relative reference condition
(?(-1)...) # relative reference condition
(?(<name>)...) # named reference condition (Perl)
(?('name')...) # named reference condition (Perl)
(?(name)...) # named reference condition (PCRE)
(?(R)...) # overall recursion condition
(?(R1)...) # specific group recursion condition
(?(R&name)...) # specific recursion condition
(?(DEFINE)...) # define subpattern for reference
(?(?=...)...) # assertion condition (positive lookahead)
(?(?!...)...) # assertion condition (negative lookahead)
(?(?<=...)...) # assertion condition (positive lookbehind)
(?(?<!...)...) # assertion condition (negative lookbehind)
<!-- The following tests have HTML tags with attribute values having unescaped
&<> chars which are invalid for XHTML. However, these chars are valid for
the HTML 4.01 Strict doctype. This is why this HTML page is served with
HTML 4.01 Strict doctype and not the prefered XHTML 1.0 Strict.-->
(?# MISCELLANEOUS TESTS)
# test HTML tags having "&<>()|[]" delimiter chars in attribute values.
<b title="Title &<attribute> (with | [regex] | delims) in open regex)">HTML TAG</b> # in open regex
(?# <b title="Title &<attribute> (with | [regex] | delims) in (?#comment group)">HTML TAG</b>) # in comment group
# <b title="Title &<attribute> (with | [regex] | delims) in # comment">HTML TAG</b> # in comment
[<b title="Title &<attribute> (with | [regex] | delims) in [character class]">HTML TAG</b> in character class]
(<b title="Title &<attribute> (with | [regex] | delims) in (group)">HTML TAG</b> in group)
\<b title="Title &<attribute> (with | [regex] | delims) in open regex)">HTML TAG</b> # with \ escape immediately before <
# character class regexes with HTML tags
<b>[</b>charclass] <b>[char</b>class] <b>[charclass</b>] <b>[charclass]</b>
[charclass<b>]</b> [char<b>class]</b> [<b>charclass]</b> <b>[charclass]</b>
<b>[</b>charclass]++ <b>[char</b>class]++ <b>[charclass</b>]++ <b>[charclass]</b>++ <b>[charclass]++</b>
[charclass]<b>++</b> [charclass<b>]++</b> [char<b>class]++</b> [<b>charclass]++</b> <b>[charclass]++</b>
# characters class regexes with multiple HTML tags
<i><b>[</b></i>charclass] <i><b>[char</b></i>class] <i><b>[charclass</b></i>] <i><b>[charclass]</b></i>
[charclass<i><b>]</b></i> [char<i><b>class]</b></i> [<i><b>charclass]</b></i> <i><b>[charclass]</b></i>
<i><b>[</b></i>charclass]++ <i><b>[char</b></i>class]++ <i><b>[charclass</b></i>]++ <i><b>[charclass]</b></i>++ <i><b>[charclass]++</b></i>
[charclass]<i><b>++</b></i> [charclass<i><b>]++</b></i> [char<i><b>class]++</b></i> [<i><b>charclass]++</b></i> <i><b>[charclass]++</b></i>
# group regexes with HTML tags
<b>(?:</b>group) <b>(?:gr</b>oup) <b>(?:group</b>) <b>(?:group)</b>
(?:group<b>)</b> (?:gr<b>oup)</b> (?:<b>group)</b> <b>(?:group)</b>
<b>(?:</b>group)++ <b>(?:gr</b>oup)++ <b>(?:group</b>)++ <b>(?:group)</b>++ <b>(?:group)++</b>
(?:group)<b>++</b> (?:group<b>)++</b> (?:gr<b>oup)++</b> (?:<b>group)++</b> <b>(?:group)++</b>
# group regexes with multiple HTML tags
<i><b>(?:</b></i>group) <i><b>(?:gr</b></i>oup) <i><b>(?:group</b></i>) <i><b>(?:group)</b></i>
(?:group<i><b>)</b></i> (?:gr<i><b>oup)</b></i> (?:<i><b>group)</b></i> <i><b>(?:group)</b></i>
<i><b>(?:</b></i>group)++ <i><b>(?:gr</b></i>oup)++ <i><b>(?:group</b></i>)++ <i><b>(?:group)</b></i>++ <i><b>(?:group)++</b></i>
(?:group)<i><b>++</b></i> (?:group<i><b>)++</b></i> (?:gr<i><b>oup)++</b></i> (?:<i><b>group)++</b></i> <i><b>(?:group)++</b></i>
[ ( ) | ] # unescaped group delimiters inside char class
[ \( \) \| ] # escaped group delimiters inside char class
( \( \) \| ) # escaped group delimiters inside group
\( \) \| # escaped group delimiters outside
) ) ( ( # unbalanced parentheses
</pre>
<h2>Notes and limitations:</h2>
<ul>
<li>During Phase 1 processing, the <tt>re_1</tt> regex will match (invalid) empty character classes. (i.e. <tt class="regex_x">/[]/</tt> or <tt class="regex_x">/[^]/</tt>). It is best to use only valid regexes.</li>
<li>The HTML document should not have any element having <tt>id="xREx"</tt> as this may cause an error during parsing when running Internet Explorer. This problem should be very rare and is easily avoided.</li>
<li>Firefox 2 refuses to break up very long words and will display them as one very long line (with a horizontal scroll bar). It is best to add some line breaks to long regexes.</li>
<li>When using the interactive tester, it is important to choose the correct value for the <em>Perl "x" free spacing mode</em> checkbox option. If you fail to check this option for regexes having #comments, the parser will get confused if there are any unbalanced metacharacters within the comments (and may (erroneously) report unbalanced parentheses).</li>
<li>When using the color syntax highlighting option, remember that the colorization script is designed to only handle regexes written in the Javascript regex flavor (i.e. no lookbehind, possessive quantifiers, named capture groups, atomic grouping, etc.) But most importantly, the Javascript flavor does not allow comments. For this reason, the interactive tester will not allow you to select both the "x" flag option and the color option at the same time.</li>
</ul>
<p>Happy regexing!<br>
©2010 Jeff Roberson.<br>
Released as open source under the <a href="http://www.opensource.org/licenses/mit-license.php">MIT License</a></p>
</div>
</body></html>