forked from coreutils/gnulib
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathregex.texi
2160 lines (1681 loc) · 76.4 KB
/
regex.texi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
@node Overview
@chapter Overview
A @dfn{regular expression} (or @dfn{regexp}, or @dfn{pattern}) is a text
string that describes some (mathematical) set of strings. A regexp
@var{r} @dfn{matches} a string @var{s} if @var{s} is in the set of
strings described by @var{r}.
Using the Regex library, you can:
@itemize @bullet
@item
see if a string matches a specified pattern as a whole, and
@item
search within a string for a substring matching a specified pattern.
@end itemize
Some regular expressions match only one string, i.e., the set they
describe has only one member. For example, the regular expression
@samp{foo} matches the string @samp{foo} and no others. Other regular
expressions match more than one string, i.e., the set they describe has
more than one member. For example, the regular expression @samp{f*}
matches the set of strings made up of any number (including zero) of
@samp{f}s. As you can see, some characters in regular expressions match
themselves (such as @samp{f}) and some don't (such as @samp{*}); the
ones that don't match themselves instead let you specify patterns that
describe many different strings.
To either match or search for a regular expression with the Regex
library functions, you must first compile it with a Regex pattern
compiling function. A @dfn{compiled pattern} is a regular expression
converted to the internal format used by the library functions. Once
you've compiled a pattern, you can use it for matching or searching any
number of times.
The Regex library is used by including @file{regex.h}.
@pindex regex.h
Regex provides three groups of functions with which you can operate on
regular expressions. One group---the GNU group---is more
powerful but not completely compatible with the other two, namely the
POSIX and Berkeley Unix groups; its interface was designed
specifically for GNU.
We wrote this chapter with programmers in mind, not users of
programs---such as Emacs---that use Regex. We describe the Regex
library in its entirety, not how to write regular expressions that a
particular program understands.
@node Regular Expression Syntax
@chapter Regular Expression Syntax
@cindex regular expressions, syntax of
@cindex syntax of regular expressions
@dfn{Characters} are things you can type. @dfn{Operators} are things in
a regular expression that match one or more characters. You compose
regular expressions from operators, which in turn you specify using one
or more characters.
Most characters represent what we call the match-self operator, i.e.,
they match themselves; we call these characters @dfn{ordinary}. Other
characters represent either all or parts of fancier operators; e.g.,
@samp{.} represents what we call the match-any-character operator
(which, no surprise, matches (almost) any character); we call these
characters @dfn{special}. Two different things determine what
characters represent what operators:
@enumerate
@item
the regular expression syntax your program has told the Regex library to
recognize, and
@item
the context of the character in the regular expression.
@end enumerate
In the following sections, we describe these things in more detail.
@menu
* Syntax Bits::
* Predefined Syntaxes::
* Collating Elements vs. Characters::
* The Backslash Character::
@end menu
@node Syntax Bits
@section Syntax Bits
@cindex syntax bits
In any particular syntax for regular expressions, some characters are
always special, others are sometimes special, and others are never
special. The particular syntax that Regex recognizes for a given
regular expression depends on the current syntax (as set by
@code{re_set_syntax}) when the pattern buffer of that regular expression
was compiled.
You get a pattern buffer by compiling a regular expression. @xref{GNU
Pattern Buffers}, for more information on pattern buffers. @xref{GNU
Regular Expression Compiling}, and @ref{BSD Regular Expression
Compiling}, for more information on compiling.
Regex considers the current syntax to be a collection of bits; we refer
to these bits as @dfn{syntax bits}. In most cases, they affect what
characters represent what operators. We describe the meanings of the
operators to which we refer in @ref{Common Operators}, @ref{GNU
Operators}, and @ref{GNU Emacs Operators}.
For reference, here is the complete list of syntax bits, in alphabetical
order:
@table @code
@cnindex RE_BACKSLASH_ESCAPE_IN_LIST
@item RE_BACKSLASH_ESCAPE_IN_LISTS
If this bit is set, then @samp{\} inside a list (@pxref{List Operators}
quotes (makes ordinary, if it's special) the following character; if
this bit isn't set, then @samp{\} is an ordinary character inside lists.
(@xref{The Backslash Character}, for what @samp{\} does outside of lists.)
@cnindex RE_BK_PLUS_QM
@item RE_BK_PLUS_QM
If this bit is set, then @samp{\+} represents the match-one-or-more
operator and @samp{\?} represents the match-zero-or-more operator; if
this bit isn't set, then @samp{+} represents the match-one-or-more
operator and @samp{?} represents the match-zero-or-one operator. This
bit is irrelevant if @code{RE_LIMITED_OPS} is set.
@cnindex RE_CHAR_CLASSES
@item RE_CHAR_CLASSES
If this bit is set, then you can use character classes in lists; if this
bit isn't set, then you can't.
@cnindex RE_CONTEXT_INDEP_ANCHORS
@item RE_CONTEXT_INDEP_ANCHORS
If this bit is set, then @samp{^} and @samp{$} are special anywhere outside
a list; if this bit isn't set, then these characters are special only in
certain contexts. @xref{Match-beginning-of-line Operator}, and
@ref{Match-end-of-line Operator}.
@cnindex RE_CONTEXT_INDEP_OPS
@item RE_CONTEXT_INDEP_OPS
If this bit is set, then certain characters are special anywhere outside
a list; if this bit isn't set, then those characters are special only in
some contexts and are ordinary elsewhere. Specifically, if this bit
isn't set then @samp{*}, and (if the syntax bit @code{RE_LIMITED_OPS}
isn't set) @samp{+} and @samp{?} (or @samp{\+} and @samp{\?}, depending
on the syntax bit @code{RE_BK_PLUS_QM}) represent repetition operators
only if they're not first in a regular expression or just after an
open-group or alternation operator. The same holds for @samp{@{} (or
@samp{\@{}, depending on the syntax bit @code{RE_NO_BK_BRACES}) if
it is the beginning of a valid interval and the syntax bit
@code{RE_INTERVALS} is set.
@cnindex RE_CONTEXT_INVALID_DUP
@item RE_CONTEXT_INVALID_DUP
If this bit is set, then an open-interval operator cannot occur at the
start of a regular expression, or immediately after an alternation,
open-group or close-interval operator.
@cnindex RE_CONTEXT_INVALID_OPS
@item RE_CONTEXT_INVALID_OPS
If this bit is set, then repetition and alternation operators can't be
in certain positions within a regular expression. Specifically, the
regular expression is invalid if it has:
@itemize @bullet
@item
a repetition operator first in the regular expression or just after a
match-beginning-of-line, open-group, or alternation operator; or
@item
an alternation operator first or last in the regular expression, just
before a match-end-of-line operator, or just after an alternation or
open-group operator.
@end itemize
If this bit isn't set, then you can put the characters representing the
repetition and alternation characters anywhere in a regular expression.
Whether or not they will in fact be operators in certain positions
depends on other syntax bits.
@cnindex RE_DEBUG
@item RE_DEBUG
If this bit is set, and the regex library was compiled with
@code{-DDEBUG}, then internal debugging is turned on; if unset, then
it is turned off.
@cnindex RE_DOT_NEWLINE
@item RE_DOT_NEWLINE
If this bit is set, then the match-any-character operator matches
a newline; if this bit isn't set, then it doesn't.
@cnindex RE_DOT_NOT_NULL
@item RE_DOT_NOT_NULL
If this bit is set, then the match-any-character operator doesn't match
a null character; if this bit isn't set, then it does.
@cnindex RE_HAT_LISTS_NOT_NEWLINE
@item RE_HAT_LISTS_NOT_NEWLINE
If this bit is set, nonmatching lists @samp{[^...]} do not match
newline; if not set, they do.
@cnindex RE_ICASE
@item RE_ICASE
If this bit is set, then ignore case when matching; otherwise, case is
significant.
@cnindex RE_INTERVALS
@item RE_INTERVALS
If this bit is set, then Regex recognizes interval operators; if this bit
isn't set, then it doesn't.
@cnindex RE_INVALID_INTERVAL_ORD
@item RE_INVALID_INTERVAL_ORD
If this bit is set, a syntactically invalid interval is treated as a
string of ordinary characters. For example, the extended regular
expression @samp{a@{1} is treated as @samp{a\@{1}.
@cnindex RE_LIMITED_OPS
@item RE_LIMITED_OPS
If this bit is set, then Regex doesn't recognize the match-one-or-more,
match-zero-or-one or alternation operators; if this bit isn't set, then
it does.
@cnindex RE_NEWLINE_ALT
@item RE_NEWLINE_ALT
If this bit is set, then newline represents the alternation operator; if
this bit isn't set, then newline is ordinary.
@cnindex RE_NO_BK_BRACES
@item RE_NO_BK_BRACES
If this bit is set, then @samp{@{} represents the open-interval operator
and @samp{@}} represents the close-interval operator; if this bit isn't
set, then @samp{\@{} represents the open-interval operator and
@samp{\@}} represents the close-interval operator. This bit is relevant
only if @code{RE_INTERVALS} is set.
@cnindex RE_NO_BK_PARENS
@item RE_NO_BK_PARENS
If this bit is set, then @samp{(} represents the open-group operator and
@samp{)} represents the close-group operator; if this bit isn't set, then
@samp{\(} represents the open-group operator and @samp{\)} represents
the close-group operator.
@cnindex RE_NO_BK_REFS
@item RE_NO_BK_REFS
If this bit is set, then Regex doesn't recognize @samp{\}@var{digit} as
the back reference operator; if this bit isn't set, then it does.
@cnindex RE_NO_BK_VBAR
@item RE_NO_BK_VBAR
If this bit is set, then @samp{|} represents the alternation operator;
if this bit isn't set, then @samp{\|} represents the alternation
operator. This bit is irrelevant if @code{RE_LIMITED_OPS} is set.
@cnindex RE_NO_EMPTY_RANGES
@item RE_NO_EMPTY_RANGES
If this bit is set, then a regular expression with a range whose ending
point collates lower than its starting point is invalid; if this bit
isn't set, then Regex considers such a range to be empty.
@cnindex RE_NO_GNU_OPS
@item RE_NO_GNU_OPS
If this bit is set, GNU regex operators are not recognized; otherwise,
they are.
@cnindex RE_NO_POSIX_BACKTRACKING
@item RE_NO_POSIX_BACKTRACKING
If this bit is set, succeed as soon as we match the whole pattern,
without further backtracking. This means that a match may not be
the leftmost longest; @pxref{What Gets Matched?} for what this means.
@cnindex RE_NO_SUB
@item RE_NO_SUB
If this bit is set, then @code{no_sub} will be set to one during
@code{re_compile_pattern}. This causes matching and searching routines
not to record substring match information.
@cnindex RE_UNMATCHED_RIGHT_PAREN_ORD
@item RE_UNMATCHED_RIGHT_PAREN_ORD
If this bit is set and the regular expression has no matching open-group
operator, then Regex considers what would otherwise be a close-group
operator (based on how @code{RE_NO_BK_PARENS} is set) to match @samp{)}.
@end table
@node Predefined Syntaxes
@section Predefined Syntaxes
If you're programming with Regex, you can set a pattern buffer's
(@pxref{GNU Pattern Buffers})
syntax either to an arbitrary combination of syntax bits
(@pxref{Syntax Bits}) or else to the configurations defined by Regex.
These configurations define the syntaxes used by certain
programs---GNU Emacs,
@cindex Emacs
POSIX Awk,
@cindex POSIX Awk
traditional Awk,
@cindex Awk
Grep,
@cindex Grep
@cindex Egrep
Egrep---in addition to syntaxes for POSIX basic and extended
regular expressions.
The predefined syntaxes---taken directly from @file{regex.h}---are:
@smallexample
#define RE_SYNTAX_EMACS 0
#define RE_SYNTAX_AWK \
(RE_BACKSLASH_ESCAPE_IN_LISTS | RE_DOT_NOT_NULL \
| RE_NO_BK_PARENS | RE_NO_BK_REFS \
| RE_NO_BK_VBAR | RE_NO_EMPTY_RANGES \
| RE_UNMATCHED_RIGHT_PAREN_ORD)
#define RE_SYNTAX_POSIX_AWK \
(RE_SYNTAX_POSIX_EXTENDED | RE_BACKSLASH_ESCAPE_IN_LISTS)
#define RE_SYNTAX_GREP \
(RE_BK_PLUS_QM | RE_CHAR_CLASSES \
| RE_HAT_LISTS_NOT_NEWLINE | RE_INTERVALS \
| RE_NEWLINE_ALT)
#define RE_SYNTAX_EGREP \
(RE_CHAR_CLASSES | RE_CONTEXT_INDEP_ANCHORS \
| RE_CONTEXT_INDEP_OPS | RE_HAT_LISTS_NOT_NEWLINE \
| RE_NEWLINE_ALT | RE_NO_BK_PARENS \
| RE_NO_BK_VBAR)
#define RE_SYNTAX_POSIX_EGREP \
(RE_SYNTAX_EGREP | RE_INTERVALS | RE_NO_BK_BRACES)
/* P1003.2/D11.2, section 4.20.7.1, lines 5078ff. */
#define RE_SYNTAX_ED RE_SYNTAX_POSIX_BASIC
#define RE_SYNTAX_SED RE_SYNTAX_POSIX_BASIC
/* Syntax bits common to both basic and extended POSIX regex syntax. */
#define _RE_SYNTAX_POSIX_COMMON \
(RE_CHAR_CLASSES | RE_DOT_NEWLINE | RE_DOT_NOT_NULL \
| RE_INTERVALS | RE_NO_EMPTY_RANGES)
#define RE_SYNTAX_POSIX_BASIC \
(_RE_SYNTAX_POSIX_COMMON | RE_BK_PLUS_QM)
/* Differs from ..._POSIX_BASIC only in that RE_BK_PLUS_QM becomes
RE_LIMITED_OPS, i.e., \? \+ \| are not recognized. Actually, this
isn't minimal, since other operators, such as \`, aren't disabled. */
#define RE_SYNTAX_POSIX_MINIMAL_BASIC \
(_RE_SYNTAX_POSIX_COMMON | RE_LIMITED_OPS)
#define RE_SYNTAX_POSIX_EXTENDED \
(_RE_SYNTAX_POSIX_COMMON | RE_CONTEXT_INDEP_ANCHORS \
| RE_CONTEXT_INDEP_OPS | RE_NO_BK_BRACES \
| RE_NO_BK_PARENS | RE_NO_BK_VBAR \
| RE_UNMATCHED_RIGHT_PAREN_ORD)
/* Differs from ..._POSIX_EXTENDED in that RE_CONTEXT_INVALID_OPS
replaces RE_CONTEXT_INDEP_OPS and RE_NO_BK_REFS is added. */
#define RE_SYNTAX_POSIX_MINIMAL_EXTENDED \
(_RE_SYNTAX_POSIX_COMMON | RE_CONTEXT_INDEP_ANCHORS \
| RE_CONTEXT_INVALID_OPS | RE_NO_BK_BRACES \
| RE_NO_BK_PARENS | RE_NO_BK_REFS \
| RE_NO_BK_VBAR | RE_UNMATCHED_RIGHT_PAREN_ORD)
@end smallexample
@node Collating Elements vs. Characters
@section Collating Elements vs.@: Characters
POSIX generalizes the notion of a character to that of a
collating element. It defines a @dfn{collating element} to be ``a
sequence of one or more bytes defined in the current collating sequence
as a unit of collation.''
This generalizes the notion of a character in
two ways. First, a single character can map into two or more collating
elements. For example, the German
@tex
``\ss''
@end tex
@ifinfo
``es-zet''
@end ifinfo
collates as the collating element @samp{s} followed by another collating
element @samp{s}. Second, two or more characters can map into one
collating element. For example, the Spanish @samp{ll} collates after
@samp{l} and before @samp{m}.
Since POSIX's ``collating element'' preserves the essential idea of
a ``character,'' we use the latter, more familiar, term in this document.
@node The Backslash Character
@section The Backslash Character
@cindex \
The @samp{\} character has one of four different meanings, depending on
the context in which you use it and what syntax bits are set
(@pxref{Syntax Bits}). It can: 1) stand for itself, 2) quote the next
character, 3) introduce an operator, or 4) do nothing.
@enumerate
@item
It stands for itself inside a list
(@pxref{List Operators}) if the syntax bit
@code{RE_BACKSLASH_ESCAPE_IN_LISTS} is not set. For example, @samp{[\]}
would match @samp{\}.
@item
It quotes (makes ordinary, if it's special) the next character when you
use it either:
@itemize @bullet
@item
outside a list,@footnote{Sometimes
you don't have to explicitly quote special characters to make
them ordinary. For instance, most characters lose any special meaning
inside a list (@pxref{List Operators}). In addition, if the syntax bits
@code{RE_CONTEXT_INVALID_OPS} and @code{RE_CONTEXT_INDEP_OPS}
aren't set, then (for historical reasons) the matcher considers special
characters ordinary if they are in contexts where the operations they
represent make no sense; for example, then the match-zero-or-more
operator (represented by @samp{*}) matches itself in the regular
expression @samp{*foo} because there is no preceding expression on which
it can operate. It is poor practice, however, to depend on this
behavior; if you want a special character to be ordinary outside a list,
it's better to always quote it, regardless.} or
@item
inside a list and the syntax bit @code{RE_BACKSLASH_ESCAPE_IN_LISTS} is set.
@end itemize
@item
It introduces an operator when followed by certain ordinary
characters---sometimes only when certain syntax bits are set. See the
cases @code{RE_BK_PLUS_QM}, @code{RE_NO_BK_BRACES}, @code{RE_NO_BK_VAR},
@code{RE_NO_BK_PARENS}, @code{RE_NO_BK_REF} in @ref{Syntax Bits}. Also:
@itemize @bullet
@item
@samp{\b} represents the match-word-boundary operator
(@pxref{Match-word-boundary Operator}).
@item
@samp{\B} represents the match-within-word operator
(@pxref{Match-within-word Operator}).
@item
@samp{\<} represents the match-beginning-of-word operator @*
(@pxref{Match-beginning-of-word Operator}).
@item
@samp{\>} represents the match-end-of-word operator
(@pxref{Match-end-of-word Operator}).
@item
@samp{\w} represents the match-word-constituent operator
(@pxref{Match-word-constituent Operator}).
@item
@samp{\W} represents the match-non-word-constituent operator
(@pxref{Match-non-word-constituent Operator}).
@item
@samp{\`} represents the match-beginning-of-buffer
operator and @samp{\'} represents the match-end-of-buffer operator
(@pxref{Buffer Operators}).
@item
If Regex was compiled with the C preprocessor symbol @code{emacs}
defined, then @samp{\s@var{class}} represents the match-syntactic-class
operator and @samp{\S@var{class}} represents the
match-not-syntactic-class operator (@pxref{Syntactic Class Operators}).
@end itemize
@item
In all other cases, Regex ignores @samp{\}. For example,
@samp{\n} matches @samp{n}.
@end enumerate
@node Common Operators
@chapter Common Operators
You compose regular expressions from operators. In the following
sections, we describe the regular expression operators specified by
POSIX; GNU also uses these. Most operators have more than one
representation as characters. @xref{Regular Expression Syntax}, for
what characters represent what operators under what circumstances.
For most operators that can be represented in two ways, one
representation is a single character and the other is that character
preceded by @samp{\}. For example, either @samp{(} or @samp{\(}
represents the open-group operator. Which one does depends on the
setting of a syntax bit, in this case @code{RE_NO_BK_PARENS}. Why is
this so? Historical reasons dictate some of the varying
representations, while POSIX dictates others.
Finally, almost all characters lose any special meaning inside a list
(@pxref{List Operators}).
@menu
* Match-self Operator:: Ordinary characters.
* Match-any-character Operator:: .
* Concatenation Operator:: Juxtaposition.
* Repetition Operators:: * + ? @{@}
* Alternation Operator:: |
* List Operators:: [...] [^...]
* Grouping Operators:: (...)
* Back-reference Operator:: \digit
* Anchoring Operators:: ^ $
@end menu
@node Match-self Operator
@section The Match-self Operator (@var{ordinary character})
This operator matches the character itself. All ordinary characters
(@pxref{Regular Expression Syntax}) represent this operator. For
example, @samp{f} is always an ordinary character, so the regular
expression @samp{f} matches only the string @samp{f}. In
particular, it does @emph{not} match the string @samp{ff}.
@node Match-any-character Operator
@section The Match-any-character Operator (@code{.})
@cindex @samp{.}
This operator matches any single printing or nonprinting character
except it won't match a:
@table @asis
@item newline
if the syntax bit @code{RE_DOT_NEWLINE} isn't set.
@item null
if the syntax bit @code{RE_DOT_NOT_NULL} is set.
@end table
The @samp{.} (period) character represents this operator. For example,
@samp{a.b} matches any three-character string beginning with @samp{a}
and ending with @samp{b}.
@node Concatenation Operator
@section The Concatenation Operator
This operator concatenates two regular expressions @var{a} and @var{b}.
No character represents this operator; you simply put @var{b} after
@var{a}. The result is a regular expression that will match a string if
@var{a} matches its first part and @var{b} matches the rest. For
example, @samp{xy} (two match-self operators) matches @samp{xy}.
@node Repetition Operators
@section Repetition Operators
Repetition operators repeat the preceding regular expression a specified
number of times.
@menu
* Match-zero-or-more Operator:: *
* Match-one-or-more Operator:: +
* Match-zero-or-one Operator:: ?
* Interval Operators:: @{@}
@end menu
@node Match-zero-or-more Operator
@subsection The Match-zero-or-more Operator (@code{*})
@cindex @samp{*}
This operator repeats the smallest possible preceding regular expression
as many times as necessary (including zero) to match the pattern.
@samp{*} represents this operator. For example, @samp{o*}
matches any string made up of zero or more @samp{o}s. Since this
operator operates on the smallest preceding regular expression,
@samp{fo*} has a repeating @samp{o}, not a repeating @samp{fo}. So,
@samp{fo*} matches @samp{f}, @samp{fo}, @samp{foo}, and so on.
Since the match-zero-or-more operator is a suffix operator, it may be
useless as such when no regular expression precedes it. This is the
case when it:
@itemize @bullet
@item
is first in a regular expression, or
@item
follows a match-beginning-of-line, open-group, or alternation
operator.
@end itemize
@noindent
Three different things can happen in these cases:
@enumerate
@item
If the syntax bit @code{RE_CONTEXT_INVALID_OPS} is set, then the
regular expression is invalid.
@item
If @code{RE_CONTEXT_INVALID_OPS} isn't set, but
@code{RE_CONTEXT_INDEP_OPS} is, then @samp{*} represents the
match-zero-or-more operator (which then operates on the empty string).
@item
Otherwise, @samp{*} is ordinary.
@end enumerate
@cindex backtracking
The matcher processes a match-zero-or-more operator by first matching as
many repetitions of the smallest preceding regular expression as it can.
Then it continues to match the rest of the pattern.
If it can't match the rest of the pattern, it backtracks (as many times
as necessary), each time discarding one of the matches until it can
either match the entire pattern or be certain that it cannot get a
match. For example, when matching @samp{ca*ar} against @samp{caaar},
the matcher first matches all three @samp{a}s of the string with the
@samp{a*} of the regular expression. However, it cannot then match the
final @samp{ar} of the regular expression against the final @samp{r} of
the string. So it backtracks, discarding the match of the last @samp{a}
in the string. It can then match the remaining @samp{ar}.
@node Match-one-or-more Operator
@subsection The Match-one-or-more Operator (@code{+} or @code{\+})
@cindex @samp{+}
If the syntax bit @code{RE_LIMITED_OPS} is set, then Regex doesn't recognize
this operator. Otherwise, if the syntax bit @code{RE_BK_PLUS_QM} isn't
set, then @samp{+} represents this operator; if it is, then @samp{\+}
does.
This operator is similar to the match-zero-or-more operator except that
it repeats the preceding regular expression at least once;
@pxref{Match-zero-or-more Operator}, for what it operates on, how some
syntax bits affect it, and how Regex backtracks to match it.
For example, supposing that @samp{+} represents the match-one-or-more
operator; then @samp{ca+r} matches, e.g., @samp{car} and
@samp{caaaar}, but not @samp{cr}.
@node Match-zero-or-one Operator
@subsection The Match-zero-or-one Operator (@code{?} or @code{\?})
@cindex @samp{?}
If the syntax bit @code{RE_LIMITED_OPS} is set, then Regex doesn't
recognize this operator. Otherwise, if the syntax bit
@code{RE_BK_PLUS_QM} isn't set, then @samp{?} represents this operator;
if it is, then @samp{\?} does.
This operator is similar to the match-zero-or-more operator except that
it repeats the preceding regular expression once or not at all;
@pxref{Match-zero-or-more Operator}, to see what it operates on, how
some syntax bits affect it, and how Regex backtracks to match it.
For example, supposing that @samp{?} represents the match-zero-or-one
operator; then @samp{ca?r} matches both @samp{car} and @samp{cr}, but
nothing else.
@node Interval Operators
@subsection Interval Operators (@code{@{} @dots{} @code{@}} or @code{\@{} @dots{} @code{\@}})
@cindex interval expression
@cindex @samp{@{}
@cindex @samp{@}}
@cindex @samp{\@{}
@cindex @samp{\@}}
If the syntax bit @code{RE_INTERVALS} is set, then Regex recognizes
@dfn{interval expressions}. They repeat the smallest possible preceding
regular expression a specified number of times.
If the syntax bit @code{RE_NO_BK_BRACES} is set, @samp{@{} represents
the @dfn{open-interval operator} and @samp{@}} represents the
@dfn{close-interval operator} ; otherwise, @samp{\@{} and @samp{\@}} do.
Specifically, supposing that @samp{@{} and @samp{@}} represent the
open-interval and close-interval operators; then:
@table @code
@item @{@var{count}@}
matches exactly @var{count} occurrences of the preceding regular
expression.
@item @{@var{min},@}
matches @var{min} or more occurrences of the preceding regular
expression.
@item @{@var{min}, @var{max}@}
matches at least @var{min} but no more than @var{max} occurrences of
the preceding regular expression.
@end table
The interval expression (but not necessarily the regular expression that
contains it) is invalid if:
@itemize @bullet
@item
@var{min} is greater than @var{max}, or
@item
any of @var{count}, @var{min}, or @var{max} are outside the range
zero to @code{RE_DUP_MAX} (which symbol @file{regex.h}
defines).
@end itemize
If the interval expression is invalid and the syntax bit
@code{RE_NO_BK_BRACES} is set, then Regex considers all the
characters in the would-be interval to be ordinary. If that bit
isn't set, then the regular expression is invalid.
If the interval expression is valid but there is no preceding regular
expression on which to operate, then if the syntax bit
@code{RE_CONTEXT_INVALID_OPS} is set, the regular expression is invalid.
If that bit isn't set, then Regex considers all the characters---other
than backslashes, which it ignores---in the would-be interval to be
ordinary.
@node Alternation Operator
@section The Alternation Operator (@code{|} or @code{\|})
@kindex |
@kindex \|
@cindex alternation operator
@cindex or operator
If the syntax bit @code{RE_LIMITED_OPS} is set, then Regex doesn't
recognize this operator. Otherwise, if the syntax bit
@code{RE_NO_BK_VBAR} is set, then @samp{|} represents this operator;
otherwise, @samp{\|} does.
Alternatives match one of a choice of regular expressions:
if you put the character(s) representing the alternation operator between
any two regular expressions @var{a} and @var{b}, the result matches
the union of the strings that @var{a} and @var{b} match. For
example, supposing that @samp{|} is the alternation operator, then
@samp{foo|bar|quux} would match any of @samp{foo}, @samp{bar} or
@samp{quux}.
The alternation operator operates on the @emph{largest} possible
surrounding regular expressions. (Put another way, it has the lowest
precedence of any regular expression operator.)
Thus, the only way you can
delimit its arguments is to use grouping. For example, if @samp{(} and
@samp{)} are the open and close-group operators, then @samp{fo(o|b)ar}
would match either @samp{fooar} or @samp{fobar}. (@samp{foo|bar} would
match @samp{foo} or @samp{bar}.)
@cindex backtracking
The matcher usually tries all combinations of alternatives so as to
match the longest possible string. For example, when matching
@samp{(fooq|foo)*(qbarquux|bar)} against @samp{fooqbarquux}, it cannot
take, say, the first (``depth-first'') combination it could match, since
then it would be content to match just @samp{fooqbar}.
Note that since the default behavior is to return the leftmost longest
match, when more than one of a series of alternatives matches the actual
match will be the longest matching alternative, not necessarily the
first in the list.
@node List Operators
@section List Operators (@code{[} @dots{} @code{]} and @code{[^} @dots{} @code{]})
@cindex matching list
@cindex @samp{[}
@cindex @samp{]}
@cindex @samp{^}
@cindex @samp{-}
@cindex @samp{\}
@cindex @samp{[^}
@cindex nonmatching list
@cindex matching newline
@cindex bracket expression
@dfn{Lists}, also called @dfn{bracket expressions}, are a set of one or
more items. An @dfn{item} is a character,
a collating symbol, an equivalence class expression,
a character class expression, or a range expression. The syntax bits
affect which kinds of items you can put in a list. We explain the last
four items in subsections below. Empty lists are invalid.
A @dfn{matching list} matches a single character represented by one of
the list items. You form a matching list by enclosing one or more items
within an @dfn{open-matching-list operator} (represented by @samp{[})
and a @dfn{close-list operator} (represented by @samp{]}).
For example, @samp{[ab]} matches either @samp{a} or @samp{b}.
@samp{[ad]*} matches the empty string and any string composed of just
@samp{a}s and @samp{d}s in any order. Regex considers invalid a regular
expression with a @samp{[} but no matching
@samp{]}.
@dfn{Nonmatching lists} are similar to matching lists except that they
match a single character @emph{not} represented by one of the list
items. You use an @dfn{open-nonmatching-list operator} (represented by
@samp{[^}@footnote{Regex therefore doesn't consider the @samp{^} to be
the first character in the list. If you put a @samp{^} character first
in (what you think is) a matching list, you'll turn it into a
nonmatching list.}) instead of an open-matching-list operator to start a
nonmatching list.
For example, @samp{[^ab]} matches any character except @samp{a} or
@samp{b}.
If the syntax bit @code{RE_HAT_LISTS_NOT_NEWLINE} is set, then
nonmatching lists do not match a newline.
Most characters lose any special meaning inside a list. The special
characters inside a list follow.
@table @samp
@item ]
ends the list if it's not the first list item. So, if you want to make
the @samp{]} character a list item, you must put it first.
@item \
quotes the next character if the syntax bit @code{RE_BACKSLASH_ESCAPE_IN_LISTS} is
set.
@item [.
represents the open-collating-symbol operator (@pxref{Collating Symbol
Operators}).
@item .]
represents the close-collating-symbol operator.
@item [=
represents the open-equivalence-class operator (@pxref{Equivalence Class
Operators}).
@item =]
represents the close-equivalence-class operator.
@item [:
represents the open-character-class operator (@pxref{Character Class
Operators}) if the syntax bit @code{RE_CHAR_CLASSES} is set and what
follows is a valid character class expression.
@item :]
represents the close-character-class operator if the syntax bit
@code{RE_CHAR_CLASSES} is set and what precedes it is an
open-character-class operator followed by a valid character class name.
@item -
represents the range operator (@pxref{Range Operator}) if it's
not first or last in a list or the ending point of a range.
@end table
@noindent
All other characters are ordinary. For example, @samp{[.*]} matches
@samp{.} and @samp{*}.
@menu
* Collating Symbol Operators:: [.elem.]
* Equivalence Class Operators:: [=class=]
* Character Class Operators:: [:class:]
* Range Operator:: start-end
@end menu
@node Collating Symbol Operators
@subsection Collating Symbol Operators (@code{[.} @dots{} @code{.]})
Collating symbols can be represented inside lists.
You form a @dfn{collating symbol} by
putting a collating element between an @dfn{open-collating-symbol
operator} and a @dfn{close-collating-symbol operator}. @samp{[.}
represents the open-collating-symbol operator and @samp{.]} represents
the close-collating-symbol operator. For example, if @samp{ll} is a
collating element, then @samp{[[.ll.]]} would match @samp{ll}.
@node Equivalence Class Operators
@subsection Equivalence Class Operators (@code{[=} @dots{} @code{=]})
@cindex equivalence class expression in regex
@cindex @samp{[=} in regex
@cindex @samp{=]} in regex
Regex recognizes equivalence class
expressions inside lists. A @dfn{equivalence class expression} is a set
of collating elements which all belong to the same equivalence class.
You form an equivalence class expression by putting a collating
element between an @dfn{open-equivalence-class operator} and a
@dfn{close-equivalence-class operator}. @samp{[=} represents the
open-equivalence-class operator and @samp{=]} represents the
close-equivalence-class operator. For example, if @samp{a} and @samp{A}
were an equivalence class, then both @samp{[[=a=]]} and @samp{[[=A=]]}
would match both @samp{a} and @samp{A}. If the collating element in an
equivalence class expression isn't part of an equivalence class, then
the matcher considers the equivalence class expression to be a collating
symbol.
@node Character Class Operators
@subsection Character Class Operators (@code{[:} @dots{} @code{:]})
@cindex character classes
@cindex @samp{[colon} in regex
@cindex @samp{colon]} in regex
If the syntax bit @code{RE_CHAR_CLASSES} is set, then Regex recognizes
character class expressions inside lists. A @dfn{character class
expression} matches one character from a given class. You form a
character class expression by putting a character class name between
an @dfn{open-character-class operator} (represented by @samp{[:}) and
a @dfn{close-character-class operator} (represented by @samp{:]}).
The character class names and their meanings are:
@table @code
@item alnum
letters and digits
@item alpha
letters
@item blank
system-dependent; for GNU, a space or tab
@item cntrl
control characters (in the ASCII encoding, code 0177 and codes
less than 040)
@item digit
digits
@item graph
same as @code{print} except omits space
@item lower
lowercase letters
@item print
printable characters (in the ASCII encoding, space
tilde---codes 040 through 0176)
@item punct
neither control nor alphanumeric characters
@item space
space, carriage return, newline, vertical tab, and form feed
@item upper
uppercase letters
@item xdigit
hexadecimal digits: @code{0}--@code{9}, @code{a}--@code{f}, @code{A}--@code{F}
@end table
@noindent
These correspond to the definitions in the C library's @file{<ctype.h>}
facility. For example, @samp{[:alpha:]} corresponds to the standard
facility @code{isalpha}. Regex recognizes character class expressions
only inside of lists; so @samp{[[:alpha:]]} matches any letter, but
@samp{[:alpha:]} outside of a bracket expression and not followed by a
repetition operator matches just itself.
@node Range Operator
@subsection The Range Operator (@code{-})
Regex recognizes @dfn{range expressions} inside a list. They represent
those characters
that fall between two elements in the current collating sequence. You
form a range expression by putting a @dfn{range operator} between two
of any of the following: characters, collating elements, collating symbols,
and equivalence class expressions. The starting point of the range and
the ending point of the range don't have to be the same kind of item,
e.g., the starting point could be a collating element and the ending
point could be an equivalence class expression. If a range's ending
point is an equivalence class, then all the collating elements in that
class will be in the range.@footnote{You can't use a character class for the starting
or ending point of a range, since a character class is not a single
character.} @samp{-} represents the range operator. For example,
@samp{a-f} within a list represents all the characters from @samp{a}
through @samp{f}
inclusively.
If the syntax bit @code{RE_NO_EMPTY_RANGES} is set, then if the range's
ending point collates less than its starting point, the range (and the
regular expression containing it) is invalid. For example, the regular