Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
sharpchen committed Jun 15, 2024
1 parent 8b1341a commit 2c12341
Show file tree
Hide file tree
Showing 9 changed files with 259 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Match Literals

## Escaping

## Case-insensitive

Regex is case-sensitive by default, to ignore case, add leading `(?i)`.

:::code-group

```regex
(?i)ascii
```

```cs
_ = new Regex(@"ascii", RegexOptions.IgnoreCase)
```

:::

To partially ignore case, close partial regex using `(?i)<regex>(?-i)`
The following matches `ASCIIascciiascii` but not `asciiasciiASCII`

```regex
ASCII(?i)aScIi(?-i)ascii
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Match Nonprintable Characters

## ASCII control characters

Seven commonly used ASCII control characters can be escaped by:

- escape like `\a`
- hex like `x07`

|Character|Meaning|Hex|
|---|---|---|
|`\a`|alert|`0x07`|
|`\e`|escape|`0x1B`|
|`\f`|form feed|`0x0C`|
|`\n`|new line|`0x0A`|
|`\r`|carriage return|`0x0D`|
|`\t`|horizontal tab|`0x09`|
|`\v`|vertical tab|`0x0B`|

:::details click to check all ASCII control characters

|Control Character|Description|Keymap (Ctrl +)|Regex|Explanation|
|---|---|---|---|---|
|NUL|Null Character|`^@`|`\c@`|Ctrl + @|
|SOH|Start of Header|`^A`|`\cA`|Ctrl + A|
|STX|Start of Text|`^B`|`\cB`|Ctrl + B|
|ETX|End of Text|`^C`|`\cC`|Ctrl + C|
|EOT|End of Transmission|`^D`|`\cD`|Ctrl + D|
|ENQ|Enquiry|`^E`|`\cE`|Ctrl + E|
|ACK|Acknowledge|`^F`|`\cF`|Ctrl + F|
|BEL|Bell|`^G`|`\cG`|Ctrl + G|
|BS|Backspace |`^H`|`\cH`|Ctrl + H|
|TAB|Horizontal Tab|`^I`|`\cI`|Ctrl + I|
|LF|Line Feed |`^J`|`\cJ`|Ctrl + J|
|VT|Vertical Tab|`^K`|`\cK`|Ctrl + K|
|FF|Form Feed |`^L`|`\cL`|Ctrl + L|
|CR|Carriage Return|`^M`|`\cM`|Ctrl + M|
|SO|Shift Out |`^N`|`\cN`|Ctrl + N|
|SI|Shift In |`^O`|`\cO`|Ctrl + O|
|DLE|Data Link Escape|`^P`|`\cP`|Ctrl + P|
|DC1|Device Control 1|`^Q`|`\cQ`|Ctrl + Q|
|DC2|Device Control 2|`^R`|`\cR`|Ctrl + R|
|DC3|Device Control 3|`^S`|`\cS`|Ctrl + S|
|DC4|Device Control 4|`^T`|`\cT`|Ctrl + T|
|NAK|Negative Acknowledge|`^U`|`\cU`|Ctrl + U|
|SYN|Synchronous Idle|`^V`|`\cV`|Ctrl + V|
|ETB|End of Transmission Block|`^W`|`\cW`|Ctrl + W|
|CAN|Cancel|`^X`|`\cX`|Ctrl + X|
|EM|End of Medium|`^Y`|`\cY`|Ctrl + Y|
|SUB|Substitute|`^Z`|`\cZ`|Ctrl + Z|
|ESC|Escape|`^[`|`\c[`|Ctrl + [|
|FS|File Separator|`^\`|`\c\`|Ctrl + \\|
|GS|Group Separator|`^]`|`\c]`|Ctrl + ]|
|RS|Record Separator|`^^`|`\c^`|Ctrl + ^|
|US|Unit Separator|`^_`|`\c_`|Ctrl + _|

:::
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Match One of Many Characters

## Character class

Create a *character class* to match one occurrence inside a `[]`

```regex
c[ae]l[ae]nd[ae]r
```

## Range operator

Create a certain range using `-`

Match one of hexadecimal characters:

```regex
[a-fA-F0-9]
```

> Reversed range like `[f-a]` are not valid
## Negation operator

Negate a range using leading `^`

Match Non-hexadecimal characters:

```regex
[^a-fA-F0-9]
```

## Escape inside character class

There's four special characters may need to be escaped:

- `-` range operator
- `^` negation operator
- `[` and `]` start and end of character class

For any character that are not one of above, is not required to be escaped:

```regex
[$()*+.?{|]
```

For `^`s not act as negation are not required to be escaped:

```regex
[a-f^A-F\^0-9]
```

Also for `-` and `[`/`]`

:::hint
It's recommended to always escape metacharacters in character classes
:::

## Shorthand character classes

- `\d` matches any single *digit*, equivalent to `[\d]` and `[0-9]`
- `\D` matches any character that is *not a digit*, equivalent to `[^\d]` and `[^0-9]`
- `\w` matches any *word character*, equivalent to `[a-zA-Z0-9_]`
- `\W` matches any character that is *not a word character*, equivalent to `[^\w]`
- `\s` matches any *whitespace character*, like tabs, spaces, line breaks.
- `\S` matches any character that is *not a whitespace character*

> In `.NET`, `\w` matches not only `[a-zA-Z0-9_]`, it also includes other letters like Cyrillic and Thai.
> `\s` also matches whitespace characters in Unicode in `.NET` and `JavaScript`
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Match Any Character

- `.` matches any character except line breaks
- `.` matches any character with api option

```cs
_ = new Regex(@".", RegexOptions.Singleline);
// or
_ = new Regex(@"(?s).");
```

- `[\s\S]` and `[\w\W]` and `[\d\D]` match any character

:::Warning
Use `.` only when you really want to allow any character. Use a character class or negated character class in any other situation.
:::

## Mode modifier

Use `(?s)`/`(?-s)` to enable/disable `singleline` mode in regex literal
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Match Start and End of Line

- `^abc` and `\Aabc` match `abc` at the start of whole string
- `abc$` and `abc\Z` match `abc` at the start of whole string
- `^abc` match `abc` at the start of each line
- `abc$` match `abc` at the start of each line

```cs
_ = new Regex(@"^abcefg$", RegexOptions.Multiline);
```

> `^` always matches after `\n`, so `\n^` is redundant
> `$` always matches before `\n`, so `$\n` is redundant
> `\A\Z` matches empty string and empty string with a single new line
> `\A\z` matches only empty string
:::hint
Always use `\A` and `\Z` instead of `^` and `$` when to match start/end of a whole string
:::

## Mode modifier

Use `(?m)`/`(?-m)` to enable/disable `multiline` mode in regex literal

## Conclusion

- `\A` and `\Z` always match the start and end of a subject string
- `(?-m)^abc` and `(?-m)$abc` are equivalent to `\Aabc` and `\Zabc`
- `\z` matches the end of the subject string
- `abc\Z` matches before line break while `abc\z` won't match if line break exists after `abc`
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Match Whole Words

## Word boundary

`\b<word>\b` matches a whole word

`\b` strictly matches the following positions:

- Before the first character in subject string
- After the last character in subject string
- Between a word character and a character that is not a word character in subject string

- `\b<wordchar>` and `<nonwordchar>\b` only match at the start of a word
- `<wordchar>\b` and `\b<nonwordchar>` only match at the end of a word
- `\b<wordchar>\b` and `\b<nonwordchar>\b` match nothing

## Nonboundary

`\B` strictly matches the following positions:

- Before the first character in subject string if it's not a word character
- After the last character in subject string if it's not a word character
- Between two word characters
- Between two nonword characters
- Empty string
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Match One of Many Alternatives

`cat|dog|bird` matches one of `cat`, `dog` and `bird`.

> The order of the alternatives in the regex matters only when two of them *can match at the same position* in the string.
Alternatives are *short-circuited*(or *eager*). If the previous alternative matches, the rest won't continue to match at current position.

So `Jane|Janet` can't match `Janet` in `Her name is Janet`, only `Jane` is matched.
To match word by word, use `\bJane\b|\bJanet\b` instead.
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Grouping and Captured Groups

## Grouping alternatives

A better syntax for `\bJane\b|\bJanet\b` is `\b(Jane|Janet)\b` using `()` for grouping alternatives.
However this also creates a captured group, if you don't need any captured group to reuse, see [Noncapturing](#noncapturing-groups)

## Noncapturing groups

`\b(?:Jane|Janet)\b` disables group capturing using `(?:)`, it won't capture the group when matching, benefit to better performance.

## Group with mode modifier

Use any of `(?i:<regex>)` or `(?s:<regex>)` or `(?m:<regex>)` to annotate grouped alternatives

- `\b(?i:Jane|Janet)\b`

To combine different modes:

- `(?ism:<regex>)` enables `case-insensitive`, `singleline` and `multiline`
- `(?-ism:<regex>)` disables `case-insensitive`, `singleline` and `multiline`
- `(?i-sm:<regex>)` enables `case-sensitive` and disables `singleline` and `multiline`
Empty file.

0 comments on commit 2c12341

Please sign in to comment.