This is a personal memo.
Note: Some not-so-popular Unicode characters are used. Recommended fonts to see this document are:
Through out this document, I use these symbols:
Symbol | Meaning |
---|---|
▮ |
EOF |
A |
Alphabets and numbers |
. |
One of the "wordSeparator" |
␣ |
Whitespace |
⏎ |
End of line |
Additionally, location of the cursor after executing a command is expressed by
vertical bar (|
) symbol in a sequence of symbols. For example, A|.
means
that assuming there is a sequence consisted with alphabets and numbers at the
cursor location, and nothing follow the sequence (EOF), then a command which
we are discussing moves the cursor just after the sequence.
-
cursorWordRight
- Procedure
- If the cursor is at an end-of-document, return original position.
- If the cursor is at an end-of-line, return position of the next line.
- If the cursor is at WSP character(s), skip the WSP(s) starting with it.
- If no characters exist after the WSPs, return the position.
- If there is a non-WSP character after the WSPs, return end position of a non-WSP character sequence which starts with it.
- Illustration:
|▮ A|▮ .|▮ ␣|▮ ⏎|▮ .|A ␣A|▮ ⏎|A ␣A|. ␣A|␣ ␣A|⏎ A|. ␣|. ⏎|. A|␣ .|␣ ⏎|␣ A|⏎ .|⏎ ␣|⏎ ⏎|⏎
- Procedure
-
cursorWordStartRight
- Procedure
- If the cursor is at an end-of-document, return original position.
- If the cursor is at an end-of-line, return position of the next line.
- Find ending position of a sequence starting with the character at cursor. Then, return position of where WSPs following the sequence end.
- Illustration:
|▮ A|▮ .|▮ ␣|▮ ⏎|▮ .|A ␣|A ⏎|A A|. ␣|. ⏎|. A␣|▮ .␣|▮ ⏎|␣ A␣|A .␣|A A␣|. .␣|. A␣|⏎ .␣|⏎ A|⏎ .|⏎ ␣|⏎ ⏎|⏎
- Procedure
-
cursorWordEndLeft
- Procedure:
- If the cursor is at an start-of-document, return original position.
- If the cursor is at an start-of-line, return end position of the previous line.
- Find starting position of a sequence which ends at the cursor position. Then, return position of where WSPs preceding it starts.
- Illustration:
▮| ▮|A ▮|. ▮|␣ ▮|⏎ A|. A|␣ A|⏎ .|A .|␣ .|⏎ ▮|␣A ▮|␣. ␣|⏎ A|␣A A|␣. .|␣A .|␣. ⏎|␣A ⏎|␣. ⏎|A ⏎|. ⏎|␣ ⏎|⏎
- Procedure:
There logic can be implemented as finite state automaton but I feel doing so is "overkill". So, I implemented these in a form of imperative procedures.
VSCode has two set of word by word cursor movement logics. First one is the logic used in most cases except for "word part" related actions. Another one is the logic for "word part" related actions.
Commands of the second version have "part" in their name (e.g.:
cursorWordPartRight
) and they can recognize words inside a camelCasedWords
or a sname_case_words. It seems that commands of this version are not affected
by "wordSeparator" configuration.
-
cursorWordEndRight
|▮ ⏎|▮ A|▮ ⏎A|▮ A|. ⏎A|. A|␣ ⏎A|␣ A|⏎ ⏎A|⏎ .|▮ ⏎.|▮ .|A ⏎.|A .|␣ ⏎.|␣ .|⏎ ⏎.|⏎ ␣|▮ ⏎␣|▮ ␣A|▮ ⏎␣A|▮ ␣A|. ⏎␣A|. ␣A|␣ ⏎␣A|␣ ␣A|⏎ ⏎␣A|⏎ ␣.|▮ ⏎␣.|▮ ␣.|A ⏎␣.|A ␣.|␣ ⏎␣.|␣ ␣.|⏎ ⏎␣.|⏎ ␣|⏎ ⏎␣|⏎ ⏎|⏎
-
cursorWordStartRight
|▮ A|▮ .|▮ ⏎|▮ .|A ⏎|A A|. ⏎|. A␣|▮ .␣|▮ ␣|▮ ⏎␣|▮ A␣|A .␣|A ␣|A ⏎␣|A A␣|. .␣|. ␣|. ⏎␣|. A␣|⏎ .␣|⏎ ␣|⏎ ⏎␣|⏎ A|⏎ .|⏎ ⏎|⏎
Essentially the difference from this version of commands and default ones is that these can stop inside a sequence of alphabets if condition met. The conditions are:
- Previous character is an underscore and the next is not an underscore (for snake_cased_words)
- Previous character is a lower cased alphabet and the next is an uppercased alphabet (for camelCasedWords or PascalCasedWords)
- Previous character is an upper cased alphabet, the next is an uppercased alphabet and the character next of the next is a lowercased alphabet (for all capital words inside a camelCASEDWords or a PascalCASEDWords)
Vim separates words by character classification.
On classifying a character, Vim firstly checks whether it is less than 0xFF
or not. If so, it will be classified into a white space, punctuation, or
"word character" which is specified by the configuration iskeyword
(wordSeparator in VSCode.) If the character is greater than 0xFF, Vim
classifies it under the basic rule as: white spaces are 0, punctuations are 1,
emojis are 3, and others are equals to or greater than 2 (but not 3).
Punctuation characters in various languages and known character set are defined
in a table and resolved as 1 or code point value of the first character in the
set.
For example, unique class values are assigned for both Hiraganas and Katakanas so those are always separated from other character types.
- src/search.c
- src/mbyte.c