Tokenizing strings of text. Extracting arrays of words and optionally number, emojis, tags, usernames and email addresses from strings. For Node.js and the browser. When you need more than just [a-z] regular expressions. Part of document processing for search-index and nowsearch.xyz.
Inspired by extractwords
From v8.0.0
- emojis
-regular expression now extracts single emojis, so no more "words" formed by several emojis. This because each emoji in a sense are words. You can still make a custom regular expression to grab several emojis in a row as one item with const customEmojis = '\\p{Emoji_Presentation}'
and then use it as your custom regex.
Meaning that instead of:
extract('A ticket to 倧ιͺ costs Β₯2000 ππ π’', { regex: emojis})
// ['ππ', 'π’']
...you will get:
extract('A ticket to 倧ιͺ costs Β₯2000 ππ π’', { regex: emojis})
// ['π', 'π', 'π’']
const { extract, words, numbers, emojis, tags, usernames, email } = require('words-n-numbers')
// extract, words, numbers, emojis, tags, usernames, email available
import { extract, words, numbers, emojis, tags, usernames, email } from 'words-n-numbers'
// extract, words, numbers, emojis, tags, usernames, email available
<script src="https://cdn.jsdelivr.net/npm/words-n-numbers/dist/words-n-numbers.umd.min.js"></script>
<script>
//wnn.extract, wnn.words, wnn.numbers, wnn.emojis, wnn.tags, wnn.usernames, wnn.email available
</script>
A simple browser demo of wnn to show how it works.
The default regex should catch every unicode character from for every language. Default regex flags are giu
. emojisCustom
-regex won't work with the u
-flag (unicode).
const stringOfWords = 'A 1000000 dollars baby!'
extract(stringOfWords)
// returns ['A', 'dollars', 'baby']
const stringOfWords = 'A 1000000 dollars baby!'
extract(stringOfWords, { toLowercase: true })
// returns ['a', 'dollars', 'baby']
const stringOfWords = 'A 1000000 dollars baby!'
extract(stringOfWords, { regex: [words, numbers], toLowercase: true })
// returns ['a', '1000000', 'dollars', 'baby']
const stringOfWords = 'A ticket to 倧ιͺ costs Β₯2000 ππ π’'
extract(stringOfWords, { regex: [words, emojis], toLowercase: true })
// returns [ 'A', 'ticket', 'to', '倧ιͺ', 'costs', 'π', 'π', 'π’' ]
const stringOfWords = 'A ticket to 倧ιͺ costs Β₯2000 ππ π’'
extract(stringOfWords, { regex: [numbers, emojis], toLowercase: true })
// returns [ '2000', 'π', 'π', 'π’' ]
cons stringOfWords = 'A ticket to 倧ιͺ costs Β₯2000 ππ π’'
extract(stringOfWords, { regex: [words, numbers, emojis], toLowercase: true })
// returns [ 'a', 'ticket', 'to', '倧ιͺ', 'costs', '2000', 'π', 'π', 'π’' ]
const stringOfWords = 'A #49ticket to #倧ιͺ or two#tickets costs Β₯2000 πππ π’'
extract(stringOfWords, { regex: tags, toLowercase: true })
// returns [ '#49ticket', '#倧ιͺ' ]
const stringOfWords = 'A #ticket to #倧ιͺ costs [email protected], @alice and @ηΎζ Β₯2000 πππ π’'
extract(stringOfWords, { regex: usernames, toLowercase: true })
// returns [ '@alice123', '@ηΎζ' ]
const stringOfWords = 'A #ticket to #倧ιͺ costs [email protected], [email protected], [email protected] and @ηΎζ Β₯2000 πππ π’'
extract(stringOfWords, { regex: email, toLowercase: true })
// returns [ '[email protected]', '[email protected]', '[email protected]' ]
const stringOfWords = 'A #ticket to #倧ιͺ costs [email protected], [email protected], [email protected] and @ηΎζ Β₯2000 πππ π’π©π½βπ€βπ¨π» π©π½βπ€βπ¨π»'
extract(stringOfWords, { regex: emojisCustom, flags: 'g' })
// returns [ 'π', 'π', 'π', 'π’', 'π©π½βπ€βπ¨π»', 'π©π½βπ€βπ¨π»' ]
Some characters needs to be escaped, like \
and '
. And you escape it with a backslash - \
.
const stringOfWords = 'This happens at 5 o\'clock !!!'
extract(stringOfWords, { regex: '[a-z\'0-9]+' })
// returns ['This', 'happens', 'at', '5', 'o\'clock']
Returns an array of words and optionally numbers.
extract(stringOfText, \<options-object\>)
{
regex: 'custom or predefined regex', // defaults to words
toLowercase: [true / false] // defaults to false
flags: 'gmixsuUAJD' // regex flags, defaults to giu - /[regexPattern]/[regexFlags]
}
You can add an array of different regexes or just a string. If you add an array, they will be joined with a |
-separator, making it an OR-regex. Put the email
, usernames
and tags
before words
to get the extraction right.
// email addresses before usernames before words can give another outcome than
extract(oldString, { regex: [email, usernames, words] })
// than words before usernames before email addresses
extract(oldString, { regex: [words, usernames, email] })
words // only words, any language <-- default
numbers // only numbers
emojis // only emojis
emojisCustom // only emojis. Works with the `g`-flag, not `giu`. Based on custom emoji extractor from https://github.com/mathiasbynens/rgi-emoji-regex-pattern
tags // #tags (any language
usernames // @usernames (any language)
email // email addresses. Most valid addresses,
// but not to be used as a validator
All but one regex uses the giu
-flag. The one that doesn't is the emojisCustom
that will need only a g
-flag. emojisCustom
is added because the standard emojis
regex based on \\p{Emoji_Presentation}
isn't able to grab all emojis. When browsers support p\{RGI_emoji} under a
giu`-flag the library will be changed.
Supports most languages supported by stopword, and others too. Some languages like Japanese and Chinese simplified needs to be tokenized. May add tokenizers at a later stage.
PR's and issues are more than welcome =)