This plugin adds the Homoglyph token filter, which replaces homoglyphs and various unicode versions of alphanumeric characters with the character itself. For example, the tokens "héllo" will be replaced with "hello".
In case a character in the token has more than one homoglyphic alphanumeric characters, a token with each of the possibilities will be generated. For example, the character 'l' which, in some fonts, is homoglyphic to both lower case L and the digit 1, will be mapped to both of them. The resulting set of tokens will contain all of the permutations of all of the possible homoglyphs. The token 'hello' will be replaced with the following tokens:
- hello
- hel1o
- he1lo
- he11o
- The filter is not case sensitive, and the resulting tokens will be lowercase. Uppercase letters are mapped to their lowercase equivalents.
- The original token is not preserved. If no character in the token has alphanumeric homoglyphs, the token is returned as is.
- If the token can be translated to too many ascii tokens (more than 200), it will produce no translations at all, and will just be returned as is. This is to protect from out-of-memory problems.
Plugin version | Elasticsearch version |
---|---|
8.x | 8.x |
7.17.3 | 7.17.3 |
0.2.3 | 7.10.0 |
0.2.2 | 7.8.1 |
0.2.1 | 7.6.2 |
0.2.0 | 7.3.1 |
0.1.1 | 7.3.1 |
0.1.0 | 7.3.1 |
(version 8.8.0 is not available because of the absence of elasticsearch-preallocate
8.8.0, as seen here. I will add an 8.8.0 release if and when this problem is fixed)
To install the plugin on an Elasticsearch node make sure you are in the Elasticsearch directory and then run:
bin/elasticsearch-plugin install https://github.com/intsights/elasticsearch-analysis-homoglyph/releases/download/v0.2.2/analysis-homoglyph-0.2.2.zip
You can verify the plugin has been installed by running:
bin/elasticsearch-plugin list
and making sure analysis-plugin
is in the list of installed plugins.
For the plugin to start working you need to restart the node.
You can create a Docker container with the plugin installed. Save the following in a Dockerfile:
FROM docker.elastic.co/elasticsearch/elasticsearch:7.6.2
RUN /usr/share/elasticsearch/bin/elasticsearch-plugin install https://github.com/intsights/elasticsearch-analysis-homoglyph/releases/download/v0.2.2/analysis-homoglyph-0.2.2.zip
Then run:
docker build -f /path/to/Dockerfile -t elaticsearch-with-homoglyph-plugin:0.2.2 .
docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" elaticsearch-with-homoglyph-plugin:0.2.2
Depending on your system, you might need to run the Docker with sudo
PUT /homoglyph_example
{
"settings" : {
"analysis" : {
"analyzer" : {
"standard_homoglyph" : {
"tokenizer" : "standard",
"filter" : ["homoglyph"]
}
}
}
}
}