Skip to content

Commit

Permalink
Merge pull request #125 from WorksApplications/feature/update-doc
Browse files Browse the repository at this point in the history
Update test, doc, build script
  • Loading branch information
mh-northlander authored May 29, 2024
2 parents 28d6e94 + 311e687 commit 83102de
Show file tree
Hide file tree
Showing 9 changed files with 245 additions and 68 deletions.
130 changes: 91 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,11 @@ Use `-PengineVersion=os:2.14.0` for OpenSearch.

a. Using the release package
```
$ bin/elasticsearch-plugin install https://github.com/WorksApplications/elasticsearch-sudachi/releases/download/v3.0.0/analysis-sudachi-8.6.0-3.0.0.zip
$ bin/elasticsearch-plugin install https://github.com/WorksApplications/elasticsearch-sudachi/releases/download/v3.1.1/analysis-sudachi-8.13.4-3.1.1.zip
```
b. Using self-build package
```
$ bin/elasticsearch-plugin install file:///path/to/analysis-sudachi-8.6.0-3.0.0.zip
$ bin/elasticsearch-plugin install file:///path/to/analysis-sudachi-8.13.4-3.1.1.zip
```
(Specify the absolute path in URI format)
3. Download sudachi dictionary archive from https://github.com/WorksApplications/SudachiDict
Expand All @@ -61,7 +61,36 @@ If you want to update Sudachi that is included in a plugin you have installed, d
2. Extract the Sudachi JAR file from the zip.
3. Delete the sudachi JAR file in $ES_HOME/plugins/analysis-sudachi and replace it with the JAR file you extracted in step 2.

# Configuration
# Analyzer

An analyzer named "sudachi" is provided.
This is equivalent to the following custom analyzer.

```json
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"default_sudachi_analyzer": {
"type": "custom",
"tokenizer": "sudachi_tokenizer",
"filter": [
"sudachi_baseform",
"sudachi_part_of_speech",
"sudachi_ja_stop"
]
}
}
}
}
}
}
```

See following sections for the detail of the tokenizer and each filters.

# Tokenizer

- split_mode: Select splitting mode of Sudachi. (A, B, C) (string, default: C)
- C: Extracts named entities
Expand All @@ -73,9 +102,18 @@ If you want to update Sudachi that is included in a plugin you have installed, d
- discard\_punctuation: Select to discard punctuation or not. (bool, default: true)
- settings\_path: Sudachi setting file path. The path may be absolute or relative; relative paths are resolved with respect to es\_config. (string, default: null)
- resources\_path: Sudachi dictionary path. The path may be absolute or relative; relative paths are resolved with respect to es\_config. (string, default: null)
- additional_settings: Describes a configuration JSON string for Sudachi. This JSON string will be merged into the default configuration. If this property is set, `settings_path` will be ignored.
- additional_settings: Describes a configuration JSON string for Sudachi. This JSON string will be merged into the default configuration. If this property is set, `settings_path` will be overridden.

## Dictionary

By default, `ES_HOME/config/sudachi/sudachi_core.dic` is used.
You can specify the dictionary either in the file specified by `settings_path` or by `additional_settings`.
Due to the security manager, you need to put resources (setting file, dictionaries, and others) under the elasticsearch config directory.

## Example

tokenizer configuration

```json
{
"settings": {
Expand All @@ -86,14 +124,13 @@ If you want to update Sudachi that is included in a plugin you have installed, d
"type": "sudachi_tokenizer",
"split_mode": "C",
"discard_punctuation": true,
"resources_path": "/etc/elasticsearch/sudachi"
"resources_path": "/etc/elasticsearch/config/sudachi"
}
},
"analyzer": {
"sudachi_analyzer": {
"filter": [],
"tokenizer": "sudachi_tokenizer",
"type": "custom"
"type": "custom",
"tokenizer": "sudachi_tokenizer"
}
}
}
Expand All @@ -102,11 +139,7 @@ If you want to update Sudachi that is included in a plugin you have installed, d
}
```

# Dictionary

You can specify the dictionary either in the file specified by `settings_path` or by `additional_settings`.

# Example
dictionary settings

```json
{
Expand All @@ -121,9 +154,8 @@ You can specify the dictionary either in the file specified by `settings_path` o
},
"analyzer": {
"sudachi_analyzer": {
"filter": [],
"tokenizer": "sudachi_tokenizer",
"type": "custom"
"type": "custom",
"tokenizer": "sudachi_tokenizer"
}
}
}
Expand All @@ -138,12 +170,16 @@ You can specify the dictionary either in the file specified by `settings_path` o

This filter works like `mode` of kuromoji.

- search: Additional segmentation useful for search. (Use C and A mode)
- Ex)関西国際空港, 関西, 国際, 空港 / アバラカダブラ
- extended: Similar to search mode, but also unigram unknown words.
- Ex)関西国際空港, 関西, 国際, 空港 / アバラカダブラ, ア, バ, ラ, カ, ダ, ブ, ラ
- mode
- "search": Additional segmentation useful for search. (Use C and A mode)
- Ex)関西国際空港, 関西, 国際, 空港 / アバラカダブラ
- "extended": Similar to search mode, but also unigram unknown words.
- Ex)関西国際空港, 関西, 国際, 空港 / アバラカダブラ, ア, バ, ラ, カ, ダ, ブ, ラ

Note: In search query, split subwords are handled as a phrase (in the same way to multi-word synonyms). If you want to search with both A/C unit, use multiple tokenizers instead.

### PUT sudachi_sample

```json
{
"settings": {
Expand All @@ -156,7 +192,7 @@ This filter works like `mode` of kuromoji.
},
"analyzer": {
"sudachi_analyzer": {
"filter": ["my_searchfilter" ],
"filter": ["my_searchfilter"],
"tokenizer": "sudachi_tokenizer",
"type": "custom"
}
Expand All @@ -173,15 +209,17 @@ This filter works like `mode` of kuromoji.
}
```

### POST sudachi_sample
### POST sudachi_sample/_analyze

```json
{
"analyzer": "sudachi_analyzer",
"text": "関西国際空港"
}
```

### Which responds with:
Which responds with:

```json
{
"tokens" : [
Expand Down Expand Up @@ -241,6 +279,7 @@ With the `stoptags`, you can filter out the result in any of these forward match
- 5,6 - e.g., `五段-カ行,終止形-一般`

### PUT sudachi_sample

```json
{
"settings": {
Expand All @@ -253,7 +292,7 @@ With the `stoptags`, you can filter out the result in any of these forward match
},
"analyzer": {
"sudachi_analyzer": {
"filter": [ "my_posfilter" ],
"filter": ["my_posfilter"],
"tokenizer": "sudachi_tokenizer",
"type": "custom"
}
Expand All @@ -275,15 +314,17 @@ With the `stoptags`, you can filter out the result in any of these forward match
}
```

### POST sudachi_sample
### POST sudachi_sample/_analyze

```json
{
"analyzer": "sudachi_analyzer",
"text": "寿司がおいしいね"
}
```

### Which responds with:
Which responds with:

```json
{
"tokens": [
Expand All @@ -310,6 +351,7 @@ With the `stoptags`, you can filter out the result in any of these forward match
The sudachi\_ja\_stop token filter filters out Japanese stopwords (_japanese_), and any other custom stopwords specified by the user. This filter only supports the predefined _japanese_ stopwords list. If you want to use a different predefined list, then use the stop token filter instead.

### PUT sudachi_sample

```json
{
"settings": {
Expand All @@ -322,7 +364,7 @@ The sudachi\_ja\_stop token filter filters out Japanese stopwords (_japanese_),
},
"analyzer": {
"sudachi_analyzer": {
"filter": [ "my_stopfilter" ],
"filter": ["my_stopfilter"],
"tokenizer": "sudachi_tokenizer",
"type": "custom"
}
Expand All @@ -343,15 +385,17 @@ The sudachi\_ja\_stop token filter filters out Japanese stopwords (_japanese_),
}
```

### POST sudachi_sample
### POST sudachi_sample/_analyze

```json
{
"analyzer": "sudachi_analyzer",
"text": "私は宇宙人です。"
}
```

### Which responds with:
Which responds with:

```json
{
"tokens": [
Expand Down Expand Up @@ -397,7 +441,7 @@ The sudachi\_baseform token filter replaces terms with their SudachiBaseFormAttr
},
"analyzer": {
"sudachi_analyzer": {
"filter": [ "sudachi_baseform" ],
"filter": ["sudachi_baseform"],
"tokenizer": "sudachi_tokenizer",
"type": "custom"
}
Expand All @@ -408,15 +452,17 @@ The sudachi\_baseform token filter replaces terms with their SudachiBaseFormAttr
}
```

### POST sudachi_sample
### POST sudachi_sample/_analyze

```json
{
"analyzer": "sudachi_analyzer",
"text": "飲み"
}
```

### Which responds with:
Which responds with:

```json
{
"tokens": [
Expand All @@ -438,6 +484,7 @@ The sudachi\_normalizedform token filter replaces terms with their SudachiNormal
This filter lemmatizes verbs and adjectives too. You don't need to use sudachi\_baseform filter with this filter.

### PUT sudachi_sample

```json
{
"settings": {
Expand All @@ -450,7 +497,7 @@ This filter lemmatizes verbs and adjectives too. You don't need to use sudachi\_
},
"analyzer": {
"sudachi_analyzer": {
"filter": [ "sudachi_normalizedform" ],
"filter": ["sudachi_normalizedform"],
"tokenizer": "sudachi_tokenizer",
"type": "custom"
}
Expand All @@ -461,15 +508,17 @@ This filter lemmatizes verbs and adjectives too. You don't need to use sudachi\_
}
```

### POST sudachi_sample
### POST sudachi_sample/_analyze

```json
{
"analyzer": "sudachi_analyzer",
"text": "呑み"
}
```

### Which responds with:
Which responds with:

```json
{
"tokens": [
Expand All @@ -496,6 +545,7 @@ Whether romaji reading form should be output instead of katakana. Defaults to fa
When using the pre-defined sudachi_readingform filter, use_romaji is set to true. The default when defining a custom sudachi_readingform, however, is false. The only reason to use the custom form is if you need the katakana reading form:

### PUT sudachi_sample

```json
{
"settings": {
Expand All @@ -519,11 +569,11 @@ When using the pre-defined sudachi_readingform filter, use_romaji is set to true
"analyzer": {
"romaji_analyzer": {
"tokenizer": "sudachi_tokenizer",
"filter": [ "romaji_readingform" ]
"filter": ["romaji_readingform"]
},
"katakana_analyzer": {
"tokenizer": "sudachi_tokenizer",
"filter": [ "katakana_readingform" ]
"filter": ["katakana_readingform"]
}
}
}
Expand All @@ -532,22 +582,24 @@ When using the pre-defined sudachi_readingform filter, use_romaji is set to true
}
```

### POST sudachi_sample
### POST sudachi_sample/_analyze

```json
{
"analyzer": "katakana_analyzer",
"text": "寿司"
}
```

Returns `スシ`.

```
```json
{
"analyzer": "romaji_analyzer",
"text": "寿司"
}
```

Returns `susi`.

# License
Expand Down
Loading

0 comments on commit 83102de

Please sign in to comment.