explain morpheme detail #121

mh-northlander · 2024-05-20T01:21:35Z

return surface, dictionaryForm, normalizedForm, readingForm and partOfSpeech when explain: true.

We may change key names (e.g. reading instead of readingForm, following analysis-kuromoji)
We may drop morpheme grouping.
~~We may drop instance information (from morphemeConsumerAttibute, see below example)~~
- Let's keep it to see which tokenizer/filter writes the term attribute

implementation note

Key-value pairs set in reflectWith method of each *AttributeImpl classes are used to construct explain contents.
Those value should have primitive type/standard collection type or implement ToXContent interface.

Document for reflectWith says:

This method is for introspection of attributes, it should simply add the key/values this attribute holds to the given {@link AttributeReflector}

So here I choosed to set "morpheme": morpheme in reflectWith and implement a wrapper class for morpheme with ToXContent.

example

curl -X POST "localhost:9200/_analyze?pretty" \
 -H 'Content-Type: application/json' \
 -d @- <<EOF
{
    "text":"すだち",
    "tokenizer": "sudachi_tokenizer",
    "explain":true
}
EOF

{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "sudachi_tokenizer",
      "tokens" : [
        {
          "token" : "すだち",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "word",
          "position" : 0,
          "bytes" : "[e3 81 99 e3 81 a0 e3 81 a1]",
          "instance" : "com.worksap.nlp.lucene.sudachi.ja.SudachiTokenizer",
          "morpheme" : {
            "surface" : "すだち",
            "dictionaryForm" : "すだち",
            "normalizedForm" : "酢橘",
            "readingForm" : "スダチ",
            "partOfSpeech" : [
              "名詞",
              "普通名詞",
              "一般",
              "*",
              "*",
              "*"
            ]
          },
          "positionLength" : 1
        }
      ]
    },
    "tokenfilters" : [ ]
  }
}

sonarcloud · 2024-05-21T07:47:50Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

mh-northlander changed the title ~~explain tokenizer detail~~ WIP: explain tokenizer detail May 20, 2024

mh-northlander added 2 commits May 21, 2024 10:18

add wrapper class for morpheme with ToXContent

29a58b6

support lower versions

e3b33b5

mh-northlander force-pushed the feature/explain branch from d335693 to 242828c Compare May 21, 2024 02:46

add tests

54576a5

mh-northlander force-pushed the feature/explain branch from 242828c to 54576a5 Compare May 21, 2024 02:49

test xcontent serialization

fede704

mh-northlander changed the title ~~WIP: explain tokenizer detail~~ explain morpheme detail May 21, 2024

mh-northlander requested a review from kazuma-t May 21, 2024 07:53

kazuma-t approved these changes May 21, 2024

View reviewed changes

mh-northlander merged commit e4e3012 into develop May 21, 2024
25 checks passed

mh-northlander deleted the feature/explain branch May 21, 2024 23:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explain morpheme detail #121

explain morpheme detail #121

mh-northlander commented May 20, 2024 •

edited

Loading

sonarcloud bot commented May 21, 2024

explain morpheme detail #121

explain morpheme detail #121

Conversation

mh-northlander commented May 20, 2024 • edited Loading

implementation note

example

sonarcloud bot commented May 21, 2024

Quality Gate passed

mh-northlander commented May 20, 2024 •

edited

Loading