Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explain morpheme detail #121

Merged
merged 4 commits into from
May 21, 2024
Merged

explain morpheme detail #121

merged 4 commits into from
May 21, 2024

Conversation

mh-northlander
Copy link
Collaborator

@mh-northlander mh-northlander commented May 20, 2024

fix #102.

return surface, dictionaryForm, normalizedForm, readingForm and partOfSpeech when explain: true.

  • We may change key names (e.g. reading instead of readingForm, following analysis-kuromoji)
  • We may drop morpheme grouping.
  • We may drop instance information (from morphemeConsumerAttibute, see below example)
    • Let's keep it to see which tokenizer/filter writes the term attribute

implementation note

Key-value pairs set in reflectWith method of each *AttributeImpl classes are used to construct explain contents.
Those value should have primitive type/standard collection type or implement ToXContent interface.

Document for reflectWith says:

This method is for introspection of attributes, it should simply add the key/values this attribute holds to the given {@link AttributeReflector}

So here I choosed to set "morpheme": morpheme in reflectWith and implement a wrapper class for morpheme with ToXContent.

example

curl -X POST "localhost:9200/_analyze?pretty" \
 -H 'Content-Type: application/json' \
 -d @- <<EOF
{
    "text":"すだち",
    "tokenizer": "sudachi_tokenizer",
    "explain":true
}
EOF
{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "sudachi_tokenizer",
      "tokens" : [
        {
          "token" : "すだち",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "word",
          "position" : 0,
          "bytes" : "[e3 81 99 e3 81 a0 e3 81 a1]",
          "instance" : "com.worksap.nlp.lucene.sudachi.ja.SudachiTokenizer",
          "morpheme" : {
            "surface" : "すだち",
            "dictionaryForm" : "すだち",
            "normalizedForm" : "酢橘",
            "readingForm" : "スダチ",
            "partOfSpeech" : [
              "名詞",
              "普通名詞",
              "一般",
              "*",
              "*",
              "*"
            ]
          },
          "positionLength" : 1
        }
      ]
    },
    "tokenfilters" : [ ]
  }
}

@mh-northlander mh-northlander changed the title explain tokenizer detail WIP: explain tokenizer detail May 20, 2024
Copy link

sonarcloud bot commented May 21, 2024

@mh-northlander mh-northlander changed the title WIP: explain tokenizer detail explain morpheme detail May 21, 2024
@mh-northlander mh-northlander requested a review from kazuma-t May 21, 2024 07:53
@mh-northlander mh-northlander merged commit e4e3012 into develop May 21, 2024
25 checks passed
@mh-northlander mh-northlander deleted the feature/explain branch May 21, 2024 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Analyzing with explain: true flag produces an exception
2 participants