Skip to content

Commit

Permalink
Merge pull request #127 from WorksApplications/feature/remove-mca
Browse files Browse the repository at this point in the history
remove MorphemeConsumerAttribute
  • Loading branch information
mh-northlander authored May 30, 2024
2 parents 27d2078 + 7e67b2c commit e966ea8
Show file tree
Hide file tree
Showing 7 changed files with 18 additions and 137 deletions.
29 changes: 17 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ If you want to update Sudachi that is included in a plugin you have installed, d

# Analyzer

An analyzer named "sudachi" is provided.
An analyzer `sudachi` is provided.
This is equivalent to the following custom analyzer.

```json
Expand Down Expand Up @@ -92,6 +92,8 @@ See following sections for the detail of the tokenizer and each filters.

# Tokenizer

The `sudachi_tokenizer` tokenizer tokenizes input texts using Sudachi.

- split_mode: Select splitting mode of Sudachi. (A, B, C) (string, default: C)
- C: Extracts named entities
- Ex) 選挙管理委員会
Expand Down Expand Up @@ -168,7 +170,7 @@ dictionary settings

## sudachi\_split

This filter works like `mode` of kuromoji.
The `sudachi_split` token filter works like `mode` of kuromoji.

- mode
- "search": Additional segmentation useful for search. (Use C and A mode)
Expand Down Expand Up @@ -258,7 +260,7 @@ Which responds with:

## sudachi\_part\_of\_speech

The sudachi\_part\_of\_speech token filter removes tokens that match a set of part-of-speech tags. It accepts the following setting:
The `sudachi_part_of_speech` token filter removes tokens that match a set of part-of-speech tags. It accepts the following setting:

The `stoptags` is an array of part-of-speech and/or inflection tags that should be removed. It defaults to the stoptags.txt file embedded in the lucene-analysis-sudachi.jar.

Expand Down Expand Up @@ -348,7 +350,7 @@ Which responds with:

## sudachi\_ja\_stop

The sudachi\_ja\_stop token filter filters out Japanese stopwords (_japanese_), and any other custom stopwords specified by the user. This filter only supports the predefined _japanese_ stopwords list. If you want to use a different predefined list, then use the stop token filter instead.
The `sudachi_ja_stop` token filter filters out Japanese stopwords (_japanese_), and any other custom stopwords specified by the user. This filter only supports the predefined _japanese_ stopwords list. If you want to use a different predefined list, then use the stop token filter instead.

### PUT sudachi_sample

Expand Down Expand Up @@ -426,7 +428,9 @@ Which responds with:

## sudachi\_baseform

The sudachi\_baseform token filter replaces terms with their SudachiBaseFormAttribute. This acts as a lemmatizer for verbs and adjectives.
The `sudachi_baseform` token filter replaces terms with their Sudachi dictionary form. This acts as a lemmatizer for verbs and adjectives.

This will be overridden by `sudachi_split`, `sudachi_normalizedform` or `sudachi_readingform` token filters.

### PUT sudachi_sample
```json
Expand Down Expand Up @@ -479,9 +483,10 @@ Which responds with:

## sudachi\_normalizedform

The sudachi\_normalizedform token filter replaces terms with their SudachiNormalizedFormAttribute. This acts as a normalizer for spelling variants.
The `sudachi_normalizedform` token filter replaces terms with their Sudachi normalized form. This acts as a normalizer for spelling variants.
This filter lemmatizes verbs and adjectives too. You don't need to use `sudachi_baseform` filter with this filter.

This filter lemmatizes verbs and adjectives too. You don't need to use sudachi\_baseform filter with this filter.
This will be overridden by `sudachi_split`, `sudachi_baseform` or `sudachi_readingform` token filters.

### PUT sudachi_sample

Expand Down Expand Up @@ -535,14 +540,14 @@ Which responds with:

## sudachi\_readingform

Convert to katakana or romaji reading.
The sudachi\_readingform token filter replaces the token with its reading form in either katakana or romaji. It accepts the following setting:
The `sudachi_readingform` token filter replaces the terms with their reading form in either katakana or romaji.

### use_romaji
This will be overridden by `sudachi_split`, `sudachi_baseform` or `sudachi_normalizedform` token filters.

Whether romaji reading form should be output instead of katakana. Defaults to false.
Accepts the following setting:

When using the pre-defined sudachi_readingform filter, use_romaji is set to true. The default when defining a custom sudachi_readingform, however, is false. The only reason to use the custom form is if you need the katakana reading form:
- use_romaji
- Whether romaji reading form should be output instead of katakana. Defaults to false.

### PUT sudachi_sample

Expand Down

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,7 @@
package com.worksap.nlp.lucene.sudachi.ja

import com.worksap.nlp.lucene.sudachi.ja.attributes.MorphemeAttribute
import com.worksap.nlp.lucene.sudachi.ja.attributes.MorphemeConsumerAttribute
import com.worksap.nlp.sudachi.Morpheme
import org.apache.logging.log4j.LogManager
import org.apache.lucene.analysis.TokenFilter
import org.apache.lucene.analysis.TokenStream
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute
Expand All @@ -39,8 +37,6 @@ abstract class MorphemeFieldFilter(input: TokenStream) : TokenFilter(input) {
@JvmField protected val morphemeAtt = existingAttribute<MorphemeAttribute>()
@JvmField protected val keywordAtt = addAttribute<KeywordAttribute>()
@JvmField protected val termAtt = addAttribute<CharTermAttribute>()
@JvmField
protected val consumer = addAttribute<MorphemeConsumerAttribute> { it.currentConsumer = this }

/**
* Override this method to customize returned value. This method will not be called if
Expand All @@ -64,16 +60,4 @@ abstract class MorphemeFieldFilter(input: TokenStream) : TokenFilter(input) {

return true
}

override fun reset() {
super.reset()
if (!consumer.shouldConsume(this)) {
logger.warn(
"an instance of ${javaClass.name} is a no-op, it is not a filter which produces terms in one of your filter chains")
}
}

companion object {
private val logger = LogManager.getLogger(MorphemeFieldFilter::class.java)
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,6 @@ public int offset() {
private final PositionIncrementAttribute posIncAtt;
private final PositionLengthAttribute posLengthAtt;
private final MorphemeAttribute morphemeAtt;
private final MorphemeConsumerAttribute consumerAttribute;
private ListIterator<Morpheme> aUnitIterator;
private final OovChars oovChars = new OovChars();

Expand All @@ -102,8 +101,6 @@ public SudachiSplitFilter(TokenStream input, Mode mode, Tokenizer.SplitMode spli
posIncAtt = addAttribute(PositionIncrementAttribute.class);
posLengthAtt = addAttribute(PositionLengthAttribute.class);
morphemeAtt = addAttribute(MorphemeAttribute.class);
consumerAttribute = addAttribute(MorphemeConsumerAttribute.class);
consumerAttribute.setCurrentConsumer(this);
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@
package com.worksap.nlp.lucene.sudachi.ja

import com.worksap.nlp.lucene.sudachi.ja.attributes.MorphemeAttribute
import com.worksap.nlp.lucene.sudachi.ja.attributes.MorphemeConsumerAttribute
import com.worksap.nlp.lucene.sudachi.ja.attributes.SudachiAttribute
import com.worksap.nlp.lucene.sudachi.ja.attributes.SudachiAttributeFactory
import org.apache.lucene.analysis.Tokenizer
Expand All @@ -37,7 +36,6 @@ class SudachiTokenizer(
private val offsetAtt = addAttribute<OffsetAttribute>()
private val posIncAtt = addAttribute<PositionIncrementAttribute>()
private val posLenAtt = addAttribute<PositionLengthAttribute>()
private val consumer = addAttribute<MorphemeConsumerAttribute> { it.currentConsumer = this }

init {
addAttribute<SudachiAttribute> { it.dictionary = tokenizer.dictionary }
Expand Down

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2023 Works Applications Co., Ltd.
* Copyright (c) 2023-2024 Works Applications Co., Ltd.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -25,7 +25,6 @@ class SudachiAttributeFactory(private val parent: AttributeFactory) : AttributeF
override fun createAttributeInstance(attClass: Class<out Attribute>?): AttributeImpl {
return when (attClass) {
MorphemeAttribute::class.java -> MorphemeAttributeImpl()
MorphemeConsumerAttribute::class.java -> MorphemeConsumerAttributeImpl()
SudachiAttribute::class.java -> SudachiAttributeImpl()
else -> parent.createAttributeInstance(attClass)
}
Expand Down

0 comments on commit e966ea8

Please sign in to comment.