Skip to content

evalphobia/go-jp-text-ripper

Repository files navigation

go-jp-text-ripper

GoDoc License: Apache Release Build Status Codecov Coverage Go Report Card Code Climate BCH compliance Downloads

go-jp-text-ripper separates long text of Japanese into words and put spaces between ths words.

Quick Usage

Install

# install
$ go get github.com/evalphobia/go-jp-text-ripper

# or clone and build
# $ git clone --depth 1 https://github.com/evalphobia/go-jp-text-ripper.git
# $ cd ./go-jp-text-ripper
# $ make build
$ go-jp-text-ripper -h

Commands:

  help   show help
  rip    Separate japanese text into words from CSV/TSV file
  rank   Show ranking of the word frequency

Subcommands

rip

rip command separate japanese text into words from --input file.

$ go-jp-text-ripper rip -h

Separate japanese text into words from CSV/TSV file

Options:

  -h, --help            display help information
  -c, --column          target column name in input file
      --columnn         target column index in input file (1st col=1)
  -i, --input          *input file path --input='/path/to/input.csv'
  -o, --output          output file path --output='./my_result.csv'
      --dic             custom dictionary path (mecab ipa dictionaly)
      --stopword        stop word list file path
      --show            print separated words to console
      --original        output original form of word
      --noun            output 'noun' type of word
      --verb            output 'verb' type of word
      --adjective       output 'adjective' type of word
      --neologd         use prefilter for neologd
      --progress[=30]   print current progress (sec)
      --min[=1]         minimum letter size for output
      --quote           columns to add double-quotes (separated by comma)
      --prefix          prefix name for new columns
  -r, --replace         replace from text column data to output result
      --debug           print debug result to console
      --dropempty       remove empty result from output
      --stoptop         use ranking from top as stopword
      --stoptopp        use ranking from top by percent as stopword (0.0 ~ 1.0)
      --stoplast        use ranking from last as stopword
      --stoplastp       use ranking from last by percent as stopword (0.0 ~ 1.0)
      --stopunique      use ranking stopword as unique per line

For example, if you want to separate words from the example TSV file, try below command.

# chack the file contents
$ head -n 2 ./example/aozora_bunko.tsv

id	author	title	url	exerpt
1	夏目 漱石	吾輩は猫である	https://www.aozora.gr.jp/cards/000148/card789.html	一 吾輩は猫である。名前はまだ無い。 ...


# run rip command
$ go-jp-text-ripper rip \
    --input ./example/aozora_bunko.tsv \
    --column exerpt \
    --output ./output.tsv

[INFO]	[Run]	read and write lines...
[INFO]	[Run]	finish process

# check the results
$ head -n 2 ./output.tsv
id	author	title	url	exerpt	op_text	op_word_count	op_non_word_count	op_raw_char_count
1	夏目 漱石	吾輩は猫である	https://www.aozora.gr.jp/cards/000148/card789.html	一 吾輩は猫である。名前はまだ無い。...	一 吾輩 猫 名前 無い どこ 生れ 見当 つか 何 薄暗い し 所 ニャーニャー 泣い いた事 記憶 ...	562	719	2000

Advanced options

# `--columnn` sets column by index
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --show \
    --columnn 5

# `--dic` uses custom dictionary for kagome (https://github.com/ikawaha/kagome)
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --dic /opt/data/neologd.dic

# `--stopword` sets custom stopword file path and ignore the words
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --stopword ./stopwords.txt

# `--show` outputs the result on console
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt \
    --show

# `--original` uses original form (i.e. 原形) of the words for the results.
# in python code, use the word of `node.feature.split(",")[6]`
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --original

# if sets `--noun`, the results contains noun type of words.
# if sets `--verb`, the results contains verb type of words.
# if sets `--adjective`, the results contains adjective type of words.
# (default are 'noun', 'verb', 'adjective')
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --noun --verb  # in thie example, using only 'noun' and 'verb'

# `--neologd` uses the special prefilter for neologd to normalize text
# ref: https://github.com/evalphobia/go-jp-text-ripper/blob/master/prefilter/neologd.go
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --neologd

# `--progress` sets the interval in sec to show current progress
# default is '30' sec
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --progress 5

# `--min` sets the minimum letter size for the result
# if you set '2', then the result ignore one letter word (e.g. 'お', 'の', '犬', '猫', '嵐')
# default is '1'
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --min 3

# `--min` sets the minimum letter size for the result
# if you set '2', then the result ignore one letter word (e.g. 'お', 'の', '犬', '猫', '嵐')
# default is '1'
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --min 3

# `--prefix` sets the prefix for the new columns
# default is 'op_'
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --output ./output.tsv \
    --prefix n_

# `--replace` overwrite the target column by the result
# default is false and output the result on new column 'op_text'
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --min 3

# `--dropempty` removes the empty result row
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --dropempty

# `--stoptop`, `--stoptopp`, `--stoplast`, `--stoplastp` uses rank command result as a stopword
# `--stoptop` and `--stoptopp` uses the word with high frequency as a stopword
# `--stoplast` and `--stoplastp` uses the word with low frequency as a stopword
# if you use both of `--stoptop` and `--stoptopp` (or `--stoplast` and `--stoplastp`), then the filter condition stops when meets both.
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --stoptop 300
    --stoptopp 0.1  # whichever is bigger, 300 words or 10% words

# `--stopunique` is used with `--stop[top/last]` option
# this option count the frequency as one word per a row
$ go-jp-text-ripper rip --input ./example/aozora_bunko.tsv --column exerpt --show \
    --stoptop 300
    --stopunique

rank

rank command gets word frequency ranking from --input file.

$ go-jp-text-ripper rank -h

Show ranking of the word frequency

Options:

  -h, --help            display help information
  -c, --column          target column name in input file
      --columnn         target column index in input file (1st col=1)
  -i, --input          *input file path --input='/path/to/input.csv'
  -o, --output          output file path --output='./my_result.csv'
      --dic             custom dictionary path (mecab ipa dictionaly)
      --stopword        stop word list file path
      --show            print separated words to console
      --original        output original form of word
      --noun            output 'noun' type of word
      --verb            output 'verb' type of word
      --adjective       output 'adjective' type of word
      --neologd         use prefilter for neologd
      --progress[=30]   print current progress (sec)
      --min[=1]         minimum letter size for output
      --top             rank from top by count
      --topp            rank from top by percent (0.0 ~ 1.0)
      --last            rank from last by count
      --lastp           rank from last by percent (0.0 ~ 1.0)
  -u, --unique          count as one word if the same word exists in a line

For example, if you want to get word frequency ranking from the example TSV file, try below command.

# chack the file contents
$ head -n 2 ./example/aozora_bunko.tsv

id	author	title	url	exerpt
1	夏目 漱石	吾輩は猫である	https://www.aozora.gr.jp/cards/000148/card789.html	一 吾輩は猫である。名前はまだ無い。 ...


# run rank command
$ go-jp-text-ripper rank \
    --input ./example/aozora_bunko.tsv \
    --column exerpt \
    --output ./output_rank.tsv \
	--stopword ./stopwords.txt

[INFO]	[DoWithProgress]	read lines...
[INFO]	[Do]	Total Words:1041
[INFO]	[DoWithProgress]	finish process

# check the results
$ head -n 10 ./output_rank.tsv
type	rank	word	countN	countP
top	1	し	52	0.02802
top	2	の	41	0.02209
top	3	い	31	0.01670
top	4	いる	23	0.01239
top	5	吾輩	18	0.00970
top	6	ゐる	16	0.00862
top	7	政治	14	0.00754
top	8	人間	13	0.00700
top	9	れ	12	0.00647

Custome Go App

Import go-jp-text-ripper and add plugins into Config. You can add your custom plugins.

package main

import (
	"github.com/evalphobia/go-jp-text-ripper/plugin"
	"github.com/evalphobia/go-jp-text-ripper/ripper"
)

// cli entry point
func main() {
	common := ripper.CommonConfig{}

	// prefilters to normalize raw text
	common.PreFilters = []*ripper.PreFilter{
			prefilter.Neologd,
	}

	// plugins
	common.Plugins = []*ripper.Plugin{
		plugin.KanaCountPlugin,
		plugin.AlphaNumCountPlugin,
		plugin.CharTypeCountPlugin,
		plugin.MaxCharCountPlugin,
		plugin.MaxWordCountPlugin,
		plugin.SymbolCountPlugin,
		plugin.NounNameCountPlugin,
		plugin.NounHasFullNamePlugin,
		plugin.NounNumberCountPlugin,
		plugin.KanaNumberLikeCountPlugin,
		plugin.KanaAlphabetLikeCountPlugin,
		plugin.NounLocationCountPlugin,
		plugin.NounOrganizationCountPlugin,
		// MyCustomePlugin,
		&ripper.Plugin{
			Title: "proper_noun_count",
			Fn: func(text *ripper.TextData) string {
				return strconv.Itoa(text.GetWords().CountFeatures("固有名詞"))
			},
		},
	}

	// postfilters running after processed all of the plugins
	common.PostFilters = []*ripper.PostFilter{
		postfilter.RatioJP,
		postfilter.RatioAlphaNum,
	}

	err := ripper.DoRip(ripper.RipConfig{
		CommonConfig:        common,
		DropEmpty:           true,
		StopWordTopNumber:   300,
	})
}

then, build and run!

License

Apache License, Version 2.0

Credit

This project depends on these awesome libraries,

About

tokenize text and separate it into words for Japanese

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published