A simple Ruby project for Chinese & Japanese word segmentation
git clone https://github.com/Reoooh/sRWS.git
Please make sure the Ruby environment is installed.
PS
: I recommend using brew to install rbenv to manage Ruby version.
All methods of this word segmentation project are based on the String matching theory [dictionary scheme].
This project can be applied to Chinese and Japanese word segmentation.
Segmentation algorithm:
forward maximum matching method
backward maximum matching method
bidirectional maximum matching method
Dictionary source:
The dictionary for Chinese Word Segmentation comes from the open source project HanLP.
Segmentation algorithm:
forward maximum matching method
Dictionary source:
The dictionary for Japanese Word Segmentation is generated by the kernel program. The final dictionary uses most of the variants of verbs and adjectives. Therefore, the number of words in the final dictionary exceeds more than 20 times of the original dictionary.
WARNING
: the original dictionary comes from the book "红宝书10000日语单词随身带". The electronic version is manually generated myself. Please make sure you hold this book in any form if you need to use the original dictionary.
You can start like this:
/WS/CWS
>>ruby Connecter.rb "中文分词测试"
FORWARD: ["中文", "分词", "测试"]
BACKWARD: ["中文", "分词", "测试"]
BIDIRECTIONAL: ["中文", "分词", "测试"]
And test the time spent on word segmentation processing.
/WS/CWS
>>ruby Connectertest.rb "中文分词测试"
FORWARD: ["中文", "分词", "测试"]
Rehearsal ----------------------------------------------
time: 3.469000 0.046000 3.515000 ( 3.531568)
------------------------------------- total: 3.515000sec
user system total real
time: 3.281000 0.063000 3.344000 ( 3.373417)
BACKWARD: ["中文", "分词", "测试"]
Rehearsal ----------------------------------------------
time: 2.719000 0.015000 2.734000 ( 2.758816)
------------------------------------- total: 2.734000sec
user system total real
time: 2.625000 0.047000 2.672000 ( 2.796045)
BIDIRECTIONAL: ["中文", "分词", "测试"]
Rehearsal ----------------------------------------------
time: 5.828000 0.078000 5.906000 ( 5.913542)
------------------------------------- total: 5.906000sec
user system total real
time: 5.828000 0.093000 5.921000 ( 5.946587)
/WS/JWS
>>ruby Connecter.rb "日本語言葉区分するテスト"
FORWARD: ["日本語", "言葉", "区分する", "テスト"]
/WS/JWS
>>ruby Connectertest.rb "日本語言葉区分するテスト"
["日本語", "言葉", "区分する", "テスト"]
Rehearsal ----------------------------------------------
time: 32.922000 0.734000 33.656000 ( 33.757088)
------------------------------------ total: 33.656000sec
user system total real
time: 37.218000 0.860000 38.078000 ( 39.532077)
Actually, the correct Japanese sentences is as follows:
/WS/JWS
>>ruby Connecter.rb "日本語単語分割テスト"
FORWARD: ["日本語", "単", "語", "分", "割", "テスト"]
/WS/JWS
>>ruby Connectertest.rb "日本語単語分割テスト"
["日本語", "単", "語", "分", "割", "テスト"]
Rehearsal ----------------------------------------------
time: 32.344000 0.563000 32.907000 ( 32.941230)
------------------------------------ total: 32.907000sec
user system total real
time: 31.969000 0.984000 32.953000 ( 32.970740)
Hence you can find out the flaws in this word segmentation processing project: program cannot recognize words that not included in the dictionary.
GNU General Public License v3.0