Merge pull request #322 from KEINOS/revamping-example-directory

Refactor example directory
ikawaha · Apr 13, 2024 · 907eda5 · 907eda5
2 parents 2ef2717 + b65900b
commit 907eda5
Show file tree

Hide file tree

Showing 32 changed files with 315 additions and 73 deletions.
diff --git a/.github/workflows/go.yml b/.github/workflows/go.yml
@@ -28,6 +28,13 @@ jobs:
           go-version-file: 'go.mod'
           cache: true
 
+      - name: Remove symlink 2to3
+        if: matrix.os == 'macos-latest'
+        run: |
+          : # Workaround GitHub Actions Python issues
+          : # https://github.com/Homebrew/homebrew-core/issues/165793#issuecomment-1989441193
+          brew unlink python && brew link --overwrite python
+
       - name: Set up Graphviz
         uses: ts-graphviz/setup-graphviz@v2
 

diff --git a/_examples/db_search/README.md b/_examples/db_search/README.md
@@ -0,0 +1,71 @@
+# Full-text search with Kagome and SQLite3
+
+This example provides a practical example of how to work with Japanese text data and **perform efficient [full-text search](https://en.wikipedia.org/wiki/Full-text_search) using Kagome and SQLite3**.
+
+- Target text data is as follows:
+
+```text
+人魚は、南の方の海にばかり棲んでいるのではありません。
+北の海にも棲んでいたのであります。
+北方の海の色は、青うございました。
+ある時、岩の上に、女の人魚があがって、
+あたりの景色を眺めながら休んでいました。
+小川未明 『赤い蝋燭と人魚』
+```
+
+- Example output:
+
+```shellsession
+$ cd /path/to/kagome/_examples/db_search
+$ go run .
+Searching for: 人魚
+  Found content: 人魚は、南の方の海にばかり棲んでいるのではありません。 at line: 1
+  Found content: ある時、岩の上に、女の人魚があがって、 at line: 4
+  Found content: 小川未明 『赤い蝋燭と人魚』 at line: 6
+Searching for: 人
+  No results found
+Searching for: 北方
+  Found content: 北方の海の色は、青うございました。 at line: 3
+Searching for: 北
+  Found content: 北の海にも棲んでいたのであります。 at line: 2
+```
+
+- [View main.go](main.go)
+
+## Details
+
+In this example, each line of text is inserted into a row of the SQLite3 database, and then the database is searched for the word "人魚", "人", "北方" and "北".
+
+When inserting text data into the database, Kagome is used to tokenize the text into words.
+
+The string (or a line) tokenized by Kagome, a.k.a. "Wakati", is recorded in a separate table for [FTS4](https://www.sqlite.org/fts3.html) (Full-Text-Search) relative to the original text.
+
+This allows Unicode text data that is not separated by spaces, such as Japanese, to be searched by FTS.
+
+Note that it is searching by word and not by character. For example "人" doesn't match "人魚". Likewise, "北" doesn't match "北方".
+
+This is due to the fact that the FTS4 module in SQLite3 is designed to search for words, not characters.
+
+### Aim of this example
+
+This example can be useful in scenarios where you need to perform full-text searches on Japanese text.
+
+It demonstrates how to tokenize Japanese text using Kagome, which is a common requirement when working with text data in the Japanese language.
+
+By using SQLite with FTS4, it efficiently manages and searches through a large amount of text data, making it suitable for applications like:
+
+1. **Search Engines:** You can use this code as a basis for building a search engine that indexes and searches Japanese text content.
+2. **Document Management Systems:** This code can be integrated into a document management system to enable full-text search capabilities for Japanese documents.
+3. **Content Recommendation Systems:** When you have a large collection of Japanese content, you can use this code to implement content recommendation systems based on user queries.
+4. **Chatbots and NLP:**  If you're building chatbots or natural language processing (NLP) systems for Japanese language, this code can assist in text analysis and search within the chatbot's knowledge base.
+
+## Acknowledgements
+
+This example is taken in part from the following book for reference.
+
+- p.204, 9.2 "データーベース登録プログラム", "Go言語プログラミングエッセンス エンジニア選書"
+  - Written by: [Mattn](https://github.com/mattn)
+  - Published: 2023/3/9 (技術評論社)
+  - ISBN: 4297134195 / 978-4297134198
+  - ASIN: B0BVZCJQ4F / [https://amazon.co.jp/dp/4297134195](https://amazon.co.jp/dp/4297134195)
+  - Original sample code: [https://github.com/mattn/aozora-search](https://github.com/mattn/aozora-search)
diff --git a/sample/_example/go.mod → _examples/db_search/go.mod b/sample/_example/go.mod → _examples/db_search/go.mod
@@ -1,4 +1,4 @@
-module kagome/examples
+module kagome/examples/db_search
 
 go 1.19
 

diff --git a/sample/_example/go.sum → _examples/db_search/go.sum b/sample/_example/go.sum → _examples/db_search/go.sum
diff --git a/sample/_example/db_search/main.go → _examples/db_search/main.go b/sample/_example/db_search/main.go → _examples/db_search/main.go
@@ -1,35 +1,9 @@
 /*
-# TL; DR
+# Full-text search with Kagome and SQLite3
 
 This example provides a practical example of how to work with Japanese text data and perform efficient full-text search using Kagome and SQLite3.
 
-# TS; WM
-
-In this example, each line of text is inserted into a row of the SQLite3 database, and then the database is searched for the word "人魚" and "人".
-
-Note that the string tokenized by Kagome, a.k.a. "Wakati", is recorded in a separate table for FTS (Full-Text-Search) at the same time as the original text.
-
-This allows Unicode text data that is not separated by spaces, such as Japanese, to be searched by FTS.
-
-Aim of this example:
-
-This example can be useful in scenarios where you need to perform full-text searches on Japanese text. It demonstrates how to tokenize Japanese text using Kagome, which is a common requirement when working with text data in the Japanese language. By using SQLite with FTS4, it efficiently manages and searches through a large amount of text data, making it suitable for applications like:
-
-1. **Search Engines:** You can use this code as a basis for building a search engine that indexes and searches Japanese text content.
-2. **Document Management Systems:**	This code can be integrated into a document management system to enable full-text search capabilities for Japanese documents.
-3. **Content Recommendation Systems:** When you have a large collection of Japanese content, you can use this code to implement content recommendation systems based on user queries.
-4. **Chatbots and NLP:**  If you're building chatbots or natural language processing (NLP) systems for Japanese language, this code can assist in text analysis and search within the chatbot's knowledge base.
-
-Acknowledgements:
-
-This example is taken in part from the following book for reference.
-
-- p.204, 9.2 "データーベース登録プログラム", "Go言語プログラミングエッセンス エンジニア選書"
-  - Written by: Mattn
-  - Published: 2023/3/9 (技術評論社)
-  - ISBN: 4297134195 / 978-4297134198
-  - ASIN: B0BVZCJQ4F / https://amazon.co.jp/dp/4297134195
-  - Original sample code: https://github.com/mattn/aozora-search
+For details and acknowledgements, see the README.md file in the same directory.
 */
 package main
 
@@ -39,6 +13,7 @@ import (
 	"fmt"
 	"log"
 	"os"
+	"slices"
 	"strings"
 
 	"github.com/ikawaha/kagome-dict/ipa"
@@ -165,6 +140,9 @@ func insertSearchToken(db *sql.DB, rowID int64, content string) error {
 	}
 
 	seg := tknzr.Wakati(content)
+
+	seg = slices.Compact(seg) // remove duplicate segment tokens
+
 	tokenizedContent := strings.Join(seg, " ")
 
 	_, err = db.Exec(

diff --git a/_examples/go.work b/_examples/go.work
@@ -0,0 +1,9 @@
+go 1.19
+
+use (
+	./db_search
+	./tokenize
+	./user_dict
+	./wakati
+	./wasm
+)
diff --git a/_examples/tokenize/README.md b/_examples/tokenize/README.md
@@ -0,0 +1,28 @@
+# Tokenizing Example with Kagome
+
+## Analyzing a Japanese text into words and parts of speech with Kagome
+
+This example demonstrates how to analyzes a sentence (tokenize) and get the part-of-speech (POS) of each word using Kagome.
+
+- Target text data is as follows:
+
+```text
+すもももももももものうち
+```
+
+- Example output:
+
+```shellsession
+$ cd /path/to/kagome/_examples/tokenize
+$ go run .
+---tokenize---
+すもも  名詞,一般,*,*,*,*,すもも,スモモ,スモモ
+も      助詞,係助詞,*,*,*,*,も,モ,モ
+もも    名詞,一般,*,*,*,*,もも,モモ,モモ
+も      助詞,係助詞,*,*,*,*,も,モ,モ
+もも    名詞,一般,*,*,*,*,もも,モモ,モモ
+の      助詞,連体化,*,*,*,*,の,ノ,ノ
+うち    名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
+```
+
+> __Note__ that tokenization varies depending on the dictionary used. In this example we use the IPA dictionary.
diff --git a/_examples/tokenize/go.mod b/_examples/tokenize/go.mod
@@ -0,0 +1,12 @@
+module kagome/examples/tokenize
+
+go 1.19
+
+require (
+	github.com/ikawaha/kagome-dict/ipa v1.0.10
+	github.com/ikawaha/kagome/v2 v2.9.3
+)
+
+require github.com/ikawaha/kagome-dict v1.0.9 // indirect
+
+replace github.com/ikawaha/kagome/v2 => ../../
diff --git a/_examples/tokenize/go.sum b/_examples/tokenize/go.sum
@@ -0,0 +1,4 @@
+github.com/ikawaha/kagome-dict v1.0.9 h1:1Gg735LbBYsdFu13fdTvW6eVt0qIf5+S2qXGJtlG8C0=
+github.com/ikawaha/kagome-dict v1.0.9/go.mod h1:mn9itZLkFb6Ixko7q8eZmUabHbg3i9EYewnhOtvd2RM=
+github.com/ikawaha/kagome-dict/ipa v1.0.10 h1:wk9I21yg+fKdL6HJB9WgGiyXIiu1VttumJwmIRwn0g8=
+github.com/ikawaha/kagome-dict/ipa v1.0.10/go.mod h1:rbaOKrF58zhtpV2+2sVZBj0sUSp9dVKPjr660MehJbs=
diff --git a/sample/_example/tokenize/main.go → _examples/tokenize/main.go b/sample/_example/tokenize/main.go → _examples/tokenize/main.go
@@ -22,12 +22,12 @@ func main() {
 	}
 
 	// Output:
-	//---tokenize---
-	//すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
-	//も	助詞,係助詞,*,*,*,*,も,モ,モ
-	//もも	名詞,一般,*,*,*,*,もも,モモ,モモ
-	//も	助詞,係助詞,*,*,*,*,も,モ,モ
-	//もも	名詞,一般,*,*,*,*,もも,モモ,モモ
-	//の	助詞,連体化,*,*,*,*,の,ノ,ノ
-	//うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
+	// ---tokenize---
+	// すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
+	// も	助詞,係助詞,*,*,*,*,も,モ,モ
+	// もも	名詞,一般,*,*,*,*,もも,モモ,モモ
+	// も	助詞,係助詞,*,*,*,*,も,モ,モ
+	// もも	名詞,一般,*,*,*,*,もも,モモ,モモ
+	// の	助詞,連体化,*,*,*,*,の,ノ,ノ
+	// うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
 }
diff --git a/_examples/user_dict/go.mod b/_examples/user_dict/go.mod
@@ -0,0 +1,11 @@
+module kagome/examples/user_dict
+
+go 1.19
+
+require (
+	github.com/ikawaha/kagome-dict v1.0.9
+	github.com/ikawaha/kagome-dict/ipa v1.0.10
+	github.com/ikawaha/kagome/v2 v2.9.3
+)
+
+replace github.com/ikawaha/kagome/v2 => ../../
diff --git a/_examples/user_dict/go.sum b/_examples/user_dict/go.sum
@@ -0,0 +1,4 @@
+github.com/ikawaha/kagome-dict v1.0.9 h1:1Gg735LbBYsdFu13fdTvW6eVt0qIf5+S2qXGJtlG8C0=
+github.com/ikawaha/kagome-dict v1.0.9/go.mod h1:mn9itZLkFb6Ixko7q8eZmUabHbg3i9EYewnhOtvd2RM=
+github.com/ikawaha/kagome-dict/ipa v1.0.10 h1:wk9I21yg+fKdL6HJB9WgGiyXIiu1VttumJwmIRwn0g8=
+github.com/ikawaha/kagome-dict/ipa v1.0.10/go.mod h1:rbaOKrF58zhtpV2+2sVZBj0sUSp9dVKPjr660MehJbs=
diff --git a/_examples/user_dict/main.go b/_examples/user_dict/main.go
@@ -0,0 +1,36 @@
+package main
+
+import (
+	"fmt"
+
+	"github.com/ikawaha/kagome-dict/dict"
+	"github.com/ikawaha/kagome-dict/ipa"
+	"github.com/ikawaha/kagome/v2/tokenizer"
+)
+
+func main() {
+	// Use IPA dictionary as a system dictionary.
+	sysDic := ipa.Dict()
+
+	// Build a user dictionary from a file.
+	userDic, err := dict.NewUserDict("userdict.txt")
+	if err != nil {
+		panic(err)
+	}
+
+	// Specify the user dictionary as an option.
+	t, err := tokenizer.New(sysDic, tokenizer.UserDict(userDic), tokenizer.OmitBosEos())
+	if err != nil {
+		panic(err)
+	}
+
+	tokens := t.Analyze("関西国際空港限定トートバッグ", tokenizer.Search)
+	for _, token := range tokens {
+		fmt.Printf("%s\t%v\n", token.Surface, token.Features())
+	}
+
+	// Output:
+	// 関西国際空港    [テスト名詞 関西/国際/空港 カンサイ/コクサイ/クウコウ]
+	// 限定    [名詞 サ変接続 * * * * 限定 ゲンテイ ゲンテイ]
+	// トートバッグ    [名詞 一般 * * * * *]
+}
diff --git a/sample/dict/userdict.txt → _examples/user_dict/userdict.txt b/sample/dict/userdict.txt → _examples/user_dict/userdict.txt
diff --git a/_examples/wakati/README.md b/_examples/wakati/README.md
@@ -0,0 +1,25 @@
+# Wakati Example with Kagome
+
+## Segmenting Japanese text into words with Kagome
+
+In this example, we demonstrate how to segment Japanese text into words using Kagome.
+
+- Target text data is as follows:
+
+```text
+すもももももももものうち
+```
+
+- Example output:
+
+```shellsession
+$ cd /path/to/kagome/_examples/wakati
+$ go run .
+----wakati---
+すもも/も/もも/も/もも/の/うち
+```
+
+> __Note__ that segmentation varies depending on the dictionary used.
+> In this example we use the IPA dictionary. But for searching purposes, the Uni dictionary is recommended.
+>
+> - [What is a Kagome dictionary?](https://github.com/ikawaha/kagome/wiki/About-the-dictionary#what-is-a-kagome-dictionary) | Wiki | kagome @ GitHub
diff --git a/_examples/wakati/go.mod b/_examples/wakati/go.mod
@@ -0,0 +1,12 @@
+module kagome/examples/wakati
+
+go 1.19
+
+require (
+	github.com/ikawaha/kagome-dict/ipa v1.0.10
+	github.com/ikawaha/kagome/v2 v2.9.3
+)
+
+require github.com/ikawaha/kagome-dict v1.0.9 // indirect
+
+replace github.com/ikawaha/kagome/v2 => ../../
diff --git a/_examples/wakati/go.sum b/_examples/wakati/go.sum
@@ -0,0 +1,4 @@
+github.com/ikawaha/kagome-dict v1.0.9 h1:1Gg735LbBYsdFu13fdTvW6eVt0qIf5+S2qXGJtlG8C0=
+github.com/ikawaha/kagome-dict v1.0.9/go.mod h1:mn9itZLkFb6Ixko7q8eZmUabHbg3i9EYewnhOtvd2RM=
+github.com/ikawaha/kagome-dict/ipa v1.0.10 h1:wk9I21yg+fKdL6HJB9WgGiyXIiu1VttumJwmIRwn0g8=
+github.com/ikawaha/kagome-dict/ipa v1.0.10/go.mod h1:rbaOKrF58zhtpV2+2sVZBj0sUSp9dVKPjr660MehJbs=
diff --git a/sample/_example/wakati/main.go → _examples/wakati/main.go b/sample/_example/wakati/main.go → _examples/wakati/main.go
diff --git a/_examples/wasm/README.md b/_examples/wasm/README.md
@@ -0,0 +1,34 @@
+# WebAssembly Example with Kagome
+
+In this example we will demonstrate how to use Kagome in a WebAssembly application and show how responsive it can be.
+
+- See: "[Kagome As a Server Side Tokenizer (Feeling Kagome Slow?)](https://github.com/ikawaha/kagome/wiki/Kagome-As-a-Server-Side-Tokenizer)" | Wiki | kagome @ GitHub
+
+## How to Use
+
+```sh
+# Build the wasm binary
+GOOS=js GOARCH=wasm go build -o kagome.wasm main.go
+
+# Copy wasm_exec.js which maches to the compiled binary
+cp "$(go env GOROOT)/misc/wasm/wasm_exec.js" .
+**snip**
+```
+
+Now call the `wasm_exec.js` and `kagome.wasm` from the HTML file and run a web server.
+
+- Online demo: [https://ikawaha.github.io/kagome/](https://ikawaha.github.io/kagome/)
+
+```shellsession
+├── docs                 ... gh-pages
+│   ├── index.html
+│   ├── kagome.wasm
+│   └── wasm_exec.js
+├── _examples
+│   └── wasm
+│       ├── README.md     ... this document
+│       ├── kagome.html   ... html sample
+│       ├── main.go       ... source code
+│       ├── go.mod
+│       └── go.sum
+```
diff --git a/_examples/wasm/go.mod b/_examples/wasm/go.mod
@@ -0,0 +1,12 @@
+module kagome/examples/wasm
+
+go 1.19
+
+require (
+	github.com/ikawaha/kagome-dict/ipa v1.0.10
+	github.com/ikawaha/kagome/v2 v2.9.3
+)
+
+require github.com/ikawaha/kagome-dict v1.0.9 // indirect
+
+replace github.com/ikawaha/kagome/v2 => ../../
diff --git a/_examples/wasm/go.sum b/_examples/wasm/go.sum
@@ -0,0 +1,4 @@
+github.com/ikawaha/kagome-dict v1.0.9 h1:1Gg735LbBYsdFu13fdTvW6eVt0qIf5+S2qXGJtlG8C0=
+github.com/ikawaha/kagome-dict v1.0.9/go.mod h1:mn9itZLkFb6Ixko7q8eZmUabHbg3i9EYewnhOtvd2RM=
+github.com/ikawaha/kagome-dict/ipa v1.0.10 h1:wk9I21yg+fKdL6HJB9WgGiyXIiu1VttumJwmIRwn0g8=
+github.com/ikawaha/kagome-dict/ipa v1.0.10/go.mod h1:rbaOKrF58zhtpV2+2sVZBj0sUSp9dVKPjr660MehJbs=
diff --git a/sample/wasm/kagome.html → _examples/wasm/kagome.html b/sample/wasm/kagome.html → _examples/wasm/kagome.html
diff --git a/sample/wasm/main.go → _examples/wasm/main.go b/sample/wasm/main.go → _examples/wasm/main.go
@@ -1,5 +1,5 @@
-//go:build ignore
-// +build ignore
+//go:build js && wasm
+// +build js,wasm
 
 package main