Skip to content

Commit

Permalink
Merge pull request #322 from KEINOS/revamping-example-directory
Browse files Browse the repository at this point in the history
Refactor example directory
  • Loading branch information
ikawaha authored Apr 13, 2024
2 parents 2ef2717 + b65900b commit 907eda5
Show file tree
Hide file tree
Showing 32 changed files with 315 additions and 73 deletions.
7 changes: 7 additions & 0 deletions .github/workflows/go.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,13 @@ jobs:
go-version-file: 'go.mod'
cache: true

- name: Remove symlink 2to3
if: matrix.os == 'macos-latest'
run: |
: # Workaround GitHub Actions Python issues
: # https://github.com/Homebrew/homebrew-core/issues/165793#issuecomment-1989441193
brew unlink python && brew link --overwrite python
- name: Set up Graphviz
uses: ts-graphviz/setup-graphviz@v2

Expand Down
71 changes: 71 additions & 0 deletions _examples/db_search/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Full-text search with Kagome and SQLite3

This example provides a practical example of how to work with Japanese text data and **perform efficient [full-text search](https://en.wikipedia.org/wiki/Full-text_search) using Kagome and SQLite3**.

- Target text data is as follows:

```text
人魚は、南の方の海にばかり棲んでいるのではありません。
北の海にも棲んでいたのであります。
北方の海の色は、青うございました。
ある時、岩の上に、女の人魚があがって、
あたりの景色を眺めながら休んでいました。
小川未明 『赤い蝋燭と人魚』
```

- Example output:

```shellsession
$ cd /path/to/kagome/_examples/db_search
$ go run .
Searching for: 人魚
Found content: 人魚は、南の方の海にばかり棲んでいるのではありません。 at line: 1
Found content: ある時、岩の上に、女の人魚があがって、 at line: 4
Found content: 小川未明 『赤い蝋燭と人魚』 at line: 6
Searching for: 人
No results found
Searching for: 北方
Found content: 北方の海の色は、青うございました。 at line: 3
Searching for: 北
Found content: 北の海にも棲んでいたのであります。 at line: 2
```

- [View main.go](main.go)

## Details

In this example, each line of text is inserted into a row of the SQLite3 database, and then the database is searched for the word "人魚", "人", "北方" and "北".

When inserting text data into the database, Kagome is used to tokenize the text into words.

The string (or a line) tokenized by Kagome, a.k.a. "Wakati", is recorded in a separate table for [FTS4](https://www.sqlite.org/fts3.html) (Full-Text-Search) relative to the original text.

This allows Unicode text data that is not separated by spaces, such as Japanese, to be searched by FTS.

Note that it is searching by word and not by character. For example "人" doesn't match "人魚". Likewise, "北" doesn't match "北方".

This is due to the fact that the FTS4 module in SQLite3 is designed to search for words, not characters.

### Aim of this example

This example can be useful in scenarios where you need to perform full-text searches on Japanese text.

It demonstrates how to tokenize Japanese text using Kagome, which is a common requirement when working with text data in the Japanese language.

By using SQLite with FTS4, it efficiently manages and searches through a large amount of text data, making it suitable for applications like:

1. **Search Engines:** You can use this code as a basis for building a search engine that indexes and searches Japanese text content.
2. **Document Management Systems:** This code can be integrated into a document management system to enable full-text search capabilities for Japanese documents.
3. **Content Recommendation Systems:** When you have a large collection of Japanese content, you can use this code to implement content recommendation systems based on user queries.
4. **Chatbots and NLP:** If you're building chatbots or natural language processing (NLP) systems for Japanese language, this code can assist in text analysis and search within the chatbot's knowledge base.

## Acknowledgements

This example is taken in part from the following book for reference.

- p.204, 9.2 "データーベース登録プログラム", "Go言語プログラミングエッセンス エンジニア選書"
- Written by: [Mattn](https://github.com/mattn)
- Published: 2023/3/9 (技術評論社)
- ISBN: 4297134195 / 978-4297134198
- ASIN: B0BVZCJQ4F / [https://amazon.co.jp/dp/4297134195](https://amazon.co.jp/dp/4297134195)
- Original sample code: [https://github.com/mattn/aozora-search](https://github.com/mattn/aozora-search)
2 changes: 1 addition & 1 deletion sample/_example/go.mod → _examples/db_search/go.mod
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
module kagome/examples
module kagome/examples/db_search

go 1.19

Expand Down
File renamed without changes.
34 changes: 6 additions & 28 deletions sample/_example/db_search/main.go → _examples/db_search/main.go
Original file line number Diff line number Diff line change
@@ -1,35 +1,9 @@
/*
# TL; DR
# Full-text search with Kagome and SQLite3
This example provides a practical example of how to work with Japanese text data and perform efficient full-text search using Kagome and SQLite3.
# TS; WM
In this example, each line of text is inserted into a row of the SQLite3 database, and then the database is searched for the word "人魚" and "人".
Note that the string tokenized by Kagome, a.k.a. "Wakati", is recorded in a separate table for FTS (Full-Text-Search) at the same time as the original text.
This allows Unicode text data that is not separated by spaces, such as Japanese, to be searched by FTS.
Aim of this example:
This example can be useful in scenarios where you need to perform full-text searches on Japanese text. It demonstrates how to tokenize Japanese text using Kagome, which is a common requirement when working with text data in the Japanese language. By using SQLite with FTS4, it efficiently manages and searches through a large amount of text data, making it suitable for applications like:
1. **Search Engines:** You can use this code as a basis for building a search engine that indexes and searches Japanese text content.
2. **Document Management Systems:** This code can be integrated into a document management system to enable full-text search capabilities for Japanese documents.
3. **Content Recommendation Systems:** When you have a large collection of Japanese content, you can use this code to implement content recommendation systems based on user queries.
4. **Chatbots and NLP:** If you're building chatbots or natural language processing (NLP) systems for Japanese language, this code can assist in text analysis and search within the chatbot's knowledge base.
Acknowledgements:
This example is taken in part from the following book for reference.
- p.204, 9.2 "データーベース登録プログラム", "Go言語プログラミングエッセンス エンジニア選書"
- Written by: Mattn
- Published: 2023/3/9 (技術評論社)
- ISBN: 4297134195 / 978-4297134198
- ASIN: B0BVZCJQ4F / https://amazon.co.jp/dp/4297134195
- Original sample code: https://github.com/mattn/aozora-search
For details and acknowledgements, see the README.md file in the same directory.
*/
package main

Expand All @@ -39,6 +13,7 @@ import (
"fmt"
"log"
"os"
"slices"
"strings"

"github.com/ikawaha/kagome-dict/ipa"
Expand Down Expand Up @@ -165,6 +140,9 @@ func insertSearchToken(db *sql.DB, rowID int64, content string) error {
}

seg := tknzr.Wakati(content)

seg = slices.Compact(seg) // remove duplicate segment tokens

tokenizedContent := strings.Join(seg, " ")

_, err = db.Exec(
Expand Down
9 changes: 9 additions & 0 deletions _examples/go.work
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
go 1.19

use (
./db_search
./tokenize
./user_dict
./wakati
./wasm
)
28 changes: 28 additions & 0 deletions _examples/tokenize/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Tokenizing Example with Kagome

## Analyzing a Japanese text into words and parts of speech with Kagome

This example demonstrates how to analyzes a sentence (tokenize) and get the part-of-speech (POS) of each word using Kagome.

- Target text data is as follows:

```text
すもももももももものうち
```

- Example output:

```shellsession
$ cd /path/to/kagome/_examples/tokenize
$ go run .
---tokenize---
すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
うち 名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
```

> __Note__ that tokenization varies depending on the dictionary used. In this example we use the IPA dictionary.
12 changes: 12 additions & 0 deletions _examples/tokenize/go.mod
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
module kagome/examples/tokenize

go 1.19

require (
github.com/ikawaha/kagome-dict/ipa v1.0.10
github.com/ikawaha/kagome/v2 v2.9.3
)

require github.com/ikawaha/kagome-dict v1.0.9 // indirect

replace github.com/ikawaha/kagome/v2 => ../../
4 changes: 4 additions & 0 deletions _examples/tokenize/go.sum
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
github.com/ikawaha/kagome-dict v1.0.9 h1:1Gg735LbBYsdFu13fdTvW6eVt0qIf5+S2qXGJtlG8C0=
github.com/ikawaha/kagome-dict v1.0.9/go.mod h1:mn9itZLkFb6Ixko7q8eZmUabHbg3i9EYewnhOtvd2RM=
github.com/ikawaha/kagome-dict/ipa v1.0.10 h1:wk9I21yg+fKdL6HJB9WgGiyXIiu1VttumJwmIRwn0g8=
github.com/ikawaha/kagome-dict/ipa v1.0.10/go.mod h1:rbaOKrF58zhtpV2+2sVZBj0sUSp9dVKPjr660MehJbs=
16 changes: 8 additions & 8 deletions sample/_example/tokenize/main.go → _examples/tokenize/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,12 @@ func main() {
}

// Output:
//---tokenize---
//すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
//も 助詞,係助詞,*,*,*,*,も,モ,モ
//もも 名詞,一般,*,*,*,*,もも,モモ,モモ
//も 助詞,係助詞,*,*,*,*,も,モ,モ
//もも 名詞,一般,*,*,*,*,もも,モモ,モモ
//の 助詞,連体化,*,*,*,*,の,ノ,ノ
//うち 名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
// ---tokenize---
// すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
// も 助詞,係助詞,*,*,*,*,も,モ,モ
// もも 名詞,一般,*,*,*,*,もも,モモ,モモ
// も 助詞,係助詞,*,*,*,*,も,モ,モ
// もも 名詞,一般,*,*,*,*,もも,モモ,モモ
// の 助詞,連体化,*,*,*,*,の,ノ,ノ
// うち 名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
}
11 changes: 11 additions & 0 deletions _examples/user_dict/go.mod
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
module kagome/examples/user_dict

go 1.19

require (
github.com/ikawaha/kagome-dict v1.0.9
github.com/ikawaha/kagome-dict/ipa v1.0.10
github.com/ikawaha/kagome/v2 v2.9.3
)

replace github.com/ikawaha/kagome/v2 => ../../
4 changes: 4 additions & 0 deletions _examples/user_dict/go.sum
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
github.com/ikawaha/kagome-dict v1.0.9 h1:1Gg735LbBYsdFu13fdTvW6eVt0qIf5+S2qXGJtlG8C0=
github.com/ikawaha/kagome-dict v1.0.9/go.mod h1:mn9itZLkFb6Ixko7q8eZmUabHbg3i9EYewnhOtvd2RM=
github.com/ikawaha/kagome-dict/ipa v1.0.10 h1:wk9I21yg+fKdL6HJB9WgGiyXIiu1VttumJwmIRwn0g8=
github.com/ikawaha/kagome-dict/ipa v1.0.10/go.mod h1:rbaOKrF58zhtpV2+2sVZBj0sUSp9dVKPjr660MehJbs=
36 changes: 36 additions & 0 deletions _examples/user_dict/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
package main

import (
"fmt"

"github.com/ikawaha/kagome-dict/dict"
"github.com/ikawaha/kagome-dict/ipa"
"github.com/ikawaha/kagome/v2/tokenizer"
)

func main() {
// Use IPA dictionary as a system dictionary.
sysDic := ipa.Dict()

// Build a user dictionary from a file.
userDic, err := dict.NewUserDict("userdict.txt")
if err != nil {
panic(err)
}

// Specify the user dictionary as an option.
t, err := tokenizer.New(sysDic, tokenizer.UserDict(userDic), tokenizer.OmitBosEos())
if err != nil {
panic(err)
}

tokens := t.Analyze("関西国際空港限定トートバッグ", tokenizer.Search)
for _, token := range tokens {
fmt.Printf("%s\t%v\n", token.Surface, token.Features())
}

// Output:
// 関西国際空港 [テスト名詞 関西/国際/空港 カンサイ/コクサイ/クウコウ]
// 限定 [名詞 サ変接続 * * * * 限定 ゲンテイ ゲンテイ]
// トートバッグ [名詞 一般 * * * * *]
}
File renamed without changes.
25 changes: 25 additions & 0 deletions _examples/wakati/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Wakati Example with Kagome

## Segmenting Japanese text into words with Kagome

In this example, we demonstrate how to segment Japanese text into words using Kagome.

- Target text data is as follows:

```text
すもももももももものうち
```

- Example output:

```shellsession
$ cd /path/to/kagome/_examples/wakati
$ go run .
----wakati---
すもも/も/もも/も/もも/の/うち
```

> __Note__ that segmentation varies depending on the dictionary used.
> In this example we use the IPA dictionary. But for searching purposes, the Uni dictionary is recommended.
>
> - [What is a Kagome dictionary?](https://github.com/ikawaha/kagome/wiki/About-the-dictionary#what-is-a-kagome-dictionary) | Wiki | kagome @ GitHub
12 changes: 12 additions & 0 deletions _examples/wakati/go.mod
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
module kagome/examples/wakati

go 1.19

require (
github.com/ikawaha/kagome-dict/ipa v1.0.10
github.com/ikawaha/kagome/v2 v2.9.3
)

require github.com/ikawaha/kagome-dict v1.0.9 // indirect

replace github.com/ikawaha/kagome/v2 => ../../
4 changes: 4 additions & 0 deletions _examples/wakati/go.sum
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
github.com/ikawaha/kagome-dict v1.0.9 h1:1Gg735LbBYsdFu13fdTvW6eVt0qIf5+S2qXGJtlG8C0=
github.com/ikawaha/kagome-dict v1.0.9/go.mod h1:mn9itZLkFb6Ixko7q8eZmUabHbg3i9EYewnhOtvd2RM=
github.com/ikawaha/kagome-dict/ipa v1.0.10 h1:wk9I21yg+fKdL6HJB9WgGiyXIiu1VttumJwmIRwn0g8=
github.com/ikawaha/kagome-dict/ipa v1.0.10/go.mod h1:rbaOKrF58zhtpV2+2sVZBj0sUSp9dVKPjr660MehJbs=
File renamed without changes.
34 changes: 34 additions & 0 deletions _examples/wasm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# WebAssembly Example with Kagome

In this example we will demonstrate how to use Kagome in a WebAssembly application and show how responsive it can be.

- See: "[Kagome As a Server Side Tokenizer (Feeling Kagome Slow?)](https://github.com/ikawaha/kagome/wiki/Kagome-As-a-Server-Side-Tokenizer)" | Wiki | kagome @ GitHub

## How to Use

```sh
# Build the wasm binary
GOOS=js GOARCH=wasm go build -o kagome.wasm main.go

# Copy wasm_exec.js which maches to the compiled binary
cp "$(go env GOROOT)/misc/wasm/wasm_exec.js" .
**snip**
```

Now call the `wasm_exec.js` and `kagome.wasm` from the HTML file and run a web server.

- Online demo: [https://ikawaha.github.io/kagome/](https://ikawaha.github.io/kagome/)

```shellsession
├── docs ... gh-pages
│   ├── index.html
│   ├── kagome.wasm
│   └── wasm_exec.js
├── _examples
│   └── wasm
│   ├── README.md ... this document
│   ├── kagome.html ... html sample
│   ├── main.go ... source code
│   ├── go.mod
│   └── go.sum
```
12 changes: 12 additions & 0 deletions _examples/wasm/go.mod
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
module kagome/examples/wasm

go 1.19

require (
github.com/ikawaha/kagome-dict/ipa v1.0.10
github.com/ikawaha/kagome/v2 v2.9.3
)

require github.com/ikawaha/kagome-dict v1.0.9 // indirect

replace github.com/ikawaha/kagome/v2 => ../../
4 changes: 4 additions & 0 deletions _examples/wasm/go.sum
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
github.com/ikawaha/kagome-dict v1.0.9 h1:1Gg735LbBYsdFu13fdTvW6eVt0qIf5+S2qXGJtlG8C0=
github.com/ikawaha/kagome-dict v1.0.9/go.mod h1:mn9itZLkFb6Ixko7q8eZmUabHbg3i9EYewnhOtvd2RM=
github.com/ikawaha/kagome-dict/ipa v1.0.10 h1:wk9I21yg+fKdL6HJB9WgGiyXIiu1VttumJwmIRwn0g8=
github.com/ikawaha/kagome-dict/ipa v1.0.10/go.mod h1:rbaOKrF58zhtpV2+2sVZBj0sUSp9dVKPjr660MehJbs=
File renamed without changes.
4 changes: 2 additions & 2 deletions sample/wasm/main.go → _examples/wasm/main.go
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
//go:build ignore
// +build ignore
//go:build js && wasm
// +build js,wasm

package main

Expand Down
Loading

0 comments on commit 907eda5

Please sign in to comment.