Skip to content

Commit

Permalink
add readme
Browse files Browse the repository at this point in the history
  • Loading branch information
Lips7 committed Jun 12, 2024
1 parent 3a0e2bf commit aabfa73
Show file tree
Hide file tree
Showing 6 changed files with 337 additions and 7 deletions.
Empty file added DESIGN.md
Empty file.
50 changes: 49 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,51 @@
# Matcher

A high performance matcher for massive amounts of sensitive words.
A high performance matcher for massive amounts of sensitive words.

## Features

- Supports simple word matching, regex-based matching, and similarity-based matching.
- Supports different text normalization methods for matching.
- `Fanjian`: Simplify traditional Chinese characters to simplified ones. eg. `蟲艸` -> `虫艹`.
- `DeleteNormalize`: Delete white spaces, punctuation, and other non-alphanumeric characters. eg. `𝜢𝕰𝕃𝙻Ϙ 𝙒ⓞƦℒ𝒟!` -> `helloworld`.
- `PinYin`: Convert Chinese characters to Pinyin, which can be used for fuzzy matching. eg. `西安` -> `/xi//an/`, will match `洗按` -> `/xi//an/`, but won't match `` -> `/xian/`.
- `PinYinChar`: Convert Chinese characters to Pinyin, which can be used for fuzzy matching. eg. `西安` -> `xian`, will match `洗按` -> `xian` and `` -> `xian`.
- Supports combination word matching and repeated word matching. eg. `hello,world` will match `hello world` and `world,hello`, `无,法,无,天` will match `无无法天` but won't match `无法天` because `` repeated two times int the word.
- Supports customizable exemption lists to exclude certain words from matching.
- Can handle large amounts of sensitive words efficiently.

## Limitations

- Matchers can only handle words containing no more than 32 combined words and no more than 8 repeated words.
- It's user's resposibility to ensure the correctness of the input data and ensure `match_id`, `table_id`, `word_id` are glabally unique.

## Usage

- For none rust user, you have to use **msgpack** to serialze matcher config to bytes.
- Why msgpack? Why not json? Because json can't handle back slash well, eg. `It's /\/\y duty`, it will be processed incorrectly if using json, and msgpack is faster than json.

### For Rust User

See [Rust Readme](./matcher_rs/README.md)

### For Python User

See [Python Readme](./matcher_py/README.md)

### For Java User

Install rust, git clone this repo, run `cargo build --release`, and copy the `target/release/libmatcher.so` or `target/release/libmatcher.dylib` if you are using mac, to `matcher_java/src/resources/matcher_c.so`.

See [Java Readme](./matcher_java/README.md)

### For C User

Install rust, git clone this repo, run `cargo build --release`, and copy the `target/release/libmatcher.so` or `target/release/libmatcher.dylib` if you are using mac, to `matcher_c/matcher_c.so`.

See [C Readme](./matcher_c/README.md)

## Design

Currently most features are besed on [aho_corasick](https://github.com/BurntSushi/aho-corasick), which provides ability to find occurrences of many patterns at once with SIMD acceleration in some cases.

For more implement details, see [Design](./DESIGN.md).
2 changes: 2 additions & 0 deletions matcher_c/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# Matcher Rust Implement C FFI bindings

## Notice

Python cffi usage is in the [test.ipynb](test.ipynb) file.
234 changes: 233 additions & 1 deletion matcher_java/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,235 @@
# Matcher Rust Implement JAVA FFI bindings

## Notice
`./src/main/resources/matcher_c.so` should be same as `../matcher_c/matcher_c.so`
`./src/main/resources/matcher_c.so` should be same as `../matcher_c/matcher_c.so`

## Usage

[Demo.java](./src/main/java/com/matcher_java/Demo.java)
```java
package com.matcher_java;

import org.msgpack.core.MessageBufferPacker;
import org.msgpack.core.MessagePack;

import com.sun.jna.Library;
import com.sun.jna.Native;
import com.sun.jna.Pointer;

import java.io.IOException;

interface Matcher extends Library {
/**
* Interface that uses Java Native Access (JNA) to load native shared library functions for matching text.
*
* This interface defines methods to initialize matchers, check for text matches, process match results,
* and clean up resources. The native library is loaded from the path of the shared object file "matcher_c.so".
*
* Methods:
* - init_matcher(byte[] match_table_map_bytes): Initializes a matcher with a given table map in byte array form.
* - matcher_is_match(Pointer matcher, byte[] text_bytes): Checks if the given text matches the initialized matcher.
* - matcher_word_match(Pointer matcher, byte[] text_bytes): Processes a match and returns the result as a Pointer.
* - drop_matcher(Pointer matcher): Releases the resources held by the matcher.
* - init_simple_matcher(byte[] simple_match_type_word_map_bytes): Initializes a simple matcher with given map bytes.
* - simple_matcher_is_match(Pointer simple_matcher, byte[] text_bytes): Checks if the given text matches the simple matcher.
* - simple_matcher_process(Pointer simple_matcher, byte[] text_bytes): Processes a simple match and returns the result.
* - drop_simple_matcher(Pointer simple_matcher): Releases the resources held by the simple matcher.
* - drop_string(Pointer ptr): Releases the memory of a string generated by the matcher.
*/

Matcher INSTANCE = (Matcher) Native.load(
Matcher.class.getResource("/matcher_c.so").getPath(),
Matcher.class);

Pointer init_matcher(byte[] match_table_map_bytes);

boolean matcher_is_match(Pointer matcher, byte[] text_bytes);

Pointer matcher_word_match(Pointer matcher, byte[] text_bytes);

void drop_matcher(Pointer matcher);

Pointer init_simple_matcher(byte[] simple_match_type_word_map_bytes);

boolean simple_matcher_is_match(Pointer simple_matcher, byte[] text_bytes);

Pointer simple_matcher_process(Pointer simple_matcher, byte[] text_bytes);

void drop_simple_matcher(Pointer simple_matcher);

void drop_string(Pointer ptr);
}

public class Demo {
/**
* Main method to demonstrate the functionality of simple matcher and matcher processes.
*
* This method performs the following steps:
* 1. Prints a message indicating the start of the Simple Matcher Test.
* 2. Calls the simple_matcher_process_demo() method to demonstrate simple matcher functionality.
* 3. Prints a separator line.
* 4. Prints a message indicating the start of the Matcher Test.
* 5. Calls the matcher_process_demo() method to demonstrate matcher functionality.
*
* @param args Command line arguments (not used in this demo).
* @throws IOException if an I/O error occurs during the demonstration process.
*/
public static void main(String[] args) throws IOException {
System.out.println("Simple Matcher Test");
simple_matcher_process_demo();

System.out.println("\n");

System.out.println("Matcher Test");
matcher_process_demo();
}

/**
* Demonstrates the functionality of the simple matcher process.
*
* This method performs the following steps:
* 1. Initializes a MessagePack buffer packer and packs a map with a single key-value pair.
* 2. Converts the packed data to a byte array.
* 3. Uses the Matcher instance to initialize a simple matcher with the byte array.
* 4. Converts the Chinese string "你好" to a UTF-8 encoded byte array.
* 5. Checks if the string matches the simple matcher and prints the result.
* 6. Processes the match and retrieves the result as a Pointer, then converts it to a UTF-8 string and prints it.
* 7. Releases the memory of the match result string and the simple matcher.
*
* @throws IOException if an I/O error occurs during the byte array processing.
*/
public static void simple_matcher_process_demo() throws IOException {
// Create a MessagePack buffer packer to pack simple match type word map data
MessageBufferPacker packer = MessagePack.newDefaultBufferPacker();
// Pack a map header with one key-value pair
packer.packMapHeader(1);
// Pack the integer key 30
packer.packInt(30);
// Pack another map header with one key-value pair
packer.packMapHeader(1);
// Pack the integer key 1
packer.packInt(1);
// Pack the string value "你好" (Hello in Chinese)
packer.packString("你好");
// Close the packer to finalize the byte array
packer.close();

// Convert the packed data to a byte array
byte[] simple_match_type_word_map_bytes = packer.toByteArray();

// Get an instance of the Matcher loaded from the native library
Matcher instance = Matcher.INSTANCE;

// Initialize a simple matcher with the packed word map bytes
Pointer simple_matcher = instance.init_simple_matcher(simple_match_type_word_map_bytes);

// Convert the Chinese string "你好" to a UTF-8 encoded byte array
byte[] str_bytes = "你好".getBytes("utf-8");
// Create a new byte array with an additional null terminator for C strings
byte[] c_str_bytes = new byte[str_bytes.length + 1];
// Copy the UTF-8 bytes into the new array
System.arraycopy(str_bytes, 0, c_str_bytes, 0, str_bytes.length);

// Check if the string matches the simple matcher
boolean is_match = instance.simple_matcher_is_match(simple_matcher, c_str_bytes);
// Print the match result
System.out.printf("is_match: %s\n", is_match);

// Process the match and retrieve the result as a Pointer
Pointer match_res_ptr = instance.simple_matcher_process(simple_matcher, c_str_bytes);
// Convert the result Pointer to a UTF-8 string
String match_res = match_res_ptr.getString(0, "utf-8");
// Print the match result string
System.out.printf("match_res: %s\n", match_res);

// Release the memory of the match result string
instance.drop_string(match_res_ptr);
// Release the resources held by the simple matcher
instance.drop_simple_matcher(simple_matcher);
}

/**
* Demonstrates the functionality of the matcher process.
*
* This method performs the following steps:
* 1. Initializes a MessagePack buffer packer and packs a map with various key-value pairs representing a match table.
* 2. Converts the packed data to a byte array.
* 3. Uses the Matcher instance to initialize a matcher with the byte array.
* 4. Converts the Chinese string "你好" to a UTF-8 encoded byte array.
* 5. Checks if the string matches the matcher and prints the result.
* 6. Processes the match and retrieves the result as a Pointer, then converts it to a UTF-8 string and prints it.
* 7. Releases the memory of the match result string and the matcher.
*
* @throws IOException if an I/O error occurs during the byte array processing.
*/
public static void matcher_process_demo() throws IOException {
// Create a MessagePack buffer packer to pack the match table dictionary data
MessageBufferPacker packer = MessagePack.newDefaultBufferPacker();

// Pack a map header with one key-value pair
packer.packMapHeader(1);
// Pack the string key "test"
packer.packString("test");
// Pack an array header with one item
packer.packArrayHeader(1);
// Pack another map header with six key-value pairs
packer.packMapHeader(6);

// Pack the string key "table_id" and integer value 1
packer.packString("table_id");
packer.packInt(1);
// Pack the string key "match_table_type" and string value "simple"
packer.packString("match_table_type");
packer.packString("simple");
packer.packString("simple_match_type");
packer.packInt(30);
packer.packString("word_list");

// Pack the string key "word_list" and an array with one string "你好"
packer.packArrayHeader(1);
packer.packString("你好");
packer.packString("exemption_simple_match_type");
// Pack the string key "exemption_simple_match_type" and integer value 30
packer.packString("exemption_simple_match_type");
packer.packInt(30);
// Pack the string key "exemption_word_list" and an empty array
packer.packString("exemption_word_list");
packer.packArrayHeader(0);
byte[] table_match_dict_bytes = packer.toByteArray();
// Close the packer to finalize the byte array
packer.close();

// Convert the packed data to a byte array

Matcher instance = Matcher.INSTANCE;
// Get an instance of the Matcher loaded from the native library

Pointer matcher = instance.init_matcher(table_match_dict_bytes);
// Initialize a matcher with the packed match table dictionary bytes

byte[] str_bytes = "你好".getBytes("utf-8");
// Convert the Chinese string "你好" to a UTF-8 encoded byte array
byte[] c_str_bytes = new byte[str_bytes.length + 1];
// Create a new byte array with an additional null terminator for C strings
System.arraycopy(str_bytes, 0, c_str_bytes, 0, str_bytes.length);
// Copy the UTF-8 bytes into the new array

boolean is_match = instance.matcher_is_match(matcher, c_str_bytes);
// Check if the string matches the matcher
System.out.printf("is_match: %s\n", is_match);
// Print the match result

Pointer match_res_ptr = instance.matcher_word_match(matcher, c_str_bytes);
// Process the match and retrieve the result as a Pointer
String match_res = match_res_ptr.getString(0, "utf-8");
// Convert the result Pointer to a UTF-8 string
System.out.printf("match_res: %s\n", match_res);
// Print the match result string

instance.drop_string(match_res_ptr);
// Release the memory of the match result string
instance.drop_matcher(matcher);
// Release the resources held by the matcher
}
}
```
8 changes: 7 additions & 1 deletion matcher_py/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,16 @@
# Matcher Rust Implement PyO3 binding

## Installation

```
pip install matcher_py
```

## Usage
Python usage is in the [test.ipynb](matcher_py/test.ipynb) file.

- Python usage is in the [test.ipynb](./matcher_py/test.ipynb) file.
- `msgspec` is used to serialize the matcher config, you can use `ormsgpack` or other msgpack serialization library to serialize the matcher config, all the types are defined in [extention_types.py](./matcher_py/extension_types.py). But for performance consideration, I recommend `msgspec`.

### Matcher
```Python
import msgspec
Expand Down
50 changes: 46 additions & 4 deletions matcher_rs/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,48 @@
# Matcher Rust Implement
# Matcher

A high performance multiple functional word matcher implemented in Rust.

## Usage
Many usages u can find in [test.rs](./tests/test.rs).

## Limitations
Matchers can only handle words contains no more than 32 combined words and no more than 8 repeated word.
To use `matcher_rs` in your Rust project, add the following to your `Cargo.toml` file:

```toml
[dependencies]
matcher_rs = "0.1.3"
```

You can then use the Matcher struct to perform text matching. Here's a basic example:

```rust
use matcher_rs::{Matcher, MatchTableMap, MatchTable, MatchTableType, SimpleMatchType};
use gxhash::HashMap as GxHashMap;

let match_table_map: MatchTableMap = GxHashMap::from_iter(vec![
("key1", vec![MatchTable {
table_id: 1,
match_table_type: MatchTableType::Simple,
simple_match_type: SimpleMatchType::FanjianDeleteNormalize,
word_list: vec!["example", "test"],
exemption_simple_match_type: SimpleMatchType::FanjianDeleteNormalize,
exemption_word_list: vec![],
}]),
]);
let matcher = Matcher::new(&match_table_map);
let text = "This is an example text.";
let results = matcher.word_match(text);
```

For more detailed usage examples, please refer to the [test.rs](./tests/test.rs) file.

## Benchmarks
- The matcher_rs library includes benchmarks to measure the performance of the matcher. You can find the benchmarks in the [bench.rs](./benches/bench.rs) file. To run the benchmarks, use the following command:

```shell
cargo bench
```

## Contributing
Contributions to matcher_rs are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.

## License
matcher_rs is licensed under the MIT OR Apache-2.0 license. See the [LICENSE](../License.md) file for more information.

0 comments on commit aabfa73

Please sign in to comment.