-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
6 changed files
with
337 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,51 @@ | ||
# Matcher | ||
|
||
A high performance matcher for massive amounts of sensitive words. | ||
A high performance matcher for massive amounts of sensitive words. | ||
|
||
## Features | ||
|
||
- Supports simple word matching, regex-based matching, and similarity-based matching. | ||
- Supports different text normalization methods for matching. | ||
- `Fanjian`: Simplify traditional Chinese characters to simplified ones. eg. `蟲艸` -> `虫艹`. | ||
- `DeleteNormalize`: Delete white spaces, punctuation, and other non-alphanumeric characters. eg. `𝜢𝕰𝕃𝙻Ϙ 𝙒ⓞƦℒ𝒟!` -> `helloworld`. | ||
- `PinYin`: Convert Chinese characters to Pinyin, which can be used for fuzzy matching. eg. `西安` -> `/xi//an/`, will match `洗按` -> `/xi//an/`, but won't match `先` -> `/xian/`. | ||
- `PinYinChar`: Convert Chinese characters to Pinyin, which can be used for fuzzy matching. eg. `西安` -> `xian`, will match `洗按` -> `xian` and `先` -> `xian`. | ||
- Supports combination word matching and repeated word matching. eg. `hello,world` will match `hello world` and `world,hello`, `无,法,无,天` will match `无无法天` but won't match `无法天` because `无` repeated two times int the word. | ||
- Supports customizable exemption lists to exclude certain words from matching. | ||
- Can handle large amounts of sensitive words efficiently. | ||
|
||
## Limitations | ||
|
||
- Matchers can only handle words containing no more than 32 combined words and no more than 8 repeated words. | ||
- It's user's resposibility to ensure the correctness of the input data and ensure `match_id`, `table_id`, `word_id` are glabally unique. | ||
|
||
## Usage | ||
|
||
- For none rust user, you have to use **msgpack** to serialze matcher config to bytes. | ||
- Why msgpack? Why not json? Because json can't handle back slash well, eg. `It's /\/\y duty`, it will be processed incorrectly if using json, and msgpack is faster than json. | ||
|
||
### For Rust User | ||
|
||
See [Rust Readme](./matcher_rs/README.md) | ||
|
||
### For Python User | ||
|
||
See [Python Readme](./matcher_py/README.md) | ||
|
||
### For Java User | ||
|
||
Install rust, git clone this repo, run `cargo build --release`, and copy the `target/release/libmatcher.so` or `target/release/libmatcher.dylib` if you are using mac, to `matcher_java/src/resources/matcher_c.so`. | ||
|
||
See [Java Readme](./matcher_java/README.md) | ||
|
||
### For C User | ||
|
||
Install rust, git clone this repo, run `cargo build --release`, and copy the `target/release/libmatcher.so` or `target/release/libmatcher.dylib` if you are using mac, to `matcher_c/matcher_c.so`. | ||
|
||
See [C Readme](./matcher_c/README.md) | ||
|
||
## Design | ||
|
||
Currently most features are besed on [aho_corasick](https://github.com/BurntSushi/aho-corasick), which provides ability to find occurrences of many patterns at once with SIMD acceleration in some cases. | ||
|
||
For more implement details, see [Design](./DESIGN.md). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,5 @@ | ||
# Matcher Rust Implement C FFI bindings | ||
|
||
## Notice | ||
|
||
Python cffi usage is in the [test.ipynb](test.ipynb) file. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,235 @@ | ||
# Matcher Rust Implement JAVA FFI bindings | ||
|
||
## Notice | ||
`./src/main/resources/matcher_c.so` should be same as `../matcher_c/matcher_c.so` | ||
`./src/main/resources/matcher_c.so` should be same as `../matcher_c/matcher_c.so` | ||
|
||
## Usage | ||
|
||
[Demo.java](./src/main/java/com/matcher_java/Demo.java) | ||
```java | ||
package com.matcher_java; | ||
|
||
import org.msgpack.core.MessageBufferPacker; | ||
import org.msgpack.core.MessagePack; | ||
|
||
import com.sun.jna.Library; | ||
import com.sun.jna.Native; | ||
import com.sun.jna.Pointer; | ||
|
||
import java.io.IOException; | ||
|
||
interface Matcher extends Library { | ||
/** | ||
* Interface that uses Java Native Access (JNA) to load native shared library functions for matching text. | ||
* | ||
* This interface defines methods to initialize matchers, check for text matches, process match results, | ||
* and clean up resources. The native library is loaded from the path of the shared object file "matcher_c.so". | ||
* | ||
* Methods: | ||
* - init_matcher(byte[] match_table_map_bytes): Initializes a matcher with a given table map in byte array form. | ||
* - matcher_is_match(Pointer matcher, byte[] text_bytes): Checks if the given text matches the initialized matcher. | ||
* - matcher_word_match(Pointer matcher, byte[] text_bytes): Processes a match and returns the result as a Pointer. | ||
* - drop_matcher(Pointer matcher): Releases the resources held by the matcher. | ||
* - init_simple_matcher(byte[] simple_match_type_word_map_bytes): Initializes a simple matcher with given map bytes. | ||
* - simple_matcher_is_match(Pointer simple_matcher, byte[] text_bytes): Checks if the given text matches the simple matcher. | ||
* - simple_matcher_process(Pointer simple_matcher, byte[] text_bytes): Processes a simple match and returns the result. | ||
* - drop_simple_matcher(Pointer simple_matcher): Releases the resources held by the simple matcher. | ||
* - drop_string(Pointer ptr): Releases the memory of a string generated by the matcher. | ||
*/ | ||
|
||
Matcher INSTANCE = (Matcher) Native.load( | ||
Matcher.class.getResource("/matcher_c.so").getPath(), | ||
Matcher.class); | ||
|
||
Pointer init_matcher(byte[] match_table_map_bytes); | ||
|
||
boolean matcher_is_match(Pointer matcher, byte[] text_bytes); | ||
|
||
Pointer matcher_word_match(Pointer matcher, byte[] text_bytes); | ||
|
||
void drop_matcher(Pointer matcher); | ||
|
||
Pointer init_simple_matcher(byte[] simple_match_type_word_map_bytes); | ||
|
||
boolean simple_matcher_is_match(Pointer simple_matcher, byte[] text_bytes); | ||
|
||
Pointer simple_matcher_process(Pointer simple_matcher, byte[] text_bytes); | ||
|
||
void drop_simple_matcher(Pointer simple_matcher); | ||
|
||
void drop_string(Pointer ptr); | ||
} | ||
|
||
public class Demo { | ||
/** | ||
* Main method to demonstrate the functionality of simple matcher and matcher processes. | ||
* | ||
* This method performs the following steps: | ||
* 1. Prints a message indicating the start of the Simple Matcher Test. | ||
* 2. Calls the simple_matcher_process_demo() method to demonstrate simple matcher functionality. | ||
* 3. Prints a separator line. | ||
* 4. Prints a message indicating the start of the Matcher Test. | ||
* 5. Calls the matcher_process_demo() method to demonstrate matcher functionality. | ||
* | ||
* @param args Command line arguments (not used in this demo). | ||
* @throws IOException if an I/O error occurs during the demonstration process. | ||
*/ | ||
public static void main(String[] args) throws IOException { | ||
System.out.println("Simple Matcher Test"); | ||
simple_matcher_process_demo(); | ||
|
||
System.out.println("\n"); | ||
|
||
System.out.println("Matcher Test"); | ||
matcher_process_demo(); | ||
} | ||
|
||
/** | ||
* Demonstrates the functionality of the simple matcher process. | ||
* | ||
* This method performs the following steps: | ||
* 1. Initializes a MessagePack buffer packer and packs a map with a single key-value pair. | ||
* 2. Converts the packed data to a byte array. | ||
* 3. Uses the Matcher instance to initialize a simple matcher with the byte array. | ||
* 4. Converts the Chinese string "你好" to a UTF-8 encoded byte array. | ||
* 5. Checks if the string matches the simple matcher and prints the result. | ||
* 6. Processes the match and retrieves the result as a Pointer, then converts it to a UTF-8 string and prints it. | ||
* 7. Releases the memory of the match result string and the simple matcher. | ||
* | ||
* @throws IOException if an I/O error occurs during the byte array processing. | ||
*/ | ||
public static void simple_matcher_process_demo() throws IOException { | ||
// Create a MessagePack buffer packer to pack simple match type word map data | ||
MessageBufferPacker packer = MessagePack.newDefaultBufferPacker(); | ||
// Pack a map header with one key-value pair | ||
packer.packMapHeader(1); | ||
// Pack the integer key 30 | ||
packer.packInt(30); | ||
// Pack another map header with one key-value pair | ||
packer.packMapHeader(1); | ||
// Pack the integer key 1 | ||
packer.packInt(1); | ||
// Pack the string value "你好" (Hello in Chinese) | ||
packer.packString("你好"); | ||
// Close the packer to finalize the byte array | ||
packer.close(); | ||
|
||
// Convert the packed data to a byte array | ||
byte[] simple_match_type_word_map_bytes = packer.toByteArray(); | ||
|
||
// Get an instance of the Matcher loaded from the native library | ||
Matcher instance = Matcher.INSTANCE; | ||
|
||
// Initialize a simple matcher with the packed word map bytes | ||
Pointer simple_matcher = instance.init_simple_matcher(simple_match_type_word_map_bytes); | ||
|
||
// Convert the Chinese string "你好" to a UTF-8 encoded byte array | ||
byte[] str_bytes = "你好".getBytes("utf-8"); | ||
// Create a new byte array with an additional null terminator for C strings | ||
byte[] c_str_bytes = new byte[str_bytes.length + 1]; | ||
// Copy the UTF-8 bytes into the new array | ||
System.arraycopy(str_bytes, 0, c_str_bytes, 0, str_bytes.length); | ||
|
||
// Check if the string matches the simple matcher | ||
boolean is_match = instance.simple_matcher_is_match(simple_matcher, c_str_bytes); | ||
// Print the match result | ||
System.out.printf("is_match: %s\n", is_match); | ||
|
||
// Process the match and retrieve the result as a Pointer | ||
Pointer match_res_ptr = instance.simple_matcher_process(simple_matcher, c_str_bytes); | ||
// Convert the result Pointer to a UTF-8 string | ||
String match_res = match_res_ptr.getString(0, "utf-8"); | ||
// Print the match result string | ||
System.out.printf("match_res: %s\n", match_res); | ||
|
||
// Release the memory of the match result string | ||
instance.drop_string(match_res_ptr); | ||
// Release the resources held by the simple matcher | ||
instance.drop_simple_matcher(simple_matcher); | ||
} | ||
|
||
/** | ||
* Demonstrates the functionality of the matcher process. | ||
* | ||
* This method performs the following steps: | ||
* 1. Initializes a MessagePack buffer packer and packs a map with various key-value pairs representing a match table. | ||
* 2. Converts the packed data to a byte array. | ||
* 3. Uses the Matcher instance to initialize a matcher with the byte array. | ||
* 4. Converts the Chinese string "你好" to a UTF-8 encoded byte array. | ||
* 5. Checks if the string matches the matcher and prints the result. | ||
* 6. Processes the match and retrieves the result as a Pointer, then converts it to a UTF-8 string and prints it. | ||
* 7. Releases the memory of the match result string and the matcher. | ||
* | ||
* @throws IOException if an I/O error occurs during the byte array processing. | ||
*/ | ||
public static void matcher_process_demo() throws IOException { | ||
// Create a MessagePack buffer packer to pack the match table dictionary data | ||
MessageBufferPacker packer = MessagePack.newDefaultBufferPacker(); | ||
|
||
// Pack a map header with one key-value pair | ||
packer.packMapHeader(1); | ||
// Pack the string key "test" | ||
packer.packString("test"); | ||
// Pack an array header with one item | ||
packer.packArrayHeader(1); | ||
// Pack another map header with six key-value pairs | ||
packer.packMapHeader(6); | ||
|
||
// Pack the string key "table_id" and integer value 1 | ||
packer.packString("table_id"); | ||
packer.packInt(1); | ||
// Pack the string key "match_table_type" and string value "simple" | ||
packer.packString("match_table_type"); | ||
packer.packString("simple"); | ||
packer.packString("simple_match_type"); | ||
packer.packInt(30); | ||
packer.packString("word_list"); | ||
|
||
// Pack the string key "word_list" and an array with one string "你好" | ||
packer.packArrayHeader(1); | ||
packer.packString("你好"); | ||
packer.packString("exemption_simple_match_type"); | ||
// Pack the string key "exemption_simple_match_type" and integer value 30 | ||
packer.packString("exemption_simple_match_type"); | ||
packer.packInt(30); | ||
// Pack the string key "exemption_word_list" and an empty array | ||
packer.packString("exemption_word_list"); | ||
packer.packArrayHeader(0); | ||
byte[] table_match_dict_bytes = packer.toByteArray(); | ||
// Close the packer to finalize the byte array | ||
packer.close(); | ||
|
||
// Convert the packed data to a byte array | ||
|
||
Matcher instance = Matcher.INSTANCE; | ||
// Get an instance of the Matcher loaded from the native library | ||
|
||
Pointer matcher = instance.init_matcher(table_match_dict_bytes); | ||
// Initialize a matcher with the packed match table dictionary bytes | ||
|
||
byte[] str_bytes = "你好".getBytes("utf-8"); | ||
// Convert the Chinese string "你好" to a UTF-8 encoded byte array | ||
byte[] c_str_bytes = new byte[str_bytes.length + 1]; | ||
// Create a new byte array with an additional null terminator for C strings | ||
System.arraycopy(str_bytes, 0, c_str_bytes, 0, str_bytes.length); | ||
// Copy the UTF-8 bytes into the new array | ||
|
||
boolean is_match = instance.matcher_is_match(matcher, c_str_bytes); | ||
// Check if the string matches the matcher | ||
System.out.printf("is_match: %s\n", is_match); | ||
// Print the match result | ||
|
||
Pointer match_res_ptr = instance.matcher_word_match(matcher, c_str_bytes); | ||
// Process the match and retrieve the result as a Pointer | ||
String match_res = match_res_ptr.getString(0, "utf-8"); | ||
// Convert the result Pointer to a UTF-8 string | ||
System.out.printf("match_res: %s\n", match_res); | ||
// Print the match result string | ||
|
||
instance.drop_string(match_res_ptr); | ||
// Release the memory of the match result string | ||
instance.drop_matcher(matcher); | ||
// Release the resources held by the matcher | ||
} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,48 @@ | ||
# Matcher Rust Implement | ||
# Matcher | ||
|
||
A high performance multiple functional word matcher implemented in Rust. | ||
|
||
## Usage | ||
Many usages u can find in [test.rs](./tests/test.rs). | ||
|
||
## Limitations | ||
Matchers can only handle words contains no more than 32 combined words and no more than 8 repeated word. | ||
To use `matcher_rs` in your Rust project, add the following to your `Cargo.toml` file: | ||
|
||
```toml | ||
[dependencies] | ||
matcher_rs = "0.1.3" | ||
``` | ||
|
||
You can then use the Matcher struct to perform text matching. Here's a basic example: | ||
|
||
```rust | ||
use matcher_rs::{Matcher, MatchTableMap, MatchTable, MatchTableType, SimpleMatchType}; | ||
use gxhash::HashMap as GxHashMap; | ||
|
||
let match_table_map: MatchTableMap = GxHashMap::from_iter(vec![ | ||
("key1", vec![MatchTable { | ||
table_id: 1, | ||
match_table_type: MatchTableType::Simple, | ||
simple_match_type: SimpleMatchType::FanjianDeleteNormalize, | ||
word_list: vec!["example", "test"], | ||
exemption_simple_match_type: SimpleMatchType::FanjianDeleteNormalize, | ||
exemption_word_list: vec![], | ||
}]), | ||
]); | ||
let matcher = Matcher::new(&match_table_map); | ||
let text = "This is an example text."; | ||
let results = matcher.word_match(text); | ||
``` | ||
|
||
For more detailed usage examples, please refer to the [test.rs](./tests/test.rs) file. | ||
|
||
## Benchmarks | ||
- The matcher_rs library includes benchmarks to measure the performance of the matcher. You can find the benchmarks in the [bench.rs](./benches/bench.rs) file. To run the benchmarks, use the following command: | ||
|
||
```shell | ||
cargo bench | ||
``` | ||
|
||
## Contributing | ||
Contributions to matcher_rs are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request. | ||
|
||
## License | ||
matcher_rs is licensed under the MIT OR Apache-2.0 license. See the [LICENSE](../License.md) file for more information. |