feat: improves chinese and japanese tokenizers #899
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request includes significant changes to the
packages/tokenizers
module, primarily focusing on migrating from a custom build system using Rust and WebAssembly to a TypeScript-based build system. The changes also include updates to the configuration files, package scripts, and test files to accommodate this migration.Migration to TypeScript-based build system:
packages/tokenizers/.tshy/build.json
: Added TypeScript build configuration withnodenext
module and module resolution options.packages/tokenizers/.tshy/commonjs.json
: Added configuration for CommonJS output, specifying source files to include and exclude, and setting the output directory.packages/tokenizers/.tshy/esm.json
: Added configuration for ES module output, specifying source files to include and exclude, and setting the output directory.Package configuration updates:
packages/tokenizers/package.json
: Updated theexports
field to reflect new paths for ES and CommonJS modules, added atshy
section for build configuration, and updated scripts to usetshy
for building andtsx
for testing. [1] [2] [3]Removal of Rust and WebAssembly build system:
packages/tokenizers/scripts/build.mjs
: Removed the script for building tokenizers using Rust and WebAssembly.packages/tokenizers/src/tokenizer-japanese/.gitignore
,packages/tokenizers/src/tokenizer-mandarin/.gitignore
: Removed entries related to Rust build artifacts. [1] [2]packages/tokenizers/src/tokenizer-japanese/Cargo.toml
,packages/tokenizers/src/tokenizer-mandarin/Cargo.toml
: Removed Rust project configuration files. [1] [2]packages/tokenizers/src/tokenizer-japanese/src/lib.rs
,packages/tokenizers/src/tokenizer-mandarin/src/lib.rs
: Removed Rust source files for tokenizers. [1] [2]packages/tokenizers/src/tokenizer-japanese/src/tokenizer.ts
,packages/tokenizers/src/tokenizer-mandarin/src/tokenizer.ts
: Removed TypeScript wrappers for Rust-based tokenizers. [1] [2]Addition of new TypeScript tokenizers:
packages/tokenizers/src/japanese.ts
: Added a new Japanese tokenizer implemented in TypeScript.packages/tokenizers/src/mandarin.ts
: Added a new Mandarin tokenizer implemented in TypeScript.Test updates:
packages/tokenizers/tests/japanese.test.ts
: Updated the Japanese tokenizer test to use the new TypeScript-based tokenizer.packages/tokenizers/tests/mandarin.test.ts
: Updated the Mandarin tokenizer test to use the new TypeScript-based tokenizer.