Skip to content

Commit

Permalink
Update to tokenizers 0.19 (#57)
Browse files Browse the repository at this point in the history
  • Loading branch information
jonatanklosko authored Apr 24, 2024
1 parent 0c8f4b7 commit a7c4cef
Show file tree
Hide file tree
Showing 10 changed files with 130 additions and 80 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Support for regular expressions to split pre-tokenizer. See
`Tokenizers.PreTokenizer.split_regex/3`.

### Removed

- **(Breaking)** `:add_prefix_space` option in favour of `:prepend_scheme` for metaspace
decoder and pre-tokenizer

## [v0.4.0] - 2023-08-09

### Added
Expand Down
7 changes: 5 additions & 2 deletions lib/tokenizers/decoder.ex
Original file line number Diff line number Diff line change
Expand Up @@ -74,8 +74,11 @@ defmodule Tokenizers.Decoder do
* `replacement` - the replacement character. Defaults to `▁`
(as char)
* `add_prefix_space` - whether to add a space to the first word.
Defaults to `true`
* `:prepend_scheme` - whether to add a space to the first word if there
isn't already one. This lets us treat "hello" exactly like "say hello".
Either of `:always`, `:never`, `:first`. `:first` means the space is
only added on the first token (relevant when special tokens are used
or other pre_tokenizer are used). Defaults to `:always`
"""
@spec metaspace(keyword()) :: t()
Expand Down
8 changes: 5 additions & 3 deletions lib/tokenizers/pre_tokenizer.ex
Original file line number Diff line number Diff line change
Expand Up @@ -103,9 +103,11 @@ defmodule Tokenizers.PreTokenizer do
* `:replacement` - the replacement character to use. Defaults to `"▁"`
* `:add_prefix_space` - whether to add a space to the first word
if there isn’t already one. This lets us treat hello exactly
like say hello. Defaults to `true`
* `:prepend_scheme` - whether to add a space to the first word if there
isn't already one. This lets us treat "hello" exactly like "say hello".
Either of `:always`, `:never`, `:first`. `:first` means the space is
only added on the first token (relevant when special tokens are used
or other pre_tokenizer are used). Defaults to `:always`
"""
@spec metaspace(keyword()) :: t()
Expand Down
116 changes: 57 additions & 59 deletions native/ex_tokenizers/Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion native/ex_tokenizers/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,5 @@ crate-type = ["cdylib"]
anyhow = "1"
rustler = "0.29.1"
thiserror = "1"
tokenizers = { version = "0.15.0", default-features = false, features = ["onig", "esaxx_fast"]}
tokenizers = { version = "0.19.1", default-features = false, features = ["onig", "esaxx_fast"]}
serde = { version = "1.0", features = [ "rc", "derive" ] }
Loading

0 comments on commit a7c4cef

Please sign in to comment.