All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
v0.5.1 - 2024-10-02
- Add new ByteLevel normalizer (
Tokenizers.Normalizer.byte_level/0
).
- Reduce memory copies when encoding.
- Bump Rust tokenizers to v0.20.0.
v0.5.0 - 2024-04-24
- Support for regular expressions to replace normalizer. See
Tokenizers.Normalizer.replace_regex/2
. - Support for regular expressions to split pre-tokenizer. See
Tokenizers.PreTokenizer.split_regex/3
.
- (Breaking) Removed
:add_prefix_space
option in favour of:prepend_scheme
for metaspace decoder and pre-tokenizer
v0.4.0 - 2023-08-09
-
Support for training a tokenizer from scratch. See
Tokenizers.Tokenizer.train_from_files/3
andTokenizers.Model
for available models. -
Support for changing tokenizer configuration, such as
Tokenizers.Tokenizer.set_padding/2
andTokenizers.Tokenizer.set_truncation/2
. See the "Configuration" functions group inTokenizers.Tokenizer
. -
Support for apply multiple encoding transformations without additional data copies, see
Tokenizers.Encoding.Transformation
. Transformations can be passed toTokenizers.Tokenizer.encode/3
via:encoding_transformations
or applied viaTokenizers.Encoding.transform/2
.
-
(Breaking)
Tokenizers.Tokenizer.encode/3
no longer accepts a batch of inputs, to encode a batch useTokenizers.Tokenizer.encode_batch/3
instead -
(Breaking)
Tokenizers.Tokenizer.decode/3
no longer accepts a batch of inputs, to encode a batch useTokenizers.Tokenizer.decode_batch/3
instead
v0.3.2 - 2023-04-19
- Bump tokenizers to v0.13.3 in the crate's dependencies.
v0.3.1 - 2023-04-06
-
Add binary variants for accessing encoding data. This way we can convert encoding data to tensors without additional allocations. The following functions were added:
get_u32_ids/1
get_u32_attention_mask/1
get_u32_type_ids/1
get_u32_special_tokens_mask/1
v0.3.0 - 2023-03-04
-
Add option to use cache when downloading pretrained files. We check the ETAG of the file before trying to download it. This introduces the
:use_cache
and:cache_dir
options to theTokenizers.from_pretrained/2
function. -
Support adding special tokens when creating a tokenizer. This allows a pretrained tokenizer to be loaded with additional special tokens.
This change adds the
:additional_special_tokens
option to theTokenizers.from_pretrained/2
function. -
Add support for the
riscv64gc-unknown-linux-gnu
target, which is useful for Nerves projects running on 64 bits RISC-V computers. This means that we are precompiling the project to run on those machines.
- Change minimum required version of Rustler Precompiled to
~> 0.6
. With this, we have theaarch64-unknown-linux-musl
andriscv64gc-unknown-linux-gnu
as default targets. But we also drop support for the NIF version 2.14.
v0.2.0 - 2022-12-07
- Add a minimal http server to avoid problems with openssl
- Expose
Encoding.get_special_tokens_mask/1
andEncoding.get_offsets/1
for NER
v0.1.0 - 2022-08-25
First release.