v0.1.0 - New models, interactive apps, overhauled benchmarks
This is a large release including many new features:
New models
Implemented support for arbitrary scikit-learn models via SKLearnClassifier and TF-IDF as a baseline embedding approach via TfidfEmbedder. Implemented support for spaCy text categorizer models and spacy-transformers models via SpaCyModel. Upgraded pytorch_transformers
v1.0.0 to transformers
v2.4.1, which added support for several new models.
Interactive apps
gobbli now comes bundled with a few Streamlit apps that can be used to explore datasets, evaluate gobbli model performance, and generate local explanations for gobbli model predictions. See the docs for more information.
Overhauled benchmarks
Completely overhauled the benchmark framework. Benchmark output is now stored as Markdown files, which can much more easily be read on GitHub, and benchmarks can be selectively rerun when new models are added. Also developed a "benchmark" for embeddings, which plots the model embeddings in 2 dimensions and allows for a qualitative assessment of how well each model differentiates between the classes in the dataset. See the benchmark output folder.
Miscellaneous improvements
- Add new BERT weights from NCBI trained on PubMed data (
ncbi-bert-base-pubmed-uncased
,ncbi-bert-base-pubmed-mimic-uncased
,ncbi-bert-large-pubmed-uncased
,ncbi-bert-large-pubmed-mimic-uncased
) (thanks @pmbaumgartner!) - Upgrade fastText to a more recent version which supports autotuning parameters.
- Add support for optional gradient accumulation in Transformer models, allowing for smaller batch sizes and larger models while retaining performance
- Upgrade USE implementation to the TensorFlow 2.0 version and add support for multilingual weights (
universal-sentence-encoder-multilingual
,universal-sentence-encoder-multilingual-large
) - Add a couple of utilities for inspecting and cleaning up disk usage
- Fix memory issues with USE model by batching input data
- Fix potential encoding issues with non-ASCII text in USE model
- Reuse static pretrained weights across instances of models instead of redownloading every time