Releases: JohnSnowLabs/spark-nlp
Spark NLP 4.1.0: Vision Transformer (ViT) is here! The very first Computer Vision pipeline for the state-of-the-art Image Classification task, AWS Graviton/ARM64 support, new EMR & Databricks support, 1000+ state-of-the-art models, and more!
Overview
An Image is Worth 16x16 Words!
For the first time ever we are delighted to announce support for Image Classification in Spark NLP by using state-of-the-art Vision Transformer (ViT) models at scale. This release comes with official support for AWS Graviton and ARM64 processors, new Databricks and EMR support, and 1000+ state-of-the-art models.
Spark NLP 4.1 also celebrates crossing 8000+ free and open-source models & pipelines available on Models Hub. π As always, we would like to thank our community for their feedback, questions, and feature requests.
β New Features & improvements
- NEW: Introducing ViTForImageClassification annotator in Spark NLP π.
ViTForImageClassification
can load Vision TransformerViT
Models with an image classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for ImageNet. This annotator is compatible with all the models trained/fine-tuned by usingViTForImageClassification
for PyTorch orTFViTForImageClassification
for TensorFlow models in HuggingFace π€ (#11536)
An overview of the ViT model structure as introduced in Google Researchβs original 2021 paper
data_df = spark.read.format("image") \
.load(path="images/")
image_assembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
image_classifier = ViTForImageClassification \
.pretrained() \
.setInputCols("image_assembler") \
.setOutputCol("class")
pipeline = Pipeline(stages=[
image_assembler,
image_classifier,
])
model = pipeline.fit(data_df)
- NEW: Support for AWS Graviton/Graviton2 With up to 3x Better Price-Performance. For the first time, Spark NLP supports Graviton and ARM64 (ARMv8 above) processors. (#10939)
- NEW: Introducing TFNerDLGraphBuilder annotator.
TFNerDLGraphBuilder
can be used to automatically detect the parameters of a needed NerDL graph and generate the graph within a pipeline when the default NER graphs are not suitable for your training datasets.TFNerDLGraphBuilder
supports local, DBFS, and S3 file systems. (#10564) - Allow passing confidence scores from all XXXForTokenClassification annotators to NerConverter. It is now possible to access the confidence scores coming from the following annotators in NerConverter metadata (similar to NerDLModel): AlbertForTokenClassification, BertForTokenClassification, DeBertaForTokenClassification, DistilBertForTokenClassification, LongformerForTokenClassification, RoBertaForTokenClassification, XlmRoBertaForTokenClassification, XlnetForTokenClassification, and DeBertaForTokenClassification
- Introducing PushToHub Python class to easily push public models & pipelines to Models Hub
- Introducing
fullAnnotateImage
to existing LightPipeline to support ImageAssembler and ViTForImageClassification annotators in a Spark NLP pipeline. ThefullAnnotateImage
supports a path to images hosted locally, on DBFS, and S3.
light_pipeline = LightPipeline(model)
annotations_result = light_pipeline.fullAnnotateImage("images/hippopotamus.JPEG")
- Welcoming a new EMR 6.x series to our Spark NLP family:
- EMR 6.7.0 (now supports Apache Spark 3.2.1, Apache Hive 3.1.3, HUDI 0.11, PrestoDB 0.272, and Trino 0.378.)
- Welcoming 3 new Databricks runtimes to our Spark NLP family:
- Databricks 11.2 LTS
- Databricks 11.2 LTS ML
- Databricks 11.2 LTS ML GPU
- Welcoming new AWS Graviton-enabled for Databricks runtime:
Models
Spark NLP 4.1.0 comes with 1000+ state-of-the-art pre-trained transformer models for Image Classifications, Token Classification, and Sequence Classification in many languages.
Featured Models
Model | Name | Lang |
---|---|---|
ViTForImageClassification | image_classifier_vit_base_patch16_224 | en |
ViTForImageClassification | image_classifier_vit_base_patch16_384 | en |
ViTForImageClassification | image_classifier_vit_base_patch32_384 | en |
ViTForImageClassification | image_classifier_vit_base_xray_pneumonia | en |
ViTForImageClassification | image_classifier_vit_finetuned_chest_xray_pneumonia | en |
ViTForImageClassification | image_classifier_vit_food | en |
ViTForImageClassification | image_classifier_vit_base_food101 | en |
ViTForImageClassification | image_classifier_vit_autotrain_dog_vs_food | en |
ViTForImageClassification | image_classifier_vit_baseball_stadium_foods | en |
ViTForImageClassification | image_classifier_vit_south_indian_foods | en |
ViTForImageClassification | image_classifier_vit_denver_nyc_paris | en |
ViTForImageClassification | image_classifier_vit_CarViT | en |
Check out 240 (ViT) models on Models Hub - Image Classification
Spark NLP covers the following languages:
English
,Multilingual
,Afrikaans
,Afro-Asiatic languages
,Albanian
,Altaic languages
,American Sign Language
,Amharic
,Arabic
,Argentine Sign Language
,Armenian
,Artificial languages
,Atlantic-Congo languages
,Austro-Asiatic languages
,Austronesian languages
,Azerbaijani
,Baltic languages
,Bantu languages
,Basque
,Basque (family)
,Belarusian
,Bemba (Zambia)
,Bengali, Bangla
,Berber languages
,Bihari
,Bislama
,Bosnian
,Brazilian Sign Language
,Breton
,Bulgarian
,Catalan
,Caucasian languages
,Cebuano
,Celtic languages
,Central Bikol
,Chichewa, Chewa, Nyanja
,Chilean Sign Language
,Chinese
,Chuukese
,Colombian Sign Language
,Congo Swahili
,Croatian
,Cushitic languages
,Czech
,Danish
,Dholuo, Luo (Kenya and Tanzania)
,Dravidian languages
,Dutch
,East Slavic languages
,Eastern Malayo-Polynesian languages
,Efik
,Esperanto
,Estonian
,Ewe
,Fijian
,Finnish
,Finnish Sign Language
,Finno-Ugrian languages
,French
,French-based creoles and pidgins
,Ga
,Galician
,Ganda
,Georgian
,German
,Germanic languages
,Gilbertese
,Greek (modern)
,Greek languages
,Gujarati
,Gun
,Haitian, Haitian Creole
,Hausa
,Hebrew (modern)
,Hiligaynon
,Hindi
,Hiri Motu
,Hungarian
,Icelandic
,Igbo
,Iloko
,Indic languages
,Indo-European languages
,Indo-Iranian languages
,Indonesian
,Irish
,Isoko
,Isthmus Zapotec
,Italian
,Italic languages
,Japanese
,Japanese
,Kabyle
,Kalaallisut, Greenlandic
,Kannada
,Kaonde
,Kinyarwanda
,Kirundi
,Kongo
,Korean
,Kwangali
,Kwanyama, Kuanyama
,Latin
,Latvian
,Lingala
,Lithuanian
,Louisiana Creole
,Lozi
,Luba-Katanga
,Luba-Lulua
,Lunda
,Lushai
,Luvale
,Macedonian
,Malagasy
,Malay
,Malayalam
,Malayo-Polynesian languages
,Maltese
,Manx
,Marathi (MarΔαΉhΔ«)
,Marshallese
,Mexican Sign Language
,Mon-Khmer languages
,Morisyen
,Mossi
,Multiple languages
,Ndonga
,Nepali
,Niger-Kordofanian languages
,Nigerian Pidgin
,Niuean
,North Germanic languages
,Northern Sotho, Pedi, Sepedi
,Norwegian
,Norwegian BokmΓ₯l
,Norwegian Nynorsk
,Nyaneka
,Oromo
,Pangasinan
,Papiamento
,Persian (Farsi)
,Peruvian Sign Language
,Philippine languages
,Pijin
,Pohnpeian
,Polish
,Portuguese
,Portuguese-based creoles and pidgins
,Punjabi (Eastern)
,Romance languages
,Romanian
,Rundi
,Russian
,Ruund
,Salishan languages
,Samoan
,San Salvador Kongo
,Sango
,Semitic languages
,Serbo-Croatian
,Seselwa Creole French
,Shona
,Sindhi
,Sino-Tibetan languages
,Slavic languages
,Slovak
,Slovene
,Somali
,South Caucasian languages
,South Slavic languages
,Southern Sotho
,Spanish
,Spanish Sign Language
,Sranan Tongo
,Swahili
,Swati
,Swedish
,Tagalog
,Tahitian
,Tai
,Tamil
,Telugu
,Tetela
,Tetun Dili
,Thai
,Tigrinya
,`...
Spark NLP 4.0.2: Over 620 new state-of-the-art models in 21 languages, full support for Apache Spark 3.3.0, new Databricks runtime 11.1, and bug fixes
Overview
We are pleased to release Spark NLP π 4.0.2! This release comes with full compatibility with the newly-released Apache Spark 3.3.0 and official support for Databrick's new runtimes 11.1 Beta (includes Apache Spark 3.3.0, Scala 2.12).
As always, we would like to thank our community for their feedback, questions, and feature requests.
New Features
- Welcoming new Databricks runtimes based on Spark/PySpark 3.3.0 to our Spark NLP family:
- Databricks 11.1 Beta
- Databricks 11.1 ML Berta
- Databricks 11.1 ML Berta GPU
SentenceDetector
now comes with a new parametercustomBoundsStrategy
for returning custom bounds #10567
Example
with setCustomBounds([r"\.", ";"])
This is a sentence. This one uses custom bounds; As is this one;
Without the flags will result in
["This is a sentence", "This one uses custom bounds", "As is this one"]
With the new flag:
.setCustomBounds([r"\.", ";"])
.setCustomBoundsStrategy("append")
the result will be
["This is a sentence.", "This one uses custom bounds;", "As is this one;"]
Similarly with prepend:
1. This is a list
1.1 This is a subpoint
2. Second thing
2.2 Second subthing
.setCustomBounds([r"\n[\d\.]+"])
.setCustomBoundsStrategy("prepend")
the result will be
[
"1. This is a list",
"1.1 This is a subpoint",
"2. Second thing",
"2.2 Second subthing"
]
Bug Fixes
- Fix bug that attempts to create spark session on executors when using GraphExtraction in Spark/PySpark 3.3 #9905
Models and Pipelines
Spark NLP 4.0.2 comes with 620+ state-of-the-art pre-trained transformer models in 21 languages including multi-lingual models.
Featured Models
The complete list of all 6900+ models & pipelines in 230+ languages is available on Models Hub
π Documentation & Articles
- Spark NLP: Hardware Acceleration
- Serving Spark NLP via API in Java
- TF Hub & HuggingFace to Spark NLP
- Models Hub with new models
- Spark NLP documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
- Spark NLP Workshop notebooks
- Spark NLP publications
- Spark NLP in Action
- Spark NLP training certification notebooks for Google Colab and Databricks
- Spark NLP Display for visualization of different types of annotations
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Installation
Python
#PyPI
pip install spark-nlp==4.0.2
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.2
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.2
M1
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.2
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>4.0.2</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>4.0.2</version>
</dependency>
spark-nlp-m1:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-m1_2.12</artifactId>
<version>4.0.2</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.2.jar
-
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.0.2.jar
-
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.0.2.jar
What's Changed
Contributors
@gadde5300 @danilojsl @hsaglamlar @Cabir40 @ahmedlone127 @muhammetsnts @KshitizGIT @maziyarpanahi @albertoandreottiATgmail @DevinTDHa @luca-martial @Damla-Gurbaz @jsl-models @Meryem1425
New Contributors
- @hsaglamlar made their first contribution in #10544
Full Changelog: 4.0.1...4.0.2
Spark NLP 4.0.1: Full support for Apache Spark 3.3.0, new Databricks runtime 11, enhancements, and other bug fixes!
Overview
We are pleased to release Spark NLP π 4.0.1! This release comes with supporting the newly-released Apache Spark 3.3.0 with improved join query performance via Bloom filters, increases the Pandas API coverage, and many other improvements. In addition, Spark NLP comes with official support for Databricks runtimes 11, other enhancements, and bug fixes.
As always, we would like to thank our community for their feedback, questions, and feature requests.
Features & Enhancements
- Full support for Apache Spark & PySpark 3.3.0
- Add Apache Spark 3.3.0 to Google Colab and Kaggle setup scripts
- New
-g
option for Google Colab and Kaggle setup on GPU device to upgradelibcudnn8
to 8.1.0 to solve the issue on GPU - Welcoming new Databricks runtimes based on Spark/PySpark 3.3.0 to our Spark NLP family:
- Databricks 11.0 LTS
- Databricks 11.0 LTS ML
- Databricks 11.0 LTS ML GPU
Bug Fixes
- Fix the error caused by PySpark 3.3.0 in CoNLL, CoNLLU, POS, and PubTator annotators as training helpers
- Fix and re-upload Dependency and Type Dependency parser pre-trained models
- Update pre-trained pipelines with issues on PySpark 3.2 and 3.3
Documentation
- Serving Spark NLP via API in Java
- TF Hub & HuggingFace to Spark NLP
- Models Hub with new models
- Spark NLP documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
- Spark NLP Workshop notebooks
- Spark NLP publications
- Spark NLP in Action
- Spark NLP training certification notebooks for Google Colab and Databricks
- Spark NLP Display for visualization of different types of annotations
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Installation
Python
#PyPI
pip install spark-nlp==4.0.1
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.1
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.1
M1
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.1
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>4.0.1</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>4.0.1</version>
</dependency>
spark-nlp-m1:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-m1_2.12</artifactId>
<version>4.0.1</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.1.jar
-
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.0.1.jar
-
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.0.1.jar
What's Changed
Contributors
@muhammetsnts @jsl-models @Meryem1425 @Damla-Gurbaz @jsl-builder @rpranab @danilojsl @josejuanmartinez @Cabir40 @DevinTDHa @agsfer @suvrat-joshi @ahmedlone127 @albertoandreottiATgmail @KshitizGIT @mahmoodbayeshi @maziyarpanahi
New Contributors
- @ahmedlone127 made their first contribution in #9887
Full Changelog: 4.0.0...4.0.1
Spark NLP 4.0.0: New modern extractive Question answering (QA) annotators for ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa, official support for Apple silicon M1, support oneDNN to improve CPU up to 97%, improved transformers on GPU up to +700%, 1000+ state-of-the-art models, and lots more!
Overview
We are very excited to release Spark NLP 4.0.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community! π
This release comes with official support for Apple silicon M1 chip (for the first time), official support for Spark/PySpark 3.2, support oneAPI Deep Neural Network Library (oneDNN) to improve TensorFlow on CPU up to 97%, optimized transformer-based embeddings on GPU to increase the performance up to +700%, brand new modern extractive transformer-based Question answering (QA) annotators for tasks like SQuAD based on ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa architectures, 1000+ state-of-the-art models, WordEmbeddingsModel now works in clusters without HDFS/DBFS/S3 such as Kubernetes, new Databricks and EMR support, new NER models achieving highest F1 score in Spark NLP, and many more enhancements and bug fixes!
We would like to mention that Spark NLP 4.0.0 drops the support for Spark 2.3 and 2.4 (Scala 2.11). Starting 4.0.0 we only support Spark/PySpark 3.x on Scala 2.12.
As always, we would like to thank our community for their feedback, questions, and feature requests.
Major features and improvements
- NEW: Support for The oneAPI Deep Neural Network Library (oneDNN) optimizations to improve TensorFlow on CPU. Enabling onDNN can improve some transformer-based models up to 97%. By default, the oneDNN optimizations will be turned off. To enable them, you can set the environment variable TF_ENABLE_ONEDNN_OPTS. On Linux systems, for instance:
export TF_ENABLE_ONEDNN_OPTS=1
- NEW: Optimizing batch processing for transformer-based Word Embeddings on a GPU device. These optimizations can result in performance improvements up to +700% (more details in the Benchmarks section)
- NEW: Official support for Apple silicon M1 on macOS devices. You can use the
spark-nlp-m1
package that supports Apple silicon M1 on your macOS machine in Spark NLP 4.0.0 - NEW: Introducing AlbertForQuestionAnswering annotator in Spark NLP π.
AlbertForQuestionAnswering
can loadALBERT
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingAlbertForQuestionAnswering
for PyTorch orTFAlbertForQuestionAnswering
for TensorFlow models in HuggingFace π€ - NEW: Introducing BertForQuestionAnswering annotator in Spark NLP π.
BertForQuestionAnswering
can loadBERT
&ELECTRA
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingBertForQuestionAnswering
andElectraForQuestionAnswering
for PyTorch orTFBertForQuestionAnswering
andTFElectraForQuestionAnswering
for TensorFlow models in HuggingFace π€ - NEW: Introducing DeBertaForQuestionAnswering annotator in Spark NLP π.
DeBertaForQuestionAnswering
can loadDeBERTa
v2&v3 Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingDebertaV2ForQuestionAnswering
for PyTorch orTFDebertaV2ForQuestionAnswering
for TensorFlow models in HuggingFace π€ - NEW: Introducing DistilBertForQuestionAnswering annotator in Spark NLP π.
DistilBertForQuestionAnswering
can loadDistilBERT
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingDistilBertForQuestionAnswering
for PyTorch orTFDistilBertForQuestionAnswering
for TensorFlow models in HuggingFace π€ - NEW: Introducing LongformerForQuestionAnswering annotator in Spark NLP π.
LongformerForQuestionAnswering
can loadLongformer
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingLongformerForQuestionAnswering
for PyTorch orTFLongformerForQuestionAnswering
for TensorFlow models in HuggingFace π€ - NEW: Introducing RoBertaForQuestionAnswering annotator in Spark NLP π.
RoBertaForQuestionAnswering
can loadRoBERTa
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingRobertaForQuestionAnswering
for PyTorch orTFRobertaForQuestionAnswering
for TensorFlow models in HuggingFace π€ - NEW: Introducing XlmRoBertaForQuestionAnswering annotator in Spark NLP π.
XlmRoBertaForQuestionAnswering
can loadXLM-RoBERTa
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingXLMRobertaForQuestionAnswering
for PyTorch orTFXLMRobertaForQuestionAnswering
for TensorFlow models in HuggingFace π€ - NEW: Introducing MultiDocumentAssembler annotator where multiple inputs require to be converted to DOCUMENT such as in XXXForQuestionAnswering annotators
- NEW: Introducing SpanBertCorefModel annotator for Coreference Resolution on BERT and SpanBERT models based on BERT for Coreference Resolution: Baselines and Analysis paper. An implementation of a SpanBert-based coreference resolution model.
- NEW: Introducing
enableInMemoryStorage
parameter inWordEmbeddingsModel
annotator. By enabling this parameter the annotator will no longer require a distributed storage to unpack indices and will perform everything in-memory. - Official support for Apache Spark and PySpark 3.2.x on Scala 2.12. Spark NLP by default is shipped for Spark 3.2.x and supports Spark/PySpark 3.0.x and 3.1.x in additions
- Unifying all supported Apache Spark packages on Maven into
spark-nlp
for CPU,spark-nlp-gpu
for GPU, andspark-nlp-m1
for new Apple silicon M1 on macOS. The need for Apache Spark specific packages likespark-nlp-spark32
has been removed. - Adding a new param to
sparknlp.start()
function in Python and Scala for Apple silicon M1 on macOS (m1=True
) - Upgrade TensorFlow to 2.7.1 and start supporting Apple silicon M1
- Upgrade RocksDB with new enhancements and support for Apple silicon M1
- Upgrade SentencePiece tokenizer TF ops to 2.7.1
- Upgrade SentencePiece JNI to v0.1.96 and provide support for Apple silicon M1 on macOS support
- Upgrade to Scala 2.12.15
- Update Colab, Kaggle, and SageMaker scripts
- Refactor the entire Python module in Spark NLP to make the development and maintenance easier
- Refactor unit tests in Python and migrate to pytest
- Welcoming 6x new Databricks runtimes to our Spark NLP family:
- Databricks 10.4 LTS
- Databricks 10.4 LTS ML
- Databricks 10.4 LTS ML GPU
- Databricks 10.5
- Databricks 10.5 ML
- Databricks 10.5 ML GPU
- Welcoming a new EMR 6.x series to our Spark NLP family:
- EMR 6.6.0 (Apache Spark 3.2.0 / Hadoop 3.2.1)
- Migrate T5Transformer to TensorFlow v2 architecture by re-uploading all the existing models
- Support for 2 inputs in LightPipeline with MultiDocumentAssembler
- Add new default NerDL graph for xsmall DeBERTa embeddings model (384 dimensions)
- Adding annotateJava method to PretrainedPipeline class in Java to facilitate the use of LightPipelines
- Allow change of case sensitivity. Currently, the user cannot set the
setCaseSensitive
param. This allows users to change this value if the model was saved/uploaded with the wrong case sensitivity parameter. (BERT, ALBERT, DistilBERT, RoBERTa, DeBERTa, XLM-RoBERTa, and Longformer for XXXForSequenceClassitication and XXXForTokenClassification. - Keep accuracy in ClassifierDL and SentimentDL during the training between 0.0 and 1.0
- Preserve the original form of the token in BPE Tokenizer used in RoBERTa annotators (used in embeddings, sequence and token classification)
Performance Improvements (Benchmarks)
We have introduced two major performance improvements for GPU and CPU devices in Spark NLP 4.0.0 release.
The following benchmarks have been done by using a single Dell Server with the following specs:
- GPU: Tesla P100 PCIe 12GB - CUDA Version: 11.3 - Driver Version: 465.19.01
- CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz - 40 Cores
- Memory: 80G
GPU
We have improved our batch processing approach for transformer-based Word Embeddings to improve their performance on a GPU device. These optimizations result in performance improvements up to +700%. The detailed list of improved transformer models on GPU in comparison to Spark NLP 3.4.x:
Model on GPU | Spark NLP 3.4.3 vs. 4.0.0 |
---|---|
RoBERTa base | +560%(6.6x) |
RoBERTa Large | +332%(4.3x) |
Albert Base | +587%(6.9x... |
Spark NLP 3.4.4: New DeBERTa for Token Classification, new CamemBERT embeddings, speed improvements for Tokenizer and UniversalSentenceEncoder annotators, over 160 new state-of-the-art models, and other improvements!
Overview
We are very excited to release Spark NLP π 3.4.4! This release comes with a new DeBERTa for Token Classification annotator compatible with existing or fine-tuned models on HuggingFace π€, a new annotator for CamemBERT embeddings models, up to 18x times improvements of UniversalSentenceEncoder on GPU devices, up to 400% speed improvements in Tokenizer with a list of exceptions, new state-of-the-art NER, French embeddings, DistilBERT embeddings, and ALBERT embeddings models!
As always, we would like to thank our community for their feedback, questions, and feature requests.
New Features
- NEW: Introducing DeBertaForTokenClassification annotator in Spark NLP π.
DeBertaForTokenClassification
can load DeBERTa v2&v3 models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by usingDebertaV2ForTokenClassification
for PyTorch orTFDebertaV2ForTokenClassification
for TensorFlow models in HuggingFace #8082 - NEW: Introducing CamemBertEmbeddings annotator in Spark NLP π. #8237 CamemBERT is a state-of-the-art language model for French based on the RoBERTa architecture pretrained on the French subcorpus of the newly available multilingual corpus OSCAR. For further information or requests, please go to Camembert Website
- Add support for batching rows to improve UniversalSentenceEncoder on GPU devices. This new feature will increase GPU speed between 2x to 18x times depending on the distribution of sentences #8234
Bug Fixes & Enhancements
- Optimizing Tokenizer performance up to 400% when there is an exceptions list. We have improved the
exceptions list
to be scalable to a large number of exceptions without impacting the overall performance #7881 - Support latest PySpark releases in Colab, Kaggle, and SageMaker scripts #8028
- Fix bug that caused get input/output/LazyAnnotator to return None #8043
- Fix DeBertaForSequenceClassification in Python failing to load pretrained models #8060
- Fix missing Lemma and POS models from 3.4.3 release
Dependencies
- Removing outdated trove4j dependency in favour of native Java modules #8236
- Upgrade the base Apache Spark to
2.4.8
,3.0.3
, and3.2.1
- Upgrade type typesafe config to
1.4.2
- Upgrade sbt to
1.6.2
Models
Spark NLP 3.4.4 comes with over 160+ state-of-the-art multi-lingual pretrained models. Some of the featured models:
New DeBERTa Token Classification Models
New fine-tuned DeBERTa v3 models for token classifications over CoNLL03 and OntoNotes datasets that reach state-of-the-art metrics.
Model | Name | Lang | F1 Dev |
---|---|---|---|
DeBertaForTokenClassification | deberta_v3_large_token_classifier_conll03 | en |
0.97 |
DeBertaForTokenClassification | deberta_v3_base_token_classifier_conll03 | en |
0.96 |
DeBertaForTokenClassification | deberta_v3_small_token_classifier_conll03 | en |
0.95 |
DeBertaForTokenClassification | deberta_v3_xsmall_token_classifier_conll03 | en |
0.93 |
DeBertaForTokenClassification | deberta_v3_large_token_classifier_ontonotes | en |
0.89 |
DeBertaForTokenClassification | deberta_v3_base_token_classifier_ontonotes | en |
0.88 |
DeBertaForTokenClassification | deberta_v3_small_token_classifier_ontonotes | en |
0.87 |
DeBertaForTokenClassification | deberta_v3_xsmall_token_classifier_ontonotes | en |
0.86 |
New CamemBERT Models
Model | Name | Lang |
---|---|---|
CamemBertEmbeddings | camembert_large | fr |
CamemBertEmbeddings | camembert_base | fr |
CamemBertEmbeddings | camembert_base_ccnet_4gb | fr |
CamemBertEmbeddings | camembert_base_ccnet | fr |
CamemBertEmbeddings | camembert_base_oscar_4gb | fr |
CamemBertEmbeddings | camembert_base_wikipedia_4gb | fr |
New DistilBERT Embeddings Models
Model | Name | Lang |
---|---|---|
DistilBertEmbeddings | distilbert_embeddings_distilbert_base_fr_cased | fr |
DistilBertEmbeddings | distilbert_embeddings_marathi_distilbert | mr |
DistilBertEmbeddings | distilbert_embeddings_distilbert_base_indonesian | id |
DistilBertEmbeddings | distilbert_embeddings_javanese_distilbert_small | jv |
DistilBertEmbeddings | distilbert_embeddings_malaysian_distilbert_small | ms |
DistilBertEmbeddings | distilbert_embeddings_distilbert_base_ar_cased | ar |
New ALBERT Embeddings Models
Model | Name | Lang |
---|---|---|
AlbertEmbeddings | albert_embeddings_fralbert_base | fr |
AlbertEmbeddings | albert_embeddings_albert_base_arabic | ar |
AlbertEmbeddings | albert_embeddings_marathi_albert_v2 | mr |
AlbertEmbeddings | albert_embeddings_albert_fa_base_v2 | fa |
AlbertEmbeddings | albert_embeddings_albert_large_bahasa_cased | ms |
AlbertEmbeddings | albert_embeddings_marathi_albert | mr |
The complete list of all 5000+ models & pipelines in 200+ languages is available on Models Hub.
New Notebooks
Import CamemBERT models to Spark NLP π
Spark NLP | HuggingFace Notebooks | Colab |
---|---|---|
CamemBertEmbeddings | HuggingFace in Spark NLP - CamemBERT |
You can visit Import Transformers in Spark NLP for more info
Documentation
- TF Hub & HuggingFace to Spark NLP
- Models Hub with new models
- Spark NLP documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
- Spark NLP Workshop notebooks
- Spark NLP publications
- Spark NLP in Action
- Spark NLP training certification notebooks for Google Colab and Databricks
- Spark NLP Display for visualization of different types of annotations
- [Discussions](https://github.com/John...
John Snow Labs Spark-NLP 3.4.3: New DeBERTa for Sequence Classification, sigmoid activation for sequence classifiers, new features for SentenceDetectorDL, over 600 new multi-lingual models, and other improvements!
Overview
We are very excited to release Spark NLP π 3.4.3! This release comes with a new DeBERTa for Sequence Classification annotator compatible with existing or fine-tuned models on HuggingFace π€, a new sigmoid activation function in addition to softmax to support multi-label models in all ForSequenceClassification annotators, new features added to SentenceDetectorDL, new features added to CoNLLU and Lemmatizer, and more than 600 new multi-lingual models for DeBERTa, BERT, DistilBERT, fastText, Lemmatizer and Part of Speech, and other improvements!
As always, we would like to thank our community for their feedback, questions, and feature requests.
New Features
- NEW: Introducing DeBertaForSequenceClassification annotator in Spark NLP π.
DeBertaForSequenceClassification
can load DeBERTa v2&v3 models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by usingDebertaForSequenceClassification
for PyTorch orTFDebertaForSequenceClassification
for TensorFlow models in HuggingFace #7713 - New multi-label feature in all SequenceForClassification. The following annotators now have the option to switch to sigmoid activation function instead of softmax for the output layer: AlbertForSequenceClassification, BertForSequenceClassification, DeBertaForSequenceClassification, DistilBertForSequenceClassification, LongformerForSequenceClassification, RoBertaForSequenceClassification, XlmRoBertaForSequenceClassification, and XlnetForSequenceClassification #7479
- New minLength, maxLength, splitLength, customBounds, and useCustomBoundsOnly parameters in SentenceDetectorDL #7214
- New impossiblePenultimates in SentenceDetectorDLModel #7685
- New feature to set names for columns in CoNLLU class: textCol, documentCol, sentenceCol, formCol, uposCol, xposCol, and lemmaCol #7344
- New formCol and lemmaCol parameters in Lemmatizer annotator #7344
- Add new functionality to download and extract models from S3 via direct link #7682
Enhancements
- Fix and train new English spell checker models for Spark NLP 3.4.1 on Spark 3.x and 2.x
- Update SentenceDetector Python and Scala documentation
- Add a missing notebook to demonstrate training a WordSegmenterApproach annotator for word segmentation https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/chinese/word-segmentation/WordSegmenter_train_chinese_segmentation.ipynb
Models
New DeBERTa Classification Models
New fine-tuned DeBERTa v3 models for text classifications over IMDB reviews in English and Urdu, AG News categories in English, and Allocine French reviews.
Model | Name | Lang |
---|---|---|
DeBertaForSequenceClassification | mdeberta_v3_base_sequence_classifier_imdb | ur |
DeBertaForSequenceClassification | mdeberta_v3_base_sequence_classifier_allocine | fr |
DeBertaForSequenceClassification | deberta_v3_xsmall_sequence_classifier_imdb | en |
DeBertaForSequenceClassification | deberta_v3_small_sequence_classifier_imdb | en |
DeBertaForSequenceClassification | deberta_v3_base_sequence_classifier_imdb | en |
DeBertaForSequenceClassification | deberta_v3_large_sequence_classifier_imdb | en |
DeBertaForSequenceClassification | deberta_v3_xsmall_sequence_classifier_ag_news | en |
DeBertaForSequenceClassification | deberta_v3_small_sequence_classifier_ag_news | en |
New BERT Models
Spark NLP now has up to 250 state-of-the-art BERT models in 27 languages including Arabic, Bengali, Chinese, Dutch, English, Finnish, French, German, Greek, Hindi, Italian, Japanese, Javanese, Korean, Marathi, Panjabi, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Telugu, Turkish, Urdu, Vietnamese, and Multi-lingual.
Model | Name | Lang |
---|---|---|
BertEmbeddings | bert_embeddings_ARBERT | ar |
BertEmbeddings | bert_embeddings_German_MedBERT | de |
BertEmbeddings | bert_embeddings_bangla_bert_base | bn |
BertEmbeddings | bert_embeddings_bert_base_5lang_cased | zh |
BertEmbeddings | bert_embeddings_bert_base_5lang_cased | fr |
BertEmbeddings | bert_embeddings_bert_base_hi_cased | hi |
BertEmbeddings | bert_embeddings_bert_base_it_cased | it |
BertEmbeddings | bert_embeddings_bert_base | ko |
BertEmbeddings | bert_embeddings_bert_base_tr_cased | tr |
BertEmbeddings | bert_embeddings_bert_base_ur_cased | ur |
BertEmbeddings | bert_embeddings_bert_base_vi_cased | vi |
New fastText Models
Over 128 new Word2Vec models in 128 languages made by fastText word embeddings.
Model | Name | Lang |
---|---|---|
WordEmbeddingsModel | w2v_cc_300d | hi |
WordEmbeddingsModel | w2v_cc_300d | azb |
WordEmbeddingsModel | w2v_cc_300d | bo |
WordEmbeddingsModel | w2v_cc_300d | diq |
WordEmbeddingsModel | w2v_cc_300d | cy |
WordEmbeddingsModel | w2v_cc_300d | ckb |
WordEmbeddingsModel | w2v_cc_300d | el |
WordEmbeddingsModel | w2v_cc_300d | es |
New Lemmatizer and Part of Speech Models
234 new Lemmatizer and Part of Speech models in 62 languages based on the new Universal Dependency treebank 2.9 release.
Model | Name | Lang |
---|---|---|
LemmatizerModel | lemma_afribooms | af |
LemmatizerModel | lemma_alksnis | lt |
LemmatizerModel | lemma_alpino | nl |
LemmatizerModel | lemma_arcosg | gd |
LemmatizerModel | lemma_ancora | es |
LemmatizerModel | lemma_ancora | ca |
PerceptronModel | pos_mtg | te |
PerceptronModel | pos_ttb | ta |
PerceptronModel | pos_vtb | vi |
PerceptronModel | pos_cac | cs |
PerceptronModel | pos_btb | bg |
PerceptronModel | pos_afribooms | af |
The complete list of all 4800+ models & pipelines in 200+ languages is available on Models Hub.
Documentation
- [T...
John Snow Labs Spark-NLP 3.4.2: DeBERTa embeddings, new caching in Word2Vec and Doc2Vec, new state-of-the-art models, and bug fixes!
Overview
We are pleased to release Spark NLP π 3.4.2! This release comes with a new DeBERTa transformer for word embeddings, new caching to speed up training Word2Vec and Doc2Vec, new English and multi-lingual state-of-the-art models, and bug fixes!
As always, we would like to thank our community for their feedback, questions, and feature requests.
New Features
- Introducing DeBertaEmbeddings annotator. DeBERTa (Decoding-enhanced BERT with disentangled attention) improves the BERT and RoBERTa models using two novel techniques. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). This annotator is compatible with all the models trained/fine-tuned by using
DebertaV2Model
for PyTorch orTFDebertaV2Model
for TensorFlow models (DeBERTa-v2 & DeBERTa-v3) in HuggingFace - Introducing a new param
enableCaching
in Doc2VecApproach to speed up the training - Introducing a new param
enableCaching
in Word2VecApproach to speed up the training - Support Databricks runtime 10.3, 10.3 ML, and 10.3 ML & GPU
- Support EMR emr-5.34.0 and emr-6.5.0
Bug Fixes
- Fix bestModelMetric param when the set value was ignored #6978
New Notebooks
Import DeBERTa models to Spark NLP π
Spark NLP | HuggingFace Notebooks | Colab |
---|---|---|
DeBertaEmbeddings | HuggingFace in Spark NLP - DeBERTa |
You can visit Import Transformers in Spark NLP for more info
Models
New state-of-the-art DeBERTa models:
Model | Name | Lang |
---|---|---|
DeBertaEmbeddings | deberta_v3_xsmall | en |
DeBertaEmbeddings | deberta_v3_small | en |
DeBertaEmbeddings | deberta_v3_base | en |
DeBertaEmbeddings | deberta_v3_large | en |
DeBertaEmbeddings | mdeberta_v3_base | xx |
Documentation
- TF Hub & HuggingFace to Spark NLP
- Models Hub with new models
- Spark NLP documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
- Spark NLP Workshop notebooks
- Spark NLP publications
- Spark NLP in Action
- Spark NLP training certification notebooks for Google Colab and Databricks
- Spark NLP Display for visualization of different types of annotations
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Installation
Python
#PyPI
pip install spark-nlp==3.4.2
Spark Packages
spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.2
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.2
spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.2
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.2
spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.2
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.2
spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.2
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.2
Maven
spark-nlp on Apache Spark 3.0.x and 3.1.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>3.4.2</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>3.4.2</version>
</dependency>
spark-nlp on Apache Spark 3.2.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark32_2.12</artifactId>
<version>3.4.2</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark32_2.12</artifactId>
<version>3.4.2</version>
</dependency>
spark-nlp on Apache Spark 2.4.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark24_2.11</artifactId>
<version>3.4.2</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
<version>3.4.2</version>
</dependency>
spark-nlp on Apache Spark 2.3.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark23_2.11</artifactId>
<version>3.4.2</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
<version>3.4.2</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.4.2.jar
-
GPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.4.2.jar
-
CPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark32-assembly-3.4.2.jar
-
GPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark32-assembly-3.4.2.jar
-
CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.4.2.jar
-
GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.4.2.jar
-
CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.4.2.jar
-
GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.4.2.jar
What's Changed
Full Changelog: 3.4.1...3.4.2
New Contributors
- @mahmoodbayeshi made their first contribution in #6835
- @bunyamin-polat made their first contribution in #6969
@agsfer @KshitizGIT @gadde5300 @kolia1985 @jsl-models @rpranab @josejuanmartinez @bunyamin-polat @maziyarpanahi @jsl-builder @Damla-Gurbaz @xusliebana @mahmoodbayeshi @luca-martial @dependabot @muhammetsnts @albertoandreottiATgmai
John Snow Labs Spark-NLP 3.4.1: TF session warmup, a new F1 metric to track to save the best model in NerDL, new T5 models like WikiSQL or grammar corrector, other new multi-lingual state-of-the-art models, and bug fixes!
Overview
We are pleased to release Spark NLP π 3.4.1! This release comes with a TF session warmup in 3 annotators where the first inference was slower than the rest, adding a new param to choose which F1 to track to save the best model when training a NerDL model, new T5 models such as text to SQL or grammar correction, new multi-lingual state-of-the-art models, and other bug fixes!
As always, we would like to thank our community for their feedback, questions, and feature requests.
New Features & Enhancements
- Implement TF Session warmup for MarianTransformer, T5Transformer, and GPT2Transformer annotators. The first inference for these annotators used to take between 15-20 seconds, now with the warmup session all the inferences including the first time will be the same #6773
- Add bestModelMetric param to choose between Micro-average or Macro-average for best model #6749
- Add trimWhitespace and preservePosition params to RegexTokenizer #6806
- Add a new
setSentenceMatch
param to EntityRuler to match entities across documents/sentences and not just tokens #6841 - Add support spark32 and real_time_output flags in sparknlp.start() function at the same time #6822
- Allow users to set tasks in the T5Transformer annotator
Bug Fixes
- Fix random NullPointerException when using TensorFlow models without Kyro serialization #6741
- Fix RecursiveTokenizerModel not being readable in a saved Pipeline #6748
- Fix ContextSpellCheckerApproach not being trained on Databricks #6750
- Fix ContextSpellCheckerModel wrong order of tokens it's used with Sentence Detectors #6799
- Fix GraphExtraction when fullAnnotate and document are used at the same time #6845
- Fix Word2VecModel being cast to Doc2VecModel by mistake #6849
- Fix broken sentence indexing in BertEmbeddings that impacted SentenceEmbeddings for text classification #6867
- Fix missing setExceotionsPath param in Tokenizer when it's used in Python #6868
- Fix the wrong metrics being mentioned when useBestModel was enabled. The documentation said Micro-averaged F1 but in fact, it was Macro-average F1 (the option to choose which metric to be tracked is now available as well)
- Update broken slow unit tests #6767
Models
New state-of-the-art models in English, French, Vietnamese, Dutch, and Indian (Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu)
Featured Pretrained Models
Model | Name | Lang |
---|---|---|
T5Transformer | t5_informal_to_formal_styletransfer | en |
T5Transformer | t5_formal_to_informal_styletransfer | en |
T5Transformer | t5_passive_to_active_styletransfer | en |
T5Transformer | t5_active_to_passive_styletransfer | en |
T5Transformer | t5_grammar_error_corrector | en |
T5Transformer | t5_small_wikiSQL | en |
LongformerEmbeddings | clinical_longformer | en |
AlbertEmbeddings | albert_indic | xx |
DistilBertEmbeddings | distilbert_base_cased | vi |
BertForSequenceClassification | bert_sequence_classifier_news_sentiment | de |
BertForSequenceClassification | bert_sequence_classifier_emotion | en |
DistilBertForTokenClassification | distilbert_token_classifier_typo_detector | en |
DistilBertForTokenClassification | distilbert_base_token_classifier_masakhaner | xx |
WordEmbeddingsModel | word2vec_wiki_1000 | fr |
WordEmbeddingsModel | word2vec_wac_200 | fr |
WordEmbeddingsModel | w2v_cc_300d | fr |
Documentation
- TF Hub & HuggingFace to Spark NLP
- Models Hub with new models
- Spark NLP documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
- Spark NLP Workshop notebooks
- Spark NLP publications
- Spark NLP in Action
- Spark NLP training certification notebooks for Google Colab and Databricks
- Spark NLP Display for visualization of different types of annotations
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Installation
Python
#PyPI
pip install spark-nlp==3.4.1
Spark Packages
spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.1
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.1
spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.1
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.1
spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.1
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.1
spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.1
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.1
Maven
spark-nlp on Apache Spark 3.0.x and 3.1.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>3.4.1</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>3.4.1</version>
</dependency>
spark-nlp on Apache Spark 3.2.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark32_2.12</artifactId>
<version>3.4.1</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark32_2.12</artifactId>
<version>3.4.1</version>
</dependency>
spark-nlp on Apache Spark 2.4.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark24_2.11</artifactId>
<version>3.4.1</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
<version>3.4.1</version>
</dependency>
spark-nlp on Apache Spark 2.3.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark23_2.11</artifactId>
<version>3.4.1</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark23_2.11...
John Snow Labs Spark-NLP 3.4.0: New OpenAI GPT-2, new ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer for Sequence Classification, support for Spark 3.2, new distributed Word2Vec, extend support to more Databricks & EMR runtimes, new state-of-the-art transformer models, bug fixes, and lots more!
Overview
We are very excited to release Spark NLP 3.4.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community at the dawn of 2022! π
Spark NLP 3.4.0 extends the support for Apache Spark 3.2.x major releases on Scala 2.12. We now support all 5 major Apache Spark and PySpark releases of 2.3.x, 2.4.x, 3.0.x, 3.1.x, and 3.2.x at once helping our community to migrate from earlier Apache Spark versions to newer releases without being worried about Spark NLP end of life support. We also extend support for new Databricks and EMR instances on Spark 3.2.x clusters.
This release also comes with a brand new GPT2Transformer using OpenAI GPT-2 models for prediction at scale, new ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer annotators to use existing or fine-tuned models for Sequence Classification, new distributed and trainable Word2Vec annotators, new state-of-the-art transformer models in many languages, a new param to useBestModel in NerDL during training, bug fixes, and lots more!
As always, we would like to thank our community for their feedback, questions, and feature requests.
Major features and improvements
- NEW: Introducing GPT2Transformer annotator in Spark NLP π for Text Generation purposes.
GPT2Transformer
uses OpenAI GPT-2 models from HuggingFace π€ for prediction at scale in Spark NLP π .GPT-2
is a transformer model trained on a very large corpus of English data in a self-supervised fashion. This means it was trained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences - NEW: Introducing RoBertaForSequenceClassification annotator in Spark NLP π.
RoBertaForSequenceClassification
can load RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by usingRobertaForSequenceClassification
for PyTorch orTFRobertaForSequenceClassification
for TensorFlow models in HuggingFace π€ - NEW: Introducing XlmRoBertaForSequenceClassification annotator in Spark NLP π.
XlmRoBertaForSequenceClassification
can load XLM-RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by usingXLMRobertaForSequenceClassification
for PyTorch orTFXLMRobertaForSequenceClassification
for TensorFlow models in HuggingFace π€ - NEW: Introducing LongformerForSequenceClassification annotator in Spark NLP π.
LongformerForSequenceClassification
can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by usingLongformerForSequenceClassification
for PyTorch orTFLongformerForSequenceClassification
for TensorFlow models in HuggingFace π€ - NEW: Introducing AlbertForSequenceClassification annotator in Spark NLP π.
AlbertForSequenceClassification
can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by usingAlbertForSequenceClassification
for PyTorch orTFAlbertForSequenceClassification
for TensorFlow models in HuggingFace π€ - NEW: Introducing XlnetForSequenceClassification annotator in Spark NLP π.
XlnetForSequenceClassification
can load XLNet Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by usingXLNetForSequenceClassification
for PyTorch orTFXLNetForSequenceClassification
for TensorFlow models in HuggingFace π€ - NEW: Introducing trainable and distributed Word2Vec annotators based on Word2Vec in Spark ML. You can train Word2Vec in a cluster on multiple machines to handle large-scale datasets and use the trained model for token-level classifications such as NerDL
- Introducing
useBestModel
param in NerDLApproach annotator. This param in the NerDLApproach preserves and restores the model that has achieved the best performance at the end of the training. The priority is metrics from testDataset (micro F1), metrics from validationSplit (micro F1), and if none is set it will keep track of loss during the training - Support Apache Spark and PySpark 3.2.x on Scala 2.12. Spark NLP by default is shipped for Spark 3.0.x/3.1.x, but now you have
spark-nlp-spark32
andspark-nlp-gpu-spark32
packages - Adding a new param to sparknlp.start() function in Python for Apache Spark 3.2.x (
spark32=True
) - Update Colab and Kaggle scripts for faster setup. We no longer need to remove Java 11 in order to install Java 8 since Spark NLP works on Java 11. This makes the installation of Spark NLP on Colab and Kaggle as fast as
pip install spark-nlp pyspark==3.1.2
- Add new scripts/notebook to generate custom TensroFlow graphs for
ContextSpellCheckerApproach
annotator - Add a new
graphFolder
param toContextSpellCheckerApproach
annotator. This param allows to train ContextSpellChecker from a custom made TensorFlow graph - Support DBFS file system in
graphFolder
param. Starting Spark NLP 3.4.0 you can point NerDLApproach or ContextSpellCheckerApproach to a TF graph hosted on Databricks - Add a new feature to all classifiers (
ForTokenClassification
andForSequenceClassification
) to retrieve classes from the pretrained models
sequenceClassifier = XlmRoBertaForSequenceClassification \
.pretrained('xlm_roberta_base_sequence_classifier_ag_news', 'en') \
.setInputCols(['token', 'document']) \
.setOutputCol('class')
print(sequenceClassifier.getClasses())
#Sports, Business, World, Sci/Tech
- Add
inputFormats
param to DateMatcher and MultiDateMatcher annotators. DateMatcher and MultiDateMatcher can now define a list of acceptable input formats via date patterns to search in the text. Consequently, the output format will be defining the output pattern for the unique output format.
date_matcher = DateMatcher() \
.setInputCols(['document']) \
.setOutputCol("date") \
.setInputFormats(["yyyy", "yyyy/dd/MM", "MM/yyyy"]) \
.setOutputFormat("yyyyMM") \ #previously called `.setDateFormat`
.setSourceLanguage("en")
- Enable batch processing in T5Transformer and MarianTransformer annotators
- Add Schema to
readDataset
in CoNLL() class - Welcoming 6x new Databricks runtimes to our Spark NLP family:
- Databricks 10.0
- Databricks 10.0 ML GPU
- Databricks 10.1
- Databricks 10.1 ML GPU
- Databricks 10.2
- Databricks 10.2 ML GPU
- Welcoming 3x new EMR 6.x series to our Spark NLP family:
- EMR 5.33.1 (Apache Spark 2.4.7 / Hadoop 2.10.1)
- EMR 6.3.1 (Apache Spark 3.1.1 / Hadoop 3.2.1)
- EMR 6.4.0 (Apache Spark 3.1.2 / Hadoop 3.2.1)
Bug Fixes
- Fix a race condition in a cluster mode when the accessing TF session is called as many times as the number of available cores on the Driver machine for the very first time. Loading a model multiple times at once results in higher disk usage and IO may become a bottleneck for larger models especially on a machine with slower disks. Thanks to @jerrychenhf for finding this issue and offering a solution #6575
- Fix a performance issue introduced in the 3.3.3 release for T5Transformer and MarianTransformer annotators. While we added support for ignored tokens, accidentally we introduced a bug that degraded the performance for these two annotators (sometimes up to 2x slower). Please update to 3.4.0 if you are using any of these two annotators #6605
- Fix a bug in model resolution by not filtering based on the timestamp
- Fix configProtoBytes param type in Python #6549
- Fix missing DefaultParamsReadable in RegexTokenizer annotator #6653
- Fix missing models
lemma_antbnc
,sentiment_vivekn
, andspellcheck_norvig
for Spark 3.x - Fix missing pipelines
clean_slang
,check_spelling
,match_chunks
, andmatch_datetime
for Spark 3.x - Fix
saveModel
in TrainingHelper - Fix Keyword/Yake module naming in Scala #6562
Models Hub
Models Hub now comes with new features to easily filter and find your desired models & pipelines by:
- NLP Task
- Natural Language
- Spark NLP version
In addition, you can also filter models & pipelines by:
- Models or Pipelines (finally! π )
- Tags used inside Model's card
- Or even by predicted entities (which labels/classes a model can predict)
As always, you can host your own pre-trained models & pipelines easily accessible to you for free & forever! π
Models and Pipelines
--------------...
John Snow Labs Spark-NLP 3.3.4: Patch release
Patch release
- Fix
ClassCastException
error in pretrained function for DistilBertForSequenceClassification in Python #6513
Documentation
- TF Hub & HuggingFace to Spark NLP
- Models Hub with new models
- Spark NLP publications
- Spark NLP in Action
- Spark NLP documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
- Spark NLP Workshop notebooks
- Spark NLP training certification notebooks for Google Colab and Databricks
- Spark NLP Display for visualization of different types of annotations
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Installation
Python
#PyPI
pip install spark-nlp==3.3.4
Spark Packages
spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.4
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.4
spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.4
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.4
spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.4
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.3.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.3.4
Maven
spark-nlp on Apache Spark 3.0.x and 3.1.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>3.3.4</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>3.3.4</version>
</dependency>
spark-nlp on Apache Spark 2.4.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark24_2.11</artifactId>
<version>3.3.4</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
<version>3.3.4</version>
</dependency>
spark-nlp on Apache Spark 2.3.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark23_2.11</artifactId>
<version>3.3.4</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
<version>3.3.4</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.3.4.jar
-
GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.3.4.jar
-
CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.3.4.jar
-
GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.3.4.jar
-
CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.3.4.jar
-
GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.3.4.jar
What's Changed
- Update documentation of ChunkKeyPhraseExtraction by @vankov in #6508
- Fixes
new
instantiation in scala section by @josejuanmartinez in #6469 - Fix the wrong name for DistilBertForSequenceClassification in Python by @maziyarpanahi in #6513
- Release/334 release candidate by @maziyarpanahi in #6514
Full Changelog: 3.3.3...3.3.4