Releases · JohnSnowLabs/spark-nlp

24 Aug 15:35

4.1.0

956e2cd

Spark NLP 4.1.0: Vision Transformer (ViT) is here! The very first Computer Vision pipeline for the state-of-the-art Image Classification task, AWS Graviton/ARM64 support, new EMR & Databricks support, 1000+ state-of-the-art models, and more!

Overview

An Image is Worth 16x16 Words!

For the first time ever we are delighted to announce support for Image Classification in Spark NLP by using state-of-the-art Vision Transformer (ViT) models at scale. This release comes with official support for AWS Graviton and ARM64 processors, new Databricks and EMR support, and 1000+ state-of-the-art models.

Spark NLP 4.1 also celebrates crossing 8000+ free and open-source models & pipelines available on Models Hub. 🎉 As always, we would like to thank our community for their feedback, questions, and feature requests.

⭐ New Features & improvements

NEW: Introducing ViTForImageClassification annotator in Spark NLP 🚀. ViTForImageClassification can load Vision Transformer ViT Models with an image classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for ImageNet. This annotator is compatible with all the models trained/fine-tuned by using ViTForImageClassification for PyTorch or TFViTForImageClassification for TensorFlow models in HuggingFace 🤗 (#11536)

An overview of the ViT model structure as introduced in Google Research’s original 2021 paper

data_df = spark.read.format("image") \
            .load(path="images/")

image_assembler = ImageAssembler() \
            .setInputCol("image") \
            .setOutputCol("image_assembler")

image_classifier = ViTForImageClassification \
    .pretrained() \
    .setInputCols("image_assembler") \
    .setOutputCol("class")

pipeline = Pipeline(stages=[
    image_assembler,
    image_classifier,
])

model = pipeline.fit(data_df)

NEW: Support for AWS Graviton/Graviton2 With up to 3x Better Price-Performance. For the first time, Spark NLP supports Graviton and ARM64 (ARMv8 above) processors. (#10939)
NEW: Introducing TFNerDLGraphBuilder annotator. TFNerDLGraphBuilder can be used to automatically detect the parameters of a needed NerDL graph and generate the graph within a pipeline when the default NER graphs are not suitable for your training datasets. TFNerDLGraphBuilder supports local, DBFS, and S3 file systems. (#10564)
Allow passing confidence scores from all XXXForTokenClassification annotators to NerConverter. It is now possible to access the confidence scores coming from the following annotators in NerConverter metadata (similar to NerDLModel): AlbertForTokenClassification, BertForTokenClassification, DeBertaForTokenClassification, DistilBertForTokenClassification, LongformerForTokenClassification, RoBertaForTokenClassification, XlmRoBertaForTokenClassification, XlnetForTokenClassification, and DeBertaForTokenClassification
Introducing PushToHub Python class to easily push public models & pipelines to Models Hub
Introducing fullAnnotateImage to existing LightPipeline to support ImageAssembler and ViTForImageClassification annotators in a Spark NLP pipeline. The fullAnnotateImage supports a path to images hosted locally, on DBFS, and S3.

light_pipeline = LightPipeline(model)
annotations_result = light_pipeline.fullAnnotateImage("images/hippopotamus.JPEG")

Welcoming a new EMR 6.x series to our Spark NLP family:
- EMR 6.7.0 (now supports Apache Spark 3.2.1, Apache Hive 3.1.3, HUDI 0.11, PrestoDB 0.272, and Trino 0.378.)
Welcoming 3 new Databricks runtimes to our Spark NLP family:
- Databricks 11.2 LTS
- Databricks 11.2 LTS ML
- Databricks 11.2 LTS ML GPU
Welcoming new AWS Graviton-enabled for Databricks runtime:
- General Purpose: m6g, m6gd
- Compute Optimized: c6g, c6gd
- Memory Optimized: r6g, r6gd

Models

Spark NLP 4.1.0 comes with 1000+ state-of-the-art pre-trained transformer models for Image Classifications, Token Classification, and Sequence Classification in many languages.

Featured Models

Model	Name	Lang
ViTForImageClassification	image_classifier_vit_base_patch16_224	`en`
ViTForImageClassification	image_classifier_vit_base_patch16_384	`en`
ViTForImageClassification	image_classifier_vit_base_patch32_384	`en`
ViTForImageClassification	image_classifier_vit_base_xray_pneumonia	`en`
ViTForImageClassification	image_classifier_vit_finetuned_chest_xray_pneumonia	`en`
ViTForImageClassification	image_classifier_vit_food	`en`
ViTForImageClassification	image_classifier_vit_base_food101	`en`
ViTForImageClassification	image_classifier_vit_autotrain_dog_vs_food	`en`
ViTForImageClassification	image_classifier_vit_baseball_stadium_foods	`en`
ViTForImageClassification	image_classifier_vit_south_indian_foods	`en`
ViTForImageClassification	image_classifier_vit_denver_nyc_paris	`en`
ViTForImageClassification	image_classifier_vit_CarViT	`en`

Check out 240 (ViT) models on Models Hub - Image Classification

Spark NLP covers the following languages:

English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,`...

Contributors

paulk-asert, pabla, and 16 other contributors

Assets 2

0 Join discussion

19 Jul 17:19

maziyarpanahi

4.0.2

69f0b35

Spark NLP 4.0.2: Over 620 new state-of-the-art models in 21 languages, full support for Apache Spark 3.3.0, new Databricks runtime 11.1, and bug fixes

Overview

We are pleased to release Spark NLP 🚀 4.0.2! This release comes with full compatibility with the newly-released Apache Spark 3.3.0 and official support for Databrick's new runtimes 11.1 Beta (includes Apache Spark 3.3.0, Scala 2.12).

As always, we would like to thank our community for their feedback, questions, and feature requests.

New Features

Welcoming new Databricks runtimes based on Spark/PySpark 3.3.0 to our Spark NLP family:
- Databricks 11.1 Beta
- Databricks 11.1 ML Berta
- Databricks 11.1 ML Berta GPU
SentenceDetector now comes with a new parameter customBoundsStrategy for returning custom bounds #10567

Example

with setCustomBounds([r"\.", ";"])

This is a sentence. This one uses custom bounds; As is this one;

Without the flags will result in

["This is a sentence", "This one uses custom bounds", "As is this one"]

With the new flag:

.setCustomBounds([r"\.", ";"])
.setCustomBoundsStrategy("append")

the result will be

["This is a sentence.", "This one uses custom bounds;", "As is this one;"]

Similarly with prepend:

1. This is a list
1.1 This is a subpoint
2. Second thing
2.2 Second subthing

.setCustomBounds([r"\n[\d\.]+"])
.setCustomBoundsStrategy("prepend")

the result will be

[
    "1. This is a list",
    "1.1 This is a subpoint",
    "2. Second thing",
    "2.2 Second subthing"
]

Bug Fixes

Fix bug that attempts to create spark session on executors when using GraphExtraction in Spark/PySpark 3.3 #9905

Models and Pipelines

Spark NLP 4.0.2 comes with 620+ state-of-the-art pre-trained transformer models in 21 languages including multi-lingual models.

Featured Models

Model	Name	Lang
BertForQuestionAnswering	electra_qa_BioM_Base_SQuAD2_BioASQ8B	`en`
BertForQuestionAnswering	bert_qa_multilingual_base_cased_chines	`zh`
BertForQuestionAnswering	bert_qa_deep_pavlov_full	`ru`
BertForQuestionAnswering	bert_qa_firmanindolanguagemodel	`id`
BertForQuestionAnswering	bert_qa_kcbert_base_finetuned_squad	`ko`
BertForQuestionAnswering	bert_qa_mbert_finetuned_mlqa_de_hi_dev	`xx`
BertForQuestionAnswering	bert_qa_modelontquad	`tr`
BertForQuestionAnswering	bert_qa_newsqa_el_4	`el`
BertForQuestionAnswering	bert_qa_testpersianqa	`fa`
BertForQuestionAnswering	bert_qa_arabert_finetuned_arcd	`ar`
BertForTokenClassification	bert_ner_NER_legal_de_Sahajtomar	`de`
BertForTokenClassification	bert_ner_NER_en_vi_it_es_tinparadox	`xx`
BertForTokenClassification	bert_ner_NER_CAMELBERT	`ar`
BertForTokenClassification	bert_ner_Swedish_NER	`sv`
BertForTokenClassification	bert_ner_bert_base_chinese_ner	`zh`
BertForTokenClassification	bert_ner_bert_base_hu_cased_ner	`hu`
BertForTokenClassification	bert_ner_bert_base_indonesian_NER	`id`
BertForTokenClassification	bert_ner_bert_base_irish_cased_v1_finetuned_ner	`ga`
BertForTokenClassification	bert_ner_bert_base_pt_archive	`pt`
BertForTokenClassification	bert_ner_bert_base_spanish_wwm_uncased_finetuned_NER_medical	`es`

The complete list of all 6900+ models & pipelines in 230+ languages is available on Models Hub

📖 Documentation & Articles

Spark NLP: Hardware Acceleration
Serving Spark NLP via API in Java
TF Hub & HuggingFace to Spark NLP
Models Hub with new models
Spark NLP documentation
Spark NLP Scala APIs
Spark NLP Python APIs
Spark NLP Workshop notebooks
Spark NLP publications
Spark NLP in Action
Spark NLP training certification notebooks for Google Colab and Databricks
Spark NLP Display for visualization of different types of annotations
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI

pip install spark-nlp==4.0.2

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.2

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.2

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>4.0.2</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>4.0.2</version>
</dependency>

spark-nlp-m1:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-m1_2.12</artifactId>
    <version>4.0.2</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.2.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.0.2.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.0.2.jar

What's Changed

Contributors

@gadde5300 @danilojsl @hsaglamlar @Cabir40 @ahmedlone127 @muhammetsnts @KshitizGIT @maziyarpanahi @albertoandreottiATgmail @DevinTDHa @luca-martial @Damla-Gurbaz @jsl-models @Meryem1425

New Contributors

@hsaglamlar made their first contribution in #10544

Full Changelog: 4.0.1...4.0.2

Contributors

albertoandreottiATgmail, maziyarpanahi, and 12 other contributors

Assets 2

0 Join discussion

01 Jul 15:19

maziyarpanahi

4.0.1

67058ea

Spark NLP 4.0.1: Full support for Apache Spark 3.3.0, new Databricks runtime 11, enhancements, and other bug fixes!

Overview

We are pleased to release Spark NLP 🚀 4.0.1! This release comes with supporting the newly-released Apache Spark 3.3.0 with improved join query performance via Bloom filters, increases the Pandas API coverage, and many other improvements. In addition, Spark NLP comes with official support for Databricks runtimes 11, other enhancements, and bug fixes.

As always, we would like to thank our community for their feedback, questions, and feature requests.

Features & Enhancements

Full support for Apache Spark & PySpark 3.3.0
Add Apache Spark 3.3.0 to Google Colab and Kaggle setup scripts
New -g option for Google Colab and Kaggle setup on GPU device to upgrade libcudnn8 to 8.1.0 to solve the issue on GPU
Welcoming new Databricks runtimes based on Spark/PySpark 3.3.0 to our Spark NLP family:
- Databricks 11.0 LTS
- Databricks 11.0 LTS ML
- Databricks 11.0 LTS ML GPU

Bug Fixes

Fix the error caused by PySpark 3.3.0 in CoNLL, CoNLLU, POS, and PubTator annotators as training helpers
Fix and re-upload Dependency and Type Dependency parser pre-trained models
Update pre-trained pipelines with issues on PySpark 3.2 and 3.3

Documentation

Serving Spark NLP via API in Java
TF Hub & HuggingFace to Spark NLP
Models Hub with new models
Spark NLP documentation
Spark NLP Scala APIs
Spark NLP Python APIs
Spark NLP Workshop notebooks
Spark NLP publications
Spark NLP in Action
Spark NLP training certification notebooks for Google Colab and Databricks
Spark NLP Display for visualization of different types of annotations
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI

pip install spark-nlp==4.0.1

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.1

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.1

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>4.0.1</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>4.0.1</version>
</dependency>

spark-nlp-m1:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-m1_2.12</artifactId>
    <version>4.0.1</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.1.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.0.1.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.0.1.jar

What's Changed

Contributors

@muhammetsnts @jsl-models @Meryem1425 @Damla-Gurbaz @jsl-builder @rpranab @danilojsl @josejuanmartinez @Cabir40 @DevinTDHa @agsfer @suvrat-joshi @ahmedlone127 @albertoandreottiATgmail @KshitizGIT @mahmoodbayeshi @maziyarpanahi

New Contributors

@ahmedlone127 made their first contribution in #9887

Full Changelog: 4.0.0...4.0.1

Contributors

albertoandreottiATgmail, agsfer, and 15 other contributors

Assets 2

0 Join discussion

15 Jun 17:38

maziyarpanahi

4.0.0

184917e

Spark NLP 4.0.0: New modern extractive Question answering (QA) annotators for ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa, official support for Apple silicon M1, support oneDNN to improve CPU up to 97%, improved transformers on GPU up to +700%, 1000+ state-of-the-art models, and lots more!

Overview

We are very excited to release Spark NLP 4.0.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community! 🎉

This release comes with official support for Apple silicon M1 chip (for the first time), official support for Spark/PySpark 3.2, support oneAPI Deep Neural Network Library (oneDNN) to improve TensorFlow on CPU up to 97%, optimized transformer-based embeddings on GPU to increase the performance up to +700%, brand new modern extractive transformer-based Question answering (QA) annotators for tasks like SQuAD based on ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa architectures, 1000+ state-of-the-art models, WordEmbeddingsModel now works in clusters without HDFS/DBFS/S3 such as Kubernetes, new Databricks and EMR support, new NER models achieving highest F1 score in Spark NLP, and many more enhancements and bug fixes!

We would like to mention that Spark NLP 4.0.0 drops the support for Spark 2.3 and 2.4 (Scala 2.11). Starting 4.0.0 we only support Spark/PySpark 3.x on Scala 2.12.

As always, we would like to thank our community for their feedback, questions, and feature requests.

Major features and improvements

NEW: Support for The oneAPI Deep Neural Network Library (oneDNN) optimizations to improve TensorFlow on CPU. Enabling onDNN can improve some transformer-based models up to 97%. By default, the oneDNN optimizations will be turned off. To enable them, you can set the environment variable TF_ENABLE_ONEDNN_OPTS. On Linux systems, for instance: export TF_ENABLE_ONEDNN_OPTS=1
NEW: Optimizing batch processing for transformer-based Word Embeddings on a GPU device. These optimizations can result in performance improvements up to +700% (more details in the Benchmarks section)
NEW: Official support for Apple silicon M1 on macOS devices. You can use the spark-nlp-m1 package that supports Apple silicon M1 on your macOS machine in Spark NLP 4.0.0
NEW: Introducing AlbertForQuestionAnswering annotator in Spark NLP 🚀. AlbertForQuestionAnswering can load ALBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using AlbertForQuestionAnswering for PyTorch or TFAlbertForQuestionAnswering for TensorFlow models in HuggingFace 🤗
NEW: Introducing BertForQuestionAnswering annotator in Spark NLP 🚀. BertForQuestionAnswering can load BERT & ELECTRA Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using BertForQuestionAnswering and ElectraForQuestionAnswering for PyTorch or TFBertForQuestionAnswering and TFElectraForQuestionAnswering for TensorFlow models in HuggingFace 🤗
NEW: Introducing DeBertaForQuestionAnswering annotator in Spark NLP 🚀. DeBertaForQuestionAnswering can load DeBERTa v2&v3 Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using DebertaV2ForQuestionAnswering for PyTorch or TFDebertaV2ForQuestionAnswering for TensorFlow models in HuggingFace 🤗
NEW: Introducing DistilBertForQuestionAnswering annotator in Spark NLP 🚀. DistilBertForQuestionAnswering can load DistilBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using DistilBertForQuestionAnswering for PyTorch or TFDistilBertForQuestionAnswering for TensorFlow models in HuggingFace 🤗
NEW: Introducing LongformerForQuestionAnswering annotator in Spark NLP 🚀. LongformerForQuestionAnswering can load Longformer Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using LongformerForQuestionAnswering for PyTorch or TFLongformerForQuestionAnswering for TensorFlow models in HuggingFace 🤗
NEW: Introducing RoBertaForQuestionAnswering annotator in Spark NLP 🚀. RoBertaForQuestionAnswering can load RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using RobertaForQuestionAnswering for PyTorch or TFRobertaForQuestionAnswering for TensorFlow models in HuggingFace 🤗
NEW: Introducing XlmRoBertaForQuestionAnswering annotator in Spark NLP 🚀. XlmRoBertaForQuestionAnswering can load XLM-RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using XLMRobertaForQuestionAnswering for PyTorch or TFXLMRobertaForQuestionAnswering for TensorFlow models in HuggingFace 🤗
NEW: Introducing MultiDocumentAssembler annotator where multiple inputs require to be converted to DOCUMENT such as in XXXForQuestionAnswering annotators
NEW: Introducing SpanBertCorefModel annotator for Coreference Resolution on BERT and SpanBERT models based on BERT for Coreference Resolution: Baselines and Analysis paper. An implementation of a SpanBert-based coreference resolution model.
NEW: Introducing enableInMemoryStorage parameter in WordEmbeddingsModel annotator. By enabling this parameter the annotator will no longer require a distributed storage to unpack indices and will perform everything in-memory.
Official support for Apache Spark and PySpark 3.2.x on Scala 2.12. Spark NLP by default is shipped for Spark 3.2.x and supports Spark/PySpark 3.0.x and 3.1.x in additions
Unifying all supported Apache Spark packages on Maven into spark-nlp for CPU, spark-nlp-gpu for GPU, and spark-nlp-m1 for new Apple silicon M1 on macOS. The need for Apache Spark specific packages like spark-nlp-spark32 has been removed.
Adding a new param to sparknlp.start() function in Python and Scala for Apple silicon M1 on macOS (m1=True)
Upgrade TensorFlow to 2.7.1 and start supporting Apple silicon M1
Upgrade RocksDB with new enhancements and support for Apple silicon M1
Upgrade SentencePiece tokenizer TF ops to 2.7.1
Upgrade SentencePiece JNI to v0.1.96 and provide support for Apple silicon M1 on macOS support
Upgrade to Scala 2.12.15
Update Colab, Kaggle, and SageMaker scripts
Refactor the entire Python module in Spark NLP to make the development and maintenance easier
Refactor unit tests in Python and migrate to pytest
Welcoming 6x new Databricks runtimes to our Spark NLP family:
- Databricks 10.4 LTS
- Databricks 10.4 LTS ML
- Databricks 10.4 LTS ML GPU
- Databricks 10.5
- Databricks 10.5 ML
- Databricks 10.5 ML GPU
Welcoming a new EMR 6.x series to our Spark NLP family:
- EMR 6.6.0 (Apache Spark 3.2.0 / Hadoop 3.2.1)
Migrate T5Transformer to TensorFlow v2 architecture by re-uploading all the existing models
Support for 2 inputs in LightPipeline with MultiDocumentAssembler
Add new default NerDL graph for xsmall DeBERTa embeddings model (384 dimensions)
Adding annotateJava method to PretrainedPipeline class in Java to facilitate the use of LightPipelines
Allow change of case sensitivity. Currently, the user cannot set the setCaseSensitive param. This allows users to change this value if the model was saved/uploaded with the wrong case sensitivity parameter. (BERT, ALBERT, DistilBERT, RoBERTa, DeBERTa, XLM-RoBERTa, and Longformer for XXXForSequenceClassitication and XXXForTokenClassification.
Keep accuracy in ClassifierDL and SentimentDL during the training between 0.0 and 1.0
Preserve the original form of the token in BPE Tokenizer used in RoBERTa annotators (used in embeddings, sequence and token classification)

Performance Improvements (Benchmarks)

We have introduced two major performance improvements for GPU and CPU devices in Spark NLP 4.0.0 release.

The following benchmarks have been done by using a single Dell Server with the following specs:

GPU: Tesla P100 PCIe 12GB - CUDA Version: 11.3 - Driver Version: 465.19.01
CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz - 40 Cores
Memory: 80G

GPU

We have improved our batch processing approach for transformer-based Word Embeddings to improve their performance on a GPU device. These optimizations result in performance improvements up to +700%. The detailed list of improved transformer models on GPU in comparison to Spark NLP 3.4.x:

Model on GPU	Spark NLP 3.4.3 vs. 4.0.0
RoBERTa base	+560%(6.6x)
RoBERTa Large	+332%(4.3x)
Albert Base	+587%(6.9x...

Contributors

pabla, albertoandreottiATgmail, and 22 other contributors

Assets 2

1 Join discussion

06 May 13:49

maziyarpanahi

3.4.4

be793cb

Spark NLP 3.4.4: New DeBERTa for Token Classification, new CamemBERT embeddings, speed improvements for Tokenizer and UniversalSentenceEncoder annotators, over 160 new state-of-the-art models, and other improvements!

Overview

We are very excited to release Spark NLP 🚀 3.4.4! This release comes with a new DeBERTa for Token Classification annotator compatible with existing or fine-tuned models on HuggingFace 🤗, a new annotator for CamemBERT embeddings models, up to 18x times improvements of UniversalSentenceEncoder on GPU devices, up to 400% speed improvements in Tokenizer with a list of exceptions, new state-of-the-art NER, French embeddings, DistilBERT embeddings, and ALBERT embeddings models!

As always, we would like to thank our community for their feedback, questions, and feature requests.

New Features

NEW: Introducing DeBertaForTokenClassification annotator in Spark NLP 🚀. DeBertaForTokenClassification can load DeBERTa v2&v3 models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using DebertaV2ForTokenClassification for PyTorch or TFDebertaV2ForTokenClassification for TensorFlow models in HuggingFace #8082
NEW: Introducing CamemBertEmbeddings annotator in Spark NLP 🚀. #8237 CamemBERT is a state-of-the-art language model for French based on the RoBERTa architecture pretrained on the French subcorpus of the newly available multilingual corpus OSCAR. For further information or requests, please go to Camembert Website
Add support for batching rows to improve UniversalSentenceEncoder on GPU devices. This new feature will increase GPU speed between 2x to 18x times depending on the distribution of sentences #8234

Bug Fixes & Enhancements

Optimizing Tokenizer performance up to 400% when there is an exceptions list. We have improved the exceptions list to be scalable to a large number of exceptions without impacting the overall performance #7881
Support latest PySpark releases in Colab, Kaggle, and SageMaker scripts #8028
Fix bug that caused get input/output/LazyAnnotator to return None #8043
Fix DeBertaForSequenceClassification in Python failing to load pretrained models #8060
Fix missing Lemma and POS models from 3.4.3 release

Dependencies

Removing outdated trove4j dependency in favour of native Java modules #8236
Upgrade the base Apache Spark to 2.4.8, 3.0.3, and 3.2.1
Upgrade type typesafe config to 1.4.2
Upgrade sbt to 1.6.2

Models

Spark NLP 3.4.4 comes with over 160+ state-of-the-art multi-lingual pretrained models. Some of the featured models:

New DeBERTa Token Classification Models

New fine-tuned DeBERTa v3 models for token classifications over CoNLL03 and OntoNotes datasets that reach state-of-the-art metrics.

Model	Name	Lang	F1 Dev
DeBertaForTokenClassification	deberta_v3_large_token_classifier_conll03	`en`	`0.97`
DeBertaForTokenClassification	deberta_v3_base_token_classifier_conll03	`en`	`0.96`
DeBertaForTokenClassification	deberta_v3_small_token_classifier_conll03	`en`	`0.95`
DeBertaForTokenClassification	deberta_v3_xsmall_token_classifier_conll03	`en`	`0.93`
DeBertaForTokenClassification	deberta_v3_large_token_classifier_ontonotes	`en`	`0.89`
DeBertaForTokenClassification	deberta_v3_base_token_classifier_ontonotes	`en`	`0.88`
DeBertaForTokenClassification	deberta_v3_small_token_classifier_ontonotes	`en`	`0.87`
DeBertaForTokenClassification	deberta_v3_xsmall_token_classifier_ontonotes	`en`	`0.86`

New CamemBERT Models

Model	Name	Lang
CamemBertEmbeddings	camembert_large	`fr`
CamemBertEmbeddings	camembert_base	`fr`
CamemBertEmbeddings	camembert_base_ccnet_4gb	`fr`
CamemBertEmbeddings	camembert_base_ccnet	`fr`
CamemBertEmbeddings	camembert_base_oscar_4gb	`fr`
CamemBertEmbeddings	camembert_base_wikipedia_4gb	`fr`

New DistilBERT Embeddings Models

Model	Name	Lang
DistilBertEmbeddings	distilbert_embeddings_distilbert_base_fr_cased	`fr`
DistilBertEmbeddings	distilbert_embeddings_marathi_distilbert	`mr`
DistilBertEmbeddings	distilbert_embeddings_distilbert_base_indonesian	`id`
DistilBertEmbeddings	distilbert_embeddings_javanese_distilbert_small	`jv`
DistilBertEmbeddings	distilbert_embeddings_malaysian_distilbert_small	`ms`
DistilBertEmbeddings	distilbert_embeddings_distilbert_base_ar_cased	`ar`

New ALBERT Embeddings Models

Model	Name	Lang
AlbertEmbeddings	albert_embeddings_fralbert_base	`fr`
AlbertEmbeddings	albert_embeddings_albert_base_arabic	`ar`
AlbertEmbeddings	albert_embeddings_marathi_albert_v2	`mr`
AlbertEmbeddings	albert_embeddings_albert_fa_base_v2	`fa`
AlbertEmbeddings	albert_embeddings_albert_large_bahasa_cased	`ms`
AlbertEmbeddings	albert_embeddings_marathi_albert	`mr`

The complete list of all 5000+ models & pipelines in 200+ languages is available on Models Hub.

New Notebooks

Import CamemBERT models to Spark NLP 🚀

Spark NLP	HuggingFace Notebooks	Colab
CamemBertEmbeddings	HuggingFace in Spark NLP - CamemBERT

You can visit Import Transformers in Spark NLP for more info

Documentation

TF Hub & HuggingFace to Spark NLP
Models Hub with new models
Spark NLP documentation
Spark NLP Scala APIs
Spark NLP Python APIs
Spark NLP Workshop notebooks
Spark NLP publications
Spark NLP in Action
Spark NLP training certification notebooks for Google Colab and Databricks
Spark NLP Display for visualization of different types of annotations
[Discussions](https://github.com/John...

Contributors

xusliebana, albertoandreottiATgmail, and 18 other contributors

Assets 2

0 Join discussion

12 Apr 20:08

maziyarpanahi

3.4.3

9abba5d

John Snow Labs Spark-NLP 3.4.3: New DeBERTa for Sequence Classification, sigmoid activation for sequence classifiers, new features for SentenceDetectorDL, over 600 new multi-lingual models, and other improvements!

Overview

We are very excited to release Spark NLP 🚀 3.4.3! This release comes with a new DeBERTa for Sequence Classification annotator compatible with existing or fine-tuned models on HuggingFace 🤗, a new sigmoid activation function in addition to softmax to support multi-label models in all ForSequenceClassification annotators, new features added to SentenceDetectorDL, new features added to CoNLLU and Lemmatizer, and more than 600 new multi-lingual models for DeBERTa, BERT, DistilBERT, fastText, Lemmatizer and Part of Speech, and other improvements!

As always, we would like to thank our community for their feedback, questions, and feature requests.

New Features

NEW: Introducing DeBertaForSequenceClassification annotator in Spark NLP 🚀. DeBertaForSequenceClassification can load DeBERTa v2&v3 models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using DebertaForSequenceClassification for PyTorch or TFDebertaForSequenceClassification for TensorFlow models in HuggingFace #7713
New multi-label feature in all SequenceForClassification. The following annotators now have the option to switch to sigmoid activation function instead of softmax for the output layer: AlbertForSequenceClassification, BertForSequenceClassification, DeBertaForSequenceClassification, DistilBertForSequenceClassification, LongformerForSequenceClassification, RoBertaForSequenceClassification, XlmRoBertaForSequenceClassification, and XlnetForSequenceClassification #7479
New minLength, maxLength, splitLength, customBounds, and useCustomBoundsOnly parameters in SentenceDetectorDL #7214
New impossiblePenultimates in SentenceDetectorDLModel #7685
New feature to set names for columns in CoNLLU class: textCol, documentCol, sentenceCol, formCol, uposCol, xposCol, and lemmaCol #7344
New formCol and lemmaCol parameters in Lemmatizer annotator #7344
Add new functionality to download and extract models from S3 via direct link #7682

Enhancements

Fix and train new English spell checker models for Spark NLP 3.4.1 on Spark 3.x and 2.x
Update SentenceDetector Python and Scala documentation
Add a missing notebook to demonstrate training a WordSegmenterApproach annotator for word segmentation https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/chinese/word-segmentation/WordSegmenter_train_chinese_segmentation.ipynb

Models

New DeBERTa Classification Models

New fine-tuned DeBERTa v3 models for text classifications over IMDB reviews in English and Urdu, AG News categories in English, and Allocine French reviews.

Model	Name	Lang
DeBertaForSequenceClassification	mdeberta_v3_base_sequence_classifier_imdb	`ur`
DeBertaForSequenceClassification	mdeberta_v3_base_sequence_classifier_allocine	`fr`
DeBertaForSequenceClassification	deberta_v3_xsmall_sequence_classifier_imdb	`en`
DeBertaForSequenceClassification	deberta_v3_small_sequence_classifier_imdb	`en`
DeBertaForSequenceClassification	deberta_v3_base_sequence_classifier_imdb	`en`
DeBertaForSequenceClassification	deberta_v3_large_sequence_classifier_imdb	`en`
DeBertaForSequenceClassification	deberta_v3_xsmall_sequence_classifier_ag_news	`en`
DeBertaForSequenceClassification	deberta_v3_small_sequence_classifier_ag_news	`en`

New BERT Models

Spark NLP now has up to 250 state-of-the-art BERT models in 27 languages including Arabic, Bengali, Chinese, Dutch, English, Finnish, French, German, Greek, Hindi, Italian, Japanese, Javanese, Korean, Marathi, Panjabi, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Telugu, Turkish, Urdu, Vietnamese, and Multi-lingual.

Model	Name	Lang
BertEmbeddings	bert_embeddings_ARBERT	`ar`
BertEmbeddings	bert_embeddings_German_MedBERT	`de`
BertEmbeddings	bert_embeddings_bangla_bert_base	`bn`
BertEmbeddings	bert_embeddings_bert_base_5lang_cased	`zh`
BertEmbeddings	bert_embeddings_bert_base_5lang_cased	`fr`
BertEmbeddings	bert_embeddings_bert_base_hi_cased	`hi`
BertEmbeddings	bert_embeddings_bert_base_it_cased	`it`
BertEmbeddings	bert_embeddings_bert_base	`ko`
BertEmbeddings	bert_embeddings_bert_base_tr_cased	`tr`
BertEmbeddings	bert_embeddings_bert_base_ur_cased	`ur`
BertEmbeddings	bert_embeddings_bert_base_vi_cased	`vi`

New fastText Models

Over 128 new Word2Vec models in 128 languages made by fastText word embeddings.

Model	Name	Lang
WordEmbeddingsModel	w2v_cc_300d	`hi`
WordEmbeddingsModel	w2v_cc_300d	`azb`
WordEmbeddingsModel	w2v_cc_300d	`bo`
WordEmbeddingsModel	w2v_cc_300d	`diq`
WordEmbeddingsModel	w2v_cc_300d	`cy`
WordEmbeddingsModel	w2v_cc_300d	`ckb`
WordEmbeddingsModel	w2v_cc_300d	`el`
WordEmbeddingsModel	w2v_cc_300d	`es`

New Lemmatizer and Part of Speech Models

234 new Lemmatizer and Part of Speech models in 62 languages based on the new Universal Dependency treebank 2.9 release.

Model	Name	Lang
LemmatizerModel	lemma_afribooms	`af`
LemmatizerModel	lemma_alksnis	`lt`
LemmatizerModel	lemma_alpino	`nl`
LemmatizerModel	lemma_arcosg	`gd`
LemmatizerModel	lemma_ancora	`es`
LemmatizerModel	lemma_ancora	`ca`
PerceptronModel	pos_mtg	`te`
PerceptronModel	pos_ttb	`ta`
PerceptronModel	pos_vtb	`vi`
PerceptronModel	pos_cac	`cs`
PerceptronModel	pos_btb	`bg`
PerceptronModel	pos_afribooms	`af`

The complete list of all 4800+ models & pipelines in 200+ languages is available on Models Hub.

Documentation

[T...

Contributors

snosrap, xusliebana, and 21 other contributors

Assets 2

10 Mar 15:33

maziyarpanahi

3.4.2

5e67a0f

John Snow Labs Spark-NLP 3.4.2: DeBERTa embeddings, new caching in Word2Vec and Doc2Vec, new state-of-the-art models, and bug fixes!

Overview

We are pleased to release Spark NLP 🚀 3.4.2! This release comes with a new DeBERTa transformer for word embeddings, new caching to speed up training Word2Vec and Doc2Vec, new English and multi-lingual state-of-the-art models, and bug fixes!

As always, we would like to thank our community for their feedback, questions, and feature requests.

New Features

Introducing DeBertaEmbeddings annotator. DeBERTa (Decoding-enhanced BERT with disentangled attention) improves the BERT and RoBERTa models using two novel techniques. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). This annotator is compatible with all the models trained/fine-tuned by using DebertaV2Model for PyTorch or TFDebertaV2Model for TensorFlow models (DeBERTa-v2 & DeBERTa-v3) in HuggingFace
Introducing a new param enableCaching in Doc2VecApproach to speed up the training
Introducing a new param enableCaching in Word2VecApproach to speed up the training
Support Databricks runtime 10.3, 10.3 ML, and 10.3 ML & GPU
Support EMR emr-5.34.0 and emr-6.5.0

Bug Fixes

Fix bestModelMetric param when the set value was ignored #6978

New Notebooks

Import DeBERTa models to Spark NLP 🚀

Spark NLP	HuggingFace Notebooks	Colab
DeBertaEmbeddings	HuggingFace in Spark NLP - DeBERTa

You can visit Import Transformers in Spark NLP for more info

Models

New state-of-the-art DeBERTa models:

Model	Name	Lang
DeBertaEmbeddings	deberta_v3_xsmall	`en`
DeBertaEmbeddings	deberta_v3_small	`en`
DeBertaEmbeddings	deberta_v3_base	`en`
DeBertaEmbeddings	deberta_v3_large	`en`
DeBertaEmbeddings	mdeberta_v3_base	`xx`

Documentation

TF Hub & HuggingFace to Spark NLP
Models Hub with new models
Spark NLP documentation
Spark NLP Scala APIs
Spark NLP Python APIs
Spark NLP Workshop notebooks
Spark NLP publications
Spark NLP in Action
Spark NLP training certification notebooks for Google Colab and Databricks
Spark NLP Display for visualization of different types of annotations
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI

pip install spark-nlp==3.4.2

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.2

spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.2

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.2

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.2

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>3.4.2</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>3.4.2</version>
</dependency>

spark-nlp on Apache Spark 3.2.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark32_2.12</artifactId>
    <version>3.4.2</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark32_2.12</artifactId>
    <version>3.4.2</version>
</dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark24_2.11</artifactId>
    <version>3.4.2</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
    <version>3.4.2</version>
</dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark23_2.11</artifactId>
    <version>3.4.2</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
    <version>3.4.2</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.4.2.jar
GPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.4.2.jar
CPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark32-assembly-3.4.2.jar
GPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark32-assembly-3.4.2.jar
CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.4.2.jar
GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.4.2.jar
CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.4.2.jar
GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.4.2.jar

What's Changed

Full Changelog: 3.4.1...3.4.2

New Contributors

@mahmoodbayeshi made their first contribution in #6835
@bunyamin-polat made their first contribution in #6969

@agsfer @KshitizGIT @gadde5300 @kolia1985 @jsl-models @rpranab @josejuanmartinez @bunyamin-polat @maziyarpanahi @jsl-builder @Damla-Gurbaz @xusliebana @mahmoodbayeshi @luca-martial @dependabot @muhammetsnts @albertoandreottiATgmai

Contributors

mykolamelnykml, xusliebana, and 14 other contributors

Assets 2

0 Join discussion

08 Feb 18:05

maziyarpanahi

3.4.1

2fa5c70

John Snow Labs Spark-NLP 3.4.1: TF session warmup, a new F1 metric to track to save the best model in NerDL, new T5 models like WikiSQL or grammar corrector, other new multi-lingual state-of-the-art models, and bug fixes!

Overview

We are pleased to release Spark NLP 🚀 3.4.1! This release comes with a TF session warmup in 3 annotators where the first inference was slower than the rest, adding a new param to choose which F1 to track to save the best model when training a NerDL model, new T5 models such as text to SQL or grammar correction, new multi-lingual state-of-the-art models, and other bug fixes!

As always, we would like to thank our community for their feedback, questions, and feature requests.

New Features & Enhancements

Implement TF Session warmup for MarianTransformer, T5Transformer, and GPT2Transformer annotators. The first inference for these annotators used to take between 15-20 seconds, now with the warmup session all the inferences including the first time will be the same #6773
Add bestModelMetric param to choose between Micro-average or Macro-average for best model #6749
Add trimWhitespace and preservePosition params to RegexTokenizer #6806
Add a new setSentenceMatch param to EntityRuler to match entities across documents/sentences and not just tokens #6841
Add support spark32 and real_time_output flags in sparknlp.start() function at the same time #6822
Allow users to set tasks in the T5Transformer annotator

Bug Fixes

Fix random NullPointerException when using TensorFlow models without Kyro serialization #6741
Fix RecursiveTokenizerModel not being readable in a saved Pipeline #6748
Fix ContextSpellCheckerApproach not being trained on Databricks #6750
Fix ContextSpellCheckerModel wrong order of tokens it's used with Sentence Detectors #6799
Fix GraphExtraction when fullAnnotate and document are used at the same time #6845
Fix Word2VecModel being cast to Doc2VecModel by mistake #6849
Fix broken sentence indexing in BertEmbeddings that impacted SentenceEmbeddings for text classification #6867
Fix missing setExceotionsPath param in Tokenizer when it's used in Python #6868
Fix the wrong metrics being mentioned when useBestModel was enabled. The documentation said Micro-averaged F1 but in fact, it was Macro-average F1 (the option to choose which metric to be tracked is now available as well)
Update broken slow unit tests #6767

Models

New state-of-the-art models in English, French, Vietnamese, Dutch, and Indian (Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu)

Featured Pretrained Models

Model	Name	Lang
T5Transformer	t5_informal_to_formal_styletransfer	`en`
T5Transformer	t5_formal_to_informal_styletransfer	`en`
T5Transformer	t5_passive_to_active_styletransfer	`en`
T5Transformer	t5_active_to_passive_styletransfer	`en`
T5Transformer	t5_grammar_error_corrector	`en`
T5Transformer	t5_small_wikiSQL	`en`
LongformerEmbeddings	clinical_longformer	`en`
AlbertEmbeddings	albert_indic	`xx`
DistilBertEmbeddings	distilbert_base_cased	`vi`
BertForSequenceClassification	bert_sequence_classifier_news_sentiment	`de`
BertForSequenceClassification	bert_sequence_classifier_emotion	`en`
DistilBertForTokenClassification	distilbert_token_classifier_typo_detector	`en`
DistilBertForTokenClassification	distilbert_base_token_classifier_masakhaner	`xx`
WordEmbeddingsModel	word2vec_wiki_1000	`fr`
WordEmbeddingsModel	word2vec_wac_200	`fr`
WordEmbeddingsModel	w2v_cc_300d	`fr`

Documentation

TF Hub & HuggingFace to Spark NLP
Models Hub with new models
Spark NLP documentation
Spark NLP Scala APIs
Spark NLP Python APIs
Spark NLP Workshop notebooks
Spark NLP publications
Spark NLP in Action
Spark NLP training certification notebooks for Google Colab and Databricks
Spark NLP Display for visualization of different types of annotations
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI

pip install spark-nlp==3.4.1

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.1

spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.1

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.1

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.1

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>3.4.1</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>3.4.1</version>
</dependency>

spark-nlp on Apache Spark 3.2.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark32_2.12</artifactId>
    <version>3.4.1</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark32_2.12</artifactId>
    <version>3.4.1</version>
</dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark24_2.11</artifactId>
    <version>3.4.1</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
    <version>3.4.1</version>
</dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark23_2.11</artifactId>
    <version>3.4.1</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark23_2.11...

Contributors

xyutech, pabla, and 22 other contributors

Assets 2

0 Join discussion

05 Jan 15:25

maziyarpanahi

3.4.0

2419984

John Snow Labs Spark-NLP 3.4.0: New OpenAI GPT-2, new ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer for Sequence Classification, support for Spark 3.2, new distributed Word2Vec, extend support to more Databricks & EMR runtimes, new state-of-the-art transformer models, bug fixes, and lots more!

Overview

We are very excited to release Spark NLP 3.4.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community at the dawn of 2022! 🎉

Spark NLP 3.4.0 extends the support for Apache Spark 3.2.x major releases on Scala 2.12. We now support all 5 major Apache Spark and PySpark releases of 2.3.x, 2.4.x, 3.0.x, 3.1.x, and 3.2.x at once helping our community to migrate from earlier Apache Spark versions to newer releases without being worried about Spark NLP end of life support. We also extend support for new Databricks and EMR instances on Spark 3.2.x clusters.

This release also comes with a brand new GPT2Transformer using OpenAI GPT-2 models for prediction at scale, new ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer annotators to use existing or fine-tuned models for Sequence Classification, new distributed and trainable Word2Vec annotators, new state-of-the-art transformer models in many languages, a new param to useBestModel in NerDL during training, bug fixes, and lots more!

As always, we would like to thank our community for their feedback, questions, and feature requests.

Major features and improvements

NEW: Introducing GPT2Transformer annotator in Spark NLP 🚀 for Text Generation purposes. GPT2Transformer uses OpenAI GPT-2 models from HuggingFace 🤗 for prediction at scale in Spark NLP 🚀 . GPT-2 is a transformer model trained on a very large corpus of English data in a self-supervised fashion. This means it was trained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences
NEW: Introducing RoBertaForSequenceClassification annotator in Spark NLP 🚀. RoBertaForSequenceClassification can load RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using RobertaForSequenceClassification for PyTorch or TFRobertaForSequenceClassification for TensorFlow models in HuggingFace 🤗
NEW: Introducing XlmRoBertaForSequenceClassification annotator in Spark NLP 🚀. XlmRoBertaForSequenceClassification can load XLM-RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using XLMRobertaForSequenceClassification for PyTorch or TFXLMRobertaForSequenceClassification for TensorFlow models in HuggingFace 🤗
NEW: Introducing LongformerForSequenceClassification annotator in Spark NLP 🚀. LongformerForSequenceClassification can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using LongformerForSequenceClassification for PyTorch or TFLongformerForSequenceClassification for TensorFlow models in HuggingFace 🤗
NEW: Introducing AlbertForSequenceClassification annotator in Spark NLP 🚀. AlbertForSequenceClassification can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using AlbertForSequenceClassification for PyTorch or TFAlbertForSequenceClassification for TensorFlow models in HuggingFace 🤗
NEW: Introducing XlnetForSequenceClassification annotator in Spark NLP 🚀. XlnetForSequenceClassification can load XLNet Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using XLNetForSequenceClassification for PyTorch or TFXLNetForSequenceClassification for TensorFlow models in HuggingFace 🤗
NEW: Introducing trainable and distributed Word2Vec annotators based on Word2Vec in Spark ML. You can train Word2Vec in a cluster on multiple machines to handle large-scale datasets and use the trained model for token-level classifications such as NerDL
Introducing useBestModel param in NerDLApproach annotator. This param in the NerDLApproach preserves and restores the model that has achieved the best performance at the end of the training. The priority is metrics from testDataset (micro F1), metrics from validationSplit (micro F1), and if none is set it will keep track of loss during the training
Support Apache Spark and PySpark 3.2.x on Scala 2.12. Spark NLP by default is shipped for Spark 3.0.x/3.1.x, but now you have spark-nlp-spark32 and spark-nlp-gpu-spark32 packages
Adding a new param to sparknlp.start() function in Python for Apache Spark 3.2.x (spark32=True)
Update Colab and Kaggle scripts for faster setup. We no longer need to remove Java 11 in order to install Java 8 since Spark NLP works on Java 11. This makes the installation of Spark NLP on Colab and Kaggle as fast as pip install spark-nlp pyspark==3.1.2
Add new scripts/notebook to generate custom TensroFlow graphs for ContextSpellCheckerApproach annotator
Add a new graphFolder param to ContextSpellCheckerApproach annotator. This param allows to train ContextSpellChecker from a custom made TensorFlow graph
Support DBFS file system in graphFolder param. Starting Spark NLP 3.4.0 you can point NerDLApproach or ContextSpellCheckerApproach to a TF graph hosted on Databricks
Add a new feature to all classifiers (ForTokenClassification and ForSequenceClassification) to retrieve classes from the pretrained models

sequenceClassifier = XlmRoBertaForSequenceClassification \
      .pretrained('xlm_roberta_base_sequence_classifier_ag_news', 'en') \
      .setInputCols(['token', 'document']) \
      .setOutputCol('class')

print(sequenceClassifier.getClasses())

#Sports, Business, World, Sci/Tech

Add inputFormats param to DateMatcher and MultiDateMatcher annotators. DateMatcher and MultiDateMatcher can now define a list of acceptable input formats via date patterns to search in the text. Consequently, the output format will be defining the output pattern for the unique output format.

date_matcher = DateMatcher() \
    .setInputCols(['document']) \
    .setOutputCol("date") \
    .setInputFormats(["yyyy", "yyyy/dd/MM", "MM/yyyy"]) \
    .setOutputFormat("yyyyMM") \ #previously called `.setDateFormat`
    .setSourceLanguage("en")

Enable batch processing in T5Transformer and MarianTransformer annotators
Add Schema to readDataset in CoNLL() class
Welcoming 6x new Databricks runtimes to our Spark NLP family:
- Databricks 10.0
- Databricks 10.0 ML GPU
- Databricks 10.1
- Databricks 10.1 ML GPU
- Databricks 10.2
- Databricks 10.2 ML GPU
Welcoming 3x new EMR 6.x series to our Spark NLP family:
- EMR 5.33.1 (Apache Spark 2.4.7 / Hadoop 2.10.1)
- EMR 6.3.1 (Apache Spark 3.1.1 / Hadoop 3.2.1)
- EMR 6.4.0 (Apache Spark 3.1.2 / Hadoop 3.2.1)

Bug Fixes

Fix a race condition in a cluster mode when the accessing TF session is called as many times as the number of available cores on the Driver machine for the very first time. Loading a model multiple times at once results in higher disk usage and IO may become a bottleneck for larger models especially on a machine with slower disks. Thanks to @jerrychenhf for finding this issue and offering a solution #6575
Fix a performance issue introduced in the 3.3.3 release for T5Transformer and MarianTransformer annotators. While we added support for ignored tokens, accidentally we introduced a bug that degraded the performance for these two annotators (sometimes up to 2x slower). Please update to 3.4.0 if you are using any of these two annotators #6605
Fix a bug in model resolution by not filtering based on the timestamp
Fix configProtoBytes param type in Python #6549
Fix missing DefaultParamsReadable in RegexTokenizer annotator #6653
Fix missing models lemma_antbnc, sentiment_vivekn, and spellcheck_norvig for Spark 3.x
Fix missing pipelines clean_slang, check_spelling, match_chunks, and match_datetime for Spark 3.x
Fix saveModel in TrainingHelper
Fix Keyword/Yake module naming in Scala #6562

Models Hub

Models Hub now comes with new features to easily filter and find your desired models & pipelines by:

NLP Task
Natural Language
Spark NLP version

In addition, you can also filter models & pipelines by:

Models or Pipelines (finally! 😃 )
Tags used inside Model's card
Or even by predicted entities (which labels/classes a model can predict)

As always, you can host your own pre-trained models & pipelines easily accessible to you for free & forever! 🚀

Models and Pipelines
--------------...

Contributors

xyutech, pabla, and 20 other contributors

Assets 2

0 Join discussion

25 Nov 15:16

maziyarpanahi

3.3.4

97cb937

John Snow Labs Spark-NLP 3.3.4: Patch release

Patch release

Fix ClassCastException error in pretrained function for DistilBertForSequenceClassification in Python #6513

Documentation

TF Hub & HuggingFace to Spark NLP
Models Hub with new models
Spark NLP publications
Spark NLP in Action
Spark NLP documentation
Spark NLP Scala APIs
Spark NLP Python APIs
Spark NLP Workshop notebooks
Spark NLP training certification notebooks for Google Colab and Databricks
Spark NLP Display for visualization of different types of annotations
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI

pip install spark-nlp==3.3.4

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.4

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.4

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.3.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.3.4

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>3.3.4</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>3.3.4</version>
</dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark24_2.11</artifactId>
    <version>3.3.4</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
    <version>3.3.4</version>
</dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark23_2.11</artifactId>
    <version>3.3.4</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
    <version>3.3.4</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.3.4.jar
GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.3.4.jar
CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.3.4.jar
GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.3.4.jar
CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.3.4.jar
GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.3.4.jar

What's Changed

Update documentation of ChunkKeyPhraseExtraction by @vankov in #6508
Fixes new instantiation in scala section by @josejuanmartinez in #6469
Fix the wrong name for DistilBertForSequenceClassification in Python by @maziyarpanahi in #6513
Release/334 release candidate by @maziyarpanahi in #6514

Full Changelog: 3.3.3...3.3.4

Contributors

maziyarpanahi, vankov, and josejuanmartinez

Assets 2

0 Join discussion

Releases: JohnSnowLabs/spark-nlp

Spark NLP 4.1.0: Vision Transformer (ViT) is here! The very first Computer Vision pipeline for the state-of-the-art Image Classification task, AWS Graviton/ARM64 support, new EMR & Databricks support, 1000+ state-of-the-art models, and more!

Overview

An Image is Worth 16x16 Words!

⭐ New Features & improvements

Models

Featured Models

Contributors

Spark NLP 4.0.2: Over 620 new state-of-the-art models in 21 languages, full support for Apache Spark 3.3.0, new Databricks runtime 11.1, and bug fixes

Overview

New Features

Example

Bug Fixes

Models and Pipelines

Featured Models

📖 Documentation & Articles

Installation

What's Changed

Contributors

New Contributors

Contributors

Spark NLP 4.0.1: Full support for Apache Spark 3.3.0, new Databricks runtime 11, enhancements, and other bug fixes!

Overview

Features & Enhancements

Bug Fixes

Documentation

Installation

What's Changed

Contributors

New Contributors

Contributors

Overview

Major features and improvements

Performance Improvements (Benchmarks)

GPU

Contributors

Spark NLP 3.4.4: New DeBERTa for Token Classification, new CamemBERT embeddings, speed improvements for Tokenizer and UniversalSentenceEncoder annotators, over 160 new state-of-the-art models, and other improvements!

Overview

New Features

Bug Fixes & Enhancements

Dependencies

Models

New DeBERTa Token Classification Models

New CamemBERT Models

New DistilBERT Embeddings Models

New ALBERT Embeddings Models

New Notebooks

Documentation

Contributors

John Snow Labs Spark-NLP 3.4.3: New DeBERTa for Sequence Classification, sigmoid activation for sequence classifiers, new features for SentenceDetectorDL, over 600 new multi-lingual models, and other improvements!

Overview

New Features

Enhancements

Models

New DeBERTa Classification Models

New BERT Models

New fastText Models

New Lemmatizer and Part of Speech Models

Documentation

Contributors

John Snow Labs Spark-NLP 3.4.2: DeBERTa embeddings, new caching in Word2Vec and Doc2Vec, new state-of-the-art models, and bug fixes!

Overview

New Features

Bug Fixes

New Notebooks

Models

Documentation

Installation

What's Changed

New Contributors

Contributors

John Snow Labs Spark-NLP 3.4.1: TF session warmup, a new F1 metric to track to save the best model in NerDL, new T5 models like WikiSQL or grammar corrector, other new multi-lingual state-of-the-art models, and bug fixes!

Overview

New Features & Enhancements

Bug Fixes

Models

Featured Pretrained Models

Documentation

Installation

Contributors