Skip to content

Commit b9a0f6a

Browse files
committed
SPARKNLP-601 Update docs and Website [run doc]
- Update Gem : github-pages and nokogiri - Fix the bad merge overriding ViT announcement on the Website - Update README with new Wav2Vec2, CamemBERT for token classification, and TAPAS Q&A - Update CHANGELOG
1 parent 4249331 commit b9a0f6a

File tree

6 files changed

+112
-62
lines changed

6 files changed

+112
-62
lines changed

CHANGELOG

+24
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,27 @@
1+
========
2+
4.2.0
3+
========
4+
----------------
5+
New Features & Enhancements
6+
----------------
7+
* **NEW:** Introducing **Wav2Vec2ForCTC** annotator in Spark NLP 🚀. `Wav2Vec2ForCTC` can load `Wav2Vec2` models for the Automatic Speech Recognition (ASR) task. Wav2Vec2 is a multi-modal model, that combines speech and text. It's the first multi-modal model of its kind we welcome in Spark NLP. This annotator is compatible with all the models trained/fine-tuned by using `Wav2Vec2ForCTC` for **PyTorch** or `TFWav2Vec2ForCTC` for **TensorFlow** models in HuggingFace 🤗 (https://github.com/JohnSnowLabs/spark-nlp/pull/12767)
8+
* **NEW:** Introducing **TapasForQuestionAnswering** annotator in Spark NLP 🚀. `TapasForQuestionAnswering` can load TAPAS Models with a cell selection head and optional aggregation head on top for question-answering tasks on tables (linear layers on top of the hidden-states output to compute logits and optional logits_aggregation), e.g. for SQA, WTQ or WikiSQL-supervised tasks. TAPAS is a BERT-based model specifically designed (and pre-trained) for answering questions about tabular data. This annotator is compatible with all the models trained/fine-tuned by using `TapasForQuestionAnswering` for **PyTorch** or `TFTapasForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
9+
* **NEW:** Introducing **CamemBertForTokenClassification** annotator in Spark NLP 🚀. `CamemBertForTokenClassification` can load CamemBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using `CamembertForTokenClassification` for PyTorch or `TFCamembertForTokenClassification` for TensorFlow in HuggingFace 🤗
10+
(https://github.com/JohnSnowLabs/spark-nlp/pull/12752)
11+
* Implementing `setTestDataset` to evaluate metrics on an external dataset during training of Text Classifiers in Spark NLP. This feature is similar to NerDLApproach where metrics are calculated on each Epoch and have been added to the following multi-class/multi-label text classifier annotators: `ClassifierDLApproach`, `SentimentDLApproach`, and `MultiClassifierDLApproach` (https://github.com/JohnSnowLabs/spark-nlp/pull/12796)
12+
* Refactoring and improving `EntityRuler` annotator inference to up to 24x faster especially when used with a long list of labels/entities. We speed up the inference process by implementing the Aho-Corasick algorithm to match patterns in a string. This requires the following changes when using `EntityRuler` https://github.com/JohnSnowLabs/spark-nlp/pull/12634
13+
* Add support for S3 storage in the `cache_folder` where models are downloaded, extracted, and loaded from. Previously, we only supported all local file systems, HDFS, and DBFS. This new feature is especially useful for users on Kubernetes clusters with no access to HDFS or any other distributed file systems (https://github.com/JohnSnowLabs/spark-nlp/pull/12707)
14+
* Implementing `lookaround` functionalities in `DocumentNormalizer` annotator. Currently, `DocumentNormalizer` has both `lookahead` and `lookbehind` functionalities. To extend support for more complex normalizations, especially within the clinical text we are introducing the `lookaround` feature (https://github.com/JohnSnowLabs/spark-nlp/pull/12735)
15+
* Implementing `setReplaceEntities` param to `NerOverwriter` annotator to replace all the NER labels (entities) with the given new labels (entities) (https://github.com/JohnSnowLabs/spark-nlp/pull/12745)
16+
17+
----------------
18+
Bug Fixes
19+
----------------
20+
* Fix a bug in generating the NerDL graph by using TF v2. The previous graph generated by the `TFGraphBuilder` annotator resulted in an exception when the length of the sequence was 1. This issue has been resolved and the new graphs created by `TFGraphBuilder` won't have this issue anymore (https://github.com/JohnSnowLabs/spark-nlp/pull/12636)
21+
* Fix a bug introduced in the 4.0.0 release between Transformer-based Word Embeddings annotators. In the 4.0.0 release, the following annotators were migrated to BatchAnnotate to improve their performance, especially on GPU. However, a bug was introduced in sentence indices which when it is combined with SentenceEmbeddings for Text Classifications tasks (ClassifierDLApproach, SentimentDLApproach, and ClassifierDLApproach) resulted in low accuracy: AlbertEmbeddings, CamemBertEmbeddings, DeBertaEmbeddings, DistilBertEmbeddings, LongformerEmbeddings, RoBertaEmbeddings, XlmRoBertaEmbeddings, and XlnetEmbeddings (https://github.com/JohnSnowLabs/spark-nlp/pull/12641)
22+
* Add support for a list of questions and context in LightPipline. Previously, only one context and question at a time were supported in LightPipeline for Question Answering annotators. We have added support to `fullAnnotate` and `annotate` to receive two lists of questions and contexts (https://github.com/JohnSnowLabs/spark-nlp/pull/12653)
23+
* Fix division by zero exception in the `GPT2Transformer` annotator when the `setDoSample` param was set to true (https://github.com/JohnSnowLabs/spark-nlp/pull/12661)
24+
125
========
226
4.1.0
327
========

README.md

+10-2
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@
1717
</p>
1818

1919
Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides **simple**, **performant** & **accurate** NLP annotations for machine learning pipelines that **scale** easily in a distributed environment.
20-
Spark NLP comes with **8000+** pretrained **pipelines** and **models** in more than **200+** languages.
21-
It also offers tasks such as **Tokenization**, **Word Segmentation**, **Part-of-Speech Tagging**, Word and Sentence **Embeddings**, **Named Entity Recognition**, **Dependency Parsing**, **Spell Checking**, **Text Classification**, **Sentiment Analysis**, **Token Classification**, **Machine Translation** (+180 languages), **Summarization** & **Question Answering**, **Text Generation**, **Image Classification**, and many more [NLP tasks](#features).
20+
Spark NLP comes with **14000+** pretrained **pipelines** and **models** in more than **200+** languages.
21+
It also offers tasks such as **Tokenization**, **Word Segmentation**, **Part-of-Speech Tagging**, Word and Sentence **Embeddings**, **Named Entity Recognition**, **Dependency Parsing**, **Spell Checking**, **Text Classification**, **Sentiment Analysis**, **Token Classification**, **Machine Translation** (+180 languages), **Summarization**, **Question Answering**, **Table Question Answering**, **Text Generation**, **Image Classification**, **Automatic Speech Recognition**, and many more [NLP tasks](#features).
2222

2323
**Spark NLP** is the only open-source NLP library in **production** that offers state-of-the-art transformers such as **BERT**, **CamemBERT**, **ALBERT**, **ELECTRA**, **XLNet**, **DistilBERT**, **RoBERTa**, **DeBERTa**, **XLM-RoBERTa**, **Longformer**, **ELMO**, **Universal Sentence Encoder**, **Google T5**, **MarianMT**, **GPT2**, and **Vision Transformers (ViT)** not only to **Python** and **R**, but also to **JVM** ecosystem (**Java**, **Scala**, and **Kotlin**) at **scale** by extending **Apache Spark** natively.
2424

@@ -115,6 +115,7 @@ Take a look at our official Spark NLP page: [http://nlp.johnsnowlabs.com/](http:
115115
- Multi-class Text Classification (Deep learning)
116116
- BERT for Token & Sequence Classification
117117
- DistilBERT for Token & Sequence Classification
118+
- CamemBERT for Token & Sequence Classification
118119
- ALBERT for Token & Sequence Classification
119120
- RoBERTa for Token & Sequence Classification
120121
- DeBERTa for Token & Sequence Classification
@@ -128,10 +129,12 @@ Take a look at our official Spark NLP page: [http://nlp.johnsnowlabs.com/](http:
128129
- DeBERTa for Question Answering
129130
- XLM-RoBERTa for Question Answering
130131
- Longformer for Question Answering
132+
- Table Question Answering (TAPAS)
131133
- Neural Machine Translation (MarianMT)
132134
- Text-To-Text Transfer Transformer (Google T5)
133135
- Generative Pre-trained Transformer 2 (OpenAI GPT2)
134136
- Vision Transformer (ViT)
137+
- Automatic Speech Recognition (Wav2Vec2)
135138
- Named entity recognition (Deep learning)
136139
- Easy TensorFlow integration
137140
- GPU Support
@@ -214,6 +217,7 @@ Spark NLP *4.2.0* has been built on top of Apache Spark 3.2 while fully supports
214217

215218
| Spark NLP | Apache Spark 2.3.x | Apache Spark 2.4.x | Apache Spark 3.0.x | Apache Spark 3.1.x | Apache Spark 3.2.x | Apache Spark 3.3.x |
216219
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
220+
| 4.2.x | NO | NO | YES | YES | YES | YES |
217221
| 4.1.x | NO | NO | YES | YES | YES | YES |
218222
| 4.0.x | NO | NO | YES | YES | YES | YES |
219223
| 3.4.x | YES | YES | YES | YES | Partially | N/A |
@@ -231,6 +235,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github
231235

232236
| Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Scala 2.11 | Scala 2.12 |
233237
|-----------|------------|------------|------------|------------|------------|------------|
238+
| 4.2.x | YES | YES | YES | YES | NO | YES |
234239
| 4.1.x | YES | YES | YES | YES | NO | YES |
235240
| 4.0.x | YES | YES | YES | YES | NO | YES |
236241
| 3.4.x | YES | YES | YES | YES | YES | YES |
@@ -264,6 +269,8 @@ Spark NLP 4.2.0 has been tested and is compatible with the following runtimes:
264269
- 11.0 ML
265270
- 11.1
266271
- 11.1 ML
272+
- 11.2
273+
- 11.2 ML
267274

268275
**GPU:**
269276

@@ -275,6 +282,7 @@ Spark NLP 4.2.0 has been tested and is compatible with the following runtimes:
275282
- 10.5 ML & GPU
276283
- 11.0 ML & GPU
277284
- 11.1 ML & GPU
285+
- 11.2 ML & GPU
278286

279287
NOTE: Spark NLP 4.0.x is based on TensorFlow 2.7.x which is compatible with CUDA11 and cuDNN 8.0.2. The only Databricks runtimes supporting CUDA 11 are 9.x and above as listed under GPU.
280288

docs/Gemfile

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
source "https://rubygems.org"
22

3-
gem "github-pages", "225"
4-
gem "nokogiri", ">= 1.13.2"
3+
gem "github-pages", "227"
4+
gem "nokogiri", ">= 1.13.8"
55

66
gem "elasticsearch", "~> 7.10"
77

docs/Gemfile.lock

+50-44
Original file line numberDiff line numberDiff line change
@@ -1,63 +1,69 @@
11
GEM
22
remote: https://rubygems.org/
33
specs:
4-
activesupport (6.0.4.7)
4+
activesupport (6.0.6)
55
concurrent-ruby (~> 1.0, >= 1.0.2)
66
i18n (>= 0.7, < 2)
77
minitest (~> 5.1)
88
tzinfo (~> 1.1)
99
zeitwerk (~> 2.2, >= 2.2.2)
10-
addressable (2.8.0)
11-
public_suffix (>= 2.0.2, < 5.0)
10+
addressable (2.8.1)
11+
public_suffix (>= 2.0.2, < 6.0)
1212
coffee-script (2.4.1)
1313
coffee-script-source
1414
execjs
1515
coffee-script-source (1.11.1)
1616
colorator (1.1.0)
17-
commonmarker (0.23.4)
18-
concurrent-ruby (1.1.9)
17+
commonmarker (0.23.6)
18+
concurrent-ruby (1.1.10)
1919
dnsruby (1.61.9)
2020
simpleidn (~> 0.1)
21-
elasticsearch (7.13.3)
22-
elasticsearch-api (= 7.13.3)
23-
elasticsearch-transport (= 7.13.3)
24-
elasticsearch-api (7.13.3)
21+
elasticsearch (7.17.1)
22+
elasticsearch-api (= 7.17.1)
23+
elasticsearch-transport (= 7.17.1)
24+
elasticsearch-api (7.17.1)
2525
multi_json
26-
elasticsearch-transport (7.13.3)
26+
elasticsearch-transport (7.17.1)
2727
faraday (~> 1)
2828
multi_json
29-
em-websocket (0.5.2)
29+
em-websocket (0.5.3)
3030
eventmachine (>= 0.12.9)
31-
http_parser.rb (~> 0.6.0)
31+
http_parser.rb (~> 0)
3232
ethon (0.15.0)
3333
ffi (>= 1.15.0)
3434
eventmachine (1.2.7)
3535
eventmachine (1.2.7-x64-mingw32)
3636
execjs (2.8.1)
37-
faraday (1.5.1)
37+
faraday (1.10.2)
3838
faraday-em_http (~> 1.0)
3939
faraday-em_synchrony (~> 1.0)
4040
faraday-excon (~> 1.1)
41-
faraday-httpclient (~> 1.0.1)
41+
faraday-httpclient (~> 1.0)
42+
faraday-multipart (~> 1.0)
4243
faraday-net_http (~> 1.0)
43-
faraday-net_http_persistent (~> 1.1)
44+
faraday-net_http_persistent (~> 1.0)
4445
faraday-patron (~> 1.0)
45-
multipart-post (>= 1.2, < 3)
46+
faraday-rack (~> 1.0)
47+
faraday-retry (~> 1.0)
4648
ruby2_keywords (>= 0.0.4)
4749
faraday-em_http (1.0.0)
4850
faraday-em_synchrony (1.0.0)
4951
faraday-excon (1.1.0)
5052
faraday-httpclient (1.0.1)
53+
faraday-multipart (1.0.4)
54+
multipart-post (~> 2)
5155
faraday-net_http (1.0.1)
5256
faraday-net_http_persistent (1.2.0)
5357
faraday-patron (1.0.0)
54-
ffi (1.15.4)
55-
ffi (1.15.4-x64-mingw32)
58+
faraday-rack (1.0.0)
59+
faraday-retry (1.0.3)
60+
ffi (1.15.5)
61+
ffi (1.15.5-x64-mingw32)
5662
forwardable-extended (2.6.0)
5763
gemoji (3.0.1)
58-
github-pages (225)
64+
github-pages (227)
5965
github-pages-health-check (= 1.17.9)
60-
jekyll (= 3.9.0)
66+
jekyll (= 3.9.2)
6167
jekyll-avatar (= 0.7.0)
6268
jekyll-coffeescript (= 1.1.1)
6369
jekyll-commonmark-ghpages (= 0.2.0)
@@ -92,12 +98,12 @@ GEM
9298
jekyll-theme-time-machine (= 0.2.0)
9399
jekyll-titles-from-headings (= 0.5.3)
94100
jemoji (= 0.12.0)
95-
kramdown (= 2.3.1)
101+
kramdown (= 2.3.2)
96102
kramdown-parser-gfm (= 1.1.0)
97103
liquid (= 4.0.3)
98104
mercenary (~> 0.3)
99105
minima (= 2.5.1)
100-
nokogiri (>= 1.12.5, < 2.0)
106+
nokogiri (>= 1.13.6, < 2.0)
101107
rouge (= 3.26.0)
102108
terminal-table (~> 1.4)
103109
github-pages-health-check (1.17.9)
@@ -106,13 +112,13 @@ GEM
106112
octokit (~> 4.0)
107113
public_suffix (>= 3.0, < 5.0)
108114
typhoeus (~> 1.3)
109-
html-pipeline (2.14.0)
115+
html-pipeline (2.14.2)
110116
activesupport (>= 2)
111117
nokogiri (>= 1.4)
112-
http_parser.rb (0.6.0)
118+
http_parser.rb (0.8.0)
113119
i18n (0.9.5)
114120
concurrent-ruby (~> 1.0)
115-
jekyll (3.9.0)
121+
jekyll (3.9.2)
116122
addressable (~> 2.4)
117123
colorator (~> 1.0)
118124
em-websocket (~> 0.5)
@@ -220,12 +226,12 @@ GEM
220226
gemoji (~> 3.0)
221227
html-pipeline (~> 2.2)
222228
jekyll (>= 3.0, < 5.0)
223-
kramdown (2.3.1)
229+
kramdown (2.3.2)
224230
rexml
225231
kramdown-parser-gfm (1.1.0)
226232
kramdown (~> 2.0)
227233
liquid (4.0.3)
228-
listen (3.7.0)
234+
listen (3.7.1)
229235
rb-fsevent (~> 0.10, >= 0.10.3)
230236
rb-inotify (~> 0.9, >= 0.9.10)
231237
mercenary (0.3.6)
@@ -234,22 +240,22 @@ GEM
234240
jekyll (>= 3.5, < 5.0)
235241
jekyll-feed (~> 0.9)
236242
jekyll-seo-tag (~> 2.1)
237-
minitest (5.15.0)
243+
minitest (5.16.3)
238244
multi_json (1.15.0)
239-
multipart-post (2.1.1)
240-
nokogiri (1.13.3)
245+
multipart-post (2.2.3)
246+
nokogiri (1.13.8)
241247
mini_portile2 (~> 2.8.0)
242248
racc (~> 1.4)
243-
nokogiri (1.13.3-x64-mingw32)
249+
nokogiri (1.13.8-x64-mingw32)
244250
racc (~> 1.4)
245-
octokit (4.22.0)
246-
faraday (>= 0.9)
247-
sawyer (~> 0.8.0, >= 0.5.3)
251+
octokit (4.25.1)
252+
faraday (>= 1, < 3)
253+
sawyer (~> 0.9)
248254
pathutil (0.16.2)
249255
forwardable-extended (~> 2.6)
250-
public_suffix (4.0.6)
256+
public_suffix (4.0.7)
251257
racc (1.6.0)
252-
rb-fsevent (0.11.0)
258+
rb-fsevent (0.11.2)
253259
rb-inotify (0.10.1)
254260
ffi (~> 1.0)
255261
rexml (3.2.5)
@@ -262,36 +268,36 @@ GEM
262268
sass-listen (4.0.0)
263269
rb-fsevent (~> 0.9, >= 0.9.4)
264270
rb-inotify (~> 0.9, >= 0.9.7)
265-
sawyer (0.8.2)
271+
sawyer (0.9.2)
266272
addressable (>= 2.3.5)
267-
faraday (> 0.8, < 2.0)
273+
faraday (>= 0.17.3, < 3)
268274
simpleidn (0.2.1)
269275
unf (~> 0.1.4)
270276
terminal-table (1.8.0)
271277
unicode-display_width (~> 1.1, >= 1.1.1)
272278
thread_safe (0.3.6)
273279
typhoeus (1.4.0)
274280
ethon (>= 0.9.0)
275-
tzinfo (1.2.9)
281+
tzinfo (1.2.10)
276282
thread_safe (~> 0.1)
277283
unf (0.1.4)
278284
unf_ext
279-
unf_ext (0.0.8)
280-
unf_ext (0.0.8-x64-mingw32)
285+
unf_ext (0.0.8.2)
286+
unf_ext (0.0.8.2-x64-mingw32)
281287
unicode-display_width (1.8.0)
282288
wdm (0.1.1)
283289
webrick (1.7.0)
284-
zeitwerk (2.5.4)
290+
zeitwerk (2.6.0)
285291

286292
PLATFORMS
287293
ruby
288294
x64-mingw32
289295

290296
DEPENDENCIES
291297
elasticsearch (~> 7.10)
292-
github-pages (= 225)
298+
github-pages (= 227)
293299
jekyll (~> 3.9)
294-
nokogiri (>= 1.13.2)
300+
nokogiri (>= 1.13.8)
295301
wdm (~> 0.1.0)
296302
webrick
297303

0 commit comments

Comments
 (0)