Skip to content

Commit e41cf90

Browse files
Merge pull request #13036 from JohnSnowLabs/release/423-release-candidate
Release/423 release candidate
2 parents 84464d9 + 9b6869b commit e41cf90

File tree

1,415 files changed

+6711
-4653
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,415 files changed

+6711
-4653
lines changed

CHANGELOG

+31
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,34 @@
1+
========
2+
4.2.3
3+
========
4+
----------------
5+
New Features & Enhancements
6+
----------------
7+
* Implement a new control over number of accepted columns in Python. This will sync the behavior between Scala and Python where user sets more columns than allowed inside setInputCols
8+
* Adding metadata sentence key parameter in order to select which metadata field to use as sentence for CoNLLGenerator annotator
9+
* Include escaping in CoNLLGenerator annotator when writing to csv and preserve special char tokens
10+
* Add documentation for new `IAnnotation` feature for Scala users
11+
* Add rules and delimiter parameters to RegexMatcher annotator to support string as input in addition to a file
12+
```python
13+
regexMatcher = RegexMatcher() \
14+
.setRules(["\\d{4}\\/\\d\\d\\/\\d\\d,date", "\\d{2}\\/\\d\\d\\/\\d\\d,short_date"]) \
15+
.setDelimiter(",") \
16+
.setInputCols(["sentence"]) \
17+
.setOutputCol("regex") \
18+
.setStrategy("MATCH_ALL")
19+
```
20+
*
21+
22+
----------------
23+
Bug Fixes
24+
----------------
25+
* Fix NotSerializableException when WordEmbeddings is used over K8s cluster while `setEnableInMemoryStorage` is set to `true`
26+
* Fix a bug in RegexTokenizer annotator when it outputs the wrong indexes if the pattern includes splits that are not followed by a space
27+
* Fix training modul failing on EMR due to a bad Apache Spark version detection. The following classes were fixed: `CoNLL()`, `CoNLLU()`, `POS()`, and `PubTator()`
28+
* Fix a bug in CoNLLGenerator annotator where token has non-int metadata
29+
* Fix the wrong SentencePiece model's name required for DeBertaForQuestionAnswering and DeBertaEmbeddings when importing models
30+
* Fix `NaNs` result in some ViTForImageClassification models/pipelines
31+
132
========
233
4.2.2
334
========

README.md

+44-44
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,7 @@ To use Spark NLP you need the following requirements:
152152

153153
**GPU (optional):**
154154

155-
Spark NLP 4.2.2 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:
155+
Spark NLP 4.2.3 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:
156156

157157
- NVIDIA® GPU drivers version 450.80.02 or higher
158158
- CUDA® Toolkit 11.2
@@ -168,7 +168,7 @@ $ java -version
168168
$ conda create -n sparknlp python=3.7 -y
169169
$ conda activate sparknlp
170170
# spark-nlp by default is based on pyspark 3.x
171-
$ pip install spark-nlp==4.2.2 pyspark==3.2.1
171+
$ pip install spark-nlp==4.2.3 pyspark==3.2.1
172172
```
173173

174174
In Python console or Jupyter `Python3` kernel:
@@ -213,7 +213,7 @@ For more examples, you can visit our dedicated [repository](https://github.com/J
213213

214214
## Apache Spark Support
215215

216-
Spark NLP *4.2.2* has been built on top of Apache Spark 3.2 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:
216+
Spark NLP *4.2.3* has been built on top of Apache Spark 3.2 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:
217217

218218
| Spark NLP | Apache Spark 2.3.x | Apache Spark 2.4.x | Apache Spark 3.0.x | Apache Spark 3.1.x | Apache Spark 3.2.x | Apache Spark 3.3.x |
219219
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
@@ -247,7 +247,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github
247247

248248
## Databricks Support
249249

250-
Spark NLP 4.2.2 has been tested and is compatible with the following runtimes:
250+
Spark NLP 4.2.3 has been tested and is compatible with the following runtimes:
251251

252252
**CPU:**
253253

@@ -288,7 +288,7 @@ NOTE: Spark NLP 4.0.x is based on TensorFlow 2.7.x which is compatible with CUDA
288288

289289
## EMR Support
290290

291-
Spark NLP 4.2.2 has been tested and is compatible with the following EMR releases:
291+
Spark NLP 4.2.3 has been tested and is compatible with the following EMR releases:
292292

293293
- emr-6.2.0
294294
- emr-6.3.0
@@ -326,23 +326,23 @@ Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x,
326326
```sh
327327
# CPU
328328

329-
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
329+
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
330330

331-
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
331+
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
332332

333-
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
333+
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
334334
```
335335

336336
The `spark-nlp` has been published to the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp).
337337

338338
```sh
339339
# GPU
340340

341-
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.2
341+
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.3
342342

343-
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.2
343+
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.3
344344

345-
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.2
345+
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.3
346346

347347
```
348348

@@ -351,11 +351,11 @@ The `spark-nlp-gpu` has been published to the [Maven Repository](https://mvnrepo
351351
```sh
352352
# AArch64
353353

354-
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.2
354+
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.3
355355

356-
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.2
356+
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.3
357357

358-
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.2
358+
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.3
359359

360360
```
361361

@@ -364,11 +364,11 @@ The `spark-nlp-aarch64` has been published to the [Maven Repository](https://mvn
364364
```sh
365365
# M1
366366

367-
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.2
367+
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.3
368368

369-
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.2
369+
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.3
370370

371-
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.2
371+
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.3
372372

373373
```
374374

@@ -380,7 +380,7 @@ The `spark-nlp-m1` has been published to the [Maven Repository](https://mvnrepos
380380
spark-shell \
381381
--driver-memory 16g \
382382
--conf spark.kryoserializer.buffer.max=2000M \
383-
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
383+
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
384384
```
385385

386386
## Scala
@@ -396,7 +396,7 @@ Spark NLP supports Scala 2.12.15 if you are using Apache Spark 3.0.x, 3.1.x, 3.2
396396
<dependency>
397397
<groupId>com.johnsnowlabs.nlp</groupId>
398398
<artifactId>spark-nlp_2.12</artifactId>
399-
<version>4.2.2</version>
399+
<version>4.2.3</version>
400400
</dependency>
401401
```
402402

@@ -407,7 +407,7 @@ Spark NLP supports Scala 2.12.15 if you are using Apache Spark 3.0.x, 3.1.x, 3.2
407407
<dependency>
408408
<groupId>com.johnsnowlabs.nlp</groupId>
409409
<artifactId>spark-nlp-gpu_2.12</artifactId>
410-
<version>4.2.2</version>
410+
<version>4.2.3</version>
411411
</dependency>
412412
```
413413

@@ -418,7 +418,7 @@ Spark NLP supports Scala 2.12.15 if you are using Apache Spark 3.0.x, 3.1.x, 3.2
418418
<dependency>
419419
<groupId>com.johnsnowlabs.nlp</groupId>
420420
<artifactId>spark-nlp-aarch64_2.12</artifactId>
421-
<version>4.2.2</version>
421+
<version>4.2.3</version>
422422
</dependency>
423423
```
424424

@@ -429,7 +429,7 @@ Spark NLP supports Scala 2.12.15 if you are using Apache Spark 3.0.x, 3.1.x, 3.2
429429
<dependency>
430430
<groupId>com.johnsnowlabs.nlp</groupId>
431431
<artifactId>spark-nlp-m1_2.12</artifactId>
432-
<version>4.2.2</version>
432+
<version>4.2.3</version>
433433
</dependency>
434434
```
435435

@@ -439,28 +439,28 @@ Spark NLP supports Scala 2.12.15 if you are using Apache Spark 3.0.x, 3.1.x, 3.2
439439

440440
```sbtshell
441441
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
442-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "4.2.2"
442+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "4.2.3"
443443
```
444444

445445
**spark-nlp-gpu:**
446446

447447
```sbtshell
448448
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu
449-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "4.2.2"
449+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "4.2.3"
450450
```
451451

452452
**spark-nlp-aarch64:**
453453

454454
```sbtshell
455455
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64
456-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "4.2.2"
456+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "4.2.3"
457457
```
458458

459459
**spark-nlp-m1:**
460460

461461
```sbtshell
462462
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-m1
463-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-m1" % "4.2.2"
463+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-m1" % "4.2.3"
464464
```
465465

466466
Maven Central: [https://mvnrepository.com/artifact/com.johnsnowlabs.nlp](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp)
@@ -480,7 +480,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through
480480
Pip:
481481

482482
```bash
483-
pip install spark-nlp==4.2.2
483+
pip install spark-nlp==4.2.3
484484
```
485485

486486
Conda:
@@ -508,7 +508,7 @@ spark = SparkSession.builder \
508508
.config("spark.driver.memory","16G")\
509509
.config("spark.driver.maxResultSize", "0") \
510510
.config("spark.kryoserializer.buffer.max", "2000M")\
511-
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2")\
511+
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3")\
512512
.getOrCreate()
513513
```
514514

@@ -576,7 +576,7 @@ Use either one of the following options
576576
- Add the following Maven Coordinates to the interpreter's library list
577577

578578
```bash
579-
com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
579+
com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
580580
```
581581

582582
- Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is available to driver path
@@ -586,7 +586,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
586586
Apart from the previous step, install the python module through pip
587587

588588
```bash
589-
pip install spark-nlp==4.2.2
589+
pip install spark-nlp==4.2.3
590590
```
591591

592592
Or you can install `spark-nlp` from inside Zeppelin by using Conda:
@@ -611,7 +611,7 @@ The easiest way to get this done on Linux and macOS is to simply install `spark-
611611
$ conda create -n sparknlp python=3.8 -y
612612
$ conda activate sparknlp
613613
# spark-nlp by default is based on pyspark 3.x
614-
$ pip install spark-nlp==4.2.2 pyspark==3.2.1 jupyter
614+
$ pip install spark-nlp==4.2.3 pyspark==3.2.1 jupyter
615615
$ jupyter notebook
616616
```
617617

@@ -627,7 +627,7 @@ export PYSPARK_PYTHON=python3
627627
export PYSPARK_DRIVER_PYTHON=jupyter
628628
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
629629

630-
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
630+
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
631631
```
632632

633633
Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`
@@ -652,7 +652,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
652652
# -s is for spark-nlp
653653
# -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage
654654
# by default they are set to the latest
655-
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.1 -s 4.2.2
655+
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.1 -s 4.2.3
656656
```
657657

658658
[Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/quick_start_google_colab.ipynb) is a live demo on Google Colab that performs named entity recognitions and sentiment analysis by using Spark NLP pretrained pipelines.
@@ -673,7 +673,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
673673
# -s is for spark-nlp
674674
# -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage
675675
# by default they are set to the latest
676-
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.1 -s 4.2.2
676+
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.1 -s 4.2.3
677677
```
678678

679679
[Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP pretrained pipeline.
@@ -691,9 +691,9 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
691691

692692
3. In `Libraries` tab inside your cluster you need to follow these steps:
693693

694-
3.1. Install New -> PyPI -> `spark-nlp==4.2.2` -> Install
694+
3.1. Install New -> PyPI -> `spark-nlp==4.2.3` -> Install
695695

696-
3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2` -> Install
696+
3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3` -> Install
697697

698698
4. Now you can attach your notebook to the cluster and use Spark NLP!
699699

@@ -741,7 +741,7 @@ A sample of your software configuration in JSON on S3 (must be public access):
741741
"spark.kryoserializer.buffer.max": "2000M",
742742
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
743743
"spark.driver.maxResultSize": "0",
744-
"spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2"
744+
"spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3"
745745
}
746746
}]
747747
```
@@ -750,7 +750,7 @@ A sample of AWS CLI to launch EMR cluster:
750750
751751
```.sh
752752
aws emr create-cluster \
753-
--name "Spark NLP 4.2.2" \
753+
--name "Spark NLP 4.2.3" \
754754
--release-label emr-6.2.0 \
755755
--applications Name=Hadoop Name=Spark Name=Hive \
756756
--instance-type m4.4xlarge \
@@ -814,7 +814,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \
814814
--enable-component-gateway \
815815
--metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \
816816
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \
817-
--properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
817+
--properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
818818
```
819819
820820
2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.
@@ -853,7 +853,7 @@ spark = SparkSession.builder \
853853
.config("spark.kryoserializer.buffer.max", "2000m") \
854854
.config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained") \
855855
.config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage") \
856-
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2") \
856+
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3") \
857857
.getOrCreate()
858858
```
859859
@@ -867,7 +867,7 @@ spark-shell \
867867
--conf spark.kryoserializer.buffer.max=2000M \
868868
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
869869
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
870-
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
870+
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
871871
```
872872
873873
**pyspark:**
@@ -880,7 +880,7 @@ pyspark \
880880
--conf spark.kryoserializer.buffer.max=2000M \
881881
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
882882
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
883-
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
883+
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
884884
```
885885
886886
**Databricks:**
@@ -1144,12 +1144,12 @@ spark = SparkSession.builder \
11441144
.config("spark.driver.memory","16G")\
11451145
.config("spark.driver.maxResultSize", "0") \
11461146
.config("spark.kryoserializer.buffer.max", "2000M")\
1147-
.config("spark.jars", "/tmp/spark-nlp-assembly-4.2.2.jar")\
1147+
.config("spark.jars", "/tmp/spark-nlp-assembly-4.2.3.jar")\
11481148
.getOrCreate()
11491149
```
11501150
11511151
- You can download provided Fat JARs from each [release notes](https://github.com/JohnSnowLabs/spark-nlp/releases), please pay attention to pick the one that suits your environment depending on the device (CPU/GPU) and Apache Spark version (3.0.x, 3.1.x, 3.2.x, and 3.3.x)
1152-
- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (i.e., `hdfs:///tmp/spark-nlp-assembly-4.2.2.jar`)
1152+
- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (i.e., `hdfs:///tmp/spark-nlp-assembly-4.2.3.jar`)
11531153
11541154
Example of using pretrained Models and Pipelines in offline:
11551155

build.sbt

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ name := getPackageName(is_m1, is_gpu, is_aarch64)
66

77
organization := "com.johnsnowlabs.nlp"
88

9-
version := "4.2.2"
9+
version := "4.2.3"
1010

1111
(ThisBuild / scalaVersion) := scalaVer
1212

conda/meta.yaml

+4-4
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
package:
22
name: "spark-nlp"
3-
version: 4.2.2
3+
version: 4.2.3
44

55
app:
66
entry: spark-nlp
77
summary: Natural Language Understanding Library for Apache Spark.
88

99
source:
10-
fn: spark-nlp-4.2.2.tar.gz
11-
url: https://files.pythonhosted.org/packages/78/7e/1ed94f903c0dfe0e6d4900bf61d0210cb39dadf918c7a21f9cfdf924fc50/spark-nlp-4.2.2.tar.gz
12-
sha256: 276abca3fc807a4dd0ffa5a299f11359c402670ad20166a01dd7ff6392719f65
10+
fn: spark-nlp-4.2.3.tar.gz
11+
url: https://files.pythonhosted.org/packages/09/81/d5644f8ff89839da85b1ef70cf38ca0cab2fc12041d724c5e794a47c14f5/spark-nlp-4.2.3.tar.gz
12+
sha256: 430aa24d0e325138140ef92b6e7c4c797838e9eee14aaf636af08276956549f9
1313
build:
1414
noarch: generic
1515
number: 0

0 commit comments

Comments
 (0)