JohnSnowLabs
diff --git a/‎CHANGELOG
Lines changed: 31 additions & 0 deletions b/‎CHANGELOG
Lines changed: 31 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 44 additions & 44 deletions b/‎README.md
Lines changed: 44 additions & 44 deletions
diff --git a/‎build.sbt
Lines changed: 1 addition & 1 deletion b/‎build.sbt
Lines changed: 1 addition & 1 deletion
diff --git a/‎conda/meta.yaml
Lines changed: 4 additions & 4 deletions b/‎conda/meta.yaml
Lines changed: 4 additions & 4 deletions
@@ -1,3 +1,34 @@
+========
+4.2.3
+========
+----------------
+New Features & Enhancements
+----------------
+* Implement a new control over number of accepted columns in Python. This will sync the behavior between Scala and Python where user sets more columns than allowed inside setInputCols
+* Adding metadata sentence key parameter in order to select which metadata field to use as sentence for CoNLLGenerator annotator
+* Include escaping in CoNLLGenerator annotator when writing to csv and preserve special char tokens
+* Add documentation for new `IAnnotation` feature for Scala users
+* Add rules and delimiter parameters to RegexMatcher annotator to support string as input in addition to a file
+```python
+regexMatcher = RegexMatcher() \
+      .setRules(["\\d{4}\\/\\d\\d\\/\\d\\d,date", "\\d{2}\\/\\d\\d\\/\\d\\d,short_date"]) \
+      .setDelimiter(",") \
+      .setInputCols(["sentence"]) \
+      .setOutputCol("regex") \
+      .setStrategy("MATCH_ALL")
+```
+*
+
+----------------
+Bug Fixes
+----------------
+* Fix NotSerializableException when WordEmbeddings is used over K8s cluster while `setEnableInMemoryStorage` is set to `true`
+* Fix a bug in RegexTokenizer annotator when it outputs the wrong indexes if the pattern includes splits that are not followed by a space
+* Fix training modul failing on EMR due to a bad Apache Spark version detection. The following classes were fixed: `CoNLL()`, `CoNLLU()`, `POS()`, and `PubTator()`
+* Fix a bug in CoNLLGenerator annotator where token has non-int metadata
+* Fix the wrong SentencePiece model's name required for DeBertaForQuestionAnswering and DeBertaEmbeddings when importing models
+* Fix `NaNs` result in some ViTForImageClassification models/pipelines
+
 ========
 4.2.2
 ========
 
@@ -152,7 +152,7 @@ To use Spark NLP you need the following requirements:
 
 **GPU (optional):**
 
-Spark NLP 4.2.2 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:
+Spark NLP 4.2.3 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:
 
 - NVIDIA® GPU drivers version 450.80.02 or higher
 - CUDA® Toolkit 11.2
@@ -168,7 +168,7 @@ $ java -version
 $ conda create -n sparknlp python=3.7 -y
 $ conda activate sparknlp
 # spark-nlp by default is based on pyspark 3.x
-$ pip install spark-nlp==4.2.2 pyspark==3.2.1
+$ pip install spark-nlp==4.2.3 pyspark==3.2.1
 ```
 
 In Python console or Jupyter `Python3` kernel:
@@ -213,7 +213,7 @@ For more examples, you can visit our dedicated [repository](https://github.com/J
 
 ## Apache Spark Support
 
-Spark NLP *4.2.2* has been built on top of Apache Spark 3.2 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:
+Spark NLP *4.2.3* has been built on top of Apache Spark 3.2 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:
 
 | Spark NLP | Apache Spark 2.3.x | Apache Spark 2.4.x | Apache Spark 3.0.x | Apache Spark 3.1.x | Apache Spark 3.2.x | Apache Spark 3.3.x |
 |-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
@@ -247,7 +247,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github
 
 ## Databricks Support
 
-Spark NLP 4.2.2 has been tested and is compatible with the following runtimes:
+Spark NLP 4.2.3 has been tested and is compatible with the following runtimes:
 
 **CPU:**
 
@@ -288,7 +288,7 @@ NOTE: Spark NLP 4.0.x is based on TensorFlow 2.7.x which is compatible with CUDA
 
 ## EMR Support
 
-Spark NLP 4.2.2 has been tested and is compatible with the following EMR releases:
+Spark NLP 4.2.3 has been tested and is compatible with the following EMR releases:
 
 - emr-6.2.0
 - emr-6.3.0
@@ -326,23 +326,23 @@ Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x,
 ```sh
 # CPU
 
-spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
 
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
 
-spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
 ```
 
 The `spark-nlp` has been published to the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp).
 
 ```sh
 # GPU
 
-spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.2
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.3
 
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.2
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.3
 
-spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.2
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.3
 
 ```
 
@@ -351,11 +351,11 @@ The `spark-nlp-gpu` has been published to the [Maven Repository](https://mvnrepo
 ```sh
 # AArch64
 
-spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.2
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.3
 
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.2
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.3
 
-spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.2
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.3
 
 ```
 
@@ -364,11 +364,11 @@ The `spark-nlp-aarch64` has been published to the [Maven Repository](https://mvn
 ```sh
 # M1
 
-spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.2
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.3
 
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.2
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.3
 
-spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.2
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.3
 
 ```
 
@@ -380,7 +380,7 @@ The `spark-nlp-m1` has been published to the [Maven Repository](https://mvnrepos
 spark-shell \
   --driver-memory 16g \
   --conf spark.kryoserializer.buffer.max=2000M \
-  --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
+  --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
 ```
 
 ## Scala
@@ -396,7 +396,7 @@ Spark NLP supports Scala 2.12.15 if you are using Apache Spark 3.0.x, 3.1.x, 3.2
 <dependency>
     <groupId>com.johnsnowlabs.nlp</groupId>
     <artifactId>spark-nlp_2.12</artifactId>
-    <version>4.2.2</version>
+    <version>4.2.3</version>
 </dependency>
 ```
 
@@ -407,7 +407,7 @@ Spark NLP supports Scala 2.12.15 if you are using Apache Spark 3.0.x, 3.1.x, 3.2
 <dependency>
     <groupId>com.johnsnowlabs.nlp</groupId>
     <artifactId>spark-nlp-gpu_2.12</artifactId>
-    <version>4.2.2</version>
+    <version>4.2.3</version>
 </dependency>
 ```
 
@@ -418,7 +418,7 @@ Spark NLP supports Scala 2.12.15 if you are using Apache Spark 3.0.x, 3.1.x, 3.2
 <dependency>
     <groupId>com.johnsnowlabs.nlp</groupId>
     <artifactId>spark-nlp-aarch64_2.12</artifactId>
-    <version>4.2.2</version>
+    <version>4.2.3</version>
 </dependency>
 ```
 
@@ -429,7 +429,7 @@ Spark NLP supports Scala 2.12.15 if you are using Apache Spark 3.0.x, 3.1.x, 3.2
 <dependency>
     <groupId>com.johnsnowlabs.nlp</groupId>
     <artifactId>spark-nlp-m1_2.12</artifactId>
-    <version>4.2.2</version>
+    <version>4.2.3</version>
 </dependency>
 ```
 
@@ -439,28 +439,28 @@ Spark NLP supports Scala 2.12.15 if you are using Apache Spark 3.0.x, 3.1.x, 3.2
 
 ```sbtshell
 // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "4.2.2"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "4.2.3"
 ```
 
 **spark-nlp-gpu:**
 
 ```sbtshell
 // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "4.2.2"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "4.2.3"
 ```
 
 **spark-nlp-aarch64:**
 
 ```sbtshell
 // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "4.2.2"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "4.2.3"
 ```
 
 **spark-nlp-m1:**
 
 ```sbtshell
 // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-m1
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-m1" % "4.2.2"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-m1" % "4.2.3"
 ```
 
 Maven Central: [https://mvnrepository.com/artifact/com.johnsnowlabs.nlp](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp)
@@ -480,7 +480,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through
 Pip:
 
 ```bash
-pip install spark-nlp==4.2.2
+pip install spark-nlp==4.2.3
 ```
 
 Conda:
@@ -508,7 +508,7 @@ spark = SparkSession.builder \
     .config("spark.driver.memory","16G")\
     .config("spark.driver.maxResultSize", "0") \
     .config("spark.kryoserializer.buffer.max", "2000M")\
-    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2")\
+    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3")\
     .getOrCreate()
 ```
 
@@ -576,7 +576,7 @@ Use either one of the following options
 - Add the following Maven Coordinates to the interpreter's library list
 
 ```bash
-com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
+com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
 ```
 
 - Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is available to driver path
@@ -586,7 +586,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
 Apart from the previous step, install the python module through pip
 
 ```bash
-pip install spark-nlp==4.2.2
+pip install spark-nlp==4.2.3
 ```
 
 Or you can install `spark-nlp` from inside Zeppelin by using Conda:
@@ -611,7 +611,7 @@ The easiest way to get this done on Linux and macOS is to simply install `spark-
 $ conda create -n sparknlp python=3.8 -y
 $ conda activate sparknlp
 # spark-nlp by default is based on pyspark 3.x
-$ pip install spark-nlp==4.2.2 pyspark==3.2.1 jupyter
+$ pip install spark-nlp==4.2.3 pyspark==3.2.1 jupyter
 $ jupyter notebook
 ```
 
@@ -627,7 +627,7 @@ export PYSPARK_PYTHON=python3
 export PYSPARK_DRIVER_PYTHON=jupyter
 export PYSPARK_DRIVER_PYTHON_OPTS=notebook
 
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
 ```
 
 Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`
@@ -652,7 +652,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
 # -s is for spark-nlp
 # -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage
 # by default they are set to the latest
-!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.1 -s 4.2.2
+!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.1 -s 4.2.3
 ```
 
 [Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/quick_start_google_colab.ipynb) is a live demo on Google Colab that performs named entity recognitions and sentiment analysis by using Spark NLP pretrained pipelines.
@@ -673,7 +673,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
 # -s is for spark-nlp
 # -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage
 # by default they are set to the latest
-!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.1 -s 4.2.2
+!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.1 -s 4.2.3
 ```
 
 [Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP pretrained pipeline.
@@ -691,9 +691,9 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
 
 3. In `Libraries` tab inside your cluster you need to follow these steps:
 
-    3.1. Install New -> PyPI -> `spark-nlp==4.2.2` -> Install
+    3.1. Install New -> PyPI -> `spark-nlp==4.2.3` -> Install
 
-    3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2` -> Install
+    3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3` -> Install
 
 4. Now you can attach your notebook to the cluster and use Spark NLP!
 
@@ -741,7 +741,7 @@ A sample of your software configuration in JSON on S3 (must be public access):
       "spark.kryoserializer.buffer.max": "2000M",
       "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
       "spark.driver.maxResultSize": "0",
-      "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2"
+      "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3"
     }
 }]
 ```
@@ -750,7 +750,7 @@ A sample of AWS CLI to launch EMR cluster:
 
 ```.sh
 aws emr create-cluster \
---name "Spark NLP 4.2.2" \
+--name "Spark NLP 4.2.3" \
 --release-label emr-6.2.0 \
 --applications Name=Hadoop Name=Spark Name=Hive \
 --instance-type m4.4xlarge \
@@ -814,7 +814,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \
   --enable-component-gateway \
   --metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \
   --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \
-  --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
+  --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
 ```
 
 2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.
@@ -853,7 +853,7 @@ spark = SparkSession.builder \
         .config("spark.kryoserializer.buffer.max", "2000m") \
         .config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained") \
         .config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage") \
-        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2") \
+        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3") \
         .getOrCreate()
 ```
 
@@ -867,7 +867,7 @@ spark-shell \
   --conf spark.kryoserializer.buffer.max=2000M \
   --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
   --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
-  --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
+  --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
 ```
 
 **pyspark:**
@@ -880,7 +880,7 @@ pyspark \
   --conf spark.kryoserializer.buffer.max=2000M \
   --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
   --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
-  --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
+  --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
 ```
 
 **Databricks:**
@@ -1144,12 +1144,12 @@ spark = SparkSession.builder \
     .config("spark.driver.memory","16G")\
     .config("spark.driver.maxResultSize", "0") \
     .config("spark.kryoserializer.buffer.max", "2000M")\
-    .config("spark.jars", "/tmp/spark-nlp-assembly-4.2.2.jar")\
+    .config("spark.jars", "/tmp/spark-nlp-assembly-4.2.3.jar")\
     .getOrCreate()
 ```
 
 - You can download provided Fat JARs from each [release notes](https://github.com/JohnSnowLabs/spark-nlp/releases), please pay attention to pick the one that suits your environment depending on the device (CPU/GPU) and Apache Spark version (3.0.x, 3.1.x, 3.2.x, and 3.3.x)
-- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (i.e., `hdfs:///tmp/spark-nlp-assembly-4.2.2.jar`)
+- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (i.e., `hdfs:///tmp/spark-nlp-assembly-4.2.3.jar`)
 
 Example of using pretrained Models and Pipelines in offline:
 
 
@@ -6,7 +6,7 @@ name := getPackageName(is_m1, is_gpu, is_aarch64)
 
 organization := "com.johnsnowlabs.nlp"
 
-version := "4.2.2"
+version := "4.2.3"
 
 (ThisBuild / scalaVersion) := scalaVer
 
 
@@ -1,15 +1,15 @@
 package:
   name: "spark-nlp"
-  version: 4.2.2
+  version: 4.2.3
 
 app:
   entry: spark-nlp
   summary: Natural Language Understanding Library for Apache Spark.
 
 source:
-    fn: spark-nlp-4.2.2.tar.gz
-    url: https://files.pythonhosted.org/packages/78/7e/1ed94f903c0dfe0e6d4900bf61d0210cb39dadf918c7a21f9cfdf924fc50/spark-nlp-4.2.2.tar.gz
-    sha256: 276abca3fc807a4dd0ffa5a299f11359c402670ad20166a01dd7ff6392719f65
+    fn: spark-nlp-4.2.3.tar.gz
+    url: https://files.pythonhosted.org/packages/09/81/d5644f8ff89839da85b1ef70cf38ca0cab2fc12041d724c5e794a47c14f5/spark-nlp-4.2.3.tar.gz
+    sha256: 430aa24d0e325138140ef92b6e7c4c797838e9eee14aaf636af08276956549f9
 build:
   noarch: generic
   number: 0