Skip to content

Commit 2221e60

Browse files
Merge pull request #14473 from JohnSnowLabs/release/552-release-candidate
Spark NLP 5.5.2 Release Candidate
2 parents 9e7ed7f + 573eefc commit 2221e60

File tree

226 files changed

+116734
-3527
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

226 files changed

+116734
-3527
lines changed

CHANGELOG

+34
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,37 @@
1+
========
2+
5.5.2
3+
========
4+
----------------
5+
New Features & Enhancements
6+
----------------
7+
* OpenVINO Support for Transformers (PR #14408):
8+
Added OpenVINO inference support to a broad range of transformer-based annotators, including DeBertaForQuestionAnswering, DeBertaForSequenceClassification, RoBertaForTokenClassification, XlmRobertaForZeroShotClassification, BartTransformer, GPT2Transformer, and many others.
9+
* BLIPForQuestionAnswering Transformer (PR #14422):
10+
Introduced a new transformer BLIPForQuestionAnswering for image-based question answering tasks. The transformer processes images alongside associated questions to provide relevant answers.
11+
* AutoGGUFEmbeddings Annotator (PR #14433):
12+
Added AutoGGUFEmbeddings to support embeddings from AutoGGUFModels, providing rich sentence embeddings. Includes an end-to-end example notebook for usage.
13+
* HTML Parsing into DataFrame (PR #14449):
14+
Introduced sparknlp.read().html() to parse local or remote HTML files and convert them into structured Spark DataFrames for easier analysis.
15+
* Email Parsing into DataFrame (PR #14455):
16+
Added sparknlp.read().email() method to parse email files into structured DataFrames, enabling scalable analysis of email content. (Note: Dependent on #14449)
17+
* Microsoft Word Document Parsing into DataFrame (PR #14476):
18+
Added a new feature to parse .docx and .doc files into a Spark DataFrame, streamlining the integration of Word documents into NLP pipelines.
19+
* Microsoft Fabric Support (PR #14467):
20+
Introduced support for leveraging Microsoft Fabric for word embeddings storage and retrieval, enhancing scalability and efficiency.
21+
* cuDNN Upgrade Instructions on Databricks (PR #14451):
22+
Added instructions on upgrading cuDNN for GPU inference and cleaned up redundant Databricks installation instructions.
23+
* ChunkEmbeddings Metadata Preservation (PR #14462):
24+
Modified ChunkEmbeddings to preserve the original chunk’s metadata in the resulting embeddings, ensuring richer contextual information is retained.
25+
* Default Names and Languages for Annotators (PR #14469):
26+
Updated default names and language configurations for newly created seq2seq annotators to improve consistency and clarity.
27+
28+
----------------
29+
Bug Fixes
30+
----------------
31+
* Spark Version Errors (PR #14467):
32+
Resolved issues related to long Spark versions when integrating Microsoft Fabric support.
33+
34+
135
========
236
5.5.1
337
========

README.md

+10-10
Original file line numberDiff line numberDiff line change
@@ -55,15 +55,15 @@ documentation and examples
5555

5656
## Quick Start
5757

58-
This is a quick example of how to use Spark NLP pre-trained pipeline in Python and PySpark:
58+
This is a quick example of how to use a Spark NLP pre-trained pipeline in Python and PySpark:
5959

6060
```sh
6161
$ java -version
6262
# should be Java 8 or 11 (Oracle or OpenJDK)
6363
$ conda create -n sparknlp python=3.7 -y
6464
$ conda activate sparknlp
6565
# spark-nlp by default is based on pyspark 3.x
66-
$ pip install spark-nlp==5.5.1 pyspark==3.3.1
66+
$ pip install spark-nlp==5.5.2 pyspark==3.3.1
6767
```
6868

6969
In Python console or Jupyter `Python3` kernel:
@@ -129,7 +129,7 @@ For a quick example of using pipelines and models take a look at our official [d
129129

130130
### Apache Spark Support
131131

132-
Spark NLP *5.5.1* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
132+
Spark NLP *5.5.2* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
133133

134134
| Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
135135
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
@@ -157,7 +157,7 @@ Find out more about 4.x `SparkNLP` versions in our official [documentation](http
157157

158158
### Databricks Support
159159

160-
Spark NLP 5.5.1 has been tested and is compatible with the following runtimes:
160+
Spark NLP 5.5.2 has been tested and is compatible with the following runtimes:
161161

162162
| **CPU** | **GPU** |
163163
|--------------------|--------------------|
@@ -174,7 +174,7 @@ We are compatible with older runtimes. For a full list check databricks support
174174

175175
### EMR Support
176176

177-
Spark NLP 5.5.1 has been tested and is compatible with the following EMR releases:
177+
Spark NLP 5.5.2 has been tested and is compatible with the following EMR releases:
178178

179179
| **EMR Release** |
180180
|--------------------|
@@ -205,7 +205,7 @@ deployed to Maven central. To add any of our packages as a dependency in your ap
205205
from our official documentation.
206206

207207
If you are interested, there is a simple SBT project for Spark NLP to guide you on how to use it in your
208-
projects [Spark NLP SBT S5.5.1r](https://github.com/maziyarpanahi/spark-nlp-starter)
208+
projects [Spark NLP SBT S5.5.2r](https://github.com/maziyarpanahi/spark-nlp-starter)
209209

210210
### Python
211211

@@ -214,7 +214,7 @@ Check all available installations for Python in our official [documentation](htt
214214

215215
### Compiled JARs
216216

217-
To compile the jars from source follow [these instructions](https://sparknlp.org/docs/en/compiled#jars) from our official documenation
217+
To compile the jars from source follow [these instructions](https://sparknlp.org/docs/en/compiled#jars) from our official documentation
218218

219219
## Platform-Specific Instructions
220220

@@ -234,7 +234,7 @@ For detailed instructions on how to use Spark NLP on supported platforms, please
234234

235235
Spark NLP library and all the pre-trained models/pipelines can be used entirely offline with no access to the Internet.
236236
Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integration) from our official documentation
237-
to use Spark NLP offline
237+
to use Spark NLP offline.
238238

239239
## Advanced Settings
240240

@@ -250,7 +250,7 @@ In Spark NLP we can define S3 locations to:
250250

251251
Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integration) from our official documentation.
252252

253-
## Document5.5.1
253+
## Document5.5.2
254254

255255
### Examples
256256

@@ -283,7 +283,7 @@ the Spark NLP library:
283283
keywords = {Spark, Natural language processing, Deep learning, Tensorflow, Cluster},
284284
abstract = {Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark NLP comes with 1100+ pretrained pipelines and models in more than 192+ languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a cluster. Downloaded more than 2.7 million times and experiencing 9x growth since January 2020, Spark NLP is used by 54% of healthcare organizations as the world’s most widely used NLP library in the enterprise.}
285285
}
286-
}5.5.1
286+
}5.5.2
287287
```
288288

289289
## Community support

build.sbt

+10-2
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ name := getPackageName(is_silicon, is_gpu, is_aarch64)
66

77
organization := "com.johnsnowlabs.nlp"
88

9-
version := "5.5.1"
9+
version := "5.5.2"
1010

1111
(ThisBuild / scalaVersion) := scalaVer
1212

@@ -157,7 +157,14 @@ lazy val utilDependencies = Seq(
157157
greex,
158158
azureIdentity,
159159
azureStorage,
160-
jsoup)
160+
jsoup,
161+
jakartaMail,
162+
angusMail,
163+
poiDocx
164+
exclude ("org.apache.logging.log4j", "log4j-api"),
165+
scratchpad
166+
exclude ("org.apache.logging.log4j", "log4j-api")
167+
)
161168

162169
lazy val typedDependencyParserDependencies = Seq(junit)
163170

@@ -230,6 +237,7 @@ lazy val root = (project in file("."))
230237

231238
(assembly / assemblyMergeStrategy) := {
232239
case PathList("META-INF", "versions", "9", "module-info.class") => MergeStrategy.discard
240+
case PathList("module-info.class") => MergeStrategy.discard // Discard any module-info.class globally
233241
case PathList("apache.commons.lang3", _ @_*) => MergeStrategy.discard
234242
case PathList("org.apache.hadoop", _ @_*) => MergeStrategy.first
235243
case PathList("com.amazonaws", _ @_*) => MergeStrategy.last

docs/_layouts/landing.html

+1-1
Original file line numberDiff line numberDiff line change
@@ -201,7 +201,7 @@ <h3 class="grey h3_title">{{ _section.title }}</h3>
201201
<div class="highlight-box">
202202
{% highlight bash %}
203203
# Using PyPI
204-
$ pip install spark-nlp==5.5.1
204+
$ pip install spark-nlp==5.5.2
205205

206206
# Using Anaconda/Conda
207207
$ conda install -c johnsnowlabs spark-nlp

docs/en/advanced_settings.md

+13-3
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ spark = SparkSession.builder
5252
.config("spark.kryoserializer.buffer.max", "2000m")
5353
.config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained")
5454
.config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")
55-
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.1")
55+
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.2")
5656
.getOrCreate()
5757
```
5858

@@ -66,7 +66,7 @@ spark-shell \
6666
--conf spark.kryoserializer.buffer.max=2000M \
6767
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
6868
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
69-
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.1
69+
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.2
7070
```
7171

7272
**pyspark:**
@@ -79,7 +79,7 @@ pyspark \
7979
--conf spark.kryoserializer.buffer.max=2000M \
8080
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
8181
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
82-
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.1
82+
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.2
8383
```
8484

8585
**Databricks:**
@@ -96,6 +96,16 @@ spark.jsl.settings.annotator.log_folder dbfs:/PATH_TO_LOGS
9696

9797
NOTE: If this is an existing cluster, after adding new configs or changing existing properties you need to restart it.
9898

99+
#### Additional Configuration for Databricks
100+
When running Email Reader feature `sparknlp.read().email("./email-files")` on Databricks, it is necessary to include the following Spark configurations to avoid dependency conflicts:
101+
102+
```bash
103+
spark.driver.userClassPathFirst true
104+
spark.executor.userClassPathFirst true
105+
```
106+
These configurations are required because the Databricks runtime environment includes a bundled version of the `com.sun.mail:jakarta.mail` library, which conflicts with `jakarta.activation`.
107+
By setting these properties, the application ensures that the user-provided libraries take precedence over those bundled in the Databricks environment, resolving the dependency conflict.
108+
99109
</div><div class="h3-box" markdown="1">
100110

101111
### S3 Integration
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
{%- capture title -%}
2+
AutoGGUFEmbeddings
3+
{%- endcapture -%}
4+
5+
{%- capture description -%}
6+
Annotator that uses the llama.cpp library to generate text embeddings with large language
7+
models.
8+
9+
The type of embedding pooling can be set with the `setPoolingType` method. The default is
10+
`"MEAN"`. The available options are `"NONE"`, `"MEAN"`, `"CLS"`, and `"LAST"`.
11+
12+
If the parameters are not set, the annotator will default to use the parameters provided by
13+
the model.
14+
15+
Pretrained models can be loaded with `pretrained` of the companion object:
16+
17+
```scala
18+
val autoGGUFEmbeddings = AutoGGUFEmbeddings.pretrained()
19+
.setInputCols("document")
20+
.setOutputCol("embeddings")
21+
```
22+
23+
The default model is `"nomic-embed-text-v1.5.Q8_0.gguf"`, if no name is provided.
24+
25+
For available pretrained models please see the [Models Hub](https://sparknlp.org/models).
26+
27+
For extended examples of usage, see the
28+
[AutoGGUFEmbeddingsTest](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/test/scala/com/johnsnowlabs/nlp/annotators/seq2seq/AutoGGUFEmbeddingsTest.scala)
29+
and the
30+
[example notebook](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/llama.cpp/llama.cpp_in_Spark_NLP_AutoGGUFEmbeddings.ipynb).
31+
32+
**Note**: To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set
33+
the number of GPU layers with the `setNGpuLayers` method.
34+
35+
When using larger models, we recommend adjusting GPU usage with `setNCtx` and `setNGpuLayers`
36+
according to your hardware to avoid out-of-memory errors.
37+
{%- endcapture -%}
38+
39+
{%- capture input_anno -%}
40+
DOCUMENT
41+
{%- endcapture -%}
42+
43+
{%- capture output_anno -%}
44+
SENTENCE_EMBEDDINGS
45+
{%- endcapture -%}
46+
47+
{%- capture python_example -%}
48+
>>> import sparknlp
49+
>>> from sparknlp.base import *
50+
>>> from sparknlp.annotator import *
51+
>>> from pyspark.ml import Pipeline
52+
>>> document = DocumentAssembler() \
53+
... .setInputCol("text") \
54+
... .setOutputCol("document")
55+
>>> autoGGUFEmbeddings = AutoGGUFEmbeddings.pretrained() \
56+
... .setInputCols(["document"]) \
57+
... .setOutputCol("completions") \
58+
... .setBatchSize(4) \
59+
... .setNGpuLayers(99) \
60+
... .setPoolingType("MEAN")
61+
>>> pipeline = Pipeline().setStages([document, autoGGUFEmbeddings])
62+
>>> data = spark.createDataFrame([["The moons of Jupiter are 77 in total, with 79 confirmed natural satellites and 2 man-made ones."]]).toDF("text")
63+
>>> result = pipeline.fit(data).transform(data)
64+
>>> result.select("completions").show()
65+
+--------------------------------------------------------------------------------+
66+
| embeddings|
67+
+--------------------------------------------------------------------------------+
68+
|[[-0.034486726, 0.07770534, -0.15982522, -0.017873349, 0.013914132, 0.0365736...|
69+
+--------------------------------------------------------------------------------+
70+
{%- endcapture -%}
71+
72+
{%- capture scala_example -%}
73+
import com.johnsnowlabs.nlp.base._
74+
import com.johnsnowlabs.nlp.annotator._
75+
import org.apache.spark.ml.Pipeline
76+
import spark.implicits._
77+
78+
val document = new DocumentAssembler().setInputCol("text").setOutputCol("document")
79+
80+
val autoGGUFEmbeddings = AutoGGUFEmbeddings
81+
.pretrained()
82+
.setInputCols("document")
83+
.setOutputCol("embeddings")
84+
.setBatchSize(4)
85+
.setPoolingType("MEAN")
86+
87+
val pipeline = new Pipeline().setStages(Array(document, autoGGUFEmbeddings))
88+
89+
val data = Seq(
90+
"The moons of Jupiter are 77 in total, with 79 confirmed natural satellites and 2 man-made ones.")
91+
.toDF("text")
92+
val result = pipeline.fit(data).transform(data)
93+
result.select("embeddings.embeddings").show(1, truncate=80)
94+
+--------------------------------------------------------------------------------+
95+
| embeddings|
96+
+--------------------------------------------------------------------------------+
97+
|[[-0.034486726, 0.07770534, -0.15982522, -0.017873349, 0.013914132, 0.0365736...|
98+
+--------------------------------------------------------------------------------+
99+
{%- endcapture -%}
100+
101+
{%- capture api_link -%}
102+
[AutoGGUFEmbeddings](/api/com/johnsnowlabs/nlp/embeddings/AutoGGUFEmbeddings)
103+
{%- endcapture -%}
104+
105+
{%- capture python_api_link -%}
106+
[AutoGGUFEmbeddings](/api/python/reference/autosummary/sparknlp/annotator/embeddings/auto_gguf_embeddings/index.html)
107+
{%- endcapture -%}
108+
109+
{%- capture source_link -%}
110+
[AutoGGUFEmbeddings](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/nlp/embeddings/AutoGGUFEmbeddings.scala)
111+
{%- endcapture -%}
112+
113+
{% include templates/anno_template.md
114+
title=title
115+
description=description
116+
input_anno=input_anno
117+
output_anno=output_anno
118+
python_example=python_example
119+
scala_example=scala_example
120+
api_link=api_link
121+
python_api_link=python_api_link
122+
source_link=source_link
123+
%}

docs/en/annotators.md

+1
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@ There are two types of Annotators:
4545
{:.table-model-big}
4646
|Annotator|Description|Version |
4747
|---|---|---|
48+
{% include templates/anno_table_entry.md path="" name="AutoGGUFEmbeddings" summary="Annotator that uses the llama.cpp library to generate text embeddings with large language models."%}
4849
{% include templates/anno_table_entry.md path="" name="AutoGGUFModel" summary="Annotator that uses the llama.cpp library to generate text completions with large language models."%}
4950
{% include templates/anno_table_entry.md path="" name="BGEEmbeddings" summary="Sentence embeddings using BGE."%}
5051
{% include templates/anno_table_entry.md path="" name="BigTextMatcher" summary="Annotator to match exact phrases (by token) provided in a file against a Document."%}

docs/en/concepts.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ $ java -version
6666
$ conda create -n sparknlp python=3.7 -y
6767
$ conda activate sparknlp
6868
# spark-nlp by default is based on pyspark 3.x
69-
$ pip install spark-nlp==5.5.1 pyspark==3.3.1 jupyter
69+
$ pip install spark-nlp==5.5.2 pyspark==3.3.1 jupyter
7070
$ jupyter notebook
7171
```
7272

docs/en/examples.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ $ java -version
1818
# should be Java 8 (Oracle or OpenJDK)
1919
$ conda create -n sparknlp python=3.7 -y
2020
$ conda activate sparknlp
21-
$ pip install spark-nlp==5.5.1 pyspark==3.3.1
21+
$ pip install spark-nlp==5.5.2 pyspark==3.3.1
2222
```
2323

2424
</div><div class="h3-box" markdown="1">
@@ -40,7 +40,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
4040
# -p is for pyspark
4141
# -s is for spark-nlp
4242
# by default they are set to the latest
43-
!bash colab.sh -p 3.2.3 -s 5.5.1
43+
!bash colab.sh -p 3.2.3 -s 5.5.2
4444
```
4545

4646
[Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb) is a live demo on Google Colab that performs named entity recognitions and sentiment analysis by using Spark NLP pretrained pipelines.

docs/en/hardware_acceleration.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ Since the new Transformer models such as BERT for Word and Sentence embeddings a
5050
| DeBERTa Large | +477%(5.8x) |
5151
| Longformer Base | +52%(1.5x) |
5252

53-
Spark NLP 5.5.1 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:
53+
Spark NLP 5.5.2 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:
5454

5555
- NVIDIA® GPU drivers version 450.80.02 or higher
5656
- CUDA® Toolkit 11.2

0 commit comments

Comments
 (0)