Spark NLP 6.0.0: PDF Reader, Excel Reader, PowerPoint Reader, Vision Language Models, Native Multimodal in GGUF, and many more! #14558

maziyarpanahi · 2025-04-28T19:13:51Z

maziyarpanahi
Apr 28, 2025
Maintainer

📢 Spark NLP 6.0.0: A New Era for Universal Ingestion and Multimodal LLM Processing at Scale

From raw documents to multimodal insights at enterprise scale

With Spark NLP 6.0.0, we are setting a new standard for building scalable, distributed AI pipelines. This release transforms Spark NLP from a pure NLP library into the de facto platform for distributed LLM ingestion and multimodal batch processing.

This release introduces native ingestion for enterprise file types including PDFs, Excel spreadsheets, PowerPoint decks, and raw text logs, with automatic structure extraction, semantic segmentation, and metadata preservation — all in scalable, zero-code Spark pipelines.

At the same time, Spark NLP now natively supports Vision-Language Models (VLMs), loading quantized multimodal models like LLAVA, Phi Vision, DeepSeek Janus, and Llama 3.2 Vision directly via Llama.cpp, ONNX, and OpenVINO runtimes with no external inference servers, no API bottlenecks.

With 6.0.0, Spark NLP offers a complete, distributed architecture for universal data ingestion, multimodal understanding, and LLM batch inference at scale — enabling retrieval-augmented generation (RAG), document understanding, compliance audits, enterprise search, and multimodal analytics — all within the native Spark ecosystem.

One unified framework. Text, vision, documents — at Spark scale. Zero boilerplate. Maximum performance.

🌟 Spotlight Feature: AutoGGUFVisionModel — Native Multimodal Inference with Llama.cpp

Spark NLP 6.0.0 introduces the new AutoGGUFVisionModel, enabling native multimodal inference for quantized GGUF models directly within Spark pipelines. Powered by Llama.cpp, this annotator makes it effortless to run Vision-Language Models (VLMs) like LLAVA-1.5-7B Q4_0, Qwen2 VL, and others fully on-premises, at scale, with no external servers or APIs required.

With Spark NLP 6.0.0, Llama.cpp vision models are now first-class citizens inside DataFrames, delivering multimodal inference at scale with native Spark performance.

Why it matters

For the first time, Spark NLP supports pure vision-text workflows, allowing you to pass raw images and captions directly into LLMs that can describe, summarize, or reason over visual inputs.
This unlocks batch multimodal processing across massive datasets with Spark’s native scalability — perfect for product catalogs, compliance audits, document analysis, and more.

How it works

Accepts raw image bytes (not Spark's OpenCV format) for true end-to-end multimodal inference.
Provides a convenient helper function ImageAssembler.loadImagesAsBytes to prepare image datasets effortlessly.
Supports all Llama.cpp runtime parameters like context length (nCtx), top-k/top-p sampling, temperature, and repeat penalties, allowing fine control over completions.

Example usage

documentAssembler = DocumentAssembler() \
    .setInputCol("caption") \
    .setOutputCol("caption_document")

imageAssembler = ImageAssembler() \
    .setInputCol("image") \
    .setOutputCol("image_assembler")

data = ImageAssembler \
    .loadImagesAsBytes(spark, "src/test/resources/image/") \
    .withColumn("caption", lit("Caption this image."))

model = AutoGGUFVisionModel.pretrained() \
    .setInputCols(["caption_document", "image_assembler"]) \
    .setOutputCol("completions") \
    .setBatchSize(4) \
    .setNPredict(40) \
    .setTopK(40) \
    .setTopP(0.95) \
    .setTemperature(0.05)

pipeline = Pipeline().setStages([documentAssembler, imageAssembler, model])
results = pipeline.fit(data).transform(data)
results.selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "completions.result").show(truncate=False)

📚 A full notebook walkthrough is available here.

🔥 New Features & Enhancements

PDF Reader

Font-aware PDF ingestion is now available with automatic page segmentation, encrypted file support, and token-level coordinate extraction, ideal for legal discovery and document Q&A.

Excel Reader

Spark NLP can now ingest .xls and .xlsx files directly into Spark DataFrames with automatic schema detection, multiple sheet support, and rich-text extraction for LLM pipelines.

PowerPoint Reader

Spark NLP introduces a native reader for .ppt and .pptx files. Capture slides, speaker notes, themes, and alt text at the document level for downstream summarization and retrieval.

Extractor and Cleaner Annotators

New Extractor and Cleaner annotators allow you to pull structured data (emails, IP addresses, dates) from text or clean noisy text artifacts like bullets, dashes, and non-ASCII characters at scale.

Text Reader

A high-performance TextReader is now available to load .txt, .csv, .log and similar files. It automatically detects encoding and line endings for massive ingestion jobs.

AutoGGUFVisionModel for Multimodal Llama.cpp Inference

Spark NLP now supports vision-language models in GGUF format using the new AutoGGUFVisionModel annotator. Run models like LLAVA-1.5-7B Q4_0 or Qwen2 VL entirely within Spark using Llama.cpp, enabling native multimodal batch inference without servers.

DeepSeek Janus Multimodal Model

The DeepSeek Janus model, tuned for instruction-following across text and images, is now fully integrated and available via a simple pretrained call.

Qwen-2 Vision-Language Model Catalog

Support for Alibaba’s Qwen-2 VL series (0.5B to 7B parameters) is now available. Use Qwen-2 checkpoints for OCR, product search, and multimodal retrieval tasks with unified APIs.

Native Multimodal Support with Phi-3.5 Vision

The new Phi3Vision annotator brings Microsoft’s Phi-3.5 multimodal model into Spark NLP. Process images and prompts together to generate grounded captions or visual Q&A results, all with a model footprint of less than 1 GB.

LLAVA 1.5 Vision-Language Transformer

Spark NLP now supports LLAVA 1.5 (7B) natively for screenshot Q&A, chart reading, and UI testing tasks. Build fully distributed multimodal inference pipelines without external services or dependencies.

Native Cohere Command-R Models

Cohere’s multilingual Command-R models (up to 35B parameters) are now fully integrated. Perform reasoning, RAG, and summarization tasks with no REST API latency and no token limits.

OLMo Family Support

Spark NLP now supports the full OLMo suite of open-weight language models (7B, 1.7B, and more) directly in Scala and Python. OLMo models come with full training transparency, Dolma-sized vocabularies, and reproducible experiment logs, making them ideal for academic research and benchmarking.

Multiple-Choice Heads for LLMs

New lightweight multiple-choice heads are now available for ALBERT, DistilBERT, RoBERTa, and XLM-RoBERTa models. These are perfect for building auto-grading systems, educational quizzes, and choice ranking pipelines.

AlbertForMultipleChoice
DistilBertForMultipleChoice
RoBertaForMultipleChoice
XlmRoBertaForMultipleChoice

VisionEncoderDecoder Improvements

The Scala API for VisionEncoderDecoder has been fully refactored to expose .generate() parameters like batch size and maximum tokens, aligning it one-to-one with the Python API.

🐛 Bug Fixes

Better GGUF Error Reporting

When a GGUF file is missing tensors or uses unsupported quantization, Spark NLP now provides clear and actionable error messages, including guidance on how to fix or convert the model.

Fixed MXBAI Typo

A small typo related to the MXBAI integration was corrected to ensure consistency across annotator names and pretrained model references.

VisionEncoderDecoder Alignment

The Scala VisionEncoderDecoder wrapper has been updated to fully match the Python API. It now exposes parameters like batch size and maximum tokens, fixing discrepancies that could occur in cross-language pipelines.

Minor Naming Improvements

Variable naming inconsistencies have been cleaned up throughout the codebase to ensure a more uniform and predictable developer experience.

📝 Models

We have added more than 110,000 new models and pipelines. The complete list of all 88,000+ models & pipelines in 230+ languages is available on our Models Hub.

❤️ Community support

Slack For live discussion with the Spark NLP community and the team
GitHub Bug reports, feature requests, and contributions
Discussions Engage with other community members, share ideas,
and show off how you use Spark NLP!
Medium Spark NLP articles
JohnSnowLabs official Medium
YouTube Spark NLP video tutorials

Installation

Python

#PyPI

pip install spark-nlp==6.0.0

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.0

Apple Silicon (M1 & M2)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.0

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.0

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>6.0.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>6.0.0</version>
</dependency>

spark-nlp-silicon:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>6.0.0</version>
</dependency>

spark-nlp-aarch64:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>6.0.0</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-6.0.0.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-6.0.0.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-6.0.0.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-6.0.0.jar

What's Changed

Update create_search_index.yml by @agsfer in Update create_search_index.yml #14526
SPARKNLP-1006: Introducing OLMo by @prabod in SPARKNLP-1006: Introducing OLMo #14242
Sparknlp 1060 implement phi 3.5 vision by @prabod in Sparknlp 1060 implement phi 3.5 vision #14444
SparkNLP 1033: Introducing LLAVA by @prabod in SparkNLP 1033: Introducing LLAVA #14450
SparkNLP 1032: Introducing CoHere by @prabod in SparkNLP 1032: Introducing CoHere #14457
SparkNLP 1077- Introducing Qwen2 - VL by @prabod in SparkNLP 1077- Introducing Qwen2 - VL #14474
updating python and scala model names by @ahmedlone127 in updating python and scala model names #14488
[SPARKNLP-1102] Adding support to read Excel files by @danilojsl in [SPARKNLP-1102] Adding support to read Excel files #14489
[SPARKNLP-1103] Adding support to read power point files by @danilojsl in [SPARKNLP-1103] Adding support to read power point files #14491
[SPARKNLP-1105] Introducing AlbertForMultipleChoice Transformer by @danilojsl in [SPARKNLP-1105] Introducing AlbertForMultipleChoice Transformer #14492
[SPARKNLP-1106] Introducing DistilBertForMultipleChoice Transformer by @danilojsl in [SPARKNLP-1106] Introducing DistilBertForMultipleChoice Transformer #14493
[SPARKNLP-1107] Introducing RoBertaForMultipleChoice by @danilojsl in [SPARKNLP-1107] Introducing RoBertaForMultipleChoice #14495
[SPARKNLP-1108] Introducing XlmRoBertaForMultipleChoice Transformer by @danilojsl in [SPARKNLP-1108] Introducing XlmRoBertaForMultipleChoice Transformer #14497
[SPARKNLP-1098] Adding PDF reader support by @danilojsl in [SPARKNLP-1098] Adding PDF reader support #14499
Sparknlp 1078 Introducing llama 3.2 vision models by @prabod in Sparknlp 1078 Introducing llama 3.2 vision models #14502
[SPARKNLP-1079] AutoGGUFVisionModel by @DevinTDHa in [SPARKNLP-1079] AutoGGUFVisionModel #14505
fixing typo in MXBAI notebook by @ahmedlone127 in fixing typo in MXBAI notebook #14510
SPARKNLP-1109 Adding Extractor to Sparknlp by @danilojsl in SPARKNLP-1109 Adding Extractor to Sparknlp #14519
[SPARKNLP-1113] Adding Text Reader by @danilojsl in [SPARKNLP-1113] Adding Text Reader #14524
SparkNLP 1088 - Introducing Deepseek Janus by @prabod in SparkNLP 1088 - Introducing Deepseek Janus #14532
Improved Error Handling for AutoGGUF models by @DevinTDHa in Improved Error Handling for AutoGGUF models #14533
Update VisionEncoderDecoder.scala by @ahmedlone127 in Update VisionEncoderDecoder.scala #14553
fixing name by @ahmedlone127 in fixing name #14554
Models hub by @maziyarpanahi in Models hub #14557
Release/600 release candidate by @maziyarpanahi in Release/600 release candidate #14534

Full Changelog: 5.5.3...6.0.0

This discussion was created from the release Spark NLP 6.0.0: PDF Reader, Excel Reader, PowerPoint Reader, Vision Language Models, Native Multimodal in GGUF, and many more!.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark NLP 6.0.0: PDF Reader, Excel Reader, PowerPoint Reader, Vision Language Models, Native Multimodal in GGUF, and many more! #14558

{{title}}

Replies: 0 comments

Select a reply

Spark NLP 6.0.0: PDF Reader, Excel Reader, PowerPoint Reader, Vision Language Models, Native Multimodal in GGUF, and many more! #14558

maziyarpanahi Apr 28, 2025 Maintainer

📢 Spark NLP 6.0.0: A New Era for Universal Ingestion and Multimodal LLM Processing at Scale

From raw documents to multimodal insights at enterprise scale

🌟 Spotlight Feature: AutoGGUFVisionModel — Native Multimodal Inference with Llama.cpp

Why it matters

How it works

Example usage

🔥 New Features & Enhancements

PDF Reader

Excel Reader

PowerPoint Reader

Extractor and Cleaner Annotators

Text Reader

AutoGGUFVisionModel for Multimodal Llama.cpp Inference

DeepSeek Janus Multimodal Model

Qwen-2 Vision-Language Model Catalog

Native Multimodal Support with Phi-3.5 Vision

LLAVA 1.5 Vision-Language Transformer

Native Cohere Command-R Models

OLMo Family Support

Multiple-Choice Heads for LLMs

VisionEncoderDecoder Improvements

🐛 Bug Fixes

Better GGUF Error Reporting

Fixed MXBAI Typo

VisionEncoderDecoder Alignment

Minor Naming Improvements

📝 Models

❤️ Community support

Installation

What's Changed

Replies: 0 comments

maziyarpanahi
Apr 28, 2025
Maintainer