Spark NLP 6.0.0: PDF Reader, Excel Reader, PowerPoint Reader, Vision Language Models, Native Multimodal in GGUF, and many more!
Latest📢 Spark NLP 6.0.0: A New Era for Universal Ingestion and Multimodal LLM Processing at Scale
From raw documents to multimodal insights at enterprise scale
With Spark NLP 6.0.0, we are setting a new standard for building scalable, distributed AI pipelines. This release transforms Spark NLP from a pure NLP library into the de facto platform for distributed LLM ingestion and multimodal batch processing.
This release introduces native ingestion for enterprise file types including PDFs, Excel spreadsheets, PowerPoint decks, and raw text logs, with automatic structure extraction, semantic segmentation, and metadata preservation — all in scalable, zero-code Spark pipelines.
At the same time, Spark NLP now natively supports Vision-Language Models (VLMs), loading quantized multimodal models like LLAVA, Phi Vision, DeepSeek Janus, and Llama 3.2 Vision directly via Llama.cpp, ONNX, and OpenVINO runtimes with no external inference servers, no API bottlenecks.
With 6.0.0, Spark NLP offers a complete, distributed architecture for universal data ingestion, multimodal understanding, and LLM batch inference at scale — enabling retrieval-augmented generation (RAG), document understanding, compliance audits, enterprise search, and multimodal analytics — all within the native Spark ecosystem.
One unified framework. Text, vision, documents — at Spark scale. Zero boilerplate. Maximum performance.
🌟 Spotlight Feature: AutoGGUFVisionModel — Native Multimodal Inference with Llama.cpp
Spark NLP 6.0.0 introduces the new AutoGGUFVisionModel
, enabling native multimodal inference for quantized GGUF models directly within Spark pipelines. Powered by Llama.cpp, this annotator makes it effortless to run Vision-Language Models (VLMs) like LLAVA-1.5-7B Q4_0, Qwen2 VL, and others fully on-premises, at scale, with no external servers or APIs required.
With Spark NLP 6.0.0, Llama.cpp vision models are now first-class citizens inside DataFrames, delivering multimodal inference at scale with native Spark performance.
Why it matters
For the first time, Spark NLP supports pure vision-text workflows, allowing you to pass raw images and captions directly into LLMs that can describe, summarize, or reason over visual inputs.
This unlocks batch multimodal processing across massive datasets with Spark’s native scalability — perfect for product catalogs, compliance audits, document analysis, and more.
How it works
- Accepts raw image bytes (not Spark's OpenCV format) for true end-to-end multimodal inference.
- Provides a convenient helper function
ImageAssembler.loadImagesAsBytes
to prepare image datasets effortlessly. - Supports all Llama.cpp runtime parameters like context length (
nCtx
), top-k/top-p sampling, temperature, and repeat penalties, allowing fine control over completions.
Example usage
documentAssembler = DocumentAssembler() \
.setInputCol("caption") \
.setOutputCol("caption_document")
imageAssembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
data = ImageAssembler \
.loadImagesAsBytes(spark, "src/test/resources/image/") \
.withColumn("caption", lit("Caption this image."))
model = AutoGGUFVisionModel.pretrained() \
.setInputCols(["caption_document", "image_assembler"]) \
.setOutputCol("completions") \
.setBatchSize(4) \
.setNPredict(40) \
.setTopK(40) \
.setTopP(0.95) \
.setTemperature(0.05)
pipeline = Pipeline().setStages([documentAssembler, imageAssembler, model])
results = pipeline.fit(data).transform(data)
results.selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "completions.result").show(truncate=False)
📚 A full notebook walkthrough is available here.
🔥 New Features & Enhancements
PDF Reader
Font-aware PDF ingestion is now available with automatic page segmentation, encrypted file support, and token-level coordinate extraction, ideal for legal discovery and document Q&A.
Excel Reader
Spark NLP can now ingest .xls
and .xlsx
files directly into Spark DataFrames with automatic schema detection, multiple sheet support, and rich-text extraction for LLM pipelines.
PowerPoint Reader
Spark NLP introduces a native reader for .ppt
and .pptx
files. Capture slides, speaker notes, themes, and alt text at the document level for downstream summarization and retrieval.
Extractor and Cleaner Annotators
New Extractor and Cleaner annotators allow you to pull structured data (emails, IP addresses, dates) from text or clean noisy text artifacts like bullets, dashes, and non-ASCII characters at scale.
Text Reader
A high-performance TextReader is now available to load .txt
, .csv
, .log
and similar files. It automatically detects encoding and line endings for massive ingestion jobs.
AutoGGUFVisionModel for Multimodal Llama.cpp Inference
Spark NLP now supports vision-language models in GGUF format using the new AutoGGUFVisionModel annotator. Run models like LLAVA-1.5-7B Q4_0 or Qwen2 VL entirely within Spark using Llama.cpp, enabling native multimodal batch inference without servers.
DeepSeek Janus Multimodal Model
The DeepSeek Janus model, tuned for instruction-following across text and images, is now fully integrated and available via a simple pretrained call.
Qwen-2 Vision-Language Model Catalog
Support for Alibaba’s Qwen-2 VL series (0.5B to 7B parameters) is now available. Use Qwen-2 checkpoints for OCR, product search, and multimodal retrieval tasks with unified APIs.
Native Multimodal Support with Phi-3.5 Vision
The new Phi3Vision annotator brings Microsoft’s Phi-3.5 multimodal model into Spark NLP. Process images and prompts together to generate grounded captions or visual Q&A results, all with a model footprint of less than 1 GB.
LLAVA 1.5 Vision-Language Transformer
Spark NLP now supports LLAVA 1.5 (7B) natively for screenshot Q&A, chart reading, and UI testing tasks. Build fully distributed multimodal inference pipelines without external services or dependencies.
Native Cohere Command-R Models
Cohere’s multilingual Command-R models (up to 35B parameters) are now fully integrated. Perform reasoning, RAG, and summarization tasks with no REST API latency and no token limits.
OLMo Family Support
Spark NLP now supports the full OLMo suite of open-weight language models (7B, 1.7B, and more) directly in Scala and Python. OLMo models come with full training transparency, Dolma-sized vocabularies, and reproducible experiment logs, making them ideal for academic research and benchmarking.
Multiple-Choice Heads for LLMs
New lightweight multiple-choice heads are now available for ALBERT
, DistilBERT
, RoBERTa
, and XLM-RoBERTa
models. These are perfect for building auto-grading systems, educational quizzes, and choice ranking pipelines.
AlbertForMultipleChoice
DistilBertForMultipleChoice
RoBertaForMultipleChoice
XlmRoBertaForMultipleChoice
VisionEncoderDecoder Improvements
The Scala API for VisionEncoderDecoder has been fully refactored to expose .generate()
parameters like batch size and maximum tokens, aligning it one-to-one with the Python API.
🐛 Bug Fixes
Better GGUF Error Reporting
When a GGUF file is missing tensors or uses unsupported quantization, Spark NLP now provides clear and actionable error messages, including guidance on how to fix or convert the model.
Fixed MXBAI Typo
A small typo related to the MXBAI integration was corrected to ensure consistency across annotator names and pretrained model references.
VisionEncoderDecoder Alignment
The Scala VisionEncoderDecoder wrapper has been updated to fully match the Python API. It now exposes parameters like batch size and maximum tokens, fixing discrepancies that could occur in cross-language pipelines.
Minor Naming Improvements
Variable naming inconsistencies have been cleaned up throughout the codebase to ensure a more uniform and predictable developer experience.
📝 Models
We have added more than 110,000 new models and pipelines. The complete list of all 88,000+ models & pipelines in 230+ languages is available on our Models Hub.
❤️ Community support
- Slack For live discussion with the Spark NLP community and the team
- GitHub Bug reports, feature requests, and contributions
- Discussions Engage with other community members, share ideas,
and show off how you use Spark NLP! - Medium Spark NLP articles
- JohnSnowLabs official Medium
- YouTube Spark NLP video tutorials
Installation
Python
#PyPI
pip install spark-nlp==6.0.0
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.0
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.0
Apple Silicon (M1 & M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.0
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.0
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>6.0.0</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>6.0.0</version>
</dependency>
spark-nlp-silicon:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>6.0.0</version>
</dependency>
spark-nlp-aarch64:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>6.0.0</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-6.0.0.jar
-
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-6.0.0.jar
-
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-6.0.0.jar
-
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-6.0.0.jar
What's Changed
- Update create_search_index.yml by @agsfer in #14526
- SPARKNLP-1006: Introducing OLMo by @prabod in #14242
- Sparknlp 1060 implement phi 3.5 vision by @prabod in #14444
- SparkNLP 1033: Introducing LLAVA by @prabod in #14450
- SparkNLP 1032: Introducing CoHere by @prabod in #14457
- SparkNLP 1077- Introducing Qwen2 - VL by @prabod in #14474
- updating python and scala model names by @ahmedlone127 in #14488
- [SPARKNLP-1102] Adding support to read Excel files by @danilojsl in #14489
- [SPARKNLP-1103] Adding support to read power point files by @danilojsl in #14491
- [SPARKNLP-1105] Introducing AlbertForMultipleChoice Transformer by @danilojsl in #14492
- [SPARKNLP-1106] Introducing DistilBertForMultipleChoice Transformer by @danilojsl in #14493
- [SPARKNLP-1107] Introducing RoBertaForMultipleChoice by @danilojsl in #14495
- [SPARKNLP-1108] Introducing XlmRoBertaForMultipleChoice Transformer by @danilojsl in #14497
- [SPARKNLP-1098] Adding PDF reader support by @danilojsl in #14499
- Sparknlp 1078 Introducing llama 3.2 vision models by @prabod in #14502
- [SPARKNLP-1079] AutoGGUFVisionModel by @DevinTDHa in #14505
- fixing typo in MXBAI notebook by @ahmedlone127 in #14510
- SPARKNLP-1109 Adding Extractor to Sparknlp by @danilojsl in #14519
- [SPARKNLP-1113] Adding Text Reader by @danilojsl in #14524
- SparkNLP 1088 - Introducing Deepseek Janus by @prabod in #14532
- Improved Error Handling for AutoGGUF models by @DevinTDHa in #14533
- Update VisionEncoderDecoder.scala by @ahmedlone127 in #14553
- fixing name by @ahmedlone127 in #14554
- Models hub by @maziyarpanahi in #14557
- Release/600 release candidate by @maziyarpanahi in #14534
Full Changelog: 5.5.3...6.0.0