Spark NLP 6.0.0: PDF Reader, Excel Reader, PowerPoint Reader, Vision Language Models, Native Multimodal in GGUF, and many more! #14558
maziyarpanahi
announced in
Announcement
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
📢 Spark NLP 6.0.0: A New Era for Universal Ingestion and Multimodal LLM Processing at Scale
From raw documents to multimodal insights at enterprise scale
With Spark NLP 6.0.0, we are setting a new standard for building scalable, distributed AI pipelines. This release transforms Spark NLP from a pure NLP library into the de facto platform for distributed LLM ingestion and multimodal batch processing.
This release introduces native ingestion for enterprise file types including PDFs, Excel spreadsheets, PowerPoint decks, and raw text logs, with automatic structure extraction, semantic segmentation, and metadata preservation — all in scalable, zero-code Spark pipelines.
At the same time, Spark NLP now natively supports Vision-Language Models (VLMs), loading quantized multimodal models like LLAVA, Phi Vision, DeepSeek Janus, and Llama 3.2 Vision directly via Llama.cpp, ONNX, and OpenVINO runtimes with no external inference servers, no API bottlenecks.
With 6.0.0, Spark NLP offers a complete, distributed architecture for universal data ingestion, multimodal understanding, and LLM batch inference at scale — enabling retrieval-augmented generation (RAG), document understanding, compliance audits, enterprise search, and multimodal analytics — all within the native Spark ecosystem.
One unified framework. Text, vision, documents — at Spark scale. Zero boilerplate. Maximum performance.
🌟 Spotlight Feature: AutoGGUFVisionModel — Native Multimodal Inference with Llama.cpp
Spark NLP 6.0.0 introduces the new
AutoGGUFVisionModel
, enabling native multimodal inference for quantized GGUF models directly within Spark pipelines. Powered by Llama.cpp, this annotator makes it effortless to run Vision-Language Models (VLMs) like LLAVA-1.5-7B Q4_0, Qwen2 VL, and others fully on-premises, at scale, with no external servers or APIs required.With Spark NLP 6.0.0, Llama.cpp vision models are now first-class citizens inside DataFrames, delivering multimodal inference at scale with native Spark performance.
Why it matters
For the first time, Spark NLP supports pure vision-text workflows, allowing you to pass raw images and captions directly into LLMs that can describe, summarize, or reason over visual inputs.
This unlocks batch multimodal processing across massive datasets with Spark’s native scalability — perfect for product catalogs, compliance audits, document analysis, and more.
How it works
ImageAssembler.loadImagesAsBytes
to prepare image datasets effortlessly.nCtx
), top-k/top-p sampling, temperature, and repeat penalties, allowing fine control over completions.Example usage
📚 A full notebook walkthrough is available here.
🔥 New Features & Enhancements
PDF Reader
Font-aware PDF ingestion is now available with automatic page segmentation, encrypted file support, and token-level coordinate extraction, ideal for legal discovery and document Q&A.
Excel Reader
Spark NLP can now ingest
.xls
and.xlsx
files directly into Spark DataFrames with automatic schema detection, multiple sheet support, and rich-text extraction for LLM pipelines.PowerPoint Reader
Spark NLP introduces a native reader for
.ppt
and.pptx
files. Capture slides, speaker notes, themes, and alt text at the document level for downstream summarization and retrieval.Extractor and Cleaner Annotators
New Extractor and Cleaner annotators allow you to pull structured data (emails, IP addresses, dates) from text or clean noisy text artifacts like bullets, dashes, and non-ASCII characters at scale.
Text Reader
A high-performance TextReader is now available to load
.txt
,.csv
,.log
and similar files. It automatically detects encoding and line endings for massive ingestion jobs.AutoGGUFVisionModel for Multimodal Llama.cpp Inference
Spark NLP now supports vision-language models in GGUF format using the new AutoGGUFVisionModel annotator. Run models like LLAVA-1.5-7B Q4_0 or Qwen2 VL entirely within Spark using Llama.cpp, enabling native multimodal batch inference without servers.
DeepSeek Janus Multimodal Model
The DeepSeek Janus model, tuned for instruction-following across text and images, is now fully integrated and available via a simple pretrained call.
Qwen-2 Vision-Language Model Catalog
Support for Alibaba’s Qwen-2 VL series (0.5B to 7B parameters) is now available. Use Qwen-2 checkpoints for OCR, product search, and multimodal retrieval tasks with unified APIs.
Native Multimodal Support with Phi-3.5 Vision
The new Phi3Vision annotator brings Microsoft’s Phi-3.5 multimodal model into Spark NLP. Process images and prompts together to generate grounded captions or visual Q&A results, all with a model footprint of less than 1 GB.
LLAVA 1.5 Vision-Language Transformer
Spark NLP now supports LLAVA 1.5 (7B) natively for screenshot Q&A, chart reading, and UI testing tasks. Build fully distributed multimodal inference pipelines without external services or dependencies.
Native Cohere Command-R Models
Cohere’s multilingual Command-R models (up to 35B parameters) are now fully integrated. Perform reasoning, RAG, and summarization tasks with no REST API latency and no token limits.
OLMo Family Support
Spark NLP now supports the full OLMo suite of open-weight language models (7B, 1.7B, and more) directly in Scala and Python. OLMo models come with full training transparency, Dolma-sized vocabularies, and reproducible experiment logs, making them ideal for academic research and benchmarking.
Multiple-Choice Heads for LLMs
New lightweight multiple-choice heads are now available for
ALBERT
,DistilBERT
,RoBERTa
, andXLM-RoBERTa
models. These are perfect for building auto-grading systems, educational quizzes, and choice ranking pipelines.AlbertForMultipleChoice
DistilBertForMultipleChoice
RoBertaForMultipleChoice
XlmRoBertaForMultipleChoice
VisionEncoderDecoder Improvements
The Scala API for VisionEncoderDecoder has been fully refactored to expose
.generate()
parameters like batch size and maximum tokens, aligning it one-to-one with the Python API.🐛 Bug Fixes
Better GGUF Error Reporting
When a GGUF file is missing tensors or uses unsupported quantization, Spark NLP now provides clear and actionable error messages, including guidance on how to fix or convert the model.
Fixed MXBAI Typo
A small typo related to the MXBAI integration was corrected to ensure consistency across annotator names and pretrained model references.
VisionEncoderDecoder Alignment
The Scala VisionEncoderDecoder wrapper has been updated to fully match the Python API. It now exposes parameters like batch size and maximum tokens, fixing discrepancies that could occur in cross-language pipelines.
Minor Naming Improvements
Variable naming inconsistencies have been cleaned up throughout the codebase to ensure a more uniform and predictable developer experience.
📝 Models
We have added more than 110,000 new models and pipelines. The complete list of all 88,000+ models & pipelines in 230+ languages is available on our Models Hub.
❤️ Community support
and show off how you use Spark NLP!
Installation
Python
#PyPI pip install spark-nlp==6.0.0
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):
GPU
Apple Silicon (M1 & M2)
AArch64
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:
spark-nlp-gpu:
spark-nlp-silicon:
spark-nlp-aarch64:
FAT JARs
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-6.0.0.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-6.0.0.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-6.0.0.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-6.0.0.jar
What's Changed
Full Changelog: 5.5.3...6.0.0
This discussion was created from the release Spark NLP 6.0.0: PDF Reader, Excel Reader, PowerPoint Reader, Vision Language Models, Native Multimodal in GGUF, and many more!.
Beta Was this translation helpful? Give feedback.
All reactions