Skip to content

Spark NLP 6.0.0: PDF Reader, Excel Reader, PowerPoint Reader, Vision Language Models, Native Multimodal in GGUF, and many more!

Latest
Compare
Choose a tag to compare
@maziyarpanahi maziyarpanahi released this 28 Apr 19:13
3fce83f

📢 Spark NLP 6.0.0: A New Era for Universal Ingestion and Multimodal LLM Processing at Scale

From raw documents to multimodal insights at enterprise scale

With Spark NLP 6.0.0, we are setting a new standard for building scalable, distributed AI pipelines. This release transforms Spark NLP from a pure NLP library into the de facto platform for distributed LLM ingestion and multimodal batch processing.

This release introduces native ingestion for enterprise file types including PDFs, Excel spreadsheets, PowerPoint decks, and raw text logs, with automatic structure extraction, semantic segmentation, and metadata preservation — all in scalable, zero-code Spark pipelines.

At the same time, Spark NLP now natively supports Vision-Language Models (VLMs), loading quantized multimodal models like LLAVA, Phi Vision, DeepSeek Janus, and Llama 3.2 Vision directly via Llama.cpp, ONNX, and OpenVINO runtimes with no external inference servers, no API bottlenecks.

With 6.0.0, Spark NLP offers a complete, distributed architecture for universal data ingestion, multimodal understanding, and LLM batch inference at scale — enabling retrieval-augmented generation (RAG), document understanding, compliance audits, enterprise search, and multimodal analytics — all within the native Spark ecosystem.

One unified framework. Text, vision, documents — at Spark scale. Zero boilerplate. Maximum performance.

spark-nlp-loves-vision

🌟 Spotlight Feature: AutoGGUFVisionModel — Native Multimodal Inference with Llama.cpp

Spark NLP 6.0.0 introduces the new AutoGGUFVisionModel, enabling native multimodal inference for quantized GGUF models directly within Spark pipelines. Powered by Llama.cpp, this annotator makes it effortless to run Vision-Language Models (VLMs) like LLAVA-1.5-7B Q4_0, Qwen2 VL, and others fully on-premises, at scale, with no external servers or APIs required.

With Spark NLP 6.0.0, Llama.cpp vision models are now first-class citizens inside DataFrames, delivering multimodal inference at scale with native Spark performance.

Why it matters

For the first time, Spark NLP supports pure vision-text workflows, allowing you to pass raw images and captions directly into LLMs that can describe, summarize, or reason over visual inputs.
This unlocks batch multimodal processing across massive datasets with Spark’s native scalability — perfect for product catalogs, compliance audits, document analysis, and more.

How it works

  • Accepts raw image bytes (not Spark's OpenCV format) for true end-to-end multimodal inference.
  • Provides a convenient helper function ImageAssembler.loadImagesAsBytes to prepare image datasets effortlessly.
  • Supports all Llama.cpp runtime parameters like context length (nCtx), top-k/top-p sampling, temperature, and repeat penalties, allowing fine control over completions.

Example usage

documentAssembler = DocumentAssembler() \
    .setInputCol("caption") \
    .setOutputCol("caption_document")

imageAssembler = ImageAssembler() \
    .setInputCol("image") \
    .setOutputCol("image_assembler")

data = ImageAssembler \
    .loadImagesAsBytes(spark, "src/test/resources/image/") \
    .withColumn("caption", lit("Caption this image."))

model = AutoGGUFVisionModel.pretrained() \
    .setInputCols(["caption_document", "image_assembler"]) \
    .setOutputCol("completions") \
    .setBatchSize(4) \
    .setNPredict(40) \
    .setTopK(40) \
    .setTopP(0.95) \
    .setTemperature(0.05)

pipeline = Pipeline().setStages([documentAssembler, imageAssembler, model])
results = pipeline.fit(data).transform(data)
results.selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "completions.result").show(truncate=False)

📚 A full notebook walkthrough is available here.


🔥 New Features & Enhancements

PDF Reader

Font-aware PDF ingestion is now available with automatic page segmentation, encrypted file support, and token-level coordinate extraction, ideal for legal discovery and document Q&A.

Excel Reader

Spark NLP can now ingest .xls and .xlsx files directly into Spark DataFrames with automatic schema detection, multiple sheet support, and rich-text extraction for LLM pipelines.

PowerPoint Reader

Spark NLP introduces a native reader for .ppt and .pptx files. Capture slides, speaker notes, themes, and alt text at the document level for downstream summarization and retrieval.

Extractor and Cleaner Annotators

New Extractor and Cleaner annotators allow you to pull structured data (emails, IP addresses, dates) from text or clean noisy text artifacts like bullets, dashes, and non-ASCII characters at scale.

Text Reader

A high-performance TextReader is now available to load .txt, .csv, .log and similar files. It automatically detects encoding and line endings for massive ingestion jobs.

AutoGGUFVisionModel for Multimodal Llama.cpp Inference

Spark NLP now supports vision-language models in GGUF format using the new AutoGGUFVisionModel annotator. Run models like LLAVA-1.5-7B Q4_0 or Qwen2 VL entirely within Spark using Llama.cpp, enabling native multimodal batch inference without servers.

DeepSeek Janus Multimodal Model

The DeepSeek Janus model, tuned for instruction-following across text and images, is now fully integrated and available via a simple pretrained call.

Qwen-2 Vision-Language Model Catalog

Support for Alibaba’s Qwen-2 VL series (0.5B to 7B parameters) is now available. Use Qwen-2 checkpoints for OCR, product search, and multimodal retrieval tasks with unified APIs.

Native Multimodal Support with Phi-3.5 Vision

The new Phi3Vision annotator brings Microsoft’s Phi-3.5 multimodal model into Spark NLP. Process images and prompts together to generate grounded captions or visual Q&A results, all with a model footprint of less than 1 GB.

LLAVA 1.5 Vision-Language Transformer

Spark NLP now supports LLAVA 1.5 (7B) natively for screenshot Q&A, chart reading, and UI testing tasks. Build fully distributed multimodal inference pipelines without external services or dependencies.

Native Cohere Command-R Models

Cohere’s multilingual Command-R models (up to 35B parameters) are now fully integrated. Perform reasoning, RAG, and summarization tasks with no REST API latency and no token limits.

OLMo Family Support

Spark NLP now supports the full OLMo suite of open-weight language models (7B, 1.7B, and more) directly in Scala and Python. OLMo models come with full training transparency, Dolma-sized vocabularies, and reproducible experiment logs, making them ideal for academic research and benchmarking.

Multiple-Choice Heads for LLMs

New lightweight multiple-choice heads are now available for ALBERT, DistilBERT, RoBERTa, and XLM-RoBERTa models. These are perfect for building auto-grading systems, educational quizzes, and choice ranking pipelines.

  • AlbertForMultipleChoice
  • DistilBertForMultipleChoice
  • RoBertaForMultipleChoice
  • XlmRoBertaForMultipleChoice

VisionEncoderDecoder Improvements

The Scala API for VisionEncoderDecoder has been fully refactored to expose .generate() parameters like batch size and maximum tokens, aligning it one-to-one with the Python API.


🐛 Bug Fixes

Better GGUF Error Reporting

When a GGUF file is missing tensors or uses unsupported quantization, Spark NLP now provides clear and actionable error messages, including guidance on how to fix or convert the model.

Fixed MXBAI Typo

A small typo related to the MXBAI integration was corrected to ensure consistency across annotator names and pretrained model references.

VisionEncoderDecoder Alignment

The Scala VisionEncoderDecoder wrapper has been updated to fully match the Python API. It now exposes parameters like batch size and maximum tokens, fixing discrepancies that could occur in cross-language pipelines.

Minor Naming Improvements

Variable naming inconsistencies have been cleaned up throughout the codebase to ensure a more uniform and predictable developer experience.


📝 Models

We have added more than 110,000 new models and pipelines. The complete list of all 88,000+ models & pipelines in 230+ languages is available on our Models Hub.


❤️ Community support

  • Slack For live discussion with the Spark NLP community and the team
  • GitHub Bug reports, feature requests, and contributions
  • Discussions Engage with other community members, share ideas,
    and show off how you use Spark NLP!
  • Medium Spark NLP articles
  • JohnSnowLabs official Medium
  • YouTube Spark NLP video tutorials

Installation

Python

#PyPI

pip install spark-nlp==6.0.0

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:6.0.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:6.0.0

Apple Silicon (M1 & M2)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:6.0.0

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:6.0.0

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>6.0.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>6.0.0</version>
</dependency>

spark-nlp-silicon:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>6.0.0</version>
</dependency>

spark-nlp-aarch64:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>6.0.0</version>
</dependency>

FAT JARs

What's Changed

Full Changelog: 5.5.3...6.0.0