Release Spark NLP 4.2.1: Over 230 state-of-the-art Transformer Vision (ViT) pretrained pipelines, new multi-lingual support for Word Segmentation, add LightPipeline support to Automatic Speech Recognition pipelines, support for processed audio files in type Double for Wav2Vec2, and bug fixes · JohnSnowLabs/spark-nlp

📢 Overview

Spark NLP 4.2.1 🚀 comes with a new multi-lingual support for Word Segmentation mostly used for (but not limited to) Chinese, Japanese, Korean, and so on, adding Automatic Speech Recognition (ASR) pipelines to LightPipeline arsenal for faster computation of smaller datasets without Apache Spark (e.g. RESTful API use case), adding support for processed audio files in type of Double in addition to Float for Wav2Vec2, over 230+ state-of-the-art Transformer Vision (ViT) pretrained pipelines for 1-line Image Classification, and bug fixes.

Do not forget to visit Models Hub with over 11400+ free and open-source models & pipelines. As always, we would like to thank our community for their feedback, questions, and feature requests. 🎉

⭐ New Features & improvements

NEW: Support for multi-lingual WordSegmenter. Add enableRegexTokenizer feature in WordSegmenter to support word segmentation within mixed and multi-lingual content #12854
NEW: Add support for Audio/ASR (Wav2Vec2) support to LightPipeline #12895
NEW: Add support for Double type in addition to Float type to AudioAssembler annotator #12904
Improve error handling in fullAnnotateImage for LightPipeline #12868
Add SpanBertCoref annotator to all docs #12889

Bug Fixes

Fix feeding fullAnnotate in Lightpipeline with a list that started to fail in 4.2.0 release
Fix exception in ContextSpellCheckerModel when updateVocabClass is used with append set to true #12875
Fix exception in Chunker annotator #12901

📓 New Notebooks

Spark NLP	Notebooks	Colab
SpanBertCorefModel	Coreference Resolution with SpanBertCorefModel
WordSegmenter	Train and inference multi-lingual Word Segmenter

You can visit Import Transformers in Spark NLP
You can visit Spark NLP Workshop for 100+ examples

Models

Spark NLP 4.2.1 comes with 230+ state-of-the-art pre-trained Transformer Vision (ViT) pipeline:

Featured Pipelines

Pipeline	Name	Lang
PretrainedPipeline	pipeline_image_classifier_vit_base_patch16_224_finetuned_eurosat	`en`
PretrainedPipeline	pipeline_image_classifier_vit_base_beans_demo_v5	`en`
PretrainedPipeline	pipeline_image_classifier_vit_animal_classifier_huggingface	`en`
PretrainedPipeline	pipeline_image_classifier_vit_Infrastructures	`en`
PretrainedPipeline	pipeline_image_classifier_vit_blocks	`en`
PretrainedPipeline	pipeline_image_classifier_vit_beer_whisky_wine_detection	`en`
PretrainedPipeline	pipeline_image_classifier_vit_base_xray_pneumonia	`en`
PretrainedPipeline	pipeline_image_classifier_vit_baseball_stadium_foods	`en`
PretrainedPipeline	pipeline_image_classifier_vit_dog_vs_chicken	`en`

Check 460+ Transformer Vision (ViT) models & pipelines for Models Hub - Image Classification

Spark NLP covers the following languages:

English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

The complete list of all 11000+ models & pipelines in 230+ languages is available on Models Hub

📖 Documentation

TF Hub & HuggingFace to Spark NLP
Models Hub with new models
Spark NLP documentation
Spark NLP Scala APIs
Spark NLP Python APIs
Spark NLP Workshop notebooks
Spark NLP publications
Spark NLP in Action
Spark NLP training certification notebooks for Google Colab and Databricks
Spark NLP Display for visualization of different types of annotations
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI

pip install spark-nlp==4.2.1

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.1

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.1

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>4.2.1</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>4.2.1</version>
</dependency>

spark-nlp-m1:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-m1_2.12</artifactId>
    <version>4.2.1</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.1.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.1.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.1.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.1.jar

What's Changed

Contributors

@Meryem1425 @muhammetsnts @jsl-models @josejuanmartinez @DevinTDHa @ArshaanNazir @C-K-Loan @KshitizGIT @agsfer @diatrambitas @danilojsl @Damla-Gurbaz @maziyarpanahi @jsl-builder

New Contributors

@ArshaanNazir made their first contribution in #12881

Full Changelog: 4.2.0...4.2.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark NLP 4.2.1: Over 230 state-of-the-art Transformer Vision (ViT) pretrained pipelines, new multi-lingual support for Word Segmentation, add LightPipeline support to Automatic Speech Recognition pipelines, support for processed audio files in type Double for Wav2Vec2, and bug fixes

📢 Overview

⭐ New Features & improvements

Bug Fixes

📓 New Notebooks

Models

Featured Pipelines

The complete list of all 11000+ models & pipelines in 230+ languages is available on Models Hub

📖 Documentation

Installation

What's Changed

Contributors

New Contributors

Contributors