Skip to content

master/spark-stemming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 Cannot retrieve latest commit at this time.

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark Stemming

Build Status

Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval. This package allows to use it as a part of Spark ML Pipeline API.

Linking

Link against this library using SBT:

libraryDependencies += "com.github.master" %% "spark-stemming" % "0.2.1"

Using Maven:

<dependency>
    <groupId>com.github.master</groupId>
    <artifactId>spark-stemming_2.10</artifactId>
    <version>0.2.0</version>
</dependency>

Or include it when starting the Spark shell:

$ bin/spark-shell --packages com.github.master:spark-stemming_2.10:0.2.1

Features

Currently implemented algorithms:

  • Arabic
  • English
  • English (Porter)
  • Romance stemmers:
    • French
    • Spanish
    • Portuguese
    • Italian
    • Romanian
  • Germanic stemmers:
    • German
    • Dutch
  • Scandinavian stemmers:
    • Swedish
    • Norwegian (Bokmål)
    • Danish
  • Russian
  • Finnish
  • Greek

More details are on the Snowball stemming algorithms page.

Usage

Stemmer Transformer can be used directly or as a part of ML Pipeline. In particular, it is nicely combined with Tokenizer.

import org.apache.spark.mllib.feature.Stemmer

val data = sqlContext
  .createDataFrame(Seq(("мама", 1), ("мыла", 2), ("раму", 3)))
  .toDF("word", "id")

val stemmed = new Stemmer()
  .setInputCol("word")
  .setOutputCol("stemmed")
  .setLanguage("Russian")
  .transform(data)

stemmed.show

About

Spark MLlib wrapper for the Snowball framework

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published