For English, please go to README.eng.md
์์ ํ๋ข ํ๋ก์ ํธ์ ํํ์ ๋ถ์๊ธฐ seunjeon์ Apache Spark์์ ์ฌ์ฉํ๊ธฐ ์ฝ๊ฒ ํฌ์ฅํ ํจํค์ง์
๋๋ค. spark-nkp
๋ ๋ค์๊ณผ ๊ฐ์ ๋ ๊ฐ์ง Transformer๋ฅผ ์ ๊ณตํฉ๋๋ค:
Tokenizer
๋ฌธ์ฅ์ ํํ์ ๋จ์๋ก ์ชผ๊ฐ๋ transformer. ์ํ๋ ํ์ฌ๋ง์ ๊ฑธ๋ฌ๋ผ ์๋ ์์ต๋๋ค.Analyzer
ํํ์ ๋ถ์์ ์ํ transformer. ๋ฌธ์ฅ์ ๋จ์ด๋ค์ ๋ํ ์์ธํ ์ ๋ณด๋ฅผ ๋ด์ DataFrame์ ์ถ๋ ฅํฉ๋๋ค.
๋ํ, ์ฌ์ฉ์ ์ ์ ์ฌ์ ์ ์ง์ํ๊ธฐ ์ํ Dictionary
๋ฅผ ์ ๊ณตํฉ๋๋ค.
spark-shell --packages com.github.uosdmlab:spark-nkp_2.11:0.3.3
๋ ๊ฐ์ง ๋ฐฉ๋ฒ์ผ๋ก ์ฌ์ฉ ๊ฐ๋ฅํฉ๋๋ค:
- Interpreter Setting
- Dynamic Dependency Loading (
%spark.dep
)
Interpreter Setting > Spark Interpreter > Edit > Dependencies
artifact com.github.uosdmlab:spark-nkp_2.11:0.3.3
%spark.dep
z.load("com.github.uosdmlab:spark-nkp_2.11:0.3.3")
import com.github.uosdmlab.nkp.Tokenizer
val df = spark.createDataset(
Seq(
"์๋ฒ์ง๊ฐ๋ฐฉ์๋ค์ด๊ฐ์ ๋ค.",
"์ฌ๋ํด์ ์ ํ๋ฆฐ!",
"์คํํฌ๋ ์ฌ๋ฐ์ด",
"๋๋์ผ ๋ฐ์ดํฐ๊ณผํ์",
"๋ฐ์ดํฐ์ผ~ ๋์~"
)
).toDF("text")
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val result = tokenizer.transform(df)
result.show(truncate = false)
output:
+------------+--------------------------+
|text |words |
+------------+--------------------------+
|์๋ฒ์ง๊ฐ๋ฐฉ์๋ค์ด๊ฐ์ ๋ค.|[์๋ฒ์ง, ๊ฐ, ๋ฐฉ, ์, ๋ค์ด๊ฐ, ์ ๋ค, .]|
|์ฌ๋ํด์ ์ ํ๋ฆฐ! |[์ฌ๋, ํด์, ์ ํ๋ฆฐ, !] |
|์คํํฌ๋ ์ฌ๋ฐ์ด |[์คํํฌ, ๋, ์ฌ๋ฐ, ์ด] |
|๋๋์ผ ๋ฐ์ดํฐ๊ณผํ์ |[๋, ๋, ์ผ, ๋ฐ์ดํฐ, ๊ณผํ์] |
|๋ฐ์ดํฐ์ผ~ ๋์~ |[๋ฐ์ดํฐ, ์ผ, ~, ๋์, ~] |
+------------+--------------------------+
import org.apache.spark.sql.functions._
import com.github.uosdmlab.nkp.Analyzer
val df = spark.createDataset(
Seq(
"์๋ฒ์ง๊ฐ๋ฐฉ์๋ค์ด๊ฐ์ ๋ค.",
"์ฌ๋ํด์ ์ ํ๋ฆฐ!",
"์คํํฌ๋ ์ฌ๋ฐ์ด",
"๋๋์ผ ๋ฐ์ดํฐ๊ณผํ์",
"๋ฐ์ดํฐ์ผ~ ๋์~"
)
).toDF("text")
.withColumn("id", monotonically_increasing_id)
val analyzer = new Analyzer
val result = analyzer.transform(df)
result.show(truncate = false)
output:
+---+----+-------+-----------------------------------------------------+-----+---+
|id |word|pos |feature |start|end|
+---+----+-------+-----------------------------------------------------+-----+---+
|0 |์๋ฒ์ง |[N] |[NNG, *, F, ์๋ฒ์ง, *, *, *, *] |0 |3 |
|0 |๊ฐ |[J] |[JKS, *, F, ๊ฐ, *, *, *, *] |3 |4 |
|0 |๋ฐฉ |[N] |[NNG, *, T, ๋ฐฉ, *, *, *, *] |4 |5 |
|0 |์ |[J] |[JKB, *, F, ์, *, *, *, *] |5 |6 |
|0 |๋ค์ด๊ฐ |[V] |[VV, *, F, ๋ค์ด๊ฐ, *, *, *, *] |6 |9 |
|0 |์ ๋ค |[EP, E]|[EP+EF, *, F, ์ ๋ค, Inflect, EP, EF, ์/EP/*+แซ๋ค/EF/*] |9 |11 |
|0 |. |[S] |[SF, *, *, *, *, *, *, *] |11 |12 |
|1 |์ฌ๋ |[N] |[NNG, *, T, ์ฌ๋, *, *, *, *] |0 |2 |
|1 |ํด์ |[XS, E]|[XSV+EF, *, F, ํด์, Inflect, XSV, EF, ํ/XSV/*+์์/EF/*]|2 |4 |
|1 |์ ํ๋ฆฐ |[N] |[NNP, *, T, ์ ํ๋ฆฐ, *, *, *, *] |5 |8 |
|1 |! |[S] |[SF, *, *, *, *, *, *, *] |8 |9 |
|2 |์คํํฌ |[N] |[NNG, *, F, ์คํํฌ, *, *, *, *] |0 |3 |
|2 |๋ |[J] |[JX, *, T, ๋, *, *, *, *] |3 |4 |
|2 |์ฌ๋ฐ |[V] |[VA, *, T, ์ฌ๋ฐ, *, *, *, *] |5 |7 |
|2 |์ด |[E] |[EC, *, F, ์ด, *, *, *, *] |7 |8 |
|3 |๋ |[N] |[NP, *, F, ๋, *, *, *, *] |0 |1 |
|3 |๋ |[J] |[JX, *, T, ๋, *, *, *, *] |1 |2 |
|3 |์ผ |[I] |[IC, *, F, ์ผ, *, *, *, *] |2 |3 |
|3 |๋ฐ์ดํฐ |[N] |[NNG, *, F, ๋ฐ์ดํฐ, *, *, *, *] |4 |7 |
|3 |๊ณผํ์ |[N] |[NNG, *, F, ๊ณผํ์, Compound, *, *, ๊ณผํ/NNG/*+์/NNG/*] |7 |10 |
+---+----+-------+-----------------------------------------------------+-----+---+
only showing top 20 rows
import com.github.uosdmlab.nkp.{Tokenizer, Dictionary}
val df = spark.createDataset(
Seq(
"๋ํ๋์๊ฐ ๋๋ค.",
"๋ ๋์น๋ ์๋? ๋๋ผ๋น ๋น !",
"๋ฒ์นด์ถฉํ์ด?",
"C++"))
.toDF("text")
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
Dictionary.addWords("๋ํ", "๋๋ผ+๋น ๋น ,-100", "๋ฒ์นด์ถฉ,-100", "C\\+\\+")
val result = tokenizer.transform(df)
result.show(truncate = false)
output:
+---------------+----------------------------+
|text |words |
+---------------+----------------------------+
|๋ํ๋์๊ฐ ๋๋ค. |[๋ํ, ๋์, ๊ฐ, ๋๋ค, .] |
|๋ ๋์น๋ ์๋? ๋๋ผ๋น ๋น !|[๋, ๋์น, ๋, ์, ๋, ?, ๋๋ผ๋น ๋น , !]|
|๋ฒ์นด์ถฉํ์ด? |[๋ฒ์นด์ถฉ, ํ, ์ด, ?] |
|C++ |[C++] |
+---------------+----------------------------+
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{CountVectorizer, IDF}
import com.github.uosdmlab.nkp.Tokenizer
val df = spark.createDataset(
Seq(
"์๋ฒ์ง๊ฐ๋ฐฉ์๋ค์ด๊ฐ์ ๋ค.",
"์ฌ๋ํด์ ์ ํ๋ฆฐ!",
"์คํํฌ๋ ์ฌ๋ฐ์ด",
"๋๋์ผ ๋ฐ์ดํฐ๊ณผํ์",
"๋ฐ์ดํฐ์ผ~ ๋์~"
)
).toDF("text")
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
.setFilter("N")
val cntVec = new CountVectorizer()
.setInputCol("words")
.setOutputCol("tf")
val idf = new IDF()
.setInputCol("tf")
.setOutputCol("tfidf")
val pipe = new Pipeline()
.setStages(Array(tokenizer, cntVec, idf))
val pipeModel = pipe.fit(df)
val result = pipeModel.transform(df)
result.show
output:
+------------+-------------+--------------------+--------------------+
| text| words| tf| tfidf|
+------------+-------------+--------------------+--------------------+
|์๋ฒ์ง๊ฐ๋ฐฉ์๋ค์ด๊ฐ์ ๋ค.| [์๋ฒ์ง, ๋ฐฉ]| (9,[1,5],[1.0,1.0])|(9,[1,5],[1.09861...|
| ์ฌ๋ํด์ ์ ํ๋ฆฐ!| [์ฌ๋, ์ ํ๋ฆฐ]| (9,[3,8],[1.0,1.0])|(9,[3,8],[1.09861...|
| ์คํํฌ๋ ์ฌ๋ฐ์ด| [์คํํฌ]| (9,[6],[1.0])|(9,[6],[1.0986122...|
| ๋๋์ผ ๋ฐ์ดํฐ๊ณผํ์|[๋, ๋ฐ์ดํฐ, ๊ณผํ์]|(9,[0,2,7],[1.0,1...|(9,[0,2,7],[0.693...|
| ๋ฐ์ดํฐ์ผ~ ๋์~| [๋ฐ์ดํฐ, ๋์]| (9,[0,4],[1.0,1.0])|(9,[0,4],[0.69314...|
+------------+-------------+--------------------+--------------------+
๋ฌธ์ฅ์ ํํ์ ๋จ์๋ก ์ชผ๊ฐ๋ transformer ์
๋๋ค. setFilter
ํจ์๋ก ์ํ๋ ํ์ฌ์ ํด๋นํ๋ ํํ์๋ง์ ๊ฑธ๋ฌ๋ผ ์๋ ์์ต๋๋ค. ํ์ฌ ํ๊ทธ๋ ์๋์ ํ์ฌ ํ๊ทธ ์ค๋ช
์ ์ฐธ๊ณ ํ์ธ์.
import com.github.uosdmlab.nkp.Tokenizer
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
.setFilter("N", "V", "SN") // ์ฒด์ธ, ์ฉ์ธ, ์ซ์๋ง์ ์ถ๋ ฅ
EP
์ ์ด๋ง์ด๋ฏธE
์ด๋ฏธI
๋ ๋ฆฝ์ธJ
๊ด๊ณ์ธM
์์์ธN
์ฒด์ธ (๋ช ์ฌ๊ฐ ์ฌ๊ธฐ ์ํฉ๋๋ค)S
๋ถํธSL
์ธ๊ตญ์ดSH
ํ์SN
์ซ์V
์ฉ์ธ (๋์ฌ๊ฐ ์ฌ๊ธฐ ์ํฉ๋๋ค)VCP
๊ธ์ ์ง์ ์ฌXP
์ ๋์ฌXS
์ ๋ฏธ์ฌXR
์ด๊ทผ
transform(dataset: Dataset[_]): DataFrame
setFilter(pos: String, poses: String*): Tokenizer
setInputCol(value: String): Tokenizer
setOutputCol(value: String): Tokenizer
getFilter: Array[String]
getInputCol: String
getOutputCol: String
ํํ์ ๋ถ์์ ์ํ transformer ์
๋๋ค. ๋ถ์ํ ๋ฌธ์ฅ๋ค๊ณผ ๊ฐ ๋ฌธ์ฅ๋ค์ ๊ตฌ๋ถํ id
๋ฅผ ์
๋ ฅ๊ฐ์ผ๋ก ๋ฐ์ต๋๋ค.
import com.github.uosdmlab.nkp.Analyzer
val analyzer = new Analyzer
Input DataFrame์ ๋ค์๊ณผ ๊ฐ์ column๋ค์ ๊ฐ์ ธ์ผ ํฉ๋๋ค. id
column์ ๊ฐ๋ค์ด ๊ณ ์ ํ(unique) ๊ฐ์ด ์๋ ๊ฒฝ์ฐ ์ค๋ฅ๊ฐ ๋ฐ์ํฉ๋๋ค. Unique ID๋ ์๋จ์ Analyzer ์์ ์ ๊ฐ์ด Spark์ SQL ํจ์ monotonically_increasing_id
๋ฅผ ์ฌ์ฉํ๋ฉด ์ฝ๊ฒ ์์ฑํ ์ ์์ต๋๋ค.
์ด๋ฆ | ์ค๋ช |
---|---|
id | ๊ฐ text๋ฅผ ๊ตฌ๋ถํ unique ID |
text | ๋ถ์ํ ํ ์คํธ |
์ด๋ฆ | ์ค๋ช |
---|---|
id | ๊ฐ text๋ฅผ ๊ตฌ๋ถํ unique ID |
word | ๋จ์ด |
pos | Part Of Speech; ํ์ฌ |
char | characteristic; ํน์ง, seunjeon์ feature |
start | ๋จ์ด ์์ ์์น |
end | ๋จ์ด ์ข ๋ฃ ์์น |
์์ธํ ํ์ฌ ํ๊ทธ ์ค๋ช ์ seunjeon์ ํ์ฌ ํ๊ทธ ์ค๋ช ์คํ๋ ๋ ์ํธ๋ฅผ ์ฐธ๊ณ ํ์๊ธฐ ๋ฐ๋๋๋ค.
transform(dataset: Dataset[_]): DataFrame
setIdCol(value: String)
setTextCol(value: String)
setWordCol(value: String)
setPosCol(value: String)
setCharCol(value: String)
setStartCol(value: String)
setEndCol(value: String)
getIdCol(value: String)
getTextCol(value: String)
getWordCol(value: String)
getPosCol(value: String)
getCharCol(value: String)
getStartCol(value: String)
getEndCol(value: String)
์ฌ์ฉ์ ์ ์ ์ฌ์ ์ ๊ด๋ฆฌํ๊ธฐ ์ํ object
์
๋๋ค. Dictionary
์ ์ถ๊ฐ๋ ๋จ์ด๋ค์ Tokenizer
์
Analyzer
๋ชจ๋์๊ฒ ์ ์ฉ๋ฉ๋๋ค. ์ฌ์ฉ์ ์ ์ ๋จ์ด๋ addWords
ํน์ addWordsFromCSV
ํจ์๋ฅผ ํตํด
์ถ๊ฐํ ์ ์์ต๋๋ค.
import com.github.uosdmlab.nkp.Dictionary
Dictionary
.addWords("๋ํ", "๋๋ผ+๋น ๋น ,-100")
.addWords(Seq("๋ฒ์นด์ถฉ,-100", "C\\+\\+"))
.addWordsFromCSV("path/to/CSV1", "path/to/CSV2")
.addWordsFromCSV("path/to/*.csv")
Dictionary.reset() // ์ฌ์ฉ์ ์ ์ ์ฌ์ ์ด๊ธฐํ
addWords(word: String, words: String*): Dictionary
addWords(words: Traversable[String]): Dictionary
addWordsFromCSV(path: String, paths: String*): Dictionary
addWordsFromCSV(paths: Traversable[String]): Dictionary
reset(): Dictionary
addWordsFromCSV
๋ฅผ ํตํด ์ ๋ฌ๋๋ CSV ํ์ผ์ header๋ ์์ด์ผํ๊ณ word
, cost
๋ ๊ฐ์ ์ปฌ๋ผ์
๊ฐ์ ธ์ผํฉ๋๋ค. cost
๋ ๋จ์ด ์ถ์ฐ ๋น์ฉ์ผ๋ก ์์์๋ก ์ถ์ฐํ ํ๋ฅ ์ด ๋์์ ๋ปํฉ๋๋ค. cost
๋ ์๋ต ๊ฐ๋ฅํฉ๋๋ค. CSV ํ์ผ์ spark.read.csv
๋ฅผ ์ฌ์ฉํ์ฌ ๋ถ๋ฌ์ค๊ธฐ ๋๋ฌธ์ HDFS์ ์กด์ฌํ๋ ํ์ผ ๋ํ ์ฌ์ฉ ๊ฐ๋ฅํฉ๋๋ค.
์๋๋ CSV ํ์ผ์ ์์
๋๋ค:
๋ํ
๋๋ผ+๋น ๋น ,-100
๋ฒ์นด์ถฉ,-100
C\+\+
+
๋ก ๋ณตํฉ ๋ช
์ฌ๋ฅผ ๋ฑ๋กํ ์๋ ์์ต๋๋ค. +
๋ฌธ์ ์์ฒด๋ฅผ ์ฌ์ ์ ๋ฑ๋กํ๊ธฐ ์ํด์๋ \+
๋ฅผ ์ฌ์ฉํ์ธ์.
sbt test
๋ณธ ํจํค์ง๋ Spark 2.0 ๋ฒ์ ์ ๊ธฐ์ค์ผ๋ก ๋ง๋ค์ด์ก์ต๋๋ค.
์์ ํ๋ข ํ๋ก์ ํธ์ ์ ์ํธ๋, ์ด์ฉ์ด๋๊ป ๊ฐ์ฌ์ ๋ง์ ๋๋ฆฝ๋๋ค! ์ฐ๊ตฌ์ ์ ๋ง ํฐ ๋์์ด ๋์์ต๋๋ค.