Korean Semantic Network of Adposition and Case Supersense
This is a work in progress. Our goal is to add to the currently publicly available dataset and publish as an addition to Universal Dependencies in the CoNLL-U format, as well as to Xposition in the augmented CoNLL-U-Lex format complete with adposition annotation.
In addition to the dataset, as described below, we also share publicly code that results in the published dataset, starting from our previously published dataset. In the code, we rely on Stanza parsers (Qi et al. 2020) to generate much of the syntactic and morphological analyses. We share the code so that our dataset is replicable, flexible to new models, and transparent with linguistic decisions.
Running run.sh
will produce little_prince_ko.conllu
as well as other byproducts:
little_prince_ko.tsv
is our previously published dataset.
little_prince_ko.json
is the same content, in json format.
little_prince_raw_sentences.json
is a list of sentences, as inputs to Stanza parsers.
little_prince_stanza.json
holds analyses from Stanza parsers.
little_prince_merged.json
is the result of alignment between tokens from original annotations and Stanza tokens.
little_prince_annotation_ready.json
is the UD-compliant version of k-SNACS, and
little_prince_ko.conllu
is the same dataset in CoNLL-U form.
little_prince_ko.conllulex
is the same dataset in CoNLL-U-Lex form, with manual and automatic additions using util.generate_col19()
.
File: little_prince_ko.conllu
Data Version:
- Current: 1.1
- Compatible with K-SNACS Guidelines v0.9
Data Info:
- Title: 어린 왕자 (erin wangca) "The Little Prince"
- Author: Atoine de Saint-Exupéry
- Original Language: French (Le Petit Prince)
- Genre: Childrens literature, Novella
Column Description: Follows CoNLL-U format
- ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
- FORM: Word form or punctuation symbol.
- LEMMA: Lemma or stem of word form.
- UPOS: Universal part-of-speech tag.
- XPOS: Optional language-specific (or treebank-specific) part-of-speech / morphological tag; underscore if not available.
- FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
- HEAD: Head of the current word, which is either a value of ID or zero (0).
- DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
- DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
- MISC: Any other annotation.
License: This dataset's supersense annotations are licensed under CC BY 4.0 (Creative Commons Attribution-ShareAlike 4.0 International license).
File: k-snacs-guideline-appendix-v0.9.pdf
Guideline Version:
- Current: 0.9
- Compatible with English SNACS v2.5
- Please note that this document is an appendix to the above English SNACS guidelines, including only language-specific information that merits further detailing. For full definitions of labels and use cases, please refer to English guidelines.
Here, we note that while UD datasets in Japanese and Hindi (where adpositions add to head words to form orthographic word) separate adposition as separate, free-standing tokens, we annotate them under a special node in the enhance graph. Each node is a separate line, so can receive annotations, and is consistent (adpositions receive ADP). This is due to the highly agglutinative nature of Korean, and the consequent agreement among Korean linguists that we will not separate adpositions (or other meaning-carrying sub-orthographic-word elements) as free-standing tokens for UD and similar projects.
When using this data, please cite the following as appropriate:
Original k-SNACS annotations Hwang et al., 2020:
Hwang, Jena D., Hanwool Choe, Na-Rae Han, and Nathan Schneider. "K-SNACS: Annotating Korean adposition semantics." In Proceedings of the Second International Workshop on Designing Meaning Representations. 2020.
- Jena Hwang - Allen Institute for AI
- Na-Rae Han - University Pittsburgh
- Hanwool Choe - Hong Kong University
- Hyun Min - Georgetown University
- Nathan Schneider - Georgetown University
- Elli Ahn - Harvard University
- Vivek Srikumar - University of Utah
- Austin Blodgett - Georgetown University
- NSF award IIS-1812778
- BSF grant 2016375