Skip to content

Aatlantise/k-snacs-ud

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

K-SNACS Dataset and Guidelines

Korean Semantic Network of Adposition and Case Supersense

Work in progress

This is a work in progress. Our goal is to add to the currently publicly available dataset and publish as an addition to Universal Dependencies in the CoNLL-U format, as well as to Xposition in the augmented CoNLL-U-Lex format complete with adposition annotation.

Codebase

In addition to the dataset, as described below, we also share publicly code that results in the published dataset, starting from our previously published dataset. In the code, we rely on Stanza parsers (Qi et al. 2020) to generate much of the syntactic and morphological analyses. We share the code so that our dataset is replicable, flexible to new models, and transparent with linguistic decisions.

Running run.sh will produce little_prince_ko.conllu as well as other byproducts:

little_prince_ko.tsv is our previously published dataset. little_prince_ko.json is the same content, in json format. little_prince_raw_sentences.json is a list of sentences, as inputs to Stanza parsers. little_prince_stanza.json holds analyses from Stanza parsers. little_prince_merged.json is the result of alignment between tokens from original annotations and Stanza tokens. little_prince_annotation_ready.json is the UD-compliant version of k-SNACS, and little_prince_ko.conllu is the same dataset in CoNLL-U form. little_prince_ko.conllulex is the same dataset in CoNLL-U-Lex form, with manual and automatic additions using util.generate_col19().

Dataset

File: little_prince_ko.conllu

Data Version:

  • Current: 1.1
  • Compatible with K-SNACS Guidelines v0.9

Data Info:

  • Title: 어린 왕자 (erin wangca) "The Little Prince"
  • Author: Atoine de Saint-Exupéry
  • Original Language: French (Le Petit Prince)
  • Genre: Childrens literature, Novella

Column Description: Follows CoNLL-U format

  • ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
  • FORM: Word form or punctuation symbol.
  • LEMMA: Lemma or stem of word form.
  • UPOS: Universal part-of-speech tag.
  • XPOS: Optional language-specific (or treebank-specific) part-of-speech / morphological tag; underscore if not available.
  • FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
  • HEAD: Head of the current word, which is either a value of ID or zero (0).
  • DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
  • DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
  • MISC: Any other annotation.

License: This dataset's supersense annotations are licensed under CC BY 4.0 (Creative Commons Attribution-ShareAlike 4.0 International license).

K-SNACS Guidelines

File: k-snacs-guideline-appendix-v0.9.pdf

Guideline Version:

  • Current: 0.9
  • Compatible with English SNACS v2.5
  • Please note that this document is an appendix to the above English SNACS guidelines, including only language-specific information that merits further detailing. For full definitions of labels and use cases, please refer to English guidelines.

Here, we note that while UD datasets in Japanese and Hindi (where adpositions add to head words to form orthographic word) separate adposition as separate, free-standing tokens, we annotate them under a special node in the enhance graph. Each node is a separate line, so can receive annotations, and is consistent (adpositions receive ADP). This is due to the highly agglutinative nature of Korean, and the consequent agreement among Korean linguists that we will not separate adpositions (or other meaning-carrying sub-orthographic-word elements) as free-standing tokens for UD and similar projects.

Citing this work

When using this data, please cite the following as appropriate:

Original k-SNACS annotations Hwang et al., 2020:

Hwang, Jena D., Hanwool Choe, Na-Rae Han, and Nathan Schneider. "K-SNACS: Annotating Korean adposition semantics." In Proceedings of the Second International Workshop on Designing Meaning Representations. 2020.

K-SNACS Team

Key Collaborators:

Special Thanks to:

This research was supported in part by:

  • NSF award IIS-1812778
  • BSF grant 2016375

About

k-sncacs dataset for Universal Depdencies

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •