Skip to content

UCREL/korean-lexicon-spreadsheet-to-tsv

Repository files navigation

Korean lexicon spreadsheet to TSV

This converts an Excel spreadsheet to CSV, and then to TSV file format so that it is in the Single Word Lexicon TSV format that can be used by PyMUSAS.

More specifically this converts an Excel spreadsheet that contains the following fields:

  • "K_WORD" - The lemma
  • "K_POS" - Part Of Speech (POS) tag associated to the lemma
  • "USAS1" - First most likely set of semantic tags associated to the lemma and POS
  • "USAS2" - Second most likely set of semantic tags associated to the lemma and POS
  • "USAS3" - Third most likely set of semantic tags associated to the lemma and POS

Python requirements

Supports Python versions 3.10-3.14.

pip install "openpyxl>=3.1.5" "pandas>=2.3.3" "typer>=0.25.1"

Lexicon file conversion

The following converts the data on sheet "sheet1" from "lexicon.xlsx" into CSV format and saves this to "lexicon.csv"

python spreadsheet_to_csv_lexicon.py "lexicon.xlsx" "sheet1" lexicon.csv

Help guide:

                                                                                                                                               
 Usage: spreadsheet_to_csv_lexicon.py [OPTIONS] LEXICON_SPREADSHEET_FILE_PATH                                                                         
                                      SPREADSHEET_SHEET_NAME OUTPUT_FILE_PATH                                                                         
                                                                                                                                                      
 Convert the lexicon spreadsheet file to a CSV file.                                                                                                  
                                                                                                                                                      
╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    lexicon_spreadsheet_file_path      PATH  Path to the lexicon spreadsheet file to convert to a CSV [required]                                  │
│ *    spreadsheet_sheet_name             TEXT  Name of the spreadsheet sheet to convert to a CSV [required]                                         │
│ *    output_file_path                   PATH  Path to the output CSV file [required]                                                               │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --help          Show this message and exit.                                                                                                        │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Then we can convert the CSV file into a Single Word Lexicon TSV format file.

python csv_lexicon_to_pymusas_lexicon.py lexicon.csv lexicon.tsv

Note: This conversion script:

  • Raises a ValueError if duplicate (lemma and POS tag) entries exist.
  • Skips with a print warning any row that contains either an empty lemma or POS tag.
  • Removes semantic tags that are Z99 and any entry that only contains a Z99 will be skipped.
  • Removes duplicated semantic tags per (lemma and POS tag) entry.

Help guide:

 Usage: csv_lexicon_to_pymusas_lexicon.py [OPTIONS] INPUT_PATH OUTPUT_PATH                                                                            
                                                                                                                                                      
╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    input_path       PATH  Input CSV lexicon file [required]                                                                                      │
│ *    output_path      PATH  Output TSV lexicon file [required]                                                                                     │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --help          Show this message and exit.                                                                                                        │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Development Setup

If you want to develop on this codebase follow this setup guide:

You can either use the dev container with your favourite editor, e.g. VSCode. Or you can create your setup locally below we demonstrate both.

In both cases they share the same tools, of which these tools are:

  • uv for Python packaging and development
  • make (OPTIONAL) for automation of tasks, not strictly required but makes life easier.

Dev Container

A dev container uses a docker container to create the required development environment, the Dockerfile we use for this dev container can be found at ./.devcontainer/Dockerfile. To run it locally it requires docker to be installed, you can also run it in a cloud based code editor, for a list of supported editors/cloud editors see the following webpage.

To run for the first time on a local VSCode editor (a slightly more detailed and better guide on the VSCode website):

  1. Ensure docker is running.
  2. Ensure the VSCode Dev Containers extension is installed in your VSCode editor.
  3. Open the command pallete CMD + SHIFT + P and then select Dev Containers: Rebuild and Reopen in Container

You should now have everything you need to develop, uv, make, for VSCode various extensions like Pylance, etc.

If you have any trouble see the VSCode website..

Local

To run locally first ensure you have the following tools installed locally:

  • uv for Python packaging and development. (version 0.9.9)
  • make (OPTIONAL) for automation of tasks, not strictly required but makes life easier.
    • Ubuntu: apt-get install make
    • Mac: Xcode command line tools includes make else you can use brew.
    • Windows: Various solutions proposed in this blog post on how to install on Windows, including Cygwin, and Windows Subsystem for Linux.

When developing on the project you will want to install the Python package locally in editable format with all the extra requirements, this can be done like so:

uv sync

Running linters and tests

Linting and formatting with ruff it is a replacement for tools like Flake8, isort, Black etc, and we us ty for type checking.

To run the linting:

make lint

Testing

To run the tests (uses pytest and coverage) and generate a coverage report:

make test

About

Converts Korean lexicon spreadsheet to TSV format

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors