Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 54 additions & 0 deletions resources/examples/tag_to_tsv/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Tag and export to TSV for manual annotation

In this example we are going to show how to tag the text from a file and then export that tagged text to TSV format for manual annotation/checking purposes. Once in TSV format it can be used in applications like Microsoft Excel or another spreadsheet editor, e.g. Google Sheets.

The TSV file we will create will have the following headers:

- "Token" - The predicted word or token text, e.g. "Bank". The word or token is not generated but the PyMUSAS tool does determine how to break the text up into tokens, this is what we mean by predicted and therefore this breaking up of the text into tokens could be incorrect.
- "POS" - The predicted Part Of Speech of the token, e.g. "NOUN".
- "Predicted-USAS" - The semicolon list of predicted USAS tags, whereby the first USAS tag in the list should be the most probable, e.g. "H4;I1.2"
- "Corrected-USAS" - The semicolon list of corrected USAS tags, whereby the first USAS tag in the list should be the most probable, e.g. "H1;I1.1"
- "Errors" - A semicolon list of errors, e.g. "WRONG-POS;WRONG-TOKEN" whereby the list of errors is best to be pre-defined so that when analysing the dataset error statistics can be generated, e.g. 10% of samples have POS tag errors.
- "Notes" - Any additional notes you may want to add that is relevant to annotating this sample. These notes can be anything but should be useful for another annotator or colleague that is currently or in the future working on this project.

Only the first three headers will have content, "Token", "POS", and "Predicted-USAS", as the rest of the headers have to be filled in by an annotator.


## Chinese Example

This example shows how to create this TSV format when given a Chinese text, we are assuming the text file is at the following path [./data/zh_text.txt](./data/zh_text.txt)

First download both the [Chinese PyMUSAS RuleBasedTagger spaCy component](https://github.com/UCREL/pymusas-models/releases/tag/cmn_dual_upos2usas_contextual-0.3.3) and the [Transformer Chinese spaCy model](https://spacy.io/models/zh#zh_core_web_trf) (**Note** you can use any spaCy model but we are choosing the most powerful model so that we can get the most accurate tokenizer and POS tagger):

``` bash
pip install https://github.com/UCREL/pymusas-models/releases/download/cmn_dual_upos2usas_contextual-0.3.3/cmn_dual_upos2usas_contextual-0.3.3-py3-none-any.whl
pip install zh_core_web_trf@https://github.com/explosion/spacy-models/releases/download/zh_core_web_trf-3.8.0/zh_core_web_trf-3.8.0.tar.gz
```

Then we can tag the text and export the generated TSV file to `./zh_tagged_text.tsv`, the text we are tagging is the introduction to the ["Bank" Wikipedia page](https://zh.wikipedia.org/wiki/%E9%8A%80%E8%A1%8C).

``` bash
python tag_and_export.py zh_core_web_trf ./data/zh_text.txt ./zh_tagged_text.tsv
```

**Note** - `zh_core_web_trf` is the name of the SpaCy model we have installed and want to use.

**Note** - The POS tags are from the [Universal Dependency POS tagset.](https://universaldependencies.org/u/pos/)

### Excel specific

If you would like to export specifically to excel format.

First install the following Python dependency:

``` bash
pip install XlsxWriter
```

Then run the following to get an excel version:

``` bash
python tag_and_export.py zh_core_web_trf ./data/zh_text.txt ./zh_tagged_text.xlsx --excel-format
```

**Note** - the file extension different the output file name extension is `.xlsx` which is the extension used by Excel.
3 changes: 3 additions & 0 deletions resources/examples/tag_to_tsv/data/zh_text.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
銀行是吸收公众存款、发放貸款、办理结算等業務的金融機構。绝大多数银行都实行部分准备金制度,它允许银行在向中央银行交存一定比例准备金后,将剩余部分资金用于发放贷款等业务,银行將存戶的錢拿來发放信用贷款與投資理財的同时,自然派生出新的經濟活動與金錢收益,一吸一吐使得其具备等於货币创造的能力,並透過支付轉移,向市場注入流動性的強心劑,打通國家經濟血脈的活絡,以維持現代社會機能的運作。

银行在金融体系中扮演着重要角色,对一国乃至全球金融稳定产生巨大影响,通常都被置于各国金融管理当局的严格监管之下。在满足不同国家或地区金融监管当局各具差异的监管规则之余,银行还普遍遵循基于巴塞尔协定的最低资本要求。
71 changes: 71 additions & 0 deletions resources/examples/tag_to_tsv/tag_and_export.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
import csv
from pathlib import Path

import typer
from typing_extensions import Annotated
import spacy
import xlsxwriter

def main(spacy_model_name: str,
text_to_tag_file_path: Annotated[Path,
typer.Argument(exists=True, file_okay=True, dir_okay=False, readable=True)],
tagged_text_output_file_path: Annotated[Path,
typer.Argument(dir_okay=False)],
excel_format: bool = False) -> None:
# We exclude the following components as we do not need them.
nlp = spacy.load(spacy_model_name, exclude=['parser', 'ner'])
# Load the Chinese PyMUSAS rule-based tagger in a separate spaCy pipeline
chinese_tagger_pipeline = spacy.load('cmn_dual_upos2usas_contextual')
# Adds the Chinese PyMUSAS rule-based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=chinese_tagger_pipeline)

with text_to_tag_file_path.open('r', encoding='utf-8') as text_fp:
fieldnames = ['Token', 'POS', 'Predicted-USAS', 'Corrected-USAS', 'Errors', 'Notes']
if not excel_format:
with tagged_text_output_file_path.open('w', encoding='utf-8', newline='') as csv_fp:
tsv_writer = csv.DictWriter(csv_fp, fieldnames=fieldnames, dialect='excel-tab', delimiter='\t')
tsv_writer.writeheader()
for line in text_fp:
line = line.strip()
if not line:
continue
output_doc = nlp(line)
for token in output_doc:
token_text = token.text
pos_tag = token.pos_
usas_tags = token._.pymusas_tags
formatted_usas_tags = '; '.join(usas_tags).strip(" ")
tsv_writer.writerow({
'Token': token_text,
'POS': pos_tag,
'Predicted-USAS': formatted_usas_tags
})
else:
with xlsxwriter.Workbook(str(tagged_text_output_file_path.resolve())) as excel_workbook:
worksheet = excel_workbook.add_worksheet()
bold = excel_workbook.add_format({'bold': 1})

for field_index, field_name in enumerate(fieldnames):
worksheet.write(0, field_index, field_name, bold)
worksheet.set_column(field_index, field_index, len(field_name) + 2)
row = 1
for line in text_fp:
line = line.strip()
if not line:
continue
output_doc = nlp(line)
for token in output_doc:
token_text = token.text
pos_tag = token.pos_
usas_tags = token._.pymusas_tags
formatted_usas_tags = '; '.join(usas_tags).strip(" ")
worksheet.write_string(row, 0, token_text)
worksheet.write_string(row, 1, pos_tag)
worksheet.write_string(row, 2, formatted_usas_tags)
worksheet.write_blank(row, 3, None)
worksheet.write_blank(row, 4, None)
row += 1


if __name__ == "__main__":
typer.run(main)
Loading