Skip to content

Commit 8deb01c

Browse files
committed
Combine/merge lexicon collection usage example
1 parent 6131236 commit 8deb01c

2 files changed

Lines changed: 265 additions & 0 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1616
- The ability to merge `LexiconCollection`s either through `pymusas.lexicon_collection.LexiconCollection.merge`.
1717
- The ability to merge `LexiconCollection` data through a list of file paths to TSV files using `pymusas.lexicon_collection.LexiconCollection.tsv_merge`, which when merged will allow the creation of a combined `LexiconCollection` instance.
1818
- The ability to merge `MWELexiconCollection` data through a list of file paths to TSV files using `pymusas.lexicon_collection.MWELexiconCollection.tsv_merge`, which when merged will allow the creation of a combined `MWELexiconCollection` instance.
19+
- Added a usage example to the documentation showing how to combine/merge lexicon collections together and add them to a PyMUSAS rule based tagger.
1920

2021
### Changed
2122

Lines changed: 264 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,264 @@
1+
---
2+
title: Combine/Merge Lexicons
3+
sidebar_position: 3
4+
---
5+
6+
In this guide we will show how to combine two lexicons together, both for single word and Multi Word Expression (MWE), so that the combined lexicon can be used in a single PyMUSAS [RuleBasedTagger](/api/spacy_api/taggers/rule_based#rulebasedtagger).
7+
8+
This approach is useful if you want the coverage of the existing lexicons that are available for the given language but you want to customize them. You might want to customize them because;
9+
* Want to add domain specific language to the lexicons, e.g. `flat_* white_*` = `F2/Z3` (type of coffee)
10+
* Want to override/change an existing lexicon with a different semantic tag, e.g. in the [English semantic lexicon](https://github.com/UCREL/Multilingual-USAS/blob/6b305509016b21cd9062c5f77c1f29313ca9cc53/English/semantic_lexicon_en.tsv#L586C1-L586C18) `Amazon PROPN` is associated with `Z2 M7` a semantic tag associated with *Geographical names* and *Places* but perhaps in your corpus you would like it to be associated with the company therefore change the semantic tag to `Z3`.
11+
12+
All of the existing lexicons for different language can be found at the [Multilingual-USAS repository](https://github.com/UCREL/Multilingual-USAS/tree/master), in this guide we will only use the [English lexicons](https://github.com/UCREL/Multilingual-USAS/tree/master/English).
13+
14+
This guide is going to show how to create a PyMUSAS [RuleBasedTagger](/api/spacy_api/taggers/rule_based#rulebasedtagger) that uses the existing [English lexicons](https://github.com/UCREL/Multilingual-USAS/tree/master/English) with additional custom lexicons that both extend the existing as well as override them. The guide will be broken down into:
15+
16+
1. Setup
17+
2. How the existing tagger performs
18+
3. How to customize the tagger through combining the existing lexicon with a custom lexicon
19+
20+
## Setup
21+
22+
Download both the [English PyMUSAS `RuleBasedTagger` spaCy component](https://github.com/UCREL/pymusas-models/releases/tag/en_dual_none_contextual-0.3.3) and the [small English spaCy model](https://spacy.io/models/en):
23+
24+
``` bash
25+
pip install https://github.com/UCREL/pymusas-models/releases/download/en_dual_none_contextual-0.3.3/en_dual_none_contextual-0.3.3-py3-none-any.whl
26+
python -m spacy download en_core_web_sm
27+
```
28+
29+
We are going to use 2 example custom lexicons, this is for example purposes only as we assume the custom lexicons you will use contain different/more lexicons and you don't need to have both a single and MWE lexicon.
30+
31+
The custom single word lexicon, that we assume is saved to a file at `./custom_semantic_lexicon.tsv`
32+
```tsv title="custom_semantic_lexicon.tsv"
33+
lemma pos semantic_tags
34+
Amazon PROPN Z3
35+
broligarchy NOUN S5
36+
```
37+
38+
The custom MWE lexicon, that we assume is saved to a file at `./custom_mwe.tsv`
39+
``` tsv title="custom_mwe.tsv"
40+
mwe_template semantic_tags
41+
battery_NOUN farm_NOUN Z3/Y1/W3
42+
flat_* white_* F2/Z3
43+
```
44+
45+
These files can be saved anywhere locally or even at a URL, just change the file path in the code to the location of these files.
46+
47+
The example sentence we are going to use throughout is:
48+
49+
``` python
50+
sentence = ("While drinking my flat white I was reading about the "
51+
"new battery farm that Amazon is creating which is owned by "
52+
"one of the broligarchy")
53+
```
54+
55+
Using this sentence and the custom lexicons we will show that we can netter reflect the meaning in this sentence.
56+
57+
## How the existing tagger performs
58+
59+
Using the off the shelf [English PyMUSAS `RuleBasedTagger` spaCy component](https://github.com/UCREL/pymusas-models/releases/tag/en_dual_none_contextual-0.3.3) with the single and MWE lexicons we get the following tags for the example sentence:
60+
61+
``` tsv
62+
Text Lemma POS USAS Tags
63+
While while SCONJ ['Z5']
64+
drinking drink VERB ['A5.4+']
65+
my my PRON ['A5.4+']
66+
flat flat ADJ ['O4.4', 'O3', 'O4.1', 'K2', 'A5.3+']
67+
white white NOUN ['O4.3', 'O4.3/S2mf', 'F2', 'F1', 'B1']
68+
I I PRON ['Z8mf']
69+
was be AUX ['A3+', 'Z5']
70+
reading read VERB ['Q3', 'Q1.2', 'X3.2+', 'X2.5+', 'P1', 'A10+']
71+
about about ADP ['Z5']
72+
the the DET ['Z5']
73+
new new ADJ ['T3-']
74+
battery battery NOUN ['F4']
75+
farm farm NOUN ['F4']
76+
that that SCONJ ['Z5', 'Z8']
77+
Amazon Amazon PROPN ['Z2', 'M7']
78+
is be AUX ['A3+', 'Z5']
79+
creating create VERB ['A1.1.1', 'A2.2', 'E1']
80+
which which DET ['Z5', 'Z8']
81+
is be AUX ['A3+', 'Z5']
82+
owned own VERB ['A9+']
83+
by by ADP ['Z5']
84+
one one NUM ['N1', 'T3', 'T1.2']
85+
of of ADP ['Z5']
86+
the the DET ['Z5']
87+
broligarchy broligarchy NOUN ['Z99']
88+
```
89+
90+
As you can see `flat white` is not recognised as a drink, `broligarchy` is not recognised at all as it is a new word according to [collins dictionary](https://www.collinsdictionary.com/dictionary/english/brollies), `Amazon` is assumed to be the rain forest in Brazil, and `battery farm` is recognised as a farm with livestock rather than a farm with batteries.
91+
92+
This was created using the following code:
93+
94+
<details>
95+
<summary>Python Script</summary>
96+
97+
``` python
98+
import spacy
99+
100+
101+
sentence = ("While drinking my flat white I was reading about the "
102+
"new battery farm that Amazon is creating which is owned by "
103+
"one of the broligarchy")
104+
105+
# We exclude the following components as we do not need them.
106+
nlp = spacy.load('en_core_web_sm', exclude=['parser', 'ner'])
107+
# Load the English PyMUSAS rule-based tagger in a separate spaCy pipeline
108+
english_tagger_pipeline = spacy.load('en_dual_none_contextual')
109+
# Adds the English PyMUSAS rule-based tagger to the main spaCy pipeline
110+
nlp.add_pipe('pymusas_rule_based_tagger', source=english_tagger_pipeline)
111+
112+
output_doc = nlp(sentence)
113+
114+
print(f'Text\tLemma\tPOS\tUSAS Tags')
115+
for token in output_doc:
116+
print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}')
117+
```
118+
119+
</details>
120+
121+
## How to customize the tagger through combining the existing lexicon with a custom lexicon
122+
123+
In the code below we show that we first need to create a combined/merged single word lexicon from the existing single word lexicon in the Multilingual USAS GitHub repository, this is done through the [RuleBasedTagger](/api/spacy_api/taggers/rule_based#rulebasedtagger) function which downloads/loads the TSV files and then merges them whereby the last TSV file in the list overrides any lexicon entries that come before it thus the custom lexicon(s) should come after the existing/general lexicon:
124+
125+
``` python
126+
# Get the existing single word lexicon from the Multilingual USAS repository
127+
existing_single_lexicon_url = ("https://raw.githubusercontent.com/UCREL/"
128+
"Multilingual-USAS/refs/heads/master/"
129+
"English/semantic_lexicon_en.tsv")
130+
custom_single_lexicon_path = Path("/workspaces/pymusas/scripts/combine_lexicon_example/custom_semantic_lexicon.tsv")
131+
132+
# Download and merge with only lemma/word information
133+
combined_single_lexicon_data = LexiconCollection.tsv_merge(*[existing_single_lexicon_url,
134+
custom_single_lexicon_path],
135+
include_pos=False)
136+
137+
# Download and merge with POS information
138+
combined_single_pos_lexicon_data = LexiconCollection.tsv_merge(*[existing_single_lexicon_url,
139+
custom_single_lexicon_path],
140+
include_pos=True)
141+
```
142+
143+
Then do the same for MWE lexicon:
144+
145+
``` python
146+
# Get the existing MWE lexicon from the Multilingual USAS repository
147+
existing_mwe_lexicon_url = ("https://raw.githubusercontent.com/UCREL/"
148+
"Multilingual-USAS/refs/heads/master/"
149+
"English/mwe-en.tsv")
150+
custom_mwe_lexicon_path = Path("/workspaces/pymusas/scripts/combine_lexicon_example/custom_mwe.tsv")
151+
combined_mwe_lexicon_data = MWELexiconCollection.tsv_merge(*[existing_mwe_lexicon_url,
152+
custom_mwe_lexicon_path])
153+
```
154+
155+
After this we need to setup the rest of the tagger and add it to the English spaCy pipeline, the full code for this can be found below:
156+
157+
<details>
158+
<summary>Python Script</summary>
159+
160+
``` python
161+
from pathlib import Path
162+
163+
import spacy
164+
from pymusas.lexicon_collection import LexiconCollection, MWELexiconCollection
165+
from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker
166+
from pymusas.taggers.rules.single_word import SingleWordRule
167+
from pymusas.taggers.rules.mwe import MWERule
168+
169+
# Get the existing single word lexicon from the Multilingual USAS repository
170+
existing_single_lexicon_url = ("https://raw.githubusercontent.com/UCREL/"
171+
"Multilingual-USAS/refs/heads/master/"
172+
"English/semantic_lexicon_en.tsv")
173+
custom_single_lexicon_path = Path("custom_semantic_lexicon.tsv")
174+
175+
# Download and merge with only lemma/word information
176+
combined_single_lexicon_data = LexiconCollection.tsv_merge(*[existing_single_lexicon_url,
177+
custom_single_lexicon_path],
178+
include_pos=False)
179+
180+
# Download and merge with POS information
181+
combined_single_pos_lexicon_data = LexiconCollection.tsv_merge(*[existing_single_lexicon_url,
182+
custom_single_lexicon_path],
183+
include_pos=True)
184+
185+
# Get the existing MWE lexicon from the Multilingual USAS repository
186+
existing_mwe_lexicon_url = ("https://raw.githubusercontent.com/UCREL/"
187+
"Multilingual-USAS/refs/heads/master/"
188+
"English/mwe-en.tsv")
189+
custom_mwe_lexicon_path = Path("custom_mwe.tsv")
190+
combined_mwe_lexicon_data = MWELexiconCollection.tsv_merge(*[existing_mwe_lexicon_url,
191+
custom_mwe_lexicon_path])
192+
193+
# Creating the PyMUSAS tagger resources
194+
single_word_rule = SingleWordRule(lexicon_collection=combined_single_pos_lexicon_data,
195+
lemma_lexicon_collection=combined_single_lexicon_data,
196+
pos_mapper=None)
197+
mwe_word_rule = MWERule(mwe_lexicon_lookup=combined_mwe_lexicon_data,
198+
pos_mapper=None)
199+
rules = [single_word_rule, mwe_word_rule]
200+
ranker = ContextualRuleBasedRanker(*ContextualRuleBasedRanker.get_construction_arguments(rules))
201+
202+
# Loading the English spaCy pipeline
203+
# We exclude the following components as we do not need them.
204+
nlp = spacy.load('en_core_web_sm', exclude=['parser', 'ner'])
205+
# Adding a blank PyMUSAS tagger
206+
pymusas_tagger = nlp.add_pipe('pymusas_rule_based_tagger')
207+
# Adding our custom resources to the tagger
208+
pymusas_tagger.initialize(rules=rules,
209+
ranker=ranker,
210+
default_punctuation_tags=["PUNCT"],
211+
default_number_tags=["NUM"])
212+
213+
sentence = ("While drinking my flat white I was reading about the "
214+
"new battery farm that Amazon is creating which is owned by "
215+
"one of the broligarchy")
216+
output_doc = nlp(sentence)
217+
218+
print(f'Text\tLemma\tPOS\tUSAS Tags')
219+
for token in output_doc:
220+
print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}')
221+
222+
```
223+
224+
</details>
225+
226+
Of which when ran on the same sentence it produces the following:
227+
228+
``` tsv
229+
Text Lemma POS USAS Tags
230+
While while SCONJ ['Z5']
231+
drinking drink VERB ['A5.4+']
232+
my my PRON ['A5.4+']
233+
flat flat ADJ ['F2/Z3']
234+
white white NOUN ['F2/Z3']
235+
I I PRON ['Z8mf']
236+
was be AUX ['A3+', 'Z5']
237+
reading read VERB ['Q3', 'Q1.2', 'X3.2+', 'X2.5+', 'P1', 'A10+']
238+
about about ADP ['Z5']
239+
the the DET ['Z5']
240+
new new ADJ ['T3-']
241+
battery battery NOUN ['Z3/Y1/W3']
242+
farm farm NOUN ['Z3/Y1/W3']
243+
that that SCONJ ['Z5', 'Z8']
244+
Amazon Amazon PROPN ['Z3']
245+
is be AUX ['A3+', 'Z5']
246+
creating create VERB ['A1.1.1', 'A2.2', 'E1']
247+
which which DET ['Z5', 'Z8']
248+
is be AUX ['A3+', 'Z5']
249+
owned own VERB ['A9+']
250+
by by ADP ['Z5']
251+
one one NUM ['N1', 'T3', 'T1.2']
252+
of of ADP ['Z5']
253+
the the DET ['Z5']
254+
broligarchy broligarchy NOUN ['S5']
255+
```
256+
257+
As you can see `flat white` is recognised as a drink and a proper noun, `broligarchy` is recognised as a group, `Amazon` is linked to a company, and `battery farm` is recognised more to do with proper noun/science and technology/green issues.
258+
259+
260+
:::note
261+
262+
At the moment we assume when you are merging lexicons together they are using the same Part Of Speech (POS) tagset.
263+
264+
:::

0 commit comments

Comments
 (0)