|
| 1 | +--- |
| 2 | +title: Combine/Merge Lexicons |
| 3 | +sidebar_position: 3 |
| 4 | +--- |
| 5 | + |
| 6 | +In this guide we will show how to combine two lexicons together, both for single word and Multi Word Expression (MWE), so that the combined lexicon can be used in a single PyMUSAS [RuleBasedTagger](/api/spacy_api/taggers/rule_based#rulebasedtagger). |
| 7 | + |
| 8 | +This approach is useful if you want the coverage of the existing lexicons that are available for the given language but you want to customize them. You might want to customize them because; |
| 9 | +* Want to add domain specific language to the lexicons, e.g. `flat_* white_*` = `F2/Z3` (type of coffee) |
| 10 | +* Want to override/change an existing lexicon with a different semantic tag, e.g. in the [English semantic lexicon](https://github.com/UCREL/Multilingual-USAS/blob/6b305509016b21cd9062c5f77c1f29313ca9cc53/English/semantic_lexicon_en.tsv#L586C1-L586C18) `Amazon PROPN` is associated with `Z2 M7` a semantic tag associated with *Geographical names* and *Places* but perhaps in your corpus you would like it to be associated with the company therefore change the semantic tag to `Z3`. |
| 11 | + |
| 12 | +All of the existing lexicons for different language can be found at the [Multilingual-USAS repository](https://github.com/UCREL/Multilingual-USAS/tree/master), in this guide we will only use the [English lexicons](https://github.com/UCREL/Multilingual-USAS/tree/master/English). |
| 13 | + |
| 14 | +This guide is going to show how to create a PyMUSAS [RuleBasedTagger](/api/spacy_api/taggers/rule_based#rulebasedtagger) that uses the existing [English lexicons](https://github.com/UCREL/Multilingual-USAS/tree/master/English) with additional custom lexicons that both extend the existing as well as override them. The guide will be broken down into: |
| 15 | + |
| 16 | +1. Setup |
| 17 | +2. How the existing tagger performs |
| 18 | +3. How to customize the tagger through combining the existing lexicon with a custom lexicon |
| 19 | + |
| 20 | +## Setup |
| 21 | + |
| 22 | +Download both the [English PyMUSAS `RuleBasedTagger` spaCy component](https://github.com/UCREL/pymusas-models/releases/tag/en_dual_none_contextual-0.3.3) and the [small English spaCy model](https://spacy.io/models/en): |
| 23 | + |
| 24 | +``` bash |
| 25 | +pip install https://github.com/UCREL/pymusas-models/releases/download/en_dual_none_contextual-0.3.3/en_dual_none_contextual-0.3.3-py3-none-any.whl |
| 26 | +python -m spacy download en_core_web_sm |
| 27 | +``` |
| 28 | + |
| 29 | +We are going to use 2 example custom lexicons, this is for example purposes only as we assume the custom lexicons you will use contain different/more lexicons and you don't need to have both a single and MWE lexicon. |
| 30 | + |
| 31 | +The custom single word lexicon, that we assume is saved to a file at `./custom_semantic_lexicon.tsv` |
| 32 | +```tsv title="custom_semantic_lexicon.tsv" |
| 33 | +lemma pos semantic_tags |
| 34 | +Amazon PROPN Z3 |
| 35 | +broligarchy NOUN S5 |
| 36 | +``` |
| 37 | + |
| 38 | +The custom MWE lexicon, that we assume is saved to a file at `./custom_mwe.tsv` |
| 39 | +``` tsv title="custom_mwe.tsv" |
| 40 | +mwe_template semantic_tags |
| 41 | +battery_NOUN farm_NOUN Z3/Y1/W3 |
| 42 | +flat_* white_* F2/Z3 |
| 43 | +``` |
| 44 | + |
| 45 | +These files can be saved anywhere locally or even at a URL, just change the file path in the code to the location of these files. |
| 46 | + |
| 47 | +The example sentence we are going to use throughout is: |
| 48 | + |
| 49 | +``` python |
| 50 | +sentence = ("While drinking my flat white I was reading about the " |
| 51 | + "new battery farm that Amazon is creating which is owned by " |
| 52 | + "one of the broligarchy") |
| 53 | +``` |
| 54 | + |
| 55 | +Using this sentence and the custom lexicons we will show that we can netter reflect the meaning in this sentence. |
| 56 | + |
| 57 | +## How the existing tagger performs |
| 58 | + |
| 59 | +Using the off the shelf [English PyMUSAS `RuleBasedTagger` spaCy component](https://github.com/UCREL/pymusas-models/releases/tag/en_dual_none_contextual-0.3.3) with the single and MWE lexicons we get the following tags for the example sentence: |
| 60 | + |
| 61 | +``` tsv |
| 62 | +Text Lemma POS USAS Tags |
| 63 | +While while SCONJ ['Z5'] |
| 64 | +drinking drink VERB ['A5.4+'] |
| 65 | +my my PRON ['A5.4+'] |
| 66 | +flat flat ADJ ['O4.4', 'O3', 'O4.1', 'K2', 'A5.3+'] |
| 67 | +white white NOUN ['O4.3', 'O4.3/S2mf', 'F2', 'F1', 'B1'] |
| 68 | +I I PRON ['Z8mf'] |
| 69 | +was be AUX ['A3+', 'Z5'] |
| 70 | +reading read VERB ['Q3', 'Q1.2', 'X3.2+', 'X2.5+', 'P1', 'A10+'] |
| 71 | +about about ADP ['Z5'] |
| 72 | +the the DET ['Z5'] |
| 73 | +new new ADJ ['T3-'] |
| 74 | +battery battery NOUN ['F4'] |
| 75 | +farm farm NOUN ['F4'] |
| 76 | +that that SCONJ ['Z5', 'Z8'] |
| 77 | +Amazon Amazon PROPN ['Z2', 'M7'] |
| 78 | +is be AUX ['A3+', 'Z5'] |
| 79 | +creating create VERB ['A1.1.1', 'A2.2', 'E1'] |
| 80 | +which which DET ['Z5', 'Z8'] |
| 81 | +is be AUX ['A3+', 'Z5'] |
| 82 | +owned own VERB ['A9+'] |
| 83 | +by by ADP ['Z5'] |
| 84 | +one one NUM ['N1', 'T3', 'T1.2'] |
| 85 | +of of ADP ['Z5'] |
| 86 | +the the DET ['Z5'] |
| 87 | +broligarchy broligarchy NOUN ['Z99'] |
| 88 | +``` |
| 89 | + |
| 90 | +As you can see `flat white` is not recognised as a drink, `broligarchy` is not recognised at all as it is a new word according to [collins dictionary](https://www.collinsdictionary.com/dictionary/english/brollies), `Amazon` is assumed to be the rain forest in Brazil, and `battery farm` is recognised as a farm with livestock rather than a farm with batteries. |
| 91 | + |
| 92 | +This was created using the following code: |
| 93 | + |
| 94 | +<details> |
| 95 | +<summary>Python Script</summary> |
| 96 | + |
| 97 | +``` python |
| 98 | +import spacy |
| 99 | + |
| 100 | + |
| 101 | +sentence = ("While drinking my flat white I was reading about the " |
| 102 | + "new battery farm that Amazon is creating which is owned by " |
| 103 | + "one of the broligarchy") |
| 104 | + |
| 105 | +# We exclude the following components as we do not need them. |
| 106 | +nlp = spacy.load('en_core_web_sm', exclude=['parser', 'ner']) |
| 107 | +# Load the English PyMUSAS rule-based tagger in a separate spaCy pipeline |
| 108 | +english_tagger_pipeline = spacy.load('en_dual_none_contextual') |
| 109 | +# Adds the English PyMUSAS rule-based tagger to the main spaCy pipeline |
| 110 | +nlp.add_pipe('pymusas_rule_based_tagger', source=english_tagger_pipeline) |
| 111 | + |
| 112 | +output_doc = nlp(sentence) |
| 113 | + |
| 114 | +print(f'Text\tLemma\tPOS\tUSAS Tags') |
| 115 | +for token in output_doc: |
| 116 | + print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}') |
| 117 | +``` |
| 118 | + |
| 119 | +</details> |
| 120 | + |
| 121 | +## How to customize the tagger through combining the existing lexicon with a custom lexicon |
| 122 | + |
| 123 | +In the code below we show that we first need to create a combined/merged single word lexicon from the existing single word lexicon in the Multilingual USAS GitHub repository, this is done through the [RuleBasedTagger](/api/spacy_api/taggers/rule_based#rulebasedtagger) function which downloads/loads the TSV files and then merges them whereby the last TSV file in the list overrides any lexicon entries that come before it thus the custom lexicon(s) should come after the existing/general lexicon: |
| 124 | + |
| 125 | +``` python |
| 126 | +# Get the existing single word lexicon from the Multilingual USAS repository |
| 127 | +existing_single_lexicon_url = ("https://raw.githubusercontent.com/UCREL/" |
| 128 | + "Multilingual-USAS/refs/heads/master/" |
| 129 | + "English/semantic_lexicon_en.tsv") |
| 130 | +custom_single_lexicon_path = Path("/workspaces/pymusas/scripts/combine_lexicon_example/custom_semantic_lexicon.tsv") |
| 131 | + |
| 132 | +# Download and merge with only lemma/word information |
| 133 | +combined_single_lexicon_data = LexiconCollection.tsv_merge(*[existing_single_lexicon_url, |
| 134 | + custom_single_lexicon_path], |
| 135 | + include_pos=False) |
| 136 | + |
| 137 | +# Download and merge with POS information |
| 138 | +combined_single_pos_lexicon_data = LexiconCollection.tsv_merge(*[existing_single_lexicon_url, |
| 139 | + custom_single_lexicon_path], |
| 140 | + include_pos=True) |
| 141 | +``` |
| 142 | + |
| 143 | +Then do the same for MWE lexicon: |
| 144 | + |
| 145 | +``` python |
| 146 | +# Get the existing MWE lexicon from the Multilingual USAS repository |
| 147 | +existing_mwe_lexicon_url = ("https://raw.githubusercontent.com/UCREL/" |
| 148 | + "Multilingual-USAS/refs/heads/master/" |
| 149 | + "English/mwe-en.tsv") |
| 150 | +custom_mwe_lexicon_path = Path("/workspaces/pymusas/scripts/combine_lexicon_example/custom_mwe.tsv") |
| 151 | +combined_mwe_lexicon_data = MWELexiconCollection.tsv_merge(*[existing_mwe_lexicon_url, |
| 152 | + custom_mwe_lexicon_path]) |
| 153 | +``` |
| 154 | + |
| 155 | +After this we need to setup the rest of the tagger and add it to the English spaCy pipeline, the full code for this can be found below: |
| 156 | + |
| 157 | +<details> |
| 158 | +<summary>Python Script</summary> |
| 159 | + |
| 160 | +``` python |
| 161 | +from pathlib import Path |
| 162 | + |
| 163 | +import spacy |
| 164 | +from pymusas.lexicon_collection import LexiconCollection, MWELexiconCollection |
| 165 | +from pymusas.rankers.lexicon_entry import ContextualRuleBasedRanker |
| 166 | +from pymusas.taggers.rules.single_word import SingleWordRule |
| 167 | +from pymusas.taggers.rules.mwe import MWERule |
| 168 | + |
| 169 | +# Get the existing single word lexicon from the Multilingual USAS repository |
| 170 | +existing_single_lexicon_url = ("https://raw.githubusercontent.com/UCREL/" |
| 171 | + "Multilingual-USAS/refs/heads/master/" |
| 172 | + "English/semantic_lexicon_en.tsv") |
| 173 | +custom_single_lexicon_path = Path("custom_semantic_lexicon.tsv") |
| 174 | + |
| 175 | +# Download and merge with only lemma/word information |
| 176 | +combined_single_lexicon_data = LexiconCollection.tsv_merge(*[existing_single_lexicon_url, |
| 177 | + custom_single_lexicon_path], |
| 178 | + include_pos=False) |
| 179 | + |
| 180 | +# Download and merge with POS information |
| 181 | +combined_single_pos_lexicon_data = LexiconCollection.tsv_merge(*[existing_single_lexicon_url, |
| 182 | + custom_single_lexicon_path], |
| 183 | + include_pos=True) |
| 184 | + |
| 185 | +# Get the existing MWE lexicon from the Multilingual USAS repository |
| 186 | +existing_mwe_lexicon_url = ("https://raw.githubusercontent.com/UCREL/" |
| 187 | + "Multilingual-USAS/refs/heads/master/" |
| 188 | + "English/mwe-en.tsv") |
| 189 | +custom_mwe_lexicon_path = Path("custom_mwe.tsv") |
| 190 | +combined_mwe_lexicon_data = MWELexiconCollection.tsv_merge(*[existing_mwe_lexicon_url, |
| 191 | + custom_mwe_lexicon_path]) |
| 192 | + |
| 193 | +# Creating the PyMUSAS tagger resources |
| 194 | +single_word_rule = SingleWordRule(lexicon_collection=combined_single_pos_lexicon_data, |
| 195 | + lemma_lexicon_collection=combined_single_lexicon_data, |
| 196 | + pos_mapper=None) |
| 197 | +mwe_word_rule = MWERule(mwe_lexicon_lookup=combined_mwe_lexicon_data, |
| 198 | + pos_mapper=None) |
| 199 | +rules = [single_word_rule, mwe_word_rule] |
| 200 | +ranker = ContextualRuleBasedRanker(*ContextualRuleBasedRanker.get_construction_arguments(rules)) |
| 201 | + |
| 202 | +# Loading the English spaCy pipeline |
| 203 | +# We exclude the following components as we do not need them. |
| 204 | +nlp = spacy.load('en_core_web_sm', exclude=['parser', 'ner']) |
| 205 | +# Adding a blank PyMUSAS tagger |
| 206 | +pymusas_tagger = nlp.add_pipe('pymusas_rule_based_tagger') |
| 207 | +# Adding our custom resources to the tagger |
| 208 | +pymusas_tagger.initialize(rules=rules, |
| 209 | + ranker=ranker, |
| 210 | + default_punctuation_tags=["PUNCT"], |
| 211 | + default_number_tags=["NUM"]) |
| 212 | + |
| 213 | +sentence = ("While drinking my flat white I was reading about the " |
| 214 | + "new battery farm that Amazon is creating which is owned by " |
| 215 | + "one of the broligarchy") |
| 216 | +output_doc = nlp(sentence) |
| 217 | + |
| 218 | +print(f'Text\tLemma\tPOS\tUSAS Tags') |
| 219 | +for token in output_doc: |
| 220 | + print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}') |
| 221 | + |
| 222 | +``` |
| 223 | + |
| 224 | +</details> |
| 225 | + |
| 226 | +Of which when ran on the same sentence it produces the following: |
| 227 | + |
| 228 | +``` tsv |
| 229 | +Text Lemma POS USAS Tags |
| 230 | +While while SCONJ ['Z5'] |
| 231 | +drinking drink VERB ['A5.4+'] |
| 232 | +my my PRON ['A5.4+'] |
| 233 | +flat flat ADJ ['F2/Z3'] |
| 234 | +white white NOUN ['F2/Z3'] |
| 235 | +I I PRON ['Z8mf'] |
| 236 | +was be AUX ['A3+', 'Z5'] |
| 237 | +reading read VERB ['Q3', 'Q1.2', 'X3.2+', 'X2.5+', 'P1', 'A10+'] |
| 238 | +about about ADP ['Z5'] |
| 239 | +the the DET ['Z5'] |
| 240 | +new new ADJ ['T3-'] |
| 241 | +battery battery NOUN ['Z3/Y1/W3'] |
| 242 | +farm farm NOUN ['Z3/Y1/W3'] |
| 243 | +that that SCONJ ['Z5', 'Z8'] |
| 244 | +Amazon Amazon PROPN ['Z3'] |
| 245 | +is be AUX ['A3+', 'Z5'] |
| 246 | +creating create VERB ['A1.1.1', 'A2.2', 'E1'] |
| 247 | +which which DET ['Z5', 'Z8'] |
| 248 | +is be AUX ['A3+', 'Z5'] |
| 249 | +owned own VERB ['A9+'] |
| 250 | +by by ADP ['Z5'] |
| 251 | +one one NUM ['N1', 'T3', 'T1.2'] |
| 252 | +of of ADP ['Z5'] |
| 253 | +the the DET ['Z5'] |
| 254 | +broligarchy broligarchy NOUN ['S5'] |
| 255 | +``` |
| 256 | + |
| 257 | +As you can see `flat white` is recognised as a drink and a proper noun, `broligarchy` is recognised as a group, `Amazon` is linked to a company, and `battery farm` is recognised more to do with proper noun/science and technology/green issues. |
| 258 | + |
| 259 | + |
| 260 | +:::note |
| 261 | + |
| 262 | +At the moment we assume when you are merging lexicons together they are using the same Part Of Speech (POS) tagset. |
| 263 | + |
| 264 | +::: |
0 commit comments