@@ -312,9 +312,12 @@ combination of the lemma and pos and the value are the semantic tags.
312312The lemma and pos are combined as follows: `{lemma}| {pos}` , e.g.
313313`Car| Noun`
314314
315- If the pos value is None then then only the lemma is used: `{lemma}` ,
315+ If the pos value is None then only the lemma is used: `{lemma}` ,
316316e.g. `Car`
317317
318+ ** Note** If the key already exists then the most recent entry will
319+ overwrite the existing entry.
320+
318321< h4 id = " add_lexicon_entry.parameters" > Parameters< a className = " headerlink" href = " #add_lexicon_entry.parameters" title = " Permanent link" > & para;< / a>< / h4>
319322
320323
@@ -466,6 +469,132 @@ welsh_lexicon_collection = LexiconCollection(welsh_lexicon_dict)
466469assert welsh_lexicon_dict[' ceir' ][0 ] == ' M3fn'
467470```
468471
472+ < a id = " pymusas.lexicon_collection.LexiconCollection.merge" >< / a>
473+
474+ # ## merge
475+
476+ ```python
477+ class LexiconCollection(MutableMapping):
478+ | ...
479+ | @ staticmethod
480+ | def merge(
481+ | * lexicon_collections: " LexiconCollection"
482+ | ) -> " LexiconCollection"
483+ ```
484+
485+ Given more than one lexicon collection it will create a single lexicon
486+ collection whereby the lexicon data from each will be combined.
487+
488+ ** Note** the data is loaded in list order therefore the last lexicon
489+ collection will take precedence, i.e. if the last contains `London` : [`Z3` ]
490+ and the first contains `London` : [`Z2` ] then the returned
491+ LexiconCollection will only contain the one entry; `London` : [`Z3` ].
492+
493+ ** Note** if the lexicon collections contain POS information we assume
494+ that all of the lexicon collections use the same POS tagset,
495+ if they do not this could cause issues during tag time.
496+
497+ < h4 id = " merge.parameters" > Parameters< a className = " headerlink" href = " #merge.parameters" title = " Permanent link" > & para;< / a>< / h4>
498+
499+
500+ - __* lexicon\_collections__ : `LexiconCollection` <br/>
501+ More than one lexicon collections that are to be merged.
502+
503+ < h4 id = " merge.returns" > Returns< a className = " headerlink" href = " #merge.returns" title = " Permanent link" > & para;< / a>< / h4>
504+
505+
506+ - [`LexiconCollection` ](# lexiconcollection) <br/>
507+
508+ < h4 id = " merge.examples" > Examples< a className = " headerlink" href = " #merge.examples" title = " Permanent link" > & para;< / a>< / h4>
509+
510+
511+ ``` python
512+ from pymusas.lexicon_collection import LexiconCollection
513+ welsh_lexicon_url = " https://raw.githubusercontent.com/UCREL/Multilingual-USAS/refs/heads/master/Welsh/semantic_lexicon_cy.tsv"
514+ english_lexicon_url = " https://raw.githubusercontent.com/UCREL/Multilingual-USAS/refs/heads/master/English/semantic_lexicon_en.tsv"
515+ welsh_lexicon_data = LexiconCollection.from_tsv(welsh_lexicon_url, include_pos = True )
516+ welsh_lexicon = LexiconCollection(welsh_lexicon_data)
517+ english_lexicon_data = LexiconCollection.from_tsv(english_lexicon_url, include_pos = True )
518+ english_lexicon = LexiconCollection(english_lexicon_data)
519+ combined_lexicon_collection = LexiconCollection.merge(welsh_lexicon, english_lexicon)
520+ assert isinstance (combined_lexicon_collection, LexiconCollection)
521+ assert combined_lexicon_collection[" Aber-lash|pnoun" ] == [" Z2" ]
522+ assert combined_lexicon_collection[" Aqua|PROPN" ] == [" Z3c" ]
523+ ```
524+
525+ < a id = " pymusas.lexicon_collection.LexiconCollection.tsv_merge" >< / a>
526+
527+ # ## tsv\_merge
528+
529+ ```python
530+ class LexiconCollection(MutableMapping):
531+ | ...
532+ | @ staticmethod
533+ | def tsv_merge(
534+ | * tsv_file_paths: PathLike,
535+ | * ,
536+ | include_pos: bool = True
537+ | ) -> dict[str , list[str ]]
538+ ```
539+
540+ Given one or more TSV files it will create a single dictionary object
541+ with the combination of all the lexicon data in each TSV , this dictionary
542+ object can then be used to create a [`LexiconCollection` ](# lexiconcollection).
543+
544+ For more information on how the TSV data is loaded see [`from_tsv` ](# from_tsv).
545+
546+ ** Note** the data is loaded in list order therefore the last TSV file
547+ will take precedence, i.e. if the last TSV file contains `London` : [`Z3` ]
548+ and the first TSV file contains `London` : [`Z2` ] then the returned
549+ dictionary will only contain the one entry; `London` : [`Z3` ].
550+
551+ ** Note** if the TSV files contain POS information we assume that all
552+ of the TSV files use the same POS tagset, if they do not this could
553+ cause issues during tag time.
554+
555+ < h4 id = " tsv_merge.parameters" > Parameters< a className = " headerlink" href = " #tsv_merge.parameters" title = " Permanent link" > & para;< / a>< / h4>
556+
557+
558+ - __* tsv\_file\_paths__ : `PathLike` <br/>
559+ File paths and / or URLs to a TSV file that contains at least two
560+ fields, with an optional third, with the following headings:
561+
562+ 1 . `lemma` ,
563+ 2 . `semantic_tags`
564+ 3 . `pos` (Optional)
565+
566+ All other fields will be ignored.
567+ - __include\_pos__ : `bool`, optional (default = `True`) <br/>
568+ Whether to include the POS information, if the information is available,
569+ or not . See [`add_lexicon_entry` ](# add_lexicon_entry) for more information on this
570+ parameter.
571+
572+ < h4 id = " tsv_merge.returns" > Returns< a className = " headerlink" href = " #tsv_merge.returns" title = " Permanent link" > & para;< / a>< / h4>
573+
574+
575+ - `dict[str , list[str ]]` < br/ >
576+
577+ < h4 id = " tsv_merge.raises" > Raises< a className = " headerlink" href = " #tsv_merge.raises" title = " Permanent link" > & para;< / a>< / h4>
578+
579+
580+ - `ValueError ` < br/ >
581+ If the minimum field headings, `lemma` and `semantic_tags` , do not
582+ exist in the given TSV files.
583+
584+ < h4 id = " tsv_merge.examples" > Examples< a className = " headerlink" href = " #tsv_merge.examples" title = " Permanent link" > & para;< / a>< / h4>
585+
586+
587+ ``` python
588+ from pymusas.lexicon_collection import LexiconCollection
589+ welsh_lexicon_url = " https://raw.githubusercontent.com/UCREL/Multilingual-USAS/refs/heads/master/Welsh/semantic_lexicon_cy.tsv"
590+ english_lexicon_url = " https://raw.githubusercontent.com/UCREL/Multilingual-USAS/refs/heads/master/English/semantic_lexicon_en.tsv"
591+ tsv_urls = [welsh_lexicon_url, english_lexicon_url]
592+ combined_lexicon_collection = LexiconCollection.tsv_merge(* tsv_urls, include_pos = True )
593+ assert isinstance (combined_lexicon_collection, dict )
594+ assert combined_lexicon_collection[" Aber-lash|pnoun" ] == [" Z2" ]
595+ assert combined_lexicon_collection[" Aqua|PROPN" ] == [" Z3c" ]
596+ ```
597+
469598< a id = " pymusas.lexicon_collection.LexiconCollection.__str__" >< / a>
470599
471600# ## \_\_str\_\_
@@ -565,7 +694,7 @@ this.
565694 If not `None ` , maps from the lexicon' s POS tagset to the desired
566695 POS tagset, whereby the mapping is a `List` of tags, at the moment there
567696 is no preference order in this list of POS tags. The POS mapping is
568- useful in situtation whereby the leixcon ' s POS tagset is different to
697+ useful in situations whereby the lexicon ' s POS tagset is different to
569698 the token' s. **Note** that the longer the `List[str]` for each POS
570699 mapping the longer it will take to match MWE templates. A one to one
571700 mapping will have no speed impact on the tagger. A selection of POS
@@ -825,6 +954,69 @@ assert mwe_lexicon_dict['abaixo_adv de_prep'][0] == 'M6'
825954assert mwe_lexicon_dict[' arco_noun e_conj flecha_noun' ][0 ] == ' K5.1'
826955```
827956
957+ < a id = " pymusas.lexicon_collection.MWELexiconCollection.tsv_merge" >< / a>
958+
959+ # ## tsv\_merge
960+
961+ ```python
962+ class MWELexiconCollection(MutableMapping):
963+ | ...
964+ | @ staticmethod
965+ | def tsv_merge(* tsv_file_paths: PathLike) -> dict[str , list[str ]]
966+ ```
967+
968+ Given one or more TSV files it will create a dictionary
969+ object that can be used to create a [`MWELexiconCollection` ](# mwelexiconcollection) whereby
970+ this dictionary is the combination of all of the lexicon information
971+ in the TSV files.
972+
973+ ** Note** the data is loaded in list order therefore the last TSV file
974+ will take precedence, i.e. if the last TSV file contains
975+ `London_* city_* ` : [`Z3` ] and the first TSV file contains
976+ `London_* city_* ` : [`Z2` ] then the returned dictionary will only
977+ contain the one entry; `London_* city_* ` : [`Z3` ].
978+
979+ ** Note** if the POS tagset used in the TSV files are different this
980+ could cause issues during tag time.
981+
982+ < h4 id = " tsv_merge.parameters" > Parameters< a className = " headerlink" href = " #tsv_merge.parameters" title = " Permanent link" > & para;< / a>< / h4>
983+
984+
985+ - __* tsv\_file\_paths__ : `Union[PathLike, str]` <br/>
986+ File paths or URLs to a TSV file that contains at least these two
987+ fields:
988+
989+ 1 . `mwe_template` ,
990+ 2 . `semantic_tags`
991+
992+ All other fields will be ignored.
993+
994+ < h4 id = " tsv_merge.returns" > Returns< a className = " headerlink" href = " #tsv_merge.returns" title = " Permanent link" > & para;< / a>< / h4>
995+
996+
997+ - `dict[str , list[str ]]` < br/ >
998+
999+ < h4 id = " tsv_merge.raises" > Raises< a className = " headerlink" href = " #tsv_merge.raises" title = " Permanent link" > & para;< / a>< / h4>
1000+
1001+
1002+ - `ValueError ` < br/ >
1003+ If the minimum field headings, `mwe_template` and `semantic_tags` ,
1004+ do not exist in the given TSV file .
1005+
1006+ < h4 id = " tsv_merge.examples" > Examples< a className = " headerlink" href = " #tsv_merge.examples" title = " Permanent link" > & para;< / a>< / h4>
1007+
1008+
1009+ ``` python
1010+ from pymusas.lexicon_collection import LexiconCollection
1011+ welsh_lexicon_url = " https://raw.githubusercontent.com/UCREL/Multilingual-USAS/refs/heads/master/Welsh/mwe-welsh.tsv"
1012+ english_lexicon_url = " https://raw.githubusercontent.com/UCREL/Multilingual-USAS/refs/heads/master/English/mwe-en.tsv"
1013+ tsv_urls = [welsh_lexicon_url, english_lexicon_url]
1014+ combined_lexicon_data = MWELexiconCollection.tsv_merge(* tsv_urls)
1015+ assert isinstance (combined_lexicon_data, dict )
1016+ assert combined_lexicon_data[" Academy_NOUN Award_NOUN" ] == [" A5.1+/K1" ]
1017+ assert combined_lexicon_data[" Ffwrnais_* Dyfi_*" ] == [" Z2" ]
1018+ ```
1019+
8281020< a id = " pymusas.lexicon_collection.MWELexiconCollection.escape_mwe" >< / a>
8291021
8301022# ## escape\_mwe
0 commit comments