Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@
"ms-vscode.makefile-tools",
"redhat.vscode-yaml",
"tamasfe.even-better-toml",
"github.vscode-github-actions"],
"github.vscode-github-actions",
"mechatroner.rainbow-csv"],
"settings": {
"python-envs.defaultEnvManager": "ms-python.python:venv"
}
Expand Down
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- The documentation now has a `How-to Tag Text` guide for Finnish and English.
- Using [developer/dev containers](https://containers.dev/) of which the files for this can be found in the [.devcontainer folder](./.devcontainer). This will allow for easier on boarding and development consistency.
- Functional tests have been added and can be found in the following directory: [./tests/functional_tests/](./tests/functional_tests/)
- The ability to merge `LexiconCollection`s either through `pymusas.lexicon_collection.LexiconCollection.merge`.
- The ability to merge `LexiconCollection` data through a list of file paths to TSV files using `pymusas.lexicon_collection.LexiconCollection.tsv_merge`, which when merged will allow the creation of a combined `LexiconCollection` instance.
- The ability to merge `MWELexiconCollection` data through a list of file paths to TSV files using `pymusas.lexicon_collection.MWELexiconCollection.tsv_merge`, which when merged will allow the creation of a combined `MWELexiconCollection` instance.
- Added a usage example to the documentation showing how to combine/merge lexicon collections together and add them to a PyMUSAS rule based tagger.

### Changed

Expand Down
196 changes: 194 additions & 2 deletions docs/docs/api/lexicon_collection.md
Original file line number Diff line number Diff line change
Expand Up @@ -312,9 +312,12 @@ combination of the lemma and pos and the value are the semantic tags.
The lemma and pos are combined as follows: `{lemma}|{pos}`, e.g.
`Car|Noun`

If the pos value is None then then only the lemma is used: `{lemma}`,
If the pos value is None then only the lemma is used: `{lemma}`,
e.g. `Car`

**Note** If the key already exists then the most recent entry will
overwrite the existing entry.

<h4 id="add_lexicon_entry.parameters">Parameters<a className="headerlink" href="#add_lexicon_entry.parameters" title="Permanent link">&para;</a></h4>


Expand Down Expand Up @@ -466,6 +469,132 @@ welsh_lexicon_collection = LexiconCollection(welsh_lexicon_dict)
assert welsh_lexicon_dict['ceir'][0] == 'M3fn'
```

<a id="pymusas.lexicon_collection.LexiconCollection.merge"></a>

### merge

```python
class LexiconCollection(MutableMapping):
| ...
| @staticmethod
| def merge(
| *lexicon_collections: "LexiconCollection"
| ) -> "LexiconCollection"
```

Given more than one lexicon collection it will create a single lexicon
collection whereby the lexicon data from each will be combined.

**Note** the data is loaded in list order therefore the last lexicon
collection will take precedence, i.e. if the last contains `London`: [`Z3`]
and the first contains `London`: [`Z2`] then the returned
LexiconCollection will only contain the one entry; `London`: [`Z3`].

**Note** if the lexicon collections contain POS information we assume
that all of the lexicon collections use the same POS tagset,
if they do not this could cause issues during tag time.

<h4 id="merge.parameters">Parameters<a className="headerlink" href="#merge.parameters" title="Permanent link">&para;</a></h4>


- __*lexicon\_collections__ : `LexiconCollection` <br/>
More than one lexicon collections that are to be merged.

<h4 id="merge.returns">Returns<a className="headerlink" href="#merge.returns" title="Permanent link">&para;</a></h4>


- [`LexiconCollection`](#lexiconcollection) <br/>

<h4 id="merge.examples">Examples<a className="headerlink" href="#merge.examples" title="Permanent link">&para;</a></h4>


``` python
from pymusas.lexicon_collection import LexiconCollection
welsh_lexicon_url = "https://raw.githubusercontent.com/UCREL/Multilingual-USAS/refs/heads/master/Welsh/semantic_lexicon_cy.tsv"
english_lexicon_url = "https://raw.githubusercontent.com/UCREL/Multilingual-USAS/refs/heads/master/English/semantic_lexicon_en.tsv"
welsh_lexicon_data = LexiconCollection.from_tsv(welsh_lexicon_url, include_pos=True)
welsh_lexicon = LexiconCollection(welsh_lexicon_data)
english_lexicon_data = LexiconCollection.from_tsv(english_lexicon_url, include_pos=True)
english_lexicon = LexiconCollection(english_lexicon_data)
combined_lexicon_collection = LexiconCollection.merge(welsh_lexicon, english_lexicon)
assert isinstance(combined_lexicon_collection, LexiconCollection)
assert combined_lexicon_collection["Aber-lash|pnoun"] == ["Z2"]
assert combined_lexicon_collection["Aqua|PROPN"] == ["Z3c"]
```

<a id="pymusas.lexicon_collection.LexiconCollection.tsv_merge"></a>

### tsv\_merge

```python
class LexiconCollection(MutableMapping):
| ...
| @staticmethod
| def tsv_merge(
| *tsv_file_paths: PathLike,
| *,
| include_pos: bool = True
| ) -> dict[str, list[str]]
```

Given one or more TSV files it will create a single dictionary object
with the combination of all the lexicon data in each TSV, this dictionary
object can then be used to create a [`LexiconCollection`](#lexiconcollection).

For more information on how the TSV data is loaded see [`from_tsv`](#from_tsv).

**Note** the data is loaded in list order therefore the last TSV file
will take precedence, i.e. if the last TSV file contains `London`: [`Z3`]
and the first TSV file contains `London`: [`Z2`] then the returned
dictionary will only contain the one entry; `London`: [`Z3`].

**Note** if the TSV files contain POS information we assume that all
of the TSV files use the same POS tagset, if they do not this could
cause issues during tag time.

<h4 id="tsv_merge.parameters">Parameters<a className="headerlink" href="#tsv_merge.parameters" title="Permanent link">&para;</a></h4>


- __*tsv\_file\_paths__ : `PathLike` <br/>
File paths and/or URLs to a TSV file that contains at least two
fields, with an optional third, with the following headings:

1. `lemma`,
2. `semantic_tags`
3. `pos` (Optional)

All other fields will be ignored.
- __include\_pos__ : `bool`, optional (default = `True`) <br/>
Whether to include the POS information, if the information is available,
or not. See [`add_lexicon_entry`](#add_lexicon_entry) for more information on this
parameter.

<h4 id="tsv_merge.returns">Returns<a className="headerlink" href="#tsv_merge.returns" title="Permanent link">&para;</a></h4>


- `dict[str, list[str]]` <br/>

<h4 id="tsv_merge.raises">Raises<a className="headerlink" href="#tsv_merge.raises" title="Permanent link">&para;</a></h4>


- `ValueError` <br/>
If the minimum field headings, `lemma` and `semantic_tags`, do not
exist in the given TSV files.

<h4 id="tsv_merge.examples">Examples<a className="headerlink" href="#tsv_merge.examples" title="Permanent link">&para;</a></h4>


``` python
from pymusas.lexicon_collection import LexiconCollection
welsh_lexicon_url = "https://raw.githubusercontent.com/UCREL/Multilingual-USAS/refs/heads/master/Welsh/semantic_lexicon_cy.tsv"
english_lexicon_url = "https://raw.githubusercontent.com/UCREL/Multilingual-USAS/refs/heads/master/English/semantic_lexicon_en.tsv"
tsv_urls = [welsh_lexicon_url, english_lexicon_url]
combined_lexicon_collection = LexiconCollection.tsv_merge(*tsv_urls, include_pos=True)
assert isinstance(combined_lexicon_collection, dict)
assert combined_lexicon_collection["Aber-lash|pnoun"] == ["Z2"]
assert combined_lexicon_collection["Aqua|PROPN"] == ["Z3c"]
```

<a id="pymusas.lexicon_collection.LexiconCollection.__str__"></a>

### \_\_str\_\_
Expand Down Expand Up @@ -565,7 +694,7 @@ this.
If not `None`, maps from the lexicon's POS tagset to the desired
POS tagset, whereby the mapping is a `List` of tags, at the moment there
is no preference order in this list of POS tags. The POS mapping is
useful in situtation whereby the leixcon's POS tagset is different to
useful in situations whereby the lexicon's POS tagset is different to
the token's. **Note** that the longer the `List[str]` for each POS
mapping the longer it will take to match MWE templates. A one to one
mapping will have no speed impact on the tagger. A selection of POS
Expand Down Expand Up @@ -825,6 +954,69 @@ assert mwe_lexicon_dict['abaixo_adv de_prep'][0] == 'M6'
assert mwe_lexicon_dict['arco_noun e_conj flecha_noun'][0] == 'K5.1'
```

<a id="pymusas.lexicon_collection.MWELexiconCollection.tsv_merge"></a>

### tsv\_merge

```python
class MWELexiconCollection(MutableMapping):
| ...
| @staticmethod
| def tsv_merge(*tsv_file_paths: PathLike) -> dict[str, list[str]]
```

Given one or more TSV files it will create a dictionary
object that can be used to create a [`MWELexiconCollection`](#mwelexiconcollection) whereby
this dictionary is the combination of all of the lexicon information
in the TSV files.

**Note** the data is loaded in list order therefore the last TSV file
will take precedence, i.e. if the last TSV file contains
`London_* city_*`: [`Z3`] and the first TSV file contains
`London_* city_*`: [`Z2`] then the returned dictionary will only
contain the one entry; `London_* city_*`: [`Z3`].

**Note** if the POS tagset used in the TSV files are different this
could cause issues during tag time.

<h4 id="tsv_merge.parameters">Parameters<a className="headerlink" href="#tsv_merge.parameters" title="Permanent link">&para;</a></h4>


- __*tsv\_file\_paths__ : `Union[PathLike, str]` <br/>
File paths or URLs to a TSV file that contains at least these two
fields:

1. `mwe_template`,
2. `semantic_tags`

All other fields will be ignored.

<h4 id="tsv_merge.returns">Returns<a className="headerlink" href="#tsv_merge.returns" title="Permanent link">&para;</a></h4>


- `dict[str, list[str]]` <br/>

<h4 id="tsv_merge.raises">Raises<a className="headerlink" href="#tsv_merge.raises" title="Permanent link">&para;</a></h4>


- `ValueError` <br/>
If the minimum field headings, `mwe_template` and `semantic_tags`,
do not exist in the given TSV file.

<h4 id="tsv_merge.examples">Examples<a className="headerlink" href="#tsv_merge.examples" title="Permanent link">&para;</a></h4>


``` python
from pymusas.lexicon_collection import LexiconCollection
welsh_lexicon_url = "https://raw.githubusercontent.com/UCREL/Multilingual-USAS/refs/heads/master/Welsh/mwe-welsh.tsv"
english_lexicon_url = "https://raw.githubusercontent.com/UCREL/Multilingual-USAS/refs/heads/master/English/mwe-en.tsv"
tsv_urls = [welsh_lexicon_url, english_lexicon_url]
combined_lexicon_data = MWELexiconCollection.tsv_merge(*tsv_urls)
assert isinstance(combined_lexicon_data, dict)
assert combined_lexicon_data["Academy_NOUN Award_NOUN"] == ["A5.1+/K1"]
assert combined_lexicon_data["Ffwrnais_* Dyfi_*"] == ["Z2"]
```

<a id="pymusas.lexicon_collection.MWELexiconCollection.escape_mwe"></a>

### escape\_mwe
Expand Down
Loading
Loading