You can either use the dev container with your favourite editor, e.g. VSCode. Or you can create your setup locally below we demonstrate both.
In both cases they share the same tools, of which these tools are:
- uv for Python packaging and development
- make (OPTIONAL) for automation of tasks, not strictly required but makes life easier.
A dev container uses a docker container to create the required development environment, the Dockerfile we use for this dev container can be found at ./.devcontainer/Dockerfile. To run it locally it requires docker to be installed, you can also run it in a cloud based code editor, for a list of supported editors/cloud editors see the following webpage.
To run for the first time on a local VSCode editor (a slightly more detailed and better guide on the VSCode website):
- Ensure docker is running.
- Ensure the VSCode Dev Containers extension is installed in your VSCode editor.
- Open the command pallete
CMD + SHIFT + Pand then selectDev Containers: Rebuild and Reopen in Container
You should now have everything you need to develop, uv, and make, for VSCode various extensions like Pylance, etc.
If you have any trouble see the VSCode website..
To run locally first ensure you have the following tools installted locally:
- uv for Python packaging and development.
- make (OPTIONAL) for automation of tasks, not strictly required but makes life easier.
- Ubuntu:
apt-get install make - Mac: Xcode command line tools includes
makeelse you can use brew. - Windows: Various solutions proposed in this blog post on how to install on Windows, inclduing
Cygwin, andWindows Subsystem for Linux.
- Ubuntu:
When developing on the project you will want to install the Python package locally in editable format with all the extra requirements, this can be done like so:
uv syncThis code base uses isort, flake8 and mypy to ensure that the format of the code is consistent and contain type hints. ISort and mypy settings can be found within ./pyproject.toml and the flake8 settings can be found in ./.flake8. To run these linters:
make lintThe default or recommended Python version is shown in [.python-version](./.python-version, currently 3.12, this can be changed using the uv command:
uv python pin
# uv python pin 3.13/pymusas_models- contains the code that creates all of the PyMUSAS models./model_release.py- Releases the models, that have been created locally, to GitHub as a GitHub release per model./model_creation_tests/model_function_tests- The tests are divided up by language, using each language's BCP 47 language code, and then model (eitherrule based taggerorneural tagger)./model_function_tests/fr/model_function_tests/fr/test_rule_based_tagger.py
/model_function_tests/it/model_function_tests/it/test_rule_based_tagger.py
/model_function_tests/en/model_function_tests/en/test_neural_tagger.py/model_function_tests/en/test_rule_based_tagger.py
- other language codes
/model_creation_tests/test_create_and_install_models.py- This creates and installs the models used withintestsand in doing so tests that this part of the code base works. Note that we install the models to a temporary Python virtual environment.
The testing structure of /model_function_tests has been heavily influenced by how spaCy tests their models.
Each model is created using a spaCy configuration and meta file, of which we can have more than one model for each language. These configuration and meta files are automatically created using the Command Line Interface (CLI) to pymusas_models and then used to create the PyMUSAS spaCy models with their relevant installation data (distribution files, README, meta data, etc). This process is done per model and all model data is stored in their own model folder, named based off the model naming convention specified in the main README, within the directory you have specified to store this data in. Each of these spaCy model folders contain the relevant information to create a GitHub release for that model.
The CLI knows which models to create and how to create them by utilising the given meta data stored in the language_resources.json file (for more information of the language_resources.json file see the Language Resource Meta Data section).
To create all of the models and store them in the folder ./models run the following:
python pymusas_models/__main__.py create-models --models-directory ./models --language-resource-file ./language_resources.jsonThis will create the following folders:
./models/cmn_dual_upos2usas_contextual_none-0.3.0./models/cmn_single_upos2usas_contextual_none-0.3.0./models/cy_dual_basiccorcencc2usas_contextual_none-0.3.0- other model folders
To automate the release of the models we have created we are going to use the GitHub REST API. This REST API has a rate limit of 5000 calls per hour when you are running it as an authenticated client, for details on authentication. As we are creating releases we need to have a Personal Access Token (PAT) for authentication with public_repo scope, the PAT can be created at the following link, we named our PAT pymusas-models.
Once you have created your PAT add it to the following file GITHUB_TOKEN.json this file should never be added to the repository as it will contain your PAT which is sensitive information. The PAT should be added to the JSON file like so:
{"PAT": "YOUR PAT TOKEN"}Now, assuming all of the models you would like to release are in the ./models directory, we can release the model to GitHub using the model_release.py script like so:
python model_release.pyOnce ran successfully it will state the rate limit you had and have left on the GitHub REST API, like below:
Current rate limit: {'limit': 5000, 'used': 74, 'remaining': 4926, 'reset': 1652299539}
Rate limit after model releases: {'limit': 5000, 'used': 76, 'remaining': 4924, 'reset': 1652299539}In addition you should see the models you wanted to release to GitHub now on GitHub within the releases section.
Some errors that can occur when running the model_release.py script:
- The model you want to release has already been released. If this occurs and is a mistake then delete the model from the
./modelsfolder. If this is not a mistake then you may need to change the model version of the model (celement as described in theModel Versioningsection from the main README) as each model that is released has to have a unique model name. - The model did not upload correctly.
Once you have corrected the error re-run the model_release.py script.
If you want to specify the version of model, e.g. the c part of model version as described in the model versioning section within the main README use the --model-version command line option (default value "0").
In addition to specify the version of spaCy that the model will be compatible with use the --spacy-version command line option (default value ">=3.0,<4.0"). This spaCy version is overridden per language if the language resource file for a given language specifies a spacy version. See python pymusas_models/__main__.py create-models --help for more details.
Below we show how both of these command line options can be used:
python pymusas_models/__main__.py create-models \
--models-directory ./models \
--language-resource-file ./language_resources.json \
--model-version 1This will create the following folders, assuming we are using PyMUSAS version 0.3.0:
./models/cmn_dual_upos2usas_contextual_none-0.3.1./models/cmn_single_upos2usas_contextual_none-0.3.1./models/cy_dual_basiccorcencc2usas_contextual_none-0.3.1- other model folders
To create the overview of the models table from the main README:
- If you have not already done so create all of the models (if you have done this please skip this step):
python pymusas_models/__main__.py create-models --models-directory ./models --language-resource-file ./language_resources.json- Run the following which will print out the Markdown overview of the models table, which can then be copied into the main README:
python pymusas_models/__main__.py overview-of-models --models-directory ./modelsThe tests are testing both:
- That the models can be created and installed via
piplocally. - Once created and installed the models function as expected.
This has resulted in two test folders, as shown in General Folder Structure; /model_function_tests and /model_creation_tests
The /model_creation_tests tests the first bullet point and /model_function_tests tests the second bullet point.
More details about these tests can be found below including how to run them individually or together;
As the /model_function_tests require the installed models that are created from /model_creation_tests the /model_creation_tests tests are ran first whereby the models created will be installed to a virtual environment that will be saved to ./temp_venv NOTE ./temp_venv is assumed to not exist, an error will occur if the directory does exist, unless you specify the --overwrite flag which will first delete the directory if it exists and then re-create.
Linux/Mac
pytest --virtual-env-directory=./temp_venv ./model_creation_testsUsing the overwrite flag, which will first delete ./temp_venv if it exists:
pytest --virtual-env-directory=./temp_venv --overwrite ./model_creation_testsThis last command can be ran as a make command:
make model-creation-testsNote Mac users, I have found that make might not work if using the make command version that comes as default with your Mac (version 3.81), but the make command you can install through Conda (version 4.2.1) will work.
Windows
pytest --virtual-env-directory=.\temp_venv .\model_creation_testsUsing the overwrite flag, which will first delete .\temp_venv if it exists:
pytest --virtual-env-directory=.\temp_venv --overwrite .\model_creation_testsBy separating these tests into two different test folders it allows the virtual environment to be cached, which allows the second set of tests, /model_function_tests, to be ran as many times as you like without having to re-create the virtual environment.
Linux/Mac
source ./temp_venv/venv/bin/activate # Used to activate the virtual environment
pytest ./model_function_tests
deactivateWindows
.\temp_venv\venv\Scripts\Activate.ps1 # Used to activate the virtual environment
pytest .\model_function_tests
deactivateLinux/Mac
There is a make command that will run all tests:
make run-all-testsNote Mac users, I have found that make might not work if using the make command version that comes as default with your Mac (version 3.81), but the make command you can install through Conda (version 4.2.1) will work.
Windows
To run all tests:
pytest --virtual-env-directory=.\temp_venv --overwrite .\model_creation_tests
.\temp_venv\venv\Scripts\Activate.ps1 # Used to activate the virtual environment
pytest .\model_function_tests
deactivateThe tests run on a subset of the models created on the GitHub actions CI, this is due to disk space restrictions on the GitHub actions runners, see this issue #14. To note the models we do not run on are:
- All
xxmultilingual models. - English base neural model.
To run the tests locally in the way they are run on the GitHub actions CI:
rm -rf ./temp_venv
uv run -m pytest -vvv --virtual-env-directory=./temp_venv --overwrite --github-ci ./model_creation_tests
source ./temp_venv/venv/bin/activate
pytest --github-ci ./model_function_tests
deactivateLanguage resource meta data is stored in the language_resources.json file, it is used by the entry points to the main package, pymusas_models, to create the models. The structure of the JSON file is the following, and is validated using the LanguageResources Pydantic Base Model class within pymusas_models/language_resource.py:
{
"language_resources": {
"Language one BCP 47 code": {
"spacy_version": ">=3.0,<4.0",
"models": [
{
"name": "NAME OF MODEL",
"model_type": "pymusas_rule_based_tagger",
"resources":{
"ranker": "RANKER NAME",
"rules": [
{
"rule_type": "single",
"pos_mapper": "POS MAPPER NAME OR NONE",
"lexicon_url": "PERMANENT URL TO LEXICON",
"with_pos": true
},
{
"rule_type": "mwe",
"pos_mapper": null,
"lexicon_url": ""
}
],
"default_punctuation_tags": ["PUNCT"],
"default_number_tags": ["NUM"]
},
"config": {
"pos_attribute": "tag_"
}
},
{
"name": "xx_none_none_none_multilingualsmallbem",
"model_type": "pymusas_neural_tagger",
"pretrained_model_name_or_path": "ucrelnlp/PyMUSAS-Neural-Multilingual-Small-BEM",
"config": {
"tokenizer_kwargs": {
"add_prefix_space": true
}
}
}
],
"language data": {
"description": "LANANGUAGE NAME",
"macrolanguage": "Macrolanguage code",
"script": "ISO 15924 script code"
}
},
...
}
}- The BCP 47 code of the language, the BCP47 language subtag lookup tool is a great tool to use to find a BCP 47 code for a language.
spacy version- Optional this key is only required if the version of spaCy required has to be more specific than the default which is">=3.0,<4.0. The version of spaCy required, this should be a String and follow the standard Python pip install syntax of the version specifier, e.g.>=3.3.models- a list of models that are associated with this language. Each model is represented as a dictionary, of which we support two model typespymusas_rule_based_taggerandpymusas_neural_taggerof which these reflect the models available in pymusas (neural and rule_based).pymusas_rule_based_tagger:name- a unique model name that follows the model naming conventionmodel_type- this should bepymusas_rule_based_taggerthis was chosen as it follows the spaCy component name of the tagger in pymusas.resources- The keys and values in this dictionary follows the arguments that the rule based tagger accepts within it's initialize method.ranker- The name of the ranker, in this case it can only becontextualas this is the only supported ranker.default_punctuation_tags- list of POS tags that represent punctuation. Can benull.default_number_tags- The POS tags that represent numbers. Can benull.rules- A list of rules to apply to the sequence of tokens. This has to be a list of dictionaries representing either Single or MWE rules.rule_type-singleormwepos_mapper- this is the mapping for linking the POS tags of the token's to the lexicon's. It can either benullfor no mapping,upos2usasfor UPOS to USAS orbasiccorcencc2usasfor Basic CorCenCC to USAS.lexicon_url- URL to the lexicon to use. This should be a permanent URL, e.g. if the URL is to a GitHub repository the URL should be to a specific commit rather to the HEAD of the main branch.with_pos- Only forsinglerules, whether the lexicon has POS tags or not.
config- The keys and values in this dictionary follows the arguments that the rule based tagger accepts within it's init method:pymusas_tags_token_attr- The name of the attribute to assign the predicted tags too under the Token._ class.pymusas_mwe_indexes_attr- The name of the attribute to assign the start and end token index of the associated MWE too under the Token._ class.pos_attribute- The name of the attribute that the Part Of Speech (POS) tag is assigned too within the Token class.lemma_attribute- The name of the attribute that the lemma is assigned too within the Token class.
pymusas_neural_tagger:name- a unique model name that follows the model naming conventionmodel_type- this should bepymusas_neural_taggerthis was chosen as it follows the spaCy component name of the tagger in pymusas.resources- The keys and values in this dictionary follows the arguments that the neural tagger accepts within it's initialize method.pretrained_model_name_or_path- The string ID or path of the pretrained neural Word Sense Disambiguation (WSD) model to load.
config- The keys and values in this dictionary follows the arguments that the neural tagger accepts within it's init method:pymusas_tags_token_attr- The name of the attribute to assign the predicted tags too under the Token._ class.pymusas_mwe_indexes_attr- The name of the attribute to assign the start and end token index of the associated MWE too under the Token._ class.top_n- The number of tags to predict. If -1 all tags will be predicted. If 0 or less than 0 will raise a ValueError.device- The device to load the model, wsd_model, on. e.g. 'cpu'.tokenizer_kwargs- Keyword arguments to pass to the tokenizer's transformers.AutoTokenizer.from_pretrained method.
language data- this is data that is associated with theBCP 47language code. To some degree this is redundant as we can look this data up through theBCP 47code, however we thought it is better to have it in the meta data for easy lookup. All of this data can be easily found through looking up theBCP 47language code in the BCP47 language subtag lookup tooldescription- Thedescriptionof the language code.macrolanguage- The macrolanguage tag, note if this does not exist then give the primary language tag, which could be the same as the wholeBCP 47code. Themacrolanguagetag could be useful in future for grouping languages.script- The ISO 15924 script code of the language code. TheBCP 47code by default does not always include the script of the language as the default script for that language is assumed, therefore this data is here to make the default more explicit.
Below is an extract of the ./language_resources.json, to give as an example of this JSON structure:
{ "language_resources":
{
"cmn":
{
"models": [
{
"name": "cmn_single_upos2usas_contextual_none",
"model_type": "pymusas_rule_based_tagger",
"resources":{
"ranker": "contextual",
"rules": [
{
"rule_type": "single",
"pos_mapper": "upos2usas",
"lexicon_url": "https://raw.githubusercontent.com/UCREL/Multilingual-USAS/7ccc8baaea36f3fd249e77671db5638c1cba6136/Chinese/semantic_lexicon_chi.tsv",
"with_pos": true
}
],
"default_punctuation_tags": ["PUNCT"],
"default_number_tags": ["NUM"]
}
},
{
"name": "cmn_dual_upos2usas_contextual_none",
"model_type": "pymusas_rule_based_tagger",
"resources":{
"ranker": "contextual",
"rules": [
{
"rule_type": "single",
"pos_mapper": "upos2usas",
"lexicon_url": "https://raw.githubusercontent.com/UCREL/Multilingual-USAS/7ccc8baaea36f3fd249e77671db5638c1cba6136/Chinese/semantic_lexicon_chi.tsv",
"with_pos": true
},
{
"rule_type": "mwe",
"pos_mapper": "upos2usas",
"lexicon_url": "https://raw.githubusercontent.com/UCREL/Multilingual-USAS/7ccc8baaea36f3fd249e77671db5638c1cba6136/Chinese/mwe-chi.tsv"
}
],
"default_punctuation_tags": ["PUNCT"],
"default_number_tags": ["NUM"]
}
}
],
"language_data": {
"description": "Mandarin Chinese",
"macrolanguage": "zh",
"script": "Hani"
}
},
...
"xx":
{
"models":[
{
"name": "xx_none_none_none_multilingualsmallbem",
"model_type": "pymusas_neural_tagger",
"pretrained_model_name_or_path": "ucrelnlp/PyMUSAS-Neural-Multilingual-Small-BEM",
"config": {
"tokenizer_kwargs": {
"add_prefix_space": true
}
}
},
{
"name": "xx_none_none_none_multilingualbasebem",
"model_type": "pymusas_neural_tagger",
"pretrained_model_name_or_path": "ucrelnlp/PyMUSAS-Neural-Multilingual-Base-BEM",
"config": {
"tokenizer_kwargs": {
"add_prefix_space": true
}
}
}
],
"language_data":{
"description": "Multilingual",
"macrolanguage": "xx",
"script": "xx"
}
}
}
}If you see an error like the following (zsm is a language code):
KeyError: 'zsm'or
FAILED model_creation_tests/test_create_and_install_models.py::test_create_and_install_models - AssertionError: assert 0 == 1
+ where 1 = <Result KeyError('zsm')>.exit_codeThen it is due to the language code for the newly added language to the ./language_resources.json file not mapping directly to a spaCy language code for that language (this is required to create a blank spacy.Language object). To fix this add the mapping for the newly added language code to the spaCy language code to the PYMUSAS_LANG_TO_SPACY dictionary wihtin ./pymusas_models/main.py. spaCy language code can be found at https://spacy.io/usage/models.