Pretrained Language Models (PLMs) have revolutionized NLP, but their linguistic underpinnings still raise several questions. This thesis tries to shed some light on these questions by investigating PLMs' ability to encode morphosyntactic information, focusing on tense and subject-verb agreement.
A novel probing method that leverages neural probes is developed to test the representations generated by three PLM architectures: BERT, RoBERTa, and Sentence Transformer. These PLMs are tested across three morphologically diverse languages: English, Italian, and German.
This repository hosts the code, data and results of my Master's thesis. For a more in-depth explanation of the research questions, data, methodology and findings, please refer to the thesis report.
This project uses Poetry, a dependency management and packaging tool for Python. To install Poetry, follow the steps described at https://python-poetry.org/docs/#installation. Additionally, depending on your GPU, you may need to adjust the following line in pyprject.toml to get the appropriate torch version for your setup:
torch = {file = "./torch-2.0.0+rocm5.4.2-cp310-cp310-linux_x86_64.whl"}After installing Poetry and updating pyprject.toml, you can install the required dependencies and create a dedicated environment by running:
poetry installThe details for the tense and agreement experiments in each language are outlined in the respective json files:
it_experiments.jsonfor Italianen_experiments.jsonfor Englishde_experiments.jsonfor German
To run the experiments for a specific language, execute the following command:
python main.py [language_code]_experiments.jsonReplace [language_code] with the desired language code (e.g., it, en, or de). This will execute main.py with the specified json file containing the experiment's configuration and data.