You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _pages/thesis.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -86,15 +86,15 @@ This research vector covers methods and resources for processing dialectal, low-
86
86
87
87
### **Thesis projects**
88
88
89
-
-*Computational Dialectology.* Language usage often differs based on sociodemographic background, where linguistic differences based on the geographical origin of the speaker are typically studied in the field of dialectology. While qualitative studies into dialectal differences have yielded valuable insights into language variation, such studies often rely on labor-intensive data collection, annotation, and analysis. As such, computational approaches to dialect differences have emerged as a possible method towards the large-scale study of dialects. For students interested in this project, multiple directions are possible, including (but not limited to): (a) interpretability of what features dialect models rely on for differentiation, (b) creation of (parallel) resources for dialect continua, (c) development of new methods to quantify dialectal or sociolinguistic variation, (d) adapting existing models to better accommodate dialect variation.
89
+
-:hourglass_flowing_sand:*Computational Dialectology.* Language usage often differs based on sociodemographic background, where linguistic differences based on the geographical origin of the speaker are typically studied in the field of dialectology. While qualitative studies into dialectal differences have yielded valuable insights into language variation, such studies often rely on labor-intensive data collection, annotation, and analysis. As such, computational approaches to dialect differences have emerged as a possible method towards the large-scale study of dialects. For students interested in this project, multiple directions are possible, including (but not limited to): (a) interpretability of what features dialect models rely on for differentiation, (b) creation of (parallel) resources for dialect continua, (c) development of new methods to quantify dialectal or sociolinguistic variation, (d) adapting existing models to better accommodate dialect variation.
90
90
**References:**[Bartelds & Wieling 2022](https://aclanthology.org/2022.naacl-main.273/), [Bafna et al. 2025](https://aclanthology.org/2025.acl-long.989/), [Shim et al. 2026](https://arxiv.org/pdf/2601.02906).
91
91
**Level: BSc or MSc.**
92
92
93
93
- *Methods for mining low-resource parallel corpora.* Parallel corpora are critical for developing and evaluating dedicated machine translation systems, as well as general-purpose large language models capable of translation. One strategy for obtaining such corpora is to mine unstructured text corpora (typically web crawls) for parallel sentences. However, standard methods typically score candidate sentences via cosine distance of their sentence embeddings, a method which requires strong sentence encoders. Such sentence encoders are typically weaker for very low-resource languages, including language varieties such as dialects. Strategies include: bootstrapping, building classifiers, coming up with simple heuristics such as word-edit distance, or relying on meta-data like HTML tags. Depending on the student's interest and academic level, this project can focus more or less on specific directions such as: evaluating the impact of different methods, methods for scoring candidate sentences, or strategies for obtaining candidate sentences.
94
94
**References:**[Improving Parallel Sentence Mining for Low-Resource and Endangered Languages - ACL Anthology](https://aclanthology.org/2025.acl-short.17/), [Obtaining Parallel Sentences in Low-Resource Language Pairs with Minimal Supervision - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC9365574/)
95
95
**Level: BSc or MSc.**
96
96
97
-
-*Synthetic language variation for robust NLP.* Robust NLP entails models that can process human language variation, such as dialects and other language varieties. These varieties are typically characterized by high variation due to their orthography, lexicography, syntax, each of which present challenges to NLP. Furthermore, these varieties are typically low-resourced, such that we widely rely on transfer from standard language data to language varieties in building NLP models. One strategy for improving robustness to linguistic variation is to introduce synthetic variation. This can range from naive perturbation of characters in order to induce more varied tokenization of standard training data, to targeted de-standardization of training data.
97
+
-:hourglass_flowing_sand:*Synthetic language variation for robust NLP.* Robust NLP entails models that can process human language variation, such as dialects and other language varieties. These varieties are typically characterized by high variation due to their orthography, lexicography, syntax, each of which present challenges to NLP. Furthermore, these varieties are typically low-resourced, such that we widely rely on transfer from standard language data to language varieties in building NLP models. One strategy for improving robustness to linguistic variation is to introduce synthetic variation. This can range from naive perturbation of characters in order to induce more varied tokenization of standard training data, to targeted de-standardization of training data.
98
98
**References:**[Improving Zero-Shot Cross-lingual Transfer Between Closely Related Languages by Injecting Character-Level Noise - ACL Anthology Neural](https://aclanthology.org/2022.findings-acl.321/), [Text Normalization for Luxembourgish Using Real-Life Variation Data - ACL Anthology](https://aclanthology.org/2025.vardial-1.9/)
99
99
**Level: BSc or MSc** (scope adjusted by languages covered, and complexity of the approaches).
0 commit comments