Skip to content

Commit 32da09c

Browse files
authored
Update thesis.md
1 parent 5b2ff06 commit 32da09c

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

_pages/thesis.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -86,15 +86,15 @@ This research vector covers methods and resources for processing dialectal, low-
8686

8787
### **Thesis projects**
8888

89-
- *Computational Dialectology.* Language usage often differs based on sociodemographic background, where linguistic differences based on the geographical origin of the speaker are typically studied in the field of dialectology. While qualitative studies into dialectal differences have yielded valuable insights into language variation, such studies often rely on labor-intensive data collection, annotation, and analysis. As such, computational approaches to dialect differences have emerged as a possible method towards the large-scale study of dialects. For students interested in this project, multiple directions are possible, including (but not limited to): (a) interpretability of what features dialect models rely on for differentiation, (b) creation of (parallel) resources for dialect continua, (c) development of new methods to quantify dialectal or sociolinguistic variation, (d) adapting existing models to better accommodate dialect variation.
89+
- :hourglass_flowing_sand: *Computational Dialectology.* Language usage often differs based on sociodemographic background, where linguistic differences based on the geographical origin of the speaker are typically studied in the field of dialectology. While qualitative studies into dialectal differences have yielded valuable insights into language variation, such studies often rely on labor-intensive data collection, annotation, and analysis. As such, computational approaches to dialect differences have emerged as a possible method towards the large-scale study of dialects. For students interested in this project, multiple directions are possible, including (but not limited to): (a) interpretability of what features dialect models rely on for differentiation, (b) creation of (parallel) resources for dialect continua, (c) development of new methods to quantify dialectal or sociolinguistic variation, (d) adapting existing models to better accommodate dialect variation.
9090
**References:** [Bartelds & Wieling 2022](https://aclanthology.org/2022.naacl-main.273/), [Bafna et al. 2025](https://aclanthology.org/2025.acl-long.989/), [Shim et al. 2026](https://arxiv.org/pdf/2601.02906).
9191
**Level: BSc or MSc.**
9292

9393
- *Methods for mining low-resource parallel corpora.* Parallel corpora are critical for developing and evaluating dedicated machine translation systems, as well as general-purpose large language models capable of translation. One strategy for obtaining such corpora is to mine unstructured text corpora (typically web crawls) for parallel sentences. However, standard methods typically score candidate sentences via cosine distance of their sentence embeddings, a method which requires strong sentence encoders. Such sentence encoders are typically weaker for very low-resource languages, including language varieties such as dialects. Strategies include: bootstrapping, building classifiers, coming up with simple heuristics such as word-edit distance, or relying on meta-data like HTML tags. Depending on the student's interest and academic level, this project can focus more or less on specific directions such as: evaluating the impact of different methods, methods for scoring candidate sentences, or strategies for obtaining candidate sentences.
9494
**References:** [Improving Parallel Sentence Mining for Low-Resource and Endangered Languages - ACL Anthology](https://aclanthology.org/2025.acl-short.17/), [Obtaining Parallel Sentences in Low-Resource Language Pairs with Minimal Supervision - PMC](https://pmc.ncbi.nlm.nih.gov/articles/PMC9365574/)
9595
**Level: BSc or MSc.**
9696

97-
- *Synthetic language variation for robust NLP.* Robust NLP entails models that can process human language variation, such as dialects and other language varieties. These varieties are typically characterized by high variation due to their orthography, lexicography, syntax, each of which present challenges to NLP. Furthermore, these varieties are typically low-resourced, such that we widely rely on transfer from standard language data to language varieties in building NLP models. One strategy for improving robustness to linguistic variation is to introduce synthetic variation. This can range from naive perturbation of characters in order to induce more varied tokenization of standard training data, to targeted de-standardization of training data.
97+
- :hourglass_flowing_sand: *Synthetic language variation for robust NLP.* Robust NLP entails models that can process human language variation, such as dialects and other language varieties. These varieties are typically characterized by high variation due to their orthography, lexicography, syntax, each of which present challenges to NLP. Furthermore, these varieties are typically low-resourced, such that we widely rely on transfer from standard language data to language varieties in building NLP models. One strategy for improving robustness to linguistic variation is to introduce synthetic variation. This can range from naive perturbation of characters in order to induce more varied tokenization of standard training data, to targeted de-standardization of training data.
9898
**References:** [Improving Zero-Shot Cross-lingual Transfer Between Closely Related Languages by Injecting Character-Level Noise - ACL Anthology Neural](https://aclanthology.org/2022.findings-acl.321/), [Text Normalization for Luxembourgish Using Real-Life Variation Data - ACL Anthology](https://aclanthology.org/2025.vardial-1.9/)
9999
**Level: BSc or MSc** (scope adjusted by languages covered, and complexity of the approaches).
100100

0 commit comments

Comments
 (0)