Update thesis.md, update topics

bplank · web-flow · commit ed00f329c425 · 2026-01-26T18:01:50.000+01:00
diff --git a/_pages/thesis.md b/_pages/thesis.md
@@ -248,7 +248,7 @@ This theme covers applied NLP methods for impactful real-world domains (e.g., cl
 - *Climate Change Insights through NLP.* Climate change is a pressing global issue that is receiving more and more attention. It is influencing regulations and decision-making in various parts of society such as politics, agriculture, business, and it is discussed extensively on social media. For students interested in real-world societal applications, this project aims to contribute insights on the discussion surrounding climate change. Example projects: Analyzing social media data. The data will have to be collected (potentially from existing sources), cleaned, and analyzed using NLP techniques to examine various aspects or features of interest such as stance, sentiment, the extraction of key players, etc. **References:** [Luo et al., 2020](https://aclanthology.org/2020.findings-emnlp.296v2.pdf), [Stede & Patz, 2021](https://aclanthology.org/2021.nlp4posimpact-1.2.pdf), [Vaid et al., 2022](https://aclanthology.org/2022.acl-srw.35.pdf). 
 **Level: BSc or MSc.**
 
-- *Multi agent debate for summarization or simplification.* Automatic summarization (or simplification) is often performed in an end-to-end manner, using a single model (e.g., an LLM). Recent work on multi-agent systems suggests that “interaction” between LLMs can improve reasoning and reduce errors. This project explores whether multi-agent debate can improve summarization (or simplification) quality by having agents summarize the same document, critique each other's summaries, and provide a final version. The project might investigate, e.g., the effect of prompting different agents with different priorities (factuality, conciseness, etc), whether debate improves performance compared to single-pass summaries and/or whether certain aggregation strategies (rounds of critique, voting, consensus-building) outperform others.
+- :hourglass_flowing_sand: *Multi agent debate for summarization or simplification.* Automatic summarization (or simplification) is often performed in an end-to-end manner, using a single model (e.g., an LLM). Recent work on multi-agent systems suggests that “interaction” between LLMs can improve reasoning and reduce errors. This project explores whether multi-agent debate can improve summarization (or simplification) quality by having agents summarize the same document, critique each other's summaries, and provide a final version. The project might investigate, e.g., the effect of prompting different agents with different priorities (factuality, conciseness, etc), whether debate improves performance compared to single-pass summaries and/or whether certain aggregation strategies (rounds of critique, voting, consensus-building) outperform others.
 **References:** [Du et al, ICLR 2024](https://openreview.net/forum?id=QAwaaLJNCk); [Koupaee et al, NAACL 2025](https://aclanthology.org/2025.naacl-long.609.pdf?utm_source=chatgpt.com); [Wan et al., NAACL 2025](https://aclanthology.org/2025.naacl-long.498/) 
 **Level: MSc** (preferred); adaptation to BSc is also possible.
 Other projects in summarization or simplification (e.g., resource building, multilinguality) are also possible depending on the student interests.
@@ -400,7 +400,7 @@ This thesis explores Personalized HLV to disentangle ambiguity from preference.
 - *Aggregating Individual Opinions through Discourse:* Human vs. Chain-of-Thought Reasoning. Public discourse often consists of multiple, partially conflicting individual opinions that may be synthesized into a coherent collective interpretation. This thesis investigates how humans and language models differ in aggregating such opinions, with a particular focus on the role of discourse structure and reasoning strategies. These opinion pieces could come from different news agencies, Wikipedia articles, political debates, discussion forums, etc. The student will examine whether humans and LLMs preserve disagreement (e.g., minority viewpoints) and how they build argumentative structure. The study is well-suited for students interested in discourse analysis, evaluation of language models, and the representation of multiple voices.
 **Level: MSc.**
 
-- *Synthetic data for metrics meta evaluation.* Automatic metrics (including LLMs-as-judges) are typically meta-evaluated by measuring their correlation with human scores or preferences. However, this evaluation is often global (making it difficult to diagnose when and why a metric fails), and human judgments are expensive and complex to collect. An alternative is behavioural testing, e.g., checklist-like approaches ([Ribeiro et al., ACL 2020](https://aclanthology.org/2020.acl-main.442/)), where targeted perturbations are designed to probe sensitivity to specific phenomena. This approach uses challenge sets to better understand metrics failure modes and test for specific biases. The goal of this project is to explore methods for automatically generating synthetic test sets for metric meta-evaluation, to assess their validity and limitations, and to use them to systematically benchmark evaluation metrics (including LLM-based judges). Depending on student interests, the project may focus on multilingual settings, specific evaluation dimensions (e.g., factuality or societal bias), robustness to perturbations, or other task- or domain-specific aspects. References include: [Sai et al., EMNLP 2021](https://aclanthology.org/2021.emnlp-main.575.pdf),  [Ye et al., ICLR 2025](https://iclr.cc/virtual/2025/poster/31088). 
+- :hourglass_flowing_sand: *Synthetic data for metrics meta evaluation.* Automatic metrics (including LLMs-as-judges) are typically meta-evaluated by measuring their correlation with human scores or preferences. However, this evaluation is often global (making it difficult to diagnose when and why a metric fails), and human judgments are expensive and complex to collect. An alternative is behavioural testing, e.g., checklist-like approaches ([Ribeiro et al., ACL 2020](https://aclanthology.org/2020.acl-main.442/)), where targeted perturbations are designed to probe sensitivity to specific phenomena. This approach uses challenge sets to better understand metrics failure modes and test for specific biases. The goal of this project is to explore methods for automatically generating synthetic test sets for metric meta-evaluation, to assess their validity and limitations, and to use them to systematically benchmark evaluation metrics (including LLM-based judges). Depending on student interests, the project may focus on multilingual settings, specific evaluation dimensions (e.g., factuality or societal bias), robustness to perturbations, or other task- or domain-specific aspects. References include: [Sai et al., EMNLP 2021](https://aclanthology.org/2021.emnlp-main.575.pdf),  [Ye et al., ICLR 2025](https://iclr.cc/virtual/2025/poster/31088). 
 **Level: BSc or MSc.**
 
 - *Pseudoword generation.* Pseudowords are words that look and sound like they could exist in a particular language, but don’t actually have any meaning. Pseudowords are frequently used in (psycho)linguistic studies to investigate how humans learn and process language (e.g., [lexical decision](https://en.wikipedia.org/wiki/Lexical_decision_task)). Often, pseudowords need to fulfill certain criteria, e.g., they should appear like a specific part of speech. However, it is quite difficult to come up with good pseudowords manually. Previous approaches have often been based on phonotactic templates or Markov chains. Neural networks also have the potential to work well. A student project could involve implementing and evaluating a new pseudoword generator for a less-studied language. Depending on interest and available resources, the approach could be rule-based, or based on statistical or neural models.
@@ -469,7 +469,7 @@ lower scope in the domain of ATS are also possible).
 - *Data Mining and LLM-as-a-Judge to better understand LLM behavior:* While the behavior of LLMs and their nuanced and complex output data is challenging to evaluate, data mining approaches can be leveraged to explain model behavior, to bring structure into evaluation and to gain new insights, e.g. on cultural biases or task failure [1]. In this thesis project, we want to take this approach further by evaluating the use of newly proposed data mining algorithms and/or the combination of LLM-as-a-Judge with data mining processes. The project offers the possibility to work on a technical evaluation of methods as well as develop and evaluate a new method. **References:** [1] [https://aclanthology.org/2025.acl-long.985/](https://aclanthology.org/2025.acl-long.985/)
 **Level: MSc.**
 
-- *Understanding Post-Training Effects Through Model Behavior Analysis and Interpretability:* Post-training has become an essential technique to adapt pretrained language models, e.g. to improve instruction following [1] or abilities for underrepresented languages [2], or to align model behavior with safety standards [3]. Correctly adapting models through post-training is, however, a complex and difficult process which can e.g. trigger broad misalignments and unexpected effects like safety failures [4]. To better control post-training, it is crucial to better understand how models change during the process.  
+- :hourglass_flowing_sand: *Understanding Post-Training Effects Through Model Behavior Analysis and Interpretability:* Post-training has become an essential technique to adapt pretrained language models, e.g. to improve instruction following [1] or abilities for underrepresented languages [2], or to align model behavior with safety standards [3]. Correctly adapting models through post-training is, however, a complex and difficult process which can e.g. trigger broad misalignments and unexpected effects like safety failures [4]. To better control post-training, it is crucial to better understand how models change during the process.  
 This thesis will study the effects of post-training through a dual lens. Through model behavior analysis tools like Spotlight [5], it will explore how a model changes with respect to non-performance metrics like gender [6] and cultural biases [7]. Using probing, logic lense or other interpretability techniques, it will then go one step further and also start explaining how these changes occur within the model. Depending on scope and resource availability, this thesis can either work with existing model (checkpoints) or post-train specific model aspects.
 **References:**
 [1] [Ouyang et al. (2022): Training language models to follow instructions with human feedback. arXiv 2203.02155.](https://arxiv.org/pdf/2203.02155)