Solutions and report for the International Computer Science Competition (Pre-Final Round 2025)
📄 View the main report (PDF)
📘 View the submission paper (PDF)
Problem C.1: Zipf’s Meaning-Frequency Law (8 Points) This problem requires you to read the following recently published scientific article:
A New Formulation of Zipf’s Meaning-Frequency Law through Contextual Diversity by Ryo Nagata and Kumiko Tanaka-Ishii (2025) Link: https://aclanthology.org/2025.acl-long.744.pdf
Answer the following questions related to this article: (a) What are the limitations of dictionary-based studies on measuring Zipf’s Meaning-Frequency Law?
(b) Explain the von Mises-Fisher distribution and how v = 1/κ measures contextual diversity.
(c) Why do the authors use the von Mises-Fisher distribution instead of simpler measures like average pairwise cosine similarity between word vectors?
(d) How do autoregressive models compare to masked language models for observing the Meaning- Frequency law?
(e) How can the proposed method serve as a diagnostic tool for language models?
(f) What does the observation that meaning-frequency law breaks down for small models and out-of-domain data suggest?
(Bonus) What factors may lead more frequent words to have more meanings? What factors may lead to fewer meanings? Give examples of each.
Problem C.2: Self-Improvement Capabilities of LLMs (8 Points) This problem requires you to read the following recently published scientific article: Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models by Y. Song, H. Zhang, C. Eisenach, S. M. Kakade, D. Foster, and U. Ghai (2025). Link: https://openreview.net/pdf?id=mtJSMcF3ek
Answer the following questions related to this article:
(a) Describe the term self-improvement using the author’s framework. What key assumption are the authors making that allows for self-improvement?
(b) What is the generation-verification gap (GV-Gap)? Why is it a better metric than measuring performance differences after model updates?
(c) What is greedy decoding and why is self-improvement with greedy decoding impossible?
(d) Explain why the relative GV-Gap scales monotonically with pre-training FLOPs for certain verification methods but not others.
(e) Why do most models fail to self-improve on Sudoku puzzles despite the exponential com- putational complexity separation between generation and verification?
(f) Propose a task domain where you would expect self-improvement to improve performance and explain why.