Talk2Me_Hackathon_2025/rag_documents.json at main · MarcAmil30/Talk2Me_Hackathon_2025 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
[
  {
    "text": "PutnamBench: A Multilingual Competition-Mathematics Benchmark for Formal Theorem-Proving\n\nWe present PutnamBench, a benchmark of 1337 formalizations of Putnam competition problems in Lean 4, Isabelle, and Coq.\n\nWe present PutnamBench, a new multilingual evaluation benchmark for formal theorem-proving. PutnamBench consists of formalizations of problems sourced from the William Lowell Putnam Mathematical Competition, the premier undergraduate-level mathematics competition in North America. All the problem statements come with formalizations in Lean 4 and Isabelle; a substantial subset have Coq formalizations as well. PutnamBench consists of 1337 hand-written formalizations across the three proof assistants, and aims to benchmark the next generation of theorem-proving algorithms for competition mathematics.  Proving the theorems requires significant problem-solving ability and proficiency in a broad range of topics taught in undergraduate mathematics courses. We evaluate several established neural and symbolic theorem provers using PutnamBench. These approaches can only solve a handful of the problems, establishing our benchmark as a difficult open challenge for research on formal theorem-proving. PutnamBench is available at https://github.com/trishullab/PUTNAM.",
    "metadata": {
      "title": "PutnamBench: A Multilingual Competition-Mathematics Benchmark for Formal Theorem-Proving",
      "authors": [
        "George Tsoukalas",
        "Jasper Lee",
        "John Jennings",
        "Jimmy Xin",
        "Michelle Ding",
        "Michael Jennings",
        "Amitayush Thakur",
        "Swarat Chaudhuri"
      ],
      "url": "https://openreview.net/pdf/f3aad42fb7a93a0a80417c4bf1486c090f4651a0.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  },
  {
    "text": "Learning to Reason by Failing: Offline RL on Sub-optimal Rollouts Scales Synthetic Data by 8x\n\n\n\nTraining on model-generated synthetic data is a promising approach for finetuning LLMs, but it remains unclear when it helps or hurts. In this paper, we investigate this for reasoning problems via an empirical study, followed by a theoretical formalization. \nFirst, we find that while the typical approach of finetuning a model on synthetic correct or *positive* problem-solution pairs generated by capable models offers modest performance gains, sampling more correct solutions from the finetuned learner **doubles** the sample efficiency of synthetic data. At the same time, training on model-generated positives can amplify  spurious  correlations, resulting in flat or even inverse scaling trends as the amount of data increases. Surprisingly, we find that several of these issues can be addressed if we also utilize *negative* responses, \\ie model-generated responses that are deemed incorrect via final answer checking. Crucially, these negatives must be constructed such that the training can appropriately recover the utility or credit of each intermediate step in the negative response. With this \\emph{per-step} scheme, we are able to attain consistent gains over only positive data, attaining performance similar to amplifying the amount of synthetic data by **8x**. We show that training on per-step negatives can help to unlearn spurious correlations in the positive data, and is equivalent to advantage-weighted reinforcement learning (RL), implying that it inherits benefits of RL over imitating positive data alone.",
    "metadata": {
      "title": "Learning to Reason by Failing: Offline RL on Sub-optimal Rollouts Scales Synthetic Data by 8x",
      "authors": [
        "Amrith Setlur",
        "Saurabh Garg",
        "Xinyang Geng",
        "Naman Garg",
        "Virginia Smith",
        "Aviral Kumar"
      ],
      "url": "https://openreview.net/pdf/aa31d8eb1de7fb2f55359bbe2174056de902fe7d.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  },
  {
    "text": "Lean4trace: Data augmentation for neural theorem proving in Lean\n\n\n\nIntegrating large language models as proof assistants with theorem provers has shown great promise. However, one of the major challenges in this field is the scarcity of training data. To address this, we release a new open-source tool, *Lean4trace*, for training data extraction from Lean 4 sources. Unlike previous approaches, *Lean4trace* is deeply integrated into the Lean elaborator, allowing us to modify proofs on-the-fly. Leveraging this feature, we propose two methods of data augmentation in Lean: (1) decomposing composite proof steps into multiple simpler steps;  (2) testing existing proof automation tactics at each proof state and collecting the successful ones. Models trained on this augmented data are capable of proving 58.0% of theorems from a hold-out subset of Mathlib and 35.6% of the test subset of the MiniF2F benchmark.",
    "metadata": {
      "title": "Lean4trace: Data augmentation for neural theorem proving in Lean",
      "authors": [
        "Vasilii Nesterov",
        "Yermek Kapushev",
        "Mikhail Burtsev"
      ],
      "url": "https://openreview.net/pdf/7bc6e168a9fe100095a13b5221a503d8d173f3ab.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  },
  {
    "text": "Efficient Linear System Solver with Transformers\n\nA novel, efficient transformer-based approach to solving small linear systems.\n\nThis paper investigates the potential of linear Transformers as solvers for systems of linear equations. We propose a novel approach where the Transformer encodes each equation as a separate token, allowing the model to process the system in a permutation-invariant manner. To enhance generalizability and reduce the parameter count, we introduce a block-wise re-parameterization technique for the attention weight matrices. This technique decouples the problem dimension from the model's parameter count, enabling the Transformer to effectively handle systems of varying sizes. Our experiments demonstrate the Transformer's competitive performance compared to established classical methods such as Conjugate Gradient, especially for systems with smaller sizes. We further explore the model's ability to extrapolate to larger systems, providing evidence for its potential as a versatile and efficient solver for linear equations.",
    "metadata": {
      "title": "Efficient Linear System Solver with Transformers",
      "authors": [
        "Max Vladymyrov",
        "Johannes von Oswald",
        "Nolan Andrew Miller",
        "Mark Sandler"
      ],
      "url": "https://openreview.net/pdf/70ff41de083363020856e5381537edb482990dd0.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  },
  {
    "text": "Large Language Models Can Self-Correct with Minimal Effort\n\nUnleash inherent capabilities of LLMs to detect and rectify incorrect responses without external feedback.\n\nIntrinsic self-correct was a method that instructed large language models (LLMs) to verify and correct their responses without external feedback. Unfortunately, the study concluded that the LLMs could not self-correct reasoning yet. We find that a simple yet effective verification method can unleash inherent capabilities of the LLMs. That is to mask a key condition in the question, add the current response to construct a verification question, and predict the condition to verify the response. The condition can be an entity in an open-domain question or a numeric value in a math question, which requires minimal effort (via prompting) to identify. We propose an iterative verify-then-correct framework to progressively identify and correct (probably) false responses, named ProCo. We conduct experiments on three reasoning tasks. On average, ProCo, with GPT-3.5-Turbo as the backend LLM, yields $+6.8$ exact match on four open-domain question answering datasets, $+14.1$ accuracy on three arithmetic reasoning datasets, and $+9.6$ accuracy on a commonsense reasoning dataset, compared to Self-Correct.",
    "metadata": {
      "title": "Large Language Models Can Self-Correct with Minimal Effort",
      "authors": [
        "Zhenyu Wu",
        "Qingkai Zeng",
        "Zhihan Zhang",
        "Zhaoxuan Tan",
        "Chao Shen",
        "Meng Jiang"
      ],
      "url": "https://openreview.net/pdf/0afed850bcc759f25fe19106cf23af0f53a7d57d.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  },
  {
    "text": "Teaching Large Language Models to Reason with Reinforcement Learning\n\nWe compare the performance of multiple RL algorithms across multiple setups for improving LLM reasoning\n\nReinforcement Learning from Human Feedback (\\textbf{RLHF}) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback (Expert Iteration, Proximal Policy Optimization (\\textbf{PPO}), Return-Conditioned RL) on improving LLM reasoning capabilities. We investigate both sparse and dense rewards provided to the LLM both heuristically and via a learned reward model. We additionally start from multiple initializations with and without supervised fine-tuning (\\textbf{SFT}) data.  Overall, we find models fine-tuned with Expert Iteration to consistently achieve the highest task accuracy with PPO and RCRL close behind. Surprisingly, the sample complexity of Expert Iteration is similar to that of PPO, requiring at most on the order of $10^6$ samples to converge from a pretrained checkpoint. We investigate why this is the case, concluding that during RL training models fail to explore significantly beyond solutions already produced by SFT models. Additionally, we discuss a trade off between maj@1 and pass@96 metric performance during SFT training and how conversely RL training improves both simultaneously. We then conclude by discussing the implications of our findings for RLHF and the future role of RL in LLM fine-tuning.",
    "metadata": {
      "title": "Teaching Large Language Models to Reason with Reinforcement Learning",
      "authors": [
        "Alexander Havrilla",
        "Yuqing Du",
        "Sharath Chandra Raparthy",
        "Christoforos Nalmpantis",
        "Jane Dwivedi-Yu",
        "Eric Hambro",
        "Sainbayar Sukhbaatar",
        "Roberta Raileanu"
      ],
      "url": "https://openreview.net/pdf/8b3bf32d71f7ec91ebdcfd798094623345755e15.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  },
  {
    "text": "AI for an inverse problem: Physical model solving quantum gravity\n\n\n\nMathematical inverse problems of determining a governing differential equation for given solution data remain a fundamental challenge.\nTo find a working example of AI for math, we provide a concrete example using a physical setup of a quantum gravity problem.\nWe present a novel sparse Neural Network (NN) model which is interpretable, to solve the inverse problem: the AdS/CFT correspondence.\nAccording to the conjectured correspondence, a special condensed matter system on a ring is equivalent to a gravity system on a bulk disk. The inverse problem is to reconstruct the higher-dimensional gravity metric from the data of the condensed matter system. \nWe use the response functions of a condensed matter system as our data, and \nby supervised machine learning, we successfully train the neural network which is equivalent to a scalar field equation on an emergent geometry of the bulk spacetime. \nThe developed method may work as a ground for generic bulk reconstruction, i.e. a solution to the inverse problem of the AdS/CFT correspondence.\nFrom a technical perspective, to achieve better numerical control, our neural network model incorporates a novel layer that implements the Runge-Kutta method.",
    "metadata": {
      "title": "AI for an inverse problem: Physical model solving quantum gravity",
      "authors": [
        "Koji Hashimoto",
        "Koshiro Matsuo",
        "Masaki Murata",
        "Gakuto Ogiwara",
        "Daichi Takeda"
      ],
      "url": "https://openreview.net/pdf/1843f03c96256409a143dfdb23c8318c0813895d.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  },
  {
    "text": "Specify What? A Case-Study using GPT-4 and Formal Methods For Specification Synthesis\n\nWe study how including the output of formal methods tools in LLM prompts can affect specification synthesis.\n\nFormal specifications are supposed to unambigu-\nously describe the behaviour of (parts of) pro-\ngrams and are usually provided as extra annota-\ntions of the program code. The intention is both\nto document the code and to be able to automati-\ncally check compliance of programs using formal\nmethods tools. Writing good specifications can\nhowever be both difficult and time-consuming for\nthe programmer. In this case-study, we investigate\nhow GPT-4 can help with the task. We propose\na neuro-symbolic integration, by which we aug-\nment the LLM prompts with outputs from two\nformal methods tools in the Frama-C ecosystem\n(Pathcrawler and EVA), and produce C program\nannotations in the specifications language ACSL.\nWe demonstrate how this impacts the quality of\nannotations: information about input/output ex-\namples from Pathcrawler produce more context-\naware annotations, while the inclusion of EVA\nreports yields annotations more attuned to run-\ntime errors.",
    "metadata": {
      "title": "Specify What? A Case-Study using GPT-4 and Formal Methods For Specification Synthesis",
      "authors": [
        "George Granberry",
        "Wolfgang Ahrendt",
        "Moa Johansson"
      ],
      "url": "https://openreview.net/pdf/92d4eaa2df1188934d60f18bffc337a55e590e99.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  },
  {
    "text": "DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation\n\nWe introduce DACO, a dataset for data analysis, containing (1) 440 databases of tabular data, (2) 2k automatically generated query-answer pairs that can serve as weak supervision for model training, and (3) a manually refined test set for evaluation.\n\nData analysis is a crucial analytical process to generate in-depth studies and conclusive insights to comprehensively answer a given user query for tabular data. In this work, we aim to propose new resources and benchmarks to inspire future research on this crucial yet challenging and under-explored task. However, collecting data analysis annotations curated by experts can be prohibitively expensive. We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs with a multi-turn prompting technique. We construct the **DACO dataset**, containing (1) 440 databases (of tabular data) collected from real-world scenarios, (2) ~2k query-answer pairs that can serve as weak supervision for model training, and (3) a concentrated but high-quality test set with human refined annotations that serves as our main evaluation benchmark. We train a 6B supervised fine-tuning (SFT) model on DACO dataset, and find that the SFT model learns reasonable data analysis capabilities. To further align the models with human preference, we use reinforcement learning to encourage generating analysis perceived by human as helpful, and design a set of dense rewards to propagate the sparse human preference reward to intermediate code generation steps. Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases, validating the effectiveness of our proposed algorithm.",
    "metadata": {
      "title": "DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation",
      "authors": [
        "Xueqing Wu",
        "Rui Zheng",
        "Jingzhen Sha",
        "Te-Lin Wu",
        "Hanyu Zhou",
        "Tang Mohan",
        "Kai-Wei Chang",
        "Nanyun Peng",
        "Haoran Huang"
      ],
      "url": "https://openreview.net/pdf/46caee17a170a1a8ce500d42f3be4659ebc8f34b.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  },
  {
    "text": "Distilling LLMs\u2019 Decomposition Abilities into Compact Language Models\n\nIn this work we develop AI-generated benchmark for the distillation of LLMs' decomposition abilities into smaller models and provide multiple baselines\n\nLarge Language Models (LLMs) have demonstrated proficiency in their reasoning abilities, yet their large size presents scalability challenges and limits any further customization. In contrast, compact models offer customized training but often fall short in solving complex reasoning tasks. This study focuses on distilling the LLMs' decomposition skills into compact models using offline reinforcement learning. We leverage the advancements in the LLM`s capabilities to provide feedback and generate a specialized task-specific dataset for training compact models. The development of an AI-generated dataset and the establishment of baselines constitute the primary contributions of our work, underscoring the potential of compact models in replicating complex problem-solving skills.",
    "metadata": {
      "title": "Distilling LLMs\u2019 Decomposition Abilities into Compact Language Models",
      "authors": [
        "Denis Tarasov",
        "Kumar Shridhar"
      ],
      "url": "https://openreview.net/pdf/b5ceba0148c6dcdfa3dd15ada8c3e29bc88ec2c8.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  },
  {
    "text": "Progressive-Hint Prompting Improves Reasoning in Large Language Models\n\nWe propose a new prompting strategy, Progressive-Hint Promoting, that can be easily combined with Chain-Of-Thought and Self-Consistency to improve performance.\n\nThe performance of Large Language Models (LLMs) in reasoning tasks depends heavily on prompt design, with Chain-of-Thought (CoT) and self-consistency being critical methods that en- hance this ability. However, these methods do not fully exploit the answers generated by the LLM to guide subsequent responses. This paper proposes a new prompting method, named Progressive-Hint Prompting (PHP), that enables automatic mul- tiple interactions between users and LLMs by using previously generated answers as hints to progressively guide toward the correct answers. PHP is orthogonal to CoT and self-consistency, making it easy to combine with state-of-the-art techniques to further improve performance. We conducted extensive and comprehensive experi- ments on seven benchmarks. The results show that PHP significantly improves accuracy while remaining highly efficient. For instance, with text- davinci-003, we observed a 4.2% improvement on GSM8K with greedy decoding compared to Complex CoT, and a 46.17% reduction in sam- ple paths with self-consistency. With GPT-4 and PHP, we achieve state-of-the-art performances on SVAMP (89.1% \u2192 91.9%), GSM8K (92% \u2192 95.5%), AQuA (76.4% \u2192 79.9%) and MATH (50.3% \u2192 53.9%).",
    "metadata": {
      "title": "Progressive-Hint Prompting Improves Reasoning in Large Language Models",
      "authors": [
        "Chuanyang Zheng",
        "Zhengying Liu",
        "Enze Xie",
        "Zhenguo Li",
        "Yu Li"
      ],
      "url": "https://openreview.net/pdf/e6106df113d9ae7f3ed2d023ebc8c8f9d8c3a3f5.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  },
  {
    "text": "VerityMath: Advancing Mathematical Reasoning by Self-Verification Through Unit Consistency\n\nStudy evaluates 7B LLaMA, Code Llama and Mistral on math word problems, introduces \"Unit Consistency Programs\" for multi-unit challenges, and reports initial results with the enhanced \"VerityMath\" model.\n\nLarge Language Models (LLMs), combined with program-based solving techniques, are increasingly demonstrating proficiency in mathematical reasoning. For example, closed-source models such as OpenAI GPT-4 and Claude show excellent results in solving math word problems. However, progress in math word problem-solving for open-source LLMs is limited, and the challenges these models face are not well-studied. In this paper, we study the performance of strong open-source LLMs, including Llama 2 (7B), Code Llama (7B), and Mistral (7B) on math word problems using program-based solving techniques. Specifically, we analyze the outputs of these models when applied to math word problems and identify a category of problems that pose a significant challenge, particularly those involving quantities spanning multiple units. To address this issue, we propose a systematic approach by defining the units for each quantity and ensuring the consistency of these units during mathematical operations. We developed Unit Consistency Programs (UCPs), an annotated dataset of math word problems, each paired with programs containing unit specifications and unit verification routines. We fine-tuned Llama 2 (7B), Code Llama (7B), and Mistral (7B) models with UCPs to produce their VerityMath variants. Our findings indicate that our approach, which incorporates unit consistency, currently slightly underperforms compared to an approach that does not. To understand the reasons behind this, we conducted an in-depth error analysis and suggested options for future improvements.",
    "metadata": {
      "title": "VerityMath: Advancing Mathematical Reasoning by Self-Verification Through Unit Consistency",
      "authors": [
        "Vernon Toh Yan Han",
        "Ratish Puduppully",
        "Nancy F. Chen"
      ],
      "url": "https://openreview.net/pdf/2e116b08e4c0d273b422594f60a3c16662b3d034.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  },
  {
    "text": "Smart Vision-Language Reasoners\n\nDeep learning innovations led to improvement in reasoning ability of vision-language models.\n\nIn this article, we investigate vision-language models (VLM) as reasoners. The ability to form abstractions underlies mathematical reasoning, problem-solving, and other Math AI tasks. Several formalisms have been given to these underlying abstractions and skills utilized by humans and intelligent systems for reasoning. Furthermore, human reasoning is inherently multimodal, and as such, we focus our investigations on multimodal AI. In this article, we employ the abstractions given in the SMART task (Simple Multimodal Algorithmic Reasoning Task) introduced in \\cite{cherian2022deep} as meta-reasoning and problem-solving skills along eight axes: math, counting, path, measure, logic, spatial, and pattern. We investigate the ability of vision-language models to reason along these axes and seek avenues of improvement. Including composite representations with vision-language cross-attention enabled learning multimodal representations adaptively from fused frozen pretrained backbones for better visual grounding. Furthermore, proper hyperparameter and other training choices led to strong improvements (up to $48\\%$ gain in accuracy) on the SMART task, further underscoring the power of deep multimodal learning. The smartest VLM, which includes a novel QF multimodal layer, improves upon the best previous baselines in every one of the eight fundamental reasoning skills. End-to-end code is available at \\href{https://github.com/D-Roberts/smarter}{github.com/D-Roberts/smarter}.",
    "metadata": {
      "title": "Smart Vision-Language Reasoners",
      "authors": [
        "Denisa Roberts",
        "Lucas Roberts"
      ],
      "url": "https://openreview.net/pdf/e71382c2b2bdd9fa5eaeae740459879c5613a036.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  },
  {
    "text": "Progress or Regress? Self-Improvement Reversal in Post-training\n\nA comprehensive evaluative framework to scrutinize the underlying mechanisms and outcomes of post-training self-improvement.\n\nSelf-improvement through post-training methods such as iterative preference learning has been acclaimed for enhancing the problem-solving capabilities~(e.g., mathematical reasoning) of Large Language Models~(LLMs) without human intervention. However, as exploration deepens, it becomes crucial to assess whether these improvements genuinely signify progress in solving more challenging problems or if they could lead to unintended regressions. To address this, we propose a comprehensive evaluative framework that goes beyond the superficial pass@1 metric to scrutinize the underlying enhancements of post-training paradigms for self-improvement. Through rigorous experimentation and analysis across diverse problem-solving tasks, the empirical results point out the phenomenon of \\emph{self-improvement reversal}, where models showing improved performance across benchmarks will paradoxically exhibit declines in broader, essential capabilities, like output diversity and out-of-distribution~(OOD) generalization. These findings indicate that current self-improvement practices through post-training are inadequate for equipping models to tackle more complex problems. Furthermore, they underscore the necessity of our critical evaluation metrics in discerning the \\emph{progress or regress} dichotomy for self-improving LLMs.",
    "metadata": {
      "title": "Progress or Regress? Self-Improvement Reversal in Post-training",
      "authors": [
        "Ting Wu",
        "Xuefeng Li",
        "Pengfei Liu"
      ],
      "url": "https://openreview.net/pdf/2fb1e8e4e7c044ab018b903eb62e5ffdb089546a.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  },
  {
    "text": "Pre-Calc: Learning to Use the Calculator Improves Numeracy in Language Models\n\nWe propose Pre-Calc, our method to teach smaller language models how to use calculators. By pre-finetuning BERT, RoBERTa, and Flan-T5 on calculator use tasks, we improved these models' performance on tasks requiring numerical understanding.\n\nQuantitative and numerical comprehension in language is an important task in many fields like education and finance, but still remains a challenging task for language models. While tool and calculator usage has shown to be helpful to improve mathematical reasoning in large pretrained decoder-only language models, this remains unexplored for smaller language models with encoders. In this paper, we propose Pre-Calc, a simple pre-finetuning objective of learning to use the calculator for both encoder-only and encoder-decoder architectures, formulated as a discriminative and generative task respectively. We pre-train BERT and RoBERTa for discriminative calculator use and Flan-T5 for generative calculator use on the MAWPS, SVAMP, and AsDiv-A datasets, which improves performance on downstream tasks that require numerical understanding.",
    "metadata": {
      "title": "Pre-Calc: Learning to Use the Calculator Improves Numeracy in Language Models",
      "authors": [
        "Vishruth Veerendranath",
        "Vishwa Shah",
        "Kshitish Ghate"
      ],
      "url": "https://openreview.net/pdf/4c53285eb3ec1484b1b90f4da18df757f653be53.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  },
  {
    "text": "Learning Efficient Recursive Numeral Systems via Reinforcement Learning\n\nHow did recursive numeral systems emerge? We take steps towards a mechanistic explanation via a Reinforcement Learning approach which optimizes a lexicon under a given meta-grammar.\n\nThe emergence of mathematical concepts, such as number systems, is an understudied area in AI for mathematics and reasoning. It has previously been shown (Carlsson et al., 2021) that by using reinforcement learning (RL), agents can derive simple approximate and exact-restricted numeral systems. However, it is a major challenge to show how more complex recursive numeral systems, similar to the one utilised in English, could arise via a simple learning mechanism such as RL. Here, we introduce an approach towards deriving a mechanistic explanation of the emergence of recursive number systems where we consider an RL agent which directly optimizes a lexicon under a given meta-grammar. Utilising a slightly modified version of the seminal meta-grammar of (Hurford, 1975), we demonstrate that our RL agent can effectively modify the lexicon towards Pareto-optimal configurations which are comparable to those observed within human numeral systems.",
    "metadata": {
      "title": "Learning Efficient Recursive Numeral Systems via Reinforcement Learning",
      "authors": [
        "Jonathan David Thomas",
        "Andrea Silvi",
        "Devdatt Dubhashi",
        "Emil Carlsson",
        "Moa Johansson"
      ],
      "url": "https://openreview.net/pdf/017a4ef5896e1abf13238bf6c96df91a351c84ea.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  },
  {
    "text": "More Details, Please: Improving Autoformalization with More Detailed Proofs\n\nWe propose SPADeR, an approach to autoformalization that uses language models to infer and explicitly incorporate implicit details from informal proofs\n\nThe formalization of mathematical theorems and their proofs is a time-consuming and tedious process which, despite recent advances in the reasoning capabilities of AI systems, remains a challenging task for computers. Existing attempts to automate the process with language models struggle with the difference in level of detail between formal and informal proofs. Successful autoformalization requires models to understand and be able to explain the nuances of logical arguments, a critical aspect of reasoning that is often overlooked in existing research. In this work, we introduce Sketch, Prove, Add Detail & Repeat (SPADeR), an approach that enhances proof autoformalizers by using language models to infer and explicitly incorporate implicit details from informal proofs. With the same number of autoformalization attempts, our method increases the percentage of successfully formalized problems in the miniF2F test dataset from 34.8% to 38.1%.",
    "metadata": {
      "title": "More Details, Please: Improving Autoformalization with More Detailed Proofs",
      "authors": [
        "Guillem Tarrach",
        "Albert Q. Jiang",
        "Daniel Raggi",
        "Wenda Li",
        "Mateja Jamnik"
      ],
      "url": "https://openreview.net/pdf/648d1ef62a6f6ea84faa78689bbf3a7ceec0fc9a.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  },
  {
    "text": "Technical Report for ICML 2024 Automated Math Reasoning Challenge: Solving Optimization Problems with Open Source Large Language Model\n\nWe investigated open-source Large Language Model's capabilities to solve optimization problem\n\nThis technical report presents an approach utilizing open-source Large Language Models for Automated Optimization Problem-solving With Code Challenge at the ICML 2024 AI4Math Workshop. This challenge emphasizes the ability of Large Language Models (LLMs) to handle complex mathematical reasoning from formulating to solving the problem at hand. By exploring different prompting techniques, such as few-shot, self-consistency, chain-of-thought, and tree-of-thought, we aim to explore the current state-of-the-art LLMs' mathematical reasoning abilities.",
    "metadata": {
      "title": "Technical Report for ICML 2024 Automated Math Reasoning Challenge: Solving Optimization Problems with Open Source Large Language Model",
      "authors": [
        "Duc M. Nguyen",
        "Sungahn Ko"
      ],
      "url": "https://openreview.net/pdf/0587d424a1a041845ea791237cb2754d46157cc3.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  },
  {
    "text": "Advancing LLM Reasoning Generalists with Preference Trees\n\nWe present Eurus, state-of-the-art open LLM reasoning generalists and its recipe.\n\nWe introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of EURUS can be primarily attributed to ULTRAINTERACT, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. ULTRAINTERACT can be used in both supervised fine-tuning and preference learning. For each instruction, it includes a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise data to facilitate preference learning. ULTRAINTERACT allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, we derive a novel reward modeling objective which, together with ULTRAINTERACT, leads to a strong reward model. All artifacts produced during this research will be made public.",
    "metadata": {
      "title": "Advancing LLM Reasoning Generalists with Preference Trees",
      "authors": [
        "Lifan Yuan",
        "Ganqu Cui",
        "Hanbin Wang",
        "Ning Ding",
        "Xingyao Wang",
        "Jia Deng",
        "Boji Shan",
        "Huimin Chen",
        "Ruobing Xie",
        "Yankai Lin",
        "Zhenghao Liu",
        "Bowen Zhou",
        "Hao Peng",
        "Zhiyuan Liu",
        "Maosong Sun"
      ],
      "url": "https://openreview.net/pdf/cdf640fc47f06c403bd674317fc554bcacd8d5e9.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  },
  {
    "text": "GeomVerse: A Systematic Evaluation of Large Models for Geometric Reasoning\n\nWe created a dataset of geometry problems and conducted a systematic evaluation of large models.\n\nLarge language models have shown impressive results for multi-hop mathematical reasoning when the input question is only textual. Many mathematical reasoning problems, however, contain both text and image. With the ever-increasing adoption of vision language models (VLMs), understanding their reasoning abilities for such problems is crucial. \nIn this paper, we evaluate the reasoning capabilities of VLMs along various axes through the lens of geometry problems. \nWe procedurally create a synthetic dataset of geometry questions with controllable difficulty levels along multiple axes, thus enabling a systematic evaluation.\nThe empirical results obtained using our benchmark for state-of-the-art VLMs indicate that these models are not as capable in subjects like geometry (and, by generalization, other topics requiring similar reasoning) as suggested by previous benchmarks. This is made especially clear by the construction of our benchmark at various depth levels, since solving higher-depth problems requires long chains of reasoning rather than additional memorized knowledge.",
    "metadata": {
      "title": "GeomVerse: A Systematic Evaluation of Large Models for Geometric Reasoning",
      "authors": [
        "Mehran Kazemi",
        "Hamidreza Alvari",
        "Ankit Anand",
        "Jialin Wu",
        "Xi Chen",
        "Radu Soricut"
      ],
      "url": "https://openreview.net/pdf/3ba5283059bb755e01651618340073d09b23f233.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  },
  {
    "text": "Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving\n\nExtracting metacognitive knowledge from LLMs and using it to improve math reasoning.\n\n\\emph{Metacognitive knowledge} refers to humans' intuitive knowledge of their own thinking and reasoning processes. Today's best LLMs clearly possess some reasoning processes. The paper gives evidence that they also  have metacognitive knowledge, including ability to name skills and procedures to apply given a task. We explore this primarily in context of math reasoning, developing a prompt-guided interaction procedure  to get a powerful  LLM to assign sensible skill labels to math questions, followed by having it perform semantic clustering to obtain coarser families of skill labels. These coarse skill labels look interpretable to humans.\n\nTo validate that these skill labels are meaningful and relevant to the LLM's reasoning processes we perform the following experiments. (a) We ask GPT-4 to assign skill labels to training questions in math datasets GSM8K and MATH.  (b) When using an LLM to solve the test questions, we present it with the full list of skill labels and ask it to identify the skill needed. Then it is presented with randomly selected exemplar solved questions associated with that skill label.  This improves accuracy on GSM8k and MATH for several strong LLMs, including code-assisted models. The methodology presented is domain-agnostic,  even though this article applies it to math problems.",
    "metadata": {
      "title": "Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving",
      "authors": [
        "Aniket Rajiv Didolkar",
        "Anirudh Goyal",
        "Nan Rosemary Ke",
        "Siyuan Guo",
        "Michal Valko",
        "Timothy P Lillicrap",
        "Danilo Jimenez Rezende",
        "Yoshua Bengio",
        "Michael Curtis Mozer",
        "Sanjeev Arora"
      ],
      "url": "https://openreview.net/pdf/b658ad52e686b34e585fbe860bd4a1bbf08341ab.pdf",
      "source": "ICML 2024 AI4MATH"
    }
  }
]