- Paper: EMNLP 2025
- Authors: Adam Zahradník, Marek Šuppa
- Repository: github.com/gardenerik/trojsten-benchmark
- Contact: [email protected]
Large language models show promising performance on reasoning tasks, yet evaluation methods for low-resource languages remain limited, particularly for complex STEM problem-solving. We introduce Trojsten Benchmark, a Slovak-language dataset of 1,108 high-school competition problems with reference solutions across mathematics, physics, and programming, and a rubric-based LLM grading framework. Using GPT-4 to generate rubrics and grade solutions, we observe 1.05 average absolute deviation from human graders (5-point scale), while benchmarking GPT-3.5-Turbo, GPT-4, GPT-4o, and open-weight models (Llama 3, Phi-3). We quantify multistep reasoning performance by difficulty, show consistent underperformance on harder items, and demonstrate language sensitivity: accuracy drops on English translations of Slovak statements, evidencing challenges beyond translation. Trojsten Benchmark complements English-centric math datasets (e.g., MATH, GSM8K) by targeting open-response, rubric-gradable reasoning under low-resource linguistic framing. We release code and data to enable reproducible evaluation and human-aligned auto-grading for STEM in under-served languages.
Generate LLM Solutions:
python generate_solutions.py [list] [prefix] [model] [solver]
# Example:
python generate_solutions.py fks_test experiment1 gpt4 zeroshot_cotGenerate Evaluation Rubrics:
python generate_rubrics.py [list] [prefix] [model]
# Example:
python generate_rubrics.py fks_test experiment1 gpt4Grade Generated Solutions:
python grade_solutions.py [list] [prefix] [grader] [rubricer]
# Example:
python grade_solutions.py fks_test experiment1 zeroshot_v1 zeroshot_v1Available Solvers:
zeroshot_v1: Basic zero-shot approachzeroshot_cot: Zero-shot with chain-of-thought reasoninggenerated_knowledge: Uses generated knowledge for problem solvingdualprompt_generated_knowledge: Dual prompt generated knowledgeleast_to_most: Least-to-most problem decomposition
Models Supported:
- OpenAI GPT models (
gpt4,gpt3.5-turbo) - Azure OpenAI deployments
- Custom model configurations
Each problem in the dataset is stored in a CSV with the following columns:
id,competition,problem,solution,difficulty
Example:
1b56d285f84b378c,fks,"Sme na vysokom strome...","Zamyslime sa nad tym...",3
Fields:
id: Unique problem identifiercompetition: Competition type (fks, kms, ksp)problem: Problem statement in Slovaksolution: Detailed solution explanationdifficulty: Numeric difficulty rating
@inproceedings{zahradnik-suppa-2025-trojsten,
title = "Trojsten Benchmark: Evaluating {LLM} Problem-Solving in {S}lovak {STEM} Competition Problems",
author = "Zahradn{\'i}k, Adam and
Suppa, Marek",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.1779/",
pages = "35094--35109",
ISBN = "979-8-89176-332-6",
abstract = "Large language models show promising performance on reasoning tasks, yet evaluation methods for low-resource languages remain limited, particularly for complex STEM problem-solving. We introduce Trojsten Benchmark, a Slovak-language dataset of 1,108 high-school competition problems with reference solutions across mathematics, physics, and programming, and a rubric-based LLM grading framework. Using GPT-4 to generate rubrics and grade solutions, we observe 1.05 average absolute deviation from human graders (5-point scale), while benchmarking GPT-3.5-Turbo, GPT-4, GPT-4o, and open-weight models (Llama 3, Phi-3). We quantify multistep reasoning performance by difficulty, show consistent underperformance on harder items, and demonstrate language sensitivity: accuracy drops on English translations of Slovak statements, evidencing challenges beyond translation. Trojsten Benchmark complements English-centric math datasets (e.g., MATH, GSM8K) by targeting open-response, rubric-gradable reasoning under low-resource linguistic framing. We release code and data to enable reproducible evaluation and human-aligned auto-grading for STEM in under-served languages."
}
This project is part of academic research. Please contact the authors for usage information.