Codes for the paper Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean
openai==1.78.0
lm-eval==0.4.8 # Installed with [api] configuration
pip install -r requirements.txtInstalling LM Evaluation Harness (Use api configuration)
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e ".[api]"- For data synthesis, please make your openai API key available with the environment variable,
OPENAI_API_KEY
export OPENAI_API_KEY="<your-openai-api-key>"- For evaluation, prepare your language model servers. Offline inference using lm-evaluation-harness might work, but this feature is not tested.
For generating random domain, utilize two scripts
# before running the script, make sure you setup the openai api key here!
export OPENAI_API_KEY="<your-openai-api-key>"
python sample_madlib_op.py # for generating random domain for object placements task.
python sample_madlib_ta.py # for generating random domain for team allocation task.For generating data instances, utilize three scripts
# before running the script, make sure you setup the openai api key here!
export OPENAI_API_KEY="<your-openai-api-key>"
bash create_mm.sh # for generating murder mysteries data
bash create_op.sh # for generating object placements data
bash create_ta.sh # for generating team allocations dataWe utilize LM Evaluation Harness for the evaluation. We provide task descriptions in musr-tasks directory.
We provide evaluation example in evaluation.sh script that utilizes local inference servers. See lm-evaluation-harness repository for advanced usage.
This repository is licensed under MIT license. For the original code, see MuSR Repository.