This repository contains the code for reproducible experiments from our paper "Know Or Not: a library for evaluating out-of-knowledge base robustness".
KnowOrNot helps you systematically evaluate LLM robustness when facing questions outside their knowledge base. Our library helps you create benchmarks, run experiments, and evaluate LLM responses through a clean, unified API.
The easiest way to install KnowOrNot is:
# Using uv (recommended)
uv add ../KnowOrNot
# Using pip
pip install knowornot
Scripts are organized into subdirectories based on their function. To run a script, use:
uv run -m subdirectory.script_name
Note: Do not include the .py
extension or use slashes (subdirectory/script_name.py
).
Scripts to create diverse question-answer pairs from various data sources:
create_BTT_questions.py
: Generates questions from Basic Theory Test driving materialscreate_CPF_questions.py
: Creates questions about the Central Provident Fund pension systemcreate_ICA_questions.py
: Builds questions from Immigration & Checkpoints Authority FAQ datacreate_medishield_QA.py
: Generates health insurance questions from MediShield documentsget_ICA_links.py
: Extracts question links from ICA website HTML
Scripts to set up and run LLM experiments:
create_all_experiments.py
: Sets up experiment configurations across all datasetsrun_experiments.py
: Executes experiments in sequenceabstention_evals.py
: Evaluates model abstention behaviors using GPT-4.1
Scripts for evaluating generated responses:
abstention_evals.py
: Evaluates whether models correctly abstainfactuality_evals_iter.py
: Determines factual accuracy of responses with tiered classificationfactuality_label_final.py
: Finalizes factuality labels from human annotationsall_abstention_evals.py
: Batch evaluation of abstention across all experimentsall_factuality_evals.py
: Batch evaluation of factuality across all experimentsgemini_search_evals.py
: Uses Gemini's search capability to verify factuality
Scripts to process and analyze experimental results:
analyse_csv.py
: Generates comprehensive analysis of evaluation resultsmake_csv.py
: Converts evaluation JSON files to CSV formatcorrect_gemini_factuality.py
: Compares factuality classifications from different evaluators
- Setup: Install the library and configure API keys in a
.env
file - Generate Questions: Run scripts in
create_questions/
to build QA datasets - Run Experiments: Execute experiments with
experiment_run/create_all_experiments.py
followed byexperiment_run/run_experiments.py
- Evaluate Results: Use scripts in
run_evaluations/
to assess model performance - Analyze Data: Process results with scripts in
analyse_data/
Our experiments created PolicyBench, a challenging benchmark for evaluating OOKB robustness across four Singapore government policy domains, varying in complexity and domain specificity. The benchmark is available at https://huggingface.co/datasets/govtech/PolicyBench.
For more details, please refer to our paper. The full source code for KnowOrNot is available at https://github.com/govtech-responsibleai/KnowOrNot.