About the Paper — BengaliFig: A Benchmark for Figurative and Culturally Grounded Reasoning in Bengali
BengaliFig is the research contribution accompanying this repository, accepted for oral presentation at AACL MMLoSo workshop. The paper introduces the first benchmark designed specifically to evaluate figurative, metaphorical, wordplay-based, and culturally grounded reasoning in Bengali, a major world language that remains under-resourced in NLP evaluation.
Large Language Models (LLMs) excel on standard comprehension benchmarks but are rarely tested on culturally embedded, non-literal tasks—especially outside high-resource languages. Bengali riddles, rich in metaphor, misdirection, cultural symbolism, and phonological cues, provide a natural diagnostic tool for probing these weaknesses.
-
BengaliFig Dataset (435 riddles):
Curated from oral/literary Bengali sources, deduplicated, normalized, and converted into a multiple-choice format. -
Five-Dimensional Annotation Framework:
Each riddle is annotated along:- Reasoning Type (metaphor, wordplay, compound, descriptive, etc.)
- Trap Type (linguistic trick, misdirection, distractor type)
- Cultural Depth (universal vs. culturally specific)
- Answer Type (object, plant, concept, body part, etc.)
- Difficulty (easy/medium/hard)
-
Hybrid LLM-Assisted Annotation Pipeline:
A human–LLM workflow that reduces annotation time while maintaining high agreement (Krippendorff’s α ≈ 0.90). -
Evaluation of Frontier LLMs:
Eight state-of-the-art models—including GPT-5, GPT-4.1, Claude Opus 4.1, DeepSeek-V3, Qwen3-235B, and LLaMA-4—are evaluated in:- Zero-shot
- Few-shot Chain-of-Thought (CoT) settings
- Metaphor and wordplay riddles are the most challenging across models.
- Cultural specificity consistently lowers performance, revealing gaps in cultural grounding.
- Phonological/syllable constraints expose LLM weaknesses on non-Latin-script languages like Bengali.
- CoT improves weaker models, though top-tier models show limited gains due to small-sample ceiling effects.
- Overall, results highlight that LLMs still struggle with non-literal, culturally embedded reasoning, even when they perform strongly on standard multilingual tasks.
All dataset files, evaluation scripts, annotation prompts, and analysis utilities accompanying the paper are included in this repository to support full reproducibility and further research.
zero_shot_riddle_eval.py— Zero-shot evaluator: asks models to return a single letter (A/B/C/D) in JSON format for each riddle.cot.py— Chain-of-Thought (CoT) evaluator: prompts models to show step-by-step reasoning, expects a JSON response with reasoning + answer.result_analyzer.py— Aggregates result JSON files (usually fromresults/zero_shot) and prints/saves a detailed analysis and comparison CSV.cot_analyser.py— Compares CoT runs to corresponding zero-shot runs and reports accuracy deltas and improvement/regression statistics.
Common outputs are saved under results/ and some CSV summaries are produced.
- Python 3.9+
- Packages:
openai(or compatible client used in the repo),python-dotenv,pandas
Suggested install (create and activate a virtualenv first):
python -m venv .venv
source .venv/bin/activate
pip install pandas python-dotenv openaior install from requirements.txt:
pip install -r requirements.txtPlace secrets in a .env file or export them in your shell. The scripts check for the following keys (depending on provider):
OPENAI_API_KEY— for OpenAINOVITA_API_KEY— for Novita (base_url is set in code)DEEPSEEK_API_KEY— for Deepseek (base_url is set in code)ANTHROPIC_API_KEY— for Anthropic (base_url is set in code)
The repo uses dotenv.load_dotenv() so a .env file in the repo root with lines like OPENAI_API_KEY=sk-... will work.
data/mcq_dataset.json(ordata/v4_patched_mcq_dataset.jsondepending on your dataset naming) — MCQ riddle dataset used by evaluators. Each entry is expected to contain fields likeid,question,options,correct_option,answer,reasoning_type,difficulty,trap_type,cultural_depth, etc.results/hardest_data_points.json— used bycot.pymain in the repository to focus CoT runs on a subset of hard cases. If not present, update thedata_filevariable inside the script or pass a custom path when running programmatically.
Make sure the data file has the structure expected by the evaluators (see RiddleEvaluator / CoTRiddleEvaluator constructors which call json.load).
Purpose: Run zero-shot evaluation on a dataset and save model responses.
What it does:
- Loads the dataset (
data_filevariable inmain()defaults todata/v4_patched_mcq_dataset.json). - Builds a simple prompt asking the model to return JSON like {"উত্তর":"A"}.
- Calls the provider/model defined in the
settingslist insidemain(). - Saves results to
results/zero_shot/with a timestamped filename.
Inputs:
- Data file (JSON list of riddles).
- Provider & model definitions inside
main()(edit thesettingslist). - Environment API key(s).
Outputs:
- JSON file in
results/zero_shot/with structure:{ "metadata": {...}, "results": [{...}, ...] }.
How to run:
-
Option A (quick, using defaults):
- Ensure environment variables are set and
data_filepath inmain()is correct. - Run:
python zero_shot_riddle_eval.py
- The script runs the providers/models listed in the
settingslist sequentially.
- Ensure environment variables are set and
-
Option B (programmatic):
- Import
RiddleEvaluatorin another script and instantiate with your provider/model/data_file. Callrun_evaluation()with custom delays and start index.
- Import
Notes:
- The script uses internal
settingsinmain()instead of CLI flags. Edit that list to add/remove models or change delays. - There are built-in sleeps to avoid rate limits; tune
delayandbatch_delayinsettings.
Purpose: Run Chain-of-Thought (CoT) evaluation on (usually) hard cases and save the model's reasoning and final answer.
What it does:
- Loads dataset (defaults to
results/hardest_data_points.jsoninmain()). - For each riddle, builds a CoT prompt in Bengali that asks the model to provide step-by-step reasoning and a JSON answer (with
যুক্তিandউত্তর). - Calls the model and parses the response (attempts JSON parsing and several regex fallbacks).
- Saves detailed results into
results/chain_of_thought_hard_cases/with per-run JSON files. Also prints progress and periodic checkpoints.
Inputs:
data_filepath (set inmain()for hard cases).- Provider & model list in
main(). - Environment API key(s).
Outputs:
- JSON files in
results/chain_of_thought_hard_cases/containingmetadataandresults(each result includesraw_response, parsedreasoning, predicted option and more).
How to run:
- Ensure API keys and
data_fileare correct. - Run:
python cot.py
- Or instantiate
CoTRiddleEvaluatorprogrammatically and callrun_evaluation(delay=..., batch_delay=..., start_index=...).
Notes:
- CoT prompts are verbose and in Bengali. The code attempts to parse structured JSON responses but also tolerates non-JSON outputs.
- You can change models in the
settingslist. Tweak delays to match the rate limits of your provider.
Purpose: Aggregate many result JSON files (e.g., those in results/zero_shot) and generate a rich analysis including accuracy by model, by reasoning type, difficulty, trap susceptibility, and confusing distractors.
What it does:
- Reads all JSON files from
results/zero_shot(default inmain()) using glob. - Computes primary metrics (overall accuracy, accuracy by reasoning type, difficulty, cultural depth).
- Computes secondary metrics (trap susceptibility, distractor confusion, confidence calibration).
- Prints an organized report to stdout and writes a comparison CSV to
result_analysis/zero_shot_model_comparison.csv.
How to run:
- After you have result JSON files (from
zero_shot_riddle_eval.py), run:python result_analyzer.py
- The script expects
.jsonfiles inresults/zero_shot/.
Notes:
- The analyzer currently uses a fixed
results_dirvariable inmain(); modify it or callResultsAnalyzerdirectly with a list of file paths.
Purpose: Compare CoT runs (chain-of-thought) against zero-shot runs over the same IDs and report improvements/regressions.
What it does:
- Loads JSON files from
results/zero_shotandresults/chain_of_thought_hard_casesby provider/model metadata. - Builds indices mapping
riddle_id→ correctness and computes metrics only on the CoT subset (so comparisons are aligned to cases CoT actually tried). - Produces a pandas DataFrame summarizing improvements, regressions, and accuracy gains. Saves CSV
cot_vs_zeroshot_final_summary.csv.
How to run:
- Ensure you have matching zero-shot and CoT output files (same provider/model pairs under
results/zero_shotandresults/chain_of_thought_hard_cases). - Run:
python cot_analyser.py
- The script writes
cot_vs_zeroshot_final_summary.csvin the repo root.
Notes:
- The analyzer picks the most recent zero-shot file if multiple are present for a provider/model pair.
results/zero_shot/— JSON result files fromzero_shot_riddle_eval.py.results/chain_of_thought_hard_cases/— JSON result files fromcot.py.result_analysis/— CSV summaries produced byresult_analyzer.py(script createsresult_analysis/zero_shot_model_comparison.csv).cot_vs_zeroshot_final_summary.csv— summary CSV generated bycot_analyser.py.
- If you get API errors, double-check the corresponding environment variable and the provider base_url used in the script.
- To add models/providers, edit the
settingslist in themain()function of the evaluator you want to run (zero-shot or CoT). Each entry should have at leastprovider,model,delay, andbatch_delaykeys. - If your dataset file has a different name/path than the script expects, edit the
data_filevariable inmain()or run the evaluator programmatically.
