Skip to content

jacklanda/SemanticQA

Repository files navigation

SemanticQA

Project PagePaper (arXiv)Data (Hugging Face)Evals (GitHub)SlidesPoster

The repository of the research project ``Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models''.

Supported Tasks

SemanticQA supports 19 tasks across four phrase types, including 6 sequential (multi-step) tasks.

Task Abbr. Eval Metrics Phrase Type
Idiomatic Expression Detection IED MCQ Accuracy Idiom
Idiomatic Expression Extraction IEE Exact Match Idiom
Idiomatic Expression Interpretation IEI ROUGE-L, BERTScore-F1, METEOR, BLEU Idiom
Noun Compound Compositionality NCC MCQ Accuracy Noun Compound
Noun Compound Extraction NCE Exact Match Noun Compound
Noun Compound Interpretation NCI ROUGE-L, BERTScore-F1, METEOR, BLEU Noun Compound
Lexical Collocation Categorization LCC Accuracy, Macro/Micro/Weighted F1 Collocation
Lexical Collocation Extraction LCE Exact Match Collocation
Lexical Collocation Interpretation LCI ROUGE-L, BERTScore-F1, METEOR, BLEU Collocation
Collocate Retrieval CR Exact Match Collocation
Collocation Identification CI Accuracy Collocation
Verbal Multiword Expression Extraction VMWE Exact Match Verbal MWE

Sequential (multi-step) tasks combine extraction with judgment or interpretation:

  • Idiom / Collocation / Noun Compound Extraction + Judgment
  • Idiom / Collocation / Noun Compound Extraction + Interpretation

Supported Models

Provider Models
OpenAI GPT-4 (0314/0613/Turbo/4o), GPT-5, o3
Anthropic Claude 3 (Sonnet/Opus), Claude Sonnet 4.5
Google Gemini 3.1 Pro, Gemma 3 27B IT
DeepSeek DeepSeek-Chat, DeepSeek-R1
Zhipu AI GLM-4.6
Alibaba Qwen3 (8B/14B/32B/235B-A22B)
Moonshot Kimi K2 Instruct
Others Llama 3 (8B), Mistral 7B, Mixtral 8x7B

Project Structure

SemanticQA/
├── resources/              # External datasets and source repositories
│   ├── dataset.zip         # Prepared benchmark data (unzip to use)
│   ├── AStitchInLanguageModels/  # Idiom & noun compound datasets
│   ├── ID10M/              # Idiom extraction dataset
│   ├── CollFrEn/           # French-English collocation data
│   ├── lexcomp/            # Lexical compositionality classifiers
│   ├── lexfunc/            # Lexical function data
│   ├── lexicalcollocations/# Collocation datasets
│   ├── LexNET/             # Lexical network corpus
│   ├── noun-compound-interpretation/  # NC interpretation data
│   ├── pronci/             # NC interpretation with transformers
│   └── graph-aware-collocation-recognition/
├── scripts/                # Data preparation & utility scripts
│   ├── data.py             # Data preparation for all tasks
│   ├── download_dataset.sh # Download raw datasets from sources
│   ├── calc_mean_sd.py     # Compute mean & std for BERT results
│   └── tsv2xlsx.py         # Convert TSV results to XLSX
├── semantic_qa/            # Main source code
│   ├── main.py             # Entry point for running evaluations
│   ├── args.py             # CLI argument definitions
│   ├── eval.py             # Evaluation metrics implementation
│   ├── utils.py            # I/O and prompt utilities
│   ├── data_utils.py       # Data loading & preprocessing
│   ├── model/              # Model query interfaces (OpenAI, Claude, Gemini, local)
│   ├── prompts/            # Zero-shot & few-shot prompt templates
│   ├── taxonomy/           # Semantic relation taxonomies (8/16 categories)
│   ├── training/           # Fine-tuning scripts (encoder LCC, T5 paraphrasing)
│   ├── type/               # Lexical function category mappings
│   ├── tests/              # Test scripts
│   └── results/            # Output directory
└── environment.yml         # Conda environment config

Getting Started

1. Preparing Data

Download raw datasets:

# Available: asilm, id10m, pie, ncc, nci, nce, vmwe
./scripts/download_dataset.sh asilm
./scripts/download_dataset.sh pie
./scripts/download_dataset.sh vmwe

Or unzip the prepared benchmark data:

unzip resources/dataset.zip -d SemanticQA/

2. Setting Up the Environment

conda env create -f environment.yml
conda activate lexbench

cd semantic_qa
pip install -r requirements.txt

3. Running Evaluation

All evaluations are launched from semantic_qa/ via main.py.

Example — idiom interpretation with gpt-5 (zero-shot):

python main.py \
  --task idiom-paraphrase \
  --api_key <YOUR_API_KEY> \
  --model gpt-5 \
  --prompt_path prompts/idiom_paraphrase_zeroshot.txt \
  --example_path dataset/idiom_paraphrase/prepared/examples.tsv \
  --input_path dataset/idiom_paraphrase/prepared/idiom_paraphrase_prepared.tsv \
  --output_path results/idiom-paraphrase_0-shot_gpt-5.json \
  --evaluate \
  --shot_num 0 \
  --max_query 1000 \
  --max_tokens 128 \
  --temperature 0 \
  --presence_penalty 0 \
  --frequency_penalty 0

Key arguments:

Argument Description
--task Task name (see supported tasks above)
--model Model identifier
--api_key API key for the model provider
--base_url Custom API base URL (optional)
--prompt_path Path to prompt template file
--example_path Path to few-shot examples (optional)
--taxonomy_path Path to taxonomy file (for LCC tasks)
--shot_num Number of few-shot examples
--evaluate Run evaluation after generation
--oracle_prompt Use oracle prompt with ground truth hints
--max_query Maximum number of API requests
--max_tokens Max generated tokens per request
--temperature Sampling temperature (0–2)

4. Scaling-category Experiments (LCC)

Run collocation categorization across different taxonomy sizes (1/2/4/8/16 categories):

cd semantic_qa
./run_lcc_scaling.sh

5. Standalone Evaluation

Evaluate existing result files directly:

python eval.py --task lcc --result_file_path results/lcc_result.json
python eval.py --task vmwe --result_file_path results/vmwe_result.json
python eval.py --task iep --result_file_path results/idiom_paraphrase_result.json

6. Fine-tuning

Fine-tuning scripts are available under semantic_qa/training/:

  • collocation-categorization/ — Fine-tune encoder models for LCC
  • semantic-phrases-interpretation/ — Fine-tune T5 for paraphrasing tasks

Citation

@inproceedings{liu-etal-2026-revisiting,
    title = "Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models",
    author = "Liu, Yang  and
      Li, Hongming  and
      Qin, Melissa Xiaohui  and
      Huang, Chao  and
      Liu, Qiankun",
    editor = "Liakata, Maria  and
      Moreira, Viviane P.  and
      Zhang, Jiajun  and
      Jurgens, David",
    booktitle = "Proceedings of the 64th Annual Meeting of the {A}ssociation for {C}omputational {L}inguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2026",
    address = "San Diego, California, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.acl-long.210/",
    pages = "4595--4618",
    ISBN = "979-8-89176-390-6",
    abstract = "We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed. It covers both general lexical phenomena, such as lexical collocations, and three fine-grained categories: idiomatic expressions, noun compounds, and verbal constructions. Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning, highlighting differences in reasoning efficacy and semantic understanding of LMs, providing insights for pushing LMs with stronger comprehension on non-trivial semantic phrases. The evaluation harness and data of SemanticQA are available at https://github.com/jacklanda/SemanticQA."
}
@article{liu2024revisiting,
    title={Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models},
    author={Liu, Yang and Qin, Melissa Xiaohui and Li, Hongming and Huang, Chao},
    journal={arXiv preprint arXiv:2405.02861},
    year={2024}
}

License

MIT License - see LICENSE for details.

About

[ACL 2026 Oral] Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

Topics

Resources

License

Stars

Watchers

Forks

Contributors