This repository contains the source code and data for the paper "Assessing Web Search Credibility and Response Groundedness in Chat Assistants".
Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources. While this promises more reliable answers, it also introduces the risk of amplifying misinformation if low-credibility sources are cited. Existing factuality benchmarks evaluate the internal knowledge of large language models (LLMs), while not focusing on the web search and credibility of sources that may be used to ground the response. In this paper, we introduce a novel evaluation methodology that analyzes web search behavior, measured through source credibility and the groundedness of assistant responses with respect to cited sources. Using 100 claims across five misinformation-prone domains, we evaluate GPT-4o, GPT-5, Perplexity, and Qwen Chat, proposing a methodology to quantify source reliability. Our findings reveal substantial differences between the assistants, with Perplexity achieving the highest credibility, while GPT-4o shows elevated disinformation rates for sensitive topics. This work provides the first systematic assessment of commonly used chat assistants for fact-checking behavior, establishing critical methodology for evaluating AI systems in misinformation-prone contexts.
Set up the environment with Python 3.11.5 and install dependencies:
pip install -r requirements.txtTo evaluate the web-search-enabled assistants, responses must first be collected.
Supported assistants: GPT-4o (deprecated), GPT-5, Perplexity, and Qwen Chat. Run the following command to collect responses:
python -m scripts.collect_dataThis launches a Selenium-based pipeline to interact with the assistants and collect metadata and raw HTML outputs.
Since the initial collection only stores metadata and HTML, you must extract the cited sources and assistant responses:
python -m scripts.run_processingNext, download the full content of all cited sources (required for groundedness evaluation):
python -m scripts.download_articlesIf some websites fail to download via requests, you can switch to a Selenium-based scraper. In download_articles.py, replace:
loader = WebLoader(url=source)with
loader = SeleniumLoader(url=source)to capture missing pages.
To measure how many credible vs. non-credible sources are cited by each assistant and what is their ration, run:
python -m scripts.run_retrieval_evalYou can analyze the results in the provided Jupyter notebook:
Web Search Analysis.ipynb
The groundedness evaluation requires GPUs capable of running the quantized Llama 3.3 70B model and an embedding model for retrieval.
We implement and compare three evaluation frameworks:
- Modified FActScore
- Modified VERIFY (FactBench)
- Our proposed method
Run them as follows:
# FActScore
python -m scripts.run_factscore
# VERIFY / FactBench
python -m scripts.run_factbench
# Our method
python -m scripts.our_methodTo analyze the date for individual assistants, run the code in the Jupzter notebook: Groundedness Evaluation.ipynb