Skip to content

This is the source code for the paper: Assessing Web Search Credibility and Response Groundedness in Chat Assistants

Notifications You must be signed in to change notification settings

kinit-sk/web-search-analysis

Repository files navigation

Assessing Web Search Credibility and Response Groundedness in Chat Assistants

This repository contains the source code and data for the paper "Assessing Web Search Credibility and Response Groundedness in Chat Assistants".

📄 Abstract

Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources. While this promises more reliable answers, it also introduces the risk of amplifying misinformation if low-credibility sources are cited. Existing factuality benchmarks evaluate the internal knowledge of large language models (LLMs), while not focusing on the web search and credibility of sources that may be used to ground the response. In this paper, we introduce a novel evaluation methodology that analyzes web search behavior, measured through source credibility and the groundedness of assistant responses with respect to cited sources. Using 100 claims across five misinformation-prone domains, we evaluate GPT-4o, GPT-5, Perplexity, and Qwen Chat, proposing a methodology to quantify source reliability. Our findings reveal substantial differences between the assistants, with Perplexity achieving the highest credibility, while GPT-4o shows elevated disinformation rates for sensitive topics. This work provides the first systematic assessment of commonly used chat assistants for fact-checking behavior, establishing critical methodology for evaluating AI systems in misinformation-prone contexts.

⚙️ Reproducibility

Installation

Set up the environment with Python 3.11.5 and install dependencies:

pip install -r requirements.txt

Data & Response Collection

To evaluate the web-search-enabled assistants, responses must first be collected.

Supported assistants: GPT-4o (deprecated), GPT-5, Perplexity, and Qwen Chat. Run the following command to collect responses:

python -m scripts.collect_data

This launches a Selenium-based pipeline to interact with the assistants and collect metadata and raw HTML outputs.

Processing Collected Data

Since the initial collection only stores metadata and HTML, you must extract the cited sources and assistant responses:

python -m scripts.run_processing

Scraping Source Content

Next, download the full content of all cited sources (required for groundedness evaluation):

python -m scripts.download_articles

If some websites fail to download via requests, you can switch to a Selenium-based scraper. In download_articles.py, replace:

loader = WebLoader(url=source)

with

loader = SeleniumLoader(url=source)

to capture missing pages.

Web Search Analysis

To measure how many credible vs. non-credible sources are cited by each assistant and what is their ration, run:

python -m scripts.run_retrieval_eval

You can analyze the results in the provided Jupyter notebook: Web Search Analysis.ipynb

Groundedness Analysis

The groundedness evaluation requires GPUs capable of running the quantized Llama 3.3 70B model and an embedding model for retrieval.

We implement and compare three evaluation frameworks:

  1. Modified FActScore
  2. Modified VERIFY (FactBench)
  3. Our proposed method

Run them as follows:

# FActScore
python -m scripts.run_factscore

# VERIFY / FactBench
python -m scripts.run_factbench

# Our method
python -m scripts.our_method

To analyze the date for individual assistants, run the code in the Jupzter notebook: Groundedness Evaluation.ipynb

About

This is the source code for the paper: Assessing Web Search Credibility and Response Groundedness in Chat Assistants

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published