Assessing Web Search Credibility and Response Groundedness in Chat Assistants

This repository contains the source code and data for the paper "Assessing Web Search Credibility and Response Groundedness in Chat Assistants".

📄 Abstract

Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources. While this promises more reliable answers, it also introduces the risk of amplifying misinformation if low-credibility sources are cited. Existing factuality benchmarks evaluate the internal knowledge of large language models (LLMs), while not focusing on the web search and credibility of sources that may be used to ground the response. In this paper, we introduce a novel evaluation methodology that analyzes web search behavior, measured through source credibility and the groundedness of assistant responses with respect to cited sources. Using 100 claims across five misinformation-prone domains, we evaluate GPT-4o, GPT-5, Perplexity, and Qwen Chat, proposing a methodology to quantify source reliability. Our findings reveal substantial differences between the assistants, with Perplexity achieving the highest credibility, while GPT-4o shows elevated disinformation rates for sensitive topics. This work provides the first systematic assessment of commonly used chat assistants for fact-checking behavior, establishing critical methodology for evaluating AI systems in misinformation-prone contexts.

⚙️ Reproducibility

Installation

Set up the environment with Python 3.11.5 and install dependencies:

pip install -r requirements.txt

Data & Response Collection

To evaluate the web-search-enabled assistants, responses must first be collected.

Supported assistants: GPT-4o (deprecated), GPT-5, Perplexity, and Qwen Chat. Run the following command to collect responses:

python -m scripts.collect_data

This launches a Selenium-based pipeline to interact with the assistants and collect metadata and raw HTML outputs.

Processing Collected Data

Since the initial collection only stores metadata and HTML, you must extract the cited sources and assistant responses:

python -m scripts.run_processing

Scraping Source Content

Next, download the full content of all cited sources (required for groundedness evaluation):

python -m scripts.download_articles

If some websites fail to download via requests, you can switch to a Selenium-based scraper. In download_articles.py, replace:

loader = WebLoader(url=source)

with

loader = SeleniumLoader(url=source)

to capture missing pages.

Web Search Analysis

To measure how many credible vs. non-credible sources are cited by each assistant and what is their ration, run:

python -m scripts.run_retrieval_eval

You can analyze the results in the provided Jupyter notebook: Web Search Analysis.ipynb

Groundedness Analysis

The groundedness evaluation requires GPUs capable of running the quantized Llama 3.3 70B model and an embedding model for retrieval.

We implement and compare three evaluation frameworks:

Modified FActScore
Modified VERIFY (FactBench)
Our proposed method

Run them as follows:

# FActScore
python -m scripts.run_factscore

# VERIFY / FactBench
python -m scripts.run_factbench

# Our method
python -m scripts.our_method

To analyze the date for individual assistants, run the code in the Jupzter notebook: Groundedness Evaluation.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
figures		figures
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
GPT-5 Thinking.ipynb		GPT-5 Thinking.ipynb
Groundedness Evaluation.ipynb		Groundedness Evaluation.ipynb
README.md		README.md
Web Search Analysis.ipynb		Web Search Analysis.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Assessing Web Search Credibility and Response Groundedness in Chat Assistants

📄 Abstract

⚙️ Reproducibility

Installation

Data & Response Collection

Processing Collected Data

Scraping Source Content

Web Search Analysis

Groundedness Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Languages

kinit-sk/web-search-analysis

Folders and files

Latest commit

History

Repository files navigation

Assessing Web Search Credibility and Response Groundedness in Chat Assistants

📄 Abstract

⚙️ Reproducibility

Installation

Data & Response Collection

Processing Collected Data

Scraping Source Content

Web Search Analysis

Groundedness Analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages