Uneven Ground: Analyzing Representational and Referencing Biases in Wikidata by Country Income Level

Abstract

Although Wikidata has been playing a significant role in semantic web and AI technologies, it is not exempt from limitations regarding its inherent biases. Our work expands upon existing critiques and evaluates differences in centrality, characterization, and evidence-groundedness between countries considered high-, upper-middle-, lower-middle-, and low-income. Our findings indicate that higher-income countries and individuals with respective citizenship are generally more centrally represented in Wikidata. Furthermore, the types of relations used differ, and references are more commonly used for statements for higher-income countries. Additionally, there are prominent differences in the types of references utilized across countries with different income levels. Our findings illuminate a previously underexplored facet of bias in Wikidata, adding to concerns regarding its current fitness as a backbone to semantic web and AI technologies meant to serve users across continents and countries. We hope to contribute to a better understanding of Wikidata's limitations to foster awareness and improved representational fairness in the long run.

Project Structure and Data Description

This project includes analysis of the Wikidata graph regarding countries and person entities related to these countries via citizenship. Below is an overview of the key directories and files:

📂 Directory Structure

data/
Raw data extracted from Wikidata and results.
data_analysis/
Graph-based analysis of the extracted data.
url_analysis/
Retrieval of URLs from Wikidata references and web scraping.

📄 Triple Files

Each file contains tuples in the format:
[subject, predicate, object, time]

data/triples_urls.tsv
Machine-readable format using Wikidata QIDs (e.g., Q42) and PIDs (e.g., P31), and URL as refernce.
data/triples_values_urls.tsv
Human-readable values using Wikidata labels and URL as reference.

These files were generated using sample_triples.py. Includes outgoing claims for all countries, excluding "P1549" (demonym).

Run following command to sample data with human-readable values using Wikidata labels and URLs provided as references.

python sample_triples.py --use_values True --urls True

🌎 Extracted Country and Person Entities

data/country_data.json.gz All country entities in Wikidata, including their outgoing claims.
data/person_statements All sampled person entities.
data/pid_labels.json and data/pid_labels_persons.json Maps PIDs to human-readable labels.
data/person_level_degrees Degree stats for the analyzed person entities.
data/plots Violin plots visualizing degree stats.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
data_analysis		data_analysis
url_analysis		url_analysis
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Uneven Ground: Analyzing Representational and Referencing Biases in Wikidata by Country Income Level

Abstract

Project Structure and Data Description

📂 Directory Structure

📄 Triple Files

🌎 Extracted Country and Person Entities

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

semantic-systems/Uneven_Ground

Folders and files

Latest commit

History

Repository files navigation

Uneven Ground: Analyzing Representational and Referencing Biases in Wikidata by Country Income Level

Abstract

Project Structure and Data Description

📂 Directory Structure

📄 Triple Files

🌎 Extracted Country and Person Entities

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages