Skip to content

semantic-systems/Uneven_Ground

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Uneven Ground: Analyzing Representational and Referencing Biases in Wikidata by Country Income Level

Abstract

Although Wikidata has been playing a significant role in semantic web and AI technologies, it is not exempt from limitations regarding its inherent biases. Our work expands upon existing critiques and evaluates differences in centrality, characterization, and evidence-groundedness between countries considered high-, upper-middle-, lower-middle-, and low-income. Our findings indicate that higher-income countries and individuals with respective citizenship are generally more centrally represented in Wikidata. Furthermore, the types of relations used differ, and references are more commonly used for statements for higher-income countries. Additionally, there are prominent differences in the types of references utilized across countries with different income levels. Our findings illuminate a previously underexplored facet of bias in Wikidata, adding to concerns regarding its current fitness as a backbone to semantic web and AI technologies meant to serve users across continents and countries. We hope to contribute to a better understanding of Wikidata's limitations to foster awareness and improved representational fairness in the long run.

Project Structure and Data Description

This project includes analysis of the Wikidata graph regarding countries and person entities related to these countries via citizenship. Below is an overview of the key directories and files:

πŸ“‚ Directory Structure

  • data/
    Raw data extracted from Wikidata and results.

  • data_analysis/
    Graph-based analysis of the extracted data.

  • url_analysis/
    Retrieval of URLs from Wikidata references and web scraping.


πŸ“„ Triple Files

Each file contains tuples in the format:
[subject, predicate, object, time]

  • data/triples_urls.tsv
    Machine-readable format using Wikidata QIDs (e.g., Q42) and PIDs (e.g., P31), and URL as refernce.

  • data/triples_values_urls.tsv
    Human-readable values using Wikidata labels and URL as reference.

These files were generated using sample_triples.py. Includes outgoing claims for all countries, excluding "P1549" (demonym).

Run following command to sample data with human-readable values using Wikidata labels and URLs provided as references.

python sample_triples.py --use_values True --urls True

🌎 Extracted Country and Person Entities

  • data/country_data.json.gz All country entities in Wikidata, including their outgoing claims.

  • data/person_statements All sampled person entities.

  • data/pid_labels.json and data/pid_labels_persons.json Maps PIDs to human-readable labels.

  • data/person_level_degrees Degree stats for the analyzed person entities.

  • data/plots Violin plots visualizing degree stats.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •