Skip to content

sisinflab/DataRec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧩 DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender Systems

This is the official GitHub repo for the paper "DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender Systems", accepted for publication at the "The 48th International ACM SIGIR Conference on Research and Development in Information Retrieval", SIGIR 2025.

Table of Contents

What is DataRec

DataRec is a Python library that focuses on the data management phase of recommendation systems. It aims to promote standardization, interoperability, and best practices for processing and analyzing recommendation datasets.

Features

  • Dataset Management: Supports reading and writing various data formats and allows dynamic format specification.
  • Reference Datasets: Include commonly used recommendation datasets with traceable sources and versioning.
  • Filtering Strategies: Implements popular filtering techniques.
  • Splitting Strategies: Implements widely used data splitting strategies.
  • Data Characteristics Analysis: Enables computing data characteristics that impact recommendation performance.
  • Interoperability: Designed to be modular and compatible with existing recommendation frameworks by allowing dataset export in various formats.

Installation guidelines

Please make sure to have the following installed on your system:

  • Python 3.9.0 or later

you first need to clone this repository:

git clone https://github.com/sisinflab/DataRec.git

You may create the virtual environment with the requirements files we included in the repository, as follows:

$ python3.9 -m venv venv
$ source venv/bin/activate
$ pip install --upgrade pip
$ pip install -r requirements.txt

Datasets

DataRec includes several commonly used recommendation datasets to facilitate reproducibility and standardization. These datasets have been carefully curated, with traceable sources and versioning information maintained whenever possible. For each dataset, DataRec provides metadata such as the number of users, items, and interactions and data characteristics known to impact recommendation performance (e.g., sparsity and user/item distribution shifts). The dataset collection in DataRec is continuously updated to include more recent and widely used datasets from the recommendation systems literature. The most recent and widely used version is included when the original data source is unavailable to ensure backward compatibility.

The following datasets are currently included in DataRec:

Dataset Name Source
Alibaba iFashion https://drive.google.com/drive/folders/1xFdx5xuNXHGsUVG2VIohFTXf9S7G5veq
Amazon Beauty https://amazon-reviews-2023.github.io
Amazon Books https://amazon-reviews-2023.github.io/
Amazon Clothing https://amazon-reviews-2023.github.io/
Amazon Sports and Outdoors https://amazon-reviews-2023.github.io/
Amazon Toys and Games https://amazon-reviews-2023.github.io/
Amazon Video Games https://amazon-reviews-2023.github.io/
Ciao https://guoguibing.github.io/librec/datasets.html
Epinions https://snap.stanford.edu/data/soc-Epinions1.html
Gowalla https://snap.stanford.edu/data/loc-gowalla.html
LastFM https://grouplens.org/datasets/hetrec-2011/
MovieLens https://grouplens.org/datasets/movielens/
Tmall https://tianchi.aliyun.com/dataset/53?t=1716541860503
Yelp https://www.yelp.com/dataset

Next Updates

  • ⏳ improving logger
  • ⏳ improving signatures
  • ⏳ documentation

Authors

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages