Data Mining Project

Data Mining project. Computer Science Master Degree, University of Pisa. A.Y 2024/2025

Overview

This repository contains the code and resources for a comprehensive Data Mining project focused on the analysis of professional cycling competitions. The primary goal of this project is to analyze the dynamics of cycling races and the attributes of cyclists, leveraging machine learning and statistical methodologies to discover patterns and predict performance outcomes.

The analysis uses two main datasets comprising professional cycling races spanning several years:

Cyclists Dataset: Contains detailed characteristics and statistics of individual cyclists.
Races Dataset: Contains race-specific information, such as length, total climb, profile difficulty (e.g., flat, mountainous), and the starting list quality.

The project follows the complete Data Mining pipeline, exploring everything from initial data cleaning and dimensionality reduction to advanced clustering, classification models, and explainability techniques.

Project Structure

The repository is structured around the 5 main tasks specified in the assignment, along with supporting directories for data and utility functions:

data/: Contains the raw and processed datasets (Cyclists and Races) used throughout the experiments. (Note: ensure data is unzipped here before running the notebooks).
src/: Contains a single utils.py file with general-purpose helper functions used across multiple notebooks to avoid code duplication.
TASK1/: Focuses on exploring data distributions, handling missing values, outlier detection (e.g., using Isolation Forest), feature engineering, and dimensionality reduction (PCA and UMAP).
TASK2/: Applies unsupervised learning algorithms such as K-Means, DBSCAN, and Hierarchical Clustering to group similar cyclists and race profiles.
TASK3/: Trains and evaluates predictive models to classify race outcomes or cyclist performances.
TASK4/: Extracts frequent itemsets and association rules to find interesting relationships between race features and cyclist attributes.
TASK5/: Uses explainable AI (XAI) techniques to interpret the decisions made by the classification models and understand the most impactful features.
report.pdf: The final comprehensive report detailing the methodology, experiments, and conclusions of the project.

Authors

Simone Marzeddu
Nicola Emmolo
Jacopo Raffi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Mining Project

Overview

Project Structure

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
TASK1		TASK1
TASK2		TASK2
TASK3		TASK3
TASK4		TASK4
TASK5		TASK5
data		data
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
report.pdf		report.pdf
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Data Mining Project

Overview

Project Structure

Authors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages