This project is an interactive web application developed for clustering analysis, built using Streamlit. It enables users to upload datasets, preprocess data, apply various clustering algorithms, and visualize results through an intuitive interface. The app supports K-Means, Agglomerative Clustering (AGNES), DIANA (custom implementation), and DBSCAN, with plans to include K-Medoids in future updates. It is designed for both novice and experienced users, making clustering accessible for applications like market segmentation, user behavior analysis, and scientific research.
- Data Preprocessing:
- Upload datasets in CSV format with preview functionality.
- Clean data by dropping irrelevant columns.
- Handle missing values using strategies like mean, median, or mode imputation.
- Normalize data with StandardScaler or Min-Max scaling.
- Clustering Algorithms:
- K-Means: Partition-based clustering with elbow curve for optimal cluster selection.
- AGNES: Hierarchical agglomerative clustering with dendrogram visualization.
- DIANA: Custom divisive hierarchical clustering implementation.
- DBSCAN: Density-based clustering with manual or auto-tuned parameters (eps, min_samples).
- K-Medoids: Currently commented out in code but planned for future integration.
- Visualizations:
- 2D and 3D cluster plots using PCA or t-SNE for dimensionality reduction.
- Dendrograms for hierarchical methods (AGNES, DIANA).
- Elbow curve for K-Means to determine optimal cluster count.
- Evaluation Metrics:
- Inertia for K-Means (and K-Medoids when implemented).
- Cluster distribution summaries.
- Planned addition of Silhouette Score for enhanced evaluation.
- Interactive Interface:
- User-friendly Streamlit interface with dynamic updates.
- Flexible column selection and algorithm parameter tuning.
- Real-time feedback and suggested parameters for DBSCAN.
- Scalability and Flexibility:
- Supports diverse datasets and use cases, from small to moderately large datasets.
- Customizable configurations for clustering and visualization.
- Clone the Repository:
git clone https://github.com/your-username/InteractiveClusteringApp.git cd InteractiveClusteringApp - Set Up a Virtual Environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Install Dependencies:
Note: The
pip install -r requirements.txt
scikit-learn-extradependency is included for K-Medoids, which is currently commented out inapp.py. You can skip it if not using K-Medoids. - Run the App:
streamlit run app.py
- Open the app in your browser (Streamlit typically runs at
http://localhost:8501). - Upload a CSV dataset via the drag-and-drop interface.
- Preprocess data:
- Select columns to drop or use for clustering.
- Handle missing values (mean, median, mode, or drop).
- Apply normalization (StandardScaler or Min-Max).
- Configure clustering:
- Choose an algorithm (K-Means, AGNES, DIANA, DBSCAN).
- Adjust parameters (e.g., number of clusters, DBSCAN’s eps and min_samples).
- Visualize results:
- View 2D/3D cluster plots, dendrograms, or elbow curves.
- Explore cluster distributions and evaluation metrics.
- Export results as images or CSV files with cluster labels.
app.py: Main Streamlit application for the interactive clustering interface.clustering_algorithms.py: Custom implementation of the DIANA clustering algorithm.utils/:preprocessing.py: Functions for handling missing data and normalization.visualization.py: Functions for generating 2D/3D plots, dendrograms, and elbow curves.dbscan_helper.py: Helper functions for tuning DBSCAN parameters.
requirements.txt: Lists project dependencies.
- Add support for additional clustering algorithms (e.g., graph-based or deep clustering).
- Optimize performance for large datasets.
- Enhance visualizations with more interactive features (e.g., zoom, export formats).
- Lamara Abdeldjalil
- Taleb Youcef