This repository contains an advanced implementation of the Self-Organizing Map (SOM) algorithm for unsupervised learning and clustering tasks. The SOM algorithm is particularly useful for visualizing high-dimensional data, performing dimensionality reduction, and clustering. This implementation is inspired by the paper "A novel self-organizing map (SOM) learning algorithm with nearest and farthest neurons" by Chaudhary, Bhatia, and Ahlawat (2014) and includes several unique features to enhance performance and flexibility.
- Overview
- Features
- Installation
- Usage
- Advanced Use Cases
- Performance Optimization
- Evaluation Metrics
- Error Handling and Debugging
- Contribution Guidelines
- Licensing and Acknowledgments
- References
The main purpose of this SOM implementation is to provide an efficient and flexible tool for clustering and visualizing high-dimensional data. This implementation includes various enhancements, such as multiple initialization methods, distance metrics, and evaluation criteria, making it suitable for a wide range of applications, from data visualization to anomaly detection.
The Self-Organizing Map (SOM) is a type of artificial neural network trained using unsupervised learning to produce a low-dimensional (typically two-dimensional) representation of input data. It uses competitive learning to find the best matching unit (BMU) and updates the neighborhood of the BMU using a Gaussian function to preserve the topological properties of the input space.
- Initialization Methods: Random, KDE, KMeans, KDE-KMeans, KMeans++, som++
- Distance Functions: Euclidean, Cosine
- Evaluation Metrics: Silhouette score, Davies-Bouldin index, Calinski-Harabasz score, Dunn index
- Multiprocessing Support: Leveraging joblib for parallel processing to accelerate training
- Customizability: Allows for customization of initialization methods, distance functions, learning rate, neighborhood functions, and more
Clone the repository and install the required dependencies:
git clone https://github.com/Evintkoo/SOM_plus_clustering.git
cd SOM_plus_clustering
pip install -r requirements.txt
- Python 3.7 or higher
- Libraries:
numpy
,joblib
,matplotlib
,scipy
(for KDE), and other dependencies listed inrequirements.txt
.
from som import SOM
som = SOM(
m=10,
n=10,
dim=3,
initiate_method='random',
learning_rate=0.5,
neighbour_rad=1.0,
distance_function='euclidean',
max_iter=1000
)
import numpy as np
data = np.random.random((100, 3)) # Example data
som.fit(x=data, epoch=100, shuffle=True, verbose=True)
labels = som.predict(data)
print(labels)
This implementation uses CuPy to perform computation on the GPU. If cupy
is not installed, imports will fail. Install a CUDA-compatible CuPy wheel (see Installation) to use the GPU path.
silhouette_score = som.evaluate(data, method=['silhouette'])
print(silhouette_score)
all_scores = som.evaluate(data, method=['all'])
print(all_scores)
To visualize the trained SOM, you can use Python libraries like matplotlib
:
import matplotlib.pyplot as plt
# Visualize the neurons
plt.imshow(som.cluster_center_.reshape(som.m, som.n, som.dim))
plt.title('Self-Organizing Map Neurons')
plt.show()
- Anomaly Detection: Use SOM to identify anomalies in time series data or financial transactions by detecting clusters that differ significantly from the norm.
- Customer Segmentation: Segment customers based on purchasing patterns, demographics, or behavior data.
- Dimensionality Reduction: Reduce high-dimensional data into a lower-dimensional space while preserving its topological properties.
- Integration with Machine Learning Tools: Use the SOM output as features for downstream machine learning tasks, such as classification or regression.
- Keep data and model on the GPU to avoid host-device transfers. This code uses CuPy end-to-end during training and prediction.
- Data Preprocessing: Normalize input data to ensure faster convergence and better clustering performance.
- Use the benchmarking script to get a quick idea of throughput:
python bench_som.py
- Silhouette Score: Measures how similar each point is to its own cluster compared to other clusters.
- Davies-Bouldin Index: Computes the average similarity ratio of each cluster with the most similar cluster.
- Calinski-Harabasz Score: Evaluates the ratio of between-cluster variance to within-cluster variance.
- Dunn Index: Determines the distance between clusters divided by the size of the largest cluster.
- Common Errors:
- ValueError: Raised when an invalid parameter is provided. Check your inputs against the valid options listed in the documentation.
- RuntimeError: Thrown if the SOM is used before fitting the data.
- Dimension Mismatch: Ensure that the input data dimensions match the expected dimensions specified during SOM initialization.
- Debugging Tips:
- Use verbose mode (
verbose=True
) during training to see progress and intermediate results. - Check input data for NaN or infinite values which may cause unexpected behavior.
- Use verbose mode (
We welcome contributions from the community! Please follow these steps to contribute:
- Fork the repository.
- Create a new branch (
git checkout -b feature-branch
). - Make your changes and commit them (
git commit -am 'Add new feature'
). - Push to the branch (
git push origin feature-branch
). - Open a Pull Request and describe the changes you made.
This project is licensed under the MIT License. See the LICENSE
file for more details.
- This implementation is inspired by the paper: Chaudhary, V., Bhatia, R. S., & Ahlawat, A. K. (2014). "A novel self-organizing map (SOM) learning algorithm with nearest and farthest neurons." Alexandria Engineering Journal, 53(4), 827-831. Link to paper
- Chaudhary, V., Bhatia, R. S., & Ahlawat, A. K. (2014). "A novel self-organizing map (SOM) learning algorithm with nearest and farthest neurons." Alexandria Engineering Journal, 53(4), 827-831. Link to paper