Clustering at Massive Scale (ClaMS)

ClaMS is a distributed memory HPC clustering tool inspired by HDBSCAN. The goal of this work is to provide a tool for clustering datasets with billions of points in general metric spaces for users with access to HPC systems. The general approach taken is to take the standard HDBSCAN algorithm and swap components that do not scale or apply to non-Euclidean data for scalable primitives. This often requires resorting to algorithms that are approximations of what is done in HDBSCAN, sometimes without approximation guaranteees.

Build

git clone https://github.com/llnl/clams.git
cd clams
mkdir build
cd build

cmake -DCMAKE_BUILD_TYPE=Release ..
make -j

# If one runs into latomic link issue with Boost >= v1.87,
# try the following cmake option:
# -DBOOST_UUID_LINK_LIBATOMIC=OFF

# Optional
# Install the Python dependencies
# Python dependencies are used for clustering metric evaluation (non YGM-based),
# generating demo datasets, or benchmarking HDBSCAN.
# If one does not need these functionalities, the Python dependencies are not required.
python3 -m venv ./venv
source ./venv/bin/activate
pip install scikit-learn hdbscan numpy pandas

Demo Datasets

The dataset directory contains scripts to achieve some famous datasets.

Run

Currently, the benchmark scripts are designed for running jobs on slurm-based (srun and sbatch) HPC systems.

Run specifying dataset

cd clams/build
# Activate the Python virtual environment to use the Python dependencies, if needed
source ./venv/bin/activate

# Generate the shell script
# Use '-h' option to see the help
# '-S' option to submit a sbatch job using a generated batch script
# Shows the path to the generated script at the end

## Without '-y' option, uses Python script for clustering metric evaluation
python3 ../script/benchmark/run_clams_bench.py -p point_features.txt -g ground_truth_labels.txt -S

## To instead use YGM script for clustering metric evaluation, use '-y' option
## In this case, the labels need to be in a file with each row in the form "point_id label"
python3 ../script/benchmark/run_clams_bench.py -p point_features.txt -g ground_truth_labels_with_point_ids.txt -y -S

# To submit a job manually using the generated script
sbatch /path/to/generated_batch_script.sh

# Use a demo dataset
pushd ../dataset/mnist
python3 ./download_mnist.py
popd
python3 ../script/benchmark/run_clams_bench.py -S -p ../dataset/mnist/points.txt -g ../dataset/mnist/labels.txt

Running HDBSCAN

cd clams/build
source ./venv/bin/activate

# In clams/build
# -m: min cluster size
# -s: min samples
python3 ../script/benchmark/hdbscan/run_hdbscan.py -m 10 -s 5 -p ../dataset/fashion-mnist/points.txt -g ../dataset/fashion-mnist/labels.txt

License

This project is licensed under the BSD-Commercial license – see the LICENSE file for details.

Release

LLNL-CODE-2015147

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
cmake		cmake
dataset		dataset
script		script
src		src
test		test
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clustering at Massive Scale (ClaMS)

Build

Demo Datasets

Run

Run specifying dataset

Running HDBSCAN

License

Release

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Clustering at Massive Scale (ClaMS)

Build

Demo Datasets

Run

Run specifying dataset

Running HDBSCAN

License

Release

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages