A modular and extensible framework for coreset selection algorithms, designed for remote sensing and other machine learning applications. This framework provides a clean interface for implementing and comparing different coreset selection strategies across various datasets.
- Modular Design: Clean separation between datasets, selection strategies, and evaluation
- Extensible: Easy to add new datasets, strategies, and evaluators
- Configuration-Driven: Support for YAML/JSON configuration files
- Multiple Strategies: Built-in implementations of common coreset selection methods
- Evaluation Tools: Comprehensive metrics for assessing coreset quality
- Type-Safe: Fully typed with Python type hints
- CLI Tool: One-command coreset creation from .npy/.npz or synthetic data
- Save Coreset Arrays: Optionally write the selected features (and labels) to an .npz
git clone https://github.com/MinasMayth/coreset_selection.git
cd coreset_selection
pip install -e .pip install -r requirements.txtimport numpy as np
from coreset_selection.datasets.numpy_dataset import NumpyDataset
from coreset_selection.strategies.k_center import KCenterGreedy
from coreset_selection.evaluators.diversity_evaluator import DiversityEvaluator
# Create or load your data
data = np.random.randn(1000, 10)
labels = np.random.randint(0, 5, 1000)
# Create dataset
dataset = NumpyDataset(data=data, labels=labels)
dataset.load()
# Select coreset using K-Center algorithm
strategy = KCenterGreedy(config={'coreset_size': 100})
coreset_data, indices, coreset_labels = strategy.get_coreset(
data=dataset.get_data(),
labels=dataset.get_labels()
)
# Evaluate coreset quality
evaluator = DiversityEvaluator()
metrics = evaluator.evaluate(
full_data=dataset.get_data(),
coreset_data=coreset_data,
full_labels=dataset.get_labels(),
coreset_labels=coreset_labels
)
print(metrics)Create a coreset directly from a features array (.npy/.npz) without writing code. The CLI can also generate a synthetic dataset for a quick smoke test and save the selected coreset arrays.
Examples:
# 1) Quick demo on synthetic data (select 10% at random and save arrays)
python scripts/make_coreset.py \
--demo-synthetic \
--strategy random \
--coreset-ratio 0.1 \
--save-coreset --coreset-out out/coreset.npz \
--out out/indices.npy --print-json
# 2) KMeans-based selection from an existing features file
# features.npz must contain key 'features' or be a .npy array
python scripts/make_coreset.py \
--features path/to/features.npz \
--strategy kmeans --num-clusters 8 \
--coreset-size 500 \
--save-coreset --coreset-out out/coreset_kmeans.npz \
--out out/indices_kmeans.npy
# 3) Label-aware selection (requires labels array)
python scripts/make_coreset.py \
--features path/to/features.npy \
--labels path/to/labels.npy \
--strategy by_labels --num-clusters 5 \
--coreset-size 200 \
--save-coreset --coreset-out out/coreset_bylabels.npzAvailable strategies:
Core Strategies (recommended):
random- Random sampling baseline (always compare against this!)kmeans- K-Means clustering with flexible sampling (general-purpose)fd/featurediversity- Feature diversity with automatic cluster detectionld/labeldifficulty- Label difficulty for classification (handles imbalance)crest- Gradient-based importance (strong theoretical foundation)
Segmentation-Only (require pixel-level masks):
lc/labelcomplexity- Label complexity via entropycb/classbalance- Class balance optimizationlcfdhybrid/hybrid_lc_fd- Combined LC+FD
Image-Specific (require raw images):
fa/featureactivation- ResNet activation-based selection
Deprecated:
bylabels- Useldinstead (backward compatible)
See STRATEGIES.md for detailed documentation of each strategy.
from coreset_selection.utils.config import ConfigLoader
# Load configuration
config = ConfigLoader.load('configs/my_experiment.yaml')
# Use config to initialize components
strategy = KCenterGreedy(config=config['strategy'])The framework is organized into four main components:
Base interface for data loading and management. Implement BaseDataset to add new data sources:
from coreset_selection.datasets.base import BaseDataset
class MyCustomDataset(BaseDataset):
def load(self):
# Load your data
self._data = ...
self._labels = ...
def get_data(self):
return self._data
def get_labels(self):
return self._labelsCoreset selection algorithms. Implement BaseCoresetStrategy to add new methods:
from coreset_selection.strategies.base import BaseCoresetStrategy
class MyCustomStrategy(BaseCoresetStrategy):
def select(self, data, labels=None):
# Implement your selection logic
indices = ...
return indicesBuilt-in strategies:
RandomSampling: Random subset selectionKCenterGreedy: K-Center greedy algorithm for maximum coverage
Quality assessment tools. Implement BaseEvaluator to add new metrics:
from coreset_selection.evaluators.base import BaseEvaluator
class MyCustomEvaluator(BaseEvaluator):
def evaluate(self, full_data, coreset_data, **kwargs):
# Compute your metrics
metrics = {...}
return metricsBuilt-in evaluators:
DiversityEvaluator: Measures coverage, diversity, and representation quality
Helper functions for configuration management, logging, and (optional) feature extraction:
ConfigLoader: Load/save YAML and JSON configurationssetup_logger: Configure logging for experimentsextract_features(optional): Stream model features to HDF5 and read back efficiently
Optional install for feature extraction utilities:
pip install -e ".[features]"See the examples/ directory for complete usage examples:
basic_usage.py: Simple example with synthetic dataconfig_usage.py: Configuration-driven workflow
Run examples:
python examples/basic_usage.py
python examples/config_usage.pycoreset_selection/
├── coreset_selection/ # Main package
│ ├── __init__.py
│ ├── datasets/ # Dataset implementations
│ │ ├── base.py # Base dataset interface
│ │ └── numpy_dataset.py # NumPy dataset implementation
│ ├── strategies/ # Coreset selection strategies
│ │ ├── base.py # Base strategy interface
│ │ ├── random_sampling.py
│ │ └── k_center.py
│ ├── evaluators/ # Evaluation metrics
│ │ ├── base.py # Base evaluator interface
│ │ └── diversity_evaluator.py
│ └── utils/ # Utility functions
│ ├── config.py # Configuration management
│ ├── logger.py # Logging utilities
│ └── extract_features.py # (optional) Stream features to HDF5
├── scripts/ # CLI utilities
│ └── make_coreset.py # Create coreset indices and optionally save arrays
├── examples/ # Usage examples
├── configs/ # Configuration files
├── tests/ # Unit tests
├── requirements.txt # Dependencies
├── setup.py # Package setup
├── pyproject.toml # Modern Python packaging
└── README.md # This file
- Create a new file in
coreset_selection/datasets/ - Inherit from
BaseDataset - Implement
load(),get_data(), andget_labels()
- Create a new file in
coreset_selection/strategies/ - Inherit from
BaseCoresetStrategy - Implement
select()method
- Create a new file in
coreset_selection/evaluators/ - Inherit from
BaseEvaluator - Implement
evaluate()method
Example configuration file (configs/example_config.yaml):
random_state: 42
data:
n_samples: 1000
n_features: 10
n_classes: 5
strategy:
name: kcenter
params:
coreset_size: 100
metric: euclidean
random_state: 42
evaluate: truepip install -e ".[dev]"
pytest tests/black coreset_selection/
flake8 coreset_selection/MIT License - See LICENSE file for details
If you use this framework in your research, please cite:
@software{coreset_selection, author = {Minas Mayth}, title = {Coreset Selection Framework}, year = {2025}, url = {https://github.com/MinasMayth/coreset_selection} } }
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Contact
For questions and support, please open an issue on GitHub.