Skip to content

For my work on remote sensing coreset selection as part of my PhD

License

Notifications You must be signed in to change notification settings

MinasMayth/coreset_selection

Repository files navigation

Coreset Selection Framework

A modular and extensible framework for coreset selection algorithms, designed for remote sensing and other machine learning applications. This framework provides a clean interface for implementing and comparing different coreset selection strategies across various datasets.

Features

  • Modular Design: Clean separation between datasets, selection strategies, and evaluation
  • Extensible: Easy to add new datasets, strategies, and evaluators
  • Configuration-Driven: Support for YAML/JSON configuration files
  • Multiple Strategies: Built-in implementations of common coreset selection methods
  • Evaluation Tools: Comprehensive metrics for assessing coreset quality
  • Type-Safe: Fully typed with Python type hints
  • CLI Tool: One-command coreset creation from .npy/.npz or synthetic data
  • Save Coreset Arrays: Optionally write the selected features (and labels) to an .npz

Installation

From Source

git clone https://github.com/MinasMayth/coreset_selection.git
cd coreset_selection
pip install -e .

Dependencies

pip install -r requirements.txt

Quick Start

Basic Usage

import numpy as np
from coreset_selection.datasets.numpy_dataset import NumpyDataset
from coreset_selection.strategies.k_center import KCenterGreedy
from coreset_selection.evaluators.diversity_evaluator import DiversityEvaluator

# Create or load your data
data = np.random.randn(1000, 10)
labels = np.random.randint(0, 5, 1000)

# Create dataset
dataset = NumpyDataset(data=data, labels=labels)
dataset.load()

# Select coreset using K-Center algorithm
strategy = KCenterGreedy(config={'coreset_size': 100})
coreset_data, indices, coreset_labels = strategy.get_coreset(
    data=dataset.get_data(),
    labels=dataset.get_labels()
)

# Evaluate coreset quality
evaluator = DiversityEvaluator()
metrics = evaluator.evaluate(
    full_data=dataset.get_data(),
    coreset_data=coreset_data,
    full_labels=dataset.get_labels(),
    coreset_labels=coreset_labels
)
print(metrics)

Command-line coreset selection (NumPy arrays)

Create a coreset directly from a features array (.npy/.npz) without writing code. The CLI can also generate a synthetic dataset for a quick smoke test and save the selected coreset arrays.

Examples:

# 1) Quick demo on synthetic data (select 10% at random and save arrays)
python scripts/make_coreset.py \
    --demo-synthetic \
    --strategy random \
    --coreset-ratio 0.1 \
    --save-coreset --coreset-out out/coreset.npz \
    --out out/indices.npy --print-json

# 2) KMeans-based selection from an existing features file
#    features.npz must contain key 'features' or be a .npy array
python scripts/make_coreset.py \
    --features path/to/features.npz \
    --strategy kmeans --num-clusters 8 \
    --coreset-size 500 \
    --save-coreset --coreset-out out/coreset_kmeans.npz \
    --out out/indices_kmeans.npy

# 3) Label-aware selection (requires labels array)
python scripts/make_coreset.py \
    --features path/to/features.npy \
    --labels path/to/labels.npy \
    --strategy by_labels --num-clusters 5 \
    --coreset-size 200 \
    --save-coreset --coreset-out out/coreset_bylabels.npz

Available strategies:

Core Strategies (recommended):

  • random - Random sampling baseline (always compare against this!)
  • kmeans - K-Means clustering with flexible sampling (general-purpose)
  • fd / featurediversity - Feature diversity with automatic cluster detection
  • ld / labeldifficulty - Label difficulty for classification (handles imbalance)
  • crest - Gradient-based importance (strong theoretical foundation)

Segmentation-Only (require pixel-level masks):

  • lc / labelcomplexity - Label complexity via entropy
  • cb / classbalance - Class balance optimization
  • lcfdhybrid / hybrid_lc_fd - Combined LC+FD

Image-Specific (require raw images):

  • fa / featureactivation - ResNet activation-based selection

Deprecated:

  • bylabels - Use ld instead (backward compatible)

See STRATEGIES.md for detailed documentation of each strategy.

Using Configuration Files

from coreset_selection.utils.config import ConfigLoader

# Load configuration
config = ConfigLoader.load('configs/my_experiment.yaml')

# Use config to initialize components
strategy = KCenterGreedy(config=config['strategy'])

Architecture

The framework is organized into four main components:

1. Datasets (coreset_selection.datasets)

Base interface for data loading and management. Implement BaseDataset to add new data sources:

from coreset_selection.datasets.base import BaseDataset

class MyCustomDataset(BaseDataset):
    def load(self):
        # Load your data
        self._data = ...
        self._labels = ...
    
    def get_data(self):
        return self._data
    
    def get_labels(self):
        return self._labels

2. Strategies (coreset_selection.strategies)

Coreset selection algorithms. Implement BaseCoresetStrategy to add new methods:

from coreset_selection.strategies.base import BaseCoresetStrategy

class MyCustomStrategy(BaseCoresetStrategy):
    def select(self, data, labels=None):
        # Implement your selection logic
        indices = ...
        return indices

Built-in strategies:

  • RandomSampling: Random subset selection
  • KCenterGreedy: K-Center greedy algorithm for maximum coverage

3. Evaluators (coreset_selection.evaluators)

Quality assessment tools. Implement BaseEvaluator to add new metrics:

from coreset_selection.evaluators.base import BaseEvaluator

class MyCustomEvaluator(BaseEvaluator):
    def evaluate(self, full_data, coreset_data, **kwargs):
        # Compute your metrics
        metrics = {...}
        return metrics

Built-in evaluators:

  • DiversityEvaluator: Measures coverage, diversity, and representation quality

4. Utilities (coreset_selection.utils)

Helper functions for configuration management, logging, and (optional) feature extraction:

  • ConfigLoader: Load/save YAML and JSON configurations
  • setup_logger: Configure logging for experiments
  • extract_features (optional): Stream model features to HDF5 and read back efficiently

Optional install for feature extraction utilities:

pip install -e ".[features]"

Examples

See the examples/ directory for complete usage examples:

  • basic_usage.py: Simple example with synthetic data
  • config_usage.py: Configuration-driven workflow

Run examples:

python examples/basic_usage.py
python examples/config_usage.py

Project Structure

coreset_selection/
├── coreset_selection/          # Main package
│   ├── __init__.py
│   ├── datasets/               # Dataset implementations
│   │   ├── base.py            # Base dataset interface
│   │   └── numpy_dataset.py   # NumPy dataset implementation
│   ├── strategies/            # Coreset selection strategies
│   │   ├── base.py           # Base strategy interface
│   │   ├── random_sampling.py
│   │   └── k_center.py
│   ├── evaluators/           # Evaluation metrics
│   │   ├── base.py          # Base evaluator interface
│   │   └── diversity_evaluator.py
│   └── utils/               # Utility functions
│       ├── config.py        # Configuration management
│       ├── logger.py        # Logging utilities
│       └── extract_features.py  # (optional) Stream features to HDF5
├── scripts/                 # CLI utilities
│   └── make_coreset.py      # Create coreset indices and optionally save arrays
├── examples/                # Usage examples
├── configs/                 # Configuration files
├── tests/                   # Unit tests
├── requirements.txt         # Dependencies
├── setup.py                # Package setup
├── pyproject.toml          # Modern Python packaging
└── README.md               # This file

Adding Your Own Components

Custom Dataset

  1. Create a new file in coreset_selection/datasets/
  2. Inherit from BaseDataset
  3. Implement load(), get_data(), and get_labels()

Custom Strategy

  1. Create a new file in coreset_selection/strategies/
  2. Inherit from BaseCoresetStrategy
  3. Implement select() method

Custom Evaluator

  1. Create a new file in coreset_selection/evaluators/
  2. Inherit from BaseEvaluator
  3. Implement evaluate() method

Configuration

Example configuration file (configs/example_config.yaml):

random_state: 42

data:
  n_samples: 1000
  n_features: 10
  n_classes: 5

strategy:
  name: kcenter
  params:
    coreset_size: 100
    metric: euclidean
    random_state: 42

evaluate: true

Development

Running Tests

pip install -e ".[dev]"
pytest tests/

Code Style

black coreset_selection/
flake8 coreset_selection/

License

MIT License - See LICENSE file for details

Citation

If you use this framework in your research, please cite:

@software{coreset_selection, author = {Minas Mayth}, title = {Coreset Selection Framework}, year = {2025}, url = {https://github.com/MinasMayth/coreset_selection} } }


## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Contact

For questions and support, please open an issue on GitHub.

About

For my work on remote sensing coreset selection as part of my PhD

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published