Skip to content

Conversation

@NicolasNoya
Copy link

Add MemStream: Memory-Based Streaming Anomaly Detection

This PR introduces MemStream, a state-of-the-art online anomaly detection framework designed for high-dimensional data streams with concept drift, based on the paper "MemStream: Memory-Based Streaming Anomaly Detection" by Bhatia et al.

What's New

Core Implementation

  • MemStream (Base Class): Abstract base class providing the core framework for memory-based anomaly detection
  • MemStreamPCA: Concrete implementation using PCA-based feature encoding

Architecture

The implementation consists of two main components:

  1. Feature Encoder: Transforms high-dimensional inputs into lower-dimensional representations

    • Currently implements PCA-based projection
    • Extensible design allows for future encoders (denoising autoencoders and information bottleneck, not implemented due to compatibility issues)
  2. Memory Module: Maintains a dynamic collection of encoded "normal" data representations

    • Adapts to concept drift without explicit labels
    • Configurable replacement strategies: FIFO, LRU, and Random
    • Prevents memory poisoning from anomalous samples

Key Features

  • Online Learning: Processes data points one at a time
  • Unsupervised Detection: No labels required during inference (optional during training)
  • Concept Drift Adaptation: Memory evolves over time to handle distribution changes
  • Flexible Scoring: Uses k-nearest neighbors with exponential weighting to compute anomaly scores
  • Grace Period: Collects initial samples to bootstrap the encoder before scoring begins
  • Memory Management: Configurable size and replacement policies

Parameters

  • memory_size: Maximum number of encoded normal samples to store (default: 1,000 for PCA variant)
  • max_threshold: Threshold for accepting samples into memory (default: 0.1)
  • grace_period: Number of initial samples before scoring begins (default: 5,000)
  • n_components: Number of PCA components (default: 20) (coded to take the value that makes PCA possible if n_components is inappropriate)
  • k: Number of nearest neighbors for scoring (default: 5)
  • gamma: Exponential weighting factor (default: 0.1)
  • replace_strategy: Memory replacement policy (FIFO, LRU, or RANDOM)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant