Skip to content

FauzanLzrd/product-matcher

Repository files navigation

Product Matching System

A robust, modular, and scalable product matching pipeline designed to identify similar products across different data sources with high accuracy and efficiency.

Overview

This project implements a sophisticated product matching system built on four core principles:

  1. Robustness: Multi-layer normalization and data validation ensure high data quality even when processing messy or inconsistent input data.

  2. Modularity: The system architecture allows for easy addition or removal of pipelines for new domains or product types without disrupting existing functionality.

  3. Tunability: Every pipeline includes a configuration file that enables users to fine-tune system behavior, facilitating the incorporation of human feedback and domain expertise.

  4. Scalability: Designed to handle large volumes of data efficiently, ensuring performance does not degrade as datasets grow.

Features

  • Multi-layer Data Processing: Comprehensive pipeline including ingestion, normalization, guardrails, signal computation, and scoring
  • Fragrance-Specific Normalization: Specialized handling of fragrance product attributes including concentration, volume, and brand variations
  • Configurable Guardrails: Business logic filters to reduce computational load by eliminating unlikely matches early in the pipeline
  • Flexible Scoring System: Weighted scoring mechanism that can be customized for different use cases
  • Docker Support: Containerized deployment for consistent environments
  • Extensible Architecture: Easy to add new normalization rules, signals, and scoring methods

Installation

Option 1: Docker (Recommended)

The easiest way to run the project is using Docker:

docker-compose up

See DOCKER_INSTRUCTIONS.md for detailed Docker setup instructions.

Option 2: Local Installation

To run the project locally:

  1. Prerequisites: Python 3.11.9 or higher

  2. Create a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Run the main script:

    python main.py

Architecture

The system is organized into five main layers:

1. Ingestion Layer

For this prototype, ingestion is implemented as a simple CSV fetch from the dataset directory. In production, this layer can be extended to connect to various data sources including databases, APIs, or cloud storage.

2. Normalization Layer

Performs general data cleaning and normalization, with specialized handling for fragrance-specific attributes:

  • Brand normalization: Standardizes brand names and variations
  • Title normalization: Cleans and standardizes product titles
  • Volume normalization: Parses and standardizes product volumes and units
  • Financial normalization: Handles price and currency formatting
  • Identifier normalization: Processes GTIN, SKU, and other product identifiers
  • Text normalization: Applies text cleaning and standardization

3. Guardrails Layer

Contains business logic that filters product pairs before they reach the signal computation layer. This optimization significantly reduces computational load by eliminating unlikely matches early in the pipeline.

4. Signal Computation Layer

Computes similarity scores between pairs of products using multiple signals:

  • Text similarity (title, description)
  • Brand matching
  • Volume/size similarity
  • Price similarity
  • Identifier matching

5. Scoring Layer

Computes the final matching score between product pairs by combining signals with configurable weights, producing a confidence score for each potential match.

How Matching Works

The product matching process follows a three-step approach that progressively filters and evaluates potential matches:

  1. Guardrails: Quick filter that eliminates obviously mismatched products based on deal-breaker criteria like concentration, size, and price differences
  2. Signals: Detailed comparison that computes similarity scores across multiple product attributes (GTIN, brand, title, volume, price, etc.)
  3. Scoring: Final decision that combines all signal scores into a single confidence score using weighted averaging

For a complete explanation of how the matching system works, including examples and configuration tips, see matching.md.

Configuration

The system uses a hierarchical configuration structure:

  • Main Configuration: Located in main.py with default settings for all layers
  • Signal Configuration: Defined in signals_config.py for tuning signal weights and thresholds
  • Normalization Configuration: Located in normalization/config.py for domain-specific rules
  • Guardrails Configuration: Configured in guardrails.py for business logic filters
  • Scoring Configuration: Defined in scoring.py for final scoring parameters

Example Configuration

DEFAULT_CONFIG = {
    'normalization': {
        'enabled': True,
    },
    'guardrails': {
        'enabled': True,
        'config': guardrails.get_default_guardrail_config(),
    },
    'signals': {
        'enabled': True,
        'config': signals.get_default_signal_config(),
    },
    'scoring': {
        'enabled': True,
        'config': scoring.get_default_scoring_config(),
    },
    'output': {
        'min_confidence': 0.5,
        'format': 'json',  # 'json' or 'csv'
    },
}

Usage

Basic Usage

Run the complete pipeline with default settings:

python main.py

Custom Configuration

Modify the configuration in main.py or pass custom parameters:

# Enable sampling for testing
config = DEFAULT_CONFIG.copy()
config['sampling'] = {
    'enabled': True,
    'sample_size': 100,
    'random_seed': 42,
}

Output

Results are saved to the output/ directory in the specified format (JSON or CSV). Each match includes:

  • Product identifiers from both sources
  • Individual signal scores
  • Final confidence score
  • Match metadata

Scaling Architecture

For production deployment with large-scale datasets, the following architecture is recommended:

Main Architecture Components

  • Raw Data Storage: S3 or equivalent cloud storage
  • Data Pre-processing: Apache Spark, AWS Glue, or EMR for initial data processing
  • Data Indexing: OpenSearch or Elasticsearch for efficient candidate retrieval
  • Distributed Processing: Spark/Glue/EMR for parallel execution of normalization, guardrails, signal computation, and scoring layers

Blocking Strategy

To reduce the O(n²) complexity of pairwise comparisons, implement a blocking strategy:

  1. Pre-processing: Generate candidate blocks based on high-confidence, low-variety fields such as:

    • Product category
    • Price range
    • Brand
    • Key attributes
  2. Indexing: Store blocked candidates in a search index (OpenSearch/Elasticsearch)

This approach significantly reduces the number of candidates generated while maintaining matching accuracy.

Distributed Workers

For simplicity, this prototype uses the pandas library. Production implementations should use distributed computing frameworks such as spark.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors