A robust, modular, and scalable product matching pipeline designed to identify similar products across different data sources with high accuracy and efficiency.
This project implements a sophisticated product matching system built on four core principles:
-
Robustness: Multi-layer normalization and data validation ensure high data quality even when processing messy or inconsistent input data.
-
Modularity: The system architecture allows for easy addition or removal of pipelines for new domains or product types without disrupting existing functionality.
-
Tunability: Every pipeline includes a configuration file that enables users to fine-tune system behavior, facilitating the incorporation of human feedback and domain expertise.
-
Scalability: Designed to handle large volumes of data efficiently, ensuring performance does not degrade as datasets grow.
- Multi-layer Data Processing: Comprehensive pipeline including ingestion, normalization, guardrails, signal computation, and scoring
- Fragrance-Specific Normalization: Specialized handling of fragrance product attributes including concentration, volume, and brand variations
- Configurable Guardrails: Business logic filters to reduce computational load by eliminating unlikely matches early in the pipeline
- Flexible Scoring System: Weighted scoring mechanism that can be customized for different use cases
- Docker Support: Containerized deployment for consistent environments
- Extensible Architecture: Easy to add new normalization rules, signals, and scoring methods
The easiest way to run the project is using Docker:
docker-compose upSee DOCKER_INSTRUCTIONS.md for detailed Docker setup instructions.
To run the project locally:
-
Prerequisites: Python 3.11.9 or higher
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Run the main script:
python main.py
The system is organized into five main layers:
For this prototype, ingestion is implemented as a simple CSV fetch from the dataset directory. In production, this layer can be extended to connect to various data sources including databases, APIs, or cloud storage.
Performs general data cleaning and normalization, with specialized handling for fragrance-specific attributes:
- Brand normalization: Standardizes brand names and variations
- Title normalization: Cleans and standardizes product titles
- Volume normalization: Parses and standardizes product volumes and units
- Financial normalization: Handles price and currency formatting
- Identifier normalization: Processes GTIN, SKU, and other product identifiers
- Text normalization: Applies text cleaning and standardization
Contains business logic that filters product pairs before they reach the signal computation layer. This optimization significantly reduces computational load by eliminating unlikely matches early in the pipeline.
Computes similarity scores between pairs of products using multiple signals:
- Text similarity (title, description)
- Brand matching
- Volume/size similarity
- Price similarity
- Identifier matching
Computes the final matching score between product pairs by combining signals with configurable weights, producing a confidence score for each potential match.
The product matching process follows a three-step approach that progressively filters and evaluates potential matches:
- Guardrails: Quick filter that eliminates obviously mismatched products based on deal-breaker criteria like concentration, size, and price differences
- Signals: Detailed comparison that computes similarity scores across multiple product attributes (GTIN, brand, title, volume, price, etc.)
- Scoring: Final decision that combines all signal scores into a single confidence score using weighted averaging
For a complete explanation of how the matching system works, including examples and configuration tips, see matching.md.
The system uses a hierarchical configuration structure:
- Main Configuration: Located in
main.pywith default settings for all layers - Signal Configuration: Defined in
signals_config.pyfor tuning signal weights and thresholds - Normalization Configuration: Located in
normalization/config.pyfor domain-specific rules - Guardrails Configuration: Configured in
guardrails.pyfor business logic filters - Scoring Configuration: Defined in
scoring.pyfor final scoring parameters
DEFAULT_CONFIG = {
'normalization': {
'enabled': True,
},
'guardrails': {
'enabled': True,
'config': guardrails.get_default_guardrail_config(),
},
'signals': {
'enabled': True,
'config': signals.get_default_signal_config(),
},
'scoring': {
'enabled': True,
'config': scoring.get_default_scoring_config(),
},
'output': {
'min_confidence': 0.5,
'format': 'json', # 'json' or 'csv'
},
}Run the complete pipeline with default settings:
python main.pyModify the configuration in main.py or pass custom parameters:
# Enable sampling for testing
config = DEFAULT_CONFIG.copy()
config['sampling'] = {
'enabled': True,
'sample_size': 100,
'random_seed': 42,
}Results are saved to the output/ directory in the specified format (JSON or CSV). Each match includes:
- Product identifiers from both sources
- Individual signal scores
- Final confidence score
- Match metadata
For production deployment with large-scale datasets, the following architecture is recommended:
- Raw Data Storage: S3 or equivalent cloud storage
- Data Pre-processing: Apache Spark, AWS Glue, or EMR for initial data processing
- Data Indexing: OpenSearch or Elasticsearch for efficient candidate retrieval
- Distributed Processing: Spark/Glue/EMR for parallel execution of normalization, guardrails, signal computation, and scoring layers
To reduce the O(n²) complexity of pairwise comparisons, implement a blocking strategy:
-
Pre-processing: Generate candidate blocks based on high-confidence, low-variety fields such as:
- Product category
- Price range
- Brand
- Key attributes
-
Indexing: Store blocked candidates in a search index (OpenSearch/Elasticsearch)
This approach significantly reduces the number of candidates generated while maintaining matching accuracy.
For simplicity, this prototype uses the pandas library. Production implementations should use distributed computing frameworks such as spark.