Smart Product Pricing Challenge - ML Solution

Problem Statement

This project tackles the Smart Product Pricing Challenge of ML Challenge 2025, aimed at predicting optimal product prices for e-commerce platforms. Pricing products correctly is critical for marketplace success and customer satisfaction. The dataset includes product details comprising textual descriptions, images, and item pack quantities, where the price depends on a complex interplay of attributes such as brand, specifications, and quantity. The goal is to build a machine learning model that holistically analyzes these attributes and predicts product prices accurately.

Dataset Description

Training dataset: 75,000 products with complete details including prices.
Test dataset: 75,000 products without prices for evaluation.
Columns:
- sample_id: Unique identifier for each product.
- catalog_content: Concatenated text containing title, description, and item pack quantity.
- image_link: URL to download the product image.
- price: Target variable in training data (to predict for test set).

Output Format

Submit a CSV file with two columns:

sample_id: Matching ID from the test set.
price: Predicted price as a positive float.

Approach and Methodology

Data Processing and Feature Engineering

Extract and clean textual information from catalog_content.
Extract item pack quantity using regex parsing.
Download product images from image_link using provided utility functions.
Extract visual features from product images using a pretrained ResNet50 model (feature extractor without classifier) for image embedding generation.
Combine textual and image features into a final feature set.

Model Architecture

Trained LightGBM regression models separately for:
- Text features (TF-IDF vectors + numerical features).
- Image features from ResNet50 embeddings.
Ensemble prediction by weighted averaging of text- and image-based model outputs.

Implementation Details

Code implemented in Python with PyTorch, torchvision, LightGBM, scikit-learn, and other standard libraries.
Used GPU acceleration for image feature extraction.
Extensive preprocessing to handle missing data and feature normalization.
Used Symmetric Mean Absolute Percentage Error (SMAPE) as the primary evaluation metric.

AWS SageMaker Utilization

Employed AWS SageMaker for scalable model training and experimentation.
Utilized SageMaker compute instances to efficiently train LightGBM models and process large datasets.
Leveraged SageMaker’s managed environment to handle data storage, image downloading, and batch processing.
Model training steps, feature extraction, and inference pipelines orchestrated on SageMaker.

Evaluation Metric

The SMAPE metric is used for model evaluation, comparing predicted prices with actual prices on a relative percentage scale.

[ \text{SMAPE} = \frac{1}{n} \sum_{i=1}^n \frac{|P_i - A_i|}{(|A_i| + |P_i|)/2} \times 100% ]

where (P_i) is the predicted price and (A_i) is the actual price for sample (i).

How to Run

Prepare environment with required Python packages (see notebook for pip install commands).
Use src/utils.py for image downloading with retry mechanisms.
Run notebook cells sequentially to:
- Load and preprocess data.
- Download and extract image features.
- Train LightGBM models on text and image features.
- Generate ensemble predictions.
Save final predictions matching the test set sample IDs for submission.

Notes

No external price lookup or data source outside the provided dataset was used.
All predictions and features rely solely on the given training data and image URLs.
Follow licensing terms (MIT/Apache 2.0) for final model use.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
project		project
.DS_Store		.DS_Store
README.md		README.md
Screenshot 2025-10-08 at 8.01.03 PM.png		Screenshot 2025-10-08 at 8.01.03 PM.png
Screenshot 2025-10-11 at 1.40.20 PM.png		Screenshot 2025-10-11 at 1.40.20 PM.png
Screenshot 2025-10-11 at 11.51.59 AM.png		Screenshot 2025-10-11 at 11.51.59 AM.png
Screenshot 2025-10-11 at 5.01.45 PM.png		Screenshot 2025-10-11 at 5.01.45 PM.png
Screenshot 2025-10-11 at 6.54.26 AM.png		Screenshot 2025-10-11 at 6.54.26 AM.png
Screenshot 2025-10-11 at 9.49.36 AM.png		Screenshot 2025-10-11 at 9.49.36 AM.png
Screenshot 2025-10-13 at 11.52.17 PM.png		Screenshot 2025-10-13 at 11.52.17 PM.png
Screenshot 2025-10-13 at 9.46.14 PM.png		Screenshot 2025-10-13 at 9.46.14 PM.png
Untitled-2.ipynb		Untitled-2.ipynb
submission.csv		submission.csv
submission3.csv		submission3.csv
submission4.csv		submission4.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Smart Product Pricing Challenge - ML Solution

Problem Statement

Dataset Description

Output Format

Approach and Methodology

Data Processing and Feature Engineering

Model Architecture

Implementation Details

AWS SageMaker Utilization

Evaluation Metric

How to Run

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Smart Product Pricing Challenge - ML Solution

Problem Statement

Dataset Description

Output Format

Approach and Methodology

Data Processing and Feature Engineering

Model Architecture

Implementation Details

AWS SageMaker Utilization

Evaluation Metric

How to Run

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages