Skip to content

Scalable retail inventory management system built with Hadoop, Hive, and Apache Spark. Features automated data pipeline, demand forecasting using time-series analysis, inventory optimization algorithms, and interactive dashboards. Prevents stockouts while minimizing holding costs.

License

Notifications You must be signed in to change notification settings

Tharun007-TK/retail-inventory-bigdata-analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Retail Inventory Management System Using Big Data

🎯 Project Overview

This project implements a Retail Inventory Management System using Big Data technologies (Hadoop, Hive, and Apache Spark) to analyze and optimize stock levels for retail stores. The system provides demand forecasting, inventory optimization, and actionable insights to prevent stockouts and overstocking.

πŸ—οΈ System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Data Sources                             β”‚
β”‚              (CSV Files - Retail Sales Data)                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                HDFS Storage Layer                            β”‚
β”‚         (Hadoop Distributed File System)                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Data Processing Layer                           β”‚
β”‚                 (Apache Hive)                                β”‚
β”‚  - External Tables                                           β”‚
β”‚  - HiveQL Queries for Insights                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Analytics Layer                                 β”‚
β”‚             (Apache Spark - PySpark)                         β”‚
β”‚  - Demand Forecasting                                        β”‚
β”‚  - Inventory Optimization                                    β”‚
β”‚  - Advanced Analytics                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Visualization & Reporting Layer                    β”‚
β”‚     (Matplotlib/Plotly/Power BI/Tableau)                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“‹ Prerequisites

Software Requirements

  • Hadoop 3.x or later
  • Apache Hive 3.x or later
  • Apache Spark 3.x or later
  • Python 3.7+
  • Java 8 or 11

Python Libraries

pip install pyspark pandas numpy matplotlib plotly scikit-learn

πŸ“ Project Structure

retail-inventory-management/
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/
β”‚   β”‚   └── retail_inventory_data.csv
β”‚   └── processed/
β”‚       └── output files from Spark
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ data_generation/
β”‚   β”‚   └── generate_dataset.py
β”‚   β”œβ”€β”€ hdfs/
β”‚   β”‚   └── load_to_hdfs.sh
β”‚   β”œβ”€β”€ hive/
β”‚   β”‚   β”œβ”€β”€ create_tables.hql
β”‚   β”‚   └── analytical_queries.hql
β”‚   └── spark/
β”‚       β”œβ”€β”€ data_processing.py
β”‚       β”œβ”€β”€ demand_forecasting.py
β”‚       └── inventory_optimization.py
β”‚
β”œβ”€β”€ visualization/
β”‚   └── dashboard.py
β”‚
β”œβ”€β”€ config/
β”‚   └── spark_config.py
β”‚
β”œβ”€β”€ outputs/
β”‚   └── results and reports
β”‚
└── README.md

πŸš€ Setup Instructions

Step 1: Start Hadoop Services

# Start HDFS
start-dfs.sh

# Start YARN (optional, for resource management)
start-yarn.sh

# Verify HDFS is running
hdfs dfsadmin -report

Step 2: Generate Sample Dataset

cd scripts/data_generation
python generate_dataset.py

This will create a sample retail inventory dataset with:

  • Product IDs
  • Categories
  • Stock Quantity
  • Sales Quantity
  • Customer Region
  • Season
  • Date

Step 3: Load Data to HDFS

# Create HDFS directories
hdfs dfs -mkdir -p /user/retail/data/inventory

# Upload data to HDFS
hdfs dfs -put data/raw/retail_inventory_data.csv /user/retail/data/inventory/

# Verify data upload
hdfs dfs -ls /user/retail/data/inventory/

Step 4: Create Hive Tables

# Start Hive CLI
hive

# Run the table creation script
source scripts/hive/create_tables.hql;

# Verify table creation
SHOW TABLES;
DESCRIBE retail_inventory;

Step 5: Run Hive Analytical Queries

# Execute analytical queries
hive -f scripts/hive/analytical_queries.hql

Step 6: Run Spark Analytics

# Data Processing
spark-submit --master local[*] scripts/spark/data_processing.py

# Demand Forecasting
spark-submit --master local[*] scripts/spark/demand_forecasting.py

# Inventory Optimization
spark-submit --master local[*] scripts/spark/inventory_optimization.py

Step 7: Generate Visualizations

python visualization/dashboard.py

πŸ“Š Key Features

1. Data Processing Layer (Hive)

  • Fast-moving vs Slow-moving Products: Identifies products based on sales velocity
  • Seasonal Sales Trends: Analyzes sales patterns across seasons
  • Regional Demand Analysis: Identifies high and low demand regions

2. Analytics Layer (Spark)

  • Demand Forecasting: Predicts future demand using time-series analysis
  • Inventory Optimization: Suggests reorder quantities to maintain optimal stock levels
  • Stock Risk Assessment: Identifies products at risk of stockout or overstock

3. Visualization Layer

  • Inventory vs Sales comparison charts
  • Demand forecast graphs with confidence intervals
  • Stock level trend analysis
  • Regional performance heatmaps

πŸ“ˆ Expected Outputs

Insights Generated:

  1. Stockout Alerts: "Product X likely to go out of stock in next 10 days"
  2. Seasonal Trends: "Category Y sales peak in the winter season"
  3. Restock Recommendations: "Restock level suggestion for each product based on forecasted demand"
  4. Slow-moving Inventory: "Product Z has low sales velocity - consider promotions"
  5. Regional Insights: "Region A shows 30% higher demand for Category B"

πŸ”§ Configuration

Spark Configuration

Edit config/spark_config.py to adjust:

  • Memory allocation
  • Number of executors
  • Parallelism level

Forecasting Parameters

Adjust forecasting window and confidence levels in scripts/spark/demand_forecasting.py

πŸ“ Usage Examples

Query Fast-Moving Products

-- In Hive
SELECT product_id, category, SUM(sales_quantity) as total_sales
FROM retail_inventory
GROUP BY product_id, category
ORDER BY total_sales DESC
LIMIT 10;

Run Complete Pipeline

# Run the complete pipeline
./run_pipeline.sh

πŸ§ͺ Testing

Verify Data Quality

# Check record count in HDFS
hdfs dfs -cat /user/retail/data/inventory/retail_inventory_data.csv | wc -l

# Verify Hive table data
hive -e "SELECT COUNT(*) FROM retail_inventory;"

πŸ› Troubleshooting

Common Issues

  1. HDFS Connection Error

    # Check if HDFS is running
    jps
    # Should show NameNode and DataNode
  2. Hive Metastore Error

    # Initialize metastore
    schematool -initSchema -dbType derby
  3. Spark Memory Issues

    # Increase driver memory
    spark-submit --driver-memory 4g --executor-memory 4g script.py

🀝 Contributing

Feel free to contribute by:

  • Adding new analytical features
  • Improving forecasting algorithms
  • Enhancing visualizations
  • Optimizing performance

πŸ“„ License

This project is for educational purposes.

πŸ‘€ Author

Big Data Engineer - Retail Analytics Team

πŸ“ž Support

For issues or questions, please create an issue in the repository.


Note: Ensure all Hadoop, Hive, and Spark services are properly configured and running before executing the pipeline.

About

Scalable retail inventory management system built with Hadoop, Hive, and Apache Spark. Features automated data pipeline, demand forecasting using time-series analysis, inventory optimization algorithms, and interactive dashboards. Prevents stockouts while minimizing holding costs.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published