Retail Inventory Management System Using Big Data

🎯 Project Overview

This project implements a Retail Inventory Management System using Big Data technologies (Hadoop, Hive, and Apache Spark) to analyze and optimize stock levels for retail stores. The system provides demand forecasting, inventory optimization, and actionable insights to prevent stockouts and overstocking.

🏗️ System Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Data Sources                             │
│              (CSV Files - Retail Sales Data)                 │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────┐
│                HDFS Storage Layer                            │
│         (Hadoop Distributed File System)                     │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────┐
│              Data Processing Layer                           │
│                 (Apache Hive)                                │
│  - External Tables                                           │
│  - HiveQL Queries for Insights                               │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────┐
│              Analytics Layer                                 │
│             (Apache Spark - PySpark)                         │
│  - Demand Forecasting                                        │
│  - Inventory Optimization                                    │
│  - Advanced Analytics                                        │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────┐
│           Visualization & Reporting Layer                    │
│     (Matplotlib/Plotly/Power BI/Tableau)                     │
└─────────────────────────────────────────────────────────────┘

📋 Prerequisites

Software Requirements

Hadoop 3.x or later
Apache Hive 3.x or later
Apache Spark 3.x or later
Python 3.7+
Java 8 or 11

Python Libraries

pip install pyspark pandas numpy matplotlib plotly scikit-learn

📁 Project Structure

retail-inventory-management/
│
├── data/
│   ├── raw/
│   │   └── retail_inventory_data.csv
│   └── processed/
│       └── output files from Spark
│
├── scripts/
│   ├── data_generation/
│   │   └── generate_dataset.py
│   ├── hdfs/
│   │   └── load_to_hdfs.sh
│   ├── hive/
│   │   ├── create_tables.hql
│   │   └── analytical_queries.hql
│   └── spark/
│       ├── data_processing.py
│       ├── demand_forecasting.py
│       └── inventory_optimization.py
│
├── visualization/
│   └── dashboard.py
│
├── config/
│   └── spark_config.py
│
├── outputs/
│   └── results and reports
│
└── README.md

🚀 Setup Instructions

Step 1: Start Hadoop Services

# Start HDFS
start-dfs.sh

# Start YARN (optional, for resource management)
start-yarn.sh

# Verify HDFS is running
hdfs dfsadmin -report

Step 2: Generate Sample Dataset

cd scripts/data_generation
python generate_dataset.py

This will create a sample retail inventory dataset with:

Product IDs
Categories
Stock Quantity
Sales Quantity
Customer Region
Season
Date

Step 3: Load Data to HDFS

# Create HDFS directories
hdfs dfs -mkdir -p /user/retail/data/inventory

# Upload data to HDFS
hdfs dfs -put data/raw/retail_inventory_data.csv /user/retail/data/inventory/

# Verify data upload
hdfs dfs -ls /user/retail/data/inventory/

Step 4: Create Hive Tables

# Start Hive CLI
hive

# Run the table creation script
source scripts/hive/create_tables.hql;

# Verify table creation
SHOW TABLES;
DESCRIBE retail_inventory;

Step 5: Run Hive Analytical Queries

# Execute analytical queries
hive -f scripts/hive/analytical_queries.hql

Step 6: Run Spark Analytics

# Data Processing
spark-submit --master local[*] scripts/spark/data_processing.py

# Demand Forecasting
spark-submit --master local[*] scripts/spark/demand_forecasting.py

# Inventory Optimization
spark-submit --master local[*] scripts/spark/inventory_optimization.py

Step 7: Generate Visualizations

python visualization/dashboard.py

📊 Key Features

1. Data Processing Layer (Hive)

Fast-moving vs Slow-moving Products: Identifies products based on sales velocity
Seasonal Sales Trends: Analyzes sales patterns across seasons
Regional Demand Analysis: Identifies high and low demand regions

2. Analytics Layer (Spark)

Demand Forecasting: Predicts future demand using time-series analysis
Inventory Optimization: Suggests reorder quantities to maintain optimal stock levels
Stock Risk Assessment: Identifies products at risk of stockout or overstock

3. Visualization Layer

Inventory vs Sales comparison charts
Demand forecast graphs with confidence intervals
Stock level trend analysis
Regional performance heatmaps

📈 Expected Outputs

Insights Generated:

Stockout Alerts: "Product X likely to go out of stock in next 10 days"
Seasonal Trends: "Category Y sales peak in the winter season"
Restock Recommendations: "Restock level suggestion for each product based on forecasted demand"
Slow-moving Inventory: "Product Z has low sales velocity - consider promotions"
Regional Insights: "Region A shows 30% higher demand for Category B"

🔧 Configuration

Spark Configuration

Edit config/spark_config.py to adjust:

Memory allocation
Number of executors
Parallelism level

Forecasting Parameters

Adjust forecasting window and confidence levels in scripts/spark/demand_forecasting.py

📝 Usage Examples

Query Fast-Moving Products

-- In Hive
SELECT product_id, category, SUM(sales_quantity) as total_sales
FROM retail_inventory
GROUP BY product_id, category
ORDER BY total_sales DESC
LIMIT 10;

Run Complete Pipeline

# Run the complete pipeline
./run_pipeline.sh

🧪 Testing

Verify Data Quality

# Check record count in HDFS
hdfs dfs -cat /user/retail/data/inventory/retail_inventory_data.csv | wc -l

# Verify Hive table data
hive -e "SELECT COUNT(*) FROM retail_inventory;"

🐛 Troubleshooting

Common Issues

HDFS Connection Error

# Check if HDFS is running
jps
# Should show NameNode and DataNode

Hive Metastore Error

# Initialize metastore
schematool -initSchema -dbType derby

Spark Memory Issues

# Increase driver memory
spark-submit --driver-memory 4g --executor-memory 4g script.py

🤝 Contributing

Feel free to contribute by:

Adding new analytical features
Improving forecasting algorithms
Enhancing visualizations
Optimizing performance

📄 License

This project is for educational purposes.

👤 Author

Big Data Engineer - Retail Analytics Team

📞 Support

For issues or questions, please create an issue in the repository.

Note: Ensure all Hadoop, Hive, and Spark services are properly configured and running before executing the pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
docs		docs
scripts		scripts
visualization		visualization
.gitignore		.gitignore
COMMANDS.md		COMMANDS.md
GETTING_STARTED.md		GETTING_STARTED.md
LICENSE		LICENSE
README.md		README.md
quick_start.ps1		quick_start.ps1
requirements.txt		requirements.txt
run_pipeline.ps1		run_pipeline.ps1

License

Tharun007-TK/retail-inventory-bigdata-analytics

Folders and files

Latest commit

History

Repository files navigation