This project implements a Retail Inventory Management System using Big Data technologies (Hadoop, Hive, and Apache Spark) to analyze and optimize stock levels for retail stores. The system provides demand forecasting, inventory optimization, and actionable insights to prevent stockouts and overstocking.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Sources β
β (CSV Files - Retail Sales Data) β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HDFS Storage Layer β
β (Hadoop Distributed File System) β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Processing Layer β
β (Apache Hive) β
β - External Tables β
β - HiveQL Queries for Insights β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Analytics Layer β
β (Apache Spark - PySpark) β
β - Demand Forecasting β
β - Inventory Optimization β
β - Advanced Analytics β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Visualization & Reporting Layer β
β (Matplotlib/Plotly/Power BI/Tableau) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Hadoop 3.x or later
- Apache Hive 3.x or later
- Apache Spark 3.x or later
- Python 3.7+
- Java 8 or 11
pip install pyspark pandas numpy matplotlib plotly scikit-learnretail-inventory-management/
β
βββ data/
β βββ raw/
β β βββ retail_inventory_data.csv
β βββ processed/
β βββ output files from Spark
β
βββ scripts/
β βββ data_generation/
β β βββ generate_dataset.py
β βββ hdfs/
β β βββ load_to_hdfs.sh
β βββ hive/
β β βββ create_tables.hql
β β βββ analytical_queries.hql
β βββ spark/
β βββ data_processing.py
β βββ demand_forecasting.py
β βββ inventory_optimization.py
β
βββ visualization/
β βββ dashboard.py
β
βββ config/
β βββ spark_config.py
β
βββ outputs/
β βββ results and reports
β
βββ README.md
# Start HDFS
start-dfs.sh
# Start YARN (optional, for resource management)
start-yarn.sh
# Verify HDFS is running
hdfs dfsadmin -reportcd scripts/data_generation
python generate_dataset.pyThis will create a sample retail inventory dataset with:
- Product IDs
- Categories
- Stock Quantity
- Sales Quantity
- Customer Region
- Season
- Date
# Create HDFS directories
hdfs dfs -mkdir -p /user/retail/data/inventory
# Upload data to HDFS
hdfs dfs -put data/raw/retail_inventory_data.csv /user/retail/data/inventory/
# Verify data upload
hdfs dfs -ls /user/retail/data/inventory/# Start Hive CLI
hive
# Run the table creation script
source scripts/hive/create_tables.hql;
# Verify table creation
SHOW TABLES;
DESCRIBE retail_inventory;# Execute analytical queries
hive -f scripts/hive/analytical_queries.hql# Data Processing
spark-submit --master local[*] scripts/spark/data_processing.py
# Demand Forecasting
spark-submit --master local[*] scripts/spark/demand_forecasting.py
# Inventory Optimization
spark-submit --master local[*] scripts/spark/inventory_optimization.pypython visualization/dashboard.py- Fast-moving vs Slow-moving Products: Identifies products based on sales velocity
- Seasonal Sales Trends: Analyzes sales patterns across seasons
- Regional Demand Analysis: Identifies high and low demand regions
- Demand Forecasting: Predicts future demand using time-series analysis
- Inventory Optimization: Suggests reorder quantities to maintain optimal stock levels
- Stock Risk Assessment: Identifies products at risk of stockout or overstock
- Inventory vs Sales comparison charts
- Demand forecast graphs with confidence intervals
- Stock level trend analysis
- Regional performance heatmaps
- Stockout Alerts: "Product X likely to go out of stock in next 10 days"
- Seasonal Trends: "Category Y sales peak in the winter season"
- Restock Recommendations: "Restock level suggestion for each product based on forecasted demand"
- Slow-moving Inventory: "Product Z has low sales velocity - consider promotions"
- Regional Insights: "Region A shows 30% higher demand for Category B"
Edit config/spark_config.py to adjust:
- Memory allocation
- Number of executors
- Parallelism level
Adjust forecasting window and confidence levels in scripts/spark/demand_forecasting.py
-- In Hive
SELECT product_id, category, SUM(sales_quantity) as total_sales
FROM retail_inventory
GROUP BY product_id, category
ORDER BY total_sales DESC
LIMIT 10;# Run the complete pipeline
./run_pipeline.sh# Check record count in HDFS
hdfs dfs -cat /user/retail/data/inventory/retail_inventory_data.csv | wc -l
# Verify Hive table data
hive -e "SELECT COUNT(*) FROM retail_inventory;"-
HDFS Connection Error
# Check if HDFS is running jps # Should show NameNode and DataNode
-
Hive Metastore Error
# Initialize metastore schematool -initSchema -dbType derby -
Spark Memory Issues
# Increase driver memory spark-submit --driver-memory 4g --executor-memory 4g script.py
Feel free to contribute by:
- Adding new analytical features
- Improving forecasting algorithms
- Enhancing visualizations
- Optimizing performance
This project is for educational purposes.
Big Data Engineer - Retail Analytics Team
For issues or questions, please create an issue in the repository.
Note: Ensure all Hadoop, Hive, and Spark services are properly configured and running before executing the pipeline.