You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This project provides a containerized Hadoop ecosystem with integrated Big Data processing tools, designed for learning, development, and small-scale testing.
🚀 Stack Versions
Component
Version
Hadoop
2.7.4
Spark
2.4.5
Hive
2.3.2
Pig
0.17.0
Tez
0.9.2
Zeppelin
0.9.0
🎯 Project Objectives
Configure a ready-to-use distributed Big Data environment
Perform analytical processing with Hive, Pig, Spark, and Tez
Compare performance across different processing engines
CREATEDATABASEcustomer_db;
USE customer_db;
CREATETABLEusers (
user_id INT,
user_name STRING,
email STRING,
age INT,
country STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
LOAD DATA INPATH '/input/customers-data/users.csv' INTO TABLE users;
SELECT*FROM users LIMIT5;
Pig Jobs
Open Pig container:
docker compose exec -it pig bash
pig
Run Pig script:
users = LOAD'/input/customers-data/users.csv'USINGPigStorage(',')
AS (user_id:int, user_name:chararray, country:chararray);
DUMPusers;
Apache Hadoop: Distributed system management (HDFS, YARN)
Apache Spark: Fast in-memory processing
Apache Hive: SQL query management
Apache Pig: Script-based data transformations
Apache Tez: Workflow optimization for Hadoop
About
A Dockerized Hadoop ecosystem featuring HDFS, Spark, Hive, Tez, Pig and Zeppelin for distributed data processing and analytics. Simplifies learning and experimentation of big data frameworks.