This guide helps you set up a multi-container environment using Docker for Hadoop (HDFS), Spark, and Hive. The setup is lightweight, without the large memory requirements of a Cloudera sandbox.
Before you begin, ensure you have the following installed:
-
IMPORTANT: - Enable the "Expose daemon on tcp://localhost:2375 without TLS" option if you're using Docker Desktop for compatibility.
-
Git: Download Git
- Git is used to download the required files from a repository.
Create a newfolder and open it in terminal or go inside it using CD Command
First, clone the GitHub repository that contains the necessary Docker setup files.
git clone https://github.com/lovnishverma/bigdataecosystem.git
or Directly download zip from my repo
Navigate to the directory:
cd bigdataecosystem
if downloaded zip than cd bigdataecosystem-main
Use Docker Compose to start the containers in the background.
docker-compose up -d
This command will launch the Hadoop, Spark, and Hive containers.
To check if the containers are running, use the following command:
docker ps
When you are done, stop and remove the containers with:
docker-compose down
Enter the NameNode container to interact with Hadoop:
docker exec -it namenode bash
To View NameNode UI Visit: http://localhost:9870/
To View Resource Manager UI Visit http://localhost:8088/
Use the following command in your windows cmd to copy the code
folder to the container:
docker cp code namenode:/
Inside the container, navigate to the code/input
directory where the data.txt
file is located.
Run the following commands to set up directories in Hadoop's file system:
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/root
hdfs dfs -mkdir /user/root/input
Copy data.txt
into the Hadoop file system:
hdfs dfs -put /code/input/data.txt /user/root/input
Return to the directory where the wordCount.jar
file is located:
cd /code/
To View NameNode UI Visit: http://localhost:9870/
Run the WordCount program to process the input data:
hadoop jar wordCount.jar org.apache.hadoop.examples.WordCount input output
To View YARN Resource Manager UI Visit http://localhost:8088/
View the results of the WordCount program:
hdfs dfs -cat /user/root/output/*
or
hdfs dfs -cat /user/root/output/part-r-00000
This guide simplifies setting up and running Hadoop on Docker. Each step ensures a smooth experience, even for beginners without a technical background. Follow the instructions carefully, and you’ll have a working Hadoop setup in no time!
Certainly! Here’s the explanation of your MapReduce process using the input example DOG CAT RAT
, CAR CAR RAT
, and DOG CAR CAT
.
The data.txt
file contains the following lines:
DOG CAT RAT
CAR CAR RAT
DOG CAR CAT
This text file is processed by the MapReduce WordCount program to count the occurrences of each word.
- MapReduce is a two-step process:
- Map Phase 🗺️: Splits the input into key-value pairs.
- Reduce Phase ➕: Combines the key-value pairs to produce the final result.
It's like dividing a big task (word counting) into smaller tasks and then combining the results. 🧩
The mapper processes each line of the input file, splits it into words, and assigns each word a count of 1
.
For example:
DOG CAT RAT -> (DOG, 1), (CAT, 1), (RAT, 1)
CAR CAR RAT -> (CAR, 1), (CAR, 1), (RAT, 1)
DOG CAR CAT -> (DOG, 1), (CAR, 1), (CAT, 1)
Mapper Output:
(DOG, 1), (CAT, 1), (RAT, 1)
(CAR, 1), (CAR, 1), (RAT, 1)
(DOG, 1), (CAR, 1), (CAT, 1)
This step groups all values for the same key (word) together and sorts them.
For example:
(CAR, [1, 1, 1])
(CAT, [1, 1])
(DOG, [1, 1])
(RAT, [1, 1])
The reducer sums up the counts for each word to get the total number of occurrences.
Reducer Output:
CAR 3 🏎️
CAT 2 🐱
DOG 2 🐶
RAT 2 🐭
The final word count is saved in the HDFS output directory. You can view it using:
hdfs dfs -cat /user/root/output/*
Result:
CAR 3
CAT 2
DOG 2
RAT 2
Here are the basic HDFS commands you used and their purpose:
-
Upload a file to HDFS 📤:
hdfs dfs -put data.txt /user/root/input
- What it does: Uploads
data.txt
to the HDFS directory/user/root/input
. - Output: No output, but the file is now in HDFS.
- What it does: Uploads
-
List files in a directory 📁:
hdfs dfs -ls /user/root/input
- What it does: Lists all files in the
/user/root/input
directory. - Output: Something like this:
Found 1 items -rw-r--r-- 1 root supergroup 50 2024-12-12 /user/root/input/data.txt
- What it does: Lists all files in the
-
View the contents of a file 📄:
hdfs dfs -cat /user/root/input/data.txt
- What it does: Displays the contents of the
data.txt
file in HDFS. - Output:
DOG CAT RAT CAR CAR RAT DOG CAR CAT
- What it does: Displays the contents of the
-
Run the MapReduce Job 🚀:
hadoop jar wordCount.jar org.apache.hadoop.examples.WordCount input output
- What it does: Runs the WordCount program on the input directory and saves the result in the output directory.
-
View the final output 📊:
hdfs dfs -cat /user/root/output/*
- What it does: Displays the word count results.
- Output:
CAR 3 CAT 2 DOG 2 RAT 2
-
Input:
You uploaded a small text file (data.txt
) to HDFS. -
Process:
TheWordCount
program processed the file using MapReduce:- The mapper broke the file into words and counted each occurrence.
- The reducer aggregated the counts for each word.
-
Output:
The results were saved in HDFS and displayed using thecat
command.
DOG CAT RAT
CAR CAR RAT
DOG CAR CAT
(DOG, 1), (CAT, 1), (RAT, 1)
(CAR, 1), (CAR, 1), (RAT, 1)
(DOG, 1), (CAR, 1), (CAT, 1)
(CAR, [1, 1, 1])
(CAT, [1, 1])
(DOG, [1, 1])
(RAT, [1, 1])
CAR 3
CAT 2
DOG 2
RAT 2
- MapReduce splits the task into small, manageable pieces and processes them in parallel.
- It’s ideal for large datasets but works the same for smaller ones (like your example).
- Hadoop is designed for distributed systems, making it powerful for big data processing.
To stop the Docker containers when done:
docker-compose down
This will stop and remove the containers and networks created by docker-compose up
.
If you face permission issues while copying files to containers ensure the correct directory permissions in Docker by using:
docker exec -it namenode bash
chmod -R 777 /your-directory
Sometimes, containers might not start or might throw errors related to Hadoop configuration. A small troubleshooting section or references to common issues (e.g., insufficient memory for Hadoop) would be helpful.
The output of the WordCount job will be written to /user/root/output/
in HDFS. This is clearly explained, but you could also include a note that the output directory might need to be created beforehand to avoid errors.
-
Network Issues:
If you can't access the NameNode UI, ensure that your Docker container's ports are correctly exposed. For example, if you're running a local machine, the UI should be accessible via http://localhost:9870.
-
Stopping Containers:
docker-compose down # Stop and remove the containers
-
Permissions Fix:
docker exec -it namenode bash chmod -R 777 /your-directory # If you face any permission errors
-
Handling HDFS Directory Creation: If
hdfs dfs -mkdir
gives an error, it may be because the directory already exists. Consider adding:hdfs dfs -rm -r /user/root/input # If the directory exists, remove it first hdfs dfs -mkdir /user/root/input
😊 References
https://data-flair.training/blogs/top-hadoop-hdfs-commands-tutorial/
https://medium.com/@traininghub.io/hadoop-mapreduce-architecture-7e167e264595
To copy a file (e.g., police.csv
) to the Hadoop cluster:
- Copy the file into the namenode container:
docker cp police.csv namenode:/police.csv
- Access the namenode container's bash shell:
docker exec -it namenode bash
- Create a directory in HDFS and upload the file:
hdfs dfs -mkdir -p /data/crimerecord/police hdfs dfs -put /police.csv /data/crimerecord/police/
To interact with Spark, start the Spark shell in the master container:
docker exec -it spark-master bash
spark/bin/spark-shell --master spark://spark-master:7077
-
Open
http://localhost:8080
in your web browser to view the Spark Master UI. -
You can monitor processes here
-
Overview: Apache Spark is an open-source distributed computing system known for its speed, ease of use, and general-purpose capabilities for big data processing.
-
Key Features:
- Fast processing using in-memory computation.
- Supports multiple languages: Scala, Python, Java, and R.
- Unified framework for batch and streaming data processing.
-
What are DataFrames?
- Distributed collections of data organized into named columns, similar to a table in a database or a DataFrame in Python's pandas.
- Optimized for processing large datasets using Spark SQL.
-
Key Operations:
- Creating DataFrames from structured data sources (CSV, JSON, Parquet, etc.).
- Performing transformations and actions on the data.
-
Why Scala?
- Apache Spark is written in Scala, offering the best compatibility and performance.
- Concise syntax and functional programming support.
-
Basic Syntax:
val numbers = List(1, 2, 3, 4, 5) // Creates a list of numbers.
val doubled = numbers.map(_ * 2) // Doubles each element in the list using map.
println(doubled) // Prints the doubled list.
The output will be: List(2, 4, 6, 8, 10)
-
Need for Spark SQL:
- Provides a declarative interface to query structured data using SQL-like syntax.
- Supports seamless integration with other Spark modules.
- Allows for optimization through Catalyst Optimizer.
-
Key Components:
- SQL Queries on DataFrames and temporary views.
- Hive integration for legacy SQL workflows.
- Support for structured data sources.
To create DataFrames, load data from different sources, and perform transformations and SQL queries.
val data = Seq(
("Alice", 30, "HR"),
("Bob", 25, "Engineering"),
("Charlie", 35, "Finance")
)
val df = data.toDF("Name", "Age", "Department")
df.show()
df.createOrReplaceTempView("employees")
val result = spark.sql("SELECT Department, COUNT(*) as count FROM employees GROUP BY Department")
result.show()
result.write.option("header", "true").csv("hdfs://namenode:9000/output_employees")
Reading from HDFS: Once the data is written to HDFS, you can read it back into Spark using:
val outputDF = spark.read.option("header", "true").csv("hdfs://namenode:9000/output_employees")
View output_employees.csv from HDFS
outputDF.show()
// Load CSV from HDFS
val df = spark.read.option("header", "false").csv("hdfs://namenode:9000/data/crimerecord/police/police.csv")
df.show()
Copy File*: Use docker cp
to move or create the file inside the namenode Docker container.
Use the following command to copy the data.txt
file from your local system to the Docker container:
docker cp data.txt nodemanager:/data.txt
Copy File to HDFS*: Use hdfs dfs -put
to move the file inside the HDFS filesystem.
Use the following command to put the data.txt
file from your Docker container to HDFS:
hdfs dfs -mkdir /data
hdfs dfs -put data.txt /data
Scala WordCount program.
WordCount Program: The program reads the file, splits it into words, and counts the occurrences of each word.
import org.apache.spark.{SparkConf}
val conf = new SparkConf().setAppName("WordCountExample").setMaster("local")
val input = sc.textFile("hdfs://namenode:9000/data/data.txt")
val wordPairs = input.flatMap(line => line.split(" ")).map(word => (word, 1))
val wordCounts = wordPairs.reduceByKey((a, b) => a + b)
wordCounts.collect().foreach { case (word, count) =>
println(s"$word: $count")
}
Output: The word counts will be printed to the console when the program is executed.
Stop Session:
sc.stop()
- Spark SQL simplifies working with structured data.
- DataFrames provide a flexible and powerful API for handling large datasets.
- Apache Spark is a versatile tool for distributed data processing, offering scalability and performance.
Access the Hive container and start the Hive Server:
docker exec -it hive-server bash
hive
Check if Hive is listening on port 10000:
netstat -anp | grep 10000
Use Beeline to connect to the Hive server:
beeline -u jdbc:hive2://localhost:10000 -n root
Alternatively, use the following command for direct connection:
beeline
!connect jdbc:hive2://127.0.0.1:10000 scott tiger
- Create a new Hive database:
CREATE DATABASE punjab_police; USE punjab_police;
-
Create a table based on the schema of the
police.csv
dataset:CREATE TABLE police_data ( Crime_ID INT, Crime_Type STRING, Location STRING, Reported_Date STRING, Status STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
-
Load the data into the Hive table:
LOAD DATA INPATH '/data/crimerecord/police/police.csv' INTO TABLE police_data;
Run SQL queries to analyze the data in Hive:
- View the top 10 rows:
SELECT * FROM police_data LIMIT 10;
- Count total crimes:
SELECT COUNT(*) AS Total_Crimes FROM police_data;
-
Find most common crime types:
SELECT Crime_Type, COUNT(*) AS Occurrences FROM police_data GROUP BY Crime_Type ORDER BY Occurrences DESC;
-
Identify locations with the highest crime rates:
SELECT Location, COUNT(*) AS Total_Crimes FROM police_data GROUP BY Location ORDER BY Total_Crimes DESC;
- Find unresolved cases:
SELECT Status, COUNT(*) AS Count FROM police_data WHERE Status != 'Closed' GROUP BY Status;
There you go: your private Hive server to play with.
show databases;
We will:
- Create a database.
- Create a table inside the database.
- Load data into the table.
- Run queries to retrieve data.
In the Beeline CLI:
CREATE DATABASE mydb;
USE mydb;
- 📝
mydb
is the name of the database. Replace it with your preferred name.
Still in the Beeline CLI, create a simple table:
CREATE TABLE employees (
id INT,
name STRING,
age INT
);
- This creates a table named
employees
with columnsid
,name
, andage
.
Insert sample data into your table:
INSERT INTO employees VALUES (1, 'Prince', 30);
INSERT INTO employees VALUES (2, 'Ram Singh', 25);
Retrieve data from your table:
SELECT * FROM employees;
- Output:
+----+----------+-----+
| id | name | age |
+----+----------+-----+
| 2 | Ram Singh | 25 |
| 1 | Prince | 30 |
+----+----------+-----+
-
What is Hive?
- Hive is a data warehouse tool on top of Hadoop.
- It allows SQL-like querying over large datasets.
-
Why Docker for Hive?
- Simplifies setup by avoiding manual configurations.
- Provides a pre-configured environment for running Hive.
-
Beeline CLI:
- A lightweight command-line tool for running Hive queries.
-
Use Cases:
- Data Analysis: Run analytics on large datasets.
- ETL: Extract, Transform, and Load data into your Hadoop ecosystem.
You’ve successfully:
- Set up Apache Hive.
- Created and queried a sample project. 🐝
Here is a collection of basic Apache Hive commands with explanations that can help you while working with Hive:
-
Show Databases: Displays all the databases available in your Hive environment.
SHOW DATABASES;
-
Create a Database: Create a new database.
CREATE DATABASE <database_name>;
Example:
CREATE DATABASE mydb;
In Hive, you can find out which database you are currently using by running the following command:
SELECT current_database();
This will return the name of the database that is currently in use.
Alternatively, you can use this command:
USE database_name;
If you want to explicitly switch to a specific database or verify the database context, you can use this command before running your queries.
-
Use a Database: Switch to the specified database.
USE <database_name>;
Example:
USE mydb;
-
Drop a Database: Deletes a database and its associated data.
DROP DATABASE <database_name>;
-
Show Tables: List all the tables in the current database.
SHOW TABLES;
-
Create a Table: Define a new table with specific columns.
CREATE TABLE <table_name> ( column_name column_type, ... );
Example:
CREATE TABLE employees ( id INT, name STRING, age INT );
-
Describe a Table: Get detailed information about a table, including column names and types.
DESCRIBE <table_name>;
-
Drop a Table: Deletes a table and its associated data.
DROP TABLE <table_name>;
-
Alter a Table: Modify a table structure, like adding new columns.
ALTER TABLE <table_name> ADD COLUMNS (<new_column> <type>);
Example:
ALTER TABLE employees ADD COLUMNS (salary DOUBLE);
-
Insert Data: Insert data into a table.
INSERT INTO <table_name> VALUES (<value1>, <value2>, ...); INSERT INTO employees VALUES (1, 'Prince', 30), (2, 'Ram Singh', 25), (3, 'John Doe', 28), (4, 'Jane Smith', 32);
Example:
INSERT INTO employees VALUES (1, 'John Doe', 30);
-
Select Data: Retrieve data from a table.
SELECT * FROM <table_name>;
-
Update Data: Update existing data in a table.
UPDATE <table_name> SET <column_name> = <new_value> WHERE <condition>;
-
Delete Data: Delete rows from a table based on a condition.
DELETE FROM <table_name> WHERE <condition>;
-
Select Specific Columns: Retrieve specific columns from a table.
SELECT <column1>, <column2> FROM <table_name>;
-
Filtering Data: Filter data based on conditions using the
WHERE
clause.SELECT * FROM <table_name> WHERE <column_name> <operator> <value>;
Example:
SELECT * FROM employees WHERE age > 25;
-
Sorting Data: Sort the result by a column in ascending or descending order.
SELECT * FROM <table_name> ORDER BY <column_name> ASC|DESC;
Example:
SELECT * FROM employees ORDER BY age DESC; SELECT * FROM employees ORDER BY age ASC;
-
Group By: Group data by one or more columns and aggregate it using functions like
COUNT
,AVG
,SUM
, etc.SELECT <column_name>, COUNT(*) FROM <table_name> GROUP BY <column_name>;
Example:
SELECT age, COUNT(*) FROM employees GROUP BY age;
-
Create External Table: Create a table that references data stored externally (e.g., in HDFS).
CREATE EXTERNAL TABLE <table_name> (<column_name> <data_type>, ...) ROW FORMAT DELIMITED FIELDS TERMINATED BY '<delimiter>' LOCATION '<file_path>';
Example:
CREATE EXTERNAL TABLE employees ( id INT, name STRING, age INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/hive/warehouse/employees';
-
Load Data into Table: Load data from a file into an existing Hive table.
LOAD DATA LOCAL INPATH '<file_path>' INTO TABLE <table_name>;
-
Show Current User: Display the current user running the Hive session.
!whoami;
-
Exit Hive: Exit from the Hive shell.
EXIT;
-
Set Hive Variables: Set Hive session variables.
SET <variable_name>=<value>;
-
Show Hive Variables: Display all the set variables.
SET;
-
Show the Status of Hive Jobs: Display the status of running queries.
SHOW JOBS;
-
Partitioning Tables: When dealing with large datasets, partitioning your tables can help improve query performance.
CREATE TABLE sales (id INT, amount DOUBLE) PARTITIONED BY (year INT, month INT);
-
Bucketing: Bucketing splits your data into a fixed number of files or "buckets."
CREATE TABLE sales (id INT, amount DOUBLE) CLUSTERED BY (id) INTO 4 BUCKETS;
-
Optimization: Use columnar formats like
ORC
orParquet
for efficient storage and performance.CREATE TABLE sales (id INT, amount DOUBLE) STORED AS ORC;
These basic commands will help you interact with Hive and perform common operations like creating tables, querying data, and managing your Hive environment efficiently.
While Hive and MySQL both use SQL-like syntax for querying data, there are some key differences in their commands, especially since Hive is designed for querying large datasets in a Hadoop ecosystem, while MySQL is a relational database management system (RDBMS).
##Here’s a comparison of Hive and MySQL commands in terms of common operations:
-
Hive:
CREATE DATABASE mydb;
-
MySQL:
CREATE DATABASE mydb;
Both Hive and MySQL use the same syntax to create a database.
-
Hive:
USE mydb;
-
MySQL:
USE mydb;
The syntax is the same for selecting a database in both systems.
-
Hive:
CREATE TABLE employees ( id INT, name STRING, age INT );
-
MySQL:
CREATE TABLE employees ( id INT, name VARCHAR(255), age INT );
Differences:
- In Hive, STRING is used for text data, while in MySQL, VARCHAR is used.
- Hive also has some specialized data types for distributed storage and performance, like
ARRAY
,MAP
,STRUCT
, etc.
-
Hive:
INSERT INTO employees VALUES (1, 'John', 30); INSERT INTO employees VALUES (2, 'Alice', 25);
-
MySQL:
INSERT INTO employees (id, name, age) VALUES (1, 'John', 30); INSERT INTO employees (id, name, age) VALUES (2, 'Alice', 25);
Differences:
- Hive allows direct
INSERT INTO
with values, while MySQL explicitly lists column names in the insert statement (though this is optional in MySQL if the columns match).
- Hive allows direct
-
Hive:
SELECT * FROM employees;
-
MySQL:
SELECT * FROM employees;
Querying data using
SELECT
is identical in both systems.
-
Hive: Hive doesn’t support traditional UPDATE or DELETE commands directly, as it is optimized for batch processing and is more suited for append operations. However, it does support INSERT and INSERT OVERWRITE operations.
Example of replacing data:
INSERT OVERWRITE TABLE employees SELECT * FROM employees WHERE age > 30;
-
MySQL:
UPDATE employees SET age = 31 WHERE id = 1; DELETE FROM employees WHERE id = 2;
Differences:
- Hive does not allow direct UPDATE or DELETE; instead, it uses INSERT OVERWRITE to modify data in batch operations.
-
Hive:
DROP TABLE IF EXISTS employees;
-
MySQL:
DROP TABLE IF EXISTS employees;
The syntax for dropping tables is the same in both systems.
-
Hive:
- Hive is designed to run on large datasets using the Hadoop Distributed File System (HDFS), so it focuses more on batch processing rather than real-time queries. Query performance in Hive may be slower than MySQL because it’s optimized for scale, not for low-latency transaction processing.
-
MySQL:
- MySQL is an RDBMS, designed to handle transactional workloads with low-latency queries. It’s better suited for OLTP (Online Transaction Processing) rather than OLAP (Online Analytical Processing) workloads.
-
Hive:
- Hive doesn’t support traditional indexing as MySQL does. However, you can create partitioned or bucketed tables in Hive to improve query performance for certain types of data.
-
MySQL:
- MySQL supports indexes (e.g., PRIMARY KEY, UNIQUE, INDEX) to speed up query performance on large datasets.
-
Hive:
SELECT a.id, a.name, b.age FROM employees a JOIN employee_details b ON a.id = b.id;
-
MySQL:
SELECT a.id, a.name, b.age FROM employees a JOIN employee_details b ON a.id = b.id;
The syntax for JOIN is the same in both systems.
- Data Types: Hive uses types like
STRING
,TEXT
,BOOLEAN
, etc., while MySQL uses types likeVARCHAR
,CHAR
,TEXT
, etc. - Data Modification: Hive does not support UPDATE or DELETE in the traditional way, and is generally used for batch processing.
- Performance: Hive is designed for querying large-scale datasets in Hadoop, so queries tend to be slower than MySQL.
- Indexing: Hive does not natively support indexing but can use partitioning and bucketing for performance optimization. MySQL supports indexing for faster queries.
- ACID Properties: MySQL supports full ACID compliance for transactional systems, whereas Hive is not transactional by default (but can support limited ACID features starting from version 0.14 with certain configurations).
In conclusion, while Hive and MySQL share SQL-like syntax, they are designed for very different use cases, and not all commands work the same way in both systems.
Export the query results to a CSV file for analysis in visualization tools:
hive -e "SELECT * FROM police_data;" > police_analysis_results.csv
You can use tools like Tableau, Excel, or Python (Matplotlib, Pandas) for data visualization.
If you need to customize configurations, you can specify parameters in the hadoop.env
file or as environmental variables for services (e.g., namenode, datanode, etc.). For example:
CORE_CONF_fs_defaultFS=hdfs://namenode:8020
This will be transformed into the following in the core-site.xml
file:
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode:8020</value>
</property>
You now have a fully functional Hadoop, Spark, and Hive cluster running in Docker. This environment is great for experimenting with big data processing and analytics in a lightweight, containerized setup.
I hope you have fun with this Hadoop-Spark-Hive cluster.