Docker Multi-Container Environment with Hadoop, Spark, and Hive

.

Docker Multi-Container Environment with Hadoop, Spark, and Hive

This guide helps you set up a multi-container environment using Docker for Hadoop (HDFS), Spark, and Hive. The setup is lightweight, without the large memory requirements of a Cloudera sandbox.

Prerequisites

Before you begin, ensure you have the following installed:

Docker: Install Docker Desktop for Windows
IMPORTANT: - Enable the "Expose daemon on tcp://localhost:2375 without TLS" option if you're using Docker Desktop for compatibility.

Git: Download Git
- Git is used to download the required files from a repository.
Create a newfolder and open it in terminal or go inside it using CD Command

Step 1: Clone the Repository

First, clone the GitHub repository that contains the necessary Docker setup files.

git clone https://github.com/lovnishverma/bigdataecosystem.git

or Directly download zip from my repo

Navigate to the directory:

cd bigdataecosystem

if downloaded zip than cd bigdataecosystem-main

Step 2: Start the Cluster

Use Docker Compose to start the containers in the background.

docker-compose up -d

This command will launch the Hadoop, Spark, and Hive containers.

Step 3: Verify Running Containers

To check if the containers are running, use the following command:

docker ps

Step 4: Stop and Remove Containers

When you are done, stop and remove the containers with:

docker-compose down

Step 5: Access the NameNode container

Enter the NameNode container to interact with Hadoop:

docker exec -it namenode bash

-it refers to (interactive terminal)

Running Hadoop Code

To View NameNode UI Visit: http://localhost:9870/

To View Resource Manager UI Visit http://localhost:8088/

MAPREDUCE WordCount program

Step 1: Copy the `code` folder into the container

Use the following command in your windows cmd to copy the code folder to the container:

docker cp code namenode:/

Step 2: Locate the `data.txt` file

Inside the container, navigate to the code/input directory where the data.txt file is located.

Step 3: Create directories in the Hadoop file system

Run the following commands to set up directories in Hadoop's file system:

hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/root
hdfs dfs -mkdir /user/root/input

Step 4: Upload the `data.txt` file

Copy data.txt into the Hadoop file system:

hdfs dfs -put /code/input/data.txt /user/root/input

Step 5: Navigate to the directory containing the `wordCount.jar` file

Return to the directory where the wordCount.jar file is located:

cd /code/

Step 6: Execute the WordCount program

To View NameNode UI Visit: http://localhost:9870/

Run the WordCount program to process the input data:

hadoop jar wordCount.jar org.apache.hadoop.examples.WordCount input output

To View YARN Resource Manager UI Visit http://localhost:8088/

Step 7: Display the output

View the results of the WordCount program:

hdfs dfs -cat /user/root/output/*

or

hdfs dfs -cat /user/root/output/part-r-00000

Summary

This guide simplifies setting up and running Hadoop on Docker. Each step ensures a smooth experience, even for beginners without a technical background. Follow the instructions carefully, and you’ll have a working Hadoop setup in no time!

Certainly! Here’s the explanation of your MapReduce process using the input example `DOG CAT RAT`, `CAR CAR RAT`, and `DOG CAR CAT`.

🐾 Input Data

The data.txt file contains the following lines:

DOG CAT RAT
CAR CAR RAT
DOG CAR CAT

This text file is processed by the MapReduce WordCount program to count the occurrences of each word.

💡 What is MapReduce?

MapReduce is a two-step process:
1. Map Phase 🗺️: Splits the input into key-value pairs.
2. Reduce Phase ➕: Combines the key-value pairs to produce the final result.

It's like dividing a big task (word counting) into smaller tasks and then combining the results. 🧩

🔄 How MapReduce Works in Your Example

1. Map Phase 🗺️

The mapper processes each line of the input file, splits it into words, and assigns each word a count of 1.

For example:

DOG CAT RAT  -> (DOG, 1), (CAT, 1), (RAT, 1)
CAR CAR RAT  -> (CAR, 1), (CAR, 1), (RAT, 1)
DOG CAR CAT  -> (DOG, 1), (CAR, 1), (CAT, 1)

Mapper Output:

(DOG, 1), (CAT, 1), (RAT, 1)
(CAR, 1), (CAR, 1), (RAT, 1)
(DOG, 1), (CAR, 1), (CAT, 1)

2. Shuffle and Sort Phase 🔄

This step groups all values for the same key (word) together and sorts them.

For example:

(CAR, [1, 1, 1])
(CAT, [1, 1])
(DOG, [1, 1])
(RAT, [1, 1])

3. Reduce Phase ➕

The reducer sums up the counts for each word to get the total number of occurrences.

Reducer Output:

CAR 3  🏎️
CAT 2  🐱
DOG 2  🐶
RAT 2  🐭

Final Output 📋

The final word count is saved in the HDFS output directory. You can view it using:

hdfs dfs -cat /user/root/output/*

Result:

CAR 3
CAT 2
DOG 2
RAT 2

🗂️ HDFS Commands You Used

Here are the basic HDFS commands you used and their purpose:

Upload a file to HDFS 📤:
```
hdfs dfs -put data.txt /user/root/input
```
- What it does: Uploads data.txt to the HDFS directory /user/root/input.
- Output: No output, but the file is now in HDFS.

List files in a directory 📁:

hdfs dfs -ls /user/root/input

What it does: Lists all files in the /user/root/input directory.

Output: Something like this:

Found 1 items
-rw-r--r--   1 root supergroup        50  2024-12-12  /user/root/input/data.txt

View the contents of a file 📄:
```
hdfs dfs -cat /user/root/input/data.txt
```
- What it does: Displays the contents of the data.txt file in HDFS.
- Output:
```
DOG CAT RAT
CAR CAR RAT
DOG CAR CAT
```
Run the MapReduce Job 🚀:
```
hadoop jar wordCount.jar org.apache.hadoop.examples.WordCount input output
```
- What it does: Runs the WordCount program on the input directory and saves the result in the output directory.
View the final output 📊:
```
hdfs dfs -cat /user/root/output/*
```
- What it does: Displays the word count results.
- Output:
```
CAR 3
CAT 2
DOG 2
RAT 2
```

🛠️ How You Utilized MapReduce

Input:
You uploaded a small text file (data.txt) to HDFS.
Process:
The WordCount program processed the file using MapReduce:
- The mapper broke the file into words and counted each occurrence.
- The reducer aggregated the counts for each word.
Output:
The results were saved in HDFS and displayed using the cat command.

🧩 Visualization of the Entire Process

Input (HDFS file):

DOG CAT RAT
CAR CAR RAT
DOG CAR CAT

Map Phase Output 🗺️:

(DOG, 1), (CAT, 1), (RAT, 1)
(CAR, 1), (CAR, 1), (RAT, 1)
(DOG, 1), (CAR, 1), (CAT, 1)

Shuffle & Sort 🔄:

(CAR, [1, 1, 1])
(CAT, [1, 1])
(DOG, [1, 1])
(RAT, [1, 1])

Reduce Phase Output ➕:

CAR 3
CAT 2
DOG 2
RAT 2

🔑 Key Takeaways

MapReduce splits the task into small, manageable pieces and processes them in parallel.
It’s ideal for large datasets but works the same for smaller ones (like your example).
Hadoop is designed for distributed systems, making it powerful for big data processing.

. Stopping the Containers

To stop the Docker containers when done:

docker-compose down

This will stop and remove the containers and networks created by docker-compose up.

4. Permissions Issue with Copying Files

If you face permission issues while copying files to containers ensure the correct directory permissions in Docker by using:

docker exec -it namenode bash
chmod -R 777 /your-directory

5. Additional Debugging Tips

Sometimes, containers might not start or might throw errors related to Hadoop configuration. A small troubleshooting section or references to common issues (e.g., insufficient memory for Hadoop) would be helpful.

6. Final Output File Path

The output of the WordCount job will be written to /user/root/output/ in HDFS. This is clearly explained, but you could also include a note that the output directory might need to be created beforehand to avoid errors.

Example Additions:

Network Issues:

If you can't access the NameNode UI, ensure that your Docker container's ports are correctly exposed. For example, if you're running a local machine, the UI should be accessible via http://localhost:9870.

Stopping Containers:

docker-compose down  # Stop and remove the containers

Permissions Fix:

docker exec -it namenode bash
chmod -R 777 /your-directory  # If you face any permission errors

Handling HDFS Directory Creation: If hdfs dfs -mkdir gives an error, it may be because the directory already exists. Consider adding:
```
hdfs dfs -rm -r /user/root/input  # If the directory exists, remove it first
hdfs dfs -mkdir /user/root/input
```

😊 References

https://data-flair.training/blogs/top-hadoop-hdfs-commands-tutorial/

https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

https://medium.com/@traininghub.io/hadoop-mapreduce-architecture-7e167e264595

Step 5: Set Up HDFS

Upload Files to HDFS

To copy a file (e.g., police.csv) to the Hadoop cluster:

Copy the file into the namenode container:
```
docker cp police.csv namenode:/police.csv
```

Access the namenode container's bash shell:
```
docker exec -it namenode bash
```

Create a directory in HDFS and upload the file:

hdfs dfs -mkdir -p /data/crimerecord/police
hdfs dfs -put /police.csv /data/crimerecord/police/

Start Spark Shell

To interact with Spark, start the Spark shell in the master container:

docker exec -it spark-master bash

spark/bin/spark-shell --master spark://spark-master:7077

Access the Spark Master UI

Open http://localhost:8080 in your web browser to view the Spark Master UI.
You can monitor processes here

Working with Apache Spark

1. Introduction to Apache Spark

Overview: Apache Spark is an open-source distributed computing system known for its speed, ease of use, and general-purpose capabilities for big data processing.
Key Features:
- Fast processing using in-memory computation.
- Supports multiple languages: Scala, Python, Java, and R.
- Unified framework for batch and streaming data processing.

2. Introduction to DataFrames

What are DataFrames?
- Distributed collections of data organized into named columns, similar to a table in a database or a DataFrame in Python's pandas.
- Optimized for processing large datasets using Spark SQL.
Key Operations:
- Creating DataFrames from structured data sources (CSV, JSON, Parquet, etc.).
- Performing transformations and actions on the data.

3. Introduction to Scala for Apache Spark

Why Scala?
- Apache Spark is written in Scala, offering the best compatibility and performance.
- Concise syntax and functional programming support.
Basic Syntax:

val numbers = List(1, 2, 3, 4, 5)      // Creates a list of numbers.
val doubled = numbers.map(_ * 2)       // Doubles each element in the list using map.
println(doubled)                        // Prints the doubled list.

The output will be: List(2, 4, 6, 8, 10)

4. Spark SQL

Need for Spark SQL:
- Provides a declarative interface to query structured data using SQL-like syntax.
- Supports seamless integration with other Spark modules.
- Allows for optimization through Catalyst Optimizer.
Key Components:
- SQL Queries on DataFrames and temporary views.
- Hive integration for legacy SQL workflows.
- Support for structured data sources.

5. Hands-On: Spark SQL

Objective:

To create DataFrames, load data from different sources, and perform transformations and SQL queries.

Step 1: Create DataFrames

val data = Seq(
  ("Alice", 30, "HR"),
  ("Bob", 25, "Engineering"),
  ("Charlie", 35, "Finance")
)

val df = data.toDF("Name", "Age", "Department")

df.show()

Step 3: Perform Transformations Using Spark SQL

df.createOrReplaceTempView("employees")
val result = spark.sql("SELECT Department, COUNT(*) as count FROM employees GROUP BY Department")
result.show()

Step 4: Save Transformed Data

result.write.option("header", "true").csv("hdfs://namenode:9000/output_employees")

Reading from HDFS: Once the data is written to HDFS, you can read it back into Spark using:

val outputDF = spark.read.option("header", "true").csv("hdfs://namenode:9000/output_employees")

View output_employees.csv from HDFS

outputDF.show()

Step 5: Load Data from HDFS

// Load CSV from HDFS
val df = spark.read.option("header", "false").csv("hdfs://namenode:9000/data/crimerecord/police/police.csv")
df.show()

Step 6: Scala WordCount using Apache Spark

Docker Command to Copy File

Copy File*: Use docker cp to move or create the file inside the namenode Docker container. Use the following command to copy the data.txt file from your local system to the Docker container:

docker cp data.txt nodemanager:/data.txt

Copy File to HDFS*: Use hdfs dfs -put to move the file inside the HDFS filesystem. Use the following command to put the data.txt file from your Docker container to HDFS:

hdfs dfs -mkdir /data
hdfs dfs -put data.txt /data

Scala WordCount program.

WordCount Program: The program reads the file, splits it into words, and counts the occurrences of each word.

import org.apache.spark.{SparkConf}
val conf = new SparkConf().setAppName("WordCountExample").setMaster("local")
val input = sc.textFile("hdfs://namenode:9000/data/data.txt")
val wordPairs = input.flatMap(line => line.split(" ")).map(word => (word, 1))
val wordCounts = wordPairs.reduceByKey((a, b) => a + b)
wordCounts.collect().foreach { case (word, count) =>
  println(s"$word: $count")
}

Output: The word counts will be printed to the console when the program is executed.

Stop Session:

sc.stop()

6. Key Takeaways

Spark SQL simplifies working with structured data.
DataFrames provide a flexible and powerful API for handling large datasets.
Apache Spark is a versatile tool for distributed data processing, offering scalability and performance.

Step 7: Set Up Hive

Start Hive Server

Access the Hive container and start the Hive Server:

docker exec -it hive-server bash

hive

Check if Hive is listening on port 10000:

netstat -anp | grep 10000

Connect to Hive Server

Use Beeline to connect to the Hive server:

beeline -u jdbc:hive2://localhost:10000 -n root

Alternatively, use the following command for direct connection:

beeline

!connect jdbc:hive2://127.0.0.1:10000 scott tiger

Create Database and Table in Hive

Create a new Hive database:

CREATE DATABASE punjab_police;
USE punjab_police;

Create a table based on the schema of the police.csv dataset:

CREATE TABLE police_data (
    Crime_ID INT,
    Crime_Type STRING,
    Location STRING,
    Reported_Date STRING,
    Status STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Load the data into the Hive table:

LOAD DATA INPATH '/data/crimerecord/police/police.csv' INTO TABLE police_data;

Query the Data in Hive

Run SQL queries to analyze the data in Hive:

View the top 10 rows:
```
SELECT * FROM police_data LIMIT 10;
```

Count total crimes:

SELECT COUNT(*) AS Total_Crimes FROM police_data;

Find most common crime types:

SELECT Crime_Type, COUNT(*) AS Occurrences
FROM police_data
GROUP BY Crime_Type
ORDER BY Occurrences DESC;

Identify locations with the highest crime rates:

SELECT Location, COUNT(*) AS Total_Crimes
FROM police_data
GROUP BY Location
ORDER BY Total_Crimes DESC;

Find unresolved cases:

SELECT Status, COUNT(*) AS Count
FROM police_data
WHERE Status != 'Closed'
GROUP BY Status;

There you go: your private Hive server to play with.

show databases;

📂 Part 2: Creating a Simple Hive Project

🎯 Objective

We will:

Create a database.
Create a table inside the database.
Load data into the table.
Run queries to retrieve data.

💾 Step 1: Create a Database

In the Beeline CLI:

CREATE DATABASE mydb;
USE mydb;

📝 mydb is the name of the database. Replace it with your preferred name.

📋 Step 2: Create a Table

Still in the Beeline CLI, create a simple table:

CREATE TABLE employees (
    id INT,
    name STRING,
    age INT
);

This creates a table named employees with columns id, name, and age.

📥 Step 3: Insert Data into the Table

Insert sample data into your table:

INSERT INTO employees VALUES (1, 'Prince', 30);
INSERT INTO employees VALUES (2, 'Ram Singh', 25);

🔍 Step 4: Query the Table

Retrieve data from your table:

SELECT * FROM employees;

Output:

+----+----------+-----+
| id |   name   | age |
+----+----------+-----+
| 2  | Ram Singh |  25 |
| 1  | Prince     | 30 |
+----+----------+-----+

🌟 Tips & Knowledge

What is Hive?
- Hive is a data warehouse tool on top of Hadoop.
- It allows SQL-like querying over large datasets.
Why Docker for Hive?
- Simplifies setup by avoiding manual configurations.
- Provides a pre-configured environment for running Hive.
Beeline CLI:
- A lightweight command-line tool for running Hive queries.
Use Cases:
- Data Analysis: Run analytics on large datasets.
- ETL: Extract, Transform, and Load data into your Hadoop ecosystem.

🎉 You're Ready!

You’ve successfully:

Set up Apache Hive.
Created and queried a sample project. 🐝

🐝 Apache Hive Basic Commands

Here is a collection of basic Apache Hive commands with explanations that can help you while working with Hive:

1. Database Commands

Show Databases: Displays all the databases available in your Hive environment.
```
SHOW DATABASES;
```
Create a Database: Create a new database.
```
CREATE DATABASE <database_name>;
```
Example:
```
CREATE DATABASE mydb;
```
In Hive, you can find out which database you are currently using by running the following command:

SELECT current_database();

This will return the name of the database that is currently in use.

Alternatively, you can use this command:

USE database_name;

If you want to explicitly switch to a specific database or verify the database context, you can use this command before running your queries.

Use a Database: Switch to the specified database.
```
USE <database_name>;
```
Example:
```
USE mydb;
```
Drop a Database: Deletes a database and its associated data.
```
DROP DATABASE <database_name>;
```

2. Table Commands

Show Tables: List all the tables in the current database.
```
SHOW TABLES;
```

Create a Table: Define a new table with specific columns.

CREATE TABLE <table_name> (
    column_name column_type,
    ...
);

Example:

CREATE TABLE employees (
    id INT,
    name STRING,
    age INT
);

Describe a Table: Get detailed information about a table, including column names and types.
```
DESCRIBE <table_name>;
```
Drop a Table: Deletes a table and its associated data.
```
DROP TABLE <table_name>;
```

Alter a Table: Modify a table structure, like adding new columns.

ALTER TABLE <table_name> ADD COLUMNS (<new_column> <type>);

Example:

ALTER TABLE employees ADD COLUMNS (salary DOUBLE);

3. Data Manipulation Commands

Insert Data: Insert data into a table.

INSERT INTO <table_name> VALUES (<value1>, <value2>, ...);
INSERT INTO employees VALUES (1, 'Prince', 30), (2, 'Ram Singh', 25), (3, 'John Doe', 28), (4, 'Jane Smith', 32);

Example:

INSERT INTO employees VALUES (1, 'John Doe', 30);

Select Data: Retrieve data from a table.
```
SELECT * FROM <table_name>;
```

Update Data: Update existing data in a table.

UPDATE <table_name> SET <column_name> = <new_value> WHERE <condition>;

Delete Data: Delete rows from a table based on a condition.
```
DELETE FROM <table_name> WHERE <condition>;
```

4. Querying Commands

Select Specific Columns: Retrieve specific columns from a table.
```
SELECT <column1>, <column2> FROM <table_name>;
```

Filtering Data: Filter data based on conditions using the WHERE clause.

SELECT * FROM <table_name> WHERE <column_name> <operator> <value>;

Example:

SELECT * FROM employees WHERE age > 25;

Sorting Data: Sort the result by a column in ascending or descending order.

SELECT * FROM <table_name> ORDER BY <column_name> ASC|DESC;

Example:

SELECT * FROM employees ORDER BY age DESC;
SELECT * FROM employees ORDER BY age ASC;

Group By: Group data by one or more columns and aggregate it using functions like COUNT, AVG, SUM, etc.

SELECT <column_name>, COUNT(*) FROM <table_name> GROUP BY <column_name>;

Example:

SELECT age, COUNT(*) FROM employees GROUP BY age;

5. File Format Commands

Create External Table: Create a table that references data stored externally (e.g., in HDFS).

CREATE EXTERNAL TABLE <table_name> (<column_name> <data_type>, ...) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '<delimiter>'
LOCATION '<file_path>';

Example:

CREATE EXTERNAL TABLE employees (
    id INT,
    name STRING,
    age INT
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/hive/warehouse/employees';

Load Data into Table: Load data from a file into an existing Hive table.
```
LOAD DATA LOCAL INPATH '<file_path>' INTO TABLE <table_name>;
```

6. Other Useful Commands

Show Current User: Display the current user running the Hive session.
```
!whoami;
```
Exit Hive: Exit from the Hive shell.
```
EXIT;
```
Set Hive Variables: Set Hive session variables.
```
SET <variable_name>=<value>;
```
Show Hive Variables: Display all the set variables.
```
SET;
```
Show the Status of Hive Jobs: Display the status of running queries.
```
SHOW JOBS;
```

🌟 Tips & Best Practices

Partitioning Tables: When dealing with large datasets, partitioning your tables can help improve query performance.
```
CREATE TABLE sales (id INT, amount DOUBLE)
PARTITIONED BY (year INT, month INT);
```

Bucketing: Bucketing splits your data into a fixed number of files or "buckets."

CREATE TABLE sales (id INT, amount DOUBLE)
CLUSTERED BY (id) INTO 4 BUCKETS;

Optimization: Use columnar formats like ORC or Parquet for efficient storage and performance.
```
CREATE TABLE sales (id INT, amount DOUBLE)
STORED AS ORC;
```

These basic commands will help you interact with Hive and perform common operations like creating tables, querying data, and managing your Hive environment efficiently.

While Hive and MySQL both use SQL-like syntax for querying data, there are some key differences in their commands, especially since Hive is designed for querying large datasets in a Hadoop ecosystem, while MySQL is a relational database management system (RDBMS).

##Here’s a comparison of Hive and MySQL commands in terms of common operations:

1. Creating Databases

Hive:
```
CREATE DATABASE mydb;
```
MySQL:
```
CREATE DATABASE mydb;
```
Both Hive and MySQL use the same syntax to create a database.

2. Switching to a Database

Hive:
```
USE mydb;
```
MySQL:
```
USE mydb;
```
The syntax is the same for selecting a database in both systems.

3. Creating Tables

Hive:

CREATE TABLE employees (
    id INT,
    name STRING,
    age INT
);

MySQL:
```
CREATE TABLE employees (
    id INT,
    name VARCHAR(255),
    age INT
);
```
Differences:
- In Hive, STRING is used for text data, while in MySQL, VARCHAR is used.
- Hive also has some specialized data types for distributed storage and performance, like ARRAY, MAP, STRUCT, etc.

4. Inserting Data

Hive:

INSERT INTO employees VALUES (1, 'John', 30);
INSERT INTO employees VALUES (2, 'Alice', 25);

MySQL:
```
INSERT INTO employees (id, name, age) VALUES (1, 'John', 30);
INSERT INTO employees (id, name, age) VALUES (2, 'Alice', 25);
```
Differences:
- Hive allows direct INSERT INTO with values, while MySQL explicitly lists column names in the insert statement (though this is optional in MySQL if the columns match).

5. Querying Data

Hive:
```
SELECT * FROM employees;
```
MySQL:
```
SELECT * FROM employees;
```
Querying data using SELECT is identical in both systems.

6. Modifying Data

Hive: Hive doesn’t support traditional UPDATE or DELETE commands directly, as it is optimized for batch processing and is more suited for append operations. However, it does support INSERT and INSERT OVERWRITE operations.

Example of replacing data:
```
INSERT OVERWRITE TABLE employees SELECT * FROM employees WHERE age > 30;
```
MySQL:
```
UPDATE employees SET age = 31 WHERE id = 1;
DELETE FROM employees WHERE id = 2;
```
Differences:
- Hive does not allow direct UPDATE or DELETE; instead, it uses INSERT OVERWRITE to modify data in batch operations.

7. Dropping Tables

Hive:
```
DROP TABLE IF EXISTS employees;
```
MySQL:
```
DROP TABLE IF EXISTS employees;
```
The syntax for dropping tables is the same in both systems.

8. Query Performance

Hive:
- Hive is designed to run on large datasets using the Hadoop Distributed File System (HDFS), so it focuses more on batch processing rather than real-time queries. Query performance in Hive may be slower than MySQL because it’s optimized for scale, not for low-latency transaction processing.
MySQL:
- MySQL is an RDBMS, designed to handle transactional workloads with low-latency queries. It’s better suited for OLTP (Online Transaction Processing) rather than OLAP (Online Analytical Processing) workloads.

9. Indexing

Hive:
- Hive doesn’t support traditional indexing as MySQL does. However, you can create partitioned or bucketed tables in Hive to improve query performance for certain types of data.
MySQL:
- MySQL supports indexes (e.g., PRIMARY KEY, UNIQUE, INDEX) to speed up query performance on large datasets.

10. Joins

Hive:

SELECT a.id, a.name, b.age
FROM employees a
JOIN employee_details b ON a.id = b.id;

MySQL:

SELECT a.id, a.name, b.age
FROM employees a
JOIN employee_details b ON a.id = b.id;

The syntax for JOIN is the same in both systems.

Summary of Key Differences:

Data Types: Hive uses types like STRING, TEXT, BOOLEAN, etc., while MySQL uses types like VARCHAR, CHAR, TEXT, etc.
Data Modification: Hive does not support UPDATE or DELETE in the traditional way, and is generally used for batch processing.
Performance: Hive is designed for querying large-scale datasets in Hadoop, so queries tend to be slower than MySQL.
Indexing: Hive does not natively support indexing but can use partitioning and bucketing for performance optimization. MySQL supports indexing for faster queries.
ACID Properties: MySQL supports full ACID compliance for transactional systems, whereas Hive is not transactional by default (but can support limited ACID features starting from version 0.14 with certain configurations).

In conclusion, while Hive and MySQL share SQL-like syntax, they are designed for very different use cases, and not all commands work the same way in both systems.

Visualize the Data (Optional)

Export the query results to a CSV file for analysis in visualization tools:

hive -e "SELECT * FROM police_data;" > police_analysis_results.csv

You can use tools like Tableau, Excel, or Python (Matplotlib, Pandas) for data visualization.

Step 8: Configure Environment Variables (Optional)

If you need to customize configurations, you can specify parameters in the hadoop.env file or as environmental variables for services (e.g., namenode, datanode, etc.). For example:

CORE_CONF_fs_defaultFS=hdfs://namenode:8020

This will be transformed into the following in the core-site.xml file:

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://namenode:8020</value>
</property>

Conclusion

You now have a fully functional Hadoop, Spark, and Hive cluster running in Docker. This environment is great for experimenting with big data processing and analytics in a lightweight, containerized setup.

I hope you have fun with this Hadoop-Spark-Hive cluster.

Name		Name	Last commit message	Last commit date
Latest commit History 184 Commits
base		base
code		code
conf		conf
data		data
datanode		datanode
historyserver		historyserver
master		master
namenode		namenode
nginx		nginx
nodemanager		nodemanager
resourcemanager		resourcemanager
submit		submit
template		template
worker		worker
.gitignore		.gitignore
Complete_Guide_to_Install_Ubuntu_and_JAVA_and_then_Configure_Hadoop,_MySQL,_HIVE,_Sqoop,_Flume,_Spark_on_a_Docker_Container.md		Complete_Guide_to_Install_Ubuntu_and_JAVA_and_then_Configure_Hadoop,_MySQL,_HIVE,_Sqoop,_Flume,_Spark_on_a_Docker_Container.md
HiveInstallation.md		HiveInstallation.md
Makefile		Makefile
Multi-Node Cluster on Ubuntu 24.04 (VMware).md		Multi-Node Cluster on Ubuntu 24.04 (VMware).md
README.md		README.md
docker-compose.yml		docker-compose.yml
ecom.md		ecom.md
entrypoint.sh		entrypoint.sh
flume.md		flume.md
hadoop-basic-commands.md		hadoop-basic-commands.md
hadoop-hive.env		hadoop-hive.env
hadoop.env		hadoop.env
hadoop_installation_VMware Workstation.md		hadoop_installation_VMware Workstation.md
police.csv		police.csv
spark_in_action.MD		spark_in_action.MD
sqoop.md		sqoop.md
startup.sh		startup.sh
students.csv		students.csv
wordcount.md		wordcount.md
yarn.md		yarn.md

lovnishverma/bigdataecosystem

Folders and files

Latest commit

History

Repository files navigation

Docker Multi-Container Environment with Hadoop, Spark, and Hive

Prerequisites

Step 1: Clone the Repository

Step 2: Start the Cluster

Step 3: Verify Running Containers

Step 4: Stop and Remove Containers

Step 5: Access the NameNode container

** -it refers to (interactive terminal)**

Running Hadoop Code

** MAPREDUCE WordCount program**

Step 1: Copy the code folder into the container

Step 2: Locate the data.txt file

Step 3: Create directories in the Hadoop file system

Step 4: Upload the data.txt file

Step 5: Navigate to the directory containing the wordCount.jar file

Step 6: Execute the WordCount program

Step 7: Display the output

Summary

Certainly! Here’s the explanation of your MapReduce process using the input example DOG CAT RAT, CAR CAR RAT, and DOG CAR CAT.

🐾 Input Data

💡 What is MapReduce?

🔄 How MapReduce Works in Your Example

1. Map Phase 🗺️

2. Shuffle and Sort Phase 🔄

3. Reduce Phase ➕

Final Output 📋

🗂️ HDFS Commands You Used

🛠️ How You Utilized MapReduce

🧩 Visualization of the Entire Process

Input (HDFS file):

Map Phase Output 🗺️:

Shuffle & Sort 🔄:

Reduce Phase Output ➕:

🔑 Key Takeaways

. Stopping the Containers

4. Permissions Issue with Copying Files

5. Additional Debugging Tips

6. Final Output File Path

Example Additions:

Step 5: Set Up HDFS

Upload Files to HDFS

Start Spark Shell

Access the Spark Master UI

Working with Apache Spark

1. Introduction to Apache Spark

2. Introduction to DataFrames

3. Introduction to Scala for Apache Spark

4. Spark SQL

5. Hands-On: Spark SQL

Objective:

Step 1: Create DataFrames

Step 3: Perform Transformations Using Spark SQL

Step 4: Save Transformed Data

Step 5: Load Data from HDFS

Step 6: Scala WordCount using Apache Spark

Docker Command to Copy File

6. Key Takeaways

Step 7: Set Up Hive

Start Hive Server

Connect to Hive Server

Create Database and Table in Hive

Query the Data in Hive

📂 Part 2: Creating a Simple Hive Project

🎯 Objective

💾 Step 1: Create a Database

📋 Step 2: Create a Table

📥 Step 3: Insert Data into the Table

🔍 Step 4: Query the Table

🌟 Tips & Knowledge

🎉 You're Ready!

🐝 Apache Hive Basic Commands

1. Database Commands

2. Table Commands

3. Data Manipulation Commands

4. Querying Commands

-it refers to (interactive terminal)

MAPREDUCE WordCount program

Step 1: Copy the `code` folder into the container

Step 2: Locate the `data.txt` file

Step 4: Upload the `data.txt` file

Step 5: Navigate to the directory containing the `wordCount.jar` file

Certainly! Here’s the explanation of your MapReduce process using the input example `DOG CAT RAT`, `CAR CAR RAT`, and `DOG CAR CAT`.

Packages