A modernized Apache Spark application that performs word counting on text files, featuring Game of Thrones character data from the HBO series.
- Modern Spark 4.0.1 with Scala 2.13 support
- Java 17 compatibility
- Two implementations: Classic RDD API and modern DataFrame/SparkSQL API
- Command-line arguments for flexible input/output paths
- Automated build and execution via shell scripts
- Word processing with case normalization and punctuation removal
- Frequency-sorted results (most common words first)
- Apache Spark 4.0.1 (installed via Homebrew)
- OpenJDK 17 (installed via Homebrew)
- Maven 3.x (installed via Homebrew)
Verify your installations:
spark-submit --version
java --version
mvn --version
.
βββ README.md
βββ pom.xml # Maven configuration
βββ src/main/scala/com/morillo/spark/
β βββ WordCount.scala # RDD API implementation
β βββ WordCountDataFrame.scala # DataFrame/SparkSQL API implementation
βββ run-wordcount.sh # RDD API execution script
βββ run-wordcount-dataframe.sh # DataFrame API execution script
βββ westeros.txt # Sample Game of Thrones data
βββ target/ # Build outputs (generated)
You must build the project before running the execution scripts:
# Set Java 17 for Maven (if using Java 24 by default)
export JAVA_HOME=$(/usr/libexec/java_home -v 17)
# Build the project
mvn clean package
This creates target/spark-wordcount-1.0.0.jar
ready for Spark submission.
Note: The execution scripts (run-wordcount.sh
and run-wordcount-dataframe.sh
) do NOT build the project automatically. You must run the Maven build command first.
# Build the project (required before running scripts)
mvn clean package
# Make script executable (first time only)
chmod +x run-wordcount.sh
# Run with Game of Thrones data
./run-wordcount.sh westeros.txt sevenkingdoms
# Run with custom input/output
./run-wordcount.sh /path/to/input.txt /path/to/output_directory
# Make script executable (first time only)
chmod +x run-wordcount-dataframe.sh
# Run with Game of Thrones data
./run-wordcount-dataframe.sh westeros.txt sevenkingdoms-dataframe
# Run with custom input/output
./run-wordcount-dataframe.sh /path/to/input.txt /path/to/output_directory
# Build first
mvn clean package
# Submit RDD API version to Spark
spark-submit \
--class com.morillo.spark.WordCount \
--master local[*] \
--deploy-mode client \
--driver-memory 2g \
--executor-memory 1g \
target/spark-wordcount-1.0.0.jar \
westeros.txt \
sevenkingdoms
# Submit DataFrame API version to Spark
spark-submit \
--class com.morillo.spark.WordCountDataFrame \
--master local[*] \
--deploy-mode client \
--driver-memory 2g \
--executor-memory 1g \
target/spark-wordcount-1.0.0.jar \
westeros.txt \
sevenkingdoms-dataframe
Running on the included westeros.txt
produces frequency-sorted results:
(stark,6)
(baratheon,5)
(lannister,4)
(martell,4)
(tyrell,3)
(arryn,3)
(targaryen,3)
(robert,2)
(jon,2)
...
# View all output files
cat sevenkingdoms/part-*
# Or view individual partitions
ls sevenkingdoms/
cat sevenkingdoms/part-00000
- Spark Version: 4.0.1
- Scala Version: 2.13.12
- Java Target: 17
- Dependencies: spark-core, spark-sql (scope: provided)
The application uses these default Spark settings:
- Master:
local[*]
(all available cores) - Driver Memory: 2GB
- Executor Memory: 1GB
- Deploy Mode: client
- Input Validation: Ensures exactly 2 arguments (input path, output path)
- Spark Session: Creates session with descriptive app name
- Text Processing:
- Reads input file(s) using
sc.textFile()
- Splits lines on whitespace using
flatMap
- Filters empty strings
- Normalizes to lowercase and removes punctuation using
map
- Filters empty results after cleaning
- Reads input file(s) using
- Word Counting: Uses
reduceByKey()
for distributed counting - Sorting: Results sorted by frequency using
sortBy()
- Output: Saves to specified directory path using
saveAsTextFile()
- Input Validation: Ensures exactly 2 arguments (input path, output path)
- Spark Session: Creates session with descriptive app name
- Text Processing:
- Reads input file(s) using
spark.read.text()
- Splits lines using
explode(split())
functions - Filters empty strings using DataFrame operations
- Normalizes using
lower()
andregexp_replace()
functions - Filters empty results after cleaning
- Reads input file(s) using
- Word Counting: Uses
groupBy().agg(count())
for aggregation - Sorting: Results sorted using
orderBy(desc(), asc())
- Output: Saves using DataFrame
write.text()
operations - Bonus: Includes SQL query alternative (commented) and console output with statistics
- Package:
com.morillo.spark
- RDD API Class:
WordCount
- DataFrame API Class:
WordCountDataFrame
- Entry Points:
main(args: Array[String])
in both classes
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.13</artifactId>
<version>4.0.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.13</artifactId>
<version>4.0.1</version>
<scope>provided</scope>
</dependency>
# Build first (always required)
mvn clean package
# Basic word count (RDD API)
./run-wordcount.sh westeros.txt got_results
# Basic word count (DataFrame API)
./run-wordcount-dataframe.sh westeros.txt got_results_df
# Process large text file (RDD API)
./run-wordcount.sh /data/books/complete_series.txt /results/word_analysis
# Process large text file (DataFrame API)
./run-wordcount-dataframe.sh /data/books/complete_series.txt /results/word_analysis_df
# Count words in log files (DataFrame API with better performance for complex queries)
./run-wordcount-dataframe.sh /var/log/application.log /analytics/log_words
Feature | RDD API | DataFrame API |
---|---|---|
Performance | Good for simple operations | Better for complex operations with Catalyst optimizer |
Code Style | Functional programming style | SQL-like operations |
Type Safety | Compile-time type safety | Runtime schema validation |
Optimization | Manual optimization needed | Automatic query optimization |
Learning Curve | Steeper for beginners | Easier for SQL users |
Best For | Complex transformations, legacy code | Analytics, SQL users, performance-critical apps |
# Ensure Java 17 is being used
export JAVA_HOME=$(/usr/libexec/java_home -v 17)
./run-wordcount.sh input.txt output
The script automatically removes existing output directories since Spark requires non-existent output paths.
# Clean and rebuild
mvn clean
mvn compile
mvn package
This project demonstrates Apache Spark capabilities using Game of Thrones character data for educational purposes.
Feel free to submit issues and pull requests to improve this Spark application example.