Local Spark Development Environment

A complete Docker-based Apache Spark 3.5.6 setup for local development with Iceberg, S3, and AWS integration.

🚀 Quick Start

Prerequisites

Docker and Docker Compose installed
At least 8GB RAM available for Docker
AWS credentials configured (optional, for S3/Glue integration)

1. Start Spark Cluster

cd spark
docker-compose up -d

2. Verify Installation

Open these URLs in your browser:

Spark Master UI: http://localhost:8080
Spark Worker 1: http://localhost:8081
Spark Worker 2: http://localhost:8082
History Server: http://localhost:18080
Thrift Server: http://localhost:4040

3. Run Your First Spark Job

# Interactive Spark Shell
docker-compose exec spark-master spark-shell --master spark://spark-master:7077

# Or PySpark
docker-compose exec spark-master pyspark --master spark://spark-master:7077

# Or submit a Python script
docker-compose exec spark-master spark-submit \
    --master spark://spark-master:7077 \
    /opt/spark/scripts/test-iceberg.py

📁 Project Structure

.
├── README.md                 # This file
├── .gitignore               # Git ignore rules
└── spark/                   # Spark setup directory
    ├── README.md            # Detailed Spark documentation
    ├── Dockerfile           # Spark image with all dependencies
    ├── docker-compose.yml   # Multi-service setup
    ├── entrypoint.sh        # Service startup script
    ├── conf/                # Spark configuration files
    │   ├── spark-defaults.conf
    │   ├── thrift-server.conf
    │   └── log4j2.properties
    ├── scripts/             # Sample scripts and tests
    │   ├── test-iceberg.py
    │   ├── test-thrift-server.py
    │   └── JDBCExample.java
    ├── data/                # Local data storage (ignored in git)
    ├── logs/                # Spark logs (ignored in git)
    ├── notebooks/           # Jupyter notebooks
    └── work/                # Spark work directory (ignored in git)

🔧 Common Commands

Start/Stop Services

# Start all services
cd spark && docker-compose up -d

# Start minimal cluster (without Thrift Server)
cd spark && docker-compose up -d spark-master spark-worker-1 spark-worker-2 spark-history-server

# Stop all services
cd spark && docker-compose down

# View logs
cd spark && docker-compose logs -f

Running Spark Applications

Submit Python Script

cd spark
docker-compose exec spark-master spark-submit \
    --master spark://spark-master:7077 \
    --deploy-mode client \
    /opt/spark/scripts/your-script.py

Interactive Development

cd spark
# Scala Shell
docker-compose exec spark-master spark-shell --master spark://spark-master:7077

# Python Shell  
docker-compose exec spark-master pyspark --master spark://spark-master:7077

# Access container for debugging
docker-compose exec spark-master bash

JDBC/ODBC Connections (Thrift Server)

The Thrift Server allows connecting BI tools and databases:

JDBC URL: jdbc:hive2://localhost:10001/default
Host: localhost
Port: 10001
Username: spark
Password: (empty)

Test connection:

cd spark
python3 scripts/test-thrift-server.py

🎯 Development Workflow

1. Add Your Scripts

Place your Spark applications in spark/scripts/:

# Copy your Python/Scala files
cp your-spark-app.py spark/scripts/

2. Test Locally

cd spark
docker-compose exec spark-master spark-submit \
    --master spark://spark-master:7077 \
    /opt/spark/scripts/your-spark-app.py

3. Monitor Execution

Check Spark UI: http://localhost:8080
View application details: http://localhost:4040
Check logs: docker-compose logs spark-master

🗃️ Data Management

Local Development Data

Place test data files in spark/data/
Access from Spark: /opt/spark/data/your-file.csv
Data persists between container restarts

Iceberg Tables

The setup includes Apache Iceberg for modern table format:

# Example: Create Iceberg table
df.write.format("iceberg") \
    .mode("overwrite") \
    .saveAsTable("local.db.my_table")

S3 Integration

Configure AWS credentials for S3 access:

# Using AWS CLI (recommended)
aws configure

# Or set environment variables
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret
export AWS_DEFAULT_REGION=ap-southeast-1

🔍 Monitoring & Debugging

Web UIs

Service	URL	Description
Spark Master	http://localhost:8080	Cluster overview, workers, applications
Worker 1	http://localhost:8081	Worker status and executors
Worker 2	http://localhost:8082	Worker status and executors
History Server	http://localhost:18080	Completed applications
Thrift Server	http://localhost:4040	JDBC/SQL interface

Logs and Troubleshooting

cd spark

# View all logs
docker-compose logs

# View specific service logs
docker-compose logs spark-master
docker-compose logs spark-worker-1

# Follow logs in real-time
docker-compose logs -f

# Check container status
docker-compose ps

⚙️ Configuration

Performance Tuning

Edit spark/conf/spark-defaults.conf:

# Increase memory allocation
spark.driver.memory              4g
spark.executor.memory            4g
spark.executor.cores             2

# Enable adaptive query execution
spark.sql.adaptive.enabled       true
spark.sql.adaptive.coalescePartitions.enabled  true

AWS Configuration

For S3 and Glue integration, see the detailed configuration in spark/README.md.

🆘 Troubleshooting

Common Issues

Port conflicts: Change ports in docker-compose.yml
Memory issues: Increase Docker memory limits
Permission errors: Check file permissions in mounted volumes
AWS access: Verify credentials and IAM permissions

Getting Help

Check the detailed documentation: spark/README.md
View container logs: docker-compose logs [service-name]
Access container shell: docker-compose exec spark-master bash
Check Spark UI for application details

🧹 Cleanup

# Stop services
cd spark && docker-compose down

# Remove all data (careful!)
cd spark && docker-compose down -v

# Clean Docker images
docker system prune -a

📖 For detailed configuration and advanced usage, see spark/README.md

🔗 Useful Links:

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
spark		spark
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Local Spark Development Environment

🚀 Quick Start

Prerequisites

1. Start Spark Cluster

2. Verify Installation

3. Run Your First Spark Job

📁 Project Structure

🔧 Common Commands

Start/Stop Services

Running Spark Applications

Submit Python Script

Interactive Development

JDBC/ODBC Connections (Thrift Server)

🎯 Development Workflow

1. Add Your Scripts

2. Test Locally

3. Monitor Execution

🗃️ Data Management

Local Development Data

Iceberg Tables

S3 Integration

🔍 Monitoring & Debugging

Web UIs

Logs and Troubleshooting

⚙️ Configuration

Performance Tuning

AWS Configuration

🆘 Troubleshooting

Common Issues

Getting Help

🧹 Cleanup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages