A complete Docker-based Apache Spark 3.5.6 setup for local development with Iceberg, S3, and AWS integration.
- Docker and Docker Compose installed
- At least 8GB RAM available for Docker
- AWS credentials configured (optional, for S3/Glue integration)
cd spark
docker-compose up -dOpen these URLs in your browser:
- Spark Master UI: http://localhost:8080
- Spark Worker 1: http://localhost:8081
- Spark Worker 2: http://localhost:8082
- History Server: http://localhost:18080
- Thrift Server: http://localhost:4040
# Interactive Spark Shell
docker-compose exec spark-master spark-shell --master spark://spark-master:7077
# Or PySpark
docker-compose exec spark-master pyspark --master spark://spark-master:7077
# Or submit a Python script
docker-compose exec spark-master spark-submit \
--master spark://spark-master:7077 \
/opt/spark/scripts/test-iceberg.py.
βββ README.md # This file
βββ .gitignore # Git ignore rules
βββ spark/ # Spark setup directory
βββ README.md # Detailed Spark documentation
βββ Dockerfile # Spark image with all dependencies
βββ docker-compose.yml # Multi-service setup
βββ entrypoint.sh # Service startup script
βββ conf/ # Spark configuration files
β βββ spark-defaults.conf
β βββ thrift-server.conf
β βββ log4j2.properties
βββ scripts/ # Sample scripts and tests
β βββ test-iceberg.py
β βββ test-thrift-server.py
β βββ JDBCExample.java
βββ data/ # Local data storage (ignored in git)
βββ logs/ # Spark logs (ignored in git)
βββ notebooks/ # Jupyter notebooks
βββ work/ # Spark work directory (ignored in git)
# Start all services
cd spark && docker-compose up -d
# Start minimal cluster (without Thrift Server)
cd spark && docker-compose up -d spark-master spark-worker-1 spark-worker-2 spark-history-server
# Stop all services
cd spark && docker-compose down
# View logs
cd spark && docker-compose logs -fcd spark
docker-compose exec spark-master spark-submit \
--master spark://spark-master:7077 \
--deploy-mode client \
/opt/spark/scripts/your-script.pycd spark
# Scala Shell
docker-compose exec spark-master spark-shell --master spark://spark-master:7077
# Python Shell
docker-compose exec spark-master pyspark --master spark://spark-master:7077
# Access container for debugging
docker-compose exec spark-master bashThe Thrift Server allows connecting BI tools and databases:
- JDBC URL:
jdbc:hive2://localhost:10001/default - Host: localhost
- Port: 10001
- Username: spark
- Password: (empty)
Test connection:
cd spark
python3 scripts/test-thrift-server.pyPlace your Spark applications in spark/scripts/:
# Copy your Python/Scala files
cp your-spark-app.py spark/scripts/cd spark
docker-compose exec spark-master spark-submit \
--master spark://spark-master:7077 \
/opt/spark/scripts/your-spark-app.py- Check Spark UI: http://localhost:8080
- View application details: http://localhost:4040
- Check logs:
docker-compose logs spark-master
- Place test data files in
spark/data/ - Access from Spark:
/opt/spark/data/your-file.csv - Data persists between container restarts
The setup includes Apache Iceberg for modern table format:
# Example: Create Iceberg table
df.write.format("iceberg") \
.mode("overwrite") \
.saveAsTable("local.db.my_table")Configure AWS credentials for S3 access:
# Using AWS CLI (recommended)
aws configure
# Or set environment variables
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret
export AWS_DEFAULT_REGION=ap-southeast-1| Service | URL | Description |
|---|---|---|
| Spark Master | http://localhost:8080 | Cluster overview, workers, applications |
| Worker 1 | http://localhost:8081 | Worker status and executors |
| Worker 2 | http://localhost:8082 | Worker status and executors |
| History Server | http://localhost:18080 | Completed applications |
| Thrift Server | http://localhost:4040 | JDBC/SQL interface |
cd spark
# View all logs
docker-compose logs
# View specific service logs
docker-compose logs spark-master
docker-compose logs spark-worker-1
# Follow logs in real-time
docker-compose logs -f
# Check container status
docker-compose psEdit spark/conf/spark-defaults.conf:
# Increase memory allocation
spark.driver.memory 4g
spark.executor.memory 4g
spark.executor.cores 2
# Enable adaptive query execution
spark.sql.adaptive.enabled true
spark.sql.adaptive.coalescePartitions.enabled trueFor S3 and Glue integration, see the detailed configuration in spark/README.md.
- Port conflicts: Change ports in
docker-compose.yml - Memory issues: Increase Docker memory limits
- Permission errors: Check file permissions in mounted volumes
- AWS access: Verify credentials and IAM permissions
- Check the detailed documentation:
spark/README.md - View container logs:
docker-compose logs [service-name] - Access container shell:
docker-compose exec spark-master bash - Check Spark UI for application details
# Stop services
cd spark && docker-compose down
# Remove all data (careful!)
cd spark && docker-compose down -v
# Clean Docker images
docker system prune -aπ For detailed configuration and advanced usage, see spark/README.md
π Useful Links: