Skip to content

KyNguyen-FinX/local

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Local Spark Development Environment

A complete Docker-based Apache Spark 3.5.6 setup for local development with Iceberg, S3, and AWS integration.

πŸš€ Quick Start

Prerequisites

  • Docker and Docker Compose installed
  • At least 8GB RAM available for Docker
  • AWS credentials configured (optional, for S3/Glue integration)

1. Start Spark Cluster

cd spark
docker-compose up -d

2. Verify Installation

Open these URLs in your browser:

3. Run Your First Spark Job

# Interactive Spark Shell
docker-compose exec spark-master spark-shell --master spark://spark-master:7077

# Or PySpark
docker-compose exec spark-master pyspark --master spark://spark-master:7077

# Or submit a Python script
docker-compose exec spark-master spark-submit \
    --master spark://spark-master:7077 \
    /opt/spark/scripts/test-iceberg.py

πŸ“ Project Structure

.
β”œβ”€β”€ README.md                 # This file
β”œβ”€β”€ .gitignore               # Git ignore rules
└── spark/                   # Spark setup directory
    β”œβ”€β”€ README.md            # Detailed Spark documentation
    β”œβ”€β”€ Dockerfile           # Spark image with all dependencies
    β”œβ”€β”€ docker-compose.yml   # Multi-service setup
    β”œβ”€β”€ entrypoint.sh        # Service startup script
    β”œβ”€β”€ conf/                # Spark configuration files
    β”‚   β”œβ”€β”€ spark-defaults.conf
    β”‚   β”œβ”€β”€ thrift-server.conf
    β”‚   └── log4j2.properties
    β”œβ”€β”€ scripts/             # Sample scripts and tests
    β”‚   β”œβ”€β”€ test-iceberg.py
    β”‚   β”œβ”€β”€ test-thrift-server.py
    β”‚   └── JDBCExample.java
    β”œβ”€β”€ data/                # Local data storage (ignored in git)
    β”œβ”€β”€ logs/                # Spark logs (ignored in git)
    β”œβ”€β”€ notebooks/           # Jupyter notebooks
    └── work/                # Spark work directory (ignored in git)

πŸ”§ Common Commands

Start/Stop Services

# Start all services
cd spark && docker-compose up -d

# Start minimal cluster (without Thrift Server)
cd spark && docker-compose up -d spark-master spark-worker-1 spark-worker-2 spark-history-server

# Stop all services
cd spark && docker-compose down

# View logs
cd spark && docker-compose logs -f

Running Spark Applications

Submit Python Script

cd spark
docker-compose exec spark-master spark-submit \
    --master spark://spark-master:7077 \
    --deploy-mode client \
    /opt/spark/scripts/your-script.py

Interactive Development

cd spark
# Scala Shell
docker-compose exec spark-master spark-shell --master spark://spark-master:7077

# Python Shell  
docker-compose exec spark-master pyspark --master spark://spark-master:7077

# Access container for debugging
docker-compose exec spark-master bash

JDBC/ODBC Connections (Thrift Server)

The Thrift Server allows connecting BI tools and databases:

  • JDBC URL: jdbc:hive2://localhost:10001/default
  • Host: localhost
  • Port: 10001
  • Username: spark
  • Password: (empty)

Test connection:

cd spark
python3 scripts/test-thrift-server.py

🎯 Development Workflow

1. Add Your Scripts

Place your Spark applications in spark/scripts/:

# Copy your Python/Scala files
cp your-spark-app.py spark/scripts/

2. Test Locally

cd spark
docker-compose exec spark-master spark-submit \
    --master spark://spark-master:7077 \
    /opt/spark/scripts/your-spark-app.py

3. Monitor Execution

πŸ—ƒοΈ Data Management

Local Development Data

  • Place test data files in spark/data/
  • Access from Spark: /opt/spark/data/your-file.csv
  • Data persists between container restarts

Iceberg Tables

The setup includes Apache Iceberg for modern table format:

# Example: Create Iceberg table
df.write.format("iceberg") \
    .mode("overwrite") \
    .saveAsTable("local.db.my_table")

S3 Integration

Configure AWS credentials for S3 access:

# Using AWS CLI (recommended)
aws configure

# Or set environment variables
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret
export AWS_DEFAULT_REGION=ap-southeast-1

πŸ” Monitoring & Debugging

Web UIs

Service URL Description
Spark Master http://localhost:8080 Cluster overview, workers, applications
Worker 1 http://localhost:8081 Worker status and executors
Worker 2 http://localhost:8082 Worker status and executors
History Server http://localhost:18080 Completed applications
Thrift Server http://localhost:4040 JDBC/SQL interface

Logs and Troubleshooting

cd spark

# View all logs
docker-compose logs

# View specific service logs
docker-compose logs spark-master
docker-compose logs spark-worker-1

# Follow logs in real-time
docker-compose logs -f

# Check container status
docker-compose ps

βš™οΈ Configuration

Performance Tuning

Edit spark/conf/spark-defaults.conf:

# Increase memory allocation
spark.driver.memory              4g
spark.executor.memory            4g
spark.executor.cores             2

# Enable adaptive query execution
spark.sql.adaptive.enabled       true
spark.sql.adaptive.coalescePartitions.enabled  true

AWS Configuration

For S3 and Glue integration, see the detailed configuration in spark/README.md.

πŸ†˜ Troubleshooting

Common Issues

  1. Port conflicts: Change ports in docker-compose.yml
  2. Memory issues: Increase Docker memory limits
  3. Permission errors: Check file permissions in mounted volumes
  4. AWS access: Verify credentials and IAM permissions

Getting Help

  1. Check the detailed documentation: spark/README.md
  2. View container logs: docker-compose logs [service-name]
  3. Access container shell: docker-compose exec spark-master bash
  4. Check Spark UI for application details

🧹 Cleanup

# Stop services
cd spark && docker-compose down

# Remove all data (careful!)
cd spark && docker-compose down -v

# Clean Docker images
docker system prune -a

πŸ“– For detailed configuration and advanced usage, see spark/README.md

πŸ”— Useful Links:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors