Apache Iceberg Course - Docker Setup

This Docker setup provides a complete, production-like environment for learning Apache Iceberg with:

MinIO: A Local S3-compatible object storage for table data
Polaris: Apache Iceberg REST Catalog
Jupyter Notebook: Interactive Python notebook with PySpark and Iceberg support
Trino: Distributed SQL query engine

You should have found this repositories along with the course videos here (TODO LINK), please check them out if you haven't.

Version Configuration

All versions are centrally managed in the .env file:

Current pinned versions:

Iceberg: 1.10.0 (released September 5, 2025)
Spark: 4.0.1 with Scala 2.13 (September 2, 2025)
Polaris: latest
Trino: 465

To update versions, simply edit the .env file and rebuild:

docker compose up -d --build

Prerequisites

Docker Desktop installed and running
At least 8GB of RAM allocated to Docker
At least 10GB of free disk space

Quick Start

Start all services:
```
docker compose up -d
```
Wait for services to be ready (approximately 1-2 minutes):
```
docker compose logs -f
```
Press Ctrl+C to stop following logs once services are running.
Access the services:
- Jupyter Notebook: http://localhost:8888 (no password) - Start here!
- MinIO Console: http://localhost:9001 (admin/password) - View your data
- Trino UI: http://localhost:8080 (username: admin, no password)
- Polaris API: http://localhost:8181
Open the demo notebook:

Direct link: http://localhost:8888/lab/tree/work/E1.1%20-%20OpenLakehouse.ipynb
- Run through the cells to see Iceberg with Polaris and MinIO

Service Details

MinIO (S3-Compatible Storage)

MinIO provides S3-compatible object storage for Iceberg table data.

Configuration:

API Port: 9000
Console Port: 9001
Username: admin
Password: password
Bucket: warehouse
Data directory: ./data/minio

Access the Console:

URL: http://localhost:9001
Login with admin/password
Browse the warehouse bucket to see your Iceberg table files

Polaris Iceberg REST Catalog

The Polaris catalog provides a REST API for managing Iceberg table metadata. It's configured with in-memory persistence for Catalog entries and MinIO for table metadata.

Configuration:

Port: 8181
Data directory: ./data/polaris
Storage: MinIO S3 (s3://warehouse/)
OAuth2 credentials automatically generated on first start

Initialization: The Polaris catalog is automatically initialized with:

S3 storage configuration pointing to MinIO
OAuth2 credentials (root:s3cr3t defined in .env)

The polaris-setup service runs bootstrap-catalog.sh on startup to configure the catalog.

Trino

Trino is configured with an Iceberg connector that connects to the Polaris catalog.

Configuration:

Port: 8080
Catalog: iceberg (connected to Polaris)
Data directory: ./data/trino
Config files: ./trino/config/

Connect to Trino CLI:

docker exec -it trino trino

Example Trino queries:

-- Show catalogs
SHOW CATALOGS;

-- Create a namespace
CREATE SCHEMA iceberg.demo;

-- Show schemas
SHOW SCHEMAS IN iceberg;

-- Create a table
CREATE TABLE iceberg.demo.test (
    id BIGINT,
    name VARCHAR
) WITH (format = 'PARQUET');

-- Insert data
INSERT INTO iceberg.demo.test VALUES (1, 'Alice'), (2, 'Bob');

-- Query data
SELECT * FROM iceberg.demo.test;

Jupyter Notebook with PySpark

The Jupyter environment comes pre-configured with:

PySpark with Iceberg support
PyIceberg library
Trino Python client
Pandas, Matplotlib, Seaborn

Access:

URL: http://localhost:8888
Notebooks directory: ./notebooks
Data directory: ./data/jupyter

Catalog Configuration:

The demo notebook uses the Polaris REST catalog with MinIO S3 storage
Metadata managed by Polaris (centralized, REST API)
Table data stored in MinIO (s3://warehouse/)
This is a production-like pattern - same architecture as using Polaris with real S3/Azure/GCS
OAuth2 authentication configured automatically

Sample notebook: E1.1 - OpenLakehouse.ipynb is provided with examples of:

Creating Iceberg tables
Querying data
ACID transactions
Time travel
Schema evolution
Partitioning

Data Persistence

All data is stored locally in the ./data directory:

./data/minio: MinIO object storage (Iceberg table data)
./data/polaris: Polaris catalog metadata
./data/trino: Trino working data
./data/jupyter: Jupyter user data

This ensures that your data persists even when containers are stopped.

Common Commands

Start all services:

docker compose up -d

Stop all services:

docker compose down

View logs:

# All services
docker compose logs -f

# Specific service
docker compose logs -f jupyter
docker compose logs -f trino
docker compose logs -f polaris

Restart a service:

docker compose restart jupyter

Rebuild and restart (after config changes):

docker compose up -d --build

Troubleshooting

Services not starting

Check if ports are already in use:

# macOS/Linux
lsof -i :8888  # Jupyter
lsof -i :8080  # Trino
lsof -i :9000  # MinIO API
lsof -i :9001  # MinIO Console
lsof -i :8181  # Polaris

Check service health

# Check if containers are running
docker compose ps

# Check specific service logs
docker compose logs jupyter

Connection refused errors

Make sure all services are fully started (check logs)
Services may take 1-2 minutes to initialize
Verify network connectivity: docker network ls

Spark Session: `ConnectionRefusedError: [Errno 111] Connection refused`

If you see this error when initializing a Spark Session in a notebook, the Spark Connect server may have failed to start. Check the Docker container logs (docker logs jupyter-spark) for details. Common causes include insufficient Docker RAM or port conflicts. You can also try restarting the container (docker compose restart jupyter).

SSL / Corporate Proxy Errors Downloading JARs

If you are on a corporate network with a proxy or firewall that intercepts HTTPS traffic, Spark may fail to download its dependency JARs at startup. You will typically see SSL certificate errors in docker logs jupyter-spark.

Fix: Pre-download the JARs on your host machine using the provided script:

./manual-download-dependencies.sh --insecure

The --insecure flag tells curl to skip SSL certificate verification. The JARs are saved to the ./jars/ directory and automatically mounted into the container on the next startup. Spark will use these local JARs instead of downloading from Maven Central.

After downloading, restart the services:

docker compose down
docker compose up -d

You should see Using pre-downloaded JARs from /opt/spark-jars in the Jupyter container logs.

Permission errors

If you encounter permission errors with data directories:

chmod -R 755 data/

Reset everything

To start fresh:

docker compose down -v
rm -rf data/*
docker compose up -d --build

Architecture

┌───────────────────────────┐                  ┌───────────────────────────┐
│    Spark Connect + Jupyter│                  │    Trino (port 8080)      │
│                           │                  │                           │
│ - Connect server (:15002) │                  │ - SQL Query Engine        │
│ - Jupyter Lab (:8888)     │                  │                           │
│                           │                  │                           │
└─────────────┬─────────────┘                  └─────────────┬─────────────┘
              │                                              │
              │\                                            /│
              │ \                                          / │
              │  \          ┌─────────────────┐          /  │
              │   \         │     Polaris     │         /   │
              │    \        │   REST Catalog  │        /    │
              │     └──────▶│   (port 8181)   │◀──────┘     │
              │             └────────┬────────┘             │
              │                      │                      │
              └──────────────────────┼──────────────────────┘
                                     │
                                     ▼
              ┌───────────────────────────────────────┐
              │         MinIO (S3-compatible)         │
              │         (ports 9000/9001)             │
              │      - Bucket: warehouse              │
              │      - Parquet data files             │
              │      - Metadata files                 │
              └───────────────────────────────────────┘

All table data stored in: MinIO s3://warehouse/
All metadata managed by: Polaris REST catalog
A single Spark Connect server runs in the Jupyter container; notebooks connect as thin clients

Testing Notebooks

An automated test suite is available to validate all notebooks. See TESTING.md for instructions.

pip install -r requirements-test.txt
pytest test_notebooks.py -v

Additional Resources

Notes

All services are configured to communicate via Docker's internal network (iceberg-net)
The Jupyter notebook is configured without password for ease of use (not recommended for production)
Polaris uses a default root password admin123 (change for production use)
Polaris is also set up with "in-memory" persistence, in production this should be a permanent store

Support

If you encounter issues:

Check the troubleshooting section above
Review service logs: docker compose logs [service-name]
Ensure your Docker has sufficient resources allocated
Try rebuilding: docker compose up -d --build

License

Licensed under the Apache 2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
notebooks		notebooks
polaris		polaris
trino		trino
.env		.env
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
TESTING.md		TESTING.md
dependencies.tsv		dependencies.tsv
docker-compose.yml		docker-compose.yml
generate-spark-config.py		generate-spark-config.py
manual-download-dependencies.sh		manual-download-dependencies.sh
pytest.ini		pytest.ini
requirements-test.txt		requirements-test.txt
spark-defaults.conf.template		spark-defaults.conf.template
test_notebooks.py		test_notebooks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache Iceberg Course - Docker Setup

Version Configuration

Prerequisites

Quick Start

Service Details

MinIO (S3-Compatible Storage)

Polaris Iceberg REST Catalog

Trino

Jupyter Notebook with PySpark

Data Persistence

Common Commands

Troubleshooting

Services not starting

Check service health

Connection refused errors

Spark Session: `ConnectionRefusedError: [Errno 111] Connection refused`

SSL / Corporate Proxy Errors Downloading JARs

Permission errors

Reset everything

Architecture

Testing Notebooks

Additional Resources

Notes

Support

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Apache Iceberg Course - Docker Setup

Version Configuration

Prerequisites

Quick Start

Service Details

MinIO (S3-Compatible Storage)

Polaris Iceberg REST Catalog

Trino

Jupyter Notebook with PySpark

Data Persistence

Common Commands

Troubleshooting

Services not starting

Check service health

Connection refused errors

Spark Session: ConnectionRefusedError: [Errno 111] Connection refused

SSL / Corporate Proxy Errors Downloading JARs

Permission errors

Reset everything

Architecture

Testing Notebooks

Additional Resources

Notes

Support

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Spark Session: `ConnectionRefusedError: [Errno 111] Connection refused`

Packages