This Docker setup provides a complete, production-like environment for learning Apache Iceberg with:
- MinIO: A Local S3-compatible object storage for table data
- Polaris: Apache Iceberg REST Catalog
- Jupyter Notebook: Interactive Python notebook with PySpark and Iceberg support
- Trino: Distributed SQL query engine
You should have found this repositories along with the course videos here (TODO LINK), please check them out if you haven't.
All versions are centrally managed in the .env file:
Current pinned versions:
- Iceberg: 1.10.0 (released September 5, 2025)
- Spark: 4.0.1 with Scala 2.13 (September 2, 2025)
- Polaris: latest
- Trino: 465
To update versions, simply edit the .env file and rebuild:
docker compose up -d --build- Docker Desktop installed and running
- At least 8GB of RAM allocated to Docker
- At least 10GB of free disk space
-
Start all services:
docker compose up -d
-
Wait for services to be ready (approximately 1-2 minutes):
docker compose logs -f
Press
Ctrl+Cto stop following logs once services are running. -
Access the services:
- Jupyter Notebook: http://localhost:8888 (no password) - Start here!
- MinIO Console: http://localhost:9001 (admin/password) - View your data
- Trino UI: http://localhost:8080 (username: admin, no password)
- Polaris API: http://localhost:8181
-
Open the demo notebook:
- Direct link: http://localhost:8888/lab/tree/work/E1.1%20-%20OpenLakehouse.ipynb
- Run through the cells to see Iceberg with Polaris and MinIO
MinIO provides S3-compatible object storage for Iceberg table data.
Configuration:
- API Port: 9000
- Console Port: 9001
- Username: admin
- Password: password
- Bucket: warehouse
- Data directory:
./data/minio
Access the Console:
- URL: http://localhost:9001
- Login with admin/password
- Browse the
warehousebucket to see your Iceberg table files
The Polaris catalog provides a REST API for managing Iceberg table metadata. It's configured with in-memory persistence for Catalog entries and MinIO for table metadata.
Configuration:
- Port: 8181
- Data directory:
./data/polaris - Storage: MinIO S3 (s3://warehouse/)
- OAuth2 credentials automatically generated on first start
Initialization: The Polaris catalog is automatically initialized with:
- S3 storage configuration pointing to MinIO
- OAuth2 credentials (root:s3cr3t defined in .env)
The polaris-setup service runs bootstrap-catalog.sh on startup to configure the catalog.
Trino is configured with an Iceberg connector that connects to the Polaris catalog.
Configuration:
- Port: 8080
- Catalog:
iceberg(connected to Polaris) - Data directory:
./data/trino - Config files:
./trino/config/
Connect to Trino CLI:
docker exec -it trino trinoExample Trino queries:
-- Show catalogs
SHOW CATALOGS;
-- Create a namespace
CREATE SCHEMA iceberg.demo;
-- Show schemas
SHOW SCHEMAS IN iceberg;
-- Create a table
CREATE TABLE iceberg.demo.test (
id BIGINT,
name VARCHAR
) WITH (format = 'PARQUET');
-- Insert data
INSERT INTO iceberg.demo.test VALUES (1, 'Alice'), (2, 'Bob');
-- Query data
SELECT * FROM iceberg.demo.test;The Jupyter environment comes pre-configured with:
- PySpark with Iceberg support
- PyIceberg library
- Trino Python client
- Pandas, Matplotlib, Seaborn
Access:
- URL: http://localhost:8888
- Notebooks directory:
./notebooks - Data directory:
./data/jupyter
Catalog Configuration:
- The demo notebook uses the Polaris REST catalog with MinIO S3 storage
- Metadata managed by Polaris (centralized, REST API)
- Table data stored in MinIO (s3://warehouse/)
- This is a production-like pattern - same architecture as using Polaris with real S3/Azure/GCS
- OAuth2 authentication configured automatically
Sample notebook: E1.1 - OpenLakehouse.ipynb is provided with examples of:
- Creating Iceberg tables
- Querying data
- ACID transactions
- Time travel
- Schema evolution
- Partitioning
All data is stored locally in the ./data directory:
./data/minio: MinIO object storage (Iceberg table data)./data/polaris: Polaris catalog metadata./data/trino: Trino working data./data/jupyter: Jupyter user data
This ensures that your data persists even when containers are stopped.
Start all services:
docker compose up -dStop all services:
docker compose downView logs:
# All services
docker compose logs -f
# Specific service
docker compose logs -f jupyter
docker compose logs -f trino
docker compose logs -f polarisRestart a service:
docker compose restart jupyterRebuild and restart (after config changes):
docker compose up -d --buildCheck if ports are already in use:
# macOS/Linux
lsof -i :8888 # Jupyter
lsof -i :8080 # Trino
lsof -i :9000 # MinIO API
lsof -i :9001 # MinIO Console
lsof -i :8181 # Polaris# Check if containers are running
docker compose ps
# Check specific service logs
docker compose logs jupyter- Make sure all services are fully started (check logs)
- Services may take 1-2 minutes to initialize
- Verify network connectivity:
docker network ls
If you see this error when initializing a Spark Session in a notebook, the Spark Connect server may have failed to start. Check the Docker container logs (docker logs jupyter-spark) for details. Common causes include insufficient Docker RAM or port conflicts. You can also try restarting the container (docker compose restart jupyter).
If you are on a corporate network with a proxy or firewall that intercepts HTTPS traffic, Spark may fail to download its dependency JARs at startup. You will typically see SSL certificate errors in docker logs jupyter-spark.
Fix: Pre-download the JARs on your host machine using the provided script:
./manual-download-dependencies.sh --insecureThe --insecure flag tells curl to skip SSL certificate verification. The JARs are saved to the ./jars/ directory and automatically mounted into the container on the next startup. Spark will use these local JARs instead of downloading from Maven Central.
After downloading, restart the services:
docker compose down
docker compose up -dYou should see Using pre-downloaded JARs from /opt/spark-jars in the Jupyter container logs.
If you encounter permission errors with data directories:
chmod -R 755 data/To start fresh:
docker compose down -v
rm -rf data/*
docker compose up -d --build┌───────────────────────────┐ ┌───────────────────────────┐
│ Spark Connect + Jupyter│ │ Trino (port 8080) │
│ │ │ │
│ - Connect server (:15002) │ │ - SQL Query Engine │
│ - Jupyter Lab (:8888) │ │ │
│ │ │ │
└─────────────┬─────────────┘ └─────────────┬─────────────┘
│ │
│\ /│
│ \ / │
│ \ ┌─────────────────┐ / │
│ \ │ Polaris │ / │
│ \ │ REST Catalog │ / │
│ └──────▶│ (port 8181) │◀──────┘ │
│ └────────┬────────┘ │
│ │ │
└──────────────────────┼──────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ MinIO (S3-compatible) │
│ (ports 9000/9001) │
│ - Bucket: warehouse │
│ - Parquet data files │
│ - Metadata files │
└───────────────────────────────────────┘
All table data stored in: MinIO s3://warehouse/
All metadata managed by: Polaris REST catalog
A single Spark Connect server runs in the Jupyter container; notebooks connect as thin clients
An automated test suite is available to validate all notebooks. See TESTING.md for instructions.
pip install -r requirements-test.txt
pytest test_notebooks.py -v- Apache Iceberg Documentation
- Polaris Catalog Documentation
- Trino Iceberg Connector
- Spark Iceberg Integration
- All services are configured to communicate via Docker's internal network (
iceberg-net) - The Jupyter notebook is configured without password for ease of use (not recommended for production)
- Polaris uses a default root password
admin123(change for production use) - Polaris is also set up with "in-memory" persistence, in production this should be a permanent store
If you encounter issues:
- Check the troubleshooting section above
- Review service logs:
docker compose logs [service-name] - Ensure your Docker has sufficient resources allocated
- Try rebuilding:
docker compose up -d --build
Copyright (c) 2026 Snowflake Inc. All rights reserved.
Licensed under the Apache 2.0 license.