This projects aims to make it easy to get started with Presto. It is based on Docker and Docker compose. Currently, the following features are supported:
- Dedicated Presto scheduler node and variable number of worker nodes
- Function Namespace Manager (for creating functions)
- Hive connector, Hive Metastore, and pseudo-replicated HDFS (i.e., without replication) with variable number of data nodes
- Reading from S3 without addtitional configuration (if running in EC2 and with a properly configured instance profile)
The following should be enough to bring up all required services:
docker-compose upTo change the number of Presto worker nodes or HDFS data nodes, use the --scale flag of docker-compose:
docker-compose up --scale datanode=3 --scale presto-worker=3Above command uses a pre-built docker image. If you want the image to be build locally, do the following instead:
docker-compose --file docker-compose-local.yml upIf you are behind a corporate firewall, you will have to configure Maven (which is used to build part of Presto) as follows before running above command:
export MAVEN_OPTS="-Dhttp.proxyHost=your.proxy.com -Dhttp.proxyPort=3128 -Dhttps.proxyHost=your.proxy.com -Dhttps.proxyPort=3128"The data/ folder is mounted into the HDFS namenode container, from where you can upload it using the HDFS client in that container (docker-presto_presto_1 may have a different name on your machine; run docker ps to find out):
docker exec -it docker-presto_namenode_1 hadoop fs -mkdir /dataset
docker exec -it docker-presto_namenode_1 hadoop fs -put /data/file.parquet /dataset/
docker exec -it docker-presto_namenode_1 hadoop fs -ls /datasetYou can use the Presto CLI included in the Docker containers of this project (adapt container name if necessary):
docker exec -it docker-presto_presto_1 presto-cli --catalog hive --schema defaultAlternatively, you can download the Presto CLI, rename it, make it executable, and run the following:
./presto-cli --server localhost:8080 --catalog hive --schema defaultSuppose you have the following file test.json:
{"s": "hello world", "i": 42}Upload it to /test/test.csv on HDFS as described above. Then run the following in the Presto CLI:
CREATE TABLE test (s VARCHAR, i INTEGER) WITH (EXTERNAL_LOCATION = 'hdfs://namenode/test/', FORMAT = 'JSON');For external tables from S3, spin up this service in an EC2 instance, set up an instance profile for that instance, and use the s3a:// protocol instead of hdfs://.
In case you need to make manual changes or want to inspect the MySQL databases, you can connect to it like this:
docker exec -it docker-presto_mysql_1 mysql -ppassword