Skip to content

cordon-thiago/airflow-pyspark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Airflow PySpark

This project contains the following containers:

Setup

Clone project

git clone https://github.com/cordon-thiago/airflow-pyspark

Download Images

sudo docker pull postgres:9.6
sudo docker pull jupyter/pyspark-notebook:latest

Build airflow Docker

Inside the <'project folder'>/docker/docker-airflow

sudo docker build --rm -t docker-airflow-pyspark .

Start containers

Navigate to the <'project folder'>/docker and:

sudo docker-compose up

If you want to run in background:

sudo docker-compose up -d

Check if you can access

Airflow: http://localhost:8080

PostgreSql - Database Test:

  • Server: localhost:5432
  • Database: test
  • User: test
  • Password: postgres

Postgres - Database airflow:

  • Server: localhost:5432
  • Database: airflow
  • User: airflow
  • Password: airflow

Jupyter Notebook: http://127.0.0.1:8888

  • For Jupyter notebook, you must copy the URL with the token generated when the container is started and paste in your browser. The URL with the token can be taken from container logs using:

    docker logs -f docker_jupyter-spark_1
    

How to run a DAG test

  1. Access airflow web UI http://localhost:8080 and go to Connections

  2. Edit the spark_default connection inserting localhost in Host field

  3. Run the spark-test DAG

Adding Airflow Extra packages

Rebuild Dockerfile:

sudo docker build --rm --build-arg AIRFLOW_DEPS="gcp" -t docker-airflow-pyspark .

After successfully built, run docker-compose to start container:

sudo docker-compose up

More info at: https://github.com/puckel/docker-airflow#build

How to run the spark app using spark-submit

Inside the project directory (airflow-pyspark):

docker exec -it docker_jupyter-spark_1 spark-submit --master local /home/jovyan/work/spark-scripts/hello-world.py

Useful docker commands

List Images:
sudo docker images <repository_name>

List Containers:
sudo docker container ls

Check container logs:
sudo docker logs -f <container_name>

To build a Dockerfile after changing sth (run inside directoty containing Dockerfile):
sudo docker build --rm -t <tag_name> .

Access container bash:
sudo docker exec -i -t <container_name> /bin/bash

Useful docker-compose commands

Start Containers:
sudo docker-compose -f <compose-file.yml> up -d

Stop Containers:
sudo docker-compose -f <compose-file.yml> down --remove-orphans

About

Docker with airflow, pyspark and jupyter notebook.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published