Apache Airflow & Spark

This repository is intended for orchestrating Spark applications using the Apache Airflow tool

1. Technologies

A list of technologies used within the project:

Python: Version 3.12
Pyspark: Version 3.5.3
Airflow: Version 2.10.2

2. Install and Run

# Clone this repo
$ git clone https://github.com/marianelaruiz/lakehouse-airflow
# Enter the project folder
$ cd lakehouse-airflow

Windows

# Create a virtual environment
$ python -m venv airflow_venv 

# Activate your virtual environment
$ airflow_venv\Scripts\activate

MacOS & Linux

# Create a virtual environment
python3 -m venv airflow_venv # or virtualenv venv

# Activate your virtual environment
source airflow_venv/bin/activate

Initial Setup and Initialize Airflow

bash setup_airflow.sh

Config and Run Apache Airflow

Ensure that Apache Airflow is installed on your machine. Yo have two options:

Create an Airflow User (optional): If you prefer to create a custom user:

airflow users create \
    --username admin \
    --firstname Admin \
    --lastname User \
    --role Admin \
    --email [email protected]

cd airflow
export AIRFLOW_HOME=$(pwd)

In a terminal:

airflow webserver --port 8080

In another terminal:

airflow scheduler

Start Airflow : For a quicker setup without creating a user, you can use the standalone mode:
```
cd airflow
export AIRFLOW_HOME=$(pwd)
airflow standalone
```
This command starts both the Airflow webserver and scheduler, and automatically creates an admin user. The username and password will be displayed in the terminal.

Access the Airflow UI: Open the browser and navigate to http://localhost:8080 to access the Airflow UI.

3. Apache Airflow Webserver

Webserver will start at: http://127.0.0.1:8080

From here, you can access the address above in your browser and log in.

The first configuration of the web server should be to change the host in Airflow. To do this, go to Admin > Connections > Search for "spark_default" > Change the "Host" field from "yarn" to "local" and save.

Then, you can go to the "Search Dags" field and search for "dag_lakehouse". Click on the search result, and you will have access to the interface related to the created Airflow instance. You can execute it by clicking the "Trigger DAG" button in the upper right corner of the screen, where you can observe the execution order and whether the Spark applications were successful or not. You can verify this by looking in the repository for each of the bronze, silver, and gold folders, where a subfolder called parquet will be created.

4. Databricks Lakehouse Architecture

The Medallion Architecture is a data design pattern used to logically organize data in a lakehouse, aimed at progressively enhancing the structure and quality of data as it flows through each layer of the architecture. The layers include Bronze (raw data), Silver (cleaned and transformed data), and Gold (data ready for analysis).

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
airflow		airflow
lakehouse/landing		lakehouse/landing
pyspark		pyspark
.gitignore		.gitignore
README.md		README.md
ariflow_readme.md		ariflow_readme.md
requirements.txt		requirements.txt
setup_airflow.sh		setup_airflow.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Apache Airflow & Spark

Table of Contents

1. Technologies

2. Install and Run

Windows

MacOS & Linux

Initial Setup and Initialize Airflow

Config and Run Apache Airflow

3. Apache Airflow Webserver

4. Databricks Lakehouse Architecture

About

Uh oh!

Releases

Packages

Uh oh!

Languages

marianelaruiz/lakehouse-airflow

Folders and files

Latest commit

History

Repository files navigation

Apache Airflow & Spark

Table of Contents

1. Technologies

2. Install and Run

Windows

MacOS & Linux

Initial Setup and Initialize Airflow

Config and Run Apache Airflow

3. Apache Airflow Webserver

4. Databricks Lakehouse Architecture

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages