Name	Name	Last commit message	Last commit date
parent directory ..
hooks	hooks
{{ cookiecutter.repo_name }}	{{ cookiecutter.repo_name }}
README.md	README.md
cookiecutter.json	cookiecutter.json
prompts.yml	prompts.yml

Name

Last commit message

Last commit date

The `spaceflights-pyspark` Kedro starter

Overview

This is a variation of the spaceflights tutorial project described in the online Kedro documentation with PySpark setup.

The code in this repository demonstrates best practice when working with Kedro and PySpark. It contains a Kedro starter template with some initial configuration and two example pipelines, and originates from the Kedro documentation about how to work with PySpark.

To create a project based on this starter, ensure you have installed Kedro into a virtual environment. Then use the following command:

pip install kedro
kedro new --starter=spaceflights-pyspark

After the project is created, navigate to the newly created project directory:

cd <my-project-name>  # change directory

Install the required dependencies:

pip install -r requirements.txt

Now you can run the project:

kedro run

To visualise the default pipeline, run:

kedro viz run

This will open the default browser and display the following pipeline visualisation:

Features

Single configuration in `/conf/base/spark.yml`

While Spark allows you to specify many different configuration options, this starter uses /conf/base/spark.yml as a single configuration location.

`SparkSession` initialisation with `SparkHooks`

This Kedro starter contains the initialisation code for SparkSession in hooks.py and takes its configuration from /conf/base/spark.yml. Modify the SparkHooks code if you want to further customise your SparkSession, e.g. to use YARN.

Uses transcoding to handle the same data in different formats

In some cases it can be desirable to handle one dataset in different ways, for example to load a parquet file into your pipeline using pandas and to save it using spark. In this starter, one of the input datasets shuttles, is an excel file. It's not possible to load an excel file directly into Spark, so we use transcoding to save the file as a pandas.CSVDataset first which then allows us to load it as a spark.SparkDataset further on in the pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

The `spaceflights-pyspark` Kedro starter

Overview

Features

Single configuration in `/conf/base/spark.yml`

`SparkSession` initialisation with `SparkHooks`

Uses transcoding to handle the same data in different formats

FilesExpand file tree

spaceflights-pyspark

Directory actions

More options

Directory actions

More options

Latest commit

History

spaceflights-pyspark

Folders and files

parent directory

README.md

The spaceflights-pyspark Kedro starter

Overview

Features

Single configuration in /conf/base/spark.yml

SparkSession initialisation with SparkHooks

Uses transcoding to handle the same data in different formats

The `spaceflights-pyspark` Kedro starter

Single configuration in `/conf/base/spark.yml`

`SparkSession` initialisation with `SparkHooks`