This repository is a complete reference for everyday PySpark functions and workflows, built and tested on Databricks. It contains code snippets and use cases youβll often encounter when working with PySpark in real-world data engineering scenarios.
- Data Reading: Load data from multiple formats (CSV, JSON, etc.)
- Schema Definitions: Explicit schemas for better control
- Transformations including Joins, aggregations, window functions, user-defined functions (UDFs)
- Data Writing: Save data back in multiple formats
- Spark SQL: Querying data with SQL syntax
- Practical Use Cases: Examples of transformations and everyday PySpark operations
This repo is designed to run on Databricks Notebooks, but you can also run it locally with PySpark.
If running locally, install PySpark with:
pip install pyspark- Clone or import this repo into your Databricks workspace.
- Attach your notebook to a running cluster.
- Run the cells to explore PySpark functions.
- Clone the repo:
git clone https://github.com/NdukaClara/my_pyspark_reference_repo.git ``` - Run scripts in Jupyter Notebook or your IDE (with PySpark installed).
my_pyspark_reference_repo/
βββ notebooks/ # Databricks notebooks with examples
βββ scripts/ # Python scripts for each concept
βββ data/ # Sample datasets (if included)
βββ README.md # Project documentation
Instead of digging through documentation every time, this repo serves as a one-stop reference for PySpark on Databricks, perfect for learners and practitioners who want quick, practical examples.