Skip to content

NdukaClara/complete_pyspark_reference_repo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 

Repository files navigation

My PySpark Reference Repo

This repository is a complete reference for everyday PySpark functions and workflows, built and tested on Databricks. It contains code snippets and use cases you’ll often encounter when working with PySpark in real-world data engineering scenarios.

πŸš€ What You’ll Find Inside

  • Data Reading: Load data from multiple formats (CSV, JSON, etc.)
  • Schema Definitions: Explicit schemas for better control
  • Transformations including Joins, aggregations, window functions, user-defined functions (UDFs)
  • Data Writing: Save data back in multiple formats
  • Spark SQL: Querying data with SQL syntax
  • Practical Use Cases: Examples of transformations and everyday PySpark operations

βš™οΈ Requirements

This repo is designed to run on Databricks Notebooks, but you can also run it locally with PySpark.

If running locally, install PySpark with:

pip install pyspark

▢️ How to Use

Option 1: On Databricks

  1. Clone or import this repo into your Databricks workspace.
  2. Attach your notebook to a running cluster.
  3. Run the cells to explore PySpark functions.

Option 2: Locally

  1. Clone the repo:
    git clone https://github.com/NdukaClara/my_pyspark_reference_repo.git
    ```
  2. Run scripts in Jupyter Notebook or your IDE (with PySpark installed).

πŸ“‚ Repo Structure

my_pyspark_reference_repo/
│── notebooks/           # Databricks notebooks with examples
│── scripts/             # Python scripts for each concept
│── data/                # Sample datasets (if included)
│── README.md            # Project documentation

πŸ’‘ Why This Repo?

Instead of digging through documentation every time, this repo serves as a one-stop reference for PySpark on Databricks, perfect for learners and practitioners who want quick, practical examples.

About

A complete reference for everyday PySpark functions and workflows, built and tested on Databricks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors