Databricks Spark workflow for getting data from GitHub REST API and loading into Databricks lakehouse tables.
Get source data from the following endpoints:
- Organizations
- Repositories
- Contributors
- Pull Requests
API response data is conditionally loaded to the github_api database in Databricks warehouse using upsert logic. Basically, if any new data is encountered it's inserted into the tables and if any modified data is encountered the affected rows are overwritten.
- config
- etl: set etl job config values (org names, repo names, etc) here
- etl
- github
- api_request: general framework for making api requests with
requestslibrary - contributors: api response data from
stats/contributorsendpoint - org: api response data from
orgsendpoint - pulls: api response data from
pullsendpoint - repo: api response data from
reposendpoint
- api_request: general framework for making api requests with
- spark
- date_utils: general datetime helpers
- etl_config: class for creating dynamic config
- spark_table: SparkTable class for working with source tables in our metastore
- spark_utills: general databricks spark helpers
- sql_table: SqlTable subclass of SparkTable for tables created from a sql query of source tables
- tables
- users: SqlTable for users dimension table
- github
- src
- dag_notebook: databricks notebook with workflow code to be run with orhcestrator
- source_tables: etl logic for source tables (warehouse tables ending in
_source_)
