Skip to content

0.1.2 warehousing using dbt, duckdb and iceberg #55

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 37 commits into
base: v0.1.1
Choose a base branch
from

Conversation

redpheonixx
Copy link
Collaborator

@redpheonixx redpheonixx commented Oct 25, 2024


Title: Boiler Plate Code for DBT Transformations

Assignees: Amit Singh Labels: Transformation layer

Description:

This pull request introduces a boilerplate code for DBT transformations using the medallion architecture. The transformations are structured across three layers:

Bronze Layer: Initial raw data ingestion and storage.

Silver Layer: Intermediate transformations to refine and standardize data.

Gold Layer: Final transformations for analytical purposes, providing clean and consumable data.

The implementation utilizes the Iceberg REST catalog to manage the datasets and transformations efficiently. This setup aims to streamline the development and maintenance of DBT models, ensuring a robust and scalable data pipelin


#45

wrote dbt transformation layer as per medallion architecture

dbt's low learning curve and SQL-based workflow empower data teams without advanced programming knowledge to build and maintain transformations. This makes it a better fit for companies looking for agility and simplicity in data pipelines.

@tusharchou tusharchou added documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed question Further information is requested good first review Good for newcomers dependencies Pull requests that update a dependency file labels Oct 28, 2024
@tusharchou tusharchou added this to the 0.1.2 Warehousing: DuckDB milestone Oct 28, 2024
@tusharchou tusharchou linked an issue Oct 28, 2024 that may be closed by this pull request
@redpheonixx redpheonixx changed the title boiler plate code for dbt transformation boiler plate code for warehouse 0.1.2 Oct 28, 2024
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@redpheonixx what issue is this associated with?

Copy link
Collaborator Author

@redpheonixx redpheonixx Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@redpheonixx redpheonixx changed the title boiler plate code for warehouse 0.1.2 0.1.2 warehousing using dbt, duckdb and iceberg Oct 28, 2024
@tusharchou tusharchou marked this pull request as draft October 29, 2024 12:19
brmhastra and others added 17 commits October 30, 2024 21:35
* Update publish.yml

* Update publish.yml

* Update publish.yml

* Update publish.yml

updated yaml file to copy distribution o/p from build to root directory

* Update publish.yml

added detailed copy from /ldf/dist to gihub/wo../dist

* Update publish.yml

added a code for creating dist directory

---------

Co-authored-by: Tushar Choudhary <[email protected]>
* Update publish.yml

* Update publish.yml

* Update publish.yml

* Update publish.yml

updated yaml file to copy distribution o/p from build to root directory

* Update publish.yml

added detailed copy from /ldf/dist to gihub/wo../dist

* Update publish.yml

added a code for creating dist directory

* Update publish.yml

relocating dots from ./github to /.github

---------

Co-authored-by: Tushar Choudhary <[email protected]>
* Update publish.yml

* Update pyproject.toml
* Update publish.yml

* Update pyproject.toml

* Update publish.yml
* Release v1.1 dist changes

* Release v1.1 publish.yml changes

* Release v1.1 publish.yml changes
* Release v1.1 dist changes

* Release v1.1 publish.yml changes

* Release v1.1 publish.yml changes

* Release v1.1
@redpheonixx redpheonixx changed the base branch from main to v0.1.1 November 8, 2024 07:45
@tusharchou
Copy link
Owner

Raising a pull request (PR) from a fork's main branch to a repository's main branch is generally discouraged for several reasons:

  1. Complexity and Confusion: Using the main branch of your fork for PRs can create confusion if you make other changes or updates to your fork's main branch. It becomes harder to track what specific changes the PR is addressing.

  2. Continuous Integration Issues: Many projects have automated checks and continuous integration (CI) workflows that run on branches. If you use the main branch for PRs, these workflows might get triggered unnecessarily, potentially causing conflicts or errors.

  3. Maintenance Difficulty: It's easier to maintain a clean and organized history if you create a separate feature branch for each set of changes. This approach avoids cluttering the main branch and makes it simpler to manage and merge changes.

  4. Conflict Management: Keeping changes in isolated branches helps in managing conflicts better. If your main branch is used for a PR and it diverges significantly from the upstream main branch, resolving conflicts can become more complicated.

  5. Best Practices: Following best practices, like creating feature branches for specific changes or issues, improves collaboration and code quality. It also aligns with how most open-source projects and development teams operate.

Best Practice Workflow:

  1. Create a Feature Branch:

    git checkout -b feature/your-feature-name
  2. Make Your Changes and Commit:

    git commit -m "Describe your changes"
  3. Push Your Feature Branch to Your Fork:

    git push origin feature/your-feature-name
  4. Open a PR from Your Feature Branch to the Upstream Main Branch:

    • Go to the repository on GitHub.
    • Open a PR from your-fork:feature/your-feature-name to upstream:main.

This approach keeps your development process clean and organized. It ensures that the main branches remain stable and only contain code that is ready for production.

Feel free to ask if you need more details or help with Git workflows! 🚀✨

redpheonixx and others added 5 commits November 8, 2024 13:41
* fixed bug warehouse uri

Thu Oct 24 9:25 PM IST

* added logging to CSV.get()

* supported big query ts format

* refactored parameter to config from catalog to reduce confusion

* Fixed bug of logger in GCP

* replaced local path with dynamic path

* replaced local path with dynamic path

* replaced local path with dynamic path

* demo

* demo

* Release v1.1 dist changes

* Release v1.1 publish.yml changes

* Release v1.1 publish.yml changes

* added a class implementation for github issue hoping it is useful for guiding users to resolution ETAs eventually

* added a exception for PlanNotFound to ask users to raise issues on the repository for resolution

* Updated overview and milestones. Added directory structure under technical specifications.

* Updated components in technical speciifcations

* added testing for BigQueryToCSV.extract()

* added testing for BigQueryToCSV.extract()

* added testing for Iceberg.get()

* Release v1.1

* Release v1.1 bug fix

* Release v1.1 bug fix

* Release v1.1 bug fix

* Pytest Added for BigQuery Source

---------

Co-authored-by: Tushar Choudhary <[email protected]>
Copy link
Collaborator

@rakhioza07 rakhioza07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix review comments and move into draft PR until changes are made.

Great work on exception handling and clean coding efforts!

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.x"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discuss and settle on a version of python 3.6.5 or higher

Check for library dependencies

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't add tar.gz, whl or heavy source files as a part of code commits

Only consider code, sample data and config file, scripts for PRs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving all SQL files to a central place

Consider keeping code, data, queries, configs in seperate directory structures

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script needs to be OOP

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this file need to be a part of the commit? If no, please mention in gitignore


p=Path('C:/Users//singsina//Desktop//local-data-platform//local-data-platform//tmp//warehouse//')
print(p)
p_path="file:///"+str(p)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use f-strings for string manipulation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation enhancement New feature or request good first review Good for newcomers help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0.1.2 Warehousing: add duckdb, dbt and iceberg packages
5 participants