Skip to content

Create dbt package generator using GenAI and PyAirbyte #6

Open
@aaronsteers

Description

@aaronsteers

Summary

The goal with this application is to take a specific raw data schema for a source being run with PyAirbyte and to auto-generate a simple dbt project for that data.

This could be the foundation of a new type of integration opportunity for Airbyte users.

Definition of Done

These are not specifically related to GenAI, but are the foundation of the code-gen:

  • Solution should be written in Python.
  • Solution can be written as a new feature in PyAirbyte or as a standalone project. (Author's preference, but probably easier/faster as a new project that just calls PyAirbyte.)
  • Solution should be able to generate basic dbt project scaffold, including a basic "profiles" yaml and "dbt_project.yml". (Okay if these are hard-coded or hand-written as generic boilerplate.)
  • Solution should be able to generate a "sources" yaml file for one or more sources that are being extracted using PyAirbyte. This should describe the tables being used.
  • Solution should be able to be executed with dbt run, as proof of the working solution

The GenAI "code gen" application portion of this project is:

  • Solution should be able to generate a dbt model (a .sql file) performing some basic transforms on top of the source table(s) defined in the "sources" yaml.
    • For instance, if the raw data is a 'sales' table, the LLM may create an aggregate table.
    • The LLM may also create 'stage' tables that take the raw schema and map the raw schema to new column names with conformed naming convention and/or conformed data types.
  • Solution should use an LLM to generate the SQL.
  • Instructions to the LLM can be hard-coded to one particular source, but the LLM should be doing the work of generating the SQL.

In terms of documentation:

  • A README.md will be required for this project.
  • A walkthrough tutorial explaining usage is also required. The walkthrough can exist within the README.md or can be provided in any other format, such as blog.
  • A demo video walkthrough is optional, but not required.

Suggestions (Per Author's Discretion)

These are some suggestions - but are not required:

  • We suggest using a simple source like source-faker, source-coin-api, or similar.
  • We suggest using DuckDB as a backend - since it is easy to replicate results locally, doesn't required a paid account, and has good SQL support.

Resources to Assist

  • PyAirbyte can be used to gather json schema for each stream.
  • (@aaronsteers will add more resources and info here shortly.)

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

  • Status

    Declined

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions