Open
Description
Summary
The goal with this application is to take a specific raw data schema for a source being run with PyAirbyte and to auto-generate a simple dbt project for that data.
This could be the foundation of a new type of integration opportunity for Airbyte users.
Definition of Done
These are not specifically related to GenAI, but are the foundation of the code-gen:
- Solution should be written in Python.
- Solution can be written as a new feature in PyAirbyte or as a standalone project. (Author's preference, but probably easier/faster as a new project that just calls PyAirbyte.)
- Solution should be able to generate basic dbt project scaffold, including a basic "profiles" yaml and "dbt_project.yml". (Okay if these are hard-coded or hand-written as generic boilerplate.)
- Solution should be able to generate a "sources" yaml file for one or more sources that are being extracted using PyAirbyte. This should describe the tables being used.
- Solution should be able to be executed with
dbt run
, as proof of the working solution
The GenAI "code gen" application portion of this project is:
- Solution should be able to generate a dbt model (a .sql file) performing some basic transforms on top of the source table(s) defined in the "sources" yaml.
- For instance, if the raw data is a 'sales' table, the LLM may create an aggregate table.
- The LLM may also create 'stage' tables that take the raw schema and map the raw schema to new column names with conformed naming convention and/or conformed data types.
- Solution should use an LLM to generate the SQL.
- Instructions to the LLM can be hard-coded to one particular source, but the LLM should be doing the work of generating the SQL.
In terms of documentation:
- A README.md will be required for this project.
- A walkthrough tutorial explaining usage is also required. The walkthrough can exist within the README.md or can be provided in any other format, such as blog.
- A demo video walkthrough is optional, but not required.
Suggestions (Per Author's Discretion)
These are some suggestions - but are not required:
- We suggest using a simple source like
source-faker
,source-coin-api
, or similar. - We suggest using DuckDB as a backend - since it is easy to replicate results locally, doesn't required a paid account, and has good SQL support.
Resources to Assist
- PyAirbyte can be used to gather json schema for each stream.
- (@aaronsteers will add more resources and info here shortly.)
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Declined