Skip to content

Database Awareness - Chat #1679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from
Open

Database Awareness - Chat #1679

wants to merge 5 commits into from

Conversation

ngafar
Copy link
Collaborator

@ngafar ngafar commented Apr 25, 2025

Description

The Mito AI is now aware of database connections, and can write SQL queries.

For now this is limited to chat, and the connections must be hardcoded.

Testing

  1. In the .mito/db dir add the new connections.json and schemas.json files.
  2. Then start a Jupyter sever, and ask a question that required a db connection.

Be sure to ask about data that is relevant to a table, but does not exist.

Documentation

N/A - We should add documentation, but after we add the taskpane to add new db connections.

Copy link

vercel bot commented Apr 25, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
monorepo ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 25, 2025 6:49pm

@ngafar ngafar requested a review from aarondr77 April 25, 2025 15:40
@ngafar ngafar changed the title [WIP] Database Awareness - Chat Database Awareness - Chat Apr 25, 2025
@aarondr77
Copy link
Member

The schema has column headers in uppercase, so if I sk it to write code like this: How does the location of a customer effect their likelihood to churn?

It ends up referencing the columns in pandas with all uppercase, even though the resulting dataframe has lowercase. If snowflake sql-alchemy always ends up with lowercase dataframe headers, (idk if this is the case), then we should update the schema probably.

import pandas as pd
from sqlalchemy import create_engine

# Database connection setup
user = "XXX"
password = "XXX"
account = "XXX"
warehouse = "COMPUTE_WH"
database = "TELCO_CHRUN"
schema = "PUBLIC"

conn_str = (
    f"snowflake://{user}:{password}@{account}/"
    f"{database}/{schema}?warehouse={warehouse}"
)
engine = create_engine(conn_str)

# Query for churn status and location data
query = """
SELECT 
    l.COUNTRY,
    l.STATE,
    l.CITY,
    s.CHURN_LABEL
FROM LOCATION_DATA l
JOIN STATUS_ANALYSIS s
ON l.CUSTOMER_ID = s.CUSTOMER_ID
"""

df = pd.read_sql(query, engine)

# Analyze churn rate by state (you can modify to COUNTRY or CITY)
churn_by_state = (
    df.groupby('STATE')['CHURN_LABEL']
    .mean()
    .sort_values(ascending=False)
    .reset_index()
    .rename(columns={'CHURN_LABEL': 'churn_rate'})
)
churn_by_state


APP_DIR_PATH: Final[str] = os.path.join(MITO_FOLDER)

with open(os.path.join(APP_DIR_PATH, 'db', 'connections.json'), 'r') as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PoC I think this okay, but instead of hardcoding the username and password in the notebook, we should be importing from the config file. Hardcoding credentials in a notebook is obviously bad practice ...

- Do not use a with statement when creating the SQLAlchemy engine. Instead, initialize it once so it can be reused for multiple queries.
- Always return the results of the query in a pandas DataFrame, unless instructed otherwise.
- Column names in query results may be returned in lowercase. Always refer to columns using their lowercase names in the resulting DataFrame (e.g., df['date'] instead of df['DATE']).
- If you think the requested data is stored in the database, but you are unsure, then ask the user for clarification.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I've been having it respond with a question pretty often

@@ -37,6 +44,9 @@ def create_chat_system_message_prompt() -> str:

Notice in the example above that the citation uses line number 2 because citation line numbers are 0-indexed.

===
{get_database_rules()}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting this in the system prompt so we don't send it over and over again makes sense.

Eventually, it might even make sense to move this to a tool use that the agent and chat are allowed to use called check_snowflake_schema or something like that so it can decide when to use it...

@aarondr77
Copy link
Member

I made a bit bigger of a schema (still I think a lot smaller than real schemas our users will have) and the performance is definetly getting worse. I'll send you the schema so you can check it out.

I asked Claude to generate 12 questions that I could ask about the data and here is how it did:

Prompts it responded with something like 'no data available'

  1. Which emplpoyee has the highest commission percentage
  2. How many people are we paying more than 100k per year ?
  3. Compare the sales performance of our top 5 products across different regions for the past 12 months.
  4. Which marketing campaigns delivered the highest ROI last year, and what channels were most effective for each?

Prompts it asked a targetted question about:

  1. Which departments have exceeded their annual budget allocations, and by how much? => "Is your budget/spending data stored in the database? If so, should I query the table FINANCE_DB.PUBLIC.tbl_budget_2024? Or is this data in one of your files (e.g., Prospects_16042025_500linesdataset.xlsx/csv), and if so, which one?"
  2. How has our S&P 500 stock portfolio performed compared to the sector averages over the past 3 years? => "What constitutes "your portfolio"? (e.g., do you have a list of stock symbols and weights, or is it an equally weighted portfolio of all S&P 500 stocks?) Should sector averages be equally weighted or weighted by market cap or sector weights?"

Prompts it identified the correct tables for

  1. Find all of the emplpoyees who have worked at the company for more than 1 year
  2. Find the customer in each state who has made the single biggest purchase from our enterprise data
  3. What's the average monthly revenue per customer in our telecom business, broken down by service type?
  4. What's our customer churn rate for the telecom business, and how does it correlate with service plan types and monthly charges?
  5. Show me a breakdown of IoT device performance by location, highlighting any devices with abnormal temperature readings.
  6. Which employees have changed departments in the past year, and what was the impact on their salary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants