Skip to content

Database Awareness - Chat #1679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
May 13, 2025
Merged

Database Awareness - Chat #1679

merged 16 commits into from
May 13, 2025

Conversation

ngafar
Copy link
Collaborator

@ngafar ngafar commented Apr 25, 2025

Description

The Mito AI is now aware of database connections, and can write SQL queries.

For now this is limited to chat, and the connections must be hardcoded.

Testing

  1. In the .mito/db dir add the new connections.json and schemas.json files.
  2. Then start a Jupyter sever, and ask a question that required a db connection.

Be sure to ask about data that is relevant to a table, but does not exist.

Documentation

N/A - We should add documentation, but after we add the taskpane to add new db connections.

Copy link

vercel bot commented Apr 25, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
monorepo ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 13, 2025 5:04pm

@ngafar ngafar requested a review from aarondr77 April 25, 2025 15:40
@ngafar ngafar changed the title [WIP] Database Awareness - Chat Database Awareness - Chat Apr 25, 2025
@aarondr77
Copy link
Member

The schema has column headers in uppercase, so if I sk it to write code like this: How does the location of a customer effect their likelihood to churn?

It ends up referencing the columns in pandas with all uppercase, even though the resulting dataframe has lowercase. If snowflake sql-alchemy always ends up with lowercase dataframe headers, (idk if this is the case), then we should update the schema probably.

import pandas as pd
from sqlalchemy import create_engine

# Database connection setup
user = "XXX"
password = "XXX"
account = "XXX"
warehouse = "COMPUTE_WH"
database = "TELCO_CHRUN"
schema = "PUBLIC"

conn_str = (
    f"snowflake://{user}:{password}@{account}/"
    f"{database}/{schema}?warehouse={warehouse}"
)
engine = create_engine(conn_str)

# Query for churn status and location data
query = """
SELECT 
    l.COUNTRY,
    l.STATE,
    l.CITY,
    s.CHURN_LABEL
FROM LOCATION_DATA l
JOIN STATUS_ANALYSIS s
ON l.CUSTOMER_ID = s.CUSTOMER_ID
"""

df = pd.read_sql(query, engine)

# Analyze churn rate by state (you can modify to COUNTRY or CITY)
churn_by_state = (
    df.groupby('STATE')['CHURN_LABEL']
    .mean()
    .sort_values(ascending=False)
    .reset_index()
    .rename(columns={'CHURN_LABEL': 'churn_rate'})
)
churn_by_state

@aarondr77
Copy link
Member

I made a bit bigger of a schema (still I think a lot smaller than real schemas our users will have) and the performance is definetly getting worse. I'll send you the schema so you can check it out.

I asked Claude to generate 12 questions that I could ask about the data and here is how it did:

Prompts it responded with something like 'no data available'

  1. Which emplpoyee has the highest commission percentage
  2. How many people are we paying more than 100k per year ?
  3. Compare the sales performance of our top 5 products across different regions for the past 12 months.
  4. Which marketing campaigns delivered the highest ROI last year, and what channels were most effective for each?

Prompts it asked a targetted question about:

  1. Which departments have exceeded their annual budget allocations, and by how much? => "Is your budget/spending data stored in the database? If so, should I query the table FINANCE_DB.PUBLIC.tbl_budget_2024? Or is this data in one of your files (e.g., Prospects_16042025_500linesdataset.xlsx/csv), and if so, which one?"
  2. How has our S&P 500 stock portfolio performed compared to the sector averages over the past 3 years? => "What constitutes "your portfolio"? (e.g., do you have a list of stock symbols and weights, or is it an equally weighted portfolio of all S&P 500 stocks?) Should sector averages be equally weighted or weighted by market cap or sector weights?"

Prompts it identified the correct tables for

  1. Find all of the emplpoyees who have worked at the company for more than 1 year
  2. Find the customer in each state who has made the single biggest purchase from our enterprise data
  3. What's the average monthly revenue per customer in our telecom business, broken down by service type?
  4. What's our customer churn rate for the telecom business, and how does it correlate with service plan types and monthly charges?
  5. Show me a breakdown of IoT device performance by location, highlighting any devices with abnormal temperature readings.
  6. Which employees have changed departments in the past year, and what was the impact on their salary?

Copy link
Member

@aarondr77 aarondr77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm noticing that the tool is struggling to know when to query the database more often then I anticipated based on the evals. For example:

Unless I am very specific, it often tries to import data from a file (even if that file doesn't exist). For example: Whenever I say "Import the SP500 dataset", it tries to "sp500_df = pd.read_csv('SP500.csv')". But if I say "Import the SP500 stock dataset", then it will choose to import.

Can we turn this into an eval so we can iterate how we handle this? Maybe we should give it more clear instructions about how to check if it should query the database or not.

@ngafar ngafar merged commit ab2e1a9 into dev May 13, 2025
6 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants