A database chatbot, utilizing:
- LLM (
Llama-3.1-70b) for converting Natural Language to SQL & SQL result to NL - SQLite3 db for storage & querying
- Vector Embeddings for tag generation based on similarity, via
Sentence Transformers - SpaCy for query preprocessing & parsing semantic structure
-
Python API (tested with Python3.11)
- (Optional) Create a virtual environment
- Install requirements by
pip install -r requirements.txt - Start the server by running
python python_apisetup.py - Please make sure that
GROQ_API_KEYenvironment variable is set. If not, please obtain a free API key from Groq.
-
Frontend:
- Go to frontend
cd frontend - Install node modules
npm i - Start the frontend:
npm start
- Go to frontend
You should be able to run it as below:
-
Introduced a similar terms column in events and Companies schema
- Challenge: Company industry (and similarly relevant industries for an event) are bad candidates for exact matching. e.g.
oil & gasindustry andpetroleumindustry are related - Solution: Created Exhaustive Vector Embeddings for each unique company industry derived from the
company_industriescolumn. For each event/company description, ran a similarity search using an embedding modelall-MiniLM-L6-v2, and created tags for each event/each company.
- Challenge: Company industry (and similarly relevant industries for an event) are bad candidates for exact matching. e.g.
-
Handling Employee Ranges:
- Challenge: Irregular Formatting for Employees column (e.g.
50-200) - Solution: Store upper & lower limits of employee count instead and deleted the previous column.
- Challenge: Irregular Formatting for Employees column (e.g.
-
Email Address Generation:
- Challenge: Irregular email formatting, e.g.
[first][last]and[firstinitial][last], etc. - Solution: Compute the email ID's precisely by writing & running a simple Python script
- Challenge: Irregular email formatting, e.g.
-
Standardized revenue to filter out irrelevant comparisons:
- Challenge: Revenue description in different denomenations (e.g.
billionsvsmillions) - Solution: Converted all revenue to
millions
- Challenge: Revenue description in different denomenations (e.g.
-
Natural Language Processing of query Using SpaCy:
-
Contextual Query Analysis: Uses
SpaCyto analyze and determine whether context is for companies or events. -
Helps in queries like
The list of sales events being attended by finance companies -
LLM has confusion dealing with such cases. SpaCy helps provide the relevant context:
- Is a particular adjective for Events or for Companies?, e.g.
Sales EventsandFinance Companies - When both events and companies are involved, whether to take a union or an intersection?
- This context is then used by the llm to make the right decision.
- Is a particular adjective for Events or for Companies?, e.g.
-
-
Search Similar Chunks: Uses tags column in databases to search for similar terms as well when the search involves a particular industry.
-
SQL Query Generation: Uses an LLM (
Llama-3.1-70bvia Groq; but plug'n'play with any OpenAI compatible API) to generateNL-2-SQLand then from SQL Result to Natural Language. -
Flask API Deployment:
- Includes endpoints for processing natural language queries (
/api/query) and retrieving results (/api/result), supporting JSON input and output.
- Includes endpoints for processing natural language queries (
-
Irrelevant Results Without Vector Embeddings:
- Challenge: When searching for a particular industry, the LLM generated queries for precise matching, leaving out industries with similar names.
- Solution: Implemented vector embeddings for relevant data and applied threshold-based filtering to capture nuanced relationships.
-
Enhancing Contextual Understanding Using SpaCy:
- Challenge: LLM had problems understanding the relationship between Adjectives (e.g.
finance related) and Nouns (eventsvscompanies) - Solution: Used SpaCy to analyze adjectives and nouns in queries, directing searches to the correct category and improving result relevance.
- Challenge: LLM had problems understanding the relationship between Adjectives (e.g.
-
Enhancing Synonym Handling with WordNet:
- Plan: Integrate
WordNetto identify synonyms for terms in the "Similar Terms" column, creating a comprehensive list of related terms for more accurate search results.
- Plan: Integrate
-
Query Matching (RAG) with Pre-Generated SQL Queries:
- Maintain a list of pre-generated SQL queries to refine query processing and improve result accuracy.
- What: Apply RAG for more accurate Query Context
- How: Queries will be matched against a pre-generated list of
Natural Language Query - SQL querytuples to find a close match. Providing further context (n-shot) can help with better generation
-
Query Submission: User enters query which are then taken to the backend by the POST operation.
-
Data Retrieval and Display:
- Fetching Results: Makes a POST request to
/api/queryand fetches results from/api/result. - Displaying Results: Shows user queries and backend responses in a chat-like interface.
- Fetching Results: Makes a POST request to
-
Loading Indicators:
- Shows loading state unless the answer is updated. For this I have a
useStatewhich stores the most recent answer. The functionality rechecks 5 times at constant intervals and gives the answer only once it is updated from the backend.
- Shows loading state unless the answer is updated. For this I have a
-
UI Styling:
- Theming: Applies a consistent theme with Material-UI’s ThemeProvider.
-
POST Request to Submit Query:
- Endpoint:
http://127.0.0.1:5000/api/query - Method: POST
- Body: JSON object with user query.
- Endpoint:
-
GET Request to Fetch Result:
- Endpoint:
http://127.0.0.1:5000/api/result - Method: GET
- Endpoint:
-
Data Retrieval Challenges:
- Outdated Data Issue: Implemented mechanisms to handle outdated data and show loading states. In the beginning since backend took time to refresh the data for the previous query was retrieved and displayed. Used useState for currentAns and multiple retries to avoid this and load the new Query Data only.
-
Design Challenges:
- User Experience: Making the ui user friendly posed some challenges.
- Enhanced Data Retrieval:
- Timestamp-Based Polling: Use exponential polling with timestamps to ensure accurate data retrieval. This will not update until the backend refreshes and will also take care of an error response being shown to the user in case it never updates because it not only takes into consideration the current state of the answer but also the time stamp.
- Events Table
- Contains all columns from
event_info.csv event_start_date(DATE): The start date of the event, stored in 'YYYY-MM-DD' formatevent_end_date(DATE): The end date of the event, stored in 'YYYY-MM-DD' formatsimilar_termscontaining all industry types pertaining to the event description
- Contains all columns from
- Companies Table:
- Columns from
company_info.csv similar_terms: containing all industry types pertaining to the company descriptionemployee_range_lower: lower limit of the Employee count rangeemployee_range_upper: upper limit of the employee count rangerevenue_millions: Revenue of the company in millions
- Columns from
- People Table:
- Columns from
people_info.csv
- Columns from
-
Handling Diverse Data Formats:
- Inconsistent Revenue Figures: Standardized revenue data (in millions) to avoid confusion.
- Employee Range Confusion: Split ranges into lower and upper bounds so as to avoid irrelevant results.
-
Generating and Managing Email Addresses:
- Programmatically created email addresses ensuring proper formatting.
-
Irrelevant Industry Comparisons:
- Removed the company_industry column to prevent misleading comparisons.
-
Identifying Key Query Components and Indexing:
- Optimize join operations and reduce query complexity by indexing relevant columns. For ex: company event_url column and event_url column in events are joined almost in every other query. Indexing them in the future would greatly enhance performance I believe.
-
Advanced Similarity Matching System:
- Implement a dynamic similarity matching algorithm to improve accuracy.
