Natural Language Querying (NLQ) using genAI

# Problem statement
Data.all currently requires users to have technical knowledge of data.all datasets, glue tables, schemas, S3 buckets, folders and SQL querying in order to access and derive insights from the diverse structured and unstructured data sets available across the organization. This creates a significant barrier for non-technical business users who need to quickly and easily query data to make informed decisions. The problem is that there is a lack of intuitive, natural language-based interfaces that allow these users to ask questions in plain English and receive relevant, contextual data responses without requiring SQL expertise or specialized data extraction abilities. 

Generative AI models offer a promising solution to bridge this gap by enabling natural language querying capabilities that understand user intent, extract data from both structured and unstructured sources, and generate dynamic responses tailored to the user's needs. This feature aims to empower non-technical users such as business analysts and executive decision makers to query and analyze structured and unstructured data using natural language querying by leveraging Generative AI (GenAI) capabilities to improve data accessibility and data-driven decision-making within data.all. 


# User Stories
**Describe the solution you'd like**

### US1. 
US1. As a Data Consumer, whether a non-technical business user, business analyst, or executive decision maker, I want to be able to query structured data in data.all using natural language, so that I can quickly find and retrieve the insights I need for my applications and decision-making processes.

#### Acceptance Criteria:

* Data consumers can formulate natural language queries to retrieve information from structured data sources with a correction feedback loop to ensure valid SQL generation and execution
* Data consumers are presented with the query results in a clear and easy-to-understand format.


### US2. 
US2. As a Data Consumer, whether a non-technical business user, business analyst, or executive decision maker, I want to be able to query unstructured data sources in data.all using natural language, so that I can quickly find and retrieve the insights I need for my applications and decision-making processes.

#### Acceptance Criteria:

* Data consumers can use natural language to query unstructured data sources containing text documents (e.g., PDF documents, text files), and the system will retrieve and summarize the relevant information.
* Data consumers are presented with the natural language query results in a clear and easy-to-understand format with references where appropriate

### US3. 
As a data.all developer and maintainer, I want the natural language query feature to be secure and respect data governance access permissions.

#### Acceptance Criteria:

* The natural language query feature employs a least privilege model to limit permissions and complies with data.all's security posture.
* The natural language query feature is available to only authenticated data.all users and has the same data access permission as the user

### US4. 
As a data.all developer and maintainer, I want the natural language query feature to be configurable, scalable, reliable, and seamlessly integrated into the data.all platform, so that I can ensure a smooth and efficient user experience for all data.all users.

#### Acceptance Criteria:

* The natural language query feature is modularized and can be turned on and off.
* The natural language query functionality is seamlessly integrated into the data.all user interface either in the worksheets module or a dedicated data.all module without significant changes in user experience.



### US5. 
As a data.all developer and maintainer, I want to be able to configure rate limits for the natural language query feature so that I can prevent overuse and ensure responsible access to the feature.

#### Acceptance Criteria:

* Maintainers can set thresholds for daily use metrics like dollar amount spent or number of queries executed per user.
* Once a user hits the configured threshold, the natural language query feature will provide notifications to the user when they reach the usage limits and will be restricted for that user until the next day 



###US6. 
As a data.all developer and maintainer, I want the natural language query feature to clearly display a disclaimer about the limitations and confidentiality of the responses, so that I understand the context and boundaries of the AI-generated information.

#### Acceptance Criteria:

* The natural language query feature UI always presents a disclaimer that cannot be easily missed by the user and states


### US7. (Future Scope) 
As a data.all developer and maintainer, I want the natural language query feature to provide feedback functionality so that users can easily indicate if the response was helpful or not, which can then be used to improve the quality of future responses.

#### Acceptance Criteria:

* The natural language query feature includes a thumbs up/down widget that users can click to provide feedback on the response which is captured and used to refine and improve the natural language query responses over time.
* Users receive a confirmation message after providing feedback, assuring them that their input will be used to enhance the feature.

# Scope


## Structured Data Query (SQL Generation & Execution)
* Data Selection: Select the table that you are interested in executing the natural language query.
* Natural Language to SQL Translation: Convert the natural language query into an SQL query.
* Query Execution: Run the SQL query on Amazon Athena to retrieve results.
* Result Presentation: Display the results to the user in an easily understandable format along with the SQL query generated.
* Example User Query: "Show me the total sales for the last quarter."
* Generated SQL executed:
```
SELECT SUM(sales) AS total_sales
 FROM sales_data
 WHERE quarter = 'last_quarter';
```
## Unstructured Data Query 
* Data Selection: Select contents from a specified bucket (e.g., S3 bucket with PDF documents). 
* Contextual Retrieval: Fetch relevant data chunks based on the user's query.
* Answer Generation: Generate a coherent response from the retrieved data chunks using an LLM.
* Example User Query: "What are the key points in the latest project report relevant to key action for the dev team?"
* Example Generated Response: "Summarized key points extracted from the document stored in S3."
```
The latest project report highlighted several critical issues that require urgent 
attention from the dev team. First, the report identified a major performance 
bottleneck in the backend systems that is causing frequent crashes and slowdowns. 
Addressing this performance issue should be the top priority for the dev team. 
Secondly, the report noted several user-reported bugs in the mobile app that need 
to be fixed in the next release. Finally, the report recommended exploring new 
technologies and frameworks that could improve the overall system architecture and 
developer productivity.
```

## Out Of Scope
* Bring Your Own Model: The natural language querying feature will not support the ability for users to bring their own language models.
* Fine Tuning: This feature doesn’t include fine tuning of LLM to get a customized model. This has been kept this way as data.all is deployed in a customer environment and due to lack of data on user executed requests and fine-tuning requires a significant size of data to align the model to a particular domain or task.
* Chatbot style conversation: This feature currently is single question answering style where user gives context and get answers. Each query execution is independent of previous query requests and depends only on the current context. 
* Query Knowledge Management: Context given by user for one execution is not explicitly passed for other query executions. In case user needs to use same context, it is the user responsibility to provide these context in different query execution requests.
* Role Management: The natural language query feature will assume the same role of a generic analyst persona and will not customized for different user personas.
* Query response cache: The natural language query feature will prioritize feature velocity over fine tuning and ensuring cache quality, with the trade off of extra cost for repeated queries

## Guardrails
* Transparency and Disclosure: Positioning the feature as "experimental" and setting clear expectations with users about the limitations and potential uncertainties like hallucinations in the responses. This helps manage user expectations and encourages a "trust but verify" mindset.
* Truthfulness and Integrity: Aligning the GenAI models to "never lie" and avoid generating intentionally false or fabricated information. This is a crucial principle to establish credibility and build user trust.
* Continuous Monitoring and Refinement: Closely monitoring the performance and quality of the natural language querying feature, and using user feedback to iteratively improve the underlying GenAI models and response generation processes.
* Dynamic Threshold-based Safeguards: Dynamic thresholds and robust controls will be implemented to detect and flag potentially inappropriate, harmful, or biased responses from the GenAI model especially for unstructured data querying.
* Clear and Informative Error Messages: The feature will provide clear and contextual guidance to users on how to reformulate their queries when issues are encountered, empowering them to refine their requests for better results.
* Cost: Usage will be restricted to a specific metric per day per user to promote responsible use. The choice of model will be determined through a frugal evaluation of functionality and cost. Estimated usage costs will be published to allow customers to make informed decisions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Natural Language Querying (NLQ) using genAI #1659

Problem statement

User Stories

US1.

Acceptance Criteria:

US2.

Acceptance Criteria:

US3.

Acceptance Criteria:

US4.

Acceptance Criteria:

US5.

Acceptance Criteria:

Acceptance Criteria:

US7. (Future Scope)

Acceptance Criteria:

Scope

Structured Data Query (SQL Generation & Execution)

Unstructured Data Query

Out Of Scope

Guardrails

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Natural Language Querying (NLQ) using genAI #1659

Description

Problem statement

User Stories

US1.

Acceptance Criteria:

US2.

Acceptance Criteria:

US3.

Acceptance Criteria:

US4.

Acceptance Criteria:

US5.

Acceptance Criteria:

Acceptance Criteria:

US7. (Future Scope)

Acceptance Criteria:

Scope

Structured Data Query (SQL Generation & Execution)

Unstructured Data Query

Out Of Scope

Guardrails

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions