Skip to content

Commit 22bfb76

Browse files
committed
Finalize Snowflake branch with the updated code
1 parent d28d42e commit 22bfb76

File tree

102 files changed

+2103
-7066
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

102 files changed

+2103
-7066
lines changed

README.md

Lines changed: 96 additions & 118 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
# Snowflake Branch: Hyperledger Labs AIFAQ prototype
21

32
![Hyperledger Labs](https://img.shields.io/badge/Hyperledger-Labs-blue?logo=hyperledger)
43
![Apache License 2.0](https://img.shields.io/badge/license-Apache%202.0-green.svg)
@@ -7,140 +6,118 @@
76

87
[![GitHub Stars](https://img.shields.io/github/stars/hyperledger-labs/aifaq?style=social)](https://github.com/hyperledger-labs/aifaq/stargazers)
98
[![GitHub Forks](https://img.shields.io/github/forks/hyperledger-labs/aifaq?style=social)](https://github.com/hyperledger-labs/aifaq/network/members)
10-
[![Language Stats](https://img.shields.io/github/languages/top/hyperledger-labs/aifaq)](https://github.com/hyperledger-labs/aifaq)
119
[![Issues](https://img.shields.io/github/issues/hyperledger-labs/aifaq)](https://github.com/hyperledger-labs/aifaq/issues)
1210
[![Pull Requests](https://img.shields.io/github/issues-pr/hyperledger-labs/aifaq)](https://github.com/hyperledger-labs/aifaq/pulls)
1311

14-
![Language Stats](https://img.shields.io/github/languages/count/hyperledger-labs/aifaq)
15-
![Python](https://img.shields.io/badge/Python-85%25-blue?logo=python)
16-
![HTML](https://img.shields.io/badge/HTML-10%25-orange?logo=html5)
17-
![Other](https://img.shields.io/badge/Others-5%25-lightgrey?logo=github)
1812

19-
---
20-
## 🚀 Overview
13+
# Hyperledger Labs AIFAQ Prototype in Snowflake
14+
An Open-Source Conversational AI - Intelligence App built on Snowflake Cloud Environment
15+
16+
## Overview
17+
18+
The **Hyperledger Labs AIFAQ Prototype** is an open-source conversational intelligence system designed to deliver accurate, context-aware answers from enterprise documentation, technical references, and organizational knowledge bases. It integrates the governance strengths of Hyperledger with the scalability of **Snowflake** and the flexibility of **open-source LLMs** to create a secure, multi-tenant production grade enterprise knowledge assistant.
19+
20+
The prototype demonstrates a complete pipeline for ingesting, embedding, storing, and querying documents using Snowflake’s native capabilities and external AI inference. It supports open models such as **Llama**, **Mistral**, and **Snowflake Arctic** etc, offering a modular architecture suitable for production-grade deployments.
21+
22+
## Features
23+
24+
- **Multi-User Authentication**
25+
Secure login and strict data isolation across document sets and chat histories.
26+
27+
- **Hybrid LLM Support**
28+
Route queries to Snowflake Cortex or external OpenSource LLM models through secure external functions.
29+
30+
- **Multi-Document Knowledge Retrieval**
31+
Supports structred and unstructred data.
32+
33+
- **Persistent Chat Sessions**
34+
Full session history stored in Snowflake with easy retrieval.
35+
36+
- **Streamlit Frontend**
37+
Intuitive UI for uploading documents, interacting with the assistant, and browsing past conversations.
38+
39+
- **Snowflake Vector Search**
40+
High-performance similarity search using Cortex Vector Search and SQL APIs inside the snowflake cloud environment.
41+
42+
- **Automated Pipelines**
43+
Re-embedding and re-indexing triggered by Snowflake Streams and Tasks when documents update.
2144

22-
The **Hyperledger Labs AIFAQ Prototype** is an open-source conversational AI tool designed to answer questions from technical documentation, FAQs, and internal knowledge bases with high accuracy and context awareness. This implementation of AIFAQ integrates deeply with **Snowflake**, providing secure multi-user support, persistent chat history, and access to powerful LLMs like OpenAI, Anthropic, and Snowflake Cortex.
45+
- **Enterprise Governance**
46+
RBAC, row-level security, and masking policies ensure protected data access.
2347

24-
👉 Official Wiki Pages:
2548

26-
- [Hyperledger Labs Wiki](https://lf-hyperledger.atlassian.net/wiki/spaces/labs/pages/20290949/AI+FAQ+2025)
49+
## Architecture
2750

28-
👉 Weekly Community Calls:
51+
### 1. Ingestion Layer
52+
- Accepts structured and unstructured formats including PDFs, HTML, plain text, and transcripts.
53+
- Uses Snowflake external tables, stages, Snowpipe, or cloud functions to store and extract metadata.
54+
- All raw inputs move through well-defined staging schemas.
2955

30-
- Every Monday (public) — join via [Hyperledger Labs Calendar](https://wiki.hyperledger.org/display/HYP/Calendar+of+Public+Meetings).
56+
### 2. Preprocessing & Embedding
57+
- Snowpark UDFs handle chunking, cleaning, and tokenization.
58+
- Embeddings generated using Cortex or external open-source models.
59+
- Metadata and embedding vectors stored inside Snowflake as the unified knowledge base.
60+
61+
### 3. Access Control & Security
62+
- Document and chat isolation enforced via Snowflake roles.
63+
- Row-level security restricts user visibility to their own data.
64+
- Sensitive fields are masked using policy-based governance.
65+
66+
### 4. Retrieval-Augmented Generation (RAG)
67+
- User query → vector search → relevant context retrieval → model response.
68+
- Hybrid routing selects the best LLM based on context given by the user preference.
69+
- Ensures responses are grounded in user-provided documentation.
70+
71+
### 5. Automation & Observability
72+
- Snowflake Streams detect document changes.
73+
- Tasks automate reprocessing and embedding updates.
74+
- Monitoring through Snoopy and event notifications for operational visibility.
3175

3276
---
33-
## Features
3477

35-
- User Authentication: Secure, multi-user access with isolated document and chat histories
36-
- LLM Integration: Seamless access to Cortex, OpenAI, and Anthropic models via Snowflake external functions
37-
- Multi-Document Support: Upload and query multiple documents per session
38-
- Persistent Chat History: Retrieve and continue conversations across sessions
39-
- Streamlit UI: Intuitive document upload and chat interface
78+
## Getting Started
79+
80+
1. Choose the appropriate implementation folder:
81+
- **version_1** for stable production deployment
82+
- **version_2** for advanced workflows with Multi cloud data ingestion
83+
84+
2. Follow the README inside the selected folder to set up:
85+
- Warehouses
86+
- Stages and schemas
87+
- Pipelines
88+
- External LLM functions
89+
- Streamlit deployment
90+
---
91+
92+
## Folder Descriptions
93+
94+
### `version_1/`
95+
A simplified demonstration build intended for quick snowflake evaluation adn to get hands on for beginners.
96+
97+
Includes:
98+
- Lightweight ingestion + embedding flow
99+
- Basic Streamlit UI
100+
- Environment dependencies
101+
- Minimal Snowflake setup scripts
102+
- Ingestion and RAG flows
40103

41104
---
42-
## 🛠️ Architecture
43105

44-
![Snowflake integration architecture](./images/snowflake-architecture.png)
106+
### `version_2/`
107+
A more advanced, optimized version improving modularity and performance.
45108

46-
1. Flexible Document Ingestion: AIFAQ supports various source formats (PDFs, HTML, YouTube transcripts, etc.) ingested into Snowflake via external tables, raw storage, and pipelines using tools like Snowpipe and Lambda-based metadata extractors.
47-
2. Preprocessing & Embedding: Documents are chunked using Snowpark UDFs and embedded using LLM-based models. Embedding vectors are stored in Snowflake, forming the searchable knowledge base alongside metadata.
48-
3. Access Control & Governance: Fine-grained access is enforced through Snowflake's role-based permissions, row-level security, and data masking policies to protect sensitive content.
49-
4. LLM Query Augmentation & Retrieval: User queries are augmented with context by retrieving relevant chunks from the vector database (via Cortex Vector Search or SQL API), then sent to external LLMs (OpenAI, Anthropic) for response generation.
50-
5. Automation & Monitoring: Updates to documents automatically re-trigger embedding pipelines using Snowflake Streams and Tasks, while monitoring tools like Snoopy and event notifications ensure system observability and orchestration.
109+
Includes:
110+
- Refined RAG pipeline ( Improved data ingestion pipeline)
111+
- Cortex Vector Search & utilities
112+
- Advanced version of Role Base Access Control (RBAC)
113+
- Enhanced logging/observability
114+
- Stronger multi-tenant isolation
115+
- Updated Streamlit interface
116+
- Multi Cloud ingestion
51117

52118
---
53-
## 📝 Setup Instructions (Snowflake Branch)
54-
Follow these steps to configure your Snowflake environment using the provided `setup.sql` script.
55-
56-
1. Set up a role for the chatbot and grant access to required resources:
57-
58-
```
59-
CREATE OR REPLACE ROLE chatbot_user;
60-
61-
GRANT USAGE ON WAREHOUSE compute_wh TO ROLE chatbot_user;
62-
GRANT USAGE ON DATABASE llm_chatbot TO ROLE chatbot_user;
63-
64-
```
65-
2. Initialize the database and schema for storing documents and chat data:
66-
67-
```
68-
CREATE OR REPLACE DATABASE llm_chatbot;
69-
CREATE OR REPLACE SCHEMA chatbot;
70-
USE SCHEMA llm_chatbot.chatbot;
71-
72-
```
73-
3. Create two core tables, one for document chunks and another for chat history:
74-
75-
```
76-
CREATE OR REPLACE TABLE documents (
77-
user_id STRING,
78-
doc_id STRING,
79-
doc_name STRING,
80-
chunk_id STRING,
81-
chunk_text STRING,
82-
embedding VECTOR(FLOAT, 1536)
83-
);
84-
85-
CREATE OR REPLACE TABLE chat_history (
86-
user_id STRING,
87-
session_id STRING,
88-
doc_id STRING,
89-
turn INT,
90-
user_input STRING,
91-
bot_response STRING,
92-
timestamp TIMESTAMP
93-
);
94-
```
95-
4. External Function – OpenAI: Create an external function to call OpenAI's API:
96-
97-
```
98-
CREATE OR REPLACE EXTERNAL FUNCTION openai_complete(prompt STRING)
99-
RETURNS STRING
100-
API_INTEGRATION = my_api_integration
101-
HEADERS = (
102-
"Authorization" = 'Bearer <OPENAI_API_KEY>',
103-
"Content-Type" = 'application/json'
104-
)
105-
URL = 'https://api.openai.com/v1/completions'
106-
POST_BODY = '{
107-
"model": "gpt-3.5-turbo-instruct",
108-
"prompt": "' || prompt || '",
109-
"max_tokens": 200
110-
}';
111-
112-
```
113-
> Replace <OPENAI_API_KEY> with your actual OpenAI API key.
114-
115-
5. External Function – Anthropic: Similarly, set up a function to call Anthropic's Claude model:
116-
117-
```
118-
CREATE OR REPLACE EXTERNAL FUNCTION anthropic_complete(prompt STRING)
119-
RETURNS STRING
120-
API_INTEGRATION = my_api_integration
121-
HEADERS = (
122-
"x-api-key" = '<ANTHROPIC_API_KEY>',
123-
"Content-Type" = 'application/json'
124-
)
125-
URL = 'https://api.anthropic.com/v1/complete'
126-
POST_BODY = '{
127-
"model": "claude-3-opus-20240229",
128-
"prompt": "Human: ' || prompt || '\nAssistant:",
129-
"max_tokens": 200
130-
}';
131-
132-
```
133-
> Replace <ANTHROPIC_API_KEY> with your actual key.
134-
135-
6. Deploy the chatbot interface using the Streamlit app stored in your project:
136-
137-
```
138-
CREATE OR REPLACE STREAMLIT chatbot_ui
139-
FROM '/chatbot_app'
140-
MAIN_FILE = '/app.py';
141-
```
142-
143-
---
119+
120+
144121

145122
## 🌐 Open Source License
146123

@@ -159,8 +136,9 @@ We welcome contributions! Please check our [CONTRIBUTING](./docs/CONTRIBUTING.md
159136
Join our weekly public calls every Monday! See the [Hyperledger Labs Calendar](https://wiki.hyperledger.org/display/HYP/Calendar+of+Public+Meetings) for details.
160137

161138

162-
## 📢 Stay Connected
139+
## Stay Connected
163140

164141
- [Slack Discussions](https://join.slack.com/t/aifaqworkspace/shared_invite/zt-337k74jsl-tvH_4ct3zLj99dvZaf9nZw)
165142
- [Hyperledger Labs Community](https://lf-hyperledger.atlassian.net/wiki/spaces/labs/pages/20290949/AI+FAQ+2025)
166143
- Official Website: [aifaq.pro](https://aifaq.pro)
144+
- Official Wiki Pages: [Hyperledger Labs Wiki](https://lf-hyperledger.atlassian.net/wiki/spaces/labs/pages/20290949/AI+FAQ+2025)

images/activate_gpu.png

-21.9 KB
Binary file not shown.

images/compress_files.png

-38.5 KB
Binary file not shown.

images/copy_paste_code.png

-9.22 KB
Binary file not shown.

images/curl_results.png

-157 KB
Binary file not shown.

images/move_command.png

-7.05 KB
Binary file not shown.

images/new_studio.png

-5.02 KB
Binary file not shown.

images/open_terminal.png

-3.72 KB
Binary file not shown.
-546 KB
Binary file not shown.

images/remove_command.png

-1.93 KB
Binary file not shown.

0 commit comments

Comments
 (0)