|
| 1 | +# Document Question Answering System |
| 2 | + |
| 3 | +This system provides a document-based question answering capability for the Smalltalk application. It allows users to upload documents, which are processed, split into fragments, and converted into vector embeddings for semantic search. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The document QA system integrates with the existing Smalltalk application and enables: |
| 8 | + |
| 9 | +1. Document upload and processing |
| 10 | +2. Automatic splitting of documents into semantically meaningful fragments |
| 11 | +3. Generation of vector embeddings for each fragment |
| 12 | +4. Retrieval of relevant document fragments based on user queries |
| 13 | +5. Enhanced responses that incorporate information from uploaded documents |
| 14 | + |
| 15 | +## Key Components |
| 16 | + |
| 17 | +### 1. DocumentFragment |
| 18 | + |
| 19 | +This class represents a piece of text from a document, with metadata such as: |
| 20 | +- Document ID |
| 21 | +- Content |
| 22 | +- Fragment index |
| 23 | +- File path |
| 24 | +- MIME type |
| 25 | +- Creation timestamp |
| 26 | + |
| 27 | +### 2. DocumentEmbedding |
| 28 | + |
| 29 | +This class stores the vector representation of a document fragment: |
| 30 | +- Fragment ID (references DocumentFragment) |
| 31 | +- Embedding vector (stored as serialized byte array) |
| 32 | +- Embedding dimension |
| 33 | +- Creation timestamp |
| 34 | + |
| 35 | +### 3. DocumentProcessor |
| 36 | + |
| 37 | +Handles the processing of uploaded documents: |
| 38 | +- Extracts textual content using Apache Tika for rich document formats |
| 39 | +- Splits content into appropriately sized fragments |
| 40 | +- Creates DocumentFragment objects for each fragment |
| 41 | +- Supports PDF, Word, Excel, PowerPoint, and other document formats through Tika |
| 42 | + |
| 43 | +### 4. EmbeddingManager |
| 44 | + |
| 45 | +Manages the generation and retrieval of embeddings: |
| 46 | +- Generates embeddings using OpenAI's embedding API |
| 47 | +- Caches query embeddings to reduce API calls |
| 48 | +- Implements the cosine similarity function for semantic search |
| 49 | +- Provides methods to find similar documents based on query |
| 50 | + |
| 51 | +### 5. DocumentQA |
| 52 | + |
| 53 | +Implements the question answering functionality: |
| 54 | +- Finds relevant document fragments for a given query |
| 55 | +- Formats document context for inclusion in AI responses |
| 56 | +- Enhances queries with document context |
| 57 | + |
| 58 | +## Database Schema (SQLite) |
| 59 | + |
| 60 | +The system uses SQLite for storage with the following schema: |
| 61 | + |
| 62 | +```sql |
| 63 | +CREATE TABLE document_fragments ( |
| 64 | + id INTEGER PRIMARY KEY AUTOINCREMENT, |
| 65 | + document_id VARCHAR(36) NOT NULL, |
| 66 | + content TEXT NOT NULL, |
| 67 | + fragment_index INTEGER NOT NULL, |
| 68 | + file_path VARCHAR(255), |
| 69 | + mime_type VARCHAR(50), |
| 70 | + created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP |
| 71 | +); |
| 72 | + |
| 73 | +CREATE TABLE document_embeddings ( |
| 74 | + id INTEGER PRIMARY KEY AUTOINCREMENT, |
| 75 | + fragment_id VARCHAR(100) NOT NULL, |
| 76 | + embedding BLOB NOT NULL, |
| 77 | + embedding_dimension INTEGER NOT NULL, |
| 78 | + created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP |
| 79 | +); |
| 80 | + |
| 81 | +CREATE TABLE chat_history ( |
| 82 | + id INTEGER PRIMARY KEY AUTOINCREMENT, |
| 83 | + meeting_code VARCHAR(255) NOT NULL, |
| 84 | + user_name VARCHAR(255), |
| 85 | + message TEXT, |
| 86 | + session_id VARCHAR(255), |
| 87 | + message_type VARCHAR(50), |
| 88 | + image_url TEXT, |
| 89 | + created_at DATE DEFAULT CURRENT_DATE |
| 90 | +); |
| 91 | +``` |
1 | 92 |
|
2 | | -smalltalk |
3 | | -== |
4 | | -smalltalk is a tinystruct-based project that provides instant messaging functionality, It allows users to send text and share images, documents, and other content. |
5 | | -Also, It allows you to interact with ChatGPT which is a language model developed by OpenAI through a command-line interface (CLI) or a web interface. |
| 93 | +## Integration with Smalltalk |
6 | 94 |
|
7 | | -[](https://star-history.com/#tinystruct/smalltalk&Date) |
| 95 | +The document QA functionality is integrated in the Smalltalk application: |
| 96 | +- When a user uploads a document, it's automatically processed |
| 97 | +- When a user asks a question, the system searches for relevant document fragments |
| 98 | +- Retrieved fragments are included in the system prompt to provide context for the AI's response |
| 99 | +- The AI can reference the specific document when answering questions |
8 | 100 |
|
9 | | -Installation |
10 | | ---- |
11 | | -1. Download the project from GitHub by clicking the "Clone or download" button, then selecting "Download ZIP". |
12 | | -2. Extract the downloaded ZIP file to your local machine. |
13 | | -3. If you used to use git, then you should execute the following command to instead of above steps: |
14 | | -```bash |
15 | | -git clone https://github.com/tinystruct/smalltalk.git |
16 | | -``` |
17 | | -4. You will need to follow this [tutorial](https://openjdk.org/install/) to install the Java Development Kit (JDK 11+) on your computer first. If you choose to download and install it manually, please check it in this [OpenJDK Archive](https://jdk.java.net/archive/). And Java development environment such as Eclipse or IntelliJ IDEA is just better to have, not required. |
| 101 | +## Usage |
18 | 102 |
|
19 | | -If your current envirionment is using JDK 8, you can execute the below command to upgrade it quickly. |
20 | | -``` |
21 | | -bin/openjdk-upgrade |
22 | | -``` |
23 | | -5. Import the extracted / cloned project into your Java development environment. |
24 | | -6. Go to `src/main/resources/application.properties` file and update the `openai.api_key` with your own key or set the environment variable `OPENAI_API_KEY` with your own key. |
25 | | -7. Here is the last step for installation: |
26 | | -```tcsh |
27 | | -./mvnw compile |
28 | | -``` |
| 103 | +### Document Upload |
29 | 104 |
|
30 | | -Usage |
31 | | ---- |
32 | | -You can run smalltalk in different ways: |
| 105 | +Documents can be uploaded through the Smalltalk interface. Supported file types include: |
| 106 | +- Plain text files (text/plain) |
| 107 | +- Markdown files (text/markdown) |
| 108 | +- PDF documents (application/pdf) |
| 109 | +- Word documents (docx, doc) |
| 110 | +- Excel spreadsheets (xlsx, xls) |
| 111 | +- PowerPoint presentations (pptx, ppt) |
| 112 | +- Other text-based files (text/*) |
33 | 113 |
|
34 | | -CLI mode |
35 | | -1. Open a terminal and navigate to the project's root directory. |
36 | | -2. To execute it in CLI mode, run the following command: |
37 | | -```tcsh |
38 | | -bin/dispatcher --version |
39 | | -``` |
40 | | -To see the available commands, run the following command: |
41 | | -```tcsh |
42 | | -bin/dispatcher --help |
43 | | -``` |
44 | | -To interact with ChatGPT, use the chat command, for example: |
45 | | -```tcsh |
46 | | -bin/dispatcher chat |
47 | | -``` |
48 | | - |
| 114 | +### Asking Questions |
49 | 115 |
|
50 | | -Web mode |
| 116 | +Simply ask questions in the chat interface. The system will automatically: |
| 117 | +1. Convert your question to an embedding |
| 118 | +2. Find the most relevant document fragments |
| 119 | +3. Include those fragments in the context sent to the AI |
| 120 | +4. Return an answer that incorporates information from your documents |
51 | 121 |
|
52 | | -1. Run the project in a servlet container or in a HTTP server: |
53 | | -2. To run it in a servlet container, you need to compile the project first: |
| 122 | +## Testing |
54 | 123 |
|
55 | | -then you can run it on tomcat server by running the following command: |
56 | | - |
57 | | -```tcsh |
58 | | -sudo bin/dispatcher start --import org.tinystruct.system.TomcatServer --server-port 777 |
59 | | -``` |
60 | | -or run it on netty http server by running the following command: |
| 124 | +The system includes several test classes to verify functionality: |
| 125 | +- DocumentEmbedding.main() - Tests embedding storage and retrieval |
| 126 | +- DocumentQATest - Tests the end-to-end document QA functionality |
| 127 | +- DocumentProcessor.main() - Tests document processing and fragmentation |
61 | 128 |
|
62 | | -```tcsh |
63 | | -sudo bin/dispatcher start --import org.tinystruct.system.NettyHttpServer --server-port 777 |
| 129 | +Run the tests using the provided batch file: |
64 | 130 | ``` |
65 | | -3. To run it in a Docker container, you can use the command below: |
66 | | - |
67 | | -```tcsh |
68 | | -docker run -d -p 777:777 -e "OPENAI_API_KEY=[YOUR-OPENAI-API-KEY]" -e "STABILITY_API_KEY=[YOUR-STABILITY-API-KEY]" m0ver/smalltalk |
| 131 | +runtest.bat |
69 | 132 | ``` |
70 | | -4. Access the application by navigating to http://localhost:777/?q=talk in your web browser |
71 | | -5. If you want to talk with ChatGPT, please type @ChatGPT in your topic of the conversation when you set up the topic. |
72 | | - |
73 | | - |
74 | | - |
75 | | -<img src="https://github.com/user-attachments/assets/32721443-b680-481b-b5ed-ae3c7e4c6908" width=500 /> |
76 | | - |
77 | | -Demonstration |
78 | | ---- |
79 | | -A demonstration for the comet technology, without any websocket and support any web browser: |
80 | | - |
81 | | -https://tinystruct.herokuapp.com/?q=talk |
82 | | - |
83 | | -Troubleshooting |
84 | | ---- |
85 | | -* If you encounter any problems during the installation or usage of the project, please check the project's documentation or build files for information about how to set up and run the project. |
86 | | -* If you still have problems, please open an issue on GitHub or contact the project maintainers for help. |
87 | | - |
88 | | -Contribution |
89 | | ---- |
90 | | -We welcome contributions to the smalltalk project. If you are interested in contributing, please read the CONTRIBUTING.md file for more information about the project's development process and coding standards. |
91 | | - |
92 | | -Acknowledgements |
93 | | ---- |
94 | | -smalltalk uses the OpenAI API to interact with the ChatGPT language model. We would like to thank OpenAI for providing this powerful tool to the community. |
95 | | - |
96 | | -License |
97 | | ---- |
98 | 133 |
|
99 | | -Licensed under the Apache License, Version 2.0 (the "License"); |
100 | | -you may not use this file except in compliance with the License. |
101 | | -You may obtain a copy of the License at |
| 134 | +This script will: |
| 135 | +1. Initialize the SQLite database if needed |
| 136 | +2. Run all test classes |
| 137 | +3. Test document processing for various file formats |
102 | 138 |
|
103 | | - http://www.apache.org/licenses/LICENSE-2.0 |
| 139 | +## Dependencies |
104 | 140 |
|
105 | | -Unless required by applicable law or agreed to in writing, software |
106 | | -distributed under the License is distributed on an "AS IS" BASIS, |
107 | | -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
108 | | -See the License for the specific language governing permissions and |
109 | | -limitations under the License. |
| 141 | +The system relies on the following key libraries: |
| 142 | +- SQLite - For database storage |
| 143 | +- Apache Tika (3.1.0) - For extracting text from various document formats |
| 144 | +- OpenAI API - For generating embeddings |
| 145 | +- tinystruct - For application framework and database access |
110 | 146 |
|
0 commit comments