|
1 | | -# Document Question Answering System |
2 | | - |
3 | | -This system provides a document-based question answering capability for the Smalltalk application. It allows users to upload documents, which are processed, split into fragments, and converted into vector embeddings for semantic search. |
4 | | - |
5 | | -## Overview |
6 | | - |
7 | | -The document QA system integrates with the existing Smalltalk application and enables: |
8 | | - |
9 | | -1. Document upload and processing |
10 | | -2. Automatic splitting of documents into semantically meaningful fragments |
11 | | -3. Generation of vector embeddings for each fragment |
12 | | -4. Retrieval of relevant document fragments based on user queries |
13 | | -5. Enhanced responses that incorporate information from uploaded documents |
14 | | - |
15 | | -## Key Components |
16 | | - |
17 | | -### 1. DocumentFragment |
| 1 | +smalltalk |
| 2 | +== |
| 3 | +smalltalk is a tinystruct-based project that provides instant messaging functionality, It allows users to send text and share images, documents, and other content. |
| 4 | +Also, It allows you to interact with ChatGPT which is a language model developed by OpenAI through a command-line interface (CLI) or a web interface. |
| 5 | + |
| 6 | +[](https://star-history.com/#tinystruct/smalltalk&Date) |
| 7 | + |
| 8 | +Installation |
| 9 | +--- |
| 10 | +1. Download the project from GitHub by clicking the "Clone or download" button, then selecting "Download ZIP". |
| 11 | +2. Extract the downloaded ZIP file to your local machine. |
| 12 | +3. If you used to use git, then you should execute the following command to instead of above steps: |
| 13 | +```bash |
| 14 | +git clone https://github.com/tinystruct/smalltalk.git |
| 15 | +``` |
| 16 | +4. You will need to follow this [tutorial](https://openjdk.org/install/) to install the Java Development Kit (JDK 11+) on your computer first. If you choose to download and install it manually, please check it in this [OpenJDK Archive](https://jdk.java.net/archive/). And Java development environment such as Eclipse or IntelliJ IDEA is just better to have, not required. |
| 17 | + |
| 18 | +If your current envirionment is using JDK 8, you can execute the below command to upgrade it quickly. |
| 19 | +``` |
| 20 | +bin/openjdk-upgrade |
| 21 | +``` |
| 22 | +5. Import the extracted / cloned project into your Java development environment. |
| 23 | +6. Go to `src/main/resources/application.properties` file and update the `openai.api_key` with your own key or set the environment variable `OPENAI_API_KEY` with your own key. |
| 24 | +7. Here is the last step for installation: |
| 25 | +```tcsh |
| 26 | +./mvnw compile |
| 27 | +``` |
18 | 28 |
|
19 | | -This class represents a piece of text from a document, with metadata such as: |
20 | | -- Document ID |
21 | | -- Content |
22 | | -- Fragment index |
23 | | -- File path |
24 | | -- MIME type |
25 | | -- Creation timestamp |
| 29 | +Usage |
| 30 | +--- |
| 31 | +You can run smalltalk in different ways: |
26 | 32 |
|
27 | | -### 2. DocumentEmbedding |
| 33 | +CLI mode |
| 34 | +1. Open a terminal and navigate to the project's root directory. |
| 35 | +2. To execute it in CLI mode, run the following command: |
| 36 | +```tcsh |
| 37 | +bin/dispatcher --version |
| 38 | +``` |
| 39 | +To see the available commands, run the following command: |
| 40 | +```tcsh |
| 41 | +bin/dispatcher --help |
| 42 | +``` |
| 43 | +To interact with ChatGPT, use the chat command, for example: |
| 44 | +```tcsh |
| 45 | +bin/dispatcher chat |
| 46 | +``` |
| 47 | + |
28 | 48 |
|
29 | | -This class stores the vector representation of a document fragment: |
30 | | -- Fragment ID (references DocumentFragment) |
31 | | -- Embedding vector (stored as serialized byte array) |
32 | | -- Embedding dimension |
33 | | -- Creation timestamp |
| 49 | +Web mode |
34 | 50 |
|
35 | | -### 3. DocumentProcessor |
| 51 | +1. Run the project in a servlet container or in a HTTP server: |
| 52 | +2. To run it in a servlet container, you need to compile the project first: |
36 | 53 |
|
37 | | -Handles the processing of uploaded documents: |
38 | | -- Extracts textual content using Apache Tika for rich document formats |
39 | | -- Splits content into appropriately sized fragments |
40 | | -- Creates DocumentFragment objects for each fragment |
41 | | -- Supports PDF, Word, Excel, PowerPoint, and other document formats through Tika |
| 54 | +then you can run it on tomcat server by running the following command: |
42 | 55 |
|
43 | | -### 4. EmbeddingManager |
| 56 | +```tcsh |
| 57 | +sudo bin/dispatcher start --import org.tinystruct.system.TomcatServer --server-port 777 |
| 58 | +``` |
| 59 | +or run it on netty http server by running the following command: |
44 | 60 |
|
45 | | -Manages the generation and retrieval of embeddings: |
46 | | -- Generates embeddings using OpenAI's embedding API |
47 | | -- Caches query embeddings to reduce API calls |
48 | | -- Implements the cosine similarity function for semantic search |
49 | | -- Provides methods to find similar documents based on query |
| 61 | +```tcsh |
| 62 | +sudo bin/dispatcher start --import org.tinystruct.system.NettyHttpServer --server-port 777 |
| 63 | +``` |
| 64 | +3. To run it in a Docker container, you can use the command below: |
50 | 65 |
|
51 | | -### 5. DocumentQA |
| 66 | +```tcsh |
| 67 | +docker run -d -p 777:777 -e "OPENAI_API_KEY=[YOUR-OPENAI-API-KEY]" -e "STABILITY_API_KEY=[YOUR-STABILITY-API-KEY]" m0ver/smalltalk |
| 68 | +``` |
| 69 | +4. Access the application by navigating to http://localhost:777/?q=talk in your web browser |
| 70 | +5. If you want to talk with ChatGPT, please type @ChatGPT in your topic of the conversation when you set up the topic. |
52 | 71 |
|
53 | | -Implements the question answering functionality: |
54 | | -- Finds relevant document fragments for a given query |
55 | | -- Formats document context for inclusion in AI responses |
56 | | -- Enhances queries with document context |
| 72 | + |
57 | 73 |
|
58 | 74 | ## Database Schema (SQLite) |
59 | 75 |
|
@@ -144,3 +160,36 @@ The system relies on the following key libraries: |
144 | 160 | - OpenAI API - For generating embeddings |
145 | 161 | - tinystruct - For application framework and database access |
146 | 162 |
|
| 163 | +Demonstration |
| 164 | +--- |
| 165 | +A demonstration for the comet technology, without any websocket and support any web browser: |
| 166 | + |
| 167 | +https://tinystruct.herokuapp.com/?q=talk |
| 168 | + |
| 169 | +Troubleshooting |
| 170 | +--- |
| 171 | +* If you encounter any problems during the installation or usage of the project, please check the project's documentation or build files for information about how to set up and run the project. |
| 172 | +* If you still have problems, please open an issue on GitHub or contact the project maintainers for help. |
| 173 | + |
| 174 | +Contribution |
| 175 | +--- |
| 176 | +We welcome contributions to the smalltalk project. If you are interested in contributing, please read the CONTRIBUTING.md file for more information about the project's development process and coding standards. |
| 177 | + |
| 178 | +Acknowledgements |
| 179 | +--- |
| 180 | +smalltalk uses the OpenAI API to interact with the ChatGPT language model. We would like to thank OpenAI for providing this powerful tool to the community. |
| 181 | + |
| 182 | +License |
| 183 | +--- |
| 184 | + |
| 185 | +Licensed under the Apache License, Version 2.0 (the "License"); |
| 186 | +you may not use this file except in compliance with the License. |
| 187 | +You may obtain a copy of the License at |
| 188 | + |
| 189 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 190 | + |
| 191 | +Unless required by applicable law or agreed to in writing, software |
| 192 | +distributed under the License is distributed on an "AS IS" BASIS, |
| 193 | +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 194 | +See the License for the specific language governing permissions and |
| 195 | +limitations under the License. |
0 commit comments