Skip to content

Commit ee497f4

Browse files
committed
Resolved code conflicts.
2 parents 3a381d9 + 771b9e5 commit ee497f4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+12939
-1505
lines changed

.idea/libraries/ja_netfilter.xml

Lines changed: 0 additions & 9 deletions
This file was deleted.

.vscode/launch.json

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
{
2+
// Use IntelliSense to learn about possible attributes.
3+
// Hover to view descriptions of existing attributes.
4+
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
5+
"version": "0.2.0",
6+
"configurations": [
7+
{
8+
"type": "java",
9+
"name": "Dispatcher",
10+
"request": "launch",
11+
"mainClass": "org.tinystruct.system.Dispatcher",
12+
"projectName": "smalltalk",
13+
"args": ["start", "--import", "org.tinystruct.system.NettyHttpServer","--server-port","777"]
14+
}
15+
]
16+
}

.vscode/settings.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"java.configuration.updateBuildConfiguration": "interactive",
3+
"java.compile.nullAnalysis.mode": "disabled",
4+
"java.debug.settings.onBuildFailureProceed": true
5+
}

README.md

Lines changed: 130 additions & 94 deletions
Original file line numberDiff line numberDiff line change
@@ -1,110 +1,146 @@
1+
# Document Question Answering System
2+
3+
This system provides a document-based question answering capability for the Smalltalk application. It allows users to upload documents, which are processed, split into fragments, and converted into vector embeddings for semantic search.
4+
5+
## Overview
6+
7+
The document QA system integrates with the existing Smalltalk application and enables:
8+
9+
1. Document upload and processing
10+
2. Automatic splitting of documents into semantically meaningful fragments
11+
3. Generation of vector embeddings for each fragment
12+
4. Retrieval of relevant document fragments based on user queries
13+
5. Enhanced responses that incorporate information from uploaded documents
14+
15+
## Key Components
16+
17+
### 1. DocumentFragment
18+
19+
This class represents a piece of text from a document, with metadata such as:
20+
- Document ID
21+
- Content
22+
- Fragment index
23+
- File path
24+
- MIME type
25+
- Creation timestamp
26+
27+
### 2. DocumentEmbedding
28+
29+
This class stores the vector representation of a document fragment:
30+
- Fragment ID (references DocumentFragment)
31+
- Embedding vector (stored as serialized byte array)
32+
- Embedding dimension
33+
- Creation timestamp
34+
35+
### 3. DocumentProcessor
36+
37+
Handles the processing of uploaded documents:
38+
- Extracts textual content using Apache Tika for rich document formats
39+
- Splits content into appropriately sized fragments
40+
- Creates DocumentFragment objects for each fragment
41+
- Supports PDF, Word, Excel, PowerPoint, and other document formats through Tika
42+
43+
### 4. EmbeddingManager
44+
45+
Manages the generation and retrieval of embeddings:
46+
- Generates embeddings using OpenAI's embedding API
47+
- Caches query embeddings to reduce API calls
48+
- Implements the cosine similarity function for semantic search
49+
- Provides methods to find similar documents based on query
50+
51+
### 5. DocumentQA
52+
53+
Implements the question answering functionality:
54+
- Finds relevant document fragments for a given query
55+
- Formats document context for inclusion in AI responses
56+
- Enhances queries with document context
57+
58+
## Database Schema (SQLite)
59+
60+
The system uses SQLite for storage with the following schema:
61+
62+
```sql
63+
CREATE TABLE document_fragments (
64+
id INTEGER PRIMARY KEY AUTOINCREMENT,
65+
document_id VARCHAR(36) NOT NULL,
66+
content TEXT NOT NULL,
67+
fragment_index INTEGER NOT NULL,
68+
file_path VARCHAR(255),
69+
mime_type VARCHAR(50),
70+
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
71+
);
72+
73+
CREATE TABLE document_embeddings (
74+
id INTEGER PRIMARY KEY AUTOINCREMENT,
75+
fragment_id VARCHAR(100) NOT NULL,
76+
embedding BLOB NOT NULL,
77+
embedding_dimension INTEGER NOT NULL,
78+
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
79+
);
80+
81+
CREATE TABLE chat_history (
82+
id INTEGER PRIMARY KEY AUTOINCREMENT,
83+
meeting_code VARCHAR(255) NOT NULL,
84+
user_name VARCHAR(255),
85+
message TEXT,
86+
session_id VARCHAR(255),
87+
message_type VARCHAR(50),
88+
image_url TEXT,
89+
created_at DATE DEFAULT CURRENT_DATE
90+
);
91+
```
192

2-
smalltalk
3-
==
4-
smalltalk is a tinystruct-based project that provides instant messaging functionality, It allows users to send text and share images, documents, and other content.
5-
Also, It allows you to interact with ChatGPT which is a language model developed by OpenAI through a command-line interface (CLI) or a web interface.
93+
## Integration with Smalltalk
694

7-
[![Star History Chart](https://api.star-history.com/svg?repos=tinystruct/smalltalk&type=Date)](https://star-history.com/#tinystruct/smalltalk&Date)
95+
The document QA functionality is integrated in the Smalltalk application:
96+
- When a user uploads a document, it's automatically processed
97+
- When a user asks a question, the system searches for relevant document fragments
98+
- Retrieved fragments are included in the system prompt to provide context for the AI's response
99+
- The AI can reference the specific document when answering questions
8100

9-
Installation
10-
---
11-
1. Download the project from GitHub by clicking the "Clone or download" button, then selecting "Download ZIP".
12-
2. Extract the downloaded ZIP file to your local machine.
13-
3. If you used to use git, then you should execute the following command to instead of above steps:
14-
```bash
15-
git clone https://github.com/tinystruct/smalltalk.git
16-
```
17-
4. You will need to follow this [tutorial](https://openjdk.org/install/) to install the Java Development Kit (JDK 11+) on your computer first. If you choose to download and install it manually, please check it in this [OpenJDK Archive](https://jdk.java.net/archive/). And Java development environment such as Eclipse or IntelliJ IDEA is just better to have, not required.
101+
## Usage
18102

19-
If your current envirionment is using JDK 8, you can execute the below command to upgrade it quickly.
20-
```
21-
bin/openjdk-upgrade
22-
```
23-
5. Import the extracted / cloned project into your Java development environment.
24-
6. Go to `src/main/resources/application.properties` file and update the `openai.api_key` with your own key or set the environment variable `OPENAI_API_KEY` with your own key.
25-
7. Here is the last step for installation:
26-
```tcsh
27-
./mvnw compile
28-
```
103+
### Document Upload
29104

30-
Usage
31-
---
32-
You can run smalltalk in different ways:
105+
Documents can be uploaded through the Smalltalk interface. Supported file types include:
106+
- Plain text files (text/plain)
107+
- Markdown files (text/markdown)
108+
- PDF documents (application/pdf)
109+
- Word documents (docx, doc)
110+
- Excel spreadsheets (xlsx, xls)
111+
- PowerPoint presentations (pptx, ppt)
112+
- Other text-based files (text/*)
33113

34-
CLI mode
35-
1. Open a terminal and navigate to the project's root directory.
36-
2. To execute it in CLI mode, run the following command:
37-
```tcsh
38-
bin/dispatcher --version
39-
```
40-
To see the available commands, run the following command:
41-
```tcsh
42-
bin/dispatcher --help
43-
```
44-
To interact with ChatGPT, use the chat command, for example:
45-
```tcsh
46-
bin/dispatcher chat
47-
```
48-
![CLI](https://github.com/tinystruct/smalltalk/assets/3631818/b49bab05-0135-4383-b252-0ca9c011f6e8)
114+
### Asking Questions
49115

50-
Web mode
116+
Simply ask questions in the chat interface. The system will automatically:
117+
1. Convert your question to an embedding
118+
2. Find the most relevant document fragments
119+
3. Include those fragments in the context sent to the AI
120+
4. Return an answer that incorporates information from your documents
51121

52-
1. Run the project in a servlet container or in a HTTP server:
53-
2. To run it in a servlet container, you need to compile the project first:
122+
## Testing
54123

55-
then you can run it on tomcat server by running the following command:
56-
57-
```tcsh
58-
sudo bin/dispatcher start --import org.tinystruct.system.TomcatServer --server-port 777
59-
```
60-
or run it on netty http server by running the following command:
124+
The system includes several test classes to verify functionality:
125+
- DocumentEmbedding.main() - Tests embedding storage and retrieval
126+
- DocumentQATest - Tests the end-to-end document QA functionality
127+
- DocumentProcessor.main() - Tests document processing and fragmentation
61128

62-
```tcsh
63-
sudo bin/dispatcher start --import org.tinystruct.system.NettyHttpServer --server-port 777
129+
Run the tests using the provided batch file:
64130
```
65-
3. To run it in a Docker container, you can use the command below:
66-
67-
```tcsh
68-
docker run -d -p 777:777 -e "OPENAI_API_KEY=[YOUR-OPENAI-API-KEY]" -e "STABILITY_API_KEY=[YOUR-STABILITY-API-KEY]" m0ver/smalltalk
131+
runtest.bat
69132
```
70-
4. Access the application by navigating to http://localhost:777/?q=talk in your web browser
71-
5. If you want to talk with ChatGPT, please type @ChatGPT in your topic of the conversation when you set up the topic.
72-
73-
![Web](https://github.com/tinystruct/smalltalk/assets/3631818/32e50145-a5be-41d6-9cea-5b25e76e9f1b)
74-
75-
<img src="https://github.com/user-attachments/assets/32721443-b680-481b-b5ed-ae3c7e4c6908" width=500 />
76-
77-
Demonstration
78-
---
79-
A demonstration for the comet technology, without any websocket and support any web browser:
80-
81-
https://tinystruct.herokuapp.com/?q=talk
82-
83-
Troubleshooting
84-
---
85-
* If you encounter any problems during the installation or usage of the project, please check the project's documentation or build files for information about how to set up and run the project.
86-
* If you still have problems, please open an issue on GitHub or contact the project maintainers for help.
87-
88-
Contribution
89-
---
90-
We welcome contributions to the smalltalk project. If you are interested in contributing, please read the CONTRIBUTING.md file for more information about the project's development process and coding standards.
91-
92-
Acknowledgements
93-
---
94-
smalltalk uses the OpenAI API to interact with the ChatGPT language model. We would like to thank OpenAI for providing this powerful tool to the community.
95-
96-
License
97-
---
98133

99-
Licensed under the Apache License, Version 2.0 (the "License");
100-
you may not use this file except in compliance with the License.
101-
You may obtain a copy of the License at
134+
This script will:
135+
1. Initialize the SQLite database if needed
136+
2. Run all test classes
137+
3. Test document processing for various file formats
102138

103-
http://www.apache.org/licenses/LICENSE-2.0
139+
## Dependencies
104140

105-
Unless required by applicable law or agreed to in writing, software
106-
distributed under the License is distributed on an "AS IS" BASIS,
107-
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
108-
See the License for the specific language governing permissions and
109-
limitations under the License.
141+
The system relies on the following key libraries:
142+
- SQLite - For database storage
143+
- Apache Tika (3.1.0) - For extracting text from various document formats
144+
- OpenAI API - For generating embeddings
145+
- tinystruct - For application framework and database access
110146

bin/dispatcher

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
#!/usr/bin/env sh
22
ROOT="$(pwd)"
3-
VERSION="1.5.5"
3+
VERSION="1.7.12"
44
cd "$(dirname "$0")" || exit
55
cd "../"
66
# Navigate to the root directory

bin/dispatcher.cmd

Lines changed: 39 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
@rem ***************************************************************************
2-
@rem Copyright (c) 2023 James Mover Zhou
2+
@rem Copyright (c) 2025 James M. Zhou
33
@rem
44
@rem Licensed under the Apache License, Version 2.0 (the "License");
55
@rem you may not use this file except in compliance with the License.
66
@rem You may obtain a copy of the License at
77
@rem
8-
@rem http:\\www.apache.org\licenses\LICENSE-2.0
8+
@rem http://www.apache.org/licenses/LICENSE-2.0
99
@rem
1010
@rem Unless required by applicable law or agreed to in writing, software
1111
@rem distributed under the License is distributed on an "AS IS" BASIS,
@@ -15,6 +15,23 @@
1515
@rem ***************************************************************************
1616
@echo off
1717

18+
@REM Set the local Maven repository path for tinystruct.jar
19+
set "MAVEN_REPO=%USERPROFILE%\.m2\repository\org\tinystruct\tinystruct"
20+
@REM Consolidate classpath entries, initialize ROOT and VERSION
21+
set "ROOT=%~dp0.."
22+
set "VERSION=1.7.12"
23+
24+
@REM Define the paths for tinystruct jars in the Maven repository
25+
set "DEFAULT_JAR_FILE=%MAVEN_REPO%\%VERSION%\tinystruct-%VERSION%.jar"
26+
set "DEFAULT_JAR_FILE_WITH_DEPENDENCIES=%MAVEN_REPO%\%VERSION%\tinystruct-%VERSION%-jar-with-dependencies.jar"
27+
28+
REM Check which jar to use for extracting Maven Wrapper
29+
if exist "%DEFAULT_JAR_FILE_WITH_DEPENDENCIES%" (
30+
set "JAR_PATH=%DEFAULT_JAR_FILE_WITH_DEPENDENCIES%"
31+
) else (
32+
set "JAR_PATH=%DEFAULT_JAR_FILE%"
33+
)
34+
1835
@REM Check if JAVA_HOME is set and valid
1936
if "%JAVA_HOME%" == "" (
2037
echo Error: JAVA_HOME not found in your environment. >&2
@@ -31,10 +48,26 @@ if not exist "%JAVA_HOME%\bin\java.exe" (
3148

3249
set "JAVA_CMD=%JAVA_HOME%\bin\java.exe"
3350

34-
@REM Consolidate classpath entries, initialize ROOT and VERSION
35-
set "ROOT=%~dp0..\"
36-
set "VERSION=1.5.5"
37-
set "classpath=%ROOT%target\classes;%ROOT%lib\tinystruct-%VERSION%-jar-with-dependencies.jar;%ROOT%lib\tinystruct-%VERSION%.jar;%ROOT%lib\*;%ROOT%WEB-INF\lib\*;%ROOT%WEB-INF\classes;%USERPROFILE%\.m2\repository\org\tinystruct\tinystruct\%VERSION%\tinystruct-%VERSION%-jar-with-dependencies.jar;%USERPROFILE%\.m2\repository\org\tinystruct\tinystruct\%VERSION%\tinystruct-%VERSION%.jar"
51+
@REM Check if the Maven Wrapper is already available
52+
if not exist "mvnw" (
53+
echo Maven Wrapper not found. Extracting from JAR...
54+
55+
@REM Run Java code to extract the ZIP file from the JAR
56+
%JAVA_CMD% -cp "%JAR_PATH%" org.tinystruct.system.Dispatcher maven-wrapper --jar-file-path "%JAR_PATH%" --destination-dir "%ROOT%"
57+
58+
if exist "%ROOT%\maven-wrapper.zip" (
59+
@REM Now unzip the Maven Wrapper files
60+
powershell -Command "Expand-Archive -Path '%ROOT%\maven-wrapper.zip' -DestinationPath '%ROOT%'"
61+
@REM Delete the ZIP file after extraction
62+
del /F /Q "%ROOT%\maven-wrapper.zip"
63+
echo Maven wrapper setup completed.
64+
) else (
65+
echo Error: Maven wrapper ZIP file not found in JAR.
66+
exit /B 1
67+
)
68+
)
69+
70+
set "classpath=%ROOT%\target\classes;%ROOT%\lib\tinystruct-%VERSION%-jar-with-dependencies.jar;%ROOT%\lib\tinystruct-%VERSION%.jar;%ROOT%\lib\*;%ROOT%\WEB-INF\lib\*;%ROOT%\WEB-INF\classes;%USERPROFILE%\.m2\repository\org\tinystruct\tinystruct\%VERSION%\tinystruct-%VERSION%-jar-with-dependencies.jar;%USERPROFILE%\.m2\repository\org\tinystruct\tinystruct\%VERSION%\tinystruct-%VERSION%.jar"
3871

3972
@REM Run Java application
4073
%JAVA_CMD% -cp "%classpath%" org.tinystruct.system.Dispatcher %*

0 commit comments

Comments
 (0)