You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Set USER_AGENT_HEADER to define a custom User-Agent string for outgoing HTTP requests.
48
+
# If not set, a robust default User-Agent is automatically applied
49
+
# Setting a clear and descriptive User-Agent helps external servers identify the application and
50
+
# reduces the chance of requests being treated as bot traffic.
51
+
export USER_AGENT_HEADER=<your_user_agent_string>
52
+
46
53
# OPTIONAL - If user wants to push the built images to a remote container registry, user needs to name the images accordingly. For this, image name should include the registry URL as well. To do this, set the following environment variable from shell. Please note that this URL will be prefixed to application name and tag to form the final image name.
@@ -103,7 +110,9 @@ This method provides the fastest way to get started with the microservice.
103
110
2. Examples of expected outputs for validation.
104
111
-->
105
112
106
-
## First Use: Running a Predefined Task
113
+
## Application Usage:
114
+
115
+
## Type 1: Upload Files
107
116
108
117
Try uploading a sample PDF file and verify that the embeddings and files are stored. Run the commands from the same shell as where the environment variables are set.
109
118
@@ -135,6 +144,53 @@ Try uploading a sample PDF file and verify that the embeddings and files are sto
135
144
rm -rf ./minimal-document.pdf
136
145
```
137
146
147
+
## Type 2: Upload URLs
148
+
149
+
Try uploading web page URLs and verify that the embeddings are created and stored. Run the commands from the same shell as where the environment variables are set.
150
+
151
+
>**Note**: This URL ingestion microservice works best with pages that are not heavily reliant on JavaScript such as Wikipedia, which serve as ideal URL input sources. For JavaScript-intensive pages (social media feeds, Single Page Applications), the API may indicate a successful request but the actual content might not be captured. Such pages should be avoided or handled separately.
152
+
153
+
1. **Get stored URLs**:
154
+
Retrieve a list of all URLs that have been processed and stored in the system.
155
+
```bash
156
+
curl -X 'GET' \
157
+
"http://${host_ip}:${DATAPREP_HOST_PORT}/urls" \
158
+
-H 'accept: application/json'
159
+
```
160
+
161
+
2. **Upload URLs to create and store embeddings**:
162
+
Submit one or more URLs to be processed for embedding creation.
163
+
```bash
164
+
curl -X 'POST' \
165
+
"http://${host_ip}:${DATAPREP_HOST_PORT}/urls" \
166
+
-H 'accept: application/json' \
167
+
-H 'Content-Type: application/json' \
168
+
-d '[
169
+
"https://en.wikipedia.org/wiki/Fiat",
170
+
"https://en.wikipedia.org/wiki/Lunar_eclipse"
171
+
]'
172
+
```
173
+
174
+
3. **Verify the URLs were processed**:
175
+
Check that the URLs were successfully processed and stored.
176
+
```bash
177
+
curl -X 'GET' \
178
+
"http://${host_ip}:${DATAPREP_HOST_PORT}/urls" \
179
+
-H 'accept: application/json'
180
+
```
181
+
Expected output: A JSON response with the list of processed URLs should be printed.
182
+
183
+
4. **Delete a specific URL or all URLs**:
184
+
Get the URL from the GET call response in step 3 and use it in the DELETE request below.
Copy file name to clipboardExpand all lines: sample-applications/chat-question-and-answer/docs/user-guide/overview-architecture.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,6 +35,9 @@ ChatQ&A application is a combination of the core LangChain application logic tha
35
35
1.**Input Sources**:
36
36
-**Documents**: The document ingestion microservice supports ingesting documents in various formats. Supported formats are word and pdf.
37
37
-**Web pages**: Contents of accessible web pages can also be parsed and used as input for the RAG pipeline.
38
+
39
+
> **Note**: This application works best with non–JavaScript-heavy pages (e.g., Wikipedia, blogs, news sites) that render most of their content directly in HTML. JavaScript-heavy pages (e.g., social media platforms, single-page applications) load content dynamically via JavaScript, so their raw HTML often lacks useful text. The current implementation only parses raw HTML and does not execute JavaScript, so such pages may return incomplete or inaccurate results and should be avoided or handled separately.
40
+
38
41
2.**Create the context**
39
42
-**Upload input documents and web links**: The UI microservice allows the developer to interact with the ChatQ&A backend. It provides the interface to upload the documents and weblinks on which the RAG pipeline will be executed. The documents are uploaded and stored in object store. MinIO is the database used for object store.
40
43
-**Convert to embeddings space**: The ChatQ&A backend microservice creates the embeddings out of the uploaded documents and web pages using the document ingestion microservice. The Embeddings microservice is used to create the embeddings. The embeddings are stored in a vector database. PGVector is used in the sample application.
0 commit comments