Skip to content

Commit 572aca6

Browse files
Added user-agent headers to avoid reject of default bot like python-requests (open-edge-platform#996)
Co-authored-by: Raghu Bhat <[email protected]>
1 parent 86ea88c commit 572aca6

File tree

4 files changed

+69
-4
lines changed

4 files changed

+69
-4
lines changed

microservices/document-ingestion/pgvector/app/main.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -389,7 +389,7 @@ async def delete_urls(
389389
url: Optional[str] = None, delete_all: Optional[bool] = False
390390
) -> None:
391391
"""
392-
Delete a document or all documents from storage and their embeddings from Vector DB.
392+
Delete a URL or all URLs from storage and their embeddings from Vector DB.
393393
394394
Args:
395395
url (str): URL to be deleted

microservices/document-ingestion/pgvector/app/url.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
import psycopg
66
import ipaddress
77
import socket
8+
import os
89
from urllib.parse import urlparse
910
from http import HTTPStatus
1011
from fastapi import HTTPException
@@ -117,6 +118,11 @@ def ingest_url_to_pgvector(url_list: List[str]) -> None:
117118
HTML parsing, or any other errors during the ingestion process.
118119
"""
119120

121+
default_user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
122+
123+
headers = {
124+
"User-Agent": os.getenv("USER_AGENT_HEADER", default_user_agent)
125+
}
120126

121127
try:
122128
invalid_urls = 0
@@ -132,7 +138,7 @@ def ingest_url_to_pgvector(url_list: List[str]) -> None:
132138
adapter = requests.adapters.HTTPAdapter()
133139
session.mount("http://", adapter)
134140
session.mount("https://", adapter)
135-
response = session.get(url, timeout=5, allow_redirects=False)
141+
response = session.get(url, headers=headers, timeout=5, allow_redirects=True)
136142

137143
if response.status_code != HTTPStatus.OK:
138144
logger.info(f"Failed to fetch URL: {url} with status code {response.status_code}")
@@ -182,7 +188,7 @@ def ingest_url_to_pgvector(url_list: List[str]) -> None:
182188
logger.error(f"Error while parsing HTML content for URL - {url}: {e}")
183189
raise HTTPException(status_code=HTTPStatus.INTERNAL_SERVER_ERROR, detail=f"Error while parsing URL")
184190

185-
logger.info(f"[ ingest url ] url: {url} content: {content}")
191+
logger.info(f"[ ingest url ] url: {url} content: {content} headers: {headers}")
186192
metadata = [{"url": url}]
187193

188194
chunks = text_splitter.split_text(content)

microservices/document-ingestion/pgvector/docs/get-started.md

Lines changed: 57 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,13 @@ export PGDB_PASSWD=<user_db_password>
4343
export PGDB_NAME=<user_db_name>
4444
export PGDB_INDEX=<user_db_index>
4545

46+
# HTTP request configuration (optional)
47+
# Set USER_AGENT_HEADER to define a custom User-Agent string for outgoing HTTP requests.
48+
# If not set, a robust default User-Agent is automatically applied
49+
# Setting a clear and descriptive User-Agent helps external servers identify the application and
50+
# reduces the chance of requests being treated as bot traffic.
51+
export USER_AGENT_HEADER=<your_user_agent_string>
52+
4653
# OPTIONAL - If user wants to push the built images to a remote container registry, user needs to name the images accordingly. For this, image name should include the registry URL as well. To do this, set the following environment variable from shell. Please note that this URL will be prefixed to application name and tag to form the final image name.
4754

4855
export CONTAINER_REGISTRY_URL=<user_container_registry_url>
@@ -103,7 +110,9 @@ This method provides the fastest way to get started with the microservice.
103110
2. Examples of expected outputs for validation.
104111
-->
105112

106-
## First Use: Running a Predefined Task
113+
## Application Usage:
114+
115+
## Type 1: Upload Files
107116

108117
Try uploading a sample PDF file and verify that the embeddings and files are stored. Run the commands from the same shell as where the environment variables are set.
109118

@@ -135,6 +144,53 @@ Try uploading a sample PDF file and verify that the embeddings and files are sto
135144
rm -rf ./minimal-document.pdf
136145
```
137146

147+
## Type 2: Upload URLs
148+
149+
Try uploading web page URLs and verify that the embeddings are created and stored. Run the commands from the same shell as where the environment variables are set.
150+
151+
> **Note**: This URL ingestion microservice works best with pages that are not heavily reliant on JavaScript such as Wikipedia, which serve as ideal URL input sources. For JavaScript-intensive pages (social media feeds, Single Page Applications), the API may indicate a successful request but the actual content might not be captured. Such pages should be avoided or handled separately.
152+
153+
1. **Get stored URLs**:
154+
Retrieve a list of all URLs that have been processed and stored in the system.
155+
```bash
156+
curl -X 'GET' \
157+
"http://${host_ip}:${DATAPREP_HOST_PORT}/urls" \
158+
-H 'accept: application/json'
159+
```
160+
161+
2. **Upload URLs to create and store embeddings**:
162+
Submit one or more URLs to be processed for embedding creation.
163+
```bash
164+
curl -X 'POST' \
165+
"http://${host_ip}:${DATAPREP_HOST_PORT}/urls" \
166+
-H 'accept: application/json' \
167+
-H 'Content-Type: application/json' \
168+
-d '[
169+
"https://en.wikipedia.org/wiki/Fiat",
170+
"https://en.wikipedia.org/wiki/Lunar_eclipse"
171+
]'
172+
```
173+
174+
3. **Verify the URLs were processed**:
175+
Check that the URLs were successfully processed and stored.
176+
```bash
177+
curl -X 'GET' \
178+
"http://${host_ip}:${DATAPREP_HOST_PORT}/urls" \
179+
-H 'accept: application/json'
180+
```
181+
Expected output: A JSON response with the list of processed URLs should be printed.
182+
183+
4. **Delete a specific URL or all URLs**:
184+
Get the URL from the GET call response in step 3 and use it in the DELETE request below.
185+
```bash
186+
curl -X 'DELETE' \
187+
"http://${host_ip}:${DATAPREP_HOST_PORT}/urls?url=<url_to_be_deleted>&delete_all=false" \
188+
-H 'accept: */*'
189+
```
190+
191+
**Note**:
192+
- Optionally set `delete_all=true` if you want to delete all URLs from the database instead of a specific URL
193+
138194
## Advanced Setup Options
139195

140196
To customize the microservice, refer to [customization documentation](./how-to-customize.md).

sample-applications/chat-question-and-answer/docs/user-guide/overview-architecture.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,9 @@ ChatQ&A application is a combination of the core LangChain application logic tha
3535
1. **Input Sources**:
3636
- **Documents**: The document ingestion microservice supports ingesting documents in various formats. Supported formats are word and pdf.
3737
- **Web pages**: Contents of accessible web pages can also be parsed and used as input for the RAG pipeline.
38+
39+
> **Note**: This application works best with non–JavaScript-heavy pages (e.g., Wikipedia, blogs, news sites) that render most of their content directly in HTML. JavaScript-heavy pages (e.g., social media platforms, single-page applications) load content dynamically via JavaScript, so their raw HTML often lacks useful text. The current implementation only parses raw HTML and does not execute JavaScript, so such pages may return incomplete or inaccurate results and should be avoided or handled separately.
40+
3841
2. **Create the context**
3942
- **Upload input documents and web links**: The UI microservice allows the developer to interact with the ChatQ&A backend. It provides the interface to upload the documents and weblinks on which the RAG pipeline will be executed. The documents are uploaded and stored in object store. MinIO is the database used for object store.
4043
- **Convert to embeddings space**: The ChatQ&A backend microservice creates the embeddings out of the uploaded documents and web pages using the document ingestion microservice. The Embeddings microservice is used to create the embeddings. The embeddings are stored in a vector database. PGVector is used in the sample application.

0 commit comments

Comments
 (0)