-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Add stuff to better avoid bot-detection in web connector #4479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR improves the web connector's ability to avoid bot detection by incorporating randomized human-like delays, enhancing Playwright context realism, and adding a robust retry mechanism.
- Modified
/backend/onyx/connectors/web/connector.py
to incorporate random delays and realistic browser settings. - Added a cookie handling mechanism to mitigate bot detection.
- Implemented exponential backoff with Playwright restart on failures.
- Validated changes through tests in
/backend/tests/daily/connectors/web/test_web_connector.py
.
1 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And so the fight of scraping bots against detection bots continues
session.headers.update(DEFAULT_HEADERS) | ||
|
||
# Add a random delay to mimic human behavior | ||
time.sleep(random.uniform(0.1, 0.5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😭
} | ||
) | ||
|
||
# Add a script to modify navigator properties to avoid detection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🕵
except Exception as e: | ||
logger.debug(f"Failed to add cookie {cookie['name']} for {domain}: {e}") | ||
except Exception as e: | ||
logger.debug(f"Failed to handle cookies for {url}: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be an exception log?
timeout=30000, # 30 seconds | ||
) | ||
# Add random mouse movements and scrolling to mimic human behavior | ||
page.mouse.move(random.randint(100, 700), random.randint(100, 500)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐍
Description
Fixes https://linear.app/danswer/issue/DAN-1782/resolve-parsing-of-certain-websites-w-web-connector
How Has This Been Tested?
Tested with a previously broken website.
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.