Skip to content

Add stuff to better avoid bot-detection in web connector #4479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 8, 2025

Conversation

Weves
Copy link
Contributor

@Weves Weves commented Apr 8, 2025

Description

Fixes https://linear.app/danswer/issue/DAN-1782/resolve-parsing-of-certain-websites-w-web-connector

How Has This Been Tested?

Tested with a previously broken website.

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

@Weves Weves requested a review from a team as a code owner April 8, 2025 18:05
Copy link

vercel bot commented Apr 8, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 8, 2025 7:37pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR improves the web connector's ability to avoid bot detection by incorporating randomized human-like delays, enhancing Playwright context realism, and adding a robust retry mechanism.

  • Modified /backend/onyx/connectors/web/connector.py to incorporate random delays and realistic browser settings.
  • Added a cookie handling mechanism to mitigate bot detection.
  • Implemented exponential backoff with Playwright restart on failures.
  • Validated changes through tests in /backend/tests/daily/connectors/web/test_web_connector.py.

1 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile

Copy link
Contributor

@evan-danswer evan-danswer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And so the fight of scraping bots against detection bots continues

session.headers.update(DEFAULT_HEADERS)

# Add a random delay to mimic human behavior
time.sleep(random.uniform(0.1, 0.5))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😭

}
)

# Add a script to modify navigator properties to avoid detection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕵

except Exception as e:
logger.debug(f"Failed to add cookie {cookie['name']} for {domain}: {e}")
except Exception as e:
logger.debug(f"Failed to handle cookies for {url}: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be an exception log?

timeout=30000, # 30 seconds
)
# Add random mouse movements and scrolling to mimic human behavior
page.mouse.move(random.randint(100, 700), random.randint(100, 500))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐍

@Weves Weves merged commit 71839e7 into main Apr 8, 2025
8 of 11 checks passed
@Weves Weves deleted the improve-web-connector branch April 8, 2025 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants