Skip to content

Mitigate Bot/DDOS on Expensive Endpoints via Human Verification Challenge for Suspicious Visitors #11806

@mekarpeles

Description

@mekarpeles

Problem

We are currently experiencing DDOS traffic from bots targeting our expensive endpoints, especially /search?q= (with complex solr syntax, e.g. language:...) and /subjects/... pages. These attacks cause significant resource drain.

Recent work (@internetarchive/openlibrary/pull/11621) improved bot detection for analytics, but our endpoints remain vulnerable to bot-driven resource consumption.

Proposal

Implement an additional human verification challenge for "suspicious visitors" (not logged in, with generic user-agents that aren't recognized bots) when:

  1. Performing a GET to /search?q= using specialized solr syntax (such as q=language:eng), or
  2. Accessing a /subjects/ page for the first time

Flow:

  • If the above conditions are met, and a vf=1 cookie is not present:
    • Instead of rendering the resource, show a templates/accounts/challenge.html page (or suitable alternative) with a minimal body and a standard Open Library button: Verify you are human.
    • When clicked, the button hits a new API endpoint (e.g. /account/verify_human), which sets the vf=1 cookie and reloads the page.
  • Share challenge logic between Search and Subjects for DRYness (suggest a shared bounce/check function; decorator may be overkill).
  • Template must follow i18n and remain minimal.
  • JS may be in a script tag within the template or implemented canonically in plugins/openlibrary/js.
  • Add a basic statsd metric (e.g. ol.stats.verify_human, following our current statsd patterns) to track challenge flow usage.

Definition

A suspicious_visitor is:

  • Not logged in
  • Presents a generic or inspecific user-agent (not a known bot UA); see is_bot in plugins/openlibrary/code.py

Implementation goals

  • Simple, secure, easy to test and ship
  • DRY: minimal overhead and code duplication
  • Backend challenge logic, with suspicious_visitor detection based on upstream nginx JS rules (inferring this logic is out-of-scope)

See also

  • PR #11621 which shows how we've previously split out human and bot traffic.

Acceptance Criteria

  • Suspicious, unverified visitors hitting /search?q= with specialized solr syntax (like languages:...) or expensive requests for/subjects/* see a human verification page if vf=1 is not set
  • Verified users (via challenge or login/cookie) are not re-prompted
  • Metrics for verification flows are recorded via statsd
  • Search and Subjects remain DRY.
  • Solution tested and ready for production.

It's possible for human verification, we don't really care about the nginx js code -- we really care that:

  1. the endpoint is a candidate (e.g. expensive subject or /search?q= with specialized solr syntax or parameters)
  2. they are not logged in
  3. no vf cookie set
  4. not a known is_bot

Stakeholders

Metadata

Metadata

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions