Skip to content

feat: add on_link_blocked_callback for robots.txt blocked URLs#394

Merged
j-mendez merged 2 commits into
spider-rs:mainfrom
zanmato:feat/robots-blocked-callback
Jun 1, 2026
Merged

feat: add on_link_blocked_callback for robots.txt blocked URLs#394
j-mendez merged 2 commits into
spider-rs:mainfrom
zanmato:feat/robots-blocked-callback

Conversation

@zanmato

@zanmato zanmato commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

I'm using spider for SEO-crawling, and need to know which URLs are blocked by robots.txt.

What does this PR do?

Add an OnLinkBlockedCallback that fires when a link is denied by robots.txt rules, allowing consumers to observe and log blocked URLs without extra polling.

Co-Authored by Claude Opus 4.7

Checklist

  • cargo test passes
  • cargo fmt applied
  • New public APIs have doc comments
  • Feature-gated behind a flag (if adding optional functionality)

Add an OnLinkBlockedCallback that fires when a link is denied by
robots.txt rules, allowing consumers to observe and log blocked
URLs without extra polling.
@zanmato zanmato force-pushed the feat/robots-blocked-callback branch from 15ef5f1 to 1dea006 Compare June 1, 2026 07:57
Restore the original short-circuit in is_allowed_default: is_allowed_robots
is only evaluated when whitelist/blacklist do not already block, instead of
the eager `let blocked_robots = !self.is_allowed_robots(...)` that ran the
robots check on every link. The on_link_blocked_callback now fires only when
robots.txt is the actual block reason (not when whitelist/blacklist blocks),
via a clean else-if branch. Robots condition semantics unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@j-mendez j-mendez merged commit c69a224 into spider-rs:main Jun 1, 2026
j-mendez added a commit that referenced this pull request Jun 1, 2026
Includes #394 (on_link_blocked_callback for robots.txt-blocked URLs, with the
robots check kept short-circuited behind a clean else-if branch) and #393
(crawl_raw uses a raw sitemap chain so HTTP-only crawls never launch chrome).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@zanmato zanmato deleted the feat/robots-blocked-callback branch June 1, 2026 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants