-
Notifications
You must be signed in to change notification settings - Fork 70
Add AI agent detection and automatic markdown rewrites #351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
29a464a
Add AI agent detection and automatic markdown rewrites
molebox 413fea5
Fix: Missing `detectionMethod` property in `TrackMdRequestParams` typ…
vercel[bot] 89bf42a
format
dferber90 8ce925a
Merge branch 'main' into add-agent-rewrites
dferber90 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,168 @@ | ||
| /** | ||
| * AI Agent Detection Utility | ||
| * | ||
| * Multi-signal detection for AI agents/bots. Used to serve markdown | ||
| * responses when agents request docs pages. | ||
| * | ||
| * Three detection layers: | ||
| * 1. Known UA patterns (definitive) — curated from https://bots.fyi/?tags=ai_assistant | ||
| * 2. Signature-Agent header (definitive) — catches ChatGPT agent (RFC 9421) | ||
| * 3. Missing browser fingerprint heuristic — catches unknown bots | ||
| * | ||
| * Optimizes for recall over precision: serving markdown to a non-AI bot | ||
| * is low-harm; missing an AI agent means a worse experience. | ||
| * | ||
| * Last reviewed: 2026-03-20 against bots.fyi + official vendor docs | ||
| */ | ||
|
|
||
| // Layer 1: Known AI agent UA substrings (lowercase). | ||
| const AI_AGENT_UA_PATTERNS = [ | ||
| // Anthropic — https://support.claude.com/en/articles/8896518 | ||
| 'claudebot', | ||
| 'claude-searchbot', | ||
| 'claude-user', | ||
| 'anthropic-ai', | ||
| 'claude-web', | ||
|
|
||
| // OpenAI — https://platform.openai.com/docs/bots | ||
| 'chatgpt', | ||
| 'gptbot', | ||
| 'oai-searchbot', | ||
| 'openai', | ||
|
|
||
| // Google AI | ||
| 'gemini', | ||
| 'bard', | ||
| 'google-cloudvertexbot', | ||
| 'google-extended', | ||
|
|
||
| // Meta | ||
| 'meta-externalagent', | ||
| 'meta-externalfetcher', | ||
| 'meta-webindexer', | ||
|
|
||
| // Search/Research AI | ||
| 'perplexity', | ||
| 'youbot', | ||
| 'you.com', | ||
| 'deepseekbot', | ||
|
|
||
| // Coding assistants | ||
| 'cursor', | ||
| 'github-copilot', | ||
| 'codeium', | ||
| 'tabnine', | ||
| 'sourcegraph', | ||
|
|
||
| // Other AI agents / data scrapers (low-harm to serve markdown) | ||
| 'cohere-ai', | ||
| 'bytespider', | ||
| 'amazonbot', | ||
| 'ai2bot', | ||
| 'diffbot', | ||
| 'omgili', | ||
| 'omgilibot', | ||
| ]; | ||
|
|
||
| // Layer 2: Known AI service URLs in Signature-Agent header (RFC 9421). | ||
| const SIGNATURE_AGENT_DOMAINS = ['chatgpt.com']; | ||
|
|
||
| // Layer 3: Traditional bot exclusion list — bots that should NOT trigger | ||
| // the heuristic layer (they're search engine crawlers, social previews, or | ||
| // monitoring tools, not AI agents). | ||
| const TRADITIONAL_BOT_PATTERNS = [ | ||
| 'googlebot', | ||
| 'bingbot', | ||
| 'yandexbot', | ||
| 'baiduspider', | ||
| 'duckduckbot', | ||
| 'slurp', | ||
| 'msnbot', | ||
| 'facebot', | ||
| 'twitterbot', | ||
| 'linkedinbot', | ||
| 'whatsapp', | ||
| 'telegrambot', | ||
| 'pingdom', | ||
| 'uptimerobot', | ||
| 'newrelic', | ||
| 'datadog', | ||
| 'statuspage', | ||
| 'site24x7', | ||
| 'applebot', | ||
| ]; | ||
|
|
||
| // Broad regex for bot-like UA strings (used only in Layer 3 heuristic). | ||
| const BOT_LIKE_REGEX = /bot|agent|fetch|crawl|spider|search/i; | ||
|
|
||
| export type DetectionMethod = 'ua-match' | 'signature-agent' | 'heuristic'; | ||
|
|
||
| export interface DetectionResult { | ||
| detected: boolean; | ||
| method: DetectionMethod | null; | ||
| } | ||
|
|
||
| /** | ||
| * Detects AI agents from HTTP request headers. | ||
| * | ||
| * Returns both whether the agent was detected and which signal triggered, | ||
| * so callers can log the detection method for accuracy tracking. | ||
| */ | ||
| export function isAIAgent(request: { | ||
| headers: { get(name: string): string | null }; | ||
| }): DetectionResult { | ||
| const userAgent = request.headers.get('user-agent'); | ||
|
|
||
| // Layer 1: Known UA pattern match | ||
| if (userAgent) { | ||
| const lowerUA = userAgent.toLowerCase(); | ||
| if (AI_AGENT_UA_PATTERNS.some((pattern) => lowerUA.includes(pattern))) { | ||
| return { detected: true, method: 'ua-match' }; | ||
| } | ||
| } | ||
|
|
||
| // Layer 2: Signature-Agent header (RFC 9421, used by ChatGPT agent) | ||
| const signatureAgent = request.headers.get('signature-agent'); | ||
| if (signatureAgent) { | ||
| const lowerSig = signatureAgent.toLowerCase(); | ||
| if (SIGNATURE_AGENT_DOMAINS.some((domain) => lowerSig.includes(domain))) { | ||
| return { detected: true, method: 'signature-agent' }; | ||
| } | ||
| } | ||
|
|
||
| // Layer 3: Missing browser fingerprint heuristic | ||
| // Real browsers (Chrome 76+, Firefox 90+, Safari 16.4+) send sec-fetch-mode | ||
| // on navigation requests. Its absence signals a programmatic client. | ||
| const secFetchMode = request.headers.get('sec-fetch-mode'); | ||
| if (!secFetchMode && userAgent && BOT_LIKE_REGEX.test(userAgent)) { | ||
| const lowerUA = userAgent.toLowerCase(); | ||
| const isTraditionalBot = TRADITIONAL_BOT_PATTERNS.some((pattern) => | ||
| lowerUA.includes(pattern), | ||
| ); | ||
| if (!isTraditionalBot) { | ||
| return { detected: true, method: 'heuristic' }; | ||
| } | ||
| } | ||
|
|
||
| return { detected: false, method: null }; | ||
| } | ||
|
|
||
| /** | ||
| * Generates a markdown response for AI agents that hit non-existent URLs. | ||
| */ | ||
| export function generateAgentNotFoundResponse(requestedPath: string): string { | ||
| return `# Page Not Found | ||
|
|
||
| The URL \`${requestedPath}\` does not exist in the documentation. | ||
|
|
||
| ## How to find the correct page | ||
|
|
||
| 1. **Browse the sitemap**: [/sitemap.md](/sitemap.md) — A structured index of all pages with URLs, content types, and descriptions | ||
| 2. **Browse the full index**: [/llms.txt](/llms.txt) — Complete documentation index | ||
|
|
||
| ## Tips for requesting documentation | ||
|
|
||
| - For markdown responses, append \`.md\` to URLs (e.g., \`/docs/getting-started.md\`) | ||
| - Use \`Accept: text/markdown\` header for content negotiation | ||
| `; | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.