Add robots.txt and sitemaps reference page

kathayl · kathayl · commit 22f198d3f2e6 · 2026-01-23T11:49:43.000-08:00
diff --git a/agents.md b/agents.md
@@ -0,0 +1,115 @@
+# Agent Workflow Rules for cloudflare-docs
+
+## Git Workflow Between Workstreams
+
+### Critical Rules
+
+1. **Always sync with main between workstreams**
+   - Before starting new work, ensure you're on the latest main branch
+   - Run `git checkout main && git pull origin main`
+
+2. **Create clean branches for each workstream**
+   - Each new workstream gets a fresh branch from main
+   - Branch naming conventions:
+     - Browser Rendering: `br-<descriptive-name>` (e.g., `br-update-playwright-docs`)
+     - Zaraz: `zaraz-<descriptive-name>`
+     - Google Tag Gateway: `gtg-<descriptive-name>`
+     - General: Use descriptive names for other products
+
+3. **Ensure PRs only contain relevant work**
+   - Each PR should only include changes from its specific workstream
+   - No leftover files or changes from previous workstreams
+
+### Standard Workflow
+
+#### Starting a New Workstream
+
+```bash
+# 1. Switch to main and update
+git checkout main
+git pull origin main
+
+# 2. Create new branch for the workstream
+git checkout -b <descriptive-branch-name>
+```
+
+#### During Work
+
+```bash
+# Stage and commit changes as you work
+git add <files>
+git commit -m "descriptive message"
+```
+
+#### Finishing a Workstream
+
+```bash
+# 1. Push branch
+git push origin <branch-name>
+
+# 2. Create PR (via GitHub UI or CLI)
+
+# 3. After PR is merged, clean up
+git checkout main
+git pull origin main
+git branch -d <branch-name>
+```
+
+#### Between Workstreams Checklist
+
+- [ ] Current work is committed and pushed
+- [ ] PR is created for current workstream
+- [ ] Switched back to main: `git checkout main`
+- [ ] Pulled latest changes: `git pull origin main`
+- [ ] Ready to create new branch for next workstream
+
+## Cloudflare Docs Specific Rules
+
+### Changelog Locations
+
+1. **Product-specific release notes** (routine updates): `src/content/release-notes/*.yaml`
+   - Use for: version bumps, bug fixes, minor features
+   - Example: `src/content/release-notes/browser-rendering.yaml`
+
+2. **Cloudflare-wide changelog** (major announcements): `src/content/changelog/<product>/*.mdx`
+   - Use for: major features, GA announcements, significant updates
+   - Example: `src/content/changelog/browser-rendering/`
+
+### Content Guidelines
+
+- Follow all rules in `.windsurf/rules/general-rules.md`
+- Use absolute paths for links (e.g., `/1.1.1.1/check/`) not full URLs
+- Always include trailing slash for links without anchors
+- Import components at top of file below frontmatter
+- No contractions, exclamation marks, or non-standard quotes
+
+### Common Components
+
+- `DashButton` - Replace `https://dash.cloudflare.com` in steps
+- `APIRequest` - Replace `sh` blocks with API requests
+- `FileTree` - Replace `txt` file tree blocks
+- `PackageManagers` - Replace `sh` blocks with npm commands
+- `TypeScriptExample` - Replace `ts`/`typescript` code blocks (except in tutorials)
+
+## Cross-Reference Reminders
+
+When making changes to one part of the docs, review existing pages to see if there are other locations that should also be updated. Common scenarios:
+
+- **Updating a value** (e.g., version number, ID, limit) — Search the docs for other references
+- **Adding new functionality** — Check if related pages (FAQ, tutorials, reference) need updates
+- **Changing behavior** — Update any pages that describe the old behavior
+
+Example values that appear in multiple places:
+
+| Value | Source of Truth | Also Referenced In |
+| ----- | --------------- | ------------------ |
+| Bot detection ID (`128292352`) | `/browser-rendering/reference/automatic-request-headers.mdx` | `/browser-rendering/faq.mdx` |
+
+## Deployment Preference
+
+When building full-stack web apps:
+- Use single Cloudflare Worker for frontend + API
+- Configure `[assets]` in `wrangler.toml` for static files
+- Use Hono for API routes
+- Deploy with single `wrangler deploy` command
+- Do not use Cloudflare Pages separately
diff --git a/src/content/docs/browser-rendering/reference/robots-txt.mdx b/src/content/docs/browser-rendering/reference/robots-txt.mdx
@@ -0,0 +1,105 @@
+---
+title: robots.txt and sitemaps
+pcx_content_type: reference
+sidebar:
+  order: 5
+---
+
+This page provides general guidance on configuring `robots.txt` and sitemaps for websites you plan to access with Browser Rendering.
+
+## User-Agent
+
+Browser Rendering uses a standard browser User-Agent by default:
+
+```txt
+Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36
+```
+
+This means `robots.txt` rules targeting `User-agent: *` will apply to Browser Rendering requests. You can customize the User-Agent using the `userAgent` parameter in your API request.
+
+## Identifying Browser Rendering requests
+
+While Browser Rendering uses a standard browser User-Agent, requests can be identified by the [automatic headers](/browser-rendering/reference/automatic-request-headers/) that Cloudflare attaches:
+
+- `cf-brapi-request-id` — Unique identifier for REST API requests
+- `Signature-agent` — Points to Cloudflare's bot verification keys
+
+For Cloudflare security products, Browser Rendering has a bot detection ID of `128292352`. Use this to create WAF rules that allow or block Browser Rendering traffic.
+
+## Best practices for robots.txt
+
+A well-configured `robots.txt` helps crawlers understand which parts of your site they can access.
+
+### Reference your sitemap
+
+Include a reference to your sitemap in `robots.txt` so crawlers can discover your URLs:
+
+```txt title="robots.txt"
+User-agent: *
+Allow: /
+
+Sitemap: https://example.com/sitemap.xml
+```
+
+You can list multiple sitemaps:
+
+```txt title="robots.txt"
+User-agent: *
+Allow: /
+
+Sitemap: https://example.com/sitemap.xml
+Sitemap: https://example.com/blog-sitemap.xml
+```
+
+### Set a crawl delay
+
+Use `crawl-delay` to control how frequently crawlers request pages from your server:
+
+```txt title="robots.txt"
+User-agent: *
+Crawl-delay: 2
+Allow: /
+
+Sitemap: https://example.com/sitemap.xml
+```
+
+The value is in seconds. A `crawl-delay` of 2 means the crawler waits 2 seconds between requests.
+
+## Best practices for sitemaps
+
+Structure your sitemap to help crawlers process your site efficiently:
+
+```xml title="sitemap.xml"
+<?xml version="1.0" encoding="UTF-8"?>
+<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
+  <url>
+    <loc>https://example.com/important-page</loc>
+    <lastmod>2025-01-15</lastmod>
+    <priority>1.0</priority>
+  </url>
+  <url>
+    <loc>https://example.com/other-page</loc>
+    <lastmod>2025-01-10</lastmod>
+    <priority>0.5</priority>
+  </url>
+</urlset>
+```
+
+| Attribute    | Purpose                       | Recommendation                                                                 |
+| ------------ | ----------------------------- | ------------------------------------------------------------------------------ |
+| `<loc>`      | URL of the page               | Required. Use full URLs.                                                       |
+| `<lastmod>`  | Last modification date        | Include to help the crawler identify updated content.                          |
+| `<priority>` | Relative importance (0.0-1.0) | Set higher values for important pages. The crawler processes pages in order.   |
+
+### Recommendations
+
+- **Include `<lastmod>`** on all URLs to help identify which pages have changed.
+- **Set `<priority>`** to control processing order. Pages with higher priority are processed first.
+- **Use sitemap index files** for large sites with multiple sitemaps.
+- **Compress large sitemaps** using `.gz` format to reduce bandwidth.
+- **Keep sitemaps under 50MB** and 50,000 URLs per file (standard sitemap limits).
+
+## Related resources
+
+- [/crawl endpoint](/browser-rendering/rest-api/crawl-endpoint/) — Automate crawling multiple pages
+- [FAQ: Will Browser Rendering bypass Cloudflare's Bot Protection?](/browser-rendering/faq/#will-browser-rendering-bypass-cloudflares-bot-protection) — Instructions for creating a WAF skip rule