Complete migration guide for switching from Firecrawl to Scrape.do. Covers endpoint mapping, parameter translation, feature gaps, and compensation strategies with working code examples.
Before diving into code, understand the fundamental difference: Firecrawl is an AI/LLM-focused data pipeline. It abstracts away the web entirely and returns clean, structured content. Its /crawl, /map, /extract, and /agent endpoints assume you want processed, ready-to-use data — often to feed directly into a language model.
Scrape.do is a traditional scraping API. It gives you raw or markdown HTML from any URL, handles anti-bot systems, rotates proxies, and renders JavaScript — but stops there. Data structuring is your responsibility.
If your Firecrawl usage centered on /scrape with formats: ["markdown"] or formats: ["html"], migration is nearly one-to-one. If you relied heavily on /crawl (full-site spidering), /extract (AI-structured JSON), or the /agent endpoint (autonomous data gathering), you will need to build compensation layers — all of which are covered in this guide.
The tradeoff: you lose Firecrawl's higher-level AI abstractions, and you gain significantly lower per-request costs, 95M+ residential/mobile IPs, precise geo-targeting, structured Amazon and Google SERP APIs, and an async batch system with its own concurrency pool.
Firecrawl (before):
import requests
response = requests.post(
"https://api.firecrawl.dev/v2/scrape",
headers={"Authorization": "Bearer FC_API_KEY"},
json={"url": "https://example.com", "formats": ["markdown"]}
)
markdown = response.json()["data"]["markdown"]Scrape.do (after):
import requests
response = requests.get(
"https://api.scrape.do",
params={
"token": "SDO_TOKEN",
"url": "https://example.com",
"output": "markdown"
}
)
markdown = response.textKey differences at a glance:
| Firecrawl | Scrape.do | |
|---|---|---|
| Base URL | https://api.firecrawl.dev/v2/ |
https://api.scrape.do |
| Auth | Authorization: Bearer KEY header |
token query parameter |
| Method | POST with JSON body | GET with query parameters |
| Default output | Markdown (LLM-ready) | Raw HTML |
| Markdown output | formats: ["markdown"] |
output=markdown |
| JS rendering | Always on (proxy auto-routes) | render=true (off by default) |
| Residential proxy | Built-in (proxy: enhanced) |
super=true |
| Geo-targeting | location.country in JSON body |
geoCode=us query param |
| Firecrawl | Scrape.do |
|---|---|
Authorization: Bearer FC_API_KEY HTTP header |
token=SDO_TOKEN query parameter |
API key from firecrawl.dev/app dashboard |
Token from dashboard.scrape.do |
| Required on all requests | Required on all requests |
Firecrawl:
curl -X POST "https://api.firecrawl.dev/v2/scrape" \
-H "Authorization: Bearer FC_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'Scrape.do:
curl "https://api.scrape.do/?token=SDO_TOKEN&url=https%3A%2F%2Fexample.com"Firecrawl exposes multiple specialized endpoints. Scrape.do uses a single endpoint for page fetching, with separate plugin URLs for structured data.
| Firecrawl Endpoint | Purpose | Scrape.do Equivalent |
|---|---|---|
POST /v2/scrape |
Scrape a single URL | GET api.scrape.do/?token=T&url=U |
POST /v2/crawl |
Spider entire site, return all pages | No direct equivalent — build a crawler loop (see section below) |
POST /v2/map |
Return all URLs on a domain | No direct equivalent — fetch sitemap.xml (see section below) |
POST /v2/extract (legacy /v1/extract still works) |
AI-structured JSON from URL(s) | No direct equivalent — use output=markdown then call an LLM (see section below) |
POST /v2/search |
Web search + scrape results | Use SDO's Google Scraper API (/plugin/google/search). Now part of the broader Google Scraper API which also covers Maps, Shopping, Flights, Hotels, News, Trends (10cr each) |
POST /v2/batch/scrape |
Scrape multiple URLs as async job | SDO Async API (https://q.scrape.do/api/v1/jobs) |
| Firecrawl (POST body) | Scrape.do (query param) | Notes |
|---|---|---|
url |
url |
Both require URL-encoding in Scrape.do's API mode |
formats: ["html"] |
(default behavior) | Firecrawl html is processed/cleaned HTML; SDO default returns the target's raw HTML |
formats: ["rawHtml"] |
(default behavior) | SDO already returns the target's unprocessed HTML by default. (transparentResponse=true is unrelated — it only changes how status codes are reported.) |
formats: ["markdown"] |
output=markdown |
Both return clean markdown text |
formats: ["screenshot"] |
render=true + screenShot=true + returnJSON=true |
All three params required (verified via live API). SDO returns base64 in screenShots[0].image. Firecrawl returns a HTTPS URL to a PNG (not base64). |
formats: ["links"] |
(parse from HTML response) | No direct param; extract links from output=markdown or HTML |
timeout (ms) |
timeout (ms) |
Same unit. SDO default: 60000. SDO max: 120000. |
waitFor (ms) |
customWait (ms) |
Fixed delay after page load. Same concept, different name. |
headers |
extraHeaders=true + Sd- prefix headers |
See Headers section |
mobile |
device=mobile |
Renders as mobile browser |
location.country |
geoCode=us |
ISO country code, lowercase in SDO |
location.languages |
(no direct equivalent) | Use extraHeaders=true with Sd-Accept-Language header |
actions (Playwright-like) |
playWithBrowser (JSON action array) |
See Browser Actions section |
proxy: "basic" |
(default) | Datacenter proxy |
proxy: "enhanced" |
super=true |
Residential/mobile proxy |
onlyMainContent |
(no direct equivalent) | Use output=markdown; SDO markdown omits boilerplate naturally |
blockAds |
(no direct equivalent) | blockResources=true blocks images/CSS/fonts |
Firecrawl routes all requests through proxies by default — you just pick the tier. Scrape.do defaults to datacenter proxies; residential requires super=true.
| Firecrawl | Scrape.do | Notes |
|---|---|---|
Default (proxy: "auto") — tries basic, escalates if needed |
Default (datacenter) | SDO datacenter pool is also rotating and anti-bot-capable |
proxy: "basic" — fast, 28 countries |
(default) | SDO datacenter: 150+ countries |
proxy: "enhanced" — residential, US and DK only |
super=true |
SDO residential/mobile: 95M+ IPs, 150+ countries |
location.country: "US" |
geoCode=us |
Lowercase ISO code in SDO. With datacenter proxy: requires Pro Plan or higher. With super=true: requires Business Plan or higher. |
| (no continent-level targeting) | regionalGeoCode=europe |
SDO supports: europe, asia, africa, oceania, northamerica, southamerica. Requires super=true (Business+ plan). |
| (no sticky sessions) | sessionId=12345 |
SDO maintains the same IP for up to 5 min of inactivity; range 0-1000000 |
Firecrawl — residential proxy from Germany:
response = requests.post(
"https://api.firecrawl.dev/v2/scrape",
headers={"Authorization": "Bearer FC_API_KEY"},
json={
"url": "https://example.de",
"proxy": "enhanced",
"location": {"country": "DE"}
}
)Scrape.do equivalent:
curl "https://api.scrape.do/?token=SDO_TOKEN&url=https%3A%2F%2Fexample.de&super=true&geoCode=de"Firecrawl always uses a browser (all requests are rendered). In Scrape.do, rendering is opt-in via render=true. This is the most important behavioral difference for direct /scrape migrations.
| Firecrawl | Scrape.do | Notes |
|---|---|---|
| Browser always active | render=true |
Add render=true to all requests that relied on Firecrawl's default JS execution |
waitFor: 2000 |
customWait=2000 |
Millisecond wait after page load |
actions: [{type: "wait", selector: ".loaded"}] |
waitSelector=.loaded |
CSS selector wait |
| (always waits for load) | waitUntil=networkidle0 |
SDO options: domcontentloaded, networkidle0, networkidle2, load |
mobile: true |
device=mobile |
Mobile browser emulation |
| (no viewport control) | width=390&height=844 |
SDO allows explicit viewport size |
formats: ["screenshot"] |
render=true + screenShot=true + returnJSON=true |
All three required. SDO returns base64 in screenShots[0].image. Firecrawl returns a temporary HTTPS URL to the PNG (not base64) — fetch it separately if you need the bytes. |
Firecrawl — render + wait for element:
response = requests.post(
"https://api.firecrawl.dev/v2/scrape",
headers={"Authorization": "Bearer FC_API_KEY"},
json={
"url": "https://spa-app.example.com",
"waitFor": 3000,
"formats": ["html"]
}
)Scrape.do equivalent:
curl "https://api.scrape.do/?token=SDO_TOKEN&url=https%3A%2F%2Fspa-app.example.com&render=true&customWait=3000"Firecrawl actions use Playwright-like objects. Scrape.do playWithBrowser uses a similar JSON array of named action objects. The structure is close enough that most action sequences translate directly.
Firecrawl actions format (valid v2 action types verified via live API: click, wait, screenshot, write, press, scroll, scrape, executeJavascript, pdf):
[
{"type": "click", "selector": "#accept-cookies"},
{"type": "wait", "milliseconds": 1000},
{"type": "scroll", "direction": "down", "amount": 500},
{"type": "write", "text": "laptop"},
{"type": "press", "key": "Enter"},
{"type": "screenshot"}
]Scrape.do playWithBrowser equivalent:
[
{"Action": "Click", "Selector": "#accept-cookies"},
{"Action": "Wait", "Timeout": 1000},
{"Action": "ScrollY", "Value": 500},
{"Action": "Fill", "Selector": "#search", "Value": "laptop"},
{"Action": "Execute", "Execute": "document.querySelector('#search').dispatchEvent(new KeyboardEvent('keydown',{key:'Enter'}))"},
{"Action": "ScreenShot"}
]Note: Firecrawl write writes into the focused element (no selector field), and press sends a single key. SDO's Fill requires a selector and replaces the value — the equivalent of "type into focused" is to first Click the field, then Fill. There is no SDO Press action; use Execute to dispatch a KeyboardEvent for single-key presses.
| Firecrawl action type | Scrape.do Action | Notes |
|---|---|---|
click (requires selector) |
Click |
selector -> Selector |
wait with milliseconds |
Wait |
milliseconds -> Timeout |
wait with selector (CSS wait) |
WaitSelector |
{"Action":"WaitSelector","WaitSelector":"#btn","Timeout":5000}; max wait ~10000ms |
scroll (direction: down, amount: N) |
ScrollY |
amount -> Value (pixels) |
scroll (direction: right, amount: N) |
ScrollX |
amount -> Value |
write (typing into focused field) |
Fill (preceded by Click) |
Firecrawl write requires text; SDO Fill requires both Selector and Value |
press (single key, requires key) |
Execute (dispatch KeyboardEvent) |
No direct SDO equivalent |
screenshot |
ScreenShot |
Requires returnJSON=true AND render=true on the request |
executeJavascript (requires script) |
Execute |
script -> Execute (PascalCase field) |
pdf |
(no equivalent) | SDO does not have a "save current page as PDF" action |
scrape (sub-fetch from inside actions) |
(no equivalent) | Make a separate SDO request |
| (no equivalent) | WaitForRequestCompletion |
Wait for a network request URL pattern to complete (SDO-only) |
Full Python example with browser actions:
import requests, json, urllib.parse
actions = [
{"Action": "Click", "Selector": "#cookie-accept"},
{"Action": "Wait", "Timeout": 500},
{"Action": "ScrollY", "Value": 2000},
{"Action": "WaitSelector", "WaitSelector": ".product-grid", "Timeout": 5000}
]
response = requests.get(
"https://api.scrape.do",
params={
"token": "SDO_TOKEN",
"url": "https://shop.example.com/category",
"render": "true",
"playWithBrowser": json.dumps(actions),
"output": "markdown"
}
)
print(response.text)Firecrawl accepts a headers object in the POST body. Scrape.do uses HTTP headers with a prefix system.
| Firecrawl | Scrape.do | Notes |
|---|---|---|
headers: {"Authorization": "Bearer T"} in body |
extraHeaders=true + Sd-Authorization: Bearer T HTTP header |
Sd- prefix: add/override headers on top of SDO defaults |
| (full header control) | customHeaders=true |
Replace ALL headers with your own |
| (no equivalent) | forwardHeaders=true |
Forward your request headers as-is to the target |
Firecrawl:
response = requests.post(
"https://api.firecrawl.dev/v2/scrape",
headers={"Authorization": "Bearer FC_API_KEY"},
json={
"url": "https://api.example.com/data",
"headers": {
"Authorization": "Bearer my-site-token",
"X-Custom-Header": "value123"
}
}
)Scrape.do:
response = requests.get(
"https://api.scrape.do",
params={
"token": "SDO_TOKEN",
"url": "https://api.example.com/data",
"extraHeaders": "true"
},
headers={
"Sd-Authorization": "Bearer my-site-token",
"Sd-X-Custom-Header": "value123"
}
)| Firecrawl | Scrape.do | Notes |
|---|---|---|
| No built-in cookie parameter | setCookies=name=value; name2=value2 |
URL-encode the cookie string |
Scrape.do with cookies:
curl "https://api.scrape.do/?token=SDO_TOKEN&url=https%3A%2F%2Fexample.com&setCookies=session%3Dabc123%3B%20token%3Dxyz789"Firecrawl's primary value is returning multiple formats in one call. Scrape.do returns one format per request.
Firecrawl formats value |
Scrape.do equivalent |
|---|---|
"markdown" |
output=markdown (only effective when target content-type is text/html — PDFs/binary content not converted) |
"html" |
Default response (no param needed) |
"rawHtml" |
transparentResponse=true |
"screenshot" |
screenShot=true + returnJSON=true |
"links" |
Parse from the markdown or HTML response |
"json" (LLM extraction) |
Fetch markdown, then call LLM yourself (see Extract section) |
Firecrawl — multiple formats:
response = requests.post(
"https://api.firecrawl.dev/v2/scrape",
headers={"Authorization": "Bearer FC_API_KEY"},
json={"url": "https://example.com", "formats": ["markdown", "html", "screenshot"]}
)
data = response.json()["data"]
markdown = data["markdown"]
html = data["html"]
screenshot_url = data["screenshot"] # Note: HTTPS URL to a hosted PNG, NOT base64.
# To get bytes: png_bytes = requests.get(screenshot_url).contentScrape.do — screenshot + markdown (two requests):
import requests
params_base = {"token": "SDO_TOKEN", "url": "https://example.com", "render": "true"}
# Get markdown
markdown = requests.get("https://api.scrape.do", params={**params_base, "output": "markdown"}).text
# Get screenshot
screenshot_resp = requests.get("https://api.scrape.do", params={**params_base, "screenShot": "true", "returnJSON": "true"})
screenshot_base64 = screenshot_resp.json()["screenShots"][0]["image"]Firecrawl's /v2/batch/scrape runs multiple URLs asynchronously. Scrape.do has a dedicated Async API at https://q.scrape.do with its own concurrency pool (30% of your plan limit, separate from the main API pool — it does not reduce your main concurrency).
SDO Async API now supports a Plugin mode that batches up to 1000 structured-data params per job (Amazon, Google search/maps/shopping/flights/hotels/news/trends, plus walmart/store and lowes/store). See async-api/plugins.
Firecrawl batch scrape:
import requests, time
# Submit batch
resp = requests.post(
"https://api.firecrawl.dev/v2/batch/scrape",
headers={"Authorization": "Bearer FC_API_KEY"},
json={"urls": ["https://example.com/page1", "https://example.com/page2"], "formats": ["markdown"]}
)
batch_id = resp.json()["id"]
# Poll until done
while True:
status = requests.get(
f"https://api.firecrawl.dev/v2/batch/scrape/{batch_id}",
headers={"Authorization": "Bearer FC_API_KEY"}
).json()
if status["status"] == "completed":
break
time.sleep(3)
pages = status["data"]Scrape.do Async API equivalent:
import requests, time
ASYNC_BASE = "https://q.scrape.do/api/v1"
HEADERS = {"X-Token": "SDO_TOKEN", "Content-Type": "application/json"}
# Submit job (raw HTML by default; per-target options like markdown are not exposed in the async body)
resp = requests.post(f"{ASYNC_BASE}/jobs", headers=HEADERS, json={
"Targets": ["https://example.com/page1", "https://example.com/page2"],
"Super": False,
"GeoCode": "us"
})
job = resp.json()
job_id = job["JobID"]
task_ids = job["TaskIDs"]
# Poll for completion
while True:
status = requests.get(f"{ASYNC_BASE}/jobs/{job_id}", headers=HEADERS).json()
if status["Status"] in ("success", "error", "canceled"):
break
time.sleep(2)
# Retrieve results per task
for task_id in task_ids:
result = requests.get(f"{ASYNC_BASE}/jobs/{job_id}/{task_id}", headers=HEADERS).json()
print(result["Content"])Async API with webhook (production pattern):
# Submit once — results delivered to your server when ready
resp = requests.post(f"{ASYNC_BASE}/jobs", headers=HEADERS, json={
"Targets": ["https://example.com/page1", "https://example.com/page2"],
"WebhookURL": "https://your-server.com/webhook/sdo",
"WebhookHeaders": {"Authorization": "Bearer your-webhook-secret"}
})
print("Job ID:", resp.json()["JobID"])
# No polling needed — your webhook endpoint receives results automaticallyThese capabilities are unavailable in Firecrawl or significantly superior in Scrape.do.
| Feature | Parameter | What you get | Docs |
|---|---|---|---|
| Residential/mobile proxies (95M+ IPs) | super=true |
Access to residential and mobile IP pool across 150+ countries. Firecrawl's enhanced proxy is US/DK only. | docs |
| Geo-targeting: 150+ countries | geoCode=us |
Country-level proxy routing. Firecrawl basic supports 28 countries; enhanced supports US and DK only. | docs |
| Geo-targeting: continent-level | regionalGeoCode=europe |
Route via entire continent (super proxy only). Values: europe, asia, africa, oceania, northamerica, southamerica. No Firecrawl equivalent. |
docs |
| Sticky sessions (same IP) | sessionId=12345 |
Maintain same IP for multi-step flows. Same integer = same IP for up to 5 min inactivity. Firecrawl has no sticky IP. | docs |
| Postal/ZIP-code targeting | postalcode=10001 (or zipcode=) |
Target a specific postal/ZIP code within a country. Requires super=true AND geoCode. Supported in 12 countries: us, gb, de, fr, ca, au, in, nl, it, es, br, jp. Send codes without spaces (e.g. SW1A1AA not SW1A 1AA). |
docs |
| Google Scraper API | GET /plugin/google/search?q=... |
Pre-parsed SERP JSON — organic results, ads, knowledge graph, local pack, AI Overview, and 10+ more types. 84 Google domains, 150+ languages, 240+ country codes. Now part of the broader Google Scraper API which also covers Maps, Shopping, Flights, Hotels, News, Trends (10cr each). | docs |
| Google AI Mode | GET /plugin/google/search/ai-mode?q=... |
Google's full conversational AI response with references and shopping results as structured JSON. 10 credits. | docs |
| Google Maps | GET /plugin/google/maps/search?q=... |
Structured Maps places list with location pinning via ll=@lat,lng,zoom. Also /place and /reviews for place details and review pagination — capability Firecrawl has no equivalent for. 10 credits. |
docs |
| Amazon Scraper API | GET /plugin/amazon/pdp?asin=... |
Structured product data (ASIN, title, price, ratings, images, specs) from 21 Amazon marketplaces with ZIP-code geo-targeting. 1 credit per request; 1 concurrent request per token. | docs |
| Amazon offer listing | GET /plugin/amazon/offer-listing?asin=... |
All seller offers with prices and shipping info for a given ASIN. | docs |
| Async API with separate concurrency pool | https://q.scrape.do |
Job-based API with its own 30% concurrency pool — runs independently from your main API, doesn't cut into your main quota. | docs |
| Device emulation | device=mobile|tablet|desktop |
Render as a specific device type. Controls both User-Agent and viewport. | docs |
| Viewport control | width=390&height=844 |
Set exact browser viewport dimensions. | docs |
| Full-page screenshots | fullScreenShot=true |
Capture the entire page (not just the viewport). | docs |
| Partial screenshots | particularScreenShot=.selector |
Screenshot of a specific CSS selector element. | docs |
| Network idle waiting | waitUntil=networkidle0 |
Wait until all network requests are finished before returning. Options: domcontentloaded, networkidle0, networkidle2, load. |
docs |
| WebSocket capture | showWebsocketRequests=true |
Capture WebSocket frames alongside XHR/Fetch in the JSON response. Requires render=true + returnJSON=true. |
docs |
| Pure cookies | pureCookies=true |
Returns raw Set-Cookie headers from target unmodified. |
docs |
| Retry control | retryTimeout=15000, disableRetry=true |
Configure or disable built-in retry mechanism. | docs |
| Disable redirect following | disableRedirection=true |
Returns the raw 3xx response without following. | docs |
| WaitForRequestCompletion | playWithBrowser action |
Wait until a specific network URL pattern completes — ideal for dynamically loaded data. | docs |
Firecrawl has several AI/pipeline features with no direct Scrape.do counterpart. Each section below describes the gap and provides a working compensation strategy.
What it does in Firecrawl: Submits a domain, Firecrawl recursively follows all internal links, renders each page, and returns the full site as an array of markdown documents. Pagination and link discovery are handled automatically. 1 credit per page.
SDO equivalent: None — Scrape.do is a per-URL API. It fetches one URL at a time.
How to compensate: Build a crawl loop yourself. The pattern: seed URL -> extract links -> deduplicate -> queue -> fetch each URL via SDO. The example below is production-ready and uses asyncio + SDO for concurrency.
"""
sdo_crawler.py — Async site crawler using Scrape.do
Usage:
python sdo_crawler.py --url https://example.com --max-pages 100 --concurrency 5
"""
import asyncio
import aiohttp
import urllib.parse
import re
import json
import argparse
from collections import deque
from urllib.parse import urljoin, urlparse
SDO_TOKEN = "SDO_TOKEN"
SDO_BASE = "https://api.scrape.do"
def normalize_url(url: str) -> str:
"""Remove fragment and trailing slash inconsistencies."""
parsed = urlparse(url)
normalized = parsed._replace(fragment="").geturl()
return normalized.rstrip("/")
def extract_links(base_url: str, html_or_markdown: str) -> list[str]:
"""Extract all internal links from HTML or markdown content."""
base = urlparse(base_url)
links = set()
# Match href attributes in HTML
href_pattern = re.compile(r'href=["\']([^"\'#][^"\']*)["\']', re.IGNORECASE)
# Match markdown links [text](url)
md_pattern = re.compile(r'\[.*?\]\((https?://[^)]+)\)')
for match in href_pattern.finditer(html_or_markdown):
url = urljoin(base_url, match.group(1))
parsed = urlparse(url)
if parsed.netloc == base.netloc and parsed.scheme in ("http", "https"):
links.add(normalize_url(url))
for match in md_pattern.finditer(html_or_markdown):
url = match.group(1)
parsed = urlparse(url)
if parsed.netloc == base.netloc:
links.add(normalize_url(url))
return list(links)
async def fetch_page(
session: aiohttp.ClientSession,
url: str,
render: bool = True,
super_proxy: bool = False,
geo_code: str = None,
) -> dict:
"""Fetch a single URL via Scrape.do and return status + content."""
params = {
"token": SDO_TOKEN,
"url": url,
"output": "markdown",
}
if render:
params["render"] = "true"
params["waitUntil"] = "networkidle2"
if super_proxy:
params["super"] = "true"
if geo_code:
params["geoCode"] = geo_code
try:
async with session.get(SDO_BASE, params=params, timeout=aiohttp.ClientTimeout(total=90)) as resp:
content = await resp.text()
return {
"url": url,
"status": resp.status,
"content": content if resp.status == 200 else "",
"error": None if resp.status == 200 else f"HTTP {resp.status}",
}
except asyncio.TimeoutError:
return {"url": url, "status": None, "content": "", "error": "timeout"}
except Exception as e:
return {"url": url, "status": None, "content": "", "error": str(e)}
async def crawl(
seed_url: str,
max_pages: int = 100,
concurrency: int = 5,
render: bool = True,
super_proxy: bool = False,
geo_code: str = None,
) -> list[dict]:
"""
Crawl a website starting from seed_url.
Returns a list of dicts: {url, status, content, error}
"""
seed_url = normalize_url(seed_url)
base_domain = urlparse(seed_url).netloc
visited = set()
queue = deque([seed_url])
results = []
semaphore = asyncio.Semaphore(concurrency)
async def bounded_fetch(session, url):
async with semaphore:
return await fetch_page(session, url, render, super_proxy, geo_code)
connector = aiohttp.TCPConnector(limit=concurrency * 2)
async with aiohttp.ClientSession(connector=connector) as session:
while queue and len(visited) < max_pages:
# Drain current batch
batch = []
while queue and len(batch) < concurrency and len(visited) + len(batch) < max_pages:
url = queue.popleft()
if url not in visited:
visited.add(url)
batch.append(url)
if not batch:
break
print(f"Fetching batch of {len(batch)} URLs | Total visited: {len(visited)}/{max_pages}")
tasks = [bounded_fetch(session, url) for url in batch]
batch_results = await asyncio.gather(*tasks)
for result in batch_results:
results.append(result)
if result["content"]:
# Discover new links and add to queue
new_links = extract_links(result["url"], result["content"])
for link in new_links:
if link not in visited and urlparse(link).netloc == base_domain:
queue.append(link)
print(f"Crawl complete. {len(results)} pages fetched.")
return results
# --- Main ---
async def main():
parser = argparse.ArgumentParser(description="Crawl a site using Scrape.do")
parser.add_argument("--url", required=True, help="Seed URL to start crawling from")
parser.add_argument("--max-pages", type=int, default=100, help="Max pages to crawl")
parser.add_argument("--concurrency", type=int, default=5, help="Concurrent requests")
parser.add_argument("--no-render", action="store_true", help="Disable JS rendering (faster, cheaper)")
parser.add_argument("--super", action="store_true", help="Use residential proxies")
parser.add_argument("--geo", default=None, help="Geo code, e.g. us, de, gb")
parser.add_argument("--output", default="crawl_results.json", help="Output JSON file")
args = parser.parse_args()
pages = await crawl(
seed_url=args.url,
max_pages=args.max_pages,
concurrency=args.concurrency,
render=not args.no_render,
super_proxy=args.super,
geo_code=args.geo,
)
with open(args.output, "w", encoding="utf-8") as f:
json.dump(pages, f, ensure_ascii=False, indent=2)
print(f"Results saved to {args.output}")
success = sum(1 for p in pages if p["status"] == 200)
print(f"Success: {success}/{len(pages)} pages")
if __name__ == "__main__":
asyncio.run(main())Usage:
# Basic crawl (JS rendering on, datacenter proxies)
python sdo_crawler.py --url https://example.com --max-pages 50
# Fast static site (no rendering, cheaper)
python sdo_crawler.py --url https://docs.example.com --max-pages 200 --no-render
# E-commerce with residential proxy from US
python sdo_crawler.py --url https://shop.example.com --max-pages 100 --super --geo usCredit cost comparison:
- Firecrawl crawl: 1 credit/page
- SDO datacenter (no render): 1 credit/page — same cost
- SDO with
render=true: 5 credits/page — more expensive, but you control when rendering is needed
What it does in Firecrawl: Queries a domain and returns a comprehensive URL list using sitemap + SERP + previous crawl data. 1 credit per call regardless of how many URLs are returned — extraordinarily cheap.
SDO equivalent: None.
How to compensate: Fetch sitemap.xml and parse it. Most sites follow the standard; many have a robots.txt that points to multiple sitemap files.
import requests
from xml.etree import ElementTree as ET
from urllib.parse import urlparse
def discover_urls(domain: str, sdo_token: str, max_urls: int = 5000) -> list[str]:
"""
Discover all URLs on a domain by fetching and parsing its sitemap(s).
Falls back to robots.txt sitemap directives if /sitemap.xml returns 404.
"""
base = f"https://{domain}" if not domain.startswith("http") else domain
parsed = urlparse(base)
root = f"{parsed.scheme}://{parsed.netloc}"
def fetch_via_sdo(url: str) -> requests.Response:
return requests.get("https://api.scrape.do", params={
"token": sdo_token,
"url": url
}, timeout=30)
def parse_sitemap(xml_text: str) -> tuple[list[str], list[str]]:
"""Returns (page_urls, nested_sitemap_urls)."""
try:
root_el = ET.fromstring(xml_text)
except ET.ParseError:
return [], []
ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
pages, sitemaps = [], []
# sitemap index
for loc in root_el.findall(".//sm:sitemap/sm:loc", ns):
sitemaps.append(loc.text.strip())
# regular sitemap
for loc in root_el.findall(".//sm:url/sm:loc", ns):
pages.append(loc.text.strip())
return pages, sitemaps
# Step 1: Check robots.txt for Sitemap directives
sitemap_urls = []
try:
robots = fetch_via_sdo(f"{root}/robots.txt")
if robots.status_code == 200:
for line in robots.text.splitlines():
if line.lower().startswith("sitemap:"):
sitemap_urls.append(line.split(":", 1)[1].strip())
except Exception:
pass
if not sitemap_urls:
sitemap_urls = [f"{root}/sitemap.xml"]
# Step 2: Fetch and parse sitemaps (handle sitemap indexes)
all_urls = []
visited_sitemaps = set()
queue = sitemap_urls
while queue and len(all_urls) < max_urls:
sitemap_url = queue.pop(0)
if sitemap_url in visited_sitemaps:
continue
visited_sitemaps.add(sitemap_url)
try:
resp = fetch_via_sdo(sitemap_url)
if resp.status_code != 200:
continue
pages, nested = parse_sitemap(resp.text)
all_urls.extend(pages)
queue.extend(nested)
except Exception as e:
print(f"Failed to parse {sitemap_url}: {e}")
return list(set(all_urls))[:max_urls]
# Usage
urls = discover_urls("example.com", sdo_token="SDO_TOKEN")
print(f"Discovered {len(urls)} URLs")
for url in urls[:10]:
print(url)What it does in Firecrawl: Takes a URL and a JSON schema (Pydantic/Zod compatible), scrapes the page, passes it through an LLM, and returns structured data matching the schema. Cost: 1 base credit + 4 LLM credits per page.
SDO equivalent: None — Scrape.do has no built-in LLM layer.
How to compensate: Fetch markdown from SDO, then pass it to your own LLM. You choose the model, schema enforcement library, and cost structure. This two-step approach is actually more flexible: you can use Claude, GPT-4o-mini, Gemini, or a local model.
import requests
import json
from openai import OpenAI
from pydantic import BaseModel
from typing import Optional
# Define your schema with Pydantic (same as Firecrawl's schema param)
class ProductSchema(BaseModel):
name: str
price: Optional[float]
currency: Optional[str]
rating: Optional[float]
review_count: Optional[int]
in_stock: Optional[bool]
description: Optional[str]
def extract_structured(
url: str,
schema: type[BaseModel],
sdo_token: str,
openai_api_key: str,
render: bool = True,
model: str = "gpt-4o-mini"
) -> dict:
"""
Firecrawl /extract equivalent using SDO + OpenAI.
Returns parsed dict matching the schema.
"""
# Step 1: Fetch page as clean markdown via SDO
params = {
"token": sdo_token,
"url": url,
"output": "markdown",
}
if render:
params["render"] = "true"
resp = requests.get("https://api.scrape.do", params=params, timeout=60)
resp.raise_for_status()
markdown = resp.text
if not markdown.strip():
raise ValueError(f"Empty response from {url}")
# Step 2: Build schema description for the prompt
schema_json = json.dumps(schema.model_json_schema(), indent=2)
# Step 3: Pass to LLM for extraction
client = OpenAI(api_key=openai_api_key)
completion = client.chat.completions.create(
model=model,
response_format={"type": "json_object"},
messages=[
{
"role": "system",
"content": (
"You are a data extraction assistant. "
"Extract structured data from the provided web page content. "
"Return only valid JSON matching the requested schema. "
"Use null for fields you cannot find."
)
},
{
"role": "user",
"content": (
f"Extract data matching this JSON schema:\n{schema_json}\n\n"
f"Page content:\n{markdown[:8000]}" # truncate to fit context window
)
}
]
)
raw_json = completion.choices[0].message.content
parsed = json.loads(raw_json)
# Validate with Pydantic
validated = schema(**parsed)
return validated.model_dump()
# Usage — mirrors Firecrawl /extract usage
result = extract_structured(
url="https://www.amazon.com/dp/B0ABCDEF",
schema=ProductSchema,
sdo_token="SDO_TOKEN",
openai_api_key="sk-...",
model="gpt-4o-mini"
)
print(json.dumps(result, indent=2))Cost comparison (per page):
- Firecrawl /extract: 1 + 4 = 5 credits
- SDO + GPT-4o-mini: 1 SDO credit (~$0.001) + ~$0.0005 GPT tokens = ~$0.0015 total
- SDO + Claude 3.5 Haiku: 1 SDO credit + ~$0.0008 = ~$0.0018 total
You end up paying roughly the same but choosing your own model and retaining full schema flexibility.
What it does in Firecrawl: An LLM-powered agent that receives a prompt, autonomously navigates websites (no seed URL required), decides what pages to visit, and returns extracted data. The /agent endpoint (spark models) and legacy FIRE-1 both fit this pattern.
SDO equivalent: None. SDO is a tool, not an agent.
How to compensate: Build an agent loop using your preferred LLM + SDO as the "fetch" tool. The pattern: LLM decides what URL to fetch -> SDO fetches it -> markdown fed back to LLM -> LLM decides next step.
"""
Minimal SDO-powered web agent using Claude as the reasoning engine.
Replaces Firecrawl /agent for structured research tasks.
"""
import requests
import anthropic
import json
SDO_TOKEN = "SDO_TOKEN"
ANTHROPIC_KEY = "sk-ant-..."
def sdo_fetch(url: str, render: bool = True) -> str:
"""Fetch a URL as markdown via Scrape.do."""
resp = requests.get("https://api.scrape.do", params={
"token": SDO_TOKEN,
"url": url,
"output": "markdown",
"render": "true" if render else "false",
"waitUntil": "networkidle2" if render else None,
}, timeout=60)
return resp.text[:6000] if resp.status_code == 200 else f"[Error {resp.status_code}]"
def run_agent(task: str, max_steps: int = 8) -> str:
"""
Run a web research agent.
The LLM decides which URLs to fetch; SDO fetches them.
Returns the final extracted answer.
"""
client = anthropic.Anthropic(api_key=ANTHROPIC_KEY)
tools = [
{
"name": "fetch_webpage",
"description": "Fetch any public webpage and return its content as markdown. Use this to read web pages, product pages, search results, etc.",
"input_schema": {
"type": "object",
"properties": {
"url": {"type": "string", "description": "The URL to fetch"},
"render": {"type": "boolean", "description": "Whether to execute JavaScript (needed for SPAs)", "default": True}
},
"required": ["url"]
}
}
]
messages = [{"role": "user", "content": task}]
steps = 0
while steps < max_steps:
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=2048,
tools=tools,
messages=messages
)
# Check if agent wants to use the fetch tool
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use" and block.name == "fetch_webpage":
url = block.input["url"]
render = block.input.get("render", True)
print(f" [Agent fetching via SDO] {url}")
content = sdo_fetch(url, render)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": content
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
steps += 1
elif response.stop_reason == "end_turn":
# Extract final text answer
for block in response.content:
if hasattr(block, "text"):
return block.text
break
return "Agent did not produce a final answer within step limit."
# Usage — same style as Firecrawl /agent prompt usage
result = run_agent(
"Find the current price and availability of the iPhone 16 Pro 256GB "
"on apple.com and return as JSON with fields: price, currency, available."
)
print(result)What it does in Firecrawl: Firecrawl is AGPL-3.0 open source at github.com/mendableai/firecrawl. You can self-host the entire stack (scraping workers, Redis, Postgres, Playwright browsers) on your own infrastructure with no per-request costs.
SDO equivalent: Scrape.do is a cloud-only SaaS. There is no self-hosted version.
If self-hosting is a requirement: You have two paths:
- Keep using Firecrawl OSS for self-hosted workloads, and route cloud-burst traffic through SDO for overflow or jurisdictions where you need different IPs.
- Evaluate alternatives like Crawlee (open-source JS crawling framework) paired with your own proxy pool.
What it does in Firecrawl: Parses PDFs (including scanned/OCR) and returns structured text. Three modes: auto, fast, and OCR. Priced at 1 credit per PDF page. The underlying engine was rewritten in Rust in Feb 2026 for 3x speed.
SDO equivalent: None — Scrape.do does not parse PDF content. It can fetch the raw PDF bytes.
How to compensate: Download the PDF bytes via SDO (no render needed), then parse client-side.
import requests
import io
# Option A: pdfplumber (good for text-heavy PDFs)
try:
import pdfplumber
def extract_pdf_text_plumber(pdf_url: str, sdo_token: str) -> str:
"""Download and extract text from a PDF using pdfplumber."""
resp = requests.get("https://api.scrape.do", params={
"token": sdo_token,
"url": pdf_url,
# No render=true — PDFs are binary, no JS needed
}, timeout=60)
resp.raise_for_status()
pages = []
with pdfplumber.open(io.BytesIO(resp.content)) as pdf:
for page in pdf.pages:
text = page.extract_text()
if text:
pages.append(text)
return "\n\n".join(pages)
text = extract_pdf_text_plumber(
"https://example.com/report.pdf",
sdo_token="SDO_TOKEN"
)
print(text[:500])
except ImportError:
print("Install pdfplumber: pip install pdfplumber")
# Option B: PyMuPDF / fitz (faster, also handles scanned PDFs with OCR)
try:
import fitz # PyMuPDF
def extract_pdf_text_pymupdf(pdf_url: str, sdo_token: str) -> str:
"""Download and extract text from a PDF using PyMuPDF."""
resp = requests.get("https://api.scrape.do", params={
"token": sdo_token,
"url": pdf_url,
}, timeout=60)
resp.raise_for_status()
doc = fitz.open(stream=resp.content, filetype="pdf")
return "\n\n".join(page.get_text() for page in doc)
# For scanned PDFs requiring OCR, use fitz with OCR plugin:
# pip install pymupdf[ocr] pillow
# Then: page.get_text(flags=fitz.TEXT_PRESERVE_LIGATURES | fitz.TEXT_MEDIABOX_CLIP)
# Or use pytesseract on page.get_pixmap().tobytes()
except ImportError:
print("Install PyMuPDF: pip install pymupdf")Packages:
pip install pdfplumber # text-focused, good table extraction
pip install pymupdf # faster, more complete, supports OCR via plugin| Request Type | Firecrawl | Scrape.do |
|---|---|---|
| Basic fetch (datacenter, no browser) | 1 credit | 1 credit |
| Browser rendering + datacenter | 1 credit (always rendered) | 5 credits (render=true) |
| Residential proxy (no browser) | 5 credits (proxy: enhanced) |
10 credits (super=true) |
| Browser + residential proxy | 5 credits | 25 credits (render=true + super=true) |
| Markdown output | 1 credit | 1 credit (output=markdown) |
| LLM structured extraction | 1 + 4 = 5 credits | No built-in — SDO 1cr + LLM API cost |
| Full-site crawl (per page) | 1 credit/page | 1-5 credits/page (depends on render) |
| URL map/discovery | 1 credit/call | 0 credits (parse sitemap.xml yourself) |
| Batch/async scrape | Same as scrape | Same as scrape (async pool separate) |
Bottom line: For pure HTML fetching without rendering, costs are equal. If Firecrawl was charging you 1 credit per page and you relied on rendering being always-on, your SDO costs will be higher per-request — but plan pricing is substantially lower, which can offset this for moderate volumes.
| Firecrawl | Scrape.do | |
|---|---|---|
| Free | 500 credits (one-time) | Free-forever plan (check dashboard) |
| Entry paid | $16/mo for 3,000 cr | Pay-as-you-go or lower-tier plan |
| Mid tier | $83/mo for 100,000 cr | Comparable plan at lower cost |
| Credit rollover | Plan credits do not roll over | Check SDO plan terms |
| Extra credits | Auto-recharge packs (4/month max) | Pay-as-you-go top-up |
| Subscription required | Yes — no pure PAYG | Pay-as-you-go available |
Firecrawl's most significant pricing advantage is the /map endpoint at 1 credit for thousands of URLs, and the /crawl endpoint at 1 credit/page with no rendering surcharge (Firecrawl always renders but charges 1 credit). If your workflow is heavily crawl-based, factor this into your cost model.
Work through this list when migrating. Tick each item before deploying to production.
- Replace
Authorization: Bearer FC_API_KEYheader withtoken=SDO_TOKENquery parameter - Change base URL from
https://api.firecrawl.dev/v2/tohttps://api.scrape.do - Switch from POST with JSON body to GET with query parameters
- Replace
formats: ["markdown"]withoutput=markdown - Replace
formats: ["html"]— default SDO response is already HTML, no change needed - Add
render=trueto requests that relied on Firecrawl's always-on browser (Firecrawl renders every request; SDO does not) - Replace
waitFor: N(ms) withcustomWait=N - Replace
mobile: truewithdevice=mobile - Replace
location.country: "US"withgeoCode=us(lowercase ISO code) - Replace
proxy: "enhanced"withsuper=true - Replace
headers: {...}body object withextraHeaders=true+Sd-{Header}HTTP headers - Replace
actions: [...]array withplayWithBrowser=[{"Action":"..."}]JSON - Replace
formats: ["screenshot"]withrender=true&screenShot=true&returnJSON=true(all three required) and read base64 fromresponse.json()["screenShots"][0]["image"](Firecrawl returns a hosted URL string instead, so any code that treateddata["screenshot"]as base64 must also be updated)
-
/v2/batch/scrape-> SDO Async API (https://q.scrape.do/api/v1/jobs)- Change auth from
Authorizationheader toX-Tokenheader - Change body format:
urlsarray ->Targetsarray - Poll job at
GET /api/v1/jobs/{jobID}, retrieve content atGET /api/v1/jobs/{jobID}/{taskID}
- Change auth from
-
/v2/crawl-> Build async crawler loop (usesdo_crawler.pyfrom this guide)- Decide: do you need
render=trueon all pages? Turn it off for static content to cut costs to 1cr/page. - Decide: maximum page depth, excluded path patterns, concurrency
- Decide: do you need
-
/v2/map-> Fetchsitemap.xml+ parse (usediscover_urls()from this guide) -
/v2/extract(or legacy/v1/extract) ->output=markdown+ LLM call (useextract_structured()from this guide)- Choose your LLM: GPT-4o-mini (cheapest), Claude 3.5 Haiku, Gemini Flash
- Validate with Pydantic or Zod to maintain type safety
-
/v2/search-> SDO Google Scraper API:GET https://api.scrape.do/plugin/google/search?token=T&q=query- Returns structured JSON (no HTML parsing needed)
- Note: 84 Google domains supported
- Now part of the broader Google Scraper API which also covers Maps, Shopping, Flights, Hotels, News, Trends (10cr each)
-
/v2/agentor FIRE-1 -> Build agent loop (userun_agent()from this guide or similar)
- PDF parsing — Download PDF bytes via SDO, parse with
pdfplumberorpymupdf - Word/Excel document parsing — Download and parse client-side with
python-docx,openpyxl - Self-hosting — SDO is cloud-only; keep Firecrawl OSS for self-hosted requirements
- MCP server — SDO has no MCP endpoint; use SDO's n8n/Zapier integrations for AI workflow automation
- Test authentication — verify SDO token works and check
Scrape.do-Remaining-Creditsresponse header - Compare output — scrape the same 5-10 URLs with both APIs and diff the markdown
- Verify geo-targeting — confirm
geoCodereturns content from the correct region - Test rendering — confirm JS-heavy pages return dynamic content with
render=true - Test async API — create a multi-URL job, poll to completion, verify all task results
- Run cost estimate — count monthly requests per type (datacenter/render/super) and compare to Firecrawl credits used
- SDO Documentation: scrape.do/documentation
- SDO Async API: scrape.do/documentation/async-api
- SDO Google Scraper API: scrape.do/documentation/google-scraper-api/search (Maps, Shopping, Flights, Hotels, News, Trends also available — 10cr each)
- SDO Amazon Scraper API: scrape.do/documentation/amazon-scraper-api
- SDO Dashboard + Token: dashboard.scrape.do
- Firecrawl OSS (self-hosted): github.com/mendableai/firecrawl