Migrate from Firecrawl to Scrape.do

Complete migration guide for switching from Firecrawl to Scrape.do. Covers endpoint mapping, parameter translation, feature gaps, and compensation strategies with working code examples.

What You Are Migrating Between

Before diving into code, understand the fundamental difference: Firecrawl is an AI/LLM-focused data pipeline. It abstracts away the web entirely and returns clean, structured content. Its /crawl, /map, /extract, and /agent endpoints assume you want processed, ready-to-use data — often to feed directly into a language model.

Scrape.do is a traditional scraping API. It gives you raw or markdown HTML from any URL, handles anti-bot systems, rotates proxies, and renders JavaScript — but stops there. Data structuring is your responsibility.

If your Firecrawl usage centered on /scrape with formats: ["markdown"] or formats: ["html"], migration is nearly one-to-one. If you relied heavily on /crawl (full-site spidering), /extract (AI-structured JSON), or the /agent endpoint (autonomous data gathering), you will need to build compensation layers — all of which are covered in this guide.

The tradeoff: you lose Firecrawl's higher-level AI abstractions, and you gain significantly lower per-request costs, 95M+ residential/mobile IPs, precise geo-targeting, structured Amazon and Google SERP APIs, and an async batch system with its own concurrency pool.

Quick Start: Minimal Changes

Firecrawl (before):

import requests

response = requests.post(
    "https://api.firecrawl.dev/v2/scrape",
    headers={"Authorization": "Bearer FC_API_KEY"},
    json={"url": "https://example.com", "formats": ["markdown"]}
)
markdown = response.json()["data"]["markdown"]

Scrape.do (after):

import requests

response = requests.get(
    "https://api.scrape.do",
    params={
        "token": "SDO_TOKEN",
        "url": "https://example.com",
        "output": "markdown"
    }
)
markdown = response.text

Key differences at a glance:

	Firecrawl	Scrape.do
Base URL	`https://api.firecrawl.dev/v2/`	`https://api.scrape.do`
Auth	`Authorization: Bearer KEY` header	`token` query parameter
Method	POST with JSON body	GET with query parameters
Default output	Markdown (LLM-ready)	Raw HTML
Markdown output	`formats: ["markdown"]`	`output=markdown`
JS rendering	Always on (proxy auto-routes)	`render=true` (off by default)
Residential proxy	Built-in (`proxy: enhanced`)	`super=true`
Geo-targeting	`location.country` in JSON body	`geoCode=us` query param

Authentication

Firecrawl	Scrape.do
`Authorization: Bearer FC_API_KEY` HTTP header	`token=SDO_TOKEN` query parameter
API key from `firecrawl.dev/app` dashboard	Token from `dashboard.scrape.do`
Required on all requests	Required on all requests

Firecrawl:

curl -X POST "https://api.firecrawl.dev/v2/scrape" \
  -H "Authorization: Bearer FC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Scrape.do:

curl "https://api.scrape.do/?token=SDO_TOKEN&url=https%3A%2F%2Fexample.com"

Endpoint Mapping

Firecrawl exposes multiple specialized endpoints. Scrape.do uses a single endpoint for page fetching, with separate plugin URLs for structured data.

Firecrawl Endpoint	Purpose	Scrape.do Equivalent
`POST /v2/scrape`	Scrape a single URL	`GET api.scrape.do/?token=T&url=U`
`POST /v2/crawl`	Spider entire site, return all pages	No direct equivalent — build a crawler loop (see section below)
`POST /v2/map`	Return all URLs on a domain	No direct equivalent — fetch `sitemap.xml` (see section below)
`POST /v2/extract` (legacy `/v1/extract` still works)	AI-structured JSON from URL(s)	No direct equivalent — use `output=markdown` then call an LLM (see section below)
`POST /v2/search`	Web search + scrape results	Use SDO's Google Scraper API (`/plugin/google/search`). Now part of the broader Google Scraper API which also covers Maps, Shopping, Flights, Hotels, News, Trends (10cr each)
`POST /v2/batch/scrape`	Scrape multiple URLs as async job	SDO Async API (`https://q.scrape.do/api/v1/jobs`)

Complete Parameter Mapping

Core Parameters

Firecrawl (POST body)	Scrape.do (query param)	Notes
`url`	`url`	Both require URL-encoding in Scrape.do's API mode
`formats: ["html"]`	(default behavior)	Firecrawl `html` is processed/cleaned HTML; SDO default returns the target's raw HTML
`formats: ["rawHtml"]`	(default behavior)	SDO already returns the target's unprocessed HTML by default. (`transparentResponse=true` is unrelated — it only changes how status codes are reported.)
`formats: ["markdown"]`	`output=markdown`	Both return clean markdown text
`formats: ["screenshot"]`	`render=true` + `screenShot=true` + `returnJSON=true`	All three params required (verified via live API). SDO returns base64 in `screenShots[0].image`. Firecrawl returns a HTTPS URL to a PNG (not base64).
`formats: ["links"]`	(parse from HTML response)	No direct param; extract links from `output=markdown` or HTML
`timeout` (ms)	`timeout` (ms)	Same unit. SDO default: 60000. SDO max: 120000.
`waitFor` (ms)	`customWait` (ms)	Fixed delay after page load. Same concept, different name.
`headers`	`extraHeaders=true` + `Sd-` prefix headers	See Headers section
`mobile`	`device=mobile`	Renders as mobile browser
`location.country`	`geoCode=us`	ISO country code, lowercase in SDO
`location.languages`	(no direct equivalent)	Use `extraHeaders=true` with `Sd-Accept-Language` header
`actions` (Playwright-like)	`playWithBrowser` (JSON action array)	See Browser Actions section
`proxy: "basic"`	(default)	Datacenter proxy
`proxy: "enhanced"`	`super=true`	Residential/mobile proxy
`onlyMainContent`	(no direct equivalent)	Use `output=markdown`; SDO markdown omits boilerplate naturally
`blockAds`	(no direct equivalent)	`blockResources=true` blocks images/CSS/fonts

Proxy and Geo-Targeting

Firecrawl routes all requests through proxies by default — you just pick the tier. Scrape.do defaults to datacenter proxies; residential requires super=true.

Firecrawl	Scrape.do	Notes
Default (`proxy: "auto"`) — tries basic, escalates if needed	Default (datacenter)	SDO datacenter pool is also rotating and anti-bot-capable
`proxy: "basic"` — fast, 28 countries	(default)	SDO datacenter: 150+ countries
`proxy: "enhanced"` — residential, US and DK only	`super=true`	SDO residential/mobile: 95M+ IPs, 150+ countries
`location.country: "US"`	`geoCode=us`	Lowercase ISO code in SDO. With datacenter proxy: requires Pro Plan or higher. With `super=true`: requires Business Plan or higher.
(no continent-level targeting)	`regionalGeoCode=europe`	SDO supports: `europe`, `asia`, `africa`, `oceania`, `northamerica`, `southamerica`. Requires `super=true` (Business+ plan).
(no sticky sessions)	`sessionId=12345`	SDO maintains the same IP for up to 5 min of inactivity; range 0-1000000

Firecrawl — residential proxy from Germany:

response = requests.post(
    "https://api.firecrawl.dev/v2/scrape",
    headers={"Authorization": "Bearer FC_API_KEY"},
    json={
        "url": "https://example.de",
        "proxy": "enhanced",
        "location": {"country": "DE"}
    }
)

Scrape.do equivalent:

curl "https://api.scrape.do/?token=SDO_TOKEN&url=https%3A%2F%2Fexample.de&super=true&geoCode=de"

Browser Rendering

Firecrawl always uses a browser (all requests are rendered). In Scrape.do, rendering is opt-in via render=true. This is the most important behavioral difference for direct /scrape migrations.

Firecrawl	Scrape.do	Notes
Browser always active	`render=true`	Add `render=true` to all requests that relied on Firecrawl's default JS execution
`waitFor: 2000`	`customWait=2000`	Millisecond wait after page load
`actions: [{type: "wait", selector: ".loaded"}]`	`waitSelector=.loaded`	CSS selector wait
(always waits for load)	`waitUntil=networkidle0`	SDO options: `domcontentloaded`, `networkidle0`, `networkidle2`, `load`
`mobile: true`	`device=mobile`	Mobile browser emulation
(no viewport control)	`width=390&height=844`	SDO allows explicit viewport size
`formats: ["screenshot"]`	`render=true` + `screenShot=true` + `returnJSON=true`	All three required. SDO returns base64 in `screenShots[0].image`. Firecrawl returns a temporary HTTPS URL to the PNG (not base64) — fetch it separately if you need the bytes.

Firecrawl — render + wait for element:

response = requests.post(
    "https://api.firecrawl.dev/v2/scrape",
    headers={"Authorization": "Bearer FC_API_KEY"},
    json={
        "url": "https://spa-app.example.com",
        "waitFor": 3000,
        "formats": ["html"]
    }
)

Scrape.do equivalent:

curl "https://api.scrape.do/?token=SDO_TOKEN&url=https%3A%2F%2Fspa-app.example.com&render=true&customWait=3000"

Browser Actions (`actions` -> `playWithBrowser`)

Firecrawl actions use Playwright-like objects. Scrape.do playWithBrowser uses a similar JSON array of named action objects. The structure is close enough that most action sequences translate directly.

Firecrawl actions format (valid v2 action types verified via live API: click, wait, screenshot, write, press, scroll, scrape, executeJavascript, pdf):

[
  {"type": "click", "selector": "#accept-cookies"},
  {"type": "wait", "milliseconds": 1000},
  {"type": "scroll", "direction": "down", "amount": 500},
  {"type": "write", "text": "laptop"},
  {"type": "press", "key": "Enter"},
  {"type": "screenshot"}
]

Scrape.do playWithBrowser equivalent:

[
  {"Action": "Click", "Selector": "#accept-cookies"},
  {"Action": "Wait", "Timeout": 1000},
  {"Action": "ScrollY", "Value": 500},
  {"Action": "Fill", "Selector": "#search", "Value": "laptop"},
  {"Action": "Execute", "Execute": "document.querySelector('#search').dispatchEvent(new KeyboardEvent('keydown',{key:'Enter'}))"},
  {"Action": "ScreenShot"}
]

Note: Firecrawl write writes into the focused element (no selector field), and press sends a single key. SDO's Fill requires a selector and replaces the value — the equivalent of "type into focused" is to first Click the field, then Fill. There is no SDO Press action; use Execute to dispatch a KeyboardEvent for single-key presses.

Action Reference

Firecrawl action type	Scrape.do Action	Notes
`click` (requires `selector`)	`Click`	`selector` -> `Selector`
`wait` with `milliseconds`	`Wait`	`milliseconds` -> `Timeout`
`wait` with `selector` (CSS wait)	`WaitSelector`	`{"Action":"WaitSelector","WaitSelector":"#btn","Timeout":5000}`; max wait ~10000ms
`scroll` (direction: down, amount: N)	`ScrollY`	`amount` -> `Value` (pixels)
`scroll` (direction: right, amount: N)	`ScrollX`	`amount` -> `Value`
`write` (typing into focused field)	`Fill` (preceded by `Click`)	Firecrawl `write` requires `text`; SDO `Fill` requires both `Selector` and `Value`
`press` (single key, requires `key`)	`Execute` (dispatch `KeyboardEvent`)	No direct SDO equivalent
`screenshot`	`ScreenShot`	Requires `returnJSON=true` AND `render=true` on the request
`executeJavascript` (requires `script`)	`Execute`	`script` -> `Execute` (PascalCase field)
`pdf`	(no equivalent)	SDO does not have a "save current page as PDF" action
`scrape` (sub-fetch from inside actions)	(no equivalent)	Make a separate SDO request
(no equivalent)	`WaitForRequestCompletion`	Wait for a network request URL pattern to complete (SDO-only)

Full Python example with browser actions:

import requests, json, urllib.parse

actions = [
    {"Action": "Click", "Selector": "#cookie-accept"},
    {"Action": "Wait", "Timeout": 500},
    {"Action": "ScrollY", "Value": 2000},
    {"Action": "WaitSelector", "WaitSelector": ".product-grid", "Timeout": 5000}
]

response = requests.get(
    "https://api.scrape.do",
    params={
        "token": "SDO_TOKEN",
        "url": "https://shop.example.com/category",
        "render": "true",
        "playWithBrowser": json.dumps(actions),
        "output": "markdown"
    }
)
print(response.text)

Headers and Cookies

Headers

Firecrawl accepts a headers object in the POST body. Scrape.do uses HTTP headers with a prefix system.

Firecrawl	Scrape.do	Notes
`headers: {"Authorization": "Bearer T"}` in body	`extraHeaders=true` + `Sd-Authorization: Bearer T` HTTP header	`Sd-` prefix: add/override headers on top of SDO defaults
(full header control)	`customHeaders=true`	Replace ALL headers with your own
(no equivalent)	`forwardHeaders=true`	Forward your request headers as-is to the target

Firecrawl:

response = requests.post(
    "https://api.firecrawl.dev/v2/scrape",
    headers={"Authorization": "Bearer FC_API_KEY"},
    json={
        "url": "https://api.example.com/data",
        "headers": {
            "Authorization": "Bearer my-site-token",
            "X-Custom-Header": "value123"
        }
    }
)

Scrape.do:

response = requests.get(
    "https://api.scrape.do",
    params={
        "token": "SDO_TOKEN",
        "url": "https://api.example.com/data",
        "extraHeaders": "true"
    },
    headers={
        "Sd-Authorization": "Bearer my-site-token",
        "Sd-X-Custom-Header": "value123"
    }
)

Cookies

Firecrawl	Scrape.do	Notes
No built-in cookie parameter	`setCookies=name=value; name2=value2`	URL-encode the cookie string

Scrape.do with cookies:

curl "https://api.scrape.do/?token=SDO_TOKEN&url=https%3A%2F%2Fexample.com&setCookies=session%3Dabc123%3B%20token%3Dxyz789"

Output Formats

Firecrawl's primary value is returning multiple formats in one call. Scrape.do returns one format per request.

Firecrawl `formats` value	Scrape.do equivalent
`"markdown"`	`output=markdown` (only effective when target content-type is `text/html` — PDFs/binary content not converted)
`"html"`	Default response (no param needed)
`"rawHtml"`	`transparentResponse=true`
`"screenshot"`	`screenShot=true` + `returnJSON=true`
`"links"`	Parse from the markdown or HTML response
`"json"` (LLM extraction)	Fetch markdown, then call LLM yourself (see Extract section)

Firecrawl — multiple formats:

response = requests.post(
    "https://api.firecrawl.dev/v2/scrape",
    headers={"Authorization": "Bearer FC_API_KEY"},
    json={"url": "https://example.com", "formats": ["markdown", "html", "screenshot"]}
)
data = response.json()["data"]
markdown = data["markdown"]
html = data["html"]
screenshot_url = data["screenshot"]   # Note: HTTPS URL to a hosted PNG, NOT base64.
# To get bytes: png_bytes = requests.get(screenshot_url).content

Scrape.do — screenshot + markdown (two requests):

import requests

params_base = {"token": "SDO_TOKEN", "url": "https://example.com", "render": "true"}

# Get markdown
markdown = requests.get("https://api.scrape.do", params={**params_base, "output": "markdown"}).text

# Get screenshot
screenshot_resp = requests.get("https://api.scrape.do", params={**params_base, "screenShot": "true", "returnJSON": "true"})
screenshot_base64 = screenshot_resp.json()["screenShots"][0]["image"]

Async API

Firecrawl's /v2/batch/scrape runs multiple URLs asynchronously. Scrape.do has a dedicated Async API at https://q.scrape.do with its own concurrency pool (30% of your plan limit, separate from the main API pool — it does not reduce your main concurrency).

SDO Async API now supports a Plugin mode that batches up to 1000 structured-data params per job (Amazon, Google search/maps/shopping/flights/hotels/news/trends, plus walmart/store and lowes/store). See async-api/plugins.

Firecrawl batch scrape:

import requests, time

# Submit batch
resp = requests.post(
    "https://api.firecrawl.dev/v2/batch/scrape",
    headers={"Authorization": "Bearer FC_API_KEY"},
    json={"urls": ["https://example.com/page1", "https://example.com/page2"], "formats": ["markdown"]}
)
batch_id = resp.json()["id"]

# Poll until done
while True:
    status = requests.get(
        f"https://api.firecrawl.dev/v2/batch/scrape/{batch_id}",
        headers={"Authorization": "Bearer FC_API_KEY"}
    ).json()
    if status["status"] == "completed":
        break
    time.sleep(3)

pages = status["data"]

Scrape.do Async API equivalent:

import requests, time

ASYNC_BASE = "https://q.scrape.do/api/v1"
HEADERS = {"X-Token": "SDO_TOKEN", "Content-Type": "application/json"}

# Submit job (raw HTML by default; per-target options like markdown are not exposed in the async body)
resp = requests.post(f"{ASYNC_BASE}/jobs", headers=HEADERS, json={
    "Targets": ["https://example.com/page1", "https://example.com/page2"],
    "Super": False,
    "GeoCode": "us"
})
job = resp.json()
job_id = job["JobID"]
task_ids = job["TaskIDs"]

# Poll for completion
while True:
    status = requests.get(f"{ASYNC_BASE}/jobs/{job_id}", headers=HEADERS).json()
    if status["Status"] in ("success", "error", "canceled"):
        break
    time.sleep(2)

# Retrieve results per task
for task_id in task_ids:
    result = requests.get(f"{ASYNC_BASE}/jobs/{job_id}/{task_id}", headers=HEADERS).json()
    print(result["Content"])

Async API with webhook (production pattern):

# Submit once — results delivered to your server when ready
resp = requests.post(f"{ASYNC_BASE}/jobs", headers=HEADERS, json={
    "Targets": ["https://example.com/page1", "https://example.com/page2"],
    "WebhookURL": "https://your-server.com/webhook/sdo",
    "WebhookHeaders": {"Authorization": "Bearer your-webhook-secret"}
})
print("Job ID:", resp.json()["JobID"])
# No polling needed — your webhook endpoint receives results automatically

Features You Gain in Scrape.do

These capabilities are unavailable in Firecrawl or significantly superior in Scrape.do.

Feature	Parameter	What you get	Docs
Residential/mobile proxies (95M+ IPs)	`super=true`	Access to residential and mobile IP pool across 150+ countries. Firecrawl's enhanced proxy is US/DK only.	docs
Geo-targeting: 150+ countries	`geoCode=us`	Country-level proxy routing. Firecrawl basic supports 28 countries; enhanced supports US and DK only.	docs
Geo-targeting: continent-level	`regionalGeoCode=europe`	Route via entire continent (super proxy only). Values: `europe`, `asia`, `africa`, `oceania`, `northamerica`, `southamerica`. No Firecrawl equivalent.	docs
Sticky sessions (same IP)	`sessionId=12345`	Maintain same IP for multi-step flows. Same integer = same IP for up to 5 min inactivity. Firecrawl has no sticky IP.	docs
Postal/ZIP-code targeting	`postalcode=10001` (or `zipcode=`)	Target a specific postal/ZIP code within a country. Requires `super=true` AND `geoCode`. Supported in 12 countries: us, gb, de, fr, ca, au, in, nl, it, es, br, jp. Send codes without spaces (e.g. `SW1A1AA` not `SW1A 1AA`).	docs
Google Scraper API	`GET /plugin/google/search?q=...`	Pre-parsed SERP JSON — organic results, ads, knowledge graph, local pack, AI Overview, and 10+ more types. 84 Google domains, 150+ languages, 240+ country codes. Now part of the broader Google Scraper API which also covers Maps, Shopping, Flights, Hotels, News, Trends (10cr each).	docs
Google AI Mode	`GET /plugin/google/search/ai-mode?q=...`	Google's full conversational AI response with references and shopping results as structured JSON. 10 credits.	docs
Google Maps	`GET /plugin/google/maps/search?q=...`	Structured Maps places list with location pinning via `ll=@lat,lng,zoom`. Also `/place` and `/reviews` for place details and review pagination — capability Firecrawl has no equivalent for. 10 credits.	docs
Amazon Scraper API	`GET /plugin/amazon/pdp?asin=...`	Structured product data (ASIN, title, price, ratings, images, specs) from 21 Amazon marketplaces with ZIP-code geo-targeting. 1 credit per request; 1 concurrent request per token.	docs
Amazon offer listing	`GET /plugin/amazon/offer-listing?asin=...`	All seller offers with prices and shipping info for a given ASIN.	docs
Async API with separate concurrency pool	`https://q.scrape.do`	Job-based API with its own 30% concurrency pool — runs independently from your main API, doesn't cut into your main quota.	docs
Device emulation	`device=mobile\|tablet\|desktop`	Render as a specific device type. Controls both User-Agent and viewport.	docs
Viewport control	`width=390&height=844`	Set exact browser viewport dimensions.	docs
Full-page screenshots	`fullScreenShot=true`	Capture the entire page (not just the viewport).	docs
Partial screenshots	`particularScreenShot=.selector`	Screenshot of a specific CSS selector element.	docs
Network idle waiting	`waitUntil=networkidle0`	Wait until all network requests are finished before returning. Options: `domcontentloaded`, `networkidle0`, `networkidle2`, `load`.	docs
WebSocket capture	`showWebsocketRequests=true`	Capture WebSocket frames alongside XHR/Fetch in the JSON response. Requires `render=true` + `returnJSON=true`.	docs
Pure cookies	`pureCookies=true`	Returns raw `Set-Cookie` headers from target unmodified.	docs
Retry control	`retryTimeout=15000`, `disableRetry=true`	Configure or disable built-in retry mechanism.	docs
Disable redirect following	`disableRedirection=true`	Returns the raw 3xx response without following.	docs
WaitForRequestCompletion	`playWithBrowser` action	Wait until a specific network URL pattern completes — ideal for dynamically loaded data.	docs

Features You Leave Behind (and How to Compensate)

Firecrawl has several AI/pipeline features with no direct Scrape.do counterpart. Each section below describes the gap and provides a working compensation strategy.

1. Crawl Endpoint — Full-Site Spider (`POST /v2/crawl`)

What it does in Firecrawl: Submits a domain, Firecrawl recursively follows all internal links, renders each page, and returns the full site as an array of markdown documents. Pagination and link discovery are handled automatically. 1 credit per page.

SDO equivalent: None — Scrape.do is a per-URL API. It fetches one URL at a time.

How to compensate: Build a crawl loop yourself. The pattern: seed URL -> extract links -> deduplicate -> queue -> fetch each URL via SDO. The example below is production-ready and uses asyncio + SDO for concurrency.

"""
sdo_crawler.py — Async site crawler using Scrape.do

Usage:
    python sdo_crawler.py --url https://example.com --max-pages 100 --concurrency 5
"""

import asyncio
import aiohttp
import urllib.parse
import re
import json
import argparse
from collections import deque
from urllib.parse import urljoin, urlparse

SDO_TOKEN = "SDO_TOKEN"
SDO_BASE = "https://api.scrape.do"


def normalize_url(url: str) -> str:
    """Remove fragment and trailing slash inconsistencies."""
    parsed = urlparse(url)
    normalized = parsed._replace(fragment="").geturl()
    return normalized.rstrip("/")


def extract_links(base_url: str, html_or_markdown: str) -> list[str]:
    """Extract all internal links from HTML or markdown content."""
    base = urlparse(base_url)
    links = set()

    # Match href attributes in HTML
    href_pattern = re.compile(r'href=["\']([^"\'#][^"\']*)["\']', re.IGNORECASE)
    # Match markdown links [text](url)
    md_pattern = re.compile(r'\[.*?\]\((https?://[^)]+)\)')

    for match in href_pattern.finditer(html_or_markdown):
        url = urljoin(base_url, match.group(1))
        parsed = urlparse(url)
        if parsed.netloc == base.netloc and parsed.scheme in ("http", "https"):
            links.add(normalize_url(url))

    for match in md_pattern.finditer(html_or_markdown):
        url = match.group(1)
        parsed = urlparse(url)
        if parsed.netloc == base.netloc:
            links.add(normalize_url(url))

    return list(links)


async def fetch_page(
    session: aiohttp.ClientSession,
    url: str,
    render: bool = True,
    super_proxy: bool = False,
    geo_code: str = None,
) -> dict:
    """Fetch a single URL via Scrape.do and return status + content."""
    params = {
        "token": SDO_TOKEN,
        "url": url,
        "output": "markdown",
    }
    if render:
        params["render"] = "true"
        params["waitUntil"] = "networkidle2"
    if super_proxy:
        params["super"] = "true"
    if geo_code:
        params["geoCode"] = geo_code

    try:
        async with session.get(SDO_BASE, params=params, timeout=aiohttp.ClientTimeout(total=90)) as resp:
            content = await resp.text()
            return {
                "url": url,
                "status": resp.status,
                "content": content if resp.status == 200 else "",
                "error": None if resp.status == 200 else f"HTTP {resp.status}",
            }
    except asyncio.TimeoutError:
        return {"url": url, "status": None, "content": "", "error": "timeout"}
    except Exception as e:
        return {"url": url, "status": None, "content": "", "error": str(e)}


async def crawl(
    seed_url: str,
    max_pages: int = 100,
    concurrency: int = 5,
    render: bool = True,
    super_proxy: bool = False,
    geo_code: str = None,
) -> list[dict]:
    """
    Crawl a website starting from seed_url.
    Returns a list of dicts: {url, status, content, error}
    """
    seed_url = normalize_url(seed_url)
    base_domain = urlparse(seed_url).netloc

    visited = set()
    queue = deque([seed_url])
    results = []
    semaphore = asyncio.Semaphore(concurrency)

    async def bounded_fetch(session, url):
        async with semaphore:
            return await fetch_page(session, url, render, super_proxy, geo_code)

    connector = aiohttp.TCPConnector(limit=concurrency * 2)
    async with aiohttp.ClientSession(connector=connector) as session:
        while queue and len(visited) < max_pages:
            # Drain current batch
            batch = []
            while queue and len(batch) < concurrency and len(visited) + len(batch) < max_pages:
                url = queue.popleft()
                if url not in visited:
                    visited.add(url)
                    batch.append(url)

            if not batch:
                break

            print(f"Fetching batch of {len(batch)} URLs | Total visited: {len(visited)}/{max_pages}")

            tasks = [bounded_fetch(session, url) for url in batch]
            batch_results = await asyncio.gather(*tasks)

            for result in batch_results:
                results.append(result)
                if result["content"]:
                    # Discover new links and add to queue
                    new_links = extract_links(result["url"], result["content"])
                    for link in new_links:
                        if link not in visited and urlparse(link).netloc == base_domain:
                            queue.append(link)

    print(f"Crawl complete. {len(results)} pages fetched.")
    return results


# --- Main ---

async def main():
    parser = argparse.ArgumentParser(description="Crawl a site using Scrape.do")
    parser.add_argument("--url", required=True, help="Seed URL to start crawling from")
    parser.add_argument("--max-pages", type=int, default=100, help="Max pages to crawl")
    parser.add_argument("--concurrency", type=int, default=5, help="Concurrent requests")
    parser.add_argument("--no-render", action="store_true", help="Disable JS rendering (faster, cheaper)")
    parser.add_argument("--super", action="store_true", help="Use residential proxies")
    parser.add_argument("--geo", default=None, help="Geo code, e.g. us, de, gb")
    parser.add_argument("--output", default="crawl_results.json", help="Output JSON file")
    args = parser.parse_args()

    pages = await crawl(
        seed_url=args.url,
        max_pages=args.max_pages,
        concurrency=args.concurrency,
        render=not args.no_render,
        super_proxy=args.super,
        geo_code=args.geo,
    )

    with open(args.output, "w", encoding="utf-8") as f:
        json.dump(pages, f, ensure_ascii=False, indent=2)

    print(f"Results saved to {args.output}")
    success = sum(1 for p in pages if p["status"] == 200)
    print(f"Success: {success}/{len(pages)} pages")


if __name__ == "__main__":
    asyncio.run(main())

Usage:

# Basic crawl (JS rendering on, datacenter proxies)
python sdo_crawler.py --url https://example.com --max-pages 50

# Fast static site (no rendering, cheaper)
python sdo_crawler.py --url https://docs.example.com --max-pages 200 --no-render

# E-commerce with residential proxy from US
python sdo_crawler.py --url https://shop.example.com --max-pages 100 --super --geo us

Credit cost comparison:

Firecrawl crawl: 1 credit/page
SDO datacenter (no render): 1 credit/page — same cost
SDO with render=true: 5 credits/page — more expensive, but you control when rendering is needed

2. Map Endpoint — URL Discovery (`POST /v2/map`)

What it does in Firecrawl: Queries a domain and returns a comprehensive URL list using sitemap + SERP + previous crawl data. 1 credit per call regardless of how many URLs are returned — extraordinarily cheap.

SDO equivalent: None.

How to compensate: Fetch sitemap.xml and parse it. Most sites follow the standard; many have a robots.txt that points to multiple sitemap files.

import requests
from xml.etree import ElementTree as ET
from urllib.parse import urlparse

def discover_urls(domain: str, sdo_token: str, max_urls: int = 5000) -> list[str]:
    """
    Discover all URLs on a domain by fetching and parsing its sitemap(s).
    Falls back to robots.txt sitemap directives if /sitemap.xml returns 404.
    """
    base = f"https://{domain}" if not domain.startswith("http") else domain
    parsed = urlparse(base)
    root = f"{parsed.scheme}://{parsed.netloc}"

    def fetch_via_sdo(url: str) -> requests.Response:
        return requests.get("https://api.scrape.do", params={
            "token": sdo_token,
            "url": url
        }, timeout=30)

    def parse_sitemap(xml_text: str) -> tuple[list[str], list[str]]:
        """Returns (page_urls, nested_sitemap_urls)."""
        try:
            root_el = ET.fromstring(xml_text)
        except ET.ParseError:
            return [], []

        ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
        pages, sitemaps = [], []

        # sitemap index
        for loc in root_el.findall(".//sm:sitemap/sm:loc", ns):
            sitemaps.append(loc.text.strip())

        # regular sitemap
        for loc in root_el.findall(".//sm:url/sm:loc", ns):
            pages.append(loc.text.strip())

        return pages, sitemaps

    # Step 1: Check robots.txt for Sitemap directives
    sitemap_urls = []
    try:
        robots = fetch_via_sdo(f"{root}/robots.txt")
        if robots.status_code == 200:
            for line in robots.text.splitlines():
                if line.lower().startswith("sitemap:"):
                    sitemap_urls.append(line.split(":", 1)[1].strip())
    except Exception:
        pass

    if not sitemap_urls:
        sitemap_urls = [f"{root}/sitemap.xml"]

    # Step 2: Fetch and parse sitemaps (handle sitemap indexes)
    all_urls = []
    visited_sitemaps = set()
    queue = sitemap_urls

    while queue and len(all_urls) < max_urls:
        sitemap_url = queue.pop(0)
        if sitemap_url in visited_sitemaps:
            continue
        visited_sitemaps.add(sitemap_url)

        try:
            resp = fetch_via_sdo(sitemap_url)
            if resp.status_code != 200:
                continue
            pages, nested = parse_sitemap(resp.text)
            all_urls.extend(pages)
            queue.extend(nested)
        except Exception as e:
            print(f"Failed to parse {sitemap_url}: {e}")

    return list(set(all_urls))[:max_urls]


# Usage
urls = discover_urls("example.com", sdo_token="SDO_TOKEN")
print(f"Discovered {len(urls)} URLs")
for url in urls[:10]:
    print(url)

3. Extract Endpoint — LLM-Structured JSON (`POST /v2/extract`, legacy `/v1/extract`)

What it does in Firecrawl: Takes a URL and a JSON schema (Pydantic/Zod compatible), scrapes the page, passes it through an LLM, and returns structured data matching the schema. Cost: 1 base credit + 4 LLM credits per page.

SDO equivalent: None — Scrape.do has no built-in LLM layer.

How to compensate: Fetch markdown from SDO, then pass it to your own LLM. You choose the model, schema enforcement library, and cost structure. This two-step approach is actually more flexible: you can use Claude, GPT-4o-mini, Gemini, or a local model.

import requests
import json
from openai import OpenAI
from pydantic import BaseModel
from typing import Optional

# Define your schema with Pydantic (same as Firecrawl's schema param)
class ProductSchema(BaseModel):
    name: str
    price: Optional[float]
    currency: Optional[str]
    rating: Optional[float]
    review_count: Optional[int]
    in_stock: Optional[bool]
    description: Optional[str]

def extract_structured(
    url: str,
    schema: type[BaseModel],
    sdo_token: str,
    openai_api_key: str,
    render: bool = True,
    model: str = "gpt-4o-mini"
) -> dict:
    """
    Firecrawl /extract equivalent using SDO + OpenAI.
    Returns parsed dict matching the schema.
    """
    # Step 1: Fetch page as clean markdown via SDO
    params = {
        "token": sdo_token,
        "url": url,
        "output": "markdown",
    }
    if render:
        params["render"] = "true"

    resp = requests.get("https://api.scrape.do", params=params, timeout=60)
    resp.raise_for_status()
    markdown = resp.text

    if not markdown.strip():
        raise ValueError(f"Empty response from {url}")

    # Step 2: Build schema description for the prompt
    schema_json = json.dumps(schema.model_json_schema(), indent=2)

    # Step 3: Pass to LLM for extraction
    client = OpenAI(api_key=openai_api_key)
    completion = client.chat.completions.create(
        model=model,
        response_format={"type": "json_object"},
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a data extraction assistant. "
                    "Extract structured data from the provided web page content. "
                    "Return only valid JSON matching the requested schema. "
                    "Use null for fields you cannot find."
                )
            },
            {
                "role": "user",
                "content": (
                    f"Extract data matching this JSON schema:\n{schema_json}\n\n"
                    f"Page content:\n{markdown[:8000]}"  # truncate to fit context window
                )
            }
        ]
    )

    raw_json = completion.choices[0].message.content
    parsed = json.loads(raw_json)

    # Validate with Pydantic
    validated = schema(**parsed)
    return validated.model_dump()


# Usage — mirrors Firecrawl /extract usage
result = extract_structured(
    url="https://www.amazon.com/dp/B0ABCDEF",
    schema=ProductSchema,
    sdo_token="SDO_TOKEN",
    openai_api_key="sk-...",
    model="gpt-4o-mini"
)
print(json.dumps(result, indent=2))

Cost comparison (per page):

Firecrawl /extract: 1 + 4 = 5 credits
SDO + GPT-4o-mini: 1 SDO credit (~$0.001) + ~$0.0005 GPT tokens = ~$0.0015 total
SDO + Claude 3.5 Haiku: 1 SDO credit + ~$0.0008 = ~$0.0018 total

You end up paying roughly the same but choosing your own model and retaining full schema flexibility.

4. FIRE-1 AI Agent / `/v2/agent` — Autonomous Navigation

What it does in Firecrawl: An LLM-powered agent that receives a prompt, autonomously navigates websites (no seed URL required), decides what pages to visit, and returns extracted data. The /agent endpoint (spark models) and legacy FIRE-1 both fit this pattern.

SDO equivalent: None. SDO is a tool, not an agent.

How to compensate: Build an agent loop using your preferred LLM + SDO as the "fetch" tool. The pattern: LLM decides what URL to fetch -> SDO fetches it -> markdown fed back to LLM -> LLM decides next step.

"""
Minimal SDO-powered web agent using Claude as the reasoning engine.
Replaces Firecrawl /agent for structured research tasks.
"""

import requests
import anthropic
import json

SDO_TOKEN = "SDO_TOKEN"
ANTHROPIC_KEY = "sk-ant-..."


def sdo_fetch(url: str, render: bool = True) -> str:
    """Fetch a URL as markdown via Scrape.do."""
    resp = requests.get("https://api.scrape.do", params={
        "token": SDO_TOKEN,
        "url": url,
        "output": "markdown",
        "render": "true" if render else "false",
        "waitUntil": "networkidle2" if render else None,
    }, timeout=60)
    return resp.text[:6000] if resp.status_code == 200 else f"[Error {resp.status_code}]"


def run_agent(task: str, max_steps: int = 8) -> str:
    """
    Run a web research agent. 
    The LLM decides which URLs to fetch; SDO fetches them.
    Returns the final extracted answer.
    """
    client = anthropic.Anthropic(api_key=ANTHROPIC_KEY)
    tools = [
        {
            "name": "fetch_webpage",
            "description": "Fetch any public webpage and return its content as markdown. Use this to read web pages, product pages, search results, etc.",
            "input_schema": {
                "type": "object",
                "properties": {
                    "url": {"type": "string", "description": "The URL to fetch"},
                    "render": {"type": "boolean", "description": "Whether to execute JavaScript (needed for SPAs)", "default": True}
                },
                "required": ["url"]
            }
        }
    ]

    messages = [{"role": "user", "content": task}]
    steps = 0

    while steps < max_steps:
        response = client.messages.create(
            model="claude-3-5-haiku-20241022",
            max_tokens=2048,
            tools=tools,
            messages=messages
        )

        # Check if agent wants to use the fetch tool
        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use" and block.name == "fetch_webpage":
                    url = block.input["url"]
                    render = block.input.get("render", True)
                    print(f"  [Agent fetching via SDO] {url}")
                    content = sdo_fetch(url, render)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": content
                    })

            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})
            steps += 1

        elif response.stop_reason == "end_turn":
            # Extract final text answer
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text
            break

    return "Agent did not produce a final answer within step limit."


# Usage — same style as Firecrawl /agent prompt usage
result = run_agent(
    "Find the current price and availability of the iPhone 16 Pro 256GB "
    "on apple.com and return as JSON with fields: price, currency, available."
)
print(result)

5. Open-Source / Self-Hosting

What it does in Firecrawl: Firecrawl is AGPL-3.0 open source at github.com/mendableai/firecrawl. You can self-host the entire stack (scraping workers, Redis, Postgres, Playwright browsers) on your own infrastructure with no per-request costs.

SDO equivalent: Scrape.do is a cloud-only SaaS. There is no self-hosted version.

If self-hosting is a requirement: You have two paths:

Keep using Firecrawl OSS for self-hosted workloads, and route cloud-burst traffic through SDO for overflow or jurisdictions where you need different IPs.
Evaluate alternatives like Crawlee (open-source JS crawling framework) paired with your own proxy pool.

6. PDF Parsing

What it does in Firecrawl: Parses PDFs (including scanned/OCR) and returns structured text. Three modes: auto, fast, and OCR. Priced at 1 credit per PDF page. The underlying engine was rewritten in Rust in Feb 2026 for 3x speed.

SDO equivalent: None — Scrape.do does not parse PDF content. It can fetch the raw PDF bytes.

How to compensate: Download the PDF bytes via SDO (no render needed), then parse client-side.

import requests
import io

# Option A: pdfplumber (good for text-heavy PDFs)
try:
    import pdfplumber

    def extract_pdf_text_plumber(pdf_url: str, sdo_token: str) -> str:
        """Download and extract text from a PDF using pdfplumber."""
        resp = requests.get("https://api.scrape.do", params={
            "token": sdo_token,
            "url": pdf_url,
            # No render=true — PDFs are binary, no JS needed
        }, timeout=60)
        resp.raise_for_status()

        pages = []
        with pdfplumber.open(io.BytesIO(resp.content)) as pdf:
            for page in pdf.pages:
                text = page.extract_text()
                if text:
                    pages.append(text)
        return "\n\n".join(pages)

    text = extract_pdf_text_plumber(
        "https://example.com/report.pdf",
        sdo_token="SDO_TOKEN"
    )
    print(text[:500])

except ImportError:
    print("Install pdfplumber: pip install pdfplumber")


# Option B: PyMuPDF / fitz (faster, also handles scanned PDFs with OCR)
try:
    import fitz  # PyMuPDF

    def extract_pdf_text_pymupdf(pdf_url: str, sdo_token: str) -> str:
        """Download and extract text from a PDF using PyMuPDF."""
        resp = requests.get("https://api.scrape.do", params={
            "token": sdo_token,
            "url": pdf_url,
        }, timeout=60)
        resp.raise_for_status()

        doc = fitz.open(stream=resp.content, filetype="pdf")
        return "\n\n".join(page.get_text() for page in doc)

    # For scanned PDFs requiring OCR, use fitz with OCR plugin:
    # pip install pymupdf[ocr] pillow
    # Then: page.get_text(flags=fitz.TEXT_PRESERVE_LIGATURES | fitz.TEXT_MEDIABOX_CLIP)
    # Or use pytesseract on page.get_pixmap().tobytes()

except ImportError:
    print("Install PyMuPDF: pip install pymupdf")

Packages:

pip install pdfplumber     # text-focused, good table extraction
pip install pymupdf        # faster, more complete, supports OCR via plugin

Credit and Cost Comparison

Per-Request Credit Costs

Request Type	Firecrawl	Scrape.do
Basic fetch (datacenter, no browser)	1 credit	1 credit
Browser rendering + datacenter	1 credit (always rendered)	5 credits (`render=true`)
Residential proxy (no browser)	5 credits (`proxy: enhanced`)	10 credits (`super=true`)
Browser + residential proxy	5 credits	25 credits (`render=true` + `super=true`)
Markdown output	1 credit	1 credit (`output=markdown`)
LLM structured extraction	1 + 4 = 5 credits	No built-in — SDO 1cr + LLM API cost
Full-site crawl (per page)	1 credit/page	1-5 credits/page (depends on render)
URL map/discovery	1 credit/call	0 credits (parse sitemap.xml yourself)
Batch/async scrape	Same as scrape	Same as scrape (async pool separate)

Bottom line: For pure HTML fetching without rendering, costs are equal. If Firecrawl was charging you 1 credit per page and you relied on rendering being always-on, your SDO costs will be higher per-request — but plan pricing is substantially lower, which can offset this for moderate volumes.

Plan-Level Comparison

	Firecrawl	Scrape.do
Free	500 credits (one-time)	Free-forever plan (check dashboard)
Entry paid	$16/mo for 3,000 cr	Pay-as-you-go or lower-tier plan
Mid tier	$83/mo for 100,000 cr	Comparable plan at lower cost
Credit rollover	Plan credits do not roll over	Check SDO plan terms
Extra credits	Auto-recharge packs (4/month max)	Pay-as-you-go top-up
Subscription required	Yes — no pure PAYG	Pay-as-you-go available

Firecrawl's most significant pricing advantage is the /map endpoint at 1 credit for thousands of URLs, and the /crawl endpoint at 1 credit/page with no rendering surcharge (Firecrawl always renders but charges 1 credit). If your workflow is heavily crawl-based, factor this into your cost model.

Migration Checklist

Work through this list when migrating. Tick each item before deploying to production.

Core Migration

Endpoint Replacement

/v2/batch/scrape -> SDO Async API (https://q.scrape.do/api/v1/jobs)
- Change auth from Authorization header to X-Token header
- Change body format: urls array -> Targets array
- Poll job at GET /api/v1/jobs/{jobID}, retrieve content at GET /api/v1/jobs/{jobID}/{taskID}
/v2/crawl -> Build async crawler loop (use sdo_crawler.py from this guide)
- Decide: do you need render=true on all pages? Turn it off for static content to cut costs to 1cr/page.
- Decide: maximum page depth, excluded path patterns, concurrency
/v2/map -> Fetch sitemap.xml + parse (use discover_urls() from this guide)
/v2/extract (or legacy /v1/extract) -> output=markdown + LLM call (use extract_structured() from this guide)
- Choose your LLM: GPT-4o-mini (cheapest), Claude 3.5 Haiku, Gemini Flash
- Validate with Pydantic or Zod to maintain type safety
/v2/search -> SDO Google Scraper API: GET https://api.scrape.do/plugin/google/search?token=T&q=query
- Returns structured JSON (no HTML parsing needed)
- Note: 84 Google domains supported
- Now part of the broader Google Scraper API which also covers Maps, Shopping, Flights, Hotels, News, Trends (10cr each)
/v2/agent or FIRE-1 -> Build agent loop (use run_agent() from this guide or similar)

Features Without Migration Path

PDF parsing — Download PDF bytes via SDO, parse with pdfplumber or pymupdf
Word/Excel document parsing — Download and parse client-side with python-docx, openpyxl
Self-hosting — SDO is cloud-only; keep Firecrawl OSS for self-hosted requirements
MCP server — SDO has no MCP endpoint; use SDO's n8n/Zapier integrations for AI workflow automation

Testing

Test authentication — verify SDO token works and check Scrape.do-Remaining-Credits response header
Compare output — scrape the same 5-10 URLs with both APIs and diff the markdown
Verify geo-targeting — confirm geoCode returns content from the correct region
Test rendering — confirm JS-heavy pages return dynamic content with render=true
Test async API — create a multi-URL job, poll to completion, verify all task results
Run cost estimate — count monthly requests per type (datacenter/render/super) and compare to Firecrawl credits used

Additional Resources

SDO Documentation: scrape.do/documentation
SDO Async API: scrape.do/documentation/async-api
SDO Google Scraper API: scrape.do/documentation/google-scraper-api/search (Maps, Shopping, Flights, Hotels, News, Trends also available — 10cr each)
SDO Amazon Scraper API: scrape.do/documentation/amazon-scraper-api
SDO Dashboard + Token: dashboard.scrape.do
Firecrawl OSS (self-hosted): github.com/mendableai/firecrawl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Migrate from Firecrawl to Scrape.do

What You Are Migrating Between

Quick Start: Minimal Changes

Authentication

Endpoint Mapping

Complete Parameter Mapping

Core Parameters

Proxy and Geo-Targeting

Browser Rendering

Browser Actions (`actions` -> `playWithBrowser`)

Action Reference

Headers and Cookies

Headers

Cookies

Output Formats

Async API

Features You Gain in Scrape.do

Features You Leave Behind (and How to Compensate)

1. Crawl Endpoint — Full-Site Spider (`POST /v2/crawl`)

2. Map Endpoint — URL Discovery (`POST /v2/map`)

3. Extract Endpoint — LLM-Structured JSON (`POST /v2/extract`, legacy `/v1/extract`)

4. FIRE-1 AI Agent / `/v2/agent` — Autonomous Navigation

5. Open-Source / Self-Hosting

6. PDF Parsing

Credit and Cost Comparison

Per-Request Credit Costs

Plan-Level Comparison

Migration Checklist

Core Migration

Endpoint Replacement

Features Without Migration Path

Testing

Additional Resources

Uh oh!

FilesExpand file tree

migrate-from-firecrawl.md

Latest commit

History

migrate-from-firecrawl.md

File metadata and controls

Migrate from Firecrawl to Scrape.do

What You Are Migrating Between

Quick Start: Minimal Changes

Authentication

Endpoint Mapping

Complete Parameter Mapping

Core Parameters

Proxy and Geo-Targeting

Browser Rendering

Browser Actions (actions -> playWithBrowser)

Action Reference

Headers and Cookies

Headers

Cookies

Output Formats

Async API

Features You Gain in Scrape.do

Features You Leave Behind (and How to Compensate)

1. Crawl Endpoint — Full-Site Spider (POST /v2/crawl)

2. Map Endpoint — URL Discovery (POST /v2/map)

3. Extract Endpoint — LLM-Structured JSON (POST /v2/extract, legacy /v1/extract)

4. FIRE-1 AI Agent / /v2/agent — Autonomous Navigation

5. Open-Source / Self-Hosting

6. PDF Parsing

Credit and Cost Comparison

Per-Request Credit Costs

Plan-Level Comparison

Migration Checklist

Core Migration

Endpoint Replacement

Features Without Migration Path

Testing

Additional Resources

Browser Actions (`actions` -> `playWithBrowser`)

1. Crawl Endpoint — Full-Site Spider (`POST /v2/crawl`)

2. Map Endpoint — URL Discovery (`POST /v2/map`)

3. Extract Endpoint — LLM-Structured JSON (`POST /v2/extract`, legacy `/v1/extract`)

4. FIRE-1 AI Agent / `/v2/agent` — Autonomous Navigation