Skip to content

Latest commit

 

History

History
1181 lines (940 loc) · 50.8 KB

File metadata and controls

1181 lines (940 loc) · 50.8 KB

Migrate from Firecrawl to Scrape.do

Complete migration guide for switching from Firecrawl to Scrape.do. Covers endpoint mapping, parameter translation, feature gaps, and compensation strategies with working code examples.


What You Are Migrating Between

Before diving into code, understand the fundamental difference: Firecrawl is an AI/LLM-focused data pipeline. It abstracts away the web entirely and returns clean, structured content. Its /crawl, /map, /extract, and /agent endpoints assume you want processed, ready-to-use data — often to feed directly into a language model.

Scrape.do is a traditional scraping API. It gives you raw or markdown HTML from any URL, handles anti-bot systems, rotates proxies, and renders JavaScript — but stops there. Data structuring is your responsibility.

If your Firecrawl usage centered on /scrape with formats: ["markdown"] or formats: ["html"], migration is nearly one-to-one. If you relied heavily on /crawl (full-site spidering), /extract (AI-structured JSON), or the /agent endpoint (autonomous data gathering), you will need to build compensation layers — all of which are covered in this guide.

The tradeoff: you lose Firecrawl's higher-level AI abstractions, and you gain significantly lower per-request costs, 95M+ residential/mobile IPs, precise geo-targeting, structured Amazon and Google SERP APIs, and an async batch system with its own concurrency pool.


Quick Start: Minimal Changes

Firecrawl (before):

import requests

response = requests.post(
    "https://api.firecrawl.dev/v2/scrape",
    headers={"Authorization": "Bearer FC_API_KEY"},
    json={"url": "https://example.com", "formats": ["markdown"]}
)
markdown = response.json()["data"]["markdown"]

Scrape.do (after):

import requests

response = requests.get(
    "https://api.scrape.do",
    params={
        "token": "SDO_TOKEN",
        "url": "https://example.com",
        "output": "markdown"
    }
)
markdown = response.text

Key differences at a glance:

Firecrawl Scrape.do
Base URL https://api.firecrawl.dev/v2/ https://api.scrape.do
Auth Authorization: Bearer KEY header token query parameter
Method POST with JSON body GET with query parameters
Default output Markdown (LLM-ready) Raw HTML
Markdown output formats: ["markdown"] output=markdown
JS rendering Always on (proxy auto-routes) render=true (off by default)
Residential proxy Built-in (proxy: enhanced) super=true
Geo-targeting location.country in JSON body geoCode=us query param

Authentication

Firecrawl Scrape.do
Authorization: Bearer FC_API_KEY HTTP header token=SDO_TOKEN query parameter
API key from firecrawl.dev/app dashboard Token from dashboard.scrape.do
Required on all requests Required on all requests

Firecrawl:

curl -X POST "https://api.firecrawl.dev/v2/scrape" \
  -H "Authorization: Bearer FC_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Scrape.do:

curl "https://api.scrape.do/?token=SDO_TOKEN&url=https%3A%2F%2Fexample.com"

Endpoint Mapping

Firecrawl exposes multiple specialized endpoints. Scrape.do uses a single endpoint for page fetching, with separate plugin URLs for structured data.

Firecrawl Endpoint Purpose Scrape.do Equivalent
POST /v2/scrape Scrape a single URL GET api.scrape.do/?token=T&url=U
POST /v2/crawl Spider entire site, return all pages No direct equivalent — build a crawler loop (see section below)
POST /v2/map Return all URLs on a domain No direct equivalent — fetch sitemap.xml (see section below)
POST /v2/extract (legacy /v1/extract still works) AI-structured JSON from URL(s) No direct equivalent — use output=markdown then call an LLM (see section below)
POST /v2/search Web search + scrape results Use SDO's Google Scraper API (/plugin/google/search). Now part of the broader Google Scraper API which also covers Maps, Shopping, Flights, Hotels, News, Trends (10cr each)
POST /v2/batch/scrape Scrape multiple URLs as async job SDO Async API (https://q.scrape.do/api/v1/jobs)

Complete Parameter Mapping

Core Parameters

Firecrawl (POST body) Scrape.do (query param) Notes
url url Both require URL-encoding in Scrape.do's API mode
formats: ["html"] (default behavior) Firecrawl html is processed/cleaned HTML; SDO default returns the target's raw HTML
formats: ["rawHtml"] (default behavior) SDO already returns the target's unprocessed HTML by default. (transparentResponse=true is unrelated — it only changes how status codes are reported.)
formats: ["markdown"] output=markdown Both return clean markdown text
formats: ["screenshot"] render=true + screenShot=true + returnJSON=true All three params required (verified via live API). SDO returns base64 in screenShots[0].image. Firecrawl returns a HTTPS URL to a PNG (not base64).
formats: ["links"] (parse from HTML response) No direct param; extract links from output=markdown or HTML
timeout (ms) timeout (ms) Same unit. SDO default: 60000. SDO max: 120000.
waitFor (ms) customWait (ms) Fixed delay after page load. Same concept, different name.
headers extraHeaders=true + Sd- prefix headers See Headers section
mobile device=mobile Renders as mobile browser
location.country geoCode=us ISO country code, lowercase in SDO
location.languages (no direct equivalent) Use extraHeaders=true with Sd-Accept-Language header
actions (Playwright-like) playWithBrowser (JSON action array) See Browser Actions section
proxy: "basic" (default) Datacenter proxy
proxy: "enhanced" super=true Residential/mobile proxy
onlyMainContent (no direct equivalent) Use output=markdown; SDO markdown omits boilerplate naturally
blockAds (no direct equivalent) blockResources=true blocks images/CSS/fonts

Proxy and Geo-Targeting

Firecrawl routes all requests through proxies by default — you just pick the tier. Scrape.do defaults to datacenter proxies; residential requires super=true.

Firecrawl Scrape.do Notes
Default (proxy: "auto") — tries basic, escalates if needed Default (datacenter) SDO datacenter pool is also rotating and anti-bot-capable
proxy: "basic" — fast, 28 countries (default) SDO datacenter: 150+ countries
proxy: "enhanced" — residential, US and DK only super=true SDO residential/mobile: 95M+ IPs, 150+ countries
location.country: "US" geoCode=us Lowercase ISO code in SDO. With datacenter proxy: requires Pro Plan or higher. With super=true: requires Business Plan or higher.
(no continent-level targeting) regionalGeoCode=europe SDO supports: europe, asia, africa, oceania, northamerica, southamerica. Requires super=true (Business+ plan).
(no sticky sessions) sessionId=12345 SDO maintains the same IP for up to 5 min of inactivity; range 0-1000000

Firecrawl — residential proxy from Germany:

response = requests.post(
    "https://api.firecrawl.dev/v2/scrape",
    headers={"Authorization": "Bearer FC_API_KEY"},
    json={
        "url": "https://example.de",
        "proxy": "enhanced",
        "location": {"country": "DE"}
    }
)

Scrape.do equivalent:

curl "https://api.scrape.do/?token=SDO_TOKEN&url=https%3A%2F%2Fexample.de&super=true&geoCode=de"

Browser Rendering

Firecrawl always uses a browser (all requests are rendered). In Scrape.do, rendering is opt-in via render=true. This is the most important behavioral difference for direct /scrape migrations.

Firecrawl Scrape.do Notes
Browser always active render=true Add render=true to all requests that relied on Firecrawl's default JS execution
waitFor: 2000 customWait=2000 Millisecond wait after page load
actions: [{type: "wait", selector: ".loaded"}] waitSelector=.loaded CSS selector wait
(always waits for load) waitUntil=networkidle0 SDO options: domcontentloaded, networkidle0, networkidle2, load
mobile: true device=mobile Mobile browser emulation
(no viewport control) width=390&height=844 SDO allows explicit viewport size
formats: ["screenshot"] render=true + screenShot=true + returnJSON=true All three required. SDO returns base64 in screenShots[0].image. Firecrawl returns a temporary HTTPS URL to the PNG (not base64) — fetch it separately if you need the bytes.

Firecrawl — render + wait for element:

response = requests.post(
    "https://api.firecrawl.dev/v2/scrape",
    headers={"Authorization": "Bearer FC_API_KEY"},
    json={
        "url": "https://spa-app.example.com",
        "waitFor": 3000,
        "formats": ["html"]
    }
)

Scrape.do equivalent:

curl "https://api.scrape.do/?token=SDO_TOKEN&url=https%3A%2F%2Fspa-app.example.com&render=true&customWait=3000"

Browser Actions (actions -> playWithBrowser)

Firecrawl actions use Playwright-like objects. Scrape.do playWithBrowser uses a similar JSON array of named action objects. The structure is close enough that most action sequences translate directly.

Firecrawl actions format (valid v2 action types verified via live API: click, wait, screenshot, write, press, scroll, scrape, executeJavascript, pdf):

[
  {"type": "click", "selector": "#accept-cookies"},
  {"type": "wait", "milliseconds": 1000},
  {"type": "scroll", "direction": "down", "amount": 500},
  {"type": "write", "text": "laptop"},
  {"type": "press", "key": "Enter"},
  {"type": "screenshot"}
]

Scrape.do playWithBrowser equivalent:

[
  {"Action": "Click", "Selector": "#accept-cookies"},
  {"Action": "Wait", "Timeout": 1000},
  {"Action": "ScrollY", "Value": 500},
  {"Action": "Fill", "Selector": "#search", "Value": "laptop"},
  {"Action": "Execute", "Execute": "document.querySelector('#search').dispatchEvent(new KeyboardEvent('keydown',{key:'Enter'}))"},
  {"Action": "ScreenShot"}
]

Note: Firecrawl write writes into the focused element (no selector field), and press sends a single key. SDO's Fill requires a selector and replaces the value — the equivalent of "type into focused" is to first Click the field, then Fill. There is no SDO Press action; use Execute to dispatch a KeyboardEvent for single-key presses.

Action Reference

Firecrawl action type Scrape.do Action Notes
click (requires selector) Click selector -> Selector
wait with milliseconds Wait milliseconds -> Timeout
wait with selector (CSS wait) WaitSelector {"Action":"WaitSelector","WaitSelector":"#btn","Timeout":5000}; max wait ~10000ms
scroll (direction: down, amount: N) ScrollY amount -> Value (pixels)
scroll (direction: right, amount: N) ScrollX amount -> Value
write (typing into focused field) Fill (preceded by Click) Firecrawl write requires text; SDO Fill requires both Selector and Value
press (single key, requires key) Execute (dispatch KeyboardEvent) No direct SDO equivalent
screenshot ScreenShot Requires returnJSON=true AND render=true on the request
executeJavascript (requires script) Execute script -> Execute (PascalCase field)
pdf (no equivalent) SDO does not have a "save current page as PDF" action
scrape (sub-fetch from inside actions) (no equivalent) Make a separate SDO request
(no equivalent) WaitForRequestCompletion Wait for a network request URL pattern to complete (SDO-only)

Full Python example with browser actions:

import requests, json, urllib.parse

actions = [
    {"Action": "Click", "Selector": "#cookie-accept"},
    {"Action": "Wait", "Timeout": 500},
    {"Action": "ScrollY", "Value": 2000},
    {"Action": "WaitSelector", "WaitSelector": ".product-grid", "Timeout": 5000}
]

response = requests.get(
    "https://api.scrape.do",
    params={
        "token": "SDO_TOKEN",
        "url": "https://shop.example.com/category",
        "render": "true",
        "playWithBrowser": json.dumps(actions),
        "output": "markdown"
    }
)
print(response.text)

Headers and Cookies

Headers

Firecrawl accepts a headers object in the POST body. Scrape.do uses HTTP headers with a prefix system.

Firecrawl Scrape.do Notes
headers: {"Authorization": "Bearer T"} in body extraHeaders=true + Sd-Authorization: Bearer T HTTP header Sd- prefix: add/override headers on top of SDO defaults
(full header control) customHeaders=true Replace ALL headers with your own
(no equivalent) forwardHeaders=true Forward your request headers as-is to the target

Firecrawl:

response = requests.post(
    "https://api.firecrawl.dev/v2/scrape",
    headers={"Authorization": "Bearer FC_API_KEY"},
    json={
        "url": "https://api.example.com/data",
        "headers": {
            "Authorization": "Bearer my-site-token",
            "X-Custom-Header": "value123"
        }
    }
)

Scrape.do:

response = requests.get(
    "https://api.scrape.do",
    params={
        "token": "SDO_TOKEN",
        "url": "https://api.example.com/data",
        "extraHeaders": "true"
    },
    headers={
        "Sd-Authorization": "Bearer my-site-token",
        "Sd-X-Custom-Header": "value123"
    }
)

Cookies

Firecrawl Scrape.do Notes
No built-in cookie parameter setCookies=name=value; name2=value2 URL-encode the cookie string

Scrape.do with cookies:

curl "https://api.scrape.do/?token=SDO_TOKEN&url=https%3A%2F%2Fexample.com&setCookies=session%3Dabc123%3B%20token%3Dxyz789"

Output Formats

Firecrawl's primary value is returning multiple formats in one call. Scrape.do returns one format per request.

Firecrawl formats value Scrape.do equivalent
"markdown" output=markdown (only effective when target content-type is text/html — PDFs/binary content not converted)
"html" Default response (no param needed)
"rawHtml" transparentResponse=true
"screenshot" screenShot=true + returnJSON=true
"links" Parse from the markdown or HTML response
"json" (LLM extraction) Fetch markdown, then call LLM yourself (see Extract section)

Firecrawl — multiple formats:

response = requests.post(
    "https://api.firecrawl.dev/v2/scrape",
    headers={"Authorization": "Bearer FC_API_KEY"},
    json={"url": "https://example.com", "formats": ["markdown", "html", "screenshot"]}
)
data = response.json()["data"]
markdown = data["markdown"]
html = data["html"]
screenshot_url = data["screenshot"]   # Note: HTTPS URL to a hosted PNG, NOT base64.
# To get bytes: png_bytes = requests.get(screenshot_url).content

Scrape.do — screenshot + markdown (two requests):

import requests

params_base = {"token": "SDO_TOKEN", "url": "https://example.com", "render": "true"}

# Get markdown
markdown = requests.get("https://api.scrape.do", params={**params_base, "output": "markdown"}).text

# Get screenshot
screenshot_resp = requests.get("https://api.scrape.do", params={**params_base, "screenShot": "true", "returnJSON": "true"})
screenshot_base64 = screenshot_resp.json()["screenShots"][0]["image"]

Async API

Firecrawl's /v2/batch/scrape runs multiple URLs asynchronously. Scrape.do has a dedicated Async API at https://q.scrape.do with its own concurrency pool (30% of your plan limit, separate from the main API pool — it does not reduce your main concurrency).

SDO Async API now supports a Plugin mode that batches up to 1000 structured-data params per job (Amazon, Google search/maps/shopping/flights/hotels/news/trends, plus walmart/store and lowes/store). See async-api/plugins.

Firecrawl batch scrape:

import requests, time

# Submit batch
resp = requests.post(
    "https://api.firecrawl.dev/v2/batch/scrape",
    headers={"Authorization": "Bearer FC_API_KEY"},
    json={"urls": ["https://example.com/page1", "https://example.com/page2"], "formats": ["markdown"]}
)
batch_id = resp.json()["id"]

# Poll until done
while True:
    status = requests.get(
        f"https://api.firecrawl.dev/v2/batch/scrape/{batch_id}",
        headers={"Authorization": "Bearer FC_API_KEY"}
    ).json()
    if status["status"] == "completed":
        break
    time.sleep(3)

pages = status["data"]

Scrape.do Async API equivalent:

import requests, time

ASYNC_BASE = "https://q.scrape.do/api/v1"
HEADERS = {"X-Token": "SDO_TOKEN", "Content-Type": "application/json"}

# Submit job (raw HTML by default; per-target options like markdown are not exposed in the async body)
resp = requests.post(f"{ASYNC_BASE}/jobs", headers=HEADERS, json={
    "Targets": ["https://example.com/page1", "https://example.com/page2"],
    "Super": False,
    "GeoCode": "us"
})
job = resp.json()
job_id = job["JobID"]
task_ids = job["TaskIDs"]

# Poll for completion
while True:
    status = requests.get(f"{ASYNC_BASE}/jobs/{job_id}", headers=HEADERS).json()
    if status["Status"] in ("success", "error", "canceled"):
        break
    time.sleep(2)

# Retrieve results per task
for task_id in task_ids:
    result = requests.get(f"{ASYNC_BASE}/jobs/{job_id}/{task_id}", headers=HEADERS).json()
    print(result["Content"])

Async API with webhook (production pattern):

# Submit once — results delivered to your server when ready
resp = requests.post(f"{ASYNC_BASE}/jobs", headers=HEADERS, json={
    "Targets": ["https://example.com/page1", "https://example.com/page2"],
    "WebhookURL": "https://your-server.com/webhook/sdo",
    "WebhookHeaders": {"Authorization": "Bearer your-webhook-secret"}
})
print("Job ID:", resp.json()["JobID"])
# No polling needed — your webhook endpoint receives results automatically

Features You Gain in Scrape.do

These capabilities are unavailable in Firecrawl or significantly superior in Scrape.do.

Feature Parameter What you get Docs
Residential/mobile proxies (95M+ IPs) super=true Access to residential and mobile IP pool across 150+ countries. Firecrawl's enhanced proxy is US/DK only. docs
Geo-targeting: 150+ countries geoCode=us Country-level proxy routing. Firecrawl basic supports 28 countries; enhanced supports US and DK only. docs
Geo-targeting: continent-level regionalGeoCode=europe Route via entire continent (super proxy only). Values: europe, asia, africa, oceania, northamerica, southamerica. No Firecrawl equivalent. docs
Sticky sessions (same IP) sessionId=12345 Maintain same IP for multi-step flows. Same integer = same IP for up to 5 min inactivity. Firecrawl has no sticky IP. docs
Postal/ZIP-code targeting postalcode=10001 (or zipcode=) Target a specific postal/ZIP code within a country. Requires super=true AND geoCode. Supported in 12 countries: us, gb, de, fr, ca, au, in, nl, it, es, br, jp. Send codes without spaces (e.g. SW1A1AA not SW1A 1AA). docs
Google Scraper API GET /plugin/google/search?q=... Pre-parsed SERP JSON — organic results, ads, knowledge graph, local pack, AI Overview, and 10+ more types. 84 Google domains, 150+ languages, 240+ country codes. Now part of the broader Google Scraper API which also covers Maps, Shopping, Flights, Hotels, News, Trends (10cr each). docs
Google AI Mode GET /plugin/google/search/ai-mode?q=... Google's full conversational AI response with references and shopping results as structured JSON. 10 credits. docs
Google Maps GET /plugin/google/maps/search?q=... Structured Maps places list with location pinning via ll=@lat,lng,zoom. Also /place and /reviews for place details and review pagination — capability Firecrawl has no equivalent for. 10 credits. docs
Amazon Scraper API GET /plugin/amazon/pdp?asin=... Structured product data (ASIN, title, price, ratings, images, specs) from 21 Amazon marketplaces with ZIP-code geo-targeting. 1 credit per request; 1 concurrent request per token. docs
Amazon offer listing GET /plugin/amazon/offer-listing?asin=... All seller offers with prices and shipping info for a given ASIN. docs
Async API with separate concurrency pool https://q.scrape.do Job-based API with its own 30% concurrency pool — runs independently from your main API, doesn't cut into your main quota. docs
Device emulation device=mobile|tablet|desktop Render as a specific device type. Controls both User-Agent and viewport. docs
Viewport control width=390&height=844 Set exact browser viewport dimensions. docs
Full-page screenshots fullScreenShot=true Capture the entire page (not just the viewport). docs
Partial screenshots particularScreenShot=.selector Screenshot of a specific CSS selector element. docs
Network idle waiting waitUntil=networkidle0 Wait until all network requests are finished before returning. Options: domcontentloaded, networkidle0, networkidle2, load. docs
WebSocket capture showWebsocketRequests=true Capture WebSocket frames alongside XHR/Fetch in the JSON response. Requires render=true + returnJSON=true. docs
Pure cookies pureCookies=true Returns raw Set-Cookie headers from target unmodified. docs
Retry control retryTimeout=15000, disableRetry=true Configure or disable built-in retry mechanism. docs
Disable redirect following disableRedirection=true Returns the raw 3xx response without following. docs
WaitForRequestCompletion playWithBrowser action Wait until a specific network URL pattern completes — ideal for dynamically loaded data. docs

Features You Leave Behind (and How to Compensate)

Firecrawl has several AI/pipeline features with no direct Scrape.do counterpart. Each section below describes the gap and provides a working compensation strategy.


1. Crawl Endpoint — Full-Site Spider (POST /v2/crawl)

What it does in Firecrawl: Submits a domain, Firecrawl recursively follows all internal links, renders each page, and returns the full site as an array of markdown documents. Pagination and link discovery are handled automatically. 1 credit per page.

SDO equivalent: None — Scrape.do is a per-URL API. It fetches one URL at a time.

How to compensate: Build a crawl loop yourself. The pattern: seed URL -> extract links -> deduplicate -> queue -> fetch each URL via SDO. The example below is production-ready and uses asyncio + SDO for concurrency.

"""
sdo_crawler.py — Async site crawler using Scrape.do

Usage:
    python sdo_crawler.py --url https://example.com --max-pages 100 --concurrency 5
"""

import asyncio
import aiohttp
import urllib.parse
import re
import json
import argparse
from collections import deque
from urllib.parse import urljoin, urlparse

SDO_TOKEN = "SDO_TOKEN"
SDO_BASE = "https://api.scrape.do"


def normalize_url(url: str) -> str:
    """Remove fragment and trailing slash inconsistencies."""
    parsed = urlparse(url)
    normalized = parsed._replace(fragment="").geturl()
    return normalized.rstrip("/")


def extract_links(base_url: str, html_or_markdown: str) -> list[str]:
    """Extract all internal links from HTML or markdown content."""
    base = urlparse(base_url)
    links = set()

    # Match href attributes in HTML
    href_pattern = re.compile(r'href=["\']([^"\'#][^"\']*)["\']', re.IGNORECASE)
    # Match markdown links [text](url)
    md_pattern = re.compile(r'\[.*?\]\((https?://[^)]+)\)')

    for match in href_pattern.finditer(html_or_markdown):
        url = urljoin(base_url, match.group(1))
        parsed = urlparse(url)
        if parsed.netloc == base.netloc and parsed.scheme in ("http", "https"):
            links.add(normalize_url(url))

    for match in md_pattern.finditer(html_or_markdown):
        url = match.group(1)
        parsed = urlparse(url)
        if parsed.netloc == base.netloc:
            links.add(normalize_url(url))

    return list(links)


async def fetch_page(
    session: aiohttp.ClientSession,
    url: str,
    render: bool = True,
    super_proxy: bool = False,
    geo_code: str = None,
) -> dict:
    """Fetch a single URL via Scrape.do and return status + content."""
    params = {
        "token": SDO_TOKEN,
        "url": url,
        "output": "markdown",
    }
    if render:
        params["render"] = "true"
        params["waitUntil"] = "networkidle2"
    if super_proxy:
        params["super"] = "true"
    if geo_code:
        params["geoCode"] = geo_code

    try:
        async with session.get(SDO_BASE, params=params, timeout=aiohttp.ClientTimeout(total=90)) as resp:
            content = await resp.text()
            return {
                "url": url,
                "status": resp.status,
                "content": content if resp.status == 200 else "",
                "error": None if resp.status == 200 else f"HTTP {resp.status}",
            }
    except asyncio.TimeoutError:
        return {"url": url, "status": None, "content": "", "error": "timeout"}
    except Exception as e:
        return {"url": url, "status": None, "content": "", "error": str(e)}


async def crawl(
    seed_url: str,
    max_pages: int = 100,
    concurrency: int = 5,
    render: bool = True,
    super_proxy: bool = False,
    geo_code: str = None,
) -> list[dict]:
    """
    Crawl a website starting from seed_url.
    Returns a list of dicts: {url, status, content, error}
    """
    seed_url = normalize_url(seed_url)
    base_domain = urlparse(seed_url).netloc

    visited = set()
    queue = deque([seed_url])
    results = []
    semaphore = asyncio.Semaphore(concurrency)

    async def bounded_fetch(session, url):
        async with semaphore:
            return await fetch_page(session, url, render, super_proxy, geo_code)

    connector = aiohttp.TCPConnector(limit=concurrency * 2)
    async with aiohttp.ClientSession(connector=connector) as session:
        while queue and len(visited) < max_pages:
            # Drain current batch
            batch = []
            while queue and len(batch) < concurrency and len(visited) + len(batch) < max_pages:
                url = queue.popleft()
                if url not in visited:
                    visited.add(url)
                    batch.append(url)

            if not batch:
                break

            print(f"Fetching batch of {len(batch)} URLs | Total visited: {len(visited)}/{max_pages}")

            tasks = [bounded_fetch(session, url) for url in batch]
            batch_results = await asyncio.gather(*tasks)

            for result in batch_results:
                results.append(result)
                if result["content"]:
                    # Discover new links and add to queue
                    new_links = extract_links(result["url"], result["content"])
                    for link in new_links:
                        if link not in visited and urlparse(link).netloc == base_domain:
                            queue.append(link)

    print(f"Crawl complete. {len(results)} pages fetched.")
    return results


# --- Main ---

async def main():
    parser = argparse.ArgumentParser(description="Crawl a site using Scrape.do")
    parser.add_argument("--url", required=True, help="Seed URL to start crawling from")
    parser.add_argument("--max-pages", type=int, default=100, help="Max pages to crawl")
    parser.add_argument("--concurrency", type=int, default=5, help="Concurrent requests")
    parser.add_argument("--no-render", action="store_true", help="Disable JS rendering (faster, cheaper)")
    parser.add_argument("--super", action="store_true", help="Use residential proxies")
    parser.add_argument("--geo", default=None, help="Geo code, e.g. us, de, gb")
    parser.add_argument("--output", default="crawl_results.json", help="Output JSON file")
    args = parser.parse_args()

    pages = await crawl(
        seed_url=args.url,
        max_pages=args.max_pages,
        concurrency=args.concurrency,
        render=not args.no_render,
        super_proxy=args.super,
        geo_code=args.geo,
    )

    with open(args.output, "w", encoding="utf-8") as f:
        json.dump(pages, f, ensure_ascii=False, indent=2)

    print(f"Results saved to {args.output}")
    success = sum(1 for p in pages if p["status"] == 200)
    print(f"Success: {success}/{len(pages)} pages")


if __name__ == "__main__":
    asyncio.run(main())

Usage:

# Basic crawl (JS rendering on, datacenter proxies)
python sdo_crawler.py --url https://example.com --max-pages 50

# Fast static site (no rendering, cheaper)
python sdo_crawler.py --url https://docs.example.com --max-pages 200 --no-render

# E-commerce with residential proxy from US
python sdo_crawler.py --url https://shop.example.com --max-pages 100 --super --geo us

Credit cost comparison:

  • Firecrawl crawl: 1 credit/page
  • SDO datacenter (no render): 1 credit/page — same cost
  • SDO with render=true: 5 credits/page — more expensive, but you control when rendering is needed

2. Map Endpoint — URL Discovery (POST /v2/map)

What it does in Firecrawl: Queries a domain and returns a comprehensive URL list using sitemap + SERP + previous crawl data. 1 credit per call regardless of how many URLs are returned — extraordinarily cheap.

SDO equivalent: None.

How to compensate: Fetch sitemap.xml and parse it. Most sites follow the standard; many have a robots.txt that points to multiple sitemap files.

import requests
from xml.etree import ElementTree as ET
from urllib.parse import urlparse

def discover_urls(domain: str, sdo_token: str, max_urls: int = 5000) -> list[str]:
    """
    Discover all URLs on a domain by fetching and parsing its sitemap(s).
    Falls back to robots.txt sitemap directives if /sitemap.xml returns 404.
    """
    base = f"https://{domain}" if not domain.startswith("http") else domain
    parsed = urlparse(base)
    root = f"{parsed.scheme}://{parsed.netloc}"

    def fetch_via_sdo(url: str) -> requests.Response:
        return requests.get("https://api.scrape.do", params={
            "token": sdo_token,
            "url": url
        }, timeout=30)

    def parse_sitemap(xml_text: str) -> tuple[list[str], list[str]]:
        """Returns (page_urls, nested_sitemap_urls)."""
        try:
            root_el = ET.fromstring(xml_text)
        except ET.ParseError:
            return [], []

        ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
        pages, sitemaps = [], []

        # sitemap index
        for loc in root_el.findall(".//sm:sitemap/sm:loc", ns):
            sitemaps.append(loc.text.strip())

        # regular sitemap
        for loc in root_el.findall(".//sm:url/sm:loc", ns):
            pages.append(loc.text.strip())

        return pages, sitemaps

    # Step 1: Check robots.txt for Sitemap directives
    sitemap_urls = []
    try:
        robots = fetch_via_sdo(f"{root}/robots.txt")
        if robots.status_code == 200:
            for line in robots.text.splitlines():
                if line.lower().startswith("sitemap:"):
                    sitemap_urls.append(line.split(":", 1)[1].strip())
    except Exception:
        pass

    if not sitemap_urls:
        sitemap_urls = [f"{root}/sitemap.xml"]

    # Step 2: Fetch and parse sitemaps (handle sitemap indexes)
    all_urls = []
    visited_sitemaps = set()
    queue = sitemap_urls

    while queue and len(all_urls) < max_urls:
        sitemap_url = queue.pop(0)
        if sitemap_url in visited_sitemaps:
            continue
        visited_sitemaps.add(sitemap_url)

        try:
            resp = fetch_via_sdo(sitemap_url)
            if resp.status_code != 200:
                continue
            pages, nested = parse_sitemap(resp.text)
            all_urls.extend(pages)
            queue.extend(nested)
        except Exception as e:
            print(f"Failed to parse {sitemap_url}: {e}")

    return list(set(all_urls))[:max_urls]


# Usage
urls = discover_urls("example.com", sdo_token="SDO_TOKEN")
print(f"Discovered {len(urls)} URLs")
for url in urls[:10]:
    print(url)

3. Extract Endpoint — LLM-Structured JSON (POST /v2/extract, legacy /v1/extract)

What it does in Firecrawl: Takes a URL and a JSON schema (Pydantic/Zod compatible), scrapes the page, passes it through an LLM, and returns structured data matching the schema. Cost: 1 base credit + 4 LLM credits per page.

SDO equivalent: None — Scrape.do has no built-in LLM layer.

How to compensate: Fetch markdown from SDO, then pass it to your own LLM. You choose the model, schema enforcement library, and cost structure. This two-step approach is actually more flexible: you can use Claude, GPT-4o-mini, Gemini, or a local model.

import requests
import json
from openai import OpenAI
from pydantic import BaseModel
from typing import Optional

# Define your schema with Pydantic (same as Firecrawl's schema param)
class ProductSchema(BaseModel):
    name: str
    price: Optional[float]
    currency: Optional[str]
    rating: Optional[float]
    review_count: Optional[int]
    in_stock: Optional[bool]
    description: Optional[str]

def extract_structured(
    url: str,
    schema: type[BaseModel],
    sdo_token: str,
    openai_api_key: str,
    render: bool = True,
    model: str = "gpt-4o-mini"
) -> dict:
    """
    Firecrawl /extract equivalent using SDO + OpenAI.
    Returns parsed dict matching the schema.
    """
    # Step 1: Fetch page as clean markdown via SDO
    params = {
        "token": sdo_token,
        "url": url,
        "output": "markdown",
    }
    if render:
        params["render"] = "true"

    resp = requests.get("https://api.scrape.do", params=params, timeout=60)
    resp.raise_for_status()
    markdown = resp.text

    if not markdown.strip():
        raise ValueError(f"Empty response from {url}")

    # Step 2: Build schema description for the prompt
    schema_json = json.dumps(schema.model_json_schema(), indent=2)

    # Step 3: Pass to LLM for extraction
    client = OpenAI(api_key=openai_api_key)
    completion = client.chat.completions.create(
        model=model,
        response_format={"type": "json_object"},
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a data extraction assistant. "
                    "Extract structured data from the provided web page content. "
                    "Return only valid JSON matching the requested schema. "
                    "Use null for fields you cannot find."
                )
            },
            {
                "role": "user",
                "content": (
                    f"Extract data matching this JSON schema:\n{schema_json}\n\n"
                    f"Page content:\n{markdown[:8000]}"  # truncate to fit context window
                )
            }
        ]
    )

    raw_json = completion.choices[0].message.content
    parsed = json.loads(raw_json)

    # Validate with Pydantic
    validated = schema(**parsed)
    return validated.model_dump()


# Usage — mirrors Firecrawl /extract usage
result = extract_structured(
    url="https://www.amazon.com/dp/B0ABCDEF",
    schema=ProductSchema,
    sdo_token="SDO_TOKEN",
    openai_api_key="sk-...",
    model="gpt-4o-mini"
)
print(json.dumps(result, indent=2))

Cost comparison (per page):

  • Firecrawl /extract: 1 + 4 = 5 credits
  • SDO + GPT-4o-mini: 1 SDO credit (~$0.001) + ~$0.0005 GPT tokens = ~$0.0015 total
  • SDO + Claude 3.5 Haiku: 1 SDO credit + ~$0.0008 = ~$0.0018 total

You end up paying roughly the same but choosing your own model and retaining full schema flexibility.


4. FIRE-1 AI Agent / /v2/agent — Autonomous Navigation

What it does in Firecrawl: An LLM-powered agent that receives a prompt, autonomously navigates websites (no seed URL required), decides what pages to visit, and returns extracted data. The /agent endpoint (spark models) and legacy FIRE-1 both fit this pattern.

SDO equivalent: None. SDO is a tool, not an agent.

How to compensate: Build an agent loop using your preferred LLM + SDO as the "fetch" tool. The pattern: LLM decides what URL to fetch -> SDO fetches it -> markdown fed back to LLM -> LLM decides next step.

"""
Minimal SDO-powered web agent using Claude as the reasoning engine.
Replaces Firecrawl /agent for structured research tasks.
"""

import requests
import anthropic
import json

SDO_TOKEN = "SDO_TOKEN"
ANTHROPIC_KEY = "sk-ant-..."


def sdo_fetch(url: str, render: bool = True) -> str:
    """Fetch a URL as markdown via Scrape.do."""
    resp = requests.get("https://api.scrape.do", params={
        "token": SDO_TOKEN,
        "url": url,
        "output": "markdown",
        "render": "true" if render else "false",
        "waitUntil": "networkidle2" if render else None,
    }, timeout=60)
    return resp.text[:6000] if resp.status_code == 200 else f"[Error {resp.status_code}]"


def run_agent(task: str, max_steps: int = 8) -> str:
    """
    Run a web research agent. 
    The LLM decides which URLs to fetch; SDO fetches them.
    Returns the final extracted answer.
    """
    client = anthropic.Anthropic(api_key=ANTHROPIC_KEY)
    tools = [
        {
            "name": "fetch_webpage",
            "description": "Fetch any public webpage and return its content as markdown. Use this to read web pages, product pages, search results, etc.",
            "input_schema": {
                "type": "object",
                "properties": {
                    "url": {"type": "string", "description": "The URL to fetch"},
                    "render": {"type": "boolean", "description": "Whether to execute JavaScript (needed for SPAs)", "default": True}
                },
                "required": ["url"]
            }
        }
    ]

    messages = [{"role": "user", "content": task}]
    steps = 0

    while steps < max_steps:
        response = client.messages.create(
            model="claude-3-5-haiku-20241022",
            max_tokens=2048,
            tools=tools,
            messages=messages
        )

        # Check if agent wants to use the fetch tool
        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use" and block.name == "fetch_webpage":
                    url = block.input["url"]
                    render = block.input.get("render", True)
                    print(f"  [Agent fetching via SDO] {url}")
                    content = sdo_fetch(url, render)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": content
                    })

            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})
            steps += 1

        elif response.stop_reason == "end_turn":
            # Extract final text answer
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text
            break

    return "Agent did not produce a final answer within step limit."


# Usage — same style as Firecrawl /agent prompt usage
result = run_agent(
    "Find the current price and availability of the iPhone 16 Pro 256GB "
    "on apple.com and return as JSON with fields: price, currency, available."
)
print(result)

5. Open-Source / Self-Hosting

What it does in Firecrawl: Firecrawl is AGPL-3.0 open source at github.com/mendableai/firecrawl. You can self-host the entire stack (scraping workers, Redis, Postgres, Playwright browsers) on your own infrastructure with no per-request costs.

SDO equivalent: Scrape.do is a cloud-only SaaS. There is no self-hosted version.

If self-hosting is a requirement: You have two paths:

  1. Keep using Firecrawl OSS for self-hosted workloads, and route cloud-burst traffic through SDO for overflow or jurisdictions where you need different IPs.
  2. Evaluate alternatives like Crawlee (open-source JS crawling framework) paired with your own proxy pool.

6. PDF Parsing

What it does in Firecrawl: Parses PDFs (including scanned/OCR) and returns structured text. Three modes: auto, fast, and OCR. Priced at 1 credit per PDF page. The underlying engine was rewritten in Rust in Feb 2026 for 3x speed.

SDO equivalent: None — Scrape.do does not parse PDF content. It can fetch the raw PDF bytes.

How to compensate: Download the PDF bytes via SDO (no render needed), then parse client-side.

import requests
import io

# Option A: pdfplumber (good for text-heavy PDFs)
try:
    import pdfplumber

    def extract_pdf_text_plumber(pdf_url: str, sdo_token: str) -> str:
        """Download and extract text from a PDF using pdfplumber."""
        resp = requests.get("https://api.scrape.do", params={
            "token": sdo_token,
            "url": pdf_url,
            # No render=true — PDFs are binary, no JS needed
        }, timeout=60)
        resp.raise_for_status()

        pages = []
        with pdfplumber.open(io.BytesIO(resp.content)) as pdf:
            for page in pdf.pages:
                text = page.extract_text()
                if text:
                    pages.append(text)
        return "\n\n".join(pages)

    text = extract_pdf_text_plumber(
        "https://example.com/report.pdf",
        sdo_token="SDO_TOKEN"
    )
    print(text[:500])

except ImportError:
    print("Install pdfplumber: pip install pdfplumber")


# Option B: PyMuPDF / fitz (faster, also handles scanned PDFs with OCR)
try:
    import fitz  # PyMuPDF

    def extract_pdf_text_pymupdf(pdf_url: str, sdo_token: str) -> str:
        """Download and extract text from a PDF using PyMuPDF."""
        resp = requests.get("https://api.scrape.do", params={
            "token": sdo_token,
            "url": pdf_url,
        }, timeout=60)
        resp.raise_for_status()

        doc = fitz.open(stream=resp.content, filetype="pdf")
        return "\n\n".join(page.get_text() for page in doc)

    # For scanned PDFs requiring OCR, use fitz with OCR plugin:
    # pip install pymupdf[ocr] pillow
    # Then: page.get_text(flags=fitz.TEXT_PRESERVE_LIGATURES | fitz.TEXT_MEDIABOX_CLIP)
    # Or use pytesseract on page.get_pixmap().tobytes()

except ImportError:
    print("Install PyMuPDF: pip install pymupdf")

Packages:

pip install pdfplumber     # text-focused, good table extraction
pip install pymupdf        # faster, more complete, supports OCR via plugin

Credit and Cost Comparison

Per-Request Credit Costs

Request Type Firecrawl Scrape.do
Basic fetch (datacenter, no browser) 1 credit 1 credit
Browser rendering + datacenter 1 credit (always rendered) 5 credits (render=true)
Residential proxy (no browser) 5 credits (proxy: enhanced) 10 credits (super=true)
Browser + residential proxy 5 credits 25 credits (render=true + super=true)
Markdown output 1 credit 1 credit (output=markdown)
LLM structured extraction 1 + 4 = 5 credits No built-in — SDO 1cr + LLM API cost
Full-site crawl (per page) 1 credit/page 1-5 credits/page (depends on render)
URL map/discovery 1 credit/call 0 credits (parse sitemap.xml yourself)
Batch/async scrape Same as scrape Same as scrape (async pool separate)

Bottom line: For pure HTML fetching without rendering, costs are equal. If Firecrawl was charging you 1 credit per page and you relied on rendering being always-on, your SDO costs will be higher per-request — but plan pricing is substantially lower, which can offset this for moderate volumes.

Plan-Level Comparison

Firecrawl Scrape.do
Free 500 credits (one-time) Free-forever plan (check dashboard)
Entry paid $16/mo for 3,000 cr Pay-as-you-go or lower-tier plan
Mid tier $83/mo for 100,000 cr Comparable plan at lower cost
Credit rollover Plan credits do not roll over Check SDO plan terms
Extra credits Auto-recharge packs (4/month max) Pay-as-you-go top-up
Subscription required Yes — no pure PAYG Pay-as-you-go available

Firecrawl's most significant pricing advantage is the /map endpoint at 1 credit for thousands of URLs, and the /crawl endpoint at 1 credit/page with no rendering surcharge (Firecrawl always renders but charges 1 credit). If your workflow is heavily crawl-based, factor this into your cost model.


Migration Checklist

Work through this list when migrating. Tick each item before deploying to production.

Core Migration

  • Replace Authorization: Bearer FC_API_KEY header with token=SDO_TOKEN query parameter
  • Change base URL from https://api.firecrawl.dev/v2/ to https://api.scrape.do
  • Switch from POST with JSON body to GET with query parameters
  • Replace formats: ["markdown"] with output=markdown
  • Replace formats: ["html"] — default SDO response is already HTML, no change needed
  • Add render=true to requests that relied on Firecrawl's always-on browser (Firecrawl renders every request; SDO does not)
  • Replace waitFor: N (ms) with customWait=N
  • Replace mobile: true with device=mobile
  • Replace location.country: "US" with geoCode=us (lowercase ISO code)
  • Replace proxy: "enhanced" with super=true
  • Replace headers: {...} body object with extraHeaders=true + Sd-{Header} HTTP headers
  • Replace actions: [...] array with playWithBrowser=[{"Action":"..."}] JSON
  • Replace formats: ["screenshot"] with render=true&screenShot=true&returnJSON=true (all three required) and read base64 from response.json()["screenShots"][0]["image"] (Firecrawl returns a hosted URL string instead, so any code that treated data["screenshot"] as base64 must also be updated)

Endpoint Replacement

  • /v2/batch/scrape -> SDO Async API (https://q.scrape.do/api/v1/jobs)

    • Change auth from Authorization header to X-Token header
    • Change body format: urls array -> Targets array
    • Poll job at GET /api/v1/jobs/{jobID}, retrieve content at GET /api/v1/jobs/{jobID}/{taskID}
  • /v2/crawl -> Build async crawler loop (use sdo_crawler.py from this guide)

    • Decide: do you need render=true on all pages? Turn it off for static content to cut costs to 1cr/page.
    • Decide: maximum page depth, excluded path patterns, concurrency
  • /v2/map -> Fetch sitemap.xml + parse (use discover_urls() from this guide)

  • /v2/extract (or legacy /v1/extract) -> output=markdown + LLM call (use extract_structured() from this guide)

    • Choose your LLM: GPT-4o-mini (cheapest), Claude 3.5 Haiku, Gemini Flash
    • Validate with Pydantic or Zod to maintain type safety
  • /v2/search -> SDO Google Scraper API: GET https://api.scrape.do/plugin/google/search?token=T&q=query

    • Returns structured JSON (no HTML parsing needed)
    • Note: 84 Google domains supported
    • Now part of the broader Google Scraper API which also covers Maps, Shopping, Flights, Hotels, News, Trends (10cr each)
  • /v2/agent or FIRE-1 -> Build agent loop (use run_agent() from this guide or similar)

Features Without Migration Path

  • PDF parsing — Download PDF bytes via SDO, parse with pdfplumber or pymupdf
  • Word/Excel document parsing — Download and parse client-side with python-docx, openpyxl
  • Self-hosting — SDO is cloud-only; keep Firecrawl OSS for self-hosted requirements
  • MCP server — SDO has no MCP endpoint; use SDO's n8n/Zapier integrations for AI workflow automation

Testing

  • Test authentication — verify SDO token works and check Scrape.do-Remaining-Credits response header
  • Compare output — scrape the same 5-10 URLs with both APIs and diff the markdown
  • Verify geo-targeting — confirm geoCode returns content from the correct region
  • Test rendering — confirm JS-heavy pages return dynamic content with render=true
  • Test async API — create a multi-URL job, poll to completion, verify all task results
  • Run cost estimate — count monthly requests per type (datacenter/render/super) and compare to Firecrawl credits used

Additional Resources