GitHub - Thordata/thordata-firecrawl: Thordata Firecrawl – Firecrawl-compatible web crawling & scraping API built on Thordata, turning any website into AI-ready Markdown/JSON/HTML/screenshots.

🔥 Thordata Crawl (thordata-firecrawl)

Thordata Crawl – Turn any website into AI‑ready data with a single API.

Thordata Crawl is a Firecrawl‑like Web Data API service built on top of Thordata’s AI‑native web data infrastructure. It turns websites into structured, AI‑ready data (Markdown / JSON / HTML / screenshots) for LLMs, RAG systems, and agents.

✨ Features

LLM‑ready output: Directly returns Markdown, structured JSON, raw HTML, and screenshot URLs to minimize post‑processing.
Simple unified API: A single client/service covers scrape, crawl, map, search, and agent-style operations.
Firecrawl‑inspired interface: Request and response shapes are designed to be as close as reasonably possible to firecrawl/firecrawl for easy migration.
Powered by Thordata: Leverages Thordata’s Web Scraper, Scraping Browser, SERP API, and proxy network for higher reliability and success rates.
Self‑hostable: Can be deployed locally or in your own environment via Docker / docker‑compose.
AI & Agent ecosystem: Designed to work smoothly with thordata-mcp-server, thordata-rag-pipeline, LangChain tools, and more.

📦 Repository Structure

README.md: Project overview and usage documentation
SELF_HOST.md: Self-hosting guide with Docker and production deployment instructions
docker-compose.yml: One-command Docker deployment
Dockerfile: Docker image for HTTP API service
run_server.py: Simple script to run the API server locally
test_api.py: Quick test script to verify API functionality
.env.example: Environment variable template
openapi.json / openapi.yaml: OpenAPI specification exported from FastAPI app
render.yaml: Render Blueprint for one-click cloud deployment
src/thordata_firecrawl/: Core Python package
- __init__.py: Package exports
- client.py: High-level Python client (ThordataCrawl)
- api.py: FastAPI HTTP server with REST endpoints
- cli.py: Command-line interface
- _crawler.py: Internal crawler utilities (BFS, link discovery)
- _llm.py: LLM integration for agent functionality
examples/: Usage examples
- basic_crawl.py: Basic crawl examples (including URL filtering)
- search_and_agent.py: Search and agent examples
- agent_with_llm.py: LLM-powered structured extraction examples
- http_api_examples.py: HTTP API usage examples (Python requests)

🚀 Quickstart

Python client example

Install:

pip install thordata-firecrawl

Basic scrape example:

from thordata_firecrawl import ThordataCrawl

client = ThordataCrawl(api_key="td-YOUR_API_KEY")

doc = client.scrape(
    url="https://www.thordata.com",
    formats=["markdown"]
)

print(doc.markdown)

Site‑level crawl example:

job = client.crawl(
    url="https://doc.thordata.com",
    limit=100,
    formats=["markdown"]
)

for page in job["data"]:
    print(page["metadata"]["sourceUrl"], page["markdown"][:200])

HTTP API examples

Start the server:

# Using Docker
docker-compose up -d

# Or locally
pip install -e ".[server]"
python run_server.py
# If port is in use, change it:
#   python run_server.py --port 3003

# Easiest way to try (no need to learn Swagger first)
# - Home page:      http://127.0.0.1:3002/
# - Playground:     http://127.0.0.1:3002/playground
# - Swagger UI:     http://127.0.0.1:3002/docs

# OpenAPI spec (for SDK generation / client integration)
python export_openapi.py  # writes openapi.json + openapi.yaml in repo root

Scrape a single page:

# Scrape a single page
curl -X POST "http://localhost:3002/v1/scrape" \
  -H "Authorization: Bearer YOUR_SCRAPER_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.thordata.com",
    "formats": ["markdown", "html", "screenshot"]
  }'

# Batch scrape multiple pages
curl -X POST "http://localhost:3002/v1/batch-scrape" \
  -H "Authorization: Bearer YOUR_SCRAPER_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://www.thordata.com",
      "https://www.thordata.com/about"
    ],
    "formats": ["markdown"]
  }'

# Crawl a website (async job)
curl -X POST "http://localhost:3002/v1/crawl?clientJobId=my-unique-job-id" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://doc.thordata.com",
    "limit": 10,
    "maxDepth": 2,
    "includePaths": ["/docs/*"],
    "excludePaths": ["/privacy*", "/terms*"],
    "webhook": {
      "url": "https://example.com/webhook",
      "headers": {"Authorization": "Bearer YOUR_WEBHOOK_TOKEN"},
      "secret": "YOUR_WEBHOOK_SECRET",
      "timeout": 10,
      "maxRetries": 3,
      "includeData": true
    },
    "scrapeOptions": {
      "formats": ["markdown"],
      "maxRetries": 3
    }
  }'

# Poll crawl job status/results
# (Replace JOB_ID with the id returned by POST /v1/crawl)
curl -X GET "http://localhost:3002/v1/crawl/JOB_ID" \
  -H "Authorization: Bearer YOUR_API_KEY"

# Poll with pagination (avoid huge responses)
curl -X GET "http://localhost:3002/v1/crawl/JOB_ID?offset=0&limit=50" \
  -H "Authorization: Bearer YOUR_API_KEY"

# Cancel a running crawl job (best-effort)
curl -X POST "http://localhost:3002/v1/crawl/JOB_ID/cancel" \
  -H "Authorization: Bearer YOUR_API_KEY"

# Map (discover links)
curl -X POST "http://localhost:3002/v1/map" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "search": "pricing"
  }'

# Search the web
curl -X POST "http://localhost:3002/v1/search" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "best web scraping tools 2026",
    "limit": 10
  }'

# Agent (structured extraction)
curl -X POST "http://localhost:3002/v1/agent" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Extract company founders information",
    "urls": ["https://example.com/about"],
    "schema": {
      "type": "object",
      "properties": {
        "founders": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": {"type": "string"},
              "role": {"type": "string"}
            }
          }
        }
      }
    }
  }'

Interactive API documentation is available at http://localhost:3002/docs (Swagger UI) and http://localhost:3002/redoc (ReDoc).

Deploy to a free PaaS (Render)

If you want an HTTPS endpoint (so your GitHub Pages playground can call it directly), follow:

DEPLOY_RENDER.md

CLI examples

# Scrape a single page
thordata-firecrawl scrape https://www.thordata.com \
  --format markdown --format html --format screenshot \
  --out thordata.md

# Batch scrape multiple URLs
thordata-firecrawl batch-scrape https://www.thordata.com https://docs.thordata.com \
  --format markdown \
  --out batch-result.json

# Crawl a website (discovers and scrapes multiple pages)
thordata-firecrawl crawl https://doc.thordata.com \
  --limit 50 \
  --max-depth 3 \
  --include-subdomains \
  --concurrency 5 \
  --format markdown \
  --out ./data/crawl-result.json

# Map (discover links without full content)
thordata-firecrawl map https://example.com \
  --search "pricing" \
  --include-subdomains \
  --out links.json

# Search the web
thordata-firecrawl search "best web scraping tools 2026" \
  --limit 10 \
  --engine google \
  --country us \
  --out search-results.json

# Search then scrape the top results (combined)
thordata-firecrawl search-and-scrape "Thordata web data API" \
  --search-limit 3 \
  --format markdown \
  --out search-and-scrape.json

# Agent (structured extraction - MVP)
thordata-firecrawl agent "Extract company founders" \
  --url https://example.com/about \
  --schema schema.json \
  --out extracted-data.json

🧩 API Design (Firecrawl‑inspired)

This section documents the target API surface. A canonical spec will be added later in openapi.yaml.

Firecrawl → Thordata Firecrawl compatibility (migration guide)

Capability	Firecrawl hosted API (conceptual)	Thordata Firecrawl HTTP API	Notes
Single page scrape	`POST /v1/scrape`	`POST /v1/scrape`	Request/response shape is intentionally very similar.
Site crawl (async)	`POST /v1/crawl` + `GET /v1/crawl/:id`	`POST /v1/crawl` + `GET /v1/crawl/{id}`	Async job model with polling, pagination and cancel.
Map / URL discovery	`POST /v1/map`	`POST /v1/map`	Discovers URLs from a seed page.
Search	`POST /v1/search`	`POST /v1/search`	Backed by Thordata SERP API; returns `data.web[]`.
Agent / Extract	`POST /v1/agent`	`POST /v1/agent`	Uses OpenAI-compatible LLM with optional JSON schema.

Key request fields

Field (Firecrawl)	Field (Thordata Firecrawl)	Status
`url`	`url`	✅ Same semantics.
`formats`	`formats`	✅ Same; supports `markdown`, `html`, `screenshot`.
`scrapeOptions.waitFor`	`scrapeOptions.waitFor` → mapped to `wait` (ms) internally	✅ Supported, converted in server/client.
`scrapeOptions.javascript`	`scrapeOptions.javascript`	✅ Supported (`True` enables JS rendering).
`scrapeOptions.timeout`	`scrapeOptions.timeout`	✅ Passed through to underlying Thordata SDK where applicable.
`includeSubdomains`	`includeSubdomains`	✅ Same for crawl/map.
`maxDepth`	`maxDepth`	✅ Same for crawl.
`metadata`	`metadata`	✅ Accepted and echoed in results where available.

Authentication

Firecrawl: Authorization: Bearer fc-... (Firecrawl API key).
Thordata Firecrawl: Authorization: Bearer <token> where:
- For Universal API scrape (markdown/html/screenshot) you should prefer THORDATA_SCRAPER_TOKEN (scraper_token).
- THORDATA_API_KEY may work for some operations, but for Firecrawl-like behavior use THORDATA_SCRAPER_TOKEN.

Practical mapping:

Firecrawl:

curl -X POST "https://api.firecrawl.dev/v1/scrape" \
  -H "Authorization: Bearer fc-XXX" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://example.com", "formats": ["markdown"] }'

Thordata Firecrawl (self-hosted):

curl -X POST "http://localhost:3002/v1/scrape" \
  -H "Authorization: Bearer YOUR_SCRAPER_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://example.com", "formats": ["markdown"] }'

`/v1/scrape` – Single‑page scrape

Purpose: Scrape a single URL and return content in requested formats.
Example request body:

{
  "url": "https://example.com",
  "formats": ["markdown", "html", "screenshot"],
  "scrapeOptions": {
    "waitFor": "selector-or-time",
    "timeout": 30000,
    "javascript": true
  },
  "metadata": {
    "includeHeaders": false
  }
}

Example response body:

{
  "success": true,
  "data": {
    "markdown": "...",
    "html": "...",
    "screenshot": "https://cdn.thordata.com/.../shot.png",
    "metadata": {
      "title": "...",
      "sourceUrl": "https://example.com"
    }
  }
}

`/v1/crawl` – Site crawl

Purpose: Discover and scrape multiple pages starting from a seed URL using BFS traversal.
Features:
- Automatic link discovery from HTML content
- Domain/subdomain filtering
- Depth and limit controls
- Concurrent requests for better performance
- Webhook callbacks for job completion/failure
- URL path filtering (include/exclude patterns)
Example request body:

{
  "url": "https://docs.example.com",
  "limit": 100,
  "scrapeOptions": {
    "formats": ["markdown"]
  },
  "includeSubdomains": false,
  "maxDepth": 3,
  "includePaths": ["/docs/*"],
  "excludePaths": ["/privacy*", "/terms*"],
  "webhook": {
    "url": "https://example.com/webhook",
    "headers": {"Authorization": "Bearer YOUR_WEBHOOK_TOKEN"},
    "secret": "YOUR_WEBHOOK_SECRET",
    "timeout": 10,
    "maxRetries": 3,
    "includeData": true
  }
}

Idempotency: Use ?clientJobId=your-unique-id query parameter to ensure idempotent job submission. If a job with the same clientJobId already exists, the existing job ID will be returned.
Webhook Configuration:
- url (required): Webhook endpoint URL
- headers (optional): Extra HTTP headers for webhook request
- secret (optional): Secret for HMAC-SHA256 signature verification
- timeout (optional): Request timeout in seconds (default: 10, max: 60)
- maxRetries (optional): Maximum retry attempts with exponential backoff (default: 3, max: 10)
- includeData (optional): Include full data array in payload (default: true). Set to false for large crawls to reduce payload size.
Retry Strategy: Scrape operations automatically retry on network errors, 5xx responses, and timeouts with exponential backoff. Configure via scrapeOptions.maxRetries (default: 3).
Webhook Payload Format:

{
  "event": "crawl.completed",
  "id": "job-123",
  "status": "completed",
  "total": 50,
  "completed": 50,
  "failed": 0,
  "data": [...],  // Only included if includeData: true
  "dataCount": 50,  // Included if includeData: false
  "error": null
}

Webhook Signature Verification (Python example):

import hmac
import hashlib
import json

def verify_webhook_signature(secret: str, body: bytes, signature: str) -> bool:
    """Verify HMAC-SHA256 webhook signature."""
    expected = hmac.new(
        secret.encode("utf-8"),
        body,
        hashlib.sha256
    ).hexdigest()
    # Signature format: "sha256=<hex>"
    received = signature.replace("sha256=", "")
    return hmac.compare_digest(expected, received)

# In your webhook handler:
signature = request.headers.get("X-Thordata-Signature")
body = request.body  # Raw bytes
if not verify_webhook_signature("YOUR_WEBHOOK_SECRET", body, signature):
    return {"error": "Invalid signature"}, 401

Example response (job submission):

{
  "success": true,
  "id": "job-123",
  "url": "https://api.thordata.com/crawl/v1/crawl/job-123"
}

Example response (poll job status):

{
  "status": "completed",
  "total": 50,
  "completed": 50,
  "creditsUsed": 50,
  "data": [
    {
      "markdown": "# Page Title\\n\\nContent...",
      "metadata": {
        "title": "Page Title",
        "sourceUrl": "https://..."
      }
    }
  ]
}

`/v1/map` – URL topology / link map

Purpose: Discover URLs within a site without fetching full page content.
Features:
- Extracts all links from the seed page HTML
- Filters by domain/subdomain
- Optional keyword-based filtering/ranking
Example request body:

{
  "url": "https://example.com",
  "search": "pricing"
}

Example response body:

{
  "success": true,
  "links": [
    {
      "url": "https://example.com",
      "title": "Example",
      "description": "..."
    },
    {
      "url": "https://example.com/pricing",
      "title": "Pricing",
      "description": "..."
    }
  ]
}

`/v1/search` – Web search

Purpose: Firecrawl‑style search interface backed by Thordata SERP API.
Example request body:

{
  "query": "best web scraping tools 2026",
  "limit": 10
}

Example response body:

{
  "success": true,
  "data": {
    "web": [
      {
        "title": "...",
        "url": "...",
        "snippet": "..."
      }
    ]
  }
}

`/v1/agent` – Structured extraction (advanced)

LLM config (recommended .env):

# OpenAI
OPENAI_API_BASE=https://api.openai.com/v1
OPENAI_API_KEY=YOUR_OPENAI_KEY
OPENAI_MODEL=gpt-4o-mini

# SiliconFlow (example)
# OPENAI_API_BASE=https://api.siliconflow.cn/v1
# OPENAI_API_KEY=YOUR_SILICONFLOW_KEY
# OPENAI_MODEL=Qwen/Qwen2.5-7B-Instruct

# DeepSeek (example)
# OPENAI_API_BASE=https://api.deepseek.com/v1
# OPENAI_API_KEY=YOUR_DEEPSEEK_KEY
# OPENAI_MODEL=deepseek-chat

Purpose: Use LLM + JSON schema to extract structured data from web content.
Features:
- Scrapes provided URLs (or searches if no URLs given)
- Uses OpenAI-compatible LLM APIs for extraction
- Supports JSON schema validation
- Works with OpenAI, SiliconFlow, DeepSeek, and other compatible providers
Example request body:

{
  "urls": ["https://example.com/about"],
  "prompt": "Extract company founders information",
  "schema": {
    "type": "object",
    "properties": {
      "founders": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": { "type": "string" },
            "role": { "type": "string" }
          }
        }
      }
    }
  },
  "model": "spark-1-mini"
}

Example response body:

{
  "success": true,
  "data": {
    "founders": [
      {
        "name": "Alice",
        "role": "CEO"
      }
    ]
  },
  "sources": ["https://example.com/about"]
}

🏗 Architecture

High‑level architecture:

Entry points
- ✅ HTTP API (FastAPI)
- ✅ Python SDK (ThordataCrawl)
- ✅ CLI tool (thordata-firecrawl)
Middle layer
- ✅ Job management / queue (in-memory, async crawl)
- ✅ URL discovery / deduplication / rate limiting
- ✅ Content cleaning (HTML → Markdown / JSON)
- ✅ Structured extraction module (Agent / LLM)
Underlying Thordata infrastructure
- Proxy Network
- Web Scraper API
- Scraping Browser
- SERP API
- RAG Pipeline

This repository does not contain proxy network or anti‑bot core logic. It only calls official Thordata APIs/SDKs, allowing this project to use a permissive open‑source license (MIT) without exposing commercial internals.

⚙️ Installation & Deployment

Local development

Requires Python 3.10+.
After cloning the repo, install in editable mode:

pip install -e .

Install optional LLM dependencies (for agent functionality):

pip install -e ".[llm]"

Install server dependencies:

pip install -e ".[server]"

Configure environment variables (or .env file):
- THORDATA_API_KEY: Thordata API key for scraping (required)
- THORDATA_BASE_URL: Optional Thordata API base URL
- OPENAI_API_KEY: OpenAI-compatible API key (required for agent functionality)
- OPENAI_API_BASE: API base URL (default: https://api.openai.com/v1)
- OPENAI_MODEL: Model name (default: auto - auto-detects based on API_BASE)
- LOG_LEVEL: Logging level (default: INFO, options: DEBUG, INFO, WARNING, ERROR)

Docker / docker-compose

Build service image via Dockerfile.
Start via docker-compose.yaml:

docker-compose up --build

The service listens on http://localhost:3002 by default.

Cloud Deployment (Render)

Deploy to Render (free tier available) using the included render.yaml:

Push this repo to GitHub
In Render dashboard, choose New → Blueprint
Select your GitHub repo
Set environment variables:
- THORDATA_API_KEY (required)
- CORS_ALLOW_ORIGINS (recommended, e.g., https://thordata.github.io)
Deploy and get your HTTPS URL

See SELF_HOST.md for detailed deployment instructions.

Production deployment

Can be deployed to K8s / ECS / VMs / Render / Fly.io
Recommended to front the service with an API Gateway for auth, rate limiting, and auditing
Task metadata and results can be stored in Redis / a database for better resilience and observability (optional)

🔧 Configuration & Environment Variables

Core Configuration

THORDATA_API_KEY (required): Thordata API key for scraping
THORDATA_BASE_URL (optional): Custom Thordata API base URL, defaults to official endpoint if omitted
PORT (optional): Server port (default: 3002)
HOST (optional): Server host (default: 0.0.0.0)

Rate Limiting

RATE_LIMIT_TOKEN_RPM (optional): Requests per minute per API token (default: 60)
RATE_LIMIT_IP_RPM (optional): Requests per minute per IP address (default: 120)
RATE_LIMIT_WINDOW_SECONDS (optional): Rate limit window size in seconds (default: 60)

Response Size Limits

MAX_RESPONSE_SIZE (optional): Max response size in bytes to prevent OOM (default: 10MB = 10485760)

Logging

LOG_LEVEL (optional): Logging level - DEBUG, INFO, WARNING, ERROR (default: INFO)

Crawl Job Controls

MAX_CONCURRENT_CRAWLS (optional): Maximum concurrent crawl jobs (default: 2)
JOB_TTL_SECONDS (optional): Job TTL in seconds (default: 3600)

LLM Configuration (for agent functionality)

OPENAI_API_KEY (optional): OpenAI-compatible API key
OPENAI_API_BASE (optional): LLM API base URL (default: https://api.openai.com/v1)
OPENAI_MODEL (optional): Model name (default: auto - auto-detects based on API_BASE)

Optional Enhancements

REDIS_URL / DATABASE_URL: Storage for tasks and results (for horizontal scaling) - not yet implemented

🛡 Resilience & Performance

Current production-ready features:

Retry strategies: ✅ Exponential backoff for 5xx / network errors / timeouts (configurable via scrapeOptions.maxRetries)
Idempotency: ✅ Optional clientJobId query parameter for crawl jobs to avoid duplicates
Observability: ✅ Structured logging around key operations (scrape, crawl, webhook) - configure via LOG_LEVEL env var
Webhook reliability: ✅ Exponential backoff retries with configurable timeout and max attempts
Rate limiting: ✅ Per-token and per-IP rate limiting with configurable limits (default: 60 req/min per token, 120 req/min per IP)
Response size limits: ✅ Maximum response size protection to prevent OOM (default: 10MB, configurable via MAX_RESPONSE_SIZE)
Request validation: ✅ URL format validation, format checking, limit bounds, and required field validation
Enhanced error handling: ✅ Proper HTTP status codes (400 for validation errors, 401 for auth, 429 for rate limits, 500 for server errors)

🔍 Comparison with Firecrawl

Interface & DX

Aligns with Firecrawl’s /scrape / /crawl / /map / /search / /agent style APIs to lower migration cost.
Provides Python client and CLI with examples that feel familiar to Firecrawl users.

Infrastructure differences

Firecrawl’s internal infra is not fully public.
Thordata Crawl is explicitly powered by Thordata’s Proxy Network, Web Scraper, Scraping Browser, SERP API, and RAG Pipeline.

Licensing model

Firecrawl's main repo is licensed under AGPL‑3.0.
Thordata Crawl uses MIT License (see LICENSE), making it easy to integrate into commercial projects.

🌐 Relationship to Thordata Ecosystem

thordata-python-sdk (thordata-sdk)
- This project is a higher‑level wrapper on top of the official Python SDK; all HTTP calls to Thordata go through it.
thordata-rag-pipeline
- Provides a full RAG pipeline from web content to retrieval. Crawl/agent outputs from this project can flow directly into it.
thordata-mcp-server
- MCP bridge for clients like Claude / Cursor / OpenAI.
- Future versions of this project may expose MCP tools so agents can call /scrape / /crawl directly.

🗺 Roadmap

v0.1 ✅ Completed
- ✅ Python client based on thordata-sdk with scrape, crawl, map, search, agent
- ✅ CLI with all subcommands
- ✅ HTTP API service (FastAPI) with all endpoints
- ✅ Docker / docker-compose support
- ✅ Job queue and status polling
- ✅ Production-ready features (rate limiting, retry, idempotency, webhooks)
- ✅ Cloud deployment (Render)
- ✅ Frontend website (GitHub Pages)
v0.2+ (Future)
- Redis-backed job store for horizontal scaling
- MCP tools integration
- LangChain / LlamaIndex / OpenAI Tools integrations
- Performance tuning and advanced scheduling

🤝 Contributing

Issues and Pull Requests are welcome.

Before submitting code, please:

Run the existing tests (planned: pytest).
Follow the basic code style (planned: ruff / black / mypy).

A full contributing guide will be added later in CONTRIBUTING.md.

📄 License

This project is licensed under the MIT License. See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
examples		examples
src/thordata_firecrawl		src/thordata_firecrawl
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
LOCAL_TEST_REPORT.md		LOCAL_TEST_REPORT.md
OPTIMIZATION_REPORT.md		OPTIMIZATION_REPORT.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SELF_HOST.md		SELF_HOST.md
TROUBLESHOOTING.md		TROUBLESHOOTING.md
diagnose.py		diagnose.py
docker-compose.yml		docker-compose.yml
export_openapi.py		export_openapi.py
openapi.json		openapi.json
openapi.yaml		openapi.yaml
pyproject.toml		pyproject.toml
render.yaml		render.yaml
run_server.py		run_server.py
test_api.py		test_api.py
test_integration.py		test_integration.py

Folders and files

Latest commit

History

Repository files navigation

🔥 Thordata Crawl (thordata-firecrawl)

✨ Features

📦 Repository Structure

🚀 Quickstart

Python client example

HTTP API examples

Deploy to a free PaaS (Render)

CLI examples

🧩 API Design (Firecrawl‑inspired)

Firecrawl → Thordata Firecrawl compatibility (migration guide)

/v1/scrape – Single‑page scrape

/v1/crawl – Site crawl

/v1/map – URL topology / link map

/v1/search – Web search

/v1/agent – Structured extraction (advanced)

🏗 Architecture

⚙️ Installation & Deployment

Local development

Docker / docker-compose

Cloud Deployment (Render)

Production deployment

🔧 Configuration & Environment Variables

Core Configuration

Rate Limiting

Response Size Limits

Logging

Crawl Job Controls

LLM Configuration (for agent functionality)

Optional Enhancements

🛡 Resilience & Performance

🔍 Comparison with Firecrawl

🌐 Relationship to Thordata Ecosystem

🗺 Roadmap

🤝 Contributing

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`/v1/scrape` – Single‑page scrape

`/v1/crawl` – Site crawl

`/v1/map` – URL topology / link map

`/v1/search` – Web search

`/v1/agent` – Structured extraction (advanced)

Packages