Skip to content

Latest commit

 

History

History
110 lines (82 loc) · 2.44 KB

File metadata and controls

110 lines (82 loc) · 2.44 KB
title CRW integration
description Integrate with the CRW document loader using LangChain Python.

CRW is an open-source, Firecrawl-compatible web scraper written in Rust. It ships as a single binary, runs with zero config in subprocess mode, and returns clean markdown, HTML, or JSON. Works self-hosted or via the fastcrw.com cloud API.

Overview

Integration details

Class Package Local Serializable
CrwLoader langchain-crw

Loader features

Source Document Lazy Loading Native Async Support
CrwLoader

Setup

pip install langchain-crw

No server required — the SDK spawns the crw binary as a local subprocess on first use. For cloud mode, get an API key at fastcrw.com.

Usage

Scrape a single page

from langchain_crw import CrwLoader

loader = CrwLoader(url="https://example.com", mode="scrape")
docs = loader.load()

print(docs[0].page_content)   # clean markdown
print(docs[0].metadata)       # {'title': ..., 'sourceURL': ..., 'statusCode': 200}

Cloud mode (fastcrw.com)

loader = CrwLoader(
    url="https://example.com",
    mode="scrape",
    api_url="https://fastcrw.com/api",
    api_key="YOUR_API_KEY",  # or CRW_API_KEY env var
)
docs = loader.load()

Self-hosted server

loader = CrwLoader(url="https://example.com", api_url="http://localhost:3000")
docs = loader.load()

Modes

  • scrape: Scrape a single URL and return markdown.
  • crawl: Crawl a URL and all accessible sub-pages.
  • map: Discover URLs on a site via sitemap and link extraction.
  • search: Web search plus content scraping (cloud only).

Crawl

loader = CrwLoader(
    url="https://docs.example.com",
    mode="crawl",
    params={"max_depth": 3, "max_pages": 50},
)
docs = loader.load()

Map

loader = CrwLoader(url="https://example.com", mode="map")
urls = [doc.page_content for doc in loader.load()]

JS rendering

loader = CrwLoader(
    url="https://spa-app.example.com",
    mode="scrape",
    params={
        "render_js": True,
        "wait_for": 3000,
        "css_selector": "article.main-content",
    },
)
docs = loader.load()

API reference

For full configuration options, see the langchain-crw README.