-
Notifications
You must be signed in to change notification settings - Fork 2.2k
docs: add CRW integration (provider + document loader pages) #3681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,110 @@ | ||
| --- | ||
| title: "CRW integration" | ||
| description: "Integrate with the CRW document loader using LangChain Python." | ||
| --- | ||
|
|
||
| [CRW](https://github.com/us/crw) is an open-source, Firecrawl-compatible web | ||
| scraper written in Rust. It ships as a single binary, runs with zero config in | ||
| subprocess mode, and returns clean markdown, HTML, or JSON. Works self-hosted | ||
| or via the [fastcrw.com](https://fastcrw.com) cloud API. | ||
|
Comment on lines
+6
to
+9
|
||
|
|
||
| ## Overview | ||
|
|
||
| ### Integration details | ||
|
|
||
| | Class | Package | Local | Serializable | | ||
| | :--- | :--- | :---: | :---: | | ||
| | `CrwLoader` | `langchain-crw` | ✅ | ❌ | | ||
|
|
||
|
Comment on lines
+15
to
+18
|
||
| ### Loader features | ||
|
|
||
| | Source | Document Lazy Loading | Native Async Support | | ||
| | :---: | :---: | :---: | | ||
| | `CrwLoader` | ✅ | ❌ | | ||
|
Comment on lines
+21
to
+23
|
||
|
|
||
| ## Setup | ||
|
|
||
| ```bash | ||
| pip install langchain-crw | ||
| ``` | ||
|
|
||
| No server required — the SDK spawns the `crw` binary as a local subprocess on | ||
| first use. For cloud mode, get an API key at [fastcrw.com](https://fastcrw.com). | ||
|
|
||
| ## Usage | ||
|
|
||
| ### Scrape a single page | ||
|
|
||
| ```python | ||
| from langchain_crw import CrwLoader | ||
|
|
||
| loader = CrwLoader(url="https://example.com", mode="scrape") | ||
| docs = loader.load() | ||
|
|
||
| print(docs[0].page_content) # clean markdown | ||
| print(docs[0].metadata) # {'title': ..., 'sourceURL': ..., 'statusCode': 200} | ||
| ``` | ||
|
|
||
| ### Cloud mode (fastcrw.com) | ||
|
|
||
| ```python | ||
| loader = CrwLoader( | ||
| url="https://example.com", | ||
| mode="scrape", | ||
| api_url="https://fastcrw.com/api", | ||
| api_key="YOUR_API_KEY", # or CRW_API_KEY env var | ||
| ) | ||
| docs = loader.load() | ||
| ``` | ||
|
|
||
| ### Self-hosted server | ||
|
|
||
| ```python | ||
| loader = CrwLoader(url="https://example.com", api_url="http://localhost:3000") | ||
| docs = loader.load() | ||
| ``` | ||
|
|
||
| ## Modes | ||
|
|
||
| - `scrape`: Scrape a single URL and return markdown. | ||
| - `crawl`: Crawl a URL and all accessible sub-pages. | ||
| - `map`: Discover URLs on a site via sitemap and link extraction. | ||
| - `search`: Web search plus content scraping (cloud only). | ||
|
|
||
| ### Crawl | ||
|
|
||
| ```python | ||
| loader = CrwLoader( | ||
| url="https://docs.example.com", | ||
| mode="crawl", | ||
| params={"max_depth": 3, "max_pages": 50}, | ||
| ) | ||
| docs = loader.load() | ||
| ``` | ||
|
|
||
| ### Map | ||
|
|
||
| ```python | ||
| loader = CrwLoader(url="https://example.com", mode="map") | ||
| urls = [doc.page_content for doc in loader.load()] | ||
| ``` | ||
|
|
||
| ### JS rendering | ||
|
|
||
| ```python | ||
| loader = CrwLoader( | ||
| url="https://spa-app.example.com", | ||
| mode="scrape", | ||
| params={ | ||
| "render_js": True, | ||
| "wait_for": 3000, | ||
| "css_selector": "article.main-content", | ||
| }, | ||
| ) | ||
| docs = loader.load() | ||
| ``` | ||
|
|
||
| ## API reference | ||
|
|
||
| For full configuration options, see the | ||
| [`langchain-crw` README](https://github.com/us/langchain-crw). | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| --- | ||
| title: "CRW integrations" | ||
| description: "Integrate with CRW using LangChain Python." | ||
| --- | ||
|
|
||
| >[CRW](https://github.com/us/crw) is an open-source, Firecrawl-compatible web | ||
| > scraper written in Rust. It ships as a single ~6 MB binary, runs self-hosted | ||
| > or via the [fastcrw.com](https://fastcrw.com) cloud API, and returns clean | ||
| > LLM-ready markdown, HTML, or structured JSON. | ||
|
|
||
| ## Installation and setup | ||
|
|
||
| Install the partner integration package: | ||
|
|
||
| <CodeGroup> | ||
| ```bash pip | ||
| pip install langchain-crw | ||
| ``` | ||
|
|
||
| ```bash uv | ||
| uv add langchain-crw | ||
| ``` | ||
| </CodeGroup> | ||
|
|
||
| ## Document loader | ||
|
|
||
| See a [usage example](/oss/integrations/document_loaders/crw). | ||
|
Comment on lines
+25
to
+27
|
||
|
|
||
| ```python | ||
| from langchain_crw import CrwLoader | ||
| ``` | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partner_pkg_table.pyderives a display title fromnamewhenname_titleis omitted, which would renderlangchain-crwas “Crw” (losing the intended all-caps acronym). Addname_title: CRWto preserve branding/capitalization in generated tables.