Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
315 changes: 315 additions & 0 deletions src/oss/python/integrations/providers/crawleo_crawler.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,315 @@
---
title: Crawleo Crawler
---

[Crawleo](https://crawleo.dev) is a privacy-first web search and crawler API. The Crawler endpoint can be used to extract content from URLs with support for raw HTML and Markdown output.

## Overview

### Integration details

| Class | Package | Serializable | JS support | Version |
|:-----------------------------------------------------------------|:-------------------------------------------------------------------|:---:|:---:|:---:|
| [CrawleoCrawler](https://github.com/crawleo/langchain-crawleo) | [langchain-crawleo](https://pypi.org/project/langchain-crawleo/) | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-crawleo?style=flat-square&label=%20) |

### Tool features

| [Returns artifact](/oss/langchain/tools) | Native async | Return data | Pricing |
|:---:|:---:|:---:|:---:|
| ❌ | ✅ | raw_content, markdown | Pay-per-use credits |

## Setup

The integration lives in the `langchain-crawleo` package.

```bash
pip install -qU langchain-crawleo
```

### Credentials

We need to set our Crawleo API key. You can get an API key by visiting [Crawleo](https://crawleo.dev) and creating an account.

```python
import getpass
import os

if not os.environ.get("CRAWLEO_API_KEY"):
os.environ["CRAWLEO_API_KEY"] = getpass.getpass("Crawleo API key:\n")
```

## Instantiation

The tool accepts various parameters during instantiation:

- `raw_html` (optional, bool): Whether to return raw HTML content. Default is False.
- `markdown` (optional, bool): Whether to return content in markdown format. Default is False.

For a comprehensive overview of the available parameters, refer to the [Crawleo API documentation](https://crawleo.dev/docs).

```python
from langchain_crawleo import CrawleoCrawler

tool = CrawleoCrawler(
markdown=True,
# raw_html=False,
)
```

## Invocation

### [Invoke directly with args](/oss/langchain/tools)

The Crawleo crawler tool accepts the following arguments during invocation:

- `urls` (required): A list of URLs to crawl (1-20 URLs maximum)
- Both `raw_html` and `markdown` can also be set during invocation

NOTE: The optional arguments are available for agents to dynamically set. If you set an argument during instantiation and then invoke the tool with a different value, the tool will use the value you passed during invocation.

```python
tool.invoke({"urls": ["https://python.langchain.com"]})
```

```output
{
"status": "success",
"data": {
"results": [
{
"url": "https://python.langchain.com",
"raw_content": "...",
"markdown": "# LangChain\n\nLangChain is a framework for developing applications..."
}
],
"credits_used": 1
}
}
```

### Crawl multiple URLs

```python
from langchain_crawleo import CrawleoCrawler

tool = CrawleoCrawler(markdown=True)

result = tool.invoke({
"urls": [
"https://python.langchain.com",
"https://js.langchain.com"
]
})

for item in result["data"]["results"]:
print(f"URL: {item['url']}")
print(f"Content preview: {item['markdown'][:200]}...")
print("---")
```

### [Invoke with ToolCall](/oss/langchain/tools)

We can also invoke the tool with a model-generated ToolCall, in which case a ToolMessage will be returned:

```python
# This is usually generated by a model, but we'll create a tool call directly for demo purposes.
model_generated_tool_call = {
"args": {"urls": ["https://en.wikipedia.org/wiki/Artificial_intelligence"]},
"id": "1",
"name": "crawleo_crawler",
"type": "tool_call",
}
tool_msg = tool.invoke(model_generated_tool_call)

# The content is a JSON string of results
print(tool_msg.content[:400])
```

```output
{"status": "success", "data": {"results": [{"url": "https://en.wikipedia.org/wiki/Artificial_intelligence", "raw_content": "Artificial intelligence - Wikipedia\nJump to content\nMain menu...", "markdown": "# Artificial intelligence\n\nArtificial intelligence (AI) is the intelligence of machines...
```

## Use within an agent

We can use our tools directly with an agent executor by binding the tool to the agent. This gives the agent the ability to dynamically crawl URLs.

```python
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("OPENAI_API_KEY:\n")
```

```python
from langchain.chat_models import init_chat_model

model = init_chat_model(model="gpt-4o", model_provider="openai", temperature=0)
```

```bash
pip install -qU langgraph
```

```python
from langchain_crawleo import CrawleoCrawler
from langgraph.prebuilt import create_react_agent

crawleo_crawler_tool = CrawleoCrawler(markdown=True)

agent = create_react_agent(model, [crawleo_crawler_tool])

user_input = "Extract the content from https://python.langchain.com and summarize what LangChain is."

for step in agent.stream(
{"messages": [("user", user_input)]},
stream_mode="values",
):
step["messages"][-1].pretty_print()
```

```output
================================ Human Message =================================

Extract the content from https://python.langchain.com and summarize what LangChain is.
================================== Ai Message ==================================
Tool Calls:
crawleo_crawler (call_xyz789)
Call ID: call_xyz789
Args:
urls: ['https://python.langchain.com']
================================= Tool Message =================================
Name: crawleo_crawler

{"status": "success", "data": {"results": [{"url": "https://python.langchain.com", "markdown": "# LangChain..."}]}}
================================== Ai Message ==================================

Based on the extracted content, LangChain is a framework for developing applications powered by language models...
```

## Advanced Usage

### Combining Search and Crawler

Use CrawleoSearch to find relevant URLs, then CrawleoCrawler to extract full content:

```python
from langchain_crawleo import CrawleoSearch, CrawleoCrawler

search = CrawleoSearch(max_pages=1)
crawler = CrawleoCrawler(markdown=True)

# Step 1: Search for relevant pages
search_results = search.invoke({"query": "LangChain documentation"})
urls = [
item["link"]
for item in search_results["data"]["pages"]["1"]["search_results"][:3]
]

# Step 2: Crawl the top results
crawl_results = crawler.invoke({"urls": urls})

for result in crawl_results["data"]["results"]:
print(f"URL: {result['url']}")
print(f"Content length: {len(result.get('markdown', ''))}")
print("---")
```

### Async crawling

```python
import asyncio
from langchain_crawleo import CrawleoCrawler

async def crawl_pages():
tool = CrawleoCrawler(markdown=True)

urls = [
"https://python.langchain.com",
"https://js.langchain.com",
]

result = await tool.ainvoke({"urls": urls})

for item in result["data"]["results"]:
print(f"Crawled: {item['url']}")
print(f"Preview: {item['markdown'][:100]}...")

asyncio.run(crawl_pages())
```

### Extract and summarize with LLM

```python
from langchain_crawleo import CrawleoCrawler
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

crawler = CrawleoCrawler(markdown=True)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Crawl a page
result = crawler.invoke({"urls": ["https://python.langchain.com/docs/concepts/"]})
content = result["data"]["results"][0]["markdown"]

# Summarize with LLM
prompt = ChatPromptTemplate.from_template("""
Summarize the following web page content in 3-5 bullet points:

{content}
""")

chain = prompt | llm
summary = chain.invoke({"content": content[:10000]}) # Limit content length
print(summary.content)
```

### Building a research pipeline

```python
from langchain_crawleo import CrawleoSearch, CrawleoCrawler
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

search = CrawleoSearch(max_pages=1, markdown=True)
crawler = CrawleoCrawler(markdown=True)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def research(topic: str) -> str:
# Step 1: Search for relevant pages
search_results = search.invoke({"query": topic})
urls = [
item["link"]
for item in search_results["data"]["pages"]["1"]["search_results"][:2]
]

# Step 2: Crawl the pages
crawl_results = crawler.invoke({"urls": urls})

# Step 3: Combine content
combined_content = "\n\n---\n\n".join([
f"Source: {r['url']}\n{r.get('markdown', '')[:3000]}"
for r in crawl_results["data"]["results"]
])

# Step 4: Generate research summary
prompt = ChatPromptTemplate.from_template("""
Based on the following sources, provide a comprehensive answer about: {topic}

Sources:
{content}

Provide a detailed answer with citations to the sources.
""")

chain = prompt | llm
return chain.invoke({"topic": topic, "content": combined_content}).content

# Run research
answer = research("What is LangChain and how does it work?")
print(answer)
```

---

## API reference

For detailed documentation of all Crawleo Crawler API features and configurations head to the API reference: [crawleo.dev/docs](https://crawleo.dev/docs)

Loading