diff --git a/src/oss/python/integrations/providers/crawleo_crawler.mdx b/src/oss/python/integrations/providers/crawleo_crawler.mdx new file mode 100644 index 0000000000..41fcc7722c --- /dev/null +++ b/src/oss/python/integrations/providers/crawleo_crawler.mdx @@ -0,0 +1,315 @@ +--- +title: Crawleo Crawler +--- + +[Crawleo](https://crawleo.dev) is a privacy-first web search and crawler API. The Crawler endpoint can be used to extract content from URLs with support for raw HTML and Markdown output. + +## Overview + +### Integration details + +| Class | Package | Serializable | JS support | Version | +|:-----------------------------------------------------------------|:-------------------------------------------------------------------|:---:|:---:|:---:| +| [CrawleoCrawler](https://github.com/crawleo/langchain-crawleo) | [langchain-crawleo](https://pypi.org/project/langchain-crawleo/) | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-crawleo?style=flat-square&label=%20) | + +### Tool features + +| [Returns artifact](/oss/langchain/tools) | Native async | Return data | Pricing | +|:---:|:---:|:---:|:---:| +| ❌ | ✅ | raw_content, markdown | Pay-per-use credits | + +## Setup + +The integration lives in the `langchain-crawleo` package. + +```bash +pip install -qU langchain-crawleo +``` + +### Credentials + +We need to set our Crawleo API key. You can get an API key by visiting [Crawleo](https://crawleo.dev) and creating an account. + +```python +import getpass +import os + +if not os.environ.get("CRAWLEO_API_KEY"): + os.environ["CRAWLEO_API_KEY"] = getpass.getpass("Crawleo API key:\n") +``` + +## Instantiation + +The tool accepts various parameters during instantiation: + +- `raw_html` (optional, bool): Whether to return raw HTML content. Default is False. +- `markdown` (optional, bool): Whether to return content in markdown format. Default is False. + +For a comprehensive overview of the available parameters, refer to the [Crawleo API documentation](https://crawleo.dev/docs). + +```python +from langchain_crawleo import CrawleoCrawler + +tool = CrawleoCrawler( + markdown=True, + # raw_html=False, +) +``` + +## Invocation + +### [Invoke directly with args](/oss/langchain/tools) + +The Crawleo crawler tool accepts the following arguments during invocation: + +- `urls` (required): A list of URLs to crawl (1-20 URLs maximum) +- Both `raw_html` and `markdown` can also be set during invocation + +NOTE: The optional arguments are available for agents to dynamically set. If you set an argument during instantiation and then invoke the tool with a different value, the tool will use the value you passed during invocation. + +```python +tool.invoke({"urls": ["https://python.langchain.com"]}) +``` + +```output +{ + "status": "success", + "data": { + "results": [ + { + "url": "https://python.langchain.com", + "raw_content": "...", + "markdown": "# LangChain\n\nLangChain is a framework for developing applications..." + } + ], + "credits_used": 1 + } +} +``` + +### Crawl multiple URLs + +```python +from langchain_crawleo import CrawleoCrawler + +tool = CrawleoCrawler(markdown=True) + +result = tool.invoke({ + "urls": [ + "https://python.langchain.com", + "https://js.langchain.com" + ] +}) + +for item in result["data"]["results"]: + print(f"URL: {item['url']}") + print(f"Content preview: {item['markdown'][:200]}...") + print("---") +``` + +### [Invoke with ToolCall](/oss/langchain/tools) + +We can also invoke the tool with a model-generated ToolCall, in which case a ToolMessage will be returned: + +```python +# This is usually generated by a model, but we'll create a tool call directly for demo purposes. +model_generated_tool_call = { + "args": {"urls": ["https://en.wikipedia.org/wiki/Artificial_intelligence"]}, + "id": "1", + "name": "crawleo_crawler", + "type": "tool_call", +} +tool_msg = tool.invoke(model_generated_tool_call) + +# The content is a JSON string of results +print(tool_msg.content[:400]) +``` + +```output +{"status": "success", "data": {"results": [{"url": "https://en.wikipedia.org/wiki/Artificial_intelligence", "raw_content": "Artificial intelligence - Wikipedia\nJump to content\nMain menu...", "markdown": "# Artificial intelligence\n\nArtificial intelligence (AI) is the intelligence of machines... +``` + +## Use within an agent + +We can use our tools directly with an agent executor by binding the tool to the agent. This gives the agent the ability to dynamically crawl URLs. + +```python +if not os.environ.get("OPENAI_API_KEY"): + os.environ["OPENAI_API_KEY"] = getpass.getpass("OPENAI_API_KEY:\n") +``` + +```python +from langchain.chat_models import init_chat_model + +model = init_chat_model(model="gpt-4o", model_provider="openai", temperature=0) +``` + +```bash +pip install -qU langgraph +``` + +```python +from langchain_crawleo import CrawleoCrawler +from langgraph.prebuilt import create_react_agent + +crawleo_crawler_tool = CrawleoCrawler(markdown=True) + +agent = create_react_agent(model, [crawleo_crawler_tool]) + +user_input = "Extract the content from https://python.langchain.com and summarize what LangChain is." + +for step in agent.stream( + {"messages": [("user", user_input)]}, + stream_mode="values", +): + step["messages"][-1].pretty_print() +``` + +```output +================================ Human Message ================================= + +Extract the content from https://python.langchain.com and summarize what LangChain is. +================================== Ai Message ================================== +Tool Calls: + crawleo_crawler (call_xyz789) + Call ID: call_xyz789 + Args: + urls: ['https://python.langchain.com'] +================================= Tool Message ================================= +Name: crawleo_crawler + +{"status": "success", "data": {"results": [{"url": "https://python.langchain.com", "markdown": "# LangChain..."}]}} +================================== Ai Message ================================== + +Based on the extracted content, LangChain is a framework for developing applications powered by language models... +``` + +## Advanced Usage + +### Combining Search and Crawler + +Use CrawleoSearch to find relevant URLs, then CrawleoCrawler to extract full content: + +```python +from langchain_crawleo import CrawleoSearch, CrawleoCrawler + +search = CrawleoSearch(max_pages=1) +crawler = CrawleoCrawler(markdown=True) + +# Step 1: Search for relevant pages +search_results = search.invoke({"query": "LangChain documentation"}) +urls = [ + item["link"] + for item in search_results["data"]["pages"]["1"]["search_results"][:3] +] + +# Step 2: Crawl the top results +crawl_results = crawler.invoke({"urls": urls}) + +for result in crawl_results["data"]["results"]: + print(f"URL: {result['url']}") + print(f"Content length: {len(result.get('markdown', ''))}") + print("---") +``` + +### Async crawling + +```python +import asyncio +from langchain_crawleo import CrawleoCrawler + +async def crawl_pages(): + tool = CrawleoCrawler(markdown=True) + + urls = [ + "https://python.langchain.com", + "https://js.langchain.com", + ] + + result = await tool.ainvoke({"urls": urls}) + + for item in result["data"]["results"]: + print(f"Crawled: {item['url']}") + print(f"Preview: {item['markdown'][:100]}...") + +asyncio.run(crawl_pages()) +``` + +### Extract and summarize with LLM + +```python +from langchain_crawleo import CrawleoCrawler +from langchain_openai import ChatOpenAI +from langchain_core.prompts import ChatPromptTemplate + +crawler = CrawleoCrawler(markdown=True) +llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) + +# Crawl a page +result = crawler.invoke({"urls": ["https://python.langchain.com/docs/concepts/"]}) +content = result["data"]["results"][0]["markdown"] + +# Summarize with LLM +prompt = ChatPromptTemplate.from_template(""" +Summarize the following web page content in 3-5 bullet points: + +{content} +""") + +chain = prompt | llm +summary = chain.invoke({"content": content[:10000]}) # Limit content length +print(summary.content) +``` + +### Building a research pipeline + +```python +from langchain_crawleo import CrawleoSearch, CrawleoCrawler +from langchain_openai import ChatOpenAI +from langchain_core.prompts import ChatPromptTemplate + +search = CrawleoSearch(max_pages=1, markdown=True) +crawler = CrawleoCrawler(markdown=True) +llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) + +def research(topic: str) -> str: + # Step 1: Search for relevant pages + search_results = search.invoke({"query": topic}) + urls = [ + item["link"] + for item in search_results["data"]["pages"]["1"]["search_results"][:2] + ] + + # Step 2: Crawl the pages + crawl_results = crawler.invoke({"urls": urls}) + + # Step 3: Combine content + combined_content = "\n\n---\n\n".join([ + f"Source: {r['url']}\n{r.get('markdown', '')[:3000]}" + for r in crawl_results["data"]["results"] + ]) + + # Step 4: Generate research summary + prompt = ChatPromptTemplate.from_template(""" + Based on the following sources, provide a comprehensive answer about: {topic} + + Sources: + {content} + + Provide a detailed answer with citations to the sources. + """) + + chain = prompt | llm + return chain.invoke({"topic": topic, "content": combined_content}).content + +# Run research +answer = research("What is LangChain and how does it work?") +print(answer) +``` + +--- + +## API reference + +For detailed documentation of all Crawleo Crawler API features and configurations head to the API reference: [crawleo.dev/docs](https://crawleo.dev/docs) + diff --git a/src/oss/python/integrations/providers/crawleo_search.mdx b/src/oss/python/integrations/providers/crawleo_search.mdx new file mode 100644 index 0000000000..6b5f9998af --- /dev/null +++ b/src/oss/python/integrations/providers/crawleo_search.mdx @@ -0,0 +1,313 @@ +--- +title: Crawleo Search +--- + +[Crawleo's Search API](https://crawleo.dev) is a privacy-first web search engine built for AI agents (LLMs), delivering real-time, accurate search results with AI-enhanced content extraction. + +## Overview + +### Integration details + +| Class | Package | Serializable | JS support | Version | +|:----------------------------------------------------------------|:-------------------------------------------------------------------|:---:|:---:|:---:| +| [CrawleoSearch](https://github.com/crawleo/langchain-crawleo) | [langchain-crawleo](https://pypi.org/project/langchain-crawleo/) | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-crawleo?style=flat-square&label=%20) | + +### Tool features + +| [Returns artifact](/oss/langchain/tools) | Native async | Return data | Pricing | +|:---:|:---:|:---:|:---:| +| ❌ | ✅ | title, URL, snippet, enhanced_html, markdown | Pay-per-use credits | + +## Setup + +The integration lives in the `langchain-crawleo` package. + +```bash +pip install -qU langchain-crawleo +``` + +### Credentials + +We need to set our Crawleo API key. You can get an API key by visiting [Crawleo](https://crawleo.dev) and creating an account. + +```python +import getpass +import os + +if not os.environ.get("CRAWLEO_API_KEY"): + os.environ["CRAWLEO_API_KEY"] = getpass.getpass("Crawleo API key:\n") +``` + +## Instantiation + +Here we show how to instantiate an instance of the Crawleo search tool. The tool accepts various parameters to customize the search. After instantiation we invoke the tool with a simple query. + +### Parameters + +The tool accepts various parameters during instantiation: + +- `max_pages` (optional, int): Maximum number of result pages to fetch. Each page costs 1 credit. Default is 1. +- `setLang` (optional, str): Language code for search interface (e.g., "en", "es", "fr"). Default is "en". +- `cc` (optional, str): Country code for search results (e.g., "US", "GB", "DE"). +- `geolocation` (optional, str): Geo location for search. Allowed: random, pl, gb, jp, de, fr, es, us. Default is "random". +- `device` (optional, str): Device simulation: "desktop", "mobile", or "tablet". Default is "desktop". +- `enhanced_html` (optional, bool): Return AI-enhanced, cleaned HTML optimized for processing. Default is True. +- `raw_html` (optional, bool): Return original, unprocessed HTML of the page. Default is False. +- `page_text` (optional, bool): Return extracted plain text without HTML tags. Default is False. +- `markdown` (optional, bool): Return content in Markdown format for easy parsing. Default is True. + +For a comprehensive overview of the available parameters, refer to the [Crawleo API documentation](https://crawleo.dev/docs). + +```python +from langchain_crawleo import CrawleoSearch + +tool = CrawleoSearch( + max_pages=1, + setLang="en", + cc="US", + # geolocation="random", + # device="desktop", + # enhanced_html=True, + # raw_html=False, + # page_text=False, + # markdown=True, +) +``` + +## Invocation + +### [Invoke directly with args](/oss/langchain/tools) + +The Crawleo search tool accepts the following arguments during invocation: + +- `query` (required): A natural language search query +- The following arguments can also be set during invocation: `max_pages`, `setLang`, `cc`, `geolocation`, `device`, `enhanced_html`, `raw_html`, `page_text`, `markdown` + +NOTE: The optional arguments are available for agents to dynamically set. If you set an argument during instantiation and then invoke the tool with a different value, the tool will use the value you passed during invocation. + +```python +tool.invoke({"query": "What are the latest AI trends?"}) +``` + +```output +{ + "status": "success", + "data": { + "query": "What are the latest AI trends?", + "pages_fetched": 1, + "time_used": 1.23, + "pages": { + "1": { + "total_results": "About 1,234,000 results", + "search_results": [ + { + "title": "AI Trends 2024", + "link": "https://example.com/ai-trends", + "snippet": "The latest AI trends include...", + "source": "example.com" + } + ], + "page_content": { + "enhanced_html": "...", + "page_markdown": "..." + } + } + }, + "credits_used": 1 + } +} +``` + +### [Invoke with ToolCall](/oss/langchain/tools) + +We can also invoke the tool with a model-generated ToolCall, in which case a ToolMessage will be returned: + +```python +# This is usually generated by a model, but we'll create a tool call directly for demo purposes. +model_generated_tool_call = { + "args": {"query": "LangChain framework features"}, + "id": "1", + "name": "crawleo_search", + "type": "tool_call", +} +tool_msg = tool.invoke(model_generated_tool_call) + +# The content is a JSON string of results +print(tool_msg.content[:400]) +``` + +```output +{"status": "success", "data": {"query": "LangChain framework features", "pages_fetched": 1, "time_used": 0.98, "pages": {"1": {"total_results": "About 523,000 results", "search_results": [{"title": "LangChain Introduction", "link": "https://python.langchain.com", "snippet": "LangChain is a framework for developing applications powered by language models... +``` + +## Use within an agent + +We can use our tools directly with an agent executor by binding the tool to the agent. This gives the agent the ability to dynamically set the available arguments to the Crawleo search tool. + +In the below example when we ask the agent to find information, the agent will dynamically set the arguments and invoke the Crawleo search tool. + +```python +if not os.environ.get("OPENAI_API_KEY"): + os.environ["OPENAI_API_KEY"] = getpass.getpass("OPENAI_API_KEY:\n") +``` + +```python +from langchain.chat_models import init_chat_model + +model = init_chat_model(model="gpt-4o", model_provider="openai", temperature=0) +``` + +We will need to install langgraph: + +```bash +pip install -qU langgraph +``` + +```python +from langchain_crawleo import CrawleoSearch +from langgraph.prebuilt import create_react_agent + +# Initialize Crawleo Search Tool +crawleo_search_tool = CrawleoSearch( + max_pages=1, + setLang="en", + markdown=True, +) + +agent = create_react_agent(model, [crawleo_search_tool]) + +user_input = "What are the main features of LangChain? Search for recent information." + +for step in agent.stream( + {"messages": [("user", user_input)]}, + stream_mode="values", +): + step["messages"][-1].pretty_print() +``` + +```output +================================ Human Message ================================= + +What are the main features of LangChain? Search for recent information. +================================== Ai Message ================================== +Tool Calls: + crawleo_search (call_abc123) + Call ID: call_abc123 + Args: + query: LangChain main features 2024 +================================= Tool Message ================================= +Name: crawleo_search + +{"status": "success", "data": {"query": "LangChain main features 2024", "pages_fetched": 1, ...}} +================================== Ai Message ================================== + +Based on the search results, LangChain is a framework for developing applications powered by language models. Its main features include... +``` + +## Advanced Usage + +### Geo-targeted search + +```python +from langchain_crawleo import CrawleoSearch + +# Search with German localization +search = CrawleoSearch( + setLang="de", + cc="DE", + geolocation="de", +) + +result = search.invoke({"query": "KI Nachrichten"}) # "AI news" in German + +for item in result["data"]["pages"]["1"]["search_results"][:5]: + print(f"{item.get('date', 'N/A')} | {item['title']}") +``` + +### Device-specific search results + +```python +from langchain_crawleo import CrawleoSearch + +# Compare desktop vs mobile results +desktop_search = CrawleoSearch(device="desktop", max_pages=1) +mobile_search = CrawleoSearch(device="mobile", max_pages=1) + +query = "best restaurants near me" + +desktop_results = desktop_search.invoke({"query": query}) +mobile_results = mobile_search.invoke({"query": query}) + +print("Desktop first result:", desktop_results["data"]["pages"]["1"]["search_results"][0]["title"]) +print("Mobile first result:", mobile_results["data"]["pages"]["1"]["search_results"][0]["title"]) +``` + +### Concurrent async searches + +```python +import asyncio +from langchain_crawleo import CrawleoSearch + +async def multi_search(): + tool = CrawleoSearch(max_pages=1, markdown=True) + + queries = [ + "artificial intelligence trends", + "machine learning frameworks", + "LLM applications", + ] + + tasks = [tool.ainvoke({"query": q}) for q in queries] + results = await asyncio.gather(*tasks) + + for query, result in zip(queries, results): + print(f"\n=== {query} ===") + first = result["data"]["pages"]["1"]["search_results"][0] + print(f"Top result: {first['title']}") + +asyncio.run(multi_search()) +``` + +### Search-then-answer (RAG pattern) + +```python +from langchain_crawleo import CrawleoSearch +from langchain_openai import ChatOpenAI +from langchain_core.prompts import ChatPromptTemplate + +search = CrawleoSearch(max_pages=1, markdown=True) +llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) + +question = "How do I create a custom tool in LangChain?" + +# Step 1: Search +result = search.invoke({"query": question}) +page = result["data"]["pages"]["1"] + +# Step 2: Build context +context = "\n".join([ + f"Source: {item['link']}\n{item.get('snippet', '')}" + for item in page["search_results"][:5] +]) + +# Step 3: Answer with sources +prompt = ChatPromptTemplate.from_template(""" +Answer the question based on the search results. Cite sources. + +Search results: +{context} + +Question: {question} +""") + +chain = prompt | llm +answer = chain.invoke({"context": context, "question": question}) +print(answer.content) +``` + +--- + +## API reference + +For detailed documentation of all Crawleo Search API features and configurations head to the API reference: [crawleo.dev/docs](https://crawleo.dev/docs) + diff --git a/src/oss/python/integrations/tools/crawleo_crawler.mdx b/src/oss/python/integrations/tools/crawleo_crawler.mdx new file mode 100644 index 0000000000..41fcc7722c --- /dev/null +++ b/src/oss/python/integrations/tools/crawleo_crawler.mdx @@ -0,0 +1,315 @@ +--- +title: Crawleo Crawler +--- + +[Crawleo](https://crawleo.dev) is a privacy-first web search and crawler API. The Crawler endpoint can be used to extract content from URLs with support for raw HTML and Markdown output. + +## Overview + +### Integration details + +| Class | Package | Serializable | JS support | Version | +|:-----------------------------------------------------------------|:-------------------------------------------------------------------|:---:|:---:|:---:| +| [CrawleoCrawler](https://github.com/crawleo/langchain-crawleo) | [langchain-crawleo](https://pypi.org/project/langchain-crawleo/) | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-crawleo?style=flat-square&label=%20) | + +### Tool features + +| [Returns artifact](/oss/langchain/tools) | Native async | Return data | Pricing | +|:---:|:---:|:---:|:---:| +| ❌ | ✅ | raw_content, markdown | Pay-per-use credits | + +## Setup + +The integration lives in the `langchain-crawleo` package. + +```bash +pip install -qU langchain-crawleo +``` + +### Credentials + +We need to set our Crawleo API key. You can get an API key by visiting [Crawleo](https://crawleo.dev) and creating an account. + +```python +import getpass +import os + +if not os.environ.get("CRAWLEO_API_KEY"): + os.environ["CRAWLEO_API_KEY"] = getpass.getpass("Crawleo API key:\n") +``` + +## Instantiation + +The tool accepts various parameters during instantiation: + +- `raw_html` (optional, bool): Whether to return raw HTML content. Default is False. +- `markdown` (optional, bool): Whether to return content in markdown format. Default is False. + +For a comprehensive overview of the available parameters, refer to the [Crawleo API documentation](https://crawleo.dev/docs). + +```python +from langchain_crawleo import CrawleoCrawler + +tool = CrawleoCrawler( + markdown=True, + # raw_html=False, +) +``` + +## Invocation + +### [Invoke directly with args](/oss/langchain/tools) + +The Crawleo crawler tool accepts the following arguments during invocation: + +- `urls` (required): A list of URLs to crawl (1-20 URLs maximum) +- Both `raw_html` and `markdown` can also be set during invocation + +NOTE: The optional arguments are available for agents to dynamically set. If you set an argument during instantiation and then invoke the tool with a different value, the tool will use the value you passed during invocation. + +```python +tool.invoke({"urls": ["https://python.langchain.com"]}) +``` + +```output +{ + "status": "success", + "data": { + "results": [ + { + "url": "https://python.langchain.com", + "raw_content": "...", + "markdown": "# LangChain\n\nLangChain is a framework for developing applications..." + } + ], + "credits_used": 1 + } +} +``` + +### Crawl multiple URLs + +```python +from langchain_crawleo import CrawleoCrawler + +tool = CrawleoCrawler(markdown=True) + +result = tool.invoke({ + "urls": [ + "https://python.langchain.com", + "https://js.langchain.com" + ] +}) + +for item in result["data"]["results"]: + print(f"URL: {item['url']}") + print(f"Content preview: {item['markdown'][:200]}...") + print("---") +``` + +### [Invoke with ToolCall](/oss/langchain/tools) + +We can also invoke the tool with a model-generated ToolCall, in which case a ToolMessage will be returned: + +```python +# This is usually generated by a model, but we'll create a tool call directly for demo purposes. +model_generated_tool_call = { + "args": {"urls": ["https://en.wikipedia.org/wiki/Artificial_intelligence"]}, + "id": "1", + "name": "crawleo_crawler", + "type": "tool_call", +} +tool_msg = tool.invoke(model_generated_tool_call) + +# The content is a JSON string of results +print(tool_msg.content[:400]) +``` + +```output +{"status": "success", "data": {"results": [{"url": "https://en.wikipedia.org/wiki/Artificial_intelligence", "raw_content": "Artificial intelligence - Wikipedia\nJump to content\nMain menu...", "markdown": "# Artificial intelligence\n\nArtificial intelligence (AI) is the intelligence of machines... +``` + +## Use within an agent + +We can use our tools directly with an agent executor by binding the tool to the agent. This gives the agent the ability to dynamically crawl URLs. + +```python +if not os.environ.get("OPENAI_API_KEY"): + os.environ["OPENAI_API_KEY"] = getpass.getpass("OPENAI_API_KEY:\n") +``` + +```python +from langchain.chat_models import init_chat_model + +model = init_chat_model(model="gpt-4o", model_provider="openai", temperature=0) +``` + +```bash +pip install -qU langgraph +``` + +```python +from langchain_crawleo import CrawleoCrawler +from langgraph.prebuilt import create_react_agent + +crawleo_crawler_tool = CrawleoCrawler(markdown=True) + +agent = create_react_agent(model, [crawleo_crawler_tool]) + +user_input = "Extract the content from https://python.langchain.com and summarize what LangChain is." + +for step in agent.stream( + {"messages": [("user", user_input)]}, + stream_mode="values", +): + step["messages"][-1].pretty_print() +``` + +```output +================================ Human Message ================================= + +Extract the content from https://python.langchain.com and summarize what LangChain is. +================================== Ai Message ================================== +Tool Calls: + crawleo_crawler (call_xyz789) + Call ID: call_xyz789 + Args: + urls: ['https://python.langchain.com'] +================================= Tool Message ================================= +Name: crawleo_crawler + +{"status": "success", "data": {"results": [{"url": "https://python.langchain.com", "markdown": "# LangChain..."}]}} +================================== Ai Message ================================== + +Based on the extracted content, LangChain is a framework for developing applications powered by language models... +``` + +## Advanced Usage + +### Combining Search and Crawler + +Use CrawleoSearch to find relevant URLs, then CrawleoCrawler to extract full content: + +```python +from langchain_crawleo import CrawleoSearch, CrawleoCrawler + +search = CrawleoSearch(max_pages=1) +crawler = CrawleoCrawler(markdown=True) + +# Step 1: Search for relevant pages +search_results = search.invoke({"query": "LangChain documentation"}) +urls = [ + item["link"] + for item in search_results["data"]["pages"]["1"]["search_results"][:3] +] + +# Step 2: Crawl the top results +crawl_results = crawler.invoke({"urls": urls}) + +for result in crawl_results["data"]["results"]: + print(f"URL: {result['url']}") + print(f"Content length: {len(result.get('markdown', ''))}") + print("---") +``` + +### Async crawling + +```python +import asyncio +from langchain_crawleo import CrawleoCrawler + +async def crawl_pages(): + tool = CrawleoCrawler(markdown=True) + + urls = [ + "https://python.langchain.com", + "https://js.langchain.com", + ] + + result = await tool.ainvoke({"urls": urls}) + + for item in result["data"]["results"]: + print(f"Crawled: {item['url']}") + print(f"Preview: {item['markdown'][:100]}...") + +asyncio.run(crawl_pages()) +``` + +### Extract and summarize with LLM + +```python +from langchain_crawleo import CrawleoCrawler +from langchain_openai import ChatOpenAI +from langchain_core.prompts import ChatPromptTemplate + +crawler = CrawleoCrawler(markdown=True) +llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) + +# Crawl a page +result = crawler.invoke({"urls": ["https://python.langchain.com/docs/concepts/"]}) +content = result["data"]["results"][0]["markdown"] + +# Summarize with LLM +prompt = ChatPromptTemplate.from_template(""" +Summarize the following web page content in 3-5 bullet points: + +{content} +""") + +chain = prompt | llm +summary = chain.invoke({"content": content[:10000]}) # Limit content length +print(summary.content) +``` + +### Building a research pipeline + +```python +from langchain_crawleo import CrawleoSearch, CrawleoCrawler +from langchain_openai import ChatOpenAI +from langchain_core.prompts import ChatPromptTemplate + +search = CrawleoSearch(max_pages=1, markdown=True) +crawler = CrawleoCrawler(markdown=True) +llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) + +def research(topic: str) -> str: + # Step 1: Search for relevant pages + search_results = search.invoke({"query": topic}) + urls = [ + item["link"] + for item in search_results["data"]["pages"]["1"]["search_results"][:2] + ] + + # Step 2: Crawl the pages + crawl_results = crawler.invoke({"urls": urls}) + + # Step 3: Combine content + combined_content = "\n\n---\n\n".join([ + f"Source: {r['url']}\n{r.get('markdown', '')[:3000]}" + for r in crawl_results["data"]["results"] + ]) + + # Step 4: Generate research summary + prompt = ChatPromptTemplate.from_template(""" + Based on the following sources, provide a comprehensive answer about: {topic} + + Sources: + {content} + + Provide a detailed answer with citations to the sources. + """) + + chain = prompt | llm + return chain.invoke({"topic": topic, "content": combined_content}).content + +# Run research +answer = research("What is LangChain and how does it work?") +print(answer) +``` + +--- + +## API reference + +For detailed documentation of all Crawleo Crawler API features and configurations head to the API reference: [crawleo.dev/docs](https://crawleo.dev/docs) + diff --git a/src/oss/python/integrations/tools/crawleo_search.mdx b/src/oss/python/integrations/tools/crawleo_search.mdx new file mode 100644 index 0000000000..6b5f9998af --- /dev/null +++ b/src/oss/python/integrations/tools/crawleo_search.mdx @@ -0,0 +1,313 @@ +--- +title: Crawleo Search +--- + +[Crawleo's Search API](https://crawleo.dev) is a privacy-first web search engine built for AI agents (LLMs), delivering real-time, accurate search results with AI-enhanced content extraction. + +## Overview + +### Integration details + +| Class | Package | Serializable | JS support | Version | +|:----------------------------------------------------------------|:-------------------------------------------------------------------|:---:|:---:|:---:| +| [CrawleoSearch](https://github.com/crawleo/langchain-crawleo) | [langchain-crawleo](https://pypi.org/project/langchain-crawleo/) | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-crawleo?style=flat-square&label=%20) | + +### Tool features + +| [Returns artifact](/oss/langchain/tools) | Native async | Return data | Pricing | +|:---:|:---:|:---:|:---:| +| ❌ | ✅ | title, URL, snippet, enhanced_html, markdown | Pay-per-use credits | + +## Setup + +The integration lives in the `langchain-crawleo` package. + +```bash +pip install -qU langchain-crawleo +``` + +### Credentials + +We need to set our Crawleo API key. You can get an API key by visiting [Crawleo](https://crawleo.dev) and creating an account. + +```python +import getpass +import os + +if not os.environ.get("CRAWLEO_API_KEY"): + os.environ["CRAWLEO_API_KEY"] = getpass.getpass("Crawleo API key:\n") +``` + +## Instantiation + +Here we show how to instantiate an instance of the Crawleo search tool. The tool accepts various parameters to customize the search. After instantiation we invoke the tool with a simple query. + +### Parameters + +The tool accepts various parameters during instantiation: + +- `max_pages` (optional, int): Maximum number of result pages to fetch. Each page costs 1 credit. Default is 1. +- `setLang` (optional, str): Language code for search interface (e.g., "en", "es", "fr"). Default is "en". +- `cc` (optional, str): Country code for search results (e.g., "US", "GB", "DE"). +- `geolocation` (optional, str): Geo location for search. Allowed: random, pl, gb, jp, de, fr, es, us. Default is "random". +- `device` (optional, str): Device simulation: "desktop", "mobile", or "tablet". Default is "desktop". +- `enhanced_html` (optional, bool): Return AI-enhanced, cleaned HTML optimized for processing. Default is True. +- `raw_html` (optional, bool): Return original, unprocessed HTML of the page. Default is False. +- `page_text` (optional, bool): Return extracted plain text without HTML tags. Default is False. +- `markdown` (optional, bool): Return content in Markdown format for easy parsing. Default is True. + +For a comprehensive overview of the available parameters, refer to the [Crawleo API documentation](https://crawleo.dev/docs). + +```python +from langchain_crawleo import CrawleoSearch + +tool = CrawleoSearch( + max_pages=1, + setLang="en", + cc="US", + # geolocation="random", + # device="desktop", + # enhanced_html=True, + # raw_html=False, + # page_text=False, + # markdown=True, +) +``` + +## Invocation + +### [Invoke directly with args](/oss/langchain/tools) + +The Crawleo search tool accepts the following arguments during invocation: + +- `query` (required): A natural language search query +- The following arguments can also be set during invocation: `max_pages`, `setLang`, `cc`, `geolocation`, `device`, `enhanced_html`, `raw_html`, `page_text`, `markdown` + +NOTE: The optional arguments are available for agents to dynamically set. If you set an argument during instantiation and then invoke the tool with a different value, the tool will use the value you passed during invocation. + +```python +tool.invoke({"query": "What are the latest AI trends?"}) +``` + +```output +{ + "status": "success", + "data": { + "query": "What are the latest AI trends?", + "pages_fetched": 1, + "time_used": 1.23, + "pages": { + "1": { + "total_results": "About 1,234,000 results", + "search_results": [ + { + "title": "AI Trends 2024", + "link": "https://example.com/ai-trends", + "snippet": "The latest AI trends include...", + "source": "example.com" + } + ], + "page_content": { + "enhanced_html": "...", + "page_markdown": "..." + } + } + }, + "credits_used": 1 + } +} +``` + +### [Invoke with ToolCall](/oss/langchain/tools) + +We can also invoke the tool with a model-generated ToolCall, in which case a ToolMessage will be returned: + +```python +# This is usually generated by a model, but we'll create a tool call directly for demo purposes. +model_generated_tool_call = { + "args": {"query": "LangChain framework features"}, + "id": "1", + "name": "crawleo_search", + "type": "tool_call", +} +tool_msg = tool.invoke(model_generated_tool_call) + +# The content is a JSON string of results +print(tool_msg.content[:400]) +``` + +```output +{"status": "success", "data": {"query": "LangChain framework features", "pages_fetched": 1, "time_used": 0.98, "pages": {"1": {"total_results": "About 523,000 results", "search_results": [{"title": "LangChain Introduction", "link": "https://python.langchain.com", "snippet": "LangChain is a framework for developing applications powered by language models... +``` + +## Use within an agent + +We can use our tools directly with an agent executor by binding the tool to the agent. This gives the agent the ability to dynamically set the available arguments to the Crawleo search tool. + +In the below example when we ask the agent to find information, the agent will dynamically set the arguments and invoke the Crawleo search tool. + +```python +if not os.environ.get("OPENAI_API_KEY"): + os.environ["OPENAI_API_KEY"] = getpass.getpass("OPENAI_API_KEY:\n") +``` + +```python +from langchain.chat_models import init_chat_model + +model = init_chat_model(model="gpt-4o", model_provider="openai", temperature=0) +``` + +We will need to install langgraph: + +```bash +pip install -qU langgraph +``` + +```python +from langchain_crawleo import CrawleoSearch +from langgraph.prebuilt import create_react_agent + +# Initialize Crawleo Search Tool +crawleo_search_tool = CrawleoSearch( + max_pages=1, + setLang="en", + markdown=True, +) + +agent = create_react_agent(model, [crawleo_search_tool]) + +user_input = "What are the main features of LangChain? Search for recent information." + +for step in agent.stream( + {"messages": [("user", user_input)]}, + stream_mode="values", +): + step["messages"][-1].pretty_print() +``` + +```output +================================ Human Message ================================= + +What are the main features of LangChain? Search for recent information. +================================== Ai Message ================================== +Tool Calls: + crawleo_search (call_abc123) + Call ID: call_abc123 + Args: + query: LangChain main features 2024 +================================= Tool Message ================================= +Name: crawleo_search + +{"status": "success", "data": {"query": "LangChain main features 2024", "pages_fetched": 1, ...}} +================================== Ai Message ================================== + +Based on the search results, LangChain is a framework for developing applications powered by language models. Its main features include... +``` + +## Advanced Usage + +### Geo-targeted search + +```python +from langchain_crawleo import CrawleoSearch + +# Search with German localization +search = CrawleoSearch( + setLang="de", + cc="DE", + geolocation="de", +) + +result = search.invoke({"query": "KI Nachrichten"}) # "AI news" in German + +for item in result["data"]["pages"]["1"]["search_results"][:5]: + print(f"{item.get('date', 'N/A')} | {item['title']}") +``` + +### Device-specific search results + +```python +from langchain_crawleo import CrawleoSearch + +# Compare desktop vs mobile results +desktop_search = CrawleoSearch(device="desktop", max_pages=1) +mobile_search = CrawleoSearch(device="mobile", max_pages=1) + +query = "best restaurants near me" + +desktop_results = desktop_search.invoke({"query": query}) +mobile_results = mobile_search.invoke({"query": query}) + +print("Desktop first result:", desktop_results["data"]["pages"]["1"]["search_results"][0]["title"]) +print("Mobile first result:", mobile_results["data"]["pages"]["1"]["search_results"][0]["title"]) +``` + +### Concurrent async searches + +```python +import asyncio +from langchain_crawleo import CrawleoSearch + +async def multi_search(): + tool = CrawleoSearch(max_pages=1, markdown=True) + + queries = [ + "artificial intelligence trends", + "machine learning frameworks", + "LLM applications", + ] + + tasks = [tool.ainvoke({"query": q}) for q in queries] + results = await asyncio.gather(*tasks) + + for query, result in zip(queries, results): + print(f"\n=== {query} ===") + first = result["data"]["pages"]["1"]["search_results"][0] + print(f"Top result: {first['title']}") + +asyncio.run(multi_search()) +``` + +### Search-then-answer (RAG pattern) + +```python +from langchain_crawleo import CrawleoSearch +from langchain_openai import ChatOpenAI +from langchain_core.prompts import ChatPromptTemplate + +search = CrawleoSearch(max_pages=1, markdown=True) +llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) + +question = "How do I create a custom tool in LangChain?" + +# Step 1: Search +result = search.invoke({"query": question}) +page = result["data"]["pages"]["1"] + +# Step 2: Build context +context = "\n".join([ + f"Source: {item['link']}\n{item.get('snippet', '')}" + for item in page["search_results"][:5] +]) + +# Step 3: Answer with sources +prompt = ChatPromptTemplate.from_template(""" +Answer the question based on the search results. Cite sources. + +Search results: +{context} + +Question: {question} +""") + +chain = prompt | llm +answer = chain.invoke({"context": context, "question": question}) +print(answer.content) +``` + +--- + +## API reference + +For detailed documentation of all Crawleo Search API features and configurations head to the API reference: [crawleo.dev/docs](https://crawleo.dev/docs) + diff --git a/src/oss/python/integrations/tools/index.mdx b/src/oss/python/integrations/tools/index.mdx index d4acc4193e..535c321dec 100644 --- a/src/oss/python/integrations/tools/index.mdx +++ b/src/oss/python/integrations/tools/index.mdx @@ -21,6 +21,7 @@ The following table shows tools that execute online searches in some shape or fo | [Jina Search](/oss/integrations/tools/jina_search) | 1M Response Tokens Free | URL, Snippet, Title, Page Content | | [Mojeek Search](/oss/integrations/tools/mojeek_search) | Paid | URL, Snippet, Title | | [Parallel Search](/oss/integrations/tools/parallel_search) | Paid | URL, Title, Excerpts | +| [Crawleo 👌](/oss/python/integrations/tools/crawleo_search) | 1000 Free Searches on Sign-Up, $1 Per 1K Searches Thereafter | URL, Snippet, Copilot Answer, Sidebar, Title, Published Date, Site Links, Page Content| | [SearchApi](/oss/integrations/tools/searchapi) | 100 Free Searches on Sign Up | URL, Snippet, Title, Search Rank, Site Links, Authors | | [SearxNG Search](/oss/integrations/tools/searx_search) | Free | URL, Snippet, Title, Category | | [SerpAPI](/oss/integrations/tools/serpapi) | 100 Free Searches/Month | Answer | @@ -225,6 +226,8 @@ The following platforms provide access to multiple tools and services through a + +