Skip to content

Commit 2ad578c

Browse files
committed
Merge branch 'feat/strands-core-apify-tools' into feat/strands-search-crawling-actor-tools
2 parents ab0d675 + 38ccda7 commit 2ad578c

3 files changed

Lines changed: 56 additions & 44 deletions

File tree

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -971,7 +971,7 @@ from strands_tools.apify import APIFY_ALL_TOOLS
971971

972972
agent = Agent(tools=APIFY_ALL_TOOLS)
973973

974-
# Scrape a single URL and get markdown content
974+
# Scrape a single URL and get Markdown content
975975
content = agent.tool.apify_scrape_url(url="https://example.com")
976976

977977
# Run an Actor and get results in one step

docs/apify_tool.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -344,9 +344,9 @@ At least one of `search_query` or `urls` must be provided.
344344
| `APIFY_API_TOKEN environment variable is not set` | Token not configured | Set the `APIFY_API_TOKEN` environment variable |
345345
| `apify-client package is required` | Optional dependency not installed | Run `pip install strands-agents-tools[apify]` |
346346
| `Actor ... finished with status FAILED` | Actor execution error | Check Actor input parameters and run logs in [Apify Console](https://console.apify.com) |
347-
| `Task ... finished with status FAILED` | task execution error | Check task configuration and run logs in [Apify Console](https://console.apify.com) |
348-
| `Actor/task ... finished with status TIMED-OUT` | Timeout too short for the workload | Increase the `timeout_secs` parameter; `apify_website_content_crawler` with large `max_pages` may need 600+ seconds |
349-
| `Task ... returned no run data` | task `call()` returned `None` (wait timeout) | Increase the `timeout_secs` parameter |
347+
| `Task ... finished with status FAILED` | Task execution error | Check task configuration and run logs in [Apify Console](https://console.apify.com) |
348+
| `Actor/task ... finished with status TIMED-OUT` | Timeout too short for the workload | Increase the `timeout_secs` parameter |
349+
| `Task ... returned no run data` | Task `call()` returned `None` (wait timeout) | Increase the `timeout_secs` parameter |
350350
| `No content returned for URL` | Website Content Crawler returned empty results | Verify the URL is accessible and returns content |
351351
| `At least one of 'search_query' or 'urls' must be provided` | YouTube Scraper called without input | Provide a `search_query`, `urls`, or both |
352352

src/strands_tools/apify.py

Lines changed: 52 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
11
"""Apify platform tools for Strands Agents.
22
3-
This module provides web scraping, data extraction, and automation capabilities
4-
using the Apify platform. It lets you run any Actor, task, fetch dataset
5-
results, scrape individual URLs, and perform specialized search and crawling.
3+
4+
Apify is the world's largest marketplace of tools for web scraping, crawling, data extraction, and web automation.
5+
These tools are called Actors, serverless cloud programs that take JSON input and store results
6+
in a dataset (structured, tabular output) or key-value store (files and unstructured data).
7+
Get structured data from social media, e-commerce, search engines, maps, travel sites, or any other website.
68
79
Available Tools:
810
---------------
@@ -24,7 +26,7 @@
2426
Setup Requirements:
2527
------------------
2628
1. Create an Apify account at https://apify.com
27-
2. Obtain your API token: Apify Console > Settings > API & Integrations > Personal API tokens
29+
2. Get your API token: Apify Console > Settings > API & Integrations > Personal API tokens
2830
3. Install the optional dependency: pip install strands-agents-tools[apify]
2931
4. Set the environment variable:
3032
APIFY_API_TOKEN=your_api_token_here
@@ -361,7 +363,7 @@ def scrape_url(
361363
timeout_secs: int = DEFAULT_SCRAPE_TIMEOUT_SECS,
362364
crawler_type: CrawlerType = "cheerio",
363365
) -> str:
364-
"""Scrape a single URL using Website Content Crawler and return markdown."""
366+
"""Scrape a single URL using Website Content Crawler and return Markdown."""
365367
self._validate_url(url)
366368
self._validate_positive(timeout_secs, "timeout_secs")
367369
if crawler_type not in WEBSITE_CONTENT_CRAWLER_TYPES:
@@ -408,20 +410,24 @@ def apify_run_actor(
408410
) -> Dict[str, Any]:
409411
"""Run any Apify Actor and return the run metadata as JSON.
410412
411-
Executes the Actor synchronously - blocks until the Actor run finishes or the timeout
412-
is reached. Use this when you need to run a specific Actor and then inspect or process
413-
the results separately.
413+
An Actor is a serverless cloud app on the Apify platform — it takes JSON input,
414+
runs the scraping or automation job, and writes results to a dataset. This tool
415+
executes the Actor synchronously and returns run metadata only (run_id, status,
416+
dataset_id, timestamps). Use apify_run_actor_and_get_dataset to also fetch the
417+
output data in one call, or apify_scrape_url for quick single-URL extraction.
414418
415419
Common Actors:
416-
- "apify/website-content-crawler" - scrape websites and extract content
417-
- "apify/web-scraper" - general-purpose web scraper
418-
- "apify/google-search-scraper" - scrape Google search results
420+
- "apify/website-content-crawler" - scrape websites and extract content as Markdown
421+
- "apify/web-scraper" - general-purpose web scraper with JS rendering
422+
- "apify/google-search-scraper" scrape Google search results
419423
420424
Args:
421-
actor_id: Actor identifier, e.g. "apify/website-content-crawler" or "username/actor-name".
422-
run_input: JSON-serializable input for the Actor. Each Actor defines its own input schema.
425+
actor_id: Actor identifier in "username/actor-name" format,
426+
e.g. "apify/website-content-crawler". Find Actors at https://apify.com/store.
427+
run_input: JSON-serializable input for the Actor. Each Actor defines its own
428+
input schema - check the Actor README on Apify Store for required fields.
423429
timeout_secs: Maximum time in seconds to wait for the Actor run to finish. Defaults to 300.
424-
memory_mbytes: Memory allocation in MB for the Actor run. Uses Actor default if not set.
430+
memory_mbytes: Memory allocation in MB for the Actor run. Uses Actor default `memory` value if not set.
425431
build: Actor build tag or number to run a specific version. Uses latest build if not set.
426432
427433
Returns:
@@ -461,8 +467,9 @@ def apify_get_dataset_items(
461467
) -> Dict[str, Any]:
462468
"""Fetch items from an existing Apify dataset and return them as JSON.
463469
464-
Use this after running an Actor to retrieve the structured results from its
465-
default dataset, or to access any dataset by ID.
470+
Every Actor run writes its output to a dataset — a structured, append-only store
471+
for tabular data. Use the dataset_id from the run metadata returned by apify_run_actor
472+
or apify_run_task. Use offset for pagination through large datasets.
466473
467474
Args:
468475
dataset_id: The Apify dataset ID to fetch items from.
@@ -499,15 +506,17 @@ def apify_run_actor_and_get_dataset(
499506
) -> Dict[str, Any]:
500507
"""Run an Apify Actor and fetch its dataset results in one step.
501508
502-
Convenience tool that combines running an Actor and fetching its default
503-
dataset items into a single call. Use this when you want both the run metadata and the
509+
Convenience tool that combines running an Actor and fetching its default dataset
510+
items into a single call. Use this when you want both the run metadata and the
504511
result data without making two separate tool calls.
505512
506513
Args:
507-
actor_id: Actor identifier, e.g. "apify/website-content-crawler" or "username/actor-name".
508-
run_input: JSON-serializable input for the Actor.
514+
actor_id: Actor identifier in "username/actor-name" format,
515+
e.g. "apify/website-content-crawler". Find Actors at https://apify.com/store.
516+
run_input: JSON-serializable input for the Actor. Each Actor defines its own
517+
input schema - check the Actor README on Apify Store for required fields.
509518
timeout_secs: Maximum time in seconds to wait for the Actor run to finish. Defaults to 300.
510-
memory_mbytes: Memory allocation in MB for the Actor run.
519+
memory_mbytes: Memory allocation in MB for the Actor run. Uses Actor default `memory` value if not set.
511520
build: Actor build tag or number to run a specific version. Uses latest build if not set.
512521
dataset_items_limit: Maximum number of dataset items to return. Defaults to 100.
513522
dataset_items_offset: Number of dataset items to skip for pagination. Defaults to 0.
@@ -551,17 +560,18 @@ def apify_run_task(
551560
timeout_secs: int = DEFAULT_TIMEOUT_SECS,
552561
memory_mbytes: Optional[int] = None,
553562
) -> Dict[str, Any]:
554-
"""Run an Apify task and return the run metadata as JSON.
563+
"""Run a saved Apify task and return the run metadata as JSON.
555564
556-
Tasks are saved Actor configurations with preset inputs. Use this when a task
557-
has already been configured in Apify Console, so you don't need to specify
558-
the full Actor input every time.
565+
Tasks are saved Actor configurations with preset inputs, managed in Apify Console.
566+
Use this when a task has already been configured, so you don't need to specify
567+
the full Actor input every time. Use apify_run_task_and_get_dataset to also fetch
568+
the output data in one call.
559569
560570
Args:
561-
task_id: Task identifier, e.g. "user/my-task" or a task ID string.
562-
task_input: Optional JSON-serializable input to override the task's default input.
571+
task_id: Task identifier in "username/task-name" format or a task ID string.
572+
task_input: Optional JSON-serializable input to override the task's default input fields.
563573
timeout_secs: Maximum time in seconds to wait for the task run to finish. Defaults to 300.
564-
memory_mbytes: Memory allocation in MB for the task run. Uses task default if not set.
574+
memory_mbytes: Memory allocation in MB for the task run. Uses task default `memory` value if not set.
565575
566576
Returns:
567577
Dict with status and content containing run metadata: run_id, status, dataset_id,
@@ -600,17 +610,17 @@ def apify_run_task_and_get_dataset(
600610
dataset_items_limit: int = DEFAULT_DATASET_ITEMS_LIMIT,
601611
dataset_items_offset: int = 0,
602612
) -> Dict[str, Any]:
603-
"""Run an Apify task and fetch its dataset results in one step.
613+
"""Run a saved Apify task and fetch its dataset results in one step.
604614
605-
Convenience tool that combines running a task and fetching its default
606-
dataset items into a single call. Use this when you want both the run metadata and the
615+
Convenience tool that combines running a task and fetching its default dataset
616+
items into a single call. Use this when you want both the run metadata and the
607617
result data without making two separate tool calls.
608618
609619
Args:
610-
task_id: Task identifier, e.g. "user/my-task" or a task ID string.
611-
task_input: Optional JSON-serializable input to override the task's default input.
620+
task_id: Task identifier in "username/task-name" format or a task ID string.
621+
task_input: Optional JSON-serializable input to override the task's default input fields.
612622
timeout_secs: Maximum time in seconds to wait for the task run to finish. Defaults to 300.
613-
memory_mbytes: Memory allocation in MB for the task run.
623+
memory_mbytes: Memory allocation in MB for the task run. Uses task default `memory` value if not set.
614624
dataset_items_limit: Maximum number of dataset items to return. Defaults to 100.
615625
dataset_items_offset: Number of dataset items to skip for pagination. Defaults to 0.
616626
@@ -651,21 +661,23 @@ def apify_scrape_url(
651661
timeout_secs: int = DEFAULT_SCRAPE_TIMEOUT_SECS,
652662
crawler_type: CrawlerType = "cheerio",
653663
) -> Dict[str, Any]:
654-
"""Scrape a single URL and return its content as markdown.
664+
"""Scrape a single URL and return its content as Markdown.
655665
656666
Uses the Website Content Crawler Actor under the hood, pre-configured for
657667
fast single-page scraping. This is the simplest way to extract readable content
658-
from any web page.
668+
from any web page — no Actor input schema needed. For multi-page crawls, use
669+
apify_run_actor_and_get_dataset with "apify/website-content-crawler" directly.
659670
660671
Args:
661672
url: The URL to scrape, e.g. "https://example.com".
662673
timeout_secs: Maximum time in seconds to wait for scraping to finish. Defaults to 120.
663-
crawler_type: Crawler engine to use. One of "cheerio" (fastest, no JS rendering,
664-
default), "playwright:adaptive" (fast, renders JS if present), or
665-
"playwright:firefox" (reliable, renders JS, best at avoiding blocking but slower).
674+
crawler_type: Crawler engine to use. One of:
675+
- "cheerio" (default): Fastest, no JavaScript rendering. Best for static HTML.
676+
- "playwright:adaptive": Renders JS only when needed. Good general-purpose choice.
677+
- "playwright:firefox": Full JS rendering, best at bypassing anti-bot protection but slowest.
666678
667679
Returns:
668-
Dict with status and content containing the markdown content of the scraped page.
680+
Dict with status and content containing the Markdown content of the scraped page.
669681
"""
670682
try:
671683
_check_dependency()

0 commit comments

Comments
 (0)