|
1 | 1 | """Apify platform tools for Strands Agents. |
2 | 2 |
|
3 | | -This module provides web scraping, data extraction, and automation capabilities |
4 | | -using the Apify platform. It lets you run any Actor, task, fetch dataset |
5 | | -results, scrape individual URLs, and perform specialized search and crawling. |
| 3 | +
|
| 4 | +Apify is the world's largest marketplace of tools for web scraping, crawling, data extraction, and web automation. |
| 5 | +These tools are called Actors, serverless cloud programs that take JSON input and store results |
| 6 | +in a dataset (structured, tabular output) or key-value store (files and unstructured data). |
| 7 | +Get structured data from social media, e-commerce, search engines, maps, travel sites, or any other website. |
6 | 8 |
|
7 | 9 | Available Tools: |
8 | 10 | --------------- |
|
24 | 26 | Setup Requirements: |
25 | 27 | ------------------ |
26 | 28 | 1. Create an Apify account at https://apify.com |
27 | | -2. Obtain your API token: Apify Console > Settings > API & Integrations > Personal API tokens |
| 29 | +2. Get your API token: Apify Console > Settings > API & Integrations > Personal API tokens |
28 | 30 | 3. Install the optional dependency: pip install strands-agents-tools[apify] |
29 | 31 | 4. Set the environment variable: |
30 | 32 | APIFY_API_TOKEN=your_api_token_here |
@@ -361,7 +363,7 @@ def scrape_url( |
361 | 363 | timeout_secs: int = DEFAULT_SCRAPE_TIMEOUT_SECS, |
362 | 364 | crawler_type: CrawlerType = "cheerio", |
363 | 365 | ) -> str: |
364 | | - """Scrape a single URL using Website Content Crawler and return markdown.""" |
| 366 | + """Scrape a single URL using Website Content Crawler and return Markdown.""" |
365 | 367 | self._validate_url(url) |
366 | 368 | self._validate_positive(timeout_secs, "timeout_secs") |
367 | 369 | if crawler_type not in WEBSITE_CONTENT_CRAWLER_TYPES: |
@@ -408,20 +410,24 @@ def apify_run_actor( |
408 | 410 | ) -> Dict[str, Any]: |
409 | 411 | """Run any Apify Actor and return the run metadata as JSON. |
410 | 412 |
|
411 | | - Executes the Actor synchronously - blocks until the Actor run finishes or the timeout |
412 | | - is reached. Use this when you need to run a specific Actor and then inspect or process |
413 | | - the results separately. |
| 413 | + An Actor is a serverless cloud app on the Apify platform — it takes JSON input, |
| 414 | + runs the scraping or automation job, and writes results to a dataset. This tool |
| 415 | + executes the Actor synchronously and returns run metadata only (run_id, status, |
| 416 | + dataset_id, timestamps). Use apify_run_actor_and_get_dataset to also fetch the |
| 417 | + output data in one call, or apify_scrape_url for quick single-URL extraction. |
414 | 418 |
|
415 | 419 | Common Actors: |
416 | | - - "apify/website-content-crawler" - scrape websites and extract content |
417 | | - - "apify/web-scraper" - general-purpose web scraper |
418 | | - - "apify/google-search-scraper" - scrape Google search results |
| 420 | + - "apify/website-content-crawler" - scrape websites and extract content as Markdown |
| 421 | + - "apify/web-scraper" - general-purpose web scraper with JS rendering |
| 422 | + - "apify/google-search-scraper" — scrape Google search results |
419 | 423 |
|
420 | 424 | Args: |
421 | | - actor_id: Actor identifier, e.g. "apify/website-content-crawler" or "username/actor-name". |
422 | | - run_input: JSON-serializable input for the Actor. Each Actor defines its own input schema. |
| 425 | + actor_id: Actor identifier in "username/actor-name" format, |
| 426 | + e.g. "apify/website-content-crawler". Find Actors at https://apify.com/store. |
| 427 | + run_input: JSON-serializable input for the Actor. Each Actor defines its own |
| 428 | + input schema - check the Actor README on Apify Store for required fields. |
423 | 429 | timeout_secs: Maximum time in seconds to wait for the Actor run to finish. Defaults to 300. |
424 | | - memory_mbytes: Memory allocation in MB for the Actor run. Uses Actor default if not set. |
| 430 | + memory_mbytes: Memory allocation in MB for the Actor run. Uses Actor default `memory` value if not set. |
425 | 431 | build: Actor build tag or number to run a specific version. Uses latest build if not set. |
426 | 432 |
|
427 | 433 | Returns: |
@@ -461,8 +467,9 @@ def apify_get_dataset_items( |
461 | 467 | ) -> Dict[str, Any]: |
462 | 468 | """Fetch items from an existing Apify dataset and return them as JSON. |
463 | 469 |
|
464 | | - Use this after running an Actor to retrieve the structured results from its |
465 | | - default dataset, or to access any dataset by ID. |
| 470 | + Every Actor run writes its output to a dataset — a structured, append-only store |
| 471 | + for tabular data. Use the dataset_id from the run metadata returned by apify_run_actor |
| 472 | + or apify_run_task. Use offset for pagination through large datasets. |
466 | 473 |
|
467 | 474 | Args: |
468 | 475 | dataset_id: The Apify dataset ID to fetch items from. |
@@ -499,15 +506,17 @@ def apify_run_actor_and_get_dataset( |
499 | 506 | ) -> Dict[str, Any]: |
500 | 507 | """Run an Apify Actor and fetch its dataset results in one step. |
501 | 508 |
|
502 | | - Convenience tool that combines running an Actor and fetching its default |
503 | | - dataset items into a single call. Use this when you want both the run metadata and the |
| 509 | + Convenience tool that combines running an Actor and fetching its default dataset |
| 510 | + items into a single call. Use this when you want both the run metadata and the |
504 | 511 | result data without making two separate tool calls. |
505 | 512 |
|
506 | 513 | Args: |
507 | | - actor_id: Actor identifier, e.g. "apify/website-content-crawler" or "username/actor-name". |
508 | | - run_input: JSON-serializable input for the Actor. |
| 514 | + actor_id: Actor identifier in "username/actor-name" format, |
| 515 | + e.g. "apify/website-content-crawler". Find Actors at https://apify.com/store. |
| 516 | + run_input: JSON-serializable input for the Actor. Each Actor defines its own |
| 517 | + input schema - check the Actor README on Apify Store for required fields. |
509 | 518 | timeout_secs: Maximum time in seconds to wait for the Actor run to finish. Defaults to 300. |
510 | | - memory_mbytes: Memory allocation in MB for the Actor run. |
| 519 | + memory_mbytes: Memory allocation in MB for the Actor run. Uses Actor default `memory` value if not set. |
511 | 520 | build: Actor build tag or number to run a specific version. Uses latest build if not set. |
512 | 521 | dataset_items_limit: Maximum number of dataset items to return. Defaults to 100. |
513 | 522 | dataset_items_offset: Number of dataset items to skip for pagination. Defaults to 0. |
@@ -551,17 +560,18 @@ def apify_run_task( |
551 | 560 | timeout_secs: int = DEFAULT_TIMEOUT_SECS, |
552 | 561 | memory_mbytes: Optional[int] = None, |
553 | 562 | ) -> Dict[str, Any]: |
554 | | - """Run an Apify task and return the run metadata as JSON. |
| 563 | + """Run a saved Apify task and return the run metadata as JSON. |
555 | 564 |
|
556 | | - Tasks are saved Actor configurations with preset inputs. Use this when a task |
557 | | - has already been configured in Apify Console, so you don't need to specify |
558 | | - the full Actor input every time. |
| 565 | + Tasks are saved Actor configurations with preset inputs, managed in Apify Console. |
| 566 | + Use this when a task has already been configured, so you don't need to specify |
| 567 | + the full Actor input every time. Use apify_run_task_and_get_dataset to also fetch |
| 568 | + the output data in one call. |
559 | 569 |
|
560 | 570 | Args: |
561 | | - task_id: Task identifier, e.g. "user/my-task" or a task ID string. |
562 | | - task_input: Optional JSON-serializable input to override the task's default input. |
| 571 | + task_id: Task identifier in "username/task-name" format or a task ID string. |
| 572 | + task_input: Optional JSON-serializable input to override the task's default input fields. |
563 | 573 | timeout_secs: Maximum time in seconds to wait for the task run to finish. Defaults to 300. |
564 | | - memory_mbytes: Memory allocation in MB for the task run. Uses task default if not set. |
| 574 | + memory_mbytes: Memory allocation in MB for the task run. Uses task default `memory` value if not set. |
565 | 575 |
|
566 | 576 | Returns: |
567 | 577 | Dict with status and content containing run metadata: run_id, status, dataset_id, |
@@ -600,17 +610,17 @@ def apify_run_task_and_get_dataset( |
600 | 610 | dataset_items_limit: int = DEFAULT_DATASET_ITEMS_LIMIT, |
601 | 611 | dataset_items_offset: int = 0, |
602 | 612 | ) -> Dict[str, Any]: |
603 | | - """Run an Apify task and fetch its dataset results in one step. |
| 613 | + """Run a saved Apify task and fetch its dataset results in one step. |
604 | 614 |
|
605 | | - Convenience tool that combines running a task and fetching its default |
606 | | - dataset items into a single call. Use this when you want both the run metadata and the |
| 615 | + Convenience tool that combines running a task and fetching its default dataset |
| 616 | + items into a single call. Use this when you want both the run metadata and the |
607 | 617 | result data without making two separate tool calls. |
608 | 618 |
|
609 | 619 | Args: |
610 | | - task_id: Task identifier, e.g. "user/my-task" or a task ID string. |
611 | | - task_input: Optional JSON-serializable input to override the task's default input. |
| 620 | + task_id: Task identifier in "username/task-name" format or a task ID string. |
| 621 | + task_input: Optional JSON-serializable input to override the task's default input fields. |
612 | 622 | timeout_secs: Maximum time in seconds to wait for the task run to finish. Defaults to 300. |
613 | | - memory_mbytes: Memory allocation in MB for the task run. |
| 623 | + memory_mbytes: Memory allocation in MB for the task run. Uses task default `memory` value if not set. |
614 | 624 | dataset_items_limit: Maximum number of dataset items to return. Defaults to 100. |
615 | 625 | dataset_items_offset: Number of dataset items to skip for pagination. Defaults to 0. |
616 | 626 |
|
@@ -651,21 +661,23 @@ def apify_scrape_url( |
651 | 661 | timeout_secs: int = DEFAULT_SCRAPE_TIMEOUT_SECS, |
652 | 662 | crawler_type: CrawlerType = "cheerio", |
653 | 663 | ) -> Dict[str, Any]: |
654 | | - """Scrape a single URL and return its content as markdown. |
| 664 | + """Scrape a single URL and return its content as Markdown. |
655 | 665 |
|
656 | 666 | Uses the Website Content Crawler Actor under the hood, pre-configured for |
657 | 667 | fast single-page scraping. This is the simplest way to extract readable content |
658 | | - from any web page. |
| 668 | + from any web page — no Actor input schema needed. For multi-page crawls, use |
| 669 | + apify_run_actor_and_get_dataset with "apify/website-content-crawler" directly. |
659 | 670 |
|
660 | 671 | Args: |
661 | 672 | url: The URL to scrape, e.g. "https://example.com". |
662 | 673 | timeout_secs: Maximum time in seconds to wait for scraping to finish. Defaults to 120. |
663 | | - crawler_type: Crawler engine to use. One of "cheerio" (fastest, no JS rendering, |
664 | | - default), "playwright:adaptive" (fast, renders JS if present), or |
665 | | - "playwright:firefox" (reliable, renders JS, best at avoiding blocking but slower). |
| 674 | + crawler_type: Crawler engine to use. One of: |
| 675 | + - "cheerio" (default): Fastest, no JavaScript rendering. Best for static HTML. |
| 676 | + - "playwright:adaptive": Renders JS only when needed. Good general-purpose choice. |
| 677 | + - "playwright:firefox": Full JS rendering, best at bypassing anti-bot protection but slowest. |
666 | 678 |
|
667 | 679 | Returns: |
668 | | - Dict with status and content containing the markdown content of the scraped page. |
| 680 | + Dict with status and content containing the Markdown content of the scraped page. |
669 | 681 | """ |
670 | 682 | try: |
671 | 683 | _check_dependency() |
|
0 commit comments