Crawl Endpoint

The Crawl endpoint discovers and extracts content from multiple pages starting from a given URL.

API Reference: https://spider.cloud/docs/api#crawl

Basic Usage

# Always use a limit to control credit usage
response = SpiderCloud.crawl( 'https://example.com', limit: 5 )

response.result.each do | page |
  puts "#{ page.url }: #{ page.content&.length } chars"
end

With Options

options = SpiderCloud::CrawlOptions.build do
  limit 5                    # max pages to crawl
  depth 2                    # max link depth
  return_format :markdown
  readability true
end

response = SpiderCloud.crawl( 'https://example.com', options )

Options Reference

Core Options

Option	Type	Default	Description
`limit`	Integer	0	Max pages to crawl (0 = unlimited)
`depth`	Integer	25	Max crawl depth
`return_format`	Symbol	`:raw`	Output format
`request`	Symbol	`:smart`	Request type

Crawl Scope

Option	Type	Description
`subdomains`	Boolean	Include subdomains
`tld`	Boolean	Include TLD variations
`external_domains`	Array	External domains to include (`["*"]` for all)
`redirect_policy`	Symbol	`:loose`, `:strict`, `:none`

URL Filtering

Option	Type	Description
`blacklist`	Array	Paths to exclude (regex supported)
`whitelist`	Array	Paths to include only
`budget`	Hash	Path-based page limits
`link_rewrite`	Hash	URL rewrite rules

Budget Example

Control how many pages to crawl per path:

options = SpiderCloud::CrawlOptions.build do
  limit 100
  budget( {
    '*' => 5,        # default: 5 pages per path
    '/docs/' => 50,  # up to 50 pages under /docs/
    '/blog/' => 20   # up to 20 pages under /blog/
  } )
end

Sitemap Options

Option	Type	Description
`sitemap`	Boolean	Use sitemap for discovery
`sitemap_only`	Boolean	Only crawl sitemap URLs
`sitemap_path`	String	Custom sitemap path

Content Extraction

Option	Type	Description
`readability`	Boolean	Safari Reader Mode
`root_selector`	String	CSS selector for content
`exclude_selector`	String	CSS selector to ignore
`css_extraction_map`	Hash	Structured data extraction
`filter_main_only`	Boolean	Main content only
`full_resources`	Boolean	Download images, videos

Output Options

Option	Type	Description
`return_json_data`	Boolean	Return SSR JSON data
`return_headers`	Boolean	Include HTTP headers
`return_cookies`	Boolean	Include cookies
`return_page_links`	Boolean	Include discovered links
`return_embeddings`	Boolean	Include embeddings
`metadata`	Boolean	Collect page metadata

Performance

Option	Type	Description
`request_timeout`	Integer	Timeout per page (5-255 seconds)
`cache`	Boolean	Enable caching
`concurrency_limit`	Integer	Concurrent requests
`delay`	Integer	Delay between requests (ms)

Cost Control

Option	Type	Description
`max_credits_per_page`	Integer	Max credits per page
`max_credits_allowed`	Integer	Total credit limit
`crawl_timeout`	Hash	Max crawl duration `{seconds:, nanoseconds:}`

Webhooks

options = SpiderCloud::CrawlOptions.build do
  limit 100
  webhooks do
    destination 'https://your-server.com/webhook'
    on_credits_depleted true
    on_find true
  end
end

Response

response = SpiderCloud.crawl( 'https://example.com', limit: 5 )

response.result.success?    # => true
response.result.count       # => 5
response.result.urls        # => ["https://...", ...]
response.result.contents    # => ["...", ...]
response.result.total_cost  # => 0.0002

# Iterate over pages
response.result.each do | page |
  page.url                  # => "https://..."
  page.content              # => "..."
  page.status               # => 200
  page.costs.total_cost     # => 0.00004
end

# Filter by success
response.result.succeeded   # => [successful pages]
response.result.failed      # => [failed pages]

Examples

Crawl Documentation

options = SpiderCloud::CrawlOptions.build do
  limit 50
  whitelist [ '/docs/' ]
  return_format :markdown
  readability true
end

response = SpiderCloud.crawl( 'https://example.com', options )

Crawl with Depth Limit

options = SpiderCloud::CrawlOptions.build do
  limit 20
  depth 2
end

response = SpiderCloud.crawl( 'https://example.com', options )

Exclude Paths

options = SpiderCloud::CrawlOptions.build do
  limit 50
  blacklist [ '/admin/', '/private/', '/api/' ]
end

Use Sitemap

options = SpiderCloud::CrawlOptions.build do
  limit 100
  sitemap true
  sitemap_only true
end

With Automation

options = SpiderCloud::CrawlOptions.build do
  limit 10
  automation_scripts( {
    '/login' => [
      { 'Fill' => { 'selector' => '#email', 'value' => 'user@example.com' } },
      { 'Click' => 'button[type=submit]' },
      { 'WaitForNavigation' => true }
    ]
  } )
end

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl Endpoint

Basic Usage

With Options

Options Reference

Core Options

Crawl Scope

URL Filtering

Budget Example

Sitemap Options

Content Extraction

Output Options

Performance

Cost Control

Webhooks

Response

Examples

Crawl Documentation

Crawl with Depth Limit

Exclude Paths

Use Sitemap

With Automation

FilesExpand file tree

crawl.md

Latest commit

History

crawl.md

File metadata and controls

Crawl Endpoint

Basic Usage

With Options

Options Reference

Core Options

Crawl Scope

URL Filtering

Budget Example

Sitemap Options

Content Extraction

Output Options

Performance

Cost Control

Webhooks

Response

Examples

Crawl Documentation

Crawl with Depth Limit

Exclude Paths

Use Sitemap

With Automation