Resilient web scraper with automatic retry, proxy rotation, and strategy cascade. Supports fetch, Playwright browsers, and custom fetch implementations.
When scraping multiple websites, you face unpredictable challenges:
- Unknown accessibility requirements - You never know whether a page will be accessible through a simple HTTP fetch or if it blocks requests and requires a headless browser
- IP-based restrictions - Sites might block your server's IP address, requiring proxy rotation as a fallback
Skrobak solves these problems by:
- Optimizing for cost and efficiency - Start with the most efficient methods first (simple fetch), gradually falling back to resource-intensive options (headless browser) or paid solutions (proxies) only when necessary
- Eliminating trial and error - Automatically test different browser engines and proxy combinations to find which ones work, without manual intervention
Define your own list of strategies to try in order, and Skrobak automatically cascades through them until one succeeds.
npm install skrobakimport { scrape } from 'skrobak'
const result = await scrape('https://example.com', {
strategies: [{ mechanism: 'fetch' }]
})
// When response is HTML, use Cheerio for parsing
if (result.mechanism === 'fetch') {
const title = result.$('title').text()
console.log(title)
}
// When response is JSON, use .json() to retrieve data
if (result.mechanism === 'fetch') {
const data = await result.json()
console.log(data)
}Skrobak tries strategies in order until one succeeds. If a strategy fails, it automatically moves to the next one:
const result = await scrape('https://example.com', {
strategies: [
{ mechanism: 'fetch' }, // Try simple fetch first
{ mechanism: 'browser' }, // Fallback to browser if fetch fails
]
})| Mechanism | Description |
|---|---|
fetch |
Fast HTTP requests with lazy-loaded Cheerio for HTML parsing |
browser |
Full browser rendering with Playwright (chromium/firefox/webkit) |
custom |
Your own fetch implementation |
Main scraping function with automatic retry and strategy cascade.
scrape(url: string, config: ScrapeConfig): Promise<ScrapeResult>| Parameter | Type | Description |
|---|---|---|
url |
string |
The URL to scrape |
config |
ScrapeConfig |
Configuration object |
Promise<ScrapeResult> - Result object with mechanism-specific properties. See Return Types.
const result = await scrape('https://example.com', {
// Strategy cascade - tries in order until one succeeds
strategies: [
{ mechanism: 'fetch', useProxy: true },
{ mechanism: 'fetch', useProxy: false },
{ mechanism: 'browser' }
],
// Global options applied to all strategies
options: {
timeout: 30000,
retries: {
count: 3,
delay: 2000,
type: 'exponential',
statusCodes: [408, 429, 500, 502, 503, 504]
},
proxies: [
'http://proxy1.example.com:8080',
'http://proxy2.example.com:8080'
],
userAgents: [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
],
headers: {
'Accept-Language': 'en-US,en;q=0.9'
},
validateResponse: ({ mechanism, response }) => {
if (mechanism === 'fetch') return response.ok
if (mechanism === 'browser') return response.status() === 200
return true
}
},
// Browser-specific configuration
browser: {
engine: 'chromium',
waitUntil: 'networkidle',
resources: ['document', 'script', 'xhr', 'fetch']
},
// Custom fetch implementation
custom: {
fn: async (url, options) => {
// Your custom fetch logic
const response = await customFetch(url, options)
return response
}
},
// Event hooks for monitoring and logging
hooks: {
onRetryAttempt: ({ attempt, maxAttempts, nextRetryDelay }) => {
console.log(`Retry ${attempt}/${maxAttempts}, waiting ${nextRetryDelay}ms`)
},
onRetryExhausted: ({ totalAttempts }) => {
console.log(`All ${totalAttempts} retries exhausted`)
},
onStrategyFailed: ({ strategy, strategyIndex, totalStrategies }) => {
console.log(`Strategy ${strategyIndex + 1}/${totalStrategies} (${strategy.mechanism}) failed`)
},
onAllStrategiesFailed: ({ strategies }) => {
console.error(`All ${strategies.length} strategies failed`)
}
}
})Root configuration object passed to scrape(). Main configuration object for the scraping operation.
| Property | Type | Required | Description |
|---|---|---|---|
strategies |
ScrapeStrategy[] |
Yes | List of strategies to try in order |
options |
ScrapeOptions |
No | Global scraping options |
browser |
BrowserConfig |
No | Browser-specific configuration |
custom |
CustomConfig |
No | Custom fetch function configuration |
hooks |
ScrapeHooks |
No | Event hooks for monitoring and logging |
Individual strategy in the cascade. Skrobak tries each strategy in order until one succeeds.
| Property | Type | Default | Description |
|---|---|---|---|
mechanism |
'fetch' 'browser' 'custom' |
- | Scraping mechanism to use |
useProxy |
boolean |
false |
Whether to use proxy for this strategy |
Example:
strategies: [
{ mechanism: 'fetch', useProxy: false }, // Try without proxy first (if available)
{ mechanism: 'fetch', useProxy: true }, // Fallback with proxy
{ mechanism: 'browser' } // Last resort: full browser
]Global options applied across all strategies.
| Property | Type | Default | Description |
|---|---|---|---|
timeout |
number |
- | Request timeout in milliseconds |
retries |
RetryConfig |
{ count: 0, delay: 5000, type: 'exponential' } |
Retry configuration |
proxies |
string[] |
- | Proxy pool (randomly selected per request) |
userAgents |
string[] |
- | User agent pool (randomly selected per request) |
viewports |
ViewportSize[] |
- | Viewport pool (randomly selected per request) |
headers |
object |
- | HTTP headers as key-value pairs |
validateResponse |
ValidateResponse |
- | Custom response validation function |
Example:
options: {
timeout: 30000,
retries: { count: 3, delay: 2000, type: 'exponential' },
proxies: ['http://proxy1.com:8080', 'http://proxy2.com:8080'],
userAgents: ['Mozilla/5.0...'],
headers: { 'Accept-Language': 'en-US' }
}Controls retry behavior when requests fail.
| Property | Type | Default | Description |
|---|---|---|---|
count |
number |
0 |
Number of retry attempts |
delay |
number |
5000 |
Base delay between retries in milliseconds |
type |
'exponential' 'linear' 'constant' |
'exponential' |
Retry delay calculation strategy |
statusCodes |
number[] |
[408, 429, 500, 502, 503, 504] |
HTTP status codes that trigger retry |
Retry delay calculation:
exponential: (1000ms → 2000ms → 4000ms)linear: (1000ms → 2000ms → 3000ms)constant: (1000ms → 1000ms → 1000ms)
Status code retry behavior:
Skrobak automatically retries only on specific HTTP status codes:
- Retriable status codes (in
statusCodeslist) → Retry the same mechanism - Non-retriable status codes (NOT in list, e.g., 404, 401) → Skip to next strategy immediately
- Network errors (no status code) → Always retry (temporary issues)
Example:
retries: {
count: 3,
delay: 2000,
type: 'exponential',
statusCodes: [429, 503] // Only retry on rate limiting and service unavailable
}Viewport dimensions for browser-based scraping.
| Property | Type | Description |
|---|---|---|
width |
number |
Viewport width in pixels |
height |
number |
Viewport height in pixels |
Example:
viewports: [
{ width: 1920, height: 1080 },
{ width: 1366, height: 768 },
{ width: 390, height: 844 }
]Custom validation function to verify response before accepting it.
Type: (context) => boolean
Function receives a context object with mechanism and response properties. Return true to accept the response, false to retry or move to the next strategy.
Example:
validateResponse: ({ mechanism, response }) => {
if (mechanism === 'fetch') {
return response.status === 200 && response.headers.get('content-type')?.includes('json')
}
if (mechanism === 'browser') {
return response.status() === 200
}
return true
}Browser-specific configuration for the browser mechanism.
| Property | Type | Default | Description |
|---|---|---|---|
engine |
'chromium' 'firefox' 'webkit' |
'chromium' |
Browser engine to use |
resources |
ResourceType[] |
- (allows all) | Allowed resource types (blocks all others) |
waitUntil |
'load' 'domcontentloaded' 'networkidle' 'commit' |
- | When to consider navigation successful |
Example:
browser: {
engine: 'chromium',
waitUntil: 'networkidle',
resources: ['document', 'script', 'xhr', 'fetch']
}Types of resources that can be loaded by the browser. When specified, all other resource types are blocked.
Type: Playwright's ResourceType (string)
Common values: 'document' 'stylesheet' 'image' 'script' 'xhr' 'fetch'
See Playwright's ResourceType for all available options.
Example:
// Only allow essential resources, block images/CSS for faster loading
resources: ['document', 'script', 'xhr', 'fetch']Configuration for custom fetch implementation when using mechanism: 'custom'.
| Property | Type | Description |
|---|---|---|
fn |
(url, options) => Promise<TCustomResponse> |
Custom fetch function |
Function parameters:
url(string): The URL to fetchoptions(object): Request options composed from global configproxy?(string): Proxy URL (whenuseProxy: true)userAgent?(string): User agent stringviewport?(object): Viewport dimensions withwidthandheightheaders?(object): HTTP headers as key-value pairstimeout?(number): Request timeout in milliseconds
Example: See Custom Fetch Function example.
Event hooks for monitoring scraping progress, logging, metrics, or debugging. All hooks are optional.
| Property | Type | Description |
|---|---|---|
onRetryAttempt |
(context) => void |
Called when a retry attempt fails |
onRetryExhausted |
(context) => void |
Called when all retry attempts are exhausted |
onStrategyFailed |
(context) => void |
Called when a strategy fails |
onAllStrategiesFailed |
(context) => void |
Called when all strategies fail |
Context object:
{
error: unknown // The error that occurred
attempt: number // Current attempt number (1-indexed)
maxAttempts: number // Total number of attempts
nextRetryDelay: number // Delay before next retry in ms
retryConfig: RetryConfig // Retry configuration
}Context object:
{
error: unknown // The final error
totalAttempts: number // Total number of attempts made
retryConfig: RetryConfig // Retry configuration
}Context object:
{
error: unknown // The error that occurred
strategy: ScrapeStrategy // The strategy that failed
strategyIndex: number // Index of failed strategy (0-indexed)
totalStrategies: number // Total number of strategies
}Context object:
{
lastError: unknown // The last error encountered
strategies: Array<ScrapeStrategy> // All strategies that were tried
totalAttempts: number // Number of strategies attempted
}Example:
const result = await scrape('https://example.com', {
strategies: [
{ mechanism: 'fetch', useProxy: true },
{ mechanism: 'browser' }
],
options: {
retries: { count: 3, delay: 1000, type: 'exponential' }
},
hooks: {
onRetryAttempt: ({ attempt, maxAttempts, nextRetryDelay, error }) => {
console.log(`Retry ${attempt}/${maxAttempts} failed, waiting ${nextRetryDelay}ms`)
console.error('Error:', error)
},
onRetryExhausted: ({ totalAttempts }) => {
console.log(`All ${totalAttempts} retries exhausted`)
},
onStrategyFailed: ({ strategy, strategyIndex, totalStrategies }) => {
console.log(`Strategy ${strategyIndex + 1}/${totalStrategies} (${strategy.mechanism}) failed`)
},
onAllStrategiesFailed: ({ strategies }) => {
console.error(`All ${strategies.length} strategies failed`)
}
}
})The result of scrape() depends on which mechanism succeeded. Use the mechanism property to determine the type.
When: mechanism: 'fetch' strategy succeeds
| Property | Type | Description |
|---|---|---|
mechanism |
'fetch' |
Indicates fetch mechanism was used |
response |
Response |
Standard fetch Response object |
$ |
CheerioAPI |
Lazy-loaded Cheerio instance for HTML parsing |
Example:
if (result.mechanism === 'fetch') {
const title = result.$('title').text()
const links = result.$('a').map((_, element) => result.$(element).attr('href')).get()
}When: mechanism: 'browser' strategy succeeds
| Property | Type | Description |
|---|---|---|
mechanism |
'browser' |
Indicates browser mechanism was used |
response |
PlaywrightResponse |
Playwright Response object |
page |
Page |
Playwright Page instance for interaction |
cleanup |
() => Promise<void> |
Function to close browser context (must be called) |
Example:
if (result.mechanism === 'browser') {
await result.page.screenshot({ path: 'screenshot.png' })
const text = await result.page.textContent('h1')
// Important: cleanup resources after use to avoid memory leaks
await result.cleanup()
}When: mechanism: 'custom' strategy succeeds
| Property | Type | Description |
|---|---|---|
mechanism |
'custom' |
Indicates custom mechanism was used |
response |
TCustomResponse |
Your custom response type |
Example:
if (result.mechanism === 'custom') {
// Your custom response type
console.log(result.response)
}const result = await scrape('https://example.com', {
strategies: [{ mechanism: 'fetch' }]
})
if (result.mechanism === 'fetch') {
const links = result.$('a').map((_, el) => result.$(el).attr('href')).get()
}const result = await scrape('https://example.com', {
strategies: [{ mechanism: 'browser' }],
options: {
viewports: [{ width: 1920, height: 1080 }]
},
browser: {
engine: 'chromium',
waitUntil: 'networkidle'
}
})
if (result.mechanism === 'browser') {
await result.page.screenshot({ path: 'screenshot.png' })
// Important: cleanup resources after use to avoid memory leaks
await result.cleanup()
}import axios from 'axios'
const result = await scrape('https://api.example.com/data', {
strategies: [{ mechanism: 'custom' }],
custom: {
fn: async (url, options) => {
// Use any HTTP client: axios, got, ofetch, etc.
const response = await axios.get(url, {
headers: options.headers,
timeout: options.timeout,
proxy: options.proxy ? { host: options.proxy } : undefined
})
return response.data
}
}
})
if (result.mechanism === 'custom') {
console.log(result.response) // Your custom response data
}const result = await scrape('https://example.com', {
strategies: [
{ mechanism: 'fetch', useProxy: true }
],
options: {
proxies: [
'http://proxy1.example.com:8080',
'http://proxy2.example.com:8080'
],
retries: {
count: 3,
delay: 2000,
type: 'exponential'
}
}
})const result = await scrape('https://example.com', {
strategies: [{ mechanism: 'fetch' }],
options: {
userAgents: [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
]
}
})const result = await scrape('https://example.com', {
strategies: [{ mechanism: 'browser' }],
browser: {
// Only allow these
resources: ['document', 'script', 'xhr', 'fetch']
}
})
if (result.mechanism === 'browser') {
// Images and media are blocked, page loads faster
await result.cleanup()
}const result = await scrape('https://api.example.com/data', {
strategies: [{ mechanism: 'fetch' }],
options: {
validateResponse: ({ mechanism, response }) => {
if (mechanism === 'fetch') {
return response.status === 200 && response.headers.get('content-type')?.includes('json')
}
return true
},
retries: { count: 3, delay: 1000 }
}
})const result = await scrape('https://example.com', {
strategies: [
{ mechanism: 'fetch', useProxy: true },
{ mechanism: 'browser' }
],
options: {
retries: { count: 3, delay: 1000, type: 'exponential' }
},
hooks: {
onRetryAttempt: ({ attempt, maxAttempts, nextRetryDelay }) => {
console.log(`Retry ${attempt}/${maxAttempts}, waiting ${nextRetryDelay}ms`)
},
onStrategyFailed: ({ strategy, strategyIndex, totalStrategies }) => {
console.log(`Strategy ${strategyIndex + 1}/${totalStrategies} failed: ${strategy.mechanism}`)
}
}
})import { ofetch } from 'ofetch'
const result = await scrape('https://example.com', {
strategies: [{ mechanism: 'custom' }],
custom: {
fn: async (url, options) => {
const response = await ofetch(url, {
headers: options.headers,
timeout: options.timeout
})
return { data: response }
}
}
})
if (result.mechanism === 'custom') {
console.log(result.response.data)
}const result = await scrape('https://example.com', {
strategies: [
{ mechanism: 'fetch', useProxy: true }, // Try with proxy first
{ mechanism: 'fetch', useProxy: false }, // Fallback to no proxy
{ mechanism: 'browser' } // Last resort: full browser
],
options: {
proxies: ['http://proxy.example.com:8080'],
retries: { count: 2, delay: 1000 }
}
})const result = await scrape('https://example.com/products', {
strategies: [
{ mechanism: 'fetch', useProxy: true },
{ mechanism: 'browser' }
],
options: {
timeout: 30000,
retries: {
count: 3,
delay: 2000,
type: 'exponential'
},
proxies: [
'http://proxy1.example.com:8080',
'http://proxy2.example.com:8080'
],
userAgents: [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
],
headers: {
'Accept-Language': 'en-US,en;q=0.9'
},
validateResponse: ({ mechanism, response }) => {
if (mechanism === 'fetch') {
return response.ok
}
if (mechanism === 'browser') {
return response.status() === 200
}
return true
}
},
browser: {
engine: 'chromium',
waitUntil: 'networkidle',
resources: ['document', 'script', 'xhr', 'fetch']
}
})
if (result.mechanism === 'fetch') {
const products = result.$('.product').map((_, element) => ({
title: result.$(element).find('.title').text(),
price: result.$(element).find('.price').text()
})).get()
}
if (result.mechanism === 'browser') {
const products = await result.page.$$eval('.product', (elements) =>
elements.map((element) => ({
title: element.querySelector('.title')?.textContent,
price: element.querySelector('.price')?.textContent
}))
)
// Important: cleanup resources after use to avoid memory leaks
await result.cleanup()
}Scrape multiple URLs sequentially with automatic browser cleanup, random delays, dynamic URL discovery, and error resilience.
Perfect for:
- Batch scraping multiple pages
- Web crawling with link discovery
scrapeMany<TCustomResponse = unknown>(
urls: string[],
scrapeConfig: ScrapeConfig<TCustomResponse>,
scrapeManyConfig?: ScrapeManyConfig<TCustomResponse>
): Promise<ScrapeManyResult>| Parameter | Type | Description |
|---|---|---|
urls |
string[] |
Array of URLs to scrape |
scrapeConfig |
ScrapeConfig |
Same config as scrape() - strategies, options, etc. |
scrapeManyConfig |
ScrapeManyConfig |
Optional batch scraping configuration |
| Property | Type | Description |
|---|---|---|
onSuccess |
(context) => Promise<void> |
Called after each successful scrape |
onError |
(context) => Promise<void> |
Called after each failed scrape |
delays |
{ min: number; max: number } |
Random delay range between requests (in milliseconds) |
Success Context:
{
result: ScrapeResult // Scrape result (fetch/browser/custom)
url: string // Current URL
index: number // URL index (0-based)
addUrls: (urls) => void // Add more URLs to queue (for crawling)
stats: {
initial: number // Initial URL count
discovered: number // URLs added via addUrls()
processed: number // Completed (success + failed)
remaining: number // URLs left in queue
succeeded: number // Successful scrapes
failed: number // Failed scrapes
}
}Error Context:
{
error: unknown // The error that occurred
url: string // Current URL
index: number // URL index (0-based)
addUrls: (urls) => void // Add more URLs to queue
stats: { /* same as above */ }
}{
total: number // Total URLs processed
succeeded: number // Successful scrapes
failed: number // Failed scrapes
}import { scrapeMany } from 'skrobak'
const result = await scrapeMany(
[
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
],
{
strategies: [{ mechanism: 'fetch' }],
options: { retries: { count: 3 } }
},
{
delays: { min: 2000, max: 5000 }, // 2-5 second delays
onSuccess: async ({ result, url }) => {
if (result.mechanism === 'fetch') {
const title = result.$('title').text()
await saveToDatabase({ url, title })
}
}
}
)
console.log(`Scraped ${result.succeeded}/${result.total} pages`)const result = await scrapeMany(
['https://example.com/category'],
{ strategies: [{ mechanism: 'fetch' }] },
{
delays: { min: 1000, max: 3000 },
onSuccess: async ({ result, addUrls }) => {
if (result.mechanism === 'fetch') {
// Extract and save data
// ...
// Discover more URLs to scrape
const productLinks = result.$('.product a').map((_, el) => result.$(el).attr('href')).get()
addUrls(productLinks)
}
}
}
)
console.log(`Crawl complete: ${result.total} pages`)const errors: Array<{ url: string; error: unknown }> = []
const result = await scrapeMany(
[/* … */],
{
strategies: [
{ mechanism: 'fetch' },
{ mechanism: 'browser' } // Fallback to browser
],
options: { retries: { count: 3 } }
},
{
delays: { min: 2000, max: 5000 },
onSuccess: async ({ result, url, stats }) => {
console.log(`✓ ${url} (${stats.succeeded}/${stats.processed})`)
if (result.mechanism === 'fetch') {
await processData(result.$('body').html())
}
if (result.mechanism === 'browser') {
await processData(await result.page.content())
// Browser cleanup happens automatically!
}
},
onError: async ({ error, url, stats }) => {
console.error(`✗ ${url} (${stats.failed} failures)`)
errors.push({ url, error })
}
}
)
console.log(`Completed: ${result.succeeded} succeeded, ${result.failed} failed`)