| id | trace-and-monitor-crawlers |
|---|---|
| title | Trace and monitor crawlers |
| description | How to use OpenTelemetry to trace and monitor your crawlers |
import ApiLink from '@site/src/components/ApiLink'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import CodeBlock from '@theme/CodeBlock';
import SetupSource from '!!raw-loader!./trace_and_monitor_setup.ts'; import BasicExampleSource from '!!raw-loader!./trace_and_monitor_basic.ts'; import WrapWithSpanSource from '!!raw-loader!./trace_and_monitor_wrap_with_span.ts'; import CustomInstrumentationSource from '!!raw-loader!./trace_and_monitor_custom.ts';
OpenTelemetry is a collection of APIs, SDKs, and tools to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software's performance and behavior. You can learn more about its basic concepts in the OpenTelemetry documentation.
In this guide, we'll show you how to set up OpenTelemetry and instrument your Crawlee crawlers to see traces of individual requests as they are processed. OpenTelemetry on its own does not provide visualization tools, so we'll use Jaeger as our tracing backend. Feel free to use any other OpenTelemetry-compatible backend. Check the OpenTelemetry vendors list for more options.
This guide will show you how to set up the environment locally to run the example code and visualize the telemetry data in Jaeger running in a Docker container.
To start the preconfigured Docker container, create a docker-compose.yml file:
services:
jaeger:
image: jaegertracing/all-in-one:1.53
container_name: jaeger
ports:
# Jaeger UI
- "16686:16686"
# OTLP gRPC
- "4317:4317"
# OTLP HTTP
- "4318:4318"
environment:
- COLLECTOR_OTLP_ENABLED=true
restart: unless-stoppedThen start it with:
docker compose up -dFor more details about the Jaeger setup, see the getting started section in their documentation. You can see the Jaeger UI in your browser by navigating to http://localhost:16686.
To instrument your Crawlee crawler, you need to install the @crawlee/otel package along with the OpenTelemetry SDK packages:
npm install @crawlee/otel @opentelemetry/api @opentelemetry/api-logs @opentelemetry/sdk-node @opentelemetry/exporter-trace-otlp-grpc @opentelemetry/exporter-logs-otlp-grpc @opentelemetry/sdk-logsOpenTelemetry instrumentation must be set up before importing Crawlee or any other instrumented modules. The easiest way to do this is to create a separate setup file and import it first using Node.js's --import flag.
Create a setup file that initializes OpenTelemetry with the Crawlee instrumentation:
{SetupSource}Now create your crawler. The CrawleeInstrumentation will automatically instrument the core crawler methods:
Run your crawler with the setup file imported first:
npx tsx --import ./src/setup.ts ./src/main.tsThe --import flag ensures the OpenTelemetry setup runs before any other code, which is required for the automatic instrumentation to work properly.
In the Jaeger UI, you can search for different traces, apply filtering, compare traces, view their detailed attributes, view timing details, and more. For a detailed description of the tool's capabilities, please refer to the Jaeger documentation.
You can use different tools to consume the OpenTelemetry data that might better suit your needs. Please see the list of known vendors in OpenTelemetry documentation.
The CrawleeInstrumentation class provides several configuration options to customize what gets instrumented:
| Option | Default | Description |
|---|---|---|
enabled |
true |
Enable or disable the instrumentation entirely |
requestHandlingInstrumentation |
true |
Instrument core crawler methods like run, _runTaskFunction, navigation handlers |
logInstrumentation |
true |
Forward Crawlee logs to OpenTelemetry logs |
customInstrumentation |
[] |
Array of custom class methods to instrument |
import { CrawleeInstrumentation } from '@crawlee/otel';
const crawleeInstrumentation = new CrawleeInstrumentation({
// Disable automatic request handling instrumentation
requestHandlingInstrumentation: false,
// Disable log forwarding
logInstrumentation: false,
// Add custom instrumentation
customInstrumentation: [
{
moduleName: '@crawlee/basic',
className: 'BasicCrawler',
methodName: 'run',
spanName: 'my-custom-span-name',
},
],
});For more fine-grained control, you can use the wrapWithSpan utility to wrap specific functions with OpenTelemetry spans. This is particularly useful for instrumenting request handlers, hooks, and error handlers.
The wrapWithSpan function accepts these options:
| Option | Type | Description |
|---|---|---|
spanName |
string | ((...args) => string) |
Static name or function that receives the handler arguments and returns a span name |
spanOptions |
SpanOptions | ((...args) => SpanOptions) |
Static options or function that returns OpenTelemetry SpanOptions including attributes |
tracer |
Tracer |
Custom tracer instance (defaults to trace.getTracer('crawlee')) |
Inside a wrapped function, you can access the current span to add additional attributes or events:
import { context, trace } from '@opentelemetry/api';
requestHandler: wrapWithSpan(
async ({ request, $ }) => {
const span = trace.getSpan(context.active());
const title = $('title').text();
if (span) {
span.setAttribute('page.title', title);
span.addEvent('page_scraped', { url: request.url });
}
// ... rest of your handler
},
{ spanName: 'request-handler' }
),You can also create your instrumentation by selecting only the methods you want to instrument. Here's an example of adding custom instrumentation for specific crawler methods:
{CustomInstrumentationSource}When requestHandlingInstrumentation is enabled (the default), the following methods are automatically instrumented:
| Crawler | Method | Span Name |
|---|---|---|
BasicCrawler |
run |
crawlee.crawler.run |
BasicCrawler |
_runTaskFunction |
crawlee.crawler.runTaskFunction |
BasicCrawler |
_requestFunctionErrorHandler |
crawlee.crawler.requestFunctionErrorHandler |
BasicCrawler |
_handleFailedRequestHandler |
crawlee.crawler.handleFailedRequestHandler |
BasicCrawler |
_executeHooks |
crawlee.crawler.executeHooks |
BrowserCrawler |
_handleNavigation |
crawlee.browser.handleNavigation |
BrowserCrawler |
_runRequestHandler |
crawlee.browser.runRequestHandler |
HttpCrawler |
_handleNavigation |
crawlee.http.handleNavigation |
HttpCrawler |
_runRequestHandler |
crawlee.http.runRequestHandler |
Request handler spans include these attributes automatically:
crawlee.request.idcrawlee.request.urlcrawlee.request.methodcrawlee.request.retry_count
