Skip to content

Latest commit

 

History

History
202 lines (145 loc) · 8.09 KB

File metadata and controls

202 lines (145 loc) · 8.09 KB
id trace-and-monitor-crawlers
title Trace and monitor crawlers
description How to use OpenTelemetry to trace and monitor your crawlers

import ApiLink from '@site/src/components/ApiLink'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import CodeBlock from '@theme/CodeBlock';

import SetupSource from '!!raw-loader!./trace_and_monitor_setup.ts'; import BasicExampleSource from '!!raw-loader!./trace_and_monitor_basic.ts'; import WrapWithSpanSource from '!!raw-loader!./trace_and_monitor_wrap_with_span.ts'; import CustomInstrumentationSource from '!!raw-loader!./trace_and_monitor_custom.ts';

OpenTelemetry is a collection of APIs, SDKs, and tools to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software's performance and behavior. You can learn more about its basic concepts in the OpenTelemetry documentation.

In this guide, we'll show you how to set up OpenTelemetry and instrument your Crawlee crawlers to see traces of individual requests as they are processed. OpenTelemetry on its own does not provide visualization tools, so we'll use Jaeger as our tracing backend. Feel free to use any other OpenTelemetry-compatible backend. Check the OpenTelemetry vendors list for more options.

Set up Jaeger

This guide will show you how to set up the environment locally to run the example code and visualize the telemetry data in Jaeger running in a Docker container.

To start the preconfigured Docker container, create a docker-compose.yml file:

services:
  jaeger:
    image: jaegertracing/all-in-one:1.53
    container_name: jaeger
    ports:
      # Jaeger UI
      - "16686:16686"
      # OTLP gRPC
      - "4317:4317"
      # OTLP HTTP
      - "4318:4318"
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    restart: unless-stopped

Then start it with:

docker compose up -d

For more details about the Jaeger setup, see the getting started section in their documentation. You can see the Jaeger UI in your browser by navigating to http://localhost:16686.

Install dependencies

To instrument your Crawlee crawler, you need to install the @crawlee/otel package along with the OpenTelemetry SDK packages:

npm install @crawlee/otel @opentelemetry/api @opentelemetry/api-logs @opentelemetry/sdk-node @opentelemetry/exporter-trace-otlp-grpc @opentelemetry/exporter-logs-otlp-grpc @opentelemetry/sdk-logs

Instrument the crawler

OpenTelemetry instrumentation must be set up before importing Crawlee or any other instrumented modules. The easiest way to do this is to create a separate setup file and import it first using Node.js's --import flag.

Setup file

Create a setup file that initializes OpenTelemetry with the Crawlee instrumentation:

{SetupSource}

Main crawler file

Now create your crawler. The CrawleeInstrumentation will automatically instrument the core crawler methods:

{BasicExampleSource}

Run the crawler

Run your crawler with the setup file imported first:

npx tsx --import ./src/setup.ts ./src/main.ts

The --import flag ensures the OpenTelemetry setup runs before any other code, which is required for the automatic instrumentation to work properly.

Analyze the results

In the Jaeger UI, you can search for different traces, apply filtering, compare traces, view their detailed attributes, view timing details, and more. For a detailed description of the tool's capabilities, please refer to the Jaeger documentation.

Jaeger search view

You can use different tools to consume the OpenTelemetry data that might better suit your needs. Please see the list of known vendors in OpenTelemetry documentation.

Customize the instrumentation

The CrawleeInstrumentation class provides several configuration options to customize what gets instrumented:

Option Default Description
enabled true Enable or disable the instrumentation entirely
requestHandlingInstrumentation true Instrument core crawler methods like run, _runTaskFunction, navigation handlers
logInstrumentation true Forward Crawlee logs to OpenTelemetry logs
customInstrumentation [] Array of custom class methods to instrument

Configuration example

import { CrawleeInstrumentation } from '@crawlee/otel';

const crawleeInstrumentation = new CrawleeInstrumentation({
    // Disable automatic request handling instrumentation
    requestHandlingInstrumentation: false,
    // Disable log forwarding
    logInstrumentation: false,
    // Add custom instrumentation
    customInstrumentation: [
        {
            moduleName: '@crawlee/basic',
            className: 'BasicCrawler',
            methodName: 'run',
            spanName: 'my-custom-span-name',
        },
    ],
});

Manual span instrumentation with wrapWithSpan

For more fine-grained control, you can use the wrapWithSpan utility to wrap specific functions with OpenTelemetry spans. This is particularly useful for instrumenting request handlers, hooks, and error handlers.

{WrapWithSpanSource}

wrapWithSpan options

The wrapWithSpan function accepts these options:

Option Type Description
spanName string | ((...args) => string) Static name or function that receives the handler arguments and returns a span name
spanOptions SpanOptions | ((...args) => SpanOptions) Static options or function that returns OpenTelemetry SpanOptions including attributes
tracer Tracer Custom tracer instance (defaults to trace.getTracer('crawlee'))

Accessing the current span

Inside a wrapped function, you can access the current span to add additional attributes or events:

import { context, trace } from '@opentelemetry/api';

requestHandler: wrapWithSpan(
    async ({ request, $ }) => {
        const span = trace.getSpan(context.active());

        const title = $('title').text();

        if (span) {
            span.setAttribute('page.title', title);
            span.addEvent('page_scraped', { url: request.url });
        }

        // ... rest of your handler
    },
    { spanName: 'request-handler' }
),

Custom class instrumentation

You can also create your instrumentation by selecting only the methods you want to instrument. Here's an example of adding custom instrumentation for specific crawler methods:

{CustomInstrumentationSource}

What gets instrumented automatically

When requestHandlingInstrumentation is enabled (the default), the following methods are automatically instrumented:

Crawler Method Span Name
BasicCrawler run crawlee.crawler.run
BasicCrawler _runTaskFunction crawlee.crawler.runTaskFunction
BasicCrawler _requestFunctionErrorHandler crawlee.crawler.requestFunctionErrorHandler
BasicCrawler _handleFailedRequestHandler crawlee.crawler.handleFailedRequestHandler
BasicCrawler _executeHooks crawlee.crawler.executeHooks
BrowserCrawler _handleNavigation crawlee.browser.handleNavigation
BrowserCrawler _runRequestHandler crawlee.browser.runRequestHandler
HttpCrawler _handleNavigation crawlee.http.handleNavigation
HttpCrawler _runRequestHandler crawlee.http.runRequestHandler

Request handler spans include these attributes automatically:

  • crawlee.request.id
  • crawlee.request.url
  • crawlee.request.method
  • crawlee.request.retry_count