Quarry Public API

User-facing guide for Quarry. Normative behavior is defined by contracts under docs/contracts/.

Prerequisites

Tool	Version	Source
Go	1.25.6	`quarry/go.mod`
Node.js	23+ or 22.6+	Native TS support (see note)
pnpm	10.28.2	`package.json#packageManager`

Node.js version note: Quarry scripts are TypeScript and use Node's native type-stripping. Node 23+ enables this by default. Node 22.6–22.x requires --experimental-strip-types flag (set via NODE_OPTIONS).

Quarry is TypeScript-first and ESM-only.

Installation

Via mise (recommended)

mise install github:pithecene-io/quarry@0.13.4

Or pin in your mise.toml:

[tools]
"github:pithecene-io/quarry" = "0.13.4"

Via Docker

# Full image — includes Chrome for Testing + fonts (amd64 only, recommended)
docker pull ghcr.io/pithecene-io/quarry:0.13.4

# Slim image — no browser, multi-arch (BYO Chromium via --browser-ws-endpoint)
docker pull ghcr.io/pithecene-io/quarry:0.13.4-slim

See docs/guides/container.md for docker run and Docker Compose examples.

SDK

Install the SDK in your script project:

npx jsr add @pithecene-io/quarry-sdk

Or via pnpm with GitHub Packages:

pnpm add @pithecene-io/quarry-sdk

From source

git clone https://github.com/pithecene-io/quarry.git
cd quarry
pnpm install
task build

Quick Start

1. Clone and Build

git clone https://github.com/pithecene-io/quarry.git
cd quarry
pnpm install
task build

2. Run an Example

mkdir -p ./quarry-data

./quarry/quarry run \
  --script ./examples/demo.ts \
  --run-id my-first-run \
  --source demo \
  --storage-backend fs \
  --storage-path ./quarry-data

Expected output:

run_id=my-first-run, attempt=1, outcome=success, duration=...

Script Authoring

Export Shape

Scripts must export a default async function:

import type { QuarryContext } from "@pithecene-io/quarry-sdk";

export default async function run(ctx: QuarryContext): Promise<void> {
  // Your extraction logic here
}

This is the only supported entry point shape.

Lifecycle Hooks

Scripts may export optional lifecycle hooks alongside the default function. All hooks are optional — scripts that export none behave identically to today.

import type { QuarryContext, PrepareResult, TerminalSignal, RunMeta } from "@pithecene-io/quarry-sdk";

// Runs before browser launch. Can skip or transform the job.
export function prepare(job: unknown, run: RunMeta): PrepareResult {
  if (isStale(job)) return { action: "skip", reason: "stale job" };
  return { action: "continue", job: enrichJob(job) };
}

// Runs before the main script function
export async function beforeRun(ctx: QuarryContext): Promise<void> { }

// Main script (required)
export default async function run(ctx: QuarryContext): Promise<void> { }

// Runs after script completes successfully (not called on error)
export async function afterRun(ctx: QuarryContext): Promise<void> { }

// Runs when the script throws (not called after manual terminal emit)
export async function onError(error: unknown, ctx: QuarryContext): Promise<void> { }

// Runs after script execution, before terminal event. Emit is still open.
export async function beforeTerminal(signal: TerminalSignal, ctx: QuarryContext): Promise<void> { }

// Runs after terminal emission attempt. Context exists but emit throws.
export async function cleanup(ctx: QuarryContext): Promise<void> { }

Execution order:

load script
    │
    ▼
[prepare]  ──skip──▶ emit run_complete({skipped}) → run_result → return
    │
    │ continue (optionally with transformed job)
    ▼
acquire browser → create context
    │
    ▼
[beforeRun] → script() → [afterRun] (success) / [onError] (error)
    │
    ▼
[beforeTerminal]  (emit still open)
    │
    ▼
auto-emit terminal → [cleanup] → run_result → return

Hook rules:

Hook	When it runs	Receives	Error handling
`prepare`	Before browser launch	`(job, run)`	Throw → crash (no browser launched)
`beforeRun`	Before script	`(ctx)`	Throw → enters error path
`afterRun`	After script success	`(ctx)`	Throw → enters error path
`onError`	After script error	`(error, ctx)`	Swallowed
`beforeTerminal`	Before terminal emit	`(signal, ctx)`	Swallowed
`cleanup`	After terminal attempt	`(ctx)`	Swallowed

prepare return values:

Return	Behavior
`{ action: "continue" }`	Proceed with original job
`{ action: "continue", job }`	Proceed with transformed job
`{ action: "skip" }`	Skip run — emit `run_complete({skipped: true})`
`{ action: "skip", reason }`	Skip run — emit `run_complete({skipped: true, reason})`

Note: When prepare returns skip, no other hooks run — there is no browser, no context, and no cleanup. The run completes immediately.

Context Object

The QuarryContext provides:

Property	Type	Description
`job`	`unknown`	Job payload (from `--job` or `--job-json`)
`run`	`RunMeta`	Run metadata (run_id, attempt, job_id, etc.)
`page`	`Page`	Puppeteer page instance
`browser`	`Browser`	Puppeteer browser instance
`browserContext`	`BrowserContext`	Puppeteer browser context
`emit`	`EmitAPI`	Output emission interface
`storage`	`StorageAPI`	Sidecar file upload interface
`memory`	`MemoryAPI`	Memory pressure monitoring

Terminal Behavior

A script terminates in one of these ways:

Normal return — Executor emits run_complete automatically
Throw error — Executor emits run_error automatically
Explicit emit.runComplete() — Script signals success
Explicit emit.runError() — Script signals failure

After a terminal event, further emit calls are undefined behavior.

Emit API

emit.* is the primary output mechanism for scripts. For sidecar file uploads, see the Storage API below.

Primary Methods

// Emit structured data (primary output)
await ctx.emit.item({
  item_type: "product",
  data: { name: "Widget", price: 9.99 }
});

// Emit binary artifact (screenshot, PDF, etc.)
const artifactId = await ctx.emit.artifact({
  name: "screenshot.png",
  content_type: "image/png",
  data: buffer  // Buffer or Uint8Array
});

// Emit progress checkpoint
await ctx.emit.checkpoint({
  checkpoint_id: "page-2",
  note: "Completed page 2 of 10"
});

Logging

await ctx.emit.debug("Debug message", { field: "value" });
await ctx.emit.info("Info message");
await ctx.emit.warn("Warning message");
await ctx.emit.error("Error message");

Advisory Methods (Not Guaranteed)

// Suggest enqueueing additional work
await ctx.emit.enqueue({
  target: "detail-page",
  params: { url: "https://example.com/item/123" },
  // Optional partition overrides (default: inherit from root run)
  // source: "other-source",
  // category: "other-category",
});

// Suggest proxy rotation
await ctx.emit.rotateProxy({ reason: "rate limited" });

Batching Enqueue Calls

For high fan-out workloads that discover hundreds or thousands of items, createBatcher groups items into fewer, larger enqueue events — reducing child run count and scheduling overhead.

import { createBatcher } from "@pithecene-io/quarry-sdk";

export default async function run(ctx: QuarryContext): Promise<void> {
  const batcher = createBatcher(ctx.emit, {
    size: 50,          // items per batch
    target: "detail.ts",
  });

  for (const url of discoveredUrls) {
    await batcher.add({ url });  // auto-flushes every 50 items
  }
  await batcher.flush();          // emit any remaining items
}

size must be a positive integer (throws RangeError otherwise)
Each flush emits a single enqueue event with params: { items: T[] }
Call flush() before runComplete() — items buffered after a terminal event are lost
Optional source and category in options override partition keys for all batched enqueue events

Storage API

storage.put() writes sidecar files directly to addressable storage paths, bypassing the event pipeline.

// Write a file to storage (e.g., images, CSVs, raw HTML)
const result = await ctx.storage.put({
  filename: "product-image.png",
  content_type: "image/png",
  data: buffer  // Buffer or Uint8Array
});

// result.key is the resolved Hive-partitioned storage path
// e.g. "datasets/quarry/partitions/source=my-source/category=default/day=2026-02-23/run_id=run-001/files/product-image.png"
console.log(result.key);

// Include the storage key in emitted items for downstream correlation
await ctx.emit.item({
  item_type: "product",
  data: { name: "Example", image_key: result.key }
});

Rules:

Filename must be flat (no /, \, or ..)
Maximum file size: 8 MiB
Shares ordering and fail-fast with emit.*
Cannot be called after a terminal event (run_complete / run_error)
Returns StoragePutResult with the resolved key (v0.11.0+)
Promise rejects on backend write failure (v0.12.0+) — errors are recoverable

Files land at Hive-partitioned paths under the storage root:

<storage-path>/datasets/<dataset>/partitions/source=<s>/category=<c>/day=<d>/run_id=<r>/files/<filename>

Sidecar File Inventory (v0.13.4+)

Files written via storage.put() are automatically tracked in Lode snapshot Metadata under the sidecar_files key. Downstream consumers can enumerate files for a run via snapshot.Manifest.Metadata["sidecar_files"] instead of prefix-scanning storage.

Each entry includes path, filename, content_type, and size. File refs are flushed into the snapshot on event and metrics writes only (not chunk writes). To reconstruct the full file inventory for a run, union sidecar_files from all snapshots for that run_id.

See docs/contracts/CONTRACT_LODE.md § Sidecar File Inventory for the normative specification.

Batching File Uploads

For workloads that write many files, createStorageBatcher dispatches multiple storage.put() calls with bounded concurrency — eliminating user-code latency between puts without creating unbounded promise counts.

import { createStorageBatcher } from "@pithecene-io/quarry-sdk";

export default async function run(ctx: QuarryContext): Promise<void> {
  const batch = createStorageBatcher(ctx.storage, { concurrency: 16 });

  for (const img of images) {
    batch.add({
      filename: img.name,
      content_type: "image/png",
      data: img.buffer,
    });
  }
  await batch.flush();  // wait for all writes before terminal event
}

concurrency defaults to 16 (must be a positive integer)
flush() must be called before terminal events — buffered writes are lost otherwise
Fail-fast: first error rejects all queued writes and poisons the batcher
Reusable after a successful flush()

Memory Pressure API

ctx.memory provides pull-based memory monitoring for container-constrained workloads. Reports node heap, browser heap, and cgroup usage with pressure classification.

export default async function run(ctx: QuarryContext): Promise<void> {
  // Check memory pressure before heavy operations
  if (await ctx.memory.isAbove("high")) {
    await ctx.emit.warn("Memory pressure is high, reducing batch size");
  }

  // Full snapshot with all sources
  const snap = await ctx.memory.snapshot();
  console.log(snap.node.ratio);     // 0.0–1.0
  console.log(snap.cgroup?.ratio);   // null if not in container
  console.log(snap.pressure);        // 'low' | 'moderate' | 'high' | 'critical'

  // Opt out of browser metrics (avoids CDP call)
  const nodeOnly = await ctx.memory.snapshot({ browser: false });
}

Pressure levels (default thresholds):

Ratio	Level
< 0.5	`low`
0.5–0.7	`moderate`
0.7–0.9	`high`
≥ 0.9	`critical`

CLI Reference

Run Command

quarry run [options]

Config file:

Flag	Description
`--config <path>`	Path to YAML config file with project-level defaults (see `docs/guides/configuration.md`)

CLI flags always override config file values.

Required flags:

Flag	Description
`--script <path>`	Path to TypeScript script
`--run-id <id>`	Unique run identifier
`--source <name>`	Source identifier for partitioning
`--storage-backend <fs\|s3>`	Storage backend
`--storage-path <path>`	Storage location

Note: --source, --storage-backend, and --storage-path can also be provided via --config file. They are required at runtime but do not need to appear on the CLI if set in the config.

Optional flags:

Flag	Default	Description
`--attempt <n>`	1	Attempt number
`--job <json>`	`{}`	Job payload as inline JSON object
`--job-json <path>`		Path to JSON file containing job payload (must be object)
`--category <name>`	`default`	Category for partitioning
`--policy <strict\|buffered\|streaming>`	`strict`	Ingestion policy
`--flush-count <n>`		Flush after N events (streaming policy)
`--flush-interval <duration>`		Flush every interval, e.g. `5s` (streaming policy)
`--report <path>`		Write structured JSON report to path on exit (use `-` for stderr)
`--quiet`	`false`	Suppress non-error output

Note: --job and --job-json are mutually exclusive. Using both is an error. The payload must be a top-level JSON object. Arrays, primitives, and null are rejected.

Streaming policy: At least one of --flush-count or --flush-interval must be specified when --policy=streaming. Both may be specified; the first trigger to fire wins. See docs/guides/policy.md for details.

Adapter flags (event-bus notification):

Flag	Default	Description
`--adapter <type>`		Adapter type (`webhook`, `redis`)
`--adapter-url <url>`		Endpoint URL (required when `--adapter` set)
`--adapter-header <key=value>`		Custom HTTP header (repeatable, webhook only)
`--adapter-channel <name>`	`quarry:run_completed`	Pub/sub channel name (redis only)
`--adapter-timeout <duration>`	`10s`	Notification timeout
`--adapter-retries <n>`	`3`	Retry attempts

Event sink flags (real-time event delivery, v0.13.0+):

Flag	Default	Description
`--event-sink <type>`		Event sink type (`lode`, `redis`); repeatable
`--event-sink-lode-delivery`	`mandatory`	Delivery mode for Lode event sink
`--event-sink-redis-url <url>`		Redis URL (required when `--event-sink` includes `redis`)
`--event-sink-redis-stream-key`	`quarry:events`	Redis Streams key
`--event-sink-redis-max-len`	`100000`	Max stream length (-1 disables)
`--event-sink-redis-ttl`	`24h`	Stream key expiry (-1 disables)
`--event-sink-redis-timeout`	`2s`	Per-operation timeout
`--event-sink-redis-retries`	`2`	Retry attempts
`--event-sink-redis-delivery`	`mandatory`	Delivery mode for Redis event sink

Event sinks vs adapters: Event sinks deliver events in real time during a run. Adapters fire once after a run completes. Both can be used together. When no --event-sink flags are set, events go to Lode only (default). See docs/contracts/CONTRACT_INTEGRATION.md and docs/guides/integration.md.

Fan-out flags (derived work execution):

Flag	Default	Description
`--depth <n>`	`0`	Maximum recursion depth (0 = disabled)
`--max-runs <n>`		Total child run cap (required when `--depth > 0`)
`--parallel <n>`	`1`	Maximum concurrent child runs

Browser reuse:

Flag	Env Var	Description
`--browser-ws-endpoint <url>`	`QUARRY_BROWSER_ENDPOINT`	Connect to an externally managed browser via WebSocket URL (skips per-run Chromium launch; see `docs/guides/cli.md`)
`--no-browser-reuse`	—	Disable transparent browser reuse across runs (forces per-run Chromium launch)

By default, Quarry transparently reuses a persistent Chromium process across sequential runs. The browser auto-terminates after idle timeout (default 60s, configurable via QUARRY_BROWSER_IDLE_TIMEOUT). --browser-ws-endpoint / QUARRY_BROWSER_ENDPOINT takes priority over transparent reuse when set. When an external endpoint is configured, a pre-run health check verifies the browser is reachable. See Container Usage for multi-crawler deployment patterns.

Module resolution:

Flag	Description
`--resolve-from <path>`	Path to `node_modules` directory for ESM resolution fallback (monorepo/container support)

When scripts import workspace packages that are not resolvable from the script's directory, --resolve-from registers an ESM resolve hook to fall back to the specified node_modules. See docs/contracts/CONTRACT_CLI.md.

Validation:

Flag	Description
`--dry-run`	Validate script loadability without executing a run (no browser, no storage)

When --dry-run is set, only --script and --run-id are required. Storage, policy, proxy, and adapter configuration are skipped entirely. The executor loads the script module (with --resolve-from hook if configured), validates the default export and hook signatures, and reports the result. Exit 0 = valid, exit 1 = invalid.

# Validate a script loads correctly
quarry run --dry-run --script ./my-script.ts --run-id test --source test

# Validate with --resolve-from (monorepo/container)
quarry run --dry-run --script ./my-script.ts --run-id test --source test \
  --resolve-from /app/node_modules

Advanced flags (development only):

Flag	Description
`--executor <path>`	Override executor path (auto-resolved in bundled binary; only needed for local development builds)

Storage flags (S3 / S3-compatible):

Flag	Description
`--storage-region <region>`	AWS region (uses default chain if omitted)
`--storage-endpoint <url>`	Custom S3 endpoint URL (e.g. Cloudflare R2, MinIO)
`--storage-s3-path-style`	Force path-style addressing (required by R2, MinIO)

Inspect Command

quarry inspect run <run-id>     # Inspect a specific run
quarry inspect job <job-id>     # Inspect a specific job

Stats Command

quarry stats runs               # Run statistics
quarry stats jobs               # Job statistics
quarry stats metrics            # Contract metrics (requires --storage-backend, --storage-path)

Examples

All examples use local fixtures (no external network calls).

Example	Description	Events
`examples/demo.ts`	Minimal item emission	1 item
`examples/static-html-list/`	Parse HTML fixture	3 items
`examples/toy-pagination/`	Multi-page pagination	6 items
`examples/artifact-snapshot/`	Screenshot artifact	1 artifact
`examples/intentional-failure/`	Error handling test	1 item + error
`examples/hooks-prepare/`	Job filtering with `prepare`	skipped or 1 item
`examples/hooks-before-terminal/`	Summary emission via `beforeTerminal`	items + summary

Run all examples:

task examples

Integration Pattern Examples

See examples/integration-patterns/ for downstream ETL trigger examples:

Event-bus pattern (SNS/SQS, handler)
Polling pattern (filesystem, S3)

Example: Minimal Item Emission

// examples/demo.ts
import type { QuarryContext } from "@pithecene-io/quarry-sdk";

export default async function run(ctx: QuarryContext): Promise<void> {
  await ctx.emit.item({
    item_type: "demo",
    data: { message: "hello from quarry" }
  });
}

Example: Screenshot Artifact

// examples/artifact-snapshot/script.ts
import type { QuarryContext } from "@pithecene-io/quarry-sdk";

export default async function run(ctx: QuarryContext): Promise<void> {
  await ctx.page.setContent("<h1>Hello</h1>");
  const data = await ctx.page.screenshot({ type: "png" });
  await ctx.emit.artifact({
    name: "snapshot.png",
    content_type: "image/png",
    data
  });
}

Example: Skip Stale Jobs with `prepare`

// examples/hooks-prepare/script.ts
import type { PrepareResult, QuarryContext, RunMeta } from "@pithecene-io/quarry-sdk";

type Job = { url: string; fetched_at?: string };

export function prepare(job: Job, _run: RunMeta): PrepareResult<Job> {
  if (job.fetched_at) {
    const age = Date.now() - new Date(job.fetched_at).getTime();
    if (age < 3_600_000) {
      return { action: "skip", reason: "fetched less than 1h ago" };
    }
  }
  return { action: "continue" };
}

export default async function run(ctx: QuarryContext<Job>): Promise<void> {
  await ctx.page.goto(ctx.job.url);
  const title = await ctx.page.title();
  await ctx.emit.item({ item_type: "page", data: { url: ctx.job.url, title } });
}

Example: Emit Summary with `beforeTerminal`

// examples/hooks-before-terminal/script.ts
import type { QuarryContext, TerminalSignal } from "@pithecene-io/quarry-sdk";

let itemCount = 0;

export default async function run(ctx: QuarryContext): Promise<void> {
  for (const product of ["Widget", "Gadget", "Gizmo"]) {
    await ctx.emit.item({ item_type: "product", data: { name: product } });
    itemCount++;
  }
}

export async function beforeTerminal(
  signal: TerminalSignal,
  ctx: QuarryContext
): Promise<void> {
  if (signal.outcome === "completed") {
    await ctx.emit.item({
      item_type: "run_summary",
      data: { total_items: itemCount }
    });
  }
}

Troubleshooting

"Module not found" for SDK

Ensure SDK is built:

pnpm -C sdk run build

"Executor not found"

The executor is auto-resolved but must be built first:

pnpm -C executor-node run build

If auto-resolution fails, specify the path manually:

./quarry/quarry run --executor ./executor-node/dist/bin/executor.js ...

Chrome/Puppeteer sandbox errors (CI)

Set environment variable:

QUARRY_NO_SANDBOX=1 quarry run ...

"Contract version mismatch"

SDK and runtime versions must match. Rebuild both:

task build

Storage path errors

FS backend: Path must be a writable directory
S3 backend: Format is bucket/prefix, credentials via AWS default chain

JSON import attribute error

If the executor crashes with:

Module "file:///path/to/data.json" needs an import attribute of "type: json"

Node ESM requires an explicit import attribute for JSON modules:

// ✗ crashes under Node ESM
import data from './data.json'

// ✓ correct
import data from './data.json' with { type: 'json' }

This applies to your script and all of its transitive dependencies. Other runtimes (Bun, bundlers, tsx) may accept bare JSON imports, but the Quarry executor runs under Node ESM.

Container Usage

Quarry ships container images via GHCR:

Full (amd64 only): ghcr.io/pithecene-io/quarry:0.13.4 — includes Chrome for Testing + fonts
Slim (amd64 + arm64): ghcr.io/pithecene-io/quarry:0.13.4-slim — no browser (BYO via --browser-ws-endpoint)

For docker run, Docker Compose, and sidecar patterns, see docs/guides/container.md.

Known Limitations (v0.13.4)

Single executor type: Only Node.js executor supported
No built-in retries: Retry logic is caller's responsibility
No streaming reads: Artifacts must fit in memory (note: the streaming ingestion policy is unrelated — it controls write batching, not read access)
No transactional storage writes: S3 and S3-compatible providers (R2, MinIO) do not provide transactional guarantees across writes
No job scheduling: Quarry supports in-process derived work via --depth but is not a scheduler; external orchestration is the caller's responsibility
Puppeteer required: All scripts run in a browser context
Integration adapters and sinks: Webhook, Redis Pub/Sub, and Redis Streams are available. Temporal, NATS, and SNS adapters are planned. Event sink read-side interfaces are tabled for future work. See docs/guides/integration.md.
Fan-out partition defaults: Child runs inherit the root run's --source and --category unless overridden via emit.enqueue({ source, category }). Overrides apply to individual child runs only and do not propagate to grandchildren.
Fan-out target resolution: target in emit.enqueue() is resolved as a file path relative to CWD (same as --script). Target resolution semantics may change in a future release; config-based logical names are under consideration.

Storage Configuration

Storage is a Quarry-level concern. The underlying Lode subsystem is internal.

Filesystem Backend

--storage-backend fs \
--storage-path /var/quarry/data

Data is stored in Hive-partitioned layout:

/var/quarry/data/
  source=my-source/
    category=default/
      day=2026-02-03/
        run_id=run-001/
          event_type=item/
            data.jsonl

S3 Backend

--storage-backend s3 \
--storage-path my-bucket/quarry-data \
--storage-region us-east-1

Uses AWS default credential chain. IAM permissions required:

s3:PutObject
s3:GetObject
s3:ListBucket

S3-Compatible Providers (R2, MinIO)

Quarry supports S3-compatible storage providers via custom endpoint and path-style flags:

--storage-backend s3 \
--storage-path my-bucket/quarry-data \
--storage-endpoint https://ACCOUNT_ID.r2.cloudflarestorage.com \
--storage-s3-path-style

Set credentials via environment variables:

export AWS_ACCESS_KEY_ID=<access-key>
export AWS_SECRET_ACCESS_KEY=<secret-key>

Downstream Integration

Quarry is an extraction runtime, not a full pipeline. For triggering downstream processing after runs complete, see docs/guides/integration.md.

Built-in adapters (v0.5.0+): --adapter webhook sends an HTTP POST, and --adapter redis publishes to a Redis pub/sub channel after each run completes. See adapter flags above.

Event sinks (v0.13.0+): --event-sink redis streams every event to a Redis Stream in real time during the run. Can be combined with adapters for both real-time event streaming and post-run notification.

Fallback pattern: Polling-based triggers with idempotent checkpoints.

Version Information

quarry version
# 0.13.4 (commit: ...)

SDK and runtime versions must match (lockstep versioning).

Distribution Channels

Component	Channel	Install
CLI binary	GitHub Releases	`mise install github:pithecene-io/quarry@0.13.4`
Container (full, amd64)	GHCR	`docker pull ghcr.io/pithecene-io/quarry:0.13.4`
Container (slim, multi-arch)	GHCR	`docker pull ghcr.io/pithecene-io/quarry:0.13.4-slim`
SDK	JSR	`npx jsr add @pithecene-io/quarry-sdk`
SDK	GitHub Packages	`pnpm add @pithecene-io/quarry-sdk`

FilesExpand file tree

PUBLIC_API.md

Latest commit

History