Correctly utilize the assumeSoleOwner flag

janbuchar · janbuchar · commit c8bb798b39f2 · 2026-06-19T13:42:04.000+02:00
diff --git a/docs/guides/parallel-scraping/parallel-scraper.mjs b/docs/guides/parallel-scraping/parallel-scraper.mjs
@@ -1,5 +1,6 @@
 import { fork } from 'node:child_process';
 
+import { FileSystemStorageClient } from '@crawlee/fs-storage';
 import { Configuration, Dataset, PlaywrightCrawler, log } from 'crawlee';
 
 import { router } from './routes.mjs';
@@ -76,13 +77,19 @@ if (!process.env.IN_WORKER_THREAD) {
     // Get the request queue
     const requestQueue = await getOrInitQueue(false);
 
-    // Disable the automatic purge on start and configure crawlee to store the worker-specific data in a separate directory
-    // (needs to be done AFTER the queue is initialized when running locally)
+    // Disable the automatic purge on start, so we don't lose the queue we prepared
     const config = new Configuration({
         purgeOnStart: false,
-        storageClientOptions: {
-            localDataDirectory: `./storage/worker-${process.env.WORKER_INDEX}`,
-        },
+    });
+
+    // Store the worker's own internal state (its default dataset, key-value store, etc.) in a separate
+    // directory so the workers don't collide with each other (needs to be done AFTER the queue is
+    // initialized when running locally). This directory is private to a single worker, so we set
+    // `assumeSoleOwner: true` — the concurrency-safe locking only matters for the shared `shop-urls`
+    // queue, which gets its own storage client in `requestQueue.mjs`.
+    const storageClient = new FileSystemStorageClient({
+        localDataDirectory: `./storage/worker-${process.env.WORKER_INDEX}`,
+        assumeSoleOwner: true,
     });
 
     workerLogger.debug('Setting up crawler.');
@@ -98,6 +105,10 @@ if (!process.env.IN_WORKER_THREAD) {
             // highlight-end
             // Let's also limit the crawler's concurrency, we don't want to overload a single process 🐌
             maxConcurrency: 5,
+            // Use the worker-specific, concurrency-safe storage client we created above
+            // highlight-start
+            storageClient,
+            // highlight-end
         },
         config,
     );
diff --git a/docs/guides/parallel-scraping/parallel-scraping.mdx b/docs/guides/parallel-scraping/parallel-scraping.mdx
@@ -60,6 +60,16 @@ The first step in our conversion process will be creating a common file (let's c
 
 The exported function, `getOrInitQueue`, might seem like it does a lot. In essence, it just ensures the request queue is initialized, and if requested, ensures it starts off with an empty state.
 
+:::caution Make the shared queue concurrency-safe with `assumeSoleOwner: false`
+
+Because every worker process opens this same `shop-urls` queue at the same time, it **must** use the concurrency-safe locking behavior of `FileSystemStorageClient`. That's why `getOrInitQueue` opens the queue with a storage client constructed with `assumeSoleOwner: false`.
+
+By default, `FileSystemStorageClient` assumes it is the *sole* consumer of a queue (`assumeSoleOwner: true`). On open it immediately reclaims any requests left *in progress* — great for a single-process crawl recovering after a crash, but disastrous when workers run side by side: each worker would happily grab requests another worker is still processing, so the same URL gets scraped multiple times.
+
+Setting `assumeSoleOwner: false` tells the client to treat an in-progress request as a potential live peer's lock and only reclaim it once the lock expires on the wall clock, so two workers never process the same request at once.
+
+:::
+
 ### Adapting our previous scraper to enqueue the product URLs to the new queue
 
 In the `src/routes.mjs` file of the scraper we previously built, we have a handler for the `CATEGORY` label. Let's adapt that handler to enqueue the product URLs to the new queue we created.
@@ -122,34 +132,44 @@ This will check how the script is executed as. If this value has _any_ value, it
 
 We use this to ensure the parent process stays alive until all the worker processes exit. Otherwise, the worker processes would just get spawned, and lose the ability to communicate with the parent. You might not need this depending on your use case (maybe you just need to spawn workers and let them process).
 
-#### What's with all those `Configuration` calls?
+#### What's with all the `Configuration` and storage client setup?
 
-There are three steps we want to do for the worker processes:
+There are two things we want to do for the worker processes:
 
-- get the queue that supports locking from the same location as the parent process
-- ensure the default storages do **not** get purged on start, as otherwise we'd lose the queue we prepared, and initialize a special storage for worker processes so they do not collide with each other
+- get the shared queue from the same location as the parent process (it already comes with the concurrency-safe storage client we set up in `requestQueue.mjs`)
+- ensure the default storages do **not** get purged on start, as otherwise we'd lose the queue we prepared, and give each worker its own private storage directory for its internal state so the workers don't collide with each other
 
 In order, that's what these lines do:
 
 ```javascript title="src/parallel-scraper.mjs"
-// Get the request queue from the parent process (step 1)
+import { FileSystemStorageClient } from '@crawlee/fs-storage';
+
+// Get the shared request queue from the parent process (step 1)
 const requestQueue = await getOrInitQueue(false);
 
-// Disable the automatic purge on start and configure crawlee to store the worker-specific data
-// in a separate directory (needs to be done AFTER the queue is initialized when running locally) (step 2)
-const config = new Configuration({
-    purgeOnStart: false,
-    storageClientOptions: {
-        localDataDirectory: `./storage/worker-${process.env.WORKER_INDEX}`,
-    },
+// Disable the automatic purge on start, so we don't lose the queue we prepared (step 2)
+const config = new Configuration({ purgeOnStart: false });
+
+// Store the worker's own internal state in a separate directory so workers don't collide (step 2,
+// cont.). Needs to be done AFTER the queue is initialized when running locally. This directory is
+// private to a single worker, so we explicitly set `assumeSoleOwner: true`.
+const storageClient = new FileSystemStorageClient({
+    localDataDirectory: `./storage/worker-${process.env.WORKER_INDEX}`,
+    assumeSoleOwner: true,
 });
 ```
 
+:::note Why no `assumeSoleOwner: false` here?
+
+Each worker's `./storage/worker-N` directory is private to that single worker — nothing else opens it — so the default `assumeSoleOwner: true` is exactly right. The concurrency-safe locking only matters for storage that is genuinely shared across processes, which is the `shop-urls` queue in `requestQueue.mjs`, not this per-worker internal state.
+
+:::
+
 #### Telling the crawler to use the worker configuration
 
 You might have noticed several lines highlighted in the code above. Those show how you provide the shared request queue to the crawler.
 
-You might have also noticed we passed in a second parameter to the constructor of the crawler, the `config` variable we created earlier. This is needed to ensure the crawler uses the worker-specific storages for internal states, and that they do not collide with each other.
+You might have also noticed we passed in the `config` and `storageClient` we created earlier to the crawler. These ensure the crawler uses the worker-specific storages for its own internal state (so the workers do not collide with each other), while still consuming the shared, concurrency-safe `shop-urls` queue we provided explicitly.
 
 #### Why do we use `process.send` instead of `context.pushData`?
 
diff --git a/docs/guides/parallel-scraping/shared.mjs b/docs/guides/parallel-scraping/shared.mjs
@@ -1,8 +1,16 @@
+import { FileSystemStorageClient } from '@crawlee/fs-storage';
 import { RequestQueue } from 'crawlee';
 
 // The request queue shared by all the parallel workers
 let queue;
 
+// The `shop-urls` queue is opened concurrently by every worker process, so it must use the
+// concurrency-safe locking behavior. With `assumeSoleOwner: false`, a request another worker is
+// still processing is treated as a live peer's lock and is not handed out again until that lock
+// expires — so two workers never scrape the same URL at once. (We point at the default `./storage`
+// location, which is where this shared queue lives.)
+const sharedStorageClient = new FileSystemStorageClient({ assumeSoleOwner: false });
+
 /**
  * @param {boolean} makeFresh Whether the queue should be cleared before returning it
  * @returns The queue
@@ -12,11 +20,11 @@ export async function getOrInitQueue(makeFresh = false) {
         return queue;
     }
 
-    queue = await RequestQueue.open('shop-urls');
+    queue = await RequestQueue.open('shop-urls', { storageClient: sharedStorageClient });
 
     if (makeFresh) {
         await queue.drop();
-        queue = await RequestQueue.open('shop-urls');
+        queue = await RequestQueue.open('shop-urls', { storageClient: sharedStorageClient });
     }
 
     return queue;
diff --git a/packages/fs-storage/src/file-system-storage.ts b/packages/fs-storage/src/file-system-storage.ts
@@ -25,6 +25,22 @@ export interface FileSystemStorageOptions {
      * Optional logger for FileSystemStorageClient warnings.
      */
     logger?: CrawleeLogger;
+
+    /**
+     * Assert that this process is the *sole* consumer of every request queue it opens.
+     *
+     * When `true` (the default), opening a queue immediately reclaims any requests that a previous
+     * run left *in progress* (e.g. after a crash), so they become fetchable again right away. This is
+     * the right behavior for the common single-process crawl.
+     *
+     * Set this to `false` if multiple processes share the same on-disk request queue concurrently
+     * (for example, the {@apilink parallel scraping setup | "Parallel Scraping Guide"}). In that mode
+     * an in-progress request is treated as a potential live peer's lock and is only reclaimed once
+     * that lock expires on the wall clock, so two workers won't process the same request at once.
+     *
+     * @default true
+     */
+    assumeSoleOwner?: boolean;
 }
 
 /**
@@ -41,6 +57,7 @@ export class FileSystemStorageClient implements storage.StorageClient {
     readonly keyValueStoresDirectory: string;
     readonly requestQueuesDirectory: string;
     readonly logger?: CrawleeLogger;
+    readonly assumeSoleOwner: boolean;
 
     readonly keyValueStoreCache: KeyValueStoreClient[] = [];
     readonly datasetClientCache: DatasetClient[] = [];
@@ -49,9 +66,11 @@ export class FileSystemStorageClient implements storage.StorageClient {
     constructor(options: FileSystemStorageOptions = {}) {
         s.object({
             localDataDirectory: s.string().optional(),
+            assumeSoleOwner: s.boolean().optional(),
         }).parse(options);
 
         this.logger = options.logger;
+        this.assumeSoleOwner = options.assumeSoleOwner ?? true;
 
         // v3.0.0 used `crawlee_storage` as the default, we changed this in v3.0.1 to just `storage`,
         // this function handles it without making BC breaks - it respects existing `crawlee_storage`
@@ -165,7 +184,15 @@ export class FileSystemStorageClient implements storage.StorageClient {
             }
         }
 
-        const nativeClient = await NativeRequestQueueClient.open(id, name, alias, this.localDataDirectory);
+        const nativeClient = await NativeRequestQueueClient.open(
+            id,
+            name,
+            alias,
+            this.localDataDirectory,
+            // useTestClock — always real wall-clock outside of native tests.
+            undefined,
+            this.assumeSoleOwner,
+        );
         const newStore = await RequestQueueClient.create({
             name: alias ? undefined : (name ?? cacheKey),
             cacheKey: cacheKey ?? '',
diff --git a/packages/fs-storage/test/request-queue/assume-sole-owner.test.ts b/packages/fs-storage/test/request-queue/assume-sole-owner.test.ts
@@ -0,0 +1,85 @@
+import { rm } from 'node:fs/promises';
+import { resolve } from 'node:path';
+
+import { FileSystemStorageClient } from '@crawlee/fs-storage';
+
+// `assumeSoleOwner` controls how the native `@crawlee/fs-storage-native` extension treats requests
+// left *in progress* by a previous run (a dangling `orderNo` lock on disk) when a queue is reopened.
+// The reclaim/respect-peer-lock semantics are owned by the native extension; these tests verify the
+// adapter's contract on top of it: the option defaults to `true`, is honored when set, and that the
+// resulting behavior reaches all the way down to the native queue.
+describe('FileSystemStorageClient assumeSoleOwner', () => {
+    const tmpLocation = resolve(import.meta.dirname, './tmp/assume-sole-owner');
+
+    afterEach(async () => {
+        await rm(tmpLocation, { force: true, recursive: true });
+    });
+
+    test('defaults to true', () => {
+        const storage = new FileSystemStorageClient({ localDataDirectory: tmpLocation });
+        expect(storage.assumeSoleOwner).toBe(true);
+    });
+
+    test('respects an explicit false', () => {
+        const storage = new FileSystemStorageClient({ localDataDirectory: tmpLocation, assumeSoleOwner: false });
+        expect(storage.assumeSoleOwner).toBe(false);
+    });
+
+    // Seed a queue with two requests, fetch (lock) one without handling it or tearing down — leaving a
+    // dangling in-progress lock on disk, exactly the "process died mid-flight" situation.
+    async function seedQueueWithDanglingLock(dir: string) {
+        const storage = new FileSystemStorageClient({ localDataDirectory: dir });
+        const queue = await storage.createRequestQueueClient({ name: 'default' });
+        await queue.addBatchOfRequests([
+            { url: 'http://example.com/1', uniqueKey: '1' },
+            { url: 'http://example.com/2', uniqueKey: '2' },
+        ]);
+        const locked = await queue.fetchNextRequest();
+        expect(locked).not.toBeNull();
+        // Intentionally NO markRequestAsHandled and NO teardown/persistState — the lock is left dangling.
+        return locked!;
+    }
+
+    test('true (default): reopening preserves contents but relinquishes the dangling lock', async () => {
+        const dir = resolve(tmpLocation, 'sole-owner-true');
+        const locked = await seedQueueWithDanglingLock(dir);
+
+        // Reopen the same directory as sole owner, without purging.
+        const reopened = new FileSystemStorageClient({ localDataDirectory: dir, assumeSoleOwner: true });
+        const queue = await reopened.createRequestQueueClient({ name: 'default' });
+
+        // Contents preserved: both requests still present, none handled.
+        const metadata = await queue.getMetadata();
+        expect(metadata.totalRequestCount).toBe(2);
+        expect(metadata.handledRequestCount).toBe(0);
+        expect(metadata.pendingRequestCount).toBe(2);
+
+        // Lock relinquished: BOTH requests are fetchable again, including the one locked before.
+        const a = await queue.fetchNextRequest();
+        const b = await queue.fetchNextRequest();
+        expect([a?.uniqueKey, b?.uniqueKey].sort()).toStrictEqual(['1', '2']);
+        // The previously-locked request survived with its data intact.
+        const reFetched = await queue.getRequest(locked.uniqueKey);
+        expect(reFetched?.url).toBe(locked.url);
+    });
+
+    test('false: reopening keeps the dangling lock (concurrency-safe mode)', async () => {
+        const dir = resolve(tmpLocation, 'sole-owner-false');
+        await seedQueueWithDanglingLock(dir);
+
+        // Reopen in concurrency-safe mode: an in-progress request is treated as a potential live peer's
+        // lock and is NOT reclaimed until it expires.
+        const reopened = new FileSystemStorageClient({ localDataDirectory: dir, assumeSoleOwner: false });
+        const queue = await reopened.createRequestQueueClient({ name: 'default' });
+
+        // Contents are still preserved...
+        const metadata = await queue.getMetadata();
+        expect(metadata.totalRequestCount).toBe(2);
+        expect(metadata.pendingRequestCount).toBe(2);
+
+        // ...but only the un-locked request is handed out; the locked one stays in progress.
+        const a = await queue.fetchNextRequest();
+        expect(a?.uniqueKey).toBe('2');
+        expect(await queue.fetchNextRequest()).toBeNull();
+    });
+});