Skip to content

TerminalError stacktrace reconstruction during replay can result into journal mismatch #656

@MohamedBassem

Description

@MohamedBassem

Context

I have a handler logic that roughly looks something like this:

    // Call the runner service
    const res = await tryCatch(runner.run(jobData));

    // Handle RPC-level errors (e.g., runner service unavailable)
    // ...
    if (res.error instanceof restate.CancelledError) {
      throw res.error;
    }
    // Notify the runner of the RPC error
    await tryCatch(
      runner.onError({
        job: jobData,
        error: {
          name:
            res.error instanceof Error ? res.error.name : "RPCError",
          message: res.error instanceof Error ? res.error.message : String(res.error),
          stack:
            res.error instanceof Error ? res.error.stack : undefined,
        },
      }),
    );

The handler calls a restate service, and if it hits any exceptions, it calls another service with the details about this error (name, message and stacktrace). The main assumption here is that during replay, the error details are going to be deterministic.

Problem

When the main service handler fails with terminal error(e.g: network error, the service is unavailable), this seems to result into a mismatch into the journal entry caused by a mismatch in the stack trace returned during replay:

[570] Found a mismatch between the code paths taken during the previous execution and the paths taken during this execution.
This typically happens when some parts of the code are non-deterministic.
- The mismatch happened while executing 'call' (index '14')
- Difference:
   parameter: '{"job":{"id":"inv_1bDIFRtrYMPl0DXqSC9RamS0cM5c2nXNe1","data":{"bookmarkId":"ft760zlpdg46o1n49e2cg0vz"},"priority":50,"runNumber":0,"numRetriesLeft":5,"timeoutSecs":60},"error":{"name":"TerminalError","message":"unexpected error while reading the response body: error reading a body from connection","stack":"TerminalError: unexpected error while reading the response body: error reading a body from connection\n    at Failure (/app/node_modules/@restatedev/restate-sdk/dist/context_impl.js:421:15)\n    at RestateSinglePromise.completer (/app/node_modules/@restatedev/restate-sdk/dist/context_impl.js:380:50)\n    at runNextTicks (node:internal/process/task_queues:65:5)\n    at process.processImmediate (node:internal/timers:472:9)\n    at async RestateSinglePromise.tryComplete (/app/node_modules/@restatedev/restate-sdk/dist/promises.js:73:3)\n    at async PromisesExecutor.doProgressInner (/app/node_modules/@restatedev/restate-sdk/dist/promises.js:197:4)\n    at async <anonymous> (/app/node_modules/@restatedev/restate-sdk/dist/promises.js:216:5)"}}' != '{"job":{"id":"inv_1bDIFRtrYMPl0DXqSC9RamS0cM5c2nXNe1","data":{"bookmarkId":"ft760zlpdg46o1n49e2cg0vz"},"priority":50,"runNumber":0,"numRetriesLeft":5,"timeoutSecs":60},"error":{"name":"TerminalError","message":"unexpected error while reading the response body: error reading a body from connection","stack":"TerminalError: unexpected error while reading the response body: error reading a body from connection\n    at Failure (/app/node_modules/@restatedev/restate-sdk/dist/context_impl.js:421:15)\n    at RestateSinglePromise.completer (/app/node_modules/@restatedev/restate-sdk/dist/context_impl.js:380:50)\n    at async RestateSinglePromise.tryComplete (/app/node_modules/@restatedev/restate-sdk/dist/promises.js:73:3)\n    at async PromisesExecutor.doProgressInner (/app/node_modules/@restatedev/restate-sdk/dist/promises.js:197:4)\n    at async <anonymous> (/app/node_modules/@restatedev/restate-sdk/dist/promises.js:216:5)"}}'

Where the calls are the following:

Image

After beautifying the stacks:


  Original execution:
  TerminalError: unexpected error while reading the response body: error reading a body from connection
      at Failure (/app/node_modules/@restatedev/restate-sdk/dist/context_impl.js:421:15)
      at RestateSinglePromise.completer (/app/node_modules/@restatedev/restate-sdk/dist/context_impl.js:380:50)
      at runNextTicks (node:internal/process/task_queues:65:5)          ← EXTRA
      at process.processImmediate (node:internal/timers:472:9)          ← EXTRA
      at async RestateSinglePromise.tryComplete (/app/node_modules/@restatedev/restate-sdk/dist/promises.js:73:3)
      at async PromisesExecutor.doProgressInner (/app/node_modules/@restatedev/restate-sdk/dist/promises.js:197:4)
      at async <anonymous> (/app/node_modules/@restatedev/restate-sdk/dist/promises.js:216:5)

  Replay execution:
  TerminalError: unexpected error while reading the response body: error reading a body from connection
      at Failure (/app/node_modules/@restatedev/restate-sdk/dist/context_impl.js:421:15)
      at RestateSinglePromise.completer (/app/node_modules/@restatedev/restate-sdk/dist/context_impl.js:380:50)
      at async RestateSinglePromise.tryComplete (/app/node_modules/@restatedev/restate-sdk/dist/promises.js:73:3)
      at async PromisesExecutor.doProgressInner (/app/node_modules/@restatedev/restate-sdk/dist/promises.js:197:4)
      at async <anonymous> (/app/node_modules/@restatedev/restate-sdk/dist/promises.js:216:5)

Notice how the original execution contained some node internal frames, while the replay didn't.

It seems that the SDK doesn't store in the journal the stacktraces of terminal failures (only code and message) and instead attempts to reconstruct them during replay. But this seems like it can be a constant source of indeterminism that it makes me wonder if it's possible for the sdk to nullify it or something to prevent users from shooting themselves in the foot.

Thanks!

Env

Node: 24
Sdk Version: 1.10.3
Server version: 1.6.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions