Skip to content

Commit 1dba1da

Browse files
leifericfclaude
andcommitted
Fix mino_pcall re-throw via set_eval_diag (STM lock leak)
mino_pcall's catch arm called set_eval_diag to publish the caught error's message via mino_last_error. But set_eval_diag itself longjmps to the next-outer try frame when one exists -- so any pcall caller that ran inside a Clojure try would see its catch path hijacked by the longjmp and unwind to the outer frame, leaking any bookkeeping the caller depended on. In the STM commit path, that bookkeeping was the global commit lock: a validator throw inside a (try (dosync ...) (catch ...)) would longjmp out of run_ref_validator past stm_unlock, leaving S->stm_commit_lock permanently held. The next dosync deadlocked. Extract a non-throwing variant record_eval_diag from set_eval_diag and route mino_pcall's catch arm through it. The throw-conversion path of set_eval_diag is preserved for its primary use case (turning eval-phase diagnostics into catchable exceptions). Adds tests/stm_test.clj validator-throw-does-not-deadlock-stm-lock: the first tx errors via retry exhaustion; the second tx completes instead of hanging. The agent code's agent_try_call (added in E.4 as a workaround) is left in place; a follow-up commit can fold it back into mino_pcall if preferred. Internal suite 1513 / 7169 / 0. External baseline unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent dfe7e7b commit 1dba1da

5 files changed

Lines changed: 94 additions & 17 deletions

File tree

CHANGELOG.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -361,6 +361,40 @@ fire watches anywhere.
361361
upstream, and the atom / ref / var arms pass cleanly. The agent
362362
arm of each still errors (out of scope until agents land).
363363

364+
### Fix `mino_pcall` re-throw via `set_eval_diag`
365+
366+
`mino_pcall`'s catch arm called `set_eval_diag` to publish the
367+
caught error's message via `mino_last_error` /
368+
`mino_last_error_map`. But `set_eval_diag` itself longjmps to the
369+
next-outer try frame when one exists -- so any pcall caller that
370+
ran inside a Clojure `try` would see its catch path hijacked by
371+
the longjmp and unwind to the outer frame, **leaking any
372+
bookkeeping the caller depended on**.
373+
374+
In the STM commit path, that bookkeeping was the global commit
375+
lock: a validator throw inside a `try (dosync ...) (catch ...)`
376+
would longjmp out of `run_ref_validator` past `stm_unlock`,
377+
leaving `S->stm_commit_lock` permanently held. The next
378+
`dosync` deadlocked.
379+
380+
Fix: extracted a non-throwing variant `record_eval_diag` from
381+
`set_eval_diag` and routed `mino_pcall`'s catch arm through it.
382+
The throw-conversion path of `set_eval_diag` is preserved for
383+
its primary use case (turning eval-phase diagnostics into
384+
catchable exceptions).
385+
386+
`tests/stm_test.clj`'s new
387+
`validator-throw-does-not-deadlock-stm-lock` regression test
388+
proves the lock is released after a validator throw (the first
389+
tx errors via retry exhaustion; the second tx completes
390+
normally instead of hanging).
391+
392+
The agent code's `agent_try_call` (added in E.4 to work around
393+
the same pcall bug) is left in place. It's not strictly
394+
necessary now but documents the historical pcall contract; a
395+
follow-up commit can fold it back into `mino_pcall` if the
396+
agent maintainers prefer one mechanism.
397+
364398
### Agents (MVP)
365399

366400
mino now ships agents: `agent`, `agent?`, `send`, `send-off`,

src/runtime/error.c

Lines changed: 28 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,33 @@ void set_error_at(mino_state_t *S, const mino_val_t *form, const char *msg)
8787
}
8888
}
8989

90+
/* Store an eval-phase diagnostic without converting to a thrown
91+
* exception. Use this from places that need to record an error
92+
* without disturbing control flow -- in particular from a catch
93+
* arm that has already absorbed a thrown exception and now needs
94+
* to publish a message via mino_last_error / mino_last_error_map.
95+
*
96+
* set_eval_diag's longjmp path is the right behavior for primitive
97+
* error reporting (an error inside a try block becomes a catchable
98+
* exception). It is the wrong behavior in a catch arm: a second
99+
* longjmp would unwind further than the catcher intends, leaving
100+
* locks held and bookkeeping unrun. */
101+
void record_eval_diag(mino_state_t *S, const mino_val_t *form,
102+
const char *kind, const char *code, const char *msg)
103+
{
104+
mino_diag_t *d = diag_new(kind, code, "eval", msg);
105+
if (d != NULL && form != NULL && form->type == MINO_CONS
106+
&& form->as.cons.file != NULL && form->as.cons.line > 0) {
107+
mino_span_t span;
108+
memset(&span, 0, sizeof(span));
109+
span.file = form->as.cons.file;
110+
span.line = form->as.cons.line;
111+
span.column = form->as.cons.column;
112+
diag_set_span(d, span);
113+
}
114+
set_diag(S, d);
115+
}
116+
90117
/* Classified eval-phase diagnostic from a form with source info. */
91118
void set_eval_diag(mino_state_t *S, const mino_val_t *form,
92119
const char *kind, const char *code, const char *msg)
@@ -98,20 +125,7 @@ void set_eval_diag(mino_state_t *S, const mino_val_t *form,
98125
mino_current_ctx(S)->try_stack[mino_current_ctx(S)->try_depth - 1].exception = ex;
99126
longjmp(mino_current_ctx(S)->try_stack[mino_current_ctx(S)->try_depth - 1].buf, 1);
100127
}
101-
102-
{
103-
mino_diag_t *d = diag_new(kind, code, "eval", msg);
104-
if (d != NULL && form != NULL && form->type == MINO_CONS
105-
&& form->as.cons.file != NULL && form->as.cons.line > 0) {
106-
mino_span_t span;
107-
memset(&span, 0, sizeof(span));
108-
span.file = form->as.cons.file;
109-
span.line = form->as.cons.line;
110-
span.column = form->as.cons.column;
111-
diag_set_span(d, span);
112-
}
113-
set_diag(S, d);
114-
}
128+
record_eval_diag(S, form, kind, code, msg);
115129
}
116130

117131
/* Return a short human-readable label for a value's type. */

src/runtime/internal.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -891,6 +891,13 @@ const char *source_cache_get_line(mino_state_t *S, const char *file,
891891
void set_eval_diag(mino_state_t *S, const mino_val_t *form,
892892
const char *kind, const char *code,
893893
const char *msg);
894+
/* Non-throwing variant: stores the diagnostic without converting it
895+
* to a thrown exception. Use from catch arms that have already
896+
* absorbed a throw and just need to publish the message via
897+
* mino_last_error / mino_last_error_map. */
898+
void record_eval_diag(mino_state_t *S, const mino_val_t *form,
899+
const char *kind, const char *code,
900+
const char *msg);
894901
const char *type_tag_str(const mino_val_t *v); /* static string */
895902
void push_frame(mino_state_t *S, const char *name, /* name: borrowed */
896903
const char *file, int line, /* file: borrowed */

src/runtime/state.c

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -767,14 +767,19 @@ int mino_pcall(mino_state_t *S, mino_val_t *fn, mino_val_t *args, mino_env_t *en
767767
load_stack_truncate(S, mino_current_ctx(S)->try_stack[saved_try].saved_load_len);
768768
mino_current_ctx(S)->try_depth = saved_try;
769769
/* Populate last_error from the exception value so the host
770-
* can inspect it via mino_last_error(). */
770+
* can inspect it via mino_last_error(). Use record_eval_diag
771+
* (non-throwing) -- set_eval_diag would re-throw to any
772+
* outer try frame, leaking out of pcall's catch arm and
773+
* skipping bookkeeping the caller depends on (e.g. the STM
774+
* commit lock). */
771775
if (mino_last_error(S) == NULL || mino_last_error(S)[0] == '\0') {
772776
const char *s = NULL;
773777
size_t slen = 0;
774778
if (ex != NULL && mino_to_string(ex, &s, &slen)) {
775-
set_eval_diag(S, mino_current_ctx(S)->eval_current_form, "eval/contract", "MCT001", s);
779+
(void)slen;
780+
record_eval_diag(S, mino_current_ctx(S)->eval_current_form, "eval/contract", "MCT001", s);
776781
} else {
777-
set_eval_diag(S, mino_current_ctx(S)->eval_current_form, "eval/contract", "MCT001", "unhandled exception");
782+
record_eval_diag(S, mino_current_ctx(S)->eval_current_form, "eval/contract", "MCT001", "unhandled exception");
778783
}
779784
}
780785
if (out != NULL) {

tests/stm_test.clj

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,3 +154,20 @@
154154

155155
(deftest type-keyword
156156
(is (= :ref (type (ref 1)))))
157+
158+
(deftest validator-throw-does-not-deadlock-stm-lock
159+
;; Regression: a validator that throws would previously leak the
160+
;; STM commit lock because mino_pcall's catch arm called set_eval_diag,
161+
;; which longjmps to any enclosing try frame -- bypassing tx_commit's
162+
;; stm_unlock. The next dosync would then hang on the leaked lock.
163+
;; Fixed by routing pcall's catch arm through record_eval_diag (the
164+
;; non-throwing variant). The first tx errors via the retry path; the
165+
;; second tx must complete instead of hanging.
166+
(let [r (ref 1)]
167+
(set-validator! r (fn [v] (if (zero? v)
168+
(throw (ex-info "validator-boom" {}))
169+
(pos? v))))
170+
(is (thrown? (dosync (alter r dec))))
171+
(is (= 1 @r))
172+
(dosync (alter r inc))
173+
(is (= 2 @r))))

0 commit comments

Comments
 (0)