-
Notifications
You must be signed in to change notification settings - Fork 0
Nico/eng 2799 fix erlcass segfault3 #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 issues found across 19 files
Prompt for AI agents (all 4 issues)
Understand the root cause of the following 4 issues and fix them.
<file name="c_src/nif_cass_session.cc">
<violation number="1" location="c_src/nif_cass_session.cc:90">
The async session callbacks now dereference cb->session->connected, but the callback only keeps a raw enif_cass_session* without enif_keep_resource. If Erlang GC drops the session term before the future completes, nif_cass_session_free can free that memory, so this write touches freed memory and can segfault.</violation>
</file>
<file name="src/erlcass.app.src">
<violation number="1" location="src/erlcass.app.src:11">
Replacing the application callback with `{type, library}` removes the `erlcass` application module. OTP will now treat `erlcass` as a pure library, so `application:start(erlcass)` fails with `{error,{bad_type,library}}` and the supervisor in `erlcass_sup` never starts. Please keep the `{mod, {erlcass_app, []}}` entry so the application can boot.</violation>
</file>
<file name="benchmarks/load_test.erl">
<violation number="1" location="benchmarks/load_test.erl:87">
Stopping and restarting the supervisor here clears the newly prepared `execute_query` statement, yet the code never re-prepares it before `async_execute/2` runs, so the benchmark will crash once the driver restarts.</violation>
</file>
<file name="src/erlcass.erl">
<violation number="1" location="src/erlcass.erl:522">
When `session_prepare_cached_statements/1` fails part-way, we close the new session but leave any already-inserted prepared statement entries in ETS pointing at that closed session. Please clear `erlcass_stm_sessions` (e.g., call `erlcass_stm_sessions:reset()` before returning) in both error branches so callers cannot bind/execute against stale session pointers.</violation>
</file>
React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.
| { | ||
| result = cass_future_error_to_nif_term(cb->env, future); | ||
| if(cb->session != NULL) | ||
| cb->session->connected = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The async session callbacks now dereference cb->session->connected, but the callback only keeps a raw enif_cass_session* without enif_keep_resource. If Erlang GC drops the session term before the future completes, nif_cass_session_free can free that memory, so this write touches freed memory and can segfault.
Prompt for AI agents
Address the following comment on c_src/nif_cass_session.cc at line 90:
<comment>The async session callbacks now dereference cb->session->connected, but the callback only keeps a raw enif_cass_session* without enif_keep_resource. If Erlang GC drops the session term before the future completes, nif_cass_session_free can free that memory, so this write touches freed memory and can segfault.</comment>
<file context>
@@ -80,9 +84,17 @@ void on_session_connect(CassFuture* future, void* user_data)
+ {
result = cass_future_error_to_nif_term(cb->env, future);
+ if(cb->session != NULL)
+ cb->session->connected = false;
+ }
else
</file context>
| stdlib | ||
| ]}, | ||
| {mod, {erlcass_app, []}}, | ||
| {type, library}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacing the application callback with {type, library} removes the erlcass application module. OTP will now treat erlcass as a pure library, so application:start(erlcass) fails with {error,{bad_type,library}} and the supervisor in erlcass_sup never starts. Please keep the {mod, {erlcass_app, []}} entry so the application can boot.
Prompt for AI agents
Address the following comment on src/erlcass.app.src at line 11:
<comment>Replacing the application callback with `{type, library}` removes the `erlcass` application module. OTP will now treat `erlcass` as a pure library, so `application:start(erlcass)` fails with `{error,{bad_type,library}}` and the supervisor in `erlcass_sup` never starts. Please keep the `{mod, {erlcass_app, []}}` entry so the application can boot.</comment>
<file context>
@@ -8,7 +8,7 @@
stdlib
]},
- {mod, {erlcass_app, []}},
+ {type, library},
{env, []},
</file context>
| {type, library}, | |
| {mod, {erlcass_app, []}}, |
| eprof:start_profiling([self()]), | ||
| ok = application:set_env(erlcass, keyspace, ?KEYSPACE), | ||
| start(), | ||
| ok = erlcass_sup:stop(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stopping and restarting the supervisor here clears the newly prepared execute_query statement, yet the code never re-prepares it before async_execute/2 runs, so the benchmark will crash once the driver restarts.
Prompt for AI agents
Address the following comment on benchmarks/load_test.erl at line 87:
<comment>Stopping and restarting the supervisor here clears the newly prepared `execute_query` statement, yet the code never re-prepares it before `async_execute/2` runs, so the benchmark will crash once the driver restarts.</comment>
<file context>
@@ -81,8 +84,8 @@ do_profiling(NrProc, RequestsNr, RepeatNumber) ->
eprof:start_profiling([self()]),
- ok = application:set_env(erlcass, keyspace, ?KEYSPACE),
- start(),
+ ok = erlcass_sup:stop(),
+ ok = start(#{keyspace => ?KEYSPACE}),
</file context>
| ok -> | ||
| {ok, SessionRef}; | ||
| {error, _} = Error -> | ||
| _ = do_close(SessionRef, self(), ?CONNECT_TIMEOUT), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When session_prepare_cached_statements/1 fails part-way, we close the new session but leave any already-inserted prepared statement entries in ETS pointing at that closed session. Please clear erlcass_stm_sessions (e.g., call erlcass_stm_sessions:reset() before returning) in both error branches so callers cannot bind/execute against stale session pointers.
Prompt for AI agents
Address the following comment on src/erlcass.erl at line 522:
<comment>When `session_prepare_cached_statements/1` fails part-way, we close the new session but leave any already-inserted prepared statement entries in ETS pointing at that closed session. Please clear `erlcass_stm_sessions` (e.g., call `erlcass_stm_sessions:reset()` before returning) in both error branches so callers cannot bind/execute against stale session pointers.</comment>
<file context>
@@ -479,6 +512,45 @@ session_prepare_cached_statements(SessionRef) ->
+ ok ->
+ {ok, SessionRef};
+ {error, _} = Error ->
+ _ = do_close(SessionRef, self(), ?CONNECT_TIMEOUT),
+ Error;
+ Other ->
</file context>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues found across 3 files
|
@codex review this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| schedule_retry(State0, Reason) -> | ||
| State1 = cancel_timer(clear_child(State0)), | ||
| ok = erlcass_sup:release_cluster(), | ||
| Delay = maps:get(backoff, State1), | ||
| Next = next_backoff(Delay, maps:get(max_backoff, State1)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid releasing cluster while erlcass session keeps retrying
When a connection attempt fails, schedule_retry/2 releases the global Cassandra cluster but leaves the existing erlcass gen_server running and reconnecting on its own timer. The reconnect path (handle_info({'timeout',…, reconnect}, State) → attempt_connect/0) uses cass_session_connect/2 which expects cassandra_data->cluster to be non‑NULL. After release_cluster/0 that pointer is freed and set to NULL, so the next retry from the still‑running session will call into the NIF with a freed cluster pointer, leading to a VM crash when Cassandra is down at startup. Either stop the erlcass child before calling release_cluster/0 or pause its reconnect timer until a new cluster is created.
Useful? React with 👍 / 👎.
Note
Converts erlcass into an embeddable supervisor with exponential-backoff connection management and safer NIF prepare behavior, updates config to persistent_term, improves reconnection and ETS handling, and adjusts tests/benchmarks.
erlcass_sup(embeddable viachild_spec/1), addstop/0andrelease_cluster/0.erlcass_connection_managerfor cluster setup, monitoring, and exponential-backoff reconnect; prepares configured statements at startup.src/erlcass.erl: add reconnect scheduling (retry_ref,reconnect_delay), handle not-connected replies, and prepare cached statements on (re)connect.enif_cass_session; set on connect/close callbacks.cass_session_preparewhen not connected, returning"no Cassandra connections available"; add extra allocation/error checks.kNoConnectionsAvailableMsg.erlcass_utils(set_config/1,get_config/0,clear_config/0), supports maps/proplists; defaultreconnect_delay(?RECONNECT_DELAY).erlcass_stm_cachecreate idempotent; adderlcass_stm_sessions:reset/0and idempotent create.erlcass_sup:start_link/1andstop/0; addno_connection_prepare_SUITEasserting prepare fails without connections.erlcass.erlcass.app.src: set{type, library}; add?RECONNECT_DELAYin headers.AGENTS.mdand extend.gitignorefor CT artifacts.Written by Cursor Bugbot for commit cd1afec. This will update automatically on new commits. Configure here.