Skip to content

Go: Clean up most panics in FFI layer #3886

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 52 commits into
base: main
Choose a base branch
from
Open

Conversation

jonathanl-bq
Copy link
Collaborator

Issue link

This Pull Request is linked to issue (URL): #3650

Checklist

Before submitting the PR make sure the following are checked:

  • This Pull Request is related to one issue.
  • Commit message has a detailed description of what changed and why.
  • Tests are added or updated.
  • CHANGELOG.md and documentation files are updated.
  • Destination branch is correct - main or release
  • Create merge commit if merging release branch into main, squash otherwise.

Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: prateek-kumar-improving <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: prateek-kumar-improving <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: prateek-kumar-improving <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
@jonathanl-bq
Copy link
Collaborator Author

@yipin-chen yipin-chen requested a review from barshaul May 21, 2025 17:22
Copy link
Collaborator

@barshaul barshaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting to review, adding more points here:

  1. Instead of using "C-unwind" which AFAIK isn't supported in Go or most wrapper languages and delegating the handling to the calling binding, I think we should change all functions to guarantee that they don't panic. All places we do unwrap and we can potentially panic we should wrap the body in std::panic::catch_unwind() and convert it to a return ClosingError (&closing the internal client) on fatal errors, or other relevant error on non fatal errors (e.g. script_add shouldn't close the whole client but just fail the command) — instead of crashing the app.

ffi/src/lib.rs Outdated
pub unsafe extern "C-unwind" fn store_script(
script_bytes: *const u8,
script_len: usize,
) -> *mut c_char {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust has permanent types between OSs, while C is OS dependent. Im not sure if it matter with char in this case, but char is 1 byte on all, but is i8 on unix and u8 on windows, so you are not returning what you think you are. We need to check size of, or feature build so we use the right type per platform. This is mainly problematic with c_long since on unix 64 it is 8byte, i64, and on everything else 4, which will cause memory violation in windows when sending back to rust and telling it this is usize or this is i64 and rust will treat extra 4 bytes as its own, or the opposite, c, on windows, will leave 4 bytes hanging.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is returning binary now, since returning a C string here was the wrong thing to do anyway. The hash could contain a Nul byte in the Rust string, so it makes more sense to return binary here.

ffi/src/lib.rs Outdated
@@ -28,7 +28,7 @@ use std::sync::Arc;
use std::{
ffi::{c_void, CString},
mem,
os::raw::{c_char, c_double, c_long, c_ulong},
os::raw::{c_char, c_double, c_long},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments bellow on type size

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i checked CommandResponse, we have problematic cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed during our meeting, this will be addressed in a different PR. Issue: #3959

@@ -185,7 +188,7 @@ pub type FailureCallback = unsafe extern "C" fn(
/// The pointers are only valid during the callback execution and will be freed
/// automatically when the callback returns. Any data needed beyond the callback's
/// execution must be copied.
pub type PubSubCallback = unsafe extern "C" fn(
pub type PubSubCallback = unsafe extern "C-unwind" fn(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we addressing in this pr only panics issues?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we create a struct or so for length and param, instead of passing so many parameters?

Copy link
Collaborator Author

@jonathanl-bq jonathanl-bq May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but then we will have to pass structs from the foreign language, which will exclude support for some languages. See here: https://jakegoulding.com/rust-ffi-omnibus/tuples/

Haskell doesn't support passing/returning raw structs/tuples and I suspect other languages might be in the same boat. We probably can't support every language anyway, but I wanted to be as inclusive as possible with my changes. We can discuss this though, and maybe outline what our generalized FFI layer should or should not support.

ffi/src/lib.rs Outdated
@@ -66,7 +69,7 @@ pub unsafe extern "C" fn store_script(script_bytes: *const u8, script_len: usize
///
/// * `hash` must be a valid null-terminated C string created by [`store_script`].
#[no_mangle]
pub unsafe extern "C" fn drop_script(hash: *const c_char) {
pub unsafe extern "C-unwind" fn drop_script(hash: *mut c_char) {
let hash_str = unsafe { CStr::from_ptr(hash).to_str().unwrap_or("") };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we use or "", what is the meaning of ""? im trying to understand if we have a problem here, and we abusing the or, or thats a valid case and we shouldn't panic. If its possible that empty drop script arrive (how) or that its not convertible to str (also, how?), its ok, otherwise, we are covering optional problem, we dont notifying that its happened, we might have ub, and we have a script that we didn't clean and we bloat the system.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is addressed now. I've updated the store_script and remove_script functions to better try to avoid UB and panicking.

@@ -361,6 +364,7 @@ impl ClientAdapter {
///
/// For async clients, invokes the appropriate callback and returns null.
/// For sync clients, returns a `CommandResult`.
#[must_use]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barshaul Why we mixed async and sync in all function instead of separating behavior? its messy, and somebody will have pain on it later. Reading this code is painful.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the logic between the async and sync client is the same other than its blocking/non blocking nature. I don't find it painful, but if you have a better suggestion please open a PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PushInfo comes from the Valkey server, so this doesn't come from user input and shouldn't be problematic.

@@ -624,7 +629,7 @@ fn create_client_internal(
/// * Both the `success_callback` and `failure_callback` function pointers need to live while the client is open/active. The caller is responsible for freeing both callbacks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake, I missed some of the doc changes. Will update.

@@ -624,7 +629,7 @@ fn create_client_internal(
/// * Both the `success_callback` and `failure_callback` function pointers need to live while the client is open/active. The caller is responsible for freeing both callbacks.
// TODO: Consider making this async
#[no_mangle]
pub unsafe extern "C" fn create_client(
pub unsafe extern "C-unwind" fn create_client(
connection_request_bytes: *const u8,
connection_request_len: usize,
client_type: *const ClientType,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::slice::from_raw_parts should we crash? there is no client yet, cant we crash with error message?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean for the null check that should be here for std::slice::from_raw_parts? I think I can add an assert here. Otherwise, not sure what else we can do.

ffi/src/lib.rs Outdated
@@ -1271,17 +1372,25 @@ pub unsafe extern "C" fn request_cluster_scan(
/// * `channel` must be valid until it is passed in a call to [`free_command_response`].
/// * Both the `success_callback` and `failure_callback` function pointers need to live while the client is open/active. The caller is responsible for freeing both callbacks.
#[no_mangle]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore, for myself.
I took a break somewhere above in the no comments zone.

@jonathanl-bq
Copy link
Collaborator Author

Starting to review, adding more points here:

1. Instead of using "C-unwind" which AFAIK isn't supported in Go or most wrapper languages and delegating the handling to the calling binding, I think we should change all functions to guarantee that they don't panic. All places we do unwrap and we can potentially panic we should wrap the body in std::panic::catch_unwind() and convert it to a return ClosingError (&closing the internal client) on fatal errors, or other relevant error on non fatal errors (e.g. script_add shouldn't close the whole client but just fail the command) — instead of crashing the app.

Fair enough, but catch_unwind's behaviour isn't defined in all cases. See here: https://rust-lang.github.io/rfcs/2945-c-unwind-abi.html#guide-level-explanation

We have callbacks into foreign code, which could potentially unwind. If we try to use catch_unwind on those, it's UB. Best practice is to mark it C-unwind anyway, according to the guide level explanation, since it'll at least prevent UB from occurring. Correct me if I'm wrong, but here it states that there is no case in which a C-unwind ABI should result in UB: https://rust-lang.github.io/rfcs/2945-c-unwind-abi.html#abi-boundaries-and-unforced-unwinding. There's little we can do about those cases for most languages, but it's better than nothing. I think it makes sense to add additional panic guards though in cases where we can safely catch an unwind originating from Rust code.

@yipin-chen yipin-chen mentioned this pull request May 23, 2025
34 tasks
Copy link
Collaborator

@barshaul barshaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial comments that wasn't published last week, continuing today

ffi/src/lib.rs Outdated
Comment on lines 1259 to 1260
for i in 0..arg_count {
match arg_vec[i] {
Copy link
Collaborator

@barshaul barshaul May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this's a bad practice. Although arg_count should be equal to arg_vec.len(), as arg_vec was built based on it, it isn't bounded to it so we are exposed to index-out-of-range panics. the debug assertion you added isn't helpful here as it only runs on debug builds and wouldn't help in runtime. Also the if arg_count < usize::MAX check is redundant if using iterator.
Instead, use the rust-way of iterating over the vector to avoid memory issues.
For example:

let mut iter = arg_vec.iter().peekable();
while let Some(arg) = iter.next() {
    match *arg {
        b"MATCH" => {
            match iter.next() {
                Some(pat) => pattern = Some(pat),
                None => {
                    let err = RedisError::from((
                        ErrorKind::ClientError,
                        "No argument following MATCH.",
                    ));
                    return client_adapter.handle_error(err, channel);
                }
            }
        }
        b"TYPE" => {
            match iter.next() {
                Some(obj_type) => object_type = Some(obj_type),
                None => {
                    let err = RedisError::from((
                        ErrorKind::ClientError,
                        "No argument following TYPE.",
                    ));
                    return client_adapter.handle_error(err, channel);
                }
            }
        }
        b"COUNT" => {
            match iter.next() {
                Some(c) => count = Some(c),
                None => {
                    let err = RedisError::from((
                        ErrorKind::ClientError,
                        "No argument following COUNT.",
                    ));
                    return client_adapter.handle_error(err, channel);
                }
            }
        }
        _ => {
            // Unknown or unsupported arg — safely skip or log
            continue;
        }
    }
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, that's a lot cleaner than doing ugly and error-prone index math.

@@ -340,7 +340,7 @@ func (client *baseClient) executeCommandWithRoute(
client.coreClient,
C.uintptr_t(pinnedChannelPtr),
uint32(requestType),
C.size_t(len(args)),
C.ulong(len(args)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this change? args len is usize in rust. this change should fail on 32-bit platform

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 322: routeBytesPtr = (*C.uchar)(C.CBytes(msg))

// Go []byte slice to C array
// The C array is allocated in the C heap using malloc.
// It is the caller's responsibility to arrange for it to be
// freed, such as by calling C.free (be sure to include stdlib.h
// if C.free is needed).
func C.CBytes([]byte) unsafe.Pointer

C.CBytes allocates memory - we should free it. Lets discuss the best option here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this change is here. @Yury-Fridlyand had it in his commits when I based my changes off of his. I ended up reverting a bunch of Go changes when merging main into this branch though, so it's gone now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in our meeting today, I have created a new issue for the memory leaks here: #3983

This will be top priority after this PR.

@barshaul
Copy link
Collaborator

barshaul commented May 25, 2025

Starting to review, adding more points here:

1. Instead of using "C-unwind" which AFAIK isn't supported in Go or most wrapper languages and delegating the handling to the calling binding, I think we should change all functions to guarantee that they don't panic. All places we do unwrap and we can potentially panic we should wrap the body in std::panic::catch_unwind() and convert it to a return ClosingError (&closing the internal client) on fatal errors, or other relevant error on non fatal errors (e.g. script_add shouldn't close the whole client but just fail the command) — instead of crashing the app.

Fair enough, but catch_unwind's behaviour isn't defined in all cases. See here: https://rust-lang.github.io/rfcs/2945-c-unwind-abi.html#guide-level-explanation

We have callbacks into foreign code, which could potentially unwind. If we try to use catch_unwind on those, it's UB. Best practice is to mark it C-unwind anyway, according to the guide level explanation, since it'll at least prevent UB from occurring. Correct me if I'm wrong, but here it states that there is no case in which a C-unwind ABI should result in UB: https://rust-lang.github.io/rfcs/2945-c-unwind-abi.html#abi-boundaries-and-unforced-unwinding. There's little we can do about those cases for most languages, but it's better than nothing. I think it makes sense to add additional panic guards though in cases where we can safely catch an unwind originating from Rust code.

catch_unwind is well-defined when handling panics originating from Rust.

The limitation from the RFC is this:

If a foreign exception (e.g., from C++, Go, etc.) enters a Rust frame and lands inside a catch_unwind, the behavior is undefined.

However, for a foreign exception to reach a catch_unwind in Rust, the foreign language must support unwinding across FFI boundaries — which the languages we use (like Go or Python) do not.

For example, in Go:

  • Go does not support stack unwinding across FFI.
  • If a Go function panics when called from C or Rust, the Go runtime detects the foreign stack and immediately aborts the process.

From Go’s cgo docs:
“If a Go function called by C panics and the panic is not recovered in the Go code, the process crashes.”

So if the callback panics, the app crashes—unwinding is never even attempted, and extern "C-unwind" becomes irrelevant for that scenario.

Suggestion

  1. Revert all functions to extern C (from extern C-unwind)
  2. Make all FFI-facing Rust functions panic-safe using catch_unwind.
    Even if a function doesn’t directly use unwrap, expect, or contain obvious panic points, it may still call internal code that could panic. The best practice is to wrap every pub extern Rust function in catch_unwind to prevent panics from propagating across FFI boundaries.
  3. Make foreign callbacks panic-safe on their side, e.g., using recover() in Go.
  4. If a panic is caught on either of the sides, we should gracefully close the client and return a ClosingError to indicate an unrecoverable error that requires creating a new client. This is what we do on parsing errors in the wrappers based on the UDS.

For example

Wrapping all Rust functions with safe_ffi macro (from cpgt, needs to be verified):

#[macro_export]
macro_rules! safe_ffi {
    ($body:block, $on_panic:expr) => {{
        use std::panic::{catch_unwind, AssertUnwindSafe};
        match catch_unwind(AssertUnwindSafe(|| $body)) {
            Ok(result) => result,
            Err(_) => {
                eprintln!("Panic caught in FFI function: {}", std::any::type_name::<fn()>());
                $on_panic
            }
        }
    }};
}

#[no_mangle]
pub extern "C" fn command(...) -> *mut CommandResult {
    safe_ffi!({
       // command function logic
    }, std::ptr::null_mut()) // Change to return something like CommandResult::ClosingError, 
}

Safe callback for Go (from cpgt, needs to be verified):

func successCallback(channelPtr unsafe.Pointer, cResponse *C.struct_CommandResponse) {
	defer func() {
		if r := recover(); r != nil {
			//  DO NOT let panic escape
                         //  Add logic to gracefully close the client and return a closing error
		}
	}()

	response := cResponse
	resultChannel := *(*chan payload)(getPinnedPtr(channelPtr))
	resultChannel <- payload{value: response, error: nil}
}

ffi/src/lib.rs Outdated
Comment on lines 1266 to 1268
fn drop(&mut self) {
let c_str = unsafe { CStr::from_ptr(self.cursor) };
let temp_str = c_str.to_str().expect("Must be UTF-8");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can still panic, suggestion:

fn drop(&mut self) {
    if let Ok(temp_str) = unsafe { CStr::from_ptr(self.cursor).to_str() } {
        glide_core::cluster_scan_container::remove_scan_state_cursor(temp_str.to_string());
    }
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't anyway, since the only way to construct this is via the smart constructor provided, but I suppose this is still safer.

@jonathanl-bq jonathanl-bq changed the base branch from go/script-eval-load to main May 26, 2025 00:36
@jonathanl-bq
Copy link
Collaborator Author

jonathanl-bq commented May 26, 2025

Starting to review, adding more points here:

1. Instead of using "C-unwind" which AFAIK isn't supported in Go or most wrapper languages and delegating the handling to the calling binding, I think we should change all functions to guarantee that they don't panic. All places we do unwrap and we can potentially panic we should wrap the body in std::panic::catch_unwind() and convert it to a return ClosingError (&closing the internal client) on fatal errors, or other relevant error on non fatal errors (e.g. script_add shouldn't close the whole client but just fail the command) — instead of crashing the app.

Fair enough, but catch_unwind's behaviour isn't defined in all cases. See here: https://rust-lang.github.io/rfcs/2945-c-unwind-abi.html#guide-level-explanation
We have callbacks into foreign code, which could potentially unwind. If we try to use catch_unwind on those, it's UB. Best practice is to mark it C-unwind anyway, according to the guide level explanation, since it'll at least prevent UB from occurring. Correct me if I'm wrong, but here it states that there is no case in which a C-unwind ABI should result in UB: https://rust-lang.github.io/rfcs/2945-c-unwind-abi.html#abi-boundaries-and-unforced-unwinding. There's little we can do about those cases for most languages, but it's better than nothing. I think it makes sense to add additional panic guards though in cases where we can safely catch an unwind originating from Rust code.

catch_unwind is well-defined when handling panics originating from Rust.

The limitation from the RFC is this:

If a foreign exception (e.g., from C++, Go, etc.) enters a Rust frame and lands inside a catch_unwind, the behavior is undefined.

However, for a foreign exception to reach a catch_unwind in Rust, the foreign language must support unwinding across FFI boundaries — which the languages we use (like Go or Python) do not.

For example, in Go:

* Go does not support stack unwinding across FFI.

* If a Go function panics when called from C or Rust, the Go runtime detects the foreign stack and immediately aborts the process.

From Go’s cgo docs: “If a Go function called by C panics and the panic is not recovered in the Go code, the process crashes.”

So if the callback panics, the app crashes—unwinding is never even attempted, and extern "C-unwind" becomes irrelevant for that scenario.

Suggestion

1. Revert all functions to `extern C` (from `extern C-unwind`)

2. Make all FFI-facing Rust functions panic-safe using catch_unwind.
   Even if a function doesn’t directly use unwrap, expect, or contain obvious panic points, it may still call internal code that could panic. The best practice is to wrap every pub extern Rust function in catch_unwind to prevent panics from propagating across FFI boundaries.

3. Make foreign callbacks panic-safe on their side, e.g., using recover() in Go.

4. If a panic is caught on either of the sides, we should gracefully close the client and return a ClosingError to indicate an unrecoverable error that requires creating a new client. This is what we do on parsing errors in the wrappers based on the UDS.

For example

Wrapping all Rust functions with safe_ffi macro (from cpgt, needs to be verified):

#[macro_export]
macro_rules! safe_ffi {
    ($body:block, $on_panic:expr) => {{
        use std::panic::{catch_unwind, AssertUnwindSafe};
        match catch_unwind(AssertUnwindSafe(|| $body)) {
            Ok(result) => result,
            Err(_) => {
                eprintln!("Panic caught in FFI function: {}", std::any::type_name::<fn()>());
                $on_panic
            }
        }
    }};
}

#[no_mangle]
pub extern "C" fn command(...) -> *mut CommandResult {
    safe_ffi!({
       // command function logic
    }, std::ptr::null_mut()) // Change to return something like CommandResult::ClosingError, 
}

Safe callback for Go (from cpgt, needs to be verified):

func successCallback(channelPtr unsafe.Pointer, cResponse *C.struct_CommandResponse) {
	defer func() {
		if r := recover(); r != nil {
			//  DO NOT let panic escape
                         //  Add logic to gracefully close the client and return a closing error
		}
	}()

	response := cResponse
	resultChannel := *(*chan payload)(getPinnedPtr(channelPtr))
	resultChannel <- payload{value: response, error: nil}
}

I discussed this with @avifenesh and he suggested not using catch_unwind since we're using the panic = "abort" runtime for release builds to get code size optimizations. I can add it anyway, in case we want to switch runtimes at some point or specifically add this for debug builds, but we have to come to an agreement.

I'm well aware that catch_unwind's behaviour is only undefined for unwinds from foreign languages (I've even talked about it in one of my videos: https://www.youtube.com/watch?v=5SgbUaiJBs4). You mention Go and Python specifically, but my understanding was that this is a generalized FFI layer, not specific to just Go and Python. Again, we need to come to an agreement on what languages we intend to support. I'll gladly change the ABI back, but we need to decide whether or not we want to exclude languages that do support stack unwinding across FFI boundaries. Or maybe we can make this a conditional compilation flag depending on the wrapper language using this layer.

EDIT: Adding on to this, I do agree that some cases we probably can just leave it as extern "C", such as when we don't call a foreign callback. Let's discuss this with the rest of the team and come to an agreement though tomorrow.

Signed-off-by: Jonathan Louie <[email protected]>
@jonathanl-bq
Copy link
Collaborator Author

jonathanl-bq commented May 27, 2025

Some findings on unwinding with Go:

Raw Findings

All of the following were run with panic = "unwind" (default) panic runtime.

Calling Rust function that panics from Go with extern "C-unwind":

[jlouie@nixos:~/MixedLanguage/go-ffi-test]$ go run .

thread '<unnamed>' panicked at src/lib.rs:3:5:
unwinding from Rust
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
SIGABRT: abort
PC=0x7ffff7c99cdc m=0 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 1 gp=0xc000002380 m=0 mp=0x567560 [syscall]:
runtime.cgocall(0x493a60, 0xc00008cf08)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/cgocall.go:167 +0x4b fp=0xc00008cee0 sp=0xc00008cea8 pc=0x46794b
main._Cfunc_unwind()
_cgo_gotypes.go:48 +0x3a fp=0xc00008cf08 sp=0xc00008cee0 pc=0x49397a
main.main()
/home/jlouie/MixedLanguage/go-ffi-test/main.go:10 +0x13 fp=0xc00008cf50 sp=0xc00008cf08 pc=0x4939b3
runtime.main()
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/proc.go:283 +0x28b fp=0xc00008cfe0 sp=0xc00008cf50 pc=0x43ad4b
runtime.goexit({})
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00008cfe8 sp=0xc00008cfe0 pc=0x46f8c1

goroutine 2 gp=0xc000002e00 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/proc.go:435 +0xce fp=0xc00007cfa8 sp=0xc00007cf88 pc=0x46956e
runtime.goparkunlock(...)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/proc.go:441
runtime.forcegchelper()
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/proc.go:348 +0xb3 fp=0xc00007cfe0 sp=0xc00007cfa8 pc=0x43b093
runtime.goexit({})
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00007cfe8 sp=0xc00007cfe0 pc=0x46f8c1
created by runtime.init.7 in goroutine 1
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/proc.go:336 +0x1a

goroutine 3 gp=0xc000003340 m=nil [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/proc.go:435 +0xce fp=0xc00007d780 sp=0xc00007d760 pc=0x46956e
runtime.goparkunlock(...)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/proc.go:441
runtime.bgsweep(0xc00002c080)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/mgcsweep.go:276 +0x94 fp=0xc00007d7c8 sp=0xc00007d780 pc=0x426c14
runtime.gcenable.gowrap1()
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/mgc.go:204 +0x25 fp=0xc00007d7e0 sp=0xc00007d7c8 pc=0x41b345
runtime.goexit({})
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00007d7e8 sp=0xc00007d7e0 pc=0x46f8c1
created by runtime.gcenable in goroutine 1
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/mgc.go:204 +0x66

goroutine 4 gp=0xc000003500 m=nil [GC scavenge wait]:
runtime.gopark(0xc00002c080?, 0x4dc570?, 0x1?, 0x0?, 0xc000003500?)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/proc.go:435 +0xce fp=0xc00007df78 sp=0xc00007df58 pc=0x46956e
runtime.goparkunlock(...)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/proc.go:441
runtime.(*scavengerState).park(0x566780)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc00007dfa8 sp=0xc00007df78 pc=0x4246c9
runtime.bgscavenge(0xc00002c080)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/mgcscavenge.go:653 +0x3c fp=0xc00007dfc8 sp=0xc00007dfa8 pc=0x424c3c
runtime.gcenable.gowrap2()
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/mgc.go:205 +0x25 fp=0xc00007dfe0 sp=0xc00007dfc8 pc=0x41b2e5
runtime.goexit({})
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00007dfe8 sp=0xc00007dfe0 pc=0x46f8c1
created by runtime.gcenable in goroutine 1
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/mgc.go:205 +0xa5

goroutine 18 gp=0xc000102700 m=nil [finalizer wait]:
runtime.gopark(0x1b8?, 0xc000002380?, 0x1?, 0x23?, 0xc00007c688?)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/proc.go:435 +0xce fp=0xc00007c630 sp=0xc00007c610 pc=0x46956e
runtime.runfinq()
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/mfinal.go:196 +0x107 fp=0xc00007c7e0 sp=0xc00007c630 pc=0x41a307
runtime.goexit({})
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00007c7e8 sp=0xc00007c7e0 pc=0x46f8c1
created by runtime.createfing in goroutine 1
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/mfinal.go:166 +0x3d

rax    0x0
rbx    0xa24a
rcx    0x7ffff7c99cdc
rdx    0x6
rdi    0xa24a
rsi    0xa24a
rbp    0x7ffff7f31300
rsp    0x7fffffffa320
r8     0x0
r9     0x0
r10    0x8
r11    0x246
r12    0x7ffff7fb9fc0
r13    0x6
r14    0x0
r15    0x7fffffffa568
rip    0x7ffff7c99cdc
rflags 0x246
cs     0x33
fs     0x0
gs     0x0
exit status 2

Calling Rust function from Go with extern "C":

[jlouie@nixos:~/MixedLanguage/go-ffi-test]$ go run .

thread '<unnamed>' panicked at src/lib.rs:3:5:
unwinding from Rust
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

thread '<unnamed>' panicked at library/core/src/panicking.rs:218:5:
panic in a function that cannot unwind
stack backtrace:
   0:     0x7ffff7f8faa0 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h9edbd6e38a8b0805
   1:     0x7ffff7fa9553 - core::fmt::write::h7b1248e5e0c79c78
   2:     0x7ffff7f969a3 - std::io::Write::write_fmt::h5e301665499081bf
   3:     0x7ffff7f8f943 - std::sys::backtrace::BacktraceLock::print::h4a386d2ef944f43e
   4:     0x7ffff7f8c86c - std::panicking::default_hook::{{closure}}::h61b7aa0fc15f236b
   5:     0x7ffff7f8c779 - std::panicking::default_hook::h2d21379b0b23a14f
   6:     0x7ffff7f8ccdf - std::panicking::rust_panic_with_hook::h100726ba9570b85a
   7:     0x7ffff7f8fe36 - std::panicking::begin_panic_handler::{{closure}}::h141712493bfacf0c
   8:     0x7ffff7f8fca9 - std::sys::backtrace::__rust_end_short_backtrace::h891003731531c924
   9:     0x7ffff7f8c90d - rust_begin_unwind
  10:     0x7ffff7f6bdcd - core::panicking::panic_nounwind_fmt::ha2f9a57c040716ff
  11:     0x7ffff7f6be62 - core::panicking::panic_nounwind::h9817f69376d9bbc4
  12:     0x7ffff7f6bf25 - core::panicking::panic_cannot_unwind::h0197944997fd9fc5
  13:     0x7ffff7f6c380 - unwind
                               at /home/jlouie/MixedLanguage/go-ffi-test/rust/unwind/src/lib.rs:2:1
  14:           0x46f544 - <unknown>
thread caused non-unwinding panic. aborting.
SIGABRT: abort
PC=0x7ffff7c99cdc m=0 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 1 gp=0xc000002380 m=0 mp=0x567560 [syscall]:
runtime.cgocall(0x493a60, 0xc00008cf08)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/cgocall.go:167 +0x4b fp=0xc00008cee0 sp=0xc00008cea8 pc=0x46794b
main._Cfunc_unwind()
_cgo_gotypes.go:48 +0x3a fp=0xc00008cf08 sp=0xc00008cee0 pc=0x49397a
main.main()
/home/jlouie/MixedLanguage/go-ffi-test/main.go:10 +0x13 fp=0xc00008cf50 sp=0xc00008cf08 pc=0x4939b3
runtime.main()
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/proc.go:283 +0x28b fp=0xc00008cfe0 sp=0xc00008cf50 pc=0x43ad4b
runtime.goexit({})
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00008cfe8 sp=0xc00008cfe0 pc=0x46f8c1

goroutine 2 gp=0xc000002e00 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/proc.go:435 +0xce fp=0xc00007cfa8 sp=0xc00007cf88 pc=0x46956e
runtime.goparkunlock(...)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/proc.go:441
runtime.forcegchelper()
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/proc.go:348 +0xb3 fp=0xc00007cfe0 sp=0xc00007cfa8 pc=0x43b093
runtime.goexit({})
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00007cfe8 sp=0xc00007cfe0 pc=0x46f8c1
created by runtime.init.7 in goroutine 1
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/proc.go:336 +0x1a

goroutine 3 gp=0xc000003340 m=nil [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/proc.go:435 +0xce fp=0xc00007d780 sp=0xc00007d760 pc=0x46956e
runtime.goparkunlock(...)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/proc.go:441
runtime.bgsweep(0xc00002c080)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/mgcsweep.go:276 +0x94 fp=0xc00007d7c8 sp=0xc00007d780 pc=0x426c14
runtime.gcenable.gowrap1()
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/mgc.go:204 +0x25 fp=0xc00007d7e0 sp=0xc00007d7c8 pc=0x41b345
runtime.goexit({})
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00007d7e8 sp=0xc00007d7e0 pc=0x46f8c1
created by runtime.gcenable in goroutine 1
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/mgc.go:204 +0x66

goroutine 4 gp=0xc000003500 m=nil [GC scavenge wait]:
runtime.gopark(0xc00002c080?, 0x4dc570?, 0x1?, 0x0?, 0xc000003500?)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/proc.go:435 +0xce fp=0xc00007df78 sp=0xc00007df58 pc=0x46956e
runtime.goparkunlock(...)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/proc.go:441
runtime.(*scavengerState).park(0x566780)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/mgcscavenge.go:425 +0x49 fp=0xc00007dfa8 sp=0xc00007df78 pc=0x4246c9
runtime.bgscavenge(0xc00002c080)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/mgcscavenge.go:653 +0x3c fp=0xc00007dfc8 sp=0xc00007dfa8 pc=0x424c3c
runtime.gcenable.gowrap2()
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/mgc.go:205 +0x25 fp=0xc00007dfe0 sp=0xc00007dfc8 pc=0x41b2e5
runtime.goexit({})
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00007dfe8 sp=0xc00007dfe0 pc=0x46f8c1
created by runtime.gcenable in goroutine 1
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/mgc.go:205 +0xa5

goroutine 18 gp=0xc000102700 m=nil [finalizer wait]:
runtime.gopark(0x1b8?, 0xc000002380?, 0x1?, 0x23?, 0xc00007c688?)
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/proc.go:435 +0xce fp=0xc00007c630 sp=0xc00007c610 pc=0x46956e
runtime.runfinq()
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/mfinal.go:196 +0x107 fp=0xc00007c7e0 sp=0xc00007c630 pc=0x41a307
runtime.goexit({})
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc00007c7e8 sp=0xc00007c7e0 pc=0x46f8c1
created by runtime.createfing in goroutine 1
/nix/store/5xvi25nqmbrg58aixp4zgczilfnp7pwg-go-1.24.3/share/go/src/runtime/mfinal.go:166 +0x3d

rax    0x0
rbx    0xa72b
rcx    0x7ffff7c99cdc
rdx    0x6
rdi    0xa72b
rsi    0xa72b
rbp    0x7ffff7f31300
rsp    0x7fffffffa2f0
r8     0x0
r9     0x0
r10    0x8
r11    0x246
r12    0x7ffff7fb9fc0
r13    0x6
r14    0x0
r15    0x7fffffffa4d8
rip    0x7ffff7c99cdc
rflags 0x246
cs     0x33
fs     0x0
gs     0x0
exit status 2

Regular panic in Go (control test):

[jlouie@nixos:~/MixedLanguage/go-ffi-test]$ go run .
panic: oh no

goroutine 1 [running]:
main.main()
/home/jlouie/MixedLanguage/go-ffi-test/main.go:10 +0x25
exit status 2

Sandwiched Rust frames between Go frames, where Go code panics and unwinds through Rust back into Go (output was the same regardless of extern "C" or extern "C-unwind"):

[jlouie@nixos:~/MixedLanguage/go-ffi-test]$ go run .
panic: oh no

goroutine 1 [running]:
main.goCallback(...)
/home/jlouie/MixedLanguage/go-ffi-test/callbacks.go:8
main._Cfunc_unwind(0x493b00)
_cgo_gotypes.go:54 +0x3a
main.main()
/home/jlouie/MixedLanguage/go-ffi-test/main.go:14 +0x1e
exit status 2

Conclusions

It appears that the Go runtime will simply abort the process if a Rust panic unwinds into it, so extern "C-unwind" shouldn't cause undefined behaviour in this case.

However, when Go panics and unwinds through Rust back into Go, the backtrace stays intact. I did a bit of digging to see if unwinding across the FFI boundary in Go is actually considered undefined behaviour, and there isn't really much on it (Google's Gemini result doesn't actually cite any real sources when it says this is UB). Most sources I found state that almost nothing is considered UB in Go, aside from data races. They do have a notion of "implementation defined behaviour", but they largely don't seem to have any kind of standard or guideline on UB itself that I could find.

As a result, I don't think it's necessarily wrong to use extern "C-unwind" here for Go, especially if the backtraces come out just fine. The lack of a real standard does make this kind of iffy though. I think the best solution here is to implement a conditional compilation flag anyway, and this would make it very easy to switch between the two ABIs. There's also the chance that a new language could pop up and become very popular, which might also explicitly support unwinding across FFI boundaries, so we probably should at least allow the option to specify extern "C-unwind" for sandwiched frames anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants