-
Notifications
You must be signed in to change notification settings - Fork 90
Go: Clean up most panics in FFI layer #3886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: prateek-kumar-improving <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: prateek-kumar-improving <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: prateek-kumar-improving <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
Signed-off-by: Prateek Kumar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting to review, adding more points here:
- Instead of using "C-unwind" which AFAIK isn't supported in Go or most wrapper languages and delegating the handling to the calling binding, I think we should change all functions to guarantee that they don't panic. All places we do unwrap and we can potentially panic we should wrap the body in std::panic::catch_unwind() and convert it to a return ClosingError (&closing the internal client) on fatal errors, or other relevant error on non fatal errors (e.g. script_add shouldn't close the whole client but just fail the command) — instead of crashing the app.
ffi/src/lib.rs
Outdated
pub unsafe extern "C-unwind" fn store_script( | ||
script_bytes: *const u8, | ||
script_len: usize, | ||
) -> *mut c_char { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rust has permanent types between OSs, while C is OS dependent. Im not sure if it matter with char in this case, but char is 1 byte on all, but is i8 on unix and u8 on windows, so you are not returning what you think you are. We need to check size of, or feature build so we use the right type per platform. This is mainly problematic with c_long since on unix 64 it is 8byte, i64, and on everything else 4, which will cause memory violation in windows when sending back to rust and telling it this is usize or this is i64 and rust will treat extra 4 bytes as its own, or the opposite, c, on windows, will leave 4 bytes hanging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is returning binary now, since returning a C string here was the wrong thing to do anyway. The hash could contain a Nul byte in the Rust string, so it makes more sense to return binary here.
ffi/src/lib.rs
Outdated
@@ -28,7 +28,7 @@ use std::sync::Arc; | |||
use std::{ | |||
ffi::{c_void, CString}, | |||
mem, | |||
os::raw::{c_char, c_double, c_long, c_ulong}, | |||
os::raw::{c_char, c_double, c_long}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comments bellow on type size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i checked CommandResponse
, we have problematic cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed during our meeting, this will be addressed in a different PR. Issue: #3959
@@ -185,7 +188,7 @@ pub type FailureCallback = unsafe extern "C" fn( | |||
/// The pointers are only valid during the callback execution and will be freed | |||
/// automatically when the callback returns. Any data needed beyond the callback's | |||
/// execution must be copied. | |||
pub type PubSubCallback = unsafe extern "C" fn( | |||
pub type PubSubCallback = unsafe extern "C-unwind" fn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we addressing in this pr only panics issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we create a struct or so for length and param, instead of passing so many parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but then we will have to pass structs from the foreign language, which will exclude support for some languages. See here: https://jakegoulding.com/rust-ffi-omnibus/tuples/
Haskell doesn't support passing/returning raw structs/tuples and I suspect other languages might be in the same boat. We probably can't support every language anyway, but I wanted to be as inclusive as possible with my changes. We can discuss this though, and maybe outline what our generalized FFI layer should or should not support.
ffi/src/lib.rs
Outdated
@@ -66,7 +69,7 @@ pub unsafe extern "C" fn store_script(script_bytes: *const u8, script_len: usize | |||
/// | |||
/// * `hash` must be a valid null-terminated C string created by [`store_script`]. | |||
#[no_mangle] | |||
pub unsafe extern "C" fn drop_script(hash: *const c_char) { | |||
pub unsafe extern "C-unwind" fn drop_script(hash: *mut c_char) { | |||
let hash_str = unsafe { CStr::from_ptr(hash).to_str().unwrap_or("") }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we use or "", what is the meaning of ""? im trying to understand if we have a problem here, and we abusing the or, or thats a valid case and we shouldn't panic. If its possible that empty drop script arrive (how) or that its not convertible to str (also, how?), its ok, otherwise, we are covering optional problem, we dont notifying that its happened, we might have ub, and we have a script that we didn't clean and we bloat the system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is addressed now. I've updated the store_script and remove_script functions to better try to avoid UB and panicking.
@@ -361,6 +364,7 @@ impl ClientAdapter { | |||
/// | |||
/// For async clients, invokes the appropriate callback and returns null. | |||
/// For sync clients, returns a `CommandResult`. | |||
#[must_use] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@barshaul Why we mixed async and sync in all function instead of separating behavior? its messy, and somebody will have pain on it later. Reading this code is painful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the logic between the async and sync client is the same other than its blocking/non blocking nature. I don't find it painful, but if you have a better suggestion please open a PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cant comment there directly
https://github.com/valkey-io/valkey-glide/pull/3886/files#diff-9f8aacd1a2531f6bbaed2e3290b7c5f18fbfb82147bcdafabb807cab66383715L523
Base on what we assume that it is 2 or 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PushInfo comes from the Valkey server, so this doesn't come from user input and shouldn't be problematic.
@@ -624,7 +629,7 @@ fn create_client_internal( | |||
/// * Both the `success_callback` and `failure_callback` function pointers need to live while the client is open/active. The caller is responsible for freeing both callbacks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake, I missed some of the doc changes. Will update.
@@ -624,7 +629,7 @@ fn create_client_internal( | |||
/// * Both the `success_callback` and `failure_callback` function pointers need to live while the client is open/active. The caller is responsible for freeing both callbacks. | |||
// TODO: Consider making this async | |||
#[no_mangle] | |||
pub unsafe extern "C" fn create_client( | |||
pub unsafe extern "C-unwind" fn create_client( | |||
connection_request_bytes: *const u8, | |||
connection_request_len: usize, | |||
client_type: *const ClientType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::slice::from_raw_parts
should we crash? there is no client yet, cant we crash with error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean for the null check that should be here for std::slice::from_raw_parts
? I think I can add an assert here. Otherwise, not sure what else we can do.
ffi/src/lib.rs
Outdated
@@ -1271,17 +1372,25 @@ pub unsafe extern "C" fn request_cluster_scan( | |||
/// * `channel` must be valid until it is passed in a call to [`free_command_response`]. | |||
/// * Both the `success_callback` and `failure_callback` function pointers need to live while the client is open/active. The caller is responsible for freeing both callbacks. | |||
#[no_mangle] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore, for myself.
I took a break somewhere above in the no comments zone.
Fair enough, but We have callbacks into foreign code, which could potentially unwind. If we try to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial comments that wasn't published last week, continuing today
ffi/src/lib.rs
Outdated
for i in 0..arg_count { | ||
match arg_vec[i] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this's a bad practice. Although arg_count should be equal to arg_vec.len(), as arg_vec was built based on it, it isn't bounded to it so we are exposed to index-out-of-range panics. the debug assertion you added isn't helpful here as it only runs on debug builds and wouldn't help in runtime. Also the if arg_count < usize::MAX
check is redundant if using iterator.
Instead, use the rust-way of iterating over the vector to avoid memory issues.
For example:
let mut iter = arg_vec.iter().peekable();
while let Some(arg) = iter.next() {
match *arg {
b"MATCH" => {
match iter.next() {
Some(pat) => pattern = Some(pat),
None => {
let err = RedisError::from((
ErrorKind::ClientError,
"No argument following MATCH.",
));
return client_adapter.handle_error(err, channel);
}
}
}
b"TYPE" => {
match iter.next() {
Some(obj_type) => object_type = Some(obj_type),
None => {
let err = RedisError::from((
ErrorKind::ClientError,
"No argument following TYPE.",
));
return client_adapter.handle_error(err, channel);
}
}
}
b"COUNT" => {
match iter.next() {
Some(c) => count = Some(c),
None => {
let err = RedisError::from((
ErrorKind::ClientError,
"No argument following COUNT.",
));
return client_adapter.handle_error(err, channel);
}
}
}
_ => {
// Unknown or unsupported arg — safely skip or log
continue;
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, that's a lot cleaner than doing ugly and error-prone index math.
go/api/base_client.go
Outdated
@@ -340,7 +340,7 @@ func (client *baseClient) executeCommandWithRoute( | |||
client.coreClient, | |||
C.uintptr_t(pinnedChannelPtr), | |||
uint32(requestType), | |||
C.size_t(len(args)), | |||
C.ulong(len(args)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this change? args len is usize in rust. this change should fail on 32-bit platform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 322: routeBytesPtr = (*C.uchar)(C.CBytes(msg))
// Go []byte slice to C array
// The C array is allocated in the C heap using malloc.
// It is the caller's responsibility to arrange for it to be
// freed, such as by calling C.free (be sure to include stdlib.h
// if C.free is needed).
func C.CBytes([]byte) unsafe.Pointer
C.CBytes allocates memory - we should free it. Lets discuss the best option here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this change is here. @Yury-Fridlyand had it in his commits when I based my changes off of his. I ended up reverting a bunch of Go changes when merging main into this branch though, so it's gone now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in our meeting today, I have created a new issue for the memory leaks here: #3983
This will be top priority after this PR.
The limitation from the RFC is this:
However, for a foreign exception to reach a catch_unwind in Rust, the foreign language must support unwinding across FFI boundaries — which the languages we use (like Go or Python) do not. For example, in Go:
From Go’s cgo docs: So if the callback panics, the app crashes—unwinding is never even attempted, and extern "C-unwind" becomes irrelevant for that scenario. Suggestion
For exampleWrapping all Rust functions with safe_ffi macro (from cpgt, needs to be verified):#[macro_export]
macro_rules! safe_ffi {
($body:block, $on_panic:expr) => {{
use std::panic::{catch_unwind, AssertUnwindSafe};
match catch_unwind(AssertUnwindSafe(|| $body)) {
Ok(result) => result,
Err(_) => {
eprintln!("Panic caught in FFI function: {}", std::any::type_name::<fn()>());
$on_panic
}
}
}};
}
#[no_mangle]
pub extern "C" fn command(...) -> *mut CommandResult {
safe_ffi!({
// command function logic
}, std::ptr::null_mut()) // Change to return something like CommandResult::ClosingError,
} Safe callback for Go (from cpgt, needs to be verified):func successCallback(channelPtr unsafe.Pointer, cResponse *C.struct_CommandResponse) {
defer func() {
if r := recover(); r != nil {
// DO NOT let panic escape
// Add logic to gracefully close the client and return a closing error
}
}()
response := cResponse
resultChannel := *(*chan payload)(getPinnedPtr(channelPtr))
resultChannel <- payload{value: response, error: nil}
} |
ffi/src/lib.rs
Outdated
fn drop(&mut self) { | ||
let c_str = unsafe { CStr::from_ptr(self.cursor) }; | ||
let temp_str = c_str.to_str().expect("Must be UTF-8"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can still panic, suggestion:
fn drop(&mut self) {
if let Ok(temp_str) = unsafe { CStr::from_ptr(self.cursor).to_str() } {
glide_core::cluster_scan_container::remove_scan_state_cursor(temp_str.to_string());
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't anyway, since the only way to construct this is via the smart constructor provided, but I suppose this is still safer.
I discussed this with @avifenesh and he suggested not using I'm well aware that EDIT: Adding on to this, I do agree that some cases we probably can just leave it as |
Signed-off-by: Jonathan Louie <[email protected]>
Some findings on unwinding with Go: Raw FindingsAll of the following were run with Calling Rust function that panics from Go with
Calling Rust function from Go with
Regular panic in Go (control test):
Sandwiched Rust frames between Go frames, where Go code panics and unwinds through Rust back into Go (output was the same regardless of
ConclusionsIt appears that the Go runtime will simply abort the process if a Rust panic unwinds into it, so However, when Go panics and unwinds through Rust back into Go, the backtrace stays intact. I did a bit of digging to see if unwinding across the FFI boundary in Go is actually considered undefined behaviour, and there isn't really much on it (Google's Gemini result doesn't actually cite any real sources when it says this is UB). Most sources I found state that almost nothing is considered UB in Go, aside from data races. They do have a notion of "implementation defined behaviour", but they largely don't seem to have any kind of standard or guideline on UB itself that I could find. As a result, I don't think it's necessarily wrong to use |
Issue link
This Pull Request is linked to issue (URL): #3650
Checklist
Before submitting the PR make sure the following are checked: