-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Open
Description
Workin on #18192 it became clear to me that there are both shortcomings in the current design and competing interests in terms of what is supported / what state is passed around. I thought it would be worth sitting down for a bit and trying to come up with a new design that satisfies these concerns.
My design goals were:
- More flexible pre/post processing. The current
Codecsystem only operates as a "fallback" and doesn't support hooking into the moment before trying to deserialize or the moment after a successful deserialization. I have a lot of example use cases for this that I can share. - Optionality of features. Some use cases demand more information / capabilities for serialization / deserialization (e.g. distributed systems want a function registry, a cache similar to Deduplicate PhysicalExpr on proto ser/de using Arc pointer addresses #18192, etc. while something that is just looking to cross the FFI boundary might be okay with no registry/state and just failing if it encounters a UDF).
The design I landed on is something like this:
// DataFusion code
trait Deserializer {
fn deserialize_physical_expr(&mut self, proto: PhysicalExprNode) -> Result<Arc<dyn PhysicalExpr>> {
// most methods have default implementations so you only have to implement those you want to override
deserialize_physical_expr(self, proto)
}
fn deserialize_udf(&mut self, proto: ScalarUdfExprNode) -> Result<()> {
// default implementation errors, we don't have a registry to call. you must override if you do.
internal_datafusion_err!("you must implement deserialize_udf yourself and put the function into a function registry")
}
}
/// Both the top level entry point (if what you have is a PhysicalExprNode)
/// and the default implementation that will get delegated back to recursively
fn deserialize_physical_expr<D: Deserializer + ?Sized>(deserializer: &mut D, proto: PhysicalExprNode) -> Result<Arc<dyn PhysicalExpr>> {
let children: Vec<PhysicalExprNode> = vec![]; // Placeholder for child nodes, from proto.children()?
let deserialized_children = children
.into_iter()
// it's important here that we call back into the trait!
.map(|child_proto| deserializer.deserialize_physical_expr(child_proto))
.collect::<Result<Vec<_>>>()?;
// Reconstruct the node with it's children
todo!()
}
fn deserialize_udf<D: Deserializer + ?Sized>(deserializer: &mut D, registry: &dyn FunctionRegistry, proto: ScalarUdfExprNode) -> Result<()> {
todo!();
}
// Users code: wants to override deserialization for PhysicalExpr
struct MySerializer {
field: String,
cache: HashMap<String, Arc<dyn PhysicalExpr>>,
}
impl Deserializer for MySerializer {
fn deserialize_physical_expr(&mut self, proto: PhysicalExprNode) -> Result<Arc<dyn PhysicalExpr>> {
// can run arbitrary stuff before, short circuit, etc.
println!("Deserializing using MySerializer {}", self.field);
let res = deserialize_physical_expr(&mut *self, proto)?;
// can run arbitrary stuff after and replace the result if necessary
println!("Deserialized result: {:?}", res);
Ok(res)
}
}
// Users code: wants to support deserialization of UDFs
struct MyDeserializerWithUDF<'a> {
registry: &'a dyn FunctionRegistry,
}
impl Deserializer for MyDeserializerWithUDF<'_> {
fn deserialize_physical_expr(&mut self, proto: PhysicalExprNode) -> Result<Arc<dyn PhysicalExpr>> {
deserialize_physical_expr(self, proto)
}
fn deserialize_udf(&mut self, proto: ScalarUdfExprNode) -> Result<()> {
deserialize_udf(self, self.registry, proto)
}
}Before continuing down this path I want go gauge appetite for this change, and understand if it satisfies the use cases currently out there or falls short.
@timsaucer @milenkovicm @Jefffrey any thoughts?
Metadata
Metadata
Assignees
Labels
No labels