Skip to content

Redesign protobuf encode/decode hooks & state #18477

@adriangb

Description

@adriangb

Workin on #18192 it became clear to me that there are both shortcomings in the current design and competing interests in terms of what is supported / what state is passed around. I thought it would be worth sitting down for a bit and trying to come up with a new design that satisfies these concerns.

My design goals were:

  • More flexible pre/post processing. The current Codec system only operates as a "fallback" and doesn't support hooking into the moment before trying to deserialize or the moment after a successful deserialization. I have a lot of example use cases for this that I can share.
  • Optionality of features. Some use cases demand more information / capabilities for serialization / deserialization (e.g. distributed systems want a function registry, a cache similar to Deduplicate PhysicalExpr on proto ser/de using Arc pointer addresses #18192, etc. while something that is just looking to cross the FFI boundary might be okay with no registry/state and just failing if it encounters a UDF).

The design I landed on is something like this:

// DataFusion code
trait Deserializer {
    fn deserialize_physical_expr(&mut self, proto: PhysicalExprNode) -> Result<Arc<dyn PhysicalExpr>> {
        // most methods have default implementations so you only have to implement those you want to override
        deserialize_physical_expr(self, proto)
    }

    fn deserialize_udf(&mut self, proto: ScalarUdfExprNode) -> Result<()> {
        // default implementation errors, we don't have a registry to call. you must override if you do.
        internal_datafusion_err!("you must implement deserialize_udf yourself and put the function into a function registry")
    }
}

/// Both the top level entry point (if what you have is a PhysicalExprNode)
/// and the default implementation that will get delegated back to recursively
fn deserialize_physical_expr<D: Deserializer + ?Sized>(deserializer: &mut D, proto: PhysicalExprNode) -> Result<Arc<dyn PhysicalExpr>> {
    let children: Vec<PhysicalExprNode> = vec![]; // Placeholder for child nodes, from proto.children()?
    let deserialized_children = children
        .into_iter()
        // it's important here that we call back into the trait!
        .map(|child_proto| deserializer.deserialize_physical_expr(child_proto))
        .collect::<Result<Vec<_>>>()?;
    // Reconstruct the node with it's children
    todo!()
}


fn deserialize_udf<D: Deserializer + ?Sized>(deserializer: &mut D, registry: &dyn FunctionRegistry, proto: ScalarUdfExprNode) -> Result<()> {
    todo!();
}

// Users code: wants to override deserialization for PhysicalExpr
struct MySerializer {
    field: String,
    cache: HashMap<String, Arc<dyn PhysicalExpr>>,
}

impl Deserializer for MySerializer {
    fn deserialize_physical_expr(&mut self, proto: PhysicalExprNode) -> Result<Arc<dyn PhysicalExpr>> {
        // can run arbitrary stuff before, short circuit, etc.
        println!("Deserializing using MySerializer {}", self.field);
        let res = deserialize_physical_expr(&mut *self, proto)?;
        // can run arbitrary stuff after and replace the result if necessary
        println!("Deserialized result: {:?}", res);
        Ok(res)
    }
}

// Users code: wants to support deserialization of UDFs
struct MyDeserializerWithUDF<'a> {
    registry: &'a dyn FunctionRegistry,
}

impl Deserializer for MyDeserializerWithUDF<'_> {
    fn deserialize_physical_expr(&mut self, proto: PhysicalExprNode) -> Result<Arc<dyn PhysicalExpr>> {
        deserialize_physical_expr(self, proto)
    }

    fn deserialize_udf(&mut self, proto: ScalarUdfExprNode) -> Result<()> {
        deserialize_udf(self, self.registry, proto)
    }
}

Before continuing down this path I want go gauge appetite for this change, and understand if it satisfies the use cases currently out there or falls short.

@timsaucer @milenkovicm @Jefffrey any thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions