Skip to content

[#5202] feat(client-python): Support Gravitino Type Serdes - serializer #6903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

tsungchih
Copy link
Contributor

What changes were proposed in this pull request?

This is the second part (totally 4 planned) of implementation to the following classes from Java to support Column and its default value, including:

  • JsonUtils
  • TypeSerializer

The TypeSerializer will be used in the incoming ColumnDTO implementation to serialize data_type field.

Why are the changes needed?

We need to support Column and its default value in python client.

#5202

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

for implementing various customized DataClassJson serializer/deserializer

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
aimed at facilitating customized DataClassJson serializer/descrializer

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
add write data type in SerdesUtils

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
add TypeSerializer

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
add unit tests for TypesSerializer

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
add write struct type in SerdesUtils

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
add unit tests for serializing struct type

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
def test_list_type_of_primitive_and_none_types_serdes(self):         for simple_string, type_ in self._primitive_and_none_types.items():             list_type = Types.ListType.of(element_type=type_, element_nullable=False)             serialized_result = TypesSerdes.serialize(list_type)             self.assertEqual(serialized_result.get(SerdesUtils.TYPE), SerdesUtils.LIST)             self.assertEqual(                 serialized_result.get(SerdesUtils.LIST_ELEMENT_TYPE), simple_string             )             self.assertEqual(                 serialized_result.get(SerdesUtils.LIST_ELEMENT_NULLABLE), False             )

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
add unit tests for serializing list type

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
add write map type in SerdesUtils

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
add unit tests for serializing map type

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
add write union type in SerdesUtils

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
add unit tests for serializing union type

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
add write external type in SerdesUtils

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
add unit tests for serializing external type

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
add write unparsed type in SerdesUtils

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
add unit tests for serializing unparsed type

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
revise TypeVar names to conform with naming patterns

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
@tsungchih tsungchih marked this pull request as ready for review April 13, 2025 05:44
@jerryshao jerryshao requested a review from xunliu April 18, 2025 09:13
@xunliu xunliu requested a review from unknowntpo April 22, 2025 03:55
@unknowntpo
Copy link
Collaborator

unknowntpo commented Apr 23, 2025

@tsungchih Could you explain why you need to implement a json_serdes util ? See what MetalakeDTO did, simply inherit fromDataClassJsonMixin in dataclasses_json library.

from dataclasses_json import DataClassJsonMixin, config

...
@dataclass
class MetalakeDTO(Metalake, DataClassJsonMixin):
     ...

Reference:

@tsungchih
Copy link
Contributor Author

tsungchih commented Apr 23, 2025

@tsungchih Could you explain why you need to implement a json_serdes util ? See what MetalakeDTO did, simply inherit fromDataClassJsonMixin in dataclasses_json library.

from dataclasses_json import DataClassJsonMixin, config

...
@dataclass
class MetalakeDTO(Metalake, DataClassJsonMixin):
     ...

Reference:

@unknowntpo Thanks for your comments.

That is because all of the current Gravitino Type defined in gravitino.api.types.types.Types, such as NullType, BooleanType, etc., do not support dataclass-json. Namely they do not inherit from DataClassJsonMixin in dataclasses_json library.

IMHO, here comes with two solutions to serialize/deserialize Type listed as follows.

  1. make all of the Types support for dataclass-json; (this will be a significant change to the interface and I'm not sure if it is acceptable)
  2. remain Types unchanged and define serializer/deserializer for them (this is the current PR)

I'm going to elaborate more by taking ShortType as an example. The current class definition is

class Types:
    class ShortType(IntegralType):
        _instance: Types.ShortType = None
        _unsigned_instance: Types.ShortType = None
        ...

We may have it to support for dataclass_json listed as follows.

class Types:
    @dataclass
    class ShortType(IntegralType, DataClassJsonMixin):
        _instance: Types.ShortType = None
        _unsigned_instance: Types.ShortType = None
        ...

Note that, however, the ShortType has two forms, signed and unsigned. We need to choose different serialize/deserialize methods for the instance of ShortType based on which form it is. I haven't figured out how to fulfill this case by simply inherit fromDataClassJsonMixin in dataclasses_json library from the documents.

Regards

tsungchih added 5 commits May 3, 2025 13:36
due to moving json_serdes

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
as one JsonSerializable interface

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
due to interface change

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
for TypeSerdes

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
@tsungchih
Copy link
Contributor Author

tsungchih commented May 4, 2025

@xunliu @unknowntpo Based on the comments, I made the following changes:

  • moved json_serdes into gravitino.api.types;
  • combined GenericJsonSerializer and GenericJsonDeserializer into one JasonSerializable interface;
  • rename TypeSerializer as TypeSerdes to implement the JsonSerializable interface;

In the end, the TypeSerdes will be used by ColumnDTO in the future such as

from gravitino.api.types.json_serdes.type_serdes import TypeSerdes

class ColumnDTO(Column):
    _data_type: Type = field(
        metadata=config(
            field_name="data_type",
            encoder=TypeSerdes.serialize,
            decoder=TypeSerdes.deserialize,
        )
    )

Please take a look at it when you're available. Thanks!


@classmethod
@abstractmethod
def serialize(cls, data_type: GravitinoTypeT) -> DeserializedTypeT:
Copy link
Collaborator

@unknowntpo unknowntpo May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsungchih Can we use Json to add some constraints to GravitinoTypeT and DeserializedTypeT ?

I found that in config.encoder, and config.decoder, it has some comments about the type of Callable, it says B is bound as a JSON type, and in dataclasses_json.core, it has Json type.
So I'm wondering we can use it to add constraints to GravitinoTypeT and DeserializedTypeT.

# in dataclasses-json cfg.py
def config(metadata: Optional[dict] = None, *,
           # TODO: these can be typed more precisely
           # Specifically, a Callable[A, B], where `B` is bound as a JSON type
           encoder: Optional[Callable] = None,
           decoder: Optional[Callable] = None,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@unknowntpo Sure, it would be better if we could have appropriate type bound. I didn’t know the Json. Allow me to investigate the type you mentioned. Thanks for your prompt comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@unknowntpo I submitted a commit to add for DeserializedTypeT to be bound as Json type and for GravitinoTypeT to be bound as Type. Thanks for the comments.

Copy link
Collaborator

@unknowntpo unknowntpo May 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsungchih Ok, but I found another problem,

def serialize(cls, data_type: GravitinoTypeT) -> DeserializedTypeT:

This is for serializing data_type to Json type, so I think DeserializedTypeT should be named as SerializedTypeT. Or we can simply use Json type.

def serialize(cls, data_type: GravitinoTypeT) -> Json:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@unknowntpo Yes, you're right. I didn't notice that. I changed it to use Json instead. Thanks for the comments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

for GravitinoTypeT and DeserializedTypeT

apache#5202

Signed-off-by: George T. C. Lai <[email protected]>
@xunliu
Copy link
Member

xunliu commented May 10, 2025

hi @tsungchih Thank you for your contributions.
I read the PR for this and it mainly talks about how to do json serialization and deserialization, right?
But the title of the PR is column handling. Is it possible that the title of the PR should be changed?

@tsungchih tsungchih changed the title [#5202] feat(client-python): Support Column and its default value part2 - type serializer [#5202] feat(client-python): Support Gravitino Type Serdes - serializer May 12, 2025
@tsungchih
Copy link
Contributor Author

hi @tsungchih Thank you for your contributions. I read the PR for this and it mainly talks about how to do json serialization and deserialization, right? But the title of the PR is column handling. Is it possible that the title of the PR should be changed?

The title has been changed. Thanks for the suggestion.

@unknowntpo
Copy link
Collaborator

@xunliu LGTM.

Copy link
Contributor

@jerryshao jerryshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @tsungchih for your work.

@jerryshao jerryshao merged commit cd78753 into apache:main May 23, 2025
27 checks passed
vishnu-chalil pushed a commit to vishnu-chalil/gravitino that referenced this pull request Jul 14, 2025
…rializer (apache#6903)

### What changes were proposed in this pull request?

This is the second part (totally 4 planned) of implementation to the
following classes from Java to support Column and its default value,
including:

- JsonUtils
- TypeSerializer

The `TypeSerializer` will be used in the incoming `ColumnDTO`
implementation to serialize `data_type` field.

### Why are the changes needed?

We need to support Column and its default value in python client.

apache#5202 

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

---------

Signed-off-by: George T. C. Lai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants