Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TODO in Metafile.get_class() (storage/model/metafile.py) #501

Open
wants to merge 1 commit into
base: 2.0
Choose a base branch
from

Conversation

sahil-03
Copy link
Contributor

@sahil-03 sahil-03 commented Mar 9, 2025

Summary

Addressing TODO in storage/model/metafile.py in Metafile.get_class() function:

# TODO: more robust implementation. Right now this relies on the
#  assumption that XLocator key will only be present in class X, and
#  is brittle to renames. On the other hand, this implementation does
#  not require any marker fields to be persisted, and a regression
#  will be quickly detected by test_metafile.io or other unit tests

Rationale

A more robust get_class method. Right now, it is vulnerable to name changes and is somewhat "hard-coded". To make the implementation more reliable, I suggest adding a new field to each object where it can be reliably and easily recognized.

Changes

Changes will (potentially) include a new field being added to each object. So far, I have provided a design suggestion and sample implementation.

Impact

Discuss any potential impacts the changes may have on existing functionalities.

Testing

Describe how the changes have been tested, including both automated and manual testing strategies.
If this is a bugfix, explain how the fix has been tested to ensure the bug is resolved without introducing new issues.

Regression Risk

If this is a bugfix, assess the risk of regression caused by this fix and steps taken to mitigate it.

Checklist

  • Unit tests covering the changes have been added

    • If this is a bugfix, regression tests have been added
  • E2E testing has been performed

Additional Notes

Any additional information or context relevant to this PR.

@sahil-03 sahil-03 changed the title resolving get_class TODO; updated with suggestion + sample impl TODO in Metafile.get_class() (storage/model/metafile.py) Mar 9, 2025
Comment on lines +531 to +539
# Sample new implementation:
# if "_type" in serialized_dict:
# type_code = serialized_dict["_type"]
# if type_code in SOME_MAPPING:
# return SOME_MAPPING[type_code]
# else:
# raise an error
# else:
# raise and error
Copy link
Member

@pdames pdames Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the proposal - it's definitely an improvement on what's currently here. Space optimality of metafiles isn't the biggest concern right now, since we generally expect things like S3 put/get latency to outweigh the cost of downloading the bytes as long as they all stay in the KiB size range.

With that being said, I think I'd still prefer to just use an integer-based enum here to not waste much more space than we need to when serializing with msgpack. This will probably be most relevant for Deltas, which will commonly be present in the millions or billions per table, so extra bytes will start to add up here (which reminds me of another important TODO to store the largest part of the Delta - the Manifest - in a separate file whenever it's in-memory size is larger than, say, 128 KiB).

WDYT about an enum like this that we add to storage/model/types.py:

class MetafileType(int, Enum):
    NAMESPACE = 1
    TABLE = 2
    TABLE_VERSION = 3
    STREAM = 4
    PARTITION = 5
    DELTA = 6
    
    # usage examples
    >>> MetafileType(1)
    <MetafileType.NAMESPACE: 1>
    >>> MetafileType(1).value
    1
    >>> MetafileType(1).name
    'NAMESPACE'
    >>> MetafileType.NAMESPACE
    <MetafileType.NAMESPACE: 1>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants