Skip to content

Non-roundtrippable objects #931

Open
@progval

Description

@progval

Hi,

Continuing on this comment:

It looks like some of these will break roundtripping

which is a property I'd like to maintain.

Here are some other sources of non-roundtrippability unrelated to timezones (I listed non-roundtrippable timezones in that comment):

dir entry modes

040000 instead of 40000 in tree entry mode (found in 37k old directories generated by GitHub, and by a Ruby library that GitHub may or may not have been using) is fixed by Dulwich:

>>> b = b'040000 example\x00\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa'
>>> t = dulwich.objects.Tree.from_string(b)
>>> t._needs_serialization = True
>>> t.as_raw_string()
b'40000 example\x00\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa'

I have also seen about 1.4k commits with other types of broken permissions, most of which don't round-trip.

dir entry order

2k trees with various types of disordered entries:

>>> b = b'10644 example0\x00\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa40000 example\x00\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb'
>>> t = dulwich.objects.Tree.from_string(b)
>>> t._needs_serialization = True
>>> t.as_raw_string()
b'40000 example\x00\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb\xbb10644 example0\x00\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa'

This usually happened when people committed trees created with their home-made git implementation, which used a wrong sort (didn't sort at all, or used naive sort instead of git's, or wrote all file entries before all dir entries, etc.)

disordered commit headers

There are many types of these; eg with nonce or encoding added after gpgsig (I don't have stats on these ones, but I don't think there are more than 10k):

>>> b = b'tree aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa \nparent bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb\nauthor John Doe <[email protected]> 1614159930 +0100\ncommitter John Doe <[email protected]> 1614159930 +0100\ngpgsig abcd\nnonce efgh\n\nfoo'
>>> c = dulwich.objects.Commit.from_string(b)
>>> c.author = c.author
>>> c.as_raw_string()
b'tree aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa \nparent bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb\nauthor John Doe <[email protected]> 1614159930 +0100\ncommitter John Doe <[email protected]> 1614159930 +0100\nnonce efgh\ngpgsig abcd\n\nfoo'

Keep in mind these stats are over almost all publicly available git commits (ie. about 2 billions), so they are all a very small fraction of a percent, that may not be worth caring about.

Either way, they are not an issue for me; I just thought they may be relevant to you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions