Skip to content

shrinking Entry size #153

@CosmicHorrorDev

Description

@CosmicHorrorDev

(Entry is entirely public, so any changes would have to be a breaking change)

this is a follow-up to:

just the size_of::<rc_zip::parse::Entry>() alone is 144 which is pretty large when there are very very many of them. there are quite a few ways that it can be pruned to be a bit smaller, so let's have a look at things (by my best guessing)

struct Entry {
    // beeg
    name: String, // <---\
    comment: String, // <-x- 24 bytes each
    // medium
    modified: DateTime<Utc>, // <-------\
    created: Option<DateTime<Utc>>, // <-x
    accessed: Option<DateTime<Utc>>, // <-x- 12 bytes each
    uid: Option<u32>, // <--x- 8 bytes each (probably bc alignment? 🥴)
    gid: Option<u32>, // <-/
    // smol
    method: Method, // <---------\
    reader_version: Version, // <-x- 2 bytes (by itself 4, but variants are stored in niches)
    // fixed, no realistic gains to be had
    compressed_size: u64, // <--\
    uncompressed_size: u64, // <-x- 8 bytes each
    header_offset: u64, // <----/
    mode: Mode, // <--x- 4 bytes each
    crc32: u32, // <-/
    flags: u16, // 2 bytes
}

in short that's

  • beeg: 48 bytes (!!)
  • medium: 52 bytes
  • smol: 4 bytes
  • fixed: 34 bytes

which comes out to a total of 138. if you add in the whatever bit i missed and round up for alignment then it looks like things make sense

to get the simple ones out of the way:

  • fixed - what can ya do 🤷
  • medium - all of types using DateTime. it looks like internally this is 2 u32s and one NonZeroI32 which provides the niche for Option. the uid and gid can potentially be packed down a bit
  • smol - same deal as uid and gid although the best you can expect are some modest gains

beeg

sooo that leaves the two Strings for name and comment taking up 48 bytes. both of those could have the internals hidden away in a new-type that expose a &str to give more freedom to change things down the line. there are a lot of options

Method 1: Box<str>

who needs the capacity anyways. it would drop 8 bytes and works for both the name and comment

Method 2: comment is almost always empty

beyond a Box<str> in theory the length could be stored in the start of the allocation. with that you can get away with a single pointer where null is empty and non-null can be used to fetch the length and construct the &str. that would shave off 16 bytes in total

Method 3: name is like... never empty

but it's often very short. it could be a good candidate for small-string-optimization which could allow avoiding extra allocations on pretty common Entrys. a lot of rust crates use take advantage of invalid-utf to store >16 bytes inline, but we would probably want one that can be created from some kind of inline Vec<u8> which can't take advantage of that trick (and #148 exploits being able to convert the Vec<u8> form directly to the String form, so something similar would be ideal)

considering that this would likely involve pulling in some third-party crates it can be feature-gated off to be something simple like Box<str> when disabled

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions