Skip to content

Reorganize Meta #41

@tgross35

Description

@tgross35

Currently we have something like the following. Min sizes assume empty storage, averages are a best guess of the common case:

// We count the data behind Arcs as free because we have to store that information anyway

// Ignoring string: 24 local, 56ish remote
// String: 24 local, about 64 remote
type WordList = HashMap<String, Vec<Meta>>;

// 40 local
pub struct Meta {
    stem: Arc<str>, // 16 local
    source: Source,  // 24 local
}

// 24 local
pub enum Source {
    Affix(Arc<AfxRule>, usize), // 16, 32 pointee
    Dict(Box[Arc<MorphInfo>]>), // 16 local, 24 pointee
    Personal(Box<PersonalMeta>), // 8 local, 40 pointee
    Raw,
}

// 40 local, extra meta in personal is uncommon
pub struct PersonalMeta { 
    friend: Option<Arc<str>>, // 16 local
    morph: Vec<Arc<MorphInfo>>, // 24 local
}

// 24 local, ~8 pointee
pub enum MorphInfo {
    Stem(MorphStr), /* ... */
}

// 32 local
pub struct AfxRule {
    kind: RuleType,
    can_combine: bool,
    patterns: Vec<AfxRulePattern>,
}

// 88 local
pub struct AfxRulePattern {
    affix: Box<str>,
    condition: Option<ReWrapper>,
    strip: Option<Arc<str>>,
    morph_info: Vec<Arc<MorphInfo>>,
}

That's really not terrible at ~80 bytes per entry for meta but I think we can simplify things, even outside of the storage reasons.

// Ignoring string: 24 local, 32ish remote
// String: 24 local, about 64 remote
type WordList = HashMap<String, Vec<Meta>>

// 16 local, 16 remote max
struct Meta(MetaInner);

enum MetaInner // 16 local
    DictStem(Arc<str>),
    DictMorph(Arc<MorphInfo>),
    PersonalStem(Arc<str>),
    PersonalFriend(Arc<str>),
    AfxRule(Box<AfxMeta>),
    Raw,
 }

// 16 local
struct AfxMeta {
    rule: Arc<AfxRule>,
    pat_idx: usize
}

This would mean more entries in a single vector rather than multiple entries in multiple vectors, and that's probably a good thing for various reasons. Having a flat structure rather than nested will probably make the CPU a bit happier too.

I would like to valgrind this all before actually doing the change, to get a good idea of how much we save.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions