Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indirect table slices #207

Closed
wants to merge 4 commits into from
Closed

Indirect table slices #207

wants to merge 4 commits into from

Conversation

CasualX
Copy link

@CasualX CasualX commented Feb 21, 2018

Hello,

I apologize in advance for this unsolicited PR, I am using the PR functionality to demonstrate a potential improvement to the codebase but its contents do not adhere to the contributing guidelines.

CAD1997 posted this reddit thread asking for help with some problems.

While I cannot directly help fix the problem I did notice some potential for improvement:

CharDataTable is rather inefficient when the mapped to value V has more than one value. Typically V then becomes for some property T: &'static [T] incurring quite a hefty overhead of 16 bytes per char for the slice reference as well as creating 10,000s of static slices that the compiler needs to deal with.

To improve this situation you can create a second table indexed by V being a Range<u32> or even Range<u16>. On x64 this means V is now 8 or 4 bytes respectively (compared to 16 bytes) and instead of 10,000s of small static slices the compiler just deals with one huge static slice.

In cases where V ends up being &'static [&'static str] and there are many string literal repetitions it can be further improved by substituting an enum for the &'static str and leaving conversion from the enum to string to users of the crate. However this has impact on user facing APIs (you cannot return the slice of string slices directly any more).

That is what I attempt to implement in this PR, but I didn't get very far.

  1. I introduced a trait to replace the CharDataTable enum which will allow me to create dedicated structs to the different ways the tables can be stored. Now that I think about it this probably isn't too essential. I rename the old methods so I can easily find the places I need to use the new trait.

  2. I create dedicated structs for the two variants for the CharDataTable enum and implement the previous trait.

  3. I selectively replace some of the CharDataTable instances to the new struct, to get an idea of how feasible this approach is. Hacked together by editing the generated code because I don't know how the code generation step works.

  4. I want to show how this indirect table can work. Actually updating the generated data was too daunting so I stopped there.

Anyway that's where I got after a few hours. If the compiler's problem is that literal 10,000s of small static slices are created then instead concatenating them in one big table from which they're indexed may help. This indexing approach should still be able to reduce memory overhead of all the slice references.

@CAD97
Copy link
Collaborator

CAD97 commented Mar 6, 2018

Sorry this took so long to get to; it looks interesting! I think the idea of an indirect slice will definitely benefit us in some cases.

I've been playing with the tables off-and-on ever since the original Reddit post, and came across some additional info that I'm trying to integrate.

I think the eventual target currently looks something like this:

// we want inherent impls but traits for this pseudo-rust example
trait CharSet {
    fn contains(&self, needle: char) -> bool;
}
trait CharMap {
    type Item: Copy;
    fn find(&self, needle: char) -> Option<Self::Item>;
}

impl CharSet for CharTrieSet {} // Something like stdlib's BoolTrie
impl CharSet for T: CharMap<Item=()> {
    fn contains(&self, needle: char) -> bool { self.find(needle).is_some() }
}

impl CharMap<Item=T> for CharRangeMap<T>         {} // eytzinger-ordered `&[(CharRange ,   T)]`
impl CharMap<Item=T> for CharMap<T>              {} // eytzinger-ordered `&[(char      ,   T)]`
impl CharMap<Item=T> for IndirectCharRangeMap<T> {} // eytzinger-ordered `(&[CharRange], &[T])`
impl CharMap<Item=T> for IndirectCharMap<T>      {} // eytzinger-ordered `(&[char]     , &[T])`
impl CharMap<Item=T> for CharPhfMap<T>           {} // Perfect Hash Function (maybe) (+indirect?)
impl CharMap<Item=T> for CharNameMap             {} // Uses unique compression techniques

// And FST for Name->char

Each individual table would then get to choose which format suits its individual needs best.

@CasualX
Copy link
Author

CasualX commented Mar 9, 2018

Those names sound far better than I could ever come up with :P

That's the general idea yes, with some comments:

Using traits instead of an enum generic over V will be necessary for indirect data tables since you'll need to differentiate between the returned value (eg. &'static [V]) and the backing store array type (eg. [V; X]). This cannot be done with the enum as it exists now.

Further I'm not sure why you'd want to split it into two traits unless it's absolutely necessary. I'm not a fan of generalizing a solution before I understand and implemented a specialized version for my own use case. So I'd keep just one trait CharMap, contains can always just have a default implementation.

Unfortunately I don't understand the code generation aspect, with some help I'll see if I can make this PR work.

@CAD97 CAD97 mentioned this pull request Mar 19, 2018
@behnam
Copy link
Member

behnam commented Mar 21, 2018

Thanks, @CasualX, for submitting this work. We love to get new collaborators or random submissions!

I finally found the time to take a look here and all sound good. Now, since @CAD97 is already implementing these ideas, along other things, in GH-218, I'm going to close this PR and we can keep the rest of the conversation there.

I mentioned some rationals about the table generation process in GH-218 already, but there's one more thing to add here. We didn't spend much on optimizing these steps so far, mainly because we were expanding the libraries to help us figure out what are the abstractions we want. Now that we have many different data types already in place, I think it's a very good time to work on it.

@behnam behnam closed this Mar 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants