-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indirect table slices #207
Conversation
Remove the find_or_default method as it can easily be called on the Option.
Sorry this took so long to get to; it looks interesting! I think the idea of an indirect slice will definitely benefit us in some cases. I've been playing with the tables off-and-on ever since the original Reddit post, and came across some additional info that I'm trying to integrate. I think the eventual target currently looks something like this: // we want inherent impls but traits for this pseudo-rust example
trait CharSet {
fn contains(&self, needle: char) -> bool;
}
trait CharMap {
type Item: Copy;
fn find(&self, needle: char) -> Option<Self::Item>;
}
impl CharSet for CharTrieSet {} // Something like stdlib's BoolTrie
impl CharSet for T: CharMap<Item=()> {
fn contains(&self, needle: char) -> bool { self.find(needle).is_some() }
}
impl CharMap<Item=T> for CharRangeMap<T> {} // eytzinger-ordered `&[(CharRange , T)]`
impl CharMap<Item=T> for CharMap<T> {} // eytzinger-ordered `&[(char , T)]`
impl CharMap<Item=T> for IndirectCharRangeMap<T> {} // eytzinger-ordered `(&[CharRange], &[T])`
impl CharMap<Item=T> for IndirectCharMap<T> {} // eytzinger-ordered `(&[char] , &[T])`
impl CharMap<Item=T> for CharPhfMap<T> {} // Perfect Hash Function (maybe) (+indirect?)
impl CharMap<Item=T> for CharNameMap {} // Uses unique compression techniques
// And FST for Name->char Each individual table would then get to choose which format suits its individual needs best. |
Those names sound far better than I could ever come up with :P That's the general idea yes, with some comments: Using traits instead of an enum generic over Further I'm not sure why you'd want to split it into two traits unless it's absolutely necessary. I'm not a fan of generalizing a solution before I understand and implemented a specialized version for my own use case. So I'd keep just one Unfortunately I don't understand the code generation aspect, with some help I'll see if I can make this PR work. |
Thanks, @CasualX, for submitting this work. We love to get new collaborators or random submissions! I finally found the time to take a look here and all sound good. Now, since @CAD97 is already implementing these ideas, along other things, in GH-218, I'm going to close this PR and we can keep the rest of the conversation there. I mentioned some rationals about the table generation process in GH-218 already, but there's one more thing to add here. We didn't spend much on optimizing these steps so far, mainly because we were expanding the libraries to help us figure out what are the abstractions we want. Now that we have many different data types already in place, I think it's a very good time to work on it. |
Hello,
I apologize in advance for this unsolicited PR, I am using the PR functionality to demonstrate a potential improvement to the codebase but its contents do not adhere to the contributing guidelines.
CAD1997 posted this reddit thread asking for help with some problems.
While I cannot directly help fix the problem I did notice some potential for improvement:
CharDataTable
is rather inefficient when the mapped to valueV
has more than one value. TypicallyV
then becomes for some propertyT
:&'static [T]
incurring quite a hefty overhead of 16 bytes perchar
for the slice reference as well as creating 10,000s of static slices that the compiler needs to deal with.To improve this situation you can create a second table indexed by
V
being aRange<u32>
or evenRange<u16>
. On x64 this meansV
is now 8 or 4 bytes respectively (compared to 16 bytes) and instead of 10,000s of small static slices the compiler just deals with one huge static slice.In cases where
V
ends up being&'static [&'static str]
and there are many string literal repetitions it can be further improved by substituting an enum for the&'static str
and leaving conversion from the enum to string to users of the crate. However this has impact on user facing APIs (you cannot return the slice of string slices directly any more).That is what I attempt to implement in this PR, but I didn't get very far.
I introduced a trait to replace the
CharDataTable
enum which will allow me to create dedicated structs to the different ways the tables can be stored. Now that I think about it this probably isn't too essential. I rename the old methods so I can easily find the places I need touse
the new trait.I create dedicated structs for the two variants for the
CharDataTable
enum and implement the previous trait.I selectively replace some of the
CharDataTable
instances to the new struct, to get an idea of how feasible this approach is. Hacked together by editing the generated code because I don't know how the code generation step works.I want to show how this indirect table can work. Actually updating the generated data was too daunting so I stopped there.
Anyway that's where I got after a few hours. If the compiler's problem is that literal 10,000s of small static slices are created then instead concatenating them in one big table from which they're indexed may help. This indexing approach should still be able to reduce memory overhead of all the slice references.