-
Notifications
You must be signed in to change notification settings - Fork 2
Implement Matryoshka #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is because it's not very compatible with the new Matryoshka implementation. May put back later, let's see.
Refer to Matryoshka paper Figure 12 in Appendix D for some more inspiration on that and take the median in the figure.
Used to call from python for matryoshka dicts.
About 10x slower than MP for 2k dimensions, 4k dictionaries. But recovers true dictionary.
Brings down time from 120ms to 72ms for test workload...
RomeoV
commented
May 19, 2025
RomeoV
commented
May 19, 2025
We had previously removed minibatching because it's a bit tricky to get to work with Matryoshka. However, we managed to bring it back. To this end, we also introduced `maybeview`: A function that constructs a view if the column indexing is contiguous, and otherwise copies. The new minibatching continues to only update `D` on some subset of the columns of `Y`, but returns `X` for all columns of `Y`, as it's good to update the structure.
Because these types hold big matrices, whenever they are printed, the terminal blows up. This occurs for example in the tests... Now it should be fine.
Sometimes matryoshka fails budget assignment for too small problems. Use `ceil` so that we always have at least one nnz. May lead to a total nnz count that's slightly out of budget though...
When the error is precomputed, we had issues where we had preallocated an error buffer for the entire data matrix, but then tried to copy in only the minibatch slice. Now we preallocate only an error buffer according to the minibatch size.
This is used a lot in testing, and I've been coding it new every time... Time to make a proper function!
Added `Hungarian.jl` dependency, which however, doesn't have further dependencies.
Now, after fitting all the coefficients, we "refit" them once more by solving $A x = b$ for only the nonzero elements of $x$ and columns of $A$.
I benchmarked once more, and it's typically not faster than just `Y - D*X` for our problem sizes.
RomeoV
commented
May 20, 2025
Owner
Author
RomeoV
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some feedback...
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Comes with:
fasterrorandfasterror!