-
Notifications
You must be signed in to change notification settings - Fork 438
Replace SHA function by something very fast #619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
My estimate is that this should increase speed by ~2.5X. Fixes issue trentm#618
Hello! Thanks for this, looks sane. Only thing i'm wondering is collision potential versus the shas we were using. Guessing minimal to none. Do you have any perspective on that? Thank you |
Two ways of thinking about it.
If are worried about users with more than 200K strings, first of all I'd say improve performance! But to avoid the chance of failure and speed things up, I see three options:
|
Oh, one reasonable thing you might want to do is run some kind of benchmark before and after test? I didn't actually verify it got a lot faster. |
Looking into this again. Used timeit to evaluate the different hash function. As-is in this PR your new method was doing about 30% better. But this looks to have the same issue around injection as mentioned in #599 We've since fixed the escaping so that it doesn't really matter if you can guess a hash, but I don't really want to introduce this in case of regression. But could possibly do something with this new hash method + SECRET_SALT. I tried adding the salt back in, but because of the bytes + string concat with urandom, the conversions ended up taking enough time to negate any wins. So I think this is going in the right direction. Just need to add in something random again. |
Well... just going to point out again, there's no actual reason you need to do hashing at all. This whole method of replacing something by a hash, and then later re-adding it, is a hack to avoid making a new data structure. You're just leaving a hole which won't be processed, and then replacing it later, and hashes were the mechanism the markdown author picked in perl, which has been copied here. So you could fundamentally rewrite the code to not need any of this logic in the slightest. You could use sequential / UUIDs in most respects. This breaks because (IIRC) you also hash URLs for use in markdown footnotes at the end, and those need repeats to be the same thing. What if you used a python dict which looked up a key, and if it wasn't in the dict added the next sequential ID? |
My estimate is that this should increase total markdown processing speed by ~2.5X.
Fixes issue #618.
Tests pass.