Skip to content

Replace SHA function by something very fast #619

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

za3k
Copy link

@za3k za3k commented Feb 28, 2025

My estimate is that this should increase total markdown processing speed by ~2.5X.

Fixes issue #618.

Tests pass.

My estimate is that this should increase speed by ~2.5X.

Fixes issue trentm#618
@nicholasserra
Copy link
Collaborator

Hello! Thanks for this, looks sane. Only thing i'm wondering is collision potential versus the shas we were using. Guessing minimal to none. Do you have any perspective on that? Thank you

@za3k
Copy link
Author

za3k commented Mar 3, 2025

Two ways of thinking about it.

  1. This fails as often as a python dict or set fails, in terms of string collision. Personally, I have never had that happen.

  2. You can do some math. It looks like the hash is a 64-bit number in cpython (on my 64-bit machine, at least). Birthday paradox says we would need 2**32=4B strings before reaching 1 expected collision in a good hash system.
    However, someone did some empirical tests, and it's not a good hash system, and the real answer might be more like 200K strings?

If are worried about users with more than 200K strings, first of all I'd say improve performance! But to avoid the chance of failure and speed things up, I see three options:

  1. The ideal answer would be stop doing this, not to find a better hash function. You can escape a segment of HTML code without hashing in any way. But, that would be a lot more rewriting work.

  2. You can use random IDs (I tried this). It breaks something in the URL escaping logic for images and end-material references, because they rely on the result being the same to look it back up the next time. I forget the details, sorry. But maybe you could fix just that.

  3. You could hash both the string and some variant on the string, which would give you a 128-bit number. That should avoid any collisions, I hope.

>>> s="hello, world"; (hash(s) << 64) + hash(s+"also")
-42389628753142344245553632286555727257

@za3k
Copy link
Author

za3k commented Mar 3, 2025

Oh, one reasonable thing you might want to do is run some kind of benchmark before and after test? I didn't actually verify it got a lot faster.

@nicholasserra
Copy link
Collaborator

Looking into this again.

Used timeit to evaluate the different hash function. As-is in this PR your new method was doing about 30% better.

But this looks to have the same issue around injection as mentioned in #599

We've since fixed the escaping so that it doesn't really matter if you can guess a hash, but I don't really want to introduce this in case of regression.

But could possibly do something with this new hash method + SECRET_SALT. I tried adding the salt back in, but because of the bytes + string concat with urandom, the conversions ended up taking enough time to negate any wins.

So I think this is going in the right direction. Just need to add in something random again.

@za3k
Copy link
Author

za3k commented May 19, 2025

Well... just going to point out again, there's no actual reason you need to do hashing at all. This whole method of replacing something by a hash, and then later re-adding it, is a hack to avoid making a new data structure. You're just leaving a hole which won't be processed, and then replacing it later, and hashes were the mechanism the markdown author picked in perl, which has been copied here. So you could fundamentally rewrite the code to not need any of this logic in the slightest.

You could use sequential / UUIDs in most respects.

This breaks because (IIRC) you also hash URLs for use in markdown footnotes at the end, and those need repeats to be the same thing.

What if you used a python dict which looked up a key, and if it wasn't in the dict added the next sequential ID?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants