Replace SHA function by something very fast #619

za3k · 2025-02-28T22:28:29Z

My estimate is that this should increase total markdown processing speed by ~2.5X.

Fixes issue #618.

Tests pass.

My estimate is that this should increase speed by ~2.5X. Fixes issue trentm#618

nicholasserra · 2025-03-02T23:56:21Z

Hello! Thanks for this, looks sane. Only thing i'm wondering is collision potential versus the shas we were using. Guessing minimal to none. Do you have any perspective on that? Thank you

za3k · 2025-03-03T02:16:21Z

Two ways of thinking about it.

This fails as often as a python dict or set fails, in terms of string collision. Personally, I have never had that happen.
You can do some math. It looks like the hash is a 64-bit number in cpython (on my 64-bit machine, at least). Birthday paradox says we would need 2**32=4B strings before reaching 1 expected collision in a good hash system.
However, someone did some empirical tests, and it's not a good hash system, and the real answer might be more like 200K strings?

If are worried about users with more than 200K strings, first of all I'd say improve performance! But to avoid the chance of failure and speed things up, I see three options:

The ideal answer would be stop doing this, not to find a better hash function. You can escape a segment of HTML code without hashing in any way. But, that would be a lot more rewriting work.
You can use random IDs (I tried this). It breaks something in the URL escaping logic for images and end-material references, because they rely on the result being the same to look it back up the next time. I forget the details, sorry. But maybe you could fix just that.
You could hash both the string and some variant on the string, which would give you a 128-bit number. That should avoid any collisions, I hope.

>>> s="hello, world"; (hash(s) << 64) + hash(s+"also")
-42389628753142344245553632286555727257

za3k · 2025-03-03T02:31:37Z

Oh, one reasonable thing you might want to do is run some kind of benchmark before and after test? I didn't actually verify it got a lot faster.

nicholasserra · 2025-05-18T20:06:47Z

Looking into this again.

Used timeit to evaluate the different hash function. As-is in this PR your new method was doing about 30% better.

But this looks to have the same issue around injection as mentioned in #599

We've since fixed the escaping so that it doesn't really matter if you can guess a hash, but I don't really want to introduce this in case of regression.

But could possibly do something with this new hash method + SECRET_SALT. I tried adding the salt back in, but because of the bytes + string concat with urandom, the conversions ended up taking enough time to negate any wins.

So I think this is going in the right direction. Just need to add in something random again.

za3k · 2025-05-19T02:08:06Z

Well... just going to point out again, there's no actual reason you need to do hashing at all. This whole method of replacing something by a hash, and then later re-adding it, is a hack to avoid making a new data structure. You're just leaving a hole which won't be processed, and then replacing it later, and hashes were the mechanism the markdown author picked in perl, which has been copied here. So you could fundamentally rewrite the code to not need any of this logic in the slightest.

You could use sequential / UUIDs in most respects.

This breaks because (IIRC) you also hash URLs for use in markdown footnotes at the end, and those need repeats to be the same thing.

What if you used a python dict which looked up a key, and if it wasn't in the dict added the next sequential ID?

nicholasserra · 2025-05-24T00:05:47Z

Well we are using a dict to store and look these up, in _escape_table. But I guess we could use something as simple as uuid, since the issue is that it just needs to not be guessable. You're right in that it doesn't seem to matter what kind of key we use to store these as long as we can look it up afterwards when they're re-inserted.

za3k · 2025-05-25T05:49:51Z

Right. I think we're mostly on the same page, but I may have been unclear about the dicts.

I'm suggesting that in addition to the "uuid -> HTML" dict (_escape_table) you can add an "HTML -> uuid" table. This is needed for things like link URLs, which are passed to _hash_text multiple times. (I think they are the only such thing). We need the URL to give the same thing each time.

uuid_dict = {}
def _hash_text(s: str):
    if s not in uuid_dict:
        uuid_dict[s] = generate_uuid()
    return uuid_dict[s]

Edit: Using a python dict avoids anything bad happening from hash collisions in a clean and fast way, because it deals with collisions already.

Replace SHA function by something very fast

34f3d96

My estimate is that this should increase speed by ~2.5X. Fixes issue trentm#618

za3k force-pushed the master branch from f14b6a9 to 34f3d96 Compare February 28, 2025 22:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace SHA function by something very fast #619

Replace SHA function by something very fast #619

Uh oh!

za3k commented Feb 28, 2025

Uh oh!

nicholasserra commented Mar 2, 2025

Uh oh!

za3k commented Mar 3, 2025 •

edited

Loading

Uh oh!

za3k commented Mar 3, 2025

Uh oh!

nicholasserra commented May 18, 2025

Uh oh!

za3k commented May 19, 2025

Uh oh!

nicholasserra commented May 24, 2025

Uh oh!

za3k commented May 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Replace SHA function by something very fast #619

Are you sure you want to change the base?

Replace SHA function by something very fast #619

Uh oh!

Conversation

za3k commented Feb 28, 2025

Uh oh!

nicholasserra commented Mar 2, 2025

Uh oh!

za3k commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

za3k commented Mar 3, 2025

Uh oh!

nicholasserra commented May 18, 2025

Uh oh!

za3k commented May 19, 2025

Uh oh!

nicholasserra commented May 24, 2025

Uh oh!

za3k commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

za3k commented Mar 3, 2025 •

edited

Loading

za3k commented May 25, 2025 •

edited

Loading