Skip to content

Risks of hash collisions due to Int encoding #428

@artwyman

Description

@artwyman

Context:

This is a spin-off of #426 for a deeper discussion of a related issue. #426 is mostly about the specific constants chosen in value hashing, and could be fixed by changing that constants. In this issue I want to prompt a broader discussion about the encoding of Ints direclty in a RawValue without hashing. I think that might have security implications which are worth exploring. I have 2 risks in mind which I'll document below. I welcome input on whether each of them is real and worth worrying about.

I'm coming at this largely from the direction that any ability to construct hash collisions is dangerous, and we need to ensure doing so is hard within some number of bits of security. I don't have specific vulnerabilities in mind, but could try to come up with some if it would be helpful. I'm also not sure of the specific security level we assume from our Poseidon hash against specific attacks - I'd appreciate input there.

Int Encoding:

I'll restate the details as I understand them, mostly based on documentation, so please correct any mistakes here:

  • Raw values are represented as 4 Goldilocks field elements containing ~256 bits (~64 bits each). I'll write these little-endian as (h0, h1, h2, h3)
  • In most cases there are the result of a Poseidon hash which we assume to be strong, with some number of effective bits of security N<256
  • Integers of 64 bits are not hashed, but are instead embedded directly in a raw value as (i0, i1, 0, 0) where i0 and i1 are the lowest and highest 32 bits of the integer, and the two constant 0s could be any other arbitrary constant.

This encoding of ints has obvious efficiency benefits of not requiring circuits to take extra inputs or verify hashes in order to prove statements like SumOf and Lt

Risk 1:

Since int-encoding is known, and 2 of the 4 elements are known, it becomes easier to construct a hash collision between an Int and some other type. For instance, if I can somehow find a String whose hash is (s0, s1, 0, 0) I immediately know how to construct a colliding Int. Thus if we think our Poseidon hash is collision-resistant with N bits of effective security, this approach to integer encoding reduces that to N/2, at least for any attack involving colliding an Int against some other type.

Is N/2 good enough? That's where I'd like to have more input or pointers from people who understand the math and security of Poseidon better. My surface-level logic is that if N/2 were good enough, we'd have picked a RawValue type with 2 field elements instead of 4, and saved a lot of circuit space.

Risk 2:

This risk isn't about direct hash collisions, but about the related collisions of key in a Merkle Tree.

The int encoding above interacts further with the encoding of Sparse Merkle Trees for containers (Dict/Set/Array), documented here. The path through the tree is fixed by the bits of the raw value in little-endian order, up to some max (usually 32 in our use cases). Note that leaves of the Merkle tree are always hashed in full, so a collision of the 32 key bits doesn't allow construction of false proofs. But it means that 2 values which share the same lowest 32 bits cannot be contained in the same Merkle tree.

For Int values, that means the lowest 32-bit bits of the integer directly determine its path in the tree. Since they're in reverse order, this gives the nice property that sequential integers (such as the indexes into an array) result in a perfectly balanced tree. However it also means that certain integer values can never be present as keys in the same container (Dict/Set). E.g. a Set of Ints cannot contain 0x0 and 0x100000000.

This can happen by accident, and be quite unexpected in a POD-based system. Integers aren't randomly distributed, so the likelihood of this happening will depend on the specifics of a given use case.

Furthermore, as an attacker, if I know the value of one element in a Set, it's easy for me to construct another value which cannot be inserted into the same Set without causing all future proofs to fail. This could result in a denial-of-service effect in some scenarios. E.g. imagine a PODB service which cannot accept a proven-valid request to insert data.

This is the kind of "sharp edge" which would cause me to avoid using Ints as keys when designing a system, which would eliminate certain possible designs. That may be sufficient to avoid problems if everyone does the same, but it means a part of the system's expressiveness isn't really usable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionTopic that requires discussionquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions