Skip to content

BUG CEP-19: Different directory trees may end up with the same hash value #150

@hunger

Description

@hunger

Checklist

  • I added a descriptive title
  • I searched open reports and couldn't find a duplicate

What happened?

I have set up two different directory trees:

|-- testdata1
|   `-- testFhello-world
`-- testdata2
    |-- test
    `-- world

Using the python script from CEP-19 to hash these two trees, they both have the same hash value:

CEP19 hash of testdata1: e91a9f9adcb3561a7a78a04d6f33b391beb92491f9ed99663b455867b031d30a
CEP19 hash of testdata2: e91a9f9adcb3561a7a78a04d6f33b391beb92491f9ed99663b455867b031d30a

The file name testFhello-world in testdata1 is added to the hash stream and is indistinguishable from a file name test with contents hello followed by a file world with whatever contents testFhello-world has. This is not the only way to confuse the algorithm: You can also use the contents of files to "pretend" there are more files.

One way to stop these filesystem trees from having the same hash value is to add the length of the input. You can do that either by hashing the length as an integer of a defined bit length or "stringified" followed by a separator like :. The separator is needed after a stringified value as without it the user provided contents may change that length by starting with digits. That might allow for another way to confuse the algorithm.

cep-19-fail.tar.gz has the script from CEP19 and the two directories so you can try for yourself.

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions