-
Notifications
You must be signed in to change notification settings - Fork 34
Description
Checklist
- I added a descriptive title
- I searched open reports and couldn't find a duplicate
What happened?
I have set up two different directory trees:
|-- testdata1
| `-- testFhello-world
`-- testdata2
|-- test
`-- world
Using the python script from CEP-19 to hash these two trees, they both have the same hash value:
CEP19 hash of testdata1: e91a9f9adcb3561a7a78a04d6f33b391beb92491f9ed99663b455867b031d30a
CEP19 hash of testdata2: e91a9f9adcb3561a7a78a04d6f33b391beb92491f9ed99663b455867b031d30a
The file name testFhello-world in testdata1 is added to the hash stream and is indistinguishable from a file name test with contents hello followed by a file world with whatever contents testFhello-world has. This is not the only way to confuse the algorithm: You can also use the contents of files to "pretend" there are more files.
One way to stop these filesystem trees from having the same hash value is to add the length of the input. You can do that either by hashing the length as an integer of a defined bit length or "stringified" followed by a separator like :. The separator is needed after a stringified value as without it the user provided contents may change that length by starting with digits. That might allow for another way to confuse the algorithm.
cep-19-fail.tar.gz has the script from CEP19 and the two directories so you can try for yourself.
Additional Context
No response