File type validation on upload through web browser#624
Conversation
|
I'm ignoring the build failure for now - it's because the test uploaded a txt file - but since the list of file types supported will need to be changed there's no point adding just that right now. |
|
We could certainly maintain an allow list of suffixes, though it seems likely that even with exhaustive comparison against already published files we are likely to encounter requests to add to the suffixes often. I doubt that I did think about using magika in conjunction with |
|
Here's a survey of puremagic detection results - issue #553Run against real files from
84 file types tested.
Summary of issuesNOT DETECTED (38) - all expectedPlain-text files have no magic byte, so every type that is
For all of these the only viable strategy is extension-based classification (or, for MISMATCH (11) - broken down by action neededExpected from ChatGPT is wrong, detection is fine:
The example file on the mirror is not what the extension implies:
Likely a 64 KB truncation artefact:
puremagic false positive:
|
|
Here's the same for magika: magika detection results - issue #553Same files and 64 KB Range-request downloads as the puremagic run. Detection uses 84 file types tested.
Summary of mismatchesExpected set too narrow or uses legacy MIME - detection is correct (13)These are not magika failures; the expected result (from the ChatGPT table in the issue)
The example file on the mirror is not what the extension implies (5)Same five files that tripped puremagic for the same reason.
64 KB truncation artefact (1)
magika does not recognise the format (5)
magika false positives / PGP-PEM confusion (6)PGP ASCII-armored blocks (
|
|
It feels to me like magika doesn't add much since all its real wins are on the plaintext files - could we just validate that something puremagic doesn't recognise is at least valid UTF-8? Maybe that's still not sufficient for the upload page where we don't necessarily know all the types we'll get! What do you think @sbp? |
aa3364a to
61f166f
Compare
|
I was thinking more about looking at the implementation to figure out the confidence we can have in the result for each type. For example, you ran Obviously that would be pretty complicated, and there are a lot of file types, so I was suggesting that we only do this for the most common types that we find on the server. Does it always detect ZIP correctly? Always? 100%? Then we can use it to detect ZIP formats. What about TGZ? 100%? Then we can use it. The suggestion about magika was that we use it when |
|
Makes sense. So you think we should just work through the largest from Dave's survey and investigate those to see if this is achievable or not? I did notice there's a note in ASVS saying level 1 only requires verification of files used for specific sensitive purposes, but level 2 requires them all. |
You can make use of the find file and then access via https://dlcdn.apache.org the files and test as many as needed. |
|
Most (but not all) archives look relatively good. The below file types were 100% correctly picked up by puremagic - the numbers still aren't high but I can up my search volumes (this is discovered from the first 3 directory levels on dlcdn.apache.org, max 800 of each file type for test purposes): .tar.gz (797 files tested) |
|
@alitheg Are you doing further testing? If not then it looks |
|
I'm happy to do more - or not. What this shows is that puremagic seems to me to be good enough for tar.gz, zip, jar, tgz, whl and probably tar.bz2. It's likely as good on all the entries in that list but my test didn't find more to test on. For other file types, we don't have enough data to say we can definitely safely block files that don't pass puremagic validation, so we could change it to a warning for other filetypes and an error for these, for example. |
|
This processing fits into the new quarantine as a first step. It's important to proceed as follows.
|
133ab83 to
929a8c3
Compare
For issue #553. If this is a good approach we probably want to run it for all other upload types, and it needs other supported filetypes added as in Dave's investigation. But I thought I'd get a quick PR raised to validate the approach.