Description
Some AV implementations (E.g. McAfee) hook in post-write, opening and scanning the file after it is written to disk. No implication of TOCTOU race, since they could also have a JIT check in place. But an IO pattern of:
-
creat
-
write
-
close
-
rename (either the file or a containing directory)
will race with the AV scanners -
open
-
read
-
close
calls on Windows OS's.
Note that the AV scanner is multi threaded, making many of these calls overlapped: trying to thread the needle between them with high frequency attempts is impossible and just increases CPU utilisation.
To fix that, either: don't perform that IO pattern, or use a retry-with-backoff strategy on the rename. We currently use the retry approach.
However that results in up to 143 second delays per rename that we do; which is roughly one-per-component, and components like docs
are particularly affected.
If rather than renaming after close, we renamed before closing, we could avoid this condition. e.g.:
- create
- write
- flush
- SetFileInformationByHandle on NT, renameat() on posix
- close
on a per file basis, into the final location, then the AV scan would take place on the final file path, rather than on a temporary location, and no rename retry would be required.
There are two broad strategies we could use to achieve this:
One: "indirect"
we could write all files to one or more centralised stores (e.g. perhaps a content addressable store), and then build a link farm from there to a directory representing the component, and rename the component into place. Error handling could be done by asserting content - every unpack just replaces content in the store - if its by full name (tar + expanded path) - we replace it; if its by hash, we can validate and replace if wrong. Or use fsync and trust no corruption ever occurs. Or variations could be made up.
Two: "direct"
we could write the files directly to their final locations, with additional error handling needed for partially unpacked components. Writes might be lost in a power failure at any point, and similarly out of disk space; and due to concurrency behaviours and disk out-of-order capabilities we can't assume that having written X then started on Y, the presence of Y implies X, unless we introduce barriers via sync calls. We could use the same strategy as for indirect: just unpack everything for anything that is suspect; and treat it all as suspect. Making sure we write metadata last and syncing that is perhaps a good common path; alternatively, checking the content exists, trusting that in the common case the full write was successful, and only linking in the manifest at the end might be good - and using locking on the metadata file to solve #988 at the same time.