Description
Currently simpleblob is quite restrictive about the characters that are allowed in names, constraining users to alphanumerical names with a few special chars (".", "-", "_"). The constraint primarily follows from using unescaped filenames in the fs
backend.
We have a use case where we want to use a version specific prefix within a bucket (e.g. "v5/"), and also to be able to write a program that can discover all versions in use. This requires listing blobs with "/" is the name.
Amazon writes the following about S3 limitations: https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html
Basically they allow any UTF-8 character in the name, with a few caveats and exceptions.
Ideally we would like to allow an as wide as possible range of names, while also constraining to names that any current and future backend can safely support. These are conflicting goals. Perhaps we should use the same approach as the S3 documentation: define and recommend 'safe characters' that must work with any backend, while allowing almost all characters.
Potential solution
- Allow almost all UTF-8 strings
- But reject non-canonical and non-local paths like
../../foo///bar
- Perform escaping in the
fs
backend to make the allowed names safe
Escaping and validation algorithm for fs
:
- Reject any backslash
\
in the name - Call
path.Clean(name)
to check for non-canonical paths. Reject is the output differs from the input. - Reject if the name starts with
..
, because it could be something like../../../etc/passwd
(path.Clean
will not touch this) - Call
url.QueryEscape(name)
to escape unsafe characters - If the name starts with a
.
, replace that character by%2E
to avoid hidden files on UNIX
The validation and escaping functions can be exposed for other backends to reuse. The validation function should be called by every backend, the escaping function is optional.
There is another issue with special reserved device names on Windows. Go 1.20 introduces a new IsLocal
function to check for these, but I don't think we want to depend on this, and it's only available in filepath
. Perhaps always prepend _
to the filename to avoid this? This would also solve the UNIX hidden files issue, but be a breaking change, and it could be useful for the fs backend to produce unescaped files when restricting oneself to safe characters.