Skip to content

Deprecate None for virtual chunk authorization #2194

@TomNicholas

Description

@TomNicholas

Problem

Currently when authorizing access to virtual chunk containers for a private bucket during Repository.open() these two kwargs values are equivalent:

authorize_virtual_chunk_access={"s3://my-private-bucket": None}
authorize_virtual_chunk_access={"s3://my-private-bucket/": icechunk.S3Credentials(from_env=True)}

but only the latter actually indicates clearly what you are doing: allowing Icechunk to look for local S3 credentials (which may of course be sensitive), which can then be used to read anything from the entirety of s3://my-private-bucket.

Given that the purpose of the authorize_virtual_chunk_access arg is to force users to explicitly opt-in to accessing potentially-sensitive storage locations for safety reasons, the auto-magicness of None seems undesirable.

It's also not ideal that None could be interpreted to mean "false" instead of the intended meaning here of "default". So a user might reasonably read the code above as meaning "do not authorize access to s3://my-private-bucket" when it actually means the opposite.

Consistency with other explicit API

We already have some existing API for being much more explicit about this. As well as icechunk.S3Credentials(from_env=True) we also have s3_anonymous_credentials(), allowing the (much safer) anon-only case to be clearly distinguished:

authorize_virtual_chunk_access={"s3://my-company-bucket": s3_anonymous_credentials()}

This kind of explicit sentinel is preferable, but not all our API forces you to use this explicit option, and some paths do not have a corresponding sentinel defined.

Proposed changes

I want the only way to authorize virtual chunk access to be to use an explicit sentinel. So I want None to never be an allowed value.

That would mean that all types of virtual chunk access required use of the correct sentinel object:

# anonymous S3 bucket
authorize_virtual_chunk_access={"s3://my-private-bucket/": icechunk.s3_anonymous_credentials()}

# private S3 bucket - credentials from env
authorize_virtual_chunk_access={"s3://my-private-bucket/": icechunk.S3Credentials(from_env=True)}

# private S3 bucket - explicit credentials
authorize_virtual_chunk_access={"s3://my-private-bucket/": icechunk.S3Credentials(access_key=...)}

# local filesystem - explicitly no credentials required
authorize_virtual_chunk_access={"file:///my-home-dir/": icechunk.credentials.LocalFilesystemAccess}  # a new sentinel type

# local filesystem - explicitly no credentials required
authorize_virtual_chunk_access={"http://some-endpoint/": icechunk.credentials.HttpAccess}  # a new sentinel type

Note the last two are new sentinel types, with names chosen to make it clear what is happening.

Deprecation cycle

Eventually we want passing authorize_virtual_chunk_access={"s3://priate-bucket/": None} to raise, but that's a breaking change for many virtual chunk users, so instead we do a deprecation cycle, where initially we merely warn on None, with the new sentinel suggested instead.

Overall the users still have to put in a similar amount of work before and after the change - they have to write store-specific boilerplate either way. The only loss of ergonomics is that technically None worked for various types of storage, whereas now their sentinel has to be storage-specific.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions