Description
Some libraries, such as polars
and pandas
, have an almost seamless method for interacting with cloud storage paths.
e.g.:
import polars as pl
pl.scan_csv('az://container/path/to/file.csv', storage_options={'account_name': 'mystorageaccount'}).collect()
This is nice, because I don't need to import any other libraries, setup credentials or blob clients, etc.
It automatically finds any available credentials in my local environment, presumably with something like DefaultAzureCredential
.
This means that when testing locally, I just need to be authenticated with Azure CLI, and everything just works.
I don't even need to manually specify environment variables.
It also means that I can deploy the same code to the server, and it will automatically find the appropriate environment variables to authenticate as a service principal with AZURE_CLIENT_ID
, AZURE_CLIENT_SECRET
, etc.
I may have missed something, but it seems that cloudpathlib
has not enabled this kind of automatic credential detection with DefaultAzureCredential
. Instead, I need to do the following to get an authenticated working CloudPath
:
from azure.identity import DefaultAzureCredential
from cloudpathlib import CloudPath, AzureBlobClient
credential = DefaultAzureCredential()
client = AzureBlobClient(account_url="https://mystorageaccount.blob.core.windows.net", credential=credential)
path = CloudPath('az://container/path/to/file.csv', client=client)
Ideally, it would be nice to be able to do the setup automatically.
I'm imagining the following future state:
from cloudpathlib import CloudPath
path = CloudPath('az://container/path/to/file.csv', storage_options={'account_name': 'mystorageaccount'})
(There may be a nicer way to specify the account name. I'm just copying the API from polars
and pandas
here. I kind of wish that it was standard to include the account name in the path somehow, as passing the account name in separately feels clunky to me. It would be nice if we could use az://mystorageaccount/container/...
)
See the documentation for DefaultAzureCredential
. (There's a reason it's called Default!):
- DefaultAzureCredential documentation
- Azure Storage DefaultAzureCredential examples
- Azure Identity Overview (including DefaultAzureCredential examples)
Note: If you are using fsspec
+ adlfs
, adlfs
requires the storage option anon=False
to be set to enable DefaultAzureCredential
.
For example, when using pandas
, you must specify storage_options={'anon': False}
.
When using fsspec
directly, you need to pass it as follows:
fs = fsspec.filesystem('az', account_name='mystorageaccount', anon=False)
For more details, see:
https://github.com/fsspec/adlfs#setting-credentials