Skip to content

Speed up is_dir on entries returned by iterdir and glob #176

Open
@analog-cbarber

Description

@analog-cbarber

The is_dir check is fairly expensive, but at least for S3 and Azure when the entries were created as a result of the client's _list_dir method, you can tell for each entry whether it is a directory or a file and immediately set the result on the created CloudPath instance.

For example for the S3Client._list_dir, you could write something like:

            paginator = self.client.get_paginator("list_objects_v2")

            for result in paginator.paginate(
                Bucket=cloud_path.bucket, Prefix=prefix, Delimiter="/", MaxKeys=1000
            ):

                # sub directory names
                for result_prefix in result.get("CommonPrefixes", []):
                    path = S3Path(f"s3://{cloud_path.bucket}/{result_prefix.get('Prefix')}")
                    path._is_dir = True
                    yield path

                # files in the directory
                for result_key in result.get("Contents", []):
                    path = S3Path(f"s3://{cloud_path.bucket}/{result_key.get('Key')}")
                    path._is_dir = False
                    yield path

and modify S3Path.is_dir:

    def is_dir(self) -> bool:
        if self._is_dir is None:
            self._is_dir = self.client._is_file_or_dir(self) == "dir"
        return self._is_dir

This makes a HUGE performance difference if you need to call is_dir on the entries returned from iterdir or glob (in my case, when implementing a file dialog that works for cloud paths).

Not sure if this particular implementation is the best way to do this, but something like this is needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions