Description
I have a list of GeoPackages, one for an urban area, I need to read to dask.GeoDataFrame. Since they are already essentially spatially partitioned, the optimal way would be to read each as a chunk directly. Now I have to read them one by one via GeoPandas, concatenate and then create dask.GeoDataFrame from geopandas.GeoDataFrame, which loses spatial partitions.
For cases like this, it may be useful to have dask_geopandas.read_files(list)
function which would call geopandas.read_file
for each chunk and create chunked GeoDataFrame directly. It would be helpful to be able to pass both list
and a path to a folder (like we do with parquet) since in the list you can specify a path in the zip for example (my case).
This is the existing code I am using:
paths = ["foo/bar/one.zip!data/file.gpkg", "foo/bar/two.zip!data/file.gpkg"]
gdfs = []
for file in paths:
gdf = gpd.read_file(file)
gdfs.append(gdf)
gdf = pd.concat(gdfs)
ddf = dask_geopandas.from_geopandas(gdf, npartitions=2) # non spatial chunks
And this would be optimal:
paths = ["foo/bar/one.zip!data/file.gpkg", "foo/bar/two.zip!data/file.gpkg"]
ddf = dask_geopandas.read_files(paths) # one chunk per file