Skip to content

ddf._meta_nonempty doesnt instantiate correctly when calling from_dask_dataframe #286

Open
@taneugene

Description

@taneugene

When I load a csv first into dask, and then into dask dataframe using .from_dask_dataframe, ._meta_nonempty does not exist, causing downstream problems in analysis (e.g. with spatial_shuffle). My hackish solution below takes the head, uses from_geopandas to get the meta, and the replaces the meta in the original. It would be nice to make this just work directly! Not sure if it replicates for other people.

# Load a csv file
df = dd.read_csv(fname,
                 dtype = {'longitude':float,
                          'latitude':float,
                          'geometry':'object',
                 }).repartition(npartitions=njobs)  # njobs is the number of workers I have
# Translate to geometry using shapely
df['geometry'] = df.geometry.map(shapely.wkt.loads,meta=('geometry','object'))
# Create a tmp dataframe using a Geodataframe and from_geopandas
tmp = dg.from_geopandas(gpd.GeoDataFrame(df.head(),geometry = 'geometry',crs = 'EPSG:4326'),npartitions = 1)

# Now create the dask_geopandas df
df = dg.from_dask_dataframe(df)

# Need to set metadata here, otherwise spatial_shuffle won't run. 
df._meta = tmp.compute()
df = df.spatial_shuffle()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions