Skip to content

raster_to_grid not working when using retile with GeoTIFF file #558

@carlosg-m

Description

@carlosg-m
  • DBR 13.3 LTS ML (includes Apache Spark 3.4.1, Scala 2.12)
  • Standard_DS13_v2 (driver and 2 workers)
  • Photon disabled
  • databricks-mosaic 0.4.1
  • GDAL init script installed in cluster
  • Dataset is a GeoTIFF relatively large raster file, it has one layer. Each pixel in uint16 dtype represents a category that describes the land use or land cover (forest, industrial, agricultural, and so on). Values out of bounds are masked (represented by the last integer 2^16 - 1)
  • Dataset source: https://geo2.dgterritorio.gov.pt/cosc/COSc2023.zip
  • Use case: need to efficiently read and process raster file represented in a projected coordinate system, convert to grid index to intersect with points represented in a geographic coordinate system (WGS84).
  • The GeoTiff file seems ok as far as I know, I've tested it with Rasterio and NumPy.
  • I'm trying to automate the process with Mosaic.

This example is very slow and seems to have a lot of data skew (it gets stuck on the last task):

import mosaic as mos
mos.enable_mosaic(spark, dbutils)
mos.enable_gdal(spark)

df = mos.read().format("raster_to_grid")  \
        .option("resolution", "2") \
        .load("dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif")
df.show()

When trying to use "retile" option it throws an error:

import mosaic as mos
mos.enable_mosaic(spark, dbutils)
mos.enable_gdal(spark)

df = mos.read().format("raster_to_grid")  \
        .option("resolution", "2") \
        .option("retile", "true")\
        .option("tileSize", "1000")\
        .load("dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif")
df.show()

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 81.0 failed 4 times, most recent failure: Lost task 0.3 in stage 81.0 (TID 2434) (10.208.237.16 executor 4): com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif.

Don't take this the wrong way, it is a pleasure to work with Shapely/Pygeos/GeoPandas and even Rasterio together with Spark and Pandas UDFs, however it is being an absolute pain navigating through Databricks-Mosaic (the same happened with Sedona and GeoSpark).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions