`raster_to_grid` not working when using `retile` with `GeoTIFF` file

- DBR 13.3 LTS ML (includes Apache Spark 3.4.1, Scala 2.12)
- Standard_DS13_v2 (driver and 2 workers)
- Photon disabled
- databricks-mosaic 0.4.1
- GDAL init script installed in cluster
- Dataset is a GeoTIFF relatively large raster file, it has one layer. Each pixel in uint16 dtype represents a category that describes the land use or land cover (forest, industrial, agricultural, and so on). Values out of bounds are masked (represented by the last integer 2^16 - 1)
- Dataset source: https://geo2.dgterritorio.gov.pt/cosc/COSc2023.zip
- Use case: need to efficiently read and process raster file represented in a projected coordinate system, convert to grid index to intersect with points represented in a geographic coordinate system (WGS84).
-  The GeoTiff file seems ok as far as I know, I've tested it with Rasterio and NumPy.
- I'm trying to automate the process with Mosaic.

**This example is very slow and seems to have a lot of data skew (it gets stuck on the last task):**
```
import mosaic as mos
mos.enable_mosaic(spark, dbutils)
mos.enable_gdal(spark)

df = mos.read().format("raster_to_grid")  \
        .option("resolution", "2") \
        .load("dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif")
df.show()
```

**When trying to use "retile" option it throws an error:**
```
import mosaic as mos
mos.enable_mosaic(spark, dbutils)
mos.enable_gdal(spark)

df = mos.read().format("raster_to_grid")  \
        .option("resolution", "2") \
        .option("retile", "true")\
        .option("tileSize", "1000")\
        .load("dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif")
df.show()
```

`
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 81.0 failed 4 times, most recent failure: Lost task 0.3 in stage 81.0 (TID 2434) (10.208.237.16 executor 4): com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif.
`

Don't take this the wrong way, it is a pleasure to work with Shapely/Pygeos/GeoPandas and even Rasterio together with Spark and Pandas UDFs, however it is being an absolute pain navigating through Databricks-Mosaic (the same happened with Sedona and GeoSpark).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`raster_to_grid` not working when using `retile` with `GeoTIFF` file #558

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

raster_to_grid not working when using retile with GeoTIFF file #558

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`raster_to_grid` not working when using `retile` with `GeoTIFF` file #558