Closed
Description
I'm trying to write a large (300 ~600 MB as .Rds) file to disk. It saved in about 5 minutes in the .Rds format and took around 10 minutes to read in from a load of compressed .gml file using this mini package developed for the job: https://github.com/ITSLeeds/mastermapr
sf::write_sf(mm_highway_uk, "destination.gpkg")
Has been running for over an hour now and am wondering if it will ever finish! I know this is likely to be an issue upstream with GDAL but I'm raising the issue here in case others have had similar issues and in case it's of use. It's related to wider question of which geographic file format to save data as.
This is my set-up:
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 7.0.0
Created on 2020-05-28 by the reprex package (v0.3.0)
Session info
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 3.6.3 (2020-02-29)
#> os Ubuntu 18.04.4 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language en_GB:en
#> collate en_GB.UTF-8
#> ctype en_GB.UTF-8
#> tz Europe/London
#> date 2020-05-28
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [2] CRAN (R 3.6.0)
#> backports 1.1.7 2020-05-13 [1] CRAN (R 3.6.3)
#> callr 3.4.3 2020-03-28 [1] CRAN (R 3.6.3)
#> class 7.3-17 2020-04-26 [2] CRAN (R 3.6.3)
#> classInt 0.4-3 2020-04-06 [1] Github (r-spatial/classInt@d024051)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.2)
#> crayon 1.3.4 2017-09-16 [2] standard (@1.3.4)
#> DBI 1.1.0 2019-12-15 [2] CRAN (R 3.6.2)
#> desc 1.2.0 2018-05-01 [2] standard (@1.2.0)
#> devtools 2.3.0 2020-04-10 [1] CRAN (R 3.6.3)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.2)
#> e1071 1.7-3 2019-11-26 [2] CRAN (R 3.6.1)
#> ellipsis 0.3.1 2020-05-15 [3] CRAN (R 3.6.3)
#> evaluate 0.14 2019-05-28 [2] CRAN (R 3.6.0)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.2)
#> fs 1.4.1 2020-04-04 [2] CRAN (R 3.6.3)
#> glue 1.4.1 2020-05-13 [2] CRAN (R 3.6.3)
#> highr 0.8 2019-03-20 [3] CRAN (R 3.5.3)
#> htmltools 0.4.0.9003 2020-04-09 [1] Github (rstudio/htmltools@1a7d0dc)
#> KernSmooth 2.23-17 2020-04-26 [4] CRAN (R 3.6.3)
#> knitr 1.28 2020-02-06 [1] CRAN (R 3.6.2)
#> magrittr 1.5 2014-11-22 [2] CRAN (R 3.5.2)
#> memoise 1.1.0 2017-04-21 [3] CRAN (R 3.5.0)
#> pkgbuild 1.0.8 2020-05-07 [1] CRAN (R 3.6.3)
#> pkgload 1.0.2 2018-10-29 [3] CRAN (R 3.5.1)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.2)
#> processx 3.4.2 2020-02-09 [1] CRAN (R 3.6.3)
#> ps 1.3.3 2020-05-08 [1] CRAN (R 3.6.3)
#> R6 2.4.1 2019-11-12 [2] CRAN (R 3.6.1)
#> Rcpp 1.0.4.6 2020-04-09 [1] CRAN (R 3.6.3)
#> remotes 2.1.1 2020-02-15 [1] CRAN (R 3.6.2)
#> rlang 0.4.6.9000 2020-05-05 [1] Github (r-lib/rlang@4bea875)
#> rmarkdown 2.1.2 2020-04-09 [1] Github (rstudio/rmarkdown@65dd144)
#> rprojroot 1.3-2 2018-01-03 [2] CRAN (R 3.5.3)
#> rstudioapi 0.11 2020-02-07 [2] CRAN (R 3.6.2)
#> sessioninfo 1.1.1 2018-11-05 [3] CRAN (R 3.5.1)
#> sf * 0.9-3 2020-05-04 [1] CRAN (R 3.6.3)
#> stringi 1.4.6 2020-02-17 [1] CRAN (R 3.6.2)
#> stringr 1.4.0 2019-02-10 [2] standard (@1.4.0)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 3.6.3)
#> units 0.6-6 2020-03-16 [1] CRAN (R 3.6.3)
#> usethis 1.6.1 2020-04-29 [1] CRAN (R 3.6.3)
#> withr 2.2.0 2020-04-20 [2] CRAN (R 3.6.3)
#> xfun 0.14 2020-05-20 [1] CRAN (R 3.6.3)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.2)
#>
#> [1] /home/robin/R/x86_64-pc-linux-gnu-library/3.6
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library
Activity
edzer commentedon May 28, 2020
Have you tried with layer creation option
SPATIAL_INDEX
set toNO
?Robinlovelace commentedon May 28, 2020
No. Will try now and aim to put in a PR documenting that feature if it works. Many thanks for fast reply!
Robinlovelace commentedon May 30, 2020
I gave it a go on a smaller dataset (61k vs ~6m rows) and the spatial index seemed to make it a bit faster. Assuming the impact of that option increases with dataset size that could solve it (gave up trying the other day):
Created on 2020-05-30 by the reprex package (v0.3.0)
Robinlovelace commentedon May 30, 2020
Update: building on the previous example I explored the impact of the layer option on different sized datasets, no clear finding:
Robinlovelace commentedon May 30, 2020
Trying on the full dataset, which takes over a minute to load as an .Rds file:
Waiting for results...
Robinlovelace commentedon May 30, 2020
Seems that the relative speed-up associated with
SPATIAL_INDEX=NO
may increase with dataset size:Robinlovelace commentedon May 30, 2020
Final benchmark on 10% sample:
I get:
So writing to .Rds is about 70 and 50 times faster than writing to .gpkg with and without the spatial index from R on my computer. I will try out writing this same 10% sample with QGIS as a test. Tempted to try .shp as an output and upgrade to GDAL 3.1.0 for FlatGeobuff outputs.
Robinlovelace commentedon May 30, 2020
Test results from QGIS: it saved the object as a .gpkg file with a spatial index in 18 seconds, around the same impressive write speed as saving as an .Rds file.
Without the spatial index the same object was written by QGIS in 12s, around 80 times faster than in R.
Robinlovelace commentedon Jun 1, 2020
Minor update on this: I left it running over the weekend and 33.5 hours later the file still hasn't finished writing. The output file is still growing in size, currently it is:
bytes. A few minutes later it is
1801400320
bytes. I think something strange is going on with the memory allocation with this, fluctuating by several GB every few seconds as shown in the .gif of the system monitor below:If you'd like any further info on this let me know. I'm not sure if this issue is specific to the dataset I have which is has many variables and
xyz
geometry, can share a sample securely if needs be but my guess is that this isn't dataset specific. Happy to provide further details/tests for sure though to support development of R so it's I/O capabilities for spatial data are comparable with desktop GIS.Jo-Schie commentedon Mar 2, 2022
I can confirm this issue. Also other filteypes are affected (e.g. geojson). I tried to explore the issue a little bit and noticed, that the problem (in my case) was writing
logical
from ansf
anddata.frame
class to disk. Quick fix for me was to convert logical to e.g. 1/0 dummy coding (see code below). Not sure if this helps you to further nail down the problem, but here is some code that is hopefully reproducible:barryrowlingson commentedon Feb 7, 2023
Whatever is causing this is in the C(++?) code. I just did some R profiling and 98% of the time in my tests was in the CPL_write_ogr function, which is
.Call("_sf_CPL_write_ogr",...
.Test code attached:
sp.txt
Usage:
returns a data frame of timings, number of rows, and
logical
being if the data was written a logical or numeric, eg:feed into ggplot if you want to plot it and see the difference....
If I knew how to profile C++ code within R I'd go deeper...
rsbivand commentedon Feb 7, 2023
These are points, so see #2059 and maybe try the
pointx
branch? Or #2036 for a different take using GDAL-devel?edzer commentedon Feb 7, 2023
with
pointx
branch:kadyb commentedon Feb 7, 2023
Out of curiosity, I also checked
{terra}
and it seems there is no overhead for the logical type.15 remaining items