Skip to content

Memory consumption with collect() #434

Open
@larry77

Description

@larry77

Hello,
Please have a look at the reprex at the end of the file.
I have a seasoned laptop with 8Gb of RAM which runs debian stable.
When I carry out an aggregation and then a left join with arrow, I need to use some swap when I collect the result (a long dataframe), but I can run the computation.
Instead, if just run the part in duckplyr (commented out in the second part of the reprex), my memory is so insufficient that the laptop freezes. Can someone take a look into this? Is there a much higher memory consumption in duckplyr wrt arrow? Thanks a lot.

library(tidyverse)

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:lubridate':
#> 
#>     duration
#> The following object is masked from 'package:utils':
#> 
#>     timestamp


# Uncomment and run this only once
## dd <- tibble(x=1:100000000, y=rep(LETTERS[1:20], 5000000))


## write_csv(dd, "test.csv")




df2  <- open_dataset("test.csv",
  format = "csv",
  skip_rows = 0)

system.time({
    df_stat2 <- df2 |>
        group_by(y) |> 
        summarise(total=sum(x)) |>
        ungroup()

    df_out2 <- df2 |>
        left_join(y=df_stat2, by=c("y")) |>
        collect()

    
})
#>    user  system elapsed 
#>  36.892   5.439  26.582


df_out2|>glimpse()
#> Rows: 100,000,000
#> Columns: 3
#> $ x     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1…
#> $ y     <chr> "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M",…
#> $ total <int64> 249999955000000, 249999960000000, 249999965000000, 24999997000…


## uncomment to run --this takes a lot of memory on my system

## library(duckplyr)


## df <- duck_csv("test.csv")

## system.time({
## df_stat <- df |>
##     summarise(total=sum(x), .by = y) 



## df_out <- df |>
##     left_join(y=df_stat,  by=c("y")) |>
##     collect()

## })


sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Debian GNU/Linux 12 (bookworm)
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
#>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Europe/Brussels
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] arrow_15.0.1    lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1  
#>  [5] dplyr_1.1.4     purrr_1.0.2     readr_2.1.5     tidyr_1.3.1    
#>  [9] tibble_3.2.1    ggplot2_3.5.1   tidyverse_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] bit_4.0.5          gtable_0.3.5       compiler_4.4.2     reprex_2.1.0      
#>  [5] tidyselect_1.2.1   assertthat_0.2.1   scales_1.3.0       yaml_2.3.8        
#>  [9] fastmap_1.1.1      R6_2.5.1           generics_0.1.3     knitr_1.46        
#> [13] munsell_0.5.1      R.cache_0.16.0     pillar_1.10.1.9000 tzdb_0.4.0        
#> [17] R.utils_2.12.3     rlang_1.1.4        stringi_1.8.4      xfun_0.43         
#> [21] fs_1.6.4           bit64_4.0.5        timechange_0.3.0   cli_3.6.3         
#> [25] withr_3.0.0        magrittr_2.0.3     digest_0.6.35      grid_4.4.2        
#> [29] hms_1.1.3          lifecycle_1.0.4    R.methodsS3_1.8.2  R.oo_1.26.0       
#> [33] vctrs_0.6.5        evaluate_0.23      glue_1.7.0         styler_1.10.3     
#> [37] colorspace_2.1-0   rmarkdown_2.26     tools_4.4.2        pkgconfig_2.0.3   
#> [41] htmltools_0.5.8.1

Created on 2025-01-01 with reprex v2.1.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    duckdb 🦆Issues where work in the duckb package is needed

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions