Skip to content

Too many open files when mapping more than 509 pages #1

@trotsiuk

Description

@trotsiuk

That is a great pleasure working with warc, however I'm experiencing error when mapping larger mount of files. It seems like the connections to the files are not closed. Please find below the reproducible minimum example:

library(warc)
library(tidyverse)

# download the Common Crawl example file if does not exist
warc_big <- normalizePath("~/cc.warc.gz")    
if(!file.exists(warc_big)){
  download.file(
    "https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/warc/CC-MAIN-20161202170900-00000-ip-10-31-129-80.ec2.internal.warc.gz",
    warc_big
  )
}

# create index if does not exist
warc_cdx <- normalizePath("~/cc.cdx")
if(!file.exists(warc_cdx)){
  create_cdx(
    warc_big,
    cdx_path = warc_cdx
  )
}
  
# read the index and mapp the data
cdx <- read_cdx(warc_cdx)

# this works
sites <- map(1:100,
             ~read_warc_entry(file.path(cdx$warc_path[.],
                                        cdx$file_name[.]), 
                              cdx$compressed_arc_file_offset[.]))                     
                              
 # this crash
sites_large <- map(1:1000,
             ~read_warc_entry(file.path(cdx$warc_path[.],
                                        cdx$file_name[.]), 
                              cdx$compressed_arc_file_offset[.]))     

The error I'm receiving is the following

Using the hard way
7593104
Error in gz_open(wf, "read") : object 'wf' not found

And if want to perform other operations getting:

> ?read_cdx
Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
  In gzfile(file, "rb") :
  cannot open compressed file 'C:/Program Files/R/R-3.4.1/library/reshape2/Meta/package.rds', probable reason 'Too many open files'

Session info:

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2    dplyr_0.7.2     purrr_0.2.2.2   readr_1.1.1     tidyr_0.6.3     tibble_1.3.3    ggplot2_2.2.1   tidyverse_1.1.1
[9] warc_0.1.0     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12     cellranger_1.1.0 compiler_3.4.1   plyr_1.8.4       bindr_0.1        forcats_0.2.0    tools_3.4.1     
 [8] uuid_0.1-2       lubridate_1.6.0  jsonlite_1.5     nlme_3.1-131     gtable_0.2.0     lattice_0.20-35  pkgconfig_2.0.1 
[15] rlang_0.1.1      psych_1.7.5      parallel_3.4.1   haven_1.1.0      xml2_1.1.1       httr_1.2.1       stringr_1.2.0   
[22] hms_0.3          grid_3.4.1       glue_1.1.1       R6_2.2.2         readxl_1.0.0     foreign_0.8-69   modelr_0.1.0    
[29] reshape2_1.4.2   magrittr_1.5     scales_0.4.1     rvest_0.3.2      assertthat_0.2.0 mnormt_1.5-5     colorspace_1.3-2
[36] stringi_1.1.5    lazyeval_0.2.0   munsell_0.4.3    broom_0.4.2   

Thanks in advance

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions