Skip to content

claimed parsing issues don't flow through purrr::map() #1571

@twest820

Description

@twest820

Guidance in #1472 is to read heterogenous data with purrr::map(). However, with at least some datasets, this approach either blocks vroom's mechanisms for reporting parsing issues or results in spurious warnings about parse errors.

This behavior easily results in difficulty attempting to determine whether there is or is not an issue somewhere within a set of hundreds of files. It'd therefore be helpful if there was mechanism to flow the problems and indicate which files they occurred in.

# download the four quarters of 2018 data from https://www.backblaze.com/cloud-storage/resources/hard-drive-test-data and extract the 365 .csv files
# .csv files for 2013-2017 are homogenous and don't raise a warning, 2019 is also heterogeneous
yearFiles = list.files("2018", "*.csv", full.names = TRUE)
yearData = purrr::map(yearFiles, \(file) read_csv(file, col_types = cols_only(date = "D", serial_number = "c", model = "c", failure = "l"), col_select = c("date", "serial_number", "model", "failure")))
# Warning message:                                                                                                                             
# One or more parsing issues, call `problems()` on your data frame for details, e.g.:
#   dat <- vroom(...)
#   problems(dat) 

problems(yearData) # doesn't print anything, not surprising since yearData's a list
lapply(yearData, problems) # returns only empty tibbles
for (index in 1:length(yearData))
{
  problems(yearData[[index]]) # doesn't print anything for any day of the year
}

# also produces a warning message but no problems are printed
for (index in 1:length(yearFiles))
{
  dayData = read_csv(yearFiles[index], col_types = cols_only(date = "D", serial_number = "c", model = "c", failure = "l"), col_select = c("date", "serial_number", "model", "failure"))
  problems(dayData)
}

It'd probably be good to also support problem flow through bind_rows(purrr::map()) as I suspect that's a common pattern.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions