Skip to content

Conversation

@jennybc
Copy link
Member

@jennybc jennybc commented Jan 11, 2026

Fixes #400
Fixes #553
Fixes tidyverse/readr#1555
Fixes tidyverse/readr#1553 (speculative, but there's no reprex)

Download remote compressed files (.gz, .bz2, .xz, .zip) to a temp file, then treat them like a local compressed file.

  • For .bz2, .xz and .zip, this is a new capability. Previously this errored with "not supported, download locally first".
  • For .gz, this is a change of approach. vroom already did support remote .gz, but it optimistically used base::gzcon() to avoid downloading the compressed file. Problem is that gzcon() has a bug (https://bugs.r-project.org/show_bug.cgi?id=18887) that was briefly fixed, but the fix got reverted. Remote .gz files that are concatenated gzip archives (multiple gzip members per RFC 1952) silently return truncated data. gzcon() only reads the first gzip member (boo), whereas gzfile() reads all of them (yay).

It's easy to convince yourself that the unsuccessful vroom() call is due to something else (3 of the 4 issues above start out barking up the wrong tree, all different trees), but it's really about a concatenated .gz file.


From the Chesterton's Fence perspective ... I cannot find any evidence in the git history of any specific reason to not download to a temp file. Nor can I (or Claude) think of some gotcha. I think it was just an accident of history and maybe the fact that, at least for .gz, you could stream compressed data with gzcon(). (Also, FWIW, data.table downloads the whole file in the background, then decompresses it.)


There was some question about where to put the new pre-download functionality. My first instinct was in connection_or_filepath(), but that is part of a call chain that starts in R, then goes to C++, then back to R. Given the current wiring, it's not possible to schedule cleanup of the temp file at the right time. I see how to fix that (something like jennybc/parentframecpp#1), but I'm going to open a separate issue (#601) and PR for this, because there are several areas in vroom that would benefit from a similar approach.

standardise_path() is another reasonable location and it's more favorable now, in terms of cleanup. So that's where I introduce this.

@jennybc jennybc force-pushed the download-remote-compressed-files branch from 2c24bae to 40bf057 Compare January 12, 2026 21:40
Even though it might not "match" vroom's messaging, it's better than looking like things might just be hung
It is favorable to do this before ever hitting C++, so we can register the deferred cleanup on the desired environment. If we do it in connection_or_filepath(), when parent.frame() is called from C++, the global environment is used and messaging about the deferred cleanup bubbles up to the user.
},
logical(1)
)
if (all(is_remote_compressed)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not strictly necessary to be so "all or nothing", but we have a similar mindset re: multiple inputs for, e.g., connections and it makes things less fiddly. I could entertain better handling of mixed inputs if real users complain.

return(lapply(path, function(p) {
ext <- tolower(tools::file_ext(p))
local_path <- download_file(p, ext, call = call)
withr::defer(unlink(local_path), envir = call)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the defer() that we want to place before we ever head into C++, given the current design.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is envir = call the right frame? Or is it the environment of standardise_path()?

call is (typically) one frame above that, right?

Or, or, wow, is the "one frame above" standardise_path() the place where you actually read in the file? So you're trying to say "please clean up this file later on, after ive fully read it in, and i promise the timing of that will all work out"

If that is the case, then this definitely deserves a comment because this feels like a big old footgun that could easily be triggered (like if you add a layer of indirection between the call to standardise_path() and where the file actually gets read in, then you might clean it up too soon)

I'd also consider any other approach that might better remove this footgun

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup is definitely what required the most thought and is related to what I say in the intro re: where to put this "download to temporary file and schedule cleanup" in the overview. This thinking lead to #601 and jennybc/parentframecpp#1. There are multiple places in vroom where a better approach to passing the calling environment around will benefit cleanup and error messaging.

But, yes, what's here currently does the cleanup at the right time. It goes down like so:

vroom()
      └── standardise_path(call = caller_env())  # call = vroom()'s env
              └── download_file() + withr::defer(..., envir = call)
      └── vroom_()  # reads the file
      └── <vroom returns, cleanup runs>

So the cleanup runs when vroom() exits, which is after the file has been consumed by vroom_() (as far as vroom_() knows, this is just a normal non-remote compressed file).

This fits an existing pattern in chr_to_file() and reencode_file(). Inspired by this convo, I'm tweaking those sites to be even more consistent and then all of them will get another round of 👀 in #601.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if I were implementing this, knowing how annoyingly complex this is, I think I would pass through both call and another env called env_cleanup or env_defer. Even if it is the same environment as call, it feels like it would express the meaning better

When I see call, I'm conditioned to think "this has to do with error messaging", so seeing it used for other things confuses me and smells off to me.

I'd also be happy if some other approach nukes the need for this entirely, but I also wouldn't be mad if this were the final solution, where vroom() itself just has this at the very top:

vroom <- function(...) {
  # The frame we tie all temporary file cleanup to
  env_cleanup <- environment()
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing I'd probaby do is make env_cleanup a required argument to each helper, rather than setting to caller_env(), because that is never going to be the correct environment and would lull me into a false sense of correctness

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are great ideas and I'm going to add them as part of the mandate in #601.

withr::defer(unlink(local_path), envir = call)
switch(
ext,
gz = gzfile(local_path),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.gz just gets treated like the rest now. Bye bye base::gzcon().

}

if (is_url(path)) {
ext <- tolower(tools::file_ext(path))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a better location for the call to download_file() but for now we need to take care of it earlier.

Comment on lines -225 to -228
gz = gzfile(path, ""),
bz2 = bzfile(path, ""),
xz = xzfile(path, ""),
zip = zipfile(path, ""),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"" is the default value for the second argument open anyway. (I had to make that true for zipfile(), which is ours, but also seemed logical for consistency.)

}

download_file <- function(url, ext, call = caller_env()) {
local_path <- vroom_tempfile(fileext = ext, pattern = "vroom-download-url-")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vroom promises to use VROOM_TEMP_PATH for all such needs, which is the main point of vroom_tempfile().


download_file <- function(url, ext, call = caller_env()) {
local_path <- vroom_tempfile(fileext = ext, pattern = "vroom-download-url-")
show_progress <- vroom_progress()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empirically, it's pretty important to show progress, even if it's not styled like vroom progress. The real-world examples from the motivating issues can take time to download and otherwise you think things have just hung.

#'
#' # Read datasets across multiple files ---------------------------------------
#' mtcars_by_cyl <- vroom_example(vroom_examples("mtcars-"))
#' mtcars_by_cyl <- vroom_example(vroom_examples("mtcars-[468]"))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex had to tighten up in order to not match the new test fixture, mtcars-concatenated.csv.gz.

skip_on_cran()

mt <- vroom(vroom_example("mtcars.csv"), col_types = list())
# FIXME: switch the branch to main or HEAD here
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll either do this right before merge (and fail last CI on PR) or right after merge (and fail first CI on main).

@jennybc jennybc requested a review from DavisVaughan January 14, 2026 01:08
@jennybc
Copy link
Member Author

jennybc commented Jan 14, 2026

In case I ever need it in the future, here's a function for detecting a concatenated .gz. It's probably gross -- I really did not refine Claude's work -- but it was good enough to guide development of this PR, which all I needed.

gzip_members()
gzip_members <- function(path) {
bytes <- readBin(path, "raw", file.size(path))
n <- length(bytes)

# Find candidate positions where gzip magic number (0x1f 0x8b) appears
candidates <- which(
  bytes[-n] == as.raw(0x1f) & bytes[-1] == as.raw(0x8b)
)

if (length(candidates) == 0) {
  stop("No gzip members found in ", path)
}

# Validate each candidate by checking:
# - Byte 3 (index +2) must be 0x08 (deflate compression method)
# - Try to actually decompress it
starts <- integer()
for (pos in candidates) {
  # Check compression method byte
  if (pos + 2 > n) next
  if (bytes[pos + 2] != as.raw(0x08)) next

  # Try to decompress from this position to EOF
  # This validates it's actually a gzip stream
  test_bytes <- bytes[pos:n]
  tmp <- tempfile(fileext = ".gz")
  writeBin(test_bytes, tmp)
  result <- tryCatch(
    {
      readLines(gzfile(tmp), n = 1, warn = FALSE)
      TRUE
    },
    error = function(e) FALSE,
    warning = function(w) TRUE
  )
  unlink(tmp)

  if (result) {
    starts <- c(starts, pos)
  }
}

if (length(starts) == 0) {
  stop("No valid gzip members found in ", path)
}

# End of each member is byte before next member starts (or EOF)
ends <- c(starts[-1] - 1, n)

# For each member, decompress and count lines
n_lines <- integer(length(starts))
for (i in seq_along(starts)) {
  member_bytes <- bytes[starts[i]:ends[i]]
  tmp <- tempfile(fileext = ".gz")
  writeBin(member_bytes, tmp)
  n_lines[i] <- tryCatch(
    length(readLines(gzfile(tmp), warn = FALSE)),
    error = function(e) NA_integer_
  )
  unlink(tmp)
}

cat(basename(path), "has", length(starts), "gzip member(s):\n")

print(data.frame(
  member = seq_along(starts),
  start_byte = starts - 1,
  end_byte = ends - 1,
  size_bytes = ends - starts + 1,
  n_lines = n_lines
), row.names = FALSE)

invisible(NULL)
}

Here's what it reports for the new test fixture/example file (concatenated), an existing example file (not concatenated), the original file from #400 (concatenated):

gzip_members("inst/extdata/mtcars-concatenated.csv.gz")
#> mtcars-concatenated.csv.gz has 2 gzip member(s):
#>  member start_byte end_byte size_bytes n_lines
#>       1          0       72         73       1
#>       2         73      931        859      33

gzip_members("inst/extdata/mtcars.csv.gz")
#> mtcars.csv.gz has 1 gzip member(s):
#>  member start_byte end_byte size_bytes n_lines
#>       1          0      869        870      33

#remote_file <- "https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/monthly/bejab/2023/bejab_vpts_202303.csv.gz"
#tf1 <- tempfile()
#curl::curl_download(remote_file, tf1, quiet = FALSE)
gzip_members(tf1)
#> file143058063f60 has 16 gzip member(s):
#>  member start_byte end_byte size_bytes n_lines
#>       1          0      132        133       1
#>       2        133   782350     782218   11475
#>       3     782351  1647510     865160   11475
#>       4    1647511  2339652     692142   11475
#>       5    2339653  3069782     730130   11475
#>       6    3069783  3942194     872412   11475
#>       7    3942195  4802812     860618   11475
#>       8    4802813  5697795     894983   11475
#>       9    5697796  6593209     895414   11475
#>      10    6593210  7397292     804083   11475
#>      11    7397293  8224468     827176   11475
#>      12    8224469  9030857     806389   11475
#>      13    9030858  9919755     888898   11475
#>      14    9919756 10765979     846224   11475
#>      15   10765980 11535286     769307   11475
#>      16   11535287 11617208      81922    1750

Copy link
Member

@DavisVaughan DavisVaughan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some concerns around when unlink() happens but otherwise looks nice to me

return(lapply(path, function(p) {
ext <- tolower(tools::file_ext(p))
local_path <- download_file(p, ext, call = call)
withr::defer(unlink(local_path), envir = call)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is envir = call the right frame? Or is it the environment of standardise_path()?

call is (typically) one frame above that, right?

Or, or, wow, is the "one frame above" standardise_path() the place where you actually read in the file? So you're trying to say "please clean up this file later on, after ive fully read it in, and i promise the timing of that will all work out"

If that is the case, then this definitely deserves a comment because this feels like a big old footgun that could easily be triggered (like if you add a layer of indirection between the call to standardise_path() and where the file actually gets read in, then you might clean it up too soon)

I'd also consider any other approach that might better remove this footgun

Comment on lines +361 to +362
parent = NA,
error = cnd,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what error = is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent is to pass the original condition along as data, if someone really wanted to catch it and dig in. Inspired by examples seen in Taking full ownership of a causal error.

When a low-level error is overtaken, it is good practice to store it in the high-level error object, so that it can be inspected for debugging purposes. In the snippet above, we stored it in the error field.

@jennybc jennybc merged commit c244c98 into main Jan 16, 2026
6 of 16 checks passed
@jennybc jennybc deleted the download-remote-compressed-files branch January 16, 2026 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants