Download remote compressed files #599

jennybc · 2026-01-11T02:07:53Z

Fixes #400
Fixes #553
Fixes tidyverse/readr#1555
Fixes tidyverse/readr#1553 (speculative, but there's no reprex)

Download remote compressed files (.gz, .bz2, .xz, .zip) to a temp file, then treat them like a local compressed file.

For .bz2, .xz and .zip, this is a new capability. Previously this errored with "not supported, download locally first".
For .gz, this is a change of approach. vroom already did support remote .gz, but it optimistically used base::gzcon() to avoid downloading the compressed file. Problem is that gzcon() has a bug (https://bugs.r-project.org/show_bug.cgi?id=18887) that was briefly fixed, but the fix got reverted. Remote .gz files that are concatenated gzip archives (multiple gzip members per RFC 1952) silently return truncated data. gzcon() only reads the first gzip member (boo), whereas gzfile() reads all of them (yay).

It's easy to convince yourself that the unsuccessful vroom() call is due to something else (3 of the 4 issues above start out barking up the wrong tree, all different trees), but it's really about a concatenated .gz file.

From the Chesterton's Fence perspective ... I cannot find any evidence in the git history of any specific reason to not download to a temp file. Nor can I (or Claude) think of some gotcha. I think it was just an accident of history and maybe the fact that, at least for .gz, you could stream compressed data with gzcon(). (Also, FWIW, data.table downloads the whole file in the background, then decompresses it.)

There was some question about where to put the new pre-download functionality. My first instinct was in connection_or_filepath(), but that is part of a call chain that starts in R, then goes to C++, then back to R. Given the current wiring, it's not possible to schedule cleanup of the temp file at the right time. I see how to fix that (something like jennybc/parentframecpp#1), but I'm going to open a separate issue (#601) and PR for this, because there are several areas in vroom that would benefit from a similar approach.

standardise_path() is another reasonable location and it's more favorable now, in terms of cleanup. So that's where I introduce this.

Even though it might not "match" vroom's messaging, it's better than looking like things might just be hung

It is favorable to do this before ever hitting C++, so we can register the deferred cleanup on the desired environment. If we do it in connection_or_filepath(), when parent.frame() is called from C++, the global environment is used and messaging about the deferred cleanup bubbles up to the user.

jennybc · 2026-01-14T00:59:24Z

R/path.R

+    },
+    logical(1)
+  )
+  if (all(is_remote_compressed)) {


It's not strictly necessary to be so "all or nothing", but we have a similar mindset re: multiple inputs for, e.g., connections and it makes things less fiddly. I could entertain better handling of mixed inputs if real users complain.

jennybc · 2026-01-14T01:00:10Z

R/path.R

+    return(lapply(path, function(p) {
+      ext <- tolower(tools::file_ext(p))
+      local_path <- download_file(p, ext, call = call)
+      withr::defer(unlink(local_path), envir = call)


This is the defer() that we want to place before we ever head into C++, given the current design.

Is envir = call the right frame? Or is it the environment of standardise_path()?

call is (typically) one frame above that, right?

Or, or, wow, is the "one frame above" standardise_path() the place where you actually read in the file? So you're trying to say "please clean up this file later on, after ive fully read it in, and i promise the timing of that will all work out"

If that is the case, then this definitely deserves a comment because this feels like a big old footgun that could easily be triggered (like if you add a layer of indirection between the call to standardise_path() and where the file actually gets read in, then you might clean it up too soon)

I'd also consider any other approach that might better remove this footgun

Cleanup is definitely what required the most thought and is related to what I say in the intro re: where to put this "download to temporary file and schedule cleanup" in the overview. This thinking lead to #601 and jennybc/parentframecpp#1. There are multiple places in vroom where a better approach to passing the calling environment around will benefit cleanup and error messaging.

But, yes, what's here currently does the cleanup at the right time. It goes down like so:

vroom() └── standardise_path(call = caller_env()) # call = vroom()'s env └── download_file() + withr::defer(..., envir = call) └── vroom_() # reads the file └── <vroom returns, cleanup runs>

So the cleanup runs when vroom() exits, which is after the file has been consumed by vroom_() (as far as vroom_() knows, this is just a normal non-remote compressed file).

This fits an existing pattern in chr_to_file() and reencode_file(). Inspired by this convo, I'm tweaking those sites to be even more consistent and then all of them will get another round of 👀 in #601.

I think if I were implementing this, knowing how annoyingly complex this is, I think I would pass through both call and another env called env_cleanup or env_defer. Even if it is the same environment as call, it feels like it would express the meaning better

When I see call, I'm conditioned to think "this has to do with error messaging", so seeing it used for other things confuses me and smells off to me.

I'd also be happy if some other approach nukes the need for this entirely, but I also wouldn't be mad if this were the final solution, where vroom() itself just has this at the very top:

vroom <- function(...) { # The frame we tie all temporary file cleanup to env_cleanup <- environment() }

Another thing I'd probaby do is make env_cleanup a required argument to each helper, rather than setting to caller_env(), because that is never going to be the correct environment and would lull me into a false sense of correctness

These are great ideas and I'm going to add them as part of the mandate in #601.

jennybc · 2026-01-14T01:00:53Z

R/path.R

+      withr::defer(unlink(local_path), envir = call)
+      switch(
+        ext,
+        gz = gzfile(local_path),


.gz just gets treated like the rest now. Bye bye base::gzcon().

jennybc · 2026-01-14T01:01:45Z

R/path.R

  }

  if (is_url(path)) {
+    ext <- tolower(tools::file_ext(path))


This is probably a better location for the call to download_file() but for now we need to take care of it earlier.

jennybc · 2026-01-14T01:03:02Z

R/path.R

-    gz = gzfile(path, ""),
-    bz2 = bzfile(path, ""),
-    xz = xzfile(path, ""),
-    zip = zipfile(path, ""),


"" is the default value for the second argument open anyway. (I had to make that true for zipfile(), which is ours, but also seemed logical for consistency.)

jennybc · 2026-01-14T01:04:00Z

R/path.R

 }

+download_file <- function(url, ext, call = caller_env()) {
+  local_path <- vroom_tempfile(fileext = ext, pattern = "vroom-download-url-")


vroom promises to use VROOM_TEMP_PATH for all such needs, which is the main point of vroom_tempfile().

jennybc · 2026-01-14T01:05:23Z

R/path.R


+download_file <- function(url, ext, call = caller_env()) {
+  local_path <- vroom_tempfile(fileext = ext, pattern = "vroom-download-url-")
+  show_progress <- vroom_progress()


Empirically, it's pretty important to show progress, even if it's not styled like vroom progress. The real-world examples from the motivating issues can take time to download and otherwise you think things have just hung.

jennybc · 2026-01-14T01:07:57Z

R/vroom.R

 #'
 #' # Read datasets across multiple files ---------------------------------------
-#' mtcars_by_cyl <- vroom_example(vroom_examples("mtcars-"))
+#' mtcars_by_cyl <- vroom_example(vroom_examples("mtcars-[468]"))


This regex had to tighten up in order to not match the new test fixture, mtcars-concatenated.csv.gz.

jennybc · 2026-01-14T01:08:36Z

tests/testthat/test-path.R

+  skip_on_cran()
+
+  mt <- vroom(vroom_example("mtcars.csv"), col_types = list())
+  # FIXME: switch the branch to main or HEAD here


I'll either do this right before merge (and fail last CI on PR) or right after merge (and fail first CI on main).

jennybc · 2026-01-14T01:20:44Z

In case I ever need it in the future, here's a function for detecting a concatenated .gz. It's probably gross -- I really did not refine Claude's work -- but it was good enough to guide development of this PR, which all I needed.

gzip_members()

gzip_members <- function(path) {
bytes <- readBin(path, "raw", file.size(path))
n <- length(bytes)

# Find candidate positions where gzip magic number (0x1f 0x8b) appears
candidates <- which(
  bytes[-n] == as.raw(0x1f) & bytes[-1] == as.raw(0x8b)
)

if (length(candidates) == 0) {
  stop("No gzip members found in ", path)
}

# Validate each candidate by checking:
# - Byte 3 (index +2) must be 0x08 (deflate compression method)
# - Try to actually decompress it
starts <- integer()
for (pos in candidates) {
  # Check compression method byte
  if (pos + 2 > n) next
  if (bytes[pos + 2] != as.raw(0x08)) next

  # Try to decompress from this position to EOF
  # This validates it's actually a gzip stream
  test_bytes <- bytes[pos:n]
  tmp <- tempfile(fileext = ".gz")
  writeBin(test_bytes, tmp)
  result <- tryCatch(
    {
      readLines(gzfile(tmp), n = 1, warn = FALSE)
      TRUE
    },
    error = function(e) FALSE,
    warning = function(w) TRUE
  )
  unlink(tmp)

  if (result) {
    starts <- c(starts, pos)
  }
}

if (length(starts) == 0) {
  stop("No valid gzip members found in ", path)
}

# End of each member is byte before next member starts (or EOF)
ends <- c(starts[-1] - 1, n)

# For each member, decompress and count lines
n_lines <- integer(length(starts))
for (i in seq_along(starts)) {
  member_bytes <- bytes[starts[i]:ends[i]]
  tmp <- tempfile(fileext = ".gz")
  writeBin(member_bytes, tmp)
  n_lines[i] <- tryCatch(
    length(readLines(gzfile(tmp), warn = FALSE)),
    error = function(e) NA_integer_
  )
  unlink(tmp)
}

cat(basename(path), "has", length(starts), "gzip member(s):\n")

print(data.frame(
  member = seq_along(starts),
  start_byte = starts - 1,
  end_byte = ends - 1,
  size_bytes = ends - starts + 1,
  n_lines = n_lines
), row.names = FALSE)

invisible(NULL)
}

Here's what it reports for the new test fixture/example file (concatenated), an existing example file (not concatenated), the original file from #400 (concatenated):

gzip_members("inst/extdata/mtcars-concatenated.csv.gz")
#> mtcars-concatenated.csv.gz has 2 gzip member(s):
#>  member start_byte end_byte size_bytes n_lines
#>       1          0       72         73       1
#>       2         73      931        859      33

gzip_members("inst/extdata/mtcars.csv.gz")
#> mtcars.csv.gz has 1 gzip member(s):
#>  member start_byte end_byte size_bytes n_lines
#>       1          0      869        870      33

#remote_file <- "https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/monthly/bejab/2023/bejab_vpts_202303.csv.gz"
#tf1 <- tempfile()
#curl::curl_download(remote_file, tf1, quiet = FALSE)
gzip_members(tf1)
#> file143058063f60 has 16 gzip member(s):
#>  member start_byte end_byte size_bytes n_lines
#>       1          0      132        133       1
#>       2        133   782350     782218   11475
#>       3     782351  1647510     865160   11475
#>       4    1647511  2339652     692142   11475
#>       5    2339653  3069782     730130   11475
#>       6    3069783  3942194     872412   11475
#>       7    3942195  4802812     860618   11475
#>       8    4802813  5697795     894983   11475
#>       9    5697796  6593209     895414   11475
#>      10    6593210  7397292     804083   11475
#>      11    7397293  8224468     827176   11475
#>      12    8224469  9030857     806389   11475
#>      13    9030858  9919755     888898   11475
#>      14    9919756 10765979     846224   11475
#>      15   10765980 11535286     769307   11475
#>      16   11535287 11617208      81922    1750

DavisVaughan

Some concerns around when unlink() happens but otherwise looks nice to me

DavisVaughan · 2026-01-16T15:29:09Z

R/path.R

+    return(lapply(path, function(p) {
+      ext <- tolower(tools::file_ext(p))
+      local_path <- download_file(p, ext, call = call)
+      withr::defer(unlink(local_path), envir = call)


Is envir = call the right frame? Or is it the environment of standardise_path()?

call is (typically) one frame above that, right?

Or, or, wow, is the "one frame above" standardise_path() the place where you actually read in the file? So you're trying to say "please clean up this file later on, after ive fully read it in, and i promise the timing of that will all work out"

If that is the case, then this definitely deserves a comment because this feels like a big old footgun that could easily be triggered (like if you add a layer of indirection between the call to standardise_path() and where the file actually gets read in, then you might clean it up too soon)

I'd also consider any other approach that might better remove this footgun

DavisVaughan · 2026-01-16T15:32:19Z

R/path.R

+        parent = NA,
+        error = cnd,


I don't know what error = is?

My intent is to pass the original condition along as data, if someone really wanted to catch it and dig in. Inspired by examples seen in Taking full ownership of a causal error.

When a low-level error is overtaken, it is good practice to store it in the high-level error object, so that it can be inspected for debugging purposes. In the snippet above, we stored it in the error field.

jennybc added 2 commits January 10, 2026 17:51

Finesse the presence/abscence of . like fs does

68e62cc

Add another test fixture

40bf057

jennybc force-pushed the download-remote-compressed-files branch from 2c24bae to 40bf057 Compare January 12, 2026 21:40

jennybc added 7 commits January 12, 2026 13:49

Still working on the test fixture

6e2ca93

Still figuring out the test fixture

1abf960

I think this is the test fixture I want

3b92a65

Download remote compressed files behind the scenes

363289c

Update documentation

c221b70

Add NEWS bullet

cb91c13

Note a TODO for around merge time

80b2688

jennybc marked this pull request as ready for review January 13, 2026 00:35

jennybc mentioned this pull request Jan 13, 2026

read_csv returns empty data.frame for remote gzipped csv files tidyverse/readr#1555

Closed

jennybc added 2 commits January 12, 2026 18:25

I think I want progress reporting

6303d96

Even though it might not "match" vroom's messaging, it's better than looking like things might just be hung

jennybc commented Jan 14, 2026

View reviewed changes

jennybc requested a review from DavisVaughan January 14, 2026 01:08

jennybc mentioned this pull request Jan 14, 2026

Cannot read gz files larger than 4 GB tidyverse/readr#1553

Closed

DavisVaughan approved these changes Jan 16, 2026

View reviewed changes

jennybc added 2 commits January 16, 2026 11:25

More convergent evolution to remove inessential inconsistency

cdbb582

URL that should work once I merge this 🤞

1076a40

jennybc mentioned this pull request Jan 16, 2026

Level up the passing of the caller environment across R <--> C++ boundary #601

Open

jennybc merged commit c244c98 into main Jan 16, 2026
6 of 16 checks passed

jennybc deleted the download-remote-compressed-files branch January 16, 2026 20:02

Download remote compressed files #599

Download remote compressed files #599

Uh oh!

Conversation

jennybc commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jennybc commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DavisVaughan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jennybc commented Jan 11, 2026 •

edited

Loading

jennybc commented Jan 14, 2026 •

edited

Loading