-
Notifications
You must be signed in to change notification settings - Fork 65
Description
Summary
When reading from connections (compressed files, URLs, etc.), comment lines that intersect with chunk boundaries cause parsing failures (indexing failures, to be more specific). The same file reads correctly when uncompressed (memory-mapped). Or, often, just with a different (usually larger) chunk size.
This whole investigation was inspired by poking at tidyverse/readr#1523.
Minimal Reproduction
library(vroom)
# Force tiny chunks to hit boundary conditions
withr::local_envvar(VROOM_CONNECTION_SIZE = "10")
content <- "a,b\n1,2\n#comment\n3,4\n"
# VROOM_CONNECTION_SIZE = 10
# ==> 10 bytes per chunk, but 1 byte for null terminator ==> 9 bytes for data
#
# Byte: 1 2 3 4 5 6 7 8 9 ...
# Content: a , b \n 1 , 2 \n # c ...
# ───────────────────┘
# first chunk "#" falls in last byte of chunk
con <- rawConnection(charToRaw(content), "rb")
result <- vroom(con, delim = ",", comment = "#", col_types = "dd")
# Expected: 2 rows (data rows only, comment skipped)
# Actual: 3 rows with parsing problems
result
#> # A tibble: 3 × 2
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#> dat <- vroom(...)
#> problems(dat)
#> a b
#> <dbl> <dbl>
#> 1 1 2
#> 2 NA NA
#> 3 3 4
problems(result)
#> # A tibble: 2 × 5
#> row col expected actual file
#> <int> <int> <chr> <chr> <chr>
#> 1 3 1 2 columns 1 columns ""
#> 2 3 1 a double #comment ""Created on 2026-01-22 with reprex v2.1.1
Root Cause
Connection-based reading processes data in fixed-size chunks. Two related issues arise when comments intersect chunk boundaries:
1. Comment detected, but terminating newline in next chunk
Chunk 1: [a,b\n1,2\n#] <- comment detected at "#"
Chunk 2: [comment\n3,4\n...] <- but newline is here
skip_rest_of_line() can't find \n, returns end-of-buffer. State incorrectly becomes RECORD_START, and "comment\n" in the next chunk is parsed as data.
2. Multi-char comment split across chunks
Chunk 1: [...data\n#] <- only first "#" visible
Chunk 2: [#rest...\n] <- second "#" here
With comment "##" and only 1 byte visible, we can't determine if it's a comment or data.
Why Memory-Mapped Files Work
For multi-threaded file reading, chunk boundaries are aligned to newlines via find_next_newline(). Comment prefixes appear at line starts, so they're never split.
Connection reading has arbitrary boundaries (wherever the buffer fills), causing the problem.
Prior Art in vroom: CRLF Boundary Fix
5fc54e6 fixed an analogous problem with \r\n spanning chunks (#331). From that commit message:
"The file would read fine until a line ending happened to fall on the connection buffer boundary, then the rest of the file would have a garbled index."
This was a much simpler problem, admitting a relatively straightforward solution. But the general description of the bad stuff that happens applies here too.
Proposed Solution: Pending Token Buffer
Leverage the existing double-buffer structure. When bytes at the end of a chunk can't be fully evaluated:
- Detect "limbo" bytes - partial comment prefix, or comment detected but no newline found
- Copy to next buffer - prepend limbo bytes to the start of the other buffer
- Read after them - fill the rest of the buffer from the connection
- Exclude from write - don't write limbo bytes to temp file (they'll be written with next chunk)
Chunk N in buf[0]: [...data...][#] <- can't decide, copy "#"
↓
Chunk N+1 in buf[1]: [#][...new data from connection...]
^
read starts here, after copied byte
This is the standard "pending token buffer" pattern used by streaming parsers (SAX, simdjson, etc.).
Advantages:
- No new parser states needed
- Bytes carry their own state (no "going backwards")
- Decisions made with complete information
- Uses existing double-buffer infrastructure
Open Questions
- Are multi-byte delimiters affected by the same issue? My money is on YES. E.g. poke at read_delim parsing issue with compressed file readr#1534.
- Are there other open issues that might be symptoms of this underlying problem?
Tests
I developed 2 tests during my exploration that will be useful (all fail now):
test_that("comment at exact buffer boundary is handled correctly", {
withr::local_envvar(VROOM_CONNECTION_SIZE = "10")
content <- "a,b\n1,2\n#comment\n3,4\n"
# VROOM_CONNECTION_SIZE = 10
# ==> 10 bytes per chunk, but 1 byte for null terminator ==> 9 bytes for data
#
# Byte: 1 2 3 4 5 6 7 8 9 ...
# Content: a , b \n 1 , 2 \n # c ...
# ───────────────────┘
# first chunk "#" falls in last byte of chunk
con <- rawConnection(charToRaw(content), "rb")
result <- vroom(con, delim = ",", comment = "#", col_types = "dd")
expect_equal(result, tibble::tibble(a = c(1, 3), b = c(2, 4)))
expect_equal(nrow(problems(result)), 0)
})
test_that("multi-char comment split across buffer boundary is handled correctly", {
withr::local_envvar(VROOM_CONNECTION_SIZE = "10")
content <- "a,b\n1,2\n##comment\n3,4\n"
# VROOM_CONNECTION_SIZE = 10
# ==> 10 bytes per chunk, but 1 byte for null terminator ==> 9 bytes for data
#
# Byte: 1 2 3 4 5 6 7 8 9 | 10 ...
# Content: a , b \n 1 , 2 \n # | # c ...
# ──────────────────┘
# first chunk first "#" in chunk 1, second "#" in chunk 2
con <- rawConnection(charToRaw(content), "rb")
result <- vroom(con, delim = ",", comment = "##", col_types = "dd")
expect_equal(result, tibble::tibble(a = c(1, 3), b = c(2, 4)))
expect_equal(nrow(problems(result)), 0)
})