Skip to content

Comment/token handling fails at connection chunk boundaries #608

@jennybc

Description

@jennybc

Summary

When reading from connections (compressed files, URLs, etc.), comment lines that intersect with chunk boundaries cause parsing failures (indexing failures, to be more specific). The same file reads correctly when uncompressed (memory-mapped). Or, often, just with a different (usually larger) chunk size.

This whole investigation was inspired by poking at tidyverse/readr#1523.

Minimal Reproduction

library(vroom)

# Force tiny chunks to hit boundary conditions
withr::local_envvar(VROOM_CONNECTION_SIZE = "10")
content <- "a,b\n1,2\n#comment\n3,4\n"

# VROOM_CONNECTION_SIZE = 10
# ==> 10 bytes per chunk, but 1 byte for null terminator ==> 9 bytes for data
#
# Byte:     1 2 3  4 5 6 7  8 9 ...
# Content:  a , b \n 1 , 2 \n #  c ...
#           ───────────────────┘
#           first chunk      "#" falls in last byte of chunk

con <- rawConnection(charToRaw(content), "rb")
result <- vroom(con, delim = ",", comment = "#", col_types = "dd")

# Expected: 2 rows (data rows only, comment skipped)
# Actual: 3 rows with parsing problems
result
#> # A tibble: 3 × 2
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2
#> 2    NA    NA
#> 3     3     4

problems(result)
#> # A tibble: 2 × 5
#>     row   col expected  actual    file 
#>   <int> <int> <chr>     <chr>     <chr>
#> 1     3     1 2 columns 1 columns ""   
#> 2     3     1 a double  #comment  ""

Created on 2026-01-22 with reprex v2.1.1

Root Cause

Connection-based reading processes data in fixed-size chunks. Two related issues arise when comments intersect chunk boundaries:

1. Comment detected, but terminating newline in next chunk

Chunk 1: [a,b\n1,2\n#]        <- comment detected at "#"
Chunk 2: [comment\n3,4\n...]  <- but newline is here

skip_rest_of_line() can't find \n, returns end-of-buffer. State incorrectly becomes RECORD_START, and "comment\n" in the next chunk is parsed as data.

2. Multi-char comment split across chunks

Chunk 1: [...data\n#]   <- only first "#" visible
Chunk 2: [#rest...\n]   <- second "#" here

With comment "##" and only 1 byte visible, we can't determine if it's a comment or data.

Why Memory-Mapped Files Work

For multi-threaded file reading, chunk boundaries are aligned to newlines via find_next_newline(). Comment prefixes appear at line starts, so they're never split.

Connection reading has arbitrary boundaries (wherever the buffer fills), causing the problem.

Prior Art in vroom: CRLF Boundary Fix

5fc54e6 fixed an analogous problem with \r\n spanning chunks (#331). From that commit message:

"The file would read fine until a line ending happened to fall on the connection buffer boundary, then the rest of the file would have a garbled index."

This was a much simpler problem, admitting a relatively straightforward solution. But the general description of the bad stuff that happens applies here too.

Proposed Solution: Pending Token Buffer

Leverage the existing double-buffer structure. When bytes at the end of a chunk can't be fully evaluated:

  1. Detect "limbo" bytes - partial comment prefix, or comment detected but no newline found
  2. Copy to next buffer - prepend limbo bytes to the start of the other buffer
  3. Read after them - fill the rest of the buffer from the connection
  4. Exclude from write - don't write limbo bytes to temp file (they'll be written with next chunk)
Chunk N in buf[0]:   [...data...][#]    <- can't decide, copy "#"
                                  ↓
Chunk N+1 in buf[1]: [#][...new data from connection...]
                     ^
                     read starts here, after copied byte

This is the standard "pending token buffer" pattern used by streaming parsers (SAX, simdjson, etc.).

Advantages:

  • No new parser states needed
  • Bytes carry their own state (no "going backwards")
  • Decisions made with complete information
  • Uses existing double-buffer infrastructure

Open Questions

  1. Are multi-byte delimiters affected by the same issue? My money is on YES. E.g. poke at read_delim parsing issue with compressed file readr#1534.
  2. Are there other open issues that might be symptoms of this underlying problem?

Tests

I developed 2 tests during my exploration that will be useful (all fail now):

test_that("comment at exact buffer boundary is handled correctly", {
  withr::local_envvar(VROOM_CONNECTION_SIZE = "10")
  content <- "a,b\n1,2\n#comment\n3,4\n"

  # VROOM_CONNECTION_SIZE = 10
  # ==> 10 bytes per chunk, but 1 byte for null terminator ==> 9 bytes for data
  #
  # Byte:     1 2 3  4 5 6 7  8 9 ...
  # Content:  a , b \n 1 , 2 \n #  c ...
  #           ───────────────────┘
  #           first chunk      "#" falls in last byte of chunk

  con <- rawConnection(charToRaw(content), "rb")
  result <- vroom(con, delim = ",", comment = "#", col_types = "dd")

  expect_equal(result, tibble::tibble(a = c(1, 3), b = c(2, 4)))
  expect_equal(nrow(problems(result)), 0)
})

test_that("multi-char comment split across buffer boundary is handled correctly", {
  withr::local_envvar(VROOM_CONNECTION_SIZE = "10")
  content <- "a,b\n1,2\n##comment\n3,4\n"

  # VROOM_CONNECTION_SIZE = 10
  # ==> 10 bytes per chunk, but 1 byte for null terminator ==> 9 bytes for data
  #
  # Byte:     1 2 3  4 5 6 7  8 9 | 10 ...
  # Content:  a , b \n 1 , 2 \n # |  # c ...
  #           ──────────────────┘
  #           first chunk       first "#" in chunk 1, second "#" in chunk 2

  con <- rawConnection(charToRaw(content), "rb")
  result <- vroom(con, delim = ",", comment = "##", col_types = "dd")

  expect_equal(result, tibble::tibble(a = c(1, 3), b = c(2, 4)))
  expect_equal(nrow(problems(result)), 0)
})

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugan unexpected problem or unintended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions