Comment/token handling fails at connection chunk boundaries

## Summary

When reading from connections (compressed files, URLs, etc.), comment lines that intersect with chunk boundaries cause parsing failures (indexing failures, to be more specific). The same file reads correctly when uncompressed (memory-mapped). Or, often, just with a different (usually larger) chunk size.

This whole investigation was inspired by poking at https://github.com/tidyverse/readr/issues/1523.

## Minimal Reproduction

``` r
library(vroom)

# Force tiny chunks to hit boundary conditions
withr::local_envvar(VROOM_CONNECTION_SIZE = "10")
content <- "a,b\n1,2\n#comment\n3,4\n"

# VROOM_CONNECTION_SIZE = 10
# ==> 10 bytes per chunk, but 1 byte for null terminator ==> 9 bytes for data
#
# Byte:     1 2 3  4 5 6 7  8 9 ...
# Content:  a , b \n 1 , 2 \n #  c ...
#           ───────────────────┘
#           first chunk      "#" falls in last byte of chunk

con <- rawConnection(charToRaw(content), "rb")
result <- vroom(con, delim = ",", comment = "#", col_types = "dd")

# Expected: 2 rows (data rows only, comment skipped)
# Actual: 3 rows with parsing problems
result
#> # A tibble: 3 × 2
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2
#> 2    NA    NA
#> 3     3     4

problems(result)
#> # A tibble: 2 × 5
#>     row   col expected  actual    file 
#>   <int> <int> <chr>     <chr>     <chr>
#> 1     3     1 2 columns 1 columns ""   
#> 2     3     1 a double  #comment  ""
```

<sup>Created on 2026-01-22 with [reprex v2.1.1](https://reprex.tidyverse.org)</sup>

## Root Cause

Connection-based reading processes data in fixed-size chunks. Two related issues arise when comments intersect chunk boundaries:

### 1. Comment detected, but terminating newline in next chunk

```
Chunk 1: [a,b\n1,2\n#]        <- comment detected at "#"
Chunk 2: [comment\n3,4\n...]  <- but newline is here
```

`skip_rest_of_line()` can't find `\n`, returns end-of-buffer. State incorrectly becomes `RECORD_START`, and "comment\n" in the next chunk is parsed as data.

### 2. Multi-char comment split across chunks

```
Chunk 1: [...data\n#]   <- only first "#" visible
Chunk 2: [#rest...\n]   <- second "#" here
```

With comment `"##"` and only 1 byte visible, we can't determine if it's a comment or data.

## Why Memory-Mapped Files Work

For multi-threaded file reading, chunk boundaries are **aligned to newlines** via `find_next_newline()`. Comment prefixes appear at line starts, so they're never split.

Connection reading has **arbitrary boundaries** (wherever the buffer fills), causing the problem.

## Prior Art in vroom: CRLF Boundary Fix

5fc54e61538e32007b9795bc56e6747e1a77d893 fixed an analogous problem with `\r\n` spanning chunks (#331). From that commit message:

> "The file would read fine until a line ending happened to fall on the connection buffer boundary, then the rest of the file would have a garbled index."

This was a much simpler problem, admitting a relatively straightforward solution. But the general description of the bad stuff that happens applies here too.

## Proposed Solution: Pending Token Buffer

Leverage the existing double-buffer structure. When bytes at the end of a chunk can't be fully evaluated:

1. **Detect "limbo" bytes** - partial comment prefix, or comment detected but no newline found
2. **Copy to next buffer** - prepend limbo bytes to the start of the other buffer
3. **Read after them** - fill the rest of the buffer from the connection
4. **Exclude from write** - don't write limbo bytes to temp file (they'll be written with next chunk)

```
Chunk N in buf[0]:   [...data...][#]    <- can't decide, copy "#"
                                  ↓
Chunk N+1 in buf[1]: [#][...new data from connection...]
                     ^
                     read starts here, after copied byte
```

This is the standard "pending token buffer" pattern used by streaming parsers (SAX, simdjson, etc.).

**Advantages:**
- No new parser states needed
- Bytes carry their own state (no "going backwards")
- Decisions made with complete information
- Uses existing double-buffer infrastructure

## Open Questions

1. Are multi-byte delimiters affected by the same issue? My money is on YES. E.g. poke at https://github.com/tidyverse/readr/issues/1534.
2. Are there other open issues that might be symptoms of this underlying problem?

## Tests

I developed 2 tests during my exploration that will be useful (all fail now):

```r
test_that("comment at exact buffer boundary is handled correctly", {
  withr::local_envvar(VROOM_CONNECTION_SIZE = "10")
  content <- "a,b\n1,2\n#comment\n3,4\n"

  # VROOM_CONNECTION_SIZE = 10
  # ==> 10 bytes per chunk, but 1 byte for null terminator ==> 9 bytes for data
  #
  # Byte:     1 2 3  4 5 6 7  8 9 ...
  # Content:  a , b \n 1 , 2 \n #  c ...
  #           ───────────────────┘
  #           first chunk      "#" falls in last byte of chunk

  con <- rawConnection(charToRaw(content), "rb")
  result <- vroom(con, delim = ",", comment = "#", col_types = "dd")

  expect_equal(result, tibble::tibble(a = c(1, 3), b = c(2, 4)))
  expect_equal(nrow(problems(result)), 0)
})

test_that("multi-char comment split across buffer boundary is handled correctly", {
  withr::local_envvar(VROOM_CONNECTION_SIZE = "10")
  content <- "a,b\n1,2\n##comment\n3,4\n"

  # VROOM_CONNECTION_SIZE = 10
  # ==> 10 bytes per chunk, but 1 byte for null terminator ==> 9 bytes for data
  #
  # Byte:     1 2 3  4 5 6 7  8 9 | 10 ...
  # Content:  a , b \n 1 , 2 \n # |  # c ...
  #           ──────────────────┘
  #           first chunk       first "#" in chunk 1, second "#" in chunk 2

  con <- rawConnection(charToRaw(content), "rb")
  result <- vroom(con, delim = ",", comment = "##", col_types = "dd")

  expect_equal(result, tibble::tibble(a = c(1, 3), b = c(2, 4)))
  expect_equal(nrow(problems(result)), 0)
})
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comment/token handling fails at connection chunk boundaries #608

Summary

Minimal Reproduction

Root Cause

1. Comment detected, but terminating newline in next chunk

2. Multi-char comment split across chunks

Why Memory-Mapped Files Work

Prior Art in vroom: CRLF Boundary Fix

Proposed Solution: Pending Token Buffer

Open Questions

Tests

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Comment/token handling fails at connection chunk boundaries #608

Description

Summary

Minimal Reproduction

Root Cause

1. Comment detected, but terminating newline in next chunk

2. Multi-char comment split across chunks

Why Memory-Mapped Files Work

Prior Art in vroom: CRLF Boundary Fix

Proposed Solution: Pending Token Buffer

Open Questions

Tests

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions