read_csv2 fails within certain files if compressed

I have hundreds of zipped files with each a single CSV data file in them. Out of those most work fine, but some are problematic because `read_csv2()` does not read them completely, but fills with `NA` in all columns originally containing data starting from a certain point within the file.

``` r
filedata <- readr::read_csv2("./problemfile.zip")
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#> Rows: 7179 Columns: 17
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (2): DateTime, Battery2: Voltage [mV]
#> dbl (6): AgvId, System Battery Level [%], System Battery Current [mA], Batte...
#> lgl (9): System Battery Voltage [mV], Battery1: Voltage [mV], Battery1: Char...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(length(filedata$DateTime[is.na(filedata$DateTime) == TRUE]))
#> [1] 2995
```

<sup>Created on 2026-03-07 with [reprex v2.1.1](https://reprex.tidyverse.org)</sup>

Example file: [problemfile.zip](https://github.com/user-attachments/files/25813318/problemfile.zip)

The files are not very well formatted (not my fault) because there are more header cols than data, but this is not the problem in general because if I unzip the file beforehand and try to read the CSV directly, no problem occurs and data is complete:

``` r
filedata <- readr::read_csv2("./problemfile.csv")
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#> Rows: 7180 Columns: 17
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr  (1): DateTime
#> dbl  (3): AgvId, System Battery Level [%], System Battery Current [mA]
#> lgl (13): System Battery Voltage [mV], Battery1: Voltage [mV], Battery1: Cha...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(length(filedata$DateTime[is.na(filedata$DateTime) == TRUE]))
#> [1] 0
```

<sup>Created on 2026-03-07 with [reprex v2.1.1](https://reprex.tidyverse.org)</sup>

Interestingly a different number of rows is reported by `read_csv2` when reading the file directly. In the file attached the line which gets missing/misaligned is 4184 which contains data of its own and malformed data from the next one if read from the compressed file. From then on only malformed data is put into wrong columns:

|DateTime            | AgvId| System Battery Level [%]|System Battery Voltage [mV] | System Battery Current [mA]|Battery1: Voltage [mV] |Battery1: Charging Cycles |Battery1: State |Battery2: Voltage [mV] | Battery2: Charging Cycles| Battery2: State|Battery3: Voltage [mV] | Battery3: Charging Cycles|Battery3: State |Battery4: Voltage [mV] |Battery4: Charging Cycles |Battery4: State |
|:-------------------|-----:|------------------------:|:---------------------------|---------------------------:|:----------------------|:-------------------------|:---------------|:----------------------|-------------------------:|---------------:|:----------------------|-------------------------:|:---------------|:----------------------|:-------------------------|:---------------|
|10.10.2025 13:58:43 |     3|                      100|NA                          |                           0|NA                     |NA                        |NA              |NA                     |                        NA|              NA|NA                     |                        NA|NA              |NA                     |NA                        |NA              |
|10.10.2025 13:58:43 |     4|                        0|NA                          |                          NA|NA                     |NA                        |NA              |0.10.2025 13:58:43     |                         5|               0|NA                     |                        NA|NA              |NA                     |NA                        |NA              |
|NA                  |    NA|                       NA|NA                          |                          NA|NA                     |NA                        |NA              |0.10.2025 13:59:43     |                         1|              95|NA                     |                         0|NA              |NA                     |NA                        |NA              |

Playing around with `guess_max` or `lazy` options does not change position or result. My original code contained `col_types = readr::cols_only()` definitions which even made the malformed data invisible (just found that while creating the reprex).

What I tried without any change for environment:
- Different readr versions: 2.2.0, 2.1.6, 2.0.0
- Different R versions (Win 11): 4.5.2, 4.4.3, 4.3.3, 4.2.3
- Different OS: Win 11 (R 4.5.2) vs. Linux (R 4.5.0)

What I tried with the file itself:
- Zipping the working CSV file again: no success with different zip compression tools or compression levels
- Unzipping programmatically works if the output really is a text file that gets read by `read_csv2()`. Handing over just the connection of the zipped file does not
- Re-saving the CSV file and zipping:
  - Without changes, nothing happens
  - Converting from UTF-8 to ANSI does not change anything
  - Deleting lines after the faulty one: No change either
  - Deleting or adding just 1 character in a line before problematic one (also in header) makes it work
  - Replacing characters without changing number of total chars before problem (e.g. moving chars between lines) has no effect
- Using other compression algorithms fails the same way: gzip, bzip2, xz

There seems to be some issue after a defined number of bytes which leads to a problem because when that changes, nothing happens.

One common thing I found between the files not working after having this insight is, that the line where the problem occurs (which is different due to different data) has its line end (after `\r\n` terminating characters) 2^17 bytes after the start of the decompressed file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv2 fails within certain files if compressed #1620

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DateTime	AgvId	System Battery Level [%]	System Battery Voltage [mV]	System Battery Current [mA]	Battery1: Voltage [mV]	Battery1: Charging Cycles	Battery1: State	Battery2: Voltage [mV]	Battery2: Charging Cycles	Battery2: State	Battery3: Voltage [mV]	Battery3: Charging Cycles	Battery3: State	Battery4: Voltage [mV]	Battery4: Charging Cycles	Battery4: State
10.10.2025 13:58:43	3	100	NA	0	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
10.10.2025 13:58:43	4	0	NA	NA	NA	NA	NA	0.10.2025 13:58:43	5	0	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	0.10.2025 13:59:43	1	95	NA	0	NA	NA	NA	NA

read_csv2 fails within certain files if compressed #1620

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions