-
Notifications
You must be signed in to change notification settings - Fork 291
read_csv2 fails within certain files if compressed #1620
Description
I have hundreds of zipped files with each a single CSV data file in them. Out of those most work fine, but some are problematic because read_csv2() does not read them completely, but fills with NA in all columns originally containing data starting from a certain point within the file.
filedata <- readr::read_csv2("./problemfile.zip")
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#> dat <- vroom(...)
#> problems(dat)
#> Rows: 7179 Columns: 17
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (2): DateTime, Battery2: Voltage [mV]
#> dbl (6): AgvId, System Battery Level [%], System Battery Current [mA], Batte...
#> lgl (9): System Battery Voltage [mV], Battery1: Voltage [mV], Battery1: Char...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(length(filedata$DateTime[is.na(filedata$DateTime) == TRUE]))
#> [1] 2995Created on 2026-03-07 with reprex v2.1.1
Example file: problemfile.zip
The files are not very well formatted (not my fault) because there are more header cols than data, but this is not the problem in general because if I unzip the file beforehand and try to read the CSV directly, no problem occurs and data is complete:
filedata <- readr::read_csv2("./problemfile.csv")
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#> dat <- vroom(...)
#> problems(dat)
#> Rows: 7180 Columns: 17
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (1): DateTime
#> dbl (3): AgvId, System Battery Level [%], System Battery Current [mA]
#> lgl (13): System Battery Voltage [mV], Battery1: Voltage [mV], Battery1: Cha...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(length(filedata$DateTime[is.na(filedata$DateTime) == TRUE]))
#> [1] 0Created on 2026-03-07 with reprex v2.1.1
Interestingly a different number of rows is reported by read_csv2 when reading the file directly. In the file attached the line which gets missing/misaligned is 4184 which contains data of its own and malformed data from the next one if read from the compressed file. From then on only malformed data is put into wrong columns:
| DateTime | AgvId | System Battery Level [%] | System Battery Voltage [mV] | System Battery Current [mA] | Battery1: Voltage [mV] | Battery1: Charging Cycles | Battery1: State | Battery2: Voltage [mV] | Battery2: Charging Cycles | Battery2: State | Battery3: Voltage [mV] | Battery3: Charging Cycles | Battery3: State | Battery4: Voltage [mV] | Battery4: Charging Cycles | Battery4: State |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10.10.2025 13:58:43 | 3 | 100 | NA | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 10.10.2025 13:58:43 | 4 | 0 | NA | NA | NA | NA | NA | 0.10.2025 13:58:43 | 5 | 0 | NA | NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA | NA | NA | NA | 0.10.2025 13:59:43 | 1 | 95 | NA | 0 | NA | NA | NA | NA |
Playing around with guess_max or lazy options does not change position or result. My original code contained col_types = readr::cols_only() definitions which even made the malformed data invisible (just found that while creating the reprex).
What I tried without any change for environment:
- Different readr versions: 2.2.0, 2.1.6, 2.0.0
- Different R versions (Win 11): 4.5.2, 4.4.3, 4.3.3, 4.2.3
- Different OS: Win 11 (R 4.5.2) vs. Linux (R 4.5.0)
What I tried with the file itself:
- Zipping the working CSV file again: no success with different zip compression tools or compression levels
- Unzipping programmatically works if the output really is a text file that gets read by
read_csv2(). Handing over just the connection of the zipped file does not - Re-saving the CSV file and zipping:
- Without changes, nothing happens
- Converting from UTF-8 to ANSI does not change anything
- Deleting lines after the faulty one: No change either
- Deleting or adding just 1 character in a line before problematic one (also in header) makes it work
- Replacing characters without changing number of total chars before problem (e.g. moving chars between lines) has no effect
- Using other compression algorithms fails the same way: gzip, bzip2, xz
There seems to be some issue after a defined number of bytes which leads to a problem because when that changes, nothing happens.
One common thing I found between the files not working after having this insight is, that the line where the problem occurs (which is different due to different data) has its line end (after \r\n terminating characters) 2^17 bytes after the start of the decompressed file.