-
Notifications
You must be signed in to change notification settings - Fork 6
Description
System
OS: Ubuntu 5.4.0-211-generic
R.version: 4.3.1
nanoparquet version: 0.4.2
polars: 0.22.0
Main issue
While running some benchmarks to test ways to improve the efficiency of my pipeline, I encountered an issue:
Couldn't deserialize thrift: don't know what type:
The code I used, simplified, is divided into three steps:
Loading packages
library(nanoparquet)
library(data.table)
library(polars)
library(ids)
library(rcorpora)
Building the dataset
file_name <- tempfile()
single_size <- 134217728
tmp <- data.table::data.table(a = sample(letters, single_size, replace = T),
b = sample(1L:10000000L, single_size, replace = T),
c = ids::ids(single_size, tolower(rcorpora::corpora("words/adjs")$adjs),
tolower(rcorpora::corpora("games/pokemon")$pokemon$name)),
d = sample(1:10000000000, single_size, replace = T),
e = ids::ids(single_size, rcorpora::corpora("games/dark_souls_iii_messages")$templates),
f = ids::ids(single_size, tolower(rcorpora::corpora("colors/xkcd")$colors$color),
tolower(rcorpora::corpora("colors/xkcd")$colors$hex)),
g = ids::ids(single_size, rcorpora::corpora("words/spells")$spells$incantation),
h = NA,
i = NA_real_)
Simplified benchmarking
nanoparquet::write_parquet(tmp, file_name, compress = "snappy")
nanoparquet::read_parquet(file_name)
Further findings
After some trial and errors, I made some discoveries:
-
Not all method of compression generated the same error:
a) snappy:Couldn't deserialize thrift: don't know what type:
b) zstd:Couldn't deserialize thrift: TProtocolException: Invalid data
c) uncompressed :Couldn't deserialize thrift: TProtocolException: Invalid data
d) gzip: no error. The script worked -
The machine has more than enough memory:
a) There is space available on disk:df -hshowed 783GB free
b) Same story for the RAM: 472GB shows as total by runningfree -gwith over 90% not allocated -
Creating a minimal working example didn't work: creating 100 datasets with 10M rows by cycling the RNG seed and on each of them running
write_parquetwithcompress = "uncompressed"didn't raise any error -
Running
write_parqueton columnsb, c, ddid not work; however, on each pair subset of the previous triplet the script works((b, c), (c, d), (b, d)) -
Writing and reading a parquet file of the same dataset using
polarsworks:
as_polars_df(tmp)$write_parquet(file_name, compress = "uncompressed", statistics = F)
pl$read_parquet(file_name)
- Files unopenable to nanoparquet are also unopenable by polars. The script errors out with:
parquet: File out of specification: Invalid thrift: protocol error
Closing remarks
Taking into account everything, the problem is probably related to writing large datasets however I couldn't pinpoint the location of the issues or a solution to it