Skip to content

write_parquet generates corrupted files on big datasets: "Couldn't deserialize thrift" #143

@DavideMessinaARS

Description

@DavideMessinaARS

System

OS: Ubuntu 5.4.0-211-generic
R.version: 4.3.1

nanoparquet version: 0.4.2
polars: 0.22.0

Main issue

While running some benchmarks to test ways to improve the efficiency of my pipeline, I encountered an issue:

Couldn't deserialize thrift: don't know what type:

The code I used, simplified, is divided into three steps:

Loading packages

library(nanoparquet)
library(data.table)
library(polars)
library(ids)
library(rcorpora)

Building the dataset

file_name <- tempfile()
single_size <- 134217728

tmp <- data.table::data.table(a = sample(letters, single_size, replace = T),
                                b = sample(1L:10000000L, single_size, replace = T),
                                c = ids::ids(single_size, tolower(rcorpora::corpora("words/adjs")$adjs),
                                             tolower(rcorpora::corpora("games/pokemon")$pokemon$name)),
                                d = sample(1:10000000000, single_size, replace = T),
                                e = ids::ids(single_size, rcorpora::corpora("games/dark_souls_iii_messages")$templates),
                                f = ids::ids(single_size, tolower(rcorpora::corpora("colors/xkcd")$colors$color),
                                             tolower(rcorpora::corpora("colors/xkcd")$colors$hex)),
                                g = ids::ids(single_size, rcorpora::corpora("words/spells")$spells$incantation),
                                h = NA,
                                i = NA_real_)

Simplified benchmarking

nanoparquet::write_parquet(tmp, file_name, compress = "snappy")
nanoparquet::read_parquet(file_name)

Further findings

After some trial and errors, I made some discoveries:

  1. Not all method of compression generated the same error:
    a) snappy: Couldn't deserialize thrift: don't know what type:
    b) zstd: Couldn't deserialize thrift: TProtocolException: Invalid data
    c) uncompressed : Couldn't deserialize thrift: TProtocolException: Invalid data
    d) gzip: no error. The script worked

  2. The machine has more than enough memory:
    a) There is space available on disk: df -h showed 783GB free
    b) Same story for the RAM: 472GB shows as total by running free -g with over 90% not allocated

  3. Creating a minimal working example didn't work: creating 100 datasets with 10M rows by cycling the RNG seed and on each of them running write_parquet with compress = "uncompressed" didn't raise any error

  4. Running write_parquet on columns b, c, d did not work; however, on each pair subset of the previous triplet the script works ((b, c), (c, d), (b, d))

  5. Writing and reading a parquet file of the same dataset using polars works:

as_polars_df(tmp)$write_parquet(file_name, compress = "uncompressed", statistics = F)
pl$read_parquet(file_name)
  1. Files unopenable to nanoparquet are also unopenable by polars. The script errors out with:

parquet: File out of specification: Invalid thrift: protocol error

Closing remarks

Taking into account everything, the problem is probably related to writing large datasets however I couldn't pinpoint the location of the issues or a solution to it

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugan unexpected problem or unintended behaviorreprexneeds a minimal reproducible example

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions