write_parquet generates corrupted files on big datasets: "Couldn't deserialize thrift"

## System

OS: Ubuntu 5.4.0-211-generic
R.version: 4.3.1

nanoparquet version: 0.4.2
polars: 0.22.0

## Main issue

While running some benchmarks to test ways to improve the efficiency of my pipeline, I encountered an issue:

```Couldn't deserialize thrift: don't know what type:```

The code I used, simplified, is divided into three steps:

### Loading packages
```
library(nanoparquet)
library(data.table)
library(polars)
library(ids)
library(rcorpora)
```

### Building the dataset
```
file_name <- tempfile()
single_size <- 134217728

tmp <- data.table::data.table(a = sample(letters, single_size, replace = T),
                                b = sample(1L:10000000L, single_size, replace = T),
                                c = ids::ids(single_size, tolower(rcorpora::corpora("words/adjs")$adjs),
                                             tolower(rcorpora::corpora("games/pokemon")$pokemon$name)),
                                d = sample(1:10000000000, single_size, replace = T),
                                e = ids::ids(single_size, rcorpora::corpora("games/dark_souls_iii_messages")$templates),
                                f = ids::ids(single_size, tolower(rcorpora::corpora("colors/xkcd")$colors$color),
                                             tolower(rcorpora::corpora("colors/xkcd")$colors$hex)),
                                g = ids::ids(single_size, rcorpora::corpora("words/spells")$spells$incantation),
                                h = NA,
                                i = NA_real_)
```

### Simplified benchmarking
```
nanoparquet::write_parquet(tmp, file_name, compress = "snappy")
nanoparquet::read_parquet(file_name)
```

## Further findings

After some trial and errors, I made some discoveries:

1) Not all method of compression generated the same error:
		a) snappy: ```Couldn't deserialize thrift: don't know what type:```
		b) zstd: ```Couldn't deserialize thrift: TProtocolException: Invalid data```
		c) uncompressed : ```Couldn't deserialize thrift: TProtocolException: Invalid data```
		d) gzip: no error. The script worked

2) The machine has more than enough memory:
		a) There is space available on disk: ```df -h``` showed 783GB free
		b) Same story for the RAM: 472GB shows as total by running ```free -g``` with over 90% not allocated

3) Creating a minimal working example didn't work: creating 100 datasets with 10M rows by cycling the RNG seed and on each of them running  ```write_parquet``` with ```compress = "uncompressed"``` didn't raise any error

4) Running ```write_parquet``` on columns ```b, c, d``` did not work; however, on each pair subset of the previous triplet the script works ```((b, c), (c, d), (b, d))```

5) Writing and reading a parquet file of the same dataset using ```polars``` works:
```
as_polars_df(tmp)$write_parquet(file_name, compress = "uncompressed", statistics = F)
pl$read_parquet(file_name)
```

6) Files unopenable to nanoparquet are also unopenable by polars. The script errors out with:

``` parquet: File out of specification: Invalid thrift: protocol error```

## Closing remarks

Taking into account everything, the problem is probably related to writing large datasets however I couldn't pinpoint the location of the issues or a solution to it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write_parquet generates corrupted files on big datasets: "Couldn't deserialize thrift" #143

System

Main issue

Loading packages

Building the dataset

Simplified benchmarking

Further findings

Closing remarks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

write_parquet generates corrupted files on big datasets: "Couldn't deserialize thrift" #143

Description

System

Main issue

Loading packages

Building the dataset

Simplified benchmarking

Further findings

Closing remarks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions