[R] Arrow write_parquet() breaks data.table ability to set columns by reference #45300
Description
Describe the bug, including details regarding any error messages, version, and platform.
As of version 17.0 of arrow, the .internal.selfref
attribute is no longer saved with the data.table when using write_parquet()
. This breaks data.table's ability to set columns by reference unless you re-cast it to a data.table (using setDT
) after reading it in with read_parquet
.
The bug can be replicated as follows (I'm on Windows 11, using version 4.4.2 of R).
library(arrow) # version 18.1.0.1
library(data.table) # version 1.16.4
dt <- data.table(x = 1:3)
names(attributes(dt))
# returns
# "names" "row.names" "class" ".internal.selfref"
#works, creating a new column by reference.
dt[, y := letters[1:3]]
# save file using write_parquet
write_parquet(dt, "test.parquet")
# read file back in using read_parquet
dt_after_parquet <- read_parquet("test.parquet")
# this has stripped away the .internal.selfref attribute
names(attributes(dt_after_parquet))
# returns
# "names" "row.names" "class"
# meaning that this works but with the following warning message.
dt_after_parquet[, z := 4:6]
# Warning message:
# In `[.data.table`(dt_after_parquet, , `:=`(z, 4:6)) :
# Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the
# data.table so that := can add this new column by reference. At an earlier point,
# this data.table has been copied by R (or was created manually using structure()
# or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy
# the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames
# and ?setattr. If this message doesn't help, please report your use case to the
# data.table issue tracker so the root cause can be fixed or this message improved.
The behaviour of the .internal.selfref
attribute is described on stackoverflow, here.
Despite not being an error, that warning message is actually significant, as a main strength of data.table
is that it does not create copies of a dataframe during manipulation or creation of columns within that dataframe, the way other R packages do. Our workflow is to work with large data.tables in a targets pipeline, saving interim files as parquet files, which targets does using arrow::read_parquet - this bug slows down our projects, as data.table creates shallow copies of our large data.tables.
Note that in arrow version 16.0 and earlier this issue did not occur, data.tables make the round trip through read/write_parquet successfully, maintaining the ability to set columns by reference.
Component(s)
R