Description
The currently planned features planned for fst
:
version 0.8.4:
Intermediate release to fix the Clang 6.0 build errors (#118)
version 0.8.6:
- Multi-threaded serialization of
character
columns - Basic
data.table
interface:
ft <- fst("1.fst")
ft[1:1000, .(ColA, ColB)] # on-disk row subsetting + column selection
ft[ColA > 50, .(ColA)] # on-disk subsetting using simple expression + column selection
ft[ColA == median(ColB), .(ColB)] # subseting using custom expression + column selection
ft[ColA == ColB, .(ColSum = ColA + ColB)] # subsetting + compute on column selection
Note that there is no grouping functionality in this basic interface (yet), but there will be:
- on-disk (random) logical row sub-setting (requiring only memory for selected rows) (argument i)
- on-disk row sub-setting using an expression (requiring only memory for columns used the expression) (argument i)
- on-disk column selection (argument j)
- computations on column selection (compute on j)
-
Basic
dplyr
interface:filter
select
slice
collect
sample_n
andsample_frac
(only needs memory for data in the returned sample)
-
Hashing of column data
version 0.8.8:
-
For the
data.table
interface:- operator
:=
for column binding to an existingfst
file rbindlist
to row bind multiplefst
files into a single file
- operator
-
For the
dplyr
interface:add_row
add_column
mutate
version 0.8.8 and later
Later features (in random order):
-
Add on-disk grouping functionality. That requires on-disk sorting, which can be done using a
merge sort
algorithm. -
lapply
like functionality creating afst
file using a list of inputs (csv's, custom methods, etc.) -
interoperability:
a) import data from Apache Parguet files
b) Python interface
c) C++ interface library
d) Julia interface, ... -
advanced operations:
a) Parallel grouping for specific methods (like +,-,*,/,sum,mean, etc. these methods need a C++ implementation for parallel operations)
b) binary search on table key columns (extremely fast sub-setting of a key range)
c) Merge operations on multiplefst
files (right join to start with, like indata.table
)
d) multiplefst
-files represent a single data set
e) set offst
-files can be sorted in parallel into a new set offst
files. This avoids the slow end-phase of sorting algorithms like merge sort.
f) user-defined map-reduce operations that can be used on thefst
file(s) in parallel. Simple example: a custom median method using 1) sum and count each chunk 2) take results from 1) to calculate median.
g) fill a data set range with specific rows from afst
file, overwriting data in-memory (Fill a data.table range with specific rows from read.fst #29). -
performance and security enhancements:
a) encryption
b) SIMD upgrades to the bit-shifters and pre-serialization filters used infst
c) a plug-in system (C++) for custom compressors to allow users to come up with faster or better compressors
d) better character columns compression
e) high compression mode for slow IO (network) speeds (Very slow writing to network drive when using compression (Windows 7) #23).
This list is subject to a lot of change depending on features and issues requested/reported by users of the fst
package :-)