Skip to content

Planned milestones for future releases #117

Open
@MarcusKlik

Description

@MarcusKlik

The currently planned features planned for fst:

version 0.8.4:

Intermediate release to fix the Clang 6.0 build errors (#118)

version 0.8.6:

  • Multi-threaded serialization of character columns
  • Basic data.table interface:
ft <- fst("1.fst")

ft[1:1000, .(ColA, ColB)]  # on-disk row subsetting + column selection

ft[ColA > 50, .(ColA)]  # on-disk subsetting using simple expression + column selection

ft[ColA == median(ColB), .(ColB)]  # subseting using custom expression + column selection

ft[ColA == ColB, .(ColSum = ColA + ColB)]  # subsetting + compute on column selection

Note that there is no grouping functionality in this basic interface (yet), but there will be:
- on-disk (random) logical row sub-setting (requiring only memory for selected rows) (argument i)
- on-disk row sub-setting using an expression (requiring only memory for columns used the expression) (argument i)
- on-disk column selection (argument j)
- computations on column selection (compute on j)

  • Basic dplyr interface:

    • filter
    • select
    • slice
    • collect
    • sample_n and sample_frac (only needs memory for data in the returned sample)
  • Hashing of column data

version 0.8.8:

  • For the data.table interface:

    • operator := for column binding to an existing fst file
    • rbindlist to row bind multiple fst files into a single file
  • For the dplyr interface:

    • add_row
    • add_column
    • mutate

version 0.8.8 and later

Later features (in random order):

  • Add on-disk grouping functionality. That requires on-disk sorting, which can be done using a merge sort algorithm.

  • lapply like functionality creating a fst file using a list of inputs (csv's, custom methods, etc.)

  • interoperability:
    a) import data from Apache Parguet files
    b) Python interface
    c) C++ interface library
    d) Julia interface, ...

  • advanced operations:
    a) Parallel grouping for specific methods (like +,-,*,/,sum,mean, etc. these methods need a C++ implementation for parallel operations)
    b) binary search on table key columns (extremely fast sub-setting of a key range)
    c) Merge operations on multiple fst files (right join to start with, like in data.table)
    d) multiple fst-files represent a single data set
    e) set of fst-files can be sorted in parallel into a new set of fst files. This avoids the slow end-phase of sorting algorithms like merge sort.
    f) user-defined map-reduce operations that can be used on the fst file(s) in parallel. Simple example: a custom median method using 1) sum and count each chunk 2) take results from 1) to calculate median.
    g) fill a data set range with specific rows from a fst file, overwriting data in-memory (Fill a data.table range with specific rows from read.fst #29).

  • performance and security enhancements:
    a) encryption
    b) SIMD upgrades to the bit-shifters and pre-serialization filters used in fst
    c) a plug-in system (C++) for custom compressors to allow users to come up with faster or better compressors
    d) better character columns compression
    e) high compression mode for slow IO (network) speeds (Very slow writing to network drive when using compression (Windows 7) #23).

This list is subject to a lot of change depending on features and issues requested/reported by users of the fst package :-)

Metadata

Metadata

Assignees

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions