Planned milestones for future releases

The currently planned features planned for `fst`:

**version 0.8.4:**

Intermediate release to fix the Clang 6.0 build errors (#118)

**version 0.8.6:**

* Multi-threaded serialization of `character` columns
* Basic `data.table` interface:

```r
ft <- fst("1.fst")

ft[1:1000, .(ColA, ColB)]  # on-disk row subsetting + column selection

ft[ColA > 50, .(ColA)]  # on-disk subsetting using simple expression + column selection

ft[ColA == median(ColB), .(ColB)]  # subseting using custom expression + column selection

ft[ColA == ColB, .(ColSum = ColA + ColB)]  # subsetting + compute on column selection
```

Note that there is no grouping functionality in this basic interface (yet), but there will be:
    - _on-disk_ (random) logical row sub-setting (requiring only memory for selected rows) (_argument i_)
    - _on-disk_ row sub-setting using an expression (requiring only memory for columns used the expression) (_argument i_)
    -   _on-disk_ column selection (_argument j_)
    - computations on column selection (_compute on j_)

* Basic `dplyr` interface:
    - `filter`
    - `select`
    - `slice`
    - `collect`
    - `sample_n` and `sample_frac` (only needs memory for data in the returned sample)

* Hashing of column data

**version 0.8.8:**

* For the `data.table` interface:
    - operator `:=` for column binding to an existing `fst` file
    - `rbindlist` to row bind multiple `fst` files into a single file

* For the `dplyr` interface:
    - `add_row`
    - `add_column`
    - `mutate`

**version 0.8.8 and later**

Later features (in random order):

* Add _on-disk_ grouping functionality. That requires _on-disk_ sorting, which can be done using a `merge sort`  algorithm.

* `lapply` like functionality creating a `fst` file using a list of inputs (csv's, custom methods, etc.)

* _interoperability:_
	a) import data from Apache Parguet files
	b) Python interface
        c) C++ interface library
        d) Julia interface, ...

* _advanced operations:_
	a) Parallel grouping for specific methods (like +,-,*,/,sum,mean, etc. these methods need a C++ implementation for parallel operations)
	b) binary search on table key columns (extremely fast sub-setting of a key range)
	c) Merge operations on multiple `fst` files (right join to start with, like in `data.table`)
	d) multiple `fst`-files represent a single data set
	e) set of `fst`-files can be sorted in parallel into a new set of `fst` files. This avoids the slow end-phase of sorting algorithms like merge sort.
	f) user-defined map-reduce operations that can be used on the `fst` file(s) in parallel. Simple example: a custom median method using 1) sum and count each chunk 2) take results from 1) to calculate median.
        g) fill a data set range with specific rows from a `fst` file, overwriting data in-memory (#29).

* _performance and security enhancements:_	
	a) encryption
	b) SIMD upgrades to the bit-shifters and pre-serialization filters used in `fst`
	c) a plug-in system (C++) for custom compressors to allow users to come up with faster or better compressors
        d) better character columns compression
        e) high compression mode for slow IO (network) speeds (#23).

This list is subject to a lot of change depending on features and issues requested/reported by users of the `fst` package :-)

  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Planned milestones for future releases #117

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Planned milestones for future releases #117

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions