Skip to content

Fill a data.table range with specific rows from read.fst #29

Open
@MarcusKlik

Description

@MarcusKlik

With this feature you can populate say row 1001:2000 in a 1e6 row data.table with a 1000 row read from fst.read. All this is done in memory. This feature is very useful for combining data from multiple (fst) sources into a single result table without having the overhead of copies. For example, when performing the merge sort algorithm on a set of data files, you need to

  1. read first x rows from all files
  2. sort the resulting table
  3. write some rows to disk
  4. read next x rows form file with smallest first chunk
  5. sort resulting table
  6. goto 3

This can be performed efficiently in R by using data.table's fast sorting and populating the result table in memory. With such an algorithm operating on a collection of fst files, we basically have a method of sorting arbitrary large fst files without running out of memory (and it can be done with multiple threads!).

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions