Skip to content

Fast factorization of character columns #38

Open
@MarcusKlik

Description

@MarcusKlik

Processing character columns is by far the slowest of all data types. For character columns (that are not completely random) we can solve this problem by first converting the vector into a factor. Factors can be efficiently serialized, provided the number of levels is significantly smaller than the number of rows. Random access will suffer because we have to load all levels even for a small subset of data. This can be partly solved by reading with a streaming object that caches the levels after a first read. Subsequent reads will then be faster.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions