Releases: SebKrantz/collapse
collapse version 2.0.1
-
%in%
withset_collapse(mask = "%in%")
does not warn about overidentification when used with data frames. -
Fixed several typos in the documentation.
collapse version 2.0.0
collapse 2.0, released in Mid-October 2023, introduces fast table joins and data reshaping capabilities alongside other convenience functions, and enhances the packages global configurability, including interactive namespace control.
Potentially breaking changes
- In a grouped setting, if
.data
is used insidefsummarise()
andfmutate()
, and.cols = NULL
,.data
will contain all columns except for grouping columns (in-line with the.SD
syntax of data.table). Before,.data
contained all columns. The selection in.cols
still refers to all columns, thus it is still possible to select all columns using e.g.grouped_data %>% fsummarise(some_expression_involving(.data), .cols = seq_col(.))
.
Other changes
- In
qsu()
, argumentvlabels
was renamed tolabels
. Butvlabels
will continue to work.
Bug Fixes
- Fixed a bug in the integer methods of
fsum()
,fmean()
andfprod()
that returnedNA
if and only if there was a single integer followed byNA
's e.gfsum(c(1L, NA, NA))
erroneously gaveNA
. This was caused by a C-level shortcut that returnedNA
when the first element of the vector had been reached (moving from back to front) without encountering any non-NA-values. The bug consisted in the content of the first element not being evaluated in this case. Note that this bug did not occur with real numbers, and also not in grouped execution. Thanks @blset for reporting (#432).
Additions
-
Added
join()
: class-agnostic, vectorized, and (default) verbose joins for R, modeled after the polars API. Two different join algorithms are implemented: a hash-join (default, ifsort = FALSE
) and a sort-merge-join (ifsort = TRUE
). -
Added
pivot()
: fast and easy data reshaping! It supports longer, wider and recast pivoting, including handling of variable labels, through a uniform and parsimonious API. It does not perform data aggregation, and by default does not check if the data is uniquely identified by the supplied ids. Underidentification for 'wide' and 'recast' pivots results in the last value being taken within each group. Users can toggle a duplicates check by settingcheck.dups = TRUE
. -
Added
rowbind()
: a fast class-agnostic alternative torbind.data.frame()
anddata.table::rbindlist()
. -
Added
fmatch()
: a fastmatch()
function for vectors and data frames/lists. It is the workhorse function ofjoin()
, and also benefitsckmatch()
,%!in%
, and new operators%iin%
and%!iin%
(see below). It is also possible toset_collapse(mask = "%in%")
to replacebase::"%in%"
usingfmatch()
. Thanks tofmatch()
, these operators also all support data frames/lists of vectors, which are compared row-wise. -
Added operators
%iin%
and%!iin%
: these directly return indices, i.e.%[!]iin%
is equivalent towhich(x %[!]in% table)
. This is useful especially for subsetting where directly supplying indices is more efficient e.g.x[x %[!]iin% table]
is faster thanx[x %[!]in% table]
. Similarlyfsubset(wlddev, iso3c %iin% c("DEU", "ITA", "FRA"))
is very fast. -
Added
vec()
: efficiently turn matrices or data frames / lists into a single atomic vector. I am aware of multiple implementations in other packages, which are mostly inefficient. With atomic objects,vec()
simply removes the attributes without copying the object, and with lists it directly callsC_pivot_longer
.
Improvements
-
set_collapse()
now supports options 'mask' and 'remove', giving collapse a flexible namespace in the broadest sense that can be changed at any point within the active session:-
'mask' supports base R or dplyr functions that can be masked into the faster collapse versions. E.g.
library(collapse); set_collapse(mask = "unique")
(or, equivalently,set_collapse(mask = "funique")
) will createunique <- funique
in the collapse namespace, exportunique()
from the namespace, and detach and attach the namespace again so R can find it. The re-attaching also ensures that collapse comes right after the global environment, implying that all it's functions will take priority over other libraries. Users can usefastverse::fastverse_conflicts()
to check which functions are masked after usingset_collapse(mask = ...)
. The option can be changed at any time. Usingset_collapse(mask = NULL)
removes all masked functions from the namespace, and can also be called simply to ensure collapse is at the top of the search path. -
'remove' allows removing arbitrary functions from the collapse namespace. E.g.
set_collapse(remove = "D")
will remove the difference operatorD()
, which also exists in stats to calculate symbolic and algorithmic derivatives (this is a convenient example but not necessary sincecollapse::D
is S3 generic and will callstats::D()
on R calls, expressions or names). This is safe to do as it only modifies which objects are exported from the namespace (it does not truly remove objects from the namespace). This option can also be changed at any time.set_collapse(remove = NULL)
will restore the exported namespace.
For both options there exist a number of convenient keywords to bulk-mask / remove functions. For example
set_collapse(mask = "manip", remove = "shorthand")
will mask all data manipulation functions such asmutate <- fmutate
and remove all function shorthands such asmtt
(i.e. abbreviations for frequently used functions that collapse supplies for faster coding / prototyping). -
-
set_collapse()
also supports options 'digits', 'verbose' and 'stable.algo', enhancing the global configurability of collapse. -
qM()
now also has arow.names.col
argument in the second position allowing generation of rownames when converting data frame-like objects to matrix e.g.qM(iris, "Species")
orqM(GGDC10S, 1:5)
(interaction of id's). -
as_factor_GRP()
andfinteraction()
now have an argumentsep = "."
denoting the separator used for compound factor labels. -
alloc()
now has an additional argumentsimplify = TRUE
.FALSE
always returns list output. -
frename()
supports bothnew = old
(pandas, used to far) andold = new
(dplyr) style renaming conventions. -
across()
supports negative indices, also in grouped settings: these will select all variables apart from grouping variables. -
TRA()
allows shorthands"NA"
for"replace_NA"
and"fill"
for"replace_fill"
. -
group()
experienced a minor speedup with >= 2 vectors as the first two vectors are now hashed jointly. -
fquantile()
withnames = TRUE
adds up to 1 digit after the comma in the percent-names, e.g.fquantile(airmiles, probs = 0.001)
generates appropriate names (not 0% as in the previous version).
collapse version 1.9.6
-
New vignette on collapse's handling of R objects.
-
print.descr()
with groups and optionperc = TRUE
(the default) also shows percentages of the group frequencies for each variable. -
funique(mtcars[NULL, ], sort = TRUE)
gave an error (for data frame with zero rows). Thanks @NicChr (#406). -
Added SIMD vectorization for
fsubset()
. -
vlengths()
now also works for strings, and is hence a much faster version of bothlengths()
andnchar()
. Also for atomic vectors the behavior is likelengths()
, e.g.vlengths(rnorm(10))
givesrep(1L, 10)
. -
In
collap[v/g]()
, the...
argument is now placed after thecustom
argument instead of after the last argument, in order to better guard against unwanted partial argument matching. In particular, previously then
argument passed tofnth
was partially matched tona.last
. Thanks @ummel for alerting me of this (#421).
collapse version 1.9.5
-
Using
DATAPTR_RO
to point to R lists because of the use ofALTLISTS
on R-devel. -
Replacing
!=
loop controls for SIMD loops with<
to ensure compatibility on all platforms. Thanks @albertus82 (#399).
collapse version 1.9.4
-
Improvements in
get_elem()/has_elem()
: Optioninvert = TRUE
is implemented more robustly, and a function passed toget_elem()/has_elem()
is now applied to all elements in the list, including elements that are themselves list-like. This enables the use ofinherits
to find list-like objects inside a broader list structure e.g.get_elem(l, inherits, what = "lm")
fetches all linear model objects insidel
. -
Fixed a small bug in
descr()
introduced in v1.9.0, producing an error if a data frame contained no numeric columns - because an internal function was not defined in that case. Also, POSIXct columns are handled better in print - preserving the time zone (thanks @cdignam-chwy #392). -
fmean()
andfsum()
withg = NULL
, as well asTRA()
,setop()
, and related operators%r+%
,%+=%
etc.,setv()
andfdist()
now utilize Single Instruction Multiple Data (SIMD) vectorization by default (if OpenMP is enabled), enabling potentially very fast computing speeds. Whether these instructions are utilized during compilation depends on your system. In general, if you want to max out collapse on your system, consider compiling from source withCFLAGS += -O3 -march=native -fopenmp
andCXXFLAGS += -O3 -march=native
in your.R/Makevars
.
collapse version 1.9.3
-
Added functions
fduplicated()
andany_duplicated()
, for vectors and lists / data frames. Thanks @NicChr (#373) -
sort
option added toset_collapse()
to be able to set unordered grouping as a default. E.g. settingset_collapse(sort = FALSE)
will affectcollap()
,BY()
,GRP()
,fgroup_by()
,qF()
,qG()
,finteraction()
,qtab()
and internal use of these functions for ad-hoc grouping in fast statistical functions. Other uses ofsort
, for example infunique()
where the default issort = FALSE
, are not affected by the global default setting. -
Fixed a small bug in
group()
/funique()
resulting in an unnecessary memory allocation error in rare cases. Thanks @NicChr (#381).
collapse version 1.9.2
-
Further fix to an Address Sanitizer issue as required by CRAN (eliminating an unused out of bounds access at the end of a loop).
-
qsu()
finally has a grouped_df method. -
Added options
option("collapse_nthreads")
andoption("collapse_na.rm")
, which allow you to load collapse with different defaults e.g. through an.Rprofile
or.fastverse
configuration file. Once collapse is loaded, these options take no effect, and users need to useset_collapse()
to change.op[["nthreads"]]
and.op[["na.rm"]]
interactively. -
Exported method
plot.psmat()
(can be useful to plot time series matrices).
collapse version 1.9.1
-
Fixed minor C/C++ issues flagged by CRAN's detailed checks.
-
Added functions
set_collapse()
andget_collapse()
, allowing you to globally set defaults for thenthreads
andna.rm
arguments to all functions in the package. E.g.set_collapse(nthreads = 4, na.rm = FALSE)
could be a suitable setting for larger data without missing values. This is implemented using an internal environment by the name of.op
, such that these defaults are received using e.g..op[["nthreads"]]
, at the computational cost of a few nanoseconds (8-10x faster thangetOption("nthreads")
which would take about 1 microsecond)..op
is not accessible by the user, so functionget_collapse()
can be used to retrieve settings. Exempt from this are functions.quantile
, and a new function.range
(alias offrange
), which go directly to C for maximum performance in repeated executions, and are not affected by these global settings. Functiondescr()
, which internally calls a bunch of statistical functions, is also not affected by these settings. -
Further improvements in thread safety for
fsum()
andfmean()
in grouped computations across data frame columns. All OpenMP enabled functions in collapse can now be considered thread safe i.e. they pass the full battery of tests in multithreaded mode.
collapse version 1.9.0
collapse 1.9.0 released mid of January 2023, provides improvements in performance and versatility in many areas, as well as greater statistical capabilities, most notably efficient (grouped, weighted) estimation of sample quantiles.
Changes to functionality
-
All functions renamed in collapse 1.6.0 are now depreciated, to be removed end of 2023. These functions had already been giving messages since v1.6.0. See
help("collapse-renamed")
. -
The lead operator
F()
is not exported anymore from the package namespace, to avoid clashes withbase::F
flagged by multiple people. The operator is still part of the package and can be accessed usingcollapse:::F
. I have also added an option"collapse_export_F"
, such that settingoptions(collapse_export_F = TRUE)
before loading the package exports the operator as before. Thanks @matthewross07 (#100), @edrubin (#194), and @arthurgailes (#347). -
Function
fnth()
has a new defaultties = "q7"
, which gives the same result asquantile(..., type = 7)
(R's default). More details below.
Bug Fixes
-
fmode()
gave wrong results for singleton groups (groups of size 1) on unsorted data. I had optimizedfmode()
for singleton groups to directly return the corresponding element, but it did not access the element through the (internal) ordering vector, so the first element/row of the entire vector/data was taken. The same mistake occurred forfndistinct
if singleton groups wereNA
, which were counted as1
instead of0
under thena.rm = TRUE
default (provided the first element of the vector/data was notNA
). The mistake did not occur with data sorted by the groups, because here the data pointer already pointed to the first element of the group. (My apologies for this bug, it took me more than half a year to discover it, using collapse on a daily basis, and it escaped 700 unit tests as well). -
Function
groupid(x, na.skip = TRUE)
returned uninitialized first elements if the first values inx
whereNA
. Thanks for reporting @Henrik-P (#335). -
Fixed a bug in the
.names
argument toacross()
. Passing a naming function such as.names = function(c, f) paste0(c, "-", f)
now works as intended i.e. the function is applied to all combinations of columns (c) and functions (f) usingouter()
. Previously this was just internally evaluated as.names(cols, funs)
, which did not work if there were multiple cols and multiple funs. There is also now a possibility to set.names = "flip"
, which names columnsf_c
instead ofc_f
. -
fnrow()
was rewritten in C and also supports data frames with 0 columns. Similarly forseq_row()
. Thanks @NicChr (#344).
Additions
-
Added functions
fcount()
andfcountv()
: a versatile and blazing fast alternative todplyr::count
. It also works with vectors, matrices, as well as grouped and indexed data. -
Added function
fquantile()
: Fast (weighted) continuous quantile estimation (methods 5-9 following Hyndman and Fan (1996)), implemented fully in C based on quickselect and radixsort algorithms, and also supports an ordering vector as optional input to speed up the process. It is up to 2x faster thanstats::quantile
on larger vectors, but also especially fast on smaller data, where the R overhead ofstats::quantile
becomes burdensome. For maximum performance during repeated executions, a programmers version.quantile()
with different defaults is also provided. -
Added function
fdist()
: A fast and versatile replacement forstats::dist
. It computes a full euclidian distance matrix around 4x faster thanstats::dist
in serial mode, with additional gains possible through multithreading along the distance matrix columns (decreasing thread loads as the matrix is lower triangular). It also supports computing the distance of a matrix with a single row-vector, or simply between two vectors. E.g.fdist(mat, mat[1, ])
is the same assqrt(colSums((t(mat) - mat[1, ])^2)))
, but about 20x faster in serial mode, andfdist(x, y)
is the same assqrt(sum((x-y)^2))
, about 3x faster in serial mode. In both cases (sub-column level) multithreading is available. Note thatfdist
does not skip missing values i.e.NA
's will result inNA
distances. There is also no internal implementation for integers or data frames. Such inputs will be coerced to numeric matrices. -
Added function
GRPid()
to easily fetch the group id from a grouping object, especially inside groupedfmutate()
calls. This addition was warranted especially by the new improvedfnth.default()
method which allows orderings to be supplied for performance improvements. See commends onfnth()
and the example provided below. -
fsummarize()
was added as a synonym tofsummarise
. Thanks @arthurgailes for the PR. -
C API: collapse exports around 40 C functions that provide functionality that is either convenient or rather complicated to implement from scratch. The exported functions can be found at the bottom of
src/ExportSymbols.c
. The API does not include the Fast Statistical Functions, which I thought are too closely related to how collapse works internally to be of much use to a C programmer (e.g. they expect grouping objects or certain kinds of integer vectors). But you are free to request the export of additional functions, including C++ functions.
Improvements
-
fnth()
andfmedian()
were rewritten in C, with significant gains in performance and versatility. Notably,fnth()
now supports (grouped, weighted) continuous quantile estimation likefquantile()
(fmedian()
, which is a wrapper aroundfnth()
, can also estimate various quantile based weighted medians). The new default forfnth()
isties = "q7"
, which gives the same result as(f)quantile(..., type = 7)
(R's default). OpenMP multithreading across groups is also much more effective in both the weighted and unweighted case. Finally,fnth.default
gained an additional argumento
to pass an ordering vector, which can dramatically speed up repeated invocations of the function on the dame data:# Estimating multiple weighted-grouped quantiles on mpg: pre-computing an ordering provides extra speed. mtcars %>% fgroup_by(cyl, vs, am) %>% fmutate(o = radixorder(GRPid(), mpg)) %>% # On grouped data, need to account for GRPid() fsummarise(mpg_Q1 = fnth(mpg, 0.25, o = o, w = wt), mpg_median = fmedian(mpg, o = o, w = wt), mpg_Q3 = fnth(mpg, 0.75, o = o, w = wt)) # Note that without weights this is not always faster. Quickselect can be very efficient, so it depends # on the data, the number of groups, whether they are sorted (which speeds up radixorder), etc...
-
BY
now supports data-length arguments to be passed e.g.BY(mtcars, mtcars$cyl, fquantile, w = mtcars$wt)
, making it effectively a generic groupedmapply
function as well. Furthermore, the grouped_df method now also expands grouping columns for output length > 1. -
collap()
, which internally usesBY
with non-Fast Statistical Functions, now also supports arbitrary further arguments passed down to functions to be split by groups. Thus users can also apply custom weighted functions withcollap()
. Furthermore, the parsing of theFUN
,catFUN
andwFUN
arguments was improved and brought in-line with the parsing of.fns
inacross()
. The main benefit of this is that Fast Statistical Functions are now also detected and optimizations carried out when passed in a list providing a new name e.g.collap(data, ~ id, list(mean = fmean))
is now optimized! Thanks @ttrodrigz (#358) for requesting this. -
descr()
, by virtue offquantile
and the improvements toBY
, supports full-blown grouped and weighted descriptions of data. This is implemented through additionalby
andw
arguments. The function has also been turned into an S3 generic, with a default and a 'grouped_df' method. The 'descr' methodsas.data.frame
andprint
also feature various improvements, and a newcompact
argument toprint.descr
, allowing a more compact printout. Users will also notice improved performance, mainly due tofquantile
: on the M1descr(wlddev)
is now 2x faster thansummary(wlddev)
, and 41x faster thanHmisc::describe(wlddev)
. Thanks @statzhero for the request (#355). -
radixorder
is about 25% faster on characters and doubles. This also benefits grouping performance. Note thatgroup()
may still be substantially faster on unsorted data, so if performance is critical try thesort = FALSE
argument to functions likefgroup_by
and compare. -
Most list processing functions are noticeably faster, as checking the data types of elements in a list is now also done in C, and I have made some improvements to collapse's version of
rbindlist()
(used inunlist2d()
, and various other places). -
fsummarise
andfmutate
gained an ability to evaluate arbitrary expressions that result in lists / data frames without the need to useacross()
. For example:mtcars |> fgroup_by(cyl, vs, am) |> fsummarise(mctl(cor(cbind(mpg, wt, carb)), names = TRUE))
ormtcars |> fgroup_by(cyl) |> fsummarise(mctl(lmtest::coeftest(lm(mpg ~ wt + carb)), names = TRUE))
. There is also the possibility to compute expressions using.data
e.g.mtcars |> fgroup_by(cyl) |> fsummarise(mctl(lmtest::coeftest(lm(mpg ~ wt + carb, .data)), names = TRUE))
yields the same thing, but is less efficient because the whole dataset (including 'cyl') is split by groups. For greater efficiency and convenience, you can pre-select columns using a global.cols
argument, e.g.mtcars |> fgroup_by(cyl, vs, am) |> fsummarise(mctl(cor(.data), names = TRUE), .cols = .c(mpg, wt, carb))
gives the same as above. Three Notes about this:- No grouped vectorizations for fast statistical functions i.e. the entire expression is evaluated for each group. (Let m...
collapse version 1.8.9
-
Fixed some warnings on rchk and newer C compilers (LLVM clang 10+).
-
.pseries
/.indexed_series
methods also change the implicit class of the vector (attached after"pseries"
), if the data type changed. e.g. calling a function likefgrowth
on an integer pseries changed the data type to double, but the "integer" class was still attached after "pseries". -
Fixed bad testing for SE inputs in
fgroup_by()
andfindex_by()
. See #320. -
Added
rsplit.matrix
method. -
descr()
now by default also reports 10% and 90% quantiles for numeric variables (in line with STATA's detailed summary statistics), and can also be applied to 'pseries' / 'indexed_series'. Furthermore,descr()
itself now has an argumentstepwise
such thatdescr(big_data, stepwise = TRUE)
yields computation of summary statistics on a variable-by-variable basis (and the finished 'descr' object is returned invisibly). The printed result is thus identical toprint(descr(big_data), stepwise = TRUE)
, with the difference that the latter first does the entire computation whereas the former computes statistics on demand.
-
Function
ss()
has a new argumentcheck = TRUE
. Settingcheck = FALSE
allows subsetting data frames / lists with positive integers without checking whether integers are positive or in-range. For programmers. -
Function
get_vars()
has a new argumentrename
allowing select-renaming of columns in standard evaluation programming, e.g.get_vars(mtcars, c(newname = "cyl", "vs", "am"), rename = TRUE)
. The default isrename = FALSE
, to warrant full backwards compatibility. See #327. -
Added helper function
setattrib()
, to set a new attribute list for an object by reference + invisible return. This is different from the existing functionsetAttrib()
(note the capital A), which takes a shallow copy of list-like objects and returns the result.