Skip to content

Commit 7319c69

Browse files
authored
Merge pull request #750 from SebKrantz/development
Development
2 parents 3524fe2 + a50995f commit 7319c69

15 files changed

+1751
-30
lines changed

NEWS.md

+7-4
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,12 @@
88

99
* `num_vars()` (and thus also `cat_vars()` and `collap()`) were changed to a simpler C-definition of numeric data types which is more in-line with `is.numeric()`: `is_numeric_C <- function(x) typeof(x) %in% c("integer", "double") && !inherits(x, c("factor", "Date", "POSIXct", "yearmon", "yearqtr"))`. The previous definition was: `is_numeric_C_old <- function(x) typeof(x) %in% c("integer", "double") && (!is.object(x) || inherits(x, c("ts", "units", "integer64")))`. Thus, the definition changed from including only certain classes to excluding the most important classes. Thanks @maouw for flagging this (#727).
1010

11-
* New improved quantile algorithm in `fquantile()` and `fnth()` (see below) does not support zero weights anymore, i.e. the code runs through, but elements with zero weights are no longer ignored by the algorithm. Thus is because the new algorithm makes it difficult to skip zero weight elements 'on the fly'.
12-
1311
### Bug Fixes
1412

15-
* Fixed some issues using *collapse* and the *tidyverse* together, particularly regarding tidyverse methods for 'grouped_df'.
13+
* Fixed some issues using *collapse* and the *tidyverse* together, particularly regarding tidyverse methods for 'grouped_df' - thanks @NicChr (#645).
14+
15+
* More consistent handling of zero-length inputs - they are now also returned in `fmean()` and `fmedian()`/`fnth()` instead of returning `NA` (#628).
16+
1617

1718
### Additions
1819

@@ -37,7 +38,9 @@ join(df1, df2, require = list(x = 0.8, fail = "warning"))
3738

3839
### Improvements
3940

40-
* The weighted quantile algorithm in `fquantile()` was changed and now uses a more theoretically sound method following [excellent notes](https://htmlpreview.github.io/?https://github.com/mjskay/uncertainty-examples/blob/master/weighted-quantiles.html) by [Matthew Kay](https://github.com/mjskay). It now also supports quantile type 4, but it does not support zero weights anymore (see above). *Note* that the existing *collapse* algorithm [already had very goood](https://github.com/mjskay/uncertainty-examples/issues/2) properties after a bug fix in v2.0.17, but the new algorithm is more theoretically sound and also faster.
41+
* The weighted quantile algorithm in `fquantile()`/`fnth()` was improved to a more theoretically sound method following [excellent notes](https://htmlpreview.github.io/?https://github.com/mjskay/uncertainty-examples/blob/master/weighted-quantiles.html) by [Matthew Kay](https://github.com/mjskay). It now also supports quantile type 4, but it does not skip zero weights anymore, as the new algorithm makes it difficult to skip them 'on the fly'. *Note* that the existing *collapse* algorithm [already had very good](https://github.com/mjskay/uncertainty-examples/issues/2) properties after a bug fix in v2.0.17, but the new algorithm is more exact and also faster.
42+
43+
* The *collapse* [**arXiv article**](https://arxiv.org/abs/2403.05038) has been updated and significantly enhanced. It is an excellent resource to get an overview of the package.
4144

4245

4346

README.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -62,9 +62,9 @@ install.packages("collapse", repos = "https://fastverse.r-universe.dev")
6262
remotes::install_github("SebKrantz/collapse")
6363

6464
# Install previous versions from the CRAN Archive (requires compilation)
65-
install.packages("https://cran.r-project.org/src/contrib/Archive/collapse/collapse_1.9.6.tar.gz",
65+
install.packages("https://cran.r-project.org/src/contrib/Archive/collapse/collapse_2.0.19.tar.gz",
6666
repos = NULL, type = "source")
67-
# Older stable versions: 1.8.9, 1.7.6, 1.6.5, 1.5.3, 1.4.2, 1.3.2, 1.2.1
67+
# Older stable versions: 1.9.6, 1.8.9, 1.7.6, 1.6.5, 1.5.3, 1.4.2, 1.3.2, 1.2.1
6868
```
6969

7070
## Documentation
@@ -80,7 +80,7 @@ In addition there are several [vignettes](<https://sebkrantz.github.io/collapse/
8080

8181
### Article on arXiv
8282

83-
An [**article**](https://arxiv.org/abs/2403.05038) on *collapse* has been submitted to the [Journal of Statistical Software](https://www.jstatsoft.org/) in March 2024.
83+
An [**article**](https://arxiv.org/abs/2403.05038) on *collapse* was submitted to the [Journal of Statistical Software](https://www.jstatsoft.org/) in March 2024 and updated/revised in February 2025.
8484

8585
### Presentation at [useR 2022](https://user2022.r-project.org)
8686

man/fnth_fmedian.Rd

+7-7
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,9 @@
1313
Fast (Grouped, Weighted) N'th Element/Quantile for Matrix-Like Objects
1414
}
1515
\description{
16-
\code{fnth} (column-wise) returns the n'th smallest element from a set of unsorted elements \code{x} corresponding to an integer index (\code{n}), or to a probability between 0 and 1. If \code{n} is passed as a probability, ties can be resolved using the lower, upper, or average of the possible elements, or, since v1.9.0, continuous quantile estimation. The new default is quantile type 7 (as in \code{\link{quantile}}). For \code{n > 1}, the lower element is always returned (as in \code{sort(x, partial = n)[n]}). See Details.
16+
\code{fnth} (column-wise) returns the n'th smallest element from a set of unsorted elements \code{x} corresponding to an integer index (\code{n}), or to a probability between 0 and 1. If \code{n} is passed as a probability, ties can be resolved using the lower, upper, or average of the possible elements, or (default) continuous quantile estimation. For \code{n > 1}, the lower element is always returned (as in \code{sort(x, partial = n)[n]}). See Details.
1717

18-
\code{fmedian} is a simple wrapper around \code{fnth}, which fixes \code{n = 0.5} and (default) \code{ties = "mean"} i.e. it averages eligible elements. See Details. %Users may prefer a quantile based definition of the weighted median.
18+
\code{fmedian} is a simple wrapper around \code{fnth}, which fixes \code{n = 0.5} and (default) \code{ties = "mean"}, i.e., it averages eligible elements. See Details. %Users may prefer a quantile based definition of the weighted median.
1919
}
2020
\usage{
2121
fnth(x, n = 0.5, \dots)
@@ -63,7 +63,7 @@ fmedian(x, \dots)
6363
1 \tab\tab "mean" \tab\tab take the arithmetic mean of all qualifying elements. \cr
6464
2 \tab\tab "min" \tab\tab take the smallest of the elements. \cr
6565
3 \tab\tab "max" \tab\tab take the largest of the elements. \cr
66-
5-9 \tab\tab "qn" \tab\tab continuous quantile types 5-9, see \code{\link{fquantile}}. \cr
66+
4-9 \tab\tab "qn" \tab\tab continuous quantile types 4-9, see \code{\link{fquantile}}. \cr
6767
}
6868
}
6969
@@ -85,10 +85,10 @@ fmedian(x, \dots)
8585

8686
}
8787
\details{
88-
For v1.9.0 \code{fnth} was completely rewritten in C and offers significantly enhanced speed and functionality. It uses a combination of quickselect, quicksort, and radixsort algorithms, combined with several (weighted) quantile estimation methods and, where possible, OpenMP multithreading. This synthesis can be summarised as follows:
88+
\code{fnth} uses a combination of quickselect, quicksort, and radixsort algorithms, combined with several (weighted) quantile estimation methods and, where possible, OpenMP multithreading:
8989

9090
\itemize{
91-
\item without weights, quickselect is used to determine a (lower) order statistic. If \code{ties \%!in\% c("min", "max")} a second order statistic is found by taking the max of the upper part of the partitioned array, and the two statistics are averaged using a simple mean (\code{ties = "mean"}), or weighted average according to a \code{\link{quantile}} method (\code{ties = "q5"-"q9"}). For \code{n = 0.5}, all supported quantile methods give the sample median. With matrices, multithreading is always across columns, for vectors and data frames it is across groups unless \code{is.null(g)} for data frames.
91+
\item without weights, quickselect is used to determine a (lower) order statistic. If \code{ties \%!in\% c("min", "max")} a second order statistic is found by taking the max of the upper part of the partitioned array, and the two statistics are averaged using a simple mean (\code{ties = "mean"}), or weighted average according to a \code{\link{quantile}} method (\code{ties = "q4"-"q9"}). For \code{n = 0.5}, all supported quantile methods give the sample median. With matrices, multithreading is always across columns, for vectors and data frames it is across groups unless \code{is.null(g)} for data frames.
9292

9393
\item with weights and no groups (\code{is.null(g)}), \code{\link{radixorder}} is called internally (on each column of \code{x}). The ordering is used to sum the weights in order of \code{x} and determine weighted order statistics or quantiles. See details below. Multithreading is disabled as \code{\link{radixorder}} cannot be called concurrently on the same memory stack.
9494

@@ -104,12 +104,12 @@ For v1.9.0 \code{fnth} was completely rewritten in C and offers significantly en
104104
If \code{n > 1}, the result is equivalent to (column-wise) \code{sort(x, partial = n)[n]}. Internally, \code{n} is converted to a probability using \code{p = (n-1)/(NROW(x)-1)}, and that probability is applied to the set of non-missing elements to find the \code{as.integer(p*(fnobs(x)-1))+1L}'th element (which corresponds to option \code{ties = "min"}). % Note that it is necessary to subtract and add 1 so that \code{n = 1} corresponds to \code{p = 0} and \code{n = NROW(x)} to \code{p = 1}. %So if \code{n > 1} is used in the presence of missing values, and the default \code{ties = "mean"} is enabled, the resulting element could be the average of two elements.
105105
When using grouped computations with \code{n > 1}, \code{n} is transformed to a probability \code{p = (n-1)/(NROW(x)/ng-1)} (where \code{ng} contains the number of unique groups in \code{g}).
106106
107-
If weights are used and \code{ties = "q5"-"q9"}, weighted continuous quantile estimation is done as described in \code{\link{fquantile}}.
107+
If weights are used and \code{ties = "q4"-"q9"}, weighted continuous quantile estimation is done as described in \code{\link{fquantile}}.
108108
109109
For \code{ties \%in\% c("mean", "min", "max")}, a target partial sum of weights \code{p*sum(w)} is calculated, and the weighted n'th element is the element k such that all elements smaller than k have a sum of weights \code{<= p*sum(w)}, and all elements larger than k have a sum of weights \code{<= (1 - p)*sum(w)}. If the partial-sum of weights (\code{p*sum(w)}) is reached exactly for some element k, then (summing from the lower end) both k and k+1 would qualify as the weighted n'th element. If the weight of element k+1 is zero, k, k+1 and k+2 would qualify... . If \code{n > 1}, k is chosen (consistent with the unweighted behavior). %(ensuring that \code{fnth(x, n)}) and \code{fnth(x, n, w = rep(1, NROW(x)))}, always provide the same outcome)
110110
If \code{0 < n < 1}, the \code{ties} option regulates how to resolve such conflicts, yielding lower (\code{ties = "min"}: k), upper (\code{ties = "max"}: k+2) or average weighted (\code{ties = "mean"}: mean(k, k+1, k+2)) n'th elements.
111111

112-
Thus, in the presence of zero weights, the weighted median (default \code{ties = "mean"}) can be an arithmetic average of >2 qualifying elements. Users may prefer a quantile based weighted median by setting \code{ties = "q5"-"q9"}, which is a continuous function of \code{p} and ignores elements with zero weights.
112+
Thus, in the presence of zero weights, the weighted median (default \code{ties = "mean"}) can be an arithmetic average of >2 qualifying elements.
113113

114114
For data frames, column-attributes and overall attributes are preserved if \code{g} is used or \code{drop = FALSE}.
115115

man/fquantile.Rd

+3-3
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ frange(x, na.rm = .op[["na.rm"]], finite = FALSE)
3232
\item{o}{integer. An vector giving the ordering of the elements in \code{x}, such that \code{identical(x[o], sort(x))}. If available this considerably speeds up the estimation.}
3333
\item{na.rm}{logical. Remove missing values, default \code{TRUE}. }
3434
\item{finite}{logical. Omit all non-finite values.}
35-
\item{type}{integer. Quantile types 4-9. See \code{\link{quantile}}. Further details are provided in \href{https://doi.org/10.2307/2684934}{Hyndman and Fan (1996)} who recommended type 8. The default method is type 7.}
35+
\item{type}{integer. Quantile types 4-9. See \code{\link{quantile}}. Further details are provided in \href{https://www.tandfonline.com/doi/abs/10.1080/00031305.1996.10473566}{Hyndman and Fan (1996)} who recommended type 8. The default method is type 7.}
3636
\item{names}{logical. Generates names of the form \code{paste0(round(probs * 100, 1), "\%")} (in C). Set to \code{FALSE} for speedup. }
3737
\item{check.o}{logical. If \code{o} is supplied, \code{TRUE} runs through \code{o} once and checks that it is valid, i.e. that each element is in \code{[1, length(x)]}. Set to \code{FALSE} for significant speedup if \code{o} is known to be valid. }
3838
}
@@ -41,7 +41,7 @@ frange(x, na.rm = .op[["na.rm"]], finite = FALSE)
4141
4242
\code{frange} is considerably more efficient than \code{\link{range}}, requiring only one pass through the data instead of two. For probabilities 0 and 1, \code{fquantile} internally calls \code{frange}.
4343
44-
Following \href{https://doi.org/10.2307/2684934}{Hyndman and Fan (1996)}, the quantile type-\eqn{i} quantile function of the sample \eqn{X} can be written as a weighted average of two order statistics:
44+
Following \href{https://www.tandfonline.com/doi/abs/10.1080/00031305.1996.10473566}{Hyndman and Fan (1996)}, the quantile type-\eqn{i} quantile function of the sample \eqn{X} can be written as a weighted average of two order statistics:
4545
4646
\deqn{\hat{Q}_{X,i}(p) = (1 - \gamma) X_{(j)} + \gamma X_{(j + 1)}}
4747
@@ -57,7 +57,7 @@ We can then first find the largest value \eqn{l} such that the cumulative normal
5757
For a more detailed exposition \href{https://htmlpreview.github.io/?https://github.com/mjskay/uncertainty-examples/blob/master/weighted-quantiles.html}{see these excellent notes} by Matthew Kay. See also the R implementation of weighted quantiles type 7 in the Examples below.
5858
}
5959
\note{
60-
The new weighted quantile algorithm from v2.1.0 does not skip zero weights anymore as this is technically very difficult (it is not clear if \eqn{j} hits a zero weight element whether one should move forward or backward to find an alternative). Thus, all non-missing elements are considered and weights should be strictily positive.
60+
The new weighted quantile algorithm from v2.1.0 does not skip zero weights anymore as this is technically very difficult (it is not clear if \eqn{j} hits a zero weight element whether one should move forward or backward to find an alternative). Thus, all non-missing elements are considered and weights should be strictly positive.
6161
}
6262
\value{
6363
A vector of quantiles. If \code{names = TRUE}, \code{fquantile} generates names as \code{paste0(round(probs * 100, 1), "\%")} (in C).

man/join.Rd

+1-1
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ join(x, y,
4848

4949
\item{verbose}{integer. Prints information about the join. One of 0 (off), 1 (default, see Details) or 2 (additionally prints the classes of the \code{on} columns). \emph{Note:} \code{verbose > 0} or \code{validate != "m:m"} invoke the \code{count} argument to \code{\link{fmatch}}, so \code{verbose = 0} is slightly more efficient. }
5050

51-
\item{require}{(optional) named list of the form \code{list(x = 1, y = 0.5, fail = "warning")} giving proportions of records that need to be matched and the action if any requirement fails (\code{"message"}, \code{"warning"}, or \code{"error"}). Any elements of the list can be omitted, the default action is \code{"error"}.}
51+
\item{require}{(optional) named list of the form \code{list(x = 1, y = 0.5, fail = "warning")} (or \code{fail.with} if you want to be more expressive) giving proportions of records that need to be matched and the action if any requirement fails (\code{"message"}, \code{"warning"}, or \code{"error"}). Any elements of the list can be omitted, the default action is \code{"error"}.}
5252

5353
\item{column}{(optional) name for an extra column to generate in the output indicating which dataset a record came from. \code{TRUE} calls this column \code{".join"} (inspired by STATA's '_merge' column). By default this column is generated as the last column, but, if \code{keep.col.order = FALSE}, it is placed after the 'on' columns. The column is a factor variable with levels corresponding to the dataset names (inferred from the input) or \code{"matched"} for matched records. Alternatively, it is possible to specify a list of 2, where the first element is the column name, and the second a length 3 (!) vector of levels e.g. \code{column = list("joined", c("x", "y", "x_y"))}, where \code{"x_y"} replaces \code{"matched"}. The column has an additional attribute \code{"on.cols"} giving the join columns corresponding to the factor levels. See Examples. }
5454

0 commit comments

Comments
 (0)