SebKrantz
diff --git a/‎NEWS.md
Lines changed: 7 additions & 4 deletions b/‎NEWS.md
Lines changed: 7 additions & 4 deletions
diff --git a/‎README.md
Lines changed: 3 additions & 3 deletions b/‎README.md
Lines changed: 3 additions & 3 deletions
diff --git a/‎man/fnth_fmedian.Rd
Lines changed: 7 additions & 7 deletions b/‎man/fnth_fmedian.Rd
Lines changed: 7 additions & 7 deletions
diff --git a/‎man/fquantile.Rd
Lines changed: 3 additions & 3 deletions b/‎man/fquantile.Rd
Lines changed: 3 additions & 3 deletions
diff --git a/‎man/join.Rd
Lines changed: 1 addition & 1 deletion b/‎man/join.Rd
Lines changed: 1 addition & 1 deletion
@@ -8,11 +8,12 @@
 
 * `num_vars()` (and thus also `cat_vars()` and `collap()`) were changed to a simpler C-definition of numeric data types which is more in-line with `is.numeric()`: `is_numeric_C <- function(x) typeof(x) %in% c("integer", "double") && !inherits(x, c("factor", "Date", "POSIXct", "yearmon", "yearqtr"))`. The previous definition was: `is_numeric_C_old <- function(x) typeof(x) %in% c("integer", "double") && (!is.object(x) || inherits(x, c("ts", "units", "integer64")))`. Thus, the definition changed from including only certain classes to excluding the most important classes. Thanks @maouw for flagging this (#727).
 
-* New improved quantile algorithm in `fquantile()` and `fnth()` (see below) does not support zero weights anymore, i.e. the code runs through, but elements with zero weights are no longer ignored by the algorithm. Thus is because the new algorithm makes it difficult to skip zero weight elements 'on the fly'. 
-
 ### Bug Fixes
 
-* Fixed some issues using *collapse* and the *tidyverse* together, particularly regarding tidyverse methods for 'grouped_df'.
+* Fixed some issues using *collapse* and the *tidyverse* together, particularly regarding tidyverse methods for 'grouped_df' - thanks @NicChr (#645).
+
+* More consistent handling of zero-length inputs - they are now also returned in `fmean()` and `fmedian()`/`fnth()` instead of returning `NA` (#628).
+
 
 ### Additions
 
@@ -37,7 +38,9 @@ join(df1, df2, require = list(x = 0.8, fail = "warning"))
 
 ### Improvements
 
-* The weighted quantile algorithm in `fquantile()` was changed and now uses a more theoretically sound method following [excellent notes](https://htmlpreview.github.io/?https://github.com/mjskay/uncertainty-examples/blob/master/weighted-quantiles.html) by [Matthew Kay](https://github.com/mjskay). It now also supports quantile type 4, but it does not support zero weights anymore (see above). *Note* that the existing *collapse* algorithm [already had very goood](https://github.com/mjskay/uncertainty-examples/issues/2) properties after a bug fix in v2.0.17, but the new algorithm is more theoretically sound and also faster.
+* The weighted quantile algorithm in `fquantile()`/`fnth()` was improved to a more theoretically sound method following [excellent notes](https://htmlpreview.github.io/?https://github.com/mjskay/uncertainty-examples/blob/master/weighted-quantiles.html) by [Matthew Kay](https://github.com/mjskay). It now also supports quantile type 4, but it does not skip zero weights anymore, as the new algorithm makes it difficult to skip them 'on the fly'. *Note* that the existing *collapse* algorithm [already had very good](https://github.com/mjskay/uncertainty-examples/issues/2) properties after a bug fix in v2.0.17, but the new algorithm is more exact and also faster.
+
+* The *collapse* [**arXiv article**](https://arxiv.org/abs/2403.05038) has been updated and significantly enhanced. It is an excellent resource to get an overview of the package.  
 
 
 
 
@@ -62,9 +62,9 @@ install.packages("collapse", repos = "https://fastverse.r-universe.dev")
 remotes::install_github("SebKrantz/collapse")
 
 # Install previous versions from the CRAN Archive (requires compilation)
-install.packages("https://cran.r-project.org/src/contrib/Archive/collapse/collapse_1.9.6.tar.gz", 
+install.packages("https://cran.r-project.org/src/contrib/Archive/collapse/collapse_2.0.19.tar.gz", 
                  repos = NULL, type = "source") 
-# Older stable versions: 1.8.9, 1.7.6, 1.6.5, 1.5.3, 1.4.2, 1.3.2, 1.2.1
+# Older stable versions: 1.9.6, 1.8.9, 1.7.6, 1.6.5, 1.5.3, 1.4.2, 1.3.2, 1.2.1
 ```
 
 ## Documentation
@@ -80,7 +80,7 @@ In addition there are several [vignettes](<https://sebkrantz.github.io/collapse/
 
 ### Article on arXiv
 
-An [**article**](https://arxiv.org/abs/2403.05038) on *collapse* has been submitted to the [Journal of Statistical Software](https://www.jstatsoft.org/) in March 2024. 
+An [**article**](https://arxiv.org/abs/2403.05038) on *collapse* was submitted to the [Journal of Statistical Software](https://www.jstatsoft.org/) in March 2024 and updated/revised in February 2025. 
 
 ### Presentation at [useR 2022](https://user2022.r-project.org)
 
 
@@ -13,9 +13,9 @@
 Fast (Grouped, Weighted) N'th Element/Quantile for Matrix-Like Objects
 }
 \description{
-\code{fnth} (column-wise) returns the n'th smallest element from a set of unsorted elements \code{x} corresponding to an integer index (\code{n}), or to a probability between 0 and 1. If \code{n} is passed as a probability, ties can be resolved using the lower, upper, or average of the possible elements, or, since v1.9.0, continuous quantile estimation. The new default is quantile type 7 (as in \code{\link{quantile}}). For \code{n > 1}, the lower element is always returned (as in \code{sort(x, partial = n)[n]}). See Details.
+\code{fnth} (column-wise) returns the n'th smallest element from a set of unsorted elements \code{x} corresponding to an integer index (\code{n}), or to a probability between 0 and 1. If \code{n} is passed as a probability, ties can be resolved using the lower, upper, or average of the possible elements, or (default) continuous quantile estimation. For \code{n > 1}, the lower element is always returned (as in \code{sort(x, partial = n)[n]}). See Details.
 
-\code{fmedian} is a simple wrapper around \code{fnth}, which fixes \code{n = 0.5} and (default) \code{ties = "mean"} i.e. it averages eligible elements. See Details. %Users may prefer a quantile based definition of the weighted median.
+\code{fmedian} is a simple wrapper around \code{fnth}, which fixes \code{n = 0.5} and (default) \code{ties = "mean"}, i.e., it averages eligible elements. See Details. %Users may prefer a quantile based definition of the weighted median.
 }
 \usage{
 fnth(x, n = 0.5, \dots)
@@ -63,7 +63,7 @@ fmedian(x, \dots)
                  1 \tab\tab "mean"   \tab\tab take the arithmetic mean of all qualifying elements. \cr
                  2 \tab\tab "min" \tab\tab take the smallest of the elements. \cr
                  3 \tab\tab "max"   \tab\tab take the largest of the elements. \cr
-                 5-9 \tab\tab "qn" \tab\tab continuous quantile types 5-9, see \code{\link{fquantile}}. \cr
+                 4-9 \tab\tab "qn" \tab\tab continuous quantile types 4-9, see \code{\link{fquantile}}. \cr
                 }
 }
 
@@ -85,10 +85,10 @@ fmedian(x, \dots)
 
 }
 \details{
-For v1.9.0 \code{fnth} was completely rewritten in C and offers significantly enhanced speed and functionality. It uses a combination of quickselect, quicksort, and radixsort algorithms, combined with several (weighted) quantile estimation methods and, where possible, OpenMP multithreading. This synthesis can be summarised as follows:
+\code{fnth} uses a combination of quickselect, quicksort, and radixsort algorithms, combined with several (weighted) quantile estimation methods and, where possible, OpenMP multithreading:
 
 \itemize{
-\item without weights, quickselect is used to determine a (lower) order statistic. If \code{ties \%!in\% c("min", "max")} a second order statistic is found by taking the max of the upper part of the partitioned array, and the two statistics are averaged using a simple mean (\code{ties = "mean"}), or weighted average according to a \code{\link{quantile}} method (\code{ties = "q5"-"q9"}). For \code{n = 0.5}, all supported quantile methods give the sample median. With matrices, multithreading is always across columns, for vectors and data frames it is across groups unless \code{is.null(g)} for data frames.
+\item without weights, quickselect is used to determine a (lower) order statistic. If \code{ties \%!in\% c("min", "max")} a second order statistic is found by taking the max of the upper part of the partitioned array, and the two statistics are averaged using a simple mean (\code{ties = "mean"}), or weighted average according to a \code{\link{quantile}} method (\code{ties = "q4"-"q9"}). For \code{n = 0.5}, all supported quantile methods give the sample median. With matrices, multithreading is always across columns, for vectors and data frames it is across groups unless \code{is.null(g)} for data frames.
 
 \item with weights and no groups (\code{is.null(g)}), \code{\link{radixorder}} is called internally (on each column of \code{x}). The ordering is used to sum the weights in order of \code{x} and determine weighted order statistics or quantiles. See details below. Multithreading is disabled as \code{\link{radixorder}} cannot be called concurrently on the same memory stack.
 
@@ -104,12 +104,12 @@ For v1.9.0 \code{fnth} was completely rewritten in C and offers significantly en
 If \code{n > 1}, the result is equivalent to (column-wise) \code{sort(x, partial = n)[n]}. Internally, \code{n} is converted to a probability using \code{p = (n-1)/(NROW(x)-1)}, and that probability is applied to the set of non-missing elements to find the \code{as.integer(p*(fnobs(x)-1))+1L}'th element (which corresponds to option \code{ties = "min"}). % Note that it is necessary to subtract and add 1 so that \code{n = 1} corresponds to \code{p = 0} and \code{n = NROW(x)} to \code{p = 1}. %So if \code{n > 1} is used in the presence of missing values, and the default \code{ties = "mean"} is enabled, the resulting element could be the average of two elements.
 When using grouped computations with \code{n > 1}, \code{n} is transformed to a probability \code{p = (n-1)/(NROW(x)/ng-1)} (where \code{ng} contains the number of unique groups in \code{g}).
 
-If weights are used and \code{ties = "q5"-"q9"}, weighted continuous quantile estimation is done as described in \code{\link{fquantile}}.
+If weights are used and \code{ties = "q4"-"q9"}, weighted continuous quantile estimation is done as described in \code{\link{fquantile}}.
 
 For \code{ties \%in\% c("mean", "min", "max")}, a target partial sum of weights \code{p*sum(w)} is calculated, and the weighted n'th element is the element k such that all elements smaller than k have a sum of weights \code{<= p*sum(w)}, and all elements larger than k have a sum of weights \code{<= (1 - p)*sum(w)}. If the partial-sum of weights (\code{p*sum(w)}) is reached exactly for some element k, then (summing from the lower end) both k and k+1 would qualify as the weighted n'th element. If the weight of element k+1 is zero, k, k+1 and k+2 would qualify... . If \code{n > 1}, k is chosen (consistent with the unweighted behavior). %(ensuring that \code{fnth(x, n)}) and \code{fnth(x, n, w = rep(1, NROW(x)))}, always provide the same outcome)
 If \code{0 < n < 1}, the \code{ties} option regulates how to resolve such conflicts, yielding lower (\code{ties = "min"}: k), upper (\code{ties = "max"}: k+2) or average weighted (\code{ties = "mean"}: mean(k, k+1, k+2)) n'th elements.
 
-Thus, in the presence of zero weights, the weighted median (default \code{ties = "mean"}) can be an arithmetic average of >2 qualifying elements. Users may prefer a quantile based weighted median by setting \code{ties = "q5"-"q9"}, which is a continuous function of \code{p} and ignores elements with zero weights.
+Thus, in the presence of zero weights, the weighted median (default \code{ties = "mean"}) can be an arithmetic average of >2 qualifying elements.
 
 For data frames, column-attributes and overall attributes are preserved if \code{g} is used or \code{drop = FALSE}.
 
 
@@ -32,7 +32,7 @@ frange(x, na.rm = .op[["na.rm"]], finite = FALSE)
   \item{o}{integer. An vector giving the ordering of the elements in \code{x}, such that \code{identical(x[o], sort(x))}. If available this considerably speeds up the estimation.}
   \item{na.rm}{logical. Remove missing values, default \code{TRUE}. }
   \item{finite}{logical. Omit all non-finite values.}
-  \item{type}{integer. Quantile types 4-9. See \code{\link{quantile}}. Further details are provided in \href{https://doi.org/10.2307/2684934}{Hyndman and Fan (1996)} who recommended type 8. The default method is type 7.}
+  \item{type}{integer. Quantile types 4-9. See \code{\link{quantile}}. Further details are provided in \href{https://www.tandfonline.com/doi/abs/10.1080/00031305.1996.10473566}{Hyndman and Fan (1996)} who recommended type 8. The default method is type 7.}
   \item{names}{logical. Generates names of the form \code{paste0(round(probs * 100, 1), "\%")} (in C). Set to \code{FALSE} for speedup. }
   \item{check.o}{logical. If \code{o} is supplied, \code{TRUE} runs through \code{o} once and checks that it is valid, i.e. that each element is in \code{[1, length(x)]}. Set to \code{FALSE} for significant speedup if \code{o} is known to be valid. }
 }
@@ -41,7 +41,7 @@ frange(x, na.rm = .op[["na.rm"]], finite = FALSE)
 
 \code{frange} is considerably more efficient than \code{\link{range}}, requiring only one pass through the data instead of two. For probabilities 0 and 1, \code{fquantile} internally calls \code{frange}.
 
-Following \href{https://doi.org/10.2307/2684934}{Hyndman and Fan (1996)}, the quantile type-\eqn{i} quantile function of the sample \eqn{X} can be written as a weighted average of two order statistics:
+Following \href{https://www.tandfonline.com/doi/abs/10.1080/00031305.1996.10473566}{Hyndman and Fan (1996)}, the quantile type-\eqn{i} quantile function of the sample \eqn{X} can be written as a weighted average of two order statistics:
 
 \deqn{\hat{Q}_{X,i}(p) = (1 - \gamma) X_{(j)} + \gamma X_{(j + 1)}}
 
@@ -57,7 +57,7 @@ We can then first find the largest value \eqn{l} such that the cumulative normal
 For a more detailed exposition \href{https://htmlpreview.github.io/?https://github.com/mjskay/uncertainty-examples/blob/master/weighted-quantiles.html}{see these excellent notes} by Matthew Kay. See also the R implementation of weighted quantiles type 7 in the Examples below.
 }
 \note{
-The new weighted quantile algorithm from v2.1.0 does not skip zero weights anymore as this is technically very difficult (it is not clear if \eqn{j} hits a zero weight element whether one should move forward or backward to find an alternative). Thus, all non-missing elements are considered and weights should be strictily positive.
+The new weighted quantile algorithm from v2.1.0 does not skip zero weights anymore as this is technically very difficult (it is not clear if \eqn{j} hits a zero weight element whether one should move forward or backward to find an alternative). Thus, all non-missing elements are considered and weights should be strictly positive.
 }
 \value{
 A vector of quantiles. If \code{names = TRUE}, \code{fquantile} generates names as \code{paste0(round(probs * 100, 1), "\%")} (in C).
 
@@ -48,7 +48,7 @@ join(x, y,
 
   \item{verbose}{integer. Prints information about the join. One of 0 (off), 1 (default, see Details) or 2 (additionally prints the classes of the \code{on} columns). \emph{Note:} \code{verbose > 0} or \code{validate != "m:m"} invoke the \code{count} argument to \code{\link{fmatch}}, so \code{verbose = 0} is slightly more efficient. }
 
-  \item{require}{(optional) named list of the form \code{list(x = 1, y = 0.5, fail = "warning")} giving proportions of records that need to be matched and the action if any requirement fails (\code{"message"}, \code{"warning"}, or \code{"error"}). Any elements of the list can be omitted, the default action is \code{"error"}.}
+  \item{require}{(optional) named list of the form \code{list(x = 1, y = 0.5, fail = "warning")} (or \code{fail.with} if you want to be more expressive) giving proportions of records that need to be matched and the action if any requirement fails (\code{"message"}, \code{"warning"}, or \code{"error"}). Any elements of the list can be omitted, the default action is \code{"error"}.}
 
   \item{column}{(optional) name for an extra column to generate in the output indicating which dataset a record came from. \code{TRUE} calls this column \code{".join"} (inspired by STATA's '_merge' column). By default this column is generated as the last column, but, if \code{keep.col.order = FALSE}, it is placed after the 'on' columns. The column is a factor variable with levels corresponding to the dataset names (inferred from the input) or \code{"matched"} for matched records. Alternatively, it is possible to specify a list of 2, where the first element is the column name, and the second a length 3 (!) vector of levels e.g. \code{column = list("joined", c("x", "y", "x_y"))}, where \code{"x_y"} replaces \code{"matched"}. The column has an additional attribute \code{"on.cols"} giving the join columns corresponding to the factor levels. See Examples. }