Skip to content

Proposal for descriptive statistics #113

Open
@jvdp1

Description

@jvdp1

Overview

It would be nice to have a module in stdlib that provides functions for computing means,variances, medians, ... of vectors, and of rows (columns) of 2D-arrays (at least).
E.g.,

real :: res
real, allocatable :: res1(:)
real :: vector(5), mat(6,5)
...
res = mean(vector)
res1 = mean(mat) !returns the mean of each row of mat
res1 = mean(mat, dim = 2) !returns the mean of each column of mat

The same could be implemented for variance, median, ... So the API of all functions would be (almost) the same.

API

Let 's discuss the API of only mean for a vector first, and then for an array.

For a vector:

function mean_sp_sp(vector) result(res)
    real(sp), intent(in) :: vector(:)
    real(sp) :: res
end function

For a 2D array:

function mean_sp_sp(mat, dim) result(res)
    real(sp), intent(in) :: mat(:,:)
    integer, intent(in), optional :: dim !dim = 1 or dim = 2 
    real(sp), allocatable :: res(:)
end function

If dim = 1, it returns the mean of each row (so res(1:size(mat,1))).
If dim = 2, it returns the mean of each column (so res(1:size(mat,2))).

Here (generated manually with fypp) is an example for mean in stdlib.

The same API could be used for variance, median, cumulative sum, geometric mean, ...

Should we support arrays of rank > 2? E.g., what would return mean(mat(:,:,:,:), dim =3)?

Should we use functions or subroutine (and overload =)?:

The result of the procedure would be of the same kind as the input, and (implicit) conversion would be performed by the user. Functions could then be used.

Alternatively:
For real arrays, procedures would return a result of the same kind, or of a lower kind, of the argument (e.g., a mean of a dp array would return the result in sp or dp). All computations inside the procedure would be performed in the same kind as the input array, and the result would be converted just before the function returns the result.
For integer arrays, procedures would return a result of a real kind (e.g., a mean of a int64 array would return the result in sp, dp, or qp). All computations inside the procedure would be performed in the same kind as the result.

Implementation

Probably most of us have some implementations. @leonfoks has also an implementation for 1D array on Github.
I would think about a module called stdlib_experimental_stat.f90 and multiple submodules (one per stat, e.g., stdlib_experimental_stat_mean.f90, that contains all functions related with that stat).
The first PR would contain only one stat, e.g. mean to facilitate the discussion.

Currently in stdlib

mean (mean)
variance (var)
central moment (moment)

Possible additional functions

standard deviation (std)
median (median)
mode (mode)

Others

covariance (cov)
correlation (corr)

Other languages

Matlab
Numpy
Octave
R

Metadata

Metadata

Assignees

No one assigned

    Labels

    implementationImplementation in experimental and submission of a PRspecificationDiscussion and iteration over the APItopic: mathematicslinear algebra, sparse matrices, special functions, FFT, random numbers, statistics, ...

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions