Skip to content

Commit 12289ac

Browse files
authored
Introduce Welford's online algorithm for variance (#116)
1 parent 19dada1 commit 12289ac

12 files changed

+821
-7
lines changed

.github/workflows/main.yml

+4
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,10 @@ jobs:
113113
run: |
114114
make yamllint
115115
116+
- name: Non-code tests
117+
run: |
118+
php vendor/bin/phpunit --group=documentation
119+
116120
- name: Static Analysis
117121
run: |
118122
make ci-analyze --keep-going

README.md

+62
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,8 @@ All entry points always return an instance of the pipeline.
125125
| `max()` | Finds the highest value. | `max` |
126126
| `min()` | Finds the lowest value. | `min` |
127127
| `toArray()` | Returns an array with all values. Eagerly executed. | `dict`, `ToDictionary` |
128+
| `runningVariance()` | Computes online statistics: sample mean, sample variance, standard deviation. | [Welford's method](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm) |
129+
| `finalVariance()` | Computes final statistics for the sequence. | |
128130
| `__construct()` | Can be provided with an optional initial iterator. Used in the `take()` function from above. | |
129131

130132
Pipeline is an iterator and can be used as any other iterable.
@@ -394,6 +396,66 @@ foreach ($pipeline as $value) {
394396

395397
This allows to skip type checks for return values if one has no results to return: instead of `false` or `null` it is safe to return an unprimed pipeline.
396398

399+
## `$pipeline->runningVariance()`
400+
401+
Computes online statistics for the sequence: counts, sample mean, sample variance, standard deviation. You can access these numbers on the fly with methods such as `getCount()`, `getMean()`, `getVariance()`, `getStandardDeviation()`.
402+
403+
This method also accepts an optional cast callback that should return `float|null`: `null` values are discarded. Therefore you can have several running variances computing numbers for different parts of the data.
404+
405+
```php
406+
$pipeline->runningVariance($varianceForShippedOrders, static function (order $order): ?float {
407+
if (!$order->isShipped()) {
408+
// This order will be excluded from the computation.
409+
return null;
410+
}
411+
412+
return $order->getTotal();
413+
});
414+
415+
$pipeline->runningVariance($varianceForPaidOrders, static function (order $order): ?float {
416+
if ($order->isUnpaid()) {
417+
// This order will be excluded from the computation.
418+
return null;
419+
}
420+
421+
return $order->getProjectedTotal();
422+
});
423+
```
424+
425+
As you process orders, you will be able to access `$varianceForShippedOrders->getMean()` and `$varianceForPaidOrders->getMean()`.
426+
427+
This computation uses [Welford's online algorithm](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm), therefore it can handle very large numbers of data points.
428+
429+
## `$pipeline->finalVariance()`
430+
431+
A convenience method to computes the final statistics for the sequence. Accepts an optional cast method, else assumes the sequence countains valid numbers.
432+
433+
```php
434+
// Fibonacci numbers generator
435+
$fibonacci = map(function () {
436+
yield 0;
437+
438+
$prev = 0;
439+
$current = 1;
440+
441+
while (true) {
442+
yield $current;
443+
$next = $prev + $current;
444+
$prev = $current;
445+
$current = $next;
446+
}
447+
});
448+
449+
// Statistics for the second hundred Fibonacci numbers.
450+
$variance = $fibonacci->slice(101, 100)->finalVariance();
451+
452+
$variance->getStandardDeviation();
453+
// float(3.5101061922557E+40)
454+
455+
$variance->getCount();
456+
// int(100)
457+
```
458+
397459
# Contributions
398460

399461
Contributions to documentation and test cases are welcome. Bug reports are welcome too.

example.php

+34
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,9 @@
6666
yield $i * $j;
6767
});
6868

69+
// Collect online statistics
70+
$pipeline->runningVariance($variance);
71+
6972
// one way to filter
7073
$pipeline->map(function ($i) {
7174
if ($i > 50) {
@@ -87,6 +90,13 @@
8790
var_dump($value);
8891
// int(104)
8992

93+
var_dump($variance->getCount());
94+
// int(12)
95+
var_dump($variance->getMean());
96+
// float(22)
97+
var_dump($variance->getStandardDeviation());
98+
// float(30.3794188704967)
99+
90100
$sum = take([1, 2, 3])->reduce();
91101
var_dump($sum);
92102
// int(6)
@@ -137,3 +147,27 @@
137147
// [1] =>
138148
// int(3)
139149
// }
150+
151+
// Fibonacci numbers generator
152+
$fibonacci = map(function () {
153+
yield 0;
154+
155+
$prev = 0;
156+
$current = 1;
157+
158+
while (true) {
159+
yield $current;
160+
$next = $prev + $current;
161+
$prev = $current;
162+
$current = $next;
163+
}
164+
});
165+
166+
// Statistics for the second hundred Fibonacci numbers
167+
$variance = $fibonacci->slice(101, 100)->finalVariance();
168+
169+
var_dump($variance->getStandardDeviation());
170+
// float(3.5101061922557E+40)
171+
172+
var_dump($variance->getCount());
173+
// int(100)

infection.json.dist

+2-1
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
],
1212
"@default": true,
1313
"IdenticalEqual": false,
14-
"NotIdenticalNotEqual": false
14+
"NotIdenticalNotEqual": false,
15+
"DecrementInteger": false
1516
}
1617
}

phpunit.xml.dist

+5-6
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,9 @@
1818
</include>
1919
</source>
2020

21-
<coverage>
22-
<report>
23-
<clover outputFile="build/logs/clover.xml"/>
24-
<text outputFile="php://stdout"/>
25-
</report>
26-
</coverage>
21+
<groups>
22+
<exclude>
23+
<group>documentation</group>
24+
</exclude>
25+
</groups>
2726
</phpunit>

psalm.xml.dist

+1
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
<file name="example.php" />
2121
</errorLevel>
2222
</ForbiddenCode>
23+
2324
<MissingClosureParamType errorLevel="suppress" />
2425
<MissingClosureReturnType errorLevel="suppress" />
2526

src/Helper/RunningVariance.php

+141
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
<?php
2+
/**
3+
* Copyright 2017, 2018 Alexey Kopytko <[email protected]>
4+
*
5+
* Licensed under the Apache License, Version 2.0 (the "License");
6+
* you may not use this file except in compliance with the License.
7+
* You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
declare(strict_types=1);
19+
20+
namespace Pipeline\Helper;
21+
22+
use function sqrt;
23+
24+
/**
25+
* Computes statistics (such as standard deviation) in real time.
26+
*
27+
* @see https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm
28+
*
29+
* @final
30+
*/
31+
class RunningVariance
32+
{
33+
/**
34+
* The number of observed values.
35+
*
36+
* @var int<0, max>
37+
*/
38+
private int $count = 0;
39+
40+
/** First moment: the mean value. */
41+
private float $mean = 0.0;
42+
43+
/** Second moment: the aggregated squared distance from the mean. */
44+
private float $m2 = 0.0;
45+
46+
public function __construct(self ...$spiesToMerge)
47+
{
48+
foreach ($spiesToMerge as $spy) {
49+
$this->merge($spy);
50+
}
51+
}
52+
53+
public function observe(float $value): float
54+
{
55+
++$this->count;
56+
57+
$delta = $value - $this->mean;
58+
59+
$this->mean += $delta / $this->count;
60+
$this->m2 += $delta * ($value - $this->mean);
61+
62+
return $value;
63+
}
64+
65+
/**
66+
* The number of observed values.
67+
*/
68+
public function getCount(): int
69+
{
70+
return $this->count;
71+
}
72+
73+
/**
74+
* Get the mean value.
75+
*/
76+
public function getMean(): float
77+
{
78+
if (0 === $this->count) {
79+
// For no values the variance is undefined.
80+
return NAN;
81+
}
82+
83+
return $this->mean;
84+
}
85+
86+
/**
87+
* Get the variance.
88+
*/
89+
public function getVariance(): float
90+
{
91+
if (0 === $this->count) {
92+
// For no values the variance is undefined.
93+
return NAN;
94+
}
95+
96+
if (1 === $this->count) {
97+
// Avoiding division by zero: variance for one value is zero.
98+
return 0.0;
99+
}
100+
101+
// https://en.wikipedia.org/wiki/Bessel%27s_correction
102+
return $this->m2 / ($this->count - 1);
103+
}
104+
105+
/**
106+
* Compute the standard deviation.
107+
*/
108+
public function getStandardDeviation(): float
109+
{
110+
return sqrt($this->getVariance());
111+
}
112+
113+
/**
114+
* Merge another instance into this instance.
115+
*
116+
* @see https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
117+
*/
118+
private function merge(self $other): void
119+
{
120+
// Shortcut a no-op
121+
if (0 === $other->count) {
122+
return;
123+
}
124+
125+
// Avoid division by zero by copying values
126+
if (0 === $this->count) {
127+
$this->count = $other->count;
128+
$this->mean = $other->mean;
129+
$this->m2 = $other->m2;
130+
131+
return;
132+
}
133+
134+
$count = $this->count + $other->count;
135+
$delta = $other->mean - $this->mean;
136+
137+
$this->mean = ($this->count * $this->mean) / $count + ($other->count * $other->mean) / $count;
138+
$this->m2 = $this->m2 + $other->m2 + ($delta ** 2 * $this->count * $other->count / $count);
139+
$this->count = $count;
140+
}
141+
}

src/Standard.php

+69
Original file line numberDiff line numberDiff line change
@@ -987,4 +987,73 @@ private static function flipKeysAndValues(iterable $previous): iterable
987987
yield $value => $key;
988988
}
989989
}
990+
991+
private function feedRunningVariance(Helper\RunningVariance $variance, ?callable $castFunc): self
992+
{
993+
if (null === $castFunc) {
994+
$castFunc = 'floatval';
995+
}
996+
997+
return $this->cast(static function ($value) use ($variance, $castFunc) {
998+
$float = $castFunc($value);
999+
1000+
if (null !== $float) {
1001+
$variance->observe($float);
1002+
}
1003+
1004+
// Returning the original value here
1005+
return $value;
1006+
});
1007+
}
1008+
1009+
/**
1010+
* Feeds in an instance of RunningVariance.
1011+
*
1012+
* @param ?Helper\RunningVariance &$variance the instance of RunningVariance; initialized unless provided
1013+
* @param ?callable $castFunc the cast callback, returning ?float; null values are not counted
1014+
*
1015+
* @param-out Helper\RunningVariance $variance
1016+
*
1017+
* @return $this
1018+
*/
1019+
public function runningVariance(
1020+
?Helper\RunningVariance &$variance,
1021+
?callable $castFunc = null
1022+
): self {
1023+
$variance ??= new Helper\RunningVariance();
1024+
1025+
$this->feedRunningVariance($variance, $castFunc);
1026+
1027+
return $this;
1028+
}
1029+
1030+
/**
1031+
* Computes final statistics for the sequence.
1032+
*
1033+
* @param ?callable $castFunc the cast callback, returning ?float; null values are not counted
1034+
* @param ?Helper\RunningVariance $variance the optional instance of RunningVariance
1035+
*/
1036+
public function finalVariance(
1037+
?callable $castFunc = null,
1038+
?Helper\RunningVariance $variance = null
1039+
): Helper\RunningVariance {
1040+
$variance ??= new Helper\RunningVariance();
1041+
1042+
if ($this->empty()) {
1043+
// No-op: an empty array.
1044+
return $variance;
1045+
}
1046+
1047+
$this->feedRunningVariance($variance, $castFunc);
1048+
1049+
if (is_array($this->pipeline)) {
1050+
// We are done!
1051+
return $variance;
1052+
}
1053+
1054+
// Consume every available item
1055+
$_ = iterator_count($this->pipeline);
1056+
1057+
return $variance;
1058+
}
9901059
}

tests/DocumentationTest.php

+2
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,8 @@
3535
/**
3636
* Test documentation has sections for all public methods.
3737
*
38+
* @group documentation
39+
*
3840
* @coversNothing
3941
*
4042
* @internal

0 commit comments

Comments
 (0)