You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Correctness, simplicity, and test coverage improvements
- Fix clustersGivenK to have N elements (was N+1 with a trailing empty array)
- Avoid intermediate array allocations in clustersGivenK building; mutate
membership arrays in place
- Remove {} as ClusterNode typecasts in fromNewick via newNode() helper,
eliminating the fillDefaults post-pass entirely
- Simplify treeToJSON to return ClusterNode directly
- Add explicit case ';' in fromNewick switch
- Add integration tests: K=3 partition, order permutation, progress callbacks,
equal-distance determinism, clusterObject label propagation
- Fix README Algorithm section (was describing old O(n³) pure-JS version;
current C code uses Lance-Williams recurrence, same as R hclust)
- Add UPGMA and Lance-Williams citations to distance.c and README
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.md
+25-15Lines changed: 25 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,22 +6,26 @@ WebAssembly, with JavaScript/TypeScript wrappers for easy integration.
6
6
## Algorithm
7
7
8
8
**Agglomerative hierarchical clustering with average linkage (UPGMA).** Each
9
-
sample starts as its own cluster; at each step the two clusters with the
9
+
sample starts as its own cluster. At each step the two clusters with the
10
10
smallest mean pairwise Euclidean distance are merged, until one cluster remains.
11
-
12
-
Average linkage measures inter-cluster distance as the mean of all pairwise
13
-
distances. This is a middle ground between single linkage (minimum distance,
14
-
prone to chaining) and complete linkage (maximum distance, forces compact
15
-
clusters). For the genomics track use case — ordering samples by similarity for
16
-
a heatmap — average linkage is a good default. Note that R's `hclust` defaults
17
-
to `method="complete"`; use `method="average"` to get equivalent behavior.
18
-
19
-
This is equivalent to R's `hclust(method="average")`, with two differences: R
20
-
uses the Lance-Williams recurrence for an O(n²) merge step, whereas this
21
-
recomputes average distances from the original matrix each iteration (O(n³)).
22
-
For the tens-to-hundreds of samples typical in genomics tracks, this is
23
-
negligible and WASM more than compensates. R also accepts a precomputed distance
24
-
matrix; this library computes Euclidean distances from raw vectors internally.
11
+
The result is a binary tree (dendrogram) whose internal node heights record the
12
+
distance at which each merge occurred.
13
+
14
+
Average linkage (UPGMA) measures inter-cluster distance as the mean of all
15
+
pairwise distances between members of the two clusters. It is a middle ground
16
+
between single linkage (minimum distance, prone to chaining) and complete
17
+
linkage (maximum distance, forces compact clusters). For the genomics use case —
18
+
ordering samples by similarity for a heatmap — average linkage is a reliable
19
+
default. Note that R's `hclust` defaults to `method="complete"`; use
20
+
`method="average"` to match this library's output.
21
+
22
+
This produces equivalent results to R's `hclust(method="average")`. The
23
+
implementation uses the Lance-Williams recurrence (Lance & Williams 1967) to
24
+
update inter-cluster distances in O(n) per merge step, giving O(n²) total for
25
+
clustering after the O(n²) initial distance matrix computation. The one
26
+
difference from R is that this library computes Euclidean distances from raw
27
+
vectors internally and uses Float32 precision; R accepts a precomputed distance
28
+
matrix and uses double precision.
25
29
26
30
## Features
27
31
@@ -149,6 +153,12 @@ clusterData({
149
153
| SharedArrayBuffer + Atomics | yes | yes | yes |
150
154
| Blob URL + sync XHR | no | yes | no |
151
155
156
+
## References
157
+
158
+
-**UPGMA**: Sokal, R.R. & Michener, C.D. (1958). "A statistical method for evaluating systematic relationships." *University of Kansas Science Bulletin*, 38, 1409–1438.
159
+
-**Lance-Williams recurrence**: Lance, G.N. & Williams, W.T. (1967). "A general theory of classificatory sorting strategies. 1. Hierarchical systems." *Computer Journal*, 9(4), 373–380.
160
+
-**Newick format**: Olsen, G.J. (1990). "Interpretation of the 'Newick's 8:45' tree format standard." http://evolution.genetics.washington.edu/phylip/newicktree.html
161
+
152
162
## Note
153
163
154
164
Generated with the help of Claude Code AI, you might be able to tell from the
0 commit comments