Skip to content

Commit aa6feff

Browse files
cmdcolinclaude
andcommitted
Correctness, simplicity, and test coverage improvements
- Fix clustersGivenK to have N elements (was N+1 with a trailing empty array) - Avoid intermediate array allocations in clustersGivenK building; mutate membership arrays in place - Remove {} as ClusterNode typecasts in fromNewick via newNode() helper, eliminating the fillDefaults post-pass entirely - Simplify treeToJSON to return ClusterNode directly - Add explicit case ';' in fromNewick switch - Add integration tests: K=3 partition, order permutation, progress callbacks, equal-distance determinism, clusterObject label propagation - Fix README Algorithm section (was describing old O(n³) pure-JS version; current C code uses Lance-Williams recurrence, same as R hclust) - Add UPGMA and Lance-Williams citations to distance.c and README Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 2ae0785 commit aa6feff

8 files changed

Lines changed: 150 additions & 68 deletions

File tree

.prettierignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
11
pnpm-lock.yaml
2+
src/distance.js

README.md

Lines changed: 31 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -6,22 +6,26 @@ WebAssembly, with JavaScript/TypeScript wrappers for easy integration.
66
## Algorithm
77

88
**Agglomerative hierarchical clustering with average linkage (UPGMA).** Each
9-
sample starts as its own cluster; at each step the two clusters with the
9+
sample starts as its own cluster. At each step the two clusters with the
1010
smallest mean pairwise Euclidean distance are merged, until one cluster remains.
11-
12-
Average linkage measures inter-cluster distance as the mean of all pairwise
13-
distances. This is a middle ground between single linkage (minimum distance,
14-
prone to chaining) and complete linkage (maximum distance, forces compact
15-
clusters). For the genomics track use case — ordering samples by similarity for
16-
a heatmap — average linkage is a good default. Note that R's `hclust` defaults
17-
to `method="complete"`; use `method="average"` to get equivalent behavior.
18-
19-
This is equivalent to R's `hclust(method="average")`, with two differences: R
20-
uses the Lance-Williams recurrence for an O(n²) merge step, whereas this
21-
recomputes average distances from the original matrix each iteration (O(n³)).
22-
For the tens-to-hundreds of samples typical in genomics tracks, this is
23-
negligible and WASM more than compensates. R also accepts a precomputed distance
24-
matrix; this library computes Euclidean distances from raw vectors internally.
11+
The result is a binary tree (dendrogram) whose internal node heights record the
12+
distance at which each merge occurred.
13+
14+
Average linkage (UPGMA) measures inter-cluster distance as the mean of all
15+
pairwise distances between members of the two clusters. It is a middle ground
16+
between single linkage (minimum distance, prone to chaining) and complete
17+
linkage (maximum distance, forces compact clusters). For the genomics use case —
18+
ordering samples by similarity for a heatmap — average linkage is a reliable
19+
default. Note that R's `hclust` defaults to `method="complete"`; use
20+
`method="average"` to match this library's output.
21+
22+
This produces equivalent results to R's `hclust(method="average")`. The
23+
implementation uses the Lance-Williams recurrence (Lance & Williams 1967) to
24+
update inter-cluster distances in O(n) per merge step, giving O(n²) total for
25+
clustering after the O(n²) initial distance matrix computation. The one
26+
difference from R is that this library computes Euclidean distances from raw
27+
vectors internally and uses Float32 precision; R accepts a precomputed distance
28+
matrix and uses double precision.
2529

2630
## Features
2731

@@ -149,6 +153,18 @@ clusterData({
149153
| SharedArrayBuffer + Atomics | yes | yes | yes |
150154
| Blob URL + sync XHR | no | yes | no |
151155

156+
## References
157+
158+
- **UPGMA**: Sokal, R.R. & Michener, C.D. (1958). "A statistical method for
159+
evaluating systematic relationships." _University of Kansas Science Bulletin_,
160+
38, 1409–1438.
161+
- **Lance-Williams recurrence**: Lance, G.N. & Williams, W.T. (1967). "A general
162+
theory of classificatory sorting strategies. 1. Hierarchical systems."
163+
_Computer Journal_, 9(4), 373–380.
164+
- **Newick format**: Olsen, G.J. (1990). "Interpretation of the 'Newick's 8:45'
165+
tree format standard."
166+
http://evolution.genetics.washington.edu/phylip/newicktree.html
167+
152168
## Note
153169

154170
Generated with the help of Claude Code AI, you might be able to tell from the

src/cluster.ts

Lines changed: 22 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -23,25 +23,38 @@ export async function clusterData({
2323

2424
// Build clustersGivenK from stable-slot merge sequence.
2525
// mergeA[i] and mergeB[i] are stable slot indices; slot mergeA[i] absorbs mergeB[i].
26+
// clustersGivenK[k] = cluster partitions when there are k+1 clusters (k=0..N-1).
2627
const numSamples = data.length
27-
const clustersGivenK: number[][][] = [[]]
28+
const clustersGivenK: number[][][] = []
2829

29-
const membership = Array.from(
30-
{ length: numSamples },
31-
(_, i) => [i] as number[],
32-
)
33-
const activeSlots = new Set(Array.from({ length: numSamples }, (_, i) => i))
30+
const membership: number[][] = Array.from({ length: numSamples }, (_, i) => [
31+
i,
32+
])
33+
const activeSlots = new Set<number>()
34+
for (let i = 0; i < numSamples; i++) {
35+
activeSlots.add(i)
36+
}
3437

3538
for (let i = 0; i < numSamples - 1; i++) {
3639
const [a, b] = result.merges[i]!
3740

38-
clustersGivenK.push([...activeSlots].map(id => [...membership[id]!]))
41+
const snapshot: number[][] = []
42+
for (const id of activeSlots) {
43+
snapshot.push([...membership[id]!])
44+
}
45+
clustersGivenK.push(snapshot)
3946

40-
membership[a] = [...membership[a]!, ...membership[b]!]
47+
for (const m of membership[b]!) {
48+
membership[a]!.push(m)
49+
}
4150
activeSlots.delete(b)
4251
}
4352

44-
clustersGivenK.push([...activeSlots].map(id => [...membership[id]!]))
53+
const finalSnapshot: number[][] = []
54+
for (const id of activeSlots) {
55+
finalSnapshot.push([...membership[id]!])
56+
}
57+
clustersGivenK.push(finalSnapshot)
4558

4659
return {
4760
tree: result.tree,

src/distance.js

-4.61 KB
Binary file not shown.

src/tree-utils.ts

Lines changed: 16 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,9 @@ export function printTree(
1919
return output
2020
}
2121

22+
// Newick format: Olsen (1990) http://evolution.genetics.washington.edu/phylip/newicktree.html
23+
// Note: this library encodes internal node height as the label (e.g. "(A,B)1.2345"),
24+
// not as a branch length (":"). fromNewick handles both forms on input.
2225
export function toNewick(node: ClusterNode): string {
2326
if (!node.children || node.children.length === 0) {
2427
return node.name
@@ -28,14 +31,17 @@ export function toNewick(node: ClusterNode): string {
2831
return `(${childStrings.join(',')})${node.height.toFixed(4)}`
2932
}
3033

34+
function newNode(): ClusterNode {
35+
return { name: '', height: 0 }
36+
}
37+
3138
export function fromNewick(s: string): ClusterNode {
3239
const ancestors: ClusterNode[] = []
33-
34-
let tree = {} as ClusterNode
40+
let tree = newNode()
3541
const tokens = s.split(/\s*(;|\(|\)|,|:)\s*/)
3642
for (let i = 0; i < tokens.length; i++) {
3743
const token = tokens[i]!
38-
const subtree = {} as ClusterNode
44+
const subtree = newNode()
3945
switch (token) {
4046
case '(':
4147
tree.children = [subtree]
@@ -49,6 +55,7 @@ export function fromNewick(s: string): ClusterNode {
4955
case ')':
5056
tree = ancestors.pop()!
5157
break
58+
case ';':
5259
case ':':
5360
break
5461
default: {
@@ -69,38 +76,16 @@ export function fromNewick(s: string): ClusterNode {
6976
}
7077
}
7178

72-
function fillDefaults(node: ClusterNode) {
73-
if (!node.name) {
74-
node.name = ''
75-
}
76-
// eslint-disable-next-line @typescript-eslint/no-unnecessary-condition
77-
if (node.height === undefined) {
78-
node.height = 0
79-
}
80-
if (node.children) {
81-
for (const child of node.children) {
82-
fillDefaults(child)
83-
}
84-
}
85-
}
86-
87-
fillDefaults(tree)
8879
return tree
8980
}
9081

91-
export function treeToJSON(node: ClusterNode) {
92-
const result: {
93-
name: string
94-
height: number
95-
children?: ReturnType<typeof treeToJSON>[]
96-
} = {
82+
export function treeToJSON(node: ClusterNode): ClusterNode {
83+
if (!node.children?.length) {
84+
return { name: node.name, height: node.height }
85+
}
86+
return {
9787
name: node.name,
9888
height: node.height,
89+
children: node.children.map(treeToJSON),
9990
}
100-
101-
if (node.children && node.children.length > 0) {
102-
result.children = node.children.map(child => treeToJSON(child))
103-
}
104-
105-
return result
10691
}

src/wasm/distance.c

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,14 @@
22
* High-performance hierarchical clustering (UPGMA / average-linkage)
33
* Compiled to WebAssembly using Emscripten
44
*
5+
* Algorithm: UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
6+
* Sokal & Michener (1958). "A statistical method for evaluating systematic
7+
* relationships." University of Kansas Science Bulletin, 38, 1409-1438.
8+
*
9+
* Distance update: Lance-Williams recurrence for average linkage
10+
* Lance & Williams (1967). "A general theory of classificatory sorting
11+
* strategies." Computer Journal, 9(4), 373-380.
12+
*
513
* Key design:
614
* - Stable slot IDs: slot mergeA[i] absorbs mergeB[i]; mergeA[i] < mergeB[i] always.
715
* Slot 0 is always the final root.

test/cluster.test.ts

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -145,10 +145,9 @@ describe('clusterData', () => {
145145

146146
const result = await clusterData({ data })
147147

148-
expect(result.clustersGivenK).toHaveLength(3)
148+
expect(result.clustersGivenK).toHaveLength(2)
149149
expect(result.clustersGivenK[0]).toEqual([[0, 1]])
150150
expect(result.clustersGivenK[1]).toEqual([[0], [1]])
151-
expect(result.clustersGivenK[2]).toEqual([])
152151
})
153152

154153
it('should build clustersGivenK correctly for 3 samples', async () => {
@@ -186,14 +185,13 @@ describe('clusterData', () => {
186185

187186
const result = await clusterData({ data })
188187

189-
expect(result.clustersGivenK).toHaveLength(4)
188+
expect(result.clustersGivenK).toHaveLength(3)
190189
expect(result.clustersGivenK[0]?.length).toBe(1)
191190
expect(result.clustersGivenK[0]?.[0]).toContain(0)
192191
expect(result.clustersGivenK[0]?.[0]).toContain(1)
193192
expect(result.clustersGivenK[0]?.[0]).toContain(2)
194193
expect(result.clustersGivenK[1]).toHaveLength(2)
195194
expect(result.clustersGivenK[2]).toHaveLength(3)
196-
expect(result.clustersGivenK[3]).toEqual([])
197195
})
198196

199197
it('should handle single sample case', async () => {
@@ -212,9 +210,8 @@ describe('clusterData', () => {
212210

213211
expect(result.tree).toEqual({ name: 'Sample 0', height: 0 })
214212
expect(result.order).toEqual([0])
215-
expect(result.clustersGivenK).toHaveLength(2)
213+
expect(result.clustersGivenK).toHaveLength(1)
216214
expect(result.clustersGivenK[0]).toEqual([[0]])
217-
expect(result.clustersGivenK[1]).toEqual([])
218215
})
219216

220217
it('should handle complex merge sequences', async () => {
@@ -243,9 +240,11 @@ describe('clusterData', () => {
243240

244241
const result = await clusterData({ data })
245242

246-
expect(result.clustersGivenK.length).toBe(5)
243+
expect(result.clustersGivenK.length).toBe(4)
247244
expect(result.clustersGivenK[0]?.length).toBe(1)
248-
expect(result.clustersGivenK[result.clustersGivenK.length - 1]).toEqual([])
245+
expect(
246+
result.clustersGivenK[result.clustersGivenK.length - 1],
247+
).toHaveLength(4)
249248
})
250249

251250
it('should propagate error thrown by checkCancellation', async () => {

test/integration.test.ts

Lines changed: 65 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import { describe, expect, it } from 'vitest'
22

3-
import { clusterData } from '../src/cluster.js'
3+
import { clusterData, clusterObject } from '../src/cluster.js'
44

55
function sortedClusters(clusters: number[][]) {
66
return clusters
@@ -18,10 +18,9 @@ describe('clusterData integration', () => {
1818

1919
expect(result.order).toEqual([0, 1])
2020

21-
expect(result.clustersGivenK).toHaveLength(3)
21+
expect(result.clustersGivenK).toHaveLength(2)
2222
expect(sortedClusters(result.clustersGivenK[0]!)).toEqual([[0, 1]])
2323
expect(sortedClusters(result.clustersGivenK[1]!)).toEqual([[0], [1]])
24-
expect(result.clustersGivenK[2]).toEqual([])
2524
})
2625

2726
it('clusters 4 samples into correct groups', async () => {
@@ -42,8 +41,7 @@ describe('clusterData integration', () => {
4241

4342
expect(result.order).toEqual([0, 1, 2, 3])
4443

45-
expect(result.clustersGivenK).toHaveLength(5)
46-
expect(result.clustersGivenK[4]).toEqual([])
44+
expect(result.clustersGivenK).toHaveLength(4)
4745
expect(sortedClusters(result.clustersGivenK[0]!)).toEqual([[0, 1, 2, 3]])
4846
expect(sortedClusters(result.clustersGivenK[1]!)).toEqual([
4947
[0, 1],
@@ -80,6 +78,68 @@ describe('clusterData integration', () => {
8078
)
8179
})
8280

81+
it('includes K=3 partition in clustersGivenK for 4 samples', async () => {
82+
// After first merge {0,1}, before second merge {2,3}, K=3 = {0,1}, {2}, {3}
83+
const data = [[1], [2], [5], [7]]
84+
const result = await clusterData({ data })
85+
86+
expect(sortedClusters(result.clustersGivenK[2]!)).toEqual([
87+
[0, 1],
88+
[2],
89+
[3],
90+
])
91+
})
92+
93+
it('order is a valid permutation of sample indices', async () => {
94+
const data = [
95+
[1, 2],
96+
[3, 4],
97+
[5, 1],
98+
[2, 8],
99+
]
100+
const result = await clusterData({ data })
101+
102+
expect([...result.order].sort((a, b) => a - b)).toEqual([0, 1, 2, 3])
103+
})
104+
105+
it('fires progress callbacks during real clustering', async () => {
106+
const data = Array.from({ length: 10 }, (_, i) => [i])
107+
const messages: string[] = []
108+
109+
await clusterData({ data, onProgress: msg => messages.push(msg) })
110+
111+
expect(messages.length).toBeGreaterThan(0)
112+
expect(messages[0]).toBe('Running hierarchical clustering in WASM...')
113+
})
114+
115+
it('handles equal distances deterministically', async () => {
116+
// Sample 1 and 2 are both distance 1 from sample 0 — ties should resolve consistently
117+
const data = [
118+
[0, 0],
119+
[1, 0],
120+
[0, 1],
121+
]
122+
const result1 = await clusterData({ data })
123+
const result2 = await clusterData({ data })
124+
125+
expect(result1.order).toEqual(result2.order)
126+
expect(result1.clustersGivenK).toEqual(result2.clustersGivenK)
127+
})
128+
129+
it('clusterObject propagates labels to leaf nodes', async () => {
130+
const result = await clusterObject({
131+
data: { alpha: [1, 2], beta: [1, 3], gamma: [9, 9] },
132+
})
133+
134+
const leafNames = (node: {
135+
name: string
136+
children?: (typeof node)[]
137+
}): string[] =>
138+
node.children ? node.children.flatMap(leafNames) : [node.name]
139+
140+
expect(leafNames(result.tree).sort()).toEqual(['alpha', 'beta', 'gamma'])
141+
})
142+
83143
it('returns deterministic results for the same input', async () => {
84144
const data = [
85145
[1, 2],

0 commit comments

Comments
 (0)