Skip to content

first draft of silhouette_samples that works(according to my test)#2199

Draft
Shabasovich wants to merge 3 commits intomainfrom
features/2190-Implement_silhouette_score
Draft

first draft of silhouette_samples that works(according to my test)#2199
Shabasovich wants to merge 3 commits intomainfrom
features/2190-Implement_silhouette_score

Conversation

@Shabasovich
Copy link
Collaborator

Description

  • tried to implement silhouette_samples from scikit-learn

Type of change

  • new function

Additional

  • I have a file with explanations containing the logic I tried to implement to calculate a(i) and b(i). I can provide that as well if needed.

Copy link
Collaborator

@brownbaerchen brownbaerchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already looking quite good. I had a brief look and highlighted some general things. Mainly using assert statements in the tests.

Since you mentioned you had trouble debugging, I want to advertise the debugger once more. Just write breakpoint() in the code whenever you want to stop and explore. There can be trouble with parallel debugging. So when running in parallel, I only call a breakpoint on rank 0:

if ht.comm.rank == 0:
    breakpoint()

If you're already familiar with this stuff, just ignore this. But I know many python programmers, myself included, who were introduced to this way too late, so it doesn't hurt to mention it :D

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have moved the tests to a different directory in #2172, please move this file there.

Comment on lines +25 to +32
ht_results_np = ht_results.numpy()

if np.allclose(sk_results, ht_results_np, atol=1e-5):
print("✅ Test Passed: HeAT matches Scikit-Learn results.")
else:
max_diff = np.max(np.abs(sk_results - ht_results_np))
print(f"❌ Test Failed: Results differ. Max diff: {max_diff}")
#print(f"sk_results are {np.abs(sk_results)}; ht_results are {np.abs(ht_results_np)}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ht_results_np = ht_results.numpy()
if np.allclose(sk_results, ht_results_np, atol=1e-5):
print("✅ Test Passed: HeAT matches Scikit-Learn results.")
else:
max_diff = np.max(np.abs(sk_results - ht_results_np))
print(f"❌ Test Failed: Results differ. Max diff: {max_diff}")
#print(f"sk_results are {np.abs(sk_results)}; ht_results are {np.abs(ht_results_np)}")
ht_results_np = ht_results.resplit(None).numpy()
assert np.allclose(sk_results, ht_results, atol=1e-5), f'Max diff between Heat and scipy: np.max(np.abs(sk_results - ht_results_np))'

Comment on lines +39 to +40
if res_edge[3] == 0:
print("✅ Edge Case Passed: Single-sample cluster correctly assigned 0.0")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if res_edge[3] == 0:
print("✅ Edge Case Passed: Single-sample cluster correctly assigned 0.0")
assert res_edge[3] == 0:

Comment on lines +60 to +63
if res_np[0] > 0.8:
print(f"✅ Success! Point 0 is {res_np[0]:.4f}")
else:
print(f"❌ Failure! Point 0 is {res_np[0]:.4f}")
Copy link
Collaborator

@brownbaerchen brownbaerchen Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if res_np[0] > 0.8:
print(f"✅ Success! Point 0 is {res_np[0]:.4f}")
else:
print(f"❌ Failure! Point 0 is {res_np[0]:.4f}")
assert res_np[0] > 0.8, f"Point 0 is {res_np[0]:.4f}"



def silhouette_samples(X, labels, *, metric="euclidean", **kwds):
X_distributed = ht.array(X, split=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want the input to be a Heat array that is split according to the user's need. Have a look at heat.sanitation.sanitize_in to make sure the input is a valid heat DNDarray and then don't split.


def silhouette_samples(X, labels, *, metric="euclidean", **kwds):
X_distributed = ht.array(X, split=0)
labels_distributed = ht.array(labels, split=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for X

Comment on lines +32 to +38
if X_distributed.dtype.kind == "f":
atol = ht.finfo(X_distributed.dtype).eps * 100 # tolerance based on machine accuracy

if ht.any(ht.abs(diag_elements) > atol):
raise error_msg
elif ht.any(diag_elements != 0): # integral dtype
raise error_msg
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think single precision deserves special treatment here. You can check against the machine eps for any datatype.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

2 participants