Skip to content

Commit 0edfa26

Browse files
committed
notes
1 parent 6ce5dd2 commit 0edfa26

1 file changed

Lines changed: 19 additions & 17 deletions

File tree

decisions.md

Lines changed: 19 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,14 @@
44
55
TLDR: We invested in and trust the labels.
66

7-
The data were sampled from the previous (v1) embedding model's decision boundary. These pairs are more interesting and
8-
are inherently the ones where a successor stands to gain. Easy stacktrace pairs are likely easy for every model. These
9-
pairs typically have completely different errors and frames. Sampling weights decreased as the distance from v1's
10-
boundary increased. Pairs were formed via a self-join per Sentry project. There are no other constraints around the
11-
data's structure besides a limit on the number of stacktraces fetched from table B for each stacktrace in table A of the
12-
self-join, and a distance range which was determined via eye-test: no pairs have a v1 cosine distance outside the range
13-
[0.001, 0.50]. A stacktrace can be similar or dissimilar to any number of other stacktraces. This dataset has:
7+
The data were largely sampled from the previous (v1) embedding model's decision boundary. These pairs are more
8+
interesting and are the ones where a successor stands to gain the most. Easy stacktrace pairs are likely easy for every
9+
model. Easy negatives typically have completely different errors and frames, and easy positives are lexically close.
10+
Sampling weights decreased as the distance from v1's boundary increased. Pairs were formed via a self-join per Sentry
11+
project. There are no other constraints around the data's structure besides a limit on the number of stacktraces fetched
12+
from table B for each stacktrace in table A of the self-join, and a distance range which was determined via eye-test: no
13+
pairs have a v1 cosine distance outside the range [0.001, 0.50]. A stacktrace can be similar or dissimilar to any number
14+
of other stacktraces. This dataset has:
1415

1516
- Some stacktraces which are similar to 20 others
1617
- Some which are dissimilar to 20
@@ -24,12 +25,12 @@ generally err on the side of avoiding overgrouping—wrongly grouping two errors
2425
costly; Sentry's job is to tell you that something in prod broke.
2526

2627
There are many ways to train an embeddings model from here. You could, e.g., collate in-batch negatives and train using
27-
symmetric MNRL—the "standard" loss today. The immediate problem is that in-batch negatives can easily be false negatives
28-
in this dataset. Across pairs and w/in a project, there are plenty of similar stacktraces that aren't explicitly labeled
29-
as similar. MNRL is especially sensitive to the false negative rate. This sensitivity is good when you're training for
30-
the globally sparse similarity structures encountered in RAG, but bad when you're training for fine-grained stacktrace
31-
similarity w/in a Sentry project. GISTEmbed aims to address this kind of issue, but there aren't good guide models for
32-
this niche task. `gemini-embedding-2`, e.g.,
28+
symmetric MNRL—the standard embeddings loss today. The immediate problem is that in-batch negatives can easily be false
29+
negatives in this dataset. Across pairs and w/in a project, there are plenty of similar stacktraces that aren't
30+
explicitly labeled as similar. MNRL is especially sensitive to the false negative rate. This sensitivity is good when
31+
you're training for the globally sparse similarity structures encountered in RAG, but bad when you're training for
32+
fine-grained stacktrace similarity w/in a Sentry project. GISTEmbed aims to address this kind of issue, but there aren't
33+
good guide models for this niche task. `gemini-embedding-2`, e.g.,
3334
[flops](https://github.com/getsentry/grouping-trainer/tree/main/eval/comparisons/test_full3/gemini-cluster_dim3072_vs_large-no-prefix_dim64).
3435

3536
You could instead have the batch sampler pair up stacktraces from different Sentry projects to avoid false negatives.
@@ -38,12 +39,13 @@ test-time, the DB query doesn't compare stacktraces across projects. A more basi
3839
procedure needs to handle stacktraces that aren't similar to anything in the dataset or are similar to multiple
3940
stacktraces by padding and masking the loss, or adding a pairwise loss term over lone pairs.
4041

41-
Another idea is to use a triplet loss. Triplet sampling is well-defined when an input is classified. It's ambiguous for
42-
our dataset which has variable numbers of similar and dissimilar stacktraces. I'm not sure there's a principled way to
43-
select from the many combinations of triplets without dropping explicitly labeled stacktrace relationships.
42+
Another standard approach is to use a triplet loss. Triplet sampling is well-defined when an input is classified. It's
43+
ambiguous for our dataset which has variable numbers of similar and dissimilar stacktraces. I'm not sure there's a
44+
principled way to select from the many combinations of triplets without dropping explicitly labeled stacktrace
45+
relationships.
4446

4547
In general, non-pairwise losses coerce a jagged but rich similarity structure into a rectangular one. The data
46-
intentionally contains hard positives and hard negatives. Pairwise losses put their faith in the data and accomodate the
48+
intentionally contains hard positives and hard negatives. Pairwise losses put their faith in our data and accomodate the
4749
jagged structure by melting it into a rectangular one.
4850

4951
Pairwise losses do have downsides. A statistical downside is that we don't have many negatives per positive, which is

0 commit comments

Comments
 (0)