44
55TLDR: We invested in and trust the labels.
66
7- The data were sampled from the previous (v1) embedding model's decision boundary. These pairs are more interesting and
8- are inherently the ones where a successor stands to gain. Easy stacktrace pairs are likely easy for every model. These
9- pairs typically have completely different errors and frames. Sampling weights decreased as the distance from v1's
10- boundary increased. Pairs were formed via a self-join per Sentry project. There are no other constraints around the
11- data's structure besides a limit on the number of stacktraces fetched from table B for each stacktrace in table A of the
12- self-join, and a distance range which was determined via eye-test: no pairs have a v1 cosine distance outside the range
13- [ 0.001, 0.50] . A stacktrace can be similar or dissimilar to any number of other stacktraces. This dataset has:
7+ The data were largely sampled from the previous (v1) embedding model's decision boundary. These pairs are more
8+ interesting and are the ones where a successor stands to gain the most. Easy stacktrace pairs are likely easy for every
9+ model. Easy negatives typically have completely different errors and frames, and easy positives are lexically close.
10+ Sampling weights decreased as the distance from v1's boundary increased. Pairs were formed via a self-join per Sentry
11+ project. There are no other constraints around the data's structure besides a limit on the number of stacktraces fetched
12+ from table B for each stacktrace in table A of the self-join, and a distance range which was determined via eye-test: no
13+ pairs have a v1 cosine distance outside the range [ 0.001, 0.50] . A stacktrace can be similar or dissimilar to any number
14+ of other stacktraces. This dataset has:
1415
1516 - Some stacktraces which are similar to 20 others
1617 - Some which are dissimilar to 20
@@ -24,12 +25,12 @@ generally err on the side of avoiding overgrouping—wrongly grouping two errors
2425costly; Sentry's job is to tell you that something in prod broke.
2526
2627There are many ways to train an embeddings model from here. You could, e.g., collate in-batch negatives and train using
27- symmetric MNRL—the " standard" loss today. The immediate problem is that in-batch negatives can easily be false negatives
28- in this dataset. Across pairs and w/in a project, there are plenty of similar stacktraces that aren't explicitly labeled
29- as similar. MNRL is especially sensitive to the false negative rate. This sensitivity is good when you're training for
30- the globally sparse similarity structures encountered in RAG, but bad when you're training for fine-grained stacktrace
31- similarity w/in a Sentry project. GISTEmbed aims to address this kind of issue, but there aren't good guide models for
32- this niche task. ` gemini-embedding-2 ` , e.g.,
28+ symmetric MNRL—the standard embeddings loss today. The immediate problem is that in-batch negatives can easily be false
29+ negatives in this dataset. Across pairs and w/in a project, there are plenty of similar stacktraces that aren't
30+ explicitly labeled as similar. MNRL is especially sensitive to the false negative rate. This sensitivity is good when
31+ you're training for the globally sparse similarity structures encountered in RAG, but bad when you're training for
32+ fine-grained stacktrace similarity w/in a Sentry project. GISTEmbed aims to address this kind of issue, but there aren't
33+ good guide models for this niche task. ` gemini-embedding-2 ` , e.g.,
3334[ flops] ( https://github.com/getsentry/grouping-trainer/tree/main/eval/comparisons/test_full3/gemini-cluster_dim3072_vs_large-no-prefix_dim64 ) .
3435
3536You could instead have the batch sampler pair up stacktraces from different Sentry projects to avoid false negatives.
@@ -38,12 +39,13 @@ test-time, the DB query doesn't compare stacktraces across projects. A more basi
3839procedure needs to handle stacktraces that aren't similar to anything in the dataset or are similar to multiple
3940stacktraces by padding and masking the loss, or adding a pairwise loss term over lone pairs.
4041
41- Another idea is to use a triplet loss. Triplet sampling is well-defined when an input is classified. It's ambiguous for
42- our dataset which has variable numbers of similar and dissimilar stacktraces. I'm not sure there's a principled way to
43- select from the many combinations of triplets without dropping explicitly labeled stacktrace relationships.
42+ Another standard approach is to use a triplet loss. Triplet sampling is well-defined when an input is classified. It's
43+ ambiguous for our dataset which has variable numbers of similar and dissimilar stacktraces. I'm not sure there's a
44+ principled way to select from the many combinations of triplets without dropping explicitly labeled stacktrace
45+ relationships.
4446
4547In general, non-pairwise losses coerce a jagged but rich similarity structure into a rectangular one. The data
46- intentionally contains hard positives and hard negatives. Pairwise losses put their faith in the data and accomodate the
48+ intentionally contains hard positives and hard negatives. Pairwise losses put their faith in our data and accomodate the
4749jagged structure by melting it into a rectangular one.
4850
4951Pairwise losses do have downsides. A statistical downside is that we don't have many negatives per positive, which is
0 commit comments