Skip to content

fix(grouping): Only run grouping calculation once #78630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Oct 7, 2024

Conversation

lobsterkatie
Copy link
Member

@lobsterkatie lobsterkatie commented Oct 4, 2024

During grouping, the slowest part of the process is calculating the variants. Before Seer and before grouphash metadata, we only needed them once, but now we use them in two places in the Seer flow and are about to use them in another place for grouphash metadata. To avoid calculating them now potentially up to four times, this PR refactors things so that they're passed from the place where they're initially calculated (in event.get_hashes) through the various intermediate functions to the spots in the Seer flow where they're currently used. Along this path is the place where we'll need them from grouphash metadata also.

Notes:

  • In order to not have to change unrelated uses of get_hashes, instead of changing its return value I instead extracted most of its inner logic into a separate get_hashes_and_variants method. Now get_hashes calls get_hashes_and_variants (and just ignores the variants) and in the spot in ingest where we used to call get_hashes, we now call get_hashes_and_variants.

  • We have a pair of helpers, run_primary_grouping and maybe_run_secondary_grouping, for calculating primary and secondary hashes, respectively, which need to have matching signatures, because they're both passed to get_hashes_and_grouphashes as the hash_calculation_function parameter. We don't ever need (or want) variants from the secondary hash calculation, so in place of real variants data maybe_run_secondary_grouping just passes an empty dictionary.

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Oct 4, 2024
Base automatically changed from kmclb-remove-try-catch-around-getting-hashes-during-event-grouping to master October 4, 2024 18:29
An error occurred while trying to automatically change base from kmclb-remove-try-catch-around-getting-hashes-during-event-grouping to master October 4, 2024 18:29
@lobsterkatie lobsterkatie force-pushed the kmclb-only-run-grouping-once-during-ingest branch from ae3a2ec to 3cabc55 Compare October 4, 2024 18:33
@lobsterkatie lobsterkatie marked this pull request as ready for review October 4, 2024 19:58
@lobsterkatie lobsterkatie requested a review from a team as a code owner October 4, 2024 19:58
except Exception as err:
sentry_sdk.capture_exception(err)

return secondary_hashes
# Return an empty variants dictionary because we need the signature of this function to match
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does the signature have to match?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so that's half a lie, and half true. The two pairs of functions in question are _calculate_primary_hashes/_calculate_secondary_hashes and run_primary_grouping/maybe_run_secondary_grouping. Technically, the signatures of the first pair don't have to match, because they're each only used in the corresponding member of the second pair. The signatures of the secondary pair really do have to match, though, because they're both passed to get_hashes_and_grouphashes as hash_calculation_function.

def get_hashes_and_grouphashes(
job: Job,
hash_calculation_function: Callable[
[Project, Job, MutableTags],
tuple[GroupingConfig, list[str]],
],
metric_tags: MutableTags,
) -> GroupHashInfo:

It sort of seemed easier to just keep the structures parallel, but you're right, I could be more accurate in how I describe what's happening.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep that makes sense. only Q then is it a bit more explicit to use pythons unwrapping syntax instead of grabbing the 0th element? so it'd be

secondary_hashes, _ = _calculate_event_grouping(
                project, event_copy, secondary_grouping_config
            )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK, do you think it is? It kinda seems six-of-one-half-a-dozen-of-the-other to me. Happy to change it if you think the other is better, since I'm already going back to clean up the comment.

Copy link
Member

@JoshFerge JoshFerge Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like it because its more explicit, as i was reviewing the PR i went to go make sure "is that the hashes being returned, that's the interface right, ah yep okay". but totally a small nit and feel free to keep it the way that it is if you like it better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you're saying. Okay, sure, I'll switch it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPDATE: In the end, I decided the most explicit thing would be to leave _calculate_secondary_hashes as is (not have it return variants, and not say it's doing so in the name, the way _calculate_primary_hashes - now called _calculate_primary_hashes_and_variants - does). I did still make maybe_run_secondary_grouping return an empty dictionary for the variants (because the matching-signature constraint there is real), but now the inner helpers actually tell the truth about what they do/don't do.

(As you suggested, I did still switch the [0] to be unpacking in _calculate_secondary_hashes, though.)

Copy link
Member

@JoshFerge JoshFerge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything makes sense and looks well factored. only q. around the interface and calling of _calculate_secondary_hashes

@lobsterkatie lobsterkatie force-pushed the kmclb-only-run-grouping-once-during-ingest branch from 88ccde3 to 64f2f90 Compare October 4, 2024 21:41
@lobsterkatie lobsterkatie merged commit 9e36c51 into master Oct 7, 2024
51 checks passed
@lobsterkatie lobsterkatie deleted the kmclb-only-run-grouping-once-during-ingest branch October 7, 2024 19:25
@github-actions github-actions bot locked and limited conversation to collaborators Oct 23, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants