Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(grouping): Only run grouping calculation once #78630

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

lobsterkatie
Copy link
Member

During grouping, the slowest part of the process is calculating the variants. Before Seer and before grouphash metadata, we only needed them once, but now we use them in two places in the Seer flow and are about to use them in another place for grouphash metadata. To avoid calculating them now potentially up to four times, this PR refactors things so that they're passed from the place where they're initially calculated (in event.get_hashes) through the various intermediate functions to the spots in the Seer flow where they're currently used. Along this path is the place where we'll need them from grouphash metadata also.

Notes:

  • In order to not have to change unrelated uses of get_hashes, instead of changing its return value I instead extracted most of its inner logic into a separate get_hashes_and_variants method. Now get_hashes calls get_hashes_and_variants (and just ignores the variants) and in the spot in ingest where we used to call get_hashes, we now call get_hashes_and_variants.

  • We have a few pairs of helpers, for calculating primary and secondary hashes, respectively, which need to have matching signatures - meaning if the primary-hash version returns variants, the secondary-hash version must, too. That said, we don't ever actually want to use the secondary variants, so rather than having the secondary version of each helper returning the real variants, I instead chose to return an empty dictionary. Since we ignore that part of the result it doesn't really matter, but I figured debugging-wise, it's easier to keep track of "this one I want, this one I don't" if one has real data and one is empty.

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Oct 4, 2024
Base automatically changed from kmclb-remove-try-catch-around-getting-hashes-during-event-grouping to master October 4, 2024 18:29
An error occurred while trying to automatically change base from kmclb-remove-try-catch-around-getting-hashes-during-event-grouping to master October 4, 2024 18:29
@lobsterkatie lobsterkatie force-pushed the kmclb-only-run-grouping-once-during-ingest branch from ae3a2ec to 3cabc55 Compare October 4, 2024 18:33
@lobsterkatie lobsterkatie marked this pull request as ready for review October 4, 2024 19:58
@lobsterkatie lobsterkatie requested a review from a team as a code owner October 4, 2024 19:58
except Exception as err:
sentry_sdk.capture_exception(err)

return secondary_hashes
# Return an empty variants dictionary because we need the signature of this function to match
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does the signature have to match?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so that's half a lie, and half true. The two pairs of functions in question are _calculate_primary_hashes/_calculate_secondary_hashes and run_primary_grouping/maybe_run_secondary_grouping. Technically, the signatures of the first pair don't have to match, because they're each only used in the corresponding member of the second pair. The signatures of the secondary pair really do have to match, though, because they're both passed to get_hashes_and_grouphashes as hash_calculation_function.

def get_hashes_and_grouphashes(
job: Job,
hash_calculation_function: Callable[
[Project, Job, MutableTags],
tuple[GroupingConfig, list[str]],
],
metric_tags: MutableTags,
) -> GroupHashInfo:

It sort of seemed easier to just keep the structures parallel, but you're right, I could be more accurate in how I describe what's happening.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep that makes sense. only Q then is it a bit more explicit to use pythons unwrapping syntax instead of grabbing the 0th element? so it'd be

secondary_hashes, _ = _calculate_event_grouping(
                project, event_copy, secondary_grouping_config
            )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK, do you think it is? It kinda seems six-of-one-half-a-dozen-of-the-other to me. Happy to change it if you think the other is better, since I'm already going back to clean up the comment.

Copy link
Member

@JoshFerge JoshFerge Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like it because its more explicit, as i was reviewing the PR i went to go make sure "is that the hashes being returned, that's the interface right, ah yep okay". but totally a small nit and feel free to keep it the way that it is if you like it better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you're saying. Okay, sure, I'll switch it.

Copy link
Member

@JoshFerge JoshFerge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything makes sense and looks well factored. only q. around the interface and calling of _calculate_secondary_hashes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants