Skip to content

chore(jobsdb): cache distinct parameters query result for all datasets except last #5752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

Sidddddarth
Copy link
Member

@Sidddddarth Sidddddarth commented Apr 21, 2025

Description

Decrease the number of queries to find the distinct parameter values(source_id, destination_id, workspace_id).
Cache the results per dataset and only compute the results for the last dataset(sometimes more - right after migration/new ds creation).

The algorithm

Considering workspace_id as just another parameter.
This deprecates further filtering by custom_val when querying for workspaces.

Linear Ticket

Resolves PIPE-2046

Security

  • The code changed/added as part of this pull request won't create any security issues with how the software is being used.

@Sidddddarth Sidddddarth requested a review from atzoum April 21, 2025 15:48
Copy link

codecov bot commented Apr 21, 2025

Codecov Report

Attention: Patch coverage is 84.42623% with 38 lines in your changes missing coverage. Please review.

Project coverage is 76.96%. Comparing base (fbc4abe) to head (8957d73).

Files with missing lines Patch % Lines
app/apphandlers/processorAppHandler.go 0.00% 13 Missing ⚠️
jobsdb/jobsdb.go 85.71% 4 Missing and 2 partials ⚠️
jobsdb/distinct_values_cache.go 94.52% 3 Missing and 1 partial ⚠️
jobsdb/jobsdb_parameters_cache.go 90.00% 2 Missing and 1 partial ⚠️
router/batchrouter/handle_lifecycle.go 86.36% 1 Missing and 2 partials ⚠️
router/batchrouter/isolation/isolation.go 72.72% 2 Missing and 1 partial ⚠️
router/handle_lifecycle.go 85.71% 1 Missing and 2 partials ⚠️
router/isolation/isolation.go 72.72% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5752      +/-   ##
==========================================
+ Coverage   76.90%   76.96%   +0.05%     
==========================================
  Files         491      493       +2     
  Lines       67183    67293     +110     
==========================================
+ Hits        51667    51789     +122     
+ Misses      12692    12686       -6     
+ Partials     2824     2818       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Sidddddarth Sidddddarth force-pushed the chore.cacheParametersQuery branch from 8bff521 to 8d0a868 Compare April 30, 2025 06:08
@Sidddddarth Sidddddarth changed the title chore: cache distinct parameters query result for all datasets except last chore(jobsdb): cache distinct parameters query result for all datasets except last Apr 30, 2025
@@ -57,7 +57,7 @@ type workspaceStrategy struct {

// ActivePartitions returns the list of active workspaceIDs in jobsdb
func (ws workspaceStrategy) ActivePartitions(ctx context.Context, db jobsdb.JobsDB) ([]string, error) {
return db.GetActiveWorkspaces(ctx, ws.customVal)
return db.GetDistinctParameterValues(ctx, jobsdb.WorkspaceID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we filter out workspace ids that don't have any entry for that customVal?

@@ -64,7 +64,7 @@ type workspaceStrategy struct {

// ActivePartitions returns the list of active workspaceIDs in jobsdb
func (ws workspaceStrategy) ActivePartitions(ctx context.Context, db jobsdb.JobsDB) ([]string, error) {
return db.GetActiveWorkspaces(ctx, ws.customVal)
return db.GetDistinctParameterValues(ctx, jobsdb.WorkspaceID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we filter out workspace ids that don't have any entry for that customVal?

UNION ALL
(SELECT s.* FROM t, LATERAL(
SELECT workspace_id FROM %[1]q f
WHERE custom_val = '%[2]s' AND f.workspace_id > t.workspace_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should retain the option for filtering based on custom_val

// GetDistinctParameterValues returns the list of distinct parameter("source_id", "destination_id", "workspace_id") values inside the jobsdb, supporting to optionally filter results based on `custom_val`.
GetDistinctParameterValues(ctx context.Context, parameter ParameterName, customVal string) (values []string, err error)

Copy link
Member Author

@Sidddddarth Sidddddarth May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the pending event registry to filter?
Because if we cache based on a combination of parameter and custom_val, we'd end up making custom_val number of queries for a parameter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively the way destinationIDs are filtered using a maplookup in router, batch_rt could be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used the second way, after one query the cache can help for upto two hours.

@Sidddddarth Sidddddarth force-pushed the chore.cacheParametersQuery branch 2 times, most recently from 6a00157 to ae53e1a Compare May 5, 2025 08:24
@Sidddddarth Sidddddarth force-pushed the chore.cacheParametersQuery branch from ae53e1a to 8957d73 Compare May 5, 2025 08:44

func NewDistinctValuesCache() *distinctValuesCache {
return &distinctValuesCache{
cache: make(map[string]map[string][]string),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use sync.Map instead? It might be a good fit than having multiple locks?

Copy link
Member

@cisse21 cisse21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to find a way in which we can disable the cache functionality and go back to the older way of not caching. If there is any can you point me to that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants