-
Notifications
You must be signed in to change notification settings - Fork 4
Optimize 'Leaderboard - FIL Earned' query: add migration for participant_id and update query using ROW_NUMBER() for faster performance #324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Hany-Almnaem
wants to merge
2
commits into
CheckerNetwork:main
Choose a base branch
from
Hany-Almnaem:optimize-query-performance
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
-- Add participant_id columns to both tables | ||
ALTER TABLE daily_scheduled_rewards ADD COLUMN participant_id INTEGER; | ||
ALTER TABLE daily_reward_transfers ADD COLUMN participant_id INTEGER; | ||
|
||
-- Backfill participant_id in daily_scheduled_rewards based on participant_address | ||
WITH address_mapping AS ( | ||
SELECT DISTINCT participant_address AS address, | ||
ROW_NUMBER() OVER (ORDER BY participant_address) AS generated_id | ||
FROM daily_scheduled_rewards | ||
) | ||
UPDATE daily_scheduled_rewards dsr | ||
SET participant_id = am.generated_id | ||
FROM address_mapping am | ||
WHERE dsr.participant_address = am.address; | ||
|
||
-- Backfill participant_id in daily_reward_transfers using the same mapping | ||
WITH address_mapping AS ( | ||
SELECT DISTINCT participant_address AS address, | ||
ROW_NUMBER() OVER (ORDER BY participant_address) AS generated_id | ||
FROM daily_scheduled_rewards | ||
) | ||
UPDATE daily_reward_transfers drt | ||
SET participant_id = am.generated_id | ||
FROM address_mapping am | ||
WHERE drt.to_address = am.address; | ||
|
||
-- Create indexes for better performance | ||
CREATE INDEX idx_daily_scheduled_rewards_pid_day ON daily_scheduled_rewards (participant_id, day DESC); | ||
CREATE INDEX idx_daily_reward_transfers_pid_day ON daily_reward_transfers (participant_id, day); | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned about using
ROW_NUMBER()
as the generated id:Is it guaranteed that we will always get the same generated id for the same participant address? For example, in the following query from your pull request, were are limiting which rows we look at:
In this particular case, since the
WHERE
condition removes recent deals, I think the query should work. However, if we need to write a different query in the future, one that will look at recent rows only (e.g.WHERE day >= $2
) - would we still get the same participant id?What is the performance cost of running
ROW_NUMBER() OVER (ORDER BY participant_address)
?Have you considered creating a new table mapping participant addresses to IDs and introducing a foreign key constraint/relation?
We have had a good experience with such schema design in spark-evaluate, see e.g. here:
https://github.com/CheckerNetwork/spark-evaluate/blob/733c5b974426027f7b0d100f2f7bf427ee1a1868/migrations/003.do.daily-participants.sql#L1-L4
My knowledge of advanced PostgreSQL is limited. It's possible your solution is correct and I just don't understand enough details to see that.
Let's discuss!
@juliangruber do you have any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I was giving this a quick look, I was also concerned about using row number here. But I was not sure whether I understood enough about the situation to leave a comment about it.
My understanding is that this is a one-time migration, so we don't need to worry about the query behaving consistently - as long as it takes care of assigning unique ids once, and as long as any future rows will have ids not conflicting with these ones.
Would https://stackoverflow.com/a/78355281 be safer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the previous solution, I aimed to minimize schema changes. However, if future queries rely on filtering conditions like WHERE day >= $2, using ROW_NUMBER() becomes risky because it dynamically assigns IDs based on available addresses. those IDs could shift around if the dataset changes in some cases like the filtering condition focuses on recent records only, which isn't great for reliability.
I think a more robust solution as you suggest is to create a separate mapping table that assigns stable and efficiently retrievable IDs to participant_address values and introduce a foreign key constraint to enforce consistency.
This approach would mean updating both daily_scheduled_rewards and daily_reward_transfers tables to work with these foreign keys, but it's worth it.
From a performance angle, there's a clear win here too. ROW_NUMBER() has to sort everything each time (O(n log n) worst-case scenarios), while looking up IDs in a mapping table is faster with proper indexing (O(1)).
Happy to walk through the migration steps if you want to give it a shot!
@juliangruber, @bajtos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed solution sounds great to me, but let's wait for a confirmation from @bajtos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @bajtos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, I was on a vacation and took me a while to process all notifications.
I agree with this plan 👍🏻