Skip to content

Optimize 'Leaderboard - FIL Earned' query: add migration for participant_id and update query using ROW_NUMBER() for faster performance #324

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions db/migrations/023.do.add-participant-id.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
-- Add participant_id columns to both tables
ALTER TABLE daily_scheduled_rewards ADD COLUMN participant_id INTEGER;
ALTER TABLE daily_reward_transfers ADD COLUMN participant_id INTEGER;

-- Backfill participant_id in daily_scheduled_rewards based on participant_address
WITH address_mapping AS (
SELECT DISTINCT participant_address AS address,
ROW_NUMBER() OVER (ORDER BY participant_address) AS generated_id
FROM daily_scheduled_rewards
)
Comment on lines +5 to +10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned about using ROW_NUMBER() as the generated id:

Is it guaranteed that we will always get the same generated id for the same participant address? For example, in the following query from your pull request, were are limiting which rows we look at:

        SELECT participant_id, scheduled_rewards, participant_address,
               ROW_NUMBER() OVER (PARTITION BY participant_id ORDER BY day DESC) AS rn
        FROM daily_scheduled_rewards
        WHERE day <= $2

In this particular case, since the WHERE condition removes recent deals, I think the query should work. However, if we need to write a different query in the future, one that will look at recent rows only (e.g. WHERE day >= $2) - would we still get the same participant id?

What is the performance cost of running ROW_NUMBER() OVER (ORDER BY participant_address)?

Have you considered creating a new table mapping participant addresses to IDs and introducing a foreign key constraint/relation?

We have had a good experience with such schema design in spark-evaluate, see e.g. here:

https://github.com/CheckerNetwork/spark-evaluate/blob/733c5b974426027f7b0d100f2f7bf427ee1a1868/migrations/003.do.daily-participants.sql#L1-L4

My knowledge of advanced PostgreSQL is limited. It's possible your solution is correct and I just don't understand enough details to see that.

Let's discuss!

@juliangruber do you have any thoughts on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was giving this a quick look, I was also concerned about using row number here. But I was not sure whether I understood enough about the situation to leave a comment about it.

My understanding is that this is a one-time migration, so we don't need to worry about the query behaving consistently - as long as it takes care of assigning unique ids once, and as long as any future rows will have ids not conflicting with these ones.

Would https://stackoverflow.com/a/78355281 be safer?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous solution, I aimed to minimize schema changes. However, if future queries rely on filtering conditions like WHERE day >= $2, using ROW_NUMBER() becomes risky because it dynamically assigns IDs based on available addresses. those IDs could shift around if the dataset changes in some cases like the filtering condition focuses on recent records only, which isn't great for reliability.

I think a more robust solution as you suggest is to create a separate mapping table that assigns stable and efficiently retrievable IDs to participant_address values ​​and introduce a foreign key constraint to enforce consistency.

This approach would mean updating both daily_scheduled_rewards and daily_reward_transfers tables to work with these foreign keys, but it's worth it.

From a performance angle, there's a clear win here too. ROW_NUMBER() has to sort everything each time (O(n log n) worst-case scenarios), while looking up IDs in a mapping table is faster with proper indexing (O(1)).

Happy to walk through the migration steps if you want to give it a shot!
@juliangruber, @bajtos

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed solution sounds great to me, but let's wait for a confirmation from @bajtos

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @bajtos

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I was on a vacation and took me a while to process all notifications.

I think a more robust solution as you suggest is to create a separate mapping table that assigns stable and efficiently retrievable IDs to participant_address values ​​and introduce a foreign key constraint to enforce consistency.

This approach would mean updating both daily_scheduled_rewards and daily_reward_transfers tables to work with these foreign keys, but it's worth it.

I agree with this plan 👍🏻

UPDATE daily_scheduled_rewards dsr
SET participant_id = am.generated_id
FROM address_mapping am
WHERE dsr.participant_address = am.address;

-- Backfill participant_id in daily_reward_transfers using the same mapping
WITH address_mapping AS (
SELECT DISTINCT participant_address AS address,
ROW_NUMBER() OVER (ORDER BY participant_address) AS generated_id
FROM daily_scheduled_rewards
)
UPDATE daily_reward_transfers drt
SET participant_id = am.generated_id
FROM address_mapping am
WHERE drt.to_address = am.address;

-- Create indexes for better performance
CREATE INDEX idx_daily_scheduled_rewards_pid_day ON daily_scheduled_rewards (participant_id, day DESC);
CREATE INDEX idx_daily_reward_transfers_pid_day ON daily_reward_transfers (participant_id, day);

23 changes: 16 additions & 7 deletions stats/lib/platform-stats-fetchers.js
Original file line number Diff line number Diff line change
Expand Up @@ -131,25 +131,34 @@ export const fetchAccumulativeDailyParticipantCount = async (pgPool, filter) =>
*/
export const fetchTopEarningParticipants = async (pgPool, filter) => {
// The query combines "transfers until filter.to" with "latest scheduled rewards as of today".
// As a result, it produces incorrect result if `to` is different from `now()`.
// See https://github.com/filecoin-station/spark-stats/pull/170#discussion_r1664080395
assert(filter.to === today(), 400, 'filter.to must be today, other values are not supported')

const { rows } = await pgPool.query(`
WITH latest_scheduled_rewards AS (
SELECT DISTINCT ON (participant_address) participant_address, scheduled_rewards
FROM daily_scheduled_rewards
ORDER BY participant_address, day DESC
SELECT participant_id, scheduled_rewards, participant_address
FROM (
SELECT participant_id, scheduled_rewards, participant_address,
ROW_NUMBER() OVER (PARTITION BY participant_id ORDER BY day DESC) AS rn
FROM daily_scheduled_rewards
WHERE day <= $2
) ranked
WHERE rn = 1
)
SELECT
COALESCE(drt.participant_id, lsr.participant_id) AS participant_id,
COALESCE(drt.to_address, lsr.participant_address) as participant_address,
COALESCE(SUM(drt.amount), 0) + COALESCE(lsr.scheduled_rewards, 0) as total_rewards
FROM daily_reward_transfers drt
FULL OUTER JOIN latest_scheduled_rewards lsr
ON drt.to_address = lsr.participant_address
ON drt.participant_id = lsr.participant_id
WHERE (drt.day >= $1 AND drt.day <= $2) OR drt.day IS NULL
GROUP BY COALESCE(drt.to_address, lsr.participant_address), lsr.scheduled_rewards
GROUP BY
COALESCE(drt.participant_id, lsr.participant_id),
COALESCE(drt.to_address, lsr.participant_address),
lsr.scheduled_rewards
ORDER BY total_rewards DESC
`, [filter.from, filter.to])

return rows
}

Expand Down
Loading