Skip to content

Optimize DB Schema & Query for Top-Earning Leaderboard #340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

Hany-Almnaem
Copy link

Refactor DB schema: Add participant_id & foreign keys for performance

Changes Made:

  • Added address_mapping table and migrations.
  • Updated daily_scheduled_rewards and daily_reward_transfers to reference participant_id.
  • New approach with stable ID-based approach.

Performance:

EXPLAIN ANALYZE

Migration Steps:
new migration files/
├── ...
├── 008.do.create-address-mapping-table.sql
├── 009.do.backfill-address-mapping.sql
├── 010.do.add-participant_id-columns.sql
├── 011.do.populate-participant-ids.sql
├── 012.do.add-foreign-keys-and-indexes.sql

Please review and let me know if any further changes are needed.

Close CheckerNetwork/roadmap#178

Copy link
Contributor

@pyropy pyropy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @Hany-Almnaem ! 🚀 Thanks for your submission! 🙏🏻

I'm wondering if we should reduce the number of migrations by combining some of them into a single file. For example, we could merge new table creation with backfilling or group the changes to daily_scheduled_rewards and daily_reward_transfers tables along with their backfilling. What do you think?

Let's also wait for feedback from others, but overall, this is looking great! 👍🏻

@Hany-Almnaem
Copy link
Author

@pyropy Thanks for your feedback!

I'm wondering if we should reduce the number of migrations by combining some of them into a single file.

It's a great Idea to reduce the number of migrations. but I recommend keeping each logical change in its own migration – it’s usually easier to debug and revert that way. However, combining steps is possible if it won’t cause issues later. In general, small, incremental migrations are the safest bet.

@bajtos
Copy link
Member

bajtos commented Mar 12, 2025

@Hany-Almnaem thank you for the pull request. To clarify: is this superseding your earlier PR #324? Can we close #324 now?

There are several failed CI checks, please take a look and fix them. (You should be able to reproduce them locally by running npm run test.)

Copy link
Member

@bajtos bajtos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The high-level direction looks good to me 👍🏻

Let's discuss the implementation details now.

@bajtos
Copy link
Member

bajtos commented Mar 12, 2025

@Hany-Almnaem I have a question about the performance.

  • In the previous PR, the current query takes 26ms to plan and 988ms to execute.
  • In the previous PR, the new query takes 2ms to plan and 580ms to execute.
  • In this PR, the new query takes 44ms to plan and 1327ms to execute.

That looks like a step in the wrong direction to me. We want to improve the performance of this query, not make it worse.

@Hany-Almnaem
Copy link
Author

@bajtos You alright,
the increased execution time is likely due to the query modifications, and the main factor could be inefficient index usage.
I'll review the indexing strategy. Let me investigate further, and I'll update the PR with improvements.

@Hany-Almnaem
Copy link
Author

Supersedes Earlier PR #324

Summary of Changes

  1. Unified Table Definitions

Aligned the schema with spark-evaluate to keep everything consistent and reduce confusion.

  1. Combined Migrations and Clear Documentation

Merged multiple smaller migrations into one, adding detailed comments to explain each step and keep the migration history concise.

  1. Removed Old Columns

Dropped legacy columns that are no longer needed after the new schema changes.

  1. Added Composite Indexes

Added Indexes for improved query performance when filtering by day and address.

  1. Updated Test Cases

Adjusted existing tests to match the new schema, ensuring they accurately reflect the latest logic and structures.

  1. Optimized Query Logic in API Fetchers

Switched to using participant IDs instead of participant addresses in joins, and leveraged the new indexes for faster lookups.

With these changes, we achieve a more efficient database schema, better performance on large datasets, and clearer migration steps.
new ANALYZE

Feel free to review and let me know if anything needs further adjustment

@bajtos bajtos requested review from bajtos and pyropy March 20, 2025 08:01
Copy link
Member

@bajtos bajtos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress!

Copy link
Member

@bajtos bajtos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cleaned up the first part of the patch, see the commits above. I need to take a closer look at the second part.

Comment on lines 118 to 157
{ args: { to: 'address1', amount: 250 }, blockNumber: 2000 }
{ args: { to: 'address1', amount: 150 }, blockNumber: 2000 }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, let's revert.

@Hany-Almnaem Hany-Almnaem force-pushed the Optimise-Leaderboard-FIL-Earned branch from 490af82 to 4081ba7 Compare April 7, 2025 11:47
@bajtos
Copy link
Member

bajtos commented Apr 11, 2025

Screenshot 2025-04-11 at 7 55 09

@Hany-Almnaem please don't force-push to pull requests, it makes it more difficult to incrementally review only what's changed since the last review. We use merge commits to bring new changes from the main branch: git merge main or [Update branch] button in GitHub UI.

Comment on lines 380 to 384
await pgPools.stats.query(`
INSERT INTO participants (id, participant_address)
VALUES (1, '0x20'), (2, '0x00')
`)
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to use mapParticipantsToIds instead of hard-coded participant ids.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend creating a helper that will ensure the address is mapped to an ID and then call the sql query to insert into daily_scheduled_rewards.

Comment on lines 216 to 225
await pgPools.stats.query(`
INSERT INTO participants (id, participant_address)
VALUES
(1, 'to1'),
(2, 'to2'),
(3, 'to3'),
(4, 'address1'),
(5, 'address2'),
(6, 'address3')
`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - use mapParticipantsToIds

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need to prepare this mapping in advance? I would expect givenDailyRewardTransferMetrics to take care for that.

Comment on lines 299 to 308
await pgPools.stats.query(`
INSERT INTO participants (id, participant_address)
VALUES
(1, 'to1'),
(2, 'to2'),
(3, 'to3'),
(4, 'address1'),
(5, 'address2'),
(6, 'address3')
`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

I recommend creating a helper that will update both participants and daily_scheduled_rewards tables, e.g. givenDailyScheduledRewards(day, participantAddress, scheduledRewards).

Comment on lines 565 to 572
const participantResult = await pgPoolStats.query(
'SELECT id FROM participants WHERE participant_address = $1',
[transfer.toAddress]
)

if (participantResult.rows.length === 0) {
throw new Error(`Participant address ${transfer.toAddress} not found`)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use mapParticipantsToIds to map new addresses to participant ids, so that we don't have to do it manually in tests.

Copy link
Member

@bajtos bajtos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates, @Hany-Almnaem. I appreciate your perseverance!

I did a bit of cleanup in 2475264 to speed things up.

I identified three more areas in tests that need improving to use mapParticipantsToIds, please take a look at my comments above.

@Hany-Almnaem
Copy link
Author

@bajtos Thanks for your earlier feedback and the cleanup. Much appreciated!

I noticed the opportunity to unify the helper functions, but I decided to keep them separate for now to preserve clarity between different test contexts. Just wanted to call that out in case it comes up. happy to refactor further if needed.

Copy link
Member

@bajtos bajtos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please fix the linting errors?

Screenshot 2025-04-23 at 15 39 18

[day, id, amount, lastCheckedBlock],
);
};
export { mapParticipantsToIds } from "../observer/lib/map-participants-to-ids.js";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it confusing to have two functions called mapParticipantsToIds. Can we find a way to discriminate between the function to map participants in spark_evaluate database and the function to map participants in spark_stats database?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @bajtos, thanks for flagging that.
In the last commit, I’ve now stopped importing mapParticipantsToIds directly in tests and wrapped each use-case in its own helper:

givenDailyParticipants still uses the evaluate version under the hood

givenRewardTransfer & givenScheduledRewards wrap the stats version

Tests only pull in those descriptive helpers, and only the stats-specific mapper is re-exported from spark-stats-db/test-helpers.js. That way, there’s no longer any ambiguity about which database we’re mapping against. Let me know if you’d prefer alternate names or locations.

/** @type {import('@filecoin-station/spark-stats-db').PgPools} */
let pgPools
let pgPools;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file contains a lot of unrelated changes, can you please revert them?

I think npm run lint:fix may be all you need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Station Public Dashboard - optimise "Leaderboard - FIL Earned"
4 participants