MTG-703 Adding peer to peer consistency checks by snorochevskiy · Pull Request #316 · metaplex-foundation/aura

snorochevskiy · 2024-11-19T22:46:32Z

This pull request contains full implementation of peer to peer2 consistency checking and missing blocks fetching for bubblegum and account nfts.

It includes:

storing of change records for newly received bubblegum changes and account NFT accounts
calculation of bubblegum epochs checksums and account buckets checksums at the end of the "epoch" (10000 slots)
periodical requesting of checksums from peers, comparison with own checksums, search for missing changes, and requesting of identified missed blocks from peers

Design document:
https://github.com/metaplex-foundation/aura/wiki/Data-consistency-for-peer%E2%80%90to%E2%80%90peer-indexers

RequescoS · 2024-11-29T09:19:44Z

interface/src/checksums_storage.rs

+    pub account_pubkey: Pubkey,
+    pub slot: u64,
+    pub write_version: u64,
+    pub data_hash: u64,


As far as we are tracking not only slot+write_version for accounts but also data_hash, we need to discuss the consensus mechanism.

If i understood correctly, the current implementation would identify accounts with the same slot+write_version but different data_hashes as different updates and will try to synchronize them between nodes. However the current accounts processing mechanism will not allow updating existing accounts with the same slot+write_version. So nodes will try to receive "updates" from each other, send these updates to each other but not process them correctly.

Maybe this sounds like a separate task but we need to discuss approach that we think will be the better

This is purely for account NFTs, because for account we don't have cleaning of forks, and as the result we cannot rely on slot. That's why we ignore slots and take only pubkey+write_version_data_hash when we calculate account changes checksum, and when we compare peer's account change with local account change.
If the slot is different, but the data_hash is same, we assume that's the same account change.

For the consensus part - that's for us to define. Essentially, we should expect the following:

same write_version, same slot, same hash - all good

same write_version, same slot, different hash - we need the consensus to define, what the correct version is. I'd suggest we ask RPC of the current state (if it has not changed yet, of course)

same write_version, different slot, same hash - all good, take the highest slot on merge probably

same write_version, different slot, different hashes - looks like a fork, take the highest slot on merge

different write_version, same/different slot, same/different hash - take the highest write_version

RequescoS · 2024-11-29T09:22:17Z

interface/src/checksums_storage.rs

+    pub account_pubkey: Pubkey,
+    pub slot: u64,
+    pub write_version: u64,
+    pub data_hash: u64,


Also, we do not track data_hashes for transaction-based protocol. Maybe we need also them here? I think it makes sense, because it will give us additional 'resist' from Solana forks

IMO, we won't gain anything by adding data_hash to bubblegum, since we do already have forks cleaner and sequence consistency checks for it.

RequescoS · 2024-11-29T09:33:27Z

nft_ingester/src/message_parser.rs

                    slot_updated: account_update.slot as i64,
                    amount: ta.amount as i64,
                    write_version: account_update.write_version,
+                    data_hash: calc_solana_account_data_hash(&account_update.data),


Could we approximately calculate the performance impact of calculating hashes "in place" for each account? We have many account updates, way larger updates amount than we have for transactions, so it would be great to understand if this increases account processing time

According to my tests, it takes ~ 9 microseconds to calculate hash for 1KB of data, which looks acceptable.
(anyway xxhash is one of most performant hashing algorithms)

RequescoS · 2024-11-29T09:45:12Z

nft_ingester/src/bin/ingester/main.rs

+        metrics_state.checksum_calculation_metrics.clone(),
+    );
+
+    if let Some(peer_urls_file) = config.peer_urls_file.as_ref() {


Do we need to add mounting for this file in docker compose?

Probably, we do...

RequescoS · 2024-11-29T09:55:29Z

nft_ingester/src/consistency_bg_job.rs

+    let mut missing_bbgm_changes: HashMap<BbgmChangeRecord, HashSet<usize>> = HashMap::new();
+    let trusted_peers = peers_provider.list_trusted_peers().await;
+
+    for (peer_ind, trusted_peer) in trusted_peers.iter().enumerate() {


Maybe we need to try to make this process more async? Here many i/o with other peers, so maybe it makes sense to spawn separate tasks for communicating with each peer?

This can be a good improvement in the future, but for the "phase 1", I'd prefer to run it sequentially, and collect CPU and disk metrics to understand whether we have some potential bandwidth for parallelization.
Cause we can easily fall into a scenario when we are stressing the disk with bunch of concurrent peer-2-peer consistency checks, and as the result the indexing part lacks resources for fulfilling it's main duty.

RequescoS · 2024-11-29T09:57:17Z

nft_ingester/src/consistency_bg_job.rs

+        .chain(ge_cmp_res.different.iter())
+        .map(|&a| a.tree_pubkey)
+        .collect::<Vec<_>>();
+    for tree_pk in ge_trees_to_check {


Maybe processing each grand_epoch also deserves to be spawned in separate task because of many i/o here

And same here: a potentially good improvement in the future.

RequescoS · 2024-11-29T10:00:54Z

nft_ingester/src/consistency_bg_job.rs

+}
+
+#[allow(clippy::while_let_on_iterator)]
+async fn handle_missing_bbgm_changes(


Point to discuss: a consensus mechanism

If we have many hosts in the network we need to think about not just fetching any updates from each host but finding consensus between them and rejecting updates from some hosts if they contradict the majority of the network

Right, this could be a good thing for interaction with non-trusted peers, but we've decided to not make it as part of the initial implementation.

RequescoS · 2024-11-29T10:16:25Z

nft_ingester/src/consistency_bg_job.rs

+                clients.get_mut(peer_ind).unwrap()
+            };
+            if let Ok(block) = client
+                .get_block(change.slot, Option::<Arc<grpc::client::Client>>::None)


We fetching the whole block, without indicating which trees we want to receive. In the future we need to add methods for syncing on concrete trees

RequescoS · 2024-11-29T10:19:25Z

nft_ingester/src/consistency_bg_job.rs

+        .db_bubblegum_get_grand_epochs_latency
+        .observe(start.elapsed().as_secs_f64());
+
+    let ge_cmp_res = cmp(&my_ge_chksms, &peer_ge_chksms);


Maybe we need to add some config for understanding which trees we are indexing. Because for now if i understood right we may mark as missed trees that we do not want to index

RequescoS · 2024-11-29T10:52:42Z

nft_ingester/src/consistency_calculator.rs

+        loop {
+            let calc_msg = tokio::select! {
+                msg = rcv.recv() => msg,
+                _ = shutdown_signal.recv() => {


nit: as far as this runs inside simple tokio task that is not related to any JoinSet, shutdown_signal is not really needed here because we do not wait for this task to complete anywhere

kstepanovdev · 2024-11-29T16:10:15Z

entities/src/models.rs

+    pub fn solana_change_info(&self) -> (Pubkey, u64, u64, u64) {
+        let (slot, write_version, data_hash) = match &self.account {
+            UnprocessedAccount::MetadataInfo(v) => (v.slot_updated, v.write_version, v.data_hash),
+            UnprocessedAccount::Token(v) => (v.slot_updated as u64, v.write_version, v.data_hash),


Are we ok with converting i64 as u64? In case it cannot be negative, why then TokenAccount stores it as i64? Some kind of restrictions from the DB?

kstepanovdev · 2024-11-29T16:10:49Z

grpc/proto/consistency_api.proto

+    rpc GetAccsInBucket(GetAccReq) returns (AccList);
+
+    rpc ProposeMissingAccChanges(AccList) returns (google.protobuf.Empty);
+}


nit: newline

kstepanovdev · 2024-11-29T16:12:45Z

grpc/src/client.rs

    pub async fn connect(peer_discovery: impl PeerDiscovery) -> Result<Self, GrpcError> {
-        let url = Uri::from_str(peer_discovery.get_gapfiller_peer_addr().as_str())
-            .map_err(|e| GrpcError::UriCreate(e.to_string()))?;
+        Client::connect_to_url(peer_discovery.get_gapfiller_peer_addr().as_str()).await


Suggested change

Client::connect_to_url(peer_discovery.get_gapfiller_peer_addr().as_str()).await

Client::connect_to_url(&peer_discovery.get_gapfiller_peer_addr()).await

Just a preference of style, feel free to ignore

kstepanovdev · 2024-11-29T16:16:16Z

interface/src/checksums_storage.rs

+
+/// Interface for querying bubblegum checksums from peer
+/// or local storage.
+#[async_trait]


Isn't async trait stabilized?

Yes, but #[async_trait] is still required to build trait object

kstepanovdev · 2024-11-29T16:19:31Z

metrics_utils/src/lib.rs

+    pub found_missing_accounts: Gauge,
+}
+
+impl Default for Peer2PeerConsistencyMetricsConfig {


It seems to me such default has no usage since new() can be freely called instead. Seems like only one should survive imo.

From my perspective, new() with no parameters = default (if the function doesn't provoke side effects)

kstepanovdev · 2024-11-29T16:20:18Z

metrics_utils/src/lib.rs

+
+impl Peer2PeerConsistencyMetricsConfig {
+    pub fn new() -> Peer2PeerConsistencyMetricsConfig {
+        let mk_histogram = || Histogram::new(exponential_buckets(20.0, 1.8, 10));


Those numbers are slightly magical. What do they mean?

StanChe · 2024-12-02T14:02:30Z

interface/src/checksums_storage.rs

+
+/// Type of checksum for bubblegum epochs and account NFT buckets.
+/// It is technically a SHA3 hash.
+pub type Chksm = [u8; 32];


I thought the type above was for this

Suggested change

pub type Chksm = [u8; 32];

pub checksum: Option<Chksm>,

pub type Chksm = [u8; 32];

is just an alias that should make easier a potential change of checksum type in the future (that will never happen)

I guess, maybe would it be more convenient/idiomatic to use a newtype instead of the alias?

I believe it will just introduce a hell of wrap and unroll calls 🥲

Deref for the rescue?

Just thinking aloud tho, not a call to action

StanChe · 2024-12-02T14:38:24Z

metrics_utils/src/lib.rs

+    pub db_bubblegum_get_grand_epochs_latency: Histogram,
+    pub db_bubblegum_get_epochs_latency: Histogram,
+    pub db_bubblegum_get_changes_latency: Histogram,
+    pub db_account_get_grand_buckets_latency: Histogram,
+    pub db_account_get_buckets_latency: Histogram,
+    pub db_account_get_latests_latency: Histogram,


Why not a single Family<MetricLabel, Histogram>?

I usually prefer to have a set of plain metrics, instead of one labeled, because it is much easier to query them by other monitoring systems. But sure, I can turn these into a family. Should I?

Yes, please. With our stack of Prometheus + grafana the primary flow is to put up the metrics onto the dashboard and have some simple alerts. With every added new metric there is an increased chance, it'll be not added as it'll require a dedicated query for itself for every chart and every alert rule. Those metrics are super generic, reusing even the existing RED approach is more favorable. If it doesn't fit into RED - having dedicated family is the next best choice. Please don't leave us with the need to create multiple charts to monitor every request. Metrics should be kept as simple as possible.

StanChe · 2024-12-02T14:42:44Z

metrics_utils/src/lib.rs

+    pub peers_bubblegum_get_grand_epochs_for_tree_errors: Family<MetricLabel, Counter>,
+    pub peers_bubblegum_get_grand_epochs_errors: Family<MetricLabel, Counter>,
+    pub peers_bubblegum_get_epochs_errors: Family<MetricLabel, Counter>,
+    pub peers_bubblegum_get_changes_errors: Family<MetricLabel, Counter>,


it looks more like an added label to me. Peers_sync_latency(protocol: "bubblegum/account", method/endpoint: "get_grand_epochs/get_epochs/get_changes")

StanChe · 2024-12-02T18:22:05Z

nft_ingester/tests/consistency_bg_job_test.rs

+        // prepare
+        let tree1 = Pubkey::new_unique();
+
+        // This change is for epoch we won't calculate in the test,


the comment is not valid in this context

StanChe · 2024-12-02T18:22:17Z

nft_ingester/tests/consistency_bg_job_test.rs

+            .put(k1_2.clone(), v1_2.clone())
+            .unwrap();
+
+        // This will be also ignored


this one as well

nft_ingester/tests/consistency_calculator_test.rs

StanChe

Great work, thank you. Several open questions need clarification and the metrics should be simplified/moved to more appropriate measuring places.

StanChe · 2024-12-02T18:30:01Z

nft_ingester/tests/consistency_calculator_test.rs

+        // Verify account last state updated
+        let latest_acc1_key = AccountNftKey::new(acc1_pubkey);
+        let latest_acc1_val = storage
+            .acc_nft_last


acc_nft_last holds the last calculated epoch value?

It is the last seen change of the account with the given pubkey.

StanChe · 2024-12-06T11:53:10Z

nft_ingester/src/consistency_bg_job.rs

+
+#[async_trait::async_trait]
+impl AuraPeersProvides for FileSrcAuraPeersProvides {
+    async fn list_trusted_peers(&self) -> Vec<String> {


Why not a list of URLs directly? Those are parsed every time anyway.

That's to make possible to change the list of peers without restarting the appliction

StanChe · 2024-12-06T12:30:55Z

rocks-db/src/storage_consistency.rs

+/// To prevent such inconsistency of a checksum, roght before the calulating,
+/// we mark the epoch checksum to be calculated is "Calculating",
+/// and after the checksum is calculated, we write this value only in case
+/// if the previous value is still in "Calculated" state.


Suggested change

/// if the previous value is still in "Calculated" state.

/// if the previous value is still in the same "Calculating" state.

StanChe · 2024-12-06T12:31:13Z

rocks-db/src/storage_consistency.rs

+/// if the previous value is still in "Calculated" state.
+///
+/// At the same time, when the Bubblegum updated processor receives
+/// a new update with slot that epoch is from the previous epoch perioud,


Suggested change

/// a new update with slot that epoch is from the previous epoch perioud,

/// a new update with slot that epoch is from the previous epoch period,

StanChe · 2024-12-06T12:31:25Z

rocks-db/src/storage_consistency.rs

+///
+/// At the same time, when the Bubblegum updated processor receives
+/// a new update with slot that epoch is from the previous epoch perioud,
+/// it not only writed the bubblegum change, but also updated


Suggested change

/// it not only writed the bubblegum change, but also updated

/// it not only writes the bubblegum change, but also updates

StanChe · 2024-12-06T12:45:58Z

nft_ingester/src/consistency_calculator.rs

+
+/// This flag is set to true before bubblegum epoch calculation is started,
+/// and set to false after the calculation is finished.
+static IS_CALCULATING_BBGM_EPOCH: AtomicI32 = AtomicI32::new(-1);


It's not a flag any more

StanChe · 2024-12-06T12:46:45Z

nft_ingester/src/consistency_bg_job.rs

+) {
+    tracing::info!("Starting bubblegum changes peer-to-peer exchange for epoch={epoch}");
+    while get_calculating_bbgm_epoch()
+        .map(|e| e == epoch)


can we end up calculating any other epoch - a previous one, or a next one?

Theoretically that should not happen, but I've changed it to compare current epoch with the last calculated.

StanChe · 2024-12-06T12:51:59Z

nft_ingester/src/consistency_bg_job.rs

+
+        metrics
+            .found_missing_bubblegums
+            .set(changes_we_miss.len() as i64);


I'd suggest incrementing this, compared to setting. Set will be way more flickery given the periodic nature of metrics collectors

StanChe · 2024-12-06T12:54:46Z

nft_ingester/src/consistency_bg_job.rs

+                    return result;
+                }
+            };
+            metrics


this metric collection should be tied closely to the actual io - the Grpc client in our case, not to the business logic level

This one measures how much time it takes on our side to fetch the data.
On the GRPC client side (BbgmConsistencyApiClientImpl and AccConsistencyApiClientImpl) there are separate metrics that measure how much time it takes to call the peer.

StanChe · 2024-12-06T13:04:00Z

metrics_utils/src/lib.rs

+    pub db_bubblegum_get_grand_epochs_latency: Histogram,
+    pub db_bubblegum_get_epochs_latency: Histogram,
+    pub db_bubblegum_get_changes_latency: Histogram,
+    pub db_account_get_grand_buckets_latency: Histogram,
+    pub db_account_get_buckets_latency: Histogram,
+    pub db_account_get_latests_latency: Histogram,


Yes, please. With our stack of Prometheus + grafana the primary flow is to put up the metrics onto the dashboard and have some simple alerts. With every added new metric there is an increased chance, it'll be not added as it'll require a dedicated query for itself for every chart and every alert rule. Those metrics are super generic, reusing even the existing RED approach is more favorable. If it doesn't fit into RED - having dedicated family is the next best choice. Please don't leave us with the need to create multiple charts to monitor every request. Metrics should be kept as simple as possible.

snorochevskiy requested review from RequescoS, StanChe and n00m4d November 19, 2024 22:46

snorochevskiy force-pushed the feature/mtg-703-peer2peer_consistency branch 6 times, most recently from 9a4547b to baedb6f Compare November 22, 2024 14:06

snorochevskiy mentioned this pull request Nov 26, 2024

MTG-750 Calculating bubblegum and account checksums #291

Closed

snorochevskiy force-pushed the feature/mtg-703-peer2peer_consistency branch 3 times, most recently from 002990c to 46d522d Compare November 26, 2024 20:05

RequescoS reviewed Nov 29, 2024

View reviewed changes

kstepanovdev reviewed Nov 29, 2024

View reviewed changes

snorochevskiy force-pushed the feature/mtg-703-peer2peer_consistency branch from 46d522d to 31e0f50 Compare December 2, 2024 12:31

StanChe reviewed Dec 2, 2024

View reviewed changes

nft_ingester/tests/consistency_calculator_test.rs Show resolved Hide resolved

snorochevskiy force-pushed the feature/mtg-703-peer2peer_consistency branch from 31e0f50 to e0d46c1 Compare December 3, 2024 12:25

StanChe reviewed Dec 6, 2024

View reviewed changes

snorochevskiy added 6 commits December 8, 2024 23:01

MTG-703 Adding peer to peer consistency checks

bc89e6b

MTG-687 Fetching all grand epoch checksums for the tree

c7492a7

Adding more comments

7e0e616

More code review changes

bf3253d

Metrics changes for code review

395c02e

Resolving conflicts

a1cdfab

snorochevskiy force-pushed the feature/mtg-703-peer2peer_consistency branch from 955a3bb to a1cdfab Compare December 8, 2024 21:25

	Client::connect_to_url(peer_discovery.get_gapfiller_peer_addr().as_str()).await
	Client::connect_to_url(&peer_discovery.get_gapfiller_peer_addr()).await

	/// if the previous value is still in "Calculated" state.
	/// if the previous value is still in the same "Calculating" state.

	/// a new update with slot that epoch is from the previous epoch perioud,
	/// a new update with slot that epoch is from the previous epoch period,

	/// it not only writed the bubblegum change, but also updated
	/// it not only writes the bubblegum change, but also updates

Conversation

snorochevskiy commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snorochevskiy Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kstepanovdev Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StanChe Dec 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

snorochevskiy commented Nov 19, 2024 •

edited

Loading

snorochevskiy Dec 3, 2024 •

edited

Loading

kstepanovdev Dec 3, 2024 •

edited

Loading

StanChe Dec 2, 2024 •

edited

Loading