F3 passive testing #12287

jennijuju · 2024-07-24T04:12:03Z

jennijuju
Jul 24, 2024
Maintainer

jennijuju · 2024-08-05T19:32:09Z

jennijuju
Aug 5, 2024
Maintainer Author

Note that the f3 implementation team will not deploy anything that initiates f3 in mainnet environment until the vast majority of the network has stabilized from the network upgrade. The current plan is to bootstrap the initial f3 instance among a small amount of the nodes on Aug 7th/8th (>24 hours after the upgrade), we will monitor the network and adjust the deployment plan accordingly, and keep you updated on what to expect in this discussion!

0 replies

jennijuju · 2024-08-07T14:52:14Z

jennijuju
Aug 7, 2024
Maintainer Author

🚀F3 Passive Testing Kick off

Aug 7th, 2024

We're excited to announce the soft launch of Fast Finality for Filecoin (F3) as part of FIP-0086! 🏎️If you've been following our progress, you might already know that this has been in the works for some time. Fast Finality aims to dramatically reduce the finality time from 900 epochs to just around 2, increasing the speed by 450X and ensuring that "finality" truly means finality. This means transactions will be completed reliably and quickly.

With the mandatory release of Lotus v1.28.1 and above, we're now able to conduct passive testing on the mainnet. This testing is a critical step toward fully implementing Fast Finality in the upcoming network upgrade (nv24) later this year. After the full rollout, we expect significant UX improvements for both token holders and dApp users.

Additionally, F3 will enable the creation of trustless clients and bridges that can send messages over the network and verify finality without needing to run a full node, making it both efficient and cost-effective.

To learn more about Fast Finality and its implications for bridging to other networks, check out this talk. We're thrilled to take this step forward and look forward to your feedback as we move closer to full implementation.

F3 Passive Testing

F3 has undergone extensive testing in both code simulation and on the Butterfly testnet. However, to ensure it's ready for mainnet production, we need to perform large-scale testing that mirrors the number of nodes on the mainnet. This will help verify the performance of F3, GPBFT, and Lotus integration. Our goal is to execute this testing as passively as possible, minimizing disruptions.

The F3 engineering team plans to start with a small group of F3 nodes to confirm everything works as expected. We will gradually increase the number of F3 participants, aiming to reach a scale close to mainnet production.

Why Passive Testing?

The transition to F3 marks a significant advancement for Filecoin and its ecosystem, offering faster transaction speeds and enabling trustless bridges. However, this evolution introduces complexities in testing, verification, and ensuring confidence in the system's implementation across the network.

Passive testing is essential for several reasons:

Building Robust Confidence: Before a widespread rollout, it's crucial to achieve strong confidence in F3's deployment. While existing tests in simulated environments provide a framework for assessing consensus correctness, they can't fully capture the complexities of interacting software layers. Real-world testing is needed to ensure that all components work together seamlessly.
Addressing Software Layer Complexities: Simulations often fall short in testing critical components such as the libp2p transport protocol, GossipSub’s optimistic message propagation mechanism, Lotus EC integration, and the dynamics of a large number of network participants and GPBFT instances. Passive testing allows us to evaluate these elements in a realistic environment.
Non-Disruptive Testing Environment: The nv23 upgrade provides an opportunity to test the F3 protocol in a live, but non-disruptive manner. Passive testing integrates F3 as an optional alternative consensus mechanism, enabled by a feature flag, without replacing the existing Filecoin consensus. This approach mitigates risks associated with modifying a fundamental component of the Filecoin protocol.
Identifying Integration Gaps: This phase is crucial for identifying potential integration gaps among the various system components affected by F3. It allows us to assess network adoption and response in a real-world setting.
Evaluating Consensus Protocol Functionality: Most importantly, passive testing provides a definitive examination of the consensus protocol’s functionality under real-world conditions. This helps ensure that F3 is fully ready for production use.

Passive Testing allows the F3 engineering team to test a full-scale rollout on a real network without affecting consensus. This approach ensures that the testing does not disrupt existing network operations.

Testing Plan Summary

A more thorough testing plan is laid out in this issue, in a nutshell:

A dedicated server will deploy an F3 Manifest connected to the Lotus node.
The F3 implementation team can configure and deploy the Manifest to bootstrap, pause, and restart the F3 participation selectively.
Once F3 is deployed and live, selected nodes will begin signing and broadcasting messages to finalize tipsets on confirmation that everything functions as expected.
The Manifest can assign “fake power” to selected Storage Providers (SPs), allowing us to verify the protocol as nodes (power) join and leave the network, and simulate scenarios that are otherwise (near) impossible to achieve in simulations.

It's important to note that none of the above affects the Expected Consensus (EC) in nv23. F3 runs in the background and finalizes tipsets independently from EC for testing purposes, hence the term “passive.”

Observable Metrics for Monitoring

The F3 team has implemented several metrics to monitor node and consensus performance:

Lotus Node Metrics: Goroutines, heap object allocation rate, garbage collection performance, CPU usage.
GPBFT Metrics: Head divergence, rounds to finalize an instance, broadcast and rebroadcast rate, errors by type, consensus phase transitions.
Validation Performance: Message validation time, proposal fetch time, duplicate validation time.
Pubsub Status: Message and RPC rate.
Certificate Exchange: Polling intervals and peers, latencies.

We will share updates and results with the community at least bi-weekly, if not more frequently. You can also monitor F3 metrics via the Prometheus debug metrics endpoint at http://<lotus-node-address>:1234/debug/metrics, looking for metric names prefixed with f3_.

sample dashboard

Initial F3 Deployment

The F3 team is excited to announce the initial deployment of Fast Finality (F3) starting on August 7th. We will begin with 50 randomly selected participants, including some of our dedicated alpha testers from the community 💙.

Once the deployment begins, you will start seeing F3 logs similar to the following:

{"level":"info","ts":"2024-08-07T02:17:00.149Z","logger":"f3","caller":"[email protected]/host.go:618","msg":"reached a decision","instance":38671,"ecHeadEpoch":37328}

This log indicates that F3 has successfully finalized a tipset.

Adjusting Log Levels

If you wish to adjust the log level for F3 in Lotus or Lotus Miner, you can do so by running the following commands:

$ lotus log set-level --system f3
$ lotus-miner log set-level --system f3

Monitoring and Reporting

The F3 engineering team will be closely monitoring all participating nodes. If you notice any irregular behavior on your nodes, please report it in the #fil-fast-finality channel on Filecoin Slack. Ideally, include a log dump from your daemon node to help us diagnose the issue. If any problems arise, we will promptly redeploy a manifest to pause all F3 activities, ensuring your node returns to normal without any manual intervention.

Mainnet Verifications

Below is a list of verifications that will be performed on the mainnet during the initial F3 deployment:

Successful Bootstrap: Ensure F3 is successfully bootstrapped.
Consensus Formation: Instances are formed, reaching consensus, and producing a finality certificate.
GPBFT Performance: Verify that GPBFT performance is as expected: e.g. rate of message broadcast, justifications weighted by power.
Pubsub Performance: Check that Pubsub performance is as expected: e.g. RPC byte transfer rate, GossipSub duplicate messages, validation time.
Lotus Daemon Performance: Ensure Lotus daemon performance is as expected: e.g. CPU usage, memory consumption, unaffected pre-existing functionality.
Network Joining: Verify that a new daemon node can join the network and catch up with F3 with expected timeframe.
Power Drop Tolerance: Ensure that if there is a <33% power drop from F3, the F3 chain continues to progress.
Power Drop Halting: Confirm that if there is a >66% power drop, the F3 chain progress eventually falls back on existing EC.
Power Rejoining: Verify that the F3 chain resumes once power re-joins.

We look forward to this exciting phase and appreciate your support and feedback as we work towards fully implementing F3 across the network.

🚢 Launch Time!

Fast Finality has been one of the most anticipated features for Filecoin participants since the network's inception. We are thrilled to collaborate with YOU to finally bring this feature to the Filecoin mainnet by the end of the year!

We want to extend our gratitude to all community members participating in this testing phase. Special thanks go to our early testers for their invaluable support with Slack handles: TippyFlits, Reiers, stuberman, beck, and @marco-storswift.

Stay tuned for updates in the #fil-fast-finality channel over the coming weeks. Please join the channel to keep up with announcements and potential support requests. We may need your node logs and profiles to assist our implementation team with debugging.

Happy finalizing fast! 🏎️💨🏁

0 replies

jennijuju · 2024-08-15T03:30:18Z

jennijuju
Aug 15, 2024
Maintainer Author

Testing update - Aug 14th

In the past week..

We’ve successfully tested F3 networks with up to 120 nodes.
We’ve significantly reduced the CPU usage of F3 through code optimisations and aggressive caching.
We’ve improved the behavior/speed of F3 when it falls behind.
We’ve introduced a ton of (local) metrics for monitoring the state of the network. As usual, these are available on your local node (only to you) via prometheus.

🐛 Bandwidth Usage Spike

Lotus node operators have reported irregular traffic spike starting Wednesday morning and we have confirmed it is caused by F3 implementation. The excessive F3 bandwidth was caused by a routing loop in pubsub. The “manifest” server (used to facilitate testing) broadcast a small message every 20 seconds, which shouldn’t have introduced an excessive load. Unfortunately, pubsub’s routing loop prevention mechanism appears to have been ineffective in this case so each message cycled around the network over and over.

We’ve fixed this issue by:

Adding additional filtering in the manifest pubsub topic to ensure that, even if we end up with routing loops, we’ll have at most one message rebroadcast every 2 minutes (negligible traffic).
Identifying pubsub messages by content, instead of by sequence number and sender (aligning this logic with the other Filecoin pubsub topics). This ensures that even if a message gets stuck in the network for a while, we’ll only have one copy instead of one copy for each time it was sent.

Note that even tho the bandwidth usage on nodes was increased, we believe that does not impact/cause issue towards node synchronization. However, we would also like to avoid any unexpected node performance degradation as much as possible. Therefore, we will release a lotus patch in NA working hours (The fix is already available here #12390, we will merge it after more testing).

Next round..

Next week, we aim to scale our testing to 500-1000 nodes. We will gradually increase the number of nodes participating in F3 and continue monitoring their performance throughout the process. One piece of good news is that we haven't encountered any issues so far that impact node synchronization or block production, and we will remain conservative with our testing to avoid any potential problems. If you notice anything irregular, please don't hesitate to reach out to us in #fil-fast-finality!
Once again, we want to thank the node operators who have been supporting the testing, helping us identify potential consensus protocol bugs and areas for performance improvement, and ensuring a smooth rollout of the fast finality consensus change to the mainnet 💙!

0 replies

rjan90 · 2024-09-05T21:19:21Z

rjan90
Sep 5, 2024
Maintainer

F3 (Fast Finality) passive testing update - 2024-09-05

🗂 F3 Readiness Review and Timeline Adjustment

With the proposed nv24 upgrade date rapidly approaching, we conducted a thorough review of the tasks necessary to complete our "Hardening and Mainnet Deployment Readiness" milestone. Our assessment revealed a time deficit of approximately three weeks. In light of this finding, we've taken two important steps:

We've thoroughly documented and prioritized tasks, focusing only on what's essential for safely deploying fast finality to mainnet.
We've communicated our progress and requested revised dates for the nv24 network upgrade.

This time extension is important for completing our "Hardening and Mainnet Deployment Readiness" milestone, ensuring we address all critical items in our backlog. The additional time will help us maintain quality and minimize risks as we approach this significant network upgrade.

⏮️ Since the last update: Progress over the past weeks

Let's dive into the specifics of our recent hardening, fixes and testing we have done in the past weeks:

Hardening and Fixes:

✅ QUALITY Rebroadcast Issue Resolved: [The QUALITY rebroadcast issue](Fast Finality in Filecoin (FIP-0086) FIPs#809 (comment)), which was causing some inefficiencies in instance progress documented [here](Fast Finality in Filecoin (FIP-0086) FIPs#809 (comment)), has been [fixed](Handle QUALITY censorship go-f3#591). This fix is now merged, and ready for live testing.
🛠️ Fixed the "Next Best Converge Value" Issue: We successfully addressed the last remaining recommendation from our external audit regarding the "next best converge value selection". This is the [item No. 4 in the FIP discussion](Fast Finality in Filecoin (FIP-0086) FIPs#809 (comment)) that details the fix to lack of progress during CONVERGE phase.
⚡ Fixed Power Overflow Issue: A bug related to power overflow was identified and fixed. This fix is important for maintaining the stability and accuracy of power calculations within the network.
📡 Improved Propagation Strategy for DECIDE Messages: We've developed a better strategy for propagating DECIDE messages, which should help nodes that are behind by approximately two epochs to catch up more quickly. This new approach reduces the reliance on certificate exchange protocol, which ultimately aids faster finality.

Testing Efforts:

🧪 Expanded Test Coverage: We've broadened our test suite to include more scenarios observed during live testing. This expansion will help us catch potential issues earlier in the development process, leading to a more robust and reliable network.

⏭️ Upcoming Week's Focus:

Hardening and Fixes:

📝 Write-Ahead Logging (WAL): As part of our efforts to reduce and detect self-equivocations (a situation where a node contradicts itself), we are implementing WAL. This feature will help us better track and manage state changes, enhancing overall network reliability and security for cases where a set of miners may be using a single identity.
🔄 Signature Generation Optimizations: We are working on two optimizations to reduce the resource consumption associated with signature generation. These changes will also contribute to the Drand BLS library, benefiting the broader community.
🔍 Inferring Unseen Messages from Justifications: We are developing methods to infer unseen messages from existing justifications. This technique aims to reduce the impact of equivocations, while at the same time further reducing reliance on message delivery for correctness.
✂️ Reducing Network Spam: To optimize network performance, we're implementing a strategy to drop messages from previous rounds. This change will help reduce network spam and lower the workload on the PubSub system, allowing us to defer some PubSub-related work and save several weeks of effort.

Testing Efforts:

🏗️ Infrastructure Improvements for Easier Testing: We're making several infrastructure upgrades to simplify the testing process. These improvements will make it easier for the team to test changes on butterflynet and calibnet.
🏁 Goals before next update:
- Conduct thorough Butterflynet testing of all recent hardening and fixes to ensure their effectiveness and stability.
- Continue improving test coverage, focusing on areas that require additional attention based on Butterflynet testing results.

Stay tuned for more updates in the #fil-fast-finality channel as we continue to harden F3 and move closer to the nv24 rollout! 💪

0 replies

rjan90 · 2024-09-15T08:45:03Z

rjan90
Sep 15, 2024
Maintainer

F3 (Fast Finality) passive testing update - 2024-09-15

Hey all! 👋

Another week has passed, so here is another weekly update on the F3 passive testing and hardening efforts in preparation for the mainnet launch.

As mentioned in last week's update, the team requested a time adjustment for the nv24 timeline which would allow us to complete all tasks in our "Hardening and Mainnet Deployment Readiness" milestone, and complete critical items in our backlog. This timeline adjustment has now been accepted, and the the nv24 timeline and F3 rollout looks like this now:

October 23rd: Calibration network upgrade
November 20th: Mainnet network upgrade
- December 4th: F3-activation

⏮️ Since the last update: Progress over the past week

Hardening and Fixes:

✅ Infrastructure Improvements for Easier Testing: We have completed our infrastructure upgrades to simplify the testing process. These improvements now make it easier for the team to test the changes faster and with more confidence.
➡️✅ Signature Generation Optimizations: The work signature generation optimization is awaiting a final approval. Based on our benchmarks we achieved up to 88x speedup for public key aggregation, which will substantially reduce resource consumption associated with signature generation. (Link to PR)
➡️✅ Reducing Network Spam: The implementation of a strategy to drop messages from previous rounds is currently in progress. This optimization will reduce pubsub traffic between instances, saving bandwidth and improving network efficiency.
🚧 📝 Write-Ahead Logging (WAL): A complete implementation of WAL has been opened for preliminary review. Integration of the write-ahead logging is currently in progress. This feature will help us better track and manage state changes, enhancing overall network reliability and security for cases where a set of miners may be using a single identity. (Link to PR
🚧 Make F3 consensus final: We have have made quite some progress on making the F3 consensus final; implementing checkpoints for F3 decisions, expanding the EC backend API, and improving invalid base chain handling in Lotus. (Link to PR #1 and #2

Testing Efforts:

✅ Test Coverage Improvements: We've added a lot more test coverage over the week, and have marked the test coverage improvements ticket completed.
🏁 Butterflynet Testing: We are currently running the Butterfly network with the latest F3-changes to ensure the effectiveness and stability of all the recent hardening and fixes, and so far it looks like the recent fixes and improvements are have made the network safer. We will continue to let the network spin over the next couple of days, while we tweak and run more edge-case testing.

⏭️ Upcoming Week's Focus:

Hardening and Fixes:

🛬 Land Outstanding in-flight PRs: Merge and integrate all pending pull requests and features that have been developed but not yet incorporated into the main codebase. This includes the Signature Generation Optimizations, Reducing Network Spam, Write-Ahead Logging, and Make F3 Consensus Final.
🛼 Mainnet Rollout Strategy: Finalize the F3 bootstrap strategy for mainnet deployment, building on discussions from [issue #596. Our goal is to decide on an approach that balances implementation simplicity, network security, and user intervention.
🛠️ Address Testing Issues: We anticipate that the expanded testing will potentially surface new issues. We have set some time off to rapidly address and resolve these issues.

Testing Efforts:

⬆️🦋 Expand Butterflynet Testing: Scale up amount of participants in the Butterfly significantly, aiming for 800-1000 nodes. This expanded testing will build on our runs with the latest F3 changes, allowing us to validate the effectiveness and stability of recent hardening and fixes at a larger scale

Next week the F3 team will be co-locating. This focused sprint aims to complete the remaining tasks in Milestone 2: "Hardening and Mainnet Deployment Readiness". By working side-by-side, we expect to accelerate our progress, and ensure we're fully prepared for the upcoming mainnet deployment 🚀.

Additionally, we are working on preparing an operators’ guide to F3 to help get the community ready for the F3 launch. This guide will cover key topics such as setup, configuration, and best practices. We aim to have it published by the end of next week.

Stay tuned for more updates in the #fil-fast-finality channel!

0 replies

rjan90 · 2024-11-26T08:02:38Z

rjan90
Nov 26, 2024
Maintainer

Hey everyone! 👋 Here are some quick updates on the passive testing efforts:

⏮️ On Friday (2024-11-22)

The F3 team resumed work on the passive testing tools after the nv24 upgrade was completed. All node operators have now upgraded to the latest go-f3 version (v0.7.2) which includes the latest bug fixes and enhancements. After addressing some infrastructure tasks, the team deployed F3 passive testing to around 5 MinerIDs. The testing ran smoothly over the weekend without any issues.

⏮️ On Monday (2024-11-26)

We began increasing the number of participants in the passive testing, starting with around 100 MinerIDs, which was successful, and let it run for about 1 hour. We then increased to 200 MinerIDs, which also successfully bootstrapped and let it run for a couple of hours. During the 200 MinerID testing round, we observed a CPU spike on our observer node. The CPU profile dump indicated that the SplitStore running compaction caused the spike. Testing was paused around 20:00 UTC to allow the team to rest.

Additionally, we created redirects for our public F3 Grafana dashboards to make them easier to find. These redirects use a 308 status code, allowing us to change the backend URL without breaking previously announced links:

- Calibnet: https://grafana.f3.eng.filoz.org/public/calibnet
- Mainnet: https://grafana.f3.eng.filoz.org/public/mainnet

⏭️ Todays plan (2024-11-27):

We plan to investigate data from yesterday's testing rounds, focusing on fluctuations between different senders during various phases and rounds on our observer node. After analyzing yesterday's data, we aim to run a test with 600 MinerIDs (approximately 30% of the network).

📣Other noteworthy callouts:

F3 on the Calibration Network has been succesfully running without any hiccups for more than a 1+ week now.
A detailed blog post about this passive testing work has been published here: https://medium.com/@filoz/finality-unveiled-passive-testing-to-mainnet-launch-of-f3-fast-finality-03e09bc68de5

0 replies

rjan90 · 2024-11-26T19:38:36Z

rjan90
Nov 26, 2024
Maintainer

Hey everyone! 👋 Quick EOD update:

⏮️ What happened today (2024-11-26)

We investigated yesterday’s data, especially in relation to two events we found interesting during yesterday data gathering:
- Analysing catch-up performance where we saw some inconsistencies (synchronicity and empty-full instance flip-flopping during Bootstrap)
  - A temporary stall on instance 114 during yesterdays 10% participation testing.
- We established a baseline moving from 5% → 8% → 10% where we saw a smooth bootstrap phase by using a high proportion of participating nodes. (There are a set of infra PRs that are private that document this.)
F3 is currently running with 10% participation, and will do that overnight.

🟡 Known open issues (as of EOD 2024-11-26)

Data from 4-5 small network samples has show that we only have ~70% F3 participation.
- Especially during bootstrap we need significantly more than 66% participation (as during bootstrap we’re more likely to have congestion causing us to drop effective participation)

⏭️ Plan for tomorrow (2024-11-27)

⏰ Note: F3 engineers will be commencing work again on 2024-11-27 @ 11 UTC
🥅 Goal 1: get initial signal on baseline network bandwidth
- How: check https://grafana.f3.eng.filoz.org/public/mainnet based on leaving the EOD 2024-11-26 experiment of 10% running overnight.
🥅 Goal 2: establish “limits on the network”
- How:
  - Increase from 10% to 20% etc. until things stop working
  - As we increase keep using the good list from the previous run
🥅 Goal 3: get an estimate for participation in power percentages because if it’s low enough then need to troubleshoot why by talking with more SPs in the community
- How:
  - Option 1: During testing rounds above be tracking which minderIds participated vs. didn’t
  - Option 2: Do a more intentional test “striping” across the network. For example, if we can bootstrap with 20% of network, we could do multiple rounds using a different 20%.
🥅 Goal 4: Proactive pulse on why not getting higher participation
- How:
  - Check if Venus High Availability has F3 enabled
  - Check if Ramo has F3 enabled

0 replies

rjan90 · 2024-11-27T21:35:12Z

rjan90
Nov 27, 2024
Maintainer

Hey everyone! 👋 Quick update:

⏮️ What happened today (2024-11-27)

During the overnight F3 passive testing monitored by Grafana observer, we recorded an average upload speed of 340 KiB/s and a download speed of 300 KiB/s, with 10% of the participants in the network. It’s important to note that the bandwidth usage shown in the accompanying graph represents the entire pubsub system, with F3 being the predominant component. To obtain a more accurate estimate, we can isolate the bandwidth usage excluding F3, which is approximately 75 KiB/s for both upload and download.
- For a very rough baseline estimate of the baseline F3's bandwidth usage (normal operation not bootstrap phase) with full network participation, we can multiply these figures by 10, resulting in:
  - ~2.5 MiB/s Upload
  - ~2.19 MiB/s Download
We ran another round of passive testing with 20% the network participation. In this passive testing round we received messages from 315 distinct senders (which accounted for 80% of the power).
- Bootstrap phase took around 50-60 minutes which was a bit higher than expected.
- After the initial bootstrapping phase, the data/observation from normal bandwidth operations @ 20% of the participants in the network, are in-line with the estimates of baseline bandwidth with F3 bandwidth usage @ full participation calculated above:
We doubled the participation to 40% of the network for another round of passive testing. Across all instances we got messages from nodes accounting for 81% of the power.
- During the Bootstrap phase we did see "congestion/throttling", and at certain instances we only got messages from around 44% power, while over multiple instances we got messages from nodes accounting for 81% of the power. It is great to see that the protocol is delay resistant - although the velocity in bootstrapping in this round was slow and took around 6 hours.

🟡 Known open issues (as of EOD 2024-11-27)

Duration of bootstrap at higher % of network participation and catchup periods:
- During passive testing with 20% of the network participating and an 80% participation rate, the bootstrapping process took 1 hour to complete.
- In contrast, when 40% of the network participated with 80% participation rate, the bootstrapping process took 6 hours to complete.

⏭️ Plan for tomorrow (2024-11-28)

⏰ Note: F3 engineers will be commencing work again on 2024-11-28 @ 8:00 UTC
🧪 Adjust delta value higher to see if the increase will help smooth the Bootstrap phase and Catchup period.
- What should we set the value for “delta”? 48 We got to this because p99 of “number of rounds per complated instance” is ~3. 2^3 is 8. 8 times current delta of 6 gets us to 48.
🧮 Establish to what extent a reduction in trying to finalize at max 100 tipset during the Bootstrap phase and Catchup period helps when we are running at higher scale testing.
🥅 Gather more data on the steady state of bandwidth usage in the round with 40% of the network running passive testing.
ℹ️ Get a quick pulse from SPs on their download and upload bandwidth speeds. What bandwidth setups they are running.

0 replies

rjan90 · 2024-11-28T19:53:49Z

rjan90
Nov 28, 2024
Maintainer

Hey everyone! 👋 Update from todays passive testing round:

⏮️ What happened today (2024-11-28)

We started the day with Mainnet passive testing with 40% of the network, and adjusted 2^3X delta, and the rebroadcast rate with the goal of measuring any change in bandwidth (especially the Bootstrap phase) and the duration of time it takes to Bootstrap.
- We started seeing queue full messages from pubsub: 2024-11-28T09:11:33.297Z INFO pubsub [email protected]/pubsub.go:965 Can't send announce message to peer 12D3KooWHe3cyGiND8KZaVfpBWj8iXdAWTeqeZiUi6unaXnc3JeJ: queue full; scheduling retry right when the bootstrap phase started. Issue for this log line was opened here: Investigate subscriber too slow errors in pubsub go-f3#759
- After prepare the passive testing made no progress. The bandwidth was a lot lower, which was expected given that we were seeing less unique messing getting propagated too us.
Based on the above observations ^^, we decided to revert the changes in rebroadcast to see if we got messages from more unique senders, and start round 16.
- Immediately after the new bootstrap phase we saw a lot more unique messages get through, and we were able to get finalize instances.
- However we did see a repeated pattern across instances, where:
  - QUALITY; fine
  - PREPARE; fine
  - Then CONVERGE on base at round 1
    - This carries on for a bit.
  - Then late arriving COMMIT messages that put the proposal back at non-base.
  - and eventually DECIDE.
- Ideally we would not be at CONVERGE on base, this to reduce the time it takes for the whole Bootstrap phase.
- Instances rand smoother than the one last night for the same network particiaption size (40%). The most significant success from this test is:
  - No base decisions during the test.
- The Bootstrap phase toook approximately 2 hours, down from the 6 hours observed in yesterdays 40% participation round.
- Bandwidth roughly scaled up @ full participation looks to be a little less compared to our initial estimates with 20% of participation. Counted total of 611 participants, at 80% power during this round of testing. Rough estimates at full participation in the network would then be:
  - 1.7 MiB/s Down
  - 1.7 MiB/s Up
We did another round of testing (round 17) @ 40% participation where we tuned down the delta from 48 to 24, and quite early on we could see that this change had worse effect on making the Bootstrap phase faster, as we had made multiple base decisions without progressing, and we found no point in progressing with testing with these params.
- Some interesting observations from this round was that we saw a lot of invalid messages, which made us wonder if we are having some race-condtitions somewhere.
- We added a F3 catchup velocity graph on both public and internal dashboards. Any value less than 1 means F3 is finalising slower than the rate at which chain grows:
In the next round of network testing (round 18), we had the same amount of participation, but increased the delta to 60 seconds.
- We saw the same pattern emerge as we did during round 16, where we converge on base at round 1, this carries on for a bit, and late arriving COMMIT messages puts the proposal back at non-base, and we are able to progress.
- We did see more of the queue full messages from PubSub, and decided to deploy a local patch to one of our observer nodes: (fix: investigate F3 subscriber too slow error #12738). This had the desired effect, and no more queue full messages where observed. Unfortunately we can´t tweak this for everyone, and would require a version bump for everyone to see if that solves the pattern/congestion we are seeing.
In the last round of network testing for the day (round 19), we adjust the alignment higher (which gurantees that everybody starts as synchrony every instance), and also adjust participation to 60% of the network.
- We observed that messages from unique senders ticked slowly upwards, and there is definetly some message delivery issues at play, which makes the Bootstrapping phase slow.
- We let this round run for a bit, but with the alignment adjusted higher together with the delta, the network was never able to progress fast enough to catch up to the head, and therefore complete the Bootstrap phase. We decided to pause the round and testing for the day, and will reconvene again tomorrow.

🟡 Known open issues (as of EOD 2024-11-28)

We need to do some investigation of Libp2p settings - based on the fact that it looks like there we are “hitting limits” espically during the Bootstrap phase, which makes the phase longer then it needs to be.
Further investigate the subscriber too slow Libp2p logs we are hitting here: Investigate subscriber too slow errors in pubsub go-f3#759, which may be at the crux of the slow increase of new unique senders we are seeing during Bootstrap, which makes the Bootstrap phase slow.

⏭️ Plan for tomorrow (2024-11-29)

⏰ Note: F3 engineers will be commencing work again on 2024-11-29 @ 8:30 UTC.
🧪 Try to adjust delta value down and the potentially the alignment phase to see if we can get the 60% participation passive testing round to move faster and smoother, especially during the Bootstrap phase, so that we can gather baseline bandwidth metrics after it is completed.

0 replies

rjan90 · 2024-11-29T20:43:45Z

rjan90
Nov 29, 2024
Maintainer

Hey everyone! 👋 Update from Fridays passive testing round:

⏮️ What happened today (2024-11-29)

We started the day by taking a closer look at the alignment logic and yesterday's data. After that we:
- Adjusted DelayMultiplier. This dictates the wait time before starting the next. Alignment seem to be doing well alignment in the context of EC delay. So if DelayMultiplier is too small then alignment won't do what we want it to.
- We decided too keep delta as it already is, because it is already quite high.
We then started passive testing round 20, with 60% of the network.
- Delta same as round 19: 48seconds
- Catchup alignment same as previous: 10m
- Rebroadcast base increased to 12s.
- Rebroadcast max increased to 90s.
- EC delay multiplier increased to 4
This round of testing was looking excellent, instances was finalising quick but the long alignment time makes the Bootstrap phase longer then it needs to be.
Timing between first QUALITY log and Terminated for the same instance was around 1 minute, which is excellent with delta of 48 seconds. On average we progressed 80 epochs during the Bootstrap phase for every 8 minutes.
We kicked off round 21 by halving catchup alignment to 5 minutes, and reducing the delay multipler from 4 to 3.
- One observation we are wondering about is if we are causing congestion in pubsub because we are not waiting long enough between experiments. Each morning when F3 passive testing has not been ongoing during the night, we had initial success, but often felt "slower" as the day progresses.
We were able to finish the Bootstrap phase in round 21 in 75 minutes, and the settings in round 21 also reduced baseline bandwidth consumption it seems. We were using about 800 KiB Up and Down after the bootstrap phase. A significant step forward.
In round 22 we scaled testing to 80% of the network, and we finalised first instance on non-base at round 1. After that instance it was a sluggish path forward and after approximately 1 hour we could see that some parameters was not correct @ this % of the network.
- Concern that delta might be too small at this scale, together that this % of the network would need higher EC delay and alignment for making swift progress during Bootstrapping.
In round 23 we adjusted the EC delay and alignment higher to try to prove the above hypothesis.
- In instance 0 we passed the PREPARE phase, but were struggling on getting enough COMMIT messages to finalize, which would be a sign of connective problems across the network - but further data analysis would be needed. It might be the case that the additional 20% bump from the last round was a "bad" set of connected peers.
In round 24 we bumped up the participation to 100% to see if the last round was just a bad set of connected peers. It's a datapoint we want to measure in any case, to see where some of the bottlenecks are especially during the Bootstrap phase at this scale.
- We did observe slow starting of the Bootstrap phase, which might indicate that we have some congestion/network propagation issues happening at this scale, as we previous rounds had a lot more participants/unique senders right from the start in smaller scale % of the network testing.
- We were mostly at instances 0, and stuck at prepare in this round of testing - so we need to look if propagation of messages are not getting through the mesh.
- We will be stopping testing round 24 and asses the data we have gathered, and tackle some of the open known issues.

So to end of todays, and the weeks update with some significant achievements of the week:

🎉 We found a sweet spot for spacing out message to reduce overall bandwidth
🎉 Steady state @ 60% of the network, catchup/bootstrap was significantly faster (around 75 minutes) and an large improvement compared to earlier rounds which even had lower % of the network in passive testing.
- After the adjustments, with 60% network participation, the baseline bandwidth that we observed after the bootstrap phase was around 800 KiB for both Download and Upload.

🟡 Known open issues (as of EOD 2024-11-29)

Adjustments made in network 20, 21 and 22 seem to have removed the burst of invalid messages at initial instance. Which is good - but we need to confirm that the short delay between round was the reason why it was happening in the first place:

391120877-bd711a0f-b308-4b91-9d7f-66736124919f (1)

We want to run another test round @ 100% participation to rule out any possibility of peer scares from previous test runs affecting the pubsub mesh. See Initial passive testing is happy but not next ones started with 10m succession go-f3#765 for more information.

0 replies

rjan90 · 2024-12-02T22:01:11Z

rjan90
Dec 2, 2024
Maintainer

Hey all! 👋 Here is an update from todays passive testing round:

⏮️ What happened today (2024-12-02)

In round 27 of passive testing we reduced delay multipler to 3, and ran with no explicit power across the whole 100% of the network. We observed some rejected messages that was captured in this issue, as it affects scoring.
- This first round of passive testing did not go anywhere, so we decided to stop it relatively early on.
After a couple testing rounds where we tweaked multiple knobs, we observed that there are some inconsistent rounds between our two observer nodes, which suggest misalignment, which means that delta is still too small for a network of this size.
In round 29 we increased the delta to 96 and all equal power, and we finalised on non-base at the initial instance, and observed a fast COMMIT turnaround on instance 1 for 100 tipsets.
- This network progressed excellent, and bandwidth usage during the Bootstrap phase was very reasonable and peaked at 17 MiB/s (compared to 32 MiB/s before). 📉🎉
- The delay between instances could be shorter since instances take around 2-3 minutes to complete, which told us higher delta is better.
- This also ruled out that any fundamental transport problems. I.e. there exists a delta for which if there is enough participation the network reaches consensus in reasonable time at reasonable bandwidth.
- We caught up to head in 11 instances 🤘
- Average bandwidth usage during the whole bootstrapping phase was 7.2 MiB/s upload and 5.6 MiB/s download. After the bootstrapping phase the bandwidth usage also calmed significantly down, and lower than any previous rounds.
In round 30 we tried with no explicit power at delta of 96 seconds, but we early on observed misalignment to the point where PREPARE was striving for base. So we redeployed with a even larger delta.
In round 31 we bumped delta to 192 seconds, but also observed that we were stuck in prepare. This told us that we need to get Venus nodes to participate in the passive testing as well for us to be able to do passive testing @ 100% with no explicit power set.

Some of the wins for todays passive testing:

💯 We were able to successfully run at 100% of the network with power across all nodes set to equal weight, that made the network:
- 🏎️ Be able to Bootstrap in only 11 instances.
- 📉 Reduce bandwidth spikes to at max peaks of 17 MiB/s, and an average of 7.2 MiB/s upload and 5.6 MiB/s download throughout the Bootstrap phase.
- 📉 After the Bootstrap phase completed the bandwidth usage fell below 1 MiB/s.

🟡 Known open issues (as of EOD 2024-12-02)

Investigate the rejected messages that was captured in this this issue ticket, which affects scoring.
Get Venus nodes to participate in passive testing efforts so that we can run a test @ 100% scale without any the networks current power distribution.

⏭️ Plan for tomorrow (2024-12-03)

⏰ Note: F3 engineers will be commencing work again on 2024-12-03 @ 8:30 UTC.
🧪 Continue to do passive testing at the scale of ~84%,which is the network size we reliably can run tests at current, in search of optimal manifest config. Once Venus nodes are up and available for passive testing, then the passive testing will be brought up to 100% again.

0 replies

rjan90 · 2024-12-03T20:51:42Z

rjan90
Dec 3, 2024
Maintainer

Hey all! 👋 Here is an update from todays passive testing round:

⏮️ What happened today (2024-12-03)

There has been no passive testing rounds ran today, as we have been focusing on fixing known open issues, and investigating collected data over the past runs.
We investigated the "subscriber too low" logs: Investigate subscriber too slow errors in pubsub go-f3#759 which we saw every time we re-bootstrapped a passive network test.
- The fix for this issue was to increase the pubsub and internal message queue buffer sizes to 128. Which allows a larger headroom for buffered messages if the internal message handling loop is too slow, or in a case where there is a slight misalignment across nodes on bootstrap.
We reduced the log verbosity of backoff duration at instance start to DEBUG, as we got user reports that these could be a bit noisy: Reduce log verbosity when computing instance start go-f3#771
We implemented some testing utilities to:
- Generate random finality certificate chain given an ec.Backend
- Verify conformity of an existing chain with a given ec.Backend
The Venus team has been working on adding more nodes to the passive testing effort on Mainnet, and have currently two Venus nodes with approximately 100PiB of power participating in passive testing. The have
- We will help the Venus team providing daily updates on the percentage of Venus power participation.
- If all goes to plan, we can get a lot more Venus nodes participating in the passive testing efforts tomorrow (2024-11-04), which will allow us to run unrestricted passive testing at 100% scale, and gather metrics from having it run for at least a couple of days in a stable state.
We opened a PR in Lotus: fix(sync): do not allow to expand checkpointed tipsets #12747 which will resolve a bug in sync by preventing checkpoint expansion.
We investigated some of the Pubsub rejected messages on we saw on Mainnet during passive testing, we only observered the zero power log for actor ID 1941530. Which leaves us to belive that this actor might be running on a outdated release of go-f3.
Additionally we have release the go-f3 v0.7.3 release: https://github.com/filecoin-project/go-f3/releases/tag/v0.7.3, which will power the Lotus v1.32.0-rc1

⏭️ Plan for tomorrow (2024-12-04)

⏰ Note: F3 engineers will be commencing work again on 2024-12-04 @ 8:30 UTC.
We will resume passive testing tomorrow with a large as network we can run smoothly, and hopefully have even more Venus nodes participating. We want to spend some time collecting more data on the steady state after the Bootstrapping phase has completed.
While the passive test is running we are planning to pick away on this issue: Add finality certificate capture to Observer go-f3#745

0 replies

rjan90 · 2024-12-04T21:22:59Z

rjan90
Dec 4, 2024
Maintainer

Hey all! 👋 Here is an update from todays passive testing round:

⏮️ What happened today (2024-12-04)

The plan we set out yesterday for the day was to:
- Run the largest network we can possibly run that progresses smoothly onto steady state.
- Smoothly: catchup within 2hrs, with maximum bandwidth consumption around the numbers we have reported before.
We started round 34 of passive testing, making up 80% of the network, with the other parameters set the same as network 33.
- We saw right away some good signs; it was running fast and finalized the first instance quickly. The downside was that it finalized on base, and the second instance moved to round 1.
- After those first instances, things picked up speed and we was able to complete the Boostrap phase in ~1.5 hrs.
- Average bandwidth during bootstrap: 4.1 MiB/s down, 6.1 MiB/s upload.
- Average bandwidth so far during steady state: 1.2 MiB/s down, 1.5 MiB/s upload
- Distance from head with the current params, seemed steady around 13 - 45 epochs.
We then moved to round 35 of passive testing, full scale with no power override.
- We got messages from 1382 unique senders, totalling 76% of the power.
- The progress velocity was good, but we was finalizing on base to often to catch up fast enough. One theory for why this is happening is that QUALITY messages don't get propagated enough, because delta is still too low for the network of this size. And we decided to run another test at 100% scale with higher delta to see what happens.
In round 36 of passive testing we bumped the delta to 96, but kept the 100% scale with no power override.
- Right from the start things was looking much better, and we were preparing for a 100 tipset long chain, then commited for 100 tipset long chain. But then things was starting to take a bit to long, and first and second instances was taking too long, and the idea is that the alignmet delay of 10 minutes is too long.

🟡 Known open issues (as of EOD 2024-12-04)

Continue to tweak the different params we have at disposal to get the Bootstrap phase during the 100% scale with no power override to perform better.

⏭️ Plan for tomorrow (2024-12-05)

⏰ Note: F3 engineers will be commencing work again on 2024-12-04 @ 11:30 UTC
With regards to passive testing, the plan is to run more experiments at 100% scale with no power override, but change these params:
- Reduce alignment to 5m
- Increase delta even more
- If unsuccessful scale down to 90% scale and repeat.

0 replies

rjan90 · 2024-12-05T21:08:33Z

rjan90
Dec 5, 2024
Maintainer

Hey! 👋 Here is the update from todays passive testing rounds:

⏮️ What happened today (2024-12-06)

We started the day by repeating yesterdays round 36, to gauge the full participation of passive testing as more nodes has joined the effort.
- The first instance finalised 100 tipsets at reasonable time, which was a good start but on the second instance we were heading for base. We started brainstorming a bit why this pattern happens and the only thing we could think of is that the QUALITY message propagation is affecting this.
- We got some evidence about this thesis in the logs. Where first log line is adopting proposal for chain of length 1 (happened in second instance) and second line does the opposite, happened in third instance.

{"level":"debug","ts":"2024-12-05T13:40:30.511+0100","logger":"f3/gpbft","caller":"gpbft/gpbft.go:586","msg":"{1}: adopting proposal [AFY2BZACECAUABUV@4502340]len(1) after converge (old proposal [AFY2BZACECAUABUV@4502340]len(1)) (round 1, phase CONVERGE, proposal [AFY2BZACECAUABUV@4502340]len(1), value 丄)"}
{"level":"debug","ts":"2024-12-05T13:54:25.561+0100","logger":"f3/gpbft","caller":"gpbft/gpbft.go:509","msg":"{2}: adopting proposal/value [AFY2BZACECAUABUV@4502340, AFY2BZACED74WTLU@4502341, AFY2BZACEBWJDYNV@4502342, ...]len(100) (round 0, phase QUALITY, proposal [AFY2BZACECAUABUV@4502340, AFY2BZACED74WTLU@4502341, AFY2BZACEBWJDYNV@4502342, ...]len(100), value [AFY2BZACECAUABUV@4502340, AFY2BZACED74WTLU@4502341, AFY2BZACEBWJDYNV@4502342, ...]len(100))"}

Our observer node saw 70% participation with across 1285 nodes - but these are only the nodes that the observer saw, so the participation is definetly higher.
We added a second observer node, to see if we could see even more of the participants in the mesh.
In round 38 of passive testing we increased the delta to 192 to see if we could validate our thesis from round 37.
- We saw the same pattern emerge as in round 37 where instance 0 was okay, but we were trying to converge on base in instance 1. Delta this large is certainly a no-go.
- So what we got from this round of passive testing was that such a long delta hurts the instance time when we enter converge - but a shorter delta + finalising base + trying again in a fresh instance is worse then a long delta. Because of the potential cascading effect of pubsub saturation that small delta can cause.
In the last round of passive testing for the day we had the delta at 192 seconds, with a scale of 80% of the network picked at random.
- We did not see enough progress at the right velocity in this testing round, which means that the lack of progress is likely independent of network saturation.
- After consistently finalising on base for some time, we decided to pause the network.
Today we also opened a PR to resolve finality for eth APIs according to F3 (feat(f3): resolve finality for eth APIs according to F3 #12762)

🟡 Known open issues (as of EOD 2024-12-05)

Getting initial quality messages heard by all matters a lot for the progress of the network, we will continue to analyse and see how we can bring that amount higher.

⏭️ Plan for tomorrow (2024-12-06)

⏰ Note: F3 engineers will be commencing work again on 2024-12-06 @ 8:30 UTC
Monitor the amount of messages trickling through the observer node, after opening a PR that will rotate its identity to renew any scoring that would potentially cause this.

0 replies

rjan90 · 2024-12-08T20:07:18Z

rjan90
Dec 8, 2024
Maintainer

Hey! 👋 Here the update from Fridays (2024-11-06) passive testing rounds:

⏮️ What happened today (2024-12-06)

We started the day by collaborating with the Venus team, and kicked off a network with 100% participation and no manual power overrides.
- Network bootstrapped and moved to PREPARE for base chain, a sign that current participation at Bootstrap was low, but both teams saw that their observers were still collecting messages.
- We measured total number of distinct senders were in each location, and collected the data for further analysis to see if there was a large difference in the unique senders each node saw.
- At first the Bootstrapping phase looked very promising with a catchup velocity of ~7X, meaning that the Bootstrap phase should complete in approximately 1 - 2 hours.
- Average bandwidth usage during the first instances of the Bootstrapping was 5 MiB/s up/down - which should significantly drop once the Bootstrap phase is over.
- After some initial progress we saw Bootstrapping velocity falling behind. We observed 1397 unique senders at 77% participation.
- We reopened the "subscriber too slow" issue: as we saw more of those error logs flow in on both nodes. We also had a new observation, that on both IDA and Archioz when this happens committee fetch time is at its max: 2-3 seconds.
Because of the new observation about committee fetch times, we started passive test round 41, with 100% of the network but with power override matching EC, to test out and confirm the committee cache issue. With this run we confirmed that the proposal fetch times was a lot faster:
- Caching committee in top level F3 host will resolve this issue.

🟡 Known open issues (as of EOD 2024-12-06)

We need to "cache committee in top level F3 host" to solve the high committee fetch times, observed when the "subscriber to slow" error logs come in. (Issue reference: GetProposal and GetCommittee too slow go-f3#779)

⏭️ Plan for monday (2024-12-09)

⏰ Note: F3 engineers will be commencing work again on 2024-12-06 @ 08:30 UTC
Open a PR for caching committee in top level F3 host to solve the high committee fetch times.

0 replies

rjan90 · 2024-12-09T21:47:07Z

rjan90
Dec 9, 2024
Maintainer

Hey everyone! 👋 Here is an update from todays (2024-11-09) passive testing rounds:

⏮️ What happened today (2024-12-09)

We started the day by reviewing the resolve finality for eth APIs according to F3 so that we can get this across the finish line for the next release.
After some Monday PR reviews, we start passive testing round 42, with a delta of 96 seconds and more aggressive rebroadcasts.
- We observed high proposal fetch times from the very first instnace; on archioz it took ~4.5 minutes. on ida ~2 minutes.
- The network was consistently deciding on base, and our data showed that the slow GetProposal knocks instances out of sync such that by the time some nodes send QUALITY message the chances are the network has moved onto PREPARE for base.
- The only way to get the instances to align considering the current implementation is to choose a delta large enough such that there is a good chance that all peers have got their proposal for the instance, i.e. 5m.
- We need to resolve this issue, by fixing the issue outlined here.
In round 43 of passive testing, we set the delta to 5 minutes. The first instance finished fine, but the second instance converged and decided on base.
- This network was constantly flip flopping between base and none base, so we decided to focus on finding a manifest config that would get out out of bootstrap at 100% of the network, despite the delayed get proposal bug outlined in GetProposal and GetCommittee too slow go-f3#779.
In round 44 of passive testing we tweaked the delta to 7.5 minutes, and set explicit power table for all at 100% scale. The aim of this was to find a config that could get us out of Bootstrap while short circuiting the high get proposal call.
- This network was very promising from the start, and at the rate of catchup it looked like we would get out of Bootstrap phase in 2 hours.
- Each instance was taking 5 minutes to run, which is promising at scale of 100% (exactly 1889 peers, taken from power table at epoch 4515096)
- Average bandwidth during bootstrap 7 MiB/s download and 9.9 MiB/s upload.
- The stability and velocity of this network highlighted the importance of fixing high get proposal time, and sensitivity of F3 to out of sync start.
- Part of the veolcity gain in this network is because of the zero alignment config. The learning here is:
  - In a network where pubsub is slower than expected in delivering messages it is better to have a larger delta than it is to have a higher instance start alignment. Because, if QUALITY messages don't get propagated properly the chances are additional rounds are needed and that would push total instance termination time to 20-30 minutes with a good chance of converging on base
  - But higher delta means when the chain is forky instances would take longer.
- We observed no significant CPU spike during signature verification, thanks to the optimisations the team has done on signature aggregation.
- This network round caught up to the chain (i.e moved out of the Bootstrap phase) in 1 hour and 10 minutes, a very reasonable number.
- Because catch up alignment is set to zero, and the distance from head is > 10+4 epochs (10 = head lookback, 4 = current ec delay multiplier) the next intance will start immediately. Which it did.
We have decided to let passive testing round 44 run overnight, while we have people monitoring metrics, so that we can gather a lot of data from running the network in stead state with 100% of the network.
- One observation that we can already see is that the large delta hurts a bit in the steady state, as the finalised tipset from the chain head is further than wanted.

🟡 Known open issues (as of EOD 2024-12-09)

We need to resolve GetProposal and GetCommittee too slow go-f3#779, to fix the high get proposal time, as F3 is sensitive to out-of-sync starts, especially during bootstrapping.

⏭️ Plan for monday (2024-12-10)

⏰ Note: F3 engineers will be commencing work again on 2024-12-10 @ 08:30 UTC
Review all the data gathered overnight in passive testing round 44, potentially tweak delta further so that we can get closer to the head of the chain in a steady running state.
Write up a fix for GetProposal and GetCommittee too slow go-f3#779.

0 replies

rjan90 · 2024-12-10T20:34:58Z

rjan90
Dec 10, 2024
Maintainer

Hey everyone! 👋 Here is an update from todays (2024-12-10) passive testing rounds:

⏮️ What happened today (2024-12-10)

We started today by going through all the data and metrics gathered from the passive testing round 44, which had been running over night. As mentioned in the previous post, the Bootstrap phase in the passive testing round was very successful averaging less then 10 MiB/s download and 10 MiB/s upload, with a duration of 1 hour and 10 minutes. Our goal was to see how the current parameters worked in the steady state of the network.
By looking at the data we could see that the bandwidth usage was higher in the steady state in this passive testing round then what is ideal:
- There are periods where finalized “epochs from head” is diverging and falling quite a lot, and the latest finalized tipsets distance to the head is larger then 100 epochs. This is not ideal for bandwidth usage as nodes in the network are now going back to sending messages that are “let us finalize these 100 tipsets” because they are trying finalize as many tipsets as possible, until we are close to the head of the chain again.
- The second thing that stood out from the data is the depth of the steady state observed in this round. Latest finalized epoch from head here is around ~70, which is high. Our goal is to have that depth as small as possible to make F3 both fast and efficient.
So in passive testing round 45 we tried to tweak the parameters, so that we could get: 1: Across the Bootstrap phase successfully and 2. Have a more stable and faster steady state of F3.
- We decided to lower the delta from 7.5minutes to 3minutes.
  - This was to test if we can get out of bootstrap phase with lower delta while not hitting the high fetch time bug caused by GetProposal and GetCommittee too slow go-f3#779.
  - see if we could drop the instance termination time in steady state to half of what it was in round 44 (15minutes to hopefully 7.5minutes)
- We also set a zero catchup alignment for faster bootstrapping, and set the a EC delay multiplier to 1 to further reduce instance termination.
- Initial instances of the network passed fast, with a healthy progress velocity towards the head of the chain. But at instance 5 things were slowing down, and was taking a really long time. After a couple of instances with very slow progress we were hitting some Libp2p rebroadcasts loop, which where causing the network bandwidth to spike higher without any progress, at which point we decided to stop the testing with these params.

🟡 Known open issues (as of EOD 2024-12-10)

Review the PRs opened for fixing GetProposal and GetCommittee too slow go-f3#779. The PRs opened are:
- Separate out claim check to to avoid double claim load in Lotus go-state-types#333
- feat: optimize ForEachClaim to return only eligible claims #12770

⏭️ Plan for Wednesday (2024-12-11)

⏰ Note: F3 engineers will be commencing work again on 2024-12-11 @ 08:30 UTC
Review and access if we can get F3 into a stable steady state with the current parameters at hand without optimising further down the stack. A lot of the design/implementation so far is based around delta being sub 10s, but in the current passive testing network we have had to tune the delta a lot higher to get out of the Bootstrap phase. If the right data for the network is in order of few minutes, we need to review all the constants. For example, the 100 max chain length.

0 replies

rjan90 · 2025-04-16T07:21:52Z

rjan90
Apr 16, 2025
Maintainer

Hey everyone! 👋 The first rounds of passive testing after the nv25 upgrade kicked off yesterday (2025-04-15). Here is a quick summary of the plans going forward, and the results from the day:

⏮️ What happened on 2025-04-15

We started with round 49 at 20% scale (about 310 nodes) with default F3 configuration and no power override. We quickly observed that the network was consistently deciding on base without progressing to quality phase. Looking at the metrics, we noticed that the 99th percentile committee fetch time was quite steep - around 5s on ArchiOz.

After discussing, we identified that our configuration needed adjustment, specifically around power table handling. For round 50, we enabled IgnoreECPower: true and doubled the quality phase timeout multiplier to 2 (giving a 12s timeout). This made a huge difference! 🚀

With these changes, we saw:

Chains of 100 tipsets finalizing properly
Committee fetch time dropped dramatically
Consistent 5 tipsets behind head (exactly what we expect with head lookback of 4)
Bandwidth usage of ~1MiB/s in both RX and TX during steady state

Given how well round 50 was performing at 20% scale, we went directly to round 51 with 50% scale (773 nodes), keeping all other parameters the same. However, at this scale we hit another roadblock - back to repeated base decisions. 😕

For round 52, we increased the quality timeout multiplier from 2 to 3, but still observed repeated base decisions. Monitoring the metrics more closely, we could see:

Quality participation at only 40-50%
Prepare phase participation at 80%

This suggested messages were getting dropped during the quality phase! We suspected the chain exchange timestamp age might be a factor, so for round 53 we doubled it to 16s. Unfortunately, we still saw base decisions and started observing "queue full" errors from PubSub.

Our hypothesis by the end of the day: we're dropping the initial burst of quality messages because the PubSub buffer size (currently 128) is insufficient for the 50% network scale.

🟡 Known open issues as of EOD 2025-04-15

Quality message propagation issues at 50% network scale - we need to increase both quality timeout multiplier and possibly the GMessage subscription buffer size.
PubSub "queue full" errors indicating message drops during high message volume periods.

⏭️ Plan for 2025-04-16

⏰ Note: Testing will continue on 2025-04-16 @ 08:30 UTC

Try round 54 with quality timeout multiplier of 4
Increase GMessage subscription buffer size from 128 to see if that helps with message propagation
Continue monitoring bandwidth usage and committee fetch times at different network scales
Run a network with the optimal parameters for at least 4 hours to gather steady-state metrics

0 replies

rjan90 · 2025-04-17T08:20:50Z

rjan90
Apr 17, 2025
Maintainer

Hey everyone! 👋 Here is an update from the 2025-04-16 passive testing rounds:

⏮️ What happened on 2025-04-16
We started the day with passive testing round 56 at 50% network scale (773 participants), continuing our efforts to improve F3 performance through changing the parameters. The key change was increasing the quality phase multiplier to 4 (giving a 24s timeout) and doubling the PubSub validated message buffer size to 256.

This network showed immediate improvement! 🎉 We observed:

First instance completed finalizing a chain of 100 epochs
Bootstrap phase completed successfully with F3 only 6 tipsets behind head
Reasonable bandwidth usage (~880 KiB/s during bootstrap, ~530 KiB/s in steady state)

Encouraged by these results, we moved to round 57 scaling up to 80% of the network (1236 participants). We kept the same config as in round 56 but shuffled participants by changing the seed in explicit power selection.

In this network, we observed base decisions dropping and quorum of senders in QUALITY phase decreasing. The pattern suggested our queue was too small relative to the processing velocity of GPBFT. For passive testing round 58, we doubled the validated non-partial message channel size to 512.

These changes create some improvements:

Steady progress with instances catching up to head
Bandwidth usage in steady state of ~930 KiB/s
Never more than 6 tipsets behind head

The big moment came with passive testing round 59 and round 60, where we pushed to 100% network scale (1548 participants)! For round 60, we:

Doubled the low-level PubSub buffer to 512
Increased the GPBFT buffer to 768
Kept ignore EC power enabled

And the results were excellent, we had almost a perfect Bootstrap phase, catching up to the head of the chain in just 10 instances. And in the steady state we were consistently 5-6 epochs behind the head, and on top of that we only used about 1MiB/s during both the bootstrap and steady state.

This is a huge improvement from our previous round of full-scale testing, where we consumed about 10× more bandwidth and had much longer catch-up times. And based on that decided to leave the network running overnight to gather more data on steady state at 100% scale

🟡 Known open issues (as of EOD 2025-04-16)

Committee fetch time is still high when running at scale without power override - this will be our next focus area
Unexplained CPU spikes that don't correlate with F3 metrics - need further investigation
We need to test with real power evolution to ensure performance remains stable

⏭️ Plan for tomorrow 2025-04-17

Focus on understanding the effects of committee fetch time at scale
Analyze the overnight data to identify any patterns or issues in long-running steady state

The key achievement today: We successfully ran F3 at 100% network scale with excellent performance! 🎯.

1 reply

masih Apr 17, 2025
Maintainer

Unexplained CPU spikes that don't correlate with F3 metrics - need further investigation

On the nodes we have observed this it happens almost exactly every 6 hours, and the nodes have splitstore enabled. This leaves me to believe that the spikes are due to datastore compaction cycles. More evidence to follow.

rjan90 · 2025-04-21T07:15:34Z

rjan90
Apr 21, 2025
Maintainer

Hey everyone! 👋 Here is an update from our weekend passive testing at full network scale 🎯:

Over the weekend we left testing round 62 running continuously. This testing round was at full scale, with evolving power from EC. This meant it was as close to a "real world" scenario as we can get on mainnet. The data looks promising, and the largest distance from head during the 3-day period was only 9 epochs:

We're starting the week by analyzing the data from this long-running network more deeply, to determine if there are even better parameters we can use for F3, while keeping stability as our highest priority requirement prior to activation. If all goes according to plan, we hope to begin discussions with the implementation teams about the activation date for F3.

🟡 Plan for the day, and known issues to investigate:

We observed a delay between head distance and EC producing null tipsets (Investigate the delay in finality in the case of null block go-f3#951) and need to investigate, potentially adjusting base decision parameters to address this.
Further analysis needed on "Quorum of Senders" drops (QUALITY phase participation)

Thanks to everyone who has been involved in the F3 passive testing round during the weekend! 🙏

0 replies

rjan90 · 2025-04-24T07:51:23Z

rjan90
Apr 24, 2025
Maintainer

Hey everyone! 👋 Here is an update from our passive testing over the past few days, where we've been doing some fine-tuning parameters for F3 at full network scale! 🚀

⏮️ What happened since 2025-04-21

After our successful 100% scale testing in round 62, we've continued to refine the parameters to make F3 as stable and performant as possible before activation. We've been methodically testing various configuration tweaks across multiple networks:

In round 63 we experimented with chain rebroadcast parameters:

Increased chain rebroadcast interval to 6s (from 2s)
Increased chain exchange max timestamp age to 24s

This reduced the bootstrap bandwidth by nearly half to ~600KiB/s (vs ~1MiB/s). However, we observed that in steady state, bandwidth usage returned to ~1MiB/s because instances progress faster and chain changes trigger immediate broadcasts.

In round 64 we wanted to measure the impact of EC delay multiplier by:

Reduced EC delay multiplier from 2 to 1.3
Disabled catch-up alignment

Unfortunately, this network showed significantly worse performance. We observed many multi-round instances and eventually a strange "converge loop" at instance 38. Participation was still high, but we weren't making forward progress.

With round 65, we isolated just the catch-up alignment variable:

Kept EC delay multiplier at 2
Kept catch-up alignment disabled at 0s

The performance improved from the previous round 64, but it was still slower than our best networks. Catch-up took 1h45m (vs ~1h10m in our best configurations), and we think that this test confirmed that both EC delay multiplier and catch-up alignment contribute to network stability.

Finally, in round 66, we returned to our most stable configuration and increased buffer sizes:

Based on filecoin/62 (our longest-running stable network)
Increased PubSub GMessage subscription buffer from 512 to 768
Increased validated message buffer from 768 to 1024
Doubled chain exchange subscription buffer to 64

This network has been running very well! 💪 Catch-up completed in just 1h10m, and has been in a steady state since then. The network has also been handling EC null blocks well, with F3 smoothly recovering back to -5 distance.

The key learning: Both EC delay multiplier and catch-up alignment help maintain network synchrony by effectively "slowing down" F3 slightly, giving nodes time to fetch committees and proposals. Our experiments with disabling or reducing these parameters showed they're important for stable operation at scale.

🟡 Known open issues

The network's reaction to null blocks from EC can cause temporary increases in distance from head.

⏭️ Plan for 2025-04-24

Continue monitoring round 66 which is using what we believe will be our final configuration for activation.
Prepare to sign the parameters for F3 network activation.

Special thanks to everyone who's been helping monitor and analyze these passive testing networks! 🙏 We are down to single digit days before F3 is activated on Mainnet.

0 replies

rjan90 · 2025-04-25T09:47:11Z

rjan90
Apr 25, 2025
Maintainer

Hey everyone! 👋 Here is the final update from our passive testing as we approach F3 activation on Mainnet! 🚀

⏮️ What happened since 2025-04-24
We've been closely monitoring round 66 passive testing data, and we're pleased to report excellent results! 📊 This configuration has proven to be stable and performant at full network scale, giving us confidence to proceed with Mainnet activation.

Key developments:

We've opened the PR updating the F3 Mainnet Manifest parameters and activation epoch: Update mainnet manifest f3-activation-contract#23
Venus and Forest teams have reviewed, validated, and signed the transaction: F3 activation mechanics checklist f3-activation-contract#22 (comment)
The F3 team has confirmed that the activation manifest has successfully landed on our Mainnet node:

2025-04-25T01:26:42.413Z        INFO    f3      lf3/manifest.go:205     new manifest from contract      {"enabled": true, "bootstrapEpoch": 4920480, "manifestCID": "baguqfiheaiqe6o53nyfgnrbkd4fmifhdvv73tmreuhfex5klvwgcdbtry3kevvq"}

The bootstrap epoch for F3 has been set to 4920480, which corresponds to 2025-04-29T10:00:00Z - just a few days away! 📅

This will be our last passive testing update. The current passive testing round 66 will continue running until activation to provide additional observability data, and the switchover to the actual activation and manifest will be automatic for node operators.

🟡 Known open issues
The network's reaction to null blocks from EC can cause temporary increases in distance from head, but our testing has shown that F3 handles this gracefully and recovers well.

⏭️ Plan for activation

Today, we will publish a comprehensive blog post about the F3 activation and update relevant status pages.
Continue monitoring the passive testing network until activation.
Prepare support channels and monitoring systems for the activation on April 29th.

Special thanks to everyone who has contributed to making F3 a reality! 🙏 After years of development, testing, and refinement, we're now just days away from bringing F3's improvements to the Filecoin Mainnet.

0 replies

F3 passive testing #12287

Uh oh!

Uh oh!

jennijuju Jul 24, 2024 Maintainer

Replies: 22 comments · 1 reply

Uh oh!

Uh oh!

jennijuju Aug 5, 2024 Maintainer Author

Uh oh!

Uh oh!

jennijuju Aug 7, 2024 Maintainer Author

🚀F3 Passive Testing Kick off

F3 Passive Testing

Why Passive Testing?

Testing Plan Summary

Observable Metrics for Monitoring

Initial F3 Deployment

Adjusting Log Levels

Monitoring and Reporting

Mainnet Verifications

🚢 Launch Time!

Uh oh!

jennijuju Aug 15, 2024 Maintainer Author

Testing update - Aug 14th

In the past week..

🐛 Bandwidth Usage Spike

Next round..

Uh oh!

rjan90 Sep 5, 2024 Maintainer

F3 (Fast Finality) passive testing update - 2024-09-05

🗂 F3 Readiness Review and Timeline Adjustment

⏮️ Since the last update: Progress over the past weeks

Hardening and Fixes:

Testing Efforts:

⏭️ Upcoming Week's Focus:

Hardening and Fixes:

Testing Efforts:

Uh oh!

rjan90 Sep 15, 2024 Maintainer

F3 (Fast Finality) passive testing update - 2024-09-15

⏮️ Since the last update: Progress over the past week

Hardening and Fixes:

Testing Efforts:

⏭️ Upcoming Week's Focus:

Hardening and Fixes:

Testing Efforts:

Uh oh!

rjan90 Nov 26, 2024 Maintainer

Uh oh!

Uh oh!

rjan90 Nov 26, 2024 Maintainer

Uh oh!

rjan90 Nov 27, 2024 Maintainer

Uh oh!

Uh oh!

rjan90 Nov 28, 2024 Maintainer

Uh oh!

Uh oh!

rjan90 Nov 29, 2024 Maintainer

Uh oh!

rjan90 Dec 2, 2024 Maintainer

Uh oh!

rjan90 Dec 3, 2024 Maintainer

Uh oh!

rjan90 Dec 4, 2024 Maintainer

Uh oh!

rjan90 Dec 5, 2024 Maintainer

Uh oh!

rjan90 Dec 8, 2024 Maintainer

Uh oh!

rjan90 Dec 9, 2024 Maintainer

Uh oh!

Uh oh!

rjan90 Dec 10, 2024 Maintainer

Uh oh!

rjan90 Apr 16, 2025 Maintainer

Uh oh!

rjan90 Apr 17, 2025 Maintainer

Uh oh!

masih Apr 17, 2025 Maintainer

jennijuju
Jul 24, 2024
Maintainer

Replies: 22 comments 1 reply

jennijuju
Aug 5, 2024
Maintainer Author

jennijuju
Aug 7, 2024
Maintainer Author

jennijuju
Aug 15, 2024
Maintainer Author

rjan90
Sep 5, 2024
Maintainer

rjan90
Sep 15, 2024
Maintainer

rjan90
Nov 26, 2024
Maintainer

rjan90
Nov 26, 2024
Maintainer

rjan90
Nov 27, 2024
Maintainer

rjan90
Nov 28, 2024
Maintainer

rjan90
Nov 29, 2024
Maintainer

rjan90
Dec 2, 2024
Maintainer

rjan90
Dec 3, 2024
Maintainer

rjan90
Dec 4, 2024
Maintainer

rjan90
Dec 5, 2024
Maintainer

rjan90
Dec 8, 2024
Maintainer

rjan90
Dec 9, 2024
Maintainer

rjan90
Dec 10, 2024
Maintainer

rjan90
Apr 16, 2025
Maintainer

rjan90
Apr 17, 2025
Maintainer

masih Apr 17, 2025
Maintainer