Skip to content

[Milestone] Waku Network Can Support 10K Users #12

@jm-clius

Description

@jm-clius

Priority Tracks: Secure Scalability
Due date: 31 May 2023
Milestone: https://github.com/waku-org/pm/milestone/5

Note: this deadline assumes that the target of 1 Mio users by end-June 2023 could lean for the largest part on the designed solutions for the problem space defined below.

Summary

  • Scale to 10K Status Community users, spread across ~10 to ~100 communities
  • This milestone focuses on 100% Desktop users, primarily using relay, but with experimental/beta support for filter and lightpush for clients with poor connectivity
  • Communities, private group chats and 1:1 chats should be considered. Public chats are excluded.

Tasks / Epics


Extracted questions

  • Are the number of users and number of communities realistic? Answer on 2023-01-19: yes, makes sense as an initial goal
  • What is the proportion (in message rate and bandwidth) of community messages vs community control messages vs store query-responses?
  • Does message rate increase linearly with increase in network size? Answer on 2023-01-19: generally should be the case (could have a multiplicative factor, but not combinatorial or exponential)_
  • What bandwidth upper bound should we target for Desktop nodes? One possible answer: ADSL2+ limit of 3.5 Mb/s?
  • Can this MVP consider participation in only one Community at a time? Answer on 2023-01-24: nodes will be part of multiple communities from the beginning.
  • What store query rate should we target for 10K users?

Network requirements

Note: this gathers the minimal set of requirements the Waku network must adhere to to support Status Communities scaling to 10K users. It does not propose a design.

1. Message Delivery and Sharding

Note: this section, especially, depends on app-defined user experience minimals. E.g. the app knows what (sub)set of messages is necessary "for a consistent experience" and this will feed into a pubsub topic, content topic and sharding design that does not compromise on UX. This process should also define when messages should be received "live" (relay) or opportunistically via history queries (store).

  1. Nodes should be able to receive (via relay or store) all community messages of the community they're part of.
  2. Nodes should receive live (via relay) all chat messages that is necessary for a consistent experience. A chat message is content sent by users either in a community channel, 1:1 or private group.
  3. Nodes should receive live (via relay) all control messages that is necessary for a consistent experience. Control messages are mostly used for community reasons, with some for 1:1 and private groups (e.g. online presence and X3DH bundle).
  4. Each community can utilize a single or multiple shards for control and community messages, as long as requirements (1) - (3) still hold.
  5. Nodes should participate in shards in such a way that resource usage (especially bandwidth) is minimized, while requirements (1) - (3) still hold.
  6. Peer and connection management should be sufficient to allow nodes to maintain a healthy set of connections within each shard they participate in.

Assumptions:

  • connectivity, NAT traversal capability, NAT hole punching, etc. is similar to that described Status MVP: Status Core Contributors use Status #7. No further work is required within the context of this MVP.
  • it is possible to be part of several communities simultaneously
  • we assume that community size is such that community desktop nodes can realistically be expected to relay the messages for all community traffic. That is - communities can be responsible for their own relay infrastructure.

2. Discovery

  1. Nodes should be able to discover peers within each shard they're interested in.
  2. Discovery method(s) can operate within a single or multiple shards, as long as:
  • requirement (1) still holds
  • nodes can bootstrap the chosen discovery method(s) for shards they're interested in
  • the chosen discovery method(s) does not add an unreasonable resource burden on nodes, especially if this method is shared between shards

Assumptions:

  • nodes are able to use discv5 as their main discovery method

3. Bootstrapping

  1. Nodes should be able to initiate connection to bootstrap nodes within the shards they're interested in.
  2. Bootstrap nodes can serve a single or multiple shards, as long as they can handle the added resource burden.

Assumptions:

  • Status initially provides bootstrapping infrastructure.
  • DNS discovery is sufficient to find initial bootstrap nodes.

4. Store nodes (Waku Archive)

  1. Nodes should be able to find capable store nodes and query history within the shards they're interested in.
  2. Store nodes can serve a single or multiple shards, as long as:
  • they can handle the query rate and resource burden
  • are discoverable as stated in requirement (1)

Assumptions:

  • Status provide initial store infrastructure, including a performant Waku Archive implementation.
  • PostgreSQL implementations exist for Waku Archive that can deal with the required rate of parallel queries to support 10K users
  • DNS discovery is sufficient to discover capable store nodes (these may or may not be the same nodes as used for bootstrapping, but discovery will be simpler if they are).

5. Security:

  1. Community members should not be vulnerable to simple DoS/spam attacks as defined in (3) and (4) below.
  2. Each community should be unaffected by failures and DoS/spam attacks in other communities. This implies some isolation/sharding in the messaging domain.
  3. Store/Archive:
    • store nodes for a community should only archive messages actually originating from the community
    • store nodes for a community should not be vulnerable to being taken down by a high rate of history queries
  4. Relay:
    • community relay nodes should only relay messages actually originating from the community.

Assumptions:

  • Community members (i.e. the application) are able to validate messages against community membership.

Other requirements

Note: this gathers the minimal set of requirements outside the Waku network (e.g. operational, testing, etc.) to support Status Communities scaling to 10K users.

1. Kurtosis network testing

A simulation framework and initial set of tests that can approximate:

  • the protocols
  • the discovery methods
  • the traffic rates for a typical community
    in such a way to prove the viability of any scaling design proposed to achieve the Network Requirements

2. Community Protocol hardening

The Community Chat Protocols specifications are moved to Vac RFC repo.

  • what else is required within this MVP time frame, e.g. including Community Chat in Kurtosis testing?

3. Nwaku integration testing

Nwaku requires integration testing and automated regression testing for releases to improve trust in stability of each release.

4. Fleet ownership

Ownership for infrastructure provided to Status communities should be established. This may require training and transfer of responsibilities which mostly lies de facto within the nwaku team.
Fleet ownership comprises the responsibility for:

  • establishing a sensible upgrade process (may require some nodes for staging)
  • upgrading fleets
  • monitoring existing fleets and protocol behavior
  • support and logging bugs when noticed

Initial work

The requirements above will lead to a design and task breakdown. Roughly the order of work:

Ownership for all three items below is shared between Vac, Waku and Status teams:

(1) Agree on requirements above as the complete and minimal set to achieve the 10K scaling goal.
(2) A viable, KISS network design adhering to "Network requirements"
(3) Task breakdown of each item and ownership assignment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions